Symmetric Matrices
Symmetric matrices arise naturally throughout applied mathematics from least-squares problems
where \(X^\top X\) is always symmetric, to quadratic forms in optimization, to covariance matrices
in statistics. They have special properties that make them particularly tractable: real eigenvalues,
orthogonal eigenvectors, and guaranteed diagonalizability.
Definition: Symmetric Matrix
A symmetric matrix is a square matrix \(A\) such that
\[
A^\top = A.
\]
For example, in the context of least-squares solutions, \(\hat{\beta}\) satisfies \(X^\top X\beta = X^\top Y\).
In this expression, \(X^\top X\) is always symmetric. To confirm, let \(X\) be an \(m \times n\) matrix. Then
\[
(X^\topX)^\top = X^\top (X^\top)^\top = X^\top X.
\]
Definition: Orthogonal Diagonalization
An \(n \times n\) matrix \(A\) is said to be orthogonally diagonalizable if
there exist an orthogonal matrix \(P\) (with \(P^{-1} = P^\top \)) and a diagonal matrix \(D\) such that:
\[
A = PDP^\top = PDP^{-1}.
\]
For this diagonalization, \(A\) must have \(n\) linearly independent and orthonormal eigenvectors.
Importantly, in such cases, \(A\) is symmetric because
\[
\begin{align*}
A^\top = (PDP^\top)^\top &= (P^\top)^\top D^\top P^\top \\\\
&= PDP^\top \\\\
&= A.
\end{align*}
\]
The converse is also always true. If \(A\) is symmetric, it is guaranteed to be orthogonally diagonalizable.
Theorem 1: The Spectral Theorem
An \(n \times n\) matrix \(A\) is orthogonally diagonalizable if and only if \(A\) is
a symmetric matrix.
Example:
On Eigenvalues & Eigenvectors, we considered the example:
\[
\begin{align*}
A &= \begin{bmatrix} 4 & 1 & 1\\ 1 & 4 & 1 \\ 1 & 1 & 4 \\ \end{bmatrix} \\\\
&= PDP^{-1} \\\\
&= \begin{bmatrix} -1 & -1 & 1 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ \end{bmatrix}
\begin{bmatrix} 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 6 \\ \end{bmatrix}
\begin{bmatrix} \frac{-1}{3} & \frac{-1}{3} & \frac{2}{3} \\
\frac{-1}{3} & \frac{2}{3} & \frac{-1}{3} \\
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{bmatrix}.
\end{align*}
\]
Since, \(A\) is symmetric, it must be orthogonally diagonalizable. To ensure \(P\) is orthonormal, we need
orthonormal eigenvectors. This can be achieved using the Gram-Schmidt algorithm on the columns of \(P\).
Since only \(v_1 \cdot v_2 \neq 0\) needs adjustment,
\[
\begin{align*}
v'_2 &= v_2 - \frac{v_2 \cdot v_1}{v_1 \cdot v_1}v_1 \\\\
&= \begin{bmatrix} -1\\ 1\\ 0\\ \end{bmatrix} - \frac{1}{2}\begin{bmatrix} -1\\ 0\\ 1\\ \end{bmatrix} \\\\
&= \begin{bmatrix} \frac{-1}{2}\\ 1 \\ \frac{-1}{2} \end{bmatrix}.
\end{align*}
\]
After normalization, the orthonormal eigenvectors are:
\[
u_1 = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ 0\\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}
,\,
u_2 = \begin{bmatrix} \frac{-1}{\sqrt{6}}\\ \frac{2}{\sqrt{6}}\\ \frac{-1}{\sqrt{6}}\\ \end{bmatrix},
\, \text{and }
u_3 = \begin{bmatrix} \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{3}} \\ \end{bmatrix}
\]
Therefore,
\[
\begin{align*}
A &= PDP^\top \\\\
&= \begin{bmatrix} \frac{-1}{\sqrt{2}} & \frac{-1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \\
0 & \frac{2}{\sqrt{6}} & \frac{1}{\sqrt{3}}\\
\frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \end{bmatrix}
\begin{bmatrix} 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 6 \\ \end{bmatrix}
\begin{bmatrix} \frac{-1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \\
\frac{-1}{\sqrt{6}} & \frac{2}{\sqrt{6}} & \frac{-1}{\sqrt{6}} \\
\frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \end{bmatrix} \\\\
&= PDP^{-1}
\end{align*}
\]
Quadratic Forms
The connection between symmetric matrices and their eigenvalues becomes especially clear through
quadratic forms. These scalar-valued functions \(Q(x) = x^\top Ax\) appear throughout optimization,
physics, and machine learning. Understanding how to eliminate cross-product terms reveals the
geometric meaning of eigenvalues.
Definition: Quadratic Form
A quadratic form on \(\mathbb{R}^n\) is a function
\[
Q(x) = x^\top Ax \in \mathbb{R}
\]
where \(A\) is an \(n \times n\) symmetric matrix.
Example:
Let's use our symmetric matrix again:
\[
A = \begin{bmatrix} 4 & 1 & 1\\ 1 & 4 & 1 \\ 1 & 1 & 4 \\ \end{bmatrix}
\]
Then:
\[
\begin{align*}
Q(x) &= x^\top Ax \\\\
&= 4x_1^2 +4 x_2^2 + 4x_3^2 + 2x_1x_2 + 2x_2x_3 + 2x_3x_1.
\end{align*}
\]
We can transform this quadratic form into one with no cross-product terms.
Let \(x = Py\) be a change of variable where \(P\) is an invertible matrix and \(y\) is a
new variable in \(\mathbb{R}^n\). Since \(A\) is symmetric, by Theorem 1, \(A\) is
orthogonally diagonalizable. Thus
\[
\begin{align*}
x^\top Ax &= (Py)^\top A(Py) \\\\
&= y^\top P^\top APy \\\\
&= y^\top (P^\top AP)y \\\\
&= y^\top(P^\top PDP^\top P)y \\\\
&= y^\top Dy
\end{align*}
\]
where \(y = P^{-1}x = P^\top x\).
Since \(P^\top A P = D\), where \(D\) is a diagonal matrix, we get:
\[
\begin{align*}
x^\top A x &= y^\top D y \\\\
&= \lambda_1 y_1^2 + \lambda_2 y_2^2 + \cdots + \lambda_n y_n^2
\end{align*}
\]
where \(\lambda_1, \lambda_2, \ldots, \lambda_n\) are the eigenvalues of \(A\).
Example:
For our example, \(Q(x)\) becomes:
\[
Q(x) = 3y_1^2 + 3y_2^2 + 6 y_3^2.
\]
Thus, the eigenvalues of \(A\) are directly connected to its quadratic form. First,
we define the classification of quadratic forms.
Definition: Classification of Quadratic Forms
A quadratic form \(Q(x)\) is classified as:
- Positive definite:
\(\qquad \qquad A \succ 0\) if \(\, \forall x \neq 0, \, Q(x) > 0\).
- Positive semi-definite:
\(\qquad \qquad A \succeq 0\) if \(\, \forall x, \, Q(x) \geq 0\).
- Negative definite:
\(\qquad \qquad A \prec 0\) if \(\, \forall x \neq 0, \, Q(x) < 0\).
- Negative semi-definite:
\(\qquad \qquad A \preceq 0\) if \(\, \forall x, \, Q(x) \leq 0\).
- Indefinite:
\(\qquad \qquad\) otherwise.
Theorem 2:
Let \(A\) be an \(n \times n\) symmetric matrix. Then a quadratic form \(x^\top Ax\) is
- positive definite iff the eigenvalues of \(A\) are all positive.
- negative definite iff the eigenvalues of \(A\) are all negative.
Recall from our derivation that an orthogonal change of variable \(x = Py\) transforms
\(x^\top Ax\) into
\[
Q(x) = \lambda_1 y_1^2 + \cdots + \lambda_n y_n^2,
\]
where
\(\lambda_1, \ldots, \lambda_n\) are the eigenvalues of \(A\).
There is a one-to-one correspondence between all \(x \neq 0\) and \(y \neq 0\) because \(P\)
is an invertible matrix. Therefore, \(Q(x)\) is determined by the signs of the eigenvalues
(or spectrum) of \(A\).
The columns of \(P\) provide the coordinate system in which the quadratic form becomes
diagonal. These columns are called the principal axes of the quadratic
form \(x^\top Ax\), and they align with the eigenvectors of \(A\).
Note:
To check whether a matrix \(A\) is positive definite, it is effective to verify the existence of a
Cholesky factorization \(A = R^\top R\) where \(R\)
is an upper triangular matrix with positive diagonal entries. This factorization is a modified version of
the LU factorization.
Insight:
Positive definiteness of \(A\) means the quadratic form \(x^\top Ax\) defines a "bowl" shape
(all eigenvalues positive). This guarantees a unique minimum for optimization problems,
making positive definite matrices central to convex optimization and machine learning.
Singular Value Decomposition(SVD)
While the spectral theorem applies only to symmetric matrices, the Singular Value Decomposition
extends orthogonal diagonalization to any \(m \times n\) matrix. SVD is arguably the
most important matrix factorization in applied mathematics, with applications ranging from data
compression to solving ill-conditioned linear systems.
Let \(A\) be an \(m \times n\) matrix. The symmetric matrix \(A^\top A\) is orthogonally diagonalizable.
Let \(\{v_1, \cdots, v_n\}\) be an orthonormal basis for \(\mathbb{R}^n\) consisting of eigenvectors
of \(A^\top A\) with corresponding eigenvalues \(\lambda_1, \cdots, \lambda_n\).
For \(1 \leq i \leq n\), the following holds:
\[
\begin{align*}
\| Av_i \|^2 &= (Av_i)^\top Av_i = v_i^\top (A^\top Av_i) \\\\
&= v_i^\top (\lambda_iv_i) \\\\
&= \lambda_i.
\end{align*}
\]
Since \(\| Av_i \|^2 \geq 0\), we conclude that \(\lambda_i \geq 0 \).
Definition: Singular Values
Let \(A\) be an \(m \times n\) matrix, and let \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n\)
be the eigenvalues of \(A^\top A\) arranged in descending order. The singular values of
\(A\) are defined as:
\[
\sigma_i = \sqrt{\lambda_i} = \|Av_i\| \geq 0 \quad (1 \leq i \leq n)
\]
where \(v_i\) are the corresponding orthonormal eigenvectors of \(A^\top A\).
If \(A\) has \(r\) nonzero singular values, then \(\{Av_1, \cdots, Av_r\}\) forms an orthogonal basis
for \(\text{Col }A\) and
\[\text{rank } A = \dim \text{ Col }A = r.
\]
Theorem 3: Singular Value Decomposition
Let \(A\) be an \(m \times n\) matrix with rank \(r\). Then there exist an \(m \times n\)
matrix \(\Sigma = \begin{bmatrix} D & 0 \\ 0 & 0 \\ \end{bmatrix}\) where \(D\) is a \(r \times r\)
diagonal matrix with the first \(r\) nonzero singular values of \(A\), \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\)
and there exist an \(m \times m\) orthogonal matrix \(U\), and an \(n \times n\) orthogonal matrix \(V\) such that
\[
A = U\Sigma V^\top.
\]
Proof:
Normalize orthogonal basis for \(\text{Col }A\) \(\{Av_1, \cdots, Av_r\}\), then we obtain an orthonormal basis
\(\{u_i, \cdots, u_r\}\), where
\[
u_i = \frac{1}{\| Av_i \|}Av_i = \frac{1}{\sigma_i}Av_i \quad 1 \leq i \leq r
\]
or,
\[
Av_i = \sigma_i u_i. \tag{1}
\]
We extend \(\{u_i, \cdots, u_r\}\) to an orthonormal basis for \(\mathbb{R}^m\), \(\{u_i, \cdots, u_r, u_{r+1}, \cdots, u_m\}\).
Let \(U = \begin{bmatrix}u_1 & \cdots & u_m \end{bmatrix}\) and \(V = \begin{bmatrix}v_1 & \cdots & v_n \end{bmatrix}\).
Clearly, by construction, \(U\) and \(V\) are orthogonal matrices. Also, by Equation (1),
\[
AV = \begin{bmatrix} \sigma_1u_1 & \cdots & \sigma_ru_r & 0 & \cdots & 0 \end{bmatrix}.
\]
Let \(D\) be a diagonal matrix with diagonal entries \(\sigma_1, \cdots, \sigma_r\) and
let \(\Sigma =\begin{bmatrix} D & 0 \\ 0 & 0 \\ \end{bmatrix}\). Then
\[
\begin{align*}
U\Sigma &= \begin{bmatrix}\sigma_1u_1 & \cdots & \sigma_ru_r & 0 & \cdots & 0\end{bmatrix} \\\\
&= AV.
\end{align*}
\]
Finally, since \(V\) is an orthogonal matrix,
\[
(U\Sigma ) V^\top = (AV)V^\top = AI = A.
\]
Algorithm: Computing SVD
- Compute eigendecomposition of \(A^\top A\):
Compute the eigenvalues and eigenvectors of \(A^\top A\).
Note: In practice, avoid directly computing \(A^\top A\) to prevent numerical errors.
- Construct \(V\) and \(\Sigma\):
Arrange eigenvalues in descending order.
Use the corresponding unit eigenvectors as the columns of \(V\).
- Construct \(U\):
For nonzero singular values, compute \(u_k = \frac{1}{\sigma_k}Av_k\).
If there are not enough nonzero singular values, extend to an orthonormal basis.
Beyond its use in matrix factorization, the SVD reveals important numerical properties of a
matrix through its singular values. The condition number measures how sensitive the
output \(Ax\) is to small changes (like floating-point errors) in the input \(x\).
Definition: Condition Number
For a matrix \(A \in \mathbb{R}^{m \times n}\) with rank \(r\), the condition number
\(\kappa(A)\) is defined as the ratio of the largest to the smallest nonzero singular value:
\[
\kappa(A) = \frac{\sigma_1}{\sigma_r}
\]
where \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\) are the nonzero singular values.
If \(A\) is a square, invertible matrix, this simplifies to \(\kappa(A) = \sigma_1 / \sigma_n\).
Often, a matrix with \(\kappa(A) \gg 1\) is called ill-conditioned, meaning it is
numerically unstable for solving linear systems because it is "almost singular." From a CS perspective,
this is a critical numerical red flag. Solving \(Ax=b\) for an ill-conditioned \(A\) can lead to highly
inaccurate results, as small rounding errors in \(b\) or \(A\) get magnified into large errors in the solution \(x\).
Pseudo-inverse via SVD
The SVD provides the most numerically stable method for computing the Moore-Penrose pseudo-inverse.
Unlike the normal equations approach \((A^\top A)^{-1}A^\top\), the SVD-based computation avoids squaring
the condition number and naturally handles rank-deficient matrices.
The Moore-Penrose Pseudo-inverse (\(A^\dagger\)) is most stably defined and computed
using the Singular Value Decomposition (SVD).
Given the SVD of \(A = U\Sigma V^\top\), where \(U\) is \(m \times m\), \(\Sigma\) is \(m \times n\), and
\(V\) is \(n \times n\), the pseudo-inverse is:
\[
A^\dagger = V\Sigma^\dagger U^\top
\]
Here, \(\Sigma^\dagger\) is an \(n \times m\) matrix formed by:
- Transposing \(\Sigma\) to get \(\Sigma^\top\) (which is \(n \times m\)).
- Inverting every nonzero singular value \(\sigma_i\) on the diagonal to get \(1/\sigma_i\).
If \(\Sigma\) has diagonal entries \(\sigma_1, \dots, \sigma_r > 0\), then \(\Sigma^\dagger\) will have
diagonal entries \(1/\sigma_1, \dots, 1/\sigma_r\). All other entries are zero.
This definition has a beautiful geometric interpretation. Recall that \(\hat{x} = A^\dagger b\) is the
minimum-norm least-squares solution.
- A least-squares solution \(\hat{x}\) must map to the projection of \(b\) onto \(\text{Col}(A)\).
- The minimum-norm requirement forces the solution \(\hat{x}\) to lie entirely within the Row Space of
\(A\), i.e., \(\hat{x} \in \text{Row}(A)\). (Any component in \(\text{Nul}(A)\) would be orthogonal to \(\text{Row}(A)\), adding
to the norm of \(\hat{x}\) without changing the result \(A\hat{x}\).)
The SVD beautifully handles this. It decomposes \(A\) into its fundamental subspaces. The pseudo-inverse essentially "inverts" the
mapping from \(\text{Row}(A)\) to \(\text{Col}(A)\) and ignores the \(\text{Nul}(A)\).
You can also see that
\[
A^\dagger A = V\Sigma^\dagger U^\top U\Sigma V^\top = V\Sigma^\dagger\Sigma V^\top.
\]
The matrix \(\Sigma^\dagger\Sigma\) is an \(n \times n\) diagonal matrix with 1s for the first \(r\) entries and 0s after.
This makes \(A^\dagger A = V \begin{bmatrix} I_r & 0 \\ 0 & 0 \end{bmatrix} V^\top\), which is the orthogonal projection matrix onto the Row Space of \(A\).
Similarly, \(A A^\dagger\) is the orthogonal projection matrix onto the Column Space of \(A\).
Practical Note: Truncation and Regularization
In real-world applications, we rarely have singular values that are exactly zero. Instead, we have a spectrum of values,
some of which are very small.
\[
\sigma_1 \geq \sigma_2 \geq \dots \gg \dots \geq \sigma_n \approx 0.
\]
Inverting these tiny singular values (\(1/\sigma_i\)) would introduce massive instability, as we'd be amplifying components
related to noise.
The practical computation of the pseudo-inverse involves setting a tolerance \(\epsilon\). We treat any
\(\sigma_i < \epsilon\) as zero.
\[
(\sigma_i^\dagger) = \begin{cases} 1/\sigma_i & \text{if } \sigma_i \geq \epsilon \\ 0 & \text{if } \sigma_i < \epsilon \end{cases}
\]
This truncated SVD is the basis for Principal Component Regression (PCR) and is a powerful form of
regularization. It explicitly discards the "directions" (singular vectors) that contribute least to the transformation and are most
likely to be noise, building a more robust, lower-rank model.
Interactive SVD Visualization
This interactive demo visualizes the Singular Value Decomposition process for \(2 \times 2\) matrices. SVD expresses any matrix A as a
product \(A = U \Sigma V^\top \), which can be understood geometrically as a composition of three transformations:
- \(U\): Rotation/reflection using left singular vectors (orthogonal)
- \(\Sigma\): Scaling along the axes (singular values)
- \(V^\top\): Rotation/reflection using right singular vectors (orthogonal)
Use the visualization to see how each component of SVD contributes to the overall transformation:
Try different matrices to observe how SVD works with various transformations:
- Identity: No transformation occurs
- Scaling: Only the singular values differ from 1
- Rotation: \(U\) and \(V\) are identical rotation matrices
- Symmetric: \(U = V\) (special case for symmetric matrices)
- Shear: Combination of rotation and non-uniform scaling
SVD with Fundamental Subspaces
The SVD has a beautiful geometric interpretation: it provides orthonormal bases for all four
fundamental subspaces of a matrix simultaneously. This reveals the deep connection between
the singular value decomposition and the structure of linear transformations.
Theorem 5: SVD and Fundamental Subspaces
Let \(A\) be an \(m \times n\) matrix with SVD \(A = U \Sigma V^\top\), where \(A\) has
rank \(r\). Let \(\{u_1, \ldots, u_m\}\) be the columns of \(U\) (left singular vectors)
and \(\{v_1, \ldots, v_n\}\) be the columns of \(V\) (right singular vectors). Then:
- \(\{u_1, \ldots, u_r\}\) is an orthonormal basis for \(\text{Col }A\)
- \(\{u_{r+1}, \ldots, u_m\}\) is an orthonormal basis for \(\text{Nul }A^\top\)
- \(\{v_{r+1}, \ldots, v_n\}\) is an orthonormal basis for \(\text{Nul }A\)
- \(\{v_1, \ldots, v_r\}\) is an orthonormal basis for \(\text{Row }A\)
Proof:
Column Space:
From the construction of the SVD, \(\{Av_1, \ldots, Av_r\} = \{\sigma_1 u_1, \ldots, \sigma_r u_r\}\)
is an orthogonal basis for \(\text{Col }A\). Since \(\sigma_i > 0\) for \(i \leq r\),
\(\{u_1, \ldots, u_r\}\) is an orthonormal basis for \(\text{Col }A\).
Left Null Space:
Using the orthogonality property \((\text{Col }A)^\perp = \text{Nul }A^\top\) and since
\(U\) is orthonormal, the vectors \(\{u_1, \ldots, u_m\}\) are orthogonal to the first \(r\)
columns, forming the basis for the left null space \(\text{Nul }A^\top\).
Null Space:
For \(i > r\), \(\sigma_i = 0\) implies \(\|Av_i\| = \sigma_i = 0\).
The set \(\{v_{r+1}, \ldots, v_n\}\) is orthonormal and has dimension \(n - r\).
Here, \(\dim(\text{Nul }A) = n - r\), confirming this set as the basis for \(\text{Nul }A\).
Row Space:
Since \((\text{Nul }A)^\perp = \text{Col }A^\top = \text{Row }A\), the first \(r\) columns of \(V\),
\(\{v_1, \ldots, v_r\}\) being orthogonal to \(\text{Nul }A\), span the row space of \(A\).
Geometric Insight:
The SVD reveals that any linear transformation can be understood as acting on orthogonal
subspaces: the row space maps via scaling by singular values to the column space, while
the null space maps to zero. The transformation is entirely characterized by how it acts
on these complementary subspaces.
Connection to the Invertible Matrix Theorem
When \(A\) is an \(n \times n\) invertible matrix, the SVD characterization of fundamental
subspaces provides additional equivalent conditions for invertibility.
Additional Invertible Matrix Conditions
Let \(A\) be an \(n \times n\) matrix. Then \(A\) is invertible if and only if any of
the following equivalent conditions hold:
- \((\text{Col }A)^\perp = \{0\}\)
- \((\text{Nul }A)^\perp = \mathbb{R}^n\)
- \(\text{Row }A = \mathbb{R}^n\)
- \(A\) has \(n\) nonzero singular values
These conditions extend the Invertible Matrix Theorem
from earlier sections.
The equivalence follows from the fact:if \(A\) is invertible, then \(\text{rank}(A) = n\),
which means \(r = n\). Therefore, \(\text{Nul }A = \{0\}\) (there are no vectors
\(v_{r+1}, \ldots, v_n\)), \(\text{Row }A = \mathbb{R}^n\) (all \(v_1, \ldots, v_n\) span
the row space), and all \(n\) singular values are nonzero. Conversely, if any of these
conditions hold, they force \(r = n\), making \(A\) invertible.