Symmetry

Symmetric Matrices Quadratic Forms Singular Value Decomposition(SVD) Pseudo-inverse via SVD SVD with Fundamental Subspaces Interactive SVD Visualization

Symmetric Matrices

Symmetric matrices arise naturally throughout applied mathematics from least-squares problems where \(X^\top X\) is always symmetric, to quadratic forms in optimization, to covariance matrices in statistics. They have special properties that make them particularly tractable: real eigenvalues, orthogonal eigenvectors, and guaranteed diagonalizability.

Definition: Symmetric Matrix

A symmetric matrix is a square matrix \(A\) such that \[ A^\top = A. \]

For example, in the context of least-squares solutions, \(\hat{\beta}\) satisfies \(X^\top X\beta = X^\top Y\). In this expression, \(X^\top X\) is always symmetric. To confirm, let \(X\) be an \(m \times n\) matrix. Then \[ (X^\topX)^\top = X^\top (X^\top)^\top = X^\top X. \]

Definition: Orthogonal Diagonalization

An \(n \times n\) matrix \(A\) is said to be orthogonally diagonalizable if there exist an orthogonal matrix \(P\) (with \(P^{-1} = P^\top \)) and a diagonal matrix \(D\) such that: \[ A = PDP^\top = PDP^{-1}. \]

For this diagonalization, \(A\) must have \(n\) linearly independent and orthonormal eigenvectors. Importantly, in such cases, \(A\) is symmetric because \[ \begin{align*} A^\top = (PDP^\top)^\top &= (P^\top)^\top D^\top P^\top \\\\ &= PDP^\top \\\\ &= A. \end{align*} \] The converse is also always true. If \(A\) is symmetric, it is guaranteed to be orthogonally diagonalizable.

Theorem 1: The Spectral Theorem

An \(n \times n\) matrix \(A\) is orthogonally diagonalizable if and only if \(A\) is a symmetric matrix.

Example:

On Eigenvalues & Eigenvectors, we considered the example: \[ \begin{align*} A &= \begin{bmatrix} 4 & 1 & 1\\ 1 & 4 & 1 \\ 1 & 1 & 4 \\ \end{bmatrix} \\\\ &= PDP^{-1} \\\\ &= \begin{bmatrix} -1 & -1 & 1 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 6 \\ \end{bmatrix} \begin{bmatrix} \frac{-1}{3} & \frac{-1}{3} & \frac{2}{3} \\ \frac{-1}{3} & \frac{2}{3} & \frac{-1}{3} \\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{bmatrix}. \end{align*} \]

Since, \(A\) is symmetric, it must be orthogonally diagonalizable. To ensure \(P\) is orthonormal, we need orthonormal eigenvectors. This can be achieved using the Gram-Schmidt algorithm on the columns of \(P\). Since only \(v_1 \cdot v_2 \neq 0\) needs adjustment, \[ \begin{align*} v'_2 &= v_2 - \frac{v_2 \cdot v_1}{v_1 \cdot v_1}v_1 \\\\ &= \begin{bmatrix} -1\\ 1\\ 0\\ \end{bmatrix} - \frac{1}{2}\begin{bmatrix} -1\\ 0\\ 1\\ \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-1}{2}\\ 1 \\ \frac{-1}{2} \end{bmatrix}. \end{align*} \] After normalization, the orthonormal eigenvectors are: \[ u_1 = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ 0\\ \frac{1}{\sqrt{2}} \\ \end{bmatrix} ,\, u_2 = \begin{bmatrix} \frac{-1}{\sqrt{6}}\\ \frac{2}{\sqrt{6}}\\ \frac{-1}{\sqrt{6}}\\ \end{bmatrix}, \, \text{and } u_3 = \begin{bmatrix} \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{3}} \\ \end{bmatrix} \] Therefore, \[ \begin{align*} A &= PDP^\top \\\\ &= \begin{bmatrix} \frac{-1}{\sqrt{2}} & \frac{-1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \\ 0 & \frac{2}{\sqrt{6}} & \frac{1}{\sqrt{3}}\\ \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \end{bmatrix} \begin{bmatrix} 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 6 \\ \end{bmatrix} \begin{bmatrix} \frac{-1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{6}} & \frac{2}{\sqrt{6}} & \frac{-1}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \end{bmatrix} \\\\ &= PDP^{-1} \end{align*} \]

Quadratic Forms

The connection between symmetric matrices and their eigenvalues becomes especially clear through quadratic forms. These scalar-valued functions \(Q(x) = x^\top Ax\) appear throughout optimization, physics, and machine learning. Understanding how to eliminate cross-product terms reveals the geometric meaning of eigenvalues.

Definition: Quadratic Form

A quadratic form on \(\mathbb{R}^n\) is a function \[ Q(x) = x^\top Ax \in \mathbb{R} \] where \(A\) is an \(n \times n\) symmetric matrix.

Example:

Let's use our symmetric matrix again: \[ A = \begin{bmatrix} 4 & 1 & 1\\ 1 & 4 & 1 \\ 1 & 1 & 4 \\ \end{bmatrix} \] Then: \[ \begin{align*} Q(x) &= x^\top Ax \\\\ &= 4x_1^2 +4 x_2^2 + 4x_3^2 + 2x_1x_2 + 2x_2x_3 + 2x_3x_1. \end{align*} \]

We can transform this quadratic form into one with no cross-product terms.

Let \(x = Py\) be a change of variable where \(P\) is an invertible matrix and \(y\) is a new variable in \(\mathbb{R}^n\). Since \(A\) is symmetric, by Theorem 1, \(A\) is orthogonally diagonalizable. Thus \[ \begin{align*} x^\top Ax &= (Py)^\top A(Py) \\\\ &= y^\top P^\top APy \\\\ &= y^\top (P^\top AP)y \\\\ &= y^\top(P^\top PDP^\top P)y \\\\ &= y^\top Dy \end{align*} \] where \(y = P^{-1}x = P^\top x\).

Since \(P^\top A P = D\), where \(D\) is a diagonal matrix, we get: \[ \begin{align*} x^\top A x &= y^\top D y \\\\ &= \lambda_1 y_1^2 + \lambda_2 y_2^2 + \cdots + \lambda_n y_n^2 \end{align*} \] where \(\lambda_1, \lambda_2, \ldots, \lambda_n\) are the eigenvalues of \(A\).

Example:

For our example, \(Q(x)\) becomes: \[ Q(x) = 3y_1^2 + 3y_2^2 + 6 y_3^2. \]

Thus, the eigenvalues of \(A\) are directly connected to its quadratic form. First, we define the classification of quadratic forms.

Definition: Classification of Quadratic Forms

A quadratic form \(Q(x)\) is classified as:

  1. Positive definite:
    \(\qquad \qquad A \succ 0\) if \(\, \forall x \neq 0, \, Q(x) > 0\).
  2. Positive semi-definite:
    \(\qquad \qquad A \succeq 0\) if \(\, \forall x, \, Q(x) \geq 0\).
  3. Negative definite:
    \(\qquad \qquad A \prec 0\) if \(\, \forall x \neq 0, \, Q(x) < 0\).
  4. Negative semi-definite:
    \(\qquad \qquad A \preceq 0\) if \(\, \forall x, \, Q(x) \leq 0\).
  5. Indefinite:
    \(\qquad \qquad\) otherwise.
Theorem 2:

Let \(A\) be an \(n \times n\) symmetric matrix. Then a quadratic form \(x^\top Ax\) is

  1. positive definite iff the eigenvalues of \(A\) are all positive.
  2. negative definite iff the eigenvalues of \(A\) are all negative.

Recall from our derivation that an orthogonal change of variable \(x = Py\) transforms \(x^\top Ax\) into \[ Q(x) = \lambda_1 y_1^2 + \cdots + \lambda_n y_n^2, \] where \(\lambda_1, \ldots, \lambda_n\) are the eigenvalues of \(A\).

There is a one-to-one correspondence between all \(x \neq 0\) and \(y \neq 0\) because \(P\) is an invertible matrix. Therefore, \(Q(x)\) is determined by the signs of the eigenvalues (or spectrum) of \(A\).

The columns of \(P\) provide the coordinate system in which the quadratic form becomes diagonal. These columns are called the principal axes of the quadratic form \(x^\top Ax\), and they align with the eigenvectors of \(A\).

Note:
To check whether a matrix \(A\) is positive definite, it is effective to verify the existence of a Cholesky factorization \(A = R^\top R\) where \(R\) is an upper triangular matrix with positive diagonal entries. This factorization is a modified version of the LU factorization.

Insight:

Positive definiteness of \(A\) means the quadratic form \(x^\top Ax\) defines a "bowl" shape (all eigenvalues positive). This guarantees a unique minimum for optimization problems, making positive definite matrices central to convex optimization and machine learning.

Singular Value Decomposition(SVD)

While the spectral theorem applies only to symmetric matrices, the Singular Value Decomposition extends orthogonal diagonalization to any \(m \times n\) matrix. SVD is arguably the most important matrix factorization in applied mathematics, with applications ranging from data compression to solving ill-conditioned linear systems.

Let \(A\) be an \(m \times n\) matrix. The symmetric matrix \(A^\top A\) is orthogonally diagonalizable. Let \(\{v_1, \cdots, v_n\}\) be an orthonormal basis for \(\mathbb{R}^n\) consisting of eigenvectors of \(A^\top A\) with corresponding eigenvalues \(\lambda_1, \cdots, \lambda_n\).

For \(1 \leq i \leq n\), the following holds: \[ \begin{align*} \| Av_i \|^2 &= (Av_i)^\top Av_i = v_i^\top (A^\top Av_i) \\\\ &= v_i^\top (\lambda_iv_i) \\\\ &= \lambda_i. \end{align*} \] Since \(\| Av_i \|^2 \geq 0\), we conclude that \(\lambda_i \geq 0 \).

Definition: Singular Values

Let \(A\) be an \(m \times n\) matrix, and let \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n\) be the eigenvalues of \(A^\top A\) arranged in descending order. The singular values of \(A\) are defined as: \[ \sigma_i = \sqrt{\lambda_i} = \|Av_i\| \geq 0 \quad (1 \leq i \leq n) \] where \(v_i\) are the corresponding orthonormal eigenvectors of \(A^\top A\).

If \(A\) has \(r\) nonzero singular values, then \(\{Av_1, \cdots, Av_r\}\) forms an orthogonal basis for \(\text{Col }A\) and \[\text{rank } A = \dim \text{ Col }A = r. \]

Theorem 3: Singular Value Decomposition

Let \(A\) be an \(m \times n\) matrix with rank \(r\). Then there exist an \(m \times n\) matrix \(\Sigma = \begin{bmatrix} D & 0 \\ 0 & 0 \\ \end{bmatrix}\) where \(D\) is a \(r \times r\) diagonal matrix with the first \(r\) nonzero singular values of \(A\), \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\) and there exist an \(m \times m\) orthogonal matrix \(U\), and an \(n \times n\) orthogonal matrix \(V\) such that \[ A = U\Sigma V^\top. \]

Proof:

Normalize orthogonal basis for \(\text{Col }A\) \(\{Av_1, \cdots, Av_r\}\), then we obtain an orthonormal basis \(\{u_i, \cdots, u_r\}\), where \[ u_i = \frac{1}{\| Av_i \|}Av_i = \frac{1}{\sigma_i}Av_i \quad 1 \leq i \leq r \] or, \[ Av_i = \sigma_i u_i. \tag{1} \]

We extend \(\{u_i, \cdots, u_r\}\) to an orthonormal basis for \(\mathbb{R}^m\), \(\{u_i, \cdots, u_r, u_{r+1}, \cdots, u_m\}\). Let \(U = \begin{bmatrix}u_1 & \cdots & u_m \end{bmatrix}\) and \(V = \begin{bmatrix}v_1 & \cdots & v_n \end{bmatrix}\). Clearly, by construction, \(U\) and \(V\) are orthogonal matrices. Also, by Equation (1), \[ AV = \begin{bmatrix} \sigma_1u_1 & \cdots & \sigma_ru_r & 0 & \cdots & 0 \end{bmatrix}. \]

Let \(D\) be a diagonal matrix with diagonal entries \(\sigma_1, \cdots, \sigma_r\) and let \(\Sigma =\begin{bmatrix} D & 0 \\ 0 & 0 \\ \end{bmatrix}\). Then \[ \begin{align*} U\Sigma &= \begin{bmatrix}\sigma_1u_1 & \cdots & \sigma_ru_r & 0 & \cdots & 0\end{bmatrix} \\\\ &= AV. \end{align*} \] Finally, since \(V\) is an orthogonal matrix, \[ (U\Sigma ) V^\top = (AV)V^\top = AI = A. \]

Algorithm: Computing SVD
  1. Compute eigendecomposition of \(A^\top A\):
    Compute the eigenvalues and eigenvectors of \(A^\top A\).
    Note: In practice, avoid directly computing \(A^\top A\) to prevent numerical errors.
  2. Construct \(V\) and \(\Sigma\):
    Arrange eigenvalues in descending order.
    Use the corresponding unit eigenvectors as the columns of \(V\).
  3. Construct \(U\):
    For nonzero singular values, compute \(u_k = \frac{1}{\sigma_k}Av_k\).
    If there are not enough nonzero singular values, extend to an orthonormal basis.

Beyond its use in matrix factorization, the SVD reveals important numerical properties of a matrix through its singular values. The condition number measures how sensitive the output \(Ax\) is to small changes (like floating-point errors) in the input \(x\).

Definition: Condition Number

For a matrix \(A \in \mathbb{R}^{m \times n}\) with rank \(r\), the condition number \(\kappa(A)\) is defined as the ratio of the largest to the smallest nonzero singular value: \[ \kappa(A) = \frac{\sigma_1}{\sigma_r} \] where \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\) are the nonzero singular values.

If \(A\) is a square, invertible matrix, this simplifies to \(\kappa(A) = \sigma_1 / \sigma_n\).

Often, a matrix with \(\kappa(A) \gg 1\) is called ill-conditioned, meaning it is numerically unstable for solving linear systems because it is "almost singular." From a CS perspective, this is a critical numerical red flag. Solving \(Ax=b\) for an ill-conditioned \(A\) can lead to highly inaccurate results, as small rounding errors in \(b\) or \(A\) get magnified into large errors in the solution \(x\).

Pseudo-inverse via SVD

The SVD provides the most numerically stable method for computing the Moore-Penrose pseudo-inverse. Unlike the normal equations approach \((A^\top A)^{-1}A^\top\), the SVD-based computation avoids squaring the condition number and naturally handles rank-deficient matrices.

The Moore-Penrose Pseudo-inverse (\(A^\dagger\)) is most stably defined and computed using the Singular Value Decomposition (SVD).

Given the SVD of \(A = U\Sigma V^\top\), where \(U\) is \(m \times m\), \(\Sigma\) is \(m \times n\), and \(V\) is \(n \times n\), the pseudo-inverse is: \[ A^\dagger = V\Sigma^\dagger U^\top \] Here, \(\Sigma^\dagger\) is an \(n \times m\) matrix formed by:

  1. Transposing \(\Sigma\) to get \(\Sigma^\top\) (which is \(n \times m\)).
  2. Inverting every nonzero singular value \(\sigma_i\) on the diagonal to get \(1/\sigma_i\).

If \(\Sigma\) has diagonal entries \(\sigma_1, \dots, \sigma_r > 0\), then \(\Sigma^\dagger\) will have diagonal entries \(1/\sigma_1, \dots, 1/\sigma_r\). All other entries are zero. This definition has a beautiful geometric interpretation. Recall that \(\hat{x} = A^\dagger b\) is the minimum-norm least-squares solution.


The SVD beautifully handles this. It decomposes \(A\) into its fundamental subspaces. The pseudo-inverse essentially "inverts" the mapping from \(\text{Row}(A)\) to \(\text{Col}(A)\) and ignores the \(\text{Nul}(A)\).

You can also see that \[ A^\dagger A = V\Sigma^\dagger U^\top U\Sigma V^\top = V\Sigma^\dagger\Sigma V^\top. \] The matrix \(\Sigma^\dagger\Sigma\) is an \(n \times n\) diagonal matrix with 1s for the first \(r\) entries and 0s after. This makes \(A^\dagger A = V \begin{bmatrix} I_r & 0 \\ 0 & 0 \end{bmatrix} V^\top\), which is the orthogonal projection matrix onto the Row Space of \(A\). Similarly, \(A A^\dagger\) is the orthogonal projection matrix onto the Column Space of \(A\).

Practical Note: Truncation and Regularization

In real-world applications, we rarely have singular values that are exactly zero. Instead, we have a spectrum of values, some of which are very small. \[ \sigma_1 \geq \sigma_2 \geq \dots \gg \dots \geq \sigma_n \approx 0. \] Inverting these tiny singular values (\(1/\sigma_i\)) would introduce massive instability, as we'd be amplifying components related to noise.

The practical computation of the pseudo-inverse involves setting a tolerance \(\epsilon\). We treat any \(\sigma_i < \epsilon\) as zero. \[ (\sigma_i^\dagger) = \begin{cases} 1/\sigma_i & \text{if } \sigma_i \geq \epsilon \\ 0 & \text{if } \sigma_i < \epsilon \end{cases} \] This truncated SVD is the basis for Principal Component Regression (PCR) and is a powerful form of regularization. It explicitly discards the "directions" (singular vectors) that contribute least to the transformation and are most likely to be noise, building a more robust, lower-rank model.

Interactive SVD Visualization

This interactive demo visualizes the Singular Value Decomposition process for \(2 \times 2\) matrices. SVD expresses any matrix A as a product \(A = U \Sigma V^\top \), which can be understood geometrically as a composition of three transformations:

  1. \(U\): Rotation/reflection using left singular vectors (orthogonal)
  2. \(\Sigma\): Scaling along the axes (singular values)
  3. \(V^\top\): Rotation/reflection using right singular vectors (orthogonal)

Use the visualization to see how each component of SVD contributes to the overall transformation:

Try different matrices to observe how SVD works with various transformations:

SVD with Fundamental Subspaces

The SVD has a beautiful geometric interpretation: it provides orthonormal bases for all four fundamental subspaces of a matrix simultaneously. This reveals the deep connection between the singular value decomposition and the structure of linear transformations.

Theorem 5: SVD and Fundamental Subspaces

Let \(A\) be an \(m \times n\) matrix with SVD \(A = U \Sigma V^\top\), where \(A\) has rank \(r\). Let \(\{u_1, \ldots, u_m\}\) be the columns of \(U\) (left singular vectors) and \(\{v_1, \ldots, v_n\}\) be the columns of \(V\) (right singular vectors). Then:

  1. \(\{u_1, \ldots, u_r\}\) is an orthonormal basis for \(\text{Col }A\)
  2. \(\{u_{r+1}, \ldots, u_m\}\) is an orthonormal basis for \(\text{Nul }A^\top\)
  3. \(\{v_{r+1}, \ldots, v_n\}\) is an orthonormal basis for \(\text{Nul }A\)
  4. \(\{v_1, \ldots, v_r\}\) is an orthonormal basis for \(\text{Row }A\)
Proof:

Column Space:
From the construction of the SVD, \(\{Av_1, \ldots, Av_r\} = \{\sigma_1 u_1, \ldots, \sigma_r u_r\}\) is an orthogonal basis for \(\text{Col }A\). Since \(\sigma_i > 0\) for \(i \leq r\), \(\{u_1, \ldots, u_r\}\) is an orthonormal basis for \(\text{Col }A\).

Left Null Space:
Using the orthogonality property \((\text{Col }A)^\perp = \text{Nul }A^\top\) and since \(U\) is orthonormal, the vectors \(\{u_1, \ldots, u_m\}\) are orthogonal to the first \(r\) columns, forming the basis for the left null space \(\text{Nul }A^\top\).

Null Space:
For \(i > r\), \(\sigma_i = 0\) implies \(\|Av_i\| = \sigma_i = 0\). The set \(\{v_{r+1}, \ldots, v_n\}\) is orthonormal and has dimension \(n - r\). Here, \(\dim(\text{Nul }A) = n - r\), confirming this set as the basis for \(\text{Nul }A\).

Row Space:
Since \((\text{Nul }A)^\perp = \text{Col }A^\top = \text{Row }A\), the first \(r\) columns of \(V\), \(\{v_1, \ldots, v_r\}\) being orthogonal to \(\text{Nul }A\), span the row space of \(A\).

Geometric Insight:

The SVD reveals that any linear transformation can be understood as acting on orthogonal subspaces: the row space maps via scaling by singular values to the column space, while the null space maps to zero. The transformation is entirely characterized by how it acts on these complementary subspaces.

Connection to the Invertible Matrix Theorem

When \(A\) is an \(n \times n\) invertible matrix, the SVD characterization of fundamental subspaces provides additional equivalent conditions for invertibility.

Additional Invertible Matrix Conditions

Let \(A\) be an \(n \times n\) matrix. Then \(A\) is invertible if and only if any of the following equivalent conditions hold:

  1. \((\text{Col }A)^\perp = \{0\}\)
  2. \((\text{Nul }A)^\perp = \mathbb{R}^n\)
  3. \(\text{Row }A = \mathbb{R}^n\)
  4. \(A\) has \(n\) nonzero singular values

These conditions extend the Invertible Matrix Theorem from earlier sections.

The equivalence follows from the fact:if \(A\) is invertible, then \(\text{rank}(A) = n\), which means \(r = n\). Therefore, \(\text{Nul }A = \{0\}\) (there are no vectors \(v_{r+1}, \ldots, v_n\)), \(\text{Row }A = \mathbb{R}^n\) (all \(v_1, \ldots, v_n\) span the row space), and all \(n\) singular values are nonzero. Conversely, if any of these conditions hold, they force \(r = n\), making \(A\) invertible.