Covariance

Covariance Matrix Principal Component Analysis (PCA) PCA with Singular Value Decomposition(SVD)

Covariance Matrix

Earlier, we introduced the variance as a measure of how a single random variable deviates from its mean. However, real-world phenomena rarely involve just one variable. In machine learning, a data point is typically a vector of features, and understanding how these features relate to one another is essential for tasks ranging from dimensionality reduction to generative modeling. This motivates the transition from univariate to multivariate statistics, which begins with a systematic way to measure how pairs of random variables vary together.

Definition: Covariance

The covariance between two random variables \(X\) and \(Y\) is defined as \[ \operatorname{Cov}[X, Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]. \]

Here, \((X - \mathbb{E}[X])\) and \((Y - \mathbb{E}[Y])\) are mean deviations, representing how far \(X\) and \(Y\) deviate from their respective means. The covariance measures how these deviations vary together (joint variation).

To work with two random variables together, we need a way to describe their joint behavior. The individual distributions of \(X\) and \(Y\) — each described by its own pmf or pdf — are not enough: two random variables may have identical individual distributions while being either independent or perfectly correlated. The required object is the joint distribution, which packages how \(X\) and \(Y\) take values together.

Definition: Joint Probability Mass Function

Let \(X\) and \(Y\) be discrete random variables on a common probability space, taking values in countable sets \(\mathcal{X}, \mathcal{Y} \subseteq \mathbb{R}\). The joint probability mass function is \[ p_{X,Y}(x, y) = P(X = x,\ Y = y), \qquad (x, y) \in \mathcal{X} \times \mathcal{Y}, \] with \(p_{X,Y}(x,y) \geq 0\) and \(\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y) = 1\). The marginal pmfs of \(X\) and \(Y\) are recovered by summing out the other variable: \[ p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y), \qquad p_Y(y) = \sum_{x \in \mathcal{X}} p_{X,Y}(x, y). \]

Definition: Joint Probability Density Function

Random variables \(X\) and \(Y\) on a common probability space are said to be jointly continuous if there exists a non-negative function \(f_{X,Y} : \mathbb{R}^2 \to [0, \infty)\), called the joint probability density function, such that for every reasonable region \(B \subseteq \mathbb{R}^2\), \[ P\big((X, Y) \in B\big) = \iint_B f_{X,Y}(x, y)\, dx\, dy, \] with \(\iint_{\mathbb{R}^2} f_{X,Y}(x, y)\, dx\, dy = 1\). The marginal densities are recovered by integrating out the other variable: \[ f_X(x) = \int_{\mathbb{R}} f_{X,Y}(x, y)\, dy, \qquad f_Y(y) = \int_{\mathbb{R}} f_{X,Y}(x, y)\, dx. \] The precise meaning of "reasonable region" — namely, Borel subsets of \(\mathbb{R}^2\) — and the rigorous justification of these identities require the language of product measures, developed in our later treatment of measure-theoretic probability.

The expectation of a function \(g(X, Y)\) of two jointly distributed random variables is, accordingly, \[ \mathbb{E}[g(X, Y)] = \sum_{x, y} g(x, y)\, p_{X,Y}(x, y) \quad \text{(discrete)}, \qquad \mathbb{E}[g(X, Y)] = \iint_{\mathbb{R}^2} g(x, y)\, f_{X,Y}(x, y)\, dx\, dy \quad \text{(continuous)}, \] provided the sum or integral converges absolutely.

Definition: Independence of Random Variables (elementary)

Two random variables \(X\) and \(Y\) on a common probability space are independent if their joint distribution factors into the product of the marginal distributions:

  1. If \(X, Y\) are jointly discrete: \(\;p_{X,Y}(x, y) = p_X(x)\, p_Y(y)\) for all \((x, y)\).
  2. If \(X, Y\) are jointly continuous: \(\;f_{X,Y}(x, y) = f_X(x)\, f_Y(y)\) for all \((x, y)\).

This is the random-variable analog of independence of events: if \(X\) and \(Y\) take values in \(\{0, 1\}\) recording whether two events \(A\) and \(B\) occur, the two definitions coincide with independence of events from the foundations chapter. A fully rigorous formulation in terms of \(\sigma\)-algebras is developed in our later treatment of measure-theoretic probability.

Before deriving the standard computational identity for covariance, we record the bivariate extension of linearity of expectation, which goes beyond the scalar version proved earlier (a single random variable scaled and shifted by constants).

Theorem: Bivariate Linearity of Expectation

Let \(X\) and \(Y\) be random variables defined on a common probability space, and let \(a, b \in \mathbb{R}\) be constants. Then \[ \mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y], \] provided each individual expectation exists. The result does not require \(X\) and \(Y\) to be independent.

Proof Sketch:

For jointly continuous \((X, Y)\) with joint density \(f_{X,Y}(x, y)\), expanding the definition of expectation gives \[ \mathbb{E}[aX + bY] = \iint_{\mathbb{R}^2} (a x + b y)\, f_{X,Y}(x, y)\, dx\, dy. \] Two operations are then performed. First, by linearity of the integral, the integrand splits into two pieces: \[ \iint a x\, f_{X,Y}(x, y)\, dx\, dy + \iint b y\, f_{X,Y}(x, y)\, dx\, dy. \] Second, we write each double integral as an iterated integral with the appropriate inner variable, so that the inner integral of \(f_{X,Y}\) recovers a marginal density. For the first piece, integrating out \(y\) inside gives \(f_X(x) = \int f_{X,Y}(x, y)\, dy\); for the second, integrating out \(x\) inside gives \(f_Y(y) = \int f_{X,Y}(x, y)\, dx\). Thus \[ \mathbb{E}[aX + bY] = a \int x\, f_X(x)\, dx + b \int y\, f_Y(y)\, dy = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]. \] The discrete case is analogous, replacing integrals by sums and densities by joint and marginal mass functions. Independence is not invoked — only the existence of the joint distribution and the ability to recover marginals from it. The substantive point is that the double integral may be written as an iterated integral in either order: this is Fubini's theorem, valid here under the existence-of-expectation hypothesis (which guarantees absolute integrability). A fully rigorous treatment of Fubini's theorem requires the machinery of product measures, which we develop in our later treatment of measure-theoretic probability.

With bivariate linearity in hand, we obtain a computational identity for covariance that is often more convenient than the definition.

Proposition: Computational Identity for Covariance

For any random variables \(X\) and \(Y\) with finite second moments, \[ \operatorname{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]. \]

Proof:

Let \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\) (constants). Expanding the product inside the definition, \[ (X - \mu_X)(Y - \mu_Y) = XY - \mu_X Y - X \mu_Y + \mu_X \mu_Y. \] Apply linearity of expectation term-by-term — bivariate linearity (extended to four terms by repeated application) splits the sum, and the scalar identity \(\mathbb{E}[cZ] = c\,\mathbb{E}[Z]\) handles the constant factors \(\mu_X\), \(\mu_Y\): \[ \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mu_X \mathbb{E}[Y] - \mathbb{E}[X] \mu_Y + \mu_X \mu_Y. \] Substituting back \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\) into the last three terms gives \[ - \mathbb{E}[X]\,\mathbb{E}[Y] - \mathbb{E}[X]\,\mathbb{E}[Y] + \mathbb{E}[X]\,\mathbb{E}[Y] = -\mathbb{E}[X]\,\mathbb{E}[Y], \] which yields the claimed identity \(\operatorname{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]\).

Independence of random variables has a powerful consequence for expectations of products: it lets the expectation factor across the two variables.

Theorem: Product Expectation Under Independence

Let \(X\) and \(Y\) be independent random variables on a common probability space, with finite expectations. Then \(XY\) has finite expectation, and \[ \mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]. \]

Proof Sketch:

For jointly continuous \((X, Y)\), independence means the joint density factors as \(f_{X,Y}(x, y) = f_X(x)\, f_Y(y)\). Substituting into the expectation formula, \[ \mathbb{E}[XY] = \iint_{\mathbb{R}^2} x y\, f_{X,Y}(x, y)\, dx\, dy = \iint_{\mathbb{R}^2} \big(x f_X(x)\big)\big(y f_Y(y)\big)\, dx\, dy. \] Writing the double integral as an iterated integral (Fubini's theorem under the absolute-integrability hypothesis), the integrand factors and the two integrals separate: \[ \mathbb{E}[XY] = \left( \int_{\mathbb{R}} x\, f_X(x)\, dx \right) \left( \int_{\mathbb{R}} y\, f_Y(y)\, dy \right) = \mathbb{E}[X]\, \mathbb{E}[Y]. \] For jointly discrete \((X, Y)\), the same argument applies with sums in place of integrals: \(p_{X,Y}(x, y) = p_X(x)\,p_Y(y)\) gives \(\mathbb{E}[XY] = \sum_{x, y} xy\, p_X(x)\,p_Y(y) = \big(\sum_x x\,p_X(x)\big)\big(\sum_y y\,p_Y(y)\big) = \mathbb{E}[X]\,\mathbb{E}[Y]\). The general case (without joint density or pmf) is treated in measure-theoretic probability via product measures.

Corollary: Independence Implies Zero Covariance

If \(X\) and \(Y\) are independent random variables with finite second moments, then \(\operatorname{Cov}[X, Y] = 0\).

Proof:

Combine the Computational Identity with the Product Expectation Theorem: \(\operatorname{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y] = \mathbb{E}[X]\,\mathbb{E}[Y] - \mathbb{E}[X]\,\mathbb{E}[Y] = 0\).

The converse is not true in general: zero covariance indicates only the absence of linear dependence, not full independence. A standard counter-example is \(X \sim \mathcal{N}(0, 1)\) and \(Y = X^2\). The standard normal density \(\varphi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\) is an even function (\(\varphi(-x) = \varphi(x)\)), and \(x^3\) is an odd function, so \(\mathbb{E}[X^3] = \int_{\mathbb{R}} x^3 \varphi(x)\, dx = 0\) by odd-times-even symmetry on \(\mathbb{R}\); together with \(\mathbb{E}[X] = 0\), this gives \(\operatorname{Cov}[X, Y] = \mathbb{E}[X^3] - \mathbb{E}[X]\,\mathbb{E}[X^2] = 0\). Yet \(Y\) is fully determined by \(X\), so \(X\) and \(Y\) are far from independent. A positive (resp. negative) covariance indicates a positive (resp. negative) linear association between \(X\) and \(Y\); the absence of such linear association is what zero covariance certifies.

In practice, we extend this idea to \(n\) random variables. Consider a vector \(\boldsymbol{x} \in \mathbb{R}^n\) whose entries represent random variables \(X_1, \ldots, X_n\).

Definition: Population Covariance Matrix

The population covariance matrix of a random vector \(\boldsymbol{x} = (X_1, \ldots, X_n)^\top\) with mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{x}]\) is \[ \begin{align*} \Sigma &= \operatorname{Cov}[\boldsymbol{x}] \\\\ &= \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top] \\\\ &= \begin{bmatrix} \operatorname{Var}[X_1] & \operatorname{Cov}[X_1, X_2] & \cdots & \operatorname{Cov}[X_1, X_n] \\\\ \operatorname{Cov}[X_2, X_1] & \operatorname{Var}[X_2] & \cdots & \operatorname{Cov}[X_2, X_n] \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \operatorname{Cov}[X_n, X_1] & \operatorname{Cov}[X_n, X_2] & \cdots & \operatorname{Var}[X_n] \end{bmatrix} \\\\ &= \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top] - \boldsymbol{\mu}\boldsymbol{\mu}^\top. \end{align*} \]

Each diagonal entry \(\Sigma_{ii}\) is the variance of \(X_i\), since \[ \operatorname{Cov}[X_i, X_i] = \mathbb{E}[(X_i - \mathbb{E}[X_i])^2] = \sigma_i^2. \] The total variance is defined as the trace of \(\Sigma\): \[ \operatorname{tr}(\Sigma) = \sum_{i=1}^n \operatorname{Var}[X_i]. \]

The covariance matrix has several important structural properties that connect probability theory to linear algebra.

First, \(\Sigma\) is symmetric because covariance itself is symmetric: \[ \Sigma_{ij} = \operatorname{Cov}[X_i, X_j] = \operatorname{Cov}[X_j, X_i] = \Sigma_{ji}. \] Equivalently at the matrix level: for every realization of the random vector, the outer product \((\boldsymbol{x} - \boldsymbol{\mu})(\boldsymbol{x} - \boldsymbol{\mu})^\top\) is a symmetric matrix, and the expectation of a symmetric-matrix-valued random variable is again symmetric (since \(\mathbb{E}\) is taken entrywise and commutes with transposition).

Since \(\Sigma\) is symmetric, by the spectral theorem, it is orthogonally diagonalizable: \[ \Sigma = P D P^\top \] where \(P\) is an orthogonal matrix whose columns are unit eigenvectors of \(\Sigma\), and \(D\) is a diagonal matrix of the corresponding eigenvalues. In particular, because the trace is invariant under orthogonal similarity, the total variance equals the sum of eigenvalues: \[ \operatorname{tr}(\Sigma) = \sum_{i=1}^n \operatorname{Var}[X_i] = \operatorname{tr}(D) = \sum_{i=1}^n \lambda_i. \] Note that the individual eigenvalues do not correspond to individual variances \(\operatorname{Var}[X_i]\); only the totals agree.

Theorem: Positive Semi-Definiteness of the Covariance Matrix

Every covariance matrix \(\Sigma\) is positive semi-definite. That is, for any \(\boldsymbol{v} \in \mathbb{R}^n\), \[ \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} \geq 0. \]

Proof:

For any \(\boldsymbol{v} \in \mathbb{R}^n\), \[ \begin{align*} \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} &= \boldsymbol{v}^\top \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top]\, \boldsymbol{v} \\\\ &= \mathbb{E}[\boldsymbol{v}^\top (\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top \boldsymbol{v}] \\\\ &= \mathbb{E} \left[(\boldsymbol{v}^\top (\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}]))^2\right] \geq 0. \end{align*} \] The last expression is the expectation of a squared real number, which is always non-negative.

As a consequence, the eigenvalues of \(\Sigma\) satisfy \[ \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0. \] This non-negativity of eigenvalues is precisely what makes the spectral decomposition of \(\Sigma\) meaningful for data analysis, as we shall see next in Principal Component Analysis.

SideNote: \(\mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top]\) is called the autocorrelation matrix denoted as \(R_{\boldsymbol{x}\boldsymbol{x}}\): \[ \begin{align*} R_{\boldsymbol{x}\boldsymbol{x}} &= \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top] \\\\ &= \begin{bmatrix} \mathbb{E}[X_1^2] & \mathbb{E}[X_1 X_2] & \cdots & \mathbb{E}[X_1 X_n] \\\\ \mathbb{E}[X_2 X_1] & \mathbb{E}[X_2^2] & \cdots & \mathbb{E}[X_2 X_n] \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \mathbb{E}[X_n X_1] & \mathbb{E}[X_n X_2] & \cdots & \mathbb{E}[X_n^2]. \end{bmatrix} \end{align*} \]

Principal Component Analysis (PCA)

The previous section established that every covariance matrix admits an orthogonal eigendecomposition with non-negative eigenvalues. This spectral structure has a direct and powerful interpretation. The eigenvectors identify the directions along which the data varies the most, and the eigenvalues quantify how much variance lies along each direction. Exploiting this insight leads to Principal Component Analysis (PCA), one of the most widely used techniques in data science for dimensionality reduction.

Before describing PCA, we establish the precise sense in which "eigenvalue equals variance along its eigenvector." This is what makes the spectral decomposition of \(\Sigma\) statistically meaningful in the first place.

Proposition: Variance Along a Direction

Let \(\boldsymbol{x}\) be a random vector in \(\mathbb{R}^n\) with covariance matrix \(\Sigma\). For any unit vector \(\boldsymbol{v} \in \mathbb{R}^n\), the scalar projection \(\boldsymbol{v}^\top \boldsymbol{x}\) is a real-valued random variable whose variance is \[ \operatorname{Var}[\boldsymbol{v}^\top \boldsymbol{x}] = \boldsymbol{v}^\top \Sigma\, \boldsymbol{v}. \] In particular, if \(\boldsymbol{v} = \boldsymbol{p}_i\) is a unit eigenvector of \(\Sigma\) with eigenvalue \(\lambda_i\), then \[ \operatorname{Var}[\boldsymbol{p}_i^\top \boldsymbol{x}] = \lambda_i. \]

Proof:

By repeated application of bivariate linearity of expectation (which extends to any finite linear combination by induction), the mean of the scalar \(\boldsymbol{v}^\top \boldsymbol{x} = \sum_i v_i X_i\) is \(\mathbb{E}[\boldsymbol{v}^\top \boldsymbol{x}] = \sum_i v_i \mathbb{E}[X_i] = \boldsymbol{v}^\top \boldsymbol{\mu}\) (a deterministic constant, since \(\boldsymbol{v}\) and \(\boldsymbol{\mu}\) are non-random). By the definition of variance, \[ \operatorname{Var}[\boldsymbol{v}^\top \boldsymbol{x}] = \mathbb{E}\!\left[\left(\boldsymbol{v}^\top \boldsymbol{x} - \boldsymbol{v}^\top \boldsymbol{\mu}\right)^2\right] = \mathbb{E}\!\left[\left(\boldsymbol{v}^\top (\boldsymbol{x} - \boldsymbol{\mu})\right)^2\right]. \] This is precisely the quantity computed in the proof of the positive semi-definiteness theorem above; carrying out the same algebraic manipulation gives \[ \mathbb{E}\!\left[\left(\boldsymbol{v}^\top (\boldsymbol{x} - \boldsymbol{\mu})\right)^2\right] = \boldsymbol{v}^\top \Sigma\, \boldsymbol{v}. \] For the eigenvector specialization, substituting \(\boldsymbol{v} = \boldsymbol{p}_i\) and using \(\Sigma \boldsymbol{p}_i = \lambda_i \boldsymbol{p}_i\) gives \(\boldsymbol{p}_i^\top \Sigma\, \boldsymbol{p}_i = \lambda_i\, \boldsymbol{p}_i^\top \boldsymbol{p}_i = \lambda_i\) since \(\boldsymbol{p}_i\) is a unit vector.

This proposition has an immediate optimization consequence. Expanding any unit vector \(\boldsymbol{v}\) in the orthonormal eigenbasis \(\{\boldsymbol{p}_1, \ldots, \boldsymbol{p}_n\}\) as \(\boldsymbol{v} = \sum_i c_i \boldsymbol{p}_i\) with \(\sum_i c_i^2 = 1\), a direct computation using \(\Sigma \boldsymbol{p}_i = \lambda_i \boldsymbol{p}_i\) and orthonormality yields \(\boldsymbol{v}^\top \Sigma\, \boldsymbol{v} = \sum_i c_i^2 \lambda_i\), a convex combination of the eigenvalues. Ordering \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0\), this combination is maximized by placing all weight on the largest eigenvalue: \[ \max_{\|\boldsymbol{v}\|=1} \operatorname{Var}[\boldsymbol{v}^\top \boldsymbol{x}] = \max_{\|\boldsymbol{v}\|=1} \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} = \lambda_1, \] attained at \(\boldsymbol{v} = \boldsymbol{p}_1\). The variance-maximizing direction is the eigenvector of the largest eigenvalue.

The orthogonal diagonalization of the covariance matrix \(\Sigma\) is important in statistics and machine learning for analyzing data and reducing dimensionality: \[ \Sigma = P D P^\top \] Here, the column vectors of \(P\) (unit eigenvectors of \(\Sigma\)) are called the principal components (PCs) of the data, and the diagonal entries of \(D\) (eigenvalues of \(\Sigma\)) represent the variances along these principal components. Each eigenvalue indicates "how much" of the total variance of \(\Sigma\) is captured by its corresponding principal component.

The first principal component (PC1) corresponds to the largest eigenvalue and represents the direction in the data space where the distribution varies the most (the direction maximizing the variance of the data). The second principal component (PC2) captures the next largest variance and is orthogonal to PC1. Similarly, each subsequent PC captures less variance and maintains orthogonality to all previous PCs.

For example, suppose you have an observed data matrix \(X \in \mathbb{R}^{m \times 10}\), which contains \(m\) data points, each represented as a vector \(\boldsymbol{x}_i \in \mathbb{R}^{10}\). Computing the sample covariance matrix and diagonalizing it: \[ S = \frac{1}{m-1}(X-\bar{X})^\top (X-\bar{X}) = PDP^\top, \] where \(\bar{X} \in \mathbb{R}^{m \times 10}\) is the matrix whose every row equals the column-mean vector \(\bar{\boldsymbol{x}} = \frac{1}{m}\sum_{i=1}^m \boldsymbol{x}_i\). You then examine the eigenvalues to determine how much variance each principal component captures, comparing with the total variance of \(S\).

Let's say PC1: 40%, PC2: 30%, and PC3: 25%. Then, you project the data onto the subspace spanned by the first 3 most significant principal components. Even though you discarded the remaining 7 dimensions (which carry only a small fraction of the total variance, typically interpreted as noise or fine-scale variation), you still retain the most significant patterns (trends) in the data. \[ \boldsymbol{x}_i \in \mathbb{R}^{10} \to \boldsymbol{z}_i \in \mathbb{R}^3 \] The vector \(\boldsymbol{z}_i\) is known as the latent vector.

This dimensionality reduction process is called principal component analysis (PCA). By reducing dimensions, PCA enables efficient analysis of large datasets while retaining their most meaningful structure. PCA is widely used in machine learning, image processing, and exploratory data analysis for dimensionality reduction and noise filtering.

PCA with Singular Value Decomposition (SVD)

In the derivation above, PCA was formulated through the eigendecomposition of the sample covariance matrix \(S = PDP^\top\). While conceptually clean, this approach requires explicitly forming \(X^\top X\), which can be expensive for high-dimensional data and may suffer from numerical instability. The singular value decomposition (SVD) provides an elegant alternative that avoids these issues while yielding the same principal components.

Given a data matrix in mean-deviation form \(X \in \mathbb{R}^{m \times n}\) (so that \(\bar{X} = 0\), where \(m\) is the number of data points and \(n\) is the number of features), the sample covariance matrix simplifies to \[ S = \frac{1}{m-1} X^\top X. \] Note: in PCA, it is common to work with \(X^\top X \in \mathbb{R}^{n \times n}\) rather than \(XX^\top \in \mathbb{R}^{m \times m}\) because we are typically more interested in understanding the relationships among the \(n\) features than among the \(m\) data points.

Now apply the SVD to \(X\): write \(X = U \Sigma V^\top\), where \(U \in \mathbb{R}^{m \times m}\) and \(V \in \mathbb{R}^{n \times n}\) are orthogonal and \(\Sigma \in \mathbb{R}^{m \times n}\) is the rectangular diagonal matrix of singular values \(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\). (Here \(\Sigma\) denotes the SVD's singular-value matrix in the standard linear-algebra convention; it is unrelated to the population covariance matrix introduced earlier in this page, which does not appear in this section.) Substituting, \[ \begin{align*} X^\top X &= (U\Sigma V^\top)^\top (U\Sigma V^\top) \\\\ &= V\Sigma^\top U^\top U\Sigma V^\top \\\\ &= V\,(\Sigma^\top \Sigma)\,V^\top, \end{align*} \] using \(U^\top U = I_m\) (orthogonality of \(U\)). The product \(\Sigma^\top \Sigma\) is the \(n \times n\) square diagonal matrix \(\operatorname{diag}(\sigma_1^2, \ldots, \sigma_n^2)\). Dividing by \(m-1\) yields the sample covariance matrix in eigendecomposition form: \[ S = \frac{1}{m-1} X^\top X = V \cdot \operatorname{diag}\!\left(\frac{\sigma_1^2}{m-1}, \ldots, \frac{\sigma_n^2}{m-1}\right) \cdot V^\top. \] Comparing with the eigendecomposition \(S = PDP^\top\) from the previous section, and writing \(\boldsymbol{v}_i\) for the \(i\)-th column of \(V\) (the \(i\)-th right singular vector), we read off \[ P = V, \qquad \boldsymbol{p}_i = \boldsymbol{v}_i, \qquad \lambda_i = \frac{\sigma_i^2}{m-1}. \] The right singular vectors are therefore the principal components, and the eigenvalues of \(S\) are the squared singular values divided by \(m-1\).

To interpret these eigenvalues as variances, recall that the Variance Along a Direction proposition was stated for a random vector \(\boldsymbol{x}\) with population covariance \(\Sigma\). The proposition's algebraic identity \(\boldsymbol{v}^\top M \boldsymbol{v} = \mathbb{E}[(\boldsymbol{v}^\top (\boldsymbol{x} - \boldsymbol{\mu}))^2]\) does not depend on which matrix \(M\) we feed in: applied to the empirical distribution that places mass \(1/m\) on each observed data point, \(M = X^\top X / m\) gives the empirical variance along direction \(\boldsymbol{v}\); applied with the Bessel-corrected matrix \(M = S = X^\top X / (m-1)\), it gives the unbiased sample variance. Thus the eigenvalues of \(S\) are precisely the sample variances along the principal components: \[ \widehat{\operatorname{Var}}\!\left[\boldsymbol{v}_i^\top \boldsymbol{x}\right] = \frac{\sigma_i^2}{m-1}, \qquad \text{sample standard deviation along PC } i \;=\; \frac{\sigma_i}{\sqrt{m-1}}. \]

SVD offers several numerical advantages for PCA. For high-dimensional data, computing \(X^\top X\) explicitly would be expensive or memory-intensive, but SVD can be computed directly from \(X\), giving access to the principal components and variances without ever forming \(X^\top X\).

                        import numpy as np

                        # Random data matrix ( m data points with n features)
                        def generate_data(m, n):
                            data = np.random.randn(m, n) 

                            # Make some correlations 
                            data[:, 2] = 0.7 * data[:, 0] + 0.5 * data[:, 2]
                            data[:, 3] = 0.2 * data[:, 0] + 0.5 * data[:, 1] 
                            data[:, 4] = -0.3 * data[:, 1] + 0.2 * data[:, 2]
                            data[:, 5] = 0.4 * data[:, 0] + 0.1 * data[:, 1]
                            data[:, 6] = 0.8 * data[:, 3] + -0.3 * data[:, 2]

                            # Mean-deviation form 
                            return (data - np.mean(data, axis=0) )

                        # PCA with covariance matrix (As reference)
                        def pca_via_covariance(data):
                            
                            # "Sample" covariance matrix 
                            # Note: Dividing by (m-1) provides "unbiased" estimate of the population covariance. 
                            cov_matrix = np.dot(data.T, data) / (data.shape[0] - 1)

                            # Eigendecomposition 
                            # np.linalg.eigh() is for symmetric matrix (real, diagonaizable), better than np.linalg.eig() 
                            eigvals, eigvecs = np.linalg.eigh(cov_matrix) 

                            #  Make eigenvalues & eigenvectors in descending order
                            idx = np.argsort(eigvals)[::-1] 
                            eigvals = eigvals[idx]
                            eigvecs = eigvecs[:, idx] 

                            # Each variance vs total variance
                            ratio = eigvals / np.sum(eigvals)

                            return eigvals, eigvecs, ratio

                        # PCA with SVD (we use this function)
                        def pca_with_svd(data):
                            
                            # Singular Value Decomposition (we don't need the matrix U: use "_" )
                            _, S, vt = np.linalg.svd(data, full_matrices=False)
                            
                            # Get eigenvalues via singular values: lambda_i = (S_i)^2  / m - 1
                            eigvals = (S ** 2) / (data.shape[0] - 1)
                            
                            # Each variance vs total variance
                            ratio = eigvals / np.sum(eigvals)
                            
                            return eigvals, vt.T, ratio 

                        # Set 90% threshold for the number of PCs
                        def threshold(var_ratio, t):
                            
                            # Compute cumulative variance
                            cum_variance = np.cumsum(var_ratio)
                            
                            # Find the number of components for 90% variance retention
                            num_pcs = np.argmax(cum_variance >= t) + 1
                            return num_pcs

                        # Dimension of data 
                        m = 1000000 
                        n = 10 

                        # Run PCA
                        eigvals, eigvecs, ratio = pca_with_svd(generate_data(m, n))

                        # Threshold for variance retention
                        t = 0.95
                        num_pcs = threshold(ratio, t)

                        # Print results
                        print("Eigenvalues:")
                        for i, val in enumerate(eigvals):
                            print(f"  Lambda {i + 1}: {val:.6f}")

                        print("\nExplained Variance Ratio (%):")
                        for i, var in enumerate(ratio):
                            print(f"  PC{i + 1}: {var * 100:.2f}%")

                        print(f"\nTo retain {t * 100:.0f}% of variance, use the first {num_pcs} PCs.")
                            

Connections to Machine Learning

The covariance matrix and PCA are foundational across modern machine learning. In kernel PCA, the same eigendecomposition idea is applied in a feature space induced by a kernel, enabling nonlinear dimensionality reduction.

In multivariate Gaussian models, the covariance matrix \(\Sigma\) fully determines the shape and orientation of the distribution, and its inverse (the precision matrix) encodes conditional independence structure. In Gaussian processes, the covariance function generalizes \(\Sigma\) to infinite-dimensional function spaces.

While covariance captures the magnitude and direction of linear association, its values depend on the scales of the variables involved. In the next part, we introduce the correlation coefficient and correlation matrix, which standardize covariance to produce "scale-free" measures of linear relationship.