Covariance

Covariance Matrix

In Part 2, we introduced the variance as a measure of how a single random variable deviates from its mean. However, real-world phenomena rarely involve just one variable. In machine learning, a data point is typically a vector of features, and understanding how these features relate to one another is essential for tasks ranging from dimensionality reduction to generative modeling. This motivates the transition from univariate to multivariate statistics, which begins with a systematic way to measure how pairs of random variables vary together.

Definition: Covariance

The covariance between two random variables \(X\) and \(Y\) is defined as \[ \text{Cov}[X, Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]. \]

Here, \((X - \mathbb{E}[X])\) and \((Y - \mathbb{E}[Y])\) are mean deviations, representing how far \(X\) and \(Y\) deviate from their respective means. The covariance measures how these deviations vary together (joint variation).

Note: A useful computational identity is \[ \text{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]. \] If \(X\) and \(Y\) are independent, then \[ \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y], \], so \[ \text{Cov}[X, Y] = 0. \] However, the converse is not true in general: zero covariance indicates only the absence of linear dependence. A positive (negative) covariance indicates a positive (negative) "linear" association between \(X\) and \(Y\).

In practice, we extend this idea to \(n\) random variables. Consider a vector \(\boldsymbol{x} \in \mathbb{R}^n\) whose each entries represent random variables \(X_1, \ldots X_n\).

Definition: Population Covariance Matrix

The population covariance matrix of a random vector \(\boldsymbol{x} = (X_1, \ldots, X_n)^\top\) with mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{x}]\) is \[ \begin{align*} \Sigma &= \text{Cov}[\boldsymbol{x}] \\\\ &= \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top] \\\\ &= \begin{bmatrix} \text{Var}[X_1] & \text{Cov}[X_1, X_2] & \cdots & \text{Cov}[X_1, X_n] \\ \text{Cov}[X_2, X_1] & \text{Var}[X_2] & \cdots & \text{Cov}[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}[X_n, X_1] & \text{Cov}[X_n, X_2] & \cdots & \text{Var}[X_n] \end{bmatrix} \\\\ &= \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top] - \boldsymbol{\mu}\boldsymbol{\mu}^\top. \end{align*} \]

Each diagonal entry \(\Sigma_{ii}\) is the variance of \(X_i\), since \[ \text{Cov}[X_i, X_i] = \mathbb{E}[(X_i - \mathbb{E}[X_i])^2] = \sigma_i^2. \] The total variance is defined as the trace of \(\Sigma\): \[ \text{tr}(\Sigma) = \sum_{i=1}^n \text{Var}[X_i]. \]

The covariance matrix has several important structural properties that connect probability theory to linear algebra.

First, \(\Sigma\) is symmetric because covariance itself is symmetric: \[ \Sigma_{ij} = \text{Cov}[X_i, X_j] = \text{Cov}[X_j, X_i] = \Sigma_{ji}. \] Equivalently, writing \(\Sigma = \mathbb{E}[(\boldsymbol{x} - \boldsymbol{\mu})(\boldsymbol{x} - \boldsymbol{\mu})^\top]\), symmetry follows from the general fact that \((AA^\top)^\top = (A^\top)^\top A^\top = AA^\top \).

Since \(\Sigma\) is symmetric, by the spectral theorem, it is orthogonally diagonalizable: \[ \Sigma = P D P^\top \] where \(P\) is an orthogonal matrix whose columns are unit eigenvectors of \(\Sigma\), and \(D\) is a diagonal matrix of the corresponding eigenvalues. In particular, because the trace is invariant under orthogonal similarity, the total variance equals the sum of eigenvalues: \[ \text{tr}(\Sigma) = \sum_{i=1}^n \text{Var}[X_i] = \text{tr}(D) = \sum_{i=1}^n \lambda_i. \] Note that the individual eigenvalues do not correspond to individual variances \(\text{Var}[X_i]\); only the totals agree.

Theorem 1: Positive Semi-Definiteness of the Covariance Matrix

Every covariance matrix \(\Sigma\) is positive semi-definite. That is, for any \(\boldsymbol{v} \in \mathbb{R}^n\), \[ \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} \geq 0. \]

Proof:

For any \(\boldsymbol{v} \in \mathbb{R}^n\), \[ \begin{align*} \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} &= \boldsymbol{v}^\top \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top]\, \boldsymbol{v} \\\\ &= \mathbb{E}[\boldsymbol{v}^\top (\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top \boldsymbol{v}] \\\\ &= \mathbb{E} \left[(\boldsymbol{v}^\top (\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}]))^2\right] \geq 0. \end{align*} \] The last expression is the expectation of a squared real number, which is always non-negative.

As a consequence, the eigenvalues of \(\Sigma\) satisfy \[ \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0. \] This non-negativity of eigenvalues is precisely what makes the spectral decomposition of \(\Sigma\) meaningful for data analysis, as we shall see next in Principal Component Analysis.

SideNote: \(\mathbb{E }[\boldsymbol{x}\boldsymbol{x}^\top]\) is called the autocorrelation matrix denoted as \(R_{\boldsymbol{x}\boldsymbol{x}}\): \[ \begin{align*} R_{\boldsymbol{x}\boldsymbol{x}} &= \mathbb{E }[\\boldsymbol{x}\boldsymbol{x}^\top] \\\\ &= \begin{bmatrix} \mathbb{E }[X_1^2] & \mathbb{E }[X_1 X_2] & \cdots & \mathbb{E }[X_1 X_n] \\ \mathbb{E }[X_2 X_1] & \mathbb{E }[X_2^2] & \cdots & \mathbb{E }[X_2 X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \mathbb{E }[X_n X_1] & \mathbb{E }[X_n X_2] & \cdots & \mathbb{E }[X_n^2]. \end{bmatrix} \\\\ \end{align*} \]

Principal Component Analysis (PCA)

The previous section established that every covariance matrix admits an orthogonal eigendecomposition with non-negative eigenvalues. This spectral structure has a direct and powerful interpretation. The eigenvectors identify the directions along which the data varies the most, and the eigenvalues quantify how much variance lies along each direction. Exploiting this insight leads to Principal Component Analysis (PCA), one of the most widely used techniques in data science for dimensionality reduction.

The orthogonal diagonalization of the covariance matrix \(\Sigma\) is important in statistics and machine learning for analyzing data and reducing dimensionality: \[ \Sigma = P D P^\top \] Here, the column vectors of \(P\) (unit eigenvectors of \(\Sigma \)) are called the principal components(PCs) of the data, and the diagonal entries of \(D\) (eigenvalues of \(\Sigma \)) represent the variances along these principal components. Each eigenvalue indicates "how much" of the total variance of \(\Sigma \) is captured by its corresponding principal component.

The first principal component(PC1) corresponds to the largest eigenvalue and represents the direction in the data space where the distribution varies the most(= the direction of maximizing the variance in data). The second principal component(PC2) captures the next largest variance and is orthogonal to PC1. Similarly, each subsequent PC capturing less variance and maintaining orthogonality to all previous PCs.

For example, suppose you have an observed data matrix \(X \in \mathbb{R}^{m \times 10}\), which contains \(m\) data points, each represented as a vector \(\boldsymbol{x}_i \in \mathbb{R}^{10} \). Computing the sample covariance matrix and diagonalizing it: \[ S = \frac{1}{m-1}(X-\bar{X})^\top (X-\bar{X}) = PDP^\top, \] you examine the eigenvalues to determine how much variance each principal component captures comparing with the total variance of \(S\).

Let's say PC1: 40%, PC2: 30%, and PC3: 25%. Then, you project the data onto the subspace spanned by the first 3 most significant principal components. Even though you discarded the remaining 7 dimensions (noise dimensions), you still retain the most significant patterns(trends) in the data. \[ \boldsymbol{x}_i \in \mathbb{R}^{10} \to \boldsymbol{z}_i \in \mathbb{R}^3 \] The vector \(z_i\) is known as the latent vector.

This dimensionality reduction process is called principal component analysis(PCA). By reducing dimensions, PCA enables efficient analysis of large datasets while retaining their most meaningful structure. PCA is widely used in machine learning, image processing, and exploratory data analysis for dimensionality reduction and noise filtering.

PCA with Singular Value Decomposition(SVD)

In the derivation above, PCA was formulated through the eigendecomposition of the covariance matrix \(S = PDP^\top\). While conceptually clean, this approach requires explicitly forming \(X^\top X\), which can be expensive for high-dimensional data and may suffer from numerical instability. The singular value decomposition (SVD) provides an elegant alternative that avoids these issues while yielding the same principal components.

Given a data matrix in mean-deviation form \(X \in \mathbb{R}^{m \times n}\) where \(m\) is the number of data points and \(n\) is the number of features, the matrix \(X^\top X\) is the covariance matrix.

Note: in PCA, it is common to use \(X^\top X \in \mathbb{R}^{n \times n}\) rather than \(XX^\top \in \mathbb{R}^{m \times m}\) because we are typically more interested in understanding the relationships among the \(n\) "features" than among \(m\) data points.

Using the SVD of \(X\), we compute the covariance matrix \(X^\top X\) : \[ \begin{align*} X^\top X &= (U\Sigma V^\top )^\top (U\Sigma V^\top) \\\\ &= V\Sigma^\top U^\top U\Sigma V^\top \\\\ &= V\Sigma^2V^\top \end{align*} \] This result is equivalent to the eigendecomposition \(X^\top X = PDP^\top\). \(\Sigma^2\) contains the eigenvalues \(\lambda_i\) of \(X^\top X\), with \(\sigma_i ^2 = \lambda_i\). Also, \(V\) (right singular vectors) corresponds to the eigenvectors (principal components) of \(X^\top X\).

Thus, the singular values \(\sigma_i\) represent the standard deviations along the principal components, and the squared singular values \(\sigma_i^2\) represent the variances.

SVD offers several numerical advantages for PCA. For high-dimensional data, computing \(X^\top X\) would be expensive or memory-intensive, but SVD can avoid explicitly forming \(X^\top X\) working directly with \(X\), which allows direct access to the principal components and variances.

                        import numpy as np

                        # Random data matrix ( m data points with n features)
                        def generate_data(m, n):
                            data = np.random.randn(m, n) 

                            # Make some correlations 
                            data[:, 2] = 0.7 * data[:, 0] + 0.5 * data[:, 2]
                            data[:, 3] = 0.2 * data[:, 0] + 0.5 * data[:, 1] 
                            data[:, 4] = -0.3 * data[:, 1] + 0.2 * data[:, 2]
                            data[:, 5] = 0.4 * data[:, 0] + 0.1 * data[:, 1]
                            data[:, 6] = 0.8 * data[:, 3] + -0.3 * data[:, 2]

                            # Mean-deviation form 
                            return (data - np.mean(data, axis=0) )

                        # PCA with covariance matrix (As refference)
                        def pca_via_covariance(data):
                            
                            # "Sample" covariance matrix 
                            # Note: Dividing by (m-1) provides "unbiased" estimate of the population covariance. 
                            cov_matrix = np.dot(data.T, data) / (data.shape[0] - 1)

                            # Eigendecomposition 
                            # np.linalg.eigh() is for symmetric matrix (real, diagonaizable), better than np.linalg.eig() 
                            eigvals, eigvecs = np.linalg.eigh(cov_matrix) 

                            #  Make eigenvalues & eigenvectors in descending order
                            idx = np.argsort(eigvals)[::-1] 
                            eigvals = eigvals[idx]
                            eigvecs = eigvecs[:, idx] 

                            # Each variance vs total variance
                            ratio = eigvals / np.sum(eigvals)

                            return eigvals, eigvecs, ratio

                        # PCA with SVD (we use this function)
                        def pca_with_svd(data):
                            
                            # Singular Value Decomposition (we don't need the matrix U: use "_" )
                            _, S, vt = np.linalg.svd(data, full_matrices=False)
                            
                            # Get eigenvalues via singular values: lambda_i = (S_i)^2  / m - 1
                            eigvals = (S ** 2) / (data.shape[0] - 1)
                            
                            # Each variance vs total variance
                            ratio = eigvals / np.sum(eigvals)
                            
                            return eigvals, vt.T, ratio 

                        # Set 90% threshold for the number of PCs
                        def threshold(var_ratio, t):
                            
                            # Compute cumulative variance
                            cum_variance = np.cumsum(var_ratio)
                            
                            # Find the number of components for 90% variance retention
                            num_pcs = np.argmax(cum_variance >= t) + 1
                            return num_pcs

                        # Dimension of data 
                        m = 1000000 
                        n = 10 

                        # Run PCA
                        eigvals, eigvecs, ratio = pca_with_svd(generate_data(m, n))

                        # Threshold for variance retention
                        t = 0.95
                        num_pcs = threshold(ratio, t)

                        # Print results
                        print("Eigenvalues:")
                        for i, val in enumerate(eigvals):
                            print(f"  Lambda {i + 1}: {val:.6f}")

                        print("\nExplained Variance Ratio (%):")
                        for i, var in enumerate(ratio):
                            print(f"  PC{i + 1}: {var * 100:.2f}%")

                        print(f"\nTo retain {t * 100:.0f}% of variance, use the first {num_pcs} PCs.")

Connections to Machine Learning

The covariance matrix and PCA are foundational across modern machine learning. In kernel PCA, the same eigendecomposition idea is applied in a feature space induced by a kernel, enabling nonlinear dimensionality reduction.

In multivariate Gaussian models, the covariance matrix \(\Sigma\) fully determines the shape and orientation of the distribution, and its inverse (the precision matrix) encodes conditional independence structure. In Gaussian processes, the covariance function generalizes \(\Sigma\) to infinite-dimensional function spaces.

While covariance captures the magnitude and direction of linear association, its values depend on the scales of the variables involved. In the next part, we introduce the correlation coefficient and correlation matrix, which standardize covariance to produce "scale-free" measures of linear relationship.

Loading...

Covariance Matrix

Principal Component Analysis (PCA)

PCA with Singular Value Decomposition(SVD)

Connections to Machine Learning