Covariance Matrix
Earlier, we introduced the variance as a measure of how a single random variable deviates from its mean. However, real-world phenomena rarely involve just one variable. In machine learning, a data point is typically a vector of features, and understanding how these features relate to one another is essential for tasks ranging from dimensionality reduction to generative modeling. This motivates the transition from univariate to multivariate statistics, which begins with a systematic way to measure how pairs of random variables vary together.
The covariance between two random variables \(X\) and \(Y\) is defined as \[ \operatorname{Cov}[X, Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]. \]
Here, \((X - \mathbb{E}[X])\) and \((Y - \mathbb{E}[Y])\) are mean deviations, representing how far \(X\) and \(Y\) deviate from their respective means. The covariance measures how these deviations vary together (joint variation).
To work with two random variables together, we need a way to describe their joint behavior. The individual distributions of \(X\) and \(Y\) — each described by its own pmf or pdf — are not enough: two random variables may have identical individual distributions while being either independent or perfectly correlated. The required object is the joint distribution, which packages how \(X\) and \(Y\) take values together.
Let \(X\) and \(Y\) be discrete random variables on a common probability space, taking values in countable sets \(\mathcal{X}, \mathcal{Y} \subseteq \mathbb{R}\). The joint probability mass function is \[ p_{X,Y}(x, y) = P(X = x,\ Y = y), \qquad (x, y) \in \mathcal{X} \times \mathcal{Y}, \] with \(p_{X,Y}(x,y) \geq 0\) and \(\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y) = 1\). The marginal pmfs of \(X\) and \(Y\) are recovered by summing out the other variable: \[ p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y), \qquad p_Y(y) = \sum_{x \in \mathcal{X}} p_{X,Y}(x, y). \]
Random variables \(X\) and \(Y\) on a common probability space are said to be jointly continuous if there exists a non-negative function \(f_{X,Y} : \mathbb{R}^2 \to [0, \infty)\), called the joint probability density function, such that for every reasonable region \(B \subseteq \mathbb{R}^2\), \[ P\big((X, Y) \in B\big) = \iint_B f_{X,Y}(x, y)\, dx\, dy, \] with \(\iint_{\mathbb{R}^2} f_{X,Y}(x, y)\, dx\, dy = 1\). The marginal densities are recovered by integrating out the other variable: \[ f_X(x) = \int_{\mathbb{R}} f_{X,Y}(x, y)\, dy, \qquad f_Y(y) = \int_{\mathbb{R}} f_{X,Y}(x, y)\, dx. \] The precise meaning of "reasonable region" — namely, Borel subsets of \(\mathbb{R}^2\) — and the rigorous justification of these identities require the language of product measures, developed in our later treatment of measure-theoretic probability.
The expectation of a function \(g(X, Y)\) of two jointly distributed random variables is, accordingly, \[ \mathbb{E}[g(X, Y)] = \sum_{x, y} g(x, y)\, p_{X,Y}(x, y) \quad \text{(discrete)}, \qquad \mathbb{E}[g(X, Y)] = \iint_{\mathbb{R}^2} g(x, y)\, f_{X,Y}(x, y)\, dx\, dy \quad \text{(continuous)}, \] provided the sum or integral converges absolutely.
Two random variables \(X\) and \(Y\) on a common probability space are independent if their joint distribution factors into the product of the marginal distributions:
- If \(X, Y\) are jointly discrete: \(\;p_{X,Y}(x, y) = p_X(x)\, p_Y(y)\) for all \((x, y)\).
- If \(X, Y\) are jointly continuous: \(\;f_{X,Y}(x, y) = f_X(x)\, f_Y(y)\) for all \((x, y)\).
This is the random-variable analog of independence of events: if \(X\) and \(Y\) take values in \(\{0, 1\}\) recording whether two events \(A\) and \(B\) occur, the two definitions coincide with independence of events from the foundations chapter. A fully rigorous formulation in terms of \(\sigma\)-algebras is developed in our later treatment of measure-theoretic probability.
Before deriving the standard computational identity for covariance, we record the bivariate extension of linearity of expectation, which goes beyond the scalar version proved earlier (a single random variable scaled and shifted by constants).
Let \(X\) and \(Y\) be random variables defined on a common probability space, and let \(a, b \in \mathbb{R}\) be constants. Then \[ \mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y], \] provided each individual expectation exists. The result does not require \(X\) and \(Y\) to be independent.
For jointly continuous \((X, Y)\) with joint density \(f_{X,Y}(x, y)\), expanding the definition of expectation gives \[ \mathbb{E}[aX + bY] = \iint_{\mathbb{R}^2} (a x + b y)\, f_{X,Y}(x, y)\, dx\, dy. \] Two operations are then performed. First, by linearity of the integral, the integrand splits into two pieces: \[ \iint a x\, f_{X,Y}(x, y)\, dx\, dy + \iint b y\, f_{X,Y}(x, y)\, dx\, dy. \] Second, we write each double integral as an iterated integral with the appropriate inner variable, so that the inner integral of \(f_{X,Y}\) recovers a marginal density. For the first piece, integrating out \(y\) inside gives \(f_X(x) = \int f_{X,Y}(x, y)\, dy\); for the second, integrating out \(x\) inside gives \(f_Y(y) = \int f_{X,Y}(x, y)\, dx\). Thus \[ \mathbb{E}[aX + bY] = a \int x\, f_X(x)\, dx + b \int y\, f_Y(y)\, dy = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]. \] The discrete case is analogous, replacing integrals by sums and densities by joint and marginal mass functions. Independence is not invoked — only the existence of the joint distribution and the ability to recover marginals from it. The substantive point is that the double integral may be written as an iterated integral in either order: this is Fubini's theorem, valid here under the existence-of-expectation hypothesis (which guarantees absolute integrability). A fully rigorous treatment of Fubini's theorem requires the machinery of product measures, which we develop in our later treatment of measure-theoretic probability.
With bivariate linearity in hand, we obtain a computational identity for covariance that is often more convenient than the definition.
For any random variables \(X\) and \(Y\) with finite second moments, \[ \operatorname{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]. \]
Let \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\) (constants). Expanding the product inside the definition, \[ (X - \mu_X)(Y - \mu_Y) = XY - \mu_X Y - X \mu_Y + \mu_X \mu_Y. \] Apply linearity of expectation term-by-term — bivariate linearity (extended to four terms by repeated application) splits the sum, and the scalar identity \(\mathbb{E}[cZ] = c\,\mathbb{E}[Z]\) handles the constant factors \(\mu_X\), \(\mu_Y\): \[ \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mu_X \mathbb{E}[Y] - \mathbb{E}[X] \mu_Y + \mu_X \mu_Y. \] Substituting back \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\) into the last three terms gives \[ - \mathbb{E}[X]\,\mathbb{E}[Y] - \mathbb{E}[X]\,\mathbb{E}[Y] + \mathbb{E}[X]\,\mathbb{E}[Y] = -\mathbb{E}[X]\,\mathbb{E}[Y], \] which yields the claimed identity \(\operatorname{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]\).
Independence of random variables has a powerful consequence for expectations of products: it lets the expectation factor across the two variables.
Let \(X\) and \(Y\) be independent random variables on a common probability space, with finite expectations. Then \(XY\) has finite expectation, and \[ \mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]. \]
For jointly continuous \((X, Y)\), independence means the joint density factors as \(f_{X,Y}(x, y) = f_X(x)\, f_Y(y)\). Substituting into the expectation formula, \[ \mathbb{E}[XY] = \iint_{\mathbb{R}^2} x y\, f_{X,Y}(x, y)\, dx\, dy = \iint_{\mathbb{R}^2} \big(x f_X(x)\big)\big(y f_Y(y)\big)\, dx\, dy. \] Writing the double integral as an iterated integral (Fubini's theorem under the absolute-integrability hypothesis), the integrand factors and the two integrals separate: \[ \mathbb{E}[XY] = \left( \int_{\mathbb{R}} x\, f_X(x)\, dx \right) \left( \int_{\mathbb{R}} y\, f_Y(y)\, dy \right) = \mathbb{E}[X]\, \mathbb{E}[Y]. \] For jointly discrete \((X, Y)\), the same argument applies with sums in place of integrals: \(p_{X,Y}(x, y) = p_X(x)\,p_Y(y)\) gives \(\mathbb{E}[XY] = \sum_{x, y} xy\, p_X(x)\,p_Y(y) = \big(\sum_x x\,p_X(x)\big)\big(\sum_y y\,p_Y(y)\big) = \mathbb{E}[X]\,\mathbb{E}[Y]\). The general case (without joint density or pmf) is treated in measure-theoretic probability via product measures.
If \(X\) and \(Y\) are independent random variables with finite second moments, then \(\operatorname{Cov}[X, Y] = 0\).
Combine the Computational Identity with the Product Expectation Theorem: \(\operatorname{Cov}[X, Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y] = \mathbb{E}[X]\,\mathbb{E}[Y] - \mathbb{E}[X]\,\mathbb{E}[Y] = 0\).
The converse is not true in general: zero covariance indicates only the absence of linear dependence, not full independence. A standard counter-example is \(X \sim \mathcal{N}(0, 1)\) and \(Y = X^2\). The standard normal density \(\varphi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\) is an even function (\(\varphi(-x) = \varphi(x)\)), and \(x^3\) is an odd function, so \(\mathbb{E}[X^3] = \int_{\mathbb{R}} x^3 \varphi(x)\, dx = 0\) by odd-times-even symmetry on \(\mathbb{R}\); together with \(\mathbb{E}[X] = 0\), this gives \(\operatorname{Cov}[X, Y] = \mathbb{E}[X^3] - \mathbb{E}[X]\,\mathbb{E}[X^2] = 0\). Yet \(Y\) is fully determined by \(X\), so \(X\) and \(Y\) are far from independent. A positive (resp. negative) covariance indicates a positive (resp. negative) linear association between \(X\) and \(Y\); the absence of such linear association is what zero covariance certifies.
In practice, we extend this idea to \(n\) random variables. Consider a vector \(\boldsymbol{x} \in \mathbb{R}^n\) whose entries represent random variables \(X_1, \ldots, X_n\).
The population covariance matrix of a random vector \(\boldsymbol{x} = (X_1, \ldots, X_n)^\top\) with mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{x}]\) is \[ \begin{align*} \Sigma &= \operatorname{Cov}[\boldsymbol{x}] \\\\ &= \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top] \\\\ &= \begin{bmatrix} \operatorname{Var}[X_1] & \operatorname{Cov}[X_1, X_2] & \cdots & \operatorname{Cov}[X_1, X_n] \\\\ \operatorname{Cov}[X_2, X_1] & \operatorname{Var}[X_2] & \cdots & \operatorname{Cov}[X_2, X_n] \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \operatorname{Cov}[X_n, X_1] & \operatorname{Cov}[X_n, X_2] & \cdots & \operatorname{Var}[X_n] \end{bmatrix} \\\\ &= \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top] - \boldsymbol{\mu}\boldsymbol{\mu}^\top. \end{align*} \]
Each diagonal entry \(\Sigma_{ii}\) is the variance of \(X_i\), since \[ \operatorname{Cov}[X_i, X_i] = \mathbb{E}[(X_i - \mathbb{E}[X_i])^2] = \sigma_i^2. \] The total variance is defined as the trace of \(\Sigma\): \[ \operatorname{tr}(\Sigma) = \sum_{i=1}^n \operatorname{Var}[X_i]. \]
The covariance matrix has several important structural properties that connect probability theory to linear algebra.
First, \(\Sigma\) is symmetric because covariance itself is symmetric: \[ \Sigma_{ij} = \operatorname{Cov}[X_i, X_j] = \operatorname{Cov}[X_j, X_i] = \Sigma_{ji}. \] Equivalently at the matrix level: for every realization of the random vector, the outer product \((\boldsymbol{x} - \boldsymbol{\mu})(\boldsymbol{x} - \boldsymbol{\mu})^\top\) is a symmetric matrix, and the expectation of a symmetric-matrix-valued random variable is again symmetric (since \(\mathbb{E}\) is taken entrywise and commutes with transposition).
Since \(\Sigma\) is symmetric, by the spectral theorem, it is orthogonally diagonalizable: \[ \Sigma = P D P^\top \] where \(P\) is an orthogonal matrix whose columns are unit eigenvectors of \(\Sigma\), and \(D\) is a diagonal matrix of the corresponding eigenvalues. In particular, because the trace is invariant under orthogonal similarity, the total variance equals the sum of eigenvalues: \[ \operatorname{tr}(\Sigma) = \sum_{i=1}^n \operatorname{Var}[X_i] = \operatorname{tr}(D) = \sum_{i=1}^n \lambda_i. \] Note that the individual eigenvalues do not correspond to individual variances \(\operatorname{Var}[X_i]\); only the totals agree.
Every covariance matrix \(\Sigma\) is positive semi-definite. That is, for any \(\boldsymbol{v} \in \mathbb{R}^n\), \[ \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} \geq 0. \]
For any \(\boldsymbol{v} \in \mathbb{R}^n\), \[ \begin{align*} \boldsymbol{v}^\top \Sigma\, \boldsymbol{v} &= \boldsymbol{v}^\top \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top]\, \boldsymbol{v} \\\\ &= \mathbb{E}[\boldsymbol{v}^\top (\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top \boldsymbol{v}] \\\\ &= \mathbb{E} \left[(\boldsymbol{v}^\top (\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}]))^2\right] \geq 0. \end{align*} \] The last expression is the expectation of a squared real number, which is always non-negative.
As a consequence, the eigenvalues of \(\Sigma\) satisfy \[ \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0. \] This non-negativity of eigenvalues is precisely what makes the spectral decomposition of \(\Sigma\) meaningful for data analysis, as we shall see next in Principal Component Analysis.
SideNote: \(\mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top]\) is called the autocorrelation matrix denoted as \(R_{\boldsymbol{x}\boldsymbol{x}}\): \[ \begin{align*} R_{\boldsymbol{x}\boldsymbol{x}} &= \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top] \\\\ &= \begin{bmatrix} \mathbb{E}[X_1^2] & \mathbb{E}[X_1 X_2] & \cdots & \mathbb{E}[X_1 X_n] \\\\ \mathbb{E}[X_2 X_1] & \mathbb{E}[X_2^2] & \cdots & \mathbb{E}[X_2 X_n] \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \mathbb{E}[X_n X_1] & \mathbb{E}[X_n X_2] & \cdots & \mathbb{E}[X_n^2]. \end{bmatrix} \end{align*} \]