Correlation

Cross-Covariance Correlation

Cross-Covariance

Earlier, the covariance matrix \[ \Sigma = \operatorname{Cov}[\boldsymbol{x}] \] captured pairwise relationships among the components of a single random vector \(\boldsymbol{x}\). In many applications, however, we work with two distinct sets of variables - for instance, relating a set of input features to a set of output measurements, or comparing gene expression levels across different experimental conditions. The cross-covariance extends the idea of covariance to pairs of variables drawn from different datasets.

As with the covariance matrix, we have both a population and a sample version. The population version is the natural object when \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are random vectors with known joint distribution; the sample version is its empirical estimate from observed data.

Definition: Population Cross-Covariance Matrix

Let \(\boldsymbol{X} \in \mathbb{R}^{n_1}\) and \(\boldsymbol{Y} \in \mathbb{R}^{n_2}\) be random vectors on a common probability space, with all components having finite second moments and means \(\boldsymbol{\mu}_X = \mathbb{E}[\boldsymbol{X}]\), \(\boldsymbol{\mu}_Y = \mathbb{E}[\boldsymbol{Y}]\). The population cross-covariance matrix is \[ \Sigma_{XY} = \operatorname{Cov}[\boldsymbol{X}, \boldsymbol{Y}] = \mathbb{E}\!\left[(\boldsymbol{X} - \boldsymbol{\mu}_X)(\boldsymbol{Y} - \boldsymbol{\mu}_Y)^\top\right] \in \mathbb{R}^{n_1 \times n_2}. \] Its \((i, j)\) entry is the covariance between the \(i\)-th component of \(\boldsymbol{X}\) and the \(j\)-th component of \(\boldsymbol{Y}\): \(\bigl(\Sigma_{XY}\bigr)_{ij} = \operatorname{Cov}[X_i, Y_j]\).

Definition: Sample Cross-Covariance Matrix

Given two data matrices \(A \in \mathbb{R}^{m \times n_1}\) and \(B \in \mathbb{R}^{m \times n_2}\) with the same number of observations \(m\), the sample cross-covariance matrix is \[ K_{AB} = \frac{1}{m-1}(A - \bar{A})^\top (B - \bar{B}) \in \mathbb{R}^{n_1 \times n_2}, \] where the rows of \(A\) are written as column vectors \(\boldsymbol{a}_1, \ldots, \boldsymbol{a}_m \in \mathbb{R}^{n_1}\), the column-mean vector is \(\bar{\boldsymbol{a}} = \frac{1}{m}\sum_{k=1}^m \boldsymbol{a}_k\), and \(\bar{A} \in \mathbb{R}^{m \times n_1}\) is the matrix whose every row equals \(\bar{\boldsymbol{a}}^\top\); \(\boldsymbol{b}_k\), \(\bar{\boldsymbol{b}}\), and \(\bar{B}\) are defined analogously. The factor \(m-1\) is the Bessel correction, ensuring that \(K_{AB}\) is an unbiased estimator of the population cross-covariance \(\Sigma_{XY}\) when the rows of \(A\) and \(B\) are i.i.d. samples of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).

Note that in general \(K_{AB}\) is not square (unless \(n_1 = n_2\)) and not symmetric. In the special case \(A = B\), the cross-covariance reduces to the ordinary (sample) covariance matrix of \(\boldsymbol{X}\) — sometimes called the auto-covariance in the cross-covariance framing — which is exactly the matrix \(S\) introduced earlier.

While cross-covariance quantifies how variables from two datasets co-vary, its magnitude depends on the scales of the variables involved. To obtain a scale-free measure, we turn to the correlation coefficient.

Correlation

Covariance tells us whether two variables tend to increase or decrease together, but its numerical value depends on the units and scales of the variables. For example, \(\operatorname{Cov}[X, Y] = 500\) might indicate a strong relationship when the variables are measured in millimeters, but a weak one when measured in kilometers. To obtain a scale-free measure of linear association, we normalize the covariance by the standard deviations.

Definition: Population Correlation Coefficient

Let \(X\) and \(Y\) be random variables with finite, non-zero variances and standard deviations \(\sigma_X = \sqrt{\mathbb{E}[(X - \mu_X)^2]}\) and \(\sigma_Y = \sqrt{\mathbb{E}[(Y - \mu_Y)^2]}\). Then the population correlation coefficient is \[ \begin{align*} \rho_{X, Y} &= \operatorname{Corr}[X, Y] \\\\ &= \frac{\operatorname{Cov}[X, Y]}{\sigma_X\, \sigma_Y} \\\\ &= \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\, \sigma_Y}. \end{align*} \]

The correlation coefficient is a dimensionless quantity that measures the strength and direction of the linear relationship between \(X\) and \(Y\).

Definition: Sample Correlation Coefficient

Given observed data \((x_1, y_1), \ldots, (x_m, y_m)\), the sample correlation coefficient is \[ r_{xy} = \frac{1}{(m-1)\,s_x\, s_y}\sum_{i=1}^m (x_i - \bar{x})(y_i - \bar{y}) \] where \(\bar{x}, \bar{y}\) are sample means and \(s_x, s_y\) are the corrected sample standard deviations: \[ s_x = \sqrt{\frac{1}{m-1}\sum_{i=1}^m (x_i - \bar{x})^2}. \] The factor \(m - 1\) (Bessel's correction) ensures that \(s_x^2\) is an unbiased estimator of \(\sigma_X^2\).

A natural question is: what range of values can \(\rho\) take? The following theorem shows that it is always bounded between \(-1\) and \(1\).

Theorem: Boundedness of the Correlation Coefficient

For any random variables \(X, Y\) with finite, non-zero variances, \[ -1 \leq \rho_{X,Y} \leq 1. \] The endpoints have specific interpretations:

  1. \(\rho = 1\) indicates a perfect positive linear relationship.
  2. \(\rho = -1\) indicates a perfect negative linear relationship.
  3. \(\rho = 0\) indicates no linear relationship (but possibly nonlinear dependence — the \(X \sim \mathcal{N}(0,1),\, Y = X^2\) counter-example introduced in our covariance discussion shows this can happen).
Proof:

The space of zero-mean random variables with finite second moment forms a (real) inner-product space under the bilinear form \(\langle U, V \rangle := \mathbb{E}[UV]\), with induced norm \(\|U\| = \sqrt{\mathbb{E}[U^2]}\). (A fully rigorous treatment of this \(L^2\) inner-product structure on a probability space — including the standard convention of identifying random variables that agree almost surely, so that \(\|U\| = 0\) implies \(U = 0\) — is developed in our later treatment of measure-theoretic probability; the classical inner-product axioms (bilinearity, symmetry, and non-negativity of \(\langle U, U \rangle\)) follow directly from linearity of expectation and the non-negativity of \(\mathbb{E}[U^2]\).) The Cauchy-Schwarz inequality in any inner-product space gives \(|\langle U, V \rangle| \leq \|U\|\,\|V\|\), which translates to \[ \bigl|\mathbb{E}[UV]\bigr| \;\leq\; \sqrt{\mathbb{E}[U^2]\, \mathbb{E}[V^2]}. \] Applying this to the centered variables \(U = X - \mu_X\) and \(V = Y - \mu_Y\) yields \(\bigl|\operatorname{Cov}[X, Y]\bigr| \leq \sigma_X\, \sigma_Y\), and dividing both sides by \(\sigma_X \sigma_Y > 0\) gives \[ \bigl|\rho_{X, Y}\bigr| = \frac{\bigl|\operatorname{Cov}[X, Y]\bigr|}{\sigma_X\, \sigma_Y} \;\leq\; 1, \] equivalently \(-1 \leq \rho_{X, Y} \leq 1\). The equality cases \(\rho = \pm 1\) correspond to equality in Cauchy-Schwarz, which holds if and only if \(U\) and \(V\) are linearly dependent — i.e., \(Y - \mu_Y = c (X - \mu_X)\) almost surely for some constant \(c\), with the sign of \(c\) matching that of \(\rho\).

What \(\rho\) does not establish: causation. Even when \(|\rho_{X, Y}|\) is close to \(1\), the correlation coefficient cannot adjudicate among the principal causal scenarios consistent with the data: \(X\) causes \(Y\), \(Y\) causes \(X\), or both are driven by an unobserved third variable \(Z\) — a confounder. Without observing additional variables, these scenarios produce the same bivariate joint distribution of \(X\) and \(Y\), and hence the same value of \(\rho\). Distinguishing them requires either an intervention (a controlled experiment in which \(X\) is set by the experimenter rather than passively observed) or additional structural assumptions about the data-generating process — neither of which the correlation coefficient supplies. The slogan "correlation does not imply causation" records this limitation; the formal framework that resolves it — causal inference — is developed in standalone treatments of the subject and lies outside the scope of this page.

Having established the correlation coefficient for a pair of random variables, we now extend this idea to an entire random vector. Just as the covariance matrix collects all pairwise covariances, the correlation matrix collects all pairwise correlations. It is a standardized version of the covariance matrix in which every entry lies in \([-1, 1]\). This standardization is particularly useful when features have significantly different scales: for example, in PCA, performing eigendecomposition on the correlation matrix rather than the covariance matrix ensures that no single feature dominates the analysis merely because of its scale.

Definition: Correlation Matrix

For a random vector \(\boldsymbol{x} = (X_1, \ldots, X_n)^\top \in \mathbb{R}^n\) with all \(\sigma_i = \sqrt{\operatorname{Var}[X_i]} > 0\), the correlation matrix \(R \in \mathbb{R}^{n \times n}\) has entries \[ R_{ij} = \operatorname{Corr}[X_i, X_j] = \frac{\operatorname{Cov}[X_i, X_j]}{\sigma_i\, \sigma_j}. \] Diagonal entries are \(R_{ii} = 1\), and off-diagonal entries lie in \([-1, 1]\) by the boundedness theorem above. In matrix form, \[ R = \begin{bmatrix} 1 & \operatorname{Corr}[X_1, X_2] & \cdots & \operatorname{Corr}[X_1, X_n] \\\\ \operatorname{Corr}[X_2, X_1] & 1 & \cdots & \operatorname{Corr}[X_2, X_n] \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \operatorname{Corr}[X_n, X_1] & \operatorname{Corr}[X_n, X_2] & \cdots & 1 \end{bmatrix}. \]

The correlation matrix is determined entirely by the covariance matrix \(\Sigma\) of \(\boldsymbol{x}\): each correlation entry is just the corresponding covariance entry divided by the product of the relevant standard deviations. The following formula encodes this normalization compactly as a matrix product.

Theorem: Standardization Formula for the Correlation Matrix

Let \(\Sigma\) be the covariance matrix of a random vector \(\boldsymbol{x} \in \mathbb{R}^n\) with all \(\sigma_i = \sqrt{\operatorname{Var}[X_i]} > 0\), and let \[ D = \operatorname{diag}\bigl(\operatorname{Var}[X_1], \ldots, \operatorname{Var}[X_n]\bigr), \qquad D^{-1/2} = \operatorname{diag}\!\left(\frac{1}{\sigma_1}, \ldots, \frac{1}{\sigma_n}\right). \] Then the correlation matrix \(R\) of \(\boldsymbol{x}\) satisfies \[ R = D^{-1/2}\, \Sigma\, D^{-1/2}. \]

Proof:

Compute the \((i, j)\) entry of the right-hand side directly. Since \(D^{-1/2}\) is diagonal with entries \((D^{-1/2})_{kk} = 1/\sigma_k\) and zeros elsewhere, the matrix product collapses to a single non-zero term in the double sum: \[ \bigl(D^{-1/2}\, \Sigma\, D^{-1/2}\bigr)_{ij} = \sum_{k, \ell} (D^{-1/2})_{ik}\, \Sigma_{k\ell}\, (D^{-1/2})_{\ell j} = \frac{1}{\sigma_i}\, \Sigma_{ij}\, \frac{1}{\sigma_j} = \frac{\operatorname{Cov}[X_i, X_j]}{\sigma_i\, \sigma_j} = R_{ij}. \] For the diagonal case \(i = j\), this gives \(\operatorname{Var}[X_i] / (\sigma_i \cdot \sigma_i) = 1\), matching the diagonal of \(R\). Thus \(R\) and \(D^{-1/2} \Sigma D^{-1/2}\) agree entry-wise.

This formulation normalizes all rows and columns simultaneously through a single matrix expression, avoiding the need to compute each pairwise correlation separately.

Connections to Machine Learning

The correlation matrix plays a key role in feature engineering and model diagnostics. Highly correlated features indicate multicollinearity, which can destabilize regression models and inflate variance in coefficient estimates. In practice, examining the correlation matrix before training helps identify redundant features, guiding decisions about feature selection or dimensionality reduction via PCA. In Gaussian processes, the kernel function can be viewed as defining a correlation structure over function values.

With covariance and correlation established for random vectors, we are now prepared to study the most important multivariate distribution in statistics and machine learning. We next introduce the multivariate normal distribution, whose shape is entirely determined by a mean vector and a covariance matrix.