Cross-Covariance
Earlier, the
covariance matrix
\[
\Sigma = \operatorname{Cov}[\boldsymbol{x}]
\]
captured pairwise relationships among the components of a single random vector \(\boldsymbol{x}\).
In many applications, however, we work with two distinct sets of variables - for instance, relating a set
of input features to a set of output measurements, or comparing gene expression levels across different
experimental conditions. The cross-covariance extends the idea of covariance to pairs of
variables drawn from different datasets.
As with the covariance matrix, we have both a population and a sample version. The population version is the
natural object when \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are random vectors with known joint distribution;
the sample version is its empirical estimate from observed data.
Definition: Population Cross-Covariance Matrix
Let \(\boldsymbol{X} \in \mathbb{R}^{n_1}\) and \(\boldsymbol{Y} \in \mathbb{R}^{n_2}\) be random vectors on a common
probability space, with all components having finite second moments and means
\(\boldsymbol{\mu}_X = \mathbb{E}[\boldsymbol{X}]\), \(\boldsymbol{\mu}_Y = \mathbb{E}[\boldsymbol{Y}]\).
The population cross-covariance matrix is
\[
\Sigma_{XY} = \operatorname{Cov}[\boldsymbol{X}, \boldsymbol{Y}]
= \mathbb{E}\!\left[(\boldsymbol{X} - \boldsymbol{\mu}_X)(\boldsymbol{Y} - \boldsymbol{\mu}_Y)^\top\right]
\in \mathbb{R}^{n_1 \times n_2}.
\]
Its \((i, j)\) entry is the covariance between the \(i\)-th component of \(\boldsymbol{X}\) and the \(j\)-th component of \(\boldsymbol{Y}\):
\(\bigl(\Sigma_{XY}\bigr)_{ij} = \operatorname{Cov}[X_i, Y_j]\).
Definition: Sample Cross-Covariance Matrix
Given two data matrices \(A \in \mathbb{R}^{m \times n_1}\) and
\(B \in \mathbb{R}^{m \times n_2}\) with the same number of observations \(m\),
the sample cross-covariance matrix is
\[
K_{AB} = \frac{1}{m-1}(A - \bar{A})^\top (B - \bar{B}) \in \mathbb{R}^{n_1 \times n_2},
\]
where the rows of \(A\) are written as column vectors \(\boldsymbol{a}_1, \ldots, \boldsymbol{a}_m \in \mathbb{R}^{n_1}\),
the column-mean vector is \(\bar{\boldsymbol{a}} = \frac{1}{m}\sum_{k=1}^m \boldsymbol{a}_k\), and
\(\bar{A} \in \mathbb{R}^{m \times n_1}\) is the matrix whose every row equals \(\bar{\boldsymbol{a}}^\top\);
\(\boldsymbol{b}_k\), \(\bar{\boldsymbol{b}}\), and \(\bar{B}\) are defined analogously.
The factor \(m-1\) is the Bessel correction, ensuring that \(K_{AB}\) is an unbiased estimator of the population cross-covariance \(\Sigma_{XY}\)
when the rows of \(A\) and \(B\) are i.i.d. samples of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).
Note that in general \(K_{AB}\) is not square (unless \(n_1 = n_2\)) and not symmetric.
In the special case \(A = B\), the cross-covariance reduces to the ordinary
(sample) covariance matrix
of \(\boldsymbol{X}\) — sometimes called the auto-covariance in the cross-covariance framing —
which is exactly the matrix \(S\) introduced earlier.
While cross-covariance quantifies how variables from two datasets co-vary, its magnitude depends on the scales of
the variables involved. To obtain a scale-free measure, we turn to the correlation coefficient.
Correlation
Covariance tells us whether two variables tend to increase or decrease together, but its numerical value
depends on the units and scales of the variables. For example, \(\operatorname{Cov}[X, Y] = 500\) might indicate
a strong relationship when the variables are measured in millimeters, but a weak one when measured in kilometers.
To obtain a scale-free measure of linear association, we normalize the covariance by the standard deviations.
Definition: Population Correlation Coefficient
Let \(X\) and \(Y\) be random variables with finite, non-zero variances and standard deviations
\(\sigma_X = \sqrt{\mathbb{E}[(X - \mu_X)^2]}\) and
\(\sigma_Y = \sqrt{\mathbb{E}[(Y - \mu_Y)^2]}\). Then
the population correlation coefficient is
\[
\begin{align*}
\rho_{X, Y} &= \operatorname{Corr}[X, Y] \\\\
&= \frac{\operatorname{Cov}[X, Y]}{\sigma_X\, \sigma_Y} \\\\
&= \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\, \sigma_Y}.
\end{align*}
\]
The correlation coefficient is a dimensionless quantity that measures the strength
and direction of the linear relationship between \(X\) and \(Y\).
Definition: Sample Correlation Coefficient
Given observed data \((x_1, y_1), \ldots, (x_m, y_m)\), the
sample correlation coefficient is
\[
r_{xy} = \frac{1}{(m-1)\,s_x\, s_y}\sum_{i=1}^m (x_i - \bar{x})(y_i - \bar{y})
\]
where \(\bar{x}, \bar{y}\) are sample means and \(s_x, s_y\) are the
corrected sample standard deviations:
\[
s_x = \sqrt{\frac{1}{m-1}\sum_{i=1}^m (x_i - \bar{x})^2}.
\]
The factor \(m - 1\) (Bessel's correction) ensures that \(s_x^2\) is an
unbiased estimator of \(\sigma_X^2\).
A natural question is: what range of values can \(\rho\) take? The following
theorem shows that it is always bounded between \(-1\) and \(1\).
Theorem: Boundedness of the Correlation Coefficient
For any random variables \(X, Y\) with finite, non-zero variances,
\[
-1 \leq \rho_{X,Y} \leq 1.
\]
The endpoints have specific interpretations:
- \(\rho = 1\) indicates a perfect positive linear relationship.
- \(\rho = -1\) indicates a perfect negative linear relationship.
- \(\rho = 0\) indicates no linear relationship (but possibly nonlinear dependence — the \(X \sim \mathcal{N}(0,1),\, Y = X^2\) counter-example introduced in our covariance discussion shows this can happen).
Proof:
The space of zero-mean random variables with finite second moment forms a (real) inner-product space under the bilinear form
\(\langle U, V \rangle := \mathbb{E}[UV]\), with induced norm \(\|U\| = \sqrt{\mathbb{E}[U^2]}\). (A fully rigorous treatment
of this \(L^2\) inner-product structure on a probability space — including the standard convention of identifying random
variables that agree almost surely, so that \(\|U\| = 0\) implies \(U = 0\) — is developed in our later treatment of
measure-theoretic probability; the classical inner-product axioms (bilinearity, symmetry, and non-negativity
of \(\langle U, U \rangle\)) follow directly from linearity of expectation and the non-negativity of \(\mathbb{E}[U^2]\).)
The
Cauchy-Schwarz inequality
in any inner-product space gives \(|\langle U, V \rangle| \leq \|U\|\,\|V\|\), which translates to
\[
\bigl|\mathbb{E}[UV]\bigr| \;\leq\; \sqrt{\mathbb{E}[U^2]\, \mathbb{E}[V^2]}.
\]
Applying this to the centered variables \(U = X - \mu_X\) and \(V = Y - \mu_Y\) yields
\(\bigl|\operatorname{Cov}[X, Y]\bigr| \leq \sigma_X\, \sigma_Y\), and dividing both sides by \(\sigma_X \sigma_Y > 0\) gives
\[
\bigl|\rho_{X, Y}\bigr|
= \frac{\bigl|\operatorname{Cov}[X, Y]\bigr|}{\sigma_X\, \sigma_Y} \;\leq\; 1,
\]
equivalently \(-1 \leq \rho_{X, Y} \leq 1\). The equality cases \(\rho = \pm 1\) correspond to equality in Cauchy-Schwarz,
which holds if and only if \(U\) and \(V\) are linearly dependent — i.e., \(Y - \mu_Y = c (X - \mu_X)\) almost surely for some
constant \(c\), with the sign of \(c\) matching that of \(\rho\).
What \(\rho\) does not establish: causation.
Even when \(|\rho_{X, Y}|\) is close to \(1\), the correlation coefficient cannot adjudicate among the principal causal scenarios
consistent with the data: \(X\) causes \(Y\), \(Y\) causes \(X\), or both are driven by an unobserved third variable \(Z\) — a confounder.
Without observing additional variables, these scenarios produce the same bivariate joint distribution of \(X\) and \(Y\), and
hence the same value of \(\rho\). Distinguishing them requires either an intervention
(a controlled experiment in which \(X\) is set by the experimenter rather than passively observed) or additional structural assumptions
about the data-generating process — neither of which the correlation coefficient supplies. The slogan "correlation does not imply causation" records
this limitation; the formal framework that resolves it — causal inference — is developed in standalone treatments of the subject and
lies outside the scope of this page.
Having established the correlation coefficient for a pair of random variables, we now extend this idea to an entire
random vector. Just as the covariance matrix collects all pairwise covariances, the correlation matrix
collects all pairwise correlations. It is a standardized version of the covariance matrix in which every entry
lies in \([-1, 1]\). This standardization is particularly useful when features have significantly different scales:
for example, in PCA, performing eigendecomposition on the correlation matrix
rather than the covariance matrix ensures that no single feature dominates the analysis merely because of its scale.
Definition: Correlation Matrix
For a random vector \(\boldsymbol{x} = (X_1, \ldots, X_n)^\top \in \mathbb{R}^n\) with all
\(\sigma_i = \sqrt{\operatorname{Var}[X_i]} > 0\), the correlation matrix
\(R \in \mathbb{R}^{n \times n}\) has entries
\[
R_{ij} = \operatorname{Corr}[X_i, X_j] = \frac{\operatorname{Cov}[X_i, X_j]}{\sigma_i\, \sigma_j}.
\]
Diagonal entries are \(R_{ii} = 1\), and off-diagonal entries lie in \([-1, 1]\) by the
boundedness theorem above.
In matrix form,
\[
R = \begin{bmatrix}
1 & \operatorname{Corr}[X_1, X_2] & \cdots & \operatorname{Corr}[X_1, X_n] \\\\
\operatorname{Corr}[X_2, X_1] & 1 & \cdots & \operatorname{Corr}[X_2, X_n] \\\\
\vdots & \vdots & \ddots & \vdots \\\\
\operatorname{Corr}[X_n, X_1] & \operatorname{Corr}[X_n, X_2] & \cdots & 1
\end{bmatrix}.
\]
The correlation matrix is determined entirely by the
covariance matrix
\(\Sigma\) of \(\boldsymbol{x}\): each correlation entry is just the corresponding covariance entry divided by the product
of the relevant standard deviations. The following formula encodes this normalization compactly as a matrix product.
Theorem: Standardization Formula for the Correlation Matrix
Let \(\Sigma\) be the covariance matrix of a random vector \(\boldsymbol{x} \in \mathbb{R}^n\) with all
\(\sigma_i = \sqrt{\operatorname{Var}[X_i]} > 0\), and let
\[
D = \operatorname{diag}\bigl(\operatorname{Var}[X_1], \ldots, \operatorname{Var}[X_n]\bigr), \qquad
D^{-1/2} = \operatorname{diag}\!\left(\frac{1}{\sigma_1}, \ldots, \frac{1}{\sigma_n}\right).
\]
Then the correlation matrix \(R\) of \(\boldsymbol{x}\) satisfies
\[
R = D^{-1/2}\, \Sigma\, D^{-1/2}.
\]
Proof:
Compute the \((i, j)\) entry of the right-hand side directly. Since \(D^{-1/2}\) is diagonal with entries
\((D^{-1/2})_{kk} = 1/\sigma_k\) and zeros elsewhere, the matrix product collapses to a single non-zero term in
the double sum:
\[
\bigl(D^{-1/2}\, \Sigma\, D^{-1/2}\bigr)_{ij}
= \sum_{k, \ell} (D^{-1/2})_{ik}\, \Sigma_{k\ell}\, (D^{-1/2})_{\ell j}
= \frac{1}{\sigma_i}\, \Sigma_{ij}\, \frac{1}{\sigma_j}
= \frac{\operatorname{Cov}[X_i, X_j]}{\sigma_i\, \sigma_j}
= R_{ij}.
\]
For the diagonal case \(i = j\), this gives \(\operatorname{Var}[X_i] / (\sigma_i \cdot \sigma_i) = 1\), matching the
diagonal of \(R\). Thus \(R\) and \(D^{-1/2} \Sigma D^{-1/2}\) agree entry-wise.
This formulation normalizes all rows and columns simultaneously through a single matrix expression, avoiding the need
to compute each pairwise correlation separately.
Connections to Machine Learning
The correlation matrix plays a key role in feature engineering and model diagnostics. Highly correlated
features indicate multicollinearity, which can destabilize regression models
and inflate variance in coefficient estimates. In practice, examining the correlation matrix
before training helps identify redundant features, guiding decisions about feature selection
or dimensionality reduction via PCA. In
Gaussian processes, the kernel
function can be viewed as defining a correlation structure over function values.
With covariance and correlation established for random vectors, we are now prepared to study the most important
multivariate distribution in statistics and machine learning. We next introduce the
multivariate normal distribution, whose shape is entirely determined
by a mean vector and a covariance matrix.