Cross-Covariance
In Part 6, the covariance matrix
\[
\Sigma = \text{Cov}[\boldsymbol{x}]
\]
captured pairwise relationships among the components of a single random vector \(\boldsymbol{x}\).
In many applications, however, we work with two distinct sets of variables - for instance, relating a set
of input features to a set of output measurements, or comparing gene expression levels across different
experimental conditions. The cross-covariance extends the idea of covariance to pairs of
variables drawn from different datasets.
Definition: Sample Cross-Covariance Matrix
Given two data matrices \(A \in \mathbb{R}^{m \times n_1}\) and
\(B \in \mathbb{R}^{m \times n_2}\) with the same number of observations \(m\),
the sample cross-covariance matrix is
\[
K_{AB} = \frac{1}{m-1}(A - \bar{A})^\top (B - \bar{B}) \in \mathbb{R}^{n_1 \times n_2},
\]
where \(\bar{A}\) and \(\bar{B}\) denote the matrices of column means.
Note that in general \(K_{AB}\) is not square (unless \(n_1 = n_2\)) and not symmetric.
In the special case \(A = B\), the cross-covariance reduces to the ordinary (sample) covariance matrix,
sometimes called the auto-covariance matrix and denoted \(K_{AA}\), which is exactly
the matrix \(S\) we studied in Part 6.
While cross-covariance quantifies how variables from two datasets co-vary, its magnitude depends on the scales of
the variables involved. To obtain a scale-free measure, we turn to the correlation coefficient.
Correlation
Covariance tells us whether two variables tend to increase or decrease together, but its numerical value
depends on the units and scales of the variables. For example, \(\text{Cov}[X, Y] = 500\) might indicate a
strong relationship when the variables are measured in millimeters, but a weak one when measured in kilometers.
To obtain a scale-free measure of linear association, we normalize the covariance
by the standard deviations.
Definition: Population Correlation Coefficient
Let \(X\) and \(Y\) be random variables with standard deviations
\(\sigma_X = \sqrt{\mathbb{E}[(X - \mu_X)^2]}\) and
\(\sigma_Y = \sqrt{\mathbb{E}[(Y - \mu_Y)^2]}\), respectively. Then
the population correlation coefficient is
\[
\begin{align*}
\rho_{X, Y} &= \text{Corr}[X, Y] \\\\
&= \frac{\text{Cov}[X, Y]}{\sigma_X \sigma_Y} \\\\
&= \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}.
\end{align*}
\]
The correlation coefficient is a dimensionless quantity that measures the strength
and direction of the linear relationship between \(X\) and \(Y\).
Definition: Sample Correlation Coefficient
Given observed data \((x_1, y_1), \ldots, (x_n, y_n)\), the
sample correlation coefficient is
\[
r_{xy} = \frac{1}{(n-1)\,s_x\, s_y}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})
\]
where \(\bar{x}, \bar{y}\) are sample means and \(s_x, s_y\) are the
corrected sample standard deviations:
\[
s_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}.
\]
The factor \(n - 1\) (Bessel's correction) ensures that \(s_x^2\) is an
unbiased estimator of \(\sigma_X^2\).
A natural question is: what range of values can \(\rho\) take? The following
theorem shows that it is always bounded between \(-1\) and \(1\).
Theorem 1: Boundedness of Correlation Coefficient
\[
-1 \leq \rho \leq 1
\]
Note:
- \(\rho = 1\) indicates a perfect positive linear relationship.
- \(\rho = -1\) indicates a perfect negative linear relationship.
- \(\rho = 0\) indicates no linear relationship.
Proof:
We use the Cauchy-Schwarz inequality for random variables:
\[
\mathbb{E}[XY]^2 \leq \mathbb{E}[X^2] \mathbb{E}[Y^2]
\]
Note: \(\mathbb{E}[XY]\) is the inner product on the set of random variables \(X\) and \(Y\).
Substitute standardized variables
\[
\begin{align*}
& \mathbb{E}[\frac{(X - \mathbb{E}[X])}{\sigma_X}\cdot \frac{(Y - \mathbb{E}[Y])}{\sigma_Y}]^2
\leq \mathbb{E}[(\frac{(X - \mathbb{E}[X])}{\sigma_X})^2] \, \mathbb{E}[(\frac{(Y - \mathbb{E}[Y])}{\sigma_Y})^2] \\\\
&\Longrightarrow \rho^2 \leq 1 \cdot 1 \\\\
&\Longrightarrow -1 \leq \rho \leq 1.
\end{align*}
\]
Having established the correlation coefficient for a pair of random variables, we now extend this idea to an entire
random vector. Just as the covariance matrix collects all pairwise covariances, the correlation matrix
collects all pairwise correlations. It is a standardized version of the covariance matrix in which every entry
lies in \([-1, 1]\). This standardization is particularly useful when features have significantly different scales:
for example, in PCA, performing eigendecomposition on the correlation matrix
rather than the covariance matrix ensures that no single feature dominates the analysis merely because of its scale.
For a random vector \(x \in \mathbb{R}^n\), the correlation matrix is defined as:
\[
\begin{align*}
R &= \text{Corr }[x] \\\\
&= \begin{bmatrix}
1 & \frac{\mathbb{E }[(X_1-\mu_1)(X_2-\mu_2)]}{\sigma_1 \sigma_2} & \cdots & \frac{\mathbb{E }[(X_1-\mu_1)(X_n-\mu_n)]}{\sigma_1 \sigma_n} \\
\frac{\mathbb{E }[(X_2-\mu_2)(X_1-\mu_1)]}{\sigma_2 \sigma_1} & 1 & \cdots & \frac{\mathbb{E }[(X_2-\mu_2)(X_n-\mu_n)]}{\sigma_2 \sigma_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\mathbb{E }[(X_n-\mu_n)(X_1-\mu_1)]}{\sigma_n \sigma_1} & \frac{\mathbb{E }[(X_n-\mu_n)(X_2-\mu_2)]}{\sigma_n \sigma_2} & \cdots & 1
\end{bmatrix} \\\\
&= \begin{bmatrix}
1 & \text{Corr }[X_1, X_2] & \cdots & \text{Corr }[X_1, X_n] \\
\text{Corr }[X_2, X_1] & 1 & \cdots & \text{Corr }[X_2, X_n] \\
\vdots & \vdots & \ddots & \vdots \\
\text{Corr }[X_n, X_1] & \text{Corr }[X_n, X_2] & \cdots & 1
\end{bmatrix} \\\\
\end{align*}
\]
where \(\mu_i = \text{E }[X_i]\) is the mean and \(\sigma_i = \sqrt{\text{Var }(X_i)}\) is the
standard deviation of \(X_i\).
The correlation matrix \(R\) can be derived from only the auto-covariance matrix \(K_{xx}\):
\[
\begin{align*}
K_{xx} &= \mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T] \\\\
&= \begin{bmatrix}
\text{Var }[X_1] & \text{Cov }[X_1, X_2] & \cdots & \text{Cov }[X_1, X_n] \\
\text{Cov }[X_2, X_1] & \text{Var }[X_2] & \cdots & \text{Cov }[X_2, X_n] \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov }[X_n, X_1] & \text{Cov }[X_n, X_2] & \cdots & \text{Var }[X_n]
\end{bmatrix}.
\end{align*}
\]
Here, we define a matrix \(\text{diag }(K_{xx})\), which is a diagonal matrix where each diagonal entry
corresponds to the variance of a variable, and its inverse square root is given by:
\[
(\text{diag }(K_{xx}))^{-\frac{1}{2}} = \begin{bmatrix}
\frac{1}{\sqrt{\text{Var }[X_1]}} & 0 & \cdots & 0 \\
0 & \frac{1}{\sqrt{\text{Var }[X_2]}} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \frac{1}{\sqrt{\text{Var }[X_n]}}
\end{bmatrix}.
\]
To standardize the auto-covariance matrix \(K_{xx}\) into the correlation matrix \(R\), each covariance entry \(\text{Cov }[X_i, X_j]\)
is divided by the product of the standard deviations \(\sqrt{\text{Var }[X_i]} \cdot \sqrt{\text{Var }[X_j]}\).
This task is achieved by:
\[
R = (\text{diag }(K_{xx}))^{-\frac{1}{2}}K_{xx}(\text{diag }(K_{xx}))^{-\frac{1}{2}}.
\]
This formulation normalizes all rows and columns simultaneously, by dividing each covariance by the corresponding
standard deviations, and avoids computing each pairwise correlation explicitly.
Connections to Machine Learning
The correlation matrix plays a key role in feature engineering and model diagnostics. Highly correlated
features indicate multicollinearity, which can destabilize
linear regression
and inflate variance in coefficient estimates. In practice, examining the correlation matrix
before training helps identify redundant features, guiding decisions about feature selection
or dimensionality reduction via PCA. In
Gaussian processes, the kernel
function can be viewed as defining a correlation structure over function values.
With covariance and correlation established for random vectors, we are now prepared to study the most important
multivariate distribution in statistics and machine learning. In the next part,
we introduce the multivariate normal distribution, whose shape is entirely determined
by a mean vector and a covariance matrix.