Correlation

Cross-Covariance

In Part 6, the covariance matrix \[ \Sigma = \text{Cov}[\boldsymbol{x}] \] captured pairwise relationships among the components of a single random vector \(\boldsymbol{x}\). In many applications, however, we work with two distinct sets of variables - for instance, relating a set of input features to a set of output measurements, or comparing gene expression levels across different experimental conditions. The cross-covariance extends the idea of covariance to pairs of variables drawn from different datasets.

Definition: Sample Cross-Covariance Matrix

Given two data matrices \(A \in \mathbb{R}^{m \times n_1}\) and \(B \in \mathbb{R}^{m \times n_2}\) with the same number of observations \(m\), the sample cross-covariance matrix is \[ K_{AB} = \frac{1}{m-1}(A - \bar{A})^\top (B - \bar{B}) \in \mathbb{R}^{n_1 \times n_2}, \] where \(\bar{A}\) and \(\bar{B}\) denote the matrices of column means.

Note that in general \(K_{AB}\) is not square (unless \(n_1 = n_2\)) and not symmetric. In the special case \(A = B\), the cross-covariance reduces to the ordinary (sample) covariance matrix, sometimes called the auto-covariance matrix and denoted \(K_{AA}\), which is exactly the matrix \(S\) we studied in Part 6.

While cross-covariance quantifies how variables from two datasets co-vary, its magnitude depends on the scales of the variables involved. To obtain a scale-free measure, we turn to the correlation coefficient.

Covariance tells us whether two variables tend to increase or decrease together, but its numerical value depends on the units and scales of the variables. For example, \(\text{Cov}[X, Y] = 500\) might indicate a strong relationship when the variables are measured in millimeters, but a weak one when measured in kilometers. To obtain a scale-free measure of linear association, we normalize the covariance by the standard deviations.

Definition: Population Correlation Coefficient

Let \(X\) and \(Y\) be random variables with standard deviations \(\sigma_X = \sqrt{\mathbb{E}[(X - \mu_X)^2]}\) and \(\sigma_Y = \sqrt{\mathbb{E}[(Y - \mu_Y)^2]}\), respectively. Then the population correlation coefficient is \[ \begin{align*} \rho_{X, Y} &= \text{Corr}[X, Y] \\\\ &= \frac{\text{Cov}[X, Y]}{\sigma_X \sigma_Y} \\\\ &= \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}. \end{align*} \]

The correlation coefficient is a dimensionless quantity that measures the strength and direction of the linear relationship between \(X\) and \(Y\).

Definition: Sample Correlation Coefficient

Given observed data \((x_1, y_1), \ldots, (x_n, y_n)\), the sample correlation coefficient is \[ r_{xy} = \frac{1}{(n-1)\,s_x\, s_y}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \] where \(\bar{x}, \bar{y}\) are sample means and \(s_x, s_y\) are the corrected sample standard deviations: \[ s_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}. \] The factor \(n - 1\) (Bessel's correction) ensures that \(s_x^2\) is an unbiased estimator of \(\sigma_X^2\).

A natural question is: what range of values can \(\rho\) take? The following theorem shows that it is always bounded between \(-1\) and \(1\).

Theorem 1: Boundedness of Correlation Coefficient

\[ -1 \leq \rho \leq 1 \] Note:

\(\rho = 1\) indicates a perfect positive linear relationship.
\(\rho = -1\) indicates a perfect negative linear relationship.
\(\rho = 0\) indicates no linear relationship.

Proof:

We use the Cauchy-Schwarz inequality for random variables: \[ \mathbb{E}[XY]^2 \leq \mathbb{E}[X^2] \mathbb{E}[Y^2] \] Note: \(\mathbb{E}[XY]\) is the inner product on the set of random variables \(X\) and \(Y\).

Substitute standardized variables \[ \begin{align*} & \mathbb{E}[\frac{(X - \mathbb{E}[X])}{\sigma_X}\cdot \frac{(Y - \mathbb{E}[Y])}{\sigma_Y}]^2 \leq \mathbb{E}[(\frac{(X - \mathbb{E}[X])}{\sigma_X})^2] \, \mathbb{E}[(\frac{(Y - \mathbb{E}[Y])}{\sigma_Y})^2] \\\\ &\Longrightarrow \rho^2 \leq 1 \cdot 1 \\\\ &\Longrightarrow -1 \leq \rho \leq 1. \end{align*} \]

Having established the correlation coefficient for a pair of random variables, we now extend this idea to an entire random vector. Just as the covariance matrix collects all pairwise covariances, the correlation matrix collects all pairwise correlations. It is a standardized version of the covariance matrix in which every entry lies in \([-1, 1]\). This standardization is particularly useful when features have significantly different scales: for example, in PCA, performing eigendecomposition on the correlation matrix rather than the covariance matrix ensures that no single feature dominates the analysis merely because of its scale.

For a random vector \(x \in \mathbb{R}^n\), the correlation matrix is defined as: \[ \begin{align*} R &= \text{Corr }[x] \\\\ &= \begin{bmatrix} 1 & \frac{\mathbb{E }[(X_1-\mu_1)(X_2-\mu_2)]}{\sigma_1 \sigma_2} & \cdots & \frac{\mathbb{E }[(X_1-\mu_1)(X_n-\mu_n)]}{\sigma_1 \sigma_n} \\ \frac{\mathbb{E }[(X_2-\mu_2)(X_1-\mu_1)]}{\sigma_2 \sigma_1} & 1 & \cdots & \frac{\mathbb{E }[(X_2-\mu_2)(X_n-\mu_n)]}{\sigma_2 \sigma_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\mathbb{E }[(X_n-\mu_n)(X_1-\mu_1)]}{\sigma_n \sigma_1} & \frac{\mathbb{E }[(X_n-\mu_n)(X_2-\mu_2)]}{\sigma_n \sigma_2} & \cdots & 1 \end{bmatrix} \\\\ &= \begin{bmatrix} 1 & \text{Corr }[X_1, X_2] & \cdots & \text{Corr }[X_1, X_n] \\ \text{Corr }[X_2, X_1] & 1 & \cdots & \text{Corr }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Corr }[X_n, X_1] & \text{Corr }[X_n, X_2] & \cdots & 1 \end{bmatrix} \\\\ \end{align*} \] where \(\mu_i = \text{E }[X_i]\) is the mean and \(\sigma_i = \sqrt{\text{Var }(X_i)}\) is the standard deviation of \(X_i\).

The correlation matrix \(R\) can be derived from only the auto-covariance matrix \(K_{xx}\): \[ \begin{align*} K_{xx} &= \mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T] \\\\ &= \begin{bmatrix} \text{Var }[X_1] & \text{Cov }[X_1, X_2] & \cdots & \text{Cov }[X_1, X_n] \\ \text{Cov }[X_2, X_1] & \text{Var }[X_2] & \cdots & \text{Cov }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov }[X_n, X_1] & \text{Cov }[X_n, X_2] & \cdots & \text{Var }[X_n] \end{bmatrix}. \end{align*} \] Here, we define a matrix \(\text{diag }(K_{xx})\), which is a diagonal matrix where each diagonal entry corresponds to the variance of a variable, and its inverse square root is given by: \[ (\text{diag }(K_{xx}))^{-\frac{1}{2}} = \begin{bmatrix} \frac{1}{\sqrt{\text{Var }[X_1]}} & 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{\text{Var }[X_2]}} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{\text{Var }[X_n]}} \end{bmatrix}. \]

To standardize the auto-covariance matrix \(K_{xx}\) into the correlation matrix \(R\), each covariance entry \(\text{Cov }[X_i, X_j]\) is divided by the product of the standard deviations \(\sqrt{\text{Var }[X_i]} \cdot \sqrt{\text{Var }[X_j]}\). This task is achieved by: \[ R = (\text{diag }(K_{xx}))^{-\frac{1}{2}}K_{xx}(\text{diag }(K_{xx}))^{-\frac{1}{2}}. \] This formulation normalizes all rows and columns simultaneously, by dividing each covariance by the corresponding standard deviations, and avoids computing each pairwise correlation explicitly.

Connections to Machine Learning

The correlation matrix plays a key role in feature engineering and model diagnostics. Highly correlated features indicate multicollinearity, which can destabilize linear regression and inflate variance in coefficient estimates. In practice, examining the correlation matrix before training helps identify redundant features, guiding decisions about feature selection or dimensionality reduction via PCA. In Gaussian processes, the kernel function can be viewed as defining a correlation structure over function values.

With covariance and correlation established for random vectors, we are now prepared to study the most important multivariate distribution in statistics and machine learning. In the next part, we introduce the multivariate normal distribution, whose shape is entirely determined by a mean vector and a covariance matrix.

Correlation

Loading...

Cross-Covariance

Correlation

Connections to Machine Learning