Multivariate Distributions

Multivariate Normal Distribution Cholesky Decomposition Dirichlet Distribution Wishart Distribution

Multivariate Normal Distribution

In the previous two parts, we developed the covariance matrix and the correlation coefficient as tools for quantifying linear relationships among random variables. We also saw in Part 4 that the univariate Gaussian \(\mathcal{N}(\mu, \sigma^2)\) is characterized by its mean and variance. The multivariate normal distribution (MVN) is the natural extension of the Gaussian to random vectors, and it is arguably the most important joint distribution in all of statistics and machine learning. Its shape is entirely determined by a mean vector and a covariance matrix, making the concepts from Parts 6 and 7 indispensable for understanding its geometry.

Definition: Multivariate Normal Distribution

An \(n\)-dimensional random vector \(\boldsymbol{x} \in \mathbb{R}^n\) follows a multivariate normal distribution, written \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\), if its p.d.f. is \[ f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp\!\left[ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \right], \tag{1} \] where \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{x}] \in \mathbb{R}^n\) is the mean vector and \(\Sigma = \text{Cov}[\boldsymbol{x}] \in \mathbb{R}^{n \times n}\) is the (symmetric, positive definite) covariance matrix.

The expression inside the exponential (ignoring the factor of \(-\tfrac{1}{2}\)) is the squared Mahalanobis distance between \(\boldsymbol{x}\) and \(\boldsymbol{\mu}\): \[ d_{\Sigma}(\boldsymbol{x}, \boldsymbol{\mu})^2 = (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}). \] Unlike the Euclidean distance, the Mahalanobis distance accounts for the correlation structure among variables. The contours of constant density are ellipsoids defined by \[ (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) = c, \] whose axes are aligned with the eigenvectors of \(\Sigma\) and whose lengths are proportional to the square roots of the corresponding eigenvalues.

To build concrete intuition, let us work out the special case \(n = 2\) in full detail. When \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is called the bivariate normal distribution. In this case, \[ \begin{align*} \Sigma &= \begin{bmatrix} \text{Var } (X_1) & \text{Cov }[X_1, X_2] \\ \text{Cov }[X_2, X_1] & \text{Var } (X_2) \end{bmatrix} \\\\ &= \begin{bmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{bmatrix} \\\\ \end{align*} \] where \(\rho\) is the correlation coefficient defined by \[ \text{Corr }[X_1, X_2] = \frac{\text{Cov }[X_1, X_2]}{\sqrt{\text{Var }(X_1)\text{Var }(X_2)}}. \]

Then \[ \begin{align*} \det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\ &= \sigma_1^2 \sigma_2^2 (1 - \rho^2) \end{align*} \] and \[ \begin{align*} \Sigma^{-1} &= \frac{1}{\det (\Sigma )} \begin{bmatrix} \sigma_2^2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma_1^2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \end{align*} \]

Note that in Expression (1), \[ (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) \] is a quadratic form. So, \[ \begin{align*} (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) &= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2}\left[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2 -\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2) +\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \right]. \end{align*} \] Therefore, we obtain the p.d.f for the bivariate normal distribution: \[ f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{(1 - \rho^2)}} \exp\Big\{-\frac{1}{2(1 - \rho^2)} \Big[\Big(\frac{X_1 - \mu_1}{\sigma_1}\Big)^2 -2\rho \Big(\frac{X_1 - \mu_1} {\sigma_1}\Big) \Big(\frac{X_2 - \mu_2} {\sigma_2}\Big) +\Big(\frac{X_2 - \mu_2}{\sigma_2}\Big)^2 \Big] \Big\} \] When \(\rho = -1 \text{ or } 1\), this p.d.f is undefined and \(f\) is said to be degenerate.

Cholesky Decomposition

The previous section defined the MVN in terms of its density function, but in practice we frequently need to sample from \[ \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}). \] For instance, when training variational autoencoders, implementing the reparameterization trick, or running Monte Carlo simulations. Generating independent standard normal samples is straightforward, but we need a way to introduce the correlation structure encoded in \(\boldsymbol{\Sigma}\). The Cholesky decomposition provides a numerically stable factorization that transforms uncorrelated samples into correlated ones.

Cholesky Decomposition:

Let \( \boldsymbol{A} \in \mathbb{R}^{d \times d} \) be a symmetric positive definite matrix. Then there exists a unique lower triangular matrix \( \boldsymbol{L} \) with positive diagonal entries such that \[ \boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top. \] (Equivalently, \( \boldsymbol{A} = \boldsymbol{R}^\top \boldsymbol{R} \) where \( \boldsymbol{R} = \boldsymbol{L}^\top \) is upper triangular.)

Let \( \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \) and apply the Cholesky decomposition to its covariance matrix: \[ \boldsymbol{\Sigma} = \boldsymbol{L}\boldsymbol{L}^\top. \] If \( \boldsymbol{x} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \), which can be obtained by sampling \( d \) independent standard normal variables, then \[ \boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu} \] has the desired covariance: \[ \begin{align*} \text{Cov}[\boldsymbol{y}] &= \boldsymbol{L}\,\text{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\ &= \boldsymbol{L}\boldsymbol{I}\boldsymbol{L}^\top \\ &= \boldsymbol{\Sigma}. \end{align*} \]

Proof:

By linearity of expectation, \[ \mathbb{E}[\boldsymbol{y}] = \mathbb{E}[\boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}] = \boldsymbol{L}\mathbb{E}[\boldsymbol{x}] + \boldsymbol{\mu} = \boldsymbol{\mu}. \] The covariance of a random vector is given by \[ \text{Cov}[\boldsymbol{y}] = \mathbb{E}\!\left[(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}]) (\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])^\top \right]. \] Substituting \( \boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu} \) and using \( \mathbb{E}[\boldsymbol{x}] = \boldsymbol{0} \), \[ \begin{align*} \text{Cov}[\boldsymbol{y}] &= \mathbb{E}\!\left[ (\boldsymbol{L}\boldsymbol{x}) (\boldsymbol{L}\boldsymbol{x})^\top \right] \\\\ &= \boldsymbol{L}\,\mathbb{E}\!\left[\boldsymbol{x}\boldsymbol{x}^\top\right]\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\,\text{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\boldsymbol{I}\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\boldsymbol{L}^\top \\\\ &= \boldsymbol{\Sigma}. \end{align*} \] Therefore, \( \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{L}\boldsymbol{L}^\top) \).

Dirichlet Distribution

The multivariate normal distribution models random vectors taking values in all of \(\mathbb{R}^n\). However, many quantities of interest in machine learning are probability vectors, which entries are non-negative and sum to one. For example, the class probabilities output by a softmax layer, the topic proportions in a document, or the mixing weights in a mixture model all live on a probability simplex. We need a distribution that respects this constraint.

The Dirichlet distribution is a multivariate generalization of beta distribution. It has support over the the \((K - 1)\)-dimensional probability simplex, defined by \[ S_K = \Bigl\{(x_1, x_2, \dots, x_K) \in \mathbb{R}^K: x_k \ge 0,\ \sum_{k=1}^K x_k = 1 \Bigr\}. \] A random vector \(\boldsymbol{x} \in \mathbb{R}^K\) is said to have a Dirichlet distribution with parameters \(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \dots, \alpha_K)\) (with each \(\alpha_k > 0\)) if its probability density function is given by \[ f(x_1, \dots, x_K; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}, \quad (x_1, \dots, x_K) \in S_K, \] or \[ \text{Dir }(\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1} \mathbb{I}(\boldsymbol{x} \in S_k), \] where the multivariate beta function \(B(\boldsymbol{\alpha})\) is defined as \[ B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma\Bigl(\sum_{k=1}^K \alpha_k\Bigr)}. \]

Moments:

Let \(\alpha_0 = \sum_{k=1}^K \alpha_k\). Then for each \(k\),

  • Mean:
    \[ \mathbb{E}[x_k] = \frac{\alpha_k}{\alpha_0}. \]
  • Variance:
    \[ \operatorname{Var}[x_k] = \frac{\alpha_k (\alpha_0 - \alpha_k)}{\alpha_0^2 (\alpha_0+1)}. \]
  • Note: Often we use a symmetric Dirichlet prior of the \(\alpha_k = \frac{\alpha}{K}\). Then \[ \mathbb{E}[x_k] = \frac{1}{K}, \quad \operatorname{Var}[x_k] = \frac{K-1}{K^2 (\alpha +1)}. \] We can see that increasing \(\alpha\) increases the precision(decreases the variance) of the distribution.
  • Covariance:
    For \(i \neq j\), \[ \operatorname{Cov}[x_i, x_j] = \frac{-\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}. \]

The parameters \(\alpha_k\) can be thought of as "pseudocounts" or prior observations of each category. When all \(\alpha_k\) are equal (i.e., \(\boldsymbol{\alpha} = \alpha\, \mathbf{1}\)), the distribution is said to be uniform over the simplex when \(\alpha = 1\) or symmetric if \(\alpha \ne 1\). This symmetry makes the Dirichlet distribution a natural prior in Bayesian models where no category is favored a priori.

The Dirichlet distribution is widely used as a conjugate prior for the parameters of a multinomial distribution in Bayesian statistics, as well as in machine learning models such as latent Dirichlet allocation (LDA) for topic modeling.

Wishart Distribution

The Dirichlet distribution placed a prior on probability vectors. In Bayesian multivariate analysis, we also need priors on covariance matrices themselves. For example, when inferring the parameters of a multivariate normal model. The Wishart distribution fills this role: it generalizes the gamma distribution from positive scalars to positive definite matrices.

Definition: Wishart Distribution

The p.d.f. of Wishart distribution is defined as follows: \[ \operatorname{Wi}(\boldsymbol{\Sigma} | \boldsymbol{S}, \nu) = \frac{1}{Z} |\boldsymbol{\Sigma}| ^{(\nu-D-1)/2} \exp \Bigl( - \frac{1}{2} \text{ tr } (\boldsymbol{S}^{-1} \boldsymbol{\Sigma})\Bigr) \] where \[ Z = 2^{\nu D/2} | \boldsymbol{S} |^{-\nu / 2}\,\Gamma_D (\frac{\nu}{2}). \] Here, \(\Gamma_D(\cdot)\) denotes the multivariate gamma function. The parameters are:

  • \(\nu\): the degrees of freedom (which must satisfy \(\nu \ge D\) for the distribution to be well-defined),
  • \(\boldsymbol{S}\): the scale matrix (a \(D \times D\) positive definite matrix). Note: \(\text{ mean } = \nu \boldsymbol{S}\).

If \(D=1\), ir reduces to the gamma distribution: \[ \operatorname{Wi}(\lambda, s^{-1}, \nu) = \operatorname{Gamma}(\lambda | \text{ shape } = \frac{\nu}{2}, \text{ rate } = \frac{1}{2s}). \] If we set \(s=2\), this further reduces to the chi-squared distribution.

Applications:

Note that one is about sampling properties of the estimator (frequentist), and the other is about prior beliefs and posterior inference (Bayesian). In practice, both deal with uncertainty in covariance matrices, and the same distribution arises in both roles - just with different interpretations.

Connections to Machine Learning

The multivariate distributions introduced in this part form the backbone of probabilistic machine learning. The MVN appears in Gaussian processes (where it defines a prior over functions), in the E-step of Gaussian mixture models, and in the reparameterization trick for variational autoencoders. The Dirichlet distribution is the standard prior for topic models (LDA) and categorical distributions in Bayesian inference. The Wishart and inverse Wishart distributions are essential whenever the covariance matrix itself is a parameter to be inferred, as in Bayesian multivariate regression and Bayesian Gaussian mixture models.

With the multivariate framework now in place from covariance matrices (Part 6) through correlation (Part 7) to multivariate distributions (this part), we have the probabilistic foundations needed for statistical inference. In Part 9, we turn to maximum likelihood estimation, the principal method for fitting probabilistic models to observed data.