Multivariate Distributions

Multivariate Normal Distribution Cholesky Decomposition Dirichlet Distribution Wishart Distribution

Multivariate Normal Distribution

In our earlier development of the covariance matrix and the correlation coefficient, we built tools for quantifying linear relationships among random variables. We also saw that the univariate Gaussian \(\mathcal{N}(\mu, \sigma^2)\) is characterized by its mean and variance. The multivariate normal distribution (MVN) is the natural extension of the Gaussian to random vectors, and it is arguably the most important joint distribution in all of statistics and machine learning. Its shape is entirely determined by a mean vector and a covariance matrix, making the concepts from our covariance and correlation discussions indispensable for understanding its geometry.

Definition: Multivariate Normal Distribution

A \(D\)-dimensional random vector \(\boldsymbol{x} \in \mathbb{R}^D\) follows a multivariate normal distribution, written \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\), if its p.d.f. is \[ f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^D \det(\Sigma)}} \exp\!\left[ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) \right], \] where \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{x}] \in \mathbb{R}^D\) is the mean vector and \(\Sigma = \operatorname{Cov}[\boldsymbol{x}] \in \mathbb{R}^{D \times D}\) is the covariance matrix, here required to be strictly positive definite so that \(\Sigma^{-1}\) exists and \(\det(\Sigma) > 0\). When \(\Sigma\) is only positive semi-definite — a property guaranteed in general by the PSD theorem for covariance matrices — the distribution becomes degenerate and is supported on a lower-dimensional affine subspace, requiring a different formulation that we do not pursue here.

The expression inside the exponential (ignoring the factor of \(-\tfrac{1}{2}\)) is the squared Mahalanobis distance between \(\boldsymbol{x}\) and \(\boldsymbol{\mu}\).

Definition: Mahalanobis Distance

Let \(\Sigma \in \mathbb{R}^{D \times D}\) be a strictly positive definite matrix. The Mahalanobis distance between \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^D\) (with respect to \(\Sigma\)) is \[ d_{\Sigma}(\boldsymbol{x}, \boldsymbol{y}) = \sqrt{(\boldsymbol{x} - \boldsymbol{y})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{y})}. \]

Unlike the Euclidean distance, the Mahalanobis distance accounts for the correlation structure among variables. The contours of constant density of \(\mathcal{N}(\boldsymbol{\mu}, \Sigma)\) are ellipsoids defined by \[ (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) = c, \] whose axes are aligned with the eigenvectors of \(\Sigma\) and whose lengths are proportional to the square roots of the corresponding eigenvalues.

To build concrete intuition, let us work out the special case \(D = 2\) in full detail. When \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is called the bivariate normal distribution. In this case, \[ \begin{align*} \Sigma &= \begin{bmatrix} \operatorname{Var}[X_1] & \operatorname{Cov}[X_1, X_2] \\\\ \operatorname{Cov}[X_2, X_1] & \operatorname{Var}[X_2] \end{bmatrix} \\\\ &= \begin{bmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\\\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{bmatrix} \end{align*} \] where \(\rho\) is the correlation coefficient of \(X_1\) and \(X_2\), \[ \rho = \operatorname{Corr}[X_1, X_2] = \frac{\operatorname{Cov}[X_1, X_2]}{\sqrt{\operatorname{Var}[X_1]\,\operatorname{Var}[X_2]}}. \]

Then \[ \begin{align*} \det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\ &= \sigma_1^2 \sigma_2^2 (1 - \rho^2) \end{align*} \] and \[ \begin{align*} \Sigma^{-1} &= \frac{1}{\det (\Sigma )} \begin{bmatrix} \sigma_2^2 & -\rho \sigma_1 \sigma_2 \\\\ -\rho \sigma_1 \sigma_2 & \sigma_1^2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\\\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \end{align*} \]

Note that the exponent in the bivariate density is a quadratic form: \[ (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}). \] Expanding, \[ \begin{align*} (\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) &= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\\\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2}\left[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2 -\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2) +\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \right]. \end{align*} \] Therefore, the p.d.f. of the bivariate normal distribution becomes \[ f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{1 - \rho^2}} \exp\!\left\{-\frac{1}{2(1 - \rho^2)} \left[\left(\frac{X_1 - \mu_1}{\sigma_1}\right)^2 -2\rho \left(\frac{X_1 - \mu_1} {\sigma_1}\right) \left(\frac{X_2 - \mu_2} {\sigma_2}\right) +\left(\frac{X_2 - \mu_2}{\sigma_2}\right)^2 \right] \right\}. \] When \(\rho = -1\) or \(\rho = 1\), this density is undefined and \(f\) is said to be degenerate.

Cholesky Decomposition

The previous section defined the MVN in terms of its density function, but in practice we frequently need to sample from \[ \mathcal{N}(\boldsymbol{\mu}, \Sigma). \] For instance, when training variational autoencoders, implementing the reparameterization trick, or running Monte Carlo simulations. Generating independent standard normal samples is straightforward, but we need a way to introduce the correlation structure encoded in \(\Sigma\). The Cholesky decomposition provides a numerically stable factorization that transforms uncorrelated samples into correlated ones.

Theorem: Cholesky Decomposition

Let \(\boldsymbol{A} \in \mathbb{R}^{D \times D}\) be a symmetric positive definite matrix. Then there exists a unique lower triangular matrix \(\boldsymbol{L}\) with positive diagonal entries such that \[ \boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top. \] (Equivalently, \(\boldsymbol{A} = \boldsymbol{R}^\top \boldsymbol{R}\) where \(\boldsymbol{R} = \boldsymbol{L}^\top\) is upper triangular.) The existence and uniqueness of \(\boldsymbol{L}\) is a standard result in numerical linear algebra; we take it as given here and focus on its application to MVN sampling below.

The Cholesky factor \(\boldsymbol{L}\) of \(\Sigma\) provides a recipe for converting standard normal samples into MVN samples with arbitrary mean and covariance.

Theorem: MVN Sampling via Cholesky Reparametrization

Let \(\Sigma \in \mathbb{R}^{D \times D}\) be strictly positive definite with Cholesky decomposition \(\Sigma = \boldsymbol{L}\boldsymbol{L}^\top\), and let \(\boldsymbol{\mu} \in \mathbb{R}^D\). If \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\) (obtained by sampling \(D\) independent standard normal variables), then \[ \boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu} \] is a multivariate normal random vector with mean \(\boldsymbol{\mu}\) and covariance \(\Sigma\): \[ \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma). \]

Proof:

By linearity of expectation, applied component-wise, \[ \mathbb{E}[\boldsymbol{y}] = \mathbb{E}[\boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}] = \boldsymbol{L}\,\mathbb{E}[\boldsymbol{x}] + \boldsymbol{\mu} = \boldsymbol{\mu}, \] since \(\mathbb{E}[\boldsymbol{x}] = \boldsymbol{0}\). For the covariance, by the definition of the covariance matrix, \[ \operatorname{Cov}[\boldsymbol{y}] = \mathbb{E}\!\left[(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}]) (\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])^\top \right]. \] Substituting \(\boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}\) and using \(\mathbb{E}[\boldsymbol{x}] = \boldsymbol{0}\) (so that \(\operatorname{Cov}[\boldsymbol{x}] = \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top]\)), the matrix-valued linearity of expectation (entry-wise application of scalar linearity) gives \[ \begin{align*} \operatorname{Cov}[\boldsymbol{y}] &= \mathbb{E}\!\left[ (\boldsymbol{L}\boldsymbol{x}) (\boldsymbol{L}\boldsymbol{x})^\top \right] \\\\ &= \boldsymbol{L}\,\mathbb{E}\!\left[\boldsymbol{x}\boldsymbol{x}^\top\right]\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\,\operatorname{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\,\boldsymbol{I}\,\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\boldsymbol{L}^\top \\\\ &= \Sigma. \end{align*} \] Finally, \(\boldsymbol{y}\) is multivariate normal because it is an affine transformation of a standard MVN: each component is a linear combination of independent normals plus a constant, and affine transformations preserve normality (a standard property of the Gaussian family). Therefore \(\boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\).

Dirichlet Distribution

The multivariate normal distribution models random vectors taking values in all of \(\mathbb{R}^D\). However, many quantities of interest in machine learning are probability vectors, whose entries are non-negative and sum to one. For example, the class probabilities output by a softmax layer, the topic proportions in a document, or the mixing weights in a mixture model all live on a probability simplex. We need a distribution that respects this constraint.

The Dirichlet distribution is a multivariate generalization of the beta distribution. It has support over the \((K - 1)\)-dimensional probability simplex, defined by \[ S_K = \Bigl\{(x_1, x_2, \dots, x_K) \in \mathbb{R}^K: x_k \ge 0,\ \sum_{k=1}^K x_k = 1 \Bigr\}. \] A random vector \(\boldsymbol{x} \in \mathbb{R}^K\) is said to have a Dirichlet distribution with parameters \(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \dots, \alpha_K)\) (with each \(\alpha_k > 0\)) if its probability density function is given by \[ f(x_1, \dots, x_K; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}, \quad (x_1, \dots, x_K) \in S_K, \] or equivalently \[ \operatorname{Dir}(\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}\, \mathbb{1}\{\boldsymbol{x} \in S_K\}, \] where the multivariate beta function \(B(\boldsymbol{\alpha})\) is defined as \[ B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma\Bigl(\sum_{k=1}^K \alpha_k\Bigr)}. \]

Theorem: Moments of the Dirichlet Distribution

Let \(\alpha_0 = \sum_{k=1}^K \alpha_k\). Then for each \(k\),

  1. Mean: \[ \mathbb{E}[x_k] = \frac{\alpha_k}{\alpha_0}. \]
  2. Variance: \[ \operatorname{Var}[x_k] = \frac{\alpha_k (\alpha_0 - \alpha_k)}{\alpha_0^2 (\alpha_0+1)}. \] For the symmetric Dirichlet prior with \(\alpha_k = \frac{\alpha}{K}\), these reduce to \[ \mathbb{E}[x_k] = \frac{1}{K}, \quad \operatorname{Var}[x_k] = \frac{K-1}{K^2 (\alpha +1)}, \] so increasing \(\alpha\) increases the precision (decreases the variance) of the distribution.
  3. Covariance: for \(i \neq j\), \[ \operatorname{Cov}[x_i, x_j] = \frac{-\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}. \]

The parameters \(\alpha_k\) can be thought of as "pseudocounts" or prior observations of each category. When all \(\alpha_k\) are equal (i.e., \(\boldsymbol{\alpha} = \alpha\, \boldsymbol{1}\)), the distribution is said to be symmetric, and reduces to the uniform distribution over the simplex when \(\alpha = 1\). This symmetry makes the Dirichlet distribution a natural prior in Bayesian models where no category is favored a priori.

The Dirichlet distribution is widely used as a conjugate prior for the parameters of a multinomial distribution in Bayesian statistics, as well as in machine learning models such as latent Dirichlet allocation (LDA) for topic modeling.

Wishart Distribution

The Dirichlet distribution placed a prior on probability vectors. In Bayesian multivariate analysis, we also need priors on covariance matrices themselves — for example, when inferring the parameters of a multivariate normal model. The Wishart distribution fills this role: it generalizes the gamma distribution from positive scalars to positive definite matrices. In this section we follow the standard machine-learning notation (see, e.g., Murphy, Probabilistic Machine Learning): \(D\) denotes the matrix dimension and \(N\) the sample size.

Definition: Wishart Distribution

A \(D \times D\) symmetric positive definite random matrix \(\Sigma\) follows a Wishart distribution with parameters \(\boldsymbol{S}\) (scale matrix) and \(\nu\) (degrees of freedom) if its p.d.f. is \[ \operatorname{Wi}(\Sigma \mid \boldsymbol{S}, \nu) = \frac{1}{Z}\, |\Sigma|^{(\nu - D - 1)/2} \exp\!\left(-\frac{1}{2}\operatorname{tr}(\boldsymbol{S}^{-1} \Sigma)\right), \] where the normalization constant is \[ Z = |\boldsymbol{S}|^{\nu/2}\, 2^{\nu D/2}\, \Gamma_D\!\left(\frac{\nu}{2}\right), \] and \(\Gamma_D(\cdot)\) denotes the multivariate gamma function. The parameters are constrained as follows:

  • \(\nu\): the degrees of freedom, which must satisfy \(\nu > D - 1\) for the normalization to exist;
  • \(\boldsymbol{S}\): the scale matrix, a \(D \times D\) symmetric positive definite matrix.

With these parameters, the distribution has mean \(\nu \boldsymbol{S}\) and (when \(\nu > D + 1\)) mode \((\nu - D - 1)\boldsymbol{S}\).

If \(D = 1\), the Wishart reduces to the gamma distribution: \[ \operatorname{Wi}(\lambda \mid s, \nu) = \operatorname{Gamma}\!\left(\lambda \,\Big|\, \text{shape} = \frac{\nu}{2},\ \text{rate} = \frac{1}{2s}\right), \] so that the mean \(\nu s\) agrees with the general formula \(\nu \boldsymbol{S}\) specialized to \(\boldsymbol{S} = s\). Setting \(s = 1\) further reduces this to the chi-squared distribution \(\chi^2_\nu\), whose density is \(\operatorname{Gamma}(\text{shape} = \nu/2,\ \text{rate} = 1/2)\).

Applications

Note that the first application concerns the sampling distribution of an estimator (a frequentist usage), while the second concerns prior beliefs and posterior inference (a Bayesian usage). Both deal with uncertainty in covariance matrices, and the same distribution arises in both roles — just with different interpretations.

Connections to Machine Learning

The multivariate distributions introduced here form the backbone of probabilistic machine learning. The MVN appears in Gaussian processes (where it defines a prior over functions), in the E-step of Gaussian mixture models, and in the reparameterization trick for variational autoencoders. The Dirichlet distribution is the standard prior for topic models (LDA) and categorical distributions in Bayesian inference. The Wishart and inverse Wishart distributions are essential whenever the covariance matrix itself is a parameter to be inferred, as in Bayesian multivariate regression and Bayesian Gaussian mixture models.

With the multivariate framework now in place — covariance, correlation, and the principal multivariate distributions — we have the probabilistic foundations needed for statistical inference. We next turn to maximum likelihood estimation, the principal method for fitting probabilistic models to observed data.