Multivariate Normal Distribution
In our earlier development of the
covariance matrix and the
correlation coefficient,
we built tools for quantifying linear relationships among random variables. We also saw that the univariate
Gaussian
\(\mathcal{N}(\mu, \sigma^2)\) is characterized by its mean and variance. The multivariate normal distribution
(MVN) is the natural extension of the Gaussian to random vectors, and it is arguably the most important joint distribution
in all of statistics and machine learning. Its shape is entirely determined by a mean vector and a covariance matrix,
making the concepts from our covariance and correlation discussions indispensable for understanding its geometry.
Definition: Multivariate Normal Distribution
A \(D\)-dimensional random vector \(\boldsymbol{x} \in \mathbb{R}^D\) follows a multivariate normal distribution,
written \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\), if its p.d.f. is
\[
f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^D \det(\Sigma)}}
\exp\!\left[
-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu})
\right],
\]
where \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{x}] \in \mathbb{R}^D\) is the mean vector
and \(\Sigma = \operatorname{Cov}[\boldsymbol{x}] \in \mathbb{R}^{D \times D}\) is the
covariance matrix, here required to be strictly positive definite so that \(\Sigma^{-1}\)
exists and \(\det(\Sigma) > 0\). When \(\Sigma\) is only positive semi-definite —
a property guaranteed in general by the
PSD theorem for covariance matrices —
the distribution becomes degenerate and is supported on a lower-dimensional affine subspace,
requiring a different formulation that we do not pursue here.
The expression inside the exponential (ignoring the factor of \(-\tfrac{1}{2}\)) is the squared
Mahalanobis distance between \(\boldsymbol{x}\) and \(\boldsymbol{\mu}\).
Definition: Mahalanobis Distance
Let \(\Sigma \in \mathbb{R}^{D \times D}\) be a strictly positive definite matrix. The
Mahalanobis distance between \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^D\) (with respect to \(\Sigma\)) is
\[
d_{\Sigma}(\boldsymbol{x}, \boldsymbol{y})
= \sqrt{(\boldsymbol{x} - \boldsymbol{y})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{y})}.
\]
Unlike the Euclidean distance, the Mahalanobis distance accounts for the correlation
structure among variables. The contours of constant density of \(\mathcal{N}(\boldsymbol{\mu}, \Sigma)\) are
ellipsoids defined by
\[
(\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} - \boldsymbol{\mu}) = c,
\]
whose axes are aligned with the eigenvectors of \(\Sigma\) and whose lengths are
proportional to the square roots of the corresponding eigenvalues.
To build concrete intuition, let us work out the special case \(D = 2\) in full detail.
When \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is called the bivariate normal distribution.
In this case,
\[
\begin{align*}
\Sigma &= \begin{bmatrix}
\operatorname{Var}[X_1] & \operatorname{Cov}[X_1, X_2] \\\\
\operatorname{Cov}[X_2, X_1] & \operatorname{Var}[X_2]
\end{bmatrix} \\\\
&= \begin{bmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\\\
\rho \sigma_1 \sigma_2 & \sigma_2^2
\end{bmatrix}
\end{align*}
\]
where \(\rho\) is the
correlation coefficient
of \(X_1\) and \(X_2\),
\[
\rho = \operatorname{Corr}[X_1, X_2] = \frac{\operatorname{Cov}[X_1, X_2]}{\sqrt{\operatorname{Var}[X_1]\,\operatorname{Var}[X_2]}}.
\]
Then
\[
\begin{align*}
\det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\
&= \sigma_1^2 \sigma_2^2 (1 - \rho^2)
\end{align*}
\]
and
\[
\begin{align*}
\Sigma^{-1} &= \frac{1}{\det (\Sigma )}
\begin{bmatrix}
\sigma_2^2 & -\rho \sigma_1 \sigma_2 \\\\
-\rho \sigma_1 \sigma_2 & \sigma_1^2
\end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\\\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\end{align*}
\]
Note that the exponent in the bivariate density is a
quadratic form:
\[
(\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}).
\]
Expanding,
\[
\begin{align*}
(\boldsymbol{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})
&= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\\\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}\left[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2
-\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2)
+\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \right].
\end{align*}
\]
Therefore, the p.d.f. of the bivariate normal distribution becomes
\[
f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{1 - \rho^2}}
\exp\!\left\{-\frac{1}{2(1 - \rho^2)}
\left[\left(\frac{X_1 - \mu_1}{\sigma_1}\right)^2
-2\rho \left(\frac{X_1 - \mu_1} {\sigma_1}\right) \left(\frac{X_2 - \mu_2} {\sigma_2}\right)
+\left(\frac{X_2 - \mu_2}{\sigma_2}\right)^2
\right]
\right\}.
\]
When \(\rho = -1\) or \(\rho = 1\), this density is undefined and \(f\) is said to be degenerate.
Cholesky Decomposition
The previous section defined the MVN in terms of its density function, but
in practice we frequently need to sample from
\[
\mathcal{N}(\boldsymbol{\mu}, \Sigma).
\]
For instance, when training variational autoencoders, implementing the reparameterization trick,
or running Monte Carlo simulations. Generating independent standard normal samples is straightforward,
but we need a way to introduce the correlation structure encoded in \(\Sigma\).
The Cholesky decomposition provides a numerically stable factorization that transforms
uncorrelated samples into correlated ones.
Theorem: Cholesky Decomposition
Let \(\boldsymbol{A} \in \mathbb{R}^{D \times D}\) be a
symmetric positive definite matrix.
Then there exists a unique lower triangular matrix \(\boldsymbol{L}\) with positive diagonal
entries such that
\[
\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top.
\]
(Equivalently, \(\boldsymbol{A} = \boldsymbol{R}^\top \boldsymbol{R}\) where
\(\boldsymbol{R} = \boldsymbol{L}^\top\) is upper triangular.) The existence and uniqueness of
\(\boldsymbol{L}\) is a standard result in numerical linear algebra; we take it as given here and
focus on its application to MVN sampling below.
The Cholesky factor \(\boldsymbol{L}\) of \(\Sigma\) provides a recipe for converting standard normal
samples into MVN samples with arbitrary mean and covariance.
Theorem: MVN Sampling via Cholesky Reparametrization
Let \(\Sigma \in \mathbb{R}^{D \times D}\) be strictly positive definite with Cholesky decomposition
\(\Sigma = \boldsymbol{L}\boldsymbol{L}^\top\), and let \(\boldsymbol{\mu} \in \mathbb{R}^D\).
If \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\) (obtained by sampling
\(D\) independent standard normal variables), then
\[
\boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}
\]
is a multivariate normal random vector with mean \(\boldsymbol{\mu}\) and covariance \(\Sigma\):
\[
\boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma).
\]
Proof:
By
linearity of expectation,
applied component-wise,
\[
\mathbb{E}[\boldsymbol{y}]
= \mathbb{E}[\boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}]
= \boldsymbol{L}\,\mathbb{E}[\boldsymbol{x}] + \boldsymbol{\mu}
= \boldsymbol{\mu},
\]
since \(\mathbb{E}[\boldsymbol{x}] = \boldsymbol{0}\). For the covariance, by the
definition of the covariance matrix,
\[
\operatorname{Cov}[\boldsymbol{y}]
= \mathbb{E}\!\left[(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])
(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])^\top \right].
\]
Substituting \(\boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}\) and using
\(\mathbb{E}[\boldsymbol{x}] = \boldsymbol{0}\) (so that \(\operatorname{Cov}[\boldsymbol{x}] = \mathbb{E}[\boldsymbol{x}\boldsymbol{x}^\top]\)),
the matrix-valued linearity of expectation (entry-wise application of scalar linearity) gives
\[
\begin{align*}
\operatorname{Cov}[\boldsymbol{y}]
&= \mathbb{E}\!\left[ (\boldsymbol{L}\boldsymbol{x})
(\boldsymbol{L}\boldsymbol{x})^\top \right] \\\\
&= \boldsymbol{L}\,\mathbb{E}\!\left[\boldsymbol{x}\boldsymbol{x}^\top\right]\boldsymbol{L}^\top \\\\
&= \boldsymbol{L}\,\operatorname{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\\\
&= \boldsymbol{L}\,\boldsymbol{I}\,\boldsymbol{L}^\top \\\\
&= \boldsymbol{L}\boldsymbol{L}^\top \\\\
&= \Sigma.
\end{align*}
\]
Finally, \(\boldsymbol{y}\) is multivariate normal because it is an affine transformation of a
standard MVN: each component is a linear combination of independent normals plus a constant, and
affine transformations preserve normality (a standard property of the Gaussian family). Therefore
\(\boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)\).
Dirichlet Distribution
The multivariate normal distribution models random vectors taking values in all of \(\mathbb{R}^D\).
However, many quantities of interest in machine learning are probability vectors, whose
entries are non-negative and sum to one. For example, the class probabilities output by a softmax layer,
the topic proportions in a document, or the mixing weights in a mixture model all live on a
probability simplex. We need a distribution that respects this constraint.
The Dirichlet distribution is a multivariate generalization of the
beta distribution.
It has support over the \((K - 1)\)-dimensional probability simplex, defined by
\[
S_K = \Bigl\{(x_1, x_2, \dots, x_K) \in \mathbb{R}^K: x_k \ge 0,\ \sum_{k=1}^K x_k = 1 \Bigr\}.
\]
A random vector \(\boldsymbol{x} \in \mathbb{R}^K\) is said to have a Dirichlet distribution with parameters
\(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \dots, \alpha_K)\) (with each \(\alpha_k > 0\)) if its probability
density function is given by
\[
f(x_1, \dots, x_K; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}, \quad (x_1, \dots, x_K) \in S_K,
\]
or equivalently
\[
\operatorname{Dir}(\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}\, \mathbb{1}\{\boldsymbol{x} \in S_K\},
\]
where the multivariate beta function \(B(\boldsymbol{\alpha})\) is defined as
\[
B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma\Bigl(\sum_{k=1}^K \alpha_k\Bigr)}.
\]
Theorem: Moments of the Dirichlet Distribution
Let \(\alpha_0 = \sum_{k=1}^K \alpha_k\). Then for each \(k\),
- Mean:
\[
\mathbb{E}[x_k] = \frac{\alpha_k}{\alpha_0}.
\]
- Variance:
\[
\operatorname{Var}[x_k] = \frac{\alpha_k (\alpha_0 - \alpha_k)}{\alpha_0^2 (\alpha_0+1)}.
\]
For the symmetric Dirichlet prior with \(\alpha_k = \frac{\alpha}{K}\), these reduce to
\[
\mathbb{E}[x_k] = \frac{1}{K}, \quad \operatorname{Var}[x_k] = \frac{K-1}{K^2 (\alpha +1)},
\]
so increasing \(\alpha\) increases the precision (decreases the variance) of the distribution.
- Covariance: for \(i \neq j\),
\[
\operatorname{Cov}[x_i, x_j] = \frac{-\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}.
\]
The parameters \(\alpha_k\) can be thought of as "pseudocounts" or prior observations of each category. When all
\(\alpha_k\) are equal (i.e., \(\boldsymbol{\alpha} = \alpha\, \boldsymbol{1}\)), the distribution is said to be
symmetric, and reduces to the uniform distribution over the simplex when
\(\alpha = 1\). This symmetry makes the Dirichlet distribution a natural
prior in Bayesian models where no category is favored a priori.
The Dirichlet distribution is widely used as a conjugate prior for the parameters
of a multinomial distribution in Bayesian statistics, as well as in machine learning models
such as latent Dirichlet allocation (LDA) for topic modeling.
Wishart Distribution
The Dirichlet distribution placed a prior on probability vectors. In Bayesian multivariate analysis,
we also need priors on covariance matrices themselves — for example, when inferring the parameters
of a multivariate normal model. The Wishart distribution fills this role: it generalizes the
gamma distribution from positive scalars to
positive definite matrices.
In this section we follow the standard machine-learning notation (see, e.g., Murphy, Probabilistic Machine Learning):
\(D\) denotes the matrix dimension and \(N\) the sample size.
Definition: Wishart Distribution
A \(D \times D\) symmetric positive definite random matrix \(\Sigma\) follows a
Wishart distribution with parameters \(\boldsymbol{S}\) (scale matrix) and
\(\nu\) (degrees of freedom) if its p.d.f. is
\[
\operatorname{Wi}(\Sigma \mid \boldsymbol{S}, \nu)
= \frac{1}{Z}\, |\Sigma|^{(\nu - D - 1)/2} \exp\!\left(-\frac{1}{2}\operatorname{tr}(\boldsymbol{S}^{-1} \Sigma)\right),
\]
where the normalization constant is
\[
Z = |\boldsymbol{S}|^{\nu/2}\, 2^{\nu D/2}\, \Gamma_D\!\left(\frac{\nu}{2}\right),
\]
and \(\Gamma_D(\cdot)\) denotes the multivariate gamma function. The parameters are constrained as follows:
- \(\nu\): the degrees of freedom, which must satisfy \(\nu > D - 1\) for the normalization to exist;
- \(\boldsymbol{S}\): the scale matrix, a \(D \times D\) symmetric positive definite matrix.
With these parameters, the distribution has mean \(\nu \boldsymbol{S}\) and (when \(\nu > D + 1\)) mode \((\nu - D - 1)\boldsymbol{S}\).
If \(D = 1\), the Wishart reduces to the gamma distribution:
\[
\operatorname{Wi}(\lambda \mid s, \nu) = \operatorname{Gamma}\!\left(\lambda \,\Big|\, \text{shape} = \frac{\nu}{2},\ \text{rate} = \frac{1}{2s}\right),
\]
so that the mean \(\nu s\) agrees with the general formula \(\nu \boldsymbol{S}\) specialized to \(\boldsymbol{S} = s\).
Setting \(s = 1\) further reduces this to the chi-squared distribution \(\chi^2_\nu\),
whose density is \(\operatorname{Gamma}(\text{shape} = \nu/2,\ \text{rate} = 1/2)\).
Applications
- Covariance Matrix Estimation:
In multivariate statistics, the Wishart distribution arises naturally in the sampling distribution
of empirical
covariance matrices.
Specifically, given \(N\) i.i.d. samples \(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N \sim \mathcal{N}(\boldsymbol{0}, \Sigma)\),
the scatter matrix
\[
\boldsymbol{S}_0 = \sum_{n=1}^N \boldsymbol{x}_n \boldsymbol{x}_n^\top
\]
(the subscript \(0\) emphasizes that this is the \(\boldsymbol{\mu} = \boldsymbol{0}\) centered scatter,
distinct from the Wishart scale parameter \(\boldsymbol{S}\) introduced above) follows a Wishart distribution:
\[
\boldsymbol{S}_0 \sim \operatorname{Wi}(\Sigma, N).
\]
- Bayesian Inference:
The Wishart serves as a conjugate prior on the precision matrix \(\Sigma^{-1}\) in multivariate normal models,
enabling closed-form updates of the posterior distribution. For a prior on the covariance matrix \(\Sigma\) directly,
we use the inverse Wishart distribution: for \(\nu > D - 1\) and symmetric positive definite \(\boldsymbol{S}\),
\[
\operatorname{IW}(\Sigma \mid \boldsymbol{S}, \nu)
= \frac{1}{Z_{IW}}\, |\Sigma|^{-(\nu + D + 1)/2} \exp\!\left(-\frac{1}{2}\operatorname{tr}(\boldsymbol{S}\, \Sigma^{-1})\right),
\]
where
\[
Z_{IW} = |\boldsymbol{S}|^{-\nu/2}\, 2^{\nu D/2}\, \Gamma_D\!\left(\frac{\nu}{2}\right).
\]
The mean of this distribution is \(\boldsymbol{S} / (\nu - D - 1)\) for \(\nu > D + 1\), and the relationship to the Wishart
is the matrix analogue of the scalar fact \(\lambda \sim \operatorname{Gamma}(a, b) \Rightarrow 1/\lambda \sim \operatorname{IG}(a, b)\):
specifically, \(\Sigma \sim \operatorname{IW}(\boldsymbol{S}, \nu)\) if and only if \(\Sigma^{-1} \sim \operatorname{Wi}(\boldsymbol{S}^{-1}, \nu)\),
so the inverse Wishart is the multivariate generalization of the inverse gamma. Specializing to \(D = 1\) confirms this:
\[
\operatorname{IW}(\sigma^2 \mid s, \nu) = \operatorname{IG}\!\left(\sigma^2 \,\Big|\, \text{shape} = \frac{\nu}{2},\ \text{scale} = \frac{s}{2}\right),
\]
with mean \(s/(\nu - 2)\) for \(\nu > 2\), agreeing with the general formula \(\boldsymbol{S}/(\nu - D - 1)\) at \(\boldsymbol{S} = s\), \(D = 1\).
This mirrors the Wishart-to-gamma reduction shown earlier.
Note that the first application concerns the sampling distribution of an estimator (a frequentist usage),
while the second concerns prior beliefs and posterior inference (a Bayesian usage). Both deal with uncertainty
in covariance matrices, and the same distribution arises in both roles — just with different interpretations.
Connections to Machine Learning
The multivariate distributions introduced here form the backbone of probabilistic machine learning.
The MVN appears in Gaussian processes (where it defines a
prior over functions), in the E-step of Gaussian mixture models,
and in the reparameterization trick for variational autoencoders. The Dirichlet distribution is the standard prior
for topic models (LDA) and categorical distributions in Bayesian inference. The Wishart and inverse Wishart distributions are essential
whenever the covariance matrix itself is a parameter to be inferred, as in Bayesian multivariate regression and Bayesian Gaussian mixture models.
With the multivariate framework now in place — covariance, correlation, and the principal multivariate distributions —
we have the probabilistic foundations needed for statistical inference. We next turn to
maximum likelihood estimation, the principal method for fitting probabilistic
models to observed data.