The Exponential Family

The Exponential Family MLE for the Exponential Family

The Exponential Family

In the previous part, we saw that certain likelihood-prior pairs such as Binomial-Beta and Normal-Normal produce posteriors in the same family as the prior. This is not a coincidence. These distributions all belong to the exponential family, a unified framework that encompasses the vast majority of distributions used in statistics and machine learning including the Bernoulli, Poisson, Normal, Gamma, and Dirichlet distributions, and many others.

The exponential family is fundamental for three reasons:

Definition: Exponential Family

The exponential family is a family of probability distributions parameterized by natural parameters(or canonical parameters) \(\eta \in \mathbb{R}^K\) with support over \(\mathcal{X}^D \subseteq \mathbb{R}^D\) such that \[ \begin{align*} p(x \mid \eta ) &= \frac{1}{Z(\eta)} h(x) \exp\{\eta^\top \mathcal{T}(x)\}\\\\ &= h(x) \exp\{\eta^\top \mathcal{T}(x) - A(\eta)\} \end{align*} \] where

  • \(h(x)\) is a base measure, which is a scaling constant, often 1.
  • \(\mathcal{T}(x) \in \mathbb{R}^K\) is sufficient statistics.
  • \(Z(\eta)\) is a normalization constant.

Each exponential family is defined by different \(h(x)\) and \(\mathcal{T}(x)\).

The normalization constant \(Z(\eta)\) is often referred to as the partition function in statistical physics and machine learning. Especially, the log-partition function: \[ A(\eta) = \log Z(\eta) \] is convex over the convex set \(\Omega = \{\eta \in \mathbb{R}^K : A(\eta) < \infty\}\). In the context of optimization and information geometry, it acts as a convex potential function that links natural parameters to moment parameters.

Definition: Minimal Representation

An exponential family is said to be minimal if there is no nonzero \(\eta \in \mathbb{R}^K\) such that \[ \eta^\top \mathcal{T}(x) = \text{const} \] holds for all \(x\) in the support.

Equivalently, the components of the sufficient statistic vector \(\mathcal{T}(x)\) (along with the constant function \(1\)) are linearly independent as functions of \(x\). This property ensures the identifiability of the model: there is a unique, one-to-one mapping between the natural parameters \(\eta\) and the resulting probability distribution.

Math vs. ML Perspectives: Handling the Constant

In many machine learning contexts, you may see the condition simplified to \[ \eta^\top \mathcal{T}(x) = 0. \] This assumes that the bias (constant) term has already been absorbed into the sufficient statistic vector \(\mathcal{T}(x)\) by augmenting it with a constant \(1\).

However, in our canonical definition where the log-partition function \(A(\eta)\) and base measure \(h(x)\) are explicitly separated, we must use the more rigorous \(\text{const}\) condition. If \(\eta^\top \mathcal{T}(x)\) were a constant, that value would be absorbed by \(A(\eta)\), meaning multiple \(\eta\) values could represent the same distribution. A minimal representation prevents this redundancy, ensuring that the Fisher information matrix is strictly positive definite and the optimization surface is well-behaved.

A common example of a non-minimal representation is the \(K\)-class multinomial distribution. Because the probabilities must sum to one (\(\sum p_i = 1\)), the natural parameters are over-specified. While we can work with this redundant form (as often done in Softmax regression), we can always reparameterize it into a minimal representation using \(K-1\) independent components to restore unique identifiability.

Let \(\eta = f(\phi)\), where \(\phi\) is some other possibly smaller set of parameters, and then \[ p(x \mid \phi ) = h(x) \exp\{ f(\phi)^\top \mathcal{T}(x) - A(f(\phi))\}. \] If the mapping \(\phi \to \eta\) is nonlinear, it is said to be a curved exponential family.

If \(\eta = f(\phi) = \phi\), the model is in canonical form and in addition, if \(\mathcal{T} =x\), we call it a natural exponential family(NEF): \[ p(x \mid \eta ) = h(x) \exp\{\eta^\top x - A(\eta)\}. \] Finally, we define the moment parameters as follows: \[ m = \mathbb{E}[\mathcal{T}(x)] \in \mathbb{R}^K. \]

Example 1: Bernoulli Distribution

\[ \begin{align*} \text{Ber}(x \mid \mu) &= \mu^x (1-\mu)^{1-x} \\\\ &= \exp\{x \log (\mu) + (1-x) \log (1-\mu)\}\\\\ &= \exp\{\mathcal{T}(x)^\top \eta\} \end{align*} \] where

  • \(\mathcal{T}(x) = [\mathbb{I}(x=1), \, \mathbb{I}(x=0)]\).
  • \(\eta = [\log(\mu), \, \log(1-\mu)]\).
  • \(\mu\) is the mean parameter.

In this representation, there is a linear dependence between the features, and then we cannot define \(\eta\) uniquely. It is common to use a minimal representation so that there is a unique \(\eta\) associated with the distribution. \[ \text{Ber}(x \mid \mu) = \exp\left\{x \log \left(\frac{\mu}{1-\mu}\right) + \log (1-\mu)\right\} \] where

  • \(\mathcal{T}(x) = x\).
  • \(\eta = \log \left(\frac{\mu}{1-\mu}\right)\).
  • \(A(\eta) = -\log (1-\mu) = \log(1+ e^{\eta})\).
  • \(h(x) = 1\).

The Bernoulli example reveals a fundamental connection to classification models. The mean parameter \(\mu\) can be recovered from the canonical parameter \(\eta\) via: \[ \mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}} \text{(This is a logistic function.)} \] Also, you might notice that \[ \begin{align*} \frac{dA}{d \eta} &= \frac{d}{d\eta} \log(1+ e^{\eta}) \\\\ &= \frac{e^{\eta}}{1 + e^{\eta}} \\\\ &= \frac{1}{1+e^{-\eta}} \\\\ &= \mu. \end{align*} \]

In general, the log partition function \(A(\eta)\) acts as the cumulant generating function for the sufficient statistics \(\mathcal{T}(x)\). In other words, the derivatives of \(A(\eta)\) generate all the cumulants of \(\mathcal{T}(x)\).

The first cumulant is given by \[ \nabla A(\eta) = \mathbb{E}[\mathcal{T}(x)]. \] Note that these expectations are often called moment parameters, denoted by \(\mu\) or \(m\).

The second cumulant is given by \[ \nabla^2 A(\eta) = \text{Cov }[\mathcal{T}(x)] \] which means that for a minimal exponential family, the Hessian is strictly positive definite, and thus the log partition function \(A(\eta)\) is strictly convex in \(\eta\). This convexity guarantees that the MLE (derived in the next section) has a unique solution.

We now verify these properties for the two most important continuous distributions in machine learning.

Example 2: Normal Distribution

\[ \begin{align*} N(x | \mu, \, \sigma^2) &= \frac{1}{\sigma\sqrt{2\pi}}\exp \left\{-\frac{1}{2\sigma^2}(x - \mu)^2 \right\} \\\\ &= \frac{1}{\sqrt{2\pi}} \exp \left\{ \frac{\mu}{\sigma^2}x - \frac{1}{2\sigma^2}x^2 -\frac{1}{2\sigma^2}\mu^2 -\log \sigma \right\} \end{align*} \] where

  • \(\mathcal{T}(x) = \begin{bmatrix}x \\ x^2 \end{bmatrix}\)
  • \(\eta = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ -\frac{1}{2\sigma^2} \end{bmatrix} \)
  • \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = -\frac{\eta_1^2}{4\eta_2}-\frac{1}{2}\log(-2\eta_2)\)
  • \(h(x) = \frac{1}{\sqrt{2\pi}}\).

Also, the moment parameters are given by: \[ m = \begin{bmatrix} \mu \\ \mu^2 + \sigma^2 \end{bmatrix}. \] Note: If \(\sigma = 1\), the distribution becomes a natural exponential family such that

  • \(\mathcal{T}(x) = x\)
  • \(\eta = \mu\)
  • \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = \frac{\mu^2}{2}\)
  • \(h(x) = \frac{1}{\sqrt{2\pi}}\exp\{-\frac{x^2}{2}\} = N(x \mid 0, 1)\) : Not constant.

The univariate case extends naturally to higher dimensions. The multivariate normal distribution introduces the information form parameterization, which is the natural exponential family representation for Gaussians and plays a central role in graphical models and message passing algorithms.

Example 3: Multivariate Normal Distribution(MVN)

\[ \begin{align*} N(x \mid \mu, \Sigma) &= \frac{1}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Sigma)}} \exp \left\{ \frac{1}{2}x^\top \Sigma^{-1}x + x^\top \Sigma^{-1}\mu -\frac{1}{2}\mu^\top \Sigma^{-1}\mu \right\}\\\\ &= c \exp\left \{x^\top \Sigma^{-1}\mu -\frac{1}{2}x^\top \Sigma^{-1}x \right\} \end{align*} \] where \[ c = \frac{\exp \left\{-\frac{1}{2}\mu^\top \Sigma^{-1} \mu\right\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Sigma)}} \] and \(\Sigma\) is a covariance matrix.

Now, we represent this model using canonical parameters. \[ N_c (x \mid \xi, \Lambda) = c' \exp \left\{x^\top \xi - \frac{1}{2}x^\top \Lambda x \right\} \] where

  • \(\Lambda = \Sigma^{-1}\) is a precision matrix
  • \(\xi = \Sigma^{-1}\mu\) is a precision-weighted mean vector
  • \(c' = \frac{\exp \left\{-\frac{1}{2}\xi^\top \Lambda^{-1} \xi \right\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Lambda^{-1})}}\).

This representation is called information form and can be converted to exponential family notation as follows: \[ \begin{align*} N_c (x \mid \xi, \Lambda) &= (2\pi)^{-\frac{D}{2}} \exp \left\{\frac{1}{2}\log | \Lambda | -\frac{1}{2}\xi^\top \Lambda^{-1}\xi \right\} \exp \left\{-\frac{1}{2}x^\top \Lambda x + x^\top \xi \right\} \\\\ &= h(x)g(\eta)\exp \left\{-\frac{1}{2}x^\top \Lambda x + x^\top \xi \right\} \\\\ &= h(x)g(\eta)\exp \left \{-\frac{1}{2}(\sum_{i, j}x_i x_j \Lambda_{ij}) + x^\top \xi \right\} \\\\ &= h(x)g(\eta)\exp \left\{-\frac{1}{2}\text{vec}(\Lambda)^\top \text{vec}(xx^\top) + x^\top \xi \right\} \\\\ &= h(x)\exp\{\eta^\top \mathcal{T}(x) - A(\eta)\} \end{align*} \] where

  • \(\mathcal{T}(x) = [x ; \text{vec}(xx^\top)]\)
  • \(\eta = [\xi ; -\frac{1}{2}\text{vec}(\Lambda)] = [\Sigma^{-1}\mu ; -\frac{1}{2}\text{vec}(\Sigma^{-1})]\)
  • \(A(\eta) = -\log g(\eta) = -\frac{1}{2} \log | \Lambda | + \frac{1}{2}\xi^\top \Lambda^{-1} \xi \)
  • \(h(x) = (2\pi)^{-\frac{D}{2}}\).

The moment parameters are given by: \[ m = [\mu ; \mu\mu^\top + \Sigma]. \] Note: This form is non-minimal because the precision matrix \(\Lambda\) is symmetric (\(\Lambda_{ij} = \Lambda_{ji}\)), so we can split it into lower and upper triangular matrices. However, in practice, non-minimal representation is easier to plug into algorithms and stable for certain operations. The minimal representation is optimized for mathematical derivations.

MLE for the Exponential Family

One of the most elegant consequences of the exponential family structure is that maximum likelihood estimation reduces to a simple condition: matching empirical moments to theoretical moments. This is because the gradient of the log-likelihood involves only the sufficient statistics and the log-partition function.

The likelihood of an exponential family model is given by: \[ \begin{align*} p(\mathcal{D} \mid \eta) &= \left\{\prod_{n=1}^N h(x_n)\right\} \exp \left\{\eta^\top \left[\sum_{n=1}^N \mathcal{T}(x_n)\right] - N A(\eta)\right\} \\\\ &\propto \exp\{\eta^\top \mathcal{T}(\mathcal{D}) - N A(\eta)\} \end{align*} \] where \(\mathcal{T}(\mathcal{D})\) are the sufficient statistics: \[ \mathcal{T}(\mathcal{D}) = \begin{bmatrix} \sum_{n=1}^N \mathcal{T}_1 (x_n) \\ \vdots \\ \sum_{n=1}^N \mathcal{T}_K (x_n) \end{bmatrix}. \]

The derivative of the log partition function yields the expected value of the sufficient statistic vector: \[ \begin{align*} \nabla_{\eta} \log p(\mathcal{D} \mid \eta) &= \nabla_{\eta} \eta^\top \mathcal{T}(\mathcal{D}) - N \nabla_{\eta} A(\eta) \\\\ &= \mathcal{T}(\mathcal{D}) - N \mathbb{E }[\mathcal{T}(x)]. \end{align*} \]

Setting this gradient to zero, we obtain the MLE \(\hat{\eta}\), which must satisfy \[ \mathbb{E }[\mathcal{T}(x)] = \frac{1}{N}\sum_{n=1}^N \mathcal{T}(x_n). \] This means that the empirical average of the sufficient statistics equals to the model's theoretical expected sufficient statistics. This principle is called moment matching.

For example, in Gaussian distribution, its empirical mean(the first moment) is given by: \[ \bar{x} = \frac{1}{N}\sum_{n=1}^N x_n. \] and its expected value(theoretical moment under the model) is \[ \mathbb{E}[x] = \mu = \bar{x}. \]

In Variational Inference (VI), approximating an intractable posterior with an exponential family distribution is standard practice: the moment matching condition from MLE generalizes to moment matching between the true and approximate posteriors. The log-partition function \(A(\eta)\) reappears as the Fenchel conjugate of the negative entropy, connecting the exponential family to convex duality and Information Geometry.

The exponential family structure reveals that the log-partition function \(A(\eta)\) encodes all distributional information through its derivatives. In the next part, we formalize this connection: the second derivative of \(A(\eta)\) - the covariance of the sufficient statistics - is precisely the Fisher information matrix (Part 16), which measures how sensitive the distribution is to changes in the natural parameter and governs the geometry of the statistical model.