The Exponential Family
In the previous chapter on Bayesian statistics, we saw that certain likelihood-prior pairs such as Binomial-Beta and Normal-Normal produce posteriors in the same family as the prior. This is not a coincidence. These distributions all belong to the exponential family, a unified framework that encompasses the vast majority of distributions used in statistics and machine learning including the Bernoulli, Poisson, Normal, Gamma, and Dirichlet distributions, and many others.
The exponential family is fundamental for three reasons:
- Conjugacy:
Every member possesses a natural conjugate prior, ensuring tractable Bayesian updates. - Efficiency: Maximum Likelihood Estimation (MLE) reduces to moment matching, significantly simplifying the optimization landscape.
- Geometry: The log-partition function serves as a cumulant generating function, providing a direct bridge to Fisher information and information geometry.
The exponential family is a family of probability distributions parameterized by natural parameters(or canonical parameters) \(\eta \in \mathbb{R}^K\) with support over \(\mathcal{X} \subseteq \mathbb{R}^D\) such that \[ \begin{align*} p(x \mid \eta ) &= \frac{1}{Z(\eta)} h(x) \exp\{\eta^\top \mathcal{T}(x)\}\\\\ &= h(x) \exp\{\eta^\top \mathcal{T}(x) - A(\eta)\} \end{align*} \] where
- \(h(x)\) is a base measure (or carrier measure): a function depending only on \(x\) and not on \(\eta\); often equal to \(1\).
- \(\mathcal{T}(x) \in \mathbb{R}^K\) is the sufficient statistic vector. It is called sufficient because the likelihood depends on \(x\) only through \(\mathcal{T}(x)\): all information in \(x\) relevant to estimating \(\eta\) is captured by \(\mathcal{T}(x)\).
- \(Z(\eta)\) is the normalization function, defined so that \(p(x \mid \eta)\) integrates to one over the support.
Each exponential family is defined by different \(h(x)\) and \(\mathcal{T}(x)\).
The normalization function \(Z(\eta)\) is often referred to as the partition function in statistical physics and machine learning. Especially, the log-partition function: \[ A(\eta) = \log Z(\eta) \] is convex over the convex set \(\Omega = \{\eta \in \mathbb{R}^K : A(\eta) < \infty\}\), and is strictly convex when the family is minimal. We verify these properties below via the Hessian formula \(\nabla^2 A(\eta) = \operatorname{Cov}[\mathcal{T}(x)]\). In the context of optimization and information geometry, \(A(\eta)\) acts as a convex potential function that links natural parameters to moment parameters.
An exponential family is said to be minimal if there is no nonzero \(\eta \in \mathbb{R}^K\) such that \[ \eta^\top \mathcal{T}(x) = \text{const} \] holds for all \(x\) in the support.
Equivalently, the components of the sufficient statistic vector \(\mathcal{T}(x)\) (along with the constant function \(1\)) are linearly independent as functions of \(x\). This property ensures the identifiability of the model: there is a unique, one-to-one mapping between the natural parameters \(\eta\) and the resulting probability distribution.
Math vs. ML Perspectives: Handling the Constant
In many machine learning contexts, you may see the condition simplified to \[ \eta^\top \mathcal{T}(x) = 0. \] This assumes that the bias (constant) term has already been absorbed into the sufficient statistic vector \(\mathcal{T}(x)\) by augmenting it with a constant \(1\).
However, in our canonical definition where the log-partition function \(A(\eta)\) and base measure \(h(x)\) are explicitly separated, we must use the more rigorous \(\text{const}\) condition. If \(\eta^\top \mathcal{T}(x)\) were a constant, that value would be absorbed by \(A(\eta)\), meaning multiple \(\eta\) values could represent the same distribution. A minimal representation prevents this redundancy, ensuring that the Fisher information matrix is strictly positive definite and the optimization surface is well-behaved.
A common example of a non-minimal representation is the \(K\)-class multinomial distribution. Because the probabilities must sum to one (\(\sum p_i = 1\)), the natural parameters are over-specified. While we can work with this redundant form (as often done in Softmax regression), we can always reparameterize it into a minimal representation using \(K-1\) independent components to restore unique identifiability.
Let \(\eta = f(\phi)\), where \(\phi \in \mathbb{R}^M\) is some other possibly smaller set of parameters (\(M \leq K\)), and then \[ p(x \mid \phi ) = h(x) \exp\{ f(\phi)^\top \mathcal{T}(x) - A(f(\phi))\}. \] If \(M < K\) and the mapping \(\phi \to \eta\) is nonlinear, it is said to be a curved exponential family.
If \(\eta = f(\phi) = \phi\), the model is in canonical form and in addition, if \(\mathcal{T} =x\), we call it a natural exponential family(NEF): \[ p(x \mid \eta ) = h(x) \exp\{\eta^\top x - A(\eta)\}. \] Finally, we define the moment parameters as follows: \[ m = \mathbb{E}[\mathcal{T}(x)] \in \mathbb{R}^K. \]
\[ \begin{align*} \operatorname{Ber}(x \mid \mu) &= \mu^x (1-\mu)^{1-x} \\\\ &= \exp\{x \log (\mu) + (1-x) \log (1-\mu)\}\\\\ &= \exp\{\mathcal{T}(x)^\top \eta\} \end{align*} \] where
- \(\mathcal{T}(x) = [\mathbb{1}\{x=1\}, \, \mathbb{1}\{x=0\}]\).
- \(\eta = [\log(\mu), \, \log(1-\mu)]\).
- \(\mu\) is the mean parameter.
In this representation, the components \(\mathbb{1}\{x=1\}\) and \(\mathbb{1}\{x=0\}\) of \(\mathcal{T}(x)\) satisfy \(\mathbb{1}\{x=1\} + \mathbb{1}\{x=0\} = 1\) for all \(x\) in the support, violating the minimality condition with the nonzero vector \(\eta = (1, 1)\). Consequently, shifting \(\eta\) by any vector parallel to \((1, 1)\) yields the same distribution, so \(\eta\) is not uniquely determined. To restore uniqueness, we use a minimal representation: \[ \operatorname{Ber}(x \mid \mu) = \exp\left\{x \log \left(\frac{\mu}{1-\mu}\right) + \log (1-\mu)\right\} \] where
- \(\mathcal{T}(x) = x\).
- \(\eta = \log \left(\frac{\mu}{1-\mu}\right)\).
- \(A(\eta) = -\log (1-\mu) = \log(1+ e^{\eta})\).
- \(h(x) = 1\).
The Bernoulli example reveals a fundamental connection to classification models. The mean parameter \(\mu\) can be recovered from the canonical parameter \(\eta\) via: \[ \mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}} \text{(This is a logistic function.)} \] Also, you might notice that \[ \begin{align*} \frac{dA}{d \eta} &= \frac{d}{d\eta} \log(1+ e^{\eta}) \\\\ &= \frac{e^{\eta}}{1 + e^{\eta}} \\\\ &= \frac{1}{1+e^{-\eta}} \\\\ &= \mu. \end{align*} \]
In general, the log partition function \(A(\eta)\) acts as the cumulant generating function for the sufficient statistics \(\mathcal{T}(x)\). In other words, the derivatives of \(A(\eta)\) generate all the cumulants of \(\mathcal{T}(x)\). The first two are particularly important.
For an exponential family with log-partition function \(A(\eta)\), the gradient and Hessian of \(A\) recover the first two cumulants of the sufficient statistic vector: \[ \nabla A(\eta) = \mathbb{E}[\mathcal{T}(x)], \qquad \nabla^2 A(\eta) = \operatorname{Cov}[\mathcal{T}(x)]. \]
The first identity says the gradient of \(A\) returns the moment parameters \(m := \mathbb{E}[\mathcal{T}(x)]\) (sometimes denoted \(\mu\)). The second identity says the Hessian of \(A\) returns the covariance of the sufficient statistics. As a consequence, for a minimal exponential family, the Hessian is strictly positive definite, and thus \(A(\eta)\) is strictly convex in \(\eta\). This convexity guarantees that the MLE (derived in the next section) has a unique solution.
We now verify these properties for the two most important continuous distributions in machine learning.
\[ \begin{align*} \mathcal{N}(x \mid \mu, \, \sigma^2) &= \frac{1}{\sigma\sqrt{2\pi}}\exp \left\{-\frac{1}{2\sigma^2}(x - \mu)^2 \right\} \\\\ &= \frac{1}{\sqrt{2\pi}} \exp \left\{ \frac{\mu}{\sigma^2}x - \frac{1}{2\sigma^2}x^2 -\frac{1}{2\sigma^2}\mu^2 -\log \sigma \right\} \end{align*} \] where
- \(\mathcal{T}(x) = \begin{bmatrix}x \\ x^2 \end{bmatrix}\)
- \(\eta = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ -\frac{1}{2\sigma^2} \end{bmatrix} \)
- \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = -\frac{\eta_1^2}{4\eta_2}-\frac{1}{2}\log(-2\eta_2)\)
- \(h(x) = \frac{1}{\sqrt{2\pi}}\).
Also, the moment parameters are given by: \[ m = \begin{bmatrix} \mu \\ \mu^2 + \sigma^2 \end{bmatrix}. \] Note. If \(\sigma = 1\), the distribution becomes a natural exponential family such that
- \(\mathcal{T}(x) = x\)
- \(\eta = \mu\)
- \(A(\eta) = \frac{\mu^2}{2} = \frac{\eta^2}{2}\)
- \(h(x) = \frac{1}{\sqrt{2\pi}}\exp\{-\frac{x^2}{2}\} = \mathcal{N}(x \mid 0, 1)\) : Not constant.
The univariate case extends naturally to higher dimensions. The multivariate normal distribution introduces the information form parameterization, which is the natural exponential family representation for Gaussians and plays a central role in graphical models and message passing algorithms.
\[ \begin{align*} \mathcal{N}(x \mid \mu, \Sigma) &= \frac{1}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Sigma)}} \exp \left\{ -\frac{1}{2}x^\top \Sigma^{-1}x + x^\top \Sigma^{-1}\mu -\frac{1}{2}\mu^\top \Sigma^{-1}\mu \right\}\\\\ &= c \exp\left \{x^\top \Sigma^{-1}\mu -\frac{1}{2}x^\top \Sigma^{-1}x \right\} \end{align*} \] where \[ c = \frac{\exp \left\{-\frac{1}{2}\mu^\top \Sigma^{-1} \mu\right\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Sigma)}} \] and \(\Sigma\) is a covariance matrix.
Now, we represent this model using canonical parameters. \[ \mathcal{N}_c (x \mid \xi, \Lambda) = c' \exp \left\{x^\top \xi - \frac{1}{2}x^\top \Lambda x \right\} \] where
- \(\Lambda = \Sigma^{-1}\) is a precision matrix
- \(\xi = \Sigma^{-1}\mu\) is a precision-weighted mean vector
- \(c' = \frac{\exp \left\{-\frac{1}{2}\xi^\top \Lambda^{-1} \xi \right\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Lambda^{-1})}}\).
This representation is called information form and can be converted to exponential family notation as follows: \[ \begin{align*} \mathcal{N}_c (x \mid \xi, \Lambda) &= (2\pi)^{-\frac{D}{2}} \exp \left\{\frac{1}{2}\log | \Lambda | -\frac{1}{2}\xi^\top \Lambda^{-1}\xi \right\} \exp \left\{-\frac{1}{2}x^\top \Lambda x + x^\top \xi \right\} \\\\ &= h(x)g(\eta)\exp \left\{-\frac{1}{2}x^\top \Lambda x + x^\top \xi \right\} \\\\ &= h(x)g(\eta)\exp \left \{-\frac{1}{2}(\sum_{i, j}x_i x_j \Lambda_{ij}) + x^\top \xi \right\} \\\\ &= h(x)g(\eta)\exp \left\{-\frac{1}{2}\operatorname{vec}(\Lambda)^\top \operatorname{vec}(xx^\top) + x^\top \xi \right\} \\\\ &= h(x)\exp\{\eta^\top \mathcal{T}(x) - A(\eta)\} \end{align*} \] where
- \(\mathcal{T}(x) = [x ; \operatorname{vec}(xx^\top)]\)
- \(\eta = [\xi ; -\frac{1}{2}\operatorname{vec}(\Lambda)] = [\Sigma^{-1}\mu ; -\frac{1}{2}\operatorname{vec}(\Sigma^{-1})]\)
- \(A(\eta) = -\log g(\eta) = -\frac{1}{2} \log | \Lambda | + \frac{1}{2}\xi^\top \Lambda^{-1} \xi \)
- \(h(x) = (2\pi)^{-\frac{D}{2}}\).
The moment parameters are given by: \[ m = [\mu ; \operatorname{vec}(\mu\mu^\top + \Sigma)]. \] Note. This form is non-minimal because the symmetry \(\Lambda_{ij} = \Lambda_{ji}\) of the precision matrix imposes a linear constraint on \(\operatorname{vec}(\Lambda)\): the natural parameter \(\eta\) is over-specified by \(D(D-1)/2\) redundant components. A minimal representation would use only the upper-triangular (or lower-triangular) part of \(\Lambda\). However, in practice, the non-minimal representation is easier to plug into algorithms and stable for certain operations, while the minimal form is preferred for mathematical derivations.