Fisher Information Matrix

Fisher Information Matrix FIM for the Exponential Family Natural Gradient Descent Jeffreys Prior

Fisher Information Matrix

In the previous page, we saw that the second derivative of the log-partition function \(A(\eta)\) equals the covariance of the sufficient statistics. This object turns out to have significance beyond the exponential family: it measures how much information data carries about unknown parameters, governs the precision of maximum likelihood estimates, and defines a natural geometry on the space of probability distributions.

The Fisher information matrix (FIM) captures the curvature of the log-likelihood function. In frequentist statistics, it characterizes the asymptotic variance of the MLE. In Bayesian statistics, it defines Jeffreys prior. In optimization, it gives rise to natural gradient descent.

Definition: Fisher Information Matrix

The Fisher information matrix (FIM) is the covariance of the score function \(s(\theta)\): \[ F(\theta) = \mathbb{E}_{x \sim p(x \mid \theta)} [s(\theta)s(\theta)^\top] \] where \(s(\theta)\) is the gradient of the log-likelihood with respect to the parameter vector \(\theta\): \[ s(\theta) = \nabla_{\theta} \log p(x \mid \theta). \]

Note: The score function has zero mean, \(\mathbb{E}[s(\theta)] = 0\), under the regularity conditions stated below. This is why the FIM equals \(\mathbb{E}[s\,s^\top]\) rather than the full covariance \(\mathbb{E}[s\,s^\top] - \mathbb{E}[s]\,\mathbb{E}[s]^\top\).

Theorem 1:

If \( \log p(x \mid \theta)\) is twice differentiable, and under certain regularity conditions, the Fisher information matrix equals the expected Hessian of the negative log likelihood (NLL): \[ F(\theta) = - \mathbb{E}_{x \sim \theta} [\nabla_{\theta}^2 \log p(x \mid \theta)]. \]

Proof:

First, the expected value of the score function \(s(\theta)\) is zero. In scalar case, assuming that \(p(x \mid \theta)\) is differentiable and the bounds of the integral do not depend on \(\theta\), we have \[ \begin{align*} &\int p(x \mid \theta) dx = 1 \\\\ &\Longrightarrow \frac{\partial}{\partial \theta} \int p(x \mid \theta) dx = 0 \\\\ &\Longrightarrow \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right] p(x \mid \theta) dx = 0 \tag{1} \\\\ &\Longrightarrow \mathbb{E}[s(\theta)] = 0 \end{align*} \] where \(\frac{\partial}{\partial \theta} \log p(x \mid \theta) = s(\theta)\).

Taking derivatives of Equation (1) with respect to \(\theta\), by the product rule, we obtain: \[ \begin{align*} 0 &= \frac{\partial}{\partial \theta} \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right] p(x \mid \theta) dx \\\\ &= \int \left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right] p(x \mid \theta) dx + \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right] \frac{\partial}{\partial \theta} p(x \mid \theta) dx \\\\ \end{align*} \]

Using the identity \(\frac{\partial}{\partial \theta} p(x \mid \theta) = p(x \mid \theta) \frac{\partial}{\partial \theta} \log p(x \mid \theta)\) (derived from the chain rule), the second term can be rewritten as: \[ 0 = \int \left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right] p(x \mid \theta) dx + \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right]^2 p(x \mid \theta) dx. \]

Therefore, \[ - \mathbb{E}_{x \sim \theta } \left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right] = \mathbb{E}_{x \sim \theta } \left[ \left(\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right)^2 \right]. \]

FIM for the Exponential Family

For general parametric models, computing the FIM requires evaluating an expectation that may have no closed form. The exponential family is special: the FIM reduces to a simple derivative of the log-partition function, which we have already computed.

Consider an exponential family distribution with natural parameter vector \(\eta \in \mathbb{R}^K\): \[ p(x \mid \eta) = h(x) \exp \{\eta^\top \mathcal{T}(x) - A(\eta)\}. \]

Remember that the gradient of the log partition function \(A(\eta)\) is the expected sufficient statistics \(\mathcal{T}(x)\), which is called the moment parameters \(m\): \[ \nabla_{\eta} A(\eta) = \mathbb{E }[\mathcal{T}(x)] = m. \]

Also, the gradient of the log likelihood is the sufficient statistics minus their expected value: \[ \begin{align*} &\log p(x \mid \eta) = \log h(x) + \eta^\top \mathcal{T}(x) - A(\eta) \\\\ &\Longrightarrow \nabla_{\eta} \log p(x \mid \eta) = \mathcal{T}(x) - \mathbb{E}[\mathcal{T}(x)] = \mathcal{T}(x) - m. \end{align*} \]

Therefore, the Hessian of the log partition function is the same as the FIM, which is the same as the covariance of the sufficient statistics: \[ \begin{align*} F(\eta) &= - \mathbb{E}_{p(x \mid \eta)} \left[\nabla_{\eta}^2 (\eta^\top \mathcal{T}(x) - A(\eta))\right] \\\\ &= \nabla_{\eta}^2 A(\eta) \\\\ &= \text{Cov}[\mathcal{T}(x)]. \end{align*} \] So, the FIM is indeed the second cumulant of the sufficient statistics.

Note: Sometimes, we need FIM with respect to the moment parameters \(m\): \[ m = \nabla_{\eta} A(\eta) \Longrightarrow \frac {dm}{d\eta} = \nabla_{\eta}^2 A(\eta) = F(\eta) \] Here, \(F(\eta)\) is the Jacobian matrix and thus: \[ F(m) = \frac{d\eta}{dm} = \left(\frac{dm}{d\eta}\right)^{-1} = F(\eta)^{-1}. \]

Natural Gradient Descent

Standard gradient descent treats parameter space as Euclidean: it moves in the direction of steepest descent measured by \(\|\delta\|_2\). But when the parameters define a probability distribution, two parameter vectors that are close in Euclidean distance can correspond to very different distributions (and vice versa). Natural gradient descent (NGD) corrects this by measuring "steepest descent" in terms of the KL divergence between distributions rather than the Euclidean distance between parameter vectors.

For any inputs \(x \in \mathbb{R}^n\), we can approximate the KL divergence in terms of the FIM by the second order Taylor series expansion: \[ \begin{align*} D_{\mathbb{KL}}(p_{\theta}(y \mid x) \, \| \, p_{\theta + \delta} (y \mid x)) &\approx -\delta^\top \mathbb{E}_{p_{\theta}(y \mid x)} [\nabla \log p_{\theta}(y \mid x) ] - \frac{1}{2}\delta^\top \mathbb{E}_{p_{\theta}(y \mid x)}[\nabla^2 \log p_{\theta}(y \mid x) ] \delta \\\\ &= 0 - \frac{1}{2}\delta^\top F_x(\theta) \delta \\\\ &= \frac{1}{2}\delta^\top F_x \delta \end{align*} \] where \(\delta\) represents the change in the parameters.

We compute average KL divergence between updated distribution and previous one using \[ \frac{1}{2}\delta^\top F \delta \] where \(F\) is the averaged FIM: \[ F(\theta) = \mathbb{E}_{p_{\mathcal{D}}(x)}[F_x(\theta)]. \]

In NGD, we use the inverse FIM as a preconditioning matrix and update parameters: \[ \theta_{t+1} = \theta_{t} - \alpha_{t} F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t). \] where \(\alpha_t > 0\) is the learning rate (or step size) at iteration \(t\).

Here, we define the natural gradient: \[ \widetilde{\nabla} \mathcal{L}(\theta_t) = F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t) = F^{-1}g_t \] Note: \(F\) is always positive definite and is relatively easier to compute and approximate compared to the Hessian matrix.

Jeffreys Prior

In Bayesian statistics, the FIM is used to derive Jeffreys prior, which is a widely used uninformative prior. It allows the posterior to be driven primarily by the data itself. Given a prior \(p_{\theta}(\theta)\) and a transformation \(\phi = f(\theta)\), We seek a prior that is invariant under reparameterization. This ensures that inference remains consistent regardless of the choice of parameterization. The prior should transform as: \[ p_{\phi}(\phi) = p_{\theta}(\theta) \left| \frac{d\theta}{d\phi} \right| \] or in multiple dimensions, \[ p_{\phi}(\phi) = p_{\theta}(\theta) | \det J | \] where \(J\) is the Jacobian matrix with entries \(J_{ij} = \frac{\partial\theta_i}{\partial\phi_j}\).

Definition: Jeffreys Prior

The Jeffreys prior is \[ p(\theta) \propto \sqrt{F(\theta)} \] where \(F\) is the Fisher information.

Or, in multiple dimensions, it has the form \[ p(\theta) \propto \sqrt{\det F(\theta)} \] where \(F\) is the Fisher information matrix.

In 1d case, suppose \(p_{\theta}(\theta) \propto \sqrt{F(\theta)}\). We can derive a prior for \(\phi\) in terms of \(\theta\) as follows: \[ \begin{align*} p_{\phi}(\phi) &= p_{\theta}(\theta) \left| \frac{d\theta}{d\phi} \right| \\\\ &\propto \sqrt{F(\theta)\left(\frac{d\theta}{d\phi}\right)^2} \\\\ &= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \theta)}{d\theta}\right)^2\right] \left(\frac{d\theta}{d\phi}\right)^2}\\\\ &= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \theta)}{d\theta}\frac{d\theta}{d\phi}\right)^2\right] }\\\\ &= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \phi)}{d\phi}\right)^2\right] }\\\\ &= \sqrt{F(\phi)} \end{align*} \] So, the Jeffreys prior is invariant to reparameterizations, and actually, the KL divergence is also invariant to reparameterizations.

Example:

Consider the Binomial distribution: \[ X \sim \text{Bin }(n, \theta), \, 0 \leq \theta \leq 1 \] \[ p(x \mid \theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x} \] Ignoring \(\binom{n}{x}\), its log likelihood is given by \[ l(\theta \mid x) \propto x\log \theta + (n-x)\log(1-\theta). \]

The Fisher information is given by \[ \begin{align*} F(\theta) &= -\mathbb{E}_{x \sim \theta}\left[\frac{d^2 l}{d \theta^2}\right] \\\\ &= -\mathbb{E}_{x \sim \theta}\left[-\frac{x}{\theta^2}-\frac{n-x}{(1-\theta)^2}\right] \\\\ &= \frac{n\theta}{\theta^2} + \frac{n(1-\theta)}{(1-\theta)^2} \\\\ &= \frac{n}{\theta(1-\theta)} \\\\ &\propto \theta^{-1}(1-\theta)^{-1}. \end{align*} \] Thus, the Jeffreys prior for the parameter \(\theta\) is given by \[ p_{\theta} (\theta) = \sqrt{F(\theta)} \propto \theta^{-\frac{1}{2}} (1 - \theta)^{-\frac{1}{2}}. \]

Now, consider the parameterization by \(\phi = \frac{\theta}{1 - \theta}\). Solving this expression for \(\theta\), we obtain \[ \theta = \frac{\phi}{\phi+1}. \] Then we have \[ \begin{align*} p(x \mid \phi) &\propto \left(\frac{\phi}{\phi+1}\right)^x \left(1 - \frac{\phi}{\phi + 1}\right)^{n-x} \\\\ &= \phi^x (\phi +1)^{-x} (\phi +1)^{-n +x} \\\\ &= \phi^x (\phi + 1)^{-n} \end{align*} \] and the log likelihood is given by \[ l(\phi \mid x) = x\log \phi -n\log(\phi + 1). \]

The Fisher information is given by \[ \begin{align*} F(\phi) &= -\mathbb{E}_{x \sim \phi}\left[\frac{d^2 l}{d \phi^2}\right] \\\\ &= -\mathbb{E}_{x \sim \phi}\left[-\frac{x}{\phi^2}+\frac{n}{(\phi + 1)^2}\right] \\\\ &= \frac{n\phi}{\phi + 1} \cdot \frac{1}{\phi^2} - \frac{n}{(\phi + 1)^2} \\\\ &= \frac{n(\phi+1)-n\phi}{\phi(\phi+1)^2} \\\\ &= \frac{n}{\phi(\phi+1)^2} \\\\ &\propto \phi^{-1} (\phi+1)^{-2} \end{align*} \] Thus, the Jeffreys prior for the reparameterized variable \(\phi\) is given by: \[ p_{\phi}(\phi) = \sqrt{F(\phi)} \propto \phi^{-\frac{1}{2}}(\phi +1)^{-1}. \]

Note: In 1d, the Jeffreys prior is the same as the reference prior, which maximizes the expected KL divergence between posterior and prior. In other words, it maximizes the information provided by the "data" relative to the prior. For multidimensional parameters, they are not the same.

Also, finding a reference prior is equivalent to finding the prior that maximizes mutual information between \(\theta\) and \(\mathcal{D}\). \[ \begin{align*} p^*(\theta) &= \arg \max_{p(\theta)} \mathbb{E}_{\mathcal{D}}[D_{\mathbb{KL}}(p(\theta \mid \mathcal{D}) \,\|\, p(\theta) )] \\\\ &= \arg \max_{p(\theta)}\, \mathbb{I}(\theta ; \mathcal{D}) \end{align*} \]

In the continuous case, \[ \begin{align*} \mathbb{I}(\theta \, ; \mathcal{D}) &= \int_{\mathcal{D}} p(\mathcal{D}) D_{\mathbb{KL}}(p(\theta \mid \mathcal{D}) \,\|\, p(\theta) ) d\mathcal{D} \\\\ &= \int p(\mathcal{D}) \left( \int p(\theta \mid \mathcal{D}) \log \frac{p(\theta \mid \mathcal{D})}{p(\theta)}d\theta \right) d\mathcal{D} \\\\ &= \int \int p(\theta \mid \mathcal{D})p(\mathcal{D}) \log \frac{p(\theta \mid \mathcal{D})}{p(\theta)} d\theta d\mathcal{D} \\\\ &= \int \int p(\theta, \mathcal{D}) \log \frac{p(\theta, \mathcal{D})}{p(\theta)p(\mathcal{D})}d\theta d\mathcal{D} \end{align*} \] where \(p(\mathcal{D})\) is the marginal likelihood: \[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta)p(\theta)d\theta \] and \(p(\theta, \mathcal{D})\) represents the joint probability distribution between \(\theta\) and \(\mathcal{D}\): \[ p(\theta, \mathcal{D}) = p(\theta \mid \mathcal{D})p(\mathcal{D}). \] Note: Remember, the mutual information itself is invariant under reparameterization.

Connections to Machine Learning

The Fisher information matrix (FIM) is central to modern deep learning optimization. The natural gradient update \(\theta \leftarrow \theta - \alpha F^{-1} \nabla \mathcal{L}\) provides a more effective descent direction than the standard gradient by accounting for the curvature of the statistical manifold.

While inverting the full FIM is computationally impractical for large networks due to its \(O(n^3)\) complexity, efficient approximations like K-FAC (Kronecker-Factored Approximate Curvature) leverage block-diagonal and Kronecker product structures to scale these methods to deep neural networks. You can explore this further on our interactive Natural Gradient Descent page.

In the field of Information Geometry, the FIM acts as a Riemannian metric. From this perspective, the FIM is essentially the second-order Taylor approximation of the KL divergence: \[ D_{KL}(p_{\theta} \| p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^\top F(\theta) d\theta. \] This identifies the FIM as the fundamental bridge between probability theory and the geometric structure of model parameters.