Fisher Information Matrix
In the previous page, we saw that the second derivative of the
log-partition function \(A(\eta)\) equals the covariance of the sufficient statistics. This object turns out
to have significance beyond the exponential family: it measures how much information data carries about unknown
parameters, governs the precision of maximum likelihood estimates, and defines a natural geometry on the space of
probability distributions.
The Fisher information matrix (FIM) captures the curvature of the log-likelihood function.
In frequentist statistics, it characterizes the asymptotic variance of the MLE. In Bayesian statistics, it defines Jeffreys prior.
In optimization, it gives rise to natural gradient descent.
Definition: Fisher Information Matrix
The Fisher information matrix (FIM) is the covariance of
the score function \(s(\theta)\):
\[
F(\theta) = \mathbb{E}_{x \sim p(x \mid \theta)} [s(\theta)s(\theta)^\top]
\]
where \(s(\theta)\) is the gradient of the log-likelihood with respect to the parameter vector \(\theta\):
\[
s(\theta) = \nabla_{\theta} \log p(x \mid \theta).
\]
Note: The score function has zero mean, \(\mathbb{E}[s(\theta)] = 0\), under the regularity conditions stated below.
This is why the FIM equals \(\mathbb{E}[s\,s^\top]\) rather than the full covariance
\(\mathbb{E}[s\,s^\top] - \mathbb{E}[s]\,\mathbb{E}[s]^\top\).
Theorem 1:
If \( \log p(x \mid \theta)\) is twice differentiable, and under certain regularity conditions,
the Fisher information matrix equals the expected Hessian of the negative log likelihood (NLL):
\[
F(\theta) = - \mathbb{E}_{x \sim \theta} [\nabla_{\theta}^2 \log p(x \mid \theta)].
\]
Proof:
First, the expected value of the score function \(s(\theta)\) is zero. In scalar case, assuming that
\(p(x \mid \theta)\) is differentiable and
the bounds of the integral do not depend on \(\theta\), we have
\[
\begin{align*}
&\int p(x \mid \theta) dx = 1 \\\\
&\Longrightarrow \frac{\partial}{\partial \theta} \int p(x \mid \theta) dx = 0 \\\\
&\Longrightarrow \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right] p(x \mid \theta) dx = 0 \tag{1} \\\\
&\Longrightarrow \mathbb{E}[s(\theta)] = 0
\end{align*}
\]
where \(\frac{\partial}{\partial \theta} \log p(x \mid \theta) = s(\theta)\).
Taking derivatives of Equation (1) with respect to \(\theta\), by the product rule, we obtain:
\[
\begin{align*}
0 &= \frac{\partial}{\partial \theta} \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right] p(x \mid \theta) dx \\\\
&= \int \left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right] p(x \mid \theta) dx
+ \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right] \frac{\partial}{\partial \theta} p(x \mid \theta) dx \\\\
\end{align*}
\]
Using the identity \(\frac{\partial}{\partial \theta} p(x \mid \theta) = p(x \mid \theta) \frac{\partial}{\partial \theta} \log p(x \mid \theta)\)
(derived from the chain rule), the second term can be rewritten as:
\[
0 = \int \left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right] p(x \mid \theta) dx
+ \int \left[\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right]^2 p(x \mid \theta) dx.
\]
Therefore,
\[
- \mathbb{E}_{x \sim \theta } \left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right]
= \mathbb{E}_{x \sim \theta } \left[ \left(\frac{\partial}{\partial \theta} \log p(x \mid \theta)\right)^2 \right].
\]
FIM for the Exponential Family
For general parametric models, computing the FIM requires evaluating an expectation that may have no closed form. The
exponential family is special: the FIM reduces to a simple derivative of
the log-partition function, which we have already computed.
Consider an exponential family distribution with natural parameter vector \(\eta \in \mathbb{R}^K\):
\[
p(x \mid \eta) = h(x) \exp \{\eta^\top \mathcal{T}(x) - A(\eta)\}.
\]
Remember that the gradient of the log partition function \(A(\eta)\) is the
expected sufficient statistics \(\mathcal{T}(x)\), which is called the moment parameters \(m\):
\[
\nabla_{\eta} A(\eta) = \mathbb{E }[\mathcal{T}(x)] = m.
\]
Also, the gradient of the log likelihood is the sufficient statistics minus their expected value:
\[
\begin{align*}
&\log p(x \mid \eta) = \log h(x) + \eta^\top \mathcal{T}(x) - A(\eta) \\\\
&\Longrightarrow \nabla_{\eta} \log p(x \mid \eta) = \mathcal{T}(x) - \mathbb{E}[\mathcal{T}(x)] = \mathcal{T}(x) - m.
\end{align*}
\]
Therefore, the Hessian of the log partition function is the same as the FIM, which is the same as the
covariance of the sufficient statistics:
\[
\begin{align*}
F(\eta) &= - \mathbb{E}_{p(x \mid \eta)} \left[\nabla_{\eta}^2 (\eta^\top \mathcal{T}(x) - A(\eta))\right] \\\\
&= \nabla_{\eta}^2 A(\eta) \\\\
&= \text{Cov}[\mathcal{T}(x)].
\end{align*}
\]
So, the FIM is indeed the second cumulant of the sufficient statistics.
Note: Sometimes, we need FIM with respect to the moment parameters \(m\):
\[
m = \nabla_{\eta} A(\eta) \Longrightarrow \frac {dm}{d\eta} = \nabla_{\eta}^2 A(\eta) = F(\eta)
\]
Here, \(F(\eta)\) is the Jacobian matrix and thus:
\[
F(m) = \frac{d\eta}{dm} = \left(\frac{dm}{d\eta}\right)^{-1} = F(\eta)^{-1}.
\]
Natural Gradient Descent
Standard gradient descent treats parameter space as Euclidean: it moves in the
direction of steepest descent measured by \(\|\delta\|_2\). But when the parameters define a probability distribution, two parameter
vectors that are close in Euclidean distance can correspond to very different distributions (and vice versa). Natural gradient descent (NGD)
corrects this by measuring "steepest descent" in terms of the KL divergence between distributions
rather than the Euclidean distance between parameter vectors.
For any inputs \(x \in \mathbb{R}^n\), we can approximate the KL divergence in terms of the FIM by the
second order Taylor series expansion:
\[
\begin{align*}
D_{\mathbb{KL}}(p_{\theta}(y \mid x) \, \| \, p_{\theta + \delta} (y \mid x))
&\approx -\delta^\top \mathbb{E}_{p_{\theta}(y \mid x)} [\nabla \log p_{\theta}(y \mid x) ] - \frac{1}{2}\delta^\top \mathbb{E}_{p_{\theta}(y \mid x)}[\nabla^2 \log p_{\theta}(y \mid x) ] \delta \\\\
&= 0 - \frac{1}{2}\delta^\top F_x(\theta) \delta \\\\
&= \frac{1}{2}\delta^\top F_x \delta
\end{align*}
\]
where \(\delta\) represents the change in the parameters.
We compute average KL divergence between updated distribution and previous one using
\[
\frac{1}{2}\delta^\top F \delta
\]
where \(F\) is the averaged FIM:
\[
F(\theta) = \mathbb{E}_{p_{\mathcal{D}}(x)}[F_x(\theta)].
\]
In NGD, we use the inverse FIM as a preconditioning matrix and update parameters:
\[
\theta_{t+1} = \theta_{t} - \alpha_{t} F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t).
\]
where \(\alpha_t > 0\) is the learning rate (or step size) at iteration \(t\).
Here, we define the natural gradient:
\[
\widetilde{\nabla} \mathcal{L}(\theta_t) = F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t) = F^{-1}g_t
\]
Note: \(F\) is always positive definite and is relatively easier to compute and approximate compared to the Hessian matrix.
Jeffreys Prior
In Bayesian statistics, the FIM is used to derive Jeffreys prior, which is a widely used uninformative prior.
It allows the posterior to be driven primarily by the data itself. Given a prior \(p_{\theta}(\theta)\) and a transformation \(\phi = f(\theta)\),
We seek a prior that is invariant under reparameterization. This ensures that inference remains consistent regardless of the choice
of parameterization. The prior should transform as:
\[
p_{\phi}(\phi) = p_{\theta}(\theta) \left| \frac{d\theta}{d\phi} \right|
\]
or in multiple dimensions,
\[
p_{\phi}(\phi) = p_{\theta}(\theta) | \det J |
\]
where \(J\) is the Jacobian matrix with entries \(J_{ij} = \frac{\partial\theta_i}{\partial\phi_j}\).
Definition: Jeffreys Prior
The Jeffreys prior is
\[
p(\theta) \propto \sqrt{F(\theta)}
\]
where \(F\) is the Fisher information.
Or, in multiple dimensions, it has the form
\[
p(\theta) \propto \sqrt{\det F(\theta)}
\]
where \(F\) is the Fisher information matrix.
In 1d case, suppose \(p_{\theta}(\theta) \propto \sqrt{F(\theta)}\). We can derive a prior for \(\phi\) in terms of \(\theta\) as follows:
\[
\begin{align*}
p_{\phi}(\phi) &= p_{\theta}(\theta) \left| \frac{d\theta}{d\phi} \right| \\\\
&\propto \sqrt{F(\theta)\left(\frac{d\theta}{d\phi}\right)^2} \\\\
&= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \theta)}{d\theta}\right)^2\right] \left(\frac{d\theta}{d\phi}\right)^2}\\\\
&= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \theta)}{d\theta}\frac{d\theta}{d\phi}\right)^2\right] }\\\\
&= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \phi)}{d\phi}\right)^2\right] }\\\\
&= \sqrt{F(\phi)}
\end{align*}
\]
So, the Jeffreys prior is invariant to reparameterizations, and actually, the KL divergence is also invariant to reparameterizations.
Example:
Consider the Binomial distribution:
\[
X \sim \text{Bin }(n, \theta), \, 0 \leq \theta \leq 1
\]
\[
p(x \mid \theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x}
\]
Ignoring \(\binom{n}{x}\), its log likelihood is given by
\[
l(\theta \mid x) \propto x\log \theta + (n-x)\log(1-\theta).
\]
The Fisher information is given by
\[
\begin{align*}
F(\theta) &= -\mathbb{E}_{x \sim \theta}\left[\frac{d^2 l}{d \theta^2}\right] \\\\
&= -\mathbb{E}_{x \sim \theta}\left[-\frac{x}{\theta^2}-\frac{n-x}{(1-\theta)^2}\right] \\\\
&= \frac{n\theta}{\theta^2} + \frac{n(1-\theta)}{(1-\theta)^2} \\\\
&= \frac{n}{\theta(1-\theta)} \\\\
&\propto \theta^{-1}(1-\theta)^{-1}.
\end{align*}
\]
Thus, the Jeffreys prior for the parameter \(\theta\) is given by
\[
p_{\theta} (\theta) = \sqrt{F(\theta)} \propto \theta^{-\frac{1}{2}} (1 - \theta)^{-\frac{1}{2}}.
\]
Now, consider the parameterization by \(\phi = \frac{\theta}{1 - \theta}\).
Solving this expression for \(\theta\), we obtain
\[
\theta = \frac{\phi}{\phi+1}.
\]
Then we have
\[
\begin{align*}
p(x \mid \phi) &\propto \left(\frac{\phi}{\phi+1}\right)^x \left(1 - \frac{\phi}{\phi + 1}\right)^{n-x} \\\\
&= \phi^x (\phi +1)^{-x} (\phi +1)^{-n +x} \\\\
&= \phi^x (\phi + 1)^{-n}
\end{align*}
\]
and the log likelihood is given by
\[
l(\phi \mid x) = x\log \phi -n\log(\phi + 1).
\]
The Fisher information is given by
\[
\begin{align*}
F(\phi) &= -\mathbb{E}_{x \sim \phi}\left[\frac{d^2 l}{d \phi^2}\right] \\\\
&= -\mathbb{E}_{x \sim \phi}\left[-\frac{x}{\phi^2}+\frac{n}{(\phi + 1)^2}\right] \\\\
&= \frac{n\phi}{\phi + 1} \cdot \frac{1}{\phi^2} - \frac{n}{(\phi + 1)^2} \\\\
&= \frac{n(\phi+1)-n\phi}{\phi(\phi+1)^2} \\\\
&= \frac{n}{\phi(\phi+1)^2} \\\\
&\propto \phi^{-1} (\phi+1)^{-2}
\end{align*}
\]
Thus, the Jeffreys prior for the reparameterized variable \(\phi\) is given by:
\[
p_{\phi}(\phi) = \sqrt{F(\phi)} \propto \phi^{-\frac{1}{2}}(\phi +1)^{-1}.
\]
Note: In 1d, the Jeffreys prior is the same as the reference prior, which maximizes
the expected KL divergence between posterior and prior. In other words, it maximizes the information
provided by the "data" relative to the prior. For multidimensional parameters, they are not the same.
Also, finding a reference prior is equivalent to finding the prior that maximizes mutual information
between \(\theta\) and \(\mathcal{D}\).
\[
\begin{align*}
p^*(\theta) &= \arg \max_{p(\theta)} \mathbb{E}_{\mathcal{D}}[D_{\mathbb{KL}}(p(\theta \mid \mathcal{D}) \,\|\, p(\theta) )] \\\\
&= \arg \max_{p(\theta)}\, \mathbb{I}(\theta ; \mathcal{D})
\end{align*}
\]
In the continuous case,
\[
\begin{align*}
\mathbb{I}(\theta \, ; \mathcal{D}) &= \int_{\mathcal{D}} p(\mathcal{D}) D_{\mathbb{KL}}(p(\theta \mid \mathcal{D}) \,\|\, p(\theta) ) d\mathcal{D} \\\\
&= \int p(\mathcal{D}) \left( \int p(\theta \mid \mathcal{D}) \log \frac{p(\theta \mid \mathcal{D})}{p(\theta)}d\theta \right) d\mathcal{D} \\\\
&= \int \int p(\theta \mid \mathcal{D})p(\mathcal{D}) \log \frac{p(\theta \mid \mathcal{D})}{p(\theta)} d\theta d\mathcal{D} \\\\
&= \int \int p(\theta, \mathcal{D}) \log \frac{p(\theta, \mathcal{D})}{p(\theta)p(\mathcal{D})}d\theta d\mathcal{D}
\end{align*}
\]
where \(p(\mathcal{D})\) is the marginal likelihood:
\[
p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta)p(\theta)d\theta
\]
and \(p(\theta, \mathcal{D})\) represents the joint probability distribution between \(\theta\) and \(\mathcal{D}\):
\[
p(\theta, \mathcal{D}) = p(\theta \mid \mathcal{D})p(\mathcal{D}).
\]
Note: Remember, the mutual information itself is invariant under reparameterization.
Connections to Machine Learning
The Fisher information matrix (FIM) is central to modern deep learning optimization. The natural gradient update
\(\theta \leftarrow \theta - \alpha F^{-1} \nabla \mathcal{L}\) provides a more effective descent direction than the standard gradient
by accounting for the curvature of the statistical manifold.
While inverting the full FIM is computationally impractical for large networks due to its \(O(n^3)\) complexity, efficient approximations like
K-FAC (Kronecker-Factored Approximate Curvature) leverage block-diagonal and Kronecker product structures to scale these methods
to deep neural networks.
You can explore this further on our interactive Natural Gradient Descent page.
In the field of Information Geometry, the FIM acts as a Riemannian metric.
From this perspective, the FIM is essentially the second-order Taylor approximation of the KL divergence:
\[
D_{KL}(p_{\theta} \| p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^\top F(\theta) d\theta.
\]
This identifies the FIM as the fundamental bridge between probability theory and the geometric structure of model parameters.