Fisher Information Matrix

Fisher Information Matrix FIM for the Exponential Family Natural Gradient Descent Cramér-Rao Lower Bound Jeffreys Prior

Fisher Information Matrix

In the previous chapter on the exponential family, we saw that the second derivative of the log-partition function \(A(\eta)\) equals the covariance of the sufficient statistics. This object turns out to have significance beyond the exponential family: it measures how much information data carries about unknown parameters, governs the precision of maximum likelihood estimates, and defines a natural geometry on the space of probability distributions.

The Fisher information matrix (FIM) captures the curvature of the log-likelihood function. In frequentist statistics, it characterizes the asymptotic variance of the MLE. In Bayesian statistics, it defines Jeffreys prior. In optimization, it gives rise to natural gradient descent.

Definition: Score Function

The score function is the gradient of the log-likelihood with respect to the parameter vector \(\theta\): \[ s(\theta) := \nabla_{\theta} \log p(x \mid \theta). \]

Note. Under the regularity conditions stated in the proof of Theorem 1 below, the score function has zero mean: \(\mathbb{E}[s(\theta)] = 0\). Consequently, its covariance \(\mathbb{E}[s\,s^\top] - \mathbb{E}[s]\,\mathbb{E}[s]^\top\) reduces to \(\mathbb{E}[s\,s^\top]\), motivating the following definition.

Definition: Fisher Information Matrix

The Fisher information matrix (FIM) is the covariance of the score function: \[ F(\theta) = \mathbb{E}_{x \sim p(x \mid \theta)} [s(\theta)\,s(\theta)^\top]. \]

Theorem 1 (FIM as Expected Negative Hessian):

Assume \(p(x \mid \theta)\) is twice differentiable in \(\theta\), and the support of \(p(x \mid \theta)\) does not depend on \(\theta\) (so that differentiation may be exchanged with integration). Then the Fisher information matrix equals the expected negative Hessian of the log-likelihood: \[ F(\theta) = - \mathbb{E}_{x \sim \theta} \!\left[\nabla_{\theta}^2 \log p(x \mid \theta)\right]. \]

Proof:

We work component-wise: fix indices \(i, j\) and show \(F_{ij}(\theta) = -\mathbb{E}\!\left[\partial^2_{\theta_i \theta_j} \log p(x \mid \theta)\right]\). Stacking these into a matrix gives the claim.

Starting from the normalization condition \(\int p(x \mid \theta)\, dx = 1\) and differentiating both sides with respect to \(\theta_i\) (exchanging derivative with integral by the regularity assumption): \[ \begin{align*} 0 &= \frac{\partial}{\partial \theta_i} \int p(x \mid \theta)\, dx \\\\ &= \int \frac{\partial p(x \mid \theta)}{\partial \theta_i}\, dx. \end{align*} \] Using the chain-rule identity \(\frac{\partial p}{\partial \theta_i} = p \cdot \frac{\partial \log p}{\partial \theta_i}\) (valid where \(p > 0\)): \[ 0 = \int \left[\frac{\partial}{\partial \theta_i} \log p(x \mid \theta)\right] p(x \mid \theta)\, dx = \mathbb{E}[s_i(\theta)]. \tag{1} \] Hence each component of the score has zero expectation, i.e., \(\mathbb{E}[s(\theta)] = 0\).

We now differentiate identity (1) once more, this time with respect to \(\theta_j\), to bring out the second-order information. Applying the product rule under the integral: \[ \begin{align*} 0 &= \frac{\partial}{\partial \theta_j} \int \left[\frac{\partial}{\partial \theta_i} \log p\right] p\, dx \\\\ &= \int \left[\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p\right] p\, dx + \int \left[\frac{\partial}{\partial \theta_i} \log p\right] \frac{\partial p}{\partial \theta_j}\, dx. \end{align*} \] Applying the chain-rule identity \(\frac{\partial p}{\partial \theta_j} = p \cdot \frac{\partial \log p}{\partial \theta_j}\) to the second term: \[ 0 = \int \left[\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p\right] p\, dx + \int \left[\frac{\partial}{\partial \theta_i} \log p\right]\left[\frac{\partial}{\partial \theta_j} \log p\right] p\, dx. \] Rearranging, \[ -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta_i \partial \theta_j}\right] = \mathbb{E}\!\left[\frac{\partial \log p}{\partial \theta_i} \cdot \frac{\partial \log p}{\partial \theta_j}\right] = \mathbb{E}[s_i(\theta)\, s_j(\theta)] = F_{ij}(\theta), \] where the last equality is the \((i, j)\) entry of the FIM definition. Stacking over all \((i, j)\) yields the matrix identity \(F(\theta) = -\mathbb{E}[\nabla_\theta^2 \log p(x \mid \theta)]\), establishing the claim.

FIM for the Exponential Family

For general parametric models, computing the FIM requires evaluating an expectation that may have no closed form. The exponential family is special: the FIM reduces to a simple derivative of the log-partition function, which we have already computed.

Consider an exponential family distribution with natural parameter vector \(\eta \in \mathbb{R}^K\): \[ p(x \mid \eta) = h(x) \exp \{\eta^\top \mathcal{T}(x) - A(\eta)\}. \]

By the cumulant identities established in the previous chapter, the gradient of the log partition function \(A(\eta)\) is the expected sufficient statistic \(\mathcal{T}(x)\), which is called the moment parameter \(m\): \[ \nabla_{\eta} A(\eta) = \mathbb{E}[\mathcal{T}(x)] = m. \]

Differentiating the log-likelihood \(\log p(x \mid \eta) = \log h(x) + \eta^\top \mathcal{T}(x) - A(\eta)\) with respect to \(\eta\), using \(\nabla_\eta \log h(x) = 0\) (since \(h\) is \(\eta\)-independent) and the first-cumulant identity \(\nabla_\eta A(\eta) = \mathbb{E}[\mathcal{T}(x)]\) above: \[ \nabla_{\eta} \log p(x \mid \eta) = \nabla_\eta \big[\eta^\top \mathcal{T}(x) - A(\eta)\big] = \mathcal{T}(x) - \mathbb{E}[\mathcal{T}(x)] = \mathcal{T}(x) - m. \] This is the score function \(s(\eta)\) for the exponential family — manifestly mean-zero, in agreement with the general zero-mean property of the score established earlier.

By the Hessian form of the FIM above, \(F(\eta) = -\mathbb{E}_{p(x|\eta)}[\nabla_\eta^2 \log p(x \mid \eta)]\). Computing the Hessian of \(\log p = \log h(x) + \eta^\top \mathcal{T}(x) - A(\eta)\): the term \(\log h(x)\) is \(\eta\)-independent (Hessian zero); the term \(\eta^\top \mathcal{T}(x)\) is linear in \(\eta\) (Hessian zero); only \(-A(\eta)\) contributes. Since \(A(\eta)\) does not depend on \(x\), the expectation passes through: \[ F(\eta) = -\mathbb{E}_{p(x|\eta)}\!\left[-\nabla_\eta^2 A(\eta)\right] = \nabla_\eta^2 A(\eta). \] Combining with the second-cumulant identity \(\nabla_\eta^2 A(\eta) = \operatorname{Cov}[\mathcal{T}(x)]\) established in the previous chapter, \[ F(\eta) = \nabla_\eta^2 A(\eta) = \operatorname{Cov}[\mathcal{T}(x)]. \] So, the FIM is indeed the second cumulant of the sufficient statistics.

Note. Sometimes, we need FIM in moment-parameter coordinates \(m\). Under a reparametrization \(\eta \mapsto m\), the FIM transforms as a metric tensor: \[ F(m) = J^\top F(\eta)\, J, \qquad \text{where } J = \frac{d\eta}{dm}. \] Since \(m = \nabla_\eta A(\eta)\), the Jacobian \(dm/d\eta = \nabla_\eta^2 A(\eta) = F(\eta)\); inverting, \(J = d\eta/dm = F(\eta)^{-1}\). Substituting, \[ F(m) = F(\eta)^{-1\,\top}\, F(\eta)\, F(\eta)^{-1} = F(\eta)^{-1}, \] where the last equality uses the symmetry of \(F(\eta)\).

Natural Gradient Descent

Standard gradient descent treats parameter space as Euclidean: it moves in the direction of steepest descent measured by \(\|\delta\|_2\). But when the parameters define a probability distribution, two parameter vectors that are close in Euclidean distance can correspond to very different distributions (and vice versa). Natural gradient descent (NGD) corrects this by measuring "steepest descent" in terms of the KL divergence between distributions rather than the Euclidean distance between parameter vectors.

When the model is a conditional distribution \(p_\theta(y \mid x)\) for input \(x\) and output \(y\), the conditional FIM is defined as \[ F_x(\theta) := \mathbb{E}_{y \sim p_\theta(y \mid x)}\!\left[\nabla_\theta \log p_\theta(y \mid x) \, \nabla_\theta \log p_\theta(y \mid x)^\top\right]. \] By a second-order Taylor expansion of \(\log p_{\theta+\delta}(y \mid x)\) around \(\theta\), and taking expectation under \(p_\theta(y \mid x)\): \[ \begin{align*} D_{\mathbb{KL}}\!\big(p_\theta(y \mid x) \,\|\, p_{\theta+\delta}(y \mid x)\big) &\approx -\delta^\top \mathbb{E}_{p_\theta(y \mid x)}[\nabla_\theta \log p_\theta(y \mid x)] - \frac{1}{2}\delta^\top \mathbb{E}_{p_\theta(y \mid x)}[\nabla_\theta^2 \log p_\theta(y \mid x)]\, \delta \\\\ &= 0 + \frac{1}{2}\delta^\top F_x(\theta)\, \delta, \end{align*} \] where the first term vanishes by the zero-mean property of the score, and the second uses the Hessian form of the FIM: \(-\mathbb{E}[\nabla_\theta^2 \log p] = F_x(\theta)\). Here \(\delta\) represents the change in the parameters.

To compute the average KL divergence between the updated and previous distributions across the data, we use \[ \frac{1}{2}\delta^\top F(\theta)\, \delta, \] where \(F(\theta)\) denotes the averaged FIM over the input distribution (extending the unconditional FIM definition to the conditional-model setting): \[ F(\theta) = \mathbb{E}_{p_{\mathcal{D}}(x)}[F_x(\theta)]. \]

In NGD, we use the inverse FIM as a preconditioning matrix and update parameters: \[ \theta_{t+1} = \theta_{t} - \alpha_{t} F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t), \] where \(\alpha_t > 0\) is the learning rate (or step size) at iteration \(t\), and \(\mathcal{L}(\theta)\) is the loss function being minimized.

Definition: Natural Gradient

Given a loss function \(\mathcal{L}(\theta)\) and Fisher information matrix \(F(\theta)\), the natural gradient is the FIM-preconditioned gradient: \[ \widetilde{\nabla} \mathcal{L}(\theta) := F(\theta)^{-1} \nabla \mathcal{L}(\theta). \] The NGD update can then be written compactly as \(\theta_{t+1} = \theta_t - \alpha_t \widetilde{\nabla}\mathcal{L}(\theta_t)\).

Note. \(F\) is always positive semi-definite (and strictly positive definite under standard identifiability conditions); it is also relatively easier to compute and approximate compared to the Hessian matrix.

Cramér-Rao Lower Bound

Returning to the frequentist thread anticipated at the start of this chapter: the FIM does not just describe the geometry of the model — it also sets a hard limit on how accurately any unbiased estimator can recover \(\theta\) from data. The smaller the FIM, the less information the data carries, and the larger the unavoidable variance of estimation.

Recall from Section on bias of estimators that an estimator \(\hat\theta = \hat\theta(\mathcal{D})\) is unbiased if \(\mathbb{E}[\hat\theta] = \theta^*\) for the true parameter \(\theta^*\). Being unbiased is not enough for a good estimator: its variance \[ \mathbb{V}[\hat\theta] := \mathbb{E}\!\left[\hat\theta^2\right] - \big(\mathbb{E}[\hat\theta]\big)^2 \] measures how much the estimate fluctuates as the data \(\mathcal{D} \sim p(\mathcal{D} \mid \theta^*)\) varies. The natural question is: how small can this variance be made? The Cramér-Rao bound provides the answer.

Theorem: Cramér-Rao Lower Bound

Let \(X_1, \ldots, X_N \overset{\text{iid}}{\sim} p(x \mid \theta^*)\), and let \(\hat\theta = \hat\theta(X_1, \ldots, X_N)\) be an unbiased estimator of the scalar parameter \(\theta^*\). Assume the regularity conditions of the Hessian form of the FIM, supplemented by an integrability condition on \(\hat\theta\) ensuring that differentiation under the integral sign is valid for the unbiasedness identity \(\int \hat\theta(\mathcal{D})\, p(\mathcal{D} \mid \theta)\, d\mathcal{D} = \theta\). Then the variance of \(\hat\theta\) satisfies \[ \mathbb{V}[\hat\theta] \;\geq\; \frac{1}{N \, F(\theta^*)}, \] where \(F(\theta^*)\) is the Fisher information (per single observation) evaluated at \(\theta^*\). For vector-valued \(\theta\), the inequality generalizes to \(\operatorname{Cov}[\hat\theta] \succeq \big(N\, F(\theta^*)\big)^{-1}\) in the positive-semi-definite order.

Proof (scalar case).

Write the score for the full sample as \(s(\theta^*) := \nabla_\theta \log p(\mathcal{D} \mid \theta^*) = \sum_{n=1}^N \nabla_\theta \log p(X_n \mid \theta^*)\), where \(\mathcal{D} = (X_1, \ldots, X_N)\). Each term is a single-observation score, which has zero mean by the zero-mean property of the score; by linearity, \(\mathbb{E}[s(\theta^*)] = 0\).

Step 1. Differentiating the unbiasedness identity \(\mathbb{E}_\theta[\hat\theta(\mathcal{D})] = \theta\) under the integral sign, \[ 1 = \nabla_\theta\!\int \hat\theta(\mathcal{D})\, p(\mathcal{D} \mid \theta)\, d\mathcal{D} = \int \hat\theta(\mathcal{D})\, p(\mathcal{D} \mid \theta)\, \nabla_\theta \log p(\mathcal{D} \mid \theta)\, d\mathcal{D} = \mathbb{E}[\hat\theta\, s]. \] Combined with \(\mathbb{E}[s] = 0\), this gives \(\operatorname{Cov}[\hat\theta, s] = \mathbb{E}[\hat\theta\, s] - \mathbb{E}[\hat\theta]\,\mathbb{E}[s] = 1\).

Step 2. Independence of the samples gives additivity of the Fisher information: \[ \mathbb{V}[s(\theta^*)] = \mathbb{V}\!\left[\sum_{n=1}^N \nabla_\theta \log p(X_n \mid \theta^*)\right] = N \, \mathbb{V}\!\left[\nabla_\theta \log p(X_1 \mid \theta^*)\right] = N\, F(\theta^*). \]

Step 3. Cauchy–Schwarz applied to the random variables \(\hat\theta\) and \(s\) gives \(\operatorname{Cov}[\hat\theta, s]^2 \leq \mathbb{V}[\hat\theta] \, \mathbb{V}[s]\). Substituting Steps 1–2, \[ 1 \;\leq\; \mathbb{V}[\hat\theta] \cdot N\, F(\theta^*), \] which rearranges to the claimed scalar bound \(\mathbb{V}[\hat\theta] \geq 1/(N\, F(\theta^*))\). \(\square\)

For vector-valued \(\theta \in \mathbb{R}^d\), the same differentiation argument applied component-wise — that is, differentiating \(\mathbb{E}_\theta[\hat\theta_i] = \theta_i\) with respect to \(\theta_j\) — yields the matrix identity \(\operatorname{Cov}[\hat\theta, s] = I_d\). Combined with \(\operatorname{Cov}[s] = N\, F(\theta^*)\), positive semi-definiteness of the joint covariance \[ \operatorname{Cov}\!\begin{pmatrix} \hat\theta \\ s \end{pmatrix} = \begin{pmatrix} \operatorname{Cov}[\hat\theta] & I_d \\ I_d & N\, F(\theta^*) \end{pmatrix} \;\succeq\; 0 \] forces, by the Schur complement criterion, \(\operatorname{Cov}[\hat\theta] \succeq \big(N\, F(\theta^*)\big)^{-1}\). The scalar argument above captures the essential mechanism.

The factor \(N\) in the denominator captures the fundamental scaling: doubling the sample size halves the minimum achievable variance. The Fisher information \(F(\theta^*)\) controls the constant: parameter values where the likelihood is sharply peaked (large \(F\)) admit lower-variance estimators than values where the likelihood is flat (small \(F\)).

MLE is Asymptotically Optimal

Under the same regularity conditions, the MLE \(\hat\theta_{\text{MLE}}\) is asymptotically unbiased and achieves the Cramér-Rao bound in the large-sample limit. A Taylor expansion of the score equation \(\nabla_\theta \log p(\mathcal{D} \mid \hat\theta_{\text{MLE}}) = 0\) around \(\theta^*\), combined with the law of large numbers and central limit theorem, yields the asymptotic distribution \[ \sqrt{N}\,(\hat\theta_{\text{MLE}} - \theta^*) \;\xrightarrow{d}\; \mathcal{N}\!\big(0,\; F(\theta^*)^{-1}\big). \] A full proof requires the convergence theory of random variables (developed in convergence of random variables). The takeaway is that no other (sufficiently regular) unbiased estimator can have smaller asymptotic variance than the MLE — the MLE is asymptotically optimal. This is the precise sense in which "maximum likelihood" is, in the large-sample frequentist regime, the right thing to do.

This optimality result complements the bias-variance decomposition discussed in regularized regression: while regularized estimators trade a small bias for a much larger variance reduction (often beating the MLE in finite samples), the Cramér-Rao bound certifies that, asymptotically, no unbiased estimator can do better than the MLE in raw variance.

Jeffreys Prior

In Bayesian statistics, the FIM is used to derive Jeffreys prior, which is a widely used uninformative prior. It allows the posterior to be driven primarily by the data itself. Given a prior \(p_{\theta}(\theta)\) and a transformation \(\phi = f(\theta)\), we seek a prior whose construction rule is invariant under reparameterization: that is, deriving the prior in \(\theta\)-coordinates and then transforming via change-of-variables should give the same result as deriving the prior directly in \(\phi\)-coordinates. The prior should transform as: \[ p_{\phi}(\phi) = p_{\theta}(\theta) \left| \frac{d\theta}{d\phi} \right| \] or in multiple dimensions, \[ p_{\phi}(\phi) = p_{\theta}(\theta) | \det J | \] where \(J\) is the Jacobian matrix with entries \(J_{ij} = \frac{\partial\theta_i}{\partial\phi_j}\).

Definition: Jeffreys Prior

The Jeffreys prior is \[ p(\theta) \propto \sqrt{F(\theta)} \] where \(F\) is the Fisher information.

Or, in multiple dimensions, it has the form \[ p(\theta) \propto \sqrt{\det F(\theta)} \] where \(F\) is the Fisher information matrix.

In 1d case, suppose \(p_{\theta}(\theta) \propto \sqrt{F(\theta)}\). We can derive a prior for \(\phi\) in terms of \(\theta\) as follows: \[ \begin{align*} p_{\phi}(\phi) &= p_{\theta}(\theta) \left| \frac{d\theta}{d\phi} \right| \\\\ &\propto \sqrt{F(\theta)\left(\frac{d\theta}{d\phi}\right)^2} \\\\ &= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \theta)}{d\theta}\right)^2\right] \left(\frac{d\theta}{d\phi}\right)^2}\\\\ &= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \theta)}{d\theta}\frac{d\theta}{d\phi}\right)^2\right] }\\\\ &= \sqrt{\mathbb{E}\left[\left(\frac{d \log p(x \mid \phi)}{d\phi}\right)^2\right] }\\\\ &= \sqrt{F(\phi)} \end{align*} \] So, the Jeffreys prior is invariant to reparameterizations. This is consistent with the fact that the KL divergence between two distributions is itself reparameterization-invariant — both the Jeffreys prior and the KL distance live on the statistical manifold, not on a particular coordinate chart.

Example:

Consider the Binomial distribution: \[ X \sim \operatorname{Bin}(n, \theta), \, 0 \leq \theta \leq 1 \] \[ p(x \mid \theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x} \] Ignoring \(\binom{n}{x}\), its log likelihood is given by \[ l(\theta \mid x) \propto x\log \theta + (n-x)\log(1-\theta). \]

The Fisher information is given by \[ \begin{align*} F(\theta) &= -\mathbb{E}_{x \sim \theta}\left[\frac{d^2 l}{d \theta^2}\right] \\\\ &= -\mathbb{E}_{x \sim \theta}\left[-\frac{x}{\theta^2}-\frac{n-x}{(1-\theta)^2}\right] \\\\ &= \frac{n\theta}{\theta^2} + \frac{n(1-\theta)}{(1-\theta)^2} \\\\ &= \frac{n}{\theta(1-\theta)} \\\\ &\propto \theta^{-1}(1-\theta)^{-1}. \end{align*} \] Thus, the Jeffreys prior for the parameter \(\theta\) is given by \[ p_{\theta} (\theta) = \sqrt{F(\theta)} \propto \theta^{-\frac{1}{2}} (1 - \theta)^{-\frac{1}{2}}. \] This is the Beta\((1/2,\, 1/2)\) distribution, which is also a special case of the Beta conjugate prior for the Binomial likelihood with hyperparameters \(a = b = 1/2\).

Now, consider the parameterization by \(\phi = \frac{\theta}{1 - \theta}\). Solving this expression for \(\theta\), we obtain \[ \theta = \frac{\phi}{\phi+1}. \] Then we have \[ \begin{align*} p(x \mid \phi) &\propto \left(\frac{\phi}{\phi+1}\right)^x \left(1 - \frac{\phi}{\phi + 1}\right)^{n-x} \\\\ &= \phi^x (\phi +1)^{-x} (\phi +1)^{-n +x} \\\\ &= \phi^x (\phi + 1)^{-n} \end{align*} \] and the log likelihood is given by \[ l(\phi \mid x) = x\log \phi -n\log(\phi + 1). \]

The Fisher information is given by \[ \begin{align*} F(\phi) &= -\mathbb{E}_{x \sim \phi}\left[\frac{d^2 l}{d \phi^2}\right] \\\\ &= -\mathbb{E}_{x \sim \phi}\left[-\frac{x}{\phi^2}+\frac{n}{(\phi + 1)^2}\right] \\\\ &= \frac{n\phi}{\phi + 1} \cdot \frac{1}{\phi^2} - \frac{n}{(\phi + 1)^2} \\\\ &= \frac{n(\phi+1)-n\phi}{\phi(\phi+1)^2} \\\\ &= \frac{n}{\phi(\phi+1)^2} \\\\ &\propto \phi^{-1} (\phi+1)^{-2} \end{align*} \] Thus, the Jeffreys prior for the reparameterized variable \(\phi\) is given by: \[ p_{\phi}(\phi) = \sqrt{F(\phi)} \propto \phi^{-\frac{1}{2}}(\phi +1)^{-1}. \]

Note. In 1d, the Jeffreys prior is the same as the reference prior, which maximizes the expected KL divergence between posterior and prior. In other words, it maximizes the information provided by the "data" relative to the prior. For multidimensional parameters, they are not the same.

Also, finding a reference prior is equivalent to finding the prior that maximizes mutual information between \(\theta\) and \(\mathcal{D}\). \[ \begin{align*} p^*(\theta) &= \arg \max_{p(\theta)} \mathbb{E}_{\mathcal{D}}[D_{\mathbb{KL}}(p(\theta \mid \mathcal{D}) \,\|\, p(\theta) )] \\\\ &= \arg \max_{p(\theta)}\, \mathbb{I}(\theta ; \mathcal{D}) \end{align*} \]

In the continuous case, \[ \begin{align*} \mathbb{I}(\theta \, ; \mathcal{D}) &= \int_{\mathcal{D}} p(\mathcal{D}) D_{\mathbb{KL}}(p(\theta \mid \mathcal{D}) \,\|\, p(\theta) ) d\mathcal{D} \\\\ &= \int p(\mathcal{D}) \left( \int p(\theta \mid \mathcal{D}) \log \frac{p(\theta \mid \mathcal{D})}{p(\theta)}d\theta \right) d\mathcal{D} \\\\ &= \int \int p(\theta \mid \mathcal{D})p(\mathcal{D}) \log \frac{p(\theta \mid \mathcal{D})}{p(\theta)} d\theta d\mathcal{D} \\\\ &= \int \int p(\theta, \mathcal{D}) \log \frac{p(\theta, \mathcal{D})}{p(\theta)p(\mathcal{D})}d\theta d\mathcal{D} \end{align*} \] where \(p(\mathcal{D})\) is the marginal likelihood: \[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta)p(\theta)d\theta \] and \(p(\theta, \mathcal{D})\) represents the joint probability distribution between \(\theta\) and \(\mathcal{D}\): \[ p(\theta, \mathcal{D}) = p(\theta \mid \mathcal{D})p(\mathcal{D}). \] Note. Remember, the mutual information itself is invariant under reparameterization.

Connections to Machine Learning

The Fisher information matrix (FIM) is central to modern deep learning optimization. The natural gradient update \(\theta \leftarrow \theta - \alpha F^{-1} \nabla \mathcal{L}\) provides a more effective descent direction than the standard gradient by accounting for the curvature of the statistical manifold.

While inverting the full FIM is computationally impractical for large networks due to its \(O(n^3)\) complexity, efficient approximations like K-FAC (Kronecker-Factored Approximate Curvature) leverage block-diagonal and Kronecker product structures to scale these methods to deep neural networks. You can explore this further on our interactive Natural Gradient Descent page.

In the field of Information Geometry, the FIM acts as a Riemannian metric. From this perspective, the FIM is essentially the second-order Taylor approximation of the KL divergence: \[ D_{\mathbb{KL}}(p_{\theta} \| p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^\top F(\theta) d\theta. \] This identifies the FIM as the fundamental bridge between probability theory and the geometric structure of model parameters.