Introduction
Around 2025, "Physical AI" entered the industry vocabulary as a label for machine learning
systems that act on the physical world — robotics, autonomous vehicles, and embodied agents.
Whether the label survives or fades, the underlying technical problem it points to is real
and constraining: when a learned policy controls a physical actuator, the path the optimizer
takes through parameter space affects stability, sample efficiency, and energy cost during
deployment, not just final model quality.
Standard Stochastic Gradient Descent (SGD)
equips the parameter space with the Euclidean metric by default, treating all directions in
\(\theta\) as equally costly to move along. Information Geometry reveals that
probability distributions live on a curved statistical manifold whose natural
metric is not Euclidean but is given by the
Fisher Information Matrix (FIM).
By respecting this intrinsic geometry,
Natural Gradient Descent (NGD)
uses the FIM as the metric in place of the Euclidean inner product. Let's visualize how this
geometry distorts our path in a simple mixture model.
The Geometry of Information: Interactive GMM Manifold
The Model: A Two-Component Gaussian Mixture
The entire canvas represents a two-dimensional parameter space
\[
\Theta = \{(\mu_1, \mu_2) \in \mathbb{R}^2\}
\]
Each point on this plane specifies a one-dimensional Gaussian Mixture Model
\[
p(x \mid \theta) = \tfrac{1}{2}\mathcal{N}(x \mid \mu_1, \sigma^2) + \tfrac{1}{2}\mathcal{N}(x \mid \mu_2, \sigma^2),
\]
where the mixing weights are fixed at \(\tfrac{1}{2}\) and the variance \(\sigma^2\) is shared.
The two draggable handles correspond to the current values of \(\mu_1\) and \(\mu_2\).
By moving them, you are traversing this parameter space and selecting different mixture distributions.
The Fisher Information Matrix as a Riemannian Metric
The family of distributions \(\{p(x \mid \theta) : \theta \in \Theta\}\) forms a statistical manifold —
a smooth manifold where each point is a probability distribution.
The Fisher Information Matrix (FIM) provides the natural Riemannian metric on this manifold.
(The formal definitions of manifold structures and Riemannian metrics will be developed in our forthcoming
smooth-manifold and Riemannian-geometry pages.)
FIM in Component Form (for the Demo)
The Fisher Information Matrix
is defined abstractly as the covariance of the score function. For the GMM demo it is convenient to
write it component-wise; the score-square form on the first line equals the negative-Hessian form on
the second by the standard identity
(FIM as Expected Negative Hessian):
\[
\begin{align*}
F(\theta_0)_{ij} &= \mathbb{E}_{x \sim p(\cdot \mid \theta_0)}\!\left[\frac{\partial \log p(x \mid \theta)}{\partial \theta_i}\,\frac{\partial \log p(x \mid \theta)}{\partial \theta_j}\right]_{\theta = \theta_0} \\\\
&= -\,\mathbb{E}_{x \sim p(\cdot \mid \theta_0)}\!\left[\frac{\partial^2 \log p(x \mid \theta)}{\partial \theta_i \,\partial \theta_j}\right]_{\theta = \theta_0}
\end{align*}
\]
This \(2 \times 2\) matrix is positive definite (almost everywhere) and encodes how sensitively the distribution \(p(x \mid \theta)\)
changes as we perturb \(\theta\) in each direction. The FIM defines an infinitesimal distance on the manifold:
for a small displacement \(d\theta\), the squared statistical distance is
\[
ds^2 = d\theta^\top F(\theta)\, d\theta.
\]
This is precisely the second-order Taylor expansion of
the KL divergence:
\[
D_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta + d\theta}) \approx \tfrac{1}{2}\,d\theta^\top F(\theta)\,d\theta.
\]
What the Ellipses Show: Local Quadratic Forms of the Metric
Each ellipse in the tessellation is a confidence ellipse of the Fisher metric — the level set
\[
\{\delta\theta : \delta\theta^\top F(\theta_0)\,\delta\theta = \varepsilon\}
\]
of the local quadratic form at that grid point.
The ellipse axes are determined by the eigendecomposition
\[
F = Q \Lambda Q^\top
\]
: the eigenvectors of \(F\) give the principal directions, and the semi-axis lengths are proportional to \(1/\sqrt{\lambda_i}\),
where \(\lambda_1, \lambda_2\) are the eigenvalues. A long axis in some direction means the FIM eigenvalue is small along
that direction - the distribution is insensitive to parameter changes there, so a large step in parameter space produces only a small
change in KL divergence.
The Information Singularity: When \(\mu_1 \to \mu_2\)
Drag the two handles toward each other and observe the ellipses elongating dramatically along the \(\mu_1 \leftrightarrow \mu_2\) axis.
This visualizes the information singularity of mixture models.
When \(\mu_1 \approx \mu_2\), the model reduces to
\[
p(x) \approx \mathcal{N}(x \mid \mu_1, \sigma^2),
\]
and the two parameters become non-identifiable: swapping \(\mu_1\) and \(\mu_2\) produces the same distribution.
Formally, the FIM becomes rank-deficient (positive semi-definite but not definite) as the eigenvalue along the "exchange direction" \((\mu_1 - \mu_2)\) approaches zero
while the eigenvalue along \((\mu_1 + \mu_2)\) (the "mean shift" direction) remains bounded.
The condition number
\[
\kappa(F) = \lambda_{\max}/\lambda_{\min}
\]
diverges, which you can verify in the panel on the right.
This singularity is not a numerical artifact - it reflects a genuine geometric property of the statistical manifold.
The manifold has a cusp-like structure near the identifiability boundary,
and standard Euclidean gradient descent slows catastrophically in this region
because it does not account for the degenerate geometry.
Well-Separated Regime: When \(\|\mu_1 - \mu_2\| \gg \sigma\)
Pull the handles far apart and observe that the ellipses become nearly circular - the FIM approaches a scalar
multiple of the identity
\[
F(\theta) \approx \frac{1}{2\sigma^2} I.
\]
In this regime, the two Gaussian components have negligible overlap, and each parameter \(\mu_i\) is informed almost
exclusively by data generated from its own component. The manifold is locally flat (Euclidean), and the natural gradient
coincides with the ordinary gradient.
Euclidean vs. Natural Gradient: The Red and Cyan Arrows
Click anywhere on the canvas to place a probe point. Two arrows appear:
- The Red Arrow:
The Euclidean (steepest-descent) gradient \(\nabla_\theta L\), computed in the standard \(\ell^2\) geometry of parameter space.
This is the direction that changes \(\theta\) most rapidly per unit Euclidean distance \(\|d\theta\|_2\).
- The cyan arrow:
The natural gradient \(\tilde{\nabla}_\theta L = F(\theta)^{-1}\nabla_\theta L\), which is the steepest-descent direction under the Fisher-Rao metric.
This is the direction that changes \(\theta\) most rapidly per unit statistical distance \(\sqrt{d\theta^\top F(\theta)\,d\theta}\).
Equivalently, the natural gradient solves \(\min_{d\theta}\, L(\theta + d\theta)\) subject to \(D_{\mathrm{KL}}(p_\theta \| p_{\theta+d\theta}) \leq \varepsilon\)
to first order.
Both arrows are normalized to the same display length so that you can compare their directions.
The angle \(\Delta\theta\) between them, displayed above the probe point, measures how much the manifold geometry
distorts the gradient direction at that point. In the well-separated regime this angle is near \(0°\);
near the singularity it can exceed \(45°\), showing that Euclidean descent and natural-gradient descent
select substantially different directions in this regime. Whether the natural-gradient direction is
operationally preferable depends on the loss landscape and the optimization objective: the
information-geometric metric measures distance between distributions, which is the relevant scale for
likelihood-based objectives, but is not universally the "correct" notion of distance for every learning
problem.
Flow Mode: Particle Trajectories Along the Natural Gradient Field
Activating Flow releases particles that follow the negative natural gradient field
\[
-F(\theta)^{-1}\nabla L(\theta).
\]
Each particle follows the steepest descent direction in the Fisher-Rao metric at each point —
the direction of fastest decrease in loss per unit statistical (KL) distance. This is distinct from both
Euclidean steepest descent and the manifold's geodesics, but shares with the latter the property of respecting
the underlying geometry. Observe how particles near the singularity region are deflected along the
manifold curvature rather than heading straight toward the nearest handle in Euclidean coordinates. This is the geometric correction
that algorithms such as Natural Gradient Descent (Amari, 1998), KFAC, and the closely related Fisher-preconditioned methods exploit
for faster convergence on the loss landscapes of neural networks and mixture models.
The Controls and What They Change Geometrically
The \(\sigma\) slider controls the shared standard deviation.
Decreasing \(\sigma\) sharpens both components, increasing the base Fisher information \(\sim 1/\sigma^2\) and
making the singularity zone narrower (the components must be closer before they overlap).
Increasing \(\sigma\) broadens the overlap region and makes the singularity dominate a larger portion of parameter space.
The grid density slider controls the tessellation resolution. More ellipses give a finer picture of how the
metric tensor varies across the manifold, at the cost of more computation per frame.
The ellipse scale slider uniformly scales the display size of the ellipses
without altering the underlying eigenvalues. This is a visualization aid, not a mathematical parameter.
Scaling to Deep Learning: The Kronecker-factored Approximation (K-FAC)
While the \(2 \times 2\) Fisher Information Matrix (FIM) in our demo is easy to invert,
modern neural networks have millions, or even billions, of parameters. In such cases, the FIM becomes a
gargantuan \(n \times n\) matrix, where \(n\) is the number of weights. Computing, storing, and inverting
this matrix is computationally prohibitive \(O(n^3)\), making the exact natural gradient impractical for
large-scale models.
For a simple layer with 1,000 inputs and 1,000 outputs, the FIM would have \(10^6 \times 10^6\) elements.
Standard Natural Gradient Descent fails here, not because the math is wrong, but because the hardware cannot keep up.
To bridge the gap between information-geometric rigor and practical deep learning, we need a way to approximate
the curvature.
The K-FAC Idea: Block-Diagonal & Kronecker Factorization
K-FAC (Kronecker-factored Approximate Curvature) introduces two major simplifications:
- Block-Diagonal Approximation:
We assume that weights in different layers are independent, treating the FIM as a block-diagonal matrix
where each block corresponds to a single layer.
- Kronecker Factorization:
For a specific layer, the FIM block \(F_\ell\) is approximated as
the Kronecker product (\(\otimes\)) of two
much smaller matrices:
\[
F_\ell \approx A_{\ell-1} \otimes G_\ell
\]
where \(A_{\ell-1}\) is the covariance of the layer's inputs (activations) and \(G_\ell\) is the covariance
of the gradients with respect to the layer's outputs.
The mechanism of this factorization is the standard Kronecker-inverse identity:
\[
(A \otimes G)^{-1} = A^{-1} \otimes G^{-1}
\]
Instead of inverting a massive \(1,000,000 \times 1,000,000\) matrix, we only need to invert two \(1,000 \times 1,000\) matrices.
This reduces the computational complexity from \(O(n^3)\) to practically manageable levels, allowing us to exploit the "natural"
path even in deep neural networks.
Further Exploration & Implementations
K-FAC is a rapidly evolving field with numerous variations for Convolutional (CNNs), Recurrent (RNNs), and Transformer architectures.
Rather than focusing on a single implementation, we recommend referring to the foundational research and high-performance libraries
that realize these geometric insights: