So far we have studied differentials of
scalar-input,
vector-input, and
matrix-input / matrix-output functions.
This page closes the natural square of cases: scalar-output functions of a matrix
argument, \(f : \mathbb{R}^{m \times n} \to \mathbb{R}\). In machine learning these
arise everywhere — regularization terms, log-likelihoods, and scalar losses are all such
functions of weight matrices.
The Frobenius norm measures the total
magnitude of a matrix \(X \in \mathbb{R}^{m \times n}\), serving as the matrix-space analogue of
the Euclidean \(L_2\) norm. It is defined through the trace:
\[
\|X\|_F \;=\; \sqrt{\operatorname{tr}(X^T X)} \;=\; \sqrt{\sum_{i,j} X_{ij}^2}.
\]
The role of the norm, made concrete. The abstract remark made on the
Linear Approximations page — that
derivatives compare the magnitudes of input and output displacements, and a norm is what supplies
that magnitude — becomes palpable once inputs are matrices. Unlike vectors, matrices admit several
genuinely different norms: the Frobenius norm \(\|\cdot\|_F\) we use throughout this page, the
operator (spectral) norm \(\|\cdot\|_2\), the nuclear norm \(\|\cdot\|_*\), and more. In finite
dimensions these are all equivalent as topologies, but they produce qualitatively different
geometric statements: a ball of radius \(1\) in Frobenius norm is not the same shape as
a ball of radius \(1\) in operator norm. Choosing a norm is therefore a modelling decision — it
fixes what "a small perturbation of \(X\)" means. For standard gradient-based optimization of
loss functions on matrix parameters, the Frobenius norm is the natural choice because it is the
norm compatible with the entrywise \(L_2\)-type inner product, which is exactly what the next
definition records.
Before differentiating, we establish the gradient convention for scalar-output matrix functions —
the direct analogue of the
gradient convention for vector
inputs.
Definition: Gradient for scalar functions of a matrix
Let \(f : \mathbb{R}^{m \times n} \to \mathbb{R}\) be differentiable at \(X\). The
gradient \(\nabla f \in \mathbb{R}^{m \times n}\) is the unique matrix
satisfying
\[
df \;=\; \langle \nabla f,\, dX \rangle_F \;=\; \operatorname{tr}\bigl((\nabla f)^T dX\bigr),
\]
where the Frobenius inner product is
\(\langle A, B \rangle_F := \operatorname{tr}(A^T B) = \sum_{i,j} A_{ij} B_{ij}\).
Equivalently, \((\nabla f)_{ij} = \partial f / \partial X_{ij}\).
In practice, we compute \(df\) by manipulating differentials, collect it into the form
\(df = \operatorname{tr}(M^T dX)\) for some matrix \(M\), and read off \(\nabla f = M\).
Derivation of \(\nabla \|X\|_F\).
Let \(f(X) = \|X\|_F = \sqrt{\operatorname{tr}(X^T X)}\). Assume \(X \neq 0\), so that
\(\operatorname{tr}(X^T X) > 0\) and \(f\) is differentiable at \(X\) (as the composition of
the smooth map \(\sqrt{\,\cdot\,}\) on \((0,\infty)\) with the polynomial map
\(X \mapsto \operatorname{tr}(X^T X)\)). Applying the
chain rule
with the outer scalar function \(u \mapsto \sqrt{u}\) (whose scalar derivative is
\(1/(2\sqrt{u})\)):
\[
df \;=\; \frac{1}{2\sqrt{\operatorname{tr}(X^T X)}}\; d\bigl[\operatorname{tr}(X^T X)\bigr].
\]
To handle the inner expression, we use linearity of the trace:
\[
d\bigl(\operatorname{tr}(A)\bigr) \;=\; \operatorname{tr}(A + dA) - \operatorname{tr}(A)
\;=\; \operatorname{tr}(dA),
\]
so \(d\) and \(\operatorname{tr}\) commute. Applying the
product rule
to \(X^T X\) (using \(d(X^T) = (dX)^T\), since transposition is linear and therefore commutes
with taking the differential):
\[
d(X^T X) \;=\; (dX^T) X + X^T (dX) \;=\; (dX)^T X + X^T (dX).
\]
Taking the trace and using that \(\operatorname{tr}(B) = \operatorname{tr}(B^T)\) to combine
the two terms:
\[
\operatorname{tr}\bigl((dX)^T X\bigr) \;=\; \operatorname{tr}\bigl(((dX)^T X)^T\bigr)
\;=\; \operatorname{tr}(X^T dX),
\]
so \(\operatorname{tr}\bigl(d(X^T X)\bigr) = 2\,\operatorname{tr}(X^T dX)\). Substituting back:
\[
df \;=\; \frac{1}{\sqrt{\operatorname{tr}(X^T X)}}\, \operatorname{tr}(X^T dX)
\;=\; \operatorname{tr}\!\left(\frac{X^T}{\|X\|_F}\, dX\right).
\]
Comparing with \(df = \operatorname{tr}((\nabla f)^T dX)\) from the gradient convention:
\[
\nabla f \;=\; \frac{X}{\|X\|_F}. \qquad \blacksquare
\]
Matrix analogue of the \(L_2\)-norm gradient
This matches the
vector \(L_2\)-norm result
\(\nabla \|\mathbf{x}\|_2 = \mathbf{x}/\|\mathbf{x}\|_2\), with \(X\) playing the role of
\(\mathbf{x}\) and \(\|\cdot\|_F\) the role of \(\|\cdot\|_2\). The Frobenius norm
is the Euclidean norm of \(\operatorname{vec}(X)\); all of its calculus inherits
directly from the vector case.
Example: gradient of a bilinear form \(f(A) = \mathbf{x}^T A \mathbf{y}\)
The trace carries a single additional tool that matrix calculus uses constantly: the
cyclic property, \(\operatorname{tr}(ABC) = \operatorname{tr}(BCA) = \operatorname{tr}(CAB)\)
whenever the products are defined. This turns almost every matrix-gradient calculation into a
one-step transformation.
Example.
Let \(f(A) = \mathbf{x}^T A \mathbf{y}\) where \(A \in \mathbb{R}^{m \times n}\),
\(\mathbf{x} \in \mathbb{R}^m\), \(\mathbf{y} \in \mathbb{R}^n\). The differential is
\[
df \;=\; \mathbf{x}^T (dA)\, \mathbf{y},
\]
a scalar. A scalar equals its own trace; applying the cyclic property:
\[
df \;=\; \operatorname{tr}\bigl(\mathbf{x}^T (dA)\, \mathbf{y}\bigr)
\;=\; \operatorname{tr}\bigl(\mathbf{y}\mathbf{x}^T\, dA\bigr)
\;=\; \operatorname{tr}\bigl((\mathbf{x}\mathbf{y}^T)^T\, dA\bigr).
\]
Comparing with \(df = \operatorname{tr}((\nabla f)^T dA)\):
\[
\nabla_A f \;=\; \mathbf{x}\mathbf{y}^T. \qquad \blacksquare
\]
Derivative of the Determinant
The determinant \(\det X\)
measures the oriented volume scaling factor of a linear transformation
\(X : \mathbb{R}^n \to \mathbb{R}^n\). Its differential is central to
Normalizing Flows, to log-determinant losses in Gaussian models, and to any
optimization involving covariance matrices.
The key formula is due to Jacobi.
Theorem: Jacobi's formula
For a differentiable matrix-valued function \(A : t \mapsto A(t) \in \mathbb{R}^{n \times n}\),
or more generally for \(A \in \mathbb{R}^{n \times n}\) viewed as a variable,
\[
d(\det A) \;=\; \operatorname{tr}\bigl(\operatorname{adj}(A)\, dA\bigr).
\]
When \(A\) is invertible, using \(\operatorname{adj}(A) = (\det A)\, A^{-1}\),
\[
d(\det A) \;=\; (\det A)\, \operatorname{tr}(A^{-1}\, dA).
\]
Equivalently, via the
gradient convention,
\[
\nabla (\det A) \;=\; \operatorname{cofactor}(A) \;=\; (\det A)(A^{-1})^T \qquad (A \text{ invertible}).
\]
Proof.
Since \(\det A\) is a polynomial in the entries of \(A\), it is \(C^\infty\) and in particular
differentiable everywhere on \(\mathbb{R}^{n \times n}\). Fix indices \(i, j\) and recall the
cofactor expansion
along the \(i\)-th row:
\[
\det A \;=\; \sum_{k=1}^n A_{ik}\, C_{ik},
\]
where \(C_{ik}\) is the \((i,k)\) cofactor. The decisive structural fact is that
\(C_{ik}\) is defined via the \((n-1) \times (n-1)\) minor obtained by deleting row \(i\) and
column \(k\), so \(C_{ik}\) does not depend on any entry of the \(i\)-th row, and in
particular does not depend on \(A_{ij}\). Differentiating the expansion with respect to
\(A_{ij}\) therefore picks out the single term \(A_{ij}\, C_{ij}\):
\[
\frac{\partial \det A}{\partial A_{ij}} \;=\; C_{ij}.
\]
By the gradient convention, the matrix whose \((i,j)\) entry is \(\partial \det A / \partial A_{ij}\)
is \(\nabla(\det A)\):
\[
\nabla (\det A) \;=\; \operatorname{cofactor}(A) \;=\; \operatorname{adj}(A)^T.
\]
The corresponding differential is
\[
d(\det A) \;=\; \operatorname{tr}\bigl((\nabla \det A)^T\, dA\bigr)
\;=\; \operatorname{tr}\bigl(\operatorname{adj}(A)\, dA\bigr).
\]
When \(A\) is invertible, \(\operatorname{adj}(A) = (\det A)\, A^{-1}\), and since \(\det A\)
is a scalar it factors out of the trace:
\[
d(\det A) \;=\; (\det A)\, \operatorname{tr}(A^{-1}\, dA). \qquad \blacksquare
\]
Where Jacobi's formula appears in ML
In Normalizing Flows, the change-of-variables formula for a bijective
transformation \(\mathbf{z} = f_\theta(\mathbf{x})\) requires
\(\log |\det J_{f_\theta}|\) in the log-likelihood; training by gradient descent on
\(\theta\) relies on Jacobi's formula to differentiate through the determinant. In
Gaussian-likelihood losses (maximum-likelihood estimation of a covariance
matrix, Gaussian processes, variational inference with full-covariance posteriors), the
\(\log \det \Sigma\) term's gradient is, via
\(d(\log \det A) = \operatorname{tr}(A^{-1} dA)\), simply \((\Sigma^{-1})^T\). In
physics-informed learning, Jacobians of coordinate transformations enter
Hamiltonians and action functionals the same way.
Corollary: derivative of the characteristic polynomial
The Jacobi formula specializes cleanly to the one-variable derivative of the
characteristic polynomial \(p(x) = \det(xI - A)\), a quantity that appears in
iterative eigenvalue algorithms (for instance, Newton's method on \(p\)).
Corollary.
For a fixed \(A \in \mathbb{R}^{n \times n}\), set \(M(x) := xI - A\). Then \(dM = (dx)\, I\),
and for \(x\) such that \(M(x)\) is invertible (i.e., \(x\) not an eigenvalue of \(A\)),
Jacobi's formula gives
\[
d\bigl(\det M(x)\bigr) \;=\; (\det M(x))\, \operatorname{tr}\bigl(M(x)^{-1}\, (dx)\, I\bigr)
\;=\; (\det M(x))\, \operatorname{tr}\bigl(M(x)^{-1}\bigr)\, dx.
\]
Dividing by \(dx\):
\[
p'(x) \;=\; p(x)\, \operatorname{tr}\bigl((xI - A)^{-1}\bigr). \qquad \blacksquare
\]
This identity has a striking consequence: since
\(\operatorname{tr}\bigl((xI-A)^{-1}\bigr) = \sum_{k=1}^n 1/(x - \lambda_k)\) where
\(\lambda_k\) are the eigenvalues of \(A\), the formula recovers the familiar logarithmic
derivative identity \(p'(x)/p(x) = \sum_k 1/(x - \lambda_k)\) — a bridge between matrix calculus
and the algebra of polynomial roots.
In numerical practice, reverse-mode automatic
differentiation can be preferable to the analytical formula when \(A\) is poorly
conditioned or close to singular (\(\det(xI - A) \approx 0\)), since it avoids explicitly forming
the inverse.
Numerical verification
We validate the Jacobi-formula corollary against a
finite-difference approximation: for a
fixed random matrix \(A\) and a chosen evaluation point \(x = 5\) (well separated from the
eigenvalues of \(A\), so \(xI - A\) is safely invertible), we compare
\[
p'(x) \;\approx\; \frac{\det((x+dx)I - A) - \det(xI - A)}{dx}
\qquad \text{versus} \qquad
p'(x) \;=\; p(x)\, \operatorname{tr}\bigl((xI-A)^{-1}\bigr).
\]
import numpy as np
np.random.seed(0)
n = 4
# Random matrix A
A = np.random.randn(n, n)
# Evaluation point x chosen well away from any eigenvalue of A
# (for real-random 4x4 matrices, |eigenvalues| are typically below ~3)
x = 5.0
dx = 1e-8
# Finite-difference approximation of p'(x) where p(x) = det(xI - A)
I = np.eye(n)
p_at_x = np.linalg.det(x * I - A)
p_approx = (np.linalg.det((x + dx) * I - A) - p_at_x) / dx
# Analytical formula from Jacobi: p'(x) = p(x) * tr((xI - A)^{-1})
p_exact = p_at_x * np.trace(np.linalg.inv(x * I - A))
# Relative error
rel_err = abs(p_approx - p_exact) / abs(p_exact)
print(f"Matrix A:\n{A}\n")
print(f"Evaluation point x = {x}")
print(f"p(x) = det(xI - A) = {p_at_x:.6e}\n")
print(f"p'(x) finite-difference : {p_approx:.6e}")
print(f"p'(x) analytical (Jacobi): {p_exact:.6e}")
print(f"\nRelative error: {rel_err:.2e}")
Typical output shows a relative error near \(10^{-8}\), matching the optimal accuracy of a
forward difference at \(dx = \sqrt{\varepsilon_{\text{mach}}}\) (as analyzed in the
previous page). The match confirms the
Jacobi-based formula and illustrates how analytical and numerical tools routinely cross-check
each other in practice.