Scalar Functions of a Matrix

Derivative of the Frobenius Norm

So far we have studied differentials of scalar-input, vector-input, and matrix-input / matrix-output functions. This page closes the natural square of cases: scalar-output functions of a matrix argument, \(f : \mathbb{R}^{m \times n} \to \mathbb{R}\). In machine learning these arise everywhere — regularization terms, log-likelihoods, and scalar losses are all such functions of weight matrices.

The Frobenius norm measures the total magnitude of a matrix \(X \in \mathbb{R}^{m \times n}\), serving as the matrix-space analogue of the Euclidean \(L_2\) norm. It is defined through the trace: \[ \|X\|_F \;=\; \sqrt{\operatorname{tr}(X^T X)} \;=\; \sqrt{\sum_{i,j} X_{ij}^2}. \]

The role of the norm, made concrete. The abstract remark made on the Linear Approximations page — that derivatives compare the magnitudes of input and output displacements, and a norm is what supplies that magnitude — becomes palpable once inputs are matrices. Unlike vectors, matrices admit several genuinely different norms: the Frobenius norm \(\|\cdot\|_F\) we use throughout this page, the operator (spectral) norm \(\|\cdot\|_2\), the nuclear norm \(\|\cdot\|_*\), and more. In finite dimensions these are all equivalent as topologies, but they produce qualitatively different geometric statements: a ball of radius \(1\) in Frobenius norm is not the same shape as a ball of radius \(1\) in operator norm. Choosing a norm is therefore a modelling decision — it fixes what "a small perturbation of \(X\)" means. For standard gradient-based optimization of loss functions on matrix parameters, the Frobenius norm is the natural choice because it is the norm compatible with the entrywise \(L_2\)-type inner product, which is exactly what the next definition records.

Before differentiating, we establish the gradient convention for scalar-output matrix functions — the direct analogue of the gradient convention for vector inputs.

Definition: Gradient for scalar functions of a matrix

Let \(f : \mathbb{R}^{m \times n} \to \mathbb{R}\) be differentiable at \(X\). The gradient \(\nabla f \in \mathbb{R}^{m \times n}\) is the unique matrix satisfying \[ df \;=\; \langle \nabla f,\, dX \rangle_F \;=\; \operatorname{tr}\bigl((\nabla f)^T dX\bigr), \] where the Frobenius inner product is \(\langle A, B \rangle_F := \operatorname{tr}(A^T B) = \sum_{i,j} A_{ij} B_{ij}\). Equivalently, \((\nabla f)_{ij} = \partial f / \partial X_{ij}\).

In practice, we compute \(df\) by manipulating differentials, collect it into the form \(df = \operatorname{tr}(M^T dX)\) for some matrix \(M\), and read off \(\nabla f = M\).

Derivation of \(\nabla \|X\|_F\).

Let \(f(X) = \|X\|_F = \sqrt{\operatorname{tr}(X^T X)}\). Assume \(X \neq 0\), so that \(\operatorname{tr}(X^T X) > 0\) and \(f\) is differentiable at \(X\) (as the composition of the smooth map \(\sqrt{\,\cdot\,}\) on \((0,\infty)\) with the polynomial map \(X \mapsto \operatorname{tr}(X^T X)\)). Applying the chain rule with the outer scalar function \(u \mapsto \sqrt{u}\) (whose scalar derivative is \(1/(2\sqrt{u})\)): \[ df \;=\; \frac{1}{2\sqrt{\operatorname{tr}(X^T X)}}\; d\bigl[\operatorname{tr}(X^T X)\bigr]. \] To handle the inner expression, we use linearity of the trace: \[ d\bigl(\operatorname{tr}(A)\bigr) \;=\; \operatorname{tr}(A + dA) - \operatorname{tr}(A) \;=\; \operatorname{tr}(dA), \] so \(d\) and \(\operatorname{tr}\) commute. Applying the product rule to \(X^T X\) (using \(d(X^T) = (dX)^T\), since transposition is linear and therefore commutes with taking the differential): \[ d(X^T X) \;=\; (dX^T) X + X^T (dX) \;=\; (dX)^T X + X^T (dX). \] Taking the trace and using that \(\operatorname{tr}(B) = \operatorname{tr}(B^T)\) to combine the two terms: \[ \operatorname{tr}\bigl((dX)^T X\bigr) \;=\; \operatorname{tr}\bigl(((dX)^T X)^T\bigr) \;=\; \operatorname{tr}(X^T dX), \] so \(\operatorname{tr}\bigl(d(X^T X)\bigr) = 2\,\operatorname{tr}(X^T dX)\). Substituting back: \[ df \;=\; \frac{1}{\sqrt{\operatorname{tr}(X^T X)}}\, \operatorname{tr}(X^T dX) \;=\; \operatorname{tr}\!\left(\frac{X^T}{\|X\|_F}\, dX\right). \] Comparing with \(df = \operatorname{tr}((\nabla f)^T dX)\) from the gradient convention: \[ \nabla f \;=\; \frac{X}{\|X\|_F}. \]

Matrix analogue of the \(L_2\)-norm gradient

This matches the vector \(L_2\)-norm result \(\nabla \|\mathbf{x}\|_2 = \mathbf{x}/\|\mathbf{x}\|_2\), with \(X\) playing the role of \(\mathbf{x}\) and \(\|\cdot\|_F\) the role of \(\|\cdot\|_2\). The Frobenius norm is the Euclidean norm of \(\operatorname{vec}(X)\); all of its calculus inherits directly from the vector case.

Example: gradient of a bilinear form \(f(A) = \mathbf{x}^T A \mathbf{y}\)

The trace carries a single additional tool that matrix calculus uses constantly: the cyclic property, \(\operatorname{tr}(ABC) = \operatorname{tr}(BCA) = \operatorname{tr}(CAB)\) whenever the products are defined. This turns almost every matrix-gradient calculation into a one-step transformation.

Example.

Let \(f(A) = \mathbf{x}^T A \mathbf{y}\) where \(A \in \mathbb{R}^{m \times n}\), \(\mathbf{x} \in \mathbb{R}^m\), \(\mathbf{y} \in \mathbb{R}^n\). The differential is \[ df \;=\; \mathbf{x}^T (dA)\, \mathbf{y}, \] a scalar. A scalar equals its own trace; applying the cyclic property: \[ df \;=\; \operatorname{tr}\bigl(\mathbf{x}^T (dA)\, \mathbf{y}\bigr) \;=\; \operatorname{tr}\bigl(\mathbf{y}\mathbf{x}^T\, dA\bigr) \;=\; \operatorname{tr}\bigl((\mathbf{x}\mathbf{y}^T)^T\, dA\bigr). \] Comparing with \(df = \operatorname{tr}((\nabla f)^T dA)\): \[ \nabla_A f \;=\; \mathbf{x}\mathbf{y}^T. \]

Derivative of the Determinant

The determinant \(\det X\) measures the oriented volume scaling factor of a linear transformation \(X : \mathbb{R}^n \to \mathbb{R}^n\). Its differential is central to Normalizing Flows, to log-determinant losses in Gaussian models, and to any optimization involving covariance matrices. The key formula is due to Jacobi.

Theorem: Jacobi's formula

For a differentiable matrix-valued function \(A : t \mapsto A(t) \in \mathbb{R}^{n \times n}\), or more generally for \(A \in \mathbb{R}^{n \times n}\) viewed as a variable, \[ d(\det A) \;=\; \operatorname{tr}\bigl(\operatorname{adj}(A)\, dA\bigr). \] When \(A\) is invertible, using \(\operatorname{adj}(A) = (\det A)\, A^{-1}\), \[ d(\det A) \;=\; (\det A)\, \operatorname{tr}(A^{-1}\, dA). \] Equivalently, via the gradient convention, \[ \nabla (\det A) \;=\; \operatorname{cofactor}(A) \;=\; (\det A)(A^{-1})^T \qquad (A \text{ invertible}). \]

Proof.

Since \(\det A\) is a polynomial in the entries of \(A\), it is \(C^\infty\) and in particular differentiable everywhere on \(\mathbb{R}^{n \times n}\). Fix indices \(i, j\) and recall the cofactor expansion along the \(i\)-th row: \[ \det A \;=\; \sum_{k=1}^n A_{ik}\, C_{ik}, \] where \(C_{ik}\) is the \((i,k)\) cofactor. The decisive structural fact is that \(C_{ik}\) is defined via the \((n-1) \times (n-1)\) minor obtained by deleting row \(i\) and column \(k\), so \(C_{ik}\) does not depend on any entry of the \(i\)-th row, and in particular does not depend on \(A_{ij}\). Differentiating the expansion with respect to \(A_{ij}\) therefore picks out the single term \(A_{ij}\, C_{ij}\): \[ \frac{\partial \det A}{\partial A_{ij}} \;=\; C_{ij}. \] By the gradient convention, the matrix whose \((i,j)\) entry is \(\partial \det A / \partial A_{ij}\) is \(\nabla(\det A)\): \[ \nabla (\det A) \;=\; \operatorname{cofactor}(A) \;=\; \operatorname{adj}(A)^T. \] The corresponding differential is \[ d(\det A) \;=\; \operatorname{tr}\bigl((\nabla \det A)^T\, dA\bigr) \;=\; \operatorname{tr}\bigl(\operatorname{adj}(A)\, dA\bigr). \] When \(A\) is invertible, \(\operatorname{adj}(A) = (\det A)\, A^{-1}\), and since \(\det A\) is a scalar it factors out of the trace: \[ d(\det A) \;=\; (\det A)\, \operatorname{tr}(A^{-1}\, dA). \]

Where Jacobi's formula appears in ML

In Normalizing Flows, the change-of-variables formula for a bijective transformation \(\mathbf{z} = f_\theta(\mathbf{x})\) requires \(\log |\det J_{f_\theta}|\) in the log-likelihood; training by gradient descent on \(\theta\) relies on Jacobi's formula to differentiate through the determinant. In Gaussian-likelihood losses (maximum-likelihood estimation of a covariance matrix, Gaussian processes, variational inference with full-covariance posteriors), the \(\log \det \Sigma\) term's gradient is, via \(d(\log \det A) = \operatorname{tr}(A^{-1} dA)\), simply \((\Sigma^{-1})^T\). In physics-informed learning, Jacobians of coordinate transformations enter Hamiltonians and action functionals the same way.

Corollary: derivative of the characteristic polynomial

The Jacobi formula specializes cleanly to the one-variable derivative of the characteristic polynomial \(p(x) = \det(xI - A)\), a quantity that appears in iterative eigenvalue algorithms (for instance, Newton's method on \(p\)).

Corollary.

For a fixed \(A \in \mathbb{R}^{n \times n}\), set \(M(x) := xI - A\). Then \(dM = (dx)\, I\), and for \(x\) such that \(M(x)\) is invertible (i.e., \(x\) not an eigenvalue of \(A\)), Jacobi's formula gives \[ d\bigl(\det M(x)\bigr) \;=\; (\det M(x))\, \operatorname{tr}\bigl(M(x)^{-1}\, (dx)\, I\bigr) \;=\; (\det M(x))\, \operatorname{tr}\bigl(M(x)^{-1}\bigr)\, dx. \] Dividing by \(dx\): \[ p'(x) \;=\; p(x)\, \operatorname{tr}\bigl((xI - A)^{-1}\bigr). \]

This identity has a striking consequence: since \(\operatorname{tr}\bigl((xI-A)^{-1}\bigr) = \sum_{k=1}^n 1/(x - \lambda_k)\) where \(\lambda_k\) are the eigenvalues of \(A\), the formula recovers the familiar logarithmic derivative identity \(p'(x)/p(x) = \sum_k 1/(x - \lambda_k)\) — a bridge between matrix calculus and the algebra of polynomial roots.

In numerical practice, reverse-mode automatic differentiation can be preferable to the analytical formula when \(A\) is poorly conditioned or close to singular (\(\det(xI - A) \approx 0\)), since it avoids explicitly forming the inverse.

Numerical verification

We validate the Jacobi-formula corollary against a finite-difference approximation: for a fixed random matrix \(A\) and a chosen evaluation point \(x = 5\) (well separated from the eigenvalues of \(A\), so \(xI - A\) is safely invertible), we compare \[ p'(x) \;\approx\; \frac{\det((x+dx)I - A) - \det(xI - A)}{dx} \qquad \text{versus} \qquad p'(x) \;=\; p(x)\, \operatorname{tr}\bigl((xI-A)^{-1}\bigr). \]

                            import numpy as np

                            np.random.seed(0)
                            n = 4

                            # Random matrix A
                            A = np.random.randn(n, n)

                            # Evaluation point x chosen well away from any eigenvalue of A
                            # (for real-random 4x4 matrices, |eigenvalues| are typically below ~3)
                            x = 5.0
                            dx = 1e-8

                            # Finite-difference approximation of p'(x) where p(x) = det(xI - A)
                            I = np.eye(n)
                            p_at_x = np.linalg.det(x * I - A)
                            p_approx = (np.linalg.det((x + dx) * I - A) - p_at_x) / dx

                            # Analytical formula from Jacobi:  p'(x) = p(x) * tr((xI - A)^{-1})
                            p_exact = p_at_x * np.trace(np.linalg.inv(x * I - A))

                            # Relative error
                            rel_err = abs(p_approx - p_exact) / abs(p_exact)

                            print(f"Matrix A:\n{A}\n")
                            print(f"Evaluation point x = {x}")
                            print(f"p(x) = det(xI - A) = {p_at_x:.6e}\n")
                            print(f"p'(x) finite-difference  : {p_approx:.6e}")
                            print(f"p'(x) analytical (Jacobi): {p_exact:.6e}")
                            print(f"\nRelative error: {rel_err:.2e}")

Typical output shows a relative error near \(10^{-8}\), matching the optimal accuracy of a forward difference at \(dx = \sqrt{\varepsilon_{\text{mach}}}\) (as analyzed in the previous page). The match confirms the Jacobi-based formula and illustrates how analytical and numerical tools routinely cross-check each other in practice.