Matrix Calculus

Derivatives of Matrix-Valued Matrix Functions

So far we have discussed differentials of scalar functions and Jacobians of vector-valued functions. We now extend the same differential calculus to matrix-valued functions of a matrix argument, \(F : \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}\). Our running examples will be \(F(X) = X^k\) and \(F(X) = X^{-1}\); at the end of the page we use the same framework to differentiate an LU decomposition.

Operator form is essential here. For scalar-in / scalar-out and vector-in / vector-out cases, the differential admits the compact matrix-product form \(df = f'(x)\,dx\). For matrix-in / matrix-out functions this compact form breaks down: the "Jacobian" is naturally a \(4\)-index tensor, or equivalently an \(n^2 \times n^2\) matrix acting on \(\mathrm{vec}(dX)\). In day-to-day calculation, the operator notation \[ dF \;=\; F'(X)[\,dX\,] \] is both cleaner and more honest: the differential \(F'(X)\) is a linear map \(\mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}\), applied to the input perturbation \(dX\), and its action may involve \(dX\) in multiple positions. We will read off \(F'(X)[\,\cdot\,]\) directly from the expansion of \(F(X + dX) - F(X)\).

The product rule preserves order. Because matrix multiplication is not commutative, the differential product rule must preserve the relative positions of the factors: \[ d(AB) \;=\; (dA)\,B + A\,(dB). \] Moving a \(dX\) past a constant matrix changes the value; this is the single most common source of error in matrix calculus.

Warm-up: affine matrix functions

For an affine map \(F(X) = AXB + C\) with \(A, B, C\) constant, the increment is exact: \[ \Delta F \;=\; A(X + dX)B + C - (AXB + C) \;=\; A\,(dX)\,B. \] There is no \(o(\|dX\|)\) remainder, so the differential is simply \[ dF \;=\; F'(X)[\,dX\,] \;=\; A\,(dX)\,B. \] The linear map \(F'(X)\) is the "sandwich operator" \(Z \mapsto AZB\), independent of \(X\). Two special cases: \(F(X) = AX\) gives \(dF = A\,dX\); \(F(X) = X\) gives \(dF = dX\). The full sandwich form \(A(\cdot)B\) is a pattern we will see repeatedly, notably in the inverse and LU examples below.

Example 1: Power of a matrix, \(F(X) = X^3\)

Example 1.

Let \(F(X) = X^3\) with \(X \in \mathbb{R}^{n \times n}\). Expanding the increment and keeping only linear-in-\(dX\) terms: \[ \begin{align*} (X+dX)^3 - X^3 &= X^2(dX) + X(dX)X + (dX)X^2 \\\\ &\quad + X(dX)^2 + (dX)X(dX) + (dX)^2 X + (dX)^3. \end{align*} \] The four nonlinear terms (three quadratic and one cubic in \(dX\)) are \(o(\|dX\|)\): each is bounded by a constant multiple of \(\|dX\|^2\) via submultiplicativity of the operator norm (e.g. \(\|X(dX)^2\| \leq \|X\|_{\mathrm{op}}\|dX\|^2\)). The three linear terms give the differential: \[ dF \;=\; F'(X)[\,dX\,] \;=\; (dX)X^2 + X(dX)X + X^2(dX). \] Note how each of the three factors of \(X\) in \(X^3\) contributes one term, with \(dX\) inserted at that factor's position; the non-commutativity of multiplication forbids collecting them into a single "coefficient".

Why there is no single "\(\partial F / \partial X\)" matrix

One might hope to write \(dF = (\partial F / \partial X)\, dX\) in analogy with the scalar case, but Example 1 shows this is impossible: \(dX\) appears in three different positions, and no single constant matrix \(M\) satisfies \(M\,dX = (dX)X^2 + X(dX)X + X^2(dX)\) for all \(dX\). The correct object — the linear map \(F'(X)\) — is described by how it acts on \(dX\), not by a single multiplicand. This is why the operator notation \(F'(X)[\,dX\,]\) is the right one for matrix-in / matrix-out calculus.

Example 2: Inverse, \(F(X) = X^{-1}\)

Example 2.

Let \(X \in \mathbb{R}^{n \times n}\) be invertible, and set \(F(X) = X^{-1}\). The set of invertible matrices is open in \(\mathbb{R}^{n \times n}\), and the entries of \(X^{-1}\) are rational functions of the entries of \(X\) (via the cofactor formula \(X^{-1} = (\det X)^{-1}\mathrm{adj}(X)\)) with nonvanishing denominator, so \(F\) is \(C^\infty\) — in particular differentiable — at every such \(X\). We may therefore compute \(dF\) as follows. From the identity \(X^{-1} X = I\) we take the differential of both sides; since \(I\) is constant, \(d(I) = 0 \in \mathbb{R}^{n \times n}\). Applying the product rule to the left side: \[ d(X^{-1})\,X + X^{-1}(dX) \;=\; 0 \qquad \Longrightarrow \qquad d(X^{-1})\,X \;=\; -X^{-1}(dX). \] Right-multiplying by \(X^{-1}\): \[ dF \;=\; d(X^{-1}) \;=\; -X^{-1}(dX)\,X^{-1}. \] In operator form: \(F'(X)[\,dX\,] = -X^{-1}(dX)X^{-1}\) — again a sandwich operator, now with \(A = B = -X^{-1}\) (and with a single sign).

From autodiff to optimization

The formula \(d(X^{-1}) = -X^{-1}(dX)X^{-1}\) is the backbone of automatic differentiation through linear-algebra primitives such as matrix solves. When an ML loss depends on \(X^{-1} b\) — for instance in Gaussian process marginal likelihoods or in second-order optimization with a Hessian solve — the gradient with respect to \(X\) is obtained by contracting the upstream gradient against \(-X^{-1}(\cdot)X^{-1}\), never by materializing an inverse and differentiating elementwise.

Derivative of the LU Decomposition

LU decomposition factors a square matrix \(X\) as \(X = LU\), where \(L\) is unit lower triangular (diagonal entries equal to one) and \(U\) is upper triangular. Understanding how \(L\) and \(U\) change when \(X\) is perturbed is fundamental for stability analysis in numerical linear algebra and for differentiable linear-algebra primitives in modern autodiff systems.

Applying the product rule to \(X = LU\): \[ dX \;=\; (dL)\,U + L\,(dU). \] The structural constraints on \(L\) and \(U\) carry over to their differentials: \(dL\) is strictly lower triangular (the unit-diagonal constraint freezes the diagonal of \(L\), so \((dL)_{ii} = 0\)), and \(dU\) is upper triangular. Between the two of them \(dL\) and \(dU\) have exactly \(n^2\) independent entries — the same as \(dX\). Under the standard condition that guarantees the existence and uniqueness of an LU decomposition — namely, all leading principal minors of \(X\) are nonzero, equivalently all diagonal entries \(U_{ii}\) are nonzero — the linear map \((dL, dU) \mapsto dX\) defined by \(dX = (dL)U + L(dU)\) is invertible, which is what allows us to solve for \(dL, dU\) given \(dX\).

A \(2 \times 2\) worked example

Take \[ L \;=\; \begin{bmatrix} 1 & 0 \\ L_{21} & 1 \end{bmatrix}, \qquad U \;=\; \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix}, \] with differentials constrained by the triangular shapes: \[ dL \;=\; \begin{bmatrix} 0 & 0 \\ dL_{21} & 0 \end{bmatrix}, \qquad dU \;=\; \begin{bmatrix} dU_{11} & dU_{12} \\ 0 & dU_{22} \end{bmatrix}. \] Computing \(dX = (dL)U + L(dU)\) term by term: \[ (dL)\,U \;=\; \begin{bmatrix} 0 & 0 \\ dL_{21}\, U_{11} & dL_{21}\, U_{12} \end{bmatrix}, \qquad L\,(dU) \;=\; \begin{bmatrix} dU_{11} & dU_{12} \\ L_{21}\, dU_{11} & L_{21}\, dU_{12} + dU_{22} \end{bmatrix}. \] Adding these, \[ dX \;=\; \begin{bmatrix} dU_{11} & dU_{12} \\ L_{21}\, dU_{11} + U_{11}\, dL_{21} & L_{21}\, dU_{12} + U_{12}\, dL_{21} + dU_{22} \end{bmatrix}. \]

Reading this system in reverse — solving for \((dU_{11}, dU_{12}, dL_{21}, dU_{22})\) given the entries of \(dX\) — yields the forward-mode sensitivity of each LU factor entry. The upper row of \(dX\) determines the top row of \(dU\) immediately (\(dU_{11} = dX_{11}\), \(dU_{12} = dX_{12}\)); assuming \(U_{11} \neq 0\), the bottom row then gives \[ dL_{21} \;=\; \frac{dX_{21} - L_{21}\,dU_{11}}{U_{11}}, \qquad dU_{22} \;=\; dX_{22} - L_{21}\,dU_{12} - U_{12}\,dL_{21}, \] by back-substitution. This triangular-then-triangular solve pattern — and the role of the diagonal entries \(U_{ii}\) as the divisors that must be nonzero — is exactly how the general LU-differentiation algorithm works at scale.

Application: differentiable linear algebra in modern ML

The identity \(dX = (dL)U + L(dU)\) is the starting point for reverse-mode autodiff through matrix factorizations — a key ingredient in frameworks that differentiate through linear solves, determinant computations, and Cholesky factorizations. Given an upstream gradient \(\bar X\) (the sensitivity of a downstream loss with respect to \(X\)), one recovers \(\bar L\) and \(\bar U\) by solving the adjoint equations derived from the identity above. This is what allows end-to-end training of models that embed linear solves — from Gaussian processes and differentiable physics simulators to structured-prediction layers — without ever writing hand-tuned backward passes.

Loading...