Derivatives of Matrix-Valued Matrix Functions
So far we have discussed differentials of scalar functions and Jacobians of vector-valued functions. We now extend the same differential calculus to matrix-valued functions of a matrix argument, \(F : \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}\). Our running examples will be \(F(X) = X^k\) and \(F(X) = X^{-1}\); at the end of the page we use the same framework to differentiate an LU decomposition.
Operator form is essential here. For scalar-in / scalar-out and vector-in / vector-out cases, the differential admits the compact matrix-product form \(df = f'(x)\,dx\). For matrix-in / matrix-out functions this compact form breaks down: the "Jacobian" is naturally a \(4\)-index tensor, or equivalently an \(n^2 \times n^2\) matrix acting on \(\mathrm{vec}(dX)\). In day-to-day calculation, the operator notation \[ dF \;=\; F'(X)[\,dX\,] \] is both cleaner and more honest: the differential \(F'(X)\) is a linear map \(\mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}\), applied to the input perturbation \(dX\), and its action may involve \(dX\) in multiple positions. We will read off \(F'(X)[\,\cdot\,]\) directly from the expansion of \(F(X + dX) - F(X)\).
The product rule preserves order. Because matrix multiplication is not commutative, the differential product rule must preserve the relative positions of the factors: \[ d(AB) \;=\; (dA)\,B + A\,(dB). \] Moving a \(dX\) past a constant matrix changes the value; this is the single most common source of error in matrix calculus.
Warm-up: affine matrix functions
For an affine map \(F(X) = AXB + C\) with \(A, B, C\) constant, the increment is exact: \[ \Delta F \;=\; A(X + dX)B + C - (AXB + C) \;=\; A\,(dX)\,B. \] There is no \(o(\|dX\|)\) remainder, so the differential is simply \[ dF \;=\; F'(X)[\,dX\,] \;=\; A\,(dX)\,B. \] The linear map \(F'(X)\) is the "sandwich operator" \(Z \mapsto AZB\), independent of \(X\). Two special cases: \(F(X) = AX\) gives \(dF = A\,dX\); \(F(X) = X\) gives \(dF = dX\). The full sandwich form \(A(\cdot)B\) is a pattern we will see repeatedly, notably in the inverse and LU examples below.
Example 1: Power of a matrix, \(F(X) = X^3\)
Let \(F(X) = X^3\) with \(X \in \mathbb{R}^{n \times n}\). Expanding the increment and keeping only linear-in-\(dX\) terms: \[ \begin{align*} (X+dX)^3 - X^3 &= X^2(dX) + X(dX)X + (dX)X^2 \\\\ &\quad + X(dX)^2 + (dX)X(dX) + (dX)^2 X + (dX)^3. \end{align*} \] The four nonlinear terms (three quadratic and one cubic in \(dX\)) are \(o(\|dX\|)\): each is bounded by a constant multiple of \(\|dX\|^2\) via submultiplicativity of the operator norm (e.g. \(\|X(dX)^2\| \leq \|X\|_{\mathrm{op}}\|dX\|^2\)). The three linear terms give the differential: \[ dF \;=\; F'(X)[\,dX\,] \;=\; (dX)X^2 + X(dX)X + X^2(dX). \] Note how each of the three factors of \(X\) in \(X^3\) contributes one term, with \(dX\) inserted at that factor's position; the non-commutativity of multiplication forbids collecting them into a single "coefficient". \(\blacksquare\)
Why there is no single "\(\partial F / \partial X\)" matrix
One might hope to write \(dF = (\partial F / \partial X)\, dX\) in analogy with the scalar case, but Example 1 shows this is impossible: \(dX\) appears in three different positions, and no single constant matrix \(M\) satisfies \(M\,dX = (dX)X^2 + X(dX)X + X^2(dX)\) for all \(dX\). The correct object — the linear map \(F'(X)\) — is described by how it acts on \(dX\), not by a single multiplicand. This is why the operator notation \(F'(X)[\,dX\,]\) is the right one for matrix-in / matrix-out calculus.
Example 2: Inverse, \(F(X) = X^{-1}\)
Let \(X \in \mathbb{R}^{n \times n}\) be invertible, and set \(F(X) = X^{-1}\). The set of invertible matrices is open in \(\mathbb{R}^{n \times n}\), and the entries of \(X^{-1}\) are rational functions of the entries of \(X\) (via the cofactor formula \(X^{-1} = (\det X)^{-1}\mathrm{adj}(X)\)) with nonvanishing denominator, so \(F\) is \(C^\infty\) — in particular differentiable — at every such \(X\). We may therefore compute \(dF\) as follows. From the identity \(X^{-1} X = I\) we take the differential of both sides; since \(I\) is constant, \(d(I) = 0 \in \mathbb{R}^{n \times n}\). Applying the product rule to the left side: \[ d(X^{-1})\,X + X^{-1}(dX) \;=\; 0 \qquad \Longrightarrow \qquad d(X^{-1})\,X \;=\; -X^{-1}(dX). \] Right-multiplying by \(X^{-1}\): \[ dF \;=\; d(X^{-1}) \;=\; -X^{-1}(dX)\,X^{-1}. \] In operator form: \(F'(X)[\,dX\,] = -X^{-1}(dX)X^{-1}\) — again a sandwich operator, now with \(A = B = -X^{-1}\) (and with a single sign). \(\blacksquare\)
From autodiff to optimization
The formula \(d(X^{-1}) = -X^{-1}(dX)X^{-1}\) is the backbone of automatic differentiation through linear-algebra primitives such as matrix solves. When an ML loss depends on \(X^{-1} b\) — for instance in Gaussian process marginal likelihoods or in second-order optimization with a Hessian solve — the gradient with respect to \(X\) is obtained by contracting the upstream gradient against \(-X^{-1}(\cdot)X^{-1}\), never by materializing an inverse and differentiating elementwise.