Jacobian

Jacobian Chain Rule Backpropagation

Jacobian

In the previous page we defined the differential of a scalar function as a linear map on displacements. We now extend this to a vector-valued function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\). The linear map representing its differential is encoded by a matrix — the Jacobian.

Definition: Differentiability (vector-valued)

A function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) is differentiable at \(\mathbf{x}\) if there exists a linear map \(L : \mathbb{R}^n \to \mathbb{R}^m\) such that \[ \mathbf{f}(\mathbf{x} + \mathbf{h}) - \mathbf{f}(\mathbf{x}) \;=\; L(\mathbf{h}) + o(\|\mathbf{h}\|) \quad \text{as } \mathbf{h} \to \mathbf{0}. \] When such \(L\) exists it is unique (two linear maps differing by \(o(\|\mathbf{h}\|)\) must coincide, by the same argument as in the scalar case) and is called the differential of \(\mathbf{f}\) at \(\mathbf{x}\), denoted \(d\mathbf{f}_{\mathbf{x}}\) or simply \(\mathbf{f}'(\mathbf{x})\).

Remark. Differentiability is a stronger condition than the existence of all partial derivatives \(\partial f_i/\partial x_j\): a function can have all partials at a point yet fail to be differentiable there (the standard counterexample is \(f(x,y) = xy/(x^2+y^2)\) extended by \(f(0,0) = 0\)). Differentiability implies the existence of all partials; the converse holds under continuity of the partials in a neighborhood.

Definition: Jacobian

Let \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) be differentiable at \(\mathbf{x}\). The Jacobian of \(\mathbf{f}\) at \(\mathbf{x}\), denoted \(\mathbf{f}'(\mathbf{x})\) or \(J_{\mathbf{f}}(\mathbf{x})\), is the \(m \times n\) matrix representing the differential \(d\mathbf{f}_{\mathbf{x}}\) in the standard bases. Its entries are the partial derivatives \[ J_{ij} \;=\; \frac{\partial f_i}{\partial x_j}, \qquad 1 \leq i \leq m,\; 1 \leq j \leq n, \] and the defining linear approximation takes the form \[ \mathbf{f}(\mathbf{x} + d\mathbf{x}) - \mathbf{f}(\mathbf{x}) \;=\; \mathbf{f}'(\mathbf{x})\, d\mathbf{x} + o(\|d\mathbf{x}\|) \quad \text{as } d\mathbf{x} \to \mathbf{0}. \] (The identification \(J_{ij} = \partial f_i/\partial x_j\) follows from evaluating the linear approximation along each coordinate direction \(\mathbf{h} = t\,\mathbf{e}_j\), \(t \to 0\).)

Two notations for the derivative-as-operator. The Jacobian \(\mathbf{f}'(\mathbf{x})\) is simultaneously (i) an \(m \times n\) matrix of partial derivatives and (ii) a linear map \(\mathbb{R}^n \to \mathbb{R}^m\). Both views are standard, and the corresponding notations are: \[ d\mathbf{f} \;=\; \mathbf{f}'(\mathbf{x})\, d\mathbf{x} \qquad \text{(matrix-vector product)} \] \[ d\mathbf{f} \;=\; \mathbf{f}'(\mathbf{x})[d\mathbf{x}] \qquad \text{(linear map applied to an argument)} \] For finite-dimensional calculations the two are interchangeable. The bracket form \(\mathbf{f}'(\mathbf{x})[d\mathbf{x}]\) is the notation of choice once we leave finite dimensions: for Fréchet derivatives on Banach spaces, for directional derivatives on manifolds, and for the operator-valued maps \(\mathrm{Ad}_g[X]\), \(\mathrm{ad}_X[Y]\) that appear in Lie theory and representation theory. This page uses the matrix-product form by default and the bracket form where operator composition is cleaner (notably in the Chain Rule).

The \(i\)-th row of \(J\) records the partial derivatives of the \(i\)-th output component; the \(j\)-th column records the effect of perturbing the \(j\)-th input coordinate. Writing out \(d\mathbf{f} = J\, d\mathbf{x}\) component by component for \(n = m = 2\): \[ \begin{align*} d\mathbf{f} &= \begin{bmatrix} \partial f_1/\partial x_1 & \partial f_1/\partial x_2 \\ \partial f_2/\partial x_1 & \partial f_2/\partial x_2 \end{bmatrix} \begin{bmatrix} dx_1 \\ dx_2 \end{bmatrix} \\\\ &= \begin{bmatrix} (\partial f_1/\partial x_1)\,dx_1 + (\partial f_1/\partial x_2)\,dx_2 \\ (\partial f_2/\partial x_1)\,dx_1 + (\partial f_2/\partial x_2)\,dx_2 \end{bmatrix}. \end{align*} \] The Jacobian acts as a linear operator that maps input displacements \(d\mathbf{x}\) to the corresponding linear part of the output change \(d\mathbf{f}\).

Convention (shape). For \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) the Jacobian \(\mathbf{f}'(\mathbf{x})\) is \(m \times n\). Two special cases recover earlier objects:

As a first example, consider the affine map \(\mathbf{f}(\mathbf{x}) = A\mathbf{x}\) with \(A \in \mathbb{R}^{m \times n}\) constant. Expanding the increment, \[ \Delta \mathbf{f} \;=\; \mathbf{f}(\mathbf{x} + d\mathbf{x}) - \mathbf{f}(\mathbf{x}) \;=\; A(\mathbf{x} + d\mathbf{x}) - A\mathbf{x} \;=\; A\, d\mathbf{x}, \] the linear part is exact and there is no \(o(\|d\mathbf{x}\|)\) remainder. Reading off the matrix-product form identifies the Jacobian as \(\mathbf{f}'(\mathbf{x}) = A\) — the constant matrix \(A\) itself is its own linearization, as expected for an affine map.

Local Transformation of Hypervolumes

When \(m = n\), the Jacobian describes how an infinitesimal hypercube in the input space is mapped to a parallelotope in the output space.

The determinant of the Jacobian, \(\det J\), is the local volume expansion factor. In generative modeling — specifically, Normalizing Flows — this determinant is what allows the change-of-variables formula to preserve total probability mass under complex nonlinear transformations of a base distribution.

Chain Rule

In deep learning, a neural network is a composition of differentiable layers. The chain rule turns differentiation of such a composition into the composition of the individual differentials — and, because differentials are linear maps, composition becomes matrix multiplication of Jacobians.

Theorem: Chain Rule

Let \(\mathbf{h} : \mathbb{R}^n \to \mathbb{R}^p\) be differentiable at \(\mathbf{x}\) and \(\mathbf{g} : \mathbb{R}^p \to \mathbb{R}^m\) be differentiable at \(\mathbf{u} := \mathbf{h}(\mathbf{x})\). Then the composition \(\mathbf{f} = \mathbf{g} \circ \mathbf{h}\) is differentiable at \(\mathbf{x}\), and its differential is the composition of the individual differentials: \[ \mathbf{f}'(\mathbf{x})[d\mathbf{x}] \;=\; \mathbf{g}'(\mathbf{h}(\mathbf{x}))\bigl[\,\mathbf{h}'(\mathbf{x})[d\mathbf{x}]\,\bigr]. \] Equivalently, at the level of Jacobian matrices, \[ \mathbf{f}'(\mathbf{x}) \;=\; \mathbf{g}'(\mathbf{h}(\mathbf{x}))\,\mathbf{h}'(\mathbf{x}) \qquad \text{(an } m \times n \text{ matrix, from } m \times p \text{ times } p \times n\text{).} \]

Proof.

Let \(J_{\mathbf{h}} := \mathbf{h}'(\mathbf{x})\) and \(J_{\mathbf{g}} := \mathbf{g}'(\mathbf{u})\). We use throughout the fact that in finite dimensions every linear map is automatically bounded: there exists \(C > 0\) with \(\|L\mathbf{v}\| \leq C\|\mathbf{v}\|\) for all \(\mathbf{v}\) (this is the operator norm, and it is finite because the unit sphere is compact). This is what lets us push \(o(\|\cdot\|)\) and \(O(\|\cdot\|)\) bounds through \(J_{\mathbf{h}}\) and \(J_{\mathbf{g}}\) below.

By differentiability of \(\mathbf{h}\) at \(\mathbf{x}\), \[ \mathbf{h}(\mathbf{x} + d\mathbf{x}) - \mathbf{h}(\mathbf{x}) \;=\; J_{\mathbf{h}}\, d\mathbf{x} + r_{\mathbf{h}}(d\mathbf{x}), \qquad r_{\mathbf{h}}(d\mathbf{x}) = o(\|d\mathbf{x}\|). \] Let \(d\mathbf{u} := J_{\mathbf{h}}\, d\mathbf{x} + r_{\mathbf{h}}(d\mathbf{x})\); note that \(\|d\mathbf{u}\| = O(\|d\mathbf{x}\|)\). By differentiability of \(\mathbf{g}\) at \(\mathbf{u}\), \[ \mathbf{g}(\mathbf{u} + d\mathbf{u}) - \mathbf{g}(\mathbf{u}) \;=\; J_{\mathbf{g}}\, d\mathbf{u} + r_{\mathbf{g}}(d\mathbf{u}), \qquad r_{\mathbf{g}}(d\mathbf{u}) = o(\|d\mathbf{u}\|). \] Composing these and expanding, \[ \mathbf{f}(\mathbf{x} + d\mathbf{x}) - \mathbf{f}(\mathbf{x}) \;=\; J_{\mathbf{g}} J_{\mathbf{h}}\, d\mathbf{x} + \underbrace{J_{\mathbf{g}}\, r_{\mathbf{h}}(d\mathbf{x})}_{o(\|d\mathbf{x}\|)} + \underbrace{r_{\mathbf{g}}(d\mathbf{u})}_{o(\|d\mathbf{u}\|) = o(\|d\mathbf{x}\|)}. \] The first error term is \(o(\|d\mathbf{x}\|)\) because \(\|J_{\mathbf{g}}\,r_{\mathbf{h}}(d\mathbf{x})\| \leq \|J_{\mathbf{g}}\|_{\mathrm{op}}\,\|r_{\mathbf{h}}(d\mathbf{x})\| = \|J_{\mathbf{g}}\|_{\mathrm{op}} \cdot o(\|d\mathbf{x}\|) = o(\|d\mathbf{x}\|)\). The second requires combining \(r_{\mathbf{g}}(d\mathbf{u}) = o(\|d\mathbf{u}\|)\) with \(\|d\mathbf{u}\| = O(\|d\mathbf{x}\|)\): since \(d\mathbf{x} \to 0\) forces \(d\mathbf{u} \to 0\), \[ \frac{\|r_{\mathbf{g}}(d\mathbf{u})\|}{\|d\mathbf{x}\|} \;=\; \underbrace{\frac{\|r_{\mathbf{g}}(d\mathbf{u})\|}{\|d\mathbf{u}\|}}_{\to\, 0} \cdot \underbrace{\frac{\|d\mathbf{u}\|}{\|d\mathbf{x}\|}}_{\text{bounded}} \;\longrightarrow\; 0, \] so \(r_{\mathbf{g}}(d\mathbf{u}) = o(\|d\mathbf{x}\|)\) as well (the case \(d\mathbf{u} = \mathbf{0}\) gives \(r_{\mathbf{g}} = \mathbf{0}\) directly). Identifying the linear part gives \(\mathbf{f}'(\mathbf{x}) = J_{\mathbf{g}} J_{\mathbf{h}}\). \(\blacksquare\)

Order matters: non-commutativity of matrix multiplication

The Jacobian of a composition is a product of Jacobians in a specific order: outer-first, inner-last. The dimensional count confirms this — if \(\mathbf{x} \in \mathbb{R}^n\), \(\mathbf{h}(\mathbf{x}) \in \mathbb{R}^p\), \(\mathbf{g}(\mathbf{h}(\mathbf{x})) \in \mathbb{R}^m\), then only \[ \underbrace{\mathbf{g}'(\mathbf{h}(\mathbf{x}))}_{m \times p} \;\cdot\; \underbrace{\mathbf{h}'(\mathbf{x})}_{p \times n} \] produces a well-defined \(m \times n\) matrix. Reversing the order would, in general, not even be dimensionally compatible. This non-commutativity is the reason chain-rule applications in machine learning must keep careful track of which direction the composition runs.

Backpropagation

We now specialize the chain rule to the situation that drives neural-network training: a scalar loss computed by a deep composition. This specialization is what the reverse-mode automatic differentiation algorithm — commonly called backpropagation in neural network training — computes efficiently.

Consider three composed layers with compatible dimensions: \[ \mathbf{x} \;\xrightarrow{\mathbf{f}_1}\; \mathbb{R}^{p} \;\xrightarrow{\mathbf{f}_2}\; \mathbb{R}^{q} \;\xrightarrow{f_3}\; \mathbb{R}, \] with \(\mathbf{x} \in \mathbb{R}^n\) the parameters and \(f_3\) scalar-valued so that the final output is the loss \[ L(\mathbf{x}) \;=\; f_3\bigl(\mathbf{f}_2(\mathbf{f}_1(\mathbf{x}))\bigr). \] By the chain rule, the Jacobian of \(L\) — a \(1 \times n\) row vector, since the output is scalar — factors as the product of the layer Jacobians: \[ L'(\mathbf{x}) \;=\; f_3'\,\mathbf{f}_2'\,\mathbf{f}_1' \qquad \text{with shapes } (1 \times q)(q \times p)(p \times n). \] By the gradient convention, the column vector \(\nabla L\) is the transpose of this row: \(L'(\mathbf{x}) = (\nabla L)^T\).

Associativity and evaluation order

Matrix multiplication is associative but not commutative, so the value of the product \(f_3'\,\mathbf{f}_2'\,\mathbf{f}_1'\) does not depend on the order of the parenthesization — but the cost of computing it does. The two natural parenthesizations are:

Tracking the shapes through reverse mode: \[ \begin{align*} \bigl(f_3'\,\mathbf{f}_2'\bigr)\,\mathbf{f}_1' &= \bigl[(1 \times q)(q \times p)\bigr](p \times n) \\\\ &= (1 \times p)(p \times n) \\\\ &= (1 \times n) = (\nabla L)^T. \end{align*} \] Every intermediate result is a row vector of length at most \(\max(p, q, n)\); no \(p \times n\) or \(q \times n\) matrix is ever materialized. Each layer contributes a single vector-Jacobian product (VJP) — the left-multiplication of a row vector \(\mathbf{v}^T\) by a layer's Jacobian \(J\).

Forward mode, by contrast, accumulates full intermediate Jacobians: \[ \begin{align*} f_3'\,\bigl(\mathbf{f}_2'\,\mathbf{f}_1'\bigr) &= (1 \times q)\bigl[(q \times p)(p \times n)\bigr] \\\\ &= (1 \times q)(q \times n) \\\\ &= (1 \times n) = (\nabla L)^T. \end{align*} \] The intermediate matrix \(\mathbf{f}_2'\,\mathbf{f}_1'\) has \(q \times n\) entries, which is much larger than any row vector in reverse mode when \(n\) is large (the typical parameter regime). This is the fundamental reason deep networks are trained with reverse-mode autodiff rather than forward-mode: for scalar loss and high-dimensional parameters, reverse mode is optimal.

The VJP and Memory

Materializing a full Jacobian for a layer with, say, \(10{,}000\) inputs and \(10{,}000\) outputs would require \(10^8\) entries. To avoid this, modern frameworks such as PyTorch and JAX implement each layer's differentiation as a vector-Jacobian product \[ \mathbf{v}^T J, \] where \(\mathbf{v}^T\) is the gradient flowing back from the downstream layers. This allows the system to compute every gradient needed for training without ever materializing a full layer Jacobian — which is what makes training of "deep" models physically feasible.