Jacobian
In the previous page we defined the differential of a scalar function as a
linear map on displacements.
We now extend this to a vector-valued function
\(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\). The linear map representing its differential is
encoded by a matrix — the Jacobian.
Definition: Differentiability (vector-valued)
A function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) is differentiable
at \(\mathbf{x}\) if there exists a linear map \(L : \mathbb{R}^n \to \mathbb{R}^m\) such that
\[
\mathbf{f}(\mathbf{x} + \mathbf{h}) - \mathbf{f}(\mathbf{x})
\;=\; L(\mathbf{h}) + o(\|\mathbf{h}\|) \quad \text{as } \mathbf{h} \to \mathbf{0}.
\]
When such \(L\) exists it is unique (two linear maps differing by \(o(\|\mathbf{h}\|)\) must
coincide, by the same argument as in the
scalar case) and is called the
differential of \(\mathbf{f}\) at \(\mathbf{x}\), denoted
\(d\mathbf{f}_{\mathbf{x}}\) or simply \(\mathbf{f}'(\mathbf{x})\).
Remark. Differentiability is a stronger condition than the existence of all partial
derivatives \(\partial f_i/\partial x_j\): a function can have all partials at a point yet fail
to be differentiable there (the standard counterexample is
\(f(x,y) = xy/(x^2+y^2)\) extended by \(f(0,0) = 0\)). Differentiability implies the existence
of all partials; the converse holds under continuity of the partials in a neighborhood.
Definition: Jacobian
Let \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) be differentiable at \(\mathbf{x}\). The
Jacobian of \(\mathbf{f}\) at \(\mathbf{x}\), denoted \(\mathbf{f}'(\mathbf{x})\)
or \(J_{\mathbf{f}}(\mathbf{x})\), is the \(m \times n\) matrix representing the differential
\(d\mathbf{f}_{\mathbf{x}}\) in the standard bases. Its entries are the partial derivatives
\[
J_{ij} \;=\; \frac{\partial f_i}{\partial x_j}, \qquad 1 \leq i \leq m,\; 1 \leq j \leq n,
\]
and the defining linear approximation takes the form
\[
\mathbf{f}(\mathbf{x} + d\mathbf{x}) - \mathbf{f}(\mathbf{x})
\;=\; \mathbf{f}'(\mathbf{x})\, d\mathbf{x} + o(\|d\mathbf{x}\|) \quad \text{as } d\mathbf{x} \to \mathbf{0}.
\]
(The identification \(J_{ij} = \partial f_i/\partial x_j\) follows from evaluating the linear
approximation along each coordinate direction \(\mathbf{h} = t\,\mathbf{e}_j\), \(t \to 0\).)
Two notations for the derivative-as-operator. The Jacobian \(\mathbf{f}'(\mathbf{x})\)
is simultaneously (i) an \(m \times n\) matrix of partial derivatives and (ii) a linear map
\(\mathbb{R}^n \to \mathbb{R}^m\). Both views are standard, and the corresponding notations are:
\[
d\mathbf{f} \;=\; \mathbf{f}'(\mathbf{x})\, d\mathbf{x}
\qquad \text{(matrix-vector product)}
\]
\[
d\mathbf{f} \;=\; \mathbf{f}'(\mathbf{x})[d\mathbf{x}]
\qquad \text{(linear map applied to an argument)}
\]
For finite-dimensional calculations the two are interchangeable. The bracket form
\(\mathbf{f}'(\mathbf{x})[d\mathbf{x}]\) is the notation of choice once we leave finite dimensions:
for Fréchet derivatives on Banach spaces, for directional derivatives on
manifolds, and for the operator-valued maps \(\mathrm{Ad}_g[X]\), \(\mathrm{ad}_X[Y]\) that appear
in Lie theory and representation theory. This page uses the matrix-product form by default and the
bracket form where operator composition is cleaner (notably in the Chain Rule).
The \(i\)-th row of \(J\) records the partial derivatives of the \(i\)-th output component; the
\(j\)-th column records the effect of perturbing the \(j\)-th input coordinate. Writing out
\(d\mathbf{f} = J\, d\mathbf{x}\) component by component for \(n = m = 2\):
\[
\begin{align*}
d\mathbf{f}
&= \begin{bmatrix}
\partial f_1/\partial x_1 & \partial f_1/\partial x_2 \\
\partial f_2/\partial x_1 & \partial f_2/\partial x_2
\end{bmatrix}
\begin{bmatrix} dx_1 \\ dx_2 \end{bmatrix} \\\\
&= \begin{bmatrix}
(\partial f_1/\partial x_1)\,dx_1 + (\partial f_1/\partial x_2)\,dx_2 \\
(\partial f_2/\partial x_1)\,dx_1 + (\partial f_2/\partial x_2)\,dx_2
\end{bmatrix}.
\end{align*}
\]
The Jacobian acts as a linear operator that maps input displacements \(d\mathbf{x}\) to the
corresponding linear part of the output change \(d\mathbf{f}\).
Convention (shape). For \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\) the Jacobian
\(\mathbf{f}'(\mathbf{x})\) is \(m \times n\). Two special cases recover earlier objects:
- \(m = 1\) (scalar output): \(\mathbf{f}'(\mathbf{x})\) is a \(1 \times n\) row vector,
equal to \((\nabla f)^T\) in the
gradient convention
established previously.
- \(n = 1\) (curve): \(\mathbf{f}'(\mathbf{x})\) is an \(m \times 1\) column vector,
the velocity vector of the curve.
As a first example, consider the affine map \(\mathbf{f}(\mathbf{x}) = A\mathbf{x}\) with
\(A \in \mathbb{R}^{m \times n}\) constant. Expanding the increment,
\[
\Delta \mathbf{f} \;=\; \mathbf{f}(\mathbf{x} + d\mathbf{x}) - \mathbf{f}(\mathbf{x})
\;=\; A(\mathbf{x} + d\mathbf{x}) - A\mathbf{x} \;=\; A\, d\mathbf{x},
\]
the linear part is exact and there is no \(o(\|d\mathbf{x}\|)\) remainder. Reading off the
matrix-product form identifies the Jacobian as \(\mathbf{f}'(\mathbf{x}) = A\) — the constant
matrix \(A\) itself is its own linearization, as expected for an affine map.
Local Transformation of Hypervolumes
When \(m = n\), the Jacobian describes how an infinitesimal hypercube in the
input space is mapped to a parallelotope in the output space.
The determinant of the
Jacobian, \(\det J\), is the local volume expansion factor. In generative
modeling — specifically, Normalizing Flows — this determinant is what
allows the change-of-variables formula to preserve total probability mass under complex
nonlinear transformations of a base distribution.
Chain Rule
In deep learning, a neural network
is a composition of differentiable layers. The chain rule turns differentiation of such a
composition into the composition of the individual differentials — and, because
differentials are linear maps, composition becomes matrix multiplication of Jacobians.
Theorem: Chain Rule
Let \(\mathbf{h} : \mathbb{R}^n \to \mathbb{R}^p\) be differentiable at \(\mathbf{x}\) and
\(\mathbf{g} : \mathbb{R}^p \to \mathbb{R}^m\) be differentiable at \(\mathbf{u} := \mathbf{h}(\mathbf{x})\).
Then the composition \(\mathbf{f} = \mathbf{g} \circ \mathbf{h}\) is differentiable at
\(\mathbf{x}\), and its differential is the composition of the individual differentials:
\[
\mathbf{f}'(\mathbf{x})[d\mathbf{x}]
\;=\; \mathbf{g}'(\mathbf{h}(\mathbf{x}))\bigl[\,\mathbf{h}'(\mathbf{x})[d\mathbf{x}]\,\bigr].
\]
Equivalently, at the level of Jacobian matrices,
\[
\mathbf{f}'(\mathbf{x}) \;=\; \mathbf{g}'(\mathbf{h}(\mathbf{x}))\,\mathbf{h}'(\mathbf{x})
\qquad \text{(an } m \times n \text{ matrix, from } m \times p \text{ times } p \times n\text{).}
\]
Proof.
Let \(J_{\mathbf{h}} := \mathbf{h}'(\mathbf{x})\) and \(J_{\mathbf{g}} := \mathbf{g}'(\mathbf{u})\).
We use throughout the fact that in finite dimensions every linear map is automatically
bounded: there exists \(C > 0\) with \(\|L\mathbf{v}\| \leq C\|\mathbf{v}\|\) for all
\(\mathbf{v}\) (this is the operator norm, and it is finite because the unit sphere is compact).
This is what lets us push \(o(\|\cdot\|)\) and \(O(\|\cdot\|)\) bounds through \(J_{\mathbf{h}}\)
and \(J_{\mathbf{g}}\) below.
By differentiability of \(\mathbf{h}\) at \(\mathbf{x}\),
\[
\mathbf{h}(\mathbf{x} + d\mathbf{x}) - \mathbf{h}(\mathbf{x})
\;=\; J_{\mathbf{h}}\, d\mathbf{x} + r_{\mathbf{h}}(d\mathbf{x}),
\qquad r_{\mathbf{h}}(d\mathbf{x}) = o(\|d\mathbf{x}\|).
\]
Let \(d\mathbf{u} := J_{\mathbf{h}}\, d\mathbf{x} + r_{\mathbf{h}}(d\mathbf{x})\); note that
\(\|d\mathbf{u}\| = O(\|d\mathbf{x}\|)\). By differentiability of \(\mathbf{g}\) at
\(\mathbf{u}\),
\[
\mathbf{g}(\mathbf{u} + d\mathbf{u}) - \mathbf{g}(\mathbf{u})
\;=\; J_{\mathbf{g}}\, d\mathbf{u} + r_{\mathbf{g}}(d\mathbf{u}),
\qquad r_{\mathbf{g}}(d\mathbf{u}) = o(\|d\mathbf{u}\|).
\]
Composing these and expanding,
\[
\mathbf{f}(\mathbf{x} + d\mathbf{x}) - \mathbf{f}(\mathbf{x})
\;=\; J_{\mathbf{g}} J_{\mathbf{h}}\, d\mathbf{x}
+ \underbrace{J_{\mathbf{g}}\, r_{\mathbf{h}}(d\mathbf{x})}_{o(\|d\mathbf{x}\|)}
+ \underbrace{r_{\mathbf{g}}(d\mathbf{u})}_{o(\|d\mathbf{u}\|) = o(\|d\mathbf{x}\|)}.
\]
The first error term is \(o(\|d\mathbf{x}\|)\) because
\(\|J_{\mathbf{g}}\,r_{\mathbf{h}}(d\mathbf{x})\| \leq \|J_{\mathbf{g}}\|_{\mathrm{op}}\,\|r_{\mathbf{h}}(d\mathbf{x})\|
= \|J_{\mathbf{g}}\|_{\mathrm{op}} \cdot o(\|d\mathbf{x}\|) = o(\|d\mathbf{x}\|)\).
The second requires combining \(r_{\mathbf{g}}(d\mathbf{u}) = o(\|d\mathbf{u}\|)\) with
\(\|d\mathbf{u}\| = O(\|d\mathbf{x}\|)\): since \(d\mathbf{x} \to 0\) forces
\(d\mathbf{u} \to 0\),
\[
\frac{\|r_{\mathbf{g}}(d\mathbf{u})\|}{\|d\mathbf{x}\|}
\;=\; \underbrace{\frac{\|r_{\mathbf{g}}(d\mathbf{u})\|}{\|d\mathbf{u}\|}}_{\to\, 0}
\cdot
\underbrace{\frac{\|d\mathbf{u}\|}{\|d\mathbf{x}\|}}_{\text{bounded}}
\;\longrightarrow\; 0,
\]
so \(r_{\mathbf{g}}(d\mathbf{u}) = o(\|d\mathbf{x}\|)\) as well (the case
\(d\mathbf{u} = \mathbf{0}\) gives \(r_{\mathbf{g}} = \mathbf{0}\) directly). Identifying
the linear part gives \(\mathbf{f}'(\mathbf{x}) = J_{\mathbf{g}} J_{\mathbf{h}}\). \(\blacksquare\)
Order matters: non-commutativity of matrix multiplication
The Jacobian of a composition is a product of Jacobians in a specific order:
outer-first, inner-last. The dimensional count confirms this — if
\(\mathbf{x} \in \mathbb{R}^n\), \(\mathbf{h}(\mathbf{x}) \in \mathbb{R}^p\),
\(\mathbf{g}(\mathbf{h}(\mathbf{x})) \in \mathbb{R}^m\), then only
\[
\underbrace{\mathbf{g}'(\mathbf{h}(\mathbf{x}))}_{m \times p}
\;\cdot\;
\underbrace{\mathbf{h}'(\mathbf{x})}_{p \times n}
\]
produces a well-defined \(m \times n\) matrix. Reversing the order would, in general, not even
be dimensionally compatible. This non-commutativity is the reason chain-rule applications in
machine learning must keep careful track of which direction the composition runs.
Backpropagation
We now specialize the chain rule to the situation that drives neural-network training: a
scalar loss computed by a deep composition. This specialization is what the
reverse-mode automatic differentiation
algorithm — commonly called backpropagation in
neural network training —
computes efficiently.
Consider three composed layers with compatible dimensions:
\[
\mathbf{x} \;\xrightarrow{\mathbf{f}_1}\; \mathbb{R}^{p}
\;\xrightarrow{\mathbf{f}_2}\; \mathbb{R}^{q}
\;\xrightarrow{f_3}\; \mathbb{R},
\]
with \(\mathbf{x} \in \mathbb{R}^n\) the parameters and \(f_3\) scalar-valued so that the final
output is the loss
\[
L(\mathbf{x}) \;=\; f_3\bigl(\mathbf{f}_2(\mathbf{f}_1(\mathbf{x}))\bigr).
\]
By the chain rule, the Jacobian of
\(L\) — a \(1 \times n\) row vector, since the output is scalar — factors as the product of the
layer Jacobians:
\[
L'(\mathbf{x}) \;=\; f_3'\,\mathbf{f}_2'\,\mathbf{f}_1'
\qquad \text{with shapes } (1 \times q)(q \times p)(p \times n).
\]
By the
gradient convention, the column
vector \(\nabla L\) is the transpose of this row: \(L'(\mathbf{x}) = (\nabla L)^T\).
Associativity and evaluation order
Matrix multiplication is associative but not commutative, so the value of the product
\(f_3'\,\mathbf{f}_2'\,\mathbf{f}_1'\) does not depend on the order of the parenthesization — but
the cost of computing it does. The two natural parenthesizations are:
- Reverse mode (backpropagation): left-to-right evaluation
\(\bigl((f_3' \, \mathbf{f}_2')\, \mathbf{f}_1'\bigr)\). In terms of the computation graph,
this corresponds to starting at the output layer and propagating gradients
backward through the network — hence the name.
- Forward mode: right-to-left evaluation
\(\bigl(f_3'\,(\mathbf{f}_2'\, \mathbf{f}_1')\bigr)\), propagating perturbations forward from
inputs toward outputs.
Tracking the shapes through reverse mode:
\[
\begin{align*}
\bigl(f_3'\,\mathbf{f}_2'\bigr)\,\mathbf{f}_1'
&= \bigl[(1 \times q)(q \times p)\bigr](p \times n) \\\\
&= (1 \times p)(p \times n) \\\\
&= (1 \times n) = (\nabla L)^T.
\end{align*}
\]
Every intermediate result is a row vector of length at most \(\max(p, q, n)\); no \(p \times n\)
or \(q \times n\) matrix is ever materialized. Each layer contributes a single
vector-Jacobian product (VJP) — the left-multiplication of a row vector
\(\mathbf{v}^T\) by a layer's Jacobian \(J\).
Forward mode, by contrast, accumulates full intermediate Jacobians:
\[
\begin{align*}
f_3'\,\bigl(\mathbf{f}_2'\,\mathbf{f}_1'\bigr)
&= (1 \times q)\bigl[(q \times p)(p \times n)\bigr] \\\\
&= (1 \times q)(q \times n) \\\\
&= (1 \times n) = (\nabla L)^T.
\end{align*}
\]
The intermediate matrix \(\mathbf{f}_2'\,\mathbf{f}_1'\) has \(q \times n\) entries, which is much
larger than any row vector in reverse mode when \(n\) is large (the typical parameter regime).
This is the fundamental reason deep networks are trained with reverse-mode autodiff rather than
forward-mode: for scalar loss and high-dimensional parameters, reverse mode is optimal.
The VJP and Memory
Materializing a full Jacobian for a layer with, say, \(10{,}000\) inputs and \(10{,}000\)
outputs would require \(10^8\) entries. To avoid this, modern frameworks such as
PyTorch and JAX implement each layer's differentiation as a
vector-Jacobian product
\[
\mathbf{v}^T J,
\]
where \(\mathbf{v}^T\) is the gradient flowing back from the downstream layers. This allows
the system to compute every gradient needed for training without ever materializing a full
layer Jacobian — which is what makes training of "deep" models physically feasible.