Jacobian
In the previous chapter, we defined the derivative as a linear operator
that maps a small input change \(dx\) to an output change \(df\). When we extend this to a vector-valued function \(f: \mathbb{R}^n \to \mathbb{R}^m\),
this operator is represented by the Jacobian matrix.
Consider \(f(x) \in \mathbb{R}^m\), where \(x \in \mathbb{R}^n\). By the differential notation,
\[
df = f'(x)dx
\]
where \(df \in \mathbb{R}^m \, \), \(dx \in \mathbb{R}^n\) and here, the linear operator \(f'(x)\) is
the \(m \times n\) Jacobian matrix such that:
\[
J_{ij} = \frac{\partial f_i}{\partial x_j}.
\]
Here, \(J_{ij}\) represents the rate of change of the \(i\)-th output \(f_i\) with respect to the
\(j\)-th input \(x_j\).
For example, Let \(f(x) = \begin{bmatrix}f_1(x) \\ f_2(x) \end{bmatrix}\),
and \(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \). Then
\[
\begin{align*}
df &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2}
\end{bmatrix}
\begin{bmatrix} dx_1 \\ dx_2 \end{bmatrix} \\\\
&= \begin{bmatrix} \frac{\partial f_1}{\partial x_1}dx_1 + \frac{\partial f_1}{\partial x_2}dx_2 \\
\frac{\partial f_2}{\partial x_1}dx_1 + \frac{\partial f_2}{\partial x_2}dx_2
\end{bmatrix}.
\end{align*}
\]
As you can see, the Jacobian matrix \(f'(x)\) acts as a linear operator that maps changes in \(x\)
encoded in \(dx\) to corresponding changes in \(f(x)\) encoded in \(df\).
Note: Input & Output: vector \(\Longrightarrow\) First derivative: Jacobian matrix.
Instead of expressing the Jacobian component by component, it is often more convenient to use a
symbolic representation.
For example, consider a trivial case: the matrix equation \(f(x) = Ax\) where \(A \in \mathbb{R}^{m \times n}\)
is a constant matrix.
Then
\[
\begin{align*}
df &= f(x + dx) - f(x) \\\\
&= A(x + dx) - Ax \\\\
&= Adx \\\\
&= f'(x)dx
\end{align*}
\]
Thus \(f'(x) = A \) is the Jacobian matrix.
Local Transformation of Hypervolumes
The Jacobian describes how an infinitesimal hypercube in the input space \(\mathbb{R}^n\) is
mapped to a parallelotope in the output space \(\mathbb{R}^m\).
The determinant of the Jacobian, \(\det(J)\), represents the
local volume expansion factor. In Generative AI, specifically, Normalizing Flows,
this determinant is critical for calculating the change of variables in probability density functions, ensuring that
the total probability remains integrated to one after a complex nonlinear transformation.
Chain Rule
In Deep Learning, we treat a neural network as a composition of differentiable layers.
If \(x \xrightarrow{f} u \xrightarrow{g} y\), then the total derivative is the product of the Jacobian matrices:
\[
\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx} = J_g(u) J_f(x)
\]
Since matrix multiplication represents the composition of linear transformations, the Chain Rule essentially
states that the linearization of a composed function is the composition of the linearizations of its individual components.
Theorem: Chain Rule
Suppose \(f(x) = g(h(x))\) where both \(g\) and \(h\) are differentiable. Then
\[
\begin{align*}
df &= f'(x)[dx] \\\\
&= g'(h(x))[h'(x)[dx]
\end{align*}
\]
Intuitively, if we think about a higher dimensional case, it is clear that the chain rule is not commutative in general.
For example, consider
\[
x \in \mathbb{R}^n, \, h(x) \in \mathbb{R}^p, \, \text{and } g(h(x)) \in \mathbb{R}^m
\]
Then the output must be the \(m \times n\) Jacobian matrix \(f'(x)\).
To construct this \(m \times n\) matrix, we must need the product of the \(m \times p\) Jacobian matrix \(g'(h(x))\)
and the \(p \times n\) Jacobian matrix h'(x) in this order.
Backpropagation
Consider a neural network with a scalar loss function \(L: \mathbb{R}^n \to \mathbb{R}\). Here, the final Jacobian
is a row vector (\(1 \times n\)). The order of matrix multiplication in the Chain Rule determines the computational cost.
The backpropagation or more generally, reverse mode automatic differentiation is a widely used
algorithm in machine learning to train neural networks.
It computes gradients of the loss function with respect to the network's parameters (weights and biases) to minimize
the loss, improving model performance.
The backpropagation works by explicitly transmitting gradients layer by layer, maintaining the order
dictated by the Chain Rule. At each layer (or function), the gradient of the loss is computed with
respect to the output of the previous layer.
Consider a small neural network, which has three layers(functions): \(f_1, f_2, f_3\). This network represents
a composition of functions:
\[
L(x) = f_3(f_2(f_1(x)))
\]
where \(x \in \mathbb{R}^n\) represents the inputs(parameters) and \(L(x) \in \mathbb{R}^m\) is the
output often called the loss function.
Suppose \(f_1 \in \mathbb{R}^p\), \(f_2 \in \mathbb{R}^q\), and \(f_3 \in \mathbb{R}^m\).
To compute the derivative of this function, we apply the chain rule:
\[
dL = f_3'f_2'f_1'
= (m \times q)(q \times p)(p \times n)
\]
This is the multiplication of Jacobian matrices for each layer.
In reverse mode, the algorithm compute this expression from the left \(f_3'\) backward.
So, in the context of neural network structure, the backpropagation process appears to follow a "reverse" order,
starting from the output layer and moving backward through the network. However, from a mathematical perspective,
this corresponds to a standard left-to-right multiplication of Jacobian matrices.
This reverse order makes it computationally efficient for large scale neural networks with many inputs (\(n\))
and a single output (\(m = 1\)).
\[
\begin{align*}
(f_3'f_2')f_1' &= [(1 \times q)(q \times p)](p \times n) \\\\
&= (1 \times p)(p \times n) \\\\
&= (1 \times n) \\\\
&= (\nabla L)^T
\end{align*}
\]
This efficiency is due to backpropagation requiring only the computation of vector-Jacobian products, which are
less computationally expensive than full matrix multiplications. At each step, only the product of a row
vector(the transpose of a local gradient) with a Jacobian matrix is calculated.
In contrast,in forward mode differentiation requires full matrix multiplications.
\[
\begin{align*}
f_3'(f_2'f_1') &= (1 \times q)[(q \times p)(p \times n)] \\\\
&= (1 \times q)(q \times n) \\\\
&= (1 \times n) \\\\
&= (\nabla L)^T
\end{align*}
\]
The VJP (Vector-Jacobian Product) and Memory
Storing a full Jacobian matrix for a layer with 10,000 inputs and 10,000 outputs would require
100 million entries. To avoid this, modern frameworks like PyTorch only compute the
Vector-Jacobian Product (VJP):
\[
v^T J
\]
where \(v^T\) is the gradient flowing back from the output. This allows the system to compute the necessary
gradients without ever materializing the full Jacobian matrix in memory, making the training of "Deep" models
physically possible.