In Part 4, we introduced backpropagation as an algorithm for computing gradients
in multilayer perceptrons, exploiting the chain rule to propagate gradients
layer by layer. However, modern neural network architectures - with residual connections, attention mechanisms, and dynamic control
flow - do not fit neatly into the sequential layer abstraction. We need a more general framework.
Automatic differentiation (AD) is that framework. Rather than relying on a layer-by-layer structure, AD operates on
arbitrary computational graphs: directed acyclic graphs (DAGs) where each node represents an elementary operation
and edges encode data dependencies. Given such a graph, AD systematically applies the chain rule to compute exact derivatives
- not finite-difference approximations, and not symbolic expressions that can explode in size.
Automatic Differentiation:
Let \(f : \mathbb{R}^n \to \mathbb{R}^m\) be a function decomposed into a sequence of elementary operations represented as
a computational graph. Automatic differentiation computes the Jacobian \(J_f \in \mathbb{R}^{m \times n}\) (or
products involving it) by applying the chain rule systematically through the graph. Two modes exist:
Forward-mode AD:
propagates Jacobian-vector products \(J_f \cdot v\) from inputs to outputs. Cost: one forward pass per input
direction. Efficient when \(n \ll m\).
Reverse-mode AD:
propagates vector-Jacobian products \(u^\top J_f\) from outputs to inputs. Cost: one backward pass per output
direction. Efficient when \(m \ll n\).
Since neural network training involves a scalar loss (\(m = 1\)) with millions of parameters (\(n \gg 1\)),
reverse-mode AD - which is backpropagation in this context - computes the full gradient in a single backward pass.
Analytic Example of Reverse-Mode AD
To make the process of automatic differentiation concrete, let's walk through an analytic example using a
composite scalar-valued function of two variables. We'll decompose the function into primitive operations,
represent it as a computational graph, and compute its gradients using reverse-mode automatic differentiation
(i.e., backpropagation).
Consider the following function
\[
f(x_1, x_2) = \log \left((x_1 + x_2)^2 + \sin(x_1 x_2) \right).
\]
Note: For this function and its derivatives to be strictly defined in the real domain, we require the input
\((x_1, x_2)\) to satisfy \((x_1 + x_2)^2 + \sin(x_1 x_2) > 0\).
The power of automatic differentiation lies in its systematic approach:
Decompose complex functions into simple primitive operations
Apply the chain rule mechanically through the computational graph
Sum gradients when variables contribute through multiple paths
This process can be fully automated, making it the backbone of modern deep learning frameworks.
Note: In AD literature, the accumulated gradient \(\frac{\partial f}{\partial x_i}\) is often denoted as the
adjoint \(\bar{x}_i\).
Applications of AD
Automatic differentiation is a core component in modern computational systems that require efficient and
accurate derivatives. In particular, it powers nearly all deep learning frameworks such as:
PyTorch - dynamic computational graphs with reverse-mode AD via autograd
TensorFlow - supports both eager and static (graph) modes of AD with tf.GradientTape
JAX - composable transformations like grad, vmap, jit based on function tracing and XLA compilation