On the previous page on neural networks, we introduced backpropagation as an algorithm for computing gradients
in multilayer perceptrons, exploiting the chain rule to propagate gradients
layer by layer. However, modern neural network architectures — with residual connections, attention mechanisms, and dynamic control
flow — do not fit neatly into the sequential layer abstraction. We need a more general framework.
Automatic differentiation (AD) is that framework. Rather than relying on a layer-by-layer structure, AD operates on
arbitrary computational graphs: directed acyclic graphs (DAGs) where each node represents an elementary operation
and edges encode data dependencies. Given such a graph, AD systematically applies the chain rule to compute numerically exact derivatives
- not finite-difference approximations, and not symbolic expressions that can explode in size.
Definition: Automatic Differentiation
Let \(f : \mathbb{R}^n \to \mathbb{R}^m\) be a function decomposed into a sequence of elementary operations represented as
a computational graph. Automatic differentiation computes the Jacobian \(J_f \in \mathbb{R}^{m \times n}\) (or
products involving it) by applying the chain rule systematically through the graph. Two modes exist:
Forward-mode AD:
propagates Jacobian-vector products \(J_f\, \mathbf{v}\) from inputs to outputs. Cost: one forward pass per input
direction. Efficient when \(n \ll m\).
Reverse-mode AD:
propagates vector-Jacobian products \(\mathbf{u}^\top J_f\) from outputs to inputs. Cost: one backward pass per output
direction. Efficient when \(m \ll n\).
Since neural network training involves a scalar loss (\(m = 1\)) with millions of parameters (\(n \gg 1\)),
reverse-mode AD — which is backpropagation in this context — computes the full gradient in a single backward pass.
Analytic Example of Reverse-Mode AD
To make the process of automatic differentiation concrete, let's walk through an analytic example using a
composite scalar-valued function of two variables. We'll decompose the function into primitive operations,
represent it as a computational graph, and compute its gradients using reverse-mode automatic differentiation
(i.e., backpropagation).
Consider the following function
\[
f(x_1, x_2) = \log \left((x_1 + x_2)^2 + \sin(x_1 x_2) \right).
\]
Note: For this function and its derivatives to be strictly defined in the real domain, we require the input
\((x_1, x_2)\) to satisfy \((x_1 + x_2)^2 + \sin(x_1 x_2) > 0\).
The power of automatic differentiation lies in its systematic approach:
Decompose complex functions into simple primitive operations
Apply the chain rule mechanically through the computational graph
Sum gradients when variables contribute through multiple paths
This process can be fully automated, making it the backbone of modern deep learning frameworks.
Note: In AD literature, the accumulated gradient \(\frac{\partial f}{\partial x_i}\) is often denoted as the
adjoint \(\bar{x}_i\).
Applications of AD
Automatic differentiation is the computational engine behind modern deep learning frameworks. Two frameworks dominate today's ML ecosystem, with complementary design philosophies that reflect different sides of the AD abstraction:
PyTorch — imperative, eager-execution framework with dynamic computational graphs and reverse-mode AD via autograd. The default choice for LLM training, NLP, and computer vision research; the framework used in our sample code below.
JAX — functional framework offering composable transformations (grad, vmap, jit) that map directly onto the mathematical structure of AD discussed above. Preferred for differentiable physics, scientific computing, and equivariant neural networks — domains we will revisit in our discussion of geometric deep learning.
These systems rely on automatic differentiation to:
Train neural networks by computing gradients of loss functions with respect to millions to billions of parameters (and beyond, in modern frontier models)
Optimize black-box functions in physics simulation, robotics, and finance
Perform end-to-end differentiation through control flow, dynamic loops, and even solver calls (e.g., differentiable physics)
The same structural pattern — attaching information to each primitive operation and letting composition
rules carry it through automatically — recurs outside calculus. The Jacobian is one instance of a
"structured side-information" that propagates by a composition law; the
formal methods
page develops another instance, where the side-information attached to each program step is a proof of
correctness and the composition law is logical inference.
With automatic differentiation providing the computational engine for gradient-based learning, we have covered the core tools of
modern ML: from regularized regression through
classification and neural networks.
On the upcoming page on support vector machines, we return to classification from a geometric perspective,
replacing probabilistic modeling with direct margin maximization — connecting to the duality theory
and constrained optimization developed in our optimization pages.