Automatic Differentiation

On the previous page on neural networks, we introduced backpropagation as an algorithm for computing gradients in multilayer perceptrons, exploiting the chain rule to propagate gradients layer by layer. However, modern neural network architectures — with residual connections, attention mechanisms, and dynamic control flow — do not fit neatly into the sequential layer abstraction. We need a more general framework.

Automatic differentiation (AD) is that framework. Rather than relying on a layer-by-layer structure, AD operates on arbitrary computational graphs: directed acyclic graphs (DAGs) where each node represents an elementary operation and edges encode data dependencies. Given such a graph, AD systematically applies the chain rule to compute numerically exact derivatives - not finite-difference approximations, and not symbolic expressions that can explode in size.

Definition: Automatic Differentiation

Let \(f : \mathbb{R}^n \to \mathbb{R}^m\) be a function decomposed into a sequence of elementary operations represented as a computational graph. Automatic differentiation computes the Jacobian \(J_f \in \mathbb{R}^{m \times n}\) (or products involving it) by applying the chain rule systematically through the graph. Two modes exist:

Forward-mode AD:
propagates Jacobian-vector products \(J_f\, \mathbf{v}\) from inputs to outputs. Cost: one forward pass per input direction. Efficient when \(n \ll m\).
Reverse-mode AD:
propagates vector-Jacobian products \(\mathbf{u}^\top J_f\) from outputs to inputs. Cost: one backward pass per output direction. Efficient when \(m \ll n\).

Since neural network training involves a scalar loss (\(m = 1\)) with millions of parameters (\(n \gg 1\)), reverse-mode AD — which is backpropagation in this context — computes the full gradient in a single backward pass.

Analytic Example of Reverse-Mode AD

To make the process of automatic differentiation concrete, let's walk through an analytic example using a composite scalar-valued function of two variables. We'll decompose the function into primitive operations, represent it as a computational graph, and compute its gradients using reverse-mode automatic differentiation (i.e., backpropagation).

Consider the following function \[ f(x_1, x_2) = \log \left((x_1 + x_2)^2 + \sin(x_1 x_2) \right). \] Note: For this function and its derivatives to be strictly defined in the real domain, we require the input \((x_1, x_2)\) to satisfy \((x_1 + x_2)^2 + \sin(x_1 x_2) > 0\).

We decompose this into primitive operations: \[ \begin{align*} &x_3 = x_1 + x_2 \\\\ &x_4 = x_3^2 \\\\ &x_5 = x_1 x_2 \\\\ &x_6 = \sin(x_5) \\\\ &x_7 = x_4 + x_6 \\\\ &x_8 = \log(x_7) = f \\\\ \end{align*} \]

This computational graph clearly shows the DAG (Directed Acyclic Graph) structure. Notice how:

Each input variable (x₁ and x₂) has multiple outgoing edges, contributing to different intermediate computations
The graph flows from inputs at the top to the output at the bottom
During backpropagation, gradients flow in the reverse direction (from f back to x₁ and x₂)

Starting from the output and working backwards: \[ \begin{align*} \frac{\partial f}{\partial x_8} &= 1 \\\\ \frac{\partial f}{\partial x_7} &= \frac{\partial f}{\partial x_8} \cdot \frac{\partial x_8}{\partial x_7} \\\\ &= 1 \cdot \frac{1}{x_7} = \frac{1}{x_7} \\\\ \frac{\partial f}{\partial x_4} &= \frac{\partial f}{\partial x_7} \cdot \frac{\partial x_7}{\partial x_4} \\\\ &= \frac{1}{x_7} \cdot 1 = \frac{1}{x_7} \\\\ \frac{\partial f}{\partial x_6} &= \frac{\partial f}{\partial x_7} \cdot \frac{\partial x_7}{\partial x_6} \\\\ &= \frac{1}{x_7} \cdot 1 = \frac{1}{x_7} \\\\ \frac{\partial f}{\partial x_3} &= \frac{\partial f}{\partial x_4} \cdot \frac{\partial x_4}{\partial x_3} \\\\ &= \frac{1}{x_7} \cdot 2 x_3\\\\ \frac{\partial f}{\partial x_5} &= \frac{\partial f}{\partial x_6} \cdot \frac{\partial x_6}{\partial x_5} \\\\ &= \frac{1}{x_7} \cdot \cos(x_5)\\\\ \end{align*} \]

Notice that the input variables \(x_1\) and \(x_2\) each contribute to multiple intermediate nodes:

\(x_1\) influences both \(x_3\) (via addition) and \(x_5\) (via multiplication)
\(x_2\) influences both \(x_3\) (via addition) and \(x_5\) (via multiplication)

This means we need to sum the gradients from all paths when computing the final derivatives.

To find the gradients with respect to the input variables, we sum contributions from all paths:

For \(\frac{\partial f}{\partial x_1}\): \[ \begin{align*} \frac{\partial f}{\partial x_1} &= \frac{\partial f}{\partial x_3} \cdot \frac{\partial x_3}{\partial x_1} + \frac{\partial f}{\partial x_5} \cdot \frac{\partial x_5}{\partial x_1} \\\\ &= \frac{2x_3}{x_7} \cdot 1 + \frac{\cos(x_5)}{x_7} \cdot x_2 \\\\ &= \frac{1}{x_7} \left[2x_3 + x_2 \cos(x_5)\right] \end{align*} \]

For \(\frac{\partial f}{\partial x_2}\): \[ \begin{align*} \frac{\partial f}{\partial x_2} &= \frac{\partial f}{\partial x_3} \cdot \frac{\partial x_3}{\partial x_2} + \frac{\partial f}{\partial x_5} \cdot \frac{\partial x_5}{\partial x_2} \\\\ &= \frac{2x_3}{x_7} \cdot 1 + \frac{\cos(x_5)}{x_7} \cdot x_1 \\\\ &= \frac{1}{x_7} \left[2x_3 + x_1 \cos(x_5)\right] \end{align*} \]

Replacing the intermediate variables with their expressions in terms of \(x_1\) and \(x_2\):

\(x_3 = x_1 + x_2\)
\(x_5 = x_1 x_2\)
\(x_7 = (x_1 + x_2)^2 + \sin(x_1 x_2)\)

Finally, we get the derivatives: \[ \boxed{ \begin{align*} \frac{\partial f}{\partial x_1} &= \frac{2(x_1 + x_2) + x_2 \cos(x_1 x_2)}{(x_1 + x_2)^2 + \sin(x_1 x_2)} \\\\ \frac{\partial f}{\partial x_2} &= \frac{2(x_1 + x_2) + x_1 \cos(x_1 x_2)}{(x_1 + x_2)^2 + \sin(x_1 x_2)} \end{align*} } \]

The power of automatic differentiation lies in its systematic approach:

Decompose complex functions into simple primitive operations
Apply the chain rule mechanically through the computational graph
Sum gradients when variables contribute through multiple paths

This process can be fully automated, making it the backbone of modern deep learning frameworks.

Note: In AD literature, the accumulated gradient \(\frac{\partial f}{\partial x_i}\) is often denoted as the adjoint \(\bar{x}_i\).

Applications of AD

Automatic differentiation is the computational engine behind modern deep learning frameworks. Two frameworks dominate today's ML ecosystem, with complementary design philosophies that reflect different sides of the AD abstraction:

PyTorch — imperative, eager-execution framework with dynamic computational graphs and reverse-mode AD via autograd. The default choice for LLM training, NLP, and computer vision research; the framework used in our sample code below.
JAX — functional framework offering composable transformations (grad, vmap, jit) that map directly onto the mathematical structure of AD discussed above. Preferred for differentiable physics, scientific computing, and equivariant neural networks — domains we will revisit in our discussion of geometric deep learning.

These systems rely on automatic differentiation to:

Train neural networks by computing gradients of loss functions with respect to millions to billions of parameters (and beyond, in modern frontier models)
Optimize black-box functions in physics simulation, robotics, and finance
Perform end-to-end differentiation through control flow, dynamic loops, and even solver calls (e.g., differentiable physics)

The same structural pattern — attaching information to each primitive operation and letting composition rules carry it through automatically — recurs outside calculus. The Jacobian is one instance of a "structured side-information" that propagates by a composition law; the formal methods page develops another instance, where the side-information attached to each program step is a proof of correctness and the composition law is logical inference.

Sample Code

                                import numpy as np

                                class AutoDiffNode:
                                    """Node in the computational graph for automatic differentiation"""
                                    def __init__(self, value, grad=0.0):
                                        self.value = value
                                        self.grad = grad
                                        self.children = []  # Nodes that depend on this node
                                        self.local_gradients = []  # Local gradients to children

                                def manual_autodiff_example(x1_val, x2_val):
                                    """
                                    Manual implementation of automatic differentiation for:
                                    f(x1, x2) = log((x1 + x2)^2 + sin(x1 * x2))
                                    
                                    This demonstrates the forward and backward pass explicitly.
                                    """
                                    print(f"Computing f({x1_val}, {x2_val}) = log((x1 + x2)² + sin(x1 * x2))")
                                    print("="*60)
                                    
                                    # Forward Pass - Compute function value
                                    print("FORWARD PASS:")
                                    x1 = x1_val
                                    x2 = x2_val
                                    print(f"x1 = {x1}")
                                    print(f"x2 = {x2}")
                                    
                                    x3 = x1 + x2
                                    print(f"x3 = x1 + x2 = {x3}")
                                    
                                    x4 = x3**2
                                    print(f"x4 = x3² = {x4}")
                                    
                                    x5 = x1 * x2
                                    print(f"x5 = x1 * x2 = {x5}")
                                    
                                    x6 = np.sin(x5)
                                    print(f"x6 = sin(x5) = {x6}")
                                    
                                    x7 = x4 + x6
                                    print(f"x7 = x4 + x6 = {x7}")
                                    
                                    x8 = np.log(x7)
                                    f = x8
                                    print(f"x8 = log(x7) = {f}")
                                    print(f"\nFunction value: f = {f}")
                                    
                                    # Backward Pass - Compute gradients
                                    print("\n" + "="*60)
                                    print("BACKWARD PASS:")
                                    
                                    # Initialize gradient
                                    df_dx8 = 1.0
                                    print(f"∂f/∂x8 = {df_dx8}")
                                    
                                    # x8 = log(x7)
                                    df_dx7 = df_dx8 * (1.0 / x7)
                                    print(f"∂f/∂x7 = ∂f/∂x8 * ∂x8/∂x7 = {df_dx8} * (1/{x7}) = {df_dx7}")
                                    
                                    # x7 = x4 + x6
                                    df_dx4 = df_dx7 * 1.0
                                    df_dx6 = df_dx7 * 1.0
                                    print(f"∂f/∂x4 = ∂f/∂x7 * ∂x7/∂x4 = {df_dx7} * 1 = {df_dx4}")
                                    print(f"∂f/∂x6 = ∂f/∂x7 * ∂x7/∂x6 = {df_dx7} * 1 = {df_dx6}")
                                    
                                    # x4 = x3²
                                    df_dx3 = df_dx4 * (2 * x3)
                                    print(f"∂f/∂x3 = ∂f/∂x4 * ∂x4/∂x3 = {df_dx4} * 2*{x3} = {df_dx3}")
                                    
                                    # x6 = sin(x5)
                                    df_dx5 = df_dx6 * np.cos(x5)
                                    print(f"∂f/∂x5 = ∂f/∂x6 * ∂x6/∂x5 = {df_dx6} * cos({x5}) = {df_dx5}")
                                    
                                    # Now accumulate gradients for x1 and x2
                                    # x3 = x1 + x2
                                    df_dx1_from_x3 = df_dx3 * 1.0
                                    df_dx2_from_x3 = df_dx3 * 1.0
                                    
                                    # x5 = x1 * x2
                                    df_dx1_from_x5 = df_dx5 * x2
                                    df_dx2_from_x5 = df_dx5 * x1
                                    
                                    # Sum gradients from all paths
                                    df_dx1 = df_dx1_from_x3 + df_dx1_from_x5
                                    df_dx2 = df_dx2_from_x3 + df_dx2_from_x5
                                    
                                    print(f"\n∂f/∂x1 = ∂f/∂x3 * ∂x3/∂x1 + ∂f/∂x5 * ∂x5/∂x1")
                                    print(f"       = {df_dx3} * 1 + {df_dx5} * {x2}")
                                    print(f"       = {df_dx1_from_x3} + {df_dx1_from_x5}")
                                    print(f"       = {df_dx1}")
                                    
                                    print(f"\n∂f/∂x2 = ∂f/∂x3 * ∂x3/∂x2 + ∂f/∂x5 * ∂x5/∂x2")
                                    print(f"       = {df_dx3} * 1 + {df_dx5} * {x1}")
                                    print(f"       = {df_dx2_from_x3} + {df_dx2_from_x5}")
                                    print(f"       = {df_dx2}")
                                    
                                    # Verify with the closed form
                                    print("\n" + "="*60)
                                    print("VERIFICATION WITH CLOSED FORM:")
                                    expected_df_dx1 = (2*(x1 + x2) + x2*np.cos(x1*x2)) / ((x1 + x2)**2 + np.sin(x1*x2))
                                    expected_df_dx2 = (2*(x1 + x2) + x1*np.cos(x1*x2)) / ((x1 + x2)**2 + np.sin(x1*x2))
                                    
                                    print(f"Expected ∂f/∂x1 = {expected_df_dx1}")
                                    print(f"Expected ∂f/∂x2 = {expected_df_dx2}")
                                    print(f"Error in ∂f/∂x1: {abs(df_dx1 - expected_df_dx1)}")
                                    print(f"Error in ∂f/∂x2: {abs(df_dx2 - expected_df_dx2)}")
                                    
                                    return f, df_dx1, df_dx2


                                def pytorch_autodiff_example(x1_val, x2_val):
                                    """
                                    PyTorch implementation showing how modern autodiff frameworks handle this
                                    """
                                    import torch
                                    
                                    print("\n" + "="*60)
                                    print("PYTORCH AUTOMATIC DIFFERENTIATION:")
                                    
                                    # Create tensors with gradient tracking
                                    x1 = torch.tensor(x1_val, requires_grad=True, dtype=torch.float32)
                                    x2 = torch.tensor(x2_val, requires_grad=True, dtype=torch.float32)
                                    
                                    # Define the function
                                    f = torch.log((x1 + x2)**2 + torch.sin(x1 * x2))
                                    
                                    # Compute gradients
                                    f.backward()
                                    
                                    print(f"f({x1_val}, {x2_val}) = {f.item()}")
                                    print(f"∂f/∂x1 = {x1.grad.item()}")
                                    print(f"∂f/∂x2 = {x2.grad.item()}")
                                    
                                    return f.item(), x1.grad.item(), x2.grad.item()


                                def gradient_check(x1, x2, epsilon=1e-7):
                                    """
                                    Numerical gradient checking using finite differences
                                    """
                                    # Compute analytical gradients
                                    f_val, df_dx1, df_dx2 = manual_autodiff_example(x1, x2)
                                    
                                    print("\n" + "="*60)
                                    print("NUMERICAL GRADIENT CHECK:")
                                    
                                    # Numerical gradient for x1
                                    def eval_f(x1_val, x2_val):
                                        return np.log((x1_val + x2_val)**2 + np.sin(x1_val * x2_val))
                                    
                                    f_plus_x1 = eval_f(x1 + epsilon, x2)
                                    f_minus_x1 = eval_f(x1 - epsilon, x2)
                                    numerical_df_dx1 = (f_plus_x1 - f_minus_x1) / (2 * epsilon)
                                    
                                    # Numerical gradient for x2
                                    f_plus_x2 = eval_f(x1, x2 + epsilon)
                                    f_minus_x2 = eval_f(x1, x2 - epsilon)
                                    numerical_df_dx2 = (f_plus_x2 - f_minus_x2) / (2 * epsilon)
                                    
                                    print(f"Analytical ∂f/∂x1: {df_dx1}")
                                    print(f"Numerical  ∂f/∂x1: {numerical_df_dx1}")
                                    print(f"Difference: {abs(df_dx1 - numerical_df_dx1)}")
                                    
                                    print(f"\nAnalytical ∂f/∂x2: {df_dx2}")
                                    print(f"Numerical  ∂f/∂x2: {numerical_df_dx2}")
                                    print(f"Difference: {abs(df_dx2 - numerical_df_dx2)}")


                                if __name__ == "__main__":
                                    # Test with specific values
                                    x1 = 1.0
                                    x2 = 0.5
                                    
                                    # Manual implementation
                                    f_manual, grad_x1_manual, grad_x2_manual = manual_autodiff_example(x1, x2)
                                    
                                    # PyTorch implementation 
                                    f_pytorch, grad_x1_pytorch, grad_x2_pytorch = pytorch_autodiff_example(x1, x2)
                                
                                    # Numerical gradient check
                                    gradient_check(x1, x2)

With automatic differentiation providing the computational engine for gradient-based learning, we have covered the core tools of modern ML: from regularized regression through classification and neural networks. On the upcoming page on support vector machines, we return to classification from a geometric perspective, replacing probabilistic modeling with direct margin maximization — connecting to the duality theory and constrained optimization developed in our optimization pages.

Automatic Differentiation

Loading...