Linear Approximations
In the study of complex systems - from the orbits of planets to the loss landscapes of deep neural networks - most functions are
inherently nonlinear and difficult to solve directly. Linear approximation is the fundamental strategy of calculus: it approximates a
complex function \(f(x)\) near a specific point \(x_o\) using the simplest possible tool - a linear function.
Asymptotic notation. Throughout this page (and the rest of this section) we use
the Landau symbols. We write \(\varphi(h) = o(h)\) as \(h \to 0\) to mean
\(\lim_{h \to 0} \varphi(h)/|h| = 0\) — that is, \(\varphi\) vanishes strictly faster than
\(h\). More generally \(\varphi(h) = o(|h|^k)\) means \(\varphi(h)/|h|^k \to 0\). The companion
symbol \(\varphi(h) = O(h)\) means \(|\varphi(h)| \leq C|h|\) for some constant \(C\) and all \(h\)
near \(0\). In the multivariate case, \(h\) is replaced by a vector \(\mathbf{h}\) and \(|h|\) by a
norm \(\|\mathbf{h}\|\). These abbreviations let us separate the linear part of a change
from everything that vanishes faster — which is exactly what a differential, and then a linearization,
capture.
Why norms enter. A derivative is fundamentally a comparison between the size of
an input displacement and the size of the resulting output change. On \(\mathbb{R}\) the absolute
value \(|h|\) plays this role implicitly, so the point passes unnoticed. Once inputs or outputs
live in higher-dimensional spaces — vectors, matrices, and eventually functions — we need a
genuine measure of magnitude, and that measure is a norm. The statement
"\(f(x+\mathbf{h}) - f(x) - f'(x)\mathbf{h} = o(\|\mathbf{h}\|)\)" is not just a compact notation;
it is the definition of differentiability, formulated so that both the displacement and
the residual are measured in compatible units. In finite dimensions all norms are equivalent, so
the choice is inessential for pointwise calculus. In infinite-dimensional settings the choice
becomes substantive — different norms induce different notions of continuity and differentiability —
which is why functional analysis begins, rather than ends, with the choice of norm.
Definition: Linearization
Let \(f : \mathbb{R} \to \mathbb{R}\) be differentiable at \(x_o\), with derivative
\[
f'(x_o) = \lim_{x \to x_o} \frac{f(x)-f(x_o)}{x-x_o}.
\]
The linearization of \(f\) at \(x_o\) is the affine function
\[
L(x) \;=\; f(x_o) + f'(x_o)(x - x_o).
\]
It is the unique affine function satisfying \(L(x_o) = f(x_o)\) and
\(L'(x_o) = f'(x_o)\); equivalently, it is characterized by
\[
f(x) - L(x) \;=\; o(x - x_o) \quad \text{as } x \to x_o.
\]
Geometrically, the graph of \(L\) is the tangent line at \((x_o, f(x_o))\). The characterization
\(f(x) - L(x) = o(x-x_o)\) is the precise statement of what "\(L(x) \approx f(x)\) near \(x_o\)"
means: the approximation error vanishes faster than the displacement itself.
This local linear model is what allows optimization algorithms to take principled "steps" toward a
minimum, even when the global shape of the function is unknown. We now refine this idea into the
language of differentials, which isolates the linear part of the change
in \(f\) as a first-class object.
Differentials
Linearization produces an affine approximation \(L(x)\) to \(f(x)\). The differential
isolates the linear part of the change in \(f\) as an object in its own right. This
shift — from "value near \(x_o\)" to "linear map on displacements" — is the conceptual move that
scales from scalar calculus to the vector, matrix, and operator settings used throughout optimization
and machine learning.
Definition: Differential (scalar case)
Let \(f : \mathbb{R} \to \mathbb{R}\) be differentiable at \(x\). The differential
of \(f\) at \(x\) is the linear map
\[
df : \mathbb{R} \to \mathbb{R}, \qquad df(dx) \;=\; f'(x)\, dx.
\]
Here \(dx\) denotes an arbitrary real displacement (the input of the linear map), and we
write the value of that map as \(df\) or \(df(dx)\) interchangeably when no confusion arises.
The differential is related to — but not equal to — the finite increment
\(\Delta f := f(x+dx) - f(x)\). Differentiability at \(x\) is precisely the statement that these two
agree up to a term that vanishes faster than \(dx\):
\[
\Delta f \;=\; f(x+dx) - f(x) \;=\; f'(x)\,dx + o(dx) \quad \text{as } dx \to 0.
\]
In other words, \(df = f'(x)\,dx\) is the unique linear map whose deviation from the true increment
\(\Delta f\) is higher-order. (Uniqueness: if two linear maps \(L_1, L_2\) both satisfy
\(\Delta f = L_i(dx) + o(dx)\), then \(L_1 - L_2 = o(dx)\) is a linear map that is \(o(dx)\), hence
identically zero.)
When \(dx \neq 0\), dividing gives the familiar Leibniz ratio:
\[
\frac{df}{dx} \;=\; f'(x),
\]
where the left side is now genuinely a ratio of values of the linear map \(df\) to its input.
Derivative as an Operator
In foundational calculus we often picture \(f'(x)\) as a static number — the slope. The
differential reframes it as a linear operator: the single-number derivative
\(f'(x)\) is the \(1 \times 1\) matrix of this operator.
In the identity \(df = f'(x)\,dx\), the derivative acts as a "transformer" that maps a
displacement in the input space (\(dx\)) to a displacement in the output space (\(df\)). This
perspective is essential once \(x\) is no longer a scalar but a vector \(\mathbf{x} \in \mathbb{R}^n\)
or a matrix \(X \in \mathbb{R}^{m \times n}\); there, \(f'\) is no longer a number but a matrix
of partial derivatives — the Jacobian — and its transpose in the scalar-output
case yields the gradient.
Multivariate Extension and the Gradient
We now pass from scalar inputs \(x \in \mathbb{R}\) to vector inputs
\(\mathbf{x} \in \mathbb{R}^n\). From here on, bold symbols
\(\mathbf{x}, \mathbf{y}, d\mathbf{x}\) denote vectors; plain italic \(x_i\) denotes the \(i\)-th
scalar component. For a scalar-valued differentiable function
\(f : \mathbb{R}^n \to \mathbb{R}\), the differential at \(\mathbf{x}\) is the linear map
\[
df : \mathbb{R}^n \to \mathbb{R}, \qquad
df(d\mathbf{x}) \;=\; \sum_{i=1}^n \frac{\partial f}{\partial x_i}\, dx_i.
\]
This linear functional, being represented by a single row of partial derivatives, can be written
as an inner product with a uniquely determined column vector.
Convention (Gradient). For a scalar-valued function
\(f : \mathbb{R}^n \to \mathbb{R}\), the gradient \(\nabla f\) is the
column vector in \(\mathbb{R}^n\) defined so that the differential takes the inner-product
form
\[
df \;=\; (\nabla f)^T\, d\mathbf{x}
\qquad \text{equivalently} \qquad
(\nabla f)_i \;=\; \frac{\partial f}{\partial x_i}.
\]
In practice, when a computation yields \(df = \mathbf{v}^T\, d\mathbf{x}\) for some row vector
\(\mathbf{v}\), we read off the gradient as \(\nabla f = \mathbf{v}^T\).
The same increment-and-linear-part story transfers verbatim:
\[
\Delta f \;=\; f(\mathbf{x} + d\mathbf{x}) - f(\mathbf{x})
\;=\; (\nabla f)^T\, d\mathbf{x} + o(\|d\mathbf{x}\|) \quad \text{as } d\mathbf{x} \to \mathbf{0}.
\]
The three examples that follow all instantiate this single pattern: expand \(\Delta f\), collect
terms linear in \(d\mathbf{x}\), and read off \(\nabla f\) from the resulting row vector.
Vector Derivatives: Optimization Foundations
In machine learning, we rarely
optimize single scalar variables — we optimize weight vectors. The differential
calculus just developed is the right tool: rather than writing out \(n\) partial derivatives by
hand, we manipulate \(d(\cdot)\) symbolically and read off the gradient from the row-vector
coefficient of \(d\mathbf{x}\). The three examples below — the squared norm, the quadratic form,
and the \(L_2\) norm — cover essentially every gradient that appears in standard regularization and
second-order optimization. We first state the product rule in the differential form that will be
used throughout.
Theorem: Differential Product Rule
Let \(g\) and \(h\) be differentiable at \(x\), and let \(f(x) = g(x)h(x)\). Then \(f\) is
differentiable at \(x\), and its differential satisfies
\[
df \;=\; (dg)\,h + g\,(dh).
\]
In product-compatible settings (vectors, matrices) where \(g\) and \(h\) need not commute, the
order of factors in each term is preserved.
Proof.
Let \(\Delta g := g(x+dx) - g(x)\) and \(\Delta h := h(x+dx) - h(x)\). By differentiability,
\(\Delta g = dg + o(dx)\) and \(\Delta h = dh + o(dx)\). Expanding the increment of \(f\):
\[
\begin{align*}
\Delta f &= g(x+dx)\,h(x+dx) - g(x)\,h(x) \\\\
&= \bigl[g(x) + \Delta g\bigr]\bigl[h(x) + \Delta h\bigr] - g(x)\,h(x) \\\\
&= (\Delta g)\,h(x) + g(x)\,(\Delta h) + (\Delta g)(\Delta h).
\end{align*}
\]
The first two terms contribute the linear part \((dg)\,h(x) + g(x)\,(dh)\), up to \(o(dx)\); here
we use that \(h(x)\) and \(g(x)\) are fixed (independent of \(dx\)), so multiplication by either
preserves \(o(dx)\). In the vector/matrix setting the same step uses submultiplicativity of the
operator norm. The cross term \((\Delta g)(\Delta h)\) is \(O(dx)\cdot O(dx) = o(dx)\). Therefore
\[
\Delta f \;=\; (dg)\,h + g\,(dh) + o(dx),
\]
which identifies \(df = (dg)\,h + g\,(dh)\) as the linear part of \(\Delta f\). \(\blacksquare\)
Example 1: Squared \(L_2\) Norm \(f(\mathbf{x}) = \mathbf{x}^T\mathbf{x}\), \(\mathbf{x} \in \mathbb{R}^n\)
This function represents the squared distance from the origin — a core component of
Mean Squared Error (MSE). The input is the vector \(\mathbf{x}\), and the output is
the scalar \(\mathbf{x}^T\mathbf{x}\).
Example 1.
Applying the differential product rule
to \(f(\mathbf{x}) = \mathbf{x}^T\mathbf{x}\):
\[
\begin{align*}
d(\mathbf{x}^T\mathbf{x})
&= (d\mathbf{x}^T)\,\mathbf{x} + \mathbf{x}^T\,(d\mathbf{x}) \\\\
&= \mathbf{x}^T\,d\mathbf{x} + \mathbf{x}^T\,d\mathbf{x} \\\\
&= 2\,\mathbf{x}^T\, d\mathbf{x},
\end{align*}
\]
where in the first line we use the fact that \(d(\mathbf{x}^T) = (d\mathbf{x})^T\) — since
transposition is itself a linear operation — and in the second line we use the scalar identity
\(\mathbf{a}^T\mathbf{b} = \mathbf{b}^T\mathbf{a}\) with \(\mathbf{a} = d\mathbf{x}\),
\(\mathbf{b} = \mathbf{x}\).
Reading off the row vector \(2\,\mathbf{x}^T\) as the transpose of the gradient:
\[
\nabla f \;=\; (2\,\mathbf{x}^T)^T \;=\; 2\mathbf{x}. \qquad \blacksquare
\]
Sanity check (entrywise). Writing
\(f(\mathbf{x}) = \sum_{i=1}^n x_i^2\) and differentiating componentwise yields
\(\partial f / \partial x_i = 2 x_i\), so
\[
\nabla f \;=\; \begin{bmatrix} 2x_1 \\ 2x_2 \\ \vdots \\ 2x_n \end{bmatrix} \;=\; 2\mathbf{x},
\]
in agreement with the differential derivation. For the remaining examples the entrywise route
would be progressively more painful; the differential calculation generalizes effortlessly.
Quadratic Forms \(f(\mathbf{x}) = \mathbf{x}^TA\mathbf{x}\)
Quadratic forms are essential for
modeling local curvature — specifically, they form the basis of the Hessian matrix
in optimization. Let \(\mathbf{x} \in \mathbb{R}^n\) and \(A \in \mathbb{R}^{n \times n}\).
Example 2.
Expanding the increment and dropping the quadratic term (which is \(o(\|d\mathbf{x}\|)\)):
\[
\begin{align*}
\Delta f &= (\mathbf{x} + d\mathbf{x})^T A\,(\mathbf{x} + d\mathbf{x}) - \mathbf{x}^T A\mathbf{x} \\\\
&= d\mathbf{x}^T A\,\mathbf{x} + \mathbf{x}^T A\, d\mathbf{x} + d\mathbf{x}^T A\, d\mathbf{x},
\end{align*}
\]
so the linear part is
\[
df \;=\; d\mathbf{x}^T A\,\mathbf{x} + \mathbf{x}^T A\, d\mathbf{x}.
\]
The first term is a scalar, hence equal to its own transpose:
\((d\mathbf{x}^T A\,\mathbf{x})^T = \mathbf{x}^T A^T\, d\mathbf{x}\). Substituting,
\[
df \;=\; \mathbf{x}^T A^T\, d\mathbf{x} + \mathbf{x}^T A\, d\mathbf{x}
\;=\; \mathbf{x}^T(A + A^T)\, d\mathbf{x}.
\]
Reading off the row vector \(\mathbf{x}^T(A + A^T)\) as the transpose of the gradient (per
the convention established in Differentials):
\[
\nabla f \;=\; (A + A^T)\,\mathbf{x}. \qquad \blacksquare
\]
Symmetry and Optimization Efficiency
In most machine learning contexts — for instance, the second-order Taylor expansion of a loss
function — the matrix \(A\) is the Hessian, which is symmetric:
\(A = A^T\). In this case the gradient simplifies to
\[
\nabla f \;=\; (A + A^T)\,\mathbf{x} \;=\; 2A\,\mathbf{x}.
\]
Symmetry is not a mathematical curiosity; it is exploited by numerical solvers to reduce memory
and computation by roughly half (e.g., storing only the upper triangle, using Cholesky in place
of general LU, using the conjugate-gradient method instead of GMRES).
Moreover, Example 1 is recovered as the special case \(A = I\): then
\(\nabla (\mathbf{x}^T\mathbf{x}) = 2I\mathbf{x} = 2\mathbf{x}\), matching our earlier result.
The \(L_2\) Norm \(f(\mathbf{x}) = \|\mathbf{x}\|_2\)
The \(L_2\) norm, or Euclidean distance
is the most common regularization term used to prevent overfitting in machine learning models.
Understanding its derivative is key to understanding how weight decay works.
Example 3.
Let \(r := \|\mathbf{x}\|_2\), so that \(r^2 = \mathbf{x}^T\mathbf{x}\). Assume
\(\mathbf{x} \neq \mathbf{0}\), so that \(r > 0\) and \(r\) is differentiable at \(\mathbf{x}\)
(this follows from the smoothness of \(\sqrt{\,\cdot\,}\) away from \(0\) composed with the
polynomial \(\mathbf{x}^T\mathbf{x}\); the chain rule is formalized in the
Jacobian chapter). Taking the differential of
both sides of \(r^2 = \mathbf{x}^T\mathbf{x}\) — using \(d(r^2) = 2r\,dr\) on the left (scalar chain rule)
and the result of Example 1 on the right:
\[
\begin{align*}
d(r^2) &= d(\mathbf{x}^T\mathbf{x}) \\\\
2r\, dr &= 2\,\mathbf{x}^T\, d\mathbf{x} \\\\
dr &= \frac{\mathbf{x}^T}{r}\, d\mathbf{x}
\;=\; \frac{\mathbf{x}^T}{\|\mathbf{x}\|_2}\, d\mathbf{x}.
\end{align*}
\]
By the gradient convention (Differentials), the gradient is therefore
\[
\nabla \|\mathbf{x}\|_2 \;=\; \frac{\mathbf{x}}{\|\mathbf{x}\|_2}. \qquad \blacksquare
\]
Note that the norm is not differentiable at \(\mathbf{x} = \mathbf{0}\); the formula above holds
only on \(\mathbb{R}^n \setminus \{\mathbf{0}\}\).
Geometric Interpretation: The Unit Vector
The gradient of the \(L_2\) norm is the unit vector pointing in the direction of
\(\mathbf{x}\).
In gradient descent, this means the
regularization force has magnitude independent of the weights' scale: it always pulls the
weights directly toward the origin with a constant pressure (proportional to the learning rate).
This is the mathematical reason \(L_2\) regularization effectively shrinks weights but rarely
drives them to exactly zero — unlike \(L_1\) regularization, whose subgradient can point toward
zero with full magnitude across a whole axis.