Linear Approximations

Linear Approximations Differentials Vector Derivatives: Optimization Foundations Quadratic Forms \(f(x) = x^TAx\) \(L_2\) Norm \(f(x) = \| x \|_2\)

Linear Approximations

In the study of complex systems - from the orbits of planets to the loss landscapes of deep neural networks - most functions are inherently nonlinear and difficult to solve directly. Linear approximation is the fundamental strategy of calculus: it approximates a complex function \(f(x)\) near a specific point \(x_o\) using the simplest possible tool - a linear function.

Asymptotic notation. Throughout this page (and the rest of this section) we use the Landau symbols. We write \(\varphi(h) = o(h)\) as \(h \to 0\) to mean \(\lim_{h \to 0} \varphi(h)/|h| = 0\) — that is, \(\varphi\) vanishes strictly faster than \(h\). More generally \(\varphi(h) = o(|h|^k)\) means \(\varphi(h)/|h|^k \to 0\). The companion symbol \(\varphi(h) = O(h)\) means \(|\varphi(h)| \leq C|h|\) for some constant \(C\) and all \(h\) near \(0\). In the multivariate case, \(h\) is replaced by a vector \(\mathbf{h}\) and \(|h|\) by a norm \(\|\mathbf{h}\|\). These abbreviations let us separate the linear part of a change from everything that vanishes faster — which is exactly what a differential, and then a linearization, capture.

Why norms enter. A derivative is fundamentally a comparison between the size of an input displacement and the size of the resulting output change. On \(\mathbb{R}\) the absolute value \(|h|\) plays this role implicitly, so the point passes unnoticed. Once inputs or outputs live in higher-dimensional spaces — vectors, matrices, and eventually functions — we need a genuine measure of magnitude, and that measure is a norm. The statement "\(f(x+\mathbf{h}) - f(x) - f'(x)\mathbf{h} = o(\|\mathbf{h}\|)\)" is not just a compact notation; it is the definition of differentiability, formulated so that both the displacement and the residual are measured in compatible units. In finite dimensions all norms are equivalent, so the choice is inessential for pointwise calculus. In infinite-dimensional settings the choice becomes substantive — different norms induce different notions of continuity and differentiability — which is why functional analysis begins, rather than ends, with the choice of norm.

Definition: Linearization

Let \(f : \mathbb{R} \to \mathbb{R}\) be differentiable at \(x_o\), with derivative \[ f'(x_o) = \lim_{x \to x_o} \frac{f(x)-f(x_o)}{x-x_o}. \] The linearization of \(f\) at \(x_o\) is the affine function \[ L(x) \;=\; f(x_o) + f'(x_o)(x - x_o). \] It is the unique affine function satisfying \(L(x_o) = f(x_o)\) and \(L'(x_o) = f'(x_o)\); equivalently, it is characterized by \[ f(x) - L(x) \;=\; o(x - x_o) \quad \text{as } x \to x_o. \]

Geometrically, the graph of \(L\) is the tangent line at \((x_o, f(x_o))\). The characterization \(f(x) - L(x) = o(x-x_o)\) is the precise statement of what "\(L(x) \approx f(x)\) near \(x_o\)" means: the approximation error vanishes faster than the displacement itself.

This local linear model is what allows optimization algorithms to take principled "steps" toward a minimum, even when the global shape of the function is unknown. We now refine this idea into the language of differentials, which isolates the linear part of the change in \(f\) as a first-class object.

Differentials

Linearization produces an affine approximation \(L(x)\) to \(f(x)\). The differential isolates the linear part of the change in \(f\) as an object in its own right. This shift — from "value near \(x_o\)" to "linear map on displacements" — is the conceptual move that scales from scalar calculus to the vector, matrix, and operator settings used throughout optimization and machine learning.

Definition: Differential (scalar case)

Let \(f : \mathbb{R} \to \mathbb{R}\) be differentiable at \(x\). The differential of \(f\) at \(x\) is the linear map \[ df : \mathbb{R} \to \mathbb{R}, \qquad df(dx) \;=\; f'(x)\, dx. \] Here \(dx\) denotes an arbitrary real displacement (the input of the linear map), and we write the value of that map as \(df\) or \(df(dx)\) interchangeably when no confusion arises.

The differential is related to — but not equal to — the finite increment \(\Delta f := f(x+dx) - f(x)\). Differentiability at \(x\) is precisely the statement that these two agree up to a term that vanishes faster than \(dx\): \[ \Delta f \;=\; f(x+dx) - f(x) \;=\; f'(x)\,dx + o(dx) \quad \text{as } dx \to 0. \] In other words, \(df = f'(x)\,dx\) is the unique linear map whose deviation from the true increment \(\Delta f\) is higher-order. (Uniqueness: if two linear maps \(L_1, L_2\) both satisfy \(\Delta f = L_i(dx) + o(dx)\), then \(L_1 - L_2 = o(dx)\) is a linear map that is \(o(dx)\), hence identically zero.)

When \(dx \neq 0\), dividing gives the familiar Leibniz ratio: \[ \frac{df}{dx} \;=\; f'(x), \] where the left side is now genuinely a ratio of values of the linear map \(df\) to its input.

Derivative as an Operator

In foundational calculus we often picture \(f'(x)\) as a static number — the slope. The differential reframes it as a linear operator: the single-number derivative \(f'(x)\) is the \(1 \times 1\) matrix of this operator.

In the identity \(df = f'(x)\,dx\), the derivative acts as a "transformer" that maps a displacement in the input space (\(dx\)) to a displacement in the output space (\(df\)). This perspective is essential once \(x\) is no longer a scalar but a vector \(\mathbf{x} \in \mathbb{R}^n\) or a matrix \(X \in \mathbb{R}^{m \times n}\); there, \(f'\) is no longer a number but a matrix of partial derivatives — the Jacobian — and its transpose in the scalar-output case yields the gradient.

Multivariate Extension and the Gradient

We now pass from scalar inputs \(x \in \mathbb{R}\) to vector inputs \(\mathbf{x} \in \mathbb{R}^n\). From here on, bold symbols \(\mathbf{x}, \mathbf{y}, d\mathbf{x}\) denote vectors; plain italic \(x_i\) denotes the \(i\)-th scalar component. For a scalar-valued differentiable function \(f : \mathbb{R}^n \to \mathbb{R}\), the differential at \(\mathbf{x}\) is the linear map \[ df : \mathbb{R}^n \to \mathbb{R}, \qquad df(d\mathbf{x}) \;=\; \sum_{i=1}^n \frac{\partial f}{\partial x_i}\, dx_i. \] This linear functional, being represented by a single row of partial derivatives, can be written as an inner product with a uniquely determined column vector.

Convention (Gradient). For a scalar-valued function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient \(\nabla f\) is the column vector in \(\mathbb{R}^n\) defined so that the differential takes the inner-product form \[ df \;=\; (\nabla f)^T\, d\mathbf{x} \qquad \text{equivalently} \qquad (\nabla f)_i \;=\; \frac{\partial f}{\partial x_i}. \] In practice, when a computation yields \(df = \mathbf{v}^T\, d\mathbf{x}\) for some row vector \(\mathbf{v}\), we read off the gradient as \(\nabla f = \mathbf{v}^T\).

The same increment-and-linear-part story transfers verbatim: \[ \Delta f \;=\; f(\mathbf{x} + d\mathbf{x}) - f(\mathbf{x}) \;=\; (\nabla f)^T\, d\mathbf{x} + o(\|d\mathbf{x}\|) \quad \text{as } d\mathbf{x} \to \mathbf{0}. \] The three examples that follow all instantiate this single pattern: expand \(\Delta f\), collect terms linear in \(d\mathbf{x}\), and read off \(\nabla f\) from the resulting row vector.

Vector Derivatives: Optimization Foundations

In machine learning, we rarely optimize single scalar variables — we optimize weight vectors. The differential calculus just developed is the right tool: rather than writing out \(n\) partial derivatives by hand, we manipulate \(d(\cdot)\) symbolically and read off the gradient from the row-vector coefficient of \(d\mathbf{x}\). The three examples below — the squared norm, the quadratic form, and the \(L_2\) norm — cover essentially every gradient that appears in standard regularization and second-order optimization. We first state the product rule in the differential form that will be used throughout.

Theorem: Differential Product Rule

Let \(g\) and \(h\) be differentiable at \(x\), and let \(f(x) = g(x)h(x)\). Then \(f\) is differentiable at \(x\), and its differential satisfies \[ df \;=\; (dg)\,h + g\,(dh). \] In product-compatible settings (vectors, matrices) where \(g\) and \(h\) need not commute, the order of factors in each term is preserved.

Proof.

Let \(\Delta g := g(x+dx) - g(x)\) and \(\Delta h := h(x+dx) - h(x)\). By differentiability, \(\Delta g = dg + o(dx)\) and \(\Delta h = dh + o(dx)\). Expanding the increment of \(f\): \[ \begin{align*} \Delta f &= g(x+dx)\,h(x+dx) - g(x)\,h(x) \\\\ &= \bigl[g(x) + \Delta g\bigr]\bigl[h(x) + \Delta h\bigr] - g(x)\,h(x) \\\\ &= (\Delta g)\,h(x) + g(x)\,(\Delta h) + (\Delta g)(\Delta h). \end{align*} \] The first two terms contribute the linear part \((dg)\,h(x) + g(x)\,(dh)\), up to \(o(dx)\); here we use that \(h(x)\) and \(g(x)\) are fixed (independent of \(dx\)), so multiplication by either preserves \(o(dx)\). In the vector/matrix setting the same step uses submultiplicativity of the operator norm. The cross term \((\Delta g)(\Delta h)\) is \(O(dx)\cdot O(dx) = o(dx)\). Therefore \[ \Delta f \;=\; (dg)\,h + g\,(dh) + o(dx), \] which identifies \(df = (dg)\,h + g\,(dh)\) as the linear part of \(\Delta f\). \(\blacksquare\)

Example 1: Squared \(L_2\) Norm \(f(\mathbf{x}) = \mathbf{x}^T\mathbf{x}\), \(\mathbf{x} \in \mathbb{R}^n\)

This function represents the squared distance from the origin — a core component of Mean Squared Error (MSE). The input is the vector \(\mathbf{x}\), and the output is the scalar \(\mathbf{x}^T\mathbf{x}\).

Example 1.

Applying the differential product rule to \(f(\mathbf{x}) = \mathbf{x}^T\mathbf{x}\): \[ \begin{align*} d(\mathbf{x}^T\mathbf{x}) &= (d\mathbf{x}^T)\,\mathbf{x} + \mathbf{x}^T\,(d\mathbf{x}) \\\\ &= \mathbf{x}^T\,d\mathbf{x} + \mathbf{x}^T\,d\mathbf{x} \\\\ &= 2\,\mathbf{x}^T\, d\mathbf{x}, \end{align*} \] where in the first line we use the fact that \(d(\mathbf{x}^T) = (d\mathbf{x})^T\) — since transposition is itself a linear operation — and in the second line we use the scalar identity \(\mathbf{a}^T\mathbf{b} = \mathbf{b}^T\mathbf{a}\) with \(\mathbf{a} = d\mathbf{x}\), \(\mathbf{b} = \mathbf{x}\).

Reading off the row vector \(2\,\mathbf{x}^T\) as the transpose of the gradient: \[ \nabla f \;=\; (2\,\mathbf{x}^T)^T \;=\; 2\mathbf{x}. \qquad \blacksquare \]

Sanity check (entrywise). Writing \(f(\mathbf{x}) = \sum_{i=1}^n x_i^2\) and differentiating componentwise yields \(\partial f / \partial x_i = 2 x_i\), so \[ \nabla f \;=\; \begin{bmatrix} 2x_1 \\ 2x_2 \\ \vdots \\ 2x_n \end{bmatrix} \;=\; 2\mathbf{x}, \] in agreement with the differential derivation. For the remaining examples the entrywise route would be progressively more painful; the differential calculation generalizes effortlessly.

Quadratic Forms \(f(\mathbf{x}) = \mathbf{x}^TA\mathbf{x}\)

Quadratic forms are essential for modeling local curvature — specifically, they form the basis of the Hessian matrix in optimization. Let \(\mathbf{x} \in \mathbb{R}^n\) and \(A \in \mathbb{R}^{n \times n}\).

Example 2.

Expanding the increment and dropping the quadratic term (which is \(o(\|d\mathbf{x}\|)\)): \[ \begin{align*} \Delta f &= (\mathbf{x} + d\mathbf{x})^T A\,(\mathbf{x} + d\mathbf{x}) - \mathbf{x}^T A\mathbf{x} \\\\ &= d\mathbf{x}^T A\,\mathbf{x} + \mathbf{x}^T A\, d\mathbf{x} + d\mathbf{x}^T A\, d\mathbf{x}, \end{align*} \] so the linear part is \[ df \;=\; d\mathbf{x}^T A\,\mathbf{x} + \mathbf{x}^T A\, d\mathbf{x}. \] The first term is a scalar, hence equal to its own transpose: \((d\mathbf{x}^T A\,\mathbf{x})^T = \mathbf{x}^T A^T\, d\mathbf{x}\). Substituting, \[ df \;=\; \mathbf{x}^T A^T\, d\mathbf{x} + \mathbf{x}^T A\, d\mathbf{x} \;=\; \mathbf{x}^T(A + A^T)\, d\mathbf{x}. \] Reading off the row vector \(\mathbf{x}^T(A + A^T)\) as the transpose of the gradient (per the convention established in Differentials): \[ \nabla f \;=\; (A + A^T)\,\mathbf{x}. \qquad \blacksquare \]

Symmetry and Optimization Efficiency

In most machine learning contexts — for instance, the second-order Taylor expansion of a loss function — the matrix \(A\) is the Hessian, which is symmetric: \(A = A^T\). In this case the gradient simplifies to \[ \nabla f \;=\; (A + A^T)\,\mathbf{x} \;=\; 2A\,\mathbf{x}. \] Symmetry is not a mathematical curiosity; it is exploited by numerical solvers to reduce memory and computation by roughly half (e.g., storing only the upper triangle, using Cholesky in place of general LU, using the conjugate-gradient method instead of GMRES).

Moreover, Example 1 is recovered as the special case \(A = I\): then \(\nabla (\mathbf{x}^T\mathbf{x}) = 2I\mathbf{x} = 2\mathbf{x}\), matching our earlier result.

The \(L_2\) Norm \(f(\mathbf{x}) = \|\mathbf{x}\|_2\)

The \(L_2\) norm, or Euclidean distance is the most common regularization term used to prevent overfitting in machine learning models. Understanding its derivative is key to understanding how weight decay works.

Example 3.

Let \(r := \|\mathbf{x}\|_2\), so that \(r^2 = \mathbf{x}^T\mathbf{x}\). Assume \(\mathbf{x} \neq \mathbf{0}\), so that \(r > 0\) and \(r\) is differentiable at \(\mathbf{x}\) (this follows from the smoothness of \(\sqrt{\,\cdot\,}\) away from \(0\) composed with the polynomial \(\mathbf{x}^T\mathbf{x}\); the chain rule is formalized in the Jacobian chapter). Taking the differential of both sides of \(r^2 = \mathbf{x}^T\mathbf{x}\) — using \(d(r^2) = 2r\,dr\) on the left (scalar chain rule) and the result of Example 1 on the right: \[ \begin{align*} d(r^2) &= d(\mathbf{x}^T\mathbf{x}) \\\\ 2r\, dr &= 2\,\mathbf{x}^T\, d\mathbf{x} \\\\ dr &= \frac{\mathbf{x}^T}{r}\, d\mathbf{x} \;=\; \frac{\mathbf{x}^T}{\|\mathbf{x}\|_2}\, d\mathbf{x}. \end{align*} \] By the gradient convention (Differentials), the gradient is therefore \[ \nabla \|\mathbf{x}\|_2 \;=\; \frac{\mathbf{x}}{\|\mathbf{x}\|_2}. \qquad \blacksquare \]

Note that the norm is not differentiable at \(\mathbf{x} = \mathbf{0}\); the formula above holds only on \(\mathbb{R}^n \setminus \{\mathbf{0}\}\).

Geometric Interpretation: The Unit Vector

The gradient of the \(L_2\) norm is the unit vector pointing in the direction of \(\mathbf{x}\).

In gradient descent, this means the regularization force has magnitude independent of the weights' scale: it always pulls the weights directly toward the origin with a constant pressure (proportional to the learning rate). This is the mathematical reason \(L_2\) regularization effectively shrinks weights but rarely drives them to exactly zero — unlike \(L_1\) regularization, whose subgradient can point toward zero with full magnitude across a whole axis.