Linear Approximations

In the study of complex systems - from the orbits of planets to the loss landscapes of deep neural networks - most functions are inherently nonlinear and difficult to solve directly. Linear approximation is the fundamental strategy of calculus: it approximates a complex function \(f(x)\) near a specific point \(x_o\) using the simplest possible tool - a linear function.

Definition: Linearization

The linearization of \(f\) at \(x_o\) is the linear function \(L(x)\) defined by: \[ L(x) = f(x_o) + f'(x_o)(x - x_o) \approx f(x) \tag{1} \] where \(f'(x_o)\) represents the instantaneous rate of change at \(x_o\): \[ f'(x_o) = \lim_{x \to x_o} \frac{f(x)-f(x_o)}{x-x_o}. \]

By constructing the tangent line at \((x_o, f(x_o))\), we create a local model where the function behaves predictably. This local predictability is what allows optimization algorithms to take "steps" toward a minimum, even when the global shape of the function is unknown.

Equivalently, from equation (1), \[ f(x) - f(x_o) \approx f'(x_o)(x -x_o) \] The graph of \(L\) is the tangent line at \((x_o, f(x_o))\). Linear approximations form the foundation of differentiation and provide a local linear model for \(f(x)\). This local predictability is what allows optimization algorithms to take "steps" toward a minimum, even when the global shape of the function is unknown. Now, we extend this concept to differentials.

Differentials

While linearization focuses on the function value, differentials describe the relationship between infinitesimally small changes in input quantities (\(dx\)) and the resulting change in output quantities (\(dy\)).

Consider a differentiable function \(y = f(x)\). We define the differential \(dy\) as: \[ dy = f'(x)dx. \] In practice, \(dx\) and \(dy\) are arbitrary small numbers. Also, if \(dx \neq 0\), then we recover the familiar derivative form: \[ \frac{dy}{dx} = f'(x) \] where the left side now represents the ratio of differentials.

As \(dx \to 0\), the differential \(dy\) becomes an increasingly accurate approximation of the true change \(\Delta y = f(x + dx) - f(x)\). More rigorously, we express this using the asymptotic notation: \[ f(x + dx) - f(x) = f'(x)dx + o(dx). \] where \(o(dx)\) represents higher-order terms that vanish faster than \(dx\) as \(dx \to 0\).
This highlights that the differential \(dy = f'(x)dx\) serves as a linear approximation to \(\Delta y\).

More generally, the differential of \(f\) can be expressed as: \[ df = f(x + dx) - f(x) = f'(x)dx, \] where \(df\) represents the linearized change in \(f(x)\) due to an infinitesimally small change in \(x\).

Here, \(df\) is the change in the output, \(dx\) is the change in the input, and most importantly, \(f'(x)\) acts as a linear operator that maps \(dx\) to \(df\).

Derivative as an Operator

In foundational calculus, we often view the derivative \(f'(x)\) as a static value (the slope). However, in high-dimensional computer science, we must view it as a linear operator.

In the equation \(df = f'(x)dx\), the derivative acts as a "transformer" that maps a small displacement in the input space (\(dx\)) to a displacement in the output space (\(df\)). This perspective is vital when \(x\) is no longer a scalar, but a vector \(\vec{x} \in \mathbb{R}^n\) or a matrix \(X \in \mathbb{R}^{m \times n}\), leading directly to the concept of the Jacobian and gradient.

Vector Derivatives: Optimization Foundations

In machine learning, we rarely optimize single variables. We optimize weight vectors. Applying differential notation to vectors requires careful attention to dimensions and the Product Rule.

Theorem: Differential Product Rule If both \(g\) and \(h\) are differentiable mappings, the differential of their product \(f(x) = g(x)h(x)\) is: \[ df = (dg)h + g(dh). \tag{1} \]

Quick derivation:

\[ \begin{align*} df &= g(x+dx)h(x+dx) - g(x)h(x) \\\\ &= [g(x) +dg][h(x) + dh]-g(x)h(x) \\\\ &= g(x)h(x) + (dg)h + g(dh) + (dg)(dh) -g(x)h(x) \end{align*} \] Then \((dg)(dh)\) is negligible, and we get (1).

Example 1: Squared \(L_2\) Norm \(f(x) = x^Tx\), \(x \in \mathbb{R}^n\)

This function represents the squared distance from the origin — a core component of Mean Squared Error (MSE). In this case, the input is the vector \(x\), and the output is the scalar \(x^Tx\).

To compute the derivative of this function, we start with: \[ f(x) = x^Tx = \sum_{i=1}^n x_i^2 , \] where \(x_i\) is the \(i\)-th entry of the vector \(x\).

Then the gradient of \(f\), denoted by \(\nabla f\) is: \[ \nabla f = \begin{bmatrix} \frac{\partial f }{\partial x_1} \\ \frac{\partial f }{\partial x_2} \\ \vdots \\ \frac{\partial f }{\partial x_n} \end{bmatrix} = \begin{bmatrix} 2x_1 \\ 2x_2\\ \vdots \\ 2x_n \end{bmatrix} = 2x \]

Note: Input: vector & Output: scalar \(\Longrightarrow\) First derivative: column vector (gradient).

Now, let's derive the same result using differential notation. Note: \(dx \in \mathbb{R}^n\).

Example 1:

By the product rule, and the commutativity of the vector inner product: \[ \begin{align*} d(x^Tx) &= (dx^T)x + x^T(dx) \\\\ &= x^Tdx + x^Tdx \\\\ &= 2x^Tdx. \end{align*} \] Note: \(dx^T = (dx)^T\) because \[\begin{align*} d(x^T) &= (x + dx)^T - x^T \\\\ &= x^T + (dx)^T - x^T \\\\ &= (dx)^T \end{align*}. \] Thus the gradient is \[ \nabla f = (2x^T)^T = 2x. \] Note: \(2x^T\) is a "row" vector and to get a column vector \(\nabla f\), we need the transpose of \(2x^T\).

Quadratic Forms \(f(x) = x^TAx\)

Quadratic forms are essential for modeling local curvature — specifically, they form the basis of the Hessian matrix in optimization. Let \(x \in \mathbb{R}^n\) and \(A \in \mathbb{R}^{n \times n}\).

Example 2:

Using the differential representation to analyze the change: \[ \begin{align*} df &= f(x + dx) - f(x) \\\\ &= (x + dx)^T A (x + dx) - x^T A x \\\\ &= x^TAx + dx^T A x + x^TAdx + dx^T A dx - x^T A x \end{align*} \] Here, it is valid to ignore the higher-order term \(dx^T A dx\), which becomes negligible as \(dx \to 0\). Then \[ df = dx^TAx + x^TAdx. \] since \(dx^TAx\) is a scalar, \((dx^TAx)^T = x^TA^Tdx\). Then we get: \[ df = x^TA^Tdx + x^TAdx = x^T(A^T + A)dx \] Here, \[ (A^T + A)^T = A + A^T = (A^T + A) \]. So, \((A^T + A)^T\) is symmetric and thus: \[ \nabla f = (A+A^T)x. \]

Symmetry and Optimization Efficiency

In most machine learning contexts, such as the Second-Order Taylor Expansion of a loss function, the matrix \(A\) is the Hessian, which is symmetric (\(A = A^T\)). In this case, the gradient simplifies to: \[ \nabla f = (A+A^T)x = (A+A)x = 2Ax. \] This symmetry is not just a mathematical curiosity; it is exploited by solvers to reduce memory overhead and computational complexity by nearly half.

Moreover, now we can see Example 1 as a special case: \(A\) is an identity matrix.

The \(L_2\) Norm \(f(x) = \| x \|_2\)

The \(L_2\) norm, or Euclidean distance, is the most common regularization term used to prevent overfitting in AI models. Understanding its derivative is key to understanding how weight decay works.

Example 3:

Let \(r = \| x \|_2\). We know from Example 1 that \(r^2 = x^Tx\). Taking the differential of both sides: \[ \begin{align*} d(r^2) &= d(x^Tx) \\\\ 2rdr &= 2x^Tdx \quad \text{(Using the result from Example 1)} \\\\ dr &= \frac{x^T}{r}dx = \frac{x^T}{\| x \|_2}dx \end{align*} \] Therefore, the gradient is: \[ \nabla \| x \|_2 = \frac{x}{\| x \|_2} \]

Geometric Interpretation: The Unit Vector

The derivative of the \(L_2\) norm is simply the unit vector pointing in the direction of \(x\).

In gradient descent, this means that the force of regularization is independent of the magnitude of the weights in terms of direction; it always pulls the weights directly toward the origin with a constant pressure proportional to the learning rate. This is the mathematical reason why \(L_2\) regularization effectively "shrinks" weights but rarely sets them to exactly zero (unlike \(L_1\) regularization).

Linear Approximations

Loading...