Gradient Descent

Introduction to Optimization

In machine learning, parameter estimation — also known as model fitting — is formulated as an optimization problem: \[ \boldsymbol{\theta}^* \in \operatorname*{arg\,min}_{\boldsymbol{\theta} \in \Theta} \mathcal{L}(\boldsymbol{\theta}), \qquad \Theta \subseteq \mathbb{R}^d, \] where \(\mathcal{L}\) is the loss function (or objective function), \(\Theta\) is the parameter space, and \(d\) is the number of parameters. Since a global minimizer is in general NP-hard to certify, our practical target is a local minimum.

Definition: Local Minimum

A point \(\boldsymbol{\theta}^* \in \Theta\) is a local minimum of \(\mathcal{L}\) if \[ \exists\, \delta > 0 \text{ such that } \mathcal{L}(\boldsymbol{\theta}^*) \leq \mathcal{L}(\boldsymbol{\theta}) \quad \text{for all } \boldsymbol{\theta} \in \Theta \text{ with } \|\boldsymbol{\theta} - \boldsymbol{\theta}^*\| < \delta. \] If the inequality is strict for \(\boldsymbol{\theta} \neq \boldsymbol{\theta}^*\), we call \(\boldsymbol{\theta}^*\) a strict local minimum.

For a twice continuously differentiable objective, local minima can be characterized analytically. Write \(\mathbf{g}(\boldsymbol{\theta}) = \nabla \mathcal{L}(\boldsymbol{\theta})\) for the gradient and \(\mathbf{H}(\boldsymbol{\theta}) = \nabla^2 \mathcal{L}(\boldsymbol{\theta})\) for the Hessian.

Theorem: Second-Order Optimality Conditions

Let \(\mathcal{L} \in C^2\) and let \(\boldsymbol{\theta}^*\) lie in the interior of \(\Theta\).

(Necessary) If \(\boldsymbol{\theta}^*\) is a local minimum, then \(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\) and \(\mathbf{H}(\boldsymbol{\theta}^*) \succeq 0\) (positive semidefinite).
(Sufficient) If \(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\) and \(\mathbf{H}(\boldsymbol{\theta}^*) \succ 0\) (positive definite), then \(\boldsymbol{\theta}^*\) is a strict local minimum.

Proof sketch:

Both directions rest on the second-order Taylor expansion \[ \mathcal{L}(\boldsymbol{\theta}^* + \mathbf{h}) = \mathcal{L}(\boldsymbol{\theta}^*) + \mathbf{g}(\boldsymbol{\theta}^*)^\top \mathbf{h} + \tfrac{1}{2}\, \mathbf{h}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{h} + o(\|\mathbf{h}\|^2) \qquad \text{as } \|\mathbf{h}\| \to 0. \]

(1) Necessity. Suppose \(\boldsymbol{\theta}^*\) is a local minimum. If \(\mathbf{g}(\boldsymbol{\theta}^*) \neq \mathbf{0}\), taking \(\mathbf{h} = -\varepsilon\, \mathbf{g}(\boldsymbol{\theta}^*)\) gives \(\mathcal{L}(\boldsymbol{\theta}^* + \mathbf{h}) - \mathcal{L}(\boldsymbol{\theta}^*) = -\varepsilon \|\mathbf{g}(\boldsymbol{\theta}^*)\|^2 + O(\varepsilon^2)\), which is negative for small \(\varepsilon > 0\) — contradicting the local-minimum property. Hence \(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\). With this, for any unit vector \(\mathbf{v}\) and small \(\varepsilon > 0\), \(\mathcal{L}(\boldsymbol{\theta}^* + \varepsilon \mathbf{v}) - \mathcal{L}(\boldsymbol{\theta}^*) = \tfrac{\varepsilon^2}{2}\, \mathbf{v}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{v} + o(\varepsilon^2) \geq 0\) forces \(\mathbf{v}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{v} \geq 0\) for all \(\mathbf{v}\), i.e., \(\mathbf{H}(\boldsymbol{\theta}^*) \succeq 0\).

(2) Sufficiency. Suppose \(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\) and \(\mathbf{H}(\boldsymbol{\theta}^*) \succ 0\). By positive-definiteness, \(\lambda_{\min}(\mathbf{H}(\boldsymbol{\theta}^*)) > 0\). Then for all sufficiently small \(\mathbf{h} \neq \mathbf{0}\), \(\tfrac{1}{2}\mathbf{h}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{h} + o(\|\mathbf{h}\|^2) \geq \tfrac{1}{2} \lambda_{\min}\, \|\mathbf{h}\|^2 + o(\|\mathbf{h}\|^2) > 0\), so \(\boldsymbol{\theta}^*\) is a strict local minimum.

Convex vs. non-convex landscapes

The sufficient condition (\(\mathbf{H} \succ 0\)) certifies only a strict local minimum. A non-convex objective typically has many stationary points (points where \(\mathbf{g} = \mathbf{0}\)): local minima with \(\mathbf{H} \succeq 0\), local maxima with \(\mathbf{H} \preceq 0\), and saddle points where \(\mathbf{H}\) is indefinite. First-order methods converge to stationary points of any type and cannot distinguish them without further information. This is one reason the convexity structure studied in the next section is so valuable: under convexity, every stationary point is automatically a global minimum.

Convexity

We often design training objectives to be convex, because in a convex optimization problem every local minimum is automatically a global minimum — removing the principal obstacle of non-convex optimization. The underlying geometry is that of a convex set: a subset \(\mathcal{S} \subseteq \mathbb{R}^n\) that contains the line segment joining any two of its points.

Definition: Convex Function

Let \(\mathcal{S} \subseteq \mathbb{R}^n\) be a convex set. A function \(f: \mathcal{S} \to \mathbb{R}\) is a convex function if for every \(\mathbf{x}, \mathbf{y} \in \mathcal{S}\) and every \(\lambda \in [0,1]\), \[ f\!\left(\lambda \mathbf{x} + (1-\lambda)\mathbf{y}\right) \leq \lambda f(\mathbf{x}) + (1-\lambda) f(\mathbf{y}). \] It is strictly convex if the inequality is strict whenever \(\mathbf{x} \neq \mathbf{y}\) and \(\lambda \in (0,1)\).

In one dimension, convexity is captured entirely by the sign of the second derivative. The general multivariate version, via the Hessian, then follows by reduction to the one-dimensional case. We begin with a lemma that links convexity to the first-order Taylor inequality — it is the technical bridge used in both directions of the next theorem.

Lemma: First-Order Characterization of Convexity

Let \(f: I \to \mathbb{R}\) be differentiable on an open interval \(I \subseteq \mathbb{R}\). Then \(f\) is convex on \(I\) if and only if \[ f(y) \geq f(x) + f'(x)(y - x) \qquad \text{for all } x, y \in I. \tag{*} \]

Proof:

(\(\Rightarrow\)) Assume \(f\) is convex. Fix \(x, y \in I\) and \(\lambda \in (0,1)\), and let \(z_\lambda = \lambda y + (1-\lambda)x = x + \lambda(y-x)\). By convexity, \[ f(z_\lambda) \leq \lambda f(y) + (1-\lambda) f(x), \qquad \text{i.e., } \frac{f(x + \lambda(y-x)) - f(x)}{\lambda} \leq f(y) - f(x). \] Letting \(\lambda \downarrow 0\) and using differentiability gives \(f'(x)(y-x) \leq f(y) - f(x)\), which is (*).

(\(\Leftarrow\)) Assume (*) holds. Fix \(a, b \in I\), \(\lambda \in [0,1]\), and set \(z = \lambda a + (1-\lambda) b\). Applying (*) at the base point \(z\) to both \(a\) and \(b\): \[ f(a) \geq f(z) + f'(z)(a - z), \qquad f(b) \geq f(z) + f'(z)(b - z). \] Multiplying the first inequality by \(\lambda\), the second by \((1-\lambda)\), and adding: \[ \lambda f(a) + (1-\lambda) f(b) \;\geq\; f(z) + f'(z)\bigl[\lambda (a-z) + (1-\lambda)(b-z)\bigr]. \] The bracketed term equals \(\lambda a + (1-\lambda) b - z = 0\) by the definition of \(z\), so \(\lambda f(a) + (1-\lambda) f(b) \geq f(z)\), which is convexity.

Theorem: Second-Order Condition for Convexity (Univariate)

Let \(f: I \to \mathbb{R}\) be twice differentiable on an open interval \(I\). Then \(f\) is convex on \(I\) if and only if \(f''(x) \geq 0\) for all \(x \in I\).

Proof:

(\(\Leftarrow\)) Suppose \(f'' \geq 0\) on \(I\). By Taylor's theorem, for any \(x, y \in I\) there exists \(c\) between \(x\) and \(y\) such that \[ f(y) = f(x) + f'(x)(y - x) + \tfrac{1}{2} f''(c)(y - x)^2 \geq f(x) + f'(x)(y - x), \] since the quadratic remainder is non-negative. This is (*) of the preceding lemma, hence \(f\) is convex.

(\(\Rightarrow\)) Suppose \(f\) is convex. By the lemma, (*) holds. Fix \(a < b\) in \(I\) and apply (*) in both directions: \[ f(b) \geq f(a) + f'(a)(b - a) \;\Longrightarrow\; f'(a) \leq \frac{f(b) - f(a)}{b - a}, \] \[ f(a) \geq f(b) + f'(b)(a - b) \;\Longrightarrow\; f'(b) \geq \frac{f(b) - f(a)}{b - a}. \] Hence \(f'(a) \leq f'(b)\) whenever \(a < b\), so \(f'\) is monotonically non-decreasing. For a differentiable \(f'\), this is equivalent to \(f'' \geq 0\).

Convex functions appear throughout machine learning. For example, the cross-entropy loss for classification is convex in the predicted probability vector, and the ReLU activation \(\phi(x) = \max(0, x)\) used in neural networks is convex on \(\mathbb{R}\). (Note that cross-entropy composed with a deep network is convex in the probabilities but generally non-convex in the network parameters — this is the origin of the non-convex optimization landscape that dominates deep learning.)

For the multivariate case, we reduce to the univariate theorem via restriction to lines.

Theorem: Second-Order Condition for Convexity (Multivariate)

Let \(f: \mathcal{S} \to \mathbb{R}\) be \(C^2\) on an open convex set \(\mathcal{S} \subseteq \mathbb{R}^n\). Then:

\(f\) is convex on \(\mathcal{S}\) \(\iff\) \(\nabla^2 f(\mathbf{x}) \succeq 0\) for all \(\mathbf{x} \in \mathcal{S}\);
If \(\nabla^2 f(\mathbf{x}) \succ 0\) for all \(\mathbf{x} \in \mathcal{S}\), then \(f\) is strictly convex (the converse fails in general — e.g., \(f(x) = x^4\) is strictly convex but \(f''(0) = 0\)).

Proof:

For \(\mathbf{x}, \mathbf{y} \in \mathcal{S}\), define the one-variable restriction \(\varphi: [0,1] \to \mathbb{R}\) by \[ \varphi(t) = f\!\left(\mathbf{x} + t(\mathbf{y} - \mathbf{x})\right). \] Since \(f \in C^2\) and the map \(t \mapsto \mathbf{x} + t(\mathbf{y}-\mathbf{x})\) is smooth, \(\varphi\) is \(C^2\) on \([0,1]\) by the chain rule. A direct computation gives \[ \varphi''(t) = (\mathbf{y} - \mathbf{x})^\top\, \nabla^2 f\!\left(\mathbf{x} + t(\mathbf{y}-\mathbf{x})\right)\, (\mathbf{y} - \mathbf{x}). \]

Equivalence of convexity on restrictions. \(f\) is convex on \(\mathcal{S}\) if and only if every such \(\varphi\) is convex on \([0,1]\): the "only if" direction is immediate from the definition, and the "if" direction follows by taking \(t = \lambda\) in \(\varphi(\lambda) \leq \lambda \varphi(1) + (1-\lambda)\varphi(0)\).

(1) By the univariate theorem, each \(\varphi\) is convex \(\iff\) \(\varphi''(t) \geq 0\) on \([0,1]\). Varying \(\mathbf{x}, \mathbf{y}\), this translates to \(\mathbf{v}^\top \nabla^2 f(\mathbf{p}) \mathbf{v} \geq 0\) for all \(\mathbf{p} \in \mathcal{S}\) and all \(\mathbf{v} \in \mathbb{R}^n\), i.e., \(\nabla^2 f \succeq 0\) on \(\mathcal{S}\).

(2) If \(\nabla^2 f \succ 0\) on \(\mathcal{S}\), then \(\varphi''(t) > 0\) for \(\mathbf{x} \neq \mathbf{y}\), so each \(\varphi\) is strictly convex, which gives strict convexity of \(f\).

Why Hessian structure matters for ML

The Hessian governs more than convexity: its spectrum controls the condition number \(\kappa = \lambda_{\max} / \lambda_{\min}\), which determines how quickly gradient-based methods converge — a theme we develop formally in the convergence analysis later in this section. In deep learning, the Hessian's effective rank, spectral gap, and curvature distribution at initialization have become active research objects: they govern generalization, trainability, and the choice of preconditioner.

By the first-order Taylor expansion, \[ \mathcal{L}(\boldsymbol{\theta} + \mathbf{d}) \approx \mathcal{L}(\boldsymbol{\theta}) + \nabla \mathcal{L}(\boldsymbol{\theta})^\top \mathbf{d} \qquad (\|\mathbf{d}\| \text{ small}). \] To make \(\mathcal{L}\) decrease fastest, we want \(\mathbf{d}\) to minimize \(\nabla \mathcal{L}(\boldsymbol{\theta})^\top \mathbf{d}\) subject to \(\|\mathbf{d}\| = 1\). Assuming \(\nabla \mathcal{L}(\boldsymbol{\theta}) \neq \mathbf{0}\), the Cauchy–Schwarz inequality (with its equality case for parallel vectors) gives the minimum at \[ \mathbf{d}^\star = -\frac{\nabla \mathcal{L}(\boldsymbol{\theta})}{\|\nabla \mathcal{L}(\boldsymbol{\theta})\|}, \] the direction of steepest descent. (At a stationary point \(\nabla \mathcal{L}(\boldsymbol{\theta}) = \mathbf{0}\), the first-order model predicts no change in any direction, and the method halts.) Gradient descent iterates along the steepest-descent direction with a chosen step size \(\eta > 0\): \[ \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta\, \nabla \mathcal{L}(\boldsymbol{\theta}^{(k)}). \] Under suitable regularity (e.g., \(L\)-smoothness with \(\eta \leq 1/L\); see the convergence analysis later in this section), the iterates converge to a stationary point.

Algorithm 1: GRADIENT_DESCENT Input: objective \(\mathcal{L}\), tolerance \(\epsilon\), learning rate \(\eta\);
Output: \(\boldsymbol{\theta}\) with \(\|\nabla \mathcal{L}(\boldsymbol{\theta})\| < \epsilon\);
begin
\(k \leftarrow 0\); choose an initial point \(\boldsymbol{\theta}^{(0)}\);
repeat
\(\mathbf{d}^{(k)} \leftarrow -\nabla \mathcal{L}(\boldsymbol{\theta}^{(k)})\);
\(\boldsymbol{\theta}^{(k+1)} \leftarrow \boldsymbol{\theta}^{(k)} + \eta\, \mathbf{d}^{(k)}\);
\(k \leftarrow k + 1\);
until \(\|\nabla \mathcal{L}(\boldsymbol{\theta}^{(k)})\| < \epsilon\);
Output \(\boldsymbol{\theta}^{(k)}\);
end

Gradient descent is called a first-order method because it uses only the gradient. Choosing \(\eta\) optimally, rather than by trial-and-error, leads to the line search machinery — the Armijo and Wolfe conditions — developed on the next page on Newton's method; analyzing how fast gradient descent converges leads to the contraction-mapping analysis at the end of this section.

Stochastic Gradient Descent

In stochastic optimization, the objective is an expectation: \[ \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{q(z)}\bigl[\ell(\boldsymbol{\theta}, z)\bigr], \] where \(z\) is a random input (a training example, or a latent noise term) whose distribution \(q(z)\) does not depend on \(\boldsymbol{\theta}\). At each iteration, stochastic gradient descent (SGD) samples \(z^{(k)} \sim q\) and updates \[ \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta^{(k)}\, \nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}^{(k)}, z^{(k)}). \] By the Leibniz rule (interchange of expectation and gradient, valid under mild regularity since \(q(z)\) does not depend on \(\boldsymbol{\theta}\)), \(\mathbb{E}_{z^{(k)}}[\nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}, z^{(k)})] = \nabla \mathcal{L}(\boldsymbol{\theta})\), so SGD follows the full gradient in expectation. Classical convergence theory (Robbins–Monro) requires a diminishing step-size schedule \(\sum_k \eta^{(k)} = \infty\) and \(\sum_k (\eta^{(k)})^2 < \infty\), under which the iterates converge (almost surely) to a stationary point, given suitable smoothness and boundedness assumptions on \(\ell\).

In practice, computing the full-batch gradient on every step is prohibitive when the dataset size \(N\) is large. The mini-batch approach is the standard compromise: at each step, draw a random subset \(B \subset \{1, \ldots, N\}\) of size \(|B| \in \{32, 64, 128, \ldots\}\) and use the batch-averaged gradient \[ \hat{\mathbf{g}}^{(k)} = \frac{1}{|B|} \sum_{i \in B} \nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}^{(k)}, z_i). \] This is an unbiased but lower-variance estimator of \(\nabla \mathcal{L}\); mini-batching also maps cleanly onto GPU parallelism, making it the default in modern deep learning.

Algorithm 2: MINI_BATCH_SGD Input: dataset \(X\), objective \(\ell\), tolerance \(\epsilon\), batch size \(|B|\), learning rate \(\eta\), max_epoch;
Output: approximate stationary point \(\boldsymbol{\theta}^*\);
begin
\(k \leftarrow 0\); choose an initial point \(\boldsymbol{\theta}^{(0)}\);
repeat
Shuffle \(X\) randomly;
Partition \(X\) into mini-batches \(\{B_1, B_2, \ldots, B_m\}\) of size \(|B|\);
for each mini-batch \(B_i\):
\(\hat{\mathbf{g}}^{(k)} \leftarrow \frac{1}{|B|} \sum_{i \in B_i} \nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}^{(k)}, z_i)\);
\(\boldsymbol{\theta}^{(k+1)} \leftarrow \boldsymbol{\theta}^{(k)} - \eta\, \hat{\mathbf{g}}^{(k)}\);
\(k \leftarrow k + 1\);
end for
until \(\|\hat{\mathbf{g}}^{(k)}\| < \epsilon\) or \(k \geq\) max_epoch;
Output \(\boldsymbol{\theta}^{(k)}\);
end

Note: One full pass over the shuffled dataset is called an epoch. Shuffling between epochs is important: without it, the gradient estimates across a single epoch are no longer i.i.d., and the noise structure that SGD relies on for escaping shallow basins is impaired.

A sample implementation of mini-batch SGD for linear regression follows.

                    import numpy as np

                    # Fixed parameters for mini-batch SGD
                    MAX_EPOCH = 10000
                    BATCH_SIZE = 64
                    TOLERANCE = 1e-3
                    LEARNING_RATE = 0.01

                    # Sample data dimensions
                    N_SAMPLES = 10000
                    N_FEATURES = 3

                    # Mini-batch Stochastic Gradient Descent
                    def mini_batch_sgd(X, y):
                        n_samples = X.shape[0]
                        theta = np.random.randn(X.shape[1]) * 0.01  # initial theta

                        for i in range(MAX_EPOCH):
                            # Shuffle data
                            indices = np.random.permutation(n_samples)
                            x_shuffled = X[indices]
                            y_shuffled = y[indices]

                            for j in range(0, n_samples, BATCH_SIZE):
                                end = j + BATCH_SIZE
                                x_batch = x_shuffled[j:end]
                                y_batch = y_shuffled[j:end]

                                # Gradient of MSE on the mini-batch
                                g = grad_f(theta, x_batch, y_batch)

                                # Update parameters
                                theta -= LEARNING_RATE * g

                            # Convergence check on the last mini-batch gradient
                            if np.linalg.norm(g) < TOLERANCE:
                                print(f"Converged in {i + 1} epochs.")
                                break
                        return theta

                    # Gradient of the MSE objective for linear regression
                    def grad_f(theta, x_batch, y_batch):
                        return (1 / x_batch.shape[0]) * x_batch.T @ ((x_batch @ theta) - y_batch)

                    # Main
                    if __name__ == "__main__":
                        # Generate sample data
                        X = np.random.rand(N_SAMPLES, N_FEATURES)
                        true_theta = np.array([5, -1, -9])
                        y = X @ true_theta + np.random.randn(X.shape[0]) * 0.1  # additive noise

                        # Run mini-batch SGD
                        estimated_theta = mini_batch_sgd(X, y)

                        # Relative error
                        relative_error = np.linalg.norm(estimated_theta - true_theta) / np.linalg.norm(true_theta)

                        print("True Parameters:     ", true_theta)
                        print("Estimated Parameters:", estimated_theta)
                        print(f"Relative Error:       {relative_error:.6f}")

Sub-gradient Descent

Many ML objectives — notably those using \(\ell_1\) regularization, ReLU, or hinge loss — are convex but not differentiable everywhere. Gradient descent is not directly applicable at the kinks. The remedy is to generalize the gradient to a set-valued notion.

Definition: Subgradient

Let \(f: \mathbb{R}^n \to \mathbb{R}\) be convex and let \(\mathbf{x} \in \operatorname{dom}(f)\). A vector \(\mathbf{g} \in \mathbb{R}^n\) is a subgradient of \(f\) at \(\mathbf{x}\) if \[ f(\mathbf{z}) \geq f(\mathbf{x}) + \mathbf{g}^\top (\mathbf{z} - \mathbf{x}) \qquad \text{for all } \mathbf{z} \in \operatorname{dom}(f). \tag{†} \]

Definition: Subdifferential

The subdifferential of \(f\) at \(\mathbf{x}\), denoted \(\partial f(\mathbf{x})\), is the set of all subgradients of \(f\) at \(\mathbf{x}\): \[ \partial f(\mathbf{x}) = \{\, \mathbf{g} \in \mathbb{R}^n : (\dagger) \text{ holds for all } \mathbf{z} \,\}. \] We say \(f\) is subdifferentiable at \(\mathbf{x}\) if \(\partial f(\mathbf{x}) \neq \emptyset\). For a convex function, \(\partial f(\mathbf{x})\) is a non-empty, closed, convex set at every interior point of \(\operatorname{dom}(f)\), and \(\partial f(\mathbf{x}) = \{\nabla f(\mathbf{x})\}\) precisely when \(f\) is differentiable at \(\mathbf{x}\).

Example: absolute value.

For \(f(x) = |x|\) on \(\mathbb{R}\), \[ \partial f(x) = \begin{cases} \{-1\} & \text{if } x < 0,\\ [-1, 1] & \text{if } x = 0,\\ \{+1\} & \text{if } x > 0. \end{cases} \] At \(x \neq 0\), \(f\) is differentiable and the subdifferential collapses to the ordinary derivative. At \(x = 0\), every slope \(g \in [-1, 1]\) satisfies \(|z| \geq 0 + g \cdot z\) for all \(z\), giving a full interval of subgradients.

Sub-gradient descent selects any \(\mathbf{g}^{(k)} \in \partial \mathcal{L}(\boldsymbol{\theta}^{(k)})\) and updates \[ \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta^{(k)}\, \mathbf{g}^{(k)}. \] Because subgradients need not vanish at the optimum (they only contain \(\mathbf{0}\) there), the iterates do not decrease monotonically, and a diminishing step-size schedule is essential for convergence. The method converges on the order of \(O(1/\sqrt{k})\) for general convex functions — slower than gradient descent's \(O(1/k)\) under \(L\)-smoothness — which is the price of handling non-smoothness.

Why this matters for modern ML

The subdifferential underpins the practical use of non-smooth regularizers. Lasso's \(\ell_1\) penalty, the SVM hinge loss, and the ReLU activation are all analyzed through their subdifferentials; the proximal gradient method and ISTA/FISTA algorithms explicitly exploit this structure to achieve sparse solutions. Automatic differentiation frameworks such as PyTorch and JAX resolve the ambiguity at kinks by choosing a canonical element of the subdifferential (e.g., \(0\) for \(\mathrm{ReLU}'(0)\)) — a convention that is mathematically arbitrary but numerically stable.

Gradient Descent First-order Optimization Techniques

Loading...

Introduction to Optimization

Convex vs. non-convex landscapes

Convexity

Why Hessian structure matters for ML