In machine learning, parameter estimation — also known as model fitting — is formulated as an
optimization problem:
\[
\boldsymbol{\theta}^* \in \operatorname*{arg\,min}_{\boldsymbol{\theta} \in \Theta} \mathcal{L}(\boldsymbol{\theta}),
\qquad \Theta \subseteq \mathbb{R}^d,
\]
where \(\mathcal{L}\) is the loss function (or objective function), \(\Theta\) is the parameter space,
and \(d\) is the number of parameters. Since a global minimizer is in general NP-hard to certify, our practical
target is a local minimum.
Definition: Local Minimum
A point \(\boldsymbol{\theta}^* \in \Theta\) is a local minimum of \(\mathcal{L}\) if
\[
\exists\, \delta > 0 \text{ such that } \mathcal{L}(\boldsymbol{\theta}^*) \leq \mathcal{L}(\boldsymbol{\theta})
\quad \text{for all } \boldsymbol{\theta} \in \Theta \text{ with } \|\boldsymbol{\theta} - \boldsymbol{\theta}^*\| < \delta.
\]
If the inequality is strict for \(\boldsymbol{\theta} \neq \boldsymbol{\theta}^*\), we call \(\boldsymbol{\theta}^*\)
a strict local minimum.
For a twice continuously differentiable objective, local minima can be characterized analytically. Write
\(\mathbf{g}(\boldsymbol{\theta}) = \nabla \mathcal{L}(\boldsymbol{\theta})\) for the
gradient and
\(\mathbf{H}(\boldsymbol{\theta}) = \nabla^2 \mathcal{L}(\boldsymbol{\theta})\) for the Hessian.
Theorem: Second-Order Optimality Conditions
Let \(\mathcal{L} \in C^2\) and let \(\boldsymbol{\theta}^*\) lie in the interior of \(\Theta\).
(Necessary) If \(\boldsymbol{\theta}^*\) is a local minimum, then
\(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\) and \(\mathbf{H}(\boldsymbol{\theta}^*) \succeq 0\)
(positive semidefinite).
(Sufficient) If \(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\) and
\(\mathbf{H}(\boldsymbol{\theta}^*) \succ 0\) (positive definite),
then \(\boldsymbol{\theta}^*\) is a strict local minimum.
Proof sketch:
Both directions rest on the second-order
Taylor expansion
\[
\mathcal{L}(\boldsymbol{\theta}^* + \mathbf{h})
= \mathcal{L}(\boldsymbol{\theta}^*)
+ \mathbf{g}(\boldsymbol{\theta}^*)^\top \mathbf{h}
+ \tfrac{1}{2}\, \mathbf{h}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{h}
+ o(\|\mathbf{h}\|^2)
\qquad \text{as } \|\mathbf{h}\| \to 0.
\]
(1) Necessity. Suppose \(\boldsymbol{\theta}^*\) is a local minimum.
If \(\mathbf{g}(\boldsymbol{\theta}^*) \neq \mathbf{0}\), taking
\(\mathbf{h} = -\varepsilon\, \mathbf{g}(\boldsymbol{\theta}^*)\) gives
\(\mathcal{L}(\boldsymbol{\theta}^* + \mathbf{h}) - \mathcal{L}(\boldsymbol{\theta}^*)
= -\varepsilon \|\mathbf{g}(\boldsymbol{\theta}^*)\|^2 + O(\varepsilon^2)\), which is negative
for small \(\varepsilon > 0\) — contradicting the local-minimum property. Hence
\(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\). With this, for any unit vector \(\mathbf{v}\)
and small \(\varepsilon > 0\),
\(\mathcal{L}(\boldsymbol{\theta}^* + \varepsilon \mathbf{v}) - \mathcal{L}(\boldsymbol{\theta}^*)
= \tfrac{\varepsilon^2}{2}\, \mathbf{v}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{v} + o(\varepsilon^2) \geq 0\)
forces \(\mathbf{v}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{v} \geq 0\) for all \(\mathbf{v}\),
i.e., \(\mathbf{H}(\boldsymbol{\theta}^*) \succeq 0\).
(2) Sufficiency. Suppose \(\mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\) and
\(\mathbf{H}(\boldsymbol{\theta}^*) \succ 0\). By positive-definiteness,
\(\lambda_{\min}(\mathbf{H}(\boldsymbol{\theta}^*)) > 0\). Then for all sufficiently small
\(\mathbf{h} \neq \mathbf{0}\),
\(\tfrac{1}{2}\mathbf{h}^\top \mathbf{H}(\boldsymbol{\theta}^*) \mathbf{h} + o(\|\mathbf{h}\|^2)
\geq \tfrac{1}{2} \lambda_{\min}\, \|\mathbf{h}\|^2 + o(\|\mathbf{h}\|^2) > 0\),
so \(\boldsymbol{\theta}^*\) is a strict local minimum.
Convex vs. non-convex landscapes
The sufficient condition (\(\mathbf{H} \succ 0\)) certifies only a strict local minimum.
A non-convex objective typically has many stationary points (points where \(\mathbf{g} = \mathbf{0}\)):
local minima with \(\mathbf{H} \succeq 0\), local maxima with \(\mathbf{H} \preceq 0\), and
saddle points where \(\mathbf{H}\) is indefinite. First-order methods converge to
stationary points of any type and cannot distinguish them without further information. This is
one reason the convexity structure studied in the next section is so valuable: under convexity,
every stationary point is automatically a global minimum.
Convexity
We often design training objectives to be convex, because in a convex optimization problem
every local minimum is automatically a global minimum — removing the principal obstacle of non-convex
optimization. The underlying geometry is that of a
convex set: a subset
\(\mathcal{S} \subseteq \mathbb{R}^n\) that contains the line segment joining any two of its points.
Definition: Convex Function
Let \(\mathcal{S} \subseteq \mathbb{R}^n\) be a convex set. A function \(f: \mathcal{S} \to \mathbb{R}\) is a
convex function if for every \(\mathbf{x}, \mathbf{y} \in \mathcal{S}\) and every
\(\lambda \in [0,1]\),
\[
f\!\left(\lambda \mathbf{x} + (1-\lambda)\mathbf{y}\right)
\leq \lambda f(\mathbf{x}) + (1-\lambda) f(\mathbf{y}).
\]
It is strictly convex if the inequality is strict whenever \(\mathbf{x} \neq \mathbf{y}\)
and \(\lambda \in (0,1)\).
In one dimension, convexity is captured entirely by the sign of the second derivative. The general
multivariate version, via the Hessian, then follows by reduction to the one-dimensional case.
We begin with a lemma that links convexity to the first-order Taylor inequality — it is the technical
bridge used in both directions of the next theorem.
Lemma: First-Order Characterization of Convexity
Let \(f: I \to \mathbb{R}\) be differentiable on an open interval \(I \subseteq \mathbb{R}\).
Then \(f\) is convex on \(I\) if and only if
\[
f(y) \geq f(x) + f'(x)(y - x) \qquad \text{for all } x, y \in I. \tag{*}
\]
Proof:
(\(\Rightarrow\)) Assume \(f\) is convex. Fix \(x, y \in I\) and \(\lambda \in (0,1)\), and let
\(z_\lambda = \lambda y + (1-\lambda)x = x + \lambda(y-x)\). By convexity,
\[
f(z_\lambda) \leq \lambda f(y) + (1-\lambda) f(x),
\qquad \text{i.e., } \frac{f(x + \lambda(y-x)) - f(x)}{\lambda} \leq f(y) - f(x).
\]
Letting \(\lambda \downarrow 0\) and using differentiability gives \(f'(x)(y-x) \leq f(y) - f(x)\),
which is (*).
(\(\Leftarrow\)) Assume (*) holds. Fix \(a, b \in I\), \(\lambda \in [0,1]\), and set
\(z = \lambda a + (1-\lambda) b\). Applying (*) at the base point \(z\) to both \(a\) and \(b\):
\[
f(a) \geq f(z) + f'(z)(a - z), \qquad f(b) \geq f(z) + f'(z)(b - z).
\]
Multiplying the first inequality by \(\lambda\), the second by \((1-\lambda)\), and adding:
\[
\lambda f(a) + (1-\lambda) f(b)
\;\geq\; f(z) + f'(z)\bigl[\lambda (a-z) + (1-\lambda)(b-z)\bigr].
\]
The bracketed term equals \(\lambda a + (1-\lambda) b - z = 0\) by the definition of \(z\),
so \(\lambda f(a) + (1-\lambda) f(b) \geq f(z)\), which is convexity.
Theorem: Second-Order Condition for Convexity (Univariate)
Let \(f: I \to \mathbb{R}\) be twice differentiable on an open interval \(I\). Then \(f\) is convex on
\(I\) if and only if \(f''(x) \geq 0\) for all \(x \in I\).
Proof:
(\(\Leftarrow\)) Suppose \(f'' \geq 0\) on \(I\). By
Taylor's theorem, for any
\(x, y \in I\) there exists \(c\) between \(x\) and \(y\) such that
\[
f(y) = f(x) + f'(x)(y - x) + \tfrac{1}{2} f''(c)(y - x)^2
\geq f(x) + f'(x)(y - x),
\]
since the quadratic remainder is non-negative. This is (*) of the preceding lemma, hence \(f\) is convex.
(\(\Rightarrow\)) Suppose \(f\) is convex. By the lemma, (*) holds. Fix \(a < b\) in \(I\)
and apply (*) in both directions:
\[
f(b) \geq f(a) + f'(a)(b - a) \;\Longrightarrow\; f'(a) \leq \frac{f(b) - f(a)}{b - a},
\]
\[
f(a) \geq f(b) + f'(b)(a - b) \;\Longrightarrow\; f'(b) \geq \frac{f(b) - f(a)}{b - a}.
\]
Hence \(f'(a) \leq f'(b)\) whenever \(a < b\), so \(f'\) is monotonically non-decreasing. For a
differentiable \(f'\), this is equivalent to \(f'' \geq 0\).
Convex functions appear throughout machine learning. For example, the
cross-entropy loss for
classification is convex
in the predicted probability vector, and the ReLU activation \(\phi(x) = \max(0, x)\) used in
neural networks is convex on \(\mathbb{R}\).
(Note that cross-entropy composed with a deep network is convex in the probabilities but generally
non-convex in the network parameters — this is the origin of the non-convex optimization landscape
that dominates deep learning.)
For the multivariate case, we reduce to the univariate theorem via restriction to lines.
Theorem: Second-Order Condition for Convexity (Multivariate)
Let \(f: \mathcal{S} \to \mathbb{R}\) be \(C^2\) on an open convex set \(\mathcal{S} \subseteq \mathbb{R}^n\).
Then:
\(f\) is convex on \(\mathcal{S}\) \(\iff\) \(\nabla^2 f(\mathbf{x}) \succeq 0\) for all \(\mathbf{x} \in \mathcal{S}\);
If \(\nabla^2 f(\mathbf{x}) \succ 0\) for all \(\mathbf{x} \in \mathcal{S}\), then \(f\) is strictly convex
(the converse fails in general — e.g., \(f(x) = x^4\) is strictly convex but \(f''(0) = 0\)).
Proof:
For \(\mathbf{x}, \mathbf{y} \in \mathcal{S}\), define the one-variable restriction
\(\varphi: [0,1] \to \mathbb{R}\) by
\[
\varphi(t) = f\!\left(\mathbf{x} + t(\mathbf{y} - \mathbf{x})\right).
\]
Since \(f \in C^2\) and the map \(t \mapsto \mathbf{x} + t(\mathbf{y}-\mathbf{x})\) is smooth,
\(\varphi\) is \(C^2\) on \([0,1]\) by the chain rule. A direct computation gives
\[
\varphi''(t) = (\mathbf{y} - \mathbf{x})^\top\, \nabla^2 f\!\left(\mathbf{x} + t(\mathbf{y}-\mathbf{x})\right)\, (\mathbf{y} - \mathbf{x}).
\]
Equivalence of convexity on restrictions. \(f\) is convex on \(\mathcal{S}\) if and only if
every such \(\varphi\) is convex on \([0,1]\): the "only if" direction is immediate from the definition,
and the "if" direction follows by taking \(t = \lambda\) in \(\varphi(\lambda) \leq \lambda \varphi(1) + (1-\lambda)\varphi(0)\).
(1) By the univariate theorem, each \(\varphi\) is convex \(\iff\) \(\varphi''(t) \geq 0\)
on \([0,1]\). Varying \(\mathbf{x}, \mathbf{y}\), this translates to
\(\mathbf{v}^\top \nabla^2 f(\mathbf{p}) \mathbf{v} \geq 0\) for all \(\mathbf{p} \in \mathcal{S}\) and all
\(\mathbf{v} \in \mathbb{R}^n\), i.e., \(\nabla^2 f \succeq 0\) on \(\mathcal{S}\).
(2) If \(\nabla^2 f \succ 0\) on \(\mathcal{S}\), then \(\varphi''(t) > 0\) for
\(\mathbf{x} \neq \mathbf{y}\), so each \(\varphi\) is strictly convex, which gives strict convexity of \(f\).
Why Hessian structure matters for ML
The Hessian governs more than convexity: its spectrum controls the condition number
\(\kappa = \lambda_{\max} / \lambda_{\min}\), which determines how quickly gradient-based methods converge
— a theme we develop formally in the convergence analysis
later in this section. In deep learning, the Hessian's effective rank, spectral gap, and curvature
distribution at initialization have become active research objects: they govern generalization,
trainability, and the choice of preconditioner.
Gradient Descent
By the first-order Taylor expansion,
\[
\mathcal{L}(\boldsymbol{\theta} + \mathbf{d})
\approx \mathcal{L}(\boldsymbol{\theta}) + \nabla \mathcal{L}(\boldsymbol{\theta})^\top \mathbf{d}
\qquad (\|\mathbf{d}\| \text{ small}).
\]
To make \(\mathcal{L}\) decrease fastest, we want \(\mathbf{d}\) to minimize
\(\nabla \mathcal{L}(\boldsymbol{\theta})^\top \mathbf{d}\) subject to \(\|\mathbf{d}\| = 1\).
Assuming \(\nabla \mathcal{L}(\boldsymbol{\theta}) \neq \mathbf{0}\), the Cauchy–Schwarz inequality
(with its equality case for parallel vectors) gives the minimum at
\[
\mathbf{d}^\star = -\frac{\nabla \mathcal{L}(\boldsymbol{\theta})}{\|\nabla \mathcal{L}(\boldsymbol{\theta})\|},
\]
the direction of steepest descent. (At a stationary point
\(\nabla \mathcal{L}(\boldsymbol{\theta}) = \mathbf{0}\), the first-order model predicts no change in
any direction, and the method halts.) Gradient descent iterates along the steepest-descent direction
with a chosen step size \(\eta > 0\):
\[
\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta\, \nabla \mathcal{L}(\boldsymbol{\theta}^{(k)}).
\]
Under suitable regularity (e.g., \(L\)-smoothness with \(\eta \leq 1/L\); see the
convergence analysis later in this section), the iterates
converge to a stationary point.
Algorithm 1: GRADIENT_DESCENTInput: objective \(\mathcal{L}\), tolerance \(\epsilon\), learning rate \(\eta\); Output: \(\boldsymbol{\theta}\) with \(\|\nabla \mathcal{L}(\boldsymbol{\theta})\| < \epsilon\); begin
\(k \leftarrow 0\); choose an initial point \(\boldsymbol{\theta}^{(0)}\); repeat
\(\mathbf{d}^{(k)} \leftarrow -\nabla \mathcal{L}(\boldsymbol{\theta}^{(k)})\);
\(\boldsymbol{\theta}^{(k+1)} \leftarrow \boldsymbol{\theta}^{(k)} + \eta\, \mathbf{d}^{(k)}\);
\(k \leftarrow k + 1\); until \(\|\nabla \mathcal{L}(\boldsymbol{\theta}^{(k)})\| < \epsilon\);
Output \(\boldsymbol{\theta}^{(k)}\); end
Gradient descent is called a first-order method because it uses only the gradient.
Choosing \(\eta\) optimally, rather than by trial-and-error, leads to the line search
machinery — the
Armijo and
Wolfe conditions —
developed on the next page on Newton's method; analyzing
how fast gradient descent converges leads to the
contraction-mapping analysis at the end of this section.
Stochastic Gradient Descent
In stochastic optimization, the objective is an expectation:
\[
\mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{q(z)}\bigl[\ell(\boldsymbol{\theta}, z)\bigr],
\]
where \(z\) is a random input (a training example, or a latent noise term) whose distribution
\(q(z)\) does not depend on \(\boldsymbol{\theta}\). At each iteration, stochastic gradient
descent (SGD) samples \(z^{(k)} \sim q\) and updates
\[
\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta^{(k)}\, \nabla_{\boldsymbol{\theta}}\,
\ell(\boldsymbol{\theta}^{(k)}, z^{(k)}).
\]
By the Leibniz rule (interchange of expectation and gradient, valid under mild regularity since
\(q(z)\) does not depend on \(\boldsymbol{\theta}\)),
\(\mathbb{E}_{z^{(k)}}[\nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}, z^{(k)})]
= \nabla \mathcal{L}(\boldsymbol{\theta})\),
so SGD follows the full gradient in expectation. Classical convergence theory (Robbins–Monro) requires a
diminishing step-size schedule \(\sum_k \eta^{(k)} = \infty\) and \(\sum_k (\eta^{(k)})^2 < \infty\),
under which the iterates converge (almost surely) to a stationary point, given suitable smoothness
and boundedness assumptions on \(\ell\).
In practice, computing the full-batch gradient on every step is prohibitive when the dataset size \(N\)
is large. The mini-batch approach is the standard compromise: at each step, draw a
random subset \(B \subset \{1, \ldots, N\}\) of size \(|B| \in \{32, 64, 128, \ldots\}\) and use the
batch-averaged gradient
\[
\hat{\mathbf{g}}^{(k)} = \frac{1}{|B|} \sum_{i \in B} \nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}^{(k)}, z_i).
\]
This is an unbiased but lower-variance estimator of \(\nabla \mathcal{L}\); mini-batching also maps
cleanly onto GPU parallelism, making it the default in modern deep learning.
Algorithm 2: MINI_BATCH_SGDInput: dataset \(X\), objective \(\ell\), tolerance \(\epsilon\), batch size \(|B|\), learning rate \(\eta\), max_epoch; Output: approximate stationary point \(\boldsymbol{\theta}^*\); begin
\(k \leftarrow 0\); choose an initial point \(\boldsymbol{\theta}^{(0)}\); repeat
Shuffle \(X\) randomly;
Partition \(X\) into mini-batches \(\{B_1, B_2, \ldots, B_m\}\) of size \(|B|\); for each mini-batch \(B_i\):
\(\hat{\mathbf{g}}^{(k)} \leftarrow \frac{1}{|B|} \sum_{i \in B_i} \nabla_{\boldsymbol{\theta}}\, \ell(\boldsymbol{\theta}^{(k)}, z_i)\);
\(\boldsymbol{\theta}^{(k+1)} \leftarrow \boldsymbol{\theta}^{(k)} - \eta\, \hat{\mathbf{g}}^{(k)}\);
\(k \leftarrow k + 1\); end for until \(\|\hat{\mathbf{g}}^{(k)}\| < \epsilon\) or \(k \geq\) max_epoch;
Output \(\boldsymbol{\theta}^{(k)}\); end
Note: One full pass over the shuffled dataset is called an epoch. Shuffling between
epochs is important: without it, the gradient estimates across a single epoch are no longer i.i.d.,
and the noise structure that SGD relies on for escaping shallow basins is impaired.
A sample implementation of mini-batch SGD for linear regression follows.
import numpy as np
# Fixed parameters for mini-batch SGD
MAX_EPOCH = 10000
BATCH_SIZE = 64
TOLERANCE = 1e-3
LEARNING_RATE = 0.01
# Sample data dimensions
N_SAMPLES = 10000
N_FEATURES = 3
# Mini-batch Stochastic Gradient Descent
def mini_batch_sgd(X, y):
n_samples = X.shape[0]
theta = np.random.randn(X.shape[1]) * 0.01 # initial theta
for i in range(MAX_EPOCH):
# Shuffle data
indices = np.random.permutation(n_samples)
x_shuffled = X[indices]
y_shuffled = y[indices]
for j in range(0, n_samples, BATCH_SIZE):
end = j + BATCH_SIZE
x_batch = x_shuffled[j:end]
y_batch = y_shuffled[j:end]
# Gradient of MSE on the mini-batch
g = grad_f(theta, x_batch, y_batch)
# Update parameters
theta -= LEARNING_RATE * g
# Convergence check on the last mini-batch gradient
if np.linalg.norm(g) < TOLERANCE:
print(f"Converged in {i + 1} epochs.")
break
return theta
# Gradient of the MSE objective for linear regression
def grad_f(theta, x_batch, y_batch):
return (1 / x_batch.shape[0]) * x_batch.T @ ((x_batch @ theta) - y_batch)
# Main
if __name__ == "__main__":
# Generate sample data
X = np.random.rand(N_SAMPLES, N_FEATURES)
true_theta = np.array([5, -1, -9])
y = X @ true_theta + np.random.randn(X.shape[0]) * 0.1 # additive noise
# Run mini-batch SGD
estimated_theta = mini_batch_sgd(X, y)
# Relative error
relative_error = np.linalg.norm(estimated_theta - true_theta) / np.linalg.norm(true_theta)
print("True Parameters: ", true_theta)
print("Estimated Parameters:", estimated_theta)
print(f"Relative Error: {relative_error:.6f}")
Sub-gradient Descent
Many ML objectives — notably those using \(\ell_1\) regularization, ReLU, or hinge loss — are convex
but not differentiable everywhere. Gradient descent is not directly applicable at the kinks.
The remedy is to generalize the gradient to a set-valued notion.
Definition: Subgradient
Let \(f: \mathbb{R}^n \to \mathbb{R}\) be convex and let \(\mathbf{x} \in \operatorname{dom}(f)\).
A vector \(\mathbf{g} \in \mathbb{R}^n\) is a subgradient of \(f\) at \(\mathbf{x}\) if
\[
f(\mathbf{z}) \geq f(\mathbf{x}) + \mathbf{g}^\top (\mathbf{z} - \mathbf{x})
\qquad \text{for all } \mathbf{z} \in \operatorname{dom}(f). \tag{†}
\]
Definition: Subdifferential
The subdifferential of \(f\) at \(\mathbf{x}\), denoted \(\partial f(\mathbf{x})\),
is the set of all subgradients of \(f\) at \(\mathbf{x}\):
\[
\partial f(\mathbf{x}) = \{\, \mathbf{g} \in \mathbb{R}^n : (\dagger) \text{ holds for all } \mathbf{z} \,\}.
\]
We say \(f\) is subdifferentiable at \(\mathbf{x}\) if
\(\partial f(\mathbf{x}) \neq \emptyset\). For a convex function, \(\partial f(\mathbf{x})\) is a
non-empty, closed, convex set at every interior point of \(\operatorname{dom}(f)\), and
\(\partial f(\mathbf{x}) = \{\nabla f(\mathbf{x})\}\) precisely when \(f\) is differentiable at \(\mathbf{x}\).
Example: absolute value.
For \(f(x) = |x|\) on \(\mathbb{R}\),
\[
\partial f(x) = \begin{cases}
\{-1\} & \text{if } x < 0,\\
[-1, 1] & \text{if } x = 0,\\
\{+1\} & \text{if } x > 0.
\end{cases}
\]
At \(x \neq 0\), \(f\) is differentiable and the subdifferential collapses to the ordinary derivative.
At \(x = 0\), every slope \(g \in [-1, 1]\) satisfies \(|z| \geq 0 + g \cdot z\) for all \(z\), giving
a full interval of subgradients.
Sub-gradient descent selects any \(\mathbf{g}^{(k)} \in \partial \mathcal{L}(\boldsymbol{\theta}^{(k)})\) and updates
\[
\boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta^{(k)}\, \mathbf{g}^{(k)}.
\]
Because subgradients need not vanish at the optimum (they only contain \(\mathbf{0}\) there), the iterates
do not decrease monotonically, and a diminishing step-size schedule is essential for convergence.
The method converges on the order of \(O(1/\sqrt{k})\) for general convex functions — slower than
gradient descent's \(O(1/k)\) under \(L\)-smoothness — which is the price of handling non-smoothness.
Why this matters for modern ML
The subdifferential underpins the practical use of non-smooth regularizers. Lasso's \(\ell_1\) penalty,
the SVM hinge loss, and the ReLU activation are all analyzed through their subdifferentials; the
proximal gradient method and ISTA/FISTA algorithms explicitly
exploit this structure to achieve sparse solutions. Automatic differentiation frameworks such as
PyTorch and JAX resolve the ambiguity at kinks by choosing a canonical element of the subdifferential
(e.g., \(0\) for \(\mathrm{ReLU}'(0)\)) — a convention that is mathematically arbitrary but
numerically stable.