In the previous page we used a fixed
learning rate (or step size) \(\eta\) chosen by experiment.
A principled alternative is the line search: given a current iterate
\(\boldsymbol{\theta}_t\) and a descent direction \(\mathbf{d}_t\) (i.e., one with
\(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\)), select the step that minimizes
the objective along the ray \(\{\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t : \eta > 0\}\):
\[
\eta_t \in \operatorname*{arg\,min}_{\eta > 0}\, \mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t).
\]
Exact Line Search on a Quadratic
On a strictly convex quadratic
\[
\mathcal{L}(\boldsymbol{\theta}) = \tfrac{1}{2}\, \boldsymbol{\theta}^\top \mathbf{A} \boldsymbol{\theta}
+ \mathbf{b}^\top \boldsymbol{\theta} + c,
\qquad \mathbf{A} \succ 0,
\]
the one-dimensional problem admits a closed-form solution. Differentiating with respect to \(\eta\):
\[
\frac{d}{d\eta}\, \mathcal{L}(\boldsymbol{\theta} + \eta\, \mathbf{d})
= \mathbf{d}^\top (\mathbf{A} \boldsymbol{\theta} + \mathbf{b}) + \eta\, \mathbf{d}^\top \mathbf{A}\, \mathbf{d}.
\]
Setting this to zero and using \(\mathbf{A} \boldsymbol{\theta} + \mathbf{b} = \nabla \mathcal{L}(\boldsymbol{\theta})\):
\[
\eta^*_t = -\,\frac{\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)}{\mathbf{d}_t^\top \mathbf{A}\, \mathbf{d}_t}.
\]
Positivity of \(\mathbf{A}\) makes the denominator positive, and the descent-direction condition makes the
numerator negative, so \(\eta^*_t > 0\) as required.
Inexact Line Search
Solving the subproblem exactly at every outer iteration is wasteful. Inexact line searches instead seek
any step \(\eta\) that delivers "sufficient" progress along \(\mathbf{d}_t\). The most common such rule
is the Armijo condition.
Fix \(c_1 \in (0, 1)\). A step size \(\eta > 0\) satisfies the Armijo condition at
\(\boldsymbol{\theta}_t\) along a descent direction \(\mathbf{d}_t\) if
\[
\mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t)
\leq \mathcal{L}(\boldsymbol{\theta}_t) + c_1\, \eta\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t).
\tag{1}
\]
Because \(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\), the right-hand side is strictly
smaller than \(\mathcal{L}(\boldsymbol{\theta}_t)\); thus (1) enforces a decrease proportional to the step
and the directional derivative. A typical choice is \(c_1 = 10^{-4}\).
The standard backtracking implementation starts from an initial trial step \(\eta\) (often
\(\eta = 1\) for Newton-type methods) and shrinks it by a factor \(\tau \in (0, 1)\) (commonly \(\tau = 0.5\))
until (1) holds. By continuity of \(\mathcal{L}\) and the negativity of
\(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)\), (1) is eventually satisfied for
sufficiently small \(\eta\).
Newton's Method
Gradient descent uses a first-order linear model of \(\mathcal{L}\); Newton's method
uses the second-order quadratic model. Instead of following the negative gradient, at each step it
jumps to the minimizer of the local quadratic approximation, exploiting curvature information encoded
in the Hessian.
The second-order Taylor expansion around the current
iterate \(\boldsymbol{\theta}_t\) reads
\[
\mathcal{L}(\boldsymbol{\theta}_t + \Delta\boldsymbol{\theta})
\;\approx\; \mathcal{L}(\boldsymbol{\theta}_t)
+ \mathbf{g}_t^\top \Delta\boldsymbol{\theta}
+ \tfrac{1}{2}\, \Delta\boldsymbol{\theta}^\top \mathbf{H}_t\, \Delta\boldsymbol{\theta},
\]
where \(\mathbf{g}_t = \nabla \mathcal{L}(\boldsymbol{\theta}_t)\) and
\(\mathbf{H}_t = \nabla^2 \mathcal{L}(\boldsymbol{\theta}_t)\). Minimizing the right-hand side over
\(\Delta\boldsymbol{\theta}\) leads to the Newton step.
Proposition: Newton Direction
Let \(\mathbf{H}_t \succ 0\). The quadratic model
\(q(\Delta\boldsymbol{\theta}) = \mathbf{g}_t^\top \Delta\boldsymbol{\theta}
+ \tfrac{1}{2}\, \Delta\boldsymbol{\theta}^\top \mathbf{H}_t\, \Delta\boldsymbol{\theta}\)
has a unique minimizer
\[
\mathbf{d}_t \;=\; -\mathbf{H}_t^{-1}\, \mathbf{g}_t,
\]
called the Newton direction. It is a descent direction whenever \(\mathbf{g}_t \neq \mathbf{0}\):
\(\mathbf{d}_t^\top \mathbf{g}_t = -\mathbf{g}_t^\top \mathbf{H}_t^{-1}\, \mathbf{g}_t < 0\).
Proof:
\(q\) is strictly convex because \(\mathbf{H}_t \succ 0\), so its unique minimizer is the stationary
point. Setting \(\nabla q = \mathbf{g}_t + \mathbf{H}_t\, \Delta\boldsymbol{\theta} = \mathbf{0}\)
and using invertibility of \(\mathbf{H}_t\) gives \(\mathbf{d}_t = -\mathbf{H}_t^{-1} \mathbf{g}_t\).
Positive-definiteness of \(\mathbf{H}_t^{-1}\) (inherited from \(\mathbf{H}_t\)) yields
\(\mathbf{g}_t^\top \mathbf{H}_t^{-1} \mathbf{g}_t > 0\) whenever \(\mathbf{g}_t \neq \mathbf{0}\).
Pure Newton's method takes \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{d}_t\)
(unit step). This scheme enjoys local quadratic convergence near a strict local minimum,
but can diverge when started far from the optimum or when the Hessian is not positive definite. To
restore global convergence, the Damped Newton's Method scales the Newton direction by
a line-search step size:
\[
\boldsymbol{\theta}_{t+1} \;=\; \boldsymbol{\theta}_t + \eta_t\, \mathbf{d}_t,
\qquad \eta_t \text{ chosen to satisfy the }
\href{#D-armijo_condition}{\textbf{Armijo condition}}.
\]
Algorithm 1: DAMPED_NEWTONInput: objective \(\mathcal{L}\), tolerance \(\epsilon\); Output: approximate stationary point \(\boldsymbol{\theta}_t\); begin
\(t \leftarrow 0\); choose initial point \(\boldsymbol{\theta}_0\); repeat
\(\mathbf{g}_t \leftarrow \nabla \mathcal{L}(\boldsymbol{\theta}_t)\);
\(\mathbf{H}_t \leftarrow \nabla^2 \mathcal{L}(\boldsymbol{\theta}_t)\);
Solve \(\mathbf{H}_t\, \mathbf{d}_t = -\mathbf{g}_t\) for \(\mathbf{d}_t\);
\(\eta_t \leftarrow \mathrm{LINE\_SEARCH}(\boldsymbol{\theta}_t, \mathbf{d}_t)\) satisfying the Armijo condition;
\(\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t + \eta_t\, \mathbf{d}_t\);
\(t \leftarrow t + 1\); until \(\|\mathbf{g}_t\| < \epsilon\);
Output \(\boldsymbol{\theta}_t\); end
Computational Note: Solving vs. Inverting
Although we write \(\mathbf{d}_t = -\mathbf{H}_t^{-1}\, \mathbf{g}_t\), the inverse is never formed
explicitly: forming \(\mathbf{H}_t^{-1}\) costs \(O(n^3)\) and is numerically fragile. Instead we
solve the linear system \(\mathbf{H}_t\, \mathbf{d}_t = -\mathbf{g}_t\) directly — by
Cholesky factorization when \(\mathbf{H}_t \succ 0\), or by conjugate gradients for large sparse
problems. The cost is the same asymptotically but the constants and numerical stability are
dramatically better.
When Newton's Method Fails
Newton's direction is a descent direction only when \(\mathbf{H}_t \succ 0\). In non-convex
landscapes — ubiquitous in deep learning — the Hessian is often indefinite near saddle points, so
\(-\mathbf{H}_t^{-1} \mathbf{g}_t\) may point uphill. Practical remedies include
trust-region methods, Levenberg–Marquardt damping
\(\mathbf{H}_t \mapsto \mathbf{H}_t + \mu \mathbf{I}\), and saddle-free Newton
(Dauphin et al., 2014), which replaces \(\mathbf{H}_t\) by \(|\mathbf{H}_t|\) (constructed from the
absolute values of its eigenvalues). Another concern is cost: storing the full Hessian requires
\(O(n^2)\) memory, and solving the Newton system is \(O(n^3)\) — intractable for modern neural
networks with \(n \sim 10^9\). This motivates the quasi-Newton methods of the next
section.
BFGS Method
To avoid the \(O(n^3)\) cost of Hessian computation and inversion, quasi-Newton methods
maintain a symmetric positive-definite matrix \(\mathbf{B}_t \approx \mathbf{H}_t\) (or its inverse
\(\mathbf{C}_t \approx \mathbf{H}_t^{-1}\)) that is updated cheaply at each iteration using only
gradient information. The BFGS method of
Broyden–Fletcher–Goldfarb–Shanno is the quasi-Newton method most widely used in practice.
Secant Condition
Let \(\mathbf{s}_t = \boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t\) denote the parameter step and
\(\mathbf{y}_t = \mathbf{g}_{t+1} - \mathbf{g}_t\) the gradient change. A first-order
Taylor expansion of \(\nabla \mathcal{L}\) around
\(\boldsymbol{\theta}_t\) gives
\[
\mathbf{g}_{t+1} \approx \mathbf{g}_t + \mathbf{H}_t\, \mathbf{s}_t
\quad\Longrightarrow\quad \mathbf{y}_t \approx \mathbf{H}_t\, \mathbf{s}_t.
\]
This motivates the quasi-Newton curvature-matching requirement on the updated approximation:
Definition: Secant Condition
A sequence of Hessian approximations \(\{\mathbf{B}_t\}\) satisfies the secant condition if
\[
\mathbf{B}_{t+1}\, \mathbf{s}_t \;=\; \mathbf{y}_t \qquad \text{for all } t.
\]
Equivalently, the inverse approximations \(\mathbf{C}_t = \mathbf{B}_t^{-1}\) satisfy
\(\mathbf{C}_{t+1}\, \mathbf{y}_t = \mathbf{s}_t\).
Derivation of the BFGS Update
The secant condition alone under-determines \(\mathbf{B}_{t+1}\): in \(n\) dimensions there are
\(n(n+1)/2\) free parameters in a symmetric matrix but only \(n\) linear constraints. BFGS makes the
update unique by requiring (i) the update be a symmetric rank-2 correction of
\(\mathbf{B}_t\), and (ii) \(\mathbf{B}_{t+1} \succ 0\) be preserved. Writing \(\mathbf{B}_{t+1} = \mathbf{B}_t + \mathbf{U}\),
the secant condition forces \(\mathbf{U}\, \mathbf{s}_t = \mathbf{y}_t - \mathbf{B}_t\, \mathbf{s}_t\),
and the rank-2 ansatz
\[
\mathbf{U} = \alpha\, \mathbf{y}_t \mathbf{y}_t^\top + \beta\, (\mathbf{B}_t \mathbf{s}_t)(\mathbf{B}_t \mathbf{s}_t)^\top
\]
is chosen so that the first term injects curvature along \(\mathbf{y}_t\) while the second removes the
now-stale curvature along \(\mathbf{B}_t \mathbf{s}_t\). Enforcing \(\mathbf{U}\, \mathbf{s}_t = \mathbf{y}_t - \mathbf{B}_t \mathbf{s}_t\)
pins the scalars to \(\alpha = 1/(\mathbf{y}_t^\top \mathbf{s}_t)\) and
\(\beta = -1/(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)\), yielding the BFGS update formula.
Proposition: BFGS Hessian Update
Given \(\mathbf{B}_t \succ 0\) and vectors \(\mathbf{s}_t, \mathbf{y}_t\) satisfying the
curvature condition \(\mathbf{y}_t^\top \mathbf{s}_t > 0\), define
\[
\mathbf{B}_{t+1} \;=\; \mathbf{B}_t
+ \frac{\mathbf{y}_t \mathbf{y}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}
- \frac{(\mathbf{B}_t \mathbf{s}_t)(\mathbf{B}_t \mathbf{s}_t)^\top}{\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t}.
\]
Then \(\mathbf{B}_{t+1}\) is symmetric, positive definite, and satisfies the secant condition
\(\mathbf{B}_{t+1}\, \mathbf{s}_t = \mathbf{y}_t\).
Symmetry. Both rank-1 correction terms are outer products of a vector with itself,
hence symmetric. \(\mathbf{B}_t\) is symmetric by hypothesis, so \(\mathbf{B}_{t+1}\) is symmetric.
Positive-definiteness. Note first that \(\mathbf{s}_t \neq \mathbf{0}\): if
\(\mathbf{s}_t = \mathbf{0}\) then \(\mathbf{y}_t^\top \mathbf{s}_t = 0\), contradicting the
curvature condition. Hence \(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t > 0\) by
\(\mathbf{B}_t \succ 0\), and the scalar below is well-defined. For any
\(\mathbf{v} \neq \mathbf{0}\), decompose by projecting onto \(\mathbf{s}_t\) (in the
\(\mathbf{B}_t\)-inner product): write \(\mathbf{w} = \mathbf{v} - \alpha \mathbf{s}_t\) with
\(\alpha = (\mathbf{s}_t^\top \mathbf{B}_t \mathbf{v})/(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)\),
so that \(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{w} = 0\). A direct calculation then gives
\[
\mathbf{v}^\top \mathbf{B}_{t+1} \mathbf{v}
= \mathbf{w}^\top \mathbf{B}_t \mathbf{w}
+ \frac{(\mathbf{v}^\top \mathbf{y}_t)^2}{\mathbf{y}_t^\top \mathbf{s}_t}.
\]
Both terms are \(\geq 0\): the first by \(\mathbf{B}_t \succ 0\), the second because
\(\mathbf{y}_t^\top \mathbf{s}_t > 0\). To see that the sum is strictly positive when
\(\mathbf{v} \neq \mathbf{0}\), suppose both terms vanish. The first vanishes only when
\(\mathbf{w} = \mathbf{0}\), i.e., \(\mathbf{v} = \alpha \mathbf{s}_t\). Substituting into the
second term gives \(\mathbf{v}^\top \mathbf{y}_t = \alpha\, \mathbf{s}_t^\top \mathbf{y}_t\);
since \(\mathbf{s}_t^\top \mathbf{y}_t > 0\), this vanishes only when \(\alpha = 0\), forcing
\(\mathbf{v} = \mathbf{0}\) — a contradiction. Hence
\(\mathbf{v}^\top \mathbf{B}_{t+1} \mathbf{v} > 0\) for every \(\mathbf{v} \neq \mathbf{0}\).
The curvature condition \(\mathbf{y}_t^\top \mathbf{s}_t > 0\) is not automatic — it fails on non-convex
objectives unless the step size \(\eta_t\) is chosen carefully. It is guaranteed by the
Wolfe conditions introduced below, which is why Wolfe (not plain Armijo) is the
standard line search for BFGS.
Proposition: BFGS Inverse Update
Under the same hypotheses as the previous proposition, the inverse \(\mathbf{C}_{t+1} = \mathbf{B}_{t+1}^{-1}\)
admits the closed form
\[
\mathbf{C}_{t+1}
\;=\; \Bigl(\mathbf{I} - \frac{\mathbf{s}_t \mathbf{y}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}\Bigr)
\mathbf{C}_t
\Bigl(\mathbf{I} - \frac{\mathbf{y}_t \mathbf{s}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}\Bigr)
+ \frac{\mathbf{s}_t \mathbf{s}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}.
\]
Proof (via two applications of Sherman–Morrison):
Apply the Sherman–Morrison formula
first to the rank-1 addition \(\mathbf{B}_t + \mathbf{y}_t \mathbf{y}_t^\top / (\mathbf{y}_t^\top \mathbf{s}_t)\):
with \(\mathbf{u} = \mathbf{y}_t\), \(\mathbf{v}^\top = \mathbf{y}_t^\top / (\mathbf{y}_t^\top \mathbf{s}_t)\),
\[
\Bigl(\mathbf{B}_t + \frac{\mathbf{y}_t \mathbf{y}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}\Bigr)^{-1}
= \mathbf{C}_t - \frac{\mathbf{C}_t \mathbf{y}_t \mathbf{y}_t^\top \mathbf{C}_t}{\mathbf{y}_t^\top \mathbf{s}_t + \mathbf{y}_t^\top \mathbf{C}_t \mathbf{y}_t}.
\]
Next apply Sherman–Morrison to the rank-1 subtraction of \((\mathbf{B}_t \mathbf{s}_t)(\mathbf{B}_t \mathbf{s}_t)^\top / (\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)\)
from the intermediate matrix. After collecting terms and repeatedly using
\(\mathbf{C}_t \mathbf{B}_t \mathbf{s}_t = \mathbf{s}_t\) (since \(\mathbf{C}_t = \mathbf{B}_t^{-1}\))
and the identity
\(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t \cdot (\mathbf{y}_t^\top \mathbf{s}_t + \mathbf{y}_t^\top \mathbf{C}_t \mathbf{y}_t) - (\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)(\mathbf{y}_t^\top \mathbf{C}_t \mathbf{y}_t)
= \mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t \cdot \mathbf{y}_t^\top \mathbf{s}_t\),
the expression telescopes to the stated form. The final form can be verified
mechanically by direct multiplication \(\mathbf{C}_{t+1} \mathbf{B}_{t+1} = \mathbf{I}\).
Why the two-factor form matters
The compact form
\(\mathbf{C}_{t+1} = (\mathbf{I} - \rho_t \mathbf{s}_t \mathbf{y}_t^\top)\, \mathbf{C}_t\, (\mathbf{I} - \rho_t \mathbf{y}_t \mathbf{s}_t^\top) + \rho_t \mathbf{s}_t \mathbf{s}_t^\top\),
where \(\rho_t = 1/(\mathbf{y}_t^\top \mathbf{s}_t)\), exposes the action of the update:
\(\mathbf{C}_{t+1} \mathbf{v}\) can be computed using only inner products and outer-product sums
involving \(\mathbf{s}_t, \mathbf{y}_t\) and \(\mathbf{C}_t \mathbf{v}\) — no explicit matrix storage
of the correction is needed. This is the structural observation that enables L-BFGS
below: by unrolling this two-factor recurrence over the most recent \(m\) pairs, one obtains a
two-loop algorithm that never materializes \(\mathbf{C}_t\) at all.
Wolfe Conditions
BFGS's positive-definiteness preservation and fast convergence both depend on the step
\(\eta_t\) satisfying conditions stronger than Armijo alone. The Wolfe conditions
pair the sufficient-decrease condition with a curvature condition that prevents steps that are too short.
Definition: Wolfe Conditions
Fix \(0 < c_1 < c_2 < 1\). A step size \(\eta > 0\) satisfies the Wolfe conditions
at \(\boldsymbol{\theta}_t\) along a descent direction \(\mathbf{d}_t\) if both hold:
(W2) Curvature:
\[
\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t)
\geq c_2\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t).
\]
The strong Wolfe conditions replace (W2) with the tighter two-sided form
\(|\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t + \eta \mathbf{d}_t)| \leq c_2\, |\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)|\).
Typical choices for quasi-Newton methods are \(c_1 = 10^{-4}\) and \(c_2 = 0.9\).
The curvature condition (W2) is precisely what guarantees the BFGS prerequisite
\(\mathbf{y}_t^\top \mathbf{s}_t > 0\). Indeed, for a descent direction
\(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\), (W2) gives
\[
\mathbf{y}_t^\top \mathbf{s}_t
= \eta_t\, \mathbf{d}_t^\top \bigl(\nabla \mathcal{L}(\boldsymbol{\theta}_{t+1}) - \nabla \mathcal{L}(\boldsymbol{\theta}_t)\bigr)
\geq \eta_t\, (c_2 - 1)\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)
> 0,
\]
since \(\eta_t > 0\), \(c_2 < 1\), and
\(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\). This is the reason BFGS line
searches are implemented with (strong) Wolfe rather than plain Armijo.
Limited-Memory BFGS (L-BFGS)
For large-scale problems, storing the \(n \times n\) matrix \(\mathbf{C}_t\) explicitly is infeasible —
\(n \sim 10^6\) already requires terabytes. L-BFGS sidesteps this by never forming
\(\mathbf{C}_t\): it retains only the most recent \(m\) pairs \(\{(\mathbf{s}_i, \mathbf{y}_i)\}\) (typically
\(m \in \{5, \ldots, 20\}\)) and computes \(\mathbf{C}_t\, \mathbf{g}_t\) implicitly via the
two-loop recursion below. A direct expansion of the two-factor form
\(\mathbf{C}_{t+1} = (\mathbf{I} - \rho_t \mathbf{s}_t \mathbf{y}_t^\top)\, \mathbf{C}_t\, (\mathbf{I} - \rho_t \mathbf{y}_t \mathbf{s}_t^\top) + \rho_t \mathbf{s}_t \mathbf{s}_t^\top\)
over the last \(m\) pairs, applied to a vector \(\mathbf{g}_t\) against an implicit initial
\(\mathbf{C}_t^{(0)}\) (often \(\mathbf{I}\)), produces exactly the two-loop recursion, and note that the recursion
is also verifiable mechanically against the explicit formula, as done in our verification. The memory cost
drops to \(O(mn)\) and each iteration costs \(O(mn)\).
Algorithm 2: L_BFGSInput: objective \(\mathcal{L}\), tolerance \(\epsilon\), memory size \(m\); Output: approximate stationary point \(\boldsymbol{\theta}^*\); begin
Choose initial point \(\boldsymbol{\theta}_0\); initialise \((\mathbf{s}, \mathbf{y})\) storage for up to \(m\) past updates;
\(t \leftarrow 0\); repeat
\(\mathbf{g}_t \leftarrow \nabla \mathcal{L}(\boldsymbol{\theta}_t)\); \(\mathbf{q} \leftarrow \mathbf{g}_t\); Backward loop (newest to oldest stored pair):
\(\alpha_i \leftarrow \rho_i\, \mathbf{s}_i^\top \mathbf{q}\); \(\;\mathbf{q} \leftarrow \mathbf{q} - \alpha_i\, \mathbf{y}_i\);
\(\mathbf{r} \leftarrow \mathbf{C}_t^{(0)}\, \mathbf{q}\) (typically \(\mathbf{C}_t^{(0)} = \mathbf{I}\) or a scaled identity); Forward loop (oldest to newest stored pair):
\(\beta_i \leftarrow \rho_i\, \mathbf{y}_i^\top \mathbf{r}\); \(\;\mathbf{r} \leftarrow \mathbf{r} + \mathbf{s}_i (\alpha_i - \beta_i)\);
\(\mathbf{p}_t \leftarrow -\mathbf{r}\);
\(\eta_t \leftarrow \mathrm{LINE\_SEARCH}(\boldsymbol{\theta}_t, \mathbf{p}_t)\) satisfying Wolfe conditions;
\(\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t + \eta_t\, \mathbf{p}_t\);
\(\mathbf{g}_{t+1} \leftarrow \nabla \mathcal{L}(\boldsymbol{\theta}_{t+1})\);
Store \(\mathbf{s}_t \leftarrow \boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t\), \(\mathbf{y}_t \leftarrow \mathbf{g}_{t+1} - \mathbf{g}_t\), \(\rho_t \leftarrow 1/(\mathbf{y}_t^\top \mathbf{s}_t)\);
If the number of stored pairs exceeds \(m\), discard the oldest pair;
\(t \leftarrow t + 1\); until \(\|\mathbf{g}_t\| < \epsilon\);
Output \(\boldsymbol{\theta}_t\); end
Note: For numerical efficiency, the scalar \(\rho_t = 1/(\mathbf{y}_t^\top \mathbf{s}_t)\) is computed once
and stored alongside each \((\mathbf{s}_t, \mathbf{y}_t)\) pair rather than being recomputed whenever it is
needed in the two loops.
Python Implementation
The sample implementation below tests L-BFGS on the Rosenbrock function
\[
f(\mathbf{x}) = \sum_{i=1}^{n-1} \bigl[a\, (x_{i+1} - x_i^2)^2 + (b - x_i)^2\bigr],
\qquad a = 100,\; b = 1,
\]
a standard optimization benchmark with global minimum \(f(\mathbf{1}) = 0\). Despite the simple
appearance, the Rosenbrock function is notoriously difficult for optimization algorithms because its
global minimum lies at the bottom of a narrow, curved, "banana-shaped" valley. The surrounding landscape
is ill-conditioned (strong anisotropy of the Hessian) and contains broad flat regions where gradients are
small, so naive gradient descent crawls while Newton-type methods must handle curvature carefully.
import numpy as np
# Line search with Wolfe conditions
def line_search(f, grad_f, theta, p, c1 = 1e-4, c2 = 0.9, max_iter = 100):
eta = 1.0
eta_low = 0.0
eta_high = None
phi_0 = f(theta)
grad_phi_0 = np.dot(grad_f(theta), p)
for _ in range(max_iter):
phi_eta = f(theta + eta * p)
# Check Armijo condition
if phi_eta > phi_0 + c1 * eta * grad_phi_0:
eta_high = eta
else:
# Check Curvature condition
grad_phi_eta = np.dot(grad_f(theta + eta * p), p)
if grad_phi_eta < c2 * grad_phi_0:
eta_low = eta
else:
return eta
# Update step size using bisection method
if eta_high is not None:
eta = (eta_low + eta_high) / 2.0
else:
eta *= 2.0
return eta
# Limited memory BFGS
def limited_bfgs(f, grad_f, theta0, m = 10, tol = 1e-6, max_iter = 2000):
theta = theta0.copy()
g = grad_f(theta)
s_list = []
y_list = []
rho_list = [] # We introduce rho = 1/s^\top y instead of directly using 1/s^\top y for efficiency & stability.
for _ in range(max_iter):
if np.linalg.norm(g) < tol:
break
q = g.copy()
alpha_list = [] # Need this for the second loop to avoid computing alpha again
# Loop backward through stored (s, y) pairs
for s, y, rho in reversed(list(zip(s_list, y_list, rho_list))):
alpha = rho * np.dot(s, q)
alpha_list.append(alpha)
q -= alpha * y
# Initial Hessian approximation is identity: H0 = I, so H0 * q = q
# Note: In the first iteration (when s_list is empty), the loops do nothing,
# so r = q = g, and thus p = -r = -g (steepest descent direction)
r = q
# Loop forward through stored (s, y) pairs
for (s, y, rho), alpha in zip(zip(s_list, y_list, rho_list), reversed(alpha_list)):
beta = rho * np.dot(y, r)
r += s * (alpha - beta)
# Search direction
p = -r
# Compute the step size by Line search satisfying Wolfe conditions
eta = line_search(f, grad_f, theta, p)
# Update parameters
theta += eta * p
grad_next = grad_f(theta)
# Update memory for (s, y) pairs
s = eta * p
y = grad_next - g
if np.dot(s, y) > 1e-10 * np.linalg.norm(s) * np.linalg.norm(y):
if len(s_list) == m:
s_list.pop(0)
y_list.pop(0)
rho_list.pop(0)
s_list.append(s)
y_list.append(y)
rho_list.append(1.0 / np.dot(y, s))
# Update gradient
g = grad_next
return theta
# Objective function and its gradient:
# The Rosenbrock function is commonly used for testing optimization algorithms.
def rosenbrock(x):
return np.sum(100 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
def grad_rosenbrock(x):
grad = np.zeros_like(x)
grad[:-1] = -400 * x[:-1] * (x[1:] - x[:-1]**2) - 2 * (1 - x[:-1]) # # x_1 to x_n-1
grad[1:] += 200 * (x[1:] - x[:-1]**2) # x_2 to x_n
return grad
# Finite difference gradient to check grad_rosenbrock().
def finite_difference_gradient(f, x, epsilon=1e-6):
grad = np.zeros_like(x)
for i in range(len(x)):
x_forward = x.copy()
x_backward = x.copy()
x_forward[i] += epsilon
x_backward[i] -= epsilon
grad[i] = (f(x_forward) - f(x_backward)) / (2 * epsilon)
return grad
if __name__ == "__main__":
n = 50 # Dimensionality
# Randomly generate an initial point x0:
# You could try : x0 = np.ones(n) + 0.3 * np.random.randn(n), which represents adding small
# perturbation around the global minimum x* = [1, ... , 1]^\top
x0 = np.random.randn(n)
numeric_opt = limited_bfgs(rosenbrock, grad_rosenbrock, x0)
print("Initial point: \n", x0.tolist())
print("\n Numerical optimum: \n", numeric_opt.tolist())
'''
# If you are not sure about the gradient of the objective, always you can compare it with finite difference.
grad_analytic = grad_rosenbrock(x0)
grad_numeric = finite_difference_gradient(rosenbrock, x0)
relative_error_grad = np.linalg.norm(grad_analytic - grad_numeric) / np.linalg.norm(grad_analytic)
print("Relative error:", relative_error_grad)
'''
# Relative error between the numerical optimum and the global optimum
global_opt = np.ones(n) # The actual global optimum of Rosenbrock function is x* = [1, ... , 1]^\top
relative_error_optimal = np.linalg.norm(numeric_opt - global_opt) / np.linalg.norm(global_opt)
print(f"\n Relative error to the global minimum: {relative_error_optimal*100:.8f}%")
Why Rosenbrock is a hard benchmark
Even though \(\mathbf{x}^* = (1, 1, \ldots, 1)\) is easy to read off analytically, numerical optimizers
struggle with the Rosenbrock landscape for three compounding reasons:
Narrow, curved valley: the minimum lies at the bottom of a bent ravine. Steep
walls deflect iterates and trigger over- or undershooting unless curvature information is used.
Ill-conditioning: along the valley the Hessian has eigenvalues differing by
several orders of magnitude — the condition number is large, slowing first-order methods
dramatically (cf. the condition-number analysis
later in this section).
Broad flat regions: outside the valley, the gradient is small but the function is
still far from its minimum, so step sizes must be large without overshooting the valley walls.
These three properties — anisotropic curvature, ill-conditioning, and flat plateaus — are exactly
the pathologies that plague loss landscapes in deep learning. This is why Rosenbrock remains the
standard testbed for new optimization algorithms.