Newton's Method

Line Search

In the previous page we used a fixed learning rate (or step size) \(\eta\) chosen by experiment. A principled alternative is the line search: given a current iterate \(\boldsymbol{\theta}_t\) and a descent direction \(\mathbf{d}_t\) (i.e., one with \(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\)), select the step that minimizes the objective along the ray \(\{\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t : \eta > 0\}\): \[ \eta_t \in \operatorname*{arg\,min}_{\eta > 0}\, \mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t). \]

Exact Line Search on a Quadratic

On a strictly convex quadratic \[ \mathcal{L}(\boldsymbol{\theta}) = \tfrac{1}{2}\, \boldsymbol{\theta}^\top \mathbf{A} \boldsymbol{\theta} + \mathbf{b}^\top \boldsymbol{\theta} + c, \qquad \mathbf{A} \succ 0, \] the one-dimensional problem admits a closed-form solution. Differentiating with respect to \(\eta\): \[ \frac{d}{d\eta}\, \mathcal{L}(\boldsymbol{\theta} + \eta\, \mathbf{d}) = \mathbf{d}^\top (\mathbf{A} \boldsymbol{\theta} + \mathbf{b}) + \eta\, \mathbf{d}^\top \mathbf{A}\, \mathbf{d}. \] Setting this to zero and using \(\mathbf{A} \boldsymbol{\theta} + \mathbf{b} = \nabla \mathcal{L}(\boldsymbol{\theta})\): \[ \eta^*_t = -\,\frac{\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)}{\mathbf{d}_t^\top \mathbf{A}\, \mathbf{d}_t}. \] Positivity of \(\mathbf{A}\) makes the denominator positive, and the descent-direction condition makes the numerator negative, so \(\eta^*_t > 0\) as required.

Inexact Line Search

Solving the subproblem exactly at every outer iteration is wasteful. Inexact line searches instead seek any step \(\eta\) that delivers "sufficient" progress along \(\mathbf{d}_t\). The most common such rule is the Armijo condition.

Definition: Armijo (Sufficient-Decrease) Condition

Fix \(c_1 \in (0, 1)\). A step size \(\eta > 0\) satisfies the Armijo condition at \(\boldsymbol{\theta}_t\) along a descent direction \(\mathbf{d}_t\) if \[ \mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t) \leq \mathcal{L}(\boldsymbol{\theta}_t) + c_1\, \eta\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t). \tag{1} \] Because \(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\), the right-hand side is strictly smaller than \(\mathcal{L}(\boldsymbol{\theta}_t)\); thus (1) enforces a decrease proportional to the step and the directional derivative. A typical choice is \(c_1 = 10^{-4}\).

The standard backtracking implementation starts from an initial trial step \(\eta\) (often \(\eta = 1\) for Newton-type methods) and shrinks it by a factor \(\tau \in (0, 1)\) (commonly \(\tau = 0.5\)) until (1) holds. By continuity of \(\mathcal{L}\) and the negativity of \(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)\), (1) is eventually satisfied for sufficiently small \(\eta\).

Gradient descent uses a first-order linear model of \(\mathcal{L}\); Newton's method uses the second-order quadratic model. Instead of following the negative gradient, at each step it jumps to the minimizer of the local quadratic approximation, exploiting curvature information encoded in the Hessian.

The second-order Taylor expansion around the current iterate \(\boldsymbol{\theta}_t\) reads \[ \mathcal{L}(\boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}) \;\approx\; \mathcal{L}(\boldsymbol{\theta}_t) + \mathbf{g}_t^\top \Delta\boldsymbol{\theta} + \tfrac{1}{2}\, \Delta\boldsymbol{\theta}^\top \mathbf{H}_t\, \Delta\boldsymbol{\theta}, \] where \(\mathbf{g}_t = \nabla \mathcal{L}(\boldsymbol{\theta}_t)\) and \(\mathbf{H}_t = \nabla^2 \mathcal{L}(\boldsymbol{\theta}_t)\). Minimizing the right-hand side over \(\Delta\boldsymbol{\theta}\) leads to the Newton step.

Proposition: Newton Direction

Let \(\mathbf{H}_t \succ 0\). The quadratic model \(q(\Delta\boldsymbol{\theta}) = \mathbf{g}_t^\top \Delta\boldsymbol{\theta} + \tfrac{1}{2}\, \Delta\boldsymbol{\theta}^\top \mathbf{H}_t\, \Delta\boldsymbol{\theta}\) has a unique minimizer \[ \mathbf{d}_t \;=\; -\mathbf{H}_t^{-1}\, \mathbf{g}_t, \] called the Newton direction. It is a descent direction whenever \(\mathbf{g}_t \neq \mathbf{0}\): \(\mathbf{d}_t^\top \mathbf{g}_t = -\mathbf{g}_t^\top \mathbf{H}_t^{-1}\, \mathbf{g}_t < 0\).

Proof:

\(q\) is strictly convex because \(\mathbf{H}_t \succ 0\), so its unique minimizer is the stationary point. Setting \(\nabla q = \mathbf{g}_t + \mathbf{H}_t\, \Delta\boldsymbol{\theta} = \mathbf{0}\) and using invertibility of \(\mathbf{H}_t\) gives \(\mathbf{d}_t = -\mathbf{H}_t^{-1} \mathbf{g}_t\). Positive-definiteness of \(\mathbf{H}_t^{-1}\) (inherited from \(\mathbf{H}_t\)) yields \(\mathbf{g}_t^\top \mathbf{H}_t^{-1} \mathbf{g}_t > 0\) whenever \(\mathbf{g}_t \neq \mathbf{0}\).

Pure Newton's method takes \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{d}_t\) (unit step). This scheme enjoys local quadratic convergence near a strict local minimum, but can diverge when started far from the optimum or when the Hessian is not positive definite. To restore global convergence, the Damped Newton's Method scales the Newton direction by a line-search step size: \[ \boldsymbol{\theta}_{t+1} \;=\; \boldsymbol{\theta}_t + \eta_t\, \mathbf{d}_t, \qquad \eta_t \text{ chosen to satisfy the } \href{#D-armijo_condition}{\textbf{Armijo condition}}. \]

Algorithm 1: DAMPED_NEWTON Input: objective \(\mathcal{L}\), tolerance \(\epsilon\);
Output: approximate stationary point \(\boldsymbol{\theta}_t\);
begin
\(t \leftarrow 0\); choose initial point \(\boldsymbol{\theta}_0\);
repeat
\(\mathbf{g}_t \leftarrow \nabla \mathcal{L}(\boldsymbol{\theta}_t)\);
\(\mathbf{H}_t \leftarrow \nabla^2 \mathcal{L}(\boldsymbol{\theta}_t)\);
Solve \(\mathbf{H}_t\, \mathbf{d}_t = -\mathbf{g}_t\) for \(\mathbf{d}_t\);
\(\eta_t \leftarrow \mathrm{LINE\_SEARCH}(\boldsymbol{\theta}_t, \mathbf{d}_t)\) satisfying the Armijo condition;
\(\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t + \eta_t\, \mathbf{d}_t\);
\(t \leftarrow t + 1\);
until \(\|\mathbf{g}_t\| < \epsilon\);
Output \(\boldsymbol{\theta}_t\);
end

Computational Note: Solving vs. Inverting

Although we write \(\mathbf{d}_t = -\mathbf{H}_t^{-1}\, \mathbf{g}_t\), the inverse is never formed explicitly: forming \(\mathbf{H}_t^{-1}\) costs \(O(n^3)\) and is numerically fragile. Instead we solve the linear system \(\mathbf{H}_t\, \mathbf{d}_t = -\mathbf{g}_t\) directly — by Cholesky factorization when \(\mathbf{H}_t \succ 0\), or by conjugate gradients for large sparse problems. The cost is the same asymptotically but the constants and numerical stability are dramatically better.

When Newton's Method Fails

Newton's direction is a descent direction only when \(\mathbf{H}_t \succ 0\). In non-convex landscapes — ubiquitous in deep learning — the Hessian is often indefinite near saddle points, so \(-\mathbf{H}_t^{-1} \mathbf{g}_t\) may point uphill. Practical remedies include trust-region methods, Levenberg–Marquardt damping \(\mathbf{H}_t \mapsto \mathbf{H}_t + \mu \mathbf{I}\), and saddle-free Newton (Dauphin et al., 2014), which replaces \(\mathbf{H}_t\) by \(|\mathbf{H}_t|\) (constructed from the absolute values of its eigenvalues). Another concern is cost: storing the full Hessian requires \(O(n^2)\) memory, and solving the Newton system is \(O(n^3)\) — intractable for modern neural networks with \(n \sim 10^9\). This motivates the quasi-Newton methods of the next section.

BFGS Method

To avoid the \(O(n^3)\) cost of Hessian computation and inversion, quasi-Newton methods maintain a symmetric positive-definite matrix \(\mathbf{B}_t \approx \mathbf{H}_t\) (or its inverse \(\mathbf{C}_t \approx \mathbf{H}_t^{-1}\)) that is updated cheaply at each iteration using only gradient information. The BFGS method of Broyden–Fletcher–Goldfarb–Shanno is the quasi-Newton method most widely used in practice.

Secant Condition

Let \(\mathbf{s}_t = \boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t\) denote the parameter step and \(\mathbf{y}_t = \mathbf{g}_{t+1} - \mathbf{g}_t\) the gradient change. A first-order Taylor expansion of \(\nabla \mathcal{L}\) around \(\boldsymbol{\theta}_t\) gives \[ \mathbf{g}_{t+1} \approx \mathbf{g}_t + \mathbf{H}_t\, \mathbf{s}_t \quad\Longrightarrow\quad \mathbf{y}_t \approx \mathbf{H}_t\, \mathbf{s}_t. \] This motivates the quasi-Newton curvature-matching requirement on the updated approximation:

Definition: Secant Condition

A sequence of Hessian approximations \(\{\mathbf{B}_t\}\) satisfies the secant condition if \[ \mathbf{B}_{t+1}\, \mathbf{s}_t \;=\; \mathbf{y}_t \qquad \text{for all } t. \] Equivalently, the inverse approximations \(\mathbf{C}_t = \mathbf{B}_t^{-1}\) satisfy \(\mathbf{C}_{t+1}\, \mathbf{y}_t = \mathbf{s}_t\).

Derivation of the BFGS Update

The secant condition alone under-determines \(\mathbf{B}_{t+1}\): in \(n\) dimensions there are \(n(n+1)/2\) free parameters in a symmetric matrix but only \(n\) linear constraints. BFGS makes the update unique by requiring (i) the update be a symmetric rank-2 correction of \(\mathbf{B}_t\), and (ii) \(\mathbf{B}_{t+1} \succ 0\) be preserved. Writing \(\mathbf{B}_{t+1} = \mathbf{B}_t + \mathbf{U}\), the secant condition forces \(\mathbf{U}\, \mathbf{s}_t = \mathbf{y}_t - \mathbf{B}_t\, \mathbf{s}_t\), and the rank-2 ansatz \[ \mathbf{U} = \alpha\, \mathbf{y}_t \mathbf{y}_t^\top + \beta\, (\mathbf{B}_t \mathbf{s}_t)(\mathbf{B}_t \mathbf{s}_t)^\top \] is chosen so that the first term injects curvature along \(\mathbf{y}_t\) while the second removes the now-stale curvature along \(\mathbf{B}_t \mathbf{s}_t\). Enforcing \(\mathbf{U}\, \mathbf{s}_t = \mathbf{y}_t - \mathbf{B}_t \mathbf{s}_t\) pins the scalars to \(\alpha = 1/(\mathbf{y}_t^\top \mathbf{s}_t)\) and \(\beta = -1/(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)\), yielding the BFGS update formula.

Proposition: BFGS Hessian Update

Given \(\mathbf{B}_t \succ 0\) and vectors \(\mathbf{s}_t, \mathbf{y}_t\) satisfying the curvature condition \(\mathbf{y}_t^\top \mathbf{s}_t > 0\), define \[ \mathbf{B}_{t+1} \;=\; \mathbf{B}_t + \frac{\mathbf{y}_t \mathbf{y}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t} - \frac{(\mathbf{B}_t \mathbf{s}_t)(\mathbf{B}_t \mathbf{s}_t)^\top}{\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t}. \] Then \(\mathbf{B}_{t+1}\) is symmetric, positive definite, and satisfies the secant condition \(\mathbf{B}_{t+1}\, \mathbf{s}_t = \mathbf{y}_t\).

Proof:

Secant condition. Compute directly: \[ \mathbf{B}_{t+1}\, \mathbf{s}_t = \mathbf{B}_t \mathbf{s}_t + \mathbf{y}_t\, \frac{\mathbf{y}_t^\top \mathbf{s}_t}{\mathbf{y}_t^\top \mathbf{s}_t} - (\mathbf{B}_t \mathbf{s}_t)\, \frac{\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t}{\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t} = \mathbf{B}_t \mathbf{s}_t + \mathbf{y}_t - \mathbf{B}_t \mathbf{s}_t = \mathbf{y}_t. \]

Symmetry. Both rank-1 correction terms are outer products of a vector with itself, hence symmetric. \(\mathbf{B}_t\) is symmetric by hypothesis, so \(\mathbf{B}_{t+1}\) is symmetric.

Positive-definiteness. Note first that \(\mathbf{s}_t \neq \mathbf{0}\): if \(\mathbf{s}_t = \mathbf{0}\) then \(\mathbf{y}_t^\top \mathbf{s}_t = 0\), contradicting the curvature condition. Hence \(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t > 0\) by \(\mathbf{B}_t \succ 0\), and the scalar below is well-defined. For any \(\mathbf{v} \neq \mathbf{0}\), decompose by projecting onto \(\mathbf{s}_t\) (in the \(\mathbf{B}_t\)-inner product): write \(\mathbf{w} = \mathbf{v} - \alpha \mathbf{s}_t\) with \(\alpha = (\mathbf{s}_t^\top \mathbf{B}_t \mathbf{v})/(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)\), so that \(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{w} = 0\). A direct calculation then gives \[ \mathbf{v}^\top \mathbf{B}_{t+1} \mathbf{v} = \mathbf{w}^\top \mathbf{B}_t \mathbf{w} + \frac{(\mathbf{v}^\top \mathbf{y}_t)^2}{\mathbf{y}_t^\top \mathbf{s}_t}. \] Both terms are \(\geq 0\): the first by \(\mathbf{B}_t \succ 0\), the second because \(\mathbf{y}_t^\top \mathbf{s}_t > 0\). To see that the sum is strictly positive when \(\mathbf{v} \neq \mathbf{0}\), suppose both terms vanish. The first vanishes only when \(\mathbf{w} = \mathbf{0}\), i.e., \(\mathbf{v} = \alpha \mathbf{s}_t\). Substituting into the second term gives \(\mathbf{v}^\top \mathbf{y}_t = \alpha\, \mathbf{s}_t^\top \mathbf{y}_t\); since \(\mathbf{s}_t^\top \mathbf{y}_t > 0\), this vanishes only when \(\alpha = 0\), forcing \(\mathbf{v} = \mathbf{0}\) — a contradiction. Hence \(\mathbf{v}^\top \mathbf{B}_{t+1} \mathbf{v} > 0\) for every \(\mathbf{v} \neq \mathbf{0}\).

The curvature condition \(\mathbf{y}_t^\top \mathbf{s}_t > 0\) is not automatic — it fails on non-convex objectives unless the step size \(\eta_t\) is chosen carefully. It is guaranteed by the Wolfe conditions introduced below, which is why Wolfe (not plain Armijo) is the standard line search for BFGS.

Proposition: BFGS Inverse Update

Under the same hypotheses as the previous proposition, the inverse \(\mathbf{C}_{t+1} = \mathbf{B}_{t+1}^{-1}\) admits the closed form \[ \mathbf{C}_{t+1} \;=\; \Bigl(\mathbf{I} - \frac{\mathbf{s}_t \mathbf{y}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}\Bigr) \mathbf{C}_t \Bigl(\mathbf{I} - \frac{\mathbf{y}_t \mathbf{s}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}\Bigr) + \frac{\mathbf{s}_t \mathbf{s}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}. \]

Proof (via two applications of Sherman–Morrison):

Apply the Sherman–Morrison formula first to the rank-1 addition \(\mathbf{B}_t + \mathbf{y}_t \mathbf{y}_t^\top / (\mathbf{y}_t^\top \mathbf{s}_t)\): with \(\mathbf{u} = \mathbf{y}_t\), \(\mathbf{v}^\top = \mathbf{y}_t^\top / (\mathbf{y}_t^\top \mathbf{s}_t)\), \[ \Bigl(\mathbf{B}_t + \frac{\mathbf{y}_t \mathbf{y}_t^\top}{\mathbf{y}_t^\top \mathbf{s}_t}\Bigr)^{-1} = \mathbf{C}_t - \frac{\mathbf{C}_t \mathbf{y}_t \mathbf{y}_t^\top \mathbf{C}_t}{\mathbf{y}_t^\top \mathbf{s}_t + \mathbf{y}_t^\top \mathbf{C}_t \mathbf{y}_t}. \] Next apply Sherman–Morrison to the rank-1 subtraction of \((\mathbf{B}_t \mathbf{s}_t)(\mathbf{B}_t \mathbf{s}_t)^\top / (\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)\) from the intermediate matrix. After collecting terms and repeatedly using \(\mathbf{C}_t \mathbf{B}_t \mathbf{s}_t = \mathbf{s}_t\) (since \(\mathbf{C}_t = \mathbf{B}_t^{-1}\)) and the identity \(\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t \cdot (\mathbf{y}_t^\top \mathbf{s}_t + \mathbf{y}_t^\top \mathbf{C}_t \mathbf{y}_t) - (\mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t)(\mathbf{y}_t^\top \mathbf{C}_t \mathbf{y}_t) = \mathbf{s}_t^\top \mathbf{B}_t \mathbf{s}_t \cdot \mathbf{y}_t^\top \mathbf{s}_t\), the expression telescopes to the stated form. The final form can be verified mechanically by direct multiplication \(\mathbf{C}_{t+1} \mathbf{B}_{t+1} = \mathbf{I}\).

Why the two-factor form matters

The compact form \(\mathbf{C}_{t+1} = (\mathbf{I} - \rho_t \mathbf{s}_t \mathbf{y}_t^\top)\, \mathbf{C}_t\, (\mathbf{I} - \rho_t \mathbf{y}_t \mathbf{s}_t^\top) + \rho_t \mathbf{s}_t \mathbf{s}_t^\top\), where \(\rho_t = 1/(\mathbf{y}_t^\top \mathbf{s}_t)\), exposes the action of the update: \(\mathbf{C}_{t+1} \mathbf{v}\) can be computed using only inner products and outer-product sums involving \(\mathbf{s}_t, \mathbf{y}_t\) and \(\mathbf{C}_t \mathbf{v}\) — no explicit matrix storage of the correction is needed. This is the structural observation that enables L-BFGS below: by unrolling this two-factor recurrence over the most recent \(m\) pairs, one obtains a two-loop algorithm that never materializes \(\mathbf{C}_t\) at all.

Wolfe Conditions

BFGS's positive-definiteness preservation and fast convergence both depend on the step \(\eta_t\) satisfying conditions stronger than Armijo alone. The Wolfe conditions pair the sufficient-decrease condition with a curvature condition that prevents steps that are too short.

Definition: Wolfe Conditions

Fix \(0 < c_1 < c_2 < 1\). A step size \(\eta > 0\) satisfies the Wolfe conditions at \(\boldsymbol{\theta}_t\) along a descent direction \(\mathbf{d}_t\) if both hold:

(W1) Sufficient decrease (Armijo): \[ \mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t) \leq \mathcal{L}(\boldsymbol{\theta}_t) + c_1\, \eta\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t). \]

(W2) Curvature: \[ \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t + \eta\, \mathbf{d}_t) \geq c_2\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t). \] The strong Wolfe conditions replace (W2) with the tighter two-sided form \(|\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t + \eta \mathbf{d}_t)| \leq c_2\, |\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t)|\). Typical choices for quasi-Newton methods are \(c_1 = 10^{-4}\) and \(c_2 = 0.9\).

The curvature condition (W2) is precisely what guarantees the BFGS prerequisite \(\mathbf{y}_t^\top \mathbf{s}_t > 0\). Indeed, for a descent direction \(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\), (W2) gives \[ \mathbf{y}_t^\top \mathbf{s}_t = \eta_t\, \mathbf{d}_t^\top \bigl(\nabla \mathcal{L}(\boldsymbol{\theta}_{t+1}) - \nabla \mathcal{L}(\boldsymbol{\theta}_t)\bigr) \geq \eta_t\, (c_2 - 1)\, \mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) > 0, \] since \(\eta_t > 0\), \(c_2 < 1\), and \(\mathbf{d}_t^\top \nabla \mathcal{L}(\boldsymbol{\theta}_t) < 0\). This is the reason BFGS line searches are implemented with (strong) Wolfe rather than plain Armijo.

Limited-Memory BFGS (L-BFGS)

For large-scale problems, storing the \(n \times n\) matrix \(\mathbf{C}_t\) explicitly is infeasible — \(n \sim 10^6\) already requires terabytes. L-BFGS sidesteps this by never forming \(\mathbf{C}_t\): it retains only the most recent \(m\) pairs \(\{(\mathbf{s}_i, \mathbf{y}_i)\}\) (typically \(m \in \{5, \ldots, 20\}\)) and computes \(\mathbf{C}_t\, \mathbf{g}_t\) implicitly via the two-loop recursion below. A direct expansion of the two-factor form \(\mathbf{C}_{t+1} = (\mathbf{I} - \rho_t \mathbf{s}_t \mathbf{y}_t^\top)\, \mathbf{C}_t\, (\mathbf{I} - \rho_t \mathbf{y}_t \mathbf{s}_t^\top) + \rho_t \mathbf{s}_t \mathbf{s}_t^\top\) over the last \(m\) pairs, applied to a vector \(\mathbf{g}_t\) against an implicit initial \(\mathbf{C}_t^{(0)}\) (often \(\mathbf{I}\)), produces exactly the two-loop recursion, and note that the recursion is also verifiable mechanically against the explicit formula, as done in our verification. The memory cost drops to \(O(mn)\) and each iteration costs \(O(mn)\).

Algorithm 2: L_BFGS Input: objective \(\mathcal{L}\), tolerance \(\epsilon\), memory size \(m\);
Output: approximate stationary point \(\boldsymbol{\theta}^*\);
begin
Choose initial point \(\boldsymbol{\theta}_0\); initialise \((\mathbf{s}, \mathbf{y})\) storage for up to \(m\) past updates;
\(t \leftarrow 0\);
repeat
\(\mathbf{g}_t \leftarrow \nabla \mathcal{L}(\boldsymbol{\theta}_t)\); \(\mathbf{q} \leftarrow \mathbf{g}_t\);
Backward loop (newest to oldest stored pair):
\(\alpha_i \leftarrow \rho_i\, \mathbf{s}_i^\top \mathbf{q}\); \(\;\mathbf{q} \leftarrow \mathbf{q} - \alpha_i\, \mathbf{y}_i\);
\(\mathbf{r} \leftarrow \mathbf{C}_t^{(0)}\, \mathbf{q}\) (typically \(\mathbf{C}_t^{(0)} = \mathbf{I}\) or a scaled identity);
Forward loop (oldest to newest stored pair):
\(\beta_i \leftarrow \rho_i\, \mathbf{y}_i^\top \mathbf{r}\); \(\;\mathbf{r} \leftarrow \mathbf{r} + \mathbf{s}_i (\alpha_i - \beta_i)\);
\(\mathbf{p}_t \leftarrow -\mathbf{r}\);
\(\eta_t \leftarrow \mathrm{LINE\_SEARCH}(\boldsymbol{\theta}_t, \mathbf{p}_t)\) satisfying Wolfe conditions;
\(\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t + \eta_t\, \mathbf{p}_t\);
\(\mathbf{g}_{t+1} \leftarrow \nabla \mathcal{L}(\boldsymbol{\theta}_{t+1})\);
Store \(\mathbf{s}_t \leftarrow \boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t\), \(\mathbf{y}_t \leftarrow \mathbf{g}_{t+1} - \mathbf{g}_t\), \(\rho_t \leftarrow 1/(\mathbf{y}_t^\top \mathbf{s}_t)\);
If the number of stored pairs exceeds \(m\), discard the oldest pair;
\(t \leftarrow t + 1\);
until \(\|\mathbf{g}_t\| < \epsilon\);
Output \(\boldsymbol{\theta}_t\);
end

Note: For numerical efficiency, the scalar \(\rho_t = 1/(\mathbf{y}_t^\top \mathbf{s}_t)\) is computed once and stored alongside each \((\mathbf{s}_t, \mathbf{y}_t)\) pair rather than being recomputed whenever it is needed in the two loops.

Python Implementation

The sample implementation below tests L-BFGS on the Rosenbrock function \[ f(\mathbf{x}) = \sum_{i=1}^{n-1} \bigl[a\, (x_{i+1} - x_i^2)^2 + (b - x_i)^2\bigr], \qquad a = 100,\; b = 1, \] a standard optimization benchmark with global minimum \(f(\mathbf{1}) = 0\). Despite the simple appearance, the Rosenbrock function is notoriously difficult for optimization algorithms because its global minimum lies at the bottom of a narrow, curved, "banana-shaped" valley. The surrounding landscape is ill-conditioned (strong anisotropy of the Hessian) and contains broad flat regions where gradients are small, so naive gradient descent crawls while Newton-type methods must handle curvature carefully.

                            import numpy as np

                            # Line search with Wolfe conditions 
                            def line_search(f, grad_f, theta, p, c1 = 1e-4, c2 = 0.9, max_iter = 100):
                                eta = 1.0
                                eta_low = 0.0
                                eta_high = None
                            
                                phi_0 = f(theta)
                                grad_phi_0 = np.dot(grad_f(theta), p)
                            
                                for _ in range(max_iter):
                                    phi_eta = f(theta + eta * p)
                            
                                    # Check Armijo condition
                                    if phi_eta > phi_0 + c1 * eta * grad_phi_0:
                                        eta_high = eta
                                    else:
                                        # Check Curvature condition
                                        grad_phi_eta = np.dot(grad_f(theta + eta * p), p)
                                        if grad_phi_eta < c2 * grad_phi_0:
                                            eta_low = eta
                                        else:
                                            return eta
                            
                                    # Update step size using bisection method
                                    if eta_high is not None:
                                        eta = (eta_low + eta_high) / 2.0
                                    else:
                                        eta *= 2.0
                            
                                return eta
                            
                            # Limited memory BFGS 
                            def limited_bfgs(f, grad_f, theta0, m = 10, tol = 1e-6, max_iter = 2000):
                                
                                theta = theta0.copy()
                                g = grad_f(theta)
                                s_list = []
                                y_list = []
                                rho_list = [] # We introduce rho =  1/s^\top y instead of directly using 1/s^\top y for efficiency & stability. 
                            
                                for _ in range(max_iter):
                                    
                                    if np.linalg.norm(g) < tol:
                                        break
                            
                                    q = g.copy()
                                    alpha_list = []  # Need this for the second loop to avoid computing alpha again 
                            
                                    # Loop backward through stored (s, y) pairs
                                    for s, y, rho in reversed(list(zip(s_list, y_list, rho_list))):
                                        alpha = rho * np.dot(s, q)
                                        alpha_list.append(alpha)
                                        q -= alpha * y
                            
                                    # Initial Hessian approximation is identity: H0 = I, so H0 * q = q
                                    # Note: In the first iteration (when s_list is empty), the loops do nothing,
                                    # so r = q = g, and thus p = -r = -g (steepest descent direction)
                                    r = q
                            
                                    # Loop forward through stored (s, y) pairs
                                    for (s, y, rho), alpha in zip(zip(s_list, y_list, rho_list), reversed(alpha_list)):
                                        beta = rho * np.dot(y, r)
                                        r += s * (alpha - beta)
                            
                                    # Search direction
                                    p = -r
                            
                                    # Compute the step size by Line search satisfying Wolfe conditions
                                    eta = line_search(f, grad_f, theta, p)
                            
                                    # Update parameters
                                    theta += eta * p
                                    grad_next = grad_f(theta)
                            
                                    # Update memory for (s, y) pairs
                                    s = eta * p
                                    y = grad_next - g
                                    if np.dot(s, y) > 1e-10 * np.linalg.norm(s) * np.linalg.norm(y): 
                                        if len(s_list) == m:
                                            s_list.pop(0)
                                            y_list.pop(0)
                                            rho_list.pop(0)
                                        s_list.append(s)
                                        y_list.append(y)
                                        rho_list.append(1.0 / np.dot(y, s))
                                        
                                    # Update gradient
                                    g = grad_next
                            
                                return theta
                            
                            # Objective function and its gradient: 
                            # The Rosenbrock function is commonly used for testing optimization algorithms.
                            def rosenbrock(x):
                                return np.sum(100 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
                            
                            def grad_rosenbrock(x):
                                grad = np.zeros_like(x)
                                grad[:-1] = -400 * x[:-1] * (x[1:] - x[:-1]**2) - 2 * (1 - x[:-1]) # # x_1 to x_n-1
                                grad[1:] += 200 * (x[1:] - x[:-1]**2) # x_2 to x_n
                                return grad
                            
                            # Finite difference gradient to check grad_rosenbrock().
                            def finite_difference_gradient(f, x, epsilon=1e-6):
                                grad = np.zeros_like(x)
                                for i in range(len(x)):
                                    x_forward = x.copy()
                                    x_backward = x.copy()
                                    x_forward[i] += epsilon
                                    x_backward[i] -= epsilon
                                    grad[i] = (f(x_forward) - f(x_backward)) / (2 * epsilon)
                                return grad
                            
                            if __name__ == "__main__":
                                
                                n = 50 # Dimensionality
                                
                                # Randomly generate an initial point x0:
                                # You could try : x0 = np.ones(n) + 0.3 * np.random.randn(n), which represents adding small 
                                # perturbation around the global minimum x* = [1, ... , 1]^\top 
                                x0 =  np.random.randn(n)  
                                numeric_opt = limited_bfgs(rosenbrock, grad_rosenbrock, x0)
                                print("Initial point: \n", x0.tolist())
                                print("\n Numerical optimum: \n", numeric_opt.tolist())
                            
                                '''
                                # If you are not sure about the gradient of the objective, always you can compare it with finite difference.
                                grad_analytic = grad_rosenbrock(x0)
                                grad_numeric = finite_difference_gradient(rosenbrock, x0)
                                relative_error_grad = np.linalg.norm(grad_analytic - grad_numeric) / np.linalg.norm(grad_analytic)
                                print("Relative error:", relative_error_grad) 
                                ''' 
                                # Relative error between the numerical optimum and the global optimum
                                global_opt = np.ones(n)  # The actual global optimum of Rosenbrock function is x* = [1, ... , 1]^\top
                                relative_error_optimal = np.linalg.norm(numeric_opt - global_opt) / np.linalg.norm(global_opt)
                                print(f"\n Relative error to the global minimum: {relative_error_optimal*100:.8f}%")

Why Rosenbrock is a hard benchmark

Even though \(\mathbf{x}^* = (1, 1, \ldots, 1)\) is easy to read off analytically, numerical optimizers struggle with the Rosenbrock landscape for three compounding reasons:

Narrow, curved valley: the minimum lies at the bottom of a bent ravine. Steep walls deflect iterates and trigger over- or undershooting unless curvature information is used.
Ill-conditioning: along the valley the Hessian has eigenvalues differing by several orders of magnitude — the condition number is large, slowing first-order methods dramatically (cf. the condition-number analysis later in this section).
Broad flat regions: outside the valley, the gradient is small but the function is still far from its minimum, so step sizes must be large without overshooting the valley walls.

These three properties — anisotropic curvature, ill-conditioning, and flat plateaus — are exactly the pathologies that plague loss landscapes in deep learning. This is why Rosenbrock remains the standard testbed for new optimization algorithms.

Newton's Method Second-order Optimization Techniques

Loading...