In the previous chapter, we used a "fixed" learning rate(or step size), which was chosen
based on experimental results. However, the "optimal" step size can be determined by line search:
\[
\eta_t = \arg \min _{\eta > 0} \mathcal{L} (\theta_t + \eta d_t) \in \mathbb{R}.
\]
Consider a quadratic loss:
\[
\mathcal{L}(\theta + \eta \, d)
= \frac{1}{2}(\theta+ \eta d)^\top A (\theta + \eta \, d) + b^\top (\theta + \eta d) + c.
\]
where \(\theta, d \in \mathbb{R}^n\), and \(\eta, c \in \mathbb{R}\).
Taking derivative of this function:
\[
\begin{align*}
\frac{d \mathcal{L}(\theta + \eta d)}{d\eta} &= \frac{1}{2} d^\top 2A(\theta + \eta d) + d^\top b \\\\
&= d^\top (A\theta + b) + \eta d^\top A d
\end{align*}
\]
Setting the derivative equal to zero, we have:
\[
\eta = - \frac{d^\top (A\theta + b)}{d^\top A d}.
\]
Note that we assunme the matrix \(A\) is symmetric positive definite (SPD), ensuring the quadratic
is strictly convex and has a unique minimum.
In practice, we don't need the "exact" line search because it can be expensive to solve this sub-optimization
problem "at each step."
For example, we can start with some initial step size, and then reduce it by a factor \(c \in (0, 1)\) at each
iteration until we satisfy the following condition:
\[
\mathcal{L}(\theta_t + \eta \, d_t) \leq \mathcal{L}(\theta_t) + c \, \eta \, d_t^\top \nabla \mathcal{L}(\theta_t) \tag{1}
\]
This is called Armijo condition(Sufficient Decrease condition) that ensures sufficient
decreasing of our objective function.
Note: Usually, we set \(c = 10^{-4}\).
Newton's Method
While gradient descent relies on a first-order linear approximation, Newton's Method utilizes a
second-order Taylor expansion to find the stationary point of a function. By incorporating the
curvature (Hessian matrix) of the loss function, it creates a quadratic approximation locally
and jumps to the minimum of that quadratic.
Consider the second-order approximation of \(\mathcal{L}\) around the current point \(\theta_t\):
\[
\mathcal{L}(\theta_t + \Delta \theta) \approx \mathcal{L}(\theta_t) + \nabla \mathcal{L}(\theta_t)^\top \Delta \theta + \frac{1}{2} \Delta \theta^\top \nabla^2 \mathcal{L}(\theta_t) \Delta \theta.
\]
To find the optimal change \(\Delta \theta\) that minimizes this approximation, we set the derivative with respect
to \(\Delta \theta\) to zero:
\[
\nabla \mathcal{L}(\theta_t) + \nabla^2 \mathcal{L}(\theta_t) \Delta \theta = 0.
\]
Let \(g_t = \nabla \mathcal{L}(\theta_t)\) be the gradient and \(H_t = \nabla^2 \mathcal{L}(\theta_t)\) be the Hessian.
Solving for \(\Delta \theta\) gives us the Newton Direction, denoted as \(d_t\):
\[
H_t d_t = -g_t \implies d_t = -H_t^{-1} g_t.
\]
Note that we assume \(H_t = \nabla^2 \mathcal{L}(\theta_t)\) is positive definite
(which holds when \(\mathcal{L}\) is locally strictly convex). Otherwise, we cannot solve for \(d_t\).
In the pure Newton's method, the update is simply \(\theta_{t+1} = \theta_t + d_t\). However, to ensure
convergence even when the function is not perfectly quadratic, we often employ Damped Newton's Method.
This involves performing a line search along the direction \(d_t\) to find a step size \(\eta_t\)
that satisfies sufficient decrease conditions (such as the Armijo rule).
Algorithm 1: NEWTONS_METHODInput: objective function \(\mathcal{L}\), tolerance \(\epsilon\), initial step size \(\eta_o\); Output: stationary point \(\theta_t\); begin
\(t \leftarrow 0\);
Choose an initial point \(\theta_o\); repeat:
Compute gradient: \(g_t = \nabla \mathcal{L}(\theta_t);\)
Compute Hessian: \(H_t = \nabla^2 \mathcal{L}(\theta_t);\)
Solve \(H_t d_t = -g_t\) for \(d_t\);
Do LINE_SEARCH to get step size \(\eta_t\) along \(d_t\);
Update parameters: \(\theta_{t+1} = \theta_t + \eta_t d_t;\)
\( t \leftarrow t + 1 ;\) until \(\| g_t \| < \epsilon\);
Output \(\theta_t\); end
Computational Note: Solving vs. Inverting
A critical issue of Newton's method is computing the inverse Hessian \(H_t^{-1}\) at
each \(t\) step, which is obviously expensive.
Mathematically, we write \(d_t = -H^{-1}g_t\). However, in algorithmic practice (as shown in the pseudocode),
we rarely compute the inverse explicitly due to numerical instability and high cost. Instead, we solve
the system of linear equations \(H_t d_t = -g_t\) using methods like Cholesky decomposition or Conjugate Gradient.
BFGS Method
To mitigate the computational burden of the Hessian, Quasi-Newton methods like
BFGS (Broyden-Fletcher-Goldfarb-Shanno) are used. These algorithms iteratively construct
an approximation of the Hessian (or its inverse) using only gradient information.
Approximation of Hessian in BFGS
\[
\begin{align*}
H_{t+1} &\approx B_{t+1} \\\\
&= B_{t} + \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}
\end{align*}
\]
where \(s_t = \theta_{t+1} - \theta_t\) is the step in the parameter space and \(y_t = g_{t+1} - g_t\) is the change in the gradient
of objective \(\mathcal{L}\).
Alternatively, we can iteratively update an approximation to the inverse Hessian:
\[
\begin{align*}
H_{t+1}^{-1} &\approx C_{t+1} \\\\
&= \Big(I - \frac{s_t y_t^\top}{y_t^\top s_t} \Big) C_t \Big( I - \frac{y_t s_t^\top}{y_t^\top s_t}\Big) + \frac{s_t s_t^\top}{y_t^\top s_t}.
\end{align*}
\]
By the Taylor expansion of the gradient \(g(\theta) = \nabla \mathcal{L}(\theta)\) around \(\theta_t\), neglecting higher-order terms:
\[
\begin{align*}
&g(\theta_{t+1}) \approx g(\theta_t) + H(\theta_t) (\theta_{t+1}-\theta_t) \\\\
&\Longrightarrow y_t = H_t s_t,
\end{align*}
\]
where
\[
\begin{align*}
&y_t = g(\theta_{t+1}) - g(\theta_t) \\\\
&s_t = \theta_{t+1}-\theta_t
\end{align*}
\]
To efficiently approximate the Hessian \(H_t\), we compute the updated Hessian approximation \(B_{t+1}\) at the end of step \(t\).
It must satisfy the secant condition:
\[
B_{t+1} s_t = y_t.
\]
This condition ensures that the updated Hessian approximation \(B_{t+1} \) reflects the curvature
relationship between the step \(s_t\) and the gradient change \(y_t\).
(Note: The Wolfe curvature condition ensures \(y_t^\top s_t >0\), which is necessary for the BFGS update to
maintain positive definiteness.)
Assuming \(B_{t+1}\) is a positive definite rank-2 update of \(B_t\):
\[
B_{t+1} = B_t + U,
\]
the secant condition becomes:
\[
(B_t + U) s_t = y_t \Longrightarrow U s_t = y_t - B_t s_t.
\]
Here, the correction matrix \(U\) must account for the discrepancy \(y_t - B_t s_t\).
We assume \(U\) takes the form:
\[
U = \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}.
\]
The first term adds curvature in the direction of \(y_t\), normalized by \(y_t^\top s_t\), ensuring positive definiteness.
The second term removes excess curvature in the direction of \(s_t\) ensuring the secant condition is satisfied without over-correction.
Verification:
\[
\begin{align*}
&B_{t+1} = B_t + \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t} \\\\
&\Longrightarrow B_{t+1} s_t = B_t s_t + \Big\{\frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}\Big\} s_t\\\\
&\Longrightarrow B_{t+1} s_t = B_t s_t + y_t \cdot \frac{y_t^\top s_t}{y_t^\top s_t} - B_t s_t \cdot \frac{(s_t^\top B_t s_t)}{s_t^\top B_t s_t}\\\\
&\Longrightarrow B_{t+1} s_t = B_t s_t + y_t - B_t s_t \\\\
&\Longrightarrow B_{t+1} s_t = y_t
\end{align*}
\]
Therefore, \(B_{t+1}\) satisfies the secant condition, maintaining positive definiteness while approximating
the curvature of the objective function.
If the initial \(B_0\) is positive definite (usually, \(B_0 = I\)), and the step size \(\eta\) is found via Condition (1) and the following
curvature condition:
\[
d_t^\top \nabla \mathcal{L}(\theta_t + \eta \, d_t) \geq c_2 \, d_t^\top \nabla \mathcal{L}(\theta_t) \tag{2}
\]
where \(c_2 \in (c_1, 1)\).
Then \(B_{t+1}\) remains positive definite. Conditions (1) and (2) together are called the Wolfe conditions.
\[d_t^\top \nabla \mathcal{L}(\theta_t + \eta \, d_t) \geq c_2 \, d_t^\top \nabla \mathcal{L}(\theta_t)\]
where \(0 < c_1 < c_2 < 1\). For example, \(c_1 = 10^{-4}, c_2 = 0.9\) for Quasi-Newton methods.
Note: In the strong Wolfe conditions, the curvature condition becomes
\[
| d_t^\top \nabla \mathcal{L}(\theta_t + \eta \, d_t) | \leq c_2 \, | d_t^\top \nabla \mathcal{L}(\theta_t) |
\]
In practice, we cannot store the whole Hessian approximation for large-scale problems. It costs \(O(n^2)\) memory space.
In Limited-memory BFGS (L-BFGS), we only use the \(m\) most recent \((s_t, y_t)\) pairs, and do not store the Hessian approximation
explicitly. Instead, we approximate \(H_t^{-1} g_t\) by computing a sequence of inner products of these vectors. The memory requirement
is \(O(mn)\), where \(m\) is typically 5 to 20.
Algorithm 2: L_BFGSInput: objective function \(\mathcal{L}\), tolerance \(\epsilon\), initial step size \(\eta_0\);
Output: stationary point \(\theta^*\);
begin
Choose an initial point \(\theta_0\);
Initialize \( (s, y) \) storage for \( m \) past updates;
\(t \leftarrow 0\);
repeat:
Compute gradient \( g_t = \nabla \mathcal{L}(\theta_t) \);
Set \( q \leftarrow g_t \);
Loop backward (from newest to oldest stored \( (s, y) \) pairs):
Compute \( \alpha_i = \rho_i \, s_i^\top q \);
Update \( q \leftarrow q - \alpha_i y_i \);
Set \( r \leftarrow H_t^0 q \);
Loop forward (from oldest to newest stored \( (s, y) \) pairs):
Compute \( \beta_i = \rho_i \, y_i^\top r \);
Update \( r \leftarrow r + s_i (\alpha_i - \beta_i) \);
Set search direction \( p_t \leftarrow -r \);
Do LINE_SEARCH: find step size \( \eta_t \) along \(p_t\) satisfying Wolfe conditions;
Update parameters: \( \theta_{t+1} \leftarrow \theta_t + \eta_t p_t \);
Compute gradient \( g_{t+1} \leftarrow \nabla \mathcal{L}(\theta_{t+1}) \);
Store \( s_t \leftarrow \theta_{t+1} - \theta_t \), \( y_t \leftarrow g_{t+1} - g_t \), \( \rho_t \leftarrow \frac{1}{y_t^\top s_t} \);
If the number of stored pairs exceeds \( m \), discard the oldest pair;
\( t \leftarrow t + 1 \);
until \( \|g_t\| < \epsilon \);
Output \(\theta_t\);
end
Note: For computational efficiency and numerical stability, instead of calculating \(\frac{1}{s^\top y}\) multiple times,
we introduce \(\rho = \frac{1}{s^\top y}\) and store it alongside each \((s, y)\) pair.
Python Implementation
A sample code for L-BFGS is as follows:
import numpy as np
# Line search with Wolfe conditions
def line_search(f, grad_f, theta, p, c1 = 1e-4, c2 = 0.9, max_iter = 100):
eta = 1.0
eta_low = 0.0
eta_high = None
phi_0 = f(theta)
grad_phi_0 = np.dot(grad_f(theta), p)
for _ in range(max_iter):
phi_eta = f(theta + eta * p)
# Check Armijo condition
if phi_eta > phi_0 + c1 * eta * grad_phi_0:
eta_high = eta
else:
# Check Curvature condition
grad_phi_eta = np.dot(grad_f(theta + eta * p), p)
if grad_phi_eta < c2 * grad_phi_0:
eta_low = eta
else:
return eta
# Update step size using bisection method
if eta_high is not None:
eta = (eta_low + eta_high) / 2.0
else:
eta *= 2.0
return eta
# Limited memory BFGS
def limited_bfgs(f, grad_f, theta0, m = 10, tol = 1e-6, max_iter = 2000):
theta = theta0.copy()
g = grad_f(theta)
s_list = []
y_list = []
rho_list = [] # We introduce rho = 1/s^\top y instead of directly using 1/s^\top y for efficiency & stability.
for _ in range(max_iter):
if np.linalg.norm(g) < tol:
break
q = g.copy()
alpha_list = [] # Need this for the second loop to avoid computing alpha again
# Loop backward through stored (s, y) pairs
for s, y, rho in reversed(list(zip(s_list, y_list, rho_list))):
alpha = rho * np.dot(s, q)
alpha_list.append(alpha)
q -= alpha * y
# Initial Hessian approximation is identity: H0 = I, so H0 * q = q
# Note: In the first iteration (when s_list is empty), the loops do nothing,
# so r = q = g, and thus p = -r = -g (steepest descent direction)
r = q
# Loop forward through stored (s, y) pairs
for (s, y, rho), alpha in zip(zip(s_list, y_list, rho_list), reversed(alpha_list)):
beta = rho * np.dot(y, r)
r += s * (alpha - beta)
# Search direction
p = -r
# Compute the step size by Line search satisfying Wolfe conditions
eta = line_search(f, grad_f, theta, p)
# Update parameters
theta += eta * p
grad_next = grad_f(theta)
# Update memory for (s, y) pairs
s = eta * p
y = grad_next - g
if np.dot(s, y) > 1e-10 * np.linalg.norm(s) * np.linalg.norm(y):
if len(s_list) == m:
s_list.pop(0)
y_list.pop(0)
rho_list.pop(0)
s_list.append(s)
y_list.append(y)
rho_list.append(1.0 / np.dot(y, s))
# Update gradient
g = grad_next
return theta
# Objective function and its gradient:
# The Rosenbrock function is commonly used for testing optimization algorithms.
def rosenbrock(x):
return np.sum(100 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
def grad_rosenbrock(x):
grad = np.zeros_like(x)
grad[:-1] = -400 * x[:-1] * (x[1:] - x[:-1]**2) - 2 * (1 - x[:-1]) # # x_1 to x_n-1
grad[1:] += 200 * (x[1:] - x[:-1]**2) # x_2 to x_n
return grad
# Finite difference gradient to check grad_rosenbrock().
def finite_difference_gradient(f, x, epsilon=1e-6):
grad = np.zeros_like(x)
for i in range(len(x)):
x_forward = x.copy()
x_backward = x.copy()
x_forward[i] += epsilon
x_backward[i] -= epsilon
grad[i] = (f(x_forward) - f(x_backward)) / (2 * epsilon)
return grad
if __name__ == "__main__":
n = 50 # Dimensionality
# Randomly generate an initial point x0:
# You could try : x0 = np.ones(n) + 0.3 * np.random.randn(n), which represents adding small
# perturbation around the global minimum x* = [1, ... , 1]^\top
x0 = np.random.randn(n)
numeric_opt = limited_bfgs(rosenbrock, grad_rosenbrock, x0)
print("Initial point: \n", x0.tolist())
print("\n Numerical optimum: \n", numeric_opt.tolist())
'''
# If you are not sure about the gradient of the objective, always you can compare it with finite difference.
grad_analytic = grad_rosenbrock(x0)
grad_numeric = finite_difference_gradient(rosenbrock, x0)
relative_error_grad = np.linalg.norm(grad_analytic - grad_numeric) / np.linalg.norm(grad_analytic)
print("Relative error:", relative_error_grad)
'''
# Relative error between the numerical optimum and the global optimum
global_opt = np.ones(n) # The actual global optimum of Rosenbrock function is x* = [1, ... , 1]^\top
relative_error_optimal = np.linalg.norm(numeric_opt - global_opt) / np.linalg.norm(global_opt)
print(f"\n Relative error to the global minimum: {relative_error_optimal*100:.8f}%")
Note: On the above code, we used the Rosenbrock function, which is defined as
\[
f(x) = \sum_{i = 1}^{n-1} a (x_{i+1} - x_i^2)^2 + (b - x_i)^2
\]
where \(x \in \mathbb{R}^n\), and usually we set the coefficients \(a = 100, \, b = 1\).
The global minimum can be obtained simply at \(x = [1, 1, \cdots, 1]\), but in numerical optimization, converging to
the global minimum is difficult because of several reasons:
Broad "flat" regions & several local minima: Optimization algorithms can be trapped there.
Ill-conditioned: Small changes in one variable can result in large changes in another.
A narrow, curved valley containing the global minimum: It makes hard for optimization algorithms to navigate. The
steep sides of the valley can cause algorithms to overshoot or converge very slowly.
Rosenbrock function is a common benchmark for testing the robustness and efficiency of optimization algorithms.