Duality
Optimization problems are often easier to analyze from a different angle. Duality formalizes this
idea: every constrained optimization problem (the primal) has a companion problem (the dual)
that provides bounds on the optimal value, and sometimes solves the original problem exactly. This principle arises throughout
machine learning, convex analysis, and game theory, and it connects the
KKT conditions to
algorithmic efficiency.
Definition: Primal Problem
In constrained optimization, the primal problem is
\[
p^* \;=\; \inf_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \leq 0, \; h_j(\mathbf{x}) = 0,
\]
where \(f(\mathbf{x})\) is the objective function, \(g_i(\mathbf{x})\) are inequality constraints, and \(h_j(\mathbf{x})\) are equality
constraints. The optimal value \(p^*\) is the infimum of \(f\) over the feasible set; it may or may not be attained.
To analyze this constrained problem, we encode the constraints into the objective via the
Lagrangian
\(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \nu_j h_j(\mathbf{x})\),
where \(\lambda_i, \nu_j\) are the Lagrange multipliers. Taking the infimum of \(\mathcal{L}\) over \(\mathbf{x}\)
with the multipliers fixed produces a scalar-valued function of the multipliers alone:
Definition: Lagrange Dual Function
The Lagrange dual function associated with the primal problem is
\[
g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \;=\; \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}).
\]
This is defined for arbitrary \((\boldsymbol{\lambda}, \boldsymbol{\nu}) \in \mathbb{R}^m \times \mathbb{R}^p\) and takes values in
\([-\infty, +\infty)\). Whenever \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\) is unbounded below in \(\mathbf{x}\),
\(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) = -\infty\); the effective domain of \(g\) is the set of multipliers for which
this infimum is finite.
Definition: Dual Problem
The dual problem is
\[
d^* \;=\; \sup_{\boldsymbol{\lambda} \geq \mathbf{0}, \, \boldsymbol{\nu}} g(\boldsymbol{\lambda}, \boldsymbol{\nu}).
\]
The restriction \(\boldsymbol{\lambda} \geq \mathbf{0}\) (with \(\boldsymbol{\nu}\) unrestricted) will be justified by the Weak Duality
theorem below, which shows that under this restriction each \(g(\boldsymbol{\lambda}, \boldsymbol{\nu})\) is a lower bound on \(p^*\);
the dual problem then selects the best such bound. Since \(g\) is the pointwise infimum of affine functions of
\((\boldsymbol{\lambda}, \boldsymbol{\nu})\), it is always concave — regardless of whether the primal is convex — so the dual
is a concave maximization, structurally a convex optimization problem over \((\boldsymbol{\lambda}, \boldsymbol{\nu})\).
This construction transforms our original problem into a framework where we can always find a guaranteed lower bound.
Regardless of the complexity or convexity of the primal problem, the dual problem provides a foundational limit on what
the optimal solution can be. This leads to one of the most fundamental results in optimization:
Theorem: Weak Duality
For any optimization problem (convex or not) with a valid Lagrangian formulation, the dual optimal value
\(d^*\) is a lower bound on the primal optimal value \(p^*\):
\[
d^* \;\leq\; p^*.
\]
Proof
Let \(\mathbf{x}\) be any primal-feasible point (so \(g_i(\mathbf{x}) \leq 0\) and \(h_j(\mathbf{x}) = 0\)) and let
\((\boldsymbol{\lambda}, \boldsymbol{\nu})\) be any dual-feasible pair (so \(\lambda_i \geq 0\)). Because \(\lambda_i \geq 0\) and
\(g_i(\mathbf{x}) \leq 0\), each \(\lambda_i g_i(\mathbf{x}) \leq 0\); because \(h_j(\mathbf{x}) = 0\), each \(\nu_j h_j(\mathbf{x}) = 0\).
Therefore
\[
\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})
\;=\; f(\mathbf{x}) + \underbrace{\textstyle\sum_i \lambda_i g_i(\mathbf{x})}_{\leq \, 0}
+ \underbrace{\textstyle\sum_j \nu_j h_j(\mathbf{x})}_{= \, 0}
\;\leq\; f(\mathbf{x}).
\]
Since \(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}'} \mathcal{L}(\mathbf{x}', \boldsymbol{\lambda}, \boldsymbol{\nu})\) takes the infimum over all \(\mathbf{x}'\)
(not only feasible ones), it is bounded above by the value at our particular feasible \(\mathbf{x}\):
\[
g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \;\leq\; \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) \;\leq\; f(\mathbf{x}).
\]
The right side \(f(\mathbf{x})\) does not depend on \((\boldsymbol{\lambda}, \boldsymbol{\nu})\), so taking the supremum of the left side
over dual-feasible \((\boldsymbol{\lambda}, \boldsymbol{\nu})\) preserves the inequality: \(d^* \leq f(\mathbf{x})\).
This now holds for every primal-feasible \(\mathbf{x}\), so \(d^*\) is a lower bound for the set
\(\{f(\mathbf{x}) : \mathbf{x} \text{ is primal-feasible}\}\). By definition of the
infimum
as the greatest lower bound, and since \(p^* = \inf\{f(\mathbf{x}) : \mathbf{x} \text{ is primal-feasible}\}\),
we conclude \(d^* \leq p^*\). The inequality continues to hold in the degenerate cases by convention:
if the primal is infeasible we set \(p^* = +\infty\), and if \(p^* = -\infty\) (primal unbounded below)
weak duality forces \(d^* = -\infty\) as well. \(\blacksquare\)
The non-negative quantity \(p^* - d^*\) is called the duality gap. The gap need not be zero in
general — non-convex problems frequently exhibit a strictly positive gap — but under favorable conditions it
vanishes, giving strong duality.
Theorem: Strong Duality (Slater's Condition)
Suppose the primal problem is convex — that is, \(f\) and \(g_i\) are convex and \(h_j\) are affine.
If Slater's condition holds, meaning there exists a strictly feasible point \(\mathbf{x}\) with
\[
g_i(\mathbf{x}) < 0 \text{ for all } i, \qquad h_j(\mathbf{x}) = 0 \text{ for all } j,
\]
then strong duality holds: \(d^* = p^*\), and the dual optimum is attained when \(p^* > -\infty\).
Proof sketch. Consider the set
\(\mathcal{A} = \{(\mathbf{u}, t) : \exists \mathbf{x} \text{ with } g_i(\mathbf{x}) \leq u_i,\ h_j(\mathbf{x}) = 0,\ f(\mathbf{x}) \leq t\}\).
Convexity of \(f\) and \(g_i\) plus affinity of \(h_j\) make \(\mathcal{A}\) convex, and \((\mathbf{0}, p^*)\) lies
on its boundary. The supporting-hyperplane theorem then provides
\((\boldsymbol{\lambda}, \mu_0) \in \mathbb{R}^m \times \mathbb{R}\), not both zero, such that
\[
\boldsymbol{\lambda}^\top \mathbf{u} + \mu_0 t \;\geq\; \mu_0 p^* \qquad \text{for all } (\mathbf{u}, t) \in \mathcal{A}.
\]
Since \(\mathcal{A}\) is unbounded in the positive \(\mathbf{u}\)- and \(t\)-directions (increasing any \(u_i\) or \(t\)
preserves membership), the coefficients must satisfy \(\boldsymbol{\lambda} \geq \mathbf{0}\) and \(\mu_0 \geq 0\); otherwise the
inequality could be driven to \(-\infty\).
The key step is ruling out the degenerate case \(\mu_0 = 0\). Suppose \(\mu_0 = 0\); then \(\boldsymbol{\lambda} \neq \mathbf{0}\),
\(\boldsymbol{\lambda} \geq \mathbf{0}\), and the supporting inequality reduces to \(\boldsymbol{\lambda}^\top \mathbf{u} \geq 0\) for all
\((\mathbf{u}, t) \in \mathcal{A}\). For every feasible \(\mathbf{x}\) (in particular every \(\mathbf{x}\) with \(h_j(\mathbf{x}) = 0\)), the point
\((\mathbf{g}(\mathbf{x}), f(\mathbf{x}))\) lies in \(\mathcal{A}\), so \(\boldsymbol{\lambda}^\top \mathbf{g}(\mathbf{x}) \geq 0\). Applying this at Slater's strictly
feasible point \(\bar{\mathbf{x}}\), for which \(g_i(\bar{\mathbf{x}}) < 0\) for all \(i\),
\[
\boldsymbol{\lambda}^\top \mathbf{g}(\bar{\mathbf{x}}) \;=\; \sum_i \lambda_i \, g_i(\bar{\mathbf{x}}) \;<\; 0,
\]
since \(\boldsymbol{\lambda} \geq \mathbf{0}\), \(\boldsymbol{\lambda} \neq \mathbf{0}\), and every \(g_i(\bar{\mathbf{x}}) < 0\). This strict inequality
contradicts \(\boldsymbol{\lambda}^\top \mathbf{g}(\bar{\mathbf{x}}) \geq 0\). Hence \(\mu_0 > 0\).
Normalizing \(\mu_0 = 1\), the supporting inequality yields \(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \geq p^*\) for an appropriate
\(\boldsymbol{\nu} \in \mathbb{R}^p\), where the \(\boldsymbol{\nu}\) component is produced by the affine span of the equality
constraints — this step requires the relative-interior refinement of Slater's condition, which treats
affine \(h_j\) separately from nonlinear \(g_i\). Combined with weak duality, this gives \(d^* \geq p^*\),
hence \(d^* = p^*\).
Slater's condition is one of several sufficient conditions guaranteeing strong duality in the convex case;
others include linearity of all constraints (which needs no strict feasibility) and various constraint
qualifications. Non-convex problems may also enjoy strong duality in special cases, but it is the exception
rather than the rule.
When strong duality holds, the dual optimal multipliers \((\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*)\) coincide with the multipliers
from the KKT conditions,
and those conditions become both necessary and sufficient for primal-dual optimality.
Insight: Duality in Machine Learning
Duality is widely used in machine learning and statistical learning theory.
In kernel Support Vector Machines,
the dual formulation is essential because it enables the kernel trick,
allowing nonlinear classification in high-dimensional (or infinite-dimensional) feature spaces
without explicitly computing the mapping. In
regularization methods
such as Ridge
and Lasso,
dual formulations help derive generalization bounds and
computationally efficient optimization algorithms. More broadly, many adversarial and
minimax training objectives — including Wasserstein GANs, which compute the Kantorovich dual
of an optimal-transport problem via a 1-Lipschitz critic — can be interpreted through duality
principles.
Lipschitz Continuity
Duality characterizes the optimal value of a constrained problem — exactly, when strong duality holds —
but it does not tell us how fast an algorithm can reach it. Convergence rates depend on regularity
properties of the objective function, particularly how rapidly its gradient can change.
Lipschitz continuity provides the precise mathematical framework for bounding this rate of change,
and it is the key assumption behind most convergence guarantees in optimization.
Recall that a function \(f: X \to Y\) between metric spaces is
Lipschitz continuous
with constant \(L \geq 0\) if \(e(f(a), f(b)) \leq L \cdot d(a, b)\) for all \(a, b \in X\). This bounds the
variation of \(f\) uniformly in terms of the input distance: \(f\) cannot change faster than linearly in \(d(a, b)\).
(Note that Lipschitz functions need not be bounded themselves — \(f(t) = t\) on \(\mathbb{R}\) is Lipschitz with
\(L = 1\) yet unbounded.)
In optimization, Lipschitz-type regularity underpins step-size selection, stability of iterative methods, and
quantitative convergence rates. See Continuity for the purely-analytic
development.
When the Lipschitz condition is applied to the gradient rather than the function
itself, we obtain the important notion of smoothness.
Definition: \(L\)-Smoothness
A continuously differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\) is
\(L\)-smooth if its gradient is Lipschitz continuous with constant \(L\):
\[
\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\| \quad \text{for all } \mathbf{x}, \mathbf{y} \in \mathbb{R}^n.
\]
This ensures that \(f\) does not curve too rapidly, allowing gradient-based methods to
converge predictably.
In gradient-based optimization, \(L\)-smoothness yields the
descent lemma
\[
f(\mathbf{y}) \;\leq\; f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) + \tfrac{L}{2} \|\mathbf{y} - \mathbf{x}\|^2,
\]
which follows from the fundamental theorem of calculus applied to \(\nabla f\) along the segment from
\(\mathbf{x}\) to \(\mathbf{y}\); no twice-differentiability is required. In the \(C^2\) case, \(L\)-smoothness is
equivalent to the two-sided spectral bound \(-LI \preceq \nabla^2 f(\mathbf{x}) \preceq LI\), and for convex \(L\)-smooth
\(f\) this sharpens to \(0 \preceq \nabla^2 f(\mathbf{x}) \preceq LI\). While Lipschitz continuity of \(f\) limits
how fast the function values can change, \(L\)-smoothness limits how fast the gradient can change.
The descent lemma motivates the step size \(\eta = 1/L\): minimizing the right-hand side in \(\mathbf{y}\) at
\(\mathbf{y} = \mathbf{x} - \eta \nabla f(\mathbf{x})\) gives the largest guaranteed per-step decrease from the quadratic upper model alone,
and this choice underlies the standard sublinear convergence analysis for convex \(L\)-smooth functions.
We emphasize, however, that \(1/L\) is optimal only with respect to the upper model — it is not
the step size minimizing the actual convergence rate when additional structure (such as strong convexity)
is available. For the quadratic case analyzed below, the step size achieving the sharpest contraction is
\(2/(\lambda_{\min} + \lambda_{\max}) = 2/(L + m)\), which is strictly larger than \(1/L\) when \(\kappa > 1\)
and yields a strictly better rate.
On its own, \(L\)-smoothness guarantees only a sublinear convergence rate
\(f(\mathbf{x}_t) - f(\mathbf{x}^*) = O(1/t)\) for gradient descent. To obtain a linear rate —
geometric decay of the error — we additionally need a lower curvature bound, captured by strong convexity:
a function is \(m\)-strongly convex if \(f(\mathbf{x}) - \tfrac{m}{2}\|\mathbf{x}\|^2\) is convex, equivalently
\(\nabla^2 f(\mathbf{x}) \succeq m I\) in the twice-differentiable case. When \(f\) is both \(L\)-smooth and \(m\)-strongly
convex, steepest descent admits the linear bound
\[
f(\mathbf{x}_{t+1}) - f(\mathbf{x}^*) \;\leq\; \mu \, \bigl(f(\mathbf{x}_t) - f(\mathbf{x}^*)\bigr), \qquad 0 < \mu < 1.
\]
The constant \(\mu\) is the convergence rate, and it depends on the chosen step size:
with \(\eta = 1/L\) the standard analysis gives \(\mu = 1 - m/L = 1 - 1/\kappa\) (Nesterov), while with
\(\eta = 2/(L + m)\) the iterate contraction factor \(k = (\kappa-1)/(\kappa+1)\) yields a bound of the form
\(\|\mathbf{x}_{t+1} - \mathbf{x}^*\| \leq k \|\mathbf{x}_t - \mathbf{x}^*\|\). On general \(L\)-smooth \(m\)-strongly convex \(f\), translating an
iterate bound into a function-value bound introduces an additional factor of \(\kappa\), so the sharpest
function-value rate one obtains in the general case is weaker than \(k^2\). To understand what determines \(\mu\)
in the cleanest setting, we analyze the canonical case of a quadratic objective — the simplest setting in which
\(L\) and \(m\) are both exact and straightforward to compute, and in which the iterate rate \(k\) and the
function-value rate \(k^2\) are both attained exactly (no \(\kappa\) loss in translation).
Example: Quadratic Objective
Consider the quadratic loss function
\[
f(\mathbf{x}) = \tfrac{1}{2}\mathbf{x}^\top A \mathbf{x} + \mathbf{b}^\top \mathbf{x} + c,
\]
where \(A \in \mathbb{R}^{n \times n}\) is symmetric positive definite,
\(\mathbf{b} \in \mathbb{R}^n\), and \(c \in \mathbb{R}\). This function has a unique
minimizer \(\mathbf{x}^* = -A^{-1}\mathbf{b}\). The quadratic case is the canonical test bed for convergence rate
analysis because the iteration map for steepest descent becomes a linear operator, whose spectral properties
directly expose the convergence factor.
To analyze the convergence of steepest descent on this quadratic, we first introduce the
concept of a contraction mapping, which provides the framework for linear convergence analysis.
Definition: Contraction (Optimization Convention)
Let \((M, d)\) be a metric space. A map \(f: M \to M\) is a
contraction if there exists a constant \(0 \leq k < 1\) such that
\[
d(f(\mathbf{x}), f(\mathbf{y})) \leq k \, d(\mathbf{x}, \mathbf{y}) \quad \text{for all } \mathbf{x}, \mathbf{y} \in M.
\]
The constant \(k\) is called the contraction factor.
A contraction is automatically a Lipschitz function with Lipschitz constant less than 1,
hence uniformly continuous on its domain. If an iterative method satisfies this contraction property, the error
shrinks by factor \(k\) at every step, yielding geometric convergence.
Note on Terminology
We use contraction here in the sense standard to optimization and ML literature
(e.g., Murphy, Probabilistic Machine Learning, 2022) — namely \(k < 1\). Our pure-analysis
pages follow the stricter convention where the unqualified term "contraction" allows \(k \leq 1\),
and the strictly-less-than-one case is labelled
strong contraction.
The two usages agree on the Banach fixed-point result; only the naming differs.
We can now state and prove the convergence rate of steepest descent on the canonical quadratic,
using the contraction framework developed above.
Theorem: Convergence Rate of Steepest Descent (Quadratic Case)
For the quadratic objective \(f(\mathbf{x}) = \tfrac{1}{2}\mathbf{x}^\top A \mathbf{x} + \mathbf{b}^\top \mathbf{x} + c\) with
\(A\) symmetric positive definite, steepest descent
with any fixed step size \(\alpha \in (0, 2/\lambda_{\max})\) converges linearly to the unique minimizer
\(\mathbf{x}^* = -A^{-1}\mathbf{b}\). The step size that minimizes the worst-case contraction factor of the iterate error is
\[
\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}},
\]
at which steepest descent satisfies the worst-case function-value bound
\[
f(\mathbf{x}_{k+1}) - f(\mathbf{x}^*) \;\leq\; \mu \, \bigl(f(\mathbf{x}_k) - f(\mathbf{x}^*)\bigr), \qquad
\mu \;=\; \left(\frac{\kappa - 1}{\kappa + 1}\right)^2,
\]
where \(\kappa = \lambda_{\max}/\lambda_{\min}\) is the
condition number of \(A\), and
\(\lambda_{\min}, \lambda_{\max}\) are the smallest and largest eigenvalues of \(A\).
Proof
The gradient of \(f\) is \(\nabla f(\mathbf{x}) = A\mathbf{x} + \mathbf{b} = A(\mathbf{x} - \mathbf{x}^*)\), using \(\mathbf{x}^* = -A^{-1}\mathbf{b}\).
Steepest descent with step size \(\alpha\) takes the form \(\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)\), so
\[
\mathbf{x}_{k+1} - \mathbf{x}^* \;=\; (\mathbf{x}_k - \mathbf{x}^*) - \alpha A (\mathbf{x}_k - \mathbf{x}^*) \;=\; (I - \alpha A)(\mathbf{x}_k - \mathbf{x}^*).
\]
The iteration is therefore an affine map whose linear part is \(I - \alpha A\). Since \(A\) is symmetric,
\(I - \alpha A\) is symmetric too, and its operator (spectral) norm equals its largest-in-absolute-value
eigenvalue:
\[
\|I - \alpha A\|_2 \;=\; \max_i |1 - \alpha \lambda_i|.
\]
The iteration is a contraction in the Euclidean norm precisely when \(\|I - \alpha A\|_2 < 1\), i.e.
\[
|1 - \alpha \lambda_i| < 1 \quad \text{for all } i,
\]
which, since each \(\lambda_i > 0\), reduces to \(0 < \alpha < 2/\lambda_{\max}\).
We now minimize the contraction factor over \(\alpha \in (0, 2/\lambda_{\max})\). The contraction
factor as a function of \(\alpha\) is
\[
\max_i |1 - \alpha \lambda_i| \;=\; \max\bigl(1 - \alpha \lambda_{\min},\ \alpha \lambda_{\max} - 1\bigr),
\]
where we have used monotonicity in \(\lambda_i\) to reduce the max over all eigenvalues to the max over
the two extremes. The first expression is decreasing in \(\alpha\) and the second is increasing, so their
maximum is minimized where they meet:
\[
1 - \alpha^* \lambda_{\min} \;=\; \alpha^* \lambda_{\max} - 1,
\]
giving
\[
\alpha^* \;=\; \frac{2}{\lambda_{\min} + \lambda_{\max}}.
\]
Substituting back, the optimal contraction factor for the iterate error is
\[
k \;=\; 1 - \alpha^* \lambda_{\min} \;=\; \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}}
\;=\; \frac{\kappa - 1}{\kappa + 1},
\]
so \(\|\mathbf{x}_{k+1} - \mathbf{x}^*\|_2 \leq k \, \|\mathbf{x}_k - \mathbf{x}^*\|_2\).
Finally we translate the iterate contraction into a function-value rate. A direct computation using
\(\mathbf{x}^* = -A^{-1} \mathbf{b}\) shows
\[
f(\mathbf{x}) - f(\mathbf{x}^*) \;=\; \tfrac{1}{2}(\mathbf{x} - \mathbf{x}^*)^\top A (\mathbf{x} - \mathbf{x}^*).
\]
Diagonalize \(A = Q \Lambda Q^\top\) with \(\Lambda = \mathrm{diag}(\lambda_1, \dots, \lambda_n)\), so that
\(I - \alpha^* A = Q(I - \alpha^* \Lambda)Q^\top\). Let \(\mathbf{z}_k := Q^\top(\mathbf{x}_k - \mathbf{x}^*)\); then
\(\mathbf{z}_{k+1} = (I - \alpha^* \Lambda) \mathbf{z}_k\), i.e.\ \((\mathbf{z}_{k+1})_i = (1 - \alpha^* \lambda_i)(\mathbf{z}_k)_i\), and
\[
f(\mathbf{x}_k) - f(\mathbf{x}^*) \;=\; \tfrac{1}{2} \mathbf{z}_k^\top \Lambda \mathbf{z}_k \;=\; \tfrac{1}{2} \sum_i \lambda_i (\mathbf{z}_k)_i^2.
\]
Applying the iteration,
\[
f(\mathbf{x}_{k+1}) - f(\mathbf{x}^*) \;=\; \tfrac{1}{2} \sum_i \lambda_i (\mathbf{z}_{k+1})_i^2
\;=\; \tfrac{1}{2} \sum_i (1 - \alpha^* \lambda_i)^2 \, \lambda_i \, (\mathbf{z}_k)_i^2.
\]
Since \((1 - \alpha^* \lambda_i)^2 \leq k^2\) for every \(i\), we conclude
\[
f(\mathbf{x}_{k+1}) - f(\mathbf{x}^*) \;\leq\; k^2 \cdot \tfrac{1}{2} \sum_i \lambda_i (\mathbf{z}_k)_i^2
\;=\; k^2 \bigl(f(\mathbf{x}_k) - f(\mathbf{x}^*)\bigr).
\]
Hence \(\mu = k^2 = \left(\dfrac{\kappa - 1}{\kappa + 1}\right)^2\). The bound is tight in the
worst case: equality is approached when the initial error \(\mathbf{z}_0\) concentrates on an eigendirection
achieving \(|1 - \alpha^* \lambda_i| = k\) (i.e., the extreme eigenvalues \(\lambda_{\min}\) or
\(\lambda_{\max}\)), while a generic initial point gives a strictly better per-step ratio. \(\blacksquare\)
Problems with low condition numbers (\(\kappa\) close to 1) converge rapidly, while ill-conditioned
problems (\(\kappa \gg 1\)) converge slowly. This is why preconditioning - transforming
the problem to reduce \(\kappa\) - is a central technique in numerical optimization.
Insight: Condition Number in Deep Learning
The condition number governs training dynamics far beyond quadratic objectives. In deep learning, the
effective condition number of the loss landscape's Hessian determines how well gradient-based optimizers
perform. Adam and other adaptive methods implicitly approximate a preconditioner by
maintaining per-parameter learning rates, effectively reducing the condition number seen by each parameter.
Batch normalization empirically accelerates training; subsequent analysis attributes this to
a smoothing effect on the loss landscape — reducing the Lipschitz constants of both the loss and its gradient —
rather than to the originally proposed "internal covariate shift" mechanism. Understanding \(\kappa\) helps
explain why some architectures train easily while others require careful hyperparameter tuning.
Interactive Duality Visualization
The theoretical results above can be difficult to internalize without concrete examples. The following interactive visualization
demonstrates the primal-dual relationship in linear programming, where strong duality always holds. By adjusting
the parameters, you can observe how the primal and dual feasible regions change, how the duality gap closes at optimality, and
how complementary slackness operates.