Duality
Optimization problems are often easier to analyze from a different angle. Duality formalizes this
idea: every constrained optimization problem (the primal) has a companion problem (the dual)
that provides bounds on the optimal value, and sometimes solves the original problem exactly. This principle arises throughout
machine learning, convex analysis, and game theory, and it connects the KKT conditions to
algorithmic efficiency.
Definition: Primal to Dual Construction
In constrained optimization, we define the primal problem
\[
\min_x f(x) \, \text{ subject to } g_i (x) \leq 0, \, h_j (x) = 0
\]
where \(f(x)\) is the objective function, \(g_i(x)\) are inequality constraints, and \(h_j(x)\) are equality
constraints.
We define the Lagrangian as a combination of the objective function and the constraints:
\[
\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)
\]
where \(\lambda_i\) and \(\nu_j\) are the Lagrange multipliers.
The Lagrange dual function is then defined as the minimum of the Lagrangian with respect to \(x\):
\[
g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu).
\]
Crucially, for the dual function to remain a valid lower bound on the primal optimal value \(p^*\)
in the presence of inequality constraints \(g_i(x) \leq 0\), we must require \(\lambda_i \geq 0\).
The dual problem seeks the best possible lower bound by maximizing this dual function:
\[
d^* = \max_{\lambda \geq 0, \nu} g(\lambda, \nu).
\]
This construction transforms our original problem into a framework where we can always find a guaranteed lower bound.
Regardless of the complexity or convexity of the primal problem, the dual problem provides a foundational limit on what
the optimal solution can be. This leads to one of the most fundamental results in optimization:
Theorem: Weak Duality
The optimal value of the dual problem \(d^*\) provides a lower bound on
the optimal value of the primal problem \(p^*\):
\[
d^* \leq p^*.
\]
This holds for any optimization problem (convex or not) where a valid dual is formulated.
The difference \(p^* - d^* \geq 0\) is called the duality gap. When this gap
is zero, the dual problem provides the exact optimal value of the primal, a situation known as
strong duality.
Theorem: Strong Duality and Slater's Condition
If \(p^* = d^*\), we say strong duality holds. Strong duality does not
hold in general (particularly for non-convex problems). However, for convex
problems, Slater's condition is sufficient: if there exists a strictly
feasible point \(x\) such that
\[
g_i(x) < 0 \quad \text{for all } i,
\]
then strong duality holds.
When strong duality holds, the dual optimal solution corresponds to the KKT multipliers
(Lagrange multipliers) from the
KKT conditions, and
the KKT conditions become both necessary and sufficient for optimality.
Insight: Duality in Machine Learning
Duality is widely used in machine learning and statistical learning theory.
In Support Vector Machines (SVMs),
the dual formulation is often preferred over the primal because it enables the
kernel trick, allowing nonlinear classification in high-dimensional
feature spaces without explicitly computing the mapping. In
regularization methods
(Lasso, Ridge, etc.), dual formulations help derive generalization bounds and
computationally efficient optimization algorithms. More broadly, many deep learning
training objectives — including adversarial training and Wasserstein GANs — can be
analyzed through duality principles, where the discriminator solves the dual problem.
Lipschitz Continuity
Duality tells us what the optimal value is, but not how fast an algorithm can reach it. Convergence
rates depend on regularity properties of the objective function, particularly how rapidly its gradient can change.
Lipschitz continuity provides the precise mathematical framework for bounding this rate of change,
and it is the key assumption behind most convergence guarantees in optimization.
Definition: Lipschitz Continuity
Let \((X, d)\) and \((Y, e)\) be
metric spaces and
\(f: X \to Y\). The function \(f\) is Lipschitz continuous on \(X\)
with Lipschitz constant \(L \geq 0\) if
\[
e(f(a), f(b)) \leq L \, d(a, b) \quad \text{for all } a, b \in X.
\]
While Lipschitz continuity has broad applications in mathematical analysis
(see Continuity), its significance in
optimization lies in boundedness:
it ensures that function values or gradients do not grow uncontrollably, preventing numerical instability
and guaranteeing that optimization trajectories remain well-defined.
When the Lipschitz condition is applied to the gradient rather than the function
itself, we obtain the important notion of smoothness.
Definition: \(L\)-Smoothness
A continuously differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\) is
\(L\)-smooth if its gradient is Lipschitz continuous with constant \(L\):
\[
\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \text{for all } x, y \in \mathbb{R}^n.
\]
This ensures that \(f\) does not curve too rapidly, allowing gradient-based methods to
converge predictably.
In gradient-based optimization, \(L\)-smoothness is critical because it provides an upper bound
on the function's curvature, ensuring that a step size of \(\eta \leq 1/L\) will always decrease the function
value without overshooting. While Lipschitz continuity limits how much the function can change,
smoothness limits how fast the gradient can change.
Under the \(L\)-smoothness assumption, steepest descent converges at a
linear rate: there exists \(0 < \mu < 1\) such that
\[
|f(x_{t+1}) - f(x^*)| \leq \mu \, |f(x_t) - f(x^*)|.
\]
The constant \(\mu\) is the convergence rate. To understand what determines
\(\mu\), we analyze the canonical case of a quadratic objective.
Example: Quadratic Objective
Consider the quadratic loss function
\[
f(x) = \frac{1}{2}x^\top A x + b^\top x + c \tag{1}
\]
where \(A \in \mathbb{R}^{n \times n}\) is symmetric positive definite,
\(b \in \mathbb{R}^n\), and \(c \in \mathbb{R}\). This function has a unique
minimizer \(x^* = -A^{-1}b\).
To analyze the convergence of steepest descent on this quadratic, we first introduce the
concept of a contraction mapping, which provides the framework for linear convergence analysis.
Definition: Contraction Mapping
Let \((M, d)\) be a metric space. A map \(f: M \to M\) is a
contraction if there exists a constant \(0 \leq k < 1\) such that
\[
d(f(x), f(y)) \leq k \, d(x, y) \quad \text{for all } x, y \in M.
\]
The constant \(k\) is called the contraction factor.
The contraction mapping \(f\) is a Lipschitz function with Lipschitz constant less than 1 and is
uniformly continuous on its domain. If an iterative method satisfies this contraction property,
it ensures that the error shrinks at every step, leading to convergence.
Note on Terminology:
We use the term "Contraction" here to align with standard Optimization and ML
literature (e.g., Murphy, 2022). However, to distinguish this strict convergence from general distance-preserving maps, our
analysis section refers to the \(k < 1\) case as a Strong Contraction.
Now, suppose we apply steepest descent with a fixed step size \(\alpha\).
To analyze the function relative to the optimum \(x^*\), without loss of generality, Function (1) can be rewritten as
\[
f(x) = \frac{1}{2}(x - x^*)^T A (x - x^*)
\]
with gradient:
\[
\nabla f(x) = A(x-x^*).
\]
Then steepest descent iteration:
\[
x_{k+1} = x_k - \alpha \nabla f(x_k)
\]
can be rewritten as:
\[
x_{k+1} - x^* = (I - \alpha A)(x_k - x^*).
\]
For the iteration to be a contraction (i.e., to ensure the error shrinks), we require:
\[
| 1 - \alpha \lambda_i | < 1, \, \forall i \tag{2}
\]
where \(\lambda_i\) are eigenvalues of \(A\). This ensures that \(I - \alpha A\) has eigenvalues strictly inside
the unit circle.
Let \(\lambda_{\min}\) and \(\lambda_{\max}\) be the smallest and largest eigenvalues of \(A\), respectively.
From Inequality (2), we get
\[
0 < \alpha \lambda_i < 2 \Longrightarrow \alpha < \frac{2}{\lambda_{\max}}
\]
Thus, for \(\alpha < \frac{2}{\lambda_{\max}}\), the matrix \(I -\alpha A\) is a contraction with respect to
the Euclidean norm. This condition is critical because it guarantees that the error diminishes at each step of iteration.
To achieve the fastest convergence rate (i.e., to minimize the worst-case contraction factor), we choose \(\alpha^*\)
such that the two extreme eigenvalues of \(I -\alpha A\) have the same absolute deviation from 1:
\[
|1 - \alpha^* \lambda_{\min}| = |1 - \alpha^* \lambda_{\max}|.
\]
Assuming \(1 - \alpha^* \lambda_{\min} \geq 0\) and \(1 - \alpha^* \lambda_{\max} \leq 0\), we solve:
\[
1 - \alpha^* \lambda_{\min} = \alpha^* \lambda_{\max} - 1,
\]
which gives:
\[
\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}.
\]
With this step size, the contraction factor is:
\[
k = 1 - \alpha^*\lambda_{\min} = \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}} = \frac{\kappa-1}{\kappa+1},
\]
where \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\) is the condition number.
Note that for the quadratic objective \(f(x) = \frac{1}{2}x^\top A x + b^\top x + c\), the error in the function value,
\(f(x) - f^*\), is proportional to the square of the distance to the optimum weighted by the Hessian matrix \(A\).
Specifically,
\[
f(x) - f^* = \frac{1}{2}(x - x^*)^\top A (x - x^*).
\]
This quadratic relationship is why the convergence rate for the function value
is the square of the rate for the parameters: \(\mu = k^2\).
Theorem: Convergence Rate of Steepest Descent (Quadratic Case)
For the quadratic objective \(f(x) = \frac{1}{2}x^\top A x + b^\top x + c\) with
\(A\) symmetric positive definite, steepest descent with optimal step size
\(\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\) achieves the function-value
convergence rate
\[
\mu = \left(\frac{\kappa - 1}{\kappa + 1}\right)^2
\]
where \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\) is the
condition number of \(A\).
Problems with low condition numbers (\(\kappa\) close to 1) converge rapidly, while ill-conditioned
problems (\(\kappa \gg 1\)) converge slowly. This is why preconditioning - transforming
the problem to reduce \(\kappa\) - is a central technique in numerical optimization.
Insight: Condition Number in Deep Learning
The condition number governs training dynamics far beyond quadratic objectives. In deep learning, the
effective condition number of the loss landscape's Hessian determines how well gradient-based optimizers
perform. Adam and other adaptive methods implicitly approximate a preconditioner by
maintaining per-parameter learning rates, effectively reducing the condition number seen by each parameter.
Batch normalization also improves conditioning by reducing internal covariate shift, which
is one reason it accelerates training. Understanding \(\kappa\) explains why some architectures train easily
while others require careful hyperparameter tuning.
Interactive Duality Visualization
The theoretical results above can be difficult to internalize without concrete examples. The following interactive visualization
demonstrates the primal-dual relationship in linear programming, where strong duality always holds. By adjusting
the parameters, you can observe how the primal and dual feasible regions change, how the duality gap closes at optimality, and
how complementary slackness operates.