Duality in Optimization & Analysis

Duality

Optimization problems are often easier to analyze from a different angle. Duality formalizes this idea: every constrained optimization problem (the primal) has a companion problem (the dual) that provides bounds on the optimal value, and sometimes solves the original problem exactly. This principle arises throughout machine learning, convex analysis, and game theory, and it connects the KKT conditions to algorithmic efficiency.

Definition: Primal to Dual Construction

In constrained optimization, we define the primal problem \[ \min_x f(x) \, \text{ subject to } g_i (x) \leq 0, \, h_j (x) = 0 \] where \(f(x)\) is the objective function, \(g_i(x)\) are inequality constraints, and \(h_j(x)\) are equality constraints.

We define the Lagrangian as a combination of the objective function and the constraints: \[ \mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x) \] where \(\lambda_i\) and \(\nu_j\) are the Lagrange multipliers.

The Lagrange dual function is then defined as the minimum of the Lagrangian with respect to \(x\): \[ g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu). \] Crucially, for the dual function to remain a valid lower bound on the primal optimal value \(p^*\) in the presence of inequality constraints \(g_i(x) \leq 0\), we must require \(\lambda_i \geq 0\).

The dual problem seeks the best possible lower bound by maximizing this dual function: \[ d^* = \max_{\lambda \geq 0, \nu} g(\lambda, \nu). \]

This construction transforms our original problem into a framework where we can always find a guaranteed lower bound. Regardless of the complexity or convexity of the primal problem, the dual problem provides a foundational limit on what the optimal solution can be. This leads to one of the most fundamental results in optimization:

Theorem: Weak Duality

The optimal value of the dual problem \(d^*\) provides a lower bound on the optimal value of the primal problem \(p^*\): \[ d^* \leq p^*. \] This holds for any optimization problem (convex or not) where a valid dual is formulated.

The difference \(p^* - d^* \geq 0\) is called the duality gap. When this gap is zero, the dual problem provides the exact optimal value of the primal, a situation known as strong duality.

Theorem: Strong Duality and Slater's Condition

If \(p^* = d^*\), we say strong duality holds. Strong duality does not hold in general (particularly for non-convex problems). However, for convex problems, Slater's condition is sufficient: if there exists a strictly feasible point \(x\) such that \[ g_i(x) < 0 \quad \text{for all } i, \] then strong duality holds.

When strong duality holds, the dual optimal solution corresponds to the KKT multipliers (Lagrange multipliers) from the KKT conditions, and the KKT conditions become both necessary and sufficient for optimality.

Insight: Duality in Machine Learning

Duality is widely used in machine learning and statistical learning theory. In Support Vector Machines (SVMs), the dual formulation is often preferred over the primal because it enables the kernel trick, allowing nonlinear classification in high-dimensional feature spaces without explicitly computing the mapping. In regularization methods (Lasso, Ridge, etc.), dual formulations help derive generalization bounds and computationally efficient optimization algorithms. More broadly, many deep learning training objectives — including adversarial training and Wasserstein GANs — can be analyzed through duality principles, where the discriminator solves the dual problem.

Lipschitz Continuity

Duality tells us what the optimal value is, but not how fast an algorithm can reach it. Convergence rates depend on regularity properties of the objective function, particularly how rapidly its gradient can change. Lipschitz continuity provides the precise mathematical framework for bounding this rate of change, and it is the key assumption behind most convergence guarantees in optimization.

Definition: Lipschitz Continuity

Let \((X, d)\) and \((Y, e)\) be metric spaces and \(f: X \to Y\). The function \(f\) is Lipschitz continuous on \(X\) with Lipschitz constant \(L \geq 0\) if \[ e(f(a), f(b)) \leq L \, d(a, b) \quad \text{for all } a, b \in X. \]

While Lipschitz continuity has broad applications in mathematical analysis (see Continuity), its significance in optimization lies in boundedness: it ensures that function values or gradients do not grow uncontrollably, preventing numerical instability and guaranteeing that optimization trajectories remain well-defined.

When the Lipschitz condition is applied to the gradient rather than the function itself, we obtain the important notion of smoothness.

Definition: \(L\)-Smoothness

A continuously differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\) is \(L\)-smooth if its gradient is Lipschitz continuous with constant \(L\): \[ \|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \text{for all } x, y \in \mathbb{R}^n. \] This ensures that \(f\) does not curve too rapidly, allowing gradient-based methods to converge predictably.

In gradient-based optimization, \(L\)-smoothness is critical because it provides an upper bound on the function's curvature, ensuring that a step size of \(\eta \leq 1/L\) will always decrease the function value without overshooting. While Lipschitz continuity limits how much the function can change, smoothness limits how fast the gradient can change.

Under the \(L\)-smoothness assumption, steepest descent converges at a linear rate: there exists \(0 < \mu < 1\) such that \[ |f(x_{t+1}) - f(x^*)| \leq \mu \, |f(x_t) - f(x^*)|. \] The constant \(\mu\) is the convergence rate. To understand what determines \(\mu\), we analyze the canonical case of a quadratic objective.

Example: Quadratic Objective

Consider the quadratic loss function \[ f(x) = \frac{1}{2}x^\top A x + b^\top x + c \tag{1} \] where \(A \in \mathbb{R}^{n \times n}\) is symmetric positive definite, \(b \in \mathbb{R}^n\), and \(c \in \mathbb{R}\). This function has a unique minimizer \(x^* = -A^{-1}b\).

To analyze the convergence of steepest descent on this quadratic, we first introduce the concept of a contraction mapping, which provides the framework for linear convergence analysis.

Definition: Contraction Mapping

Let \((M, d)\) be a metric space. A map \(f: M \to M\) is a contraction if there exists a constant \(0 \leq k < 1\) such that \[ d(f(x), f(y)) \leq k \, d(x, y) \quad \text{for all } x, y \in M. \] The constant \(k\) is called the contraction factor.

The contraction mapping \(f\) is a Lipschitz function with Lipschitz constant less than 1 and is uniformly continuous on its domain. If an iterative method satisfies this contraction property, it ensures that the error shrinks at every step, leading to convergence.

Note on Terminology:
We use the term "Contraction" here to align with standard Optimization and ML literature (e.g., Murphy, 2022). However, to distinguish this strict convergence from general distance-preserving maps, our analysis section refers to the \(k < 1\) case as a Strong Contraction.

Now, suppose we apply steepest descent with a fixed step size \(\alpha\). To analyze the function relative to the optimum \(x^*\), without loss of generality, Function (1) can be rewritten as \[ f(x) = \frac{1}{2}(x - x^*)^T A (x - x^*) \] with gradient: \[ \nabla f(x) = A(x-x^*). \] Then steepest descent iteration: \[ x_{k+1} = x_k - \alpha \nabla f(x_k) \] can be rewritten as: \[ x_{k+1} - x^* = (I - \alpha A)(x_k - x^*). \] For the iteration to be a contraction (i.e., to ensure the error shrinks), we require: \[ | 1 - \alpha \lambda_i | < 1, \, \forall i \tag{2} \] where \(\lambda_i\) are eigenvalues of \(A\). This ensures that \(I - \alpha A\) has eigenvalues strictly inside the unit circle.

Let \(\lambda_{\min}\) and \(\lambda_{\max}\) be the smallest and largest eigenvalues of \(A\), respectively. From Inequality (2), we get \[ 0 < \alpha \lambda_i < 2 \Longrightarrow \alpha < \frac{2}{\lambda_{\max}} \] Thus, for \(\alpha < \frac{2}{\lambda_{\max}}\), the matrix \(I -\alpha A\) is a contraction with respect to the Euclidean norm. This condition is critical because it guarantees that the error diminishes at each step of iteration.

To achieve the fastest convergence rate (i.e., to minimize the worst-case contraction factor), we choose \(\alpha^*\) such that the two extreme eigenvalues of \(I -\alpha A\) have the same absolute deviation from 1: \[ |1 - \alpha^* \lambda_{\min}| = |1 - \alpha^* \lambda_{\max}|. \] Assuming \(1 - \alpha^* \lambda_{\min} \geq 0\) and \(1 - \alpha^* \lambda_{\max} \leq 0\), we solve: \[ 1 - \alpha^* \lambda_{\min} = \alpha^* \lambda_{\max} - 1, \] which gives: \[ \alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}. \] With this step size, the contraction factor is: \[ k = 1 - \alpha^*\lambda_{\min} = \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}} = \frac{\kappa-1}{\kappa+1}, \] where \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\) is the condition number.

Note that for the quadratic objective \(f(x) = \frac{1}{2}x^\top A x + b^\top x + c\), the error in the function value, \(f(x) - f^*\), is proportional to the square of the distance to the optimum weighted by the Hessian matrix \(A\). Specifically, \[ f(x) - f^* = \frac{1}{2}(x - x^*)^\top A (x - x^*). \] This quadratic relationship is why the convergence rate for the function value is the square of the rate for the parameters: \(\mu = k^2\).

Theorem: Convergence Rate of Steepest Descent (Quadratic Case)

For the quadratic objective \(f(x) = \frac{1}{2}x^\top A x + b^\top x + c\) with \(A\) symmetric positive definite, steepest descent with optimal step size \(\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}\) achieves the function-value convergence rate \[ \mu = \left(\frac{\kappa - 1}{\kappa + 1}\right)^2 \] where \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\) is the condition number of \(A\).

Problems with low condition numbers (\(\kappa\) close to 1) converge rapidly, while ill-conditioned problems (\(\kappa \gg 1\)) converge slowly. This is why preconditioning - transforming the problem to reduce \(\kappa\) - is a central technique in numerical optimization.

Insight: Condition Number in Deep Learning

The condition number governs training dynamics far beyond quadratic objectives. In deep learning, the effective condition number of the loss landscape's Hessian determines how well gradient-based optimizers perform. Adam and other adaptive methods implicitly approximate a preconditioner by maintaining per-parameter learning rates, effectively reducing the condition number seen by each parameter. Batch normalization also improves conditioning by reducing internal covariate shift, which is one reason it accelerates training. Understanding \(\kappa\) explains why some architectures train easily while others require careful hyperparameter tuning.

Interactive Duality Visualization

The theoretical results above can be difficult to internalize without concrete examples. The following interactive visualization demonstrates the primal-dual relationship in linear programming, where strong duality always holds. By adjusting the parameters, you can observe how the primal and dual feasible regions change, how the duality gap closes at optimality, and how complementary slackness operates.