So far we have studied unconstrained minimization
of \(\mathcal{L}: \mathbb{R}^d \to \mathbb{R}\). In most applications, however, the parameters
must additionally satisfy a collection of equality and inequality constraints — a feasibility
region carved out of \(\mathbb{R}^d\) by the constraints.
Definition: Constrained Optimization Problem
Let \(\mathcal{L}: \mathbb{R}^d \to \mathbb{R}\) be the objective,
\(g_i: \mathbb{R}^d \to \mathbb{R}\) for \(i \in \mathcal{I}\) inequality constraints, and
\(h_j: \mathbb{R}^d \to \mathbb{R}\) for \(j \in \mathcal{E}\) equality constraints. The
constrained optimization problem is
\[
\boldsymbol{\theta}^* \in \operatorname*{arg\,min}_{\boldsymbol{\theta} \in \mathcal{C}}\, \mathcal{L}(\boldsymbol{\theta}),
\]
where the feasible set is
\[
\mathcal{C} = \{\, \boldsymbol{\theta} \in \mathbb{R}^d :
g_i(\boldsymbol{\theta}) \leq 0 \text{ for all } i \in \mathcal{I},\;
h_j(\boldsymbol{\theta}) = 0 \text{ for all } j \in \mathcal{E} \,\}.
\]
Penalty Method
A natural first attempt to reduce the constrained problem to something we already know how to solve
is the penalty method: replace the constraints by soft penalties in the objective,
\[
\mathcal{L}_\rho(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta})
+ \rho \sum_{i \in \mathcal{I}} \max\!\bigl(0, g_i(\boldsymbol{\theta})\bigr)^{\!2}
+ \rho \sum_{j \in \mathcal{E}} h_j(\boldsymbol{\theta})^2,
\]
where \(\rho > 0\) controls the penalty strength. The result is unconstrained and can be minimized
by gradient descent. The drawback is that for finite \(\rho\) the constraints are only approximately
satisfied, and driving \(\rho \to \infty\) causes the unconstrained problem to become
ill-conditioned.
When constraints must be strictly satisfied — or high-precision solutions are required — the
Lagrangian machinery introduced below provides a principled alternative that
reformulates constrained optimality as a system of equations (the KKT conditions).
In large-scale problems, hybrid approaches such as Augmented Lagrangian methods
combine the numerical robustness of penalties with the exactness of Lagrangian multipliers.
Lagrange Multipliers
We build up the general theory in two stages. First, consider the pure equality-constrained problem
with a single smooth constraint \(h: \mathbb{R}^d \to \mathbb{R}\),
\[
\operatorname*{min}_{\boldsymbol{\theta}}\, \mathcal{L}(\boldsymbol{\theta})
\quad \text{s.t.} \quad h(\boldsymbol{\theta}) = 0,
\qquad \mathcal{C} = \{\boldsymbol{\theta} \in \mathbb{R}^d : h(\boldsymbol{\theta}) = 0\}.
\]
Normal Direction at a Feasible Point
Fix \(\boldsymbol{\theta} \in \mathcal{C}\) and consider a small displacement
\(\boldsymbol{\epsilon} \in \mathbb{R}^d\) keeping \(\boldsymbol{\theta} + \boldsymbol{\epsilon} \in \mathcal{C}\).
By the first-order Taylor expansion,
\[
h(\boldsymbol{\theta} + \boldsymbol{\epsilon})
\approx h(\boldsymbol{\theta}) + \boldsymbol{\epsilon}^\top \nabla h(\boldsymbol{\theta})
= \boldsymbol{\epsilon}^\top \nabla h(\boldsymbol{\theta}),
\]
using \(h(\boldsymbol{\theta}) = 0\). Since \(h(\boldsymbol{\theta} + \boldsymbol{\epsilon}) = 0\) as well,
\(\boldsymbol{\epsilon}^\top \nabla h(\boldsymbol{\theta}) \approx 0\). Displacements that keep us on
\(\mathcal{C}\) are therefore (to first order) orthogonal to \(\nabla h(\boldsymbol{\theta})\) — equivalently,
\(\nabla h(\boldsymbol{\theta})\) points in the normal direction to the constraint surface.
Definition: Tangent Space (constraint surface)
The tangent space of the constraint surface \(\mathcal{C} = \{h = 0\}\) at
\(\boldsymbol{\theta}\) is
\[
T_{\boldsymbol{\theta}} \mathcal{C}
= \{\boldsymbol{\epsilon} \in \mathbb{R}^d : \nabla h(\boldsymbol{\theta})^\top \boldsymbol{\epsilon} = 0\}
= \operatorname{ker}\bigl(\nabla h(\boldsymbol{\theta})^\top\bigr).
\]
This is the first-order approximation of the set of feasible displacements from
\(\boldsymbol{\theta}\). The general notion of tangent space — independent of any specific
defining function — will be developed in the upcoming smooth-manifold series.
The Lagrange Multiplier Rule
Suppose \(\boldsymbol{\theta}^* \in \mathcal{C}\) is a local minimum. If
\(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) had any tangent component — i.e., were not orthogonal to
\(T_{\boldsymbol{\theta}^*} \mathcal{C}\) — moving a small step along that component inside
\(\mathcal{C}\) would strictly decrease \(\mathcal{L}\), contradicting local optimality. Therefore
\(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) is also normal to \(\mathcal{C}\).
A caveat about rigor. The step "move along a tangent direction inside \(\mathcal{C}\)"
is geometrically intuitive but not self-justifying: the linearized tangent space
\(T_{\boldsymbol{\theta}^*}\mathcal{C}\) is defined by a first-order condition, and it is a
non-trivial fact that every tangent vector is actually realized as the velocity of a smooth curve
inside \(\mathcal{C}\). This realization is the content of the implicit function theorem,
and it is precisely what the regularity condition \(\nabla h(\boldsymbol{\theta}^*) \neq \mathbf{0}\)
buys. The intrinsic language for making this argument airtight — tangent spaces defined
independently of any specific defining function, and regular level sets as smooth manifolds —
belongs to the smooth-manifold framework developed in the upcoming manifold series. Here we accept
this geometric realization on faith and proceed with the formal Lagrange rule.
Assuming \(\nabla h(\boldsymbol{\theta}^*) \neq \mathbf{0}\) (the single-constraint analogue of the
LICQ regularity condition introduced below), the normal space is one-dimensional, spanned by
\(\nabla h(\boldsymbol{\theta}^*)\), so \(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) is a scalar
multiple of \(\nabla h(\boldsymbol{\theta}^*)\): there exists \(\lambda \in \mathbb{R}\), the
Lagrange multiplier, with
\[
\nabla \mathcal{L}(\boldsymbol{\theta}^*) + \lambda\, \nabla h(\boldsymbol{\theta}^*) = \mathbf{0}. \tag{1}
\]
Definition: Lagrangian (Equality Constraints)
For equality constraints \(h_j(\boldsymbol{\theta}) = 0\), \(j = 1, \ldots, p\), the
Lagrangian is
\[
\mathcal{L}\!\left(\boldsymbol{\theta}, \boldsymbol{\lambda}\right)
= \mathcal{L}(\boldsymbol{\theta})
+ \sum_{j=1}^{p} \lambda_j\, h_j(\boldsymbol{\theta})
= \mathcal{L}(\boldsymbol{\theta}) + \boldsymbol{\lambda}^\top \mathbf{h}(\boldsymbol{\theta}),
\]
where \(\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_p)^\top \in \mathbb{R}^p\) is the vector
of Lagrange multipliers and \(\mathbf{h} = (h_1, \ldots, h_p)^\top\). (We overload \(\mathcal{L}\) to
refer to both the objective and the Lagrangian; the number of arguments disambiguates.)
At a stationary point of the Lagrangian, \(\nabla_{\boldsymbol{\theta}, \boldsymbol{\lambda}}\, \mathcal{L} = \mathbf{0}\),
which unpacks to
\[
\nabla_{\boldsymbol{\theta}}\, \mathcal{L}(\boldsymbol{\theta}) + \sum_{j=1}^p \lambda_j\, \nabla h_j(\boldsymbol{\theta}) = \mathbf{0},
\qquad h_j(\boldsymbol{\theta}) = 0 \text{ for all } j.
\]
Such a point is called a critical point; it simultaneously satisfies primal feasibility
and the stationarity condition (1).
Geometric Intuition
At a constrained optimum \(\boldsymbol{\theta}^*\), you cannot improve the objective by moving along
the constraint surface. This means the gradient of the objective \(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\)
has no component tangent to \(\mathcal{C}\) — it must point purely in the normal direction, which is
spanned by \(\{\nabla h_j(\boldsymbol{\theta}^*)\}\). The Lagrange multiplier
\(\lambda_j\) measures "how much" of the objective gradient is absorbed by constraint \(h_j\), and its
sign records whether the constraint pulls the objective up or down as it is relaxed.
The KKT Conditions
We now add inequality constraints \(g_i(\boldsymbol{\theta}) \leq 0\). A single inequality is handled
by extending the Lagrangian with a non-negative multiplier \(\mu \geq 0\):
\[
\mathcal{L}(\boldsymbol{\theta}, \mu) = \mathcal{L}(\boldsymbol{\theta}) + \mu\, g(\boldsymbol{\theta}).
\]
The sign restriction \(\mu \geq 0\) is essential — it is what distinguishes inequalities from
equalities, and encodes that the constraint can only push back against infeasibility, never
pull into it. For multiple inequalities, the extension is additive:
\(\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\mu}) = \mathcal{L}(\boldsymbol{\theta}) + \sum_{i=1}^{m} \mu_i\, g_i(\boldsymbol{\theta})\).
Definition: Active Set
At a feasible point \(\boldsymbol{\theta} \in \mathcal{C}\), an inequality constraint
\(g_i(\boldsymbol{\theta}) \leq 0\) is said to be active if
\(g_i(\boldsymbol{\theta}) = 0\). The active set at \(\boldsymbol{\theta}\) is
\[
\mathcal{A}(\boldsymbol{\theta}) = \{\, i \in \{1, \ldots, m\} : g_i(\boldsymbol{\theta}) = 0 \,\}.
\]
Only active inequalities constrain the local geometry at \(\boldsymbol{\theta}\); inactive ones
(\(g_i(\boldsymbol{\theta}) < 0\)) have slack and play no local role.
In the presence of both \(m\) inequality constraints and \(p\) equality constraints the general
Lagrangian is
\[
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\mu}, \boldsymbol{\lambda})
= \mathcal{L}(\boldsymbol{\theta})
+ \sum_{i=1}^{m} \mu_i\, g_i(\boldsymbol{\theta})
+ \sum_{j=1}^{p} \lambda_j\, h_j(\boldsymbol{\theta}),
\]
and the primal problem admits the min–max reformulation
\[
\min_{\boldsymbol{\theta}} \max_{\boldsymbol{\mu} \geq 0,\, \boldsymbol{\lambda}}\,
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\mu}, \boldsymbol{\lambda}). \tag{P}
\]
The inner maximization over \(\boldsymbol{\mu} \geq 0\) returns \(+\infty\) whenever any
\(g_i(\boldsymbol{\theta}) > 0\) (infeasibility), and the maximization over unrestricted
\(\boldsymbol{\lambda}\) returns \(+\infty\) whenever any \(h_j(\boldsymbol{\theta}) \neq 0\). Feasibility
is therefore enforced implicitly through the min–max structure. This formulation anchors the transition
to duality, where we will instead study the dual
\(\max_{\boldsymbol{\mu} \geq 0, \boldsymbol{\lambda}} \min_{\boldsymbol{\theta}} \mathcal{L}\).
Definition: Linear Independence Constraint Qualification (LICQ)
The linear independence constraint qualification holds at a feasible point
\(\boldsymbol{\theta}^*\) if the gradients
\[
\{\nabla g_i(\boldsymbol{\theta}^*) : i \in \mathcal{A}(\boldsymbol{\theta}^*)\}
\;\cup\;
\{\nabla h_j(\boldsymbol{\theta}^*) : j = 1, \ldots, p\}
\]
are linearly independent in \(\mathbb{R}^d\). LICQ is one of several possible
constraint qualifications ensuring that the linearized constraint geometry at
\(\boldsymbol{\theta}^*\) faithfully captures the nonlinear feasible set.
Theorem: Karush–Kuhn–Tucker (KKT) Conditions
Let \(\boldsymbol{\theta}^*\) be a feasible point. Say that
\((\boldsymbol{\theta}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)\) satisfies the KKT
conditions if all four of the following hold:
Primal feasibility:
\[
g_i(\boldsymbol{\theta}^*) \leq 0 \text{ for all } i, \qquad h_j(\boldsymbol{\theta}^*) = 0 \text{ for all } j.
\]
Dual feasibility:
\[
\mu_i^* \geq 0 \quad \text{for all } i = 1, \ldots, m.
\]
Complementary slackness:
\[
\mu_i^*\, g_i(\boldsymbol{\theta}^*) = 0 \quad \text{for all } i = 1, \ldots, m,
\]
equivalently \(\boldsymbol{\mu}^* \odot \mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\). For each
\(i\), either \(\mu_i^* = 0\) (constraint inactive) or \(g_i(\boldsymbol{\theta}^*) = 0\)
(constraint active); only constraints in the active set \(\mathcal{A}(\boldsymbol{\theta}^*)\)
contribute non-trivially.
(i) Necessity.
If \(\boldsymbol{\theta}^*\) is a local minimum of the constrained problem and
LICQ holds at \(\boldsymbol{\theta}^*\), then there exist
multipliers \((\boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)\) such that the KKT conditions hold. (No
convexity of \(\mathcal{L}\) or \(g_i\) is required for necessity.)
(ii) Sufficiency under convexity.
If \(\mathcal{L}\) and all \(g_i\) are convex,
all \(h_j\) are affine, and \((\boldsymbol{\theta}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)\)
satisfies the KKT conditions, then \(\boldsymbol{\theta}^*\) is a global minimum of the
constrained problem.
Reading KKT: sign conventions and the role of each condition
The dual-feasibility sign \(\mu_i^* \geq 0\) records that inequality constraints can only push
back from infeasibility. If a small relaxation of \(g_i\) (moving from \(g_i \leq 0\) to
\(g_i \leq \varepsilon\), \(\varepsilon > 0\)) would let us decrease \(\mathcal{L}\), then \(\mu_i^*\)
measures the rate — this is the sensitivity interpretation of Lagrange multipliers.
Equality multipliers \(\lambda_j^*\) can have either sign, reflecting that an equality constraint
can be relaxed in either direction.
KKT is critical in many machine-learning applications such as the
support vector machine (SVM); here we
demonstrate the KKT machinery on a simple problem that can be solved by hand.
Worked Example
We minimize
\[
f(\mathbf{x}) = (x_1 - 2)^2 + x_2^2 + x_3^2
\]
subject to
\[
\begin{aligned}
&g_1(\mathbf{x}) = x_1 + x_2 - 1 \leq 0, \\\\
&g_2(\mathbf{x}) = x_1 + x_3 - 1 \leq 0, \\\\
&h_1(\mathbf{x}) = x_1 - x_2 = 0.
\end{aligned}
\]
The objective is strictly convex, the inequality constraints are affine (hence convex), and the
equality constraint is affine; moreover the gradients \(\nabla g_1 = (1,1,0)^\top\),
\(\nabla g_2 = (1,0,1)^\top\), \(\nabla h_1 = (1,-1,0)^\top\) are linearly independent at every
feasible point, so LICQ holds. Both the necessity and the sufficiency parts of the KKT theorem apply:
a KKT point will be the global minimum.
Case (iv): \(\mu_1, \mu_2 > 0\) (both inequalities active).
\(x_1 + x_2 = 1\) and \(x_1 + x_3 = 1\), combined with \(h_1: x_1 = x_2\), give
\(x_1 = x_2 = x_3 = \tfrac{1}{2}\). From (3): \(\mu_2 = -2 x_3 = -1 < 0\), violating dual feasibility.
Rejected.
Only case (iii) yields a KKT point, so by sufficiency the constrained global minimum is
\[
\mathbf{x}^* = \bigl(\tfrac{1}{2},\, \tfrac{1}{2},\, 0\bigr)^\top,
\qquad
f(\mathbf{x}^*) = \bigl(\tfrac{1}{2} - 2\bigr)^2 + \bigl(\tfrac{1}{2}\bigr)^2 + 0 = \tfrac{5}{2}.
\]
A sample code for optimization problem with KKT conditions and a simple penalty method is as follows:
Implementation note: slack variables vs. active-set tracking
Numerical KKT solvers cannot directly handle the inequality \(g_i(\mathbf{x}) \leq 0\) nor the
kink in the complementary slackness condition \(\mu_i\, g_i(\mathbf{x}) = 0\). Two strategies
resolve this:
(1) Slack variables. Introduce \(s_i \geq 0\) and convert each inequality to an
equality:
\[
g_i(\mathbf{x}) \leq 0 \iff g_i(\mathbf{x}) + s_i = 0, \; s_i \geq 0.
\]
Complementary slackness then becomes \(\mu_i\, s_i = 0\), which is a smooth bilinear equation —
amenable to Newton-type solvers (as in the code above).
(2) Active-set methods. Guess which constraints are active, solve the resulting
equality-constrained problem, then check whether the guess is consistent with dual feasibility
and primal feasibility; if not, update the active set and repeat. Faster when the correct active
set can be identified cheaply, but requires more logic for dynamically updating the active set.
Modern interior-point methods use a smoothed version of slack-variable tracking with barrier
functions, and are the workhorse for large-scale convex programs — the subject of the next page
on duality, which reformulates the primal problem in
a way that reveals structure often invisible in the primal.