Constrained Optimization

Constrained Optimization Problem

So far we have studied unconstrained minimization of \(\mathcal{L}: \mathbb{R}^d \to \mathbb{R}\). In most applications, however, the parameters must additionally satisfy a collection of equality and inequality constraints — a feasibility region carved out of \(\mathbb{R}^d\) by the constraints.

Definition: Constrained Optimization Problem

Let \(\mathcal{L}: \mathbb{R}^d \to \mathbb{R}\) be the objective, \(g_i: \mathbb{R}^d \to \mathbb{R}\) for \(i \in \mathcal{I}\) inequality constraints, and \(h_j: \mathbb{R}^d \to \mathbb{R}\) for \(j \in \mathcal{E}\) equality constraints. The constrained optimization problem is \[ \boldsymbol{\theta}^* \in \operatorname*{arg\,min}_{\boldsymbol{\theta} \in \mathcal{C}}\, \mathcal{L}(\boldsymbol{\theta}), \] where the feasible set is \[ \mathcal{C} = \{\, \boldsymbol{\theta} \in \mathbb{R}^d : g_i(\boldsymbol{\theta}) \leq 0 \text{ for all } i \in \mathcal{I},\; h_j(\boldsymbol{\theta}) = 0 \text{ for all } j \in \mathcal{E} \,\}. \]

Penalty Method

A natural first attempt to reduce the constrained problem to something we already know how to solve is the penalty method: replace the constraints by soft penalties in the objective, \[ \mathcal{L}_\rho(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}) + \rho \sum_{i \in \mathcal{I}} \max\!\bigl(0, g_i(\boldsymbol{\theta})\bigr)^{\!2} + \rho \sum_{j \in \mathcal{E}} h_j(\boldsymbol{\theta})^2, \] where \(\rho > 0\) controls the penalty strength. The result is unconstrained and can be minimized by gradient descent. The drawback is that for finite \(\rho\) the constraints are only approximately satisfied, and driving \(\rho \to \infty\) causes the unconstrained problem to become ill-conditioned.

When constraints must be strictly satisfied — or high-precision solutions are required — the Lagrangian machinery introduced below provides a principled alternative that reformulates constrained optimality as a system of equations (the KKT conditions). In large-scale problems, hybrid approaches such as Augmented Lagrangian methods combine the numerical robustness of penalties with the exactness of Lagrangian multipliers.

Lagrange Multipliers

We build up the general theory in two stages. First, consider the pure equality-constrained problem with a single smooth constraint \(h: \mathbb{R}^d \to \mathbb{R}\), \[ \operatorname*{min}_{\boldsymbol{\theta}}\, \mathcal{L}(\boldsymbol{\theta}) \quad \text{s.t.} \quad h(\boldsymbol{\theta}) = 0, \qquad \mathcal{C} = \{\boldsymbol{\theta} \in \mathbb{R}^d : h(\boldsymbol{\theta}) = 0\}. \]

Normal Direction at a Feasible Point

Fix \(\boldsymbol{\theta} \in \mathcal{C}\) and consider a small displacement \(\boldsymbol{\epsilon} \in \mathbb{R}^d\) keeping \(\boldsymbol{\theta} + \boldsymbol{\epsilon} \in \mathcal{C}\). By the first-order Taylor expansion, \[ h(\boldsymbol{\theta} + \boldsymbol{\epsilon}) \approx h(\boldsymbol{\theta}) + \boldsymbol{\epsilon}^\top \nabla h(\boldsymbol{\theta}) = \boldsymbol{\epsilon}^\top \nabla h(\boldsymbol{\theta}), \] using \(h(\boldsymbol{\theta}) = 0\). Since \(h(\boldsymbol{\theta} + \boldsymbol{\epsilon}) = 0\) as well, \(\boldsymbol{\epsilon}^\top \nabla h(\boldsymbol{\theta}) \approx 0\). Displacements that keep us on \(\mathcal{C}\) are therefore (to first order) orthogonal to \(\nabla h(\boldsymbol{\theta})\) — equivalently, \(\nabla h(\boldsymbol{\theta})\) points in the normal direction to the constraint surface.

Definition: Tangent Space (constraint surface)

The tangent space of the constraint surface \(\mathcal{C} = \{h = 0\}\) at \(\boldsymbol{\theta}\) is \[ T_{\boldsymbol{\theta}} \mathcal{C} = \{\boldsymbol{\epsilon} \in \mathbb{R}^d : \nabla h(\boldsymbol{\theta})^\top \boldsymbol{\epsilon} = 0\} = \operatorname{ker}\bigl(\nabla h(\boldsymbol{\theta})^\top\bigr). \] This is the first-order approximation of the set of feasible displacements from \(\boldsymbol{\theta}\). The general notion of tangent space — independent of any specific defining function — will be developed in the upcoming smooth-manifold series.

The Lagrange Multiplier Rule

Suppose \(\boldsymbol{\theta}^* \in \mathcal{C}\) is a local minimum. If \(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) had any tangent component — i.e., were not orthogonal to \(T_{\boldsymbol{\theta}^*} \mathcal{C}\) — moving a small step along that component inside \(\mathcal{C}\) would strictly decrease \(\mathcal{L}\), contradicting local optimality. Therefore \(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) is also normal to \(\mathcal{C}\).

A caveat about rigor. The step "move along a tangent direction inside \(\mathcal{C}\)" is geometrically intuitive but not self-justifying: the linearized tangent space \(T_{\boldsymbol{\theta}^*}\mathcal{C}\) is defined by a first-order condition, and it is a non-trivial fact that every tangent vector is actually realized as the velocity of a smooth curve inside \(\mathcal{C}\). This realization is the content of the implicit function theorem, and it is precisely what the regularity condition \(\nabla h(\boldsymbol{\theta}^*) \neq \mathbf{0}\) buys. The intrinsic language for making this argument airtight — tangent spaces defined independently of any specific defining function, and regular level sets as smooth manifolds — belongs to the smooth-manifold framework developed in the upcoming manifold series. Here we accept this geometric realization on faith and proceed with the formal Lagrange rule.

Assuming \(\nabla h(\boldsymbol{\theta}^*) \neq \mathbf{0}\) (the single-constraint analogue of the LICQ regularity condition introduced below), the normal space is one-dimensional, spanned by \(\nabla h(\boldsymbol{\theta}^*)\), so \(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) is a scalar multiple of \(\nabla h(\boldsymbol{\theta}^*)\): there exists \(\lambda \in \mathbb{R}\), the Lagrange multiplier, with \[ \nabla \mathcal{L}(\boldsymbol{\theta}^*) + \lambda\, \nabla h(\boldsymbol{\theta}^*) = \mathbf{0}. \tag{1} \]

Definition: Lagrangian (Equality Constraints)

For equality constraints \(h_j(\boldsymbol{\theta}) = 0\), \(j = 1, \ldots, p\), the Lagrangian is \[ \mathcal{L}\!\left(\boldsymbol{\theta}, \boldsymbol{\lambda}\right) = \mathcal{L}(\boldsymbol{\theta}) + \sum_{j=1}^{p} \lambda_j\, h_j(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}) + \boldsymbol{\lambda}^\top \mathbf{h}(\boldsymbol{\theta}), \] where \(\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_p)^\top \in \mathbb{R}^p\) is the vector of Lagrange multipliers and \(\mathbf{h} = (h_1, \ldots, h_p)^\top\). (We overload \(\mathcal{L}\) to refer to both the objective and the Lagrangian; the number of arguments disambiguates.)

At a stationary point of the Lagrangian, \(\nabla_{\boldsymbol{\theta}, \boldsymbol{\lambda}}\, \mathcal{L} = \mathbf{0}\), which unpacks to \[ \nabla_{\boldsymbol{\theta}}\, \mathcal{L}(\boldsymbol{\theta}) + \sum_{j=1}^p \lambda_j\, \nabla h_j(\boldsymbol{\theta}) = \mathbf{0}, \qquad h_j(\boldsymbol{\theta}) = 0 \text{ for all } j. \] Such a point is called a critical point; it simultaneously satisfies primal feasibility and the stationarity condition (1).

Geometric Intuition

At a constrained optimum \(\boldsymbol{\theta}^*\), you cannot improve the objective by moving along the constraint surface. This means the gradient of the objective \(\nabla \mathcal{L}(\boldsymbol{\theta}^*)\) has no component tangent to \(\mathcal{C}\) — it must point purely in the normal direction, which is spanned by \(\{\nabla h_j(\boldsymbol{\theta}^*)\}\). The Lagrange multiplier \(\lambda_j\) measures "how much" of the objective gradient is absorbed by constraint \(h_j\), and its sign records whether the constraint pulls the objective up or down as it is relaxed.

The KKT Conditions

We now add inequality constraints \(g_i(\boldsymbol{\theta}) \leq 0\). A single inequality is handled by extending the Lagrangian with a non-negative multiplier \(\mu \geq 0\): \[ \mathcal{L}(\boldsymbol{\theta}, \mu) = \mathcal{L}(\boldsymbol{\theta}) + \mu\, g(\boldsymbol{\theta}). \] The sign restriction \(\mu \geq 0\) is essential — it is what distinguishes inequalities from equalities, and encodes that the constraint can only push back against infeasibility, never pull into it. For multiple inequalities, the extension is additive: \(\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\mu}) = \mathcal{L}(\boldsymbol{\theta}) + \sum_{i=1}^{m} \mu_i\, g_i(\boldsymbol{\theta})\).

Definition: Active Set

At a feasible point \(\boldsymbol{\theta} \in \mathcal{C}\), an inequality constraint \(g_i(\boldsymbol{\theta}) \leq 0\) is said to be active if \(g_i(\boldsymbol{\theta}) = 0\). The active set at \(\boldsymbol{\theta}\) is \[ \mathcal{A}(\boldsymbol{\theta}) = \{\, i \in \{1, \ldots, m\} : g_i(\boldsymbol{\theta}) = 0 \,\}. \] Only active inequalities constrain the local geometry at \(\boldsymbol{\theta}\); inactive ones (\(g_i(\boldsymbol{\theta}) < 0\)) have slack and play no local role.

In the presence of both \(m\) inequality constraints and \(p\) equality constraints the general Lagrangian is \[ \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\mu}, \boldsymbol{\lambda}) = \mathcal{L}(\boldsymbol{\theta}) + \sum_{i=1}^{m} \mu_i\, g_i(\boldsymbol{\theta}) + \sum_{j=1}^{p} \lambda_j\, h_j(\boldsymbol{\theta}), \] and the primal problem admits the min–max reformulation \[ \min_{\boldsymbol{\theta}} \max_{\boldsymbol{\mu} \geq 0,\, \boldsymbol{\lambda}}\, \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\mu}, \boldsymbol{\lambda}). \tag{P} \] The inner maximization over \(\boldsymbol{\mu} \geq 0\) returns \(+\infty\) whenever any \(g_i(\boldsymbol{\theta}) > 0\) (infeasibility), and the maximization over unrestricted \(\boldsymbol{\lambda}\) returns \(+\infty\) whenever any \(h_j(\boldsymbol{\theta}) \neq 0\). Feasibility is therefore enforced implicitly through the min–max structure. This formulation anchors the transition to duality, where we will instead study the dual \(\max_{\boldsymbol{\mu} \geq 0, \boldsymbol{\lambda}} \min_{\boldsymbol{\theta}} \mathcal{L}\).

Definition: Linear Independence Constraint Qualification (LICQ)

The linear independence constraint qualification holds at a feasible point \(\boldsymbol{\theta}^*\) if the gradients \[ \{\nabla g_i(\boldsymbol{\theta}^*) : i \in \mathcal{A}(\boldsymbol{\theta}^*)\} \;\cup\; \{\nabla h_j(\boldsymbol{\theta}^*) : j = 1, \ldots, p\} \] are linearly independent in \(\mathbb{R}^d\). LICQ is one of several possible constraint qualifications ensuring that the linearized constraint geometry at \(\boldsymbol{\theta}^*\) faithfully captures the nonlinear feasible set.

Theorem: Karush–Kuhn–Tucker (KKT) Conditions

Let \(\boldsymbol{\theta}^*\) be a feasible point. Say that \((\boldsymbol{\theta}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)\) satisfies the KKT conditions if all four of the following hold:

Primal feasibility: \[ g_i(\boldsymbol{\theta}^*) \leq 0 \text{ for all } i, \qquad h_j(\boldsymbol{\theta}^*) = 0 \text{ for all } j. \]
Stationarity: \[ \nabla \mathcal{L}(\boldsymbol{\theta}^*) + \sum_{i=1}^{m} \mu_i^*\, \nabla g_i(\boldsymbol{\theta}^*) + \sum_{j=1}^{p} \lambda_j^*\, \nabla h_j(\boldsymbol{\theta}^*) = \mathbf{0}. \]
Dual feasibility: \[ \mu_i^* \geq 0 \quad \text{for all } i = 1, \ldots, m. \]
Complementary slackness: \[ \mu_i^*\, g_i(\boldsymbol{\theta}^*) = 0 \quad \text{for all } i = 1, \ldots, m, \] equivalently \(\boldsymbol{\mu}^* \odot \mathbf{g}(\boldsymbol{\theta}^*) = \mathbf{0}\). For each \(i\), either \(\mu_i^* = 0\) (constraint inactive) or \(g_i(\boldsymbol{\theta}^*) = 0\) (constraint active); only constraints in the active set \(\mathcal{A}(\boldsymbol{\theta}^*)\) contribute non-trivially.

(i) Necessity. If \(\boldsymbol{\theta}^*\) is a local minimum of the constrained problem and LICQ holds at \(\boldsymbol{\theta}^*\), then there exist multipliers \((\boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)\) such that the KKT conditions hold. (No convexity of \(\mathcal{L}\) or \(g_i\) is required for necessity.)

(ii) Sufficiency under convexity. If \(\mathcal{L}\) and all \(g_i\) are convex, all \(h_j\) are affine, and \((\boldsymbol{\theta}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)\) satisfies the KKT conditions, then \(\boldsymbol{\theta}^*\) is a global minimum of the constrained problem.

Reading KKT: sign conventions and the role of each condition

The dual-feasibility sign \(\mu_i^* \geq 0\) records that inequality constraints can only push back from infeasibility. If a small relaxation of \(g_i\) (moving from \(g_i \leq 0\) to \(g_i \leq \varepsilon\), \(\varepsilon > 0\)) would let us decrease \(\mathcal{L}\), then \(\mu_i^*\) measures the rate — this is the sensitivity interpretation of Lagrange multipliers. Equality multipliers \(\lambda_j^*\) can have either sign, reflecting that an equality constraint can be relaxed in either direction.

KKT is critical in many machine-learning applications such as the support vector machine (SVM); here we demonstrate the KKT machinery on a simple problem that can be solved by hand.

Worked Example

We minimize \[ f(\mathbf{x}) = (x_1 - 2)^2 + x_2^2 + x_3^2 \] subject to \[ \begin{aligned} &g_1(\mathbf{x}) = x_1 + x_2 - 1 \leq 0, \\\\ &g_2(\mathbf{x}) = x_1 + x_3 - 1 \leq 0, \\\\ &h_1(\mathbf{x}) = x_1 - x_2 = 0. \end{aligned} \] The objective is strictly convex, the inequality constraints are affine (hence convex), and the equality constraint is affine; moreover the gradients \(\nabla g_1 = (1,1,0)^\top\), \(\nabla g_2 = (1,0,1)^\top\), \(\nabla h_1 = (1,-1,0)^\top\) are linearly independent at every feasible point, so LICQ holds. Both the necessity and the sufficiency parts of the KKT theorem apply: a KKT point will be the global minimum.

The Lagrangian is \[ \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}, \lambda) = (x_1 - 2)^2 + x_2^2 + x_3^2 + \mu_1 (x_1 + x_2 - 1) + \mu_2 (x_1 + x_3 - 1) + \lambda (x_1 - x_2). \] Stationarity gives \[ \begin{aligned} &\tfrac{\partial \mathcal{L}}{\partial x_1} = 2 x_1 - 4 + \mu_1 + \mu_2 + \lambda = 0, \tag{1}\\\\ &\tfrac{\partial \mathcal{L}}{\partial x_2} = 2 x_2 + \mu_1 - \lambda = 0, \tag{2}\\\\ &\tfrac{\partial \mathcal{L}}{\partial x_3} = 2 x_3 + \mu_2 = 0. \tag{3} \end{aligned} \] Dual feasibility requires \(\mu_1, \mu_2 \geq 0\), and complementary slackness gives \[ \mu_1 (x_1 + x_2 - 1) = 0, \qquad \mu_2 (x_1 + x_3 - 1) = 0. \tag{4,5} \]

Case analysis. Complementary slackness splits the problem into four cases determined by which inequality constraints are active.

Case (i): \(\mu_1 = \mu_2 = 0\) (both inequalities inactive). From (3), \(x_3 = 0\). From (2), \(\lambda = 2 x_2\); from \(h_1\), \(x_1 = x_2\); from (1), \(2 x_1 - 4 + 2 x_1 = 0 \Rightarrow x_1 = 1\). Hence \(\mathbf{x} = (1, 1, 0)^\top\), which gives \(g_1(\mathbf{x}) = 1 > 0\) — infeasible. Rejected.
Case (ii): \(\mu_1 = 0, \mu_2 > 0\) (only \(g_2\) active). Complementary slackness: \(x_1 + x_3 = 1\). From (3): \(\mu_2 = -2 x_3\), so \(\mu_2 > 0\) requires \(x_3 < 0\), hence \(x_1 > 1\). From \(h_1\): \(x_2 = x_1 > 1\). From (2): \(\lambda = 2 x_2\); substituting into (1): \(2 x_1 - 4 + \mu_2 + 2 x_2 = 0\), which with \(\mu_2 = -2 x_3\) and \(x_3 = 1 - x_1\) gives \(-6 x_3 = 0 \Rightarrow x_3 = 0\), contradicting \(x_3 < 0\). Rejected.
Case (iii): \(\mu_1 > 0, \mu_2 = 0\) (only \(g_1\) active). Complementary slackness: \(x_1 + x_2 = 1\). From (3): \(x_3 = 0\). From \(h_1\): \(x_1 = x_2\), so \(x_1 = x_2 = \tfrac{1}{2}\). From (2): \(\lambda = 2 x_2 + \mu_1 = 1 + \mu_1\). From (1): \(1 - 4 + \mu_1 + \lambda = 0 \Rightarrow \lambda = 3 - \mu_1\). Solving \(1 + \mu_1 = 3 - \mu_1\) gives \(\mu_1 = 1\), \(\lambda = 2\). All conditions check: \(\mu_1 = 1 > 0\), \(\mu_2 = 0 \geq 0\), \(g_1 = 0\), \(g_2 = -\tfrac{1}{2} \leq 0\), \(h_1 = 0\). Accepted.
Case (iv): \(\mu_1, \mu_2 > 0\) (both inequalities active). \(x_1 + x_2 = 1\) and \(x_1 + x_3 = 1\), combined with \(h_1: x_1 = x_2\), give \(x_1 = x_2 = x_3 = \tfrac{1}{2}\). From (3): \(\mu_2 = -2 x_3 = -1 < 0\), violating dual feasibility. Rejected.

Only case (iii) yields a KKT point, so by sufficiency the constrained global minimum is \[ \mathbf{x}^* = \bigl(\tfrac{1}{2},\, \tfrac{1}{2},\, 0\bigr)^\top, \qquad f(\mathbf{x}^*) = \bigl(\tfrac{1}{2} - 2\bigr)^2 + \bigl(\tfrac{1}{2}\bigr)^2 + 0 = \tfrac{5}{2}. \]

A sample code for optimization problem with KKT conditions and a simple penalty method is as follows:

                            import numpy as np

                            # Objective function and its gradient
                            def objective(x):
                                return (x[0] - 2) ** 2 + x[1] ** 2 + x[2] ** 2

                            def grad_objective(x):
                                return np.array([2 * (x[0] - 2), 2 * x[1], 2 * x[2]])

                            # Constraints and their gradients
                            def inequality_constraint_1(x):
                                return x[0] + x[1] - 1  # g1

                            def inequality_constraint_2(x):
                                return x[0] + x[2] - 1  # g2

                            def equality_constraint(x):
                                return x[0] - x[1]  # h

                            def grad_inequality_constraint_1():
                                return np.array([1, 1, 0])  # Dg1

                            def grad_inequality_constraint_2():
                                return np.array([1, 0, 1])  # Dg2

                            def grad_equality_constraint():
                                return np.array([1, -1, 0])  # Dh

                            # Simple Penalty Method
                            def penalty_method(rho=0.01, lr=0.05, tol=1e-6, max_iter=2000, rho_growth=1.2, clip_grad=1.0):
                                
                                x = np.array([0.5, 0.5, 0.5])  # Initial point
                                
                                for i in range(max_iter):
                                    # Evaluate constraints with clipping to avoid invalid values
                                    g1 = max(0, inequality_constraint_1(x))  # g1(x)
                                    g2 = max(0, inequality_constraint_2(x))  # g2(x)
                                    h = equality_constraint(x)              # h(x)

                                    # Compute gradients of penalty terms
                                    grad_penalty = (
                                        rho * g1 * grad_inequality_constraint_1() +
                                        rho * g2 * grad_inequality_constraint_2() +
                                        rho * h * grad_equality_constraint()
                                    )

                                    # Gradient of the penalized objective function
                                    grad = grad_objective(x) + grad_penalty

                                    # Gradient clipping to prevent instability
                                    grad = np.clip(grad, -clip_grad, clip_grad)

                                    # Gradient descent step: Learning rate diminishes over iterations
                                    lr_t = lr / (1 + i / 10)  
                                    x -= lr_t * grad
                                
                                    # Check convergence
                                    constraint_violation = max(abs(g1), abs(g2), abs(h))
                                    if np.linalg.norm(grad) < tol and constraint_violation < tol:
                                        print(f"Converged after {i + 1} iterations.")
                                        break

                                    # Increase penalty parameter
                                    if abs(h) > tol or g1 > tol or g2 > tol:
                                        rho = min(rho * rho_growth, 1e6)  # Cap rho at a reasonable level

                                    # Debugging output for monitoring progress
                                    #print(f"Iteration {iteration + 1}: x = {x}, f(x) = {objective(x):.6f}, rho = {rho}")

                                return x, objective(x)

                            # KKT method (with Newton's method)
                            def kkt_method(tol = 1e-8, max_iter = 10):
                                # Initial values of parameters: x1, x2, x3, mu1, mu2, s1, s2, lambda
                                # Note: s1 and s2 are slack variables.
                                parameters = np.array([2.0, 0.0, 0.0, 0.5, 0.5, 0.5, 0.5, 0.5]) 

                                for i in range(max_iter):
                                    # Scaling factors (currently identity; adjust if Jacobian becomes ill-conditioned)
                                    # Example: scale = np.array([10.0, 1.0, 1.0]) if x1 has different magnitude
                                    x = parameters[:3] / np.array([1.0, 1.0, 1.0]) 

                                    mu1, mu2, s1, s2, lambd = parameters[3:] 

                                    # Gradients
                                    grad_f = grad_objective(x)
                                    grad_g1 = grad_inequality_constraint_1()
                                    grad_g2 = grad_inequality_constraint_2()
                                    grad_h = grad_equality_constraint()

                                    # Residuals (KKT conditions)
                                    r_stationarity = grad_f + mu1 * grad_g1 + mu2 * grad_g2 + lambd * grad_h
                                    r_primal_feasibility = np.array([
                                            inequality_constraint_1(x) + s1,  # g1(x) + s1 == 0
                                            inequality_constraint_2(x) + s2,  # g2(x) + s2 == 0
                                            equality_constraint(x)            # h(x) == 0
                                    ])
                                    r_complementarity = np.array([
                                            mu1 * s1,  # mu1*s1 == 0
                                            mu2 * s2   # mu2*s2 == 0
                                    ])

                                    # Residual vector
                                    residuals = np.concatenate([r_stationarity, r_primal_feasibility, r_complementarity])

                                    # Jacobian matrix is 8x8 (8 equations and 8 parameters)
                                    # The initial Jacobian matrix: 
                                    #  x1   x2    x3   mu1  mu2  s1   s2  lambda
                                    # [ 2.   0.   0.   1.   1.   0.   0.   1. ] Stationarity 1
                                    # [ 0.   2.   0.   1.   0.   0.   0.  -1. ] Stationarity 2
                                    # [ 0.   0.   2.   0.   1.   0.   0.   0. ] Stationarity 3
                                    # [ 1.   1.   0.   0.   0.   1.   0.   0. ] Primal feasibility g1 + s1
                                    # [ 1.   0.   1.   0.   0.   0.   1.   0. ] Primal feasibility g2 + s2
                                    # [ 1.  -1.   0.   0.   0.   0.   0.   0. ] Primal feasibility h
                                    # [ 0.   0.   0.   0.5  0.   0.5  0.   0. ] Complementarity 1 mu1 & s1
                                    # [ 0.   0.   0.   0.   0.5  0.   0.5  0. ] Complementarity 2 mu2 & s2
                                    jacobian = np.zeros((8, 8))

                                    # Stationarity (Df + μ1*Dg1 + μ2*Dg2 + λ*Dh)
                                    jacobian[:3, :3] = 2 * np.eye(3)  # Df(x)
                                    jacobian[:3, 3] = grad_g1  # Dg1(x) 
                                    jacobian[:3, 4] = grad_g2  # Dg2(x) 
                                    jacobian[:3, 7] = grad_h   # Dh(x) 

                                    # Primal feasibility (g1(x) + s1, g2(x) + s2, h(x))
                                    jacobian[3, :3] = grad_g1  # Dg1(x) 
                                    jacobian[4, :3] = grad_g2  # Dg2(x) 
                                    jacobian[5, :3] = grad_h   # Dh(x) 
                                    jacobian[3, 5] = 1         # Slack variable s1
                                    jacobian[4, 6] = 1         # Slack variable s2

                                    # Complementarity  (mu1*s1, mu2*s2)
                                    jacobian[6, 3] = s1     # mu1*s1 
                                    jacobian[6, 5] = mu1    # mu1 
                                    jacobian[7, 4] = s2     # mu2*s2 
                                    jacobian[7, 6] = mu2    # mu2 

                                    print(f"Iteration {i+1}:")
                                    print(f" x = {parameters[:3]}, f(x) = {objective(x):.6f}")
                                    print(f"Multipliers: mu1 = {parameters[3]}, mu2 = {parameters[4]}, lambda = {parameters[7]}\n")
                                    
                                    # Solve the system
                                    try:
                                        delta = np.linalg.solve(jacobian, -residuals)
                                    except np.linalg.LinAlgError:
                                        print("Jacobian is singular OR ill-conditioned.")
                                        break

                                    # Update parameters
                                    parameters += delta

                                    # Enforce non-negativity of inequality multipliers and slack variables
                                    parameters[3] = max(0, parameters[3])   # mu1 >= 0
                                    parameters[4] = max(0, parameters[4])   # mu2 >= 0
                                    parameters[5] = max(0, parameters[5])   # s1 >= 0
                                    parameters[6] = max(0, parameters[6])   # s2 >= 0

                                    # Check Convergence
                                    if np.linalg.norm(residuals) < tol:
                                        break

                                x = parameters[:3]
                                return x, parameters[3], parameters[4], parameters[7], objective(x)

                            if __name__ == "__main__":
                                x_opt, mu1, mu2, lambd, f_opt = kkt_method()
                                print(f"Optimal by KKT multipliers: x = {x_opt}, f(x) = {f_opt}")
                                print(f"Multipliers: mu1 = {mu1}, mu2 = {mu2}, lambda = {lambd}\n")
                                
                                x_penalty, f_penalty = penalty_method()
                                x_relative_error  = np.linalg.norm(np.array(x_penalty) - np.array(x_opt)) / np.linalg.norm(np.array(x_opt))
                                f_relative_error  = np.linalg.norm(np.array(f_penalty) - np.array(f_opt)) / np.linalg.norm(np.array(f_opt))
                                print(f"Optimal by Penalty Method: x = {x_penalty}, f(x) = {f_penalty}")
                                print(f"x_Relative_Error: {x_relative_error :.6f}, f_Relative_Error: {f_relative_error :.6f}")

Implementation note: slack variables vs. active-set tracking

Numerical KKT solvers cannot directly handle the inequality \(g_i(\mathbf{x}) \leq 0\) nor the kink in the complementary slackness condition \(\mu_i\, g_i(\mathbf{x}) = 0\). Two strategies resolve this:

(1) Slack variables. Introduce \(s_i \geq 0\) and convert each inequality to an equality: \[ g_i(\mathbf{x}) \leq 0 \iff g_i(\mathbf{x}) + s_i = 0, \; s_i \geq 0. \] Complementary slackness then becomes \(\mu_i\, s_i = 0\), which is a smooth bilinear equation — amenable to Newton-type solvers (as in the code above).

(2) Active-set methods. Guess which constraints are active, solve the resulting equality-constrained problem, then check whether the guess is consistent with dual feasibility and primal feasibility; if not, update the active set and repeat. Faster when the correct active set can be identified cheaply, but requires more logic for dynamically updating the active set.

Modern interior-point methods use a smoothed version of slack-variable tracking with barrier functions, and are the workhorse for large-scale convex programs — the subject of the next page on duality, which reformulates the primal problem in a way that reveals structure often invisible in the primal.

Loading...