Support Vector Machine (SVM)

Support Vector Machine Soft Margin Constraints SVM Demo

Support Vector Machine

So far in both regression and classification, we have used probabilistic predictors that model \(p(y \mid x)\) and optimize via maximum likelihood. Here, we take a fundamentally different, geometric approach: rather than modeling probabilities, we seek the hyperplane that separates the classes with the largest possible margin. This formulation leads naturally to a constrained optimization problem whose dual formulation enables the kernel trick.

This margin-based approach often leads to strong generalization, especially in high dimensional spaces with limited samples. By employing the kernel trick (e.g., RBF kernel), SVMs can implicitly map inputs into higher dimensional feature spaces, enabling the classification of nonlinearly separable data without explicitly computing those transformations. Despite the rise of deep learning, SVMs remain valuable for small- to medium-sized datasets, offering robustness and interpretability through a sparse set of support vectors.

In binary classification \(h(x) = \text{sign }(f(x))\), the decision boundary (hyperplane) is given by: \[ f(x) = w^\top x + w_0 \tag{1} \] To obtain a robust solution, we would like to maximize the margin between data points and the decision boundary.

SVM margin and decision boundary

To express the distance of the closest point to the decision boundary, we introduce the orthogonal projection of any point \(x\) onto the boundary \(f(x)\): \[ x = \text{proj}_{f(x)}x + r \frac{w}{\| w \|} \tag{2} \] where \(r\) is the distance from \(x\) to the decision boundary whose normal vector is \(w\).

Substitute Expression (2) for the decision boundary (1): \[ \begin{align*} f(x) &= w^\top \left(\text{proj}_{f(x)}x + r \frac{w}{\| w \|} + w_0 \right) \\\\ &= w^\top \text{proj}_{f(x)}x + w_0 + r \| w \|. \end{align*} \] Since \(f(\text{proj}_{f(x)}x ) = 0\), \[ r = \frac{f(x)}{\| w \|} \] and each point must be on the correct side of the decision boundary: \[ f(x_n)\tilde{y}_n > 0. \] Therefore, our objective is represented by: \[ \max_{w, w_0} \frac{1}{\| w \|} \min_{n = 1}^N \left[\tilde{y}_n \left(w^\top x_n + w_0 \right)\right]. \] Since both \(w\) and \(w_0\) can be freely rescaled without changing the decision boundary, we may normalize so that the closest point satisfies \(\tilde{y}_n f(x_n) = 1\) exactly. Under this normalization, the margin equals \(\frac{2}{\|w\|}\).

Note that \(\max \frac{1}{\|w\|}\) is equivalent to \(\min \|w\|^2\). Thus our objective becomes:

Hard-Margin SVM (Primal):

\[ \min_{w,\, w_0} \;\frac{1}{2}\|w\|^2 \] subject to \[ \tilde{y}_n(w^\top x_n + w_0) \geq 1, \quad n = 1 : N. \] The factor of \(\frac{1}{2}\) is included for convenience when differentiating.

This constrained optimization problem, known as the primal problem, is difficult to solve directly. However, since this is the convex optimization problem, there exists its dual problem using Lagrange multipliers, which often has computational advantages and enables the kernel trick.

Let \(\alpha \in \mathbb{R}^N\) be the dual variables corresponding to Lagrange multipliers: \[ \mathcal{L}(w, w_0, \alpha) = \frac{1}{2}w^\top w - \sum_{n =1}^N \alpha_n (\tilde{y}_n (w^\top x_n + w_0) -1) \tag{3} \] To optimize this, we need a stationary point that satisfies: \[ (\hat{w}, \hat{w}_0, \hat{\alpha}) = \min_{w, w_0} \max_{\alpha} \mathcal{L}(w, w_0, \alpha). \] Taking derivatives with respect to \(w\) and \(w_0\) and setting to zero: \[ \begin{align*} \nabla_w \mathcal{L}(w, w_0, \alpha) &= w - \sum_{n =1}^N \alpha_n \tilde{y}_n x_n \\\\ \frac{\partial \mathcal{L}(w, w_0, \alpha)}{\partial w_0} &= - \sum_{n =1}^N \alpha_n \tilde{y}_n. \end{align*} \] Then we have: \[ \begin{align*} \hat{w} &= \sum_{n =1}^N \hat{\alpha}_n \tilde{y}_n x_n \\\\ 0 &= \sum_{n =1}^N \hat{\alpha}_n \tilde{y}_n. \end{align*} \] Substitute these for the Lagrangian (3), we obtain \[ \begin{align*} \mathcal{L}(\hat{w}, \hat{w}_0, \alpha) &= \frac{1}{2}\hat{w}^\top \hat{w} - \sum_{n =1}^N \alpha_n \tilde{y}_n \hat{w}^\top x_n - \sum_{n =1}^N \alpha_n \tilde{y}_n w_0 + \sum_{n =1}^N \alpha_n \\\\ &= \frac{1}{2}\hat{w}^\top \hat{w} - \hat{w}^\top \hat{w} - 0 + \sum_{n =1}^N \alpha_n \\\\ &= - \frac{1}{2} \sum_{i =1}^N \sum_{j = 1}^N \alpha_i \alpha_j \tilde{y}_i \tilde{y}_j x_i^\top x_j + \sum_{n =1}^N \alpha_n. \end{align*} \]

Hard-Margin SVM (Dual):

\[ \max_{\alpha} \mathcal{L}(\hat{w}, \hat{w}_0, \alpha) \] subject to \[ \sum_{n =1}^N \alpha_n \tilde{y}_n = 0 \text{ and } \alpha_n \geq 0, \quad n = 1 : N. \] The solution must satisfy the KKT conditions: \[ \begin{align*} \alpha_n &\geq 0 \text{ (Dual feasibility) }\\\\ \tilde{y}_n f(x_n) -1 &\geq 0 \text{ (Primal feasibility) }\\\\ \alpha_n (\tilde{y}_n f(x_n) -1 ) &= 0 \text{ (Complementary slackness) } \end{align*} \]

Thus, for each data point, either \[ \alpha_n = 0 \, \text{ or } \, \tilde{y}_n (\hat{w}^\top x_n + \hat{w}_0) = 1 \] must hold. The points satisfying the second condition are called support vectors.

Let the set of support vectors be \(\mathcal{S}\). To perform prediction, we expand \(\hat{w} = \sum_{n \in \mathcal{S}} \alpha_n \tilde{y}_n x_n\): \[ \begin{align*} f(x;\,\hat{w}, \hat{w}_0) &= \hat{w}^\top x + \hat{w}_0 \\\\ &= \sum_{n \in \mathcal{S}} \alpha_n \tilde{y}_n\, x_n^\top x + \hat{w}_0. \end{align*} \] By the kernel trick, replacing inner products with kernel evaluations:

SVM Prediction (Kernel Form):

\[ f(x) = \sum_{n \in \mathcal{S}} \alpha_n \tilde{y}_n \mathcal{K}(x_n, x) + \hat{w}_0 \] where the bias is computed from the support vectors: \[ \begin{align*} \hat{w}_0 &= \frac{1}{|\mathcal{S}|} \sum_{n \in \mathcal{S}} \left(\tilde{y}_n - \hat{w}^\top x_n \right) \\\\ &= \frac{1}{|\mathcal{S}|} \sum_{n \in \mathcal{S}} \left(\tilde{y}_n - \sum_{m \in \mathcal{S}} \alpha_m \tilde{y}_m x_m^\top x_n \right). \end{align*} \] Applying the kernel trick to the bias: \[ \hat{w}_0 = \frac{1}{|\mathcal{S}|} \sum_{n \in \mathcal{S}} \left(\tilde{y}_n - \sum_{m \in \mathcal{S}} \hat{\alpha}_m \tilde{y}_m \mathcal{K}(x_m, x_n) \right). \]

Soft Margin Constraints

Consider the case where the data is NOT linearly separable. In this case there is not feasible solution in which \(\forall n, \, \tilde{y}_n f_n \geq 1\). By introducing slack variables \(\xi_n \geq 0\), we relax the constraints \(\tilde{y}_n f_n \geq 0\) with soft margin constraints: \[ \tilde{y}_n f_n \geq 1 - \xi_n. \] Thus, our objective becomes:

Soft-Margin SVM (Primal):

\[ \min_{w,\, w_0,\, \xi} \;\frac{1}{2}\|w\|^2 + C\sum_{n=1}^N \xi_n \] subject to \[ \xi_n \geq 0, \quad \tilde{y}_n(x_n^\top w + w_0) \geq 1 - \xi_n \] where \(C \geq 0\) is a hyperparameter that controls how many data points are allowed to violate the margin.

The Lagrangian becomes: \[ \begin{align*} &\mathcal{L}(w, w_0, \alpha, \xi, \mu) \\\\ &= \frac{1}{2} w^\top w + C \sum_{n =1}^N \xi_n - \sum_{n = 1}^N \alpha_n (\tilde{y}_n(w^\top x_n + w_0) -1 + \xi_n) - \sum_{n = 1}^N \mu_n \xi_n \end{align*} \] where \(\alpha_n \geq 0\) and \(\mu_n \geq 0\) are Lagrange multipliers.

Setting the derivatives of \(\mathcal{L}\) to zero with respect to \(w\), \(w_0\), and \(\xi_n\): \[ \begin{align*} \frac{\partial \mathcal{L}}{\partial w} = 0 &\implies w = \sum_{n=1}^N \alpha_n \tilde{y}_n x_n, \\\\ \frac{\partial \mathcal{L}}{\partial w_0} = 0 &\implies \sum_{n=1}^N \alpha_n \tilde{y}_n = 0, \\\\ \frac{\partial \mathcal{L}}{\partial \xi_n} = 0 &\implies C - \alpha_n - \mu_n = 0. \end{align*} \] The third condition, combined with \(\mu_n \geq 0\), gives \(\alpha_n \leq C\). Since we also require \(\alpha_n \geq 0\), we obtain the box constraint \(0 \leq \alpha_n \leq C\).

Substituting the first two conditions back into the Lagrangian eliminates the primal variables, yielding the same dual objective as the hard-margin case: \[ \mathcal{L}(\alpha) = \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j \tilde{y}_i \tilde{y}_j \, x_i^\top x_j, \] but now subject to the constraints: \[ 0 \leq \alpha_n \leq C, \quad \sum_{n=1}^N \alpha_n \tilde{y}_n = 0. \] These constraints partition the training points into three cases:

SVM Demo

This interactive demo visualizes the SVM algorithm in action. You can observe how the algorithm finds the optimal decision boundary that maximizes the margin between two classes while respecting the soft margin constraints.

Understanding the Visualization

The visualization displays several key components of the SVM:

Interactive Controls

Experiment with different parameters to understand their effects:

Training Process

The demo uses Stochastic Gradient Descent (SGD) to optimize the SVM objective. For the RBF kernel, it employs Random Fourier Features to approximate the kernel function, making the non-linear SVM computationally efficient. Click "Train SVM" to watch the algorithm iteratively:

  1. Update weights to minimize the hinge loss: \(\max(0, 1 - \tilde{y}_n f(x_n))\)
  2. Apply regularization to control model complexity
  3. Identify support vectors that satisfy the KKT conditions
  4. Converge to the optimal decision boundary

Observing KKT Conditions

After training, the demo verifies the Karush-Kuhn-Tucker (KKT) conditions for each point:

The percentage of points satisfying these conditions indicates how well the optimization has converged.

Experiment Suggestions

Try these experiments to deepen your understanding:

  1. Generate linearly separable data and compare linear vs RBF kernels
  2. Create circular data patterns to see where linear kernels fail
  3. Adjust C to observe the trade-off between margin width and training accuracy
  4. For RBF kernel, vary \(\gamma\) to see how decision boundaries become more complex
  5. Watch how the number of support vectors changes with different parameters

SVMs demonstrate how geometric intuition - maximizing the margin - leads to strong generalization guarantees, particularly in high-dimensional spaces with limited data. In Part 7: PCA & Autoencoders, we shift from supervised to unsupervised learning, asking how to discover low-dimensional structure in data without any labels - connecting eigenvalue decomposition to neural network-based representation learning.