Support Vector Machine (SVM)

Support Vector Machine

So far in both regression and classification, we have used probabilistic predictors that model \(p(y \mid \mathbf{x})\) and optimize via maximum likelihood. Here, we take a fundamentally different, geometric approach: rather than modeling probabilities, we seek the hyperplane that separates the classes with the largest possible margin. This formulation leads naturally to a constrained optimization problem whose dual formulation enables the kernel trick.

This margin-based approach often leads to strong generalization, especially in high dimensional spaces with limited samples. By employing the kernel trick (e.g., RBF kernel), SVMs can implicitly map inputs into higher dimensional feature spaces, enabling the classification of nonlinearly separable data without explicitly computing those transformations. Despite the rise of deep learning, SVMs remain valuable for small- to medium-sized datasets, offering robustness and interpretability through a sparse set of support vectors.

In binary classification with labels \(\tilde{y} \in \{-1, +1\}\), the prediction is \(h(\mathbf{x}) = \operatorname{sign}(f(\mathbf{x}))\), and the decision boundary (hyperplane) is defined by \[ f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b. \tag{1} \] To obtain a robust solution, we would like to maximize the margin between data points and the decision boundary.

To express the distance of the closest point to the decision boundary, we decompose any point \(\mathbf{x}\) as its orthogonal projection onto the hyperplane \(H = \{\mathbf{x} : f(\mathbf{x}) = 0\}\) plus a signed normal component: \[ \mathbf{x} = \operatorname{proj}_H \mathbf{x} + r \frac{\mathbf{w}}{\|\mathbf{w}\|}, \tag{2} \] where \(r\) is the signed distance from \(\mathbf{x}\) to \(H\), and \(\mathbf{w}/\|\mathbf{w}\|\) is the unit normal vector to \(H\).

Substituting Expression (2) into the boundary function (1): \[ \begin{align*} f(\mathbf{x}) &= \mathbf{w}^\top \!\left(\operatorname{proj}_H \mathbf{x} + r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right) + b \\\\ &= \underbrace{\mathbf{w}^\top \operatorname{proj}_H \mathbf{x} + b}_{= \, f(\operatorname{proj}_H \mathbf{x}) \, = \, 0} + r \|\mathbf{w}\| \\\\ &= r \|\mathbf{w}\|. \end{align*} \] Hence \[ r = \frac{f(\mathbf{x})}{\|\mathbf{w}\|}, \] and each correctly classified point satisfies \[ \tilde{y}_n f(\mathbf{x}_n) > 0. \] Therefore, our objective is \[ \max_{\mathbf{w},\, b} \frac{1}{\|\mathbf{w}\|} \min_{1 \leq n \leq N} \tilde{y}_n \!\left(\mathbf{w}^\top \mathbf{x}_n + b\right). \] Since both \(\mathbf{w}\) and \(b\) can be freely rescaled without changing the decision boundary, we may normalize so that the closest point satisfies \(\tilde{y}_n f(\mathbf{x}_n) = 1\) exactly. Under this normalization, the margin (distance between the two parallel supporting hyperplanes) equals \(\frac{2}{\|\mathbf{w}\|}\).

Note that \(\max \frac{1}{\|\mathbf{w}\|}\) is equivalent to \(\min \|\mathbf{w}\|^2\). Thus our objective becomes:

Hard-Margin SVM (Primal)

\[ \min_{\mathbf{w},\, b} \;\frac{1}{2}\|\mathbf{w}\|^2 \] subject to \[ \tilde{y}_n(\mathbf{w}^\top \mathbf{x}_n + b) \geq 1, \quad n = 1, \ldots, N. \] The factor of \(\frac{1}{2}\) is included for convenience when differentiating.

This constrained optimization problem, known as the primal problem, is difficult to solve directly. Since the objective is convex and the inequality constraints are linear, Slater's condition is automatically satisfied, so strong duality holds and the problem can equivalently be solved via its dual, which often has computational advantages and enables the kernel trick.

Let \(\boldsymbol{\alpha} \in \mathbb{R}^N\) be the vector of Lagrange multipliers associated with the inequality constraints. The Lagrangian is \[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top \mathbf{w} - \sum_{n=1}^N \alpha_n \!\left(\tilde{y}_n (\mathbf{w}^\top \mathbf{x}_n + b) - 1\right). \tag{3} \] A primal-dual optimum \((\hat{\mathbf{w}}, \hat{b}, \hat{\boldsymbol{\alpha}})\) satisfies \[ (\hat{\mathbf{w}}, \hat{b}, \hat{\boldsymbol{\alpha}}) = \min_{\mathbf{w}, b} \max_{\boldsymbol{\alpha} \geq \mathbf{0}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}). \] Taking derivatives with respect to \(\mathbf{w}\) and \(b\) and setting them to zero (the stationarity condition): \[ \begin{align*} \nabla_{\mathbf{w}} \mathcal{L} &= \mathbf{w} - \sum_{n=1}^N \alpha_n \tilde{y}_n \mathbf{x}_n = \mathbf{0}, \\\\ \frac{\partial \mathcal{L}}{\partial b} &= -\sum_{n=1}^N \alpha_n \tilde{y}_n = 0, \end{align*} \] which gives \[ \begin{align*} \hat{\mathbf{w}} &= \sum_{n=1}^N \hat{\alpha}_n \tilde{y}_n \mathbf{x}_n, \\\\ 0 &= \sum_{n=1}^N \hat{\alpha}_n \tilde{y}_n. \end{align*} \] Substituting these into the Lagrangian (3): \[ \begin{align*} \mathcal{L}(\hat{\mathbf{w}}, \hat{b}, \boldsymbol{\alpha}) &= \frac{1}{2}\hat{\mathbf{w}}^\top \hat{\mathbf{w}} - \sum_{n=1}^N \alpha_n \tilde{y}_n \hat{\mathbf{w}}^\top \mathbf{x}_n - \hat{b} \sum_{n=1}^N \alpha_n \tilde{y}_n + \sum_{n=1}^N \alpha_n \\\\ &= \frac{1}{2}\hat{\mathbf{w}}^\top \hat{\mathbf{w}} - \hat{\mathbf{w}}^\top \hat{\mathbf{w}} - 0 + \sum_{n=1}^N \alpha_n \\\\ &= \sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j \tilde{y}_i \tilde{y}_j \, \mathbf{x}_i^\top \mathbf{x}_j. \end{align*} \]

Hard-Margin SVM (Dual)

\[ \max_{\boldsymbol{\alpha}} \;\sum_{n=1}^N \alpha_n - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j \tilde{y}_i \tilde{y}_j \, \mathbf{x}_i^\top \mathbf{x}_j \] subject to \[ \sum_{n=1}^N \alpha_n \tilde{y}_n = 0, \qquad \alpha_n \geq 0, \quad n = 1, \ldots, N. \] The primal-dual solution must satisfy the KKT conditions: \[ \begin{align*} \mathbf{w} - \sum_{n=1}^N \alpha_n \tilde{y}_n \mathbf{x}_n &= \mathbf{0}, \quad \sum_{n=1}^N \alpha_n \tilde{y}_n = 0 &&\text{(stationarity)} \\\\ \tilde{y}_n f(\mathbf{x}_n) - 1 &\geq 0 &&\text{(primal feasibility)} \\\\ \alpha_n &\geq 0 &&\text{(dual feasibility)} \\\\ \alpha_n (\tilde{y}_n f(\mathbf{x}_n) - 1) &= 0 &&\text{(complementary slackness)} \end{align*} \]

Complementary slackness implies that, for each data point, either \[ \alpha_n = 0 \quad \text{or} \quad \tilde{y}_n (\hat{\mathbf{w}}^\top \mathbf{x}_n + \hat{b}) = 1. \] The points satisfying the second condition are called support vectors.

Let the set of support vectors be \(\mathcal{S}\). To perform prediction, we expand \(\hat{\mathbf{w}} = \sum_{n \in \mathcal{S}} \hat{\alpha}_n \tilde{y}_n \mathbf{x}_n\): \[ \begin{align*} f(\mathbf{x};\, \hat{\mathbf{w}}, \hat{b}) &= \hat{\mathbf{w}}^\top \mathbf{x} + \hat{b} \\\\ &= \sum_{n \in \mathcal{S}} \hat{\alpha}_n \tilde{y}_n\, \mathbf{x}_n^\top \mathbf{x} + \hat{b}. \end{align*} \] By the kernel trick, the inner product can be replaced with a positive definite kernel \(\mathcal{K}(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle\), implicitly mapping inputs to a higher- (or infinite-) dimensional feature space without ever computing \(\phi\) explicitly.

SVM Prediction (Kernel Form)

\[ f(\mathbf{x}) = \sum_{n \in \mathcal{S}} \hat{\alpha}_n \tilde{y}_n \mathcal{K}(\mathbf{x}_n, \mathbf{x}) + \hat{b}, \] where the bias is computed by averaging over the support vectors: \[ \begin{align*} \hat{b} &= \frac{1}{|\mathcal{S}|} \sum_{n \in \mathcal{S}} \!\left(\tilde{y}_n - \hat{\mathbf{w}}^\top \mathbf{x}_n \right) \\\\ &= \frac{1}{|\mathcal{S}|} \sum_{n \in \mathcal{S}} \!\left(\tilde{y}_n - \sum_{m \in \mathcal{S}} \hat{\alpha}_m \tilde{y}_m \mathbf{x}_m^\top \mathbf{x}_n \right). \end{align*} \] Applying the kernel trick to the bias as well: \[ \hat{b} = \frac{1}{|\mathcal{S}|} \sum_{n \in \mathcal{S}} \!\left(\tilde{y}_n - \sum_{m \in \mathcal{S}} \hat{\alpha}_m \tilde{y}_m \mathcal{K}(\mathbf{x}_m, \mathbf{x}_n) \right). \]

Soft Margin Constraints

Consider the case where the data is NOT linearly separable. In this case there is no feasible solution in which \(\tilde{y}_n f(\mathbf{x}_n) \geq 1\) for all \(n\). By introducing slack variables \(\xi_n \geq 0\), we relax the constraint \(\tilde{y}_n f(\mathbf{x}_n) \geq 1\) to the soft margin constraint: \[ \tilde{y}_n f(\mathbf{x}_n) \geq 1 - \xi_n. \] Thus, our objective becomes:

Soft-Margin SVM (Primal)

\[ \min_{\mathbf{w},\, b,\, \boldsymbol{\xi}} \;\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{n=1}^N \xi_n \] subject to \[ \xi_n \geq 0, \quad \tilde{y}_n(\mathbf{w}^\top \mathbf{x}_n + b) \geq 1 - \xi_n, \] where \(C \geq 0\) is a hyperparameter that controls how many data points are allowed to violate the margin.

The Lagrangian becomes \[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}, \boldsymbol{\xi}, \boldsymbol{\mu}) = \frac{1}{2} \mathbf{w}^\top \mathbf{w} + C \sum_{n=1}^N \xi_n - \sum_{n=1}^N \alpha_n \!\left(\tilde{y}_n(\mathbf{w}^\top \mathbf{x}_n + b) - 1 + \xi_n\right) - \sum_{n=1}^N \mu_n \xi_n, \] where \(\alpha_n \geq 0\) and \(\mu_n \geq 0\) are Lagrange multipliers.

Setting the derivatives of \(\mathcal{L}\) to zero with respect to \(\mathbf{w}\), \(b\), and \(\xi_n\): \[ \begin{align*} \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{0} &\implies \mathbf{w} = \sum_{n=1}^N \alpha_n \tilde{y}_n \mathbf{x}_n, \\\\ \frac{\partial \mathcal{L}}{\partial b} = 0 &\implies \sum_{n=1}^N \alpha_n \tilde{y}_n = 0, \\\\ \frac{\partial \mathcal{L}}{\partial \xi_n} = 0 &\implies C - \alpha_n - \mu_n = 0. \end{align*} \] The third condition, combined with \(\mu_n \geq 0\), gives \(\alpha_n \leq C\). Since we also require \(\alpha_n \geq 0\), we obtain the box constraint \(0 \leq \alpha_n \leq C\).

Substituting the first two conditions back into the Lagrangian eliminates the primal variables, yielding the same dual objective as the hard-margin case: \[ \mathcal{L}(\boldsymbol{\alpha}) = \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j \tilde{y}_i \tilde{y}_j \, \mathbf{x}_i^\top \mathbf{x}_j, \] but now subject to the constraints \[ 0 \leq \alpha_n \leq C, \qquad \sum_{n=1}^N \alpha_n \tilde{y}_n = 0. \] These constraints partition the training points into three cases:

If \(\alpha_n = 0\): the point lies outside the margin and is correctly classified (ignored by the model).
If \(0 < \alpha_n < C\): the point is exactly on the margin boundary (\(\xi_n = 0\)). These are the support vectors.
If \(\alpha_n = C\): the point is inside or beyond the margin.
- If \(\xi_n \leq 1\): correctly classified but inside the margin.
- If \(\xi_n > 1\): misclassified.

SVM Demo

This interactive demo visualizes the SVM algorithm in action. You can observe how the algorithm finds the optimal decision boundary that maximizes the margin between two classes while respecting the soft margin constraints.

Understanding the Visualization

The visualization displays several key components of the SVM:

Training Data Points: Solid circles represent training samples with labels \(\tilde{y} \in \{-1, +1\}\)
Test Data Points: Semi-transparent squares show test samples for evaluating generalization
Decision Boundary: The dark line (linear kernel) or contour (RBF kernel) where \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b = 0\)
Margin Boundaries: Dashed lines showing where \(\tilde{y}_n f(\mathbf{x}_n) = 1\)
Support Vectors: Points with golden borders that satisfy \(\alpha_n > 0\) and define the decision boundary
Slack Variables: Dotted circles around support vectors visualize \(\xi_n\) for points violating the margin

Interactive Controls

Experiment with different parameters to understand their effects:

Kernel Type:
- Linear: Uses \(\mathcal{K}(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\top \mathbf{x}_j\) for linearly separable data
- RBF (Gaussian): Uses \(\mathcal{K}(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)\) for non-linear boundaries
Regularization Parameter C: Controls the trade-off between maximizing the margin and minimizing classification errors. As shown in the soft margin formulation: \(\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{n=1}^N \xi_n\)
- Low C: Prioritizes wider margins, allows more margin violations
- High C: Prioritizes correct classification, may lead to overfitting
RBF Gamma (\(\gamma\)): Controls the influence radius of support vectors in RBF kernel
- Low \(\gamma\): Smooth decision boundaries, each point influences a wider region
- High \(\gamma\): Complex boundaries, localized influence around each support vector
Learning Rate & Iterations: SGD optimization parameters for training convergence

Training Process

The demo uses Stochastic Gradient Descent (SGD) to optimize the SVM objective. For the RBF kernel, it employs Random Fourier Features to approximate the kernel function, making the non-linear SVM computationally efficient. Click "Train SVM" to watch the algorithm iteratively:

Update weights to minimize the hinge loss: \(\max(0, 1 - \tilde{y}_n f(\mathbf{x}_n))\)
Apply regularization to control model complexity
Identify support vectors that satisfy the KKT conditions
Converge to the optimal decision boundary

Observing KKT Conditions

After training, the demo verifies the Karush-Kuhn-Tucker (KKT) conditions for each point:

For non-support vectors (\(\alpha_n = 0\)): Must have \(\tilde{y}_n f(\mathbf{x}_n) \geq 1\)
For support vectors on the margin (\(0 < \alpha_n < C\)): Must have \(\tilde{y}_n f(\mathbf{x}_n) = 1\)
For support vectors inside/beyond margin (\(\alpha_n = C\)): May have \(\tilde{y}_n f(\mathbf{x}_n) < 1\)

The percentage of points satisfying these conditions indicates how well the optimization has converged.

Experiment Suggestions

Try these experiments to deepen your understanding:

Generate linearly separable data and compare linear vs RBF kernels
Create circular data patterns to see where linear kernels fail
Adjust C to observe the trade-off between margin width and training accuracy
For RBF kernel, vary \(\gamma\) to see how decision boundaries become more complex
Watch how the number of support vectors changes with different parameters

SVMs demonstrate how geometric intuition - maximizing the margin - leads to strong generalization guarantees, particularly in high-dimensional spaces with limited data. In the PCA & Autoencoders page that follows, we shift from supervised to unsupervised learning, asking how to discover low-dimensional structure in data without any labels - connecting eigenvalue decomposition to neural network-based representation learning.

Support Vector Machine (SVM)

Loading...