Ridge Regression
In Part 11 of Section III,
we established the linear model: given observed data \(\{(x_i, y_i)\}_{i=1}^n\) with
\(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\), we write
\[
y = Xw + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I),
\]
where \(X \in \mathbb{R}^{n \times d}\) is the design matrix and \(w \in \mathbb{R}^d\) is the weight vector.
Under Gaussian noise, the maximum likelihood estimator
coincides with the least-squares solution:
\[
\begin{align*}
\widehat{w}_{LS} &= \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 \\\\
&= \arg \min_w \|y - Xw\|_2^2 \\\\
&= \arg \min_w (y - Xw)^\top (y - Xw) \\\\
&= \arg \min_w \; y^\top y - y^\top Xw - w^\top X^\top y + w^\top X^\top X w \\\\
&= \arg \min_w \; J(w).
\end{align*}
\]
Differentiating with respect to \(w\) and setting the gradient to zero:
\[
\begin{align*}
\nabla J(w) &= -X^\top y - X^\top y + 2X^\top Xw \\\\
&= -2X^\top y + 2 X^\top X w = 0.
\end{align*}
\]
Thus, if \((X^\top X)^{-1}\) exists,
\[
\widehat{w}_{LS} = (X^\top X)^{-1}X^\top y = \widehat{w}_{MLE}.
\]
While this estimator is optimal when \(X^\top X\) is well-conditioned, high-dimensional settings where \(d \gg n\) lead to
ill-conditioning or singularity of \(X^\top X\). The objective surface becomes flat in some directions, making the solution
non-unique and highly sensitive to noise - a phenomenon known as overfitting. Regularization addresses
this by trading a small amount of bias for a substantial reduction in variance, a principle we will formalize in the
bias-variance decomposition below.
To mitigate this issue, we add a penalty term on the objective. This technique is called regularization,
and the modified version of linear regression is known as ridge regression:
\[
\begin{align*}
\widehat{w}_{ridge} &= \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 +\underbrace{\lambda \|w\|_2^2}_{\text{Squared \(\ell_2\)-norm regularizer}}.\\\\
&= \arg \min_w \|y - Xw\|_2^2 + \lambda w^\top w \\\\
&= \arg \min_w (Xw -y)^\top (Xw-y)+\lambda w^\top w \\\\
&= \arg \min_w y^\top y -2w^\top Xy +w^\top X^\top Xw + w^\top (\lambda I)w \\\\
&= \arg \min_w y^\top y -2w^\top Xy +w^\top (X^\top X + \lambda I)w \\\\
&= \arg \min_w J(w)
\end{align*}
\]
Then
\[
\nabla_w J(w) = -2X^\top y +2(X^\top X + \lambda I )w = 0
\]
Solving for \(w\), we obtain
\[
\widehat{w}_{ridge} = (X^\top X + \lambda I)^{-1}X^\top y.
\]
where \(\lambda > 0\) is the regularization parameter that controls the strength of the penalty.
Ridge Regression:
Ridge regression augments the least-squares objective with a squared \(\ell_2\)-norm penalty
on the weight vector:
\[
\widehat{w}_{\text{ridge}} = \arg \min_w \|y - Xw\|_2^2 + \lambda \|w\|_2^2
= (X^\top X + \lambda I)^{-1} X^\top y,
\]
where \(\lambda > 0\) is the regularization parameter. The addition of \(\lambda I\) guarantees that
\((X^\top X + \lambda I)\) is positive definite and hence invertible, even when
\(X^\top X\) is singular.
The hyperparameter \(\lambda \geq 0\) controls the strength of the regularization:
- As \(\lambda \to 0\), the solution approaches the least-squares solution: \(\widehat{w}_{ridge} \to \widehat{w}_{LS}\).
- As \(\lambda \to \infty\), the solution is pushed toward zero: \(\widehat{w}_{ridge} \to 0\), leading to high bias but low variance.
This illustrates the classic bias-variance tradeoff in statistical learning.
Demo: Ridge Regression
Note: Tiny regularization \(\lambda = 0.0001\) has been added on linear regression for numerical stability.
After running the demo above, we observe that linear regression tends to achieve lower training error but at the cost of higher test error and wildly large
coefficients, especially with high polynomial degrees. This means the model memorizes the training data too closely and fails to generalize
well to new, unseen data.
In contrast, ridge regression, by constraining the weights, may have slightly higher training error but typically generalizes
better to new data. This trade-off between fitting the training data and maintaining model simplicity is known as the bias-variance tradeoff.
Increasing the model complexity (e.g., by raising the polynomial degree) monotonically reduces training error, but after a certain point, the test error begins to increase.
This turning point highlights the danger of overfitting.
Cross Validation
To build a generalized model and select an appropriate regularization strength (\(\lambda\)),
we employ K-fold cross-validation (CV) on the training set. The training data is partitioned into \(K\)
equal-sized folds. For each fold \(k \in \{1, \cdots, K\}\), the model is trained on the \(K - 1\) other folds and validated on the \(k\)th fold.
This process is repeated \(K\) times, ensuring that each data point in the training set is used once for validation. The average validation
error is used to select the optimal hyperparameter \(\lambda\). Finally, the model is evaluated on the held-out test set.
Here, we demonstrate how k-fold CV is used to tune the regularization parameter \(\lambda > 0\) in Ridge Regression.
A special case of K-fold CV is Leave-One-Out Cross-Validation (LOOCV), where the number of folds \(K\) equals the
number of data points in the training set. That is, the model is trained on all training data points except one, which is used for validation,
and this process is repeated for each data point. Thus, LOOCV tends to produce a low-bias estimate of the validation error. However,
LOOCV is computationally expensive for large datasets. In contrast, smaller values of \(K\) (such as 5-10) offer a good tradeoff between bias,
variance, and computational cost, making them a popular choice in practice.
Bias-Variance Tradeoff in Ridge Regression
Let's dive into a formal derivation of expected prediction error in the context of ridge regression. Remember we have
obtained:
\[
\widehat{w}_{ridge} = (X^\top X + \lambda I)^{-1}X^\top y.
\]
We assume probabilistic generative model:
\[
x_i \sim P_X, \quad y = Xw + \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2 I).
\]
The true error at a sample with feature \(x\) is given by:
\[
\begin{align*}
&\mathbb{E}_{y, D \mid x} \left[ \left(y - x^\top \widehat{w}_{\text{ridge}} \right)^2 \mid x \right] \\\\
&= \mathbb{E}_{y \mid x} \left[ \left(y - \mathbb{E}[y \mid x] \right)^2 \mid x \right]
+ \mathbb{E}_D \left[ \left( \mathbb{E}[y \mid x] -x^\top \widehat{w}_{ridge} \right)^2 \mid x \right] \\\\
&= \underbrace{ \mathbb{E}_{y \mid x} \left[ \left(y - x^\top w \right)^2 \mid x \right] }_{\text{Irreducible error (Noise)}}
+ \mathbb{E}_D \left[ \left(x^\top w - x^\top \widehat{w}_{ridge} \right)^2 \mid x \right] \\\\
&= \underbrace{\sigma^2}_{\text{Noise}}
+ \underbrace{\left(x^\top w - \mathbb{E}_D \left[x^\top \widehat{w}_{ridge} \mid x \right] \right)^2 }_{(\text{Bias})^2}
+ \underbrace{ \mathbb{E}_D \left[ \left( x^\top \widehat{w}_{ridge} - \mathbb{E}_D \left[ x^\top \widehat{w}_{ridge} \mid x\right] \right)^2 \mid x \right] }_{\text{Variance}} \tag{1}\\\\
\end{align*}
\]
Now we assume \(X^\top X = n I\) for simplicity. This corresponds to the case where the predictors are orthogonal (uncorrelated)
and standardized, which is a common simplifying assumption in ridge regression analysis. For orthogonal covariates, the ridge estimator
takes the simple form \(\widehat{w}_{ridge} = \frac{n}{n+\lambda} \widehat{w}_{LS}\), showing pure shrinkage behavior .
While this assumption doesn't hold in general, it allows clear illustration of the bias-variance tradeoff without the complexity of
correlated predictors. Then
\[
\begin{align*}
\widehat{w}_{ridge} &= (X^\top X + \lambda I)^{-1} X^\top (Xw + \epsilon) \\\\
&= (nI + \lambda I)^{-1}(nIw + X^\top \epsilon) \\\\
&= (n + \lambda)^{-1} I (nw + X^\top \epsilon) \\\\
&= \frac{n}{n + \lambda}w + \frac{1}{n + \lambda} X^\top \epsilon.
\end{align*}
\]
Thus,
\[
x^\top \widehat{w}_{ridge} = \underbrace{\frac{n}{n + \lambda}x^\top w}_{\text{Constant}} + \underbrace{\frac{1}{n + \lambda} (Xx)^\top \epsilon}_{\text{Random}}.
\]
Since \(\mathbb{E}[\epsilon] = 0\), the bias term of Expression 1 is written as:
\[
\begin{align*}
& \left( x^\top w -\mathbb{E}_D \left[x^\top \widehat{w}_{ridge}\right] \right)^2 \\\\
&= \left( x^\top w -\mathbb{E}_D \left[\frac{n}{n + \lambda}x^\top w + \frac{1}{n + \lambda} (Xx)^\top \epsilon \right] \right)^2 \\\\
&= \left( x^\top w - \frac{n}{n + \lambda}x^\top w \right)^2 \\\\
&= \frac{\lambda^2}{(n + \lambda)^2}(x^\top w)^2.
\end{align*}
\]
Since the variance of a constant is always zero, the variance term of Expression 1 is written as:
\[
\begin{align*}
&\mathbb{E}_D \left[ \left( x^\top \widehat{w}_{ridge} - \mathbb{E}_D \left[ x^\top \widehat{w}_{ridge} \mid x\right] \right)^2 \mid x \right] \\\\
&= \text{Var}\left[ \frac{1}{n + \lambda} (Xx)^\top \epsilon \right] \\\\
&= \frac{1}{(n + \lambda)^2}\text{Var}\left[(Xx)^\top \epsilon \right] \\\\
&= \frac{1}{(n + \lambda)^2}(Xx)^\top \text{Var}[\epsilon] (Xx) \\\\
&= \frac{1}{(n + \lambda)^2} x^\top X^\top (\sigma^2 I) X x \\\\
&= \frac{\sigma^2}{(n + \lambda)^2} x^\top (X^\top X) x \\\\
&= \frac{\sigma^2}{(n + \lambda)^2} x^\top (n I) x \\\\
&= \frac{\sigma^2 n}{(n + \lambda)^2} \| x \|_2^2 \\\\
\end{align*}
\]
Therefore, putting it all together:
\[
\begin{align*}
&\mathbb{E}_{y, D \mid x} \left[ \left(y - x^\top \widehat{w}_{\text{ridge}} \right)^2 \mid x \right] \\\\
&= \sigma^2 + \frac{\lambda^2}{(n + \lambda)^2}(w^\top x)^2 + \frac{\sigma^2 n}{(n + \lambda)^2} \| x \|_2^2
\end{align*}
\]
This result makes the tradeoff explicit: as \(\lambda\) increases, the bias term \(\frac{\lambda^2}{(n+\lambda)^2}(w^\top x)^2\) grows
while the variance term \(\frac{\sigma^2 n}{(n+\lambda)^2}\|x\|_2^2\) shrinks. The optimal \(\lambda\) balances these competing effects,
and in practice is selected via cross-validation. An alternative approach to regularization replaces
the squared \(\ell_2\) penalty with an \(\ell_1\) penalty, yielding a qualitatively different behavior.
Lasso Regression
Ridge regression shrinks all coefficients smoothly toward zero but never sets any exactly to zero.
In many applications - particularly when most features are irrelevant - we want the model to automatically
identify and discard uninformative variables. This motivates replacing the squared \(\ell_2\) penalty with an \(\ell_1\) penalty:
Lasso Regression:
The Least Absolute Shrinkage and Selection Operator (Lasso)
solves
\[
\widehat{w}_{\text{lasso}} = \arg \min_w \|y - Xw\|_2^2 + \lambda \|w\|_1,
\]
where \(\|w\|_1 = \sum_{j=1}^d |w_j|\) is the \(\ell_1\)-norm and
\(\lambda > 0\) controls the regularization strength.
The key property of Lasso is that it performs automatic feature selection.
The \(\ell_1\) penalty drives many coefficients \(w_j\) to exactly zero, effectively dropping those features
from the model. Geometrically, this occurs because the \(\ell_1\) ball has corners (unlike the smooth \(\ell_2\) ball),
so the level sets of the loss function are more likely to intersect the constraint region at a vertex where
one or more coordinates vanish.
Unlike ridge regression, the Lasso objective is not differentiable at \(w_j = 0\) due to the absolute value, so
there is no closed-form solution. In practice, Lasso is solved by coordinate descent or proximal gradient methods.
After obtaining the Lasso solution, a common practice is to re-fit ordinary least-squares on the selected features.
Specifically, define the active set
\[
S = \{j : \widehat{w}_{\text{lasso}, j} \neq 0\},
\]
and then solve the unregularized problem restricted to these features:
\[
\tilde{w}_S = \arg \min_{w_S} \sum_{i=1}^n (y_i - x_{i,S}^\top w_S)^2,
\]
setting \(\tilde{w}_j = 0\) for \(j \notin S\). This two-stage procedure removes the shrinkage bias that Lasso
introduces on the selected coefficients.
It is worth noting that Lasso's feature selection is fundamentally different from dimensionality reduction via PCA.
Lasso selects a subset of the original variables, preserving interpretability but potentially missing signals that are spread across many
weakly predictive features. PCA, by contrast, constructs new variables (linear combinations of the originals) that capture directions of maximal
variance, improving numerical stability for correlated features at the cost of direct interpretability. In practice, one can combine both approaches -
reducing to a moderate number of principal components and then applying Lasso - when both compression and sparsity are needed.
Ridge and Lasso regularization illustrate a core theme in machine learning: controlling model complexity to achieve good generalization. In
Part 3: Intro to Classification, we move from predicting continuous values to
predicting discrete categories, introducing logistic regression and kernel methods that extend the linear framework into nonlinear decision boundaries.