Ridge Regression
Throughout this page we assume the linear regression model
\(\boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}\) with i.i.d. Gaussian noise
\(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)\), where \(X \in \mathbb{R}^{n \times d}\) is the design matrix and
\(\boldsymbol{\beta} \in \mathbb{R}^d\) is the parameter vector. As established in our
linear regression page, the
maximum likelihood estimator
under this model coincides with the
least-squares solution
given by the normal equations:
\[
\widehat{\boldsymbol{\beta}}_{\text{LS}} = (X^\top X)^{-1} X^\top \boldsymbol{y} = \widehat{\boldsymbol{\beta}}_{\text{MLE}},
\]
valid whenever \(X^\top X\) is invertible.
While this estimator is optimal when \(X^\top X\) is well-conditioned, high-dimensional settings where \(d \gg n\) lead to
ill-conditioning or singularity of \(X^\top X\). The objective surface becomes flat in some directions, making the solution
non-unique and highly sensitive to noise - a phenomenon known as overfitting. Regularization addresses
this by trading a small amount of bias for a substantial reduction in variance, a principle we will formalize in the
bias-variance decomposition below.
To mitigate this issue, we add a penalty term to the objective. This technique is called regularization,
and the modified version of linear regression is known as ridge regression:
\[
\begin{align*}
\widehat{\boldsymbol{\beta}}_{\text{ridge}}
&= \arg \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - X\boldsymbol{\beta}\|_2^2 + \underbrace{\lambda \|\boldsymbol{\beta}\|_2^2}_{\text{squared } \ell_2 \text{-norm regularizer}} \\\\
&= \arg \min_{\boldsymbol{\beta}} \boldsymbol{y}^\top \boldsymbol{y} - 2\boldsymbol{\beta}^\top X^\top \boldsymbol{y} + \boldsymbol{\beta}^\top (X^\top X + \lambda I) \boldsymbol{\beta} \\\\
&=: \arg \min_{\boldsymbol{\beta}} J(\boldsymbol{\beta}).
\end{align*}
\]
Setting the gradient to zero,
\[
\nabla_{\boldsymbol{\beta}} J(\boldsymbol{\beta}) = -2 X^\top \boldsymbol{y} + 2(X^\top X + \lambda I) \boldsymbol{\beta} = \mathbf{0},
\]
yields the closed-form solution
\[
\widehat{\boldsymbol{\beta}}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top \boldsymbol{y},
\]
where \(\lambda > 0\) is the regularization parameter controlling the strength of the penalty.
Theorem: Ridge Regression Solution
Ridge regression augments the least-squares objective with a squared \(\ell_2\)-norm penalty
on the parameter vector:
\[
\widehat{\boldsymbol{\beta}}_{\text{ridge}} = \arg \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2
= (X^\top X + \lambda I)^{-1} X^\top \boldsymbol{y},
\]
where \(\lambda > 0\) is the regularization parameter. The addition of \(\lambda I\) guarantees that
\(X^\top X + \lambda I\) is positive definite and hence invertible, even when
\(X^\top X\) is singular.
The hyperparameter \(\lambda \geq 0\) controls the strength of the regularization:
- As \(\lambda \to 0\), the solution approaches the least-squares solution: \(\widehat{\boldsymbol{\beta}}_{\text{ridge}} \to \widehat{\boldsymbol{\beta}}_{\text{LS}}\).
- As \(\lambda \to \infty\), the solution is pushed toward zero: \(\widehat{\boldsymbol{\beta}}_{\text{ridge}} \to \mathbf{0}\), leading to high bias but low variance.
This illustrates the classic bias-variance tradeoff in statistical learning.
Demo: Ridge Regression
Note: Tiny regularization \(\lambda = 0.0001\) has been added on linear regression for numerical stability.
After running the demo above, we observe that linear regression tends to achieve lower training error but at the cost of higher test error and wildly large
coefficients, especially with high polynomial degrees. This means the model memorizes the training data too closely and fails to generalize
well to new, unseen data.
In contrast, ridge regression, by constraining the weights, may have slightly higher training error but typically generalizes
better to new data. This trade-off between fitting the training data and maintaining model simplicity is known as the bias-variance tradeoff.
Increasing the model complexity (e.g., by raising the polynomial degree) monotonically reduces training error, but after a certain point, the test error begins to increase.
This turning point highlights the danger of overfitting.
Cross Validation
To build a generalized model and select an appropriate regularization strength (\(\lambda\)),
we employ K-fold cross-validation (CV) on the training set. The training data is partitioned into \(K\)
equal-sized folds. For each fold \(k \in \{1, \cdots, K\}\), the model is trained on the \(K - 1\) other folds and validated on the \(k\)th fold.
This process is repeated \(K\) times, ensuring that each data point in the training set is used once for validation. The average validation
error is used to select the optimal hyperparameter \(\lambda\). Finally, the model is evaluated on the held-out test set.
Here, we demonstrate how k-fold CV is used to tune the regularization parameter \(\lambda > 0\) in Ridge Regression.
A special case of K-fold CV is Leave-One-Out Cross-Validation (LOOCV), where the number of folds \(K\) equals the
number of data points in the training set. That is, the model is trained on all training data points except one, which is used for validation,
and this process is repeated for each data point. Thus, LOOCV tends to produce a low-bias estimate of the validation error. However,
LOOCV is computationally expensive for large datasets. In contrast, smaller values of \(K\) (such as 5-10) offer a good tradeoff between bias,
variance, and computational cost, making them a popular choice in practice.
Bias-Variance Tradeoff in Ridge Regression
Whereas mean squared error
measures the discrepancy between an estimator and the parameter it estimates, here we examine the closely related
discrepancy between a model's predictions and the true response. Both decompose into noise, bias, and variance.
We assume the probabilistic generative model:
\[
\boldsymbol{x}_i \sim P_X, \quad \boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I),
\]
and write \(\mathcal{D} = \{(\boldsymbol{x}_i, y_i)\}_{i=1}^n\) for the training sample, with \(\mathbb{E}_{\mathcal{D}}\) denoting expectation
over the random training set. For a fixed query point \(\boldsymbol{x} \in \mathbb{R}^d\), the prediction error decomposes as follows.
Theorem: Bias-Variance Decomposition for Ridge Prediction Error
Under the linear-Gaussian model above, the expected squared prediction error of the ridge estimator
\(\widehat{\boldsymbol{\beta}}_{\text{ridge}}\) at a query point \(\boldsymbol{x}\) decomposes as
\[
\begin{align*}
&\mathbb{E}_{y, \mathcal{D} \mid \boldsymbol{x}} \!\left[ \left(y - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \right)^2 \mid \boldsymbol{x} \right] \\\\
&= \underbrace{\sigma^2}_{\text{Noise}}
+ \underbrace{\Big(\boldsymbol{x}^\top \boldsymbol{\beta} - \mathbb{E}_{\mathcal{D}}\!\left[\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \mid \boldsymbol{x}\right] \Big)^2}_{(\text{Bias})^2}
+ \underbrace{\mathbb{E}_{\mathcal{D}}\!\left[\Big(\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} - \mathbb{E}_{\mathcal{D}}\!\left[\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \mid \boldsymbol{x}\right] \Big)^2 \mid \boldsymbol{x}\right]}_{\text{Variance}}. \tag{1}
\end{align*}
\]
Derivation:
Adding and subtracting \(\mathbb{E}[y \mid \boldsymbol{x}] = \boldsymbol{x}^\top \boldsymbol{\beta}\) inside the squared error and expanding yields
\[
\begin{align*}
&\mathbb{E}_{y, \mathcal{D} \mid \boldsymbol{x}} \!\left[ \left(y - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \right)^2 \mid \boldsymbol{x} \right] \\\\
&= \mathbb{E}_{y \mid \boldsymbol{x}}\!\left[\left(y - \mathbb{E}[y \mid \boldsymbol{x}]\right)^2 \mid \boldsymbol{x}\right]
+ \mathbb{E}_{\mathcal{D}}\!\left[\left(\mathbb{E}[y \mid \boldsymbol{x}] - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}}\right)^2 \mid \boldsymbol{x}\right]
\end{align*}
\]
(the cross term vanishes since \(y - \mathbb{E}[y\mid \boldsymbol{x}]\) has mean zero and is independent of \(\widehat{\boldsymbol{\beta}}_{\text{ridge}}\) given \(\boldsymbol{x}\)).
The first term equals \(\sigma^2\). The second admits a further bias-variance split via the standard identity
\(\mathbb{E}[(Z - c)^2] = (\mathbb{E}[Z] - c)^2 + \operatorname{Var}(Z)\), applied with
\(Z = \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}}\) and \(c = \boldsymbol{x}^\top \boldsymbol{\beta}\), giving (1).
Closed Form Under Orthogonal Design
To make the tradeoff concrete, assume \(X^\top X = n I\). This corresponds to standardized, orthogonal predictors - a
common simplifying assumption that isolates the shrinkage effect of \(\lambda\) without the confounding influence of
correlated features. Substituting \(\boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}\) into the ridge solution:
\[
\begin{align*}
\widehat{\boldsymbol{\beta}}_{\text{ridge}} &= (X^\top X + \lambda I)^{-1} X^\top (X\boldsymbol{\beta} + \boldsymbol{\epsilon}) \\\\
&= (n+\lambda)^{-1} I \cdot (n\boldsymbol{\beta} + X^\top \boldsymbol{\epsilon}) \\\\
&= \frac{n}{n+\lambda}\boldsymbol{\beta} + \frac{1}{n+\lambda} X^\top \boldsymbol{\epsilon},
\end{align*}
\]
which exhibits the pure shrinkage form \(\widehat{\boldsymbol{\beta}}_{\text{ridge}} = \frac{n}{n+\lambda} \widehat{\boldsymbol{\beta}}_{\text{LS}}\). Therefore,
\[
\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} = \underbrace{\frac{n}{n+\lambda}\boldsymbol{x}^\top \boldsymbol{\beta}}_{\text{deterministic}} + \underbrace{\frac{1}{n+\lambda}(X\boldsymbol{x})^\top \boldsymbol{\epsilon}}_{\text{random}}.
\]
Since \(\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}\), the squared bias term in (1) is
\[
\left(\boldsymbol{x}^\top \boldsymbol{\beta} - \frac{n}{n+\lambda}\boldsymbol{x}^\top \boldsymbol{\beta}\right)^2
= \frac{\lambda^2}{(n+\lambda)^2}(\boldsymbol{x}^\top \boldsymbol{\beta})^2.
\]
The variance term is the variance of the random part:
\[
\begin{align*}
\operatorname{Var}\!\left[\frac{1}{n+\lambda}(X\boldsymbol{x})^\top \boldsymbol{\epsilon}\right]
&= \frac{1}{(n+\lambda)^2}(X\boldsymbol{x})^\top (\sigma^2 I)(X\boldsymbol{x}) \\\\
&= \frac{\sigma^2}{(n+\lambda)^2}\boldsymbol{x}^\top (X^\top X) \boldsymbol{x}
= \frac{\sigma^2 n}{(n+\lambda)^2}\|\boldsymbol{x}\|_2^2.
\end{align*}
\]
Combining all three terms:
\[
\mathbb{E}_{y, \mathcal{D} \mid \boldsymbol{x}}\!\left[\left(y - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}}\right)^2 \mid \boldsymbol{x}\right]
= \sigma^2 + \frac{\lambda^2}{(n+\lambda)^2}(\boldsymbol{x}^\top \boldsymbol{\beta})^2 + \frac{\sigma^2 n}{(n+\lambda)^2}\|\boldsymbol{x}\|_2^2.
\]
This result makes the tradeoff explicit: as \(\lambda\) increases, the bias term \(\frac{\lambda^2}{(n+\lambda)^2}(\boldsymbol{x}^\top \boldsymbol{\beta})^2\) grows
while the variance term \(\frac{\sigma^2 n}{(n+\lambda)^2}\|\boldsymbol{x}\|_2^2\) shrinks. The optimal \(\lambda\) balances these competing effects,
and in practice is selected via cross-validation. An alternative approach to regularization replaces
the squared \(\ell_2\) penalty with an \(\ell_1\) penalty, yielding a qualitatively different behavior.
Insight: The Necessity of the Gap
The bias-variance decomposition sets up a tempting question: why not send \(\lambda \to 0\)
and recover the unbiased least-squares estimator? The decomposition tells us we trade bias
for variance, but pushes the question one step further: what would happen if we could drive
the training error to exactly zero? In an overparameterized regime where \(d \gg n\), this
is achievable: the model interpolates \(\mathcal{D}\), including the noise term
\(\boldsymbol{\epsilon}\), rather than distilling the underlying signal \(X\boldsymbol{\beta}\).
Regularization is, structurally, the imposition of a deliberate gap between the model's fit
and the training data. The penalty \(\lambda \|\boldsymbol{\beta}\|_2^2\) prevents the optimizer
from chasing every fluctuation in the noise, trading a controlled increase in bias for a
reduction in variance. The bias term in the decomposition above is not a defect to be
eliminated - it is the operational signature of a model that has generalized rather than
memorized.
Stated more sharply: under the classical bias-variance picture, a learning algorithm that
closes the gap between its predictions and the noisy training labels completely is no longer
learning the signal; it is reconstructing the noise. (Modern overparameterized models can
sometimes generalize despite interpolation, a phenomenon studied under the names
benign overfitting and double descent; the classical principle remains the
baseline.) The same need for a productive gap recurs throughout machine learning, under names
such as early stopping, dropout, weight decay, and the implicit regularization of stochastic
gradient descent. Each is a mechanism for engineering a gap that lets the model see past the
noise - because generalization requires the model to be slightly wrong about the training
data in order to be right about the underlying distribution.
Lasso Regression
Ridge regression shrinks all coefficients smoothly toward zero but never sets any exactly to zero.
In many applications - particularly when most features are irrelevant - we want the model to automatically
identify and discard uninformative variables. This motivates replacing the squared \(\ell_2\) penalty with an \(\ell_1\) penalty:
Definition: Lasso Regression
The Least Absolute Shrinkage and Selection Operator (Lasso)
solves
\[
\widehat{\boldsymbol{\beta}}_{\text{lasso}} = \arg \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1,
\]
where \(\|\boldsymbol{\beta}\|_1 = \sum_{j=1}^d |\beta_j|\) is the \(\ell_1\)-norm and
\(\lambda > 0\) controls the regularization strength.
The key property of Lasso is that it performs automatic feature selection.
The \(\ell_1\) penalty drives many coefficients \(\beta_j\) to exactly zero, effectively dropping those features
from the model. Geometrically, this occurs because the \(\ell_1\) ball has corners (unlike the smooth \(\ell_2\) ball),
so the level sets of the loss function are more likely to intersect the constraint region at a vertex where
one or more coordinates vanish.
Unlike ridge regression, the Lasso objective is not differentiable at \(\beta_j = 0\) due to the absolute value, so
there is no closed-form solution. In practice, Lasso is solved by coordinate descent or proximal gradient methods that
handle the non-smoothness via the
subdifferential of the \(\ell_1\)-norm.
After obtaining the Lasso solution, a common practice is to re-fit ordinary least-squares on the selected features.
Specifically, define the active set
\[
S = \{j : \widehat{\beta}_{\text{lasso}, j} \neq 0\},
\]
and then solve the unregularized problem restricted to these features:
\[
\tilde{\boldsymbol{\beta}}_S = \arg \min_{\boldsymbol{\beta}_S} \sum_{i=1}^n (y_i - \boldsymbol{x}_{i,S}^\top \boldsymbol{\beta}_S)^2,
\]
setting \(\tilde{\beta}_j = 0\) for \(j \notin S\). This two-stage procedure removes the shrinkage bias that Lasso
introduces on the selected coefficients.
It is worth noting that Lasso's feature selection is fundamentally different from dimensionality reduction via PCA.
Lasso selects a subset of the original variables, preserving interpretability but potentially missing signals that are spread across many
weakly predictive features. PCA, by contrast, constructs new variables (linear combinations of the originals) that capture directions of maximal
variance, improving numerical stability for correlated features at the cost of direct interpretability. In practice, one can combine both approaches -
reducing to a moderate number of principal components and then applying Lasso - when both compression and sparsity are needed.
Ridge and Lasso regularization illustrate a core theme in machine learning: controlling model complexity to achieve good generalization. In
the classification page that follows, we move from predicting continuous values to
predicting discrete categories, introducing logistic regression and kernel methods that extend the linear framework into nonlinear decision boundaries.