Regularized Regression

Ridge Regression

Throughout this page we assume the linear regression model \(\boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}\) with i.i.d. Gaussian noise \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)\), where \(X \in \mathbb{R}^{n \times d}\) is the design matrix and \(\boldsymbol{\beta} \in \mathbb{R}^d\) is the parameter vector. As established in our linear regression page, the maximum likelihood estimator under this model coincides with the least-squares solution given by the normal equations: \[ \widehat{\boldsymbol{\beta}}_{\text{LS}} = (X^\top X)^{-1} X^\top \boldsymbol{y} = \widehat{\boldsymbol{\beta}}_{\text{MLE}}, \] valid whenever \(X^\top X\) is invertible.

While this estimator is optimal when \(X^\top X\) is well-conditioned, high-dimensional settings where \(d \gg n\) lead to ill-conditioning or singularity of \(X^\top X\). The objective surface becomes flat in some directions, making the solution non-unique and highly sensitive to noise - a phenomenon known as overfitting. Regularization addresses this by trading a small amount of bias for a substantial reduction in variance, a principle we will formalize in the bias-variance decomposition below.

To mitigate this issue, we add a penalty term to the objective. This technique is called regularization, and the modified version of linear regression is known as ridge regression: \[ \begin{align*} \widehat{\boldsymbol{\beta}}_{\text{ridge}} &= \arg \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - X\boldsymbol{\beta}\|_2^2 + \underbrace{\lambda \|\boldsymbol{\beta}\|_2^2}_{\text{squared } \ell_2 \text{-norm regularizer}} \\\\ &= \arg \min_{\boldsymbol{\beta}} \boldsymbol{y}^\top \boldsymbol{y} - 2\boldsymbol{\beta}^\top X^\top \boldsymbol{y} + \boldsymbol{\beta}^\top (X^\top X + \lambda I) \boldsymbol{\beta} \\\\ &=: \arg \min_{\boldsymbol{\beta}} J(\boldsymbol{\beta}). \end{align*} \] Setting the gradient to zero, \[ \nabla_{\boldsymbol{\beta}} J(\boldsymbol{\beta}) = -2 X^\top \boldsymbol{y} + 2(X^\top X + \lambda I) \boldsymbol{\beta} = \mathbf{0}, \] yields the closed-form solution \[ \widehat{\boldsymbol{\beta}}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top \boldsymbol{y}, \] where \(\lambda > 0\) is the regularization parameter controlling the strength of the penalty.

Theorem: Ridge Regression Solution

Ridge regression augments the least-squares objective with a squared \(\ell_2\)-norm penalty on the parameter vector: \[ \widehat{\boldsymbol{\beta}}_{\text{ridge}} = \arg \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2 = (X^\top X + \lambda I)^{-1} X^\top \boldsymbol{y}, \] where \(\lambda > 0\) is the regularization parameter. The addition of \(\lambda I\) guarantees that \(X^\top X + \lambda I\) is positive definite and hence invertible, even when \(X^\top X\) is singular.

The hyperparameter \(\lambda \geq 0\) controls the strength of the regularization:

As \(\lambda \to 0\), the solution approaches the least-squares solution: \(\widehat{\boldsymbol{\beta}}_{\text{ridge}} \to \widehat{\boldsymbol{\beta}}_{\text{LS}}\).
As \(\lambda \to \infty\), the solution is pushed toward zero: \(\widehat{\boldsymbol{\beta}}_{\text{ridge}} \to \mathbf{0}\), leading to high bias but low variance.

This illustrates the classic bias-variance tradeoff in statistical learning.

Demo: Ridge Regression

Note: Tiny regularization \(\lambda = 0.0001\) has been added on linear regression for numerical stability.

After running the demo above, we observe that linear regression tends to achieve lower training error but at the cost of higher test error and wildly large coefficients, especially with high polynomial degrees. This means the model memorizes the training data too closely and fails to generalize well to new, unseen data.

In contrast, ridge regression, by constraining the weights, may have slightly higher training error but typically generalizes better to new data. This trade-off between fitting the training data and maintaining model simplicity is known as the bias-variance tradeoff. Increasing the model complexity (e.g., by raising the polynomial degree) monotonically reduces training error, but after a certain point, the test error begins to increase. This turning point highlights the danger of overfitting.

Cross Validation

To build a generalized model and select an appropriate regularization strength (\(\lambda\)), we employ K-fold cross-validation (CV) on the training set. The training data is partitioned into \(K\) equal-sized folds. For each fold \(k \in \{1, \cdots, K\}\), the model is trained on the \(K - 1\) other folds and validated on the \(k\)th fold. This process is repeated \(K\) times, ensuring that each data point in the training set is used once for validation. The average validation error is used to select the optimal hyperparameter \(\lambda\). Finally, the model is evaluated on the held-out test set.

Here, we demonstrate how k-fold CV is used to tune the regularization parameter \(\lambda > 0\) in Ridge Regression.

A special case of K-fold CV is Leave-One-Out Cross-Validation (LOOCV), where the number of folds \(K\) equals the number of data points in the training set. That is, the model is trained on all training data points except one, which is used for validation, and this process is repeated for each data point. Thus, LOOCV tends to produce a low-bias estimate of the validation error. However, LOOCV is computationally expensive for large datasets. In contrast, smaller values of \(K\) (such as 5-10) offer a good tradeoff between bias, variance, and computational cost, making them a popular choice in practice.

Bias-Variance Tradeoff in Ridge Regression

Whereas mean squared error measures the discrepancy between an estimator and the parameter it estimates, here we examine the closely related discrepancy between a model's predictions and the true response. Both decompose into noise, bias, and variance.

We assume the probabilistic generative model: \[ \boldsymbol{x}_i \sim P_X, \quad \boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I), \] and write \(\mathcal{D} = \{(\boldsymbol{x}_i, y_i)\}_{i=1}^n\) for the training sample, with \(\mathbb{E}_{\mathcal{D}}\) denoting expectation over the random training set. For a fixed query point \(\boldsymbol{x} \in \mathbb{R}^d\), the prediction error decomposes as follows.

Theorem: Bias-Variance Decomposition for Ridge Prediction Error

Under the linear-Gaussian model above, the expected squared prediction error of the ridge estimator \(\widehat{\boldsymbol{\beta}}_{\text{ridge}}\) at a query point \(\boldsymbol{x}\) decomposes as \[ \begin{align*} &\mathbb{E}_{y, \mathcal{D} \mid \boldsymbol{x}} \!\left[ \left(y - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \right)^2 \mid \boldsymbol{x} \right] \\\\ &= \underbrace{\sigma^2}_{\text{Noise}} + \underbrace{\Big(\boldsymbol{x}^\top \boldsymbol{\beta} - \mathbb{E}_{\mathcal{D}}\!\left[\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \mid \boldsymbol{x}\right] \Big)^2}_{(\text{Bias})^2} + \underbrace{\mathbb{E}_{\mathcal{D}}\!\left[\Big(\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} - \mathbb{E}_{\mathcal{D}}\!\left[\boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \mid \boldsymbol{x}\right] \Big)^2 \mid \boldsymbol{x}\right]}_{\text{Variance}}. \tag{1} \end{align*} \]

Derivation:

Adding and subtracting \(\mathbb{E}[y \mid \boldsymbol{x}] = \boldsymbol{x}^\top \boldsymbol{\beta}\) inside the squared error and expanding yields \[ \begin{align*} &\mathbb{E}_{y, \mathcal{D} \mid \boldsymbol{x}} \!\left[ \left(y - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} \right)^2 \mid \boldsymbol{x} \right] \\\\ &= \mathbb{E}_{y \mid \boldsymbol{x}}\!\left[\left(y - \mathbb{E}[y \mid \boldsymbol{x}]\right)^2 \mid \boldsymbol{x}\right] + \mathbb{E}_{\mathcal{D}}\!\left[\left(\mathbb{E}[y \mid \boldsymbol{x}] - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}}\right)^2 \mid \boldsymbol{x}\right] \end{align*} \] (the cross term vanishes since \(y - \mathbb{E}[y\mid \boldsymbol{x}]\) has mean zero and is independent of \(\widehat{\boldsymbol{\beta}}_{\text{ridge}}\) given \(\boldsymbol{x}\)). The first term equals \(\sigma^2\). The second admits a further bias-variance split via the standard identity \(\mathbb{E}[(Z - c)^2] = (\mathbb{E}[Z] - c)^2 + \operatorname{Var}(Z)\), applied with \(Z = \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}}\) and \(c = \boldsymbol{x}^\top \boldsymbol{\beta}\), giving (1).

Closed Form Under Orthogonal Design

To make the tradeoff concrete, assume \(X^\top X = n I\). This corresponds to standardized, orthogonal predictors - a common simplifying assumption that isolates the shrinkage effect of \(\lambda\) without the confounding influence of correlated features. Substituting \(\boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}\) into the ridge solution: \[ \begin{align*} \widehat{\boldsymbol{\beta}}_{\text{ridge}} &= (X^\top X + \lambda I)^{-1} X^\top (X\boldsymbol{\beta} + \boldsymbol{\epsilon}) \\\\ &= (n+\lambda)^{-1} I \cdot (n\boldsymbol{\beta} + X^\top \boldsymbol{\epsilon}) \\\\ &= \frac{n}{n+\lambda}\boldsymbol{\beta} + \frac{1}{n+\lambda} X^\top \boldsymbol{\epsilon}, \end{align*} \] which exhibits the pure shrinkage form \(\widehat{\boldsymbol{\beta}}_{\text{ridge}} = \frac{n}{n+\lambda} \widehat{\boldsymbol{\beta}}_{\text{LS}}\). Therefore, \[ \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}} = \underbrace{\frac{n}{n+\lambda}\boldsymbol{x}^\top \boldsymbol{\beta}}_{\text{deterministic}} + \underbrace{\frac{1}{n+\lambda}(X\boldsymbol{x})^\top \boldsymbol{\epsilon}}_{\text{random}}. \]

Since \(\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}\), the squared bias term in (1) is \[ \left(\boldsymbol{x}^\top \boldsymbol{\beta} - \frac{n}{n+\lambda}\boldsymbol{x}^\top \boldsymbol{\beta}\right)^2 = \frac{\lambda^2}{(n+\lambda)^2}(\boldsymbol{x}^\top \boldsymbol{\beta})^2. \] The variance term is the variance of the random part: \[ \begin{align*} \operatorname{Var}\!\left[\frac{1}{n+\lambda}(X\boldsymbol{x})^\top \boldsymbol{\epsilon}\right] &= \frac{1}{(n+\lambda)^2}(X\boldsymbol{x})^\top (\sigma^2 I)(X\boldsymbol{x}) \\\\ &= \frac{\sigma^2}{(n+\lambda)^2}\boldsymbol{x}^\top (X^\top X) \boldsymbol{x} = \frac{\sigma^2 n}{(n+\lambda)^2}\|\boldsymbol{x}\|_2^2. \end{align*} \] Combining all three terms: \[ \mathbb{E}_{y, \mathcal{D} \mid \boldsymbol{x}}\!\left[\left(y - \boldsymbol{x}^\top \widehat{\boldsymbol{\beta}}_{\text{ridge}}\right)^2 \mid \boldsymbol{x}\right] = \sigma^2 + \frac{\lambda^2}{(n+\lambda)^2}(\boldsymbol{x}^\top \boldsymbol{\beta})^2 + \frac{\sigma^2 n}{(n+\lambda)^2}\|\boldsymbol{x}\|_2^2. \]

This result makes the tradeoff explicit: as \(\lambda\) increases, the bias term \(\frac{\lambda^2}{(n+\lambda)^2}(\boldsymbol{x}^\top \boldsymbol{\beta})^2\) grows while the variance term \(\frac{\sigma^2 n}{(n+\lambda)^2}\|\boldsymbol{x}\|_2^2\) shrinks. The optimal \(\lambda\) balances these competing effects, and in practice is selected via cross-validation. An alternative approach to regularization replaces the squared \(\ell_2\) penalty with an \(\ell_1\) penalty, yielding a qualitatively different behavior.

Insight: The Necessity of the Gap

The bias-variance decomposition sets up a tempting question: why not send \(\lambda \to 0\) and recover the unbiased least-squares estimator? The decomposition tells us we trade bias for variance, but pushes the question one step further: what would happen if we could drive the training error to exactly zero? In an overparameterized regime where \(d \gg n\), this is achievable: the model interpolates \(\mathcal{D}\), including the noise term \(\boldsymbol{\epsilon}\), rather than distilling the underlying signal \(X\boldsymbol{\beta}\).

Regularization is, structurally, the imposition of a deliberate gap between the model's fit and the training data. The penalty \(\lambda \|\boldsymbol{\beta}\|_2^2\) prevents the optimizer from chasing every fluctuation in the noise, trading a controlled increase in bias for a reduction in variance. The bias term in the decomposition above is not a defect to be eliminated - it is the operational signature of a model that has generalized rather than memorized.

Stated more sharply: under the classical bias-variance picture, a learning algorithm that closes the gap between its predictions and the noisy training labels completely is no longer learning the signal; it is reconstructing the noise. (Modern overparameterized models can sometimes generalize despite interpolation, a phenomenon studied under the names benign overfitting and double descent; the classical principle remains the baseline.) The same need for a productive gap recurs throughout machine learning, under names such as early stopping, dropout, weight decay, and the implicit regularization of stochastic gradient descent. Each is a mechanism for engineering a gap that lets the model see past the noise - because generalization requires the model to be slightly wrong about the training data in order to be right about the underlying distribution.

Lasso Regression

Ridge regression shrinks all coefficients smoothly toward zero but never sets any exactly to zero. In many applications - particularly when most features are irrelevant - we want the model to automatically identify and discard uninformative variables. This motivates replacing the squared \(\ell_2\) penalty with an \(\ell_1\) penalty:

Definition: Lasso Regression

The Least Absolute Shrinkage and Selection Operator (Lasso) solves \[ \widehat{\boldsymbol{\beta}}_{\text{lasso}} = \arg \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1, \] where \(\|\boldsymbol{\beta}\|_1 = \sum_{j=1}^d |\beta_j|\) is the \(\ell_1\)-norm and \(\lambda > 0\) controls the regularization strength.

The key property of Lasso is that it performs automatic feature selection. The \(\ell_1\) penalty drives many coefficients \(\beta_j\) to exactly zero, effectively dropping those features from the model. Geometrically, this occurs because the \(\ell_1\) ball has corners (unlike the smooth \(\ell_2\) ball), so the level sets of the loss function are more likely to intersect the constraint region at a vertex where one or more coordinates vanish.

Unlike ridge regression, the Lasso objective is not differentiable at \(\beta_j = 0\) due to the absolute value, so there is no closed-form solution. In practice, Lasso is solved by coordinate descent or proximal gradient methods that handle the non-smoothness via the subdifferential of the \(\ell_1\)-norm.

After obtaining the Lasso solution, a common practice is to re-fit ordinary least-squares on the selected features. Specifically, define the active set \[ S = \{j : \widehat{\beta}_{\text{lasso}, j} \neq 0\}, \] and then solve the unregularized problem restricted to these features: \[ \tilde{\boldsymbol{\beta}}_S = \arg \min_{\boldsymbol{\beta}_S} \sum_{i=1}^n (y_i - \boldsymbol{x}_{i,S}^\top \boldsymbol{\beta}_S)^2, \] setting \(\tilde{\beta}_j = 0\) for \(j \notin S\). This two-stage procedure removes the shrinkage bias that Lasso introduces on the selected coefficients.

It is worth noting that Lasso's feature selection is fundamentally different from dimensionality reduction via PCA. Lasso selects a subset of the original variables, preserving interpretability but potentially missing signals that are spread across many weakly predictive features. PCA, by contrast, constructs new variables (linear combinations of the originals) that capture directions of maximal variance, improving numerical stability for correlated features at the cost of direct interpretability. In practice, one can combine both approaches - reducing to a moderate number of principal components and then applying Lasso - when both compression and sparsity are needed.

Ridge and Lasso regularization illustrate a core theme in machine learning: controlling model complexity to achieve good generalization. In the classification page that follows, we move from predicting continuous values to predicting discrete categories, introducing logistic regression and kernel methods that extend the linear framework into nonlinear decision boundaries.

Loading...