Linear Regression

Recap from Linear Algebra Linear Regression: A Probabilistic Perspective Interactive Statistical Regression Tool

Recap from Linear Algebra

With maximum likelihood estimation and hypothesis testing in hand, we are ready to apply these tools to the most fundamental predictive model in statistics. Linear regression sits at the intersection of linear algebra and probability: the algebraic formulation leads to least-squares, while the probabilistic formulation leads to MLE - and the two turn out to be equivalent under Gaussian noise.

Given observed data points \(\{(\boldsymbol{x}_i, y_i)\}_{i=1}^{n}\) where \(\boldsymbol{x}_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\), we assume the linear model: \[ \boldsymbol{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon} \tag{1} \] where \(X \in \mathbb{R}^{n \times d}\) is the design matrix, \(\boldsymbol{\beta} \in \mathbb{R}^d\) is the parameter (weight) vector, \(\boldsymbol{y} \in \mathbb{R}^n\) is the observation vector, and \(\boldsymbol{\epsilon} = \boldsymbol{y} - X\boldsymbol{\beta}\) is the residual vector.

Here \(d\) is the number of features and \(n\) is the number of data points. The least-squares solution \(\hat{\boldsymbol{\beta}}\) must satisfy the normal equations: \[ X^\top X\hat{\boldsymbol{\beta}} = X^\top \boldsymbol{y}. \tag{2} \] These were derived geometrically in Least-Squares Problems via orthogonal projection of \(\boldsymbol{y}\) onto \(\text{Col}(X)\).

An important point: the model is "linear" in the parameters \(\boldsymbol{\beta}\), not in the features \(\boldsymbol{x}\). We can apply any nonlinear transformation to the features - for example, in a scalar case, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is still a linear model because \(y\) is a linear combination of the transformed features with respect to \(\boldsymbol{\beta}\). We now show that the same normal equations arise naturally from a probabilistic argument.

Linear Regression: A Probabilistic Perspective

We now reinterpret the linear model (1) probabilistically. Assume that each observation \(y_i\) is the realized value of a random variable \(Y_i\) that depends on the predictor \(\boldsymbol{x}_i\), corrupted by i.i.d. Gaussian noise: \[ \epsilon_i \sim \mathcal{N}(0, \sigma^2). \] Since \(\mathbb{E}[\epsilon_i] = 0\), the conditional mean of \(Y_i\) is \[ \mathbb{E}[Y_i] = \mu_i = \boldsymbol{x}_i^\top \boldsymbol{\beta} \] where \(\boldsymbol{\beta} \in \mathbb{R}^d\) contains the unknown regression parameters. This means each response is an independent Gaussian random variable \(Y_i \sim \mathcal{N}(\boldsymbol{x}_i^\top \boldsymbol{\beta},\; \sigma^2)\).

The conditional p.d.f. of a single observation \(y_i\) is therefore \[ p(y_i \mid \boldsymbol{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left\{-\frac{1}{2\sigma^2}(y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})^2\right\} \] and, by independence, the likelihood function for \(\boldsymbol{\beta}\) (with \(\sigma^2\) fixed) is \[ \begin{align*} L(\boldsymbol{\beta}) &= \prod_{i=1}^n p(y_i \mid \boldsymbol{x}_i, \boldsymbol{\beta}, \sigma^2) \\\\ &= \left(\frac{1}{\sigma\sqrt{2\pi}}\right)^{\!n} \exp\!\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})^2\right\}. \end{align*} \]

The log-likelihood function is \[ \begin{align*} \ln L(\boldsymbol{\beta}) &= n \ln \left(\frac{1}{\sigma \sqrt{2\pi}} \right) - \frac{1}{2\sigma^2} \sum_{i = 1}^n (y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})^2 \\\\ &= -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})^2. \end{align*} \]

We set the gradient of the log-likelihood with respect to \(\boldsymbol{\beta}\) equal to zero. By the chain rule, the derivative of \((y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})^2\) with respect to \(\boldsymbol{\beta}\) is \(-2(y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})\boldsymbol{x}_i\): \[ \begin{align*} \nabla_{\boldsymbol{\beta}} \ln L(\boldsymbol{\beta}) &= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta}) \boldsymbol{x}_i = \mathbf{0} \\\\ &\Longrightarrow \sum_{i=1}^n (y_i \boldsymbol{x}_i - (\boldsymbol{x}_i^\top \boldsymbol{\beta})\boldsymbol{x}_i) = \mathbf{0} \\\\ &\Longrightarrow \sum_{i=1}^n y_i \boldsymbol{x}_i = \sum_{i=1}^n (\boldsymbol{x}_i \boldsymbol{x}_i^\top) \boldsymbol{\beta} \\\\ &\Longrightarrow X^\top \boldsymbol{y} = X^\top X \boldsymbol{\beta} \end{align*} \]

This is exactly equivalent to the normal equations (2). Thus, the MLE solution for linear regression under Gaussian noise coincides with the least-squares solution: \[ \begin{align*} \hat{\boldsymbol{\beta}}_{\text{MLE}} &= \Big(\sum_{i=1}^n \boldsymbol{x}_i\boldsymbol{x}_i^\top \Big)^{-1}\Big(\sum_{i=1}^n \boldsymbol{x}_i y_i \Big) \\\\ &= (X^\top X)^{-1}X^\top \boldsymbol{y} \\\\ &= \hat{\boldsymbol{\beta}}_{\text{LS}}. \end{align*} \]

Verification via Direct Minimization

Consider the linear model (1) where \(X \in \mathbb{R}^{n \times d}\), \(\boldsymbol{y} \in \mathbb{R}^n\), and \(\boldsymbol{\beta} \in \mathbb{R}^d\).

To obtain the least-squares solution \(\hat{\boldsymbol{\beta}}_{\text{LS}}\), we minimize the least-squares error objective function \(f(\boldsymbol{\beta})\): \[ \begin{align*} \hat{\boldsymbol{\beta}}_{\text{LS}} &= \arg \min_{\boldsymbol{\beta}} \| \boldsymbol{y} - X \boldsymbol{\beta} \|_{2}^2 \\\\ &= \arg \min_{\boldsymbol{\beta}} (\boldsymbol{y} - X \boldsymbol{\beta})^\top (\boldsymbol{y} - X \boldsymbol{\beta}) \\\\ &= \arg \min_{\boldsymbol{\beta}} \left( \boldsymbol{y}^\top \boldsymbol{y} - \boldsymbol{y}^\top X \boldsymbol{\beta} - \boldsymbol{\beta}^\top X^\top \boldsymbol{y} + \boldsymbol{\beta}^\top X^\top X \boldsymbol{\beta} \right) \\\\ &= \arg \min_{\boldsymbol{\beta}} f(\boldsymbol{\beta}). \end{align*} \]

Taking the gradient of \(f(\boldsymbol{\beta})\) with respect to \(\boldsymbol{\beta}\) and setting it equal to zero: \[ \begin{align*} \nabla_{\boldsymbol{\beta}} f(\boldsymbol{\beta}) &= - X^\top \boldsymbol{y} - X^\top \boldsymbol{y} + 2X^\top X\boldsymbol{\beta} = \mathbf{0} \\\\ &\Longrightarrow -2 X^\top \boldsymbol{y} + 2X^\top X\boldsymbol{\beta} = \mathbf{0} \\\\ &\Longrightarrow X^\top X\boldsymbol{\beta} = X^\top \boldsymbol{y} \\\\ &\Longrightarrow \hat{\boldsymbol{\beta}}_{\text{LS}} = (X^\top X)^{-1}X^\top \boldsymbol{y}. \end{align*} \] Note that \(X^\top X\) is symmetric and positive semi-definite. It is invertible if and only if \(\text{rank}(X) = d\), i.e., the columns of \(X\) are linearly independent (which requires \(n \geq d\)).

Connections to Machine Learning

The equivalence \(\hat{\beta}_{\text{MLE}} = \hat{\beta}_{\text{LS}}\) reveals a fundamental principle: minimizing squared error is equivalent to maximum likelihood under Gaussian noise. This connection extends broadly. Adding an \(\ell^2\) penalty (ridge regression) corresponds to placing a Gaussian prior on \(\beta\) and computing the MAP estimate, while an \(\ell^1\) penalty (LASSO) corresponds to a Laplace prior. The bias-variance tradeoff in regularized regression directly reflects the MSE decomposition from Part 9. When the Gaussian noise assumption fails, one may use generalized linear models or robust regression methods that replace the squared loss with alternatives derived from other likelihood models.

Interactive Statistical Regression Tool

This interactive tool allows you to explore linear regression from a statistical perspective. You can upload your own data or use the provided sample datasets to analyze regression models, check statistical significance, and perform diagnostics.