Least-Squares

Least-Squares Problems Moore-Penrose Pseudo-inverse Linear Regression Interactive Linear Regression Demo

Least-Squares Problems

Many practical problems require solving systems of equations \(Ax = b\) that have no exact solution because the system is overdetermined (more equations than unknowns). The least-squares approach finds the vector \(\hat{x}\) that makes \(Ax\) as close as possible to \(b\), providing the best approximate solution in the sense of minimizing the squared error.

Definition: Least-Squares Solution

If \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\), a least-squares solution of \(Ax = b\) is an \(\hat{x} \in \mathbb{R}^n\) such that \[ \| b - A\hat{x} \| \leq \| b - Ax \| \quad \forall x \in \mathbb{R}^n. \] The norm \(\| b - A\hat{x} \|\) is called the least-squares error of the approximation.

Derivation:

Let \[ \hat{b} = \text{proj}_{\text{Col }A} \, b. \] Since \(\hat{b} \in \text{Col }A \,\), \(Ax = \hat{b}\) is consistent. This \(\hat{b}\) is the closest point to \(b\) in \(\text{Col }A\). Then, there exists a least-squares solution \(\hat{x} \in \mathbb{R}^n\) such that \(A\hat{x} = \hat{b}\).

Since \(\hat{b}\) is the orthogonal projection of \(b\) onto \(\text{Col }A\), we can say that \(b - \hat{b} = b - A\hat{x}\) is orthogonal to \(\text{Col }A \). (Note: \(b = A\hat{x} + (b - A\hat{x})\).)

Let \(a_j\) be a column of \(A\), then \[ \begin{align*} a_j \cdot (b-A\hat{x}) = 0 &\Longrightarrow a_j^T(b-A\hat{x}) = 0 \\\\ &\Longrightarrow A^T (b-A\hat{x}) = 0 \end{align*} \] and we obtain \[ A^TA\hat{x} = A^Tb. \]

Finally, we also get the orthogonal projection of \(b\) onto the \(\text{Col } A\) \[ \begin{align*} \hat{b} &= \text{proj}_{\text{Col }A} \, b \\\\ &= A\hat{x} = A(A^TA)^{-1}A^Tb. \end{align*} \]

Theorem 1: Normal Equations

The set of least-squares solutions of \(Ax = b\) consists of all vectors \(\hat{x}\) that satisfy the normal equations: \[ A^TAx = A^Tb. \] If the columns of \(A\) are linearly independent, then \(A^TA\) is invertible and the unique least-squares solution is: \[ \hat{x} = (A^TA)^{-1}A^Tb. \]

QR Factorization Method:

In practice, computing \(A^TA\) makes relatively large errors in \(\hat{x}\). Instead, the QR factorization gives us more reliable result. Let \(A = QR\). Then \[ \begin{align*} \hat{x} &= (R^TQ^TQR)^{-1}(R^TQ^T)b \\\\ &= (R^TR)^{-1}(R^TQ^T)b. \tag{1} \end{align*} \] On the other hand, \[ \begin{align*} A\hat{x} = b &\Longrightarrow Q^TQR\hat{x} = Q^Tb \\\\ &\Longrightarrow R\hat{x} = Q^Tb \\\\ &\Longrightarrow \hat{x} = R^{-1}Q^Tb. \\\\ \end{align*} \] Comparing with (1), \[ (R^TR)^{-1}R^T = R^{-1}. \]

Thus, the unique least-squares solution can be generated by \[ \hat{x} = R^{-1}Q^Tb. \] Note:
In practice, we solve \(R\hat{x} = Q^Tb\) by back-substitution rather than computing \(R^{-1}\) explicitly, which is much faster and more stable.

Computational Insight:

The QR factorization method avoids computing \(A^TA\), which squares the condition number of \(A\). This makes QR-based least-squares significantly more numerically stable, especially for ill-conditioned problems. The back-substitution step is also computationally efficient since \(R\) is upper triangular.

Moore-Penrose Pseudo-inverse (\(A^\dagger\))

While the normal equations \(A^TA\hat{x} = A^Tb\) provide a theoretical solution to least-squares problems, they have serious practical and theoretical limitations. The Moore-Penrose pseudo-inverse provides a unified framework that handles all cases—including singular, underdetermined, and ill-conditioned systems with numerical stability and a unique solution guarantee.

  1. Theoretical Failure:
    If the columns of \(A\) are linearly dependent (e.g., in an underdetermined system where \(n > m\), or due to collinear features), \(A^TA\) is singular and \((A^TA)^{-1}\) does not exist.
  2. Practical Failure:
    Even if \(A^TA\) is invertible, it can be ill-conditioned. As defined in Part 9, the condition number \(\kappa(A)\) measures a matrix's numerical sensitivity. The problem is that \(\kappa(A^TA) = \kappa(A)^2\). This squaring of the condition number is catastrophic. A moderately ill-conditioned problem (e.g., \(\kappa(A) = 10^5\)) becomes a hopelessly unstable one (\(\kappa(A^TA) = 10^{10}\)). This guarantees that solving for \(\hat{x}\) by computing \((A^TA)^{-1}\) will have a large numerical error.

Definition: Moore-Penrose Pseudo-inverse

For any matrix \(A \in \mathbb{R}^{m \times n}\), the Moore-Penrose pseudo-inverse \(A^\dagger \in \mathbb{R}^{n \times m}\) is the unique matrix satisfying:

  1. \(A A^\dagger A = A\)    (reconstruction property)
  2. \(A^\dagger A A^\dagger = A^\dagger\)    (weak inverse property)
  3. \((A A^\dagger)^T = A A^\dagger\)    (\(AA^\dagger\) is the symmetric projection onto \(\text{Col }A\))
  4. \((A^\dagger A)^T = A^\dagger A\)    (\(A^\dagger A\) is the symmetric projection onto \(\text{Row }A\))

The reason why this unique matrix is so powerful is that the vector \(\hat{x} = A^\dagger b\) is always the unique minimum-norm least-squares solution to \(Ax=b\). This means \(\hat{x}\) simultaneously satisfies two conditions:

Theorem 2: Minimum-Norm Least-Squares Solution

For any \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\), the vector \(\hat{x} = A^\dagger b\) is the unique minimum-norm least-squares solution to \(Ax = b\), meaning:

  • Least-Squares Property:
    \(\hat{x}\) minimizes \(\|Ax - b\|^2\).
  • Minimum-Norm Property:
    Among all vectors that minimize \(\|Ax - b\|^2\), \(\hat{x}\) has the smallest norm \(\|\hat{x}\|\).

This property is particularly valuable in machine learning. For an underdetermined system (e.g., more features than data points), the minimum-norm solution is the "simplest" one, which is a form of implicit regularization.

The most stable way to compute \(A^\dagger\) is via the Singular Value Decomposition (SVD), which we will define in Part 9.

Linear Regression

Least-squares theory has its most widespread application in statistical modeling through linear regression. Given data points, we seek the best linear relationship between predictor variables and an observed response. This is a direct application of the least-squares framework we've developed.

Definition: Linear Regression Model

Given observed data points \(\{(x_i, y_i)\}_{i=1}^n\) where \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\), the linear regression model is: \[ Y = X\beta + \epsilon \] where:

  • \(X \in \mathbb{R}^{n \times d}\) is the design matrix
  • \(\beta \in \mathbb{R}^d\) is the parameter (i.e., weight) vector
  • \(Y \in \mathbb{R}^n\) is the observation vector
  • \(\epsilon = Y - X\beta\) is the residual vector

The dimension \(d\) is the number of features, and \(n\) is the number of data points. The residual \(\epsilon\) is the "difference" between each observed value \(y_i\) and its corresponding predicted \(y\) value. The least-squares hyperplane represents the set of predicted \(y\) values based on estimated parameters (i.e., weights) \(\hat{\beta}\), and it must satisfy the normal equations: \[ X^TX\hat{\beta} = X^TY. \]

Note:
The linear model is linear in terms of parameters \(\beta \, \), not \(X\). We can choose any non-linear transformation for each \(x_i\). For example, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is still a linear model. Thus, \(Y\) is modeled as a linear combination of features (i.e., predictors) \(X\) with respect to the coefficients (i.e, weights) \(\beta\).

There are lots of details we should cover in this topic. We leave it for Section III - Probability & Statistics: Linear Regression. Finally, \(X^TX\) is called a symmetric matrix. We will learn about this special matrix in the next part.

Interactive Linear Regression Demo

Linear regression finds the best-fitting line or curve for a set of data points by minimizing the sum of squared differences between observed and predicted values. This interactive demo lets you explore how polynomial regression works with different datasets.

Try adding your own data points by clicking on the plot, or use the example datasets to see how different polynomial degrees fit various patterns. The demo visually demonstrates key concepts from least-squares theory and the normal equations.