Least-Squares Problems
Many practical problems require solving systems of equations \(A\mathbf{x} = \mathbf{b}\) that have no exact solution
because the system is overdetermined (more equations than unknowns). The least-squares approach finds
the vector \(\hat{\mathbf{x}}\) that makes \(A\mathbf{x}\) as close as possible to \(\mathbf{b}\), providing the best approximate
solution in the sense of minimizing the squared error.
Definition: Least-Squares Solution
If \(A \in \mathbb{R}^{m \times n}\) and \(\mathbf{b} \in \mathbb{R}^m\), a least-squares solution of
\(A\mathbf{x} = \mathbf{b}\) is an \(\hat{\mathbf{x}} \in \mathbb{R}^n\) such that
\[
\| \mathbf{b} - A\hat{\mathbf{x}} \| \leq \| \mathbf{b} - A\mathbf{x} \| \quad \forall \mathbf{x} \in \mathbb{R}^n.
\]
The norm \(\| \mathbf{b} - A\hat{\mathbf{x}} \|\) is called the least-squares error of the approximation.
Derivation:
By definition, \(\hat{\mathbf{x}}\) is a least-squares solution if and only if \(A\hat{\mathbf{x}}\) is a closest point in \(\operatorname{Col} A\) to \(\mathbf{b}\) (minimizing \(\|\mathbf{b} - A\mathbf{x}\|\) over \(\mathbf{x} \in \mathbb{R}^n\) is equivalent to minimizing over \(A\mathbf{x} \in \operatorname{Col} A\)). By the Best Approximation Theorem, the unique closest point is the orthogonal projection:
\[
A\hat{\mathbf{x}} = \operatorname{proj}_{\operatorname{Col} A} \mathbf{b} =: \hat{\mathbf{b}}.
\]
By the Orthogonal Decomposition Theorem, \(\mathbf{b} - \hat{\mathbf{b}} = \mathbf{b} - A\hat{\mathbf{x}}\) is orthogonal to \(\operatorname{Col} A\). Let \(\mathbf{a}_j\) denote the \(j\)-th column of \(A\). Orthogonality to every column means
\[
\mathbf{a}_j \cdot (\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{a}_j^\top(\mathbf{b} - A\hat{\mathbf{x}}) = 0 \qquad (j = 1, \ldots, n).
\]
Stacking these \(n\) scalar equations into a single vector equation gives
\[
A^\top(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0},
\]
which rearranges to the normal equations
\[
A^\top A \hat{\mathbf{x}} = A^\top \mathbf{b}.
\]
When \(A^\topA\) is invertible (which holds precisely when the columns of \(A\) are linearly independent; see the remark following the theorem), we can solve explicitly:
\[
\hat{\mathbf{x}} = (A^\topA)^{-1}A^\top\mathbf{b},
\qquad
\hat{\mathbf{b}} = A\hat{\mathbf{x}} = A(A^\topA)^{-1}A^\top\mathbf{b}.
\]
Theorem: Normal Equations
The set of least-squares solutions of \(A\mathbf{x} = \mathbf{b}\) consists of all vectors \(\hat{\mathbf{x}}\) that satisfy
the normal equations:
\[
A^\topA\mathbf{x} = A^\top\mathbf{b}.
\]
If the columns of \(A\) are linearly independent, then \(A^\topA\) is invertible and the unique
least-squares solution is:
\[
\hat{\mathbf{x}} = (A^\topA)^{-1}A^\top\mathbf{b}.
\]
Remark: \(A^\top A\) is invertible if and only if the columns of \(A\) are linearly independent.
(\(\Leftarrow\)) Suppose the columns of \(A\) are linearly independent, and \(A^\topA\mathbf{x} = \mathbf{0}\). Multiplying on the left by \(\mathbf{x}^\top\) gives
\[
\mathbf{x}^\top A^\top A \mathbf{x} = (A\mathbf{x})^\top (A\mathbf{x}) = \|A\mathbf{x}\|^2 = 0,
\]
so \(A\mathbf{x} = \mathbf{0}\). By linear independence of the columns, the only solution of \(A\mathbf{x} = \mathbf{0}\) is \(\mathbf{x} = \mathbf{0}\). Hence the null space \(\operatorname{Nul}(A^\topA) = \{\mathbf{0}\}\), so \(A^\topA\) is invertible by the Invertible Matrix Theorem.
(\(\Rightarrow\)) Conversely, suppose \(A^\topA\) is invertible. If \(A\mathbf{x} = \mathbf{0}\), then \(A^\topA\mathbf{x} = A^\top \mathbf{0} = \mathbf{0}\), which forces \(\mathbf{x} = \mathbf{0}\) by invertibility of \(A^\topA\). Hence \(\operatorname{Nul} A = \{\mathbf{0}\}\), i.e., the columns of \(A\) are linearly independent.
QR Factorization Method:
In practice, forming \(A^\topA\) amplifies numerical errors (see the Computational Insight below). The QR factorization gives a more stable route. Assuming the columns of \(A\) are linearly independent, write \(A = QR\) where \(Q\) has orthonormal columns (\(Q^\topQ = I\)) and \(R\) is upper triangular and invertible with positive diagonal.
Substituting into the normal equations \(A^\topA\hat{\mathbf{x}} = A^\top\mathbf{b}\),
\[
(QR)^\top(QR)\hat{\mathbf{x}} = (QR)^\top\mathbf{b}
\;\Longrightarrow\;
R^\top \underbrace{Q^\topQ}_{=\,I} R \hat{\mathbf{x}} = R^\top Q^\top \mathbf{b}
\;\Longrightarrow\;
R^\top R \hat{\mathbf{x}} = R^\top Q^\top \mathbf{b}.
\]
Since \(R\) is invertible, so is \(R^\top\); multiplying both sides by \((R^\top)^{-1}\) yields
\[
R \hat{\mathbf{x}} = Q^\top \mathbf{b},
\]
and hence
\[
\hat{\mathbf{x}} = R^{-1} Q^\top \mathbf{b}.
\]
Note:
In practice, one solves \(R\hat{\mathbf{x}} = Q^\top\mathbf{b}\) by back-substitution rather than computing \(R^{-1}\) explicitly,
which is much faster and more numerically stable.
Computational Insight:
The QR factorization method avoids computing \(A^\topA\), which squares the condition number
of \(A\). This makes QR-based least-squares significantly more numerically stable, especially
for ill-conditioned problems. The back-substitution step is also computationally efficient
since \(R\) is upper triangular.
Moore-Penrose Pseudo-inverse (\(A^\dagger\))
While the normal equations \(A^\topA\hat{\mathbf{x}} = A^\top\mathbf{b}\) provide a theoretical solution to least-squares
problems, they have serious practical and theoretical limitations. The Moore-Penrose pseudo-inverse
provides a unified framework that handles all cases—including singular, underdetermined, and
ill-conditioned systems with numerical stability and a unique solution guarantee.
- Theoretical Failure:
If the columns of \(A\) are linearly dependent (e.g., in an underdetermined system where \(n > m\), or
due to collinear features), \(A^\topA\) is singular and \((A^\topA)^{-1}\) does not exist.
- Practical Failure:
Even if \(A^\topA\) is invertible, it can be ill-conditioned. As defined in our treatment of the singular value decomposition,
the condition number \(\kappa(A)\) measures a matrix's numerical sensitivity.
The problem is that \(\kappa(A^\topA) = \kappa(A)^2\). This squaring of the condition number is catastrophic. A moderately
ill-conditioned problem (e.g., \(\kappa(A) = 10^5\)) becomes a hopelessly unstable one (\(\kappa(A^\topA) = 10^{10}\)).
This guarantees that solving for \(\hat{\mathbf{x}}\) by computing \((A^\topA)^{-1}\) will have a large numerical error.
Definition: Moore-Penrose Pseudo-inverse
For any matrix \(A \in \mathbb{R}^{m \times n}\), the Moore-Penrose pseudo-inverse
\(A^\dagger \in \mathbb{R}^{n \times m}\) is the unique matrix satisfying:
- \(A A^\dagger A = A\) (reconstruction property)
- \(A^\dagger A A^\dagger = A^\dagger\) (weak inverse property)
- \((A A^\dagger)^\top = A A^\dagger\) (\(AA^\dagger\) is the symmetric projection onto \(\operatorname{Col} A\))
- \((A^\dagger A)^\top = A^\dagger A\) (\(A^\dagger A\) is the symmetric projection onto \(\operatorname{Row} A\))
The reason why this unique matrix is so powerful is that the vector \(\hat{\mathbf{x}} = A^\dagger \mathbf{b}\) is always the
unique minimum-norm least-squares solution to \(A\mathbf{x} = \mathbf{b}\). This means \(\hat{\mathbf{x}}\) simultaneously
satisfies two conditions:
Theorem: Minimum-Norm Least-Squares Solution
For any \(A \in \mathbb{R}^{m \times n}\) and \(\mathbf{b} \in \mathbb{R}^m\), the vector
\(\hat{\mathbf{x}} = A^\dagger \mathbf{b}\) is the unique minimum-norm least-squares solution
to \(A\mathbf{x} = \mathbf{b}\), meaning:
- Least-Squares Property:
\(\hat{\mathbf{x}}\) minimizes \(\|A\mathbf{x} - \mathbf{b}\|^2\).
- Minimum-Norm Property:
Among all vectors that minimize \(\|A\mathbf{x} - \mathbf{b}\|^2\), \(\hat{\mathbf{x}}\) has the smallest norm
\(\|\hat{\mathbf{x}}\|\).
This property is particularly valuable in machine learning. For an underdetermined system (e.g., more features than data points),
the minimum-norm solution is the "simplest" one, which is a form of implicit regularization.
The most stable way to compute \(A^\dagger\) is via the Singular Value Decomposition (SVD), which
we will define in our treatment of symmetric matrices and the SVD.
Linear Regression
Least-squares theory has its most widespread application in statistical modeling through linear
regression. Given data points, we seek the best linear relationship between predictor variables
and an observed response. This is a direct application of the least-squares framework we've developed.
Definition: Linear Regression Model
Given observed data points \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\) where \(\mathbf{x}_i \in \mathbb{R}^d\) and
\(y_i \in \mathbb{R}\), the linear regression model is:
\[
\mathbf{Y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}
\]
where:
- \(X \in \mathbb{R}^{n \times d}\) is the design matrix
- \(\boldsymbol{\beta} \in \mathbb{R}^d\) is the parameter (i.e., weight) vector
- \(\mathbf{Y} \in \mathbb{R}^n\) is the observation vector
- \(\boldsymbol{\epsilon} = \mathbf{Y} - X\boldsymbol{\beta}\) is the residual vector
The dimension \(d\) is the number of features, and \(n\) is the number of data points. The residual \(\boldsymbol{\epsilon}\) is the
"difference" between each observed value \(y_i\) and its corresponding predicted \(y\) value. The least-squares hyperplane
represents the set of predicted \(y\) values based on estimated parameters (i.e., weights) \(\hat{\boldsymbol{\beta}}\), and it must satisfy the normal
equations:
\[
X^\topX\hat{\boldsymbol{\beta}} = X^\top\mathbf{Y}.
\]
Note:
The linear model is linear in terms of parameters \(\boldsymbol{\beta}\), not \(X\). We can choose any non-linear transformation
for each \(\mathbf{x}_i\). For example, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is still a linear model. Thus, \(\mathbf{Y}\) is modeled as a
linear combination of features (i.e., predictors) \(X\) with respect to the coefficients (i.e, weights) \(\boldsymbol{\beta}\).
There are lots of details we should cover in this topic. We leave it for
Section III - Probability & Statistics: Linear Regression.
Finally, \(X^\topX\) is a symmetric matrix. We will learn about this special matrix
in the next part.
Interactive Linear Regression Demo
Linear regression finds the best-fitting line or curve for a set of data points by minimizing the sum of squared differences between observed and predicted values. This interactive demo lets you explore how polynomial regression works with different datasets.
Try adding your own data points by clicking on the plot, or use the example datasets to see how different polynomial degrees fit various patterns. The demo visually demonstrates key concepts from least-squares theory and the normal equations.