Inner Product & Orthogonality in \(\mathbb{R}^n\)
Definition: Inner Product
The inner product of two vectors \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^n \) is a scalar defined as:
\[
\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^\top \mathbf{v} = u_1 v_1 + \cdots + u_n v_n.
\]
To extend the notion of "length" to higher dimensions, we define the norm of vectors:
Definition: Norm and Unit Vector
The norm (or length) of a vector \(\mathbf{v} \in \mathbb{R}^n\) is the
nonnegative scalar:
\[
\|\mathbf{v}\| = \sqrt{\mathbf{v} \cdot \mathbf{v}} = \sqrt{v_1^2 + \cdots + v_n^2}.
\]
This is also called the Euclidean norm or \(L_2\) norm.
A unit vector is a vector with norm 1. Any nonzero vector \(\mathbf{v}\) can be
normalized to produce a unit vector:
\[
\mathbf{u} = \frac{\mathbf{v}}{\|\mathbf{v}\|}.
\]
The unit vector maintains the "direction" of the original vector. This process is known as normalization, which
is widely used in applications like machine learning.
Now we can define the distance between two vectors, which measures
the difference or similarity between data points.
Definition: Euclidean Distance
The distance between two vectors \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^n\) is defined as:
\[
\operatorname{dist}(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|.
\]
This is called the Euclidean distance.
To explore orthogonality geometrically, consider the squared distances:
\[\begin{align*}
\|\mathbf{u} - \mathbf{v}\|^2 &= (\mathbf{u}-\mathbf{v}) \cdot (\mathbf{u}-\mathbf{v}) \\\\
&= \mathbf{u} \cdot (\mathbf{u}-\mathbf{v}) + (-\mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) \\\\
&= \|\mathbf{u}\|^2 - 2 \mathbf{u} \cdot \mathbf{v} + \|\mathbf{v}\|^2
\end{align*}
\]
\[
\begin{align*}
\|\mathbf{u} - (-\mathbf{v})\|^2 &= (\mathbf{u}+\mathbf{v}) \cdot (\mathbf{u}+\mathbf{v}) \\\\
&= \mathbf{u} \cdot (\mathbf{u}+\mathbf{v}) + \mathbf{v} \cdot (\mathbf{u} + \mathbf{v}) \\\\
&= \|\mathbf{u}\|^2 + 2 \mathbf{u} \cdot \mathbf{v} + \|\mathbf{v}\|^2
\end{align*}
\]
(Note: The vectors \(\mathbf{u}-\mathbf{v}\) and \(\mathbf{u}+\mathbf{v}\) are perpendicular if and only if
\(\|\mathbf{u}-\mathbf{v}\| = \|\mathbf{u}+\mathbf{v}\|\), which occurs when \(\mathbf{u} \cdot \mathbf{v} = 0\).)
Thus, if \(\mathbf{u} \cdot \mathbf{v} = 0\), the terms resemble the Pythagorean theorem:
\[
\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2.
\]
From the above result, we use the inner product to describe orthogonality for any dimension.
Definition: Orthogonality
Two vectors \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^n \) are orthogonal to each other if their inner product satisfies
\[
\mathbf{u} \cdot \mathbf{v} = 0.
\]
Definition: Orthogonal Complement
Let \(W \subseteq \mathbb{R}^n\). A vector \(\mathbf{v} \in \mathbb{R}^n\) is orthogonal to \(W\)
if it is orthogonal to every vector in \(W\). The set of all such vectors is called the orthogonal complement of \(W\),
denoted
\[
W^{\perp} = \{\mathbf{v} \in \mathbb{R}^n : \mathbf{v} \cdot \mathbf{w} = 0 \text{ for all } \mathbf{w} \in W\}.
\]
This orthogonality provides a useful relationship between the column space and null space of a matrix:
Theorem: Orthogonal Complements of Row and Column Spaces
Let \(A \in \mathbb{R}^{m \times n}\). Then,
\[
(\operatorname{Row} A)^{\perp} = \operatorname{Nul} A
\]
and
\[
(\operatorname{Col} A)^{\perp} = \operatorname{Nul} A^\top.
\]
Proof:
If \(\mathbf{x} \in\) \(\operatorname{Nul} A\),
then \(A\mathbf{x} = \mathbf{0}\), which means that \(\mathbf{x}\) is orthogonal to each row of \(A\)
because the inner product of each row of \(A\) and \(\mathbf{x}\) are zero.
Then, since the rows of \(A\) span the row space of \(A\),
\(\mathbf{x}\) is orthogonal to \(\operatorname{Row} A\). Also, if \(\mathbf{x}\) is orthogonal to \(\operatorname{Row} A\),
then \(\mathbf{x}\) is orthogonal to every row of \(A\)
and thus \(A\mathbf{x} = \mathbf{0}\), or \(\mathbf{x} \in \operatorname{Nul} A\).
Therefore, \((\operatorname{Row} A)^{\perp} = \operatorname{Nul} A\).
Since the first identity holds for any matrix, we apply it to
\(A^\top\): \((\operatorname{Row} A^\top)^{\perp} = \operatorname{Nul} A^\top\). The rows of \(A^\top\) are the columns of \(A\),
so \(\operatorname{Row} A^\top = \) \(\operatorname{Col} A\).
Therefore \((\operatorname{Col} A)^{\perp} = \operatorname{Nul} A^\top\).
The Orthogonal Set
Having established the concept of orthogonality between individual vectors, we now extend this to
sets of vectors. Orthogonal sets have special properties that simplify many linear algebra computations,
particularly when finding coordinates relative to a basis.
Definition: Orthogonal Set & Basis
A set of vectors \(\{\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_p\}\) in \(\mathbb{R}^n\) is called an orthogonal set
if \(\mathbf{v}_i \cdot \mathbf{v}_j = 0\), \(i \neq j\). Additionally, an orthogonal basis for \(W \subseteq \mathbb{R}^n\) is
a basis for \(W\) that forms an orthogonal set.
Theorem: Coordinates in an Orthogonal Basis
Let \(\{\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_p\}\) be an orthogonal basis for \(W \subseteq \mathbb{R}^n\). For each
\(\mathbf{y} \in W\), the weights in the linear combination \(\mathbf{y} = c_1 \mathbf{v}_1 + \cdots + c_p \mathbf{v}_p\) are given by
\[
c_j = \frac{\mathbf{y} \cdot \mathbf{v}_j}{\mathbf{v}_j \cdot \mathbf{v}_j} \qquad (j = 1, \cdots, p).
\]
Proof:
Since \(\mathbf{v}_1\) is orthogonal to all other vectors in the set,
\[
\begin{align*}
\mathbf{y} \cdot \mathbf{v}_1 &= (c_1 \mathbf{v}_1 + \cdots + c_p \mathbf{v}_p) \cdot \mathbf{v}_1 \\
&= c_1(\mathbf{v}_1 \cdot \mathbf{v}_1) + \cdots + c_p(\mathbf{v}_p \cdot \mathbf{v}_1) \\
&= c_1(\mathbf{v}_1 \cdot \mathbf{v}_1).
\end{align*}
\]
By definition of the basis, \(\mathbf{v}_1\) is nonzero, \(\mathbf{v}_1 \cdot \mathbf{v}_1\) must be nonzero, and
then we can solve this equation for \(c_1\). Similarly, we can compute \(\mathbf{y} \cdot \mathbf{v}_j\)
and solve for \(c_j \quad (j = 2, \cdots, p)\).
Definition: Orthonormal Set & Basis
A set \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is said to be an orthonormal set if it is an orthogonal
set of unit vectors. If \(W\) is the subspace spanned by such a set, then \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is
an orthonormal basis for \(W\).
Matrices with orthonormal columns are central to many practical applications due to their unique properties.
Theorem: Orthonormal Columns Characterization
A matrix \(U \in \mathbb{R}^{m \times n} \) has orthonormal columns if and only if \(U^\topU = I\).
Proof:
\[
\begin{align*}
U^\topU &= \begin{bmatrix} \mathbf{u}_1^\top \\ \mathbf{u}_2^\top \\ \vdots \\ \mathbf{u}_n^\top \end{bmatrix}
\begin{bmatrix} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \end{bmatrix} \\\\
&= \begin{bmatrix} \mathbf{u}_1^\top \mathbf{u}_1 & \mathbf{u}_1^\top \mathbf{u}_2 & \cdots & \mathbf{u}_1^\top \mathbf{u}_n \\
\mathbf{u}_2^\top \mathbf{u}_1 & \mathbf{u}_2^\top \mathbf{u}_2 & \cdots & \mathbf{u}_2^\top \mathbf{u}_n \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{u}_n^\top \mathbf{u}_1 & \mathbf{u}_n^\top \mathbf{u}_2 & \cdots & \mathbf{u}_n^\top \mathbf{u}_n
\end{bmatrix}
\end{align*}
\]
Thus, the \((i,j)\)-entry of \(U^\topU\) is \(\mathbf{u}_i^\top \mathbf{u}_j\). This matrix equals \(I\)
if and only if each entry matches the corresponding entry of the identity, i.e.,
\[
\mathbf{u}_i^\top \mathbf{u}_j = \delta_{ij} := \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}
\qquad (i, j = 1, \ldots, n).
\]
The off-diagonal conditions \(\mathbf{u}_i^\top \mathbf{u}_j = 0\) for \(i \neq j\) say that distinct columns are orthogonal;
the diagonal conditions \(\mathbf{u}_i^\top \mathbf{u}_i = 1\) say that each column has unit length. Combined, \(U^\topU = I\)
if and only if the columns form an orthonormal set.
Theorem: Norm- and Inner-Product Preservation
Let \(U \in \mathbb{R}^{m \times n} \) with orthonormal columns, and let \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\). Then
- \(\| U\mathbf{x} \| = \| \mathbf{x} \|\)
- \((U\mathbf{x}) \cdot (U\mathbf{y}) = \mathbf{x} \cdot \mathbf{y}\)
- \((U\mathbf{x}) \cdot (U\mathbf{y}) = 0\) if and only if \(\mathbf{x} \cdot \mathbf{y} = 0\)
Proof:
For Property 1,
\[
\begin{align*}
\| U\mathbf{x} \|^2 &= (U\mathbf{x})^\top(U\mathbf{x}) \\\\
&= \mathbf{x}^\top U^\top U \mathbf{x} \\\\
&= \mathbf{x}^\top I \mathbf{x} \\\\
&= \mathbf{x}^\top \mathbf{x}.
\end{align*}
\]
Therefore,
\[
\| U\mathbf{x} \| = \| \mathbf{x} \|.
\]
For Property 2,
\[
\begin{align*}
(U\mathbf{x}) \cdot (U\mathbf{y}) &= (U\mathbf{x})^\top(U\mathbf{y}) \\\\
&= \mathbf{x}^\top U^\top U \mathbf{y} \\\\
&= \mathbf{x}^\top I \mathbf{y} = \mathbf{x}^\top \mathbf{y} \\\\
&= \mathbf{x} \cdot \mathbf{y}.
\end{align*}
\]
Property 3 follows immediately from Property 2: since \((U\mathbf{x}) \cdot (U\mathbf{y}) = \mathbf{x} \cdot \mathbf{y}\),
the left side vanishes if and only if the right side does.
Property 1 can also be recovered from Property 2 by setting \(\mathbf{y} = \mathbf{x}\).
These results are particularly significant for square matrices.
Definition: Orthogonal Matrix
A square matrix \(U \in \mathbb{R}^{n \times n}\) is called an orthogonal matrix if
\[
U^{-1} = U^\top,
\]
or equivalently, \(U^\topU = UU^\top = I\).
By the Orthonormal Columns Characterization,
this matrix \(U\) has orthonormal columns (the name "orthogonal matrix" is
somewhat misleading since it refers to "orthonormal" columns). Notably, the rows of \(U\) also form an orthonormal
basis of \(\mathbb{R}^n\): since \(U\) is square, \(U^{-1} = U^\top\) gives \(UU^\top = I\), and applying the same characterization to \(U^\top\)
(whose columns are the rows of \(U\)) shows those rows are orthonormal.
Example: The 2D Rotation Matrix Part 1
\[
R_{\theta} = \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}
\]
Verify:
\[
\begin{align*}
R_{\theta}^\topR_{\theta} &= R_{\theta}^{-1}R_{\theta}\\\\
&= \begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix}
\begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}\\\\
&= \begin{bmatrix} \cos^2 \theta + \sin^2 \theta & 0 \\ 0 & \sin^2 \theta + \cos^2 \theta \end{bmatrix}\\\\
&= I
\end{align*}
\]
Also, \[
\det R_{\theta} = \cos^2 \theta + \sin^2 \theta = 1.
\]
This property generalizes to any orthogonal matrix \(U\):
by the multiplicative
and transpose properties of determinants,
\[
\begin{align*}
1 = \det I &= \det ( U^\topU ) \\\\
&= \det U^\top \det U \\\\
&= \det U \det U \\\\
&= (\det U)^2
\end{align*}
\]
Thus, for any orthogonal matrix \(U\),
\[
\det U = \pm 1.
\]
Rotations vs. Reflections: The Two Worlds of \(\det U = \pm 1\)
The condition \(\det U = +1\) characterizes proper rotations —
transformations that can be reached by continuously rotating from the identity.
The rotation matrix \(R_\theta\) above satisfies this: as \(\theta\) varies from
\(0\) to any angle, \(R_\theta\) sweeps out a smooth path of orthogonal matrices
with determinant 1. In contrast, \(\det U = -1\) corresponds to
improper rotations (rotations composed with a reflection), such as
\(\begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}\). No continuous path connects the
identity to such a matrix without leaving the set of orthogonal matrices.
This split is not a coincidence — it reflects a fundamental topological fact: the set
of orthogonal matrices has two disconnected components. The component
with \(\det = +1\) forms the special orthogonal group
\(SO(n)\) — the group of rotations. The full set (both components together) is the
orthogonal group \(O(n)\).
Example: The 2D Rotation Matrix Part 2
Lastly, consider the eigenvalues of \(R_{\theta}\).
\[
\begin{align*}
\det (R_{\theta} - \lambda I ) &= (\cos\theta - \lambda)^2 + \sin^2 \theta \\\\
&= \lambda^2 -2 \lambda \cos\theta +1
\end{align*}
\]
Solve the characteristic equation:
\[
\lambda^2 -2 \lambda\cos\theta +1 = 0,
\]
We obtain:
\[
\begin{align*}
\lambda &=\frac{2\cos\theta \pm \sqrt{4\cos^2\theta - 4}}{2} \\\\
&= \frac{2\cos\theta \pm \sqrt{-4\sin^2\theta}}{2} \\\\
&= \cos\theta \pm i \sin\theta.
\end{align*}
\]
Then,
\[
|\lambda | = 1
\]
We can also see \(|\lambda| = 1\) directly for real eigenvalues.
Suppose \(\lambda \in \mathbb{R}\) and \(\mathbf{x} \in \mathbb{R}^n\) is a nonzero eigenvector
of an orthogonal \(U\). Then
\[
\begin{align*}
U\mathbf{x} = \lambda \mathbf{x}
&\Longrightarrow \| U\mathbf{x} \| = |\lambda| \| \mathbf{x} \| \\\\
&\Longrightarrow \| \mathbf{x} \| = |\lambda| \| \mathbf{x} \| \qquad \text{(by Property 1 above)}
\end{align*}
\]
Since \(\| \mathbf{x} \| \neq 0\), we conclude \(|\lambda| = 1\). (The rotation example on this page has complex eigenvalues \(\lambda = \cos\theta \pm i\sin\theta\), for which \(|\lambda|=1\) was verified directly. The same conclusion holds for any complex eigenvalue by the unitary extension stated in the insight box below.)
Why "U"? — From Orthogonal to Unitary
We have been using the letter \(U\) for orthogonal matrices throughout this page. This
is not arbitrary — it reflects a deeper structure. The norm-preserving property
\(\|U\mathbf{x}\| = \|\mathbf{x}\|\) holds not only for real vectors, but also for
complex vectors, provided we replace the transpose \(U^\top\) with the
conjugate transpose \(U^* = \overline{U}^\top\). A matrix satisfying
\(U^* U = I\) is called a unitary matrix, and the proof that it
preserves norms is identical to the Norm- and Inner-Product Preservation theorem above.
An orthogonal matrix is simply a unitary matrix that happens to have real entries —
when all entries are real, \(U^* = U^\top\) and the unitary condition \(U^* U = I\)
reduces to the orthogonal condition \(U^\top U = I\). The collection of all
\(n \times n\) unitary matrices forms the
unitary group
\(U(n)\), and the orthogonal matrices form the
orthogonal group
\(O(n) \subset U(n)\). These are examples of Lie groups — continuous
symmetry groups that play a central role in physics, robotics, and modern deep learning.
The Orthogonal Projection
Orthogonal matrices preserve "structure under linear transformations," but we often need to decompose
arbitrary vectors relative to a subspace. The orthogonal projection provides the closest approximation
to a vector by elements of a subspace, a fundamental operation in applications from least squares
regression to signal processing.
Given a nonzero vector \(\mathbf{u} \in \mathbb{R}^n\). We often need to decompose a vector \(\mathbf{y} \in \mathbb{R}^n\)
into two vectors:
\[
\mathbf{y} = \hat{\mathbf{y}} + \mathbf{z} \tag{2}
\]
where \(\hat{\mathbf{y}} = \alpha \mathbf{u}\) is a multiple of \(\mathbf{u}\) for some scalar \(\alpha\) and
the vector \(\mathbf{z}\) is orthogonal to \(\mathbf{u}\).
For any scalar \(\alpha\), let \(\mathbf{z} = \mathbf{y} - \alpha \mathbf{u}\) so that (2) holds. Then \(\mathbf{y} - \hat{\mathbf{y}}\) is
orthogonal to \(\mathbf{u}\) if and only if
\[
(\mathbf{y} - \alpha \mathbf{u}) \cdot \mathbf{u} = \mathbf{y} \cdot \mathbf{u} - \alpha (\mathbf{u} \cdot \mathbf{u}) = 0
\]
\[
\Longrightarrow \alpha = \frac{\mathbf{y} \cdot \mathbf{u}}{\mathbf{u} \cdot \mathbf{u}}
\]
Thus, the orthogonal projection of \(\mathbf{y}\) onto \(\mathbf{u}\) is
\[
\hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u}}{\mathbf{u} \cdot \mathbf{u}} \mathbf{u}
\]
Theorem: Orthogonal Decomposition
Let \(W\) be a subspace of \(\mathbb{R}^n\) with a finite orthogonal basis \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\).
Then every \(\mathbf{y} \in \mathbb{R}^n\) can be written uniquely in the form
\[\mathbf{y} = \hat{\mathbf{y}} + \mathbf{z},\]
where \(\hat{\mathbf{y}} \in W\) and \(\mathbf{z} \in W^{\perp}\), and these components are given by
\[
\hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1 + \cdots + \frac{\mathbf{y} \cdot \mathbf{u}_p}{\mathbf{u}_p \cdot \mathbf{u}_p} \mathbf{u}_p \tag{3}
\]
and
\[
\mathbf{z} = \mathbf{y} - \hat{\mathbf{y}}.
\]
Note that (3) is called the orthogonal projection of \(\mathbf{y}\) onto \(W\) denoted by \(\operatorname{proj}_W \mathbf{y}\).
Proof:
Let \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is any orthogonal basis of \(W\) and
\(\hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1 + \cdots + \frac{\mathbf{y} \cdot \mathbf{u}_p}{\mathbf{u}_p \cdot \mathbf{u}_p} \mathbf{u}_p\).
Since \(\hat{\mathbf{y}}\) is a linear combination of \(\mathbf{u}_1, \cdots, \mathbf{u}_p\), \(\quad \hat{\mathbf{y}} \in W\).
Let \(\mathbf{z} = \mathbf{y} - \hat{\mathbf{y}}\). Because \(\mathbf{u}_i\) is orthogonal to each \(\mathbf{u}_j\) in the basis for \(W\) with \(i \neq j\),
\[
\begin{align*}
\mathbf{z} \cdot \mathbf{u}_i &= (\mathbf{y} - \hat{\mathbf{y}}) \cdot \mathbf{u}_i \\\\
&= \mathbf{y} \cdot \mathbf{u}_i - \left(\frac{\mathbf{y} \cdot \mathbf{u}_i}{\mathbf{u}_i \cdot \mathbf{u}_i}\right) \mathbf{u}_i \cdot \mathbf{u}_i - 0 - \cdots - 0 \\\\
&= \mathbf{y} \cdot \mathbf{u}_i - \mathbf{y} \cdot \mathbf{u}_i = 0
\end{align*}
\]
Hence \(\mathbf{z}\) is orthogonal to each \(\mathbf{u}_i\) in the basis for \(W\). Therefore, \(\mathbf{z}\) is orthogonal to every vector in \(W\).
In other words, \(\mathbf{z} \in W^{\perp}\).
Next, suppose \(\mathbf{y}\) can also be written as \(\mathbf{y} = \hat{\mathbf{y}}_1 + \mathbf{z}_1\) where \(\hat{\mathbf{y}}_1 \in W\) and \(\mathbf{z}_1 \in W^{\perp}\).
Then
\[
\hat{\mathbf{y}} + \mathbf{z} = \hat{\mathbf{y}}_1 + \mathbf{z}_1 \Longrightarrow \hat{\mathbf{y}} - \hat{\mathbf{y}}_1 = \mathbf{z}_1 - \mathbf{z}.
\]
The left side lies in \(W\) (since \(W\) is a subspace).
The right side lies in \(W^{\perp}\), because \(W^{\perp}\) is itself a subspace: for any \(\mathbf{a}, \mathbf{b} \in W^{\perp}\)
and any \(\mathbf{w} \in W\),
\[
(\mathbf{a} - \mathbf{b}) \cdot \mathbf{w} = \mathbf{a} \cdot \mathbf{w} - \mathbf{b} \cdot \mathbf{w} = 0 - 0 = 0,
\]
so \(\mathbf{a} - \mathbf{b} \in W^{\perp}\). Hence
\[
(\hat{\mathbf{y}} - \hat{\mathbf{y}}_1) \in W \cap W^{\perp}.
\]
Any vector \(\mathbf{w} \in W \cap W^{\perp}\) satisfies \(\mathbf{w} \cdot \mathbf{w} = 0\) (since \(\mathbf{w}\) is orthogonal to itself),
forcing \(\mathbf{w} = \mathbf{0}\).
Therefore
\[
\hat{\mathbf{y}} - \hat{\mathbf{y}}_1 = \mathbf{0},
\]
giving \(\hat{\mathbf{y}} = \hat{\mathbf{y}}_1\) and \(\mathbf{z} = \mathbf{z}_1\).
The orthogonal projection of \(y\) onto \(W\) is the best approximation to \(y\) by elements of W.
This is significant in practical applications.
Theorem: Best Approximation
Let \(W \subseteq \mathbb{R}^n\) and \(\mathbf{y} \in \mathbb{R}^n\). Also let \(\hat{\mathbf{y}}\) be
the orthogonal projection of \(\mathbf{y}\) onto \(W\). Then \(\hat{\mathbf{y}}\) is the closest point in \(W\) to \(\mathbf{y}\)
such that
\[
\| \mathbf{y} - \hat{\mathbf{y}} \| < \| \mathbf{y} - \mathbf{v} \| \tag{4}
\]
for all \(\mathbf{v} \in W\) distinct from \(\hat{\mathbf{y}}\).
Proof:
Choose \(\mathbf{v} \in W\) distinct from \(\hat{\mathbf{y}} \in W\). Then \((\hat{\mathbf{y}} - \mathbf{v}) \in W\).
By the Orthogonal Decomposition Theorem,
\((\mathbf{y} - \hat{\mathbf{y}})\) is orthogonal to \(W\); in particular,
\((\mathbf{y} - \hat{\mathbf{y}})\) is orthogonal to \((\hat{\mathbf{y}} - \mathbf{v}) \in W\).
Consider
\[
\mathbf{y} - \mathbf{v} = (\mathbf{y} - \hat{\mathbf{y}}) + (\hat{\mathbf{y}} - \mathbf{v}).
\]
By the Pythagorean Theorem,
\[
\|\mathbf{y} - \mathbf{v}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}} - \mathbf{v}\|^2.
\]
Since \(\mathbf{v} \neq \hat{\mathbf{y}}\), we have \(\hat{\mathbf{y}} - \mathbf{v} \neq \mathbf{0}\),
so \(\|\hat{\mathbf{y}} - \mathbf{v}\|^2 > 0\).
Therefore
\[
\|\mathbf{y} - \mathbf{v}\|^2 > \|\mathbf{y} - \hat{\mathbf{y}}\|^2,
\]
which gives
\[
\|\mathbf{y} - \hat{\mathbf{y}}\| < \|\mathbf{y} - \mathbf{v}\|.
\]
The formula (3) becomes much simpler when the basis for \(W\) is an orthonormal set as follows:
If \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is an orthonormal basis for \(W \subseteq \mathbb{R}^n\), then
\[
\operatorname{proj}_W \mathbf{y} = (\mathbf{y} \cdot \mathbf{u}_1) \mathbf{u}_1 + \cdots + (\mathbf{y} \cdot \mathbf{u}_p) \mathbf{u}_p
\]
If \(U = [\mathbf{u}_1, \cdots, \mathbf{u}_p]\), then
\[
\forall \mathbf{y} \in \mathbb{R}^n, \quad \operatorname{proj}_W \mathbf{y} = UU^\top \mathbf{y}
\]
because each weight can be written as \(\mathbf{u}_1^\top \mathbf{y}, \cdots, \mathbf{u}_p^\top \mathbf{y}\), which are the entries of \(U^\top \mathbf{y}\).
Besides having orthonormal columns, if \(U\) is an \(n \times n\) square matrix
(\(U\) is the orthogonal matrix),
\[
\forall \mathbf{y} \in \mathbb{R}^n, \quad \operatorname{proj}_W \mathbf{y} = UU^\top \mathbf{y} = I \mathbf{y} = \mathbf{y}.
\]
This implies the column space \(W\) is equal to \(\mathbb{R}^n\).
Gram-Schmidt Algorithm
While orthogonal and orthonormal bases simplify computations significantly, an arbitrary basis for a
subspace will typically not be orthogonal. The Gram-Schmidt Algorithm provides a
systematic method to transform any basis into an orthogonal (or orthonormal) basis while preserving the span.
Theorem: Gram-Schmidt Process
Given a basis \(\{\mathbf{x}_1, \cdots, \mathbf{x}_p\}\) for a nonzero subspace \(W\) of \(\mathbb{R}^n\),
the Gram-Schmidt algorithm constructs an orthogonal basis \(\{\mathbf{v}_1, \cdots, \mathbf{v}_p\}\)
for \(W\) as follows:
\[
\begin{align*}
\mathbf{v}_1 &= \mathbf{x}_1 \\\\
\mathbf{v}_2 &= \mathbf{x}_2 - \frac{\mathbf{x}_2 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 \\\\
\mathbf{v}_3 &= \mathbf{x}_3 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_2}{\mathbf{v}_2 \cdot \mathbf{v}_2} \mathbf{v}_2 \\\\
&\vdots \\\\
\mathbf{v}_k &= \mathbf{x}_k - \sum_{j=1}^{k-1} \frac{\mathbf{x}_k \cdot \mathbf{v}_j}{\mathbf{v}_j \cdot \mathbf{v}_j} \mathbf{v}_j
\end{align*}
\]
An orthonormal basis is obtained by normalizing: \(\mathbf{u}_k = \frac{\mathbf{v}_k}{\|\mathbf{v}_k\|}\) for \(k = 1, \cdots, p\).
A proof by induction on \(k\) is straightforward: at each step, \(\mathbf{v}_k = \mathbf{x}_k - \operatorname{proj}_{W_{k-1}} \mathbf{x}_k\) where \(W_{k-1} = \operatorname{Span}\{\mathbf{v}_1, \ldots, \mathbf{v}_{k-1}\}\), so the Orthogonal Decomposition Theorem places \(\mathbf{v}_k\) in \(W_{k-1}^{\perp}\), which establishes orthogonality with every earlier \(\mathbf{v}_j\). Moreover \(\mathbf{v}_k \neq \mathbf{0}\): otherwise \(\mathbf{x}_k \in W_{k-1} = \operatorname{Span}\{\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}\}\), contradicting the linear independence of \(\{\mathbf{x}_1, \ldots, \mathbf{x}_p\}\). Finally, the span is preserved: \(\mathbf{v}_k\) is \(\mathbf{x}_k\) plus a linear combination of the earlier \(\mathbf{v}_j\)'s (hence in \(\operatorname{Span}\{\mathbf{x}_1, \ldots, \mathbf{x}_k\}\)), and conversely \(\mathbf{x}_k = \mathbf{v}_k + \operatorname{proj}_{W_{k-1}} \mathbf{x}_k \in \operatorname{Span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\), so \(\operatorname{Span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \operatorname{Span}\{\mathbf{x}_1, \ldots, \mathbf{x}_k\}\). Normalization \(\mathbf{u}_k = \mathbf{v}_k / \|\mathbf{v}_k\|\) is well-defined because \(\mathbf{v}_k \neq \mathbf{0}\).
Example:
Consider the basis \(\{\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3\}\) for a nonzero subspace \(W\) of \(\mathbb{R}^3\), where:
\(\mathbf{x}_1 = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad
\mathbf{x}_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \text{ and }
\mathbf{x}_3 = \begin{bmatrix} 0 \\ 0 \\ 1\end{bmatrix}\)
Let \(\mathbf{v}_1 = \mathbf{x}_1\) and \(W_1 = \operatorname{Span} \{\mathbf{v}_1\}\). Then
\[
\begin{align*}
\mathbf{v}_2 &= \mathbf{x}_2 - \operatorname{proj}_{W_1} \mathbf{x}_2 \\\\
&= \mathbf{x}_2 - \frac{\mathbf{x}_2 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 \\\\
&= \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} - \frac{2}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\\\
&= \begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix}
\end{align*}
\]
Next, \(W_2 = \operatorname{Span} \{\mathbf{v}_1, \mathbf{v}_2\}\) and then
\[
\begin{align*}
\mathbf{v}_3 &= \mathbf{x}_3 - \operatorname{proj}_{W_2} \mathbf{x}_3 \\\\
&= \mathbf{x}_3 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_2}{\mathbf{v}_2 \cdot \mathbf{v}_2} \mathbf{v}_2 \\\\
&= \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} - \frac{3}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}
- \frac{\frac{-3}{7}}{\frac{5}{7}}\begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \\\\
&= \begin{bmatrix} \frac{-3}{10} \\ 0 \\ \frac{1}{10} \end{bmatrix}
\end{align*}
\]
Thus, the orthogonal basis for \(W\) is \(W_3 = \operatorname{Span}\{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3\}\).
Finally, we construct an orthonormal basis for \(W\) by normalizing each \(\mathbf{v}_k\):
\[
\{\mathbf{u}_1 = \begin{bmatrix} \frac{1}{\sqrt{14}} \\ \frac{2}{\sqrt{14}} \\ \frac{3}{\sqrt{14}} \end{bmatrix}, \,
\mathbf{u}_2 = \begin{bmatrix} \frac{-1}{\sqrt{35}} \\ \frac{5}{\sqrt{35}} \\ \frac{-3}{\sqrt{35}} \end{bmatrix}, \,
\mathbf{u}_3 = \begin{bmatrix} \frac{-3}{\sqrt{10}} \\ 0 \\ \frac{1}{\sqrt{10}} \end{bmatrix} \}.
\]
The Gram-Schmidt algorithm at the \(k\)th step can be written as:
\[
\mathbf{v}_k = \mathbf{x}_k - \sum_{j=1}^{k-1} (\mathbf{x}_k \cdot \mathbf{u}_j) \mathbf{u}_j ,\quad \mathbf{u}_k = \frac{\mathbf{v}_k}{\|\mathbf{v}_k\|}.
\]
The resulting vectors look different from the original ones. This is because the algorithm systematically redefines each
vector to be orthogonal to the previous ones while maintaining the span of the original vectors. Essentially,
it transforms the basis into an orthonormal one without changing the space that the vectors describe.
QR Factorization
The Gram-Schmidt process has a matrix interpretation: it factors any matrix with linearly independent
columns into the product of an orthogonal matrix \(Q\) and an upper triangular matrix \(R\). This
QR factorization is fundamental in numerical linear algebra, particularly for
approximating eigenvalues and solving least squares problems.
(Note: Finding exact roots of high-degree characteristic polynomials is often computationally infeasible,
making iterative methods based on QR factorization essential for eigenvalue computation.)
Theorem: QR Factorization
Let \(A\) be an \(m \times n\) matrix with linearly independent columns. Then \(A\) can be
factorized as
\[
A = QR,
\]
where \(Q\) is an \(m \times n\) matrix whose columns form an orthonormal basis for \(\operatorname{Col} A\),
and \(R\) is an \(n \times n\) upper triangular invertible matrix with positive diagonal entries.
Proof:
Let \(\mathbf{a}_1, \ldots, \mathbf{a}_n\) be the columns of \(A\).
Applying the Gram-Schmidt process to this basis of \(\operatorname{Col} A\),
followed by normalization, produces an orthonormal basis \(\mathbf{q}_1, \ldots, \mathbf{q}_n\) of \(\operatorname{Col} A\).
The construction ensures that for each \(k\),
\[
\operatorname{Span}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} = \operatorname{Span}\{\mathbf{a}_1, \ldots, \mathbf{a}_k\},
\]
so \(\mathbf{a}_k\) lies in this span and can be expanded in the orthonormal basis as
\[
\mathbf{a}_k = (\mathbf{q}_1^\top \mathbf{a}_k) \mathbf{q}_1 + \cdots + (\mathbf{q}_k^\top \mathbf{a}_k) \mathbf{q}_k.
\]
Setting \(r_{jk} = \mathbf{q}_j^\top \mathbf{a}_k\) for \(j \leq k\) and \(r_{jk} = 0\) for \(j > k\), and \(Q = [\mathbf{q}_1 \,\cdots\, \mathbf{q}_n]\),
this reads \(\mathbf{a}_k = Q \mathbf{r}_k\) where \(\mathbf{r}_k\) is the \(k\)-th column of the upper triangular matrix \(R = (r_{jk})\).
Collecting over \(k\) gives \(A = QR\).
For the diagonal entries: by the Gram-Schmidt construction,
\(\mathbf{q}_k = \mathbf{v}_k / \|\mathbf{v}_k\|\) with \(\mathbf{v}_k = \mathbf{a}_k - \operatorname{proj}_{\operatorname{Span}\{\mathbf{q}_1, \ldots, \mathbf{q}_{k-1}\}} \mathbf{a}_k\), so
\[
r_{kk} = \mathbf{q}_k^\top \mathbf{a}_k = \frac{1}{\|\mathbf{v}_k\|} \mathbf{v}_k^\top \mathbf{a}_k = \frac{\mathbf{v}_k^\top \mathbf{v}_k}{\|\mathbf{v}_k\|} = \|\mathbf{v}_k\| > 0
\]
(using \(\mathbf{v}_k^\top \mathbf{a}_k = \mathbf{v}_k^\top \mathbf{v}_k\), since \(\mathbf{v}_k\) is orthogonal to the projection subtracted from \(\mathbf{a}_k\)).
Thus \(R\) has positive diagonal, and in particular \(\det R = \prod_k r_{kk} \neq 0\), so \(R\) is invertible.
Example:
Consider again
\[
A = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix}.
\]
By the GS algorithm, we have already obtained \(Q\) from the previous example:
\[
Q = \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{-1}{\sqrt{35}} & \frac{-3}{\sqrt{10}} \\
\frac{2}{\sqrt{14}} & \frac{5}{\sqrt{35}} & 0 \\
\frac{3}{\sqrt{14}} & \frac{-3}{\sqrt{35}} & \frac{1}{\sqrt{10}}
\end{bmatrix}.
\]
Since the columns of \(Q\) are orthonormal, \(Q^\topQ=I\). Then, \(Q^\topA = Q^\topQR = IR = R\).
\[
\begin{align*}
R = Q^\topA &= \begin{bmatrix}
\frac{1}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\
\frac{-1}{\sqrt{35}} & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\
\frac{-3}{\sqrt{10}} & \ 0 & \frac{1}{\sqrt{10}}
\end{bmatrix}
\begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix} \\\\
&= \begin{bmatrix}
\frac{14}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\
0 & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\
0 & 0 & \frac{1}{\sqrt{10}}
\end{bmatrix}
\end{align*}
\]
Interactive Visualizer
This interactive demonstrates fundamental concepts like inner products, orthogonal decomposition, and orthogonalization
that are essential for understanding many applications in machine learning.
In machine learning, orthogonality plays a critical role in feature engineering, dimensionality reduction,
gradient-based optimization, and neural network initialization. Try changing the demo type in the dropdown menu
above to explore different aspects of orthogonality!