Orthogonality

Inner Product & Orthogonality in \(\mathbb{R}^n\) The Orthogonal Set The Orthogonal Projection Gram-Schmidt Algorithm QR Factorization Interactive Visualizer

Inner Product & Orthogonality in \(\mathbb{R}^n\)

Definition: Inner Product

The inner product of two vectors \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^n \) is a scalar defined as: \[ \mathbf{u} \cdot \mathbf{v} = \mathbf{u}^\top \mathbf{v} = u_1 v_1 + \cdots + u_n v_n. \]

To extend the notion of "length" to higher dimensions, we define the norm of vectors:

Definition: Norm and Unit Vector

The norm (or length) of a vector \(\mathbf{v} \in \mathbb{R}^n\) is the nonnegative scalar: \[ \|\mathbf{v}\| = \sqrt{\mathbf{v} \cdot \mathbf{v}} = \sqrt{v_1^2 + \cdots + v_n^2}. \] This is also called the Euclidean norm or \(L_2\) norm.

A unit vector is a vector with norm 1. Any nonzero vector \(\mathbf{v}\) can be normalized to produce a unit vector: \[ \mathbf{u} = \frac{\mathbf{v}}{\|\mathbf{v}\|}. \]

The unit vector maintains the "direction" of the original vector. This process is known as normalization, which is widely used in applications like machine learning.

Now we can define the distance between two vectors, which measures the difference or similarity between data points.

Definition: Euclidean Distance

The distance between two vectors \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^n\) is defined as: \[ \operatorname{dist}(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|. \] This is called the Euclidean distance.

To explore orthogonality geometrically, consider the squared distances: \[\begin{align*} \|\mathbf{u} - \mathbf{v}\|^2 &= (\mathbf{u}-\mathbf{v}) \cdot (\mathbf{u}-\mathbf{v}) \\\\ &= \mathbf{u} \cdot (\mathbf{u}-\mathbf{v}) + (-\mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) \\\\ &= \|\mathbf{u}\|^2 - 2 \mathbf{u} \cdot \mathbf{v} + \|\mathbf{v}\|^2 \end{align*} \] \[ \begin{align*} \|\mathbf{u} - (-\mathbf{v})\|^2 &= (\mathbf{u}+\mathbf{v}) \cdot (\mathbf{u}+\mathbf{v}) \\\\ &= \mathbf{u} \cdot (\mathbf{u}+\mathbf{v}) + \mathbf{v} \cdot (\mathbf{u} + \mathbf{v}) \\\\ &= \|\mathbf{u}\|^2 + 2 \mathbf{u} \cdot \mathbf{v} + \|\mathbf{v}\|^2 \end{align*} \] (Note: The vectors \(\mathbf{u}-\mathbf{v}\) and \(\mathbf{u}+\mathbf{v}\) are perpendicular if and only if \(\|\mathbf{u}-\mathbf{v}\| = \|\mathbf{u}+\mathbf{v}\|\), which occurs when \(\mathbf{u} \cdot \mathbf{v} = 0\).) Thus, if \(\mathbf{u} \cdot \mathbf{v} = 0\), the terms resemble the Pythagorean theorem: \[ \|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2. \]

From the above result, we use the inner product to describe orthogonality for any dimension.

Definition: Orthogonality

Two vectors \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^n \) are orthogonal to each other if their inner product satisfies \[ \mathbf{u} \cdot \mathbf{v} = 0. \]

Definition: Orthogonal Complement

Let \(W \subseteq \mathbb{R}^n\). A vector \(\mathbf{v} \in \mathbb{R}^n\) is orthogonal to \(W\) if it is orthogonal to every vector in \(W\). The set of all such vectors is called the orthogonal complement of \(W\), denoted \[ W^{\perp} = \{\mathbf{v} \in \mathbb{R}^n : \mathbf{v} \cdot \mathbf{w} = 0 \text{ for all } \mathbf{w} \in W\}. \]

This orthogonality provides a useful relationship between the column space and null space of a matrix:

Theorem: Orthogonal Complements of Row and Column Spaces

Let \(A \in \mathbb{R}^{m \times n}\). Then, \[ (\operatorname{Row} A)^{\perp} = \operatorname{Nul} A \] and \[ (\operatorname{Col} A)^{\perp} = \operatorname{Nul} A^\top. \]

Proof:

If \(\mathbf{x} \in\) \(\operatorname{Nul} A\), then \(A\mathbf{x} = \mathbf{0}\), which means that \(\mathbf{x}\) is orthogonal to each row of \(A\) because the inner product of each row of \(A\) and \(\mathbf{x}\) are zero. Then, since the rows of \(A\) span the row space of \(A\), \(\mathbf{x}\) is orthogonal to \(\operatorname{Row} A\). Also, if \(\mathbf{x}\) is orthogonal to \(\operatorname{Row} A\), then \(\mathbf{x}\) is orthogonal to every row of \(A\) and thus \(A\mathbf{x} = \mathbf{0}\), or \(\mathbf{x} \in \operatorname{Nul} A\). Therefore, \((\operatorname{Row} A)^{\perp} = \operatorname{Nul} A\).

Since the first identity holds for any matrix, we apply it to \(A^\top\): \((\operatorname{Row} A^\top)^{\perp} = \operatorname{Nul} A^\top\). The rows of \(A^\top\) are the columns of \(A\), so \(\operatorname{Row} A^\top = \) \(\operatorname{Col} A\). Therefore \((\operatorname{Col} A)^{\perp} = \operatorname{Nul} A^\top\).

The Orthogonal Set

Having established the concept of orthogonality between individual vectors, we now extend this to sets of vectors. Orthogonal sets have special properties that simplify many linear algebra computations, particularly when finding coordinates relative to a basis.

Definition: Orthogonal Set & Basis

A set of vectors \(\{\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_p\}\) in \(\mathbb{R}^n\) is called an orthogonal set if \(\mathbf{v}_i \cdot \mathbf{v}_j = 0\), \(i \neq j\). Additionally, an orthogonal basis for \(W \subseteq \mathbb{R}^n\) is a basis for \(W\) that forms an orthogonal set.

Theorem: Coordinates in an Orthogonal Basis

Let \(\{\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_p\}\) be an orthogonal basis for \(W \subseteq \mathbb{R}^n\). For each \(\mathbf{y} \in W\), the weights in the linear combination \(\mathbf{y} = c_1 \mathbf{v}_1 + \cdots + c_p \mathbf{v}_p\) are given by \[ c_j = \frac{\mathbf{y} \cdot \mathbf{v}_j}{\mathbf{v}_j \cdot \mathbf{v}_j} \qquad (j = 1, \cdots, p). \]

Proof:

Since \(\mathbf{v}_1\) is orthogonal to all other vectors in the set, \[ \begin{align*} \mathbf{y} \cdot \mathbf{v}_1 &= (c_1 \mathbf{v}_1 + \cdots + c_p \mathbf{v}_p) \cdot \mathbf{v}_1 \\ &= c_1(\mathbf{v}_1 \cdot \mathbf{v}_1) + \cdots + c_p(\mathbf{v}_p \cdot \mathbf{v}_1) \\ &= c_1(\mathbf{v}_1 \cdot \mathbf{v}_1). \end{align*} \]

By definition of the basis, \(\mathbf{v}_1\) is nonzero, \(\mathbf{v}_1 \cdot \mathbf{v}_1\) must be nonzero, and then we can solve this equation for \(c_1\). Similarly, we can compute \(\mathbf{y} \cdot \mathbf{v}_j\) and solve for \(c_j \quad (j = 2, \cdots, p)\).

Definition: Orthonormal Set & Basis

A set \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is said to be an orthonormal set if it is an orthogonal set of unit vectors. If \(W\) is the subspace spanned by such a set, then \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is an orthonormal basis for \(W\).

Matrices with orthonormal columns are central to many practical applications due to their unique properties.

Theorem: Orthonormal Columns Characterization

A matrix \(U \in \mathbb{R}^{m \times n} \) has orthonormal columns if and only if \(U^\topU = I\).

Proof:

\[ \begin{align*} U^\topU &= \begin{bmatrix} \mathbf{u}_1^\top \\ \mathbf{u}_2^\top \\ \vdots \\ \mathbf{u}_n^\top \end{bmatrix} \begin{bmatrix} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \end{bmatrix} \\\\ &= \begin{bmatrix} \mathbf{u}_1^\top \mathbf{u}_1 & \mathbf{u}_1^\top \mathbf{u}_2 & \cdots & \mathbf{u}_1^\top \mathbf{u}_n \\ \mathbf{u}_2^\top \mathbf{u}_1 & \mathbf{u}_2^\top \mathbf{u}_2 & \cdots & \mathbf{u}_2^\top \mathbf{u}_n \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{u}_n^\top \mathbf{u}_1 & \mathbf{u}_n^\top \mathbf{u}_2 & \cdots & \mathbf{u}_n^\top \mathbf{u}_n \end{bmatrix} \end{align*} \] Thus, the \((i,j)\)-entry of \(U^\topU\) is \(\mathbf{u}_i^\top \mathbf{u}_j\). This matrix equals \(I\) if and only if each entry matches the corresponding entry of the identity, i.e., \[ \mathbf{u}_i^\top \mathbf{u}_j = \delta_{ij} := \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases} \qquad (i, j = 1, \ldots, n). \] The off-diagonal conditions \(\mathbf{u}_i^\top \mathbf{u}_j = 0\) for \(i \neq j\) say that distinct columns are orthogonal; the diagonal conditions \(\mathbf{u}_i^\top \mathbf{u}_i = 1\) say that each column has unit length. Combined, \(U^\topU = I\) if and only if the columns form an orthonormal set.

Theorem: Norm- and Inner-Product Preservation

Let \(U \in \mathbb{R}^{m \times n} \) with orthonormal columns, and let \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\). Then

  1. \(\| U\mathbf{x} \| = \| \mathbf{x} \|\)
  2. \((U\mathbf{x}) \cdot (U\mathbf{y}) = \mathbf{x} \cdot \mathbf{y}\)
  3. \((U\mathbf{x}) \cdot (U\mathbf{y}) = 0\) if and only if \(\mathbf{x} \cdot \mathbf{y} = 0\)
Proof:

For Property 1, \[ \begin{align*} \| U\mathbf{x} \|^2 &= (U\mathbf{x})^\top(U\mathbf{x}) \\\\ &= \mathbf{x}^\top U^\top U \mathbf{x} \\\\ &= \mathbf{x}^\top I \mathbf{x} \\\\ &= \mathbf{x}^\top \mathbf{x}. \end{align*} \] Therefore, \[ \| U\mathbf{x} \| = \| \mathbf{x} \|. \]

For Property 2, \[ \begin{align*} (U\mathbf{x}) \cdot (U\mathbf{y}) &= (U\mathbf{x})^\top(U\mathbf{y}) \\\\ &= \mathbf{x}^\top U^\top U \mathbf{y} \\\\ &= \mathbf{x}^\top I \mathbf{y} = \mathbf{x}^\top \mathbf{y} \\\\ &= \mathbf{x} \cdot \mathbf{y}. \end{align*} \]

Property 3 follows immediately from Property 2: since \((U\mathbf{x}) \cdot (U\mathbf{y}) = \mathbf{x} \cdot \mathbf{y}\), the left side vanishes if and only if the right side does. Property 1 can also be recovered from Property 2 by setting \(\mathbf{y} = \mathbf{x}\).

These results are particularly significant for square matrices.

Definition: Orthogonal Matrix

A square matrix \(U \in \mathbb{R}^{n \times n}\) is called an orthogonal matrix if \[ U^{-1} = U^\top, \] or equivalently, \(U^\topU = UU^\top = I\).

By the Orthonormal Columns Characterization, this matrix \(U\) has orthonormal columns (the name "orthogonal matrix" is somewhat misleading since it refers to "orthonormal" columns). Notably, the rows of \(U\) also form an orthonormal basis of \(\mathbb{R}^n\): since \(U\) is square, \(U^{-1} = U^\top\) gives \(UU^\top = I\), and applying the same characterization to \(U^\top\) (whose columns are the rows of \(U\)) shows those rows are orthonormal.

Example: The 2D Rotation Matrix Part 1

\[ R_{\theta} = \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \] Verify: \[ \begin{align*} R_{\theta}^\topR_{\theta} &= R_{\theta}^{-1}R_{\theta}\\\\ &= \begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}\\\\ &= \begin{bmatrix} \cos^2 \theta + \sin^2 \theta & 0 \\ 0 & \sin^2 \theta + \cos^2 \theta \end{bmatrix}\\\\ &= I \end{align*} \] Also, \[ \det R_{\theta} = \cos^2 \theta + \sin^2 \theta = 1. \]

This property generalizes to any orthogonal matrix \(U\): by the multiplicative and transpose properties of determinants, \[ \begin{align*} 1 = \det I &= \det ( U^\topU ) \\\\ &= \det U^\top \det U \\\\ &= \det U \det U \\\\ &= (\det U)^2 \end{align*} \]

Thus, for any orthogonal matrix \(U\), \[ \det U = \pm 1. \]

Rotations vs. Reflections: The Two Worlds of \(\det U = \pm 1\)

The condition \(\det U = +1\) characterizes proper rotations — transformations that can be reached by continuously rotating from the identity. The rotation matrix \(R_\theta\) above satisfies this: as \(\theta\) varies from \(0\) to any angle, \(R_\theta\) sweeps out a smooth path of orthogonal matrices with determinant 1. In contrast, \(\det U = -1\) corresponds to improper rotations (rotations composed with a reflection), such as \(\begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}\). No continuous path connects the identity to such a matrix without leaving the set of orthogonal matrices.

This split is not a coincidence — it reflects a fundamental topological fact: the set of orthogonal matrices has two disconnected components. The component with \(\det = +1\) forms the special orthogonal group \(SO(n)\) — the group of rotations. The full set (both components together) is the orthogonal group \(O(n)\).

Example: The 2D Rotation Matrix Part 2

Lastly, consider the eigenvalues of \(R_{\theta}\). \[ \begin{align*} \det (R_{\theta} - \lambda I ) &= (\cos\theta - \lambda)^2 + \sin^2 \theta \\\\ &= \lambda^2 -2 \lambda \cos\theta +1 \end{align*} \] Solve the characteristic equation: \[ \lambda^2 -2 \lambda\cos\theta +1 = 0, \] We obtain: \[ \begin{align*} \lambda &=\frac{2\cos\theta \pm \sqrt{4\cos^2\theta - 4}}{2} \\\\ &= \frac{2\cos\theta \pm \sqrt{-4\sin^2\theta}}{2} \\\\ &= \cos\theta \pm i \sin\theta. \end{align*} \] Then, \[ |\lambda | = 1 \]

We can also see \(|\lambda| = 1\) directly for real eigenvalues. Suppose \(\lambda \in \mathbb{R}\) and \(\mathbf{x} \in \mathbb{R}^n\) is a nonzero eigenvector of an orthogonal \(U\). Then \[ \begin{align*} U\mathbf{x} = \lambda \mathbf{x} &\Longrightarrow \| U\mathbf{x} \| = |\lambda| \| \mathbf{x} \| \\\\ &\Longrightarrow \| \mathbf{x} \| = |\lambda| \| \mathbf{x} \| \qquad \text{(by Property 1 above)} \end{align*} \] Since \(\| \mathbf{x} \| \neq 0\), we conclude \(|\lambda| = 1\). (The rotation example on this page has complex eigenvalues \(\lambda = \cos\theta \pm i\sin\theta\), for which \(|\lambda|=1\) was verified directly. The same conclusion holds for any complex eigenvalue by the unitary extension stated in the insight box below.)

Why "U"? — From Orthogonal to Unitary

We have been using the letter \(U\) for orthogonal matrices throughout this page. This is not arbitrary — it reflects a deeper structure. The norm-preserving property \(\|U\mathbf{x}\| = \|\mathbf{x}\|\) holds not only for real vectors, but also for complex vectors, provided we replace the transpose \(U^\top\) with the conjugate transpose \(U^* = \overline{U}^\top\). A matrix satisfying \(U^* U = I\) is called a unitary matrix, and the proof that it preserves norms is identical to the Norm- and Inner-Product Preservation theorem above.

An orthogonal matrix is simply a unitary matrix that happens to have real entries — when all entries are real, \(U^* = U^\top\) and the unitary condition \(U^* U = I\) reduces to the orthogonal condition \(U^\top U = I\). The collection of all \(n \times n\) unitary matrices forms the unitary group \(U(n)\), and the orthogonal matrices form the orthogonal group \(O(n) \subset U(n)\). These are examples of Lie groups — continuous symmetry groups that play a central role in physics, robotics, and modern deep learning.

The Orthogonal Projection

Orthogonal matrices preserve "structure under linear transformations," but we often need to decompose arbitrary vectors relative to a subspace. The orthogonal projection provides the closest approximation to a vector by elements of a subspace, a fundamental operation in applications from least squares regression to signal processing.

Given a nonzero vector \(\mathbf{u} \in \mathbb{R}^n\). We often need to decompose a vector \(\mathbf{y} \in \mathbb{R}^n\) into two vectors: \[ \mathbf{y} = \hat{\mathbf{y}} + \mathbf{z} \tag{2} \] where \(\hat{\mathbf{y}} = \alpha \mathbf{u}\) is a multiple of \(\mathbf{u}\) for some scalar \(\alpha\) and the vector \(\mathbf{z}\) is orthogonal to \(\mathbf{u}\).

For any scalar \(\alpha\), let \(\mathbf{z} = \mathbf{y} - \alpha \mathbf{u}\) so that (2) holds. Then \(\mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to \(\mathbf{u}\) if and only if \[ (\mathbf{y} - \alpha \mathbf{u}) \cdot \mathbf{u} = \mathbf{y} \cdot \mathbf{u} - \alpha (\mathbf{u} \cdot \mathbf{u}) = 0 \] \[ \Longrightarrow \alpha = \frac{\mathbf{y} \cdot \mathbf{u}}{\mathbf{u} \cdot \mathbf{u}} \] Thus, the orthogonal projection of \(\mathbf{y}\) onto \(\mathbf{u}\) is \[ \hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u}}{\mathbf{u} \cdot \mathbf{u}} \mathbf{u} \]

Theorem: Orthogonal Decomposition

Let \(W\) be a subspace of \(\mathbb{R}^n\) with a finite orthogonal basis \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\). Then every \(\mathbf{y} \in \mathbb{R}^n\) can be written uniquely in the form \[\mathbf{y} = \hat{\mathbf{y}} + \mathbf{z},\] where \(\hat{\mathbf{y}} \in W\) and \(\mathbf{z} \in W^{\perp}\), and these components are given by \[ \hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1 + \cdots + \frac{\mathbf{y} \cdot \mathbf{u}_p}{\mathbf{u}_p \cdot \mathbf{u}_p} \mathbf{u}_p \tag{3} \] and \[ \mathbf{z} = \mathbf{y} - \hat{\mathbf{y}}. \]

Note that (3) is called the orthogonal projection of \(\mathbf{y}\) onto \(W\) denoted by \(\operatorname{proj}_W \mathbf{y}\).

Proof:

Let \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is any orthogonal basis of \(W\) and \(\hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1 + \cdots + \frac{\mathbf{y} \cdot \mathbf{u}_p}{\mathbf{u}_p \cdot \mathbf{u}_p} \mathbf{u}_p\). Since \(\hat{\mathbf{y}}\) is a linear combination of \(\mathbf{u}_1, \cdots, \mathbf{u}_p\), \(\quad \hat{\mathbf{y}} \in W\).

Let \(\mathbf{z} = \mathbf{y} - \hat{\mathbf{y}}\). Because \(\mathbf{u}_i\) is orthogonal to each \(\mathbf{u}_j\) in the basis for \(W\) with \(i \neq j\), \[ \begin{align*} \mathbf{z} \cdot \mathbf{u}_i &= (\mathbf{y} - \hat{\mathbf{y}}) \cdot \mathbf{u}_i \\\\ &= \mathbf{y} \cdot \mathbf{u}_i - \left(\frac{\mathbf{y} \cdot \mathbf{u}_i}{\mathbf{u}_i \cdot \mathbf{u}_i}\right) \mathbf{u}_i \cdot \mathbf{u}_i - 0 - \cdots - 0 \\\\ &= \mathbf{y} \cdot \mathbf{u}_i - \mathbf{y} \cdot \mathbf{u}_i = 0 \end{align*} \] Hence \(\mathbf{z}\) is orthogonal to each \(\mathbf{u}_i\) in the basis for \(W\). Therefore, \(\mathbf{z}\) is orthogonal to every vector in \(W\). In other words, \(\mathbf{z} \in W^{\perp}\).

Next, suppose \(\mathbf{y}\) can also be written as \(\mathbf{y} = \hat{\mathbf{y}}_1 + \mathbf{z}_1\) where \(\hat{\mathbf{y}}_1 \in W\) and \(\mathbf{z}_1 \in W^{\perp}\). Then \[ \hat{\mathbf{y}} + \mathbf{z} = \hat{\mathbf{y}}_1 + \mathbf{z}_1 \Longrightarrow \hat{\mathbf{y}} - \hat{\mathbf{y}}_1 = \mathbf{z}_1 - \mathbf{z}. \] The left side lies in \(W\) (since \(W\) is a subspace). The right side lies in \(W^{\perp}\), because \(W^{\perp}\) is itself a subspace: for any \(\mathbf{a}, \mathbf{b} \in W^{\perp}\) and any \(\mathbf{w} \in W\), \[ (\mathbf{a} - \mathbf{b}) \cdot \mathbf{w} = \mathbf{a} \cdot \mathbf{w} - \mathbf{b} \cdot \mathbf{w} = 0 - 0 = 0, \] so \(\mathbf{a} - \mathbf{b} \in W^{\perp}\). Hence \[ (\hat{\mathbf{y}} - \hat{\mathbf{y}}_1) \in W \cap W^{\perp}. \] Any vector \(\mathbf{w} \in W \cap W^{\perp}\) satisfies \(\mathbf{w} \cdot \mathbf{w} = 0\) (since \(\mathbf{w}\) is orthogonal to itself), forcing \(\mathbf{w} = \mathbf{0}\). Therefore \[ \hat{\mathbf{y}} - \hat{\mathbf{y}}_1 = \mathbf{0}, \] giving \(\hat{\mathbf{y}} = \hat{\mathbf{y}}_1\) and \(\mathbf{z} = \mathbf{z}_1\).

The orthogonal projection of \(y\) onto \(W\) is the best approximation to \(y\) by elements of W. This is significant in practical applications.

Theorem: Best Approximation

Let \(W \subseteq \mathbb{R}^n\) and \(\mathbf{y} \in \mathbb{R}^n\). Also let \(\hat{\mathbf{y}}\) be the orthogonal projection of \(\mathbf{y}\) onto \(W\). Then \(\hat{\mathbf{y}}\) is the closest point in \(W\) to \(\mathbf{y}\) such that \[ \| \mathbf{y} - \hat{\mathbf{y}} \| < \| \mathbf{y} - \mathbf{v} \| \tag{4} \] for all \(\mathbf{v} \in W\) distinct from \(\hat{\mathbf{y}}\).

Proof:

Choose \(\mathbf{v} \in W\) distinct from \(\hat{\mathbf{y}} \in W\). Then \((\hat{\mathbf{y}} - \mathbf{v}) \in W\). By the Orthogonal Decomposition Theorem, \((\mathbf{y} - \hat{\mathbf{y}})\) is orthogonal to \(W\); in particular, \((\mathbf{y} - \hat{\mathbf{y}})\) is orthogonal to \((\hat{\mathbf{y}} - \mathbf{v}) \in W\). Consider \[ \mathbf{y} - \mathbf{v} = (\mathbf{y} - \hat{\mathbf{y}}) + (\hat{\mathbf{y}} - \mathbf{v}). \] By the Pythagorean Theorem, \[ \|\mathbf{y} - \mathbf{v}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}} - \mathbf{v}\|^2. \] Since \(\mathbf{v} \neq \hat{\mathbf{y}}\), we have \(\hat{\mathbf{y}} - \mathbf{v} \neq \mathbf{0}\), so \(\|\hat{\mathbf{y}} - \mathbf{v}\|^2 > 0\). Therefore \[ \|\mathbf{y} - \mathbf{v}\|^2 > \|\mathbf{y} - \hat{\mathbf{y}}\|^2, \] which gives \[ \|\mathbf{y} - \hat{\mathbf{y}}\| < \|\mathbf{y} - \mathbf{v}\|. \]

The formula (3) becomes much simpler when the basis for \(W\) is an orthonormal set as follows:

If \(\{\mathbf{u}_1, \cdots, \mathbf{u}_p\}\) is an orthonormal basis for \(W \subseteq \mathbb{R}^n\), then \[ \operatorname{proj}_W \mathbf{y} = (\mathbf{y} \cdot \mathbf{u}_1) \mathbf{u}_1 + \cdots + (\mathbf{y} \cdot \mathbf{u}_p) \mathbf{u}_p \] If \(U = [\mathbf{u}_1, \cdots, \mathbf{u}_p]\), then \[ \forall \mathbf{y} \in \mathbb{R}^n, \quad \operatorname{proj}_W \mathbf{y} = UU^\top \mathbf{y} \] because each weight can be written as \(\mathbf{u}_1^\top \mathbf{y}, \cdots, \mathbf{u}_p^\top \mathbf{y}\), which are the entries of \(U^\top \mathbf{y}\).

Besides having orthonormal columns, if \(U\) is an \(n \times n\) square matrix (\(U\) is the orthogonal matrix), \[ \forall \mathbf{y} \in \mathbb{R}^n, \quad \operatorname{proj}_W \mathbf{y} = UU^\top \mathbf{y} = I \mathbf{y} = \mathbf{y}. \] This implies the column space \(W\) is equal to \(\mathbb{R}^n\).

Gram-Schmidt Algorithm

While orthogonal and orthonormal bases simplify computations significantly, an arbitrary basis for a subspace will typically not be orthogonal. The Gram-Schmidt Algorithm provides a systematic method to transform any basis into an orthogonal (or orthonormal) basis while preserving the span.

Theorem: Gram-Schmidt Process

Given a basis \(\{\mathbf{x}_1, \cdots, \mathbf{x}_p\}\) for a nonzero subspace \(W\) of \(\mathbb{R}^n\), the Gram-Schmidt algorithm constructs an orthogonal basis \(\{\mathbf{v}_1, \cdots, \mathbf{v}_p\}\) for \(W\) as follows: \[ \begin{align*} \mathbf{v}_1 &= \mathbf{x}_1 \\\\ \mathbf{v}_2 &= \mathbf{x}_2 - \frac{\mathbf{x}_2 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 \\\\ \mathbf{v}_3 &= \mathbf{x}_3 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_2}{\mathbf{v}_2 \cdot \mathbf{v}_2} \mathbf{v}_2 \\\\ &\vdots \\\\ \mathbf{v}_k &= \mathbf{x}_k - \sum_{j=1}^{k-1} \frac{\mathbf{x}_k \cdot \mathbf{v}_j}{\mathbf{v}_j \cdot \mathbf{v}_j} \mathbf{v}_j \end{align*} \] An orthonormal basis is obtained by normalizing: \(\mathbf{u}_k = \frac{\mathbf{v}_k}{\|\mathbf{v}_k\|}\) for \(k = 1, \cdots, p\).

A proof by induction on \(k\) is straightforward: at each step, \(\mathbf{v}_k = \mathbf{x}_k - \operatorname{proj}_{W_{k-1}} \mathbf{x}_k\) where \(W_{k-1} = \operatorname{Span}\{\mathbf{v}_1, \ldots, \mathbf{v}_{k-1}\}\), so the Orthogonal Decomposition Theorem places \(\mathbf{v}_k\) in \(W_{k-1}^{\perp}\), which establishes orthogonality with every earlier \(\mathbf{v}_j\). Moreover \(\mathbf{v}_k \neq \mathbf{0}\): otherwise \(\mathbf{x}_k \in W_{k-1} = \operatorname{Span}\{\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}\}\), contradicting the linear independence of \(\{\mathbf{x}_1, \ldots, \mathbf{x}_p\}\). Finally, the span is preserved: \(\mathbf{v}_k\) is \(\mathbf{x}_k\) plus a linear combination of the earlier \(\mathbf{v}_j\)'s (hence in \(\operatorname{Span}\{\mathbf{x}_1, \ldots, \mathbf{x}_k\}\)), and conversely \(\mathbf{x}_k = \mathbf{v}_k + \operatorname{proj}_{W_{k-1}} \mathbf{x}_k \in \operatorname{Span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}\), so \(\operatorname{Span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \operatorname{Span}\{\mathbf{x}_1, \ldots, \mathbf{x}_k\}\). Normalization \(\mathbf{u}_k = \mathbf{v}_k / \|\mathbf{v}_k\|\) is well-defined because \(\mathbf{v}_k \neq \mathbf{0}\).

Example:

Consider the basis \(\{\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3\}\) for a nonzero subspace \(W\) of \(\mathbb{R}^3\), where: \(\mathbf{x}_1 = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{x}_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \text{ and } \mathbf{x}_3 = \begin{bmatrix} 0 \\ 0 \\ 1\end{bmatrix}\)

Let \(\mathbf{v}_1 = \mathbf{x}_1\) and \(W_1 = \operatorname{Span} \{\mathbf{v}_1\}\). Then \[ \begin{align*} \mathbf{v}_2 &= \mathbf{x}_2 - \operatorname{proj}_{W_1} \mathbf{x}_2 \\\\ &= \mathbf{x}_2 - \frac{\mathbf{x}_2 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 \\\\ &= \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} - \frac{2}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \end{align*} \]

Next, \(W_2 = \operatorname{Span} \{\mathbf{v}_1, \mathbf{v}_2\}\) and then \[ \begin{align*} \mathbf{v}_3 &= \mathbf{x}_3 - \operatorname{proj}_{W_2} \mathbf{x}_3 \\\\ &= \mathbf{x}_3 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_1}{\mathbf{v}_1 \cdot \mathbf{v}_1} \mathbf{v}_1 - \frac{\mathbf{x}_3 \cdot \mathbf{v}_2}{\mathbf{v}_2 \cdot \mathbf{v}_2} \mathbf{v}_2 \\\\ &= \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} - \frac{3}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} - \frac{\frac{-3}{7}}{\frac{5}{7}}\begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-3}{10} \\ 0 \\ \frac{1}{10} \end{bmatrix} \end{align*} \] Thus, the orthogonal basis for \(W\) is \(W_3 = \operatorname{Span}\{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3\}\).

Finally, we construct an orthonormal basis for \(W\) by normalizing each \(\mathbf{v}_k\): \[ \{\mathbf{u}_1 = \begin{bmatrix} \frac{1}{\sqrt{14}} \\ \frac{2}{\sqrt{14}} \\ \frac{3}{\sqrt{14}} \end{bmatrix}, \, \mathbf{u}_2 = \begin{bmatrix} \frac{-1}{\sqrt{35}} \\ \frac{5}{\sqrt{35}} \\ \frac{-3}{\sqrt{35}} \end{bmatrix}, \, \mathbf{u}_3 = \begin{bmatrix} \frac{-3}{\sqrt{10}} \\ 0 \\ \frac{1}{\sqrt{10}} \end{bmatrix} \}. \]

The Gram-Schmidt algorithm at the \(k\)th step can be written as: \[ \mathbf{v}_k = \mathbf{x}_k - \sum_{j=1}^{k-1} (\mathbf{x}_k \cdot \mathbf{u}_j) \mathbf{u}_j ,\quad \mathbf{u}_k = \frac{\mathbf{v}_k}{\|\mathbf{v}_k\|}. \]

The resulting vectors look different from the original ones. This is because the algorithm systematically redefines each vector to be orthogonal to the previous ones while maintaining the span of the original vectors. Essentially, it transforms the basis into an orthonormal one without changing the space that the vectors describe.

QR Factorization

The Gram-Schmidt process has a matrix interpretation: it factors any matrix with linearly independent columns into the product of an orthogonal matrix \(Q\) and an upper triangular matrix \(R\). This QR factorization is fundamental in numerical linear algebra, particularly for approximating eigenvalues and solving least squares problems.
(Note: Finding exact roots of high-degree characteristic polynomials is often computationally infeasible, making iterative methods based on QR factorization essential for eigenvalue computation.)

Theorem: QR Factorization

Let \(A\) be an \(m \times n\) matrix with linearly independent columns. Then \(A\) can be factorized as \[ A = QR, \] where \(Q\) is an \(m \times n\) matrix whose columns form an orthonormal basis for \(\operatorname{Col} A\), and \(R\) is an \(n \times n\) upper triangular invertible matrix with positive diagonal entries.

Proof:

Let \(\mathbf{a}_1, \ldots, \mathbf{a}_n\) be the columns of \(A\). Applying the Gram-Schmidt process to this basis of \(\operatorname{Col} A\), followed by normalization, produces an orthonormal basis \(\mathbf{q}_1, \ldots, \mathbf{q}_n\) of \(\operatorname{Col} A\). The construction ensures that for each \(k\), \[ \operatorname{Span}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} = \operatorname{Span}\{\mathbf{a}_1, \ldots, \mathbf{a}_k\}, \] so \(\mathbf{a}_k\) lies in this span and can be expanded in the orthonormal basis as \[ \mathbf{a}_k = (\mathbf{q}_1^\top \mathbf{a}_k) \mathbf{q}_1 + \cdots + (\mathbf{q}_k^\top \mathbf{a}_k) \mathbf{q}_k. \] Setting \(r_{jk} = \mathbf{q}_j^\top \mathbf{a}_k\) for \(j \leq k\) and \(r_{jk} = 0\) for \(j > k\), and \(Q = [\mathbf{q}_1 \,\cdots\, \mathbf{q}_n]\), this reads \(\mathbf{a}_k = Q \mathbf{r}_k\) where \(\mathbf{r}_k\) is the \(k\)-th column of the upper triangular matrix \(R = (r_{jk})\). Collecting over \(k\) gives \(A = QR\).

For the diagonal entries: by the Gram-Schmidt construction, \(\mathbf{q}_k = \mathbf{v}_k / \|\mathbf{v}_k\|\) with \(\mathbf{v}_k = \mathbf{a}_k - \operatorname{proj}_{\operatorname{Span}\{\mathbf{q}_1, \ldots, \mathbf{q}_{k-1}\}} \mathbf{a}_k\), so \[ r_{kk} = \mathbf{q}_k^\top \mathbf{a}_k = \frac{1}{\|\mathbf{v}_k\|} \mathbf{v}_k^\top \mathbf{a}_k = \frac{\mathbf{v}_k^\top \mathbf{v}_k}{\|\mathbf{v}_k\|} = \|\mathbf{v}_k\| > 0 \] (using \(\mathbf{v}_k^\top \mathbf{a}_k = \mathbf{v}_k^\top \mathbf{v}_k\), since \(\mathbf{v}_k\) is orthogonal to the projection subtracted from \(\mathbf{a}_k\)). Thus \(R\) has positive diagonal, and in particular \(\det R = \prod_k r_{kk} \neq 0\), so \(R\) is invertible.

Example:

Consider again \[ A = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix}. \]

By the GS algorithm, we have already obtained \(Q\) from the previous example: \[ Q = \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{-1}{\sqrt{35}} & \frac{-3}{\sqrt{10}} \\ \frac{2}{\sqrt{14}} & \frac{5}{\sqrt{35}} & 0 \\ \frac{3}{\sqrt{14}} & \frac{-3}{\sqrt{35}} & \frac{1}{\sqrt{10}} \end{bmatrix}. \]

Since the columns of \(Q\) are orthonormal, \(Q^\topQ=I\). Then, \(Q^\topA = Q^\topQR = IR = R\). \[ \begin{align*} R = Q^\topA &= \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\ \frac{-1}{\sqrt{35}} & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\ \frac{-3}{\sqrt{10}} & \ 0 & \frac{1}{\sqrt{10}} \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{14}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\ 0 & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\ 0 & 0 & \frac{1}{\sqrt{10}} \end{bmatrix} \end{align*} \]

Interactive Visualizer

This interactive demonstrates fundamental concepts like inner products, orthogonal decomposition, and orthogonalization that are essential for understanding many applications in machine learning.

In machine learning, orthogonality plays a critical role in feature engineering, dimensionality reduction, gradient-based optimization, and neural network initialization. Try changing the demo type in the dropdown menu above to explore different aspects of orthogonality!