Orthogonality

Inner Product & Orthogonality in \(\mathbb{R}^n\) The Orthogonal Set The Orthogonal Projection Gram-Schmidt Algorithm QR Factorization Interactive Visualizer

Inner Product & Orthogonality in \(\mathbb{R}^n\)

Definition: Inner Product

The inner product of two vectors \(u, v \in \mathbb{R}^n \) is a scalar defined as: \[ u \cdot v = u^Tv = u_1v_1 + \cdots + u_nv_n. \]

To extend the notion of "length" to higher dimensions, we define the norm of vectors:

Definition: Norm and Unit Vector

The norm (or length) of a vector \(v \in \mathbb{R}^n\) is the nonnegative scalar: \[ \|v\| = \sqrt{v \cdot v} = \sqrt{v_1^2 + \cdots + v_n^2}. \] This is also called the Euclidean norm or \(L_2\) norm.

A unit vector is a vector with norm 1. Any nonzero vector \(v\) can be normalized to produce a unit vector: \[ u = \frac{v}{\|v\|}. \]

The unit vector maintains the "direction" of the original vector. This process is known as normalization, which is widely used in applications like machine learning.

Now we can define the distance between two vectors, which measures the difference or similarity between data points.

Definition: Euclidean Distance

The distance between two vectors \(u, v \in \mathbb{R}^n\) is defined as: \[ \text{dist}(u, v) = \|u - v\|. \] This is called the Euclidean distance.

To explore orthogonality geometrically, consider the squared distances: \[\begin{align*} \|u - v\|^2 &= (u-v) \cdot (u-v) \\\\ &= u \cdot (u-v) + (-v) \cdot (u - v) \\\\ &= \|u\|^2 - 2u \cdot v + \|v\|^2 \end{align*} \] \[ \begin{align*} \|u - (-v)\|^2 &= (u+v) \cdot (u+v) \\\\ &= u \cdot (u+v) + (v) \cdot (u + v) \\\\ &= \|u\|^2 + 2u \cdot v + \|v\|^2 \end{align*} \] (Note: The vectors \(u-v\) and \(u+v\) are perpendicular if and only if \(\|u-v\| = \|u+v\|\), which occurs when \(u \cdot v = 0\).) Thus, if \(u \cdot v = 0\), the terms resemble the Pythagorean theorem: \[ \|u + v \|^2 = \|u\|^2 + \|v\|^2. \]

From the above result, we use the inner product to describe orthogonality for any dimension.

Definition: Orthogonality

Two vectors \(u, v \in \mathbb{R}^n \) are orthogonal to each other if their inner product satisfies \[ u \cdot v =0. \] If a vector \(v\) is orthogonal to every vector in \(W \subseteq \mathbb{R}^n\), then the vector \(v\) is said to be orthogonal to \(W\). Also, the set of all vectors that are orthogonal to \(W\) is called the orthogonal complement of \(W\), denoted by \(W^{\perp}\).

This orthogonality provides a useful relationship between the column space and null space of a matrix:

Theorem 1:

Let \(A \in \mathbb{R}^{m \times n}\). Then, \[ (\text{Row } A)^{\perp} = \text{Nul } A \] and \[ (\text{Col } A)^{\perp} = \text{Nul } A^T. \]

Proof:

If \(x \in \text{Nul } A\), then \(Ax =0\), which means that \(x\) is orthogonal to each row of \(A\) because the inner product of each row of \(A\) and \(x\) are zero. Then, since the rows of \(A\) span the row space of \(A\), \(x\) is orthogonal to \(\text{Row } A \). Also, if \(x\) is orthogonal to \(\text{Row } A \), then \(x\) is orthogonal to every row of \(A\) and thus \(Ax =0\), or \(x \in \text{Nul } A\). Therefore, \((\text{Row } A)^{\perp} = \text{Nul } A \).

Next, since the first equation is true for any matrix, we consider for \(A^T\). Since \(\text{Row } A^T = \text{Col } A\), we conclude that \((\text{Col } A)^{\perp} = \text{Nul } A^T \).

The Orthogonal Set

Having established the concept of orthogonality between individual vectors, we now extend this to sets of vectors. Orthogonal sets have special properties that simplify many linear algebra computations, particularly when finding coordinates relative to a basis.

Definition: Orthogonal Set & Basis

A set of vectors \(\{v_1, v_2, \cdots, v_p\}\) in \(\mathbb{R}^n\) is called an orthogonal set if \(v_i \cdot v_j =0\), \(i \neq j\). Additionally, an orthogonal basis for \(W \subseteq \mathbb{R}^n\) is a basis for \(W\) that forms an orthogonal set.

Theorem 2:

Let \(\{v_1, v_2, \cdots, v_p\}\) be an orthogonal basis for \(W \subseteq \mathbb{R}^n\). For each \(y \in W\), the weights in the linear combination \(y = c_1v_1 + \cdots + c_pv_p\) are given by \[ c_j = \frac{y \cdot v_j}{v_j \cdot v_j} \qquad (j = 1, \cdots, p). \]

Proof:

Since \(v_1\) is orthogonal to all other vectors in the set, \[ \begin{align*} y \cdot v_1 &= (c_1v_1 + \cdots + c_pv_p) \cdot v_1 \\\\ &= c_1(v_1 \cdot v_1) + \cdots + c_p(v_p \cdot v_1) \\\\ &= c_1(v_1 \cdot v_1). \end{align*} \]

By definition of the basis, \(v_1\) is nonzero, \(v_1 \cdot v_1\) must be nonzero, and then we can solve this equation for \(c_1\). Similarly, we can compute \(y \cdot v_j\) and solve for \(c_j \quad (j = 2, \cdots, p)\).

Definition: Orthonormal Set & Basis

A set \(\{u_1, \cdots, u_p\}\) is said to be an orthonormal set if it is an orthogonal set of unit vectors. If \(W\) is the subspace spanned by such a set, then \(\{u_1, \cdots, u_p\}\) is an orthonormal basis for \(W\).

Matrices with orthonormal columns are central to many practical applications due to their unique properties.

Theorem 3:

A matrix \(U \in \mathbb{R}^{m \times n} \) has orthonormal columns iff \(U^TU = I\).

Proof:

\[ \begin{align*} U^TU &= \begin{bmatrix} u_1^T \\ u_2^T \\ \vdots \\ u_n^T \end{bmatrix} \begin{bmatrix} u_1 & u_2 & \cdots & u_n \end{bmatrix} \\\\ &= \begin{bmatrix} u_1^Tu_1 & u_1^Tu_2 & \cdots & u_1^Tu_n \\ u_2^Tu_1 & u_2^Tu_2 & \cdots & u_2^Tu_n \\ \vdots & \vdots & \ddots & \vdots \\ u_n^Tu_1 & u_n^Tu_2 & \cdots & u_n^Tu_n \end{bmatrix} \end{align*} \] The columns of \(U\) are orthogonal iff \[ u_i^Tu_j = 0, i \neq j \quad (i, j = 1, \cdots, n) \] and, the columns of \(U\) have unit length iff \[ u_i^Tu_i = 1, (i = 1, \cdots, n). \] Thus, \(U^TU = I\) and the converse is also true.

Theorem 4:

Let \(U \in \mathbb{R}^{m \times n} \) with orthonormal columns, and let \(x, y \in \mathbb{R}^n\). Then

  1. \(\| Ux \| = \| x \| \)
  2. \((Ux) \cdot (Uy) = x \cdot y \)
  3. \((Ux) \cdot (Uy) = 0\) iff \(x \cdot y = 0\)
Proof:

For Property 1, \[ \begin{align*} \| Ux \|^2 &= (Ux)^T(Ux) \\\\ &= x^TU^TUx \\\\ &= x^TIx \\\\ &= x^Tx. \end{align*} \] Therefore, \[ \| Ux \| = \| x \|. \]

For Property 2, \[ \begin{align*} (Ux) \cdot (Uy) &= (Ux)^T(Uy) \\\\ &= x^TU^TUy \\\\ &= x^TIy = x^Ty \\\\ &= x \cdot y. \end{align*} \] Note: We can also prove properties 1 and 3 from the Property 2.

These results are particularly significant for square matrices.

Definition: Orthogonal Matrix

A square matrix \(U \in \mathbb{R}^{n \times n}\) is called an orthogonal matrix if \[ U^{-1} = U^T, \] or equivalently, \(U^TU = UU^T = I\).

By Theorem 3, this matrix \(U\) has orthonormal columns (the name "orthogonal matrix" is somewhat misleading since it refers to "orthonormal" columns). Notably, the rows of \(U\) form an orthonormal basis of \(\mathbb{R}^n\).

Example: The 2D Rotation Matrix Part 1

\[ R_{\theta} = \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \] Verify: \[ \begin{align*} R_{\theta}^TR_{\theta} &= R_{\theta}^{-1}R_{\theta}\\\\ &= \begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}\\\\ &= \begin{bmatrix} \cos^2 \theta + \sin^2 \theta & 0 \\ 0 & \sin^2 \theta + \cos^2 \theta \end{bmatrix}\\\\ &= I \end{align*} \] Also, \[ \det R_{\theta} = \cos^2 \theta + \sin^2 \theta = 1. \]

This property generalizes to any orthogonal matrix \(U\): \[ \begin{align*} 1 = \det I &= \det ( U^TU ) \\\\ &= \det U^T \det U \\\\ &= \det U \det U \\\\ &= (\det U)^2 \end{align*} \] (Note: Since \(U\) is a square matrix, \(\det U^T = \det U\).)

Thus, for any orthogonal matrix \(U\), \[ \det U = \pm 1. \]

Example: The 2D Rotation Matrix Part 2

Lastly, consider the eigenvalues of \(R_{\theta}\). \[ \begin{align*} \det (R_{\theta} - \lambda I ) &= (\cos\theta - \lambda)^2 - \sin^2 \theta \\\\ &= \lambda^2 -2 \lambda \cos\theta +1 \end{align*} \] Solve the characteristic equation: \[ \lambda^2 -2 \lambda\cos\theta +1 = 0, \] We obtain: \[ \begin{align*} \lambda &=\frac{2\cos\theta \pm 2\sqrt{(\cos^2\theta -1)}i}{2} \\\\ &= \cos\theta \pm i \sin\theta. \end{align*} \] Then, \[ |\lambda | = 1 \]

Let \(x\) be the nonzero eigenvector correspoding to the eigenvalue of the orthogonal matrix \(U\), then \[ \begin{align*} Ux = \lambda x &\Longrightarrow \| Ux \| = |\lambda | \| x \| \\\\ &\Longrightarrow \| x \| = |\lambda | \| x \| \end{align*} \] Since \(\| x \| \neq 0 \), we conclude that \(|\lambda | = 1\ \).

Note:
\(\| Ux \| = \| x \|\) is also true in the case \(x\) is a complex vector. For complex vectors, an orthogonal matrix generalizes to a unitary matrix s.t. \(U^*U = I\) where \(U^*\) is the conjugate transpose of \(U\). Then you can prove this fact like we did in Theorem 4. So, the orthogonal matrix is just a special case of a "U"nitary matrix with real entries.

The Orthogonal Projection

Orthogonal matrices preserve "structure under linear transformations," but we often need to decompose arbitrary vectors relative to a subspace. The orthogonal projection provides the closest approximation to a vector by elements of a subspace, a fundamental operation in applications from least squares regression to signal processing.

Given a nonzero vector \(u \in \mathbb{R}^n\). We often need to decompose a vector \(y\in \mathbb{R}^n\) into two vectors: \[ y= \hat{y} + z \tag{2} \] where \(\hat{y} = \alpha u\) is a multiple of \(u\) for some scalar \(\alpha\) and the vector \(z\) is orthogonal to \(u\).

For any scalar \(\alpha\), let \(z = y - \alpha u\) so that (2) holds. Then \(y - \hat{y}\) is orthogonal to \(u\) iff \[ (y - \alpha u) \cdot u = y \cdot u - \alpha (u \cdot u) = 0 \] \[ \Longrightarrow \alpha = \frac{y \cdot u}{u \cdot u} \] Thus, the orthogonal projection of \(y\) onto \(u\) is \[ \hat{y} = \frac{y \cdot u}{u \cdot u} u \]

Theorem 5: Orthogonal Decomposition

Let \(W \subseteq \mathbb{R}^n\). Then \(\forall y \in \mathbb{R}^n\) can be written uniquely in the form \[y= \hat{y} + z\] where \(\hat{y} \in W\) and \(z \in W^{\perp}\). If \(\{u_1, \cdots, u_p\}\) is any orthogonal basis of \(W\), then \[ \hat{y} = \frac{y \cdot u_1}{u_1 \cdot u_1} u_1 + \cdots + \frac{y \cdot u_p}{u_p \cdot u_p} u_p \tag{3} \] and \[ z = y - \hat{y}. \]

Note that (3) is called the orthogonal projection of \(y\) onto \(W\) denoted by \(\text{proj}_W y\).

Proof:

Let \(\{u_1, \cdots, u_p\}\) is any orthogonal basis of \(W\) and \(\hat{y} = \frac{y \cdot u_1}{u_1 \cdot u_1} u_1 + \cdots + \frac{y \cdot u_p}{u_p \cdot u_p} u_p\). Since \(\hat{y}\) is a linear combination of \(u_1, \cdots u_p\), \(\quad \hat{y} \in W\).

Let \(z = y - \hat{y}\). Because \(u_i\) is orthogonal to each \(u_j\) in the basis for \(W\) with \(i \neq j\), \[ \begin{align*} z \cdot u_i &= (y -\hat{y}) \cdot u_i \\\\ &= y \cdot u_i - (\frac{y \cdot u_i}{u_i \cdot u_i})u_i \cdot u_i - 0 - \cdots - 0 \\\\ &= y \cdot u_i - y \cdot u_i = 0 \end{align*} \] Hence \(z\) is orthogonal to each \(u_i\) in the basis for \(W\). Therefore, \(z\) is orthogonal to every vector in \(W\). In other words, \(z \in W^{\perp}\).

Next, suppose \(y\) can also be written as \(y = \hat{y_1} + z_1\) where \(y_1 \in W \text{ and } z_1 \in W^{\perp}\). Then \[ \hat{y} +z = \hat{y_1} + z_1 \Longrightarrow \hat{y} - \hat{y_1} = z_1 - z. \] This equation implies that the vector \((\hat{y} - \hat{y_1})\) is in both \(W\) and \(W^{\perp}\). Thus, since the only vector that can be in both \(W\) and \(W^{\perp}\) is the zero vector, \((\hat{y} - \hat{y_1}) = 0\). Therefore, \(\hat{y} = \hat{y_1} \text{ and } z = z_1\).

The orthogonal projection of \(y\) onto \(W\) is the best approximation to \(y\) by elements of W. This is significant in practical applications.

Theorem 6: Best Approximation

Let \(W \subseteq \mathbb{R}^n\) and \(y \in \mathbb{R}^n\). Also let \(\hat{y}\) be the orthogonal projection of \(y\) onto \(W\). Then \(\hat{y}\) is the closest point in \(W\) to \(y\) such that \[ \| y - \hat{y} \| < \| y - v \| \tag{4} \] for all \(v \in W\) distinct from \(\hat{y}\).

Proof:

Choose \(v \in W\) distinct from \(\hat{y} \in W\). Then \((\hat{y} - v) \in W\). By the Orthogonal Decomposition Theorem, \((y - \hat{y})\) is orthogonal to \(W\). Moreover, \((y - \hat{y})\) is orthogonal to \((\hat{y} - v) \in W\). Consider \[ y - v = (y - \hat{y}) + (\hat{y} -v). \] By the Pythagorean Theorem, \[ \|y - v\|^2 = \|y - \hat{y}\|^2 + \|\hat{y} -v \|^2 > 0. \] Thus, we can conclude \(\| y - \hat{y} \| < \| y - v \|\).

The formula (3) becomes much simpler when the basis for \(W\) is an orthonormal set as follows:

If \(\{u_1, \cdots, u_p\}\) is an orthonormal basis for \(W \subseteq \mathbb{R}^n\), then \[ \text{proj}_W y = (y \cdot u_1)u_1 + \cdots + (y \cdot u_p)u_p \] If \(U = [u_1, \cdots, u_p]\), then \[ \forall y \in \mathbb{R}^n, \quad \text{proj}_W y = UU^Ty \] because each weight can be written as \(u_1^Ty, \cdots, u_p^Ty\), which are the entries of \(U^Ty\).

Besides having orthonormal columns, if \(U\) is an \(n \times n\) square matrix (\(U\) is the orthogonal matrix), \[ \forall y \in \mathbb{R}^n, \quad \text{proj}_W y = UU^Ty = Iy = y. \] This implies the column space \(W\) is equal to \(\mathbb{R}^n\).

Gram-Schmidt Algorithm

While orthogonal and orthonormal bases simplify computations significantly, an arbitrary basis for a subspace will typically not be orthogonal. The Gram-Schmidt Algorithm provides a systematic method to transform any basis into an orthogonal (or orthonormal) basis while preserving the span.

Theorem 8: Gram-Schmidt Algorithm (Process)

Given a basis \(\{x_1, \cdots, x_p\}\) for a nonzero subspace \(W\) of \(\mathbb{R}^n\), the Gram-Schmidt algorithm constructs an orthogonal basis \(\{v_1, \cdots, v_p\}\) for \(W\) as follows: \[ \begin{align*} v_1 &= x_1 \\ v_2 &= x_2 - \frac{x_2 \cdot v_1}{v_1 \cdot v_1}v_1 \\ v_3 &= x_3 - \frac{x_3 \cdot v_1}{v_1 \cdot v_1}v_1 - \frac{x_3 \cdot v_2}{v_2 \cdot v_2}v_2 \\ &\vdots \\ v_k &= x_k - \sum_{j=1}^{k-1} \frac{x_k \cdot v_j}{v_j \cdot v_j}v_j \end{align*} \] An orthonormal basis is obtained by normalizing: \(u_k = \frac{v_k}{\|v_k\|}\) for \(k = 1, \cdots, p\).

Example:

Consider the basis \({x_1, x_2, x_3}\) for a nonzero subspace \(W\) of \(\mathbb{R}^3\), where: \(x_1 = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad x_2 =\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \text{ and } x_3 = \begin{bmatrix} 0 \\ 0 \\ 1\end{bmatrix} \)

Let \(v_1 = x_1\) and \(W_1 = \text{Span } \{v_1\}\). Then \[ \begin{align*} v_2 &= x_2 - \text{proj}_{W_1} x_2 \\\\ &= x_2 - \frac{x_2 \cdot v_1} {v_1 \cdot v_1}v_1 \\\\ &= \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} - \frac{2}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \end{align*} \]

Next, \(W_2 = \text{Span } \{v_1, v_2\}\) and then \[ \begin{align*} v_3 &= x_3 - \text{proj}_{W_2} x_3 \\\\ &= x_3 - \frac{x_3 \cdot v_1} {v_1 \cdot v_1}v_1 - \frac{x_3 \cdot v_2} {v_2 \cdot v_2}v_2 \\\\ &= \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} - \frac{3}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} - \frac{\frac{-3}{7}}{\frac{5}{7}}\begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-3}{10} \\ 0 \\ \frac{1}{10} \end{bmatrix} \end{align*} \] Thus, the orthogonal basis for \(W\) is \(W_3 = Span\{v_1, v_2, v_3\}\).

Finally, we construct an orthonormal basis for \(W\) by normalizing each \(v_k\): \[ \{u_1 = \begin{bmatrix} \frac{1}{\sqrt{14}} \\ \frac{2}{\sqrt{14}} \\ \frac{3}{\sqrt{14}} \end{bmatrix}, \, u_2 = \begin{bmatrix} \frac{-1}{\sqrt{35}} \\ \frac{5}{\sqrt{35}} \\ \frac{-3}{\sqrt{35}} \end{bmatrix}, \, u_3 = \begin{bmatrix} \frac{-3}{\sqrt{10}} \\ \ 0 \\ \frac{1}{\sqrt{10}} \end{bmatrix} \}. \]

The Gram-Schmidt algorithm at the \(k\)th step can be written as: \[ v_k = x_k - \sum_{j=1}^{k-1} (x_k \cdot u_j)u_j ,\quad u_k = \frac{v_k}{\| v_k \|}. \]

The resulting vectors look different from the original ones. This is because the algorithm systematically redefines each vector to be orthogonal to the previous ones while maintaining the span of the original vectors. Essentially, it transforms the basis into an orthonormal one without changing the space that the vectors describe.

QR Factorization

The Gram-Schmidt process has a matrix interpretation: it factors any matrix with linearly independent columns into the product of an orthogonal matrix \(Q\) and an upper triangular matrix \(R\). This QR factorization is fundamental in numerical linear algebra, particularly for approximating eigenvalues and solving least squares problems.
(Note: Finding exact roots of high-degree characteristic polynomials is often computationally infeasible, making iterative methods based on QR factorization essential for eigenvalue computation.)

Theorem 7: QR Factorization

Let \(A\) be an \(m \times n\) matrix with linearly independent columns. Then \(A\) can be factorized as \[ A = QR, \] where \(Q\) is an \(m \times n\) matrix whose columns form an orthonormal basis for \(\text{Col }A\), and \(R\) is an \(n \times n\) upper triangular invertible matrix with positive diagonal entries.

Example:

Consider again \[ A = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix}. \]

By the GS algorithm, we have already obtained \(Q\) from the previous example: \[ Q = \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{-1}{\sqrt{35}} & \frac{-3}{\sqrt{10}} \\ \frac{2}{\sqrt{14}} & \frac{5}{\sqrt{35}} & 0 \\ \frac{3}{\sqrt{14}} & \frac{-3}{\sqrt{35}} & \frac{1}{\sqrt{10}} \end{bmatrix}. \]

Since the columns of \(Q\) are orthonormal, \(Q^TQ=I\). Then, \(Q^TA = Q^TQR = IR = R\). \[ \begin{align*} R = Q^TA &= \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\ \frac{-1}{\sqrt{35}} & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\ \frac{-3}{\sqrt{10}} & \ 0 & \frac{1}{\sqrt{10}} \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{14}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\ 0 & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\ 0 & 0 & \frac{1}{\sqrt{10}} \end{bmatrix} \end{align*} \]

Interactive Visualizer

This interactive demonstrates fundamental concepts like inner products, orthogonal decomposition, and orthogonalization that are essential for understanding many applications in machine learning.

In machine learning, orthogonality plays a critical role in feature engineering, dimensionality reduction, gradient-based optimization, and neural network initialization. Try changing the demo type in the dropdown menu above to explore different aspects of orthogonality!