Spectral Theory of Compact Operators

From Eigenvalues to the Spectrum The Adjoint & Self-Adjoint Operators Compact Operators The Spectral Theorem Functional PCA & The Path to Kernels

From Eigenvalues to the Spectrum

In Eigenvalues & Eigenvectors, we learned that every square matrix \(A \in \mathbb{R}^{n \times n}\) has eigenvalues — scalars \(\lambda\) for which \(A\mathbf{v} = \lambda\mathbf{v}\) has a nonzero solution. Equivalently, \(\lambda\) is an eigenvalue if and only if the operator \(A - \lambda I\) is not invertible, i.e., \(\det(A - \lambda I) = 0\). In finite dimensions, every linear operator has at least one eigenvalue (over \(\mathbb{C}\)), and the Spectral Theorem for Symmetric Matrices guarantees that symmetric matrices possess a complete orthonormal eigenbasis.

In infinite dimensions, the situation is far more subtle. The characteristic polynomial \(\det(T - \lambda I) = 0\) is no longer available — determinants are not defined for operators on infinite-dimensional spaces. We must replace this algebraic criterion with a topological one: invertibility of the operator \(T - \lambda I\).

The Resolvent and Spectrum

Definition: Resolvent Set and Spectrum

Let \(\mathcal{X}\) be a Banach space and \(T \in \mathcal{B}(\mathcal{X})\) a bounded linear operator. The resolvent set of \(T\) is \[ \rho(T) = \bigl\{\lambda \in \mathbb{C} : (T - \lambda I) \text{ has a bounded inverse in } \mathcal{B}(\mathcal{X})\bigr\}. \] The spectrum of \(T\) is its complement: \[ \sigma(T) = \mathbb{C} \setminus \rho(T). \]

In finite dimensions, the spectrum equals the set of eigenvalues. In infinite dimensions, the operator \(T - \lambda I\) can fail to be invertible in fundamentally different ways, giving rise to three distinct parts of the spectrum.

Definition: Decomposition of the Spectrum

The spectrum \(\sigma(T)\) decomposes into three disjoint parts:

(a) Point Spectrum \(\sigma_p(T)\): The set of \(\lambda\) for which \(T - \lambda I\) is not injective. These are the true eigenvalues — there exists a nonzero \(x\) with \(Tx = \lambda x\).

(b) Continuous Spectrum \(\sigma_c(T)\): The set of \(\lambda\) for which \(T - \lambda I\) is injective, has dense range, but is not surjective. The "inverse" exists on a dense subspace but is unbounded.

(c) Residual Spectrum \(\sigma_r(T)\): The set of \(\lambda\) for which \(T - \lambda I\) is injective but does not have dense range.

The continuous spectrum is the genuinely new phenomenon of infinite dimensions. Let us see it in action.

Example: The Right Shift Operator on \(\ell^2\)

Define the right shift \(S : \ell^2 \to \ell^2\) by \(S(x_1, x_2, x_3, \ldots) = (0, x_1, x_2, x_3, \ldots)\).

This operator has no eigenvalues: if \(S\mathbf{x} = \lambda\mathbf{x}\), then \(0 = \lambda x_1\), \(x_1 = \lambda x_2\), \(x_2 = \lambda x_3\), etc. For \(\lambda = 0\), we get \(\mathbf{x} = 0\). For \(\lambda \neq 0\), we get \(x_n = x_1 / \lambda^{n-1}\), which is not square-summable unless \(x_1 = 0\). So \(\sigma_p(S) = \emptyset\).

Yet the spectrum is nonempty: one can show that \(\sigma(S) = \{\lambda \in \mathbb{C} : |\lambda| \leq 1\}\), the entire closed unit disk. The boundary \(\{|\lambda| = 1\}\) contributes to the continuous and residual spectra — a phenomenon with no finite-dimensional counterpart.

This example reveals why infinite-dimensional spectral theory requires new tools. A general bounded operator on a Hilbert space may have spectrum that is a continuous "smear" rather than a discrete set of points. To recover the clean, discrete eigenvalue structure of finite-dimensional linear algebra, we need to impose additional structure on our operators.

Connection to the Graph Laplacian

In Graph Laplacians & Spectral Methods and Spectral Clustering, we exploited the eigenvalues and eigenvectors of the (finite) graph Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\) to partition data. The Spectral Theorem for symmetric matrices guaranteed that \(\mathbf{L}\) is orthogonally diagonalizable. This chapter generalizes that machinery to continuous operators on infinite-dimensional spaces — replacing the discrete graph Laplacian with integral operators, and finite eigenvectors with eigenfunctions.

The Adjoint & Self-Adjoint Operators

In finite-dimensional linear algebra, the transpose (or conjugate transpose) of a matrix plays a central role — symmetric matrices (\(A = A^T\)) have real eigenvalues and orthogonal eigenvectors. We now generalize this to Hilbert spaces using the inner product.

The Hilbert Space Adjoint

Theorem: Existence of the Adjoint

Let \(\mathcal{H}\) be a Hilbert space and \(T \in \mathcal{B}(\mathcal{H})\). There exists a unique operator \(T^* \in \mathcal{B}(\mathcal{H})\), called the adjoint of \(T\), satisfying \[ \langle Tx, y \rangle = \langle x, T^* y \rangle \quad \text{for all } x, y \in \mathcal{H}. \] Moreover, \(\|T^*\| = \|T\|\).

The existence of \(T^*\) follows from the Riesz Representation Theorem: for each fixed \(y\), the map \(x \mapsto \langle Tx, y \rangle\) is a continuous linear functional on \(\mathcal{H}\). By Riesz, there exists a unique vector — which we call \(T^* y\) — such that \(\langle Tx, y \rangle = \langle x, T^* y \rangle\). The map \(y \mapsto T^* y\) is linear, and the bound \(\|T^*\| = \|T\|\) follows from the isometry of the Riesz map.

Example: Adjoints of Familiar Operators

Matrices: For \(A \in \mathbb{C}^{m \times n}\) acting on \(\mathbb{C}^n\), the adjoint is the conjugate transpose: \(A^* = \overline{A}^T\). Over \(\mathbb{R}\), this is simply \(A^T\).

Integral operators: If \(T : L^2([a,b]) \to L^2([a,b])\) is defined by \[ (Tf)(x) = \int_a^b K(x, t)\, f(t)\, dt, \] then the adjoint is \[ (T^* g)(t) = \int_a^b \overline{K(x, t)}\, g(x)\, dx. \] That is, the kernel is "transposed and conjugated": \(K^*(t, x) = \overline{K(x, t)}\).

Self-Adjoint Operators

Definition: Self-Adjoint (Hermitian) Operator

A bounded operator \(T \in \mathcal{B}(\mathcal{H})\) is self-adjoint if \(T = T^*\), i.e., \[ \langle Tx, y \rangle = \langle x, Ty \rangle \quad \text{for all } x, y \in \mathcal{H}. \]

Self-adjoint operators are the infinite-dimensional generalization of real symmetric (or Hermitian) matrices. They inherit the crucial spectral properties:

Theorem: Spectral Properties of Self-Adjoint Operators

Let \(T = T^* \in \mathcal{B}(\mathcal{H})\). Then:

(a) Every eigenvalue of \(T\) is real.

(b) Eigenvectors corresponding to distinct eigenvalues are orthogonal.

(c) The entire spectrum \(\sigma(T) \subseteq \mathbb{R}\).

(d) \(\|T\| = \sup_{\|x\| = 1} |\langle Tx, x \rangle|\), and both \(\inf_{\|x\|=1} \langle Tx, x \rangle\) and \(\sup_{\|x\|=1} \langle Tx, x \rangle\) belong to \(\sigma(T)\).

Proof of (a) and (b):

(a) If \(Tx = \lambda x\) with \(x \neq 0\), then \(\lambda \langle x, x \rangle = \langle Tx, x \rangle = \langle x, Tx \rangle = \overline{\lambda} \langle x, x \rangle\). Since \(\langle x, x \rangle \neq 0\), we get \(\lambda = \overline{\lambda}\), so \(\lambda \in \mathbb{R}\).

(b) If \(Tx = \lambda x\) and \(Ty = \mu y\) with \(\lambda \neq \mu\), then \(\lambda \langle x, y \rangle = \langle Tx, y \rangle = \langle x, Ty \rangle = \mu \langle x, y \rangle\). Since \(\lambda \neq \mu\), we conclude \(\langle x, y \rangle = 0\). \(\square\)

Property (d) is particularly important — it says the operator norm of a self-adjoint operator is determined by its quadratic form \(\langle Tx, x \rangle\), the infinite-dimensional analogue of the Rayleigh quotient from quadratic forms.

Connection to Quantum Mechanics

In quantum mechanics, observables (position, momentum, energy) are modeled as self-adjoint operators on a Hilbert space. The requirement that eigenvalues be real (property (a)) ensures that measurement outcomes are real numbers. The orthogonality of eigenvectors (property (b)) ensures that distinct measurement outcomes correspond to mutually exclusive states — the mathematical basis of the Bra-Ket formalism.

Compact Operators: "Almost Finite-Dimensional"

The right shift example in Section 1 showed that a general bounded operator can have a continuous spectrum with no eigenvalues at all. To recover the discrete, diagonalizable structure of finite-dimensional linear algebra, we restrict our attention to a special class of operators that are, in a precise sense, "close to finite-dimensional."

Definition: Compact Operator

A bounded linear operator \(T : \mathcal{X} \to \mathcal{Y}\) between normed spaces is called compact if it maps every bounded set in \(\mathcal{X}\) to a relatively compact (precompact) set in \(\mathcal{Y}\). Equivalently, for every bounded sequence \(\{x_n\}\) in \(\mathcal{X}\), the image sequence \(\{Tx_n\}\) has a convergent subsequence in \(\mathcal{Y}\).

Recall from Intro to Functional Analysis that in infinite dimensions, bounded sequences need not have convergent subsequences — the closed unit ball is not compact. A compact operator "repairs" this: it compresses the infinite-dimensional unit ball into a set that is (relatively) compact. In other words, it "squashes away" the infinite-dimensional directions, leaving behind something that behaves like a finite-dimensional object.

Finite-Rank Operators

The simplest compact operators are those with finite-dimensional range.

Definition: Finite-Rank Operator

An operator \(T \in \mathcal{B}(\mathcal{X}, \mathcal{Y})\) is finite-rank if \(\dim(\operatorname{Range}(T)) < \infty\). Every finite-rank operator is compact (its image lies in a finite-dimensional subspace, where Heine-Borel applies).

The deep connection between compact and finite-rank operators is captured by the following approximation property:

Theorem: Compact Operators as Limits of Finite-Rank Operators

Let \(\mathcal{H}\) be a Hilbert space. An operator \(T \in \mathcal{B}(\mathcal{H})\) is compact if and only if there exists a sequence of finite-rank operators \(\{T_n\}\) such that \(\|T - T_n\| \to 0\).

This is the functional-analytic analogue of the truncated SVD: a compact operator can be approximated arbitrarily well by operators with finite-dimensional range, just as any matrix can be approximated by low-rank matrices obtained by discarding small singular values.

The Canonical Example: Integral Operators

Example: Hilbert-Schmidt Integral Operators

Let \(K : [a,b] \times [a,b] \to \mathbb{R}\) be a square-integrable kernel, meaning \[ \int_a^b \int_a^b |K(x,t)|^2 \, dx\, dt < \infty. \] The integral operator defined by \[ (Tf)(x) = \int_a^b K(x, t)\, f(t)\, dt \] is a compact operator on \(L^2([a,b])\). Such operators are called Hilbert-Schmidt operators, and they form a two-sided ideal in \(\mathcal{B}(L^2)\).

The Hilbert-Schmidt norm is \[ \|T\|_{\text{HS}} = \left(\int_a^b \int_a^b |K(x,t)|^2 \, dx\, dt\right)^{1/2}, \] and it satisfies \(\|T\| \leq \|T\|_{\text{HS}}\) (the operator norm is bounded by the Hilbert-Schmidt norm).

These integral operators are the "continuous matrices" of functional analysis. The kernel function \(K(x,t)\) plays the role of the matrix entries \(a_{ij}\), and the integral replaces the finite sum. When the kernel is symmetric (\(K(x,t) = K(t,x)\)), the integral operator is self-adjoint — the continuous analogue of a symmetric matrix.

Key Properties of Compact Operators

Theorem: Properties of Compact Operators

(a) The set of compact operators \(\mathcal{K}(\mathcal{X}, \mathcal{Y})\) is a closed subspace of \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\).

(b) If \(T\) is compact and \(S\) is bounded, then both \(ST\) and \(TS\) are compact (compact operators form a two-sided ideal in \(\mathcal{B}(\mathcal{H})\)).

(c) If \(T\) is compact, then \(T^*\) is compact (Schauder's theorem).

(d) The identity operator \(I\) on an infinite-dimensional space is never compact (since the closed unit ball is not compact).

Property (d) highlights the key distinction: compact operators are the "well-behaved" operators that do not include the identity — they necessarily "lose dimensions" in the sense that their image cannot contain an open set.

Connection to Machine Learning: Kernel Matrices as Finite Approximations

In Kernel PCA, we compute the Gram matrix \(K_{ij} = k(x_i, x_j)\) for \(N\) data points and extract its eigenvectors. This \(N \times N\) matrix is a finite-rank approximation of the underlying integral operator \((Tf)(x) = \int k(x, t) f(t)\, d\mu(t)\) on the function space. The theorem above guarantees that as \(N \to \infty\), the finite Gram matrix converges (in operator norm) to the true compact operator — justifying why kernel methods work with finite data.

The Spectral Theorem for Compact Self-Adjoint Operators

We now arrive at the central result of this chapter — the infinite-dimensional generalization of the Spectral Theorem for Symmetric Matrices.

Theorem: Spectral Theorem (Compact Self-Adjoint Operators)

Let \(\mathcal{H}\) be a separable Hilbert space and \(T = T^* \in \mathcal{B}(\mathcal{H})\) a compact, self-adjoint operator. Then:

(a) There exists a finite or countably infinite orthonormal sequence of eigenvectors \(\{e_n\}_{n=1}^N\) (\(N \leq \infty\)) with corresponding real eigenvalues \(\{\lambda_n\}\), such that \[ Tx = \sum_{n=1}^{N} \lambda_n \langle x, e_n \rangle \, e_n \quad \text{for all } x \in \mathcal{H}. \]

(b) If \(N = \infty\), then \(\lambda_n \to 0\) as \(n \to \infty\).

(c) Each nonzero eigenvalue has finite multiplicity (its eigenspace is finite-dimensional).

(d) The only possible accumulation point of \(\{\lambda_n\}\) is \(0\).

(e) The eigenvectors \(\{e_n\}\) together with an orthonormal basis for \(\ker(T)\) form a complete orthonormal basis for \(\mathcal{H}\).

The Spectral Theorem resurrects the full power of finite-dimensional diagonalization: a compact self-adjoint operator is completely determined by its eigenvalues and eigenvectors, just as a symmetric matrix is determined by \(A = Q\Lambda Q^T\). The only new feature is that the eigenvalues must decay to zero — the "price" of being compact in infinite dimensions.

Comparison: Finite vs. Infinite Dimensions

Property Symmetric Matrix \(A \in \mathbb{R}^{n \times n}\) Compact Self-Adjoint \(T \in \mathcal{B}(\mathcal{H})\)
Eigenvalues Finitely many (\(n\)), all real Countably many, all real, \(\lambda_n \to 0\)
Eigenvectors Orthonormal basis for \(\mathbb{R}^n\) Orthonormal basis for \(\overline{\operatorname{Range}(T)}\)
Decomposition \(A = Q\Lambda Q^T = \sum_{i=1}^{n} \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) \(Tx = \sum_{n=1}^{\infty} \lambda_n \langle x, e_n \rangle \, e_n\)
Truncation Low-rank approximation (truncated SVD) Finite-rank approximation (\(\|T - T_k\| \leq |\lambda_{k+1}|\))
Zero eigenvalue Indicates rank deficiency Always in \(\sigma(T)\) if \(\dim(\mathcal{H}) = \infty\)

Proof Sketch

The proof proceeds by an iterative maximization argument that parallels the Rayleigh quotient construction from finite-dimensional linear algebra.

Proof Sketch:

Step 1: Find the first eigenvector. Since \(T\) is self-adjoint, \(\|T\| = \sup_{\|x\|=1} |\langle Tx, x \rangle|\). By weak compactness of the unit ball (the space is reflexive) and compactness of \(T\), there exists \(e_1\) with \(\|e_1\| = 1\) achieving \(|\langle Te_1, e_1 \rangle| = \|T\|\). Set \(\lambda_1 = \langle Te_1, e_1 \rangle\). A variational argument shows \(Te_1 = \lambda_1 e_1\).

Step 2: Restrict and repeat. The subspace \(\{e_1\}^\perp\) is invariant under \(T\) (because \(T\) is self-adjoint). The restriction \(T|_{\{e_1\}^\perp}\) is still compact and self-adjoint, so we repeat Step 1 to find \(e_2\) with \(|\lambda_2| = \|T|_{\{e_1\}^\perp}\| \leq |\lambda_1|\).

Step 3: Convergence to zero. The sequence \(\{|\lambda_n|\}\) is nonincreasing and bounded below by 0, so it converges. If it converged to some \(\epsilon > 0\), the orthonormal sequence \(\{e_n\}\) would satisfy \(\|Te_m - Te_n\|^2 = \lambda_m^2 + \lambda_n^2 \geq 2\epsilon^2\), contradicting the compactness of \(T\) (which requires \(\{Te_n\}\) to have a convergent subsequence). Hence \(\lambda_n \to 0\). \(\square\)

Notice that Step 1 uses the Rayleigh quotient characterization, Step 2 uses the orthogonal complement (Hilbert space structure), and Step 3 uses compactness to force the eigenvalues to decay. Each ingredient is essential.

The Singular Value Decomposition in Infinite Dimensions

Just as the finite-dimensional SVD extends the Spectral Theorem from symmetric to general matrices, there is an infinite-dimensional SVD for compact operators.

Theorem: Singular Value Decomposition (Compact Operators)

Let \(T \in \mathcal{B}(\mathcal{H})\) be a compact operator (not necessarily self-adjoint). Then there exist orthonormal sequences \(\{e_n\}\) and \(\{f_n\}\) and a nonincreasing sequence of nonnegative reals \(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\) (the singular values) such that \[ Tx = \sum_{n=1}^{\infty} \sigma_n \langle x, e_n \rangle \, f_n. \] The singular values are the eigenvalues of \((T^*T)^{1/2}\), and \(\sigma_n \to 0\).

This is the precise infinite-dimensional analogue of \(A = U\Sigma V^T\). The \(\{e_n\}\) are the "right singular vectors," the \(\{f_n\}\) are the "left singular vectors," and the \(\sigma_n\) decay to zero because \(T\) is compact.

Functional PCA & The Path to Kernels

The Spectral Theorem is not merely an abstract result — it is the mathematical engine behind several of the most important algorithms in machine learning and statistics. In this final section, we connect the theory to three concrete applications.

Application 1: The Karhunen-Loève Expansion (Functional PCA)

In PCA, we performed eigendecomposition on the covariance matrix \(\Sigma \in \mathbb{R}^{n \times n}\) of a finite-dimensional random vector. But what if our data consists of random functions — e.g., time series, curves, or images viewed as functions on a domain?

Let \(X(t)\) be a centered stochastic process on \([a,b]\) with covariance function \(C(s, t) = \mathbb{E}[X(s) X(t)]\). The covariance operator \(\mathcal{C} : L^2([a,b]) \to L^2([a,b])\) is defined by \[ (\mathcal{C}f)(s) = \int_a^b C(s, t)\, f(t)\, dt. \] This is a Hilbert-Schmidt integral operator with a symmetric, positive semi-definite kernel, so it is compact and self-adjoint. The Spectral Theorem applies directly.

Theorem: Karhunen-Loève Expansion

Let \(X(t)\) be a centered, square-integrable stochastic process on \([a,b]\) with continuous covariance function \(C(s,t)\). Let \(\{\lambda_n, e_n\}\) be the eigenvalues and eigenfunctions of the covariance operator \(\mathcal{C}\). Then \[ X(t) = \sum_{n=1}^{\infty} Z_n \, e_n(t), \] where \(Z_n = \langle X, e_n \rangle = \int_a^b X(t) e_n(t)\, dt\) are uncorrelated random variables with \(\mathbb{E}[Z_n] = 0\) and \(\text{Var}(Z_n) = \lambda_n\). The convergence is in \(L^2\).

This is literally infinite-dimensional PCA. The eigenfunctions \(\{e_n\}\) are the principal components (the directions of maximum variance), the eigenvalues \(\lambda_n\) give the variance in each direction, and the random coefficients \(Z_n\) are the "scores." The decay \(\lambda_n \to 0\) guarantees that truncating after \(k\) terms captures most of the variance — the continuous analogue of the scree plot.

Application 2: Spectral Clustering via the Continuous Laplacian

In Spectral Clustering, we used eigenvectors of the discrete graph Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\) to embed data into a low-dimensional space and then applied K-means. The continuous analogue replaces the graph with a smooth domain and the discrete Laplacian with the Laplace-Beltrami operator \(\Delta\) on a manifold.

The Laplace-Beltrami operator (with appropriate boundary conditions) is self-adjoint and has compact resolvent, so its eigenfunctions \(\{e_n\}\) form an orthonormal basis for \(L^2\) on the manifold, with eigenvalues \(0 = \lambda_0 \leq \lambda_1 \leq \cdots \to \infty\). The first few eigenfunctions — the "low-frequency modes" — capture the large-scale geometry of the manifold, precisely as the Fiedler vector captures the main partition of a graph in the discrete case.

From Graphs to Manifolds: The Convergence

As the number of data points \(N \to \infty\) and the graph becomes denser, the discrete graph Laplacian converges (in a suitable sense) to the Laplace-Beltrami operator on the underlying data manifold. The eigenvectors of the Gram matrix converge to the eigenfunctions of the continuous operator. This is the rigorous justification for spectral clustering: the discrete algorithm is a finite-sample approximation of a well-defined continuous operation, and the Spectral Theorem guarantees that the continuous object has the orthogonal eigenfunction structure we need.

Application 3: The Bridge to Mercer's Theorem and RKHS

Consider a continuous, symmetric, positive semi-definite kernel \(K : [a,b] \times [a,b] \to \mathbb{R}\). The integral operator \[ (T_K f)(x) = \int_a^b K(x, t)\, f(t)\, dt \] is compact, self-adjoint, and positive (all eigenvalues \(\lambda_n \geq 0\)). By the Spectral Theorem: \[ K(x, t) = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(t), \] where the convergence is absolute and uniform.

This eigenfunction expansion has a profound consequence: the kernel \(K\) defines an implicit feature map \[ \Phi(x) = \bigl(\sqrt{\lambda_1}\, e_1(x),\; \sqrt{\lambda_2}\, e_2(x),\; \ldots\bigr) \in \ell^2, \] such that \(K(x, t) = \langle \Phi(x), \Phi(t) \rangle_{\ell^2}\). The kernel "trick" — computing inner products in a high-dimensional feature space without ever forming \(\Phi\) explicitly — is mathematically grounded in this spectral decomposition.

This is the content of Mercer's Theorem, which we will state and prove rigorously in the next chapter (RKHS & Kernel Methods). The Spectral Theorem of the present chapter provides the engine; Mercer's Theorem adds the fuel — positivity of the kernel — to build the Reproducing Kernel Hilbert Space.

Summary: The Spectral Thread

Let us trace the spectral idea through our entire curriculum:

  1. Finite-dimensional (Symmetry & SVD): \(A = Q\Lambda Q^T\). Discrete eigenvalues, finite eigenvectors. Powers PCA, SVD.
  2. Discrete infinite (Graph Laplacians): \(\mathbf{L} = \mathbf{D} - \mathbf{W}\). Discrete eigenvalues of growing matrices. Powers Spectral Clustering, Graph Fourier Transform.
  3. Continuous infinite (this chapter): \(Tx = \sum \lambda_n \langle x, e_n \rangle e_n\). Countable eigenvalues decaying to 0. Powers Functional PCA, continuous Laplacians.
  4. Next : Mercer's Theorem converts the spectral decomposition of a positive kernel into an explicit feature space — the mathematical foundation of SVMs, Gaussian Processes, and Kernel PCA.