From Eigenvalues to the Spectrum
In Eigenvalues & Eigenvectors,
we learned that every square matrix \(A \in \mathbb{C}^{n \times n}\) has eigenvalues —
scalars \(\lambda\) for which \(A\mathbf{v} = \lambda\mathbf{v}\) has a nonzero solution.
Equivalently, \(\lambda\) is an eigenvalue if and only if the operator \(A - \lambda I\)
is not invertible, i.e., \(\det(A - \lambda I) = 0\).
In finite dimensions, every linear operator has at least one eigenvalue (over \(\mathbb{C}\)),
and the Spectral Theorem for Symmetric Matrices
guarantees that symmetric matrices possess a complete orthonormal eigenbasis.
In infinite dimensions, the situation is far more subtle. The characteristic polynomial
\(\det(T - \lambda I) = 0\) is no longer available — determinants are not defined for
operators on infinite-dimensional spaces. We must replace this algebraic criterion with
a topological one: invertibility of the operator \(T - \lambda I\).
Convention. Throughout this chapter we take \(\mathcal{X}\) and \(\mathcal{H}\)
to be complex Banach and Hilbert spaces (i.e., \(\mathbb{F} = \mathbb{C}\)).
Spectral theory over \(\mathbb{R}\) is strictly poorer: a rotation on \(\mathbb{R}^2\) has
empty spectrum when viewed as a real operator, since its characteristic polynomial has no
real roots. The passage to \(\mathbb{C}\) restores the fundamental theorem of algebra and,
with it, the guarantee that \(\sigma(T) \neq \emptyset\) for any \(T \in \mathcal{B}(\mathcal{X})\)
on a nonzero space.
The Resolvent and Spectrum
Definition: Resolvent Set and Spectrum
Let \(\mathcal{X}\) be a Banach space and \(T \in \mathcal{B}(\mathcal{X})\) a bounded
linear operator. The resolvent set of \(T\) is
\[
\rho(T) = \bigl\{\lambda \in \mathbb{C} : T - \lambda I \text{ is bijective with inverse in } \mathcal{B}(\mathcal{X})\bigr\}.
\]
The spectrum of \(T\) is its complement:
\[
\sigma(T) = \mathbb{C} \setminus \rho(T).
\]
In finite dimensions, the spectrum equals the set of eigenvalues. In infinite dimensions,
the operator \(T - \lambda I\) can fail to be invertible in fundamentally different ways,
giving rise to three distinct parts of the spectrum.
Definition: Decomposition of the Spectrum
The spectrum \(\sigma(T)\) decomposes into three disjoint parts:
(a) Point Spectrum \(\sigma_p(T)\): The set of \(\lambda\) for which
\(T - \lambda I\) is not injective. These are the true eigenvalues —
there exists a nonzero \(x\) with \(Tx = \lambda x\).
(b) Continuous Spectrum \(\sigma_c(T)\): The set of \(\lambda\) for which
\(T - \lambda I\) is injective, has dense range, but is not surjective.
(c) Residual Spectrum \(\sigma_r(T)\): The set of \(\lambda\) for which
\(T - \lambda I\) is injective but does not have dense range.
These three cases are disjoint and, via the open mapping theorem, exhaust
\(\sigma(T)\): any \(\lambda\) at which \(T - \lambda I\) is injective with dense range
and a bounded inverse on its range would extend to a bounded inverse on all of
\(\mathcal{X}\), placing \(\lambda\) in \(\rho(T)\). The "inverse" in case (b) is
therefore necessarily unbounded. Among the three, the continuous spectrum is the
genuinely new phenomenon of infinite dimensions. Let us see it in action.
Example: The Right Shift Operator on \(\ell^2\)
Define the right shift \(S : \ell^2 \to \ell^2\) by
\(S(x_1, x_2, x_3, \ldots) = (0, x_1, x_2, x_3, \ldots)\).
This operator has no eigenvalues: if \(S\mathbf{x} = \lambda\mathbf{x}\),
then \(0 = \lambda x_1\), \(x_1 = \lambda x_2\), \(x_2 = \lambda x_3\), etc.
For \(\lambda = 0\), we get \(\mathbf{x} = 0\). For \(\lambda \neq 0\),
we get \(x_n = x_1 / \lambda^{n-1}\), which is not square-summable unless \(x_1 = 0\).
So \(\sigma_p(S) = \emptyset\).
Yet the spectrum is nonempty: one can show that \(\sigma(S) = \{\lambda \in \mathbb{C} : |\lambda| \leq 1\}\),
the entire closed unit disk. Specifically, the open disk \(\{|\lambda| < 1\}\) forms the
residual spectrum, and the boundary \(\{|\lambda| = 1\}\) forms the continuous spectrum —
a phenomenon with no finite-dimensional counterpart.
This example reveals why infinite-dimensional spectral theory requires new tools.
A general bounded operator on a Hilbert space may have spectrum that is a continuous
"smear" rather than a discrete set of points. To recover the clean, discrete
eigenvalue structure of finite-dimensional linear algebra, we need to impose
additional structure on our operators.
Connection to the Graph Laplacian
In Graph Laplacians & Spectral Methods
and Spectral Clustering,
we exploited the eigenvalues and eigenvectors of the (finite) graph Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\)
to partition data. The Spectral Theorem for symmetric matrices guaranteed that \(\mathbf{L}\) is
orthogonally diagonalizable. This chapter generalizes that machinery to
continuous operators on infinite-dimensional spaces — replacing the
discrete graph Laplacian with integral operators, and finite eigenvectors with
eigenfunctions.
The Adjoint & Self-Adjoint Operators
In finite-dimensional linear algebra, the transpose (or conjugate transpose) of a matrix
plays a central role — symmetric matrices (\(A = A^T\)) have real eigenvalues and
orthogonal eigenvectors. We now generalize this to Hilbert spaces using the inner product.
The Hilbert Space Adjoint
Theorem: Existence of the Adjoint
Let \(\mathcal{H}\) be a Hilbert space and \(T \in \mathcal{B}(\mathcal{H})\).
There exists a unique operator \(T^* \in \mathcal{B}(\mathcal{H})\),
called the adjoint of \(T\), satisfying
\[
\langle Tx, y \rangle = \langle x, T^* y \rangle \quad \text{for all } x, y \in \mathcal{H}.
\]
Moreover, \(\|T^*\| = \|T\|\).
Proof:
The existence of \(T^*\) follows from the
Riesz Representation Theorem:
for each fixed \(y\), the map \(x \mapsto \langle Tx, y \rangle\) is a continuous linear functional on \(\mathcal{H}\).
By Riesz, there exists a unique vector — which we call \(T^* y\) — such that \(\langle Tx, y \rangle = \langle x, T^* y \rangle\).
The map \(y \mapsto T^* y\) is linear (it is the composition of two conjugate-linear maps: \(y \mapsto \varphi_y\)
where \(\varphi_y(x) = \langle Tx, y\rangle\), and the
Riesz sharp map \(\varphi \mapsto \varphi^\sharp\), which
is conjugate-linear).
For the norm equality: \(\|T^*y\| = \|\varphi_y\|_{\mathcal{H}^*} \leq \|T\|\|y\|\) gives \(\|T^*\| \leq \|T\|\).
The reverse inequality follows from \((T^*)^* = T\)
(which holds because \(\langle T^*y, x\rangle = \overline{\langle x, T^*y\rangle} = \overline{\langle Tx, y\rangle} = \langle y, Tx\rangle\),
so \(T\) satisfies the same defining relation that characterizes the adjoint of \(T^*\); by the uniqueness asserted in the theorem
(applied to \(T^*\) in place of \(T\)), we conclude \((T^*)^* = T\)) combined with the same inequality applied to \(T^*\).
Example: Adjoints of Familiar Operators
Matrices: For \(A \in \mathbb{C}^{n \times n}\) acting on
\(\mathbb{C}^n\), the adjoint is the conjugate transpose: \(A^* = \overline{A}^T\).
Over \(\mathbb{R}\), this is simply \(A^T\).
Integral operators: If \(T : L^2([a,b]) \to L^2([a,b])\) is defined by
\[
(Tf)(x) = \int_a^b K(x, t)\, f(t)\, dt,
\]
then the adjoint is
\[
(T^* g)(x) = \int_a^b \overline{K(t, x)}\, g(t)\, dt.
\]
That is, the kernel is "transposed and conjugated": \(K^*(x, t) = \overline{K(t, x)}\).
Self-Adjoint Operators
Definition: Self-Adjoint (Hermitian) Operator
A bounded operator \(T \in \mathcal{B}(\mathcal{H})\) is self-adjoint
if \(T = T^*\), i.e.,
\[
\langle Tx, y \rangle = \langle x, Ty \rangle \quad \text{for all } x, y \in \mathcal{H}.
\]
Self-adjoint operators are the infinite-dimensional generalization of real symmetric
(or Hermitian) matrices. They inherit the crucial spectral properties:
Theorem: Spectral Properties of Self-Adjoint Operators
Let \(T = T^* \in \mathcal{B}(\mathcal{H})\). Then:
(a) Every eigenvalue of \(T\) is real.
(b) Eigenvectors corresponding to distinct eigenvalues are orthogonal.
(c) The entire spectrum \(\sigma(T) \subseteq \mathbb{R}\).
(d) \(\|T\| = \sup_{\|x\| = 1} |\langle Tx, x \rangle|\).
Proof:
(a) If \(Tx = \lambda x\) with \(x \neq 0\), then
\(\lambda \langle x, x \rangle = \langle Tx, x \rangle = \langle x, Tx \rangle
= \overline{\lambda} \langle x, x \rangle\). Since \(\langle x, x \rangle \neq 0\),
we get \(\lambda = \overline{\lambda}\), so \(\lambda \in \mathbb{R}\).
(b) If \(Tx = \lambda x\) and \(Ty = \mu y\) with \(\lambda \neq \mu\),
then
\[
\lambda \langle x, y \rangle = \langle Tx, y \rangle = \langle x, Ty \rangle = \overline{\mu} \langle x, y \rangle = \mu \langle x, y \rangle,
\]
where the last step uses \(\mu \in \mathbb{R}\) from part (a).
Since \(\lambda \neq \mu\), we conclude \(\langle x, y \rangle = 0\).
(c) (sketch) For \(\lambda = \alpha + i\beta \in \mathbb{C}\) with \(\beta \neq 0\), self-adjointness gives the identity
\[
\|(T - \lambda I)x\|^2 \;=\; \|(T - \alpha I)x\|^2 + \beta^2 \|x\|^2 \;\geq\; \beta^2 \|x\|^2,
\]
so \(T - \lambda I\) is bounded below, hence injective with closed range. Applying the same estimate to \((T - \overline{\lambda}I)\) shows
it is injective, so \(\ker(T - \overline{\lambda}I) = \{0\}\).
Combined with the standard Hilbert-space identity \(\overline{\operatorname{Range}(S)} = \ker(S^*)^\perp\)
(applied to \(S = T - \lambda I\), so \(S^* = T - \overline{\lambda}I\)) and the closedness of the range established above, we get
\[
\operatorname{Range}(T - \lambda I) = \{0\}^\perp = \mathcal{H}.
\]
Thus \(T - \lambda I\) is bijective; the bound \(\|(T-\lambda I)x\| \geq |\beta|\|x\|\) from the initial inequality forces
\(\|(T-\lambda I)^{-1}\| \leq 1/|\beta|\), so \(\lambda \notin \sigma(T)\), giving \(\sigma(T) \subseteq \mathbb{R}\).
(d) Let \(M = \sup_{\|x\|=1} |\langle Tx, x\rangle|\). The bound \(M \leq \|T\|\) follows from Cauchy-Schwarz.
For the reverse, take unit vectors \(h, g \in \mathcal{H}\) and expand
\(\langle T(h \pm g), h \pm g\rangle\) using self-adjointness: the cross terms satisfy
\[
\begin{align*}
\langle Th, g\rangle + \langle Tg, h\rangle
&= \langle Th, g\rangle + \overline{\langle Th, g\rangle} \\\\
&= 2\operatorname{Re}\langle Th, g\rangle
\end{align*}
\]
(using self-adjointness \(\langle Tg, h\rangle = \langle g, Th\rangle\) followed by the Hermitian symmetry of the inner product), so
\[
\langle T(h \pm g), h \pm g\rangle \;=\; \langle Th, h\rangle \pm 2\operatorname{Re}\langle Th, g\rangle + \langle Tg, g\rangle.
\]
Subtracting the two equations gives
\[
4\operatorname{Re}\langle Th, g\rangle = \langle T(h+g), h+g\rangle - \langle T(h-g), h-g\rangle.
\]
Applying \(|\langle Tf, f\rangle| \leq M\|f\|^2\) and the parallelogram law,
\[
4\operatorname{Re}\langle Th, g\rangle \;\leq\; M\bigl(\|h+g\|^2 + \|h-g\|^2\bigr) \;=\; 2M\bigl(\|h\|^2 + \|g\|^2\bigr) \;=\; 4M.
\]
Replacing \(h\) with \(e^{-i\theta}h\) for \(\theta = \arg\langle Th, g\rangle\) converts the real part into the modulus:
\(|\langle Th, g\rangle| \leq M\) for all unit \(h, g\). Taking the supremum over unit \(g\) yields \(\|Th\| \leq M\)
(using the standard Hilbert-space identity \(\|y\| = \sup_{\|g\|=1}|\langle y, g\rangle|\)); taking it over unit \(h\) yields \(\|T\| \leq M\).
Property (d) is particularly important — it says the operator norm of a self-adjoint
operator is determined by its quadratic form \(\langle Tx, x \rangle\),
the infinite-dimensional analogue of the Rayleigh quotient from
quadratic forms.
Connection to Quantum Mechanics
In quantum mechanics, observables (position, momentum, energy) are modeled as
self-adjoint operators on a Hilbert space. The requirement that eigenvalues be real
(property (a)) ensures that measurement outcomes are real numbers.
The orthogonality of eigenvectors (property (b)) ensures that distinct measurement
outcomes correspond to mutually exclusive states — the mathematical basis of the
Bra-Ket formalism.
Compact Operators: "Almost Finite-Dimensional"
The right shift example in From Eigenvalues to the Spectrum showed that a general
bounded operator can have a continuous spectrum with no eigenvalues at all. To recover the discrete,
diagonalizable structure of finite-dimensional linear algebra, we restrict our attention to a special
class of operators that are, in a precise sense, "close to finite-dimensional."
Definition: Compact Operator
A bounded linear operator \(T : \mathcal{X} \to \mathcal{Y}\) between normed spaces
is called compact if it maps every bounded set in \(\mathcal{X}\)
to a relatively compact (precompact) set in \(\mathcal{Y}\).
Since \(\mathcal{Y}\) is a metric space, this is
equivalent to:
for every bounded sequence \(\{x_n\}\) in \(\mathcal{X}\),
the image sequence \(\{Tx_n\}\) has a convergent subsequence in \(\mathcal{Y}\).
Recall from Intro to Functional Analysis
that in infinite dimensions, bounded sequences need not have convergent subsequences —
the closed unit ball is not compact. A compact operator "repairs" this: it
compresses the infinite-dimensional unit ball into a set that is
(relatively) compact. In other words, it "squashes away" the infinite-dimensional
directions, leaving behind something that behaves like a finite-dimensional object.
Finite-Rank Operators
The simplest compact operators are those with finite-dimensional range.
Definition: Finite-Rank Operator
An operator \(T \in \mathcal{B}(\mathcal{X}, \mathcal{Y})\) is finite-rank
if \(\dim(\operatorname{Range}(T)) < \infty\). Every finite-rank operator is compact
(its image lies in a finite-dimensional subspace, where Heine-Borel applies).
The deep connection between compact and finite-rank operators is captured by the following
approximation property:
Theorem: Compact Operators as Limits of Finite-Rank Operators
Let \(\mathcal{H}\) be a Hilbert space. An operator \(T \in \mathcal{B}(\mathcal{H})\)
is compact if and only if there exists a sequence of finite-rank operators \(\{T_n\}\)
such that \(\|T - T_n\| \to 0\).
Proof sketch (Hilbert space case):
(\(\Leftarrow\)) Suppose \(T_n \to T\) in operator norm with each \(T_n\) finite-rank (hence compact).
Given a bounded sequence \(\{x_k\}\) with \(\|x_k\| \leq B\), iteratively extract subsequences: \(\{T_1 x_{k}\}\) has
a convergent subsequence indexed by \(k_1(j)\); within that, \(\{T_2 x_{k_1(j)}\}\) has a convergent subsequence indexed by \(k_2(j)\);
and so on. The diagonal \(y_j := x_{k_j(j)}\) satisfies: for each fixed \(n\), \(\{T_n y_j\}_{j \geq n}\) converges.
For any \(\varepsilon > 0\), choose \(n\) with \(\|T - T_n\| < \varepsilon/(3B)\); then for \(j, j'\) large enough that
\(\|T_n y_j - T_n y_{j'}\| < \varepsilon/3\),
\[
\|T y_j - T y_{j'}\| \;\leq\; \|(T - T_n)y_j\| + \|T_n y_j - T_n y_{j'}\| + \|(T_n - T)y_{j'}\| \;<\; \tfrac{\varepsilon}{3} + \tfrac{\varepsilon}{3} + \tfrac{\varepsilon}{3} = \varepsilon.
\]
Hence \(\{Ty_j\}\) is Cauchy in \(\mathcal{H}\), and converges by completeness. Thus \(T\) is compact.
(\(\Rightarrow\)) Separability of \(\overline{\operatorname{Range}(T)}\)
(which follows from compactness, since \(\overline{\operatorname{Range}(T)} = \overline{\bigcup_n T(\overline{B}_n)}\)
is the closure of a countable union of compact — hence separable — sets) lets us pick an orthonormal basis \(\{e_n\}\) for it.
Let \(P_n\) be orthogonal projection onto \(\operatorname{span}\{e_1, \ldots, e_n\}\), and set \(T_n = P_n T\).
Each \(T_n\) is finite-rank. Since \(T(\overline{B}_\mathcal{H})\) is relatively compact, for any \(\varepsilon > 0\)
it is covered by finitely many \(\varepsilon/2\)-balls centered at points \(y_1, \ldots, y_k\).
Choose \(N\) so that \(\|y_j - P_n y_j\| < \varepsilon/2\) for all \(j\) and \(n \geq N\)
(possible because \(P_n y_j \to y_j\) for each fixed \(y_j\) in the closed span of \(\{e_n\}\)).
For unit \(x \in \overline{B}_\mathcal{H}\), pick \(j\) with \(\|Tx - y_j\| < \varepsilon/2\); then
\[
\|(T - T_n) x\| \;=\; \|(I - P_n) Tx\| \;\leq\; \|(I - P_n)(Tx - y_j)\| + \|(I - P_n) y_j\| \;\leq\; \|Tx - y_j\| + \|y_j - P_n y_j\| \;<\; \varepsilon,
\]
using \(\|I - P_n\| \leq 1\). Taking sup over unit \(x\) gives \(\|T - T_n\| \leq \varepsilon\) for \(n \geq N\).
Remark. This characterization is special to spaces with a Schauder basis (in particular, separable Hilbert spaces).
In general Banach spaces the analogous statement — the approximation property — can fail (Enflo, 1973).
This is the functional-analytic analogue of the truncated SVD:
a compact operator can be approximated arbitrarily well by operators with
finite-dimensional range, just as any matrix can be approximated by low-rank matrices
obtained by discarding small singular values.
The Canonical Example: Integral Operators
Definition: Hilbert-Schmidt Integral Operators
Let \(K : [a,b] \times [a,b] \to \mathbb{C}\) be a square-integrable kernel, meaning
\[
\int_a^b \int_a^b |K(x,t)|^2 \, dx\, dt < \infty.
\]
The integral operator defined by
\[
(Tf)(x) = \int_a^b K(x, t)\, f(t)\, dt
\]
is a compact operator on \(L^2([a,b])\).
Such operators are called Hilbert-Schmidt operators, and they form
a two-sided ideal in \(\mathcal{B}(L^2)\).
The Hilbert-Schmidt norm is
\[
\|T\|_{\text{HS}} = \left(\int_a^b \int_a^b |K(x,t)|^2 \, dx\, dt\right)^{1/2},
\]
and it satisfies \(\|T\| \leq \|T\|_{\text{HS}}\)
(the operator norm is bounded by the Hilbert-Schmidt norm).
Proof sketch (compactness and norm bound):
Norm bound. For \(f \in L^2([a,b])\), Cauchy-Schwarz applied pointwise in \(x\) gives
\[
|(Tf)(x)|^2 \;=\; \left|\int_a^b K(x,t) f(t)\, dt\right|^2 \;\leq\; \left(\int_a^b |K(x,t)|^2\, dt\right) \|f\|_2^2.
\]
Integrating in \(x\) yields \(\|Tf\|_2^2 \leq \|T\|_{\text{HS}}^2 \|f\|_2^2\), i.e., \(\|T\| \leq \|T\|_{\text{HS}}\).
Compactness. Fix an orthonormal basis \(\{\varphi_i\}\) of \(L^2([a,b])\); then \(\{\varphi_i(x)\varphi_j(t)\}\)
is an orthonormal basis of \(L^2([a,b]^2)\).
Expand \(K = \sum_{i,j} c_{ij}\, \varphi_i(x)\varphi_j(t)\) and let \(K_N\) be the truncation to \(i, j \leq N\).
The corresponding operator \(T_N\) is finite-rank, and \(\|T - T_N\| \leq \|T - T_N\|_{\text{HS}} \to 0\) by Parseval.
By the finite-rank approximation theorem above, \(T\) is compact.
These integral operators are the "continuous matrices" of functional analysis.
The kernel function \(K(x,t)\) plays the role of the matrix entries \(a_{ij}\),
and the integral replaces the finite sum. When the kernel is symmetric
(\(K(x,t) = K(t,x)\)), the integral operator is self-adjoint — the
continuous analogue of a symmetric matrix.
Key Properties of Compact Operators
Theorem: Properties of Compact Operators
(a) The set of compact operators \(\mathcal{K}(\mathcal{X}, \mathcal{Y})\)
is a closed subspace of \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\).
(b) If \(T\) is compact and \(S\) is bounded, then both \(ST\) and \(TS\)
are compact (compact operators form a two-sided ideal in \(\mathcal{B}(\mathcal{H})\)).
(c) If \(T\) is compact, then \(T^*\) is compact (Schauder's theorem).
(d) The identity operator \(I\) on an infinite-dimensional space is
never compact (since the closed unit ball is not compact).
Proof:
(a) Linearity (closure under addition and scalar multiplication) is straightforward from the subsequence-extraction characterization.
Closedness in operator norm was proved as the (\(\Leftarrow\)) direction of
Compact Operators as Limits of Finite-Rank Operators for the Hilbert case;
the same diagonal argument (no use of finite-rank approximability) works for general normed spaces, replacing "finite-rank" by "compact" throughout:
a sequence \(\{T_n\}\) of compact operators with \(T_n \to T\) in norm and a bounded \(\{x_k\}\) admit, by iterated subsequence extraction,
a diagonal \(\{y_j\}\) on which every \(T_n\) converges; the \(3\varepsilon\) argument then gives \(\{Ty_j\}\) Cauchy, hence convergent.
(b) Let \(\{x_n\}\) be a bounded sequence in \(\mathcal{X}\).
For \(TS\): \(\{Sx_n\}\) is bounded (\(\|Sx_n\| \leq \|S\|\|x_n\|\)), so by compactness of \(T\), \((TS)x_n = T(Sx_n)\) has a convergent subsequence.
For \(ST\): by compactness of \(T\), \(\{Tx_n\}\) has a convergent subsequence \(Tx_{n_k} \to y\); by continuity of \(S\), \(S(Tx_{n_k}) \to Sy\),
so \((ST)x_n\) has a convergent subsequence. Hence both \(ST\) and \(TS\) are compact.
(c) The compactness of \(T^*\) given that of \(T\) — Schauder's theorem
— is a deeper result requiring the Arzelà-Ascoli theorem applied to the family \(\{x \mapsto \langle Tx, y\rangle\}_{\|y\|\leq 1}\)
of equicontinuous functions on a totally bounded set.
(d) If \(I\) were compact, the image of the unit ball under \(I\) — which is the unit ball itself — would be relatively compact.
But in infinite dimensions, the closed unit ball is not compact: this is exactly
the Riesz characterization of finite-dimensionality.
Property (d) highlights the key distinction: compact operators are the "well-behaved"
operators that do not include the identity — they necessarily "lose dimensions"
in the sense that their image cannot contain an open set.
Connection to Machine Learning: Kernel Matrices as Finite Approximations
In Kernel PCA,
we compute the Gram matrix \(K_{ij} = k(x_i, x_j)\) for \(N\) data points
and extract its eigenvectors. This \(N \times N\) matrix is a
finite-dimensional discretization of the underlying integral operator
\((Tf)(x) = \int k(x, t) f(t)\, d\mu(t)\) on the function space, obtained by replacing the measure \(\mu\)
with the empirical measure \(\mu_N = \tfrac{1}{N}\sum_i \delta_{x_i}\).
Mercer-type results guarantee that as \(N \to \infty\) and \(\mu_N \to \mu\)
(e.g., for i.i.d. samples by
the Glivenko-Cantelli theorem),
the eigenvalues and eigenfunctions of the
finite Gram matrix converge to those of the underlying compact integral operator — justifying why kernel methods
extract meaningful spectral structure from finite data.
The Spectral Theorem for Compact Self-Adjoint Operators
We now arrive at the central result of this chapter — the infinite-dimensional
generalization of the
Spectral Theorem for Symmetric Matrices.
Theorem: Spectral Theorem (Compact Self-Adjoint Operators)
Let \(\mathcal{H}\) be a separable Hilbert space and
\(T = T^* \in \mathcal{B}(\mathcal{H})\) a compact, self-adjoint operator. Then:
(a) There exists a finite or countably infinite orthonormal sequence
of eigenvectors \(\{e_n\}_{n=1}^N\) (\(N \leq \infty\)) with corresponding
real eigenvalues \(\{\lambda_n\}\), such that
\[
Tx = \sum_{n=1}^{N} \lambda_n \langle x, e_n \rangle \, e_n \quad \text{for all } x \in \mathcal{H}.
\]
(b) If \(N = \infty\), then \(\lambda_n \to 0\) as \(n \to \infty\).
(c) Each nonzero eigenvalue has finite multiplicity
(its eigenspace is finite-dimensional).
(d) The only possible accumulation point of \(\{\lambda_n\}\) is \(0\).
(e) The eigenvectors \(\{e_n\}\) together with an orthonormal basis for
\(\ker(T)\) form a complete orthonormal basis for \(\mathcal{H}\).
The Spectral Theorem resurrects the full power of finite-dimensional diagonalization:
a compact self-adjoint operator is completely determined by its eigenvalues and eigenvectors,
just as a symmetric matrix is determined by \(A = Q\Lambda Q^T\).
The only new feature is that the eigenvalues must decay to zero — the
"price" of being compact in infinite dimensions.
Comparison: Finite vs. Infinite Dimensions
| Property |
Symmetric Matrix \(A \in \mathbb{R}^{n \times n}\) |
Compact Self-Adjoint \(T \in \mathcal{B}(\mathcal{H})\) |
| Eigenvalues |
Finitely many (\(n\)), all real |
Countably many, all real, \(\lambda_n \to 0\) |
| Eigenvectors |
Orthonormal basis for \(\mathbb{R}^n\) |
Span \(\overline{\operatorname{Range}(T)}\); together with a basis for \(\ker(T)\), form an orthonormal basis for \(\mathcal{H}\) |
| Decomposition |
\(A = Q\Lambda Q^T = \sum_{i=1}^{n} \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) |
\(Tx = \sum_{n=1}^{N} \lambda_n \langle x, e_n \rangle \, e_n\) (\(N \leq \infty\)) |
| Truncation |
Low-rank approximation (truncated SVD) |
Finite-rank approximation (\(\|T - T_k\| \leq |\lambda_{k+1}|\)) |
| Zero eigenvalue |
Indicates rank deficiency |
Always in \(\sigma(T)\) if \(\dim(\mathcal{H}) = \infty\) |
Proof Sketch
The proof proceeds by an iterative maximization argument that parallels the
Rayleigh quotient construction from finite-dimensional linear algebra.
Proof Sketch:
Step 1: Find the first eigenvector.
Since \(T\) is self-adjoint, \(\|T\| = \sup_{\|x\|=1} |\langle Tx, x \rangle|\). By
weak compactness of the unit ball
(every Hilbert space is reflexive) and compactness of \(T\), there exists \(x_* \in \overline{B}_\mathcal{H}\) achieving
\(|\langle Tx_*, x_* \rangle| = \|T\|\).
(Since \(T\) is compact, \(x_n \rightharpoonup x\) implies \(Tx_n \to Tx\) strongly;
combined with weak convergence of \(x_n\), this gives \(\langle Tx_n, x_n \rangle \to \langle Tx, x \rangle\), so
the map \(x \mapsto \langle Tx, x \rangle\) is weakly continuous and attains its extrema on the weakly
compact unit ball.) Since the sup over \(\overline{B}\) equals the sup over the unit sphere
(by positive homogeneity of \(|\langle Tx, x\rangle|\) in \(\|x\|^2\)),
we may assume \(\|x_*\| = 1\); set \(e_1 = x_*\) and \(\lambda_1 = \langle Te_1, e_1 \rangle\).
Variational argument for \(Te_1 = \lambda_1 e_1\).
For any \(v \perp e_1\) and \(t \in \mathbb{R}\), consider
\(\phi(t) = \langle T(e_1 + tv), e_1 + tv \rangle / \|e_1 + tv\|^2\).
Since \(\|e_1\| = 1\) is a critical point of \(|\phi|\) and \(T\) is self-adjoint, differentiating at \(t = 0\) gives
\(\phi'(0) = 2\operatorname{Re} \langle Te_1 - \lambda_1 e_1, v \rangle = 0\);
replacing \(v\) with \(iv\) over \(\mathbb{C}\) gives the imaginary part, so
\(\langle Te_1 - \lambda_1 e_1, v \rangle = 0\) for all \(v \perp e_1\).
Thus \(Te_1 - \lambda_1 e_1 \in \{e_1\}^{\perp\perp} = \operatorname{span}\{e_1\}\)
(since \(\operatorname{span}\{e_1\}\) is one-dimensional, hence closed,
and the double-orthogonal-complement of a closed subspace equals itself by the Projection Theorem); combined with
\(\langle Te_1 - \lambda_1 e_1, e_1\rangle = 0\) by construction, we conclude \(Te_1 = \lambda_1 e_1\).
Step 2: Restrict and repeat.
The subspace \(\{e_1\}^\perp\) is invariant under \(T\) (because \(T\) is self-adjoint).
The restriction \(T|_{\{e_1\}^\perp}\) is still compact and self-adjoint,
so we repeat Step 1 to find \(e_2\) with \(|\lambda_2| = \|T|_{\{e_1\}^\perp}\| \leq |\lambda_1|\).
Step 3: Convergence to zero.
The sequence \(\{|\lambda_n|\}\) is nonincreasing and bounded below by 0, so it converges.
If it converged to some \(\epsilon > 0\), the orthonormal sequence \(\{e_n\}\) would satisfy
\(\|Te_m - Te_n\|^2 = \lambda_m^2 + \lambda_n^2 \geq 2\epsilon^2\), contradicting
the compactness of \(T\) (which requires \(\{Te_n\}\) to have a convergent subsequence).
Hence \(\lambda_n \to 0\).
Closing the remaining claims.
Steps 1–3 yield (b) directly. Statement (d) follows because the construction enumerates the eigenvalues in nonincreasing order of \(|\lambda_n|\), so
any accumulation point must equal \(\lim |\lambda_n| = 0\).
For (c), if some nonzero eigenvalue \(\mu\) had infinite multiplicity, an orthonormal basis \(\{f_k\}\) of its eigenspace would give
\(\|Tf_k - Tf_l\|^2 = 2\mu^2\) for \(k \neq l\), again contradicting compactness.
For (e), the constructed \(\{e_n\}\) span \(\overline{\operatorname{Range}(T)}\): any vector \(v\) orthogonal
to all \(e_n\) satisfies
\(\|Tv\| \leq \|T|_{\{e_1,\ldots,e_n\}^\perp}\| \cdot \|v\| = |\lambda_{n+1}| \|v\| \to 0\), forcing \(Tv = 0\),
hence \(v \in \ker(T)\). Since \(\mathcal{H}\) is separable, the closed subspace \(\ker(T)\) is also separable and admits a countable orthonormal basis.
Concatenating \(\{e_n\}\) with such a basis of \(\ker(T)\) gives a complete orthonormal basis
of \(\mathcal{H} = \overline{\operatorname{Range}(T)} \oplus \ker(T)\).
The expansion in (a) then follows: writing \(x = x_R + x_K\) with \(x_R \in \overline{\operatorname{Range}(T)}\)
and \(x_K \in \ker(T)\), Parseval gives \(x_R = \sum_n \langle x, e_n\rangle e_n\),
so \(Tx = Tx_R = \sum_n \lambda_n \langle x, e_n\rangle e_n\) by linearity and continuity.
Notice that Step 1 uses the Rayleigh quotient characterization, Step 2 uses the
orthogonal complement (Hilbert space structure), and Step 3 uses compactness
to force the eigenvalues to decay. Each ingredient is essential.
The Singular Value Decomposition in Infinite Dimensions
Just as the finite-dimensional SVD extends the Spectral Theorem from symmetric to general matrices,
there is an infinite-dimensional SVD for compact operators.
Theorem: Singular Value Decomposition (Compact Operators)
Let \(T \in \mathcal{B}(\mathcal{H})\) be a compact operator (not necessarily self-adjoint).
Then there exist orthonormal sequences \(\{e_n\}_{n=1}^N\) and \(\{f_n\}_{n=1}^N\) (\(N \leq \infty\)) and a nonincreasing
sequence of strictly positive reals \(\sigma_1 \geq \sigma_2 \geq \cdots > 0\)
(the singular values) such that
\[
Tx = \sum_{n=1}^{N} \sigma_n \langle x, e_n \rangle \, f_n.
\]
The singular values are the positive square roots of the nonzero eigenvalues of \(T^*T\), and \(\sigma_n \to 0\) when \(N = \infty\).
Proof sketch:
By Schauder's theorem,
\(T^*\) is compact; hence \(T^*T\) is compact, self-adjoint, and positive (\(\langle T^*Tx, x\rangle = \|Tx\|^2 \geq 0\)).
The Spectral Theorem above applied to \(T^*T\) yields its nonzero eigenvalues \(\{\sigma_n^2\}_{n=1}^N\) (\(\sigma_n^2 > 0\),
nonincreasing, \(\sigma_n^2 \to 0\) when \(N = \infty\)) with corresponding orthonormal eigenvectors \(\{e_n\}\)
spanning \(\overline{\operatorname{Range}(T^*T)}\).
Define \(\sigma_n = \sqrt{\sigma_n^2} > 0\) and \(f_n = Te_n / \sigma_n\). Direct computation gives
\[
\begin{align*}
\langle f_n, f_m\rangle &= \langle Te_n, Te_m\rangle/(\sigma_n \sigma_m) \\\\
&= \langle T^*Te_n, e_m\rangle/(\sigma_n\sigma_m) \\\\
&= \sigma_n^2\delta_{nm}/(\sigma_n\sigma_m) \\\\
&= \delta_{nm},
\end{align*}
\]
so \(\{f_n\}\) is orthonormal.
For the expansion: note that \(\ker(T^*T) = \ker(T)\), since \(T^*Tx = 0\) implies
\(\|Tx\|^2 = \langle T^*Tx, x\rangle = 0\), hence \(Tx = 0\), and conversely.
Decompose any \(x \in \mathcal{H}\) as \(x = x_R + x_K\) with
\(x_R \in \overline{\operatorname{Range}(T^*T)} = \ker(T^*T)^\perp = \ker(T)^\perp\) and \(x_K \in \ker(T)\).
By Parseval on the ON basis \(\{e_n\}\) of \(\overline{\operatorname{Range}(T^*T)}\), \(x_R = \sum_n \langle x, e_n\rangle e_n\).
Since \(Tx_K = 0\) and \(T\) is continuous,
\[
Tx \;=\; Tx_R \;=\; \sum_n \langle x, e_n\rangle Te_n \;=\; \sum_n \sigma_n \langle x, e_n\rangle f_n.
\]
This is the precise infinite-dimensional analogue of \(A = U\Sigma V^T\).
The \(\{e_n\}\) are the "right singular vectors," the \(\{f_n\}\) are the
"left singular vectors," and the \(\sigma_n\) decay to zero because \(T\) is compact.
Functional PCA & The Path to Kernels
The Spectral Theorem is not merely an abstract result — it is the mathematical engine
behind several of the most important algorithms in machine learning and statistics.
In this final section, we connect the theory to three concrete applications.
Application 1: The Karhunen-Loève Expansion (Functional PCA)
In PCA, we performed
eigendecomposition on the covariance matrix \(\Sigma \in \mathbb{R}^{n \times n}\)
of a finite-dimensional random vector. But what if our data consists of
random functions — e.g., time series, curves, or images viewed as functions
on a domain?
Let \(X(t)\) be a centered, square-integrable stochastic process on \([a,b]\), with covariance function
\[
C(s, t) = \mathbb{E}[X(s)\, \overline{X(t)}].
\]
The kernel \(C\) is Hermitian symmetric (\(C(s,t) = \overline{C(t,s)}\)) and positive semi-definite.
The most common ML setting takes \(X\) real-valued, in which case the conjugate is a no-op and \(C\) reduces to
a real symmetric kernel \(C(s,t) = \mathbb{E}[X(s) X(t)]\);
the formulation here covers both this classical case and complex-valued processes
(e.g., complex baseband signals in communications, quantum stochastic processes).
The covariance operator
\(\mathcal{C} : L^2([a,b]) \to L^2([a,b])\) is defined by
\[
(\mathcal{C}f)(s) = \int_a^b C(s, t)\, f(t)\, dt.
\]
This is a Hilbert-Schmidt integral operator with a Hermitian, positive semi-definite kernel,
so it is compact and self-adjoint. The Spectral Theorem applies directly.
Theorem: Karhunen-Loève Expansion
Let \(X(t)\) be a centered, square-integrable stochastic process on \([a,b]\) with
continuous covariance function \(C(s,t)\). Let \(\{\lambda_n, e_n\}\) be the eigenvalues
and eigenfunctions of the covariance operator \(\mathcal{C}\). Then
\[
X(t) = \sum_{n=1}^{\infty} Z_n \, e_n(t),
\]
where \(Z_n = \langle X, e_n \rangle = \int_a^b X(t)\, \overline{e_n(t)}\, dt\) are uncorrelated
random variables with \(\mathbb{E}[Z_n] = 0\) and \(\text{Var}(Z_n) = \lambda_n\).
The convergence is in \(L^2\).
Proof sketch:
Apply the Spectral Theorem to the compact self-adjoint covariance operator \(\mathcal{C}\) to obtain
orthonormal eigenfunctions \(\{e_n\}\) with eigenvalues \(\lambda_n \geq 0\)
(positivity from \(\langle \mathcal{C}f, f\rangle = \mathbb{E}[|\langle X, f\rangle|^2] \geq 0\),
real-valuedness of \(\lambda_n\) from self-adjointness).
Define \(Z_n = \langle X, e_n\rangle\). Centering of \(X\) gives \(\mathbb{E}[Z_n] = 0\).
By Fubini's theorem and the eigenvalue equation \(\mathcal{C}e_m = \lambda_m e_m\),
\[
\mathbb{E}[Z_n\, \overline{Z_m}] \;=\; \int_a^b \!\!\int_a^b C(s,t)\, \overline{e_n(s)}\, e_m(t)\, ds\, dt \;=\; \langle \mathcal{C}e_m, e_n\rangle \;=\; \lambda_m \langle e_m, e_n\rangle \;=\; \lambda_m\, \delta_{nm},
\]
giving uncorrelatedness and \(\text{Var}(Z_n) = \mathbb{E}[|Z_n|^2] = \lambda_n\).
\(L^2\) convergence of the series \(\sum_n Z_n e_n(t)\) to \(X(t)\) is an application of Parseval in \(L^2([a,b])\)
combined with completeness of \(\{e_n\}\) on \(\overline{\operatorname{Range}(\mathcal{C})}\);
the orthogonal complement of this range contributes a.s. zero to \(X\) by the covariance structure.
For the full argument, see Hsing & Eubank, Theoretical Foundations of Functional Data Analysis, Ch. 2.
This is literally infinite-dimensional PCA. The eigenfunctions
\(\{e_n\}\) are the principal components (the directions of maximum variance),
the eigenvalues \(\lambda_n\) give the variance in each direction,
and the random coefficients \(Z_n\) are the "scores."
The decay \(\lambda_n \to 0\) guarantees that truncating after \(k\) terms
captures most of the variance — the continuous analogue of the scree plot.
Application 2: Spectral Clustering via the Continuous Laplacian
In Spectral Clustering,
we used eigenvectors of the discrete graph Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\)
to embed data into a low-dimensional space and then applied K-means. The continuous
analogue replaces the graph with a smooth domain and the discrete Laplacian
with the Laplace-Beltrami operator on a manifold.
On a compact Riemannian manifold (or on a bounded domain with appropriate boundary conditions),
the negative Laplace-Beltrami operator \(-\Delta\)
is a positive self-adjoint unbounded operator with compact resolvent \((-\Delta + I)^{-1}\).
Although \(-\Delta\) itself is unbounded — outside the scope of the present chapter's spectral theorem — applying the Spectral Theorem
to its compact self-adjoint resolvent yields an orthonormal basis of eigenfunctions \(\{e_n\}\) for \(L^2\) on the manifold; these are
simultaneously eigenfunctions of \(-\Delta\), with eigenvalues
\(0 = \lambda_0 \leq \lambda_1 \leq \cdots \to \infty\).
The first few eigenfunctions — the "low-frequency modes" — capture the large-scale
geometry of the manifold, precisely as the Fiedler vector captures the main partition
of a graph in the discrete case.
From Graphs to Manifolds: The Convergence
As the number of data points \(N \to \infty\) and the graph becomes denser,
the discrete graph Laplacian converges (in a suitable sense) to the
Laplace-Beltrami operator on the underlying data manifold.
The eigenvectors of the Gram matrix converge to the eigenfunctions of
the continuous operator. This is the rigorous justification for spectral clustering:
the discrete algorithm is a finite-sample approximation of a well-defined
continuous operation, and the Spectral Theorem guarantees that the continuous
object has the orthogonal eigenfunction structure we need.
Application 3: The Bridge to Mercer's Theorem and RKHS
Consider a continuous, Hermitian (\(K(x,t) = \overline{K(t,x)}\)), positive semi-definite kernel
\(K : [a,b] \times [a,b] \to \mathbb{C}\). The integral operator
\[
(T_K f)(x) = \int_a^b K(x, t)\, f(t)\, dt
\]
is compact, self-adjoint, and positive (all eigenvalues \(\lambda_n \geq 0\)).
By the Spectral Theorem:
\[
K(x, t) = \sum_{n=1}^{\infty} \lambda_n \, e_n(x)\, \overline{e_n(t)},
\]
where the convergence is absolute and uniform. The most common ML setting uses real-symmetric kernels
(Gaussian, polynomial, Laplacian, Matérn, etc.);
in that case the conjugate is a no-op and the formula reduces to \(K(x,t) = \sum_n \lambda_n e_n(x) e_n(t)\).
This eigenfunction expansion has a profound consequence: the kernel \(K\) defines
an implicit feature map
\[
\Phi(x) = \bigl(\sqrt{\lambda_1}\, e_1(x),\; \sqrt{\lambda_2}\, e_2(x),\; \ldots\bigr) \in \ell^2,
\]
such that \(K(x, t) = \langle \Phi(x), \Phi(t) \rangle_{\ell^2}\) (in the linear-in-first-argument convention, the second slot conjugates,
recovering the Hermitian \(\overline{e_n(t)}\) automatically; in the real case both reduce to \(K(x,t) = \sum_n \lambda_n e_n(x) e_n(t)\)).
The kernel "trick" — computing inner products in a high-dimensional feature space
without ever forming \(\Phi\) explicitly — is mathematically grounded in this
spectral decomposition.
This is the content of Mercer's Theorem, which we will state and prove
rigorously in the next chapter (RKHS & Kernel Methods).
The Spectral Theorem of the present chapter provides the engine; Mercer's Theorem adds the fuel — positivity of the kernel —
to build the Reproducing Kernel Hilbert Space.
Summary: The Spectral Thread
Let us trace the spectral idea through our entire curriculum:
- Finite-dimensional
(Symmetry & SVD):
\(A = Q\Lambda Q^T\). Discrete eigenvalues, finite eigenvectors. Powers PCA, SVD.
- Discrete infinite
(Graph Laplacians):
\(\mathbf{L} = \mathbf{D} - \mathbf{W}\). Discrete eigenvalues of growing matrices.
Powers Spectral Clustering, Graph Fourier Transform.
- Continuous infinite (this chapter):
\(Tx = \sum \lambda_n \langle x, e_n \rangle e_n\). Countable eigenvalues decaying to 0.
Powers Functional PCA, continuous Laplacians.
- Next:
Mercer's Theorem converts the spectral decomposition of a positive kernel
into an explicit feature space — the mathematical foundation of
SVMs, Gaussian Processes, and Kernel PCA.