Point Evaluation & The Need for RKHS
Throughout the Functional Analysis series — from
Banach & Hilbert Spaces through
Spectral Theory — we have
treated functions as "points" in abstract infinite-dimensional spaces. We measured
their "length" via norms, their "angle" via inner products, and their "spectrum"
via eigenvalue decomposition. But there is one seemingly innocent operation we have
not yet examined: evaluating a function at a point.
Convention for this chapter. Our focus is the standard machine-learning
setting, in which kernels and the corresponding Hilbert space \(\mathcal{H}_K\) are
real-valued (RBF, polynomial, Laplacian, and Matérn kernels are all real).
The inner product is therefore symmetric,
\(\langle u, v\rangle_{\mathcal{H}_K} = \langle v, u\rangle_{\mathcal{H}_K}\), so the
first-slot-linear convention adopted in
Dual Spaces
trivially restricts here. The complex Hermitian theory developed in
Spectral Theory specializes cleanly to
this real-symmetric case — the conjugation in the Hermitian inner product becomes a no-op
when all quantities are real-valued. We will state this reduction explicitly when it
matters (notably in the Mercer expansion).
Given a function \(f\) in some Hilbert space \(\mathcal{H}\), define the
point evaluation functional at \(x \in \mathcal{X}\):
\[
\delta_x : \mathcal{H} \to \mathbb{R}, \qquad \delta_x(f) = f(x).
\]
This is clearly linear: \(\delta_x(\alpha f + \beta g) = \alpha f(x) + \beta g(x)\).
The critical question is: is it continuous?
In \(L^2([a,b])\), the answer is no. Functions in \(L^2\) are equivalence
classes defined up to sets of measure zero — changing a function at a single point
does not change its \(L^2\) norm. Consequently, point evaluation is not a
bounded linear functional
on \(L^2\), and \(\delta_x \notin (L^2)^*\). We cannot meaningfully "evaluate" an
\(L^2\) function at a point.
This is a serious problem for machine learning. In supervised learning, we must
evaluate our model \(f\) at data points \(x_1, \ldots, x_N\) to compute the loss:
\[
\mathcal{L}(f) = \sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}}^2.
\]
If point evaluation is not continuous, this expression is not well-defined as a
functional on \(\mathcal{H}\). We need a Hilbert space where point evaluation is
always bounded.
A Reproducing Kernel Hilbert Space (RKHS) is precisely a Hilbert space of functions
in which point evaluation is a continuous linear functional for every point in the domain.
This single requirement — that \(\delta_x \in \mathcal{H}^*\) for all \(x\) — has
extraordinary consequences, connecting the
Riesz Representation Theorem,
Spectral Theory,
and regularized optimization into a single coherent framework.
Reproducing Kernel Hilbert Spaces
By the standing assumption on \(\mathcal{H}\), the evaluation functional
\(\delta_x(f) = f(x)\) belongs to \(\mathcal{H}^*\) for every \(x \in \mathcal{X}\). The
Riesz Representation Theorem
then gives a unique element \(K_x \in \mathcal{H}\) — the Riesz representative
of \(\delta_x\) — such that
\[
f(x) = \delta_x(f) = \langle f,\, K_x \rangle_{\mathcal{H}} \quad \text{for all } f \in \mathcal{H}.
\]
Definition: Reproducing Kernel
Let \(\mathcal{H}\) be a Hilbert space of functions on a set \(\mathcal{X}\)
in which all point evaluations are continuous. The reproducing kernel is
the function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) defined by
\[
K(x, x') = \langle K_x,\, K_{x'} \rangle_{\mathcal{H}} = K_{x'}(x).
\]
The pair \((\mathcal{H}, K)\) is called a
Reproducing Kernel Hilbert Space (RKHS).
Notation. The Riesz representative \(K_x \in \mathcal{H}\) and the
partially-evaluated kernel \(K(\cdot, x) : \mathcal{X} \to \mathbb{R}\) denote the
same function: \(K_x(\cdot) = K(\cdot, x)\). We use both notations interchangeably,
favoring \(K(\cdot, x)\) when emphasizing the kernel structure and \(K_x\) when
emphasizing the Riesz duality.
Two properties follow immediately from this definition and are so fundamental that
they deserve explicit naming.
Theorem: The Reproducing Property
For every \(f \in \mathcal{H}\) and every \(x' \in \mathcal{X}\):
\[
f(x') = \langle f,\, K(\cdot, x') \rangle_{\mathcal{H}}.
\]
In particular, setting \(f = K(\cdot, x)\):
\[
K(x, x') = \langle K(\cdot, x),\, K(\cdot, x') \rangle_{\mathcal{H}}.
\]
That is, the kernel reproduces itself under the inner product —
hence the name.
The reproducing property says: to evaluate a function \(f\) at a point \(x'\),
take the inner product of \(f\) with the kernel "centered at \(x'\)."
The kernel function \(K(\cdot, x')\) acts as a "probe" that extracts the value of
\(f\) at the point \(x'\).
Positive Definiteness: The Characterization
Not every symmetric function \(K(x, x')\) is a reproducing kernel. The kernel must be
positive definite — the same condition that appeared in
Gaussian Processes
as the requirement for a valid covariance function.
Definition: Positive Definite Kernel (Mercer Kernel)
A symmetric function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is
positive definite if for all \(N \in \mathbb{N}\), all
\(x_1, \ldots, x_N \in \mathcal{X}\), and all \(c_1, \ldots, c_N \in \mathbb{R}\):
\[
\sum_{i=1}^{N} \sum_{j=1}^{N} c_i c_j K(x_i, x_j) \geq 0.
\]
Equivalently, the Gram matrix \(\mathbf{K}_{ij} = K(x_i, x_j)\)
is positive semi-definite for every finite collection of points.
Terminological note. The condition \(\geq 0\) makes the Gram matrix
positive semi-definite in the sense of
Quadratic Forms.
The kernel literature (Schölkopf-Smola, Berg-Christensen-Ressel) nevertheless calls this
"positive definite kernel" by convention, reserving "strictly positive definite" for the
case where equality holds only when all \(c_i = 0\). We follow this convention throughout.
The deep connection between positive definite kernels and RKHS is captured by the
following theorem, which establishes a perfect one-to-one correspondence.
Theorem: Moore-Aronszajn
For every positive definite kernel \(K\) on a set \(\mathcal{X}\), there exists a
unique Hilbert space \(\mathcal{H}_K\) of functions on \(\mathcal{X}\)
for which \(K\) is the reproducing kernel. Conversely, every RKHS has a unique
reproducing kernel.
Proof sketch:
We construct \(\mathcal{H}_K\) explicitly in four steps.
Step 1 (pre-Hilbert space).
Let \(\mathcal{H}_0 = \operatorname{span}\{K(\cdot, x) : x \in \mathcal{X}\}\) and
define a bilinear form on the generators by
\(\langle K(\cdot, x), K(\cdot, x') \rangle := K(x, x')\). Symmetry of \(K\) makes
this form symmetric. For any finite linear combination
\(f = \sum_i a_i K(\cdot, x_i)\), bilinearity forces
\(\langle f, f\rangle = \sum_{i,j} a_i a_j K(x_i, x_j) \geq 0\) by the Gram-matrix
condition, so the form is positive semi-definite. Well-definedness of the bilinear
extension to all of \(\mathcal{H}_0\) — that \(\langle f, g\rangle\) is independent
of the chosen finite-sum representations of \(f\) and \(g\) — is established next.
Step 2 (reproducing property on \(\mathcal{H}_0\); well-definedness;
positive definiteness).
For \(f = \sum_i a_i K(\cdot, x_i)\), direct computation gives
\(\langle f, K(\cdot, x)\rangle = \sum_i a_i K(x_i, x) = f(x)\).
Consequently, for any \(g = \sum_j b_j K(\cdot, y_j) \in \mathcal{H}_0\),
bilinearity yields \(\langle f, g\rangle = \sum_j b_j f(y_j)\) — a value
determined entirely by the function \(f\) (not its representation) and the
data \((b_j, y_j)\); a symmetric argument in the first slot completes the
verification that \(\langle\cdot,\cdot\rangle\) is well-defined on \(\mathcal{H}_0\).
The Cauchy-Schwarz inequality then gives
\(|f(x)|^2 = |\langle f, K(\cdot, x)\rangle|^2 \leq \langle f, f\rangle \cdot K(x, x)\),
so \(\langle f, f\rangle = 0\) forces \(f \equiv 0\) — the form is a genuine
inner product.
Step 3 (completion as a function space).
This is the heart of the theorem. Take the abstract Hilbert-space completion of
\(\mathcal{H}_0\) — the equivalence classes of Cauchy sequences
\(\{f_n\} \subset \mathcal{H}_0\). For each \(x \in \mathcal{X}\), the
Cauchy-Schwarz bound from Step 2 gives
\(|f_n(x) - f_m(x)| \leq \|f_n - f_m\| \cdot \sqrt{K(x, x)}\), so
\(\{f_n(x)\}\) is Cauchy in \(\mathbb{R}\) and converges to some \(f(x)\).
The same bound shows that Cauchy-equivalent sequences (\(\|f_n - g_n\| \to 0\))
yield identical pointwise limits, so \(f : \mathcal{X} \to \mathbb{R}\) depends
only on the equivalence class.
Conversely, if the pointwise limit vanishes at every \(x\), then the completion
limit \(f^*\) also vanishes pointwise (the reproducing property extends
continuously to the completion, giving
\(f^*(x) = \langle f^*, K(\cdot, x)\rangle = \lim_n f_n(x) = 0\)); since
\(\mathcal{H}_0\) is dense in the completion and \(f^*\) is orthogonal to every
\(K(\cdot, x) \in \mathcal{H}_0\), this forces \(\|f^*\| = 0\). Thus the
assignment "equivalence class \(\mapsto\) pointwise limit" is injective. The
completion thus realizes as a Hilbert space
\(\mathcal{H}_K\) of bona fide functions on \(\mathcal{X}\), not merely as
abstract equivalence classes.
Step 4 (reproducing property on \(\mathcal{H}_K\); uniqueness).
Point evaluation \(\delta_x = \langle \cdot, K(\cdot, x)\rangle\) is bounded on
\(\mathcal{H}_0\) with operator norm \(\sqrt{K(x, x)}\); since \(\mathcal{H}_0\)
is dense in \(\mathcal{H}_K\) by construction, \(\delta_x\) extends by continuity to
all of \(\mathcal{H}_K\), preserving the reproducing property
\(f(x) = \langle f, K(\cdot, x)\rangle\) throughout. Uniqueness of the reproducing
kernel follows from the
Riesz Representation Theorem:
each \(\delta_x\) determines its representative \(K_x\) uniquely. Uniqueness of
\(\mathcal{H}_K\) itself follows because any RKHS reproducing the same \(K\) must
contain \(\{K(\cdot, x)\}\), inherit the same inner product on this set
(forced by the reproducing property), and be complete — hence coincides with
\(\mathcal{H}_K\) as a function space.
For the complete argument with all measurability and density verifications, see
Aronszajn's original 1950 paper or Steinwart & Christmann,
Support Vector Machines, Ch. 4 ("Kernels and Reproducing Kernel Hilbert Spaces").
Connection to the Kernel Trick
The kernel trick
in SVMs replaces inner
products \(\langle \phi(x), \phi(x') \rangle\) with kernel evaluations \(K(x, x')\).
The Moore-Aronszajn theorem tells us exactly what is happening:
\(K(x, x') = \langle K(\cdot, x), K(\cdot, x') \rangle_{\mathcal{H}_K}\),
so the "feature map" is \(\phi(x) = K(\cdot, x) \in \mathcal{H}_K\).
The kernel trick is not a computational shortcut — it is the reproducing property
of an RKHS.
Mercer's Theorem
In Spectral Theory,
we saw that a compact, self-adjoint, positive operator on a Hilbert space admits a
spectral decomposition into eigenfunctions. Mercer's Theorem applies this machinery
to the integral operator defined by a positive definite kernel, providing an explicit
eigenfunction expansion of the kernel itself.
Theorem: Mercer's Theorem
Let \(\mathcal{X}\) be a compact subset of \(\mathbb{R}^d\) equipped with a finite strictly positive
Borel measure \(\mu\) (i.e., every non-empty open set has positive measure), and let
\(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) be a continuous, symmetric, positive definite kernel.
Define the integral operator \(T_K : L^2(\mathcal{X}, \mu) \to L^2(\mathcal{X}, \mu)\) by
\[
(T_K f)(x) = \int_{\mathcal{X}} K(x, x')\, f(x')\, d\mu(x').
\]
Then \(T_K\) is compact, self-adjoint, and positive, and there exist
orthonormal eigenfunctions \(\{e_n\}_{n=1}^\infty\) in \(L^2(\mathcal{X}, \mu)\)
with eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq 0\), \(\lambda_n \to 0\)
(with the convention that the list enumerates only the nonzero eigenvalues, of which there are countably many by compactness of \(T_K\)),
such that
\[
K(x, x') = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x').
\]
The convergence is absolute and uniform on
\(\mathcal{X} \times \mathcal{X}\).
Proof sketch:
Continuity of \(K\) on the compact set \(\mathcal{X}\times\mathcal{X}\) and finiteness
of \(\mu\) give \(K \in L^2(\mathcal{X}\times\mathcal{X}, \mu\otimes\mu)\), so
\(T_K\) is a
Hilbert-Schmidt operator
and hence compact. Symmetry of \(K\) gives self-adjointness.
Positive definiteness of \(K\) — defined for finite point sets — must be
promoted to the integral statement \(\langle T_K f, f\rangle \geq 0\). For
continuous \(f \in C(\mathcal{X})\) and any partition
\(\{A_i\}\) of \(\mathcal{X}\) with sample points \(x_i \in A_i\), the discrete
PD condition gives
\[
\sum_{i,j} K(x_i, x_j) f(x_i) f(x_j) \, \mu(A_i) \mu(A_j) \geq 0.
\]
As the partition is refined, the left-hand side converges to
\(\iint K(x, x') f(x) f(x')\, d\mu(x) d\mu(x') = \langle T_K f, f\rangle\)
(uniform continuity of \(K\) and \(f\) on the compact \(\mathcal{X}\); finite \(\mu\)),
so \(\langle T_K f, f\rangle \geq 0\) for all \(f \in C(\mathcal{X})\).
Density of \(C(\mathcal{X})\) in \(L^2(\mathcal{X}, \mu)\) (standard for finite
Borel measures on compact spaces) and continuity of
\(f \mapsto \langle T_K f, f\rangle\) on \(L^2\) extend the inequality to all of
\(L^2\). All eigenvalues of \(T_K\) are therefore nonnegative.
The
Spectral Theorem
for compact self-adjoint operators gives
\(T_K f = \sum_n \lambda_n \langle f, e_n\rangle_{L^2} e_n\) with orthonormal
eigenfunctions \(\{e_n\}\). Since the kernel \(K\) determines \(T_K\) uniquely as
an integral operator (distinct kernels modulo a.e. equivalence define distinct
integral operators), this lifts to the kernel decomposition
\(K(x, x') = \sum_n \lambda_n e_n(x) e_n(x')\) with convergence in
\(L^2(\mathcal{X}\times\mathcal{X}, \mu\otimes\mu)\).
The upgrade to absolute and uniform convergence on
\(\mathcal{X}\times\mathcal{X}\) is the deep part. The right-hand side of
\(T_K e_n = \lambda_n e_n\), namely
\(x \mapsto \int K(x, x') e_n(x')\, d\mu(x')\), is continuous in \(x\) by
continuity of \(K\) and dominated convergence. Replacing \(e_n\) by
\(\lambda_n^{-1} T_K e_n\) (valid for \(\lambda_n > 0\)) within its a.e.
equivalence class therefore selects a continuous representative; strict
positivity of \(\mu\) ensures this representative is uniquely determined
(any two continuous functions agreeing a.e. with respect to a strictly
positive measure must agree pointwise), making the pointwise expansion
well-defined.
The partial sums \(S_N(x, x') = \sum_{n=1}^N \lambda_n e_n(x) e_n(x')\) are
continuous, and on the diagonal the nonnegative terms
\(\lambda_n e_n(x)^2 \geq 0\) make \(S_N(x, x) \nearrow K(x, x)\) monotonically.
Dini's theorem (continuous monotone sequence with continuous
limit on a compact set converges uniformly) upgrades this to uniform
convergence on the diagonal. For the off-diagonal terms, Cauchy-Schwarz applied
to partial-sum tails gives, for \(M > N\),
\[
|S_M(x, x') - S_N(x, x')|^2 = \left|\sum_{n=N+1}^M \lambda_n e_n(x) e_n(x')\right|^2
\leq R_N(x) \cdot R_N(x'),
\]
where \(R_N(x) = K(x, x) - S_N(x, x) \geq \sum_{n=N+1}^M \lambda_n e_n(x)^2\).
The diagonal uniform convergence \(R_N \to 0\) gives a uniform Cauchy criterion,
yielding both absolute and uniform convergence
on all of \(\mathcal{X}\times\mathcal{X}\).
For the full argument, see Steinwart & Christmann,
Support Vector Machines, Ch. 4 ("Kernels and Reproducing Kernel Hilbert Spaces").
Let us unpack why each ingredient is necessary:
- Continuity + symmetry of \(K\): Guarantees that \(T_K\) is a
Hilbert-Schmidt operator,
hence compact.
- Positive definiteness: Ensures all eigenvalues are nonnegative
(so \(T_K\) is a positive operator).
- The Spectral Theorem:
Provides the eigenfunction decomposition \(T_K f = \sum \lambda_n \langle f, e_n \rangle e_n\).
- Mercer's contribution: The eigenfunction expansion of the
operator lifts to a pointwise expansion of the kernel function,
with uniform convergence.
- Strictly positive measure: Ensures that \(L^2\)-eigenfunctions
have unique continuous representatives, so the pointwise equality
\(K(x,x') = \sum \lambda_n e_n(x) e_n(x')\) is well-defined.
The Feature Map from Mercer's Theorem
The Mercer expansion immediately yields an explicit feature map into \(\ell^2\):
\[
\Phi(x) = \bigl(\sqrt{\lambda_1}\, e_1(x),\; \sqrt{\lambda_2}\, e_2(x),\; \ldots\bigr) \in \ell^2,
\]
and the kernel is recovered as the inner product:
\[
K(x, x') = \langle \Phi(x),\, \Phi(x') \rangle_{\ell^2} = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x').
\]
This is the rigorous justification for the
kernel trick:
the kernel evaluates an inner product in an infinite-dimensional feature space
without ever computing the feature map \(\Phi\) explicitly.
To see that \(\Phi(x)\) indeed lands in \(\ell^2\), set \(x = x'\) in the Mercer expansion:
\(\|\Phi(x)\|_{\ell^2}^2 = \sum_{n=1}^{\infty} \lambda_n\, e_n(x)^2 = K(x, x) < \infty\),
which is finite since \(K\) is continuous on a compact set.
The rate of decay of \(\lambda_n\) controls the "effective dimensionality" of the feature space.
Example: The RBF (Gaussian) Kernel
The radial basis function kernel
\(K(x, x') = \exp\bigl(-\|x - x'\|^2 / (2\sigma^2)\bigr)\)
is continuous, symmetric, and positive definite. Its Mercer eigenfunctions are
related to Hermite functions, and the eigenvalues decay
exponentially: \(\lambda_n \sim C \cdot r^n\) for some
\(0 < r < 1\). This rapid decay explains why the
Random Fourier Feature
approximation works well in practice — only the first few eigenfunctions carry significant weight.
Connection to Gaussian Processes
In a Gaussian Process
with covariance function \(K\), the Mercer expansion gives the
Karhunen-Loève expansion:
\[
f(x) = \sum_{n=1}^{\infty} Z_n \sqrt{\lambda_n}\, e_n(x), \qquad Z_n \sim \mathcal{N}(0, 1) \text{ i.i.d.}
\]
Each draw from the GP is a random function whose "complexity" is governed by the
eigenvalue decay. Smooth kernels (fast decay) produce smooth sample paths;
rough kernels (slow decay) produce rough paths. The RKHS norm
(introduced in the next section) measures the complexity of the mean function
of the posterior GP.
RKHS Norm & Function Complexity
A central utility of the RKHS framework for machine learning is that the RKHS norm
provides a mathematically precise notion of function "smoothness" or "complexity."
This connects directly to regularization.
The RKHS Inner Product via Mercer Eigenfunctions
Suppose the kernel \(K\) has the Mercer expansion
\(K(x, x') = \sum_{n=1}^\infty \lambda_n e_n(x) e_n(x')\).
Any function \(f \in \mathcal{H}_K\) can be expanded as
\(f(x) = \sum_{n=1}^\infty f_n \, e_n(x)\), where \(f_n = \langle f, e_n \rangle_{L^2}\)
are the generalized Fourier coefficients.
The RKHS inner product is not the \(L^2\) inner product. Instead, it
reweights each eigencomponent by the inverse of the eigenvalue:
Lemma: Aronszajn Factorization
Under the hypotheses of
Mercer's Theorem
with \(\lambda_n > 0\) for all \(n\), let
\(T_K^{1/2}\) be the positive square root of \(T_K\), defined via spectral
calculus by
\(T_K^{1/2} g = \sum_n \sqrt{\lambda_n} \langle g, e_n\rangle_{L^2}\, e_n\).
Then \(T_K^{1/2} : L^2(\mathcal{X}, \mu) \to \mathcal{H}_K\) is an isometric
isomorphism:
\[
\mathcal{H}_K = T_K^{1/2}\bigl(L^2(\mathcal{X}, \mu)\bigr), \qquad
\|T_K^{1/2} g\|_{\mathcal{H}_K} = \|g\|_{L^2(\mathcal{X}, \mu)}.
\]
In particular, \(\{\phi_n := \sqrt{\lambda_n}\, e_n\}_{n \geq 1}\) is an
orthonormal basis of \(\mathcal{H}_K\).
Proof sketch:
By Mercer's Theorem,
\(K(x, y) = \sum_n \lambda_n e_n(x) e_n(y) = \sum_n \phi_n(x) \phi_n(y)\)
with \(\phi_n := \sqrt{\lambda_n}\, e_n\). The Hilbert space of formal series
\(\mathcal{H}' := \{\sum_n c_n \phi_n : \sum_n c_n^2 < \infty\}\), equipped
with the inner product
\(\langle \sum_n c_n \phi_n, \sum_n d_n \phi_n\rangle := \sum_n c_n d_n\),
has \(\{\phi_n\}\) as an orthonormal basis and reproducing kernel
\(\sum_n \phi_n(x) \phi_n(y) = K(x, y)\). By the uniqueness half of the
Moore-Aronszajn Theorem,
\(\mathcal{H}' = \mathcal{H}_K\) as Hilbert spaces of functions.
The spectral-calculus definition of \(T_K^{1/2}\) sends
\(g = \sum_n g_n e_n \in L^2\) to
\(T_K^{1/2} g = \sum_n \sqrt{\lambda_n}\, g_n\, e_n = \sum_n g_n \phi_n\),
whose \(\mathcal{H}_K\)-norm is
\((\sum_n g_n^2)^{1/2} = \|g\|_{L^2}\). Thus
\(T_K^{1/2} : L^2(\mathcal{X}, \mu) \to \mathcal{H}_K\) is norm-preserving,
with image all of \(\mathcal{H}_K\) (every \(f = \sum_n c_n \phi_n \in \mathcal{H}_K\)
is the image of \(g = \sum_n c_n e_n \in L^2\)). For full details, see
Aronszajn's original 1950 paper, or Steinwart & Christmann,
Support Vector Machines, Ch. 4 ("Kernels and Reproducing Kernel Hilbert Spaces").
Proposition: RKHS Norm via Mercer Coefficients
Assume \(\lambda_n > 0\) for all \(n\) (a standing assumption ensured, for instance, when \(K\) is strictly positive definite under the Mercer hypotheses; otherwise restrict the sums to \(\{n : \lambda_n > 0\}\) and identify \(\mathcal{H}_K\) with the corresponding subspace via the isometry below).
For \(f = \sum_n f_n \, e_n\) and \(g = \sum_n g_n \, e_n\) in \(\mathcal{H}_K\):
\[
\langle f, g \rangle_{\mathcal{H}_K} = \sum_{n=1}^{\infty} \frac{f_n \, g_n}{\lambda_n}.
\]
The RKHS norm is therefore
\[
\|f\|_{\mathcal{H}_K}^2 = \sum_{n=1}^{\infty} \frac{f_n^2}{\lambda_n}.
\]
A function \(f \in L^2(\mathcal{X}, \mu)\) with Fourier expansion
\(f = \sum_n f_n e_n\) belongs to \(\mathcal{H}_K\) if and only if
\(\sum_n f_n^2 / \lambda_n < \infty\).
Proof sketch:
By the
Aronszajn Factorization Lemma,
every \(f \in \mathcal{H}_K\) can be written uniquely as \(f = T_K^{1/2} h\)
for some \(h \in L^2(\mathcal{X}, \mu)\), and this correspondence is an
isometry: \(\|f\|_{\mathcal{H}_K} = \|h\|_{L^2}\). Expanding \(h\) in the
\(L^2\)-orthonormal basis \(\{e_n\}\) as \(h = \sum_n h_n e_n\), we have
\[
f = T_K^{1/2} h = \sum_n h_n\, T_K^{1/2} e_n = \sum_n h_n \sqrt{\lambda_n}\, e_n.
\]
Comparing with the \(L^2\)-Fourier expansion \(f = \sum_n f_n e_n\) gives
\(f_n = h_n \sqrt{\lambda_n}\), i.e. \(h_n = f_n / \sqrt{\lambda_n}\). The
isometry \(\|f\|_{\mathcal{H}_K}^2 = \|h\|_{L^2}^2\) then reads
\[
\|f\|_{\mathcal{H}_K}^2 = \sum_n h_n^2 = \sum_n \frac{f_n^2}{\lambda_n}.
\]
The polarization identity yields the inner-product formula. The membership
characterization \(f \in \mathcal{H}_K \iff \sum_n f_n^2/\lambda_n < \infty\)
follows from
\(f \in \mathcal{H}_K \iff h \in L^2 \iff \sum_n h_n^2 < \infty\).
The division by \(\lambda_n\) is the key insight. Since \(\lambda_n \to 0\),
the higher-order eigenfunctions (those corresponding to small eigenvalues)
are penalized more heavily in the RKHS norm. A function that relies
heavily on high-frequency components will have a large RKHS norm, while a function
concentrated on the first few eigenmodes will have a small norm.
Connection to Regularization: Why \(\|f\|_{\mathcal{H}_K}^2\) Controls Complexity
When we add \(\lambda \|f\|_{\mathcal{H}_K}^2\) to our loss function, we are
penalizing functions that use high-frequency eigenfunctions. This is the
infinite-dimensional analogue of
Ridge regression:
in finite dimensions, \(\|w\|^2 = \sum w_i^2\) penalizes large weights equally;
in the RKHS, \(\|f\|_{\mathcal{H}_K}^2 = \sum f_n^2 / \lambda_n\) penalizes
components in "hard" directions (small \(\lambda_n\)) more than "easy" ones
(large \(\lambda_n\)).
This is analogous to the quadratic form \(\mathbf{f}^T \mathbf{K}^{-1} \mathbf{f}\)
that appears in GP regression — the inverse kernel matrix penalizes predictions
that deviate from the prior's "smooth" subspace.
RKHS Functions are "Smoother" than \(L^2\) Functions
The membership condition \(\sum f_n^2 / \lambda_n < \infty\) is
strictly stronger than the \(L^2\) condition \(\sum f_n^2 < \infty\)
(since \(\lambda_n \to 0\); for example, \(f_n = \sqrt{\lambda_n}\) gives
\(\sum f_n^2 = \sum \lambda_n < \infty\) — finite since integrating the Mercer
expansion on the diagonal gives
\(\sum_n \lambda_n = \int_{\mathcal{X}} K(x, x)\, d\mu(x)\), which is finite
by continuity of \(K\) on the compact \(\mathcal{X}\) and finiteness of \(\mu\) —
but \(\sum f_n^2 / \lambda_n = \infty\), giving
\(f \in L^2 \setminus \mathcal{H}_K\)).
Therefore:
\[
\mathcal{H}_K \subsetneq L^2(\mathcal{X}, \mu).
\]
The RKHS consists of "smoother" functions — those whose high-frequency components
decay fast enough to compensate for the shrinking eigenvalues.
The faster the eigenvalues \(\lambda_n\) decay, the smaller (and smoother)
the RKHS becomes.
This also explains why point evaluation is continuous in an RKHS but not in \(L^2\).
The extra smoothness required for RKHS membership ensures that functions are
"nice enough" to be evaluated pointwise — unlike generic \(L^2\) functions,
which are only defined up to measure-zero modifications.
The Representer Theorem & Grand Synthesis
We now arrive at the theorem that ties the entire Functional Analysis series together:
the Representer Theorem. It tells us that even though the RKHS
\(\mathcal{H}_K\) is infinite-dimensional, the solution to any regularized
empirical risk minimization problem lives in a finite-dimensional subspace
spanned by the kernel functions at the data points.
The Representer Theorem is a structural result: it describes
the form of any minimizer that exists, without itself asserting existence.
Existence depends on the loss function and is not automatic. In the kernel
methods that use this theorem in practice, existence is nevertheless clear:
Kernel Ridge Regression reduces to a finite-dimensional linear system with a
unique closed-form solution, and the kernel SVM to a finite-dimensional
convex quadratic program. The theorem below tells us the structure
of any such minimizer:
Theorem: The Representer Theorem
Let \(\mathcal{H}_K\) be an RKHS on \(\mathcal{X}\) with kernel \(K\),
let \(\{(x_i, y_i)\}_{i=1}^N\) be training data, and consider the regularized problem
\[
\min_{f \in \mathcal{H}_K} \left[\sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}_K}^2\right],
\]
where \(\ell\) is any loss function and \(\lambda > 0\). Then any minimizer
\(f^*\) of this objective has the form
\[
f^*(x) = \sum_{i=1}^{N} \alpha_i \, K(x, x_i)
\]
for some coefficients \(\alpha_1, \ldots, \alpha_N \in \mathbb{R}\).
This is a remarkable dimension reduction: the optimization was posed over the
entire infinite-dimensional space \(\mathcal{H}_K\), yet the answer lies in the
\(N\)-dimensional subspace \(\operatorname{span}\{K(\cdot, x_1), \ldots, K(\cdot, x_N)\}\).
Proof Sketch:
Decompose any \(f \in \mathcal{H}_K\) as
\(f = f_\parallel + f_\perp\), where
\(f_\parallel \in \operatorname{span}\{K(\cdot, x_i)\}_{i=1}^N\) and
\(f_\perp\) is orthogonal to this subspace.
(The span is finite-dimensional, hence closed in \(\mathcal{H}_K\), so this orthogonal decomposition is guaranteed by
the projection theorem.)
By the reproducing property, \(f(x_i) = \langle f, K(\cdot, x_i) \rangle_{\mathcal{H}_K}\).
Since \(f_\perp\) is orthogonal to every \(K(\cdot, x_i)\):
\[
f(x_i) = f_\parallel(x_i) + \underbrace{\langle f_\perp, K(\cdot, x_i) \rangle}_{= 0} = f_\parallel(x_i).
\]
So \(f_\perp\) does not affect the data-fit term. However, by Pythagoras:
\[
\|f\|_{\mathcal{H}_K}^2 = \|f_\parallel\|_{\mathcal{H}_K}^2 + \|f_\perp\|_{\mathcal{H}_K}^2 \geq \|f_\parallel\|_{\mathcal{H}_K}^2.
\]
Adding any nonzero \(f_\perp\) strictly increases the regularization penalty
without improving the data fit. Hence any minimizer satisfies
\(f_\perp = 0\).
From the Representer Theorem to Kernel Algorithms
With \(f^*(x) = \sum_{i=1}^N \alpha_i K(x, x_i)\), the infinite-dimensional problem
reduces to a finite-dimensional optimization over \(\boldsymbol{\alpha} \in \mathbb{R}^N\).
The reproducing property gives
\[
f^*(x_j) = \sum_{i=1}^N \alpha_i K(x_j, x_i) = (\mathbf{K}\boldsymbol{\alpha})_j
\]
where \(\mathbf{K}_{ij} = K(x_i, x_j)\) is the Gram matrix, and
\[
\|f^*\|_{\mathcal{H}_K}^2 = \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}.
\]
The regularized problem becomes
\[
\min_{\boldsymbol{\alpha} \in \mathbb{R}^N} \left[\sum_{i=1}^N \ell\bigl(y_i,\, (\mathbf{K}\boldsymbol{\alpha})_i\bigr) + \lambda \, \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}\right].
\]
For the squared loss \(\ell(y, \hat{y}) = (y - \hat{y})^2\), this is
Kernel Ridge Regression, with closed-form solution
\(\boldsymbol{\alpha} = (\mathbf{K} + \lambda I)^{-1} \mathbf{y}\).
For the hinge loss, it is the kernel SVM.
The Grand Synthesis
Every chapter in this Functional Analysis series has contributed a piece of
the RKHS picture.
Banach & Hilbert Spaces
gave us the ambient structure — completeness, inner products, and \(L^2\) as
the home of Mercer's integral operator.
Bounded Operators
gave us the equivalence of boundedness and continuity, which is exactly what
point evaluation \(\delta_x\) must satisfy for the whole theory to begin.
Dual Spaces and the Riesz
Representation Theorem converted "\(\delta_x\) is bounded" into the concrete
statement "\(\delta_x\) has a Riesz representative \(K_x\)" — the reproducing
kernel itself.
Spectral Theory lifted
Mercer's integral operator into an eigenfunction expansion, exposing the
explicit feature map into \(\ell^2\) and the smoothness-controlling role of
eigenvalue decay. And the Representer Theorem, proved here, collapses
infinite-dimensional optimization onto a finite-dimensional kernel algebra.
The Functional Analysis series reaches its culmination here: each abstract tool introduced
above — Hilbert structure, boundedness, Riesz duality, spectral decomposition —
now anchors a concrete machine-learning method.
Looking Ahead: From Function Spaces to Manifolds
Now, we will move from function spaces to geometric spaces.
The Musical Isomorphisms (\(\sharp\) and \(\flat\)) will
generalize from a single inner product to a Riemannian metric tensor that varies from point to point.
The spectral methods will be fundamentally powered by the eigenvalues
and eigenfunctions of the Laplace-Beltrami operator on smooth manifolds. And the reproducing property
of RKHS will find its echo in the equivariance constraints of Geometric Deep Learning: just as the
kernel encodes the "right" notion of similarity in function space, the group action encodes the "right" notion of symmetry
on geometric domains.
The Functional Analysis series has given us the language of
infinite-dimensional mathematics. The geometry series will give us its shape.