Point Evaluation & The Need for RKHS
Throughout the Functional Analysis series — from
Banach & Hilbert Spaces through
Spectral Theory — we have
treated functions as "points" in abstract infinite-dimensional spaces. We measured
their "length" via norms, their "angle" via inner products, and their "spectrum"
via eigenvalue decomposition. But there is one seemingly innocent operation we have
not yet examined: evaluating a function at a point.
Given a function \(f\) in some Hilbert space \(\mathcal{H}\), define the
point evaluation functional at \(x \in \mathcal{X}\):
\[
\delta_x : \mathcal{H} \to \mathbb{R}, \qquad \delta_x(f) = f(x).
\]
This is clearly linear: \(\delta_x(\alpha f + \beta g) = \alpha f(x) + \beta g(x)\).
The critical question is: is it continuous?
In \(L^2([a,b])\), the answer is no. Functions in \(L^2\) are equivalence
classes defined up to sets of measure zero — changing a function at a single point
does not change its \(L^2\) norm. Consequently, point evaluation is not a
bounded linear functional
on \(L^2\), and \(\delta_x \notin (L^2)^*\). We cannot meaningfully "evaluate" an
\(L^2\) function at a point.
This is a serious problem for machine learning. In supervised learning, we must
evaluate our model \(f\) at data points \(x_1, \ldots, x_N\) to compute the loss:
\[
\mathcal{L}(f) = \sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}}^2.
\]
If point evaluation is not continuous, this expression is not well-defined as a
functional on \(\mathcal{H}\). We need a Hilbert space where point evaluation is
always bounded.
A Reproducing Kernel Hilbert Space (RKHS) is precisely a Hilbert space of functions
in which point evaluation is a continuous linear functional for every point in the domain.
This single requirement — that \(\delta_x \in \mathcal{H}^*\) for all \(x\) — has
extraordinary consequences, connecting the
Riesz Representation Theorem,
Spectral Theory,
and regularized optimization into a single coherent framework.
Reproducing Kernel Hilbert Spaces
Suppose \(\mathcal{H}\) is a Hilbert space of functions \(f : \mathcal{X} \to \mathbb{R}\)
in which point evaluation is continuous: for every \(x \in \mathcal{X}\), the functional
\(\delta_x(f) = f(x)\) satisfies \(\delta_x \in \mathcal{H}^*\). By the
Riesz Representation Theorem,
there exists a unique element \(K_x \in \mathcal{H}\) such that
\[
f(x) = \delta_x(f) = \langle f,\, K_x \rangle_{\mathcal{H}} \quad \text{for all } f \in \mathcal{H}.
\]
The function \(K_x\) is the Riesz representative of the evaluation
functional at \(x\).
Definition: Reproducing Kernel
Let \(\mathcal{H}\) be a Hilbert space of functions on a set \(\mathcal{X}\)
in which all point evaluations are continuous. The reproducing kernel is
the function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) defined by
\[
K(x, x') = \langle K_x,\, K_{x'} \rangle_{\mathcal{H}} = K_{x'}(x).
\]
The pair \((\mathcal{H}, K)\) is called a
Reproducing Kernel Hilbert Space (RKHS).
Two properties follow immediately from this definition and are so fundamental that
they deserve explicit naming.
Theorem: The Reproducing Property
For every \(f \in \mathcal{H}\) and every \(x' \in \mathcal{X}\):
\[
f(x') = \langle f,\, K(\cdot, x') \rangle_{\mathcal{H}}.
\]
In particular, setting \(f = K(\cdot, x)\):
\[
K(x, x') = \langle K(\cdot, x),\, K(\cdot, x') \rangle_{\mathcal{H}}.
\]
That is, the kernel reproduces itself under the inner product —
hence the name.
The reproducing property says: to evaluate a function \(f\) at a point \(x'\),
take the inner product of \(f\) with the kernel "centered at \(x'\)."
The kernel function \(K(\cdot, x')\) acts as a "probe" that extracts the value of
\(f\) at the point \(x'\).
Positive Definiteness: The Characterization
Not every symmetric function \(K(x, x')\) is a reproducing kernel. The kernel must be
positive definite — the same condition that appeared in
Gaussian Processes
as the requirement for a valid covariance function.
Definition: Positive Definite Kernel (Mercer Kernel)
A symmetric function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is
positive definite if for all \(N \in \mathbb{N}\), all
\(x_1, \ldots, x_N \in \mathcal{X}\), and all \(c_1, \ldots, c_N \in \mathbb{R}\):
\[
\sum_{i=1}^{N} \sum_{j=1}^{N} c_i c_j K(x_i, x_j) \geq 0.
\]
Equivalently, the Gram matrix \(\mathbf{K}_{ij} = K(x_i, x_j)\)
is positive semi-definite for every finite collection of points.
The deep connection between positive definite kernels and RKHS is captured by the
following theorem, which establishes a perfect one-to-one correspondence.
Theorem: Moore-Aronszajn
For every positive definite kernel \(K\) on a set \(\mathcal{X}\), there exists a
unique Hilbert space \(\mathcal{H}_K\) of functions on \(\mathcal{X}\)
for which \(K\) is the reproducing kernel. Conversely, every RKHS has a unique
reproducing kernel.
The proof constructs \(\mathcal{H}_K\) explicitly: start with the span of all
functions of the form \(K(\cdot, x)\), define the inner product via
\(\langle K(\cdot, x), K(\cdot, x') \rangle = K(x, x')\),
and take the completion. The reproducing property is then built in by construction.
Connection to the Kernel Trick
The kernel trick
in SVMs replaces inner
products \(\langle \phi(x), \phi(x') \rangle\) with kernel evaluations \(K(x, x')\).
The Moore-Aronszajn theorem tells us exactly what is happening:
\(K(x, x') = \langle K(\cdot, x), K(\cdot, x') \rangle_{\mathcal{H}_K}\),
so the "feature map" is \(\phi(x) = K(\cdot, x) \in \mathcal{H}_K\).
The kernel trick is not a computational shortcut — it is the reproducing property
of an RKHS.
Mercer's Theorem
In Spectral Theory,
we saw that a compact, self-adjoint, positive operator on a Hilbert space admits a
spectral decomposition into eigenfunctions. Mercer's Theorem applies this machinery
to the integral operator defined by a positive definite kernel, providing an explicit
eigenfunction expansion of the kernel itself.
Theorem: Mercer's Theorem
Let \(\mathcal{X}\) be a compact subset of \(\mathbb{R}^d\) equipped with a
finite Borel measure \(\mu\), and let \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\)
be a continuous, symmetric, positive definite kernel.
Define the integral operator \(T_K : L^2(\mathcal{X}, \mu) \to L^2(\mathcal{X}, \mu)\) by
\[
(T_K f)(x) = \int_{\mathcal{X}} K(x, x')\, f(x')\, d\mu(x').
\]
Then \(T_K\) is compact, self-adjoint, and positive, and there exist
orthonormal eigenfunctions \(\{e_n\}_{n=1}^\infty\) in \(L^2(\mathcal{X}, \mu)\)
with eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots > 0\), \(\lambda_n \to 0\),
such that
\[
K(x, x') = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x').
\]
The convergence is absolute and uniform on
\(\mathcal{X} \times \mathcal{X}\).
Let us unpack why each ingredient is necessary:
- Continuity + symmetry of \(K\): Guarantees that \(T_K\) is a
Hilbert-Schmidt operator,
hence compact.
- Positive definiteness: Ensures all eigenvalues are nonnegative
(so \(T_K\) is a positive operator).
- The Spectral Theorem
(calc-27):
Provides the eigenfunction decomposition
\(T_K f = \sum \lambda_n \langle f, e_n \rangle e_n\).
- Mercer's contribution: The eigenfunction expansion of the
operator lifts to a pointwise expansion of the kernel function,
with uniform convergence.
The Feature Map from Mercer's Theorem
The Mercer expansion immediately yields an explicit feature map into \(\ell^2\):
\[
\Phi(x) = \bigl(\sqrt{\lambda_1}\, e_1(x),\; \sqrt{\lambda_2}\, e_2(x),\; \ldots\bigr) \in \ell^2,
\]
and the kernel is recovered as the inner product:
\[
K(x, x') = \langle \Phi(x),\, \Phi(x') \rangle_{\ell^2} = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x').
\]
This is the rigorous justification for the
kernel trick:
the kernel evaluates an inner product in an infinite-dimensional feature space
without ever computing the feature map \(\Phi\) explicitly.
The eigenvalues \(\lambda_n \to 0\) ensure that the feature map lands in \(\ell^2\)
(the squared norms are summable), and the rate of decay of \(\lambda_n\) controls the
"effective dimensionality" of the feature space.
Example: The RBF (Gaussian) Kernel
The radial basis function kernel
\(K(x, x') = \exp\bigl(-\|x - x'\|^2 / (2\sigma^2)\bigr)\)
is continuous, symmetric, and positive definite. Its Mercer eigenfunctions are
related to Hermite functions, and the eigenvalues decay
exponentially: \(\lambda_n \sim C \cdot r^n\) for some
\(0 < r < 1\). This rapid decay explains why the
Random Fourier Feature
approximation works well in practice — only the first few eigenfunctions carry
significant weight.
Connection to Gaussian Processes
In a Gaussian Process
with covariance function \(K\), the Mercer expansion gives the
Karhunen-Loève expansion:
\[
f(x) = \sum_{n=1}^{\infty} Z_n \sqrt{\lambda_n}\, e_n(x), \qquad Z_n \sim \mathcal{N}(0, 1) \text{ i.i.d.}
\]
Each draw from the GP is a random function whose "complexity" is governed by the
eigenvalue decay. Smooth kernels (fast decay) produce smooth sample paths;
rough kernels (slow decay) produce rough paths. The RKHS norm
(introduced in the next section) measures the complexity of the mean function
of the posterior GP.
RKHS Norm & Function Complexity
A central utility of the RKHS framework for machine learning is that the RKHS norm
provides a mathematically precise notion of function "smoothness" or "complexity."
This connects directly to
regularization.
The RKHS Inner Product via Mercer Eigenfunctions
Suppose the kernel \(K\) has the Mercer expansion
\(K(x, x') = \sum_{n=1}^\infty \lambda_n e_n(x) e_n(x')\).
Any function \(f \in \mathcal{H}_K\) can be expanded as
\(f(x) = \sum_{n=1}^\infty f_n \, e_n(x)\), where \(f_n = \langle f, e_n \rangle_{L^2}\)
are the generalized Fourier coefficients.
The RKHS inner product is not the \(L^2\) inner product. Instead, it
reweights each eigencomponent by the inverse of the eigenvalue:
Definition: RKHS Inner Product and Norm
For \(f = \sum_n f_n \, e_n\) and \(g = \sum_n g_n \, e_n\) in \(\mathcal{H}_K\):
\[
\langle f, g \rangle_{\mathcal{H}_K} = \sum_{n=1}^{\infty} \frac{f_n \, g_n}{\lambda_n}.
\]
The RKHS norm is therefore
\[
\|f\|_{\mathcal{H}_K}^2 = \sum_{n=1}^{\infty} \frac{f_n^2}{\lambda_n}.
\]
A function \(f\) belongs to \(\mathcal{H}_K\) if and only if
\(\sum_n f_n^2 / \lambda_n < \infty\).
The division by \(\lambda_n\) is the key insight. Since \(\lambda_n \to 0\),
the higher-order eigenfunctions (those corresponding to small eigenvalues)
are penalized more heavily in the RKHS norm. A function that relies
heavily on high-frequency components will have a large RKHS norm, while a function
concentrated on the first few eigenmodes will have a small norm.
Connection to Regularization: Why \(\|f\|_{\mathcal{H}_K}^2\) Controls Complexity
When we add \(\lambda \|f\|_{\mathcal{H}_K}^2\) to our loss function, we are
penalizing functions that use high-frequency eigenfunctions. This is the
infinite-dimensional analogue of
Ridge regression:
in finite dimensions, \(\|w\|^2 = \sum w_i^2\) penalizes large weights equally;
in the RKHS, \(\|f\|_{\mathcal{H}_K}^2 = \sum f_n^2 / \lambda_n\) penalizes
components in "hard" directions (small \(\lambda_n\)) more than "easy" ones
(large \(\lambda_n\)).
This is analogous to the quadratic form \(\mathbf{f}^T \mathbf{K}^{-1} \mathbf{f}\)
that appears in GP regression — the inverse kernel matrix penalizes predictions
that deviate from the prior's "smooth" subspace.
RKHS Functions are "Smoother" than \(L^2\) Functions
The membership condition \(\sum f_n^2 / \lambda_n < \infty\) is
strictly stronger than the \(L^2\) condition \(\sum f_n^2 < \infty\)
(since \(\lambda_n \to 0\)). Therefore:
\[
\mathcal{H}_K \subsetneq L^2(\mathcal{X}, \mu).
\]
The RKHS consists of "smoother" functions — those whose high-frequency components
decay fast enough to compensate for the shrinking eigenvalues.
The faster the eigenvalues \(\lambda_n\) decay, the smaller (and smoother)
the RKHS becomes.
This also explains why point evaluation is continuous in an RKHS but not in \(L^2\).
The extra smoothness required for RKHS membership ensures that functions are
"nice enough" to be evaluated pointwise — unlike generic \(L^2\) functions,
which are only defined up to measure-zero modifications.
The Representer Theorem & Grand Synthesis
We now arrive at the theorem that ties the entire Functional Analysis series together:
the Representer Theorem. It tells us that even though the RKHS
\(\mathcal{H}_K\) is infinite-dimensional, the solution to any regularized
empirical risk minimization problem lives in a finite-dimensional subspace
spanned by the kernel functions at the data points.
Theorem: The Representer Theorem
Let \(\mathcal{H}_K\) be an RKHS on \(\mathcal{X}\) with kernel \(K\),
let \(\{(x_i, y_i)\}_{i=1}^N\) be training data, and consider the regularized problem
\[
\min_{f \in \mathcal{H}_K} \left[\sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}_K}^2\right],
\]
where \(\ell\) is any loss function and \(\lambda > 0\). Then the minimizer
\(f^*\) has the form
\[
f^*(x) = \sum_{i=1}^{N} \alpha_i \, K(x, x_i)
\]
for some coefficients \(\alpha_1, \ldots, \alpha_N \in \mathbb{R}\).
This is a remarkable dimension reduction: the optimization was posed over the
entire infinite-dimensional space \(\mathcal{H}_K\), yet the answer lies in the
\(N\)-dimensional subspace \(\operatorname{span}\{K(\cdot, x_1), \ldots, K(\cdot, x_N)\}\).
Proof Sketch:
Decompose any \(f \in \mathcal{H}_K\) as
\(f = f_\parallel + f_\perp\), where
\(f_\parallel \in \operatorname{span}\{K(\cdot, x_i)\}_{i=1}^N\) and
\(f_\perp\) is orthogonal to this subspace.
By the reproducing property, \(f(x_i) = \langle f, K(\cdot, x_i) \rangle_{\mathcal{H}_K}\).
Since \(f_\perp\) is orthogonal to every \(K(\cdot, x_i)\):
\[
f(x_i) = f_\parallel(x_i) + \underbrace{\langle f_\perp, K(\cdot, x_i) \rangle}_{= 0} = f_\parallel(x_i).
\]
So \(f_\perp\) does not affect the data-fit term. However, by Pythagoras:
\[
\|f\|_{\mathcal{H}_K}^2 = \|f_\parallel\|_{\mathcal{H}_K}^2 + \|f_\perp\|_{\mathcal{H}_K}^2 \geq \|f_\parallel\|_{\mathcal{H}_K}^2.
\]
Adding any nonzero \(f_\perp\) strictly increases the regularization penalty
without improving the data fit. Hence the minimizer satisfies
\(f_\perp = 0\). \(\square\)
From the Representer Theorem to Kernel Algorithms
With \(f^*(x) = \sum_{i=1}^N \alpha_i K(x, x_i)\), the infinite-dimensional problem
reduces to a finite-dimensional optimization over \(\boldsymbol{\alpha} \in \mathbb{R}^N\).
The reproducing property gives
\[
f^*(x_j) = \sum_{i=1}^N \alpha_i K(x_j, x_i) = (\mathbf{K}\boldsymbol{\alpha})_j
\]
where \(\mathbf{K}_{ij} = K(x_i, x_j)\) is the Gram matrix, and
\[
\|f^*\|_{\mathcal{H}_K}^2 = \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}.
\]
The regularized problem becomes
\[
\min_{\boldsymbol{\alpha} \in \mathbb{R}^N} \left[\sum_{i=1}^N \ell\bigl(y_i,\, (\mathbf{K}\boldsymbol{\alpha})_i\bigr) + \lambda \, \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}\right].
\]
For the squared loss \(\ell(y, \hat{y}) = (y - \hat{y})^2\), this is
Kernel Ridge Regression, with closed-form solution
\(\boldsymbol{\alpha} = (\mathbf{K} + \lambda I)^{-1} \mathbf{y}\).
For the hinge loss, it is the kernel
SVM.
The Grand Synthesis: calc-23 to calc-28
Let us now trace how every chapter in the Functional Analysis series contributed
to this final result.
| Chapter |
Key Contribution |
Role in RKHS Theory |
| calc-23 |
Hilbert spaces, completeness, \(L^p\) spaces |
RKHS is a Hilbert space; \(L^2\) is the ambient space for Mercer's integral operator |
| calc-24 |
Bounded operators, continuity = boundedness |
Point evaluation must be bounded (continuous);
integral operators are bounded |
| calc-25 |
Dual spaces, Riesz Representation |
\(\delta_x \in \mathcal{H}^*\) implies existence of
\(K_x = \delta_x^\sharp\) via Riesz — the reproducing kernel |
| calc-26 |
Weak compactness, existence of minimizers |
Guarantees that the regularized problem
\(\min(\text{loss} + \lambda\|f\|^2)\) has a solution in
\(\mathcal{H}_K\) |
| calc-27 |
Spectral Theorem, eigenfunction expansion |
Provides Mercer's eigenfunction decomposition of \(K\)
and the explicit feature map \(\Phi\) |
| calc-28 (this chapter) |
RKHS, Mercer, Representer Theorem |
Unifies everything: infinite-dimensional optimization
reduces to finite-dimensional kernel algebra |
Looking Ahead: From Function Spaces to Manifolds
This chapter concludes the Functional Analysis block of MATH-CS COMPASS.
We have traveled from the abstract definition of normed spaces (calc-23) to a
complete, rigorous understanding of why kernel methods work (calc-28).
The tools we have built — Hilbert space structure, duality, weak topologies,
spectral decomposition, and now RKHS — form the analytical foundation for
everything that follows.
The next stage of the curriculum moves from function spaces to
geometric spaces. The Musical Isomorphisms
(\(\sharp\) and \(\flat\)) from
calc-25 will generalize
from a single inner product to a Riemannian metric tensor that varies
from point to point. The spectral methods of calc-27 will become the eigenvalues of
the Laplace-Beltrami operator on smooth manifolds.
And the reproducing property of RKHS will find its echo in the
equivariance constraints of Geometric Deep Learning:
just as the kernel encodes the "right" notion of similarity in function space,
the group action encodes the "right" notion of symmetry on geometric domains.
The Functional Analysis series has given us the language of
infinite-dimensional mathematics. The geometry series will give us its
shape.