RKHS & Kernel Methods

Point Evaluation & The Need for RKHS Reproducing Kernel Hilbert Spaces Mercer's Theorem RKHS Norm & Function Complexity The Representer Theorem & Grand Synthesis

Point Evaluation & The Need for RKHS

Throughout the Functional Analysis series — from Banach & Hilbert Spaces through Spectral Theory — we have treated functions as "points" in abstract infinite-dimensional spaces. We measured their "length" via norms, their "angle" via inner products, and their "spectrum" via eigenvalue decomposition. But there is one seemingly innocent operation we have not yet examined: evaluating a function at a point.

Given a function \(f\) in some Hilbert space \(\mathcal{H}\), define the point evaluation functional at \(x \in \mathcal{X}\): \[ \delta_x : \mathcal{H} \to \mathbb{R}, \qquad \delta_x(f) = f(x). \] This is clearly linear: \(\delta_x(\alpha f + \beta g) = \alpha f(x) + \beta g(x)\). The critical question is: is it continuous?

In \(L^2([a,b])\), the answer is no. Functions in \(L^2\) are equivalence classes defined up to sets of measure zero — changing a function at a single point does not change its \(L^2\) norm. Consequently, point evaluation is not a bounded linear functional on \(L^2\), and \(\delta_x \notin (L^2)^*\). We cannot meaningfully "evaluate" an \(L^2\) function at a point.

This is a serious problem for machine learning. In supervised learning, we must evaluate our model \(f\) at data points \(x_1, \ldots, x_N\) to compute the loss: \[ \mathcal{L}(f) = \sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}}^2. \] If point evaluation is not continuous, this expression is not well-defined as a functional on \(\mathcal{H}\). We need a Hilbert space where point evaluation is always bounded.

A Reproducing Kernel Hilbert Space (RKHS) is precisely a Hilbert space of functions in which point evaluation is a continuous linear functional for every point in the domain. This single requirement — that \(\delta_x \in \mathcal{H}^*\) for all \(x\) — has extraordinary consequences, connecting the Riesz Representation Theorem, Spectral Theory, and regularized optimization into a single coherent framework.

Reproducing Kernel Hilbert Spaces

Suppose \(\mathcal{H}\) is a Hilbert space of functions \(f : \mathcal{X} \to \mathbb{R}\) in which point evaluation is continuous: for every \(x \in \mathcal{X}\), the functional \(\delta_x(f) = f(x)\) satisfies \(\delta_x \in \mathcal{H}^*\). By the Riesz Representation Theorem, there exists a unique element \(K_x \in \mathcal{H}\) such that \[ f(x) = \delta_x(f) = \langle f,\, K_x \rangle_{\mathcal{H}} \quad \text{for all } f \in \mathcal{H}. \] The function \(K_x\) is the Riesz representative of the evaluation functional at \(x\).

Definition: Reproducing Kernel

Let \(\mathcal{H}\) be a Hilbert space of functions on a set \(\mathcal{X}\) in which all point evaluations are continuous. The reproducing kernel is the function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) defined by \[ K(x, x') = \langle K_x,\, K_{x'} \rangle_{\mathcal{H}} = K_{x'}(x). \] The pair \((\mathcal{H}, K)\) is called a Reproducing Kernel Hilbert Space (RKHS).

Two properties follow immediately from this definition and are so fundamental that they deserve explicit naming.

Theorem: The Reproducing Property

For every \(f \in \mathcal{H}\) and every \(x' \in \mathcal{X}\): \[ f(x') = \langle f,\, K(\cdot, x') \rangle_{\mathcal{H}}. \] In particular, setting \(f = K(\cdot, x)\): \[ K(x, x') = \langle K(\cdot, x),\, K(\cdot, x') \rangle_{\mathcal{H}}. \] That is, the kernel reproduces itself under the inner product — hence the name.

The reproducing property says: to evaluate a function \(f\) at a point \(x'\), take the inner product of \(f\) with the kernel "centered at \(x'\)." The kernel function \(K(\cdot, x')\) acts as a "probe" that extracts the value of \(f\) at the point \(x'\).

Positive Definiteness: The Characterization

Not every symmetric function \(K(x, x')\) is a reproducing kernel. The kernel must be positive definite — the same condition that appeared in Gaussian Processes as the requirement for a valid covariance function.

Definition: Positive Definite Kernel (Mercer Kernel)

A symmetric function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is positive definite if for all \(N \in \mathbb{N}\), all \(x_1, \ldots, x_N \in \mathcal{X}\), and all \(c_1, \ldots, c_N \in \mathbb{R}\): \[ \sum_{i=1}^{N} \sum_{j=1}^{N} c_i c_j K(x_i, x_j) \geq 0. \] Equivalently, the Gram matrix \(\mathbf{K}_{ij} = K(x_i, x_j)\) is positive semi-definite for every finite collection of points.

The deep connection between positive definite kernels and RKHS is captured by the following theorem, which establishes a perfect one-to-one correspondence.

Theorem: Moore-Aronszajn

For every positive definite kernel \(K\) on a set \(\mathcal{X}\), there exists a unique Hilbert space \(\mathcal{H}_K\) of functions on \(\mathcal{X}\) for which \(K\) is the reproducing kernel. Conversely, every RKHS has a unique reproducing kernel.

The proof constructs \(\mathcal{H}_K\) explicitly: start with the span of all functions of the form \(K(\cdot, x)\), define the inner product via \(\langle K(\cdot, x), K(\cdot, x') \rangle = K(x, x')\), and take the completion. The reproducing property is then built in by construction.

Connection to the Kernel Trick

The kernel trick in SVMs replaces inner products \(\langle \phi(x), \phi(x') \rangle\) with kernel evaluations \(K(x, x')\). The Moore-Aronszajn theorem tells us exactly what is happening: \(K(x, x') = \langle K(\cdot, x), K(\cdot, x') \rangle_{\mathcal{H}_K}\), so the "feature map" is \(\phi(x) = K(\cdot, x) \in \mathcal{H}_K\). The kernel trick is not a computational shortcut — it is the reproducing property of an RKHS.

Mercer's Theorem

In Spectral Theory, we saw that a compact, self-adjoint, positive operator on a Hilbert space admits a spectral decomposition into eigenfunctions. Mercer's Theorem applies this machinery to the integral operator defined by a positive definite kernel, providing an explicit eigenfunction expansion of the kernel itself.

Theorem: Mercer's Theorem

Let \(\mathcal{X}\) be a compact subset of \(\mathbb{R}^d\) equipped with a finite Borel measure \(\mu\), and let \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) be a continuous, symmetric, positive definite kernel. Define the integral operator \(T_K : L^2(\mathcal{X}, \mu) \to L^2(\mathcal{X}, \mu)\) by \[ (T_K f)(x) = \int_{\mathcal{X}} K(x, x')\, f(x')\, d\mu(x'). \] Then \(T_K\) is compact, self-adjoint, and positive, and there exist orthonormal eigenfunctions \(\{e_n\}_{n=1}^\infty\) in \(L^2(\mathcal{X}, \mu)\) with eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots > 0\), \(\lambda_n \to 0\), such that \[ K(x, x') = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x'). \] The convergence is absolute and uniform on \(\mathcal{X} \times \mathcal{X}\).

Let us unpack why each ingredient is necessary:

  1. Continuity + symmetry of \(K\): Guarantees that \(T_K\) is a Hilbert-Schmidt operator, hence compact.
  2. Positive definiteness: Ensures all eigenvalues are nonnegative (so \(T_K\) is a positive operator).
  3. The Spectral Theorem (calc-27): Provides the eigenfunction decomposition \(T_K f = \sum \lambda_n \langle f, e_n \rangle e_n\).
  4. Mercer's contribution: The eigenfunction expansion of the operator lifts to a pointwise expansion of the kernel function, with uniform convergence.

The Feature Map from Mercer's Theorem

The Mercer expansion immediately yields an explicit feature map into \(\ell^2\): \[ \Phi(x) = \bigl(\sqrt{\lambda_1}\, e_1(x),\; \sqrt{\lambda_2}\, e_2(x),\; \ldots\bigr) \in \ell^2, \] and the kernel is recovered as the inner product: \[ K(x, x') = \langle \Phi(x),\, \Phi(x') \rangle_{\ell^2} = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x'). \]

This is the rigorous justification for the kernel trick: the kernel evaluates an inner product in an infinite-dimensional feature space without ever computing the feature map \(\Phi\) explicitly. The eigenvalues \(\lambda_n \to 0\) ensure that the feature map lands in \(\ell^2\) (the squared norms are summable), and the rate of decay of \(\lambda_n\) controls the "effective dimensionality" of the feature space.

Example: The RBF (Gaussian) Kernel

The radial basis function kernel \(K(x, x') = \exp\bigl(-\|x - x'\|^2 / (2\sigma^2)\bigr)\) is continuous, symmetric, and positive definite. Its Mercer eigenfunctions are related to Hermite functions, and the eigenvalues decay exponentially: \(\lambda_n \sim C \cdot r^n\) for some \(0 < r < 1\). This rapid decay explains why the Random Fourier Feature approximation works well in practice — only the first few eigenfunctions carry significant weight.

Connection to Gaussian Processes

In a Gaussian Process with covariance function \(K\), the Mercer expansion gives the Karhunen-Loève expansion: \[ f(x) = \sum_{n=1}^{\infty} Z_n \sqrt{\lambda_n}\, e_n(x), \qquad Z_n \sim \mathcal{N}(0, 1) \text{ i.i.d.} \] Each draw from the GP is a random function whose "complexity" is governed by the eigenvalue decay. Smooth kernels (fast decay) produce smooth sample paths; rough kernels (slow decay) produce rough paths. The RKHS norm (introduced in the next section) measures the complexity of the mean function of the posterior GP.

RKHS Norm & Function Complexity

A central utility of the RKHS framework for machine learning is that the RKHS norm provides a mathematically precise notion of function "smoothness" or "complexity." This connects directly to regularization.

The RKHS Inner Product via Mercer Eigenfunctions

Suppose the kernel \(K\) has the Mercer expansion \(K(x, x') = \sum_{n=1}^\infty \lambda_n e_n(x) e_n(x')\). Any function \(f \in \mathcal{H}_K\) can be expanded as \(f(x) = \sum_{n=1}^\infty f_n \, e_n(x)\), where \(f_n = \langle f, e_n \rangle_{L^2}\) are the generalized Fourier coefficients.

The RKHS inner product is not the \(L^2\) inner product. Instead, it reweights each eigencomponent by the inverse of the eigenvalue:

Definition: RKHS Inner Product and Norm

For \(f = \sum_n f_n \, e_n\) and \(g = \sum_n g_n \, e_n\) in \(\mathcal{H}_K\): \[ \langle f, g \rangle_{\mathcal{H}_K} = \sum_{n=1}^{\infty} \frac{f_n \, g_n}{\lambda_n}. \] The RKHS norm is therefore \[ \|f\|_{\mathcal{H}_K}^2 = \sum_{n=1}^{\infty} \frac{f_n^2}{\lambda_n}. \] A function \(f\) belongs to \(\mathcal{H}_K\) if and only if \(\sum_n f_n^2 / \lambda_n < \infty\).

The division by \(\lambda_n\) is the key insight. Since \(\lambda_n \to 0\), the higher-order eigenfunctions (those corresponding to small eigenvalues) are penalized more heavily in the RKHS norm. A function that relies heavily on high-frequency components will have a large RKHS norm, while a function concentrated on the first few eigenmodes will have a small norm.

Connection to Regularization: Why \(\|f\|_{\mathcal{H}_K}^2\) Controls Complexity

When we add \(\lambda \|f\|_{\mathcal{H}_K}^2\) to our loss function, we are penalizing functions that use high-frequency eigenfunctions. This is the infinite-dimensional analogue of Ridge regression: in finite dimensions, \(\|w\|^2 = \sum w_i^2\) penalizes large weights equally; in the RKHS, \(\|f\|_{\mathcal{H}_K}^2 = \sum f_n^2 / \lambda_n\) penalizes components in "hard" directions (small \(\lambda_n\)) more than "easy" ones (large \(\lambda_n\)).

This is analogous to the quadratic form \(\mathbf{f}^T \mathbf{K}^{-1} \mathbf{f}\) that appears in GP regression — the inverse kernel matrix penalizes predictions that deviate from the prior's "smooth" subspace.

RKHS Functions are "Smoother" than \(L^2\) Functions

The membership condition \(\sum f_n^2 / \lambda_n < \infty\) is strictly stronger than the \(L^2\) condition \(\sum f_n^2 < \infty\) (since \(\lambda_n \to 0\)). Therefore: \[ \mathcal{H}_K \subsetneq L^2(\mathcal{X}, \mu). \] The RKHS consists of "smoother" functions — those whose high-frequency components decay fast enough to compensate for the shrinking eigenvalues. The faster the eigenvalues \(\lambda_n\) decay, the smaller (and smoother) the RKHS becomes.

This also explains why point evaluation is continuous in an RKHS but not in \(L^2\). The extra smoothness required for RKHS membership ensures that functions are "nice enough" to be evaluated pointwise — unlike generic \(L^2\) functions, which are only defined up to measure-zero modifications.

The Representer Theorem & Grand Synthesis

We now arrive at the theorem that ties the entire Functional Analysis series together: the Representer Theorem. It tells us that even though the RKHS \(\mathcal{H}_K\) is infinite-dimensional, the solution to any regularized empirical risk minimization problem lives in a finite-dimensional subspace spanned by the kernel functions at the data points.

Theorem: The Representer Theorem

Let \(\mathcal{H}_K\) be an RKHS on \(\mathcal{X}\) with kernel \(K\), let \(\{(x_i, y_i)\}_{i=1}^N\) be training data, and consider the regularized problem \[ \min_{f \in \mathcal{H}_K} \left[\sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}_K}^2\right], \] where \(\ell\) is any loss function and \(\lambda > 0\). Then the minimizer \(f^*\) has the form \[ f^*(x) = \sum_{i=1}^{N} \alpha_i \, K(x, x_i) \] for some coefficients \(\alpha_1, \ldots, \alpha_N \in \mathbb{R}\).

This is a remarkable dimension reduction: the optimization was posed over the entire infinite-dimensional space \(\mathcal{H}_K\), yet the answer lies in the \(N\)-dimensional subspace \(\operatorname{span}\{K(\cdot, x_1), \ldots, K(\cdot, x_N)\}\).

Proof Sketch:

Decompose any \(f \in \mathcal{H}_K\) as \(f = f_\parallel + f_\perp\), where \(f_\parallel \in \operatorname{span}\{K(\cdot, x_i)\}_{i=1}^N\) and \(f_\perp\) is orthogonal to this subspace.

By the reproducing property, \(f(x_i) = \langle f, K(\cdot, x_i) \rangle_{\mathcal{H}_K}\). Since \(f_\perp\) is orthogonal to every \(K(\cdot, x_i)\): \[ f(x_i) = f_\parallel(x_i) + \underbrace{\langle f_\perp, K(\cdot, x_i) \rangle}_{= 0} = f_\parallel(x_i). \] So \(f_\perp\) does not affect the data-fit term. However, by Pythagoras: \[ \|f\|_{\mathcal{H}_K}^2 = \|f_\parallel\|_{\mathcal{H}_K}^2 + \|f_\perp\|_{\mathcal{H}_K}^2 \geq \|f_\parallel\|_{\mathcal{H}_K}^2. \] Adding any nonzero \(f_\perp\) strictly increases the regularization penalty without improving the data fit. Hence the minimizer satisfies \(f_\perp = 0\). \(\square\)

From the Representer Theorem to Kernel Algorithms

With \(f^*(x) = \sum_{i=1}^N \alpha_i K(x, x_i)\), the infinite-dimensional problem reduces to a finite-dimensional optimization over \(\boldsymbol{\alpha} \in \mathbb{R}^N\). The reproducing property gives \[ f^*(x_j) = \sum_{i=1}^N \alpha_i K(x_j, x_i) = (\mathbf{K}\boldsymbol{\alpha})_j \] where \(\mathbf{K}_{ij} = K(x_i, x_j)\) is the Gram matrix, and \[ \|f^*\|_{\mathcal{H}_K}^2 = \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}. \] The regularized problem becomes \[ \min_{\boldsymbol{\alpha} \in \mathbb{R}^N} \left[\sum_{i=1}^N \ell\bigl(y_i,\, (\mathbf{K}\boldsymbol{\alpha})_i\bigr) + \lambda \, \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}\right]. \] For the squared loss \(\ell(y, \hat{y}) = (y - \hat{y})^2\), this is Kernel Ridge Regression, with closed-form solution \(\boldsymbol{\alpha} = (\mathbf{K} + \lambda I)^{-1} \mathbf{y}\). For the hinge loss, it is the kernel SVM.

The Grand Synthesis: calc-23 to calc-28

Let us now trace how every chapter in the Functional Analysis series contributed to this final result.

Chapter Key Contribution Role in RKHS Theory
calc-23 Hilbert spaces, completeness, \(L^p\) spaces RKHS is a Hilbert space; \(L^2\) is the ambient space for Mercer's integral operator
calc-24 Bounded operators, continuity = boundedness Point evaluation must be bounded (continuous); integral operators are bounded
calc-25 Dual spaces, Riesz Representation \(\delta_x \in \mathcal{H}^*\) implies existence of \(K_x = \delta_x^\sharp\) via Riesz — the reproducing kernel
calc-26 Weak compactness, existence of minimizers Guarantees that the regularized problem \(\min(\text{loss} + \lambda\|f\|^2)\) has a solution in \(\mathcal{H}_K\)
calc-27 Spectral Theorem, eigenfunction expansion Provides Mercer's eigenfunction decomposition of \(K\) and the explicit feature map \(\Phi\)
calc-28 (this chapter) RKHS, Mercer, Representer Theorem Unifies everything: infinite-dimensional optimization reduces to finite-dimensional kernel algebra

Looking Ahead: From Function Spaces to Manifolds

This chapter concludes the Functional Analysis block of MATH-CS COMPASS. We have traveled from the abstract definition of normed spaces (calc-23) to a complete, rigorous understanding of why kernel methods work (calc-28). The tools we have built — Hilbert space structure, duality, weak topologies, spectral decomposition, and now RKHS — form the analytical foundation for everything that follows.

The next stage of the curriculum moves from function spaces to geometric spaces. The Musical Isomorphisms (\(\sharp\) and \(\flat\)) from calc-25 will generalize from a single inner product to a Riemannian metric tensor that varies from point to point. The spectral methods of calc-27 will become the eigenvalues of the Laplace-Beltrami operator on smooth manifolds. And the reproducing property of RKHS will find its echo in the equivariance constraints of Geometric Deep Learning: just as the kernel encodes the "right" notion of similarity in function space, the group action encodes the "right" notion of symmetry on geometric domains.

The Functional Analysis series has given us the language of infinite-dimensional mathematics. The geometry series will give us its shape.