RKHS & Kernel Methods

Point Evaluation & The Need for RKHS Reproducing Kernel Hilbert Spaces Mercer's Theorem RKHS Norm & Function Complexity The Representer Theorem & Grand Synthesis

Point Evaluation & The Need for RKHS

Throughout the Functional Analysis series — from Banach & Hilbert Spaces through Spectral Theory — we have treated functions as "points" in abstract infinite-dimensional spaces. We measured their "length" via norms, their "angle" via inner products, and their "spectrum" via eigenvalue decomposition. But there is one seemingly innocent operation we have not yet examined: evaluating a function at a point.

Convention for this chapter. Our focus is the standard machine-learning setting, in which kernels and the corresponding Hilbert space \(\mathcal{H}_K\) are real-valued (RBF, polynomial, Laplacian, and Matérn kernels are all real). The inner product is therefore symmetric, \(\langle u, v\rangle_{\mathcal{H}_K} = \langle v, u\rangle_{\mathcal{H}_K}\), so the first-slot-linear convention adopted in Dual Spaces trivially restricts here. The complex Hermitian theory developed in Spectral Theory specializes cleanly to this real-symmetric case — the conjugation in the Hermitian inner product becomes a no-op when all quantities are real-valued. We will state this reduction explicitly when it matters (notably in the Mercer expansion).

Given a function \(f\) in some Hilbert space \(\mathcal{H}\), define the point evaluation functional at \(x \in \mathcal{X}\): \[ \delta_x : \mathcal{H} \to \mathbb{R}, \qquad \delta_x(f) = f(x). \] This is clearly linear: \(\delta_x(\alpha f + \beta g) = \alpha f(x) + \beta g(x)\). The critical question is: is it continuous?

In \(L^2([a,b])\), the answer is no. Functions in \(L^2\) are equivalence classes defined up to sets of measure zero — changing a function at a single point does not change its \(L^2\) norm. Consequently, point evaluation is not a bounded linear functional on \(L^2\), and \(\delta_x \notin (L^2)^*\). We cannot meaningfully "evaluate" an \(L^2\) function at a point.

This is a serious problem for machine learning. In supervised learning, we must evaluate our model \(f\) at data points \(x_1, \ldots, x_N\) to compute the loss: \[ \mathcal{L}(f) = \sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}}^2. \] If point evaluation is not continuous, this expression is not well-defined as a functional on \(\mathcal{H}\). We need a Hilbert space where point evaluation is always bounded.

A Reproducing Kernel Hilbert Space (RKHS) is precisely a Hilbert space of functions in which point evaluation is a continuous linear functional for every point in the domain. This single requirement — that \(\delta_x \in \mathcal{H}^*\) for all \(x\) — has extraordinary consequences, connecting the Riesz Representation Theorem, Spectral Theory, and regularized optimization into a single coherent framework.

Reproducing Kernel Hilbert Spaces

By the standing assumption on \(\mathcal{H}\), the evaluation functional \(\delta_x(f) = f(x)\) belongs to \(\mathcal{H}^*\) for every \(x \in \mathcal{X}\). The Riesz Representation Theorem then gives a unique element \(K_x \in \mathcal{H}\) — the Riesz representative of \(\delta_x\) — such that \[ f(x) = \delta_x(f) = \langle f,\, K_x \rangle_{\mathcal{H}} \quad \text{for all } f \in \mathcal{H}. \]

Definition: Reproducing Kernel

Let \(\mathcal{H}\) be a Hilbert space of functions on a set \(\mathcal{X}\) in which all point evaluations are continuous. The reproducing kernel is the function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) defined by \[ K(x, x') = \langle K_x,\, K_{x'} \rangle_{\mathcal{H}} = K_{x'}(x). \] The pair \((\mathcal{H}, K)\) is called a Reproducing Kernel Hilbert Space (RKHS).

Notation. The Riesz representative \(K_x \in \mathcal{H}\) and the partially-evaluated kernel \(K(\cdot, x) : \mathcal{X} \to \mathbb{R}\) denote the same function: \(K_x(\cdot) = K(\cdot, x)\). We use both notations interchangeably, favoring \(K(\cdot, x)\) when emphasizing the kernel structure and \(K_x\) when emphasizing the Riesz duality.

Two properties follow immediately from this definition and are so fundamental that they deserve explicit naming.

Theorem: The Reproducing Property

For every \(f \in \mathcal{H}\) and every \(x' \in \mathcal{X}\): \[ f(x') = \langle f,\, K(\cdot, x') \rangle_{\mathcal{H}}. \] In particular, setting \(f = K(\cdot, x)\): \[ K(x, x') = \langle K(\cdot, x),\, K(\cdot, x') \rangle_{\mathcal{H}}. \] That is, the kernel reproduces itself under the inner product — hence the name.

The reproducing property says: to evaluate a function \(f\) at a point \(x'\), take the inner product of \(f\) with the kernel "centered at \(x'\)." The kernel function \(K(\cdot, x')\) acts as a "probe" that extracts the value of \(f\) at the point \(x'\).

Positive Definiteness: The Characterization

Not every symmetric function \(K(x, x')\) is a reproducing kernel. The kernel must be positive definite — the same condition that appeared in Gaussian Processes as the requirement for a valid covariance function.

Definition: Positive Definite Kernel (Mercer Kernel)

A symmetric function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is positive definite if for all \(N \in \mathbb{N}\), all \(x_1, \ldots, x_N \in \mathcal{X}\), and all \(c_1, \ldots, c_N \in \mathbb{R}\): \[ \sum_{i=1}^{N} \sum_{j=1}^{N} c_i c_j K(x_i, x_j) \geq 0. \] Equivalently, the Gram matrix \(\mathbf{K}_{ij} = K(x_i, x_j)\) is positive semi-definite for every finite collection of points.

Terminological note. The condition \(\geq 0\) makes the Gram matrix positive semi-definite in the sense of Quadratic Forms. The kernel literature (Schölkopf-Smola, Berg-Christensen-Ressel) nevertheless calls this "positive definite kernel" by convention, reserving "strictly positive definite" for the case where equality holds only when all \(c_i = 0\). We follow this convention throughout.

The deep connection between positive definite kernels and RKHS is captured by the following theorem, which establishes a perfect one-to-one correspondence.

Theorem: Moore-Aronszajn

For every positive definite kernel \(K\) on a set \(\mathcal{X}\), there exists a unique Hilbert space \(\mathcal{H}_K\) of functions on \(\mathcal{X}\) for which \(K\) is the reproducing kernel. Conversely, every RKHS has a unique reproducing kernel.

Proof sketch:

We construct \(\mathcal{H}_K\) explicitly in four steps.

Step 1 (pre-Hilbert space). Let \(\mathcal{H}_0 = \operatorname{span}\{K(\cdot, x) : x \in \mathcal{X}\}\) and define a bilinear form on the generators by \(\langle K(\cdot, x), K(\cdot, x') \rangle := K(x, x')\). Symmetry of \(K\) makes this form symmetric. For any finite linear combination \(f = \sum_i a_i K(\cdot, x_i)\), bilinearity forces \(\langle f, f\rangle = \sum_{i,j} a_i a_j K(x_i, x_j) \geq 0\) by the Gram-matrix condition, so the form is positive semi-definite. Well-definedness of the bilinear extension to all of \(\mathcal{H}_0\) — that \(\langle f, g\rangle\) is independent of the chosen finite-sum representations of \(f\) and \(g\) — is established next.

Step 2 (reproducing property on \(\mathcal{H}_0\); well-definedness; positive definiteness). For \(f = \sum_i a_i K(\cdot, x_i)\), direct computation gives \(\langle f, K(\cdot, x)\rangle = \sum_i a_i K(x_i, x) = f(x)\). Consequently, for any \(g = \sum_j b_j K(\cdot, y_j) \in \mathcal{H}_0\), bilinearity yields \(\langle f, g\rangle = \sum_j b_j f(y_j)\) — a value determined entirely by the function \(f\) (not its representation) and the data \((b_j, y_j)\); a symmetric argument in the first slot completes the verification that \(\langle\cdot,\cdot\rangle\) is well-defined on \(\mathcal{H}_0\). The Cauchy-Schwarz inequality then gives \(|f(x)|^2 = |\langle f, K(\cdot, x)\rangle|^2 \leq \langle f, f\rangle \cdot K(x, x)\), so \(\langle f, f\rangle = 0\) forces \(f \equiv 0\) — the form is a genuine inner product.

Step 3 (completion as a function space). This is the heart of the theorem. Take the abstract Hilbert-space completion of \(\mathcal{H}_0\) — the equivalence classes of Cauchy sequences \(\{f_n\} \subset \mathcal{H}_0\). For each \(x \in \mathcal{X}\), the Cauchy-Schwarz bound from Step 2 gives \(|f_n(x) - f_m(x)| \leq \|f_n - f_m\| \cdot \sqrt{K(x, x)}\), so \(\{f_n(x)\}\) is Cauchy in \(\mathbb{R}\) and converges to some \(f(x)\). The same bound shows that Cauchy-equivalent sequences (\(\|f_n - g_n\| \to 0\)) yield identical pointwise limits, so \(f : \mathcal{X} \to \mathbb{R}\) depends only on the equivalence class. Conversely, if the pointwise limit vanishes at every \(x\), then the completion limit \(f^*\) also vanishes pointwise (the reproducing property extends continuously to the completion, giving \(f^*(x) = \langle f^*, K(\cdot, x)\rangle = \lim_n f_n(x) = 0\)); since \(\mathcal{H}_0\) is dense in the completion and \(f^*\) is orthogonal to every \(K(\cdot, x) \in \mathcal{H}_0\), this forces \(\|f^*\| = 0\). Thus the assignment "equivalence class \(\mapsto\) pointwise limit" is injective. The completion thus realizes as a Hilbert space \(\mathcal{H}_K\) of bona fide functions on \(\mathcal{X}\), not merely as abstract equivalence classes.

Step 4 (reproducing property on \(\mathcal{H}_K\); uniqueness). Point evaluation \(\delta_x = \langle \cdot, K(\cdot, x)\rangle\) is bounded on \(\mathcal{H}_0\) with operator norm \(\sqrt{K(x, x)}\); since \(\mathcal{H}_0\) is dense in \(\mathcal{H}_K\) by construction, \(\delta_x\) extends by continuity to all of \(\mathcal{H}_K\), preserving the reproducing property \(f(x) = \langle f, K(\cdot, x)\rangle\) throughout. Uniqueness of the reproducing kernel follows from the Riesz Representation Theorem: each \(\delta_x\) determines its representative \(K_x\) uniquely. Uniqueness of \(\mathcal{H}_K\) itself follows because any RKHS reproducing the same \(K\) must contain \(\{K(\cdot, x)\}\), inherit the same inner product on this set (forced by the reproducing property), and be complete — hence coincides with \(\mathcal{H}_K\) as a function space.

For the complete argument with all measurability and density verifications, see Aronszajn's original 1950 paper or Steinwart & Christmann, Support Vector Machines, Ch. 4 ("Kernels and Reproducing Kernel Hilbert Spaces").

Connection to the Kernel Trick

The kernel trick in SVMs replaces inner products \(\langle \phi(x), \phi(x') \rangle\) with kernel evaluations \(K(x, x')\). The Moore-Aronszajn theorem tells us exactly what is happening: \(K(x, x') = \langle K(\cdot, x), K(\cdot, x') \rangle_{\mathcal{H}_K}\), so the "feature map" is \(\phi(x) = K(\cdot, x) \in \mathcal{H}_K\). The kernel trick is not a computational shortcut — it is the reproducing property of an RKHS.

Mercer's Theorem

In Spectral Theory, we saw that a compact, self-adjoint, positive operator on a Hilbert space admits a spectral decomposition into eigenfunctions. Mercer's Theorem applies this machinery to the integral operator defined by a positive definite kernel, providing an explicit eigenfunction expansion of the kernel itself.

Theorem: Mercer's Theorem

Let \(\mathcal{X}\) be a compact subset of \(\mathbb{R}^d\) equipped with a finite strictly positive Borel measure \(\mu\) (i.e., every non-empty open set has positive measure), and let \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) be a continuous, symmetric, positive definite kernel. Define the integral operator \(T_K : L^2(\mathcal{X}, \mu) \to L^2(\mathcal{X}, \mu)\) by \[ (T_K f)(x) = \int_{\mathcal{X}} K(x, x')\, f(x')\, d\mu(x'). \] Then \(T_K\) is compact, self-adjoint, and positive, and there exist orthonormal eigenfunctions \(\{e_n\}_{n=1}^\infty\) in \(L^2(\mathcal{X}, \mu)\) with eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq 0\), \(\lambda_n \to 0\) (with the convention that the list enumerates only the nonzero eigenvalues, of which there are countably many by compactness of \(T_K\)), such that \[ K(x, x') = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x'). \] The convergence is absolute and uniform on \(\mathcal{X} \times \mathcal{X}\).

Proof sketch:

Continuity of \(K\) on the compact set \(\mathcal{X}\times\mathcal{X}\) and finiteness of \(\mu\) give \(K \in L^2(\mathcal{X}\times\mathcal{X}, \mu\otimes\mu)\), so \(T_K\) is a Hilbert-Schmidt operator and hence compact. Symmetry of \(K\) gives self-adjointness.

Positive definiteness of \(K\) — defined for finite point sets — must be promoted to the integral statement \(\langle T_K f, f\rangle \geq 0\). For continuous \(f \in C(\mathcal{X})\) and any partition \(\{A_i\}\) of \(\mathcal{X}\) with sample points \(x_i \in A_i\), the discrete PD condition gives \[ \sum_{i,j} K(x_i, x_j) f(x_i) f(x_j) \, \mu(A_i) \mu(A_j) \geq 0. \] As the partition is refined, the left-hand side converges to \(\iint K(x, x') f(x) f(x')\, d\mu(x) d\mu(x') = \langle T_K f, f\rangle\) (uniform continuity of \(K\) and \(f\) on the compact \(\mathcal{X}\); finite \(\mu\)), so \(\langle T_K f, f\rangle \geq 0\) for all \(f \in C(\mathcal{X})\). Density of \(C(\mathcal{X})\) in \(L^2(\mathcal{X}, \mu)\) (standard for finite Borel measures on compact spaces) and continuity of \(f \mapsto \langle T_K f, f\rangle\) on \(L^2\) extend the inequality to all of \(L^2\). All eigenvalues of \(T_K\) are therefore nonnegative.

The Spectral Theorem for compact self-adjoint operators gives \(T_K f = \sum_n \lambda_n \langle f, e_n\rangle_{L^2} e_n\) with orthonormal eigenfunctions \(\{e_n\}\). Since the kernel \(K\) determines \(T_K\) uniquely as an integral operator (distinct kernels modulo a.e. equivalence define distinct integral operators), this lifts to the kernel decomposition \(K(x, x') = \sum_n \lambda_n e_n(x) e_n(x')\) with convergence in \(L^2(\mathcal{X}\times\mathcal{X}, \mu\otimes\mu)\).

The upgrade to absolute and uniform convergence on \(\mathcal{X}\times\mathcal{X}\) is the deep part. The right-hand side of \(T_K e_n = \lambda_n e_n\), namely \(x \mapsto \int K(x, x') e_n(x')\, d\mu(x')\), is continuous in \(x\) by continuity of \(K\) and dominated convergence. Replacing \(e_n\) by \(\lambda_n^{-1} T_K e_n\) (valid for \(\lambda_n > 0\)) within its a.e. equivalence class therefore selects a continuous representative; strict positivity of \(\mu\) ensures this representative is uniquely determined (any two continuous functions agreeing a.e. with respect to a strictly positive measure must agree pointwise), making the pointwise expansion well-defined.

The partial sums \(S_N(x, x') = \sum_{n=1}^N \lambda_n e_n(x) e_n(x')\) are continuous, and on the diagonal the nonnegative terms \(\lambda_n e_n(x)^2 \geq 0\) make \(S_N(x, x) \nearrow K(x, x)\) monotonically. Dini's theorem (continuous monotone sequence with continuous limit on a compact set converges uniformly) upgrades this to uniform convergence on the diagonal. For the off-diagonal terms, Cauchy-Schwarz applied to partial-sum tails gives, for \(M > N\), \[ |S_M(x, x') - S_N(x, x')|^2 = \left|\sum_{n=N+1}^M \lambda_n e_n(x) e_n(x')\right|^2 \leq R_N(x) \cdot R_N(x'), \] where \(R_N(x) = K(x, x) - S_N(x, x) \geq \sum_{n=N+1}^M \lambda_n e_n(x)^2\). The diagonal uniform convergence \(R_N \to 0\) gives a uniform Cauchy criterion, yielding both absolute and uniform convergence on all of \(\mathcal{X}\times\mathcal{X}\). For the full argument, see Steinwart & Christmann, Support Vector Machines, Ch. 4 ("Kernels and Reproducing Kernel Hilbert Spaces").

Let us unpack why each ingredient is necessary:

  1. Continuity + symmetry of \(K\): Guarantees that \(T_K\) is a Hilbert-Schmidt operator, hence compact.
  2. Positive definiteness: Ensures all eigenvalues are nonnegative (so \(T_K\) is a positive operator).
  3. The Spectral Theorem: Provides the eigenfunction decomposition \(T_K f = \sum \lambda_n \langle f, e_n \rangle e_n\).
  4. Mercer's contribution: The eigenfunction expansion of the operator lifts to a pointwise expansion of the kernel function, with uniform convergence.
  5. Strictly positive measure: Ensures that \(L^2\)-eigenfunctions have unique continuous representatives, so the pointwise equality \(K(x,x') = \sum \lambda_n e_n(x) e_n(x')\) is well-defined.

The Feature Map from Mercer's Theorem

The Mercer expansion immediately yields an explicit feature map into \(\ell^2\): \[ \Phi(x) = \bigl(\sqrt{\lambda_1}\, e_1(x),\; \sqrt{\lambda_2}\, e_2(x),\; \ldots\bigr) \in \ell^2, \] and the kernel is recovered as the inner product: \[ K(x, x') = \langle \Phi(x),\, \Phi(x') \rangle_{\ell^2} = \sum_{n=1}^{\infty} \lambda_n \, e_n(x) \, e_n(x'). \]

This is the rigorous justification for the kernel trick: the kernel evaluates an inner product in an infinite-dimensional feature space without ever computing the feature map \(\Phi\) explicitly. To see that \(\Phi(x)\) indeed lands in \(\ell^2\), set \(x = x'\) in the Mercer expansion: \(\|\Phi(x)\|_{\ell^2}^2 = \sum_{n=1}^{\infty} \lambda_n\, e_n(x)^2 = K(x, x) < \infty\), which is finite since \(K\) is continuous on a compact set. The rate of decay of \(\lambda_n\) controls the "effective dimensionality" of the feature space.

Example: The RBF (Gaussian) Kernel

The radial basis function kernel \(K(x, x') = \exp\bigl(-\|x - x'\|^2 / (2\sigma^2)\bigr)\) is continuous, symmetric, and positive definite. Its Mercer eigenfunctions are related to Hermite functions, and the eigenvalues decay exponentially: \(\lambda_n \sim C \cdot r^n\) for some \(0 < r < 1\). This rapid decay explains why the Random Fourier Feature approximation works well in practice — only the first few eigenfunctions carry significant weight.

Connection to Gaussian Processes

In a Gaussian Process with covariance function \(K\), the Mercer expansion gives the Karhunen-Loève expansion: \[ f(x) = \sum_{n=1}^{\infty} Z_n \sqrt{\lambda_n}\, e_n(x), \qquad Z_n \sim \mathcal{N}(0, 1) \text{ i.i.d.} \] Each draw from the GP is a random function whose "complexity" is governed by the eigenvalue decay. Smooth kernels (fast decay) produce smooth sample paths; rough kernels (slow decay) produce rough paths. The RKHS norm (introduced in the next section) measures the complexity of the mean function of the posterior GP.

RKHS Norm & Function Complexity

A central utility of the RKHS framework for machine learning is that the RKHS norm provides a mathematically precise notion of function "smoothness" or "complexity." This connects directly to regularization.

The RKHS Inner Product via Mercer Eigenfunctions

Suppose the kernel \(K\) has the Mercer expansion \(K(x, x') = \sum_{n=1}^\infty \lambda_n e_n(x) e_n(x')\). Any function \(f \in \mathcal{H}_K\) can be expanded as \(f(x) = \sum_{n=1}^\infty f_n \, e_n(x)\), where \(f_n = \langle f, e_n \rangle_{L^2}\) are the generalized Fourier coefficients.

The RKHS inner product is not the \(L^2\) inner product. Instead, it reweights each eigencomponent by the inverse of the eigenvalue:

Lemma: Aronszajn Factorization

Under the hypotheses of Mercer's Theorem with \(\lambda_n > 0\) for all \(n\), let \(T_K^{1/2}\) be the positive square root of \(T_K\), defined via spectral calculus by \(T_K^{1/2} g = \sum_n \sqrt{\lambda_n} \langle g, e_n\rangle_{L^2}\, e_n\). Then \(T_K^{1/2} : L^2(\mathcal{X}, \mu) \to \mathcal{H}_K\) is an isometric isomorphism: \[ \mathcal{H}_K = T_K^{1/2}\bigl(L^2(\mathcal{X}, \mu)\bigr), \qquad \|T_K^{1/2} g\|_{\mathcal{H}_K} = \|g\|_{L^2(\mathcal{X}, \mu)}. \] In particular, \(\{\phi_n := \sqrt{\lambda_n}\, e_n\}_{n \geq 1}\) is an orthonormal basis of \(\mathcal{H}_K\).

Proof sketch:

By Mercer's Theorem, \(K(x, y) = \sum_n \lambda_n e_n(x) e_n(y) = \sum_n \phi_n(x) \phi_n(y)\) with \(\phi_n := \sqrt{\lambda_n}\, e_n\). The Hilbert space of formal series \(\mathcal{H}' := \{\sum_n c_n \phi_n : \sum_n c_n^2 < \infty\}\), equipped with the inner product \(\langle \sum_n c_n \phi_n, \sum_n d_n \phi_n\rangle := \sum_n c_n d_n\), has \(\{\phi_n\}\) as an orthonormal basis and reproducing kernel \(\sum_n \phi_n(x) \phi_n(y) = K(x, y)\). By the uniqueness half of the Moore-Aronszajn Theorem, \(\mathcal{H}' = \mathcal{H}_K\) as Hilbert spaces of functions.

The spectral-calculus definition of \(T_K^{1/2}\) sends \(g = \sum_n g_n e_n \in L^2\) to \(T_K^{1/2} g = \sum_n \sqrt{\lambda_n}\, g_n\, e_n = \sum_n g_n \phi_n\), whose \(\mathcal{H}_K\)-norm is \((\sum_n g_n^2)^{1/2} = \|g\|_{L^2}\). Thus \(T_K^{1/2} : L^2(\mathcal{X}, \mu) \to \mathcal{H}_K\) is norm-preserving, with image all of \(\mathcal{H}_K\) (every \(f = \sum_n c_n \phi_n \in \mathcal{H}_K\) is the image of \(g = \sum_n c_n e_n \in L^2\)). For full details, see Aronszajn's original 1950 paper, or Steinwart & Christmann, Support Vector Machines, Ch. 4 ("Kernels and Reproducing Kernel Hilbert Spaces").

Proposition: RKHS Norm via Mercer Coefficients

Assume \(\lambda_n > 0\) for all \(n\) (a standing assumption ensured, for instance, when \(K\) is strictly positive definite under the Mercer hypotheses; otherwise restrict the sums to \(\{n : \lambda_n > 0\}\) and identify \(\mathcal{H}_K\) with the corresponding subspace via the isometry below). For \(f = \sum_n f_n \, e_n\) and \(g = \sum_n g_n \, e_n\) in \(\mathcal{H}_K\): \[ \langle f, g \rangle_{\mathcal{H}_K} = \sum_{n=1}^{\infty} \frac{f_n \, g_n}{\lambda_n}. \] The RKHS norm is therefore \[ \|f\|_{\mathcal{H}_K}^2 = \sum_{n=1}^{\infty} \frac{f_n^2}{\lambda_n}. \] A function \(f \in L^2(\mathcal{X}, \mu)\) with Fourier expansion \(f = \sum_n f_n e_n\) belongs to \(\mathcal{H}_K\) if and only if \(\sum_n f_n^2 / \lambda_n < \infty\).

Proof sketch:

By the Aronszajn Factorization Lemma, every \(f \in \mathcal{H}_K\) can be written uniquely as \(f = T_K^{1/2} h\) for some \(h \in L^2(\mathcal{X}, \mu)\), and this correspondence is an isometry: \(\|f\|_{\mathcal{H}_K} = \|h\|_{L^2}\). Expanding \(h\) in the \(L^2\)-orthonormal basis \(\{e_n\}\) as \(h = \sum_n h_n e_n\), we have \[ f = T_K^{1/2} h = \sum_n h_n\, T_K^{1/2} e_n = \sum_n h_n \sqrt{\lambda_n}\, e_n. \] Comparing with the \(L^2\)-Fourier expansion \(f = \sum_n f_n e_n\) gives \(f_n = h_n \sqrt{\lambda_n}\), i.e. \(h_n = f_n / \sqrt{\lambda_n}\). The isometry \(\|f\|_{\mathcal{H}_K}^2 = \|h\|_{L^2}^2\) then reads \[ \|f\|_{\mathcal{H}_K}^2 = \sum_n h_n^2 = \sum_n \frac{f_n^2}{\lambda_n}. \] The polarization identity yields the inner-product formula. The membership characterization \(f \in \mathcal{H}_K \iff \sum_n f_n^2/\lambda_n < \infty\) follows from \(f \in \mathcal{H}_K \iff h \in L^2 \iff \sum_n h_n^2 < \infty\).

The division by \(\lambda_n\) is the key insight. Since \(\lambda_n \to 0\), the higher-order eigenfunctions (those corresponding to small eigenvalues) are penalized more heavily in the RKHS norm. A function that relies heavily on high-frequency components will have a large RKHS norm, while a function concentrated on the first few eigenmodes will have a small norm.

Connection to Regularization: Why \(\|f\|_{\mathcal{H}_K}^2\) Controls Complexity

When we add \(\lambda \|f\|_{\mathcal{H}_K}^2\) to our loss function, we are penalizing functions that use high-frequency eigenfunctions. This is the infinite-dimensional analogue of Ridge regression: in finite dimensions, \(\|w\|^2 = \sum w_i^2\) penalizes large weights equally; in the RKHS, \(\|f\|_{\mathcal{H}_K}^2 = \sum f_n^2 / \lambda_n\) penalizes components in "hard" directions (small \(\lambda_n\)) more than "easy" ones (large \(\lambda_n\)).

This is analogous to the quadratic form \(\mathbf{f}^T \mathbf{K}^{-1} \mathbf{f}\) that appears in GP regression — the inverse kernel matrix penalizes predictions that deviate from the prior's "smooth" subspace.

RKHS Functions are "Smoother" than \(L^2\) Functions

The membership condition \(\sum f_n^2 / \lambda_n < \infty\) is strictly stronger than the \(L^2\) condition \(\sum f_n^2 < \infty\) (since \(\lambda_n \to 0\); for example, \(f_n = \sqrt{\lambda_n}\) gives \(\sum f_n^2 = \sum \lambda_n < \infty\) — finite since integrating the Mercer expansion on the diagonal gives \(\sum_n \lambda_n = \int_{\mathcal{X}} K(x, x)\, d\mu(x)\), which is finite by continuity of \(K\) on the compact \(\mathcal{X}\) and finiteness of \(\mu\) — but \(\sum f_n^2 / \lambda_n = \infty\), giving \(f \in L^2 \setminus \mathcal{H}_K\)). Therefore: \[ \mathcal{H}_K \subsetneq L^2(\mathcal{X}, \mu). \] The RKHS consists of "smoother" functions — those whose high-frequency components decay fast enough to compensate for the shrinking eigenvalues. The faster the eigenvalues \(\lambda_n\) decay, the smaller (and smoother) the RKHS becomes.

This also explains why point evaluation is continuous in an RKHS but not in \(L^2\). The extra smoothness required for RKHS membership ensures that functions are "nice enough" to be evaluated pointwise — unlike generic \(L^2\) functions, which are only defined up to measure-zero modifications.

The Representer Theorem & Grand Synthesis

We now arrive at the theorem that ties the entire Functional Analysis series together: the Representer Theorem. It tells us that even though the RKHS \(\mathcal{H}_K\) is infinite-dimensional, the solution to any regularized empirical risk minimization problem lives in a finite-dimensional subspace spanned by the kernel functions at the data points.

The Representer Theorem is a structural result: it describes the form of any minimizer that exists, without itself asserting existence. Existence depends on the loss function and is not automatic. In the kernel methods that use this theorem in practice, existence is nevertheless clear: Kernel Ridge Regression reduces to a finite-dimensional linear system with a unique closed-form solution, and the kernel SVM to a finite-dimensional convex quadratic program. The theorem below tells us the structure of any such minimizer:

Theorem: The Representer Theorem

Let \(\mathcal{H}_K\) be an RKHS on \(\mathcal{X}\) with kernel \(K\), let \(\{(x_i, y_i)\}_{i=1}^N\) be training data, and consider the regularized problem \[ \min_{f \in \mathcal{H}_K} \left[\sum_{i=1}^{N} \ell\bigl(y_i,\, f(x_i)\bigr) + \lambda \|f\|_{\mathcal{H}_K}^2\right], \] where \(\ell\) is any loss function and \(\lambda > 0\). Then any minimizer \(f^*\) of this objective has the form \[ f^*(x) = \sum_{i=1}^{N} \alpha_i \, K(x, x_i) \] for some coefficients \(\alpha_1, \ldots, \alpha_N \in \mathbb{R}\).

This is a remarkable dimension reduction: the optimization was posed over the entire infinite-dimensional space \(\mathcal{H}_K\), yet the answer lies in the \(N\)-dimensional subspace \(\operatorname{span}\{K(\cdot, x_1), \ldots, K(\cdot, x_N)\}\).

Proof Sketch:

Decompose any \(f \in \mathcal{H}_K\) as \(f = f_\parallel + f_\perp\), where \(f_\parallel \in \operatorname{span}\{K(\cdot, x_i)\}_{i=1}^N\) and \(f_\perp\) is orthogonal to this subspace. (The span is finite-dimensional, hence closed in \(\mathcal{H}_K\), so this orthogonal decomposition is guaranteed by the projection theorem.)

By the reproducing property, \(f(x_i) = \langle f, K(\cdot, x_i) \rangle_{\mathcal{H}_K}\). Since \(f_\perp\) is orthogonal to every \(K(\cdot, x_i)\): \[ f(x_i) = f_\parallel(x_i) + \underbrace{\langle f_\perp, K(\cdot, x_i) \rangle}_{= 0} = f_\parallel(x_i). \] So \(f_\perp\) does not affect the data-fit term. However, by Pythagoras: \[ \|f\|_{\mathcal{H}_K}^2 = \|f_\parallel\|_{\mathcal{H}_K}^2 + \|f_\perp\|_{\mathcal{H}_K}^2 \geq \|f_\parallel\|_{\mathcal{H}_K}^2. \] Adding any nonzero \(f_\perp\) strictly increases the regularization penalty without improving the data fit. Hence any minimizer satisfies \(f_\perp = 0\).

From the Representer Theorem to Kernel Algorithms

With \(f^*(x) = \sum_{i=1}^N \alpha_i K(x, x_i)\), the infinite-dimensional problem reduces to a finite-dimensional optimization over \(\boldsymbol{\alpha} \in \mathbb{R}^N\). The reproducing property gives \[ f^*(x_j) = \sum_{i=1}^N \alpha_i K(x_j, x_i) = (\mathbf{K}\boldsymbol{\alpha})_j \] where \(\mathbf{K}_{ij} = K(x_i, x_j)\) is the Gram matrix, and \[ \|f^*\|_{\mathcal{H}_K}^2 = \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}. \] The regularized problem becomes \[ \min_{\boldsymbol{\alpha} \in \mathbb{R}^N} \left[\sum_{i=1}^N \ell\bigl(y_i,\, (\mathbf{K}\boldsymbol{\alpha})_i\bigr) + \lambda \, \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha}\right]. \] For the squared loss \(\ell(y, \hat{y}) = (y - \hat{y})^2\), this is Kernel Ridge Regression, with closed-form solution \(\boldsymbol{\alpha} = (\mathbf{K} + \lambda I)^{-1} \mathbf{y}\). For the hinge loss, it is the kernel SVM.

The Grand Synthesis

Every chapter in this Functional Analysis series has contributed a piece of the RKHS picture. Banach & Hilbert Spaces gave us the ambient structure — completeness, inner products, and \(L^2\) as the home of Mercer's integral operator. Bounded Operators gave us the equivalence of boundedness and continuity, which is exactly what point evaluation \(\delta_x\) must satisfy for the whole theory to begin. Dual Spaces and the Riesz Representation Theorem converted "\(\delta_x\) is bounded" into the concrete statement "\(\delta_x\) has a Riesz representative \(K_x\)" — the reproducing kernel itself. Spectral Theory lifted Mercer's integral operator into an eigenfunction expansion, exposing the explicit feature map into \(\ell^2\) and the smoothness-controlling role of eigenvalue decay. And the Representer Theorem, proved here, collapses infinite-dimensional optimization onto a finite-dimensional kernel algebra. The Functional Analysis series reaches its culmination here: each abstract tool introduced above — Hilbert structure, boundedness, Riesz duality, spectral decomposition — now anchors a concrete machine-learning method.

Looking Ahead: From Function Spaces to Manifolds

Now, we will move from function spaces to geometric spaces. The Musical Isomorphisms (\(\sharp\) and \(\flat\)) will generalize from a single inner product to a Riemannian metric tensor that varies from point to point. The spectral methods will be fundamentally powered by the eigenvalues and eigenfunctions of the Laplace-Beltrami operator on smooth manifolds. And the reproducing property of RKHS will find its echo in the equivariance constraints of Geometric Deep Learning: just as the kernel encodes the "right" notion of similarity in function space, the group action encodes the "right" notion of symmetry on geometric domains.

The Functional Analysis series has given us the language of infinite-dimensional mathematics. The geometry series will give us its shape.