Introduction
Throughout our study of statistics, we have worked with parametric models: models with a fixed number of parameters
\(\boldsymbol{\theta}\) that does not grow with the training set size \(N\). For example,
deep neural networks
learn a function approximation \(f(\boldsymbol{x}; \boldsymbol{\theta})\) where
\(\dim(\boldsymbol{\theta})\) is determined by the architecture. Such models can overfit when \(N\) is small or underfit
when the model capacity is insufficient. We now consider a fundamentally different approach:
nonparametric models, whose effective complexity adapts to the amount of available data.
The Gaussian Process (GP) is a fundamental example of a nonparametric Bayesian model.
Rather than placing a prior distribution over a finite-dimensional parameter vector,
we place a prior directly over functions \(p(f)\), and update it after observing data to obtain the posterior \(p(f \mid \mathcal{D})\).
This function-space view builds on two key prerequisites:
the multivariate normal distribution, which provides the conditioning
formulas we will need, and positive definite matrices, which
ensure that our covariance structures are well-defined.
The key insight is to think of a function as an infinite-dimensional vector.
Consider a Gaussian random vector \(\boldsymbol{f} = [f_1, f_2, \ldots, f_N]^\top\), characterized by its
mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{f}]\) and
covariance \(\boldsymbol{\Sigma} = \mathrm{Cov}[\boldsymbol{f}]\).
Now, let \(f : \mathcal{X} \rightarrow \mathbb{R}\) be a function evaluated at a finite set of input points
\[
\boldsymbol{X} = \{\boldsymbol{x}_n \in \mathcal{X}\}_{n=1}^N,
\]
and define
\[
\boldsymbol{f}_X = [f(\boldsymbol{x}_1), f(\boldsymbol{x}_2), \ldots, f(\boldsymbol{x}_N)]^\top.
\]
If \(\boldsymbol{f}_X\) follows a joint multivariate Gaussian distribution for any finite set
of input points, then \(f\) is said to follow a Gaussian process.
Definition: Gaussian Process (GP)
A Gaussian process is a collection of random variables, any finite subset of which has a joint
Gaussian distribution. A GP is fully specified by its mean function and covariance function:
\[
f(\boldsymbol{x}) \sim \mathcal{GP}\left(m(\boldsymbol{x}), \, \mathcal{K}(\boldsymbol{x}, \boldsymbol{x}')\right),
\]
where
\[
m(\boldsymbol{x}) = \mathbb{E}[f(\boldsymbol{x})]
\]
and
\[
\mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') = \mathrm{Cov}[f(\boldsymbol{x}), f(\boldsymbol{x}')].
\]
The covariance function \(\mathcal{K}\) must be a positive definite Mercer kernel to ensure that
the Gram matrix
\[
\boldsymbol{K}_{ij} = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)
\]
is positive semi-definite for any set of inputs.
In practice, the mean function is often set to zero, \(m(\boldsymbol{x}) = 0\), since the data can be centered and
the GP's expressiveness comes primarily from the covariance function. The choice of \(\mathcal{K}\) then determines
all structural properties of the functions: their smoothness, periodicity, amplitude, and length-scale.
For any finite collection of input points
\(\boldsymbol{X} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_N\}\),
the corresponding function values follow a multivariate normal distribution:
\[
p(\boldsymbol{f}_X \mid \boldsymbol{X})
= \mathcal{N}(\boldsymbol{f}_X \mid \boldsymbol{\mu}_X, \boldsymbol{K}_{X,X}),
\]
where \(\boldsymbol{\mu}_X = [m(\boldsymbol{x}_1), m(\boldsymbol{x}_2), \ldots, m(\boldsymbol{x}_N)]^\top\)
and \(\boldsymbol{K}_{X,X}(i, j) = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\).
Mercer Kernels
The kernel function defines the notion of similarity between function values
at different input points. In a Gaussian process (GP), the choice of kernel is crucial since it
determines the smoothness, periodicity, and other structural properties of the functions drawn
from the GP prior.
While the kernel function intuitively measures similarity between inputs, not every function can serve
as a valid covariance function in a Gaussian process.
To ensure that the resulting covariance matrix is positive definite for any finite set of inputs -
a requirement for the GP to define a proper multivariate Gaussian distribution -
we restrict our attention to Mercer kernels.
Definition: Mercer Kernel
A Mercer kernel (often referred to as a positive definite kernel)
is a symmetric function:
\[
\mathcal{K}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}
\]
such that for all \(N \in \mathbb{N}\) and all \(\boldsymbol{x}_i \in \mathcal{X}\)
\[
\sum_{i=1}^N \sum_{j=1}^N c_i c_j \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) \geq 0
\]
for any choice of numbers \(c_i \in \mathbb{R}\).
The equality allowed when \(\forall i, c_i =0\). For a strictly positive definite kernel, equality
holds only when \(c_i = 0\) for all \(i\).
This means that the corresponding Gram matrix
\( \boldsymbol{K} = [\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)]_{i,j=1}^N \)
is positive semidefinite. Note that this condition does NOT require each individual kernel
value \(\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\) to be nonnegative.
We often focus on stationary kernels, which depend only on the difference between inputs,
\(\boldsymbol{r} = \boldsymbol{x} - \boldsymbol{x}'\), and not on their absolute locations.
In many cases, only the Euclidean distance between inputs matters:
\[
r = \|\boldsymbol{r}\|_2 = \|\boldsymbol{x} - \boldsymbol{x}'\|.
\]
One of the most widely used stationary kernels in practice is the radial basis function (RBF) kernel
(also introduced in the context of
classification):
\[
\begin{align*}
\mathcal{K}_\text{RBF}(r ; \ell) &= \exp \left( -\frac{r^2}{2\ell^2}\right) \\\\
&= \exp \left( -\frac{\| \boldsymbol{x} - \boldsymbol{x}'\|^2}{2\ell^2}\right)
\end{align*}
\]
where \(\ell > 0\) is the length-scale parameter controlling how quickly the correlation
between points decays with distance. This kernel is also known as the Gaussian kernel.
Furthermore, by replacing Euclidean distance with Mahalanobis distance, we obtain the generalized RBF kernel:
\[
\mathcal{K}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2)
= \sigma^2 \exp \left( - \frac{1}{2} \boldsymbol{r}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{r}\right).
\]
In the case \(\boldsymbol{\Sigma}\) is diagonal, i.e. \(\boldsymbol{\Sigma} = \mathrm{diag}(\ell_1^2, \ldots, \ell_D^2)\),
we obtain the Automatic Relevance Determination (ARD) kernel:
\[
\mathcal{K}_\text{ARD}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2)
= \sigma^2 \exp \left( - \frac{1}{2} \sum_{d=1}^D \frac{1}{\ell_d^2 }r_d^2 \right)
\]
where \(\sigma^2\) is the overall variance and \(\ell_d >0\) is the characteristic length scale of dimension \(d\),
controlling the sensitivity of the function to that input. If a dimension \(d\) is irrelevant, we can
set \(\ell_d = \infty\), effectively ignoring that input.
While RBF and ARD kernels are infinitely smooth and often used as default choices, there are
situations where we may want more flexible control over the smoothness of the functions.
For example, in Bayesian optimisation or when modelling rougher processes, it is useful to have kernels
whose sample paths are only once or twice differentiable.
This motivates the use of Matérn kernels, which form a family of stationary kernels
that include the RBF kernel as a limiting case (\(\nu \to \infty\)), while introducing an explicit
smoothness parameter \(\nu > 0\) to control the differentiability of the functions.
By adjusting \(\nu\), we can interpolate between very rough kernels (like the exponential kernel) and
infinitely smooth kernels (like the RBF), giving us fine-grained control over function smoothness.
For parameters \(\nu > 0\), \(\ell > 0\), the Matérn kernel is
\[
\mathcal{K}_\text{Matérn}(r; \nu, \ell)
= \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\,r}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\,r}{\ell}\right),
\]
where \(K_\nu\) denotes the modified Bessel function of the second kind.
The parameter \(\nu\) controls the smoothness of the resulting functions. Specifically, the Gaussian process
is \(k\)-times mean-square differentiable if and only if \(\nu > k\). Consequently, the actual sample paths
of the GP are \(\lfloor \nu - \tfrac{1}{2}\rfloor\)-times continuously differentiable. Common choices
(up to the variance factor \(\sigma^2\)) admit closed forms:
- Exponential Kernel (Non-Differentiable):
\[
\mathcal{K}_\text{Matérn}(r; \frac{1}{2}, \ell ) = \exp \left(- \frac{r}{\ell}\right)
\]
- Once Differentiable:
\[
\mathcal{K}_\text{Matérn}(r; \frac{3}{2}, \ell ) = \left(1 + \frac{\sqrt{3}r}{\ell}\right)\exp \left(- \frac{\sqrt{3}r}{\ell}\right)
\]
- Twice Differentiable:
\[
\mathcal{K}_\text{Matérn}(r; \frac{5}{2}, \ell ) = \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right)\exp \left(- \frac{\sqrt{5}r}{\ell}\right)
\]
Note: In Gaussian process regression, the kernel can be multiplied by a positive amplitude
parameter \(\sigma^2 > 0\) to scale the variance of the function values:
\[
\mathcal{K}(r;\nu,\ell) \;\longrightarrow\; \sigma^2 \mathcal{K}(r;\nu,\ell).
\]
This scaling changes the magnitude of function fluctuations but does not affect the smoothness or
correlation structure of the GP.
Some functions exhibit repeating or periodic behavior that is not captured by standard RBF or Matérn kernels.
To model such patterns, we can use a periodic kernel, which explicitly encodes periodicity into
the covariance function.
A common choice for a one-dimensional periodic kernel is:
\[
\mathcal{K}_\text{per}(r; \ell, p) = \exp \Bigg( - \frac{2 \, \sin^2\big(\pi r / p\big)}{\ell^2} \Bigg),
\]
where \(p > 0\) is the period of the function and \(\ell > 0\) is the length-scale controlling how quickly
correlations decay away from exact multiples of the period.
Periodic kernels are particularly useful in Gaussian process regression for modeling phenomena that repeat regularly over time or space,
such as seasonal time series, cyclic patterns, or physical systems with inherent periodicity.
GP Regression
Suppose we observe a training data set
\(\mathcal{D} = \{(\boldsymbol{x}_n, y_n)\}_{n=1}^N\),
where each observation follows
\[
y_n = f(\boldsymbol{x}_n) + \epsilon_n, \quad
\epsilon_n \sim \mathcal{N}(0, \sigma_y^2).
\]
Under this model, the covariance of the noisy observations is
\[
\begin{align*}
\mathrm{Cov}[y_i, y_j]
&= \mathrm{Cov}[f_i, f_j] + \mathrm{Cov}[\epsilon_i, \epsilon_j] \\\\
&= \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) + \sigma_y^2 \, \mathbb{I}(i = j),
\end{align*}
\]
so that
\[
\mathrm{Cov}[\boldsymbol{y} \mid \boldsymbol{X}]
= \boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N.
\]
Because a Gaussian process defines a joint Gaussian distribution
over any collection of function values, we can express both
the noisy training outputs and the noise-free test values together as
\[
\begin{bmatrix}
\boldsymbol{y} \\
\boldsymbol{f}_*
\end{bmatrix}
\sim
\mathcal{N} \Bigg(
\begin{bmatrix}
\boldsymbol{\mu}_X \\
\boldsymbol{\mu}_*
\end{bmatrix},
\begin{bmatrix}
\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N & \boldsymbol{K}_{X,*} \\\\
\boldsymbol{K}_{X,*}^\top & \boldsymbol{K}_{*,*}
\end{bmatrix}
\Bigg).
\]
The posterior predictive distribution at the test points
\(\boldsymbol{X}_*\) is obtained by conditioning on the observed data:
\[
p(\boldsymbol{f}_* \mid \mathcal{D}, \boldsymbol{X}_*)
= \mathcal{N}\big(
\boldsymbol{f}_* \mid
\boldsymbol{\mu}_{* \mid X},
\boldsymbol{\Sigma}_{* \mid X}
\big),
\]
where
\[
\boldsymbol{\mu}_{* \mid X}
= \boldsymbol{\mu}_*
+ \boldsymbol{K}_{X,*}^\top
\big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1}
(\boldsymbol{y} - \boldsymbol{\mu}_X),
\]
\[
\boldsymbol{\Sigma}_{* \mid X}
= \boldsymbol{K}_{*,*}
- \boldsymbol{K}_{X,*}^\top
\big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1}
\boldsymbol{K}_{X,*}.
\]
These formulas are a direct application of the multivariate normal conditioning.
The posterior mean \(\boldsymbol{\mu}_{*|X}\) is a linear combination of the observed outputs \(\boldsymbol{y}\), weighted
by the kernel similarities between test and training points. The posterior covariance \(\boldsymbol{\Sigma}_{*|X}\) starts
from the prior covariance \(\boldsymbol{K}_{*,*}\) and subtracts the information gained from the observations - the more data
we observe near a test point, the smaller the posterior uncertainty.
The computational bottleneck is the inversion of the \(N \times N\) matrix \((\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N)\), which
costs \(O(N^3)\) in time and \(O(N^2)\) in storage. This cubic scaling limits standard GP regression to moderate dataset sizes
(typically \(N \lesssim 10^4\)). For larger datasets, sparse approximations (inducing point methods),
random Fourier features,
and structured kernel interpolation provide scalable alternatives.
Gaussian Processes in Machine Learning
GPs occupy a unique position in the ML landscape. They provide closed-form uncertainty quantification that
is automatic and calibrated, requiring no ensembles, dropout, or approximate inference. This makes them invaluable for
Bayesian optimization (e.g., hyperparameter tuning via acquisition functions like Expected Improvement),
active learning (selecting the most informative data points to label), and safety-critical applications
where knowing what the model doesn't know is as important as its predictions. The kernel framework also connects GPs to
neural networks: the Neural Tangent Kernel (NTK) shows that infinitely wide neural networks converge to GPs,
providing a theoretical bridge between deep learning and nonparametric Bayesian modeling.