Introduction
Throughout our study of statistics, we have worked with parametric models: models with a fixed number of parameters
\(\boldsymbol{\theta}\) that does not grow with the training set size \(N\). For example,
deep neural networks
learn a function approximation \(f(\boldsymbol{x}; \boldsymbol{\theta})\) where
\(\dim(\boldsymbol{\theta})\) is determined by the architecture. Such models can overfit when \(N\) is small or underfit
when the model capacity is insufficient. We now consider a fundamentally different approach:
nonparametric models, whose effective complexity adapts to the amount of available data.
The Gaussian Process (GP) is a fundamental example of a nonparametric Bayesian model.
Rather than placing a prior distribution over a finite-dimensional parameter vector,
we place a prior directly over functions \(p(f)\), and update it after observing data to obtain the posterior \(p(f \mid \mathcal{D})\).
This function-space view builds on two key prerequisites:
the multivariate normal distribution,
whose conditioning identities we will use to derive the predictive posterior,
and positive definite matrices, which
ensure that our covariance structures are well-defined.
The key insight is to think of a function as an infinite-dimensional vector.
Consider a Gaussian random vector \(\boldsymbol{f} = [f_1, f_2, \ldots, f_N]^\top\), characterized by its
mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{f}]\) and
covariance \(\boldsymbol{\Sigma} = \operatorname{Cov}[\boldsymbol{f}]\).
Now, let \(f : \mathcal{X} \rightarrow \mathbb{R}\) be a function evaluated at a finite set of input points
\[
\boldsymbol{X} = \{\boldsymbol{x}_n \in \mathcal{X}\}_{n=1}^N,
\]
and define
\[
\boldsymbol{f}_X = [f(\boldsymbol{x}_1), f(\boldsymbol{x}_2), \ldots, f(\boldsymbol{x}_N)]^\top.
\]
If \(\boldsymbol{f}_X\) follows a joint multivariate Gaussian distribution for any finite set
of input points, then \(f\) is said to follow a Gaussian process.
Definition: Gaussian Process (GP)
A Gaussian process is a collection of random variables, any finite subset of which has a joint
Gaussian distribution. A GP is fully specified by its mean function and covariance function:
\[
f(\boldsymbol{x}) \sim \mathcal{GP}\left(m(\boldsymbol{x}), \, \mathcal{K}(\boldsymbol{x}, \boldsymbol{x}')\right),
\]
where
\[
m(\boldsymbol{x}) = \mathbb{E}[f(\boldsymbol{x})]
\]
and
\[
\mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') = \operatorname{Cov}[f(\boldsymbol{x}), f(\boldsymbol{x}')].
\]
The covariance function \(\mathcal{K}\) is required to be a positive definite Mercer kernel:
equivalently, the Gram matrix
\[
\boldsymbol{K}_{ij} = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)
\]
is positive semi-definite for every finite set of inputs. This is precisely the condition that the resulting
finite-dimensional joint distribution is a valid multivariate Gaussian.
In practice, the mean function is often set to zero, \(m(\boldsymbol{x}) = 0\), since the data can be centered and
the GP's expressiveness comes primarily from the covariance function. The choice of \(\mathcal{K}\) then determines
all structural properties of the functions: their smoothness, periodicity, amplitude, and length-scale.
For any finite collection of input points
\(\boldsymbol{X} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_N\}\),
the corresponding function values follow a multivariate normal distribution:
\[
p(\boldsymbol{f}_X \mid \boldsymbol{X})
= \mathcal{N}(\boldsymbol{f}_X \mid \boldsymbol{\mu}_X, \boldsymbol{K}_{X,X}),
\]
where \(\boldsymbol{\mu}_X = [m(\boldsymbol{x}_1), m(\boldsymbol{x}_2), \ldots, m(\boldsymbol{x}_N)]^\top\)
and \(\boldsymbol{K}_{X,X}(i, j) = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\).
Mercer Kernels
The kernel function defines the notion of similarity between function values
at different input points. In a Gaussian process (GP), the choice of kernel is crucial since it
determines the smoothness, periodicity, and other structural properties of the functions drawn
from the GP prior.
While the kernel function intuitively measures similarity between inputs, not every function can serve
as a valid covariance function in a Gaussian process.
To ensure that the resulting covariance matrix is positive semi-definite for every finite set of inputs —
a requirement for the GP to define a valid multivariate Gaussian distribution —
we restrict our attention to Mercer kernels.
Definition: Mercer Kernel
A Mercer kernel (also called a positive definite kernel)
is a symmetric function
\[
\mathcal{K}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}
\]
such that for every \(N \in \mathbb{N}\), every choice of distinct points
\(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N \in \mathcal{X}\), and every choice of scalars
\(c_1, \ldots, c_N \in \mathbb{R}\),
\[
\sum_{i=1}^N \sum_{j=1}^N c_i c_j \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) \geq 0.
\]
Equivalently, the Gram matrix
\(\boldsymbol{K} = [\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)]_{i,j=1}^N\)
is positive semi-definite for every finite collection of distinct points.
A Mercer kernel is called strictly positive definite if equality holds only when
\(c_1 = \cdots = c_N = 0\) — equivalently, every such Gram matrix is strictly positive definite.
Note that this condition does NOT require each individual kernel value
\(\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\) to be nonnegative; only the quadratic form
built from the Gram matrix \(\boldsymbol{K}\) must be.
The definition above captures everything required to use Mercer kernels in Gaussian processes:
a finite-sample positive semi-definiteness condition that ensures every Gram matrix
\(\boldsymbol{K}\) defines a valid covariance for a multivariate Gaussian. A deeper
treatment — including the connection to reproducing kernel Hilbert spaces
and Mercer's integral-operator decomposition
\(\mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') = \sum_n \lambda_n e_n(\boldsymbol{x}) e_n(\boldsymbol{x}')\) —
is developed later in the curriculum, where the same notion appears under the name
positive definite kernel in the more general functional-analytic setting.
We often focus on stationary kernels, which depend only on the difference between inputs,
\(\boldsymbol{r} = \boldsymbol{x} - \boldsymbol{x}'\), and not on their absolute locations.
In many cases, only the Euclidean distance between inputs matters:
\[
r = \|\boldsymbol{r}\|_2 = \|\boldsymbol{x} - \boldsymbol{x}'\|.
\]
One of the most widely used stationary kernels in practice is the radial basis function (RBF) kernel
(also used in
classification,
where its random Fourier-feature approximation enables linear-time training):
\[
\begin{align*}
\mathcal{K}_\text{RBF}(r ; \ell) &= \exp \left( -\frac{r^2}{2\ell^2}\right) \\\\
&= \exp \left( -\frac{\| \boldsymbol{x} - \boldsymbol{x}'\|^2}{2\ell^2}\right)
\end{align*}
\]
where \(\ell > 0\) is the length-scale parameter controlling how quickly the correlation
between points decays with distance. This kernel is also known as the Gaussian kernel.
Furthermore, by replacing Euclidean distance with Mahalanobis distance, we obtain the generalized RBF kernel:
\[
\mathcal{K}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2)
= \sigma^2 \exp \left( - \frac{1}{2} \boldsymbol{r}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{r}\right).
\]
In the case \(\boldsymbol{\Sigma}\) is diagonal, i.e. \(\boldsymbol{\Sigma} = \operatorname{diag}(\ell_1^2, \ldots, \ell_D^2)\),
we obtain the Automatic Relevance Determination (ARD) kernel:
\[
\mathcal{K}_\text{ARD}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2)
= \sigma^2 \exp \left( - \frac{1}{2} \sum_{d=1}^D \frac{1}{\ell_d^2 }r_d^2 \right)
\]
where \(\sigma^2\) is the overall variance and \(\ell_d >0\) is the characteristic length scale of dimension \(d\),
controlling the sensitivity of the function to that input. If a dimension \(d\) is irrelevant, we can
set \(\ell_d = \infty\), effectively ignoring that input.
While RBF and ARD kernels are infinitely smooth and often used as default choices, there are
situations where we may want more flexible control over the smoothness of the functions.
For example, in Bayesian optimisation or when modelling rougher processes, it is useful to have kernels
whose sample paths are only once or twice differentiable.
This motivates the use of Matérn kernels, which form a family of stationary kernels
that include the RBF kernel as a limiting case (\(\nu \to \infty\)), while introducing an explicit
smoothness parameter \(\nu > 0\) to control the differentiability of the functions.
By adjusting \(\nu\), we can interpolate between very rough kernels (like the exponential kernel) and
infinitely smooth kernels (like the RBF), giving us fine-grained control over function smoothness.
For parameters \(\nu > 0\), \(\ell > 0\), the Matérn kernel is
\[
\mathcal{K}_\text{Matérn}(r; \nu, \ell)
= \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\,r}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\,r}{\ell}\right),
\]
where \(K_\nu\) denotes the modified Bessel function of the second kind.
The parameter \(\nu\) controls the smoothness of the resulting functions. Specifically, the Gaussian process
is \(k\)-times mean-square differentiable if and only if \(\nu > k\). Consequently, the actual sample paths
of the GP are \(\lfloor \nu - \tfrac{1}{2}\rfloor\)-times continuously differentiable. Common choices
(up to the variance factor \(\sigma^2\)) admit closed forms:
- Exponential Kernel (Non-Differentiable):
\[
\mathcal{K}_\text{Matérn}(r; \frac{1}{2}, \ell ) = \exp \left(- \frac{r}{\ell}\right)
\]
- Once Differentiable:
\[
\mathcal{K}_\text{Matérn}(r; \frac{3}{2}, \ell ) = \left(1 + \frac{\sqrt{3}r}{\ell}\right)\exp \left(- \frac{\sqrt{3}r}{\ell}\right)
\]
- Twice Differentiable:
\[
\mathcal{K}_\text{Matérn}(r; \frac{5}{2}, \ell ) = \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right)\exp \left(- \frac{\sqrt{5}r}{\ell}\right)
\]
Note: In Gaussian process regression, the kernel can be multiplied by a positive amplitude
parameter \(\sigma^2 > 0\) to scale the variance of the function values:
\[
\mathcal{K}(r;\nu,\ell) \;\longrightarrow\; \sigma^2 \mathcal{K}(r;\nu,\ell).
\]
This scaling changes the magnitude of function fluctuations but does not affect the smoothness or
correlation structure of the GP.
Some functions exhibit repeating or periodic behavior that is not captured by standard RBF or Matérn kernels.
To model such patterns, we can use a periodic kernel, which explicitly encodes periodicity into
the covariance function.
A common choice for a one-dimensional periodic kernel is:
\[
\mathcal{K}_\text{per}(r; \ell, p) = \exp \Bigg( - \frac{2 \, \sin^2\big(\pi r / p\big)}{\ell^2} \Bigg),
\]
where \(p > 0\) is the period of the function and \(\ell > 0\) is the length-scale controlling how quickly
correlations decay away from exact multiples of the period.
Periodic kernels are particularly useful in Gaussian process regression for modeling phenomena that repeat regularly over time or space,
such as seasonal time series, cyclic patterns, or physical systems with inherent periodicity.
GP Regression
Suppose we observe a training data set
\(\mathcal{D} = \{(\boldsymbol{x}_n, y_n)\}_{n=1}^N\),
where each observation follows
\[
y_n = f(\boldsymbol{x}_n) + \epsilon_n, \quad
\epsilon_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma_y^2),
\]
and the noise \(\{\epsilon_n\}\) is independent of the latent function \(f\).
Under this model, the covariance of the noisy observations is
\[
\begin{align*}
\operatorname{Cov}[y_i, y_j]
&= \operatorname{Cov}[f_i, f_j] + \operatorname{Cov}[\epsilon_i, \epsilon_j] \\\\
&= \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) + \sigma_y^2 \, \mathbb{1}\{i = j\},
\end{align*}
\]
so that
\[
\operatorname{Cov}[\boldsymbol{y} \mid \boldsymbol{X}]
= \boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N.
\]
Because a Gaussian process defines a joint Gaussian distribution
over any finite collection of function values, we can express both
the noisy training outputs and the noise-free test values together.
For test inputs \(\boldsymbol{X}_* = \{\boldsymbol{x}_{*,m}\}_{m=1}^M\), let
\(\boldsymbol{f}_* = [f(\boldsymbol{x}_{*,1}), \ldots, f(\boldsymbol{x}_{*,M})]^\top\) and
\(\boldsymbol{\mu}_* = [m(\boldsymbol{x}_{*,1}), \ldots, m(\boldsymbol{x}_{*,M})]^\top\).
Define the cross-covariance and test-test covariance blocks by
\((\boldsymbol{K}_{X,*})_{ij} = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_{*,j})\) and
\((\boldsymbol{K}_{*,*})_{ij} = \mathcal{K}(\boldsymbol{x}_{*,i}, \boldsymbol{x}_{*,j})\).
Then
\[
\begin{bmatrix}
\boldsymbol{y} \\
\boldsymbol{f}_*
\end{bmatrix}
\sim
\mathcal{N} \Bigg(
\begin{bmatrix}
\boldsymbol{\mu}_X \\
\boldsymbol{\mu}_*
\end{bmatrix},
\begin{bmatrix}
\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N & \boldsymbol{K}_{X,*} \\\\
\boldsymbol{K}_{X,*}^\top & \boldsymbol{K}_{*,*}
\end{bmatrix}
\Bigg).
\]
The posterior predictive distribution at the test points
\(\boldsymbol{X}_*\) is obtained by conditioning on the observed data:
\[
p(\boldsymbol{f}_* \mid \mathcal{D}, \boldsymbol{X}_*)
= \mathcal{N}\big(
\boldsymbol{f}_* \mid
\boldsymbol{\mu}_{* \mid X},
\boldsymbol{\Sigma}_{* \mid X}
\big),
\]
where
\[
\boldsymbol{\mu}_{* \mid X}
= \boldsymbol{\mu}_*
+ \boldsymbol{K}_{X,*}^\top
\big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1}
(\boldsymbol{y} - \boldsymbol{\mu}_X),
\]
\[
\boldsymbol{\Sigma}_{* \mid X}
= \boldsymbol{K}_{*,*}
- \boldsymbol{K}_{X,*}^\top
\big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1}
\boldsymbol{K}_{X,*}.
\]
These formulas follow from the standard conditioning identities for the
multivariate normal distribution
applied to the joint above.
The posterior mean \(\boldsymbol{\mu}_{*|X}\) is a linear combination of the observed outputs \(\boldsymbol{y}\), weighted
by the kernel similarities between test and training points. The posterior covariance \(\boldsymbol{\Sigma}_{*|X}\) starts
from the prior covariance \(\boldsymbol{K}_{*,*}\) and subtracts the information gained from the observations — the more data
we observe near a test point, the smaller the posterior uncertainty.
The computational bottleneck is the inversion of the \(N \times N\) matrix \((\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N)\), which
costs \(O(N^3)\) in time and \(O(N^2)\) in storage. This cubic scaling limits standard GP regression to moderate dataset sizes
(typically \(N \lesssim 10^4\)). For larger datasets, sparse approximations (inducing point methods),
random Fourier features,
and structured kernel interpolation provide scalable alternatives.
Gaussian Processes in Machine Learning
GPs occupy a unique position in the ML landscape. They provide closed-form uncertainty quantification that
is automatic and calibrated, requiring no ensembles, dropout, or approximate inference. This makes them invaluable for
Bayesian optimization (e.g., hyperparameter tuning via acquisition functions like Expected Improvement),
active learning (selecting the most informative data points to label), and safety-critical applications
where knowing what the model doesn't know is as important as its predictions. The kernel framework also connects GPs to
neural networks: the Neural Tangent Kernel (NTK) shows that infinitely wide neural networks converge to GPs,
providing a theoretical bridge between deep learning and nonparametric Bayesian modeling.