Gaussian Processes

Introduction

Throughout our study of statistics, we have worked with parametric models: models with a fixed number of parameters \(\boldsymbol{\theta}\) that does not grow with the training set size \(N\). For example, deep neural networks learn a function approximation \(f(\boldsymbol{x}; \boldsymbol{\theta})\) where \(\dim(\boldsymbol{\theta})\) is determined by the architecture. Such models can overfit when \(N\) is small or underfit when the model capacity is insufficient. We now consider a fundamentally different approach: nonparametric models, whose effective complexity adapts to the amount of available data.

The Gaussian Process (GP) is a fundamental example of a nonparametric Bayesian model. Rather than placing a prior distribution over a finite-dimensional parameter vector, we place a prior directly over functions \(p(f)\), and update it after observing data to obtain the posterior \(p(f \mid \mathcal{D})\). This function-space view builds on two key prerequisites: the multivariate normal distribution, which provides the conditioning formulas we will need, and positive definite matrices, which ensure that our covariance structures are well-defined.

The key insight is to think of a function as an infinite-dimensional vector. Consider a Gaussian random vector \(\boldsymbol{f} = [f_1, f_2, \ldots, f_N]^\top\), characterized by its mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{f}]\) and covariance \(\boldsymbol{\Sigma} = \mathrm{Cov}[\boldsymbol{f}]\).

Now, let \(f : \mathcal{X} \rightarrow \mathbb{R}\) be a function evaluated at a finite set of input points \[ \boldsymbol{X} = \{\boldsymbol{x}_n \in \mathcal{X}\}_{n=1}^N, \] and define \[ \boldsymbol{f}_X = [f(\boldsymbol{x}_1), f(\boldsymbol{x}_2), \ldots, f(\boldsymbol{x}_N)]^\top. \] If \(\boldsymbol{f}_X\) follows a joint multivariate Gaussian distribution for any finite set of input points, then \(f\) is said to follow a Gaussian process.

Definition: Gaussian Process (GP)

A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by its mean function and covariance function: \[ f(\boldsymbol{x}) \sim \mathcal{GP}\left(m(\boldsymbol{x}), \, \mathcal{K}(\boldsymbol{x}, \boldsymbol{x}')\right), \] where \[ m(\boldsymbol{x}) = \mathbb{E}[f(\boldsymbol{x})] \] and \[ \mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') = \mathrm{Cov}[f(\boldsymbol{x}), f(\boldsymbol{x}')]. \] The covariance function \(\mathcal{K}\) must be a positive definite Mercer kernel to ensure that the Gram matrix \[ \boldsymbol{K}_{ij} = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) \] is positive semi-definite for any set of inputs.

In practice, the mean function is often set to zero, \(m(\boldsymbol{x}) = 0\), since the data can be centered and the GP's expressiveness comes primarily from the covariance function. The choice of \(\mathcal{K}\) then determines all structural properties of the functions: their smoothness, periodicity, amplitude, and length-scale.

For any finite collection of input points \(\boldsymbol{X} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_N\}\), the corresponding function values follow a multivariate normal distribution: \[ p(\boldsymbol{f}_X \mid \boldsymbol{X}) = \mathcal{N}(\boldsymbol{f}_X \mid \boldsymbol{\mu}_X, \boldsymbol{K}_{X,X}), \] where \(\boldsymbol{\mu}_X = [m(\boldsymbol{x}_1), m(\boldsymbol{x}_2), \ldots, m(\boldsymbol{x}_N)]^\top\) and \(\boldsymbol{K}_{X,X}(i, j) = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\).

Mercer Kernels

The kernel function defines the notion of similarity between function values at different input points. In a Gaussian process (GP), the choice of kernel is crucial since it determines the smoothness, periodicity, and other structural properties of the functions drawn from the GP prior.

While the kernel function intuitively measures similarity between inputs, not every function can serve as a valid covariance function in a Gaussian process. To ensure that the resulting covariance matrix is positive definite for any finite set of inputs - a requirement for the GP to define a proper multivariate Gaussian distribution - we restrict our attention to Mercer kernels.

Definition: Mercer Kernel

A Mercer kernel (often referred to as a positive definite kernel) is a symmetric function: \[ \mathcal{K}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} \] such that for all \(N \in \mathbb{N}\) and all \(\boldsymbol{x}_i \in \mathcal{X}\) \[ \sum_{i=1}^N \sum_{j=1}^N c_i c_j \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) \geq 0 \] for any choice of numbers \(c_i \in \mathbb{R}\). The equality allowed when \(\forall i, c_i =0\). For a strictly positive definite kernel, equality holds only when \(c_i = 0\) for all \(i\).

This means that the corresponding Gram matrix \( \boldsymbol{K} = [\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)]_{i,j=1}^N \) is positive semidefinite. Note that this condition does NOT require each individual kernel value \(\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\) to be nonnegative.

We often focus on stationary kernels, which depend only on the difference between inputs, \(\boldsymbol{r} = \boldsymbol{x} - \boldsymbol{x}'\), and not on their absolute locations. In many cases, only the Euclidean distance between inputs matters: \[ r = \|\boldsymbol{r}\|_2 = \|\boldsymbol{x} - \boldsymbol{x}'\|. \]

One of the most widely used stationary kernels in practice is the radial basis function (RBF) kernel (also introduced in the context of classification): \[ \begin{align*} \mathcal{K}_\text{RBF}(r ; \ell) &= \exp \left( -\frac{r^2}{2\ell^2}\right) \\\\ &= \exp \left( -\frac{\| \boldsymbol{x} - \boldsymbol{x}'\|^2}{2\ell^2}\right) \end{align*} \] where \(\ell > 0\) is the length-scale parameter controlling how quickly the correlation between points decays with distance. This kernel is also known as the Gaussian kernel.

Furthermore, by replacing Euclidean distance with Mahalanobis distance, we obtain the generalized RBF kernel: \[ \mathcal{K}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2) = \sigma^2 \exp \left( - \frac{1}{2} \boldsymbol{r}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{r}\right). \] In the case \(\boldsymbol{\Sigma}\) is diagonal, i.e. \(\boldsymbol{\Sigma} = \mathrm{diag}(\ell_1^2, \ldots, \ell_D^2)\), we obtain the Automatic Relevance Determination (ARD) kernel: \[ \mathcal{K}_\text{ARD}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2) = \sigma^2 \exp \left( - \frac{1}{2} \sum_{d=1}^D \frac{1}{\ell_d^2 }r_d^2 \right) \] where \(\sigma^2\) is the overall variance and \(\ell_d >0\) is the characteristic length scale of dimension \(d\), controlling the sensitivity of the function to that input. If a dimension \(d\) is irrelevant, we can set \(\ell_d = \infty\), effectively ignoring that input.

While RBF and ARD kernels are infinitely smooth and often used as default choices, there are situations where we may want more flexible control over the smoothness of the functions. For example, in Bayesian optimisation or when modelling rougher processes, it is useful to have kernels whose sample paths are only once or twice differentiable.

This motivates the use of Matérn kernels, which form a family of stationary kernels that include the RBF kernel as a limiting case (\(\nu \to \infty\)), while introducing an explicit smoothness parameter \(\nu > 0\) to control the differentiability of the functions. By adjusting \(\nu\), we can interpolate between very rough kernels (like the exponential kernel) and infinitely smooth kernels (like the RBF), giving us fine-grained control over function smoothness.

For parameters \(\nu > 0\), \(\ell > 0\), the Matérn kernel is \[ \mathcal{K}_\text{Matérn}(r; \nu, \ell) = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\,r}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\,r}{\ell}\right), \] where \(K_\nu\) denotes the modified Bessel function of the second kind.

The parameter \(\nu\) controls the smoothness of the resulting functions. Specifically, the Gaussian process is \(k\)-times mean-square differentiable if and only if \(\nu > k\). Consequently, the actual sample paths of the GP are \(\lfloor \nu - \tfrac{1}{2}\rfloor\)-times continuously differentiable. Common choices (up to the variance factor \(\sigma^2\)) admit closed forms:

Exponential Kernel (Non-Differentiable):
Once Differentiable:
Twice Differentiable:

Note: In Gaussian process regression, the kernel can be multiplied by a positive amplitude parameter \(\sigma^2 > 0\) to scale the variance of the function values: \[ \mathcal{K}(r;\nu,\ell) \;\longrightarrow\; \sigma^2 \mathcal{K}(r;\nu,\ell). \] This scaling changes the magnitude of function fluctuations but does not affect the smoothness or correlation structure of the GP.

Some functions exhibit repeating or periodic behavior that is not captured by standard RBF or Matérn kernels. To model such patterns, we can use a periodic kernel, which explicitly encodes periodicity into the covariance function.

A common choice for a one-dimensional periodic kernel is:

\[ \mathcal{K}_\text{per}(r; \ell, p) = \exp \Bigg( - \frac{2 \, \sin^2\big(\pi r / p\big)}{\ell^2} \Bigg), \]

where \(p > 0\) is the period of the function and \(\ell > 0\) is the length-scale controlling how quickly correlations decay away from exact multiples of the period.

Periodic kernels are particularly useful in Gaussian process regression for modeling phenomena that repeat regularly over time or space, such as seasonal time series, cyclic patterns, or physical systems with inherent periodicity.

GP Regression

Suppose we observe a training data set \(\mathcal{D} = \{(\boldsymbol{x}_n, y_n)\}_{n=1}^N\), where each observation follows \[ y_n = f(\boldsymbol{x}_n) + \epsilon_n, \quad \epsilon_n \sim \mathcal{N}(0, \sigma_y^2). \] Under this model, the covariance of the noisy observations is \[ \begin{align*} \mathrm{Cov}[y_i, y_j] &= \mathrm{Cov}[f_i, f_j] + \mathrm{Cov}[\epsilon_i, \epsilon_j] \\\\ &= \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) + \sigma_y^2 \, \mathbb{I}(i = j), \end{align*} \] so that \[ \mathrm{Cov}[\boldsymbol{y} \mid \boldsymbol{X}] = \boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N. \] Because a Gaussian process defines a joint Gaussian distribution over any collection of function values, we can express both the noisy training outputs and the noise-free test values together as \[ \begin{bmatrix} \boldsymbol{y} \\ \boldsymbol{f}_* \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_* \end{bmatrix}, \begin{bmatrix} \boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N & \boldsymbol{K}_{X,*} \\\\ \boldsymbol{K}_{X,*}^\top & \boldsymbol{K}_{*,*} \end{bmatrix} \Bigg). \] The posterior predictive distribution at the test points \(\boldsymbol{X}_*\) is obtained by conditioning on the observed data: \[ p(\boldsymbol{f}_* \mid \mathcal{D}, \boldsymbol{X}_*) = \mathcal{N}\big( \boldsymbol{f}_* \mid \boldsymbol{\mu}_{* \mid X}, \boldsymbol{\Sigma}_{* \mid X} \big), \] where \[ \boldsymbol{\mu}_{* \mid X} = \boldsymbol{\mu}_* + \boldsymbol{K}_{X,*}^\top \big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1} (\boldsymbol{y} - \boldsymbol{\mu}_X), \] \[ \boldsymbol{\Sigma}_{* \mid X} = \boldsymbol{K}_{*,*} - \boldsymbol{K}_{X,*}^\top \big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1} \boldsymbol{K}_{X,*}. \]

These formulas are a direct application of the multivariate normal conditioning. The posterior mean \(\boldsymbol{\mu}_{*|X}\) is a linear combination of the observed outputs \(\boldsymbol{y}\), weighted by the kernel similarities between test and training points. The posterior covariance \(\boldsymbol{\Sigma}_{*|X}\) starts from the prior covariance \(\boldsymbol{K}_{*,*}\) and subtracts the information gained from the observations - the more data we observe near a test point, the smaller the posterior uncertainty.

The computational bottleneck is the inversion of the \(N \times N\) matrix \((\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N)\), which costs \(O(N^3)\) in time and \(O(N^2)\) in storage. This cubic scaling limits standard GP regression to moderate dataset sizes (typically \(N \lesssim 10^4\)). For larger datasets, sparse approximations (inducing point methods), random Fourier features, and structured kernel interpolation provide scalable alternatives.

Gaussian Processes in Machine Learning

GPs occupy a unique position in the ML landscape. They provide closed-form uncertainty quantification that is automatic and calibrated, requiring no ensembles, dropout, or approximate inference. This makes them invaluable for Bayesian optimization (e.g., hyperparameter tuning via acquisition functions like Expected Improvement), active learning (selecting the most informative data points to label), and safety-critical applications where knowing what the model doesn't know is as important as its predictions. The kernel framework also connects GPs to neural networks: the Neural Tangent Kernel (NTK) shows that infinitely wide neural networks converge to GPs, providing a theoretical bridge between deep learning and nonparametric Bayesian modeling.

Loading...

Introduction

Mercer Kernels

GP Regression

Gaussian Processes in Machine Learning