Monte Carlo Methods

Credible Intervals

In our discussion of Bayesian inference, we introduced posterior distributions for unknown parameters. In our discussion of Markov chains, we developed models for sequential data, noting that Markov Chain Monte Carlo (MCMC) methods use the Markov property to sample from complex posterior distributions.

We begin with the problem of summarizing a posterior distribution. Since a full distribution is often complex (especially in high-dimensional parameter spaces) and difficult to interpret or communicate, we frequently summarize the uncertainty about a parameter using a credible interval. It is important to note that a credible interval is fundamentally different from a confidence interval in frequentist statistics: a credible interval makes a direct probability statement about the parameter given the data, whereas a confidence interval is a statement about the long-run frequency of the procedure. Nevertheless, they may sometimes numerically coincide.

Recall the credible interval: for a level \(\alpha \in (0,1)\), it is an interval \(C_\alpha(\mathcal{D}) = [l, u]\) with \(P(\theta \in [l, u] \mid \mathcal{D}) = 1 - \alpha\) under the posterior distribution \(p(\theta \mid \mathcal{D})\). For continuous posteriors, this condition does not uniquely determine the interval, so a common choice is the central credible interval, which places \(\alpha/2\) of the posterior probability in each tail.

If the cumulative distribution function (cdf) \(F\) of the posterior distribution is known, the central credible interval can be computed as: \[ \begin{align*} l &= F^{-1}\left( \frac{\alpha}{2} \right), \\\\ u &= F^{-1}\left( 1 - \frac{\alpha}{2} \right), \end{align*} \] where \(F^{-1}\) denotes the quantile function (the inverse of the cdf).

As an example, if the posterior distribution is approximately normal with posterior mean \(\mu\) and posterior standard deviation \(\sigma\), a rough \(95\%\) credible interval can be set as: \[ [l, u] = [\mu - 2\sigma, \mu + 2\sigma], \] because for a standard normal distribution \(\mathcal{N}(0,1)\), approximately \(95\%\) (more precisely \(95.45\%\)) of the probability mass lies within two standard deviations from the mean.

More precisely, when \(p(\theta \mid \mathcal{D}) = \mathcal{N}(\mu, \sigma^2)\) and \(\alpha = 0.05\), the central credible interval endpoints are: \[ \begin{align*} l &= \mu + \sigma \Phi^{-1}\left( \frac{\alpha}{2} \right), \\\\ u &= \mu + \sigma \Phi^{-1}\left( 1 - \frac{\alpha}{2} \right), \end{align*} \] where \(\Phi\) is the cdf of the standard normal distribution.

In particular, if the posterior distribution is the standard normal \(p(\theta \mid \mathcal{D}) = \mathcal{N}(0,1)\), then: \[ \begin{align*} l &= \Phi^{-1}(0.025) \approx -1.96, \\\\ u &= \Phi^{-1}(0.975) \approx 1.96, \end{align*} \] so the \(95\%\) credible interval is approximately \([-1.96, 1.96]\).

The computation above relies on knowing the quantile function \(F^{-1}\) in closed form. For many posterior distributions — especially those arising from non-conjugate models or high-dimensional problems — no such closed form is available. This motivates the use of Monte Carlo methods.

Monte Carlo Approximation

The Monte Carlo method is a general computational technique that uses random sampling to approximate quantities that are difficult or impossible to compute analytically. The fundamental idea is simple: to estimate an expectation, draw independent samples and compute their average.

Definition: Monte Carlo Estimator

Let \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)} \overset{\text{iid}}{\sim} p(\theta \mid \mathcal{D})\) and let \(f\) be any function for which \(\mathbb{E}_{p(\theta \mid \mathcal{D})}[|f(\theta)|]\) is finite. The Monte Carlo estimator of \(\mathbb{E}_{p(\theta \mid \mathcal{D})}[f(\theta)]\) based on \(S\) samples is \[ \hat{\mu}_S := \frac{1}{S} \sum_{s=1}^{S} f(\theta^{(s)}). \]

Proposition: Monte Carlo Consistency

Under the assumptions above, the Monte Carlo estimator converges to the true expectation almost surely: \[ \hat{\mu}_S \xrightarrow{\text{a.s.}} \mathbb{E}_{p(\theta \mid \mathcal{D})}[f(\theta)] \quad \text{as } S \to \infty. \]

Proof.

The samples \(f(\theta^{(1)}), f(\theta^{(2)}), \ldots\) are iid as functions of iid random variables, with finite mean by assumption. Applying the strong law of large numbers to this sequence yields the desired almost-sure convergence.

By choosing different functions \(f\), this estimator approximates many functionals of the posterior. Setting \(f(\theta) = \theta\) gives the posterior mean. For the posterior variance, the standard plug-in estimator is the sample variance \(\hat{\sigma}^2_S = \frac{1}{S} \sum_{s=1}^{S} (\theta^{(s)} - \hat{\theta}_S)^2\) where \(\hat{\theta}_S = \frac{1}{S} \sum_s \theta^{(s)}\) is the sample mean — strictly speaking this is a plug-in estimator rather than a direct Monte Carlo estimator (since \(f\) depends on the sample), but it inherits consistency from the joint convergence of \(\hat{\theta}_S\) and the second-moment estimator. Predictive distributions, posterior probabilities of events, and other integrable functionals follow the same template.

When the inverse cdf of the posterior distribution is not available in closed form, a practical alternative is to estimate the quantiles from samples. Suppose we have \(S\) independent samples \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)}\) from the posterior \(p(\theta \mid \mathcal{D})\). To approximate a \(100(1-\alpha)\%\) central credible interval:

Sort the samples in increasing order: \(\theta^{(1)} \leq \theta^{(2)} \leq \cdots \leq \theta^{(S)}\).
Set the lower endpoint \(l\) to be the \(\left(\frac{\alpha}{2}\right)\)-quantile, and the upper endpoint \(u\) to be the \(\left(1 - \frac{\alpha}{2}\right)\)-quantile.

More precisely, one common choice (corresponding to the lower-bound ceiling convention) is: \[ \begin{align*} l &= \theta^{\left(\left\lceil S \times \frac{\alpha}{2} \right\rceil\right)}, \\\\ u &= \theta^{\left(\left\lceil S \times \left(1 - \frac{\alpha}{2}\right) \right\rceil\right)}, \end{align*} \] where \(\lceil \cdot \rceil\) denotes the ceiling function. Other conventions exist — for example, numerical libraries (such as numpy's quantile) often use linear interpolation between adjacent order statistics. All such conventions agree in the limit \(S \to \infty\).

Theorem: Glivenko-Cantelli

Let \(X_1, X_2, \ldots\) be iid real-valued random variables with cumulative distribution function \(F\), and let \(\hat{F}_S\) denote the empirical cdf based on the first \(S\) samples: \[ \hat{F}_S(x) := \frac{1}{S} \sum_{s=1}^{S} \mathbb{1}\{X_s \leq x\}. \] Then \(\hat{F}_S\) converges uniformly to \(F\) almost surely: \[ \sup_{x \in \mathbb{R}} \left| \hat{F}_S(x) - F(x) \right| \xrightarrow{\text{a.s.}} 0 \quad \text{as } S \to \infty. \]

The argument proceeds in two steps. First, for each fixed \(x\), the indicator variables \(\mathbb{1}\{X_s \leq x\}\) are iid with mean \(F(x)\), so the strong law of large numbers gives pointwise convergence \(\hat{F}_S(x) \to F(x)\) almost surely. Second, since both \(\hat{F}_S\) and \(F\) are monotone non-decreasing functions taking values in \([0, 1]\), a partitioning-by-rationals argument upgrades the countable pointwise convergence to uniform convergence.

Remark. The partitioning argument above goes through directly when \(F\) is continuous; the general case (where \(F\) may have atoms) requires an additional argument handling the jump points separately.

Returning to the credible interval problem: instead of computing the inverse cdf analytically, we use the empirical cumulative distribution function (ECDF) formed by the sorted samples to estimate the desired quantiles. By the Glivenko-Cantelli theorem above, the ECDF converges uniformly to the true cdf, and under standard regularity (continuity of \(F\)), this implies convergence of the empirical quantiles to the true quantiles.

Demo: Monte Carlo for Credible Intervals

Note: this demo does not perform full Bayesian inference from data. It draws seeded, reproducible samples from a chosen "posterior" — for the gamma and beta options, sampling is literally the inverse transform \(\theta = F^{-1}(U)\) with \(U \sim U[0,1]\), so the quantile function of this page is also the sampler. This isolates the object of study: how well do \(S\) samples recover the interval that the true distribution defines?

The central credible interval is computed exactly as derived above: sort the samples and read off the order statistics \(\theta^{(\lceil S \alpha/2 \rceil)}\) and \(\theta^{(\lceil S(1 - \alpha/2) \rceil)}\) — the readout displays which order statistics were used, and the dashed lines mark the true quantiles \(F^{-1}(\alpha/2)\) and \(F^{-1}(1-\alpha/2)\) for comparison. Both convergence theorems of this page run live in the lower plot. In the ECDF view, the empirical cdf \(\hat{F}_S\) is drawn over the true \(F\) with the exact sup-distance \(\sup_x |\hat{F}_S(x) - F(x)|\) reported and marked at the point where it is attained — increase \(S\) and watch the Glivenko-Cantelli theorem squeeze it toward zero. Switching to the convergence trace plots the running mean \(\hat{\mu}_S\) and the interval endpoints against \(S\), showing the strong law of large numbers pull them onto the true values.

However, quantile-based central intervals ignore the shape of the distribution. Try the skewed gamma option, or the bimodal mixture: the central interval must place \(\alpha/2\) mass in each tail regardless of where the density actually concentrates, so for the mixture it happily covers the low-density valley between the peaks. Selecting HPD switches to the highest-density construction developed in the next section, estimated from the samples by a kernel density estimate with a density-quantile threshold — and, alongside it, the true HPD region computed from the exact density by threshold bisection, with the threshold \(p^*\) reported. For the well-separated bimodal case the region correctly splits into a disjoint union of two intervals, one around each mode; for the skewed gamma it is visibly shorter than the central interval, as the next section proves it must be.

Highest Posterior Density (HPD)

As discussed above, central credible intervals can be suboptimal for asymmetric or multimodal posteriors because they allocate equal probability mass to each tail regardless of where the density is concentrated. The highest posterior density (HPD) region provides a more principled alternative by selecting the smallest region that contains the desired probability mass.

Definition: HPD Region

Assume the posterior has a continuous density \(p(\theta \mid \mathcal{D})\). For a level \(\alpha \in (0,1)\), the \(100(1-\alpha)\%\) highest posterior density (HPD) region is \[ C_{\alpha}(\mathcal{D}) = \{\theta : p(\theta \mid \mathcal{D}) \geq p^*\}, \] where the threshold \(p^*\) is the largest constant satisfying \[ P(\theta \in C_{\alpha}(\mathcal{D}) \mid \mathcal{D}) = \int_{\{\theta : p(\theta \mid \mathcal{D}) \geq p^*\}} p(\theta \mid \mathcal{D}) \, d\theta = 1 - \alpha. \]

In one dimension, the HPD region corresponds to the interval (or union of intervals) containing the values of highest posterior density, with total mass equal to the target credibility level (e.g., 95%). In this case, HPD regions are sometimes called highest density intervals (HDIs).

Unlike central credible intervals, HPD regions are not required to be symmetric or centered around the mean. This makes them especially useful when the posterior distribution is skewed or multimodal. In such cases, HPD regions better capture the actual regions of highest posterior density. For unimodal distributions, the HPD region is in fact a single interval — the shortest interval containing \(1-\alpha\) probability mass.

In the demo above, you can observe that the HPD interval is often narrower than the central credible interval and may exclude low-density regions that the central interval mistakenly includes. Both the central credible interval and the HPD region require samples from the posterior distribution. In the examples above, we assumed that such samples are readily available. In practice, however, directly sampling from the posterior is often impossible. The next section introduces the algorithmic framework that makes posterior sampling feasible.

Markov Chain Monte Carlo (MCMC)

The Monte Carlo methods described above assume that we can draw independent samples directly from the posterior distribution \(p(\theta \mid \mathcal{D})\). In practice, however, the full posterior is rarely available in closed form. By Bayes' theorem, we have \[ p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})}, \] where computing the normalizing constant (the marginal likelihood or evidence) \[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) \, p(\theta) \, d\theta \] requires integrating over the entire parameter space, which is often intractable in high dimensions. We can, however, easily evaluate the unnormalized posterior \[ \tilde{p}(\theta) := p(\mathcal{D} \mid \theta) \cdot p(\theta) \propto p(\theta \mid \mathcal{D}), \] since this only requires evaluating the likelihood and prior at a given point \(\theta\).

This is where Markov Chain Monte Carlo (MCMC) methods become essential. The key insight is that we can construct a Markov chain whose stationary distribution equals the target posterior \(p(\theta \mid \mathcal{D})\), using only the unnormalized density \(\tilde{p}(\theta)\). Although the samples produced by such a chain are not independent, under standard regularity conditions on the chain (irreducibility, aperiodicity), time averages along the chain converge to expectations under the stationary distribution — enabling Monte Carlo estimation without computing \(p(\mathcal{D})\).

Three classic MCMC algorithms are foundational to Bayesian computation:

Metropolis-Hastings:
A general-purpose algorithm that proposes new samples from a proposal distribution and accepts or rejects them based on a ratio of unnormalized densities.
Gibbs sampling:
A special case of Metropolis-Hastings that is efficient when the full conditional distributions are available in closed form. It is commonly used in probabilistic graphical models.
Hamiltonian Monte Carlo (HMC):
A gradient-based method that uses the geometry of the posterior to propose distant yet high-probability samples, making it well-suited for high-dimensional problems. It is widely used in probabilistic programming frameworks such as Stan and PyMC.

It is worth noting the role of MCMC in the broader landscape of machine learning. In large-scale models such as transformers and large language models (LLMs), MCMC is not used for training due to its computational cost and poor scalability in very high-dimensional parameter spaces. These models are instead trained using gradient-based optimization (e.g., stochastic gradient descent or Adam) with maximum likelihood or related objectives. However, MCMC remains widely used for Bayesian modeling in scientific applications, uncertainty quantification, simulation-based inference, and decision-making under uncertainty — settings where principled uncertainty estimates are essential.

Connections to Machine Learning

Monte Carlo methods form the computational backbone of modern Bayesian machine learning. Credible intervals and HPD regions provide a principled framework for uncertainty quantification in model predictions — a necessity in safety-critical domains such as medical diagnosis and autonomous systems.

While MCMC sampling serves as the gold standard for training Bayesian neural networks (where weights are treated as distributions), the Monte Carlo principle also powers stochastic variational inference In this context, the reparameterization trick enables gradient-based optimization of stochastic objectives (like the ELBO), and Monte Carlo dropout offers a practical, scalable approximation to Bayesian inference in deep architectures.

The Monte Carlo framework developed here assumes that we can sample from the target distribution directly. However, what if the posterior is difficult to sample from, or we want to focus on estimating the probability of a rare event in the distribution's tail? In the next page on importance sampling, we relax the requirement of direct sampling. Instead of sampling from the target distribution, we use a more convenient proposal distribution and correct for the mismatch using importance weights, providing a powerful tool for both efficiency and variance reduction.