Monte Carlo Methods

Credible Intervals Monte Carlo Approximation Demo: Monte Carlo for Credible Intervals Highest Posterior Density (HPD) Markov Chain Monte Carlo (MCMC)

Credible Intervals

In Part 14, we introduced Bayesian inference and derived posterior distributions for unknown parameters. In Part 18, we developed Markov chains for modeling sequential data, noting that Markov Chain Monte Carlo (MCMC) methods use the Markov property to sample from complex posterior distributions.

We begin with the problem of summarizing a posterior distribution. Since a full distribution is often a high-dimensional object that is difficult to interpret or communicate, we frequently summarize the uncertainty about a parameter using a credible interval. It is important to note that a credible interval is fundamentally different from a confidence interval in frequentist statistics: a credible interval makes a direct probability statement about the parameter given the data, whereas a confidence interval is a statement about the long-run frequency of the procedure. Nevertheless, they may sometimes numerically coincide.

Definition: Credible Interval

For a given level \(\alpha \in (0,1)\), a \(100(1-\alpha)\%\) credible interval is a set \(C_{\alpha}(\mathcal{D}) = (l, u)\) such that \[ P(l \leq \theta \leq u \mid \mathcal{D}) = 1 - \alpha, \] where the probability is computed under the posterior distribution \(p(\theta \mid \mathcal{D})\).

This interval contains \(1-\alpha\) of the posterior probability mass. Since the definition does not uniquely determine the interval (infinitely many intervals satisfy this condition), a common choice is the central credible interval, which places \(\frac{\alpha}{2}\) of the posterior probability in each tail.

If the cumulative distribution function (cdf) \(F\) of the posterior distribution is known, the central credible interval can be computed as: \[ \begin{align*} l &= F^{-1}\left( \frac{\alpha}{2} \right), \\\\ u &= F^{-1}\left( 1 - \frac{\alpha}{2} \right), \end{align*} \] where \(F^{-1}\) denotes the quantile function (the inverse of the cdf).

As an example, if the posterior distribution is approximately normal with posterior mean \(\mu\) and posterior standard deviation \(\sigma\), a rough \(95\%\) credible interval can be set as: \[ (l, u) = (\mu - 2\sigma, \mu + 2\sigma), \] because for a standard normal distribution \(\mathcal{N}(0,1)\), approximately \(95\%\) of the probability mass lies within two standard deviations from the mean.

More precisely, when \(p(\theta \mid \mathcal{D}) = \mathcal{N}(\mu, \sigma^2)\) and \(\alpha = 0.05\), the central credible interval endpoints are: \[ \begin{align*} l &= \mu + \sigma \Phi^{-1}\left( \frac{\alpha}{2} \right), \\\\ u &= \mu + \sigma \Phi^{-1}\left( 1 - \frac{\alpha}{2} \right), \end{align*} \] where \(\Phi\) is the cumulative distribution function (cdf) of the standard normal distribution.

In particular, if the posterior distribution is the standard normal \(p(\theta \mid \mathcal{D}) = \mathcal{N}(0,1)\), then: \[ \begin{align*} l &= \Phi^{-1}(0.025) \approx -1.96, \\\\ u &= \Phi^{-1}(0.975) \approx 1.96, \end{align*} \] so the \(95\%\) credible interval is approximately \((-1.96, 1.96)\).

The computation above relies on knowing the quantile function \(F^{-1}\) in closed form. For many posterior distributions - especially those arising from non-conjugate models or high-dimensional problems - no such closed form is available. This motivates the use of Monte Carlo methods.

Monte Carlo Approximation

The Monte Carlo method is a general computational technique that uses random sampling to approximate quantities that are difficult or impossible to compute analytically. The fundamental idea is simple: to estimate an expectation, draw independent samples and compute their average.

Theorem: Monte Carlo Approximation

Let \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)} \overset{\text{iid}}{\sim} p(\theta \mid \mathcal{D})\). Then for any integrable function \(f\), \[ \mathbb{E}_{p(\theta \mid \mathcal{D})}[f(\theta)] \approx \frac{1}{S} \sum_{s=1}^{S} f(\theta^{(s)}). \] By the law of large numbers, this approximation converges to the true expectation as \(S \to \infty\).

This principle is broadly applicable: by choosing appropriate functions \(f\), we can approximate posterior means (\(f(\theta) = \theta\)), posterior variances (\(f(\theta) = (\theta - \bar{\theta})^2\)), predictive distributions, and many other quantities of interest. We now apply this idea to the specific problem of approximating credible intervals.

When the inverse c.d.f. of the posterior distribution is not available in closed form, a practical alternative is to estimate the quantiles from samples. Suppose we have \(S\) independent samples \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)}\) from the posterior \(p(\theta \mid \mathcal{D})\). To approximate a \(100(1-\alpha)\%\) central credible interval:

More precisely, approximate: \[ \begin{align*} l &\approx \theta^{\left(\left\lceil S \times \frac{\alpha}{2} \right\rceil\right)}, \\\\ u &\approx \theta^{\left(\left\lceil S \times \left(1 - \frac{\alpha}{2}\right) \right\rceil\right)}, \end{align*} \] where \(\lceil \cdot \rceil\) denotes the ceiling function, rounding up to the nearest integer if necessary.

Intuitively, instead of computing the inverse cdf analytically, we use the empirical cumulative distribution function (ECDF) formed by the sorted samples to estimate the desired quantiles. As \(S \to \infty\), the empirical quantiles converge to the true quantiles of the posterior by the Glivenko-Cantelli theorem, which guarantees that the ECDF converges uniformly to the true cdf.

Demo: Monte Carlo for Credible Intervals

Note: This demo does not perform full Bayesian inference from data. Instead, it uses predefined probability distributions to simulate "posterior-like" behavior. This allows us to illustrate how Monte Carlo methods are used to approximate credible intervals - particularly central and HPD intervals—without needing a prior or likelihood.

Monte Carlo methods are widely used in Bayesian statistics to approximate posterior distributions, especially when closed-form solutions are unavailable. One common application is estimating credible intervals using random samples drawn from the posterior.

To construct a central credible interval via Monte Carlo sampling:

  1. Draw many random samples from the posterior distribution
  2. Sort the samples in ascending order
  3. Select the appropriate quantiles (e.g., the 2.5% and 97.5% quantiles for a 95% interval)

As the number of samples increases, the approximation of the posterior improves. However, central credible intervals based on quantiles may not accurately reflect the distribution's shape - especially in skewed or multimodal cases.

For example, in a bimodal distribution (with two distinct peaks), the interval may:

In such cases, highest posterior density (HPD) regions are preferred. They represent the region(s) of highest probability mass and always include the most probable values, even if the result is disjoint.

Highest Posterior Density (HPD)

As discussed above, central credible intervals can be suboptimal for asymmetric or multimodal posteriors because they allocate equal probability mass to each tail regardless of where the density is concentrated. The highest posterior density (HPD) region provides a more principled alternative by selecting the smallest region that contains the desired probability mass.

Definition (HPD Region)

For a given level \(\alpha \in (0,1)\), the \(100(1-\alpha)\%\) highest posterior density (HPD) region is defined as \[ C_{\alpha}(\mathcal{D}) = \{\theta : p(\theta \mid \mathcal{D}) \geq p^*\}, \] where the threshold \(p^*\) is the largest constant such that \[ P(\theta \in C_{\alpha}(\mathcal{D}) \mid \mathcal{D}) = \int_{\theta:\, p(\theta \mid \mathcal{D}) \geq p^*} p(\theta \mid \mathcal{D}) \, d\theta = 1 - \alpha. \]

In one dimension, the HPD region corresponds to the interval (or union of intervals) containing the values of highest posterior density, with total mass equal to the target credibility level (e.g., 95%). In this case, HPD regions are sometimes called highest density intervals (HDIs).

Unlike central credible intervals, HPD regions are not required to be symmetric or centered around the mean. This makes them especially useful when the posterior distribution is skewed or multimodal. In such cases, HPD regions better capture the actual regions of highest posterior density, and for unimodal distributions, the HPD interval is guaranteed to be the shortest interval containing \(1-\alpha\) probability mass.

In the demo above, you can observe that the HPD interval is often narrower than the central credible interval and may exclude low-density regions that the central interval mistakenly includes.Both the central credible interval and the HPD region require samples from the posterior distribution. In the examples above, we assumed that such samples are readily available. In practice, however, directly sampling from the posterior is often impossible. The next section introduces the algorithmic framework that makes posterior sampling feasible.

Markov Chain Monte Carlo (MCMC)

The Monte Carlo methods described above assume that we can draw independent samples directly from the posterior distribution \(p(\theta \mid \mathcal{D})\). In practice, however, the full posterior is rarely available in closed form. By Bayes' theorem, we have \[ p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})}, \] where computing the normalizing constant (the marginal likelihood or evidence) \[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) \, p(\theta) \, d\theta \] requires integrating over the entire parameter space, which is often intractable in high dimensions. We can, however, easily evaluate the unnormalized posterior \[ \tilde{p}(\theta) \;\triangleq\; p(\mathcal{D} \mid \theta) \cdot p(\theta) \;\propto\; p(\theta \mid \mathcal{D}), \] since this only requires evaluating the likelihood and prior at a given point \(\theta\).

This is where Markov Chain Monte Carlo (MCMC) methods become essential. The key insight is that we can construct a Markov chain whose stationary distribution equals the target posterior \(p(\theta \mid \mathcal{D})\), using only the unnormalized density \(\tilde{p}(\theta)\). By running this chain for sufficiently many steps, the samples it produces approximate draws from the posterior, enabling Monte Carlo estimation without computing \(p(\mathcal{D})\).

Three classic MCMC algorithms are foundational to Bayesian computation:

It is worth noting the role of MCMC in the broader landscape of machine learning. In large-scale models such as transformers and large language models (LLMs), MCMC is not used for training due to its computational cost and poor scalability in very high-dimensional parameter spaces. These models are instead trained using gradient-based optimization (e.g., stochastic gradient descent or Adam) with maximum likelihood or related objectives. However, MCMC remains indispensable for Bayesian modeling in scientific applications, uncertainty quantification, simulation-based inference, and decision-making under uncertainty - settings where principled uncertainty estimates are essential.

Connections to Machine Learning

Monte Carlo methods form the computational backbone of modern Bayesian machine learning. Credible intervals and HPD regions provide a principled framework for uncertainty quantification in model predictions - a necessity in safety-critical domains such as medical diagnosis and autonomous systems.

While MCMC sampling serves as the gold standard for training Bayesian neural networks (where weights are treated as distributions), the Monte Carlo principle also powers stochastic variational inference. In this context, the reparameterization trick enables gradient-based optimization of stochastic objectives (like the ELBO), and Monte Carlo dropout offers a practical, scalable approximation to Bayesian inference in deep architectures.

The Monte Carlo framework developed here assumes that we can sample from the target distribution directly. However, what if the posterior is difficult to sample from, or we want to focus on estimating the probability of a rare event in the distribution's tail? In Part 20: Importance Sampling, we relax the requirement of direct sampling. Instead of sampling from the target distribution, we use a more convenient proposal distribution and correct for the mismatch using importance weights, providing a powerful tool for both efficiency and variance reduction.