Credible Intervals
In our discussion of Bayesian inference, we introduced posterior distributions for
unknown parameters. In our discussion of Markov chains, we developed models for sequential data,
noting that Markov Chain Monte Carlo (MCMC) methods use the Markov property to sample from complex posterior distributions.
We begin with the problem of summarizing a posterior distribution. Since a full distribution is often complex
(especially in high-dimensional parameter spaces) and difficult to interpret or communicate, we frequently
summarize the uncertainty about a parameter using a
credible interval. It is important to note that a credible interval is fundamentally different from a
confidence interval in frequentist statistics: a credible interval
makes a direct probability statement about the parameter given the data, whereas a confidence interval is a statement about
the long-run frequency of the procedure. Nevertheless, they may sometimes numerically coincide.
Recall the credible interval:
for a level \(\alpha \in (0,1)\), it is an interval \(C_\alpha(\mathcal{D}) = [l, u]\) with
\(P(\theta \in [l, u] \mid \mathcal{D}) = 1 - \alpha\) under the posterior distribution \(p(\theta \mid \mathcal{D})\).
For continuous posteriors, this condition does not uniquely determine the interval, so a common choice is the
central credible interval, which places \(\alpha/2\) of the posterior probability in each tail.
If the cumulative distribution function (cdf) \(F\) of the posterior distribution is known, the central
credible interval can be computed as:
\[
\begin{align*}
l &= F^{-1}\left( \frac{\alpha}{2} \right), \\\\
u &= F^{-1}\left( 1 - \frac{\alpha}{2} \right),
\end{align*}
\]
where \(F^{-1}\) denotes the quantile function (the inverse of the cdf).
As an example, if the posterior distribution is approximately normal with posterior mean \(\mu\) and
posterior standard deviation \(\sigma\), a rough \(95\%\) credible interval can be set as:
\[
[l, u] = [\mu - 2\sigma, \mu + 2\sigma],
\]
because for a standard normal distribution \(\mathcal{N}(0,1)\), approximately \(95\%\) (more precisely \(95.45\%\)) of the probability
mass lies within two standard deviations from the mean.
More precisely, when \(p(\theta \mid \mathcal{D}) = \mathcal{N}(\mu, \sigma^2)\) and \(\alpha = 0.05\), the
central credible interval endpoints are:
\[
\begin{align*}
l &= \mu + \sigma \Phi^{-1}\left( \frac{\alpha}{2} \right), \\\\
u &= \mu + \sigma \Phi^{-1}\left( 1 - \frac{\alpha}{2} \right),
\end{align*}
\]
where \(\Phi\) is the cdf of the standard normal distribution.
In particular, if the posterior distribution is the standard normal \(p(\theta \mid \mathcal{D}) = \mathcal{N}(0,1)\), then:
\[
\begin{align*}
l &= \Phi^{-1}(0.025) \approx -1.96, \\\\
u &= \Phi^{-1}(0.975) \approx 1.96,
\end{align*}
\]
so the \(95\%\) credible interval is approximately \([-1.96, 1.96]\).
The computation above relies on knowing the quantile function \(F^{-1}\) in closed form. For many posterior distributions — especially
those arising from non-conjugate models or high-dimensional problems — no such closed form is available. This motivates the use
of Monte Carlo methods.
Monte Carlo Approximation
The Monte Carlo method is a general computational technique that uses random sampling to approximate quantities
that are difficult or impossible to compute analytically. The fundamental idea is simple: to estimate an expectation, draw independent
samples and compute their average.
Definition: Monte Carlo Estimator
Let \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)} \overset{\text{iid}}{\sim} p(\theta \mid \mathcal{D})\)
and let \(f\) be any function for which \(\mathbb{E}_{p(\theta \mid \mathcal{D})}[|f(\theta)|]\) is finite.
The Monte Carlo estimator of \(\mathbb{E}_{p(\theta \mid \mathcal{D})}[f(\theta)]\) based on \(S\) samples is
\[
\hat{\mu}_S \;:=\; \frac{1}{S} \sum_{s=1}^{S} f(\theta^{(s)}).
\]
Proposition: Monte Carlo Consistency
Under the assumptions above, the Monte Carlo estimator converges to the true expectation almost surely:
\[
\hat{\mu}_S \;\xrightarrow{\text{a.s.}}\; \mathbb{E}_{p(\theta \mid \mathcal{D})}[f(\theta)]
\quad \text{as } S \to \infty.
\]
Proof.
The samples \(f(\theta^{(1)}), f(\theta^{(2)}), \ldots\) are iid as functions of iid random variables,
with finite mean by assumption. Applying the
strong law of large numbers
to this sequence yields the desired almost-sure convergence.
By choosing different functions \(f\), this estimator approximates many functionals of the posterior.
Setting \(f(\theta) = \theta\) gives the posterior mean. For the posterior variance, the standard plug-in
estimator is the sample variance \(\hat{\sigma}^2_S = \frac{1}{S} \sum_{s=1}^{S} (\theta^{(s)} - \hat{\theta}_S)^2\)
where \(\hat{\theta}_S = \frac{1}{S} \sum_s \theta^{(s)}\) is the sample mean — strictly speaking this is a
plug-in estimator rather than a direct Monte Carlo estimator (since \(f\) depends on the sample), but it
inherits consistency from the joint convergence of \(\hat{\theta}_S\) and the second-moment estimator.
Predictive distributions, posterior probabilities of events, and other integrable functionals follow the same template.
When the inverse cdf of the posterior distribution is not available in closed form, a practical alternative is to
estimate the quantiles from samples. Suppose we have \(S\) independent samples \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)}\)
from the posterior \(p(\theta \mid \mathcal{D})\). To approximate a \(100(1-\alpha)\%\) central credible interval:
- Sort the samples in increasing order: \(\theta^{(1)} \leq \theta^{(2)} \leq \cdots \leq \theta^{(S)}\).
- Set the lower endpoint \(l\) to be the \(\left(\frac{\alpha}{2}\right)\)-quantile, and the upper endpoint \(u\) to be the \(\left(1 - \frac{\alpha}{2}\right)\)-quantile.
More precisely, one common choice (corresponding to the lower-bound ceiling convention) is:
\[
\begin{align*}
l &= \theta^{\left(\left\lceil S \times \frac{\alpha}{2} \right\rceil\right)}, \\\\
u &= \theta^{\left(\left\lceil S \times \left(1 - \frac{\alpha}{2}\right) \right\rceil\right)},
\end{align*}
\]
where \(\lceil \cdot \rceil\) denotes the ceiling function. Other conventions exist — for example,
numerical libraries (such as numpy's quantile) often use linear interpolation between adjacent order statistics.
All such conventions agree in the limit \(S \to \infty\).
Theorem: Glivenko-Cantelli
Let \(X_1, X_2, \ldots\) be iid real-valued random variables with cumulative distribution function \(F\),
and let \(\hat{F}_S\) denote the empirical cdf based on the first \(S\) samples:
\[
\hat{F}_S(x) \;:=\; \frac{1}{S} \sum_{s=1}^{S} \mathbb{1}\{X_s \leq x\}.
\]
Then \(\hat{F}_S\) converges uniformly to \(F\) almost surely:
\[
\sup_{x \in \mathbb{R}} \left| \hat{F}_S(x) - F(x) \right| \;\xrightarrow{\text{a.s.}}\; 0 \quad \text{as } S \to \infty.
\]
The argument proceeds in two steps. First, for each fixed \(x\), the indicator variables
\(\mathbb{1}\{X_s \leq x\}\) are iid with mean \(F(x)\), so the
strong law of large numbers
gives pointwise convergence \(\hat{F}_S(x) \to F(x)\) almost surely. Second, since both \(\hat{F}_S\)
and \(F\) are monotone non-decreasing functions taking values in \([0, 1]\), a partitioning-by-rationals
argument upgrades the countable pointwise convergence to uniform convergence.
Remark. The partitioning argument above goes through directly when \(F\) is continuous;
the general case (where \(F\) may have atoms) requires an additional argument handling the jump points
separately.
Returning to the credible interval problem: instead of computing the inverse cdf analytically, we use the
empirical cumulative distribution function (ECDF) formed by the sorted samples to estimate
the desired quantiles. By the Glivenko-Cantelli theorem above,
the ECDF converges uniformly to the true cdf, and under standard regularity (continuity of \(F\)),
this implies convergence of the empirical quantiles to the true quantiles.
Demo: Monte Carlo for Credible Intervals
Note: This demo does not perform full Bayesian inference from data. Instead, it uses predefined probability distributions
to simulate "posterior-like" behavior. This allows us to illustrate how Monte Carlo methods are used to approximate credible
intervals — particularly central and HPD intervals — without needing a prior or likelihood.
Monte Carlo methods are widely used in Bayesian statistics to approximate posterior distributions, especially
when closed-form solutions are unavailable. One common application is estimating credible intervals using random
samples drawn from the posterior.
To construct a central credible interval via Monte Carlo sampling:
- Draw many random samples from the posterior distribution
- Sort the samples in ascending order
- Select the appropriate quantiles (e.g., the 2.5% and 97.5% quantiles for a 95% interval)
As the number of samples increases, the approximation of the posterior improves.
However, central credible intervals based on quantiles may not accurately reflect the distribution's
shape — especially in skewed or multimodal cases.
For example, in a bimodal distribution (with two distinct peaks), the interval may:
- Fall in a low-probability region between the modes
- Omit one of the peaks entirely
In such cases, highest posterior density (HPD) regions are preferred.
They represent the region(s) of highest probability mass and always include the modes,
even if the result is disjoint.
Highest Posterior Density (HPD)
As discussed above, central credible intervals can be suboptimal for asymmetric or multimodal posteriors because they allocate equal probability
mass to each tail regardless of where the density is concentrated. The highest posterior density (HPD) region provides a
more principled alternative by selecting the smallest region that contains the desired probability mass.
Definition: HPD Region
Assume the posterior has a continuous density \(p(\theta \mid \mathcal{D})\).
For a level \(\alpha \in (0,1)\), the \(100(1-\alpha)\%\) highest posterior
density (HPD) region is
\[
C_{\alpha}(\mathcal{D}) \;=\; \{\theta : p(\theta \mid \mathcal{D}) \geq p^*\},
\]
where the threshold \(p^*\) is the largest constant satisfying
\[
P(\theta \in C_{\alpha}(\mathcal{D}) \mid \mathcal{D}) \;=\; \int_{\{\theta : p(\theta \mid \mathcal{D}) \geq p^*\}} p(\theta \mid \mathcal{D}) \, d\theta \;=\; 1 - \alpha.
\]
In one dimension, the HPD region corresponds to the interval (or union of intervals) containing the values of highest
posterior density, with total mass equal to the target credibility level (e.g., 95%). In this case, HPD regions are
sometimes called highest density intervals (HDIs).
Unlike central credible intervals, HPD regions are not required to be symmetric or centered around the mean.
This makes them especially useful when the posterior distribution is skewed or multimodal. In such cases,
HPD regions better capture the actual regions of highest posterior density. For unimodal distributions, the
HPD region is in fact a single interval — the shortest interval containing \(1-\alpha\) probability mass.
In the demo above, you can observe that the HPD interval is often narrower than the central credible interval
and may exclude low-density regions that the central interval mistakenly includes. Both the central credible interval
and the HPD region require samples from the posterior distribution. In the examples above, we assumed that such samples
are readily available. In practice, however, directly sampling from the posterior is often impossible. The next
section introduces the algorithmic framework that makes posterior sampling feasible.
Markov Chain Monte Carlo (MCMC)
The Monte Carlo methods described above assume that we can draw independent samples directly from the posterior distribution
\(p(\theta \mid \mathcal{D})\). In practice, however, the full posterior is rarely available in closed form.
By Bayes' theorem, we have
\[
p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})},
\]
where computing the normalizing constant (the marginal likelihood or evidence)
\[
p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) \, p(\theta) \, d\theta
\]
requires integrating over the entire parameter space, which is often intractable in high dimensions.
We can, however, easily evaluate the unnormalized posterior
\[
\tilde{p}(\theta) \;:=\; p(\mathcal{D} \mid \theta) \cdot p(\theta) \;\propto\; p(\theta \mid \mathcal{D}),
\]
since this only requires evaluating the likelihood and prior at a given point \(\theta\).
This is where Markov Chain Monte Carlo (MCMC) methods become essential. The key insight is that we
can construct a Markov chain whose stationary distribution
equals the target posterior \(p(\theta \mid \mathcal{D})\), using only the unnormalized density \(\tilde{p}(\theta)\).
Although the samples produced by such a chain are not independent, under standard regularity conditions on the chain
(irreducibility, aperiodicity), time averages along the chain converge to expectations under the stationary distribution —
enabling Monte Carlo estimation without computing \(p(\mathcal{D})\).
Three classic MCMC algorithms are foundational to Bayesian computation:
- Metropolis-Hastings:
A general-purpose algorithm that proposes new samples from a proposal distribution and accepts or rejects
them based on a ratio of unnormalized densities.
- Gibbs sampling:
A special case of Metropolis-Hastings that is efficient when the full conditional distributions are available
in closed form. It is commonly used in probabilistic graphical models.
- Hamiltonian Monte Carlo (HMC):
A gradient-based method that uses the geometry of the posterior to propose distant yet high-probability samples,
making it well-suited for high-dimensional problems. It is widely used in probabilistic programming frameworks
such as Stan and PyMC.
It is worth noting the role of MCMC in the broader landscape of machine learning. In large-scale models such as transformers
and large language models (LLMs), MCMC is not used for training due to its computational cost and poor scalability
in very high-dimensional parameter spaces. These models are instead trained using
gradient-based optimization
(e.g., stochastic gradient descent or Adam) with maximum likelihood or related objectives.
However, MCMC remains widely used for Bayesian modeling in scientific applications, uncertainty quantification, simulation-based inference,
and decision-making under uncertainty — settings where principled uncertainty estimates are essential.
Connections to Machine Learning
Monte Carlo methods form the computational backbone of modern Bayesian machine learning.
Credible intervals and HPD regions provide a principled framework for
uncertainty quantification in model predictions — a necessity in safety-critical domains such as
medical diagnosis and autonomous systems.
While MCMC sampling serves as the gold standard for training Bayesian neural networks
(where weights are treated as distributions), the Monte Carlo principle also powers
stochastic variational inference
In this context, the reparameterization trick enables gradient-based optimization of
stochastic objectives (like the ELBO),
and Monte Carlo dropout offers a practical, scalable approximation to Bayesian inference
in deep architectures.
The Monte Carlo framework developed here assumes that we can sample from the target distribution directly. However, what if
the posterior is difficult to sample from, or we want to focus on estimating the probability of a rare event in the distribution's tail?
In the next page on importance sampling, we relax the requirement of direct sampling.
Instead of sampling from the target distribution, we use a more convenient proposal distribution and correct for the mismatch
using importance weights, providing a powerful tool for both efficiency and variance reduction.