Introduction
In the previous page, we developed Monte Carlo methods, for approximating posterior summaries
such as credible intervals and HPD regions, and introduced MCMC as a way to sample from posteriors that cannot be evaluated in closed form.
These methods assume that we can draw samples from the target distribution. But what if sampling from the target is prohibitively expensive,
or we want to estimate the probability of a rare event in the distribution's tail? Importance sampling addresses precisely these
challenges by sampling from a different, more convenient distribution and correcting for the mismatch.
Suppose we want to compute the expectation of a target function
\(\varphi(\mathbf{x})\) under a target distribution \(\pi(\mathbf{x})\):
\[
I = \mathbb{E}_{\pi}[\varphi(\mathbf{x})] = \int \varphi(\mathbf{x}) \, \pi(\mathbf{x}) \, d\mathbf{x}.
\]
Convention. Throughout this page, \(\pi\) denotes the target distribution and
\(q\) the proposal distribution. The policy notation \(\pi_\theta(a \mid s)\) appearing in the
RLHF example below is a separate, conventional reuse of the symbol \(\pi\) in reinforcement learning.
Standard Monte Carlo estimation would draw independent samples
\(\mathbf{x}_1, \ldots, \mathbf{x}_{N_s} \sim \pi(\mathbf{x})\) and
approximate:
\[
I \approx \hat{I}_{\text{MC}} = \frac{1}{N_s} \sum_{n=1}^{N_s} \varphi(\mathbf{x}_n).
\]
However, this approach fails when sampling from \(\pi(\mathbf{x})\) is computationally expensive or infeasible, when
the normalizing constant of \(\pi(\mathbf{x})\) is unknown
(as is common in Bayesian inference), or when \(\varphi(\mathbf{x})\)
has large values only in regions where \(\pi(\mathbf{x})\) assigns low probability (rare events).
Importance sampling solves these problems by introducing a proposal distribution
(or importance distribution) \(q(\mathbf{x})\) from which we can easily sample, then reweighting samples
to account for the distribution mismatch. This idea—reusing samples collected under one distribution to estimate
expectations under another—is one of the most broadly applicable principles in computational statistics.
Importance sampling has become essential in modern ML, particularly in the following domains:
Modern large language models (LLMs) like ChatGPT are fine-tuned using RLHF, which fundamentally relies on
importance sampling. The Proximal Policy Optimization (PPO) algorithm uses importance sampling to
enable multiple gradient updates from the same batch of collected data:
\[
L(\theta) = \mathbb{E}_{(s,a) \sim \mu}\left[\min\left(\frac{\pi_\theta(a \mid s)}{\mu(a \mid s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a \mid s)}{\mu(a \mid s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]
\]
where \(\mu\) is the behavior policy (data collection policy) and \(\pi_\theta\) is the current policy being optimized.
The importance ratio \(\pi_\theta/\mu\) allows us to reuse data collected under \(\mu\) for multiple updates
to \(\pi_\theta\). PPO's clipping mechanism (typically with \(\epsilon = 0.2\)) prevents this ratio from becoming
too large, ensuring stable training. Without importance sampling, we would need fresh on-policy data after
every gradient step, making RLHF computationally prohibitive.
2. Off-Policy Reinforcement Learning
In robotics, autonomous systems, and game AI, collecting on-policy data can be dangerous or expensive. Importance sampling
enables learning from historical data, human demonstrations, or safer exploration policies. This is crucial for algorithms
like Soft Actor-Critic (SAC), Conservative Q-Learning (CQL), and offline RL methods.
3. Variational Inference and Deep Generative Models
Importance-weighted autoencoders (IWAE) use importance sampling to obtain tighter lower bounds on the
log-evidence than the standard evidence lower bound (ELBO),
leading to better generative models and more accurate uncertainty estimates in Bayesian deep learning.
4. Imbalanced Data and Rare Events
In applications like fraud detection, medical diagnosis of rare diseases, and AI safety, the events we care most about
are extremely rare. Importance sampling allows us to oversample these critical cases and reweight appropriately, ensuring
models do not ignore rare but important scenarios.
5. Computing Normalizing Constants
Many probabilistic models in AI (Boltzmann machines, energy-based models, certain Bayesian posteriors) have intractable
normalizing constants. Importance sampling, particularly annealed importance sampling, provides practical methods to
estimate these constants, which is essential for model comparison and learning.
The critical insight is that importance sampling enables sample efficiency: in an era where data collection
and computation are expensive, the ability to reuse and reweight existing data rather than collecting new samples is invaluable.
This makes importance sampling one of the foundational techniques enabling practical, scalable AI systems.
Direct Importance Sampling
The key mathematical insight behind importance sampling is the change of measure.
Suppose \(q(\mathbf{x})\) is a proposal distribution satisfying the support condition:
\(q(\mathbf{x}) > 0\) whenever \(\pi(\mathbf{x}) > 0\). Then we can rewrite the expectation under
\(\pi(\mathbf{x})\) as an expectation under \(q(\mathbf{x})\):
\[
\begin{align*}
I &= \int \varphi(\mathbf{x}) \, \pi(\mathbf{x}) \, d\mathbf{x} \\\\
&= \int_{\{\pi > 0\}} \varphi(\mathbf{x}) \, \frac{\pi(\mathbf{x})}{q(\mathbf{x})} \, q(\mathbf{x}) \, d\mathbf{x} \\\\
&= \mathbb{E}_{q}\!\left[\varphi(\mathbf{x}) \, \frac{\pi(\mathbf{x})}{q(\mathbf{x})}\right].
\end{align*}
\]
The second equality uses the support condition, which guarantees that the ratio \(\pi/q\) is well-defined on the integration domain.
The third equality extends the integration to all of \(\{q > 0\}\) by setting the integrand to zero where \(\pi = 0\); this redefinition
does not change the integral, since \(\varphi \cdot \pi = 0\) on that set.
Definition: Importance Weights
Given a target distribution \(\pi(\mathbf{x})\) and a proposal distribution \(q(\mathbf{x})\) with \(q(\mathbf{x}) > 0\)
whenever \(\pi(\mathbf{x}) > 0\), the importance weight function is defined as
\[
w(\mathbf{x}) = \frac{\pi(\mathbf{x})}{q(\mathbf{x})}.
\]
For a sample \(\mathbf{x}_n \sim q\), the weight
\[
\tilde{w}_n := w(\mathbf{x}_n) = \pi(\mathbf{x}_n)/q(\mathbf{x}_n)
\]
measures how much more (or less) likely \(\mathbf{x}_n\) is under the target compared to the proposal.
Note that while \(\pi\) and \(q\) are normalized distributions, the weights \(\tilde{w}_n\) do not sum to any particular
value.
Assume that the normalized \(\pi(\mathbf{x})\) can be evaluated pointwise, but that we cannot sample from it.
If we draw \(N_s\) samples \(\mathbf{x}_1, \ldots, \mathbf{x}_{N_s} \sim q(\mathbf{x})\),
the direct importance sampling estimator is:
\[
\hat{I}_{\text{IS}} := \frac{1}{N_s} \sum_{n=1}^{N_s} \tilde{w}_n \, \varphi(\mathbf{x}_n).
\]
This estimator is unbiased. Since \(\mathbf{x}_1, \ldots, \mathbf{x}_{N_s}\) are i.i.d.\ samples from \(q\),
linearity of expectation gives
\[
\begin{align*}
\mathbb{E}_q[\hat{I}_{\text{IS}}] &= \frac{1}{N_s}\sum_{n=1}^{N_s} \mathbb{E}_q\!\left[w(\mathbf{x}_n)\,\varphi(\mathbf{x}_n)\right] \\\\
&= \mathbb{E}_q\!\left[w(\mathbf{X})\,\varphi(\mathbf{X})\right] \\\\
&= I,
\end{align*}
\]
where the second equality uses that all terms have the same distribution (we write \(\mathbf{X} \sim q\) for a single representative),
and the third invokes the change-of-measure identity above.
Theorem: Variance of the Direct IS Estimator
Under the support condition, with \(\mathbf{x}_1, \ldots, \mathbf{x}_{N_s} \overset{\text{i.i.d.}}{\sim} q\),
let \(\mathbf{X} \sim q\) denote a single representative sample. If
\(\mathbb{E}_q[w(\mathbf{X})^2 \varphi(\mathbf{X})^2] < \infty\), then
\[
\operatorname{Var}_q[\hat{I}_{\text{IS}}] = \frac{1}{N_s}\!\left(\mathbb{E}_q[w(\mathbf{X})^2 \varphi(\mathbf{X})^2] - I^2\right).
\]
Proof.
Let \(Y_n := w(\mathbf{x}_n)\varphi(\mathbf{x}_n)\). Since the \(\mathbf{x}_n\) are i.i.d.\ under \(q\), so are the \(Y_n\). Hence
\[
\operatorname{Var}_q[\hat{I}_{\text{IS}}] = \operatorname{Var}_q\!\left[\frac{1}{N_s} \sum_{n=1}^{N_s} Y_n\right]
= \frac{1}{N_s^2} \sum_{n=1}^{N_s} \operatorname{Var}_q[Y_n]
= \frac{1}{N_s} \operatorname{Var}_q[Y_1],
\]
where the second equality uses independence of the \(Y_n\), and the third their identical distribution.
The single-sample variance follows from \(\operatorname{Var}[Y] = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2\) and unbiasedness:
\[
\operatorname{Var}_q[Y_1] = \mathbb{E}_q[w(\mathbf{X})^2 \varphi(\mathbf{X})^2] - \left(\mathbb{E}_q[w(\mathbf{X})\varphi(\mathbf{X})]\right)^2
= \mathbb{E}_q[w(\mathbf{X})^2 \varphi(\mathbf{X})^2] - I^2.
\]
If the importance weights vary dramatically, a few samples will dominate the estimate, leading to high variance.
This is quantified by the effective sample size (ESS), which measures the number of independent
samples that would provide the same statistical efficiency as the current weighted sample. The ESS can be computed
in two equivalent ways:
Definition: Effective Sample Size
The effective sample size (ESS) of a weighted sample
\(\{(\mathbf{x}_n, \tilde{w}_n)\}_{n=1}^{N_s}\) is defined as
\[
\operatorname{ESS} := \frac{\left(\sum_{n=1}^{N_s} \tilde{w}_n\right)^2}{\sum_{n=1}^{N_s} \tilde{w}_n^2}
= \frac{1}{\sum_{n=1}^{N_s} W_n^2},
\]
where \(W_n := \tilde{w}_n / \sum_{n'=1}^{N_s} \tilde{w}_{n'}\) are the normalized weights.
The second equality follows by dividing numerator and denominator of the first expression by
\(\left(\sum_n \tilde{w}_n\right)^2\).
The ESS ranges from 1 (maximum degeneracy, where all weight concentrates on a single sample) to \(N_s\)
(no degeneracy, where all weights are equal). A good rule of thumb is that if \(\operatorname{ESS} < N_s/2\), the
proposal distribution \(q(\mathbf{x})\) should be reconsidered, as this indicates that many samples have
negligible contribution to the estimate.
Requirements for Direct Importance Sampling
For direct importance sampling to be valid and practical:
- Support condition: \(q(\mathbf{x}) > 0\) whenever \(\pi(\mathbf{x}) > 0\)
- Normalization: Both \(\pi(\mathbf{x})\) and \(q(\mathbf{x})\) must be properly normalized probability distributions
- Evaluability: We must be able to evaluate \(\pi(\mathbf{x})\) and \(q(\mathbf{x})\) for any \(\mathbf{x}\)
- Sampling: We must be able to draw samples from \(q(\mathbf{x})\) efficiently
Self-Normalized Importance Sampling
Direct importance sampling, as developed above, requires evaluating the normalized target density \(\pi(\mathbf{x})\)
pointwise. In many real-world settings—especially in Bayesian inference and energy-based modeling—this is precisely
what we cannot do. The next method removes this requirement by working with unnormalized densities throughout.
In many practical applications, we can only evaluate the unnormalized target distribution:
\[
\tilde{\gamma}(\mathbf{x}) = Z \pi(\mathbf{x})
\]
where the normalization constant
\[
Z = \int \tilde{\gamma}(\mathbf{x})d\mathbf{x}
\]
is unknown or intractable. This situation is ubiquitous in Bayesian inference (where \(Z\) is the marginal
likelihood), energy-based models (where \(Z\) is the partition function), and many other machine learning applications.
Self-normalized importance sampling (SNIS) addresses this by working with unnormalized weights
and normalizing them within the estimator itself. The key identity rewrites the expectation as a ratio whose
numerator and denominator can both be approximated by Monte Carlo:
\[
\begin{align*}
\mathbb{E}_\pi[\varphi(\mathbf{x})] &= \int \varphi(\mathbf{x}) \, \pi(\mathbf{x}) \, d\mathbf{x} \\\\
&= \frac{\int \varphi(\mathbf{x}) \, \tilde{\gamma}(\mathbf{x}) \, d\mathbf{x}}
{\int \tilde{\gamma}(\mathbf{x}) \, d\mathbf{x}} \\\\
&= \frac{\mathbb{E}_q\!\left[\varphi(\mathbf{x}) \, \tilde{\gamma}(\mathbf{x})/q(\mathbf{x})\right]}
{\mathbb{E}_q\!\left[\tilde{\gamma}(\mathbf{x})/q(\mathbf{x})\right]}.
\end{align*}
\]
The second equality uses \(\tilde{\gamma} = Z\pi\): the factor \(Z\) appears in both the numerator and the
denominator (since \(\int \tilde{\gamma} \, d\mathbf{x} = Z\)) and therefore cancels. The third equality applies the
change-of-measure identity to both integrals, requiring the support condition
\(q(\mathbf{x}) > 0\) whenever \(\tilde{\gamma}(\mathbf{x}) > 0\).
Sampling \(\mathbf{x}_n \sim q(\mathbf{x})\) for \(n = 1, \ldots, N_s\), we redefine the importance weights using the unnormalized target:
\[
\tilde{w}_n := \frac{\tilde{\gamma}(\mathbf{x}_n)}{q(\mathbf{x}_n)}.
\]
These weights differ from the importance weights defined earlier by a factor of \(Z\);
the SNIS estimator below is invariant under this rescaling because the factor cancels between numerator and denominator.
Definition: Self-Normalized Importance Sampling
The self-normalized importance sampling (SNIS) estimator is
\[
\hat{I}_{\text{SNIS}}
:= \frac{\sum_{n=1}^{N_s} \tilde{w}_n \, \varphi(\mathbf{x}_n)}{\sum_{n=1}^{N_s} \tilde{w}_n}
= \sum_{n=1}^{N_s} W_n \, \varphi(\mathbf{x}_n),
\]
where \(\tilde{w}_n = \tilde{\gamma}(\mathbf{x}_n)/q(\mathbf{x}_n)\) are the unnormalized importance weights and
\(W_n := \tilde{w}_n / \sum_{n'=1}^{N_s} \tilde{w}_{n'}\) are the normalized weights.
This estimator is biased for finite \(N_s\) (the expectation of a ratio is not generally the
ratio of expectations), but it is consistent:
Theorem: Consistency of the SNIS Estimator
Assume the support condition \(q(\mathbf{x}) > 0\) whenever \(\tilde{\gamma}(\mathbf{x}) > 0\),
\(Z := \int \tilde{\gamma}(\mathbf{x}) \, d\mathbf{x} < \infty\), and
\(\mathbb{E}_q[|w(\mathbf{X}) \varphi(\mathbf{X})|] < \infty\), where \(w(\mathbf{x}) := \tilde{\gamma}(\mathbf{x})/q(\mathbf{x})\). Then
\[
\hat{I}_{\text{SNIS}} \xrightarrow{a.s.} \mathbb{E}_\pi[\varphi(\mathbf{X})] = I \quad \text{as } N_s \to \infty.
\]
Proof.
Write \(\hat{I}_{\text{SNIS}} = A_{N_s} / B_{N_s}\), where
\[
A_{N_s} := \frac{1}{N_s}\sum_{n=1}^{N_s} \tilde{w}_n \, \varphi(\mathbf{x}_n), \qquad
B_{N_s} := \frac{1}{N_s}\sum_{n=1}^{N_s} \tilde{w}_n.
\]
(Multiplying numerator and denominator of \(\hat{I}_{\text{SNIS}}\) by \(1/N_s\) preserves the ratio.)
Since the \(\mathbf{x}_n\) are i.i.d.\ under \(q\), the
Strong Law of Large Numbers gives
\[
A_{N_s} \xrightarrow{a.s.} \mathbb{E}_q\!\left[w(\mathbf{X}) \, \varphi(\mathbf{X})\right], \qquad
B_{N_s} \xrightarrow{a.s.} \mathbb{E}_q\!\left[w(\mathbf{X})\right].
\]
Both expectations evaluate explicitly via change of measure:
\[
\mathbb{E}_q[w(\mathbf{X})] = \int \tilde{\gamma}(\mathbf{x}) \, d\mathbf{x} = Z, \qquad
\mathbb{E}_q[w(\mathbf{X}) \, \varphi(\mathbf{X})] = \int \varphi(\mathbf{x}) \, \tilde{\gamma}(\mathbf{x}) \, d\mathbf{x} = Z \cdot I,
\]
where the second uses \(\tilde{\gamma} = Z\pi\) and \(I = \int \varphi \, \pi \, d\mathbf{x}\). Since both convergences hold a.s.\ on the same
probability space, the joint convergence \((A_{N_s}, B_{N_s}) \xrightarrow{a.s.} (Z I, Z)\) follows automatically. With \(Z > 0\), the
function \((a, b) \mapsto a/b\) is continuous at \((Z I, Z)\). Applying the
continuous mapping theorem,
\[
\hat{I}_{\text{SNIS}} = \frac{A_{N_s}}{B_{N_s}} \xrightarrow{a.s.} \frac{Z \cdot I}{Z} = I.
\]
The estimator converges almost surely
to the true expectation as \(N_s \to \infty\); the bias, finite at every fixed \(N_s\), vanishes in the limit.
Despite the bias, SNIS is the standard approach in practice for several important reasons.
Most importantly, it only requires evaluating unnormalized densities \(\tilde{\gamma}(\mathbf{x})\)
and \(q(\mathbf{x})\), making it applicable when the normalization constant \(Z\) is intractable—a
common situation in Bayesian inference, energy-based models, and many machine learning applications.
While SNIS can have a positive effect on the dispersion (variance) of the estimator in certain cases,
it does not universally reduce variance compared to direct importance sampling. The key advantage is
practical: SNIS enables estimation when direct importance sampling is impossible due to unknown
normalization constants.
Annealed Importance Sampling (AIS)
Self-normalized importance sampling resolves the difficulty of unknown normalization, but it does not address
a different failure mode: when the target distribution and proposal distribution have substantial distributional
mismatch, importance sampling—direct or self-normalized—fails due to extremely high variance in the importance
weights. Most samples from the proposal will have near-zero weight under the target, while a few rare samples
will have enormous weights—a phenomenon known as weight degeneracy. This leads to a low effective
sample size, meaning that despite drawing many samples, few actually contribute meaningfully to the estimate,
making the results unreliable.
Annealed importance sampling (AIS) addresses this fundamental limitation by introducing a sequence
of intermediate distributions that smoothly bridge the gap between the proposal and target distributions,
dramatically reducing variance through carefully constructed importance weights.
Suppose we wish to sample from a complicated target distribution:
\[
p_0(\mathbf{x}) \propto f_0(\mathbf{x}),
\]
where sampling is difficult because \(p_0\) may be high-dimensional and/or multimodal.
Here \(f_0\) is the unnormalized target (the same role played by \(\tilde{\gamma}\) in the previous discussion);
we adopt the AIS-specific notation \(f_0, f_n\) to emphasize the indexed family of intermediate distributions introduced below.
Suppose also that we have a proposal distribution from which we can easily sample (e.g., the prior in a Bayesian setting):
\[
p_n(\mathbf{x}) \propto f_n(\mathbf{x}).
\]
The key idea of AIS is to bridge the gap between proposal and target by constructing a sequence of \(n+1\) intermediate
distributions \(p_j \propto f_j\), \(j = 0, 1, \ldots, n\), each closer to its neighbors than \(p_0\) is to \(p_n\):
\[
f_j(\mathbf{x}) = f_0(\mathbf{x})^{\beta_j} f_n(\mathbf{x})^{1 - \beta_j},
\]
where \(1 = \beta_0 > \beta_1 > \cdots > \beta_n = 0\) defines an annealing schedule.
At the endpoints \(\beta_0 = 1\) gives \(f_0\) (the target) and \(\beta_n = 0\) gives \(f_n\) (the proposal);
intermediate values produce geometric mixtures.
Each \(\beta_j\) acts as an inverse temperature in the statistical-physics sense: small \(\beta_j\) corresponds to
"high temperature" (closer to the easy proposal), large \(\beta_j\) to "low temperature" (closer to the complicated target).
The AIS sampling procedure traverses these distributions in reverse, starting from \(p_n\) and ending at \(p_0\).
Typical schedules include:
- Linear: \(\beta_j = \frac{n - j}{n}\)
- Geometric: \(\beta_j = \left( \frac{n - j}{n} \right)^{\alpha}\), for some \(\alpha > 0\)
- Adaptive: chosen to maintain approximately constant effective sample size (ESS) between steps
To sample from each intermediate distribution \(p_j\), we use a Markov chain transition kernel:
\[
T_j(\mathbf{x}, \mathbf{x}') = p_j(\mathbf{x}' \mid \mathbf{x})
\]
which leaves \(p_j\) invariant, i.e.,
\[
\int p_j(\mathbf{x})T_j(\mathbf{x}, \mathbf{x}')d\mathbf{x} = p_j(\mathbf{x}').
\]
Common choices for \(T_j\) include Metropolis-Hastings or Hamiltonian Monte Carlo,
applied for a few iterations to adapt samples toward \(p_j\).
The AIS procedure proceeds as follows:
- Sample \(\mathbf{v}_n \sim p_n\) from the proposal distribution.
- Apply successive Markov transitions to obtain \(\mathbf{v}_0\):
\[
\mathbf{v}_{n-1} \sim T_{n-1}(\mathbf{v}_n, \cdot), \quad
\mathbf{v}_{n-2} \sim T_{n-2}(\mathbf{v}_{n-1}, \cdot), \; \ldots, \;
\mathbf{v}_0 \sim T_0(\mathbf{v}_1, \cdot).
\]
Each AIS run yields a trajectory
\((\mathbf{v}_n, \mathbf{v}_{n-1}, \ldots, \mathbf{v}_0)\)
with an associated importance weight:
\[
w =
\frac{f_{n-1}(\mathbf{v}_{n-1})}{f_n(\mathbf{v}_{n-1})}
\frac{f_{n-2}(\mathbf{v}_{n-2})}{f_{n-1}(\mathbf{v}_{n-2})}
\cdots
\frac{f_1(\mathbf{v}_1)}{f_2(\mathbf{v}_1)}
\frac{f_0(\mathbf{v}_0)}{f_1(\mathbf{v}_0)}. \tag{1}
\]
Each ratio is typically close to 1, which keeps the weights stable and prevents variance explosion.
Proof:
The goal is to show that the AIS weight in Equation (1) is a valid importance weight for some target density on the
extended state space \((\mathbf{v}_0, \mathbf{v}_1, \ldots, \mathbf{v}_n)\). The key technical device is to
define an extended target distribution on this product space whose marginal on \(\mathbf{v}_0\) is exactly
\(p_0\), and a corresponding extended proposal whose ratio against the target reproduces the AIS weight.
Consider the following unnormalized density on the extended state space:
\[
\begin{align*}
p(\mathbf{v}) \propto \tilde{f}(\mathbf{v})
&= f_0(\mathbf{v}_0) \, \tilde{T}_0(\mathbf{v}_0, \mathbf{v}_1) \, \tilde{T}_1(\mathbf{v}_1, \mathbf{v}_2) \cdots \tilde{T}_{n-1}(\mathbf{v}_{n-1}, \mathbf{v}_n) \\\\
&\propto p(\mathbf{v}_0) \, p(\mathbf{v}_1 \mid \mathbf{v}_0) \cdots p(\mathbf{v}_n \mid \mathbf{v}_{n-1}),
\end{align*}
\]
where \(\tilde{T}_j\) is the reversal of \(T_j\) with respect to \(p_j\), defined by
\[
\begin{align*}
\tilde{T}_j(\mathbf{v}, \mathbf{v}') &= T_j(\mathbf{v}', \mathbf{v}) \, p_j(\mathbf{v}') \, / \, p_j(\mathbf{v}) \\\\
&= T_j(\mathbf{v}', \mathbf{v}) \, f_j(\mathbf{v}') \, / \, f_j(\mathbf{v}).
\end{align*}
\]
The second equality holds because the normalization constant \(Z_j\) of \(p_j = f_j/Z_j\) cancels in the ratio.
Note that we use a new symbol \(\tilde{f}(\mathbf{v})\) for the unnormalized extended density to avoid clashing with the target
function \(\varphi(\mathbf{x})\) introduced in the previous sections.
The invariance of \(p_j\) with respect to \(T_j\) ensures that each \(\tilde{T}_j\) is a valid Markov transition kernel.
Explicitly,
\[
\int \tilde{T}_j(\mathbf{v}, \mathbf{v}') \, d\mathbf{v}'
= \frac{1}{f_j(\mathbf{v})} \int T_j(\mathbf{v}', \mathbf{v}) \, f_j(\mathbf{v}') \, d\mathbf{v}'
= \frac{f_j(\mathbf{v})}{f_j(\mathbf{v})} = 1,
\]
where the second equality uses the invariance condition
\(\int T_j(\mathbf{v}', \mathbf{v}) \, p_j(\mathbf{v}') \, d\mathbf{v}' = p_j(\mathbf{v})\),
rescaled by \(Z_j\) to remove normalization. To see that the \(\mathbf{v}_0\)-marginal of \(p(\mathbf{v})\) is \(p_0\),
we integrate out \(\mathbf{v}_1, \ldots, \mathbf{v}_n\):
\[
\int \tilde{f}(\mathbf{v}) \, d\mathbf{v}_1 \cdots d\mathbf{v}_n = f_0(\mathbf{v}_0),
\]
where successive applications of \(\int \tilde{T}_j(\cdot, \mathbf{v}_{j+1}) \, d\mathbf{v}_{j+1} = 1\) collapse the chain from the right.
Hence sampling from \(p(\mathbf{v})\) yields, as a byproduct, samples from \(p_0(\mathbf{x})\) via \(\mathbf{x} := \mathbf{v}_0\).
The AIS procedure samples on this extended state space using a forward chain starting from \(p_n\). This corresponds to
the proposal:
\[
\begin{align*}
q(\mathbf{v}) \propto g(\mathbf{v}) &= f_n(\mathbf{v}_n) \, T_{n-1}(\mathbf{v}_n, \mathbf{v}_{n-1}) \cdots T_1(\mathbf{v}_2, \mathbf{v}_1) \, T_0(\mathbf{v}_1, \mathbf{v}_0) \\\\
&\propto p(\mathbf{v}_n) \, p(\mathbf{v}_{n-1} \mid \mathbf{v}_n) \cdots p(\mathbf{v}_0 \mid \mathbf{v}_1).
\end{align*}
\]
The importance weight on the extended space is then
\[
w = \frac{\tilde{f}(\mathbf{v}_0, \ldots, \mathbf{v}_n)}{g(\mathbf{v}_0, \ldots, \mathbf{v}_n)}.
\]
Substituting the definition of \(\tilde{T}_j\) into \(\tilde{f}\) and dividing by \(g\) produces a telescoping cancellation
of the forward kernels \(T_j\), leaving exactly the product of ratios \(f_{j-1}(\mathbf{v}_{j-1}) / f_j(\mathbf{v}_{j-1})\)
appearing in Equation (1). The detailed cancellation is carried out in Neal (2001), Section 3.
Since the marginal of the extended target on \(\mathbf{v}_0\) is \(p_0\), expectations under \(p_0\) can be approximated
by self-normalized importance sampling on the extended space using these weights.
A key application of AIS is to estimate the ratio of partition functions. Since the marginal claim above gives
\(\int \tilde{f}(\mathbf{v}) \, d\mathbf{v}_1 \cdots d\mathbf{v}_n = f_0(\mathbf{v}_0)\), integrating once more yields
\[
Z_0 = \int f_0(\mathbf{x}) \, d\mathbf{x} = \int \tilde{f}(\mathbf{v}) \, d\mathbf{v},
\]
and an analogous identity for the proposal side gives
\[
Z_n = \int f_n(\mathbf{x}) \, d\mathbf{x} = \int g(\mathbf{v}) \, d\mathbf{v}.
\]
Therefore:
\[
\begin{align*}
\frac{Z_0}{Z_n} &= \frac{\int \tilde{f}(\mathbf{v}) \, d\mathbf{v}}{\int g(\mathbf{v}) \, d\mathbf{v}} \\\\
&= \frac{\int \frac{\tilde{f}(\mathbf{v})}{g(\mathbf{v})} g(\mathbf{v}) \, d\mathbf{v}}{\int g(\mathbf{v}) \, d\mathbf{v}} \\\\
&= \mathbb{E}_g\!\left[\frac{\tilde{f}(\mathbf{v})}{g(\mathbf{v})}\right] \\\\
&\approx \frac{1}{S}\sum_{s=1}^S w_s,
\end{align*}
\]
where \(w_s = \tilde{f}(\mathbf{v}_s) / g(\mathbf{v}_s)\) is the AIS weight from Equation (1) for the \(s\)-th run,
and \(S\) is the number of independent AIS runs. By abuse of notation, \(\mathbb{E}_g\) here denotes
the expectation under the normalized distribution \(q \propto g\); the normalization constant \(Z_n\) cancels between
numerator and denominator in the ratio above, so the formula is unaffected by which version of \(g\) is used.
A common application is Bayesian model evidence estimation. In this use case the AIS roles are
arranged so that \(f_0(\boldsymbol{\theta}) = p(\boldsymbol{\theta})\) is the prior
(with known normalization \(Z_0 = 1\)) and \(f_n(\boldsymbol{\theta}) = p(\boldsymbol{\theta}) \, p(\mathcal{D} \mid \boldsymbol{\theta})\)
is the unnormalized posterior. The above ratio formula then yields
\[
Z_n = p(\mathcal{D}) = \frac{Z_n}{Z_0} \approx \frac{1}{S}\sum_{s=1}^S w_s,
\]
so AIS provides a direct estimator of the marginal likelihood. Note that the assignment of which distribution is
\(f_0\) and which is \(f_n\) depends on the goal: for posterior sampling the easier prior plays the role of the proposal \(p_n\);
for evidence estimation, the prior is taken as \(p_0\) instead so that \(Z_n\) becomes the quantity of interest.
The variance of AIS can be reduced by increasing the number of intermediate distributions \(n\). In practice,
\(n = 100\) to \(n = 10000\) steps is common. The optimal choice balances computational cost (proportional to \(n\))
against variance reduction. For example, TensorFlow Probability uses 1000 steps as default.
When implementing AIS in practice, several considerations can dramatically impact performance:
1. Choosing the Annealing Schedule
The choice of \(\{\beta_j\}\) is crucial. Recent research suggests:
- Moment matching: Choose \(\beta_j\) such that the second moments of successive distributions are approximately equal
- Constant ESS: Adapt \(\beta_j\) to maintain ESS \(\approx 0.8 \times N_s\) between steps
- Sigmoid schedules: \(\beta_j = \sigma(a(2j/n - 1))\) where \(\sigma\) is the sigmoid function and \(a\) controls steepness
2. Transition Kernels
The choice and tuning of \(T_j\) significantly affects efficiency:
- Hamiltonian Monte Carlo (HMC): Often most effective for continuous distributions, especially in high dimensions
- Number of MCMC steps: Usually 1-10 steps per temperature is sufficient; more steps increase cost with diminishing returns
- Step size adaptation: Crucial for HMC; can be adapted based on acceptance rate at each temperature
3. Modern Applications in Deep Learning
AIS has become particularly important in deep learning for:
- Evaluating generative models: Estimating log-likelihoods of VAEs, normalizing flows, and energy-based models
- Bayesian deep learning: Computing model evidence for neural network architectures
- Differentiable AIS: Recent work enables gradient-based optimization through AIS estimates using techniques like the reparameterization trick
Diagnostics and Debugging
Importance sampling can fail silently, producing estimates with high bias or variance.
Here are essential diagnostics to monitor:
1. Weight Diagnostics
The following metrics monitor the health of the importance weights at a glance. The Effective Sample Size
quantifies overall weight degeneracy, while the maximum-weight ratio, coefficient of variation, and entropy
capture complementary aspects of how concentrated or dispersed the weight distribution is.
Key Diagnostic Metrics
- Effective Sample Size (ESS): Should be \(> N_s/2\) for reliable estimates (rule of thumb)
- Maximum weight ratio: \(\max_i W_i\) should be small (typically \(< 0.1\) is recommended) to avoid single-sample domination
- Coefficient of variation: \(\operatorname{CV} = \frac{\operatorname{std}(\tilde{w})}{\operatorname{mean}(\tilde{w})}\) should be \(< 1\) for stable estimation
- Weight entropy: \(H(W) = -\sum_i W_i \log W_i\) measures weight uniformity (higher is better; here \(\log\) is the natural logarithm, so \(H\) is measured in nats)
2. Common Failure Modes and Solutions
| Symptom |
Likely Cause |
Solution |
| \(\operatorname{ESS} \ll N_s\) |
Poor proposal choice |
Use heavier-tailed proposal or AIS |
| High variance across runs |
Weight degeneracy |
Increase samples or improve proposal |
| Biased estimates |
Violated support condition |
Ensure \(q(\mathbf{x}) > 0\) wherever \(\pi(\mathbf{x}) > 0\) |
| Numerical instability |
Extreme weight ratios |
Work in log-space; use log-sum-exp trick |
3. Validation Techniques
To verify your importance sampling implementation:
- Known ground truth: Test on problems with analytical solutions
- Convergence check: Estimate should stabilize as \(N_s\) increases
- Multiple proposals: Different valid proposals should give consistent estimates
- Bootstrap confidence intervals: Resample weights to assess uncertainty
Importance sampling, SNIS, and AIS complete our treatment of Monte Carlo methods for
Bayesian computation. Throughout the Monte Carlo and importance sampling material, we have worked with parametric models
where uncertainty is over a finite-dimensional parameter \(\theta\). In
Gaussian Processes, we take
a fundamentally different perspective: instead of placing priors over parameters, we place
priors directly over functions, leading to nonparametric Bayesian models with
built-in uncertainty quantification.