Why Variational Inference?
Bayesian inference, in its idealised form, is a single-step procedure: write down a prior \(p(\boldsymbol{z})\)
over latent variables, a likelihood \(p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\) generating observations,
so that the joint factorises as
\[
p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) = p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) p(\boldsymbol{z})
\]
(we take the prior to be independent of \(\boldsymbol{\theta}\), aligning the notation with the variational inference literature), and
read off the posterior
\[
p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})
\;=\; \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}
{p_{\boldsymbol{\theta}}(\boldsymbol{x})},
\qquad
p_{\boldsymbol{\theta}}(\boldsymbol{x}) \;=\; \int p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) \, d\boldsymbol{z},
\]
by Bayes' rule. We use \(\int \cdot \, d\boldsymbol{z}\) throughout to denote integration against the appropriate reference measure on the
latent space (Lebesgue measure for continuous \(\boldsymbol{z}\), counting measure for discrete \(\boldsymbol{z}\), product for mixed cases), with
all densities below understood as densities with respect to this reference measure. The difficulty hidden in the formula
is the evidence \(p_{\boldsymbol{\theta}}(\boldsymbol{x})\): for conjugate exponential-family models the integral admits a closed form,
but the moment one steps outside this regime — neural-network likelihoods, non-conjugate priors, discrete-continuous interactions — there is no analytic solution,
and numerical quadrature scales prohibitively in the dimension of \(\boldsymbol{z}\). The posterior itself is well-defined as a
conditional expectation;
what variational inference addresses is the intractability of computing with it.
Variational inference (VI) replaces the intractable computation with a tractable one of a different shape.
We restrict attention to a family \(\mathcal{Q}\) of distributions over \(\boldsymbol{z}\) that we have decided in advance
we can compute with — typically a parametric family \(\{q_{\boldsymbol{\psi}}(\boldsymbol{z}) : \boldsymbol{\psi} \in \Psi\}\)
indexed by variational parameters \(\boldsymbol{\psi}\) — and we look inside this family for the member closest
to the true posterior in the
Kullback-Leibler divergence:
\[
\boldsymbol{\psi}^* \;=\; \arg\min_{\boldsymbol{\psi} \in \Psi} \;
D_{\mathrm{KL}}\!\left( q_{\boldsymbol{\psi}}(\boldsymbol{z}) \,\|\, p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x}) \right).
\]
The geometric picture is clean: the family \(\mathcal{Q}\) is a set of distributions, the true posterior is a single distribution generally
lying outside \(\mathcal{Q}\), and \(\boldsymbol{\psi}^*\) selects the member of \(\mathcal{Q}\) closest to the posterior in the sense of KL
— bearing in mind that the KL divergence is not symmetric, so this notion of "closest" depends on the order of the arguments, and the choice
\(D_{\mathrm{KL}}(q_{\boldsymbol{\psi}} \,\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x}))\) above (rather than the reverse) is itself
a modelling decision with consequences we return to in the closing section.
The substitution carries an immediate complication: the KL divergence
\[
D_{\mathrm{KL}}(q_{\boldsymbol{\psi}} \,\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x}))
\]
is itself defined in terms of the intractable posterior, so it cannot be evaluated directly.
The headline payoff of the next section is that an algebraic rearrangement converts KL minimisation
into an equivalent objective — the evidence lower bound (ELBO) — that depends only on the joint
\(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) and on \(q_{\boldsymbol{\psi}}(\boldsymbol{z})\), both of
which are tractable by construction. Maximising the ELBO is exactly the same problem as minimising the KL, but stated in a
form one can actually compute. We adopt throughout the absolute-continuity hypothesis
\(q_{\boldsymbol{\psi}} \ll p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\), under which the KL is well-defined
as an element of \([0, \infty]\), together with the integrability hypothesis that this expectation is finite, both holding
for the standard variational families used on this page.
With the ELBO substitution in hand, what was an integration problem has become an optimisation problem over \(\boldsymbol{\psi}\).
This trade is favourable because optimisation in high dimensions has a far more developed machinery than integration:
stochastic gradient methods, automatic differentiation, and the entire toolkit of modern deep learning are available to attack it.
Variational inference trades the asymptotic exactness of MCMC-style sampling for tractable, fast, gradient-based optimisation in a
chosen surrogate family, at the price that the result is only as good as the family allows.
The reader familiar with the variational autoencoder literature will recognise this framework as the foundation of the VAE: a VAE
is the amortised specialisation of the variational programme below, in which the variational parameters
\(\boldsymbol{\psi}_n = f_{\boldsymbol{\phi}}(\boldsymbol{x}_n)\) are produced by a neural network (the inference network) reading
\(\boldsymbol{x}_n\). We treat the non-amortised case throughout; the amortised reduction is developed in
Variational Autoencoders
and is a specialisation, not a parallel theory.
The notation throughout is consistent with the earlier chapters of this section in measure-theoretic probability and with the advanced
machine learning literature. The probability space is \((\Omega, \mathcal{F}, \mathbb{P})\), with \(\boldsymbol{x}\) and \(\boldsymbol{z}\)
measurable functions on \(\Omega\) taking values in observation- and latent-spaces respectively. Lowercase letters
\(p_{\boldsymbol{\theta}}, q_{\boldsymbol{\psi}}\) denote densities with respect to the appropriate reference measure on the latent and
observation spaces, in line with the convention adopted at the start of this section; we sometimes refer to these informally as
"distributions" when the underlying reference measure is clear from context, following the standard convention of the variational-inference literature.
The joint density \(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) is parameterised by the model parameters
\(\boldsymbol{\theta}\), regarded as fixed throughout the inference problem (their estimation is a separate task, briefly discussed in the
closing section). The variational density \(q_{\boldsymbol{\psi}}(\boldsymbol{z})\) is parameterised by the variational
parameters \(\boldsymbol{\psi}\), and the variational family \(\mathcal{Q} = \{q_{\boldsymbol{\psi}} : \boldsymbol{\psi} \in \Psi\}\)
is the set of all such densities as \(\boldsymbol{\psi}\) ranges over its parameter space \(\Psi\).
The ELBO is denoted \(\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})\) throughout, with the convention that
\(\mathcal{L}\) refers to the ELBO itself (which is maximised), not its negation (which is minimised); this matches the \(L\)-flavoured convention
used when the ELBO is the object of interest, and inverts the loss-flavoured convention used when the variational free energy is the object of
interest. The two conventions differ only by a sign and yield identical optimisation problems; we use the maximisation convention throughout. Where
convenient we also write \(q_{\boldsymbol{\psi}} \ll p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\)
and \(D_{\mathrm{KL}}(q_{\boldsymbol{\psi}} \,\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x}))\)
with the densities themselves as arguments, in line with the standard abuse of notation of the variational-inference literature; the reader should
read these as statements about the underlying induced probability measures \(Q_{\boldsymbol{\psi}}\) and \(P_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\)
on the latent space.
The ELBO and the KL Decomposition
We define the evidence lower bound, prove that it is indeed a lower bound on the log-evidence by a direct application of
Jensen's inequality, and then prove the central exact identity
\[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}) = \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
+ D_{\mathrm{KL}}(q_{\boldsymbol{\psi}} \,\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x}))
\]
that licenses the entire variational programme by certifying that maximising the ELBO with respect to \(\boldsymbol{\psi}\)
is exactly the same as minimising the KL divergence between \(q_{\boldsymbol{\psi}}\) and the true posterior. The Jensen bound
and the exact decomposition are two views of the same fact — one inequality, one equality — and we present them in this order
because the inequality is both the standard entry point, while the equality, which absorbs the inequality as a corollary,
is the working tool in everything that follows.
The ELBO Definition
Definition: Evidence Lower Bound (ELBO)
Let \(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) be a generative model with observations
\(\boldsymbol{x}\) and latent variables \(\boldsymbol{z}\), and let \(q_{\boldsymbol{\psi}}(\boldsymbol{z})\)
be a variational distribution from a chosen family \(\mathcal{Q}\). The evidence lower bound
(ELBO) of \(q_{\boldsymbol{\psi}}\) at observation \(\boldsymbol{x}\) is
\[
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\;\;\triangleq\;\;
\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})
\;-\; \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\right].
\]
Equivalently, separating the joint into likelihood and prior gives the
decomposition
\[
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\;=\;
\underbrace{\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})
\right]}_{\text{expected log-likelihood}}
\;-\;
\underbrace{D_{\mathrm{KL}}\!\left(
q_{\boldsymbol{\psi}}(\boldsymbol{z}) \,\big\|\, p(\boldsymbol{z})
\right)}_{\text{KL to prior}},
\]
which exhibits the ELBO as a balance between data fit and regularisation against the prior.
The two forms agree as elements of \([-\infty, +\infty]\) for any \(q_{\boldsymbol{\psi}}\);
the second form takes a finite value precisely when \(q_{\boldsymbol{\psi}} \ll p\)
(the variational density is absolutely continuous with respect to the prior, so that the prior-KL term is finite),
and we adopt this hypothesis whenever the prior-KL form is invoked.
The two forms of the ELBO are equivalent and connected by the identity
\[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) = \log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) + \log p(\boldsymbol{z})
\]
together with the definition of the KL divergence. The first form is what we use to prove the lower-bound property and the exact decomposition;
the second form is what drives variational autoencoders, in which the expected log-likelihood is the "reconstruction" term and the KL to the prior
is the "regularisation" term. We work with the first form throughout this section and return to the second when the
autoencoder bridge becomes relevant in the section on gradient computation.
Three remarks before proceeding. First, the ELBO depends on \(\boldsymbol{x}\) through the joint and on \(\boldsymbol{\psi}\)
through the variational distribution; it is a function of \(\boldsymbol{\theta}\) only insofar as the joint depends on the model
parameters. Throughout this section \(\boldsymbol{\theta}\) is fixed, and we occasionally drop it from the notation when the dependence
is not in play. Second, the expectation \(\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}[\,\cdot\,]\) is against the variational distribution,
not against the intractable posterior — this is the entire computational point of the construction, since \(q_{\boldsymbol{\psi}}\)
has been chosen precisely so that such expectations can be computed (analytically for Gaussian \(q_{\boldsymbol{\psi}}\) with simple likelihoods,
by Monte Carlo otherwise).
Third, the integrand
\[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\]
is the logarithm of a likelihood ratio
\[
p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) / q_{\boldsymbol{\psi}}(\boldsymbol{z}),
\]
which one should think of as the unnormalised "importance weight" of sampling
\(\boldsymbol{z}\) from \(q_{\boldsymbol{\psi}}\) when the target is the joint
\(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\); the connection to importance
sampling is more than notational, and tighter bounds based on it (such as IWAE)
sharpen the ELBO by averaging multiple importance weights before taking the logarithm.
The Jensen Lower Bound
The ELBO is not just any quantity associated with \(q_{\boldsymbol{\psi}}\): it is a lower bound on the log-evidence
\(\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\), and this inequality holds for every \(q_{\boldsymbol{\psi}}\) in the
variational family with no further hypotheses beyond integrability of the integrand. The proof is a one-step application of
Jensen's inequality.
Theorem: ELBO is a Lower Bound on the Log-Evidence
Assume the integrability hypothesis that
\(\mathbb{E}_{q_{\boldsymbol{\psi}}}[|\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})|]\)
and
\(\mathbb{E}_{q_{\boldsymbol{\psi}}}[|\log q_{\boldsymbol{\psi}}(\boldsymbol{z})|]\)
are finite, so that \(\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})\)
is well-defined. Then for every \(\boldsymbol{\psi} \in \Psi\) and every
\(\boldsymbol{x}\),
\[
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\;\leq\;
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}).
\]
No absolute-continuity hypothesis between
\(q_{\boldsymbol{\psi}}\) and the posterior is required for the inequality itself;
the Jensen-based proof works in full generality.
Proof.
Starting from the ELBO and writing the integrand as a logarithm of a ratio,
\[
\begin{align*}
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
&\;=\; \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}
{q_{\boldsymbol{\psi}}(\boldsymbol{z})}
\right] \\\\
&\;\leq\; \log \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}
{q_{\boldsymbol{\psi}}(\boldsymbol{z})}
\right] \\\\
&\;=\; \log \int q_{\boldsymbol{\psi}}(\boldsymbol{z})
\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})}
{q_{\boldsymbol{\psi}}(\boldsymbol{z})} \, d\boldsymbol{z} \\\\
&\;=\; \log \int p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) \, d\boldsymbol{z}
\;=\; \log p_{\boldsymbol{\theta}}(\boldsymbol{x}).
\end{align*}
\]
The inequality is
Jensen's inequality applied
to the concave function \(\log\), with the expectation taken against \(q_{\boldsymbol{\psi}}\).
The cancellation in the third line uses that
\[
q_{\boldsymbol{\psi}}(\boldsymbol{z}) \cdot p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) / q_{\boldsymbol{\psi}}(\boldsymbol{z})
= p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})
\]
wherever \(q_{\boldsymbol{\psi}}(\boldsymbol{z}) > 0\); the integral on the second line is, strictly speaking, taken over
\(\{q_{\boldsymbol{\psi}} > 0\}\), and its relation to \(\int p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) \, d\boldsymbol{z}\)
is one of \(\leq\): if \(q_{\boldsymbol{\psi}}\) covers the support of the joint
\(p_{\boldsymbol{\theta}}(\cdot, \boldsymbol{x})\), the two integrals coincide and the
last equality holds; if not, the integral over \(\{q_{\boldsymbol{\psi}} > 0\}\) is strictly smaller, and the inequality
\(\mathcal{L} \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{x})\) is preserved (in fact strengthened).
Either way, the conclusion follows. The variational families considered on this page satisfy the common-support condition in their
standard configurations, so equality in the cancellation step is the typical case in practice.
The Jensen step deserves a moment of reflection. The inequality \(\mathbb{E}[\log Y] \leq \log \mathbb{E}[Y]\) is
a one-direction relation, and Jensen tells us nothing in general about how tight it is. The bound
\(\mathcal{L} \leq \log p_{\boldsymbol{\theta}}(\boldsymbol{x})\) it produces is sharp in a precise sense: equality
holds in Jensen if and only if the ratio
\[
p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) / q_{\boldsymbol{\psi}}(\boldsymbol{z})
\]
is \(q_{\boldsymbol{\psi}}\)-almost-everywhere constant in \(\boldsymbol{z}\), which
occurs precisely when
\[
q_{\boldsymbol{\psi}}(\boldsymbol{z}) = p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})
\]
(in which case the ratio equals \(p_{\boldsymbol{\theta}}(\boldsymbol{x})\) for every \(\boldsymbol{z}\) in the common support).
The Jensen bound therefore recovers the log-evidence exactly when the variational distribution equals the posterior — a
condition the variational family \(\mathcal{Q}\) may or may not be rich enough to satisfy. Quantifying the slack when
\(q_{\boldsymbol{\psi}} \neq p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\) — saying exactly how much the bound loses —
is the content of the next theorem.
The Exact KL Decomposition
The Jensen bound is suggestive but, on its own, gives no information about the gap between the ELBO and the log-evidence.
The exact decomposition fills this in: the gap is precisely the KL divergence from the variational distribution to the true posterior.
This is the central identity of variational inference.
Theorem: ELBO–KL Decomposition
Assume the absolute-continuity hypothesis
\(q_{\boldsymbol{\psi}} \ll p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\),
together with the integrability hypothesis that
\(\mathbb{E}_{q_{\boldsymbol{\psi}}}[|\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})|]\)
and
\(\mathbb{E}_{q_{\boldsymbol{\psi}}}[|\log q_{\boldsymbol{\psi}}(\boldsymbol{z})|]\)
are finite (whence the KL term on the right-hand side is finite). Then for every
\(\boldsymbol{\psi} \in \Psi\) and every \(\boldsymbol{x}\),
\[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x})
\;=\;
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\;+\;
D_{\mathrm{KL}}\!\left(
q_{\boldsymbol{\psi}}(\boldsymbol{z}) \,\big\|\, p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})
\right).
\]
Proof.
Start from the definition of the KL divergence, expand the posterior using Bayes' rule
\[
p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})
= p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) / p_{\boldsymbol{\theta}}(\boldsymbol{x}),
\]
and separate the resulting expectation:
\[
\begin{align*}
D_{\mathrm{KL}}\!\left(
q_{\boldsymbol{\psi}} \,\big\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})
\right)
&\;=\; \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\log \frac{q_{\boldsymbol{\psi}}(\boldsymbol{z})}
{p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})}
\right] \\\\
&\;=\; \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\,-\, \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})
\,+\, \log p_{\boldsymbol{\theta}}(\boldsymbol{x})
\right] \\\\
&\;=\; -\,\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})
\,-\, \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\right]
\,+\, \log p_{\boldsymbol{\theta}}(\boldsymbol{x}) \\\\
&\;=\; -\,\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\,+\, \log p_{\boldsymbol{\theta}}(\boldsymbol{x}).
\end{align*}
\]
The third line uses the linearity of the expectation and the fact that \(\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\)
does not depend on \(\boldsymbol{z}\), so its expectation against \(q_{\boldsymbol{\psi}}\) equals itself.
Rearranging gives the claimed identity.
A few words on the rigorous backing of this manipulation. The ratio
\(q_{\boldsymbol{\psi}}(\boldsymbol{z}) / p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})\)
is, at the level of measures, the
Radon-Nikodym derivative
\(d q_{\boldsymbol{\psi}} / d p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\); the
absolute-continuity
hypothesis adopted at the start of this section guarantees its existence, and the integrability hypothesis
guarantees finiteness of the expectations involved. The remaining steps — splitting the logarithm by linearity
of expectation, pulling out the constant \(\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\) — are standard
manipulations of Lebesgue integrals, justified term by term as soon as each integrand is \(q_{\boldsymbol{\psi}}\)-integrable.
Two Consequences
The decomposition has two consequences that drive everything that follows in the chapter.
The Jensen bound is recovered as a corollary, under the decomposition's hypotheses.
Under the absolute-continuity hypothesis \(q_{\boldsymbol{\psi}} \ll p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\) of the
decomposition theorem, the KL term is non-negative — a standard fact, proved in the entropy chapter as
Gibbs' inequality
— and the decomposition immediately gives
\[
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}) \geq \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x}),
\]
with equality if and only if \(q_{\boldsymbol{\psi}} = p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})\) almost everywhere under
\(q_{\boldsymbol{\psi}}\). This recovers the Jensen bound under stronger hypotheses than the Jensen-based proof itself required
(which used only integrability), but with the additional benefit of giving the exact gap and explaining when it is tight
(when the variational distribution equals the posterior). The Jensen-based proof gives the inequality in full generality;
the decomposition gives the inequality plus its equality condition under the more restrictive setting where KL is well-defined.
ELBO maximisation is KL minimisation.
Fix \(\boldsymbol{x}\) and \(\boldsymbol{\theta}\); then \(\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\) is a
constant in \(\boldsymbol{\psi}\), and the decomposition becomes
\[
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
= \log p_{\boldsymbol{\theta}}(\boldsymbol{x})
- D_{\mathrm{KL}}(q_{\boldsymbol{\psi}} \,\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})).
\]
Maximising \(\mathcal{L}\) over \(\boldsymbol{\psi}\) is therefore exactly the same optimisation problem
as minimising the KL divergence over \(\boldsymbol{\psi}\). This is the licence for the entire variational programme:
the original problem of finding
\[
\boldsymbol{\psi}^*
= \arg\min_{\boldsymbol{\psi}} D_{\mathrm{KL}}(q_{\boldsymbol{\psi}} \,\|\, p_{\boldsymbol{\theta}}(\cdot \mid \boldsymbol{x})),
\]
which involves the intractable posterior, is converted into the equivalent problem of maximising the ELBO, which involves only the joint
\(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\) and
the variational distribution \(q_{\boldsymbol{\psi}}(\boldsymbol{z})\) — both of which we can compute with by construction.
The second consequence is what one sells to a deep-learning practitioner: an inference problem stated in terms of the intractable posterior
has been converted, with no approximation, into an optimisation problem stated entirely in terms of computable quantities.
The remaining work — gradient computation, choice of variational family, minibatch scaling — is the subject of the rest of the chapter.
Choice of Variational Family
The variational family \(\mathcal{Q}\) is a modelling choice that the practitioner makes before running any algorithm, and
it determines both the quality of the approximation and the cost of finding it. Two families are sufficiently common that
they organise much of the variational literature, and we introduce them here informally; each receives a fuller treatment
in the algorithmic sections that follow.
The first is the fixed-form family, in which one chooses a parametric distribution — most commonly a multivariate Gaussian
\[
q_{\boldsymbol{\psi}}(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{z} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})
\]
with \(\boldsymbol{\psi} = (\boldsymbol{\mu}, \boldsymbol{\Sigma})\) — and optimises its parameters by gradient methods on the ELBO.
The covariance structure of \(\boldsymbol{\Sigma}\) (full, diagonal, or low-rank-plus-diagonal) trades expressivity
against the cost of storage and inversion; the diagonal case in particular is so important that it has its own name as the mean-field family below.
The fixed-form approach is the path taken by variational autoencoders and is
the natural target of the gradient-based machinery developed in the next section.
The second is the mean-field family, in which the variational distribution factorises across coordinate groups,
\[
q_{\boldsymbol{\psi}}(\boldsymbol{z}) \;=\; \prod_{j=1}^{J} q_j(\boldsymbol{z}_j),
\]
with \(\boldsymbol{z}\) partitioned into blocks \(\boldsymbol{z}_1, \ldots, \boldsymbol{z}_J\) and each \(q_j\) a distribution on the
\(j\)-th block. The factors \(q_j\) are not constrained to lie in any particular parametric family: their functional forms are derived
from the model and the factorisation by the coordinate-ascent procedure of a later section, which alternately maximises
the ELBO with respect to one factor at a time while holding the others fixed. The mean-field family is historically the original variational
construction and remains the workhorse of classical Bayesian VI, particularly for models with conjugate updates between blocks.
Its principal weakness — captured already in two dimensions, where a mean-field Gaussian cannot represent any posterior correlation between
the two coordinates — is that it systematically underestimates posterior variance when the true posterior has inter-block dependencies.
The two families are not mutually exclusive: a mean-field factorisation can have Gaussian factors (mean-field Gaussian VI is a standard baseline),
and a fixed-form parametric family can be made richer by adopting structured rather than diagonal covariances. Richer alternatives — normalising flows,
structured mean-field with learned dependency graphs, implicit posteriors specified by sampling rather than density — generalise both directions and
are an active research frontier; we mention them in the closing section but do not develop them here.
Computing the ELBO Gradient
The decomposition theorem of the previous section reduces variational inference to the optimisation problem
\[
\arg\max_{\boldsymbol{\psi}} \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x}).
\]
For the parametric variational families used in practice — Gaussians, mean-field exponential families,
neural-network-parameterised distributions — this is a continuous optimisation in \(\boldsymbol{\psi}\), and
the natural attack is gradient ascent on the ELBO. The technical obstacle is that the ELBO is an expectation
against a distribution whose parameters are themselves what we are differentiating with respect to: naive
interchange of the gradient and the expectation is illegitimate, and recovering a valid gradient estimator is the work of this section.
Two standard estimators address the obstacle, and they sit at opposite ends of a bias-variance trade-off.
The reparameterisation gradient applies when \(q_{\boldsymbol{\psi}}\) admits a sampling representation
\(\boldsymbol{z} = g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon})\) with \(\boldsymbol{\epsilon}\)
drawn from a fixed base distribution; it pushes the \(\boldsymbol{\psi}\)-dependence inside the expectation, where ordinary differentiation
under the integral sign applies, and yields a low-variance estimator at the cost of requiring \(g_{\boldsymbol{\psi}}\) to be differentiable.
The score-function estimator, also called REINFORCE, applies in full generality — including when \(\boldsymbol{z}\) is
discrete or \(g_{\boldsymbol{\psi}}\) is non-differentiable — by exploiting an algebraic identity that converts the gradient of the expectation
into a different expectation that can be Monte-Carlo-estimated; the price is high variance and a corresponding need for variance-reduction techniques.
We develop both estimators in turn, with the rigorous justification (in each case, an application of dominated convergence) explicit.
The Gradient Interchange Problem
Write the ELBO as an explicit integral:
\[
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\;=\; \int q_{\boldsymbol{\psi}}(\boldsymbol{z}) \,
f_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}; \boldsymbol{\psi}) \, d\boldsymbol{z},
\qquad
f_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}; \boldsymbol{\psi})
\;=\; \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})
- \log q_{\boldsymbol{\psi}}(\boldsymbol{z}).
\]
The \(\boldsymbol{\psi}\)-dependence enters in two places: the integrand
\(f_{\boldsymbol{\theta}}\) contains \(\log q_{\boldsymbol{\psi}}\), and the integrating
measure is itself \(q_{\boldsymbol{\psi}}(\boldsymbol{z}) \, d\boldsymbol{z}\). The
naive identity
\[
\nabla_{\boldsymbol{\psi}} \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
h(\boldsymbol{z})
\right]
\;\stackrel{?}{=}\;
\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\nabla_{\boldsymbol{\psi}} h(\boldsymbol{z})
\right]
\]
fails in general because the right-hand side ignores the contribution of the gradient acting on the measure \(q_{\boldsymbol{\psi}}\) itself;
for an integrand that does not depend on \(\boldsymbol{\psi}\) (so that \(\nabla_{\boldsymbol{\psi}} h = 0\)),
the right-hand side would be identically zero, while the left-hand side is generally non-zero
whenever \(q_{\boldsymbol{\psi}}\) actually depends on \(\boldsymbol{\psi}\).
The legitimate operation underlying both estimators we develop below is differentiation under the integral sign, which the
dominated convergence theorem
licenses provided one can find a single integrable function dominating the \(\boldsymbol{\psi}\)-derivatives uniformly
in a neighbourhood of the parameter value.
The two estimators differ in where they apply this principle: the reparameterisation gradient first rewrites the expectation
against a fixed \(\boldsymbol{\psi}\)-independent measure (eliminating the second source of \(\boldsymbol{\psi}\)-dependence)
and then differentiates under the integral; the score-function estimator differentiates under the integral directly, paying for the
\(\boldsymbol{\psi}\)-dependent measure with an algebraic identity that produces an additional term.
The Reparameterisation Gradient
Suppose \(q_{\boldsymbol{\psi}}\) admits a reparameterisation: a fixed base distribution \(p_0\)
on some space \(\mathcal{E}\) — typically the standard multivariate Gaussian \(\mathcal{N}(\mathbf{0}, \mathbf{I})\)
— and a deterministic map \(g_{\boldsymbol{\psi}} : \mathcal{E} \to \mathcal{Z}\) such that the law of
\(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon})\) under
\(\boldsymbol{\epsilon} \sim p_0\) coincides with \(q_{\boldsymbol{\psi}}\).
The canonical example is the location-scale family for a Gaussian variational distribution:
if \(q_{\boldsymbol{\psi}}(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{z} \mid \boldsymbol{\mu}, \mathrm{diag}(\boldsymbol{\sigma}^2))\)
with \(\boldsymbol{\psi} = (\boldsymbol{\mu}, \boldsymbol{\sigma})\), one takes
\(p_0 = \mathcal{N}(\mathbf{0}, \mathbf{I})\) and
\[
g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}) = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon},
\]
where \(\odot\) denotes element-wise multiplication. This is the construction
introduced in the
variational autoencoder
chapter; the present treatment isolates the gradient identity from any specific parametric form.
Theorem: Reparameterisation Gradient
Let \(p_0\) be a probability distribution on \(\mathcal{E}\) and let
\(g_{\boldsymbol{\psi}} : \mathcal{E} \to \mathcal{Z}\) be a family of measurable maps,
indexed by \(\boldsymbol{\psi}\) in an open set \(\Psi \subseteq \mathbb{R}^d\),
such that \(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}) \sim q_{\boldsymbol{\psi}}\)
when \(\boldsymbol{\epsilon} \sim p_0\). Let \(h_{\boldsymbol{\psi}} : \mathcal{Z} \to \mathbb{R}\) be a family of integrable
functions, possibly depending on \(\boldsymbol{\psi}\). Assume that
\((\boldsymbol{\psi}, \boldsymbol{\epsilon}) \mapsto h_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))\)
is differentiable in \(\boldsymbol{\psi}\) for \(p_0\)-almost every
\(\boldsymbol{\epsilon}\), and that there exists a \(p_0\)-integrable dominating
function \(M(\boldsymbol{\epsilon})\) with
\(\|\nabla_{\boldsymbol{\psi}} h_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))\| \leq M(\boldsymbol{\epsilon})\)
for all \(\boldsymbol{\psi}\) in a neighbourhood of the parameter value of interest.
Then
\[
\nabla_{\boldsymbol{\psi}} \, \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
h_{\boldsymbol{\psi}}(\boldsymbol{z})
\right]
\;=\;
\mathbb{E}_{p_0(\boldsymbol{\epsilon})}\!\left[
\nabla_{\boldsymbol{\psi}} \, h_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))
\right].
\]
Proof.
By the change-of-variables identity
\[
\mathbb{E}_{q_{\boldsymbol{\psi}}}[h_{\boldsymbol{\psi}}(\boldsymbol{z})]
= \mathbb{E}_{p_0}[h_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))],
\]
which holds because \(g_{\boldsymbol{\psi}}\) pushes \(p_0\) forward to \(q_{\boldsymbol{\psi}}\).
The right-hand side is an expectation against the fixed base measure \(p_0\), which does not depend on \(\boldsymbol{\psi}\).
The differentiability and domination hypotheses imply that the family of difference quotients
\[
\big[h_{\boldsymbol{\psi}+t e_j}(g_{\boldsymbol{\psi}+t e_j}(\boldsymbol{\epsilon}))
- h_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))\big] / t
\]
converges \(p_0\)-a.s. to
\[
\partial_{\psi_j} h_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))
\]
as \(t \to 0\), and is dominated in absolute value by \(M(\boldsymbol{\epsilon})\)
(by the mean-value theorem and the domination hypothesis). The
dominated convergence theorem
therefore licenses the interchange of the limit \(t \to 0\) and the expectation, which is exactly the differentiation
under the integral. Reading the coordinate-wise statement as a vector identity gives the claim.
Applied to the ELBO with
\[
h_{\boldsymbol{\psi}}(\boldsymbol{z})
= \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\psi}}(\boldsymbol{z}),
\]
the theorem gives
\[
\nabla_{\boldsymbol{\psi}} \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\psi} \mid \boldsymbol{x})
\;=\;
\mathbb{E}_{p_0(\boldsymbol{\epsilon})}\!\left[
\nabla_{\boldsymbol{\psi}} \!\left(
\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))
- \log q_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))
\right)
\right],
\]
and the right-hand side is amenable to Monte Carlo estimation by drawing samples
\(\boldsymbol{\epsilon}^{(1)}, \ldots, \boldsymbol{\epsilon}^{(K)} \sim p_0\) and
averaging the bracketed gradient.
Evaluating the integrand at a sample \(\boldsymbol{\epsilon}\) requires computing
\(\log q_{\boldsymbol{\psi}}(g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}))\).
When the variational family has a known closed-form density — Gaussian, Gamma, Beta, exponential family — one simply substitutes
\(\boldsymbol{z} = g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon})\)
into that closed form. When the family is defined only through the transform \(g_{\boldsymbol{\psi}}\)
— as in normalising flows, where the variational density is induced by composing Jacobian-tractable diffeomorphisms
with the base distribution — the closed form is unavailable and the change-of-variables formula is the only route:
if \(g_{\boldsymbol{\psi}} : \mathcal{E} \to \mathcal{Z}\) is a diffeomorphism onto its image with Jacobian
\(\partial g_{\boldsymbol{\psi}} / \partial \boldsymbol{\epsilon}\), the densities are related by
\[
\log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\;=\;
\log p_0(\boldsymbol{\epsilon})
\;-\;
\log \!\left| \det \frac{\partial g_{\boldsymbol{\psi}}}{\partial \boldsymbol{\epsilon}}(\boldsymbol{\epsilon}) \right|,
\qquad
\boldsymbol{z} = g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}).
\]
The two routes agree where they overlap: for the location-scale Gaussian
\[
g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon}) = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon},
\]
the Jacobian is \(\mathrm{diag}(\boldsymbol{\sigma})\) and the log-determinant is \(\sum_k \log \sigma_k\), and
the change-of-variables expression reduces to the closed-form Gaussian log-density that one would have written down directly.
The practical relevance of the formula is therefore that it lifts the reparameterisation trick to families without closed-form
densities: Designing \(g_{\boldsymbol{\psi}}\) so that its Jacobian determinant is tractable is the central constraint on the
variational family in such cases; full-covariance Gaussians use a Cholesky factor \(\mathbf{L}\) with diagonal log-determinant
\(\sum_k \log L_{kk}\), and richer families such as normalising flows compose Jacobian-tractable maps to extend expressiveness
while preserving the closed-form log-determinant. The closed-form Gaussian case underlying the
variational autoencoder
is detailed in the corresponding chapter, where the entropy term appears as the regulariser of the encoder.
A complementary technique, automatic differentiation variational inference (ADVI), addresses the case
where the latent space is a constrained subset of \(\mathbb{R}^D\) (positivity, simplex, bounded interval) by composing
a fixed bijection \(T : \Theta \to \mathbb{R}^D\) with a Gaussian variational distribution on the unconstrained space;
the density on \(\Theta\) is recovered by the same change-of-variables formula, and the resulting ELBO is differentiable
in the variational parameters by automatic differentiation through \(T^{-1}\) and the model. ADVI is the construction
underlying probabilistic programming systems such as Stan and PyMC, and is a special case of the reparameterisation gradient
with the bijection chosen to handle parameter constraints rather than to expand expressiveness.
The crucial feature of all of the above is that the entire chain
\[
\boldsymbol{\epsilon}
\mapsto g_{\boldsymbol{\psi}}(\boldsymbol{\epsilon})
\mapsto \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \cdot) - \log q_{\boldsymbol{\psi}}(\cdot)
\]
is differentiable in \(\boldsymbol{\psi}\), so the gradient can be computed by ordinary automatic differentiation
through the model and the variational density. This is what makes the reparameterisation estimator the workhorse
of variational autoencoders and, more broadly, of every modern deep generative model whose latent variables admit a
location-scale or related differentiable parameterisation. In the amortized setting of the variational autoencoder
the variational parameters
\(\boldsymbol{\psi}_n = f_{\boldsymbol{\phi}}(\boldsymbol{x}_n)\)
themselves depend on the data through an inference network, and the reparameterisation map becomes
\(g_{\boldsymbol{\phi}}(\boldsymbol{x}, \boldsymbol{\epsilon})\) — a special case of the present theorem
in which the gradient with respect to the encoder parameters \(\boldsymbol{\phi}\) flows
through both the data-dependence of \(\boldsymbol{\psi}_n\) and the deterministic transform \(g\).
The Score-Function (REINFORCE) Estimator
When \(q_{\boldsymbol{\psi}}\) does not admit a differentiable reparameterisation — discrete latents, implicit distributions specified
only by their sampler, mixture distributions whose component assignment is itself a random variable — the reparameterisation theorem
is unavailable, and one needs an estimator that depends only on the ability to evaluate \(\log q_{\boldsymbol{\psi}}(\boldsymbol{z})\)
and its \(\boldsymbol{\psi}\)-gradient. The score-function estimator delivers this by exchanging the gradient of an expectation for an
expectation of a different gradient.
Theorem: Score-Function Estimator
Let \(q_{\boldsymbol{\psi}}\) be a family of densities on \(\mathcal{Z}\) such that
\((\boldsymbol{\psi}, \boldsymbol{z}) \mapsto q_{\boldsymbol{\psi}}(\boldsymbol{z})\)
is differentiable in \(\boldsymbol{\psi}\) for almost every \(\boldsymbol{z}\), and
let \(h_{\boldsymbol{\psi}} : \mathcal{Z} \to \mathbb{R}\) be a family of integrable
functions, possibly depending on \(\boldsymbol{\psi}\). Assume the
common-support hypothesis that the support
\(\{\boldsymbol{z} : q_{\boldsymbol{\psi}}(\boldsymbol{z}) > 0\}\) does not depend
on \(\boldsymbol{\psi}\) in a neighbourhood of the parameter value of interest,
and that
\(\boldsymbol{z} \mapsto \nabla_{\boldsymbol{\psi}}[h_{\boldsymbol{\psi}}(\boldsymbol{z}) q_{\boldsymbol{\psi}}(\boldsymbol{z})]\)
admits an integrable dominating function in that neighbourhood. Then
\[
\nabla_{\boldsymbol{\psi}} \, \mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
h_{\boldsymbol{\psi}}(\boldsymbol{z})
\right]
\;=\;
\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\nabla_{\boldsymbol{\psi}} h_{\boldsymbol{\psi}}(\boldsymbol{z})
\;+\;
h_{\boldsymbol{\psi}}(\boldsymbol{z}) \, \nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\right].
\]
In particular, when \(h\) is independent of \(\boldsymbol{\psi}\) the first term vanishes and the identity reduces to
\(\mathbb{E}_{q_{\boldsymbol{\psi}}}[h(\boldsymbol{z}) \nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})]\).
The common-support hypothesis is essential: without it, differentiating the integral produces additional Leibniz boundary terms
— the classical pitfall illustrated by \(\boldsymbol{z} \sim \mathrm{Uniform}(0, \boldsymbol{\psi})\), where the
\(\boldsymbol{\psi}\)-dependent endpoint generates a contribution that the score identity does not capture.
Proof.
Writing the expectation as an integral and differentiating under the integral sign (justified by
dominated convergence
applied to the family
\(\boldsymbol{\psi} \mapsto h_{\boldsymbol{\psi}}(\boldsymbol{z}) q_{\boldsymbol{\psi}}(\boldsymbol{z})\)
with the dominating hypothesis assumed),
\[
\nabla_{\boldsymbol{\psi}} \int h_{\boldsymbol{\psi}}(\boldsymbol{z}) q_{\boldsymbol{\psi}}(\boldsymbol{z}) \, d\boldsymbol{z}
\;=\; \int \nabla_{\boldsymbol{\psi}}\!\left[ h_{\boldsymbol{\psi}}(\boldsymbol{z}) q_{\boldsymbol{\psi}}(\boldsymbol{z}) \right] d\boldsymbol{z}.
\]
Apply the product rule and the elementary log-derivative identity
\[
\nabla_{\boldsymbol{\psi}} q_{\boldsymbol{\psi}}(\boldsymbol{z})
= q_{\boldsymbol{\psi}}(\boldsymbol{z}) \, \nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z}),
\]
valid wherever \(q_{\boldsymbol{\psi}}(\boldsymbol{z}) > 0\):
\[
\nabla_{\boldsymbol{\psi}}[h_{\boldsymbol{\psi}}(\boldsymbol{z}) q_{\boldsymbol{\psi}}(\boldsymbol{z})]
\;=\;
\nabla_{\boldsymbol{\psi}} h_{\boldsymbol{\psi}}(\boldsymbol{z}) \cdot q_{\boldsymbol{\psi}}(\boldsymbol{z})
\;+\;
h_{\boldsymbol{\psi}}(\boldsymbol{z}) \cdot q_{\boldsymbol{\psi}}(\boldsymbol{z}) \, \nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z}).
\]
Integrating against \(d\boldsymbol{z}\) and recognising both terms as expectations
under \(q_{\boldsymbol{\psi}}\) gives the claim.
The quantity \(\nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})\) is the score function
of the variational distribution at \(\boldsymbol{z}\); the estimator that results from Monte Carlo sampling of the
right-hand side — drawing \(\boldsymbol{z}^{(k)} \sim q_{\boldsymbol{\psi}}\) and averaging the bracketed integrand — is
the score-function or REINFORCE estimator, the same construction that appears in policy-gradient reinforcement learning under
the policy \(q_{\boldsymbol{\psi}}\) and reward \(h\).
Applied to the ELBO with
\[
h_{\boldsymbol{\psi}}(\boldsymbol{z})
= \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\psi}}(\boldsymbol{z}),
\]
the theorem gives
\[
\nabla_{\boldsymbol{\psi}} \mathcal{L}
\;=\;
\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\nabla_{\boldsymbol{\psi}} h_{\boldsymbol{\psi}}(\boldsymbol{z})
\;+\;
h_{\boldsymbol{\psi}}(\boldsymbol{z}) \, \nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\right].
\]
The first term simplifies: since
\[
\nabla_{\boldsymbol{\psi}} h_{\boldsymbol{\psi}}(\boldsymbol{z})
= -\nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\]
(only the \(\log q_{\boldsymbol{\psi}}\) term in \(h_{\boldsymbol{\psi}}\) carries
\(\boldsymbol{\psi}\)-dependence), and the score has zero mean under its own distribution,
\[
\mathbb{E}_{q_{\boldsymbol{\psi}}}[\nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}] = 0
\]
(a consequence of differentiating \(\int q_{\boldsymbol{\psi}} d\boldsymbol{z} = 1\)),
the first term contributes zero. The ELBO gradient is therefore
\[
\nabla_{\boldsymbol{\psi}} \mathcal{L}
\;=\;
\mathbb{E}_{q_{\boldsymbol{\psi}}(\boldsymbol{z})}\!\left[
\big( \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) - \log q_{\boldsymbol{\psi}}(\boldsymbol{z}) \big)
\nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\right],
\]
and the score-function estimator of the ELBO gradient is the Monte Carlo average of this expression over samples
drawn from \(q_{\boldsymbol{\psi}}\). Crucially, this formula requires only that we can sample from
\(q_{\boldsymbol{\psi}}\) and evaluate its log-density and score; neither \(q_{\boldsymbol{\psi}}\) nor the integrand needs
to be differentiable in \(\boldsymbol{z}\), which is exactly the regime where reparameterisation fails.
Bias, Variance, and the Choice of Estimator
Both estimators are unbiased: their expectation is the exact ELBO gradient.
They differ sharply in variance, and this difference dictates the choice in practice.
The reparameterisation estimator achieves low variance because the \(\boldsymbol{\psi}\)-dependence has been pushed
into a single deterministic transform \(g_{\boldsymbol{\psi}}\), and the gradient is taken pathwise: each Monte Carlo
sample evaluates the derivative of a smooth composition, with no random factors of
\(\nabla_{\boldsymbol{\psi}} \log q_{\boldsymbol{\psi}}\) introduced by the variance reduction.
For Gaussian \(q_{\boldsymbol{\psi}}\) and smooth log-joint \(\log p_{\boldsymbol{\theta}}\), variances of one or
two samples per gradient step are typically sufficient to drive stochastic gradient descent to convergence — a fact
exploited by every variational autoencoder trained in practice.
The score-function estimator, by contrast, suffers from variance that scales with the magnitude of
\(h(\boldsymbol{z}) - \mathbb{E}_{q_{\boldsymbol{\psi}}}[h]\) and with
the variance of the score function itself. In high-dimensional latent spaces or when
the integrand has wide range — both common in modern probabilistic models — single-sample
estimates can have variance orders of magnitude larger than the reparameterisation counterpart.
The standard remedy is variance reduction by control variates: replace each component
\[
\tilde{g}_i(\boldsymbol{z}) = h(\boldsymbol{z}) \, \partial_{\psi_i} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\]
of the naive estimator by
\[
\tilde{g}_i^{\mathrm{cv}}(\boldsymbol{z})
\;=\;
\tilde{g}_i(\boldsymbol{z}) \;-\; c_i \, b_i(\boldsymbol{z}),
\]
where \(b_i(\boldsymbol{z})\) is a baseline with
\[
\mathbb{E}_{q_{\boldsymbol{\psi}}}[b_i(\boldsymbol{z})] = 0
\]
and \(c_i \in \mathbb{R}\) is a coefficient to be chosen.
The zero-mean condition guarantees that the modified estimator remains unbiased;
a natural baseline is
\[
b_i(\boldsymbol{z}) = \partial_{\psi_i} \log q_{\boldsymbol{\psi}}(\boldsymbol{z})
\]
itself, which is zero-mean by the same identity that simplified the ELBO gradient derivation.
The coefficient \(c_i\) is then chosen to minimise the variance of \(\tilde{g}_i^{\mathrm{cv}}\);
a standard calculation yields the optimum
\[
c_i^*
\;=\;
\frac{\mathrm{Cov}(\tilde{g}_i(\boldsymbol{z}), \, b_i(\boldsymbol{z}))}{\mathrm{Var}(b_i(\boldsymbol{z}))},
\]
which can be estimated online from the same Monte Carlo samples used for the gradient.
Choosing \(b_i\) and \(c_i^*\) in this way — and the further refinement of using a learned function of \(\boldsymbol{z}\)
as baseline, common in policy-gradient reinforcement learning — can reduce variance by orders of magnitude while preserving
unbiasedness, and the development of effective control variates is a substantial body of work that we do not develop further here.
The choice of estimator is therefore largely dictated by the structure of \(q_{\boldsymbol{\psi}}\).
When a reparameterisation is available, it is essentially always preferred; the literature on "blackbox" variational inference,
in which the score-function estimator is paired with control variates and applied to models for which no reparameterisation is known,
exists precisely to extend the variational programme to the discrete and combinatorial models that the reparameterisation cannot
reach. The two estimators are not in competition: they cover complementary regimes, and a practitioner whose model contains both
reparameterisable and non-reparameterisable latents can apply each estimator to the appropriate component.
Coordinate Ascent Variational Inference
The gradient-based estimators of the previous section operate within a fixed-form variational family:
the practitioner chooses a parametric distribution \(q_{\boldsymbol{\psi}}\) — Gaussian, Gaussian mixture,
normalising-flow-induced — and optimises its parameters by stochastic gradient ascent on Monte Carlo estimates
of \(\nabla_{\boldsymbol{\psi}} \mathcal{L}\). The present section develops a complementary approach in which the
variational family is the fully-factorised mean-field family \(q(\boldsymbol{z}) = \prod_{j=1}^{J} q_j(z_j)\) introduced in
the section on the variational family, and the functional form of each factor \(q_j\) is not
chosen in advance but emerges from the optimisation itself. This is sometimes called free-form variational
inference, in contrast to the fixed-form setting of the previous section.
The free-form approach exchanges flexibility in the variational family for flexibility in the functional form of each factor.
Under the mean-field assumption, the ELBO admits an explicit decomposition that makes coordinate-wise optimisation
tractable: with each \(q_j\) treated as the variational degree of freedom
(rather than a finite-dimensional parameter \(\boldsymbol{\psi}_j\)),
the optimal form of \(q_j^*\) holding the other factors fixed is determined by the model's log-joint via
a closed-form expression. Cycling through the factors and updating each in turn yields
the coordinate ascent variational inference (CAVI) algorithm,
the most classical and historically important variational algorithm — the construction that,
since the 1990s, has powered every conjugate-exponential-family Bayesian inference task addressable by mean-field VI.
The Mean-Field ELBO and the Coordinate Update
Substituting the mean-field factorisation \(q(\boldsymbol{z}) = \prod_j q_j(z_j)\) into the ELBO and using the additivity
of entropy under product distributions, the ELBO decomposes as
\[
\mathcal{L}(q) \;=\; \int q(\boldsymbol{z}) \log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z}) \, d\boldsymbol{z} \;+\; \sum_{j=1}^{J} \mathbb{H}(q_j),
\]
where \(\mathbb{H}(q_j) = -\int q_j(z_j) \log q_j(z_j) \, dz_j\) is the (differential) entropy of the \(j\)-th factor.
The first term is the expected log-joint under the factorised \(q\); the second is a sum of entropies that decouples completely across factors.
This decomposition is what makes coordinate-wise optimisation feasible: holding all factors except \(q_j\) fixed, the dependence of the ELBO on \(q_j\) is
isolated to the \(j\)-th term of the entropy sum and to the contribution of \(q_j\) to the expectation in the first term.
The optimal coordinate update is given by the following theorem, the central result of mean-field variational inference.
Theorem: Optimal Mean-Field Update (CAVI)
Let \(q(\boldsymbol{z}) = \prod_{j=1}^{J} q_j(z_j)\) be a mean-field variational distribution, and
fix all factors \(\{q_i\}_{i \neq j}\). Assume that the expected log-joint
\[
\mathbb{E}_{q_{-j}}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})]
\]
is a measurable function of \(z_j\) taking values in \([-\infty, \infty)\) — where
\(q_{-j}(\boldsymbol{z}_{-j}) = \prod_{i \neq j} q_i(z_i)\), and the value
\(-\infty\) is permitted on the set where the prior or likelihood vanishes — and that the normalising integral
\[
\int \exp\!\big( \mathbb{E}_{q_{-j}}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})] \big) \, dz_j
\]
is finite, with the convention \(\exp(-\infty) = 0\). Then the ELBO, viewed as a functional of \(q_j\) with the
other factors held fixed, is uniquely maximised by
\[
q_j^*(z_j) \;\propto\; \exp\!\Big( \mathbb{E}_{q_{-j}}\!\big[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\big] \Big).
\]
Equivalently,
\[
\log q_j^*(z_j) = \mathbb{E}_{q_{-j}}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})] + \mathrm{const}
\]
on the set where the right-hand side is finite, and \(q_j^*(z_j) = 0\) elsewhere;
the constant ensures normalisation. The flexibility to take the value \(-\infty\) on a set of positive measure
is essential for models with bounded-support priors — Beta, Dirichlet, truncated families —
where the log-joint is genuinely \(-\infty\) outside the prior support and \(q_j^*\) inherits that support automatically.
Proof.
Define
\[
g_j(z_j) := \mathbb{E}_{q_{-j}}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})]
\]
and let \(\tilde{f}_j(z_j) := \exp(g_j(z_j))\); the integrability hypothesis ensures that the normalised density
\(f_j(z_j) := \tilde{f}_j(z_j) / \int \tilde{f}_j(z_j') \, dz_j'\) is well-defined.
Isolating the dependence of the ELBO on \(q_j\),
\[
\mathcal{L}(q) \;=\; \int q_j(z_j) \, g_j(z_j) \, dz_j \;-\; \int q_j(z_j) \log q_j(z_j) \, dz_j \;+\; C,
\]
where \(C\) collects all terms independent of \(q_j\) (the entropies of the other factors and the integral
against \(q_{-j}\) of any \(q_{-j}\)-only contributions). Substitute \(g_j(z_j) = \log f_j(z_j) + \log Z_j\), where
\(Z_j = \int \tilde{f}_j(z_j') \, dz_j'\) is the (constant) normaliser; the \(\log Z_j\) term contributes another
constant absorbed into \(C\), and the remaining \(q_j\)-dependent terms reorganise as
\[
\mathcal{L}(q) \;=\; \int q_j(z_j) \big[ \log f_j(z_j) - \log q_j(z_j) \big] \, dz_j \;+\; C'
\;=\; -D_{\mathrm{KL}}(q_j \,\|\, f_j) \;+\; C'.
\]
By
Gibbs' inequality,
\(D_{\mathrm{KL}}(q_j \,\|\, f_j) \geq 0\) with equality if and only if \(q_j = f_j\) almost everywhere.
Therefore the ELBO is maximised over choices of \(q_j\) precisely at \(q_j^* = f_j\), which is the claimed expression.
Two structural features of this update deserve emphasis. First, the optimal \(q_j^*\) depends on the other factors \(q_{-j}\)
only through the expected log-joint; in graphical-model language, the dependence is mediated entirely by the
Markov blanket of node \(j\), since variables conditionally independent of \(z_j\) given its neighbours contribute
additive constants to \(\log p\) that drop out of the proportionality. Second, the functional form of \(q_j^*\) is dictated by the
model's log-joint: when \(p_{\boldsymbol{\theta}}\) is a member of the exponential family with conjugate prior structure, the expectation
\(\mathbb{E}_{q_{-j}}[\log p]\) is a linear function of the sufficient statistics of \(z_j\), and \(q_j^*\) inherits
the same exponential-family form as the corresponding conditional. This is the source of the closed-form update equations that make CAVI
practical — the variational family is, in effect, dictated by the model rather than chosen by the practitioner.
The CAVI Algorithm
The optimal-update theorem yields an iterative algorithm: initialise the factors \(\{q_j\}\) arbitrarily, then
cycle through \(j = 1, \ldots, J\), updating each \(q_j\) to its optimal form given the current values of the other factors,
and repeat until convergence. Each individual update increases the ELBO (or leaves it unchanged at a fixed point), so
the sequence of ELBO values is monotonically non-decreasing and, since the ELBO is bounded above by the log-evidence, the
sequence of ELBO values converges to a finite limit. We emphasise that this is convergence of the objective values, not of
the iterates themselves: monotone boundedness of \(\mathcal{L}(q^{(t)})\) does not by itself guarantee that the
factors \((q_1^{(t)}, \ldots, q_J^{(t)})\) converge in the space of distributions.
Convergence of the iterates to a stationary point requires additional regularity conditions
— compactness of the variational parameter space, continuity of the update map, or the standard hypotheses
for block coordinate ascent in the literature — that hold for the conjugate-exponential models on which CAVI is typically run but
are not automatic. In practice the iterates settle at a local maximum that depends on the initialisation,
since the ELBO is concave with respect to each \(q_j\) individually but generally non-concave jointly.
Algorithmically, after \(T\) sweeps:
Algorithm: COORDINATE_ASCENT_VI (CAVI)
Input: joint density \(p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\), number of factors \(J\), number of sweeps \(T\);
Output: mean-field factors \(\{q_j\}_{j=1}^{J}\) with \(q(\boldsymbol{z}) = \prod_j q_j(z_j)\);
begin
Initialise \(q_1, \ldots, q_J\) (arbitrary distributions on the respective coordinates);
for \(t = 1, \ldots, T\) do
for \(j = 1, \ldots, J\) do
\(g_j(z_j) \leftarrow \mathbb{E}_{q_{-j}}\!\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x}, \boldsymbol{z})\right]\);
\(q_j(z_j) \leftarrow \exp(g_j(z_j)) \,/\, \int \exp(g_j(z_j')) \, dz_j'\);
end
end
Output \(\{q_1, \ldots, q_J\}\);
end
The expectation in the inner loop is, in conjugate-exponential models, available in closed form as a function
of the parameters of the other factors \(q_{-j}\); the update therefore reduces to a finite-dimensional parameter update
on each iteration, and the apparent functional optimisation collapses to ordinary numerical computation.
In non-conjugate models the expectation is intractable and CAVI is replaced by gradient-based methods of the previous section,
or by hybrid schemes that Monte-Carlo-estimate the inner expectation while preserving the coordinate ascent structure of the outer loop.
Bayesian Gaussian Mixture Models and Automatic Sparsity
The canonical application of CAVI is Bayesian inference in conjugate latent-variable models, of which
the Bayesian Gaussian mixture model is the most pedagogically important example.
With a Dirichlet prior on the mixing weights, conjugate Normal-Wishart priors on the component parameters, and
the standard latent cluster-assignment variables, the mean-field factorisation
\[
q(\boldsymbol{z}, \boldsymbol{\theta}) = q(\boldsymbol{\theta}) \prod_n q_n(z_n)
\]
yields CAVI updates in which every factor's optimal form is itself a tractable member of the exponential family
— Categorical for the cluster-assignment factors \(q_n(z_n)\),
Dirichlet for the mixing-weight factor \(q(\boldsymbol{\pi})\), and Normal-Wishart for
the component-parameter factors \(q(\boldsymbol{\mu}_k, \boldsymbol{\Lambda}_k)\).
The update equations involve expected sufficient statistics under the other factors and are derived in detail
in any standard reference on variational Bayes; we do not reproduce the derivation here, but highlight a single
emergent phenomenon that is characteristic of the Bayesian setting and absent from gradient-based VI.
Insight: Automatic Model Selection via Posterior Pruning
When the Dirichlet prior on the mixing weights is sufficiently sparse — concretely,
when the concentration parameter \(\alpha_0\) is small — the CAVI updates for the
Bayesian Gaussian mixture exhibit an "automatic sparsity" effect: clusters
with few or no assigned data points have their effective Dirichlet count driven
toward zero, and the corresponding \(q_n(z_n)\) factors place vanishing probability
on those clusters at every subsequent iteration. The model "kills off" unneeded
components autonomously, and the number of effectively active clusters at
convergence is determined by the data rather than fixed in advance.
Mathematically, the effect arises from the digamma function — denoted
\(\psi_{\mathrm{dig}}\) here to avoid clash with the variational parameters
\(\boldsymbol{\psi}\) — that appears in the variational expectation of
\(\log \pi_k\) under the Dirichlet factor: for small \(N_k\),
\(\exp(\psi_{\mathrm{dig}}(N_k)) < N_k\) and approaches zero as
\(N_k \to 0\), producing a soft-thresholding behaviour reminiscent of
\(\ell_1\)-regularisation. The same mechanism — under the name "rich-get-richer"
— drives the asymptotic concentration phenomena of Dirichlet process mixtures.
The same effect has a less benign manifestation in modern deep generative models,
where it is known as posterior collapse: a sufficiently expressive
decoder can render the latent variable uninformative for reconstruction, at which
point the variational posterior collapses to the prior and the latent dimension
is effectively pruned. Both phenomena are instances of the same Bayesian Occam's
razor encoded in the mean-field ELBO.
Limitations of Mean-Field Inference
The mean-field assumption is structurally restrictive in a way that cannot be
remedied by clever choice of factor functional form: the family
\(q(\boldsymbol{z}) = \prod_j q_j(z_j)\) is, by construction, incapable of
representing posterior dependence between latent variables. When the true posterior
\(p_{\boldsymbol{\theta}}(\boldsymbol{z} \mid \boldsymbol{x})\) exhibits strong
correlation between the \(z_j\) — which it generically does, since the data couples
the latents — mean-field VI returns an approximation whose marginal variances
systematically underestimate the true posterior uncertainty. The asymmetry
of the KL divergence is responsible: minimising
\(D_{\mathrm{KL}}(q \,\|\, p(\cdot \mid \boldsymbol{x}))\) penalises configurations
where \(q\) places mass in regions of low posterior probability much more strongly
than configurations where \(q\) misses regions of high posterior probability,
producing the well-documented mode-seeking, variance-shrinking behaviour of
mean-field approximations.
These limitations motivate the directions that have dominated variational inference
since the early 2010s. Structured mean-field approximations retain
dependencies between blocks of latent variables — for instance, treating a hidden
Markov sequence as a single block while assuming independence between sequences —
and recover much of the correlation structure that fully-factorised mean-field
discards. Richer fixed-form families for use with the gradient-based
methods of the previous section include Gaussians with full or low-rank-plus-diagonal
covariance, Gaussian mixtures, and the family of normalising flows,
in which a sequence of Jacobian-tractable diffeomorphisms transforms a simple base
distribution into an arbitrarily expressive variational posterior. Tighter
bounds than the ELBO are available when one is willing to draw multiple
samples per gradient step: the importance-weighted autoencoder bound (IWAE) and its
descendants exploit \(K\)-sample importance averaging to reduce the gap to the
log-evidence. Expectation propagation reverses the direction of the
KL divergence, minimising
\(D_{\mathrm{KL}}(p(\cdot \mid \boldsymbol{x}) \,\|\, q)\) instead of
\(D_{\mathrm{KL}}(q \,\|\, p(\cdot \mid \boldsymbol{x}))\), exchanging the
mode-seeking behaviour of standard VI for a moment-matching, variance-preserving
behaviour at the cost of a more delicate algorithmic implementation. We do not
develop these directions here; each represents a substantial body of work and is
addressed in the specialised literature.
The broader role of variational inference in the contemporary toolkit is best
understood by contrast with the alternatives. Markov chain Monte Carlo, treated
elsewhere on the site, produces samples whose distribution converges to the true
posterior in the limit of infinite computation; it sacrifices exactness in finite
time and pays in autocorrelated samples and slow mixing on multimodal posteriors.
Variational inference, conversely, returns a fast and deterministic approximation
whose quality is bounded by the expressiveness of the variational family rather
than by sampling diagnostics; it sacrifices asymptotic exactness for tractability
on problems where MCMC is prohibitive. The two are complementary tools, not
competitors, and a working Bayesian practitioner draws on both. Among gradient-based
methods, the
variational autoencoder
chapter develops the amortised, deep-network specialisation that drives modern deep
generative modelling, and the
natural gradient
chapter develops the geometry of parameter space that underlies many advanced
variational optimisers. The variational foundation laid in this chapter — the ELBO,
its KL decomposition, and the gradient and coordinate-ascent algorithms that operate
on it — remains the structural backbone underneath these specialisations.