One of the main challenges in Bayesian inference is choosing an appropriate prior. In general, finding the
exact normalizing constant of the posterior requires evaluating the integral in the marginal likelihood, which
is often intractable. What if there are families of priors for which the posterior has a known closed form
without this difficulty?
The notion of conjugacy is defined relative to a particular parametric family \(\mathcal{F}\) and a likelihood function.
Familiar conjugate pairs — such as Beta–Binomial, Normal–Normal (with known variance), and Inverse-Gamma–Normal
(with known mean) — will be derived in the examples below; the abstract definition above is best understood through these concrete instances.
Through the following example, we introduce basic Bayesian inference ideas and an example of a conjugate prior.
Example 1: Beta-Binomial Model
Consider tossing a coin \(N\) times. Let \(\theta \in [0, 1]\) be a chance of getting head. We record the outcomes as
\(\mathcal{D} = \{y_n \in \{0, 1\} : n = 1, \ldots, N\}\). We assume the data are iid.
If we consider a sequence of coin tosses, the likelihood can be written as a Bernoulli likelihood model:
\[
\begin{align*}
p(\mathcal{D} \mid \theta) &= \prod_{n = 1}^N \theta^{y_n}(1 - \theta)^{1-y_n} \\\\
&= \theta^{N_1}(1 - \theta)^{N_0}
\end{align*}
\]
where \(N_1\) and \(N_0\) are the number of heads and tails respectively. (Sample size: \(N_1 + N_0 = N\))
Equivalently, since the two models differ only by a combinatorial factor that does not depend on \(\theta\), we can summarize the data
via the head count \(y = N_1\) and use the Binomial likelihood:
\[
\begin{align*}
p(\mathcal{D} \mid \theta) &= \operatorname{Bin}(y \mid N, \theta) \\\\
&= \binom{N}{y} \theta^y (1 - \theta)^{N - y} \\\\
&\propto \theta^y (1 - \theta)^{N - y}.
\end{align*}
\]
The combinatorial factor \(\binom{N}{y}\) does not depend on \(\theta\), so it will not affect posterior inference up to proportionality.
Next, we have to specify a prior. If we know nothing about the parameter, an
uninformative prior can be used:
\[
p(\theta) = \operatorname{Unif}(\theta \mid 0, 1).
\]
However, in this example, using the Beta distribution,
we can represent the prior as follows:
\[
\begin{align*}
p(\theta) = \operatorname{Beta}(\theta \mid a, b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \tag{1} \\\\
&\propto \theta^{a-1}(1-\theta)^{b-1}
\end{align*}
\]
where \(a, b > 0\) are usually called hyper-parameters.(Our main parameter is \(\theta\).)
Note. If \(a = b = 1\), we get the uninformative prior.
Applying Bayes' theorem and absorbing the marginal likelihood \(p(\mathcal{D})\) into the normalization (since it does not depend on \(\theta\)),
the posterior is proportional to the product of the likelihood and the prior:
\[
\begin{align*}
p(\theta \mid \mathcal{D}) &\propto [\theta^{y}(1 - \theta)^{N-y}] \cdot [\theta^{a-1}(1-\theta)^{b-1}] \\\\
&= \theta^{(a+y)-1}(1-\theta)^{(b+N-y)-1} \\\\
&\propto \operatorname{Beta}(\theta \mid a+y, \, b+N-y) \\\\
&= \frac{\Gamma(a+b+N)}{\Gamma(a+y)\Gamma(b+N-y)}\theta^{a+y-1}(1-\theta)^{b+N-y-1}. \tag{2}
\end{align*}
\]
Here, the posterior has the same functional form as the prior. Thus, the beta distribution is the
conjugate prior for the binomial distribution.
Once we got the posterior distribution, for example, we can use the posterior mean \(\bar{\theta}\) as a point estimate of \(\theta\).
Since the posterior is \(\operatorname{Beta}(\theta \mid a+y, \, b+N-y)\), its mean is given by the standard formula for the
Beta distribution:
\[
\bar{\theta} = \mathbb{E}[\theta \mid \mathcal{D}] = \frac{a+y}{a+b+N}.
\]
Note. By adjusting hyper-parameters \(a\) and \(b\), we can control the influence of the prior on the posterior.
If \(a\) and \(b\) are small, the posterior mean will closely reflect the data:
\[
\bar{\theta} \approx \frac{y}{N} = \hat{\theta}_{\text{MLE}}
\]
while if \(a\) and \(b\) are large, the posterior mean will be more influenced by the prior.
Often we need to check the standard error of our estimate, which is the posterior standard deviation. Again invoking the
Beta variance formula, we have:
\[
\begin{align*}
\operatorname{SE}(\theta) &= \sqrt{\operatorname{Var}[\theta \mid \mathcal{D}]} \\\\
&= \sqrt{\frac{(a+y)(b+N-y)}{(a+b+N)^2(a+b+N+1)}}.
\end{align*}
\]
Here, in the data-rich regime where \(N \gg a, b\) and the empirical fraction \(\hat{\theta} = y/N\) is bounded away from 0 and 1, we can simplify the posterior variance as follows:
\[
\begin{align*}
\operatorname{Var}[\theta \mid \mathcal{D}] &\approx \frac{y(N-y)}{N^3} \\\\
&= \frac{y}{N^2} - \frac{y^2}{N^3} \\\\
&= \frac{\hat{\theta}(1 - \hat{\theta})}{N}
\end{align*}
\]
where \(\hat{\theta} = \frac{y}{N}\) is the MLE.
Thus, the standard error is given by
\[
\operatorname{SE}(\theta) \approx \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{N}}.
\]
From (1) and (2), the marginal likelihood for the Bernoulli sequence \(\mathcal{D} = \{y_1, \ldots, y_N\}\) is given by the ratio of normalization constants(beta functions)
for the prior and posterior:
\[
p(\mathcal{D}) = \frac{B(a+y,\, b+N-y)}{B(a, b)}.
\]
(For the Binomial summary \(y = \sum_n y_n\), the marginal likelihood acquires the additional factor \(\binom{N}{y}\).)
Note. In general, computing the marginal likelihood is too expensive or impossible, but the conjugate prior allows us to get the
exact marginal likelihood easily. Otherwise, we have to introduce some approximation methods.
Finally, to make predictions for new observations, we use posterior predictive distribution:
\[
p(x_{new} \mid \mathcal{D}) = \int p(x_{new} \mid \theta) p(\theta \mid \mathcal{D}) d\theta.
\]
Again, like computing the marginal likelihood, it is difficult to compute posterior predictive distribution, but in this case,
we can get it easily due to the conjugate prior.
For example, the probability of observing a head in the next coin toss is given by:
\[
\begin{align*}
p(y_{new}=1 \mid \mathcal{D}) &= \int_0^1 p(y_{new}=1 \mid \theta) \, p(\theta \mid \mathcal{D}) \, d\theta \\\\
&= \int_0^1 \theta \cdot \operatorname{Beta}(\theta \mid a+y, \, b+N-y) \, d\theta \quad (\because p(y_{new}=1 \mid \theta) = \theta) \\\\
&= \mathbb{E}[\theta \mid \mathcal{D}] \\\\
&= \frac{a+y}{a+b+N}.
\end{align*}
\]
Note. As you can see, the hyper-parameters \(a\) and \(b\) are critical in the whole process of our inference.
In practice, setting up hyper-parameters is one of the most challenging aspects of the project.