One of the main challenges in Bayesian inference is choosing an appropriate prior. In general, finding the
exact normalizing constant of the posterior requires evaluating the integral in the marginal likelihood, which
is often intractable. What if there are families of priors for which the posterior has a known closed form
without this difficulty?
Through the following example, we introduce basic Bayesian inference ideas and an example of a conjugate prior.
Example 1: Beta-Binomial Model
Consider tossing a coin \(N\) times. Let \(\theta \in [0, 1]\) be a chance of getting head. We record the outcomes as
\(\mathcal{D} = \{y_n \in \{0, 1\} : n = 1 : N\}\). We assume the data are iid.
If we consider a sequence of coin tosses, the likelihood can be written as a Bernoulli likelihood model:
\[
\begin{align*}
p(\mathcal{D} \mid \theta) &= \prod_{n = 1}^N \theta^{y_n}(1 - \theta)^{1-y_n} \\\\
&= \theta^{N_1}(1 - \theta)^{N_0}
\end{align*}
\]
where \(N_1\) and \(N_0\) are the number of heads and tails respectively. (Sample size: \(N_1 + N_0 = N\))
Alternatively, we can consider the Binomial likelihood model:
The likelihood has the following form:
\[
\begin{align*}
p(\mathcal{D} \mid \theta) &= \text{Bin } (y \mid N, \theta) \\\\
&= \begin{pmatrix} N \\ y \end{pmatrix} \theta^y (1 - \theta)^{N - y} \\\\
&\propto \theta^y (1 - \theta)^{N - y}
\end{align*}
\]
where \(y\) is the number of heads.
Next, we have to specify a prior. If we know nothing about the parameter, an
uninformative prior can be used:
\[
p(\theta) = \text{Unif }(\theta \mid 0, 1).
\]
However, "in this example", using Beta distribution,
we can represent the prior as follows:
\[
\begin{align*}
p(\theta) = \text{Beta }(\theta \mid a, b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \tag{1} \\\\
&\propto \theta^{a-1}(1-\theta)^{b-1}
\end{align*}
\]
where \(a, b > 0\) are usually called hyper-parameters.(Our main parameter is \(\theta\).)
Note: If \(a = b = 1\), we get the uninformative prior.
Using Bayes' rule, the posterior is proportional to the product of the likelihood and the prior:
\[
\begin{align*}
p(\theta \mid \mathcal{D}) &\propto [\theta^{y}(1 - \theta)^{N-y}] \cdot [\theta^{a-1}(1-\theta)^{b-1}] \\\\
&\propto \text{Beta }(\theta \mid a+y, \, b+N-y) \\\\
&= \frac{\Gamma(a+b+N)}{\Gamma(a+y)\Gamma(b+N-y)}\theta^{a+y-1}(1-\theta)^{b+N-y-1}. \tag{2}
\end{align*}
\]
Here, the posterior has the same functional form as the prior. Thus, the beta distribution is the
conjugate prior for the binomial distribution.
Once we got the posterior distribution, for example, we can use posterior mean \(\bar{\theta}\) as a point estimate of \(\theta\):
\[
\begin{align*}
\bar{\theta} = \mathbb{E }[\theta \mid \mathcal{D}] &= \frac{a+y}{(a+y) + (b+N-y)} \\\\
&= \frac{a+y}{a+b+N}.
\end{align*}
\]
Note:
By adjusting hyper-parameters \(a\) and \(b\), we can control the influence of the prior on the posterior.
If \(a\) and \(b\) are small, the posterior mean will closely reflect the data:
\[
\bar{\theta} \approx \frac{y}{N} = \hat{\theta}_{MLE}
\]
while if \(a\) and \(b\) are large, the posterior mean will be more influenced by the prior.
Often we need to check the standard error of our estimate, which is the posterior standard deviation:
\[
\begin{align*}
\text{SE }(\theta) &= \sqrt{\text{Var }[\theta \mid \mathcal{D}]} \\\\
&= \sqrt{\frac{(a+y)(b+N-y)}{(a+b+N)^2(a+b+N+1)}}
\end{align*}
\]
Here, if \(N \gg a, b\), we can simplify the posterior variance as follows:
\[
\begin{align*}
\text{Var }[\theta \mid \mathcal{D}] &\approx \frac{y(N-y)}{(N)^2 N} \\\\
&= \frac{y}{N^2} - \frac{y^2}{N^3} \\\\
&= \frac{\hat{\theta}(1 - \hat{\theta})}{N}
\end{align*}
\]
where \(\hat{\theta} = \frac{y}{N}\) is the MLE.
Thus, the standard error is given by
\[
\text{SE }(\theta) \approx \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{N}}.
\]
From (1) and (2), the marginal likelihood is given by the ratio of normalization constants(beta functions)
for the prior and posterior:
\[
p(\mathcal{D}) = \frac{B(a+y,\, b+N-y)}{B(a, b)}.
\]
Note: In general, computing the marginal likelihood is too expensive or impossible, but the conjugate prior allows us to get the
exact marginal likelihood easily. Otherwise, we have to introduce some approximation methods.
Finally, to make predictions for new observations, we use posterior predictive distribution:
\[
p(x_{new} \mid \mathcal{D}) = \int p(x_{new} \mid \theta) p(\theta \mid \mathcal{D}) d\theta.
\]
Again, like computing the marginal likelihood, it is difficult to compute posterior predictive distribution, but in this case,
we can get it easily due to the conjugate prior.
For example, the probability of observing a head in the next coin toss is given by:
\[
\begin{align*}
p(y_{new}=1 \mid \mathcal{D}) &= \int_0 ^1 p(y_{new}=1 \mid \theta) p(\theta \mid \mathcal{D}) d\theta \\\\
&= \int_0 ^1 \theta \text{Beta }(\theta \mid a+y, \, b+N-y) d\theta \\\\
&= \mathbb{E }[\theta \mid \mathcal{D}] \\\\
&= \frac{a+y}{a+b+N}.
\end{align*}
\]
Note: As you can see, the hyper-parameters \(a\) and \(b\) are critical in the whole process of our inference.
In practice, setting up hyper-parameters is one of the most challenging aspects of the project.