Intro to Bayesian Statistics

Bayesian Inference Conjugate Prior Univariate Normal Distribution Model

Bayesian Inference

Our discussion of maximum likelihood estimation developed a frequentist procedure for fitting parameters to data, and our treatment of hypothesis testing introduced both frequentist confidence intervals and Bayesian credible intervals as ways to quantify uncertainty. The credible interval example there gave a first glimpse of the Bayesian approach — treating the parameter as a random variable and computing a posterior via a conjugate prior. We now develop the Bayesian framework systematically.

Recall the core idea: the unknown parameter \(\theta\) is treated as a random variable with a probability distribution, and the observed data \(\mathcal{D}\) are considered fixed. Prior knowledge or uncertainty about \(\theta\) is encoded in a prior distribution \(p(\theta)\), which is updated after observing data to produce a posterior distribution \(p(\theta \mid \mathcal{D})\) via Bayes' theorem: \[ p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)\,p(\theta)}{p(\mathcal{D})}. \] Let us now examine each component in detail:

  1. Prior distribution \(p(\theta)\):

    Encodes our belief about the parameters \(\theta\) before observing any data. It captures prior knowledge, domain expertise, or initial uncertainty through a specific probability distribution.

  2. Likelihood \(p(\mathcal{D} \mid \theta)\):

    Quantifies how well a specific value of \(\theta\) explains the observed data \(\mathcal{D}\) — formally, the conditional distribution of the data given the parameter. As discussed in our treatment of maximum likelihood estimation, once \(\mathcal{D}\) is fixed by observation, this quantity is reinterpreted as a function of \(\theta\).

  3. Marginal likelihood (or evidence) \(p(\mathcal{D})\):

    Serves as a normalization constant that ensures the posterior integrates to one. During parameter estimation it can often be ignored, since the posterior is determined up to proportionality by the prior–likelihood product. However, it plays a central role in model selection via the Bayes factor, and explicit closed-form expressions become available for conjugate models, as we will see below.

    For continuous parameters, marginal likelihood is computed by integrating over all possible values of \(\theta\): \[ p(\mathcal{D}) = \int p(\theta')p(\mathcal{D} \mid \theta') \, d\theta'. \] In the integral for the marginal likelihood, we use \(\theta'\) instead of \(\theta\) to clarify that the integration is over the entire parameter space, not a specific parameter.

    For discrete parameters, marginal likelihood is given by: \[ p(\mathcal{D}) = \sum_{\theta'} p(\theta')p(\mathcal{D} \mid \theta'). \]

Conjugate Prior

One of the main challenges in Bayesian inference is choosing an appropriate prior. In general, finding the exact normalizing constant of the posterior requires evaluating the integral in the marginal likelihood, which is often intractable. What if there are families of priors for which the posterior has a known closed form without this difficulty?

Definition: Conjugate Prior

A prior \(p(\theta) \in \mathcal{F}\) is said to be a conjugate prior for a likelihood function \(p(\mathcal{D} \mid \theta)\) if the posterior is in the same parameterized family as the prior: \[ p(\theta \mid \mathcal{D}) \in \mathcal{F}. \]

The notion of conjugacy is defined relative to a particular parametric family \(\mathcal{F}\) and a likelihood function. Familiar conjugate pairs — such as Beta–Binomial, Normal–Normal (with known variance), and Inverse-Gamma–Normal (with known mean) — will be derived in the examples below; the abstract definition above is best understood through these concrete instances.

Through the following example, we introduce basic Bayesian inference ideas and an example of a conjugate prior.

Example 1: Beta-Binomial Model

Consider tossing a coin \(N\) times. Let \(\theta \in [0, 1]\) be a chance of getting head. We record the outcomes as \(\mathcal{D} = \{y_n \in \{0, 1\} : n = 1, \ldots, N\}\). We assume the data are iid.

If we consider a sequence of coin tosses, the likelihood can be written as a Bernoulli likelihood model: \[ \begin{align*} p(\mathcal{D} \mid \theta) &= \prod_{n = 1}^N \theta^{y_n}(1 - \theta)^{1-y_n} \\\\ &= \theta^{N_1}(1 - \theta)^{N_0} \end{align*} \] where \(N_1\) and \(N_0\) are the number of heads and tails respectively. (Sample size: \(N_1 + N_0 = N\))

Equivalently, since the two models differ only by a combinatorial factor that does not depend on \(\theta\), we can summarize the data via the head count \(y = N_1\) and use the Binomial likelihood: \[ \begin{align*} p(\mathcal{D} \mid \theta) &= \operatorname{Bin}(y \mid N, \theta) \\\\ &= \binom{N}{y} \theta^y (1 - \theta)^{N - y} \\\\ &\propto \theta^y (1 - \theta)^{N - y}. \end{align*} \] The combinatorial factor \(\binom{N}{y}\) does not depend on \(\theta\), so it will not affect posterior inference up to proportionality.

Next, we have to specify a prior. If we know nothing about the parameter, an uninformative prior can be used: \[ p(\theta) = \operatorname{Unif}(\theta \mid 0, 1). \] However, in this example, using the Beta distribution, we can represent the prior as follows: \[ \begin{align*} p(\theta) = \operatorname{Beta}(\theta \mid a, b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \tag{1} \\\\ &\propto \theta^{a-1}(1-\theta)^{b-1} \end{align*} \] where \(a, b > 0\) are usually called hyper-parameters.(Our main parameter is \(\theta\).)

Note. If \(a = b = 1\), we get the uninformative prior.

Applying Bayes' theorem and absorbing the marginal likelihood \(p(\mathcal{D})\) into the normalization (since it does not depend on \(\theta\)), the posterior is proportional to the product of the likelihood and the prior: \[ \begin{align*} p(\theta \mid \mathcal{D}) &\propto [\theta^{y}(1 - \theta)^{N-y}] \cdot [\theta^{a-1}(1-\theta)^{b-1}] \\\\ &= \theta^{(a+y)-1}(1-\theta)^{(b+N-y)-1} \\\\ &\propto \operatorname{Beta}(\theta \mid a+y, \, b+N-y) \\\\ &= \frac{\Gamma(a+b+N)}{\Gamma(a+y)\Gamma(b+N-y)}\theta^{a+y-1}(1-\theta)^{b+N-y-1}. \tag{2} \end{align*} \] Here, the posterior has the same functional form as the prior. Thus, the beta distribution is the conjugate prior for the binomial distribution.

Once we got the posterior distribution, for example, we can use the posterior mean \(\bar{\theta}\) as a point estimate of \(\theta\). Since the posterior is \(\operatorname{Beta}(\theta \mid a+y, \, b+N-y)\), its mean is given by the standard formula for the Beta distribution: \[ \bar{\theta} = \mathbb{E}[\theta \mid \mathcal{D}] = \frac{a+y}{a+b+N}. \] Note. By adjusting hyper-parameters \(a\) and \(b\), we can control the influence of the prior on the posterior.

If \(a\) and \(b\) are small, the posterior mean will closely reflect the data: \[ \bar{\theta} \approx \frac{y}{N} = \hat{\theta}_{\text{MLE}} \] while if \(a\) and \(b\) are large, the posterior mean will be more influenced by the prior.

Often we need to check the standard error of our estimate, which is the posterior standard deviation. Again invoking the Beta variance formula, we have: \[ \begin{align*} \operatorname{SE}(\theta) &= \sqrt{\operatorname{Var}[\theta \mid \mathcal{D}]} \\\\ &= \sqrt{\frac{(a+y)(b+N-y)}{(a+b+N)^2(a+b+N+1)}}. \end{align*} \] Here, in the data-rich regime where \(N \gg a, b\) and the empirical fraction \(\hat{\theta} = y/N\) is bounded away from 0 and 1, we can simplify the posterior variance as follows: \[ \begin{align*} \operatorname{Var}[\theta \mid \mathcal{D}] &\approx \frac{y(N-y)}{N^3} \\\\ &= \frac{y}{N^2} - \frac{y^2}{N^3} \\\\ &= \frac{\hat{\theta}(1 - \hat{\theta})}{N} \end{align*} \] where \(\hat{\theta} = \frac{y}{N}\) is the MLE.

Thus, the standard error is given by \[ \operatorname{SE}(\theta) \approx \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{N}}. \]

From (1) and (2), the marginal likelihood for the Bernoulli sequence \(\mathcal{D} = \{y_1, \ldots, y_N\}\) is given by the ratio of normalization constants(beta functions) for the prior and posterior: \[ p(\mathcal{D}) = \frac{B(a+y,\, b+N-y)}{B(a, b)}. \] (For the Binomial summary \(y = \sum_n y_n\), the marginal likelihood acquires the additional factor \(\binom{N}{y}\).) Note. In general, computing the marginal likelihood is too expensive or impossible, but the conjugate prior allows us to get the exact marginal likelihood easily. Otherwise, we have to introduce some approximation methods.

Finally, to make predictions for new observations, we use posterior predictive distribution: \[ p(x_{new} \mid \mathcal{D}) = \int p(x_{new} \mid \theta) p(\theta \mid \mathcal{D}) d\theta. \] Again, like computing the marginal likelihood, it is difficult to compute posterior predictive distribution, but in this case, we can get it easily due to the conjugate prior.

For example, the probability of observing a head in the next coin toss is given by: \[ \begin{align*} p(y_{new}=1 \mid \mathcal{D}) &= \int_0^1 p(y_{new}=1 \mid \theta) \, p(\theta \mid \mathcal{D}) \, d\theta \\\\ &= \int_0^1 \theta \cdot \operatorname{Beta}(\theta \mid a+y, \, b+N-y) \, d\theta \quad (\because p(y_{new}=1 \mid \theta) = \theta) \\\\ &= \mathbb{E}[\theta \mid \mathcal{D}] \\\\ &= \frac{a+y}{a+b+N}. \end{align*} \] Note. As you can see, the hyper-parameters \(a\) and \(b\) are critical in the whole process of our inference. In practice, setting up hyper-parameters is one of the most challenging aspects of the project.

Hyperparameters as Pseudo-Counts

In machine learning, the hyper-parameters \(a\) and \(b\) of the Beta prior can be highly intuitive: they act as pseudo-observations or effective prior counts. Before flipping the coin even once, an AI agent with a prior of \(\operatorname{Beta}(a, b)\) behaves exactly as if it has already observed \(a\) heads and \(b\) tails. This structural equivalence is a common trick in natural language processing (like Laplace smoothing in Naive Bayes), preventing models from assigning zero probability to unseen events.

Univariate Normal Distribution Model

We now apply the same Bayesian framework to the normal distribution. Given the univariate normal distribution \(\mathcal{N}(\mu, \sigma^2)\), there are three cases for our target posterior:

  1. \(p(\mu \mid \mathcal{D}, \sigma^2)\): Variance \(\sigma^2\) is known.
  2. \(p(\sigma^2 \mid \mathcal{D}, \mu)\): Mean \(\mu\) is known.
  3. \(p(\mu, \sigma^2 \mid \mathcal{D})\): Both are unknown.

Here, we discuss cases 1 and 2. (Case 3 requires the normal-inverse-gamma conjugate prior and is treated in more advanced references.)

Example 2: Normal distribution model with known variance \(\sigma^2\)

In the case where \(\sigma^2\) is known constant, the likelihood for \(\mu\) is given by \[ \begin{align*} p(\mathcal{D} \mid \mu) &= \prod_{n=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(y_n - \mu)^2}{2\sigma^2}\right)\\\\ &= \left(\frac{1}{\sqrt{2\pi \sigma^2}}\right)^N \exp \left(- \sum_{n=1}^N \frac{(y_n - \mu)^2}{2\sigma^2}\right) \\\\ &\propto \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right). \end{align*} \]

Following the Beta–Binomial pattern, we conjecture that another normal distribution might serve as a conjugate prior: \[ \begin{align*} p(\mu) &= \mathcal{N}(\mu \mid \, \mu_0, \sigma_0^2) \\\\ &\propto \exp \left(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right). \end{align*} \]

Applying Bayes' theorem and absorbing \(p(\mathcal{D})\) into the normalization, we compute the posterior as follows: \[ \begin{align*} p(\mu \mid \, \mathcal{D}, \sigma^2) &\propto \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right) \cdot \exp \left(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right) \\\\ &= \exp \left\{-\frac{1}{2}\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2} \right)\mu^2 +\left(\frac{\sum y_n}{\sigma^2} +\frac{\mu_0}{\sigma_0^2} \right)\mu + \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N y_n^2 - \frac{\mu_0^2}{2\sigma_0^2} \right) \right\} \end{align*} \] Since \(-\frac{1}{2\sigma^2} \sum_{n=1}^N y_n^2 - \frac{\mu_0^2}{2\sigma_0^2}\) is constant with respect to \(\mu\), we ignore it.

Let \(A = \left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2} \right)\) and \(B = \left(\frac{\sum y_n}{\sigma^2} +\frac{\mu_0}{\sigma_0^2} \right)\), and completing the square, we get: \[ \begin{align*} p(\mu \mid \, \mathcal{D}, \sigma^2) &\propto \exp \left\{-\frac{1}{2}A\mu^2 + B\mu \right\} \\\\ &= \exp \left\{-\frac{1}{2}A \left(\mu - \frac{B}{A} \right)^2 + \frac{B^2}{2A}\right\}. \end{align*} \]

Since the term \(\frac{B^2}{2A}\) does not depend on \(\mu\), and \(-\frac{1}{2}A \left(\mu - \frac{B}{A} \right)^2\) is a quadratic form of an univariate normal distribution, we conclude that \[ \begin{align*} p(\mu \mid \, \mathcal{D}, \sigma^2) = \mathcal{N}(\mu \mid \, \mu_N, \sigma_N^2 ) \end{align*} \] where the posterior variance is given by: \[ \begin{align*} \sigma_N^2 &= \frac{1}{A} \\\\ &= \frac{\sigma^2 \sigma_0^2}{N\sigma_0^2 + \sigma^2} \end{align*} \] and the posterior mean is given by: \[ \begin{align*} \mu_N &= \frac{B}{A} \\\\ &= \sigma_N^2 \left(\frac{\mu_0}{\sigma_0^2} + \frac{N \bar{y}}{\sigma^2} \right) \\\\ &= \frac{\sigma^2}{N\sigma_0^2 + \sigma^2}\mu_0 + \frac{N\sigma_0^2}{N\sigma_0^2 + \sigma^2}\bar{y} \\\\ \end{align*} \] where \(\bar{y} = \frac{1}{N}\sum_{n=1}^N y_n\) is the empirical mean. The posterior is again normal, confirming that the normal family is conjugate to the normal likelihood with known variance.

Equivalently, this is a precision-weighted average: the posterior mean balances the prior mean \(\mu_0\) and the empirical mean \(\bar{y}\), weighted by their respective precisions \(1/\sigma_0^2\) and \(N/\sigma^2\). The posterior precision \(1/\sigma_N^2 = 1/\sigma_0^2 + N/\sigma^2\) is simply the sum of the prior and data precisions.

Note. In this case, \(\mu_0\) and \(\sigma_0^2\) are hyper-parameters of the prior distribution. For example, as you can see the form of the posterior mean \(\mu_N\), a small \(\sigma_0^2\) gives more weight to the prior mean \(\mu_0\), while a large \(\sigma_0^2\) reduces the influence of the prior, making the posterior more rely on the data.

Online Learning via Sequential Updates

The true power of conjugate priors in computer science is memory efficiency. In online learning, we do not need to store massive historical datasets in memory. Because the posterior and prior share the same functional form, today's posterior simply becomes tomorrow's prior. For sequential data points \(y_1, y_2\), the update rule is structurally elegant: \[ p(\mu \mid y_1, y_2) \propto p(y_2 \mid \mu) p(\mu \mid y_1) \] A ML system only needs to maintain the current sufficient statistics (like \(\mu_N\) and \(\sigma_N^2\)) and update them as new data streams in, demonstrating a perfect synergy between mathematical structure and computational limits.

Example 3: Normal distribution model with known mean \(\mu\)

In the case where \(\mu\) is known constant, the likelihood for \(\sigma^2\) is given by \[ \begin{align*} p(\mathcal{D} \mid \sigma^2) &= \frac{1}{(2 \pi \sigma^2)^{\frac{N}{2}}} \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right) \\\\ &\propto (\sigma^2)^{-\frac{N}{2}} \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right), \end{align*} \] where the \(\sigma^2\)-independent factor \((2\pi)^{-N/2}\) has been dropped. Following the same pattern as before, we conjecture that the inverse gamma distribution might serve as a conjugate prior: \[ \begin{align*} p(\sigma^2) &= \operatorname{InvGamma}(\sigma^2 \mid \, a, b) \\\\ &= \frac{b^a}{\Gamma(a)}(\sigma^2)^{-(a+1)} \exp \left(- \frac{b}{\sigma^2}\right) \\\\ &\propto (\sigma^2)^{-(a+1)} \exp \left(- \frac{b}{\sigma^2}\right) \end{align*} \] where \(a > 0\) is the shape parameter and \(b > 0\) is the scale parameter.

The posterior is given by: \[ \begin{align*} p(\sigma^2 \mid \, \mathcal{D}, \mu) &\propto \left\{ (\sigma^2)^{-\frac{N}{2}} \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right) \right\} \cdot \left\{ (\sigma^2)^{-(a+1)} \exp \left(- \frac{b}{\sigma^2}\right) \right\} \\\\ &= (\sigma^2)^{-(a+\frac{N}{2}+1)} \exp \left\{ - \frac{1}{\sigma^2}\left(b + \frac{1}{2} \sum_{n=1}^N (y_n - \mu)^2 \right) \right\}. \end{align*} \] Letting \(\hat{a} = a + \frac{N}{2}\) and \(\hat{b} = b + \frac{1}{2} \sum_{n=1}^N (y_n - \mu)^2\), the right-hand side is the kernel of an inverse-gamma density. Since the posterior is determined by its proportionality class up to normalization, we conclude \[ p(\sigma^2 \mid \, \mathcal{D}, \mu) = \operatorname{InvGamma}(\sigma^2 \mid \, \hat{a}, \hat{b}), \] confirming that the inverse gamma family is conjugate to the normal likelihood with known mean.

Equivalently, the inverse gamma family can be reparametrized as the scaled inverse chi-squared distribution: \[ \begin{align*} p(\sigma^2) &= \text{Scaled-Inv-}\chi^2(\sigma^2 \mid \, \nu_0, \sigma_0^2) \\\\ &= \operatorname{InvGamma}\!\left(\sigma^2 \mid \, \frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right) \\\\ &\propto (\sigma^2)^{-\frac{\nu_0}{2}-1} \exp \left(- \frac{\nu_0 \sigma_0^2}{2\sigma^2} \right) \end{align*} \] where \(\nu_0\) is the prior degrees of freedom. Since this is the same family in a different parametrization, the conjugacy property carries over directly.

Thus, the posterior in this parametrization is given by: \[ p(\sigma^2 \mid \, \mathcal{D}, \mu) = \text{Scaled-Inv-}\chi^2\!\left(\sigma^2 \mid \, \nu_0 + N, \, \frac{\nu_0 \sigma_0^2 + \sum_{n=1}^N (y_n - \mu)^2}{\nu_0 + N}\right). \] A benefit of using this form is that the hyperparameters \((\nu_0, \sigma_0^2)\) admit direct interpretation: \(\nu_0\) plays the role of a prior sample size (or prior degrees of freedom), and \(\sigma_0^2\) is a prior point estimate of the variance. The \(\operatorname{InvGamma}(a, b)\) form requires the indirect identification \(a = \nu_0/2\) and \(b = \nu_0 \sigma_0^2 / 2\).

Anomaly Detection and Uncertainty

This model is a cornerstone for quality control and process monitoring. In industrial manufacturing, the "target" mean of a process is often known from historical data, while the variance represents the stability of the process.

In machine learning, this is the fundamental principle of anomaly detection. Unlike simple frequentist methods that use a point estimate for variance, the Bayesian approach evaluates new data points against the posterior predictive distribution. By integrating over the inverse-gamma posterior of \(\sigma^2\), the resulting predictive distribution — a Student's t-distribution — naturally accounts for our uncertainty about the variance itself. This makes the system far more robust: when we have very little data, the model is "cautious" and requires stronger evidence to flag a point as an anomaly.

The Beta prior was conjugate for the Binomial likelihood, the Normal prior was conjugate for the Normal likelihood with known variance, and the Inverse-Gamma prior was conjugate for the Normal likelihood with known mean. Is there a unifying structure that explains why these conjugacies exist? In the upcoming chapter on the Exponential Family, we show that all these distributions belong to a single parametric family — the exponential family — whose canonical form guarantees the existence of conjugate priors and reduces MLE to moment matching.