Intro to Bayesian Statistics

Bayesian Inference

In Part 9, we developed maximum likelihood estimation for fitting parameters to data, and in Part 10, we introduced both frequentist confidence intervals and Bayesian credible intervals as ways to quantify uncertainty. The credible interval example there gave a first glimpse of the Bayesian approach - treating the parameter as a random variable and computing a posterior via a conjugate prior. In the previous part, we proved the convergence theorems that guarantee these estimation procedures are well-founded. We now develop the Bayesian framework systematically.

Recall the core idea: the unknown parameter \(\theta\) is treated as a random variable with a probability distribution, and the observed data \(\mathcal{D}\) are considered fixed. Prior knowledge or uncertainty about \(\theta\) is encoded in a prior distribution \(p(\theta)\), which is updated after observing data to produce a posterior distribution \(p(\theta \mid \mathcal{D})\) via Bayes' theorem: \[ p(\theta \mid \mathcal{D}) = \frac{p(\theta)\,p(\mathcal{D} \mid \theta)}{p(\mathcal{D})}. \] Let us now examine each component in detail:

Prior distribution \(p(\theta)\):

Encodes our belief about the parameters \(\theta\) before observing any data. It captures prior knowledge, domain expertise, or initial uncertainty through a specific probability distribution.
Likelihood \(p(\mathcal{D} \mid \theta)\):

Quantifies how well a specific value of \(\theta\) explains the observed data \(\mathcal{D}\). This term represents the underlying model of the data-generating process.
Marginal likelihood (or evidence) \(p(\mathcal{D})\):

Serves as a normalization constant that ensures the posterior integrates to one. While often ignored during parameter estimation, it is indispensable for model selection, as it allows for the objective comparison of different probabilistic models.

For continuous parameters, marginal likelihood is computed by integrating over all possible values of \(\theta\): \[ p(\mathcal{D}) = \int p(\theta')p(\mathcal{D} \mid \theta') \, d\theta'. \] In the integral for the marginal likelihood, we use \(\theta'\) instead of \(\theta\) to clarify that the integration is over the entire parameter space, not a specific parameter.

For discrete parameters, marginal likelihood is given by: \[ p(\mathcal{D}) = \sum_{\theta'} p(\theta')p(\mathcal{D} \mid \theta'). \]

Conjugate Prior

One of the main challenges in Bayesian inference is choosing an appropriate prior. In general, finding the exact normalizing constant of the posterior requires evaluating the integral in the marginal likelihood, which is often intractable. What if there are families of priors for which the posterior has a known closed form without this difficulty?

Definition: Conjugate Prior

A prior \(p(\theta) \in \mathcal{F}\) is said to be a conjugate prior for a likelihood function \(p(\mathcal{D} \mid \theta)\) if the posterior is in the same parameterized family as the prior: \[ p(\theta \mid \mathcal{D}) \in \mathcal{F}. \]

Through the following example, we introduce basic Bayesian inference ideas and an example of a conjugate prior.

Example 1: Beta-Binomial Model

Consider tossing a coin \(N\) times. Let \(\theta \in [0, 1]\) be a chance of getting head. We record the outcomes as \(\mathcal{D} = \{y_n \in \{0, 1\} : n = 1 : N\}\). We assume the data are iid.

If we consider a sequence of coin tosses, the likelihood can be written as a Bernoulli likelihood model: \[ \begin{align*} p(\mathcal{D} \mid \theta) &= \prod_{n = 1}^N \theta^{y_n}(1 - \theta)^{1-y_n} \\\\ &= \theta^{N_1}(1 - \theta)^{N_0} \end{align*} \] where \(N_1\) and \(N_0\) are the number of heads and tails respectively. (Sample size: \(N_1 + N_0 = N\))

Alternatively, we can consider the Binomial likelihood model: The likelihood has the following form: \[ \begin{align*} p(\mathcal{D} \mid \theta) &= \text{Bin } (y \mid N, \theta) \\\\ &= \begin{pmatrix} N \\ y \end{pmatrix} \theta^y (1 - \theta)^{N - y} \\\\ &\propto \theta^y (1 - \theta)^{N - y} \end{align*} \] where \(y\) is the number of heads.

Next, we have to specify a prior. If we know nothing about the parameter, an uninformative prior can be used: \[ p(\theta) = \text{Unif }(\theta \mid 0, 1). \] However, "in this example", using Beta distribution, we can represent the prior as follows: \[ \begin{align*} p(\theta) = \text{Beta }(\theta \mid a, b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \tag{1} \\\\ &\propto \theta^{a-1}(1-\theta)^{b-1} \end{align*} \] where \(a, b > 0\) are usually called hyper-parameters.(Our main parameter is \(\theta\).)

Note: If \(a = b = 1\), we get the uninformative prior.

Using Bayes' rule, the posterior is proportional to the product of the likelihood and the prior: \[ \begin{align*} p(\theta \mid \mathcal{D}) &\propto [\theta^{y}(1 - \theta)^{N-y}] \cdot [\theta^{a-1}(1-\theta)^{b-1}] \\\\ &\propto \text{Beta }(\theta \mid a+y, \, b+N-y) \\\\ &= \frac{\Gamma(a+b+N)}{\Gamma(a+y)\Gamma(b+N-y)}\theta^{a+y-1}(1-\theta)^{b+N-y-1}. \tag{2} \end{align*} \] Here, the posterior has the same functional form as the prior. Thus, the beta distribution is the conjugate prior for the binomial distribution.

Once we got the posterior distribution, for example, we can use posterior mean \(\bar{\theta}\) as a point estimate of \(\theta\): \[ \begin{align*} \bar{\theta} = \mathbb{E }[\theta \mid \mathcal{D}] &= \frac{a+y}{(a+y) + (b+N-y)} \\\\ &= \frac{a+y}{a+b+N}. \end{align*} \] Note: By adjusting hyper-parameters \(a\) and \(b\), we can control the influence of the prior on the posterior.

If \(a\) and \(b\) are small, the posterior mean will closely reflect the data: \[ \bar{\theta} \approx \frac{y}{N} = \hat{\theta}_{MLE} \] while if \(a\) and \(b\) are large, the posterior mean will be more influenced by the prior.

Often we need to check the standard error of our estimate, which is the posterior standard deviation: \[ \begin{align*} \text{SE }(\theta) &= \sqrt{\text{Var }[\theta \mid \mathcal{D}]} \\\\ &= \sqrt{\frac{(a+y)(b+N-y)}{(a+b+N)^2(a+b+N+1)}} \end{align*} \] Here, if \(N \gg a, b\), we can simplify the posterior variance as follows: \[ \begin{align*} \text{Var }[\theta \mid \mathcal{D}] &\approx \frac{y(N-y)}{(N)^2 N} \\\\ &= \frac{y}{N^2} - \frac{y^2}{N^3} \\\\ &= \frac{\hat{\theta}(1 - \hat{\theta})}{N} \end{align*} \] where \(\hat{\theta} = \frac{y}{N}\) is the MLE.

Thus, the standard error is given by \[ \text{SE }(\theta) \approx \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{N}}. \]

From (1) and (2), the marginal likelihood is given by the ratio of normalization constants(beta functions) for the prior and posterior: \[ p(\mathcal{D}) = \frac{B(a+y,\, b+N-y)}{B(a, b)}. \] Note: In general, computing the marginal likelihood is too expensive or impossible, but the conjugate prior allows us to get the exact marginal likelihood easily. Otherwise, we have to introduce some approximation methods.

Finally, to make predictions for new observations, we use posterior predictive distribution: \[ p(x_{new} \mid \mathcal{D}) = \int p(x_{new} \mid \theta) p(\theta \mid \mathcal{D}) d\theta. \] Again, like computing the marginal likelihood, it is difficult to compute posterior predictive distribution, but in this case, we can get it easily due to the conjugate prior.

For example, the probability of observing a head in the next coin toss is given by: \[ \begin{align*} p(y_{new}=1 \mid \mathcal{D}) &= \int_0 ^1 p(y_{new}=1 \mid \theta) p(\theta \mid \mathcal{D}) d\theta \\\\ &= \int_0 ^1 \theta \text{Beta }(\theta \mid a+y, \, b+N-y) d\theta \\\\ &= \mathbb{E }[\theta \mid \mathcal{D}] \\\\ &= \frac{a+y}{a+b+N}. \end{align*} \] Note: As you can see, the hyper-parameters \(a\) and \(b\) are critical in the whole process of our inference. In practice, setting up hyper-parameters is one of the most challenging aspects of the project.

Hyperparameters as Pseudo-Counts

In machine learning, the hyper-parameters \(a\) and \(b\) of the Beta prior can be highly intuitive: they act as pseudo-observations or effective prior counts. Before flipping the coin even once, an AI agent with a prior of \(\text{Beta}(a, b)\) behaves exactly as if it has already observed \(a\) heads and \(b\) tails. This structural equivalence is a common trick in natural language processing (like Laplace smoothing in Naive Bayes), preventing models from assigning zero probability to unseen events.

Univariate Normal Distribution Model

We now apply the same Bayesian framework to the normal distribution. Given the univariate normal distribution \(N(\mu, \sigma^2)\), there are three cases for our target posterior:

\(p(\mu \mid \mathcal{D}, \sigma^2)\): Variance \(\sigma^2\) is known.
\(p(\sigma^2 \mid \mathcal{D}, \mu)\): Mean \(\mu\) is known.
\(p(\mu, \sigma^2 \mid \mathcal{D})\): Both are unknown.

Here, we discuss cases 1 and 2. (Case 3 requires the normal-inverse-gamma conjugate prior and is treated in more advanced references.)

Example 2: Normal distribution model with known variance \(\sigma^2\)

In the case where \(\sigma^2\) is known constant, the likelihood for \(\mu\) is given by \[ \begin{align*} p(\mathcal{D} \mid \mu) &= \prod_{n=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(y_n - \mu)^2}{2\sigma^2}\right)\\\\ &= \left(\frac{1}{\sqrt{2\pi \sigma^2}}\right)^N \exp \left(- \sum_{n=1}^N \frac{(y_n - \mu)^2}{2\sigma^2}\right) \\\\ &\propto \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right). \end{align*} \]

The conjugate prior is another normal distribution: \[ \begin{align*} p(\mu) &= N(\mu \mid \, \mu_0, \sigma_0^2) \\\\ &\propto \exp \left(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right). \end{align*} \]

Now, we can compute the posterior as follows: \[ \begin{align*} p(\mu \mid \, \mathcal{D}, \sigma^2) &\propto \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right) \cdot \exp \left(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right) \\\\ &= \exp \left\{-\frac{1}{2}\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2} \right)\mu^2 +\left(\frac{\sum y_n}{\sigma^2} +\frac{\mu_0}{\sigma_0^2} \right)\mu + \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N y_n^2 - \frac{\mu_0^2}{2\sigma_0^2} \right) \right\} \end{align*} \] Since \(-\frac{1}{2\sigma^2} \sum_{n=1}^N y_n^2 - \frac{\mu_0^2}{2\sigma_0^2}\) is constant with respect to \(\mu\), we ignore it.

Let \(A = \left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2} \right)\) and \(B = \left(\frac{\sum y_n}{\sigma^2} +\frac{\mu_0}{\sigma_0^2} \right)\), and completing the square, we get: \[ \begin{align*} p(\mu \mid \, \mathcal{D}, \sigma^2) &\propto \exp \left\{-\frac{1}{2}A\mu^2 + B\mu \right\} \\\\ &= \exp \left\{-\frac{1}{2}A \left(\mu - \frac{B}{A} \right)^2 + \frac{B^2}{2A}\right\}. \end{align*} \]

Since the term \(\frac{B^2}{2A}\) does not depend on \(\mu\), and \(-\frac{1}{2}A \left(\mu - \frac{B}{A} \right)^2\) is a quadratic form of an univariate normal distribution, we conclude that \[ \begin{align*} p(\mu \mid \, \mathcal{D}, \sigma^2) = N(\mu \mid \, \mu_N, \sigma_N^2 ) \end{align*} \] where the posterior variance is given by: \[ \begin{align*} \sigma_N^2 &= \frac{1}{A} \\\\ &= \frac{\sigma^2 \sigma_0^2}{N\sigma_0^2 + \sigma^2} \end{align*} \] and the posterior mean is given by: \[ \begin{align*} \mu_N &= \frac{B}{A} \\\\ &= \sigma_N^2 \left(\frac{\mu_0}{\sigma_0^2} + \frac{N \bar{y}}{\sigma^2} \right) \\\\ &= \frac{\sigma^2}{N\sigma_0^2 + \sigma^2}\mu_0 + \frac{N\sigma_0^2}{N\sigma_0^2 + \sigma^2}\bar{y} \\\\ \end{align*} \] where \(\bar{y} = \frac{1}{N}\sum_{n=1}^N y_n\) is the empirical mean.

Note: In this case, \(\mu_0\) and \(\sigma_0^2\) are hyper-parameters of the prior distribution. For example, as you can see the form of the posterior mean \(\mu_N\), a small \(\sigma_0^2\) gives more weight to the prior mean \(\mu_0\), while a large \(\sigma_0^2\) reduces the influence of the prior, making the posterior more rely on the data.

Condition Number in Deep Learning

The true power of conjugate priors in computer science is memory efficiency. In online learning, we do not need to store massive historical datasets in memory. Because the posterior and prior share the same functional form, today's posterior simply becomes tomorrow's prior. For sequential data points \(y_1, y_2\), the update rule is structurally elegant: \[ p(\mu \mid y_1, y_2) \propto p(y_2 \mid \mu) p(\mu \mid y_1) \] An AI system only needs to maintain the current sufficient statistics (like \(\mu_N\) and \(\sigma_N^2\)) and update them as new data streams in, demonstrating a perfect synergy between mathematical structure and computational limits.

Example 3: Normal distribution model with known mean \(\mu\)

In the case where \(\mu\) is known constant, the likelihood for \(\sigma^2\) is given by \[ \begin{align*} p(\mathcal{D} \mid \sigma^2) &= \frac{1}{(2 \pi \sigma^2)^{\frac{N}{2}}} \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right) \\\\ &\propto (\sigma^2)^{-\frac{N}{2}} \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right). \end{align*} \] The "standard" conjugate prior is the inverse gamma distribution: \[ \begin{align*} p(\sigma^2) &= \text{InvGamma }(\sigma^2 \mid \, a, b) \\\\ &= \frac{b^a}{\Gamma(a)}(\sigma^2)^{-(a+1)} \exp \left(- \frac{b}{\sigma^2}\right) \end{align*} \] where \(a > 0\) is the shape parameter and \(b > 0\) is the scale parameter.

The posterior is given by: \[ \begin{align*} p(\sigma^2 \mid \, \mathcal{D}, \mu) &\propto \left\{ (\sigma^2)^{-\frac{N}{2}} \exp \left(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \right) \right\} \cdot \left\{\frac{b^a}{\Gamma(a)}(\sigma^2)^{-(a+1)} \exp \left(- \frac{b}{\sigma^2}\right) \right\} \\\\ &= (\sigma^2)^{-(a+\frac{N}{2}+1)} \exp \left\{ - \frac{1}{\sigma^2}\left(b + \frac{1}{2} \sum_{n=1}^N (y_n - \mu)^2 \right) \right\}. \end{align*} \] Here, let \(\hat{a} = a + \frac{N}{2}\), and \(\hat{b} = b + \frac{1}{2} \sum_{n=1}^N (y_n - \mu)^2\). Thus, \[ p(\sigma^2 \mid \, \mathcal{D}, \mu) = \text{InvGamma }(\sigma^2 \mid \, \hat{a}, \hat{b}). \]

Alternatively, we can choose scaled inverse chi-squared distribution as the conjugate prior: \[ \begin{align*} p(\sigma^2) &= \text{ScaledInv- }\chi^2(\sigma^2 \mid \, \nu_0, \sigma_0^2) \\\\ &= \text{InvGamma }\left(\sigma^2 \mid \, \frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right) \\\\ &\propto (\sigma^2)^{-\frac{\nu_0}{2}-1} \exp \left(- \frac{\nu_0 \sigma_0^2}{2\sigma^2} \right) \end{align*} \] where \(\nu\) is the degrees of freedom.

Thus, the posterior is given by: \[ p(\sigma^2 \mid \, \mathcal{D}, \mu) = \text{ScaledInv- }\chi^2 \left(\sigma^2 \mid \, \nu_0 + N, \, \frac{\nu_0 \sigma_0^2 + \sum_{n=1}^N (y_n - \mu)^2}{\nu_0 + N}\right). \] A benefit of using this form is that we can control the strength of the prior by the single hyper-parameter \(\nu_0\).

Anomaly Detection and Uncertainty

This model is a cornerstone for quality control and process monitoring. In industrial manufacturing, the "target" mean of a process is often known from historical data, while the variance represents the stability of the process.

In machine learning, this is the fundamental principle of anomaly detection. Unlike simple frequentist methods that use a point estimate for variance, the Bayesian approach evaluates new data points against the posterior predictive distribution. By integrating over the inverse-gamma posterior of \(\sigma^2\), the resulting predictive distribution (a Student's t-distribution) naturally accounts for our uncertainty about the variance itself. This makes the system far more robust: when we have very little data, the model is "cautious" and requires stronger evidence to flag a point as an anomaly.

The Beta prior was conjugate for the Binomial likelihood, and the Normal prior was conjugate for the Normal likelihood with known variance. Is there a unifying structure that explains why these conjugacies exist? In Part 15: The Exponential Family, we show that all these distributions belong to a single parametric family - the exponential family - whose canonical form guarantees the existence of conjugate priors and reduces MLE to moment matching.