Maximum Likelihood Estimation

Point Estimators Likelihood Functions Maximum Likelihood Estimation Example 1: Binomial Distribution \(X \sim b(n, p) \) Example 2: Normal Distribution

Point Estimators

In the previous parts, we built up a vocabulary of probability distributions. Each of them is parameterized by unknown quantities (means, variances, covariance matrices) that must be determined from observed data. The fundamental question of statistical inference is: given a sample \(\mathcal{D} = \{x_1, \ldots, x_n\}\), how do we estimate the true parameter \(\theta\) of the underlying population?

Definition: Point Estimator

A point estimator \(\hat{\theta}\) is a function of sample random variables \(X_1, X_2, \ldots, X_n\) that produces a single value as an estimate of the unknown population parameter \(\theta\). For example, the sample mean \[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \] is a point estimator of the population mean \(\mu\).

Since \(\hat{\theta}\) is a function of random variables, it is itself a random variable with its own distribution, called the sampling distribution. A natural question is: how close is \(\hat{\theta}\) to the true parameter \(\theta\)? Two fundamental properties characterize the quality of an estimator:

An ideal estimator has both low bias and low variance. However, these two goals often conflict - reducing one may increase the other. The mean squared error provides a single criterion that balances both considerations.

Definition: Mean Squared Error (MSE)

The mean squared error of an estimator \(\hat{\theta}\) is \[ \begin{align*} \text{MSE}(\hat{\theta}) &= \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right] \\\\ &= \text{Var}(\hat{\theta}) + \left[\text{Bias}(\hat{\theta})\right]^2. \end{align*} \]

Derivation:

\[ \begin{align*} \text{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \theta)^2\right] \\\\ &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\ &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2 + 2(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta) + (\mathbb{E}[\hat{\theta}] - \theta)^2\right]. \end{align*} \] Using the linearity of expectation, we distribute \(\mathbb{E}\). Note that \(\theta\) is a fixed population parameter, and \(\mathbb{E}[\hat{\theta}]\) is a constant. Thus, \((\mathbb{E}[\hat{\theta}] - \theta)\) is treated as a constant: \[ \begin{align*} \text{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right] + 2(\mathbb{E}[\hat{\theta}] - \theta)\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right] + \mathbb{E}\left[(\mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\ &= \text{Var}(\hat{\theta}) + 2(\mathbb{E}[\hat{\theta}] - \theta)(0) + [\text{Bias}(\hat{\theta})]^2 \\\\ &= \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2. \end{align*} \]

Note: The cross term vanishes because the expected deviation from the mean is zero: \(\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right] = \mathbb{E}[\hat{\theta}] - \mathbb{E}[\hat{\theta}] = 0\).

The MSE serves as a criterion for comparing estimators: among competing estimators, we prefer the one with the smallest MSE. For an unbiased estimator, the MSE reduces to the variance alone.

Once an estimator is selected, we quantify its precision using the standard error (SE), which is the standard deviation of the estimator's sampling distribution. For the sample mean, \[ \text{SE}(\bar{X}) = \sqrt{\text{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}}. \] Notice that the standard error decreases as \(n\) grows, confirming the intuition that more data yields more precise estimates. With the concept of an estimator and its quality in hand, we now ask: what principle should guide our choice of estimator in the first place?

Likelihood Functions

Before we can choose an optimal estimator, we need a way to measure how well a candidate parameter value \(\theta\) explains the observed data. The key insight is a change of perspective: the same mathematical expression that gives the probability of data given a parameter can also be viewed as a function of the parameter given fixed data. This reversal of roles leads to the likelihood function.

Definition: Likelihood Function

Suppose observations \(X_1, X_2, \ldots, X_n\) are i.i.d. random variables. The "observed" values of these random variables are denoted by \(x_1, x_2, \cdots, x_n\) respectively. Then the joint p.d.f. (or p.m.f.) of \(X_1, X_2, \cdots, X_n\) is given by \[ f(x_1, x_2, \cdots, x_n \mid \theta) = \prod_{i = 1}^n f(x_i \mid \theta) \] where \(\theta\) is some unknown parameter.

This is called the likelihood function of \(\theta\) for observed values \(x_1, x_2, \cdots, x_n\) and denote it by \(L(\theta \mid x_1, x_2, \cdots, x_n)\), or simply \(L(\theta)\).

The crucial distinction is one of interpretation: the expression \(\prod_{i=1}^n f(x_i \mid \theta)\) is the same mathematical formula whether we view it as a probability (function of \(x\) with \(\theta\) fixed) or as a likelihood (function of \(\theta\) with \(x\) fixed). This duality is often expressed as \[ \underbrace{L(\theta \mid x_1, \ldots, x_n)}_{\text{After sampling: function of } \theta} = \underbrace{\prod_{i=1}^n f(x_i \mid \theta)}_{\text{Before sampling: function of } x}. \] With the likelihood function defined, we can now state the most widely used principle for parameter estimation.

Maximum Likelihood Estimation

In machine learning, model fitting (or training) is the process of estimating unknown parameters \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)\) from sample data \(\mathcal{D} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\). This can be framed as an optimization problem: \[ \hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}) \] where \(\mathcal{L}(\boldsymbol{\theta})\) is a loss function (or objective function). The most natural and widely used choice is to select the parameter that makes the observed data most probable.

Definition: Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) is defined as \[ \hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}}\, L(\boldsymbol{\theta}) \] where \(L(\boldsymbol{\theta})\) is the likelihood function for the sample data \(\mathcal{D}\).

Since the logarithm is a strictly increasing function, maximizing \(L\) is equivalent to maximizing \(\ln L\). Working with the log-likelihood is preferred in practice for two reasons: it converts products into sums (improving numerical stability) and simplifies differentiation.

If \(L(\boldsymbol{\theta})\) is differentiable, the MLE can be found by solving the score equation: \[ \begin{align*} &\nabla_{\boldsymbol{\theta}} \ln L(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln \prod_{i = 1}^n f(\mathbf{x}_i \mid \boldsymbol{\theta}) = \mathbf{0}\\\\ &\Longrightarrow \nabla_{\boldsymbol{\theta}} \ln L(\boldsymbol{\theta}) = \sum_{i=1}^ n \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{x}_i \mid \boldsymbol{\theta}) = \mathbf{0}. \end{align*} \] This gradient of the log-likelihood is called the score function, which plays a central role in the Fisher information theory.

Example 1: Binomial Distribution \(X \sim b(n, p) \)

We begin with a discrete example. Consider flipping a coin \(n\) times and observing \(k\) heads.

Proof:

Let \(X \sim b(n, \theta)\), where \(\theta \in [0, 1]\) is the probability of success. If we observe \(X = k\) successes in \(n\) trials, the likelihood function is the probability mass function: \[ L(\theta) = P(X = k \mid \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k}. \] Taking the natural logarithm, we get: \[ \ln L(\theta) = \ln \binom{n}{k} + k \ln(\theta) + (n-k) \ln(1-\theta). \] To find \(\hat{\theta}_{\text{MLE}}\), we take the derivative with respect to \(\theta\) and set it to zero. Note that the combinatorial term \(\ln \binom{n}{k}\) is a constant with respect to \(\theta\) and vanishes: \[ \begin{align*} \frac{d}{d\theta} \ln L(\theta) &= \frac{k}{\theta} - \frac{n-k}{1-\theta} = 0 \\\\ &\Longrightarrow k(1 - \theta) - (n-k)\theta = 0 \\\\ &\Longrightarrow k - k\theta - n\theta + k\theta = 0 \\\\ &\Longrightarrow k = n\theta. \end{align*} \] Therefore, \[ \hat{\theta}_{\text{MLE}} = \frac{k}{n}. \]

This is exactly the sample proportion \(\hat{p} = \frac{X}{n}\), confirming that the intuitively natural estimator for the population proportion coincides with the MLE. We now turn to a continuous example where the MLE must be found for two parameters simultaneously.

Note: \[ \begin{align*} &\mathbb{E}[\hat{p}] = \frac{1}{n}\mathbb{E}[X] = \frac{1}{n}np = p \\\\ &\text{Var }(\hat{p}) = \frac{1}{n^2}\text{Var }[X] = \frac{1}{n^2}np(1-p) = \frac{p(1-p)}{n}. \end{align*} \]

Example 2: Normal Distribution

Given a random sample \(X_1, X_2, \ldots, X_n\) from the normal distribution \(\mathcal{N}(\mu, \sigma^2)\), we seek the MLEs for both parameters.

Proof:

Suppose \(\mathcal{D} = \{x_1, x_2, \cdots, x_n \}\) is from a normal distribution with p.d.f \[ f(x \mid \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}\exp \left\{- \frac{(x - \mu)^2}{2\sigma^2}\right\}. \] And its likelihood fuction is given by \[ \begin{align*} L(\mu, \sigma^2) &= \prod_{i=1}^n [ f(x_i \mid \mu, \sigma^2)] \\\\ &= \left (\frac{1}{\sigma \sqrt{2\pi}}\right)^n \exp \left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right\}. \end{align*} \] The log-likelihood function is given by \[ \begin{align*} \ln L(\mu, \sigma^2) &= n \ln \left(\frac{1}{\sigma \sqrt{2\pi}}\right) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\ &= -n \ln (\sigma) - n \ln (\sqrt{2\pi}) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\ &= -\frac{n}{2} \ln (\sigma^2) - \frac{n}{2}\ln (2\pi) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2. \tag{1} \\\\ \end{align*} \]

Setting the partial derivative of (1) with respect to \(\mu\) equal to zero: \[ \begin{align*} &\frac{\partial \ln L(\mu, \sigma^2) }{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \\\\ &\Longrightarrow \sum_{i=1}^n (x_i) - n\mu = 0. \end{align*} \] Thus, \[ \hat{\mu}_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}. \tag{2} \]

Similarly, taking the partial derivative of (1) with respect to the variance \(\sigma^2\) (treating \(\sigma^2\) as a single variable) and setting it to zero: \[ \begin{align*} &\frac{\partial \ln L(\mu, \sigma^2) }{\partial (\sigma^2)} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0 \\\\ &\Longrightarrow -n \sigma^2 + \sum_{i=1}^n (x_i - \mu)^2 = 0. \end{align*} \] To find the maximum, we substitute \(\mu\) with our MLE estimate \(\hat{\mu}_{\text{MLE}} = \bar{x}\) and solve for \(\sigma^2\): \[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2. \]

Note that the MLE for the variance divides by \(n\), not \(n - 1\). This means \(\hat{\sigma}^2_{\text{MLE}}\) is a biased estimator of \(\sigma^2\), with \(\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2\). The unbiased sample variance \(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\) corrects for this by using Bessel's correction. This is one instance of a general phenomenon: the MLE is consistent (converges to the true value as \(n \to \infty\)) but not always unbiased for finite samples.

Connections to Machine Learning

MLE is the default parameter estimation method across machine learning. Training a logistic regression model is equivalent to minimizing the negative log-likelihood (cross-entropy loss). Training a neural network with mean squared error loss is equivalent to MLE under a Gaussian noise assumption. The exponential family unifies these examples, showing that MLE always reduces to moment matching. In Bayesian inference, MLE corresponds to the special case of a uniform (flat) prior, and incorporating regularization is equivalent to choosing a non-uniform prior.

MLE provides point estimates - single "best guess" values for the parameters. But how confident should we be in these estimates? In the next part, we introduce hypothesis testing and confidence intervals, which provide principled frameworks for quantifying the uncertainty inherent in any statistical estimate.