Maximum Likelihood Estimation

Point Estimators Likelihood Functions Maximum Likelihood Estimation Example 1: Binomial Distribution \(X \sim b(n, p) \) Example 2: Normal Distribution

Point Estimators

In the preceding chapters on probability distributions, we built up a vocabulary parameterized by unknown quantities (means, variances, covariance matrices) that must be determined from observed data. The fundamental question of statistical inference is: given a sample \(\mathcal{D} = \{x_1, \ldots, x_n\}\), how do we estimate the true parameter \(\theta\) of the underlying population?

Definition: Point Estimator

Let the population parameter \(\theta\) take values in a parameter space \(\Theta \subseteq \mathbb{R}^k\) (with \(k = 1\) for scalar parameters). A point estimator of \(\theta\) is a function \[ \hat{\theta} : \mathbb{R}^n \to \Theta, \qquad (X_1, \ldots, X_n) \mapsto \hat{\theta}(X_1, \ldots, X_n) \] of the sample random variables that does not itself depend on \(\theta\) — that is, a statistic. The numerical value \(\hat{\theta}(x_1, \ldots, x_n)\) computed from observed data is the corresponding point estimate.

For example, the sample mean \[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \] is a point estimator of the population mean \(\mu\).

Since \(\hat{\theta}\) is a function of random variables, it is itself a random variable with its own distribution, called the sampling distribution. A natural question is: how close is \(\hat{\theta}\) to the true parameter \(\theta\)? Two fundamental properties characterize the quality of an estimator: its bias measures systematic error, and its variance measures precision across samples.

Definition: Bias

The bias of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is \[ \operatorname{Bias}(\hat{\theta}) := \mathbb{E}[\hat{\theta}] - \theta. \] An estimator with \(\operatorname{Bias}(\hat{\theta}) = 0\) for every \(\theta \in \Theta\) is called unbiased.

The variance of an estimator is the standard variance of \(\hat{\theta}\) viewed as a random variable: \[ \operatorname{Var}(\hat{\theta}) = \mathbb{E}\!\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]. \]

An ideal estimator has both low bias and low variance. However, these two goals often conflict - reducing one may increase the other. The mean squared error provides a single criterion that balances both considerations.

Definition: Mean Squared Error (MSE)

The mean squared error of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is \[ \operatorname{MSE}(\hat{\theta}) := \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right]. \]

A short calculation shows that the MSE decomposes into the variance and the squared bias of the estimator.

Bias-variance decomposition:

\[ \begin{align*} \operatorname{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \theta)^2\right] \\\\ &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\ &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2 + 2(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta) + (\mathbb{E}[\hat{\theta}] - \theta)^2\right]. \end{align*} \] Using the linearity of expectation, we distribute \(\mathbb{E}\). Note that \(\theta\) is a fixed population parameter, and \(\mathbb{E}[\hat{\theta}]\) is a constant. Thus, \((\mathbb{E}[\hat{\theta}] - \theta)\) is treated as a constant: \[ \begin{align*} \operatorname{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right] + 2(\mathbb{E}[\hat{\theta}] - \theta)\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right] + \mathbb{E}\left[(\mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\ &= \operatorname{Var}(\hat{\theta}) + 2(\mathbb{E}[\hat{\theta}] - \theta)(0) + [\operatorname{Bias}(\hat{\theta})]^2 \\\\ &= \operatorname{Var}(\hat{\theta}) + [\operatorname{Bias}(\hat{\theta})]^2. \end{align*} \]

Note: The cross term vanishes because the expected deviation from the mean is zero: \(\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right] = \mathbb{E}[\hat{\theta}] - \mathbb{E}[\hat{\theta}] = 0\).

The MSE serves as a criterion for comparing estimators: among competing estimators, we prefer the one with the smallest MSE. For an unbiased estimator, the MSE reduces to the variance alone.

Once an estimator is selected, we quantify its precision using the standard error (SE), which is the standard deviation of the estimator's sampling distribution. For the sample mean, \[ \operatorname{SE}(\bar{X}) = \sqrt{\operatorname{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}}. \] Notice that the standard error decreases as \(n\) grows, confirming the intuition that more data yields more precise estimates. With the concept of an estimator and its quality in hand, we now ask: what principle should guide our choice of estimator in the first place?

Likelihood Functions

Before we can choose an optimal estimator, we need a way to measure how well a candidate parameter value \(\theta\) explains the observed data. The key insight is a change of perspective: the same mathematical expression that gives the probability of data given a parameter can also be viewed as a function of the parameter given fixed data. This reversal of roles leads to the likelihood function.

Definition: Likelihood Function

Let \(f(\,\cdot \mid \theta)\) denote the joint p.d.f. (or p.m.f.) of a sample \((X_1, \ldots, X_n)\), parameterized by an unknown parameter \(\theta \in \Theta\). Once the data \((x_1, \ldots, x_n)\) have been observed, they are no longer free variables — only \(\theta\) remains unknown. The likelihood function of \(\theta\) given the observed data is the function \[ L(\theta \mid x_1, \ldots, x_n) \;:=\; f(x_1, \ldots, x_n \mid \theta), \] now regarded as a function of \(\theta\) with the observed values held fixed. We often write \(L(\theta)\) when the data are clear from context.

For an i.i.d. sample, the joint density factorizes and \[ L(\theta) \;=\; \prod_{i=1}^n f(x_i \mid \theta). \]

The crucial distinction is one of interpretation: the expression \(\prod_{i=1}^n f(x_i \mid \theta)\) is the same mathematical formula whether we view it as a probability (function of \(x\) with \(\theta\) fixed) or as a likelihood (function of \(\theta\) with \(x\) fixed). This duality is often expressed as \[ \underbrace{L(\theta \mid x_1, \ldots, x_n)}_{\text{After sampling: function of } \theta} = \underbrace{\prod_{i=1}^n f(x_i \mid \theta)}_{\text{Before sampling: function of } x}. \] With the likelihood function defined, we can now state the most widely used principle for parameter estimation.

Maximum Likelihood Estimation

In machine learning, model fitting (or training) is the process of estimating unknown parameters \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)\) from sample data \(\mathcal{D} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\). This is typically framed as the optimization of an objective (or loss) function over \(\boldsymbol{\theta}\). A widely used choice in frequentist inference is to select the parameter that makes the observed data most probable.

Convention. In what follows we allow \(\mathbf{x}_i \in \mathbb{R}^d\) in general; the univariate case \(x_i \in \mathbb{R}\) used in the examples below is the special case \(d = 1\). Boldface marks vector-valued quantities, lightface marks scalar ones.

Definition: Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) is defined as \[ \hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}}\, L(\boldsymbol{\theta}) \] where \(L(\boldsymbol{\theta})\) is the likelihood function for the sample data \(\mathcal{D}\).

The argmax need not exist or be unique in general. Standard regularity conditions guarantee existence (e.g. compactness of \(\Theta\) with continuous \(L\)) and local uniqueness near the truth (e.g. strict concavity of \(\ln L\)); for the examples below, both conditions hold.

Since the logarithm is a strictly increasing function, maximizing \(L\) is equivalent to maximizing \(\ln L\). Working with the log-likelihood is preferred in practice for two reasons: it converts products into sums (improving numerical stability) and simplifies differentiation.

If \(L(\boldsymbol{\theta})\) is differentiable in \(\boldsymbol{\theta}\) and the maximum is attained at an interior point of the parameter space, the MLE satisfies the score equation. For an i.i.d. sample, \[ \nabla_{\boldsymbol{\theta}} \ln L(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \sum_{i = 1}^n \ln f(\mathbf{x}_i \mid \boldsymbol{\theta}) = \sum_{i=1}^n \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{x}_i \mid \boldsymbol{\theta}) = \mathbf{0}. \] The score equation provides only a necessary condition: critical points must be checked against the second-order condition (negative-definite Hessian of \(\ln L\)) to confirm a local maximum, and boundary maxima or non-differentiable cases must be handled separately. This gradient of the log-likelihood is called the score function, which plays a central role in the Fisher information theory.

Example 1: Binomial Distribution \(X \sim b(n, p) \)

We begin with a discrete example. Consider flipping a coin \(n\) times and observing \(k\) heads.

Each flip is a Bernoulli trial; the \(n\) flips together follow \(\text{Binomial}(n, \theta)\). We can equivalently view the data as a single observation \(X \sim b(n, \theta)\) with likelihood \(L(\theta) = P(X = k \mid \theta)\), or as \(n\) i.i.d. Bernoulli trials with likelihood \(\prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}\) where \(\sum x_i = k\); both yield the same log-likelihood up to a \(\theta\)-independent constant. We adopt the single-observation view below, so the symbol \(n\) here denotes the number of trials in the binomial experiment, not the sample size of the previous section (which is one in this view).

Example:

Let \(X \sim b(n, \theta)\), where \(\theta \in [0, 1]\) is the probability of success. If we observe \(X = k\) successes in \(n\) trials, the likelihood function is the probability mass function: \[ L(\theta) = P(X = k \mid \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k}. \] Taking the natural logarithm, we get: \[ \ln L(\theta) = \ln \binom{n}{k} + k \ln(\theta) + (n-k) \ln(1-\theta). \] To find \(\hat{\theta}_{\text{MLE}}\), we take the derivative with respect to \(\theta\) and set it to zero. Note that the combinatorial term \(\ln \binom{n}{k}\) is a constant with respect to \(\theta\) and vanishes: \[ \begin{align*} \frac{d}{d\theta} \ln L(\theta) &= \frac{k}{\theta} - \frac{n-k}{1-\theta} \\\\ &= \frac{k(1-\theta) - (n-k)\theta}{\theta(1-\theta)} = \frac{k - n\theta}{\theta(1-\theta)} \;=\; 0, \end{align*} \] which holds (for \(\theta \in (0,1)\)) iff \(k = n\theta\). Therefore, \[ \hat{\theta}_{\text{MLE}} = \frac{k}{n}. \] The second derivative \(\frac{d^2}{d\theta^2}\ln L(\theta) = -k/\theta^2 - (n-k)/(1-\theta)^2 < 0\) on \((0,1)\), so this critical point is the unique interior maximum.

This is exactly the sample proportion \(\hat{p} = X/n\): the intuitively natural estimator coincides with the MLE. Its sampling distribution properties follow from \(X \sim b(n, p)\): \[ \mathbb{E}[\hat{p}] = \frac{1}{n}\mathbb{E}[X] = p, \qquad \operatorname{Var}(\hat{p}) = \frac{1}{n^2}\operatorname{Var}[X] = \frac{p(1-p)}{n}, \] so \(\hat{p}\) is unbiased, with variance shrinking as \(1/n\). We now turn to a continuous example where the MLE must be found for two parameters simultaneously.

Example 2: Normal Distribution

Given an i.i.d. sample \(X_1, X_2, \ldots, X_n\) from the normal distribution \(\mathcal{N}(\mu, \sigma^2)\), we seek the MLEs for both parameters.

Example:

With observed values \(\mathcal{D} = \{x_1, x_2, \ldots, x_n\}\) and per-observation density \[ f(x \mid \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}\exp \left\{- \frac{(x - \mu)^2}{2\sigma^2}\right\}, \] the likelihood function is \[ \begin{align*} L(\mu, \sigma^2) &= \prod_{i=1}^n f(x_i \mid \mu, \sigma^2) \\\\ &= \left (\frac{1}{\sigma \sqrt{2\pi}}\right)^n \exp \left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right\}. \end{align*} \] The log-likelihood function is given by \[ \begin{align*} \ln L(\mu, \sigma^2) &= n \ln \left(\frac{1}{\sigma \sqrt{2\pi}}\right) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\ &= -n \ln (\sigma) - n \ln (\sqrt{2\pi}) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\ &= -\frac{n}{2} \ln (\sigma^2) - \frac{n}{2}\ln (2\pi) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2. \tag{1} \end{align*} \]

At any interior critical point of \(\ln L\), both partial derivatives must vanish simultaneously. Setting the partial derivative of (1) with respect to \(\mu\) equal to zero: \[ \frac{\partial \ln L(\mu, \sigma^2)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = \frac{1}{\sigma^2}\!\left(\sum_{i=1}^n x_i - n\mu\right) = 0, \] which gives \[ \hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}. \tag{2} \]

Similarly, taking the partial derivative of (1) with respect to the variance \(\sigma^2\) (treating \(\sigma^2\) as a single variable) and setting it to zero: \[ \frac{\partial \ln L(\mu, \sigma^2) }{\partial (\sigma^2)} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0 \;\;\iff\;\; \sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2. \] Substituting the value \(\hat{\mu}_{\text{MLE}} = \bar{x}\) obtained above into this equation gives the joint critical point \[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2. \] As \(\sigma^2 \to 0^+\) or \(\sigma^2 \to \infty\) the likelihood vanishes, and \((\hat{\mu}_{\text{MLE}}, \hat{\sigma}^2_{\text{MLE}})\) is the unique interior critical point, so it is the global maximum.

Note that the MLE for the variance divides by \(n\), not \(n - 1\). This means \(\hat{\sigma}^2_{\text{MLE}}\) is a biased estimator of \(\sigma^2\), with \(\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2\). Replacing \(n\) with \(n-1\) — Bessel's correction — gives the unbiased sample variance \(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\).

Verification of the bias:

For an i.i.d. sample with finite variance, \(\bar{X}\) has \(\operatorname{Var}(\bar{X}) = \sigma^2/n\) (a standard property of sums of independent variables). Expand using \(\sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n X_i^2 - n\bar{X}^2\) and take expectations. Using \(\mathbb{E}[X_i^2] = \sigma^2 + \mu^2\) and \(\mathbb{E}[\bar{X}^2] = \operatorname{Var}(\bar{X}) + (\mathbb{E}[\bar{X}])^2 = \sigma^2/n + \mu^2\): \[ \begin{align*} \mathbb{E}\!\left[\sum_{i=1}^n (X_i - \bar{X})^2\right] &= \sum_{i=1}^n \mathbb{E}[X_i^2] - n\,\mathbb{E}[\bar{X}^2] \\\\ &= n(\sigma^2 + \mu^2) - n\!\left(\frac{\sigma^2}{n} + \mu^2\right) \\\\ &= (n - 1)\sigma^2. \end{align*} \] Dividing by \(n\) gives \(\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2\), confirming the bias.

The finite-sample bias of \(\hat{\sigma}^2_{\text{MLE}}\) is a general feature of MLE: it can be biased for finite \(n\), yet under standard regularity conditions (identifiability, compactness of the parameter space, continuity and domination of the log-likelihood) it is consistent, meaning \(\hat{\boldsymbol{\theta}}_{\text{MLE}} \xrightarrow{P} \boldsymbol{\theta}_{\text{true}}\) as \(n \to \infty\) in the sense of convergence in probability. A classical consistency theorem along these lines is due to Wald (1949).

Connections to Machine Learning

MLE is the default parameter estimation method across machine learning. Training a logistic regression model is equivalent to minimizing the negative log-likelihood (cross-entropy loss). Training a neural network with mean squared error loss is equivalent to MLE under a Gaussian noise assumption. The exponential family unifies these examples, showing that, within the exponential family, MLE reduces to moment matching. In Bayesian inference, MLE corresponds to the special case of a uniform (flat) prior, and incorporating regularization is equivalent to choosing a non-uniform prior.

MLE provides point estimates - single "best guess" values for the parameters. But how confident should we be in these estimates? The next page introduces hypothesis testing and confidence intervals, which provide principled frameworks for quantifying the uncertainty inherent in any statistical estimate.