Point Estimators
In the preceding chapters on probability distributions, we built up a vocabulary parameterized by unknown quantities
(means, variances, covariance matrices) that must be determined from observed data. The fundamental question of
statistical inference is: given a sample \(\mathcal{D} = \{x_1, \ldots, x_n\}\), how do we estimate
the true parameter \(\theta\) of the underlying population?
Definition: Point Estimator
Let the population parameter \(\theta\) take values in a parameter space
\(\Theta \subseteq \mathbb{R}^k\) (with \(k = 1\) for scalar parameters).
A point estimator of \(\theta\) is a function
\[
\hat{\theta} : \mathbb{R}^n \to \Theta, \qquad (X_1, \ldots, X_n) \mapsto \hat{\theta}(X_1, \ldots, X_n)
\]
of the sample random variables that does not itself depend on \(\theta\) — that is, a
statistic. The numerical value
\(\hat{\theta}(x_1, \ldots, x_n)\) computed from observed data is the corresponding
point estimate.
For example, the sample mean
\[
\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i
\]
is a point estimator of the population mean \(\mu\).
Since \(\hat{\theta}\) is a function of random variables, it is itself a random
variable with its own distribution, called the sampling distribution.
A natural question is: how close is \(\hat{\theta}\) to the true parameter \(\theta\)?
Two fundamental properties characterize the quality of an estimator: its bias
measures systematic error, and its variance measures precision across samples.
Definition: Bias
The bias of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is
\[
\operatorname{Bias}(\hat{\theta}) := \mathbb{E}[\hat{\theta}] - \theta.
\]
An estimator with \(\operatorname{Bias}(\hat{\theta}) = 0\) for every \(\theta \in \Theta\) is called
unbiased.
The variance of an estimator is the standard
variance
of \(\hat{\theta}\) viewed as a random variable:
\[
\operatorname{Var}(\hat{\theta}) = \mathbb{E}\!\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right].
\]
An ideal estimator has both low bias and low variance. However, these two goals often
conflict - reducing one may increase the other. The mean squared error
provides a single criterion that balances both considerations.
Definition: Mean Squared Error (MSE)
The mean squared error of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is
\[
\operatorname{MSE}(\hat{\theta}) := \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right].
\]
A short calculation shows that the MSE decomposes into the variance and the squared bias of the estimator.
Bias-variance decomposition:
\[
\begin{align*}
\operatorname{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \theta)^2\right] \\\\
&= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\
&= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2
+ 2(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta)
+ (\mathbb{E}[\hat{\theta}] - \theta)^2\right].
\end{align*}
\]
Using the linearity of expectation, we distribute \(\mathbb{E}\). Note that \(\theta\) is a fixed population parameter,
and \(\mathbb{E}[\hat{\theta}]\) is a constant. Thus, \((\mathbb{E}[\hat{\theta}] - \theta)\) is treated as a constant:
\[
\begin{align*}
\operatorname{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]
+ 2(\mathbb{E}[\hat{\theta}] - \theta)\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right]
+ \mathbb{E}\left[(\mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\
&= \operatorname{Var}(\hat{\theta}) + 2(\mathbb{E}[\hat{\theta}] - \theta)(0) + [\operatorname{Bias}(\hat{\theta})]^2 \\\\
&= \operatorname{Var}(\hat{\theta}) + [\operatorname{Bias}(\hat{\theta})]^2.
\end{align*}
\]
Note: The cross term vanishes because the expected deviation from the mean is zero:
\(\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right] = \mathbb{E}[\hat{\theta}] - \mathbb{E}[\hat{\theta}] = 0\).
The MSE serves as a criterion for comparing estimators: among competing estimators, we prefer the one with
the smallest MSE. For an unbiased estimator, the MSE reduces to the variance alone.
Once an estimator is selected, we quantify its precision using the standard error (SE), which
is the standard deviation of the estimator's sampling distribution. For the sample mean,
\[
\operatorname{SE}(\bar{X}) = \sqrt{\operatorname{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}}.
\]
Notice that the standard error decreases as \(n\) grows, confirming the intuition that more data yields more precise
estimates. With the concept of an estimator and its quality in hand, we now ask: what principle should guide
our choice of estimator in the first place?
Likelihood Functions
Before we can choose an optimal estimator, we need a way to measure how well a candidate parameter value
\(\theta\) explains the observed data. The key insight is a change of perspective:
the same mathematical expression that gives the probability of data given a parameter can also be viewed
as a function of the parameter given fixed data. This reversal of roles leads to the likelihood function.
Definition: Likelihood Function
Let \(f(\,\cdot \mid \theta)\) denote the joint p.d.f. (or p.m.f.) of a sample
\((X_1, \ldots, X_n)\), parameterized by an unknown parameter \(\theta \in \Theta\).
Once the data \((x_1, \ldots, x_n)\) have been observed, they are no longer free
variables — only \(\theta\) remains unknown. The likelihood function
of \(\theta\) given the observed data is the function
\[
L(\theta \mid x_1, \ldots, x_n) \;:=\; f(x_1, \ldots, x_n \mid \theta),
\]
now regarded as a function of \(\theta\) with the observed values held fixed.
We often write \(L(\theta)\) when the data are clear from context.
For an i.i.d. sample, the joint density factorizes and
\[
L(\theta) \;=\; \prod_{i=1}^n f(x_i \mid \theta).
\]
The crucial distinction is one of interpretation: the expression \(\prod_{i=1}^n f(x_i \mid \theta)\) is the same mathematical
formula whether we view it as a probability (function of \(x\) with \(\theta\) fixed) or as a
likelihood (function of \(\theta\) with \(x\) fixed). This duality is often expressed as
\[
\underbrace{L(\theta \mid x_1, \ldots, x_n)}_{\text{After sampling: function of } \theta}
= \underbrace{\prod_{i=1}^n f(x_i \mid \theta)}_{\text{Before sampling: function of } x}.
\]
With the likelihood function defined, we can now state the most widely used
principle for parameter estimation.
Maximum Likelihood Estimation
In machine learning, model fitting (or training) is the process of estimating unknown parameters
\(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)\) from sample data
\(\mathcal{D} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\). This is typically framed as the
optimization of an objective (or loss) function over \(\boldsymbol{\theta}\).
A widely used choice in frequentist inference is to select the parameter that makes the
observed data most probable.
Convention. In what follows we allow \(\mathbf{x}_i \in \mathbb{R}^d\) in general; the univariate case
\(x_i \in \mathbb{R}\) used in the examples below is the special case \(d = 1\). Boldface marks vector-valued quantities,
lightface marks scalar ones.
Definition: Maximum Likelihood Estimator
The maximum likelihood estimator (MLE) is defined as
\[
\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}}\, L(\boldsymbol{\theta})
\]
where \(L(\boldsymbol{\theta})\) is the likelihood function for the sample data \(\mathcal{D}\).
The argmax need not exist or be unique in general. Standard regularity conditions guarantee
existence (e.g. compactness of \(\Theta\) with continuous \(L\)) and local uniqueness near the
truth (e.g. strict concavity of \(\ln L\)); for the examples below, both conditions hold.
Since the logarithm is a strictly increasing function, maximizing \(L\) is equivalent to maximizing \(\ln L\).
Working with the log-likelihood is preferred in practice for two reasons: it converts products
into sums (improving numerical stability) and simplifies differentiation.
If \(L(\boldsymbol{\theta})\) is differentiable in \(\boldsymbol{\theta}\) and the maximum is attained
at an interior point of the parameter space, the MLE satisfies the score equation.
For an i.i.d. sample,
\[
\nabla_{\boldsymbol{\theta}} \ln L(\boldsymbol{\theta})
= \nabla_{\boldsymbol{\theta}} \sum_{i = 1}^n \ln f(\mathbf{x}_i \mid \boldsymbol{\theta})
= \sum_{i=1}^n \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{x}_i \mid \boldsymbol{\theta})
= \mathbf{0}.
\]
The score equation provides only a necessary condition: critical points must be checked
against the second-order condition (negative-definite Hessian of \(\ln L\)) to confirm a local maximum,
and boundary maxima or non-differentiable cases must be handled separately. This gradient of the
log-likelihood is called the score function, which plays a central role in the
Fisher information theory.
Example 1: Binomial Distribution \(X \sim b(n, p) \)
We begin with a discrete example. Consider flipping a coin \(n\) times and observing \(k\) heads.
Each flip is a Bernoulli trial; the \(n\) flips together follow \(\text{Binomial}(n, \theta)\).
We can equivalently view the data as a single observation \(X \sim b(n, \theta)\) with
likelihood \(L(\theta) = P(X = k \mid \theta)\), or as \(n\) i.i.d. Bernoulli trials with
likelihood \(\prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}\) where \(\sum x_i = k\); both yield
the same log-likelihood up to a \(\theta\)-independent constant. We adopt the single-observation
view below, so the symbol \(n\) here denotes the number of trials in the binomial
experiment, not the sample size of the previous section (which is one in this view).
Example:
Let \(X \sim b(n, \theta)\), where \(\theta \in [0, 1]\) is the probability of success.
If we observe \(X = k\) successes in \(n\) trials, the likelihood function is the probability mass function:
\[
L(\theta) = P(X = k \mid \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k}.
\]
Taking the natural logarithm, we get:
\[
\ln L(\theta) = \ln \binom{n}{k} + k \ln(\theta) + (n-k) \ln(1-\theta).
\]
To find \(\hat{\theta}_{\text{MLE}}\), we take the derivative with respect to \(\theta\) and set it to zero.
Note that the combinatorial term \(\ln \binom{n}{k}\) is a constant with respect to \(\theta\) and vanishes:
\[
\begin{align*}
\frac{d}{d\theta} \ln L(\theta)
&= \frac{k}{\theta} - \frac{n-k}{1-\theta} \\\\
&= \frac{k(1-\theta) - (n-k)\theta}{\theta(1-\theta)}
= \frac{k - n\theta}{\theta(1-\theta)} \;=\; 0,
\end{align*}
\]
which holds (for \(\theta \in (0,1)\)) iff \(k = n\theta\).
Therefore,
\[
\hat{\theta}_{\text{MLE}} = \frac{k}{n}.
\]
The second derivative
\(\frac{d^2}{d\theta^2}\ln L(\theta) = -k/\theta^2 - (n-k)/(1-\theta)^2 < 0\)
on \((0,1)\), so this critical point is the unique interior maximum.
This is exactly the sample proportion \(\hat{p} = X/n\): the intuitively
natural estimator coincides with the MLE. Its sampling distribution properties follow from
\(X \sim b(n, p)\):
\[
\mathbb{E}[\hat{p}] = \frac{1}{n}\mathbb{E}[X] = p,
\qquad
\operatorname{Var}(\hat{p}) = \frac{1}{n^2}\operatorname{Var}[X] = \frac{p(1-p)}{n},
\]
so \(\hat{p}\) is unbiased, with variance shrinking as \(1/n\). We now turn to a continuous
example where the MLE must be found for two parameters simultaneously.
Example 2: Normal Distribution
Given an i.i.d. sample \(X_1, X_2, \ldots, X_n\) from the
normal distribution
\(\mathcal{N}(\mu, \sigma^2)\), we seek the MLEs for both parameters.
Example:
With observed values \(\mathcal{D} = \{x_1, x_2, \ldots, x_n\}\) and per-observation density
\[
f(x \mid \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}\exp \left\{- \frac{(x - \mu)^2}{2\sigma^2}\right\},
\]
the likelihood function is
\[
\begin{align*}
L(\mu, \sigma^2) &= \prod_{i=1}^n f(x_i \mid \mu, \sigma^2) \\\\
&= \left (\frac{1}{\sigma \sqrt{2\pi}}\right)^n \exp \left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right\}.
\end{align*}
\]
The log-likelihood function is given by
\[
\begin{align*}
\ln L(\mu, \sigma^2) &= n \ln \left(\frac{1}{\sigma \sqrt{2\pi}}\right) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\
&= -n \ln (\sigma) - n \ln (\sqrt{2\pi}) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\
&= -\frac{n}{2} \ln (\sigma^2) - \frac{n}{2}\ln (2\pi) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2. \tag{1}
\end{align*}
\]
At any interior critical point of \(\ln L\), both partial derivatives must vanish
simultaneously. Setting the partial derivative of (1) with respect to \(\mu\) equal to zero:
\[
\frac{\partial \ln L(\mu, \sigma^2)}{\partial \mu}
= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)
= \frac{1}{\sigma^2}\!\left(\sum_{i=1}^n x_i - n\mu\right) = 0,
\]
which gives
\[
\hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}. \tag{2}
\]
Similarly, taking the partial derivative of (1) with respect to the variance \(\sigma^2\)
(treating \(\sigma^2\) as a single variable) and setting it to zero:
\[
\frac{\partial \ln L(\mu, \sigma^2) }{\partial (\sigma^2)}
= -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0
\;\;\iff\;\;
\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2.
\]
Substituting the value \(\hat{\mu}_{\text{MLE}} = \bar{x}\) obtained above into this
equation gives the joint critical point
\[
\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.
\]
As \(\sigma^2 \to 0^+\) or \(\sigma^2 \to \infty\) the likelihood vanishes, and
\((\hat{\mu}_{\text{MLE}}, \hat{\sigma}^2_{\text{MLE}})\) is the unique interior critical
point, so it is the global maximum.
Note that the MLE for the variance divides by \(n\), not \(n - 1\). This means \(\hat{\sigma}^2_{\text{MLE}}\)
is a biased estimator of \(\sigma^2\), with \(\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2\).
Replacing \(n\) with \(n-1\) — Bessel's correction — gives the unbiased sample variance
\(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\).
Verification of the bias:
For an i.i.d. sample with finite variance, \(\bar{X}\) has \(\operatorname{Var}(\bar{X}) = \sigma^2/n\)
(a standard property of sums of independent variables).
Expand using \(\sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n X_i^2 - n\bar{X}^2\) and take expectations.
Using \(\mathbb{E}[X_i^2] = \sigma^2 + \mu^2\) and \(\mathbb{E}[\bar{X}^2] = \operatorname{Var}(\bar{X}) + (\mathbb{E}[\bar{X}])^2 = \sigma^2/n + \mu^2\):
\[
\begin{align*}
\mathbb{E}\!\left[\sum_{i=1}^n (X_i - \bar{X})^2\right]
&= \sum_{i=1}^n \mathbb{E}[X_i^2] - n\,\mathbb{E}[\bar{X}^2] \\\\
&= n(\sigma^2 + \mu^2) - n\!\left(\frac{\sigma^2}{n} + \mu^2\right) \\\\
&= (n - 1)\sigma^2.
\end{align*}
\]
Dividing by \(n\) gives \(\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2\), confirming the bias.
The finite-sample bias of \(\hat{\sigma}^2_{\text{MLE}}\) is a general feature of MLE: it can
be biased for finite \(n\), yet under standard regularity conditions
(identifiability, compactness of the parameter space, continuity and domination of the
log-likelihood) it is consistent, meaning
\(\hat{\boldsymbol{\theta}}_{\text{MLE}} \xrightarrow{P} \boldsymbol{\theta}_{\text{true}}\) as
\(n \to \infty\) in the sense of
convergence in probability.
A classical consistency theorem along these lines is due to Wald (1949).
Connections to Machine Learning
MLE is the default parameter estimation method across machine learning.
Training a logistic regression
model is equivalent to minimizing the negative log-likelihood (cross-entropy loss).
Training a neural network
with mean squared error loss is equivalent to MLE under a Gaussian noise assumption.
The exponential family unifies these
examples, showing that, within the exponential family, MLE reduces to
moment matching.
In Bayesian inference, MLE corresponds
to the special case of a uniform (flat) prior, and incorporating
regularization is equivalent to choosing a non-uniform prior.
MLE provides point estimates - single "best guess" values for the parameters. But how confident should we be in these estimates?
The next page introduces hypothesis testing and
confidence intervals, which provide principled frameworks for quantifying the uncertainty inherent
in any statistical estimate.