Point Estimators
In the previous parts, we built up a vocabulary of probability distributions. Each of them is
parameterized by unknown quantities (means, variances, covariance matrices) that must
be determined from observed data. The fundamental question of statistical inference
is: given a sample \(\mathcal{D} = \{x_1, \ldots, x_n\}\), how do we estimate the true parameter \(\theta\)
of the underlying population?
Definition: Point Estimator
A point estimator \(\hat{\theta}\) is a function of sample
random variables \(X_1, X_2, \ldots, X_n\) that produces a single value as an
estimate of the unknown population parameter \(\theta\). For example, the
sample mean
\[
\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i
\]
is a point estimator of the population mean \(\mu\).
Since \(\hat{\theta}\) is a function of random variables, it is itself a random
variable with its own distribution, called the sampling distribution.
A natural question is: how close is \(\hat{\theta}\) to the true parameter \(\theta\)?
Two fundamental properties characterize the quality of an estimator:
- Bias:
\[
\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta.
\]
This measures the systematic error - on average, how far off is the estimator from the truth?
An estimator with zero bias is called unbiased.
- Variance:
\[
\text{Var}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]
\]
This measures the precision - how much does the estimator vary across different samples?
An ideal estimator has both low bias and low variance. However, these two goals often
conflict - reducing one may increase the other. The mean squared error
provides a single criterion that balances both considerations.
Definition: Mean Squared Error (MSE)
The mean squared error of an estimator \(\hat{\theta}\) is
\[
\begin{align*}
\text{MSE}(\hat{\theta}) &= \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right] \\\\
&= \text{Var}(\hat{\theta}) + \left[\text{Bias}(\hat{\theta})\right]^2.
\end{align*}
\]
Derivation:
\[
\begin{align*}
\text{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \theta)^2\right] \\\\
&= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}] + \mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\
&= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2
+ 2(\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta)
+ (\mathbb{E}[\hat{\theta}] - \theta)^2\right].
\end{align*}
\]
Using the linearity of expectation, we distribute \(\mathbb{E}\). Note that \(\theta\) is a fixed population parameter,
and \(\mathbb{E}[\hat{\theta}]\) is a constant. Thus, \((\mathbb{E}[\hat{\theta}] - \theta)\) is treated as a constant:
\[
\begin{align*}
\text{MSE}(\hat{\theta}) &= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]
+ 2(\mathbb{E}[\hat{\theta}] - \theta)\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right]
+ \mathbb{E}\left[(\mathbb{E}[\hat{\theta}] - \theta)^2\right] \\\\
&= \text{Var}(\hat{\theta}) + 2(\mathbb{E}[\hat{\theta}] - \theta)(0) + [\text{Bias}(\hat{\theta})]^2 \\\\
&= \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2.
\end{align*}
\]
Note: The cross term vanishes because the expected deviation from the mean is zero:
\(\mathbb{E}\left[\hat{\theta} - \mathbb{E}[\hat{\theta}]\right] = \mathbb{E}[\hat{\theta}] - \mathbb{E}[\hat{\theta}] = 0\).
The MSE serves as a criterion for comparing estimators: among competing estimators, we prefer the one with
the smallest MSE. For an unbiased estimator, the MSE reduces to the variance alone.
Once an estimator is selected, we quantify its precision using the standard error (SE), which
is the standard deviation of the estimator's sampling distribution. For the sample mean,
\[
\text{SE}(\bar{X}) = \sqrt{\text{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}}.
\]
Notice that the standard error decreases as \(n\) grows, confirming the intuition that more data yields more precise
estimates. With the concept of an estimator and its quality in hand, we now ask: what principle should guide
our choice of estimator in the first place?
Likelihood Functions
Before we can choose an optimal estimator, we need a way to measure how well a candidate parameter value
\(\theta\) explains the observed data. The key insight is a change of perspective:
the same mathematical expression that gives the probability of data given a parameter can also be viewed
as a function of the parameter given fixed data. This reversal of roles leads to the likelihood function.
Definition: Likelihood Function
Suppose observations \(X_1, X_2, \ldots, X_n\) are i.i.d. random variables.
The "observed" values of these random variables are denoted by
\(x_1, x_2, \cdots, x_n\) respectively. Then the joint p.d.f. (or p.m.f.) of \(X_1, X_2, \cdots, X_n\) is given by
\[
f(x_1, x_2, \cdots, x_n \mid \theta) = \prod_{i = 1}^n f(x_i \mid \theta)
\]
where \(\theta\) is some unknown parameter.
This is called the likelihood function of \(\theta\) for observed values \(x_1, x_2, \cdots, x_n\)
and denote it by \(L(\theta \mid x_1, x_2, \cdots, x_n)\), or simply \(L(\theta)\).
The crucial distinction is one of interpretation: the expression \(\prod_{i=1}^n f(x_i \mid \theta)\) is the same mathematical
formula whether we view it as a probability (function of \(x\) with \(\theta\) fixed) or as a
likelihood (function of \(\theta\) with \(x\) fixed). This duality is often expressed as
\[
\underbrace{L(\theta \mid x_1, \ldots, x_n)}_{\text{After sampling: function of } \theta}
= \underbrace{\prod_{i=1}^n f(x_i \mid \theta)}_{\text{Before sampling: function of } x}.
\]
With the likelihood function defined, we can now state the most widely used
principle for parameter estimation.
Maximum Likelihood Estimation
In machine learning, model fitting (or training) is the process of estimating unknown parameters
\(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)\) from sample data
\(\mathcal{D} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\). This can be framed as an
optimization problem:
\[
\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})
\]
where \(\mathcal{L}(\boldsymbol{\theta})\) is a loss function (or objective function).
The most natural and widely used choice is to select the parameter that makes the
observed data most probable.
Definition: Maximum Likelihood Estimator
The maximum likelihood estimator (MLE) is defined as
\[
\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}}\, L(\boldsymbol{\theta})
\]
where \(L(\boldsymbol{\theta})\) is the likelihood function for the sample data \(\mathcal{D}\).
Since the logarithm is a strictly increasing function, maximizing \(L\) is equivalent to maximizing \(\ln L\).
Working with the log-likelihood is preferred in practice for two reasons: it converts products
into sums (improving numerical stability) and simplifies differentiation.
If \(L(\boldsymbol{\theta})\) is differentiable, the MLE can be found by solving the score equation:
\[
\begin{align*}
&\nabla_{\boldsymbol{\theta}} \ln L(\boldsymbol{\theta})
= \nabla_{\boldsymbol{\theta}} \ln \prod_{i = 1}^n f(\mathbf{x}_i \mid \boldsymbol{\theta}) = \mathbf{0}\\\\
&\Longrightarrow
\nabla_{\boldsymbol{\theta}} \ln L(\boldsymbol{\theta})
= \sum_{i=1}^ n \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{x}_i \mid \boldsymbol{\theta}) = \mathbf{0}.
\end{align*}
\]
This gradient of the log-likelihood is called the score function, which plays a central role in the
Fisher information theory.
Example 1: Binomial Distribution \(X \sim b(n, p) \)
We begin with a discrete example. Consider flipping a coin \(n\) times and observing \(k\) heads.
Proof:
Let \(X \sim b(n, \theta)\), where \(\theta \in [0, 1]\) is the probability of success.
If we observe \(X = k\) successes in \(n\) trials, the likelihood function is the probability mass function:
\[
L(\theta) = P(X = k \mid \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k}.
\]
Taking the natural logarithm, we get:
\[
\ln L(\theta) = \ln \binom{n}{k} + k \ln(\theta) + (n-k) \ln(1-\theta).
\]
To find \(\hat{\theta}_{\text{MLE}}\), we take the derivative with respect to \(\theta\) and set it to zero.
Note that the combinatorial term \(\ln \binom{n}{k}\) is a constant with respect to \(\theta\) and vanishes:
\[
\begin{align*}
\frac{d}{d\theta} \ln L(\theta) &= \frac{k}{\theta} - \frac{n-k}{1-\theta} = 0 \\\\
&\Longrightarrow k(1 - \theta) - (n-k)\theta = 0 \\\\
&\Longrightarrow k - k\theta - n\theta + k\theta = 0 \\\\
&\Longrightarrow k = n\theta.
\end{align*}
\]
Therefore,
\[
\hat{\theta}_{\text{MLE}} = \frac{k}{n}.
\]
This is exactly the sample proportion \(\hat{p} = \frac{X}{n}\), confirming that the
intuitively natural estimator for the population proportion coincides with the MLE. We now
turn to a continuous example where the MLE must be found for two parameters simultaneously.
Note:
\[
\begin{align*}
&\mathbb{E}[\hat{p}] = \frac{1}{n}\mathbb{E}[X] = \frac{1}{n}np = p \\\\
&\text{Var }(\hat{p}) = \frac{1}{n^2}\text{Var }[X] = \frac{1}{n^2}np(1-p) = \frac{p(1-p)}{n}.
\end{align*}
\]
Example 2: Normal Distribution
Given a random sample \(X_1, X_2, \ldots, X_n\) from the normal distribution
\(\mathcal{N}(\mu, \sigma^2)\), we seek the MLEs for both parameters.
Proof:
Suppose \(\mathcal{D} = \{x_1, x_2, \cdots, x_n \}\) is from a normal distribution with p.d.f
\[
f(x \mid \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}\exp \left\{- \frac{(x - \mu)^2}{2\sigma^2}\right\}.
\]
And its likelihood fuction is given by
\[
\begin{align*}
L(\mu, \sigma^2) &= \prod_{i=1}^n [ f(x_i \mid \mu, \sigma^2)] \\\\
&= \left (\frac{1}{\sigma \sqrt{2\pi}}\right)^n \exp \left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right\}.
\end{align*}
\]
The log-likelihood function is given by
\[
\begin{align*}
\ln L(\mu, \sigma^2) &= n \ln \left(\frac{1}{\sigma \sqrt{2\pi}}\right) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\
&= -n \ln (\sigma) - n \ln (\sqrt{2\pi}) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \\\\
&= -\frac{n}{2} \ln (\sigma^2) - \frac{n}{2}\ln (2\pi) -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2. \tag{1} \\\\
\end{align*}
\]
Setting the partial derivative of (1) with respect to \(\mu\) equal to zero:
\[
\begin{align*}
&\frac{\partial \ln L(\mu, \sigma^2) }{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \\\\
&\Longrightarrow \sum_{i=1}^n (x_i) - n\mu = 0.
\end{align*}
\]
Thus,
\[
\hat{\mu}_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}. \tag{2}
\]
Similarly, taking the partial derivative of (1) with respect to the variance \(\sigma^2\) (treating \(\sigma^2\) as a single variable)
and setting it to zero:
\[
\begin{align*}
&\frac{\partial \ln L(\mu, \sigma^2) }{\partial (\sigma^2)} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0 \\\\
&\Longrightarrow -n \sigma^2 + \sum_{i=1}^n (x_i - \mu)^2 = 0.
\end{align*}
\]
To find the maximum, we substitute \(\mu\) with our MLE estimate \(\hat{\mu}_{\text{MLE}} = \bar{x}\) and solve for \(\sigma^2\):
\[
\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.
\]
Note that the MLE for the variance divides by \(n\), not \(n - 1\). This means \(\hat{\sigma}^2_{\text{MLE}}\)
is a biased estimator of \(\sigma^2\), with \(\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2\).
The unbiased sample variance \(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\) corrects for this by using Bessel's correction.
This is one instance of a general phenomenon: the MLE is consistent (converges to the true value as
\(n \to \infty\)) but not always unbiased for finite samples.
Connections to Machine Learning
MLE is the default parameter estimation method across machine learning.
Training a logistic regression
model is equivalent to minimizing the negative log-likelihood (cross-entropy loss).
Training a neural network
with mean squared error loss is equivalent to MLE under a Gaussian noise assumption.
The exponential family unifies these
examples, showing that MLE always reduces to moment matching.
In Bayesian inference, MLE corresponds
to the special case of a uniform (flat) prior, and incorporating
regularization is equivalent to choosing a non-uniform prior.
MLE provides point estimates - single "best guess" values for the parameters. But how confident should we be in these estimates?
In the next part, we introduce hypothesis testing and
confidence intervals, which provide principled frameworks for quantifying the uncertainty inherent
in any statistical estimate.