Student's \(t\)-Distribution
In Part 4, we saw that the chi-squared distribution governs
sums of squared standard normals, and that the sample variance \(s^2\) satisfies \((n-1)s^2/\sigma^2 \sim \chi^2_{n-1}\).
A natural question follows: what happens to our inference when we replace the unknown population standard deviation
\(\sigma\) with its estimate \(s\)? The answer is the Student's \(t\)-distribution, which arises as the
ratio of a standard normal to the square root of an independent chi-squared variable divided by its degrees of freedom.
Definition: Student's \(t\)-Distribution
A random variable \(Y\) has a Student's \(t\)-distribution with location
\(\mu\), scale \(\sigma > 0\) (not the standard deviation), and \(\nu > 0\) degrees
of freedom if its p.d.f. is:
\[
f(y \mid \mu, \sigma^2, \nu) \propto \left[1 + \frac{1}{\nu}\left(\frac{y - \mu}{\sigma}\right)^2\right]^{-\frac{\nu+1}{2}}.
\]
The moments are:
\[
\text{mean} = \text{mode} = \mu \;(\text{exists if } \nu > 1), \qquad
\text{Var}(Y) = \frac{\nu\sigma^2}{\nu - 2} \;(\text{exists if } \nu > 2).
\]
The key feature of the \(t\)-distribution is its heavy tails: compared to a normal distribution
with the same location and scale, the \(t\)-distribution assigns substantially more probability mass to extreme values.
This makes parameter estimates based on the \(t\)-distribution more robust to outliers.
The parameter \(\nu\) controls the tail heaviness. As \(\nu \to \infty\), the \(t\)-distribution converges to the
normal distribution \(\mathcal{N}(\mu, \sigma^2)\). In practice, for \(\nu \gg 5\), the \(t\)-distribution is nearly
indistinguishable from the normal and loses its robustness advantage.
Cauchy Distribution
An important special case of the \(t\)-distribution arises at the extreme end of heavy-tailed behavior,
when the degrees of freedom parameter takes its smallest meaningful value.
Definition: Cauchy Distribution
When \(\nu = 1\), the Student's \(t\)-distribution reduces to the Cauchy distribution
with location \(\mu\) and scale \(\gamma > 0\) that has p.d.f.:
\[
f(x \mid \mu, \gamma) = \frac{1}{\gamma\pi}\left[1 + \left(\frac{x - \mu}{\gamma}\right)^2\right]^{-1}.
\]
Consider the standard Cauchy distribution (\(\mu = 0, \gamma = 1\)). When we attempt to calculate its
expected value:
\[
\mathbb{E}[X] = \frac{1}{\pi} \int_{-\infty}^{\infty} \frac{x}{1 + x^2} dx
\]
we find that the integral does not converge. Because the tails decay at a rate of only \(\frac{1}{x^2}\),
the integrand \(\frac{x}{(1+x^2)}\) behaves like \(\frac{1}{x}\) for large \(x\). The integral \(\int \frac{1}{x} \, dx\)
diverges logarithmically. (See Improper Riemann integrals).
Why this matters: Since the mean and variance are undefined, the Law of Large Numbers
fails. If you average \(n\) independent Cauchy variables, the sample mean \(\bar{X}_n\) does not settle down;
it follows the exact same Cauchy distribution as the individual observations. We will explore
this behavior rigorously in Part 13: Convergence.
Despite these theoretical challenges, the Cauchy distribution is highly useful. In Bayesian modeling,
when we need a heavy-tailed prior over \(\mathbb{R}^+\) that allows for large values while maintaining
finite density at the origin, we use the Half-Cauchy distribution (\(x \geq 0\)):
\[
f(x \mid \gamma) = \frac{2}{\pi \gamma} \left[ 1 + \left(\frac{x}{\gamma}\right)^2\right]^{-1}.
\]
This is often the default choice for scale parameters (hierarchical priors) because it is more
robust than the Inverse-Gamma distribution.
Laplace Distribution
The Cauchy distribution demonstrates that heavy tails can be extreme enough to prevent even the mean
from existing. A more moderate heavy-tailed alternative, which retains finite moments of all orders
while still placing more mass in the tails than the normal distribution, is the Laplace distribution.
Definition: Laplace Distribution
The Laplace distribution (also called the double-sided exponential distribution)
with location \(\mu\) and scale \(b > 0\) has p.d.f.:
\[
f(y \mid \mu, b) = \frac{1}{2b}\exp\!\left(-\frac{|y - \mu|}{b}\right).
\]
Its moments are:
\[
\text{mean} = \text{mode} = \mu, \qquad \text{Var}(Y) = 2b^2.
\]
Insight: Heavy Tails and Regularization
In machine learning, the choice of distribution often corresponds to the choice of regularization.
While a Gaussian prior on weights leads to \(L_2\) regularization (Ridge),
a Laplace prior leads to \(L_1\) regularization (Lasso), promoting sparsity.
Additionally, using the Student's t-distribution in regression makes the model more
robust to outliers compared to standard least-squares (Gaussian) regression.
The Student's \(t\), Cauchy, and Laplace distributions complete our toolkit of univariate distributions for robust modeling.
In the next part, we move from single random variables to pairs of random variables, introducing
covariance as the fundamental measure of
linear dependence between variables.