Student's t-Distribution - MATH-CS COMPASS

Student's \(t\)-Distribution

In Part 4, we saw that the chi-squared distribution governs sums of squared standard normals, and that the sample variance \(s^2\) satisfies \((n-1)s^2/\sigma^2 \sim \chi^2_{n-1}\). A natural question follows: what happens to our inference when we replace the unknown population standard deviation \(\sigma\) with its estimate \(s\)? The answer is the Student's \(t\)-distribution, which arises as the ratio of a standard normal to the square root of an independent chi-squared variable divided by its degrees of freedom.

Definition: Student's \(t\)-Distribution

A random variable \(Y\) has a Student's \(t\)-distribution with location \(\mu\), scale \(\sigma > 0\) (not the standard deviation), and \(\nu > 0\) degrees of freedom if its p.d.f. is: \[ f(y \mid \mu, \sigma^2, \nu) \propto \left[1 + \frac{1}{\nu}\left(\frac{y - \mu}{\sigma}\right)^2\right]^{-\frac{\nu+1}{2}}. \] The moments are: \[ \text{mean} = \text{mode} = \mu \;(\text{exists if } \nu > 1), \qquad \text{Var}(Y) = \frac{\nu\sigma^2}{\nu - 2} \;(\text{exists if } \nu > 2). \]

The key feature of the \(t\)-distribution is its heavy tails: compared to a normal distribution with the same location and scale, the \(t\)-distribution assigns substantially more probability mass to extreme values. This makes parameter estimates based on the \(t\)-distribution more robust to outliers.

The parameter \(\nu\) controls the tail heaviness. As \(\nu \to \infty\), the \(t\)-distribution converges to the normal distribution \(\mathcal{N}(\mu, \sigma^2)\). In practice, for \(\nu \gg 5\), the \(t\)-distribution is nearly indistinguishable from the normal and loses its robustness advantage.

Cauchy Distribution

An important special case of the \(t\)-distribution arises at the extreme end of heavy-tailed behavior, when the degrees of freedom parameter takes its smallest meaningful value.

Definition: Cauchy Distribution

When \(\nu = 1\), the Student's \(t\)-distribution reduces to the Cauchy distribution with location \(\mu\) and scale \(\gamma > 0\) that has p.d.f.: \[ f(x \mid \mu, \gamma) = \frac{1}{\gamma\pi}\left[1 + \left(\frac{x - \mu}{\gamma}\right)^2\right]^{-1}. \]

Consider the standard Cauchy distribution (\(\mu = 0, \gamma = 1\)). When we attempt to calculate its expected value: \[ \mathbb{E}[X] = \frac{1}{\pi} \int_{-\infty}^{\infty} \frac{x}{1 + x^2} dx \] we find that the integral does not converge. Because the tails decay at a rate of only \(\frac{1}{x^2}\), the integrand \(\frac{x}{(1+x^2)}\) behaves like \(\frac{1}{x}\) for large \(x\). The integral \(\int \frac{1}{x} \, dx\) diverges logarithmically. (See Improper Riemann integrals).

Why this matters: Since the mean and variance are undefined, the Law of Large Numbers fails. If you average \(n\) independent Cauchy variables, the sample mean \(\bar{X}_n\) does not settle down; it follows the exact same Cauchy distribution as the individual observations. We will explore this behavior rigorously in Part 13: Convergence.

Despite these theoretical challenges, the Cauchy distribution is highly useful. In Bayesian modeling, when we need a heavy-tailed prior over \(\mathbb{R}^+\) that allows for large values while maintaining finite density at the origin, we use the Half-Cauchy distribution (\(x \geq 0\)): \[ f(x \mid \gamma) = \frac{2}{\pi \gamma} \left[ 1 + \left(\frac{x}{\gamma}\right)^2\right]^{-1}. \] This is often the default choice for scale parameters (hierarchical priors) because it is more robust than the Inverse-Gamma distribution.

Laplace Distribution

The Cauchy distribution demonstrates that heavy tails can be extreme enough to prevent even the mean from existing. A more moderate heavy-tailed alternative, which retains finite moments of all orders while still placing more mass in the tails than the normal distribution, is the Laplace distribution.

Definition: Laplace Distribution

The Laplace distribution (also called the double-sided exponential distribution) with location \(\mu\) and scale \(b > 0\) has p.d.f.: \[ f(y \mid \mu, b) = \frac{1}{2b}\exp\!\left(-\frac{|y - \mu|}{b}\right). \] Its moments are: \[ \text{mean} = \text{mode} = \mu, \qquad \text{Var}(Y) = 2b^2. \]

Insight: Heavy Tails and Regularization

In machine learning, the choice of distribution often corresponds to the choice of regularization. While a Gaussian prior on weights leads to \(L_2\) regularization (Ridge), a Laplace prior leads to \(L_1\) regularization (Lasso), promoting sparsity. Additionally, using the Student's t-distribution in regression makes the model more robust to outliers compared to standard least-squares (Gaussian) regression.

The Student's \(t\), Cauchy, and Laplace distributions complete our toolkit of univariate distributions for robust modeling. In the next part, we move from single random variables to pairs of random variables, introducing covariance as the fundamental measure of linear dependence between variables.

Student's \(t\)-Distribution

Loading...