Student's \(t\)-Distribution
On the normal distribution page, we saw that the chi-squared distribution governs
sums of squared standard normals, and that the sample variance \(s^2\) satisfies \((n-1)s^2/\sigma^2 \sim \chi^2_{n-1}\).
A natural question follows: what happens to our inference when we replace the unknown population standard deviation
\(\sigma\) with its estimate \(s\)? The answer is the Student's \(t\)-distribution, which arises as the
ratio of a standard normal to the square root of an independent chi-squared variable divided by its degrees of freedom.
Definition: Standard Student's \(t\)-Distribution (ratio form)
Let \(Z \sim \mathcal{N}(0, 1)\) and \(Q \sim \chi^2_\nu\) with \(Z\) and \(Q\) independent.
The random variable
\[
T \;=\; \frac{Z}{\sqrt{Q/\nu}}
\]
is said to follow the standard Student's \(t\)-distribution with \(\nu > 0\)
degrees of freedom, written \(T \sim t_\nu\).
Proposition: Standard \(t\) p.d.f.
The p.d.f. of \(T \sim t_\nu\) is
\[
f_T(t) \;=\; \frac{\Gamma\!\bigl((\nu+1)/2\bigr)}{\sqrt{\nu\pi}\,\Gamma(\nu/2)}\,
\left(1 + \frac{t^2}{\nu}\right)^{-(\nu+1)/2}, \qquad t \in \mathbb{R}.
\]
Sketch. The density of \(T\) is obtained by writing the joint density of
\((Z, Q)\) (which factors as \(\varphi(z)\,f_{\chi^2_\nu}(q)\) by independence), changing variables
to \((T, Q)\) with \(Z = T\sqrt{Q/\nu}\), and integrating out \(Q\). The integrand reduces to a
gamma integral that produces the factor \(\Gamma((\nu+1)/2)\); the remaining constants combine into
the prefactor above. The joint-distribution machinery is developed in
Limit Theorems & Product Measures;
the bivariate change-of-variables and integration step is standard but not carried out in detail on this site.
Definition: Student's \(t\)-Distribution (location–scale form)
For location \(\mu \in \mathbb{R}\) and scale \(\sigma > 0\) (not the standard deviation),
a random variable \(Y\) follows the Student's \(t\)-distribution with parameters
\((\mu, \sigma, \nu)\), written \(Y \sim t_\nu(\mu, \sigma^2)\), if
\(Y \;=\; \mu + \sigma T\) for some \(T \sim t_\nu\). The corresponding p.d.f. is
\[
f(y \mid \mu, \sigma, \nu) \;=\;
\frac{1}{\sigma}\,f_T\!\!\left(\frac{y - \mu}{\sigma}\right)
\;=\; \frac{\Gamma\!\bigl((\nu+1)/2\bigr)}{\sigma\sqrt{\nu\pi}\,\Gamma(\nu/2)}
\left[1 + \frac{1}{\nu}\!\left(\frac{y - \mu}{\sigma}\right)^{\!2}\right]^{-(\nu+1)/2}.
\]
The first moments of \(Y \sim t_\nu(\mu, \sigma^2)\) are
\[
\operatorname{mode}(Y) = \mu \;\;(\text{any } \nu > 0), \qquad
\mathbb{E}[Y] = \mu \;\;(\text{if } \nu > 1), \qquad
\operatorname{Var}(Y) = \frac{\nu\,\sigma^2}{\nu - 2} \;\;(\text{if } \nu > 2).
\]
The mean fails to exist for \(\nu \leq 1\) (the integral is not absolutely convergent — see the
Cauchy case below), and the variance fails to exist for \(\nu \leq 2\): the density decays only
like \(|y|^{-(\nu+1)}\) at infinity, so the variance integrand \(y^2 f(y)\) decays like
\(|y|^{1-\nu}\), which is integrable only when \(\nu > 2\). The mode, by contrast, is
always well defined: the density is symmetric about \(\mu\) and unimodal there for every \(\nu > 0\).
The key feature of the \(t\)-distribution is its heavy tails: compared to a normal distribution
with the same location and scale, the \(t\)-distribution assigns substantially more probability mass to extreme values.
This makes parameter estimates based on the \(t\)-distribution more robust to outliers.
The parameter \(\nu\) controls the tail heaviness. As \(\nu \to \infty\), the \(t\)-distribution converges to the
normal distribution \(\mathcal{N}(\mu, \sigma^2)\). In practice, for \(\nu \gg 5\), the \(t\)-distribution is nearly
indistinguishable from the normal and loses its robustness advantage.
Cauchy Distribution
An important special case of the \(t\)-distribution arises at the extreme end of heavy-tailed behavior,
when the degrees of freedom parameter takes its smallest meaningful value.
Definition: Cauchy Distribution
When \(\nu = 1\), the Student's \(t\)-distribution reduces to the Cauchy distribution
with location \(\mu\) and scale \(\gamma > 0\) that has p.d.f.:
\[
f(x \mid \mu, \gamma) = \frac{1}{\gamma\pi}\left[1 + \left(\frac{x - \mu}{\gamma}\right)^2\right]^{-1}.
\]
We write \(X \sim \text{Cauchy}(\mu, \gamma)\).
(Specializing the standard \(t\)-density at \(\nu = 1\) with location \(\mu\) and scale \(\gamma\)
gives the prefactor \(\Gamma(1)/[\gamma\sqrt{\pi}\,\Gamma(1/2)] = 1/(\gamma\pi)\),
using \(\Gamma(1/2) = \sqrt{\pi}\).)
Consider the standard Cauchy distribution (\(\mu = 0, \gamma = 1\)). When we attempt to calculate its
expected value:
\[
\mathbb{E}[X] \;=\; \frac{1}{\pi} \int_{-\infty}^{\infty} \frac{x}{1 + x^2}\, dx
\]
we find that the integral is not absolutely convergent: the integrand
\(x/(1 + x^2)\) behaves like \(1/x\) for large \(|x|\), and \(\int |x|/(1+x^2)\,dx\) diverges
logarithmically (see Improper Riemann Integrals).
Since the Lebesgue definition of expectation requires absolute convergence — equivalently, the
finiteness of \(\mathbb{E}[|X|]\) — the mean of a Cauchy random variable is undefined, even
though the symmetric Cauchy principal value
\(\lim_{R \to \infty} \int_{-R}^{R} x/(1+x^2)\,dx\) equals zero. By the same logic the variance
is also undefined.
Why this matters: Since the mean and variance are undefined, the Law of Large Numbers
fails. If you average \(n\) independent Cauchy variables, the sample mean \(\bar{X}_n\) does not settle down;
it follows the exact same Cauchy distribution as the individual observations. (Sketch: the
Cauchy characteristic function is \(\varphi_X(t) = e^{-|t|}\), and by independence
\(\varphi_{\bar{X}_n}(t) = \prod_{i=1}^{n} \varphi_X(t/n) = e^{-|t|}\) — the same as \(\varphi_X\) itself,
so \(\bar{X}_n\) has the same distribution as each \(X_i\).) We will explore
this behavior rigorously on the Convergence page.
Despite these theoretical challenges, the Cauchy distribution is highly useful. In Bayesian modeling,
when we need a heavy-tailed prior over \(\mathbb{R}^+\) that allows for large values while maintaining
finite density at the origin, we use the Half-Cauchy distribution, defined next.
Definition: Half-Cauchy Distribution
The Half-Cauchy distribution with scale \(\gamma > 0\) is the standard Cauchy
\(\text{Cauchy}(0, \gamma)\) restricted to \(x \geq 0\) and renormalized so its total mass is one.
Its p.d.f. is
\[
f(x \mid \gamma) = \frac{2}{\pi \gamma}\left[1 + \left(\frac{x}{\gamma}\right)^2\right]^{-1}, \qquad x \geq 0.
\]
(The factor \(2\) compensates for the loss of the negative half by the symmetry of the standard
Cauchy density about the origin.) The Half-Cauchy is a common choice for scale parameters in
hierarchical Bayesian priors: its heavy tail allows large values to remain probable while keeping
a finite density at the origin, and the resulting posterior tends to be less sensitive to the
choice of hyperparameter \(\gamma\) than the Inverse-Gamma alternative is to its own hyperparameters.
Laplace Distribution
The Cauchy distribution demonstrates that heavy tails can be extreme enough to prevent even the mean
from existing. A more moderate heavy-tailed alternative, which retains finite moments of all orders
while still placing more mass in the tails than the normal distribution, is the Laplace distribution.
Definition: Laplace Distribution
The Laplace distribution (also called the double-sided exponential distribution)
with location \(\mu\) and scale \(b > 0\) has p.d.f.:
\[
f(y \mid \mu, b) = \frac{1}{2b}\exp\!\left(-\frac{|y - \mu|}{b}\right).
\]
Its moments are:
\[
\operatorname{mode}(Y) = \mathbb{E}[Y] = \mu, \qquad \operatorname{Var}(Y) = 2b^2.
\]
Proof of the Laplace moments:
Substitute \(u = (y - \mu)/b\), so \(y = \mu + bu\) and \(dy = b\,du\). The density transforms
to \(f(y)\,dy = \tfrac{1}{2}e^{-|u|}\,du\), and \((y - \mu)^k = b^k u^k\). For the mean,
\[
\mathbb{E}[Y] - \mu = \mathbb{E}[Y - \mu] = b\int_{-\infty}^{\infty} u \cdot \tfrac{1}{2}e^{-|u|}\,du = 0,
\]
since \(u\,e^{-|u|}\) is odd and the integral is absolutely convergent (\(|u|\,e^{-|u|}\) integrates
to a finite value). For the variance, \(u^2 e^{-|u|}\) is even, so
\[
\operatorname{Var}(Y) = \mathbb{E}[(Y - \mu)^2]
= b^2 \int_{-\infty}^{\infty} u^2 \cdot \tfrac{1}{2} e^{-|u|}\,du
= b^2 \int_{0}^{\infty} u^2 e^{-u}\,du
= b^2 \,\Gamma(3) = 2b^2,
\]
using \(\Gamma(n) = (n-1)!\) for positive integers \(n\). The mode is \(\mu\) because
\(e^{-|y - \mu|/b}\) attains its maximum exactly at \(y = \mu\).
Insight: Heavy Tails and Regularization
In machine learning, the choice of distribution often corresponds to the choice of regularization.
While a Gaussian prior on weights leads to \(L_2\) regularization
(Ridge),
a Laplace prior leads to \(L_1\) regularization
(Lasso), promoting sparsity.
Additionally, using the Student's t-distribution in regression makes the model more
robust to outliers compared to standard least-squares (Gaussian) regression.
The Student's \(t\), Cauchy, and Laplace distributions complete our toolkit of univariate distributions for robust modeling.
In the next part, we move from single random variables to pairs of random variables, introducing
covariance as the fundamental measure of
linear dependence between variables.