Student's \(t\)-Distribution

Student's \(t\)-Distribution Cauchy Distribution Laplace Distribution

Student's \(t\)-Distribution

On the normal distribution page, we saw that the chi-squared distribution governs sums of squared standard normals, and that the sample variance \(s^2\) satisfies \((n-1)s^2/\sigma^2 \sim \chi^2_{n-1}\). A natural question follows: what happens to our inference when we replace the unknown population standard deviation \(\sigma\) with its estimate \(s\)? The answer is the Student's \(t\)-distribution, which arises as the ratio of a standard normal to the square root of an independent chi-squared variable divided by its degrees of freedom.

Definition: Standard Student's \(t\)-Distribution (ratio form)

Let \(Z \sim \mathcal{N}(0, 1)\) and \(Q \sim \chi^2_\nu\) with \(Z\) and \(Q\) independent. The random variable \[ T \;=\; \frac{Z}{\sqrt{Q/\nu}} \] is said to follow the standard Student's \(t\)-distribution with \(\nu > 0\) degrees of freedom, written \(T \sim t_\nu\).

Proposition: Standard \(t\) p.d.f.

The p.d.f. of \(T \sim t_\nu\) is \[ f_T(t) \;=\; \frac{\Gamma\!\bigl((\nu+1)/2\bigr)}{\sqrt{\nu\pi}\,\Gamma(\nu/2)}\, \left(1 + \frac{t^2}{\nu}\right)^{-(\nu+1)/2}, \qquad t \in \mathbb{R}. \]

Sketch. The density of \(T\) is obtained by writing the joint density of \((Z, Q)\) (which factors as \(\varphi(z)\,f_{\chi^2_\nu}(q)\) by independence), changing variables to \((T, Q)\) with \(Z = T\sqrt{Q/\nu}\), and integrating out \(Q\). The integrand reduces to a gamma integral that produces the factor \(\Gamma((\nu+1)/2)\); the remaining constants combine into the prefactor above. The joint-distribution machinery is developed in Limit Theorems & Product Measures; the bivariate change-of-variables and integration step is standard but not carried out in detail on this site.

Definition: Student's \(t\)-Distribution (location–scale form)

For location \(\mu \in \mathbb{R}\) and scale \(\sigma > 0\) (not the standard deviation), a random variable \(Y\) follows the Student's \(t\)-distribution with parameters \((\mu, \sigma, \nu)\), written \(Y \sim t_\nu(\mu, \sigma^2)\), if \(Y \;=\; \mu + \sigma T\) for some \(T \sim t_\nu\). The corresponding p.d.f. is \[ f(y \mid \mu, \sigma, \nu) \;=\; \frac{1}{\sigma}\,f_T\!\!\left(\frac{y - \mu}{\sigma}\right) \;=\; \frac{\Gamma\!\bigl((\nu+1)/2\bigr)}{\sigma\sqrt{\nu\pi}\,\Gamma(\nu/2)} \left[1 + \frac{1}{\nu}\!\left(\frac{y - \mu}{\sigma}\right)^{\!2}\right]^{-(\nu+1)/2}. \]

The first moments of \(Y \sim t_\nu(\mu, \sigma^2)\) are \[ \operatorname{mode}(Y) = \mu \;\;(\text{any } \nu > 0), \qquad \mathbb{E}[Y] = \mu \;\;(\text{if } \nu > 1), \qquad \operatorname{Var}(Y) = \frac{\nu\,\sigma^2}{\nu - 2} \;\;(\text{if } \nu > 2). \] The mean fails to exist for \(\nu \leq 1\) (the integral is not absolutely convergent — see the Cauchy case below), and the variance fails to exist for \(\nu \leq 2\): the density decays only like \(|y|^{-(\nu+1)}\) at infinity, so the variance integrand \(y^2 f(y)\) decays like \(|y|^{1-\nu}\), which is integrable only when \(\nu > 2\). The mode, by contrast, is always well defined: the density is symmetric about \(\mu\) and unimodal there for every \(\nu > 0\).

The key feature of the \(t\)-distribution is its heavy tails: compared to a normal distribution with the same location and scale, the \(t\)-distribution assigns substantially more probability mass to extreme values. This makes parameter estimates based on the \(t\)-distribution more robust to outliers.

The parameter \(\nu\) controls the tail heaviness. As \(\nu \to \infty\), the \(t\)-distribution converges to the normal distribution \(\mathcal{N}(\mu, \sigma^2)\). In practice, for \(\nu \gg 5\), the \(t\)-distribution is nearly indistinguishable from the normal and loses its robustness advantage.

Cauchy Distribution

An important special case of the \(t\)-distribution arises at the extreme end of heavy-tailed behavior, when the degrees of freedom parameter takes its smallest meaningful value.

Definition: Cauchy Distribution

When \(\nu = 1\), the Student's \(t\)-distribution reduces to the Cauchy distribution with location \(\mu\) and scale \(\gamma > 0\) that has p.d.f.: \[ f(x \mid \mu, \gamma) = \frac{1}{\gamma\pi}\left[1 + \left(\frac{x - \mu}{\gamma}\right)^2\right]^{-1}. \] We write \(X \sim \text{Cauchy}(\mu, \gamma)\). (Specializing the standard \(t\)-density at \(\nu = 1\) with location \(\mu\) and scale \(\gamma\) gives the prefactor \(\Gamma(1)/[\gamma\sqrt{\pi}\,\Gamma(1/2)] = 1/(\gamma\pi)\), using \(\Gamma(1/2) = \sqrt{\pi}\).)

Consider the standard Cauchy distribution (\(\mu = 0, \gamma = 1\)). When we attempt to calculate its expected value: \[ \mathbb{E}[X] \;=\; \frac{1}{\pi} \int_{-\infty}^{\infty} \frac{x}{1 + x^2}\, dx \] we find that the integral is not absolutely convergent: the integrand \(x/(1 + x^2)\) behaves like \(1/x\) for large \(|x|\), and \(\int |x|/(1+x^2)\,dx\) diverges logarithmically (see Improper Riemann Integrals). Since the Lebesgue definition of expectation requires absolute convergence — equivalently, the finiteness of \(\mathbb{E}[|X|]\) — the mean of a Cauchy random variable is undefined, even though the symmetric Cauchy principal value \(\lim_{R \to \infty} \int_{-R}^{R} x/(1+x^2)\,dx\) equals zero. By the same logic the variance is also undefined.

Why this matters: Since the mean and variance are undefined, the Law of Large Numbers fails. If you average \(n\) independent Cauchy variables, the sample mean \(\bar{X}_n\) does not settle down; it follows the exact same Cauchy distribution as the individual observations. (Sketch: the Cauchy characteristic function is \(\varphi_X(t) = e^{-|t|}\), and by independence \(\varphi_{\bar{X}_n}(t) = \prod_{i=1}^{n} \varphi_X(t/n) = e^{-|t|}\) — the same as \(\varphi_X\) itself, so \(\bar{X}_n\) has the same distribution as each \(X_i\).) We will explore this behavior rigorously on the Convergence page.

Despite these theoretical challenges, the Cauchy distribution is highly useful. In Bayesian modeling, when we need a heavy-tailed prior over \(\mathbb{R}^+\) that allows for large values while maintaining finite density at the origin, we use the Half-Cauchy distribution, defined next.

Definition: Half-Cauchy Distribution

The Half-Cauchy distribution with scale \(\gamma > 0\) is the standard Cauchy \(\text{Cauchy}(0, \gamma)\) restricted to \(x \geq 0\) and renormalized so its total mass is one. Its p.d.f. is \[ f(x \mid \gamma) = \frac{2}{\pi \gamma}\left[1 + \left(\frac{x}{\gamma}\right)^2\right]^{-1}, \qquad x \geq 0. \] (The factor \(2\) compensates for the loss of the negative half by the symmetry of the standard Cauchy density about the origin.) The Half-Cauchy is a common choice for scale parameters in hierarchical Bayesian priors: its heavy tail allows large values to remain probable while keeping a finite density at the origin, and the resulting posterior tends to be less sensitive to the choice of hyperparameter \(\gamma\) than the Inverse-Gamma alternative is to its own hyperparameters.

Laplace Distribution

The Cauchy distribution demonstrates that heavy tails can be extreme enough to prevent even the mean from existing. A more moderate heavy-tailed alternative, which retains finite moments of all orders while still placing more mass in the tails than the normal distribution, is the Laplace distribution.

Definition: Laplace Distribution

The Laplace distribution (also called the double-sided exponential distribution) with location \(\mu\) and scale \(b > 0\) has p.d.f.: \[ f(y \mid \mu, b) = \frac{1}{2b}\exp\!\left(-\frac{|y - \mu|}{b}\right). \] Its moments are: \[ \operatorname{mode}(Y) = \mathbb{E}[Y] = \mu, \qquad \operatorname{Var}(Y) = 2b^2. \]

Proof of the Laplace moments:

Substitute \(u = (y - \mu)/b\), so \(y = \mu + bu\) and \(dy = b\,du\). The density transforms to \(f(y)\,dy = \tfrac{1}{2}e^{-|u|}\,du\), and \((y - \mu)^k = b^k u^k\). For the mean, \[ \mathbb{E}[Y] - \mu = \mathbb{E}[Y - \mu] = b\int_{-\infty}^{\infty} u \cdot \tfrac{1}{2}e^{-|u|}\,du = 0, \] since \(u\,e^{-|u|}\) is odd and the integral is absolutely convergent (\(|u|\,e^{-|u|}\) integrates to a finite value). For the variance, \(u^2 e^{-|u|}\) is even, so \[ \operatorname{Var}(Y) = \mathbb{E}[(Y - \mu)^2] = b^2 \int_{-\infty}^{\infty} u^2 \cdot \tfrac{1}{2} e^{-|u|}\,du = b^2 \int_{0}^{\infty} u^2 e^{-u}\,du = b^2 \,\Gamma(3) = 2b^2, \] using \(\Gamma(n) = (n-1)!\) for positive integers \(n\). The mode is \(\mu\) because \(e^{-|y - \mu|/b}\) attains its maximum exactly at \(y = \mu\).

Insight: Heavy Tails and Regularization

In machine learning, the choice of distribution often corresponds to the choice of regularization. While a Gaussian prior on weights leads to \(L_2\) regularization (Ridge), a Laplace prior leads to \(L_1\) regularization (Lasso), promoting sparsity. Additionally, using the Student's t-distribution in regression makes the model more robust to outliers compared to standard least-squares (Gaussian) regression.

The Student's \(t\), Cauchy, and Laplace distributions complete our toolkit of univariate distributions for robust modeling. In the next part, we move from single random variables to pairs of random variables, introducing covariance as the fundamental measure of linear dependence between variables.