Gaussian Function
The normal (Gaussian) distribution is arguably the most important distribution
in all of statistics and machine learning. Before defining it, we study the underlying
Gaussian function and its integral properties. The key challenge is that the
Gaussian function has no elementary antiderivative, yet its improper integral over all of
\(\mathbb{R}\) can be evaluated exactly, which makes the normal distribution analytically tractable.
A Gaussian function is defined as:
\[
f(x) = e^{-x^2}
\]
and often is parametrized as
\[
f(x) = a e^{-\frac{(x-b)^2}{2c^2}} \tag{1}
\]
where \(a, b , c \in \mathbb{R}, \text{ and } c \neq 0\).
The family of Gaussian functions does not have elementary antiderivatives. To represent the integral of Gaussian functions,
we use the special function known as the error function:
\[
\text{erf }(z) = \frac{2}{\sqrt{\pi}} \int_{0}^z e^{-t^2}dt. \qquad \text{erf}: \mathbb{C} \to \mathbb{C}
\]
Then,
\[
\int e^{-x^2} dx = \frac{\sqrt{\pi}}{2}\text{erf }(x) + C.
\]
On the other hand, their improper integrals over \(\mathbb{R}\) can be evaluated exactly using
Gaussian integral:
Theorem: Gaussian Integral
\[
\int_{-\infty} ^\infty e^{-x^2} dx = \sqrt{\pi}.
\]
Note: The generalized Gaussian integral is given by
\[
\int_{-\infty} ^\infty e^{-ax^2} dx = \sqrt{\frac{\pi}{a}} \qquad a > 0 \tag{2}
\]
Proof:
Let \(I = \int_{-\infty} ^\infty e^{-x^2} dx\). Then
\[
\begin{align*}
I^2 &= \int_{-\infty} ^\infty \int_{-\infty} ^\infty e^{-u^2} e^{-v^2} du dv \\\\
&= \int_{-\infty} ^\infty \int_{-\infty} ^\infty e^{-(u^2 + v^2)} dudv.
\end{align*}
\]
Using the polar coordinates, let \(u = r\cos \theta\), and \(v = r\sin \theta\) and then
\(u^2 + v^2 = r^2\) and \(dudv = r dr d\theta\).
Thus, by Fubini (the integrand is non-negative, so Tonelli applies unconditionally),
\[
\begin{align*}
I^2 = \int_{0}^{2\pi} \int_{0}^{\infty} e^{-r^2}\,r\,dr\,d\theta \\\\
= \left(\int_{0}^{2\pi} d\theta\right)\left(\int_{0}^{\infty} e^{-r^2}\,r\,dr\right).
\end{align*}
\]
The angular integral is \(2\pi\). For the radial integral, the antiderivative of \(e^{-r^2} r\)
with respect to \(r\) is \(-\tfrac{1}{2}e^{-r^2}\), so
\[
\int_{0}^{\infty} e^{-r^2}\,r\,dr = \left[-\tfrac{1}{2}e^{-r^2}\right]_{0}^{\infty} = 0 - \left(-\tfrac{1}{2}\right) = \tfrac{1}{2}.
\]
Therefore \(I^2 = 2\pi \cdot \tfrac{1}{2} = \pi\), and since \(I > 0\),
\[
I = \sqrt{\pi}.
\]
This is the case \(a = 1\) in the generalized Gaussian integral. We extend this result to the
generalized case with a scaling factor \(a > 0\).
Let \(u = \sqrt{a}\,x\) and then \(x = \frac{u}{\sqrt{a}} \) and \(dx = \frac{1}{\sqrt{a}}du\).
Substituting these into the integral (2)
\[
\begin{align*}
& \int_{-\infty} ^\infty e^{-a(\frac{u}{\sqrt{a}})^2} \frac{1}{\sqrt{a}}du \\\\
&= \frac{1}{\sqrt{a}} \int_{-\infty} ^\infty e^{-u^2}du \\\\
&= \frac{1}{\sqrt{a}} (\sqrt{\pi}) \\\\
&= \sqrt{\frac{\pi}{a}}.
\end{align*}
\]
Here, we use the parametrized Gaussian function (1).
Let \(u = \frac{x-b}{c}\), which implies \(x = cu + b \) and \(dx = cdu\). Then
\[
\begin{align*}
\int_{-\infty} ^\infty a e^{-\frac{(x-b)^2}{2c^2}} dx
&= \int_{-\infty} ^\infty a e^{-\frac{(cu)^2}{2c^2}}cdu \\\\
&= ac \int_{-\infty} ^\infty e^{-\frac{u^2}{2}}du
\end{align*}
\]
By (2) applied with parameter \(1/2\) in place of \(a\), we have
\(\int_{-\infty}^\infty e^{-u^2/2}\,du = \sqrt{2\pi}\), so
\[
\begin{align*}
\int_{-\infty} ^\infty a e^{-\frac{(x-b)^2}{2c^2}} dx
&= ac \int_{-\infty} ^\infty e^{-\frac{u^2}{2}}du \\\\
&= ac\sqrt{2\pi}. \tag{3}
\end{align*}
\]
Normal(Gaussian) Distribution
With the Gaussian integral established, we can now construct a proper probability density
function from the Gaussian function by choosing the parameters so that the total area under
the curve equals one.
Definition: Normal (Gaussian) Distribution
A random variable \(X\) has a normal (Gaussian) distribution with mean \(\mu \in \mathbb{R}\)
and variance \(\sigma^2 > 0\) if its p.d.f. is:
\[
f(x) = \frac{1}{\sigma \sqrt{2\pi}}\exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}.
\]
We write \(X \sim \mathcal{N}(\mu, \sigma^2)\).
That \(f\) integrates to \(1\) follows from the generalized Gaussian integral (3)
above: the density above is the parametrized Gaussian (1) with amplitude
\(a = \frac{1}{\sigma\sqrt{2\pi}}\), location \(b = \mu\), and scale \(c = \sigma\),
so its integral over \(\mathbb{R}\) equals \(ac\sqrt{2\pi} = 1\).
Note that the c.d.f. of the normal distribution does not have a closed form:
\[
F(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}}\exp\!\left(-\frac{(t - \mu)^2}{2\sigma^2}\right) dt.
\]
The above figure shows the normal p.d.f. curve. \(\mu\) is the center, and \(\sigma\) is the distance from
the center to the inflection point of the curve. The one reason the normal distribution is widely used in statistics
and machine learning as well, the parameters capture its mean and variance, which are essential properties of the distribution.
\[
\mathbb{E}[X] = \mu \qquad \operatorname{Var}(X) = \sigma^2.
\]
So, in the normal distribution, measures of central tendency(mean, median, and mode) are the same.
Proof of \(\mathbb{E}[X] = \mu\) and \(\operatorname{Var}(X) = \sigma^2\):
Apply the substitution \(u = (x - \mu)/\sigma\), so \(x = \sigma u + \mu\) and \(dx = \sigma\,du\).
For the mean,
\[
\begin{align*}
\mathbb{E}[X] &= \int_{-\infty}^{\infty} x \cdot \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) dx \\\\
&= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} (\sigma u + \mu)\,e^{-u^2/2}\,du \\\\
&= \frac{\sigma}{\sqrt{2\pi}}\underbrace{\int_{-\infty}^{\infty} u\,e^{-u^2/2}\,du}_{= 0 \text{ (odd integrand)}}
+ \frac{\mu}{\sqrt{2\pi}}\underbrace{\int_{-\infty}^{\infty} e^{-u^2/2}\,du}_{= \sqrt{2\pi}} \\\\
&= 0 + \mu = \mu.
\end{align*}
\]
For the variance, the same substitution gives \((x - \mu)^2 = \sigma^2 u^2\), so
\[
\begin{align*}
\operatorname{Var}(X) &= \mathbb{E}\bigl[(X - \mu)^2\bigr] \\\\
&= \int_{-\infty}^{\infty} (x - \mu)^2 \cdot \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}\,dx \\\\
&= \frac{\sigma^2}{\sqrt{2\pi}}\int_{-\infty}^{\infty} u^2\,e^{-u^2/2}\,du.
\end{align*}
\]
The remaining integral is evaluated by integration by parts with \(v = u\) and
\(dw = u\,e^{-u^2/2}\,du\) (so \(w = -e^{-u^2/2}\)):
\[
\int_{-\infty}^{\infty} u^2\,e^{-u^2/2}\,du
= \bigl[-u\,e^{-u^2/2}\bigr]_{-\infty}^{\infty} + \int_{-\infty}^{\infty} e^{-u^2/2}\,du
= 0 + \sqrt{2\pi},
\]
the boundary term vanishing by exponential decay. Therefore
\(\operatorname{Var}(X) = (\sigma^2/\sqrt{2\pi})\cdot\sqrt{2\pi} = \sigma^2\).
The simplest and useful normal distribution is the one with zero mean and unit variance. We call it
the standard normal distribution denoted by
\[
Z \sim \mathcal{N}(0, 1).
\]
Any normally distributed random variable can be "transformed" to a standard normal random variable. If
\(X \sim \mathcal{N}(\mu, \sigma^2)\), then
\[
Z = \frac{X- \mu}{\sigma}\sim \mathcal{N}(0, 1).
\]
This process is called standardization. This is just a special case of the linear transformation:
\[
Y = aX + b \sim \mathcal{N}(a\mu+b, a^2\sigma^2) \quad (a \neq 0).
\]
Standardization corresponds to \(a = 1/\sigma\) and \(b = -\mu/\sigma\), giving
\(\mathbb{E}[Z] = (\mu - \mu)/\sigma = 0\) and
\(\operatorname{Var}(Z) = \sigma^2/\sigma^2 = 1\) by linearity of expectation and the variance scaling rule.
Remark on notation. Up to this point we have written the density and c.d.f. of a single random variable
plainly as \(f(x)\) and \(F(x)\), since the variable in question was unambiguous. In the proof below we work with
two related variables \(X\) and \(Y = aX + b\) simultaneously, so we must distinguish their densities and c.d.f.s.
We adopt the standard subscript convention \(f_X, F_X\) (resp. \(f_Y, F_Y\)) whenever multiple random variables
are present and confusion would otherwise be possible; we revert to plain \(f, F\) once the context fixes a single variable.
Proof of \(Y = aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)\):
Assume \(a \neq 0\). We derive the density of \(Y\) from its c.d.f. Consider first \(a > 0\):
\[
F_Y(y) = P(aX + b \leq y) = P\!\left(X \leq \frac{y - b}{a}\right) = F_X\!\left(\frac{y - b}{a}\right).
\]
Differentiating with respect to \(y\) and using the chain rule,
\(f_Y(y) = f_X\!\bigl((y - b)/a\bigr)\cdot (1/a)\).
For \(a < 0\), the inequality flips:
\(F_Y(y) = P(X \geq (y-b)/a) = 1 - F_X\!\bigl((y-b)/a\bigr)\), and differentiating gives
\(f_Y(y) = -f_X\!\bigl((y-b)/a\bigr)\cdot(1/a)\). Since \(a < 0\) implies \(-1/a = 1/|a|\),
both cases combine to
\[
f_Y(y) = \frac{1}{|a|}\,f_X\!\left(\frac{y - b}{a}\right).
\]
Substituting the normal density,
\[
f_Y(y) = \frac{1}{|a|\,\sigma\sqrt{2\pi}}\exp\!\left(-\frac{\bigl((y - b)/a - \mu\bigr)^2}{2\sigma^2}\right).
\]
Rewriting the exponent,
\[
\frac{y - b}{a} - \mu = \frac{y - b - a\mu}{a} = \frac{y - (a\mu + b)}{a},
\]
so its square is \((y - (a\mu+b))^2 / a^2\), and
\[
f_Y(y) = \frac{1}{|a|\,\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(y - (a\mu + b))^2}{2\,a^2\sigma^2}\right).
\]
This matches the \(\mathcal{N}(a\mu + b,\;a^2\sigma^2)\) density (with standard deviation
\(|a|\sigma\) and variance \(a^2\sigma^2\)). Therefore \(Y \sim \mathcal{N}(a\mu + b, a^2\sigma^2)\).
The p.d.f. of \(Z\) is denoted by
\[
\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-z^2/2}, \qquad z \in \mathbb{R},
\]
and c.d.f. of \(Z\) is denoted by
\[
\Phi(z) = P(Z \leq z) = \int_{-\infty}^z \phi(u)\,du.
\]
Chi-Squared Distribution
If a single standard normal variable \(Z \sim \mathcal{N}(0,1)\) captures the behavior of
one random measurement, what distribution governs the sum of squared measurements?
This question arises naturally in statistics: whenever we compute a sample variance, we are
summing squared deviations. The answer is the chi-squared distribution, which
connects the normal distribution from this section to the
gamma distribution from the previous one.
Definition: Chi-Squared Distribution
Let \(Z_1, Z_2, \ldots, Z_\nu\) be independent standard normal random variables,
\(Z_i \sim \mathcal{N}(0, 1)\). The random variable
\[
Q = Z_1^2 + Z_2^2 + \cdots + Z_\nu^2 = \sum_{i=1}^{\nu} Z_i^2
\]
is said to have a chi-squared distribution with \(\nu\) degrees of freedom,
written \(Q \sim \chi^2_\nu\).
Proposition: Chi-Squared p.d.f.
The p.d.f. of \(Q \sim \chi^2_\nu\) is
\[
f(x) = \frac{1}{2^{\nu/2}\,\Gamma(\nu/2)}\, x^{\nu/2 - 1}\, e^{-x/2}, \quad x \geq 0.
\]
Equivalently,
\(\chi^2_\nu = \text{Gamma}\!\left(\alpha = \nu/2,\; \beta = 1/2\right)\).
Sketch. The two ingredients are: (i) for a single standard normal \(Z\), the change of
variables \(Y = Z^2\) (using the symmetry \(P(Z^2 \leq y) = 2\Phi(\sqrt{y}) - 1\) and differentiating)
yields the density \(f_Y(y) = (1/\sqrt{2\pi})\,y^{-1/2}e^{-y/2}\) for \(y > 0\), which matches
\(\text{Gamma}(1/2, 1/2)\); (ii) the sum of \(\nu\) independent \(\text{Gamma}(1/2, 1/2)\) random
variables is \(\text{Gamma}(\nu/2, 1/2)\) (additivity of independent gamma variables with a common
rate parameter, proved via convolution of densities or moment generating functions).
Ingredient (i) is a one-variable computation; ingredient (ii) requires the machinery of independence
and joint distributions, which we develop in
Limit Theorems & Product Measures.
The mean and variance follow immediately from the gamma parameters:
\[
\mathbb{E}[Q] = \frac{\alpha}{\beta} = \nu, \qquad
\operatorname{Var}(Q) = \frac{\alpha}{\beta^2} = 2\nu.
\]
The chi-squared distribution plays a fundamental role in statistical inference.
If \(X_1, \ldots, X_n\) are i.i.d. \(\mathcal{N}(\mu, \sigma^2)\) and we define the
sample variance \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2\), then:
\[
\frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}.
\]
We state this without proof: the result follows from Cochran's theorem,
which decomposes the quadratic form \(\sum_i (X_i - \bar{X})^2 / \sigma^2\) along orthogonal
subspaces of \(\mathbb{R}^n\) and shows that the residual component is a sum of \(n-1\)
independent squared standard normals — exactly one degree of freedom is “spent” on
estimating \(\bar{X}\). The proof requires multivariate normal theory not yet developed in this
section. This result is what allows us to construct confidence intervals for variances and to
define the Student's \(t\)-distribution,
which arises when we replace the unknown \(\sigma\) with the sample estimate \(s\).
Insight: The Gamma → Chi-Squared → Student's \(t\) Chain
The relationship between these distributions forms a coherent chain. The
gamma distribution is the general family for non-negative continuous
variables. Setting \(\alpha = \nu/2\) and \(\beta = 1/2\) specializes it to the
chi-squared distribution, which governs sums of squared normals.
The Student's \(t\)-distribution then arises as the ratio:
\[
T = \frac{Z}{\sqrt{Q/\nu}}, \quad Z \sim \mathcal{N}(0,1),\; Q \sim \chi^2_\nu,\; Z \perp Q
\]
where \(T \sim t_\nu\). In practice, \(Z\) represents the standardized signal
and \(Q/\nu\) represents the estimated noise level. When we don't know the true
variance \(\sigma^2\) and must estimate it from data, replacing \(\sigma\) with \(s\)
introduces extra uncertainty that thickens the tails - precisely captured by the
\(t\)-distribution's degrees of freedom parameter \(\nu = n - 1\).
Central Limit Theorem
The previous section defined the normal distribution and its properties for a single random variable.
In practice, however, we almost always work with collections of observations. A fundamental
question arises: if we average many independent measurements, what can we say about the distribution
of that average? The answer - provided by the Central Limit Theorem - is one of the
most remarkable results in all of mathematics: the average tends toward a normal distribution,
regardless of the underlying distribution of the individual measurements.
Now, we consider a set of random variables \(\{X_1, X_2, \cdots, X_n\}\). In random sampling, we assume that each
\(X_i\) has the same distribution and is independent of each other. We call it independent and identically distributed(i.i.d.).
In this case, a single sample mean \(\bar{X} = \frac{1}{n}\sum_{i=1} ^n X_i\) is not exactly equal to the population mean \(\mu\)
because there are many different sampling possibilities. However, the average of sample means is equal to \(\mu\):
\[
\begin{align*}
\mathbb{E}[\bar{X}] &= \mathbb{E}\!\left[\tfrac{1}{n}(X_1 + X_2 + \cdots +X_n)\right] \\\\
&= \tfrac{1}{n}\bigl[\mathbb{E}[X_1] + \mathbb{E}[X_2] + \cdots + \mathbb{E}[X_n]\bigr] \\\\
&= \tfrac{n\mu}{n} = \mu.
\end{align*}
\]
Thus the sample mean is an unbiased estimator of \(\mu\). Its variance, by independence of the \(X_i\), is
\[
\begin{align*}
\operatorname{Var}(\bar{X}) &= \operatorname{Var}\!\left[\tfrac{1}{n}(X_1 + X_2 + \cdots +X_n)\right] \\\\
&= \tfrac{1}{n^2}\bigl[\operatorname{Var}(X_1) + \operatorname{Var}(X_2) + \cdots + \operatorname{Var}(X_n)\bigr] \\\\
&= \tfrac{n \sigma^2}{n^2} = \tfrac{\sigma^2}{n},
\end{align*}
\]
so \(\bar{X}\) concentrates around \(\mu\) with standard deviation \(\sigma/\sqrt{n}\) — the foundation of the law of large numbers.
The naive sample-variance estimator \(\widetilde{s}^{\,2} = \tfrac{1}{n}\sum_{i=1}^n (X_i - \bar{X})^2\), however,
is a biased estimator of \(\sigma^2\): one can show that
\(\mathbb{E}[\widetilde{s}^{\,2}] = \tfrac{n-1}{n}\sigma^2\), which underestimates \(\sigma^2\) because the same data is used both to
center the deviations and to compute their average. To remove the bias, we adjust the denominator:
\[
s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2,
\]
so that \(\mathbb{E}[s^2] = \sigma^2\). The full computation of the bias factor is a standard exercise in joint
distribution theory.
At this point, you might wonder how we can make inferences about the unknown distribution of random variables in practice.
The Central Limit Theorem (CLT) provides a result: regardless of the underlying distribution of a population,
the distribution of the sample mean approaches a normal distribution as the sample size \(n\) becomes large enough. This remarkable
property is one of the reasons the normal distribution is widely used in statistics and forms the foundation of many statistical methods
and theories.
Theorem: Central Limit Theorem
Let \(X_1, X_2, \cdots, X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\).
Then the distribution of
\[
Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}
\]
converges to the standard normal distribution as the sample size \(n \to \infty\).
Note: The SD of \(\bar{X}\) is \(\sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\).
If the sample size \(n\) is large enough,
\[
\bar{X} \approx \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right)
\]
and also,
\[
\sum_{i =1} ^n X_i \approx \mathcal{N}(n\mu , n\sigma^2).
\]
Note: The rate at which this approximation becomes accurate depends on the underlying distribution.
A common rule of thumb is that mildly skewed distributions are well approximated by
\(n \approx 30\), while highly skewed distributions such as the exponential typically require
\(n \gtrsim 100\) before the normal approximation is accurate in practice.
The Central Limit Theorem (CLT) explains the ubiquity of the normal distribution, showing that any quantity arising from
the aggregate of many small, independent contributions will be approximately normally distributed. For a rigorous treatment
and a proof using moment generating functions, see Convergence.
Beyond the CLT, the Gaussian distribution is fundamental in information theory as the distribution that
maximizes entropy for a fixed mean and variance - making it the "most random"
choice under these constraints.
Next, we introduce the Student's \(t\)-distribution,
which accounts for the additional uncertainty that arises when the population variance \(\sigma^2\)
is unknown and must be estimated from the sample.