Normal (Gaussian) Distribution

Gaussian Function Normal(Gaussian) Distribution Chi-Squared Distribution Central Limit Theorem

Gaussian Function

The normal (Gaussian) distribution is arguably the most important distribution in all of statistics and machine learning. Before defining it, we study the underlying Gaussian function and its integral properties. The key challenge is that the Gaussian function has no elementary antiderivative, yet its improper integral over all of \(\mathbb{R}\) can be evaluated exactly, which makes the normal distribution analytically tractable.

A Gaussian function is defined as: \[ f(x) = e^{-x^2} \] and often is parametrized as \[ f(x) = a e^{-\frac{(x-b)^2}{2c^2}} \tag{1} \] where \(a, b , c \in \mathbb{R}, \text{ and } c \neq 0\).

The family of Gaussian functions does not have elementary antiderivatives. To represent the integral of Gaussian functions, we use the special function known as the error function: \[ \text{erf }(z) = \frac{2}{\sqrt{\pi}} \int_{0}^z e^{-t^2}dt. \qquad \text{erf}: \mathbb{C} \to \mathbb{C} \] Then, \[ \int e^{-x^2} dx = \frac{\sqrt{\pi}}{2}\text{erf }(x) + C. \]

On the other hand, their improper integrals over \(\mathbb{R}\) can be evaluated exactly using Gaussian integral:

Theorem 1: Gaussian Integral

\[ \int_{-\infty} ^\infty e^{-x^2} dx = \sqrt{\pi}. \] Note: The generalized Gaussian integral is given by \[ \int_{-\infty} ^\infty e^{-ax^2} dx = \sqrt{\frac{\pi}{a}} \qquad a > 0 \tag{2} \]

Proof:

Let \(I = \int_{-\infty} ^\infty e^{-x^2} dx\). Then \[ \begin{align*} I^2 &= \int_{-\infty} ^\infty \int_{-\infty} ^\infty e^{-u^2} e^{-v^2} du dv \\\\ &= \int_{-\infty} ^\infty \int_{-\infty} ^\infty e^{-(u+v)^2} dudv. \end{align*} \] Using the polar coordinates, let \(u = r\cos \theta\), and \(v = r\sin \theta\) and then \(u^2 + v^2 = r^2\) and \(dudv = r dr d\theta\).

Thus, \[ \begin{align*} I^2 &= \int_{0} ^{2\pi} \int_{0} ^\infty e^{-r^2} r dr d\theta \\\\ &= \int_{0} ^{2\pi} 1d\theta \int_{0} ^\infty e^{-r^2} r dr. \\\\\ \end{align*} \] Here, let \(w = -r^2\) and \(dw = -2rdr\). Then \[ \begin{align*} I^2 &= (2\pi) (-\frac{1}{2}\int_{0} ^{-\infty} e^{w} dw) \\\\\ &= (2\pi) (\frac{1}{2}\int_{-\infty} ^0 e^{w} dw)\\\\\ &= (2\pi)(\frac{1}{2}\cdot 1) \end{align*} \] Therefore, \[ I = \sqrt{\pi}. \] This is the case where \(a =1\) in the generalized Gaussian integral. We extend this result to the generalized case with a scaling factor \(a>0\).

Let \(u = sqrt{a}x\) and then \(x = \frac{u}{\sqrt{a}} \) and \(dx = \frac{1}{\sqrt{a}}du\). Substituting these into the integral (2) \[ \begin{align*} & \int_{-\infty} ^\infty e^{-a(\frac{u}{\sqrt{a}})^2} \frac{1}{\sqrt{a}}du \\\\ &= \frac{1}{\sqrt{a}} \int_{-\infty} ^\infty e^{-u^2}du \\\\ &= \frac{1}{\sqrt{a}} (\sqrt{\pi}) \\\\ &= \sqrt{\frac{\pi}{a}}. \end{align*} \]

Here, we use the parametrized Gaussian function (1).

Let \(u = \frac{x-b}{c}\), which implies \(x = cu + b \) and \(dx = cdu\). Then \[ \begin{align*} \int_{-\infty} ^\infty a e^{-\frac{(x-b)^2}{2c^2}} dx &= \int_{-\infty} ^\infty a e^{-\frac{(cu)^2}{2c^2}}cdu \\\\ &= ac \int_{-\infty} ^\infty e^{-\frac{u^2}{2}}du \end{align*} \] Using the Gaussian integral (3), \[ \begin{align*} \int_{-\infty} ^\infty a e^{-\frac{(x-b)^2}{2c^2}} dx &= ac \int_{-\infty} ^\infty e^{-\frac{u^2}{2}}du \\\\ &= ac\sqrt{2\pi}. \tag{3} \end{align*} \]

Normal(Gaussian) Distribution

With the Gaussian integral established, we can now construct a proper probability density function from the Gaussian function by choosing the parameters so that the total area under the curve equals one.

Definition: Normal (Gaussian) Distribution

A random variable \(X\) has a normal (Gaussian) distribution with mean \(\mu \in \mathbb{R}\) and variance \(\sigma^2 > 0\) if its p.d.f. is: \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}}\exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}. \] We write \(X \sim \mathcal{N}(\mu, \sigma^2)\).

Note that the c.d.f. of the normal distribution does not have a closed form: \[ F(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}}\exp\!\left(-\frac{(t - \mu)^2}{2\sigma^2}\right) dt \] but comparing with the expression (3), this is the gaussian integral with \(a = \frac{1}{c\sqrt{2\pi}}, \, b = \mu, \text{ and } c = \sigma\) to make its value \(1\).

Normal Distribution PDF

The above figure shows the normal p.d.f. curve. \(\mu\) is the center, and \(\sigma\) is the distance from the center to the inflection point of the curve. The one reason the normal distribution is widely used in statistics and machine learning as well, the parameters capture its mean and variance, which are essential properties of the distribution. \[ \mathbb{E }[X] = \mu \qquad \text{Var }(X) = \sigma^2. \] So, in the normal distribution, measures of central tendency(mean, median, and mode) are the same.

The simplest and useful normal distribution is the one with zero mean and unit variance. We call it the standard normal distribution denoted by \[ Z \sim N(0, 1). \] Any normally distributed random variable can be "transformed" to a standard normal random variable. If \(X \sim N(\mu, \sigma^2)\), then \[ Z = \frac{X- \mu}{\sigma}\sim N(0, 1). \] This process is called standardization. This is just a special case of the linear transformation: \[ Y = aX + b \sim N(a\mu+b, a^2\sigma^2). \] Here, \(a = \frac{1}{\sigma} \) and \(b = \frac{-\mu}{\sigma}\) and then you get \(\mathbb{E}[\frac{1}{\sigma}x -\frac{\mu}{\sigma}] = 0 \) and \(\text{Var }(\frac{1}{\sigma}x -\frac{\mu}{\sigma}) =1\).

The p.d.f. of \(Z\) is denoted by \[ \phi (z) = f(x) = \frac{1}{\sqrt{2\pi}}e^{\frac{-z^2}{2}} \qquad z \in \mathbb{R} \] and c.d.f. of \(Z\) is denoted by \[ \Phi (z) = P(Z \leq z) = \int_{-\infty} ^z \phi (u)du. \]

Chi-Squared Distribution

If a single standard normal variable \(Z \sim \mathcal{N}(0,1)\) captures the behavior of one random measurement, what distribution governs the sum of squared measurements? This question arises naturally in statistics: whenever we compute a sample variance, we are summing squared deviations. The answer is the chi-squared distribution, which connects the normal distribution from this section to the gamma distribution from the previous one.

Definition: Chi-Squared Distribution

Let \(Z_1, Z_2, \ldots, Z_\nu\) be independent standard normal random variables, \(Z_i \sim \mathcal{N}(0, 1)\). The random variable \[ Q = Z_1^2 + Z_2^2 + \cdots + Z_\nu^2 = \sum_{i=1}^{\nu} Z_i^2 \] has a chi-squared distribution with \(\nu\) degrees of freedom, written \(Q \sim \chi^2_\nu\). Its p.d.f. is: \[ f(x) = \frac{1}{2^{\nu/2}\,\Gamma(\nu/2)}\, x^{\nu/2 - 1}\, e^{-x/2}, \quad x \geq 0. \]

Comparing this p.d.f. with the gamma distribution \[ f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} \], we see that the chi-squared distribution is a special case: \[ \chi^2_\nu = \text{Gamma}\!\left(\alpha = \frac{\nu}{2},\; \beta = \frac{1}{2}\right). \] This identification is not merely notational — it means that every property of the gamma distribution (moment formulas, moment generating function, additivity) transfers directly to the chi-squared distribution.

The mean and variance follow immediately from the gamma parameters: \[ \mathbb{E}[Q] = \frac{\alpha}{\beta} = \nu, \qquad \text{Var}(Q) = \frac{\alpha}{\beta^2} = 2\nu. \]

The chi-squared distribution plays a fundamental role in statistical inference. If \(X_1, \ldots, X_n\) are i.i.d. \(\mathcal{N}(\mu, \sigma^2)\) and we define the sample variance \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2\), then: \[ \frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}. \] This result is what allows us to construct confidence intervals for variances and to define the Student's \(t\)-distribution, which arises when we replace the unknown \(\sigma\) with the sample estimate \(s\).

Insight: The Gamma → Chi-Squared → Student's \(t\) Chain

The relationship between these distributions forms a coherent chain. The gamma distribution is the general family for non-negative continuous variables. Setting \(\alpha = \nu/2\) and \(\beta = 1/2\) specializes it to the chi-squared distribution, which governs sums of squared normals. The Student's \(t\)-distribution then arises as the ratio: \[ T = \frac{Z}{\sqrt{Q/\nu}}, \quad Z \sim \mathcal{N}(0,1),\; Q \sim \chi^2_\nu,\; Z \perp Q \] where \(T \sim t_\nu\). In practice, \(Z\) represents the standardized signal and \(Q/\nu\) represents the estimated noise level. When we don't know the true variance \(\sigma^2\) and must estimate it from data, replacing \(\sigma\) with \(s\) introduces extra uncertainty that thickens the tails - precisely captured by the \(t\)-distribution's degrees of freedom parameter \(\nu = n - 1\).

Central Limit Theorem

The previous section defined the normal distribution and its properties for a single random variable. In practice, however, we almost always work with collections of observations. A fundamental question arises: if we average many independent measurements, what can we say about the distribution of that average? The answer - provided by the Central Limit Theorem - is one of the most remarkable results in all of mathematics: the average tends toward a normal distribution, regardless of the underlying distribution of the individual measurements.

Now, we consider a set of random variables \(\{X_1, X_2, \cdots, X_n\}\). In random sampling, we assume that each \(X_i\) has the same distribution and is independent of each other. We call it independent and identically distributed(i.i.d.). In this case, a single sample mean \(\bar{X} = \frac{1}{n}\sum_{i=1} ^n X_i\) is not exactly equivalent to the population mean \(\mu\) because there are many different sampling possibilities. However, the "average" of sample means is equal to \(\mu\): \[ \begin{align*} \mathbb{E}[\bar{X}] &= \mathbb{E}[\frac{1}{n}(X_1 + X_2 + \cdots +X_n)] \\\\ &= \frac{1}{n}[\mathbb{E}[X_1] + \mathbb{E}[X_2] + \cdots + \mathbb{E}[X_n]] \\\\ &= \frac{n\mu}{n} \\\\ &= \mu \end{align*} \]

Thus, the sample mean is an unbiased estimator of the population mean \(\mu\). On the other hand, the sample variance \(s^2\) is a biased estimator of the population variance \(\sigma^2\): \[ \begin{align*} \text{Var }(\bar{X}) &= \text{Var }[\frac{1}{n}(X_1 + X_2 + \cdots +X_n)] \\\\ &= \frac{1}{n^2}[\text{Var }(X_1) + \text{Var }(X_2) + \cdots + \text{Var }(X_n)] \\\\ &= \frac{n \sigma^2}{n^2} \\\\ &= \frac{\sigma^2}{n}. \end{align*} \] To make the estimator unbiased, we adjust the denominator: \[ s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2. \]

At this point, you might wonder how we can make inferences about the unknown distribution of random variables in practice. The Central Limit Theorem (CLT) provides a result: regardless of the underlying distribution of a population, the distribution of the sample mean approaches a normal distribution as the sample size \(n\) becomes large enough. This remarkable property is one of the reasons the normal distribution is widely used in statistics and forms the foundation of many statistical methods and theories.

Theorem 2: Central Limit Theorem

Let \(X_1, X_2, \cdots, X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\). Then the distribution of \[ Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \] converges to the standard normal distribution as the sample size \(n \to \infty\).

Note: The SD of \(\bar{X}\) is \(\sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\).

If the sample size \(n\) is large enough, \[ \bar{X} \approx N(\mu, \frac{\sigma^2}{n}) \] and also, \[ \sum_{i =1} ^n X_i \approx N(n\mu , n\sigma^2). \] Note: For example, highly skewed distributions such as the exponential distribution requires around \(n \geq 100\) for applying CLT.

The Central Limit Theorem (CLT) explains the ubiquity of the normal distribution, showing that any quantity arising from the aggregate of many small, independent contributions will be approximately normally distributed. For a rigorous treatment and a proof using moment generating functions, see Part 16: Convergence. Beyond the CLT, the Gaussian distribution is fundamental in information theory as the distribution that maximizes entropy for a fixed mean and variance - making it the "most random" choice under these constraints.

Next, we introduce the Student's \(t\)-distribution, which accounts for the additional uncertainty that arises when the population variance \(\sigma^2\) is unknown and must be estimated from the sample.