Convergence

Modes of Convergence & The Law of Large Numbers Convergence in Probability Convergence in Distribution Moment Generating Function (MGF)

Modes of Convergence & The Law of Large Numbers

We usually accept the convergence of probabilities based on experimental results or large-scale simulations. In maximum likelihood estimation and hypothesis testing, we relied on the idea that sample statistics approximate population parameters as the sample size grows. But what exactly does "approximate" mean for random variables? Unlike deterministic sequences, a sequence of random variables can approach a limit in several distinct senses, each with different mathematical strength and practical implications.

We introduce three fundamental modes of convergence, ordered from strongest to weakest, and connect each to a classical result in probability theory.

Definition: Almost Sure Convergence (\(\xrightarrow{a.s.}\))

Let \(\{X_n\}\) and \(X\) be random variables on a common probability space \((\Omega, \mathcal{F}, P)\). We say \(X_n\) converges to \(X\) almost surely, written \(X_n \xrightarrow{a.s.} X\), if \[ P\!\left(\left\{\omega \in \Omega : \lim_{n \to \infty} X_n(\omega) = X(\omega)\right\}\right) = 1. \] The convergence event is measurable because it admits the representation \[ \left\{\lim_{n} X_n = X\right\} \;=\; \bigcap_{k \geq 1} \bigcup_{N \geq 1} \bigcap_{n \geq N} \left\{|X_n - X| < \tfrac{1}{k}\right\}, \] a countable combination of measurable sets.

This is the strongest form of convergence. It is the probabilistic analogue of pointwise convergence "almost everywhere" (a.e.) in measure theory: the set of outcomes \(\omega\) for which \(X_n(\omega) \not\to X(\omega)\) exists as a subset of the sample space, but has probability zero. Almost sure convergence underpins the following fundamental result.

Theorem: Strong Law of Large Numbers (SLLN)

Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(\mathbb{E}[|X_1|] < \infty\) and \(\mathbb{E}[X_1] = \mu\). Then \[ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{a.s.} \mu. \]

A rigorous proof lies beyond the scope of this introductory development; we accept the result here and use it freely throughout the rest of this section.

The SLLN guarantees that individual sample paths of \(\bar{X}_n\) converge to \(\mu\), except possibly on a set of probability zero. A weaker notion relaxes this path-level guarantee.

The second mode, convergence in probability, demands less: rather than asking individual sample paths to settle down on a probability-1 set, it asks only that the probability of any fixed-size deviation from \(X\) vanish as \(n \to \infty\). The formal definition appears below; this mode corresponds to the Weak Law.

Theorem: Weak Law of Large Numbers (WLLN)

Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(\mathbb{E}[X_1] = \mu\) and \(\operatorname{Var}(X_1) = \sigma^2 < \infty\). Then \[ \bar{X}_n \xrightarrow{P} \mu. \]

Proof (via Chebyshev's inequality):

First, the variance of the sample mean. By the scaling rule for variance applied to \(\bar{X}_n = \tfrac{1}{n}\sum_{i=1}^n X_i\), \[ \operatorname{Var}(\bar{X}_n) \;=\; \frac{1}{n^2}\,\operatorname{Var}\!\left(\sum_{i=1}^n X_i\right). \] Expanding the variance of the sum, \[ \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) \;=\; \mathbb{E}\!\left[\left(\sum_{i=1}^n (X_i - \mu)\right)^{\!2}\right] \;=\; \sum_{i=1}^n \operatorname{Var}(X_i) \;+\; \sum_{i \neq j} \mathbb{E}[(X_i - \mu)(X_j - \mu)]. \] Since the \(X_i\) are independent, independence factors expectations, so for \(i \neq j\), \[ \mathbb{E}[(X_i - \mu)(X_j - \mu)] \;=\; \mathbb{E}[X_i - \mu] \cdot \mathbb{E}[X_j - \mu] \;=\; 0. \] The cross terms vanish, leaving \(\operatorname{Var}(\sum_i X_i) = n\sigma^2\), and hence \(\operatorname{Var}(\bar{X}_n) = \sigma^2/n\).

Next, the deviation bound. Set \(Z = (\bar{X}_n - \mu)^2 \geq 0\). For any \(\epsilon > 0\), \[ \mathbb{E}[Z] \;\geq\; \mathbb{E}\!\left[Z \cdot \mathbb{1}\{Z \geq \epsilon^2\}\right] \;\geq\; \epsilon^2 \cdot P(Z \geq \epsilon^2), \] the first inequality because \(Z \cdot \mathbb{1}\{Z < \epsilon^2\} \geq 0\) and the second because the indicator restricts to the region \(Z \geq \epsilon^2\). Rearranging and using \(\{Z \geq \epsilon^2\} = \{|\bar{X}_n - \mu| \geq \epsilon\}\), \[ P(|\bar{X}_n - \mu| \geq \epsilon) \;\leq\; \frac{\mathbb{E}[(\bar{X}_n - \mu)^2]}{\epsilon^2} \;=\; \frac{\operatorname{Var}(\bar{X}_n)}{\epsilon^2} \;=\; \frac{\sigma^2}{n\epsilon^2} \;\longrightarrow\; 0. \] This is Chebyshev's inequality applied to \(\bar{X}_n - \mu\).

Remark. The finite-variance assumption above is convenient but not necessary. The SLLN guarantees \(\bar{X}_n \xrightarrow{a.s.} \mu\) under the weaker condition \(\mathbb{E}[|X_1|] < \infty\) alone, and almost sure convergence implies convergence in probability (established below). So the WLLN in fact holds whenever the SLLN does — finite variance is needed only for the simple self-contained argument given here.

The third and weakest mode, convergence in distribution, abandons the random variables themselves and tracks only their cumulative distribution functions: \(F_{X_n}(x) \to F_X(x)\) at every continuity point of the limit \(F_X\). The formal definition is given further below; this is the setting of the Central Limit Theorem (CLT), which describes how the standardized fluctuations of \(\bar{X}_n\) around \(\mu\) converge to a normal distribution.

Theorem: Hierarchy of Convergence Modes

The three modes satisfy the following implications: \[ \xrightarrow{a.s.} \;\Longrightarrow\; \xrightarrow{P} \;\Longrightarrow\; \xrightarrow{D}. \] The reverse implications fail in general; however, if the limit \(X\) is a constant (as in the LLN), then \(\xrightarrow{D} \Longrightarrow \xrightarrow{P}\) as well.

Proof:

(\(\xrightarrow{a.s.}\) implies \(\xrightarrow{P}\)). Suppose \(X_n \xrightarrow{a.s.} X\) and fix \(\epsilon > 0\). Set \(Y_n = \mathbb{1}\{|X_n - X| < \epsilon\}\), so \(Y_n \in [0,1]\) for all \(n\). On the convergence set (which has probability \(1\)), each \(\omega\) eventually satisfies \(|X_n(\omega) - X(\omega)| < \epsilon\), so \(Y_n(\omega) \to 1\). Hence \(\liminf_n Y_n = 1\) almost surely, and \(\mathbb{E}[\liminf_n Y_n] = 1\). Applying Fatou's Lemma to the nonnegative sequence \(\{Y_n\}\), \[ 1 \;=\; \mathbb{E}\!\left[\liminf_{n \to \infty} Y_n\right] \;\leq\; \liminf_{n \to \infty} \mathbb{E}[Y_n] \;=\; \liminf_{n \to \infty} P(|X_n - X| < \epsilon). \] Since \(P(|X_n - X| < \epsilon) \leq 1\) for all \(n\), this forces \(P(|X_n - X| < \epsilon) \to 1\), equivalently \(P(|X_n - X| \geq \epsilon) \to 0\).

(\(\xrightarrow{P}\) implies \(\xrightarrow{D}\)). Proved as Convergence in Probability Implies Convergence in Distribution in the section that follows.

(Partial converse: \(\xrightarrow{D} \Longrightarrow \xrightarrow{P}\) when the limit is a constant \(c\)). Suppose \(X_n \xrightarrow{D} c\). The CDF of the constant \(c\) is the step function \(F(x) = \mathbb{1}\{x \geq c\}\), whose only discontinuity is at \(x = c\); every other point of \(\mathbb{R}\) is a continuity point. Fix \(\epsilon > 0\). The points \(c - \epsilon\) and \(c + \epsilon/2\) are both continuity points of \(F\), and \[ \{|X_n - c| \geq \epsilon\} \;=\; \{X_n \leq c - \epsilon\} \cup \{X_n \geq c + \epsilon\} \;\subseteq\; \{X_n \leq c - \epsilon\} \cup \{X_n > c + \epsilon/2\}, \] so \[ P(|X_n - c| \geq \epsilon) \;\leq\; F_{X_n}(c - \epsilon) + \bigl(1 - F_{X_n}(c + \epsilon/2)\bigr) \;\longrightarrow\; F(c - \epsilon) + \bigl(1 - F(c + \epsilon/2)\bigr) \;=\; 0 + (1 - 1) \;=\; 0, \] which is convergence in probability.

Deep Dive: Why so many modes?

The distinction between these modes reflects how "well-behaved" a distribution's moments are. To see why we need these conditions, we first must look at the Cauchy Distribution.

  • The Pathological Case (Cauchy):
    The Cauchy distribution has such "heavy tails" that its mean is undefined (\(\mathbb{E}[|X|] = \infty\)). Because it lacks a finite mean, the Law of Large Numbers fails completely. The sample average \(\bar{X}_n\) of Cauchy variables does not settle down, and it follows the same Cauchy distribution as a single observation, no matter how large \(n\) is.
  • Finite Mean (\(\mathbb{E}[|X|] < \infty\)):
    This is the bare minimum for the universe to have a "balance point." Under this condition, the Strong Law of Large Numbers (SLLN) guarantees the strongest mode: \(\xrightarrow{a.s.}\).
  • Finite Variance (\(\sigma^2 < \infty\)):
    This ensures the "spread" doesn't explode. While the Weak Law (WLLN) only needs a finite mean to reach \(\xrightarrow{P}\), having a finite variance allows us to use Chebyshev's Inequality for a simple proof and, more importantly, enables the Central Limit Theorem (CLT).

We use \(\xrightarrow{P}\) for the LLN to confirm our estimates hit the bullseye.
We use \(\xrightarrow{D}\) for the CLT to understand the "shape" of our uncertainty.
If you are dealing with Cauchy-like heavy tails (common in finance or network theory), these standard convergence tools may break down.

In the following sections, we develop each of the two weaker modes in detail - first convergence in probability with the Continuous Mapping Theorem, then convergence in distribution with a proof that the former implies the latter. Finally, we introduce the moment generating function and use it to give a rigorous proof of the Central Limit Theorem.

Convergence in Probability

We now develop convergence in probability in more detail. Recall that this mode captures the idea that \(X_n\) becomes increasingly unlikely to deviate from \(X\) by any fixed amount. Unlike almost sure convergence, it does not require individual sample paths to converge pointwise - only that "large deviations" become rare events.

Definition: Convergence in Probability

Let \(\{X_n\}\) and \(X\) be random variables on a common probability space \((\Omega, \mathcal{F}, P)\). We say \(X_n\) converges in probability to \(X\), written \(X_n \xrightarrow{P} X\), if for every \(\epsilon > 0\), \[ \lim_{n \to \infty} P(|X_n - X| \geq \epsilon) = 0, \] or equivalently, \[ \lim_{n \to \infty} P(|X_n - X| < \epsilon) = 1. \]

A natural question arises: if \(X_n \xrightarrow{P} X\), what happens when we apply a continuous function to \(X_n\)? The following theorem answers this question and is indispensable for deriving the asymptotic behavior of estimators.

Theorem: Continuous Mapping Theorem

Suppose \(X_n \xrightarrow{P} a\) where \(a\) is a constant, and the function \(f\) is continuous at \(a\). Then \[ f(X_n) \xrightarrow{P} f(a). \]

Proof:

Let \(\epsilon > 0\). Since \(f\) is continuous at \(a\), \(\, \exists \delta > 0 \) such that \[ |x - a| < \delta \Longrightarrow |f(x) - f(a)| < \epsilon. \] Taking the contrapositive, \[ |f(x) - f(a)| \geq \epsilon \Longrightarrow |x - a| \geq \delta. \] Substituting \(X_n\) for \(x\), we obtain \[ P(|f(X_n) - f(a)| \geq \epsilon) \leq P(|X_n - a| \geq \delta). \] As \(n \to \infty\), the right-hand side vanishes by assumption, hence \(f(X_n) \xrightarrow{P} f(a)\).

Why this matters: The Continuous Mapping Theorem allows us to transfer convergence results through transformations. For instance, if sample means converge in probability to \(\mu\), then \(\bar{X}_n^2 \xrightarrow{P} \mu^2\) and \(e^{\bar{X}_n} \xrightarrow{P} e^\mu\). This is essential for deriving the asymptotic behavior of estimators and test statistics.

Convergence in Distribution

We now turn to the weakest mode of convergence. Unlike convergence in probability, which tracks how the random variables themselves behave, convergence in distribution focuses solely on the cumulative distribution functions (CDFs). Two sequences of random variables can converge to the same limiting distribution even if they are defined on entirely different probability spaces.

Definition: Convergence in Distribution

Let \(\{X_n\}\) be a sequence of random variables with CDFs \(F_{X_n}\), and let \(X\) be a random variable with CDF \(F_X\). We say \(X_n\) converges in distribution to \(X\), written \(X_n \xrightarrow{D} X\), if at every continuity point \(x\) of \(F_X\), \[ \lim_{n \to \infty} F_{X_n}(x) = F_X(x). \] The distribution of \(X\) is often called the asymptotic (limiting) distribution of the sequence \(\{X_n\}\).

Key distinction: Convergence in distribution does not imply convergence in probability. For example, let \(X \sim \mathcal{N}(0,1)\) and define \(X_n = -X\) for all \(n\). Then \(X_n \xrightarrow{D} X\) (since both have the same standard normal distribution), yet \(|X_n - X| = 2|X|\) does not converge to zero in probability.

However, the reverse implication does hold:

Theorem: Convergence in Probability Implies Convergence in Distribution

If \(X_n\) converges to \(X\) in probability, then \(X_n\) converges to \(X\) in distribution.

Intuition: If \(X_n\) is increasingly likely to be close to \(X\) (convergence in probability), then the probability mass of \(X_n\) must accumulate where \(X\) places its mass, forcing the CDFs to align.

Proof:

Suppose \(X_n \xrightarrow{P} X\), and let \(x\) be a point of continuity of \(F_X\).

Upper bound.
For any \(\epsilon > 0\), partition the event \(\{X_n \leq x\}\) according to whether \(|X_n - X| < \epsilon\): \[ \begin{align*} F_{X_n}(x) &= P(X_n \leq x) \\\\ &= P(\{X_n \leq x\} \cap \{|X_n - X| < \epsilon\}) + P(\{X_n \leq x\} \cap \{|X_n - X| \geq \epsilon\}) \\\\ &\leq P(X \leq x + \epsilon) + P(|X_n - X| \geq \epsilon) \\\\ &= F_X(x + \epsilon) + P(|X_n - X| \geq \epsilon). \end{align*} \] The first inequality holds because \(\{X_n \leq x\} \cap \{|X_n - X| < \epsilon\}\) implies \(X < x + \epsilon\). Since \(X_n \xrightarrow{P} X\), the second term vanishes, giving \[ \limsup_{n \to \infty} F_{X_n}(x) \leq F_X(x + \epsilon). \tag{1} \]

Lower bound:
Similarly, partition the complementary event \(\{X_n > x\}\): \[ \begin{align*} 1 - F_{X_n}(x) &= P(X_n > x) \\\\ &= P(\{X_n > x\} \cap \{|X_n - X| < \epsilon\}) + P(\{X_n > x\} \cap \{|X_n - X| \geq \epsilon\}) \\\\ &\leq P(X > x - \epsilon) + P(|X_n - X| \geq \epsilon) \\\\ &= 1 - F_X(x - \epsilon) + P(|X_n - X| \geq \epsilon). \end{align*} \] The first inequality holds because \(\{X_n > x\} \cap \{|X_n - X| < \epsilon\}\) implies \(X > X_n - \epsilon > x - \epsilon\). Rearranging, \[ F_{X_n}(x) \geq F_X(x - \epsilon) - P(|X_n - X| \geq \epsilon), \] and since \(X_n \xrightarrow{P} X\) the second term vanishes, giving \[ \liminf_{n \to \infty} F_{X_n}(x) \geq F_X(x - \epsilon). \tag{2} \]

Combining (1) and (2): \[ F_X(x - \epsilon) \leq \liminf_{n \to \infty} F_{X_n}(x) \leq \limsup_{n \to \infty} F_{X_n}(x) \leq F_X(x + \epsilon). \] Since \(x\) is a point of continuity of \(F_X\), letting \(\epsilon \to 0\) gives \(F_X(x - \epsilon) \to F_X(x)\) and \(F_X(x + \epsilon) \to F_X(x)\), so \[ F_X(x) \leq \liminf_{n \to \infty} F_{X_n}(x) \leq \limsup_{n \to \infty} F_{X_n}(x) \leq F_X(x), \] forcing \(\liminf = \limsup = F_X(x)\), hence \(\lim_{n \to \infty} F_{X_n}(x) = F_X(x)\). Since this holds for every continuity point of \(F_X\), we conclude \(X_n \xrightarrow{D} X\).

Bridge: The Analytical Hierarchy in \(L^p\) Spaces

The hierarchy \(\xrightarrow{a.s.} \Longrightarrow \xrightarrow{P} \Longrightarrow \xrightarrow{D}\) has a parallel in the setting of general measure spaces. For measurable functions on \((\Omega, \mathcal{F}, \mu)\), the Relations Between Modes of Convergence catalog the implications — and non-implications — among \(L^p\) convergence, pointwise a.e. convergence, convergence in measure, and uniform a.e. convergence.

Specialization to \(\mu = \mathbb{P}\): On a probability space, a.e. becomes a.s., and convergence in measure becomes convergence in probability (\(\mu(\{|f_n - f| \geq \epsilon\}) \to 0\) is literally the definition of \(\xrightarrow{P}\)). The analytical statement that \(L^p\) convergence implies convergence in measure therefore specializes to: if \(\mathbb{E}[|X_n - X|^p] \to 0\) for some \(p \geq 1\), then \(X_n \xrightarrow{P} X\). Conversely, convergence in probability does not imply \(L^p\) convergence without an integrable domination hypothesis — this is why the CLT, stated as \(\xrightarrow{D}\), is a genuinely weaker statement than any finite-moment version.

Convergence in distribution \(\xrightarrow{D}\), by contrast, has no direct counterpart in the general \(L^p\) hierarchy: it depends only on the distributions of the \(X_n\), not on the underlying probability space. It is a weaker, distribution-level mode tailored to probability theory.

We can now see that the Central Limit Theorem (CLT) — first stated without proof when we introduced the Gaussian distribution — is precisely a statement about convergence in distribution. To prove the CLT rigorously, we need a tool that characterizes distributions and behaves well under limits: the moment generating function.

Moment Generating Function (MGF)

Verifying convergence in distribution directly - by showing pointwise convergence of CDFs at every continuity point - is often impractical. The moment generating function (MGF) provides a powerful alternative: it transforms the problem of distributional convergence into the simpler problem of pointwise convergence of real-valued functions, via Lévy's continuity theorem (stated below).

Definition: Moment Generating Function (MGF)

Let \(X\) be a random variable such that for some \(h > 0\), the expectation \(\mathbb{E}[e^{tX}]\) is finite for all \(t \in (-h, h)\). The moment generating function of \(X\) is defined by \[ M(t) = \mathbb{E}[e^{tX}], \qquad t \in (-h, h). \]

Why "moment generating"?: Expanding \(e^{tX}\) as a power series and taking expectations term-by-term: \[ M(t) = \mathbb{E}\!\left[\sum_{k=0}^{\infty} \frac{(tX)^k}{k!}\right] = \sum_{k=0}^{\infty} \frac{t^k}{k!} \mathbb{E}[X^k]. \] The interchange of expectation and infinite sum is justified within the radius of convergence by dominated convergence applied to the partial sums: the dominating variable is \(\sum_{k=0}^\infty |tX|^k/k! = e^{|tX|}\), whose expectation is finite when \(M\) exists in a neighborhood of \(0\) (since \(e^{|tX|} \leq e^{tX} + e^{-tX}\)). Thus, the \(k\)-th derivative at zero gives the \(k\)-th moment: \[ M^{(k)}(0) = \mathbb{E}[X^k] \] (by termwise differentiation of the power series, valid within its radius of convergence).

Why require an open interval around 0? The MGF must exist in a neighborhood of zero (not just at \(t = 0\)) for two reasons. First, the moment-extraction formula \(M^{(k)}(0) = \mathbb{E}[X^k]\) requires \(M\) to be differentiable at \(0\), which forces existence on more than a single point. Second, our convergence strategy below — pointwise convergence of MGFs on an interval implies convergence in distribution — needs the MGFs defined on a common interval to compare them at all. If the MGF exists only at \(t = 0\), then \(M(0) = 1\) trivially and no useful information is conveyed.

The key theorem connecting MGFs to convergence in distribution is:

Theorem: Lévy's Continuity Theorem - MGF Version

Let \(\{X_n\}\) be a sequence of random variables, each with an MGF \(M_n\) defined on a common open interval \((-h, h)\) for some \(h > 0\). Suppose there exist \(\delta \in (0, h]\) and a random variable \(X\) with MGF \(M\) defined on \((-\delta, \delta)\), such that \[ M_n(t) \to M(t) \quad \text{for every } t \in (-\delta, \delta). \] Then \[ X_n \xrightarrow{D} X. \]

A rigorous proof lies beyond the scope of this introductory development; we accept the result here and use it as the bridge from MGF convergence to distributional convergence in the proof of the Central Limit Theorem below.

This theorem transforms the problem: instead of comparing CDFs pointwise, we show MGFs converge. We now apply this strategy to prove the Central Limit Theorem.

Theorem: Central Limit Theorem (CLT)

Let \(X_1, X_2, \ldots\) be i.i.d. random variables with mean \(\mathbb{E}[X_1] = \mu\) and variance \(\operatorname{Var}(X_1) = \sigma^2 \in (0, \infty)\). Assume the common MGF \(M(t) = \mathbb{E}[e^{tX_1}]\) exists for \(t\) in a neighborhood of zero. Then the standardized sum \[ Y_n = \frac{\sum_{i=1}^{n} X_i - n\mu}{\sigma \sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \] converges in distribution to a standard normal: \[ Y_n \xrightarrow{D} \mathcal{N}(0, 1). \]

Proof:

Step 1: MGF of centered variable.
Throughout this proof, write \(X\) for a representative copy of the common distribution (so \(\mathbb{E}[X] = \mu\), \(\operatorname{Var}(X) = \sigma^2\), and \(M(t) = \mathbb{E}[e^{tX}]\) is the common MGF). Define \(W = X - \mu\). The MGF of \(W\) is: \[ m(t) = \mathbb{E}[e^{t(X-\mu)}] = e^{-\mu t} M(t), \qquad t \in (-h, h). \] Applying the moment-extraction property \(m^{(k)}(0) = \mathbb{E}[W^k]\) (established above for MGFs in their domain of analyticity), together with the linearity of expectation and the definition of variance: \[ m(0) = 1, \quad m'(0) = \mathbb{E}[X - \mu] = \mathbb{E}[X] - \mu = 0, \quad m''(0) = \mathbb{E}[(X-\mu)^2] = \operatorname{Var}(X) = \sigma^2. \]

Step 2: Taylor expansion.
By Taylor's theorem with Lagrange remainder (a standard result of one-variable calculus, applicable here because the MGF \(m\) is infinitely differentiable on \((-h, h)\) — its power series \(\sum_{k=0}^\infty t^k \mathbb{E}[W^k]/k!\) converges on this interval), for any \(t \in (-h, h)\) there exists \(\xi\) between \(0\) and \(t\) such that: \[ m(t) = m(0) + m'(0)t + \frac{m''(\xi)}{2}t^2 = 1 + \frac{m''(\xi)}{2}t^2. \] We rewrite this as: \[ m(t) = 1 + \frac{\sigma^2 t^2}{2} + \frac{[m''(\xi) - \sigma^2]t^2}{2}. \tag{3} \]

Step 3: MGF of the standardized sum.
Writing the exponential of a sum as a product of exponentials, \[ M_n(t) = \mathbb{E}\!\left[\exp\!\left(t \cdot \frac{\sum_{i=1}^{n}(X_i - \mu)}{\sigma\sqrt{n}}\right)\right] = \mathbb{E}\!\left[\prod_{i=1}^n \exp\!\left(\frac{t(X_i - \mu)}{\sigma\sqrt{n}}\right)\right]. \] Since the \(X_i\) are independent, so are the functions \(\exp(t(X_i - \mu)/(\sigma\sqrt{n}))\) of them. By the product rule for expectations of independent variables, the expectation factors: \[ M_n(t) = \prod_{i=1}^n \mathbb{E}\!\left[\exp\!\left(\frac{t(X_i - \mu)}{\sigma\sqrt{n}}\right)\right]. \] Each factor equals \(m(t/(\sigma\sqrt{n}))\) by the i.i.d. assumption (the \(X_i\) all share the common centered MGF \(m\)), so \[ M_n(t) = \left[m\!\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n. \]

Step 4: Substitute and take limit.
Replacing \(t\) with \(\frac{t}{\sigma\sqrt{n}}\) in equation (3): \[ m\left(\frac{t}{\sigma\sqrt{n}}\right) = 1 + \frac{t^2}{2n} + \frac{[m''(\xi_n) - \sigma^2]t^2}{2n\sigma^2} \] where \(\xi_n\) lies between \(0\) and \(t/(\sigma\sqrt{n})\) (the precise sign depending on that of \(t\)). Thus: \[ M_n(t) = \left\{1 + \frac{t^2}{2n} + \frac{[m''(\xi_n) - \sigma^2]t^2}{2n\sigma^2}\right\}^n. \] As \(n \to \infty\), we have \(|\xi_n| \leq |t|/(\sigma\sqrt{n}) \to 0\), so \(\xi_n \to 0\). By continuity of \(m''\) at \(0\) (which follows from the smoothness of \(m\) on \((-h, h)\) established in Step 2): \[ m''(\xi_n) \to m''(0) = \sigma^2 \Longrightarrow m''(\xi_n) - \sigma^2 \to 0. \] To close the limit, we use the following generalization of the classical exponential limit \((1 + a_n/n)^n \to e^a\) (valid for any sequence \(a_n \to a\), a standard result of one-variable calculus): if \(c_n \to c\) and \(\epsilon_n \to 0\), then \(\left(1 + \frac{c_n}{n} + \frac{\epsilon_n}{n}\right)^n \to e^c\). (Proof: take logarithms and use the Taylor expansion \(\ln(1 + x) = x + O(x^2)\) for small \(x\), giving \(n \ln(1 + (c_n + \epsilon_n)/n) = (c_n + \epsilon_n) + O((c_n + \epsilon_n)^2/n) \to c\); then exponentiate.) Applying this with \(c_n \equiv \frac{t^2}{2}\) and \(\epsilon_n = \frac{[m''(\xi_n) - \sigma^2]t^2}{2\sigma^2} \to 0\): \[ \lim_{n \to \infty} M_n(t) = e^{t^2/2}. \]

Step 5: Identify the limit.
The function \(e^{t^2/2}\) is the MGF of \(\mathcal{N}(0,1)\) (verified below). To apply Lévy's Continuity Theorem (MGF Version), note that each \(M_n\) is defined on the common open interval \((-h\sigma, h\sigma)\) (since \(M_n(t) = [m(t/(\sigma\sqrt{n}))]^n\) requires \(t/(\sigma\sqrt{n}) \in (-h, h)\), and \(\sqrt{n} \geq 1\)), the limit MGF \(M(t) = e^{t^2/2}\) is defined on all of \(\mathbb{R}\), and Step 4 established \(M_n(t) \to e^{t^2/2}\) for every \(t \in \mathbb{R}\). Taking \(\delta = h\sigma\) in the theorem yields: \[ Y_n = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{D} \mathcal{N}(0,1). \]

Verification: MGF of \(\mathcal{N}(0,1)\)

We confirm that \(e^{t^2/2}\) is the MGF of the standard normal distribution \(\mathcal{N}(0,1)\), whose density is \(\frac{1}{\sqrt{2\pi}} e^{-x^2/2}\) (see the definition of the normal distribution): \[ \begin{align*} M(t) &= \int_{-\infty}^{\infty} e^{tx} \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \, dx \\\\ &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-(x^2 - 2tx)/2} \, dx \\\\ &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-[(x-t)^2 - t^2]/2} \, dx \\\\ &= e^{t^2/2} \underbrace{\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(x-t)^2/2} \, dx}_{= 1} \\\\ &= e^{t^2/2}. \end{align*} \] The key step is completing the square in the exponent; the remaining integral equals \(1\) because it is the total probability of \(\mathcal{N}(t, 1)\), which (by the same definition) is a probability density integrating to \(1\) over \(\mathbb{R}\).

Without standardization, the CLT can equivalently be written as \[ \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{D} \mathcal{N}(0, \sigma^2), \] (by the scaling property of \(\xrightarrow{D}\) under multiplication by the constant \(\sigma\)), which makes the role of \(\sigma^2\) explicit: the fluctuations of \(\bar{X}_n\) around \(\mu\) have order \(\sigma / \sqrt{n}\), and once rescaled by \(\sqrt{n}\), they converge to a Gaussian with variance \(\sigma^2\).

Remark on hypotheses. The assumption that \(M(t)\) exists in a neighborhood of zero is convenient for our proof, but is not necessary for the conclusion. A more general CLT requires only finite variance; this stronger version will be developed in a forthcoming dedicated treatment. In the present section we traded generality for a transparent derivation.

Connections to Machine Learning

The convergence results developed in this section are not merely theoretical. The Law of Large Numbers justifies estimating expectations by sample averages — the operating principle behind training neural networks on finite datasets. (Whether the minimizer of the sample-average loss converges to the minimizer of the true expected loss is a more subtle question requiring uniform laws of large numbers, which lie outside the present scope.) The CLT explains why confidence intervals and hypothesis tests (from our development of statistical inference and hypothesis testing) work for large samples: under finite variance, test statistics built from sample averages are asymptotically normal, even when the underlying distribution is non-Gaussian. In deep learning, the Continuous Mapping Theorem ensures that if parameter estimates converge in probability, then for any fixed input the model's prediction (as a continuous function of parameters) converges in probability as well.

We now have the tools to reason rigorously about parameter estimation and inference. In our development of Bayesian statistics, we adopt a fundamentally different perspective on inference - treating parameters themselves as random variables with prior distributions - where the convergence results developed here form the foundation of the large-sample analyses that connect the two perspectives.