Convergence

Modes of Convergence & The Law of Large Numbers Convergence in Probability Convergence in Distribution Moment Generating Function (MGF)

Modes of Convergence & The Law of Large Numbers

We usually accept the convergence of probabilities based on experimental results or large-scale simulations. In Part 9 and Part 10, we relied on the idea that sample statistics approximate population parameters as the sample size grows. But what exactly does "approximate" mean for random variables? Unlike deterministic sequences, a sequence of random variables can approach a limit in several distinct senses, each with different mathematical strength and practical implications.

We introduce three fundamental modes of convergence, ordered from strongest to weakest, and connect each to a classical result in probability theory.

1. Almost Sure Convergence (\(\xrightarrow{a.s.}\))

A sequence \(\{X_n\}\) converges to \(X\) almost surely if the event where they do not converge has probability zero: \[ P\left( \lim_{n \to \infty} X_n = X \right) = 1. \]

This is the strongest form of convergence. It is the probabilistic analogue of pointwise convergence "almost everywhere" (a.e.) in measure theory: the set of outcomes \(\omega\) for which \(X_n(\omega) \not\to X(\omega)\) exists as a subset of the sample space, but has probability zero. Almost sure convergence underpins the following fundamental result.

Theorem: Strong Law of Large Numbers (SLLN)

Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(\mathbb{E}[|X_1|] < \infty\) and \(\mathbb{E}[X_1] = \mu\). Then \[ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{a.s.} \mu. \]

The SLLN guarantees that individual sample paths of \(\bar{X}_n\) converge to \(\mu\), except possibly on a set of probability zero. A weaker notion relaxes this path-level guarantee.

2. Convergence in Probability (\( \xrightarrow{P} \))

We say \(X_n\) converges to \(X\) in probability if, for every \(\epsilon > 0\), the probability of the difference exceeding \(\epsilon\) goes to zero: \[ \lim_{n \to \infty} P(|X_n - X| \geq \epsilon) = 0. \]

Convergence in probability says that large deviations from \(X\) become increasingly unlikely, but it does not demand that each individual sample path settles down. This mode of convergence corresponds to the Weak Law.

Theorem: Weak Law of Large Numbers (WLLN)

Let \(X_1, X_2, \ldots\) be i.i.d. random variables with \(\mathbb{E}[X_1] = \mu\) and \(\text{Var}(X_1) = \sigma^2 < \infty\). Then \[ \bar{X}_n \xrightarrow{P} \mu. \]

The WLLN is often proved using Chebyshev's inequality, which requires the assumption of a finite variance (\(\sigma^2 < \infty\)). Since \(\text{Var}(\bar{X}_n) = \sigma^2/n\), we have \[ P(|\bar{X}_n - \mu| \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2} \to 0. \] Note: While Chebyshev's proof requires finite variance, Khinchin's theorem shows that the WLLN actually holds under the same condition as the SLLN — only a finite mean (\(\mathbb{E}[|X_1|] < \infty\)) is needed.

The weakest mode concerns only the distribution functions, not the random variables themselves.

3. Convergence in Distribution (\( \xrightarrow{D} \))

This is the weakest form, focusing only on the cumulative distribution functions (CDFs). We say \(X_n\) converges to \(X\) in distribution if: \[ \lim_{n \to \infty} F_n(x) = F(x) \] for all points \(x\) where \(F\) is continuous.

Convergence in distribution does not require the random variables to be defined on the same probability space. It is purely a statement about CDFs. This is the setting of the Central Limit Theorem (CLT), which describes how the standardized fluctuations of \(\bar{X}_n\) around \(\mu\) converge to a normal distribution.

Theorem: Hierarchy of Convergence Modes

The three modes satisfy the following implications: \[ \xrightarrow{a.s.} \;\Longrightarrow\; \xrightarrow{P} \;\Longrightarrow\; \xrightarrow{D}. \] The reverse is not generally true. However, if the limit \(X\) is a constant (as in LLN), then convergence in distribution implies convergence in probability.

Deep Dive: Why so many modes?

The distinction between these modes reflects how "well-behaved" a distribution's moments are. To see why we need these conditions, we first must look at the Cauchy Distribution.

  • The Pathological Case (Cauchy):
    The Cauchy distribution has such "heavy tails" that its mean is undefined (\(\mathbb{E}[|X|] = \infty\)). Because it lacks a finite mean, the Law of Large Numbers fails completely. The sample average \(\bar{X}_n\) of Cauchy variables does not settle down, and it follows the same Cauchy distribution as a single observation, no matter how large \(n\) is.
  • Finite Mean (\(\mathbb{E}[|X|] < \infty\)):
    This is the bare minimum for the universe to have a "balance point." Under this condition, the Strong Law of Large Numbers (SLLN) guarantees the strongest mode: \(\xrightarrow{a.s.}\).
  • Finite Variance (\(\sigma^2 < \infty\)):
    This ensures the "spread" doesn't explode. While the Weak Law (WLLN) only needs a finite mean to reach \(\xrightarrow{P}\), having a finite variance allows us to use Chebyshev's Inequality for a simple proof and, more importantly, enables the Central Limit Theorem (CLT).

We use \(\xrightarrow{P}\) for the LLN to confirm our estimates hit the bullseye.
We use \(\xrightarrow{D}\) for the CLT to understand the "shape" of our uncertainty.
If you are dealing with Cauchy-like heavy tails (common in finance or network theory), these standard convergence tools may break down.

In the following sections, we develop each of the two weaker modes in detail - first convergence in probability with the Continuous Mapping Theorem, then convergence in distribution with a proof that the former implies the latter. Finally, we introduce the moment generating function and use it to give a rigorous proof of the Central Limit Theorem.

Convergence in Probability

We now develop convergence in probability in more detail. Recall that this mode captures the idea that \(X_n\) becomes increasingly unlikely to deviate from \(X\) by any fixed amount. Unlike almost sure convergence, it does not require individual sample paths to converge pointwise - only that "large deviations" become rare events.

Definition: Convergence in Probability

Let \(\{X_n\}\) be a sequence of random variables and \(X\) be a random variable on a common probability space. We say \(X_n\) converges in probability to \(X\), written \(X_n \xrightarrow{P} X\), if for every \(\epsilon > 0\), \[ \lim_{n \to \infty} P[|X_n - X| \geq \epsilon] = 0, \] or equivalently, \[ \lim_{n \to \infty} P[|X_n - X| < \epsilon] = 1. \]

A natural question arises: if \(X_n \xrightarrow{P} X\), what happens when we apply a continuous function to \(X_n\)? The following theorem answers this question and is indispensable for deriving the asymptotic behavior of estimators.

Theorem: Continuous Mapping Theorem

Suppose \(X_n \xrightarrow{P} a\) where \(a\) is a constant, and the function \(f\) is continuous at \(a\). Then \[ f(X_n) \xrightarrow{P} f(a). \]

Proof:

Let \(\epsilon > 0\). Since \(f\) is continuous at \(a\), \(\, \exists \delta > 0 \) such that \[ |x - a| < \delta \Longrightarrow |f(x) - f(a)| < \epsilon. \] Taking the contrapositive, \[ |f(x) - f(a)| \geq \epsilon \Longrightarrow |x - a| \geq \delta. \] Substituting \(X_n\) for \(x\), we obtain \[ P[|f(X_n) - f(a)| \geq \epsilon] \leq P [|X_n - a| \geq \delta ]. \] As \(n \to \infty\), the right-hand side vanishes by assumption, hence \(f(X_n) \xrightarrow{P} f(a)\).

More generally, if \(X_n \xrightarrow{P} X\) (where \(X\) is a random variable, not necessarily constant) and \(f\) is continuous, then \[ f(X_n) \xrightarrow{P} f(X). \] The proof follows similar logic but requires a more careful measure-theoretic argument.

Why this matters: The Continuous Mapping Theorem allows us to transfer convergence results through transformations. For instance, if sample means converge in probability to \(\mu\), then \(\bar{X}_n^2 \xrightarrow{P} \mu^2\) and \(e^{\bar{X}_n} \xrightarrow{P} e^\mu\). This is essential for deriving the asymptotic behavior of estimators and test statistics.

Convergence in Distribution

We now turn to the weakest mode of convergence. Unlike convergence in probability, which tracks how the random variables themselves behave, convergence in distribution focuses solely on the cumulative distribution functions (CDFs). Two sequences of random variables can converge to the same limiting distribution even if they are defined on entirely different probability spaces.

Definition: Convergence in Distribution

Let \(\{X_n\}\) be a sequence of random variables with CDFs \(F_{X_n}\), and let \(X\) be a random variable with CDF \(F_X\). We say \(X_n\) converges in distribution to \(X\), written \(X_n \xrightarrow{D} X\), if for every point \(x \in C(F_X)\) (where \(C(F_X)\) is the set of all points where \(F_X\) is continuous), \[ \lim_{n \to \infty} F_{X_n}(x) = F_X(x). \] The distribution of \(X\) is often called the asymptotic (limiting) distribution of the sequence \(\{X_n\}\).

Key distinction: Convergence in distribution does not imply convergence in probability. For example, let \(X \sim N(0,1)\) and define \(X_n = -X\) for all \(n\). Then \(X_n \xrightarrow{D} X\) (since both have the same standard normal distribution), yet \(|X_n - X| = 2|X|\) does not converge to zero in probability.

However, the reverse implication does hold:

Theorem:

If \(X_n\) converges to \(X\) in probability, then \(X_n\) converges to \(X\) in distribution.

Intuition: If \(X_n\) is increasingly likely to be close to \(X\) (convergence in probability), then the probability mass of \(X_n\) must accumulate where \(X\) places its mass, forcing the CDFs to align.

Proof:

Suppose \(X_n \xrightarrow{P} X\), and let \(x\) be a point of continuity of \(F_X\).

Upper bound.
For any \(\epsilon > 0\), partition the event \(\{X_n \leq x\}\) according to whether \(|X_n - X| < \epsilon\): \[ \begin{align*} F_{X_n}(x) &= P[X_n \leq x] \\ &= P[\{X_n \leq x\} \cap \{|X_n - X| < \epsilon\}] + P[\{X_n \leq x\} \cap \{|X_n - X| \geq \epsilon\}] \\ &\leq P[X \leq x + \epsilon] + P[|X_n - X| \geq \epsilon] \\ &= F_X(x + \epsilon) + P[|X_n - X| \geq \epsilon]. \end{align*} \] The first inequality holds because \(\{X_n \leq x\} \cap \{|X_n - X| < \epsilon\}\) implies \(X < x + \epsilon\). Since \(X_n \xrightarrow{P} X\), the second term vanishes, giving \[ \limsup_{n \to \infty} F_{X_n}(x) \leq F_X(x + \epsilon). \tag{1} \]

Lower bound:
Similarly, consider the complementary event: \[ \begin{align*} 1 - F_{X_n}(x) &= P[X_n > x] \\ &\leq P[X > x - \epsilon] + P[|X_n - X| \geq \epsilon] \\ &= 1 - F_X(x - \epsilon) + P[|X_n - X| \geq \epsilon]. \end{align*} \] Rearranging and taking the limit inferior: \[ \liminf_{n \to \infty} F_{X_n}(x) \geq F_X(x - \epsilon). \tag{2} \]

Combining (1) and (2): \[ F_X(x - \epsilon) \leq \liminf_{n \to \infty} F_{X_n}(x) \leq \limsup_{n \to \infty} F_{X_n}(x) \leq F_X(x + \epsilon). \] Since \(x\) is a point of continuity of \(F_X\), letting \(\epsilon \to 0\) gives \(F_X(x - \epsilon) \to F_X(x)\) and \(F_X(x + \epsilon) \to F_X(x)\), so \[ \lim_{n \to \infty} F_{X_n}(x) = F_X(x). \] Since this holds for every continuity point of \(F_X\), we conclude \(X_n \xrightarrow{D} X\).

We can now see that the Central Limit Theorem (CLT) - first stated without proof in Part 4: Gaussian Distribution - is precisely a statement about convergence in distribution. To prove the CLT rigorously, we need a tool that characterizes distributions and behaves well under limits: the moment generating function.

Moment Generating Function (MGF)

Verifying convergence in distribution directly - by showing pointwise convergence of CDFs at every continuity point - is often impractical. The moment generating function (MGF) provides a powerful alternative: it uniquely determines a distribution (when it exists) and converts the problem of distributional convergence into the simpler problem of pointwise convergence of real-valued functions.

Definition: Moment Generating Function (MGF)

Let \(X\) be a random variable such that for some \(h > 0\), the expectation of \(e^{tX}\) exists for \(t \in (-h, h)\). The moment generating function of \(X\) is defined by \[ M(t) = \mathbb{E}(e^{tX}), \qquad t \in (-h, h). \]

Why "moment generating"?: Expanding \(e^{tX}\) as a power series and taking expectations term-by-term: \[ \begin{align*} M(t) = \mathbb{E}\left[\sum_{k=0}^{\infty} \frac{(tX)^k}{k!}\right] = \sum_{k=0}^{\infty} \frac{t^k}{k!} \mathbb{E}(X^k). \end{align*} \] Thus, the \(k\)-th derivative at zero gives the \(k\)-th moment: \[ M^{(k)}(0) = \mathbb{E}(X^k). \]

Why require an open interval around 0? The MGF must exist in a neighborhood of zero (not just at \(t = 0\)) to guarantee that it uniquely determines the distribution and that we can differentiate to extract moments. If the MGF exists only at \(t = 0\), it provides no useful information.

The key theorem connecting MGFs to convergence in distribution is:

Theorem: Lévy's Continuity Theorem - MGF Version

Let \(\{X_n\}\) be a sequence of random variables with MGFs \(M_n(t)\) that exist for \(t \in (-h, h)\). If \(M_n(t) \to M(t)\) for all \(t\) in some open interval containing 0, and \(M(t)\) is the MGF of a random variable \(X\), then \[ X_n \xrightarrow{D} X. \]

This theorem transforms the problem: instead of comparing CDFs pointwise, we show MGFs converge. We now apply this strategy to prove the Central Limit Theorem.

Theorem: Central Limit Theorem (CLT)

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2 > 0\). Assume the MGF \(M(t) = \mathbb{E}(e^{tX})\) exists in a neighborhood of zero. Then the standardized sum \[ Y_n = \frac{\sum_{i=1}^{n} X_i - n\mu}{\sigma \sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \] converges in distribution to a standard normal: \[ Y_n \xrightarrow{D} N(0, 1). \]

Proof:

Step 1: MGF of centered variable.
Define \(W = X - \mu\). The MGF of \(W\) is: \[ m(t) = \mathbb{E}[e^{t(X-\mu)}] = e^{-\mu t} M(t), \qquad t \in (-h, h). \] Note the key properties: \[ m(0) = 1, \quad m'(0) = \mathbb{E}(X - \mu) = 0, \quad m''(0) = \mathbb{E}[(X-\mu)^2] = \sigma^2. \]

Step 2: Taylor expansion.
By Taylor's theorem with Lagrange remainder, for any \(t\) in the domain, there exists \(\xi\) between \(0\) and \(t\) such that: \[ m(t) = m(0) + m'(0)t + \frac{m''(\xi)}{2}t^2 = 1 + \frac{m''(\xi)}{2}t^2. \] We rewrite this as: \[ m(t) = 1 + \frac{\sigma^2 t^2}{2} + \frac{[m''(\xi) - \sigma^2]t^2}{2}. \tag{3} \]

Step 3: MGF of the standardized sum.
By independence: \[ M_n(t) = \mathbb{E}\left[\exp\left(t \cdot \frac{\sum_{i=1}^{n}(X_i - \mu)}{\sigma\sqrt{n}}\right)\right] = \left[m\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^n. \]

Step 4: Substitute and take limit.
Replacing \(t\) with \(\frac{t}{\sigma\sqrt{n}}\) in equation (3): \[ m\left(\frac{t}{\sigma\sqrt{n}}\right) = 1 + \frac{t^2}{2n} + \frac{[m''(\xi_n) - \sigma^2]t^2}{2n\sigma^2} \] where \(\xi_n \in \left(0, \frac{t}{\sigma\sqrt{n}}\right)\). Thus: \[ M_n(t) = \left\{1 + \frac{t^2}{2n} + \frac{[m''(\xi_n) - \sigma^2]t^2}{2n\sigma^2}\right\}^n. \] As \(n \to \infty\), we have \(\xi_n \to 0\), and by continuity of \(m''(t)\) at \(t = 0\): \[ m''(\xi_n) \to m''(0) = \sigma^2 \implies m''(\xi_n) - \sigma^2 \to 0. \] Using the fundamental limit \(\lim_{n \to \infty}\left(1 + \frac{x}{n}\right)^n = e^x\) with \(x = \frac{t^2}{2}\): \[ \lim_{n \to \infty} M_n(t) = e^{t^2/2}. \]

Step 5: Identify the limit.
The function \(e^{t^2/2}\) is the MGF of \(N(0,1)\). By Lévy's Continuity Theorem: \[ Y_n = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{D} N(0,1). \]

Verification: MGF of \(N(0,1)\)

We confirm that \(e^{t^2/2}\) is the MGF of the standard normal distribution: \[ \begin{align*} M(t) &= \int_{-\infty}^{\infty} e^{tx} \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \, dx \\ &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-(x^2 - 2tx)/2} \, dx \\ &= \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-[(x-t)^2 - t^2]/2} \, dx \\ &= e^{t^2/2} \underbrace{\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(x-t)^2/2} \, dx}_{= 1} \\ &= e^{t^2/2}. \end{align*} \] The key step is completing the square in the exponent and recognizing the remaining integral as the total probability of \(N(t, 1)\).

Without standardization, the CLT can equivalently be written as \[ \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{D} N(0, \sigma^2), \] which makes the role of \(\sigma^2\) explicit: the fluctuations of \(\bar{X}_n\) around \(\mu\) have order \(\sigma / \sqrt{n}\), and once rescaled by \(\sqrt{n}\), they converge to a Gaussian with variance \(\sigma^2\).

Connections to Machine Learning

The convergence results developed in this section are not merely theoretical. The Law of Large Numbers justifies replacing expectations with sample averages, which is the foundation of empirical risk minimization: minimizing the training loss converges to minimizing the true expected loss as the dataset grows. The CLT explains why confidence intervals and hypothesis tests (from Part 10) work: test statistics are asymptotically normal regardless of the underlying distribution. In deep learning, the Continuous Mapping Theorem ensures that if parameter estimates converge, so do the predictions of any continuous model.

We now have the tools to reason rigorously about parameter estimation and inference. In Part 14: Introduction to Bayesian Statistics, we adopt a fundamentally different perspective on inference - treating parameters themselves as random variables with prior distributions - where the convergence results developed here ensure that Bayesian and frequentist approaches agree in the large-sample limit.