Limit Theorems & Product Measures

From Translation to Tools Convergence Theorems for Probability Independence and Product Measures Looking Ahead

From Translation to Tools

In Measure-Theoretic Probability, we translated the basic vocabulary of probability into the language of measure theory: random variables became measurable functions, distributions became pushforward measures, and expectations became Lebesgue integrals. That translation unified discrete and continuous probability under a single integral \(\int g \, dP_X\).

This chapter completes the measure-theoretic foundation by putting the translated machinery to work. First: when can we interchange limits and expectations? The Monotone Convergence Theorem, Fatou's Lemma, and the Dominated Convergence Theorem — proved for abstract measure spaces in \(L^p\) Spaces — now receive their probabilistic names and applications. Second: what does independence really mean? The factorization condition \(P(A \cap B) = P(A)P(B)\) from Basic Probability is elevated to independence of \(\sigma\)-algebras, formalized via product measures and Fubini's theorem — the construction that gives "i.i.d. sampling" its mathematical content.

Convergence Theorems for Probability

A recurring challenge in probability and machine learning is the interchange of limits and expectations: \[ \lim_{n \to \infty} \mathbb{E}[X_n] \;\stackrel{?}{=}\; \mathbb{E}\!\left[\lim_{n \to \infty} X_n\right]. \] When does this hold? The answer is not always — and the conditions under which it does are precisely the subject of the convergence theorems from Lebesgue integration.

In \(L^p\) Spaces, we stated and used the Monotone Convergence Theorem (MCT), Fatou's Lemma, and the Dominated Convergence Theorem (DCT) to prove the completeness of \(L^p\). There, the context was abstract measure spaces and the Riesz-Fischer theorem. Here, we give these same theorems their probabilistic names and interpretations: the measure becomes \(\mathbb{P}\), the measurable functions become random variables, the integrals become expectations, and "a.e." becomes "a.s."

The Translation Table

Before stating the theorems, we make the correspondence explicit. Every statement in the left column is literally the same mathematical assertion as the corresponding statement in the right column — only the vocabulary changes.

Measure Theory ↔ Probability Dictionary

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space.

Measure Theory (Section II) Probability (Section III)
Measurable function \(f\) Random variable \(X\)
\(\int_\Omega f \, d\mu\) \(\mathbb{E}[X]\)
Almost everywhere (a.e.) Almost surely (a.s.)
Convergence in measure Convergence in probability
\(f \in L^p(\mu)\) \(X \in L^p(\mathbb{P})\)
Dominating function \(h \in L^1(\mu)\) Integrable bound: \(\mathbb{E}[|Y|] < \infty\) with \(|X_n| \leq Y\) a.s.

In Convergence in \(L^p\), we developed the hierarchy of convergence modes for measurable functions and noted that on probability spaces, a.e. convergence becomes a.s. convergence and convergence in measure becomes convergence in probability. In Convergence, we introduced these probabilistic modes — a.s., in probability, in distribution — and established their hierarchy. We now complete the picture by showing how the three great limit theorems of Lebesgue integration operate within this probabilistic framework.

The Three Limit Theorems in Probability

We state each theorem in probabilistic language, with explicit reference to its analytical counterpart. The mathematical content is identical — only the framing changes.

Theorem: Monotone Convergence Theorem (Probabilistic Form)

Let \((X_n)\) be a sequence of random variables satisfying \(0 \leq X_1 \leq X_2 \leq \cdots\) a.s., and let \(X = \lim_{n \to \infty} X_n\) (which exists in \([0, \infty]\) a.s. by monotonicity). Then \[ \lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X]. \]

Analytical form: MCT in \(L^p\) Spaces, with \(\mu = \mathbb{P}\).

The MCT requires two conditions: nonnegativity and monotone increase. Under these conditions, expectations and limits commute without any integrability assumption — the limit \(\mathbb{E}[X]\) may be \(+\infty\), and the theorem still holds as an equality.

Example: Computing an infinite-series expectation.

Let \(Y_1, Y_2, \ldots \geq 0\) be nonnegative random variables. Define \(X_n = \sum_{k=1}^n Y_k\). Then \((X_n)\) is nonnegative and increasing, so by the MCT: \[ \mathbb{E}\!\left[\sum_{k=1}^\infty Y_k\right] \;=\; \lim_{n \to \infty} \mathbb{E}\!\left[\sum_{k=1}^n Y_k\right] \;=\; \sum_{k=1}^\infty \mathbb{E}[Y_k]. \] This justifies interchanging summation and expectation for nonnegative terms — a step used constantly in probability (e.g., computing expectations of counting random variables).

Theorem: Fatou's Lemma (Probabilistic Form)

Let \((X_n)\) be a sequence of random variables satisfying \(X_n \geq 0\) a.s. Then \[ \mathbb{E}\!\left[\liminf_{n \to \infty} X_n\right] \;\leq\; \liminf_{n \to \infty} \mathbb{E}[X_n]. \]

Analytical form: Fatou's Lemma in \(L^p\) Spaces, with \(\mu = \mathbb{P}\).

Fatou's lemma is the safety net: when we cannot verify monotonicity (needed for MCT) or find a dominating variable (needed for DCT), Fatou still gives an inequality. The inequality can be strict — probability mass can "escape to infinity" in the limit.

Example: Mass escape.

Let \(X_n = n \cdot \mathbf{1}_{\{U \leq 1/n\}}\), where \(U \sim \text{Uniform}(0,1)\). Then \(\mathbb{E}[X_n] = n \cdot (1/n) = 1\) for all \(n\). But \(X_n \to 0\) a.s. (for any fixed \(\omega\) with \(U(\omega) > 0\), eventually \(1/n < U(\omega)\) and \(X_n(\omega) = 0\)). Thus \(\mathbb{E}[\lim X_n] = 0 < 1 = \lim \mathbb{E}[X_n]\). Fatou's inequality is sharp here: the mass of \(X_n\) concentrates on an ever-shrinking set, and is lost in the limit.

Theorem: Dominated Convergence Theorem (Probabilistic Form)

Let \((X_n)\) be a sequence of random variables converging almost surely to a random variable \(X\). Suppose there exists an integrable dominating variable \(Y\) with \(\mathbb{E}[|Y|] < \infty\) and \(|X_n| \leq Y\) a.s. for all \(n\). Then \(\mathbb{E}[|X|] < \infty\) and \[ \lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X]. \]

Analytical form: DCT in \(L^p\) Spaces, with \(\mu = \mathbb{P}\).

The DCT is the most commonly applied of the three in practice. It guarantees the full equality \(\lim \mathbb{E} = \mathbb{E}[\lim]\) at the cost of requiring an integrable dominating variable. The domination condition \(|X_n| \leq Y\) with \(\mathbb{E}[|Y|] < \infty\) prevents the mass-escape phenomenon seen in the Fatou example above.

A Decision Guide: Which Theorem When?

When faced with the need to interchange \(\lim\) and \(\mathbb{E}\), the choice of theorem depends on what we know about the sequence:

Guide: Interchanging Limits and Expectations

Given a sequence \((X_n)\) with \(X_n \to X\) a.s.:

  1. If \(X_n \geq 0\) and \(X_n \uparrow X\) (nonnegative and increasing): use MCT. No integrability condition needed; the limit may be infinite.
  2. If \(X_n \geq 0\) (nonnegative but not necessarily monotone): use Fatou. Yields only \(\mathbb{E}[X] \leq \liminf \mathbb{E}[X_n]\), not equality.
  3. If \(|X_n| \leq Y\) a.s. with \(\mathbb{E}[|Y|] < \infty\) (dominated by an integrable variable): use DCT. Full equality; the most versatile tool.

The three theorems above provide sufficient conditions for interchanging limits and expectations. A deeper theory — uniform integrability and the associated Vitali convergence theorem — characterizes the interchange as a necessary and sufficient condition: \(X_n \to X\) in probability and \((X_n)\) is uniformly integrable if and only if \(X_n \to X\) in \(L^1\). Domination by an integrable variable (the DCT hypothesis) is one way to guarantee uniform integrability, but not the only way. This more refined criterion is essential for martingale theory and will arise naturally if that path is pursued.

Connecting to the Convergence Hierarchy

In Convergence, we established the hierarchy of probabilistic convergence modes: \[ \xrightarrow{a.s.} \;\Longrightarrow\; \xrightarrow{P} \;\Longrightarrow\; \xrightarrow{D}. \] In Convergence in \(L^p\), we proved the analytical counterpart: \(L^p\) convergence implies convergence in measure (= convergence in probability), and a.e. convergence with domination implies \(L^p\) convergence (the \(L^p\)-DCT).

The three limit theorems above add a new dimension to this picture: they tell us when convergence of random variables (any of the modes) can be upgraded to convergence of expectations. The key insight is:

Principle: Convergence of Variables vs. Convergence of Expectations

Convergence of random variables (a.s., in probability, in distribution) does not automatically imply convergence of their expectations. Additional conditions — monotonicity (MCT), nonnegativity (Fatou), or domination (DCT) — are required to pass the limit inside \(\mathbb{E}\). Without such conditions, expectations can diverge even when the variables themselves converge.

The Fatou example above illustrates this: \(X_n \to 0\) a.s. (the strongest mode of convergence), yet \(\mathbb{E}[X_n] = 1 \not\to 0 = \mathbb{E}[X]\). The limit theorems are precisely the tools that bridge this gap.

Connection to Machine Learning: Gradient-Expectation Interchange

In stochastic gradient descent (SGD), we compute the gradient of an expected loss: \[ \nabla_{\boldsymbol{\theta}} \, \mathbb{E}_{X \sim P}\!\bigl[\ell(X, \boldsymbol{\theta})\bigr] \;\stackrel{?}{=}\; \mathbb{E}_{X \sim P}\!\bigl[\nabla_{\boldsymbol{\theta}} \, \ell(X, \boldsymbol{\theta})\bigr]. \] The left side is the true gradient of the risk; the right side is the expectation of sample gradients — the quantity SGD approximates via mini-batches. The interchange is justified by the DCT: if we can find an integrable function \(Y(X)\) such that \(|\partial \ell / \partial \theta_j| \leq Y(X)\) for all \(\boldsymbol{\theta}\) in a neighborhood, then the dominated convergence theorem (applied to difference quotients) permits differentiation under the integral sign.

This is not a technicality — it is the theoretical foundation of SGD. When the domination condition fails (e.g., with heavy-tailed gradients in large language models), the interchange can break down, leading to the gradient explosion phenomena that motivate gradient clipping.

Independence and Product Measures

In Basic Probability, we defined two events \(A\) and \(B\) to be independent if \(\mathbb{P}(A \cap B) = \mathbb{P}(A)\,\mathbb{P}(B)\). This condition was stated for individual events, without further explanation of why it generalizes correctly to random variables, \(\sigma\)-algebras, or infinite collections. The measure-theoretic framework provides the definitive formulation.

Independence of \(\sigma\)-Algebras

The fundamental definition is not independence of events or of random variables, but independence of \(\sigma\)-algebras. All other notions are derived from it.

Definition: Independence of \(\sigma\)-Algebras

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space. Sub-\(\sigma\)-algebras \(\mathcal{G}_1, \ldots, \mathcal{G}_n \subseteq \mathcal{F}\) are independent if \[ \mathbb{P}(A_1 \cap \cdots \cap A_n) \;=\; \mathbb{P}(A_1) \cdots \mathbb{P}(A_n) \] for every choice of \(A_i \in \mathcal{G}_i\), \(i = 1, \ldots, n\).

An arbitrary (possibly infinite) family \(\{\mathcal{G}_\alpha\}_{\alpha \in I}\) is independent if for every finite collection of distinct indices \(\alpha_1, \ldots, \alpha_n \in I\), the sub-\(\sigma\)-algebras \(\mathcal{G}_{\alpha_1}, \ldots, \mathcal{G}_{\alpha_n}\) are independent.

Independence of events is the special case where each \(\mathcal{G}_i = \{\emptyset, A_i, A_i^c, \Omega\}\) — the smallest \(\sigma\)-algebra containing \(A_i\). Independence of random variables \(X_1, \ldots, X_n\) means that their generated \(\sigma\)-algebras \(\sigma(X_i) = \{X_i^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\}\) are independent:

Definition: Independence of Random Variables

Random variables \(X_1, \ldots, X_n\) are independent if for all Borel sets \(B_1, \ldots, B_n \in \mathcal{B}(\mathbb{R})\), \[ \mathbb{P}(X_1 \in B_1, \ldots, X_n \in B_n) \;=\; \prod_{i=1}^n \mathbb{P}(X_i \in B_i). \] Equivalently, the joint distribution factors as a product of marginals: \[ P_{(X_1, \ldots, X_n)} \;=\; P_{X_1} \otimes \cdots \otimes P_{X_n}. \]

The second formulation — factoring the joint distribution into a product of marginals — is the form most directly connected to the product measure construction that follows.

A key consequence of independence concerns expectations of products:

Proposition: Expectation of Products

If \(X_1, \ldots, X_n\) are independent random variables with \(\mathbb{E}[|X_i|] < \infty\) for each \(i\), then \[ \mathbb{E}[X_1 \cdots X_n] \;=\; \mathbb{E}[X_1] \cdots \mathbb{E}[X_n]. \]

This was used implicitly throughout Variance and Convergence (for instance, in \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) for independent \(X, Y\)). The measure-theoretic proof reduces to Fubini's theorem on the product space, which we develop next.

Product Measures

When we say "\(X_1, \ldots, X_n\) are independent," we are asserting that the joint probability space factors into a product. To make this precise, we need the construction of product measures.

Definition: Product \(\sigma\)-Algebra and Product Measure

Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and \((\Omega_2, \mathcal{F}_2, \mu_2)\) be \(\sigma\)-finite measure spaces.

The product \(\sigma\)-algebra \(\mathcal{F}_1 \otimes \mathcal{F}_2\) is the \(\sigma\)-algebra on \(\Omega_1 \times \Omega_2\) generated by all measurable rectangles \(A_1 \times A_2\) with \(A_1 \in \mathcal{F}_1\), \(A_2 \in \mathcal{F}_2\).

There exists a unique measure \(\mu_1 \otimes \mu_2\) on \((\Omega_1 \times \Omega_2, \, \mathcal{F}_1 \otimes \mathcal{F}_2)\) satisfying \[ (\mu_1 \otimes \mu_2)(A_1 \times A_2) \;=\; \mu_1(A_1) \cdot \mu_2(A_2) \] for all measurable rectangles. This is the product measure.

The existence and uniqueness of the product measure follow from Carathéodory's extension theorem: the formula on rectangles defines a premeasure on the algebra generated by rectangles, and the extension theorem promotes it to a measure on the full product \(\sigma\)-algebra. The \(\sigma\)-finiteness assumption ensures uniqueness.

For probability spaces, \(\sigma\)-finiteness is automatic (\(\mathbb{P}(\Omega) = 1 < \infty\)), so the product always exists. The construction extends to finite products \(\mu_1 \otimes \cdots \otimes \mu_n\) by induction. For countably infinite products of probability measures, the product construction can be extended by verifying \(\sigma\)-additivity on the algebra of cylinder sets and applying Carathéodory's extension theorem.

Fubini's Theorem and Tonelli's Theorem

The central computational tool for product measures is the ability to evaluate a double integral as an iterated integral — to "integrate out" one variable at a time. This is the content of Fubini's and Tonelli's theorems.

Theorem: Tonelli's Theorem

Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and \((\Omega_2, \mathcal{F}_2, \mu_2)\) be \(\sigma\)-finite measure spaces, and let \(f : \Omega_1 \times \Omega_2 \to [0, \infty]\) be measurable with respect to \(\mathcal{F}_1 \otimes \mathcal{F}_2\). Then \[ \int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2) \;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\right) d\mu_1(\omega_1) \;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1(\omega_1)\right) d\mu_2(\omega_2). \] In particular, the order of integration may be exchanged freely.

Tonelli's theorem applies to nonnegative measurable functions without any integrability condition — the integrals may be infinite. When the function is not necessarily nonnegative, we need an additional assumption:

Theorem: Fubini's Theorem

Under the same setup as Tonelli's theorem, suppose \(f : \Omega_1 \times \Omega_2 \to \mathbb{R}\) is \((\mathcal{F}_1 \otimes \mathcal{F}_2)\)-measurable and integrable with respect to the product measure: \[ \int_{\Omega_1 \times \Omega_2} |f| \, d(\mu_1 \otimes \mu_2) \;<\; \infty. \] Then the iterated integrals exist and are equal: \[ \int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2) \;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f \, d\mu_2 \right) d\mu_1 \;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f \, d\mu_1 \right) d\mu_2. \]

The relationship between the two theorems mirrors the MCT-DCT relationship: Tonelli handles nonnegative functions unconditionally; Fubini handles general (sign-changing) functions at the cost of an integrability condition. A standard workflow is: apply Tonelli to \(|f|\) first to verify the integrability condition, then invoke Fubini on \(f\) itself.

The Mathematical Justification of i.i.d. Sampling

We can now answer a question that has been implicit since Convergence: what does it mean, precisely, for \(X_1, X_2, \ldots, X_n\) to be "i.i.d. with distribution \(P\)"?

Definition: i.i.d. Random Variables

Random variables \(X_1, \ldots, X_n\) on a probability space \((\Omega, \mathcal{F}, \mathbb{P})\) are independent and identically distributed (i.i.d.) with common distribution \(P\) if:

  1. \(X_1, \ldots, X_n\) are independent (in the sense of \(\sigma\)-algebra independence), and
  2. \(P_{X_i} = P\) for each \(i = 1, \ldots, n\) (each has the same pushforward measure).

Condition (2) alone says the variables have the same distribution. Condition (1) says that their joint distribution factors: \[ P_{(X_1, \ldots, X_n)} \;=\; P \otimes \cdots \otimes P \;=\; P^{\otimes n}. \] The product measure \(P^{\otimes n}\) on \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\) is the mathematical model for "drawing \(n\) independent samples from \(P\)." The coordinate projections \(\pi_i : \mathbb{R}^n \to \mathbb{R}\), \(\pi_i(x_1, \ldots, x_n) = x_i\), are the random variables \(X_i\) in this model.

With this formalization, the laws of large numbers gain their full meaning. The Strong Law (stated in Convergence) says: \[ \frac{1}{n} \sum_{i=1}^n \pi_i \;\xrightarrow{a.s.}\; \mathbb{E}_{P}[X] \quad \text{under } P^{\otimes \infty}, \] where \(P^{\otimes \infty}\) is the infinite product measure on \((\mathbb{R}^\infty, \mathcal{B}(\mathbb{R})^{\otimes \infty})\). The existence of this infinite product measure — constructed by verifying \(\sigma\)-additivity on cylinder sets and extending via Carathéodory — is the foundational result that makes the study of infinite sequences of random variables rigorous.

Example: Fubini justifies the variance formula for sums.

Let \(X, Y\) be independent with \(\mathbb{E}[X^2], \mathbb{E}[Y^2] < \infty\). The claim \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) requires \(\mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]\). Under independence, the joint distribution is \(P_X \otimes P_Y\), so by Fubini: \[ \mathbb{E}[XY] \;=\; \int_{\mathbb{R}^2} x \, y \; d(P_X \otimes P_Y)(x, y) \;=\; \int_{\mathbb{R}} x \, dP_X(x) \;\cdot\; \int_{\mathbb{R}} y \, dP_Y(y) \;=\; \mathbb{E}[X] \cdot \mathbb{E}[Y]. \] Tonelli's theorem ensures the application is valid: applying Tonelli to \(|xy|\) under the product measure gives \(\mathbb{E}[|XY|] = \mathbb{E}[|X|] \cdot \mathbb{E}[|Y|]\). Since \(\mathbb{E}[X^2] < \infty\) implies \(\mathbb{E}[|X|] < \infty\) (and likewise for \(Y\)), the integrability condition for Fubini is confirmed.

Connection to Machine Learning: Why i.i.d. Matters

Nearly every convergence guarantee in machine learning — consistency of MLE, convergence of empirical risk to population risk, validity of cross-validation — assumes that the training data \((x_1, y_1), \ldots, (x_n, y_n)\) are i.i.d. draws from some distribution \(P\).

The product measure \(P^{\otimes n}\) is what makes these guarantees mathematically precise: it defines the probability space on which the sample lives, and Fubini's theorem is the tool that lets us decompose expectations over the full sample into expectations over individual observations. Without the product measure construction, the phrase "assume the data are i.i.d." would be a slogan without mathematical content.

When the i.i.d. assumption fails — as it does for time-series data, reinforcement learning trajectories, or data collected with adaptive sampling — the product measure model breaks down, and the theorems of this section no longer apply directly. Understanding what the assumption buys us is essential to recognizing when we can and cannot rely on it.

Looking Ahead

With this chapter, the measure-theoretic translation of probability is complete. The \(\sigma\)-algebras and Lebesgue integrals of Measure Theory and Lebesgue Integration are now visible as the precise machinery behind the random variables, expectations, and convergence results that have powered this section from the beginning. The product measure construction gives "i.i.d. sampling" its mathematical content, and the three limit theorems tell us exactly when limits and expectations commute.

Several threads lead forward from here:

The measure-theoretic viewpoint developed across these two chapters is not a detour from applied probability — it is its foundation. Every time we write \(\mathbb{E}[\,\cdot\,]\), invoke the law of large numbers, or assume data are i.i.d., we are implicitly invoking these constructions. Making them explicit is what turns intuition into proof.