Limit Theorems & Product Measures

From Translation to Tools

In Measure-Theoretic Probability, we translated the basic vocabulary of probability into the language of measure theory: random variables became measurable functions, distributions became pushforward measures, and expectations became Lebesgue integrals. That translation unified discrete and continuous probability under a single integral \(\int g \, dP_X\).

This chapter completes the measure-theoretic foundation by putting the translated machinery to work. First: when can we interchange limits and expectations? The Monotone Convergence Theorem, Fatou's Lemma, and the Dominated Convergence Theorem — proved for abstract measure spaces in \(L^p\) Spaces — now receive their probabilistic names and applications. Second: what does independence really mean? The condition of independence \(P(A \cap B) = P(A)P(B)\) is elevated to independence of \(\sigma\)-algebras, formalized via product measures and Fubini's theorem — the construction that gives "i.i.d. sampling" its mathematical content.

Convergence Theorems for Probability

A recurring challenge in probability and machine learning is the interchange of limits and expectations: \[ \lim_{n \to \infty} \mathbb{E}[X_n] \;\stackrel{?}{=}\; \mathbb{E}\!\left[\lim_{n \to \infty} X_n\right]. \] When does this hold? The answer is not always — and the conditions under which it does are precisely the subject of the convergence theorems from Lebesgue integration.

We stated and used the Monotone Convergence Theorem (MCT), Fatou's Lemma, and the Dominated Convergence Theorem (DCT) to prove the completeness of \(L^p\). In that context, we were dealing with abstract measure spaces and the Riesz-Fischer Theorem. Here, we give these same theorems their probabilistic names and interpretations: the measure becomes \(\mathbb{P}\), the measurable functions become random variables, the integrals become expectations, and "a.e." becomes "a.s."

The Translation Table

Before stating the theorems, we make the correspondence explicit. Every statement in the left column is literally the same mathematical assertion as the corresponding statement in the right column — only the vocabulary changes.

Measure Theory ↔ Probability Dictionary

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space.

Measure Theory (Section II)	Probability (Section III)
Measurable function \(f\)	Random variable \(X\)
\(\int_\Omega f \, d\mu\)	\(\mathbb{E}[X]\)
Almost everywhere (a.e.)	Almost surely (a.s.)
Convergence in measure	Convergence in probability
\(f \in L^p(\mu)\)	\(X \in L^p(\mathbb{P})\)
Dominating function \(h \in L^1(\mu)\)	Integrable bound: \(\mathbb{E}[\|Y\|] < \infty\) with \(\|X_n\| \leq Y\) a.s.

In Convergence in \(L^p\), we developed the hierarchy of convergence modes for measurable functions and noted that on probability spaces, a.e. convergence becomes a.s. convergence and convergence in measure becomes convergence in probability. In Modes of Convergence, we introduced these probabilistic modes — a.s., in probability, in distribution — and established their hierarchy. We now complete the picture by showing how the three great limit theorems of Lebesgue integration operate within this probabilistic framework.

The Three Limit Theorems in Probability

We state each theorem in probabilistic language, with explicit reference to its analytical counterpart. The mathematical content is identical — only the framing changes.

Theorem: Monotone Convergence Theorem (Probabilistic Form)

Let \((X_n)\) be a sequence of random variables satisfying \(0 \leq X_1 \leq X_2 \leq \cdots\) a.s., and let \(X = \lim_{n \to \infty} X_n\) (which exists in \([0, \infty]\) a.s. by monotonicity). Then \[ \lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X]. \]

Analytical form: MCT in \(L^p\) Spaces with \(\mu = \mathbb{P}\). Under this specialization, a.e. becomes a.s., \(\int_\Omega g \, d\mu\) becomes \(\mathbb{E}[X]\), and the two statements are literally identical.

The MCT requires two conditions: nonnegativity and monotone increase. Under these conditions, expectations and limits commute without any integrability assumption — the limit \(\mathbb{E}[X]\) may be \(+\infty\), and the theorem still holds as an equality.

Example: Computing an infinite-series expectation.

Let \(Y_1, Y_2, \ldots \geq 0\) be nonnegative random variables. Define \(X_n = \sum_{k=1}^n Y_k\). Then \((X_n)\) is nonnegative and increasing, so by the MCT: \[ \mathbb{E}\!\left[\sum_{k=1}^\infty Y_k\right] \;=\; \lim_{n \to \infty} \mathbb{E}\!\left[\sum_{k=1}^n Y_k\right] \;=\; \sum_{k=1}^\infty \mathbb{E}[Y_k]. \] This justifies interchanging summation and expectation for nonnegative terms — a step used constantly in probability (e.g., computing expectations of counting random variables).

Theorem: Fatou's Lemma (Probabilistic Form)

Let \((X_n)\) be a sequence of random variables satisfying \(X_n \geq 0\) a.s. Then \[ \mathbb{E}\!\left[\liminf_{n \to \infty} X_n\right] \;\leq\; \liminf_{n \to \infty} \mathbb{E}[X_n]. \]

Analytical form: Fatou's Lemma in \(L^p\) Spaces with \(\mu = \mathbb{P}\) — a literal specialization, with a.e. replaced by a.s. and integrals replaced by expectations.

Fatou's lemma is the safety net: when we cannot verify monotonicity (needed for MCT) or find a dominating variable (needed for DCT), Fatou still gives an inequality. The inequality can be strict — probability mass can "escape to infinity" in the limit.

Example: Mass escape.

Let \(X_n = n \cdot \mathbf{1}_{\{U \leq 1/n\}}\), where \(U \sim \text{Uniform}(0,1)\). Then \(\mathbb{E}[X_n] = n \cdot (1/n) = 1\) for all \(n\). But \(X_n \to 0\) a.s. (for any fixed \(\omega\) with \(U(\omega) > 0\), eventually \(1/n < U(\omega)\) and \(X_n(\omega) = 0\)). Thus \(\mathbb{E}[\lim X_n] = 0 < 1 = \lim \mathbb{E}[X_n]\). Fatou's inequality is sharp here. Geometrically, the probability mass of \(X_n\) forms a spike of height \(n\) on the interval \((0, 1/n]\) — the area (expectation) is always \(1\), but the spike narrows and grows without bound. For each fixed outcome, the spike eventually passes beyond it, leaving zero behind. The mass is not destroyed — it escapes to infinity, beyond the reach of the pointwise limit.

Theorem: Dominated Convergence Theorem (Probabilistic Form)

Let \((X_n)\) be a sequence of random variables converging almost surely to a random variable \(X\). Suppose there exists an integrable dominating variable \(Y\) with \(\mathbb{E}[|Y|] < \infty\) and \(|X_n| \leq Y\) a.s. for all \(n\). Then \(\mathbb{E}[|X|] < \infty\) and \[ \lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X]. \]

Analytical form: DCT in \(L^p\) Spaces with \(\mu = \mathbb{P}\); the "integrable dominating function \(h\)" becomes an "integrable dominating variable \(Y\)," and the statement is otherwise identical.

The DCT is the most commonly applied of the three in practice. It guarantees the full equality \(\lim \mathbb{E} = \mathbb{E}[\lim]\) at the cost of requiring an integrable dominating variable. The domination condition \(|X_n| \leq Y\) with \(\mathbb{E}[|Y|] < \infty\) prevents the mass-escape phenomenon seen in the Fatou example above.

A Decision Guide: Which Theorem When?

When faced with the need to interchange \(\lim\) and \(\mathbb{E}\), the choice of theorem depends on what we know about the sequence:

Guide: Interchanging Limits and Expectations

Given a sequence \((X_n)\) with \(X_n \to X\) a.s.:

If \(X_n \geq 0\) and \(X_n \uparrow X\) (nonnegative and increasing): use MCT. No integrability condition needed; the limit may be infinite.
If \(X_n \geq 0\) (nonnegative but not necessarily monotone): use Fatou. Yields only \(\mathbb{E}[X] \leq \liminf \mathbb{E}[X_n]\), not equality.
If \(|X_n| \leq Y\) a.s. with \(\mathbb{E}[|Y|] < \infty\) (dominated by an integrable variable): use DCT. Full equality; the most versatile tool.

The three theorems above provide sufficient conditions for interchanging limits and expectations. A deeper theory — uniform integrability and the associated Vitali convergence theorem, treated in Durrett's probability text — characterizes the interchange as a necessary and sufficient condition: \(X_n \to X\) in \(L^1\) if and only if both \(X_n \to X\) in probability and \((X_n)\) is uniformly integrable. Domination by an integrable variable (the DCT hypothesis) is one way to guarantee uniform integrability, but not the only way. This more refined criterion is essential for martingale theory and will arise naturally if that path is pursued.

Connecting to the Convergence Hierarchy

We established the hierarchy of probabilistic convergence modes: \[ \xrightarrow{a.s.} \;\Longrightarrow\; \xrightarrow{P} \;\Longrightarrow\; \xrightarrow{D}. \] Also, we proved the analytical counterpart: \(L^p\) convergence implies convergence in measure (= convergence in probability), and a.e. convergence with domination implies \(L^p\) convergence (the \(L^p\)-DCT).

The three limit theorems above add a new perspective to this picture: they tell us when convergence of random variables (any of the modes) can be upgraded to convergence of expectations. The key insight is:

Principle: Convergence of Variables vs. Convergence of Expectations

Convergence of random variables (a.s., in probability, in distribution) does not automatically imply convergence of their expectations. Additional conditions — monotonicity (MCT), nonnegativity (Fatou), or domination (DCT) — are required to pass the limit inside \(\mathbb{E}\). Without such conditions, expectations can diverge even when the variables themselves converge.

The Fatou example above illustrates this: \(X_n \to 0\) a.s. (the strongest mode of convergence), yet \(\mathbb{E}[X_n] = 1 \not\to 0 = \mathbb{E}[X]\). The limit theorems are precisely the tools that bridge this gap.

Connection to Machine Learning: Gradient-Expectation Interchange

In stochastic gradient descent (SGD), we compute the gradient of an expected loss: \[ \nabla_{\boldsymbol{\theta}} \, \mathbb{E}_{X \sim P}\!\bigl[\ell(X, \boldsymbol{\theta})\bigr] \;\stackrel{?}{=}\; \mathbb{E}_{X \sim P}\!\bigl[\nabla_{\boldsymbol{\theta}} \, \ell(X, \boldsymbol{\theta})\bigr]. \] The left side is the true gradient of the risk; the right side is the expectation of sample gradients — the quantity SGD approximates via mini-batches. The interchange is justified by differentiation under the integral sign — a standard consequence of DCT applied to the difference quotients \(\frac{\ell(X, \boldsymbol{\theta} + h \mathbf{e}_j) - \ell(X, \boldsymbol{\theta})}{h}\) as \(h \to 0\). The DCT hypothesis is satisfied when an integrable dominating function \(Y(X)\) exists with \(|\partial \ell / \partial \theta_j| \leq Y(X)\) for all \(\boldsymbol{\theta}\) in a neighborhood.

This is one rigorous ingredient — among others — in the analysis of SGD: when the domination condition holds, the interchange is exact, and the sample-average gradient is an unbiased estimator of the true gradient of the risk. When the domination condition fails (e.g., with heavy-tailed gradient distributions reported in some large-language-model training runs), the interchange need not hold, and the sample-average gradient may be a biased or high-variance estimate of the true gradient. The empirical practice of gradient clipping, introduced independently as a remedy for exploding gradients in deep recurrent networks (Pascanu, Mikolov & Bengio, 2013), can be read in this light as restoring an effective domination bound — though the historical motivation was empirical rather than measure-theoretic, and the connection to DCT is interpretive.

Independence and Product Measures

We defined two events \(A\) and \(B\) to be independent if \(\mathbb{P}(A \cap B) = \mathbb{P}(A)\,\mathbb{P}(B)\). This condition was stated for individual events, without further explanation of why it generalizes correctly to random variables, \(\sigma\)-algebras, or infinite collections. The measure-theoretic framework provides the definitive formulation.

Independence of \(\sigma\)-Algebras

The fundamental definition is not independence of events or of random variables, but independence of \(\sigma\)-algebras. All other notions are derived from it.

Definition: Independence of \(\sigma\)-Algebras

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space. Sub-\(\sigma\)-algebras \(\mathcal{G}_1, \ldots, \mathcal{G}_n \subseteq \mathcal{F}\) are independent if \[ \mathbb{P}(A_1 \cap \cdots \cap A_n) \;=\; \mathbb{P}(A_1) \cdots \mathbb{P}(A_n) \] for every choice of \(A_i \in \mathcal{G}_i\), \(i = 1, \ldots, n\).

An arbitrary (possibly infinite) family \(\{\mathcal{G}_\alpha\}_{\alpha \in I}\) is independent if for every finite collection of distinct indices \(\alpha_1, \ldots, \alpha_n \in I\), the sub-\(\sigma\)-algebras \(\mathcal{G}_{\alpha_1}, \ldots, \mathcal{G}_{\alpha_n}\) are independent.

Why define independence at the level of \(\sigma\)-algebras rather than individual events or random variables? Because this formulation is stable under measurable transformations: if \(X\) and \(Y\) are independent, then so are \(f(X)\) and \(g(Y)\) for any measurable functions \(f\) and \(g\). This follows immediately from the fact that \(\sigma(f(X)) \subseteq \sigma(X)\) — any information extractable from \(f(X)\) is already contained in the information generated by \(X\). Defining independence via \(\sigma\)-algebras captures this closure property automatically.

Independence of events is the special case where each \(\mathcal{G}_i = \sigma(A_i) = \{\emptyset, A_i, A_i^c, \Omega\}\) — the smallest \(\sigma\)-algebra containing \(A_i\). (The elementary condition \(\mathbb{P}(A_1 \cap \cdots \cap A_n) = \prod_i \mathbb{P}(A_i)\) automatically propagates to all \(2^n\) choices of \(A_i\) or \(A_i^c\), yielding the full \(\sigma\)-algebra factorization — a short inclusion-exclusion argument.) Independence of random variables \(X_1, \ldots, X_n\) means that their generated \(\sigma\)-algebras \(\sigma(X_i) = \{X_i^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\}\) are independent:

Definition: Independence of Random Variables

Random variables \(X_1, \ldots, X_n\) are independent if for all Borel sets \(B_1, \ldots, B_n \in \mathcal{B}(\mathbb{R})\), \[ \mathbb{P}(X_1 \in B_1, \ldots, X_n \in B_n) \;=\; \prod_{i=1}^n \mathbb{P}(X_i \in B_i). \] Equivalently, the joint distribution factors as a product of marginals: \[ P_{(X_1, \ldots, X_n)} \;=\; P_{X_1} \otimes \cdots \otimes P_{X_n}. \]

The second formulation — factoring the joint distribution into a product of marginals — is the form most directly connected to the product measure construction that follows.

A key consequence of independence concerns expectations of products:

Proposition: Expectation of Products

If \(X_1, \ldots, X_n\) are independent random variables with \(\mathbb{E}[|X_i|] < \infty\) for each \(i\), then the product \(X_1 \cdots X_n\) is integrable and \[ \mathbb{E}[X_1 \cdots X_n] \;=\; \mathbb{E}[X_1] \cdots \mathbb{E}[X_n]. \]

This was used implicitly throughout Variance and Modes of Convergence (for instance, in \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) for independent \(X, Y\)). The measure-theoretic proof reduces to Fubini's theorem on the product space: under independence the joint distribution factors as \(P_{(X_1, \ldots, X_n)} = P_{X_1} \otimes \cdots \otimes P_{X_n}\), and Fubini converts the integral of the product against the product measure into a product of integrals. The worked example at the end of this section performs this derivation in the case \(n = 2\); the general case follows by induction.

Product Measures

When we say "\(X_1, \ldots, X_n\) are independent," we are asserting that the joint probability space factors into a product. To make this precise, we need the construction of product measures.

Definition: Product \(\sigma\)-Algebra and Product Measure

Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and \((\Omega_2, \mathcal{F}_2, \mu_2)\) be \(\sigma\)-finite measure spaces.

The product \(\sigma\)-algebra \(\mathcal{F}_1 \otimes \mathcal{F}_2\) is the \(\sigma\)-algebra on \(\Omega_1 \times \Omega_2\) generated by all measurable rectangles \(A_1 \times A_2\) with \(A_1 \in \mathcal{F}_1\), \(A_2 \in \mathcal{F}_2\).

There exists a unique measure \(\mu_1 \otimes \mu_2\) on \((\Omega_1 \times \Omega_2, \, \mathcal{F}_1 \otimes \mathcal{F}_2)\) satisfying \[ (\mu_1 \otimes \mu_2)(A_1 \times A_2) \;=\; \mu_1(A_1) \cdot \mu_2(A_2) \] for all measurable rectangles. This is the product measure.

The existence and uniqueness of the product measure follow from Carathéodory's extension theorem. The formula \(\mu_1(A_1)\,\mu_2(A_2)\) extends additively to the algebra of finite disjoint unions of rectangles, and its \(\sigma\)-additivity on this algebra is verified by applying the MCT to indicator functions of countable disjoint rectangle decompositions. The extension theorem then promotes this premeasure to a measure on the full product \(\sigma\)-algebra; the \(\sigma\)-finiteness assumption ensures uniqueness.

For probability spaces, \(\sigma\)-finiteness is automatic (\(\mathbb{P}(\Omega) = 1 < \infty\)), so the product always exists. The construction extends to finite products \(\mu_1 \otimes \cdots \otimes \mu_n\) by induction. For countably infinite products of probability measures, the product construction can be extended by verifying \(\sigma\)-additivity on the algebra of cylinder sets and applying Carathéodory's extension theorem — the full statement and proof go under the name Kolmogorov's extension theorem, developed in standard graduate probability and real analysis references.

Fubini's Theorem and Tonelli's Theorem

The central computational tool for product measures is the ability to evaluate a double integral as an iterated integral — to "integrate out" one variable at a time. This is the content of Fubini's and Tonelli's theorems.

Theorem: Tonelli's Theorem

Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and \((\Omega_2, \mathcal{F}_2, \mu_2)\) be \(\sigma\)-finite measure spaces, and let \(f : \Omega_1 \times \Omega_2 \to [0, \infty]\) be measurable with respect to \(\mathcal{F}_1 \otimes \mathcal{F}_2\). Then \[ \int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2) \;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\right) d\mu_1(\omega_1) \;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1(\omega_1)\right) d\mu_2(\omega_2). \] In particular, the order of integration may be exchanged freely.

Tonelli's theorem applies to nonnegative measurable functions without any integrability condition — the integrals may be infinite. When the function is not necessarily nonnegative, we need an additional assumption:

Theorem: Fubini's Theorem

Under the same setup as Tonelli's theorem, suppose \(f : \Omega_1 \times \Omega_2 \to \mathbb{R}\) is \((\mathcal{F}_1 \otimes \mathcal{F}_2)\)-measurable and integrable with respect to the product measure: \[ \int_{\Omega_1 \times \Omega_2} |f| \, d(\mu_1 \otimes \mu_2) \;<\; \infty. \] Then for \(\mu_1\)-a.e. \(\omega_1\), the function \(\omega_2 \mapsto f(\omega_1, \omega_2)\) is \(\mu_2\)-integrable; the function \(\omega_1 \mapsto \int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\) (defined \(\mu_1\)-a.e.) is \(\mu_1\)-integrable; and \[ \int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2) \;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\right) d\mu_1(\omega_1) \;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1(\omega_1)\right) d\mu_2(\omega_2). \] The symmetric statement with the roles of \(\omega_1, \omega_2\) exchanged holds as well.

The relationship between the two theorems mirrors the MCT-DCT relationship: Tonelli handles nonnegative functions unconditionally; Fubini handles general (sign-changing) functions at the cost of an integrability condition. A standard workflow is: apply Tonelli to \(|f|\) first to verify the integrability condition, then invoke Fubini on \(f\) itself. This order is logically sound because Tonelli's theorem requires no integrability prerequisite — it remains valid even when the iterated integrals evaluate to \(+\infty\). If Tonelli applied to \(|f|\) yields a finite value, the integrability hypothesis for Fubini is rigorously confirmed.

We do not prove Tonelli's and Fubini's theorems here. The proofs proceed by the standard three-step measure-theoretic induction (indicators, then simple functions, then general measurable functions via MCT), combined with the monotone class theorem to promote rectangle-level identities to the full product \(\sigma\)-algebra. Complete proofs are developed in standard graduate probability and real analysis texts.

The Mathematical Justification of i.i.d. Sampling

We can now answer a question that has been implicit: what does it mean, precisely, for \(X_1, X_2, \ldots, X_n\) to be "i.i.d. with distribution \(P\)"?

Definition: i.i.d. Random Variables

Random variables \(X_1, \ldots, X_n\) on a probability space \((\Omega, \mathcal{F}, \mathbb{P})\) are independent and identically distributed (i.i.d.) with common distribution \(P\) if:

\(X_1, \ldots, X_n\) are independent (in the sense of \(\sigma\)-algebra independence), and
\(P_{X_i} = P\) for each \(i = 1, \ldots, n\) (each has the same pushforward measure).

Condition (2) alone says the variables have the same distribution. Condition (1) says that their joint distribution factors: \[ P_{(X_1, \ldots, X_n)} \;=\; P \otimes \cdots \otimes P \;=\; P^{\otimes n}. \] The product measure \(P^{\otimes n}\) on \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\) is the mathematical model for "drawing \(n\) independent samples from \(P\)." The coordinate projections \(\pi_i : \mathbb{R}^n \to \mathbb{R}\), \(\pi_i(x_1, \ldots, x_n) = x_i\), are the random variables \(X_i\) in this model.

For an infinite sequence \((X_i)_{i=1}^\infty\), the i.i.d. property is imposed on every finite sub-collection. Equivalently, the joint distribution on \(\bigl(\mathbb{R}^{\mathbb{N}}, \, \bigotimes_{i=1}^\infty \mathcal{B}(\mathbb{R})\bigr)\) is the infinite product measure \(P^{\otimes \infty}\), constructed via Kolmogorov's extension theorem, and the coordinate projections serve as the random variables.

With this formalization, the laws of large numbers gain their full meaning. The Strong Law says: if \(\mathbb{E}_P[|X|] < \infty\) (where the expectation is taken against a single copy of \(P\)), then \[ \frac{1}{n} \sum_{i=1}^n \pi_i \;\xrightarrow{a.s.}\; \int_{\mathbb{R}} x \, dP(x) \quad \text{under } P^{\otimes \infty}, \] where \(P^{\otimes \infty}\) is the infinite product measure on \(\bigl(\mathbb{R}^{\mathbb{N}}, \, \bigotimes_{i=1}^\infty \mathcal{B}(\mathbb{R})\bigr)\) whose existence is guaranteed by Kolmogorov's extension theorem (referenced earlier in this chapter). The infinite product measure is the foundational construction that makes the study of infinite sequences of random variables rigorous. Without it, the entire infinite sequence \((X_1, X_2, \ldots)\) cannot be defined on a single probability space, and asymptotic statements like the Strong Law ("as \(n \to \infty\), the sample mean converges a.s.") would lack a well-defined probability space on which to measure convergence.

Example: Fubini justifies the variance formula for sums.

Let \(X, Y\) be independent with \(\mathbb{E}[X^2], \mathbb{E}[Y^2] < \infty\). The claim \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) requires \(\mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]\). Under independence, the joint distribution is \(P_X \otimes P_Y\), so by Fubini: \[ \mathbb{E}[XY] \;=\; \int_{\mathbb{R}^2} x \, y \; d(P_X \otimes P_Y)(x, y) \;=\; \int_{\mathbb{R}} x \, dP_X(x) \;\cdot\; \int_{\mathbb{R}} y \, dP_Y(y) \;=\; \mathbb{E}[X] \cdot \mathbb{E}[Y]. \] Tonelli's theorem ensures the application is valid: applying Tonelli to \(|xy|\) under the product measure, combined with the change-of-variable formula, gives \(\mathbb{E}[|XY|] = \mathbb{E}[|X|] \cdot \mathbb{E}[|Y|]\). Since \(\mathbb{E}[X^2] < \infty\) implies \(\mathbb{E}[|X|] < \infty\) (and likewise for \(Y\)), the integrability condition for Fubini is confirmed.

Connection to Machine Learning: Why i.i.d. Matters

Nearly every convergence guarantee in machine learning — consistency of MLE, convergence of empirical risk to population risk, validity of cross-validation — assumes that the training data \((x_1, y_1), \ldots, (x_n, y_n)\) are i.i.d. draws from some distribution \(P\).

The product measure \(P^{\otimes n}\) is what makes these guarantees mathematically precise: it defines the probability space on which the sample lives, and Fubini's theorem is the tool that lets us decompose expectations over the full sample into expectations over individual observations. Without the product measure construction, the phrase "assume the data are i.i.d." would be a slogan without mathematical content.

When the i.i.d. assumption fails — as it does for time-series data, reinforcement learning trajectories, or data collected with adaptive sampling — the product measure model breaks down, and the theorems of this section no longer apply directly. Understanding what the assumption buys us is essential to recognizing when we can and cannot rely on it.

Looking Ahead

With this chapter, the measure-theoretic translation of probability is complete. The \(\sigma\)-algebras and Lebesgue integrals of Measure Theory and Lebesgue Integration are now visible as the precise machinery behind the random variables, expectations, and convergence results that have powered this section from the beginning. The product measure construction gives "i.i.d. sampling" its mathematical content, and the three limit theorems tell us exactly when limits and expectations commute.

Several threads lead forward from here:

The Radon-Nikodym theorem — invoked in Pushforward Measure to define the PDF as \(dP_X/d\lambda\) — has a far-reaching consequence: conditional expectation as a measurable function. For an integrable random variable \(X \in L^1(\mathbb{P})\), the measure-theoretic conditional expectation \(\mathbb{E}[X \mid \mathcal{G}]\) is defined as the Radon-Nikodym derivative of the signed measure \(A \mapsto \int_A X \, d\mathbb{P}\) with respect to \(\mathbb{P}|_{\mathcal{G}}\). This concept is the foundation of martingale theory and the starting point for stochastic calculus.
Uniform integrability — mentioned in this chapter as a refinement of the DCT — provides a necessary and sufficient condition for \(L^1\) convergence: \(X_n \to X\) in \(L^1\) if and only if both \(X_n \to X\) in probability and \((X_n)\) is uniformly integrable. This criterion is essential for martingale convergence theorems and will arise naturally if that path is pursued.
On the analysis side, Fourier analysis in Hilbert spaces will take the \(L^2\) completeness proven in Riesz-Fischer Theorem and develop the Fourier transform as a unitary operator — the mathematical engine behind both the Nyquist-Shannon sampling theorem and Heisenberg's uncertainty principle.

The measure-theoretic viewpoint developed across these two chapters is not a detour from applied probability — it is its foundation. Every time we write \(\mathbb{E}[\,\cdot\,]\), invoke the law of large numbers, or assume data are i.i.d., we are implicitly invoking these constructions. Making them explicit is what turns intuition into proof.