Convergence Theorems for Probability
A recurring challenge in probability and machine learning is the
interchange of limits and expectations:
\[
\lim_{n \to \infty} \mathbb{E}[X_n] \;\stackrel{?}{=}\;
\mathbb{E}\!\left[\lim_{n \to \infty} X_n\right].
\]
When does this hold? The answer is not always — and the conditions
under which it does are precisely the subject of the convergence theorems
from Lebesgue integration.
We stated and used the Monotone Convergence Theorem (MCT),
Fatou's Lemma, and
the Dominated Convergence Theorem (DCT) to prove the
completeness of \(L^p\). In that context, we were dealing with abstract measure spaces and
the Riesz-Fischer Theorem.
Here, we give these same theorems their probabilistic names and interpretations: the measure becomes \(\mathbb{P}\),
the measurable functions become random variables, the integrals become expectations, and "a.e." becomes "a.s."
The Translation Table
Before stating the theorems, we make the correspondence explicit. Every
statement in the left column is literally the same mathematical assertion
as the corresponding statement in the right column — only the vocabulary changes.
Measure Theory ↔ Probability Dictionary
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space.
| Measure Theory (Section II) |
Probability (Section III) |
| Measurable function \(f\) |
Random variable \(X\) |
| \(\int_\Omega f \, d\mu\) |
\(\mathbb{E}[X]\) |
| Almost everywhere (a.e.) |
Almost surely (a.s.) |
| Convergence in measure |
Convergence in probability |
| \(f \in L^p(\mu)\) |
\(X \in L^p(\mathbb{P})\) |
| Dominating function \(h \in L^1(\mu)\) |
Integrable bound: \(\mathbb{E}[|Y|] < \infty\) with \(|X_n| \leq Y\) a.s. |
In Convergence in \(L^p\),
we developed the hierarchy of convergence modes for measurable functions and noted that on probability
spaces, a.e. convergence becomes a.s. convergence and convergence in measure becomes convergence in probability.
In Modes of Convergence, we introduced these probabilistic
modes — a.s., in probability, in distribution — and established their
hierarchy.
We now complete the picture by showing how the three great limit theorems of Lebesgue integration operate within
this probabilistic framework.
The Three Limit Theorems in Probability
We state each theorem in probabilistic language, with explicit reference to
its analytical counterpart. The mathematical content is identical — only the
framing changes.
Theorem: Monotone Convergence Theorem (Probabilistic Form)
Let \((X_n)\) be a sequence of random variables satisfying
\(0 \leq X_1 \leq X_2 \leq \cdots\) a.s., and let
\(X = \lim_{n \to \infty} X_n\) (which exists in \([0, \infty]\) a.s.
by monotonicity). Then
\[
\lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X].
\]
Analytical form:
MCT in \(L^p\) Spaces
with \(\mu = \mathbb{P}\). Under this specialization, a.e. becomes a.s.,
\(\int_\Omega g \, d\mu\) becomes \(\mathbb{E}[X]\), and the two statements
are literally identical.
The MCT requires two conditions: nonnegativity and monotone increase.
Under these conditions, expectations and limits commute without any
integrability assumption — the limit \(\mathbb{E}[X]\) may be \(+\infty\),
and the theorem still holds as an equality.
Example: Computing an infinite-series expectation.
Let \(Y_1, Y_2, \ldots \geq 0\) be nonnegative random variables. Define
\(X_n = \sum_{k=1}^n Y_k\). Then \((X_n)\) is nonnegative and increasing, so
by the MCT:
\[
\mathbb{E}\!\left[\sum_{k=1}^\infty Y_k\right]
\;=\; \lim_{n \to \infty} \mathbb{E}\!\left[\sum_{k=1}^n Y_k\right]
\;=\; \sum_{k=1}^\infty \mathbb{E}[Y_k].
\]
This justifies interchanging summation and expectation for nonnegative
terms — a step used constantly in probability (e.g., computing
expectations of counting random variables).
Theorem: Fatou's Lemma (Probabilistic Form)
Let \((X_n)\) be a sequence of random variables satisfying \(X_n \geq 0\) a.s. Then
\[
\mathbb{E}\!\left[\liminf_{n \to \infty} X_n\right]
\;\leq\; \liminf_{n \to \infty} \mathbb{E}[X_n].
\]
Analytical form:
Fatou's Lemma in \(L^p\) Spaces
with \(\mu = \mathbb{P}\) — a literal specialization, with a.e.
replaced by a.s. and integrals replaced by expectations.
Fatou's lemma is the safety net: when we cannot verify monotonicity (needed for MCT)
or find a dominating variable (needed for DCT), Fatou still gives an inequality.
The inequality can be strict — probability mass can "escape to infinity" in the limit.
Example: Mass escape.
Let \(X_n = n \cdot \mathbf{1}_{\{U \leq 1/n\}}\), where \(U \sim \text{Uniform}(0,1)\).
Then \(\mathbb{E}[X_n] = n \cdot (1/n) = 1\) for all \(n\).
But \(X_n \to 0\) a.s. (for any fixed \(\omega\) with \(U(\omega) > 0\),
eventually \(1/n < U(\omega)\) and \(X_n(\omega) = 0\)).
Thus \(\mathbb{E}[\lim X_n] = 0 < 1 = \lim \mathbb{E}[X_n]\).
Fatou's inequality is sharp here. Geometrically, the probability mass of \(X_n\)
forms a spike of height \(n\) on the interval \((0, 1/n]\) — the area (expectation) is
always \(1\), but the spike narrows and grows without bound.
For each fixed outcome, the spike eventually passes beyond it, leaving zero behind.
The mass is not destroyed — it escapes to infinity, beyond the reach of the pointwise limit.
Theorem: Dominated Convergence Theorem (Probabilistic Form)
Let \((X_n)\) be a sequence of random variables converging almost surely to a random variable \(X\).
Suppose there exists an integrable dominating variable \(Y\) with \(\mathbb{E}[|Y|] < \infty\)
and \(|X_n| \leq Y\) a.s. for all \(n\). Then \(\mathbb{E}[|X|] < \infty\) and
\[
\lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X].
\]
Analytical form:
DCT in \(L^p\) Spaces
with \(\mu = \mathbb{P}\); the "integrable dominating function \(h\)"
becomes an "integrable dominating variable \(Y\)," and the statement is
otherwise identical.
The DCT is the most commonly applied of the three in practice. It guarantees the full equality
\(\lim \mathbb{E} = \mathbb{E}[\lim]\) at the cost of requiring an integrable dominating variable.
The domination condition \(|X_n| \leq Y\) with \(\mathbb{E}[|Y|] < \infty\) prevents the mass-escape
phenomenon seen in the Fatou example above.
A Decision Guide: Which Theorem When?
When faced with the need to interchange \(\lim\) and \(\mathbb{E}\),
the choice of theorem depends on what we know about the sequence:
Guide: Interchanging Limits and Expectations
Given a sequence \((X_n)\) with \(X_n \to X\) a.s.:
-
If \(X_n \geq 0\) and \(X_n \uparrow X\)
(nonnegative and increasing): use MCT.
No integrability condition needed; the limit may be infinite.
-
If \(X_n \geq 0\)
(nonnegative but not necessarily monotone): use Fatou.
Yields only \(\mathbb{E}[X] \leq \liminf \mathbb{E}[X_n]\), not equality.
-
If \(|X_n| \leq Y\) a.s. with \(\mathbb{E}[|Y|] < \infty\)
(dominated by an integrable variable): use DCT.
Full equality; the most versatile tool.
The three theorems above provide sufficient conditions for interchanging limits and expectations.
A deeper theory — uniform integrability and the associated Vitali convergence theorem,
treated in Durrett's probability text — characterizes the interchange as a
necessary and sufficient condition: \(X_n \to X\) in \(L^1\) if and only if both \(X_n \to X\)
in probability and \((X_n)\) is uniformly integrable. Domination by an integrable variable (the DCT hypothesis) is one way
to guarantee uniform integrability, but not the only way. This more refined criterion is essential for martingale theory and
will arise naturally if that path is pursued.
Connecting to the Convergence Hierarchy
We established the hierarchy of probabilistic convergence modes:
\[
\xrightarrow{a.s.} \;\Longrightarrow\; \xrightarrow{P} \;\Longrightarrow\; \xrightarrow{D}.
\]
Also, we proved the analytical counterpart: \(L^p\) convergence implies convergence
in measure (= convergence in probability), and a.e. convergence with domination
implies \(L^p\) convergence (the \(L^p\)-DCT).
The three limit theorems above add a new perspective to this picture: they tell us when
convergence of random variables (any of the modes) can be upgraded to
convergence of expectations. The key insight is:
Principle: Convergence of Variables vs. Convergence of Expectations
Convergence of random variables (a.s., in probability, in distribution)
does not automatically imply convergence of their expectations.
Additional conditions — monotonicity (MCT), nonnegativity (Fatou), or
domination (DCT) — are required to pass the limit inside \(\mathbb{E}\).
Without such conditions, expectations can diverge even when the variables
themselves converge.
The Fatou example above illustrates this: \(X_n \to 0\) a.s. (the strongest
mode of convergence), yet \(\mathbb{E}[X_n] = 1 \not\to 0 = \mathbb{E}[X]\).
The limit theorems are precisely the tools that bridge this gap.
Connection to Machine Learning: Gradient-Expectation Interchange
In stochastic gradient descent (SGD), we compute the gradient of an expected loss:
\[
\nabla_{\boldsymbol{\theta}} \, \mathbb{E}_{X \sim P}\!\bigl[\ell(X, \boldsymbol{\theta})\bigr]
\;\stackrel{?}{=}\;
\mathbb{E}_{X \sim P}\!\bigl[\nabla_{\boldsymbol{\theta}} \, \ell(X, \boldsymbol{\theta})\bigr].
\]
The left side is the true gradient of the risk; the right side is the
expectation of sample gradients — the quantity SGD approximates via
mini-batches. The interchange is justified by differentiation under the
integral sign — a standard consequence of DCT applied to the difference
quotients
\(\frac{\ell(X, \boldsymbol{\theta} + h \mathbf{e}_j) - \ell(X, \boldsymbol{\theta})}{h}\)
as \(h \to 0\). The DCT hypothesis is satisfied when an integrable
dominating function \(Y(X)\) exists with
\(|\partial \ell / \partial \theta_j| \leq Y(X)\) for all \(\boldsymbol{\theta}\)
in a neighborhood.
This is one rigorous ingredient — among others — in the analysis of SGD: when the
domination condition holds, the interchange is exact, and the sample-average gradient
is an unbiased estimator of the true gradient of the risk. When the domination
condition fails (e.g., with heavy-tailed gradient distributions reported in some
large-language-model training runs), the interchange need not hold, and the
sample-average gradient may be a biased or high-variance estimate of the true
gradient. The empirical practice of gradient clipping, introduced
independently as a remedy for exploding gradients in deep recurrent networks
(Pascanu, Mikolov & Bengio, 2013), can be read in this light as restoring an
effective domination bound — though the historical motivation was empirical rather
than measure-theoretic, and the connection to DCT is interpretive.
Independence and Product Measures
We defined two events \(A\) and \(B\) to be independent if
\(\mathbb{P}(A \cap B) = \mathbb{P}(A)\,\mathbb{P}(B)\). This condition
was stated for individual events, without further explanation of why it
generalizes correctly to random variables, \(\sigma\)-algebras, or infinite
collections. The measure-theoretic framework provides the definitive
formulation.
Independence of \(\sigma\)-Algebras
The fundamental definition is not independence of events or of random variables,
but independence of \(\sigma\)-algebras. All other notions
are derived from it.
Definition: Independence of \(\sigma\)-Algebras
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space.
Sub-\(\sigma\)-algebras \(\mathcal{G}_1, \ldots, \mathcal{G}_n \subseteq \mathcal{F}\)
are independent if
\[
\mathbb{P}(A_1 \cap \cdots \cap A_n)
\;=\; \mathbb{P}(A_1) \cdots \mathbb{P}(A_n)
\]
for every choice of \(A_i \in \mathcal{G}_i\), \(i = 1, \ldots, n\).
An arbitrary (possibly infinite) family \(\{\mathcal{G}_\alpha\}_{\alpha \in I}\) is independent if for
every finite collection of distinct indices \(\alpha_1, \ldots, \alpha_n \in I\), the sub-\(\sigma\)-algebras
\(\mathcal{G}_{\alpha_1}, \ldots, \mathcal{G}_{\alpha_n}\) are independent.
Why define independence at the level of \(\sigma\)-algebras rather than individual events or random variables?
Because this formulation is stable under measurable transformations: if \(X\) and \(Y\)
are independent, then so are \(f(X)\) and \(g(Y)\) for any measurable functions \(f\) and \(g\).
This follows immediately from the fact that \(\sigma(f(X)) \subseteq \sigma(X)\) — any information extractable
from \(f(X)\) is already contained in the information generated by \(X\). Defining independence via \(\sigma\)-algebras
captures this closure property automatically.
Independence of events is the special case where each
\(\mathcal{G}_i = \sigma(A_i) = \{\emptyset, A_i, A_i^c, \Omega\}\) — the smallest
\(\sigma\)-algebra containing \(A_i\). (The elementary condition
\(\mathbb{P}(A_1 \cap \cdots \cap A_n) = \prod_i \mathbb{P}(A_i)\) automatically
propagates to all \(2^n\) choices of \(A_i\) or \(A_i^c\), yielding the full
\(\sigma\)-algebra factorization — a short inclusion-exclusion argument.)
Independence of random variables
\(X_1, \ldots, X_n\) means that their generated \(\sigma\)-algebras
\(\sigma(X_i) = \{X_i^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\}\)
are independent:
Definition: Independence of Random Variables
Random variables \(X_1, \ldots, X_n\) are independent if
for all Borel sets \(B_1, \ldots, B_n \in \mathcal{B}(\mathbb{R})\),
\[
\mathbb{P}(X_1 \in B_1, \ldots, X_n \in B_n)
\;=\; \prod_{i=1}^n \mathbb{P}(X_i \in B_i).
\]
Equivalently, the joint distribution factors as a product of marginals:
\[
P_{(X_1, \ldots, X_n)}
\;=\; P_{X_1} \otimes \cdots \otimes P_{X_n}.
\]
The second formulation — factoring the joint distribution into a product of
marginals — is the form most directly connected to the product measure
construction that follows.
A key consequence of independence concerns expectations of products:
Proposition: Expectation of Products
If \(X_1, \ldots, X_n\) are independent random variables with
\(\mathbb{E}[|X_i|] < \infty\) for each \(i\), then the product
\(X_1 \cdots X_n\) is integrable and
\[
\mathbb{E}[X_1 \cdots X_n] \;=\; \mathbb{E}[X_1] \cdots \mathbb{E}[X_n].
\]
This was used implicitly throughout
Variance and
Modes of Convergence
(for instance, in \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) for
independent \(X, Y\)). The measure-theoretic proof reduces to Fubini's theorem
on the product space: under independence the joint distribution factors as
\(P_{(X_1, \ldots, X_n)} = P_{X_1} \otimes \cdots \otimes P_{X_n}\), and Fubini
converts the integral of the product against the product measure into a product
of integrals. The
worked example at the end of this section
performs this derivation in the case \(n = 2\); the general case follows by
induction.
Product Measures
When we say "\(X_1, \ldots, X_n\) are independent," we are asserting that the
joint probability space factors into a product. To make this precise, we need
the construction of product measures.
Definition: Product \(\sigma\)-Algebra and Product Measure
Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and \((\Omega_2, \mathcal{F}_2, \mu_2)\)
be \(\sigma\)-finite measure spaces.
The product \(\sigma\)-algebra
\(\mathcal{F}_1 \otimes \mathcal{F}_2\) is the \(\sigma\)-algebra
on \(\Omega_1 \times \Omega_2\) generated by all
measurable rectangles
\(A_1 \times A_2\) with \(A_1 \in \mathcal{F}_1\),
\(A_2 \in \mathcal{F}_2\).
There exists a unique measure
\(\mu_1 \otimes \mu_2\) on
\((\Omega_1 \times \Omega_2, \, \mathcal{F}_1 \otimes \mathcal{F}_2)\) satisfying
\[
(\mu_1 \otimes \mu_2)(A_1 \times A_2)
\;=\; \mu_1(A_1) \cdot \mu_2(A_2)
\]
for all measurable rectangles. This is the product measure.
The existence and uniqueness of the product measure follow from
Carathéodory's extension theorem.
The formula \(\mu_1(A_1)\,\mu_2(A_2)\) extends additively to the algebra of finite disjoint unions of
rectangles, and its \(\sigma\)-additivity on this algebra is verified by applying the MCT to
indicator functions of countable disjoint rectangle decompositions. The extension theorem then
promotes this premeasure to a measure on the full product \(\sigma\)-algebra; the
\(\sigma\)-finiteness assumption ensures uniqueness.
For probability spaces, \(\sigma\)-finiteness is automatic
(\(\mathbb{P}(\Omega) = 1 < \infty\)), so the product always exists.
The construction extends to finite products \(\mu_1 \otimes \cdots \otimes \mu_n\) by induction. For countably
infinite products of probability measures, the product construction can be extended by verifying \(\sigma\)-additivity
on the algebra of cylinder sets and applying Carathéodory's extension theorem — the full statement and
proof go under the name Kolmogorov's extension theorem, developed in standard graduate
probability and real analysis references.
Fubini's Theorem and Tonelli's Theorem
The central computational tool for product measures is the ability to evaluate
a double integral as an iterated integral — to "integrate out" one variable at a time.
This is the content of Fubini's and Tonelli's theorems.
Theorem: Tonelli's Theorem
Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and
\((\Omega_2, \mathcal{F}_2, \mu_2)\) be \(\sigma\)-finite measure spaces,
and let \(f : \Omega_1 \times \Omega_2 \to [0, \infty]\) be measurable
with respect to \(\mathcal{F}_1 \otimes \mathcal{F}_2\). Then
\[
\int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2)
\;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\right) d\mu_1(\omega_1)
\;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1(\omega_1)\right) d\mu_2(\omega_2).
\]
In particular, the order of integration may be exchanged freely.
Tonelli's theorem applies to nonnegative measurable functions
without any integrability condition — the integrals may be infinite.
When the function is not necessarily nonnegative, we need an additional
assumption:
Theorem: Fubini's Theorem
Under the same setup as Tonelli's theorem, suppose
\(f : \Omega_1 \times \Omega_2 \to \mathbb{R}\) is
\((\mathcal{F}_1 \otimes \mathcal{F}_2)\)-measurable and
integrable with respect to the product measure:
\[
\int_{\Omega_1 \times \Omega_2} |f| \, d(\mu_1 \otimes \mu_2) \;<\; \infty.
\]
Then for \(\mu_1\)-a.e. \(\omega_1\), the function
\(\omega_2 \mapsto f(\omega_1, \omega_2)\) is \(\mu_2\)-integrable;
the function \(\omega_1 \mapsto \int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\)
(defined \(\mu_1\)-a.e.) is \(\mu_1\)-integrable; and
\[
\int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2)
\;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\right) d\mu_1(\omega_1)
\;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1(\omega_1)\right) d\mu_2(\omega_2).
\]
The symmetric statement with the roles of \(\omega_1, \omega_2\) exchanged holds as well.
The relationship between the two theorems mirrors the MCT-DCT relationship:
Tonelli handles nonnegative functions unconditionally; Fubini handles general
(sign-changing) functions at the cost of an integrability condition.
A standard workflow is: apply Tonelli to \(|f|\) first to verify the integrability condition,
then invoke Fubini on \(f\) itself. This order is logically sound because Tonelli's theorem requires no
integrability prerequisite — it remains valid even when the iterated integrals evaluate to \(+\infty\). If Tonelli
applied to \(|f|\) yields a finite value, the integrability hypothesis for Fubini is rigorously confirmed.
We do not prove Tonelli's and Fubini's theorems here. The proofs proceed by
the standard three-step measure-theoretic induction (indicators, then simple
functions, then general measurable functions via MCT), combined with the
monotone class theorem to promote rectangle-level identities to the full
product \(\sigma\)-algebra. Complete proofs are developed in standard graduate
probability and real analysis texts.
The Mathematical Justification of i.i.d. Sampling
We can now answer a question that has been implicit:
what does it mean, precisely, for \(X_1, X_2, \ldots, X_n\) to be "i.i.d. with distribution \(P\)"?
Definition: i.i.d. Random Variables
Random variables \(X_1, \ldots, X_n\) on a probability space
\((\Omega, \mathcal{F}, \mathbb{P})\) are
independent and identically distributed (i.i.d.) with
common distribution \(P\) if:
- \(X_1, \ldots, X_n\) are independent (in the sense of
\(\sigma\)-algebra independence), and
- \(P_{X_i} = P\) for each \(i = 1, \ldots, n\) (each has the same
pushforward measure).
Condition (2) alone says the variables have the same distribution.
Condition (1) says that their joint distribution factors:
\[
P_{(X_1, \ldots, X_n)} \;=\; P \otimes \cdots \otimes P \;=\; P^{\otimes n}.
\]
The product measure \(P^{\otimes n}\) on
\((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\) is the mathematical model for
"drawing \(n\) independent samples from \(P\)." The coordinate projections
\(\pi_i : \mathbb{R}^n \to \mathbb{R}\), \(\pi_i(x_1, \ldots, x_n) = x_i\),
are the random variables \(X_i\) in this model.
For an infinite sequence \((X_i)_{i=1}^\infty\), the i.i.d. property is imposed
on every finite sub-collection. Equivalently, the joint distribution on
\(\bigl(\mathbb{R}^{\mathbb{N}}, \, \bigotimes_{i=1}^\infty \mathcal{B}(\mathbb{R})\bigr)\)
is the infinite product measure \(P^{\otimes \infty}\), constructed via Kolmogorov's
extension theorem, and the coordinate projections serve as the random variables.
With this formalization, the laws of large numbers gain their full meaning.
The Strong Law says:
if \(\mathbb{E}_P[|X|] < \infty\) (where the expectation is taken against a
single copy of \(P\)), then
\[
\frac{1}{n} \sum_{i=1}^n \pi_i \;\xrightarrow{a.s.}\; \int_{\mathbb{R}} x \, dP(x)
\quad \text{under } P^{\otimes \infty},
\]
where \(P^{\otimes \infty}\) is the infinite product measure on
\(\bigl(\mathbb{R}^{\mathbb{N}}, \, \bigotimes_{i=1}^\infty \mathcal{B}(\mathbb{R})\bigr)\)
whose existence is guaranteed by Kolmogorov's extension theorem (referenced earlier in this chapter). The infinite product measure is the
foundational construction that makes the study of infinite sequences of random variables rigorous. Without it, the
entire infinite sequence \((X_1, X_2, \ldots)\) cannot be defined on a single probability space, and asymptotic
statements like the Strong Law ("as \(n \to \infty\), the sample mean converges a.s.") would lack a well-defined
probability space on which to measure convergence.
Example: Fubini justifies the variance formula for sums.
Let \(X, Y\) be independent with \(\mathbb{E}[X^2], \mathbb{E}[Y^2] < \infty\).
The claim \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) requires
\(\mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]\). Under independence,
the joint distribution is \(P_X \otimes P_Y\), so by Fubini:
\[
\mathbb{E}[XY]
\;=\; \int_{\mathbb{R}^2} x \, y \; d(P_X \otimes P_Y)(x, y)
\;=\; \int_{\mathbb{R}} x \, dP_X(x) \;\cdot\; \int_{\mathbb{R}} y \, dP_Y(y)
\;=\; \mathbb{E}[X] \cdot \mathbb{E}[Y].
\]
Tonelli's theorem ensures the application is valid: applying Tonelli to \(|xy|\) under the product measure,
combined with the
change-of-variable formula,
gives \(\mathbb{E}[|XY|] = \mathbb{E}[|X|] \cdot \mathbb{E}[|Y|]\).
Since \(\mathbb{E}[X^2] < \infty\) implies \(\mathbb{E}[|X|] < \infty\) (and likewise for \(Y\)), the integrability
condition for Fubini is confirmed.
Connection to Machine Learning: Why i.i.d. Matters
Nearly every convergence guarantee in machine learning —
consistency of MLE,
convergence of empirical risk to population risk, validity of cross-validation —
assumes that the training data \((x_1, y_1), \ldots, (x_n, y_n)\) are i.i.d. draws from some
distribution \(P\).
The product measure \(P^{\otimes n}\) is what makes these guarantees mathematically precise:
it defines the probability space on which the sample lives, and Fubini's theorem is the tool
that lets us decompose expectations over the full sample into expectations over individual
observations. Without the product measure construction, the phrase "assume the data are i.i.d."
would be a slogan without mathematical content.
When the i.i.d. assumption fails — as it does for time-series data, reinforcement learning trajectories,
or data collected with adaptive sampling — the product measure model breaks down, and the theorems
of this section no longer apply directly. Understanding what the assumption buys us is essential
to recognizing when we can and cannot rely on it.