From Translation to Tools
In Measure-Theoretic Probability,
we translated the basic vocabulary of probability into the language of measure theory:
random variables became measurable functions, distributions became pushforward measures,
and expectations became Lebesgue integrals. That translation unified discrete and continuous
probability under a single integral \(\int g \, dP_X\).
This chapter completes the measure-theoretic foundation by putting the translated
machinery to work. First: when can we interchange limits
and expectations? The Monotone Convergence Theorem, Fatou's Lemma, and the
Dominated Convergence Theorem — proved for abstract measure spaces in
\(L^p\) Spaces —
now receive their probabilistic names and applications.
Second: what does independence really mean? The factorization condition
\(P(A \cap B) = P(A)P(B)\) from
Basic Probability
is elevated to independence of \(\sigma\)-algebras, formalized via product measures
and Fubini's theorem — the construction that gives "i.i.d. sampling" its
mathematical content.
Convergence Theorems for Probability
A recurring challenge in probability and machine learning is the
interchange of limits and expectations:
\[
\lim_{n \to \infty} \mathbb{E}[X_n] \;\stackrel{?}{=}\;
\mathbb{E}\!\left[\lim_{n \to \infty} X_n\right].
\]
When does this hold? The answer is not always — and the conditions
under which it does are precisely the subject of the convergence theorems
from Lebesgue integration.
In \(L^p\) Spaces,
we stated and used the Monotone Convergence Theorem (MCT), Fatou's Lemma, and
the Dominated Convergence Theorem (DCT) to prove the completeness of \(L^p\).
There, the context was abstract measure spaces and the Riesz-Fischer theorem.
Here, we give these same theorems their probabilistic names and
interpretations: the measure becomes \(\mathbb{P}\),
the measurable functions become random variables, the integrals become expectations,
and "a.e." becomes "a.s."
The Translation Table
Before stating the theorems, we make the correspondence explicit. Every
statement in the left column is literally the same mathematical assertion
as the corresponding statement in the right column — only the vocabulary changes.
Measure Theory ↔ Probability Dictionary
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space.
| Measure Theory (Section II) |
Probability (Section III) |
| Measurable function \(f\) |
Random variable \(X\) |
| \(\int_\Omega f \, d\mu\) |
\(\mathbb{E}[X]\) |
| Almost everywhere (a.e.) |
Almost surely (a.s.) |
| Convergence in measure |
Convergence in probability |
| \(f \in L^p(\mu)\) |
\(X \in L^p(\mathbb{P})\) |
| Dominating function \(h \in L^1(\mu)\) |
Integrable bound: \(\mathbb{E}[|Y|] < \infty\) with \(|X_n| \leq Y\) a.s. |
In Convergence in \(L^p\),
we developed the hierarchy of convergence modes for measurable functions and
noted that on probability spaces, a.e. convergence becomes a.s. convergence
and convergence in measure becomes convergence in probability.
In Convergence,
we introduced these probabilistic modes — a.s., in probability, in distribution —
and established their
hierarchy.
We now complete the picture by showing how the three great limit theorems of
Lebesgue integration operate within this probabilistic framework.
The Three Limit Theorems in Probability
We state each theorem in probabilistic language, with explicit reference to
its analytical counterpart. The mathematical content is identical — only the
framing changes.
Theorem: Monotone Convergence Theorem (Probabilistic Form)
Let \((X_n)\) be a sequence of random variables satisfying
\(0 \leq X_1 \leq X_2 \leq \cdots\) a.s., and let
\(X = \lim_{n \to \infty} X_n\) (which exists in \([0, \infty]\) a.s.
by monotonicity). Then
\[
\lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X].
\]
Analytical form:
MCT in \(L^p\) Spaces,
with \(\mu = \mathbb{P}\).
The MCT requires two conditions: nonnegativity and monotone increase.
Under these conditions, expectations and limits commute without any
integrability assumption — the limit \(\mathbb{E}[X]\) may be \(+\infty\),
and the theorem still holds as an equality.
Example: Computing an infinite-series expectation.
Let \(Y_1, Y_2, \ldots \geq 0\) be nonnegative random variables. Define
\(X_n = \sum_{k=1}^n Y_k\). Then \((X_n)\) is nonnegative and increasing, so
by the MCT:
\[
\mathbb{E}\!\left[\sum_{k=1}^\infty Y_k\right]
\;=\; \lim_{n \to \infty} \mathbb{E}\!\left[\sum_{k=1}^n Y_k\right]
\;=\; \sum_{k=1}^\infty \mathbb{E}[Y_k].
\]
This justifies interchanging summation and expectation for nonnegative
terms — a step used constantly in probability (e.g., computing
expectations of counting random variables).
Theorem: Fatou's Lemma (Probabilistic Form)
Let \((X_n)\) be a sequence of random variables satisfying \(X_n \geq 0\) a.s. Then
\[
\mathbb{E}\!\left[\liminf_{n \to \infty} X_n\right]
\;\leq\; \liminf_{n \to \infty} \mathbb{E}[X_n].
\]
Analytical form:
Fatou's Lemma in \(L^p\) Spaces,
with \(\mu = \mathbb{P}\).
Fatou's lemma is the safety net: when we cannot verify monotonicity (needed for MCT)
or find a dominating variable (needed for DCT), Fatou still gives an inequality.
The inequality can be strict — probability mass can "escape to infinity" in the limit.
Example: Mass escape.
Let \(X_n = n \cdot \mathbf{1}_{\{U \leq 1/n\}}\), where \(U \sim \text{Uniform}(0,1)\).
Then \(\mathbb{E}[X_n] = n \cdot (1/n) = 1\) for all \(n\).
But \(X_n \to 0\) a.s. (for any fixed \(\omega\) with \(U(\omega) > 0\),
eventually \(1/n < U(\omega)\) and \(X_n(\omega) = 0\)).
Thus \(\mathbb{E}[\lim X_n] = 0 < 1 = \lim \mathbb{E}[X_n]\).
Fatou's inequality is sharp here: the mass of \(X_n\) concentrates on an ever-shrinking
set, and is lost in the limit.
Theorem: Dominated Convergence Theorem (Probabilistic Form)
Let \((X_n)\) be a sequence of random variables converging almost surely to a random variable \(X\).
Suppose there exists an integrable dominating variable \(Y\) with \(\mathbb{E}[|Y|] < \infty\)
and \(|X_n| \leq Y\) a.s. for all \(n\). Then \(\mathbb{E}[|X|] < \infty\) and
\[
\lim_{n \to \infty} \mathbb{E}[X_n] \;=\; \mathbb{E}[X].
\]
Analytical form:
DCT in \(L^p\) Spaces,
with \(\mu = \mathbb{P}\).
The DCT is the most commonly applied of the three in practice. It guarantees the full equality
\(\lim \mathbb{E} = \mathbb{E}[\lim]\) at the cost of requiring an integrable dominating variable.
The domination condition \(|X_n| \leq Y\) with \(\mathbb{E}[|Y|] < \infty\) prevents the mass-escape
phenomenon seen in the Fatou example above.
A Decision Guide: Which Theorem When?
When faced with the need to interchange \(\lim\) and \(\mathbb{E}\),
the choice of theorem depends on what we know about the sequence:
Guide: Interchanging Limits and Expectations
Given a sequence \((X_n)\) with \(X_n \to X\) a.s.:
-
If \(X_n \geq 0\) and \(X_n \uparrow X\)
(nonnegative and increasing): use MCT.
No integrability condition needed; the limit may be infinite.
-
If \(X_n \geq 0\)
(nonnegative but not necessarily monotone): use Fatou.
Yields only \(\mathbb{E}[X] \leq \liminf \mathbb{E}[X_n]\), not equality.
-
If \(|X_n| \leq Y\) a.s. with \(\mathbb{E}[|Y|] < \infty\)
(dominated by an integrable variable): use DCT.
Full equality; the most versatile tool.
The three theorems above provide sufficient conditions for interchanging limits and expectations.
A deeper theory — uniform integrability and the associated Vitali convergence theorem — characterizes
the interchange as a necessary and sufficient condition: \(X_n \to X\) in probability and \((X_n)\) is uniformly
integrable if and only if \(X_n \to X\) in \(L^1\). Domination by an integrable variable (the DCT hypothesis) is one way
to guarantee uniform integrability, but not the only way. This more refined criterion is essential for martingale theory and
will arise naturally if that path is pursued.
Connecting to the Convergence Hierarchy
In Convergence,
we established the hierarchy of probabilistic convergence modes:
\[
\xrightarrow{a.s.} \;\Longrightarrow\; \xrightarrow{P} \;\Longrightarrow\; \xrightarrow{D}.
\]
In Convergence in \(L^p\),
we proved the analytical counterpart: \(L^p\) convergence implies convergence
in measure (= convergence in probability), and a.e. convergence with domination
implies \(L^p\) convergence (the \(L^p\)-DCT).
The three limit theorems above add a new dimension to this picture: they
tell us when convergence of random variables (any of the
modes) can be upgraded to convergence of expectations.
The key insight is:
Principle: Convergence of Variables vs. Convergence of Expectations
Convergence of random variables (a.s., in probability, in distribution)
does not automatically imply convergence of their expectations.
Additional conditions — monotonicity (MCT), nonnegativity (Fatou), or
domination (DCT) — are required to pass the limit inside \(\mathbb{E}\).
Without such conditions, expectations can diverge even when the variables
themselves converge.
The Fatou example above illustrates this: \(X_n \to 0\) a.s. (the strongest
mode of convergence), yet \(\mathbb{E}[X_n] = 1 \not\to 0 = \mathbb{E}[X]\).
The limit theorems are precisely the tools that bridge this gap.
Connection to Machine Learning: Gradient-Expectation Interchange
In stochastic gradient descent (SGD), we compute the gradient of an expected loss:
\[
\nabla_{\boldsymbol{\theta}} \, \mathbb{E}_{X \sim P}\!\bigl[\ell(X, \boldsymbol{\theta})\bigr]
\;\stackrel{?}{=}\;
\mathbb{E}_{X \sim P}\!\bigl[\nabla_{\boldsymbol{\theta}} \, \ell(X, \boldsymbol{\theta})\bigr].
\]
The left side is the true gradient of the risk; the right side is the
expectation of sample gradients — the quantity SGD approximates via
mini-batches. The interchange is justified by the DCT: if we can find an
integrable function \(Y(X)\) such that
\(|\partial \ell / \partial \theta_j| \leq Y(X)\) for all \(\boldsymbol{\theta}\)
in a neighborhood, then the dominated convergence theorem (applied to
difference quotients) permits differentiation under the integral sign.
This is not a technicality — it is the theoretical foundation of SGD.
When the domination condition fails (e.g., with heavy-tailed gradients in
large language models), the interchange can break down, leading to the
gradient explosion phenomena that motivate gradient clipping.
Independence and Product Measures
In Basic Probability,
we defined two events \(A\) and \(B\) to be independent if
\(\mathbb{P}(A \cap B) = \mathbb{P}(A)\,\mathbb{P}(B)\). This condition
was stated for individual events, without further explanation of why it
generalizes correctly to random variables, \(\sigma\)-algebras, or infinite
collections. The measure-theoretic framework provides the definitive
formulation.
Independence of \(\sigma\)-Algebras
The fundamental definition is not independence of events or of random variables,
but independence of \(\sigma\)-algebras. All other notions
are derived from it.
Definition: Independence of \(\sigma\)-Algebras
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space.
Sub-\(\sigma\)-algebras \(\mathcal{G}_1, \ldots, \mathcal{G}_n \subseteq \mathcal{F}\)
are independent if
\[
\mathbb{P}(A_1 \cap \cdots \cap A_n)
\;=\; \mathbb{P}(A_1) \cdots \mathbb{P}(A_n)
\]
for every choice of \(A_i \in \mathcal{G}_i\), \(i = 1, \ldots, n\).
An arbitrary (possibly infinite) family \(\{\mathcal{G}_\alpha\}_{\alpha \in I}\) is independent if for
every finite collection of distinct indices \(\alpha_1, \ldots, \alpha_n \in I\), the sub-\(\sigma\)-algebras
\(\mathcal{G}_{\alpha_1}, \ldots, \mathcal{G}_{\alpha_n}\) are independent.
Independence of events is the special case where each
\(\mathcal{G}_i = \{\emptyset, A_i, A_i^c, \Omega\}\) — the smallest
\(\sigma\)-algebra containing \(A_i\).
Independence of random variables
\(X_1, \ldots, X_n\) means that their generated \(\sigma\)-algebras
\(\sigma(X_i) = \{X_i^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\}\)
are independent:
Definition: Independence of Random Variables
Random variables \(X_1, \ldots, X_n\) are independent if
for all Borel sets \(B_1, \ldots, B_n \in \mathcal{B}(\mathbb{R})\),
\[
\mathbb{P}(X_1 \in B_1, \ldots, X_n \in B_n)
\;=\; \prod_{i=1}^n \mathbb{P}(X_i \in B_i).
\]
Equivalently, the joint distribution factors as a product of marginals:
\[
P_{(X_1, \ldots, X_n)}
\;=\; P_{X_1} \otimes \cdots \otimes P_{X_n}.
\]
The second formulation — factoring the joint distribution into a product of
marginals — is the form most directly connected to the product measure
construction that follows.
A key consequence of independence concerns expectations of products:
Proposition: Expectation of Products
If \(X_1, \ldots, X_n\) are independent random variables with
\(\mathbb{E}[|X_i|] < \infty\) for each \(i\), then
\[
\mathbb{E}[X_1 \cdots X_n] \;=\; \mathbb{E}[X_1] \cdots \mathbb{E}[X_n].
\]
This was used implicitly throughout
Variance and
Convergence
(for instance, in \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) for
independent \(X, Y\)). The measure-theoretic proof reduces to Fubini's theorem
on the product space, which we develop next.
Product Measures
When we say "\(X_1, \ldots, X_n\) are independent," we are asserting that the
joint probability space factors into a product. To make this precise, we need
the construction of product measures.
Definition: Product \(\sigma\)-Algebra and Product Measure
Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and \((\Omega_2, \mathcal{F}_2, \mu_2)\)
be \(\sigma\)-finite measure spaces.
The product \(\sigma\)-algebra
\(\mathcal{F}_1 \otimes \mathcal{F}_2\) is the \(\sigma\)-algebra
on \(\Omega_1 \times \Omega_2\) generated by all
measurable rectangles
\(A_1 \times A_2\) with \(A_1 \in \mathcal{F}_1\),
\(A_2 \in \mathcal{F}_2\).
There exists a unique measure
\(\mu_1 \otimes \mu_2\) on
\((\Omega_1 \times \Omega_2, \, \mathcal{F}_1 \otimes \mathcal{F}_2)\) satisfying
\[
(\mu_1 \otimes \mu_2)(A_1 \times A_2)
\;=\; \mu_1(A_1) \cdot \mu_2(A_2)
\]
for all measurable rectangles. This is the product measure.
The existence and uniqueness of the product measure follow from
Carathéodory's extension theorem:
the formula on rectangles defines a premeasure on the algebra generated by
rectangles, and the extension theorem promotes it to a measure on the full
product \(\sigma\)-algebra. The \(\sigma\)-finiteness assumption ensures
uniqueness.
For probability spaces, \(\sigma\)-finiteness is automatic
(\(\mathbb{P}(\Omega) = 1 < \infty\)), so the product always exists.
The construction extends to finite products \(\mu_1 \otimes \cdots \otimes \mu_n\) by induction. For countably
infinite products of probability measures, the product construction can be extended by verifying \(\sigma\)-additivity
on the algebra of cylinder sets and applying Carathéodory's extension theorem.
Fubini's Theorem and Tonelli's Theorem
The central computational tool for product measures is the ability to evaluate
a double integral as an iterated integral — to "integrate out" one variable at a time.
This is the content of Fubini's and Tonelli's theorems.
Theorem: Tonelli's Theorem
Let \((\Omega_1, \mathcal{F}_1, \mu_1)\) and
\((\Omega_2, \mathcal{F}_2, \mu_2)\) be \(\sigma\)-finite measure spaces,
and let \(f : \Omega_1 \times \Omega_2 \to [0, \infty]\) be measurable
with respect to \(\mathcal{F}_1 \otimes \mathcal{F}_2\). Then
\[
\int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2)
\;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2(\omega_2)\right) d\mu_1(\omega_1)
\;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1(\omega_1)\right) d\mu_2(\omega_2).
\]
In particular, the order of integration may be exchanged freely.
Tonelli's theorem applies to nonnegative measurable functions
without any integrability condition — the integrals may be infinite.
When the function is not necessarily nonnegative, we need an additional
assumption:
Theorem: Fubini's Theorem
Under the same setup as Tonelli's theorem, suppose
\(f : \Omega_1 \times \Omega_2 \to \mathbb{R}\) is
\((\mathcal{F}_1 \otimes \mathcal{F}_2)\)-measurable and
integrable with respect to the product measure:
\[
\int_{\Omega_1 \times \Omega_2} |f| \, d(\mu_1 \otimes \mu_2) \;<\; \infty.
\]
Then the iterated integrals exist and are equal:
\[
\int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2)
\;=\; \int_{\Omega_1} \!\left(\int_{\Omega_2} f \, d\mu_2 \right) d\mu_1
\;=\; \int_{\Omega_2} \!\left(\int_{\Omega_1} f \, d\mu_1 \right) d\mu_2.
\]
The relationship between the two theorems mirrors the MCT-DCT relationship:
Tonelli handles nonnegative functions unconditionally; Fubini handles general
(sign-changing) functions at the cost of an integrability condition.
A standard workflow is: apply Tonelli to \(|f|\) first to
verify the integrability condition, then invoke Fubini on \(f\) itself.
The Mathematical Justification of i.i.d. Sampling
We can now answer a question that has been implicit since
Convergence:
what does it mean, precisely, for
\(X_1, X_2, \ldots, X_n\) to be "i.i.d. with distribution \(P\)"?
Definition: i.i.d. Random Variables
Random variables \(X_1, \ldots, X_n\) on a probability space
\((\Omega, \mathcal{F}, \mathbb{P})\) are
independent and identically distributed (i.i.d.) with
common distribution \(P\) if:
- \(X_1, \ldots, X_n\) are independent (in the sense of
\(\sigma\)-algebra independence), and
- \(P_{X_i} = P\) for each \(i = 1, \ldots, n\) (each has the same
pushforward measure).
Condition (2) alone says the variables have the same distribution.
Condition (1) says that their joint distribution factors:
\[
P_{(X_1, \ldots, X_n)} \;=\; P \otimes \cdots \otimes P \;=\; P^{\otimes n}.
\]
The product measure \(P^{\otimes n}\) on
\((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n))\) is the mathematical model for
"drawing \(n\) independent samples from \(P\)." The coordinate projections
\(\pi_i : \mathbb{R}^n \to \mathbb{R}\), \(\pi_i(x_1, \ldots, x_n) = x_i\),
are the random variables \(X_i\) in this model.
With this formalization, the laws of large numbers gain their full meaning.
The Strong Law
(stated in Convergence) says:
\[
\frac{1}{n} \sum_{i=1}^n \pi_i \;\xrightarrow{a.s.}\; \mathbb{E}_{P}[X]
\quad \text{under } P^{\otimes \infty},
\]
where \(P^{\otimes \infty}\) is the infinite product measure on \((\mathbb{R}^\infty, \mathcal{B}(\mathbb{R})^{\otimes \infty})\).
The existence of this infinite product measure — constructed by verifying \(\sigma\)-additivity on cylinder sets and extending via
Carathéodory — is the foundational result that makes the study of infinite sequences of random variables rigorous.
Example: Fubini justifies the variance formula for sums.
Let \(X, Y\) be independent with \(\mathbb{E}[X^2], \mathbb{E}[Y^2] < \infty\).
The claim \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) requires
\(\mathbb{E}[XY] = \mathbb{E}[X]\,\mathbb{E}[Y]\). Under independence,
the joint distribution is \(P_X \otimes P_Y\), so by Fubini:
\[
\mathbb{E}[XY]
\;=\; \int_{\mathbb{R}^2} x \, y \; d(P_X \otimes P_Y)(x, y)
\;=\; \int_{\mathbb{R}} x \, dP_X(x) \;\cdot\; \int_{\mathbb{R}} y \, dP_Y(y)
\;=\; \mathbb{E}[X] \cdot \mathbb{E}[Y].
\]
Tonelli's theorem ensures the application is valid: applying Tonelli to \(|xy|\) under the product measure gives
\(\mathbb{E}[|XY|] = \mathbb{E}[|X|] \cdot \mathbb{E}[|Y|]\).
Since \(\mathbb{E}[X^2] < \infty\) implies \(\mathbb{E}[|X|] < \infty\) (and likewise for \(Y\)), the integrability
condition for Fubini is confirmed.
Connection to Machine Learning: Why i.i.d. Matters
Nearly every convergence guarantee in machine learning —
consistency of
MLE,
convergence of empirical risk to population risk, validity of
cross-validation — assumes that the training data
\((x_1, y_1), \ldots, (x_n, y_n)\) are i.i.d. draws from some
distribution \(P\).
The product measure \(P^{\otimes n}\) is what makes these guarantees
mathematically precise: it defines the probability space on which
the sample lives, and Fubini's theorem is the tool that lets us
decompose expectations over the full sample into expectations over
individual observations. Without the product measure construction,
the phrase "assume the data are i.i.d." would be a slogan without
mathematical content.
When the i.i.d. assumption fails — as it does for time-series data,
reinforcement learning trajectories, or data collected with adaptive
sampling — the product measure model breaks down, and the theorems
of this section no longer apply directly. Understanding what
the assumption buys us is essential to recognizing
when we can and cannot rely on it.
Looking Ahead
With this chapter, the measure-theoretic translation of probability is complete.
The \(\sigma\)-algebras and Lebesgue integrals of
Measure Theory
and
Lebesgue Integration
are now visible as the precise machinery behind the random variables,
expectations, and convergence results that have powered this section
from the beginning. The product measure construction gives
"i.i.d. sampling" its mathematical content, and the three limit theorems
tell us exactly when limits and expectations commute.
Several threads lead forward from here:
-
The Radon-Nikodym theorem — invoked in
Measure-Theoretic Probability
to define the PDF as \(dP_X/d\lambda\) — has a far-reaching consequence:
conditional expectation as a measurable function.
For an integrable random variable \(X \in L^1(\mathbb{P})\), the
measure-theoretic conditional expectation \(\mathbb{E}[X \mid \mathcal{G}]\) is defined as
the Radon-Nikodym derivative of the signed measure \(A \mapsto \int_A X \, d\mathbb{P}\) with
respect to \(\mathbb{P}|_{\mathcal{G}}\). This concept is the foundation of martingale theory
and the starting point for stochastic calculus.
-
Uniform integrability — mentioned in this chapter as a
refinement of the DCT — provides a necessary and sufficient condition for
\(L^1\) convergence: \(X_n \to X\) in probability and \((X_n)\) is uniformly
integrable if and only if \(X_n \to X\) in \(L^1\). This criterion is
essential for martingale convergence theorems and will arise naturally
if that path is pursued.
-
On the analysis side,
Fourier analysis in Hilbert spaces
will take the \(L^2\) completeness proven in
\(L^p\) Spaces
and develop the Fourier transform as a unitary operator — the mathematical
engine behind both the Nyquist-Shannon sampling theorem and
Heisenberg's uncertainty principle.
The measure-theoretic viewpoint developed across these two chapters is not a detour from
applied probability — it is its foundation. Every time we write
\(\mathbb{E}[\,\cdot\,]\), invoke the law of large numbers, or assume data
are i.i.d., we are implicitly invoking these constructions.
Making them explicit is what turns intuition into proof.