From Two Integrals to One
In our early study of probability, we were forced to treat discrete and continuous
random variables as two separate worlds — one ruled by summation \(\sum\),
the other by integration \(\int\). In
Random Variables,
we defined a function \(X: S \to \mathbb{R}\) (where \(S\) was the sample space,
which we will now rewrite as \(\Omega\) to conform with measure-theoretic notation)
that assigns a numerical value to each outcome, and we computed expectations by
summation for discrete variables and by integration against a density for continuous ones.
But nature does not divide data into such neat categories. A mixed distribution — partly
discrete, partly continuous — cannot be handled cleanly by either formula alone.
The discrete-continuous dichotomy is not a feature of probability — it is an artifact
of having two separate integration theories. We need a single framework in which
all random variables and all expectations are treated uniformly.
The tools to build that framework already exist. In Measure Theory,
we constructed the \(\sigma\)-algebra and probability measure as the mathematical substrate of randomness. In
Lebesgue Integration, we built an integral powerful enough to handle
highly irregular functions on arbitrary measure spaces. In \(L^p\) Spaces,
we proved that the resulting function spaces are complete.
This chapter brings these tools to bear on probability. We do not repeat the definitions of \(\sigma\)-algebras, measures, or the Lebesgue
integral — those foundations are in place. Instead, we translate: each concept from classical probability receives its
measure-theoretic formulation, and each measure-theoretic theorem receives its probabilistic name. The result is a unified
framework in which discrete sums, continuous integrals, and mixed distributions are all special cases of a single operation: integration
against a measure.
Random Variables as Measurable Functions
We defined a random variable as
a function that assigns a numerical value to each outcome in the sample space. This definition is correct
in spirit but incomplete in one critical respect: it imposes no condition on which functions qualify.
In the measure-theoretic framework, not every function from \(\Omega\) to \(\mathbb{R}\) deserves to be called a
random variable — only those compatible with the \(\sigma\)-algebra structure that determines which events can
be assigned probabilities.
The Measurability Condition
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a
probability space,
and let \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\) be the real line equipped with its
Borel \(\sigma\)-algebra
— the \(\sigma\)-algebra generated by all open intervals.
Definition: Random Variable (Measure-Theoretic)
A random variable is a measurable function
\(X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\).
That is, \(X\) satisfies the measurability condition:
\[
X^{-1}(B) \;=\; \{\omega \in \Omega : X(\omega) \in B\} \;\in\; \mathcal{F}
\quad \text{for every } B \in \mathcal{B}(\mathbb{R}).
\]
The condition asks: for every "reasonable" subset \(B\) of the real line (every Borel set),
the set of outcomes \(\omega\) for which \(X(\omega)\) lands in \(B\) must be an event —
that is, a member of \(\mathcal{F}\), to which \(\mathbb{P}\) can assign a probability.
Without this condition, the expression \(\mathbb{P}(X \in B)\) might be undefined:
we would be asking for the probability of a set that our \(\sigma\)-algebra does not recognize.
In practice, measurability is verified via a useful shortcut. Since
\(\mathcal{B}(\mathbb{R})\) is generated by half-lines \((-\infty, a]\), it suffices
to check preimages of these generators:
Proposition: Measurability via Half-Lines
A function \(X : \Omega \to \mathbb{R}\) is measurable with respect to
\((\mathcal{F}, \mathcal{B}(\mathbb{R}))\) if and only if
\[
\{\omega \in \Omega : X(\omega) \leq a\} \in \mathcal{F}
\quad \text{for every } a \in \mathbb{R}.
\]
Proof:
(\(\Rightarrow\)) Half-lines \((-\infty, a]\) are closed, hence Borel
(as complements of open sets), so their preimages under a measurable \(X\) lie in
\(\mathcal{F}\).
(\(\Leftarrow\)) The collection
\(\mathcal{E} = \{(-\infty, a] : a \in \mathbb{R}\}\) generates
\(\mathcal{B}(\mathbb{R})\) (since every open interval, and hence every
open set, can be built from countable operations on half-lines). By the
general principle that \(X^{-1}(\sigma(\mathcal{E})) = \sigma(X^{-1}(\mathcal{E}))\)
— which follows because the preimage map commutes with countable unions, countable
intersections, and complements, making
\(\{B \subseteq \mathbb{R} : X^{-1}(B) \in \mathcal{F}\}\) a \(\sigma\)-algebra
containing \(\mathcal{E}\) and hence \(\sigma(\mathcal{E}) = \mathcal{B}(\mathbb{R})\)
— the hypothesis \(X^{-1}(\mathcal{E}) \subseteq \mathcal{F}\) extends to
\(X^{-1}(\mathcal{B}(\mathbb{R})) \subseteq \mathcal{F}\).
This connects directly to the familiar CDF: the measurability condition is precisely
the requirement that the cumulative distribution function
\(F_X(a) = \mathbb{P}(X \leq a)\) is well-defined for every \(a \in \mathbb{R}\).
Why Not Every Function Qualifies
On an abstract probability space, the \(\sigma\)-algebra \(\mathcal{F}\) can be
much smaller than the power set \(2^\Omega\). Consider
\(\Omega = \{H, T\}\) with \(\mathcal{F} = \{\emptyset, \Omega\}\) — the
trivial \(\sigma\)-algebra. Define \(X(H) = 1\) and \(X(T) = 0\). Then
\(X^{-1}((-\infty, 0.5]) = \{T\}\), but \(\{T\} \notin \mathcal{F}\).
So \(\mathbb{P}(X \leq 0.5)\) is undefined — this \(X\) is not a random variable
with respect to this \(\sigma\)-algebra.
The issue disappears if we enlarge to \(\mathcal{F} = \{\emptyset, \{H\}, \{T\}, \Omega\}\),
because every preimage of a Borel set now belongs to \(\mathcal{F}\).
The lesson is that "being a random variable" is not an intrinsic property of a function —
it depends on the relationship between the function and the \(\sigma\)-algebra.
The Distribution of a Random Variable
Given a random variable \(X\), we constantly ask questions of the form
"\(\mathbb{P}(X \in B)\) for various Borel sets \(B\)." This collection of
probabilities is itself a measure on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\),
and it encodes everything we need to know about \(X\) from a probabilistic standpoint.
Definition: Distribution (Pushforward Measure)
Let \(X : (\Omega, \mathcal{F}, \mathbb{P}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\)
be a random variable. The distribution (or law)
of \(X\) is the probability measure \(P_X\) on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\)
defined by
\[
P_X(B) \;=\; \mathbb{P}(X \in B) \;=\; \mathbb{P}\bigl(X^{-1}(B)\bigr)
\quad \text{for every } B \in \mathcal{B}(\mathbb{R}).
\]
This measure is also called the pushforward of \(\mathbb{P}\)
by \(X\), and is written \(P_X = \mathbb{P} \circ X^{-1}\).
The intuition is geometric: the function \(X\) transports the probability mass from the abstract space \(\Omega\)
onto the real line \(\mathbb{R}\). Whatever probability \(\mathbb{P}\) assigns to a region of \(\Omega\), the
pushforward \(P_X\) assigns the same probability to the corresponding region of \(\mathbb{R}\) that \(X\) maps
it into. The distribution \(P_X\) is simply the result of "weighing" the real line according to how \(X\) redistributes
the original probability mass.
That \(P_X\) is indeed a probability measure follows from the properties
of \(\mathbb{P}\). First, \(P_X(\mathbb{R}) = \mathbb{P}(X^{-1}(\mathbb{R})) = \mathbb{P}(\Omega) = 1\).
For countable additivity, let \(B_1, B_2, \ldots \in \mathcal{B}(\mathbb{R})\) be disjoint.
Since preimages commute with countable unions and \(X^{-1}(B_j) \cap X^{-1}(B_k) = X^{-1}(B_j \cap B_k) = \emptyset\)
for \(j \neq k\), the preimages \(\{X^{-1}(B_k)\}\) are disjoint measurable sets, and
\[
P_X\Bigl(\bigsqcup_k B_k\Bigr)
\;=\; \mathbb{P}\Bigl(\bigsqcup_k X^{-1}(B_k)\Bigr)
\;=\; \sum_k \mathbb{P}(X^{-1}(B_k))
\;=\; \sum_k P_X(B_k),
\]
using countable additivity of \(\mathbb{P}\).
The pushforward perspective reveals that all the information about \(X\) that matters
probabilistically — its CDF, its moments, its tail behavior — is encoded in the single
measure \(P_X\) on \(\mathbb{R}\). Two random variables defined on entirely different
probability spaces can have the same distribution, and from the perspective of
any question answerable by \(P_X\) alone, they are indistinguishable.
Recovering PMFs and PDFs
The pushforward framework unifies the discrete and continuous cases that were treated
separately in Random Variables.
Discrete case. If \(X\) takes values in a countable set
\(\{x_1, x_2, \ldots\}\), then \(P_X\) is concentrated on these points:
\[
P_X(B) \;=\; \sum_{k:\, x_k \in B} \mathbb{P}(X = x_k)
\;=\; \sum_{k:\, x_k \in B} p(x_k).
\]
Here \(p(x_k) = \mathbb{P}(X = x_k)\) is the probability mass function.
Formally, \(P_X\) is the weighted sum of point masses:
\(P_X = \sum_k p(x_k) \, \delta_{x_k}\), where \(\delta_{x_k}(B) = 1\) if
\(x_k \in B\) and \(0\) otherwise. The measure \(P_X\) is
singular with respect to the
Lebesgue measure \(\lambda\):
it is supported on a countable set, which has \(\lambda\)-measure zero.
Continuous case. If \(P_X\) is
absolutely continuous with respect to
\(\lambda\) — meaning \(\lambda(B) = 0\) implies \(P_X(B) = 0\) — then
there exists a nonnegative measurable function \(f_X\) such that
\[
P_X(B) \;=\; \int_B f_X(x) \, d\lambda(x)
\quad \text{for every } B \in \mathcal{B}(\mathbb{R}).
\]
The function \(f_X\) is the probability density function (PDF),
and the relationship is written compactly as
\[
f_X \;=\; \frac{dP_X}{d\lambda},
\]
the Radon-Nikodym derivative of \(P_X\) with respect to \(\lambda\).
The fractional notation is not merely symbolic. On a small interval \([x, x + dx]\), it encodes the approximation
\(P_X([x, x+dx]) \approx f_X(x) \cdot \lambda([x, x+dx])\) — that is, "probability \(\approx\) density \(\times\) length."
The density \(f_X(x)\) measures the local rate at which \(P_X\) accumulates probability relative to the rate at which Lebesgue
measure \(\lambda\) accumulates length.
The Radon-Nikodym theorem
guarantees the existence of \(f_X\) whenever \(P_X \ll \lambda\) (absolute continuity) and the
dominating measure \(\lambda\) is \(\sigma\)-finite — a condition satisfied here, since \(\lambda\) is \(\sigma\)-finite
on \(\mathbb{R}\). The proof is developed in a
later chapter; for now we use the result freely.
This perspective elevates the PDF from "the derivative of the CDF" to a precise statement about how one measure relates
to another.
The End of the Discrete-Continuous Dichotomy
The pushforward measure \(P_X\) treats all random variables uniformly.
A discrete variable has \(P_X\) concentrated on isolated points.
A continuous variable has \(P_X\) spread smoothly via a density.
A mixed distribution — for example, a variable that
equals zero with probability \(1/2\) and is otherwise uniformly
distributed on \([0,1]\) — is simply a measure that has both a point mass
and an absolutely continuous component:
\[
P_X \;=\; \tfrac{1}{2}\,\delta_0 \;+\; \tfrac{1}{2}\,\lambda\big|_{[0,1]}.
\]
No special formulas are needed. The Lebesgue integral against \(P_X\) handles
all three cases — discrete, continuous, and mixed — in a single expression.
This unification is not aesthetic convenience; it is a foundational requirement
in modern machine learning, where models such as
variational autoencoders
with discrete latent variables (e.g., Gumbel-Softmax, VQ-VAE) and hybrid
models with mixed discrete-continuous outputs require a probability framework
that does not depend on a single dominating measure.
Random Vectors and Measurability
Everything above extends to vector-valued random variables. A
random vector
\(\mathbf{X} = (X_1, \ldots, X_d) : \Omega \to \mathbb{R}^d\) is measurable
with respect to \((\mathcal{F}, \mathcal{B}(\mathbb{R}^d))\), where
\(\mathcal{B}(\mathbb{R}^d)\) is the Borel \(\sigma\)-algebra on \(\mathbb{R}^d\).
A convenient fact simplifies verification:
Proposition: Component-wise Measurability
A function \(\mathbf{X} = (X_1, \ldots, X_d) : \Omega \to \mathbb{R}^d\) is
\((\mathcal{F}, \mathcal{B}(\mathbb{R}^d))\)-measurable if and only if
each component \(X_i : \Omega \to \mathbb{R}\) is
\((\mathcal{F}, \mathcal{B}(\mathbb{R}))\)-measurable.
The proof relies on the identity
\(\mathcal{B}(\mathbb{R}^d) = \mathcal{B}(\mathbb{R}) \otimes \cdots \otimes \mathcal{B}(\mathbb{R})\)
— the Borel \(\sigma\)-algebra of the product space coincides with the product
of the Borel \(\sigma\)-algebras. The inclusion \(\supseteq\) is immediate: each
measurable rectangle \(B_1 \times \cdots \times B_d\) with \(B_i \in \mathcal{B}(\mathbb{R})\)
is a Borel set in \(\mathbb{R}^d\). The reverse inclusion uses the second-countability
of \(\mathbb{R}^d\) (a countable base of open sets given by rectangles with rational
corners), so every open set is a countable union of measurable rectangles and hence
lies in \(\mathcal{B}(\mathbb{R}) \otimes \cdots \otimes \mathcal{B}(\mathbb{R})\).
The product \(\sigma\)-algebra construction itself is developed in the
next chapter.
The product \(\sigma\)-algebra is generated by measurable rectangles
\(B_1 \times \cdots \times B_d\) with each \(B_i \in \mathcal{B}(\mathbb{R})\).
Since \(\mathbf{X}^{-1}(B_1 \times \cdots \times B_d) = \bigcap_{i=1}^d X_i^{-1}(B_i)\),
component-wise measurability ensures that all preimages of generators belong to \(\mathcal{F}\),
which suffices.
The distribution of a random vector is the pushforward \(P_{\mathbf{X}} = \mathbb{P} \circ \mathbf{X}^{-1}\) on
\((\mathbb{R}^d, \mathcal{B}(\mathbb{R}^d))\). When \(P_{\mathbf{X}}\) has a density \(f_{\mathbf{X}}\) with
respect to the Lebesgue measure on \(\mathbb{R}^d\), we recover exactly the joint density studied in
Multivariate Distributions: for instance, the multivariate normal density
with mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\) is the Radon-Nikodym derivative
\(f_{\mathbf{X}} = dP_{\mathbf{X}}/d\lambda^d\).
Expectation as Lebesgue Integral
In Random Variables, we
defined the
expected value
of \(X\) using two separate formulas: a sum \(\sum x \, p(x)\) for discrete variables and an integral
\(\int x \, f(x) \, dx\) for continuous ones. This bifurcation is awkward — every theorem about expectations
had to be stated (or proved) twice, once for each case. The Lebesgue integral eliminates this entirely.
The Unified Definition
We showed that when the underlying measure space is a probability space \((\Omega, \mathcal{F}, \mathbb{P})\), the
Lebesgue integral of a random variable \(X\) is written
\[
\mathbb{E}[X] \;=\; \int_\Omega X(\omega) \, d\mathbb{P}(\omega).
\]
This single expression replaces both the discrete sum and the continuous
integral. The construction of the Lebesgue integral — from
indicator functions
through
simple functions
to
general measurable functions —
is already in place. What we do here is not repeat that construction but
show how it absorbs the two classical formulas as special cases.
Recovering the Classical Formulas
The bridge between integration over the abstract space \(\Omega\) and
integration over \(\mathbb{R}\) (where we actually compute) is the
change-of-variable formula for pushforward measures.
Theorem: Change of Variables (Law of the Unconscious Statistician)
Let \(X : (\Omega, \mathcal{F}, \mathbb{P}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\)
be a random variable with distribution \(P_X = \mathbb{P} \circ X^{-1}\),
and let \(g : \mathbb{R} \to \mathbb{R}\) be a Borel-measurable function.
Then \(g(X) = g \circ X\) is measurable (as the composition of a Borel-measurable
function with a random variable), and
\[
\mathbb{E}[g(X)]
\;=\; \int_\Omega g\bigl(X(\omega)\bigr) \, d\mathbb{P}(\omega)
\;=\; \int_{\mathbb{R}} g(x) \, dP_X(x),
\]
provided either side is well-defined (i.e., the integral of \(|g(X)|\) is finite,
or \(g \geq 0\)).
Proof:
We retrace the three-step construction of the Lebesgue integral. For an
indicator function \(g = \mathbf{1}_B\), the left side is
\(\mathbb{P}(X \in B)\) and the right side is \(P_X(B)\) — these are equal
by the definition of the pushforward. By linearity, the identity extends
to simple functions (finite linear combinations of indicators). For
general \(g \geq 0\), approximate from below by a monotone sequence of
simple functions \(g_n \uparrow g\); the
Monotone Convergence Theorem
then yields the identity in the limit. For arbitrary measurable \(g\),
split into positive and negative parts \(g = g^+ - g^-\) and apply the
nonnegative case to each; the assumption \(\mathbb{E}[|g(X)|] \lt \infty\)
ensures \(\mathbb{E}[g^+(X)]\) and \(\mathbb{E}[g^-(X)]\) are separately
finite (avoiding \(\infty - \infty\)), and subtraction gives the identity
for \(g\).
The first equality is the definition of \(\mathbb{E}[g(X)]\); the second is the
content of the theorem. It says: to compute the expectation of \(g(X)\), we do
not need access to the abstract space \(\Omega\) — we can integrate \(g\) directly
against the distribution \(P_X\) on \(\mathbb{R}\). This is precisely what we
always did in practice: no one computes expectations by integrating over "all
outcomes of the experiment." We integrate over the range of the random variable,
weighted by its distribution.
From here, the two classical formulas are immediate:
Discrete case. If \(P_X = \sum_k p(x_k) \, \delta_{x_k}\)
(a sum of point masses), then integration against \(P_X\) reduces to summation:
\[
\mathbb{E}[g(X)] \;=\; \int_{\mathbb{R}} g(x) \, dP_X(x)
\;=\; \sum_k g(x_k) \, p(x_k).
\]
Setting \(g(x) = x\) recovers the formula \(\mathbb{E}[X] = \sum_k x_k \, p(x_k)\).
Continuous case. If \(P_X\) has density
\(f_X = dP_X/d\lambda\), then by the definition of the Radon-Nikodym derivative:
\[
\mathbb{E}[g(X)] \;=\; \int_{\mathbb{R}} g(x) \, dP_X(x)
\;=\; \int_{\mathbb{R}} g(x) \, f_X(x) \, dx.
\]
Setting \(g(x) = x\) recovers the formula \(\mathbb{E}[X] = \int x \, f_X(x) \, dx\).
The "Law of the Unconscious Statistician" — traditionally presented as a
computational trick — is revealed as a theorem about the relationship between
a measure and its pushforward. What seemed like two separate definitions
were always a single integral \(\int g \, dP_X\),
specialized to different types of measures.
Properties of Expectation: One Proof for All
The Lebesgue integral inherits linearity, monotonicity, and compatibility
with limits from its construction. These properties now hold for
all random variables simultaneously, with no case splitting:
Theorem: Properties of Expectation
Let \(X, Y\) be random variables on
\((\Omega, \mathcal{F}, \mathbb{P})\) and let \(a, b \in \mathbb{R}\). Then:
- Linearity:
\(\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]\),
provided \(\mathbb{E}[|X|] \lt \infty\) and \(\mathbb{E}[|Y|] \lt \infty\).
- Monotonicity:
If \(X \leq Y\) a.s., then \(\mathbb{E}[X] \leq \mathbb{E}[Y]\).
- Triangle inequality:
\(|\mathbb{E}[X]| \leq \mathbb{E}[|X|]\).
Proof:
Each property is a direct specialization of the corresponding property
of the Lebesgue integral (established in
Lebesgue Integration)
to the case where the measure is a probability measure. Linearity of
expectation is the linearity of the Lebesgue integral; monotonicity of
expectation is its monotonicity, extended to "a.s." inequalities by noting
that the Lebesgue integral is unchanged if the integrand is modified on a
null set; the triangle inequality
\(|\mathbb{E}[X]| \leq \mathbb{E}[|X|]\) is the standard bound
\(\bigl|\int X \, d\mathbb{P}\bigr| \leq \int |X| \, d\mathbb{P}\). No new
proofs are needed — the discrete / continuous / mixed case distinction of
classical probability has been absorbed into a single measure-theoretic
fact.
Remember, the linearity of expectation was proved separately for the continuous case (by splitting the Riemann
integral). Here, it is a single application of the linearity of the Lebesgue integral, valid for discrete,
continuous, and mixed distributions alike.
Moments and \(L^p\) Spaces
The measure-theoretic framework reveals a direct link between moments of a random variable and the
\(L^p\) spaces studied in analysis.
Observation: Moments as \(L^p\) Membership
Let \(X\) be a random variable on \((\Omega, \mathcal{F}, \mathbb{P})\)
and let \(1 \leq p \lt \infty\). Then
\[
X \in L^p(\Omega, \mathcal{F}, \mathbb{P})
\quad \Longleftrightarrow \quad
\mathbb{E}\bigl[|X|^p\bigr] \lt \infty.
\]
That is, \(X\) belongs to \(L^p\) if and only if its \(p\)-th absolute moment
is finite.
This is not a new theorem — it is a direct restatement of the definition of \(L^p\) in the special case
where the measure is a probability measure. But the translation is illuminating:
- \(X \in L^1(\mathbb{P})\): the mean \(\mathbb{E}[X]\) exists and is finite.
- \(X \in L^2(\mathbb{P})\): the variance
\(\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\) is finite.
The space \(L^2(\Omega, \mathbb{P})\) is a
Hilbert space
with inner product \(\langle X, Y \rangle = \mathbb{E}[XY]\).
- \(X \in L^4(\mathbb{P})\): the kurtosis (fourth standardized moment) is finite.
The \(L^p\) inclusion hierarchy \(L^q \subseteq L^p\) for \(q \geq p\)
(valid on probability spaces, where \(\mathbb{P}(\Omega) = 1\)) translates to:
finiteness of a higher moment implies finiteness of all lower moments.
If \(X \in L^4(\mathbb{P})\), then automatically
\(X \in L^2(\mathbb{P}) \subseteq L^1(\mathbb{P})\) — the mean and variance
are guaranteed to be finite.
Proof of \(L^q \subseteq L^p\) on probability spaces (\(q \geq p \geq 1\)):
Apply Hölder's inequality
with exponents \(q/p\) and \(q/(q-p)\) to the functions \(|X|^p\) and \(1\).
These are indeed Hölder conjugates:
\(\frac{1}{q/p} + \frac{1}{q/(q-p)} = \frac{p}{q} + \frac{q-p}{q} = 1\).
Hence
\[
\mathbb{E}\bigl[|X|^p\bigr]
\;=\; \int_\Omega |X|^p \cdot 1 \, d\mathbb{P}
\;\leq\; \left(\int_\Omega |X|^q \, d\mathbb{P}\right)^{p/q}
\left(\int_\Omega 1 \, d\mathbb{P}\right)^{(q-p)/q}
\;=\; \bigl(\mathbb{E}[|X|^q]\bigr)^{p/q}.
\]
The final factor equals \(1\) because \(\mathbb{P}(\Omega) = 1\) —
this is the crucial point where the result depends on working with a
probability measure rather than an arbitrary measure.
Taking \(p\)-th roots: \(\|X\|_p \leq \|X\|_q\).
This inclusion is a privilege of finite measure spaces. On \((\mathbb{R}, \mathcal{B}(\mathbb{R}), \lambda)\),
where \(\lambda(\mathbb{R}) = \infty\), the inclusion reverses: functions can be \(L^p\)-integrable without being
\(L^q\)-integrable for \(q > p\), and vice versa. The proof above breaks down precisely because the factor
\(\lambda(\mathbb{R})^{(q-p)/q}\) is infinite instead of \(1\). On a probability space — a "closed world" of total
mass \(1\) — finiteness of a higher moment automatically guarantees finiteness of all lower moments.
Hölder's Inequality as a Moment Condition
The Hölder's inequality in the probability setting reads: if \(X \in L^p(\mathbb{P})\) and
\(Y \in L^q(\mathbb{P})\) with \(1/p + 1/q = 1\), then
\[
\mathbb{E}[|XY|] \;\leq\;
\bigl(\mathbb{E}[|X|^p]\bigr)^{1/p} \, \bigl(\mathbb{E}[|Y|^q]\bigr)^{1/q}.
\]
This bounds the expectation of a product purely in terms of individual moment conditions. For the important
special case \(p = q = 2\), Hölder yields
\[
\mathbb{E}[|XY|] \;\leq\; \sqrt{\mathbb{E}[X^2] \, \mathbb{E}[Y^2]},
\]
and squaring (using \(|\mathbb{E}[XY]| \leq \mathbb{E}[|XY|]\)) gives the
Cauchy-Schwarz inequality:
\[
\bigl(\mathbb{E}[XY]\bigr)^2 \;\leq\; \mathbb{E}[X^2] \, \mathbb{E}[Y^2],
\]
which is the most widely used inequality in probability and statistics — it bounds correlations, proves the
existence of conditional expectations as \(L^2\)-projections, and underpins the
non-negativity of KL divergence.
The fact that this is a single inequality, valid for all random variables regardless of their distribution
type, is a direct consequence of the measure-theoretic unification.
"Almost Surely" Revisited
In Lebesgue integration, we introduced the phrase "almost everywhere" (a.e.) to mean
"for all \(\omega\) outside a set of measure zero." On a probability space, the same concept is called
"almost surely" (a.s.): an event holds a.s. if the set of outcomes where it fails has
probability zero.
The equivalence is literal:
\[
X = Y \;\text{a.s.}
\quad \Longleftrightarrow \quad
\mathbb{P}\bigl(\{\omega : X(\omega) \neq Y(\omega)\}\bigr) = 0
\quad \Longleftrightarrow \quad
X = Y \;\text{a.e. with respect to } \mathbb{P}.
\]
This is why \(L^p(\Omega, \mathbb{P})\) consists of equivalence classes of
random variables that agree a.s. — two random variables that differ only on a
null event are probabilistically identical, and we should not (and do not)
distinguish them.
The "a.e. vs. a.s." translation may seem trivial, but it matters when reading
across disciplines. An analyst writes "\(f_n \to f\) a.e."; a probabilist writes
"\(X_n \to X\) a.s." They mean the same thing. In
Limit Theorems and Product Measures,
we exploit this translation systematically — giving the great limit theorems
of Lebesgue integration their probabilistic names, and formalizing the product
measure construction that makes "i.i.d. sampling" mathematically precise.