Measure-Theoretic Probability

From Two Integrals to One

In our early study of probability, we were forced to treat discrete and continuous random variables as two separate worlds — one ruled by summation \(\sum\), the other by integration \(\int\). In Random Variables, we defined a function \(X: S \to \mathbb{R}\) (where \(S\) was the sample space, which we will now rewrite as \(\Omega\) to conform with measure-theoretic notation) that assigns a numerical value to each outcome, and we computed expectations by summation for discrete variables and by integration against a density for continuous ones.

But nature does not divide data into such neat categories. A mixed distribution — partly discrete, partly continuous — cannot be handled cleanly by either formula alone. The discrete-continuous dichotomy is not a feature of probability — it is an artifact of having two separate integration theories. We need a single framework in which all random variables and all expectations are treated uniformly.

The tools to build that framework already exist. In Measure Theory, we constructed the \(\sigma\)-algebra and probability measure as the mathematical substrate of randomness. In Lebesgue Integration, we built an integral powerful enough to handle highly irregular functions on arbitrary measure spaces. In \(L^p\) Spaces, we proved that the resulting function spaces are complete.

This chapter brings these tools to bear on probability. We do not repeat the definitions of \(\sigma\)-algebras, measures, or the Lebesgue integral — those foundations are in place. Instead, we translate: each concept from classical probability receives its measure-theoretic formulation, and each measure-theoretic theorem receives its probabilistic name. The result is a unified framework in which discrete sums, continuous integrals, and mixed distributions are all special cases of a single operation: integration against a measure.

Random Variables as Measurable Functions

We defined a random variable as a function that assigns a numerical value to each outcome in the sample space. This definition is correct in spirit but incomplete in one critical respect: it imposes no condition on which functions qualify. In the measure-theoretic framework, not every function from \(\Omega\) to \(\mathbb{R}\) deserves to be called a random variable — only those compatible with the \(\sigma\)-algebra structure that determines which events can be assigned probabilities.

The Measurability Condition

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space, and let \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\) be the real line equipped with its Borel \(\sigma\)-algebra — the \(\sigma\)-algebra generated by all open intervals.

Definition: Random Variable (Measure-Theoretic)

A random variable is a measurable function \(X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\). That is, \(X\) satisfies the measurability condition: \[ X^{-1}(B) \;=\; \{\omega \in \Omega : X(\omega) \in B\} \;\in\; \mathcal{F} \quad \text{for every } B \in \mathcal{B}(\mathbb{R}). \]

The condition asks: for every "reasonable" subset \(B\) of the real line (every Borel set), the set of outcomes \(\omega\) for which \(X(\omega)\) lands in \(B\) must be an event — that is, a member of \(\mathcal{F}\), to which \(\mathbb{P}\) can assign a probability. Without this condition, the expression \(\mathbb{P}(X \in B)\) might be undefined: we would be asking for the probability of a set that our \(\sigma\)-algebra does not recognize.

In practice, measurability is verified via a useful shortcut. Since \(\mathcal{B}(\mathbb{R})\) is generated by half-lines \((-\infty, a]\), it suffices to check preimages of these generators:

Proposition: Measurability via Half-Lines

A function \(X : \Omega \to \mathbb{R}\) is measurable with respect to \((\mathcal{F}, \mathcal{B}(\mathbb{R}))\) if and only if \[ \{\omega \in \Omega : X(\omega) \leq a\} \in \mathcal{F} \quad \text{for every } a \in \mathbb{R}. \]

Proof:

(\(\Rightarrow\)) Half-lines \((-\infty, a]\) are closed, hence Borel (as complements of open sets), so their preimages under a measurable \(X\) lie in \(\mathcal{F}\).

(\(\Leftarrow\)) The collection \(\mathcal{E} = \{(-\infty, a] : a \in \mathbb{R}\}\) generates \(\mathcal{B}(\mathbb{R})\) (since every open interval, and hence every open set, can be built from countable operations on half-lines). By the general principle that \(X^{-1}(\sigma(\mathcal{E})) = \sigma(X^{-1}(\mathcal{E}))\) — which follows because the preimage map commutes with countable unions, countable intersections, and complements, making \(\{B \subseteq \mathbb{R} : X^{-1}(B) \in \mathcal{F}\}\) a \(\sigma\)-algebra containing \(\mathcal{E}\) and hence \(\sigma(\mathcal{E}) = \mathcal{B}(\mathbb{R})\) — the hypothesis \(X^{-1}(\mathcal{E}) \subseteq \mathcal{F}\) extends to \(X^{-1}(\mathcal{B}(\mathbb{R})) \subseteq \mathcal{F}\).

This connects directly to the familiar CDF: the measurability condition is precisely the requirement that the cumulative distribution function \(F_X(a) = \mathbb{P}(X \leq a)\) is well-defined for every \(a \in \mathbb{R}\).

Why Not Every Function Qualifies

On an abstract probability space, the \(\sigma\)-algebra \(\mathcal{F}\) can be much smaller than the power set \(2^\Omega\). Consider \(\Omega = \{H, T\}\) with \(\mathcal{F} = \{\emptyset, \Omega\}\) — the trivial \(\sigma\)-algebra. Define \(X(H) = 1\) and \(X(T) = 0\). Then \(X^{-1}((-\infty, 0.5]) = \{T\}\), but \(\{T\} \notin \mathcal{F}\). So \(\mathbb{P}(X \leq 0.5)\) is undefined — this \(X\) is not a random variable with respect to this \(\sigma\)-algebra.

The issue disappears if we enlarge to \(\mathcal{F} = \{\emptyset, \{H\}, \{T\}, \Omega\}\), because every preimage of a Borel set now belongs to \(\mathcal{F}\). The lesson is that "being a random variable" is not an intrinsic property of a function — it depends on the relationship between the function and the \(\sigma\)-algebra.

The Distribution of a Random Variable

Given a random variable \(X\), we constantly ask questions of the form "\(\mathbb{P}(X \in B)\) for various Borel sets \(B\)." This collection of probabilities is itself a measure on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\), and it encodes everything we need to know about \(X\) from a probabilistic standpoint.

Definition: Distribution (Pushforward Measure)

Let \(X : (\Omega, \mathcal{F}, \mathbb{P}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\) be a random variable. The distribution (or law) of \(X\) is the probability measure \(P_X\) on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\) defined by \[ P_X(B) \;=\; \mathbb{P}(X \in B) \;=\; \mathbb{P}\bigl(X^{-1}(B)\bigr) \quad \text{for every } B \in \mathcal{B}(\mathbb{R}). \] This measure is also called the pushforward of \(\mathbb{P}\) by \(X\), and is written \(P_X = \mathbb{P} \circ X^{-1}\).

The intuition is geometric: the function \(X\) transports the probability mass from the abstract space \(\Omega\) onto the real line \(\mathbb{R}\). Whatever probability \(\mathbb{P}\) assigns to a region of \(\Omega\), the pushforward \(P_X\) assigns the same probability to the corresponding region of \(\mathbb{R}\) that \(X\) maps it into. The distribution \(P_X\) is simply the result of "weighing" the real line according to how \(X\) redistributes the original probability mass.

That \(P_X\) is indeed a probability measure follows from the properties of \(\mathbb{P}\). First, \(P_X(\mathbb{R}) = \mathbb{P}(X^{-1}(\mathbb{R})) = \mathbb{P}(\Omega) = 1\). For countable additivity, let \(B_1, B_2, \ldots \in \mathcal{B}(\mathbb{R})\) be disjoint. Since preimages commute with countable unions and \(X^{-1}(B_j) \cap X^{-1}(B_k) = X^{-1}(B_j \cap B_k) = \emptyset\) for \(j \neq k\), the preimages \(\{X^{-1}(B_k)\}\) are disjoint measurable sets, and \[ P_X\Bigl(\bigsqcup_k B_k\Bigr) \;=\; \mathbb{P}\Bigl(\bigsqcup_k X^{-1}(B_k)\Bigr) \;=\; \sum_k \mathbb{P}(X^{-1}(B_k)) \;=\; \sum_k P_X(B_k), \] using countable additivity of \(\mathbb{P}\).

The pushforward perspective reveals that all the information about \(X\) that matters probabilistically — its CDF, its moments, its tail behavior — is encoded in the single measure \(P_X\) on \(\mathbb{R}\). Two random variables defined on entirely different probability spaces can have the same distribution, and from the perspective of any question answerable by \(P_X\) alone, they are indistinguishable.

Recovering PMFs and PDFs

The pushforward framework unifies the discrete and continuous cases that were treated separately in Random Variables.

Discrete case. If \(X\) takes values in a countable set \(\{x_1, x_2, \ldots\}\), then \(P_X\) is concentrated on these points: \[ P_X(B) \;=\; \sum_{k:\, x_k \in B} \mathbb{P}(X = x_k) \;=\; \sum_{k:\, x_k \in B} p(x_k). \] Here \(p(x_k) = \mathbb{P}(X = x_k)\) is the probability mass function. Formally, \(P_X\) is the weighted sum of point masses: \(P_X = \sum_k p(x_k) \, \delta_{x_k}\), where \(\delta_{x_k}(B) = 1\) if \(x_k \in B\) and \(0\) otherwise. The measure \(P_X\) is singular with respect to the Lebesgue measure \(\lambda\): it is supported on a countable set, which has \(\lambda\)-measure zero.

Continuous case. If \(P_X\) is absolutely continuous with respect to \(\lambda\) — meaning \(\lambda(B) = 0\) implies \(P_X(B) = 0\) — then there exists a nonnegative measurable function \(f_X\) such that \[ P_X(B) \;=\; \int_B f_X(x) \, d\lambda(x) \quad \text{for every } B \in \mathcal{B}(\mathbb{R}). \] The function \(f_X\) is the probability density function (PDF), and the relationship is written compactly as \[ f_X \;=\; \frac{dP_X}{d\lambda}, \] the Radon-Nikodym derivative of \(P_X\) with respect to \(\lambda\).

The fractional notation is not merely symbolic. On a small interval \([x, x + dx]\), it encodes the approximation \(P_X([x, x+dx]) \approx f_X(x) \cdot \lambda([x, x+dx])\) — that is, "probability \(\approx\) density \(\times\) length." The density \(f_X(x)\) measures the local rate at which \(P_X\) accumulates probability relative to the rate at which Lebesgue measure \(\lambda\) accumulates length.

The Radon-Nikodym theorem guarantees the existence of \(f_X\) whenever \(P_X \ll \lambda\) (absolute continuity) and the dominating measure \(\lambda\) is \(\sigma\)-finite — a condition satisfied here, since \(\lambda\) is \(\sigma\)-finite on \(\mathbb{R}\). The proof is developed in a later chapter; for now we use the result freely. This perspective elevates the PDF from "the derivative of the CDF" to a precise statement about how one measure relates to another.

The End of the Discrete-Continuous Dichotomy

The pushforward measure \(P_X\) treats all random variables uniformly. A discrete variable has \(P_X\) concentrated on isolated points. A continuous variable has \(P_X\) spread smoothly via a density. A mixed distribution — for example, a variable that equals zero with probability \(1/2\) and is otherwise uniformly distributed on \([0,1]\) — is simply a measure that has both a point mass and an absolutely continuous component: \[ P_X \;=\; \tfrac{1}{2}\,\delta_0 \;+\; \tfrac{1}{2}\,\lambda\big|_{[0,1]}. \] No special formulas are needed. The Lebesgue integral against \(P_X\) handles all three cases — discrete, continuous, and mixed — in a single expression. This unification is not aesthetic convenience; it is a foundational requirement in modern machine learning, where models such as variational autoencoders with discrete latent variables (e.g., Gumbel-Softmax, VQ-VAE) and hybrid models with mixed discrete-continuous outputs require a probability framework that does not depend on a single dominating measure.

Random Vectors and Measurability

Everything above extends to vector-valued random variables. A random vector \(\mathbf{X} = (X_1, \ldots, X_d) : \Omega \to \mathbb{R}^d\) is measurable with respect to \((\mathcal{F}, \mathcal{B}(\mathbb{R}^d))\), where \(\mathcal{B}(\mathbb{R}^d)\) is the Borel \(\sigma\)-algebra on \(\mathbb{R}^d\). A convenient fact simplifies verification:

Proposition: Component-wise Measurability

A function \(\mathbf{X} = (X_1, \ldots, X_d) : \Omega \to \mathbb{R}^d\) is \((\mathcal{F}, \mathcal{B}(\mathbb{R}^d))\)-measurable if and only if each component \(X_i : \Omega \to \mathbb{R}\) is \((\mathcal{F}, \mathcal{B}(\mathbb{R}))\)-measurable.

The proof relies on the identity \(\mathcal{B}(\mathbb{R}^d) = \mathcal{B}(\mathbb{R}) \otimes \cdots \otimes \mathcal{B}(\mathbb{R})\) — the Borel \(\sigma\)-algebra of the product space coincides with the product of the Borel \(\sigma\)-algebras. The inclusion \(\supseteq\) is immediate: each measurable rectangle \(B_1 \times \cdots \times B_d\) with \(B_i \in \mathcal{B}(\mathbb{R})\) is a Borel set in \(\mathbb{R}^d\). The reverse inclusion uses the second-countability of \(\mathbb{R}^d\) (a countable base of open sets given by rectangles with rational corners), so every open set is a countable union of measurable rectangles and hence lies in \(\mathcal{B}(\mathbb{R}) \otimes \cdots \otimes \mathcal{B}(\mathbb{R})\). The product \(\sigma\)-algebra construction itself is developed in the next chapter. The product \(\sigma\)-algebra is generated by measurable rectangles \(B_1 \times \cdots \times B_d\) with each \(B_i \in \mathcal{B}(\mathbb{R})\). Since \(\mathbf{X}^{-1}(B_1 \times \cdots \times B_d) = \bigcap_{i=1}^d X_i^{-1}(B_i)\), component-wise measurability ensures that all preimages of generators belong to \(\mathcal{F}\), which suffices.

The distribution of a random vector is the pushforward \(P_{\mathbf{X}} = \mathbb{P} \circ \mathbf{X}^{-1}\) on \((\mathbb{R}^d, \mathcal{B}(\mathbb{R}^d))\). When \(P_{\mathbf{X}}\) has a density \(f_{\mathbf{X}}\) with respect to the Lebesgue measure on \(\mathbb{R}^d\), we recover exactly the joint density studied in Multivariate Distributions: for instance, the multivariate normal density with mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\) is the Radon-Nikodym derivative \(f_{\mathbf{X}} = dP_{\mathbf{X}}/d\lambda^d\).

Expectation as Lebesgue Integral

In Random Variables, we defined the expected value of \(X\) using two separate formulas: a sum \(\sum x \, p(x)\) for discrete variables and an integral \(\int x \, f(x) \, dx\) for continuous ones. This bifurcation is awkward — every theorem about expectations had to be stated (or proved) twice, once for each case. The Lebesgue integral eliminates this entirely.

The Unified Definition

We showed that when the underlying measure space is a probability space \((\Omega, \mathcal{F}, \mathbb{P})\), the Lebesgue integral of a random variable \(X\) is written \[ \mathbb{E}[X] \;=\; \int_\Omega X(\omega) \, d\mathbb{P}(\omega). \] This single expression replaces both the discrete sum and the continuous integral. The construction of the Lebesgue integral — from indicator functions through simple functions to general measurable functions — is already in place. What we do here is not repeat that construction but show how it absorbs the two classical formulas as special cases.

Recovering the Classical Formulas

The bridge between integration over the abstract space \(\Omega\) and integration over \(\mathbb{R}\) (where we actually compute) is the change-of-variable formula for pushforward measures.

Theorem: Change of Variables (Law of the Unconscious Statistician)

Let \(X : (\Omega, \mathcal{F}, \mathbb{P}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\) be a random variable with distribution \(P_X = \mathbb{P} \circ X^{-1}\), and let \(g : \mathbb{R} \to \mathbb{R}\) be a Borel-measurable function. Then \(g(X) = g \circ X\) is measurable (as the composition of a Borel-measurable function with a random variable), and \[ \mathbb{E}[g(X)] \;=\; \int_\Omega g\bigl(X(\omega)\bigr) \, d\mathbb{P}(\omega) \;=\; \int_{\mathbb{R}} g(x) \, dP_X(x), \] provided either side is well-defined (i.e., the integral of \(|g(X)|\) is finite, or \(g \geq 0\)).

Proof:

We retrace the three-step construction of the Lebesgue integral. For an indicator function \(g = \mathbf{1}_B\), the left side is \(\mathbb{P}(X \in B)\) and the right side is \(P_X(B)\) — these are equal by the definition of the pushforward. By linearity, the identity extends to simple functions (finite linear combinations of indicators). For general \(g \geq 0\), approximate from below by a monotone sequence of simple functions \(g_n \uparrow g\); the Monotone Convergence Theorem then yields the identity in the limit. For arbitrary measurable \(g\), split into positive and negative parts \(g = g^+ - g^-\) and apply the nonnegative case to each; the assumption \(\mathbb{E}[|g(X)|] \lt \infty\) ensures \(\mathbb{E}[g^+(X)]\) and \(\mathbb{E}[g^-(X)]\) are separately finite (avoiding \(\infty - \infty\)), and subtraction gives the identity for \(g\).

The first equality is the definition of \(\mathbb{E}[g(X)]\); the second is the content of the theorem. It says: to compute the expectation of \(g(X)\), we do not need access to the abstract space \(\Omega\) — we can integrate \(g\) directly against the distribution \(P_X\) on \(\mathbb{R}\). This is precisely what we always did in practice: no one computes expectations by integrating over "all outcomes of the experiment." We integrate over the range of the random variable, weighted by its distribution.

From here, the two classical formulas are immediate:

Discrete case. If \(P_X = \sum_k p(x_k) \, \delta_{x_k}\) (a sum of point masses), then integration against \(P_X\) reduces to summation: \[ \mathbb{E}[g(X)] \;=\; \int_{\mathbb{R}} g(x) \, dP_X(x) \;=\; \sum_k g(x_k) \, p(x_k). \] Setting \(g(x) = x\) recovers the formula \(\mathbb{E}[X] = \sum_k x_k \, p(x_k)\).

Continuous case. If \(P_X\) has density \(f_X = dP_X/d\lambda\), then by the definition of the Radon-Nikodym derivative: \[ \mathbb{E}[g(X)] \;=\; \int_{\mathbb{R}} g(x) \, dP_X(x) \;=\; \int_{\mathbb{R}} g(x) \, f_X(x) \, dx. \] Setting \(g(x) = x\) recovers the formula \(\mathbb{E}[X] = \int x \, f_X(x) \, dx\).

The "Law of the Unconscious Statistician" — traditionally presented as a computational trick — is revealed as a theorem about the relationship between a measure and its pushforward. What seemed like two separate definitions were always a single integral \(\int g \, dP_X\), specialized to different types of measures.

Properties of Expectation: One Proof for All

The Lebesgue integral inherits linearity, monotonicity, and compatibility with limits from its construction. These properties now hold for all random variables simultaneously, with no case splitting:

Theorem: Properties of Expectation

Let \(X, Y\) be random variables on \((\Omega, \mathcal{F}, \mathbb{P})\) and let \(a, b \in \mathbb{R}\). Then:

Linearity: \(\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]\), provided \(\mathbb{E}[|X|] \lt \infty\) and \(\mathbb{E}[|Y|] \lt \infty\).
Monotonicity: If \(X \leq Y\) a.s., then \(\mathbb{E}[X] \leq \mathbb{E}[Y]\).
Triangle inequality: \(|\mathbb{E}[X]| \leq \mathbb{E}[|X|]\).

Proof:

Each property is a direct specialization of the corresponding property of the Lebesgue integral (established in Lebesgue Integration) to the case where the measure is a probability measure. Linearity of expectation is the linearity of the Lebesgue integral; monotonicity of expectation is its monotonicity, extended to "a.s." inequalities by noting that the Lebesgue integral is unchanged if the integrand is modified on a null set; the triangle inequality \(|\mathbb{E}[X]| \leq \mathbb{E}[|X|]\) is the standard bound \(\bigl|\int X \, d\mathbb{P}\bigr| \leq \int |X| \, d\mathbb{P}\). No new proofs are needed — the discrete / continuous / mixed case distinction of classical probability has been absorbed into a single measure-theoretic fact.

Remember, the linearity of expectation was proved separately for the continuous case (by splitting the Riemann integral). Here, it is a single application of the linearity of the Lebesgue integral, valid for discrete, continuous, and mixed distributions alike.

Moments and \(L^p\) Spaces

The measure-theoretic framework reveals a direct link between moments of a random variable and the \(L^p\) spaces studied in analysis.

Observation: Moments as \(L^p\) Membership

Let \(X\) be a random variable on \((\Omega, \mathcal{F}, \mathbb{P})\) and let \(1 \leq p \lt \infty\). Then \[ X \in L^p(\Omega, \mathcal{F}, \mathbb{P}) \quad \Longleftrightarrow \quad \mathbb{E}\bigl[|X|^p\bigr] \lt \infty. \] That is, \(X\) belongs to \(L^p\) if and only if its \(p\)-th absolute moment is finite.

This is not a new theorem — it is a direct restatement of the definition of \(L^p\) in the special case where the measure is a probability measure. But the translation is illuminating:

\(X \in L^1(\mathbb{P})\): the mean \(\mathbb{E}[X]\) exists and is finite.
\(X \in L^2(\mathbb{P})\): the variance \(\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\) is finite. The space \(L^2(\Omega, \mathbb{P})\) is a Hilbert space with inner product \(\langle X, Y \rangle = \mathbb{E}[XY]\).
\(X \in L^4(\mathbb{P})\): the kurtosis (fourth standardized moment) is finite.

The \(L^p\) inclusion hierarchy \(L^q \subseteq L^p\) for \(q \geq p\) (valid on probability spaces, where \(\mathbb{P}(\Omega) = 1\)) translates to: finiteness of a higher moment implies finiteness of all lower moments. If \(X \in L^4(\mathbb{P})\), then automatically \(X \in L^2(\mathbb{P}) \subseteq L^1(\mathbb{P})\) — the mean and variance are guaranteed to be finite.

Proof of \(L^q \subseteq L^p\) on probability spaces (\(q \geq p \geq 1\)):

Apply Hölder's inequality with exponents \(q/p\) and \(q/(q-p)\) to the functions \(|X|^p\) and \(1\). These are indeed Hölder conjugates: \(\frac{1}{q/p} + \frac{1}{q/(q-p)} = \frac{p}{q} + \frac{q-p}{q} = 1\). Hence \[ \mathbb{E}\bigl[|X|^p\bigr] \;=\; \int_\Omega |X|^p \cdot 1 \, d\mathbb{P} \;\leq\; \left(\int_\Omega |X|^q \, d\mathbb{P}\right)^{p/q} \left(\int_\Omega 1 \, d\mathbb{P}\right)^{(q-p)/q} \;=\; \bigl(\mathbb{E}[|X|^q]\bigr)^{p/q}. \] The final factor equals \(1\) because \(\mathbb{P}(\Omega) = 1\) — this is the crucial point where the result depends on working with a probability measure rather than an arbitrary measure. Taking \(p\)-th roots: \(\|X\|_p \leq \|X\|_q\).

This inclusion is a privilege of finite measure spaces. On \((\mathbb{R}, \mathcal{B}(\mathbb{R}), \lambda)\), where \(\lambda(\mathbb{R}) = \infty\), the inclusion reverses: functions can be \(L^p\)-integrable without being \(L^q\)-integrable for \(q > p\), and vice versa. The proof above breaks down precisely because the factor \(\lambda(\mathbb{R})^{(q-p)/q}\) is infinite instead of \(1\). On a probability space — a "closed world" of total mass \(1\) — finiteness of a higher moment automatically guarantees finiteness of all lower moments.

Hölder's Inequality as a Moment Condition

The Hölder's inequality in the probability setting reads: if \(X \in L^p(\mathbb{P})\) and \(Y \in L^q(\mathbb{P})\) with \(1/p + 1/q = 1\), then \[ \mathbb{E}[|XY|] \;\leq\; \bigl(\mathbb{E}[|X|^p]\bigr)^{1/p} \, \bigl(\mathbb{E}[|Y|^q]\bigr)^{1/q}. \] This bounds the expectation of a product purely in terms of individual moment conditions. For the important special case \(p = q = 2\), Hölder yields \[ \mathbb{E}[|XY|] \;\leq\; \sqrt{\mathbb{E}[X^2] \, \mathbb{E}[Y^2]}, \] and squaring (using \(|\mathbb{E}[XY]| \leq \mathbb{E}[|XY|]\)) gives the Cauchy-Schwarz inequality: \[ \bigl(\mathbb{E}[XY]\bigr)^2 \;\leq\; \mathbb{E}[X^2] \, \mathbb{E}[Y^2], \] which is the most widely used inequality in probability and statistics — it bounds correlations, proves the existence of conditional expectations as \(L^2\)-projections, and underpins the non-negativity of KL divergence. The fact that this is a single inequality, valid for all random variables regardless of their distribution type, is a direct consequence of the measure-theoretic unification.

"Almost Surely" Revisited

In Lebesgue integration, we introduced the phrase "almost everywhere" (a.e.) to mean "for all \(\omega\) outside a set of measure zero." On a probability space, the same concept is called "almost surely" (a.s.): an event holds a.s. if the set of outcomes where it fails has probability zero.

The equivalence is literal: \[ X = Y \;\text{a.s.} \quad \Longleftrightarrow \quad \mathbb{P}\bigl(\{\omega : X(\omega) \neq Y(\omega)\}\bigr) = 0 \quad \Longleftrightarrow \quad X = Y \;\text{a.e. with respect to } \mathbb{P}. \] This is why \(L^p(\Omega, \mathbb{P})\) consists of equivalence classes of random variables that agree a.s. — two random variables that differ only on a null event are probabilistically identical, and we should not (and do not) distinguish them.

The "a.e. vs. a.s." translation may seem trivial, but it matters when reading across disciplines. An analyst writes "\(f_n \to f\) a.e."; a probabilist writes "\(X_n \to X\) a.s." They mean the same thing. In Limit Theorems and Product Measures, we exploit this translation systematically — giving the great limit theorems of Lebesgue integration their probabilistic names, and formalizing the product measure construction that makes "i.i.d. sampling" mathematically precise.

Looking Ahead

We have completed the first half of the measure-theoretic translation: random variables are measurable functions, distributions are pushforward measures, and expectations are Lebesgue integrals. The discrete-continuous dichotomy that split our earlier treatment has been dissolved — PMFs, PDFs, and mixed distributions are all special cases of a single integral \(\int g \, dP_X\).

But the translation is not yet complete. Two fundamental questions remain open:

When can we interchange limits and expectations? The Monotone Convergence Theorem, Fatou's Lemma, and the Dominated Convergence Theorem need their probabilistic interpretations. These are the tools that license, under stated regularity hypotheses, the operations of differentiating under the integral sign in stochastic gradient descent and of passing from almost sure convergence to convergence of expectations. Treated in the Convergence Theorems section.
What does "independent" really mean? The condition \(P(A \cap B) = P(A)P(B)\) must be elevated to independence of \(\sigma\)-algebras, and the product measure construction must be built to give "i.i.d. sampling" its mathematical content. Treated in the Independence and Product Measures section.

These two threads are taken up in Limit Theorems and Product Measures, which completes the core measure-theoretic foundation for probability.

Measure-Theoretic Probability

Loading...

From Two Integrals to One

Random Variables as Measurable Functions

The Measurability Condition

Why Not Every Function Qualifies

The Distribution of a Random Variable

Recovering PMFs and PDFs

The End of the Discrete-Continuous Dichotomy

Random Vectors and Measurability

Expectation as Lebesgue Integral

The Unified Definition

Recovering the Classical Formulas

Properties of Expectation: One Proof for All

Moments and \(L^p\) Spaces

Hölder's Inequality as a Moment Condition

"Almost Surely" Revisited

Looking Ahead