Introduction
In the previous section, we observed that certain functions are not integrable using either the Riemann or improper
Riemann integration. To handle such cases, we need to introduce measure theory and the Lebesgue integration.
Since our focus is on applied mathematics, particularly in the context of statistics and machine learning, we will concentrate
on probability-related measure theory. While we'll avoid covering every foundational topic and formal mathematical
detail, our goal is to build a solid understanding that leads to the introduction of the Lebesgue integration.
Note that measure theory fundamentally defines the "volume" and structure of a space.
This perspective is crucial when transitioning from discrete symmetries to continuous ones, such as \(SE(3)\),
where the measure remains invariant under rotation.
(See: geometry of symmetry)
Probability Space
While the intuitive notion of "volume" or "probability" works well for simple shapes and finite sets, it
becomes surprisingly fragile when applied to the continuum of real numbers or complex manifolds.
To handle these cases without leading to logical paradoxes, we must move beyond mere intuition and adopt
a rigorous axiomatic framework.
In this context, we don't just "measure" things; we define a structured environment where every operation is
logically consistent. This is why we treat a probabilistic model not as a single value, but as a triple of
interconnected components.
Now, we need a formal definition of probabilistic model:
Definition: Probability Space
A probability space is a triple \((\Omega, \mathcal{F}, \mathbb{P})\) where
- \(\Omega\) is the sample space:
The set of possible outcomes of an experiment.
- \(\mathcal{F}\) is a \(\sigma\)-algebra:
A collection of subsets of the sample space \(\Omega\).
- \(\mathbb{P}\) is a probability measure:
A function \(\mathbb{P}: \mathcal{F} \to [0, 1]\) satisfying \(\mathbb{P}(\Omega) = 1\) together with countable additivity (defined formally below).
Sample Space
The sample space \(\Omega\) can be finite, countable, or uncountable. An element of \(\Omega\) is denoted by
\(\omega\), and is called an elementary outcome.
For example, if our experiment consists of an infinite number of consecutive rolls of a die, the sample space is the set:
\[
\Omega = \{1, 2, 3, 4, 5, 6\}^{\infty}
\]
and an elementary outcome is an infinite sequence such as:
\[
\omega = (1, 1, 4, 3, 1, 5, \cdots ).
\]
A simpler case of the probability space can be a discrete probability space. In this case, the sample space
is finite, or countable:
\[
\Omega = \{\omega_1, \omega_2, \cdots \}.
\]
and \(\sigma\)-algebra is the set of all subsets of \(\Omega\). Then the probability measure assigns a number in the set \([0, 1]\) to
every subset of \(\Omega\). It is defined in terms of the probabilities \(\mathbb{P}(\{\omega\})\) of the elementary outcomes and satisfies
\[
\forall A \subset \Omega, \quad \mathbb{P}(A) = \sum_{\omega \in A}\mathbb{P}(\{\omega\})
\]
and
\[
\sum_{\omega \in \Omega}\mathbb{P}(\{\omega\}) = 1.
\]
Note: We will use \(\mathbb{P}(\omega)\) instead of \(\mathbb{P}(\{\omega\})\) and \(\mathbb{P}(\omega_i)\) will be denoted by \(p_i\).
\(\sigma\)-algebra (\(\sigma\)-field)
Ideally, we wish to specify the probability \(\mathbb{P} (A)\) of "every" subset of \(\Omega\). However, it
is too complicated, especially, in the case where \(\Omega\) is uncountable. So, we assign probabilities to only
a partial collection of subsets of \(\Omega\). The sets in this collection are to be thought of as the “nice”
and "interesting" subsets of \(\Omega\). Formally, we define the collection as follows:
Definition: \(\sigma\)-algebra
\(\sigma\)-algebra, \(\mathcal{F}\) is a collection of subsets of \(\Omega\) with the
following properties:
- \(\emptyset \in \mathcal{F}\).
- If \(A \in \mathcal{F}\), then \(A^c \in \mathcal{F}\).
- If \(\{A_i\}_{i=1}^{\infty} \subset \mathcal{F}\), then \(\bigcup_{i =1}^{\infty} A_i \in \mathcal{F}\).
An event \(A\) is called an \(\mathcal{F}\)-measurable set, or simply a measurable set.
The pair \((\Omega, \mathcal{F})\) is called a measurable space.
Given collection \(\mathcal{C}\) of subsets of \(\Omega\), we want to define \(\sigma\)-algebra \(\mathcal{F}\) as the intersection of
"all" \(\sigma\)-algebras that contains \(\mathcal{C}\) because we only need the smallest \(\sigma\)-algebra containing
\(\mathcal{C}\). In this manner, \(\mathcal{F}\) is said to be the \(\sigma\)-algebra generated by \(\mathcal{C}\), and is denoted
by \(\sigma(\mathcal{C})\).
Probability Measure
A collection of sets \(A_{\alpha} \subset \Omega\) where \(\alpha\) ranges over some index set is
mutually exclusive or that the sets are disjoint if \(A_{\alpha} \cap A_{\alpha'} = \emptyset \)
whenever \(\alpha \neq \alpha'\). Also, the sets \(A_{\alpha} \subset \Omega\) are called collectively exhaustive
if \(\, \cup_{\alpha} A_{\alpha} = \Omega\).
Definition: Measure
A measure is a function
\[
\mu: \mathcal{F} \to [0, \infty]
\]
which assigns a nonnegative extended real number \(\mu(A)\) to every set \(A \in \mathcal{F}\), and which satisfies the
following two conditions:
- \(\mu(\emptyset) = 0\)
- Countable additivity (\(\sigma\)-additivity):
If \(\{A_i\}\) is a sequence of disjoint sets that belong to \(\mathcal{F}\), then
\[
\mu(\cup_i A_i) = \sum_{i =1}^{\infty} \mu(A_i).
\]
This general definition covers measures that may be infinite on the whole space. A measure with \(\mu(\Omega) < \infty\) is called a finite measure; the special case \(\mu(\Omega) = 1\) is a probability measure, defined next.
Definition: Probability Measure
A probability measure is a measure \(\mathbb{P}\) with the additional property \(\mathbb{P}(\Omega) = 1\).
A crucial consequence of countable additivity is that the probability measure is continuous.
This continuity property of measure ensures that probabilities behave well under limits,
which is essential for defining concepts like convergence of random variables and the law of large numbers.
Theorem: Continuity of Measure
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space and \(\{A_n\} \subset \mathcal{F}\).
- Continuity from below: If \(A_n \uparrow A\) (i.e., \(A_1 \subset A_2 \subset \dots\) and \(\bigcup_n A_n = A\)), then
\[
\lim_{n \to \infty} \mathbb{P}(A_n) = \mathbb{P}(A).
\]
- Continuity from above: If \(A_n \downarrow A\) (i.e., \(A_1 \supset A_2 \supset \dots\) and \(\bigcap_n A_n = A\)), then
\[
\lim_{n \to \infty} \mathbb{P}(A_n) = \mathbb{P}(A).
\]
For a general measure \(\mu\), continuity from below holds unconditionally, but continuity from above requires the additional hypothesis \(\mu(A_1) < \infty\) (one can construct counterexamples on \(\mathbb{R}\) with Lebesgue measure otherwise). Probability measures automatically satisfy this since \(\mathbb{P}(\Omega) = 1\).
Proof (sketch):
From below. Set \(B_1 = A_1\) and \(B_n = A_n \setminus A_{n-1}\) for \(n \geq 2\). The \(B_n\) are pairwise disjoint, \(\bigcup_{k=1}^n B_k = A_n\), and \(\bigcup_{k=1}^\infty B_k = A\). Countable additivity gives
\[
\mathbb{P}(A) = \sum_{k=1}^\infty \mathbb{P}(B_k) = \lim_{n \to \infty} \sum_{k=1}^n \mathbb{P}(B_k) = \lim_{n \to \infty} \mathbb{P}(A_n).
\]
From above. Since \(A_n \downarrow A\), by De Morgan \(A_n^c \uparrow A^c\). Applying continuity from below to the complements yields \(\lim_n \mathbb{P}(A_n^c) = \mathbb{P}(A^c)\). Using \(\mathbb{P}(A_n^c) = 1 - \mathbb{P}(A_n)\) and \(\mathbb{P}(A^c) = 1 - \mathbb{P}(A)\) (both valid because \(\mathbb{P}\) is a finite measure with \(\mathbb{P}(\Omega) = 1\)), we obtain \(\lim_n (1 - \mathbb{P}(A_n)) = 1 - \mathbb{P}(A)\), hence \(\lim_n \mathbb{P}(A_n) = \mathbb{P}(A)\).
The countable additivity implies that probabilities (and more generally, measures) behave like the notions of volume:
the volume of a countable union of disjoint sets is the sum of their individual "volumes."
Indeed, a measure is a generalized notion of a volume that characterizes the "size" of sets within a
manifold.
In physical systems like robotics \(SE(3)\), this ensures that the core properties of an object are preserved throughout
any rigid body motion.
\(\sigma\)-Finite Measures
The continuity-from-above argument above required the finiteness hypothesis
\(\mu(A_1) < \infty\), which is automatic for probability measures but fails
for general measures such as Lebesgue measure on \(\mathbb{R}\). For many results
in measure theory and its applications, we do not need full finiteness — it
suffices that the space can be exhausted by countably many finite-measure pieces.
Definition: \(\sigma\)-Finite Measure
A measure \(\mu\) on \((\Omega, \mathcal{F})\) is \(\sigma\)-finite
if there exists a countable collection
\(\{\Omega_n\}_{n \geq 1} \subseteq \mathcal{F}\) (not required to be disjoint)
with \(\Omega = \bigcup_{n=1}^\infty \Omega_n\) and \(\mu(\Omega_n) < \infty\)
for every \(n\).
Every finite measure is \(\sigma\)-finite (take \(\Omega_1 = \Omega\)), and
in particular every probability measure is \(\sigma\)-finite. The Lebesgue measure
on \(\mathbb{R}\) is \(\sigma\)-finite but not finite, via the exhaustion
\(\Omega_n = [-n, n]\). By contrast, the counting measure on an uncountable set
\(\Omega\) is not \(\sigma\)-finite: any \(\Omega_n\) with finite counting
measure is a finite set, and a countable union of finite sets is countable,
which cannot cover an uncountable \(\Omega\).
The \(\sigma\)-finite hypothesis is a standard regularity condition
appearing throughout measure theory: it is the weakest assumption under which
many fundamental results hold, and we will invoke it in later pages whenever
our setting calls for reasoning that strictly exceeds the finite-measure case.
Finite Additivity
At this point, the \(\sigma\)-algebra and probability measure definitions look quite
abstract. Do we really need all of this machinery just to define probabilities?
The answer lies in the concept of countable additivity (\(\sigma\)-additivity), which requires
handling infinite sequences of disjoint events. While this property is essential for dealing with limits
and continuous spaces (like the real line), it can be technically overwhelming when we are just getting started.
Fortunately, there exists a simpler framework that only requires finite unions and sums.
This "weaker" structure serves as a stepping stone: we can first define probabilities on this simpler collection,
and then—thanks to Carathéodory's Extension Theorem—extend it to the full \(\sigma\)-algebra.
This approach is analogous to defining a function on a dense subset and then extending it by continuity.
We start with what is manageable (finite operations) and let the mathematics handle the rest (infinite operations).
Definition: Algebra (Field)
An algebra (or, a field) is a collection \(\mathcal{F}_0\) of subsets of \(\Omega\) with the following properties:
- \(\emptyset \in \mathcal{F}_0\).
- If \(A \in \mathcal{F}_0\), then \(A^c \in \mathcal{F}_0\).
- If \(A, B \in \mathcal{F}_0\), then \(A \cup B \in \mathcal{F}_0\).
Notice that this is identical to the \(\sigma\)-algebra definition, except property (3) only requires closure
under finite unions rather than countable unions. This makes it far easier to verify in practice.
Definition: Finite Additivity
A function \(\mathbb{P}: \mathcal{F}_0 \to [0, 1]\) is said to be finitely additive if
\[
\begin{align*}
&A, B \in \mathcal{F}_0 , \quad A \cap B = \emptyset \\\\
&\Longrightarrow \mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B).
\end{align*}
\]
Every \(\sigma\)-additive function is automatically finitely additive, but the converse does not hold in general: finite additivity alone is not strong enough to force countable additivity. This gap is precisely what Carathéodory's Extension Theorem (below) bridges, by showing that a \(\sigma\)-additive function defined on a small algebra can be uniquely extended to a full measure on the generated \(\sigma\)-algebra.
We examine Carathéodory's Extension Theorem next.
Caratheodory's Extension Theorem
Defining a probability measure directly on a full \(\sigma\)-algebra is often technically overwhelming because
\(\sigma\)-algebras contain incredibly complex sets (formed by countable unions and limits).
However, it is usually straightforward to define probabilities on a smaller, simpler structure called an
algebra (e.g., just finite unions of intervals).
The following theorem acts as a powerful bridge. It guarantees that if we can essentially
"get the definitions right" on the simple building blocks (the algebra), the mathematics automatically and
uniquely extends that definition to the entire complex \(\sigma\)-algebra.
Theorem: Carathéodory's Extension Theorem
Let \(\mathcal{F}_0\) be an algebra of subsets of a sample space \(\Omega\), and
let \(\mathcal{F} = \sigma(\mathcal{F}_0)\) be the \(\sigma\)-algebra that it generates.
Suppose that \(\mathbb{P}_0 : \mathcal{F}_0 \to [0, 1]\) satisfies \(\mathbb{P}_0(\Omega) = 1\)
and is \(\sigma\)-additive on \(\mathcal{F}_0\): whenever \(\{A_i\}_{i=1}^{\infty} \subset \mathcal{F}_0\)
are pairwise disjoint with \(\bigcup_{i=1}^{\infty} A_i \in \mathcal{F}_0\), we have
\(\mathbb{P}_0\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \mathbb{P}_0(A_i)\).
Then, \(\mathbb{P}_0\) can be extended uniquely to a probability measure on \((\Omega, \mathcal{F})\).
That is, there exists a unique probability measure \(\mathbb{P}\) on \((\Omega, \mathcal{F})\) such that
\[
\forall A \in \mathcal{F}_0, \quad \mathbb{P}(A) = \mathbb{P}_0(A).
\]
The proof — which constructs an outer measure from \(\mathbb{P}_0\), identifies the family of Carathéodory-measurable sets, and establishes uniqueness via a monotone-class argument — is substantial and belongs to the real-analysis track. We defer it until that later development.
This theorem provides the fundamental existence and uniqueness guarantee for measures.
Without it, we would have to manually prove the existence of a measure for every complicated Borel set, which
is practically impossible. It assures us that our intuitive notion of "length" on intervals can be rigorously
extended to a full-fledged measure without logical contradictions. In the next section, we will directly apply
this theorem to construct the Lebesgue measure from simple interval lengths.
Lebesgue measure
The uniform distribution on the interval \([0, 1]\) assigns probability \(b - a\) to every
interval \([a, b] \subset [0, 1]\). We want to define the appropriate \(\sigma\)-algebra and the probability
measure on the sample space \(\Omega = [0, 1]\), but first, we consider the sample space:
\[
\Omega' = (0, 1].
\]
Definition: Borel \(\sigma\)-algebra
Consider the collection \(\mathcal{C}\) of all intervals \((a, b]\) contained in \((0, 1]\) and let \(\mathcal{F}\) be
the \(\sigma\)-algebra generated by \(\mathcal{C}\). This is called the Borel \(\sigma\)-algebra and is
denoted by \(\mathcal{B}\). Every set that belongs to this \(\sigma\)-algebra is called a Borel (measurable) set.
Any set that can be formed by starting with intervals \([a, b]\) is a Borel set. For example,
the set of rational numbers in \((0, 1]\), and its complement, the set of irrational numbers in \((0, 1]\) are Borel sets.
Since defining a probability measure for all Borel sets is too complicated, we start with a "smaller" collection,
\(\mathcal{F}_0 \subset (0, 1]\). We let \(\mathcal{F}_0\) consist of the empty set and all sets that are finite
unions of disjoint intervals of the form \((a, b]\). For example,
\[
A \in \mathcal{F}_0 \Longrightarrow A = (a_1, b_1] \cup (a_2, b_2] \cdots \cup (a_n, b_n],
\]
where \(0 \leq a_1 < b_1 \leq a_2 < b_2 \leq \cdots \leq a_n < b_n \leq 1, \quad n \in \mathbb{N}\).
Also, we define:
\[
\mathbb{P}_0(A) = (b_1 - a_1) + (b_2 - a_2) + \cdots + (b_n - a_n)
\]
which is \(\sigma\)-additive on \(\mathcal{F}_0\).
We can now apply the Caratheodory's extension theorem, and conclude that there exists a probability measure \(\mathbb{P}\) defined on the
entire Borel \(\sigma\)-algebra \(\mathcal{B}\), that agrees with \(\mathbb{P}_0\) on \(\mathcal{F}_0\). We call this measure
the Lebesgue or uniform measure.
In particular,
\[
\forall \, (a, b] \subset (0, 1], \quad \mathbb{P}((a, b]) = b - a.
\]
Definition: Lebesgue Measure
The probability measure \(\mathbb{P}\) on \(((0, 1], \mathcal{B})\) constructed above via Carathéodory's Extension Theorem is called the Lebesgue measure (also uniform measure) on \((0, 1]\). It is the unique measure satisfying
\[
\mathbb{P}((a, b]) = b - a \quad \text{for all } (a, b] \subset (0, 1].
\]
The construction extends to \(\mathbb{R}\) by the countable-additivity gluing described later in this section, yielding a (non-probability) measure \(\mu\) on \((\mathbb{R}, \mathcal{B})\) with \(\mu((a, b]) = b - a\) for every bounded interval.
Mathematical Detail: Borel vs Lebesgue Measure
Strictly speaking, the measure constructed above on the Borel \(\sigma\)-algebra \(\mathcal{B}\) is called the Borel measure. The Lebesgue measure, in its strictest sense, is defined on a slightly larger \(\sigma\)-algebra called the Lebesgue \(\sigma\)-algebra, which is the completion of \(\mathcal{B}\) with respect to the Borel measure. This completion ensures that every subset of a set with measure zero is itself measurable (and has measure zero).
For most practical applications in CS and statistics, this distinction is subtle — the two measures agree on every Borel set, and integrals of measurable functions are unchanged. However, the completed version is the natural setting for the "almost everywhere" identifications underlying \(L^p\) spaces and the Radon–Nikodym theorem.
Finally, by adding \(\{0\}\) to the sample space \(\Omega'\), and assigning zero probability to it, we obtain the uniform distribution
model with sample space \(\Omega = [0, 1]\). (We only need to check the measurability of \(\{0\}\), and it is also a Borel set.)
Revisit our Problem:
Remember, in the previous section, we encountered the following problem:
Consider the Dirichlet function:
\[
f(x)=
\begin{cases}
1 &\text{if \(x \in \mathbb{Q}\)} \\
0 &\text{if \(x \in \mathbb{R} \setminus \mathbb{Q}\)}
\end{cases}
\]
compute:
\[
\int_0^1 f(x)dx
\]
Now, we can see the interval \([0, 1]\) as the sample space \(\Omega\). The smallest \(\sigma\)-algebra that contains
every interval \([a, b] \subset [0, 1]\) is the Borel \(\sigma\)-algebra. We claim the Dirichlet function is measurable with respect to it, and check three steps:
- Every singleton \(\{q\}\) is Borel, since \(\{q\} = \bigcap_{n=1}^\infty (q - 1/n, q + 1/n)\) is a countable intersection of Borel sets.
- \(\mathbb{Q} \cap [0, 1]\) is Borel, as a countable union \(\bigcup_{q \in \mathbb{Q} \cap [0,1]} \{q\}\) of Borel singletons.
- The Dirichlet function is the indicator of this Borel set. The indicator of a Borel set is automatically Borel-measurable — a fact pinned down in the next page on Lebesgue integration.
Hence the Dirichlet function is Lebesgue measurable.
We are getting closer to the Lebesgue integration, but we would like to learn more about Lebesgue measure.
Construction: Lebesgue Measure on \(\mathbb{R}\)
Consider the sample space \(\Omega = \mathbb{R}\). As usual, we define a \(\sigma\)-algebra of subsets of \(\mathbb{R}\). Let
\(\mathcal{C}\) be the collection of all intervals of the form \([a, b]\) and let \(\mathcal{B} = \sigma(\mathcal{C})\) be the
\(\sigma\)-algebra that it generates.
Let \(\mathbb{P}_n\) be the uniform measure on \((n, n+1]\). Given a set \(A \subset \mathbb{R}\), we decompose it into countably
many pieces, each piece contained in some interval \((n, n+1]\), and define its “length” \(\mu(A)\) using countable additivity as follows:
\[
\mu(A) = \sum_{n = -\infty}^{\infty} \mathbb{P}_n \left(A \cap (n, n+1] \right).
\]
Since \(A \cap (n, n+1]\) is a measurable subset of \((n, n+1]\), \(\mathbb{P}_n (A \cap (n, n+1]) \geq 0\).
Thus, nonnegativity holds:
\[
\mu(A) \in [0, \infty].
\]
Also, \(\emptyset \cap (n, n+1] = \emptyset\) and \(\mathbb{P}_n ( \emptyset) = 0\), thus
\[\mu(\emptyset) = 0.
\]
Now, we need to check countable additivity. Let \(\{A_i\}_{i=1}^{\infty}\) be a countable collection of
pairwise disjoint sets in \(\mathcal{B}\). For each fixed \(n\), the set \(\left\{A_i \cap (n, n+1] \right\}_{i=1}^{\infty}\) are
also pairwise disjoint. Since \(\mathbb{P}_n\) is the uniform measure on \((n, n+1]\), it satisfies countable additivity. That is,
\[
\begin{align*}
&\mathbb{P}_n \left(\bigcup_{i=1}^{\infty} (A_i \cap (n, n+1] )\right) = \sum_{i=1}^{\infty} \mathbb{P}_n \left(A_i \cap (n, n+1] \right) \\\\
&\Longrightarrow \mathbb{P}_n \left( \left( \bigcup_{i=1}^{\infty} A_i \right) \cap (n, n+1] \right) = \sum_{i=1}^{\infty} \mathbb{P}_n \left(A_i \cap (n, n+1]\right) \\\\
\end{align*}
\]
Then we have:
\[
\begin{align*}
\mu\left( \bigcup_{i=1}^{\infty} A_i \right) &= \sum_{n=-\infty}^{\infty} \mathbb{P}_n \left(\left( \bigcup_{i=1}^{\infty} A_i \right) \cap (n, n+1] \right) \\\\
&= \sum_{n= -\infty}^{\infty} \sum_{i = 1}^{\infty} \mathbb{P}_n (A_i \cap (n, n+1] ) \\\\
&= \sum_{i = 1}^{\infty} \sum_{n= -\infty}^{\infty} \mathbb{P}_n (A_i \cap (n, n+1] ) \\\\
&= \sum_{i = 1}^{\infty} \mu(A_i).
\end{align*}
\]
(We can exchange the order of summation because the terms in the sums are nonnegative — this is Tonelli's theorem for counting measures, and in elementary terms a direct consequence of nonnegative-term double series convergence.)
Therefore, \(\mu\) is a measure on \((\mathbb{R}, \mathcal{B})\), which is the Lebesgue measure.
In discrete probability, an elementary outcome \(\omega\) has positive mass.
However, in the Lebesgue measure on \(\mathbb{R}\), every singleton set \(\{x\}\) has measure zero.
Using the continuity of measure (from above), consider a sequence of intervals decreasing to \(\{x\}\):
\[
A_n = (x - \frac{1}{n}, x]
\]
Clearly, \(\bigcap_{n=1}^{\infty} A_n = \{x\}\). Since \(\mu(A_1) = 1 < \infty\), continuity from above applies. Since the Lebesgue measure of an interval is its length:
\[
\mu(\{x\}) = \mu(\bigcap_{n=1}^{\infty} A_n) = \lim_{n \to \infty} \mu((x - \frac{1}{n}, x]) = \lim_{n \to \infty} \frac{1}{n} = 0.
\]
This implies that any countable set (like \(\mathbb{Q}\), the rational numbers) also has measure zero,
because \(\mu(\cup \{x_i\}) = \sum \mu(\{x_i\}) = 0\).
This is essentially why the "probability" of picking an exact integer from the real line is zero.
Note: Since \(\mu(\mathbb{R}) = \infty\), this is NOT a probability measure.