Measure Theory with Probability

Introduction

In the previous section, we observed that certain functions are not integrable using either the Riemann or improper Riemann integration. To handle such cases, we need to introduce measure theory and the Lebesgue integration. Since our focus is on applied mathematics, particularly in the context of statistics and machine learning, we will concentrate on probability-related measure theory. While we'll avoid covering every foundational topic and formal mathematical detail, our goal is to build a solid understanding that leads to the introduction of the Lebesgue integration.

Note that measure theory fundamentally defines the "volume" and structure of a space. This perspective is crucial when transitioning from discrete symmetries to continuous ones, such as \(SE(3)\), where the measure remains invariant under rotation.
(See: Section I - Part 24 Geometry of Symmetry)

Probability Space

While the intuitive notion of "volume" or "probability" works well for simple shapes and finite sets, it becomes surprisingly fragile when applied to the continuum of real numbers or complex manifolds. To handle these cases without leading to logical paradoxes, we must move beyond mere intuition and adopt a rigorous axiomatic framework.

In this context, we don't just "measure" things; we define a structured environment where every operation is logically consistent. This is why we treat a probabilistic model not as a single value, but as a triple of interconnected components.

Now, we need a formal definition of probabilistic model:

Definition: Probability Space

A probability space is a triple \(\Omega, \mathcal{F}, \mathbb{P}\) where

\(\Omega\) is the sample space:
\(\mathcal{F}\) is a \(\sigma\)-algebra:
\(\mathbb{P}\) is a probability measure:

Sample Space

The sample space \(\Omega\) can be finite, countable, or uncountable. An element of \(\Omega\) is denoted by \(\omega\), and is called an elementary outcome.

For example, if our experiment consists of an infinite number of consecutive rolls of a die, the sample space is the set: \[ \Omega = \{1, 2, 3, 4, 5, 6\}^{\infty} \] and an elementary outcome is an infinite sequence such as: \[ \omega = (1, 1, 4, 3, 1, 5, \cdots ). \]

A simpler case of the probability space can be a discrete probability space. In this case, the sample space is finite, or countable: \[ \Omega = \{\omega_1, \omega_2, \cdots \}. \] and \(\sigma\)-algebra is the set of all subsets of \(\Omega\). Then the probability measure assigns a number in the set \([0, 1]\) to every subset of \(\Omega\). It is defined in terms of the probabilities \(\mathbb{P}(\{\omega\})\) of the elementary outcomes and satisfies \[ \forall A \subset \Omega, \quad \mathbb{P}(A) = \sum_{\omega \in A}\mathbb{P}(\{\omega\}) \] and \[ \sum_{\omega \in \Omega}\mathbb{P}(\{\omega\}) = 1. \] Note: We will use \(\mathbb{P}(\omega)\) instead of \(\mathbb{P}(\{\omega\})\) and \(\mathbb{P}(\omega_i)\) will be denoted by \(p_i\).

\(\sigma\)-algebra (\(\sigma\)-field)

Ideally, we wish to specify the probability \(\mathbb{P} (A)\) of "every" subset of \(\Omega\). However, it is too complicated, especially, in the case where \(\Omega\) is uncountable. So, we assign probabilities to only a partial collection of subsets of \(\Omega\). The sets in this collection are to be thought of as the “nice” and "interesting" subsets of \(\Omega\). Formally, we define the collection as follows:

Definition: \(\sigma\)-algebra

\(\sigma\)-algebra, \(\mathcal{F}\) is a collection of subsets of \(\Omega\) with the following properties:

\(\emptyset \in \mathcal{F}\).
If \(A \in \mathcal{F}\), then \(A^c \in \mathcal{F}\).
If \(\{A_i\}_{i=1}^{\infty} \subset \mathcal{F}\), then \(\bigcup_{i =1}^{\infty} A_i \in \mathcal{F}\).

An event \(A\) is called an \(\mathcal{F}\)-measurable set, or simply a measurable set. The pair \((\Omega, \mathcal{F})\) is called a measurable space.

Given collection \(\mathcal{C}\) of subsets of \(\Omega\), we want to define \(\sigma\)-algebra \(\mathcal{F}\) as the intersection of "all" \(\sigma\)-algebras that contains \(\mathcal{C}\) because we only need the smallest \(\sigma\)-algebra containing \(\mathcal{C}\). In this manner, \(\mathcal{F}\) is said to be the \(\sigma\)-algebra generated by \(\mathcal{C}\), and is denoted by \(\sigma(\mathcal{C})\).

Probability Measure

A collection of sets \(A_{\alpha} \subset \Omega\) where \(\alpha\) ranges over some index set is mutually exclusive or that the sets are disjoint if \(A_{\alpha} \cap A_{\alpha'} = \emptyset \) whenever \(\alpha \neq \alpha'\). Also, the sets \(A_{\alpha} \subset \Omega\) are called collectively exhaustive if \(\, \cup_{\alpha} A_{\alpha} = \Omega\).

Definition: Measure

A measure is a function \[ \mu: \mathcal{F} \to [0, \infty] \] which assigns a nonnegative extended real number \(\mu(A)\) to every set \(A \in \mathcal{F}\), and which satisfies the following two conditions:

\(\mu(\emptyset) = 0\)
Countable additivity (\(\sigma\)-additivity):

Definition: Probability Measure

A probability measure is a measure \(\mathbb{P}\) with the additional property \(\mathbb{P}(\Omega) = 1\). In this case, the triple \((\Omega, \mathcal{F}, \mathbb{P})\) is called a probability space.

A crucial consequence of countable additivity is that the probability measure is continuous. This continuity property of measure ensures that probabilities behave well under limits, which is essential for defining concepts like convergence of random variables and the law of large numbers.

If a sequence of events \(A_n\) is monotonic (nested), we can swap the limit and the probability function:

Continuity from below: If \(A_n \uparrow A\) (i.e., \(A_1 \subset A_2 \subset \dots\) and \(\cup A_n = A\)), then \[ \lim_{n \to \infty} \mathbb{P}(A_n) = \mathbb{P}(A). \]
Continuity from above: If \(A_n \downarrow A\) (i.e., \(A_1 \supset A_2 \supset \dots\) and \(\cap A_n = A\)), then \[ \lim_{n \to \infty} \mathbb{P}(A_n) = \mathbb{P}(A). \]

The countable additivity implies that probabilities (and more generally, measures) behave like the notions of volume: the volume of a countable union of disjoint sets is the sum of their individual "volumes." Indeed, a measure is a generalized notion of a volume that characterizes the "size" of sets within a manifold. In physical systems like robotics \(SE(3)\), this ensures that the core properties of an object are preserved throughout any rigid body motion.

Finite Additivity

At this point, the \(\sigma\)-algebra and probability measure definitions look quite abstract. Do we really need all of this machinery just to define probabilities?"

The answer lies in the concept of countable additivity (\(\sigma\)-additivity), which requires handling infinite sequences of disjoint events. While this property is essential for dealing with limits and continuous spaces (like the real line), it can be technically overwhelming when we are just getting started.

Fortunately, there exists a simpler framework that only requires finite unions and sums. This "weaker" structure serves as a stepping stone: we can first define probabilities on this simpler collection, and then—thanks to Carathéodory's Extension Theorem—extend it to the full \(\sigma\)-algebra.

This approach is analogous to defining a function on a dense subset and then extending it by continuity. We start with what is manageable (finite operations) and let the mathematics handle the rest (infinite operations).

Definition: Algebra (Field)

An algebra (or, a field) is a collection \(\mathcal{F}_0\) of subsets of \(\Omega\) with the following properties:

\(\emptyset \in \mathcal{F}_0\).
If \(A \in \mathcal{F}_0\), then \(A^c \in \mathcal{F}_0\).
If \(A, B \in \mathcal{F}_0\), then \(A \cup B \in \mathcal{F}_0\).

Notice that this is identical to the \(\sigma\)-algebra definition, except property (3) only requires closure under finite unions rather than countable unions. This makes it far easier to verify in practice.

Definition: Finite Additivity

A function \(\mathbb{P}: \mathcal{F}_0 \to [0, 1]\) is said to be finitely additive if \[ \begin{align*} &A, B \in \mathcal{F}_0 , \quad A \cap B = \emptyset \\ &\Longrightarrow \mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B). \end{align*} \]

Finite additivity is strictly weaker than countable additivity (\(\sigma\)-additivity). Every \(\sigma\)-additive function is finitely additive, but the converse is not true.

This raises an important question: If we define probabilities using only finite additivity on an algebra, can we extend this to a full probability measure on a \(\sigma\)-algebra? The answer to this fundamental question is provided by Carathéodory's Extension Theorem, which we examine next.

Caratheodory's Extension Theorem

Defining a probability measure directly on a full \(\sigma\)-algebra is often technically overwhelming because \(\sigma\)-algebras contain incredibly complex sets (formed by countable unions and limits). However, it is usually straightforward to define probabilities on a smaller, simpler structure called an algebra (e.g., just finite unions of intervals).

The following theorem acts as a powerful bridge. It guarantees that if we can essentially "get the definitions right" on the simple building blocks (the algebra), the mathematics automatically and uniquely extends that definition to the entire complex \(\sigma\)-algebra.

Theorem 1: Carathéodory's Extension Theorem

Let \(\mathcal{F}_0\) be an algebra of subsets of a sample space \(\Omega\), and let \(\mathcal{F} = \sigma(\mathcal{F}_0)\) be the \(\sigma\)-algebra that it generates.
Suppose that \(\mathbb{P}_0 : \mathcal{F}_0 \to [0, 1]\) satisfies \(\mathbb{P}_0(\Omega) = 1\) and is \(\sigma\)-additive on \(\mathcal{F}_0\): whenever \(\{A_i\}_{i=1}^{\infty} \subset \mathcal{F}_0\) are pairwise disjoint with \(\bigcup_{i=1}^{\infty} A_i \in \mathcal{F}_0\), we have \(\mathbb{P}_0\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \mathbb{P}_0(A_i)\).
Then, \(\mathbb{P}_0\) can be extended uniquely to a probability measure on \((\Omega, \mathcal{F})\). That is, there exists a unique probability measure \(\mathbb{P}\) on \((\Omega, \mathcal{F})\) such that \[ \forall A \in \mathcal{F}_0, \quad \mathbb{P}(A) = \mathbb{P}_0(A). \]

This theorem provides the fundamental existence and uniqueness guarantee for measures. Without it, we would have to manually prove the existence of a measure for every complicated Borel set, which is practically impossible. It assures us that our intuitive notion of "length" on intervals can be rigorously extended to a full-fledged measure without logical contradictions. In the next section, we will directly apply this theorem to construct the Lebesgue measure from simple interval lengths.

Lebesgue measure

The uniform distribution on the interval \([0, 1]\) assigns probability \(b - a\) to every interval \([a, b] \subset [0, 1]\). We want to define the appropriate \(\sigma\)-algebra and the probability measure on the sample space \(\Omega = [0, 1]\), but first, we consider the sample space: \[ \Omega' = (0, 1]. \]

Definition: Borel \(\sigma\)-algebra

Consider the collection \(\mathcal{C}\) of all intervals \((a, b]\) contained in \((0, 1]\) and let \(\mathcal{F}\) be the \(\sigma\)-algebra generated by \(\mathcal{C}\). This is called the Borel \(\sigma\)-algebra and is denoted by \(\mathcal{B}\). Every set that belongs to this \(\sigma\)-algebra is called a Borel (measurable) set.

Any set that can be formed by starting with intervals \([a, b]\) is a Borel set. For example, the set of rational numbers in \((0, 1]\), and its complement, the set of irrational numbers in \((0, 1]\) are Borel sets.

Since defining a probability measure for all Borel sets is too complicated, we start with a "smaller" collection, \(\mathcal{F}_0 \subset (0, 1]\). We let \(\mathcal{F}_0\) consist of the empty set and all sets that are finite unions of disjoint intervals of the form \((a, b]\). For example, \[ A \in \mathcal{F}_0 \Longrightarrow A = (a_1, b_1] \cup (a_2, b_2] \cdots \cup (a_n, b_n], \] where \(0 \leq a_1 < b_1 \leq a_2 < b_2 \leq \cdots \leq a_n < b_n \leq 1, \quad n \in \mathbb{N}\). Also, we define: \[ \mathbb{P}_0(A) = (b_1 - a_1) + (b_2 - a_2) + \cdots + (b_n - a_n) \] which is \(\sigma\)-additive on \(\mathcal{F}_0\).

We can now apply the Caratheodory's extension theorem, and conclude that there exists a probability measure \(\mathbb{P}\) defined on the entire Borel \(\sigma\)-algebra \(\mathcal{B}\), that agrees with \(\mathbb{P}_0\) on \(\mathcal{F}_0\). We call this measure the Lebesgue or uniform measure. In particular, \[ \forall \, (a, b] \subset (0, 1], \quad \mathbb{P}((a, b]) = b - a. \]

Mathematical Detail: Strictly speaking, the measure constructed on the Borel \(\sigma\)-algebra \(\mathcal{B}\) is called the Borel measure. The Lebesgue measure is typically defined on a slightly larger \(\sigma\)-algebra called the Lebesgue \(\sigma\)-algebra, which is the "completion" of \(\mathcal{B}\). This completion ensures that any subset of a set with measure zero is also measurable (and has measure zero). For most practical applications in CS and statistics, this distinction is subtle, but it is fundamental in analysis.

Finally, by adding \(\{0\}\) to the sample space \(\Omega'\), and assigning zero probability to it, we obtain the uniform distribution model with sample space \(\Omega = [0, 1]\). (We only need to check the measurability of \(\{0\}\), and it is also a Borel set.)

Revisit our Problem:

Remember, in the previous section, we encountered the following problem:

Consider the Dirichlet function: \[ f(x)= \begin{cases} 1 &\text{if \(x \in \mathbb{Q}\)} \\ 0 &\text{if \(x \in \mathbb{R} \setminus \mathbb{Q}\)} \end{cases} \] compute: \[ \int_0^1 f(x)dx \]

Now, we can see the interval \([0, 1]\) as the sample space \(\Omega\). The smallest \(\sigma\)-algebra that contains every interval \([a, b] \subset [0, 1]\) is the Borel \(\sigma\)-algebra. Moreover, the Dirichlet function is measurable with respect to the Borel \(\sigma\)-algebra because both the set of rationals: \(\mathbb{Q} \cap [0, 1]\) and irrationals \([0, 1] \setminus \mathbb{Q}\) are Borel sets. Therefore, in short, the Dirichlet function is Lebesgue measurable.

We are getting closer to the Lebesgue integration, but we would like to learn more about Lebesgue measure.

Consider the sample space \(\Omega = \mathbb{R}\). As usual, we define a \(\sigma\)-algebra of subsets of \(\mathbb{R}\). Let \(\mathcal{C}\) be the collection of all intervals of the form \([a, b]\) and let \(\mathcal{B} = \sigma(\mathcal{C})\) be the \(\sigma\)-algebra that it generates.
Let \(\mathbb{P}_n\) be the uniform measure on \((n, n+1]\). Given a set \(A \subset \mathbb{R}\), we decompose it into countably many pieces, each piece contained in some interval \((n, n+1]\), and define its “length” \(\mu(A)\) using countable additivity as follows: \[ \mu(A) = \sum_{n = -\infty}^{\infty} \mathbb{P}_n \left(A \cap (n, n+1] \right). \] Since \(A \cap (n, n+1]\) is a measurable subset of \((n, n+1]\), \(\mathbb{P}_n (A \cap (n, n+1]) \geq 0\). Thus, nonnegativity holds: \[ \mu(A) \in [0, \infty]. \] Also, \(\emptyset \cap (n, n+1] = \emptyset\) and \(\mathbb{P}_n ( \emptyset) = 0\), thus \[\mu(\emptyset) = 0. \] Now, we need to check countable additivity. Let \(\{A_i\}_{i=1}^{\infty}\) be a countable collection of pairwise disjoint sets in \(\mathcal{B}\). For each fixed \(n\), the set \(\left\{A_i \cap (n, n+1] \right\}_{i=1}^{\infty}\) are also pairwise disjoint. Since \(\mathbb{P}_n\) is the uniform measure on \((n, n+1]\), it satisfies countable additivity. That is, \[ \begin{align*} &\mathbb{P}_n \left(\bigcup_{i=1}^{\infty} (A_i \cap (n, n+1] )\right) = \sum_{i=1}^{\infty} \mathbb{P}_n \left(A_i \cap (n, n+1] \right) \\\\ &\Longrightarrow \mathbb{P}_n \left( \left( \bigcup_{i=1}^{\infty} A_i \right) \cap (n, n+1] \right) = \sum_{i=1}^{\infty} \mathbb{P}_n \left(A_i \cap (n, n+1]\right) \\\\ \end{align*} \] Then we have: \[ \begin{align*} \mu\left( \bigcup_{i=1}^{\infty} A_i \right) &= \sum_{n=-\infty}^{\infty} \mathbb{P}_n \left(\left( \bigcup_{i=1}^{\infty} A_i \right) \cap (n, n+1] \right) \\\\ &= \sum_{n= -\infty}^{\infty} \sum_{i = 1}^{\infty} \mathbb{P}_n (A_i \cap (n, n+1] ) \\\\ &= \sum_{i = 1}^{\infty} \sum_{n= -\infty}^{\infty} \mathbb{P}_n (A_i \cap (n, n+1] ) \\\\ &= \sum_{i = 1}^{\infty} \mu(A_i). \end{align*} \] (We can exchange the order of summation because the terms in the sums are nonnegative.)
Therefore, \(\mu\) is a measure on \((\mathbb{R}, \mathcal{B})\), which is the Lebesgue measure.

In discrete probability, an elementary outcome \(\omega\) has positive mass. However, in the Lebesgue measure on \(\mathbb{R}\), every singleton set \(\{x\}\) has measure zero.

Using the continuity of measure (from above), consider a sequence of intervals decreasing to \(\{x\}\): \[ A_n = (x - \frac{1}{n}, x] \] Clearly, \(\bigcap_{n=1}^{\infty} A_n = \{x\}\). Since the Lebesgue measure of an interval is its length: \[ \mu(\{x\}) = \mu(\bigcap_{n=1}^{\infty} A_n) = \lim_{n \to \infty} \mu((x - \frac{1}{n}, x]) = \lim_{n \to \infty} \frac{1}{n} = 0. \] This implies that any countable set (like \(\mathbb{Q}\), the rational numbers) also has measure zero, because \(\mu(\cup \{x_i\}) = \sum \mu(\{x_i\}) = 0\). This is essentially why the "probability" of picking an exact integer from the real line is zero.

Note: Since \(\mu(\mathbb{R}) = \infty\), this is NOT a probability measure.