Probability
At the heart of machine learning lies a simple but profound question: how do we reason about uncertainty?
Every prediction a model makes, every classification it outputs, rests on a mathematical framework for
quantifying the likelihood of events. Probability theory provides this framework. Before
we can discuss random variables, distributions, or Bayesian inference, we must establish the foundational
language of sample spaces, events, and the axioms that govern how probabilities behave.
Definition: Sample Space and Events
The collection of every possible outcome of an experiment is called the sample space,
denoted \(S\). It can be discrete (finite or countably infinite) or continuous (uncountably infinite).
An event \(A\) is a subset of the sample space, \(A \subseteq S\). An event
is said to occur if the outcome of the experiment belongs to \(A\).
With the sample space and events defined, we need a consistent way to assign numerical
measures of likelihood to events. The following axioms provide the minimal requirements that any
probability assignment must satisfy.
Definition: Probability Axioms
A probability \(P\) on a sample space \(S\) is a function \(P: \mathcal{F} \to [0,1]\)
(where \(\mathcal{F}\) is a collection of events) satisfying:
- Non-negativity: \(0 \leq P(A) \leq 1\) for every event \(A\).
- Normalization: \(P(S) = 1\).
- Additivity: If events \(A\) and \(B\) are mutually exclusive
(i.e., \(A \cap B = \emptyset\)), then
\[
P(A \cup B) = P(A) + P(B).
\]
These axioms immediately imply \(P(\emptyset) = 0\), since \(S\) and \(\emptyset\) are mutually exclusive
and \(S = S \cup \emptyset\), giving \(P(S) = P(S) + P(\emptyset)\).
When the sample space is finite and all outcomes are equally likely, the probability of an event
reduces to a counting problem:
\[
P(A) = \frac{|A|}{|S|} = \frac{\text{number of outcomes in } A}{\text{number of outcomes in } S}.
\]
This classical formula motivates the counting techniques below.
From the axioms, the following fundamental properties can be derived using basic set algebra:
- Complement rule:
\[
P(\bar{A}) = 1 - P(A)
\]
since \(P(S) = P(A \cup \bar{A}) = P(A) + P(\bar{A})\), where \(A\) and \(\bar{A}\) are mutually exclusive.
- Addition rule (inclusion-exclusion):
\[
P(A \cup B) = P(A) + P(B) - P(A \cap B)
\]
Note: When \(A\) and \(B\) are mutually exclusive, \(P(A \cap B) = 0\), recovering Axiom 3.
- Monotonicity:
If \(B \subseteq A\), then \(A = B \cup (A \cap \bar{B})\) with disjoint union, so
\[
P(A) - P(B) = P(A \cap \bar{B}) \geq 0.
\]
- De Morgan's laws:
\[
\begin{align*}
P(\overline{A \cup B}) &= P(\bar{A} \cap \bar{B}) \\\\
P(\overline{A \cap B}) &= P(\bar{A} \cup \bar{B})
\end{align*}
\]
The formula \(P(A) = |A|/|S|\) requires us to count the elements in \(A\) and \(S\). For complex
experiments, direct enumeration quickly becomes impractical. The following counting principles
provide systematic tools for this task.
- Multiplication principle:
If an experiment consists of a sequence of operations \(O_1, O_2, \ldots, O_r\)
with \(n_1, n_2, \ldots, n_r\) outcomes respectively, then the total number of possible outcomes
is the product \(n_1 n_2 \cdots n_r\).
- Permutation:
An ordered selection of \(r\) objects from \(n\) distinct objects (sampling without replacement
where order matters):
\[
P(n, r) = n(n-1)(n-2)\cdots(n-r+1) = \frac{n!}{(n-r)!}.
\]
- Combination:
An unordered selection of \(r\) objects from \(n\) distinct objects (sampling without replacement
where order does not matter):
\[
\binom{n}{r} = \frac{n!}{r!(n-r)!} = \frac{P(n, r)}{r!}.
\]
This is the binomial coefficient. More generally, partitioning \(n\) objects
into \(k\) groups of sizes \(r_1, r_2, \ldots, r_k\) with \(r_1 + r_2 + \cdots + r_k = n\)
yields the multinomial coefficient:
\[
\frac{n!}{r_1!\, r_2!\, \cdots\, r_k!}.
\]
For example, the number of ways to deal 52 cards evenly among 4 players is
\(\frac{52!}{(13!)^4}\).
With these tools in hand, we can compute probabilities for a wide range of discrete experiments.
However, many real-world situations require reasoning about how the probability of one event changes
when we learn that another event has occurred. This leads us naturally to the concept of
conditional probability.
From Axioms to Measure Theory
The axioms presented here work well for finite and countable sample spaces. However, when the
sample space is uncountable (e.g., \(S = \mathbb{R}\)), Axiom 3 must be strengthened to
countable additivity, and we cannot assign probabilities to every subset
of \(S\) without encountering paradoxes. This necessitates the formal machinery of
\(\sigma\)-algebras and probability measures, which we develop rigorously in
Section II - Part 11: Measure Theory with Probability.
The intuitive treatment here provides the working foundation needed for the sections that follow.
Conditional Probability
In practice, we rarely evaluate probabilities in a vacuum. When a medical test returns positive,
we want the probability of disease given the test result. When a spam filter processes
an email, it needs the probability that the message is spam given the words it contains.
Conditional probability formalizes this idea of updating beliefs in light of new information.
Definition: Conditional Probability
The conditional probability of an event \(A\) given that event \(B\) has occurred
(with \(P(B) > 0\)) is defined as:
\[
P(A \mid B) = \frac{P(A \cap B)}{P(B)}. \tag{1}
\]
Intuitively, once we know that \(B\) has occurred, the effective sample space shrinks from \(S\) to \(B\),
and we ask what fraction of \(B\) also belongs to \(A\). Rearranging Equation (1) gives the useful
multiplication rule:
\[
P(A \cap B) = P(A \mid B)\, P(B). \tag{2}
\]
Since \(P(A \cap B) = P(B \cap A)\), we can also write:
\[
P(B \mid A) = \frac{P(A \cap B)}{P(A)} = \frac{P(A \mid B)\, P(B)}{P(A)}. \tag{3}
\]
Equation (3) is the simplest form of Bayes' theorem, which we will state more
generally after introducing the Law of Total Probability.
A natural question arises: when does learning about \(B\) provide no information about \(A\)?
This motivates the concept of independence.
Definition: Independence
Two events \(A\) and \(B\) are mutually independent if
\[
P(A \cap B) = P(A)\,P(B).
\]
Equivalently (when \(P(B) > 0\)): \(P(A \mid B) = P(A)\), meaning that observing \(B\) does not
change the probability of \(A\).
Independence propagates to complements. If \(A\) and \(B\) are mutually independent, then:
\[
\begin{align*}
P(A \cap \bar{B}) &= P(A)\bigl[1 - P(B)\bigr] \\\\
&= P(A) - P(A)P(B) \\\\
&= P(A) - P(A \cap B) \\\\
&= P(A)\,P(\bar{B}).
\end{align*}
\]
Similarly, \(P(\bar{A} \cap B) = P(\bar{A})\,P(B)\) and
\(P(\bar{A} \cap \bar{B}) = P(\bar{A})\,P(\bar{B})\). That is, independence of \(A\) and \(B\)
implies independence of any combination involving their complements.
Caution: Independence and mutual exclusivity are fundamentally different concepts.
If \(A\) and \(B\) are mutually exclusive with \(P(A) > 0\) and \(P(B) > 0\), then
\(P(A \cap B) = 0 \neq P(A)P(B)\), so they are not independent. In fact, learning that
one occurred tells us the other certainly did not — the events carry maximal information about each other.
Conditional probability allows us to decompose complex probability calculations by conditioning on
different scenarios. When those scenarios partition the entire sample space, we obtain a powerful
computational tool: the Law of Total Probability.
Law of Total Probability
Often, the direct computation of \(P(A)\) is difficult, but the conditional probabilities
\(P(A \mid B_i)\) under different scenarios \(B_i\) are known or easier to estimate.
The Law of Total Probability provides a systematic way to compute \(P(A)\) by
summing over all possible scenarios.
Theorem 1: Law of Total Probability
Let \(B_1, B_2, \ldots, B_k\) be a partition of the sample space \(S\):
the events are mutually exclusive (\(B_i \cap B_j = \emptyset\) for \(i \neq j\)) and
exhaustive (\(B_1 \cup B_2 \cup \cdots \cup B_k = S\)), with \(P(B_i) > 0\) for each \(i\).
Then for any event \(A\),
\[
\begin{align*}
P(A) &= \sum_{i = 1} ^k P(A \mid B_i)\,P(B_i) \\\\
&= P(A \mid B_1)\,P(B_1) + P(A \mid B_2)\,P(B_2) + \cdots + P(A \mid B_k)\,P(B_k). \\\\
\end{align*}
\]
Proof:
Since \(\{B_i\}\) partitions \(S\), we can decompose \(A\) as:
\[
A = (A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_k).
\]
The sets \(A \cap B_i\) are mutually exclusive (because the \(B_i\) are), so by Axiom 3 applied
repeatedly:
\[
\begin{align*}
P(A) &= P[(A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_k)] \\\\
&= P(A \cap B_1) + P(A \cap B_2) + \cdots + P(A \cap B_k) \\\\
&= \sum_{i=1}^{k} P(A \cap B_i) \\\\
&= \sum_{i=1}^{k} P(A \mid B_i)\,P(B_i)
\end{align*}
\]
where the last equality follows from the multiplication rule (Equation 2).
The Law of Total Probability is the key ingredient needed to state Bayes' theorem in its
most useful form: it provides the denominator \(P(A)\) that normalizes the posterior probability.
Bayes' Theorem
We saw in Equation (3) that \(P(B \mid A) = P(A \mid B)\,P(B) / P(A)\). The quantity \(P(B)\) is called the
prior probability of \(B\) - our belief about \(B\) before observing any data - and \(P(B \mid A)\)
is the posterior probability - our updated belief after observing that \(A\) has occurred. This
inversion of conditioning, from \(P(A \mid B)\) to \(P(B \mid A)\), is the foundation of Bayesian statistics
and lies at the heart of many machine learning algorithms.
By applying the Law of Total Probability to the denominator, we obtain the general form of Bayes' theorem.
Theorem 2: Bayes' Theorem
Let \(B_1, B_2, \ldots, B_k\) be a partition of the sample space \(S\) with \(P(B_i) > 0\)
for each \(i\). Then for any event \(A\) with \(P(A) > 0\) and for \(j = 1, 2, \ldots, k\),
\[
\begin{align*}
P(B_j \mid A) &= \frac{P(A \mid B_j)\,P(B_j)}{\sum_{i=1}^{k} P(A \mid B_i)\,P(B_i)} \\\\
&= \frac{P(B_j \cap A)}{P(A)}.
\end{align*}
\]
The numerator \(P(A \mid B_j)\,P(B_j) = P(A \cap B_j)\) captures how likely the observed data \(A\)
is under hypothesis \(B_j\), weighted by our prior belief. The denominator is the total probability
\(P(A)\), which serves as a normalizing constant ensuring the posterior probabilities sum to one.
Insight: Bayes' Theorem in Machine Learning
Bayes' theorem is the foundation of probabilistic machine learning. In a classification
setting, let \(B_1, B_2, \ldots, B_k\) represent possible classes and let \(A\) represent observed
features (e.g., pixel values, word frequencies, or sensor readings). Then \(P(B_j \mid A)\) is the
posterior probability that the input belongs to class \(B_j\). This reasoning underlies
Naive Bayes classifiers, Bayesian neural networks,
and the entire framework of Bayesian inference.
In medical diagnosis, \(B_1, \ldots, B_k\) represent possible diseases, \(A\) represents observed symptoms,
and the posterior \(P(B_j \mid A)\) ranks the most likely diagnoses.
The denominator is the total probability \(P(A)\), which serves as a normalizing constant. In practice,
especially in Bayesian inference, calculating this denominator is often the most computationally
challenging part, as it may involve high-dimensional integrals that are analytically intractable.
Developing methods to approximate this term is a major focus of modern probabilistic modeling,
which we will explore in later sections.
With the foundations of probability, conditional reasoning, and Bayes' theorem established, we are
ready to move beyond events and introduce random variables - numerical functions on the
sample space that enable the full power of calculus and algebra to be brought to bear on
probabilistic problems. This is the subject of
Part 2: Random Variables.