Basic Probability Ideas

Probability

At the heart of machine learning lies a simple but profound question: how do we reason about uncertainty? Every prediction a model makes, every classification it outputs, rests on a mathematical framework for quantifying the likelihood of events. Probability theory provides this framework. Before we can discuss random variables, distributions, or Bayesian inference, we must establish the foundational language of sample spaces, events, and the axioms that govern how probabilities behave.

Definition: Sample Space and Events

The collection of every possible outcome of an experiment is called the sample space, denoted \(S\). It can be discrete (finite or countably infinite) or continuous (uncountably infinite).

An event \(A\) is a subset of the sample space, \(A \subseteq S\). An event is said to occur if the outcome of the experiment belongs to \(A\).

With the sample space and events defined, we need a consistent way to assign numerical measures of likelihood to events. The following axioms provide the minimal requirements that any probability assignment must satisfy.

Definition: Probability Axioms

A probability \(P\) on a sample space \(S\) is a function \(P: \mathcal{F} \to [0,1]\) (where \(\mathcal{F}\) is a collection of events) satisfying:

Non-negativity: \(0 \leq P(A) \leq 1\) for every event \(A\).
Normalization: \(P(S) = 1\).
Additivity: If events \(A\) and \(B\) are mutually exclusive (i.e., \(A \cap B = \emptyset\)), then \[ P(A \cup B) = P(A) + P(B). \]

These axioms immediately imply \(P(\emptyset) = 0\), since \(S\) and \(\emptyset\) are mutually exclusive and \(S = S \cup \emptyset\), giving \(P(S) = P(S) + P(\emptyset)\).

When the sample space is finite and all outcomes are equally likely, the probability of an event reduces to a counting problem: \[ P(A) = \frac{|A|}{|S|} = \frac{\text{number of outcomes in } A}{\text{number of outcomes in } S}. \] This classical formula motivates the counting techniques below.

From the axioms, the following fundamental properties can be derived using basic set algebra:

Complement rule:
\[ P(\bar{A}) = 1 - P(A) \] since \(P(S) = P(A \cup \bar{A}) = P(A) + P(\bar{A})\), where \(A\) and \(\bar{A}\) are mutually exclusive.
Addition rule (inclusion-exclusion):
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \] Note: When \(A\) and \(B\) are mutually exclusive, \(P(A \cap B) = 0\), recovering Axiom 3.
Monotonicity:
If \(B \subseteq A\), then \(A = B \cup (A \cap \bar{B})\) with disjoint union, so \[ P(A) - P(B) = P(A \cap \bar{B}) \geq 0. \]
De Morgan's laws:
\[ \begin{align*} P(\overline{A \cup B}) &= P(\bar{A} \cap \bar{B}) \\\\ P(\overline{A \cap B}) &= P(\bar{A} \cup \bar{B}) \end{align*} \]

The formula \(P(A) = |A|/|S|\) requires us to count the elements in \(A\) and \(S\). For complex experiments, direct enumeration quickly becomes impractical. The following counting principles provide systematic tools for this task.

Multiplication principle:
If an experiment consists of a sequence of operations \(O_1, O_2, \ldots, O_r\) with \(n_1, n_2, \ldots, n_r\) outcomes respectively, then the total number of possible outcomes is the product \(n_1 n_2 \cdots n_r\).
Permutation:
An ordered selection of \(r\) objects from \(n\) distinct objects (sampling without replacement where order matters): \[ P(n, r) = n(n-1)(n-2)\cdots(n-r+1) = \frac{n!}{(n-r)!}. \]
Combination:
An unordered selection of \(r\) objects from \(n\) distinct objects (sampling without replacement where order does not matter): \[ \binom{n}{r} = \frac{n!}{r!(n-r)!} = \frac{P(n, r)}{r!}. \] This is the binomial coefficient. More generally, partitioning \(n\) objects into \(k\) groups of sizes \(r_1, r_2, \ldots, r_k\) with \(r_1 + r_2 + \cdots + r_k = n\) yields the multinomial coefficient: \[ \frac{n!}{r_1!\, r_2!\, \cdots\, r_k!}. \] For example, the number of ways to deal 52 cards evenly among 4 players is \(\frac{52!}{(13!)^4}\).

With these tools in hand, we can compute probabilities for a wide range of discrete experiments. However, many real-world situations require reasoning about how the probability of one event changes when we learn that another event has occurred. This leads us naturally to the concept of conditional probability.

From Axioms to Measure Theory

The axioms presented here work well for finite and countable sample spaces. However, when the sample space is uncountable (e.g., \(S = \mathbb{R}\)), Axiom 3 must be strengthened to countable additivity, and we cannot assign probabilities to every subset of \(S\) without encountering paradoxes. This necessitates the formal machinery of \(\sigma\)-algebras and probability measures, which we develop rigorously in Section II - Part 11: Measure Theory with Probability. The intuitive treatment here provides the working foundation needed for the sections that follow.

Conditional Probability

In practice, we rarely evaluate probabilities in a vacuum. When a medical test returns positive, we want the probability of disease given the test result. When a spam filter processes an email, it needs the probability that the message is spam given the words it contains. Conditional probability formalizes this idea of updating beliefs in light of new information.

Definition: Conditional Probability

The conditional probability of an event \(A\) given that event \(B\) has occurred (with \(P(B) > 0\)) is defined as: \[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}. \tag{1} \]

Intuitively, once we know that \(B\) has occurred, the effective sample space shrinks from \(S\) to \(B\), and we ask what fraction of \(B\) also belongs to \(A\). Rearranging Equation (1) gives the useful multiplication rule: \[ P(A \cap B) = P(A \mid B)\, P(B). \tag{2} \]

Since \(P(A \cap B) = P(B \cap A)\), we can also write: \[ P(B \mid A) = \frac{P(A \cap B)}{P(A)} = \frac{P(A \mid B)\, P(B)}{P(A)}. \tag{3} \] Equation (3) is the simplest form of Bayes' theorem, which we will state more generally after introducing the Law of Total Probability.

A natural question arises: when does learning about \(B\) provide no information about \(A\)? This motivates the concept of independence.

Definition: Independence

Two events \(A\) and \(B\) are mutually independent if \[ P(A \cap B) = P(A)\,P(B). \] Equivalently (when \(P(B) > 0\)): \(P(A \mid B) = P(A)\), meaning that observing \(B\) does not change the probability of \(A\).

Independence propagates to complements. If \(A\) and \(B\) are mutually independent, then: \[ \begin{align*} P(A \cap \bar{B}) &= P(A)\bigl[1 - P(B)\bigr] \\\\ &= P(A) - P(A)P(B) \\\\ &= P(A) - P(A \cap B) \\\\ &= P(A)\,P(\bar{B}). \end{align*} \] Similarly, \(P(\bar{A} \cap B) = P(\bar{A})\,P(B)\) and \(P(\bar{A} \cap \bar{B}) = P(\bar{A})\,P(\bar{B})\). That is, independence of \(A\) and \(B\) implies independence of any combination involving their complements.

Caution: Independence and mutual exclusivity are fundamentally different concepts. If \(A\) and \(B\) are mutually exclusive with \(P(A) > 0\) and \(P(B) > 0\), then \(P(A \cap B) = 0 \neq P(A)P(B)\), so they are not independent. In fact, learning that one occurred tells us the other certainly did not — the events carry maximal information about each other.

Conditional probability allows us to decompose complex probability calculations by conditioning on different scenarios. When those scenarios partition the entire sample space, we obtain a powerful computational tool: the Law of Total Probability.

Law of Total Probability

Often, the direct computation of \(P(A)\) is difficult, but the conditional probabilities \(P(A \mid B_i)\) under different scenarios \(B_i\) are known or easier to estimate. The Law of Total Probability provides a systematic way to compute \(P(A)\) by summing over all possible scenarios.

Theorem 1: Law of Total Probability

Let \(B_1, B_2, \ldots, B_k\) be a partition of the sample space \(S\): the events are mutually exclusive (\(B_i \cap B_j = \emptyset\) for \(i \neq j\)) and exhaustive (\(B_1 \cup B_2 \cup \cdots \cup B_k = S\)), with \(P(B_i) > 0\) for each \(i\). Then for any event \(A\), \[ \begin{align*} P(A) &= \sum_{i = 1} ^k P(A \mid B_i)\,P(B_i) \\\\ &= P(A \mid B_1)\,P(B_1) + P(A \mid B_2)\,P(B_2) + \cdots + P(A \mid B_k)\,P(B_k). \\\\ \end{align*} \]

Proof:

Since \(\{B_i\}\) partitions \(S\), we can decompose \(A\) as: \[ A = (A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_k). \] The sets \(A \cap B_i\) are mutually exclusive (because the \(B_i\) are), so by Axiom 3 applied repeatedly: \[ \begin{align*} P(A) &= P[(A \cap B_1) \cup (A \cap B_2) \cup \cdots \cup (A \cap B_k)] \\\\ &= P(A \cap B_1) + P(A \cap B_2) + \cdots + P(A \cap B_k) \\\\ &= \sum_{i=1}^{k} P(A \cap B_i) \\\\ &= \sum_{i=1}^{k} P(A \mid B_i)\,P(B_i) \end{align*} \] where the last equality follows from the multiplication rule (Equation 2).

The Law of Total Probability is the key ingredient needed to state Bayes' theorem in its most useful form: it provides the denominator \(P(A)\) that normalizes the posterior probability.

Bayes' Theorem

We saw in Equation (3) that \(P(B \mid A) = P(A \mid B)\,P(B) / P(A)\). The quantity \(P(B)\) is called the prior probability of \(B\) - our belief about \(B\) before observing any data - and \(P(B \mid A)\) is the posterior probability - our updated belief after observing that \(A\) has occurred. This inversion of conditioning, from \(P(A \mid B)\) to \(P(B \mid A)\), is the foundation of Bayesian statistics and lies at the heart of many machine learning algorithms.

By applying the Law of Total Probability to the denominator, we obtain the general form of Bayes' theorem.

Theorem 2: Bayes' Theorem

Let \(B_1, B_2, \ldots, B_k\) be a partition of the sample space \(S\) with \(P(B_i) > 0\) for each \(i\). Then for any event \(A\) with \(P(A) > 0\) and for \(j = 1, 2, \ldots, k\), \[ \begin{align*} P(B_j \mid A) &= \frac{P(A \mid B_j)\,P(B_j)}{\sum_{i=1}^{k} P(A \mid B_i)\,P(B_i)} \\\\ &= \frac{P(B_j \cap A)}{P(A)}. \end{align*} \]

The numerator \(P(A \mid B_j)\,P(B_j) = P(A \cap B_j)\) captures how likely the observed data \(A\) is under hypothesis \(B_j\), weighted by our prior belief. The denominator is the total probability \(P(A)\), which serves as a normalizing constant ensuring the posterior probabilities sum to one.

Insight: Bayes' Theorem in Machine Learning

Bayes' theorem is the foundation of probabilistic machine learning. In a classification setting, let \(B_1, B_2, \ldots, B_k\) represent possible classes and let \(A\) represent observed features (e.g., pixel values, word frequencies, or sensor readings). Then \(P(B_j \mid A)\) is the posterior probability that the input belongs to class \(B_j\). This reasoning underlies Naive Bayes classifiers, Bayesian neural networks, and the entire framework of Bayesian inference. In medical diagnosis, \(B_1, \ldots, B_k\) represent possible diseases, \(A\) represents observed symptoms, and the posterior \(P(B_j \mid A)\) ranks the most likely diagnoses.

The denominator is the total probability \(P(A)\), which serves as a normalizing constant. In practice, especially in Bayesian inference, calculating this denominator is often the most computationally challenging part, as it may involve high-dimensional integrals that are analytically intractable. Developing methods to approximate this term is a major focus of modern probabilistic modeling, which we will explore in later sections.

With the foundations of probability, conditional reasoning, and Bayes' theorem established, we are ready to move beyond events and introduce random variables - numerical functions on the sample space that enable the full power of calculus and algebra to be brought to bear on probabilistic problems. This is the subject of Part 2: Random Variables.