Introduction
In Part 14, we introduced Bayesian inference and derived
posterior distributions for unknown parameters. In Part 18,
we developed Markov chains for modeling sequential data, noting that Markov Chain Monte Carlo (MCMC)
methods use the Markov property to sample from complex posterior distributions.
So far, our development of Bayesian statistics has focused on inference: given observed data, how should we update
our beliefs about unknown quantities? But inference alone does not tell us what to do. In many applications - medical diagnosis,
spam filtering, autonomous driving - we must ultimately choose an action based on uncertain information. Bayesian decision theory
provides the principled framework for making such optimal decisions under uncertainty.
The setup is as follows. An agent must choose an action \(a\) from a set
of possible actions \(\mathcal{A}\). The consequence of this action depends on the unknown
state of nature \(h \in \mathcal{H}\), which the agent cannot observe directly.
To quantify the cost of choosing an action \(a\) when the true state is \(h\), we introduce
a loss function \(l(h, a)\).
The key idea of Bayesian decision theory is to combine the loss function with the posterior distribution
\(p(h \mid \boldsymbol{x})\) obtained from observed evidence \(\boldsymbol{x}\) (or a dataset \(\mathcal{D}\)).
For any action \(a\), the posterior expected loss (or posterior risk) is defined as follows.
Definition: Posterior Expected Loss
Given evidence \(\boldsymbol{x}\) and a loss function \(l(h, a)\), the posterior expected loss of action
\(a \in \mathcal{A}\) is
\[
\rho(a \mid \boldsymbol{x}) = \mathbb{E}_{p(h \mid \boldsymbol{x})} [l(h, a)] = \sum_{h \in \mathcal{H}} l(h, a) \, p(h \mid \boldsymbol{x}).
\]
A rational agent should select the action that minimizes this expected loss. This leads to the central concept
of the section.
Definition: Bayes Estimator
The Bayes estimator (or Bayes decision rule / optimal policy)
is the action that minimizes the posterior expected loss:
\[
\pi^*(\boldsymbol{x}) = \arg \min_{a \in \mathcal{A}} \, \mathbb{E}_{p(h \mid \boldsymbol{x})}[l(h, a)].
\]
Equivalently, by defining the utility function \(U(h, a) = -l(h, a)\), which measures the
desirability of each action in each state, the Bayes estimator can be written as
\[
\pi^*(\boldsymbol{x}) = \arg \max_{a \in \mathcal{A}} \, \mathbb{E}_{h}[U(h, a)].
\]
This utility-maximization formulation is natural in economics, game theory, and
reinforcement learning, where the focus is on
maximizing expected rewards rather than minimizing losses.
The power of this framework lies in its generality: by choosing different loss functions, we recover many
familiar estimators and decision rules as special cases. We begin with the most common setting in machine
learning - classification under zero-one loss.
Classification (Zero-One Loss)
Perhaps the most common application of Bayesian decision theory in machine learning is
classification: given an input \(\boldsymbol{x} \in \mathcal{X}\), we wish to assign
it the optimal class label. To apply the general framework, we specify the states, actions, and loss function
for this particular setting.
Suppose that the states of nature correspond to class labels
\[
\mathcal{H} = \mathcal{Y} = \{1, \ldots, C\},
\]
and that the possible actions are also the class labels: \(\mathcal{A} = \mathcal{Y}\).
A natural loss function in this context is the zero-one loss, which penalizes
misclassification equally regardless of which classes are confused.
Definition: Zero-One Loss
The zero-one loss is defined as
\[
l_{01}(y^*, \hat{y}) = \mathbb{I}(y^* \neq \hat{y}),
\]
where \(y^*\) is the true label and \(\hat{y}\) is the predicted label.
Under the zero-one loss, the posterior expected loss for choosing label \(\hat{y}\) becomes
\[
\rho(\hat{y} \mid \boldsymbol{x}) = p(\hat{y} \neq y^* \mid \boldsymbol{x}) = 1 - p(y^* = \hat{y} \mid \boldsymbol{x}).
\]
Thus, minimizing the expected loss is equivalent to maximizing the posterior probability:
\[
\pi^*(\boldsymbol{x}) = \arg \max_{y \in \mathcal{Y}} \, p(y \mid \boldsymbol{x}).
\]
In other words, the optimal decision under zero-one loss is to select the mode of the posterior distribution,
which is the maximum a posteriori (MAP) estimate. This provides a decision-theoretic justification for
the MAP estimator that we encountered in Part 14.
The Reject Option
In some scenarios - particularly safety-critical applications such as medical diagnosis or autonomous
driving - the cost of an incorrect classification may be so high that it is preferable for the system
to abstain from making a decision when it is uncertain. This is formalized through the reject option.
Under this approach, the set of available actions is expanded to include a reject action:
\[
\mathcal{A} = \mathcal{Y} \cup \{0\},
\]
where action \(0\) represents the reject option (i.e., saying "I'm not sure"). The loss function
is then defined as
\[
l(y^*, a) =
\begin{cases}
0 & \text{if } y^* = a \text{ and } a \in \{1, \ldots, C\} \\
\lambda_r & \text{if } a = 0 \\
\lambda_e & \text{otherwise}
\end{cases}
\]
where \(\lambda_r\) is the cost of the reject action and \(\lambda_e\) is the cost of a
classification error. For the reject option to be meaningful, we require \(0 < \lambda_r < \lambda_e\); otherwise,
rejecting is either free (and we would always reject) or more expensive than guessing (and we would never reject).
Under this framework, instead of always choosing the label with the highest posterior probability, the optimal policy
chooses a label only when the classifier is sufficiently confident:
\[
a^* =
\begin{cases}
y^* & \text{if } p^* > \lambda^* \\
\text{reject} & \text{otherwise}
\end{cases}
\]
where
\[
p^* = \max_{y \in \{1, \ldots, C\}} p(y \mid \boldsymbol{x}), \qquad
\lambda^* = 1 - \frac{\lambda_r}{\lambda_e}.
\]
Proof that \(\lambda^* = 1 - \frac{\lambda_r}{\lambda_e}\):
The optimal decision is to choose the class label \(y\) if and only if its expected loss
is lower than that of rejecting. The expected loss of choosing \(y\) is
\[
R(y) = \lambda_e \sum_{y^* \neq y} p(y^* \mid \boldsymbol{x}) = \lambda_e [1 - p(y \mid \boldsymbol{x})],
\]
while the expected loss of rejecting is \(R(\text{reject}) = \lambda_r\) (since the cost \(\lambda_r\)
is incurred regardless of the true state). We choose \(y\) if
\[
\begin{align*}
R(y) &< R(\text{reject}) \\\\
\lambda_e [1 - p(y \mid \boldsymbol{x})] &< \lambda_r \\\\
p(y \mid \boldsymbol{x}) &> 1 - \frac{\lambda_r}{\lambda_e} = \lambda^*.
\end{align*}
\]
Thus, if the maximum posterior probability \(p^*\) exceeds the threshold \(\lambda^*\),
the classifier should choose the corresponding label. Otherwise, it should reject.
Confusion Matrix
The Bayes estimator tells us what the optimal decision rule is, but in practice, we need to measure how well a
classifier actually performs on real data. The standard tool for this is the (class) confusion matrix, which
provides a complete summary of the outcomes of classification decisions. We recall the four possible outcomes from
Part 10: Statistical Inference & Hypothesis Testing, where we first
introduced Type I and Type II errors.
For binary classification (\(y \in \{0, 1\}\)), each prediction falls into one of four categories:
- True Positives (TP): instances correctly classified as positive.
- True Negatives (TN): instances correctly classified as negative.
- False Positives (FP): instances incorrectly classified as positive (Type I error).
- False Negatives (FN): instances incorrectly classified as negative (Type II error).
Understanding the distinction between FP and FN is critical because the costs associated with each error type may
differ significantly depending on the application. In safety-critical systems, a false negative (missing a dangerous
condition) is typically far more costly than a false positive (raising an unnecessary alarm). For example, in medical
screening, a false positive means unnecessarily alarming a healthy patient, whereas a false negative means failing to
detect a disease in someone who needs treatment.
Table 1: Confusion Matrix for Binary Classification
\[
\begin{array}{|c|c|c|}
\hline
& \textbf{Predicted Positive} \; (\hat{P} = \text{TP} + \text{FP})
& \textbf{Predicted Negative} \; (\hat{N} = \text{FN} + \text{TN}) \\
\hline
\textbf{Actual Positive} \; (P = \text{TP} + \text{FN}) & \textbf{TP} & \textbf{FN} \\
\hline
\textbf{Actual Negative} \; (N = \text{FP} + \text{TN}) & \textbf{FP} & \textbf{TN} \\
\hline
\end{array}
\]
In the context of Bayesian decision theory, the confusion matrix quantifies the empirical performance
of the Bayes estimator (or any other decision rule) by counting how often the predicted labels match or
mismatch the true labels.
In practice, a binary classifier typically outputs a probability \(p(y = 1 \mid \boldsymbol{x})\),
and the final prediction depends on a decision threshold \(\tau \in [0, 1]\).
For any fixed threshold \(\tau\), the decision rule is
\[
\hat{y}_{\tau}(\boldsymbol{x}) = \mathbb{I}\left(p(y = 1 \mid \boldsymbol{x}) \geq \tau\right).
\]
Given a set of \(N\) labeled examples, we can compute the empirical counts for each cell
of the confusion matrix. For example,
\[
\begin{align*}
\text{FP}_{\tau} &= \sum_{n=1}^N \mathbb{I}(\hat{y}_{\tau}(\boldsymbol{x}_n) = 1, \, y_n = 0),\\\\
\text{FN}_{\tau} &= \sum_{n=1}^N \mathbb{I}(\hat{y}_{\tau}(\boldsymbol{x}_n) = 0, \, y_n = 1).
\end{align*}
\]
Table 2: Threshold-Dependent Confusion Matrix
\[
\begin{array}{|c|c|c|}
\hline
& \hat{y}_{\tau}(\boldsymbol{x}_n) = 1
& \hat{y}_{\tau}(\boldsymbol{x}_n) = 0 \\
\hline
y_n = 1 & \text{TP}_{\tau} & \text{FN}_{\tau} \\
\hline
y_n = 0 & \text{FP}_{\tau} & \text{TN}_{\tau} \\
\hline
\end{array}
\]
Since the confusion matrix depends on the choice of \(\tau\), different thresholds lead to different trade-offs
between error types. How should we choose \(\tau\), and how can we evaluate a classifier's performance across all
possible thresholds? This motivates the ROC and PR curves developed in the following sections.
Receiver Operating Characteristic Curves
Rather than evaluating a classifier at a single threshold, we can characterize its performance across all thresholds
simultaneously. The receiver operating characteristic (ROC) curve achieves this by plotting two rates
derived from the confusion matrix. To obtain these rates, we normalize the confusion matrix per row,
yielding the conditional distribution \(p(\hat{y} \mid y)\). Since each row sums to 1, these rates describe the classifier's
behavior separately within the positive and negative populations.
Table 3: Confusion Matrix Normalized per Row
\[
\begin{array}{|c|c|c|}
\hline
& 1 & 0 \\
\hline
1 & \text{TP}_{\tau} / P = \text{TPR}_{\tau} & \text{FN}_{\tau} / P = \text{FNR}_{\tau} \\
\hline
0 & \text{FP}_{\tau} / N = \text{FPR}_{\tau} & \text{TN}_{\tau} / N = \text{TNR}_{\tau} \\
\hline
\end{array}
\]
- True positive rate (TPR) (or Sensitivity / Recall): \[
\text{TPR}_{\tau} = p(\hat{y} = 1 \mid y = 1, \tau) = \frac{\text{TP}_{\tau}}{\text{TP}_{\tau} + \text{FN}_{\tau}}.
\]
- False positive rate (FPR) (or Type I error rate / Fallout): \[
\text{FPR}_{\tau} = p(\hat{y} = 1 \mid y = 0, \tau) = \frac{\text{FP}_{\tau}}{\text{FP}_{\tau} + \text{TN}_{\tau}}.
\]
- False negative rate (FNR) (or Type II error rate / Miss rate):\[
\text{FNR}_{\tau} = p(\hat{y} = 0 \mid y = 1, \tau) = \frac{\text{FN}_{\tau}}{\text{TP}_{\tau} + \text{FN}_{\tau}}.
\]
- True negative rate (TNR) (or Specificity):\[
\text{TNR}_{\tau} = p(\hat{y} = 0 \mid y = 0, \tau) = \frac{\text{TN}_{\tau}}{\text{FP}_{\tau} + \text{TN}_{\tau}}.
\]
By plotting TPR against FPR as an implicit function of \(\tau\), we obtain the ROC curve.
The overall quality of a classifier is often summarized using the AUC (Area Under the Curve).
A higher AUC indicates better discriminative ability across all threshold values, with a maximum of 1.0 for a
perfect classifier.
The figure below compares two classifiers — one trained using logistic regression and the other using a random forest. Both classifiers
provide predicted probabilities for the positive class, allowing us to vary \(\tau\) and compute the corresponding TPR and FPR. A diagonal
line is also drawn, representing the performance of a random classifier — i.e., one that assigns labels purely by chance. On this line,
the TPR equals the FPR at every threshold. If a classifier's ROC curve lies on this diagonal, it means the classifier is performing no
better than random guessing. In contrast, any performance above the diagonal indicates that the classifier is capturing some signal,
while performance below the diagonal (rare in practice) would indicate worse-than-random behavior.
In our demonstration, the logistic regression model has been intentionally made worse, yielding an AUC of 0.78, while the random forest
shows superior performance with an AUC of 0.94. These results mean that, overall, the random forest is much better at distinguishing
between the positive and negative classes compared to the underperforming logistic regression model.
(Data: 10,000 samples, 20 total features, 5 features are informative, 2 clusters per class, 5% label noise.)
Equal Error Rate (EER)
The ROC curve provides a comprehensive view of the trade-off between TPR and FPR, but it can be useful to summarize
this trade-off with a single number. The equal error rate (EER) is the operating point where
\(\text{FPR} = \text{FNR}\). This point represents the optimal threshold under the assumption that
the prior probabilities and the costs of both types of errors are equal. This metric is particularly
important in applications such as biometric authentication (fingerprint or face recognition), where false acceptance and false
rejection carry comparable costs.
The figure below shows the EER point for our two models, marking where the FPR and FNR
curves intersect:
A perfect classifier would achieve an EER of 0, corresponding to the top-left corner of the
ROC curve. In practice, one may tune the decision threshold to operate at or near the EER,
depending on the application's requirements.
The ROC curve and EER evaluate a classifier by conditioning on the true class (i.e., how the classifier
behaves within each population). However, in many practical settings we care about a different question:
given that the classifier predicted positive, how likely is it to be correct? This leads us to
precision-recall analysis.
Precision-Recall (PR) Curves
Here, we normalize the confusion matrix per column to obtain \(p(y \mid \hat{y})\), which conditions
on the predicted label. The column-normalized confusion matrix answers a complementary question:
among the instances predicted as positive (or negative), what fraction are truly positive (or negative)?
This perspective is especially important when precision is critical, such as in medical diagnosis or fraud detection.
Table 4: Confusion Matrix Normalized per Column
\[
\begin{array}{|c|c|c|}
\hline
& \hat{y} = 1 & \hat{y} = 0 \\
\hline
y = 1 & \text{TP}_{\tau} / \hat{P} = \text{PPV}_{\tau} & \text{FN}_{\tau} / \hat{N} = \text{FOR}_{\tau} \\
\hline
y = 0 & \text{FP}_{\tau} / \hat{P} = \text{FDR}_{\tau} & \text{TN}_{\tau} / \hat{N} = \text{NPV}_{\tau} \\
\hline
\end{array}
\]
- Positive predictive value (PPV) (or Precision):\[
\text{PPV}_{\tau} = p(y = 1 \mid \hat{y} = 1, \tau) = \frac{\text{TP}_{\tau}}{\text{TP}_{\tau} + \text{FP}_{\tau}}.
\]
- False discovery rate (FDR):\[
\text{FDR}_{\tau} = p(y = 0 \mid \hat{y} = 1, \tau) = \frac{\text{FP}_{\tau}}{\text{TP}_{\tau} + \text{FP}_{\tau}}.
\]
- False omission rate (FOR):\[
\text{FOR}_{\tau} = p(y = 1 \mid \hat{y} = 0, \tau) = \frac{\text{FN}_{\tau}}{\text{FN}_{\tau} + \text{TN}_{\tau}}.
\]
- Negative predictive value (NPV):\[
\text{NPV}_{\tau} = p(y = 0 \mid \hat{y} = 0, \tau) = \frac{\text{TN}_{\tau}}{\text{FN}_{\tau} + \text{TN}_{\tau}}.
\]
Note that within each column, the rates sum to 1: \(\text{PPV} + \text{FDR} = 1\)
and \(\text{FOR} + \text{NPV} = 1\).
To summarize a system's performance - especially when classes are imbalanced (i.e., when the positive class is rare)
or when false positives and false negatives have different costs - we often use a precision-recall (PR) curve. This
curve plots precision against recall as the decision threshold \(\tau\) varies.
Imbalanced datasets appear frequently in real-world machine learning applications where one class is naturally much rarer than
the other. For example, in financial transactions, fraudulent activities are rare compared to legitimate ones. The classifier must
detect the very few fraud cases (positive class) among millions of normal transactions (negative class).
Let precision be \(\mathcal{P}(\tau)\) and recall be \(\mathcal{R}(\tau)\). If \(\hat{y}_n \in \{0, 1\}\) is the predicted
label and \(y_n \in \{0, 1\}\) is the true label, then at threshold \(\tau\), precision and recall can be estimated by:
\[
\mathcal{P}(\tau) = \frac{\sum_n y_n \hat{y}_n}{\sum_n \hat{y}_n},
\quad
\mathcal{R}(\tau) = \frac{\sum_n y_n \hat{y}_n}{\sum_n y_n}.
\]
By plotting the precision vs recall for various the threshold \(\tau\), we obtain the PR curve.
This curve visually represents the trade-off between precision and recall. It is particularly valuable in situations where one
class is much rarer than the other or when false alarms carry a significant cost.
A crucial difference between the ROC and PR curves is the baseline for a random classifier. While a random
classifier always yields an AUC of 0.5 on an ROC curve, the baseline for a PR curve is the fraction of
positive samples in the dataset:
\[
y_{baseline} = \frac{P}{P + N}
\]
This makes the PR curve a much more rigorous evaluation tool for highly imbalanced datasets, as the "no-knowledge" score
can be near zero while the "perfect" score remains 1.0.
However, raw precision values can be noisy as the threshold varies. To stabilize this, interpolated precision
is often computed. For a given recall level \(r\), it is defined as the maximum precision observed for any recall level greater
than or equal to \(r\):
\[
\mathcal{P}_{\text{interp}}(r) = \max_{r' \geq r} \mathcal{P}(r')
\]
The average precision (AP) is the area under this interpolated PR curve. It provides a single-number summary
that reflects the classifier's ability to maintain high precision as the recall threshold is lowered.
In our case, the logistic regression produced an AP of 0.73, while the random forest achieved a much stronger AP of 0.93. This
indicates that the random forest is significantly more robust at identifying positive cases without sacrificing precision.
Note: In settings where multiple PR curves are generated (for example, one for each query in information retrieval or one per class in
multi-class classification), the mean average precision (mAP) is computed as the mean of the AP scores over all
curves. mAP offers an overall performance measure across multiple queries or classes.
Class Imbalance
In real-world applications, the class distribution is far from uniform. A dataset is considered imbalanced
when one class has significantly fewer examples than the other - for instance, 5% positive samples and 95% negative samples.
In such cases, naive metrics like accuracy can be highly misleading: a model that always predicts the majority
class achieves high accuracy without correctly identifying any minority class instances.
How do ROC-AUC and PR-AP behave differently under class imbalance?
The ROC-AUC metric is often insensitive to class imbalance because both TPR and FPR are
computed within their respective populations - TPR is a ratio within the positive samples and FPR is a ratio within the
negative samples. Consequently, the class proportions do not directly affect the ROC curve.
On the other hand, the PR-AP metric is more sensitive to class imbalance.
This is because precision depends on both populations:
\[
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}.
\]
When the negative class vastly outnumbers the positive class, even a small false positive
rate can produce a large absolute number of false positives, significantly reducing precision.
To demonstrate this effect, we create a dataset where 90% of the samples belong to the negative
class and only 10% belong to the positive class, then train both the logistic regression and
random forest models again:
ROC Curve (Imbalanced)
Precision-Recall Curve (Imbalanced)
The ROC curves remain relatively smooth and the AUC does not drop drastically, even though the dataset is highly
imbalanced. The PR curves, however, reveal a distinct difference, with noticeably lower AP scores. This highlights
how class imbalance makes it harder to achieve high precision and recall simultaneously. Even though the random
forest outperforms logistic regression, it still struggles to detect the rare positive cases effectively.
In summary, PR curves focus on precision and recall, which directly reflect how well a model identifies the minority class.
Precision, in particular, is sensitive to even a small number of false positives, providing a more realistic picture of
performance when the positive class is scarce. Thus, PR-AP is generally the preferred metric for imbalanced classification problems
because it directly measures performance on the minority class.
Connections to Machine Learning
The evaluation framework developed in this section is essential throughout applied machine learning.
The confusion matrix underlies the training objectives for many classifiers: the
cross-entropy loss used in logistic regression and neural networks can be understood as a smooth
surrogate for the zero-one loss. ROC-AUC is the standard metric for balanced
classification tasks and is widely used in model selection. PR-AP (and its multi-class
extension, mAP) is the primary evaluation metric in object detection
and information retrieval, where positive instances are inherently rare. The
reject option introduced earlier is implemented in practice through confidence
thresholding and is increasingly important in safety-critical AI systems where abstaining from
a prediction is preferable to making an error.