Entropy

Entropy Joint Entropy Conditional Entropy Cross Entropy KL Divergence (Relative Entropy, Information Gain) Mutual Information (MI)

Entropy

So far in our Section III journey, we have studied distributions, estimation, and inference. A complementary question is: how much information does a random variable carry? Information Theory, founded by Claude Shannon, provides a mathematical framework for quantifying uncertainty - and it turns out to be deeply connected to machine learning, where the cross-entropy loss is the default objective for classification.

Conventions. Throughout this page, \(\log\) denotes the natural logarithm (base \(e\)), so information is measured in nats; the alternative base 2, giving bits, is standard in coding theory but less convenient for calculus-based optimization. We further adopt the limiting conventions \[ 0 \log 0 := 0, \qquad 0 \log \tfrac{0}{0} := 0, \qquad 0 \log \tfrac{0}{q} := 0 \;\; (q > 0), \qquad p \log \tfrac{p}{0} := +\infty \;\; (p > 0), \] so that terms involving zero probabilities cause no ambiguity in the entropy and KL divergence formulas introduced below. The first three are justified either by the limit \(\lim_{x \to 0^+} x \log x = 0\) or by treating the indeterminate form \(\tfrac{0}{0}\) as 0 within probability-theoretic contexts; the fourth reflects that assigning zero probability to an event that actually occurs incurs infinite description cost.

Consider a number guessing game: you must guess a natural number between 1 and 100. If I tell you that the number is not 1, you still have 99 possibilities, which is not very helpful. But if I tell you that the number is divisible by 11, the possibilities shrink to 9, which is far more informative. This intuition leads to a precise definition.

Definition: Information Content (Self-Information)

The information content of an event \(E\) with probability \(p(E)\) is \[ I(E) = -\log p(E) \geq 0. \] Low-probability events carry high information; certain events carry none.

The use of the logarithm is not arbitrary - it ensures additivity for independent events. If \(A\) and \(B\) are independent, then \(p(A \cap B) = p(A) \, p(B)\), so \[ I(A \cap B) = -\log p(A \cap B) = -\log\bigl(p(A) \, p(B)\bigr) = I(A) + I(B). \] This converts the multiplication of probabilities into the addition of information, a property essential for building a consistent theory.

We now extend from a single event to the average information content across all possible outcomes of a discrete random variable \(X\).

Definition: Shannon Entropy

The entropy of a discrete random variable \(X\) with distribution \(p\) over \(n\) states is \[ \begin{align*} \mathbb{H}(X) &= -\sum_{k=1}^n p(X = k)\,\log p(X = k) \\\\ &= \mathbb{E}_X[-\log p(X)]. \end{align*} \]

Joint Entropy

Entropy quantifies the uncertainty of a single random variable. When we observe two (or more) variables together, we need a measure of their combined uncertainty.

Definition: Joint Entropy

The joint entropy of two discrete random variables \(X\) and \(Y\) is \[ \mathbb{H}(X, Y) = -\sum_{x,\, y} p(x, y)\,\log p(x, y). \]

Theorem: Bounds on Joint Entropy

For discrete random variables \(X\) and \(Y\), \[ \mathbb{H}(X) + \mathbb{H}(Y) \;\geq\; \mathbb{H}(X, Y) \;\geq\; \max\{\mathbb{H}(X),\, \mathbb{H}(Y)\} \;\geq\; 0, \] with the following equality conditions:

  • \(\mathbb{H}(X) + \mathbb{H}(Y) = \mathbb{H}(X, Y)\) if and only if \(X\) and \(Y\) are independent.
  • \(\mathbb{H}(X, Y) = \mathbb{H}(X)\) if and only if \(Y\) is a deterministic function of \(X\); symmetrically, \(\mathbb{H}(X, Y) = \mathbb{H}(Y)\) if and only if \(X\) is a deterministic function of \(Y\).
  • \(\mathbb{H}(X, Y) = 0\) if and only if both \(X\) and \(Y\) are constants — that is, the joint distribution is concentrated on a single outcome.

Each of these bounds has a clear intuition. The left inequality says that joint uncertainty is largest when neither variable carries information about the other; any statistical dependence reduces \(\mathbb{H}(X, Y)\) below the sum. The right inequality says that the joint uncertainty of \((X, Y)\) cannot be smaller than the uncertainty of either variable alone — adding a second variable to the picture can only preserve or expand the total uncertainty, and equality is reached precisely when one variable is fully determined by the other. Non-negativity follows term-by-term, since \(p(x,y) \in [0,1]\) gives \(-p(x,y) \log p(x,y) \geq 0\) (using the convention \(0 \log 0 := 0\) established above).

A complete proof is deferred to the tools developed below: the right inequality follows from the chain rule for entropy, and the left inequality is equivalent to the non-negativity of mutual information, which is in turn a consequence of Gibbs' inequality.

These bounds extend to more than two variables: \[ \mathbb{H}(X_1, \ldots, X_n) \leq \sum_{i=1}^n \mathbb{H}(X_i), \] with equality if and only if \(X_1, \ldots, X_n\) are mutually independent.

Conditional Entropy

Joint entropy measures total uncertainty, but it does not tell us how much one variable reveals about another. Conditional entropy answers: on average, how much uncertainty remains in \(Y\) after we observe \(X\)?

Definition: Conditional Entropy

For each \(x\) in the support of \(X\), the entropy of the conditional distribution \(p(Y \mid X = x)\) is \[ \mathbb{H}(Y \mid X = x) := -\sum_{y} p(y \mid x) \log p(y \mid x). \] The conditional entropy of \(Y\) given \(X\) is the average of these quantities with respect to the marginal of \(X\): \[ \mathbb{H}(Y \mid X) := \sum_{x} p(x) \, \mathbb{H}(Y \mid X = x). \]

Theorem: Joint–Conditional Decomposition

For discrete random variables \(X\) and \(Y\), \[ \mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X), \] or equivalently \(\mathbb{H}(Y \mid X) = \mathbb{H}(X, Y) - \mathbb{H}(X)\).

Read in the first form, the theorem decomposes the joint uncertainty of \((X, Y)\) into the uncertainty of \(X\) plus the residual uncertainty of \(Y\) once \(X\) is known. The proof below establishes the second form by direct computation; the first form is the algebraic rearrangement.

Proof: \[ \begin{align*} \mathbb{H}(Y \mid X) &= \sum_{x} p(x) \, \mathbb{H}(Y \mid X = x) \\\\ &= - \sum_{x} p(x) \sum_{y} p(y \mid x) \log p(y \mid x) \\\\ &= - \sum_{x, y} p(x, y) \log p(y \mid x) \\\\ &= - \sum_{x, y} p(x, y) \log \frac{p(x,y)}{p(x)}\\\\ &= - \sum_{x, y} p(x, y) \log p(x,y) + \sum_{x, y} p(x, y) \log p(x) \\\\ &= - \sum_{x, y} p(x, y) \log p(x,y) + \sum_{x} p(x) \log p(x) \\\\ &= \mathbb{H}(X, Y) - \mathbb{H}(X) \tag{1} \end{align*} \]

The penultimate step uses the marginal identity \(\sum_{y} p(x, y) = p(x)\) to collapse the second sum from \((x, y)\) to \(x\) alone.

Theorem: Extremes of Conditional Entropy

For discrete random variables \(X\) and \(Y\),

  • \(\mathbb{H}(Y \mid X) = \mathbb{H}(Y)\) if and only if \(X\) and \(Y\) are independent.
  • \(\mathbb{H}(Y \mid X) = 0\) if and only if \(Y\) is a deterministic function of \(X\).

Together, the decomposition and the extremes above complete the proof of the right inequality of the joint entropy bounds deferred in the previous section. The inequality itself follows from non-negativity: since \(\mathbb{H}(Y \mid X) = \sum_{x} p(x) \, \mathbb{H}(Y \mid X = x) \geq 0\) (each term is the entropy of a probability distribution), the decomposition gives \(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X) \geq \mathbb{H}(X)\), and symmetrically \(\mathbb{H}(X, Y) \geq \mathbb{H}(Y)\). The equality \(\mathbb{H}(X, Y) = \mathbb{H}(X)\) holds precisely when \(\mathbb{H}(Y \mid X) = 0\), which by the extremes above means \(Y\) is a deterministic function of \(X\); symmetrically for \(\mathbb{H}(X, Y) = \mathbb{H}(Y)\). The left inequality \(\mathbb{H}(X) + \mathbb{H}(Y) \geq \mathbb{H}(X, Y)\) and its equality condition (independence) remain pending until mutual information is introduced.

Theorem: Chain Rule of Entropy

For discrete random variables \(X_1, \ldots, X_n\), \[ \mathbb{H}(X_1, X_2, \ldots, X_n) = \mathbb{H}(X_1) + \sum_{i=2}^n \mathbb{H}(X_i \mid X_1, \ldots, X_{i-1}). \]

Proof sketch (induction on \(n\)):

The case \(n = 2\) is the Joint–Conditional Decomposition above. For the inductive step, treat \((X_1, \ldots, X_{n-1})\) as a single random variable and apply the two-variable decomposition: \[ \mathbb{H}(X_1, \ldots, X_n) = \mathbb{H}(X_1, \ldots, X_{n-1}) + \mathbb{H}(X_n \mid X_1, \ldots, X_{n-1}), \] then expand the first term by the inductive hypothesis.

Cross Entropy

Conditional entropy measures how much observing one variable reduces uncertainty about another. A different but equally important question is: what happens when we use an incorrect distribution \(q\) to encode data that actually follows distribution \(p\)? The answer is the cross entropy.

Definition: Cross Entropy

For probability distributions \(p\) and \(q\) on the same finite sample space \(\{1, \ldots, n\}\), the cross entropy of \(q\) relative to \(p\) is \[ \mathbb{H}(p, q) = -\sum_{k=1}^n p_k \log q_k. \] Note that the order of arguments matters: \(\mathbb{H}(p, q) \neq \mathbb{H}(q, p)\) in general.

In machine learning — especially for classification with neural networks — cross entropy is the standard loss function. It measures the discrepancy between the true label distribution and the model's predicted distribution: smaller cross entropy means better predictions.

For example, consider binary classification (\(n = 2\)) with true label \(y \in \{0, 1\}\) and predicted probability \(\hat{y} \in (0, 1)\) of class 1. Specializing cross entropy to this case gives the binary cross-entropy loss (the subscript "ce" denotes the binary specialization of the general cross entropy \(\mathbb{H}(p, q)\) above) \[ \mathbb{H}_{ce}(y, \hat{y}) = -\bigl[\, y \log \hat{y} + (1 - y) \log(1 - \hat{y}) \,\bigr]. \] For instance, when the true label is \(y = 1\), the loss reduces to \(-\log \hat{y}\).

If \(\hat{y}\) is close to 1 (the model is confident and correct), the penalty is small. If \(\hat{y}\) is close to 0 (the model is confident and wrong), the penalty is large. The higher penalty signals that the model needs to be improved, and forces the optimization algorithm to adjust parameters to reduce future error.

KL Divergence (Relative Entropy, Information Gain)

Cross entropy measures the cost of encoding data from \(p\) using the model \(q\). A natural follow-up: how much extra cost does the wrong model incur beyond the irreducible entropy \(\mathbb{H}(p)\)? This "excess cost" is precisely the Kullback-Leibler divergence.

Definition: KL Divergence

The Kullback-Leibler divergence between two distributions \(p\) and \(q\) is \[ \begin{align*} D_{\mathbb{KL}}(p \| q) &= \sum_{k=1}^n p_k \log \frac{p_k}{q_k} \\\\ &= \sum_{k=1}^n p_k \log p_k - \sum_{k=1}^n p_k \log q_k \\\\ &= - \mathbb{H}(p) + \mathbb{H}(p, q) \tag{2} \end{align*} \] where \(p\) and \(q\) are defined on the same sample space \(\mathcal{X}\). Following the conventions established at the beginning of the page, the sum is finite when \(q_k = 0 \Rightarrow p_k = 0\) for all \(k\); otherwise (some \(p_k > 0\) with \(q_k = 0\)), \(D_{\mathbb{KL}}(p \| q) = +\infty\).

For example, \(D_{\mathbb{KL}}(p \| q)\) can measure how much an estimated distribution \(q\) is different from a true distribution \(p\). If the estimation of \(p\) by \(q\) is "good," the KL divergence will be close to zero.

By the expression (2), the cross entropy can be written as: \[ \mathbb{H}(p, q) = \mathbb{H}(p) + D_{\mathbb{KL}}(p \| q). \] So, if \(p(x)\) is fixed, minimizing the cross entropy is equivalent to minimizing KL divergence.

Theorem: Jensen's Inequality

Let \(\phi: I \to \mathbb{R}\) be a convex function on an interval \(I \subseteq \mathbb{R}\), and let \(X\) be a random variable taking values in \(I\) with finite expectation. Then \[ \phi(\mathbb{E}[X]) \;\leq\; \mathbb{E}[\phi(X)]. \] For finitely many points \(x_1, \ldots, x_n \in I\) and weights \(\lambda_1, \ldots, \lambda_n \geq 0\) with \(\sum_k \lambda_k = 1\), this specializes to the finite form \[ \phi\Big(\sum_{k=1}^n \lambda_k x_k\Big) \;\leq\; \sum_{k=1}^n \lambda_k \, \phi(x_k). \]

Proof sketch (finite form, induction on \(n\)):

For \(n = 2\), the inequality \(\phi(\lambda_1 x_1 + \lambda_2 x_2) \leq \lambda_1 \phi(x_1) + \lambda_2 \phi(x_2)\) with \(\lambda_1 + \lambda_2 = 1\) is the very definition of convexity. For \(n \geq 3\), set \(\Lambda = \sum_{k=1}^{n-1} \lambda_k\) (so \(\Lambda + \lambda_n = 1\); assume \(\Lambda > 0\), else the claim is trivial) and write \[ \sum_{k=1}^n \lambda_k x_k = \Lambda \, \bar{x} + \lambda_n x_n, \qquad \bar{x} := \sum_{k=1}^{n-1} \tfrac{\lambda_k}{\Lambda} x_k. \] Since \(I\) is an interval (hence convex), \(\bar{x} \in I\). Applying the \(n = 2\) case to \((\bar{x}, x_n)\) and the inductive hypothesis to \(\bar{x}\), \[ \phi\Big(\sum_{k=1}^n \lambda_k x_k\Big) \leq \Lambda \, \phi(\bar{x}) + \lambda_n \phi(x_n) \leq \Lambda \sum_{k=1}^{n-1} \tfrac{\lambda_k}{\Lambda} \phi(x_k) + \lambda_n \phi(x_n) = \sum_{k=1}^n \lambda_k \phi(x_k). \] The same statement holds for general random variables (possibly continuous) by the supporting line argument: setting \(\mu = \mathbb{E}[X]\), convexity of \(\phi\) guarantees the existence of a constant \(c \in \mathbb{R}\) (a subgradient at \(\mu\)) such that \(\phi(x) \geq \phi(\mu) + c\,(x - \mu)\) for all \(x \in I\). Taking expectations of both sides and using linearity of \(\mathbb{E}\), \[ \mathbb{E}[\phi(X)] \;\geq\; \phi(\mu) + c \, (\mathbb{E}[X] - \mu) \;=\; \phi(\mu) \;=\; \phi(\mathbb{E}[X]). \]

Theorem: Log Sum Inequality

For nonnegative reals \(a_1, \ldots, a_n\) and \(b_1, \ldots, b_n\), \[ \sum_{k=1}^n a_k \log \frac{a_k}{b_k} \;\geq\; \Big(\sum_{k=1}^n a_k\Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k}, \] with equality if and only if \(a_k / b_k\) is the same for all \(k\) (the conventions established at the beginning of the page apply, with \(p, q\) replaced by general nonnegative reals \(a_k, b_k\)).

Proof:

Zero terms are absorbed by the conventions above (terms with \(a_k = 0\) vanish; terms with \(b_k = 0, a_k > 0\) make the left side \(+\infty\), in which case the inequality is trivial). Assume henceforth that all \(a_k, b_k > 0\). The function \(f(x) = x \log x\) is convex on \((0, \infty)\), since \(f''(x) = 1/x > 0\). Set \(\lambda_k = b_k / \sum_j b_j\), so \(\lambda_k \geq 0\) and \(\sum_k \lambda_k = 1\). Then \[ \begin{align*} \sum_{k=1}^n a_k \log \frac{a_k}{b_k} &= \sum_{k=1}^n b_k \cdot \tfrac{a_k}{b_k} \log \tfrac{a_k}{b_k} = \Big(\sum_j b_j\Big) \sum_{k=1}^n \lambda_k \, f\Big(\tfrac{a_k}{b_k}\Big) \\\\ &\geq \Big(\sum_j b_j\Big) \, f\Big(\sum_{k=1}^n \lambda_k \tfrac{a_k}{b_k}\Big) \quad \text{(Jensen's inequality)} \\\\ &= \Big(\sum_j b_j\Big) \, f\Big(\tfrac{\sum_k a_k}{\sum_j b_j}\Big) = \Big(\sum_k a_k\Big) \log \tfrac{\sum_k a_k}{\sum_j b_j}. \end{align*} \] Equality in Jensen holds if and only if all \(a_k / b_k\) are equal.

Theorem: Non-negativity of KL Divergence (Gibbs' Inequality)

For probability distributions \(p\) and \(q\) on the same sample space, \[ D_{\mathbb{KL}}(p \| q) \;\geq\; 0, \] with equality if and only if \(p = q\). Equivalently, the entropy of \(p\) is bounded above by its cross entropy with any other distribution \(q\): \[ \mathbb{H}(p) \;\leq\; \mathbb{H}(p, q). \] The latter form is known as Gibbs' inequality.

Proof:

If \(D_{\mathbb{KL}}(p \| q) = +\infty\) (i.e., some \(q_k = 0\) with \(p_k > 0\)), the inequality is trivial. Otherwise, apply the log sum inequality with \(a_k = p_k\) and \(b_k = q_k\): \[ D_{\mathbb{KL}}(p \| q) = \sum_{k=1}^n p_k \log \frac{p_k}{q_k} \;\geq\; \Big(\sum_{k=1}^n p_k\Big) \log \frac{\sum_{k=1}^n p_k}{\sum_{k=1}^n q_k}. \] Since \(p\) and \(q\) are probability distributions, \(\sum_k p_k = \sum_k q_k = 1\), so the right-hand side becomes \(1 \cdot \log 1 = 0\). Therefore \(D_{\mathbb{KL}}(p \| q) \geq 0\), with equality if and only if \(p_k / q_k\) is constant in \(k\); since both sum to 1, this constant must be 1, i.e., \(p = q\). The equivalence with Gibbs' inequality follows immediately from the identity \(D_{\mathbb{KL}}(p \| q) = \mathbb{H}(p, q) - \mathbb{H}(p)\) established above.

Corollary: Uniform Distribution Maximizes Entropy

For a discrete random variable \(X\) taking values in a finite set \(\mathcal{X}\), \[ \mathbb{H}(X) \;\leq\; \log |\mathcal{X}|, \] with equality if and only if \(X\) is uniformly distributed on \(\mathcal{X}\).

Proof:

Let \(u\) denote the uniform distribution on \(\mathcal{X}\), with \(u_k = 1/|\mathcal{X}|\) for each \(k\). Apply the non-negativity of KL divergence to \(p\) (the distribution of \(X\)) and \(u\): \[ 0 \;\leq\; D_{\mathbb{KL}}(p \| u) = \sum_k p_k \log \frac{p_k}{1/|\mathcal{X}|} = \sum_k p_k \log p_k + \log |\mathcal{X}| \sum_k p_k = -\mathbb{H}(X) + \log |\mathcal{X}|. \] Rearranging gives \(\mathbb{H}(X) \leq \log |\mathcal{X}|\), with equality if and only if \(p = u\), i.e., \(X\) is uniform.

Despite measuring discrepancy between distributions, KL divergence is not a metric: it is not symmetric (\(D_{\mathbb{KL}}(p \| q) \neq D_{\mathbb{KL}}(q \| p)\) in general) and does not satisfy the triangle inequality.

Mutual Information (MI)

KL divergence measures the cost of confusing one distribution for another. A particularly important special case arises when we ask: how much does the joint distribution \(p(x,y)\) differ from the product of marginals \(p(x)p(y)\)? If \(X\) and \(Y\) are independent, these are identical; otherwise, their divergence quantifies the shared information between the two variables.

Mutual information measures the reduction in uncertainty about one random variable given knowledge of another. High mutual information means that knowing \(X\) tells you a lot about \(Y\) (and vice versa), while zero mutual information means they are completely independent.

Definition: Mutual Information

The mutual information between two discrete random variables \(X\) and \(Y\) is \[ \mathbb{I}(X ; Y) := D_{\mathbb{KL}}(p(x,y) \,\|\, p(x)\,p(y)) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)}. \]

Since mutual information is a KL divergence, the non-negativity of KL divergence immediately gives \(\mathbb{I}(X; Y) \geq 0\), with equality if and only if \(p(x, y) = p(x)\, p(y)\) — that is, if and only if \(X\) and \(Y\) are independent.

Theorem: Entropy Expressions for Mutual Information

For discrete random variables \(X\) and \(Y\), \[ \begin{align*} \mathbb{I}(X; Y) &= \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) \\\\ &= \mathbb{H}(X) - \mathbb{H}(X \mid Y) \\\\ &= \mathbb{H}(Y) - \mathbb{H}(Y \mid X) \\\\ &= \mathbb{H}(X, Y) - \mathbb{H}(X \mid Y) - \mathbb{H}(Y \mid X). \end{align*} \]

The four forms admit complementary readings: union of marginals minus joint; reduction in the uncertainty of \(X\) after observing \(Y\); reduction in the uncertainty of \(Y\) after observing \(X\); and the joint uncertainty minus both residual conditional entropies, \(\mathbb{H}(X \mid Y)\) and \(\mathbb{H}(Y \mid X)\).

Proof:

For the first form, expand the logarithm in the Definition: \[ \begin{align*} \mathbb{I}(X; Y) &= \sum_{x, y} p(x, y) \log p(x, y) - \sum_{x, y} p(x, y) \log p(x) - \sum_{x, y} p(x, y) \log p(y) \\\\ &= -\mathbb{H}(X, Y) + \mathbb{H}(X) + \mathbb{H}(Y), \end{align*} \] where the marginal identities \(\sum_y p(x, y) = p(x)\) and \(\sum_x p(x, y) = p(y)\) collapse the last two double sums to the single-variable entropies of \(X\) and \(Y\). The reduction forms follow by substituting the joint–conditional decomposition \(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X) = \mathbb{H}(Y) + \mathbb{H}(X \mid Y)\) into the first form. The fourth form follows by adding the two chain-rule decompositions to obtain \(2\,\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y) + \mathbb{H}(X \mid Y) + \mathbb{H}(Y \mid X)\), and rearranging gives \(\mathbb{H}(X, Y) - \mathbb{H}(X \mid Y) - \mathbb{H}(Y \mid X) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) = \mathbb{I}(X; Y)\) by the first form.

These identities deliver two important consequences. First, combining the union form \(\mathbb{I}(X; Y) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y)\) with non-negativity completes the proof of the left inequality of the joint entropy bounds previously deferred: \[ \mathbb{H}(X) + \mathbb{H}(Y) \;\geq\; \mathbb{H}(X, Y), \] with equality if and only if \(X\) and \(Y\) are independent. Second, the reduction form \(\mathbb{I}(X; Y) = \mathbb{H}(X) - \mathbb{H}(X \mid Y) \geq 0\) implies \[ \mathbb{H}(X \mid Y) \;\leq\; \mathbb{H}(X) \qquad \text{(and symmetrically } \mathbb{H}(Y \mid X) \leq \mathbb{H}(Y) \text{)}, \] that is, conditioning reduces entropy on average. Observing \(Y\) can never increase the average uncertainty of \(X\); the residual uncertainty after observing \(Y\) is at most the prior uncertainty, with equality precisely when \(X\) and \(Y\) are independent.

Connections to Machine Learning

Information-theoretic quantities are ubiquitous in machine learning. Cross-entropy loss is the standard training objective for classification; minimizing it is equivalent to minimizing KL divergence from the true label distribution to the model's predictions (since \(\mathbb{H}(p)\) is constant with respect to model parameters). Mutual information appears in feature selection (maximizing \(\mathbb{I}(\text{features}; \text{target})\)), in the information bottleneck method for deep learning, and in variational autoencoders where the ELBO involves a KL divergence term. The Fisher information matrix connects to KL divergence through the local approximation \(D_{\mathbb{KL}}(p_\theta \| p_{\theta+\delta}) \approx \frac{1}{2}\delta^\top F(\theta)\delta\), linking information theory to the geometry of statistical models.

Entropy and KL divergence provide the language for comparing distributions. In the next page on convergence, we study probabilistic convergence - the mathematical conditions under which estimators and sample distributions approach their population counterparts as sample size grows.