Entropy
So far in our Section III journey, we have studied distributions, estimation, and inference. A complementary question is:
how much information does a random variable carry? Information Theory, founded by Claude Shannon,
provides a mathematical framework for quantifying uncertainty - and it turns out to be deeply connected to machine learning,
where the cross-entropy loss is the default objective for classification.
Conventions. Throughout this page, \(\log\) denotes the natural logarithm (base \(e\)), so
information is measured in nats; the alternative base 2, giving bits, is standard in coding
theory but less convenient for calculus-based optimization. We further adopt the limiting conventions
\[
0 \log 0 := 0, \qquad 0 \log \tfrac{0}{0} := 0, \qquad 0 \log \tfrac{0}{q} := 0 \;\; (q > 0), \qquad p \log \tfrac{p}{0} := +\infty \;\; (p > 0),
\]
so that terms involving zero probabilities cause no ambiguity in the entropy and KL divergence formulas
introduced below. The first three are justified either by the limit \(\lim_{x \to 0^+} x \log x = 0\)
or by treating the indeterminate form \(\tfrac{0}{0}\) as 0 within probability-theoretic contexts;
the fourth reflects that assigning zero probability to an event that actually occurs incurs infinite description cost.
Consider a number guessing game: you must guess a natural number between 1 and 100. If I tell you that the number is not 1,
you still have 99 possibilities, which is not very helpful. But if I tell you that the number is divisible by 11, the possibilities
shrink to 9, which is far more informative. This intuition leads to a precise definition.
Definition: Information Content (Self-Information)
The information content of an event \(E\) with probability \(p(E)\) is
\[
I(E) = -\log p(E) \geq 0.
\]
Low-probability events carry high information; certain events carry none.
The use of the logarithm is not arbitrary - it ensures additivity for independent events.
If \(A\) and \(B\) are independent, then \(p(A \cap B) = p(A) \, p(B)\), so
\[
I(A \cap B) = -\log p(A \cap B) = -\log\bigl(p(A) \, p(B)\bigr) = I(A) + I(B).
\]
This converts the multiplication of probabilities into the addition of information, a property essential
for building a consistent theory.
We now extend from a single event to the average information content
across all possible outcomes of a discrete random variable \(X\).
Definition: Shannon Entropy
The entropy of a discrete random variable \(X\) with distribution \(p\) over \(n\) states is
\[
\begin{align*}
\mathbb{H}(X) &= -\sum_{k=1}^n p(X = k)\,\log p(X = k) \\\\
&= \mathbb{E}_X[-\log p(X)].
\end{align*}
\]
Joint Entropy
Entropy quantifies the uncertainty of a single random variable. When we observe two (or more) variables
together, we need a measure of their combined uncertainty.
Definition: Joint Entropy
The joint entropy of two discrete random variables \(X\) and \(Y\) is
\[
\mathbb{H}(X, Y) = -\sum_{x,\, y} p(x, y)\,\log p(x, y).
\]
Theorem: Bounds on Joint Entropy
For discrete random variables \(X\) and \(Y\),
\[
\mathbb{H}(X) + \mathbb{H}(Y) \;\geq\; \mathbb{H}(X, Y) \;\geq\; \max\{\mathbb{H}(X),\, \mathbb{H}(Y)\} \;\geq\; 0,
\]
with the following equality conditions:
-
\(\mathbb{H}(X) + \mathbb{H}(Y) = \mathbb{H}(X, Y)\) if and only if \(X\) and \(Y\) are independent.
-
\(\mathbb{H}(X, Y) = \mathbb{H}(X)\) if and only if \(Y\) is a deterministic function of \(X\);
symmetrically, \(\mathbb{H}(X, Y) = \mathbb{H}(Y)\) if and only if \(X\) is a deterministic function of \(Y\).
-
\(\mathbb{H}(X, Y) = 0\) if and only if both \(X\) and \(Y\) are constants — that is, the joint
distribution is concentrated on a single outcome.
Each of these bounds has a clear intuition. The left inequality says that joint uncertainty is largest
when neither variable carries information about the other; any statistical dependence reduces
\(\mathbb{H}(X, Y)\) below the sum. The right inequality says that the joint uncertainty of \((X, Y)\)
cannot be smaller than the uncertainty of either variable alone — adding a second variable to the picture
can only preserve or expand the total uncertainty, and equality is reached precisely when one variable is
fully determined by the other. Non-negativity follows term-by-term, since \(p(x,y) \in [0,1]\) gives
\(-p(x,y) \log p(x,y) \geq 0\) (using the convention \(0 \log 0 := 0\) established above).
A complete proof is deferred to the tools developed below: the right inequality follows from the
chain rule for entropy, and the left inequality is equivalent
to the non-negativity of mutual information, which is in turn a
consequence of Gibbs' inequality.
These bounds extend to more than two variables:
\[
\mathbb{H}(X_1, \ldots, X_n) \leq \sum_{i=1}^n \mathbb{H}(X_i),
\]
with equality if and only if \(X_1, \ldots, X_n\) are mutually independent.
Conditional Entropy
Joint entropy measures total uncertainty, but it does not tell us how much one variable reveals
about another. Conditional entropy answers: on average, how much uncertainty remains in \(Y\) after we observe \(X\)?
Definition: Conditional Entropy
For each \(x\) in the support of \(X\), the entropy of the conditional distribution \(p(Y \mid X = x)\) is
\[
\mathbb{H}(Y \mid X = x) := -\sum_{y} p(y \mid x) \log p(y \mid x).
\]
The conditional entropy of \(Y\) given \(X\) is the average of these quantities
with respect to the marginal of \(X\):
\[
\mathbb{H}(Y \mid X) := \sum_{x} p(x) \, \mathbb{H}(Y \mid X = x).
\]
Theorem: Joint–Conditional Decomposition
For discrete random variables \(X\) and \(Y\),
\[
\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X),
\]
or equivalently \(\mathbb{H}(Y \mid X) = \mathbb{H}(X, Y) - \mathbb{H}(X)\).
Read in the first form, the theorem decomposes the joint uncertainty of \((X, Y)\) into the uncertainty
of \(X\) plus the residual uncertainty of \(Y\) once \(X\) is known. The proof below establishes the
second form by direct computation; the first form is the algebraic rearrangement.
Proof:
\[
\begin{align*}
\mathbb{H}(Y \mid X) &= \sum_{x} p(x) \, \mathbb{H}(Y \mid X = x) \\\\
&= - \sum_{x} p(x) \sum_{y} p(y \mid x) \log p(y \mid x) \\\\
&= - \sum_{x, y} p(x, y) \log p(y \mid x) \\\\
&= - \sum_{x, y} p(x, y) \log \frac{p(x,y)}{p(x)}\\\\
&= - \sum_{x, y} p(x, y) \log p(x,y) + \sum_{x, y} p(x, y) \log p(x) \\\\
&= - \sum_{x, y} p(x, y) \log p(x,y) + \sum_{x} p(x) \log p(x) \\\\
&= \mathbb{H}(X, Y) - \mathbb{H}(X) \tag{1}
\end{align*}
\]
The penultimate step uses the marginal identity \(\sum_{y} p(x, y) = p(x)\) to collapse
the second sum from \((x, y)\) to \(x\) alone.
Theorem: Extremes of Conditional Entropy
For discrete random variables \(X\) and \(Y\),
-
\(\mathbb{H}(Y \mid X) = \mathbb{H}(Y)\) if and only if \(X\) and \(Y\) are independent.
-
\(\mathbb{H}(Y \mid X) = 0\) if and only if \(Y\) is a deterministic function of \(X\).
Together, the decomposition and the extremes above complete the proof of the right inequality of
the joint entropy bounds deferred in the previous
section. The inequality itself follows from non-negativity: since
\(\mathbb{H}(Y \mid X) = \sum_{x} p(x) \, \mathbb{H}(Y \mid X = x) \geq 0\) (each term is the
entropy of a probability distribution), the decomposition gives
\(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X) \geq \mathbb{H}(X)\),
and symmetrically \(\mathbb{H}(X, Y) \geq \mathbb{H}(Y)\). The equality \(\mathbb{H}(X, Y) = \mathbb{H}(X)\)
holds precisely when \(\mathbb{H}(Y \mid X) = 0\), which by the extremes above means \(Y\) is a
deterministic function of \(X\); symmetrically for \(\mathbb{H}(X, Y) = \mathbb{H}(Y)\). The left inequality
\(\mathbb{H}(X) + \mathbb{H}(Y) \geq \mathbb{H}(X, Y)\) and its equality condition (independence) remain
pending until mutual information is introduced.
Theorem: Chain Rule of Entropy
For discrete random variables \(X_1, \ldots, X_n\),
\[
\mathbb{H}(X_1, X_2, \ldots, X_n) = \mathbb{H}(X_1) + \sum_{i=2}^n \mathbb{H}(X_i \mid X_1, \ldots, X_{i-1}).
\]
Proof sketch (induction on \(n\)):
The case \(n = 2\) is the Joint–Conditional Decomposition above. For the inductive step, treat
\((X_1, \ldots, X_{n-1})\) as a single random variable and apply the two-variable decomposition:
\[
\mathbb{H}(X_1, \ldots, X_n) = \mathbb{H}(X_1, \ldots, X_{n-1}) + \mathbb{H}(X_n \mid X_1, \ldots, X_{n-1}),
\]
then expand the first term by the inductive hypothesis.
Cross Entropy
Conditional entropy measures how much observing one variable reduces uncertainty about another.
A different but equally important question is: what happens when we use an incorrect distribution \(q\)
to encode data that actually follows distribution \(p\)? The answer is the cross entropy.
Definition: Cross Entropy
For probability distributions \(p\) and \(q\) on the same finite sample space \(\{1, \ldots, n\}\),
the cross entropy of \(q\) relative to \(p\) is
\[
\mathbb{H}(p, q) = -\sum_{k=1}^n p_k \log q_k.
\]
Note that the order of arguments matters: \(\mathbb{H}(p, q) \neq \mathbb{H}(q, p)\) in general.
In machine learning — especially for classification
with neural networks — cross entropy is the standard
loss function. It measures the discrepancy between the true label distribution and the model's predicted
distribution: smaller cross entropy means better predictions.
For example, consider binary classification (\(n = 2\)) with true label \(y \in \{0, 1\}\)
and predicted probability \(\hat{y} \in (0, 1)\) of class 1. Specializing cross entropy to this case gives the
binary cross-entropy loss (the subscript "ce" denotes the binary specialization of the general cross entropy \(\mathbb{H}(p, q)\) above)
\[
\mathbb{H}_{ce}(y, \hat{y}) = -\bigl[\, y \log \hat{y} + (1 - y) \log(1 - \hat{y}) \,\bigr].
\]
For instance, when the true label is \(y = 1\), the loss reduces to \(-\log \hat{y}\).
If \(\hat{y}\) is close to 1 (the model is confident and correct), the penalty is small. If \(\hat{y}\) is close to 0
(the model is confident and wrong), the penalty is large. The higher penalty signals that the model needs to be
improved, and forces the optimization algorithm to adjust parameters to reduce future error.
KL Divergence (Relative Entropy, Information Gain)
Cross entropy measures the cost of encoding data from \(p\) using the model \(q\). A natural follow-up:
how much extra cost does the wrong model incur beyond the irreducible entropy \(\mathbb{H}(p)\)?
This "excess cost" is precisely the Kullback-Leibler divergence.
Definition: KL Divergence
The Kullback-Leibler divergence between two
distributions \(p\) and \(q\) is
\[
\begin{align*}
D_{\mathbb{KL}}(p \| q) &= \sum_{k=1}^n p_k \log \frac{p_k}{q_k} \\\\
&= \sum_{k=1}^n p_k \log p_k - \sum_{k=1}^n p_k \log q_k \\\\
&= - \mathbb{H}(p) + \mathbb{H}(p, q) \tag{2}
\end{align*}
\]
where \(p\) and \(q\) are defined on the same sample space \(\mathcal{X}\). Following the conventions
established at the beginning of the page, the sum is finite when \(q_k = 0 \Rightarrow p_k = 0\) for all \(k\);
otherwise (some \(p_k > 0\) with \(q_k = 0\)), \(D_{\mathbb{KL}}(p \| q) = +\infty\).
For example, \(D_{\mathbb{KL}}(p \| q)\) can measure how much an estimated distribution \(q\) is different from
a true distribution \(p\). If the estimation of \(p\) by \(q\) is "good," the KL divergence will be close to zero.
By the expression (2), the cross entropy can be written as:
\[
\mathbb{H}(p, q) = \mathbb{H}(p) + D_{\mathbb{KL}}(p \| q).
\]
So, if \(p(x)\) is fixed, minimizing the cross entropy is equivalent to minimizing KL divergence.
Theorem: Jensen's Inequality
Let \(\phi: I \to \mathbb{R}\) be a convex function on an interval \(I \subseteq \mathbb{R}\),
and let \(X\) be a random variable taking values in \(I\) with finite expectation. Then
\[
\phi(\mathbb{E}[X]) \;\leq\; \mathbb{E}[\phi(X)].
\]
For finitely many points \(x_1, \ldots, x_n \in I\) and weights \(\lambda_1, \ldots, \lambda_n \geq 0\)
with \(\sum_k \lambda_k = 1\), this specializes to the finite form
\[
\phi\Big(\sum_{k=1}^n \lambda_k x_k\Big) \;\leq\; \sum_{k=1}^n \lambda_k \, \phi(x_k).
\]
Proof sketch (finite form, induction on \(n\)):
For \(n = 2\), the inequality \(\phi(\lambda_1 x_1 + \lambda_2 x_2) \leq \lambda_1 \phi(x_1) + \lambda_2 \phi(x_2)\)
with \(\lambda_1 + \lambda_2 = 1\) is the very definition of convexity. For \(n \geq 3\), set
\(\Lambda = \sum_{k=1}^{n-1} \lambda_k\) (so \(\Lambda + \lambda_n = 1\); assume \(\Lambda > 0\), else the claim is trivial) and write
\[
\sum_{k=1}^n \lambda_k x_k = \Lambda \, \bar{x} + \lambda_n x_n, \qquad \bar{x} := \sum_{k=1}^{n-1} \tfrac{\lambda_k}{\Lambda} x_k.
\]
Since \(I\) is an interval (hence convex), \(\bar{x} \in I\). Applying the \(n = 2\) case to \((\bar{x}, x_n)\) and the inductive hypothesis to \(\bar{x}\),
\[
\phi\Big(\sum_{k=1}^n \lambda_k x_k\Big) \leq \Lambda \, \phi(\bar{x}) + \lambda_n \phi(x_n) \leq \Lambda \sum_{k=1}^{n-1} \tfrac{\lambda_k}{\Lambda} \phi(x_k) + \lambda_n \phi(x_n) = \sum_{k=1}^n \lambda_k \phi(x_k).
\]
The same statement holds for general random variables (possibly continuous) by the
supporting line argument: setting \(\mu = \mathbb{E}[X]\), convexity of \(\phi\)
guarantees the existence of a constant \(c \in \mathbb{R}\) (a subgradient at \(\mu\)) such that
\(\phi(x) \geq \phi(\mu) + c\,(x - \mu)\) for all \(x \in I\). Taking expectations of both sides
and using linearity of \(\mathbb{E}\),
\[
\mathbb{E}[\phi(X)] \;\geq\; \phi(\mu) + c \, (\mathbb{E}[X] - \mu) \;=\; \phi(\mu) \;=\; \phi(\mathbb{E}[X]).
\]
Theorem: Log Sum Inequality
For nonnegative reals \(a_1, \ldots, a_n\) and \(b_1, \ldots, b_n\),
\[
\sum_{k=1}^n a_k \log \frac{a_k}{b_k} \;\geq\; \Big(\sum_{k=1}^n a_k\Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k},
\]
with equality if and only if \(a_k / b_k\) is the same for all \(k\)
(the conventions established at the beginning of the page apply, with \(p, q\) replaced by general nonnegative reals \(a_k, b_k\)).
Proof:
Zero terms are absorbed by the conventions above (terms with \(a_k = 0\) vanish; terms with
\(b_k = 0, a_k > 0\) make the left side \(+\infty\), in which case the inequality is trivial).
Assume henceforth that all \(a_k, b_k > 0\). The function \(f(x) = x \log x\) is convex on \((0, \infty)\),
since \(f''(x) = 1/x > 0\). Set \(\lambda_k = b_k / \sum_j b_j\), so \(\lambda_k \geq 0\) and \(\sum_k \lambda_k = 1\). Then
\[
\begin{align*}
\sum_{k=1}^n a_k \log \frac{a_k}{b_k}
&= \sum_{k=1}^n b_k \cdot \tfrac{a_k}{b_k} \log \tfrac{a_k}{b_k}
= \Big(\sum_j b_j\Big) \sum_{k=1}^n \lambda_k \, f\Big(\tfrac{a_k}{b_k}\Big) \\\\
&\geq \Big(\sum_j b_j\Big) \, f\Big(\sum_{k=1}^n \lambda_k \tfrac{a_k}{b_k}\Big) \quad \text{(Jensen's inequality)} \\\\
&= \Big(\sum_j b_j\Big) \, f\Big(\tfrac{\sum_k a_k}{\sum_j b_j}\Big)
= \Big(\sum_k a_k\Big) \log \tfrac{\sum_k a_k}{\sum_j b_j}.
\end{align*}
\]
Equality in Jensen holds if and only if all \(a_k / b_k\) are equal.
Theorem: Non-negativity of KL Divergence (Gibbs' Inequality)
For probability distributions \(p\) and \(q\) on the same sample space,
\[
D_{\mathbb{KL}}(p \| q) \;\geq\; 0,
\]
with equality if and only if \(p = q\). Equivalently, the entropy of \(p\) is bounded above by its
cross entropy with any other distribution \(q\):
\[
\mathbb{H}(p) \;\leq\; \mathbb{H}(p, q).
\]
The latter form is known as Gibbs' inequality.
Proof:
If \(D_{\mathbb{KL}}(p \| q) = +\infty\) (i.e., some \(q_k = 0\) with \(p_k > 0\)), the inequality is trivial.
Otherwise, apply the log sum inequality with \(a_k = p_k\) and \(b_k = q_k\):
\[
D_{\mathbb{KL}}(p \| q) = \sum_{k=1}^n p_k \log \frac{p_k}{q_k} \;\geq\; \Big(\sum_{k=1}^n p_k\Big) \log \frac{\sum_{k=1}^n p_k}{\sum_{k=1}^n q_k}.
\]
Since \(p\) and \(q\) are probability distributions, \(\sum_k p_k = \sum_k q_k = 1\), so the right-hand
side becomes \(1 \cdot \log 1 = 0\). Therefore \(D_{\mathbb{KL}}(p \| q) \geq 0\), with equality if and
only if \(p_k / q_k\) is constant in \(k\); since both sum to 1, this constant must be 1, i.e., \(p = q\).
The equivalence with Gibbs' inequality follows immediately from the identity
\(D_{\mathbb{KL}}(p \| q) = \mathbb{H}(p, q) - \mathbb{H}(p)\) established above.
Proof:
Let \(u\) denote the uniform distribution on \(\mathcal{X}\), with \(u_k = 1/|\mathcal{X}|\) for each \(k\).
Apply the non-negativity of KL divergence to \(p\) (the distribution of \(X\)) and \(u\):
\[
0 \;\leq\; D_{\mathbb{KL}}(p \| u) = \sum_k p_k \log \frac{p_k}{1/|\mathcal{X}|} = \sum_k p_k \log p_k + \log |\mathcal{X}| \sum_k p_k = -\mathbb{H}(X) + \log |\mathcal{X}|.
\]
Rearranging gives \(\mathbb{H}(X) \leq \log |\mathcal{X}|\), with equality if and only if \(p = u\),
i.e., \(X\) is uniform.
Despite measuring discrepancy between distributions, KL divergence is not a metric:
it is not symmetric (\(D_{\mathbb{KL}}(p \| q) \neq D_{\mathbb{KL}}(q \| p)\) in general) and does
not satisfy the triangle inequality.
Mutual Information (MI)
KL divergence measures the cost of confusing one distribution for another. A particularly important special case
arises when we ask: how much does the joint distribution \(p(x,y)\) differ from the product of marginals \(p(x)p(y)\)?
If \(X\) and \(Y\) are independent, these are identical; otherwise, their divergence quantifies the shared information
between the two variables.
Mutual information measures the reduction in uncertainty
about one random variable given knowledge of another. High mutual information means that
knowing \(X\) tells you a lot about \(Y\) (and vice versa), while zero mutual information
means they are completely independent.
Since mutual information is a KL divergence, the non-negativity
of KL divergence immediately gives \(\mathbb{I}(X; Y) \geq 0\), with equality if and only if
\(p(x, y) = p(x)\, p(y)\) — that is, if and only if \(X\) and \(Y\) are independent.
The four forms admit complementary readings: union of marginals minus joint; reduction in the
uncertainty of \(X\) after observing \(Y\); reduction in the uncertainty of \(Y\) after observing
\(X\); and the joint uncertainty minus both residual conditional entropies, \(\mathbb{H}(X \mid Y)\) and \(\mathbb{H}(Y \mid X)\).
Proof:
For the first form, expand the logarithm in the Definition:
\[
\begin{align*}
\mathbb{I}(X; Y) &= \sum_{x, y} p(x, y) \log p(x, y) - \sum_{x, y} p(x, y) \log p(x) - \sum_{x, y} p(x, y) \log p(y) \\\\
&= -\mathbb{H}(X, Y) + \mathbb{H}(X) + \mathbb{H}(Y),
\end{align*}
\]
where the marginal identities \(\sum_y p(x, y) = p(x)\) and \(\sum_x p(x, y) = p(y)\) collapse
the last two double sums to the single-variable entropies of \(X\) and \(Y\). The reduction
forms follow by substituting the joint–conditional
decomposition \(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X) = \mathbb{H}(Y) + \mathbb{H}(X \mid Y)\)
into the first form. The fourth form follows by adding the two chain-rule decompositions to obtain
\(2\,\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y) + \mathbb{H}(X \mid Y) + \mathbb{H}(Y \mid X)\),
and rearranging gives \(\mathbb{H}(X, Y) - \mathbb{H}(X \mid Y) - \mathbb{H}(Y \mid X) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) = \mathbb{I}(X; Y)\) by the first form.
These identities deliver two important consequences. First, combining the union form
\(\mathbb{I}(X; Y) = \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y)\) with non-negativity
completes the proof of the left inequality of the joint
entropy bounds previously deferred:
\[
\mathbb{H}(X) + \mathbb{H}(Y) \;\geq\; \mathbb{H}(X, Y),
\]
with equality if and only if \(X\) and \(Y\) are independent. Second, the reduction form
\(\mathbb{I}(X; Y) = \mathbb{H}(X) - \mathbb{H}(X \mid Y) \geq 0\) implies
\[
\mathbb{H}(X \mid Y) \;\leq\; \mathbb{H}(X)
\qquad \text{(and symmetrically } \mathbb{H}(Y \mid X) \leq \mathbb{H}(Y) \text{)},
\]
that is, conditioning reduces entropy on average. Observing \(Y\) can never increase
the average uncertainty of \(X\); the residual uncertainty after observing \(Y\) is at most the
prior uncertainty, with equality precisely when \(X\) and \(Y\) are independent.
Connections to Machine Learning
Information-theoretic quantities are ubiquitous in machine learning. Cross-entropy loss is
the standard training objective for classification;
minimizing it is equivalent to minimizing KL divergence from the true label distribution to the model's predictions
(since \(\mathbb{H}(p)\) is constant with respect to model parameters). Mutual information appears in feature
selection (maximizing \(\mathbb{I}(\text{features}; \text{target})\)), in the information bottleneck method for deep learning,
and in variational autoencoders where the ELBO involves a KL divergence term. The
Fisher information matrix connects to KL divergence through the local
approximation \(D_{\mathbb{KL}}(p_\theta \| p_{\theta+\delta}) \approx \frac{1}{2}\delta^\top F(\theta)\delta\),
linking information theory to the geometry of statistical models.
Entropy and KL divergence provide the language for comparing distributions. In the next page on convergence,
we study probabilistic convergence - the mathematical conditions under which estimators and sample distributions approach
their population counterparts as sample size grows.