Entropy
So far in our Section III journey, we have studied distributions, estimation, and inference. A complementary question is:
how much information does a random variable carry? Information Theory, founded by Claude Shannon,
provides a mathematical framework for quantifying uncertainty - and it turns out to be deeply connected to machine learning,
where the cross-entropy loss is the default objective for classification.
Consider a number guessing game: you must guess a natural number between 1 and 100. If I tell you that the number is not 1,
you still have 99 possibilities, which is not very helpful. But if I tell you that the number is divisible by 11, the possibilities
shrink to 9, which is far more informative. This intuition leads to a precise definition.
Definition: Information Content (Self-Information)
The information content of an event \(E\) with probability \(p(E)\) is
\[
I(E) = -\log p(E) \geq 0.
\]
Low-probability events carry high information; certain events carry none.
The use of the logarithm is not arbitrary - it ensures additivity for independent events:
\[
I(A, B) = -\log(p(A) \cdot p(B)) = I(A) + I(B).
\]
This converts the multiplication of probabilities into the addition of information, a property essential
for building a consistent theory.
We now extend from a single event to the average information content
across all possible outcomes of a discrete random variable \(X\).
Definition: Shannon Entropy
The entropy of a discrete random variable \(X\) with distribution \(p\) over \(n\) states is
\[
\begin{align*}
\mathbb{H}(X) &= -\sum_{k=1}^n p(X = k)\,\log p(X = k) \\\\
&= \mathbb{E}_X[-\log p(X)].
\end{align*}
\]
The choice of logarithmic base determines the units: base 2 gives bits, base \(e\) gives nats.
In machine learning, natural logarithms (nats) are standard because they integrate seamlessly with calculus-based optimization.
The remaining sections of this page use \(\log\) to denote the natural logarithm unless otherwise stated.
Joint Entropy
Entropy quantifies the uncertainty of a single random variable. When we observe two (or more) variables
together, we need a measure of their combined uncertainty.
Definition: Joint Entropy
The joint entropy of two discrete random variables \(X\) and \(Y\) is
\[
\mathbb{H}(X, Y) = -\sum_{x,\, y} p(x, y)\,\log p(x, y).
\]
Joint entropy satisfies the subadditivity inequality:
\[
\mathbb{H}(X) + \mathbb{H}(Y) \geq \mathbb{H}(X, Y)
\geq \max\{\mathbb{H}(X),\, \mathbb{H}(Y)\}
\geq 0.
\]
Equality in the left inequality holds if and only if \(X\) and \(Y\) are independent, in which case
\[
\mathbb{H}(X,Y) = \mathbb{H}(X) + \mathbb{H}(Y).
\]
Conversely, if \(Y\) is a deterministic function of \(X\), then knowing \(X\) completely determines
\(Y\), so
\[\mathbb{H}(X,Y) = \mathbb{H}(X).
\]
These bounds generalize to more than two random variables.
Conditional Entropy
Joint entropy measures total uncertainty, but it does not tell us how much one variable reveals
about another. Conditional entropy answers: on average, how much uncertainty remains in \(Y\) after we observe \(X\)?
Definition: Conditional Entropy
The conditional entropy of \(Y\) given \(X\) is
\[
\mathbb{H}(Y \mid X) = \mathbb{E}_{p(X)}\!\left[\mathbb{H}(p(Y \mid X))\right].
\]
\(\mathbb{H}(Y \mid X) = \mathbb{H}(X, Y) - \mathbb{H}(X)\):
\[
\begin{align*}
\mathbb{H}(Y \mid X) &= \mathbb{E }_{p(X)}[\mathbb{H}(p(Y \mid X))] \\\\
&= \sum_{x} p(x) \mathbb{H}(p(Y \mid X = x)) \\\\
&= - \sum_{x} p(x) \sum_{y} p(y \mid x) \log p(y \mid x) \\\\\
&= - \sum_{x, y} p(x, y) \log p(y \mid x) \\\\
&= - \sum_{x, y} p(x, y) \log \frac{p(x,y)}{p(x)}\\\\
&= - \sum_{x, y} p(x, y) \log p(x,y) + \sum_{x} p(x) \log p(x) \\\\
&= \mathbb{H}(X, Y) - \mathbb{H}(X) \tag{1}
\end{align*}
\]
Note: If \(Y\) can be completely determined by \(X\), \(\mathbb{H}(Y \mid X) = 0\).
Also, if both \(X\) and \(Y\) are independent each other, \(\mathbb{H}(Y \mid X) = \mathbb{H}(Y) \).
By (1), \(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y \mid X) \). This implies the chain rule of entropy:
\[
\mathbb{H}(X_1, X_2, \cdots, X_n) = \sum_{i =1}^n \mathbb{H}(X_i \mid X_1, \cdots, X_{i-1}).
\]
Cross Entropy
Conditional entropy measures how much observing one variable reduces uncertainty about another.
A different but equally important question is: what happens when we use an incorrect distribution \(q\)
to encode data that actually follows distribution \(p\)? The answer is the cross entropy.
Definition: Cross Entropy
The cross entropy of a distribution \(q\) relative to a
distribution \(p\) is
\[
\mathbb{H}(p, q) = -\sum_{k=1}^n p_k \log q_k.
\]
In machine learning, especially in the context of classification problems and
neural networks, often we use the
cross entropy as a loss function. It measures the performance of a model whose output is a probability distribution.
Or, it helps in training models by penalizing the difference between the true labels and the predicted probabilities.
Generally, the smaller the cross entropy loss, the better the predictions are.
For example, consider a binary classification(n =2). The true label \(y \in \{0, 1\}\) is represented by
a distribution \(p = [y, 1-y]\), and the predicted probability \(\hat{y} \in (0, 1)\) is represented by Bernoulli
distribution \(q = [\hat{y}, 1 - \hat{y}]\). Then we obtain a cross-entropy loss function:
\[
\begin{align*}
\mathcal{L}(y, \hat{y}) &= - [y \log (\hat{y}) + (1 - y) \log (1 - \hat{y})] \\\\
&= - y \log (\hat{y}) - (1 - y) \log (1 - \hat{y}) \\
\end{align*}
\]
Let's say the true label is 1, and then the loss function becomes \(- \log (\hat{y})\).
If \(\hat{y}\) is close to 1 (the model is confident and correct), the penalty becomes small. On the other hand,
if \(\hat{y}\) is close to 0 (the model is confident and wrong), the penalty becomes large. So, the higher penalty(loss)
indicates the model need to be improved and forces the optimization algorithm to adjust parameters for reducing future error.
KL Divergence (Relative Entropy, Information Gain)
Cross entropy measures the cost of encoding data from \(p\) using the model \(q\). A natural follow-up:
how much extra cost does the wrong model incur beyond the irreducible entropy \(\mathbb{H}(p)\)?
This "excess cost" is precisely the Kullback-Leibler divergence.
While we might wish for a true distance metric between distributions, KL divergence is only a divergence measure, which
satisfies \(D(p, q) \geq 0\) with equality if and only if \(p = q\), but it is neither
symmetric nor does it satisfy the triangle inequality.
Definition: KL Divergence
The Kullback-Leibler divergence between two
distributions \(p\) and \(q\) is
\[
\begin{align*}
D_{\mathbb{KL}}(p \| q) &= \sum_{k=1}^n p_k \log \frac{p_k}{q_k} \\\\
&= \sum_{k=1}^n p_k \log p_k - \sum_{k=1}^n p_k \log q_k \\\\
&= - \mathbb{H}(p) + \mathbb{H}(p, q) \tag{2}
\end{align*}
\]
where \(p\) and \(q\) are defined on the same sample space \(\mathcal{X}\).
For example, \(D_{\mathbb{KL}}(p \| q)\) can measure how much an estimated distribution \(q\) is different from
a true distribution \(p\). If the estimation of \(p\) by \(q\) is "good," the KL divergence will be close to zero.
By the expression (2), the cross entropy can be written as:
\[
\mathbb{H}(p, q) = \mathbb{H}(p) + D_{\mathbb{KL}}(p \| q).
\]
So, if \(p(x)\) is fixed, minimizing the cross entropy is equivalent to minimizing KL divergence.
Theorem 1: Non-negativity of KL Divergence
\[
D_{\mathbb{KL}}(p \| q) \geq 0
\]
where \(p\) and \(q\) are discrete distributions defined on the same sample space \(\mathcal{X}\) and with
equality if and only if \(p = q\).
Or, the entropy of \(p\) is less than or equal to its cross entropy with any other distribution \(q\):
\[
\begin{align*}
&\sum_{k=1}^n p_k \log p_k - \sum_{k=1}^n p_k \log q_k \geq 0 \\\\
&\Longrightarrow - \sum_{k=1}^n p_k \log p_k \leq - \sum_{k=1}^n p_k \log q_k.
\end{align*}
\]
This is called Gibbs' inequality.
There are many ways to prove this theorem, here we introduce log sum inequality to prove Theorem 1.
Theorem 2: Log Sum Inequality
For nonnegative numbers \(a_1, a_2, \cdots, a_n\) and \(b_1, b_2, \cdots, b_n\),
\[
\sum_{k=1}^n a_k \log \frac{a_k}{b_k} \geq \Big(\sum_{k=1}^n a_k \Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k} \tag{3}
\]
with equality if and only if \(\frac{a_k}{b_k} = constant\).
Proof of Theorem 2:
Assume without loss of generality that \(a_k >0\) and \(b_k >0\). Consider a function \(f(x) = x \log x\). Since
\(f'(x) = \log x + 1\) and \(f''(x) = \frac{1}{x}\), the function is a convex function for all \(x > 0\).
Let \(\lambda_i = \frac{b_k}{\sum_{k=1}^n b_k}\), then the left side of inequality (3) becomes
\[
\begin{align*}
\sum_{k=1}^n a_k \log \frac{a_k}{b_k}
&= \sum_{k=1}^n b_k \frac{a_k}{b_k} \log \frac{a_k}{b_k} \\\\
&= \sum_{k=1}^n b_k \frac{\sum_{k=1}^n b_k}{\sum_{k=1}^n b_k}f(\frac{a_k}{b_k}) \\\\
&= \Big(\sum_{k=1}^n b_k \Big) \sum_{k=1}^n \lambda_k f(\frac{a_k}{b_k})\\\\
\end{align*}
\]
Here, by Jensen's inequality,
\[
\begin{align*}
\Big(\sum_{k=1}^n b_k \Big) \sum_{k=1}^n \lambda_k f(\frac{a_k}{b_k})
&\geq \Big(\sum_{k=1}^n b_k \Big) f\Big(\sum_{k=1}^n \lambda_k \frac{a_k}{b_k} \Big) \\\\
&= \Big(\sum_{k=1}^n b_k \Big) f\Big( \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k} \Big) \\\\
&= \Big(\sum_{k=1}^n a_k \Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k}
\end{align*}
\]
Therefore,
\[
\sum_{k=1}^n a_k \log \frac{a_k}{b_k} \geq \Big(\sum_{k=1}^n a_k \Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k}.
\]
Theorem 3: Jensen's inequality
If \(\phi\) is convex on an open interval \(I\), and \(X\) is a random variable whose support is
contained in \(I\), and has a finite expected value, then
\[
\phi (\mathbb{E}[X]) \leq \mathbb{E }[\phi (X)].
\]
Note: The support of a discrete random variable \(X\) to be the points in the space of \(X\) which has
positive probability.
Note: In the above proof, we used the following alternative finite form:
\[
f\Big(\sum_{k=1}^n \lambda_k x_k \Big) \leq \sum_{k=1}^n \lambda_k f(x_k)
\]
where \(\lambda_k \geq 0\) and \(\sum_{k=1}^n \lambda_k = 1\).
Finally, we can easily prove non-negativity of KL divergence using log sum inequality.
Proof of Theorem 1:
Suppose \(\sum_{k=1}^n p_k = \sum_{k=1}^n q_k = 1\) and \(p_k, \, q_k \geq 0\).
By the definition of KL divergence,
\[
D_{\mathbb{KL}}(p \| q) = \sum_{k=1}^n p_k \log \frac{p_k}{q_k}.
\]
Using Log sum inequality with \(a_k = p_k\) and \(b_k = q_k\),
\[
\sum_{k=1}^n p_k \log \frac{p_k}{q_k} \geq \Big(\sum_{k=1}^n p_k \Big) \log \frac{\sum_{k=1}^n p_k}{\sum_{k=1}^n q_k}.
\]
Since \(\sum_{k=1}^n p_k = \sum_{k=1}^n q_k = 1\),
\[
\sum_{k=1}^n p_k \log \frac{p_k}{q_k} \geq \Big(\sum_{k=1}^n p_k \Big) \log 1 = 0
\]
with equality if and only if \(\forall k, \, p_k = q_k\).
Again, KL divergence is NOT a metric. KL divergence is not symmetric and does not satisfy the triangle inequality.
Mutual Information (MI)
KL divergence measures the cost of confusing one distribution for another. A particularly important special case
arises when we ask: how much does the joint distribution \(p(x,y)\) differ from the product of marginals \(p(x)p(y)\)?
If \(X\) and \(Y\) are independent, these are identical; otherwise, their divergence quantifies the shared information
between the two variables.
Mutual information measures the reduction in uncertainty
about one random variable given knowledge of another. High mutual information means that
knowing \(X\) tells you a lot about \(Y\) (and vice versa), while zero mutual information
means they are completely independent.
Definition: Mutual Information
The mutual information between two random variable \(X\) and \(Y\) is
\[
\begin{align*}
\mathbb{I}(X ; Y) &= D_{\mathbb{KL}}(p(x,y) \| p(x)p(y)) \\\\
&= \sum_{y \in Y} \sum_{x \in X} p(x, y) \log \frac{p(x,y)}{p(x)p(y)} \geq 0.
\end{align*}
\]
Or, using entropies:
\[
\begin{align*}
\mathbb{I}(X ; Y) &= \sum_{y \in Y} \sum_{x \in X} p(x, y) \log p(x,y)
- \sum_{y \in Y} \sum_{x \in X} p(x, y) \log p(x) - \sum_{y \in Y} \sum_{x \in X} p(x, y) \log p(y) \\\\
&= \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y).
\end{align*}
\]
Using the chain rule for entropy: \(\mathbb{H}(X, Y) = \mathbb{H}(Y \mid X) + \mathbb{H}(X) \,\),
we obtain following expressions:
\[
\begin{align*}
\mathbb{I}(X ; Y) &= \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) \\\\
&= \mathbb{H}(Y) - \mathbb{H}(Y \mid X) \\\\
&= \mathbb{H}(X) - \mathbb{H}(X \mid Y).
\end{align*}
\]
\[
\begin{align*}
\mathbb{I}(X ; Y) &= \mathbb{H}(X) - \mathbb{H}(X \mid Y) \\\\
&= \mathbb{H}(X, Y) - \mathbb{H}(X \mid Y) - \mathbb{H}(Y \mid X).
\end{align*}
\]
Connections to Machine Learning
Information-theoretic quantities are ubiquitous in machine learning. Cross-entropy loss is
the standard training objective for classification;
minimizing it is equivalent to minimizing KL divergence from the true label distribution to the model's predictions
(since \(\mathbb{H}(p)\) is constant with respect to model parameters). Mutual information appears in feature
selection (maximizing \(\mathbb{I}(\text{features}; \text{target})\)), in the information bottleneck method for deep learning,
and in variational autoencoders where the ELBO involves a KL divergence term. The
Fisher information matrix connects to KL divergence through the local
approximation \(D_{\text{KL}}(p_\theta \| p_{\theta+\delta}) \approx \frac{1}{2}\delta^\top F(\theta)\delta\),
linking information theory to the geometry of statistical models.
Entropy and KL divergence provide the language for comparing distributions. In Part 13,
we study probabilistic convergence - the mathematical conditions under which estimators and sample distributions approach
their population counterparts as sample size grows.