Signed Measures, Hahn & Jordan Decompositions
A measure, as introduced in
Measure Theory, assigns a non-negative value to
each measurable set and is countably additive on disjoint unions. Many natural constructions, however,
produce set functions that take both signs. The simplest example: given a measure \(\mu\) and an integrable
function \(f \in L^1(\mu)\) that takes both positive and negative values, the assignment
\[
\nu_f(A) \;=\; \int_A f \, d\mu
\]
is countably additive and finite, but \(\nu_f(A)\) can be negative. Differences of two measures,
\(\nu = \mu_1 - \mu_2\), behave the same way. To work with such objects on equal footing with measures,
we relax the non-negativity axiom while keeping countable additivity intact.
Definition
Definition: Signed Measure
Let \((\Omega, \mathcal{F})\) be a measurable space. A signed measure on
\((\Omega, \mathcal{F})\) is a function \(\nu : \mathcal{F} \to [-\infty, \infty]\) satisfying:
- \(\nu(\emptyset) = 0\).
- \(\nu\) takes at most one of the values \(+\infty\) and \(-\infty\) (so the countable sum
in (3) is always well-defined, with no \(\infty - \infty\) ambiguity).
- Countable additivity: for any sequence \((A_n)_{n \geq 1}\) of pairwise
disjoint sets in \(\mathcal{F}\),
\[
\nu\!\left(\bigsqcup_{n=1}^\infty A_n\right) \;=\; \sum_{n=1}^\infty \nu(A_n),
\]
where the series \(\sum_n \nu(A_n)\) is required to converge absolutely whenever
\(\nu(\bigsqcup_n A_n)\) is finite. (When the left-hand side equals \(+\infty\) or \(-\infty\),
condition (2) forces all but finitely many terms of the series to share the same sign on the
divergent side, so the partial sums diverge unambiguously to the same value.)
The absolute-convergence requirement in (3) deserves a brief comment. When \(\nu(\bigsqcup_n A_n)\) is
finite, the left-hand side does not depend on the order in which the disjoint sets \(A_n\) are listed,
so the series on the right must converge to the same value under every rearrangement. By Riemann's
rearrangement theorem on real series, this is equivalent to absolute convergence. The condition is
automatic when \(\nu\) is non-negative — it is only the possibility of cancellation between positive
and negative terms that makes it substantive in the signed case.
The example \(\nu_f(A) = \int_A f \, d\mu\) is the prototype. Splitting \(f = f^+ - f^-\) into its positive
and negative parts gives
\[
\nu_f(A) \;=\; \int_A f^+ \, d\mu \;-\; \int_A f^- \, d\mu \;=\; \nu_{f^+}(A) - \nu_{f^-}(A),
\]
which writes \(\nu_f\) as the difference of two non-negative measures, supported respectively on
\(\{f \geq 0\}\) and \(\{f < 0\}\). The Jordan decomposition asserts that every signed
measure admits such a structural splitting canonically, independently of any representing function \(f\).
The Hahn and Jordan Decompositions
The geometric picture is direct: a signed measure \(\nu\) on \(\Omega\) carves the space into a part
\(P\) on which \(\nu\) is non-negative and a complementary part \(N\) on which \(\nu\) is non-positive.
Once such a decomposition of \(\Omega\) is found, the splitting of \(\nu\) into a non-negative and a
non-positive piece comes for free.
A measurable set \(P \in \mathcal{F}\) is called positive for \(\nu\) if \(\nu(A) \geq 0\)
for every measurable \(A \subseteq P\); similarly, \(N\) is negative if \(\nu(A) \leq 0\)
for every measurable \(A \subseteq N\). It is not enough merely that \(\nu(P) \geq 0\): every measurable
subset of \(P\) must inherit non-negativity.
Theorem: Hahn Decomposition
Let \(\nu\) be a signed measure on \((\Omega, \mathcal{F})\). Then there exist disjoint measurable
sets \(P, N \in \mathcal{F}\) with \(\Omega = P \sqcup N\), where \(P\) is positive and \(N\) is
negative for \(\nu\). The decomposition is unique up to \(\nu\)-null sets: if
\(\Omega = P' \sqcup N'\) is another such decomposition, then the symmetric differences
\(P \triangle P'\) and \(N \triangle N'\) are \(\nu\)-null in the strong sense (every measurable
subset has \(\nu\)-measure zero).
Proof:
We assume without loss of generality that \(\nu\) does not take the value \(+\infty\); the other
case is symmetric. The strategy is to extract a positive set of maximal measure and verify that its
complement is negative.
Step 1 (subset extraction). We first show: if \(E \in \mathcal{F}\) satisfies
\(0 < \nu(E) < \infty\), then \(E\) contains a positive set \(A \subseteq E\) with \(\nu(A) > 0\).
Suppose for contradiction that every measurable \(A \subseteq E\) with \(\nu(A) > 0\) fails to be
positive — that is, contains a measurable subset of strictly negative \(\nu\)-value. Set \(E_0 = E\).
Let \(n_1\) be the smallest positive integer such that there exists a measurable \(B_1 \subseteq E_0\)
with \(\nu(B_1) < -1/n_1\); such an \(n_1\) exists by the contradiction hypothesis. Set
\(E_1 = E_0 \setminus B_1\). Inductively, given \(E_{k-1}\), let \(n_k\) be the smallest positive
integer such that there exists \(B_k \subseteq E_{k-1}\) with \(\nu(B_k) < -1/n_k\), and set
\(E_k = E_{k-1} \setminus B_k\). Continue while such an \(n_k\) exists.
If at some stage no such \(n_k\) exists, then no measurable subset of \(E_{k-1}\) has strictly
negative \(\nu\)-measure, so \(E_{k-1}\) is itself a positive set; moreover
\[
\nu(E_{k-1}) \;=\; \nu(E) - \sum_{j < k} \nu(B_j) \;=\; \nu(E) + \sum_{j < k} |\nu(B_j)| \;\geq\; \nu(E) \;>\; 0
\]
(each \(\nu(B_j) < 0\), so \(-\nu(B_j) = |\nu(B_j)| \geq 0\)), and the claim holds with
\(A = E_{k-1}\). Otherwise, the construction continues for all \(k \geq 1\), and we proceed as
follows.
Define \(E_\infty = E \setminus \bigsqcup_{k \geq 1} B_k\), so that
\(E = E_\infty \sqcup \bigsqcup_{k \geq 1} B_k\). By countable additivity,
\[
\nu(E) \;=\; \nu(E_\infty) + \sum_{k \geq 1} \nu(B_k).
\]
Since \(\nu(E)\) is finite and \(\nu\) does not take the value \(+\infty\) (by the WLOG
assumption), the equation forces both \(\nu(E_\infty)\) and \(\sum_k \nu(B_k)\) to be finite —
for if \(\nu(E_\infty) = -\infty\), the right-hand side would be \(-\infty \neq \nu(E)\). With
the left-hand side finite, condition (3) of the signed-measure definition gives absolute
convergence of \(\sum_k \nu(B_k)\). Each \(\nu(B_k) < -1/n_k < 0\), so
\(\sum_k 1/n_k < \infty\), forcing \(n_k \to \infty\). Moreover,
\(\nu(E_\infty) = \nu(E) - \sum_k \nu(B_k) \geq \nu(E) > 0\), so in particular
\(\nu(E_\infty) > 0\).
We claim \(E_\infty\) is positive. If not, there exists \(C \subseteq E_\infty\) with \(\nu(C) < 0\);
choose \(m \in \mathbb{N}\) with \(-1/m > \nu(C)\) (possible since \(\nu(C) < 0\)). For all sufficiently
large \(k\), \(n_k > m\), so by minimality of \(n_k\), no measurable subset of \(E_{k-1}\) has
\(\nu\)-measure \(< -1/m\). But \(C \subseteq E_\infty \subseteq E_{k-1}\) and \(\nu(C) < -1/m\),
a contradiction. Hence \(E_\infty\) is positive with \(\nu(E_\infty) > 0\), proving the claim with
\(A := E_\infty\).
Step 2 (maximization). Let
\[
s \;=\; \sup\bigl\{\, \nu(P) \,:\, P \in \mathcal{F} \text{ is positive for } \nu \,\bigr\} \;\in\; [0, +\infty].
\]
The supremum is over a non-empty family (the empty set is positive with measure \(0\)). The
finiteness \(s < \infty\) is not yet established; it will follow at the end of this step from
\(\nu(P) = s\) and the WLOG assumption \(\nu < +\infty\). Choose positive sets \(P_n\) with
\(\nu(P_n) \to s\), and set \(P = \bigcup_n P_n\).
Each finite union \(P_1 \cup \cdots \cup P_n\) is positive, since a measurable subset of a finite
union of positive sets can be partitioned into measurable pieces, each contained in some \(P_i\),
and a sum of non-negative numbers is non-negative. Apply continuity from below to the increasing
sequence of positive sets \(Q_n := P_1 \cup \cdots \cup P_n \nearrow P\) (this is a special case of
countable additivity applied to the disjoint sequence
\(P_1, P_2 \setminus P_1, P_3 \setminus (P_1 \cup P_2), \ldots\); the \(\nu\)-values \(\nu(Q_n)\)
lie in \([0, s]\), so all quantities are non-negative reals or \(+\infty\) and no
\(\infty - \infty\) ambiguity arises). Then every measurable subset \(A \subseteq P\) satisfies
\(\nu(A) = \lim_n \nu(A \cap Q_n) \geq 0\) (each term is the \(\nu\)-measure of a measurable
subset of the positive set \(Q_n\)), so \(P\) is positive. Moreover, since \(P\) is positive and
\(P_n \subseteq P\), the set \(P \setminus P_n\) is a measurable subset of \(P\), hence has
non-negative \(\nu\)-measure, giving \(\nu(P) \geq \nu(P_n)\) for each \(n\); letting
\(n \to \infty\) yields \(\nu(P) \geq s\), and since \(\nu(P)\) is a candidate in the supremum,
\(\nu(P) = s\). Finally, the WLOG assumption \(\nu < +\infty\) gives \(\nu(P) < +\infty\),
confirming \(s < \infty\).
Step 3 (complement is negative). Set \(N = \Omega \setminus P\). Suppose for
contradiction \(N\) is not negative: there exists \(E \subseteq N\) with \(\nu(E) > 0\). Since
\(\nu(E)\) is finite (as \(\nu < +\infty\)), Step 1 produces a positive set \(A \subseteq E\) with
\(\nu(A) > 0\). Then \(P \cup A\) is positive, disjointly assembled, with
\(\nu(P \cup A) = \nu(P) + \nu(A) = s + \nu(A) > s\), contradicting the definition of \(s\).
Hence \(N\) is negative.
Uniqueness. Let \(\Omega = P' \sqcup N'\) be another Hahn decomposition. The set
\(P \setminus P' = P \cap N'\) is a subset of the positive set \(P\) and of the negative set \(N'\),
so every measurable \(B \subseteq P \setminus P'\) satisfies both \(\nu(B) \geq 0\) and
\(\nu(B) \leq 0\), forcing \(\nu(B) = 0\). Thus \(P \setminus P'\) is \(\nu\)-null in the strong
sense (every measurable subset has \(\nu\)-measure zero); symmetrically \(P' \setminus P\) is
\(\nu\)-null. Hence \(P \triangle P'\) is \(\nu\)-null, and likewise \(N \triangle N'\).
The Hahn decomposition produces a partition of the underlying space; the Jordan decomposition repackages
this as an intrinsic splitting of the measure itself.
Theorem: Jordan Decomposition
Every signed measure \(\nu\) on \((\Omega, \mathcal{F})\) decomposes uniquely as
\[
\nu \;=\; \nu^+ - \nu^-,
\]
where \(\nu^+\) and \(\nu^-\) are non-negative measures and \(\nu^+ \perp \nu^-\) — that is,
\(\nu^+\) and \(\nu^-\) are concentrated on disjoint measurable sets.
Proof:
Let \(\Omega = P \sqcup N\) be a Hahn decomposition for \(\nu\). Define
\[
\nu^+(A) \;=\; \nu(A \cap P), \qquad \nu^-(A) \;=\; -\,\nu(A \cap N), \qquad A \in \mathcal{F}.
\]
Since \(P\) is positive and \(N\) is negative, both \(\nu^+\) and \(\nu^-\) are non-negative, and
countable additivity of \(\nu\) transfers immediately. For every \(A \in \mathcal{F}\),
\[
\nu^+(A) - \nu^-(A) \;=\; \nu(A \cap P) + \nu(A \cap N) \;=\; \nu(A),
\]
proving the existence of the decomposition. By construction \(\nu^+(N) = 0\) and \(\nu^-(P) = 0\), so
\(\nu^+\) and \(\nu^-\) are concentrated on disjoint sets and hence mutually singular.
For uniqueness, suppose \(\nu = \mu_1 - \mu_2\) with \(\mu_1, \mu_2\) non-negative measures
concentrated on disjoint measurable sets \(P', N' \in \mathcal{F}\) with
\(\Omega = P' \sqcup N'\) (so \(\mu_1(N') = 0\) and \(\mu_2(P') = 0\)).
Step (a): \(\Omega = P' \sqcup N'\) is a Hahn decomposition for \(\nu\). For
\(A \subseteq P'\), \(\mu_2(A) \leq \mu_2(P') = 0\) (since \(\mu_2\) is concentrated on \(N'\),
disjoint from \(P'\)), so \(\nu(A) = \mu_1(A) \geq 0\); thus \(P'\) is positive. Symmetrically,
\(N'\) is negative. By Hahn uniqueness, the symmetric difference
\(P \triangle P' = (P \setminus P') \sqcup (P' \setminus P)\) is \(\nu\)-null in the strong sense —
every measurable subset has \(\nu\)-measure zero.
Step (b): \(\mu_1(P \triangle P') = 0\) and \(\mu_2(P \triangle P') = 0\). For
\(P' \setminus P \subseteq P'\), the concentration of \(\mu_2\) on \(N'\) gives
\(\mu_2(P' \setminus P) = 0\), so \(\nu(P' \setminus P) = \mu_1(P' \setminus P)\); combined with
the strong \(\nu\)-nullity from Step (a), \(\mu_1(P' \setminus P) = 0\). For
\(P \setminus P' \subseteq N'\) (since \(P \setminus P' \subseteq \Omega \setminus P' = N'\)), the
concentration of \(\mu_1\) on \(P'\) gives \(\mu_1(P \setminus P') = 0\) directly; then
\(\nu(P \setminus P') = -\mu_2(P \setminus P')\), and the strong \(\nu\)-nullity forces
\(\mu_2(P \setminus P') = 0\). Combining, \(\mu_1(P \triangle P') = 0\) and
\(\mu_2(P \triangle P') = 0\).
Step (c): \(\mu_1 = \nu^+\) and \(\mu_2 = \nu^-\). Fix \(A \in \mathcal{F}\). Since
\(\mu_1\) is concentrated on \(P' = (P \cap P') \sqcup (P' \setminus P)\),
\[
\mu_1(A) \;=\; \mu_1(A \cap P') \;=\; \mu_1(A \cap P \cap P') + \mu_1(A \cap (P' \setminus P))
\;=\; \mu_1(A \cap P \cap P'),
\]
where the last equality uses \(\mu_1(P' \setminus P) = 0\) from Step (b). Similarly, splitting
\(P = (P \cap P') \sqcup (P \setminus P')\),
\[
\mu_1(A \cap P) \;=\; \mu_1(A \cap P \cap P') + \mu_1(A \cap (P \setminus P')) \;=\; \mu_1(A \cap P \cap P'),
\]
using \(\mu_1(P \setminus P') = 0\). Hence \(\mu_1(A) = \mu_1(A \cap P)\).
On the other hand, from \(\nu = \mu_1 - \mu_2\),
\[
\nu^+(A) \;=\; \nu(A \cap P) \;=\; \mu_1(A \cap P) - \mu_2(A \cap P).
\]
We claim \(\mu_2(A \cap P) = 0\): split
\(A \cap P = (A \cap P \cap P') \sqcup (A \cap (P \setminus P'))\); the first piece satisfies
\(\mu_2(A \cap P \cap P') \leq \mu_2(P') = 0\), and the second satisfies
\(\mu_2(A \cap (P \setminus P')) \leq \mu_2(P \setminus P') = 0\) by Step (b). Thus
\(\nu^+(A) = \mu_1(A \cap P) = \mu_1(A)\). The identity \(\mu_2 = \nu^-\) follows symmetrically.
The Hahn decomposition is unique only up to \(\nu\)-null sets, but the Jordan decomposition itself is
fully unique — the ambiguity in choosing \(P\) versus \(P'\) is invisible from the perspective of
\(\nu^+\) and \(\nu^-\), which only see how \(\nu\) acts on sets, not which Hahn-partition was used
to construct them.
Definition: Total Variation
The total variation of a signed measure \(\nu\) is the non-negative measure
\[
|\nu| \;=\; \nu^+ + \nu^-.
\]
We say \(\nu\) is a finite signed measure if \(|\nu|(\Omega) < \infty\), and
\(\sigma\)-finite if \(\Omega\) is a countable union of sets of finite \(|\nu|\)-measure.
The total variation \(|\nu|\) is the natural "size" of a signed measure: \(|\nu|(A)\) measures the total
mass of \(\nu\) on \(A\) without cancellation between the positive and negative parts. For
\(\nu_f(A) = \int_A f \, d\mu\) with \(f \in L^1(\mu)\), the Hahn decomposition is given explicitly by
\(P = \{f \geq 0\}\) and \(N = \{f < 0\}\), so \(\nu_f^+(A) = \int_A f^+ \, d\mu\) and
\(\nu_f^-(A) = \int_A f^- \, d\mu\), giving \(|\nu_f|(A) = \int_A |f| \, d\mu\). In particular
\(|\nu_f|(\Omega) = \|f\|_{L^1(\mu)}\). The map \(f \mapsto \nu_f\) is therefore a norm-preserving linear
embedding of \(L^1(\mu)\) into the space of finite signed measures absolutely continuous with respect to
\(\mu\) (in the sense introduced in the next section). The Radon-Nikodym
theorem proved at the end of this chapter will upgrade this embedding to an isometric isomorphism by
establishing surjectivity — every such \(\mu\)-AC signed measure is of the form \(\nu_f\) for some
\(f \in L^1(\mu)\).
Signed Measures in CS and ML
Signed measures appear wherever a system carries net-flow or signed-mass quantities rather than
mass alone.
Optimal transport via Kantorovich-Rubinstein duality. In the Earth Mover's
formulation of optimal transport, the difference \(P - Q\) of two probability distributions is a
signed measure of total mass zero. The Kantorovich-Rubinstein duality writes the Wasserstein-1
distance as a supremum over Lipschitz functions of integrals against this signed measure:
\(W_1(P, Q) = \sup_{f \in \mathrm{Lip}_1} \int f \, d(P - Q)\). The Jordan decomposition
\(P - Q = (P - Q)^+ - (P - Q)^-\) identifies the regions of mass surplus (sources) and deficit
(sinks) between the two distributions — the regions from which mass must flow and to which it
must arrive in any optimal transport plan.
Log-likelihood ratios. In hypothesis testing and classification, when
\(\ell \in L^1(P_0)\) (i.e., \(\mathbb{E}_{P_0}[|\log(p_1/p_0)|] < \infty\)), the function
\(\ell(x) = \log\!\bigl(p_1(x)/p_0(x)\bigr)\) defines a finite signed measure
\(\nu_\ell(A) = \int_A \ell \, dP_0\). Its Jordan decomposition isolates the regions where
evidence favors hypothesis \(H_1\) over \(H_0\) and vice versa, and the total variation
\(|\nu_\ell|(\Omega) = \mathbb{E}_{P_0}[|\ell|]\) measures the typical magnitude of the
log-evidence under \(H_0\) — a quantity related to but distinct from standard divergence-based
measures of test difficulty such as the KL divergence \(D_{\mathrm{KL}}(P_0 \| P_1) = -\mathbb{E}_{P_0}[\ell]\)
or the total-variation distance \(\mathrm{TV}(P_0, P_1) = \tfrac{1}{2}\|p_0 - p_1\|_{L^1}\) that
governs the minimax Bayes risk in binary hypothesis testing (via Le Cam's identity
\(\mathcal{R}_{\min} = \tfrac{1}{2}(1 - \mathrm{TV}(P_0, P_1))\)).
Network flows on graphs. For a flow defined on the edges of a finite graph (with
sources and sinks), its divergence — the net flow at each vertex (incoming minus
outgoing) — is naturally a signed measure on the vertex set, with \(\nu(\{v\})\) the net flow at
vertex \(v\); the Hahn decomposition partitions the vertices into net-source and net-sink subsets.
The same algebraic structure underlies divergence operators on simplicial complexes — a connection
that resurfaces in the simplicial and homological structures developed in Section IV and ahead
toward Geometric Deep Learning.
Absolute Continuity & Singularity
With the structural theory of signed measures in hand, we turn to the relation between two measures
on the same space. The Radon-Nikodym theorem will assert that one measure can be expressed as an
integral against another precisely when the two satisfy the relation defined here: absolute
continuity. The opposite extreme — mutual singularity — describes measures
that are concentrated on disjoint sets and have nothing to integrate against one another. Together,
these two relations partition the qualitative ways that two measures can interact, and the Lebesgue
decomposition (mentioned in the Looking Ahead section) shows that they account
for every \(\sigma\)-finite case.
Definitions
Definition: Absolute Continuity
Let \(\mu\) be a non-negative measure and \(\nu\) a signed measure on \((\Omega, \mathcal{F})\).
We say \(\nu\) is absolutely continuous with respect to \(\mu\), written
\(\nu \ll \mu\), if for every \(A \in \mathcal{F}\),
\[
\mu(A) = 0 \;\Longrightarrow\; \nu(A) = 0.
\]
The condition \(\nu \ll \mu\) is exactly what is needed for "\(\mu\)-null sets are also \(\nu\)-null
sets" — \(\nu\) inherits whatever \(\mu\) declares to be negligible. For signed \(\nu\), the definition
is equivalent to \(|\nu| \ll \mu\), and also to the conjunction \(\nu^+ \ll \mu\) and \(\nu^- \ll \mu\).
To see (\(\Rightarrow\)): if \(\mu(A) = 0\), then \(\mu(B) = 0\) for every measurable \(B \subseteq A\)
(by monotonicity, since \(\mu \geq 0\)); applied to \(B = A \cap P\) and \(B = A \cap N\) from a Hahn
decomposition of \(\nu\), this gives \(\nu^+(A) = \nu(A \cap P) = 0\) and
\(\nu^-(A) = -\nu(A \cap N) = 0\), hence \(|\nu|(A) = \nu^+(A) + \nu^-(A) = 0\). For (\(\Leftarrow\)):
\(\mu(A) = 0 \Rightarrow |\nu|(A) = 0 \Rightarrow |\nu(A)| \leq |\nu|(A) = 0 \Rightarrow \nu(A) = 0\).
The Radon-Nikodym theorem will be stated for non-negative \(\nu\); the signed case follows by applying
the result to \(\nu^+\) and \(\nu^-\) separately and subtracting.
Definition: Mutual Singularity
Two signed measures \(\mu\) and \(\nu\) on \((\Omega, \mathcal{F})\) are
mutually singular, written \(\mu \perp \nu\), if there exists a measurable
\(E \in \mathcal{F}\) such that
\[
|\mu|(E^c) \;=\; 0 \quad \text{and} \quad |\nu|(E) \;=\; 0.
\]
Equivalently, \(\mu\) is concentrated on \(E\) and \(\nu\) is concentrated on \(E^c\), so the
two measures live on disjoint measurable carriers. (Unlike absolute continuity, mutual singularity
is a symmetric relation between two signed measures.)
The two relations are extremes. If \(\nu \ll \mu\), then \(\nu\) is "dominated" by \(\mu\): every
\(\mu\)-negligible set is also \(\nu\)-negligible. If \(\nu \perp \mu\), then \(\nu\) and \(\mu\)
have no overlap whatsoever. The only signed measure that is simultaneously \(\nu \ll \mu\) and
\(\nu \perp \mu\) is the zero measure: from \(\nu \perp \mu\), choose \(E\) with
\(|\nu|(E) = 0\) and \(\mu(E^c) = 0\); then \(\nu \ll \mu\) forces \(|\nu|(E^c) = 0\), and so
\(|\nu|(\Omega) = 0\), giving \(\nu = 0\).
The ε-δ Characterization
The implication \(\mu(A) = 0 \Rightarrow \nu(A) = 0\) in the
definition of absolute continuity is a
qualitative, "all-or-nothing" condition: it says that \(\nu\) collapses on \(\mu\)-null sets, but it
says nothing about how \(\nu\) behaves on sets of small but positive \(\mu\)-measure.
For finite measures, however, the condition tightens: \(\nu(A)\) must in fact be uniformly small
whenever \(\mu(A)\) is small. This is the analytic counterpart of the qualitative definition,
and it is the form in which absolute continuity most often appears in proofs.
Theorem: ε-δ Characterization of Absolute Continuity
Let \(\mu\) be a non-negative measure and \(\nu\) a finite signed measure on \((\Omega, \mathcal{F})\).
Then \(\nu \ll \mu\) if and only if for every \(\varepsilon > 0\) there exists \(\delta > 0\) such
that for every \(A \in \mathcal{F}\),
\[
\mu(A) < \delta \;\Longrightarrow\; |\nu(A)| < \varepsilon.
\]
Proof:
(\(\Leftarrow\)) The ε-δ condition is strictly stronger than \(\nu \ll \mu\): if
\(\mu(A) = 0\), then \(\mu(A) < \delta\) for every \(\delta > 0\), so \(|\nu(A)| < \varepsilon\)
for every \(\varepsilon > 0\), forcing \(\nu(A) = 0\).
(\(\Rightarrow\)) Suppose \(\nu \ll \mu\) but the ε-δ condition fails. Then there
exists \(\varepsilon_0 > 0\) such that for every \(n \geq 1\), some \(A_n \in \mathcal{F}\) satisfies
\[
\mu(A_n) < 2^{-n} \quad \text{and} \quad |\nu(A_n)| \geq \varepsilon_0.
\]
Set \(B_n = \bigcup_{k \geq n} A_k\) and \(B = \bigcap_{n \geq 1} B_n = \limsup_n A_n\). The sets
\(B_n\) are decreasing, with \(B_1\) of finite \(|\nu|\)-measure (since \(\nu\) is finite). By
\(\sigma\)-subadditivity,
\[
\mu(B) \leq \mu(B_n) \leq \sum_{k \geq n} \mu(A_k) < \sum_{k \geq n} 2^{-k} = 2^{-(n-1)},
\]
so \(\mu(B) = 0\) on letting \(n \to \infty\). By absolute continuity, \(|\nu|(B) = 0\).
On the other hand, by
continuity from above
for the finite measure \(|\nu|\),
\[
|\nu|(B) \;=\; \lim_{n \to \infty} |\nu|(B_n) \;\geq\; \limsup_{n \to \infty} |\nu|(A_n) \;\geq\; \limsup_{n \to \infty} |\nu(A_n)| \;\geq\; \varepsilon_0,
\]
where the second inequality uses \(A_n \subseteq B_n\) (so \(|\nu|(A_n) \leq |\nu|(B_n)\) for each
\(n\), giving \(\limsup_n |\nu|(A_n) \leq \limsup_n |\nu|(B_n) = \lim_n |\nu|(B_n)\); the last
equality holds because \(B_n\) is decreasing) and the third uses \(|\nu(A_n)| \leq |\nu|(A_n)\).
This contradicts \(|\nu|(B) = 0\).
The finiteness hypothesis on \(\nu\) cannot be dropped. Take \(\mu = \lambda\) (Lebesgue measure on
\(\mathbb{R}\)) and \(\nu(A) = \int_A x^2 \, d\lambda(x)\), so that \(\nu \ll \lambda\) and \(\nu\) is
\(\sigma\)-finite. The sets \(A_n = [n, n + 1/n^2]\) satisfy \(\lambda(A_n) = 1/n^2 \to 0\), but
\[
\nu(A_n) \;=\; \int_n^{n + 1/n^2} x^2 \, dx \;\geq\; n^2 \cdot \frac{1}{n^2} \;=\; 1
\]
for every \(n\). So \(\nu(A_n)\) does not vanish even as \(\lambda(A_n) \to 0\), and the ε-δ form
fails despite \(\nu \ll \lambda\) holding qualitatively.
The Probabilistic Picture
For a probability distribution \(P_X\) on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\), the relation
to Lebesgue measure \(\lambda\) determines the qualitative type of the random variable. If
\(P_X \ll \lambda\), the distribution is absolutely continuous in the
measure-theoretic sense: it has a Lebesgue density (the Radon-Nikodym derivative \(dP_X/d\lambda\), whose existence we will prove below). If
\(P_X \perp \lambda\) — for instance, when \(P_X\) is concentrated on a countable set, or on a
fractal of zero Lebesgue measure such as the Cantor set — there is no density relative to length.
A mixed distribution, such as \(\tfrac{1}{2}\delta_0 + \tfrac{1}{2}\,\lambda\big|_{[0,1]}\), is neither:
it has a singular part (the Dirac mass at \(0\)) and an absolutely continuous part. The Lebesgue
decomposition mentioned below asserts that this is the universal pattern: every \(\sigma\)-finite
measure splits canonically into an absolutely continuous and a mutually singular piece.
The Radon-Nikodym Theorem
We arrive at the central result. Given two \(\sigma\)-finite measures with \(\nu \ll \mu\), the
Radon-Nikodym theorem asserts that \(\nu\) is the integral of some non-negative measurable function
against \(\mu\). The function — unique up to \(\mu\)-null sets — is the abstract derivative
\(d\nu/d\mu\), and it is the working object underneath every "density" in probability and statistics.
Statement
Theorem: Radon-Nikodym
Let \((\Omega, \mathcal{F})\) be a measurable space, and let \(\mu, \nu\) be
\(\sigma\)-finite
non-negative measures on \((\Omega, \mathcal{F})\) with \(\nu \ll \mu\). Then there exists a
non-negative measurable function \(f : \Omega \to [0, \infty]\), unique up to \(\mu\)-a.e. equality,
such that
\[
\nu(A) \;=\; \int_A f \, d\mu \quad \text{for every } A \in \mathcal{F}.
\]
The function \(f\) is called the Radon-Nikodym derivative of \(\nu\) with respect
to \(\mu\), denoted
\[
f \;=\; \frac{d\nu}{d\mu}.
\]
For signed \(\nu\), the result extends by Jordan decomposition: writing \(\nu = \nu^+ - \nu^-\) with
\(\nu^\pm \ll \mu\) (which follows from \(\nu \ll \mu\), as observed in the
previous section), we obtain
\(d\nu/d\mu = d\nu^+/d\mu - d\nu^-/d\mu\), an extended-real measurable function, finite
\(\mu\)-a.e. when \(\nu\) is finite.
Proof via von Neumann's \(L^2\) Method:
The proof we present is due to von Neumann and is among the most striking applications of the
Hilbert-space machinery developed in Section II. The key idea is to construct \(f\) as the solution
of a single linear-algebraic problem in \(L^2\) and then transport it back to a relation between
measures. The integral against \(\nu\) defines a bounded linear functional on a suitable \(L^2\)
space; the
Riesz Representation Theorem
delivers a representing function; algebraic manipulation extracts \(f\) from that representative.
Reduction to finite measures.
Since both \(\mu\) and \(\nu\) are \(\sigma\)-finite, we can write
\(\Omega = \bigsqcup_{n \geq 1} \Omega_n\) with \(\Omega_n\) pairwise disjoint,
measurable, and satisfying \(\mu(\Omega_n), \nu(\Omega_n) < \infty\). (Take a \(\mu\)-exhaustion
and a \(\nu\)-exhaustion separately, intersect, and disjointify.) If we can produce a Radon-Nikodym
derivative \(f_n\) of \(\nu|_{\Omega_n}\) with respect to \(\mu|_{\Omega_n}\) on each \(\Omega_n\),
extend each \(f_n\) by zero outside \(\Omega_n\); the function \(f = \sum_n f_n \mathbf{1}_{\Omega_n}\) is
measurable and non-negative, and the disjointness of \(\{\Omega_n\}\) together with MCT gives
\(\nu(A) = \sum_n \nu(A \cap \Omega_n) = \sum_n \int_{A \cap \Omega_n} f_n \, d\mu = \int_A f \, d\mu\)
for every \(A \in \mathcal{F}\). Thus it suffices to prove the theorem when both \(\mu\) and \(\nu\)
are finite.
Step 1: The auxiliary measure \(\rho = \mu + \nu\).
Assume henceforth that \(\mu\) and \(\nu\) are finite. Define a new measure
\[
\rho \;=\; \mu + \nu, \qquad \rho(A) = \mu(A) + \nu(A).
\]
Then \(\rho\) is finite, and \(\mu \ll \rho\) and \(\nu \ll \rho\) hold trivially (any \(\rho\)-null
set has both \(\mu\)- and \(\nu\)-measure zero, since \(\mu, \nu \geq 0\)).
Step 2: A bounded linear functional on \(L^2(\rho)\).
Define
\[
T : L^2(\rho) \to \mathbb{R}, \qquad T(g) \;=\; \int_\Omega g \, d\nu.
\]
The functional \(T\) is linear by linearity of the integral. To verify boundedness, apply the
Cauchy-Schwarz inequality on \(L^2(\rho)\) using the constant function \(\mathbf{1} \in L^2(\rho)\)
(which lies in \(L^2(\rho)\) because \(\rho(\Omega) < \infty\)):
\[
|T(g)| \;\leq\; \int |g| \, d\nu \;\leq\; \int |g| \, d\rho \;=\; \int |g| \cdot 1 \, d\rho
\;\leq\; \|g\|_{L^2(\rho)} \cdot \rho(\Omega)^{1/2}.
\]
Here the second inequality uses \(\nu \leq \rho\) (as measures), and the final inequality is
Cauchy-Schwarz. Thus \(T\) is a continuous linear functional on the Hilbert space \(L^2(\rho)\),
with operator norm at most \(\rho(\Omega)^{1/2}\).
Step 3: Riesz representation.
By the
Riesz Representation Theorem,
there exists a unique \(h \in L^2(\rho)\) such that
\[
T(g) \;=\; \langle g, h \rangle_{L^2(\rho)} \;=\; \int_\Omega g \, h \, d\rho \quad \text{for every } g \in L^2(\rho).
\]
Unwinding the definition of \(T\), this reads
\[
\int_\Omega g \, d\nu \;=\; \int_\Omega g \, h \, d\rho \quad \text{for every } g \in L^2(\rho). \tag{$\star$}
\]
The function \(h\) is so far an abstract \(L^2(\rho)\)-element; the next two steps pin down its
geometry.
Step 4: \(0 \leq h \leq 1\) holds \(\rho\)-a.e.
We prove the two bounds separately by testing (\(\star\)) against indicator functions.
For the lower bound, let \(E_- = \{h < 0\}\), and substitute \(g = \mathbf{1}_{E_-} \in L^2(\rho)\)
(this is in \(L^2(\rho)\) because \(\rho\) is finite). Then
\[
\nu(E_-) \;=\; \int_{E_-} 1 \, d\nu \;=\; \int_{E_-} h \, d\rho \;\leq\; 0,
\]
while \(\nu(E_-) \geq 0\) since \(\nu\) is non-negative. Hence \(\nu(E_-) = 0\) and
\(\int_{E_-} h \, d\rho = 0\). But \(h < 0\) strictly on \(E_-\), so \(\int_{E_-} h \, d\rho < 0\)
unless \(\rho(E_-) = 0\). Combined with \(\int_{E_-} h \, d\rho = 0\), this forces
\(\rho(E_-) = 0\), i.e., \(h \geq 0\) holds \(\rho\)-a.e.
For the upper bound, let \(E_+ = \{h > 1\}\) and substitute \(g = \mathbf{1}_{E_+}\). Then
\[
\nu(E_+) \;=\; \int_{E_+} h \, d\rho \;\geq\; \int_{E_+} 1 \, d\rho \;=\; \rho(E_+) \;=\; \mu(E_+) + \nu(E_+).
\]
Since \(\nu(E_+) \leq \nu(\Omega) < \infty\) (we are in the finite case), the term \(\nu(E_+)\) can be
subtracted from both sides, forcing \(\mu(E_+) \leq 0\), hence \(\mu(E_+) = 0\). By absolute continuity
\(\nu \ll \mu\), this gives \(\nu(E_+) = 0\), and therefore \(\rho(E_+) = 0\). Hence \(h \leq 1\) holds
\(\rho\)-a.e.
We may modify \(h\) on a \(\rho\)-null set without disturbing equation (\(\star\)), so we redefine
\(h\) to take values in \([0, 1]\) everywhere.
Step 5: Rewriting (\(\star\)) and isolating \(d\nu/d\mu\).
Substitute the definition \(\rho = \mu + \nu\) into (\(\star\)) and use linearity of the integral with respect to
the sum measure (\(\int \phi \, d(\mu + \nu) = \int \phi \, d\mu + \int \phi \, d\nu\) for non-negative or
\(\rho\)-integrable \(\phi\)):
\[
\int_\Omega g \, d\nu \;=\; \int_\Omega g \, h \, d\mu + \int_\Omega g \, h \, d\nu,
\]
which rearranges to
\[
\int_\Omega g \, (1 - h) \, d\nu \;=\; \int_\Omega g \, h \, d\mu \quad \text{for every } g \in L^2(\rho). \tag{$\star\star$}
\]
We will use (\(\star\star\)) on indicator functions to identify \(d\nu/d\mu\).
Let \(E_1 = \{h = 1\}\). Substituting \(g = \mathbf{1}_{E_1}\) in (\(\star\star\)) gives
\(0 = \int_{E_1} h \, d\mu = \mu(E_1)\), so \(\mu(E_1) = 0\). By absolute continuity
\(\nu \ll \mu\), also \(\nu(E_1) = 0\). Hence \(h < 1\) holds \(\mu\)-a.e. and \(\nu\)-a.e., and
redefining \(h\) to take a value in \([0, 1)\) on the \(\rho\)-null set \(E_1\) leaves (\(\star\star\))
intact.
Set
\[
f \;=\; \frac{h}{1 - h},
\]
a non-negative measurable function with values in \([0, \infty)\), defined \(\mu\)-a.e. (we set
\(f = 0\) on the \(\mu\)-null set \(E_1\) where \(h = 1\) to make \(f\) defined everywhere; the
choice is irrelevant). The identity \(h = (1-h) f\) holds \(\mu\)-a.e., and will be used in Step 6
to extract \(f\) as the desired Radon-Nikodym derivative.
Step 6: From (\(\star\star\)) to the Radon-Nikodym identity.
We first extend (\(\star\star\)) from \(L^2(\rho)\) to all non-negative measurable \(g\). Both sides of (\(\star\star\))
are integrals of non-negative functions against non-negative measures (since
\(0 \leq h \leq 1\) \(\rho\)-a.e., so \(g(1-h) \geq 0\) and \(gh \geq 0\) for \(g \geq 0\)). For
indicator functions \(g = \mathbf{1}_A\), (\(\star\star\)) holds since \(\mathbf{1}_A \in L^2(\rho)\);
by linearity it holds for non-negative simple \(g\); and by the standard simple-function approximation
\(g_n \uparrow g\) and MCT applied to each side, it extends to all non-negative measurable \(g\)
(whether or not \(g \in L^2(\rho)\)). Call this extended identity (\(\star\star'\)).
Now apply (\(\star\star'\)) with \(g = \mathbf{1}_A / (1 - h)\) — formally, via the truncations
\(g_N := \mathbf{1}_A \cdot \min(N, 1/(1-h))\), each bounded and measurable (hence in \(L^2(\rho)\)
since \(\rho\) is finite, and certainly admissible for the extended (\(\star\star'\))). The function
\(1/(1-h)\) is undefined on the set \(E_1 = \{h = 1\}\), but \(E_1\) is both \(\mu\)-null and
\(\nu\)-null (Step 5), hence \(\rho\)-null, so the value assigned on \(E_1\) (say 0) is irrelevant
for both integrals. Substituting \(g_N\) into (\(\star\star'\)):
\[
\int_A \min(N(1-h), 1) \, d\nu \;=\; \int_A \min(N, 1/(1-h)) \cdot h \, d\mu.
\]
On \(\{h < 1\}\), \(\min(N(1-h), 1) \uparrow 1\) and \(\min(N, 1/(1-h)) \cdot h \uparrow h/(1-h) = f\)
as \(N \to \infty\). Applying MCT to each side,
\[
\nu(A) \;=\; \int_A 1 \, d\nu \;=\; \int_A f \, d\mu \quad \text{for every } A \in \mathcal{F},
\]
completing the proof for finite \(\mu, \nu\). The reduction at the start of the proof extends the
result to the \(\sigma\)-finite case.
What Hilbert Space Bought Us
The proof never constructs \(f\) directly. The function \(f = h / (1 - h)\) appears only in the
last step; until then, all the work happens with the auxiliary function \(h \in L^2(\rho)\)
produced by Riesz representation. This is the strategic content of von Neumann's argument: a
problem about existence of a density is converted into a problem about existence
of a representing vector for a continuous linear functional on a Hilbert space — a problem
already solved in
Dual Spaces & Riesz Representation.
The entire functional-analytic apparatus of Section II — completeness of \(L^2\) (
Riesz-Fischer),
orthogonal projection, the duality \((L^2)^* \cong L^2\) — collapses, in this single proof, into
a result about probability densities. Section III thereby harvests, in this single proof, what
Section II's chapters on Hilbert spaces, dual functionals, and \(L^p\) completeness developed.
Uniqueness
Uniqueness of the Radon-Nikodym derivative is a short, standalone argument that uses only the
integrability of \(f\) and the zero-integral lemma from Section II.
Suppose \(f_1, f_2\) are two non-negative measurable functions, both satisfying
\(\nu(A) = \int_A f_i \, d\mu\) for every \(A \in \mathcal{F}\). We first prove uniqueness when
\(\nu\) is finite; the \(\sigma\)-finite case follows by applying the finite argument to each
\(\Omega_n\) of a \(\sigma\)-finite exhaustion. Assume henceforth \(\nu(\Omega) < \infty\). Then
\(\int_\Omega f_i \, d\mu = \nu(\Omega) < \infty\), so each \(f_i\) is \(\mu\)-integrable and
therefore finite \(\mu\)-a.e. Setting \(g = f_1 - f_2\) (well-defined \(\mu\)-a.e. and assigned the
value 0 on the \(\mu\)-null exceptional set where either \(f_i\) is infinite; the choice is
irrelevant), we have
\[
\int_A g \, d\mu \;=\; \nu(A) - \nu(A) \;=\; 0 \quad \text{for every } A \in \mathcal{F}.
\]
Apply this to \(A = \{g > 0\}\) and to \(A = \{g < 0\}\) separately. On the first set,
\(g^+ = g\) and \(g^- = 0\), so \(\int g^+ \, d\mu = \int_{\{g > 0\}} g \, d\mu = 0\). Since
\(g^+\) is non-negative, the
Zero Integral Lemma
forces \(g^+ = 0\) \(\mu\)-a.e., that is, \(\mu(\{g > 0\}) = 0\). The argument on \(\{g < 0\}\)
is symmetric and gives \(\mu(\{g < 0\}) = 0\). Hence \(g = 0\) \(\mu\)-a.e., that is, \(f_1 = f_2\)
\(\mu\)-a.e.
The Radon-Nikodym derivative is therefore well-defined as an element of the equivalence class of
measurable functions modulo \(\mu\)-null modifications. In particular, equations of the form
"\(d\nu/d\mu = f\)" are always understood up to \(\mu\)-a.e. equality.
Chain Rule and Change of Variables
The Radon-Nikodym derivative is genuinely a derivative: it satisfies a chain rule under composition
of absolute-continuity relations, and it intertwines with integration in exactly the way the
Leibniz notation \(d\nu/d\mu\) suggests. These two properties — the chain rule and the change-of-variables
formula — are the computational engines that make the derivative usable in practice. Every Bayesian
update, every importance-sampling estimator, every KL divergence computation rests on the second
of these.
Proposition: Chain Rule and Change of Variables
Let \(\mu, \nu, \lambda\) be \(\sigma\)-finite non-negative measures on \((\Omega, \mathcal{F})\).
- Chain rule. If \(\nu \ll \mu \ll \lambda\), then \(\nu \ll \lambda\) and
\[
\frac{d\nu}{d\lambda} \;=\; \frac{d\nu}{d\mu} \cdot \frac{d\mu}{d\lambda} \quad \lambda\text{-a.e.}
\]
- Change of variables. If \(\nu \ll \mu\), then for every measurable
\(g : \Omega \to [0, \infty]\),
\[
\int_\Omega g \, d\nu \;=\; \int_\Omega g \cdot \frac{d\nu}{d\mu} \, d\mu.
\]
The same identity holds for measurable \(g : \Omega \to \mathbb{R}\) provided either side
is finite when \(g\) is replaced by \(|g|\) (i.e., \(g \in L^1(\nu)\) iff
\(g \cdot d\nu/d\mu \in L^1(\mu)\), and the integrals agree).
Proof:
We prove (ii) first; (i) follows as a special case.
Proof of (ii).
Set \(f = d\nu/d\mu\), so that \(\nu(A) = \int_A f \, d\mu\) for every \(A \in \mathcal{F}\).
We extend this from indicator-set integrals to general non-negative \(g\) by the standard
three-step procedure.
Step 1: Indicator functions.
For \(g = \mathbf{1}_A\), the identity
\[
\int_\Omega \mathbf{1}_A \, d\nu \;=\; \nu(A) \;=\; \int_A f \, d\mu \;=\; \int_\Omega \mathbf{1}_A \cdot f \, d\mu
\]
is the defining property of \(f = d\nu/d\mu\).
Step 2: Non-negative simple functions.
Let \(g = \sum_{i=1}^n c_i \mathbf{1}_{A_i}\)
with \(c_i \geq 0\) and \(A_i \in \mathcal{F}\) pairwise disjoint. By linearity of the integral
on both sides and Step 1,
\[
\int_\Omega g \, d\nu \;=\; \sum_{i=1}^n c_i \nu(A_i) \;=\; \sum_{i=1}^n c_i \int_\Omega \mathbf{1}_{A_i} \cdot f \, d\mu \;=\; \int_\Omega g \cdot f \, d\mu.
\]
Step 3: Non-negative measurable functions.
Let \(g : \Omega \to [0, \infty]\) be measurable. By the standard simple-function approximation,
there exists an increasing sequence
\(g_n \uparrow g\) of non-negative simple functions. Since \(f \geq 0\), the sequence
\(g_n f \uparrow g f\) is also increasing pointwise. Apply
MCT
on both sides:
\[
\int_\Omega g \, d\nu \;=\; \lim_{n \to \infty} \int_\Omega g_n \, d\nu
\;=\; \lim_{n \to \infty} \int_\Omega g_n \, f \, d\mu
\;=\; \int_\Omega g \, f \, d\mu,
\]
where the middle equality is Step 2 applied to each \(g_n\). This proves (ii) for non-negative \(g\).
Step 4: Real-valued case.
For measurable \(g : \Omega \to \mathbb{R}\), apply Step 3 separately to \(g^+\) and \(g^-\) and subtract.
Both \(\int g^+ \, d\nu = \int g^+ f \, d\mu\)
and \(\int g^- \, d\nu = \int g^- f \, d\mu\) hold, and the integrability hypothesis
\(g \in L^1(\nu)\) (i.e., \(\int |g| d\nu < \infty\)) makes both finite by the non-negative
identity applied to \(|g|\). The subtraction then yields
\[
\int_\Omega g \, d\nu \;=\; \int_\Omega g \, f \, d\mu,
\]
with both sides finite. The integrability equivalence
\(g \in L^1(\nu) \iff g f \in L^1(\mu)\) follows by applying the non-negative identity to \(|g|\).
Proof of (i).
Set \(f = d\nu/d\mu\) and \(k = d\mu/d\lambda\). For any \(A \in \mathcal{F}\), apply (ii) with \(\nu \ll \mu\)
and the test function \(g = \mathbf{1}_A\), then apply (ii) again with \(\mu \ll \lambda\)
and the test function \(g' = \mathbf{1}_A f\) (non-negative and measurable):
\[
\nu(A) \;\stackrel{\text{(ii)}_{\nu \ll \mu}}{=}\; \int_\Omega \mathbf{1}_A f \, d\mu
\;\stackrel{\text{(ii)}_{\mu \ll \lambda}}{=}\; \int_\Omega \mathbf{1}_A f \, k \, d\lambda
\;=\; \int_A f k \, d\lambda.
\]
Hence \(\nu(A) = \int_A (f k) \, d\lambda\) for every \(A \in \mathcal{F}\). The product \(fk\) is
measurable and non-negative; although the value of \(f\) on \(\mu\)-null sets is undetermined (and
such sets need not be \(\lambda\)-null), the integral against \(\lambda\) is unaffected by this
ambiguity: for any two representatives \(f, f'\) of \(d\nu/d\mu\), applying (ii) with \(\mu \ll \lambda\)
gives \(\int_A (f - f') k \, d\lambda = \int_A (f - f') \, d\mu = 0\), since \(f = f'\) holds
\(\mu\)-a.e. Thus \(\int_A fk \, d\lambda = \nu(A)\) is well-defined regardless of the choice of
representatives, exhibiting \(fk\) as a density of \(\nu\) with respect to \(\lambda\). By
uniqueness of the Radon-Nikodym derivative with respect to \(\lambda\),
\(d\nu/d\lambda = fk = (d\nu/d\mu)(d\mu/d\lambda)\) holds \(\lambda\)-a.e.
The hypothesis \(\nu \ll \lambda\) needed to apply uniqueness here is itself a consequence of
\(\nu \ll \mu \ll \lambda\): if \(\lambda(A) = 0\), then \(\mu(A) = 0\), and then \(\nu(A) = 0\).
Two notational conveniences are worth noting. First, the change-of-variables formula (ii) is often
written in the suggestive Leibniz form
\[
\int g \, d\nu \;=\; \int g \, \frac{d\nu}{d\mu} \, d\mu,
\]
in which "\(d\nu = (d\nu/d\mu) \, d\mu\)" reads as a formal cancellation of differentials. This is
the source of all calculations of the type
\(\mathbb{E}_P[g(X)] = \mathbb{E}_Q[g(X) \cdot dP/dQ(X)]\) used in importance sampling and Monte Carlo
methods, where the target distribution \(P\) is integrated against by drawing samples from a different
distribution \(Q\) and reweighting by the Radon-Nikodym derivative \(dP/dQ\). Second, the chain rule (i) gives the
Radon-Nikodym derivative the algebraic structure of a classical derivative under composition:
transitivity of \(\ll\) at the level of measures lifts to multiplicativity of \(d/d\) at the level
of densities.
A subtlety distinguishes this change-of-variables formula from the pushforward change-of-variables
proved in
Pushforward Measure.
The pushforward identity \(\int g \, d(X_*\mathbb{P}) = \int (g \circ X) \, d\mathbb{P}\) transports
integrals across a measurable map between two different measurable spaces; the function \(g\) and
its lift \(g \circ X\) live on different spaces, and the derivative \(dX_*\mathbb{P}/d\mathbb{P}\)
is not even defined (the measures live on different \(\sigma\)-algebras). The Radon-Nikodym
change-of-variables, by contrast, transports integrals between two measures on the same
space, paid for by multiplication by the density. Both are change-of-variables formulas, but they
operate in orthogonal directions: pushforward changes the carrier space, Radon-Nikodym changes the
weighting.
Connection to Machine Learning
Many probabilistic quantities in modern machine learning are, at the foundational
level, Radon-Nikodym derivatives, and many algorithms involving densities can be
read as instances of the change-of-variables formula (ii). The qualification "at
the foundational level" matters: the rigorous Radon-Nikodym object licenses the
construction, but the implementation typically works with explicit densities under
a chosen reference measure, with neural-network surrogates, or with sample-average
approximations rather than computing the abstract derivative directly. We collect
four examples that span supervised, generative, and reinforcement learning, noting
for each how the implementation relates to the underlying Radon-Nikodym structure.
Bayesian posterior density.
Given a prior \(\Pi\) on the parameter space \(\Theta\) and a likelihood \(p(x | \theta)\)
— viewed as a density of the conditional data distribution with respect to a reference measure
on the data space — the Bayesian posterior \(\Pi_{\theta | x}\) is defined as a measure on
\(\Theta\) — not, a priori, as a density.
When \(\Pi \ll \lambda_\Theta\) (with \(\lambda_\Theta\) a reference measure on \(\Theta\),
often Lebesgue measure), Bayes' rule produces the posterior density
\(d\Pi_{\theta|x}/d\lambda_\Theta\) by the formula
\[
\frac{d\Pi_{\theta|x}}{d\lambda_\Theta}(\theta)
\;=\; \frac{p(x | \theta) \cdot (d\Pi/d\lambda_\Theta)(\theta)}{\int_\Theta p(x|\theta') \, (d\Pi/d\lambda_\Theta)(\theta') \, d\lambda_\Theta(\theta')}.
\]
If the prior is supported on a discrete set, \(\Pi \perp \lambda_\Theta\) and no Lebesgue density
exists; the posterior is then a discrete measure whose density relative to counting measure on
the support plays the analogous role. The framework is unified by the choice of dominating measure.
Score function in diffusion models.
Score-based generative models — including denoising diffusion probabilistic models and score-based stochastic differential equations —
train a neural network \(s_\phi\) to estimate the gradient of the log-density, called the
Stein score (in the diffusion-modeling literature, often simply "score"):
\[
s_\phi(x, t) \;\approx\; \nabla_x \log p_t(x) \;=\; \nabla_x \log \frac{dP_t}{d\lambda}(x),
\]
where \(P_t\) is the distribution of the noised data at time \(t\) and \(\lambda\) is Lebesgue
measure on \(\mathbb{R}^d\). The score is well-defined as a measurable function on \(\mathbb{R}^d\)
when \(P_t \ll \lambda\) for \(t > 0\) (and, in the diffusion-model setting, smooth thanks to the
Gaussian noise injection along the forward diffusion). This noise injection ensures the absolute
continuity even when the original data distribution \(P_0\) lies on a low-dimensional manifold and
is singular to \(\lambda\). The reverse-time generative SDE is then driven by \(s_\phi\), which
plays the role, in the dynamics, of the Radon-Nikodym derivative \(dP_t/d\lambda\);
the implementation does not compute the derivative directly but learns its
gradient \(\nabla_x \log dP_t/d\lambda\) from data via score matching.
Importance sampling.
The most direct application of the change-of-variables formula is importance sampling: to estimate
\(\mathbb{E}_P[g(X)]\) using samples drawn from a different distribution \(Q\), one rewrites
\[
\mathbb{E}_P[g(X)] \;=\; \int g \, dP \;=\; \int g \cdot \frac{dP}{dQ} \, dQ \;=\; \mathbb{E}_Q\!\left[g(X) \cdot \frac{dP}{dQ}(X)\right]
\]
and replaces the expectation under \(P\) by the empirical mean under \(Q\) of the reweighted
integrand \(g(X) \cdot dP/dQ(X)\). This is precisely the
change-of-variables proposition with \(\nu = P\)
and \(\mu = Q\). The estimator is unbiased
whenever \(P \ll Q\) (and \(g \in L^1(P)\), so that both expectations are finite); when
\(P \not\ll Q\), the ratio \(dP/dQ\) is undefined on a set of
positive \(P\)-measure, and the estimator misses contributions from that region — the practical
manifestation of an absolute-continuity violation.
KL divergence and RLHF regularization.
The Kullback-Leibler divergence between two probability measures \(P\) and \(Q\) is defined by
\[
D_{\mathrm{KL}}(P \| Q) \;=\;
\begin{cases}
\displaystyle \int_\Omega \log \frac{dP}{dQ} \, dP & \text{if } P \ll Q, \\
+\infty & \text{otherwise}.
\end{cases}
\]
The Radon-Nikodym derivative \(dP/dQ\) is what the divergence is integrating; without absolute
continuity, \(dP/dQ\) does not exist as a Radon-Nikodym derivative on the portion of the space
where \(Q\) places no mass but \(P\) does, and \(D_{\mathrm{KL}}(P \| Q)\) is set to \(+\infty\)
by convention. In reinforcement learning from human feedback (RLHF), the regularizer
\(D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})\) — penalizing the trained policy
\(\pi_\theta\) for drifting from a reference policy \(\pi_{\mathrm{ref}}\) — is finite
precisely when \(\pi_\theta \ll \pi_{\mathrm{ref}}\). In practice the divergence is estimated
by the sample average of \(\log(\pi_\theta(y \mid x) / \pi_{\mathrm{ref}}(y \mid x))\) over
sampled completions, and the absolute-continuity hypothesis is a structural property of
the model parameterization (e.g., a softmax over a shared vocabulary makes both policies
positive on the same token set) rather than an explicit runtime check.