Why Conditional Expectation?
Elementary probability offers two distinct constructions that both go by the name conditional expectation.
Given an event \(A\) with \(\mathbb{P}(A) > 0\), one writes
\[
\mathbb{E}[X \mid A] \;=\; \frac{1}{\mathbb{P}(A)} \int_A X \, d\mathbb{P},
\]
a single number — the average of \(X\) restricted to the event \(A\). Given a continuous
random variable \(Y\) with joint density \(p(x, y)\), one writes
\[
\mathbb{E}[X \mid Y = y] \;=\; \int x \, p(x \mid y) \, dx,
\]
a function of \(y\). These two formulas address visibly different situations: the first averages over a positive-probability event,
the second averages along a measure-zero fibre using a conditional density. Neither formula reduces to the other, and the second
one is not even well-defined when \(p(x \mid y)\) fails to exist as a function
(e.g., when \(Y\) is mixed discrete-continuous, or supported on a fractal).
A unifying definition exists, and its existence is the headline payoff of the
Radon-Nikodym theorem.
Given an integrable random variable \(X\) and a sub-\(\sigma\)-algebra
\(\mathcal{G} \subseteq \mathcal{F}\) — which we read as "the information available to an
observer" — there is a single \(\mathcal{G}\)-measurable random variable, denoted
\(\mathbb{E}[X \mid \mathcal{G}]\), that simultaneously specialises to both classical
formulas and continues to make sense in every intermediate situation. The unifying object
is not a number; it is a function on \(\Omega\), and its values reflect the best forecast
of \(X\) that an observer with information \(\mathcal{G}\) can make.
This page builds that object. The construction is short — it occupies one paragraph
of the next section — because all the analytic machinery has
already been set up. In
Limit Theorems & Product Measures
we previewed conditional expectation as a Radon-Nikodym derivative, and in
Signed Measures & the Radon-Nikodym Theorem
we proved the existence theorem on which the construction rests. This page collects the
payoff: a definition, a verification that the discrete classical case is recovered
verbatim, an explanation of why the continuous classical case requires one further
layer of machinery (developed on the next page), and a complete algebraic toolbox of
properties — linearity, monotonicity, the take-out rule, the tower property, Jensen's
inequality for conditional expectation, the \(L^p\) contraction, and the
\(L^2\) projection characterisation.
The conditional expectation also underlies, in idealised form, several core constructions of modern machine learning — though,
as the closing section of this chapter discusses in detail, the rigorous object built here is rarely instantiated
directly in code. The E-step of the expectation-maximisation algorithm is, in its exact form, a conditional expectation;
in practice it is computed by a discrete sum when the latent variable is finite, and by Monte Carlo sampling otherwise. The
Bellman equation of reinforcement learning is the tower property in disguise; in practice it is propagated by single-trajectory
stochastic approximation rather than by exact integration. The Bayesian posterior predictive distribution
\[
p(x_{\text{new}} \mid D) = \mathbb{E}[p(x_{\text{new}} \mid \theta) \mid D]
\]
is, in its exact form, a conditional expectation under the posterior measure; in practice it is approximated by Markov chain Monte Carlo
or by variational surrogates. The evidence lower bound (ELBO) of variational inference
decomposes, exactly, into terms that are conditional expectations against a variational measure; the decomposition is the algorithmic foundation
of the entire variational programme. The present chapter installs the rigorous definition against which all of these
constructions — and the approximations actually used to compute them — can be calibrated. That the rigorous object is not currently instantiated in code
is a fact about today's implementations, not a verdict on the necessity of the underlying mathematics: the existence of \(\mathbb{E}[X \mid \mathcal{G}]\)
as a well-defined object is what makes any of these approximations approximations of something, and the history of mathematical foundations
entering practice with a lag — Hilbert spaces preceding quantum mechanics, measure theory preceding modern probability — suggests that future, more rigorously
grounded ML systems may invoke the construction of this page far more directly than current ones do.
Before turning to the construction, we fix the notation used throughout. The underlying probability space is \((\Omega, \mathcal{F}, \mathbb{P})\),
and \(\mathcal{G}\) always denotes a sub-\(\sigma\)-algebra of \(\mathcal{F}\) — a smaller collection of measurable sets representing partial information.
The restriction of \(\mathbb{P}\) to \(\mathcal{G}\) is written \(\mathbb{P}|_{\mathcal{G}}\); it is the same probability measure
regarded as defined only on the smaller \(\sigma\)-algebra. The object we are going to construct is denoted \(\mathbb{E}[X \mid \mathcal{G}]\),
and when \(\mathcal{G} = \sigma(Y)\) we abbreviate it as \(\mathbb{E}[X \mid Y]\). Throughout, "\(\mathbb{P}\)-a.s." means almost surely
with respect to \(\mathbb{P}\); we use this qualifier in place of "\(\mathbb{P}\)-a.e." in the probabilistic context, consistent with
earlier chapters of this section.
Definition via Radon-Nikodym
The construction proceeds in three steps. We first associate to each integrable random variable \(X\) and each
sub-\(\sigma\)-algebra \(\mathcal{G}\) a finite signed measure \(\nu_X\) on \(\mathcal{G}\). We then verify that
\(\nu_X\) is absolutely continuous with respect to \(\mathbb{P}|_{\mathcal{G}}\), so that the Radon-Nikodym theorem applies.
The resulting Radon-Nikodym derivative is, by definition, the conditional expectation. The discrete case is recovered immediately,
and the continuous case is identified as requiring one further layer of machinery — a topic in its own right within
measure-theoretic probability, namely the framework of regular conditional distributions and the disintegration theorem.
The Signed Measure Associated with \(X\)
Definition: The Signed Measure \(\nu_X\)
Let \(X \in L^1(\Omega, \mathcal{F}, \mathbb{P})\) and let \(\mathcal{G} \subseteq \mathcal{F}\)
be a sub-\(\sigma\)-algebra. Define
\[
\nu_X : \mathcal{G} \to \mathbb{R}, \qquad
\nu_X(A) \;=\; \int_A X \, d\mathbb{P}, \quad A \in \mathcal{G}.
\]
Three properties of \(\nu_X\) must be checked before the Radon-Nikodym theorem can be invoked:
that \(\nu_X\) is a signed measure on \((\Omega, \mathcal{G})\), that it is finite, and
that it is absolutely continuous with respect to \(\mathbb{P}|_{\mathcal{G}}\).
Verification (signed measure).
We check the conditions of
signed measure.
Clearly \(\nu_X(\emptyset) = \int_\emptyset X \, d\mathbb{P} = 0\). For countable additivity,
let \((A_n)_{n \geq 1}\) be a sequence of pairwise disjoint sets in
\(\mathcal{G}\) and write \(A = \bigsqcup_n A_n\). The sequence
\(S_N = \sum_{n=1}^N X \mathbf{1}_{A_n}\) converges \(\mathbb{P}\)-a.s. to
\(X \mathbf{1}_A\), and is dominated in absolute value by \(|X| \in L^1(\mathbb{P})\).
The
dominated convergence theorem
gives
\[
\nu_X(A) \;=\; \int_A X \, d\mathbb{P} \;=\; \int_\Omega X \mathbf{1}_A \, d\mathbb{P}
\;=\; \lim_{N \to \infty} \sum_{n=1}^N \int_{A_n} X \, d\mathbb{P}
\;=\; \sum_{n=1}^\infty \nu_X(A_n),
\]
and the series converges absolutely because
\[
\sum_n |\nu_X(A_n)| \leq \sum_n \int_{A_n} |X| \, d\mathbb{P}
= \int_A |X| \, d\mathbb{P} \leq \mathbb{E}[|X|] < \infty.
\]
Since \(X \in L^1\), \(\nu_X\) takes values in \(\mathbb{R}\)
(never \(\pm \infty\)), so the sign-restriction condition is vacuously satisfied.
Verification (finiteness).
The
Jordan decomposition
gives \(\nu_X = \nu_X^+ - \nu_X^-\) where both parts are non-negative measures on
\(\mathcal{G}\), and the
total variation
\(|\nu_X| = \nu_X^+ + \nu_X^-\) satisfies
\[
|\nu_X|(\Omega) \;=\; \int_\Omega |X| \, d\mathbb{P}|_{\mathcal{G}}
\;=\; \mathbb{E}[|X|] \;<\; \infty.
\]
(The middle equality uses the explicit Hahn decomposition: \(\nu_X^+(A) = \int_{A \cap P} X \, d\mathbb{P}\)
and \(\nu_X^-(A) = -\int_{A \cap N} X \, d\mathbb{P}\), where \(\Omega = P \sqcup N\) is a Hahn
decomposition for \(\nu_X\); on \(P\) the integrand \(X\) is non-negative \(\nu_X\)-a.e. and on
\(N\) it is non-positive, so summing the two contributions yields \(\int |X| \, d\mathbb{P}\).)
Hence \(\nu_X\) is a finite signed measure.
Verification (absolute continuity).
Let \(A \in \mathcal{G}\) with \(\mathbb{P}|_{\mathcal{G}}(A) = 0\), i.e.,
\(\mathbb{P}(A) = 0\) (the restricted measure agrees with \(\mathbb{P}\) on
\(\mathcal{G}\)-sets by definition). Then \(X \mathbf{1}_A = 0\) \(\mathbb{P}\)-a.s.,
whence \(\nu_X(A) = \int_A X \, d\mathbb{P} = 0\). By the definition of
absolute continuity,
\(\nu_X \ll \mathbb{P}|_{\mathcal{G}}\).
All three hypotheses of the Radon-Nikodym theorem are now in place: \(\mathbb{P}|_{\mathcal{G}}\)
is a finite (hence \(\sigma\)-finite) non-negative measure on \((\Omega, \mathcal{G})\), and
\(\nu_X\) is a finite signed measure absolutely continuous with respect to it. The
Radon-Nikodym theorem (proved for non-negative \(\nu\) in the previous chapter, and extended
to finite signed \(\nu\) by applying the theorem separately to the Jordan parts \(\nu_X^+\)
and \(\nu_X^-\)) produces a \(\mathcal{G}\)-measurable function, unique up to
\(\mathbb{P}\)-a.s. equality, that represents \(\nu_X\) as an integral against
\(\mathbb{P}|_{\mathcal{G}}\). This function is, by definition, the conditional expectation.
The Conditional Expectation
Theorem & Definition: Conditional Expectation
Let \(X \in L^1(\Omega, \mathcal{F}, \mathbb{P})\) and \(\mathcal{G} \subseteq \mathcal{F}\)
a sub-\(\sigma\)-algebra. There exists a \(\mathcal{G}\)-measurable function
\(Y \in L^1(\Omega, \mathcal{G}, \mathbb{P}|_{\mathcal{G}})\), unique up to
\(\mathbb{P}\)-a.s. equality, satisfying the averaging identity
\[
\int_A Y \, d\mathbb{P} \;=\; \int_A X \, d\mathbb{P} \quad
\text{for every } A \in \mathcal{G}. \tag{$\ast$}
\]
Any such \(Y\) is called a version of the conditional expectation of
\(X\) given \(\mathcal{G}\), and we write
\[
\mathbb{E}[X \mid \mathcal{G}] \;=\; Y, \qquad \text{equivalently,} \qquad
\mathbb{E}[X \mid \mathcal{G}] \;=\; \frac{d\nu_X}{d \mathbb{P}|_{\mathcal{G}}}.
\]
Proof.
By the verifications above, \(\nu_X\) is a finite signed measure on \((\Omega, \mathcal{G})\)
with \(\nu_X \ll \mathbb{P}|_{\mathcal{G}}\). Apply the
Radon-Nikodym theorem
separately to the Jordan parts \(\nu_X^+, \nu_X^-\) of \(\nu_X\); the resulting non-negative
densities \(f^+, f^-\) belong to \(L^1(\mathbb{P}|_{\mathcal{G}})\) because
\(\int f^\pm \, d\mathbb{P}|_{\mathcal{G}} = \nu_X^\pm(\Omega) < \infty\). Set
\(Y = f^+ - f^-\). Then \(Y\) is \(\mathcal{G}\)-measurable, integrable, and
\[
\int_A Y \, d\mathbb{P} \;=\; \int_A (f^+ - f^-) \, d\mathbb{P}|_{\mathcal{G}}
\;=\; \nu_X^+(A) - \nu_X^-(A) \;=\; \nu_X(A) \;=\; \int_A X \, d\mathbb{P}
\]
for every \(A \in \mathcal{G}\), establishing (\(\ast\)). Uniqueness up to
\(\mathbb{P}\)-a.s. equality is the uniqueness clause of Radon-Nikodym applied to each
Jordan part: if \(Y'\) is another version, then \(\int_A (Y - Y') \, d\mathbb{P} = 0\)
for every \(A \in \mathcal{G}\), so the \(\mathcal{G}\)-measurable function \(Y - Y'\)
integrates to zero against every set in its own \(\sigma\)-algebra, hence vanishes
\(\mathbb{P}\)-a.s.
Two remarks on this definition are essential, and both will be invoked repeatedly.
The averaging identity is the working characterisation. Although the
construction goes through Radon-Nikodym, every subsequent proof on this page uses the
averaging identity (\(\ast\)) directly. The pattern is invariably the same: to verify
that some candidate \(\mathcal{G}\)-measurable function \(Y\) is a version of
\(\mathbb{E}[X \mid \mathcal{G}]\), one shows that \(Y\) is \(\mathcal{G}\)-measurable and
that \(\int_A Y \, d\mathbb{P} = \int_A X \, d\mathbb{P}\) for all \(A \in \mathcal{G}\);
the a.s.-uniqueness clause then identifies \(Y\) as the conditional expectation. Radon-Nikodym
is the existence engine; the averaging identity is the daily tool.
The values of \(\mathbb{E}[X \mid \mathcal{G}](\omega)\) are defined only up to
a \(\mathbb{P}\)-null set. Two versions of \(\mathbb{E}[X \mid \mathcal{G}]\)
can disagree on any set \(N\) with \(\mathbb{P}(N) = 0\), and they will both be valid
representatives. Statements such as
"\(\mathbb{E}[X \mid \mathcal{G}](\omega_0) = c\)" for a particular \(\omega_0 \in \Omega\)
are therefore not meaningful in isolation; they are meaningful only as statements about
a \(\mathbb{P}\)-positive set, or after a specific version has been fixed. This subtlety
is the seed of regular conditional distributions: a "regular version" — a coherent
choice of representative that behaves well as a function of \(\omega\), defining an
honest probability measure on \(\mathcal{F}\) for each \(\omega\) — is the central
object of the regular-conditional-distribution framework in measure-theoretic
probability.
Recovery of the Discrete Case
Let \(\{A_i\}_{i \geq 1}\) be a countable measurable partition of \(\Omega\) and let
\(\mathcal{G} = \sigma(\{A_i\}_{i \geq 1})\) be the sub-\(\sigma\)-algebra it generates.
The elements of \(\mathcal{G}\) are precisely the countable unions of the partition
blocks. A \(\mathcal{G}\)-measurable function is constant on each \(A_i\), so any
candidate version of \(\mathbb{E}[X \mid \mathcal{G}]\) is determined by its constant
value on each block.
Proposition: Discrete Case
With \(\mathcal{G} = \sigma(\{A_i\}_{i \geq 1})\) for a countable measurable partition
\(\{A_i\}\), and for any \(X \in L^1(\mathbb{P})\), one has
\[
\mathbb{E}[X \mid \mathcal{G}](\omega) \;=\;
\frac{1}{\mathbb{P}(A_i)} \int_{A_i} X \, d\mathbb{P}
\;=\; \mathbb{E}[X \mid A_i]
\quad \text{for } \omega \in A_i,
\]
on every block \(A_i\) with \(\mathbb{P}(A_i) > 0\). On blocks with
\(\mathbb{P}(A_i) = 0\), the value is arbitrary (consistent with a.s.-uniqueness).
Proof.
Define \(Y(\omega) = \mathbb{E}[X \mid A_i]\) for \(\omega \in A_i\) when
\(\mathbb{P}(A_i) > 0\), and \(Y(\omega) = 0\) otherwise. Then \(Y\) is constant on
each \(A_i\) and is therefore \(\mathcal{G}\)-measurable. To verify the averaging
identity, let \(A \in \mathcal{G}\). Then \(A = \bigsqcup_{i \in I} A_i\) for some
index set \(I\), and
\[
\int_A Y \, d\mathbb{P}
\;=\; \sum_{i \in I, \, \mathbb{P}(A_i) > 0} \mathbb{E}[X \mid A_i] \cdot \mathbb{P}(A_i)
\;=\; \sum_{i \in I, \, \mathbb{P}(A_i) > 0} \int_{A_i} X \, d\mathbb{P}
\;=\; \int_A X \, d\mathbb{P},
\]
where the last equality discards blocks of probability zero, which contribute nothing
to the integral of \(X\) either. By the a.s.-uniqueness clause of the conditional
expectation, \(Y\) is a version of \(\mathbb{E}[X \mid \mathcal{G}]\).
This recovers the elementary "conditional expectation given an event" used informally in
earlier probability chapters of this section: when \(\mathcal{G}\) is generated by a
countable partition, the abstract definition reduces, block by block, to the formula
one already knows. The novelty of the abstract definition lies in the cases that the
elementary formula does not cover.
The Continuous Case Requires One More Layer
Let \(Y\) be a real-valued random variable and consider \(\mathcal{G} = \sigma(Y)\), the
sub-\(\sigma\)-algebra generated by \(Y\). The conditional expectation
\(\mathbb{E}[X \mid \sigma(Y)]\), abbreviated \(\mathbb{E}[X \mid Y]\), is a
\(\sigma(Y)\)-measurable random variable on \(\Omega\). By a measurability argument
(the Doob-Dynkin lemma), \(\mathbb{E}[X \mid Y]\) takes the form
\(g(Y)\) for some Borel-measurable function \(g : \mathbb{R} \to \mathbb{R}\), determined
\(P_Y\)-a.s. on \(\mathbb{R}\) (equivalently, on the support of \(P_Y\), since
\(P_Y\) places no mass outside \(Y(\Omega)\)). The function \(g\) is what one would like to call
"\(y \mapsto \mathbb{E}[X \mid Y = y]\)" — a deterministic forecast of \(X\) for each
observed value of \(Y\).
The notation \(\mathbb{E}[X \mid Y = y]\) thus has a meaning, but with a caveat: \(g\)
is determined only up to \(P_Y\)-null sets. For continuous \(Y\), every singleton
\(\{y_0\}\) has \(P_Y\)-measure zero, so the value \(g(y_0)\) at any specific point is
not determined by the abstract definition. Two versions of \(g\) can disagree on a
\(P_Y\)-null set and both remain valid; pointwise statements are again meaningful only
up to such null sets.
For most ML purposes — Bayesian inference over continuous parameters, regression as
forecast of \(X\) given \(Y = y\), the Bellman equation evaluated at a particular
state — one wants more: a coherent, simultaneously chosen function
\(y \mapsto g(y)\) that defines an honest probability measure
\(\mathbb{P}(\cdot \mid Y = y)\) on \((\Omega, \mathcal{F})\) for every \(y\), so that
\(\mathbb{E}[X \mid Y = y]\) can be computed as an integral
\(\int X \, d\mathbb{P}(\cdot \mid Y = y)\) in the elementary sense. Such a coherent
choice is called a regular conditional distribution. Its existence is
not automatic; it requires a topological hypothesis on the codomain of \(Y\) (typically
that \(Y\) takes values in a Polish space, the framework of the disintegration
theorem). The construction of regular conditional distributions is the central topic
of a separate strand of measure-theoretic probability, and we do not develop it on
this page. The deferral is honest: the abstract \(\sigma(Y)\)-measurable function
\(\mathbb{E}[X \mid Y]\) exists and is unique a.s. by the construction above; what
requires additional machinery is its pointwise-coherent reading as a function of \(y\),
together with the conditional measures \(\mathbb{P}(\cdot \mid Y = y)\) that allow
integrals over the fibre to be computed directly.
Properties of Conditional Expectation
The defining averaging identity (\(\ast\)) — coupled with the a.s.-uniqueness clause —
is the working tool of every proof in this section. The pattern is uniform: to identify
a candidate \(\mathcal{G}\)-measurable function as a version of
\(\mathbb{E}[X \mid \mathcal{G}]\), verify that it integrates to the same value as \(X\)
on every set of \(\mathcal{G}\). We collect the algebraic properties first, then
inequalities, then the geometric \(L^2\) characterisation, and finally the tower
property — the structural identity that drives martingale theory and dynamic programming.
A note on scope. Throughout this page, \(X \in L^1(\mathbb{P})\) is integrable, and every
instance of \(\mathbb{E}[X \mid \mathcal{G}]\) is consequently a finite-valued random
variable. Some standard treatments — Williams's Probability with Martingales,
Durrett's Probability: Theory and Examples — first define
\(\mathbb{E}[X \mid \mathcal{G}]\) for non-negative \(X\) (allowing the value \(+\infty\)
via the \(\sigma\)-finite Radon-Nikodym theorem) and then extend to \(L^1\) via the
decomposition \(X = X^+ - X^-\). On \(L^1\), the resulting object coincides with ours;
the \(L^1\)-first restriction adopted here keeps every quantity on the page finite by
construction. When intermediate steps below pass through non-negative \(X\) (notably the
Take-out rule's three-step extension), the implicit framing is that integrability is
recovered at the end via the assumed \(L^1\) bound on the original \(X\).
Algebraic Properties
Theorem: Linearity
Let \(X, Y \in L^1(\mathbb{P})\) and \(a, b \in \mathbb{R}\). Then
\[
\mathbb{E}[aX + bY \mid \mathcal{G}] \;=\; a\, \mathbb{E}[X \mid \mathcal{G}]
\;+\; b\, \mathbb{E}[Y \mid \mathcal{G}] \quad \mathbb{P}\text{-a.s.}
\]
Proof.
The function \(Z = a\, \mathbb{E}[X \mid \mathcal{G}] + b\, \mathbb{E}[Y \mid \mathcal{G}]\)
is \(\mathcal{G}\)-measurable (linear combination of \(\mathcal{G}\)-measurable functions)
and integrable. For \(A \in \mathcal{G}\),
\[
\int_A Z \, d\mathbb{P}
\;=\; a \int_A \mathbb{E}[X \mid \mathcal{G}] \, d\mathbb{P}
\;+\; b \int_A \mathbb{E}[Y \mid \mathcal{G}] \, d\mathbb{P}
\;=\; a \int_A X \, d\mathbb{P} \;+\; b \int_A Y \, d\mathbb{P}
\;=\; \int_A (aX + bY) \, d\mathbb{P},
\]
using linearity of the Lebesgue integral and the averaging identity for each summand.
The a.s.-uniqueness clause identifies \(Z\) as a version of
\(\mathbb{E}[aX + bY \mid \mathcal{G}]\).
Theorem: Monotonicity
If \(X, Y \in L^1(\mathbb{P})\) and \(X \leq Y\) \(\mathbb{P}\)-a.s., then
\[
\mathbb{E}[X \mid \mathcal{G}] \;\leq\; \mathbb{E}[Y \mid \mathcal{G}]
\quad \mathbb{P}\text{-a.s.}
\]
Proof.
Set \(D = \mathbb{E}[X \mid \mathcal{G}] - \mathbb{E}[Y \mid \mathcal{G}]\), a
\(\mathcal{G}\)-measurable function. We want to show \(D \leq 0\) \(\mathbb{P}\)-a.s.
Let \(A = \{D > 0\} \in \mathcal{G}\). By linearity and the averaging identity,
\[
\int_A D \, d\mathbb{P} \;=\; \int_A \mathbb{E}[X \mid \mathcal{G}] \, d\mathbb{P}
\;-\; \int_A \mathbb{E}[Y \mid \mathcal{G}] \, d\mathbb{P}
\;=\; \int_A X \, d\mathbb{P} \;-\; \int_A Y \, d\mathbb{P}
\;=\; \int_A (X - Y) \, d\mathbb{P} \;\leq\; 0,
\]
since \(X - Y \leq 0\) \(\mathbb{P}\)-a.s. But \(D > 0\) on \(A\), so
\(\int_A D \, d\mathbb{P} \geq 0\), with strict inequality unless \(\mathbb{P}(A) = 0\).
Hence \(\mathbb{P}(A) = 0\), i.e., \(D \leq 0\) \(\mathbb{P}\)-a.s.
Theorem: Take-out (Pull-out) of \(\mathcal{G}\)-Measurable Factors
Let \(X \in L^1(\mathbb{P})\) and let \(Z\) be a \(\mathcal{G}\)-measurable random
variable such that \(ZX \in L^1(\mathbb{P})\). Then
\[
\mathbb{E}[ZX \mid \mathcal{G}] \;=\; Z \cdot \mathbb{E}[X \mid \mathcal{G}]
\quad \mathbb{P}\text{-a.s.}
\]
Proof (standard three-step extension).
We verify the averaging identity for \(Z \cdot \mathbb{E}[X \mid \mathcal{G}]\) in
three stages: indicator, simple, then general \(\mathcal{G}\)-measurable.
Step 1 (indicator). Let \(Z = \mathbf{1}_B\) for \(B \in \mathcal{G}\). For any
\(A \in \mathcal{G}\), \(A \cap B \in \mathcal{G}\), so
\[
\int_A \mathbf{1}_B \cdot \mathbb{E}[X \mid \mathcal{G}] \, d\mathbb{P}
\;=\; \int_{A \cap B} \mathbb{E}[X \mid \mathcal{G}] \, d\mathbb{P}
\;=\; \int_{A \cap B} X \, d\mathbb{P}
\;=\; \int_A \mathbf{1}_B X \, d\mathbb{P},
\]
and \(\mathbf{1}_B \cdot \mathbb{E}[X \mid \mathcal{G}]\) is \(\mathcal{G}\)-measurable
as a product of \(\mathcal{G}\)-measurable functions. The averaging identity holds.
Step 2 (simple non-negative). By linearity (already proved), the identity
extends to non-negative simple \(Z = \sum_{k=1}^n c_k \mathbf{1}_{B_k}\) with \(B_k \in \mathcal{G}\)
and \(c_k \geq 0\).
Step 3 (general). First take \(X \geq 0\). Let \(Z \geq 0\) be
\(\mathcal{G}\)-measurable, and choose simple \(\mathcal{G}\)-measurable \(Z_n \uparrow Z\).
Then \(Z_n X \uparrow ZX\) \(\mathbb{P}\)-a.s., and
\(Z_n \cdot \mathbb{E}[X \mid \mathcal{G}] \uparrow Z \cdot \mathbb{E}[X \mid \mathcal{G}]\)
\(\mathbb{P}\)-a.s. (using \(\mathbb{E}[X \mid \mathcal{G}] \geq 0\) by monotonicity, since
\(X \geq 0\)). The
monotone convergence theorem
applied to both sides of the Step-2 identity, integrated over an arbitrary
\(A \in \mathcal{G}\), gives
\[
\int_A Z \cdot \mathbb{E}[X \mid \mathcal{G}] \, d\mathbb{P}
\;=\; \int_A ZX \, d\mathbb{P}.
\]
For general \(X \in L^1\), decompose \(X = X^+ - X^-\) and \(Z = Z^+ - Z^-\) and
apply the non-negative case to each of the four products, using the integrability
hypothesis \(ZX \in L^1\) to ensure that each piece is integrable. Linearity
(already proved for conditional expectation) reassembles the four pieces. The
a.s.-uniqueness clause identifies \(Z \cdot \mathbb{E}[X \mid \mathcal{G}]\) as a
version of \(\mathbb{E}[ZX \mid \mathcal{G}]\).
Theorem: Independence Collapse
If \(X \in L^1(\mathbb{P})\) and \(\sigma(X)\) is independent of \(\mathcal{G}\), then
\[
\mathbb{E}[X \mid \mathcal{G}] \;=\; \mathbb{E}[X] \quad \mathbb{P}\text{-a.s.}
\]
Proof.
The constant function \(\mathbb{E}[X]\) is \(\mathcal{G}\)-measurable. For
\(A \in \mathcal{G}\), the independence of \(\sigma(X)\) and \(\mathcal{G}\) gives
\(\mathbb{E}[X \mathbf{1}_A] = \mathbb{E}[X] \mathbb{E}[\mathbf{1}_A]
= \mathbb{E}[X] \cdot \mathbb{P}(A)\), hence
\[
\int_A \mathbb{E}[X] \, d\mathbb{P} \;=\; \mathbb{E}[X] \cdot \mathbb{P}(A)
\;=\; \mathbb{E}[X \mathbf{1}_A] \;=\; \int_A X \, d\mathbb{P}.
\]
The a.s.-uniqueness clause identifies the constant \(\mathbb{E}[X]\) as a version of
\(\mathbb{E}[X \mid \mathcal{G}]\).
Inequalities
Theorem: Jensen's Inequality for Conditional Expectation
Let \(\varphi : \mathbb{R} \to \mathbb{R}\) be convex, and let \(X \in L^1(\mathbb{P})\)
with \(\varphi(X) \in L^1(\mathbb{P})\). Then
\[
\varphi\big(\mathbb{E}[X \mid \mathcal{G}]\big) \;\leq\;
\mathbb{E}[\varphi(X) \mid \mathcal{G}]
\quad \mathbb{P}\text{-a.s.}
\]
Proof (supporting-line argument).
For a convex function \(\varphi : \mathbb{R} \to \mathbb{R}\), at every point
\(x_0 \in \mathbb{R}\) there is a supporting affine function: there exist
\(a, b \in \mathbb{R}\) (depending on \(x_0\)) with \(\varphi(x_0) = a x_0 + b\) and
\(\varphi(x) \geq a x + b\) for all \(x \in \mathbb{R}\). Moreover, since \(\varphi\)
is convex on all of \(\mathbb{R}\), it is the pointwise supremum of a countable family
of affine functions: there exist sequences \((a_n), (b_n) \subset \mathbb{R}\) with
\[
\varphi(x) \;=\; \sup_{n \in \mathbb{N}} (a_n x + b_n) \quad \text{for all } x \in \mathbb{R}.
\]
(One construction: take the affine functions supporting \(\varphi\) at every rational
\(x_0\), which exist because \(\varphi\) is convex hence has left and right derivatives
at every point; the countability of \(\mathbb{Q}\) and the pointwise tightness of the
envelope give the supremum representation.)
For each \(n\), apply linearity and monotonicity of conditional expectation to the
affine inequality \(\varphi(X) \geq a_n X + b_n\):
\[
\mathbb{E}[\varphi(X) \mid \mathcal{G}]
\;\geq\; \mathbb{E}[a_n X + b_n \mid \mathcal{G}]
\;=\; a_n\, \mathbb{E}[X \mid \mathcal{G}] + b_n
\quad \mathbb{P}\text{-a.s.}
\]
The exceptional null set may depend on \(n\), but the union over the countable index
set is still null. Outside this single null set,
\[
\mathbb{E}[\varphi(X) \mid \mathcal{G}]
\;\geq\; \sup_n \big( a_n \mathbb{E}[X \mid \mathcal{G}] + b_n \big)
\;=\; \varphi\big(\mathbb{E}[X \mid \mathcal{G}]\big),
\]
which is the asserted inequality.
Theorem: \(L^p\) Contraction
For \(1 \leq p < \infty\) and \(X \in L^p(\Omega, \mathcal{F}, \mathbb{P})\),
\[
\big\| \mathbb{E}[X \mid \mathcal{G}] \big\|_p \;\leq\; \|X\|_p.
\]
In particular, conditional expectation is a contraction on \(L^p(\mathbb{P})\).
Proof.
The function \(\varphi(t) = |t|^p\) is convex on \(\mathbb{R}\) for \(p \geq 1\), and
\(\varphi(X) = |X|^p \in L^1(\mathbb{P})\) by the assumption \(X \in L^p(\mathbb{P})\).
Apply Jensen's inequality for conditional expectation:
\[
\big| \mathbb{E}[X \mid \mathcal{G}] \big|^p \;\leq\;
\mathbb{E}[|X|^p \mid \mathcal{G}] \quad \mathbb{P}\text{-a.s.}
\]
Take expectations of both sides; on the right, the
tower-with-trivial-\(\sigma\)-algebra identity
\(\mathbb{E}[\mathbb{E}[Z \mid \mathcal{G}]] = \mathbb{E}[Z]\) (the averaging identity
applied to \(A = \Omega \in \mathcal{G}\)) gives
\(\mathbb{E}[\mathbb{E}[|X|^p \mid \mathcal{G}]] = \mathbb{E}[|X|^p]\). Hence
\[
\big\| \mathbb{E}[X \mid \mathcal{G}] \big\|_p^p
\;=\; \mathbb{E}\big[ \big| \mathbb{E}[X \mid \mathcal{G}] \big|^p \big]
\;\leq\; \mathbb{E}[|X|^p]
\;=\; \|X\|_p^p,
\]
and taking \(p\)-th roots gives the claim.
The \(L^2\) Projection Characterisation
The \(L^p\) contraction is sharpest at \(p = 2\), where it acquires geometric content.
The space \(L^2(\Omega, \mathcal{G}, \mathbb{P}|_{\mathcal{G}})\) — square-integrable
\(\mathcal{G}\)-measurable functions — sits inside \(L^2(\Omega, \mathcal{F}, \mathbb{P})\)
as a closed linear subspace (closure under \(L^2\) limits is immediate from the fact that a
pointwise a.s.-limit of \(\mathcal{G}\)-measurable functions is \(\mathcal{G}\)-measurable,
and \(L^2\) convergence implies a.s.-convergence along a subsequence). Conditional
expectation, restricted to \(L^2\), is precisely the orthogonal projection onto this
subspace.
Theorem: Conditional Expectation as \(L^2\) Projection
Let \(X \in L^2(\Omega, \mathcal{F}, \mathbb{P})\). Then \(\mathbb{E}[X \mid \mathcal{G}]\)
is the orthogonal projection of \(X\) onto the closed subspace
\(L^2(\Omega, \mathcal{G}, \mathbb{P}|_{\mathcal{G}}) \subseteq L^2(\Omega, \mathcal{F}, \mathbb{P})\).
Equivalently, \(\mathbb{E}[X \mid \mathcal{G}]\) is the unique
\(\mathcal{G}\)-measurable square-integrable function minimising
\[
\mathbb{E}\big[ (X - Y)^2 \big] \quad \text{over } Y \in L^2(\Omega, \mathcal{G}, \mathbb{P}|_{\mathcal{G}}).
\]
Proof.
The
Hilbert projection theorem
applied to the closed subspace
\(M = L^2(\Omega, \mathcal{G}, \mathbb{P}|_{\mathcal{G}})\) of the Hilbert space
\(\mathcal{H} = L^2(\Omega, \mathcal{F}, \mathbb{P})\) produces a unique element
\(P_M(X) \in M\) such that \(X - P_M(X) \perp M\), and \(P_M(X)\) is the unique
minimiser of \(\|X - Y\|_{L^2}\) over \(Y \in M\). Orthogonality means
\(\langle X - P_M(X), Z \rangle_{L^2} = 0\) for every \(Z \in M\), i.e.,
\[
\mathbb{E}\big[ (X - P_M(X)) \cdot Z \big] \;=\; 0 \quad \text{for all }
Z \in L^2(\Omega, \mathcal{G}, \mathbb{P}|_{\mathcal{G}}). \tag{$\dagger$}
\]
Specialising \(Z = \mathbf{1}_A\) for \(A \in \mathcal{G}\) (which is bounded, hence in
\(L^2\), and \(\mathcal{G}\)-measurable), (\(\dagger\)) reduces to
\[
\int_A (X - P_M(X)) \, d\mathbb{P} \;=\; 0,
\quad \text{i.e.,} \quad
\int_A P_M(X) \, d\mathbb{P} \;=\; \int_A X \, d\mathbb{P}.
\]
Thus \(P_M(X)\) is \(\mathcal{G}\)-measurable, integrable (since
\(L^2 \subseteq L^1\) on a finite measure space), and satisfies the averaging identity
on every \(A \in \mathcal{G}\). The a.s.-uniqueness clause of the conditional
expectation identifies \(P_M(X)\) as a version of \(\mathbb{E}[X \mid \mathcal{G}]\).
The \(L^2\) projection identification is the geometric face of conditional expectation.
It explains in one stroke why \(\mathbb{E}[X \mid \mathcal{G}]\) is the
minimum-mean-square forecast of \(X\) based on the information \(\mathcal{G}\): orthogonal
projection minimises distance, and squared \(L^2\)-distance is mean-square error. Every
"best linear predictor" theorem in classical statistics (linear regression, the Wiener
filter, Kalman update equations) is a finite-dimensional or Gaussian instance of this
identification. The same geometric picture also makes the next property — the tower
property — visually obvious: projecting twice, first onto a larger subspace and then
onto a smaller one nested inside it, equals projecting once directly onto the smaller.
The Tower Property
Theorem: Tower Property
Let \(\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}\) be sub-\(\sigma\)-algebras
and \(X \in L^1(\mathbb{P})\). Then
\[
\mathbb{E}\big[\, \mathbb{E}[X \mid \mathcal{G}] \,\big|\, \mathcal{H} \,\big]
\;=\; \mathbb{E}[X \mid \mathcal{H}] \quad \mathbb{P}\text{-a.s.}
\]
In particular, \(\mathbb{E}\big[\mathbb{E}[X \mid \mathcal{G}]\big] = \mathbb{E}[X]\)
(taking \(\mathcal{H} = \{\emptyset, \Omega\}\)).
Proof.
Write \(W = \mathbb{E}[X \mid \mathcal{G}]\) and let
\(V = \mathbb{E}[W \mid \mathcal{H}]\), which is \(\mathcal{H}\)-measurable and
integrable by construction. We verify that \(V\) satisfies the averaging identity
for \(\mathbb{E}[X \mid \mathcal{H}]\): for every \(A \in \mathcal{H}\),
\[
\int_A V \, d\mathbb{P}
\;\stackrel{(\mathrm{i})}{=}\; \int_A W \, d\mathbb{P}
\;\stackrel{(\mathrm{ii})}{=}\; \int_A X \, d\mathbb{P},
\]
where (i) is the averaging identity for \(V = \mathbb{E}[W \mid \mathcal{H}]\) on
\(A \in \mathcal{H}\), and (ii) is the averaging identity for
\(W = \mathbb{E}[X \mid \mathcal{G}]\) on \(A\), valid because
\(A \in \mathcal{H} \subseteq \mathcal{G}\). The a.s.-uniqueness clause identifies
\(V\) as a version of \(\mathbb{E}[X \mid \mathcal{H}]\).
The tower property is the structural identity that drives iterated conditioning. Read in
the projection picture, it says: projecting onto \(L^2(\mathcal{G})\) and then onto the
smaller subspace \(L^2(\mathcal{H})\) produces the same vector as projecting directly
onto \(L^2(\mathcal{H})\). Read in dynamic-programming terms, it says that the value of
\(X\) under coarse information \(\mathcal{H}\) can be computed by first computing the
value under finer information \(\mathcal{G}\) and then averaging that over \(\mathcal{H}\)
— which is the idealised recursion structure of the Bellman equation, taken up next.
Conditional Expectation in Practice
Each of the four ML scenarios below corresponds, in idealised form, to a conditional
expectation. The pattern is: name the underlying probability space, identify the
sub-\(\sigma\)-algebra, observe that the quantity of interest is a conditional
expectation, and read off which property of the previous section is being invoked —
while keeping in view, throughout, that the rigorous object so identified is rarely
instantiated directly in code, the implementation typically relying on a density
assumption together with a sampling- or surrogate-based approximation.
Expectation-Maximisation
The expectation-maximisation (EM) algorithm fits a parametric model \(p(x, z \mid \theta)\)
with observed data \(X\) and latent variable \(Z\) by alternating between two steps. Given
a current parameter estimate \(\theta^{(t)}\), the E-step computes
\[
Q(\theta \mid \theta^{(t)}) \;=\; \mathbb{E}\big[ \log p(X, Z \mid \theta) \,\big|\, X, \theta^{(t)} \big],
\]
where the conditional expectation is taken with respect to the conditional distribution
of \(Z\) given the observed \(X\) under parameter \(\theta^{(t)}\). The
M-step sets \(\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})\).
The E-step is, in its idealised (density-based) form, a conditional expectation;
when the conditional density \(p(z \mid x, \theta^{(t)})\) exists, the expectation
reduces to the explicit integral
\(\int \log p(x, z \mid \theta) \, p(z \mid x, \theta^{(t)}) \, dz\)
that is implemented in code (a discrete sum when \(z\) takes finitely many values, a
Monte Carlo estimate otherwise). The M-step is a finite-dimensional optimisation.
The monotonic improvement of the marginal log-likelihood \(\ell(\theta) = \log p(X \mid \theta)\)
under EM iterations follows from Jensen's inequality for conditional expectation. Write
\(p(z \mid x, \theta) = p(x, z \mid \theta) / p(x \mid \theta)\), so that
\(\log p(x \mid \theta) = \log p(x, z \mid \theta) - \log p(z \mid x, \theta)\) for every \(z\).
Taking conditional expectations of both sides given \(X\) under
\(p(\cdot \mid X, \theta^{(t)})\) — the left side is a constant in \(z\), so it is
unchanged — yields
\[
\log p(X \mid \theta)
\;=\; \underbrace{\mathbb{E}\big[ \log p(X, Z \mid \theta) \,\big|\, X, \theta^{(t)} \big]}_{Q(\theta \mid \theta^{(t)})}
\;-\; \underbrace{\mathbb{E}\big[ \log p(Z \mid X, \theta) \,\big|\, X, \theta^{(t)} \big]}_{H(\theta \mid \theta^{(t)})}.
\]
Subtracting the same identity at \(\theta = \theta^{(t)}\) gives
\[
\ell(\theta) - \ell(\theta^{(t)})
\;=\; \big[ Q(\theta \mid \theta^{(t)}) - Q(\theta^{(t)} \mid \theta^{(t)}) \big]
\;+\; \big[ H(\theta^{(t)} \mid \theta^{(t)}) - H(\theta \mid \theta^{(t)}) \big].
\]
The first bracket is non-negative for \(\theta = \theta^{(t+1)}\) by definition of the
M-step. The second bracket is non-negative by Jensen's inequality applied to the convex
function \(\varphi(u) = -\log u\): writing
\(H(\theta^{(t)} \mid \theta^{(t)}) - H(\theta \mid \theta^{(t)})
= \mathbb{E}\big[ -\log( p(Z \mid X, \theta) / p(Z \mid X, \theta^{(t)}) ) \,\big|\, X, \theta^{(t)} \big]\),
Jensen's inequality for conditional expectation gives
\[
\mathbb{E}\big[ -\log( p(Z \mid X, \theta) / p(Z \mid X, \theta^{(t)}) ) \,\big|\, X, \theta^{(t)} \big]
\;\geq\; -\log \mathbb{E}\big[ p(Z \mid X, \theta) / p(Z \mid X, \theta^{(t)}) \,\big|\, X, \theta^{(t)} \big]
\;=\; -\log 1 \;=\; 0,
\]
where the inner expectation evaluates to \(1\) by the explicit calculation
\[
\mathbb{E}\big[ p(Z \mid X, \theta) / p(Z \mid X, \theta^{(t)}) \,\big|\, X, \theta^{(t)} \big]
\;=\; \int \frac{p(z \mid x, \theta)}{p(z \mid x, \theta^{(t)})} \, p(z \mid x, \theta^{(t)}) \, dz
\;=\; \int p(z \mid x, \theta) \, dz \;=\; 1,
\]
in which the conditioning density cancels and the remaining integrand is a probability
density that integrates to \(1\) — the same algebraic structure as the importance-sampling
identity \(\mathbb{E}_q[f(Z) \, p(Z)/q(Z)] = \mathbb{E}_p[f(Z)]\). Both brackets in the
earlier decomposition are non-negative, so
\(\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})\): EM never decreases the marginal
log-likelihood.
Reinforcement Learning: Value Functions and the Bellman Equation
In a Markov decision process with policy \(\pi\), the state-value function
\[
V^\pi(s) \;=\; \mathbb{E}\Big[\, \sum_{t=0}^\infty \gamma^t R_{t+1} \,\Big|\, S_0 = s \,\Big]
\]
is a conditional expectation of the discounted return given the initial state. The
Bellman equation
\[
V^\pi(s) \;=\; \mathbb{E}\big[ R_1 + \gamma V^\pi(S_1) \,\big|\, S_0 = s \big]
\]
is the tower property in disguise: conditioning on \(S_0 = s\) factors as
"condition on the first transition, then re-condition on \(S_0\)". The
\(\sigma\)-algebra structure is \(\sigma(S_0) \subseteq \sigma(S_0, S_1)\), and the
Bellman recursion is the identity
\(\mathbb{E}[X \mid \sigma(S_0)] = \mathbb{E}\big[ \mathbb{E}[X \mid \sigma(S_0, S_1)] \,\big|\, \sigma(S_0) \big]\)
applied to the discounted return \(X = \sum_t \gamma^t R_{t+1}\).
Bayesian Posterior Predictive
Given a Bayesian model with parameter \(\theta\), prior \(\pi(\theta)\), and observed
data \(D\), the posterior predictive distribution for a new
observation \(X_{\text{new}}\) is
\[
p(x_{\text{new}} \mid D) \;=\; \mathbb{E}\big[ p(x_{\text{new}} \mid \theta) \,\big|\, D \big],
\]
a conditional expectation under the posterior measure \(p(\theta \mid D)\). The integral
"marginalises over uncertainty in \(\theta\)" — and that operation, as a probability
statement, is, in idealised form, the conditional expectation construction of this
page applied with \(\mathcal{G} = \sigma(D)\). When \(\theta\) is a continuous parameter,
the pointwise reading \(p(x_{\text{new}} \mid D)\) requires the regular-conditional-distribution
machinery, which we do not develop on this page; in practice the integral is approximated
by Markov chain Monte Carlo or by variational surrogates. The take-out property licenses
pulling deterministic functions of the data outside the conditional expectation; the
tower property licenses hierarchical decompositions (e.g., predicting via an
intermediate latent layer).
Variational Inference and the ELBO
Variational inference approximates an intractable posterior \(p(z \mid x)\) by a
distribution \(q(z \mid x)\) chosen from a tractable family. The evidence lower
bound
\[
\mathrm{ELBO}(q) \;=\; \mathbb{E}_{q(z \mid x)} \big[ \log p(x, z) - \log q(z \mid x) \big]
\]
is an expectation against \(q(z \mid x)\), not against the intractable posterior
\(p(z \mid x)\) — and this is precisely the point: variational inference replaces the
inaccessible conditional measure with a tractable one. Both \(q(\cdot \mid x)\) and
\(p(\cdot \mid x)\) are conditional distributions on the same \(z\)-space; when \(q \ll p\),
the Radon-Nikodym derivative \(dq/dp\) measures the gap between them, and the exact
decomposition
\(\log p(x) = \mathrm{ELBO}(q) + D_{\mathrm{KL}}(q \,\|\, p(\cdot \mid x))\),
which certifies that ELBO maximisation is equivalent to KL minimisation against the
posterior, is a manipulation of this Radon-Nikodym derivative within the
averaging-identity framework of this page. Variational inference is, in this sense, a
programme of exploiting the conditional expectation identities of this page in the
case where the underlying posterior conditional measure is intractable but admits a
tractable surrogate.
Looking Ahead
Two threads from this page extend immediately. First, the continuous case deferred in
the definition section — the existence of a coherent, pointwise-defined function
\(y \mapsto \mathbb{E}[X \mid Y = y]\) and an associated regular conditional measure
\(\mathbb{P}(\cdot \mid Y = y)\) — is a topic in measure-theoretic probability that
requires the Polish-space hypothesis and the disintegration theorem; it is the subject
of the regular-conditional-distribution framework, which we have not developed here.
With that machinery in place, Bayesian inference over continuous parameters becomes
rigorously licensed at the pointwise level required by the practitioner. Second,
conditional expectation is the structural primitive of martingale theory and
stochastic calculus: a martingale is a sequence \((M_n)\) satisfying
\(\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n\), the tower property is its
characteristic identity, and continuous-time analogues — Brownian motion, the Itô
integral, stochastic differential equations underlying diffusion-model generative
AI — are built on filtrations of sub-\(\sigma\)-algebras with conditional expectation
as the propagator. Variational inference, foreshadowed in the ELBO discussion above,
rests directly on the conditional-expectation manipulations developed on this page.
A broader observation is worth recording, even at the cost of leaving rigorously verified
ground. The applications surveyed above all share a common pattern: machine learning
implements a finite-sample, density-based approximation of an object whose rigorous
existence is licensed by the conditional expectation construction of this page, but the
rigorous construction itself is rarely instantiated in code. Monte Carlo replaces the
integral; a single sampled trajectory replaces the Bellman expectation; a variational
surrogate replaces the intractable posterior. These approximations have carried machine
learning through a successful empirical era. It is at least worth asking whether some
of the unstable behaviours observed in contemporary large language models — the
inconsistency of long chains of probabilistic reasoning, the brittleness under
distribution shift, the difficulty of calibrating uncertainty — are connected to the
absence of a rigorous measure-theoretic substrate underneath the approximations. The
connection is hypothesised, not proved. But the question of how rigorous mathematical
structure should enter the foundations of AI systems — whether through measure theory,
topology, differential geometry, or other frameworks — is an active research
direction; the curriculum of this site is built on the working assumption that the
rigorous mathematical layer will become increasingly relevant as the field matures,
and that a reader who has internalised the construction on this page is better
positioned to follow that line of development as it unfolds.