Random Variables
In our development of Basic Probability Ideas, we worked
with probability in terms of events — subsets of a sample space. While this framework is logically
complete, it is insufficient for the quantitative demands of statistics and machine learning.
We need to associate numerical values with outcomes so that we can compute averages,
measure spread, and apply the tools of calculus. A random variable is precisely
this bridge: it is a function that maps each outcome in the sample space to a real number.
Definition: Random Variable
A random variable is a function \(X: S \to \mathbb{R}\) that assigns a numerical
value to each outcome in the sample space \(S\). We denote random variables by capital letters
(\(X, Y, Z\)) and specific values they take by the corresponding lowercase letters (\(x, y, z\)).
A Note on Measurability
Strictly speaking, the function \(X\) must satisfy a technical measurability
condition so that \(\{X \in B\}\) is a valid event for every Borel set \(B \subseteq \mathbb{R}\).
This condition is what makes probabilities like \(P(X \leq x)\) and \(P(a \leq X \leq b)\)
well-defined. For the discrete and continuous random variables studied here, the condition
is automatically satisfied for any reasonable function, so we take it for granted. The full
formulation is developed in
Measure-Theoretic Probability,
where a random variable is recast as a measurable function between probability spaces.
Random variables come in two fundamental types, depending on the nature of the values they can take.
This distinction determines the mathematical tools — summation or integration — used to analyze them.
Discrete Random Variables
A random variable is discrete if the set of values it can take is finite or
countably infinite (e.g., \(\{0, 1, 2, \ldots\}\)). The probability structure of a discrete
random variable is completely characterized by its probability mass function.
Definition: Probability Mass Function (p.m.f.)
The probability mass function of a discrete random variable \(X\) is the function
\[
f(x) = P(X = x)
\]
satisfying:
- \(f(x) \geq 0\) for all \(x\).
- \(\sum_{x} f(x) = 1\), where the sum is over all possible values of \(X\).
To answer questions of the form "what is the probability that \(X\) is at most \(x\)?",
we accumulate the mass function into a running total. This cumulative perspective is
especially useful for computing probabilities over intervals.
Definition: Cumulative Distribution Function (Discrete Case)
The cumulative distribution function (c.d.f.) of a discrete random variable \(X\) is
\[
F(x) = P(X \leq x) = \sum_{k \leq x} f(k).
\]
For an integer-valued \(X\) and integers \(a \leq b\), this gives
\(P(a \leq X \leq b) = F(b) - F(a-1)\).
More generally, \(P(a \leq X \leq b) = F(b) - P(X < a)\), where \(P(X < a)\) excludes the
mass at \(x = a\) itself.
Continuous Random Variables
A random variable is continuous if it can take any value from one or more intervals
of real numbers. Since the set of possible values is uncountably infinite, the probability that \(X\)
takes any single specific value is zero: \(P(X = x) = 0\). Instead of assigning mass to individual points,
we describe probabilities through a density function whose integral over an interval gives the
probability that \(X\) falls in that interval.
Definition: Probability Density Function (p.d.f.)
The probability density function of a continuous random variable \(X\) is a
function \(f(x)\) satisfying:
- \(f(x) \geq 0\) for all \(x\).
- \(\int_{-\infty}^{\infty} f(x)\,dx = 1\).
- \(P(a \leq X \leq b) = \int_{a}^{b} f(x)\,dx\) for any \(a \leq b\).
Note that \(f(x)\) itself is not a probability — it is a density. In particular,
\(f(x)\) can exceed 1 (e.g., the uniform distribution on \([0, 0.5]\) has density \(f(x) = 2\)).
Definition: Cumulative Distribution Function (Continuous Case)
The cumulative distribution function of a continuous random variable \(X\) is
\[
F(x) = P(X \leq x) = \int_{-\infty}^{x} f(u)\,du.
\]
By the Fundamental Theorem of Calculus, the density is recovered as:
\[
f(x) = \frac{dF(x)}{dx}
\]
at every point of continuity of \(f\). Furthermore,
\(P(a \leq X \leq b) = F(b) - F(a)\).
Note that for a continuous random variable, \(P(X = a) = \int_a^a f(x)\,dx = 0\), so
\(P(a \leq X \leq b) = P(a < X < b)\). The distinction between strict and non-strict
inequalities only matters in the discrete case.
With the language of random variables and their distributions established, we can now ask
the two most fundamental questions about any distribution: where is its center, and
how spread out is it? These are captured by the expected value and
variance, respectively.
Expected Value
The expected value (or mean) of a random variable provides a single
number that summarizes the "center" of its distribution. It is a weighted average of all possible values,
where each value is weighted by its probability. This concept is indispensable in machine learning:
loss functions are expectations, model predictions are conditional expectations, and training
algorithms minimize expected risk.
Definition: Expected Value
The expected value of a random variable \(X\) is defined as:
\[
\mathbb{E}[X] = \mu =
\begin{cases}
\displaystyle\sum_{x} x\, f(x) & \text{if } X \text{ is discrete} \\\\
\displaystyle\int_{-\infty}^{\infty} x\, f(x)\,dx & \text{if } X \text{ is continuous}
\end{cases}
\]
provided the sum or integral converges absolutely.
The expected value can be interpreted as the center of gravity of the distribution.
If the distribution is symmetric about some point \(c\), then \(\mathbb{E}[X] = c\), meaning
the mean coincides with the median.
One of the most useful properties of expectation is its linearity, which holds
regardless of whether the random variables involved are independent.
Theorem: Linearity of Expectation
Let \(X\) and \(Y\) be random variables on the same sample space, and let
\(a, b, c \in \mathbb{R}\) be constants. Then
\[
\mathbb{E}[aX + bY + c] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y] + c,
\]
whenever the expectations on the right are well-defined. In particular,
\(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\) and
\(\mathbb{E}[aX + c] = a\,\mathbb{E}[X] + c\).
Remark. The additivity \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\)
holds without any assumption of independence between \(X\) and \(Y\) — a fact that
will be central in the analyses of estimators, gradient noise, and bias-variance decompositions later.
Proof (scalar case \(\mathbb{E}[aX + c] = a\,\mathbb{E}[X] + c\)):
For continuous \(X\) with density \(f\),
\[
\begin{align*}
\mathbb{E}[aX + c] &= \int_{-\infty}^{\infty} (ax + c)\, f(x)\,dx \\\\
&= a\int_{-\infty}^{\infty} x\, f(x)\,dx + c\int_{-\infty}^{\infty} f(x)\,dx \\\\
&= a\,\mathbb{E}[X] + c,
\end{align*}
\]
using \(\int f(x)\,dx = 1\) in the last step. For discrete \(X\) with p.m.f. \(f\), the same
argument with sums gives
\[
\mathbb{E}[aX + c] = \sum_{x}(ax + c) f(x) = a\sum_x x\,f(x) + c\sum_x f(x) = a\,\mathbb{E}[X] + c.
\]
The bivariate identity \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\) requires the
joint distribution of \((X, Y)\), which we postpone until joint and multivariate distributions
are introduced. The argument runs in parallel — replacing the single sum or integral by a
double sum or integral over the joint distribution — and crucially does not require \(X\) and
\(Y\) to be independent.
Knowing the center of a distribution is valuable, but it tells us nothing about how concentrated
or dispersed the values are around that center. Two distributions can share the same mean yet
differ dramatically in spread. To quantify this spread, we introduce the variance.
Variance
The variance measures the expected squared deviation of a random variable from its
mean. A small variance indicates that the values of \(X\) tend to cluster tightly around \(\mu\),
while a large variance indicates wide dispersion. In machine learning, variance appears
everywhere: in the bias-variance tradeoff, in gradient noise during stochastic optimization,
and in the uncertainty quantification of Bayesian predictions.
Definition: Variance and Standard Deviation
The variance of a random variable \(X\) with mean \(\mu = \mathbb{E}[X]\) is:
\[
\operatorname{Var}(X) = \sigma^2 = \mathbb{E}\bigl[(X - \mu)^2\bigr] \geq 0.
\]
The standard deviation is \(\sigma = \sqrt{\operatorname{Var}(X)}\), which has the same
units as \(X\).
The definition involves the unknown quantity \(\mathbb{E}[X]\) inside the expectation.
Expanding the square yields a computationally convenient alternative:
\[
\begin{align*}
\operatorname{Var}(X) = \mathbb{E}\bigl[(X - \mu)^2\bigr]
&= \mathbb{E}[X^2 - 2\mu X + \mu^2] \\
&= \mathbb{E}[X^2] - 2\mu\,\mathbb{E}[X] + \mu^2 \\
&= \mathbb{E}[X^2] - \mu^2.
\end{align*}
\]
That is, the variance equals the mean of the square minus the square of the mean.
Note that \(\operatorname{Var}(c) = 0\) for any constant \(c\), since a constant has no spread.
The following result describes how variance transforms under linear operations. Unlike
expectation, variance is affected by scaling but not by translation.
Proof:
Using the alternative formula and linearity of expectation:
\[
\begin{align*}
\operatorname{Var}(aX + b)
&= \mathbb{E}\bigl[(aX+b)^2\bigr] - \bigl(\mathbb{E}[aX+b]\bigr)^2 \\\\
&= \bigl(a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2\bigr) - \bigl(a\,\mathbb{E}[X] + b\bigr)^2 \\\\
&= a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2 - a^2(\mathbb{E}[X])^2 - 2ab\,\mathbb{E}[X] - b^2 \\\\
&= a^2\bigl(\mathbb{E}[X^2] - (\mathbb{E}[X])^2\bigr) \\\\
&= a^2\,\operatorname{Var}(X).
\end{align*}
\]
Insight: Random Variables in Machine Learning
The framework of random variables, expectations, and variances is the language in which
virtually all of machine learning is written. A model's loss function is an
expectation: we minimize \(\mathbb{E}[\ell(Y, \hat{Y})]\) over the data distribution.
The bias-variance decomposition shows that a model's expected prediction error
decomposes as \(\text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\), directly using
the concepts defined here. In the next parts, we will study specific families of distributions
- Gamma and Beta,
Gaussian, and others - that serve as building
blocks for probabilistic models throughout statistics and machine learning.