Random Variables
In Part 1, we developed probability in terms of events —
subsets of a sample space. While this framework is logically complete, it is insufficient
for the quantitative demands of statistics and machine learning. We need to associate
numerical values with outcomes so that we can compute averages, measure spread,
and apply the tools of calculus. A random variable is precisely this bridge:
it is a function that maps each outcome in the sample space to a real number.
Definition: Random Variable
A random variable is a function \(X: S \to \mathbb{R}\) that assigns a numerical
value to each outcome in the sample space \(S\). We denote random variables by capital letters
(\(X, Y, Z\)) and specific values they take by the corresponding lowercase letters (\(x, y, z\)).
Random variables come in two fundamental types, depending on the nature of the values they can take.
This distinction determines the mathematical tools — summation or integration — used to analyze them.
Discrete Random Variables
A random variable is discrete if the set of values it can take is finite or
countably infinite (e.g., \(\{0, 1, 2, \ldots\}\)). The probability structure of a discrete
random variable is completely characterized by its probability mass function.
Definition: Probability Mass Function (p.m.f.)
The probability mass function of a discrete random variable \(X\) is the function
\[
f(x) = P(X = x)
\]
satisfying:
- \(f(x) \geq 0\) for all \(x\).
- \(\sum_{x} f(x) = 1\), where the sum is over all possible values of \(X\).
To answer questions of the form "what is the probability that \(X\) is at most \(x\)?",
we accumulate the mass function into a running total. This cumulative perspective is
especially useful for computing probabilities over intervals.
Definition: Cumulative Distribution Function (Discrete Case)
The cumulative distribution function (c.d.f.) of a discrete random variable \(X\) is
\[
F(x) = P(X \leq x) = \sum_{k \leq x} f(k).
\]
It follows that \(P(a \leq X \leq b) = F(b) - F(a-1)\) for integer-valued random variables.
Continuous Random Variables
A random variable is continuous if it can take any value from one or more intervals
of real numbers. Since the set of possible values is uncountably infinite, the probability that \(X\)
takes any single specific value is zero: \(P(X = x) = 0\). Instead of assigning mass to individual points,
we describe probabilities through a density function whose integral over an interval gives the
probability that \(X\) falls in that interval.
Definition: Probability Density Function (p.d.f.)
The probability density function of a continuous random variable \(X\) is a
function \(f(x)\) satisfying:
- \(f(x) \geq 0\) for all \(x\).
- \(\int_{-\infty}^{\infty} f(x)\,dx = 1\).
- \(P(a \leq X \leq b) = \int_{a}^{b} f(x)\,dx\) for any \(a \leq b\).
Note that \(f(x)\) itself is not a probability — it is a density. In particular,
\(f(x)\) can exceed 1 (e.g., the uniform distribution on \([0, 0.5]\) has density \(f(x) = 2\)).
Definition: Cumulative Distribution Function (Continuous Case)
The cumulative distribution function of a continuous random variable \(X\) is
\[
F(x) = P(X \leq x) = \int_{-\infty}^{x} f(u)\,du.
\]
By the Fundamental Theorem of Calculus, the density is recovered as:
\[
f(x) = \frac{dF(x)}{dx}
\]
at every point where \(F\) is differentiable. Furthermore,
\(P(a \leq X \leq b) = F(b) - F(a)\).
Note that for a continuous random variable, \(P(X = a) = \int_a^a f(x)\,dx = 0\), so
\(P(a \leq X \leq b) = P(a < X < b)\). The distinction between strict and non-strict
inequalities only matters in the discrete case.
With the language of random variables and their distributions established, we can now ask
the two most fundamental questions about any distribution: where is its center, and
how spread out is it? These are captured by the expected value and
variance, respectively.
Expected Value
The expected value (or mean) of a random variable provides a single
number that summarizes the "center" of its distribution. It is a weighted average of all possible values,
where each value is weighted by its probability. This concept is indispensable in machine learning:
loss functions are expectations, model predictions are conditional expectations, and training
algorithms minimize expected risk.
Definition: Expected Value
The expected value of a random variable \(X\) is defined as:
\[
\mathbb{E}[X] = \mu =
\begin{cases}
\displaystyle\sum_{x} x\, f(x) & \text{if } X \text{ is discrete} \\[10pt]
\displaystyle\int_{-\infty}^{\infty} x\, f(x)\,dx & \text{if } X \text{ is continuous}
\end{cases}
\]
provided the sum or integral converges absolutely.
The expected value can be interpreted as the center of gravity of the distribution.
If the distribution is symmetric about some point \(c\), then \(\mathbb{E}[X] = c\), meaning
the mean coincides with the median.
One of the most useful properties of expectation is its linearity, which holds
regardless of whether the random variables involved are independent.
Theorem 1: Linearity of Expectation
For any constants \(a, b \in \mathbb{R}\),
\[
\mathbb{E}[aX + b] = a\,\mathbb{E}[X] + b.
\]
Proof (continuous case):
\[
\begin{align*}
\mathbb{E}[aX + b] &= \int_{-\infty}^{\infty} (ax + b)\, f(x)\,dx \\\\
&= a\int_{-\infty}^{\infty} x\, f(x)\,dx + b\int_{-\infty}^{\infty} f(x)\,dx \\\\
&= a\,\mathbb{E}[X] + b.
\end{align*}
\]
The discrete case follows identically by replacing integrals with sums.
Knowing the center of a distribution is valuable, but it tells us nothing about how concentrated
or dispersed the values are around that center. Two distributions can share the same mean yet
differ dramatically in spread. To quantify this spread, we introduce the variance.
Variance
The variance measures the expected squared deviation of a random variable from its
mean. A small variance indicates that the values of \(X\) tend to cluster tightly around \(\mu\),
while a large variance indicates wide dispersion. In machine learning, variance appears
everywhere: in the bias-variance tradeoff, in gradient noise during stochastic optimization,
and in the uncertainty quantification of Bayesian predictions.
Definition: Variance and Standard Deviation
The variance of a random variable \(X\) with mean \(\mu = \mathbb{E}[X]\) is:
\[
\text{Var}(X) = \sigma^2 = \mathbb{E}\bigl[(X - \mu)^2\bigr] \geq 0.
\]
The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\), which has the same
units as \(X\).
The definition involves the unknown quantity \(\mathbb{E}[X]\) inside the expectation.
Expanding the square yields a computationally convenient alternative:
\[
\begin{align*}
\text{Var}(X) = \mathbb{E}\bigl[(X - \mu)^2\bigr]
&= \mathbb{E}[X^2 - 2\mu X + \mu^2] \\\\
&= \mathbb{E}[X^2] - 2\mu\,\mathbb{E}[X] + \mu^2 \\\\
&= \mathbb{E}[X^2] - \mu^2.
\end{align*}
\]
That is, the variance equals the mean of the square minus the square of the mean.
Note that \(\text{Var}(c) = 0\) for any constant \(c\), since a constant has no spread.
The following result describes how variance transforms under linear operations. Unlike
expectation, variance is affected by scaling but not by translation.
Theorem 2: Variance of a Linear Transformation
For any constants \(a, b \in \mathbb{R}\),
\[
\text{Var}(aX + b) = a^2\,\text{Var}(X).
\]
The additive constant \(b\) shifts the distribution without changing its spread.
Proof:
Using the alternative formula and linearity of expectation:
\[
\begin{align*}
\text{Var}(aX + b)
&= \mathbb{E}\bigl[(aX+b)^2\bigr] - \bigl(\mathbb{E}[aX+b]\bigr)^2 \\\\
&= \bigl(a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2\bigr) - \bigl(a\,\mathbb{E}[X] + b\bigr)^2 \\\\
&= a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2 - a^2(\mathbb{E}[X])^2 - 2ab\,\mathbb{E}[X] - b^2 \\\\
&= a^2\bigl(\mathbb{E}[X^2] - (\mathbb{E}[X])^2\bigr) \\\\
&= a^2\,\text{Var}(X).
\end{align*}
\]
Insight: Random Variables in Machine Learning
The framework of random variables, expectations, and variances is the language in which
virtually all of machine learning is written. A model's loss function is an
expectation: we minimize \(\mathbb{E}[\ell(Y, \hat{Y})]\) over the data distribution.
The bias-variance decomposition shows that a model's expected prediction error
decomposes as \(\text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\), directly using
the concepts defined here. In the next parts, we will study specific families of distributions
- Gamma and Beta,
Gaussian, and others - that serve as building
blocks for probabilistic models throughout statistics and machine learning.