Random Variables

Random Variables Expected Value Variance

Random Variables

In our development of Basic Probability Ideas, we worked with probability in terms of events — subsets of a sample space. While this framework is logically complete, it is insufficient for the quantitative demands of statistics and machine learning. We need to associate numerical values with outcomes so that we can compute averages, measure spread, and apply the tools of calculus. A random variable is precisely this bridge: it is a function that maps each outcome in the sample space to a real number.

Definition: Random Variable

A random variable is a function \(X: S \to \mathbb{R}\) that assigns a numerical value to each outcome in the sample space \(S\). We denote random variables by capital letters (\(X, Y, Z\)) and specific values they take by the corresponding lowercase letters (\(x, y, z\)).

A Note on Measurability

Strictly speaking, the function \(X\) must satisfy a technical measurability condition so that \(\{X \in B\}\) is a valid event for every Borel set \(B \subseteq \mathbb{R}\). This condition is what makes probabilities like \(P(X \leq x)\) and \(P(a \leq X \leq b)\) well-defined. For the discrete and continuous random variables studied here, the condition is automatically satisfied for any reasonable function, so we take it for granted. The full formulation is developed in Measure-Theoretic Probability, where a random variable is recast as a measurable function between probability spaces.

Random variables come in two fundamental types, depending on the nature of the values they can take. This distinction determines the mathematical tools — summation or integration — used to analyze them.

Discrete Random Variables

A random variable is discrete if the set of values it can take is finite or countably infinite (e.g., \(\{0, 1, 2, \ldots\}\)). The probability structure of a discrete random variable is completely characterized by its probability mass function.

Definition: Probability Mass Function (p.m.f.)

The probability mass function of a discrete random variable \(X\) is the function \[ f(x) = P(X = x) \] satisfying:

  1. \(f(x) \geq 0\) for all \(x\).
  2. \(\sum_{x} f(x) = 1\), where the sum is over all possible values of \(X\).

To answer questions of the form "what is the probability that \(X\) is at most \(x\)?", we accumulate the mass function into a running total. This cumulative perspective is especially useful for computing probabilities over intervals.

Definition: Cumulative Distribution Function (Discrete Case)

The cumulative distribution function (c.d.f.) of a discrete random variable \(X\) is \[ F(x) = P(X \leq x) = \sum_{k \leq x} f(k). \] For an integer-valued \(X\) and integers \(a \leq b\), this gives \(P(a \leq X \leq b) = F(b) - F(a-1)\). More generally, \(P(a \leq X \leq b) = F(b) - P(X < a)\), where \(P(X < a)\) excludes the mass at \(x = a\) itself.

Continuous Random Variables

A random variable is continuous if it can take any value from one or more intervals of real numbers. Since the set of possible values is uncountably infinite, the probability that \(X\) takes any single specific value is zero: \(P(X = x) = 0\). Instead of assigning mass to individual points, we describe probabilities through a density function whose integral over an interval gives the probability that \(X\) falls in that interval.

Definition: Probability Density Function (p.d.f.)

The probability density function of a continuous random variable \(X\) is a function \(f(x)\) satisfying:

  1. \(f(x) \geq 0\) for all \(x\).
  2. \(\int_{-\infty}^{\infty} f(x)\,dx = 1\).
  3. \(P(a \leq X \leq b) = \int_{a}^{b} f(x)\,dx\) for any \(a \leq b\).

Note that \(f(x)\) itself is not a probability — it is a density. In particular, \(f(x)\) can exceed 1 (e.g., the uniform distribution on \([0, 0.5]\) has density \(f(x) = 2\)).

Definition: Cumulative Distribution Function (Continuous Case)

The cumulative distribution function of a continuous random variable \(X\) is \[ F(x) = P(X \leq x) = \int_{-\infty}^{x} f(u)\,du. \] By the Fundamental Theorem of Calculus, the density is recovered as: \[ f(x) = \frac{dF(x)}{dx} \] at every point of continuity of \(f\). Furthermore, \(P(a \leq X \leq b) = F(b) - F(a)\).

Note that for a continuous random variable, \(P(X = a) = \int_a^a f(x)\,dx = 0\), so \(P(a \leq X \leq b) = P(a < X < b)\). The distinction between strict and non-strict inequalities only matters in the discrete case.

With the language of random variables and their distributions established, we can now ask the two most fundamental questions about any distribution: where is its center, and how spread out is it? These are captured by the expected value and variance, respectively.

Expected Value

The expected value (or mean) of a random variable provides a single number that summarizes the "center" of its distribution. It is a weighted average of all possible values, where each value is weighted by its probability. This concept is indispensable in machine learning: loss functions are expectations, model predictions are conditional expectations, and training algorithms minimize expected risk.

Definition: Expected Value

The expected value of a random variable \(X\) is defined as: \[ \mathbb{E}[X] = \mu = \begin{cases} \displaystyle\sum_{x} x\, f(x) & \text{if } X \text{ is discrete} \\\\ \displaystyle\int_{-\infty}^{\infty} x\, f(x)\,dx & \text{if } X \text{ is continuous} \end{cases} \] provided the sum or integral converges absolutely.

The expected value can be interpreted as the center of gravity of the distribution. If the distribution is symmetric about some point \(c\), then \(\mathbb{E}[X] = c\), meaning the mean coincides with the median.

One of the most useful properties of expectation is its linearity, which holds regardless of whether the random variables involved are independent.

Theorem: Linearity of Expectation

Let \(X\) and \(Y\) be random variables on the same sample space, and let \(a, b, c \in \mathbb{R}\) be constants. Then \[ \mathbb{E}[aX + bY + c] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y] + c, \] whenever the expectations on the right are well-defined. In particular, \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\) and \(\mathbb{E}[aX + c] = a\,\mathbb{E}[X] + c\).

Remark. The additivity \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\) holds without any assumption of independence between \(X\) and \(Y\) — a fact that will be central in the analyses of estimators, gradient noise, and bias-variance decompositions later.

Proof (scalar case \(\mathbb{E}[aX + c] = a\,\mathbb{E}[X] + c\)):

For continuous \(X\) with density \(f\), \[ \begin{align*} \mathbb{E}[aX + c] &= \int_{-\infty}^{\infty} (ax + c)\, f(x)\,dx \\\\ &= a\int_{-\infty}^{\infty} x\, f(x)\,dx + c\int_{-\infty}^{\infty} f(x)\,dx \\\\ &= a\,\mathbb{E}[X] + c, \end{align*} \] using \(\int f(x)\,dx = 1\) in the last step. For discrete \(X\) with p.m.f. \(f\), the same argument with sums gives \[ \mathbb{E}[aX + c] = \sum_{x}(ax + c) f(x) = a\sum_x x\,f(x) + c\sum_x f(x) = a\,\mathbb{E}[X] + c. \] The bivariate identity \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\) requires the joint distribution of \((X, Y)\), which we postpone until joint and multivariate distributions are introduced. The argument runs in parallel — replacing the single sum or integral by a double sum or integral over the joint distribution — and crucially does not require \(X\) and \(Y\) to be independent.

Knowing the center of a distribution is valuable, but it tells us nothing about how concentrated or dispersed the values are around that center. Two distributions can share the same mean yet differ dramatically in spread. To quantify this spread, we introduce the variance.

Variance

The variance measures the expected squared deviation of a random variable from its mean. A small variance indicates that the values of \(X\) tend to cluster tightly around \(\mu\), while a large variance indicates wide dispersion. In machine learning, variance appears everywhere: in the bias-variance tradeoff, in gradient noise during stochastic optimization, and in the uncertainty quantification of Bayesian predictions.

Definition: Variance and Standard Deviation

The variance of a random variable \(X\) with mean \(\mu = \mathbb{E}[X]\) is: \[ \operatorname{Var}(X) = \sigma^2 = \mathbb{E}\bigl[(X - \mu)^2\bigr] \geq 0. \] The standard deviation is \(\sigma = \sqrt{\operatorname{Var}(X)}\), which has the same units as \(X\).

The definition involves the unknown quantity \(\mathbb{E}[X]\) inside the expectation. Expanding the square yields a computationally convenient alternative: \[ \begin{align*} \operatorname{Var}(X) = \mathbb{E}\bigl[(X - \mu)^2\bigr] &= \mathbb{E}[X^2 - 2\mu X + \mu^2] \\ &= \mathbb{E}[X^2] - 2\mu\,\mathbb{E}[X] + \mu^2 \\ &= \mathbb{E}[X^2] - \mu^2. \end{align*} \] That is, the variance equals the mean of the square minus the square of the mean. Note that \(\operatorname{Var}(c) = 0\) for any constant \(c\), since a constant has no spread.

The following result describes how variance transforms under linear operations. Unlike expectation, variance is affected by scaling but not by translation.

Theorem: Variance of a Linear Transformation

For any constants \(a, b \in \mathbb{R}\), \[ \operatorname{Var}(aX + b) = a^2\,\operatorname{Var}(X). \] The additive constant \(b\) shifts the distribution without changing its spread.

Proof:

Using the alternative formula and linearity of expectation: \[ \begin{align*} \operatorname{Var}(aX + b) &= \mathbb{E}\bigl[(aX+b)^2\bigr] - \bigl(\mathbb{E}[aX+b]\bigr)^2 \\\\ &= \bigl(a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2\bigr) - \bigl(a\,\mathbb{E}[X] + b\bigr)^2 \\\\ &= a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2 - a^2(\mathbb{E}[X])^2 - 2ab\,\mathbb{E}[X] - b^2 \\\\ &= a^2\bigl(\mathbb{E}[X^2] - (\mathbb{E}[X])^2\bigr) \\\\ &= a^2\,\operatorname{Var}(X). \end{align*} \]

Insight: Random Variables in Machine Learning

The framework of random variables, expectations, and variances is the language in which virtually all of machine learning is written. A model's loss function is an expectation: we minimize \(\mathbb{E}[\ell(Y, \hat{Y})]\) over the data distribution. The bias-variance decomposition shows that a model's expected prediction error decomposes as \(\text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\), directly using the concepts defined here. In the next parts, we will study specific families of distributions - Gamma and Beta, Gaussian, and others - that serve as building blocks for probabilistic models throughout statistics and machine learning.