Random Variables

Random Variables Expected Value Variance

Random Variables

In Part 1, we developed probability in terms of events — subsets of a sample space. While this framework is logically complete, it is insufficient for the quantitative demands of statistics and machine learning. We need to associate numerical values with outcomes so that we can compute averages, measure spread, and apply the tools of calculus. A random variable is precisely this bridge: it is a function that maps each outcome in the sample space to a real number.

Definition: Random Variable

A random variable is a function \(X: S \to \mathbb{R}\) that assigns a numerical value to each outcome in the sample space \(S\). We denote random variables by capital letters (\(X, Y, Z\)) and specific values they take by the corresponding lowercase letters (\(x, y, z\)).

Random variables come in two fundamental types, depending on the nature of the values they can take. This distinction determines the mathematical tools — summation or integration — used to analyze them.

Discrete Random Variables

A random variable is discrete if the set of values it can take is finite or countably infinite (e.g., \(\{0, 1, 2, \ldots\}\)). The probability structure of a discrete random variable is completely characterized by its probability mass function.

Definition: Probability Mass Function (p.m.f.)

The probability mass function of a discrete random variable \(X\) is the function \[ f(x) = P(X = x) \] satisfying:

  1. \(f(x) \geq 0\) for all \(x\).
  2. \(\sum_{x} f(x) = 1\), where the sum is over all possible values of \(X\).

To answer questions of the form "what is the probability that \(X\) is at most \(x\)?", we accumulate the mass function into a running total. This cumulative perspective is especially useful for computing probabilities over intervals.

Definition: Cumulative Distribution Function (Discrete Case)

The cumulative distribution function (c.d.f.) of a discrete random variable \(X\) is \[ F(x) = P(X \leq x) = \sum_{k \leq x} f(k). \] It follows that \(P(a \leq X \leq b) = F(b) - F(a-1)\) for integer-valued random variables.

Continuous Random Variables

A random variable is continuous if it can take any value from one or more intervals of real numbers. Since the set of possible values is uncountably infinite, the probability that \(X\) takes any single specific value is zero: \(P(X = x) = 0\). Instead of assigning mass to individual points, we describe probabilities through a density function whose integral over an interval gives the probability that \(X\) falls in that interval.

Definition: Probability Density Function (p.d.f.)

The probability density function of a continuous random variable \(X\) is a function \(f(x)\) satisfying:

  1. \(f(x) \geq 0\) for all \(x\).
  2. \(\int_{-\infty}^{\infty} f(x)\,dx = 1\).
  3. \(P(a \leq X \leq b) = \int_{a}^{b} f(x)\,dx\) for any \(a \leq b\).

Note that \(f(x)\) itself is not a probability — it is a density. In particular, \(f(x)\) can exceed 1 (e.g., the uniform distribution on \([0, 0.5]\) has density \(f(x) = 2\)).

Definition: Cumulative Distribution Function (Continuous Case)

The cumulative distribution function of a continuous random variable \(X\) is \[ F(x) = P(X \leq x) = \int_{-\infty}^{x} f(u)\,du. \] By the Fundamental Theorem of Calculus, the density is recovered as: \[ f(x) = \frac{dF(x)}{dx} \] at every point where \(F\) is differentiable. Furthermore, \(P(a \leq X \leq b) = F(b) - F(a)\).

Note that for a continuous random variable, \(P(X = a) = \int_a^a f(x)\,dx = 0\), so \(P(a \leq X \leq b) = P(a < X < b)\). The distinction between strict and non-strict inequalities only matters in the discrete case.

With the language of random variables and their distributions established, we can now ask the two most fundamental questions about any distribution: where is its center, and how spread out is it? These are captured by the expected value and variance, respectively.

Expected Value

The expected value (or mean) of a random variable provides a single number that summarizes the "center" of its distribution. It is a weighted average of all possible values, where each value is weighted by its probability. This concept is indispensable in machine learning: loss functions are expectations, model predictions are conditional expectations, and training algorithms minimize expected risk.

Definition: Expected Value

The expected value of a random variable \(X\) is defined as: \[ \mathbb{E}[X] = \mu = \begin{cases} \displaystyle\sum_{x} x\, f(x) & \text{if } X \text{ is discrete} \\[10pt] \displaystyle\int_{-\infty}^{\infty} x\, f(x)\,dx & \text{if } X \text{ is continuous} \end{cases} \] provided the sum or integral converges absolutely.

The expected value can be interpreted as the center of gravity of the distribution. If the distribution is symmetric about some point \(c\), then \(\mathbb{E}[X] = c\), meaning the mean coincides with the median.

One of the most useful properties of expectation is its linearity, which holds regardless of whether the random variables involved are independent.

Theorem 1: Linearity of Expectation

For any constants \(a, b \in \mathbb{R}\), \[ \mathbb{E}[aX + b] = a\,\mathbb{E}[X] + b. \]

Proof (continuous case):

\[ \begin{align*} \mathbb{E}[aX + b] &= \int_{-\infty}^{\infty} (ax + b)\, f(x)\,dx \\\\ &= a\int_{-\infty}^{\infty} x\, f(x)\,dx + b\int_{-\infty}^{\infty} f(x)\,dx \\\\ &= a\,\mathbb{E}[X] + b. \end{align*} \] The discrete case follows identically by replacing integrals with sums.

Knowing the center of a distribution is valuable, but it tells us nothing about how concentrated or dispersed the values are around that center. Two distributions can share the same mean yet differ dramatically in spread. To quantify this spread, we introduce the variance.

Variance

The variance measures the expected squared deviation of a random variable from its mean. A small variance indicates that the values of \(X\) tend to cluster tightly around \(\mu\), while a large variance indicates wide dispersion. In machine learning, variance appears everywhere: in the bias-variance tradeoff, in gradient noise during stochastic optimization, and in the uncertainty quantification of Bayesian predictions.

Definition: Variance and Standard Deviation

The variance of a random variable \(X\) with mean \(\mu = \mathbb{E}[X]\) is: \[ \text{Var}(X) = \sigma^2 = \mathbb{E}\bigl[(X - \mu)^2\bigr] \geq 0. \] The standard deviation is \(\sigma = \sqrt{\text{Var}(X)}\), which has the same units as \(X\).

The definition involves the unknown quantity \(\mathbb{E}[X]\) inside the expectation. Expanding the square yields a computationally convenient alternative: \[ \begin{align*} \text{Var}(X) = \mathbb{E}\bigl[(X - \mu)^2\bigr] &= \mathbb{E}[X^2 - 2\mu X + \mu^2] \\\\ &= \mathbb{E}[X^2] - 2\mu\,\mathbb{E}[X] + \mu^2 \\\\ &= \mathbb{E}[X^2] - \mu^2. \end{align*} \] That is, the variance equals the mean of the square minus the square of the mean. Note that \(\text{Var}(c) = 0\) for any constant \(c\), since a constant has no spread.

The following result describes how variance transforms under linear operations. Unlike expectation, variance is affected by scaling but not by translation.

Theorem 2: Variance of a Linear Transformation

For any constants \(a, b \in \mathbb{R}\), \[ \text{Var}(aX + b) = a^2\,\text{Var}(X). \] The additive constant \(b\) shifts the distribution without changing its spread.

Proof:

Using the alternative formula and linearity of expectation: \[ \begin{align*} \text{Var}(aX + b) &= \mathbb{E}\bigl[(aX+b)^2\bigr] - \bigl(\mathbb{E}[aX+b]\bigr)^2 \\\\ &= \bigl(a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2\bigr) - \bigl(a\,\mathbb{E}[X] + b\bigr)^2 \\\\ &= a^2\mathbb{E}[X^2] + 2ab\,\mathbb{E}[X] + b^2 - a^2(\mathbb{E}[X])^2 - 2ab\,\mathbb{E}[X] - b^2 \\\\ &= a^2\bigl(\mathbb{E}[X^2] - (\mathbb{E}[X])^2\bigr) \\\\ &= a^2\,\text{Var}(X). \end{align*} \]

Insight: Random Variables in Machine Learning

The framework of random variables, expectations, and variances is the language in which virtually all of machine learning is written. A model's loss function is an expectation: we minimize \(\mathbb{E}[\ell(Y, \hat{Y})]\) over the data distribution. The bias-variance decomposition shows that a model's expected prediction error decomposes as \(\text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\), directly using the concepts defined here. In the next parts, we will study specific families of distributions - Gamma and Beta, Gaussian, and others - that serve as building blocks for probabilistic models throughout statistics and machine learning.