III - Probability & Statistics

The Mathematics of Uncertainty and Inference

Probability and Statistics are essential to truly understand machine learning, because machine learning is an elegant combination of statistics and algorithms. In Linear Algebra, we intentionally avoided applications related to statistics, focusing instead on foundational concepts. Now, it is time to build upon the probabilistic basis of machine learning. This section introduces the essential concepts of probability and statistics — random variables and distributions, statistical inference, information-theoretic quantities, Markov chains, Monte Carlo methods, Gaussian processes — and continues into the measure-theoretic backbone (Lebesgue integration, the Radon-Nikodym theorem, conditional expectation) that supports modern probabilistic machine learning. At its core, statistics involves inferring unknown parameters from outcomes — a process that is essentially the inverse of probability theory. Two main approaches dominate statistical inference: frequentist statistics, which treats parameters as fixed and data as random, and Bayesian statistics, which treats data as fixed and parameters as random. Bayesian statistics, in particular, forms the foundation of many machine learning algorithms.

Probability serves as the critical bridge between the exactness of algebraic structure and the unpredictability of the real world. In the Compass ecosystem, this section plays the role of The Bridge — the vital junction between The Discrete World (Section IV) and The Continuous World (Section II). Discrete combinatorics from Section IV feeds the counting arguments behind discrete distributions, hypothesis tests, and information entropy; the measure theory and Lebesgue integration developed in Section II underwrite continuous distributions, density functions, and the rigorous treatment of expectation and conditional probability. The algebraic vocabulary of Section I appears throughout — covariance matrices and their spectral decompositions, the exponential-family algebra underlying conjugate priors, and the Fisher information matrix as a Riemannian metric on parameter space — and the entire section eventually feeds the high-dimensional inference engines of Section V.

Section III's most active recent expansion has been a measure-theoretic deepening aimed at bringing the section to the same graduate-level depth already achieved in Sections I (Lie theory) and II (functional analysis). Recent additions develop measure-theoretic probability, product measures and limit theorems, the Radon-Nikodym theorem, and conditional expectation as a Radon-Nikodym derivative — and a forthcoming page on variational inference will build the rigorous ELBO framework on top of these tools. Together, these provide the mathematical foundation behind techniques used throughout modern machine learning: variational autoencoders, KL-regularized policies in reinforcement learning from human feedback (RLHF), score-based diffusion models, and Bayesian neural networks. Looking further ahead, a stochastic calculus track (Brownian motion, the Itô integral, SDEs and Fokker-Planck) is planned to follow once a downstream ML page demands the continuous-time foundation that powers diffusion generation, score matching, and Neural SDE / Neural ODE models.