Intro to Machine Learning

The Law of Intelligence

Artificial intelligence (AI) has become one of the most influential technologies of our time, powering applications from search engines to self-driving cars. Before diving into its technical details, it is worth stepping back and asking: what is intelligence itself, and what does it mean to replicate it artificially?

Consider an analogy from physics. The laws of motion and aerodynamics govern both natural and human-made flight. We accept without hesitation that birds can fly, and we trust airplanes to carry us safely across continents because we understand the shared physical principles. Similarly, if we could uncover the fundamental laws of intelligence, we might someday build machines that "think" with the same confidence we have in machines that fly.

The catch is that we have no such laws yet. Absent them, "artificial intelligence" describes our aspirations as much as our achievements — the term covers everything from genuinely powerful pattern-matching systems to speculative claims that outrun the current science. What we can speak about precisely is a more modest and more rigorous subject: machine learning (ML), the mathematical and algorithmic study of how systems improve at specific tasks by processing data. ML draws on well-developed frameworks such as Bayesian decision theory, information theory, and statistical learning theory, and it is the focus of this section.

A widely cited formal definition comes from computer scientist Tom M. Mitchell:

A computer program is said to learn from experience \(E\), with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\). (T. Mitchell, Machine Learning, McGraw Hill, 1997)

Mathematically, this definition encapsulates the core loop of empirical risk minimization:

Definition: Learning

Given a hypothesis space \(\mathcal{H}\) of candidate models parameterized by \(\theta \in \Theta\), an objective function \(J(\theta)\) measuring prediction quality (typically based on a loss function), and a dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}\), the learner seeks optimal parameters \(\theta^*\) that minimize \(J\): \[ \theta^* = \arg\min_{\theta \in \Theta} \; J(\theta; \mathcal{D}). \]

Every component of this formulation draws on earlier sections: the parameter space \(\Theta\) lives in a vector space, the optimization is driven by gradient-based methods, the objective often involves a likelihood or posterior, and the algorithmic procedure has a well-defined computational complexity.

One branch of machine learning, called deep learning, utilizes large neural networks to perform complex tasks such as:

Autonomous vehicles:
Self-driving cars and drones leverage deep learning to interpret sensor data and assist in navigation and real-time decision-making.
Medical diagnostics:
Deep learning models analyze medical images and patient data to detect diseases such as cancer and heart conditions, enabling early diagnosis and personalized treatment.
Scientific discovery:
AI accelerates research in fields like structural biology and genomics by predicting protein structures (e.g., AlphaFold) and genetic regulatory patterns, contributing to drug discovery pipelines.

One of the most impactful applications of deep learning today is the development of Large Language Models (LLMs). These models are built using deep neural networks - specifically the transformer architecture - and are trained on massive text datasets. LLMs have demonstrated strong performance on language understanding, text generation, translation, and tasks that require multi-step inference.

Beyond digital text processing, modern ML is increasingly extending into Physical AI and autonomous systems. Real-world interaction is inherently stochastic — sensor noise, unobserved physical properties, and environmental variability mean that even classical robotic systems incorporate probabilistic methods (e.g., Kalman filtering). Modern Physical AI extends this further by treating system states as full probability distributions, enabling a system to systematically quantify its own epistemic uncertainty and establishing a mathematical foundation for real-time safety and risk management.

With this broad motivation in hand, we now turn to the fundamental distinction that organizes the field: the difference between supervised and unsupervised learning.

Learning Paradigms

Machine learning encompasses various approaches, primarily distinguished by the presence or absence of labeled data. This distinction determines the mathematical formulation of the learning problem.

Supervised Learning:

The model is trained on a labeled dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}\), where each input \(x_i \in \mathbb{R}^D\) is paired with an output label \(y_i\). The goal is to learn a mapping \(f: \mathbb{R}^D \to \mathcal{Y}\) that generalizes well to unseen data. From a probabilistic perspective, we model the conditional distribution \(p(y \mid x; \theta)\) and optimize parameters via maximum likelihood estimation (MLE) or Bayesian inference. Common applications include image classification, spam detection, and medical diagnosis.

Unsupervised Learning:

Here the dataset consists only of inputs \(\mathcal{D} = \{x_i\}_{i=1}^{N}\) with no target labels. The model attempts to uncover intrinsic structures or underlying patterns in two primary ways:

Structural Analysis: Identifying discrete groupings (clustering) or finding low-dimensional latent representations (dimensionality reduction) that capture the data's dominant variance.
Generative Modeling: Explicitly modeling the data-generating distribution \(p(x; \theta)\). This allows the system to not only understand existing data but also to synthesize new samples and quantify epistemic uncertainty—a critical capability for safety in Physical AI.

Applications include customer segmentation, anomaly detection, and synthetic data generation.

While Supervised and Unsupervised learning form the mathematical foundation of data modeling, a third major paradigm exists where data is acquired through action:

Reinforcement Learning (RL):

Unlike learning from a static dataset, an agent interacts with a dynamic environment to maximize a scalar reward signal. This is a fundamentally different formulation where the model must balance exploration of the unknown with exploitation of known rewards. In Physical AI, RL increasingly incorporates uncertainty-aware constraints, allowing robots to recognize "out-of-distribution" states and trigger safety aborts to prevent hardware damage. (See: Reinforcement Learning)

Modern machine learning often bridges these three major paradigms (Supervised, Unsupervised, and RL) through hybrid approaches:

Semi-Supervised Learning (Supervised + Unsupervised):
Combines small labeled sets with vast unlabeled data. The unlabeled data exposes the structure of the input distribution, helping refine decision boundaries when manual labeling is prohibitively expensive.
Self-Supervised Learning (Unsupervised as Supervised):
The system generates its own supervisory signals from the raw input (e.g., predicting the next token, or masked words). Large language model pretraining is the most prominent application, allowing models to learn broadly useful representations before any task-specific fine-tuning.
RLHF (Reinforcement Learning from Human Feedback):
A hybrid of Supervised and RL used to align models (like LLMs) with human intent. A reward model is first trained on human-ranked data (Supervised), which then guides the RL process to fine-tune the agent's behavior.
World Models (Generative + RL):
In Physical AI, an agent uses Generative Modeling to learn a "mental model" of the environment's physics from unlabeled sensory data. The agent then performs Reinforcement Learning within this simulated internal world, improving sample efficiency and safety.

With these core paradigms and hybrid strategies established, we can now classify the concrete tasks that machine learning algorithms are designed to solve.

Basic Task Categories

The learning paradigms above (supervised, unsupervised, reinforcement) describe how a model learns. Orthogonal to this is the question of what the model predicts. Machine learning tasks are broadly categorized based on the nature of the output space:

Regression:
Predicts a continuous numerical value \(y \in \mathbb{R}\). The model learns a function \(f: \mathbb{R}^D \to \mathbb{R}\) that minimizes a loss such as the squared error. This is where least-squares theory and gradient descent find their most direct application.

Linear regression, polynomial regression, Ridge regression

Classification:
Assigns inputs into predefined discrete categories \(y \in \{1, \ldots, K\}\). Models like logistic regression and neural networks typically output a probability distribution over classes via the softmax function, connecting directly to maximum likelihood estimation. Other geometric approaches, such as support vector machines (SVM), optimize a decision boundary by maximizing the margin between classes.

Logistic regression

support vector machines

neural networks

Clustering:
Groups similar data points without predefined labels, effectively partitioning the input space into coherent regions. The notion of "similarity" relies on distance metrics and the covariance structure of the data.

K-means clustering

Dimensionality Reduction:
Reduces the number of input variables while preserving essential information. Mathematically, this seeks a low-dimensional subspace (or manifold) that captures the dominant variation in the data. Linear methods accomplish this by drawing on eigenvalue decomposition, while non-linear methods optimize deep representation networks.

Principal Component Analysis (PCA), Kernel PCA, Autoencoders.

Generative Modeling:
Learns the underlying data distribution to generate new, synthetic samples that share the same statistical properties as the training set. Moving beyond deterministic dimensionality reduction, approaches like the Variational Autoencoder (VAE) map data to a continuous latent distribution characterized by a mean and variance. This balances data reconstruction fidelity with probabilistic regularization via Kullback-Leibler (KL) divergence, forcing the learned latent space to approximate a structured prior.

VAEs

Policy Optimization / Control:
Outputs an action \(a \in \mathcal{A}\) given a state \(s \in \mathcal{S}\). This is the task executed within the Reinforcement Learning paradigm, vital for robotics and autonomous navigation.

Methods

From a unified probabilistic perspective, these categories can be viewed as different ways of modeling the data-generating process. Whether we are predicting a continuous value (Regression), assigning a label (Classification), or discovering hidden manifolds (Dimensionality Reduction), we are essentially seeking the underlying mathematical structure that governs the observed data.

It is crucial to recognize that these categories are not mutually exclusive; in practice, they frequently overlap. For instance, a Variational Autoencoder (VAE) simultaneously performs non-linear dimensionality reduction and generative modeling. Modern architectures often integrate multiple paradigms to handle complex, high-dimensional data.

Each of these categories is explored in dedicated pages within this section. Regardless of the specific task, all machine learning methods share a common workflow, which we outline next.

Standard Process of ML

Regardless of whether we are performing regression, classification, or clustering, every machine learning project follows a common pipeline. Understanding this pipeline matters because each step introduces its own mathematical and engineering considerations — from the statistical properties of the data to the convergence guarantees of the optimizer to the deployment realities of distribution shift.

The Machine Learning Pipeline

Problem Definition.
Articulate the problem precisely and decide whether machine learning is the right tool. This means specifying the input space \(\mathcal{X}\), the output space \(\mathcal{Y}\), and the performance criterion. ML failures often trace back to vague problem definitions: a model can only optimize what we measure, so the choice of metric quietly determines what behavior the system will learn — a manifestation of Goodhart's law: when a measure becomes a target, it ceases to be a good measure.
Data Collection and Curation.
Gather data that is relevant, diverse, and representative of the deployment distribution. For most of the 2010s the dominant strategy was simply to collect more data — bigger was better. Since around 2023, this trend has partially reversed: data quality has come to rival data quantity as the binding constraint on model performance. Carefully curated, deduplicated, and filtered datasets routinely outperform much larger noisy ones, and licensing, provenance, and contamination have become first-class engineering concerns. The data must be sufficiently rich to capture the underlying distribution \(p(x, y)\) we wish to model — but no richer than what we can verify and trust.
Data Preprocessing.
Clean and prepare the data: handle missing values, encode categorical variables, normalize features. Feature scaling, for instance, ensures that gradient descent converges efficiently by improving the condition number of the optimization landscape. For text and image data, preprocessing also includes tokenization, augmentation, and standardization choices that often matter as much as the model architecture.
Data Splitting and Contamination Control.
Divide the dataset into training, validation, and test sets. This separation is essential for estimating generalization performance and is formalized through cross-validation techniques. In the foundation-model era, an additional concern has emerged: contamination — the test set may overlap with the corpus on which a pretrained model was already trained, inflating evaluation scores without genuine generalization. Modern benchmarking practice requires explicit contamination checks and, increasingly, the use of held-out or freshly constructed evaluation sets.
Model Selection.
Choose a hypothesis class \(\mathcal{H}\) appropriate to the problem. In classical ML this means picking a model family — linear, tree-based, kernel-based, or a neural-network architecture — and confronting the fundamental bias-variance tradeoff: more expressive models fit training data better (lower bias) but may generalize poorly (higher variance). In modern practice, model selection often reduces to choosing among pretrained foundation models (a vision transformer, a language model, a diffusion backbone) and a transfer strategy: full fine-tuning, parameter-efficient methods like LoRA or adapters, or zero-shot prompting. The mathematical principles — capacity, regularization, generalization — apply equally to both regimes.
Training or Fine-Tuning.
Optimize the model parameters \(\theta\) by minimizing an empirical loss on the training set. For neural networks, this involves backpropagation — an efficient application of the chain rule — combined with stochastic gradient descent or its variants (AdamW — derived from the earlier Adam algorithm — and more recent alternatives such as Lion). When starting from a pretrained model, training is usually called fine-tuning and uses smaller learning rates, fewer steps, and often a fraction of the original parameters. Specialized training paradigms — supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and knowledge distillation — extend this basic gradient-descent loop in domain-specific ways.
Evaluation.
Assess the model's performance on the validation set using metrics appropriate to the task: accuracy, precision, recall, and F1 for classification; mean squared error for regression; BLEU, ROUGE, and pairwise human preference for generative tasks. Beyond task accuracy, modern evaluation increasingly includes behavioral and safety assessments: robustness to adversarial inputs, calibration of uncertainty estimates, fairness across subgroups, and red-team probing for harmful outputs. A model that scores high on task accuracy but fails safety evaluation is not yet ready for deployment.
Hyperparameter Tuning.
Optimize hyperparameters that govern the learning process — regularization strength \(\lambda\), learning rate, batch size, network depth, dropout rate — using grid search, random search, or Bayesian optimization. Hyperparameter tuning sits in an outer loop around training and is computationally expensive; principled methods (e.g., successive halving, population-based training) can reduce the cost substantially.
Testing.
Evaluate the final model on the held-out test set to obtain an unbiased estimate of generalization performance. The test set must remain untouched throughout hyperparameter tuning and model selection — every peek at the test set during development is a small leak that biases the final estimate. Statistical rigor at this stage is what separates a trustworthy result from an over-tuned one.
Deployment and Monitoring.
Integrate the model into a production environment and monitor its behavior continuously. In practice, the data distribution drifts over time (distribution shift), requiring periodic retraining. For safety-critical systems, this phase relies on out-of-distribution (OOD) detection: if a model's estimated epistemic uncertainty exceeds a learned threshold, the system recognizes it is operating outside its confident regime and can trigger fallback policies or immediate aborts.
Inference-Time Compute.
A relatively new addition to the pipeline, popularized by reasoning-focused models from 2024 onward (OpenAI's o1, DeepSeek's R1, and successors). Rather than fixing the amount of compute at training time, the system spends additional compute at inference — generating multiple candidate solutions, performing internal search or reflection, or extending its chain of reasoning (chain-of-thought) — and selects the best output. This recasts model performance as a function of three axes (training data, model parameters, and inference compute) rather than two, and is reshaping how the community thinks about the cost-quality frontier.

This pipeline reflects the workflow of a classical ML project built from scratch. In foundation-model era practice, several stages — large-scale data collection, expensive pretraining — are typically replaced by transfer from a pretrained model, and inference-time compute has emerged as a fourth axis of optimization beyond data, parameters, and training compute. The mathematical principles examined throughout this section — convergence, generalization, regularization, optimization geometry — apply across both regimes.

The ML pipeline is a concrete synthesis of every section in this website. Data representation relies on Linear Algebra to Algebraic Foundations (Section I) - each data point is a vector, each dataset is a matrix, and transformations like PCA are eigenvalue problems. Optimization is governed by Calculus to Optimization & Analysis (Section II) - gradient descent, convexity, and convergence rates determine whether training succeeds. Generalization is a question of Probability & Statistics (Section III) - from the bias-variance trade-off to the law of large numbers justifying empirical risk minimization. Finally, the computational feasibility of each algorithm depends on Discrete Mathematics & Algorithms (Section IV). In the pages that follow, we explore each of these connections in depth.

Looking further ahead, these foundations converge in a recurring viewpoint that is becoming increasingly central to modern ML: Geometric Deep Learning (GDL). GDL is the framework that unifies Convolutional Neural Networks (translation symmetry), Graph Neural Networks (permutation symmetry on nodes), and equivariant networks for 3D data (rotation/translation symmetry under Lie groups such as \(SO(3)\) and \(SE(3)\)) under a single principle: the architecture of a neural network should respect the symmetries of the data it operates on. The mathematical machinery this requires — Lie groups, smooth manifolds, group representations, and the graph Laplacian — is developed across Sections I, II, and IV in preparation for the GDL viewpoint pages that follow.

Loading...

The Law of Intelligence

Learning Paradigms

Basic Task Categories

Standard Process of ML

The Machine Learning Pipeline