Intro to Machine Learning

The Law of Intelligence Learning Paradigms Basic Task Categories Standard Process of ML

The Law of Intelligence

Artificial intelligence (AI) has become one of the most influential technologies of our time, powering applications from search engines to self-driving cars. Before diving into its technical details, it is worth stepping back and asking: what is intelligence itself, and what does it mean to replicate it artificially?

Consider an analogy from physics. The laws of motion and aerodynamics govern both natural and human-made flight. We accept without hesitation that birds can fly, and we trust airplanes to carry us safely across continents because we understand the shared physical principles. Similarly, if we could uncover the fundamental laws of intelligence, we might someday build machines that "think" with the same confidence we have in machines that fly.

Creating a truly intelligent system - one that rivals the flexibility and generality of the human mind - remains an open scientific challenge. Nevertheless, we do have guiding principles. Modern approaches to AI are built on frameworks like Bayesian decision theory and information processing. These form the theoretical foundation for machine learning (ML), a subfield of artificial intelligence that focuses on developing algorithms enabling computers to learn from data and improve their performance on specific tasks over time.

A widely cited formal definition comes from computer scientist Tom M. Mitchell:

A computer program is said to learn from experience \(E\), with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\). (T. Mitchell, Machine Learning, McGraw Hill, 1997)

Mathematically, this definition encapsulates the core loop of empirical risk minimization:

Definition: Learning

Given a hypothesis space \(\mathcal{H}\) of candidate models parameterized by \(\theta \in \Theta\), an objective function \(J(\theta)\) measuring prediction quality (typically based on a loss function), and a dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}\), the learner seeks optimal parameters \(\theta^*\) that minimize \(J\): \[ \theta^* = \arg\min_{\theta \in \Theta} \; J(\theta; \mathcal{D}). \]

Every component of this formulation draws on earlier sections: the parameter space \(\Theta\) lives in a vector space, the optimization is driven by gradient-based methods, the objective often involves a likelihood or posterior, and the algorithmic procedure has a well-defined computational complexity.

One branch of machine learning, called deep learning, utilizes large neural networks to perform complex tasks such as:

One of the most impactful applications of deep learning today is the development of Large Language Models (LLMs). These models are built using deep neural networks - specifically the transformer architecture - and are trained on massive text datasets. LLMs have demonstrated remarkable capabilities in language understanding, text generation, translation, and even reasoning.

Beyond digital text processing, modern AI is increasingly extending into Physical AI and autonomous systems. While classical robotics often relies on deterministic models, real-world interaction is inherently stochastic. Modern frameworks must go beyond calculating point-estimates; they must quantify epistemic uncertainty. By treating system states as probability distributions, an AI can rigorously measure its own uncertainty, establishing a mathematical foundation for real-time safety and risk management.

With this broad motivation in hand, we now turn to the fundamental distinction that organizes the field: the difference between supervised and unsupervised learning.

Learning Paradigms

Machine learning encompasses various approaches, primarily distinguished by the presence or absence of labeled data. This distinction determines the mathematical formulation of the learning problem.

Supervised Learning:

The model is trained on a labeled dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N}\), where each input \(x_i \in \mathbb{R}^D\) is paired with an output label \(y_i\). The goal is to learn a mapping \(f: \mathbb{R}^D \to \mathcal{Y}\) that generalizes well to unseen data. From a probabilistic perspective, we model the conditional distribution \(p(y \mid x; \theta)\) and optimize parameters via maximum likelihood estimation (MLE) or Bayesian inference. Common applications include image classification, spam detection, and medical diagnosis.

Unsupervised Learning:

Here the dataset consists only of inputs \(\mathcal{D} = \{x_i\}_{i=1}^{N}\) with no target labels. The model attempts to uncover intrinsic structures or underlying patterns in two primary ways:

  • Structural Analysis: Identifying discrete groupings (clustering) or finding low-dimensional latent representations (dimensionality reduction) that capture the data's dominant variance.
  • Generative Modeling: Explicitly modeling the data-generating distribution \(p(x; \theta)\). This allows the system to not only understand existing data but also to synthesize new samples and quantify epistemic uncertainty—a critical capability for safety in Physical AI.
Applications include customer segmentation, anomaly detection, and synthetic data generation.

While Supervised and Unsupervised learning form the mathematical foundation of data modeling, a third major paradigm exists where data is acquired through action:

Reinforcement Learning (RL):

Unlike learning from a static dataset, an agent interacts with a dynamic environment to maximize a scalar reward signal. This is a fundamentally different formulation where the model must balance exploration of the unknown with exploitation of known rewards. In Physical AI, RL increasingly incorporates uncertainty-aware constraints, allowing robots to recognize "out-of-distribution" states and trigger safety aborts to prevent hardware damage. (See: Reinforcement Learning)

Modern machine learning often bridges these three major paradigms (Supervised, Unsupervised, and RL) through hybrid approaches:

With these core paradigms and hybrid strategies established, we can now classify the concrete tasks that machine learning algorithms are designed to solve.

Basic Task Categories

The learning paradigms above (supervised, unsupervised, reinforcement) describe how a model learns. Orthogonal to this is the question of what the model predicts. Machine learning tasks are broadly categorized based on the nature of the output space:

From a unified probabilistic perspective, these categories can be viewed as different ways of modeling the data-generating process. Whether we are predicting a continuous value (Regression), assigning a label (Classification), or discovering hidden manifolds (Dimensionality Reduction), we are essentially seeking the underlying mathematical structure that governs the observed data.

It is crucial to recognize that these categories are not mutually exclusive; in practice, they frequently overlap. For instance, a Variational Autoencoder (VAE) simultaneously performs non-linear dimensionality reduction and generative modeling. Modern architectures often integrate multiple paradigms to handle complex, high-dimensional data.

Each of these categories is explored in dedicated pages within this section. Regardless of the specific task, all machine learning methods share a common workflow, which we outline next.

Standard Process of ML

Regardless of whether we are performing regression, classification, or clustering, every machine learning project follows a common pipeline. Understanding this pipeline is important because each step introduces its own mathematical and practical considerations - from the statistical properties of the data to the convergence guarantees of the optimizer.

The Machine Learning Pipeline:
  1. Problem Definition:
    Clearly articulate the problem and determine whether machine learning is an appropriate solution. This includes specifying the input space \(\mathcal{X}\), the output space \(\mathcal{Y}\), and the performance criterion.
  2. Data Collection:
    Gather relevant data from various sources, ensuring quality and representativeness. The data must be sufficiently rich to capture the underlying distribution \(p(x, y)\) that we wish to model.
  3. Data Preprocessing:
    Clean and prepare the data by handling missing values, encoding categorical variables, and normalizing features. Feature scaling, for example, ensures that gradient descent converges efficiently by improving the condition number of the optimization landscape.
  4. Data Splitting:
    Divide the dataset into training, validation, and test sets. This separation is essential for estimating generalization performance and is formalized through cross-validation techniques.
  5. Model and Optimization Procedure Selection:
    Choose a hypothesis class \(\mathcal{H}\) and an optimization algorithm based on the problem type and data characteristics. This step involves the fundamental bias-variance tradeoff: a more expressive model can fit the training data better (lower bias) but may generalize poorly (higher variance).
  6. Training:
    Optimize the model parameters \(\theta\) by minimizing the empirical loss on the training set. For neural networks, this involves backpropagation - an efficient application of the chain rule — combined with stochastic gradient descent or its variants.
  7. Evaluation:
    Assess the model's performance using the validation set and appropriate metrics (e.g., accuracy, precision, recall for classification). For regression, a common choice is the mean squared error (MSE).
  8. Hyperparameter Tuning:
    Optimize hyperparameters (e.g., regularization strength \(\lambda\), learning rate, network depth) using grid search, random search, or Bayesian optimization.
  9. Testing:
    Evaluate the final model on the held-out test set to obtain an unbiased estimate of generalization performance. This test set must remain untouched until this stage to preserve statistical validity.
  10. Deployment and Monitoring:
    Integrate the model into a production environment. In practice, the data distribution may shift over time (distribution shift), requiring continuous monitoring and periodic retraining. For safety-critical systems, this phase relies heavily on out-of-distribution detection: if a model's estimated epistemic uncertainty exceeds a learned threshold, the system recognizes it is operating outside its confident regime and can trigger immediate safety aborts.

The ML pipeline is a concrete synthesis of every section in this website. Data representation relies on Linear Algebra to Algebraic Foundations (Section I) - each data point is a vector, each dataset is a matrix, and transformations like PCA are eigenvalue problems. Optimization is governed by Calculus to Optimization & Analysis (Section II) - gradient descent, convexity, and convergence rates determine whether training succeeds. Generalization is a question of Probability & Statistics (Section III) - from the bias-variance trade-off to the law of large numbers justifying empirical risk minimization. Finally, the computational feasibility of each algorithm depends on Discrete Mathematics & Algorithms (Section IV). In the pages that follow, we explore each of these connections in depth.