Formal Methods: Machine-Verified Proof

Why Formalize?

Readers of the discrete mathematics section have already met an unfinished story. In planar graphs, we proved the Five Color Theorem in roughly a page using Kempe chains, and then stated the Four Color Theorem with the brief note that its proof requires computer verification of a finite case analysis. We did not say what that means. We did not say whether such a proof can be trusted in the same sense as a hand-written one. We left the question open.

This page closes it. The answer is a thirty-year episode, beginning in 1976 and resolving in 2005, about what counts as a proof when computers are involved — and the structural reorganization of mathematical trust that made the resolution possible.

From Hilbert to Gödel: certainty reorients

The goal of formalizing all of mathematics is older than computers. In the first decades of the twentieth century, David Hilbert articulated a program: codify the axioms and inference rules of mathematics so completely that every true mathematical statement could, in principle, be derived from them, and prove the entire system's consistency by purely finite means. This was the most ambitious form of the program — certainty established by construction.

In 1931, Kurt Gödel proved that the goal, taken literally, is impossible. Any consistent formal system strong enough to express elementary arithmetic contains true statements that cannot be derived within it, and cannot prove its own consistency by its own rules. The project was not abandoned, however; it was reoriented.

The new goal is more modest and, in retrospect, more useful: not absolute certainty, but a structured allocation of trust. Make explicit which parts of mathematical reasoning can be mechanically verified, which require human judgment, and where the boundary between them lies. Place the unavoidable acts of trust precisely, rather than scattering them invisibly throughout proofs.

The Four Color crisis (1976-2005)

In 1976, Kenneth Appel and Wolfgang Haken announced a proof of the Four Color Theorem. The strategy was to reduce the question to a finite list of unavoidable configurations and verify the reducibility of each. The verification consumed roughly 1,200 hours of computer time on the machines of the era.

The mathematical community split. One camp accepted the proof: the logical reduction was rigorous, the verification mechanical and reproducible, and machines were arguably more reliable than human referees at checking thousands of cases without fatigue. Another camp rejected it. The philosopher Thomas Tymoczko, in a 1979 paper, argued that a proof a human cannot read in entirety is not a proof at all; Paul Halmos echoed the sentiment more bluntly. The dispute was not about whether the theorem was true. It was about whether the means of establishing it counted as a proof.

The structural reason the dispute could not be settled is worth stating directly. The verification programs were written in conventional languages (a mix of C and assembler in the original; later reimplementations in other languages). To trust the proof, one had to trust those programs. To trust the programs, one had to inspect their source code by hand — exactly the activity the computer was supposed to replace. The chain of justification bottomed out in unverified imperative code, with no independent mechanism to break the regress. Disagreement could not be resolved because the parties were not pointing to a common, inspectable foundation.

The next two decades sustained the doubt. A 1981 independent verification effort discovered, and corrected, small errors in the original — not catastrophic, but enough to keep the question of program correctness open. In 1996, a simpler proof by Robertson, Sanders, Seymour, and Thomas reduced the configuration count and tightened the argument, but the proof remained computer-assisted and the philosophical objection remained intact.

The resolution arrived in 2005. Georges Gonthier, working with the Coq proof assistant, reformulated the entire Four Color Theorem — logical core and configuration analysis alike — inside a single formal system whose verification procedure is a small, fixed, inspectable kernel program. Coq's kernel did the checking. The chain of justification became explicit: trust the Coq kernel, and the proof follows. Anyone who disagrees can now point to a specific object of disagreement — the kernel — and audit it. The thirty-year dispute did not end by consensus. It ended structurally, by clarification of where the residual trust resides.

The pattern repeats

Once stated, the pattern recurs across modern mathematics. Thomas Hales announced a proof of the Kepler conjecture in 1998. The Annals of Mathematics referees, after years of effort, were unable to declare full confidence in the proof's correctness — a public admission with no major precedent. Hales responded by launching the Flyspeck Project in 2003 to formalize the proof entirely in HOL Light and Isabelle. Eleven years later, in 2014, the formalization was complete and the Kepler conjecture passed from "almost certainly proven" to proven in the same sense the Four Color Theorem now is.

The 2020s have accelerated the trend. In 2021, Peter Scholze publicly asked the Lean community to verify a key result in condensed mathematics — the so-called Liquid Tensor Experiment — because he was not personally certain his own proof was correct. It was the first time a frontier mathematician requested formalization for his own epistemic comfort. The community completed it in months. In 2023, Terence Tao led a Lean formalization of the Polynomial Freiman-Ruzsa conjecture within weeks of the paper's publication; formal methods entered the active research workflow.

The principle

Formal methods do not deliver absolute certainty. The reorientation after Gödel was permanent, not a temporary retreat. What formal methods deliver is something more useful: a structured allocation of trust. Without that structure, debates about the correctness of a proof — especially a long, computer-assisted, or machine-generated one — reduce to disputes about preferences. With it, disagreements become specific. "Do you trust the kernel?" is a question that admits an answer; "is this really a proof?" is not.

The remainder of this page explains how the structure works. The conceptual heart is the discovery, made twice independently in the 1930s and unified in the 1960s, that propositions and types — proofs and programs — are not two analogous things but a single mathematical structure seen from two angles. This is the Curry-Howard correspondence, and it is what makes a proof assistant possible.

The Curry-Howard Correspondence

The Curry-Howard correspondence is not the name of a tool, a model, or a particular formal system. It is the name of a mathematical discovery: that logic and type theory — propositions and types, proofs and programs — are two surface presentations of the same underlying structure. The discovery is what licenses the architecture of Lean, Coq, Agda, and every other modern proof assistant. Understanding it is the entry point to understanding why such systems work.

The name

The correspondence carries two names because two logicians, working independently, found pieces of it at different times. Haskell Curry — for whom the programming language Haskell and the technique of currying are both named — observed in 1934, and elaborated in 1958, that the type \(A \to B\) of functions from \(A\) to \(B\) obeys the same inference rules as the logical proposition "\(A\) implies \(B\)". The implication arrow and the function arrow are not analogous; they are the same arrow.

William Alvin Howard, in a 1969 manuscript circulated privately and published in 1980, extended Curry's observation into a systematic correspondence covering every logical connective. Conjunction, disjunction, negation, universal and existential quantification each acquired a type-theoretic counterpart, and the operations on proofs (modus ponens, case analysis, generalization) became the operations on programs (function application, pattern matching, binding).

In the 1980s, Joachim Lambek added a third side. The same structure that appears in logic and in type theory also appears in category theory, where the relevant categorical structure (cartesian closed categories, in the simplest case) is a third equivalent presentation. The full three-way correspondence — Curry-Howard-Lambek — places logic, programming, and category theory as three faces of one mathematical object. We will encounter the categorical side later in the curriculum; the next section will plant a deliberate pointer.

Three layers

The correspondence has substance at three depths.

The static layer identifies propositions with types. Each proposition corresponds to a type; each type can be read as a proposition. The proposition "\(P\) and \(Q\)" is the product type \(P \times Q\); the proposition "\(P\) implies \(Q\)" is the function type \(P \to Q\). This is the layer at which the correspondence is most easily stated.

The dynamic layer identifies proofs with programs. A specific proof of \(P \land Q\) is a specific pair \(\langle p, q \rangle\) consisting of a proof \(p\) of \(P\) and a proof \(q\) of \(Q\). A specific proof of \(P \to Q\) is a specific function that, given any proof of \(P\), produces a proof of \(Q\). Proofs are not described by programs; they are programs, in the same way that the number 3 is not described by the successor of 2 but is the successor of 2.

The operational layer identifies proof normalization with computation. Proofs can be simplified — a proof that constructs a pair and immediately projects out the first component can be reduced to a proof that just uses the first component directly. Programs can be executed — the expression that constructs a pair and projects its first component evaluates to the first component. These are the same operation: \(\beta\)-reduction in logic and \(\beta\)-reduction in the lambda calculus are not analogous procedures, they are the same procedure viewed through two vocabularies. When a proof assistant checks a proof, it is, mathematically, executing a program.

The correspondence in full

The complete table of correspondences for the standard logical connectives:

The Curry-Howard Correspondence — Connective Table

The standard logical connectives and their type-theoretic counterparts:

Logic	Type theory
Proposition \(P\)	Type \(P\)
\(P \land Q\)	Product type \(P \times Q\)
\(P \lor Q\)	Sum type \(P \oplus Q\)
\(P \to Q\)	Function type \(P \to Q\)
\(\neg P\)	\(P \to \bot\)
\(\forall x : X,\; P(x)\)	Dependent function type \((x : X) \to P(x)\)
\(\exists x : X,\; P(x)\)	Dependent pair type \((x : X) \times P(x)\)
\(\top\) (true)	Unit type
\(\bot\) (false)	Empty type

The two bold rows introduce the universal and existential quantifiers, which require a feature called dependent types — types whose form depends on a value. To express "for every natural number \(n\), \(n + 0 = n\)", we need a type whose elements are functions taking an \(n\) and producing a proof of \(n + 0 = n\), where the proposition being proved changes with \(n\). Conventional programming languages cannot express such types; the dependent-type machinery is what gives proof assistants their full expressive power. We will encounter it more concretely on the following page.

Reframing the Boolean connectives, retroactively

A reader who has worked through Boolean Logic has already worked with every connective in the table above, in a different vocabulary. The truth table of \(P \land Q\) records its value: given the values of \(P\) and \(Q\), it returns \(1\) if both are \(1\), and \(0\) otherwise. The Curry-Howard reading shifts focus from the value to the structure of the evidence: a proof of \(P \land Q\) consists of a proof of \(P\) together with a proof of \(Q\) — a pair. The two readings are not the same statement; they live at different layers — one external (evaluation in a two-element set) and one internal (construction of evidence). What carries across is the syntax and the operational intuition, not the underlying logical system. The connectives \(\land, \lor, \to, \neg\) appearing in propositional logic and in the type theory of a proof assistant are formally distinct objects that share a notation.

Concretely: to construct a proof of \(P \land Q\), build a pair from a proof of \(P\) and a proof of \(Q\). To use a proof of \(P \land Q\), project out either component — the left projection gives a proof of \(P\), the right projection gives a proof of \(Q\). The introduction and elimination rules of conjunction in natural deduction are, exactly, the constructor and projections of the product type. The same pattern holds for the other connectives:

To prove \(P \to Q\), define a function taking proofs of \(P\) to proofs of \(Q\); to use such a proof, apply it.
To prove \(P \lor Q\), provide a proof of \(P\) or a proof of \(Q\) tagged with which one; to use it, case-analyze.
To prove \(\neg P\), define a function from \(P\) to \(\bot\); to use it on a proof of \(P\), derive \(\bot\) (contradiction).

A first look at the syntax

Lean is the proof assistant we will use to make these abstractions concrete. A first taste: the proof that \(P \land Q\) implies \(P\) can be written in two equivalent ways.


                    -- Term mode: the proof is written directly as an expression
                    theorem and_left (P Q : Prop) (h : P ∧ Q) : P := h.left

                    -- Tactic mode: the proof is constructed step by step
                    theorem and_left' (P Q : Prop) (h : P ∧ Q) : P := by
                    exact h.left

In both, the hypothesis h has type P ∧ Q — it is a pair of a proof of P and a proof of Q. The expression h.left projects out the first component, yielding a proof of P. Term mode states this projection directly; tactic mode wraps it in a by ... block that, in larger proofs, allows constructing the proof interactively, step by step. The two forms produce the same final proof object internally; the choice between them is a matter of authoring style. The next section returns to this distinction.

Proof Assistants: A Landscape

The Curry-Howard correspondence is a mathematical observation. A proof assistant is a software system that implements it: a programming language whose type system is rich enough to encode mathematical propositions, together with the infrastructure to construct, check, and organize proofs as programs in that language. Several such systems exist; this section surveys the major ones and the architectural feature they share.

The major systems

Four proof assistants account for most contemporary work in formal mathematics:

Lean: originated at Microsoft Research in 2013 under Leonardo de Moura, with a major redesign released in 2023; now developed by the Lean Focused Research Organization and a large community of contributors. Based on dependent type theory. Its mathematical library, mathlib, surpassed 210,000 formalized theorems in 2025 and continues to grow. The system that hosted Liquid Tensor, the Polynomial Freiman-Ruzsa formalization, and the bulk of machine-generated theorem proving research in 2024–2026.
Coq (1989–; later renamed Rocq): the historical workhorse of dependent-type-based formalization. Used in the seminal cases — Gonthier's 2005 Four Color formalization, and Compcert, Xavier Leroy's formally verified C compiler. Coq's design and Lean's are close cousins; their underlying type theories differ in details that matter to specialists but not to first encounters.
Isabelle/HOL (first released 1991): based on higher-order logic rather than dependent type theory. Used for seL4 (a formally verified operating-system microkernel) and Hales's Flyspeck completion of Kepler. The higher-order-logic foundation trades some expressive power for simpler proof automation in many practical cases.
Agda (2007–): the closest of the major systems to pure dependent type theory, often used in programming-language theory and in the study of constructive mathematics for its own sake. Less infrastructure than Lean or Coq, but a clearer view of the type-theoretic foundations through the surface syntax.

The choice of system in any given project depends on the existing libraries, the surrounding community, and the foundational preferences of the authors. The mathematics being formalized is essentially the same in each. We will use Lean in the next page for its current research momentum, its large mathematical library, and the clean separation between its two proof-authoring modes.

Tactic mode and term mode, in practice

The two writing styles introduced at the end of the previous section deserve a closer look, because the distinction governs how mathematicians actually use these systems.

Term mode is the direct presentation: the proof is written as a single expression of the appropriate type. It is concise, fits well in displayed equations, and matches the Curry-Howard slogan most cleanly — a proof of a proposition is a term of the corresponding type.

Tactic mode is the interactive presentation: the proof is constructed step by step inside a by ... block, with the system displaying, at each step, the current state of the proof — what hypotheses are available, what remains to be shown. The state is presented in a standardized format, with the available hypotheses listed above a horizontal rule and the current goal below it. A small example: suppose we wish to prove that if \(P \land Q\) holds, then \(Q \land P\) holds.


                    example (P Q : Prop) (h : P ∧ Q) : Q ∧ P := by
                    constructor
                    · exact h.right
                    · exact h.left

The constructor tactic decomposes a goal of the form \(Q \land P\) into two subgoals — one for \(Q\), one for \(P\) — reflecting the Curry-Howard reading that to construct a pair, one must construct each component. After constructor runs, the proof state shows:

The turnstile ⊢ separates the hypotheses (above) from the goal (below). The two cases are dispatched by the two lines beginning with ·, each producing the required proof from components of h. The experience is close to writing a proof on a blackboard, with the system tracking what has been established and what remains.

The two modes are extensionally equivalent: internally, Lean executes tactics and produces a term-mode proof, which is then checked by the kernel. The kernel never sees tactics. This separation is the structural foundation of the trust architecture, which is the next subject.

The kernel and self-hosting trust

Every proof assistant has a kernel — a small program whose sole job is to check whether a given proof term is valid. The kernel implements the rules of the underlying logic and nothing else: no automation, no tactics, no high-level convenience features. In Lean, the kernel is written in C++ and consists of a few thousand lines of code. In Coq, similarly small. The smallness is deliberate. The kernel is the only part of the system whose correctness must be trusted on faith; everything else is checked by the kernel. This design principle — that a proof assistant should produce proof terms verifiable by a small, independent kernel — is known as the de Bruijn criterion, after N.G. de Bruijn, whose 1968 Automath system was the first working proof checker built around it. The kernel-based architectures of Coq, Lean, and Isabelle all descend, conceptually if not always genealogically, from this design.

A notable feature of Lean's design is that almost everything above the kernel — the elaborator that converts surface syntax into proof terms, the tactic framework, mathlib itself — is written in Lean itself. The system is largely self-hosting. This creates a trust loop with a useful properties: the high-level machinery is itself a Lean program, and its correctness is, in principle, the sort of thing the kernel can verify. In practice, the high-level machinery is not exhaustively verified to date, but the architecture makes such verification possible in principle, with the kernel as the fixed point of the trust chain.

This is exactly the structural property that was missing from the Appel-Haken proof in 1976. The verification programs were ordinary C and assembler code; to trust the proof, one had to trust those programs by inspection, and the inspection could not be mechanized. Gonthier's 2005 Coq formalization replaced the unverified imperative code with a Coq proof checked by Coq's kernel. The physical fact of the proof did not change; what changed was the structure of the trust chain. A present-day reader can locate exactly where the residual trust lives — in the few thousand lines of the Coq kernel — and audit it.

The Universal Pattern

The technique of attaching additional structured information to each primitive operation, and letting composition rules carry the information through automatically recurs across mathematics and computer science:

Automatic differentiation attaches derivatives to functions; the chain rule is the composition law.
Formal methods attach proofs of correctness to programs; logical inference is the composition law.
Probabilistic programming attaches probability distributions; conditional probability and marginalization are the composition laws.
Quantum computation attaches unitary evolution; sequential and parallel composition of unitaries are the composition laws.

These are not analogies, but instances of a single mathematical structure. The language that names this structure is category theory, and we will return to it as a viewpoint of its own later in the curriculum. For now, the recognition that the same pattern recurs is the substantive observation.

Lean's boundaries

A clear-eyed account of any tool must say where the tool ends. Two boundaries circumscribe what a proof assistant does.

Verification is not discovery. Lean's primary function is to check proofs that are provided to it. Finding a proof in the first place is a separate problem, and the degree of automation depends on the domain. Some tactics close goals entirely on their own: rfl settles equalities that hold by definition; omega solves linear arithmetic over the integers; ring handles polynomial identities in commutative rings; simp simplifies using a registered database of lemmas. Other tactics search heuristically: aesop attempts general goal-directed search. And learning-augmented tactics increasingly suggest or generate proofs in open-ended cases.

Yet two different ceilings limit proof discovery, and the popular discourse often conflates them. In a consistent formal system strong enough to express arithmetic, there exist true propositions that cannot be proved within the system at all — this is Gödel's first incompleteness theorem (1931). It is a statement about the gap between truth and provability, not about search; no algorithm — not in 1931, not now, not ever — can derive a proof that the system itself cannot produce.

The second ceiling concerns propositions that are provable: even there, whether a proof can be found by mechanical search reduces to combinatorial explosion. First-order logic's provability problem is semi-decidable: if a proposition is provable, an exhaustive search procedure will eventually find a proof; if it is not, the search may never terminate. In practice, the search space grows fast enough that exhaustive methods are useless beyond toy cases, and the practical limits of automated discovery are determined by computational complexity — the same kind of considerations that underlie P vs NP — rather than by Gödel's theorem. Within the domains where decision procedures exist and run quickly (linear arithmetic, polynomial identities, propositional logic), automation is essentially complete; outside them, human strategy or learned heuristics drive the search.

The system verifies what is written, not what was meant. Lean checks that a given proof establishes the proposition stated. Whether that proposition is the one the author intended to prove is a separate question — one the system cannot answer. A trivial example illustrates the point: the statement theorem length_nonneg (L : List α) : 0 ≥ 0 := by rfl is true and admits a one-line proof, but if the author meant to assert that the length of a list is non-negative, the actual proposition being proved has nothing to do with lists. The error is invisible to Lean. This is the specification problem: the gap between intent and the formal statement of intent.

The trust structure of a formal development can be drawn as a pyramid. At the base sits the kernel — small, inspectable, the only piece that must be trusted by direct audit. Above it sits the proof term, which the kernel checks mechanically. Above the proof sits the specification — the proposition being proved — which a human writes and a human must review. At the top sits the author's intent, which lives in cognition and cannot be formalized at all. Formal methods do not eliminate the top of the pyramid; they locate it precisely. Each layer's responsibility is explicit.

In practice, these gaps are mitigated rather than closed. Specifications can be written as computable predicates and tested against examples, surfacing many drafting errors. Multiple independent specifications of the same intent can be cross-checked. Specification design typically receives the same kind of scrutiny as proof design — in the Compcert and seL4 projects, the documented experience was that getting the specification right took longer than producing the proofs. None of these techniques eliminate the specification problem, but they make it tractable.

These boundaries are not flaws to be lamented; they are the price of explicitness. Every formal system has them. What distinguishes formal methods is that the boundaries are visible: each act of trust is located, each unverified assumption is named. A system that promises "absolute correctness" hides its trust chain; a system that promises "correctness relative to a stated specification" exposes it. The second promise is the one worth making.

The 2026 Inflection: Machine Learning × Formal Methods

The decade beginning around 2020 has been a period of rapid acceleration for formal methods, driven by two converging forces: the maturation of large mathematical libraries (chiefly mathlib) and the arrival of machine learning systems capable of generating proofs in the form a proof assistant can check. Together, these have produced a qualitative shift in what formal verification can accomplish at scale.

The acceleration

The most visible developments since 2024:

In 2024, a DeepMind system known as AlphaProof achieved silver-medal-level performance on the International Mathematical Olympiad by generating Lean-formal proofs of the contest problems. This was the first widely-reported demonstration that a machine learning system could produce machine-checkable proofs at olympiad-mathematics level.
The IMO 2025 contest saw multiple independent machine learning systems achieve gold-medal-equivalent performance, with their proofs presented as Lean terms that the kernel verified directly.
By 2026, open-source machine learning proof agents had appeared under permissive licenses, bringing machine-assisted theorem proving into the public domain rather than keeping it inside proprietary systems.
A community effort known as CSLib, launched in 2026, set out to formalize the standard undergraduate algorithms and theory-of-computation curriculum in Lean — extending the formal-mathematics project into computer science proper.

The cumulative effect is that the practice of formalizing mathematics, and of generating proofs by machine, has shifted from a specialist research activity to standard infrastructure for both mathematics and computer science. The pace of mathlib's growth has steepened sharply since 2024.

Why the trust structure makes this possible

The architecture of the trust pyramid, developed in the previous section, is exactly what licenses the pairing of machine learning with formal verification. When a machine learning system generates a proof:

The specification remains human-written and human-reviewed. The intent-to-specification gap, identified earlier as the irreducible top of the pyramid, is unaffected by the involvement of a learning system.
The proof is generated by the learning system and verified by the kernel. The generator may be unreliable, opaque, or stochastic; the kernel does not care. If the proof passes verification, it is valid; if not, it is rejected and the generator tries again.
The kernel remains fixed and small, audited by humans on its own merits.

A consequence sometimes summarized in the field is that when a learning system produces a kernel-verified proof, a human reviewer adds no additional reliability. The kernel has already absorbed whatever errors the learning system might have introduced. This is not a claim that machine-generated proofs are intrinsically trustworthy; it is a precise statement about where reliability lives in the system-plus-kernel architecture. The reliability lives in the kernel. The learning system is, structurally, an untrusted but useful proof-search procedure.

This division of labor matters because it dissolves an apparent contradiction. Machine learning systems are known to be unreliable at many tasks; formal verification demands absolute reliability of the verified output. The two coexist because the verification step sits between them, mechanically converting "unreliable candidate proof" into either "verified theorem" or "rejected attempt". No middle ground exists. The learning system's fallibility is bounded by the kernel.

What this implies

Several trends follow from the structural compatibility of machine-generated proofs with kernel verification. Formalization, historically a slow and labor-intensive process, is becoming faster as machine learning systems handle routine proof construction. Mathematical theorems that previously could not be formalized for lack of human time enter mathlib at growing rates. Engineering verification — the formal proof of correctness of software systems, as in Compcert and seL4 — is becoming accessible to projects below the scale of flagship academic efforts.

The longer-term implication for the curriculum is the basis on which formal methods are introduced here. Formal verification is on a trajectory to become standard infrastructure for both pure mathematics and software engineering — comparable in significance to the way numerical linear algebra became standard infrastructure for scientific computing in the late twentieth century. A mathematics education that does not name this development risks producing readers who encounter it later without a framework to understand what they are seeing.

From description to practice

This page has discussed formal methods entirely from the outside: their history, their conceptual core, their architecture, their limits, and their current moment. To understand them from the inside requires using one. The next page is a guided walk through Lean itself, beginning with a theorem the reader already knows — the triangle inequality \(|x + y| \leq |x| + |y|\) — and showing how a mathematician's natural strategic decomposition translates, step by step, into a Lean proof. The translation, as we will see, is more direct than it may at first appear.