Flow Matching

The Trajectory Problem

A common shorthand divides modern generative AI into two camps: language is the domain of large autoregressive transformers, and images and video are the domain of diffusion. The second half of that sentence describes a landscape that has shifted. Many of the most capable recent image and video generators — the FLUX family, among others — are trained not as diffusion models but by a closely related principle called flow matching. The shift is one of engineering emphasis, not of mathematical obsolescence: the machinery of diffusion remains exact and illuminating, and we will see that it sits inside the flow-matching framework as a special case rather than standing opposed to it. What changed is which trajectory practitioners choose to integrate, and why.

The point of departure is our treatment of diffusion models. There, generation was framed as the reversal of a noising process: a sample is produced by starting from Gaussian noise and following a trajectory back toward the data distribution, one denoising step at a time. That page closed with a scope note we now make good on. The discrete denoising chain admits a continuous-time limit; in that limit the deterministic sampler traces out a smooth curve governed by an ordinary differential equation — a probability-flow ODE — rather than a noisy, stochastic walk. Flow matching takes this deterministic, continuous-time picture as primary, and in doing so dispenses with the language of noise altogether.

The reframing is worth stating plainly. A diffusion sampler integrates a stochastic trajectory whose randomness, inherited from its Brownian-motion origins, makes the path between noise and data wander. A wandering path is expensive: faithfully integrating it demands many small steps, which is why classical diffusion sampling can require hundreds or thousands of network evaluations to produce a single image. Flow matching asks a more direct question. Rather than reversing a stochastic corruption, it seeks a deterministic map from a source distribution \(p_{\text{init}}\) — typically, but not necessarily, Gaussian noise — to the data distribution \(p_{\text{data}}\), realized as the time-1 flow of an ODE. The object to be learned is the velocity field driving that ODE, and the central advantage, developed over the next two sections, is that the trajectories can be made nearly straight — and a straight path is cheap to integrate.

Two consequences of the deterministic ODE viewpoint deserve early mention, because they explain why flow matching is not merely a recasting of diffusion but a genuine generalization. First, the source distribution need not be Gaussian: any distribution one can sample from may serve as \(p_{\text{init}}\), a freedom that diffusion's noise-centric construction does not naturally permit. Second, diffusion itself reappears inside this framework as one particular choice of trajectory, rather than as a separate theory. We develop the construction first, then return to both points once the machinery is in place.

Learning the Velocity Field

Fix a source distribution \(p_{\text{init}}\) and the data distribution \(p_{\text{data}}\), both on \(\mathbb{R}^d\). Flow matching seeks a time-dependent velocity field \(\mathbf{u}_t(\mathbf{x})\), defined for \(t \in [0, 1]\), with the following property: a point started at a noise sample and carried along the field arrives, at time \(1\), at a data sample. Concretely, if \(\mathbf{x}_t\) solves the ordinary differential equation

\[ \frac{\mathrm{d}}{\mathrm{d}t}\,\mathbf{x}_t = \mathbf{u}_t(\mathbf{x}_t), \qquad \mathbf{x}_0 \sim p_{\text{init}}, \]

then its distribution at each intermediate time should trace out a prescribed probability path \((p_t)_{0 \le t \le 1}\) — a curve in the space of distributions — running from \(p_0 = p_{\text{init}}\) to \(p_1 = p_{\text{data}}\). The flow of this ODE, the map carrying each starting point to its time-\(1\) endpoint, is the generator we want; sampling means drawing \(\mathbf{x}_0\) from noise and integrating forward to \(t = 1\). That distinct trajectories never cross — so that the flow is a well-defined map — follows from the existence-and-uniqueness theory for ODEs (the Picard–Lindelöf theorem), valid when \(\mathbf{u}_t\) satisfies a Lipschitz condition. We take this regularity for granted and turn to the real problem: finding the field.

A path of distributions, built one data point at a time

We are free to choose the probability path, and the choice is made by first prescribing it conditionally. For a fixed data point \(\mathbf{z} \sim p_{\text{data}}\), a conditional probability path \(p_t(\cdot \mid \mathbf{z})\) is a family of distributions interpolating from the source noise at \(t=0\) to a point mass at \(\mathbf{z}\) at \(t=1\). The dominant choice in practice is the Gaussian conditional path

\[ p_t(\cdot \mid \mathbf{z}) = \mathcal{N}\!\big(\alpha_t \mathbf{z},\, \beta_t^2 I\big), \]

where the scalar schedules \(\alpha_t, \beta_t\) run monotonically from \(\alpha_0 = 0, \beta_0 = 1\) (pure noise) to \(\alpha_1 = 1, \beta_1 = 0\) (the variance collapses and the Gaussian concentrates on \(\mathbf{z}\)). Sampling from it is immediate: draw \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)\) and set \(\mathbf{x}_t = \alpha_t \mathbf{z} + \beta_t \boldsymbol{\varepsilon}\). The simplest schedule of all, \(\alpha_t = t\) and \(\beta_t = 1 - t\), makes the mean travel in a straight line from noise to data — a fact we will lean on heavily in the next section.

A word on generality. With \(\alpha_0 = 0\) and \(\beta_0 = 1\) the conditional path begins at \(p_0(\cdot \mid \mathbf{z}) = \mathcal{N}(\mathbf{0}, I)\) independently of \(\mathbf{z}\), so mixing over data points leaves the source pinned to \(p_{\text{init}} = \mathcal{N}(\mathbf{0}, I)\): this standard construction recovers a Gaussian source, the same starting point as diffusion. The freedom to take an arbitrary \(p_{\text{init}}\), promised earlier, is realized by a more general construction that conditions the path on a source–target pair \((\mathbf{x}_0, \mathbf{z})\) rather than on \(\mathbf{z}\) alone; the marginalization argument below goes through unchanged for that case. We develop the Gaussian-source version, the one in universal practical use, and note where the generalization enters.

The path we actually care about is the marginal probability path, obtained by mixing the conditional paths over all data points,

\[ p_t(\mathbf{x}) = \int p_t(\mathbf{x} \mid \mathbf{z})\, p_{\text{data}}(\mathbf{z})\, \mathrm{d}\mathbf{z}. \]

One can sample from it — draw a data point, then draw from its conditional path — but its density is an intractable integral. This tractable-to-sample, intractable-to-evaluate split is the crux of everything that follows.

The continuity equation, and the trick it powers

What links a probability path to a velocity field is a conservation law. A field \(\mathbf{u}_t\) transports the distribution \(p_t\) correctly — meaning a particle obeying the ODE stays distributed according to \(p_t\) — if and only if the pair satisfies the continuity equation

\[ \partial_t\, p_t(\mathbf{x}) = -\,\operatorname{div}\!\big(p_t\, \mathbf{u}_t\big)(\mathbf{x}). \]

The reading is physical: the left side is the rate at which probability mass at \(\mathbf{x}\) changes in time; the divergence on the right measures net outflow of mass carried by the field, so its negative is net inflow. Mass is neither created nor destroyed — probability always integrates to one — so the two sides must agree. This is the same equation that governs the flow of an incompressible fluid or the conservation of charge.

The continuity equation is what makes the conditional construction pay off. The marginal velocity field that transports \(p_t\) is recovered from the conditional fields by a weighted average — each data point contributes the field aimed at it, weighted by how plausibly the current location \(\mathbf{x}\) arose from that point:

\[ \mathbf{u}_t(\mathbf{x}) = \int \mathbf{u}_t(\mathbf{x} \mid \mathbf{z})\, \frac{p_t(\mathbf{x} \mid \mathbf{z})\, p_{\text{data}}(\mathbf{z})}{p_t(\mathbf{x})}\, \mathrm{d}\mathbf{z}. \]

This marginalization trick — verified by substituting the average into the continuity equation and checking it holds — is the theoretical heart of flow matching. The marginal field is intractable, just like the marginal density. But the conditional field \(\mathbf{u}_t(\mathbf{x} \mid \mathbf{z})\) is explicit: for the Gaussian path it is a short closed-form expression in \(\mathbf{x}, \mathbf{z}\), and \(\alpha_t, \beta_t\). The remaining question is whether we can train against the easy conditional field and still recover the hard marginal one.

Conditional flow matching

We can, and the reason is a clean identity between two losses. The loss we would like to minimize regresses a network \(\mathbf{u}_t^{\boldsymbol{\theta}}\) against the marginal field; the loss we can minimize, the conditional flow matching objective, regresses it against the explicit conditional field,

\[ \mathcal{L}_{\text{CFM}}(\boldsymbol{\theta}) = \mathbb{E}\Big[\big\|\,\mathbf{u}_t^{\boldsymbol{\theta}}(\mathbf{x}_t) - \mathbf{u}_t(\mathbf{x}_t \mid \mathbf{z})\,\big\|^2\Big], \]

with the expectation over a random time \(t \sim \mathrm{Unif}[0,1]\), a data point \(\mathbf{z} \sim p_{\text{data}}\), and a point \(\mathbf{x}_t \sim p_t(\cdot \mid \mathbf{z})\) on the conditional path. The two losses differ only by an additive constant independent of \(\boldsymbol{\theta}\), so they have identical gradients — minimizing the conditional loss is exactly minimizing the marginal one. The network learns the conditional field it is shown, and thereby, without ever being shown it, the marginal field that transports noise into data.

This is what simulation-free means, and why it matters. No ODE is solved during training. Each gradient step needs only a noise sample, a data sample, a random time, and the conditional velocity between them — a closed-form target, evaluated without any integration through the field being learned. Contrast this with the alternative the trick avoids: defining the loss through the marginal field would require simulating the very flow one is trying to fit, at every step. The marginalization trick converts an intractable simulation problem into ordinary regression.

Diffusion as a Special Case

We can now make good on a claim from the opening: diffusion is not a rival to flow matching but an instance of it. The flow-matching construction transports a distribution along a probability path using a deterministic ODE driven by the velocity field \(\mathbf{u}_t\). Diffusion models transport the same distribution along the same path, but using a stochastic differential equation. The two are connected by a single knob.

One knob: from deterministic flow to stochastic diffusion

Given a velocity field \(\mathbf{u}_t\) that transports the path \(p_t\), one may add a noise term and obtain a stochastic differential equation that transports the very same path. For any diffusion coefficient \(\sigma_t \ge 0\),

\[ \mathrm{d}\mathbf{x}_t = \Big[\mathbf{u}_t(\mathbf{x}_t) + \tfrac{\sigma_t^2}{2}\, \mathbf{s}_t(\mathbf{x}_t)\Big]\mathrm{d}t + \sigma_t\, \mathrm{d}\mathbf{w}_t, \]

where \(\mathbf{s}_t = \nabla \log p_t\) is the score function of the path and \(\mathbf{w}_t\) is Brownian motion, produces trajectories distributed according to \(p_t\) at every time, for any choice of \(\sigma_t\). The deterministic flow is the endpoint \(\sigma_t = 0\): the noise vanishes, the score term vanishes with it, and the ODE of the previous section is recovered. Diffusion models live at \(\sigma_t > 0\): the trajectories acquire the wandering, stochastic character of Brownian motion, while the distribution they carry is unchanged. The schedule that reproduces the variance-preserving noising of classical denoising diffusion is one particular setting of \(\sigma_t\) and the path; flow matching is the same family read at \(\sigma_t = 0\).

The bookkeeping that guarantees the path is preserved is, once again, a conservation law. A stochastic differential equation with drift \(\boldsymbol{\mu}_t\) and diffusion coefficient \(\sigma_t\) evolves its density according to the Fokker–Planck equation

\[ \partial_t\, p_t = -\operatorname{div}(p_t\, \boldsymbol{\mu}_t) + \tfrac{\sigma_t^2}{2}\,\Delta p_t. \]

The first term is the familiar transport of the continuity equation; the Laplacian \(\Delta p_t\) is the diffusion term that also appears in the heat equation, the mathematical signature of mass spreading stochastically. Substituting the drift of our equation, \(\boldsymbol{\mu}_t = \mathbf{u}_t + \tfrac{\sigma_t^2}{2}\,\mathbf{s}_t\), is what makes the two parts conspire. The score contributes a term \(-\tfrac{\sigma_t^2}{2}\operatorname{div}(p_t \nabla \log p_t)\); since \(p_t \nabla \log p_t = \nabla p_t\), this is exactly \(-\tfrac{\sigma_t^2}{2}\Delta p_t\), which cancels the diffusion term identically. What remains is

\[ \partial_t\, p_t = -\operatorname{div}(p_t\, \mathbf{u}_t), \]

the continuity equation of the previous section, for every \(\sigma_t\). This is the precise reason the whole family carries the same path: the score term is calibrated to absorb the diffusion the Brownian motion injects. At \(\sigma_t = 0\) both the stochastic spreading and the score correction vanish together, and the stochastic differential equation collapses to the deterministic ODE outright.

The score and the velocity field are the same object

The appearance of the score function above is not a coincidence of notation. For a Gaussian path, the velocity field and the score are related by an invertible linear formula: each is recoverable from the other through the schedule \(\alpha_t, \beta_t\). What the diffusion literature calls learning the score function, and what flow matching calls learning the velocity field, are two parametrizations of one underlying quantity. The diffusion development built that quantity through Brownian motion and time-reversal; the flow-matching development built it through deterministic transport and the marginalization trick. They arrive at the same place.

Why the deterministic end is attractive

If every \(\sigma_t\) carries the same distribution, why prefer the deterministic end? Because of sampling cost. A stochastic trajectory wanders and must be integrated in many small steps; a deterministic trajectory, especially one whose path has been chosen to be nearly straight, can be integrated in few — in the ideal limit, a single step. Each step is one evaluation of the network, so straighter and deterministic means faster generation. The schedule whose conditional trajectories are straight segments from noise to data is the basis of rectified flow; pushing toward genuine straightness connects to optimal transport, which characterizes the straightest coupling between two distributions. The large-scale image generators that displaced diffusion samplers in practice — among them Stable Diffusion 3 — are rectified-flow models, trading the curved stochastic trajectories of classical diffusion for straight deterministic ones, and the order-of-magnitude reduction in sampling steps that straightness buys.

This is the sense in which flow matching generalizes diffusion. It keeps what made diffusion work — a tractable training signal, a smooth path from noise to data — exposes the stochasticity as a tunable coefficient rather than a built-in commitment, and frees the path itself to be chosen. Diffusion is the well-studied region of that design space at \(\sigma_t > 0\); the deterministic flows that now drive large-scale generation are the same construction at \(\sigma_t = 0\) with a straightened path; and the framework leaves room for choices not yet made.

Interactive Demo

Both panels start from the same noise sample and aim at the same five target modes (rings); only the path differs — straight on the left, curved on the right. Blue points reached a mode, orange missed. Drag N to change the number of Euler steps.

Loading...