The Trajectory Problem
A common shorthand divides modern generative AI into two camps: language is the domain of large
autoregressive transformers, and images and video are the domain of diffusion.
The second half of that sentence describes a landscape that has shifted. Many of the most capable
recent image and video generators — the FLUX family, among others — are trained not as diffusion
models but by a closely related principle called flow matching. The shift is one of
engineering emphasis, not of mathematical obsolescence: the machinery of diffusion remains exact and
illuminating, and we will see that it sits inside the flow-matching framework as a special
case rather than standing opposed to it. What changed is which trajectory practitioners choose to
integrate, and why.
The point of departure is our
treatment of diffusion models.
There, generation was framed as the reversal of a noising process: a sample is produced by starting
from Gaussian noise and following a trajectory back toward the data distribution, one denoising
step at a time. That page closed with a scope note we now make good on. The discrete denoising chain
admits a continuous-time limit; in that limit the deterministic sampler traces out a smooth curve
governed by an ordinary differential equation — a probability-flow ODE — rather than a noisy,
stochastic walk. Flow matching takes this deterministic, continuous-time picture as primary, and in
doing so dispenses with the language of noise altogether.
The reframing is worth stating plainly. A diffusion sampler integrates a stochastic trajectory whose
randomness, inherited from its Brownian-motion origins, makes the path between noise and data wander.
A wandering path is expensive: faithfully integrating it demands many small steps, which is why classical
diffusion sampling can require hundreds or thousands of network evaluations to produce a single image.
Flow matching asks a more direct question. Rather than reversing a stochastic corruption, it seeks a
deterministic map from a source distribution \(p_{\text{init}}\) — typically, but not necessarily,
Gaussian noise — to the data distribution \(p_{\text{data}}\), realized as the time-1 flow of an ODE.
The object to be learned is the velocity field driving that ODE, and the central
advantage, developed over the next two sections, is that the trajectories can be made nearly
straight — and a straight path is cheap to integrate.
Two consequences of the deterministic ODE viewpoint deserve early mention, because they explain why
flow matching is not merely a recasting of diffusion but a genuine generalization. First, the source
distribution need not be Gaussian: any distribution one can sample from may serve as \(p_{\text{init}}\),
a freedom that diffusion's noise-centric construction does not naturally permit. Second, diffusion itself
reappears inside this framework as one particular choice of trajectory, rather than as a separate theory.
We develop the construction first, then return to both points once the machinery is in place.
Learning the Velocity Field
Fix a source distribution \(p_{\text{init}}\) and the data distribution \(p_{\text{data}}\), both on
\(\mathbb{R}^d\). Flow matching seeks a time-dependent velocity field
\(\mathbf{u}_t(\mathbf{x})\), defined for \(t \in [0, 1]\), with the following property: a point started
at a noise sample and carried along the field arrives, at time \(1\), at a data sample. Concretely,
if \(\mathbf{x}_t\) solves the ordinary differential equation
\[
\frac{\mathrm{d}}{\mathrm{d}t}\,\mathbf{x}_t = \mathbf{u}_t(\mathbf{x}_t),
\qquad \mathbf{x}_0 \sim p_{\text{init}},
\]
then its distribution at each intermediate time should trace out a prescribed probability
path \((p_t)_{0 \le t \le 1}\) — a curve in the space of distributions — running from
\(p_0 = p_{\text{init}}\) to \(p_1 = p_{\text{data}}\). The flow of this ODE, the map carrying each
starting point to its time-\(1\) endpoint, is the generator we want; sampling means drawing
\(\mathbf{x}_0\) from noise and integrating forward to \(t = 1\). That distinct trajectories never
cross — so that the flow is a well-defined map — follows from the existence-and-uniqueness theory for
ODEs (the Picard–Lindelöf theorem), valid when \(\mathbf{u}_t\) satisfies a
Lipschitz condition.
We take this regularity for granted and turn to the real problem: finding the field.
A path of distributions, built one data point at a time
We are free to choose the probability path, and the choice is made by first prescribing it
conditionally. For a fixed data point \(\mathbf{z} \sim p_{\text{data}}\), a
conditional probability path \(p_t(\cdot \mid \mathbf{z})\) is a family of
distributions interpolating from the source noise at \(t=0\) to a point mass at
\(\mathbf{z}\) at \(t=1\). The dominant choice in practice is the Gaussian conditional
path
\[
p_t(\cdot \mid \mathbf{z}) = \mathcal{N}\!\big(\alpha_t \mathbf{z},\, \beta_t^2 I\big),
\]
where the scalar schedules \(\alpha_t, \beta_t\) run monotonically from \(\alpha_0 = 0, \beta_0 = 1\)
(pure noise) to \(\alpha_1 = 1, \beta_1 = 0\) (the variance collapses and the Gaussian concentrates on
\(\mathbf{z}\)). Sampling from it is immediate: draw \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)\)
and set \(\mathbf{x}_t = \alpha_t \mathbf{z} + \beta_t \boldsymbol{\varepsilon}\). The simplest schedule
of all, \(\alpha_t = t\) and \(\beta_t = 1 - t\), makes the mean travel in a straight line from noise to
data — a fact we will lean on heavily in the next section.
A word on generality. With \(\alpha_0 = 0\) and \(\beta_0 = 1\) the conditional path begins at
\(p_0(\cdot \mid \mathbf{z}) = \mathcal{N}(\mathbf{0}, I)\) independently of \(\mathbf{z}\), so mixing
over data points leaves the source pinned to \(p_{\text{init}} = \mathcal{N}(\mathbf{0}, I)\): this
standard construction recovers a Gaussian source, the same starting point as diffusion. The freedom to
take an arbitrary \(p_{\text{init}}\), promised earlier, is realized by a more general construction that
conditions the path on a source–target pair \((\mathbf{x}_0, \mathbf{z})\) rather than on
\(\mathbf{z}\) alone; the marginalization argument below goes through unchanged for that case. We develop
the Gaussian-source version, the one in universal practical use, and note where the generalization
enters.
The path we actually care about is the marginal probability path, obtained by mixing
the conditional paths over all data points,
\[
p_t(\mathbf{x}) = \int p_t(\mathbf{x} \mid \mathbf{z})\, p_{\text{data}}(\mathbf{z})\, \mathrm{d}\mathbf{z}.
\]
One can sample from it — draw a data point, then draw from its conditional path — but its density is an
intractable integral. This tractable-to-sample, intractable-to-evaluate split is the crux of everything
that follows.
The continuity equation, and the trick it powers
What links a probability path to a velocity field is a conservation law. A field \(\mathbf{u}_t\)
transports the distribution \(p_t\) correctly — meaning a particle obeying the ODE stays distributed
according to \(p_t\) — if and only if the pair satisfies the continuity equation
\[
\partial_t\, p_t(\mathbf{x}) = -\,\operatorname{div}\!\big(p_t\, \mathbf{u}_t\big)(\mathbf{x}).
\]
The reading is physical: the left side is the rate at which probability mass at \(\mathbf{x}\) changes
in time; the divergence on the right measures net outflow of mass carried by the field, so its negative
is net inflow. Mass is neither created nor destroyed — probability always integrates to one — so the two
sides must agree. This is the same equation that governs the flow of an incompressible fluid or the
conservation of charge.
The continuity equation is what makes the conditional construction pay off. The marginal velocity field
that transports \(p_t\) is recovered from the conditional fields by a weighted average — each data point
contributes the field aimed at it, weighted by how plausibly the current location \(\mathbf{x}\) arose
from that point:
\[
\mathbf{u}_t(\mathbf{x}) = \int \mathbf{u}_t(\mathbf{x} \mid \mathbf{z})\,
\frac{p_t(\mathbf{x} \mid \mathbf{z})\, p_{\text{data}}(\mathbf{z})}{p_t(\mathbf{x})}\, \mathrm{d}\mathbf{z}.
\]
This marginalization trick — verified by substituting the average into the continuity
equation and checking it holds — is the theoretical heart of flow matching. The marginal field is
intractable, just like the marginal density. But the conditional field \(\mathbf{u}_t(\mathbf{x} \mid \mathbf{z})\)
is explicit: for the Gaussian path it is a short closed-form expression in \(\mathbf{x}, \mathbf{z}\),
and \(\alpha_t, \beta_t\). The remaining question is whether we can train against the easy conditional
field and still recover the hard marginal one.
Conditional flow matching
We can, and the reason is a clean identity between two losses. The loss we would like to
minimize regresses a network \(\mathbf{u}_t^{\boldsymbol{\theta}}\) against the marginal field; the loss
we can minimize, the conditional flow matching objective, regresses it against
the explicit conditional field,
\[
\mathcal{L}_{\text{CFM}}(\boldsymbol{\theta})
= \mathbb{E}\Big[\big\|\,\mathbf{u}_t^{\boldsymbol{\theta}}(\mathbf{x}_t)
- \mathbf{u}_t(\mathbf{x}_t \mid \mathbf{z})\,\big\|^2\Big],
\]
with the expectation over a random time \(t \sim \mathrm{Unif}[0,1]\), a data point
\(\mathbf{z} \sim p_{\text{data}}\), and a point \(\mathbf{x}_t \sim p_t(\cdot \mid \mathbf{z})\) on the
conditional path. The two losses differ only by an additive constant independent of
\(\boldsymbol{\theta}\), so they have identical gradients — minimizing the conditional loss is exactly
minimizing the marginal one. The network learns the conditional field it is shown, and thereby, without
ever being shown it, the marginal field that transports noise into data.
This is what simulation-free means, and why it matters. No ODE is solved during
training. Each gradient step needs only a noise sample, a data sample, a random time, and the
conditional velocity between them — a closed-form target, evaluated without any integration through the
field being learned. Contrast this with the alternative the trick avoids: defining the loss through the
marginal field would require simulating the very flow one is trying to fit, at every step. The
marginalization trick converts an intractable simulation problem into ordinary regression.
Diffusion as a Special Case
We can now make good on a claim from the opening: diffusion is not a rival to flow matching but an
instance of it. The flow-matching construction transports a distribution along a probability path using
a deterministic ODE driven by the velocity field \(\mathbf{u}_t\). Diffusion models transport the same
distribution along the same path, but using a stochastic differential equation. The two are
connected by a single knob.
One knob: from deterministic flow to stochastic diffusion
Given a velocity field \(\mathbf{u}_t\) that transports the path \(p_t\), one may add a noise term and
obtain a stochastic differential equation that transports the very same path. For any
diffusion coefficient \(\sigma_t \ge 0\),
\[
\mathrm{d}\mathbf{x}_t = \Big[\mathbf{u}_t(\mathbf{x}_t)
+ \tfrac{\sigma_t^2}{2}\, \mathbf{s}_t(\mathbf{x}_t)\Big]\mathrm{d}t
+ \sigma_t\, \mathrm{d}\mathbf{w}_t,
\]
where \(\mathbf{s}_t = \nabla \log p_t\) is the score function of the path and
\(\mathbf{w}_t\) is Brownian motion, produces trajectories distributed according to \(p_t\) at every
time, for any choice of \(\sigma_t\). The deterministic flow is the endpoint \(\sigma_t = 0\):
the noise vanishes, the score term vanishes with it, and the ODE of the previous section is recovered.
Diffusion models live at \(\sigma_t > 0\): the trajectories acquire the wandering, stochastic character
of Brownian motion, while the distribution they carry is unchanged. The schedule that reproduces the
variance-preserving noising of classical denoising diffusion is one particular setting of
\(\sigma_t\) and the path; flow matching is the same family read at \(\sigma_t = 0\).
The bookkeeping that guarantees the path is preserved is, once again, a conservation law. A stochastic
differential equation with drift \(\boldsymbol{\mu}_t\) and diffusion coefficient \(\sigma_t\) evolves
its density according to the Fokker–Planck equation
\[
\partial_t\, p_t = -\operatorname{div}(p_t\, \boldsymbol{\mu}_t) + \tfrac{\sigma_t^2}{2}\,\Delta p_t.
\]
The first term is the familiar transport of the continuity equation; the Laplacian
\(\Delta p_t\) is the diffusion term that also appears in the heat equation, the mathematical
signature of mass spreading stochastically. Substituting the drift of our equation,
\(\boldsymbol{\mu}_t = \mathbf{u}_t + \tfrac{\sigma_t^2}{2}\,\mathbf{s}_t\), is what makes the two
parts conspire. The score contributes a term
\(-\tfrac{\sigma_t^2}{2}\operatorname{div}(p_t \nabla \log p_t)\); since
\(p_t \nabla \log p_t = \nabla p_t\), this is exactly \(-\tfrac{\sigma_t^2}{2}\Delta p_t\), which
cancels the diffusion term identically. What remains is
\[
\partial_t\, p_t = -\operatorname{div}(p_t\, \mathbf{u}_t),
\]
the continuity equation of the previous section, for every \(\sigma_t\). This is the precise
reason the whole family carries the same path: the score term is calibrated to absorb the diffusion
the Brownian motion injects. At \(\sigma_t = 0\) both the stochastic spreading and the score correction
vanish together, and the stochastic differential equation collapses to the deterministic ODE outright.
The score and the velocity field are the same object
The appearance of the score function above is not a coincidence of notation. For a Gaussian path, the
velocity field and the score are related by an invertible linear formula: each is recoverable from the
other through the schedule \(\alpha_t, \beta_t\). What the diffusion literature calls learning the
score function,
and what flow matching calls learning the velocity field, are two parametrizations of one underlying
quantity. The diffusion development built that quantity through Brownian motion and time-reversal; the
flow-matching development built it through deterministic transport and the marginalization trick. They
arrive at the same place.
Why the deterministic end is attractive
If every \(\sigma_t\) carries the same distribution, why prefer the deterministic end? Because of
sampling cost. A stochastic trajectory wanders and must be integrated in many small steps; a
deterministic trajectory, especially one whose path has been chosen to be nearly straight, can be
integrated in few — in the ideal limit, a single step. Each step is one evaluation of the network, so
straighter and deterministic means faster generation. The schedule whose conditional trajectories are
straight segments from noise to data is the basis of rectified flow; pushing toward
genuine straightness connects to optimal transport, which characterizes the
straightest coupling between two distributions. The large-scale image generators that displaced
diffusion samplers in practice — among them Stable Diffusion 3 — are rectified-flow models, trading
the curved stochastic trajectories of classical diffusion for straight deterministic ones, and the
order-of-magnitude reduction in sampling steps that straightness buys.
This is the sense in which flow matching generalizes diffusion. It keeps what made diffusion
work — a tractable training signal, a smooth path from noise to data — exposes the stochasticity as a
tunable coefficient rather than a built-in commitment, and frees the path itself to be chosen.
Diffusion is the well-studied region of that design space at \(\sigma_t > 0\); the deterministic flows
that now drive large-scale generation are the same construction at \(\sigma_t = 0\) with a straightened
path; and the framework leaves room for choices not yet made.
Interactive Demo
Both panels start from the same noise sample and aim at the same five target modes (rings); only the
path differs — straight on the left, curved on the right. Blue points reached a mode, orange
missed. Drag N to change the number of Euler steps.