Continuous Fourier Transform
Fourier series decompose periodic functions into discrete frequency components.
However, most signals in practice - a spoken sentence, a transient pulse, a neural network's activation - are not periodic. The
Fourier transform generalizes the Fourier series to arbitrary functions on \(\mathbb{R}\), replacing the discrete
spectrum with a continuous one. This generalization is the mathematical backbone of signal processing, spectral analysis, and
frequency-domain methods throughout machine learning.
To see how this arises, consider a function \(f_L\) periodic with period \(2L\). Its complex Fourier series is:
\[
f_L(x) = \sum_{n=-\infty}^{\infty} c_n e^{-i\frac{n\pi x}{L}}, \quad c_n = \frac{1}{2L}\int_{-L}^{L} f_L(x)e^{i\frac{n\pi x}{L}} \, dx.
\]
As \(L \to \infty\), the discrete frequencies \(\omega_n = \frac{n\pi}{L}\) become densely packed with spacing
\(\Delta\omega = \frac{\pi}{L} \to 0\), and the sum approaches an integral. This limiting process yields the
Fourier transform.
Note on Conventions:
As discussed in the Fourier series conventions, we use the mathematical/PDE convention
with positive sign in the forward transform and negative in the inverse. Engineering and physics often use the
opposite convention: \(\hat{f}(\omega) = \int_{-\infty}^{\infty} f(t)e^{-i\omega t} \, dt\).
Both are mathematically equivalent, just be consistent within your work.
The Fourier transform is well-defined for different function classes:
- For \(f \in L^1(\mathbb{R})\) (absolutely integrable):
The integral defining \(\hat{f}\) converges absolutely, and \(\hat{f}\) is bounded and continuous
with \(|\hat{f}(\xi)| \leq \|f\|_{L^1}\).
- For \(f \in L^2(\mathbb{R})\) (square-integrable):
On the intersection \(L^1 \cap L^2\), the integral definition above already applies, and Plancherel's theorem
(proved in the Properties section) holds. For \(f \in L^2 \setminus L^1\), the integral does not converge absolutely,
and the Fourier transform must be redefined as a bounded linear operator on the Hilbert space \(L^2(\mathbb{R})\) — see the
Fourier analysis in Hilbert spaces
page.
- For tempered distributions:
The theory extends to generalized functions, allowing transforms of \(\delta\)-functions, polynomials,
and other non-integrable functions important in applications.
Example: Gaussian Function
Consider the Gaussian function:
\[
f(x) = e^{-\frac{x^2}{2\sigma^2}}.
\]
Its Fourier transform (using the mathematical convention) is:
\[
\begin{align*}
\hat{f}(\xi) &= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2}}e^{i\xi x} \, dx \\\\
&= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2} + i\xi x} \, dx.
\end{align*}
\]
Complete the square in the exponent:
\[
\begin{align*}
-\frac{x^2}{2\sigma^2} + i\xi x
&= -\frac{1}{2\sigma^2}(x^2 - 2i\sigma^2\xi x) \\\\
&= -\frac{1}{2\sigma^2}((x - i\sigma^2\xi)^2 + \sigma^4\xi^2).
\end{align*}
\]
Therefore:
\[
\hat{f}(\xi) = e^{-\frac{\sigma^2\xi^2}{2}} \int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx.
\]
Since the integrand is entire and decays exponentially as \(|\mathrm{Re}\, z| \to \infty\) uniformly on horizontal strips of
bounded imaginary part, Cauchy's theorem applied to a rectangular contour — together with the vanishing of the vertical-side
contributions in the limit — justifies shifting the contour from \(\mathbb{R}\) to \(\mathbb{R} + i\sigma^2\xi\) without changing the value:
\[
\begin{align*}
\int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx
&= \int_{-\infty}^{\infty} e^{-\frac{u^2}{2\sigma^2}} \, du \\\\
&= \sigma\sqrt{2\pi}.
\end{align*}
\]
Thus:
\[
\hat{f}(\xi) = \sigma\sqrt{2\pi} \cdot e^{-\frac{\sigma^2\xi^2}{2}}.
\]
This shows that the Gaussian is essentially an eigenfunction of the Fourier transform: its transform is also
Gaussian, with inverse width (narrow in space \(\leftrightarrow\) wide in frequency). This property makes Gaussians
fundamental in uncertainty principles, quantum mechanics, and signal processing.
Properties of Fourier Transform
The Fourier transform is far more than a definition - its power lies in a collection of algebraic properties that
convert difficult operations (differentiation, convolution) into simple ones (multiplication, pointwise products).
These properties, stated below using the mathematical convention \(\hat{f}(\xi) = \int_{-\infty}^{\infty} f(x)e^{ix\xi} \, dx\),
form the foundation of spectral methods in both analysis and computation.
Property: Linearity
\[
\mathcal{F}\{\alpha f + \beta g\} = \alpha \mathcal{F}\{f\} + \beta \mathcal{F}\{g\}
\]
for constants \(\alpha, \beta \in \mathbb{C}\).
Property: Translation (Shift)
\[
\mathcal{F}\{f(x - a)\}(\xi) = e^{ia\xi}\hat{f}(\xi)
\]
A shift in space corresponds to a phase shift in frequency.
Property: Scaling
\[
\mathcal{F}\{f(ax)\}(\xi) = \frac{1}{|a|}\hat{f}\left(\frac{\xi}{a}\right), \quad a \neq 0
\]
This embodies the uncertainty principle: compressing a signal in time stretches its frequency
spectrum, and vice versa. One cannot localize a signal arbitrarily well in both time and frequency simultaneously.
Property: Differentiation
\[
\mathcal{F}\{f'(x)\}(\xi) = -i\xi \hat{f}(\xi)
\]
More generally, for the \(n\)-th derivative:
\[
\mathcal{F}\{f^{(n)}(x)\}(\xi) = (-i\xi)^n \hat{f}(\xi)
\]
This transforms differentiation into algebraic multiplication, which is why Fourier methods are powerful for
solving differential equations. Note the sign difference from the engineering convention.
Theorem: Fourier Inversion
For suitable functions (e.g., \(f \in L^1(\mathbb{R})\) with
\(\hat{f} \in L^1(\mathbb{R})\)), the original function can be recovered:
\[
f(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-ix\xi} \, d\xi.
\]
A corollary: \(\mathcal{F}\{\mathcal{F}\{f\}\}(x) = 2\pi f(-x)\), demonstrating
a deep symmetry between the spatial and frequency domains.
Proof (Schwartz class):
The theorem statement gives the inversion formula for the class \(\{f \in L^1(\mathbb{R}) : \hat f \in L^1(\mathbb{R})\}\), but
the proof we present here is restricted to the smaller Schwartz class \(\mathcal{S}(\mathbb{R}) \subset \{f \in L^1 : \hat f \in L^1\}\).
The extension from \(\mathcal{S}(\mathbb{R})\) to the full \(L^1\) class indicated in the statement requires additional machinery
and is discussed after the proof. The Schwartz-class technique is Gaussian regularization: insert a Gaussian damping factor
that makes Fubini applicable, then remove it in the limit.
Step 1: regularized inverse.
For \(\epsilon > 0\), define
\[
I_\epsilon(x) :=\; \frac{1}{2\pi} \int_{-\infty}^{\infty} \hat{f}(\xi)\, e^{-\epsilon \xi^2/2}\, e^{-ix\xi}\, d\xi.
\]
Since \(\hat f \in \mathcal{S}(\mathbb{R})\) (the Fourier transform preserves Schwartz), the integrand is absolutely integrable
for every \(\epsilon \geq 0\); in particular the integral converges and \(I_\epsilon(x)\) is well-defined.
Step 2: Fubini interchange.
Substituting \(\hat f(\xi) = \int f(y) e^{iy\xi}\, dy\) and observing that the resulting double integrand satisfies
\[
\bigl|f(y)\, e^{iy\xi}\, e^{-\epsilon\xi^2/2}\, e^{-ix\xi}\bigr| = |f(y)|\, e^{-\epsilon\xi^2/2} \in L^1(\mathbb{R} \times \mathbb{R})
\]
(the product of an \(L^1\) function in \(y\) with an \(L^1\) function in \(\xi\)),
Fubini's theorem
permits the order swap:
\[
\begin{align*}
I_\epsilon(x) \;&=\; \int_{-\infty}^{\infty} f(y)\, K_\epsilon(y - x)\, dy, \\\\
K_\epsilon(z) \;&:=\; \frac{1}{2\pi} \int_{-\infty}^{\infty} e^{iz\xi}\, e^{-\epsilon\xi^2/2}\, d\xi.
\end{align*}
\]
The kernel \(K_\epsilon\) is the forward Fourier transform of the Gaussian \(\xi \mapsto e^{-\epsilon\xi^2/2}\),
evaluated at \(z\), divided by \(2\pi\). The Gaussian-transform calculation on this page
(with the roles of space and frequency variables exchanged: apply the formula
\[
\widehat{e^{-x^2/(2\sigma^2)}}(\xi) = \sigma\sqrt{2\pi}\, e^{-\sigma^2 \xi^2/2}
\]
to the variable \(\xi\) playing the role of \(x\), with \(\sigma^2 = 1/\epsilon\)) gives
\[
\int e^{iz\xi}\, e^{-\epsilon\xi^2/2}\, d\xi = \sqrt{2\pi/\epsilon}\, e^{-z^2/(2\epsilon)}.
\]
Thus
\[
K_\epsilon(z) \;=\; \frac{1}{\sqrt{2\pi\epsilon}}\, e^{-z^2/(2\epsilon)},
\]
a Gaussian of unit mass: \(\int K_\epsilon(z)\, dz = 1\).
Step 3: convolution limit.
By the substitution \(z = y - x\),
\(I_\epsilon(x) = \int f(x + z)\, K_\epsilon(z)\, dz\). Using \(\int K_\epsilon = 1\) to write
\(f(x) = \int f(x)\, K_\epsilon(z)\, dz\),
\[
I_\epsilon(x) - f(x) \;=\; \int_{-\infty}^{\infty} \bigl[ f(x + z) - f(x) \bigr]\, K_\epsilon(z)\, dz.
\]
Fix \(\eta > 0\). Schwartz functions are uniformly continuous on \(\mathbb{R}\)
(a consequence of bounded derivative and rapid decay), so there exists \(\delta > 0\) such that \(|f(x + z) - f(x)| < \eta\)
whenever \(|z| \leq \delta\), uniformly in \(x\). Split the integral accordingly.
On the inner region \(|z| \leq \delta\), positivity of \(K_\epsilon\) and unit mass give
\[
\left| \int_{|z| \leq \delta} [f(x+z) - f(x)]\, K_\epsilon(z)\, dz \right| \;\leq\; \eta.
\]
On the outer region \(|z| > \delta\), the substitution \(u = z/\sqrt{\epsilon}\) shows
\[
\int_{|z| > \delta} K_\epsilon(z)\, dz
\;=\; \int_{|u| > \delta/\sqrt{\epsilon}} \frac{1}{\sqrt{2\pi}}\, e^{-u^2/2}\, du \;\longrightarrow\; 0 \quad \text{as } \epsilon \to 0,
\]
since the lower bound \(\delta/\sqrt{\epsilon} \to \infty\) and the integrand is the standard-normal density.
Combined with the trivial bound \(|f(x+z) - f(x)| \leq 2 \sup_{\mathbb{R}} |f|\) on the outer region,
\[
\limsup_{\epsilon \to 0}\, |I_\epsilon(x) - f(x)| \;\leq\; \eta.
\]
Since \(\eta > 0\) was arbitrary, \(\lim_{\epsilon \to 0} I_\epsilon(x) = f(x)\).
Step 4: spectral-side limit.
Returning to the definition of \(I_\epsilon\), the integrand
\[
\hat f(\xi)\, e^{-\epsilon \xi^2/2}\, e^{-ix\xi}
\]
is dominated by \(|\hat f(\xi)| \in L^1(\mathbb{R})\) uniformly
in \(\epsilon\) and converges pointwise to \(\hat f(\xi)\, e^{-ix\xi}\) as \(\epsilon \to 0\). By the
dominated convergence theorem,
\[
\lim_{\epsilon \to 0} I_\epsilon(x) \;=\; \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\xi)\, e^{-ix\xi}\, d\xi.
\]
Equating the two limits established in Steps 3 and 4 yields the inversion formula for every \(f \in \mathcal{S}(\mathbb{R})\).
On the \(L^1\) extension.
The Schwartz proof above does not directly extend to the larger class
\[
\{f \in L^1(\mathbb{R}) : \hat f \in L^1(\mathbb{R})\}
\]
by a one-line density argument: although Schwartz functions are dense in \(L^1\), the resulting approximation \(\hat{f_n} \to \hat f\)
is only uniform, not \(L^1\), so the right-hand side of the inversion formula does not pass to the limit for free.
A rigorous \(L^1\) treatment proceeds by integrating both sides of the Schwartz identity against test functions and invoking the theory
of tempered distributions — material that lies outside the present page.
Inversion shows that the Fourier transform is invertible on the Schwartz class
(the formula \(f = \mathcal{F}^{-1}\mathcal{F} f\) holds for every \(f \in \mathcal{S}(\mathbb{R})\), and the corollary \(\mathcal{F}\mathcal{F} f(x) = 2\pi f(-x)\)
ensures invertibility from the other side as well). The next theorem extracts from this the deeper structural fact: on \(L^1 \cap L^2(\mathbb{R})\),
the Fourier transform also preserves the \(L^2\) inner product, up to the normalization constant \(\tfrac{1}{2\pi}\).
This is Plancherel's identity — the continuous-frequency analogue of the
Parseval identity for Fourier series.
Its proof uses inversion (Stage 1 below), then extends from \(\mathcal{S}(\mathbb{R})\) to \(L^1 \cap L^2(\mathbb{R})\) by a density argument (Stage 2).
Theorem: Plancherel's Theorem (on \(L^1 \cap L^2\))
For \(f, g \in L^1 \cap L^2(\mathbb{R})\), the classical Fourier transform defined by the absolutely convergent integral
\(\hat f(\xi) = \int f(x) e^{ix\xi}\, dx\) satisfies
\[
\int_{-\infty}^{\infty} f(x)\overline{g(x)} \, dx = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)\overline{\hat{g}(\xi)} \, d\xi.
\]
In particular, setting \(f = g\):
\[
\|f\|_2^2 = \int_{-\infty}^{\infty} |f(x)|^2 \, dx = \frac{1}{2\pi}\int_{-\infty}^{\infty} |\hat{f}(\xi)|^2 \, d\xi.
\]
Proof:
We prove the identity in two stages: first on the Schwartz class \(\mathcal{S}(\mathbb{R})\),
then by density on all of \(L^1 \cap L^2(\mathbb{R})\). It suffices to prove the polarized form;
the special case \(f = g\) follows by setting them equal.
Stage 1: identity on the Schwartz class.
Fix \(f, g \in \mathcal{S}(\mathbb{R})\). Since \(\mathcal{S}(\mathbb{R})\) is closed under the Fourier transform,
\(\hat f, \hat g \in \mathcal{S}(\mathbb{R})\) as well. Apply the Fourier inversion theorem (proved above) to \(g\):
\[
g(x) \;=\; \frac{1}{2\pi}\int \hat g(\xi)\, e^{-ix\xi}\, d\xi.
\]
Taking the complex conjugate of both sides, and using \(\overline{e^{-ix\xi}} = e^{+ix\xi}\),
\[
\overline{g(x)} \;=\; \frac{1}{2\pi}\int \overline{\hat g(\xi)}\, e^{ix\xi}\, d\xi.
\]
Multiplying by \(f(x)\) and integrating in \(x\),
\[
\int f(x)\overline{g(x)}\, dx \;=\; \frac{1}{2\pi}\iint f(x)\, e^{ix\xi}\, \overline{\hat g(\xi)}\, dx\, d\xi.
\]
To justify the order swap in the double integral, we check the integrability hypothesis of Fubini's theorem.
The absolute value of the integrand is \(|f(x)| \cdot |\hat g(\xi)|\), and Tonelli's theorem applied to
this non-negative measurable function gives
\[
\iint |f(x)| \cdot |\hat g(\xi)|\, dx\, d\xi \;=\; \|f\|_1 \cdot \|\hat g\|_1 \;<\; \infty,
\]
since both \(f\) and \(\hat g\) are Schwartz, hence in \(L^1(\mathbb{R})\).
Therefore the integrand is in \(L^1(\mathbb{R} \times \mathbb{R})\), and
Fubini's theorem
permits the order swap:
\[
\frac{1}{2\pi}\iint f(x)\, e^{ix\xi}\, \overline{\hat g(\xi)}\, dx\, d\xi
\;=\; \frac{1}{2\pi}\int \overline{\hat g(\xi)} \left( \int f(x)\, e^{ix\xi}\, dx \right) d\xi
\;=\; \frac{1}{2\pi}\int \hat f(\xi)\, \overline{\hat g(\xi)}\, d\xi,
\]
where the inner integral is \(\hat f(\xi)\) by definition. This establishes the polarized identity on \(\mathcal{S}(\mathbb{R})\).
Stage 2: extension to \(L^1 \cap L^2(\mathbb{R})\).
Equip \(L^1 \cap L^2(\mathbb{R})\) with the \(L^2\) norm. Then \(\mathcal{S}(\mathbb{R})\) is a dense subspace: indeed,
\(C_c^\infty(\mathbb{R}) \subset \mathcal{S}(\mathbb{R})\), and \(C_c^\infty(\mathbb{R})\) is dense in \(L^1 \cap L^2\)
by the standard truncation-and-mollification argument (cut off outside a large ball, then convolve with a smooth bump).
The same construction, applied to a fixed \(f \in L^1 \cap L^2\), produces an approximating sequence \(\{f_n\} \subset \mathcal{S}(\mathbb{R})\)
converging to \(f\) simultaneously in both \(L^1\) and \(L^2\)
(by the dominated convergence theorem applied to the cutoffs, combined with the standard \(L^p\)-continuity of mollification, \(1 \leq p < \infty\)).
We fix such a sequence for the remainder of the proof.
Stage 1 specialized to \(f = g\) gives
\[
\|f\|_2^2 = \frac{1}{2\pi}\|\hat f\|_2^2
\]
for every \(f \in \mathcal{S}(\mathbb{R})\); equivalently, the linear map
\(T : \mathcal{S}(\mathbb{R}) \to L^2(\mathbb{R})\), \(Tf := \hat f / \sqrt{2\pi}\), is an isometry from
\((\mathcal{S}(\mathbb{R}), \|\cdot\|_2)\) into \(L^2(\mathbb{R})\).
From this, two observations about our sequence \(\{f_n\}\) follow:
-
The sequence \(\hat{f_n}\) is Cauchy in \(L^2\).
By Stage 1 applied to the differences
\(f_n - f_m \in \mathcal{S}(\mathbb{R})\), \(\|\hat{f_n} - \hat{f_m}\|_2 = \sqrt{2\pi}\, \|f_n - f_m\|_2 \to 0\). Hence by
completeness of \(L^2\),
\(\hat{f_n}\) converges to some \(F \in L^2(\mathbb{R})\).
-
The limit \(F\) coincides with the classical Fourier transform \(\hat f\).
Since \(f \in L^1\), the integral \(\hat f(\xi) = \int f(x)\, e^{ix\xi}\, dx\) is absolutely convergent and defines
\(\hat f \in L^\infty(\mathbb{R})\).
The \(L^1\)-convergence \(f_n \to f\) (which our sequence provides) gives \(\hat{f_n} \to \hat f\) uniformly
on \(\mathbb{R}\), by the elementary estimate
\[
\|\hat{f_n} - \hat f\|_\infty \;=\; \sup_\xi \left|\int (f_n - f)(x)\, e^{ix\xi}\, dx\right| \;\leq\; \|f_n - f\|_1 \;\to\; 0.
\]
Combined with \(\hat{f_n} \to F\) in \(L^2\), passing to a subsequence (using
subsequential a.e. convergence
for \(L^2\) convergence) yields pointwise a.e. convergence to both \(F\) and \(\hat f\), so \(F = \hat f\) a.e.
Taking the limit \(n \to \infty\) in the Stage 1 identity \(\|f_n\|_2^2 = \tfrac{1}{2\pi}\|\hat{f_n}\|_2^2\)
— using continuity of the \(L^2\) norm — yields
\[
\|f\|_2^2 = \tfrac{1}{2\pi}\|\hat f\|_2^2.
\]
The polarized form follows by the same argument applied to a second sequence \(\{g_n\} \subset \mathcal{S}(\mathbb{R})\)
approximating \(g \in L^1 \cap L^2\) simultaneously in \(L^1\) and \(L^2\): the bilinear identity
\[
\int f_n \overline{g_n}\, dx = \tfrac{1}{2\pi}\int \hat{f_n}\, \overline{\hat{g_n}}\, d\xi
\]
passes to the limit on both sides by continuity of the \(L^2\) inner product,
since \(f_n \to f\), \(g_n \to g\) in \(L^2\) and (by the same argument as above for
\(\hat f\)) \(\hat{f_n} \to \hat f\), \(\hat{g_n} \to \hat g\) in \(L^2\).
Looking ahead: both theorems on \(L^2(\mathbb{R})\).
The two main results of this section — Fourier inversion and Plancherel's identity — are stated above on classes that
require \(f \in L^1\): inversion on \(\{f \in L^1 : \hat f \in L^1\}\)
(proved here for the smaller Schwartz class), and Plancherel on \(L^1 \cap L^2\).
For functions \(f \in L^2 \setminus L^1\) (such as \((1+|x|)^{-1}\)), the defining integral
\(\hat f(\xi) = \int f(x)\, e^{ix\xi}\, dx\) fails to converge absolutely, so neither result applies directly.
The classes \(L^1\) and \(L^2\) are mutually non-contained on \(\mathbb{R}\)
(for instance \(|x|^{-1/2}\mathbf{1}_{[0,1]} \in L^1 \setminus L^2\)), so neither setting subsumes the other.
The remedy is to redefine the Fourier transform on \(L^2(\mathbb{R})\) as a bounded linear operator,
obtained as the unique continuous extension of its restriction to the dense subspace \(L^1 \cap L^2\).
Thanks to the completeness of \(L^2\) as a Hilbert space, this approach is structurally cleaner than the \(L^1\) inversion theory:
the normalized map \(f \mapsto \hat f / \sqrt{2\pi}\) becomes a unitary operator on \(L^2(\mathbb{R})\), whose
inverse coincides with its Hilbert-space adjoint, simultaneously delivering an \(L^2\) inversion formula and a Plancherel identity
on all of \(L^2\). This construction is carried out on
Fourier analysis in Hilbert spaces.
Convolution Theorem
The differentiation property above shows that the Fourier transform simplifies differentiation. The convolution theorem reveals an
even more powerful algebraic simplification: it converts the integral operation of convolution - which is \(O(n^2)\)
to compute directly - into pointwise multiplication in the frequency domain. This single result is the reason FFT-based methods dominate
signal processing, image filtering, and large-kernel neural network architectures.
The convolution of two functions \(f, g: \mathbb{R} \to \mathbb{C}\) is defined as:
\[
(f * g)(x) = \int_{-\infty}^{\infty} f(y)g(x-y) \, dy = \int_{-\infty}^{\infty} f(x-y)g(y) \, dy
\]
(The second equality shows convolution is commutative.)
Theorem: Convolution Theorem
The Fourier transform converts convolution into pointwise multiplication:
\[
\mathcal{F}\{f * g\}(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)
\]
and conversely, the Fourier transform of a product is a (scaled) convolution:
\[
\mathcal{F}\{f \cdot g\}(\xi) = \frac{1}{2\pi}(\hat{f} * \hat{g})(\xi).
\]
This is one of the most important results in applied mathematics. It allows us to:
- Replace expensive convolution operations (\(O(n^2)\) for discrete signals) with multiplication after
FFT (\(O(n \log n)\))
- Understand filtering as multiplication in frequency domain
- Analyze linear time-invariant systems through their frequency response
Insight: Convolution Theorem in CS and ML
- Signal filtering:
Low-pass, high-pass, and band-pass filters are convolutions in time domain but simple multiplications in frequency domain
- Image processing: Gaussian blur, edge detection (Sobel, Canny), and sharpening are
convolution operations efficiently computed via FFT for large kernels
- Deep learning: While CNNs typically use small kernels \((3 \times 3, 5 \times 5)\) where direct convolution
is efficient, FFT-based convolution is useful for:
- Large kernels in certain architectures
- Global convolutions in some vision transformers
- Efficient implementation of depthwise separable convolutions
- Probability: The PDF of \(X + Y\) for independent random variables is
\(f_{X+Y} = f_X * f_Y\), computed efficiently via FFT
Proof (convolution to product):
Using the mathematical convention with \(\hat{f}(\xi) = \int f(x)e^{ix\xi} \, dx\):
\[
\begin{align*}
\mathcal{F}\{f * g\}(\xi) &= \int_{-\infty}^{\infty} (f * g)(x)e^{ix\xi} \, dx \\\\
&= \int_{-\infty}^{\infty} \left(\int_{-\infty}^{\infty} f(y)g(x-y) \, dy\right) e^{ix\xi} \, dx \\\\
&= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(x-y)e^{ix\xi} \, dx\right) dy.
\end{align*}
\]
Substituting \(u = x - y\) (so \(x = u + y\) and \(dx = du\)):
\[
\begin{align*}
&= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(u)e^{i(u+y)\xi} \, du\right) dy \\\\
&= \int_{-\infty}^{\infty} f(y)e^{iy\xi} \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) dy \\\\
&= \left(\int_{-\infty}^{\infty} f(y)e^{iy\xi} \, dy\right) \cdot \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) \\\\
&= \hat{f}(\xi) \cdot \hat{g}(\xi).
\end{align*}
\]
The interchange of integration order is justified by Fubini's theorem when \(f, g \in L^1(\mathbb{R})\).
Proof (product to convolution):
Under sufficient regularity (for instance, \(f, g \in L^1\) with \(\hat f, \hat g \in L^1\)), the inversion formula gives
\[
f(x) = \frac{1}{2\pi}\int \hat f(\eta) e^{-ix\eta}\, d\eta.
\]
Substituting into the Fourier transform of \(fg\) and applying Fubini:
\[
\begin{align*}
\mathcal{F}\{f \cdot g\}(\xi)
&= \int_{-\infty}^{\infty} \left[\frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\eta)\, e^{-ix\eta}\, d\eta\right] g(x)\, e^{ix\xi}\, dx \\\\
&= \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\eta)\left[\int_{-\infty}^{\infty} g(x)\, e^{ix(\xi - \eta)}\, dx\right] d\eta \\\\
&= \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\eta)\, \hat g(\xi - \eta)\, d\eta
= \frac{1}{2\pi}(\hat f * \hat g)(\xi).
\end{align*}
\]
Discrete Fourier Transform (DFT)
The continuous Fourier transform operates on functions defined on all of \(\mathbb{R}\), but computers work with finite,
sampled data. The Discrete Fourier Transform (DFT) bridges this gap: it is the exact discrete analogue
of the continuous transform, operating on sequences of \(N\) complex numbers and producing \(N\) frequency coefficients.
The DFT inherits all the key properties of its continuous counterpart - linearity, shift, convolution - in discrete form.
Note on Sign Convention for DFT
Attention: In the continuous section above, we used the Mathematical/PDE convention (\(e^{+ix\xi}\))
to align with spectral theory. However, in the discrete domain (DFT/FFT), the Engineering/Data Science convention
(negative sign in the forward transform) is universally adopted in software libraries like NumPy, PyTorch, and MATLAB.
To ensure that the formulas below match the code you will write in Python, we switch to the standard
computational convention for the remainder of this page:
\[
\text{DFT: } e^{-i \dots}, \quad \text{IDFT: } e^{+i \dots}.
\]
Definition: DFT & IDFT
For a sequence \(\{x_0, x_1, \ldots, x_{N-1}\}\) of \(N\) complex numbers, the DFT is:
\[
X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N}kn}, \quad k = 0, 1, \ldots, N-1.
\]
The inverse DFT (IDFT) is:
\[
x_n = \frac{1}{N}\sum_{k=0}^{N-1} X_k e^{\frac{2\pi i}{N}kn}, \quad n = 0, 1, \ldots, N-1.
\]
Connection to Continuous Transform: The DFT can be viewed as:
- Sampling a periodic function at \(N\) equally spaced points
- Computing Fourier series coefficients for the periodized version
- Approximating the continuous Fourier transform for band-limited signals (via Shannon-Nyquist sampling theorem)
Matrix Formulation:
Define \(\omega = e^{-\frac{2\pi i}{N}}\), a primitive \(N\)-th root of unity
(so \(\omega^N = 1\)). The DFT matrix is:
\[
W = \begin{bmatrix}
1 & 1 & 1 & \cdots & 1 \\
1 & \omega & \omega^2 & \cdots & \omega^{N-1} \\
1 & \omega^2 & \omega^4 & \cdots & \omega^{2(N-1)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & \omega^{N-1} & \omega^{2(N-1)} & \cdots & \omega^{(N-1)^2}
\end{bmatrix}
\]
where \(W_{kn} = \omega^{kn}\). Then:
\[
\mathbf{X} = W\mathbf{x}, \quad \mathbf{x} = \frac{1}{N}W^*\mathbf{X}
\]
where \(W^*\) is the conjugate transpose.
Properties of the DFT Matrix:
- Orthogonality: \(W^*W = NI\), so \(W^{-1} = \frac{1}{N}W^*\)
- Symmetry: \(W\) is symmetric: \(W^T = W\)
- Fourth-power identity and eigenvalues: Writing out \((W^2)_{ij} = \sum_k \omega^{k(i+j)}\)
shows \(W^2 = N \cdot P\), where \(P\) is the permutation matrix implementing \(n \mapsto -n \bmod N\).
Since \(P^2 = I\), we get \(W^4 = N^2 I\); equivalently, the normalized DFT matrix
\(F := \tfrac{1}{\sqrt{N}}\, W\) satisfies \(F^4 = I\). The eigenvalues of \(F\) therefore lie in
\(\{+1, -1, +i, -i\}\), while the eigenvalues of the unnormalized \(W\) are the scaled quartet
\(\{\pm\sqrt{N}, \pm i\sqrt{N}\}\).
- Circulant diagonalization: The DFT matrix diagonalizes all circulant matrices
Normalization conventions across domains
The choice between the unnormalized \(W\) and the normalized \(F = \tfrac{1}{\sqrt{N}} W\) is
not a mere cosmetic rescaling — it reflects which property of the transform the user cares
about most. Three regimes coexist in practice:
- Signal processing / scientific computing (unnormalized \(W\)):
NumPy, SciPy, MATLAB, FFTW, PyTorch (torch.fft.fft), and JAX all default to
forward DFT without normalization, placing the full \(1/N\) factor in the inverse.
This is the long-standing numerical-computation convention (reinforced by Cooley–Tukey
1965 and its descendants) and has two practical advantages: the convolution theorem
\(\mathcal{F}(f * g) = \hat f \cdot \hat g\) holds with no stray constants, and only the
inverse transform pays the \(1/N\) cost. The price is that the forward transform is
not an isometry — norms are scaled by \(\sqrt{N}\).
- Harmonic analysis / functional analysis (normalized \(F\)):
The discrete Plancherel identity \(\|f\|_2 = \|Ff\|_2\) requires the normalized version. The map
\(F : \mathbb{C}^N \to \mathbb{C}^N\) is then a unitary operator
(\(F^*F = I\)), and all of \(L^2\)-based Fourier theory transfers to the discrete setting
without rescaling. The clean eigenvalue structure \(\{\pm 1, \pm i\}\) derived above is a
symptom of the fourth-power identity \(F^4 = I\), with unitarity forcing the roots onto
the unit circle.
- Quantum computation (normalized \(F\) is mandatory):
The Quantum Fourier Transform, the key subroutine behind Shor's
algorithm and phase estimation, is the same matrix \(F\) — and it must be unitary
because every gate operation on qubits is, by the postulates of quantum mechanics, a
unitary transformation. There is no choice: the normalized version is the only one that
corresponds to a physically realizable quantum gate.
Most numerical libraries expose both options via a norm= argument — NumPy's
np.fft.fft(x, norm='ortho') returns the normalized transform, and
norm='backward' (the default) returns the unnormalized one. When porting a
formula from a textbook to code (or vice versa), the first question to ask is always:
which convention is this written in?
Computational Complexity:
- Direct computation: \(O(N^2)\) complex multiplications and additions
- Fast Fourier Transform: \(O(N \log N)\) operations (see next section)
Practical Considerations for CS/ML:
- Zero-padding: Padding sequences to powers of 2 enables efficient radix-2 FFT algorithms
- Windowing: Applying window functions (Hamming, Hann, Blackman) reduces spectral leakage from finite-length effects
- Real-valued signals: For real inputs, \(X_{N-k} = \overline{X_k}\) (Hermitian symmetry),
allowing optimization to compute only half the spectrum using rfft
- Normalization: Different software uses different conventions; NumPy uses \(\frac{1}{N}\) in IFFT only
Fast Fourier Transform (FFT)
The DFT requires \(O(N^2)\) operations when computed directly from its definition - prohibitive for large signals.
The Fast Fourier Transform (FFT), is not a different transform but an efficient algorithm that reduces
the complexity to \(O(N \log N)\). This reduction - from hours to seconds for large \(N\) - is what makes real-time audio
processing, medical imaging, and spectral methods in ML computationally feasible.
Cooley-Tukey Algorithm (Radix-2 FFT):
The key insight is to exploit the structure of the DFT matrix using divide-and-conquer. For \(N = 2^m\),
split the input into even and odd indices:
\[
\begin{align*}
X_k &= \sum_{n=0}^{N-1} x_n \omega^{kn} \\\\
&= \sum_{j=0}^{N/2-1} x_{2j} \omega^{k(2j)} + \sum_{j=0}^{N/2-1} x_{2j+1} \omega^{k(2j+1)} \\\\
&= \sum_{j=0}^{N/2-1} x_{2j} (\omega^2)^{kj} + \omega^k \sum_{j=0}^{N/2-1} x_{2j+1} (\omega^2)^{kj}.
\end{align*}
\]
Note that \(\omega^2 = e^{-\frac{4\pi i}{N}} = e^{-\frac{2\pi i}{N/2}}\) is a primitive \((N/2)\)-th root of unity.
Let \(E_k\) be the \((N/2)\)-point DFT of the even-indexed subsequence and \(O_k\) the \((N/2)\)-point
DFT of odd-indexed subsequence. Since \(E_k\) and \(O_k\) are themselves \((N/2)\)-periodic in \(k\)
(\(E_{k+N/2} = E_k\), \(O_{k+N/2} = O_k\)), and since \(\omega^{k+N/2} = -\omega^k\) (because
\(\omega^{N/2} = e^{-i\pi} = -1\)), we obtain, for \(k = 0, 1, \ldots, N/2-1\):
\[
X_k = E_k + \omega^k O_k,
\]
\[
X_{k+N/2} = E_k - \omega^k O_k.
\]
This gives the recurrence \(T(N) = 2T(N/2) + O(N)\), yielding \(T(N) = O(N \log N)\) by the Master theorem.
Other FFT Algorithms:
- Radix-4, Radix-8: Reduce the number of complex multiplications by processing 4 or 8 points at once
- Split-radix FFT: Combines radix-2 and radix-4 for a low operation count
- Prime-factor algorithm (Good-Thomas): For \(N = N_1 N_2\) with \(\gcd(N_1, N_2) = 1\)
- Chirp Z-transform (Bluestein): Computes DFT of any size via convolution
Implementation Considerations:
- Bit-reversal permutation: Required for in-place Cooley-Tukey algorithm
- Twiddle factors: Precomputing \(\omega^k = e^{-2\pi ik/N}\) reduces trigonometric computations
- Cache optimization: FFTW uses self-tuning to optimize for specific hardware
- SIMD vectorization: Modern CPUs can compute multiple FFT butterflies in parallel
Applications in Machine Learning
Fourier methods are fundamental to modern machine learning, particularly in areas requiring efficient computation,
signal processing, and function approximation. Here we highlight three significant applications.
1. Audio and Speech Processing:
Speech recognition systems such as OpenAI's Whisper (2022) and its open-source successors
convert raw audio into log-Mel spectrograms via the Short-Time Fourier Transform (STFT):
the audio is windowed into overlapping segments, each transformed via FFT to obtain frequency content, then
mapped to the Mel scale (which approximates human auditory perception). This frequency-domain representation
serves as input to the neural network — typically a Transformer encoder — enabling robust transcription
across languages and noisy environments.
Reference: Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022)
2. Fourier Neural Operators (FNO) for Scientific Computing:
Introduced by Li et al. (ICLR 2021), Fourier Neural Operators (FNOs) learn solution operators of parametric PDEs
directly in Fourier space, achieving up to three orders of magnitude speedup over traditional PDE solvers on benchmark
problems such as Burgers and Navier-Stokes equations. The key idea: parameterize the kernel \(\hat{k}(\xi)\) in
frequency domain, then multiply with the Fourier transform of the input. The approach has since inspired a rich family
of spectral neural operators.
FNO Architecture Sketch
Each FNO layer performs:
- Lift input to higher-dimensional channel space
- Apply spectral convolution: \(\mathcal{F}^{-1}(R \cdot \mathcal{F}(v))\) where \(R\) is a learnable weight matrix
- Add local residual connection \(Wv\)
- Apply non-linearity (activation function) e.g., GELU
- Project to output
The spectral convolution step is where the convolution theorem provides the key efficiency gain.
Reference: Li et al., "Fourier Neural Operator for Parametric PDEs"
3. Fourier Features as Positional Encoding:
Neural networks exhibit a spectral bias—they tend to learn low-frequency functions more easily than
high-frequency ones. Fourier feature mappings overcome this limitation by lifting low-dimensional inputs
into higher-dimensional spaces via sinusoidal functions. The Transformer architecture (Vaswani et al., 2017) introduced
sinusoidal positional encoding for sequence positions; this idea was later adapted to continuous spatial coordinates,
with Neural Radiance Fields (NeRF, 2020) mapping a scalar coordinate \(p\) to:
\[
\gamma(p) = \left[\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right]
\]
enabling neural networks to capture fine geometric details. Tancik et al. (2020) gave a unified Neural Tangent Kernel
analysis explaining why such encodings tame spectral bias.
The same Fourier-feature principle reappears in modern Large Language Models as Rotary Position Embedding
(RoPE), which encodes token positions through rotations in 2D subspaces of the query/key vectors and has become
a de facto standard across recent LLM families.
References:
Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields" (2020);
Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions" (NeurIPS 2020);
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)