Fourier Transform

Continuous Fourier Transform Properties of Fourier Transform Convolution Theorem Discrete Fourier Transform Fast Fourier Transform Applications in Machine Learning Interactive Demos

Continuous Fourier Transform

Fourier series decompose periodic functions into discrete frequency components. However, most signals in practice - a spoken sentence, a transient pulse, a neural network's activation - are not periodic. The Fourier transform generalizes the Fourier series to arbitrary functions on \(\mathbb{R}\), replacing the discrete spectrum with a continuous one. This generalization is the mathematical backbone of signal processing, spectral analysis, and frequency-domain methods throughout machine learning.

To see how this arises, consider a function \(f_L\) periodic with period \(2L\). Its complex Fourier series is: \[ f_L(x) = \sum_{n=-\infty}^{\infty} c_n e^{-i\frac{n\pi x}{L}}, \quad c_n = \frac{1}{2L}\int_{-L}^{L} f_L(x)e^{i\frac{n\pi x}{L}} \, dx. \] As \(L \to \infty\), the discrete frequencies \(\omega_n = \frac{n\pi}{L}\) become densely packed with spacing \(\Delta\omega = \frac{\pi}{L} \to 0\), and the sum approaches an integral. This limiting process yields the Fourier transform.

Definition: Fourier Transform

For a function \(f: \mathbb{R} \to \mathbb{C}\), we define: \[ \hat{f}(\xi) = \mathcal{F}\{f\}(\xi) = \int_{-\infty}^{\infty} f(x)e^{i x\xi} \, dx \] where \(\xi \in \mathbb{R}\) is the frequency variable.

The inverse Fourier transform recovers \(f\) from \(\hat{f}\): \[ f(x) = \mathcal{F}^{-1}\{\hat{f}\}(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-i x\xi} \, d\xi. \]

Note on Conventions:
As discussed in the Fourier series conventions, we use the mathematical/PDE convention with positive sign in the forward transform and negative in the inverse. Engineering and physics often use the opposite convention: \(\hat{f}(\omega) = \int_{-\infty}^{\infty} f(t)e^{-i\omega t} \, dt\). Both are mathematically equivalent, just be consistent within your work.

The Fourier transform is well-defined for different function classes:

Example: Gaussian Function

Consider the Gaussian function: \[ f(x) = e^{-\frac{x^2}{2\sigma^2}}. \] Its Fourier transform (using the mathematical convention) is: \[ \begin{align*} \hat{f}(\xi) &= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2}}e^{i\xi x} \, dx \\\\ &= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2} + i\xi x} \, dx. \end{align*} \] Complete the square in the exponent: \[ \begin{align*} -\frac{x^2}{2\sigma^2} + i\xi x &= -\frac{1}{2\sigma^2}(x^2 - 2i\sigma^2\xi x) \\\\ &= -\frac{1}{2\sigma^2}((x - i\sigma^2\xi)^2 + \sigma^4\xi^2). \end{align*} \] Therefore: \[ \hat{f}(\xi) = e^{-\frac{\sigma^2\xi^2}{2}} \int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx. \] Since the integrand is entire and decays exponentially as \(|\mathrm{Re}\, z| \to \infty\) uniformly on horizontal strips of bounded imaginary part, Cauchy's theorem applied to a rectangular contour — together with the vanishing of the vertical-side contributions in the limit — justifies shifting the contour from \(\mathbb{R}\) to \(\mathbb{R} + i\sigma^2\xi\) without changing the value: \[ \begin{align*} \int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx &= \int_{-\infty}^{\infty} e^{-\frac{u^2}{2\sigma^2}} \, du \\\\ &= \sigma\sqrt{2\pi}. \end{align*} \] Thus: \[ \hat{f}(\xi) = \sigma\sqrt{2\pi} \cdot e^{-\frac{\sigma^2\xi^2}{2}}. \]

This shows that the Gaussian is essentially an eigenfunction of the Fourier transform: its transform is also Gaussian, with inverse width (narrow in space \(\leftrightarrow\) wide in frequency). This property makes Gaussians fundamental in uncertainty principles, quantum mechanics, and signal processing.

Properties of Fourier Transform

The Fourier transform is far more than a definition - its power lies in a collection of algebraic properties that convert difficult operations (differentiation, convolution) into simple ones (multiplication, pointwise products). These properties, stated below using the mathematical convention \(\hat{f}(\xi) = \int_{-\infty}^{\infty} f(x)e^{ix\xi} \, dx\), form the foundation of spectral methods in both analysis and computation.

Property: Linearity

\[ \mathcal{F}\{\alpha f + \beta g\} = \alpha \mathcal{F}\{f\} + \beta \mathcal{F}\{g\} \] for constants \(\alpha, \beta \in \mathbb{C}\).

Property: Translation (Shift)

\[ \mathcal{F}\{f(x - a)\}(\xi) = e^{ia\xi}\hat{f}(\xi) \] A shift in space corresponds to a phase shift in frequency.

Property: Scaling

\[ \mathcal{F}\{f(ax)\}(\xi) = \frac{1}{|a|}\hat{f}\left(\frac{\xi}{a}\right), \quad a \neq 0 \] This embodies the uncertainty principle: compressing a signal in time stretches its frequency spectrum, and vice versa. One cannot localize a signal arbitrarily well in both time and frequency simultaneously.

Property: Differentiation

\[ \mathcal{F}\{f'(x)\}(\xi) = -i\xi \hat{f}(\xi) \] More generally, for the \(n\)-th derivative: \[ \mathcal{F}\{f^{(n)}(x)\}(\xi) = (-i\xi)^n \hat{f}(\xi) \] This transforms differentiation into algebraic multiplication, which is why Fourier methods are powerful for solving differential equations. Note the sign difference from the engineering convention.

Theorem: Fourier Inversion

For suitable functions (e.g., \(f \in L^1(\mathbb{R})\) with \(\hat{f} \in L^1(\mathbb{R})\)), the original function can be recovered: \[ f(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-ix\xi} \, d\xi. \] A corollary: \(\mathcal{F}\{\mathcal{F}\{f\}\}(x) = 2\pi f(-x)\), demonstrating a deep symmetry between the spatial and frequency domains.

Proof (Schwartz class):

The theorem statement gives the inversion formula for the class \(\{f \in L^1(\mathbb{R}) : \hat f \in L^1(\mathbb{R})\}\), but the proof we present here is restricted to the smaller Schwartz class \(\mathcal{S}(\mathbb{R}) \subset \{f \in L^1 : \hat f \in L^1\}\). The extension from \(\mathcal{S}(\mathbb{R})\) to the full \(L^1\) class indicated in the statement requires additional machinery and is discussed after the proof. The Schwartz-class technique is Gaussian regularization: insert a Gaussian damping factor that makes Fubini applicable, then remove it in the limit.

Step 1: regularized inverse.
For \(\epsilon > 0\), define \[ I_\epsilon(x) :=\; \frac{1}{2\pi} \int_{-\infty}^{\infty} \hat{f}(\xi)\, e^{-\epsilon \xi^2/2}\, e^{-ix\xi}\, d\xi. \] Since \(\hat f \in \mathcal{S}(\mathbb{R})\) (the Fourier transform preserves Schwartz), the integrand is absolutely integrable for every \(\epsilon \geq 0\); in particular the integral converges and \(I_\epsilon(x)\) is well-defined.

Step 2: Fubini interchange.
Substituting \(\hat f(\xi) = \int f(y) e^{iy\xi}\, dy\) and observing that the resulting double integrand satisfies \[ \bigl|f(y)\, e^{iy\xi}\, e^{-\epsilon\xi^2/2}\, e^{-ix\xi}\bigr| = |f(y)|\, e^{-\epsilon\xi^2/2} \in L^1(\mathbb{R} \times \mathbb{R}) \] (the product of an \(L^1\) function in \(y\) with an \(L^1\) function in \(\xi\)), Fubini's theorem permits the order swap: \[ \begin{align*} I_\epsilon(x) \;&=\; \int_{-\infty}^{\infty} f(y)\, K_\epsilon(y - x)\, dy, \\\\ K_\epsilon(z) \;&:=\; \frac{1}{2\pi} \int_{-\infty}^{\infty} e^{iz\xi}\, e^{-\epsilon\xi^2/2}\, d\xi. \end{align*} \] The kernel \(K_\epsilon\) is the forward Fourier transform of the Gaussian \(\xi \mapsto e^{-\epsilon\xi^2/2}\), evaluated at \(z\), divided by \(2\pi\). The Gaussian-transform calculation on this page (with the roles of space and frequency variables exchanged: apply the formula \[ \widehat{e^{-x^2/(2\sigma^2)}}(\xi) = \sigma\sqrt{2\pi}\, e^{-\sigma^2 \xi^2/2} \] to the variable \(\xi\) playing the role of \(x\), with \(\sigma^2 = 1/\epsilon\)) gives \[ \int e^{iz\xi}\, e^{-\epsilon\xi^2/2}\, d\xi = \sqrt{2\pi/\epsilon}\, e^{-z^2/(2\epsilon)}. \] Thus \[ K_\epsilon(z) \;=\; \frac{1}{\sqrt{2\pi\epsilon}}\, e^{-z^2/(2\epsilon)}, \] a Gaussian of unit mass: \(\int K_\epsilon(z)\, dz = 1\).

Step 3: convolution limit.
By the substitution \(z = y - x\), \(I_\epsilon(x) = \int f(x + z)\, K_\epsilon(z)\, dz\). Using \(\int K_\epsilon = 1\) to write \(f(x) = \int f(x)\, K_\epsilon(z)\, dz\), \[ I_\epsilon(x) - f(x) \;=\; \int_{-\infty}^{\infty} \bigl[ f(x + z) - f(x) \bigr]\, K_\epsilon(z)\, dz. \] Fix \(\eta > 0\). Schwartz functions are uniformly continuous on \(\mathbb{R}\) (a consequence of bounded derivative and rapid decay), so there exists \(\delta > 0\) such that \(|f(x + z) - f(x)| < \eta\) whenever \(|z| \leq \delta\), uniformly in \(x\). Split the integral accordingly. On the inner region \(|z| \leq \delta\), positivity of \(K_\epsilon\) and unit mass give \[ \left| \int_{|z| \leq \delta} [f(x+z) - f(x)]\, K_\epsilon(z)\, dz \right| \;\leq\; \eta. \] On the outer region \(|z| > \delta\), the substitution \(u = z/\sqrt{\epsilon}\) shows \[ \int_{|z| > \delta} K_\epsilon(z)\, dz \;=\; \int_{|u| > \delta/\sqrt{\epsilon}} \frac{1}{\sqrt{2\pi}}\, e^{-u^2/2}\, du \;\longrightarrow\; 0 \quad \text{as } \epsilon \to 0, \] since the lower bound \(\delta/\sqrt{\epsilon} \to \infty\) and the integrand is the standard-normal density. Combined with the trivial bound \(|f(x+z) - f(x)| \leq 2 \sup_{\mathbb{R}} |f|\) on the outer region, \[ \limsup_{\epsilon \to 0}\, |I_\epsilon(x) - f(x)| \;\leq\; \eta. \] Since \(\eta > 0\) was arbitrary, \(\lim_{\epsilon \to 0} I_\epsilon(x) = f(x)\).

Step 4: spectral-side limit.
Returning to the definition of \(I_\epsilon\), the integrand \[ \hat f(\xi)\, e^{-\epsilon \xi^2/2}\, e^{-ix\xi} \] is dominated by \(|\hat f(\xi)| \in L^1(\mathbb{R})\) uniformly in \(\epsilon\) and converges pointwise to \(\hat f(\xi)\, e^{-ix\xi}\) as \(\epsilon \to 0\). By the dominated convergence theorem, \[ \lim_{\epsilon \to 0} I_\epsilon(x) \;=\; \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\xi)\, e^{-ix\xi}\, d\xi. \] Equating the two limits established in Steps 3 and 4 yields the inversion formula for every \(f \in \mathcal{S}(\mathbb{R})\).

On the \(L^1\) extension.
The Schwartz proof above does not directly extend to the larger class \[ \{f \in L^1(\mathbb{R}) : \hat f \in L^1(\mathbb{R})\} \] by a one-line density argument: although Schwartz functions are dense in \(L^1\), the resulting approximation \(\hat{f_n} \to \hat f\) is only uniform, not \(L^1\), so the right-hand side of the inversion formula does not pass to the limit for free. A rigorous \(L^1\) treatment proceeds by integrating both sides of the Schwartz identity against test functions and invoking the theory of tempered distributions — material that lies outside the present page.

Inversion shows that the Fourier transform is invertible on the Schwartz class (the formula \(f = \mathcal{F}^{-1}\mathcal{F} f\) holds for every \(f \in \mathcal{S}(\mathbb{R})\), and the corollary \(\mathcal{F}\mathcal{F} f(x) = 2\pi f(-x)\) ensures invertibility from the other side as well). The next theorem extracts from this the deeper structural fact: on \(L^1 \cap L^2(\mathbb{R})\), the Fourier transform also preserves the \(L^2\) inner product, up to the normalization constant \(\tfrac{1}{2\pi}\). This is Plancherel's identity — the continuous-frequency analogue of the Parseval identity for Fourier series. Its proof uses inversion (Stage 1 below), then extends from \(\mathcal{S}(\mathbb{R})\) to \(L^1 \cap L^2(\mathbb{R})\) by a density argument (Stage 2).

Theorem: Plancherel's Theorem (on \(L^1 \cap L^2\))

For \(f, g \in L^1 \cap L^2(\mathbb{R})\), the classical Fourier transform defined by the absolutely convergent integral \(\hat f(\xi) = \int f(x) e^{ix\xi}\, dx\) satisfies \[ \int_{-\infty}^{\infty} f(x)\overline{g(x)} \, dx = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)\overline{\hat{g}(\xi)} \, d\xi. \] In particular, setting \(f = g\): \[ \|f\|_2^2 = \int_{-\infty}^{\infty} |f(x)|^2 \, dx = \frac{1}{2\pi}\int_{-\infty}^{\infty} |\hat{f}(\xi)|^2 \, d\xi. \]

Proof:

We prove the identity in two stages: first on the Schwartz class \(\mathcal{S}(\mathbb{R})\), then by density on all of \(L^1 \cap L^2(\mathbb{R})\). It suffices to prove the polarized form; the special case \(f = g\) follows by setting them equal.

Stage 1: identity on the Schwartz class.
Fix \(f, g \in \mathcal{S}(\mathbb{R})\). Since \(\mathcal{S}(\mathbb{R})\) is closed under the Fourier transform, \(\hat f, \hat g \in \mathcal{S}(\mathbb{R})\) as well. Apply the Fourier inversion theorem (proved above) to \(g\): \[ g(x) \;=\; \frac{1}{2\pi}\int \hat g(\xi)\, e^{-ix\xi}\, d\xi. \] Taking the complex conjugate of both sides, and using \(\overline{e^{-ix\xi}} = e^{+ix\xi}\), \[ \overline{g(x)} \;=\; \frac{1}{2\pi}\int \overline{\hat g(\xi)}\, e^{ix\xi}\, d\xi. \] Multiplying by \(f(x)\) and integrating in \(x\), \[ \int f(x)\overline{g(x)}\, dx \;=\; \frac{1}{2\pi}\iint f(x)\, e^{ix\xi}\, \overline{\hat g(\xi)}\, dx\, d\xi. \] To justify the order swap in the double integral, we check the integrability hypothesis of Fubini's theorem. The absolute value of the integrand is \(|f(x)| \cdot |\hat g(\xi)|\), and Tonelli's theorem applied to this non-negative measurable function gives \[ \iint |f(x)| \cdot |\hat g(\xi)|\, dx\, d\xi \;=\; \|f\|_1 \cdot \|\hat g\|_1 \;<\; \infty, \] since both \(f\) and \(\hat g\) are Schwartz, hence in \(L^1(\mathbb{R})\). Therefore the integrand is in \(L^1(\mathbb{R} \times \mathbb{R})\), and Fubini's theorem permits the order swap: \[ \frac{1}{2\pi}\iint f(x)\, e^{ix\xi}\, \overline{\hat g(\xi)}\, dx\, d\xi \;=\; \frac{1}{2\pi}\int \overline{\hat g(\xi)} \left( \int f(x)\, e^{ix\xi}\, dx \right) d\xi \;=\; \frac{1}{2\pi}\int \hat f(\xi)\, \overline{\hat g(\xi)}\, d\xi, \] where the inner integral is \(\hat f(\xi)\) by definition. This establishes the polarized identity on \(\mathcal{S}(\mathbb{R})\).

Stage 2: extension to \(L^1 \cap L^2(\mathbb{R})\).
Equip \(L^1 \cap L^2(\mathbb{R})\) with the \(L^2\) norm. Then \(\mathcal{S}(\mathbb{R})\) is a dense subspace: indeed, \(C_c^\infty(\mathbb{R}) \subset \mathcal{S}(\mathbb{R})\), and \(C_c^\infty(\mathbb{R})\) is dense in \(L^1 \cap L^2\) by the standard truncation-and-mollification argument (cut off outside a large ball, then convolve with a smooth bump). The same construction, applied to a fixed \(f \in L^1 \cap L^2\), produces an approximating sequence \(\{f_n\} \subset \mathcal{S}(\mathbb{R})\) converging to \(f\) simultaneously in both \(L^1\) and \(L^2\) (by the dominated convergence theorem applied to the cutoffs, combined with the standard \(L^p\)-continuity of mollification, \(1 \leq p < \infty\)). We fix such a sequence for the remainder of the proof.

Stage 1 specialized to \(f = g\) gives \[ \|f\|_2^2 = \frac{1}{2\pi}\|\hat f\|_2^2 \] for every \(f \in \mathcal{S}(\mathbb{R})\); equivalently, the linear map \(T : \mathcal{S}(\mathbb{R}) \to L^2(\mathbb{R})\), \(Tf := \hat f / \sqrt{2\pi}\), is an isometry from \((\mathcal{S}(\mathbb{R}), \|\cdot\|_2)\) into \(L^2(\mathbb{R})\). From this, two observations about our sequence \(\{f_n\}\) follow:

  • The sequence \(\hat{f_n}\) is Cauchy in \(L^2\).
    By Stage 1 applied to the differences \(f_n - f_m \in \mathcal{S}(\mathbb{R})\), \(\|\hat{f_n} - \hat{f_m}\|_2 = \sqrt{2\pi}\, \|f_n - f_m\|_2 \to 0\). Hence by completeness of \(L^2\), \(\hat{f_n}\) converges to some \(F \in L^2(\mathbb{R})\).
  • The limit \(F\) coincides with the classical Fourier transform \(\hat f\).
    Since \(f \in L^1\), the integral \(\hat f(\xi) = \int f(x)\, e^{ix\xi}\, dx\) is absolutely convergent and defines \(\hat f \in L^\infty(\mathbb{R})\). The \(L^1\)-convergence \(f_n \to f\) (which our sequence provides) gives \(\hat{f_n} \to \hat f\) uniformly on \(\mathbb{R}\), by the elementary estimate \[ \|\hat{f_n} - \hat f\|_\infty \;=\; \sup_\xi \left|\int (f_n - f)(x)\, e^{ix\xi}\, dx\right| \;\leq\; \|f_n - f\|_1 \;\to\; 0. \] Combined with \(\hat{f_n} \to F\) in \(L^2\), passing to a subsequence (using subsequential a.e. convergence for \(L^2\) convergence) yields pointwise a.e. convergence to both \(F\) and \(\hat f\), so \(F = \hat f\) a.e.

Taking the limit \(n \to \infty\) in the Stage 1 identity \(\|f_n\|_2^2 = \tfrac{1}{2\pi}\|\hat{f_n}\|_2^2\) — using continuity of the \(L^2\) norm — yields \[ \|f\|_2^2 = \tfrac{1}{2\pi}\|\hat f\|_2^2. \] The polarized form follows by the same argument applied to a second sequence \(\{g_n\} \subset \mathcal{S}(\mathbb{R})\) approximating \(g \in L^1 \cap L^2\) simultaneously in \(L^1\) and \(L^2\): the bilinear identity \[ \int f_n \overline{g_n}\, dx = \tfrac{1}{2\pi}\int \hat{f_n}\, \overline{\hat{g_n}}\, d\xi \] passes to the limit on both sides by continuity of the \(L^2\) inner product, since \(f_n \to f\), \(g_n \to g\) in \(L^2\) and (by the same argument as above for \(\hat f\)) \(\hat{f_n} \to \hat f\), \(\hat{g_n} \to \hat g\) in \(L^2\).

Looking ahead: both theorems on \(L^2(\mathbb{R})\).
The two main results of this section — Fourier inversion and Plancherel's identity — are stated above on classes that require \(f \in L^1\): inversion on \(\{f \in L^1 : \hat f \in L^1\}\) (proved here for the smaller Schwartz class), and Plancherel on \(L^1 \cap L^2\). For functions \(f \in L^2 \setminus L^1\) (such as \((1+|x|)^{-1}\)), the defining integral \(\hat f(\xi) = \int f(x)\, e^{ix\xi}\, dx\) fails to converge absolutely, so neither result applies directly. The classes \(L^1\) and \(L^2\) are mutually non-contained on \(\mathbb{R}\) (for instance \(|x|^{-1/2}\mathbf{1}_{[0,1]} \in L^1 \setminus L^2\)), so neither setting subsumes the other.

The remedy is to redefine the Fourier transform on \(L^2(\mathbb{R})\) as a bounded linear operator, obtained as the unique continuous extension of its restriction to the dense subspace \(L^1 \cap L^2\). Thanks to the completeness of \(L^2\) as a Hilbert space, this approach is structurally cleaner than the \(L^1\) inversion theory: the normalized map \(f \mapsto \hat f / \sqrt{2\pi}\) becomes a unitary operator on \(L^2(\mathbb{R})\), whose inverse coincides with its Hilbert-space adjoint, simultaneously delivering an \(L^2\) inversion formula and a Plancherel identity on all of \(L^2\). This construction is carried out on Fourier analysis in Hilbert spaces.

Convolution Theorem

The differentiation property above shows that the Fourier transform simplifies differentiation. The convolution theorem reveals an even more powerful algebraic simplification: it converts the integral operation of convolution - which is \(O(n^2)\) to compute directly - into pointwise multiplication in the frequency domain. This single result is the reason FFT-based methods dominate signal processing, image filtering, and large-kernel neural network architectures.

The convolution of two functions \(f, g: \mathbb{R} \to \mathbb{C}\) is defined as: \[ (f * g)(x) = \int_{-\infty}^{\infty} f(y)g(x-y) \, dy = \int_{-\infty}^{\infty} f(x-y)g(y) \, dy \] (The second equality shows convolution is commutative.)

Theorem: Convolution Theorem

The Fourier transform converts convolution into pointwise multiplication: \[ \mathcal{F}\{f * g\}(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi) \] and conversely, the Fourier transform of a product is a (scaled) convolution: \[ \mathcal{F}\{f \cdot g\}(\xi) = \frac{1}{2\pi}(\hat{f} * \hat{g})(\xi). \]

This is one of the most important results in applied mathematics. It allows us to:

Insight: Convolution Theorem in CS and ML

  • Signal filtering: Low-pass, high-pass, and band-pass filters are convolutions in time domain but simple multiplications in frequency domain
  • Image processing: Gaussian blur, edge detection (Sobel, Canny), and sharpening are convolution operations efficiently computed via FFT for large kernels
  • Deep learning: While CNNs typically use small kernels \((3 \times 3, 5 \times 5)\) where direct convolution is efficient, FFT-based convolution is useful for:
    • Large kernels in certain architectures
    • Global convolutions in some vision transformers
    • Efficient implementation of depthwise separable convolutions
  • Probability: The PDF of \(X + Y\) for independent random variables is \(f_{X+Y} = f_X * f_Y\), computed efficiently via FFT
Proof (convolution to product):

Using the mathematical convention with \(\hat{f}(\xi) = \int f(x)e^{ix\xi} \, dx\): \[ \begin{align*} \mathcal{F}\{f * g\}(\xi) &= \int_{-\infty}^{\infty} (f * g)(x)e^{ix\xi} \, dx \\\\ &= \int_{-\infty}^{\infty} \left(\int_{-\infty}^{\infty} f(y)g(x-y) \, dy\right) e^{ix\xi} \, dx \\\\ &= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(x-y)e^{ix\xi} \, dx\right) dy. \end{align*} \] Substituting \(u = x - y\) (so \(x = u + y\) and \(dx = du\)): \[ \begin{align*} &= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(u)e^{i(u+y)\xi} \, du\right) dy \\\\ &= \int_{-\infty}^{\infty} f(y)e^{iy\xi} \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) dy \\\\ &= \left(\int_{-\infty}^{\infty} f(y)e^{iy\xi} \, dy\right) \cdot \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) \\\\ &= \hat{f}(\xi) \cdot \hat{g}(\xi). \end{align*} \] The interchange of integration order is justified by Fubini's theorem when \(f, g \in L^1(\mathbb{R})\).

Proof (product to convolution):

Under sufficient regularity (for instance, \(f, g \in L^1\) with \(\hat f, \hat g \in L^1\)), the inversion formula gives \[ f(x) = \frac{1}{2\pi}\int \hat f(\eta) e^{-ix\eta}\, d\eta. \] Substituting into the Fourier transform of \(fg\) and applying Fubini: \[ \begin{align*} \mathcal{F}\{f \cdot g\}(\xi) &= \int_{-\infty}^{\infty} \left[\frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\eta)\, e^{-ix\eta}\, d\eta\right] g(x)\, e^{ix\xi}\, dx \\\\ &= \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\eta)\left[\int_{-\infty}^{\infty} g(x)\, e^{ix(\xi - \eta)}\, dx\right] d\eta \\\\ &= \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat f(\eta)\, \hat g(\xi - \eta)\, d\eta = \frac{1}{2\pi}(\hat f * \hat g)(\xi). \end{align*} \]

Discrete Fourier Transform (DFT)

The continuous Fourier transform operates on functions defined on all of \(\mathbb{R}\), but computers work with finite, sampled data. The Discrete Fourier Transform (DFT) bridges this gap: it is the exact discrete analogue of the continuous transform, operating on sequences of \(N\) complex numbers and producing \(N\) frequency coefficients. The DFT inherits all the key properties of its continuous counterpart - linearity, shift, convolution - in discrete form.

Note on Sign Convention for DFT

Attention: In the continuous section above, we used the Mathematical/PDE convention (\(e^{+ix\xi}\)) to align with spectral theory. However, in the discrete domain (DFT/FFT), the Engineering/Data Science convention (negative sign in the forward transform) is universally adopted in software libraries like NumPy, PyTorch, and MATLAB.

To ensure that the formulas below match the code you will write in Python, we switch to the standard computational convention for the remainder of this page: \[ \text{DFT: } e^{-i \dots}, \quad \text{IDFT: } e^{+i \dots}. \]

Definition: DFT & IDFT

For a sequence \(\{x_0, x_1, \ldots, x_{N-1}\}\) of \(N\) complex numbers, the DFT is: \[ X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N}kn}, \quad k = 0, 1, \ldots, N-1. \] The inverse DFT (IDFT) is: \[ x_n = \frac{1}{N}\sum_{k=0}^{N-1} X_k e^{\frac{2\pi i}{N}kn}, \quad n = 0, 1, \ldots, N-1. \]

Connection to Continuous Transform: The DFT can be viewed as:

Matrix Formulation:
Define \(\omega = e^{-\frac{2\pi i}{N}}\), a primitive \(N\)-th root of unity (so \(\omega^N = 1\)). The DFT matrix is: \[ W = \begin{bmatrix} 1 & 1 & 1 & \cdots & 1 \\ 1 & \omega & \omega^2 & \cdots & \omega^{N-1} \\ 1 & \omega^2 & \omega^4 & \cdots & \omega^{2(N-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \omega^{N-1} & \omega^{2(N-1)} & \cdots & \omega^{(N-1)^2} \end{bmatrix} \] where \(W_{kn} = \omega^{kn}\). Then: \[ \mathbf{X} = W\mathbf{x}, \quad \mathbf{x} = \frac{1}{N}W^*\mathbf{X} \] where \(W^*\) is the conjugate transpose.

Properties of the DFT Matrix:

Normalization conventions across domains

The choice between the unnormalized \(W\) and the normalized \(F = \tfrac{1}{\sqrt{N}} W\) is not a mere cosmetic rescaling — it reflects which property of the transform the user cares about most. Three regimes coexist in practice:

  • Signal processing / scientific computing (unnormalized \(W\)):
    NumPy, SciPy, MATLAB, FFTW, PyTorch (torch.fft.fft), and JAX all default to forward DFT without normalization, placing the full \(1/N\) factor in the inverse. This is the long-standing numerical-computation convention (reinforced by Cooley–Tukey 1965 and its descendants) and has two practical advantages: the convolution theorem \(\mathcal{F}(f * g) = \hat f \cdot \hat g\) holds with no stray constants, and only the inverse transform pays the \(1/N\) cost. The price is that the forward transform is not an isometry — norms are scaled by \(\sqrt{N}\).
  • Harmonic analysis / functional analysis (normalized \(F\)):
    The discrete Plancherel identity \(\|f\|_2 = \|Ff\|_2\) requires the normalized version. The map \(F : \mathbb{C}^N \to \mathbb{C}^N\) is then a unitary operator (\(F^*F = I\)), and all of \(L^2\)-based Fourier theory transfers to the discrete setting without rescaling. The clean eigenvalue structure \(\{\pm 1, \pm i\}\) derived above is a symptom of the fourth-power identity \(F^4 = I\), with unitarity forcing the roots onto the unit circle.
  • Quantum computation (normalized \(F\) is mandatory):
    The Quantum Fourier Transform, the key subroutine behind Shor's algorithm and phase estimation, is the same matrix \(F\) — and it must be unitary because every gate operation on qubits is, by the postulates of quantum mechanics, a unitary transformation. There is no choice: the normalized version is the only one that corresponds to a physically realizable quantum gate.

Most numerical libraries expose both options via a norm= argument — NumPy's np.fft.fft(x, norm='ortho') returns the normalized transform, and norm='backward' (the default) returns the unnormalized one. When porting a formula from a textbook to code (or vice versa), the first question to ask is always: which convention is this written in?

Computational Complexity:

Practical Considerations for CS/ML:

Fast Fourier Transform (FFT)

The DFT requires \(O(N^2)\) operations when computed directly from its definition - prohibitive for large signals. The Fast Fourier Transform (FFT), is not a different transform but an efficient algorithm that reduces the complexity to \(O(N \log N)\). This reduction - from hours to seconds for large \(N\) - is what makes real-time audio processing, medical imaging, and spectral methods in ML computationally feasible.

Cooley-Tukey Algorithm (Radix-2 FFT):
The key insight is to exploit the structure of the DFT matrix using divide-and-conquer. For \(N = 2^m\), split the input into even and odd indices: \[ \begin{align*} X_k &= \sum_{n=0}^{N-1} x_n \omega^{kn} \\\\ &= \sum_{j=0}^{N/2-1} x_{2j} \omega^{k(2j)} + \sum_{j=0}^{N/2-1} x_{2j+1} \omega^{k(2j+1)} \\\\ &= \sum_{j=0}^{N/2-1} x_{2j} (\omega^2)^{kj} + \omega^k \sum_{j=0}^{N/2-1} x_{2j+1} (\omega^2)^{kj}. \end{align*} \] Note that \(\omega^2 = e^{-\frac{4\pi i}{N}} = e^{-\frac{2\pi i}{N/2}}\) is a primitive \((N/2)\)-th root of unity.

Let \(E_k\) be the \((N/2)\)-point DFT of the even-indexed subsequence and \(O_k\) the \((N/2)\)-point DFT of odd-indexed subsequence. Since \(E_k\) and \(O_k\) are themselves \((N/2)\)-periodic in \(k\) (\(E_{k+N/2} = E_k\), \(O_{k+N/2} = O_k\)), and since \(\omega^{k+N/2} = -\omega^k\) (because \(\omega^{N/2} = e^{-i\pi} = -1\)), we obtain, for \(k = 0, 1, \ldots, N/2-1\): \[ X_k = E_k + \omega^k O_k, \] \[ X_{k+N/2} = E_k - \omega^k O_k. \]

This gives the recurrence \(T(N) = 2T(N/2) + O(N)\), yielding \(T(N) = O(N \log N)\) by the Master theorem.

Other FFT Algorithms:

Implementation Considerations:

Applications in Machine Learning

Fourier methods are fundamental to modern machine learning, particularly in areas requiring efficient computation, signal processing, and function approximation. Here we highlight three significant applications.

1. Audio and Speech Processing:

Speech recognition systems such as OpenAI's Whisper (2022) and its open-source successors convert raw audio into log-Mel spectrograms via the Short-Time Fourier Transform (STFT): the audio is windowed into overlapping segments, each transformed via FFT to obtain frequency content, then mapped to the Mel scale (which approximates human auditory perception). This frequency-domain representation serves as input to the neural network — typically a Transformer encoder — enabling robust transcription across languages and noisy environments.

Reference: Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022)

2. Fourier Neural Operators (FNO) for Scientific Computing:

Introduced by Li et al. (ICLR 2021), Fourier Neural Operators (FNOs) learn solution operators of parametric PDEs directly in Fourier space, achieving up to three orders of magnitude speedup over traditional PDE solvers on benchmark problems such as Burgers and Navier-Stokes equations. The key idea: parameterize the kernel \(\hat{k}(\xi)\) in frequency domain, then multiply with the Fourier transform of the input. The approach has since inspired a rich family of spectral neural operators.

FNO Architecture Sketch

Each FNO layer performs:

  1. Lift input to higher-dimensional channel space
  2. Apply spectral convolution: \(\mathcal{F}^{-1}(R \cdot \mathcal{F}(v))\) where \(R\) is a learnable weight matrix
  3. Add local residual connection \(Wv\)
  4. Apply non-linearity (activation function) e.g., GELU
  5. Project to output

The spectral convolution step is where the convolution theorem provides the key efficiency gain.

Reference: Li et al., "Fourier Neural Operator for Parametric PDEs"

3. Fourier Features as Positional Encoding:

Neural networks exhibit a spectral bias—they tend to learn low-frequency functions more easily than high-frequency ones. Fourier feature mappings overcome this limitation by lifting low-dimensional inputs into higher-dimensional spaces via sinusoidal functions. The Transformer architecture (Vaswani et al., 2017) introduced sinusoidal positional encoding for sequence positions; this idea was later adapted to continuous spatial coordinates, with Neural Radiance Fields (NeRF, 2020) mapping a scalar coordinate \(p\) to: \[ \gamma(p) = \left[\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right] \] enabling neural networks to capture fine geometric details. Tancik et al. (2020) gave a unified Neural Tangent Kernel analysis explaining why such encodings tame spectral bias.

The same Fourier-feature principle reappears in modern Large Language Models as Rotary Position Embedding (RoPE), which encodes token positions through rotations in 2D subspaces of the query/key vectors and has become a de facto standard across recent LLM families.

References: Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields" (2020); Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions" (NeurIPS 2020); Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)

Interactive Demos

Now that we've covered the theory, let's see the Fourier Transform in action. We will start with a simple 1D signal to understand the core principles, then move to a 2D image to see a practical application (and a great example of the Convolution Theorem).

1D Signal & Frequency Spectrum

This demo shows how two sine waves (plus Gaussian noise) are combined in the Time Domain to form a composite signal, and simultaneously computes the FFT to display which frequencies dominate the Frequency Domain. Move the sliders to see how the two domains are linked in real time. Notice how the spectral peaks remain clearly visible above the noise floor — a basic illustration of how frequency-domain analysis separates deterministic structure from broadband noise.

2D Image & Frequency Domain Filtering

An image is a 2D signal. This demo computes the 2D FFT to reveal its magnitude spectrum: the center represents low frequencies (broad regions, gradients), while points farther from the center represent high frequencies (edges, fine textures). As established in the Convolution Theorem section, multiplication in the frequency domain corresponds to convolution in the spatial domain, so reshaping the spectrum reshapes the image.

The demo exposes several frequency-domain operations:

  • Low-pass / High-pass / Band-pass filters attenuate frequencies outside a chosen range — blurring, edge extraction, and texture isolation respectively.
  • Notch filter (Auto-Remove Periodic Noise) detects bright spikes in the spectrum and suppresses them with Gaussian notches, removing periodic interference patterns that appear as discrete points away from the DC center.
  • Compression retains only the top-\(k\%\) of Fourier coefficients (by magnitude) and zeroes the rest, illustrating the principle behind JPEG-style lossy compression — most image energy is concentrated in relatively few coefficients.
  • Sharpening amplifies high frequencies via a high-frequency-emphasis filter \(1 + \alpha\, H_{\text{high}}(\xi)\), enhancing edges and fine detail.
  • Phase spectrum: toggle to see how phase, not magnitude, carries much of the perceptual structure — a classic counterintuitive fact about Fourier representations of images.
  • Custom Mask: draw on the magnitude spectrum to suppress chosen frequencies and observe the result.

Note on filter shape: the low/high/band-pass filters use a smooth Butterworth-style roll-off rather than an ideal sharp cutoff. Ideal cutoffs introduce ringing artifacts (Gibbs phenomenon) in the spatial domain; smooth filters trade a slight frequency-domain transition band for cleaner spatial output.