Fourier Transform

Continuous Fourier Transform

Fourier series decompose periodic functions into discrete frequency components. However, most signals in practice - a spoken sentence, a transient pulse, a neural network's activation - are not periodic. The Fourier transform generalizes the Fourier series to arbitrary functions on \(\mathbb{R}\), replacing the discrete spectrum with a continuous one. This generalization is the mathematical backbone of signal processing, spectral analysis, and frequency-domain methods throughout machine learning.

To see how this arises, consider a function \(f_L\) periodic with period \(2L\). Its complex Fourier series is: \[ f_L(x) = \sum_{n=-\infty}^{\infty} c_n e^{-i\frac{n\pi x}{L}}, \quad c_n = \frac{1}{2L}\int_{-L}^{L} f_L(x)e^{i\frac{n\pi x}{L}} \, dx. \] As \(L \to \infty\), the discrete frequencies \(\omega_n = \frac{n\pi}{L}\) become densely packed with spacing \(\Delta\omega = \frac{\pi}{L} \to 0\), and the sum approaches an integral. This limiting process yields the Fourier transform.

Definition: Fourier Transform

For a function \(f: \mathbb{R} \to \mathbb{C}\), we define: \[ \hat{f}(\xi) = \mathcal{F}\{f\}(\xi) = \int_{-\infty}^{\infty} f(x)e^{i x\xi} \, dx \] where \(\xi \in \mathbb{R}\) is the frequency variable.

The inverse Fourier transform recovers \(f\) from \(\hat{f}\): \[ f(x) = \mathcal{F}^{-1}\{\hat{f}\}(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-i x\xi} \, d\xi. \]

Note on Conventions:
As discussed in Part 14, we use the mathematical/PDE convention with positive sign in the forward transform and negative in the inverse. Engineering and physics often use the opposite convention: \(\hat{f}(\omega) = \int_{-\infty}^{\infty} f(t)e^{-i\omega t} \, dt\). Both are mathematically equivalent, just be consistent within your work.

The Fourier transform is well-defined for different function classes:

For \(f \in L^1(\mathbb{R})\) (absolutely integrable):
The integral defining \(\hat{f}\) converges absolutely, and \(\hat{f}\) is bounded and continuous with \(|\hat{f}(\xi)| \leq \|f\|_{L^1}\).
For \(f \in L^2(\mathbb{R})\) (square-integrable):
The transform is defined via limiting procedures (e.g., truncating \(f\) to compact support, then taking limits). The result satisfies Plancherel's theorem (see Properties section).
For tempered distributions:
The theory extends to generalized functions, allowing transforms of \(\delta\)-functions, polynomials, and other non-integrable functions important in applications.

Example: Gaussian Function

Consider the Gaussian function: \[ f(x) = e^{-\frac{x^2}{2\sigma^2}}. \] Its Fourier transform (using the mathematical convention) is: \[ \begin{align*} \hat{f}(\xi) &= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2}}e^{i\xi x} \, dx \\\\ &= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2} + i\xi x} \, dx. \end{align*} \] Complete the square in the exponent: \[ -\frac{x^2}{2\sigma^2} + i\xi x = -\frac{1}{2\sigma^2}(x^2 - 2i\sigma^2\xi x) = -\frac{1}{2\sigma^2}((x - i\sigma^2\xi)^2 + \sigma^4\xi^2). \] Therefore: \[ \hat{f}(\xi) = e^{-\frac{\sigma^2\xi^2}{2}} \int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx. \] By contour integration (the integrand is analytic and decays exponentially), shifting the path doesn't change the value: \[ \int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx = \int_{-\infty}^{\infty} e^{-\frac{u^2}{2\sigma^2}} \, du = \sigma\sqrt{2\pi}. \] Thus: \[ \boxed{\hat{f}(\xi) = \sigma\sqrt{2\pi} \cdot e^{-\frac{\sigma^2\xi^2}{2}}}. \]

This shows that the Gaussian is essentially an eigenfunction of the Fourier transform: its transform is also Gaussian, with inverse width (narrow in space \(\leftrightarrow\) wide in frequency). This property makes Gaussians fundamental in uncertainty principles, quantum mechanics, and signal processing.

Properties of Fourier Transform

The Fourier transform is far more than a definition - its power lies in a collection of algebraic properties that convert difficult operations (differentiation, convolution) into simple ones (multiplication, pointwise products). These properties, stated below using the mathematical convention \(\hat{f}(\xi) = \int_{-\infty}^{\infty} f(x)e^{ix\xi} \, dx\), form the foundation of spectral methods in both analysis and computation.

Theorem: Properties of the Fourier Transform

1. Linearity:
\[ \mathcal{F}\{\alpha f + \beta g\} = \alpha \mathcal{F}\{f\} + \beta \mathcal{F}\{g\} \] for constants \(\alpha, \beta \in \mathbb{C}\).

2. Translation (Shift) Property:
\[ \mathcal{F}\{f(x - a)\}(\xi) = e^{ia\xi}\hat{f}(\xi) \] A shift in space corresponds to a phase shift in frequency.

3. Scaling Property:
\[ \mathcal{F}\{f(ax)\}(\xi) = \frac{1}{|a|}\hat{f}\left(\frac{\xi}{a}\right), \quad a \neq 0 \] This embodies the uncertainty principle: compressing a signal in time stretches its frequency spectrum, and vice versa. One cannot localize a signal arbitrarily well in both time and frequency simultaneously.

4. Differentiation Property:
\[ \mathcal{F}\{f'(x)\}(\xi) = -i\xi \hat{f}(\xi) \] More generally, for the \(n\)-th derivative: \[ \mathcal{F}\{f^{(n)}(x)\}(\xi) = (-i\xi)^n \hat{f}(\xi) \] This transforms differentiation into algebraic multiplication, which is why Fourier methods are powerful for solving differential equations. Note the sign difference from the engineering convention.

Theorem: Plancherel's Theorem

For \(f, g \in L^2(\mathbb{R})\): \[ \int_{-\infty}^{\infty} f(x)\overline{g(x)} \, dx = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)\overline{\hat{g}(\xi)} \, d\xi. \] In particular, setting \(f = g\): \[ \|f\|_{L^2}^2 = \int_{-\infty}^{\infty} |f(x)|^2 \, dx = \frac{1}{2\pi}\int_{-\infty}^{\infty} |\hat{f}(\xi)|^2 \, d\xi. \]

This generalizes Parseval's identity to non-periodic functions and shows that the Fourier transform preserves the \(L^2\) norm up to a constant factor. Equivalently, the normalized map \(f \mapsto \frac{1}{\sqrt{2\pi}}\hat{f}\) is a unitary operator on \(L^2(\mathbb{R})\).

Theorem: Fourier Inversion

For suitable functions (e.g., \(f \in L^1(\mathbb{R})\) with \(\hat{f} \in L^1(\mathbb{R})\)), the original function can be recovered: \[ f(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-ix\xi} \, d\xi. \] A corollary: \(\mathcal{F}\{\mathcal{F}\{f\}\}(x) = 2\pi f(-x)\), demonstrating a deep symmetry between the spatial and frequency domains.

Convolution Theorem

The properties above show that the Fourier transform simplifies differentiation. The convolution theorem reveals an even more powerful algebraic simplification: it converts the integral operation of convolution - which is \(O(n^2)\) to compute directly - into pointwise multiplication in the frequency domain. This single result is the reason FFT-based methods dominate signal processing, image filtering, and large-kernel neural network architectures.

The convolution of two functions \(f, g: \mathbb{R} \to \mathbb{C}\) is defined as: \[ (f * g)(x) = \int_{-\infty}^{\infty} f(y)g(x-y) \, dy = \int_{-\infty}^{\infty} f(x-y)g(y) \, dy \] (The second equality shows convolution is commutative.)

Theorem: Convolution Theorem

The Fourier transform converts convolution into pointwise multiplication: \[ \mathcal{F}\{f * g\}(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi) \] and conversely, the Fourier transform of a product is a (scaled) convolution: \[ \mathcal{F}\{f \cdot g\}(\xi) = \frac{1}{2\pi}(\hat{f} * \hat{g})(\xi). \]

This is one of the most important results in applied mathematics. It allows us to:

Replace expensive convolution operations (\(O(n^2)\) for discrete signals) with multiplication after FFT (\(O(n \log n)\))
Understand filtering as multiplication in frequency domain
Analyze linear time-invariant systems through their frequency response

Insight: Convolution Theorem in CS and ML

Signal filtering: Low-pass, high-pass, and band-pass filters are convolutions in time domain but simple multiplications in frequency domain
Image processing: Gaussian blur, edge detection (Sobel, Canny), and sharpening are convolution operations efficiently computed via FFT for large kernels
Deep learning: While CNNs typically use small kernels \((3 \times 3, 5 \times 5)\) where direct convolution is efficient, FFT-based convolution is useful for:
- Large kernels in certain architectures
- Global convolutions in some vision transformers
- Efficient implementation of depthwise separable convolutions
Probability: The PDF of \(X + Y\) for independent random variables is \(f_{X+Y} = f_X * f_Y\), computed efficiently via FFT

Proof of Convolution Theorem:

Using the mathematical convention with \(\hat{f}(\xi) = \int f(x)e^{ix\xi} \, dx\): \[ \begin{align*} \mathcal{F}\{f * g\}(\xi) &= \int_{-\infty}^{\infty} (f * g)(x)e^{ix\xi} \, dx \\\\ &= \int_{-\infty}^{\infty} \left(\int_{-\infty}^{\infty} f(y)g(x-y) \, dy\right) e^{ix\xi} \, dx \\\\ &= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(x-y)e^{ix\xi} \, dx\right) dy. \end{align*} \] Substituting \(u = x - y\) (so \(x = u + y\) and \(dx = du\)): \[ \begin{align*} &= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(u)e^{i(u+y)\xi} \, du\right) dy \\\\ &= \int_{-\infty}^{\infty} f(y)e^{iy\xi} \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) dy \\\\ &= \left(\int_{-\infty}^{\infty} f(y)e^{iy\xi} \, dy\right) \cdot \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) \\\\ &= \hat{f}(\xi) \cdot \hat{g}(\xi). \end{align*} \] The interchange of integration order is justified by Fubini's theorem when \(f, g \in L^1(\mathbb{R})\).

Discrete Fourier Transform (DFT)

The continuous Fourier transform operates on functions defined on all of \(\mathbb{R}\), but computers work with finite, sampled data. The Discrete Fourier Transform (DFT) bridges this gap: it is the exact discrete analogue of the continuous transform, operating on sequences of \(N\) complex numbers and producing \(N\) frequency coefficients. The DFT inherits all the key properties of its continuous counterpart - linearity, shift, convolution - in discrete form.

Note on Sign Convention for DFT

Attention: In the continuous section above, we used the Mathematical/PDE convention (\(e^{+ix\xi}\)) to align with spectral theory. However, in the discrete domain (DFT/FFT), the Engineering/Data Science convention (negative sign in the forward transform) is universally adopted in software libraries like NumPy, PyTorch, and MATLAB.

To ensure that the formulas below match the code you will write in Python, we switch to the standard computational convention for the remainder of this page: \[ \text{DFT: } e^{-i \dots}, \quad \text{IDFT: } e^{+i \dots}. \]

Definition: DFT & IDFT

For a sequence \(\{x_0, x_1, \ldots, x_{N-1}\}\) of \(N\) complex numbers, the DFT is: \[ X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N}kn}, \quad k = 0, 1, \ldots, N-1. \] The inverse DFT (IDFT) is: \[ x_n = \frac{1}{N}\sum_{k=0}^{N-1} X_k e^{\frac{2\pi i}{N}kn}, \quad n = 0, 1, \ldots, N-1. \]

Connection to Continuous Transform: The DFT can be viewed as:

Sampling a periodic function at \(N\) equally spaced points
Computing Fourier series coefficients for the periodized version
Approximating the continuous Fourier transform for band-limited signals (via Shannon-Nyquist sampling theorem)

Matrix Formulation:
Define \(\omega = e^{-\frac{2\pi i}{N}}\), a primitive \(N\)-th root of unity (so \(\omega^N = 1\)). The DFT matrix is: \[ W = \begin{bmatrix} 1 & 1 & 1 & \cdots & 1 \\ 1 & \omega & \omega^2 & \cdots & \omega^{N-1} \\ 1 & \omega^2 & \omega^4 & \cdots & \omega^{2(N-1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \omega^{N-1} & \omega^{2(N-1)} & \cdots & \omega^{(N-1)^2} \end{bmatrix} \] where \(W_{kn} = \omega^{kn}\). Then: \[ \mathbf{X} = W\mathbf{x}, \quad \mathbf{x} = \frac{1}{N}W^*\mathbf{X} \] where \(W^*\) is the conjugate transpose.

Properties of the DFT Matrix:

Orthogonality: \(W^*W = NI\), so \(W^{-1} = \frac{1}{N}W^*\)
Symmetry: \(W\) is symmetric: \(W^T = W\)
Eigenvalues: The eigenvalues of the DFT matrix are \(\pm 1, \pm i\) (for \(N = 4k\))
Circulant diagonalization: The DFT matrix diagonalizes all circulant matrices

Computational Complexity:

Direct computation: \(O(N^2)\) complex multiplications and additions
Fast Fourier Transform: \(O(N \log N)\) operations (see next section)

Practical Considerations for CS/ML:

Zero-padding: Padding sequences to powers of 2 enables efficient radix-2 FFT algorithms
Windowing: Applying window functions (Hamming, Hann, Blackman) reduces spectral leakage from finite-length effects
Real-valued signals: For real inputs, \(X_{N-k} = \overline{X_k}\) (Hermitian symmetry), allowing optimization to compute only half the spectrum using rfft
Normalization: Different software uses different conventions; NumPy uses \(\frac{1}{N}\) in IFFT only

Fast Fourier Transform (FFT)

The DFT requires \(O(N^2)\) operations when computed directly from its definition - prohibitive for large signals. The Fast Fourier Transform (FFT), is not a different transform but an efficient algorithm that reduces the complexity to \(O(N \log N)\). This reduction - from hours to seconds for large \(N\) - is what makes real-time audio processing, medical imaging, and spectral methods in ML computationally feasible.

Cooley-Tukey Algorithm (Radix-2 FFT):
The key insight is to exploit the structure of the DFT matrix using divide-and-conquer. For \(N = 2^m\), split the input into even and odd indices: \[ \begin{align*} X_k &= \sum_{n=0}^{N-1} x_n \omega^{kn} \\\\ &= \sum_{j=0}^{N/2-1} x_{2j} \omega^{k(2j)} + \sum_{j=0}^{N/2-1} x_{2j+1} \omega^{k(2j+1)} \\\\ &= \sum_{j=0}^{N/2-1} x_{2j} (\omega^2)^{kj} + \omega^k \sum_{j=0}^{N/2-1} x_{2j+1} (\omega^2)^{kj}. \end{align*} \] Note that \(\omega^2 = e^{-\frac{4\pi i}{N}} = e^{-\frac{2\pi i}{N/2}}\) is a primitive \((N/2)\)-th root of unity.

Let \(E_k\) be the DFT of the even-indexed subsequence and \(O_k\) the DFT of odd-indexed subsequence. Using the periodicity of the DFT, for \(k = 0, 1, \ldots, N/2-1\): \[ X_k = E_k + \omega^k O_k \] \[ X_{k+N/2} = E_k - \omega^k O_k \] (using the fact that \(\omega^{k+N/2} = -\omega^k\) since \(\omega^{N/2} = e^{-i\pi} = -1\)).

This gives the recurrence \(T(N) = 2T(N/2) + O(N)\), yielding \(T(N) = O(N \log N)\) by the Master theorem.

Other FFT Algorithms:

Radix-4, Radix-8: Reduce the number of complex multiplications by processing 4 or 8 points at once
Split-radix FFT: Combines radix-2 and radix-4 for optimal operation count
Prime-factor algorithm (Good-Thomas): For \(N = N_1 N_2\) with \(\gcd(N_1, N_2) = 1\)
Chirp Z-transform (Bluestein): Computes DFT of any size via convolution

Implementation Considerations:

Bit-reversal permutation: Required for in-place Cooley-Tukey algorithm
Twiddle factors: Precomputing \(\omega^k = e^{-2\pi ik/N}\) reduces trigonometric computations
Cache optimization: FFTW uses self-tuning to optimize for specific hardware
SIMD vectorization: Modern CPUs can compute multiple FFT butterflies in parallel
Numerical stability: Use of split-radix reduces roundoff error accumulation

Python Example:

                                import numpy as np
                                import matplotlib.pyplot as plt

                                # Generate a signal with multiple frequency components
                                fs = 1000  # Sampling frequency (Hz)
                                t = np.linspace(0, 1, fs, endpoint=False)
                                signal = (np.sin(2*np.pi*50*t) +           # 50 Hz component
                                        0.5*np.sin(2*np.pi*120*t) +      # 120 Hz component
                                        0.3*np.cos(2*np.pi*200*t))       # 200 Hz component

                                # Add some Gaussian noise
                                np.random.seed(42)  # For reproducibility
                                signal += 0.3 * np.random.randn(len(t))

                                # Compute FFT
                                fft_vals = np.fft.fft(signal)
                                fft_freq = np.fft.fftfreq(len(signal), d=1/fs)

                                # Only plot positive frequencies (due to Hermitian symmetry for real signals)
                                pos_mask = fft_freq >= 0
                                plt.figure(figsize=(12, 4))

                                plt.subplot(1, 2, 1)
                                plt.plot(t[:100], signal[:100])
                                plt.xlabel('Time (s)')
                                plt.ylabel('Amplitude')
                                plt.title('Signal (first 100 samples)')
                                plt.grid(True, alpha=0.3)

                                plt.subplot(1, 2, 2)
                                plt.plot(fft_freq[pos_mask], np.abs(fft_vals[pos_mask])/len(signal)*2)
                                plt.xlabel('Frequency (Hz)')
                                plt.ylabel('Magnitude')
                                plt.title('Frequency Spectrum (Single-sided)')
                                plt.xlim(0, 300)
                                plt.grid(True, alpha=0.3)
                                plt.tight_layout()
                                plt.show()

                                # Verify Parseval's theorem
                                energy_time = np.sum(np.abs(signal)**2)
                                energy_freq = np.sum(np.abs(fft_vals)**2) / len(signal)
                                print(f"Energy in time domain: {energy_time:.2f}")
                                print(f"Energy in frequency domain: {energy_freq:.2f}")
                                print(f"Relative error: {abs(energy_time - energy_freq)/energy_time:.2e}")

Applications in Machine Learning

Fourier methods are fundamental to modern machine learning, particularly in areas requiring efficient computation, signal processing, and function approximation. Here we highlight three significant applications.

1. Audio and Speech Processing:

Modern speech recognition models like Whisper (OpenAI, 2022) fundamentally rely on Fourier transforms for feature extraction. Raw audio waveforms are converted into log-Mel spectrograms using the Short-Time Fourier Transform (STFT): the audio is windowed into overlapping segments, each transformed via FFT to obtain frequency content, then mapped to the Mel scale (which approximates human auditory perception). This frequency-domain representation serves as input to the neural network, enabling robust transcription across languages and noisy environments.

Reference: Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022)

2. Fourier Neural Operators (FNO) for Scientific Computing:

Introduced by Li et al. (NeurIPS 2020), FNOs learn operators directly in Fourier space, achieving 1000x speedup over traditional PDE solvers. The key idea: parameterize the kernel \(\hat{k}(\xi)\) in frequency domain, then multiply with the Fourier transform of the input.

FNO Architecture Sketch

Each FNO layer performs:

Lift input to higher-dimensional channel space
Apply spectral convolution: \(\mathcal{F}^{-1}(R \cdot \mathcal{F}(v))\) where \(R\) is a learnable weight matrix
Add local residual connection \(Wv\)
Apply non-linearity (activation function) e.g., GELU
Project to output

The spectral convolution step is where the convolution theorem provides the key efficiency gain.

Reference: Li et al., "Fourier Neural Operator for Parametric PDEs"

3. Positional Encoding in Neural Networks:

Neural networks have a spectral bias—they tend to learn low-frequency functions more easily than high-frequency ones. Fourier feature mappings overcome this limitation by lifting low-dimensional inputs into higher-dimensional spaces via sinusoidal functions. For instance, the positional encoding used in Neural Radiance Fields (NeRF) maps a scalar coordinate \(p\) to: \[ \gamma(p) = \left[\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right] \] This enables the network to capture fine geometric details essential for photorealistic 3D scene reconstruction. Similar Fourier-based encodings appear in Transformers (sinusoidal positional encoding) and modern LLMs (Rotary Position Embedding/RoPE).

References: Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields" (ECCV 2020); Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions" (NeurIPS 2020)

Interactive Demos

Now that we've covered the theory, let's see the Fourier Transform in action. We will start with a simple 1D signal to understand the core principles, then move to a 2D image to see a practical application (and a great example of the Convolution Theorem).

1D Signal & Frequency Spectrum

This demo shows how several sine waves (+ noise) are combined in the Time Domain to create a complex signal. Simultaneously, it computes the Fourier Transform (FFT) of that combined signal to show which frequencies are dominant in the Frequency Domain (the spectrum). Move the sliders to see how the two domains are linked in real-time. This is the fundamental concept of Fourier Analysis.

2D Image & Frequency Domain Filtering

An image is just a 2D signal. This demo computes the 2D FFT of an image to show its Frequency Spectrum. The center of the spectrum represents low frequencies (blurry parts of the image), while the edges represent high frequencies (edges and details). As you learned in the "Convolution Theorem" section, multiplying in the frequency domain is equivalent to convolution in the spatial domain. Try applying a Low-pass filter (which only keeps the center) to see the image blur, or a High-pass filter (which only keeps the edges) to extract details.

Loading...