Fourier series decompose periodic functions into discrete frequency components.
However, most signals in practice - a spoken sentence, a transient pulse, a neural network's activation - are not periodic. The
Fourier transform generalizes the Fourier series to arbitrary functions on \(\mathbb{R}\), replacing the discrete
spectrum with a continuous one. This generalization is the mathematical backbone of signal processing, spectral analysis, and
frequency-domain methods throughout machine learning.
To see how this arises, consider a function \(f_L\) periodic with period \(2L\). Its complex Fourier series is:
\[
f_L(x) = \sum_{n=-\infty}^{\infty} c_n e^{-i\frac{n\pi x}{L}}, \quad c_n = \frac{1}{2L}\int_{-L}^{L} f_L(x)e^{i\frac{n\pi x}{L}} \, dx.
\]
As \(L \to \infty\), the discrete frequencies \(\omega_n = \frac{n\pi}{L}\) become densely packed with spacing
\(\Delta\omega = \frac{\pi}{L} \to 0\), and the sum approaches an integral. This limiting process yields the
Fourier transform.
Definition: Fourier Transform
For a function \(f: \mathbb{R} \to \mathbb{C}\), we define:
\[
\hat{f}(\xi) = \mathcal{F}\{f\}(\xi) = \int_{-\infty}^{\infty} f(x)e^{i x\xi} \, dx
\]
where \(\xi \in \mathbb{R}\) is the frequency variable.
The inverse Fourier transform recovers \(f\) from \(\hat{f}\):
\[
f(x) = \mathcal{F}^{-1}\{\hat{f}\}(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-i x\xi} \, d\xi.
\]
Note on Conventions:
As discussed in Part 14, we use the mathematical/PDE convention
with positive sign in the forward transform and negative in the inverse. Engineering and physics often use the
opposite convention: \(\hat{f}(\omega) = \int_{-\infty}^{\infty} f(t)e^{-i\omega t} \, dt\).
Both are mathematically equivalent, just be consistent within your work.
The Fourier transform is well-defined for different function classes:
For \(f \in L^1(\mathbb{R})\) (absolutely integrable):
The integral defining \(\hat{f}\) converges absolutely, and \(\hat{f}\) is bounded and continuous
with \(|\hat{f}(\xi)| \leq \|f\|_{L^1}\).
For \(f \in L^2(\mathbb{R})\) (square-integrable):
The transform is defined via limiting procedures (e.g., truncating \(f\) to compact support, then taking limits).
The result satisfies Plancherel's theorem (see Properties section).
For tempered distributions:
The theory extends to generalized functions, allowing transforms of \(\delta\)-functions, polynomials,
and other non-integrable functions important in applications.
Example: Gaussian Function
Consider the Gaussian function:
\[
f(x) = e^{-\frac{x^2}{2\sigma^2}}.
\]
Its Fourier transform (using the mathematical convention) is:
\[
\begin{align*}
\hat{f}(\xi) &= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2}}e^{i\xi x} \, dx \\\\
&= \int_{-\infty}^{\infty} e^{-\frac{x^2}{2\sigma^2} + i\xi x} \, dx.
\end{align*}
\]
Complete the square in the exponent:
\[
-\frac{x^2}{2\sigma^2} + i\xi x = -\frac{1}{2\sigma^2}(x^2 - 2i\sigma^2\xi x) = -\frac{1}{2\sigma^2}((x - i\sigma^2\xi)^2 + \sigma^4\xi^2).
\]
Therefore:
\[
\hat{f}(\xi) = e^{-\frac{\sigma^2\xi^2}{2}} \int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx.
\]
By contour integration (the integrand is analytic and decays exponentially), shifting the path doesn't change the value:
\[
\int_{-\infty}^{\infty} e^{-\frac{(x-i\sigma^2\xi)^2}{2\sigma^2}} \, dx = \int_{-\infty}^{\infty} e^{-\frac{u^2}{2\sigma^2}} \, du = \sigma\sqrt{2\pi}.
\]
Thus:
\[
\boxed{\hat{f}(\xi) = \sigma\sqrt{2\pi} \cdot e^{-\frac{\sigma^2\xi^2}{2}}}.
\]
This shows that the Gaussian is essentially an eigenfunction of the Fourier transform: its transform is also
Gaussian, with inverse width (narrow in space \(\leftrightarrow\) wide in frequency). This property makes Gaussians
fundamental in uncertainty principles, quantum mechanics, and signal processing.
Properties of Fourier Transform
The Fourier transform is far more than a definition - its power lies in a collection of algebraic properties that
convert difficult operations (differentiation, convolution) into simple ones (multiplication, pointwise products).
These properties, stated below using the mathematical convention \(\hat{f}(\xi) = \int_{-\infty}^{\infty} f(x)e^{ix\xi} \, dx\),
form the foundation of spectral methods in both analysis and computation.
2. Translation (Shift) Property:
\[
\mathcal{F}\{f(x - a)\}(\xi) = e^{ia\xi}\hat{f}(\xi)
\]
A shift in space corresponds to a phase shift in frequency.
3. Scaling Property:
\[
\mathcal{F}\{f(ax)\}(\xi) = \frac{1}{|a|}\hat{f}\left(\frac{\xi}{a}\right), \quad a \neq 0
\]
This embodies the uncertainty principle: compressing a signal in time stretches its frequency
spectrum, and vice versa. One cannot localize a signal arbitrarily well in both time and frequency simultaneously.
4. Differentiation Property:
\[
\mathcal{F}\{f'(x)\}(\xi) = -i\xi \hat{f}(\xi)
\]
More generally, for the \(n\)-th derivative:
\[
\mathcal{F}\{f^{(n)}(x)\}(\xi) = (-i\xi)^n \hat{f}(\xi)
\]
This transforms differentiation into algebraic multiplication, which is why Fourier methods are powerful for
solving differential equations. Note the sign difference from the engineering convention.
This generalizes Parseval's identity
to non-periodic functions and shows that the Fourier transform preserves the \(L^2\) norm
up to a constant factor. Equivalently, the normalized map
\(f \mapsto \frac{1}{\sqrt{2\pi}}\hat{f}\) is a unitary operator on \(L^2(\mathbb{R})\).
Theorem: Fourier Inversion
For suitable functions (e.g., \(f \in L^1(\mathbb{R})\) with
\(\hat{f} \in L^1(\mathbb{R})\)), the original function can be recovered:
\[
f(x) = \frac{1}{2\pi}\int_{-\infty}^{\infty} \hat{f}(\xi)e^{-ix\xi} \, d\xi.
\]
A corollary: \(\mathcal{F}\{\mathcal{F}\{f\}\}(x) = 2\pi f(-x)\), demonstrating
a deep symmetry between the spatial and frequency domains.
Convolution Theorem
The properties above show that the Fourier transform simplifies differentiation. The convolution theorem reveals an
even more powerful algebraic simplification: it converts the integral operation of convolution - which is \(O(n^2)\)
to compute directly - into pointwise multiplication in the frequency domain. This single result is the reason FFT-based methods dominate
signal processing, image filtering, and large-kernel neural network architectures.
The convolution of two functions \(f, g: \mathbb{R} \to \mathbb{C}\) is defined as:
\[
(f * g)(x) = \int_{-\infty}^{\infty} f(y)g(x-y) \, dy = \int_{-\infty}^{\infty} f(x-y)g(y) \, dy
\]
(The second equality shows convolution is commutative.)
Theorem: Convolution Theorem
The Fourier transform converts convolution into pointwise multiplication:
\[
\mathcal{F}\{f * g\}(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)
\]
and conversely, the Fourier transform of a product is a (scaled) convolution:
\[
\mathcal{F}\{f \cdot g\}(\xi) = \frac{1}{2\pi}(\hat{f} * \hat{g})(\xi).
\]
This is one of the most important results in applied mathematics. It allows us to:
Replace expensive convolution operations (\(O(n^2)\) for discrete signals) with multiplication after
FFT (\(O(n \log n)\))
Understand filtering as multiplication in frequency domain
Analyze linear time-invariant systems through their frequency response
Insight: Convolution Theorem in CS and ML
Signal filtering:
Low-pass, high-pass, and band-pass filters are convolutions in time domain but simple multiplications in frequency domain
Image processing: Gaussian blur, edge detection (Sobel, Canny), and sharpening are
convolution operations efficiently computed via FFT for large kernels
Deep learning: While CNNs typically use small kernels \((3 \times 3, 5 \times 5)\) where direct convolution
is efficient, FFT-based convolution is useful for:
Large kernels in certain architectures
Global convolutions in some vision transformers
Efficient implementation of depthwise separable convolutions
Probability: The PDF of \(X + Y\) for independent random variables is
\(f_{X+Y} = f_X * f_Y\), computed efficiently via FFT
Proof of Convolution Theorem:
Using the mathematical convention with \(\hat{f}(\xi) = \int f(x)e^{ix\xi} \, dx\):
\[
\begin{align*}
\mathcal{F}\{f * g\}(\xi) &= \int_{-\infty}^{\infty} (f * g)(x)e^{ix\xi} \, dx \\\\
&= \int_{-\infty}^{\infty} \left(\int_{-\infty}^{\infty} f(y)g(x-y) \, dy\right) e^{ix\xi} \, dx \\\\
&= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(x-y)e^{ix\xi} \, dx\right) dy.
\end{align*}
\]
Substituting \(u = x - y\) (so \(x = u + y\) and \(dx = du\)):
\[
\begin{align*}
&= \int_{-\infty}^{\infty} f(y) \left(\int_{-\infty}^{\infty} g(u)e^{i(u+y)\xi} \, du\right) dy \\\\
&= \int_{-\infty}^{\infty} f(y)e^{iy\xi} \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) dy \\\\
&= \left(\int_{-\infty}^{\infty} f(y)e^{iy\xi} \, dy\right) \cdot \left(\int_{-\infty}^{\infty} g(u)e^{iu\xi} \, du\right) \\\\
&= \hat{f}(\xi) \cdot \hat{g}(\xi).
\end{align*}
\]
The interchange of integration order is justified by Fubini's theorem when \(f, g \in L^1(\mathbb{R})\).
Discrete Fourier Transform (DFT)
The continuous Fourier transform operates on functions defined on all of \(\mathbb{R}\), but computers work with finite,
sampled data. The Discrete Fourier Transform (DFT) bridges this gap: it is the exact discrete analogue
of the continuous transform, operating on sequences of \(N\) complex numbers and producing \(N\) frequency coefficients.
The DFT inherits all the key properties of its continuous counterpart - linearity, shift, convolution - in discrete form.
Note on Sign Convention for DFT
Attention: In the continuous section above, we used the Mathematical/PDE convention (\(e^{+ix\xi}\))
to align with spectral theory. However, in the discrete domain (DFT/FFT), the Engineering/Data Science convention
(negative sign in the forward transform) is universally adopted in software libraries like NumPy, PyTorch, and MATLAB.
To ensure that the formulas below match the code you will write in Python, we switch to the standard
computational convention for the remainder of this page:
\[
\text{DFT: } e^{-i \dots}, \quad \text{IDFT: } e^{+i \dots}.
\]
Definition: DFT & IDFT
For a sequence \(\{x_0, x_1, \ldots, x_{N-1}\}\) of \(N\) complex numbers, the DFT is:
\[
X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N}kn}, \quad k = 0, 1, \ldots, N-1.
\]
The inverse DFT (IDFT) is:
\[
x_n = \frac{1}{N}\sum_{k=0}^{N-1} X_k e^{\frac{2\pi i}{N}kn}, \quad n = 0, 1, \ldots, N-1.
\]
Connection to Continuous Transform: The DFT can be viewed as:
Sampling a periodic function at \(N\) equally spaced points
Computing Fourier series coefficients for the periodized version
Approximating the continuous Fourier transform for band-limited signals (via Shannon-Nyquist sampling theorem)
Real-valued signals: For real inputs, \(X_{N-k} = \overline{X_k}\) (Hermitian symmetry),
allowing optimization to compute only half the spectrum using rfft
Normalization: Different software uses different conventions; NumPy uses \(\frac{1}{N}\) in IFFT only
Fast Fourier Transform (FFT)
The DFT requires \(O(N^2)\) operations when computed directly from its definition - prohibitive for large signals.
The Fast Fourier Transform (FFT), is not a different transform but an efficient algorithm that reduces
the complexity to \(O(N \log N)\). This reduction - from hours to seconds for large \(N\) - is what makes real-time audio
processing, medical imaging, and spectral methods in ML computationally feasible.
Cooley-Tukey Algorithm (Radix-2 FFT):
The key insight is to exploit the structure of the DFT matrix using divide-and-conquer. For \(N = 2^m\),
split the input into even and odd indices:
\[
\begin{align*}
X_k &= \sum_{n=0}^{N-1} x_n \omega^{kn} \\\\
&= \sum_{j=0}^{N/2-1} x_{2j} \omega^{k(2j)} + \sum_{j=0}^{N/2-1} x_{2j+1} \omega^{k(2j+1)} \\\\
&= \sum_{j=0}^{N/2-1} x_{2j} (\omega^2)^{kj} + \omega^k \sum_{j=0}^{N/2-1} x_{2j+1} (\omega^2)^{kj}.
\end{align*}
\]
Note that \(\omega^2 = e^{-\frac{4\pi i}{N}} = e^{-\frac{2\pi i}{N/2}}\) is a primitive \((N/2)\)-th root of unity.
Let \(E_k\) be the DFT of the even-indexed subsequence and \(O_k\) the DFT of odd-indexed subsequence.
Using the periodicity of the DFT, for \(k = 0, 1, \ldots, N/2-1\):
\[
X_k = E_k + \omega^k O_k
\]
\[
X_{k+N/2} = E_k - \omega^k O_k
\]
(using the fact that \(\omega^{k+N/2} = -\omega^k\) since \(\omega^{N/2} = e^{-i\pi} = -1\)).
This gives the recurrence \(T(N) = 2T(N/2) + O(N)\), yielding \(T(N) = O(N \log N)\) by the Master theorem.
Other FFT Algorithms:
Radix-4, Radix-8: Reduce the number of complex multiplications by processing 4 or 8 points at once
Split-radix FFT: Combines radix-2 and radix-4 for optimal operation count
Prime-factor algorithm (Good-Thomas): For \(N = N_1 N_2\) with \(\gcd(N_1, N_2) = 1\)
Chirp Z-transform (Bluestein): Computes DFT of any size via convolution
Implementation Considerations:
Bit-reversal permutation: Required for in-place Cooley-Tukey algorithm
Cache optimization: FFTW uses self-tuning to optimize for specific hardware
SIMD vectorization: Modern CPUs can compute multiple FFT butterflies in parallel
Numerical stability: Use of split-radix reduces roundoff error accumulation
Python Example:
import numpy as np
import matplotlib.pyplot as plt
# Generate a signal with multiple frequency components
fs = 1000 # Sampling frequency (Hz)
t = np.linspace(0, 1, fs, endpoint=False)
signal = (np.sin(2*np.pi*50*t) + # 50 Hz component
0.5*np.sin(2*np.pi*120*t) + # 120 Hz component
0.3*np.cos(2*np.pi*200*t)) # 200 Hz component
# Add some Gaussian noise
np.random.seed(42) # For reproducibility
signal += 0.3 * np.random.randn(len(t))
# Compute FFT
fft_vals = np.fft.fft(signal)
fft_freq = np.fft.fftfreq(len(signal), d=1/fs)
# Only plot positive frequencies (due to Hermitian symmetry for real signals)
pos_mask = fft_freq >= 0
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(t[:100], signal[:100])
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Signal (first 100 samples)')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(fft_freq[pos_mask], np.abs(fft_vals[pos_mask])/len(signal)*2)
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude')
plt.title('Frequency Spectrum (Single-sided)')
plt.xlim(0, 300)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Verify Parseval's theorem
energy_time = np.sum(np.abs(signal)**2)
energy_freq = np.sum(np.abs(fft_vals)**2) / len(signal)
print(f"Energy in time domain: {energy_time:.2f}")
print(f"Energy in frequency domain: {energy_freq:.2f}")
print(f"Relative error: {abs(energy_time - energy_freq)/energy_time:.2e}")
Applications in Machine Learning
Fourier methods are fundamental to modern machine learning, particularly in areas requiring efficient computation,
signal processing, and function approximation. Here we highlight three significant applications.
1. Audio and Speech Processing:
Modern speech recognition models like Whisper (OpenAI, 2022) fundamentally rely on Fourier transforms
for feature extraction. Raw audio waveforms are converted into log-Mel spectrograms using the Short-Time
Fourier Transform (STFT): the audio is windowed into overlapping segments, each transformed via FFT to obtain frequency
content, then mapped to the Mel scale (which approximates human auditory perception). This frequency-domain representation
serves as input to the neural network, enabling robust transcription across languages and noisy environments.
2. Fourier Neural Operators (FNO) for Scientific Computing:
Introduced by Li et al. (NeurIPS 2020), FNOs learn operators directly in Fourier space, achieving
1000x speedup over traditional PDE solvers. The key idea: parameterize the kernel \(\hat{k}(\xi)\)
in frequency domain, then multiply with the Fourier transform of the input.
FNO Architecture Sketch
Each FNO layer performs:
Lift input to higher-dimensional channel space
Apply spectral convolution: \(\mathcal{F}^{-1}(R \cdot \mathcal{F}(v))\) where \(R\) is a learnable weight matrix
Neural networks have a spectral bias—they tend to learn low-frequency functions more easily than
high-frequency ones. Fourier feature mappings overcome this limitation by lifting low-dimensional inputs into
higher-dimensional spaces via sinusoidal functions. For instance, the positional encoding used in
Neural Radiance Fields (NeRF) maps a scalar coordinate \(p\) to:
\[
\gamma(p) = \left[\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right]
\]
This enables the network to capture fine geometric details essential for photorealistic 3D scene reconstruction.
Similar Fourier-based encodings appear in Transformers (sinusoidal positional encoding) and modern LLMs
(Rotary Position Embedding/RoPE).
Now that we've covered the theory, let's see the Fourier Transform in action.
We will start with a simple 1D signal to understand the core principles, then move to a 2D image
to see a practical application (and a great example of the Convolution Theorem).
1D Signal & Frequency Spectrum
This demo shows how several sine waves (+ noise) are combined in the Time Domain to create a complex signal.
Simultaneously, it computes the Fourier Transform (FFT) of that combined signal to show which frequencies are dominant in the Frequency Domain (the spectrum).
Move the sliders to see how the two domains are linked in real-time. This is the fundamental concept of Fourier Analysis.
2D Image & Frequency Domain Filtering
An image is just a 2D signal. This demo computes the 2D FFT of an image to show its Frequency Spectrum.
The center of the spectrum represents low frequencies (blurry parts of the image), while the edges represent high frequencies (edges and details).
As you learned in the "Convolution Theorem" section, multiplying in the frequency domain is equivalent to convolution in the spatial domain.
Try applying a Low-pass filter (which only keeps the center) to see the image blur, or a High-pass filter (which only keeps the edges) to extract details.