Principal Component Analysis (PCA) & Autoencoders

Recap & Introduction

The methods covered so far - regression, classification, and SVMs - are all supervised: they learn from labeled examples. We now turn to unsupervised learning, where the goal is to discover structure in data without labels. The central technique is dimensionality reduction: finding low-dimensional representations that capture the essential variation in high-dimensional data. This section builds on the eigenvalue decomposition and SVD from Section I, and extends PCA to nonlinear settings using both kernels and neural networks.

Principal Component Analysis (PCA) is one of the most fundamental techniques in machine learning and statistics for dimensionality reduction. It provides a method to reduce the number of variables in high-dimensional datasets while retaining the most meaningful structure and variation present in the original data.

PCA begins by analyzing the covariance structure of the data. Given a dataset, we compute the covariance matrix to understand how different features co-vary. Since the covariance matrix is always symmetric and positive semi-definite, it can be orthogonally diagonalized. This is the finite-dimensional shadow of a more general result — the spectral theorem for compact self-adjoint operators on Hilbert spaces — which underlies the kernel PCA construction below. The resulting eigenvectors, called principal components (PCs), form an orthonormal basis that captures the directions of maximum variance. The corresponding eigenvalues indicate how much of the total variance is captured by each component.

By projecting the data onto the subspace spanned by the top \(k\) principal components (those associated with the largest eigenvalues), PCA identifies a lower-dimensional representation that retains as much variance as possible. This allows us to reduce dimensionality by discarding less informative directions (i.e., those with small variance), thereby simplifying the dataset while minimizing information loss.

For example, if a dataset in \(\mathbb{R}^{10}\) has 3 principal components capturing 95% of the total variance, we can project the original data onto a 3D subspace. This projection preserves the dominant patterns and relationships in the data and filters out noise and redundancy. The transformed vectors in the low-dimensional space are known as latent representations.

While PCA is a powerful tool, it identifies directions of maximum variance using linear combinations of the original features. As a result, it may fail to uncover complex, nonlinear structures in the data.

Kernel PCA

Given an \(N \times D\) data matrix, PCA requires the eigenvectors of the \(D \times D\) covariance matrix \(X^\top X\). If \(D \gg N\), working with the \(N \times N\) Gram matrix \(K = \Phi \Phi^\top\) is much more efficient - its entries are inner products \(\langle \phi(x_i), \phi(x_j) \rangle\). In the linear case, \(\phi\) is the identity and \(K_{ij} = x_i^\top x_j\).

Kernel PCA (KPCA) is a nonlinear generalization of classical PCA that uses the kernel trick, which allows us to replace inner products \(x_i^\top x_j\) with a kernel function \(K_{ij} = \mathcal{K}(x_i, x_j)\). The spectral structure that makes this work is formalized by Mercer's theorem (for continuous positive-definite kernels on a compact domain) and, more generally, by the reproducing kernel Hilbert space (RKHS) construction developed in our functional-analysis section.

The kPCA implicitly replaces \(x_i\) with \(\phi(x_i) = \phi_i\). Let \(\Phi\) be the corresponding design matrix. Assuming the features are centered, the covariance matrix in feature space is represented by: \[ S_{\phi} = \frac{1}{N} \sum_i \phi_i \phi_i^T. \] The normalized eigenvectors of \(S\) are given by: \[ V_{kPCA} = \Phi^\top U \Lambda^{-\frac{1}{2}} \] where \(U\) is an orthogonal matrix containing the eigenvectors of the kernel (Gram) matrix \(K = \Phi \Phi^\top\) with corresponding eigenvalues in \(\Lambda\).

Since \(\phi_i\) can be infinite dimensional, we cannot compute \(V_{kPCA}\) directly. Instead, we express the projection of a test vector \(x_*\) entirely in terms of kernel evaluations:

Kernel PCA Projection:

\[ \phi_*^\top V_{kPCA} = \phi_*^\top \Phi^\top U \Lambda^{-\frac{1}{2}} = k_*^\top U \Lambda^{-\frac{1}{2}} \] where \(k_* = \left[\mathcal{K}(x_*, x_1), \cdots, \mathcal{K}(x_*, x_N)\right]\) is the vector of kernel evaluations between the test point and all training points.

Note that using \(K = \Phi \Phi^\top\) is valid only if \(\mathbb{E}[\phi_i] = 0\). However, the feature space can be infinite dimensional, so we cannot subtract off the mean. Here, we introduce the double centering trick. Let the centered feature vector be \[ \tilde{\phi}_i = \phi(x_i) - \frac{1}{N} \sum_{j =1}^N \phi(x_j). \] Its Gram matrix is given by \[ \tilde{K}_{ij} = \tilde{\phi}_i^\top \tilde{\phi}_j. \]

Double Centering Trick:

The centered Gram matrix is computed as: \[ \begin{align*} \tilde{K} &= C_N K C_N \\\\ &= K - \frac{1}{N}JK - \frac{1}{N}KJ + \frac{1}{N^2}1^\top K 1 \end{align*} \] where \[ C_N = I_N - \frac{1}{N}1_N 1_N^\top \] is the centering matrix and \(J = 1_N 1_N^\top\) is the \(N \times N\) all-ones matrix.

Note: To verify the double centering trick, consider the scalar form: \[ \begin{align*} \tilde{K}_{ij} &= \tilde{x}_i^\top \tilde{x}_j \\\\ &= \left(x_i - \frac{1}{N} \sum_{k=1}^N x_k \right)\left(x_j - \frac{1}{N} \sum_{l=1}^N x_l \right) \\\\ &= x_i^\top x_j - \frac{1}{N} \sum_{l=1}^N x_i^\top x_l - \frac{1}{N} \sum_{k=1}^N x_k^\top x_j + \frac{1}{N^2} \sum_{k=1}^N \sum_{l=1}^N x_k^\top x_l. \end{align*} \]

Autoencoders

Data reconstruction serves as the primary quality control mechanism for dimensionality reduction. When we compress data from the original high-dimensional feature space \(\mathbb{R}^D\) to low-dimensional space \(\mathbb{R}^L\) (where \(L < D\)), we need to ensure that the essential structure and information of the original data is preserved. The reconstruction error provides a quantitative measure of information loss. If reconstruction is poor, the learned representation is inadequate for the task at hand.

Reconstruction ensures that the learned latent representation \(z = f_e(x)\) captures the most relevant and meaningful features of the data. The function \(f_e : x \to z\) is called the encoder. If the decoder \(f_d : z \to x\) can successfully reconstruct the original input \(x\) from \(z\), it demonstrates that \(z\) contains sufficient information about the underlying data structure.

We can consider PCA as the process of learning linear mappings such as \(f_e\) and \(f_d\). Then the reconstruction function can be represented as \(r(x) = f_d \left(f_e(x) \right)\), which is trained to minimize \(\mathcal{L}(\theta) = -\log p(x | r(x))\).

We can implement \(f_e\) and \(f_d\) by neural networks. This is called an autoencoder. In particular, a linear autoencoder is equivalent to PCA:

Linear Autoencoder:

Input: \(x \in \mathbb{R}^D\)
Hidden units: \(z = W_1 x, \quad W_1 \in \mathbb{R}^{L \times D}, \quad L < D\)
Output: \(\hat{x} = W_2 z = W_2 W_1 x = Wx, \quad W_2 \in \mathbb{R}^{D \times L}\)

Trained by minimizing the squared reconstruction error: \[ \mathcal{L}(W) = \sum_{n=1}^N \|x_n - Wx_n\|_2^2 \] The optimal \(\hat{W}\) acts as an orthogonal projection onto the subspace spanned by the first \(L\) eigenvectors of the empirical covariance matrix of the data.

In general, it is known that if we introduce nonlinearities into the autoencoder, the model becomes strictly more powerful than PCA. Moreover, autoencoders designed with deep learning architectures handle large datasets and complex structures much better than Kernel PCA, so autoencoders have generally become more popular than Kernel PCA for nonlinear mapping in practical applications.

Demo: Kernel PCA and Autoencoders

This interactive demo allows you to explore and compare different dimensionality reduction techniques:

PCA Comparison Tab

Compare how standard PCA and Kernel PCA handle non-linear datasets. Standard PCA finds linear projections, while Kernel PCA can "unfold" complex structures like circles and spirals using the kernel trick. Try different kernels (RBF, polynomial) and adjust parameters to see their effects.

Autoencoder Tab

Explore how a neural network autoencoder with a 1D bottleneck learns to compress and reconstruct 2D data. Unlike PCA which finds linear subspaces, the autoencoder learns non-linear mappings through its hidden layers. The 1D latent space visualization shows how the network arranges data points along a single dimension, effectively finding the principal curve through the data.

💡Tip:
Start with the "circles" dataset to see how each method handles non-linear structure. PCA will struggle, Kernel PCA (with RBF kernel) will separate the circles, and the autoencoder will "unfold" them into a line.

Denoising Autoencoders

Denoising autoencoders (DAE) are a more regularized variant of standard autoencoders that add noise to the input during training, then learn to reconstruct the original, uncorrupted data. The theoretical consequence — established below — is that the optimal denoiser implicitly learns the score function of the data distribution.

The training process involves corrupting the input \(x\) to produce \(\tilde{x}\), typically using Gaussian noise: \[ p_c (\tilde{x} | x) = \mathcal{N}(\tilde{x} | x, \sigma^2 I) \] The model then minimizes the reconstruction error between its output \(r(\tilde{x})\) and the clean input \(x\): \[ \ell (x, r(\tilde{x})) = \| e \|_2^2 \] where \(e(x) = r(\tilde{x}) - x \) is the residual error for a sample \(x\).

Denoising and Score Estimation:

As \(\sigma \to 0\), the optimal denoising function satisfies: \[ e(x) = r(\tilde{x}) - x \approx \sigma^2 \nabla_x \log p(x) + O(\sigma^4). \] The denoising autoencoder implicitly learns the score function \(\nabla_x \log p(x)\) (gradient of log-density). This identity is due to Vincent (2011) and Alain & Bengio (2014); the statement is taken on faith here as the underlying score-matching theory sits beyond the scope of this page.

This score function forms a vector field over the entire feature space. At each point, this vector field indicates the direction and magnitude to move toward regions of higher data density. The reconstruction process follows these vectors, effectively "flowing" corrupted points back to the data manifold along the steepest ascent of the probability landscape.

Data Manifolds in Our Demo:

The autoencoder tab in our demo visualizes data manifolds. When the 2D data is compressed through a 1D bottleneck, the network must learn the underlying 1-D manifold (curve) that best represents the data. The "Reconstructed" visualization shows points projected onto this learned manifold This is the autoencoder discovering and representing the intrinsic lower-dimensional structure of the data.

Note: Lipschitz continuity is a common regularity condition: a function \(f: (X, d_X) \to (Y, d_Y)\) is Lipschitz with constant \(L \geq 0\) if \(d_Y(f(x_1), f(x_2)) \leq L \, d_X(x_1, x_2)\) for all \(x_1, x_2 \in X\). The full development — including the hierarchy Lipschitz \(\Rightarrow\) uniformly continuous \(\Rightarrow\) continuous — appears in our continuity page in the analysis section.

This means the function's output changes at most linearly with respect to the input. In autoencoders, Lipschitz continuity in the reconstruction map \(r(x)\) ensures stability — small changes in input lead to small changes in reconstruction.

In the context of denoising autoencoders, enforcing or assuming Lipschitz continuity makes the learned vector field well-behaved. It guarantees smooth flows along the data manifold and avoids sharp or unstable reconstructions, which is essential when approximating the gradient \(\nabla_x \log p(x)\).

Manifolds

Intuitively, a manifold is a topological space that locally resembles Euclidean space near each point. Imagine a curved surface like a sphere—zooming in on any small patch makes it look flat, like \(\mathbb{R}^2\). In machine learning, we often assume data lies on a low-dimensional manifold embedded in high-dimensional space.

Definition: Manifold (Intuitive Form)

An \(n\)-dimensional manifold \(\mathcal{M}\) is a topological space where every point \(p \in \mathcal{M}\) has an open neighborhood \(U\) that is homeomorphic to \(\mathbb{R}^n\). That is, there exists a continuous bijection \(\phi : U \to \mathbb{R}^n\) with continuous inverse.

This is the intuitive form sufficient for the manifold-hypothesis discussion that follows. The full differential structure (charts, atlases, smoothness) is developed rigorously in our smooth-manifold series.

The Manifold Hypothesis: Real-world high-dimensional data (e.g., images, speech) tends to lie on or near a low-dimensional manifold embedded in the ambient space. For example, each \(64 \times 64\) grayscale face image can be represented as a point in \(\mathbb{R}^{4096}\), but the set of "realistic" face images occupies only a small, structured region of this space - likely a nonlinear manifold of much lower dimension, governed by factors such as pose, lighting, expression, and identity.

Dimensionality reduction reveals low-dimensional structure in data, but does not assign data points to groups. The clustering page that follows addresses this complementary unsupervised task: partitioning data into meaningful groups without labels. We will see how K-means minimizes distortion in the original feature space, while spectral clustering leverages the graph Laplacian to discover clusters that respect the geometry of the data manifold.

Looking further ahead, the manifold hypothesis introduced here is the entry point to a much broader program. The rigorous theory of smooth manifolds — tangent spaces, charts, Riemannian metrics — is developed in our manifold series, generalizing the intuitive picture above. On the discrete side, sampled data on graph-structured domains motivate graph neural networks, which use the spectral structure of the graph Laplacian as a learnable signal-processing primitive. Both threads converge in Geometric Deep Learning (GDL), the unifying viewpoint we have referenced from the introduction onward, where neural-network architectures are designed to the symmetries and geometric structure of the data they operate on.