Clustering

Introduction

Clustering is one of the core tasks in unsupervised learning. Unlike supervised learning, where models learn from labeled data, unsupervised learning aims to uncover hidden patterns or structures in data without any predefined labels. Among these tasks, clustering is particularly important - it groups data points into clusters based on their similarity, revealing the underlying structure of the data.

In previous sections, we explored techniques like Principal Component Analysis (PCA) and autoencoders, which reduce the dimensionality of high-dimensional data while preserving meaningful information. These dimensionality reduction methods helped us uncover compact representations of data, often referred to as latent features.

Now that we have the tools to represent complex data in a lower-dimensional space, we turn to the next natural question: Can we identify meaningful groupings within this reduced space? This is the goal of clustering, which allows us to segment data into coherent groups - for example, grouping similar images, organizing unlabeled documents, or detecting communities in social networks.

K-means Clustering

We begin with K-means clustering, one of the most widely used and intuitive algorithms.

We assume that there are \(K\) cluster centers \(\mu_k \in \mathbb{R}^D\), for \(k = 1, \cdots, K\). Each data point \(x_n \in \mathbb{R}^D\) is assigned to its nearest center: \[ z_n^* = \arg \min_k \| x_n - \mu_k \|_2^2. \]

Given the assignments, the cluster centers are updated as the mean of points in each cluster: \[ \mu_k = \frac{1}{N_k} \sum_{n : z_n =k} x_n, \,\text{where } N_k = \sum_{n=1}^N \mathbb{I}[z_n = k]. \]

These two steps - assignment and update - are repeated until convergence. This process can be viewed as minimizing a cost function known as the distortion, which is equivalent to the squared Frobenius norm of the reconstruction error: \[ \begin{align*} J(M, Z) &= \sum_{n = 1}^N \| x_n - \mu_{z_n} \|_2^2 \\\\ &= \| X - ZM^\top \|_F^2 \end{align*} \] where \(X \in \mathbb{R}^{N \times D}\), \(Z \in \{0, 1\} ^{N \times K}\), and \(M \in \mathbb{R}^{D \times K}\) contains the cluster centers \(\mu_k\) in its columns.
Note: \(Z\) is the one-hot encoding(or dummy encoding) matrix whose entries are \[ Z_{nk} = \begin{cases} 1, &\text{ if \(x_n\) is assigned to cluster \(k\) } \\ 0, &\text{ otherwise}. \end{cases} \]

Vector Quantization (VQ)

A key idea of vector quantization is to compress data by replacing each high-dimensional vector \(x_n \in \mathbb{R}^D\) with a discrete symbol \(z_n \in \{1, \cdots, K\}\), which is an index into a codebook of \(K\) prototype vectors, \(\mu_k \in \mathbb{R}^D\).

Each data point is encoded by finding the index of the closest prototype using Euclidean distance: \[ z_n := f_e (x_n) = \arg \min_k \| x_n - \mu_k \|^2. \]

The corresponding decoder simply maps the index back to the prototype: \[ f_d(k) = \mu_k. \] The reconstruction of each data point is therefore \(\hat{x}_n = f_d(z_n) = \mu_{z_n}\).

The quality of a codebook is measured using the reconstruction error (also called distortion): \[ \begin{align*} J &= \frac{1}{N} \sum_{n =1}^N \| x_n - f_d (f_e (x_n))\|^2 \\\\ &= \frac{1}{N} \sum_{n=1}^N \|x_n - \mu_{z_n} \|^2. \end{align*} \] Indeed, this is equivalent to the cost function minimized by the K-means algorithm, which can be interpreted as a special case of vector quantization where the codebook is learned by minimizing distortion via iterative updates. VQ is more focused on signal compression and encoding, whereas K-means is usually employed for data analysis and clustering tasks.

Spectral Clustering

Traditional clustering methods like K-means assume clusters are linearly separable and spherical in shape. However, many real-world datasets exhibit complex, non-convex structures that are not well-captured by distance alone. Spectral clustering addresses this limitation by transforming the data into a new space that reflects the connectivity structure of the data, often represented as a graph.

The idea is to build a similarity graph where each data point is a node, and edges encode pairwise similarities. Let \(\mathbf{W} \in \mathbb{R}^{N \times N}\) be a symmetric weight matrix for a graph such that \(w_{ij} = w_{ji} \geq 0\) measures the similarity between points \(i\) and \(j\). The degree of node \(i\) is defined as: \[ d_i = \sum_{j=1}^N w_{ij}, \] and we define the degree matrix \(\mathbf{D} \in \mathbb{R}^{N \times N}\) as a diagonal matrix with entries \(D_{ii} = d_i\).

The fundamental object in spectral clustering is the graph Laplacian, \(\mathbf{L} = \mathbf{D} - \mathbf{W}\), where \(\mathbf{D}\) is the diagonal degree matrix and \(\mathbf{W}\) is the weight (similarity) matrix. The entries of \(\mathbf{L}\) are \[ L_{ij} = \begin{cases} d_i & \text{if } i = j \\ -w_{ij} & \text{if } i \neq j \end{cases} \] The Laplacian is real symmetric and positive semi-definite, with associated quadratic form \[ \boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} = \frac{1}{2} \sum_{i,j} w_{ij}(f_i - f_j)^2, \] known as the Dirichlet energy. It measures the smoothness of \(\boldsymbol{f}\) on the graph: small when adjacent nodes (those with \(w_{ij} > 0\)) have similar values.

The single result that powers spectral clustering is the connection between the Laplacian's nullspace and the graph's component structure:

Spectral Connectivity (Result Used Below)

For a graph \(G\) with \(K\) connected components \(S_1, \ldots, S_K\), \[ \ker \mathbf{L} = \operatorname{span}\{\mathbf{1}_{S_1}, \ldots, \mathbf{1}_{S_K}\}, \] so the multiplicity of the eigenvalue \(0\) equals the number of connected components. The full statement and proof — following the standard Dirichlet-energy argument — are developed in our graph Laplacian page in the linear algebra section.

The intuition for spectral clustering is now immediate: if a graph has \(K\) well-separated clusters joined by weak inter-cluster edges, it is approximately a union of \(K\) connected components, and the bottom \(K\) eigenvectors of \(\mathbf{L}\) approximate the indicator vectors of those clusters. Spectral clustering exploits this by embedding nodes via these eigenvectors and applying K-means in the embedded space.

Since \(\mathbf{L}\) is symmetric and positive semi-definite, it admits a spectral decomposition: \[ \mathbf{L} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top, \quad 0 = \lambda_1 \le \lambda_2 \le \cdots \le \lambda_N, \] where \(\boldsymbol{\Lambda}\) contains the real, non-negative eigenvalues, and the columns of \(\mathbf{V}\) form an orthonormal eigenbasis. The smallest eigenvalues correspond to the smoothest variations of a function on the graph — and, in the cluster setting, to the indicator structure described above.

In practice, graphs often exhibit irregular structure: nodes may have highly varying degrees, and clusters are not perfectly block-separated. The raw Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\) can then be dominated by high-degree nodes, leading to unbalanced or misleading spectral embeddings. To address this, we use the symmetric normalized Laplacian: \[ \begin{align*} \mathbf{L}_{\text{sym}} &= \mathbf{D}^{-\frac{1}{2}} \mathbf{L} \mathbf{D}^{-\frac{1}{2}} \\\\ &= \mathbf{I} - \mathbf{D}^{-\frac{1}{2}} \mathbf{W} \mathbf{D}^{-\frac{1}{2}} \end{align*} \] For \(\mathbf{L}_{\text{sym}}\) the analogous nullspace characterization is that the eigenspace of zero eigenvalue is spanned by \(\mathbf{D}^{1/2} \mathbf{1}_{S_k}\) for \(k = 1, \ldots, K\), where \(\mathbf{1}_{S_k}\) are again the indicator vectors of the connected components.

The spectral clustering algorithm proceeds as follows:

Algorithm: Spectral Clustering

Compute the symmetric normalized Laplacian \(\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-\frac{1}{2}} \mathbf{W} \mathbf{D}^{-\frac{1}{2}}\).
Find the smallest \(K\) eigenvectors of \(\mathbf{L}_{\text{sym}}\) and stack them column-wise into a matrix \(\mathbf{U} \in \mathbb{R}^{N \times K}\).
Normalize each row of \(\mathbf{U}\) to have unit norm, forming a matrix \(\mathbf{T} \in \mathbb{R}^{N \times K}\): \[ t_{ij} = \frac{u_{ij}}{\sqrt{\sum_{l=1}^K u_{il}^2}}. \]
Apply K-means clustering to the rows of \(\mathbf{T}\).
Assign the original data point \(x_i\) to cluster \(k\) if row \(i\) of \(\mathbf{T}\) is assigned to cluster \(k\).

Looking Ahead — From Spectral Clustering to Geometric Deep Learning

Spectral clustering uses a small number of low-frequency Laplacian eigenvectors to partition a graph. Two natural generalizations of this idea anchor much of modern graph-based learning:

The quality of a spectral cut is quantified by the Cheeger inequality, which links the second-smallest Laplacian eigenvalue (\(\lambda_2\) in our indexing here, often denoted \(\lambda_1\) when eigenvalues are 0-indexed) — the spectral gap — to the graph's edge expansion. This is the quantitative bridge between "small spectral gap" and "well-defined cluster structure."
The Laplacian eigenbasis defines a graph Fourier transform, extending classical Fourier analysis to functions on graphs. This is the foundational primitive for graph neural networks (GNNs), where Laplacian-based filters become learnable signal-processing operators rather than fixed spectral cuts.

Both threads converge in Geometric Deep Learning, the unifying framework — referenced from our introduction onward — in which neural-network architectures are designed to respect the symmetries and geometric structure of the data they operate on. Spectral clustering is the discrete, classical predecessor; GNNs and equivariant architectures are its modern, learnable descendants.

Demo

Note: In random initialization, we simply pick \(K\) data points uniformly at random to serve as the initial cluster centers. While this method is fast, it can be highly sensitive to outliers or imbalanced data distributions, often leading to poor cluster quality or slow convergence. To address this, the K-means++ initialization strategy is widely used in practice. It selects initial centers more carefully by favoring points that are far apart from already chosen centers, significantly improving the stability and performance of K-means clustering and often leading to better local optima.

Algorithm: K-means++ Initialization

Choose the first center \(\mu_1\) uniformly at random from the dataset \(\{x_1, \ldots, x_N\}\).
For each remaining point \(x_i\), compute its squared distance to the nearest selected center: \[ D(x_i) = \min_{1 \leq j \leq m} \|x_i - \mu_j\|_2^2 \] where \(\mu_j\) is one of the centers already chosen.
Choose the next center \(\mu_{m+1}\) from the data points with probability proportional to \(D(x_i)\): \[ \Pr(x_i \text{ is chosen}) = \frac{D(x_i)}{\sum_j D(x_j)} \]
Repeat steps 2-3 until \(K\) centers have been selected.
Proceed with the standard K-means algorithm using these \(K\) initial centers.

Clustering completes our tour of the core ML toolkit: from supervised methods (regression, classification, SVMs) through unsupervised techniques (PCA, autoencoders, and clustering). The deep neural networks page that follows scales up to modern architectures — convolutional networks, residual connections, attention mechanisms, and the Transformer — that power today's state-of-the-art systems.

Loading...