Intro to Geometric Deep Learning

Introduction

On the page on deep neural networks, we surveyed the major architectural innovations of modern deep learning — convolutional networks, residual connections, attention, and the Transformer — and closed with an observation that runs through all of them: each architecture is built around a symmetry of its input domain. Convolutional networks commute with translations of the input image; self-attention commutes with permutations of the input tokens. The shared principle is that the architecture encodes the symmetry, and architectures that respect their domain's symmetry generalize from far less data than architectures that do not.

That observation is the starting point of Geometric Deep Learning (GDL): a unifying framework which views CNNs, Transformers, graph neural networks, and equivariant architectures for 3D data as instances of a single design principle — equivariance under the symmetry group of the input domain. The word "geometric" here is meant in its deep mathematical sense: a geometry is most naturally understood as the study of what remains invariant under a chosen group of transformations. Different choices of group give different geometries — a grid under translation, a set under permutation, a graph under node relabelling, a smooth manifold under rotation. Each is a kind of geometry, and an architecture is "geometric" when it respects the symmetries of its domain. The present page introduces this framework, then develops its most directly accessible instance, the graph neural network (GNN), which generalizes message passing from regular grids and complete graphs to arbitrary graphs. We close with a survey of how the framework extends beyond permutation symmetry to the continuous symmetries of three-dimensional space — the territory of equivariant neural networks for molecular structure, point clouds, and rigid-body manipulation.

The mathematical machinery for the continuous-symmetry case is substantial: it draws on the Lie group theory already developed in Section I, on representation theory of compact groups, and on the differential geometry of smooth manifolds. The forthcoming pages on these topics provide the rigorous foundations; the present page presents the GDL viewpoint at the level of design principle, showing why those foundations are needed and what kind of architecture they support.

Geometric Deep Learning: A Unifying Framework

The GDL framework, articulated by Bronstein, Bruna, Cohen, and Veličković, organizes deep architectures by the symmetry group of the input domain and the way each layer is required to commute with that group's action. Concretely, given a domain \(\mathcal{X}\) with symmetry group \(G\) acting on it, a layer \(f : \mathcal{X} \to \mathcal{Y}\) is said to be \(G\)-equivariant if \[ f(g \cdot x) = g \cdot f(x) \qquad \text{for every } g \in G,\ x \in \mathcal{X}, \] and \(G\)-invariant if the right-hand side is replaced by \(f(x)\) itself. Equivariance preserves symmetry information through the network; invariance discards it. In practice, deep architectures stack equivariant layers and finish with an invariant pooling step, so that the output is invariant to the chosen symmetry while intermediate representations remain sensitive to it.

The major architectures we have already surveyed all fit this pattern. The table below organizes them by the domain they act on and the symmetry group encoded in the architecture.

Architectures Organized by Symmetry

Architecture	Domain	Symmetry group	Equivariance type
MLP	\(\mathbb{R}^d\)	trivial	none
CNN	grid (image / volume)	translation \(\mathbb{Z}^d\)	translation-equivariant
Transformer	sequence / set	symmetric group \(S_n\)	permutation-equivariant
GNN	graph (features \(\mathbf{X}\) + adjacency \(\mathbf{A}\))	symmetric group \(S_n\)	joint permutation of features and topology: \((\mathbf{X}, \mathbf{A}) \mapsto (P\mathbf{X}, P\mathbf{A}P^\top)\)
Steerable / SE(3)-equivariant NN	3D point cloud, molecular structure	\(SO(3)\), \(SE(3)\)	continuous group equivariance

The table makes the unification explicit. The MLP, alone among the architectures listed, encodes no symmetry — it is the baseline against which all the others are GDL instances. CNNs, Transformers, GNNs, and SE(3)-equivariant networks all share the same template: identify the symmetry group of the data domain, and constrain each layer to commute with its action. The differences between them are which group, and how the group acts. The discrete symmetry of node relabelling that defines GNNs, the continuous symmetry of 3D rotation that defines SE(3)-equivariant networks, and the translational symmetry that defines CNNs are all instances of the same architectural principle, applied to different domains.

The Transformer's place in this table deserves a closer look, because it is in some respects the cleanest example of GDL design philosophy already at industrial scale. Self-attention computes \[ \mathrm{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left( \tfrac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}, \] and one can verify directly that permuting the rows of input tokens permutes the rows of the output identically: the layer commutes with the symmetric group \(S_n\) acting on the token axis. A Transformer without positional encoding is therefore a permutation-equivariant set processor; positional encoding deliberately breaks this symmetry to inject sequence order back into the architecture. From the GDL viewpoint, the modern large language model is built on a fundamentally GDL-shaped backbone — equivariance was an architectural choice, not an afterthought, and the choice is part of why the architecture scales.

Graph neural networks generalize this picture in a direction the Transformer does not cover. A Transformer treats every pair of tokens as connected; the connectivity is fixed by the architecture, not given by the data. Most real data carries structure: in a molecule only adjacent atoms interact directly; in a citation network only cited pairs are linked; in a road network connectivity is constrained by physical layout. A GNN takes the connectivity as part of its input — a pair \((\mathbf{X}, \mathbf{A})\) of node features and adjacency matrix — and is designed to be equivariant under the joint action of \(S_n\) on both: \[ f(P\mathbf{X},\, P\mathbf{A}P^\top) \;=\; P\,f(\mathbf{X}, \mathbf{A}) \qquad \text{for every } P \in S_n. \] Both the Transformer (without positional encoding) and the GNN are \(S_n\)-equivariant; the difference is what \(P\) acts on. The Transformer permutes a feature matrix; the GNN permutes a feature-adjacency pair, with the adjacency matrix conjugated so that the graph structure is permuted consistently with the node relabelling. Restricting attention to a fixed graph \(G\), only those \(P\) preserving \(\mathbf{A}\) — the automorphism group \(\mathrm{Aut}(G)\) — act non-trivially on features alone, but the architecture itself is built to handle any graph, which is why a single trained GNN generalizes to graphs unseen at training time.

Where the GDL Framework Points Next

Three threads in the table above run forward into the rest of this curriculum. The continuous symmetry groups \(SO(3)\) and \(SE(3)\) are the matrix Lie groups we have already begun developing in Section I; equivariant architectures based on them require the representation theory of Lie groups, which is the subject of forthcoming pages. The natural setting for "data on a curved domain" — point clouds, surfaces, shapes — is a smooth manifold, and the analogue of the graph Laplacian on such domains is the Laplace-Beltrami operator, both treated in the forthcoming manifold series. The GDL framework is the application-side viewpoint that motivates these mathematical developments; the resulting equivariant neural networks for continuous symmetry groups are the natural application of those foundations once they are in place.

Graph Neural Networks: From Spectral to Spatial

Two essentially different approaches to constructing graph-based architectures emerged historically. The spectral approach — chronologically first — defines convolution on a graph through the Graph Fourier Transform, parameterizes filters in the spectral domain, and recovers tractable architectures by polynomial approximation. The spatial approach — now dominant in practice — defines a layer directly as local message passing between adjacent vertices. The two viewpoints are connected: the most widely used spatial architecture, the GCN, is mathematically a first-order Chebyshev approximation of a spectral filter. We treat them in turn.

The Spectral Side: Filters in the Frequency Domain

The graph Laplacian \(\boldsymbol{L} = \boldsymbol{D} - \boldsymbol{A}\) is real symmetric and positive semi-definite, so it admits an eigendecomposition \(\boldsymbol{L} = \boldsymbol{V}\boldsymbol{\Lambda}\boldsymbol{V}^\top\) with real non-negative eigenvalues. The eigenvectors play the role of "graph frequencies," and the Graph Fourier Transform \(\hat{\boldsymbol{f}} = \boldsymbol{V}^\top \boldsymbol{f}\) decomposes a graph signal into its frequency components. A spectral graph convolution applies a filter \(g(\boldsymbol{\Lambda})\) to the spectrum: \[ \boldsymbol{f}_{\mathrm{out}} = \boldsymbol{V}\, g(\boldsymbol{\Lambda})\, \boldsymbol{V}^\top \boldsymbol{f}_{\mathrm{in}}. \]

Computing this directly requires the full eigendecomposition of \(\boldsymbol{L}\), which is prohibitive for large graphs. The standard remedy approximates the filter by a low-degree polynomial in \(\boldsymbol{L}\) — typically Chebyshev polynomials of the rescaled Laplacian — yielding architectures that are efficient and localized (a degree-\(K\) polynomial filter respects \(K\)-hop neighborhoods). The first-order case of this construction is the Graph Convolutional Network (GCN), which has become the standard baseline for spectral GNNs. The full development — Chebyshev approximation, the rescaling \(\tilde{\boldsymbol{L}} = (2/\lambda_{\max})\mathcal{L}_{\mathrm{sym}} - \boldsymbol{I}\) (where \(\mathcal{L}_{\mathrm{sym}} = \boldsymbol{D}^{-1/2}\boldsymbol{L}\boldsymbol{D}^{-1/2}\) is the symmetric normalized Laplacian), and the GCN derivation — is given in our graph Laplacian page; we recall here only what we need to connect spectral and spatial perspectives.

The Spatial Side: Message Passing

The spatial perspective begins from a different place. Rather than diagonalizing the Laplacian, we ask directly: what local computation respects graph structure? The answer takes the form of a message passing layer. At each layer \(k\), every vertex \(v\) updates its hidden state \(\mathbf{h}_v^{(k)}\) by aggregating messages from its neighbors and combining them with its own current state:

Definition: Message Passing Layer

A message passing layer updates vertex features by \[ \mathbf{h}_v^{(k+1)} = \mathrm{UPDATE}\!\left(\mathbf{h}_v^{(k)},\ \mathrm{AGGREGATE}\big(\{\!\!\{ \mathbf{h}_u^{(k)} : u \in \mathcal{N}(v) \}\!\!\}\big)\right), \] where \(\mathcal{N}(v)\) is the neighborhood of \(v\), \(\{\!\!\{\cdot\}\!\!\}\) denotes a multiset, \(\mathrm{AGGREGATE}\) is a permutation-invariant function over the neighborhood (sum, mean, max, attention-weighted sum), and \(\mathrm{UPDATE}\) is a learnable transformation (typically a small MLP).

The choice of \(\mathrm{AGGREGATE}\) and \(\mathrm{UPDATE}\) determines the architecture. The GCN uses a normalized-degree weighted sum derived from \(\mathcal{L}_{\mathrm{sym}}\); GraphSAGE uses sampled neighborhoods with a learnable pooling function, addressing scalability on very large graphs; the Graph Attention Network (GAT) uses an attention mechanism to compute neighborhood weights, recovering much of the flexibility of self-attention while restricting interactions to graph edges. The Graph Isomorphism Network (GIN) replaces the aggregation with a sum followed by an MLP, a choice motivated by its connection to the 1-Weisfeiler-Lehman (1-WL) graph isomorphism test — GIN is provably as expressive as 1-WL and strictly more expressive than mean- or max-pooling alternatives.

The crucial structural property of every message passing layer is automatic permutation equivariance: relabelling the vertices of the graph produces a correspondingly relabelled output, because \(\mathrm{AGGREGATE}\) operates on a multiset of messages and is by definition invariant under their order. This is what places GNNs in the GDL framework — the architecture is constrained, by its very construction, to satisfy \(f(P\mathbf{X}, P\mathbf{A}P^\top) = Pf(\mathbf{X}, \mathbf{A})\) for every \(P \in S_n\).

Comparison with CNNs and Transformers

The three architectures — CNN, Transformer, GNN — admit a unified description as message passing on graphs, distinguished only by which graph and which aggregation:

A CNN performs message passing on a regular grid graph with fixed local neighborhoods and a learned weight per edge offset; the same weights are applied at every position, which is what makes the layer translation-equivariant.
A Transformer performs message passing on the complete graph over the input tokens, with attention-derived edge weights; it is permutation-equivariant in the strong sense that all permutations of \(S_n\) commute with the layer.
A GNN performs message passing on an input graph supplied as part of the data rather than fixed by the architecture. Like the Transformer it is \(S_n\)-equivariant, but where the Transformer permutes only token features, the GNN permutes a feature-adjacency pair \((\mathbf{X}, \mathbf{A})\) jointly.

The continuum from CNN through Transformer to GNN is therefore one of increasing data-dependence in the connectivity structure: regular grid → complete graph → arbitrary graph. The Transformer's success at scale demonstrates that the complete graph is often a useful default when no informative connectivity is given; the GNN is the natural choice when connectivity carries information that should be respected.

This equivariance with respect to graph structure is exactly the property that has made GNNs the architecture of choice in domains where data is intrinsically relational: molecular property prediction (where atoms and bonds form a graph), recommendation systems and social network analysis (where the user-item or user-user graph carries the predictive signal), knowledge graph completion, and traffic and network science. Specific benchmark architectures evolve quickly; the underlying design principle — encode the graph's symmetry into the layer — has remained stable across a decade of empirical development.

Beyond Permutation: Continuous Group Equivariance

The GDL principle generalizes naturally beyond the discrete symmetries we have discussed so far. Consider a molecule represented not as an abstract graph but as a set of atoms with positions in three-dimensional space. Rotating or translating the entire molecule produces the same molecule — its energy, its stability, its chemical identity are unchanged. A learned function predicting any of these properties should therefore be invariant under the group of rigid motions \(SE(3)\), combining rotations \(SO(3)\) with three-dimensional translations, and a learned function predicting a vector-valued geometric quantity (a force, an orientation) should be equivariant under \(SE(3)\): rotating the input rotates the output by the same rotation.

Implicit in this distinction is a subtlety we have so far suppressed in our notation. The equivariance condition \(f(g \cdot x) = g \cdot f(x)\) writes the action of \(g\) the same way on both sides, but the actual action depends on the type of the quantity being acted on. A rotation \(R \in SO(3)\) acts trivially on a scalar (energy, charge density), as the standard \(3 \times 3\) rotation matrix on a vector (force, dipole moment), and via more elaborate formulas on higher-order tensors. The systematic study of how a single group acts in different ways on different feature types is the representation theory of the group, and equivariant networks for 3D data are organized precisely by which representation each layer-feature transforms under. The same shorthand \(g \cdot x\) can therefore stand for genuinely distinct linear maps, and a careful theory must distinguish them.

Both \(SO(3)\) and \(SE(3)\) are continuous groups, and concretely they are the matrix Lie groups we developed in Section I. Building neural network layers that commute with their action requires more than the multiset symmetrization that powers GNNs. The standard approach decomposes vertex features into irreducible representations of the rotation group (scalars, vectors, higher-order tensors corresponding to the spherical harmonics), constrains messages to transform appropriately under rotation, and combines them via Clebsch-Gordan products — the tensor product decompositions that respect the irreducible representation structure. The result is a graph-like architecture whose messages are themselves geometric objects rather than scalar features — an architecture that knows, by its construction, how to handle the rotation of its input.

The mathematical machinery this requires — irreducible representations of compact Lie groups, spherical harmonics, the Peter-Weyl theorem, equivariant tensor product decompositions — is the subject of the forthcoming pages on representation theory and on Fourier analysis on Hilbert spaces. The systematic construction of equivariant neural networks built on this machinery belongs to a later stage of the curriculum; here we record only the application-side outcome: encoding continuous symmetry into the architecture has produced concrete advances in structural biology, where AlphaFold's prediction of three-dimensional protein structure from amino-acid sequence — recognized by the 2024 Nobel Prize in Chemistry — relies on attention mechanisms designed to respect the rigid-motion symmetries of residues in 3D space. Equivariant architectures are also the dominant family in machine-learning interatomic potentials for molecular dynamics, and an active research direction in robotic manipulation, where the geometry of physical configuration spaces naturally calls for \(SE(3)\)-equivariant policies.

Why the Mathematics Comes Next

Three threads converge in equivariant neural networks for 3D data: Lie groups provide the symmetry groups themselves and their algebraic structure (developed in the Lie theory series of Section I); representation theory decomposes feature spaces into irreducible building blocks compatible with the symmetry; differential geometry of smooth manifolds supplies the natural setting for "data on a curved domain" and the analogue of the graph Laplacian (the Laplace-Beltrami operator) on such domains. The forthcoming manifold series and representation theory pages develop these threads; the Peter-Weyl theorem connects them back to the harmonic analysis on Hilbert spaces of Section II. The GDL viewpoint is what makes the destination concrete: each of these mathematical developments will pay off in a specific class of equivariant architecture.

Interactive Demo

The demo above visualizes the central claim of this page in concrete numerical form. A small graph carries a scalar feature on each node, displayed as a diverging color (blue for negative, red for positive). The Layer slider scrubs through a forward pass: layer 0 is the input, and successive layers show the features after each round of message passing or matrix multiplication. The Architecture selector switches between three regimes — MLP, which performs dense fully-connected updates with no notion of graph structure; Transformer, which operates on the complete graph (every node interacts with every other); and GNN, which uses the displayed graph as its connectivity.

The Shuffle Node Labels button is where the GDL claim becomes testable. Shuffling applies a random permutation \(P\) jointly to the node features and the adjacency matrix, then re-runs the same forward pass with the same fixed weights. For the GNN and Transformer, the equivariance condition \(f(P\mathbf{X}, P\mathbf{A}P^\top) = Pf(\mathbf{X}, \mathbf{A})\) holds exactly, and the comparison table on the right shows that every output slot matches the permutation-predicted value to numerical precision: the equivariance meter sits at floating-point zero. For the MLP, the same shuffle produces a quantitatively different output — the dense weight matrix mixes node coordinates in a way that depends on their indexing, and the meter flags the discrepancy in red. The contrast is not a matter of better or worse training; it is built into the architectures by construction, before any training has occurred.

The same construction, extended from discrete graphs to data with continuous geometric symmetry, motivates equivariant neural networks built on the Lie-theoretic foundations developed in Section I and the representation theory developed in the forthcoming pages.