Introduction
On the page on deep neural networks, we surveyed
the major architectural innovations of modern deep learning — convolutional networks,
residual connections, attention, and the Transformer — and closed with an observation
that runs through all of them: each architecture is built around a
symmetry of its input domain. Convolutional networks commute with translations of
the input image; self-attention commutes with permutations of the input tokens. The shared
principle is that the architecture encodes the symmetry, and architectures
that respect their domain's symmetry generalize from far less data than architectures that do not.
That observation is the starting point of Geometric Deep Learning (GDL):
a unifying framework which views CNNs, Transformers, graph neural networks, and equivariant
architectures for 3D data as instances of a single design principle — equivariance under
the symmetry group of the input domain. The word "geometric" here is meant in its
deep mathematical sense: a geometry is most naturally understood as the study of what
remains invariant under a chosen group of transformations. Different choices of
group give different geometries — a grid under translation, a set under permutation, a
graph under node relabelling, a smooth manifold under rotation. Each is a kind of geometry,
and an architecture is "geometric" when it respects the symmetries of its domain.
The present page introduces this framework, then
develops its most directly accessible instance, the graph neural network (GNN),
which generalizes message passing from regular grids and complete graphs to arbitrary
graphs. We close with a survey of how the framework extends beyond permutation symmetry
to the continuous symmetries of three-dimensional space — the territory of equivariant
neural networks for molecular structure, point clouds, and rigid-body manipulation.
The mathematical machinery for the continuous-symmetry case is substantial: it draws on the
Lie group theory already
developed in Section I, on representation theory of compact groups, and on the differential
geometry of smooth manifolds. The forthcoming pages on these topics provide the rigorous
foundations; the present page presents the GDL viewpoint at the level of design principle,
showing why those foundations are needed and what kind of architecture they support.
Geometric Deep Learning: A Unifying Framework
The GDL framework, articulated by Bronstein, Bruna, Cohen, and Veličković, organizes
deep architectures by the symmetry group of the input domain and the way each layer is
required to commute with that group's action. Concretely, given a domain \(\mathcal{X}\)
with symmetry group \(G\) acting on it, a layer
\(f : \mathcal{X} \to \mathcal{Y}\) is said to be \(G\)-equivariant if
\[
f(g \cdot x) = g \cdot f(x) \qquad \text{for every } g \in G,\ x \in \mathcal{X},
\]
and \(G\)-invariant if the right-hand side is replaced by \(f(x)\) itself.
Equivariance preserves symmetry information through the network; invariance discards it.
In practice, deep architectures stack equivariant layers and finish with an invariant
pooling step, so that the output is invariant to the chosen symmetry while intermediate
representations remain sensitive to it.
The major architectures we have already surveyed all fit this pattern. The table below
organizes them by the domain they act on and the symmetry group encoded in the architecture.
Architectures Organized by Symmetry
| Architecture |
Domain |
Symmetry group |
Equivariance type |
| MLP |
\(\mathbb{R}^d\) |
trivial |
none |
| CNN |
grid (image / volume) |
translation \(\mathbb{Z}^d\) |
translation-equivariant |
| Transformer |
sequence / set |
symmetric group \(S_n\) |
permutation-equivariant |
| GNN |
graph (features \(\mathbf{X}\) + adjacency \(\mathbf{A}\)) |
symmetric group \(S_n\) |
joint permutation of features and topology: \((\mathbf{X}, \mathbf{A}) \mapsto (P\mathbf{X}, P\mathbf{A}P^\top)\) |
| Steerable / SE(3)-equivariant NN |
3D point cloud, molecular structure |
\(SO(3)\), \(SE(3)\) |
continuous group equivariance |
The table makes the unification explicit. The MLP, alone among the architectures listed,
encodes no symmetry — it is the baseline against which all the others are GDL instances.
CNNs, Transformers, GNNs, and SE(3)-equivariant networks all share the same template:
identify the symmetry group of the data domain, and constrain each layer to commute
with its action. The differences between them are which group, and how
the group acts. The discrete symmetry of node relabelling that defines GNNs, the
continuous symmetry of 3D rotation that defines SE(3)-equivariant networks, and the
translational symmetry that defines CNNs are all instances of the same architectural
principle, applied to different domains.
The Transformer's place in this table deserves a closer look, because it is in some
respects the cleanest example of GDL design philosophy already at industrial scale.
Self-attention
computes
\[
\mathrm{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left(
\tfrac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V},
\]
and one can verify directly that permuting the rows of input tokens permutes the rows
of the output identically: the layer commutes with the
symmetric group
\(S_n\) acting on the token axis. A Transformer without
positional encoding
is therefore a permutation-equivariant set processor; positional encoding
deliberately breaks this symmetry to inject sequence order back into the architecture.
From the GDL viewpoint, the modern large language model is built on a fundamentally
GDL-shaped backbone — equivariance was an architectural choice, not an afterthought,
and the choice is part of why the architecture scales.
Graph neural networks generalize this picture in a direction the Transformer does not
cover. A Transformer treats every pair of tokens as connected; the connectivity is
fixed by the architecture, not given by the data. Most real data carries structure:
in a molecule only adjacent atoms interact directly; in a citation network only cited
pairs are linked; in a road network connectivity is constrained by physical layout.
A GNN takes the connectivity as part of its input — a pair \((\mathbf{X}, \mathbf{A})\)
of node features and adjacency matrix — and is designed to be equivariant under the
joint action of \(S_n\) on both:
\[
f(P\mathbf{X},\, P\mathbf{A}P^\top) \;=\; P\,f(\mathbf{X}, \mathbf{A})
\qquad \text{for every } P \in S_n.
\]
Both the Transformer (without positional encoding) and the GNN are \(S_n\)-equivariant;
the difference is what \(P\) acts on. The Transformer permutes a feature matrix; the
GNN permutes a feature-adjacency pair, with the adjacency matrix conjugated so that
the graph structure is permuted consistently with the node relabelling. Restricting
attention to a fixed graph \(G\), only those \(P\) preserving \(\mathbf{A}\) — the
automorphism group
\(\mathrm{Aut}(G)\) — act non-trivially on features alone, but the architecture itself
is built to handle any graph, which is why a single trained GNN generalizes
to graphs unseen at training time.
Where the GDL Framework Points Next
Three threads in the table above run forward into the rest of this curriculum.
The continuous symmetry groups \(SO(3)\) and \(SE(3)\) are the
matrix Lie groups
we have already begun developing in Section I; equivariant architectures based
on them require the representation theory of Lie groups, which
is the subject of forthcoming pages. The natural setting for "data on a curved
domain" — point clouds, surfaces, shapes — is a smooth manifold,
and the analogue of the graph Laplacian on such domains is the Laplace-Beltrami
operator, both treated in the forthcoming manifold series. The GDL
framework is the application-side viewpoint that motivates these mathematical
developments; the resulting equivariant neural networks for continuous
symmetry groups are the natural application of those foundations once they are
in place.
Graph Neural Networks: From Spectral to Spatial
Two essentially different approaches to constructing graph-based architectures emerged
historically. The spectral approach — chronologically first — defines
convolution on a graph through the
Graph Fourier Transform,
parameterizes filters in the spectral domain, and recovers tractable architectures
by polynomial approximation. The spatial approach — now dominant in
practice — defines a layer directly as local message passing between adjacent vertices.
The two viewpoints are connected: the most widely used spatial architecture, the GCN,
is mathematically a first-order Chebyshev approximation of a spectral filter. We treat
them in turn.
The Spectral Side: Filters in the Frequency Domain
The
graph Laplacian
\(\boldsymbol{L} = \boldsymbol{D} - \boldsymbol{A}\) is real symmetric and
positive semi-definite,
so it admits an
eigendecomposition
\(\boldsymbol{L} = \boldsymbol{V}\boldsymbol{\Lambda}\boldsymbol{V}^\top\)
with real non-negative eigenvalues. The eigenvectors play the role of "graph
frequencies," and the
Graph Fourier Transform
\(\hat{\boldsymbol{f}} = \boldsymbol{V}^\top \boldsymbol{f}\) decomposes a graph
signal into its frequency components. A spectral graph convolution applies a filter
\(g(\boldsymbol{\Lambda})\) to the spectrum:
\[
\boldsymbol{f}_{\mathrm{out}}
= \boldsymbol{V}\, g(\boldsymbol{\Lambda})\, \boldsymbol{V}^\top \boldsymbol{f}_{\mathrm{in}}.
\]
Computing this directly requires the full eigendecomposition of \(\boldsymbol{L}\),
which is prohibitive for large graphs. The standard remedy approximates the filter
by a low-degree polynomial in \(\boldsymbol{L}\) — typically Chebyshev polynomials
of the rescaled Laplacian — yielding architectures that are efficient and
localized (a degree-\(K\) polynomial filter respects \(K\)-hop
neighborhoods). The first-order case of this construction is the
Graph Convolutional Network (GCN), which has become the standard
baseline for spectral GNNs. The full development — Chebyshev approximation, the
rescaling \(\tilde{\boldsymbol{L}} = (2/\lambda_{\max})\mathcal{L}_{\mathrm{sym}} - \boldsymbol{I}\)
(where \(\mathcal{L}_{\mathrm{sym}} = \boldsymbol{D}^{-1/2}\boldsymbol{L}\boldsymbol{D}^{-1/2}\)
is the symmetric normalized Laplacian), and the GCN derivation — is given in our
graph Laplacian page;
we recall here only what we need to connect spectral and spatial perspectives.
The Spatial Side: Message Passing
The spatial perspective begins from a different place. Rather than diagonalizing the
Laplacian, we ask directly: what local computation respects graph structure?
The answer takes the form of a message passing layer. At each layer
\(k\), every vertex \(v\) updates its hidden state \(\mathbf{h}_v^{(k)}\) by aggregating
messages from its neighbors and combining them with its own current state:
Definition: Message Passing Layer
A message passing layer updates vertex features by
\[
\mathbf{h}_v^{(k+1)}
= \mathrm{UPDATE}\!\left(\mathbf{h}_v^{(k)},\ \mathrm{AGGREGATE}\big(\{\!\!\{ \mathbf{h}_u^{(k)} : u \in \mathcal{N}(v) \}\!\!\}\big)\right),
\]
where \(\mathcal{N}(v)\) is the
neighborhood
of \(v\), \(\{\!\!\{\cdot\}\!\!\}\) denotes a multiset, \(\mathrm{AGGREGATE}\) is a
permutation-invariant function over the neighborhood (sum, mean, max, attention-weighted
sum), and \(\mathrm{UPDATE}\) is a learnable transformation (typically a small MLP).
The choice of \(\mathrm{AGGREGATE}\) and \(\mathrm{UPDATE}\) determines the architecture.
The GCN uses a normalized-degree weighted sum derived from
\(\mathcal{L}_{\mathrm{sym}}\); GraphSAGE uses sampled neighborhoods with a learnable
pooling function, addressing scalability on very large graphs; the Graph Attention
Network (GAT) uses an attention mechanism to compute neighborhood weights, recovering
much of the flexibility of self-attention while restricting interactions to graph
edges. The Graph Isomorphism Network (GIN) replaces the aggregation with a sum
followed by an MLP, a choice motivated by its connection to the
1-Weisfeiler-Lehman (1-WL) graph isomorphism test — GIN is
provably as expressive as 1-WL and strictly more expressive than mean- or
max-pooling alternatives.
The crucial structural property of every message passing layer is automatic
permutation equivariance: relabelling the vertices of the graph
produces a correspondingly relabelled output, because \(\mathrm{AGGREGATE}\) operates
on a multiset of messages and is by definition invariant under their order. This is
what places GNNs in the GDL framework — the architecture is constrained, by its very
construction, to satisfy \(f(P\mathbf{X}, P\mathbf{A}P^\top) = Pf(\mathbf{X}, \mathbf{A})\)
for every \(P \in S_n\).
Comparison with CNNs and Transformers
The three architectures — CNN, Transformer, GNN — admit a unified description
as message passing on graphs, distinguished only by which graph and which aggregation:
-
A CNN performs message passing on a regular grid graph
with fixed local neighborhoods and a learned weight per edge offset; the same
weights are applied at every position, which is what makes the layer translation-equivariant.
-
A Transformer performs message passing on the
complete graph over the input tokens, with attention-derived edge
weights; it is permutation-equivariant in the strong sense that all permutations
of \(S_n\) commute with the layer.
-
A GNN performs message passing on an input graph supplied as
part of the data rather than fixed by the architecture. Like the Transformer it
is \(S_n\)-equivariant, but where the Transformer permutes only token features,
the GNN permutes a feature-adjacency pair \((\mathbf{X}, \mathbf{A})\) jointly.
The continuum from CNN through Transformer to GNN is therefore one of increasing
data-dependence in the connectivity structure: regular grid → complete graph
→ arbitrary graph. The Transformer's success at scale demonstrates that the complete
graph is often a useful default when no informative connectivity is given; the GNN is
the natural choice when connectivity carries information that should be respected.
This equivariance with respect to graph structure is exactly the property that has
made GNNs the architecture of choice in domains where data is intrinsically
relational: molecular property prediction (where atoms and bonds form a graph),
recommendation systems and social network analysis (where the user-item or user-user
graph carries the predictive signal), knowledge graph completion, and traffic and
network science. Specific benchmark architectures evolve quickly; the underlying
design principle — encode the graph's symmetry into the layer — has remained stable
across a decade of empirical development.
Beyond Permutation: Continuous Group Equivariance
The GDL principle generalizes naturally beyond the discrete symmetries we have
discussed so far. Consider a molecule represented not as an abstract graph but as a
set of atoms with positions in three-dimensional space. Rotating or translating the
entire molecule produces the same molecule — its energy, its stability, its chemical
identity are unchanged. A learned function predicting any of these properties should
therefore be invariant under the group of
rigid motions
\(SE(3)\), combining
rotations \(SO(3)\) with three-dimensional translations,
and a learned function predicting a vector-valued geometric quantity (a force, an
orientation) should be equivariant under \(SE(3)\): rotating the input
rotates the output by the same rotation.
Implicit in this distinction is a subtlety we have so far suppressed in our notation.
The equivariance condition \(f(g \cdot x) = g \cdot f(x)\) writes the action of \(g\)
the same way on both sides, but the actual action depends on the type
of the quantity being acted on. A rotation \(R \in SO(3)\) acts trivially on a scalar
(energy, charge density), as the standard \(3 \times 3\) rotation matrix on a vector
(force, dipole moment), and via more elaborate formulas on higher-order tensors. The
systematic study of how a single group acts in different ways on different feature
types is the representation theory of the group, and equivariant
networks for 3D data are organized precisely by which representation each layer-feature
transforms under. The same shorthand \(g \cdot x\) can therefore stand for genuinely
distinct linear maps, and a careful theory must distinguish them.
Both \(SO(3)\) and \(SE(3)\) are continuous groups, and concretely they are the
matrix Lie groups
we developed in Section I. Building neural network layers that commute with their
action requires more than the multiset symmetrization that powers GNNs. The standard
approach decomposes vertex features into irreducible representations
of the rotation group (scalars, vectors, higher-order tensors corresponding to the
spherical harmonics), constrains messages to transform appropriately under rotation,
and combines them via Clebsch-Gordan products — the tensor product decompositions
that respect the irreducible representation structure.
The result is a graph-like architecture whose messages are themselves geometric
objects rather than scalar features — an architecture that knows, by its
construction, how to handle the rotation of its input.
The mathematical machinery this requires — irreducible representations of compact
Lie groups, spherical harmonics, the Peter-Weyl theorem, equivariant tensor product
decompositions — is the subject of the forthcoming pages on representation theory
and on Fourier analysis on Hilbert spaces. The systematic construction of equivariant
neural networks built on this machinery belongs to a later stage of the curriculum;
here we record only the application-side outcome:
encoding continuous symmetry into the architecture has produced concrete advances
in structural biology, where AlphaFold's prediction of
three-dimensional protein structure from amino-acid sequence — recognized by the 2024
Nobel Prize in Chemistry — relies on attention mechanisms designed to respect the
rigid-motion symmetries of residues in 3D space. Equivariant architectures are also
the dominant family in machine-learning interatomic potentials for molecular dynamics,
and an active research direction in robotic manipulation, where the geometry of
physical configuration spaces naturally calls for \(SE(3)\)-equivariant policies.
Why the Mathematics Comes Next
Three threads converge in equivariant neural networks for 3D data:
Lie groups provide the symmetry groups themselves and their algebraic
structure (developed in the Lie theory series of Section I); representation theory
decomposes feature spaces into irreducible building blocks compatible with the
symmetry; differential geometry of smooth manifolds supplies the
natural setting for "data on a curved domain" and the analogue of the graph
Laplacian (the Laplace-Beltrami operator) on such domains. The forthcoming
manifold series and representation theory pages develop these threads; the
Peter-Weyl theorem connects them back to the harmonic analysis on Hilbert spaces
of Section II. The GDL viewpoint is what makes the destination concrete: each
of these mathematical developments will pay off in a specific class of
equivariant architecture.
Interactive Demo
The demo above visualizes the central claim of this page in concrete numerical form.
A small graph carries a scalar feature on each node, displayed as a diverging color
(blue for negative, red for positive). The Layer slider scrubs through
a forward pass: layer 0 is the input, and successive layers show the features after
each round of message passing or matrix multiplication. The Architecture
selector switches between three regimes — MLP, which performs dense
fully-connected updates with no notion of graph structure; Transformer, which
operates on the complete graph (every node interacts with every other); and
GNN, which uses the displayed graph as its connectivity.
The Shuffle Node Labels button is where the GDL claim becomes testable.
Shuffling applies a random permutation \(P\) jointly to the node features and the
adjacency matrix, then re-runs the same forward pass with the same fixed weights.
For the GNN and Transformer, the equivariance condition
\(f(P\mathbf{X}, P\mathbf{A}P^\top) = Pf(\mathbf{X}, \mathbf{A})\) holds exactly, and
the comparison table on the right shows that every output slot matches the
permutation-predicted value to numerical precision: the equivariance meter sits at
floating-point zero. For the MLP, the same shuffle produces a quantitatively
different output — the dense weight matrix mixes node coordinates in a way that
depends on their indexing, and the meter flags the discrepancy in red. The contrast
is not a matter of better or worse training; it is built into the architectures by
construction, before any training has occurred.
The same construction, extended from discrete graphs to data with continuous
geometric symmetry, motivates equivariant neural networks built on the Lie-theoretic
foundations developed in Section I and the representation theory developed in the
forthcoming pages.