Intro to Deep Neural Networks

Convolutional Neural Networks (CNNs)

On the previous pages on neural networks and automatic differentiation, we introduced the multilayer perceptron, backpropagation, and the generalization of gradient computation to arbitrary computational graphs. With these foundations in place, we now survey the major architectural innovations that have shaped modern deep learning — from convolutional networks for spatial data to the Transformer architecture that underpins today's large language models.

Deep neural networks (DNNs) have evolved over the past few decades from early image classifiers to today's cutting-edge language and multimodal models, the field has progressed through a series of foundational innovations. Broadly speaking, modern DNNs are typically feedforward networks, where information flows in a single direction — from input to output — without any loops or recurrence. This design simplifies computations, making them faster and more stable.

In the 1990s, Convolutional Neural Networks (CNNs) emerged as an architecture for processing grid-like data such as images. A CNN uses convolutional layers that apply learnable filters across local spatial regions. Mathematically, a 2D convolution operation is given by: \[ (f * x)(i, j) = \sum_m \sum_n f(m, n) \cdot x(i - m, j - n) \] where \(f\) is the filter (or kernel) and \(x\) is the input image. This operation captures local spatial patterns while sharing parameters across the image.

Deep CNNs trained on GPUs and applied to large labeled datasets transformed computer vision, demonstrating that learned convolutional features could vastly outperform traditional vision pipelines. A subsequent architectural milestone was the introduction of residual connections, which enabled training networks of unprecedented depth and set the stage for modern vision architectures.

ResNet: Residual Connections

ResNet is a feedforward architecture built from residual blocks:

Definition: Residual Block

\[ \mathbf{z}_{\ell+1} = \mathcal{F}_\ell(\mathbf{z}_\ell) + \mathbf{z}_\ell \] where \(\mathcal{F}_\ell\) is a nonlinear transformation at layer \(\ell\), and the input \(\mathbf{z}_\ell\) is added back directly via a skip connection.

The activations at the final layer \(L\) can be expressed recursively in terms of an earlier layer \(\ell\): \[ \mathbf{z}_L = \mathbf{z}_{\ell} + \sum_{i = \ell}^{L - 1} \mathcal{F}_i (\mathbf{z}_i; \boldsymbol{\theta}_i), \] where \(\boldsymbol{\theta}_i\) denotes the parameters of the transformation at layer \(i\).

This formulation mitigates the vanishing gradient problem in deep neural networks, enabling the successful training of very deep models.

Applying the chain rule, the gradient of the loss \(\mathcal{L}\) with respect to the parameters at layer \(\ell\) becomes (we adopt the denominator-layout convention, in which gradients of scalars are column vectors and Jacobians are written so that products read right-to-left along the computation): \[ \begin{align*} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_{\ell}} &= \frac{\partial \mathbf{z}_{\ell}}{\partial \boldsymbol{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{\ell}} \\\\ &= \frac{\partial \mathbf{z}_{\ell}}{\partial \boldsymbol{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} \frac{\partial \mathbf{z}_L}{\partial \mathbf{z}_{\ell}} \\\\ &= \frac{\partial \mathbf{z}_{\ell}}{\partial \boldsymbol{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} \left(\mathbf{I} + \sum_{i = \ell}^{L-1} \frac{\partial \mathcal{F}_i (\mathbf{z}_i ; \boldsymbol{\theta}_i)}{\partial \mathbf{z}_{\ell}}\right) \\\\ &= \frac{\partial \mathbf{z}_{\ell}}{\partial \boldsymbol{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} + \text{other terms} \end{align*} \] The "other terms" in the gradient capture indirect paths through deeper residual blocks and reflect how updates propagate backward through the network. These terms may diminish in magnitude in very deep networks, but they are generally not zero and play a critical role in learning refined representations. Residual networks avoid the vanishing gradient problem not by eliminating these terms, but by ensuring they are not the only path for gradient flow.

Layer Normalization

Let \(z_i\) be the \(i\)'th element of a tensor. For example, in 2d images, the index \(i\) has four components, indicating batch, height, width, and channel: \[ i = (i_N, i_H, i_W, i_C). \] For each index \(z_i\), the mean and variance are computed by pooling statistics across other dimensions of the tensor but not across examples in the batch: \[ \begin{align*} &\mu_i = \frac{1}{|\mathcal{S}_i|} \sum_{k \in \mathcal{S}_i} z_k \\\\ &\sigma_i = \sqrt{\frac{1}{|\mathcal{S}_i|}\sum_{k \in \mathcal{S}_i} (z_k - \mu_i)^2 + \epsilon} \end{align*} \] where \(\mathcal{S}_i\) is the set of elements over which statistics are computed for index \(i\), and \(\epsilon\) is a small constant for numerical stability.

The normalized and scaled output is given by: \[ \begin{align*} &\hat{z}_i = \frac{(z_i - \mu_i)}{\sigma_i} \\\\ &\tilde{z}_i = \gamma_c \hat{z}_i + \beta_c \end{align*} \] where \(c\) is the channel index corresponding to \(i\), and \(\gamma_c, \beta_c\) are learnable parameters for scaling and shifting.

Unlike batch normalization, which pools across the batch dimension, layer normalization pools only over feature dimensions for each input example. This makes it especially effective in settings where batch-level statistics are unstable or unavailable - such as in autoregressive models, small batch sizes, or variable-length sequences.

Attention Mechanisms

Recurrent architectures process sequences step-by-step, which limits parallelism and makes modeling long-range dependencies challenging. For example, imagine reading a sentence and trying to understand what each word refers to. When processing the word "it," you automatically look back at previous words to determine its meaning. Attention mechanisms formalize this intuitive process by computing a weighted sum of all input feature vectors, enabling the model to dynamically focus on the most relevant parts of the sequence regardless of distance.

In traditional feedforward neural networks, each layer computes hidden activations as a linear transformation of input activations followed by a nonlinearity: \[ \mathbf{Z} = \varphi (\mathbf{X}\mathbf{W}) \in \mathbb{R}^{m \times v'} \] where \(\mathbf{X} \in \mathbb{R}^{m \times v}\) is a matrix of input (or hidden) feature vectors, and \(\mathbf{W} \in \mathbb{R}^{v \times v'}\) is a fixed weight matrix learned during training.

This formulation uses the same set of weights \(\mathbf{W}\) for every input position. To allow more flexibility, we can instead make the weights input-dependent. That is, we can allow: \[ \mathbf{Z} = \varphi (\mathbf{X}\mathbf{W}(\mathbf{X})), \] where the transformation matrix \(\mathbf{W}(\mathbf{X})\) varies depending on the input itself. This allows the model to dynamically adjust its computations based on the input context. Such input-dependent multiplicative interactions are central to attention mechanisms.

In more general settings, the input-dependent transformation matrix can be computed from separate sets of learned representations called queries, keys, and values. This leads to a broader formulation of attention: \[ \mathbf{Z} = \varphi (\mathbf{V}\mathbf{W}(\mathbf{Q}, \mathbf{K})) \] where:

\(\mathbf{Q} \in \mathbb{R}^{m \times q}\) is a set of queries
\(\mathbf{K} \in \mathbb{R}^{m \times q}\) is a set of keys
\(\mathbf{V} \in \mathbb{R}^{m \times v}\) is a set of values

The transformation matrix \(\mathbf{W}(\mathbf{Q}, \mathbf{K})\) represents a learned or computed relationship between queries and keys. Note: In practice, these matrices are often obtained by applying learned linear projections to the input sequence.

In attention, the core idea is to compute each output vector \(\mathbf{z}_j\) as a weighted sum over all value vectors \(\mathbf{v}_i\), where the weights \(\alpha_{ij}\) are determined by a similarity between the query \(\mathbf{q}_j\) and each key \(\mathbf{k}_i\): \[ \mathbf{z}_j = \sum_{i}\alpha_{ij}\mathbf{v}_i \] where \(0 \leq \alpha_{ij} \leq 1\) and \(\sum_i \alpha_{ij} = 1\). This allows the model to selectively attend to relevant parts of the input.

A common similarity function is the inner product between query and key vectors. If \(\mathbf{q}, \mathbf{k} \in \mathbb{R}^d\) are independent random variables with components that have zero mean and unit variance, then their inner product \(\mathbf{q}^\top \mathbf{k}\) has zero mean and variance \(d\). To prevent large variance from pushing the softmax into saturation regions where gradients vanish, we normalize it by \(\sqrt{d}\): \[ a(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d}}. \] This is known as scaled dot-product attention score, where the scale factor \(\frac{1}{\sqrt{d}}\) normalizes the variance so that it remains stable (i.e., \(= 1\)) as the dimension \(d\) increases. This is used to compute attention weights \(\alpha_{ij}\) after applying the softmax function.

In practice, we compute attention over minibatches of \(n\) query vectors.

Definition: Scaled Dot-Product Attention

Given queries \(\mathbf{Q} \in \mathbb{R}^{n \times d}\), keys \(\mathbf{K} \in \mathbb{R}^{m \times d}\), and values \(\mathbf{V} \in \mathbb{R}^{m \times v}\): \[ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}}\right)\mathbf{V} \in \mathbb{R}^{n \times v} \] where the softmax is applied row-wise, normalizing each row of the \(n \times m\) similarity matrix.

Self-Attention

Self-attention is a mechanism that allows each position \(i\) in a sequence to attend to all other positions in the same sequence. Given a sequence of input tokens \(\mathbf{x}_1, \cdots, \mathbf{x}_n \in \mathbb{R}^d\), a sequence of outputs is given by: \[ \mathbf{y}_i = \text{Attention}(\mathbf{q}_i, (\mathbf{x}_1, \mathbf{x}_1), \cdots, (\mathbf{x}_n, \mathbf{x}_n )) \in \mathbb{R}^d \] where the query is \(q_i\) and the keys and values are \(\mathbf{x}_1, \cdots \mathbf{x}_n\).

This mechanism captures contextual relationships between tokens by allowing information to flow from any position to any other in a single layer. During training, all the outputs are known. Thus the model can evaluate \(\mathbf{y}_i\) in parallel, which overcomes the sequential bottleneck of past DNNs.

Multi-Head Attention

While a single self-attention mechanism can capture dependencies between tokens, using only one set of projections may limit the model's expressiveness. Multi-head attention (MHA) addresses this by computing multiple self-attention operations in parallel using different learned projections, called "heads."

Definition: Multi-Head Attention

Given input \(\mathbf{X} \in \mathbb{R}^{n \times d}\), each head \(i \in \{1,\ldots,h\}\) computes: \[ \text{head}_i = \text{Attention}\!\left( \mathbf{X}\mathbf{W}_i^{(q)},\; \mathbf{X}\mathbf{W}_i^{(k)},\; \mathbf{X}\mathbf{W}_i^{(v)} \right) \in \mathbb{R}^{n \times d_v} \] where \(\mathbf{W}_i^{(q)} \in \mathbb{R}^{d \times d_k}\), \(\mathbf{W}_i^{(k)} \in \mathbb{R}^{d \times d_k}\), \(\mathbf{W}_i^{(v)} \in \mathbb{R}^{d \times d_v}\). The outputs are concatenated and projected: \[ \text{MHA}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,\mathbf{W}_o \] where \(\mathbf{W}_o \in \mathbb{R}^{hd_v \times d}\). A common choice is \(d_k = d_v = d/h\).

The query and key projections share dimension \(d_k\) so that their inner product is well-defined; the value projection dimension \(d_v\) may differ. With \(h\) heads and \(d_k = d_v = d/h\), the total computational cost is comparable to single-head attention.

By using multiple heads, the model can attend to information from different representation subspaces and capture a richer set of relationships in the data. Each head may focus on different positions or interaction patterns, improving both performance and interpretability.

Positional Encoding

The standard self-attention is permutation invariant, which means the model ignores the order of tokens in an input sequence. To address this, positional encoding is added to the input embeddings to inject information about the position of each token. A widely used method is to define fixed, deterministic positional encodings using sine and cosine functions of varying frequencies.

Definition: Sinusoidal Positional Encoding

The positional encoding vector for position \(i\) and dimension \(j\) is defined as: \[ p_{i,\,2j} = \sin\!\left(\frac{i}{C^{2j/d}}\right) \] \[ p_{i,\,2j+1} = \cos\!\left(\frac{i}{C^{2j/d}}\right) \] where:

\(i\) is the position index in the sequence, starting from 0,
\(j\) is the dimension index in the positional encoding vector,
\(d\) is the total dimension of the model (e.g., 512),
\(C\) is a constant (usually set to 10,000) to control the frequency scale.

This formulation ensures that each dimension of the positional encoding corresponds to a sinusoid of a different frequency. As a result, any two positions will have a unique positional encoding vector, and the relative position between tokens can be represented as a linear function of the encodings.

Example: Suppose the model dimension is \( d = 4 \), and we compute the positional encoding for position \( i = 2 \). Then we compute each component of the encoding as: \[ \begin{align*} &p_{2,0} = \sin \left( \frac{2}{10,000^{0/4}} \right) = \sin(2) \\\\ &p_{2,1} = \cos \left( \frac{2}{10,000^{0/4}} \right) = \cos(2) \\\\ &p_{2,2} = \sin \left( \frac{2}{10,000^{2/4}} \right) = \sin \left( \frac{2}{100} \right) \\\\ &p_{2,3} = \cos \left( \frac{2}{10,000^{2/4}} \right) = \cos \left( \frac{2}{100} \right) \end{align*} \]

This results in the positional encoding vector for position 2 as: \[ \mathbf{p}_2 = \left[ \sin(2),\ \cos(2),\ \sin(0.02),\ \cos(0.02) \right]. \]

These encodings are added element-wise to the token embeddings at each position, allowing the model to incorporate position information without relying on recurrence or convolution. Moreover, the use of periodic functions enables the model to potentially generalize to sequences longer than those seen during training.

Transformers

The Transformer architecture replaces recurrence and convolution with a mechanism based entirely on self-attention. In both the encoder and decoder, the core components are multi-head self-attention layers, feedforward networks, residual connections, and layer normalization. The design enables full parallelization across sequence positions during training, and allows the model to capture long-range dependencies through direct pairwise interactions between tokens. By using learned attention weights that vary depending on the input, the model dynamically determines which parts of the sequence to emphasize at each layer.

Transformer-based models form the basis of many state-of-the-art systems across domains such as text, vision, and multimodal learning. In particular, modern large language models (LLMs) are built on stacked Transformer decoder blocks trained on massive text corpora, leveraging the scalability and expressiveness of the Transformer to generate coherent and context-aware output across a wide range of tasks.

Transformer Encoder & Decoder EncoderBlock(X) Z = LayerNorm(MHA(Q =X, K = X, V = X) + X) E = LayerNorm(FeedForward(Z) + Z) return E
Encoder(X, N) E = POS(Embed(X)) for \(n\) in range (N) E = EncoderBlock(E) return E
DecoderBlock(Y, E) Z = LayerNorm(MHA(Q =Y, K = Y, V = Y) + Y) // Causal mask Z' = LayerNorm(MHA(Q = Z, K = E, V = E) + Z) D = LayerNorm(FeedForward(Z') + Z') return D
Decoder(Y, E, N) D = POS(Embed(Y)) for \(n\) in range (N) D = DecoderBlock(D, E) return D

Note:

MHA(...) : the Multi-head self-attention. Each token attends to every other token.
+ (...) : residual connection. The input is added back to the output of attention.
LayerNorm(...) : Layer normalization, which is applied after the residual connection to stabilize training.
FeedForward(...) : A position-wise fully connected feedforward network is applied to each token vector independently.
Embed(...) : Each token is mapped to a high-dimensional embedding vector.
POS(...) : Positional encoding is added to embeddings to inject token order information.
N: the number of copies of the block

Symmetry as Architectural Principle

From Discrete Permutations to Continuous Lie Groups

A pattern recurs across the architectures we have surveyed: each is built around a symmetry of the input domain. Convolutional networks are translation-equivariant — their convolutional layers commute with shifts of the input image. Self-attention is permutation-equivariant — outputs are unchanged when input tokens are reordered — which is precisely why positional encoding must be added back to inject ordering information.

This perspective generalizes far beyond CNNs and Transformers. Architectures designed for graphs respect permutation symmetry of the node ordering; networks processing 3D data such as molecules or point clouds are designed to be equivariant under the rotation group \(SO(3)\) or the rigid-body group \(SE(3)\). The mathematical machinery for constructing such equivariant layers is the Lie group / Lie algebra correspondence — the continuous limit of the discrete symmetries we first encountered in dihedral groups. We will develop these threads in our discussion of geometric deep learning.

Demo

The Transformer architecture provides the computational backbone for modern ML systems, from language models to vision transformers. On the upcoming page on reinforcement learning, we explore a fundamentally different learning paradigm — where an agent learns through interaction with an environment, guided by reward signals rather than labeled examples.