Convolutional Neural Networks (CNNs)
In Part 4, we introduced the multilayer perceptron and backpropagation; in
Part 5, we generalized gradient computation to arbitrary computational graphs.
With these foundations in place, we now survey the major architectural innovations that have shaped modern deep learning - from
convolutional networks for spatial data to the Transformer architecture that underpins today's large language models.
Deep neural networks (DNNs) have undergone remarkable evolution over the past few decades. From early image
classifiers to today's cutting-edge language and multimodal models, the field has progressed through a series of foundational
innovations. Broadly speaking, modern DNNs are typically feedforward networks, where information flows in a
single direction - from input to output - without any loops or recurrence. This design simplifies computations, making them
faster and more stable.
In the 1990s, Convolutional Neural Networks (CNNs) emerged as an architecture for processing grid-like data
such as images. A CNN uses convolutional layers that apply learnable filters across local spatial regions.
Mathematically, a 2D convolution operation is given by:
\[
(f * x)(i, j) = \sum_m \sum_n f(m, n) \cdot x(i - m, j - n)
\]
where \(f\) is the filter (or kernel) and \(x\) is the input image. This operation captures local spatial patterns while
sharing parameters across the image.
The breakthrough came in 2012, when AlexNet won the ImageNet competition. It stacked convolution and
pooling layers, used the ReLU nonlinearity \( \text{ReLU}(x) = \max(0, x) \), and applied dropout to reduce overfitting.
AlexNet demonstrated that deep CNNs trained on GPUs could vastly outperform traditional vision pipelines.
This success led to the deeper architecture, ResNet (2015), which introduced residual connections.
Such a model advanced image recognition and set the stage for modern vision architectures.
ResNet: Residual Connections
ResNet is a feedforward architecture built from residual blocks:
Definition: Residual Block:
\[
\mathbf{z}_{\ell+1} = \mathcal{F}_\ell(\mathbf{z}_\ell) + \mathbf{z}_\ell
\]
where \(\mathcal{F}_\ell\) is a nonlinear transformation at layer \(\ell\), and the input \(\mathbf{z}_\ell\) is
added back directly via a skip connection.
The activations at the final layer \(L\) can be expressed recursively in terms of an earlier layer \(\ell\):
\[
\mathbf{z}_L = \mathbf{z}_{\ell} + \sum_{i = \ell}^{L - 1} \mathcal{F}_i (\mathbf{z}_i; \mathbf{\theta}_i),
\]
where \(\mathbf{\theta}_i\) denotes the parameters of the transformation at layer \(i\).
This formulation mitigates the vanishing gradient problem in deep neural networks, enabling the
successful training of very deep models.
Applying the chain rule, the gradient of the loss \(\mathcal{L}\) with respect to the parameters at layer \(\ell\) becomes:
\[
\begin{align*}
\frac{\partial \mathcal{L}}{\partial \mathbf{\theta}_{\ell}}
&= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{\ell}} \\\\
&= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} \frac{\partial \mathbf{z}_L}{\partial \mathbf{z}_{\ell}} \\\\
&= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}}
\left(1 + \sum_{i = \ell}^{L-1} \frac{\partial \mathcal{F}_i (\mathbf{z}_i ; \mathbf{\theta}_i)}{\partial \mathbf{z}_{\ell}}\right) \\\\
&= \frac{\partial \mathbf{z}_{\ell}}{\partial \mathbf{\theta}_{\ell}} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_{L}} + \text{other terms}
\end{align*}
\]
The “other terms” in the gradient capture indirect paths through deeper residual blocks and reflect how updates propagate backward
through the network. These terms may diminish in magnitude in very deep networks, but they are generally not zero and play a critical role in
learning refined representations. Residual networks avoid the vanishing gradient problem not by eliminating these terms, but by ensuring they
are not the only path for gradient flow.
Layer Normalization
Let \(z_i\) be the \(i\)'th element of a tensor. For example, in 2d images, the index \(i\) has four components,
indicating batch, height, width, and channel:
\[
i = (i_N, i_H, i_W, i_C).
\]
For each index \(z_i\), the mean and variance are computed by pooling statistics across other dimensions of the tensor
but not across examples in the batch:
\[
\begin{align*}
&\mu_i = \frac{1}{|\mathcal{S}_i|} \sum_{k \in \mathcal{S}_i} z_k \\\\
&\sigma_i = \sqrt{\frac{1}{|\mathcal{S}_i|}\sum_{k \in \mathcal{S}_i} (z_k - \mu_i)^2 + \epsilon}
\end{align*}
\]
where \(\mathcal{S}_i\) is the set of elements over which statistics are computed for index \(i\),
and \(\epsilon\) is a small constant for numerical stability.
The normalized and scaled output is given by:
\[
\begin{align*}
&\hat{z}_i = \frac{(z_i - \mu_i)}{\sigma_i} \\\\
&\tilde{z}_i = \gamma_c \hat{z}_i + \beta_c
\end{align*}
\]
where \(c\) is the channel index corresponding to \(i\), and \(\gamma_c, \beta_c\) are learnable
parameters for scaling and shifting.
Unlike batch normalization, which pools across the batch dimension, layer normalization pools only
over feature dimensions for each input example. This makes it especially effective in settings where batch-level statistics are
unstable or unavailable - such as in autoregressive models, small batch sizes, or variable-length sequences.
Attention Mechanisms
Recurrent architectures process sequences step-by-step, which limits parallelism and makes modeling
long-range dependencies challenging. For example, imagine reading a sentence and trying to understand
what each word refers to. When processing the word "it," you automatically look back at previous words
to determine its meaning. Attention mechanisms formalize this intuitive process by computing
a weighted sum of all input feature vectors, enabling the model to dynamically focus on the most relevant
parts of the sequence regardless of distance.
In traditional feedforward neural networks, each layer computes hidden activations as a linear transformation of input activations followed by a nonlinearity:
\[
\mathbf{Z} = \varphi (\mathbf{X}\mathbf{W}) \in \mathbb{R}^{m \times v'}
\]
where \(\mathbf{X} \in \mathbb{R}^{m \times v}\) is a matrix of input (or hidden) feature vectors, and
\(\mathbf{W} \in \mathbb{R}^{v \times v'}\) is a fixed weight matrix learned during training.
This formulation uses the same set of weights \(\mathbf{W}\) for every input position. To allow more flexibility,
we can instead make the weights input-dependent. That is, we can allow:
\[
\mathbf{Z} = \varphi (\mathbf{X}\mathbf{W}(\mathbf{X})),
\]
where the transformation matrix \(\mathbf{W}(\mathbf{X})\) varies depending on the input itself.
This allows the model to dynamically adjust its computations based on the input context.
Such input-dependent multiplicative interactions are central to attention mechanisms.
In more general settings, the input-dependent transformation matrix can be computed from separate sets of learned representations called
queries, keys, and values. This leads to a broader formulation of attention:
\[
\mathbf{Z} = \varphi (\mathbf{V}\mathbf{W}(\mathbf{Q}, \mathbf{K}))
\]
where:
- \(\mathbf{Q} \in \mathbb{R}^{m \times q}\) is a set of queries
- \(\mathbf{K} \in \mathbb{R}^{m \times q}\) is a set of keys
- \(\mathbf{V} \in \mathbb{R}^{m \times v}\) is a set of values
The transformation matrix \(\mathbf{W}(\mathbf{Q}, \mathbf{K})\) represents a learned or computed relationship between queries and keys.
Note: In practice, these matrices are often obtained by applying learned linear projections to the input sequence.
In attention, the core idea is to compute each output vector \(\mathbf{z}_j\) as a weighted sum over all value vectors \(\mathbf{v}_i\),
where the weights \(\alpha_{ij}\) are determined by a similarity between the query \(\mathbf{q}_j\) and each key \(\mathbf{k}_i\):
\[
\mathbf{z}_j = \sum_{i}\alpha_{ij}\mathbf{v}_i
\]
where \(0 \leq \alpha_{ij} \leq 1\) and \(\sum_i \alpha_{ij} = 1\). This allows the model to selectively attend to relevant parts of the input.
A common similarity function is the inner product between query and key vectors. If
\(\mathbf{q}, \mathbf{k} \in \mathbb{R}^d\) are independent random variables with components that have zero mean and unit variance,
then their inner product \(\mathbf{q}^\top \mathbf{k}\) has zero mean and variance \(d\).
To prevent large variance from pushing the softmax into saturation regions where gradients vanish, we normalize it by \(\sqrt{d}\):
\[
a(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d}}.
\]
This is known as scaled dot-product attention score, where the scale factor
\(\frac{1}{\sqrt{d}}\) normalizes the variance so that it remains stable (i.e., \(= 1\)) as the dimension \(d\) increases.
This is used to compute attention weights \(\alpha_{ij}\) after applying the softmax function.
In practice, we compute attention over minibatches of \(n\) query vectors.
Definition: Scaled Dot-Product Attention:
Given queries \(\mathbf{Q} \in \mathbb{R}^{n \times d}\), keys \(\mathbf{K} \in \mathbb{R}^{m \times d}\), and
values \(\mathbf{V} \in \mathbb{R}^{m \times v}\):
\[
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})
= \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}}\right)\mathbf{V} \in \mathbb{R}^{n \times v}
\]
where the softmax is applied row-wise, normalizing each row of the \(n \times m\) similarity matrix.
Self-Attention
Self-attention is a mechanism that allows each position \(i\) in a sequence to attend to all other positions
in the same sequence. Given a sequence of input tokens \(\mathbf{x}_1, \cdots, \mathbf{x}_n \in \mathbb{R}^d\),
a sequence of outputs is given by:
\[
\mathbf{y}_i = \text{Attention}(\mathbf{q}_i, (\mathbf{x}_1, \mathbf{x}_1), \cdots, (\mathbf{x}_n, \mathbf{x}_n )) \in \mathbb{R}^d
\]
where the query is \(q_i\) and the keys and values are \(\mathbf{x}_1, \cdots \mathbf{x}_n\).
This mechanism captures contextual relationships between tokens by allowing information to flow from any position to any other in a single layer.
During training, all the outputs are known. Thus the model can evaluate \(\mathbf{y}_i\) in parallel, which
overcomes the sequential bottleneck of past DNNs.
Multi-Head Attention
While a single self-attention mechanism can capture dependencies between tokens, using only one set of projections
may limit the model's expressiveness. Multi-head attention (MHA) addresses this by computing
multiple self-attention operations in parallel using different learned projections, called "heads."
Definition: Multi-Head Attention
Given input \(\mathbf{X} \in \mathbb{R}^{n \times d}\), each head \(i \in \{1,\ldots,h\}\) computes:
\[
\text{head}_i = \text{Attention}\!\left(
\mathbf{X}\mathbf{W}_i^{(q)},\;
\mathbf{X}\mathbf{W}_i^{(k)},\;
\mathbf{X}\mathbf{W}_i^{(v)}
\right) \in \mathbb{R}^{n \times d_v}
\]
where
\(\mathbf{W}_i^{(q)} \in \mathbb{R}^{d \times d_k}\),
\(\mathbf{W}_i^{(k)} \in \mathbb{R}^{d \times d_k}\),
\(\mathbf{W}_i^{(v)} \in \mathbb{R}^{d \times d_v}\).
The outputs are concatenated and projected:
\[
\text{MHA}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,\mathbf{W}_o
\]
where \(\mathbf{W}_o \in \mathbb{R}^{hd_v \times d}\). A common choice is \(d_k = d_v = d/h\).
The query and key projections share dimension \(d_k\) so that their inner product is well-defined; the value
projection dimension \(d_v\) may differ. With \(h\) heads and \(d_k = d_v = d/h\), the total computational cost is
comparable to single-head attention.
By using multiple heads, the model can attend to information from different representation subspaces and capture a richer set of relationships in the data.
Each head may focus on different positions or interaction patterns, improving both performance and interpretability.
Positional Encoding
The standard self-attention is permutation invariant, which means the model ignores the order of tokens in an
input sequence. To address this, positional encoding is added to the input embeddings to inject information about
the position of each token. A widely used method is to define fixed, deterministic positional encodings using
sine and cosine functions of varying frequencies.
Definition: Sinusoidal Positional Encoding
The positional encoding vector for position \(i\) and dimension \(j\) is defined as:
\[
p_{i,\,2j} = \sin\!\left(\frac{i}{C^{2j/d}}\right)
\]
\[
p_{i,\,2j+1} = \cos\!\left(\frac{i}{C^{2j/d}}\right)
\]
where:
- \(i\) is the position index in the sequence, starting from 0,
- \(j\) is the dimension index in the positional encoding vector,
- \(d\) is the total dimension of the model (e.g., 512),
- \(C\) is a constant (usually set to 10,000) to control the frequency scale.
This formulation ensures that each dimension of the positional encoding corresponds to a sinusoid of a
different frequency. As a result, any two positions will have a unique positional encoding vector, and the
relative position between tokens can be represented as a linear function of the encodings.
Example: Suppose the model dimension is \( d = 4 \), and we compute the positional encoding for
position \( i = 2 \). Then we compute each component of the encoding as:
\[
\begin{align*}
&p_{2,0} = \sin \left( \frac{2}{10,000^{0/4}} \right) = \sin(2) \\\\
&p_{2,1} = \cos \left( \frac{2}{10,000^{0/4}} \right) = \cos(2) \\\\
&p_{2,2} = \sin \left( \frac{2}{10,000^{2/4}} \right) = \sin \left( \frac{2}{100} \right) \\\\
&p_{2,3} = \cos \left( \frac{2}{10,000^{2/4}} \right) = \cos \left( \frac{2}{100} \right)
\end{align*}
\]
This results in the positional encoding vector for position 2 as:
\[
\mathbf{p}_2 = \left[ \sin(2),\ \cos(2),\ \sin(0.02),\ \cos(0.02) \right].
\]
These encodings are added element-wise to the token embeddings at each position, allowing the model to
incorporate position information without relying on recurrence or convolution.
Moreover, the use of periodic functions enables the model to potentially generalize to sequences longer
than those seen during training.
Transformers
The Transformer architecture, introduced by Vaswani et al. (2017), replaces recurrence and convolution with
a mechanism based entirely on self-attention. In both the encoder and decoder, the core components are multi-head self-attention
layers, feedforward networks, residual connections, and layer normalization. The design enables full parallelization across
sequence positions during training, and allows the model to capture long-range dependencies through direct pairwise interactions
between tokens. By using learned attention weights that vary depending on the input, the model dynamically determines which parts
of the sequence to emphasize at each layer.
Transformer-based models form the basis of many state-of-the-art systems across domains such as text, vision, and multimodal learning.
In particular, large language models (LLMs) such as the GPT family, developed by OpenAI, are built on
stacked Transformer decoder blocks trained on massive text corpora. These models leverage the scalability and expressiveness of the
Transformer to generate coherent and context-aware output across a wide range of tasks.
Transformer Encoder & Decoder
EncoderBlock(X)
Z = LayerNorm(MHA(Q =X, K = X, V = X) + X)
E = LayerNorm(FeedForward(Z) + Z)
return E
Encoder(X, N)
E = POS(Embed(X))
for \(n\) in range (N)
E = EncoderBlock(E)
return E
DecoderBlock(Y, E)
Z = LayerNorm(MHA(Q =Y, K = Y, V = Y) + Y) // Causal mask
Z' = LayerNorm(MHA(Q = Z, K = E, V = E) + Z)
D = LayerNorm(FeedForward(Z') + Z')
return D
Decoder(Y, E, N)
D = POS(Embed(Y))
for \(n\) in range (N)
D = DecoderBlock(D, E)
return D
Note:
- MHA(...) : the Multi-head self-attention. Each token attends to every other token.
- + (...) : residual connection. The input is added back to the output of attention.
- LayerNorm(...) : Layer normalization, which is applied after the residual connection to stabilize training.
- FeedForward(...) : A position-wise fully connected feedforward network is applied to each token vector independently.
- Embed(...) : Each token is mapped to a high-dimensional embedding vector.
- POS(...) : Positional encoding is added to embeddings to inject token order information.
- N: the number of copies of the block
Demo
The Transformer architecture provides the computational backbone for modern AI systems, from language models to vision transformers.
In Part 10: Intro to Reinforcement Learning, we explore a fundamentally
different learning paradigm - where an agent learns through interaction with an environment, guided by reward signals rather
than labeled examples.