Neural Networks Basics

Multilayer Perceptron (MLP) Activation Functions Learning in Neural Networks Neural Networks Demo Development of Deep Learning

Multilayer Perceptron (MLP)

In Part 3, we saw that logistic regression defines a linear decision boundary and that the kernel trick can extend this to nonlinear settings - but requires choosing a fixed feature map. Neural networks take a fundamentally different approach: they learn the feature representation jointly with the classifier by composing many simple, differentiable transformations. The resulting gradients are computed via the Chain Rule, which we formalize in this page as the backpropagation algorithm.

The key idea of deep neural networks (DNNs) is "composing" a vast number of simple functions to make a huge complex function. We focus on a specific type of DNN known as the multilayer perceptron (MLP), also referred to as a feedforward neural network (FFNN).

An MLP defines a composite function of the form: \[ f(x ; \theta) = f_L (f_{L-1}(\cdots(f_1(x))\cdots)) \] where each component function \( f_\ell(x) = f(x; \theta_\ell) \) represents the transformation at layer \( \ell \,\), \( x \in \mathbb{R}^D \) is an input vector with \( D \) features, and \(\theta\) is a collection of parameters (weights and biases): \[ \theta = \{ \theta_\ell \}_{\ell=1}^L \text{ ,where } \theta_\ell = \{ W^{(\ell)}, b^{(\ell)} \}. \]

Each layer is assumed to be differentiable and consists of two operations: an affine transformation followed by a non-linear differentiable activation function \( g_\ell : \mathbb{R} \to \mathbb{R}\). An MLP consists of an input layer, one or more hidden layers, and an output layer.

We define the hidden units \(z^{(\ell)}\) at each layer \(\ell\) passing elementwise through the activation: \[ z^{(\ell)} = g_{\ell}(b^{(\ell)} + W^{(\ell)}z^{(\ell -1)}) = g_{\ell}(a^{(\ell)}) \] where \(a^{(\ell)}\) is called the pre-activations, and the output of the network is denoted by \(\hat{y} = h_\theta(x) = g_L(a^{(L)})\).

Note: Input data is typically stored as an \( N \times D \) design matrix, where each row corresponds to a data point and each column to a feature. This is referred to as structured data or tabular data. In contrast, for unstructured data such as images or text, different architectures are used:

Definition: Multilayer Perceptron

A multilayer perceptron (MLP) with \(L\) layers defines a parameterized function \[ f(x; \theta) = f_L \circ f_{L-1} \circ \cdots \circ f_1(x), \] where each layer computes \[ z^{(\ell)} = g_\ell\!\left(W^{(\ell)} z^{(\ell-1)} + b^{(\ell)}\right), \] with \(z^{(0)} = x\), weight matrices \(W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\), bias vectors \(b^{(\ell)} \in \mathbb{R}^{d_\ell}\), and nonlinear activation functions \(g_\ell\). The output layer uses an activation appropriate to the task (sigmoid for binary classification, softmax for multiclass, identity for regression).

In particular, modern Large Language Models (LLMs) are built upon the Transformer architecture. The Transformer's Self-Attention mechanism has largely replaced RNNs in natural language processing because it allows for global dependencies to be modeled in parallel, bypassing the sequential bottleneck and vanishing gradient issues inherent in recurrent designs.

Are RNNs Outdated?

Although Transformers dominate high-compute tasks, "Recurrent" logic remains essential. Modern State Space Models (SSMs), such as Mamba, have revitalized this field by leveraging hardware-aware parallel algorithms for training while maintaining the efficient inference characteristic of RNNs.

  • Inference Efficiency:
    Standard Transformers suffer from \(O(L^2)\) computational complexity. In contrast, SSMs maintain linear complexity \(O(L)\) and constant \(O(1)\) memory overhead per step during inference. This makes them ideal for embedded devices and real-time control systems where memory bandwidth is a bottleneck.
  • Infinite Horizons and State Compression:
    By updating a fixed-size latent state, recurrent systems can theoretically process continuous data streams indefinitely without the quadratic memory growth. However, it is important to note that while the physical context window is effectively infinite, the information density is constrained by the fixed state size, meaning the model must selectively forget or compress older information as the sequence progresses.

References:
[1] Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
[2] Ravi, D., et al. (2021). Recurrent Neural Network for Human Activity Recognition in Embedded Systems. Electronics, 10(12), 1434.

Activation Functions

Without a non-linear activation function, a neural network composed of multiple layers would reduce to a single affine transformation. For example, a two-layer linear network computes: \[ f(x ; \theta) = W^{(2)}(W^{(1)} x + b^{(1)}) + b^{(2)} = (W^{(2)}W^{(1)})x + (W^{(2)}b^{(1)} + b^{(2)}) = \tilde{W}x + \tilde{b}. \] This composition remains an affine transformation of \( x \), and therefore is incapable of representing non-linear decision boundaries. Non-linear activation functions are necessary to break this linearity.

Historically, a common choice was the sigmoid (logistic) activation function: \[ \sigma(a) = \frac{1}{1+e^{-a}}. \] However, sigmoid functions saturate for large positive or negative inputs: \( \sigma(a) \to 1 \) as \( a \to +\infty \), and \( \sigma(a) \to 0 \) as \( a \to -\infty \). In these regions, the gradient \(\sigma'(a) = \sigma(a)(1-\sigma(a))\) becomes very small (approaching zero), leading to the vanishing gradient problem. During backpropagation, gradients are multiplied layer by layer, so very small gradients in deeper layers become exponentially smaller as they propagate backward, making learning slow or unstable in deep networks.

To address this, modern networks often use the Rectified Linear Unit (ReLU): \[ g(a) = \max(0, a) = a \mathbb{I}(a>0) \] ReLU's derivative is 1 for positive inputs and 0 for negative inputs, avoiding the exponential decay of gradients that occurs with sigmoid functions. ReLU introduces non-linearity while preserving gradient magnitude for positive inputs. It is computationally simple and helps maintain gradient flow during training, which is why it is now a standard choice in modern neural network architectures.

Note: Strictly speaking, the ReLU function is not differentiable at \( a = 0 \). In practice, optimization algorithms rely on the concept of subgradients, explicitly assigning a derivative of \( 0 \) (or sometimes \( 1 \)) at exactly \( a = 0 \).

Note: While ReLU solves the vanishing gradient problem, it can suffer from the "dying ReLU" problem where neurons become inactive (always output zero) and stop learning. This occurs when neurons consistently receive negative inputs, causing their gradients to be permanently zero.

Learning in Neural Networks

Training the network is finding parameters \( \theta = \{ \theta_\ell \}_{\ell=1}^L \), where \( \theta_\ell = \{ W^{(\ell)}, b^{(\ell)} \} \) that minimize the empirical risk (average loss over all training data): \[ J(\theta) = \frac{1}{N} \sum_i \mathcal{L}(y_i, \hat{y}_i) \] where \(\hat{y}_i = h_{\theta}(x_i)\) is the network's prediction.

For the binary classification, a common choice is binary cross-entropy: \[ \mathcal{L}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y}) \]

The optimization is done by performing a gradient-based optimization method which iteratively updates parameters in the direction of negative gradient: \[ \theta \leftarrow \theta - \alpha \nabla_{\theta} J(\theta) \] where \(\alpha\) is the learning rate.

Our demo employs mini-batch gradient descent, which computes gradients on a small random subset of the data at each iteration. This provides a good balance between computational efficiency and gradient quality, often leading to faster convergence and better generalization compared to using the entire dataset at once.

The gradients are computed efficiently by the backpropagation algorithm. Backprop is an efficient application of the chain rule starting from the gradient of the loss w.r.t the output and working backwards. The algorithm computes all gradients in just two passes through the network:

Algorithm: BACKPROPAGATION Consider an MLP with L layers and a loss function \(\mathcal{L}\) Input: Data point \((x, y)\) // Forward Pass  \(z^{(0)} = x\);  for \(\ell = 1 : L\) do     \(a^{(\ell)} = W^{(\ell)} z^{(\ell-1)} + b^{(\ell)}\);     \(z^{(\ell)} = g_\ell(a^{(\ell)})\);  \(\hat{y} = z^{(L)}\);  Compute Loss \(\mathcal{L}(y, \hat{y})\); // Backward Pass  \(u^{(L)} = \nabla_{\hat{y}} \mathcal{L}\); // Gradient of the loss wrt the network output  for \(\ell = L : 1\) do     \(\delta^{(\ell)} = u^{(\ell)} \odot g_\ell'(a^{(\ell)})\); // Element-wise multiplication     \(\nabla_{W^{(\ell)}} \mathcal{L} = \delta^{(\ell)} (z^{(\ell-1)})^\top\);     \(\nabla_{b^{(\ell)}} \mathcal{L} = \delta^{(\ell)}\);     \(u^{(\ell-1)} = (W^{(\ell)})^\top \delta^{(\ell)}\); // Gradient wrt the input of the current layer Output: \(\{\nabla_{W^{(\ell)}} \mathcal{L},\; \nabla_{b^{(\ell)}} \mathcal{L}\}_{\ell=1}^L\)

Neural Networks Demo

This interactive demo showcases how a simple neural network can learn to classify non-linear patterns. You can generate datasets, tweak model parameters, and visualize the training process in real time.

Try Adjusting:

Development of Deep Learning

The modern revolution in deep learning has been driven not only by algorithmic advances, but also by dramatic improvements in hardware—especially the rise of graphics processing units (GPUs). Originally designed to accelerate matrix-vector computations for real-time rendering in video games, GPUs turned out to be ideally suited for the linear algebra operations at the heart of neural networks.

In the early 2010s, researchers discovered that GPUs could speed up deep learning training by orders of magnitude compared to traditional CPUs. This enabled the training of large neural networks on large labeled datasets, like ImageNet, which led to breakthroughs in computer vision, speech recognition (converting spoken language to text), and broader natural language processing (NLP) tasks such as translation, summarization, and question answering. Today, GPUs are a core component in AI research and development, alongside other fields such as scientific computing, complex simulations, and even cryptocurrency mining.

Zooming out further, GPUs themselves rely on foundational advances in semiconductor technology. Semiconductors are materials whose conductivity can be precisely controlled, making them the backbone of all modern electronics — from GPUs and CPUs to memory chips and mobile devices. By using advanced fabrication techniques and nanometer-scale engineering, manufacturers can pack billions of transistors onto a single chip. This density of computation enables the incredible power of today's hardware and fuels the era of foundation models, including large language models (LLMs).

With the architecture and training procedure of neural networks established, a natural question arises: how exactly are the gradients \(\nabla_\theta J(\theta)\) computed for arbitrary computational graphs? In Part 5: Automatic Differentiation, we generalize backpropagation beyond sequential layers to the full framework of reverse-mode automatic differentiation on directed acyclic graphs.