Neural Networks Basics

Multilayer Perceptron (MLP)

The key idea of deep neural networks (DNNs) is "composing" a vast number of simple functions to make a huge complex function. In this section, we focus on a specific type of DNN known as the multilayer perceptron (MLP), also referred to as a feedforward neural network (FFNN).

An MLP defines a composite function of the form: \[ f(x ; \theta) = f_L (f_{L-1}(\cdots(f_1(x))\cdots)) \] where each component function $ f_\ell(x) = f(x; \theta_\ell) $ represents the transformation at layer $ \ell \,$, $ x \in \mathbb{R}^D $ is an input vector with $ D $ features, and $\theta$ is a collection of parameters(weights and biases): \[ \theta = \{ \theta_\ell \}_{\ell=1}^L \text{ ,where } \theta_\ell = \{ W^{(\ell)}, b^{(\ell)} \}. \]

Each layer is assumed to be differentiable and consists of two operations: an affine transformation followed by a non-linear differentiable activation function $ g_\ell : \mathbb{R} \to \mathbb{R}$. An MLP consists of an input layer, one or more hidden layers, and an output layer.

We define the hidden units $z^{(\ell)}$ at each layer $\ell$ passed elementwise through the activation: \[ z^{(\ell)} = g_{\ell}(b^{(\ell)} + W^{(\ell)}z^{(\ell -1)}) = g_{\ell}(a^{(\ell)}) \] where $a^{(\ell)}$ is called the pre-activations, and the output of the newtwork is denoted by $\hat{y} = h_\theta(x) = g_L(a^{(L)})$.

Note: Input data is typically stored as an $ N \times D $ design matrix, where each row corresponds to a data point and each column to a feature. This is referred to as structured data or tabular data. In contrast, for unstructured data such as images or text, different architectures are used:

Convolutional Neural Networks (CNNs) for images
Recurrent Neural Networks (RNNs) and Transformers for sequential data (e.g. text)

In particular, modern Large Language Models (LLMs) like GPT-4 are built upon the Transformer architecture. The Transformer's Self-Attention mechanism has largely replaced RNNs in natural language processing because it allows for global dependencies to be modeled in parallel, bypassing the sequential bottleneck and vanishing gradient issues inherent in recurrent designs.

Are RNNs Outdated?

Although Transformers dominate high-compute tasks, "Recurrent" logic remains essential. Modern State Space Models (SSMs), such as Mamba, have revitalized this field by leveraging hardware-aware parallel algorithms for training while maintaining the efficient inference characteristic of RNNs.

Inference Efficiency:
Standard Transformers suffer from $O(L^2)$ computational complexity. In contrast, SSMs maintain linear complexity $O(L)$ and constant $O(1)$ memory overhead per step during inference. This makes them ideal for embedded devices and real-time control systems where memory bandwidth is a bottleneck.
Infinite Horizons and State Compression:
By updating a fixed-size latent state, recurrent systems can theoretically process continuous data streams indefinitely without the quadratic memory growth. However, it is important to note that while the physical context window is effectively infinite, the information density is constrained by the fixed state size, meaning the model must selectively forget or compress older information as the sequence progresses.

References:
[1] Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
[2] Ravi, D., et al. (2021). Recurrent Neural Network for Human Activity Recognition in Embedded Systems. Electronics, 10(12), 1434.

Activation Functions

Without a non-linear activation function, a neural network composed of multiple layers would reduce to a single linear transformation: \[ f(x ; \theta) = \theta^{(L)} \theta^{(L-1)} \cdots \theta^{(2)} \theta^{(1)} x. \] This composition is still linear in $ x $, and therefore incapable of representing non-linear decision boundaries. Non-linear activation functions are necessary to break this linearity and allow networks to approximate arbitrary functions.

Historically, a common choice was the sigmoid (logistic) activation function: \[ \sigma(a) = \frac{1}{1+e^{-a}}. \] However, sigmoid functions saturate for large positive or negative inputs: $ \sigma(a) \to 1 $ as $ a \to +\infty $, and $ \sigma(a) \to 0 $ as $ a \to -\infty $. In these regions, the gradient $\sigma'(a) = \sigma(a)(1-\sigma(a))$ becomes very small (approaching zero), leading to the vanishing gradient problem. During backpropagation, gradients are multiplied layer by layer, so very small gradients in deeper layers become exponentially smaller as they propagate backward, making learning slow or unstable in deep networks.

To address this, modern networks often use the Rectified Linear Unit (ReLU): \[ g(a) = \max(0, a) = a \mathbb{I}(a>0) \] ReLU's derivative is 1 for positive inputs and 0 for negative inputs, avoiding the exponential decay of gradients that occurs with sigmoid functions. ReLU introduces non-linearity while preserving gradient magnitude for positive inputs. It is computationally simple and helps maintain gradient flow during training, which is why it is now a standard choice in modern neural networks architectures.

Note: While ReLU solves the vanishing gradient problem, it can suffer from the "dying ReLU" problem where neurons become inactive (always output zero) and stop learning. This occurs when neurons consistently receive negative inputs, causing their gradients to be permanently zero.

Learning in Neural Networks

Training the network is finding parameters $ \theta = \{ \theta_\ell \}_{\ell=1}^L $, where $ \theta_\ell = \{ W^{(\ell)}, b^{(\ell)} \} $ that minimize the empirical risk (average loss over all training data): \[ J(\theta) = \frac{1}{N} \sum_i \mathcal{L}(y_i, \hat{y_i}) \] where $\hat{y_i} = h_{\theta}(x_i)$ is the network's prediction.

For the binary classification, a common choice can be binary cross-entropy: \[ \mathcal{L}(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y}) \]

The optimization is done by performing a gradient-based optimization method which iteratively updates parameters in the direction of negative gradient: \[ \theta \leftarrow \theta - \alpha \nabla_{\theta} J(\theta) \] where $\alpha$ is the learning rate.

Our demo employs mini-batch gradient descent, which computes gradients on a small random subset of the data at each iteration. This provides a good balance between computational efficiency and gradient quality, often leading to faster convergence and better generalization compared to using the entire dataset at once.

The gradients are computed efficiently by the backpropagation algorithm. Backprop is an efficient application of the chain rule starting from the gradient of the loss w.r.t the output and working backwards. The algorithm computes all gradients in just two passes through the network:

Algorithm: BACKPROPAGATION Consider the MLP with K layers Input: $x \in \mathbb{R}^D$ //Forward Pass $x_1 = x$; // $f_k$ is an activation with the previous output: $x_k$ and the parameters for this layer: $\theta_k$ for $k = 1 : K$ do $x_{k+1} = f_k(x_k, \theta_k)$; //Backward Pass $u_{K+1} = 1$; //Gradient of \mathcal{L} wrt itself is 1 for $k = K : 1$ do $g_k = u_{k+1}^\top \frac{\partial f_k (x_k, \theta_k)}{\partial \theta_k}$; //Gradient of the loss wrt $\theta_k$ $u_k^\top = u_{k+1}^\top \frac{\partial f_k (x_k, \theta_k)}{\partial x_k}$; //Gradient of the loss wrt $x_k$ Output: $\mathcal{L} = x_{K+1}$; //Loss value (computed in forward pass) $\nabla_x \mathcal{L} = u_1$; //Gradient wrt the input $\{\nabla_{\theta} \mathcal{L} = g_k\}_{k=1}^K$; ///Gradients wrt the parameters

Neural Networks Demo

This interactive demo showcases how a simple neural network can learn to classify non-linear patterns. You can generate datasets, tweak model parameters, and visualize the training process in real time.

Model Architecture:
- 2 input features ($x_1$ and $x_2$)
- 1 hidden layer with ReLU activation (adjustable number of units)
- 1 output unit with sigmoid activation for binary classification
Forward Pass: The network computes predictions by applying matrix operations and non-linear activations. Selecting a demo point shows a step-by-step computation.
Training: The network is trained using mini-batch gradient descent with backpropagation to minimize binary cross-entropy loss. Each iteration uses a small, randomly sampled subset of the training data to update weights.
- Faster and more stable than full-batch training
- Helps escape flat regions and saddle points
- More closely mirrors how real-world neural networks are trained
Training Optimizations:
- Dynamic learning rate adjustment
- Gradient clipping: Prevents instability due to the exploding gradients by scaling them when their norm exceeds a threshold.
- Early stopping when performance stabilizes
- $\ell_2$ regularization (λ) to reduce overfitting
Visualizations:
- Color-coded data points for training and test sets
- Decision boundary (green) shows where prediction = 0.5
- Probability contours reveal model confidence
- Dynamic network graph and forward pass breakdown

Try Adjusting:

Hidden Units: More neurons allow for more complex decision boundaries
Regularization (λ): Helps prevent overfitting by discouraging large weights
Learning Rate: Controls how quickly the model updates
Max Iterations: Sets how long the training runs before stopping

Development of Deep Learning

The modern revolution in deep learning has been driven not only by algorithmic advances, but also by dramatic improvements in hardware—especially the rise of graphics processing units (GPUs). Originally designed to accelerate matrix-vector computations for real-time rendering in video games, GPUs turned out to be ideally suited for the linear algebra operations at the heart of neural networks.

In the early 2010s, researchers discovered that GPUs could speed up deep learning training by orders of magnitude compared to traditional CPUs. This enabled the training of large neural networks on large labeled datasets, like ImageNet, which led to breakthroughs in computer vision, speech recognition (converting spoken language to text), and broader natural language processing (NLP) tasks such as translation, summarization, and question answering. Today, GPUs are a core component in AI research and development, alongside other fields such as scientific computing, complex simulations, and even cryptocurrency mining.

Zooming out further, GPUs themselves rely on foundational advances in semiconductor technology. Semiconductors are materials whose conductivity can be precisely controlled, making them the backbone of all modern electronics—from GPUs and CPUs to memory chips and mobile devices. By using advanced fabrication techniques and nanometer-scale engineering, manufacturers can pack billions of transistors (the basic units of computation) onto a single chip. This density of computation enables the incredible power of today's hardware and fuels the era of foundation models, including large language models (LLMs) such as GPT and BERT.

Loading...