Intro to Classification

Binary Logistic Regression Binary Logistic Regression Demo Kernel Methods RBF Kernel & Random Fourier Features Multinomial logistic regression

Binary Logistic Regression

In Part 2, we modeled continuous-valued targets \(y \in \mathbb{R}\) using regularized linear regression. We now turn to the classification setting, where the output is a discrete label. The key shift is that instead of predicting a value directly, we model the probability that an input belongs to each class, connecting maximum likelihood estimation with the cross-entropy loss.

Consider Consider a discriminative classification model: \[ p(y \mid x, \theta) \] where \(x \in \mathbb{R}^D\) is an input vector, \(y \in \{1, \cdots, C\}\) is the class label, and \(\theta\) are the parameters.

Let \(y \in \{0, 1\}\). Then the model is known as binary logistic regression, which can be represented by: \[ p(y \mid x, \theta) = \text{Ber }(y \mid \sigma(w^\top x + b)) \] where \(\sigma\) is the sigmoid (logistic) function, \(w \in \mathbb{R}^D\) is the weight vector, \(b \in \mathbb{R}\) is a bias, and \(\theta = (w, b)\) are the model parameters. The notation \(\text{Ber}(y \mid p)\) denotes the Bernoulli distribution, defined as: \[ \text{Ber}(y \mid p) = p^y (1 - p)^{1 - y}, \quad y \in \{0, 1\},\; p \in [0, 1]. \]

In other words, \[ p(y = 1 \mid x, \theta) = \sigma(a) = \frac{1}{1 + e^{-a}} \] where \(a\) is called the logit(or the pre-activation): \[ a = w^\top x + b = \log \left(\frac{p}{1-p}\right), \quad \text{with } p = p(y = 1 \mid x, \theta). \]

The sigmoid function outputs the probability that the class label is \(y = 1\). Therefore, we define a decision rule such that we predict \(y = 1\) if and only if class 1 is more likely than class 0. That is, \[ \begin{align*} f(x) &= \mathbb{I}(p(y=1 \mid x) > p(y=0 \mid x)) \\\\ &= \mathbb{I} \left( \log \frac{p(y = 1 \mid x)}{p(y =0 \mid x)} > 0 \right) \\\\ &= \mathbb{I}(a > 0). \end{align*} \] Here, \(\mathbb{I}(\cdot)\) denotes the indicator function, which returns 1 if the condition inside is true, and 0 otherwise: \[ \mathbb{I}(\text{condition}) = \begin{cases} 1 & \text{if condition is true,} \\ 0 & \text{otherwise.} \end{cases} \]

Therefore, the prediction function can be written as: \[ f(x ; \theta) = w^\top x + b. \] Since \(w^\top x\) is the inner product between the weight vector \(w\) and the feature vector \(x\), this function defines a linear hyperplane with normal vector \(w\) and an offset \(b\) from the origin. In other words, the weight vector \(w\) determines the "orientation" of the decision boundary, and the bias term \(b\) shifts the boundary. Moreover, the norm \(\|w\|\) controls the "steepness" of the sigmoid function - a larger norm results in a sharper transition between predicted classes, which reflects greater confidence in predictions.

To estimate the parameters \(w\), we maximize the likelihood of the observed dataset \(\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N\). Equivalently, we minimize the negative log-likelihood (NLL): \[ \begin{align*} \mathrm{NLL }(w) &= - \frac{1}{N} \log p(\mathcal{D} \mid w) \\\\ &= - \frac{1}{N} \log \prod_{n =1}^N \text{Ber }(y_n \mid \mu_n) \\\\ &= - \frac{1}{N} \sum_{n=1}^N \log \left[ \mu_n^{y_n} (1-\mu_n)^{1-y_n} \right] \\\\ &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \mu_n + (1-y_n) \log (1-\mu_n) \right] \\\\ &= - \frac{1}{N} \sum_{n=1}^N \mathbb{H}_{ce} (y_n , \mu_n) \end{align*} \] where \(\mu_n = \sigma(a_n) = \sigma(w^\top x_n)\) is the model's predicted probability of class 1, and \(\mathbb{H}_{ce}(y_n, \mu_n)\) is the binary cross entropy between \(y_n\) and \(\mu_n\).

We then find the maximum likelihood estimate (MLE) by solving: \[ \nabla_w \text{NLL }(w) = 0 \] via gradient-based optimization (in practice, using automatic differentiation).

The gradient of the NLL with respect to \(w\) is \[ \begin{align*} \nabla_w \text{NLL }(w) &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n (1-\mu_n) x_n - (1-y_n) \mu_n x_n \right] \\\\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n - y_n) x_n, \\\\ \end{align*} \] Equivalently, in matrix form: \[ \nabla_w \text{NLL }(w) = \frac{1}{N} \left( \mathbf{1}_N^\top (\text{diag }(\mu - y)X)\right)^\top. \] where \(X \in \mathbb{R}^{N \times D}\) is the design matrix whose \(n\)th row is \(x_n^\top\), and \(\mu, y\in\mathbb{R}^N\) are the vectors of \(\mu_n\) and \(y_n\), respectively. Now, it is clear that each \(x_n\) is being weighted by its residual \(\mu_n - y_n\).

In practice, we never build an \(N \times N\) diagonal matrix. Instead, we simply computes \[ \nabla_w \text{NLL }(w) = \frac{1}{N} X^\top (\mu - y) \] which is both more memory- and compute-efficient.

Derivation of the gradient of \(\text{NLL }(w)\):

Recall the sigmoid function: \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] We compute its derivative with respect to \(a\): \[ \begin{aligned} \frac{d\sigma}{da} &= \frac{d}{da}\,(1 + e^{-a})^{-1} \\\\\ &= -1 \cdot (1 + e^{-a})^{-2} \;\bigl(-e^{-a}\bigr) \\\\ &= \frac{e^{-a}}{(1 + e^{-a})^2} \\\\ &= \left( \frac{1}{1 + e^{-a}}\right) \left(\frac{e^{-a}}{1 + e^{-a}}\right) \\\\ &= \sigma(a)\,\bigl(1 - \sigma(a)\bigr). \end{aligned} \]

Since \(\mu_n = \sigma(a_n)\), we conclude: \[ \frac{\partial \mu_n}{\partial a_n} = \mu_n\,(1 - \mu_n). \]

For a single data point \((x_n,y_n)\), let \[ \ell_n = -\bigl[y_n\log\mu_n + (1-y_n)\log(1-\mu_n)\bigr]. \] We compute: \[ \frac{\partial \ell_n}{\partial \mu_n} = -\Bigl(\frac{y_n}{\mu_n} - \frac{1 - y_n}{1 - \mu_n}\Bigr) = \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)}. \] By chain rule, \[ \begin{aligned} \frac{\partial \ell_n}{\partial a_n} &= \frac{\partial \ell_n}{\partial \mu_n} \cdot \frac{\partial \mu_n}{\partial a_n} \\\\ &= \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)} \cdot \mu_n\,(1 - \mu_n) \\\\ &= \mu_n - y_n. \end{aligned} \] Also, since \(a_n = w^\top x_n + b\), \[ \frac{\partial a_n}{\partial w} = x_n. \] Now combine all of them: \[ \frac{\partial \ell_n}{\partial w} = \frac{\partial \ell_n}{\partial a_n} \cdot \frac{\partial a_n}{\partial w} = (\mu_n - y_n)x_n. \] Finally, summing over all \(N\) examples and dividing by \(N\) gives \[ \nabla_w \mathrm{NLL}(w) = \frac{1}{N}\sum_{n=1}^N (\mu_n - y_n)\,x_n. \]

To verify convexity, we examine the Hessian: \[ \begin{align*} H(w) &=\nabla_w^2 \text{NLL }(w) \\\\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n( 1 - \mu_n)x_n) x_n^\top \\\\ &= \frac{1}{N} X^\top S X \end{align*} \] where \(S = \text{diag }(\mu_1(1-\mu_1), \cdots, \mu_N(1-\mu_N))\).

For any \(v \in \mathbb{R}^D\), \[ \begin{align*} v^\top H(w) v &= \frac{1}{N} v^\top X^\top S X v \\ &= \frac{1}{N} (v^\top X^\top S^{\frac{1}{2}})(S^{\frac{1}{2}} X v) \\ &= \frac{1}{N} \| S^{\frac{1}{2}} X v \|_2^2 \geq 0. \end{align*} \] Thus, the Hessian \(H(w)\) is positive semi-definite, confirming that the NLL is convex. This ensures that any stationary point found is a global optimum.

Note that \(H(w)\) is strictly positive definite only if the design matrix \(X\) has full column rank. In practice, when \(X\) is not full rank, or when \(\mu_n\) is very close to 0 or 1 (near-perfect separation), the corresponding diagonal entries of \(S\) become near-zero, making \(H(w)\) ill-conditioned or singular. Adding an \(\ell_2\) regularization term \(\frac{\lambda}{2}\|w\|_2^2\) shifts the Hessian to a strictly positive definite \(H(w) + \lambda I\), restoring numerical stability and ensuring a unique solution - exactly as ridge regression does for the linear case.

Binary Logistic Regression Demo

Given a linear scoring function: \[ z = w_0 + w_1 x_1 + w_2 x_2 \] and the predicted probability: \[ \sigma(z) = \frac{1}{1 + e^{-z}}, \] we minimize the average binary cross-entropy plus an \(\ell_2\) penalty on the weights (excluding the bias): \[ \mathrm{NLL }(w) = - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \sigma(z_n) + (1-y_n) \log (1-\sigma(z_n)) \right] + \frac{\lambda}{2} \sum_{j=1}^2 w_j^2. \] Note: The learning rate controls the step size in each gradient descent update.

Kernel Methods

In many problems the original feature vectors \(x\in\mathbb{R}^D\) are not linearly separable. One fix is to apply a nonlinear feature mapping: \[ \phi(x)\colon\mathbb{R}^D\to\mathbb{R}^P,\quad P\gg D, \] hoping that \(\{\phi(x_n)\}\) become separable. But explicit mappings can blow up model complexity and lead to overfitting, even with regularization.

Alternatively, if a learning algorithm depends only on inner products \(\langle \phi(x),\phi(x')\rangle\), we can invoke the kernel trick: replace each dot-product with a kernel function \(K(x,x')\) that computes \(\langle\phi(x),\phi(x')\rangle\) implicitly—no \(\phi\) ever constructed.

Definition: Kernel Function A function \[ K : \mathbb{R}^D \times \mathbb{R}^D \;\to\; \mathbb{R} \] is a kernel for some feature map \(\phi\) if \[ \forall\,x,x',\quad K(x,x') = \langle \phi(x),\,\phi(x')\rangle. \] In practice \(K\) must also be symmetric and positive semi-definite (Mercer's condition).

Concretely, we build the kernel matrix \[ K_{ij} = K(x_i,x_j),\quad i,j=1,\dots,N, \] and run our algorithm using \(\{K_{ij}\}\) instead of \(\{\phi(x_i)\}\).

For large \(N\) (e.g. \(>10^5\)), storing \(N\times N\) kernels is infeasible. Two common remedies are:

RBF Kernel & Random Fourier Features

Among the many valid kernel functions, one of the most widely used is the RBF (Gaussian) kernel: \[ K_{\mathrm{RBF}}(x,x') = \exp\Bigl(-\tfrac{\|x-x'\|^2}{2\sigma^2}\Bigr), \] which corresponds to an infinite-dimensional feature map.
Note: RBF stands for "Radial Basis Function."
(See: Gaussian processes for more general parameterization.)

By Bochner's theorem, any shift-invariant kernel admits the representation \[ K_{\mathrm{RBF}}(x,x') = \mathbb{E}_{\omega\sim\mathcal{N}(0,\sigma^{-2}I)}\bigl[\cos(\omega^\top(x-x'))\bigr]. \]

To approximate, sample \(D'\) frequencies \(\{\omega_k\}\) and phases \(\{b_k\}\sim\mathrm{Uniform}[0,2\pi]\), then define \[ \phi_{\mathrm{RFF}}(x) = \sqrt{\frac{2}{D'}}\, \bigl[\cos(\omega_1^\top x + b_1),\,\dots,\,\cos(\omega_{D'}^\top x + b_{D'})\bigr]\in\mathbb{R}^{D'}. \]

One shows \[ \phi_{\mathrm{RFF}}(x)^\top \phi_{\mathrm{RFF}}(x') \;\approx\; K_{\mathrm{RBF}}(x,x'), \] so we can train a regularized linear model on these \(D'\)-dim features in \(O(ND')\) time, avoiding the \(O(N^2)\) kernel matrix.

Multinomial Logistic Regression

When there are more than two classes (\(C > 2\)), we generalize the sigmoid to the softmax function.

Definition: Softmax Function

For inputs \(x \in \mathbb{R}^D\) and class-specific weights \(\{w_c, b_c\}_{c=1}^C\), the softmax function defines a probability distribution over \(C\) classes: \[ p(y = c \mid x, \theta) = \frac{\exp(w_c^\top x + b_c)} {\sum_{c'=1}^C \exp(w_{c'}^\top x + b_{c'})}, \quad c = 1, \ldots, C. \] When \(C = 2\), this reduces to the sigmoid function \(\sigma(a)\) from binary logistic regression.

We fit \(\theta=\{w_c,b_c\}\) by minimizing the average cross-entropy loss \[ \mathrm{NLL}(\theta) = -\frac{1}{N}\sum_{n=1}^N \sum_{c=1}^C \mathbb{I}\{y_n=c\}\,\log p(y_n=c\,|\,x_n,\theta). \]

The gradient has a similarly clean form: \[ \frac{\partial\mathrm{NLL}}{\partial w_c} = \frac{1}{N}\sum_{n=1}^N \Bigl[p(y_n=c\,|\,x_n)-\mathbb{I}\{y_n=c\}\Bigr]\,x_n, \] and you can optimize it via batch or (mini-batch) gradient descent.

The Hessian is block-structured but remains positive semi-definite, ensuring convexity of the multiclass NLL.

Logistic regression - both binary and multinomial - is a linear classifier: the decision boundaries are hyperplanes in the original feature space. The kernel trick extends this to nonlinear boundaries without explicitly constructing high-dimensional features. In Part 4: Neural Networks Basics, we take a different approach entirely: instead of choosing a fixed feature map \(\phi(x)\), we learn the representation jointly with the classifier, stacking layers of linear transformations and nonlinear activations to build expressive models from simple building blocks.