Intro to Classification

Binary Logistic Regression

In the regularized regression page, we modeled continuous-valued targets \(y \in \mathbb{R}\) using regularized linear regression. We now turn to the classification setting, where the output is a discrete label. The key shift is that instead of predicting a value directly, we model the probability that an input belongs to each class, connecting maximum likelihood estimation with the cross-entropy loss.

Consider a discriminative classification model \[ p(y \mid \mathbf{x}, \boldsymbol{\theta}), \] where \(\mathbf{x} \in \mathbb{R}^D\) is an input vector, \(y \in \{1, \ldots, C\}\) is the class label, and \(\boldsymbol{\theta}\) collects the model parameters.

Definition: Sigmoid Function and Binary Logistic Regression

The sigmoid (logistic) function \(\sigma : \mathbb{R} \to (0, 1)\) is defined by \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] For binary labels \(y \in \{0, 1\}\), binary logistic regression is the model \[ p(y \mid \mathbf{x}, \boldsymbol{\theta}) = \mathrm{Ber}\!\bigl(y \mid \sigma(\mathbf{w}^\top \mathbf{x} + b)\bigr), \] with \(\mathbf{w} \in \mathbb{R}^D\) the weight vector, \(b \in \mathbb{R}\) the bias, and \(\boldsymbol{\theta} = (\mathbf{w}, b)\) the parameters. The Bernoulli likelihood is \(\mathrm{Ber}(y \mid p) = p^y(1-p)^{1-y}\) for \(y \in \{0, 1\}\) and \(p \in [0, 1]\).

Equivalently, \[ p(y = 1 \mid \mathbf{x}, \boldsymbol{\theta}) = \sigma(a), \] where \(a\) is called the logit (or the pre-activation): \[ a = \mathbf{w}^\top \mathbf{x} + b = \log\!\left(\frac{p}{1-p}\right), \quad p = p(y = 1 \mid \mathbf{x}, \boldsymbol{\theta}). \]

The sigmoid function outputs the probability that the class label is \(y = 1\). Therefore, we define a decision rule that predicts \(y = 1\) if and only if class 1 is more likely than class 0: \[ \begin{align*} f(\mathbf{x}) &= \mathbb{1}\{p(y=1 \mid \mathbf{x}) > p(y=0 \mid \mathbf{x})\} \\ &= \mathbb{1}\!\left\{ \log \frac{p(y = 1 \mid \mathbf{x})}{p(y = 0 \mid \mathbf{x})} > 0 \right\} \\ &= \mathbb{1}\{a > 0\}. \end{align*} \] Here, \(\mathbb{1}\{\cdot\}\) denotes the indicator function, which returns 1 if the condition inside is true, and 0 otherwise: \[ \mathbb{1}\{\text{condition}\} = \begin{cases} 1 & \text{if condition is true,} \\ 0 & \text{otherwise.} \end{cases} \]

Therefore, the underlying score function can be written as \[ s(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^\top \mathbf{x} + b, \] so that \(a = s(\mathbf{x}; \boldsymbol{\theta})\) and \(f(\mathbf{x}) = \mathbb{1}\{s(\mathbf{x}; \boldsymbol{\theta}) > 0\}\). Since \(\mathbf{w}^\top \mathbf{x}\) is the inner product between the weight vector \(\mathbf{w}\) and the feature vector \(\mathbf{x}\), \(s\) defines a linear hyperplane with normal vector \(\mathbf{w}\) and offset \(b\) from the origin. The weight vector \(\mathbf{w}\) determines the orientation of the decision boundary, and the bias \(b\) shifts the boundary. Moreover, the norm \(\|\mathbf{w}\|\) controls the steepness of the sigmoid - a larger norm yields a sharper transition between predicted classes, reflecting greater confidence in predictions.

To estimate the parameters \(\mathbf{w}\), we maximize the likelihood of the observed dataset \(\mathcal{D} = \{(\mathbf{x}_n, y_n)\}_{n=1}^N\). Equivalently, we minimize the negative log-likelihood (NLL): \[ \begin{align*} \mathrm{NLL}(\mathbf{w}) &= - \frac{1}{N} \log p(\mathcal{D} \mid \mathbf{w}) \\ &= - \frac{1}{N} \log \prod_{n=1}^N \mathrm{Ber}(y_n \mid \mu_n) \\ &= - \frac{1}{N} \sum_{n=1}^N \log\!\left[ \mu_n^{y_n} (1-\mu_n)^{1-y_n} \right] \\ &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \mu_n + (1-y_n) \log(1-\mu_n) \right] \\ &= \frac{1}{N} \sum_{n=1}^N \mathbb{H}_{ce}(y_n, \mu_n), \end{align*} \] where \(\mu_n = \sigma(a_n) = \sigma(\mathbf{w}^\top \mathbf{x}_n)\) is the model's predicted probability of class 1, and \(\mathbb{H}_{ce}(y_n, \mu_n)\) is the binary specialization of cross entropy between \(y_n\) and \(\mu_n\).

We then find the maximum likelihood estimate (MLE) by solving \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \mathbf{0} \] via gradient-based optimization (in practice, using automatic differentiation).

The gradient of the NLL with respect to \(\mathbf{w}\) is \[ \begin{align*} \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n (1-\mu_n) \mathbf{x}_n - (1-y_n) \mu_n \mathbf{x}_n \right] \\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n - y_n) \mathbf{x}_n. \end{align*} \] Equivalently, in matrix form, \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \frac{1}{N} X^\top (\boldsymbol{\mu} - \mathbf{y}), \] where \(X \in \mathbb{R}^{N \times D}\) is the design matrix whose \(n\)th row is \(\mathbf{x}_n^\top\), and \(\boldsymbol{\mu}, \mathbf{y} \in \mathbb{R}^N\) are the vectors of \(\mu_n\) and \(y_n\), respectively. Each \(\mathbf{x}_n\) is weighted by its residual \(\mu_n - y_n\) - this efficient form avoids constructing any \(N \times N\) diagonal matrix.

Derivation of the gradient of \(\mathrm{NLL}(\mathbf{w})\):

Recall the sigmoid function: \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] We compute its derivative with respect to \(a\): \[ \begin{aligned} \frac{d\sigma}{da} &= \frac{d}{da}\,(1 + e^{-a})^{-1} \\ &= -1 \cdot (1 + e^{-a})^{-2} \;\bigl(-e^{-a}\bigr) \\ &= \frac{e^{-a}}{(1 + e^{-a})^2} \\ &= \left( \frac{1}{1 + e^{-a}}\right) \left(\frac{e^{-a}}{1 + e^{-a}}\right) \\ &= \sigma(a)\,\bigl(1 - \sigma(a)\bigr). \end{aligned} \]

Since \(\mu_n = \sigma(a_n)\), we conclude: \[ \frac{\partial \mu_n}{\partial a_n} = \mu_n\,(1 - \mu_n). \]

For a single data point \((\mathbf{x}_n, y_n)\), let \[ \ell_n = -\bigl[y_n\log\mu_n + (1-y_n)\log(1-\mu_n)\bigr]. \] We compute: \[ \frac{\partial \ell_n}{\partial \mu_n} = -\Bigl(\frac{y_n}{\mu_n} - \frac{1 - y_n}{1 - \mu_n}\Bigr) = \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)}. \] By the chain rule, \[ \begin{aligned} \frac{\partial \ell_n}{\partial a_n} &= \frac{\partial \ell_n}{\partial \mu_n} \cdot \frac{\partial \mu_n}{\partial a_n} \\ &= \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)} \cdot \mu_n\,(1 - \mu_n) \\ &= \mu_n - y_n. \end{aligned} \] Also, since \(a_n = \mathbf{w}^\top \mathbf{x}_n + b\), \[ \frac{\partial a_n}{\partial \mathbf{w}} = \mathbf{x}_n. \] Combining: \[ \frac{\partial \ell_n}{\partial \mathbf{w}} = \frac{\partial \ell_n}{\partial a_n} \cdot \frac{\partial a_n}{\partial \mathbf{w}} = (\mu_n - y_n)\mathbf{x}_n. \] Summing over all \(N\) examples and dividing by \(N\) gives \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \frac{1}{N}\sum_{n=1}^N (\mu_n - y_n)\,\mathbf{x}_n. \]

To verify convexity, we examine the Hessian: \[ \begin{align*} H(\mathbf{w}) &= \nabla_{\mathbf{w}}^2 \mathrm{NLL}(\mathbf{w}) \\ &= \frac{1}{N} \sum_{n=1}^N \mu_n(1-\mu_n)\,\mathbf{x}_n \mathbf{x}_n^\top \\ &= \frac{1}{N} X^\top S X, \end{align*} \] where \(S = \mathrm{diag}(\mu_1(1-\mu_1), \ldots, \mu_N(1-\mu_N))\).

For any \(\mathbf{v} \in \mathbb{R}^D\), \[ \begin{align*} \mathbf{v}^\top H(\mathbf{w}) \mathbf{v} &= \frac{1}{N} \mathbf{v}^\top X^\top S X \mathbf{v} \\ &= \frac{1}{N} (\mathbf{v}^\top X^\top S^{1/2})(S^{1/2} X \mathbf{v}) \\ &= \frac{1}{N} \| S^{1/2} X \mathbf{v} \|_2^2 \geq 0. \end{align*} \] Thus, the Hessian \(H(\mathbf{w})\) is positive semi-definite, confirming that the NLL is convex. This ensures that any stationary point found is a global optimum.

Note that \(H(\mathbf{w})\) is strictly positive definite only if the design matrix \(X\) has full column rank. In practice, when \(X\) is not full rank, or when \(\mu_n\) is very close to 0 or 1 (near-perfect separation), the corresponding diagonal entries of \(S\) become near-zero, making \(H(\mathbf{w})\) ill-conditioned or singular. Adding an \(\ell_2\) regularization term \(\frac{\lambda}{2}\|\mathbf{w}\|_2^2\) shifts the Hessian to a strictly positive definite \(H(\mathbf{w}) + \lambda I\), restoring numerical stability and ensuring a unique solution - exactly as ridge regression does for the linear case.

Binary Logistic Regression Demo

Given a linear scoring function \[ z = w_0 + w_1 x_1 + w_2 x_2, \] and the predicted probability \[ \sigma(z) = \frac{1}{1 + e^{-z}}, \] we minimize the average binary cross-entropy plus an \(\ell_2\) penalty on the weights (excluding the bias): \[ \mathrm{NLL}(\mathbf{w}) = - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \sigma(z_n) + (1-y_n) \log(1-\sigma(z_n)) \right] + \frac{\lambda}{2} \sum_{j=1}^2 w_j^2. \] Note: The learning rate controls the step size in each gradient descent update.

Kernel Methods

In many problems the original feature vectors \(\mathbf{x} \in \mathbb{R}^D\) are not linearly separable. One fix is to apply a nonlinear feature mapping \[ \phi(\mathbf{x}) : \mathbb{R}^D \to \mathbb{R}^P, \quad P \gg D, \] hoping that \(\{\phi(\mathbf{x}_n)\}\) become separable. But explicit mappings can blow up model complexity and lead to overfitting, even when augmented with \(\ell_2\)-regularization.

Alternatively, if a learning algorithm depends only on inner products \(\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle\), we can invoke the kernel trick: replace each dot-product with a positive definite kernel \(K(\mathbf{x}, \mathbf{x}')\) that computes \(\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle\) implicitly - no \(\phi\) is ever constructed. The full theory of when such a representation exists, and the canonical Hilbert space it lives in, is developed in our RKHS & Kernel Methods page via the Moore-Aronszajn theorem.

Concretely, we build the kernel (Gram) matrix \[ \mathbf{K}_{ij} = K(\mathbf{x}_i, \mathbf{x}_j), \quad i,j = 1, \ldots, N, \] and run our algorithm using \(\{\mathbf{K}_{ij}\}\) instead of \(\{\phi(\mathbf{x}_i)\}\).

For large \(N\) (e.g., \(> 10^5\)), storing \(N \times N\) kernel matrices is infeasible. Two common remedies are:

Random Fourier Features (next section): approximate RBF kernels with \(O(ND')\) cost.
Representation learning: learn \(\phi(\mathbf{x}; \boldsymbol{\theta})\) via neural networks and optimize by mini-batch SGD.

RBF Kernel & Random Fourier Features

Among the many valid kernels, one of the most widely used is the RBF (Gaussian) kernel: \[ K_{\mathrm{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\!\Bigl(-\tfrac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\Bigr), \] which corresponds to an infinite-dimensional feature map. (RBF stands for "radial basis function." See Gaussian processes for more general parameterizations.)

Every shift-invariant continuous positive definite kernel on \(\mathbb{R}^D\) admits an integral representation as the Fourier transform of a finite non-negative measure - a foundational result of harmonic analysis developed in our forthcoming Fourier-analysis-on-Hilbert-spaces page. For the RBF kernel, the relevant measure is precisely a (rescaled) Gaussian, yielding \[ K_{\mathrm{RBF}}(\mathbf{x}, \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, \sigma^{-2} I)}\!\bigl[\cos(\boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}'))\bigr]. \] We take this representation as the starting point for an algorithmic approximation.

To approximate, sample \(D'\) frequencies \(\{\boldsymbol{\omega}_k\}\) and phases \(\{b_k\} \sim \mathrm{Uniform}[0, 2\pi]\), then define \[ \phi_{\mathrm{RFF}}(\mathbf{x}) = \sqrt{\frac{2}{D'}}\, \bigl[\cos(\boldsymbol{\omega}_1^\top \mathbf{x} + b_1), \ldots, \cos(\boldsymbol{\omega}_{D'}^\top \mathbf{x} + b_{D'})\bigr] \in \mathbb{R}^{D'}. \]

One shows \[ \phi_{\mathrm{RFF}}(\mathbf{x})^\top \phi_{\mathrm{RFF}}(\mathbf{x}') \;\approx\; K_{\mathrm{RBF}}(\mathbf{x}, \mathbf{x}'), \] so we can train a regularized linear model on these \(D'\)-dim features in \(O(ND')\) time, avoiding the \(O(N^2)\) kernel matrix.

Multinomial Logistic Regression

When there are more than two classes (\(C > 2\)), we generalize the sigmoid to the softmax function.

Definition: Softmax Function and Multinomial Logistic Regression

For inputs \(\mathbf{x} \in \mathbb{R}^D\) and class-specific weights \(\{\mathbf{w}_c, b_c\}_{c=1}^C\), the softmax function defines a probability distribution over \(C\) classes: \[ p(y = c \mid \mathbf{x}, \boldsymbol{\theta}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)} {\sum_{c'=1}^C \exp(\mathbf{w}_{c'}^\top \mathbf{x} + b_{c'})}, \quad c = 1, \ldots, C. \] When \(C = 2\), this reduces to the sigmoid function \(\sigma(a)\) of binary logistic regression.

We fit \(\boldsymbol{\theta} = \{\mathbf{w}_c, b_c\}\) by minimizing the average cross-entropy loss \[ \mathrm{NLL}(\boldsymbol{\theta}) = -\frac{1}{N}\sum_{n=1}^N \sum_{c=1}^C \mathbb{1}\{y_n = c\}\,\log p(y_n = c \mid \mathbf{x}_n, \boldsymbol{\theta}). \]

The gradient has a similarly clean form: \[ \frac{\partial \mathrm{NLL}}{\partial \mathbf{w}_c} = \frac{1}{N}\sum_{n=1}^N \Bigl[p(y_n = c \mid \mathbf{x}_n) - \mathbb{1}\{y_n = c\}\Bigr]\,\mathbf{x}_n, \] which can be optimized via batch or mini-batch gradient descent.

The Hessian is block-structured but remains positive semi-definite, ensuring convexity of the multiclass NLL.

Logistic regression - both binary and multinomial - is a linear classifier: the decision boundaries are hyperplanes in the original feature space. The kernel trick extends this to nonlinear boundaries without explicitly constructing high-dimensional features. In the neural networks page that follows, we take a different approach entirely: instead of choosing a fixed feature map \(\phi(\mathbf{x})\), we learn the representation jointly with the classifier, stacking layers of linear transformations and nonlinear activations to build expressive models from simple building blocks.

Loading...

Binary Logistic Regression

Binary Logistic Regression Demo

Kernel Methods

RBF Kernel & Random Fourier Features

Multinomial Logistic Regression