Intro to Classification

Binary Logistic Regression Binary Logistic Regression Demo Kernel Methods RBF Kernel & Random Fourier Features Multinomial logistic regression

Binary Logistic Regression

In the regularized regression page, we modeled continuous-valued targets \(y \in \mathbb{R}\) using regularized linear regression. We now turn to the classification setting, where the output is a discrete label. The key shift is that instead of predicting a value directly, we model the probability that an input belongs to each class, connecting maximum likelihood estimation with the cross-entropy loss.

Consider a discriminative classification model \[ p(y \mid \mathbf{x}, \boldsymbol{\theta}), \] where \(\mathbf{x} \in \mathbb{R}^D\) is an input vector, \(y \in \{1, \ldots, C\}\) is the class label, and \(\boldsymbol{\theta}\) collects the model parameters.

Definition: Sigmoid Function and Binary Logistic Regression

The sigmoid (logistic) function \(\sigma : \mathbb{R} \to (0, 1)\) is defined by \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] For binary labels \(y \in \{0, 1\}\), binary logistic regression is the model \[ p(y \mid \mathbf{x}, \boldsymbol{\theta}) = \mathrm{Ber}\!\bigl(y \mid \sigma(\mathbf{w}^\top \mathbf{x} + b)\bigr), \] with \(\mathbf{w} \in \mathbb{R}^D\) the weight vector, \(b \in \mathbb{R}\) the bias, and \(\boldsymbol{\theta} = (\mathbf{w}, b)\) the parameters. The Bernoulli likelihood is \(\mathrm{Ber}(y \mid p) = p^y(1-p)^{1-y}\) for \(y \in \{0, 1\}\) and \(p \in [0, 1]\).

Equivalently, \[ p(y = 1 \mid \mathbf{x}, \boldsymbol{\theta}) = \sigma(a), \] where \(a\) is called the logit (or the pre-activation): \[ a = \mathbf{w}^\top \mathbf{x} + b = \log\!\left(\frac{p}{1-p}\right), \quad p = p(y = 1 \mid \mathbf{x}, \boldsymbol{\theta}). \]

The sigmoid function outputs the probability that the class label is \(y = 1\). Therefore, we define a decision rule that predicts \(y = 1\) if and only if class 1 is more likely than class 0: \[ \begin{align*} f(\mathbf{x}) &= \mathbb{1}\{p(y=1 \mid \mathbf{x}) > p(y=0 \mid \mathbf{x})\} \\ &= \mathbb{1}\!\left\{ \log \frac{p(y = 1 \mid \mathbf{x})}{p(y = 0 \mid \mathbf{x})} > 0 \right\} \\ &= \mathbb{1}\{a > 0\}. \end{align*} \] Here, \(\mathbb{1}\{\cdot\}\) denotes the indicator function, which returns 1 if the condition inside is true, and 0 otherwise: \[ \mathbb{1}\{\text{condition}\} = \begin{cases} 1 & \text{if condition is true,} \\ 0 & \text{otherwise.} \end{cases} \]

Therefore, the underlying score function can be written as \[ s(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^\top \mathbf{x} + b, \] so that \(a = s(\mathbf{x}; \boldsymbol{\theta})\) and \(f(\mathbf{x}) = \mathbb{1}\{s(\mathbf{x}; \boldsymbol{\theta}) > 0\}\). Since \(\mathbf{w}^\top \mathbf{x}\) is the inner product between the weight vector \(\mathbf{w}\) and the feature vector \(\mathbf{x}\), \(s\) defines a linear hyperplane with normal vector \(\mathbf{w}\) and offset \(b\) from the origin. The weight vector \(\mathbf{w}\) determines the orientation of the decision boundary, and the bias \(b\) shifts the boundary. Moreover, the norm \(\|\mathbf{w}\|\) controls the steepness of the sigmoid - a larger norm yields a sharper transition between predicted classes, reflecting greater confidence in predictions.

To estimate the parameters \(\mathbf{w}\), we maximize the likelihood of the observed dataset \(\mathcal{D} = \{(\mathbf{x}_n, y_n)\}_{n=1}^N\). Equivalently, we minimize the negative log-likelihood (NLL): \[ \begin{align*} \mathrm{NLL}(\mathbf{w}) &= - \frac{1}{N} \log p(\mathcal{D} \mid \mathbf{w}) \\ &= - \frac{1}{N} \log \prod_{n=1}^N \mathrm{Ber}(y_n \mid \mu_n) \\ &= - \frac{1}{N} \sum_{n=1}^N \log\!\left[ \mu_n^{y_n} (1-\mu_n)^{1-y_n} \right] \\ &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \mu_n + (1-y_n) \log(1-\mu_n) \right] \\ &= \frac{1}{N} \sum_{n=1}^N \mathbb{H}_{ce}(y_n, \mu_n), \end{align*} \] where \(\mu_n = \sigma(a_n) = \sigma(\mathbf{w}^\top \mathbf{x}_n)\) is the model's predicted probability of class 1, and \(\mathbb{H}_{ce}(y_n, \mu_n)\) is the binary specialization of cross entropy between \(y_n\) and \(\mu_n\).

We then find the maximum likelihood estimate (MLE) by solving \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \mathbf{0} \] via gradient-based optimization (in practice, using automatic differentiation).

The gradient of the NLL with respect to \(\mathbf{w}\) is \[ \begin{align*} \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n (1-\mu_n) \mathbf{x}_n - (1-y_n) \mu_n \mathbf{x}_n \right] \\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n - y_n) \mathbf{x}_n. \end{align*} \] Equivalently, in matrix form, \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \frac{1}{N} X^\top (\boldsymbol{\mu} - \mathbf{y}), \] where \(X \in \mathbb{R}^{N \times D}\) is the design matrix whose \(n\)th row is \(\mathbf{x}_n^\top\), and \(\boldsymbol{\mu}, \mathbf{y} \in \mathbb{R}^N\) are the vectors of \(\mu_n\) and \(y_n\), respectively. Each \(\mathbf{x}_n\) is weighted by its residual \(\mu_n - y_n\) - this efficient form avoids constructing any \(N \times N\) diagonal matrix.

Derivation of the gradient of \(\mathrm{NLL}(\mathbf{w})\):

Recall the sigmoid function: \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] We compute its derivative with respect to \(a\): \[ \begin{aligned} \frac{d\sigma}{da} &= \frac{d}{da}\,(1 + e^{-a})^{-1} \\ &= -1 \cdot (1 + e^{-a})^{-2} \;\bigl(-e^{-a}\bigr) \\ &= \frac{e^{-a}}{(1 + e^{-a})^2} \\ &= \left( \frac{1}{1 + e^{-a}}\right) \left(\frac{e^{-a}}{1 + e^{-a}}\right) \\ &= \sigma(a)\,\bigl(1 - \sigma(a)\bigr). \end{aligned} \]

Since \(\mu_n = \sigma(a_n)\), we conclude: \[ \frac{\partial \mu_n}{\partial a_n} = \mu_n\,(1 - \mu_n). \]

For a single data point \((\mathbf{x}_n, y_n)\), let \[ \ell_n = -\bigl[y_n\log\mu_n + (1-y_n)\log(1-\mu_n)\bigr]. \] We compute: \[ \frac{\partial \ell_n}{\partial \mu_n} = -\Bigl(\frac{y_n}{\mu_n} - \frac{1 - y_n}{1 - \mu_n}\Bigr) = \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)}. \] By the chain rule, \[ \begin{aligned} \frac{\partial \ell_n}{\partial a_n} &= \frac{\partial \ell_n}{\partial \mu_n} \cdot \frac{\partial \mu_n}{\partial a_n} \\ &= \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)} \cdot \mu_n\,(1 - \mu_n) \\ &= \mu_n - y_n. \end{aligned} \] Also, since \(a_n = \mathbf{w}^\top \mathbf{x}_n + b\), \[ \frac{\partial a_n}{\partial \mathbf{w}} = \mathbf{x}_n. \] Combining: \[ \frac{\partial \ell_n}{\partial \mathbf{w}} = \frac{\partial \ell_n}{\partial a_n} \cdot \frac{\partial a_n}{\partial \mathbf{w}} = (\mu_n - y_n)\mathbf{x}_n. \] Summing over all \(N\) examples and dividing by \(N\) gives \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \frac{1}{N}\sum_{n=1}^N (\mu_n - y_n)\,\mathbf{x}_n. \]

To verify convexity, we examine the Hessian: \[ \begin{align*} H(\mathbf{w}) &= \nabla_{\mathbf{w}}^2 \mathrm{NLL}(\mathbf{w}) \\ &= \frac{1}{N} \sum_{n=1}^N \mu_n(1-\mu_n)\,\mathbf{x}_n \mathbf{x}_n^\top \\ &= \frac{1}{N} X^\top S X, \end{align*} \] where \(S = \mathrm{diag}(\mu_1(1-\mu_1), \ldots, \mu_N(1-\mu_N))\).

For any \(\mathbf{v} \in \mathbb{R}^D\), \[ \begin{align*} \mathbf{v}^\top H(\mathbf{w}) \mathbf{v} &= \frac{1}{N} \mathbf{v}^\top X^\top S X \mathbf{v} \\ &= \frac{1}{N} (\mathbf{v}^\top X^\top S^{1/2})(S^{1/2} X \mathbf{v}) \\ &= \frac{1}{N} \| S^{1/2} X \mathbf{v} \|_2^2 \geq 0. \end{align*} \] Thus, the Hessian \(H(\mathbf{w})\) is positive semi-definite, confirming that the NLL is convex. This ensures that any stationary point found is a global optimum.

Note that \(H(\mathbf{w})\) is strictly positive definite only if the design matrix \(X\) has full column rank. In practice, when \(X\) is not full rank, or when \(\mu_n\) is very close to 0 or 1 (near-perfect separation), the corresponding diagonal entries of \(S\) become near-zero, making \(H(\mathbf{w})\) ill-conditioned or singular. Adding an \(\ell_2\) regularization term \(\frac{\lambda}{2}\|\mathbf{w}\|_2^2\) shifts the Hessian to a strictly positive definite \(H(\mathbf{w}) + \lambda I\), restoring numerical stability and ensuring a unique solution - exactly as ridge regression does for the linear case.

Binary Logistic Regression Demo

Given a linear scoring function \[ z = w_0 + w_1 x_1 + w_2 x_2, \] and the predicted probability \[ \sigma(z) = \frac{1}{1 + e^{-z}}, \] we minimize the average binary cross-entropy plus an \(\ell_2\) penalty on the weights (excluding the bias): \[ \mathrm{NLL}(\mathbf{w}) = - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \sigma(z_n) + (1-y_n) \log(1-\sigma(z_n)) \right] + \frac{\lambda}{2} \sum_{j=1}^2 w_j^2. \] Note: The learning rate controls the step size in each gradient descent update.

Kernel Methods

In many problems the original feature vectors \(\mathbf{x} \in \mathbb{R}^D\) are not linearly separable. One fix is to apply a nonlinear feature mapping \[ \phi(\mathbf{x}) : \mathbb{R}^D \to \mathbb{R}^P, \quad P \gg D, \] hoping that \(\{\phi(\mathbf{x}_n)\}\) become separable. But explicit mappings can blow up model complexity and lead to overfitting, even when augmented with \(\ell_2\)-regularization.

Alternatively, if a learning algorithm depends only on inner products \(\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle\), we can invoke the kernel trick: replace each dot-product with a positive definite kernel \(K(\mathbf{x}, \mathbf{x}')\) that computes \(\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle\) implicitly - no \(\phi\) is ever constructed. The full theory of when such a representation exists, and the canonical Hilbert space it lives in, is developed in our RKHS & Kernel Methods page via the Moore-Aronszajn theorem.

Concretely, we build the kernel (Gram) matrix \[ \mathbf{K}_{ij} = K(\mathbf{x}_i, \mathbf{x}_j), \quad i,j = 1, \ldots, N, \] and run our algorithm using \(\{\mathbf{K}_{ij}\}\) instead of \(\{\phi(\mathbf{x}_i)\}\).

For large \(N\) (e.g., \(> 10^5\)), storing \(N \times N\) kernel matrices is infeasible. Two common remedies are:

RBF Kernel & Random Fourier Features

Among the many valid kernels, one of the most widely used is the RBF (Gaussian) kernel: \[ K_{\mathrm{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\!\Bigl(-\tfrac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\Bigr), \] which corresponds to an infinite-dimensional feature map. (RBF stands for "radial basis function." See Gaussian processes for more general parameterizations.)

Every shift-invariant continuous positive definite kernel on \(\mathbb{R}^D\) admits an integral representation as the Fourier transform of a finite non-negative measure - a foundational result of harmonic analysis developed in our forthcoming Fourier-analysis-on-Hilbert-spaces page. For the RBF kernel, the relevant measure is precisely a (rescaled) Gaussian, yielding \[ K_{\mathrm{RBF}}(\mathbf{x}, \mathbf{x}') = \mathbb{E}_{\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, \sigma^{-2} I)}\!\bigl[\cos(\boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}'))\bigr]. \] We take this representation as the starting point for an algorithmic approximation.

To approximate, sample \(D'\) frequencies \(\{\boldsymbol{\omega}_k\}\) and phases \(\{b_k\} \sim \mathrm{Uniform}[0, 2\pi]\), then define \[ \phi_{\mathrm{RFF}}(\mathbf{x}) = \sqrt{\frac{2}{D'}}\, \bigl[\cos(\boldsymbol{\omega}_1^\top \mathbf{x} + b_1), \ldots, \cos(\boldsymbol{\omega}_{D'}^\top \mathbf{x} + b_{D'})\bigr] \in \mathbb{R}^{D'}. \]

One shows \[ \phi_{\mathrm{RFF}}(\mathbf{x})^\top \phi_{\mathrm{RFF}}(\mathbf{x}') \;\approx\; K_{\mathrm{RBF}}(\mathbf{x}, \mathbf{x}'), \] so we can train a regularized linear model on these \(D'\)-dim features in \(O(ND')\) time, avoiding the \(O(N^2)\) kernel matrix.

Multinomial Logistic Regression

When there are more than two classes (\(C > 2\)), we generalize the sigmoid to the softmax function.

Definition: Softmax Function and Multinomial Logistic Regression

For inputs \(\mathbf{x} \in \mathbb{R}^D\) and class-specific weights \(\{\mathbf{w}_c, b_c\}_{c=1}^C\), the softmax function defines a probability distribution over \(C\) classes: \[ p(y = c \mid \mathbf{x}, \boldsymbol{\theta}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)} {\sum_{c'=1}^C \exp(\mathbf{w}_{c'}^\top \mathbf{x} + b_{c'})}, \quad c = 1, \ldots, C. \] When \(C = 2\), this reduces to the sigmoid function \(\sigma(a)\) of binary logistic regression.

We fit \(\boldsymbol{\theta} = \{\mathbf{w}_c, b_c\}\) by minimizing the average cross-entropy loss \[ \mathrm{NLL}(\boldsymbol{\theta}) = -\frac{1}{N}\sum_{n=1}^N \sum_{c=1}^C \mathbb{1}\{y_n = c\}\,\log p(y_n = c \mid \mathbf{x}_n, \boldsymbol{\theta}). \]

The gradient has a similarly clean form: \[ \frac{\partial \mathrm{NLL}}{\partial \mathbf{w}_c} = \frac{1}{N}\sum_{n=1}^N \Bigl[p(y_n = c \mid \mathbf{x}_n) - \mathbb{1}\{y_n = c\}\Bigr]\,\mathbf{x}_n, \] which can be optimized via batch or mini-batch gradient descent.

The Hessian is block-structured but remains positive semi-definite, ensuring convexity of the multiclass NLL.

Logistic regression - both binary and multinomial - is a linear classifier: the decision boundaries are hyperplanes in the original feature space. The kernel trick extends this to nonlinear boundaries without explicitly constructing high-dimensional features. In the neural networks page that follows, we take a different approach entirely: instead of choosing a fixed feature map \(\phi(\mathbf{x})\), we learn the representation jointly with the classifier, stacking layers of linear transformations and nonlinear activations to build expressive models from simple building blocks.