Binary Logistic Regression
In the regularized regression page, we modeled continuous-valued targets \(y \in \mathbb{R}\) using regularized linear regression. We now turn to the classification setting, where the output is a discrete label. The key shift is that instead of predicting a value directly, we model the probability that an input belongs to each class, connecting maximum likelihood estimation with the cross-entropy loss.
Consider a discriminative classification model \[ p(y \mid \mathbf{x}, \boldsymbol{\theta}), \] where \(\mathbf{x} \in \mathbb{R}^D\) is an input vector, \(y \in \{1, \ldots, C\}\) is the class label, and \(\boldsymbol{\theta}\) collects the model parameters.
The sigmoid (logistic) function \(\sigma : \mathbb{R} \to (0, 1)\) is defined by \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] For binary labels \(y \in \{0, 1\}\), binary logistic regression is the model \[ p(y \mid \mathbf{x}, \boldsymbol{\theta}) = \mathrm{Ber}\!\bigl(y \mid \sigma(\mathbf{w}^\top \mathbf{x} + b)\bigr), \] with \(\mathbf{w} \in \mathbb{R}^D\) the weight vector, \(b \in \mathbb{R}\) the bias, and \(\boldsymbol{\theta} = (\mathbf{w}, b)\) the parameters. The Bernoulli likelihood is \(\mathrm{Ber}(y \mid p) = p^y(1-p)^{1-y}\) for \(y \in \{0, 1\}\) and \(p \in [0, 1]\).
Equivalently, \[ p(y = 1 \mid \mathbf{x}, \boldsymbol{\theta}) = \sigma(a), \] where \(a\) is called the logit (or the pre-activation): \[ a = \mathbf{w}^\top \mathbf{x} + b = \log\!\left(\frac{p}{1-p}\right), \quad p = p(y = 1 \mid \mathbf{x}, \boldsymbol{\theta}). \]
The sigmoid function outputs the probability that the class label is \(y = 1\). Therefore, we define a decision rule that predicts \(y = 1\) if and only if class 1 is more likely than class 0: \[ \begin{align*} f(\mathbf{x}) &= \mathbb{1}\{p(y=1 \mid \mathbf{x}) > p(y=0 \mid \mathbf{x})\} \\ &= \mathbb{1}\!\left\{ \log \frac{p(y = 1 \mid \mathbf{x})}{p(y = 0 \mid \mathbf{x})} > 0 \right\} \\ &= \mathbb{1}\{a > 0\}. \end{align*} \] Here, \(\mathbb{1}\{\cdot\}\) denotes the indicator function, which returns 1 if the condition inside is true, and 0 otherwise: \[ \mathbb{1}\{\text{condition}\} = \begin{cases} 1 & \text{if condition is true,} \\ 0 & \text{otherwise.} \end{cases} \]
Therefore, the underlying score function can be written as \[ s(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{w}^\top \mathbf{x} + b, \] so that \(a = s(\mathbf{x}; \boldsymbol{\theta})\) and \(f(\mathbf{x}) = \mathbb{1}\{s(\mathbf{x}; \boldsymbol{\theta}) > 0\}\). Since \(\mathbf{w}^\top \mathbf{x}\) is the inner product between the weight vector \(\mathbf{w}\) and the feature vector \(\mathbf{x}\), \(s\) defines a linear hyperplane with normal vector \(\mathbf{w}\) and offset \(b\) from the origin. The weight vector \(\mathbf{w}\) determines the orientation of the decision boundary, and the bias \(b\) shifts the boundary. Moreover, the norm \(\|\mathbf{w}\|\) controls the steepness of the sigmoid - a larger norm yields a sharper transition between predicted classes, reflecting greater confidence in predictions.
To estimate the parameters \(\mathbf{w}\), we maximize the likelihood of the observed dataset \(\mathcal{D} = \{(\mathbf{x}_n, y_n)\}_{n=1}^N\). Equivalently, we minimize the negative log-likelihood (NLL): \[ \begin{align*} \mathrm{NLL}(\mathbf{w}) &= - \frac{1}{N} \log p(\mathcal{D} \mid \mathbf{w}) \\ &= - \frac{1}{N} \log \prod_{n=1}^N \mathrm{Ber}(y_n \mid \mu_n) \\ &= - \frac{1}{N} \sum_{n=1}^N \log\!\left[ \mu_n^{y_n} (1-\mu_n)^{1-y_n} \right] \\ &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n \log \mu_n + (1-y_n) \log(1-\mu_n) \right] \\ &= \frac{1}{N} \sum_{n=1}^N \mathbb{H}_{ce}(y_n, \mu_n), \end{align*} \] where \(\mu_n = \sigma(a_n) = \sigma(\mathbf{w}^\top \mathbf{x}_n)\) is the model's predicted probability of class 1, and \(\mathbb{H}_{ce}(y_n, \mu_n)\) is the binary specialization of cross entropy between \(y_n\) and \(\mu_n\).
We then find the maximum likelihood estimate (MLE) by solving \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \mathbf{0} \] via gradient-based optimization (in practice, using automatic differentiation).
The gradient of the NLL with respect to \(\mathbf{w}\) is \[ \begin{align*} \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) &= - \frac{1}{N} \sum_{n=1}^N \left[ y_n (1-\mu_n) \mathbf{x}_n - (1-y_n) \mu_n \mathbf{x}_n \right] \\ &= \frac{1}{N} \sum_{n=1}^N (\mu_n - y_n) \mathbf{x}_n. \end{align*} \] Equivalently, in matrix form, \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \frac{1}{N} X^\top (\boldsymbol{\mu} - \mathbf{y}), \] where \(X \in \mathbb{R}^{N \times D}\) is the design matrix whose \(n\)th row is \(\mathbf{x}_n^\top\), and \(\boldsymbol{\mu}, \mathbf{y} \in \mathbb{R}^N\) are the vectors of \(\mu_n\) and \(y_n\), respectively. Each \(\mathbf{x}_n\) is weighted by its residual \(\mu_n - y_n\) - this efficient form avoids constructing any \(N \times N\) diagonal matrix.
Recall the sigmoid function: \[ \sigma(a) = \frac{1}{1 + e^{-a}}. \] We compute its derivative with respect to \(a\): \[ \begin{aligned} \frac{d\sigma}{da} &= \frac{d}{da}\,(1 + e^{-a})^{-1} \\ &= -1 \cdot (1 + e^{-a})^{-2} \;\bigl(-e^{-a}\bigr) \\ &= \frac{e^{-a}}{(1 + e^{-a})^2} \\ &= \left( \frac{1}{1 + e^{-a}}\right) \left(\frac{e^{-a}}{1 + e^{-a}}\right) \\ &= \sigma(a)\,\bigl(1 - \sigma(a)\bigr). \end{aligned} \]
Since \(\mu_n = \sigma(a_n)\), we conclude: \[ \frac{\partial \mu_n}{\partial a_n} = \mu_n\,(1 - \mu_n). \]
For a single data point \((\mathbf{x}_n, y_n)\), let \[ \ell_n = -\bigl[y_n\log\mu_n + (1-y_n)\log(1-\mu_n)\bigr]. \] We compute: \[ \frac{\partial \ell_n}{\partial \mu_n} = -\Bigl(\frac{y_n}{\mu_n} - \frac{1 - y_n}{1 - \mu_n}\Bigr) = \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)}. \] By the chain rule, \[ \begin{aligned} \frac{\partial \ell_n}{\partial a_n} &= \frac{\partial \ell_n}{\partial \mu_n} \cdot \frac{\partial \mu_n}{\partial a_n} \\ &= \frac{\mu_n - y_n}{\mu_n (1 -\mu_n)} \cdot \mu_n\,(1 - \mu_n) \\ &= \mu_n - y_n. \end{aligned} \] Also, since \(a_n = \mathbf{w}^\top \mathbf{x}_n + b\), \[ \frac{\partial a_n}{\partial \mathbf{w}} = \mathbf{x}_n. \] Combining: \[ \frac{\partial \ell_n}{\partial \mathbf{w}} = \frac{\partial \ell_n}{\partial a_n} \cdot \frac{\partial a_n}{\partial \mathbf{w}} = (\mu_n - y_n)\mathbf{x}_n. \] Summing over all \(N\) examples and dividing by \(N\) gives \[ \nabla_{\mathbf{w}} \mathrm{NLL}(\mathbf{w}) = \frac{1}{N}\sum_{n=1}^N (\mu_n - y_n)\,\mathbf{x}_n. \]
To verify convexity, we examine the Hessian: \[ \begin{align*} H(\mathbf{w}) &= \nabla_{\mathbf{w}}^2 \mathrm{NLL}(\mathbf{w}) \\ &= \frac{1}{N} \sum_{n=1}^N \mu_n(1-\mu_n)\,\mathbf{x}_n \mathbf{x}_n^\top \\ &= \frac{1}{N} X^\top S X, \end{align*} \] where \(S = \mathrm{diag}(\mu_1(1-\mu_1), \ldots, \mu_N(1-\mu_N))\).
For any \(\mathbf{v} \in \mathbb{R}^D\), \[ \begin{align*} \mathbf{v}^\top H(\mathbf{w}) \mathbf{v} &= \frac{1}{N} \mathbf{v}^\top X^\top S X \mathbf{v} \\ &= \frac{1}{N} (\mathbf{v}^\top X^\top S^{1/2})(S^{1/2} X \mathbf{v}) \\ &= \frac{1}{N} \| S^{1/2} X \mathbf{v} \|_2^2 \geq 0. \end{align*} \] Thus, the Hessian \(H(\mathbf{w})\) is positive semi-definite, confirming that the NLL is convex. This ensures that any stationary point found is a global optimum.
Note that \(H(\mathbf{w})\) is strictly positive definite only if the design matrix \(X\) has full column rank. In practice, when \(X\) is not full rank, or when \(\mu_n\) is very close to 0 or 1 (near-perfect separation), the corresponding diagonal entries of \(S\) become near-zero, making \(H(\mathbf{w})\) ill-conditioned or singular. Adding an \(\ell_2\) regularization term \(\frac{\lambda}{2}\|\mathbf{w}\|_2^2\) shifts the Hessian to a strictly positive definite \(H(\mathbf{w}) + \lambda I\), restoring numerical stability and ensuring a unique solution - exactly as ridge regression does for the linear case.