Logistic Regression and Binary Classification

All previously discussed regression methods can be considered as supervised binary classifiers, when the regression function $f({\bf x},{\bf\theta})$ is thresholded by some constant $C$. Without loss of generality, we will always assume $C=0$ in the following. Once the model parameter ${\bf\theta}$ is obtained based on the training set ${\cal D}=\{ {\bf X},{\bf y}\}$, every point ${\bf x}$ in the d-dimensional feature space can be classified into either of two classes $C_-$ and $C_+$:

if$\displaystyle \;\;f({\bf x})
={\bf w}^T{\bf x}\left\{\begin{array}{l}<C\\ >C\end{array}\right.
\;\;\;\;\;\;\;$then$\displaystyle \;\;\;\;
{\bf x}\in \left\{\begin{array}{c} C_- \\ C_+\end{array}\right.$ (207)

When the regression function is thresholded by $C=0$, it becomes an equation $f({\bf x})=0$ representing a hypersurface in the d-dimensional space (a point if $d=1$, a curve if $d=2$, a surface or hypersurfaceif $d\ge 3$), by which the space is partitioned into two regions. In this case, $f({\bf x})$ is called the decision function and the surface decision boundary. Now any point in the space is classified into either of the two classes, depending on whether it is on the positive side of the decision boundary ( $f({\bf x})>0$) or the negative side ( $f({\bf x})<0$).

This simple binary classifier suffers from the drawback that the distance between a point ${\bf x}$ and the decision boundary $f({\bf x})=0$ is not taken into consideration, as all points on the same side of the boundary, regardless how far from the boundary, are equally classified into the same class. On the other hand, one would naturally feel more confident about such a classification for a point far away from the boundary, than a point very close to it.

This problem can be addressed by the method of logistic regression, which is similarly trained based on the training set with each sample ${\bf x}_n$ labeled by a binary value $y_n\in \{0,\;1\}$ for either of the two classes $C_-$ and $C_+$. While the logistic regression is based on the linear regression function $f({\bf x})={\bf w}^T{\bf x}$, which is not hard-thresholded by the unit step function $u(x)$ to become either 0 or 1; instead, it is soft-thresholded by a sigmoid function $\sigma(f({\bf x}))=\sigma({\bf w}^T{\bf x})$ to become a real value between 0 and 1, interpreted as the probability $p({\bf x}\in C_+)$ for ${\bf x}$ to belong to $C_+$, and $p({\bf x}\in C_-)=1-p({\bf x}\in C_+)$ is the probability for ${\bf x}$ to belong to $C_-$.

UnitStepSigmoid.png

Similar to linear and nonlinear regression methods considered before, the goal of logistic regression is also to find the optimal model parameter ${\bf\theta}$ for the regression function $f({\bf x},{\bf\theta})$ to fit the training set optimally. But different from linear and nonlinear regression based on the least squares (LS) method that minimizes the squared error in Eq. (189), logistic regression is based on maximum likelihood (ML) estimation, which maximizes the likelihood of the model parameter, here ${\bf\theta}={\bf w}$, as discussed below.

The sigmoid function can be either the error function (cumulative Gaussian, erf):

$\displaystyle \phi(x)=\int_{-\infty}^x {\cal N}(u\vert,1)\,du
=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^x \exp\left(-\frac{1}{2N} u^2\right)\,du$ (208)

or the logistic function:

$\displaystyle \sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{1+e^x}
=\left\{\begin{arr...
... & x\rightarrow\infty\\
0.5 & x=0\\ 0 & x\rightarrow -\infty\end{array}\right.$ (209)

In the following, we will only consider the logistic function for consistency and simplicity. Here is a set of properties of the logistic function:
    $\displaystyle \sigma(-x)=\frac{1}{1+e^x}=1-\frac{1}{1+e^{-x}}=1-\sigma(x)$  
    $\displaystyle \frac{d}{dx}\sigma(x)=\frac{e^{-x}}{(1+e^{-x})^2}
=\sigma(x)\sigma(-x)$  
    $\displaystyle x=\ln\left(\frac{\sigma(x)}{1-\sigma(x)}\right),\;\;\;\;\;\;
e^x=\frac{\sigma(x)}{1-\sigma(x)}$ (210)

Moreover, a parameter $w$ can be included in the logistic function to control its softness when used to threshold the variable $x$ as shown below:

$\displaystyle \sigma(wx)=\left\{\begin{array}{ll}1/2 & w=0\\
1/(1+e^{-wx}) & w>0 \\ u(x) & w\rightarrow \infty\end{array}\right.$ (211)

We see that the argument $x$ is thresholded by the logistic function with variable softness controlled by parameter $w$, between the two extreme cases of $\sigma(w\,x)=1/2$ (no thresholding) when $w=0$, and $\sigma(w\,x)=u(x)$ when $w\rightarrow\infty$ (hard thresholding).

sigmoidPlots.png

Now the function value $f({\bf x})\in (-\infty,\;\infty)$ is mapped to $\sigma(f({\bf x}))\in (0,\;1)$, treated as the probability $p({\bf x}\in C_+)$ for ${\bf x}\in C_+$. Specifically, consider the following cases:

SigmoidSigma.png

The value of the logistic function is interpreted as the conditional probability of output $\hat{y}=1$ given data point ${\bf x}$ and model parameter ${\bf w}$:

$\displaystyle p(\hat{y}=1\vert{\bf x},{\bf w}) =\sigma({\bf w}^T{\bf x})
=\frac{1}{1+e^{-{\bf w}^T{\bf x}}}$ (212)

and the conditional probability of $\hat{y}=0$ is:

$\displaystyle p(\hat{y}=0\vert{\bf x},{\bf w})=1-p(y=1\vert{\bf x},{\bf w})
=1-...
...a({\bf w}^T{\bf x})
=\sigma(-{\bf w}^T{\bf x})=\frac{1}{1+e^{{\bf w}^T{\bf x}}}$ (213)

Combining these two cases, we get the conditional probability of $\hat{y}=y_n$ (either 1 or 0), as the labeling of a sample ${\bf x}_n$ in the training set:
$\displaystyle p(\hat{y}=y_n\vert{\bf x}_n,{\bf w})$ $\displaystyle =$ $\displaystyle p(\hat{y}=1\vert{\bf x}_n,{\bf w})^{y_n} p(\hat{y}=0\vert{\bf x}_...
...=1\\
p(\hat{y}=0\vert{\bf x}_n,{\bf w}) & \mbox{if}\;\;y_n=0\end{array}\right.$ (214)

We further get the likelihood function of ${\bf w}$ given all $N$ samples in ${\cal D}=\{ {\bf X},{\bf y}\}$ (assumed i.i.d.):
$\displaystyle L({\bf w}\vert{\cal D})$ $\displaystyle =$ $\displaystyle p({\cal D}\vert{\bf w})
=p({\bf y}\vert{\bf X},{\bf w})
=\prod_{n=1}^N p(\hat{y}=y_n\vert{\bf x}_n,{\bf w})$  
  $\displaystyle =$ $\displaystyle \prod_{n=1}^N p(\hat{y}=1\vert{\bf x}_n)^{y_n} p(\hat{y}=0\vert{\...
...{n=1}^N \sigma({\bf w}^T{\bf x}_n)^{y_n}
(1-\sigma({\bf w}^T{\bf x}_n))^{1-y_n}$ (215)

Now the optimal model parameter ${\bf w}$ can be defined as the one that maximizes the likelihood function $L({\bf w}\vert{\cal D})$, or, equivalently, that minimizes the negative log likelihood as the objective function:

$\displaystyle J({\bf w})=-l({\bf w}\vert{\cal D})=-\log L({\bf w}\vert{\cal D})...
...g \sigma({\bf w}^T{\bf x}_n)+(1-y_n)\log
(1-\sigma({\bf w}^T{\bf x}_n)) \right]$ (216)

This is the method of maximum likelihood estimation (MLE), and it produces the most probable estimation of certan parameter of a model based on the given data.

The optimal paramter ${\bf w}$ can also be found by the method of maximum a posteriori (MAP), which maximizes the posterior probability:

$\displaystyle p({\bf w}\vert{\cal D})=p({\bf w}\vert{\bf X},{\bf y})
=\frac{p({...
...w})}{p({\bf y}\vert{\bf X})}
\propto p({\bf y}\vert{\bf X},{\bf w})\;p({\bf w})$ (217)

where $p({\bf y}\vert{\bf X},{\bf w})=L({\bf w}\vert{\cal D})$ is the likelihood found above, $p({\bf w})$ is the prior probability of ${\bf w}$ without observing the training data ${\cal D}$, and the denominator independent of the variable ${\bf w}$ is dropped as a constant independent of ${\bf w}$.

Based on the assumption that all components of ${\bf w}$ are independent of each other and they have zero mean and the same variance, we let the prior probability be:

$\displaystyle p({\bf w})={\cal N}({\bf w},{\bf0},\sigma^2{\bf I})
=\frac{1}{(2\...
...ight)
\propto\exp\left(-\frac{\vert\vert{\bf w}\vert\vert^2}{2\sigma^2} \right)$ (218)

where the normalizing factor $1/(2\pi\sigma^2)^{N/2}$ is independent of ${\bf w}$ and is therefore dropped. The assumed zero mean is due to the desire that the values of all weights in ${\bf w}$ should be around zero instead of large values on either positive or negative side, to avoid overfitting, as discussed below.

Now the posterior can be written as

$\displaystyle p({\bf w}\vert{\cal D})$ $\displaystyle \;\propto\;$ $\displaystyle p({\bf y}\vert{\bf X},{\bf w})\;p({\bf w}\vert{\bf X})
=\prod_{n=1}^Np(\hat{y}_n=y_n\vert{\bf x}_n,{\bf w})\;
{\cal N}({\bf0},{\bf\Sigma}_w)$  
  $\displaystyle \propto$ $\displaystyle \prod_{n=1}^N \sigma({\bf w}^T{\bf x}_n)^{y_n}
(1-\sigma({\bf w}^...
...))^{1-y_n}\;
\exp\left(-\frac{\vert\vert{\bf w}\vert\vert^2}{2\sigma^2} \right)$ (219)

We can alternatively find the optimal model parameter ${\bf w}$ that maximizes this posterior $p({\bf w}\vert{\cal D})$ by the mathod of maximum a posteriori (MAP) estimation. To do so, we minimize the negative log posterior as the objective function:
$\displaystyle J({\bf w})$ $\displaystyle =$ $\displaystyle -\log p({\bf w}\vert{\cal D})
=-\log p({\bf y}\vert{\bf X},{\bf w})
-\left(-\frac{\vert\vert{\bf w}\vert\vert^2}{2\sigma^2}\right)$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N y_n\log\sigma({\bf w}^T{\bf x})
+(1-y_n)\log(1-\sigma({\bf w}^T{\bf x}))
+\frac{\vert\vert{\bf w}\vert\vert^2}{2\sigma^2}$ (220)

We can define $\lambda=1/\sigma^2$ as a hyperparameter in the second term $\lambda \vert\vert{\bf w}\vert\vert^2$ treated as a penalty term to discourage large values of ${\bf w}$, similar to Eq. (142) for ridge regression. A larger value of $\lambda$ will result in smaller values for ${\bf w}$, and the logistic function will be more smooth for softer thresholding, and the classification is less affected by possible noisy outliers close to the decision boundary. In other words, by adjusting $\lambda$, we can make a proper tradeoff between overfitting and underfitting.

In particular, if $\sigma\rightarrow\infty$ and $\lambda\rightarrow 0$, then $\exp(-\lambda\vert\vert{\bf w}\vert\vert^2/2)\Rightarrow 1$, i.e., the prior $p({\bf w})$ approaches a uniform distribution (all values of ${\bf w}$ are equaly likely), then the penalty term can be removed, and Eq. (220) becomes the same as Eq. (216), i.e., the MAP solution that maximizes the posterior $p({\bf w}\vert{\cal D})$ becomes the same as the MLE solution that maximizes the likelihood $L({\bf w}\vert{\cal D})$.

Now we can find the optimal ${\bf w}$ that minimizes the objective function $J({\bf w})$ in either Eq. (216) for the MLE method, or Eq. (220) for the MAP method. Here we consider the latter case which is more general (with an extra term $\vert\vert{\bf w}\vert\vert^2//2\sigma^2$). We first find the gradient of objective function:

$\displaystyle {\bf g}_J ({\bf w})$ $\displaystyle =$ $\displaystyle \frac{d}{d{\bf w}} J({\bf w})
=\frac{d}{d{\bf w}}\left(-\log p({\...
...}\vert{\bf X},{\bf w})
+\frac{\vert\vert{\bf w}\vert\vert^2}{2\sigma^2} \right)$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N \frac{d}{d{\bf w}}
\left[y_n\log \sigma({\bf w}^T{\bf x}_n)+(1-y_n)\log
(1-\sigma({\bf w}^T{\bf x}_n)) \right]+\lambda{\bf w}$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N \left[ \frac{y_n}{\sigma({\bf w}^T{\bf x}_n)}
-\fra...
...ma({\bf w}^T{\bf x}_n)(1-\sigma({\bf w}^T{\bf x}_n))\;{\bf x}_n
+\lambda{\bf w}$  
  $\displaystyle =$ $\displaystyle \sum_{n=1}^N \left[ \sigma({\bf w}^T{\bf x}_n)-y_n\right]{\bf x}_...
..._1\\ \vdots\\
\sigma({\bf w}^T{\bf x}_N)-y_N\end{array}\right]
+\lambda{\bf w}$  
  $\displaystyle =$ $\displaystyle {\bf Xr}+\lambda{\bf w}$ (221)

where we have defined a residual vector

$\displaystyle {\bf r}={\bf r}({\bf w})=\left[\begin{array}{c}r_1({\bf w})\\
\v...
...w}^T{\bf x}_1)-y_1\\ \vdots\\
\sigma({\bf w}^T{\bf x}_N)-y_N\end{array}\right]$ (222)

containing all $N$ residuals $r_n=\sigma({\bf w}^T{\bf x}_n)-y_n
\;\;(n=1,\cdots,N)$, the difference between the model prediction $0<\sigma({\bf w}^T{\bf x}_n)<1$ and the ground truth labeling $y_n\in\{0,\,1\}$. Based on ${\bf g}_J ({\bf w})$ we can find the optimal ${\bf w}$ that minimizes this objective function iteratively by the gradient descent method. We note that the gradient in Eq. (221) is based on all $N$ training samples in the training set. If $N$ is large, then the stochastic gradient descent method can be considered based on only one of the $N$ samples in each iteration.

Below is a segment of Matlab code for estimating ${\bf w}$ by the gradient descent method. Here X contains the $N$ samples ${\bf x}_1,\cdots,{\bf x}_N$ and y contains the corresponding a binary labelings.

The following code segment estimates weight vector ${\bf w}$ by the gradient descent method:

    X=[ones(1,n); X];          % augmented X with x_0=1 
    w=ones(m+1,1);             % initialize weight vector
    g=X*(sgm(w,X)-y)';         % gradient of log posterior
    tol=10^(-6);               % tolerance
    delta=0.1;                 % step size
    while norm(g)>tol          % terminate when gradient is small enough
        w=w-delta*g;           % update weight vector by gradient descent
        g=X*(sgm(w,X)-y)';     % update gradient 
    end
where
    function s=sgm(w,X)
        s=1./(1+exp(-w'*X));
    end

Alternatively, this minimization problem for finding optimal ${\bf w}$ can also be solved by the Newton's method based on the second order Hessian matrix ${\bf H}_J({\bf w})$ as well as the first order gradient ${\bf g}_J ({\bf w})$:

$\displaystyle {\bf H}_J({\bf w})$ $\displaystyle =$ $\displaystyle \frac{d}{d{\bf w}} {\bf g}_J({\bf w})
=\frac{d}{d{\bf w}}({\bf Xr...
...\vert\vert{\bf w}\vert\vert)
={\bf X}\frac{d}{d{\bf w}}{\bf r}({\bf w})+\lambda$  
  $\displaystyle =$ $\displaystyle {\bf X}\frac{d}{d{\bf w}}\left[\begin{array}{c}
r_1({\bf w})\\ \vdots\\ r_N({\bf w})\end{array}\right]
+\lambda
={\bf X}{\bf J}_r({\bf w})+\lambda$ (223)

where ${\bf J}_r({\bf w})=d{\bf r}({\bf w})/d{\bf w}$ is the $N\times(d+1)$ Jacobian matrix of the vector function ${\bf r}({\bf w})$, whose nth row is

$\displaystyle \frac{dr_n({\bf w})}{d{\bf w}} =\frac{d}{d{\bf w}}
\left[\sigma({...
...\sigma({\bf w}^T{\bf x}_n)[1-\sigma({\bf w}^T{\bf x}_n)]{\bf x}_n
=z_n{\bf x}_n$ (224)

where we have defined $z_n=\sigma({\bf w}^T{\bf x}_n)(1-\sigma({\bf w}^T{\bf x}_n))$, which is always greater than zero as $0<\sigma({\bf w}^T{\bf x}_n)<1$. Now the Jacobian can be written as

$\displaystyle {\bf J}_r({\bf w})=\left[\begin{array}{c}dr_1({\bf w})/d{\bf w}\\...
...array}{c}{\bf x}_1^T\\ \vdots\\ {\bf x}_N^T\end{array}\right]
={\bf Z}{\bf X}^T$ (225)

where ${\bf Z}=diag(z_1,\cdots,z_N)$ is a positive definite diagonal matrix (as $z_n>0$). Substituting ${\bf J}_r$ back into the expression for ${\bf H}_J({\bf w})$ above, we get

$\displaystyle {\bf H}_J ={\bf X}{\bf J}_r+\lambda={\bf X}{\bf ZX}^T+\lambda$ (226)

Since ${\bf Z}$ is positive definite, the following quadratic function is also positive, i.e., for any ${\bf y}$, we have

$\displaystyle {\bf y}^T{\bf H}_J {\bf y}={\bf y}^T{\bf X}{\bf ZX}^T{\bf y}
+\la...
...X}^T{\bf y})^T{\bf Z} ({\bf X}^T{\bf y}) +\lambda\vert\vert{\bf y}\vert\vert> 0$ (227)

indicating ${\bf H}_J$ is also positive definite, and the objective function $J({\bf w})$ when approximated by the first three terms of its Taylor series is has a global minimum.

Having found ${\bf H}_J({\bf w})$ as well as ${\bf g}_l=J({\bf w})$ of the objective function $J({\bf w})$, we can solve the minimization problem also by the Newton method:

$\displaystyle {\bf w}_{n+1}={\bf w}_n-{\bf H}_J ({\bf w}_n)^{-1}{\bf g}_J({\bf w}_n)
={\bf w}_n-( {\bf X}{\bf ZX}^T )^{-1} {\bf Xr}$ (228)

The following code segment estimates weight vector ${\bf w}$ by Newton method:
    X=[ones(1,n); X];          % augmented X with x_0=1 
    w=ones(m+1,1);             % initialize weight vector
    g=X*(sgm(w,X)-y)';         % gradient of log posterior
    H=X*diag(s.*(1-s))*X';     % Hesian of log posterior
    tol=10^(-6)
    while norm(g)>tol          % terminate when gradient is small enough
        w=w-inv(H)*g;          % update weight vector by Newton-Raphson
        s=sgm(w,X);            % update sigma 
        g=X*(s-y)';            % update gradient 
        H=X*diag(s.*(1-s))*X'; % update Hessian 
    end

Once the model parameter ${\bf w}$ is available, a given test sample ${\bf x}_*$ of unknown class can be classified into either of the two classes $C_1$ and $C_0$, based on $p(y_*=1\vert{\bf x}_*)$:

if $\displaystyle \;\;\;p(y_*=1\vert{\bf x}_*)=\sigma({\bf w}^T{\bf x}_*)
\left\{\begin{array}{l} > 0.5 \\ < 0.5 \end{array}\right.,
\;\;\;\;\;$then $\displaystyle \;\;\;{\bf x}_*\in
\left\{\begin{array}{l} C_1 \\ C_0\end{array}\right.$ (229)

In summary, we make a comparison between the linear and logistic regression methods both can be used for binary classification. In linear regression, the function value $f({\bf x})={\bf w}^T{\bf x}$ is hard-thresholded by the unit step function, so that any data point ${\bf x}$ is classified to class $C_+$ if ${\bf w}^T{\bf x}>0$, or class $C_-$ if ${\bf w}^T{\bf x}<0$ in a deterministic fashion (with probability $p({\bf x}\in C_+)\in \{0,\,1\}$. On the other hand, in logistic regression, the linear function $f({\bf x})={\bf w}^T{\bf x}$ is soft-thresholded by the logistic function to become $p({\bf x}\in C_+)=\sigma(f({\bf x}))\in(0,\;1)$, treated as the probability for ${\bf x}$ to belong to class $C_+$. Although the same linear model $f({\bf x})={\bf w}^T{\bf x}$ is used in both cases, the two methods obtain the model parameter ${\bf w}$ by minimizing very different objective functions (Eqs. (114) and (220)).