All previously discussed regression methods can be considered as supervised binary classifiers, when the regression function is thresholded by some constant . Without loss of generality, we will always assume in the following. Once the model parameter is obtained based on the training set , every point in the d-dimensional feature space can be classified into either of two classes and :
ifthen | (207) |
This simple binary classifier suffers from the drawback that the distance between a point and the decision boundary is not taken into consideration, as all points on the same side of the boundary, regardless how far from the boundary, are equally classified into the same class. On the other hand, one would naturally feel more confident about such a classification for a point far away from the boundary, than a point very close to it.
This problem can be addressed by the method of logistic regression, which is similarly trained based on the training set with each sample labeled by a binary value for either of the two classes and . While the logistic regression is based on the linear regression function , which is not hard-thresholded by the unit step function to become either 0 or 1; instead, it is soft-thresholded by a sigmoid function to become a real value between 0 and 1, interpreted as the probability for to belong to , and is the probability for to belong to .
Similar to linear and nonlinear regression methods considered before, the goal of logistic regression is also to find the optimal model parameter for the regression function to fit the training set optimally. But different from linear and nonlinear regression based on the least squares (LS) method that minimizes the squared error in Eq. (189), logistic regression is based on maximum likelihood (ML) estimation, which maximizes the likelihood of the model parameter, here , as discussed below.
The sigmoid function can be either the error function (cumulative Gaussian, erf):
(208) |
(210) |
(211) |
Now the function value is mapped to , treated as the probability for . Specifically, consider the following cases:
The value of the logistic function is interpreted as the conditional probability of output given data point and model parameter :
and the conditional probability of is:(213) |
(214) |
(215) |
The optimal paramter can also be found by the method of maximum a posteriori (MAP), which maximizes the posterior probability:
(217) |
Based on the assumption that all components of are independent of each other and they have zero mean and the same variance, we let the prior probability be:
(218) |
Now the posterior can be written as
In particular, if and , then , i.e., the prior approaches a uniform distribution (all values of are equaly likely), then the penalty term can be removed, and Eq. (220) becomes the same as Eq. (216), i.e., the MAP solution that maximizes the posterior becomes the same as the MLE solution that maximizes the likelihood .
Now we can find the optimal that minimizes
the objective function
in either
Eq. (216) for the MLE method,
or Eq. (220) for the MAP
method. Here we consider the latter case which is more general
(with an extra term
). We first find
the gradient of objective function:
(222) |
Below is a segment of Matlab code for estimating by the
gradient descent method. Here X
contains the samples
and y
contains the corresponding
a binary labelings.
The following code segment estimates weight vector by the gradient descent method:
X=[ones(1,n); X]; % augmented X with x_0=1 w=ones(m+1,1); % initialize weight vector g=X*(sgm(w,X)-y)'; % gradient of log posterior tol=10^(-6); % tolerance delta=0.1; % step size while norm(g)>tol % terminate when gradient is small enough w=w-delta*g; % update weight vector by gradient descent g=X*(sgm(w,X)-y)'; % update gradient endwhere
function s=sgm(w,X) s=1./(1+exp(-w'*X)); end
Alternatively, this minimization problem for finding
optimal can also be solved by the
Newton's method
based on the second order
Hessian matrix
as well as the first order gradient
:
(223) |
(224) |
(225) |
(226) |
Since is positive definite, the following quadratic function is also positive, i.e., for any , we have
(227) |
Having found as well as of the objective function , we can solve the minimization problem also by the Newton method:
(228) |
X=[ones(1,n); X]; % augmented X with x_0=1 w=ones(m+1,1); % initialize weight vector g=X*(sgm(w,X)-y)'; % gradient of log posterior H=X*diag(s.*(1-s))*X'; % Hesian of log posterior tol=10^(-6) while norm(g)>tol % terminate when gradient is small enough w=w-inv(H)*g; % update weight vector by Newton-Raphson s=sgm(w,X); % update sigma g=X*(s-y)'; % update gradient H=X*diag(s.*(1-s))*X'; % update Hessian end
Once the model parameter is available, a given test sample of unknown class can be classified into either of the two classes and , based on :
In summary, we make a comparison between the linear and logistic regression methods both can be used for binary classification. In linear regression, the function value is hard-thresholded by the unit step function, so that any data point is classified to class if , or class if in a deterministic fashion (with probability . On the other hand, in logistic regression, the linear function is soft-thresholded by the logistic function to become , treated as the probability for to belong to class . Although the same linear model is used in both cases, the two methods obtain the model parameter by minimizing very different objective functions (Eqs. (114) and (220)).