All previously discussed regression methods can be considered
as supervised binary classifiers, when the regression function
is thresholded by some constant
.
Without loss of generality, we will always assume
in the
following. Once the model parameter
is obtained
based on the training set
, every
point
in the d-dimensional feature space can be
classified into either of two classes
and
:
if |
(207) |
This simple binary classifier suffers from the drawback that
the distance between a point and the decision boundary
is not taken into consideration, as all points
on the same side of the boundary, regardless how far from the
boundary, are equally classified into the same class. On the
other hand, one would naturally feel more confident about such
a classification for a point far away from the boundary, than a
point very close to it.
This problem can be addressed by the method of
logistic regression, which is similarly trained based
on the training set with each sample labeled by a
binary value
for either of the two classes
and
. While the logistic regression is based on the
linear regression function
, which
is not hard-thresholded by the unit step function
to
become either 0 or 1; instead, it is soft-thresholded by a
sigmoid function
to become a real value between 0 and 1, interpreted as the
probability
for
to belong to
, and
is the
probability for
to belong to
.
Similar to linear and nonlinear regression methods considered
before, the goal of logistic regression is also to find the
optimal model parameter
for the regression
function
to fit the training set
optimally. But different from linear and nonlinear regression
based on the least squares (LS) method that minimizes the
squared error in Eq. (189), logistic regression is
based on maximum likelihood (ML) estimation,
which maximizes the likelihood of the model parameter,
here
, as discussed below.
The sigmoid function can be either the error function (cumulative Gaussian, erf):
(208) |
(210) |
(211) |
Now the function value
is mapped to
, treated as the probability
for
. Specifically, consider
the following cases:
The value of the logistic function is interpreted as the
conditional probability of output given data
point
and model parameter
:
(213) |
(214) |
(215) |
The optimal paramter can also be found by the method
of maximum a posteriori (MAP), which maximizes the posterior
probability:
(217) |
Based on the assumption that all components of are
independent of each other and they have zero mean and the same
variance, we let the prior probability be:
(218) |
Now the posterior can be written as
In particular, if
and
, then
, i.e., the prior
approaches a uniform distribution (all values of
are equaly likely), then the penalty term can be
removed, and Eq. (220)
becomes the same as Eq. (216),
i.e., the MAP solution that maximizes the posterior
becomes the same as the MLE solution
that maximizes the likelihood
.
Now we can find the optimal that minimizes
the objective function
in either
Eq. (216) for the MLE method,
or Eq. (220) for the MAP
method. Here we consider the latter case which is more general
(with an extra term
). We first find
the gradient of objective function:
(222) |
Below is a segment of Matlab code for estimating by the
gradient descent method. Here
X
contains the samples
and
y
contains the corresponding
a binary labelings.
The following code segment estimates weight vector by the
gradient descent method:
X=[ones(1,n); X]; % augmented X with x_0=1 w=ones(m+1,1); % initialize weight vector g=X*(sgm(w,X)-y)'; % gradient of log posterior tol=10^(-6); % tolerance delta=0.1; % step size while norm(g)>tol % terminate when gradient is small enough w=w-delta*g; % update weight vector by gradient descent g=X*(sgm(w,X)-y)'; % update gradient endwhere
function s=sgm(w,X) s=1./(1+exp(-w'*X)); end
Alternatively, this minimization problem for finding
optimal can also be solved by the
Newton's method
based on the second order
Hessian matrix
as well as the first order gradient
:
(223) |
(224) |
(225) |
(226) |
Since is positive definite, the following quadratic
function is also positive, i.e., for any
, we have
(227) |
Having found
as well as
of the objective function
, we can solve the minimization
problem also by the Newton method:
(228) |
X=[ones(1,n); X]; % augmented X with x_0=1 w=ones(m+1,1); % initialize weight vector g=X*(sgm(w,X)-y)'; % gradient of log posterior H=X*diag(s.*(1-s))*X'; % Hesian of log posterior tol=10^(-6) while norm(g)>tol % terminate when gradient is small enough w=w-inv(H)*g; % update weight vector by Newton-Raphson s=sgm(w,X); % update sigma g=X*(s-y)'; % update gradient H=X*diag(s.*(1-s))*X'; % update Hessian end
Once the model parameter is available, a given test
sample
of unknown class can be classified into either
of the two classes
and
, based on
:
In summary, we make a comparison between the linear and logistic
regression methods both can be used for binary classification. In
linear regression, the function value
is hard-thresholded by the unit step function, so that any data
point
is classified to class
if
,
or class
if
in a deterministic fashion
(with probability
. On the other
hand, in logistic regression, the linear function
is soft-thresholded by the logistic
function to become
,
treated as the probability for
to belong to class
.
Although the same linear model
is
used in both cases, the two methods obtain the model parameter
by minimizing very different objective functions
(Eqs. (114) and
(220)).