We first consider binary classification based on the same linear model used in linear regression considered before. Any test sample is classified into one of the two classes depending on whether is greater or smaller than zero:
ifthen | (275) |
While in the previously considered least square classification
method, we find the optimal that minimizes the squared
error
, here we find the optimal
based on a probabilistic model. Specifically, we now
convert the linear function
into
the probability for to belong to either class:
(276) |
and | (278) |
(279) |
The binary classification problem can now be treated as a regression problem to find the model parameter that best fits the data in the training set . Such a regression problem is called logistic regression if is used, or probit regression if is used.
Same as in the case of Bayesian regression, we assume the prior distribution of to be a zero-mean Gaussian , and for simplicity we further assume , and find the likelihood of based on the linear model applied to the observed data set :
(281) |
The posterior of can now be expressed in terms of the
prior
and the likelihood
:
The optimal that best fits the training set
can now be found as the one that
maximizes this posterior
, or, equivalently,
the log posterior
, by setting the derivative of
to zero and solving the resulting equation below
by Newton's method or conjugate gradient ascent method:
(284) |
Having found the optimal , we can classify any test pattern in terms of the posterior of its corresponding labeling :
(285) |
In a multi-class case with , we can still use a vector to represent each class , the direction of the class with respect to the origin in the feature space, and the inner product proportional to the projection of onto vector measures the extent to which belongs to . Similar to the logistic function used in the two-class case, here the soflmax function defined below is used to convert into the probability that belongs to :
(286) |