We first consider binary classification based on the same
linear model
used in
linear regression considered before. Any test sample
is classified into one of the two classes
depending on whether
is greater or smaller than
zero:
if |
(275) |
While in the previously considered least square classification
method, we find the optimal that minimizes the squared
error
, here we find the optimal
based on a probabilistic model. Specifically, we now
convert the linear function
into
the probability for
to belong to either class:
(276) |
(278) |
(279) |
The binary classification problem can now be treated as a regression
problem to find the model parameter that best fits the data
in the training set
. Such a regression
problem is called logistic regression if
is used, or
probit regression if
is used.
Same as in the case of Bayesian regression, we assume the prior
distribution of to be a zero-mean Gaussian
, and for simplicity
we further assume
, and find the likelihood of
based on the linear model applied to the observed data set
:
(281) |
The posterior of can now be expressed in terms of the
prior
and the likelihood
:
The optimal that best fits the training set
can now be found as the one that
maximizes this posterior
, or, equivalently,
the log posterior
, by setting the derivative of
to zero and solving the resulting equation below
by Newton's method or conjugate gradient ascent method:
(284) |
Having found the optimal , we can classify any test pattern
in terms of the posterior of its corresponding labeling
:
(285) |
In a multi-class case with
, we can still use a vector
to represent each class
, the direction
of the class with respect to the origin in the feature space, and the inner
product
proportional to the projection of
onto vector
measures the extent to which
belongs to
. Similar to the logistic function used in the two-class case, here
the soflmax function defined below is used to convert
into the probability
that
belongs to
:
(286) |