In both logistic regression and softmax regression considered
previously, we convert the linear regression function
to the probability
for
, by either
the logistic function
for binary classification,
or the softmax function
for multiclass classification.
Also, in Gaussian process regression (GPR), we treat the regression
function
as a Gaussian process. Now we consider the
Gaussian process classification (GPC) based on the
combination of both logistic/softmax regression and Gaussian
process regression. We will consider binary GPC based on the
logistic function in this section, and then multiclass GPC based
on softmax function in the following section.
Here in binary GPC, we assume the training samples are labeled
by 1 or -1 (instead of 0) for mathematical convenience, i.e.,
if
or
if
.
Similar to logistic programming, here we also convert the
regression function
, now a Gaussian process, into
the probability for
by the logistic function
. However, as a function of this random
argument,
is also random. We therefore
define
as the expectation of
with respect to
:
(265) |
For the training set
,
we define
,
and express the probability above in vector form:
We first find the likelihood
. The Gaussian
process
is mapped by the logistic function into the
probability for
for
, or
for
:
(268) |
(270) |
We then consider the prior
, which is
assumed to be a zero-mean Gaussian
. Here the
covariance matrix
can be constructed based on
the training set
, the same as in GPR. Specifically,
the covariance between
and
in the mth row and nth column of
is modeled by
the squared exponential (SE) kernel:
(271) |
Also, as discussed in GPR, here the parameter in the SE controls
the smoothness of the function. If
, the value of
the SE approaches 1, then
and
are highly
correlated and
is very smooth; but if
,
SE approaches 0, then
and
are not
correlated and
is no longer smooth. By adjusting
,
a proper tradeoff can be made between overfitting and underfitting.
Having found both the likelihood
and the prior
, we can get the posterior in
Eq. (267):
(273) |
Now that the posterior
is approximated as
a Gaussian, we can find the gradient vector
and Hessian matrix
of the log posterior
(see Appendix):
On the other hand, we can also find
and
by directly taking the first and
second order derivatives of the log posterior::
(280) |
(281) |
Equatiing the two expressions for
in
Eqs. (276) and (278) we get:
(285) |
(286) |
Having found
, we can proceed to carry
out classification of any test set
based on the
probability (similar to Eq. (266)):
(289) |
(290) |
(293) |
Now that the posterior
is approximated as a Gaussian with mean and covariance given
in Eqs. (292) and (294), we can finally
carry out Eq. (266) to find the probability for
the test points in
to belong to class
:
(295) |
The Matlab code for the essential parts of this algorithm is
listed below. Here X
and y
are for the training data
, and
Xs
is an array composed
of test vectors. First, the essential segment of the main program
listed below takes in the training and test data, generates the
covariance matrices ,
, and
represented by
K
, Ks
, and Kss
, respectively.
The function Kernel
is exactly the same as the one used for
Gaussian process regression. This code segment then further calls
a function findPosteriorMean
which finds the mean and
covariance of
based on covariance of
training data (
and
, and computes
the mean
and covariance
based
on Eqs. (292) and (294), respectively. The
sign function of
indicates the classification of
the test data points in
.
K=Kernel(X,X); % cov(f,f), covariance of prior p(f|X) Ks=Kernel(X,Xs); % cov(f_*,f) Kss=Kernel(Xs,Xs); % cov(f_*,f_*) [Sigmaf W w]=findPosteriorMean(K,y); % get mean/covariance of p(f|D), W, w meanfD=Ks'*w; % mean of p(f_*|X_*,D) SigmafD=Kss-Ks'*inv(K-inv(W))*Ks; % covariance of p(f_*|X_*,D) ys=sign(meanfD); % binary classification of test data p=1./(1+exp(-meanfD)); % p(y_*=1|X_*,D) as logistic function
The function findPosteriorMean
listed below uses Newton's method
to find the mean and covariance of the posterior
of the latent function
based on the training data, and returns
them in
meanf
and covf
, respectively, together with w
and W
for the gradient and Hessian of the likelihood
, to be used for computing
and
.
function [meanf covf W w]=findPosteriorMean(K,y) % K: covariance of prior of p(f|X) % y: labeling of training data X % w: gradient vector of log p(y|f) % W: Hessian matrix of log p(y|f) % meanf: mean of p(f|X,y) % covf: covariance of p(f|X,y) n=length(y); % number of training samples f0=zeros(n,1); % initial value of latent function f=f0; er=1; while er > 10^(-9) % Newton's method to get f that maximizes p(f|X,y) e=exp(-y.*f); w=y.*e./(1+e); % update w W=diag(-e./(1+e).^2); % update W f=inv(inv(K)-W)*(w-W*f0); % iteration to get f from previous f0 er=norm(f-f0); % difference between two consecutive f's f0=f; % update f end meanf=f; % mean of f covf=inv(inv(K)-W); % coviance of f end
Example 1 The GPC method is trained by the two classes shown in
the figure below, represented by 100 red points and 80 blue points
drawn from two Gaussian distributions
and
, where
(296) |
Three classification results are shown in the figure below corresponding
to different values used in the kernel function. On the left, the 2-D
space is partitioned into red and blue regions corresponding to the
two classes based on the sign function of
; on the right,
the 3-D distribution plots of
representing the
estimated probability for
to belong to either class (not
normalized) is shown, to be compared with the original Gaussian
distributions from which the traning samples were drawn.
We see that wherever there is evidence represented by the training
samples of either class in red or blue, there are high probabilities
for the neighboring points to belong to either or
represented
by the positive or negative peaks in the 3-D plots. Data points far
away from any evidence will have low probability to belong to either
class.
We make the following observations for three different values of the
parameter in SE: