In both logistic regression and softmax regression considered previously, we convert the linear regression function to the probability for , by either the logistic function for binary classification, or the softmax function for multiclass classification. Also, in Gaussian process regression (GPR), we treat the regression function as a Gaussian process. Now we consider the Gaussian process classification (GPC) based on the combination of both logistic/softmax regression and Gaussian process regression. We will consider binary GPC based on the logistic function in this section, and then multiclass GPC based on softmax function in the following section.
Here in binary GPC, we assume the training samples are labeled by 1 or -1 (instead of 0) for mathematical convenience, i.e., if or if . Similar to logistic programming, here we also convert the regression function , now a Gaussian process, into the probability for by the logistic function . However, as a function of this random argument, is also random. We therefore define as the expectation of with respect to :
(265) |
For the training set , we define , and express the probability above in vector form:
where is the posterior which can be found in terms of the likelihood and the prior based on Bayes' theorem: As always, the denominator independent of is dropped.
We first find the likelihood
. The Gaussian
process
is mapped by the logistic function into the
probability for for
, or for
:
(268) |
(270) |
We then consider the prior , which is assumed to be a zero-mean Gaussian . Here the covariance matrix can be constructed based on the training set , the same as in GPR. Specifically, the covariance between and in the mth row and nth column of is modeled by the squared exponential (SE) kernel:
(271) |
Also, as discussed in GPR, here the parameter in the SE controls the smoothness of the function. If , the value of the SE approaches 1, then and are highly correlated and is very smooth; but if , SE approaches 0, then and are not correlated and is no longer smooth. By adjusting , a proper tradeoff can be made between overfitting and underfitting.
Having found both the likelihood
and the prior
, we can get the posterior in
Eq. (267):
(273) |
Now that the posterior
is approximated as
a Gaussian, we can find the gradient vector
and Hessian matrix
of the log posterior
(see Appendix):
On the other hand, we can also find and by directly taking the first and second order derivatives of the log posterior::
where we have defined are respectively the gradient vector and Hessian matrix of the log function(280) |
(281) |
Equatiing the two expressions for in Eqs. (276) and (278) we get:
Equating the two expressions for in Eqs. (275) and (277) we get:i.e., | (285) |
(286) |
Having found , we can proceed to carry out classification of any test set based on the probability (similar to Eq. (266)):
(289) |
(290) |
(293) |
Now that the posterior
is approximated as a Gaussian with mean and covariance given
in Eqs. (292) and (294), we can finally
carry out Eq. (266) to find the probability for
the test points in to belong to class :
(295) |
The Matlab code for the essential parts of this algorithm is
listed below. Here X
and y
are for the training data
, and Xs
is an array composed
of test vectors. First, the essential segment of the main program
listed below takes in the training and test data, generates the
covariance matrices , , and
represented by K
, Ks
, and Kss
, respectively.
The function Kernel
is exactly the same as the one used for
Gaussian process regression. This code segment then further calls
a function findPosteriorMean
which finds the mean and
covariance of
based on covariance of
training data (
and , and computes
the mean
and covariance
based
on Eqs. (292) and (294), respectively. The
sign function of
indicates the classification of
the test data points in .
K=Kernel(X,X); % cov(f,f), covariance of prior p(f|X) Ks=Kernel(X,Xs); % cov(f_*,f) Kss=Kernel(Xs,Xs); % cov(f_*,f_*) [Sigmaf W w]=findPosteriorMean(K,y); % get mean/covariance of p(f|D), W, w meanfD=Ks'*w; % mean of p(f_*|X_*,D) SigmafD=Kss-Ks'*inv(K-inv(W))*Ks; % covariance of p(f_*|X_*,D) ys=sign(meanfD); % binary classification of test data p=1./(1+exp(-meanfD)); % p(y_*=1|X_*,D) as logistic function
The function findPosteriorMean
listed below uses Newton's method
to find the mean and covariance of the posterior
of the latent function based on the training data, and returns
them in meanf
and covf
, respectively, together with w
and W
for the gradient and Hessian of the likelihood
, to be used for computing
and
.
function [meanf covf W w]=findPosteriorMean(K,y) % K: covariance of prior of p(f|X) % y: labeling of training data X % w: gradient vector of log p(y|f) % W: Hessian matrix of log p(y|f) % meanf: mean of p(f|X,y) % covf: covariance of p(f|X,y) n=length(y); % number of training samples f0=zeros(n,1); % initial value of latent function f=f0; er=1; while er > 10^(-9) % Newton's method to get f that maximizes p(f|X,y) e=exp(-y.*f); w=y.*e./(1+e); % update w W=diag(-e./(1+e).^2); % update W f=inv(inv(K)-W)*(w-W*f0); % iteration to get f from previous f0 er=norm(f-f0); % difference between two consecutive f's f0=f; % update f end meanf=f; % mean of f covf=inv(inv(K)-W); % coviance of f end
Example 1 The GPC method is trained by the two classes shown in the figure below, represented by 100 red points and 80 blue points drawn from two Gaussian distributions and , where
(296) |
Three classification results are shown in the figure below corresponding to different values used in the kernel function. On the left, the 2-D space is partitioned into red and blue regions corresponding to the two classes based on the sign function of ; on the right, the 3-D distribution plots of representing the estimated probability for to belong to either class (not normalized) is shown, to be compared with the original Gaussian distributions from which the traning samples were drawn.
We see that wherever there is evidence represented by the training samples of either class in red or blue, there are high probabilities for the neighboring points to belong to either or represented by the positive or negative peaks in the 3-D plots. Data points far away from any evidence will have low probability to belong to either class.
We make the following observations for three different values of the parameter in SE: