The binary GPC considered previously can be generalized to multi-class GPC based on softmax function, similar to the how binary classification based on logistic function is generalized to multi-class classification. First of all, we define the following variables for each class of the classes :
Binary labeling indicating whether each of the samples in the training set belongs to :
(297) |
Latent function , of which the nth component is kth Gaussian process associated with evaluated at the nth training sample in the training set .
Probability , of which the nth component is the probability for , modeled by the solftmax function based on (same as in Eq. (237) in softmax regression):
Note that if then necessarily for all but , i.e., the variables are not independent. given , we do not need to consider for any .
The probability for to be correctly classified into class can be written as the following product of factors (of which are equal to 1, same as Eq. (238) in softmax regression):
(299) |
Based on , , and for all , we further define the following dimensional vectors:
(300) |
The posterior of given the training set can be found based on the Bayesian theorem as (same as in Eq. (267) in the binary case):
(301) |
We first find the likelihood based on all (same as Eq. (239) in softmax regression):
(302) |
We next assume the prior probability of the latent function for each class to be a zero-mean Gaussian process , where the covariance matrix is constructed based on the squared exponential (SE) for its mn-th component:
(303) |
(304) |
Having found both the likelihood and prioor , we can write the posterior as
(305) |
Taking log of the posterior, we get
(306) |
(307) |
As the posterior is approximated as a Gaussian, which reaches maximum at its mean , and so does the log prior , and the gradient of is zero at :
(309) |
We further get the Hessian matrix of :
(311) |
(313) |
(315) |
(316) |
(317) |
Having found both the gradient
and Hessian
, we can further find the
mean
at which
achieves maximum by
the following iteration of
Newton's method:
(318) |
Now that we have got both and of , we can further get and of :
Similar to Eq. (292) in the binary case, we have
the following based on
in
Eq. (310):
(319) |
(320) |
Similar to Eq. (294) in the binary case, we have the following based on in Eq. (312):
(321) |
(322) |
Now we can further get the probability for to belong to based on the softmax function in Eq. (298)
(323) |
Moreover, the certainty or confidence of this classification result can be found from .
The Matlab code for the essential parts of the algorithm is listed below. First, the following code segment carries out the classification of given test samples based on the training set .
K=Kernel(X,X); % covariance of prior of p(f|X) Ks=Kernel(X,Xs); Kss=Kernel(Xs,Xs); [meanf Sigmaf p W]=findPosteriorMeanMC(K,y,C); % find mean and covariance of p(f|D), and W, p Sigmafy=Kss-Ks'*inv(K+inv(W))*Ks; % covariance of p(f|D) p=reshape(p,N,C); y=reshape(y,N,C); for k=1:C meanfD(:,k)=Ks'*(y(:,k)-p(:,k); % mean of p(f_*|X_*,D) for kth class end for i=1:n % for each of n test samples d=sum(exp(meanfD(i,:))); % denominator of softmax function [pmax k]=max(exp(meanfD(i,:))/d); % find class with max probability ys(i)=k; % label ith sample as member of kth class pr(i,k)=pmax; % probility of ith sample belonging to kth class end
The code segment above calls the following function which computes the mean and covariance of by Newton's method:
function [meanf Sigmaf p W]=findPosteriorMeanMC(K,y,C) % find mean and covariance of p(f|X,y) by Newton's method, based on D={X,y} n=length(y); % number of training samples f0=zeros(n,1); % initial value of latent function f=zeros(n,1); er=1; k=0; while er > 10^(-9) % find f that maximizes p(f|X,y) k=k+1; [p W]=findp(f,N,C); % call function findp to find vector p and matrix W f=inv(inv(K)+W)*(W*f0+y-p); % iteratively update value of f er=norm(f-f0); % difference between consecutive iterations f0=f; % update f end meanf=f; Sigmaf=inv(inv(K)+W); end
The following function called by the previous function finds vector and matrix :
function [p W]=findp(f,N,C) % find vector p and matrix W=diag(p)-P*P' F=reshape(f,N,C)'; % kth row contains N samples of class k p=zeros(C,N); % initialize p for n=1:N % for each of N training samples d=sum(exp(F(:,n))); % sum of all C terms in denominator for k=1:C % for all C classes p(k,n)=exp(F(k,n))/d; end end P=[]; for k=1:C % generate P P=[P; diag(p(k,:))]; % stack C diagonal matrices end p=reshape(p',N*C,1); % convert p into a column vector W=diag(p)-P*P'; % generate W matrix end
Example 1:
This example shows the classification result of the same dataset of three classes used before. The top two panels show the distributions of the three classes in the training data set. The bottom two panels show the classification results in terms of the partitioning of the feature space (bottom left) and the posterior distribution (bottom right), which can be compared with the distribution of the training set (top right). The confusion matrix of the classification result is shown below, with the error rate :
(324) |
Example 2:
This example shows the classification of the XOR data set. The confusion matrix of the classification result is shown below, with the error rate :
(325) |
We see that in both examples, the error rates of the GPC method are lower than those of the naive Bayesian method. However, the naive Bayesian method does not have the overfitting problem, while in the method of GPC, we may need to carefully adjust the parameter of squared exponential for the kernel functions to make proper tradeoff between overfitting and error rate.