The binary GPC considered previously can be generalized to multi-class
GPC based on softmax function, similar to the how binary classification
based on logistic function is generalized to multi-class classification.
First of all, we define the following variables for each class
of the
classes
:
Binary labeling
indicating
whether each of the
samples in the training set
belongs to
:
(297) |
Latent function
, of which
the nth component
is kth
Gaussian process associated with
evaluated at the nth
training sample
in the training set
.
Probability
, of which the
nth component
is the probability for
, modeled by the solftmax function based on
(same as in Eq. (237) in softmax regression):
The probability for to be correctly classified into
class
can be written as the following product of
factors (of which
are equal to 1, same as
Eq. (238) in softmax regression):
(299) |
Based on ,
, and
for all
,
we further define the following
dimensional vectors:
(300) |
The posterior of given the training set
can be found based on the Bayesian theorem as (same as in
Eq. (267) in the binary case):
(301) |
We first find the likelihood based on all (same as
Eq. (239) in softmax regression):
(302) |
We next assume the prior probability of the latent function
for each class
to be a zero-mean Gaussian
process
,
where the covariance matrix
is constructed based
on the squared exponential (SE) for its mn-th component:
(303) |
(304) |
Having found both the likelihood
and prioor
, we can write the posterior as
(305) |
Taking log of the posterior, we get
(306) |
(307) |
As the posterior
is approximated as a Gaussian, which reaches maximum at its mean
, and so does the log prior
, and
the gradient of
is zero at
:
(309) |
We further get the Hessian matrix of
:
(311) |
(313) |
(315) |
(316) |
(317) |
Having found both the gradient
and Hessian
, we can further find the
mean
at which
achieves maximum by
the following iteration of
Newton's method:
(318) |
Now that we have got both
and
of
, we can further get
and
of
:
Similar to Eq. (292) in the binary case, we have
the following based on
in
Eq. (310):
(319) |
(320) |
Similar to Eq. (294) in the binary case, we have
the following based on
in Eq. (312):
(321) |
(322) |
Now we can further get the probability for to belong to
based on the softmax function in Eq. (298)
(323) |
Moreover, the certainty or confidence of this classification result
can be found from
.
The Matlab code for the essential parts of the algorithm is
listed below. First, the following code segment carries out the
classification of given test samples based on the
training
set
.
K=Kernel(X,X); % covariance of prior of p(f|X) Ks=Kernel(X,Xs); Kss=Kernel(Xs,Xs); [meanf Sigmaf p W]=findPosteriorMeanMC(K,y,C); % find mean and covariance of p(f|D), and W, p Sigmafy=Kss-Ks'*inv(K+inv(W))*Ks; % covariance of p(f|D) p=reshape(p,N,C); y=reshape(y,N,C); for k=1:C meanfD(:,k)=Ks'*(y(:,k)-p(:,k); % mean of p(f_*|X_*,D) for kth class end for i=1:n % for each of n test samples d=sum(exp(meanfD(i,:))); % denominator of softmax function [pmax k]=max(exp(meanfD(i,:))/d); % find class with max probability ys(i)=k; % label ith sample as member of kth class pr(i,k)=pmax; % probility of ith sample belonging to kth class end
The code segment above calls the following function which computes the
mean and covariance of
by Newton's method:
function [meanf Sigmaf p W]=findPosteriorMeanMC(K,y,C) % find mean and covariance of p(f|X,y) by Newton's method, based on D={X,y} n=length(y); % number of training samples f0=zeros(n,1); % initial value of latent function f=zeros(n,1); er=1; k=0; while er > 10^(-9) % find f that maximizes p(f|X,y) k=k+1; [p W]=findp(f,N,C); % call function findp to find vector p and matrix W f=inv(inv(K)+W)*(W*f0+y-p); % iteratively update value of f er=norm(f-f0); % difference between consecutive iterations f0=f; % update f end meanf=f; Sigmaf=inv(inv(K)+W); end
The following function called by the previous function finds vector
and matrix
:
function [p W]=findp(f,N,C) % find vector p and matrix W=diag(p)-P*P' F=reshape(f,N,C)'; % kth row contains N samples of class k p=zeros(C,N); % initialize p for n=1:N % for each of N training samples d=sum(exp(F(:,n))); % sum of all C terms in denominator for k=1:C % for all C classes p(k,n)=exp(F(k,n))/d; end end P=[]; for k=1:C % generate P P=[P; diag(p(k,:))]; % stack C diagonal matrices end p=reshape(p',N*C,1); % convert p into a column vector W=diag(p)-P*P'; % generate W matrix end
Example 1:
This example shows the classification result of the same dataset of
three classes used before. The top two panels show the distributions
of the three classes in the training data set. The bottom two panels
show the classification results in terms of the partitioning of the
feature space (bottom left) and the posterior distribution
(bottom right), which can be
compared with the distribution of the training set (top right). The
confusion matrix of the classification result is shown below, with the
error rate
:
(324) |
Example 2:
This example shows the classification of the XOR data set. The confusion
matrix of the classification result is shown below, with the error rate
:
(325) |
We see that in both examples, the error rates of the GPC method are lower than those of the naive Bayesian method. However, the naive Bayesian method does not have the overfitting problem, while in the method of GPC, we may need to carefully adjust the parameter of squared exponential for the kernel functions to make proper tradeoff between overfitting and error rate.