The naive Bayes (maximum likelihood) classification is based on a quadratic decision function and is therefore unable to classify data that are not quadratically separable. However, as shown below, the Bayes method can be reformulated so that all data points appear in the form of an inner product, and the kernel method can be used to map the original space into a higher dimensional space in which all groups can be separated even though they are not quadratically separable in the original space.
Consider a binary Bayes classifier by which any sample is classified into either of the two classes and depending on whether is on the positive or negative side of the quadratic decision surface (Eq. (22)):
(172) |
(173) |
(174) |
(175) |
As matrices , , and are all symmetric, they can be written in the following eigendecomposition form:
(176) |
(177) |
(178) |
(179) |
(182) |
In general, when all data points are kernel mapped to a higher dimensional space, the two classes can be more easily separated, even by a hyperplane based on the linear part of the decision function without the second order term. This allows the assumption that the two classes have the same covariance matrix so that and the second order term is dropped. This is the justification for the following two special cases:
(184) |
Note that if the kernel method is used to replace an inner product by a kernel function , we need to map all data points to in the higher dimensional space, instead of only mapping their means and , because the mapping of a sum is not equal to the sum of the mapped points if the kernel is not linear:
(187) |
The Matlab code for the essential part of the algorithm is listed
below. Given X
and y
for the data array composed of
training vectors
and their
corresponding labelings
, the code carries
out the training and then classifies any unlabeled data point into
either of the two classes. Parameter type
selects any one of
the three different versions of the algorithm, and the function
K(X,x)
returns a 1-D array containing all kernel functions
of the column vectors in
and vector .
X=getData; [m n]=size(X); X0=X(:,find(y>0)); n0=size(X0,2); % separate C+ and C- X1=X(:,find(y<0)); n1=size(X1,2); m0=mean(X0,2); C0=cov(X0'); % calculate mean and covariance m1=mean(X1,2); C1=cov(X1'); if type==1 for i=1:n x(i)=sum(K(X0,X(:,i)))/n0-sum(K(X1,X(:,i)))/n1; end elseif type==2 C=inv(C0+C1); [V D]=eig(C); U=(V*D^(1/2))'; Z=U*X; Z0=U*X0; Z1=U*X1; for i=1:n x(i)=sum(K(Z0,Z(:,i)))/n0-sum(K(Z1,Z(:,i)))/n1; end elseif type==3 C0=inv(C0); C1=inv(C1); W=-(C0-C1)/2; [V D]=eig(W); U=(V*D^(1/2)).'; Z=U*X; [V0 D0]=eig(C0); U0=(V0*D0^(1/2))'; Z0=U0*X; Z00=U0*X0; [V1 D1]=eig(C1); U1=(V1*D1^(1/2))'; Z1=U1*X; Z11=U1*X1; for i=1:n x(i)=K(Z(:,i),Z(:,i))+sum(K(Z00,Z0(:,i)))/n0-sum(K(Z11,Z1(:,i)))/n1; end end [x I]=sort(x); y=y(I); % sort 1-D data together with their labelings smax=0; for i=1:n-1 % find optimal threshold value b s=abs(sum(y(1:i)))+abs(sum(y(i+1:n))); if s>smax smax=s; b=-(x(i)+x(i+1))/2; end end
Note that may be either positive or negative definite, and its eigenvalue matrix may contain negative values and may contain complex values. Given any unlabeled data point , the code below is carried out
if type==1 y(i)=sum(K(X0,x))/n0-sum(K(X1,x))/n1+b; elseif type==2 Z=U*x; y(i)=sum(K(Z0,z))/n0-sum(K(Z1,z))/n1+b; elseif type==3 z=U*x; z0=U0*x; z1=U1*x; y(i)=K(z,z)+sum(K(Z00,z0))/n0-sum(K(Z11,z1))/n1+b; endto classify into either if , or if .
In comparison with the SVM method, which requires solving a QP problem by certain iterative algorithm (either the interior point method or the SMO method), the kernel based Bayes method is closed-form and therefore extremely efficient computationally. Moreover, as shown in the examples below, this method is also highly effective as its classification results are comparable and offen more accurate than those of the SVM method.
We now show a few examples to test all three types of the kernel based Bayes method based on a set of simulated 2-D data. Both linear kernel and RBF kernel . The value of the parameter used in the examples is 5, but it can be fine tuned in a wide range (e.g., to make proper trade-off between accuracy and avoiding overfitting. The performances of these method are also compared with the SVM method implemented by the Matlab function:
fitcsvm(X',y,'Standardize',true,'KernelFunction','linear','KernelScale','auto'))
Example 1: Based on some 2-D training datasets, four different binary classifiers are tested. The correct rates of each of the four methods are listed below, together with the corresponding partitionings of the 2-D space shown in the figures.
(191) |
(192) |
Example 2: The three datasets are generated using the Matlab code on this site. A parameter value is used for the three versions of the kernel Bayes method.
(193) |
As the data are not linearly separable, all methods performed poorly except kernelized Bayes Type III with a second order term in the decison function.
All four methods performed very well, but all three variations of the kernelized Bayes method achieved higher correct rates than the SVM.
Example 3: The classification results of Fisher's iris data by
the SVM method (Matlab function fitcsvm
) and the kernel Bayes
methods are shown below. This is an example used to illustrate the
SVM method in the
documentation of fitcsvm.
In this example only two (3rd and 4th) of the four features are used, with half of the samples used for training while the other half for testing. Note that the second class (setosa in green) and third class (versicolor in blue) not linearly separable can be better separated by the kernel Bayes method.