The method of naive Bayes (NB) classification is a classical supervised classification algorithm, which is first trained by a training set of samples and their corresponding labelings , and then used to classify any unlabeled sample into class with the maximumm posterior probability. As indicated by the name, naive Bayes classification is based on Bayes' theorem:
where
Based on Bayes' theorem discussed above, the naive Bayes method classifies an unlabeled data sample to class with the maximum posterior probability (maximum a posteriori (MAP)):
As is common to all classes (independent of ), it plays no role in the relative comparison among the classes, and can therefore be dropped, i.e., is classified to with maximal .
The naive Bayes classifier is an optimal classifier in the sense
that the classification error is minimum. To illustrate this,
consider an arbitrary boundary between two classes and
that partitions the 1-D feature space into two regions and
, as shown below. The probability of a misclassification is
the joint probability
for a sample
but falling in . The total
probability of error due to misclassification can be expressed as:
(13) |
It is obvious that the Bayes classifier is indeed optimal, due to the fact that its boundary between classes and (the bottom plot) guarantees the classification error (shaded area) to be minimum.
To find the likelihood function , we need to find and based on the training set , based on the training samples in class by the method of maximum likelihood estimation:
(14) |
Once both and are available, the classifier is trained, and the posterior can be calculated for the classification in Eq. (12).
As the classification is based on the relative comparison of the
posterior
, any monotonic mapping of the posterior
can be equivalently used to simplify the computation, such as the
logarithmic function:
(15) |
(16) |
ifthen | (17) |
(18) |
Geometrically, the feature space is partitioned into regions corresponding
to the classes by the decision boundaries between every pair of
classes and , represented by the equation
,
i.e.,
(19) |
(20) |
We further consider several special cases:
(23) |
(24) |
We can reconsider Example 2 in the previous section but now based on the negative of the discriminant function above treated as a distance , and get
(25) |
(26) |
The boundary equation between and becomes
(27) |
(29) |
(30) |
(31) |
(32) |
The boundary equation becomes:
(33) |
(34) |
(35) |
(36) |
The Matlab functions for training and testing are listed low:
function [M S P]=NBtraining(X,y) % naive Bayes training [d N]=size(X); % dimensions of dataset K=length(unique(y)); % number of classes M=zeros(d,K); % mean vectors S=zeros(d,d,K); % covariance matrices P=zeros(1,K); % prior probabilities for k=1:K idx=find(y==k); % indices of samples in kth class Xk=X(:,idx); % all samples in kth class P(k)=length(idx)/N; % prior probability M(:,k)=mean(Xk'); % mean vector of kth class S(:,:,k)=cov(Xk'); % covariance matrix of kth class end end function yhat=NBtesting(X,M,S,P) % naive Bayes testing [d N]=size(X); % dimensions of dataset K=length(P); % number of classes for k=1:K InvS(:,:,k)=inv(S(:,:,k)); % inverse of covariance matrix Det(k)=det(S(:,:,k)); % determinant of covariance matrix end for n=1:N % for each of N samples x=X(:,n); dmax=-inf; for k=1:K xm=x-M(:,k); d=-(xm'*InvS(:,:,k)*xm)/2; d=d-log(Det(k))/2+log(P(k)); % discriminant function if d>dmax yhat(n)=k; % assign nth sample to kth class dmax=d; end end end end
The classification result in terms of the estimated class identity of the training samples compared with the ground truth class labeling can be represented in the form of a confusion matrix, of which the element in the ith row and jth column is the number of training samples labeled as a member of the ith class but actually classified into the jth class. For an ideal classifier that classifies all data samples correctly, this confussion matrix should be a diagonal matrix. But for a nonideal classifier, its error rate can be found as the percentage of the number of misclassified samples (sum of all off-diagonal elements) out of the total number of samples. The code below shows the function that generates such a confussion matrix based on the estimated class identity and the ground truth labeling :
function [Cm er]=ConfusionMatrix(yhat,ytrain) N=length(ytrain); % number of test samples K=length(unique(ytrain)); % number of classes Cm=zeros(K); % the confussion matrixxs for n=1:N i=ytrain(n); j=yhat(n); Cm(i,j)=Cm(i,j)+1; end er=1-sum(diag(Cm))/N; % error percentage end
Example 1:
This example shows the classification of two cocentric classes with one surrounding the other. They are separated by an ellipse with 9 out of 200 samples misclassified, i.e., the error rate of . The confusion matrix is shown below:
(37) |
Example 2:
This example shows the classification of a 2-class exclusive-or (XOR) data set with significant overlap, in terms of the confusion matrix and the error rate of . Note that the decision boundaries are a pair of hyperbolas. The confusion matrix is
(38) |
Example 3:
The first two panels in the figure below show three classes in the 2-D space, while the third one shows the partitioning of the space corresponding to the three classes. Note that the boundaries between the classes are all quadratic. The confusion matrix of the classification result is shown below, with the error rate .
(39) |
Example 4
The figure below shows some sub-samples of ten sets digits from 0 to 9, each is hand-written 224 times by different students. Each hand-written digit is represented as an 16 by 16 image containing 256 pixels, treated as a 256 dimensional vector.
The dataset can be visualized by applying the KLT transform to map all data points in the 256-D space into a 3-D space spanned by the three eigenvectors corresponding to the three greatest eigenvalues of the covariance matrix of the dataset, as shown below:
The naive Bayes classifier is applied to the dataset of handwritten digits from 0 to 9, of which each data sample is a 256-dimensional vector composed of pixels of an image for the handwritten of a digit. The dataset contains 224 samples per class and in total for all classes.
We note that the 256-dimensional covariance matrix for each of the classes is estimated by the 224 samples in the class, i.e., the estimated covariance matrix does not have a full rank and is not invertible. This issue can be resolved by applying KLT to the dataset to reduce its dimensionality from 256 to some value smaller than 224, such as 100.
The classifier is trained on half of the dataset of 1120 randomly selected samples and then tested on the remaining 1120 samples. The classification results are given below in terms of the confusion matrices of the traing samples (left) and the testing samples (right). All 1120 training samples are correctly classfied, while out of the remaining 1120 test samples, 198 are misclassified ( error).
(40) |