The method of naive Bayes (NB) classification is a classical
supervised classification algorithm, which is first trained by a
training set of samples
and their corresponding labelings
,
and then used to classify any unlabeled sample
into
class
with the maximumm posterior probability. As
indicated by the name, naive Bayes classification is based on
Bayes' theorem:
Based on Bayes' theorem discussed above, the naive Bayes method
classifies an unlabeled data sample to class
with
the maximum posterior probability (maximum a posteriori (MAP)):
The naive Bayes classifier is an optimal classifier in the sense
that the classification error is minimum. To illustrate this,
consider an arbitrary boundary between two classes and
that partitions the 1-D feature space into two regions
and
, as shown below. The probability of a misclassification is
the joint probability
for a sample
but falling in
. The total
probability of error due to misclassification can be expressed as:
(13) |
It is obvious that the Bayes classifier is indeed optimal, due to
the fact that its boundary
between classes
and
(the bottom plot) guarantees the
classification error (shaded area) to be minimum.
To find the likelihood function
, we need
to find
and
based on the training set
, based on the
training samples
in class
by the method of
maximum likelihood estimation:
(14) |
Once both and
are available, the classifier is trained, and the posterior
can be calculated for the classification in
Eq. (12).
As the classification is based on the relative comparison of the
posterior
, any monotonic mapping of the posterior
can be equivalently used to simplify the computation, such as the
logarithmic function:
(15) |
(16) |
if |
(17) |
(18) |
Geometrically, the feature space is partitioned into regions corresponding
to the classes by the decision boundaries between every pair of
classes
and
, represented by the equation
,
i.e.,
(19) |
(20) |
We further consider several special cases:
(23) |
(24) |
We can reconsider Example 2 in the previous section but now based on
the negative of the discriminant function above treated as a distance
, and get
(25) |
(26) |
The boundary equation
between
and
becomes
(27) |
(29) |
(30) |
(31) |
(32) |
The boundary equation
becomes:
(33) |
(34) |
(35) |
(36) |
The Matlab functions for training and testing are listed low:
function [M S P]=NBtraining(X,y) % naive Bayes training [d N]=size(X); % dimensions of dataset K=length(unique(y)); % number of classes M=zeros(d,K); % mean vectors S=zeros(d,d,K); % covariance matrices P=zeros(1,K); % prior probabilities for k=1:K idx=find(y==k); % indices of samples in kth class Xk=X(:,idx); % all samples in kth class P(k)=length(idx)/N; % prior probability M(:,k)=mean(Xk'); % mean vector of kth class S(:,:,k)=cov(Xk'); % covariance matrix of kth class end end function yhat=NBtesting(X,M,S,P) % naive Bayes testing [d N]=size(X); % dimensions of dataset K=length(P); % number of classes for k=1:K InvS(:,:,k)=inv(S(:,:,k)); % inverse of covariance matrix Det(k)=det(S(:,:,k)); % determinant of covariance matrix end for n=1:N % for each of N samples x=X(:,n); dmax=-inf; for k=1:K xm=x-M(:,k); d=-(xm'*InvS(:,:,k)*xm)/2; d=d-log(Det(k))/2+log(P(k)); % discriminant function if d>dmax yhat(n)=k; % assign nth sample to kth class dmax=d; end end end end
The classification result in terms of the estimated class identity
of the training samples compared with the ground
truth class labeling
can be represented in the form of
a
confusion matrix, of which the element in the ith
row and jth column is the number of training samples labeled as
a member of the ith class but actually classified into the jth
class. For an ideal classifier that classifies all data samples
correctly, this confussion matrix should be a diagonal matrix.
But for a nonideal classifier, its error rate can be found as
the percentage of the number of misclassified samples (sum of all
off-diagonal elements) out of the total number of samples. The
code below shows the function that generates such a confussion
matrix based on the estimated class identity
and
the ground truth labeling
:
function [Cm er]=ConfusionMatrix(yhat,ytrain) N=length(ytrain); % number of test samples K=length(unique(ytrain)); % number of classes Cm=zeros(K); % the confussion matrixxs for n=1:N i=ytrain(n); j=yhat(n); Cm(i,j)=Cm(i,j)+1; end er=1-sum(diag(Cm))/N; % error percentage end
Example 1:
This example shows the classification of two cocentric classes with one
surrounding the other. They are separated by an ellipse with 9 out of
200 samples misclassified, i.e., the error rate of
. The
confusion matrix is shown below:
(37) |
Example 2:
This example shows the classification of a 2-class exclusive-or (XOR)
data set with significant overlap, in terms of the confusion matrix and
the error rate of
. Note that the decision boundaries
are a pair of hyperbolas. The confusion matrix is
(38) |
Example 3:
The first two panels in the figure below show three classes in the 2-D
space, while the third one shows the partitioning of the space corresponding
to the three classes. Note that the boundaries between the classes are all
quadratic. The confusion matrix of the classification result is shown below,
with the error rate
.
(39) |
Example 4
The figure below shows some sub-samples of ten sets digits from 0 to 9, each is hand-written 224 times by different students. Each hand-written digit is represented as an 16 by 16 image containing 256 pixels, treated as a 256 dimensional vector.
The dataset can be visualized by applying the KLT transform to map all data points in the 256-D space into a 3-D space spanned by the three eigenvectors corresponding to the three greatest eigenvalues of the covariance matrix of the dataset, as shown below:
The naive Bayes classifier is applied to the dataset of handwritten
digits from 0 to 9, of which each data sample is a 256-dimensional
vector composed of
pixels of an image for the
handwritten of a digit. The dataset contains 224 samples per class
and
in total for all
classes.
We note that the 256-dimensional covariance matrix
for each of the
classes is estimated by the 224 samples in
the class, i.e., the estimated covariance matrix does not have a
full rank and is not invertible. This issue can be resolved by
applying KLT to the dataset to reduce its dimensionality from
256 to some value smaller than 224, such as 100.
The classifier is trained on half of
the dataset of 1120 randomly selected samples and then tested on the
remaining 1120 samples. The classification results are given below
in terms of the confusion matrices of the traing samples (left) and
the testing samples (right). All 1120 training samples are correctly
classfied, while out of the remaining 1120 test samples, 198 are
misclassified ( error).
(40) |