The Adaptive boosting (AdaBoost) is a supervised binary
classification algorithm based on a training set
, where each sample
is labeled by
, indicating to which
of the two classes
and
it belongs.
AdaBoost is an iterative algorithm. In the t-th iteration, each of
the training samples is classified into one of the two classes
by a weak classifier
, which can
be considered as a hypothesis. The classifier is weak in the sense
that its performance only needs to be better than chance, i.e., the
error rate is less than
. If a sample
is correctly
classified,
, i.e.,
;
if it is misclassified,
, i.e.,
. We need to find the best weak classifier
that minimizes the weighted error rate defined as:
(41) |
(42) |
(43) |
At each iteration, a strong or boosted classifier
is also constructed based on the linear combination
(boosting) of all previous weak classifiers
for all
:
(45) |
The weight of
in the t-th iteration
is
defined as
(46) |
The performance of the strong classifier can be measured by the
exponential loss, defined as the sume of all weights:
(48) |
The coefficient in Eq. (44) can now
be found as the one that minimizes the exponential loss
.
To do so, we first separate the terms in the summation for
into two parts, corresponding to the samples that are classified by
correctly and incorrectly:
(49) |
(50) |
(51) |
(52) |
We now see that if a weak classifier has a small error
, it will be weighted by a large
and therefore
contributes more to the strong classifier
; on the other hand, if
has a large error
, it will be weighted by a small
and contributes less to
. As the strong classifier
takes advantage of all previous weak classifiers
, each
of which may be more effective in a certain region of the N-D feature
space than others,
can be expected to be a much more accurate
classifier than any of the weak classifiers.
Replacing by
in Eq. (47), we now get
(53) |
We further consider the ratio of the exponential losses of two
consecutive iterations:
(54) |
(55) |
The weak classifier used in each iteration is typically implemented
as a decision stump, a binary classifier that partitions the
N-D feature space into two regions. Specifically, as a simple example,
a coordinate descent method can be used by which all training samples
are projected onto the ith dimension of the feature space :
(56) |
(57) |
(58) |
Alternatively, the weak classifier can also be obtained based on the principal component analysis (PCA) by partitioning the N-D feature space along the directions of the eigenvectors of the between-class scatter matrix
(59) |
(60) |
(61) |
The Matlab code of the main iteration loop of the algorithm is
listed below, followed by the essential functions called by the
main loop. Here T
is the maximum number of iteration, and
N
is the total number of training samples.
The algorithm can be carried out in the feature space based on
either the PCA basis (columns of the eigenvector matrix ),
or the original standard basis (columns of the identity matrix
).
The decision stump is implemented by a binary classifier,
which partitions all training samples projected onto a 1-D space
into two groups by a threshold value, which needs to be optimal in
terms of the number of misclassification. A parameter plt
is
used to indicate the polarity of the binary classification, i.e.,
which of the two class and
is on the lower (or higher)
side of the threshold.
Plt=[]; % polarity of binary classification Tr=[]; % transformation of basis vectors Th=[]; % threshold of binary classification Alpha=[]; % alpha values h=zeros(T,N); % h functions F=zeros(T,N); % F functions w=ones(T,N); % weights for N training samples Er=N; % error initialized to N t=0; % iteration index while t<T & Er>0 % the main iteration t=t+1; [tr th plt er]=WeakClassifier(X,y,w(t,:),PCA); % weak classifier alpha=log(sqrt((1-er)/er)); % update alpha Alpha=cat(1,Alpha,alpha); % record alpha Tr=cat(2,Tr,tr'); % record transform vector Th=cat(1,Th,th); % record threshold Plt=cat(1,Plt,plt); % record polarity x=tr*X; % carry out transform c=sqrt(er/(1-er)); for n=1:N % update weights h(t,n)=h_function(x(n),th,plt); % find h function if h(t,n)*y(n)<0 w(t+1,n)=w(t,n)/c; % update weights else w(t+1,n)=w(t,n)*c; end end F(t,:)=Alpha(t)*h(t,:); % get F functions if t>1 F(t,:)=F(t,:)+F(t-1,:); end Er=sum(sign(F(t,:).*y)==-1); % error of strong classifier, number of misclassifications fprintf('%d: %d/%d=%f\n',t,Er,N,Er/N) end
Here are the functions called by the main iteration loop above:
function [Tr,Th Plt Er]=WeakClassifier(X,y,w,pca) % X: N columns each for one of the N training samples % y: labeling of X % w: weights for N training samples % pca: use PCA dimension if pca~=0 % Er: minimum error among all D dimensions % Tr: transform vector (standard or PCA basis) % Th: threshold value % Plt: polarity [D N]=size(X); n0=sum(y>0); % number of samples in class C+ n1=sum(y<0); % number of samples in class C- if pca % find PCA basis for i=1:D Y(i,:)=w.*X(i,:); end X0=Y(:,find(y>0)); % all samples in class C+ X1=Y(:,find(y<0)); % all samples in class C- m0=mean(X0')'; % mean of C+ m1=mean(X1')'; % mean of C- mu=(n0*m0+n1*m1)/N; % over all mean Sb=(n0*(m0-mu)*(m0-mu)'+n1*(m1-mu)*(m1-mu)')/N; % between-class scatter matrix [v d]=eig(Sb); % eigenvector and eigenvalue matrices of Sb else v=eye(N); % standard basis end Er=9e9; for i=1:D % for all D dimensions tr=v(:,i)'; % get transform vector from identity or PCA matrix x=tr*X; % rotate the vector [th plt er]=BinaryClassifier(x,y,w); % binary classify N samples in 1-D er=0; for n=1:N h(n)=h_function(x(n),th,plt); % h-function of nth sample if h(n)*y(n)<0 % if misclassified er=er+w(n); % add error end end er=er/sum(w); % total error of dimension d if Er>er % record info corresponding to min error Er=er; % min error Plt=plt; % polarity Th=th; % threshold Tr=tr; % transform vector end end end function h=h_function(x, th, plt) if xor(x>th, plt) h=1; else h=-1; end end function [Th Plt Er]=BinaryClassifier(x,y,w) N=length(x); [x1 i]=sort(x); % sort 1-D data x y1=y(i); % reorder the targets w1=w(i); % reorder the weights Er=9e9; for n=1:N-1 % for N-1 ways of binary classification e0=sum(w1(find(y1(1:n)==1)))+sum(w1(n+find(y1(n+1:N)==-1))); e1=sum(w1(find(y1(1:n)~=1)))+sum(w1(n+find(y1(n+1:N)~=-1))); if e1 > e0 % polarity: left -1, right +1 plt=0; er=e0; else % polarity: left +1, right -1 plt=1; er=e1; end if Er > er % update minimum error for kth dimension Er=er; % minimum error Plt=plt; % polarity Th=(x1(k)+x1(k+1))/2; % threshold end end end
Example 0 This example shows the classification of an XOR data
set (used previously the test the naive Bayes mehtod), containing two
classes of simples each in 2-D space, represented by the red
and blue dots in the figure below. The intermediate results after 4, 8,
16 and 32 iterations are shown in to illustrate the progress of the
iteration, when the error rate continuously reduces from 134/400, 85/400,
81/400, 58/100.
After over 350 iterations the error rate eventually reduced to zero (guaranteed by the AdaBoost method). The sequence of binary participation Although all training samples are correctly classified eventually, the result suffers the problem of overfitting.
Example 1 This example shows the classification of two classes
of simples in 2-D space, which is partitioned along both the
standard basis vectors (left) and the PCA directions (right). The PCA
method performances significantly better in this example as its error
reduces faster and it converges to zero after 61 iterations, while the
error of the method based on the standard axes (in vertical and horizontal
directions) does not go down to zero even after 120 iterations. This
faster reduction of error of the PCA method can be easily understood
as many different directions are used to partition the space into two
regions, while in the coordinate descent method the partitioning of the
space is limited to either of the two directions.
How the AdaBoost is iteratively trained can be seen in the figure below.
First, the error rate, the ratio between the number of misclassified
samples and the total number of samples, for both the training and
testing samples, are plotted as the iteration progresses (top). Also,
the weighted error
, the exponential costs
, and
the ratio
are all
plotted (bottom). We see that
and
for all steps, and the exponential cost
attenuates exponentially
towards zero.
Example 2 In this dataset, 100 training samples of two classes (50 each) forms four clusters arranged in the 2-D space as an XOR pattern as shown figure (top-left), and 100 testing samples of the same distribution are classified by the AbaBoost algorithm trained by both the coordinate descent and PCA methods.
We see that the coordinate descent method performs very poorly in both the slow convergence (more than 400 iterations) during training and high classification error rate (more than 1/3) during testing. The partitioning of the 2-D space is an obvious over fitting of the training set instead of reflecting the actual distribution of the two classes (four clusters). On the other hand, the PCA method converges quickly (10 iterations) and classifies the testing samples with very low error rate. The space is clearly partitioned into two regions corresponding to the two classes.