Next: Unsupervised Classification - Clustering Up: classify Previous: Some special cases

The AdaBoost Algorithm

The Adaptive boosting (AdaBoost) is a supervised binary classification algorithm based on a training set $\{ ({\bf x}_k,y_k),\;(k=1,\cdots,K)\}$ , where each sample ${\bf x}_k$ is labeled by $y_k=\{-1, +1\}$ , indicating to which of the two classes it belongs.

AdaBoost is an iterative algorithm. In the t-th iterative step, a weak classifier, considered as a hypothesis and denoted by $h_t({\bf x}_k)=\{-1, +1\}$ , is to be used to classify each of the training samples into one of the two classes. If a sample ${\bf x}_k$ is correctly classified, $h_t({\bf x}_k)=y_k=\pm 1$ , i.e., $y_k h_t({\bf x}_k)=1$ ; if it is misclassified, $h_t({\bf x}_k)=-y_k=\pm 1$ , i.e., $y_k h_t({\bf x}_k)=-1$ . We need to find the best weak classifier that minimizes the weighted error rate defined as:

$\begin{displaymath} \varepsilon_t=\frac{\sum_{y_k h_t({\bf x}_k)=-1} w_t(k)}{\sum_{k=1}^K w_t(k)} \end{displaymath}$

where $w_t(k)\;(k=1,\cdots,K)$ are the weights for the

training samples. We can also get the correct rate:

$\displaystyle 1-\varepsilon_t$	$\textstyle =$	$\displaystyle 1-\frac{\sum_{y_k h_t({\bf x}_k)=-1} w_t(k)}{\sum_{k=1}^K w_t(k)... ...{\sum_{k=1}^K w_t(k)-\sum_{y_k h_t({\bf x}_k)=-1} w_t(k)}{\sum_{k=1}^K w_t(k)}$
	$\textstyle =$	$\displaystyle \frac{\sum_{y_k h_t({\bf x}_k)=1} w_t(k)}{\sum_{k=1}^K w_t(k)}$

At the initial iteration

, all

samples are equally weighted by $w_0(1)=\cdots=w_0(K)=1$ , and the weighted error

$\begin{displaymath} \varepsilon_0=\frac{1}{K} \sum_{y_k h_0({\bf x}_k)=-1} w_0(... ...er of misclassified samples}} {\mbox{total number of samples}} \end{displaymath}$

is actually the probability for any sample ${\bf x}$ in the training set to be misclassified by

. As we will see below, in the subsequent iterations, the weight

will be modified based on whether ${\bf x}_k$ is are correctly classified or misclassified.

The classifier is weak in the sense that its performance only needs to be better than chance, i.e., the error rate defined above is less than 50 percent, $\varepsilon_t<1/2$ .

At each iteration, a strong or boosted classifier can also be constructed based on the linear combination of all previous weak classifiers $h_1,\cdots,h_t$ :

$\begin{displaymath} F_t({\bf x})=\alpha_th_t({\bf x})+F_{t-1}({\bf x}) =\alpha_t... ... x})+F_{t-2}({\bf x}) =\cdots=\sum_{i=1}^t\alpha_ih_i({\bf x}) \end{displaymath}$

where $\alpha_i$ is the coefficient for the i-th weak classifier

. The strong classifier is the sign function of $F_t({\bf x})$ :

$\begin{displaymath} H_t({\bf x})=sign [F_t({\bf x})]=\left\{\begin{array}{ll} +1 & F_t({\bf x})>0 -1 & F_t({\bf x})<0\end{array}\right. \end{displaymath}$

that classifies any ${\bf x}$ into one of the two classes, the magnitude $\vert F_t({\bf x})\vert$ represents the confidence of the result.

The performance of the strong classifier can be measured by the exponential loss defined as

$\displaystyle E_{t+1}$	$\textstyle =$	$\displaystyle \sum_{k=1}^K e^{-y_k F_t({\bf x}_k)} =\sum_{k=1}^K e^{-y_k [\alpha_t h_t({\bf x}_k)+F_{t-1}({\bf x}_k)]}$
	$\textstyle =$	$\displaystyle \sum_{k=1}^K e^{-\alpha_t y_k h_t({\bf x}_k)} \; e^{-y_k F_{t... ... =\sum_{k=1}^K e^{-\alpha_t y_k h_t({\bf x}_k)} \;w_t(k), \;\;\;\;\;\;(t>1)$

where

is the weight of ${\bf x}_k$ in the t-th iteration

, defined as

$\displaystyle w_t(k)$	$\textstyle =$	$\displaystyle e^{-y_k F_{t-1}({\bf x}_k)} =e^{-y_k [\alpha_{t-1}h_{t-1}({\bf x}_k)+F_{t-2}({\bf x}_k)]}$
	$\textstyle =$	$\displaystyle e^{-\alpha_{t-1} y_kh_{t-1}({\bf x}_k)}\;e^{-y_k F_{t-2}({\bf x}_k)} =e^{-\alpha_{t-1} y_kh_{t-1}({\bf x}_k)}\;w_{t-1}(k), \;\;\;\;\;\;(t>1)$

We see that of the t-th iteration can be obtained recursively from $w_{t-1}(k)$ of the previous iteration, and this recursion can be carried out all the way back to the first step with $w_1(k)=e^{-\alpha_0 y_kh_0({\bf x}_k)}w_0(k)$ .

Following this recursive definition of the weights, the exponential loss defined above can be written as the sum of all weights:

$\begin{displaymath} E_{t+1}=\sum_{k=1}^K e^{-y_k F_t({\bf x}_k)}=\sum_{k=1}^K w_{t+1}(k) \end{displaymath}$

Consider the k-th term in the summation:

$\begin{displaymath} w_{t+1}(k)=e^{-y_k F_t({\bf x}_k)}\left\{\begin{array}{ll} >... ...)<0\\ <1 & \mbox{if } y_k F_t({\bf x}_k)>0\end{array}\right. \end{displaymath}$

it is greater than one if ${\bf x}_k$ is misclassified, but smaller than one otherwise, i.e., the exponential loss

is indeed a measurement of the performance of the strong classifier

We need to further determine the optimal coefficient $\alpha_t$ in the function $F_t({\bf x})$ that minimizes the exponential loss . To do so, we first separate the summation for into two parts, for the samples that are classified by correctly and incorrectly:

$\begin{displaymath} E_{t+1}=\sum_{k=1}^K w_t(k)e^{-y_k \alpha_t h_t({\bf x}_k)}... ...)e^{\alpha_t} +\sum_{y_k h_t({\bf x}_k)=1} w_t(k)e^{-\alpha_t} \end{displaymath}$

and then let

$\begin{displaymath} \frac{dE_{t+1}}{d\alpha_t} =\sum_{y_k h_t({\bf x}_k)=-1} w_t... ...{\alpha_t} -\sum_{y_k h_t({\bf x}_k)=1} w_t(k) e^{-\alpha_t}=0 \end{displaymath}$

Solving this for $\alpha_t$ we get the optimal coefficient for

$\begin{displaymath} \alpha_t=\frac{1}{2} \ln \left( \frac{\sum_{y_k h_t({\bf x}... ...(k)}\right) =\ln \sqrt{\frac{1-\varepsilon_t}{\varepsilon_t}} \end{displaymath}$

As $\varepsilon_t<1/2$ and $1-\varepsilon_t>1/2$ , we have $\sqrt{(1-\varepsilon_t)/\varepsilon_t} > 1$ and $\alpha_t>0$ .

As the strong classifier takes advantage of all previous weak classifiers $h_1,\cdots,h_t$ , each of which may be more effective in a certain region of the N-D space than others, can be expected to be a much more accurate classifier than any of the weak classifiers. Also note that a with a smaller error $\varepsilon_t$ will be weighted by a greater $\alpha_t$ and therefore contribute more to the strong classifier , but a with a large $\varepsilon_t$ will be weighted by a smaller $\alpha_t$ and contributes less to .

Having found the optimal coefficient $\alpha_t=\ln\sqrt{(1-\varepsilon_t)/\varepsilon_t}$ , we can further get

$\begin{displaymath} e^{\alpha_t}=\sqrt{\frac{1-\varepsilon_t}{\varepsilon_t}},\;\;\;\; e^{-\alpha_t}=\sqrt{\frac{\varepsilon_t}{1-\varepsilon_t}} \end{displaymath}$

Now the weights are updated by

$\begin{displaymath} w_{t+1}(k)=w_t(k) e^{-\alpha_t y_kh_t({\bf x}_k)} =w_t(k) ... ...ilon_t}} & \mbox{if }y_k h_t({\bf x}_k)=-1 \end{array}\right. \end{displaymath}$

As $\varepsilon_t<1/2$ ,

is reduced from

by a factor of $\sqrt{\varepsilon_t/(1-\varepsilon_t)}<1$ if it is classified correctly $(y_kh_t({\bf x}_k)=1)$ ; but increased by a factor of $\sqrt{(1-\varepsilon_t)/\varepsilon_t} > 1$ if it is misclassified $(y_kh_t({\bf x}_k)=-1)$ . Due to this heavier weight, a sample ${\bf x}_k$ misclassified by

will be emphasized and thereby have a better chance to be more accurately classified by $h_{t+1}$ in the next iteration. This is an essential feature of the AdaBoost algorithm.

We further consider the ratio of the exponential losses of two consecutive iterations:

$\displaystyle \frac{E_{t+1}}{E_t}$	$\textstyle =$	$\displaystyle \frac{\sum_{k=1}^K w_{t+1}(k)}{\sum_{k=1}^K w_t(k)} =\frac{\sum_{k=1}^K w_t(k)e^{-\alpha_t y_k h_t({\bf x}_k)}}{\sum_{k=1}^K w_t(k)}$
	$\textstyle =$	$\displaystyle \frac{\sum_{y_k h_t({\bf x}_k)=-1} w_t(k)}{\sum_{k=1}^K w_t(k)}e^... ...} +\frac{\sum_{y_k h_t({\bf x}_k)=1} w_t(k)}{\sum_{k=1}^K w_t(k)}e^{-\alpha_t}$
	$\textstyle =$	$\displaystyle \varepsilon_t\;\sqrt{\frac{1-\varepsilon_t}{\varepsilon_t}} +(1-... ...{\varepsilon_t}{1-\varepsilon_t}} =2\sqrt{\varepsilon_t(1-\varepsilon_t)}\le 1$

This ratio is twice the geometric average of the weighted error rate $\varepsilon_t$ and the correct rate $1-\varepsilon_t$ , which reaches its maximum when $\varepsilon_t=1-\varepsilon=1/2$ . As $\varepsilon_t<1/2$ , the ratio is always smaller than 1, we see that the exponential cost can be approximated as an exponentially decaying function from its initial value $E_0=\sum_{i=1}^K w_0(k)=K$ :

$\begin{displaymath} \lim_{t\rightarrow\infty}E_t\approx \lim_{t\rightarrow\infty}\left(2\sqrt{\varepsilon(1-\varepsilon)}\right)^t E_0=0 \end{displaymath}$

and conclude that the exponential loss will always decrease, i.e., the AdaBoost method will always converge.

The weak classifier used in each iteration is typically implemented as a decision stump, a binary classifier that partitions the N-D feature space into two regions. Specifically, a coordinate descent method is used by which all samples in the training set are projected onto ${\bf d}_n$ , one of the dimensions of the feature space:

$\begin{displaymath} x_k=P_{{\bf d}_n}({\bf x}_k)=\frac{{\bf x}_k^T{\bf d}_n}{\vert\vert{\bf d}_n\vert\vert}, \;\;\;\;\;\;(k=1,\cdots,K) \end{displaymath}$

and classified into two classes by a threshold

along the direction of ${\bf d}_n$ :

$\begin{displaymath} h(x_k)=\left\{\begin{array}{ll}+1& x_k<th -1 & x_k>th\end{array}\right. \end{displaymath}$

The optimal decision stump is to be obtained so that the following weighted error is to be minimized with respect to the dimension ${\bf d}_n$ as well as the threshold along that direction:

$\begin{displaymath} \varepsilon_t=\sum_{k=1}^K w_t(k)\; I[h_t(x_k)\ne y_k] \end{displaymath}$

Alternatively, the weak classifier can also obtained based on the principal component analysis (PCA) by partitioning the N-D feature space along the directions of the eigenvectors of the between-class scatter matrix

$\begin{displaymath} {\bf S}_b=\frac{1}{K}\left[K_-({\bf m}_{-1}-{\bf m})({\bf m}... ...})^T +K_+({\bf m}_{+1}-{\bf m})({\bf m}_{+1}-{\bf m})^T\right] \end{displaymath}$

where

and

are the number of samples in the two classes

, and

$\begin{displaymath} {\bf m}_{\pm}=\frac{1}{K_{\pm}}\sum_{y_k=\pm 1} w(k) {\bf x}_k, \;\;\;\;\; {\bf m}=\frac{1}{K}\sum_{k=1}^K w(k) {\bf x}_k \end{displaymath}$

are the weighted mean vectors of the two classes and the total weighted mean vector of all

samples in the training set. As these directions are for the principal components (PC) of ${\bf S}_b$ that measures the separability of the two classes, they are likely to be better separated along these directions. As ${\bf S}_b$ is symmetric, its eigenvectors are orthogonal and they span the N-D feature space as well as the standard basis. The same method considered above for the weak classifiers can be readily applied.

Example 1 Consider the classification of two classes of simples in 2-D space, shown as red and blue dots in the figure below, which are not linearly separable. As shown in the figure, the 2-D space is partitioned along both the standard basis vectors (left) and the PCA directions (right). The PCA method performances significantly better in this example as its error reduces faster and it converges to zero after 61 iterations, while the error of the method based on the standard axes (in vertical and horizontal directions) does not go down to zero even after 120 iterations. This more faster reduction of error of the PCA method can be easily understood as many different directions are used to partition the space into two regions, while the coordinate method

How the AdaBoost is iteratively trained can be seen in the figure below. First, the error rate, the ratio between the number of misclassified samples and the total number of samples, for both the training and testing samples, are plotted as the iteration progresses (top). Also, the weighted error $\varepsilon_t$ , the exponential costs , and the ration $E_{t+1}/E_t=2\sqrt{\varepsilon_t(1-\varepsilon_t)}$ are all ploted (bottom). We see that $\varepsilon_t<0.5$ and $E_{t+1}/E_t<1$ for all steps, and the exponential cost attenuates exponentially towards zero.

Example 2 In this dataset, 100 training samples of two classes (50 each) forms four clusters arranged in the 2-D space as an XOR pattern as shown figure (top-left), and 100 testing samples of the same distribution are classified by the AbaBoost algorithm trained by both the coordinate descent and PCA methods.

We see that the coordinate descent method performs very poorly in both the slow convergence (more than 400 iterations) during training and high classification error rate (more than 1/3) during testing. The partitioning of the 2-D space is an obvious over fitting of the training set instead of reflecting the actual distribution of the two classes (four clusters). On the other hand, the PCA method converges quickly (only 10 iterations) and classifies the testing samples with very low error rate. The space is clearly partitioned into two regions corresponding to the two classes.

Next: Unsupervised Classification - Clustering Up: classify Previous: Some special cases

Ruye Wang 2016-11-30