Multi-Class Classification

CrammerSinger WestonWatkins WangXue Bredensteiner

The SVM method is inherently a binary classifier, but it can be adapted to classification problems of more than two classes. In the following, we consider a general K-class classifier based on a training set ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ , of which each sample ${\bf x}_n$ is labeled by $y_n\in\{1,\cdots,K\}$ to indicate ${\bf x}_n \in C_{y_n}$ .

We first consider two straight forward and imperical methods for multiclass classification based directly on binary SVM.

One-Vs-One (1V1)
Any unlabeled ${\bf x}$ is classified to a class which receives the maximum votes out of the binary classifications between every pair of the classes.
One-Vs-Rest (1VR)
This method converts a K-class problem () into K binary problems. First, we regroup the classes $C_1,\cdots,C_K$ into two classes and $C_-=\{C_l\vert l=1,\cdots,K,\;l\ne k\}$ and find the corresponding decision function $f_k({\bf x})$ representing quantitatively to what extent a given ${\bf x}$ belongs to class (if $f_k({\bf x})>0$ ), instead of containing all remaining classes (if $f_k({\bf x})<0$ ). After this process has been carried out for all $k=1,\cdots,K$ , the ${\bf x}$ can be classified as below:

if $\displaystyle \;\;\; f_l({\bf x})=\max\{f_1({\bf x}),\cdots,f_K({\bf x})\}, \;\;\;\;\;$ then $\displaystyle \;\;\;{\bf x}\in C_l$ (159)

We further consider another method for multiclass classifiction by directly generalizing the binary SVM so that the decision margin between any two of the $K$ classes is maximized. First, we define a linear function for each of the $K$ classes:

$\displaystyle f_k({\bf x})={\bf w}_k^T{\bf x}+b_k\;\;\;\;\;(k=1,\cdots,K)$

(160)

Once the parameters ${\bf w}_k$ and $b_k$

are determined during the training set, any unlabeled point ${\bf x}$ can be classified as below:

if $\displaystyle \;\;f_l({\bf x})=\max_k\{f_k({\bf x})\;\;(k=1,\cdots,K)\}, \;\;\;$ then $\displaystyle \;\;{\bf x}\in C_l$

(161)

The figure below illustrates the classification of $K=3$

classes in $d=2$

dimensional feature space. Here each straight lines $H_k$

(plane or hyperplane if $d\ge 3$ ) is determined by equation $f_k({\bf x})={\bf w}_k^T{\bf x}+b_k=0,\;(k=1,\cdots,K)$ , with normal direction in ${\bf w}_k$ and distance $d(H_k,{\bf0})=\vert b_k\vert/\vert\vert{\bf w}_k\vert\vert$ to the origin (same as in Eq. (65)).

According to the classification rule in Eq. (161), the 2-D feature space is partitioned into three regions each for one of the $K=3$ classes by the decision boundaries $B_{12},\;B_{23},\;B_{31}$ , of which boundary $B_{ij}$ between classes $C_i$ and $C_j$ is composed of all points ${\bf x}\in B_{ij}$ satisfying $f_i({\bf x})=f_j({\bf x})$ . Same as in binary SVM, we further define for each of the $K$ classes two additional straight lines (planes or hyperplanes if $d\ge 3$ ) parallel to $H_k$ , denoted by $H_{k\pm}$ and represented by equations $f_k({\bf x})={\bf w}_k^T{\bf x}+b_k\pm c_k$ . Their distances to $H_k$ can be found as (same as in Eq. (70)):

$\displaystyle d(H_{k\pm},H_k)=\frac{\vert c_k\vert}{\vert\vert{\bf w}_k\vert\vert}$

(162)

Also as shown in the figure, the intersections of these parallel lines for classes $C_1$

and

futher determine two straight lines parallel to the decision boundary $B_{12}$ , defining the decision margin between the support vectors of $C_1$

and

. It can be seen that this decision margin is monotonically related to both $\vert c_1\vert/\vert\vert{\bf w}_1\vert\vert$ and $\vert c_2\vert/\vert\vert{\bf w}_2\vert\vert$ , which can be maximized by minimizing both $\vert\vert{\bf w}_1\vert\vert$ and $\vert\vert{\bf w}_2\vert\vert$ .

Now the multiclass classification problem can be formulated as to find the parameters ${\bf w}_k$ and $b_k$ for all $k=1,\cdots,K$ so that $\sum_{k=1}^K \vert\vert{\bf w}_k\vert\vert^2$ is minimized and thereby the decision margins for all boundaries $B_{ij}$ between $C_i$ and $C_j$ are maximized, under the constraint that every training sample ${\bf x}_n$ labeled by $y_n$ is correctly classified with a decision margin of 1 (the distance between the support vectors on both sides of $B_{ij}$ is 2):

minimize: $\displaystyle \;\;$		$\displaystyle \frac{1}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2$
subject to: $\displaystyle \;\;$		$\displaystyle {\bf w}_{y_n}^T{\bf x}_n+b_{y_n} \ge {\bf w}_k^T{\bf x}_n+b_k+2, \;\;\;\;\; (k=1,\cdots,K,\;k\ne y_n)$	(163)

Similar to Eq. (115) in soft margin SVM, the contraints of the optimization problem above can be relaxed by including in each contraint an extra error term $\xi_{nk}\ge 0$ , which is to be minimized. Now the problem can be reformulated as

minimize: $\displaystyle \;\;$	$\displaystyle \frac{1}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2 +C\sum_{k=1}^K\sum_{n=1}^N \xi_{nk}$
subject to: $\displaystyle \;\;$	$\displaystyle {\bf w}_{y_n}^T{\bf x}_n+b_{y_n} \ge {\bf w}_k^T{\bf x}_n+b_k+2-\xi_{nk}$
	$\displaystyle \xi_{nk}\ge 0$
	$\displaystyle (n=1,\cdots,N,\;\;\;\;\;\;k=1,\cdots,K,\;k\ne y_n)$	(164)

The Lagrangian function of this constrained optimization problem is

$\displaystyle L({\bf w},b,\xi,\alpha,\beta)$	$\displaystyle =$	$\displaystyle \frac{1}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2+C\sum_k\sum_n \xi_{nk}$
	$\displaystyle -$	$\displaystyle \sum_k\sum_n \alpha_{nk}\left[ ({\bf w}_{y_n}-{\bf w}_k)^T{\bf x}_n +b_{y_n}-b_k-2+\xi_{nk}\right]-\sum_k\sum_n \beta_{nk}\xi_{nk}$	(165)

where

		$\displaystyle \alpha_{ny_n}=\beta_{ny_n}=0,\;\;\;\;\;\;\xi_{ny_n}=2;$
		$\displaystyle \alpha_{nk}\ge 0,\;\;\;\;\beta_{nk}\ge 0,\;\;\;\;\xi_{nk}\ge 0$
		$\displaystyle (n=1,\cdots,N,\;\;\;\;\;\;k=1,\cdots,K,\;k\ne y_n)$	(166)