Multi-Class Classification

CrammerSinger WestonWatkins WangXue Bredensteiner

The SVM method is inherently a binary classifier, but it can be adapted to classification problems of more than two classes. In the following, we consider a general K-class classifier based on a training set ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$, of which each sample ${\bf x}_n$ is labeled by $y_n\in\{1,\cdots,K\}$ to indicate ${\bf x}_n \in C_{y_n}$.

We first consider two straight forward and imperical methods for multiclass classification based directly on binary SVM.

We further consider another method for multiclass classifiction by directly generalizing the binary SVM so that the decision margin between any two of the $K$ classes is maximized. First, we define a linear function for each of the $K$ classes:

$\displaystyle f_k({\bf x})={\bf w}_k^T{\bf x}+b_k\;\;\;\;\;(k=1,\cdots,K)$ (160)

Once the parameters ${\bf w}_k$ and $b_k$ are determined during the training set, any unlabeled point ${\bf x}$ can be classified as below:

if$\displaystyle \;\;f_l({\bf x})=\max_k\{f_k({\bf x})\;\;(k=1,\cdots,K)\},
\;\;\;$then$\displaystyle \;\;{\bf x}\in C_l$ (161)

The figure below illustrates the classification of $K=3$ classes in $d=2$ dimensional feature space. Here each straight lines $H_k$ (plane or hyperplane if $d\ge 3$) is determined by equation $f_k({\bf x})={\bf w}_k^T{\bf x}+b_k=0,\;(k=1,\cdots,K)$, with normal direction in ${\bf w}_k$ and distance $d(H_k,{\bf0})=\vert b_k\vert/\vert\vert{\bf w}_k\vert\vert$ to the origin (same as in Eq. (65)).

According to the classification rule in Eq. (161), the 2-D feature space is partitioned into three regions each for one of the $K=3$ classes by the decision boundaries $B_{12},\;B_{23},\;B_{31}$, of which boundary $B_{ij}$ between classes $C_i$ and $C_j$ is composed of all points ${\bf x}\in B_{ij}$ satisfying $f_i({\bf x})=f_j({\bf x})$. Same as in binary SVM, we further define for each of the $K$ classes two additional straight lines (planes or hyperplanes if $d\ge 3$) parallel to $H_k$, denoted by $H_{k\pm}$ and represented by equations $f_k({\bf x})={\bf w}_k^T{\bf x}+b_k\pm c_k$. Their distances to $H_k$ can be found as (same as in Eq. (70)):

$\displaystyle d(H_{k\pm},H_k)=\frac{\vert c_k\vert}{\vert\vert{\bf w}_k\vert\vert}$ (162)

Also as shown in the figure, the intersections of these parallel lines for classes $C_1$ and $C_2$ futher determine two straight lines parallel to the decision boundary $B_{12}$, defining the decision margin between the support vectors of $C_1$ and $C_2$. It can be seen that this decision margin is monotonically related to both $\vert c_1\vert/\vert\vert{\bf w}_1\vert\vert$ and $\vert c_2\vert/\vert\vert{\bf w}_2\vert\vert$, which can be maximized by minimizing both $\vert\vert{\bf w}_1\vert\vert$ and $\vert\vert{\bf w}_2\vert\vert$.

multiclassSVM.png

Now the multiclass classification problem can be formulated as to find the parameters ${\bf w}_k$ and $b_k$ for all $k=1,\cdots,K$ so that $\sum_{k=1}^K \vert\vert{\bf w}_k\vert\vert^2$ is minimized and thereby the decision margins for all boundaries $B_{ij}$ between $C_i$ and $C_j$ are maximized, under the constraint that every training sample ${\bf x}_n$ labeled by $y_n$ is correctly classified with a decision margin of 1 (the distance between the support vectors on both sides of $B_{ij}$ is 2):

minimize:$\displaystyle \;\;$   $\displaystyle \frac{1}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2$  
subject to:$\displaystyle \;\;$   $\displaystyle {\bf w}_{y_n}^T{\bf x}_n+b_{y_n}
\ge {\bf w}_k^T{\bf x}_n+b_k+2,
\;\;\;\;\; (k=1,\cdots,K,\;k\ne y_n)$ (163)

Similar to Eq. (115) in soft margin SVM, the contraints of the optimization problem above can be relaxed by including in each contraint an extra error term $\xi_{nk}\ge 0$, which is to be minimized. Now the problem can be reformulated as

minimize:$\displaystyle \;\;$   $\displaystyle \frac{1}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2
+C\sum_{k=1}^K\sum_{n=1}^N \xi_{nk}$  
subject to:$\displaystyle \;\;$   $\displaystyle {\bf w}_{y_n}^T{\bf x}_n+b_{y_n} \ge {\bf w}_k^T{\bf x}_n+b_k+2-\xi_{nk}$  
    $\displaystyle \xi_{nk}\ge 0$  
    $\displaystyle (n=1,\cdots,N,\;\;\;\;\;\;k=1,\cdots,K,\;k\ne y_n)$ (164)

The Lagrangian function of this constrained optimization problem is

$\displaystyle L({\bf w},b,\xi,\alpha,\beta)$ $\displaystyle =$ $\displaystyle \frac{1}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2+C\sum_k\sum_n \xi_{nk}$  
  $\displaystyle -$ $\displaystyle \sum_k\sum_n \alpha_{nk}\left[ ({\bf w}_{y_n}-{\bf w}_k)^T{\bf x}_n
+b_{y_n}-b_k-2+\xi_{nk}\right]-\sum_k\sum_n \beta_{nk}\xi_{nk}$ (165)

where
    $\displaystyle \alpha_{ny_n}=\beta_{ny_n}=0,\;\;\;\;\;\;\xi_{ny_n}=2;$  
    $\displaystyle \alpha_{nk}\ge 0,\;\;\;\;\beta_{nk}\ge 0,\;\;\;\;\xi_{nk}\ge 0$  
    $\displaystyle (n=1,\cdots,N,\;\;\;\;\;\;k=1,\cdots,K,\;k\ne y_n)$ (166)

$\displaystyle \frac{\partial}{\partial{\bf w}_k}L({\bf w},b,\xi,\alpha,\beta)
=\sum_k{\bf w}_k+\sum_n \alpha_{nk}{\bf x}_n=0$ (167)

$\displaystyle \frac{\partial}{\partial b_k}L({\bf w},b,\xi,\alpha,\beta)
=\sum_k\sum_n\alpha_{nk}=0$ (168)

$\displaystyle \frac{\partial}{\partial \xi_{nk}}L({\bf w},b,\xi,\alpha,\beta)
=C-\alpha_{nk}-\beta_{nk}=0$ (169)

$\displaystyle {\bf f}({\bf x})={\bf W}^T{\bf x}$ (170)

where

$\displaystyle {\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ (171)

An unlabeled sample ${\bf x}$ is classified to class $C_i$ if $f_i({\bf x})=\max(f_1({\bf x}),\cdots,f_K({\bf x}))$.

The training of the classifier is to find all weight vectors in ${\bf W}$ based on the training dataset. To do so, we first define a cost function

////