Soft Margin SVM

When the two classes $C_-$ and $C_+$ are not linearly separable, the condition for the optimal hyperplane in Eq. (72) can be relaxed by including an extra error term $\xi_n \ge 0$ :

$\displaystyle y_n ({\bf x}_n^T{\bf w} +b) \ge 1-\xi_n,\;\;\;(n=1,\cdots,N)$

(115)

For better classification result, this error $\xi_n$ needs to be minimized as well as $\vert\vert{\bf w}\vert\vert$ . Now the primal problem in Eq. (73) can be modified to the following minimization problem with an objective function and two sets of inequality (non-negative) constraints:

minimize: $\displaystyle \;\;$		$\displaystyle \frac{1}{2}\vert\vert{\bf w}\vert\vert^2+C\sum_{n=1}^N \xi_n$
subject to: $\displaystyle \;\;$		$\displaystyle y_n ({\bf x}_n^T {\bf w}+b)+\xi_n-1 \ge 0, \;\;\;$ and $\displaystyle \;\;\;\xi_n \ge 0;\;\;\;(n=1,\cdots,N)$	(116)

The value of the parameter $C$ can be adjusted to make a proper trade-off between under fitting emphasizing the global distribution of the training samples, and over fitting emphasizing the local samples close to the decision boundary $H_0$ . A large $C$ value emphasizes the minimization of the error term in the objective function, and the resulting $H_0$ is mostly dictated by a small number of local support vectors in the region between the class, including possibly some noisy outliers, thereby risking the problem of overfitting. On the other hand, a small $C$ value de-emphasizes the error term and the classifier can tolerate greater errors, thereby allowing more data samples farther away from $H_0$ to become support vectors and play a role in determining $H_0$ . Consequently the classification trends to better reflect the overall distribution of the data samples, but with the risk of overfitting.

This constrained minimization problem can be converted to the minimization of the corresponding Lagrangian:

$\displaystyle L_p({\bf w},b,{\bf\xi},{\bf\alpha},{\bf\mu}) =\frac{1}{2}\vert\ve... ...\sum_{n=1}^N \alpha_n[y_n({\bf w}^T{\bf x}+b)+\xi_n-1]-\sum_{n=1}^N\beta_n\xi_n$

(117)

where $\beta_n$ as well as $\alpha_n$ is the Lagrange multipliers for the two sets of constraints. As they are non-negative (instead of non-positive as in previous section), minus sign is used in front of the last two terms for the constraints, so that the Lagrange multipliers $\alpha_n$ and $\beta_n$ are still required to be positive (in consistent with Table

). The KKT conditions (Eq. (

)) of this minimization problem are listed below (for all $n=1,\cdots,N$ ):

Stationarity:

$\displaystyle \frac{\partial L_p}{\partial {\bf w}}={\bf w}-\sum_{n=1}^N y_n\alpha_n{\bf x}_n=0, \;\;\;\;$ i.e. $\displaystyle \;\;\;\;{\bf w}=\sum_{n=1}^N y_n\alpha_n{\bf x}_n;$ (118)

$\displaystyle \frac{\partial L_p}{\partial b}=\sum_{n=1}^N y_n\alpha_n=0;$ (119)

$\displaystyle \frac{\partial L_p}{\partial \xi_n}=C-\alpha_n-\beta_n=0;$ (120)
Primal feasibility:

$\displaystyle y_n({\bf w}^T{\bf x}+b)+\xi_n-1\ge 0,\;\;\;\;\xi_n\ge 0,$ (121)
Complementarity:

$\displaystyle \alpha_n[y_n({\bf w}^T{\bf x}+b)+\xi_n-1]=0,\;\;\;\;\beta_n\xi_n=0,$ (122)
Dual feasibility:

$\displaystyle \alpha_n\ge 0,\;\;\;\;\;\beta_n\ge 0$ (123)

Depending on the value of $\alpha_n$ , there exist three possible cases, due to the complementarity (Eq. (122)):

If $\alpha_n=0$ , then $\beta_n=C-\alpha_n=C>0$ , $\xi_n=0$ , and we have

$\displaystyle y_n({\bf w}^T{\bf x}_n+b)-1\ge 0$ (124)
If $0<\alpha_n<C$ , i.e., $\alpha_n\ne 0$ , then we have

$\displaystyle y_n({\bf w}^T{\bf x}_n+b)+\xi_n-1=0$ (125)

But as $\beta_n=C-\alpha_n>0$ , $\xi_n=0$ and the equation above becomes

$\displaystyle y_n({\bf w}^T{\bf x}_n+b)-1=0$ (126)
If $\alpha_n=C\ne 0$ , then we have

$\displaystyle y_n({\bf w}^T{\bf x}_n+b)+\xi_n-1=0$ (127)

But as $\beta_n=C-\alpha_n=0$ , $\xi_n \ge 0$ and the equation above becomes

$\displaystyle y_n({\bf w}^T{\bf x}_n+b)-1\le 0$ (128)

In all three cases the expression $y_n({\bf w}^T{\bf x}_n+b)-1$ can be rewritten as

$\displaystyle y_n({\bf w}^T{\bf x}_n+b)-1=y_n({\bf w}^T{\bf x}_n+b)-y_n^2 =y_n[({\bf w}^T{\bf x}_n+b)-y_n]=y_n\,E_n$

(129)

where $E_n=({\bf w}^T{\bf x}_n+b)-y_n$ is the error, or the difference between the actual and desired output We summarize all three cases above to get:

$\displaystyle y_n\,E_n \left\{\begin{array}{ll}\ge 0 & \mbox{if }\alpha_n=0\\ = 0 & \mbox{if }0<\alpha_n<C\\ \le 0 & \mbox{if }\alpha_n=C \end{array}\right.$

(130)

These results derived from the KKT conditions in Eqs. (118) through (123) are an alternative form of the KKT conditions, which will be used in SMO algorithm to check if the KKT conditions are violated by any $\alpha_n$ .

The dual of the primal problem in Eq. (117) can be obtained by substituting the first set of equations (stationarity) into the primal Lagrangian:

maximize: $\displaystyle \;\;$		$\displaystyle L_d({\bf\alpha})=\sum_{n=1}^N\alpha_n -\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N \alpha_n\alpha_m y_ny_m{\bf x}_m^T{\bf x}_n$
	$\displaystyle +$	$\displaystyle C\sum_{n=1}^N \xi_n-\sum_{n=1}^N \alpha_n\xi_n-\sum_{n=1}^N \beta_n\xi_n$
		$\displaystyle =\sum_{n=1}^N \alpha_n -\frac{1}{2}\sum_{n=1}^N \sum_{m=1}^N \alpha_n\alpha_my_ny_m {\bf x}_m^T{\bf x}_n$
subject to: $\displaystyle \;\;$		$\displaystyle 0 \le \alpha_n \le C,\;\;\;\; \sum_{n=1}^N \alpha_n y_n=0$	(131)

Note that due to the condition $C=\alpha_n+\beta_n$ , the Lagrangian multipliers $\xi_n$ and $\beta_n$ no longer appear in the Lagrangian $L_d$

of the dual problem, which happens to be identical to that of the linearly separable problem discussed in previous section. Also, since $\alpha_n \ge 0,\;\beta_n \ge 0$ and $C=\alpha_n+\beta_n$ , we have $0 \le \alpha_n =C-\beta_n \le C$ . Having solved this QP problem for $\alpha_1,\cdots,\alpha_N$ , we get the optimal decision plane in terms of the normal vector of the decision plane

$\displaystyle {\bf w}=\sum_{n=1}^Ny_n\alpha_n{\bf x}_n =\sum_{n\in sv} y_n\alpha_n{\bf x}_n$

(132)

based on Eq. (86), and the bais term $b$

based on Eq. (88):

$\displaystyle b=y_n-\sum_{i\in sv} y_i\alpha_i{\bf x}_i^T{\bf x}_n =y_n-\sum_{i\in sv} y_i\alpha_iK({\bf x}_i,{\bf x}_n), \;\;\;\;\;(n\in sv)$

(133)

where the inner product ${\bf x}_k^T{\bf x}$ is replaced by the kernel function $K({\bf x}_k,{\bf x})$ if it is nonlinear. Computationally, we take the average of such $b$

values of all support vectors as the offset term.

Given ${\bf w}$ and $b$ , any unlabeled point ${\bf x}$ can be classified into either of the two classes depending on whether it is on the positive or negative side of the decision plane:

$\displaystyle {\bf w}^T{\bf x}+b=\sum_{i\in sv} y_i\alpha_i {\bf x}_i^T{\bf x}+... ...\begin{array}{ll}>0, & {\bf x}\in C_+\\ <0, & {\bf x}\in C_-\end{array}\right.$

(134)

where the inner product is replaced by a kernel function if it is nonlinear.

		$\displaystyle \frac{\partial L_p}{\partial {\bf w}}={\bf w}-\sum_{n=1}^N y_n\alpha_n{\bf x}_n=0, \;\;\;\;$ i.e. $\displaystyle \;\;\;\;{\bf w}=\sum_{n=1}^N y_n\alpha_n{\bf x}_n;$	(118)
		$\displaystyle \frac{\partial L_p}{\partial b}=\sum_{n=1}^N y_n\alpha_n=0;$	(119)
		$\displaystyle \frac{\partial L_p}{\partial \xi_n}=C-\alpha_n-\beta_n=0;$	(120)