Support Vector Machine

Next: Soft Margin SVM Up: Support Vector Machines (SVM) Previous: The learning problem

Support Vector Machine

For a decision hyper-plane ${\bf x}^T {\bf w}+b=0$ to separate the two classes $P=\{({\bf x}_i,1)\}$ and $N=\{({\bf x}_i,-1)\}$ , it has to satisfy

$\begin{displaymath} y_i ({\bf x}_i^T{\bf w}+b) \ge 0 \end{displaymath}$

for both ${\bf x}_i \in P$ and ${\bf x}_i \in N$ . Among all such planes satisfying this condition, we want to find the optimal one

that separates the two classes with the maximal margin (the distance between the decision plane and the closest sample points).

The optimal plane should be in the middle of the two classes, so that the distance from the plane to the closest point on either side is the same. We define two additional planes and that are parallel to and go through the point closest to the plane on either side:

$\begin{displaymath} {\bf x}^T {\bf w}+b=1,\;\;\;\;\mbox{and}\;\;\;\;{\bf x}^T {\bf w}+b=-1 \end{displaymath}$

All points ${\bf x}_i \in P$ on the positive side should satisfy

$\begin{displaymath}{\bf x}_i^T {\bf w}+b \ge 1,\;\;\;\;y_i=1 \end{displaymath}$

and all points ${\bf x}_i \in N$ on the negative side should satisfy

$\begin{displaymath}{\bf x}_i^T {\bf w}+b \le -1,\;\;\;\; y_i=-1 \end{displaymath}$

These can be combined into one inequality:

$\begin{displaymath}y_i ({\bf x}_i^T{\bf w} +b) \ge 1,\;\;\;(i=1,\cdots,m) \end{displaymath}$

The equality holds for those points on the planes

. Such points are called support vectors, for which

$\begin{displaymath}{\bf x}_i^T {\bf w}+b = y_i \end{displaymath}$

i.e., the following holds for all support vectors:

$\begin{displaymath}b=y_i-{\bf x}_i^T {\bf w}=y_i-\sum_{j=1}^m \alpha_j y_j ({\bf x}_i^T {\bf x}_j) \end{displaymath}$

Moreover, the distances from the origin to the three parallel planes , and are, respectively, $\vert b-1\vert/\vert\vert{\bf w}\vert\vert$ , $\vert b\vert/\vert\vert{\bf w}\vert\vert$ , and $\vert b+1\vert/\vert\vert{\bf w}\vert\vert$ , and the distance between planes and is $2/\vert\vert{\bf w}\vert\vert$ .

Our goal is to maximize this distance, or, equivalantly, to minimize the norm $\vert\vert{\bf w}\vert\vert$ . Now the problem of finding the optimal decision plane in terms of ${\bf w}$ and can be formulated as:

	$\textstyle \mbox{minimize}$	$\displaystyle \frac{1}{2}{\bf w}^T {\bf w}=\frac{1}{2}\vert\vert{\bf w}\vert\vert^2 \;\;\;\;\;\;\mbox{(objective function)}$
	$\textstyle \mbox{subject to}$	$\displaystyle y_i ({\bf x}_i^T {\bf w}+b) \ge 1,\;\;\mbox{or}\;\; 1-y_i ({\bf x}_i^T {\bf w}+b) \le 0,\;\;\;\;(i=1,\cdots,m)$

Since the objective function is quadratic, this constrained optimization problem is called a quadratic program (QP) problem. (If the objective function is linear instead, the problem is a linear program (LP) problem). This QP problem can be solved by Lagrange multipliers method to minimize the following

$\begin{displaymath} L_p({\bf w},b,{\bf\alpha})=\frac{1}{2}\vert\vert{\bf w}\vert\vert^2+\sum_{i=1}^m \alpha_i(1-y_i({\bf x}_i^T{\bf w}+b)) \end{displaymath}$

with respect to ${\bf w}$ ,

and the Lagrange coefficients $\alpha_i\ge 0$ $(i=1,\cdots,\alpha_m)$ . We let

$\begin{displaymath}\frac{\partial}{\partial W}L_p({\bf w},b)=0,\;\;\; \frac{\partial}{\partial b}L_p({\bf w},b)=0 \end{displaymath}$

These lead, respectively, to

$\begin{displaymath}{\bf w}=\sum_{j=1}^m \alpha_j y_j {\bf x}_j,\;\;\;\mbox{and}\;\;\;\; \sum_{i=1}^m \alpha_i y_i=0 \end{displaymath}$

Substituting these two equations back into the expression of $L({\bf w},b)$ , we get the dual problem (with respect to $\alpha_i$ ) of the above primal problem:

	$\textstyle \mbox{maximize}$	$\displaystyle L_d({\bf\alpha})= \sum_{i=1}^m\alpha_i -\frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j {\bf x}_i^T,{\bf x}_j$
	$\textstyle \mbox{subject to}$	$\displaystyle \alpha_i\ge 0,\;\;\;\; \sum_{i=1}^m \alpha_i y_i=0$

The dual problem is related to the primal problem by:

$\begin{displaymath}L_d({\bf\alpha})=inf_{({\bf w},b)} L_p({\bf w},b,{\bf\alpha}) \end{displaymath}$

i.e.,

is the greatest lower bound (infimum) of

for all ${\bf w}$ and

Solving this dual problem (an easier problem than the primal one), we get $\alpha_i$ , from which ${\bf w}$ of the optimal plane can be found.

Those points ${\bf x}_i$ on either of the two planes and (for which the equality $y_i({\bf w}^T {\bf x}_i+b)=1$ holds) are called support vectors and they correspond to positive Lagrange multipliers $\alpha_i>0$ . The training depends only on the support vectors, while all other samples away from the planes and are not important.

For a support vector ${\bf x}_i$ (on the or plane), the constraining condition is

$\begin{displaymath}y_i \left({\bf x}_i^T {\bf w}+b\right) = 1\;\;\;\;(i \in sv) \end{displaymath}$

here

is a set of all indices of support vectors ${\bf x}_i$ (corresponding to $\alpha_i>0$ ). Substituting

$\begin{displaymath}{\bf w}=\sum_{j=1}^m \alpha_j y_j {\bf x}_j=\sum_{j\in sv} \alpha_j y_j {\bf x}_j \end{displaymath}$

we get

$\begin{displaymath}y_i(\sum_{j\in sv} \alpha_j y_j {\bf x}^T_i {\bf x}_j+b) = 1 \end{displaymath}$

Note that the summation only contains terms corresponding to those support vectors ${\bf x}_j$ with $\alpha_j>0$ , i.e.

$\begin{displaymath}y_i \sum_{j\in sv} \alpha_j y_j {\bf x}^T_i {\bf x}_j= 1-y_i b \end{displaymath}$

For the optimal weight vector ${\bf w}$ and optimal

, we have:

$\displaystyle \vert\vert{\bf w}\vert\vert^2$	$\textstyle =$	$\displaystyle {\bf w}^T {\bf w}=\sum_{i\in sv} \alpha_i y_i {\bf x}^T_i \sum_{... ... \sum_{i\in sv} \alpha_i y_i \sum_{j\in sv} \alpha_j y_j {\bf x}^T_i{\bf x}_j$
	$\textstyle =$	$\displaystyle \sum_{i\in sv} \alpha_i (1-y_i b) =\sum_{i\in sv} \alpha_i - b\sum_{i\in sv} \alpha_i y_i$
	$\textstyle =$	$\displaystyle \sum_{i\in sv} \alpha_i$

The last equality is due to $\sum_{i=1}^m \alpha_i y_i=0$ shown above. Recall that the distance between the two margin planes

and

is $2/\vert\vert{\bf w}\vert\vert$ , and the margin, the distance between

(or

) and the optimal decision plane

, is

$\begin{displaymath}\frac{1}{\vert\vert{\bf w}\vert\vert}=\left(\sum_{i\in sv} \alpha_i\right)^{-1/2} \end{displaymath}$

Next: Soft Margin SVM Up: Support Vector Machines (SVM) Previous: The learning problem

Ruye Wang 2016-08-19