The support vector machine (SVM) is a supervised binary classifier trained by a dsta set of which data sample , a vector in the d-dimensional feature space, belongs to either class if labeled by or class if labeled by . The result of the training process is the decision boundaries or decision plane in the feature space, described by the linear decision equation:
(62) |
ifthen | (63) |
The initial setup of the SVM algorithm seems the same as the method of linear regression as a linear binary classifier when the linear regression function is thresholded by zero to become . But here for the SVM, the parameters and are determined in such a way that the corresponding decision plane separates the training samples belonging to the two classes optimally, so that its distance to the closest samples on either side, called support vectors, is maximized. Also, as kernel mapping can be applied to the SVM, it can also be used for nonlinear classification.
We rewrite the decision equation as
(64) |
(66) |
We desire to find the optimal decision plane that separates the training samples belonging to the two different classes, assumed to be linearly separable, in such a way that its distance to the support vectors, denoted by , is maximized:
(67) |
(69) |
For these planes to correctly separate all samples in the training set , they have to satisfy the following two conditions:
(71) |
Example:
The straight line in 2D space shown above, denoted by , is described by the following linear equation
(74) |
(75) |
These two points and with equal distance to , the straight line , can be assumed to be two support vectors on either side of , and
(76) |
(77) |
(78) |
(79) |
For reasons to be discussed later, instead of directly solving the constrained minimization problem in Eq. (73), now called the primal problem, we actually solve the dual problem. Specifically, we first construct the Lagrangian function of the primal problem:
where are the Lagrange multipliers, which is called the primal function. Here for this minimization problem with non-positive constraints, the Lagrangian multipliers are required to be negative, , according to Table here, if a minus sign is used for the second term. However, to be consistent with most SVM literatures, we use the positive sign for the second term and require . Note that if is a support vector on either or , i.e., the equality constraint holds, then ; but if is not a support vector, the equality constraint does not hold, and .We next find the minimum (or infimum) as the lower bound of the primal function in Eq. (80), by setting to zero its partial derivatives with respect to both and :
and(82) |
(84) |
We can now find the normal direction of the optimal decision plane based on Eq. (83):
Note that the normal vector is the difference between the weighted means of the support vectors in class (first term) and class (second term), and it is determined only by the support vectors.Having found , we can also find based on the equality of Eq. (72) for any of the support vectors:
or | (87) |
We note that both and depend only on the support vectors on planes and , corresponding to a positive , while all remaining samples corresponding to are farther away from the decision plane , behind either or . We see that the SVM is solely based on those support vectors in the training set, once they are identified during the training process, all other non-support vectors are irrelevant.
Having obtained and as shown above as the training process, any unlabeled sample can be classified into either of the two classes depending on whether the following decision function is greater or smaller than zero:
We also get:(90) |
(91) |
The SVM algorithm for binary classification can now by summarized as the following steps:
We prefer to solve this dual problem not only because it is easier than the original primal problem, but also, more importantly, for the reason that all data points appear only in the form of an inner product in the dual problem, allowing the kernel method to be used, as discussed later.
The Matlab code for the essential part of the algorithm is listed
below, where X
and y
are respectively the data array
composed of training vectors
and
their corresponding labelings
, and IPmethod
is a function that implements the
interior point method
for solving a general QP problem of minimizing
subject to the linear
equality constraints
and
.
[X y]=getData; % get the training data [m n]=size(X); Q=(y'*y).*(X'*X); % compute the quadratic matrix c=-ones(n,1); % coefficients of linear term A=y; % coefficient matrix and b=0; % constants of linear equality constraints alpha=0.1*ones(n,1); % initial guess of solution [alpha mu lambda]=IPmethod(Q,A,c,b,alpha); % solve QP to find alphas % by interior point method I=find(abs(alpha)>10^(-3)); % indecies of non-zero alphas asv=alpha(I); % non-zero alphas Xsv=X(:,I); % support vectors ysv=y(:,I); % their labels w=sum(repmat(ysv.*asv,m,1).*Xsv,2); % normal vector (not needed) bias=mean(ysv-w'*Xsv); % bias
Example: The training set contains two classess of 2-D points with Gaussian distributions generated based on the following mean vectors and covariance matrices:
(92) |
(93) |