When the two classes and are not linearly separable, the
condition for the optimal hyperplane in Eq. (72) can
be relaxed by including an extra error term
:
|
(115) |
For better classification result, this error needs to be
minimized as well as
. Now the primal problem in
Eq. (73) can be modified to the following minimization
problem with an objective function and two sets of inequality
(non-negative) constraints:
minimize: |
|
|
|
subject to: |
|
and |
(116) |
The value of the parameter can be adjusted to make a proper
trade-off between under fitting emphasizing the global distribution
of the training samples, and over fitting emphasizing the local
samples close to the decision boundary . A large value
emphasizes the minimization of the error term in the objective
function, and the resulting is mostly dictated by a small
number of local support vectors in the region between the class,
including possibly some noisy outliers, thereby risking the problem
of overfitting. On the other hand, a small value de-emphasizes
the error term and the classifier can tolerate greater errors,
thereby allowing more data samples farther away from to
become support vectors and play a role in determining .
Consequently the classification trends to better reflect the
overall distribution of the data samples, but with the risk of
overfitting.
This constrained minimization problem can be converted to the
minimization of the corresponding Lagrangian:
|
(117) |
where as well as is the Lagrange multipliers for
the two sets of constraints. As they are non-negative (instead of
non-positive as in previous section), minus sign is used in front of
the last two terms for the constraints, so that the Lagrange multipliers
and are still required to be positive (in consistent
with Table ). The
KKT conditions
(Eq. ()) of
this minimization problem are listed below (for all
):
- Stationarity:
- Primal feasibility:
|
(121) |
- Complementarity:
|
(122) |
- Dual feasibility:
|
(123) |
Depending on the value of , there exist three possible cases,
due to the complementarity (Eq. (122)):
In all three cases the expression
can be rewritten as
|
(129) |
where
is the error, or the
difference between the actual and desired output
We summarize all three cases above to get:
|
(130) |
These results derived from the KKT conditions in Eqs. (118)
through (123) are an alternative form of the KKT conditions,
which will be used in SMO algorithm to check if the KKT conditions
are violated by any .
The dual of the primal problem in Eq. (117) can be
obtained by substituting the first set of equations (stationarity)
into the primal Lagrangian:
maximize: |
|
|
|
|
|
|
|
|
|
|
|
subject to: |
|
|
(131) |
Note that due to the condition
, the Lagrangian
multipliers and no longer appear in the Lagrangian
of the dual problem, which happens to be identical to that of
the linearly separable problem discussed in previous section. Also,
since
and
, we have
. Having solved this QP problem for
, we get the optimal decision plane in
terms of the normal vector of the decision plane
|
(132) |
based on Eq. (86), and the bais term based on
Eq. (88):
|
(133) |
where the inner product
is replaced by the
kernel function
if it is nonlinear.
Computationally, we take the average of such values of all
support vectors as the offset term.
Given and , any unlabeled point can be
classified into either of the two classes depending on whether it
is on the positive or negative side of the decision plane:
|
(134) |
where the inner product is replaced by a kernel function if it is
nonlinear.