When the two classes
and
are not linearly separable, the
condition for the optimal hyperplane in Eq. (72) can
be relaxed by including an extra error term
:
data:image/s3,"s3://crabby-images/8c68c/8c68cd120ef768b82d112900f48e9b71f9b9a111" alt="$\displaystyle y_n ({\bf x}_n^T{\bf w} +b) \ge 1-\xi_n,\;\;\;(n=1,\cdots,N)$" |
(115) |
For better classification result, this error
needs to be
minimized as well as
. Now the primal problem in
Eq. (73) can be modified to the following minimization
problem with an objective function and two sets of inequality
(non-negative) constraints:
minimize:data:image/s3,"s3://crabby-images/31d26/31d265143a7563631441b38b5b9718c633677cc3" alt="$\displaystyle \;\;$" |
|
data:image/s3,"s3://crabby-images/bb075/bb0753fd2a5a741c09c48817538cdc0cafeb8e65" alt="$\displaystyle \frac{1}{2}\vert\vert{\bf w}\vert\vert^2+C\sum_{n=1}^N \xi_n$" |
|
subject to:data:image/s3,"s3://crabby-images/31d26/31d265143a7563631441b38b5b9718c633677cc3" alt="$\displaystyle \;\;$" |
|
anddata:image/s3,"s3://crabby-images/2c2d2/2c2d2698d2aa14074d8c2c479eaf3424ab4643c8" alt="$\displaystyle \;\;\;\xi_n \ge 0;\;\;\;(n=1,\cdots,N)$" |
(116) |
The value of the parameter
can be adjusted to make a proper
trade-off between under fitting emphasizing the global distribution
of the training samples, and over fitting emphasizing the local
samples close to the decision boundary
. A large
value
emphasizes the minimization of the error term in the objective
function, and the resulting
is mostly dictated by a small
number of local support vectors in the region between the class,
including possibly some noisy outliers, thereby risking the problem
of overfitting. On the other hand, a small
value de-emphasizes
the error term and the classifier can tolerate greater errors,
thereby allowing more data samples farther away from
to
become support vectors and play a role in determining
.
Consequently the classification trends to better reflect the
overall distribution of the data samples, but with the risk of
overfitting.
This constrained minimization problem can be converted to the
minimization of the corresponding Lagrangian:
![$\displaystyle L_p({\bf w},b,{\bf\xi},{\bf\alpha},{\bf\mu})
=\frac{1}{2}\vert\ve...
...\sum_{n=1}^N \alpha_n[y_n({\bf w}^T{\bf x}+b)+\xi_n-1]-\sum_{n=1}^N\beta_n\xi_n$](img431.svg) |
(117) |
where
as well as
is the Lagrange multipliers for
the two sets of constraints. As they are non-negative (instead of
non-positive as in previous section), minus sign is used in front of
the last two terms for the constraints, so that the Lagrange multipliers
and
are still required to be positive (in consistent
with Table
). The
KKT conditions
(Eq. (
)) of
this minimization problem are listed below (for all
):
- Stationarity:
- Primal feasibility:
data:image/s3,"s3://crabby-images/6e1d4/6e1d456b989712b94c57a88604cdce9d784c1124" alt="$\displaystyle y_n({\bf w}^T{\bf x}+b)+\xi_n-1\ge 0,\;\;\;\;\xi_n\ge 0,$" |
(121) |
- Complementarity:
![$\displaystyle \alpha_n[y_n({\bf w}^T{\bf x}+b)+\xi_n-1]=0,\;\;\;\;\beta_n\xi_n=0,$](img438.svg) |
(122) |
- Dual feasibility:
data:image/s3,"s3://crabby-images/5fa55/5fa55309223fecdc9da7868858f6395817ebe92e" alt="$\displaystyle \alpha_n\ge 0,\;\;\;\;\;\beta_n\ge 0$" |
(123) |
Depending on the value of
, there exist three possible cases,
due to the complementarity (Eq. (122)):
In all three cases the expression
can be rewritten as
![$\displaystyle y_n({\bf w}^T{\bf x}_n+b)-1=y_n({\bf w}^T{\bf x}_n+b)-y_n^2
=y_n[({\bf w}^T{\bf x}_n+b)-y_n]=y_n\,E_n$](img452.svg) |
(129) |
where
is the error, or the
difference between the actual and desired output
We summarize all three cases above to get:
data:image/s3,"s3://crabby-images/7f40c/7f40c2156c1f873db7860bd280d9b09220ff7d5b" alt="$\displaystyle y_n\,E_n \left\{\begin{array}{ll}\ge 0 & \mbox{if }\alpha_n=0\\
= 0 & \mbox{if }0<\alpha_n<C\\ \le 0 & \mbox{if }\alpha_n=C
\end{array}\right.$" |
(130) |
These results derived from the KKT conditions in Eqs. (118)
through (123) are an alternative form of the KKT conditions,
which will be used in SMO algorithm to check if the KKT conditions
are violated by any
.
The dual of the primal problem in Eq. (117) can be
obtained by substituting the first set of equations (stationarity)
into the primal Lagrangian:
maximize:data:image/s3,"s3://crabby-images/31d26/31d265143a7563631441b38b5b9718c633677cc3" alt="$\displaystyle \;\;$" |
|
data:image/s3,"s3://crabby-images/cdbc5/cdbc5ab476242bccb839bc452775d9cb8ff46471" alt="$\displaystyle L_d({\bf\alpha})=\sum_{n=1}^N\alpha_n
-\frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N \alpha_n\alpha_m y_ny_m{\bf x}_m^T{\bf x}_n$" |
|
|
data:image/s3,"s3://crabby-images/4b977/4b9770f3032f7c06ec357c42808dc1571b2edb0c" alt="$\displaystyle +$" |
data:image/s3,"s3://crabby-images/48722/487223c10151c5d409b030395eece4c3deb8ef58" alt="$\displaystyle C\sum_{n=1}^N \xi_n-\sum_{n=1}^N \alpha_n\xi_n-\sum_{n=1}^N \beta_n\xi_n$" |
|
|
|
data:image/s3,"s3://crabby-images/a564c/a564cf73e7122539b5c4b07853d94a6b3487c831" alt="$\displaystyle =\sum_{n=1}^N \alpha_n
-\frac{1}{2}\sum_{n=1}^N \sum_{m=1}^N \alpha_n\alpha_my_ny_m {\bf x}_m^T{\bf x}_n$" |
|
subject to:data:image/s3,"s3://crabby-images/31d26/31d265143a7563631441b38b5b9718c633677cc3" alt="$\displaystyle \;\;$" |
|
data:image/s3,"s3://crabby-images/a4fbd/a4fbd0b55bd96839318955e99ed406e4045ac586" alt="$\displaystyle 0 \le \alpha_n \le C,\;\;\;\;
\sum_{n=1}^N \alpha_n y_n=0$" |
(131) |
Note that due to the condition
, the Lagrangian
multipliers
and
no longer appear in the Lagrangian
of the dual problem, which happens to be identical to that of
the linearly separable problem discussed in previous section. Also,
since
and
, we have
. Having solved this QP problem for
, we get the optimal decision plane in
terms of the normal vector of the decision plane
data:image/s3,"s3://crabby-images/4c3a9/4c3a9aedd051de36f619a358b1c9a6186225cef5" alt="$\displaystyle {\bf w}=\sum_{n=1}^Ny_n\alpha_n{\bf x}_n
=\sum_{n\in sv} y_n\alpha_n{\bf x}_n$" |
(132) |
based on Eq. (86), and the bais term
based on
Eq. (88):
data:image/s3,"s3://crabby-images/7e6fd/7e6fd690eccc47d013dfa262c0cfa187d9c338a8" alt="$\displaystyle b=y_n-\sum_{i\in sv} y_i\alpha_i{\bf x}_i^T{\bf x}_n
=y_n-\sum_{i\in sv} y_i\alpha_iK({\bf x}_i,{\bf x}_n),
\;\;\;\;\;(n\in sv)$" |
(133) |
where the inner product
is replaced by the
kernel function
if it is nonlinear.
Computationally, we take the average of such
values of all
support vectors as the offset term.
Given
and
, any unlabeled point
can be
classified into either of the two classes depending on whether it
is on the positive or negative side of the decision plane:
data:image/s3,"s3://crabby-images/03bb4/03bb4717f59094bd5d0deeb274cdf76a879c3f0c" alt="$\displaystyle {\bf w}^T{\bf x}+b=\sum_{i\in sv} y_i\alpha_i {\bf x}_i^T{\bf x}+...
...\begin{array}{ll}>0, & {\bf x}\in C_+\\
<0, & {\bf x}\in C_-\end{array}\right.$" |
(134) |
where the inner product is replaced by a kernel function if it is
nonlinear.