The SVM algorithm above converges only if the data points of the two classes in the training set are linearly separable. If this is not the case, we can consider the following kernel method.
In the kernel method, all data points as a vector
in the original d-dimensional feature
space are mapped by a kernel function
into a higher, possibly infinite, dimensional space
(94) |
The motivation for such a kernel mapping is that the relevant operations such as classification and clustering may be carried out much more effectively once the dataset is mapped to the higher dimensional space. For example, classes not linearly separable in the original d-dimensional feature space may be trivially separable in a higher dimensional space, as illustrated by the following examples.
Example 1
In 1-D space, classes
and
or
are not linearly
separable. By the following mappping from 1-D space to 2-D space:
(95) |
Example 2:
The method above can be generalized to higher dimensional spaces
such as mapping from 2-D to 3-D space. Consider two classes in a
2-D space that are not linearly separable:
and
. However, by the
following mappping from 2-D space to 3-D space:
(96) |
Example 3:
In 2-D space, in the exclusive OR data set, the two classes of
containing points in quadrants I and III, and
containing
points in quadrants II and IV are not linearly separable. However, by
mapping the data points to a 3-D space:
(97) |
Definition: A kernel is a function that takes two vectors
and
as arguments and returns the inner product
of their images
and
:
(98) |
The kernel function takes as input some two vectors
and
in the original feature space, and returns a scalor
value as the inner product
and
in some higher
dimensional space. If the data points in the original space only
appear in the form of inner product, then the kernel function
, called the kernel-induced implicit
mapping, never needs to be explicitly specified, and the dimension
of the new does not even need to be known.
The following is a set of commonly used kernel functions
, which can be represented as an
inner product of two vectors
and
in a higher dimensional space.
Assume
,
,
(99) |
The binomial theorem states:
(100) |
(101) |
(102) |
(103) |
Now consider the homogeneous polynomial kernel for d-dimensional vectors
defined as
(104) |
(105) |
In particular, when and
, the polynomial kernel defined over
2-D vectors
is:
(106) |
A non-homogeneous polynomial kernel is defined as
(107) |
The RBF kernel is defined as
(108) |
(109) |
(110) |
(111) |
The method of kernel mapping can be applied to the SVM algorithm
consider previously as all data points appear in the algorithm are
in the form of an inner product. Specifically, during the training
process, we replace the inner product
in both
Eq. (85) and Eq. (88) by the kernel functioin
to get