The SVM algorithm above converges only if the data points of the two classes in the training set are linearly separable. If this is not the case, we can consider the following kernel method.
In the kernel method, all data points as a vector in the original d-dimensional feature space are mapped by a kernel function into a higher, possibly infinite, dimensional space
(94) |
The motivation for such a kernel mapping is that the relevant operations such as classification and clustering may be carried out much more effectively once the dataset is mapped to the higher dimensional space. For example, classes not linearly separable in the original d-dimensional feature space may be trivially separable in a higher dimensional space, as illustrated by the following examples.
Example 1
In 1-D space, classes and or are not linearly separable. By the following mappping from 1-D space to 2-D space:
(95) |
Example 2:
The method above can be generalized to higher dimensional spaces such as mapping from 2-D to 3-D space. Consider two classes in a 2-D space that are not linearly separable: and . However, by the following mappping from 2-D space to 3-D space:
(96) |
Example 3:
In 2-D space, in the exclusive OR data set, the two classes of containing points in quadrants I and III, and containing points in quadrants II and IV are not linearly separable. However, by mapping the data points to a 3-D space:
(97) |
Definition: A kernel is a function that takes two vectors and as arguments and returns the inner product of their images and :
(98) |
The kernel function takes as input some two vectors and in the original feature space, and returns a scalor value as the inner product and in some higher dimensional space. If the data points in the original space only appear in the form of inner product, then the kernel function , called the kernel-induced implicit mapping, never needs to be explicitly specified, and the dimension of the new does not even need to be known.
The following is a set of commonly used kernel functions , which can be represented as an inner product of two vectors and in a higher dimensional space.
Assume , ,
(99) |
The binomial theorem states:
(100) |
(101) |
(102) |
(103) |
Now consider the homogeneous polynomial kernel for d-dimensional vectors
defined as
(104) |
(105) |
In particular, when and , the polynomial kernel defined over 2-D vectors is:
(106) |
A non-homogeneous polynomial kernel is defined as
(107) |
The RBF kernel is defined as
(108) |
(109) |
(110) |
(111) |
The method of kernel mapping can be applied to the SVM algorithm
consider previously as all data points appear in the algorithm are
in the form of an inner product. Specifically, during the training
process, we replace the inner product
in both
Eq. (85) and Eq. (88) by the kernel functioin
to get