Kernel Mapping

The SVM algorithm above converges only if the data points of the two classes in the training set are linearly separable. If this is not the case, we can consider the following kernel method.

In the kernel method, all data points as a vector ${\bf x}=[x_1,\cdots,x_d]^T$ in the original d-dimensional feature space are mapped by a kernel function ${\bf z}=\phi({\bf x})$ into a higher, possibly infinite, dimensional space

$\displaystyle {\bf x} \Longrightarrow {\bf z}=\phi({\bf x})$ (94)

The presumption of the method is that the data points only appear in the form of inner product ${\bf x}_m^T{\bf x}_n$ in the algorithm, and the kernel function ${\bf z}=\phi({\bf x})$ never needs to be actually carried out. In fact, the form of the kernel function $\phi({\bf x})$ and the dimensionality of the higher dimensional space do not need to be explicitly specified or known.

The motivation for such a kernel mapping is that the relevant operations such as classification and clustering may be carried out much more effectively once the dataset is mapped to the higher dimensional space. For example, classes not linearly separable in the original d-dimensional feature space may be trivially separable in a higher dimensional space, as illustrated by the following examples.

Example 1

In 1-D space, classes $C_-=\{x\big\vert(a\le x\le b)\}$ and $C_+=\{x\big\vert(x\le a)\;$or$\;(x\ge b)\}$ are not linearly separable. By the following mappping from 1-D space to 2-D space:

$\displaystyle {\bf z}=\phi(x)=\left[\begin{array}{l}z_1\\ z_2\end{array}\right]
=\left[ \begin{array}{c} x \\ (x-(a+b)/2)^2 \end{array}\right]$ (95)

the two classes can be separated by a threshold in the second dimension of the 2-D space.

KernelExamples.png

Example 2:

The method above can be generalized to higher dimensional spaces such as mapping from 2-D to 3-D space. Consider two classes in a 2-D space that are not linearly separable: $C_-=\{{\bf x},\; \vert\vert{\bf x}\vert\vert< D\}$ and $C_+=\{{\bf x},\; \vert\vert{\bf x}\vert\vert> D\}$. However, by the following mappping from 2-D space to 3-D space:

$\displaystyle {\bf z}=\phi({\bf x})=\left[\begin{array}{l}z_1\\ z_2\\ z_3\end{array}\right]
=\left[\begin{array}{c}x_1\\ x_2\\ x_1^2+x_2^2\end{array}\right]$ (96)

the two classes can be trivially separated linearly by thresholding in the third dimension of the 3-D space.

Example 3:

In 2-D space, in the exclusive OR data set, the two classes of $C_-$ containing points in quadrants I and III, and $C_+$ containing points in quadrants II and IV are not linearly separable. However, by mapping the data points to a 3-D space:

$\displaystyle {\bf z}=\phi({\bf x})=\left[\begin{array}{l}z_1\\ z_2\\ z_3\end{array}\right]
=\left[\begin{array}{c}x_1\\ x_2\\ x_1x_2\end{array}\right]$ (97)

the two classes can be separated by simply thresholding in the third dimension of the 3-D space.

Definition: A kernel is a function that takes two vectors ${\bf x}_m$ and ${\bf x}_n$ as arguments and returns the inner product of their images ${\bf z}_m=\phi({\bf x}_m)$ and ${\bf z}_n=\phi({\bf x}_n)$:

$\displaystyle K({\bf x}_m,{\bf x}_n)=\phi({\bf x}_m)^T\phi({\bf x}_n)
={\bf z}_m^T{\bf z}_n$ (98)

The kernel function takes as input some two vectors ${\bf x}_m$ and ${\bf x}_n$ in the original feature space, and returns a scalor value as the inner product ${\bf z}_m$ and ${\bf z}_n$ in some higher dimensional space. If the data points in the original space only appear in the form of inner product, then the kernel function ${\bf z}=\phi({\bf x})$, called the kernel-induced implicit mapping, never needs to be explicitly specified, and the dimension of the new does not even need to be known.

The following is a set of commonly used kernel functions $K({\bf x},\,{\bf x}')$, which can be represented as an inner product of two vectors ${\bf z}=\phi({\bf x})$ and ${\bf z}'=\phi({\bf x}')$ in a higher dimensional space.

The method of kernel mapping can be applied to the SVM algorithm consider previously as all data points appear in the algorithm are in the form of an inner product. Specifically, during the training process, we replace the inner product ${\bf x}_m^T{\bf x}_n$ in both Eq. (85) and Eq. (88) by the kernel functioin $K({\bf x}_m,\,{\bf x}_n)$ to get

maximize:$\displaystyle \;\;$   $\displaystyle L_d({\bf\alpha})
=\sum_{n=1}^N\alpha_n -\frac{1}{2}
\sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m K({\bf x}_n,{\bf x}_m)$  
subject to:$\displaystyle \;\;$   $\displaystyle \sum_{n=1}^N \alpha_n y_n
={\bf y}^T{\bf\alpha}=0, \;\;\;\; \alpha_n\ge 0\;\;\;(n=1,\cdots,N)$ (112)

and

$\displaystyle b=y_n-\sum_{m \in sv} \alpha_m y_m K({\bf x}_m,{\bf x}_n),\;\;\;\;\;(n\in sv)$ (113)

We also replace the inner product ${\bf x}^T_n{\bf x}$ in Eq. (89) by $K({\bf x}_n,\,{\bf x})$ for the classification of any unlabeled point ${\bf z}=\phi({\bf x})$:

$\displaystyle f({\bf z})={\bf w}^T{\bf z}+b
=\sum_{n\in sv}\alpha_ny_n\,({\bf z...
...\{\begin{array}{ll}>0 & {\bf x}\in C_+\\ <0 & {\bf x}\in C_-
\end{array}\right.$ (114)

Again, we note that the normal vector ${\bf w}$ in Eq. (86) never needs to be explicitely calculated. As now both the training and classification are carried out in some higher dimensional space ${\bf z}=\phi({\bf x})$, in which the classes are more likely linearly separable, the classification can be more effectively. More generally, the kernel method can be applied to any algorithm, so long as the data always appear in the form of an inner product.