Kernel Methods

In the kernel method, all data points as a vector ${\bf x}=[x_1,\cdots,x_d]^T$ in the original d-dimensional feature space are mapped by a kernel function ${\bf z}=\phi({\bf x})$ into a higher, possibly infinite, dimensional space

$\displaystyle {\bf x} \Longrightarrow {\bf z}=\phi({\bf x})$ (90)

The presumption of the method is that the data points only appear in the form of inner product ${\bf x}_m^T{\bf x}_n$ in the algorithm, then based on the kernel trick, the kernel function ${\bf z}=\phi({\bf x})$ never needs to be actually carried out. In fact, the form of the kernel function $\phi({\bf x})$ and the dimensionality of the higher dimensional space do not need to be explicitly specified or known.

The motivation for such a kernel mapping is that the relevant operations such as classification and clustering may be carried out much more effectively once the dataset is mapped to the higher dimensional space. For example, classes not linearly separable in the original d-dimensional feature space may be trivially separable in a higher dimensional space, as illustrated by the following examples.

Example 1

In 1-D space, classes $C_-=\{x\big\vert(a\le x\le b)\}$ and $C_+=\{x\big\vert(x\le a)\;$or$\;(x\ge b)\}$ are not linearly separable. By the following mappping from 1-D space to 2-D space:

$\displaystyle {\bf z}=\phi(x)=\left[\begin{array}{l}z_1\\ z_2\end{array}\right]
=\left[ \begin{array}{c} x \\ (x-(a+b)/2)^2 \end{array}\right]$ (91)

the two classes can be separated by a threshold in the second dimension of the 2-D space.

KernelExamples.png

Example 2:

The method above can be generalized to higher dimensional spaces such as mapping from 2-D to 3-D space. Consider two classes in a 2-D space that are not linearly separable: $C_-=\{{\bf x},\; \vert\vert{\bf x}\vert\vert< D\}$ and $C_+=\{{\bf x},\; \vert\vert{\bf x}\vert\vert> D\}$. However, by the following mappping from 2-D space to 3-D space:

$\displaystyle {\bf z}=\phi({\bf x})=\left[\begin{array}{l}z_1\\ z_2\\ z_3\end{array}\right]
=\left[\begin{array}{c}x_1\\ x_2\\ x_1^2+x_2^2\end{array}\right]$ (92)

the two classes can be trivially separated linearly by thresholding in the third dimension of the 3-D space.

Example 3:

In 2-D space, in the exclusive OR dataset, the two classes of $C_-$ containing points in quadrants I and III, and $C_+$ containing points in quadrants II and IV are not linearly separable. However, by mapping the data points to a 3-D space:

$\displaystyle {\bf z}=\phi({\bf x})=\left[\begin{array}{l}z_1\\ z_2\\ z_3\end{array}\right]
=\left[\begin{array}{c}x_1\\ x_2\\ x_1x_2\end{array}\right]$ (93)

the two classes can be separated by simply thresholding in the third dimension of the 3-D space.

Definition: A kernel is a function that takes two vectors ${\bf x}_m$ and ${\bf x}_n$ as arguments and returns the inner product of their images ${\bf z}_m=\phi({\bf x}_m)$ and ${\bf z}_n=\phi({\bf x}_n)$:

$\displaystyle K({\bf x}_m,{\bf x}_n)=\phi({\bf x}_m)^T\phi({\bf x}_n)
={\bf z}_m^T{\bf z}_n$ (94)

The kernel function takes as input some two vectors ${\bf x}_m$ and ${\bf x}_n$ in the original feature space, and returns a scalor value as the inner product ${\bf z}_m$ and ${\bf z}_n$ in some higher dimensional space. If the data points in the original space only appear in the form of inner product, then the kernel function ${\bf z}=\phi({\bf x})$, called the kernel-induced implicit mapping, never needs to be explicitly specified, and the dimension of the new does not even need to be known.

The following is a set of commonly used kernel functions $K({\bf x},\,{\bf x}')$, which can be represented as an inner product of two vectors ${\bf z}=\phi({\bf x})$ and ${\bf z}'=\phi({\bf x}')$ in a higher dimensional space.