Kernel Methods

In the kernel method, all data points as a vector ${\bf x}=[x_1,\cdots,x_d]^T$ in the original d-dimensional feature space are mapped by a kernel function ${\bf z}=\phi({\bf x})$ into a higher, possibly infinite, dimensional space

$\displaystyle {\bf x} \Longrightarrow {\bf z}=\phi({\bf x})$

(90)

The presumption of the method is that the data points only appear in the form of inner product ${\bf x}_m^T{\bf x}_n$ in the algorithm, then based on the kernel trick, the kernel function ${\bf z}=\phi({\bf x})$ never needs to be actually carried out. In fact, the form of the kernel function $\phi({\bf x})$ and the dimensionality of the higher dimensional space do not need to be explicitly specified or known.

The motivation for such a kernel mapping is that the relevant operations such as classification and clustering may be carried out much more effectively once the dataset is mapped to the higher dimensional space. For example, classes not linearly separable in the original d-dimensional feature space may be trivially separable in a higher dimensional space, as illustrated by the following examples.

Example 1

In 1-D space, classes $C_-=\{x\big\vert(a\le x\le b)\}$ and $C_+=\{x\big\vert(x\le a)\;$ or $\;(x\ge b)\}$ are not linearly separable. By the following mappping from 1-D space to 2-D space:

$\displaystyle {\bf z}=\phi(x)=\left[\begin{array}{l}z_1\\ z_2\end{array}\right] =\left[ \begin{array}{c} x \\ (x-(a+b)/2)^2 \end{array}\right]$

(91)

the two classes can be separated by a threshold in the second dimension of the 2-D space.

Example 2:

The method above can be generalized to higher dimensional spaces such as mapping from 2-D to 3-D space. Consider two classes in a 2-D space that are not linearly separable: $C_-=\{{\bf x},\; \vert\vert{\bf x}\vert\vert< D\}$ and $C_+=\{{\bf x},\; \vert\vert{\bf x}\vert\vert> D\}$ . However, by the following mappping from 2-D space to 3-D space:

$\displaystyle {\bf z}=\phi({\bf x})=\left[\begin{array}{l}z_1\\ z_2\\ z_3\end{array}\right] =\left[\begin{array}{c}x_1\\ x_2\\ x_1^2+x_2^2\end{array}\right]$

(92)

the two classes can be trivially separated linearly by thresholding in the third dimension of the 3-D space.

Example 3:

In 2-D space, in the exclusive OR dataset, the two classes of $C_-$ containing points in quadrants I and III, and $C_+$ containing points in quadrants II and IV are not linearly separable. However, by mapping the data points to a 3-D space:

$\displaystyle {\bf z}=\phi({\bf x})=\left[\begin{array}{l}z_1\\ z_2\\ z_3\end{array}\right] =\left[\begin{array}{c}x_1\\ x_2\\ x_1x_2\end{array}\right]$

(93)

the two classes can be separated by simply thresholding in the third dimension of the 3-D space.

Definition: A kernel is a function that takes two vectors ${\bf x}_m$ and ${\bf x}_n$ as arguments and returns the inner product of their images ${\bf z}_m=\phi({\bf x}_m)$ and ${\bf z}_n=\phi({\bf x}_n)$ :

$\displaystyle K({\bf x}_m,{\bf x}_n)=\phi({\bf x}_m)^T\phi({\bf x}_n) ={\bf z}_m^T{\bf z}_n$

(94)

The kernel function takes as input some two vectors ${\bf x}_m$ and ${\bf x}_n$ in the original feature space, and returns a scalor value as the inner product ${\bf z}_m$ and ${\bf z}_n$ in some higher dimensional space. If the data points in the original space only appear in the form of inner product, then the kernel function ${\bf z}=\phi({\bf x})$ , called the kernel-induced implicit mapping, never needs to be explicitly specified, and the dimension of the new does not even need to be known.

The following is a set of commonly used kernel functions $K({\bf x},\,{\bf x}')$ , which can be represented as an inner product of two vectors ${\bf z}=\phi({\bf x})$ and ${\bf z}'=\phi({\bf x}')$ in a higher dimensional space.

linear kernel (no kernel mapping)
Assume ${\bf x}=[x_1,\cdots,x_d]^T$ , ${\bf x}'=[x'_1,\cdots,x'_d]^T$ ,

$\displaystyle K({\bf x},{\bf x}')={\bf x}^T {\bf x}'=\sum_{i=1}^d x_ix'_i$ (95)
polynomial kernels
The binomial theorem states:

$\displaystyle (x+y)^n=\sum_{k=0}^n\left(\begin{array}{c}n\\ k\end{array}\right) x^{n-k} y^k =\sum_{k=0}^n \frac{n!}{k!(n-k)!}\; x^{n-k} y^k$ (96)

where the binomial coefficient

$\displaystyle \left(\begin{array}{c}n\\ k\end{array}\right)=\frac{n!}{k!(n-k)!}$ (97)

is the number of ways to distribute items into two bins ( in one and in the other). This result can be generalized to the multinormial case:

$\displaystyle (x_1+\cdots+x_d)^n=\sum_{\sum_{i=1}^d k_i=n} \left(\begin{array}{... ...} =\sum_{\sum_{i=1}^d k_i=n}\frac{n!}{k_1!\cdots k_d!}x_1^{k_1}\cdots x_d^{k_d}$ (98)

where the multinomial coefficient

$\displaystyle \left(\begin{array}{c}n\\ k_1,\cdots,k_d\end{array}\right) =\frac{n!}{k_1!\cdots k_d!}$ (99)

is the number of ways to distribute balls into bins with balls in the ith bin (see here), and the summation is over all possible ways to get non-negative integers $k_1,\cdots,k_d$ that add up to .
Now consider the homogeneous polynomial kernel for d-dimensional vectors ${\bf x}=[x_1,\cdots,x_d]^T$ defined as

$\displaystyle K({\bf x},{\bf x}')$ $\displaystyle =$ $\displaystyle ({\bf x}^T{\bf x}')^n=(x_1x'_1+\cdots+x_dx'_d)^n$

$\displaystyle =$ $\displaystyle \sum_{\sum_{i=1}^d k_i=n}\frac{n!}{k_1!\cdots k_d!}\; \left( (x_1... ...cdots (x_dx'_d)^{k_d} \right) =\phi({\bf x})^T \phi({\bf x}')={\bf z}^T{\bf z}'$ (100)

where

$\displaystyle {\bf z}=\phi({\bf x})=\left[\sqrt{ \frac{n!}{k_1!\cdots k_d!} } \... ...1}\cdots x_d^{k_d}\right),\;\left(k_i\ge 0,\;\sum_{i=1}^d k_i=n\right)\right]^T$ (101)

In particular, when and , the polynomial kernel defined over 2-D vectors ${\bf x}=[x_1,x_2]^T$ is:

$\displaystyle K({\bf x},{\bf x}')=({\bf x}^T{\bf x}')^2=(x_1x'_1+x_2x'_2)^2 =(x... ..._1 x'_1 x_2 x'_2+(x_2 x'_2)^2=\phi({\bf x})^T \phi({\bf x}') ={\bf z}^T{\bf z}'$ (102)

where ${\bf z}=\phi({\bf x})=[x_1^2,\,\sqrt{2}x_1x_2,\,x_2^2]$ is a mapping from ${\bf x}$ in 2-D space to ${\bf z}$ in 3-D space.
A non-homogeneous polynomial kernel is defined as

$\displaystyle K({\bf x},{\bf x}')=(1+{\bf x}^T{\bf x}')^n$ (103)
The radial basis function (RBF) kernel
The RBF kernel is defined as

$\displaystyle K({\bf x},{\bf x}')=e^{-\vert\vert{\bf x}-{\bf x}'\vert\vert^2/2\sigma^2} =e^{-\gamma\vert\vert{\bf x}-{\bf x}'\vert\vert^2}$ (104)

where $\gamma=1/2\sigma^2$ is a parameter that can be adjusted to fit each specific dataset. This kernel can be wriiten as the inner product of two infinite dimensional vectors (for simplicity, we assume $\sigma=1$ ):

$\displaystyle K({\bf x},\,{\bf x}')$ $\displaystyle =$ $\displaystyle e^{-\vert\vert{\bf x}-{\bf x}'\vert\vert^2/2} =e^{-\vert\vert{\bf... ...\vert{\bf x}'\vert\vert^2/2} \sum_{n=0}^\infty \frac{({\bf x}^T{\bf x}')^n}{n!}$

$\displaystyle =$ $\displaystyle e^{-\vert\vert{\bf x}\vert\vert^2/2} \; e^{-\vert\vert{\bf x}'\ve... ...!}{k_1!\cdots k_d! } \left((x_1x'_1)^{k_1}\cdots (x_dx'_d)^{k_d}\right) \right]$

$\displaystyle =$ $\displaystyle \sum_{n=0}^\infty \sum_{\sum_{i=1}^d k_i=n} \left(e^{-\vert\vert{... ...; \frac{x_1^{\prime k_1}\cdots x_d^{\prime k_d}}{\sqrt{k_1!\cdots k_d!}}\right)$

$\displaystyle =$ $\displaystyle \phi({\bf x})^T \phi({\bf x}')={\bf z}^T{\bf z}'$ (105)

where

$\displaystyle {\bf z}=\phi({\bf x})=\left[ e^{-\vert\vert{\bf x}\vert\vert^2/2}... ...1!\cdots k_d!}}, \;\left(n=0,\cdots,\infty,\;\sum_{k=1}^nk_i=n\right) \right]^T$ (106)

is a vector in an infinite dimensional space. In particular, when we have

$\displaystyle K(x,\,x')$ $\displaystyle =$ $\displaystyle e^{-(x-x')^2/2}=e^{-x^2/2}\, e^{-x'^2/2}\, e^{xx'} =e^{-x^2/2}\, e^{-x'^2/2} \sum_{n=0}^\infty \frac{(xx')^n}{n!}$

$\displaystyle =$ $\displaystyle \sum_{n=0}^\infty (e^{-x^2/2}\,x^n/\sqrt{n!})\;(e^{-x'^2/2}\,x'^n/\sqrt{n!})$ (107)

where ${\bf z}=\phi(x)=\left[ e^{-x^2/2}\,x^n/\sqrt{n!}, \;(n=0,\cdots,\infty)\right]^T$ is a kernel function that maps a 1-D space into an infinite dimensional space.

$\displaystyle K({\bf x},{\bf x}')$	$\displaystyle =$	$\displaystyle ({\bf x}^T{\bf x}')^n=(x_1x'_1+\cdots+x_dx'_d)^n$
	$\displaystyle =$	$\displaystyle \sum_{\sum_{i=1}^d k_i=n}\frac{n!}{k_1!\cdots k_d!}\; \left( (x_1... ...cdots (x_dx'_d)^{k_d} \right) =\phi({\bf x})^T \phi({\bf x}')={\bf z}^T{\bf z}'$	(100)

$\displaystyle K({\bf x},\,{\bf x}')$	$\displaystyle =$	$\displaystyle e^{-\vert\vert{\bf x}-{\bf x}'\vert\vert^2/2} =e^{-\vert\vert{\bf... ...\vert{\bf x}'\vert\vert^2/2} \sum_{n=0}^\infty \frac{({\bf x}^T{\bf x}')^n}{n!}$
	$\displaystyle =$	$\displaystyle e^{-\vert\vert{\bf x}\vert\vert^2/2} \; e^{-\vert\vert{\bf x}'\ve... ...!}{k_1!\cdots k_d! } \left((x_1x'_1)^{k_1}\cdots (x_dx'_d)^{k_d}\right) \right]$
	$\displaystyle =$	$\displaystyle \sum_{n=0}^\infty \sum_{\sum_{i=1}^d k_i=n} \left(e^{-\vert\vert{... ...; \frac{x_1^{\prime k_1}\cdots x_d^{\prime k_d}}{\sqrt{k_1!\cdots k_d!}}\right)$
	$\displaystyle =$	$\displaystyle \phi({\bf x})^T \phi({\bf x}')={\bf z}^T{\bf z}'$	(105)

$\displaystyle K(x,\,x')$	$\displaystyle =$	$\displaystyle e^{-(x-x')^2/2}=e^{-x^2/2}\, e^{-x'^2/2}\, e^{xx'} =e^{-x^2/2}\, e^{-x'^2/2} \sum_{n=0}^\infty \frac{(xx')^n}{n!}$
	$\displaystyle =$	$\displaystyle \sum_{n=0}^\infty (e^{-x^2/2}\,x^n/\sqrt{n!})\;(e^{-x'^2/2}\,x'^n/\sqrt{n!})$	(107)