Kernel PCA

Next: Appendix: Kernel Up: Kernel PCA (KPCA) Previous: Mapping to high dimensional

Kernel PCA

First consider nonlinearly mapping all data points ${\bf x}$ to $f({\bf x})$ in a higher dimensional feature space ${\cal F}$ , where the covariance matrix can be estimated as

$\begin{displaymath}\Sigma_f=\frac{1}{N}\sum_{n=1}^N f({\bf x}_n) f({\bf x}_n)^T \end{displaymath}$

Plugging this into the eigenequation of the covariance matrix

$\begin{displaymath}\Sigma_f \phi_i=\lambda_i \phi_i \end{displaymath}$

we get

$\begin{displaymath}[\frac{1}{N}\sum_{n=1}^N f({\bf x}_n) f({\bf x}_n)^T]\phi_i =... ...}^N (f({\bf x}_n)\cdot \phi_i) f({\bf x}_n) =\lambda_i \phi_i \end{displaymath}$

We see that the eigenvector $\phi_i$ is a linear combination of the

mapped data points:

$\begin{displaymath} \phi_i=\frac{1}{\lambda_i N}\sum_{n=1}^N (f({\bf x}_n)\cdot \phi_i)f({\bf x}_n) =\sum_{n=1}^N a_n^{(i)} f({\bf x}_n) \end{displaymath}$

where

$\begin{displaymath}a_n^{(i)}=\frac{1}{\lambda_i N}(f({\bf x}_n) \cdot \phi_i) \end{displaymath}$

Left multiplying $f({\bf x}_m)^T$ to both sides of the equation above, we get

$\begin{displaymath}(f({\bf x}_m)\cdot \phi_i)=\lambda_i N a_m^{(i)} =\sum_{n=1}... ... f({\bf x}_n)) =\sum_{n=1}^N a_n^{(i)} k({\bf x}_m, {\bf x}_n) \end{displaymath}$

where

$\begin{displaymath}k({\bf x}_m, {\bf x}_n)=(f({\bf x}_m)\cdot f({\bf x}_n)),\;\;\;\; (m,n=1,\cdots,N) \end{displaymath}$

is the kernel representing a inner-product of two vectors in space ${\cal F}$ . If we consider $m=1,\cdots,N$ , the above scalar equation becomes the m-th component of the following vector equation:

$\begin{displaymath}\lambda_i N {\bf a}_i={\bf K} {\bf a}_i \end{displaymath}$

where

$\begin{displaymath}{\bf K}=\left[ \begin{array}{ccc} ...&...&...\\ ...&k({\bf x}_i,{\bf x}_j)&... ...&...&...\end{array} \right]_{N\times N} \end{displaymath}$

is a matrix of $N\times N$ kernel elements $k({\bf x}_i,{\bf x}_j)$ ( $i,j=1,\cdots,N$ ), and ${\bf a}_i=[a_1^{(i)},\cdots,a_N^{(i)}]^T$ ( $i=1,\cdots,N$ ) are the

eigen vectors of ${\bf K}$ , which can be obtained by solving the eigenvalue problem of ${\bf K}$ . As the eigenvalues of ${\bf K}$ are proportional to the eigenvalues $\lambda_i$ of the covariance matrix $\Sigma_f$ in the feature space, feature selection in regular PCA can be carried out by keeping only a small number of components corresponding to the largest eigenvalues without losing much information.

Any new data point ${\bf x}$ can now be mapped to $f({\bf x})$ in the high-dimensional space ${\cal F}$ , where its i-th PCA component can be obtained as its projection on the i-th eigenvector $\phi_i=\sum_{n=1}^N a_n^{(i)} f({\bf x}_n)$

$\begin{displaymath} (f({\bf x})\cdot \phi_i)=(f({\bf x})\cdot \sum_{n=1}^N a_n^{... ...t f({\bf x}_n)) =\sum_{n=1}^N a_n^{(i)} k({\bf x}, {\bf x}_n) \end{displaymath}$

These are the steps for the implementation of the kernel PCA algorithm:

Choose a kernel mapping $k({\bf x}_m,{\bf x}_n)$ .
Obtain ${\bf K}$ based on training data $\{ {\bf x}_n, (n=1,\cdots,N)\}$ .
Solve eigenvalue problem of ${\bf K}$ to get $\lambda_i$ and ${\bf a}_i$ .
For each given data point ${\bf x}$ , obtain its principal components in the feature space: $(f({\bf x})\cdot \phi_i)=\sum_{n=1}^N a_n^{(i)} k({\bf x}, {\bf x}_n)$
Do whatever processing (e.g., feature selection, classification) in the feature space.

Some commonly used kernels are:

Gaussian:

$\begin{displaymath}k({\bf x}_m,{\bf x}_n)=exp(-\frac{\vert\vert{\bf x}_m-{\bf x}_n\vert\vert}{2\sigma^2})\end{displaymath}$
Sigmoid:

$\begin{displaymath}k({\bf x}_m,{\bf x}_n)=tanh(({\bf x}_m\cdot {\bf x}_n)+\theta) \end{displaymath}$
Polynomial:

$\begin{displaymath}k({\bf x}_m,{\bf x}_n)=({\bf x}_m\cdot {\bf x}_n)^k \end{displaymath}$

The above discussion is based on the assumption that the data points have zero mean, and therefore the covariance of two data points is the same as their correlation. While the data points can be easily de-meaned in the original space by subtracting the mean vector from all points, it is done differently in the feature space. The de-meaned data points in the feature space is:

$\begin{displaymath}\tilde{f}({\bf x}_m)=f({\bf x}_m)-\frac{1}{N}\sum_{k=1}^N f({\bf x}_k) \end{displaymath}$

But this cannot be actually carried out as the mapping is not explicite and $f({\bf x}_k)$ is never available. However, we can still obtain the kernel matrix $\tilde{{\bf K}}$ for the zero-mean data points $\tilde{f}({\bf x})$ in terms of ${\bf K}$ for $f({\bf x})$ :

$\displaystyle \tilde{k}_{mn}$	$\textstyle =$	$\displaystyle \tilde{f}({\bf x}_m)^T\tilde{f}({\bf x}_n) =[f({\bf x}_m)-\frac{1... ...sum_{k=1}^N f({\bf x}_k)]^T [f({\bf x}_n)-\frac{1}{N}\sum_{k=1}^N f({\bf x}_k)]$
	$\textstyle =$	$\displaystyle f({\bf x}_m)^Tf({\bf x}_n) -\frac{1}{N}\sum_{k=1}^N f({\bf x}_k)^... ...f({\bf x}_n) +\frac{1}{N^2}\sum_{k=1}^N\sum_{l=1}^N f({\bf x}_k)^T f({\bf x}_l)$
	$\textstyle =$	$\displaystyle k_{mn}-\frac{1}{N}\sum_{k=1}^N k_{km}-\frac{1}{N}\sum_{k=1}^N k_{kn} +\frac{1}{N^2}\sum_{k=1}^N\sum_{l=1}^N k_{kl}$

In summary, the mapping $f({\bf x})$ in the kernel PCA algorithm is never explicitly specified, neither is the dimensionality of the feature space ${\cal F}$ . Similarly, the covariance matrix $\Sigma_f$ and its eigenvectors $\phi_i$ are only mentioned in the above derivation, but they do not need to be computed in implementation. The potential high dimensionality of ${\cal F}$ does not cost extra computation, as only the kernel $k({\bf x},{\bf x}_n)$ is needed in the implementation.

Kernel PCA will have all the advantages of the regular PCA, as well as the implicit nonlinear mapping to a feature space ${\cal F}$ where the features representing the structure in the data may be better extracted. For example, data classification may be much more easily carried out in ${\cal F}$ due to linear saparability.

Next: Appendix: Kernel Up: Kernel PCA (KPCA) Previous: Mapping to high dimensional

Ruye Wang 2007-03-26