next up previous
Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing

Optimal transformation for maximizing $tr({\bf S}_{B/T})$

The previous method only maximizes the between-class scatteredness ${\bf S}_B$ without taking into consideration the within-class scatteredness ${\bf S}_W$, or, equivalently, the total scatteredness ${\bf S}_T={\bf S}_B+{\bf S}_W$. As it is possible that in the space after the transform ${\bf y}={\bf A}^T {\bf x}$ the between-class scatteredness ${\bf S}_B^{(y)}$ is maximized but so is the total scatteredness ${\bf S}_T^{(y)}$, while it is desirable to maximize ${\bf S}_B$ and minimize ${\bf S}_W$ at the same time in order to maximize the separability. A better criterion for feature selection can be $J=tr[{\bf S}_W^{-1}{\bf S}_B]$ as it relative separability.

In order to find the optimal transform matrix ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_M]$, we consider the maximization of the objective function $J({\bf A})$:

\begin{displaymath}
J({\bf A}) = tr\;[ ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}]=tr\;({\bf S}^{(y)}_{B/T})
\end{displaymath}

where

\begin{displaymath}
{\bf S}_T^{(y)}={\bf A}^T{\bf S}_T^{(x)}{\bf A},\;\;\;\;\;\;\;\;
{\bf S}_B^{(y)}={\bf A}^T{\bf S}_B^{(x)}{\bf A}
\end{displaymath}

and
$\displaystyle {\bf S}_{B/T}^{(y)}$ $\textstyle =$ $\displaystyle ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}
=({\bf A}^T{\bf S}_T^{(x)}...
...bf A}^{-1}({\bf S}_T^{(x)})^{-1}({\bf A}^T)^{-1}{\bf A}^T{\bf S}_B^{(x)}{\bf A}$  
  $\textstyle =$ $\displaystyle {\bf A}^{-1}({\bf S}_T^{(x)})^{-1}{\bf S}_B^{(x)}{\bf A}
={\bf A}^{-1} {\bf S}_{B/T}^{(x)} {\bf A}$  

Note that as ${\bf S}^{(x)}_{B/T}$ is not symmetric (the product of two symmetric matrices is in general not symmetric), the KLT method used for the maximization of $tr\;{\bf S}_B^{(y)}$ considered previously can no longer be used to maximize $tr\;({\bf S}^{(y)}_{B/T})$.

To find the transform matrix ${\bf A}$ so that $J({\bf A})=tr\;[ ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}]$ is maximized, we first consider the case of $M=1$ to find a single feature $y={\bf a}^T{\bf x}$, obtained by an $N\times 1$ vector ${\bf A}={\bf a}$. The objective function $J({\bf A})$ above becomes

\begin{displaymath}
J({\bf a})=\frac{s_B^{(y)}}{s_T^{(y)}}
=\frac{{\bf a}^T{\bf S}^{(x)}_B{\bf a}}{{\bf a}^T{\bf S}^{(x)}_T{\bf a}}
=R({\bf a})
\end{displaymath}

This function of ${\bf a}$ is the Rayleigh quotient $R({\bf a})$ of the two symmetric matrices ${\bf S}_B^{(x)}$ and ${\bf S}_T^{(x)}$. The optimal transform vector ${\bf a}$ that maximizes $J({\bf a})$ can be found by solving the corresponding generalized eigenvalue problem

\begin{displaymath}
{\bf S}_B^{(x)}{\bf\phi}_i=\lambda_i {\bf S}_T^{(x)}{\bf\phi}_i,\;\;\;\;\;\;
(i=1,\cdots,N)
\end{displaymath}

or in matrix form:

\begin{displaymath}
{\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambda}
\end{displaymath}

where ${\bf\Phi}=[{\bf\phi}_1,\cdots,{\bf\phi}_N]$, and ${\bf\phi}_i$ is the ith eigenvector of ${\bf S}_{B/T}=[{\bf S}_T^{(x)}]^{-1}{\bf S}_B^{(x)}$, and the corresponding eigenvalue is $\lambda_i=R({\bf a})=J({\bf a})$, which is maximized by the eigenvector ${\bf\phi}_1>{\bf\phi}_i$ ($i=2,\cdots,N$) corresponding to the greatest eigenvalue $\lambda_{max}$.

By solving the above generalized eigen-equation, we get the eigenvector matrix ${\bf\Phi}$, which can be used as the linear transform matrix ${\bf A}$ by which both ${\bf S}_T^{(x)}$ and ${\bf S}_B^{(x)}$ can be diagonalized at the same time:

\begin{displaymath}
\left\{ \begin{array}{l}
{\bf S}_B^{(y)}={\bf\Phi}^T{\bf S}...
...{\bf\Phi}^T{\bf S}_T^{(x)}{\bf\Phi}={\bf I}
\end{array}\right.
\end{displaymath}

and we get

\begin{displaymath}
{\bf S}_{B/T}^{(y)}=({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}={\bf\Lambda}
\end{displaymath}

We can therefore construct a transform matrix ${\bf\Phi}_{N\times M}=[{\bf\phi}_1,
\cdots,{\bf\phi}_M]$ composed of the $M$ eigenvectors ${\bf\phi}_m$ corresponding to the $M$ largest eigenvalues the Rayleigh quotient $\lambda_m=R({\bf\phi}_m)=J({\bf\phi}_m)$ representing the separability in the mth dimension for $y_i={\bf\phi}_i^T{\bf x}$, ($m=1,\cdots,M$), so that in the resulting M-D space,

The generalized eigen-equation above can be further written as

\begin{displaymath}
{\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambd...
...hi}{\bf\Lambda}={\bf S}_B^{(x)}{\bf\Phi}({\bf I}-{\bf\Lambda})
\end{displaymath}


\begin{displaymath}
{\bf S}_B^{(x)}{\bf\Phi}={\bf S}_W^{(x)} {\bf\Phi}{\bf\Lamb...
...}{\bf\Phi}
={\bf S}_{B/W}^{(x)}{\bf\Phi}={\bf\Phi}{\bf\Theta}
\end{displaymath}

where ${\bf\Theta}$ is a diagonal matrix composed of $\theta_i=\lambda_i/(1-\lambda_i)$ on its diagonal:

\begin{displaymath}
{\bf\Theta}=diag[\theta_1,\cdots,\theta_N]
=diag[\lambda_1/(1-\lambda_1),\cdots,\lambda_N/(1-\lambda_N)]
\end{displaymath}

We see that using the $M$ greatest eigenvalues $\theta_i$ of ${\bf S}_{B/W}$ to maximize ${\bf S}_{B/W}$ is equivalent to using the $M$ greatest eigenvalues $\lambda_i$ of ${\bf S}_{B/T}$ to maximize ${\bf S}_{B/T}$.


next up previous
Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing
Ruye Wang 2016-11-30