next up previous
Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing

Optimal transformation for maximizing $tr({\bf S}_{B/T})$

The previous method only maximizes the between-class scatteredness ${\bf S}_B$ without taking into consideration the within-class scatteredness ${\bf S}_W$, or, equivalently, the total scatteredness ${\bf S}_T={\bf S}_B+{\bf S}_W$. As it is possible that in the space after the transform ${\bf y}={\bf A}^T{\bf x}$ the between-class scatteredness ${\bf S}_B^{(y)}$ is maximized but so is the total scatteredness ${\bf S}_T^{(y)}$, while it is desirable to maximize ${\bf S}_B$ and minimize ${\bf S}_W$ at the same time in order to maximize the separability. We therefore choose $J=tr[{\bf S}_W^{-1}{\bf S}_B]$ as a better criterion for feature selection.

In order to find the optimal transform matrix ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_M]$, we consider the maximization of the objective function $J({\bf A})$:

\begin{displaymath}
J({\bf A}) = tr\;[ ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}]=tr\;({\bf S}^{(y)}_{B/T})
\end{displaymath}

where

\begin{displaymath}
{\bf S}_T^{(y)}={\bf A}^T{\bf S}_T^{(x)}{\bf A},\;\;\;\;\;\;\;\;
{\bf S}_B^{(y)}={\bf A}^T{\bf S}_B^{(x)}{\bf A}
\end{displaymath}

and
$\displaystyle {\bf S}_{B/T}^{(y)}$ $\textstyle =$ $\displaystyle ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}
=({\bf A}^T{\bf S}_T^{(x)}...
...bf A}^{-1}({\bf S}_T^{(x)})^{-1}({\bf A}^T)^{-1}{\bf A}^T{\bf S}_B^{(x)}{\bf A}$  
  $\textstyle =$ $\displaystyle {\bf A}^{-1}({\bf S}_T^{(x)})^{-1}{\bf S}_B^{(x)}{\bf A}
={\bf A}^{-1} {\bf S}_{B/T}^{(x)} {\bf A}$  

Note that as ${\bf S}^{(x)}_{B/T}$ here is not symmetric (the product of two symmetric matrices is in general not symmetric), the method used for the maximization of $tr\;{\bf S}_B^{(y)}$ considered previously can no longer be used to maximize $tr\;({\bf S}^{(y)}_{B/T})$.

To address the problem of maximization of $J({\bf A})$, we first consider the case of $M=1$, i.e., ${\bf A}={\bf a}$, to find a single feature $y={\bf a}^T{\bf x}$. The objective function $J({\bf A})$ above becomes

\begin{displaymath}
J({\bf a})=\frac{s_B^{(y)}}{s_T^{(y)}}
=\frac{{\bf a}^T{\bf S}^{(x)}_B{\bf a}}{{\bf a}^T{\bf S}^{(x)}_T{\bf a}}
=R({\bf a})
\end{displaymath}

This function of ${\bf a}$ is the Rayleigh quotient $R({\bf a})$ of the two symmetric matrices ${\bf S}_B^{(x)}$ and ${\bf S}_T^{(x)}$. The optimal transform vector ${\bf a}$ that maximizes $J({\bf a})$ can be found by solving the corresponding generalized eigenvalue problem

\begin{displaymath}
{\bf S}_B^{(x)}{\bf\phi}=\lambda {\bf S}_T^{(x)}{\bf\phi}
\end{displaymath}

where ${\bf\phi}$ is an eigenvector of ${\bf S}_{B/T}=[{\bf S}_T^{(x)}]^{-1}{\bf S}_B^{(x)}$, and the corresponding eigenvalue is $\lambda=R({\bf a})=J({\bf a})$, which is maximized by the eigenvector ${\bf\phi}$ corresponding to the greatest eigenvalue $\lambda_{max}$.

Next, we generalize this approach the case of $M>1$, by solving the generalized eigen-equation

\begin{displaymath}
{\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambda}
\end{displaymath}

where the eigenvector matrix ${\bf\Phi}={\bf A}$ is used as the linear transform matrix by which both ${\bf S}_T^{(x)}$ and ${\bf S}_B^{(x)}$ can be diagonalized at the same time:

\begin{displaymath}
\left\{ \begin{array}{l}
{\bf S}_B^{(y)}={\bf\Phi}^T{\bf S}...
...{\bf\Phi}^T{\bf S}_T^{(x)}{\bf\Phi}={\bf I}
\end{array}\right.
\end{displaymath}

We see that the signal components in ${\bf y}$ are completely decorrelated, i.e., they each carry some separability information independent of others. We can therefore construct the transform matrix ${\bf\Phi}_{N\times M}=[{\bf\phi}_1,
\cdots,{\bf\phi}_M]$ composed of the $M$ eigenvectors corresponding to the $M$ largest eigenvalues, for the Rayleigh quotient $\lambda_i=R({\bf\phi}_i)=J({\bf\phi}_i)$ representing the separability in the ith dimension for $y_i={\bf\phi}_i^T{\bf x}$, ($i=1,\cdots,M$).

The generalized eigen-equation above can be further written as

\begin{displaymath}
{\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambd...
...hi}{\bf\Lambda}={\bf S}_B^{(x)}{\bf\Phi}({\bf I}-{\bf\Lambda})
\end{displaymath}


\begin{displaymath}
{\bf S}_B^{(x)}{\bf\Phi}={\bf S}_W^{(x)} {\bf\Phi}{\bf\Lamb...
...}{\bf\Phi}
={\bf S}_{B/W}^{(x)}{\bf\Phi}={\bf\Phi}{\bf\Theta}
\end{displaymath}

where ${\bf\Theta}$ is a diagonal matrix composed of $\theta_i=\lambda_i/(1-\lambda_i)$ on its diagonal:

\begin{displaymath}
{\bf\Theta}=diag[\theta_1,\cdots,\theta_N]
=diag[\lambda_1/(1-\lambda_1),\cdots,\lambda_N/(1-\lambda_N)]
\end{displaymath}

We see that using the $M$ greatest eigenvalues $\theta_i$ of ${\bf S}_{B/W}$ to maximize ${\bf S}_{B/W}$ is equivalent to using the $M$ greatest eigenvalues $\lambda_i$ of ${\bf S}_{B/T}$ to maximize ${\bf S}_{B/T}$.


next up previous
Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing
Ruye Wang 2017-03-30