Optimal transformation for maximizing

Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing

Optimal transformation for maximizing $tr({\bf S}_{B/T})$

The previous method only maximizes the between-class scatteredness ${\bf S}_B$ without taking into consideration the within-class scatteredness ${\bf S}_W$ , or, equivalently, the total scatteredness ${\bf S}_T={\bf S}_B+{\bf S}_W$ . As it is possible that in the space after the transform ${\bf y}={\bf A}^T{\bf x}$ the between-class scatteredness ${\bf S}_B^{(y)}$ is maximized but so is the total scatteredness ${\bf S}_T^{(y)}$ , while it is desirable to maximize ${\bf S}_B$ and minimize ${\bf S}_W$ at the same time in order to maximize the separability. We therefore choose $J=tr[{\bf S}_W^{-1}{\bf S}_B]$ as a better criterion for feature selection.

In order to find the optimal transform matrix ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_M]$ , we consider the maximization of the objective function $J({\bf A})$ :

$\begin{displaymath} J({\bf A}) = tr\;[ ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}]=tr\;({\bf S}^{(y)}_{B/T}) \end{displaymath}$

where

$\begin{displaymath} {\bf S}_T^{(y)}={\bf A}^T{\bf S}_T^{(x)}{\bf A},\;\;\;\;\;\;\;\; {\bf S}_B^{(y)}={\bf A}^T{\bf S}_B^{(x)}{\bf A} \end{displaymath}$

and

$\displaystyle {\bf S}_{B/T}^{(y)}$	$\textstyle =$	$\displaystyle ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)} =({\bf A}^T{\bf S}_T^{(x)}... ...bf A}^{-1}({\bf S}_T^{(x)})^{-1}({\bf A}^T)^{-1}{\bf A}^T{\bf S}_B^{(x)}{\bf A}$
	$\textstyle =$	$\displaystyle {\bf A}^{-1}({\bf S}_T^{(x)})^{-1}{\bf S}_B^{(x)}{\bf A} ={\bf A}^{-1} {\bf S}_{B/T}^{(x)} {\bf A}$

Note that as ${\bf S}^{(x)}_{B/T}$ here is not symmetric (the product of two symmetric matrices is in general not symmetric), the method used for the maximization of $tr\;{\bf S}_B^{(y)}$ considered previously can no longer be used to maximize $tr\;({\bf S}^{(y)}_{B/T})$ .

To address the problem of maximization of $J({\bf A})$ , we first consider the case of , i.e., ${\bf A}={\bf a}$ , to find a single feature $y={\bf a}^T{\bf x}$ . The objective function $J({\bf A})$ above becomes

$\begin{displaymath} J({\bf a})=\frac{s_B^{(y)}}{s_T^{(y)}} =\frac{{\bf a}^T{\bf S}^{(x)}_B{\bf a}}{{\bf a}^T{\bf S}^{(x)}_T{\bf a}} =R({\bf a}) \end{displaymath}$

This function of ${\bf a}$ is the Rayleigh quotient $R({\bf a})$ of the two symmetric matrices ${\bf S}_B^{(x)}$ and ${\bf S}_T^{(x)}$ . The optimal transform vector ${\bf a}$ that maximizes $J({\bf a})$ can be found by solving the corresponding generalized eigenvalue problem

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\phi}=\lambda {\bf S}_T^{(x)}{\bf\phi} \end{displaymath}$

where ${\bf\phi}$ is an eigenvector of ${\bf S}_{B/T}=[{\bf S}_T^{(x)}]^{-1}{\bf S}_B^{(x)}$ , and the corresponding eigenvalue is $\lambda=R({\bf a})=J({\bf a})$ , which is maximized by the eigenvector ${\bf\phi}$ corresponding to the greatest eigenvalue $\lambda_{max}$ .

Next, we generalize this approach the case of , by solving the generalized eigen-equation

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambda} \end{displaymath}$

where the eigenvector matrix ${\bf\Phi}={\bf A}$ is used as the linear transform matrix by which both ${\bf S}_T^{(x)}$ and ${\bf S}_B^{(x)}$ can be diagonalized at the same time:

$\begin{displaymath} \left\{ \begin{array}{l} {\bf S}_B^{(y)}={\bf\Phi}^T{\bf S}... ...{\bf\Phi}^T{\bf S}_T^{(x)}{\bf\Phi}={\bf I} \end{array}\right. \end{displaymath}$

We see that the signal components in ${\bf y}$ are completely decorrelated, i.e., they each carry some separability information independent of others. We can therefore construct the transform matrix ${\bf\Phi}_{N\times M}=[{\bf\phi}_1, \cdots,{\bf\phi}_M]$ composed of the

eigenvectors corresponding to the

largest eigenvalues, for the Rayleigh quotient $\lambda_i=R({\bf\phi}_i)=J({\bf\phi}_i)$ representing the separability in the ith dimension for $y_i={\bf\phi}_i^T{\bf x}$ , ( $i=1,\cdots,M$ ).

The generalized eigen-equation above can be further written as

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambd... ...hi}{\bf\Lambda}={\bf S}_B^{(x)}{\bf\Phi}({\bf I}-{\bf\Lambda}) \end{displaymath}$

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\Phi}={\bf S}_W^{(x)} {\bf\Phi}{\bf\Lamb... ...}{\bf\Phi} ={\bf S}_{B/W}^{(x)}{\bf\Phi}={\bf\Phi}{\bf\Theta} \end{displaymath}$

where ${\bf\Theta}$ is a diagonal matrix composed of $\theta_i=\lambda_i/(1-\lambda_i)$ on its diagonal:

$\begin{displaymath} {\bf\Theta}=diag[\theta_1,\cdots,\theta_N] =diag[\lambda_1/(1-\lambda_1),\cdots,\lambda_N/(1-\lambda_N)] \end{displaymath}$

We see that using the

greatest eigenvalues $\theta_i$ of ${\bf S}_{B/W}$ to maximize ${\bf S}_{B/W}$ is equivalent to using the

greatest eigenvalues $\lambda_i$ of ${\bf S}_{B/T}$ to maximize ${\bf S}_{B/T}$ .

Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing

Ruye Wang 2017-03-30