Optimal transformation for maximizing

Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing

Optimal transformation for maximizing $tr({\bf S}_{B/T})$

The previous method only maximizes the between-class scatteredness ${\bf S}_B$ without taking into consideration the within-class scatteredness ${\bf S}_W$ , or, equivalently, the total scatteredness ${\bf S}_T={\bf S}_B+{\bf S}_W$ . As it is possible that in the space after the transform ${\bf y}={\bf A}^T {\bf x}$ the between-class scatteredness ${\bf S}_B^{(y)}$ is maximized but so is the total scatteredness ${\bf S}_T^{(y)}$ , while it is desirable to maximize ${\bf S}_B$ and minimize ${\bf S}_W$ at the same time in order to maximize the separability. A better criterion for feature selection can be $J=tr[{\bf S}_W^{-1}{\bf S}_B]$ as it relative separability.

In order to find the optimal transform matrix ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_M]$ , we consider the maximization of the objective function $J({\bf A})$ :

$\begin{displaymath} J({\bf A}) = tr\;[ ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}]=tr\;({\bf S}^{(y)}_{B/T}) \end{displaymath}$

where

$\begin{displaymath} {\bf S}_T^{(y)}={\bf A}^T{\bf S}_T^{(x)}{\bf A},\;\;\;\;\;\;\;\; {\bf S}_B^{(y)}={\bf A}^T{\bf S}_B^{(x)}{\bf A} \end{displaymath}$

and

$\displaystyle {\bf S}_{B/T}^{(y)}$	$\textstyle =$	$\displaystyle ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)} =({\bf A}^T{\bf S}_T^{(x)}... ...bf A}^{-1}({\bf S}_T^{(x)})^{-1}({\bf A}^T)^{-1}{\bf A}^T{\bf S}_B^{(x)}{\bf A}$
	$\textstyle =$	$\displaystyle {\bf A}^{-1}({\bf S}_T^{(x)})^{-1}{\bf S}_B^{(x)}{\bf A} ={\bf A}^{-1} {\bf S}_{B/T}^{(x)} {\bf A}$

Note that as ${\bf S}^{(x)}_{B/T}$ is not symmetric (the product of two symmetric matrices is in general not symmetric), the KLT method used for the maximization of $tr\;{\bf S}_B^{(y)}$ considered previously can no longer be used to maximize $tr\;({\bf S}^{(y)}_{B/T})$ .

To find the transform matrix ${\bf A}$ so that $J({\bf A})=tr\;[ ({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}]$ is maximized, we first consider the case of to find a single feature $y={\bf a}^T{\bf x}$ , obtained by an $N\times 1$ vector ${\bf A}={\bf a}$ . The objective function $J({\bf A})$ above becomes

$\begin{displaymath} J({\bf a})=\frac{s_B^{(y)}}{s_T^{(y)}} =\frac{{\bf a}^T{\bf S}^{(x)}_B{\bf a}}{{\bf a}^T{\bf S}^{(x)}_T{\bf a}} =R({\bf a}) \end{displaymath}$

This function of ${\bf a}$ is the Rayleigh quotient $R({\bf a})$ of the two symmetric matrices ${\bf S}_B^{(x)}$ and ${\bf S}_T^{(x)}$ . The optimal transform vector ${\bf a}$ that maximizes $J({\bf a})$ can be found by solving the corresponding generalized eigenvalue problem

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\phi}_i=\lambda_i {\bf S}_T^{(x)}{\bf\phi}_i,\;\;\;\;\;\; (i=1,\cdots,N) \end{displaymath}$

or in matrix form:

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambda} \end{displaymath}$

where ${\bf\Phi}=[{\bf\phi}_1,\cdots,{\bf\phi}_N]$ , and ${\bf\phi}_i$ is the ith eigenvector of ${\bf S}_{B/T}=[{\bf S}_T^{(x)}]^{-1}{\bf S}_B^{(x)}$ , and the corresponding eigenvalue is $\lambda_i=R({\bf a})=J({\bf a})$ , which is maximized by the eigenvector ${\bf\phi}_1>{\bf\phi}_i$ ( $i=2,\cdots,N$ ) corresponding to the greatest eigenvalue $\lambda_{max}$ .

By solving the above generalized eigen-equation, we get the eigenvector matrix ${\bf\Phi}$ , which can be used as the linear transform matrix ${\bf A}$ by which both ${\bf S}_T^{(x)}$ and ${\bf S}_B^{(x)}$ can be diagonalized at the same time:

$\begin{displaymath} \left\{ \begin{array}{l} {\bf S}_B^{(y)}={\bf\Phi}^T{\bf S}... ...{\bf\Phi}^T{\bf S}_T^{(x)}{\bf\Phi}={\bf I} \end{array}\right. \end{displaymath}$

and we get

$\begin{displaymath} {\bf S}_{B/T}^{(y)}=({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}={\bf\Lambda} \end{displaymath}$

We can therefore construct a transform matrix ${\bf\Phi}_{N\times M}=[{\bf\phi}_1, \cdots,{\bf\phi}_M]$ composed of the

eigenvectors ${\bf\phi}_m$ corresponding to the

largest eigenvalues the Rayleigh quotient $\lambda_m=R({\bf\phi}_m)=J({\bf\phi}_m)$ representing the separability in the mth dimension for $y_i={\bf\phi}_i^T{\bf x}$ , ( $m=1,\cdots,M$ ), so that in the resulting M-D space,

the signal components in ${\bf y}={\bf\Phi}^T{\bf x}$ are completely decorrelated, i.e., they each carry some separability information independent of others.
the separability along each signal component $y_m={\bf\phi}_m{\bf x}$ is maximized

$\begin{displaymath} J_m=\frac{s_B^{y_m}}{s_T^{y_m}}=\lambda_m,\;\;\;\;\;(m=1,\cdots,M) \end{displaymath}$

The generalized eigen-equation above can be further written as

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\Phi}={\bf S}_T^{(x)}{\bf\Phi}{\bf\Lambd... ...hi}{\bf\Lambda}={\bf S}_B^{(x)}{\bf\Phi}({\bf I}-{\bf\Lambda}) \end{displaymath}$

$\begin{displaymath} {\bf S}_B^{(x)}{\bf\Phi}={\bf S}_W^{(x)} {\bf\Phi}{\bf\Lamb... ...}{\bf\Phi} ={\bf S}_{B/W}^{(x)}{\bf\Phi}={\bf\Phi}{\bf\Theta} \end{displaymath}$

where ${\bf\Theta}$ is a diagonal matrix composed of $\theta_i=\lambda_i/(1-\lambda_i)$ on its diagonal:

$\begin{displaymath} {\bf\Theta}=diag[\theta_1,\cdots,\theta_N] =diag[\lambda_1/(1-\lambda_1),\cdots,\lambda_N/(1-\lambda_N)] \end{displaymath}$

We see that using the

greatest eigenvalues $\theta_i$ of ${\bf S}_{B/W}$ to maximize ${\bf S}_{B/W}$ is equivalent to using the

greatest eigenvalues $\lambda_i$ of ${\bf S}_{B/T}$ to maximize ${\bf S}_{B/T}$ .

Next: Information conservation in feature Up: Feature Selection Previous: Optimal transformation for maximizing

Ruye Wang 2016-11-30