Next: Optimal transformation for maximizing Up: classify Previous: Hierarchical Classifiers

Feature Selection

The main purpose of feature selection is to reduce the computational cost by using only features for recognition/classification purposes. These features can be either directly chosen from the original ones, or generated as some linear combinations of the original ones. The features selected should keep as much separability information as possible.

Choose features from original ones
There are

$\begin{displaymath} C_N^M=\frac{N!}{(N-M)!\;M!} \end{displaymath}$

ways to do so. We simply need to find the best ones to span an M-D feature space in which any of the following separability criteria is maximized.
- $\begin{displaymath} J_1=\sum_{i \neq j} P_iP_j d_B(\omega_i,\omega_j) \end{displaymath}$
  
  where and are the a priori probabilities for class $\omega_i$ and $\omega_j$ , respectively, and $d_B(\omega_i,\omega_j)$ is the Bhattacharyya distance between the ith and jth classes:
  
  $\begin{displaymath} d_B(\omega_i, \omega_j)=\frac{1}{4}({\bf m}_i-{\bf m}_j)^T ... ...right\vert\;\left\vert{\bf\Sigma}_j\right\vert)^{1/2}}\right] \end{displaymath}$
- $\begin{displaymath} J_2=tr\;( {\bf S}_W^{-1} {\bf S}_B)=tr\; ({\bf S}_{B/W}) \end{displaymath}$
  
  where, for convenience, ${\bf S}_{B/W}$ is defined as
  
  $\begin{displaymath} {\bf S}_{B/W}\stackrel{\triangle}{=}{\bf S}_W^{-1}{\bf S}_B \end{displaymath}$
- $\begin{displaymath} J_3=tr\;( {\bf S}_T^{-1} {\bf S}_B)=tr\; ({\bf S}_{B/T}) \end{displaymath}$
  
  where ${\bf S}_{B/T}$ is defined as
  
  $\begin{displaymath} {\bf S}_{B/T}\stackrel{\triangle}{=}{\bf S}_T^{-1}{\bf S}_B \end{displaymath}$
As , and are equivalent.
Generate features from original ones
If the features chosen optimally above do not produce satisfactory separability, we can try to generate some new features as the linear combinations of the original ones by a linear transform:

$\begin{displaymath} {\bf y}=\left[ \begin{array}{c} y_1 \vdots y_M \end{ar... ...}{\bf a}_1^T \vdots {\bf a}_M\end{array}\right] {\bf x} \end{displaymath}$

Here ${\bf A}=[ {\bf a}_1, \cdots, {\bf a}_M ]_{N\times M}$ is an by matrix composed of N-D column vectors ${\bf a}_i$ , and its transpose is an $M \times N$ matrix

$\begin{displaymath} {\bf A}^T=\left[ \begin{array}{c} {\bf a}_1^T \vdots {\bf a}_M \end{array} \right]_{M\times N} \end{displaymath}$

and ${\bf y}$ is an M-D vector containing elements as the new features $\{ y_i={\bf a}_i^T{\bf x},\;\;\;i=1,\cdots,M\}$ .
After a linear transform ${\bf y}={\bf A}^T {\bf x}$ , the mean vectors and covariance matrices of each class become

$\begin{displaymath} {\bf m}_i^{(y)}={\bf A}^T{\bf m}_i^{(x)} \;\;\;\;\;\; {\bf... ...\bf A}^T{\bf\Sigma}_i^{(x)}{\bf A}\;\;\;\;\;\;(i=1,\cdots,C) \end{displaymath}$

and the various scatter matrices become

$\begin{displaymath} {\bf S}_W^{(y)}={\bf A}^T{\bf S}_W^{(x)}{\bf A},\;\;\;\;\; ... ...;\; {\bf S}_{B/T}^{(y)}={\bf A}^T{\bf S}_{B/T}^{(x)}{\bf A} \end{displaymath}$

We need to find the optimal matrix ${\bf A}$ which maximizes $J({\bf A})$ in the M-D feature space spanned by the new features ${\bf y}={\bf A}^T {\bf x}$ .

Subsections

Next: Optimal transformation for maximizing Up: classify Previous: Hierarchical Classifiers

Ruye Wang 2016-11-30