Distances and Separability Measurements

We first consider various distances widely used for feature selection in pattern classification.

We also define a set of scatter matrices all related to the separability of the classes/clusters in the feature space, based on which different subsets of features can be compared and the best subset can be selected. We will also consider methods for feature selection or feature extraction to reduce the dimensionality of the feature space. Here we assume there are $K$ classes (or clusters) in the dataset containing $N$ data points, and the kth class contains $N_k$ data points, i.e., $N_1+\cdots+N_K=N$.

We can show that ${\bf S}_T={\bf S}_W+{\bf S}_B$, i.e., the total scatter matrix is the sum of within-class and between-class scatter matrices:
$\displaystyle {\bf S}_T$ $\displaystyle =$ $\displaystyle \frac{1}{N}\sum_{k=1}^K\sum_{{\bf x}\in C_k}
({\bf x}-{\bf m})({\...
...k}
({\bf x}-{\bf m}_k+{\bf m}_k-{\bf m})({\bf x}-{\bf m}_k+{\bf m}_k-{\bf m})^T$  
  $\displaystyle =$ $\displaystyle \frac{1}{N}\sum_{k=1}^K \sum_{{\bf x} \in C_k}
[({\bf x}-{\bf m}_...
...f m}_k-{\bf m})({\bf x}-{\bf m}_k)^T
+({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T]$  
  $\displaystyle =$ $\displaystyle \frac{1}{N}\sum_{k=1}^K \sum_{{\bf x} \in C_k}
({\bf x}-{\bf m}_k...
...{K}\sum_{k=1}^K \sum_{{\bf x} \in C_k}
({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T$  
  $\displaystyle =$ $\displaystyle \sum_{k=1}^K \frac{N_k}{N}{\bf\Sigma}_k +
\sum_{k=1}^K \frac{N_k}{N}({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T
={\bf S}_W+{\bf S}_B$ (21)

Obviously, for better separability, we want the within-class scatteredness to be small but the between-class scatteredness to be large. However, we cannot use the scatter matrices directly as they cannot be compared in terms of their sizes. Instead, we use their traces (or determinants) as the scalar measurements for the separability:

$\displaystyle J_B$ $\displaystyle =$ $\displaystyle tr\;{\bf S}_B=tr\;\left[\sum_{k=1}^C
P_k({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T\right]$  
  $\displaystyle =$ $\displaystyle \sum_{i=1}^C P_k\;tr\left[({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T\right]
=\sum_{i=1}^C P_k\;\vert\vert{\bf m}_k-{\bf m}\vert\vert^2$ (22)

and
$\displaystyle J_W$ $\displaystyle =$ $\displaystyle tr\;{\bf S}_W
=tr\left[\sum_{k=1}^K P_k \sum_{{\bf x}\in C_k}
({\bf x}-{\bf m}_k)({\bf x}-{\bf m}_k)^T \right]$  
  $\displaystyle =$ $\displaystyle \sum_{k=1}^K P_k \sum_{{\bf x}\in C_k}
tr\left[({\bf x}-{\bf m}_k...
...
=\sum_{k=1}^K P_k \sum_{{\bf x}\in C_k}\vert\vert{\bf x}-{\bf m}_k\vert\vert^2$ (23)

We see that $J_B$ is the weighted sum of the Euclidean distances between ${\bf m}_k$ and ${\bf m}$ for all $C$ classes, and $J_W$ is the weighted sum of the sums of Euclidean distances between all samples ${\bf x}\in C_k$ to the mean ${\bf m}_k$ for all $K$ classes $C_1,\cdots,C_K$.

To achieve the best separability, we maximize $tr{\bf S}_B$ while at the same time minimize $tr{\bf S}_W$. In the same space with a constant ${\bf S}_T$, maximizing $tr{\bf S}_B$ is equivalent to minimizing $tr{\bf S}_W$, due to the relationship ${\bf S}_B+{\bf S}_W={\bf S}_T$. To measure the separability across different spaces, ${\bf S}_B$ can still be used but now it needs to be normalized by either ${\bf S}_T$ or equivalently ${\bf S}_W$:

$\displaystyle J_{B/W}=tr({\bf S}_W^{-1}{\bf S}_B)=tr({\bf S}_B{\bf S}_W^{-1}),
\;\;\;\;\;\;\;\;
J_{B/T}=tr({\bf S}_T^{-1}{\bf S}_B)=tr({\bf S}_B{\bf S}_T^{-1})$ (24)

The separability measured by the scatter matrices can be used as the criterion for selecting $m<n$ of the $n$ original features to reduce the computational cost while still keeping the effectiveness of the classification. Moreover, we can also generate $m$ new features as the linear combination of the $n$ original ones by a linear transform:

$\displaystyle {\bf y}_{m\times 1}=({\bf A}_{n\times m})^T{\bf x}_{n\times 1}$ (25)

We desire to find the $n\times m$ transform matrix ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_m]$ so that the separability in the original n-D space is maximally conserved in the new m-D space of lower dimensionality.