Distances and Separability Measurements

We first consider various distances widely used for feature selection in pattern classification.

Distance between two data points
The distance between two points ${\bf x}$ and ${\bf y}$ in the n-dimensional feature space can be measured by the p-norm of the difference ${\bf x}-{\bf y}$ :

$\displaystyle d_p({\bf x},{\bf y})=\vert\vert{\bf x}-{\bf y}\vert\vert _p =\left(\sum_{i=1}^n \vert x_i-y_i\vert^p\right)^{1/p} \;\;\;\;\;\; 1\le p \le \infty$ (1)

Specially, consider the following three cases:
- , city block or Manhattan distance:
  
  $\displaystyle d_1({\bf x},{\bf y})=\vert\vert{\bf x}-{\bf y}\vert\vert _1=\sum_{i=1}^n \vert x_i-y_i\vert$ (2)
- , Euclidean distance:
  
  $\displaystyle d_2({\bf x},{\bf y})=\vert\vert{\bf x}-{\bf y}\vert\vert _2 =\sqrt{\sum_{i=1}^n \vert x_i-y_i\vert^2}$ (3)
- $p=\infty$ , the Chebychev distance:
  
  $\displaystyle d_\infty({\bf x},{\bf y})=\vert\vert{\bf x}-{\bf y}\vert\vert _\infty =\max\{ \vert x_1\vert,\cdots, \vert x_n\vert \}$ (4)
Intra-class distance
An intra-class distance represents how widely (or narrowly) all samples in each class are distributed in the space. It needs to be small for good separability.
- The max diameter:
  
  $\displaystyle d_{max}(C_k)=\max_{{\bf x},{\bf y}\in C_k} d({\bf x},{\bf y})$ (5)
- The average diameter:
  
  $\displaystyle d_{average}(C_k) =\frac{1}{N_k(N_k-1)}\sum_{{\bf x},{\bf y}\in C_k,{\bf x}\ne {\bf y}} d({\bf x},{\bf y})$ (6)
- The average distance from each sample to the mean vector of the cluster:
  
  $\displaystyle \frac{1}{N_k} \sum_{{\bf x}\in C_k} d({\bf x},{\bf m}_k), \;\;\;\;\;$ where $\displaystyle \;\;\;\;\; {\bf m}_k=\frac{1}{N_k}\sum_{{\bf x} \in C_k} {\bf x}$ (7)
- The covariance of all samples in the class represents the tightness of the samples in the class:
  
  $\displaystyle {\bf\Sigma}_k=\frac{1}{N_k}\sum_{{\bf x} \in C_k} ({\bf x}-{\bf m}_k)({\bf x}-{\bf m}_k)^T$ (8)
  
  and its determinant or trace can be used as such a scalar measurement of the tightness:
  
  $\displaystyle \det {\bf\Sigma}_k=\prod_{i=1}^n \lambda_i,\;\;\;\;\; tr {\bf\Sigma}_k=\sum_{i=1}^n \lambda_i$ (9)
  
  As ${\bf\Sigma}_k$ is positive semi-definite, all of its eigenvalues $\lambda_i$ are non-negative and so are its determinant and trace.
Distance between a point and a cluster/class of points
- The max and min-distances:
  
  $\displaystyle d_{max}({\bf x}, C_k)=\max_{{\bf y}\in C_k} d({\bf x},{\bf y}), \;\;\;\;\; d_{min}({\bf x}, C_k)=\min_{{\bf y}\in C_k} d({\bf x},{\bf y})$ (10)
- The centroid distance:
  
  $\displaystyle d_{centroid}({\bf x}, C_k)=d({\bf x},{\bf m}_k)$ (11)
- The Mahalanobis distance
  
  $\displaystyle d_M({\bf x}, C_k)=({\bf x}-{\bf m}_k)^T {\bf\Sigma}_k^{-1}({\bf x}-{\bf m}_k)$ (12)
Inter-class distance
An intra-class distance measures the difference between two classes, it needs to be large for good separability.
- The max- and min-distances:
  
  $\displaystyle d_{max}(C_i, C_j)=\max_{{\bf x}\in C_i, {\bf y}\in C_j} d({\bf x}... ...;\;\; d_{min}(C_i, C_j)=\min_{{\bf x}\in C_i,{\bf y}\in C_j} d({\bf x},{\bf y})$ (13)
- The average distance between two clusters is the average of all pair-wise distances (e.g., Euclidean) between members of the two classes:
  
  $\displaystyle d_{average}(C_i, C_j) =\frac{1}{N_i N_j}\sum_{{\bf x}\in C_i,{\bf y}\in C_j} d({\bf x},{\bf y})$ (14)
- The centroid distance is the distance (e.g., Euclidean) between the centroids (mean vectors) of the two classes:
  
  $\displaystyle d_{centroid}(C_i, C_j)=d({\bf m}_i,{\bf m}_j)$ (15)
- The Bhattacharyya distance:
  
  $\displaystyle d_B(C_i, C_j)$ $\displaystyle =$ $\displaystyle \frac{1}{4}({\bf m}_i-{\bf m}_j)^T \left(\frac{{\bf\Sigma}_i+{\bf\Sigma}_j}{2}\right)^{-1}({\bf m}_i-{\bf m}_j)$
  
  $\displaystyle +$ $\displaystyle \log\left[\frac{\left\vert ({\bf\Sigma}_i+{\bf\Sigma}_j)/2 \right... ...vert{\bf\Sigma}_i\right\vert\;\left\vert{\bf\Sigma}_j\right\vert)^{1/2}}\right]$ (16)
  
  The first term reflects mostly the difference between the two mean vectors of the two clusters, while the second term reflects the difference between the distributions of the two clusters, which is always positive due to the AM-GM inequality of the arithmetic and geometric means:
  
  $\displaystyle \frac{1}{n}\sum_{i=1}^n x_i \ge \left(\prod_{i=1}^n x_i\right)^{1/n}$ (17)
  
  Note that even when the first term is zero due to ${\bf m}_i={\bf m}_j$ , the Bhattacharyya distance may still not zero due to the non-zero second term if ${\bf\Sigma}_i\ne{\bf\Sigma}_j$ .

We also define a set of scatter matrices all related to the separability of the classes/clusters in the feature space, based on which different subsets of features can be compared and the best subset can be selected. We will also consider methods for feature selection or feature extraction to reduce the dimensionality of the feature space. Here we assume there are $K$ classes (or clusters) in the dataset containing $N$ data points, and the kth class contains $N_k$ data points, i.e., $N_1+\cdots+N_K=N$ .

Total scatter matrix:
Thee total scatter matrix is the same as the covariance matrix:

$\displaystyle {\bf S}_T$ $\displaystyle =$ $\displaystyle \frac{1}{N}\sum_{\mbox{all }\bf x} ({\bf x}-{\bf m})({\bf x}-{\bf... ...c{1}{N}\sum_{k=1}^K \sum_{{\bf x} \in C_k} ({\bf x}-{\bf m})({\bf x}-{\bf m})^T$ (18)

Note that ${\bf S}_T\rank({\bf S}_T)\le N-1$ , due to the mean ${\bf m}=\sum_{\mbox{all ${\bf x}$}} {\bf x}$ as a constraint.
Within-class (intra-class) scatter matrix:

$\displaystyle {\bf S}_W$ $\displaystyle =$ $\displaystyle \sum_{k=1}^K P_k{\bf\Sigma}_k =\sum_{k=1}^K \frac{N_k}{N} {\bf\Si... ...1}{N}\sum_{k=1}^K\sum_{{\bf x}\in C_k} ({\bf x}-{\bf m}_k)({\bf x}-{\bf m}_k)^T$ (19)

Note that $\rank({\bf S}_W)=\sum_{k=1}^K (N_k-1)\le N-K$ , due to the means ${\bf m}_k=\sum_{{\bf x}\in C_k} {\bf x}/N_k, \;\;\;\;(k=1,\cdots,K)$ as constraints.
Between-class (inter-class) scatter matrix:

$\displaystyle {\bf S}_B$ $\displaystyle =$ $\displaystyle \sum_{k=1}^K P_k ({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T =\sum_{k=1}^K \frac{N_k}{N} ({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T$ (20)

Note that $\rank({\bf S}_B)\le K-1$ , due to total mean ${\bf m}=\sum_{k=1}^K N_k{\bf m}_k/N$ as a constraint.

We can show that ${\bf S}_T={\bf S}_W+{\bf S}_B$ , i.e., the total scatter matrix is the sum of within-class and between-class scatter matrices:

$\displaystyle {\bf S}_T$	$\displaystyle =$	$\displaystyle \frac{1}{N}\sum_{k=1}^K\sum_{{\bf x}\in C_k} ({\bf x}-{\bf m})({\... ...k} ({\bf x}-{\bf m}_k+{\bf m}_k-{\bf m})({\bf x}-{\bf m}_k+{\bf m}_k-{\bf m})^T$
	$\displaystyle =$	$\displaystyle \frac{1}{N}\sum_{k=1}^K \sum_{{\bf x} \in C_k} [({\bf x}-{\bf m}_... ...f m}_k-{\bf m})({\bf x}-{\bf m}_k)^T +({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T]$
	$\displaystyle =$	$\displaystyle \frac{1}{N}\sum_{k=1}^K \sum_{{\bf x} \in C_k} ({\bf x}-{\bf m}_k... ...{K}\sum_{k=1}^K \sum_{{\bf x} \in C_k} ({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T$
	$\displaystyle =$	$\displaystyle \sum_{k=1}^K \frac{N_k}{N}{\bf\Sigma}_k + \sum_{k=1}^K \frac{N_k}{N}({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T ={\bf S}_W+{\bf S}_B$	(21)

Obviously, for better separability, we want the within-class scatteredness to be small but the between-class scatteredness to be large. However, we cannot use the scatter matrices directly as they cannot be compared in terms of their sizes. Instead, we use their traces (or determinants) as the scalar measurements for the separability:

$\displaystyle J_B$	$\displaystyle =$	$\displaystyle tr\;{\bf S}_B=tr\;\left[\sum_{k=1}^C P_k({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T\right]$
	$\displaystyle =$	$\displaystyle \sum_{i=1}^C P_k\;tr\left[({\bf m}_k-{\bf m})({\bf m}_k-{\bf m})^T\right] =\sum_{i=1}^C P_k\;\vert\vert{\bf m}_k-{\bf m}\vert\vert^2$	(22)

and

$\displaystyle J_W$	$\displaystyle =$	$\displaystyle tr\;{\bf S}_W =tr\left[\sum_{k=1}^K P_k \sum_{{\bf x}\in C_k} ({\bf x}-{\bf m}_k)({\bf x}-{\bf m}_k)^T \right]$
	$\displaystyle =$	$\displaystyle \sum_{k=1}^K P_k \sum_{{\bf x}\in C_k} tr\left[({\bf x}-{\bf m}_k... ... =\sum_{k=1}^K P_k \sum_{{\bf x}\in C_k}\vert\vert{\bf x}-{\bf m}_k\vert\vert^2$	(23)

We see that $J_B$

is the weighted sum of the Euclidean distances between ${\bf m}_k$ and ${\bf m}$ for all $C$

classes, and $J_W$

is the weighted sum of the sums of Euclidean distances between all samples ${\bf x}\in C_k$ to the mean ${\bf m}_k$ for all $K$

classes $C_1,\cdots,C_K$ .

To achieve the best separability, we maximize $tr{\bf S}_B$ while at the same time minimize $tr{\bf S}_W$ . In the same space with a constant ${\bf S}_T$ , maximizing $tr{\bf S}_B$ is equivalent to minimizing $tr{\bf S}_W$ , due to the relationship ${\bf S}_B+{\bf S}_W={\bf S}_T$ . To measure the separability across different spaces, ${\bf S}_B$ can still be used but now it needs to be normalized by either ${\bf S}_T$ or equivalently ${\bf S}_W$ :

$\displaystyle J_{B/W}=tr({\bf S}_W^{-1}{\bf S}_B)=tr({\bf S}_B{\bf S}_W^{-1}), \;\;\;\;\;\;\;\; J_{B/T}=tr({\bf S}_T^{-1}{\bf S}_B)=tr({\bf S}_B{\bf S}_T^{-1})$

(24)

The separability measured by the scatter matrices can be used as the criterion for selecting $m<n$ of the $n$ original features to reduce the computational cost while still keeping the effectiveness of the classification. Moreover, we can also generate $m$ new features as the linear combination of the $n$ original ones by a linear transform:

$\displaystyle {\bf y}_{m\times 1}=({\bf A}_{n\times m})^T{\bf x}_{n\times 1}$

(25)

We desire to find the $n\times m$ transform matrix ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_m]$ so that the separability in the original n-D space is maximally conserved in the new m-D space of lower dimensionality.

Optimal transformation for maximizing $tr({\bf S}_B)$
The separability criterion $tr\;({\bf S}_B^{(y)})$ in new m-D space can be expressed as:

$\displaystyle J^{(y)}({\bf A})$ $\displaystyle =$ $\displaystyle tr\;({\bf S}_{B}^{(y)}) =tr\;({\bf A}^T\,{\bf S}_{B}^x\,{\bf A}) ... ... \vdots \\ {\bf a}_m^T\end{array}\right]{\bf S}_B^x[{\bf a}_1,\cdots,{\bf a}_m]$

$\displaystyle =$ $\displaystyle tr \; \left[ \begin{array}{c} {\bf a}_1^T\\ \vdots \\ {\bf a}_m^T... ..._1,\cdots,{\bf S}_{B}^x{\bf a}_m] =\sum_{i=1}^m {\bf a}_i^T{\bf S}_B^x{\bf a}_i$ (26)

To find the optimal transform matrix ${\bf A}$ that maximizes $tr({\bf S}_B^{(y)})$ , we consider the following constrained maximization problem:

$\displaystyle \left\{ \begin{array}{ll} \mbox{maximize} & J({\bf A})=tr({\bf A}... ...rt\vert^2={\bf a}_j^{T}{\bf a}_j=1 \;\;\;\;(j=1, \cdots, m) \end{array} \right.$ (27)

Here the constraint is to guarantee that the column vectors of ${\bf A}$ are normalized. Same as the PCA problem discussed in the Appendix, this constrained optimization problem can be solved by Lagrange multiplier method:

$\displaystyle \frac{\partial}{\partial {\bf a}_i}\left[J({\bf A})-\sum_{j=1}^m ... ...bf a}_j^T{\bf S}_B^x{\bf a}_j-\lambda_j {\bf a}_j^T{\bf a}_j+\lambda_j) \right]$

$\displaystyle =$ $\displaystyle \frac{\partial}{\partial {\bf a}_i} \left[{\bf a}_i^T{\bf S}_B^x{... ...f a}_i^T{\bf a}_i \right] = 2{\bf S}_B^x{\bf a}_i-2\lambda_i{\bf a}_i =0,\;\;\;$ i.e. $\displaystyle \;\;\; {\bf S}_B{\bf a}_i=\lambda_i{\bf a}_i \;\;\;\;\;\;(i=1, \cdots, n)$ (28)

We see that the optimal feature selection transform is the PCA transform which compacts most of the energy/information (representing separability in this context) into components. The column vectors of ${\bf A}$ must be the orthogonal eigenvectors of the symmetric matrix ${\bf S}_B^x$ corresponding to the greatest eigenvalues, and the new features can be obtained by

$\displaystyle {\bf y}_{m\times 1}={\bf A}_{m\times n}^T {\bf x}_{n\times 1} =\l... ...\\ {\bf\phi}^T_m \end{array}\right]_{m\times n} {\bf x}_{n\times 1}, \;\;\;\;\;$ or $\displaystyle \;\;\;\; y_i={\bf\phi}_i^T{\bf x},\;\;(i=1,\cdots, m)$ (29)

and

$\displaystyle J({\bf A})=J({\bf\Phi})=\sum_{i=1}^m {\bf\phi}_i^T{\bf S}_B{\bf\phi}_i = \sum_{i=1}^m \lambda_i$ (30)

where the eigenvalues of ${\bf S}_{B}$ are sorted in descending order $\lambda_1 \ge \cdots \ge \lambda_m \ge \cdots \ge \lambda_n$ .
Optimal transformation for maximizing $tr({\bf S}_{B/T})$
The previous method only maximizes ${\bf S}_B$ without taking into consideration ${\bf S}_W$ or, equivalently, ${\bf S}_T={\bf S}_B+{\bf S}_W$ . If in the m-D space after the transform ${\bf S}_W^{(y)}$ has also changed as well as ${\bf S}_B^{(y)}$ , then we need to maximize ${\bf S}_B$ while at the same time also minimize ${\bf S}_W$ , in order to maximize the separability. Or, equivalently, we need to maximize ${\bf S}_B^{(y)}$ normalized by the total scatteredness ${\bf S}_T^{(y)}$ . We can therefore use the trace of ${\bf S}_T^{-1}{\bf S}_B$ (or ${\bf S}_W^{-1}{\bf S}_B$ ) as the measurement of the separability in the new m-D space:

$\displaystyle {\bf S}_{B/T}^{(y)}$ $\displaystyle =$ $\displaystyle ({\bf S}_T^{(y)})^{-1}\;{\bf S}_B^{(y)} =({\bf A}^T{\bf S}_T^x{\b... ... A}) ={\bf A}^{-1}({\bf S}_T^x)^{-1}({\bf A}^T)^{-1}{\bf A}^T{\bf S}_B^x{\bf A}$

$\displaystyle =$ $\displaystyle {\bf A}^{-1}({\bf S}_T^x)^{-1}{\bf S}_B^x{\bf A} ={\bf A}^{-1} {\bf S}_{B/T}^x {\bf A}$ (31)

We need to find ${\bf A}=[{\bf a}_1,\cdots,{\bf a}_m]$ that maximizes the trace of the matrix above in the m-D space of ${\bf y}={\bf A}^T{\bf x}$ . As ${\bf S}_{B/T}^{(y)}$ (a product of two symmetric matrices) is not a symmetric matrix, the PCA method above cannot be used, we will find the optimal matrix ${\bf A}$ in some other way.
We first let , so that ${\bf A}={\bf a}$ is an $n\times 1$ vector that maximizes the following objective function in the 1-D space $y={\bf a}^T{\bf x}$ :

$\displaystyle ({\bf S}_T^{(y)})^{-1}\;{\bf S}_B^{(y)} =({\bf a}^T{\bf S}_T^x{\b... ...}} =\frac{{\bf a}^T{\bf S}^x_B{\bf a}}{{\bf a}^T{\bf S}^x_T{\bf a}} =R({\bf a})$ (32)

This function $R({\bf a})$ is the Rayleigh quotient of the two symmetric matrices ${\bf S}_B^x$ and ${\bf S}_T^x$ . The optimal transform vector ${\bf a}$ that maximizes this $R({\bf a})$ can be found by solving the corresponding generalized eigenvalue problem:

$\displaystyle {\bf S}_B^x{\bf\phi}_i=\lambda_i {\bf S}_T^x{\bf\phi}_i,\;\;\;\;\;\; (i=1,\cdots,n)$ (33)

where $\lambda_i=R({\bf a})$ is an eigenvalue and ${\bf\phi}_i$ the corresponding eigenvector of ${\bf S}_{B/T}=({\bf S}_T^x)^{-1}{\bf S}_B^x$ . Obviously, the transform vector ${\bf a}$ that maximizes $R({\bf a})$ is the eigenvector corresponding to the greatest eigenvalue $\lambda_1=\max\{\lambda_i\;(i=1,\cdots,n)\}$ .
We next generalize the method above to the case of . The matrix form of the eigenequation above is:

$\displaystyle {\bf S}_B^x{\bf\Phi}={\bf S}_T^x{\bf\Phi}{\bf\Lambda}$ (34)

where ${\bf\Lambda}$ is the eigenvalue matrix of $({\bf S}_T^x)^{-1} {\bf S}_B^x$ , and ${\bf\Phi}=[{\bf\phi}_1,\cdots,{\bf\phi}_n]$ the eigenvector matrix (no longer orthogonal in general). This generalized eigenvalue problem can be solved by finding the matrix ${\bf\Phi}$ that diagonalizes both ${\bf S}_T^x$ and ${\bf S}_B^x$ at the same time:

$\displaystyle \left\{ \begin{array}{l} {\bf S}_B^{(y)}={\bf\Phi}^T{\bf S}_B^x{\... ...}\\ {\bf S}_T^{(y)}={\bf\Phi}^T{\bf S}_T^x{\bf\Phi}={\bf I} \end{array}\right.$ (35)

Left multiplying the inverse of the second equation to the first, we get

$\displaystyle {\bf S}_{B/T}^{(y)}=({\bf S}_T^{(y)})^{-1}{\bf S}_B^{(y)}={\bf\Lambda}$ (36)

The optimal transform matrix is composed of the eigenvectors ${\bf A}=[{\bf\phi}_1,\cdots,{\bf\phi}_m]$ corresponding to the greatest of all eigenvalues $\lambda_1 \ge \cdots \ge \lambda_m \ge \cdots \ge \lambda_n$ , so that
- the signal components in ${\bf y}=[y_1,\cdots,y_m]^T$ are completely decorrelated, each component $y_i={\bf\phi}_i^T{\bf x}$ carries certain separability information $s_B^{(y_i)}/s_T^{(y_i)}=\lambda_i$ ( $i=1,\cdots,m$ ), independent of others;
- the total separability contained in the m-D space, as the sum of the greatest eigenvalues $\sum_{i=1}^m \lambda_i$ , is maximized.

$\displaystyle d_B(C_i, C_j)$	$\displaystyle =$	$\displaystyle \frac{1}{4}({\bf m}_i-{\bf m}_j)^T \left(\frac{{\bf\Sigma}_i+{\bf\Sigma}_j}{2}\right)^{-1}({\bf m}_i-{\bf m}_j)$
	$\displaystyle +$	$\displaystyle \log\left[\frac{\left\vert ({\bf\Sigma}_i+{\bf\Sigma}_j)/2 \right... ...vert{\bf\Sigma}_i\right\vert\;\left\vert{\bf\Sigma}_j\right\vert)^{1/2}}\right]$	(16)

$\displaystyle J^{(y)}({\bf A})$	$\displaystyle =$	$\displaystyle tr\;({\bf S}_{B}^{(y)}) =tr\;({\bf A}^T\,{\bf S}_{B}^x\,{\bf A}) ... ... \vdots \\ {\bf a}_m^T\end{array}\right]{\bf S}_B^x[{\bf a}_1,\cdots,{\bf a}_m]$
	$\displaystyle =$	$\displaystyle tr \; \left[ \begin{array}{c} {\bf a}_1^T\\ \vdots \\ {\bf a}_m^T... ..._1,\cdots,{\bf S}_{B}^x{\bf a}_m] =\sum_{i=1}^m {\bf a}_i^T{\bf S}_B^x{\bf a}_i$	(26)

		$\displaystyle \frac{\partial}{\partial {\bf a}_i}\left[J({\bf A})-\sum_{j=1}^m ... ...bf a}_j^T{\bf S}_B^x{\bf a}_j-\lambda_j {\bf a}_j^T{\bf a}_j+\lambda_j) \right]$
	$\displaystyle =$	$\displaystyle \frac{\partial}{\partial {\bf a}_i} \left[{\bf a}_i^T{\bf S}_B^x{... ...f a}_i^T{\bf a}_i \right] = 2{\bf S}_B^x{\bf a}_i-2\lambda_i{\bf a}_i =0,\;\;\;$ i.e. $\displaystyle \;\;\; {\bf S}_B{\bf a}_i=\lambda_i{\bf a}_i \;\;\;\;\;\;(i=1, \cdots, n)$	(28)

$\displaystyle {\bf S}_{B/T}^{(y)}$	$\displaystyle =$	$\displaystyle ({\bf S}_T^{(y)})^{-1}\;{\bf S}_B^{(y)} =({\bf A}^T{\bf S}_T^x{\b... ... A}) ={\bf A}^{-1}({\bf S}_T^x)^{-1}({\bf A}^T)^{-1}{\bf A}^T{\bf S}_B^x{\bf A}$
	$\displaystyle =$	$\displaystyle {\bf A}^{-1}({\bf S}_T^x)^{-1}{\bf S}_B^x{\bf A} ={\bf A}^{-1} {\bf S}_{B/T}^x {\bf A}$	(31)