We first consider various distances widely used for feature
selection in pattern classification.
- Distance between two data points
The distance between two points
and
in the n-dimensional feature space can be measured by the
p-norm
of the difference
:
 |
(1) |
Specially, consider the following three cases:
, city block or Manhattan distance:
 |
(2) |
, Euclidean distance:
 |
(3) |
, the Chebychev distance:
 |
(4) |
- Intra-class distance
An intra-class distance represents how widely (or narrowly)
all samples in each class
are distributed in the space.
It needs to be small for good separability.
- The max diameter:
 |
(5) |
- The average diameter:
 |
(6) |
- The average distance from each sample to the mean vector
of the cluster:
where |
(7) |
- The covariance of all samples in the class represents
the tightness of the samples in the class:
 |
(8) |
and its determinant or trace can be used as such a scalar
measurement of the tightness:
 |
(9) |
As
is positive semi-definite, all of its
eigenvalues
are non-negative and so are its
determinant and trace.
- Distance between a point and a cluster/class of points
- The max and min-distances:
 |
(10) |
- The centroid distance:
 |
(11) |
- The Mahalanobis distance
 |
(12) |
- Inter-class distance
An intra-class distance measures the difference between two classes, it
needs to be large for good separability.
- The max- and min-distances:
 |
(13) |
- The average distance between two clusters is the average
of all pair-wise distances (e.g., Euclidean) between members of the
two classes:
 |
(14) |
- The centroid distance is the distance (e.g., Euclidean) between
the centroids (mean vectors) of the two classes:
 |
(15) |
- The Bhattacharyya distance:
The first term reflects mostly the difference between the
two mean vectors of the two clusters, while the second term
reflects the difference between the distributions of the two
clusters, which is always positive due to the
AM-GM inequality
of the arithmetic and geometric means:
 |
(17) |
Note that even when the first term is zero due to
, the Bhattacharyya distance may
still not zero due to the non-zero second term if
.
We also define a set of scatter matrices all related to the
separability of the classes/clusters in the feature
space, based on which different subsets of features can be
compared and the best subset can be selected. We will also
consider methods for feature selection or
feature extraction to reduce the dimensionality of
the feature space. Here we assume there are
classes
(or clusters) in the dataset containing
data points, and
the kth class contains
data points, i.e.,
.
We can show that
, i.e.,
the total scatter matrix is the sum of within-class and
between-class scatter matrices:
Obviously, for better separability, we want the within-class
scatteredness to be small but the between-class scatteredness
to be large. However, we cannot use the scatter matrices directly
as they cannot be compared in terms of their sizes. Instead, we
use their traces (or determinants) as the scalar measurements for
the separability:
and
We see that
is the weighted sum of the Euclidean distances
between
and
for all
classes, and
is the weighted sum of the sums of Euclidean distances between all
samples
to the mean
for all
classes
.
To achieve the best separability, we maximize
while at
the same time minimize
. In the same space with a constant
, maximizing
is equivalent to minimizing
, due to the relationship
.
To measure the separability across different spaces,
can still be used but now it needs to be normalized by either
or equivalently
:
 |
(24) |
The separability measured by the scatter matrices can be used as the
criterion for selecting
of the
original features to reduce
the computational cost while still keeping the effectiveness of the
classification. Moreover, we can also generate
new features as
the linear combination of the
original ones by a linear transform:
 |
(25) |
We desire to find the
transform matrix
so that the separability in
the original n-D space is maximally conserved in the new m-D space
of lower dimensionality.
- Optimal transformation for maximizing
The separability criterion
in new m-D space
can be expressed as:
To find the optimal transform matrix
that maximizes
, we consider the following constrained
maximization problem:
 |
(27) |
Here the constraint is to guarantee that the column vectors of
are normalized. Same as the PCA problem discussed in
the Appendix, this
constrained optimization problem can be solved by Lagrange multiplier
method:
We see that the optimal feature selection transform is the PCA transform
which compacts most of the energy/information (representing separability
in this context) into
components. The column vectors of
must be the orthogonal eigenvectors of the symmetric matrix
corresponding to the
greatest eigenvalues, and the
new features
can be obtained by
or |
(29) |
and
 |
(30) |
where the eigenvalues of
are sorted in descending order
.
- Optimal transformation for maximizing
The previous method only maximizes
without taking into
consideration
or, equivalently,
.
If in the m-D space after the transform
has also changed
as well as
, then we need to maximize
while
at the same time also minimize
, in order to maximize the
separability. Or, equivalently, we need to maximize
normalized by the total scatteredness
. We can therefore
use the trace of
(or
)
as the measurement of the separability in the new m-D space:
We need to find
that maximizes the
trace of the matrix above in the m-D space of
.
As
(a product of two symmetric matrices) is not a
symmetric matrix, the PCA method above cannot be used, we will find the
optimal matrix
in some other way.
We first let
, so that
is an
vector
that maximizes the following objective function in the 1-D space
:
 |
(32) |
This function
is the
Rayleigh quotient
of the two symmetric matrices
and
. The
optimal transform vector
that maximizes this
can be
found by solving the corresponding
generalized eigenvalue problem:
 |
(33) |
where
is an eigenvalue and
the
corresponding eigenvector of
.
Obviously, the transform vector
that maximizes
is the eigenvector corresponding to the greatest eigenvalue
.
We next generalize the method above to the case of
. The matrix form
of the eigenequation above is:
 |
(34) |
where
is the eigenvalue matrix of
, and
the eigenvector matrix
(no longer orthogonal in general). This generalized eigenvalue problem
can be solved by finding the matrix
that diagonalizes both
and
at the same time:
 |
(35) |
Left multiplying the inverse of the second equation to the first, we get
 |
(36) |
The optimal transform matrix is composed of the eigenvectors
corresponding to the
greatest of all
eigenvalues
, so that
- the signal components in
are
completely decorrelated, each component
carries certain separability information
(
), independent of others;
- the total separability contained in the m-D space, as the sum
of the
greatest eigenvalues
, is maximized.