Feature Selection and Dimensionality Reduction

The main goal of pattern classification is to classify a set of patterns, objects of interest each described by $d$ features represented as a d-dimensional vector ${\bf x}=[x_1,\cdots,x_d]^T$ in the feature space, into one of some $K$ classes $\{C_1,\cdots, C_K\}$. In supervised classification, we are to partition the feature space into $K$ regions each corresponding to one of the classes based on the provided training set ${\cal D}=\{{\bf X},{\bf y}\}$, of which each training sample in ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ is labeled by the corresponding component in ${\bf y}=[y_1,\cdots,y_N]^T$. For example, if $y_n=\{1,2,\cdots,K\}$, then ${\bf x}_n\in C_{y_n}$. However, different types of labeling can be used depending on the specific classification algorithms, such as binary labeling used in certain regression algorithms considered previously.

When the dimensionality $d$ of the feature space is high, there is the need to reduce it significantly to $d'\ll d$ while still maintaining most of the information relevant to the task of classification, i.e., the separability of the classes. This is called feature selection, which can be done by either selecting $d'$ features directly from the $d$ original ones (there are $C_d^{d'}=d!/d'!(d-d')!$ ways to do so), or generating $d'$ new features as certain linear combinations of the $d$ original ones. In either case, the separability of the data are to be maximally maintained in the resulting space spanned by the $d'$ features, so that the classification can be carried out both efficiently and effectively in the much lower dimensional feature space.