Discriminative vs. Generative Methods for Classification

As one of the most important tasks in machine learning, pattern classification is to classify some objects of interest, generically referred to as patterns and described by a set of $d$ features or attributes that characterizes the patterns, to one of some $K$ classes or categories. Each pattern is represented by a vector (or a point) ${\bf x}=[x_1,\cdots,x_d]^T$ in a d-dimensional feature space, where $x_i\;(i=1,\cdots,d)$ is a variable for the measurement of the ith feature. Symbolically, the $K$ classes can be denoted $\{C_1,\cdots,C_K\}$, and a pattern ${\bf x}$ belonging to the kth class is denoted by ${\bf x}\in C_k$. Pattern classification can therefore be considered as the process by which the d-dimensional feature space is partitioned into $K$ regions each corresponding to one of the $K$ classes. The boundaries between these regions, called decision boundaries, are to be determined by the specific algorithm, called a classifier, used for the classification.

Pattern classification can be carried out as either a supervised or unsupervised learning process, depending on the availability of a training set containing patterns of known class identities. Specifically, the training set contains a set of $N$ patterns in ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$, labeled respectively by the corresponding component in ${\bf y}=[ y_1,\cdots,y_N]^T$ representing the class identities of the corresponding patterns in some way. For example, we can use $y_k \in \{1,\cdots,K\}$ to indicate ${\bf x}_k
\in C_{y_k}$. In the special case when $K=2$, there are only two classes $C_1=C_+$ and $C_0=C_-$, and the classifier becomes binary based on training pattern ${\bf x}_j\;(j=1,\cdots,N)$, each labeled by $y_j=1$ if ${\bf x}_j\in C+$ or $y_j=-1$ if ${\bf x}_j\in C_-$.

We assume there are $N_k$ training samples $\{{\bf x}_1^{(k)},\cdots,{\bf x}_{N_k}^{(k)}\}$ all labeled to belong to $C_k,\;\;(k=1,\cdots,K)$, and in total $N=\sum_{k=1}^K N_k$ samples in the training set. If the training set is a fair representation of all patterns of different classes in the entire dataset, then $P_k=N_k/N$ can be treated as an estimate of the a priori probability that any randomly selected pattern ${\bf x}$ happens to belong to class $C_k$, without any prior knowledge of the pattern.

Once a classifier is properly trained according to a specific algorithm based on the traning set, the feature space is partitioned into regions corresponding to the different classes and any unlabeled pattern of unknown class as a vector ${\bf x}$ in the feature space can be classified into one of the $K$ classes.

Supervised classification can be considered as a process of establishing the corresponding relationship between the patterns ${\bf x}_1,\cdots, {\bf x}_N$ treated as the independent or input variables to the classifier, and the classes $C_1,\cdots,C_K$ the input patterns belong, treated as the dependent or output variables. Therefore regression and classification can be considered as the same supervised learning process: modeling the relationship between the data points in $\{ {\bf x}_1,\cdots,{\bf x}_N\}$ and their corresponding labelings (or targets) in $\{ y_1,\cdots,y_N\}$. This process is regression when the labelings take continous real values, but it is classification when they are discrete categorical representing different classes. Some methods in the previous chapter on regression analysis are actually used as classifiers, such as logistic and solfmax regressions, and the method of Gaussian process can also be used for classification.

If the training data of labeled patterns are unavailable, various unsupervised learning methods can be used to assign each unlabeled patterns into one of the $K$ different groups, called clusters, according to its position in the feature space, based on the overall spatial structure and distribution of the data set in the feature space. This process is called clustering analysis or simply clustering.

There exsit a variety of methods for learning, including both regression and classification, based on different models assumed. One way to characterize these methods is to put them all in a probabilistic framework, in terms of the probabilities of the given dataset incuding data points ${\bf X}$ and the corresponding labeling ${\bf y}$. Now a method can be categorized into either of the following two groups:

DiscriminativeGenerative.png

Here are some comparisons between the two approaches: