In K-means clustering, each sample point is assigned to one of the clusters if it has the minimum Euclidean distance to the mean of the cluster. But here the distribution of the data samples in the cluster represented by the covariance is not taken into consideration.
This issue can be addressed by the method of Gaussian mixture model (GMM) based on the assumption that each of the clusters in the given dataset can be modeled by a Gaussian distribution in terms of both the mean vector and covariance matrix . The overall distribution of the entire dataset is the weighted sum of the Gaussian distributions:
where is the weight for the kth Gaussian , satisfying(210) |
We note that the GMM model in Eq. (209) is actually the same as Eq. (11) in the naive Bayes classification. These two methods are similar in the sense that each cluster or classe is modeled by a Gaussian , weighted by , and the model parameters and , as well as , are to be estimated based on the given dataset. However, the two methods are different in that the dataset in the supervised naive Bayes method is labeled by , while in GMM such a labeling vector is not available. Instead, here we will introduce a latent or hidden variable as the cluster labeling of the samples in the given dataset . This latent variable is binary vector, of which each component is a binary random variable . Only one of these components is 1, e.g., , indicating a sample in the dataset belongs to the kth cluster , while all others are 0, i.e., these binary variables add up to 1, .
We further introduce the following probabilities for each of the clusters :
All such probabilities defined for can be generalized to
for all clusters:
(215) | |||
(216) | |||
(217) |
Given the dataset containing i.i.d. samples, we introduce corresponding latent variables in , of which the nth column is the labeling vector of , i.e., belongs to if (while for all ). Note that here is defined in the same way as in softmax regression, both as the labeling of , with the only difference that is provided in the training data available for a supervised method, but here is a latent variable not part of the data provided for unsupervised clustering. Now we have
(218) |
We first find the posterior probability for an observed
to belong to cluster , indicated by (and
for all ):
(222) |
We note that the definition of is similar to the softmax function with the only difference that here the weighted Gaussian function are used instead of the softmax function used in softmax regression.
We also note that the posterior probability defined above represents a soft decision, in the sense that it is possible for any to belong to any with probability for all , instead of a hard decision, in the sense that belongs to only one specific with , while for all , as in K-means clustering.
We can now find the expectation of the log likelihood with respect
to the latent variables in :
(224) |
We first set to zero the derivatives of the expectation of the log likelihood with respect to each of the parameters in , and then solve the resulting equations to get the optimal parameters.
As all of 's need to satisfy the constraint , we first construct the Lagrangian function composed of an extra Lagrange multiplier term for the constraint as well as the log likelihood as the objective function:
(225) |
(226) |
(229) |
(230) |
(233) |
(236) |
(237) |
In summary, here is the EM clustering algorithm based on Gaussian mixture model:
Find the responsibility for all data points and all clusters and then :
(239) |
Recalculate the parameters that maximize the likelihood function:
(240) |
We can show that the K-means algorithm is actually a special case of the EM algorithm, when all covariance matrices are the same , where is a scaling factor which appraoches to zero. In this case we have:
(241) |
(242) |
(243) |
We can also make a comparison between the GMM method for unsupervised clustering and the softmax regression for supervised classification. First, the latent variables in GMM play a similar role as the labeling in softmax regression for multi-class classificatioin. However, the difference is that is explicitely given in the training set for a supervised classification, while is hidden for an unsupervised cllustering analysis. Second, we note that the probability given in Eq. (221) is similar to the softmax function in the softmax method in terms of their form, with the only difference that the Gaussian function is used for GMM while the exponential function is used for solfmax.
Examples
The same dataset is used to test both the K-means and EM clustering methods. The first panel shows 10 iterations of the K-means method, while the second panel shows 16 iterations of the EM method. In both cases, the iteration converges to the last plot. Comparing the two clustering results, we see that the K-means method cannot separate the red and green data points from two different clusters, both normally distributed with similar means but very different covariance matrices, while the blue data points all in the same cluster are separated into two clusters. But the EM method based on the Gaussian mixture model can correctly identified all three clusters.
The two clustering methods are also applied to the Iris dataset, which has three classes each of 50 4-dimensional sample vectors. The PCA method is used to visualize the first two principal compnents, as shown below. Also, as can be seen from their c onfussion matrices, the error rate of the K-means method is 18/150, while that of the EM method is 5/150.
(244) |