In K-means clustering, each sample point is assigned to
one of the
clusters if it has the minimum Euclidean distance
to the mean of the cluster. But here the distribution of the data
samples in the cluster represented by the covariance is not taken
into consideration.
This issue can be addressed by the method of
Gaussian mixture model (GMM) based on the assumption
that each of the clusters
in the given
dataset can be modeled by a Gaussian distribution
in
terms of both the mean vector
and covariance matrix
. The overall distribution of the entire dataset
is the weighted sum of the
Gaussian distributions:
(210) |
We note that the GMM model in Eq. (209) is actually the same
as Eq. (11) in the naive Bayes classification. These two
methods are similar in the sense that each cluster or classe
is modeled by a Gaussian
,
weighted by
, and the model parameters
and
,
as well as
, are to be estimated based on the given dataset.
However, the two methods are different in that the dataset
in the supervised naive Bayes method is labeled by
, while
in GMM such a labeling vector is not available. Instead, here we
will introduce a latent or hidden variable
as the cluster labeling of the samples in the given dataset
.
This latent variable
is binary vector, of which each component
is a binary random variable
. Only one of these
components is 1, e.g.,
, indicating a sample
in the
dataset belongs to the kth cluster
, while all others are 0, i.e.,
these
binary variables add up to 1,
.
We further introduce the following probabilities for each of the
clusters
:
All such probabilities defined for can be generalized to
for all
clusters:
(215) | |||
(216) | |||
(217) |
Given the dataset
containing
i.i.d. samples, we introduce
corresponding latent variables
in
, of which the nth column
is the labeling vector of
,
i.e.,
belongs to
if
(while
for all
). Note that here
is defined in the same way as
in
softmax regression, both as
the labeling of
, with the only difference that
is
provided in the training data available for a supervised method, but
here
is a latent variable not part of the data provided for
unsupervised clustering. Now we have
(218) |
We first find the posterior probability for an observed
to belong to cluster
, indicated by
(and
for all
):
(222) |
We note that the definition of is similar to the
softmax function
with the only difference that here the weighted Gaussian function
are used instead of the softmax function used in softmax
regression.
We also note that the posterior probability defined above
represents a soft decision, in the sense that it is possible
for any
to belong to any
with probability
for all
, instead of a hard decision, in the
sense that
belongs to only one specific
with
, while
for all
, as in K-means clustering.
We can now find the expectation of the log likelihood with respect
to the latent variables in :
(224) |
We first set to zero the derivatives of the expectation of
the log likelihood with respect to each of the parameters in
,
and then solve the resulting equations to get the optimal parameters.
As all of
's need to satisfy the constraint
, we first construct the Lagrangian function
composed of an extra Lagrange multiplier term for the constraint
as well as the log likelihood as the objective function:
(225) |
(226) |
(229) |
(230) |
(233) |
(236) |
(237) |
In summary, here is the EM clustering algorithm based on Gaussian mixture model:
Find the responsibility for all
data points and all
clusters and then
:
(239) |
Recalculate the parameters that maximize the likelihood function:
(240) |
We can show that the K-means algorithm is actually a special case
of the EM algorithm, when all covariance matrices are the same
, where
is a
scaling factor which appraoches to zero. In this case we have:
(241) |
(242) |
(243) |
We can also make a comparison between the GMM method for
unsupervised clustering and the softmax regression for
supervised classification. First, the latent variables
in GMM play a similar
role as the labeling
in
softmax regression
for
multi-class classificatioin. However, the difference is that
is explicitely given in the training set for a supervised
classification, while
is hidden for an unsupervised
cllustering analysis. Second, we note that the probability
given in Eq. (221)
is similar to the softmax function
in the softmax method in terms of their form, with the only
difference that the Gaussian function is used for GMM while the
exponential function is used for solfmax.
Examples
The same dataset is used to test both the K-means and EM clustering methods. The first panel shows 10 iterations of the K-means method, while the second panel shows 16 iterations of the EM method. In both cases, the iteration converges to the last plot. Comparing the two clustering results, we see that the K-means method cannot separate the red and green data points from two different clusters, both normally distributed with similar means but very different covariance matrices, while the blue data points all in the same cluster are separated into two clusters. But the EM method based on the Gaussian mixture model can correctly identified all three clusters.
The two clustering methods are also applied to the Iris dataset, which has three classes each of 50 4-dimensional sample vectors. The PCA method is used to visualize the first two principal compnents, as shown below. Also, as can be seen from their c onfussion matrices, the error rate of the K-means method is 18/150, while that of the EM method is 5/150.
(244) |