Gaussian mixture model

In K-means clustering, each sample point ${\bf x}$ is assigned to one of the $K$ clusters if it has the minimum Euclidean distance to the mean of the cluster. But here the distribution of the data samples in the cluster represented by the covariance is not taken into consideration.

This issue can be addressed by the method of Gaussian mixture model (GMM) based on the assumption that each of the $K$ clusters $\{C_1,\cdots,C_K\}$ in the given dataset can be modeled by a Gaussian distribution ${\cal N}({\bf x}; {\bf m}_k,{\bf\Sigma}_k),\;(k=1,\cdots,K)$ in terms of both the mean vector ${\bf m}_k$ and covariance matrix ${\bf\Sigma}_k$. The overall distribution of the entire dataset ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ is the weighted sum of the $K$ Gaussian distributions:

$\displaystyle p({\bf x})=\sum_{k=1}^K P_k {\cal N}({\bf x}; {\bf m}_k,{\bf\Sigm...
...{1}{2}({\bf x}-{\bf m}_k)^T
{\bf\Sigma}_k^{-1}({\bf x}-{\bf m}_k)\right)\right]$ (209)

where $P_k$ is the weight for the kth Gaussian ${\cal N}({\bf x}; {\bf m}_k,{\bf\Sigma}_k)$, satisfying

$\displaystyle \int_{-\infty}^\infty p({\bf x})\,d{\bf x}
=\sum_{k=1}^K P_k \int...
...infty
{\cal N}({\bf x}; {\bf m}_k,{\bf\Sigma}_k)\,d{\bf x}
=\sum_{k=1}^K P_k =1$ (210)

The GMM clustering is carried out by the method of expectation maximization (EM), by which all model parameters denoted by $\theta =\{P_k,{\bf m}_k,{\bf\Sigma}_k,\;(k=1,\cdots,K)\}$ are estimated based on the given dataset, and then each sample in the dataset is assigned to one of the clusters.

We note that the GMM model in Eq. (209) is actually the same as Eq. (11) in the naive Bayes classification. These two methods are similar in the sense that each cluster or classe $C_k$ is modeled by a Gaussian ${\cal N}({\bf x},{\bf m}_k,{\bf\Sigma}_k)$, weighted by $P_k$, and the model parameters ${\bf m}_k$ and ${\bf\Sigma}_k$, as well as $P_k$, are to be estimated based on the given dataset. However, the two methods are different in that the dataset ${\bf X}$ in the supervised naive Bayes method is labeled by ${\bf y}$, while in GMM such a labeling vector is not available. Instead, here we will introduce a latent or hidden variable ${\bf z}=[z_1,\cdots,z_K]^T$ as the cluster labeling of the samples in the given dataset ${\bf X}$. This latent variable ${\bf z}$ is binary vector, of which each component is a binary random variable $z_k\in\{0,\,1\}$. Only one of these $K$ components is 1, e.g., $z_k=1$, indicating a sample ${\bf x}$ in the dataset belongs to the kth cluster $C_k$, while all others are 0, i.e., these $K$ binary variables add up to 1, $\sum_{k=1}^K z_k=1$.

We further introduce the following probabilities for each of the $K$ clusters $C_k\;\;(k=1,\cdots,K)$:

Marginalizing this joint probability over the latent variable $z_k$, we get the Gaussian mixture model, the distribution $p({\bf x})$ of any sample ${\bf x}$ regardless to which cluster it belongs:

$\displaystyle p({\bf x}\vert\theta)=\sum_{k=1}^K p({\bf x},z_k=1\vert\theta)
=\...
...,\theta)\;P(z_k=1)
=\sum_{k=1}^K P_k {\cal N}({\bf x}; {\bf m}_k,{\bf\Sigma}_k)$ (214)

Note that Eqs. (211) through (214) are the same as Eqs. (8) through (11) in the naive Bayes classifier, respectively.

All such probabilities defined for $z_k=1$ can be generalized to ${\bf z}=[z_1,\cdots,z_K]^T$ for all $K$ clusters:

$\displaystyle p({\bf z}\vert{\bf\theta})$ $\displaystyle =$ $\displaystyle \prod_{k=1}^K P_k^{z_k}$ (215)
$\displaystyle p({\bf x}\vert{\bf z},{\bf\theta})$ $\displaystyle =$ $\displaystyle \prod_{k=1}^K {\cal N}({\bf x},{\bf m}_k,{\bf\Sigma}_k)^{z_k}$ (216)
$\displaystyle p({\bf x},{\bf z}\vert{\bf\theta})$ $\displaystyle =$ $\displaystyle p({\bf z})\;p({\bf x}\vert{\bf z},\theta)
=\prod_{k=1}^K \left(P_k {\cal N}({\bf x},{\bf m}_k,{\bf\Sigma}_k)\right)^{z_k}$ (217)

Given the dataset ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ containing $N$ i.i.d. samples, we introduce $N$ corresponding latent variables in ${\bf Z}=[{\bf z}_1,\cdots,{\bf z}_N]$, of which the nth column ${\bf z}_n=[z_{n1},\cdots,z_{nK}]^T$ is the labeling vector of ${\bf x}_n$, i.e., ${\bf x}_n$ belongs to $C_k$ if $z_{nk}=1$ (while $z_{nl}=0$ for all $l\ne k$). Note that here ${\bf Z}=[{\bf z}_1,\cdots,{\bf z}_N]$ is defined in the same way as ${\bf Y}=[{\bf y}_1,\cdots,{\bf y}_N]$ in softmax regression, both as the labeling of ${\bf X}$, with the only difference that ${\bf Y}$ is provided in the training data available for a supervised method, but here ${\bf Z}$ is a latent variable not part of the data provided for unsupervised clustering. Now we have

$\displaystyle p({\bf x}_n,{\bf z}_n\vert\theta)=\prod_{k=1}^K
\left(P_k {\cal N...
...m}_k,{\bf\Sigma}_k)\right)^{z_{nk}},
\;\;\;\;\;\;\;\;\;\;\;\;\;\;(n=1,\cdots,N)$ (218)

The likelihood function of the GMM model parameters ${\bf\theta}$ to be estimated can be expressed as:
$\displaystyle L(\theta\vert{\bf X},{\bf Z})$ $\displaystyle =$ $\displaystyle p({\bf X},{\bf Z}\vert\theta)
=p\left([{\bf x}_1,\cdots,{\bf x}_N...
...}_1,\cdots,{\bf z}_N]\bigg\vert{\bf m}_k,{\bf\Sigma}_k,P_k(k=1,\cdots,K)\right)$  
  $\displaystyle =$ $\displaystyle \prod_{n=1}^N p({\bf x}_n,{\bf z}_n\vert\theta)
=\prod_{n=1}^N \prod_{k=1}^K \left(P_k{\cal N}({\bf x}_n,{\bf m}_k,{\bf\Sigma}_k)\right)^{z_{nk}}$ (219)

and the log likelihood is:
$\displaystyle l({\bf\theta}\vert{\bf X},{\bf Z})=
\log L(\theta\vert{\bf X},{\bf Z})$ $\displaystyle =$ $\displaystyle \log p({\bf X},{\bf Z}\vert\theta)
=\log \left[ \prod_{n=1}^N \pr...
...}^K
\left(P_k{\cal N}({\bf x}_n,{\bf m}_k,{\bf\Sigma}_k)\right)^{z_{nk}}\right]$  
  $\displaystyle =$ $\displaystyle \sum_{n=1}^N \sum_{k=1}^K z_{nk}
\left[\log P_k+\log {\cal N}({\bf x}_n,{\bf m}_k,{\bf\Sigma}_k)\right]$ (220)

Similar to the method of maximum likelihodd estimation (MLE) which finds the model parameters in ${\bf\theta}$ as those that maximize the likelihood function $L({\bf\theta})$ or its log function $\log\;L({\bf\theta})$, here we find the model parameters in $\theta =\{P_k,{\bf m}_k,{\bf\Sigma}_k,\;(k=1,\cdots,K)\}$ as those that maximize the expectation of the log likelihood function above with respect to the latent variables in ${\bf Z}$ in the following two iterative steps of the EM method:

We note that all model parameters in ${\bf\theta}=\{P_k,{\bf m}_k,{\bf\Sigma}_k,(k=1,\cdots,K)\}$ found respectively in Eqs. (231), (234), and (238) in the M-step depend on $P_{nk}$ in Eq. (221) in the E-step, which in turn is a function of these parameters, i.e., the E-step and M-step need to be carried out in an alternative and iterative fashion from some initial values of the parameters until convergence.

In summary, here is the EM clustering algorithm based on Gaussian mixture model:

  1. Initialize means ${\bf m}_k$, covariance ${\bf\Sigma}_k$ and coefficient $P_k$.

  2. The E-step:

    Find the responsibility $P_{nk}$ for all $N$ data points and all $K$ clusters and then $N_k$:

    $\displaystyle P_{nk}=P(r_k=1\vert{\bf x}_n)
=\frac{P_k\,{\cal N}({\bf x}_n;{\bf...
...l N}({\bf x}_n;{\bf m}_l,{\bf\Sigma}_l)},
\;\;\;\;\;\;\;N_k=\sum_{n=1}^N P_{nk}$ (239)

  3. The M-step:

    Recalculate the parameters that maximize the likelihood function:

    $\displaystyle P_k$ $\displaystyle =$ $\displaystyle \frac{N_k}{N}$  
    $\displaystyle {\bf m}_k$ $\displaystyle =$ $\displaystyle \frac{1}{N_k} \sum_{n=1}^N P_{nk}{\bf x}_n$  
    $\displaystyle {\bf\Sigma}_k$ $\displaystyle =$ $\displaystyle \frac{1}{N_k}\sum_{n=1}^N P_{nk}({\bf x}_n-{\bf m}_k)({\bf x}_n-{\bf m}_k)^T$ (240)

  4. If the parameters or the log likelihood function have not converged, go back to step 2. Otherwise terminate the iteration. The probability for each sample ${\bf x}_n$ to belong to cluster $C_l$ is $p_{nl}=P(z_{nl}=1\vert{\bf x}_n,\theta)$, and it is therefore assigned to $C_k$ if $P_{nk}=\max_l p_{nl}$.

We can show that the K-means algorithm is actually a special case of the EM algorithm, when all covariance matrices are the same ${\bf\Sigma}_k=\varepsilon{\bf I}$, where $\varepsilon$ is a scaling factor which appraoches to zero. In this case we have:

$\displaystyle p({\bf x}\vert z_k=1,\theta)={\cal N}({\bf x}\vert{\bf m}_k,\vare...
...\exp\left(-\frac{1}{2\varepsilon}\vert\vert{\bf x}-{\bf m}_k\vert\vert^2\right)$ (241)

and the probability for any ${\bf x}_n\in{\bf X}$ to belong to cluster $C_k$ is:

$\displaystyle P_{nk}=P(z_k=1\vert{\bf x}_n,\theta)
=\frac{P_k{\cal N}({\bf x}_n...
...{\sum_{l=1}^K P_l\exp(-\vert\vert{\bf x}_n-{\bf m}_l\vert\vert^2/2\varepsilon)}$ (242)

When $\varepsilon\rightarrow 0$, all terms in the denominator approach to zero, but the one with minimum $\vert\vert{\bf x}-{\bf m}_k\vert\vert$ approaches to zero most slowly, and becomes the dominant term of the denominator. If the numerator happens to be this term as well, then $P_{nk}=1$, otherwise the numerator approaches zero and $P_{nk}=0$. Now $P_{nk}$ defined above becomes:

$\displaystyle \lim\limits_{\varepsilon\rightarrow 0} P_{nk}=\lim\limits_{\varep...
...t\vert{\bf x}_n-{\bf m_l}\vert$}\vert\\
0 & \mbox{otherwise}\end{array}\right.$ (243)

Now the posterior probability $0<P_{nk}<1$ defined in Eq. (221) for a soft decision becomes a binary value $P_{nk}\in\{0,\;1\}$ for a hard binary decision to assign ${\bf x}_n$ to $C_k$ with the smallest distance. Also $N_k=\sum_{n=1}^N P_{nk}$ defined in Eq. (228) as the sum of the posterior probabilities for all $N$ data points to belong to $C_k$ becomes $N_k$ as the number of data samples assigned only to $C_k$. In other words, now the probabilistic EM method based on both ${\bf m}$ and ${\bf\Sigma}$ becomes the deterministic K-means method based on ${\bf m}$ only.

We can also make a comparison between the GMM method for unsupervised clustering and the softmax regression for supervised classification. First, the latent variables ${\bf Z}=[{\bf z}_1,\cdots,{\bf z}_N]$ in GMM play a similar role as the labeling ${\bf Y}=[{\bf y}_1,\cdots,{\bf y}_N]$ in softmax regression for multi-class classificatioin. However, the difference is that ${\bf Y}$ is explicitely given in the training set for a supervised classification, while ${\bf Z}$ is hidden for an unsupervised cllustering analysis. Second, we note that the probability $P_{nk}=p(z_{nk}=1\vert{\bf x}_n,\theta)$ given in Eq. (221) is similar to the softmax function $\phi_{nk}=P(y'=k\vert{\bf x}_n)$ in the softmax method in terms of their form, with the only difference that the Gaussian function is used for GMM while the exponential function is used for solfmax.

Examples

The same dataset is used to test both the K-means and EM clustering methods. The first panel shows 10 iterations of the K-means method, while the second panel shows 16 iterations of the EM method. In both cases, the iteration converges to the last plot. Comparing the two clustering results, we see that the K-means method cannot separate the red and green data points from two different clusters, both normally distributed with similar means but very different covariance matrices, while the blue data points all in the same cluster are separated into two clusters. But the EM method based on the Gaussian mixture model can correctly identified all three clusters.

ClusteringKmeans.png clusteringEM.png

The two clustering methods are also applied to the Iris dataset, which has three classes each of 50 4-dimensional sample vectors. The PCA method is used to visualize the first two principal compnents, as shown below. Also, as can be seen from their c onfussion matrices, the error rate of the K-means method is 18/150, while that of the EM method is 5/150.

$\displaystyle \begin{tabular}{r\vert r\vert r}
\multicolumn{3}{c}{K-means}\\ \h...
...e
0 & 0 & 50\\ \hline45 & 5 & 0\\ \hline0 & 50 & 0\\ \hline\hline
\end{tabular}$ (244)

IrisKmeans.png IrisEM.png