EM Method for Parameter Estimation of Mixture Densities

Next: Appendix A: Kullback-Leibler (KL) Up: MCMC and EM Algorithms Previous: Expectation Maximization (EM)

EM Method for Parameter Estimation of Mixture Densities

Assume random samples $\{X_1,\cdots,X_N\}$ are drawn from a mixture distribution

$\begin{displaymath}X_n \sim p(X\vert\Theta)=\sum_{m=1}^M \alpha_m\;p_m(X\vert\theta_m) \end{displaymath}$

where

is the mth distribution component parameterized by $\theta_m$ , and $\alpha_m$ is the mixing coeficient or prior probability of each mixture component satisfying

$\begin{displaymath}0 \le \alpha_m \le 1,\;\;\;\;\; \sum_{m=1} \alpha_m = 1 \end{displaymath}$

The parameters are $\Theta=\{\alpha_m, \theta_m,(m=1,\cdots,M) \}$ . As $\{X_1,\cdots,X_N\}$ are independent, their joint distribution is

$\begin{displaymath}p(X\vert\Theta)=\prod_{n=1}^N p(X_n\vert\Theta) \end{displaymath}$

and the log-likelihood of the parameters is

$\displaystyle log[L(\Theta\vert X)]$	$\textstyle =$	$\displaystyle log(p(X\vert\Theta)=log[\prod_{n=1}^N p(X_n\vert\Theta)] =\sum_{n=1}^N log\; p(X_n\vert\Theta)$
	$\textstyle =$	$\displaystyle \sum_{n=1}^N log[ \sum_{m=1}^M \alpha_m p_m(X_n\vert\theta_m)]$

Finding a $\Theta^*$ that maximizes this log-likelihood is not easy. To make the problem easier, now assume some hidden or latent ramdom variables $Y=[y_1,\cdots,y_N]^T$ , where $y_n=m \in \{1,\cdots,M\}$ if the nth sample

is generated by the mth component $p_m(X\vert\theta_m)$ of the mixture distribution. Now the log-likelihood can be written in terms of both

and

$\displaystyle log[L(\Theta\vert X,Y)]$	$\textstyle =$	$\displaystyle log[p(X,Y\vert\Theta)]=log[\prod_{n=1}^N p(X_n,Y\vert\Theta)] =\sum_{n=1}^N log\; [p(X_n\vert Y,\Theta)p(Y)]$
	$\textstyle =$	$\displaystyle \sum_{n=1}^N log\; [\alpha_{y_n} p_{y_n}(X_n\vert\theta_{y_n})]$

The last equal sign is due to the definition of

, i.e.,

is known to be generated by the

th distribution component $p_{y_n}$ , therefore all other terms in the summation $p(X\vert\Theta)=\sum_m \alpha_m p_m(X\vert\theta_m)$ can be dropped. The expectation of the log-likelihood with respect to

$\begin{displaymath}Q(\Theta,\Theta^{(t)})=E_Y \;log[L(\Theta\vert X,Y)] =\sum_Y log[L(\Theta\vert X,Y)]\; p(Y\vert X,\Theta^{(t)}) \end{displaymath}$

But as any $y_n\in \{1,\cdots,M\}$ can only take one of these

integers, the expectation above can be written as

$\displaystyle Q(\Theta,\Theta^{(t)})$	$\textstyle =$	$\displaystyle E_Y \;log[L(\Theta\vert X,Y)] =\sum_{m=1}^M [\sum_{n=1}^N log\; [\alpha_{m} p_{m}(X_n\vert\theta_{m})] ] P(m\vert X_n,\Theta^{(t)})$
	$\textstyle =$	$\displaystyle \sum_{m=1}^M \sum_{n=1}^N log\;(\alpha_m)\;P(m\vert X_n,\Theta^{(... ...m_{m=1}^M \sum_{n=1}^N log\;[p_{m}(X_n\vert\theta_m)]P(m\vert X_n,\Theta^{(t)})$

The two terms can be maximized separately as $\alpha_m$ and $\theta_m$ are not related.

Find $\alpha_m$
To find $\alpha_m$ that maximize the first term, we solve the following

$\begin{displaymath}\frac{\partial}{\partial \alpha_m}\left[ \sum_{m=1}^M \sum_{n... ... X_n,\Theta^{(t)}) +\lambda(\sum_{m=1}^M \alpha_m-1) \right]=0 \end{displaymath}$

where $\lambda$ is the Lagrange multiplier to impose the constraint $\sum_{m=1}^M \alpha_m=1$ in the opptimizatioin. Solving this equation, we get:

$\begin{displaymath}\sum_{n=1}^N \frac{1}{\alpha_m} P(m\vert X_n,\Theta^{(t)})+\l... ...\;\; \sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})=-\lambda \alpha_m \end{displaymath}$

Summing both sides over $m=1,\cdots,M$ , we get

$\begin{displaymath}\sum_{n=1}^N \sum_{m=1}^M P(m\vert X_n,\Theta^{(t)})=-\lambda \sum_{m=1}^M\alpha_m \end{displaymath}$

which yields

$\begin{displaymath}\lambda=-N \end{displaymath}$

and

$\begin{displaymath}\alpha_m=\frac{1}{N}P(m\vert X_n,\Theta^{(t)}) \end{displaymath}$

These probabilities $P(m\vert X_n,\Theta^{(t)})$ can be found by Bayes's rule:

$\begin{displaymath}P(m\vert X_n,\Theta^{(t)})=\frac{p(X_n\vert\Theta^{(t)},m)P(m... ...{(t)}}{\sum_{m=1}^M p_m(X_n\vert\theta_m^{(t)})\alpha_m^{(t)}} \end{displaymath}$

where $P(m\vert X_n,\Theta^{(t)})$ on the left is the probability for a given sample to be from the th component distribution, $p_m(X_n\vert\Theta^{(t)},m)$ is the probability of given that it is from the th component distribution, is the a priori probability of the th component, which is the same as the coeficient $\alpha_m^{(t)}$ for that component. Given the current estimation of the parameters $\Theta^{(t)}=\{\theta_m^{(t)},\alpha_m^{(t)}, (m=1,\cdots,M)\}$ of the specific distributions, all variables on the right-hand side are available and the conditional probabilities of the hidden variables can be obtained.
Find $\theta_m$
Given a specific distribution function, e.g., a Gaussian distribution

$\begin{displaymath}p(X\vert\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}\vert\Sigma\vert^{1/2}} e^{-(X-\mu)^T\Sigma^{-1}(X-\mu)/2} \end{displaymath}$

where $\mu$ and $\Sigma$ are the mean vector and the covariance matrix which can be estimated by

$\begin{displaymath}\hat{\mu}=\frac{1}{N}\sum_{n=1}^N X_n,\;\;\;\; \hat{\Sigma}=\frac{1}{N}\sum_{n=1}^N (X_n-\mu)(X_n-\mu)^T \end{displaymath}$

we can proceed to find the parameters $\theta_m=(\mu_m,\Sigma_m)$ that maximize the second term of the log-likelihood function above by solving

$\displaystyle \frac{\partial}{\partial \theta_m} \left[ \sum_{m=1}^M \sum_{n=1}^N log\;[p_m(X_n\vert\theta_m)]P(m\vert X_n,\Theta^{(t)}) \right]$

$\textstyle =$ $\displaystyle \frac{\partial}{\partial \theta_m} \left[ \sum_{m=1}^M \sum_{n=1}... ..._n-\mu_m)^T\Sigma_m^{-1}(X_n-\mu_m)/2} ] \;P(m\vert X_n,\Theta^{(t)}) \right]=0$

Taking the log inside and dropping the constants, we get

$\begin{displaymath} \frac{\partial}{\partial \theta_m} \left[ \sum_{m=1}^M \sum_... ...igma_m^{-1}(X_n-\mu_m)) \;P(m\vert X_n,\Theta^{(t)}) \right]=0 \end{displaymath}$

First consider taking derivative with respect to $\mu_m$ of $\theta_m$ , we get

$\begin{displaymath} \sum_{n=1}^N \Sigma_m^{-1}(X_n-\mu_m)\;P(m\vert X_n,\Theta^{... ..._m^{-1} \sum_{n=1}^N (X_n-\mu_m)\;P(m\vert X_n,\Theta^{(t)})=0 \end{displaymath}$

i.e.,

$\begin{displaymath}\sum_{n=1}^N (X_n-\mu_m)\;P(m\vert X_n,\Theta^{(t)})=0 \end{displaymath}$

which can be solved for $\mu_m$ to get

$\begin{displaymath}\mu_m=\frac{\sum_{n=1}^N X_n P(m\vert X_n,\Theta^{(t)})} {\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})} \end{displaymath}$

Next consider taking derivative with respect to $\Sigma_m$ of $\theta_m$ . We first rearrange the log-likelihood function above to get

$\displaystyle \sum_{m=1}^M \left[ \sum_{n=1}^N [ log \vert\Sigma_m^{-1}\vert-(X_n-\mu_m)^T\Sigma_m^{-1}(X_n-\mu_m)] \;P(m\vert X_n,\Theta^{(t)})\right]$

$\textstyle =$ $\displaystyle \sum_{m=1}^M\left[ log \vert\Sigma_m^{-1}\vert \sum_{n=1}^N \;P(m... ...N \;P(m\vert X_n,\Theta^{(t)}) tr[\Sigma_m^{-1}(X_n-\mu_m)(X_n-\mu_m)^T]\right]$

$\textstyle =$ $\displaystyle \sum_{m=1}^M\left[log \vert\Sigma_m^{-1}\vert \sum_{n=1}^N \;P(m\... ...t)}) -\sum_{n=1}^N \;P(m\vert X_n,\Theta^{(t)}) tr[\Sigma_m^{-1} G_{mn}]\right]$

where $G_{mn}=(X_n-\mu_m)(X_n-\mu_m)^T$ . Then taking the derivative with respect to $\Sigma^{-1}_m$ and setting it to zero, we get (see Appendix B):

$\displaystyle \sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(2\Sigma_m-diag(\Sigma_m)) -\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(2G_{mn}-diag(G_{mn}))$

$\textstyle =$ $\displaystyle \sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(2 R_{mn}-diag(R_{mn}))$

$\textstyle =$ $\displaystyle 2S-diag(S)=0$

where $R_{mn}=\Sigma_m-G_{mn}$ , $S=\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})R_{mn}$ . This result implies , i.e.,

$\begin{displaymath}S=\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})R_{mn} =\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(\Sigma_m-G_{mn})=0 \end{displaymath}$

This leads to

$\begin{displaymath}\Sigma_m=\frac{\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})G_{mn}}... ...\mu_m)(X_n-\mu_m)^T} {\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})} \end{displaymath}$

The results obtained above can be summarized as below:

$\begin{displaymath}\alpha_m^{(t+1)}=\frac{1}{N}\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)}) \end{displaymath}$
$\begin{displaymath}\mu_m^{(t+1)}=\frac{\sum_{n=1}^N X_n P(m\vert X_n,\Theta^{(t)})} {\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})} \end{displaymath}$
$\begin{displaymath}\Sigma_m^{(t+1)}=\frac{\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)}... ...\mu_m)(X_n-\mu_m)^T} {\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})} \end{displaymath}$

Although the above derivation seems tedius, but the final results all make sense intuitively: the coeficient $\alpha_m$ for the

th component distribution is the average of the posterior probabilities $P(m\vert X_n,\Theta^{(t)})$ for the

samples to be from the mth component distribution, and the estimated mean vector $\mu_m$ is the average of the

samples

, and the estimated covariance matrix $\Sigma_m$ is the average of the difference between

and $\mu$ squared, both weighted by the posterior probability $P(m\vert X_n,\Theta^{(t)})$ that a given

is from the mth component distribution

. Also note that both the E and M steps are simultaneously carried out by these three iterative equations, from some arbitrary initial guesses of these parameters.

Next: Appendix A: Kullback-Leibler (KL) Up: MCMC and EM Algorithms Previous: Expectation Maximization (EM)

Ruye Wang 2006-10-11

		$\displaystyle \frac{\partial}{\partial \theta_m} \left[ \sum_{m=1}^M \sum_{n=1}^N log\;[p_m(X_n\vert\theta_m)]P(m\vert X_n,\Theta^{(t)}) \right]$
	$\textstyle =$	$\displaystyle \frac{\partial}{\partial \theta_m} \left[ \sum_{m=1}^M \sum_{n=1}... ..._n-\mu_m)^T\Sigma_m^{-1}(X_n-\mu_m)/2} ] \;P(m\vert X_n,\Theta^{(t)}) \right]=0$

		$\displaystyle \sum_{m=1}^M \left[ \sum_{n=1}^N [ log \vert\Sigma_m^{-1}\vert-(X_n-\mu_m)^T\Sigma_m^{-1}(X_n-\mu_m)] \;P(m\vert X_n,\Theta^{(t)})\right]$
	$\textstyle =$	$\displaystyle \sum_{m=1}^M\left[ log \vert\Sigma_m^{-1}\vert \sum_{n=1}^N \;P(m... ...N \;P(m\vert X_n,\Theta^{(t)}) tr[\Sigma_m^{-1}(X_n-\mu_m)(X_n-\mu_m)^T]\right]$
	$\textstyle =$	$\displaystyle \sum_{m=1}^M\left[log \vert\Sigma_m^{-1}\vert \sum_{n=1}^N \;P(m\... ...t)}) -\sum_{n=1}^N \;P(m\vert X_n,\Theta^{(t)}) tr[\Sigma_m^{-1} G_{mn}]\right]$

		$\displaystyle \sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(2\Sigma_m-diag(\Sigma_m)) -\sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(2G_{mn}-diag(G_{mn}))$
	$\textstyle =$	$\displaystyle \sum_{n=1}^N P(m\vert X_n,\Theta^{(t)})(2 R_{mn}-diag(R_{mn}))$
	$\textstyle =$	$\displaystyle 2S-diag(S)=0$