General EM Algorithm

The EM method for the Gaussian mexture model can be generalized to estimate the parameters in ${\bf\theta}=\{\theta_1,\cdots,\theta_m\}$ of a generic probabilistic model $p({\bf x},z\vert{\bf\theta})$ , where $z$ is a hidden variable that takes any of the $K$ values $\{z_1,\cdots,z_K\}$ with the corresponding probabilities $P_1,\cdots,P_K$ , satisfying $\sum_{k=1}^K P_k=1$ .

By marginalizing the hidden variable $z$ we get (similar to how we get Eq. (214)):

$\displaystyle p({\bf x}\vert{\bf\theta}) =\sum_{k=1}^K p({\bf x}, z_k \vert {\bf\theta}) =\sum_{k=1}^K P(z_k\vert{\bf\theta})\,p({\bf x}\vert z_k, {\bf\theta})$

(268)

Based on the training set ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ containing $N$

i.i.d. samples, we can further get the likelihood function of the parameter ${\bf\theta}$ (similar to Eq. (219)):

$\displaystyle L({\bf\theta}\vert{\bf X})$	$\displaystyle =$	$\displaystyle \prod_{n=1}^N p({\bf x}_n \vert{\bf\theta}) =\prod_{n=1}^N \sum_{... ...rod_{n=1}^N \sum_{k=1}^K P_k \frac{ p({\bf x}_n, z_k \vert {\bf\theta} ) }{P_k}$
	$\displaystyle =$	$\displaystyle \prod_{n=1}^N E_z \left[\frac{ p({\bf x}_n, z_k \vert{\bf\theta})}{P_k} \right]$	(269)

and the log likelihood

$\displaystyle l({\bf\theta}\vert{\bf X})$	$\displaystyle =$	$\displaystyle \log L({\bf\theta}\vert{\bf X}) =\log \prod_{n=1}^N E_z \left[\frac{ p({\bf x}_n, z_k \vert {\bf\theta})}{P_k} \right]$
	$\displaystyle =$	$\displaystyle \sum_{n=1}^N \log E_z \left[\frac{ p({\bf x}_n, z_k \vert {\bf\th... ...=1}^N E_z \log \left[\frac{ p({\bf x}_n, z_k \vert {\bf\theta})}{P_k} \right] )$	(270)

The last step is due to Jensen's inequality for the concave logarithmic function $f(x)=\log x$ .

The final expression is a lower bound of the log likelihood function, which becomes tight if the equality holds, if the expression inside the brackets is not a random variable, but some constant $c$ , i.e., $P_k=c\,p({\bf x}_n, z_k \vert {\bf\theta})$ for all $k=1,\cdots,K$ . This $c$ can be found by

$\displaystyle \sum_{k=1}^K P_k=1=c\sum_{k=1}^K p({\bf x}_n, z_k \vert {\bf\theta}) \;\;\;\;\;\;$ i.e., $\displaystyle \;\;\;\;\; c=\frac{1}{\sum_{k=1}^K p({\bf x}_n, z_k \vert {\bf\theta}) }$

(271)

Now we have

$\displaystyle P_k=c\,p({\bf x}_n, z_k \vert {\bf\theta}) =\frac{p({\bf x}_n, z_... ...\bf\theta})}{p({\bf x}_n\vert{\bf\theta})} =p(z_k \vert {\bf x}_n, {\bf\theta})$

(272)

We see that if we set $P_k$

to be the same as the posterior probability $p(z_k\vert{\bf x}_n, {\bf\theta})$ of $z_k$

, then the right hand side of Eq. (270) is the same as the log likelihood $l(\theta\vert{\bf X})$ . The optimal parameters in ${\bf\theta}$ that maximize this log likelihood can be found by iterating the following two steps, based on some initial guess of ${\bf\theta}$ :

The E-step:

$\displaystyle P_k=p(z_k\vert{\bf x}_n,{\bf\theta}) =\frac{p({\bf x}_n, z_k \ver... ..._n, z_k \vert {\bf\theta})} {\sum_{k=1}^K p({\bf x}_n, z_k \vert {\bf\theta}) }$ (273)

We note that this is the same as Eq. (221) for the specific Gaussian mixture model.
The M-step:

$\displaystyle {\bf\theta}=\argmax_{\theta} \sum_{n=1}^N E_z \log \left[\frac{ p... ..._{k=1}^K P_k \log\left[\frac{ p({\bf x}_n, z_k \vert {\bf\theta})}{P_k} \right]$ (274)

in the E-step depends on ${\bf\theta}$ , which can be found in the M-step depending on $P_k$

, these two steps need to be carried out alternatively and iteratively until the final convergence of ${\bf\theta}$ .