General EM Algorithm

The EM method for the Gaussian mexture model can be generalized to estimate the parameters in ${\bf\theta}=\{\theta_1,\cdots,\theta_m\}$ of a generic probabilistic model $p({\bf x},z\vert{\bf\theta})$, where $z$ is a hidden variable that takes any of the $K$ values $\{z_1,\cdots,z_K\}$ with the corresponding probabilities $P_1,\cdots,P_K$, satisfying $\sum_{k=1}^K P_k=1$.

By marginalizing the hidden variable $z$ we get (similar to how we get Eq. (214)):

$\displaystyle p({\bf x}\vert{\bf\theta})
=\sum_{k=1}^K p({\bf x}, z_k \vert {\bf\theta})
=\sum_{k=1}^K P(z_k\vert{\bf\theta})\,p({\bf x}\vert z_k, {\bf\theta})$ (268)

Based on the training set ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ containing $N$ i.i.d. samples, we can further get the likelihood function of the parameter ${\bf\theta}$ (similar to Eq. (219)):


$\displaystyle L({\bf\theta}\vert{\bf X})$ $\displaystyle =$ $\displaystyle \prod_{n=1}^N p({\bf x}_n \vert{\bf\theta})
=\prod_{n=1}^N \sum_{...
...rod_{n=1}^N \sum_{k=1}^K P_k
\frac{ p({\bf x}_n, z_k \vert {\bf\theta} ) }{P_k}$  
  $\displaystyle =$ $\displaystyle \prod_{n=1}^N E_z \left[\frac{ p({\bf x}_n, z_k \vert{\bf\theta})}{P_k} \right]$ (269)

and the log likelihood
$\displaystyle l({\bf\theta}\vert{\bf X})$ $\displaystyle =$ $\displaystyle \log L({\bf\theta}\vert{\bf X})
=\log \prod_{n=1}^N E_z
\left[\frac{ p({\bf x}_n, z_k \vert {\bf\theta})}{P_k}
\right]$  
  $\displaystyle =$ $\displaystyle \sum_{n=1}^N \log
E_z \left[\frac{ p({\bf x}_n, z_k \vert {\bf\th...
...=1}^N E_z \log
\left[\frac{ p({\bf x}_n, z_k \vert {\bf\theta})}{P_k}
\right]
)$ (270)

The last step is due to Jensen's inequality for the concave logarithmic function $f(x)=\log x$.

The final expression is a lower bound of the log likelihood function, which becomes tight if the equality holds, if the expression inside the brackets is not a random variable, but some constant $c$, i.e., $P_k=c\,p({\bf x}_n, z_k \vert {\bf\theta})$ for all $k=1,\cdots,K$. This $c$ can be found by

$\displaystyle \sum_{k=1}^K P_k=1=c\sum_{k=1}^K p({\bf x}_n, z_k \vert {\bf\theta})
\;\;\;\;\;\;$i.e.,$\displaystyle \;\;\;\;\;
c=\frac{1}{\sum_{k=1}^K p({\bf x}_n, z_k \vert {\bf\theta}) }$ (271)

Now we have

$\displaystyle P_k=c\,p({\bf x}_n, z_k \vert {\bf\theta})
=\frac{p({\bf x}_n, z_...
...\bf\theta})}{p({\bf x}_n\vert{\bf\theta})}
=p(z_k \vert {\bf x}_n, {\bf\theta})$ (272)

We see that if we set $P_k$ to be the same as the posterior probability $p(z_k\vert{\bf x}_n, {\bf\theta})$ of $z_k$, then the right hand side of Eq. (270) is the same as the log likelihood $l(\theta\vert{\bf X})$. The optimal parameters in ${\bf\theta}$ that maximize this log likelihood can be found by iterating the following two steps, based on some initial guess of ${\bf\theta}$: As $P_k$ in the E-step depends on ${\bf\theta}$, which can be found in the M-step depending on $P_k$, these two steps need to be carried out alternatively and iteratively until the final convergence of ${\bf\theta}$.