next up previous
Next: EM Method for Parameter Up: MCMC and EM Algorithms Previous: Metropolis-Hastings algorithm

Expectation Maximization (EM)

Assume the a priori distribution of the parameters is $P(\theta)$ and the distribution of the data is $P(X)$, then the joint probability of both the data and parameters is

\begin{displaymath}P(X,\theta)=P(X\vert\theta) P(\theta) \end{displaymath}

The posterior distribution of the model parameters can be obtained according to Bayesian theorem:

\begin{displaymath}P(\theta\vert X)=\frac{P(\theta)P(X\vert\theta)}{P(X)}
=\fra...
...theta)P(X\vert\theta)}{\int P(\theta) P(X\vert\theta) d\theta}
\end{displaymath}

where $P(X\vert\theta)=L(\theta\vert X)$ is the likelihood function of the parameters $\theta$, given the observed data $X$. The goal of maximum-likelihood estimation is to find $\theta$ that maximizes the likelihood $L$:

\begin{displaymath}\theta^*=argmax_{\theta}\;L(\theta\vert X) \end{displaymath}

The expectation-maximization (EM) algorithm is a method for finding maximum likelihood estimates of the parameters $\theta$ of a probabilistic model, based on unobserved or hidden variables $Y$, as well as the observed variables $X$.

Let $(X,Y)$ be the complete data containing both the observed (but incomplete) data $X$ and the missing or hidden data $Y$, the the complete-data likelihood of the parameters $\theta$ is

\begin{displaymath}L(\theta\vert X,Y)=p(X,Y\vert\theta)=p(Y\vert X,\theta) p(X\vert\theta) \end{displaymath}

which is a random variable, since the missing data $Y$ is unknown, assumed to be random with some distribution, and according to Bayes rule, the incomplete-data likelihood is

\begin{displaymath}L(\theta\vert X)=p(X\vert\theta)=\frac{p(X,Y\vert\theta)}{p(Y\vert X,\theta)} \end{displaymath}

The EM algorithm attempts to find the value of $\theta$ which maximizes $p(X\vert\theta)$ given the observed $X$, by making use of the associated family $p(X, Y\vert\theta)$ [Dempster et al., 1977]. EM alternates between the following two steps:

In the E step, the first argument $\theta$ of $Q(\theta,\theta_t)$ represents the parameter to be optimized to maximize the likelihood, while the second argument $\theta_t$ represents the parameters used to evaluate the expectation. In the M step, $\theta_{t+1}$ is the value that maximizes (M) the conditional expectation (E) of the complete data log-likelihood given the observed variables $X$ under the previous parameter value $\theta_t$. The parameters $\theta_{t+1}$ found on the M step are then used to begin another E step, and the process is repeated.

Also note that the last equal sign is due to the fact that the denominator $p(X\vert\theta_t)$ is not a function and is therefore independent of the parameters $\theta$. In other words, the $Q$ function in the E step can also be written as

\begin{displaymath}Q(\theta,\theta_t)=\int log\;p(X,Y\vert\theta) \;p(Y,X\vert\theta_t) dY \end{displaymath}

Theorem: The procedure in M step above quarantees that

\begin{displaymath}log[L(\theta_{t+1}\vert X)] \ge log[L(\theta_t\vert X)] \end{displaymath}

with equality iff $Q(\theta_{t+1}\vert X,\theta_t)=Q(\theta_t\vert X,\theta_t)$ $n=1,2,\cdots$.

Proof: As $p(X\vert\theta)=p(X,Y\vert\theta)/p(Y\vert X,\theta)$, we have

    $\displaystyle log\;L(\theta\vert X)=log\; p(X\vert\theta)=log\;p(X,Y\vert\theta)-log\;p(Y\vert X,\theta)$  
  $\textstyle \approx$ $\displaystyle E_Y [log\;p(X,Y\vert\theta)]-E_Y[log\;p(Y\vert X,\theta)]
=Q(\theta,\theta_t)-H(\theta,\theta_t)$  

where $H(\theta,\theta_t)$ is defined as

\begin{displaymath}H(\theta,\theta_t)\stackrel{\triangle}{=}E_Y\;[ log\;p(Y\vert...
...a_t]
=\int log\;p(Y\vert X,\theta) \;p(Y\vert X,\theta_t) dY \end{displaymath}

Now we get:
    $\displaystyle log[L(\theta_{t+1}\vert X)]-log[L(\theta_t\vert X)]
=log[p(X\vert\theta_{t+1})]-log[p(X\vert\theta_t)]$  
  $\textstyle \approx$ $\displaystyle [Q(\theta_{t+1},\theta_t)-Q(\theta_t,\theta_t)]+[H(\theta_t,\theta_t)-H(\theta_{t+1},\theta_t)] \ge 0$  

as the first term is greater than or equal to 0 due to the M step of the algorithm, and we can show the second term is also greater than or equal to 0:
$\displaystyle H(\theta_t,\theta_t)-H(\theta_{t+1},\theta_t)$ $\textstyle =$ $\displaystyle \int log\;p(Y\vert X,\theta_t) \;p(Y\vert X,\theta_t) dY
-\int log\;p(Y\vert X,\theta_{t+1}) \;p(Y\vert X,\theta_t) dY$  
  $\textstyle =$ $\displaystyle \int log \frac{p(Y\vert X,\theta_t)}{p(Y\vert X,\theta_{t+1})} p(Y\vert X,\theta_t) dY
\ge 0$  

The last expression is Kullback-Leibler information divergence (KL divergence), which is always non-negative, or zero if

\begin{displaymath}p(Y\vert X,\theta_t)=p(Y\vert X,\theta_{t+1}) \end{displaymath}

Example: Assume observed variable $x=5$ and hidden variablle $y$ are independently and identically generated by an exponential distribution $p(x\vert\theta)=\theta\; exp(-\theta x)$ ($x\ge 0$):

\begin{displaymath}\int_0^\infty p(x\vert\theta) dx=\int_0^\infty \theta e^{-\theta x} dx =1\end{displaymath}


\begin{displaymath}E[ x]=\int_0^\infty x p(x\vert\theta) dx
=\int_0^\infty x \theta e^{-\theta x} dx=\frac{1}{\theta} \end{displaymath}

Then the joint distribution of $x$ and $y$ is

\begin{displaymath}p(x,y\vert\theta)=p(x\vert\theta) p(y\vert\theta)=\theta e^{-\theta x}\;\theta e^{-\theta y} \end{displaymath}

and the log likelihood is

\begin{displaymath}log\;L(\theta\vert x,y)=log\;p(x,y\vert\theta)=log [\theta e^...
...2 log\;\theta-\theta x-\theta y=2 log\;\theta-5\theta-\theta y \end{displaymath}

As $x$ and $x$ are independent, the conditional probability of $y$ is

\begin{displaymath}p(y\vert x,\theta)=p(y\vert\theta)=\theta\; e^{-\theta y} \end{displaymath}

The E step: compute $Q(\theta,\theta_t)$:

\begin{displaymath}Q(\theta,\theta_t)=\int_0^\infty log \;p(x,y\vert\theta) p(y\...
...{-\theta y} dy
=2 log\;\theta-5\theta-\frac{\theta}{\theta_t} \end{displaymath}

The M step: maximize $Q(\theta,\theta_t)$:

\begin{displaymath}\frac{d}{d\theta}[2 log\;\theta-5\theta-\frac{\theta}{\theta_t}]=0 \end{displaymath}

which can be solved to yield the iteration formula:

\begin{displaymath}\theta_{t+1}=\frac{2\theta_t}{5\theta_t +1} \end{displaymath}

This iteration will always converge to $\theta_{t+1}=0.2$, independent of the initial value $\theta_1$. In fact, for any value of the observed sample $x$, the iteration will converge to $\theta=1/x$, agreeing with the observed value $E(x)=1/\theta$.


next up previous
Next: EM Method for Parameter Up: MCMC and EM Algorithms Previous: Metropolis-Hastings algorithm
Ruye Wang 2006-10-11