Next: Appendix A: Kullback-Leibler (KL)
Up: MCMC and EM Algorithms
Previous: Expectation Maximization (EM)
Assume
random samples
are drawn from a mixture
distribution
where
is the mth distribution component parameterized by
, and
is the mixing coeficient or prior probability
of each mixture component satisfying
The parameters are
.
As
are independent, their joint distribution is
and the log-likelihood of the parameters is
Finding a
that maximizes this log-likelihood is not easy. To
make the problem easier, now assume some hidden or latent ramdom variables
, where
if the nth sample
is generated by the mth component
of the mixture
distribution. Now the log-likelihood can be written in terms of both
and
:
The last equal sign is due to the definition of
, i.e.,
is known
to be generated by the
th distribution component
, therefore
all other terms in the summation
can be dropped. The expectation of the log-likelihood with respect to
is
But as any
can only take one of these
integers,
the expectation above can be written as
The two terms can be maximized separately as
and
are not related.
- Find
To find
that maximize the first term, we solve the following
where
is the Lagrange multiplier to impose the constraint
in the opptimizatioin. Solving this equation,
we get:
Summing both sides over
, we get
which yields
and
These probabilities
can be found by Bayes's rule:
where
on the left is the probability for a given sample
to be from the
th component distribution,
is
the probability of
given that it is from the
th component distribution,
is the a priori probability of the
th component, which is the
same as the coeficient
for that component. Given the current
estimation of the parameters
of the
specific distributions, all variables on the
right-hand side are available and the conditional probabilities of the
hidden variables
can be obtained.
- Find
Given a specific distribution function, e.g., a Gaussian distribution
where
and
are the mean vector and the covariance matrix
which can be estimated by
we can proceed to find the parameters
that
maximize the second term of the log-likelihood function above by solving
Taking the log inside
and dropping the constants, we get
First consider taking derivative with respect to
of
,
we get
i.e.,
which can be solved for
to get
Next consider taking derivative with respect to
of
.
We first rearrange the log-likelihood function above to get
where
. Then taking the derivative with
respect to
and setting it to zero, we get (see Appendix B):
where
,
.
This result implies
, i.e.,
This leads to
The results obtained above can be summarized as below:
Although the above derivation seems tedius, but the final results all
make sense intuitively: the coeficient
for the
th component
distribution is the average of the posterior probabilities
for the
samples to be from the mth component
distribution, and the estimated mean vector
is the average of the
samples
, and the estimated covariance matrix
is the
average of the difference between
and
squared, both weighted by
the posterior probability
that a given
is
from the mth component distribution
. Also note that both the E
and M steps are simultaneously carried out by these three iterative
equations, from some arbitrary initial guesses of these parameters.
Next: Appendix A: Kullback-Leibler (KL)
Up: MCMC and EM Algorithms
Previous: Expectation Maximization (EM)
Ruye Wang
2006-10-11