If the data are binary, i.e., each data point is treated
as a discrete random variable that takes either of two binary
values such as and 0 with probabilities and ,
then the assumption of Gaussian distribution of the dataset is
no longer valid and the Gaussian mixture model is not suitable.
In this case, the probability mass function (pmf) of the
Bernoulli distribution can be used instead:
|
(245) |
The mean and variance of are
A set of independent binary variables can be represented
as a random vector
with mean vector
and covariance matrix as shown below:
Note that the covariance matrix
is solely determined
by the means
.
Now we can get the pmf of the a binary random vector :
|
(250) |
and the log pmf:
|
(251) |
Similar to the Gaussian mixture model, the Bernoulli mixture model
of multivariate Bernoulli distributions is defined as:
|
(252) |
where
denotes
all parameters of the mixture model to be estimated based on
the given dataset, and
respect to
. The mean of this mixture model is
|
(253) |
Also similar to the Gaussian mixture model, we introduce a set of
latent binary random variables
with binary
conponents
and
, and get the prior
probability of , the conditional probability of given
, and the joint probability of and as the
following
Given the dataset
containing
i.i.d. samples, we introduce the corresponding latent variables
in
, of which each
is for the labeling of .
Then we can find the likelihood function of the Bernoulli mixture model
parameters
:
and the log likelihood function:
Based on the same EM method used in Gaussian mixture model, we can
find the opptimal parameters that maximize the expectation of the
log likelihood function in the following two steps:
- E-step: Find the expectation of the likelihood function.
We first find the posterior probability for any sample
to belong to cluster , denoted by :
which is the expectation of :
|
(260) |
Now we can find the expectation of the log likelihood with respect to
the latent variables in :
- M-step: Find the optimal model parameters that maximize
the expectation of the log likelihood function.
We first set to zero the derivatives of the expectation of
the log likelihood with respect to each of the parameters in
, and then solve
the resulting equations to get the optimal parameters.
- Find : same as in the case of the GMM model:
|
(262) |
- Find :
The ith component of the equation is
i.e.,
|
(265) |
Solving for we get
|
(266) |
or, in vector form,
|
(267) |
Example:
Clustering results of hand-written digits with and . The mean
vectors of each of the clusters are visualized as shown: