Mixture of Bernoulli

If the data are binary, i.e., each data point $x$ is treated as a discrete random variable that takes either of two binary values such as $1$ and 0 with probabilities $\mu$ and $1-\mu$, then the assumption of Gaussian distribution of the dataset is no longer valid and the Gaussian mixture model is not suitable. In this case, the probability mass function (pmf) of the Bernoulli distribution can be used instead:

$\displaystyle {\cal B}(x\vert\mu)=\mu^x(1-\mu)^{1-x}=\left\{\begin{array}{cl}
\mu & \mbox{if $x=1$} \\ 1-\mu & \mbox{if $x=0$} \end{array}\right.$ (245)

The mean and variance of $x$ are
$\displaystyle E(x)$ $\displaystyle =$ $\displaystyle 1\;P(x=1)+0\;P(x=0)=1\;\mu+0\;(1-\mu)=\mu$ (246)
$\displaystyle Var(x)$ $\displaystyle =$ $\displaystyle E[(x-E(x))^2]=E(x^2)-E(x)^2$  
  $\displaystyle =$ $\displaystyle 1^2\;P(x=1)+0^2\;P(x=0)-\mu^2=\mu-\mu^2=\mu(1-\mu)$ (247)

A set of $d$ independent binary variables can be represented as a random vector ${\bf x}=[x_1,\cdots,x_d]^T$ with mean vector and covariance matrix as shown below:
$\displaystyle E({\bf x})$ $\displaystyle =$ $\displaystyle {\bf m}=[\mu_1,\cdots,\mu_N]^T$ (248)
$\displaystyle Cov({\bf x})$ $\displaystyle =$ $\displaystyle {\bf\Sigma}=diag( \mu_i(1-\mu_i) )
=\left[ \begin{array}{ccc}
\mu_1(1-\mu_1) & & 0 \\ & \ddots & \\
0 & & \mu_d(1-\mu_d)\end{array}\right]$ (249)

Note that the covariance matrix ${\bf\Sigma}$ is solely determined by the means $\{\mu_1,\cdots,\mu_N\}$.

Now we can get the pmf of the a binary random vector ${\bf x}$:

$\displaystyle {\cal B}({\bf x}\vert{\bf m})=\prod_{i=1}^d {\cal B}(x_i\vert\mu_i)
=\prod_{i=1}^d \mu_i^{x_i}(1-\mu_i)^{1-x_i}$ (250)

and the log pmf:

$\displaystyle \log {\cal B}({\bf x}\vert{\bf m})
=\log\left(\prod_{i=1}^d {\cal...
...ert\mu_i)\right)
=\sum_{i=1}^d \left[ x_i\log\mu_i+(1-x_i)\log(1-\mu_i) \right]$ (251)

Similar to the Gaussian mixture model, the Bernoulli mixture model of $K$ multivariate Bernoulli distributions is defined as:

$\displaystyle p({\bf x}\vert{\bf m}_k,P_k,(k=1,\cdots,K))=p({\bf x}\vert{\bf\th...
...},{\bf m}_k)
=\sum_{k=1}^K P_k \prod_{i=1}^d \mu_{ki}^{x_i}(1-\mu_{ki})^{1-x_i}$ (252)

where ${\bf\theta}=\{{\bf m}_k,P_k,(k=1,\cdots,K)\}$ denotes all parameters of the mixture model to be estimated based on the given dataset, and ${\bf m}_k=E_k({\bf x})$ respect to ${\cal B}({\bf x}\vert{\bf m}_k)$. The mean of this mixture model is

$\displaystyle {\bf m}=E({\bf x})=\sum_{k=1}^KP_k E_k({\bf x})=\sum_{k=1}^K P_k{\bf m}_k$ (253)

Also similar to the Gaussian mixture model, we introduce a set of $K$ latent binary random variables ${\bf z}=[z_1,\cdots,z_K]^T$ with binary conponents $z_k\in\{0,\;1\}$ and $\sum_{k=1}^K z_k=1$, and get the prior probability of ${\bf z}$, the conditional probability of ${\bf x}$ given ${\bf z}$, and the joint probability of ${\bf x}$ and ${\bf z}$ as the following

$\displaystyle p({\bf z}\vert{\bf\theta})$ $\displaystyle =$ $\displaystyle \prod_{k=1}^K P_k^{z_k}$ (254)
$\displaystyle p({\bf x}\vert{\bf z},{\bf\theta})$ $\displaystyle =$ $\displaystyle \prod_{k=1}^K {\cal B}({\bf x},{\bf m}_k)^{z_k}$ (255)
$\displaystyle p({\bf x},{\bf z}\vert{\bf\theta})$ $\displaystyle =$ $\displaystyle p({\bf z}\vert{\bf\theta})\;p({\bf x}\vert{\bf z},{\bf\theta})
=\prod_{k=1}^K \left(P_k\;{\cal B}({\bf x},{\bf m}_k)\right)^{z_k}$ (256)

Given the dataset ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ containing $N$ i.i.d. samples, we introduce the corresponding latent variables in ${\bf Z}=[{\bf z}_1,\cdots,{\bf z}_N]$, of which each ${\bf z}_n=[z_{n1},\cdots,z_{nK}]^T$ is for the labeling of ${\bf x}_n$. Then we can find the likelihood function of the Bernoulli mixture model parameters ${\bf\theta}=\{P_k,{\bf m}_k,\;(k=1,\cdots,K)\}$:
$\displaystyle L({\bf\theta}\vert{\bf X},{\bf Z})=p({\bf X},{\bf Z}\vert\theta)$ $\displaystyle =$ $\displaystyle p([{\bf x}_1,\cdots,{\bf x}_N],[{\bf z}_1,\cdots,{\bf z}_N]
\vert{\bf m}_k,P_k(k=1,\cdots,K))$  
  $\displaystyle =$ $\displaystyle \prod_{n=1}^N p({\bf x}_n,{\bf z}_n\vert{\bf\theta})
=\prod_{n=1}^N \prod_{k=1}^K \left(P_k{\cal B}({\bf x}_n,{\bf m}_k)\right)^{z_{nk}}$ (257)

and the log likelihood function:
$\displaystyle \log\;L({\bf\theta}\vert{\bf X},{\bf Z})$ $\displaystyle =$ $\displaystyle \log p({\bf X},{\bf Z}\vert\theta)
=\log\prod_{n=1}^N \prod_{k=1}^K \left(P_k{\cal B}({\bf x}_n,{\bf m}_k)\right)^{z_{nk}}$  
  $\displaystyle =$ $\displaystyle \sum_{n=1}^N \sum_{k=1}^K {z_{nk}} \left[ \log P_k
+\log {\cal B}({\bf x}_n,{\bf m}_k)\right]$ (258)

Based on the same EM method used in Gaussian mixture model, we can find the opptimal parameters that maximize the expectation of the log likelihood function in the following two steps: