Minimization of Mutual Information

The mutual information $I(x,y)$ of two random variables $x$ and $y$ is defined as

$\displaystyle I(x,y)=H(x)+H(y)-H(x,y)=H(x)-H(x\vert y)=H(y)-H(y\vert x)$ (216)

Obviously when $x$ and $y$ are independnent, i.e., $H(y\vert x)=H(y)$ and $H(x\vert y)=H(x)$, their mutual information $I(x,y)$ is zero.

mutual_info.gif

Similarly the mutual information $I(y_1,\cdots,y_n)$ of a set of $n$ variables $y_i$ ( $i=1,\cdots,n$) is defined as

$\displaystyle I(y_1,\cdots,y_n)=\sum_{i=1}^n H(y_i)-H(y_1,\cdots,y_n)$ (217)

If random vector ${\bf y}=[y_1,\cdots,y_n]^T$ is a linear transform of another random vector ${\bf x}=[x_1,\cdots,x_n]^T$:

$\displaystyle y_i=\sum_{j=1}^n w_{ij} x_j,\;\;\;\;\;$or$\displaystyle \;\;\;\;{\bf y=Wx}$ (218)

then the entropy of ${\bf y}$ is related to that of ${\bf x}$ by:
$\displaystyle H(y_1,\cdots,y_n)$ $\displaystyle =$ $\displaystyle H(x_1,\cdots,x_n)+E\;\{ log\;J(x_1,\cdots,x_n)\}$  
  $\displaystyle =$ $\displaystyle H(x_1,\cdots,x_n)+ log\;det\; {\bf W}$ (219)

where $J(x_1,\cdots,x_n)$ is the Jacobian of the above transformation:

$\displaystyle J(x_1,\cdots,x_n)=\left\vert \begin{array}{ccc}
\frac{\partial y_...
...\cdots &\frac{\partial y_n}{\partial x_n}
\end{array} \right\vert =det\;{\bf W}$ (220)

The mutual information above can be written as

$\displaystyle I(y_1,\cdots,y_n)$ $\displaystyle =$ $\displaystyle \sum_{i=1}^n H(y_i)-H(y_1,\cdots,y_n)$  
  $\displaystyle =$ $\displaystyle \sum_{i=1}^n H(y_i)-H(x_1,\cdots,x_n)-log\;det\; {\bf W}$ (221)

We further assume $y_i$ to be uncorrelated and of unit variance, i.e., the covariance matrix of ${\bf y}$ is

$\displaystyle E\{{\bf yy^T}\}={\bf W}E\{{\bf xx^T}\}{\bf W^T}={\bf I}$ (222)

and its determinant is

$\displaystyle det\;{\bf I}=1=(det\;{\bf W})\;(det\;E\{{\bf xx}^T\})
\;(det\;{\bf W}^T)$ (223)

This means $det\;{\bf W}$ is a constant (same for any ${\bf W}$). Also, as the second term in the mutual information expression $H(x_1,\cdots,x_n)$ is also a constant (invariant with respect to ${\bf W}$), we have

$\displaystyle I(y_1,\cdots,y_n)=\sum_{i=1}^n H(y_i)+$Constant (224)

i.e., minimization of mutual information $I(y_1,\cdots,y_n)$ is achieved by minimizing the entropies

$\displaystyle H(y_i)=-\int p_i(y_i) log\;p_i(y_i) dy_i=-E\{ log\;p_i(y_i) \}$ (225)

As Gaussian density has maximal entropy, minimizing entropy is equivalent to minimizing Gaussianity.

Moreover, since all $y_i$ have the same unit variance, their negentropy becomes

$\displaystyle J(y_i)=H(y_G)-H(y_i)=C-H(y_i)$ (226)

where $C=H(y_G)$ is the entropy of a Gaussian with unit variance, same for all $y_i$. Substituting $H(y_i)=C-J(y_i)$ into the expression of mutual information, and realizing the other two terms $H({\bf x})$ and $log\;det\;{\bf W}$ are both constant (same for any ${\bf W}$), we get

$\displaystyle I(y_1,\cdots,y_n)=Const-\sum_{i=1}^n J(y_i)$ (227)

where $Const$ is a constant (including all terms $C$, $H({\bf x})$ and $log\;det\;{\bf W}$) which is the same for any linear transform matrix $W$. This is the fundamental relation between mutual information and negentropy of the variables $y_1$. If the mutual information of a set of variables is decreased (indicating the variables are less dependent) then the negentropy will be increased, and $y_i$ are less Gaussian. We want to find a linear transform matrix $W$ to minimize mutual information $I(y_1,\cdots,y_n)$, or, equivalently, to maximize negentropy (under the assumption that $y_i$ are uncorrelated).