next up previous
Next: The algorithm Up: Image Compression with Huffman Previous: Image Compression with Huffman

Shannon's source coding theorem

Assume a set of $N$ symbols (26 English letters and some additional symbols such as space, period, etc.) is to be transmitted through the communication channel. These symbols can be treated as $N$ independent samples of a random variable $X$ with probability $P(X)$ and entropy $H(X)=-\sum_X P(X) log\;P(X)$. For example, $P(X=''s'')$ is much higher than $P(X=''z'')$. To minimize the code length, number of bits, for these symbols, it is natural to assign shorter code for symbols of high probabilities. For example, shorter code for ``s'' than the code for ``z''. The length of the code for a symbol $x$ with $P(X=x)$ can be its surprise $-log P(X=x)$. Let $L$ be the average number of bits to encode the $N$ symbols. Shannon proved that the minimum $L$ satisfies

\begin{displaymath}H(X) \le L < H(X)+\frac{1}{N} \end{displaymath}

This source coding theorem establishes the limits of data compression.

In reality, the true probability $P(X)$ is not available and can only be estimated by $Q(X)$ to be used for source coding purpose. In this case, the minimum $L_Q$ satisfies:

\begin{displaymath}H(X)+KL(P\vert\vert Q) \le L_Q < H(X)+KL(P\vert\vert Q+\frac{1}{N} \end{displaymath}

where $KL(P\vert\vert Q)$ is the relative entropy or the Kullback-Leibler (KL) divergence defined as

\begin{displaymath}KL(P\vert\vert Q)=\sum_x P(x) log\frac{P(x)}{Q(x)} \ge 0 \end{displaymath}

which is a measure of the difference between two probability distributions, from the true probability distribution $P(X)$ to its estimate $Q(X)$. Obviously the KL divergence is zero if and only if $P(x)=Q(x)$. However, note that the KL divergence is not a distance metric as it is not symmetric (hence the name divergence rather than distance): $ KL(P\vert\vert Q) \ne KL(Q\vert\vert P)$


next up previous
Next: The algorithm Up: Image Compression with Huffman Previous: Image Compression with Huffman
Ruye Wang 2021-03-28