5.7: Detail- Efficient Source Code

Last updated
Save as PDF

Page ID: 50982

Paul Penfield, Jr.
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Sometimes source coding and compression for communication systems of the sort shown in Figure 5.1 are done together (it is an open question whether there are practical benefits to combining source and channel coding). For sources with a finite number of symbols, but with unequal probabilities of appearing in the input stream, there is an elegant, simple technique for source coding with minimum redundancy.

Example of a Finite Source

Consider a source which generates symbols which are MIT letter grades, with possible values A, B, C, D, and F. You are asked to design a system which can transmit a stream of such grades, produced at the rate of one symbol per second, over a communications channel that can only carry two boolean digits, each 0 or 1, per second.\(^3\)

First, assume nothing about the grade distribution. To transmit each symbol separately, you must encode each as a sequence of bits (boolean digits). Using 7-bit ASCII code is wasteful; we have only five symbols, and ASCII can handle 128. Since there are only five possible values, the grades can be coded in three bits per symbol. But the channel can only process two bits per second.

However, three bits is more than needed. The entropy, assuming there is no information about the probabilities, is at most \(\log_2(5)\) = 2.32 bits. This is also \(\sum_{i} p(A_i) \log_2(1/p(A_i))\) where there are five such \(p_i\), each equal to 1/5. Why did we need three bits in the first case? Because we had no way of transmitting a partial bit. To do better, we can use “block coding.” We group the symbols in blocks of, say, three. The information in each block is three times the information per symbol, or 6.97 bits. Thus a block can be transmitted using 7 boolean bits (there are 125 distinct sequences of three grades and 128 possible patterns available in 7 bits). Of course we also need a way of signifying the end, and a way of saying that the final word transmitted has only one valid grade (or two), not three.

But this is still too many bits per second for the channel to handle. So let’s look at the probability distribution of the symbols. In a typical “B-centered” MIT course with good students, the grade distribution might be as shown in Table 5.3. Assuming this as a probability distribution, what is the information per symbol and what is the average information per symbol? This calculation is shown in Table 5.4. The information per symbol is 1.840 bits. Since this is less than two bits perhaps the symbols can be encoded to use this channel.

Table 5.3: Distribution of grades for a typical MIT course
A	B	C	D	F
25%	50%	12.5%	10%	2.5%

\[\nonumber \begin{array}{cccc}
\text { Symbol } & \text { Probability } & \text { Information } & \begin{array}{c}
\text { Contribution } \\
\text { to average }
\end{array} \\
& p & \log \left(\frac{1}{p}\right) & p \log \left(\frac{1}{p}\right) \\
\mathrm{A} & 0.25 & 2 \mathrm{\,bits} & 0.5 \mathrm{\,bits} \\
\mathrm{B} & 0.50 & 1 \mathrm{\,bit} & 0.5 \mathrm{\,bits} \\
\mathrm{C} & 0.125 & 3 \mathrm{\,bits} & 0.375 \mathrm{\,bits} \\
\mathrm{D} & 0.10 & 3.32 \mathrm{\,bits} & 0.332 \mathrm{\,bits} \\
\mathrm{F} & 0.025 & 5.32 \mathrm{\,bits} & 0.133 \mathrm{\,bits} \\[-20pt]
& \overline{ \qquad\qquad } & &\overline{ \qquad\qquad }\\[-10pt]
\text { Total } & 1.00 & & 1.840 \mathrm{\,bits}
\end{array} \nonumber \]

Table 5.4: Information distribution for grades in an average MIT distribution

Huffman Code

David A. Huffman (August 9, 1925 - October 6, 1999) was a graduate student at MIT. To solve a homework assignment for a course he was taking from Prof. Robert M. Fano, he devised a way of encoding symbols with different probabilities, with minimum redundancy and without special symbol frames, and hence most compactly. He described it in Proceedings of the IRE, September 1962. His algorithm is very simple. The objective is to come up with a “codebook” (a string of bits for each symbol) so that the average code length is minimized. Presumably infrequent symbols would get long codes, and common symbols short codes, just like in Morse code. The algorithm is as follows (you can follow along by referring to Table 5.5):

Initialize: Let the partial code for each symbol initially be the empty bit string. Define corresponding to each symbol a “symbol-set,” with just that one symbol in it, and a probability equal to the probability of that symbol.
Loop-test: If there is exactly one symbol-set (its probability must be 1) you are done. The codebook consists of the codes associated with each of the symbols in that symbol-set.
Loop-action: If there are two or more symbol-sets, take the two with the lowest probabilities (in case of a tie, choose any two). Prepend the codes for those in one symbol-set with 0, and the other with 1. Define a new symbol-set which is the union of the two symbol-sets just processed, with probability equal to the sum of the two probabilities. Replace the two symbol-sets with the new one. The number of symbol-sets is thereby reduced by one. Repeat this loop, including the loop test, until only one symbol-set remains.

Note that this procedure generally produces a variable-length code. If there are \(n\) distinct symbols, at least two of them have codes with the maximum length.

For our example, we start out with five symbol-sets, and reduce the number of symbol-sets by one each step, until we are left with just one. The steps are shown in Table 5.5, and the final codebook in Table 5.6.

Table 5.5: Huffman coding for MIT course grade distribution, where '-' denotes the empty bit string
Start:	(A='-' p=0.25) (B='-' p=0.5) (C='-' p=0.125) (D='-' p=0.1) (F='-' p=0.025)
Next:	(A='-' p=0.25) (B='-' p=0.5) (C='-' p=0.125) (D='1' F='0' p=0.125)
Next:	(A='-' p=0.25) (B='-' p=0.5) (C='1' D='01' F='00' p=0.25)
Next:	(B='-' p=0.5) (A='1' C='01' D='001' F='000' p=0.5)
Final:	(B='1' A='01' C='001' D='0001' F'0000' p=1.0)

Table 5.6: Huffman Codebook for typical MIT grade distribution
Symbol	Code
A	0 1
B	1
C	0 0 1
D	0 0 0 1
F	0 0 0 0

Is this code really compact? The most frequent symbol (B) is given the shortest code and the least frequent symbols (D and F) the longest codes, so on average the number of bits needed for an input stream which obeys the assumed probability distribution is indeed short, as shown in Table 5.7.

Compare this table with the earlier table of information content, Table 5.4. Note that the average coded length per symbol, 1.875 bits, is greater than the information per symbol, which is 1.840 bits. This is because the symbols D and F cannot be encoded in fractional bits. If a block of several symbols were considered

\[\nonumber \begin{array}{cllll}
\text { Symbol } & \text { Code } & \text { Probability } & \begin{array}{c}
\text { Code } \\
\text { length }
\end{array} & \begin{array}{c}
\text { Contribution } \\
\text { to average }
\end{array} \\
\text { A } & 01 & 0.25 & 2 & 0.5 \\
\text { B } & 1 & 0.50 & 1 & 0.5 \\
\text { C } & 001 & 0.125 & 3 & 0.375 \\
\text { D } & 0001 & 0.1 & 3.32 & 0.4 \\
\text { F } & 0000 & 0.025 & 5.32 & 0.1 \\[-30pt]
& & \overline{\qquad\qquad\quad } & & \overline{\qquad\qquad\quad}\\[-30pt]
\text { Total } & & 1.00 & & 1.875 \text { bits }
\end{array} \nonumber \]

Table 5.7: Huffman coding of typical MIT grade distribution

together, the average length of the Huffman code could be closer to the actual information per symbol, but not below it.

The channel can handle two bits per second. By using this code, you can transmit over the channel slightly more than one symbol per second on average. You can achieve your design objective. There are at least six practical things to consider about Huffman codes:

A burst of D or F grades might occur. It is necessary for the encoder to store these bits until the channel can catch up. How big a buffer is needed for storage? What will happen if the buffer overflows?
The output may be delayed because of a buffer backup. The time between an input and the associated output is called the “latency.” For interactive systems you want to keep latency low. The number of bits processed per second, the “throughput,” is more important in other applications that are not interactive.
The output will not occur at regularly spaced intervals, because of delays caused by bursts. In some real-time applications like audio, this may be important.
What if we are wrong in our presumed probability distributions? One large course might give fewer A and B grades and more C and D. Our coding would be inefficient, and there might be buffer overflow.
The decoder needs to know how to break the stream of bits into individual codes. The rule in this case is, break after ’1’ or after ’0000’, whichever comes first. However, there are many possible Huffman codes, corresponding to different choices for 0 and 1 prepending in step 3 of the algorithm above. Most do not have such simple rules. It can be hard (although it is always possible) to know where the inter-symbol breaks should be put.
The codebook itself must be given, in advance, to both the encoder and decoder. This might be done by transmitting the codebook over the channel once.

Another Example

Freshmen at MIT are on a “pass/no-record” system during their first semester on campus. Grades of A, B, and C are reported on transcripts as P (pass), and D and F are not reported (for our purposes we will designate this as no-record, N). Let’s design a system to send these P and N symbols to a printer at the fastest average rate. Without considering probabilities, 1 bit per symbol is needed. But the probabilities (assuming the typical MIT grade distribution in Table 5.3) are \(p(P) = p(A) + p(B) + p(C) = 0.875\), and \(p(N) = p(D) + p(F) = 0.125\). The information per symbol is therefore not 1 bit but only 0.875 × \(\log_2(1/0.875)\) + 0.125 × \(\log_2(1/0.125)\) = 0.544 bits. Huffman coding on single symbols does not help. We need to take groups of bits together. For example eleven grades as a block would have 5.98 bits of information and could in principle be encoded to require only 6 bits to be sent to the printer.

\(^3\)Boolean digits, or binary digits, are usually called “bits.” The word “bit” also refers to a unit of information. When a boolean digit carries exactly one bit of information there may be no confusion. But inefficient codes or redundant codes may have boolean digit sequences that are longer than the minimum and therefore carry less than one bit of information per bit. This same confusion attends other units of measure, for example meter, second, etc.