6.6: Noisy Channel

Last updated
Save as PDF

Page ID: 51017

Paul Penfield, Jr.
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

If the channel introduces noise then the output is not a unique function of the input. We will model this case by saying that for every possible input (which are mutually exclusive states indexed by \(i\)) there may be more than one possible output outcome. Which actually happens is a matter of chance, and we will model the channel by the set of probabilities that each of the output events \(B_j\) occurs when each of the possible input events \(A_i\) happens. These transition probabilities \(c_{ji}\) are, of course, probabilities, but they are properties of the channel and do not depend on the probability distribution \(p(A_i)\) of the input. Like all probabilities, they have values between 0 and 1

\(0 \leq c_{ji} \leq 1 \tag{6.15}\)

and may be thought of as forming a matrix with as many columns as there are input events, and as many rows as there are output events. Because each input event must lead to exactly one output event,

\(1 = \displaystyle \sum_{j} c_{ji} \tag{6.16}\)

for each \(i\). (In other words, the sum of \(c_{ji}\) in each column \(i\) is 1.) If the channel is noiseless, for each value of \(i\) exactly one of the various \(c_{ji}\) is equal to 1 and all others are 0.

When the channel is driven by a source with probabilities \(p(A_i)\), the conditional probabilities of the output events, conditioned on the input events, is

\(p(B_j \; | \; A_i) = c_{ji} \tag{6.17}\)

The unconditional probability of each output \(p(B_j)\) is

\(p(B_j) = \displaystyle \sum_{i} c_{ji} p(A_i) \tag{6.18}\)

The backward conditional probabilities \(p(A_i \;|\; B_j ) can be found using Bayes’ Theorem:

\[\begin{align*} p(A_i, \ B_j) \ &= \ p(B_j)p(A_i \; | \; B_j) \\ &= \ p(A_i)p(B_j \; | \; A_i) \\ &= \ p(A_i)c_{ji} \tag{6.19} \end{align*} \]

The simplest noisy channel is the symmetric binary channel, for which there is a (hopefully small) probability \(\epsilon\) of an error, so

This binary channel is called symmetric because the probability of an error for both inputs is the same. If \(\epsilon\) = 0 then this channel is noiseless (it is also noiseless if \(\epsilon\) = 1, in which case it behaves like an inverter). Figure 6.3 can be made more useful for the noisy channel if the possible transitions from input to output are shown, as in Figure 6.5.

If the output \(B_j\) is observed to be in one of its (mutually exclusive) states, can the input \(A_i\) that caused it be determined? In the absence of noise, yes; there is no uncertainty about the input once the output is known. However, with noise there is some residual uncertainty. We will calculate this uncertainty in terms of the transition probabilities \(c_{ji}\) and define the information that we have learned about the input as a result of knowing the output as the mutual information. From that we will define the channel capacity \(C\).

Figure 6.5: Symmetric binary channel

Before we know the output, what is our uncertainty \(U_{\text{before}}\) about the identity of the input event? This is the entropy of the input:

\(U_{\text{before}} = \displaystyle \sum_{i} p(A_i)\log_2\Big(\dfrac{1}{p(A_i)}\Big) \tag{6.21}\)

After some particular output event \(B_j\) has been observed, what is the residual uncertainty \(U_{\text{after}}(B_j)\) about the input event? A similar formula applies, with \(p(A_i)\) replaced by the conditional backward probability \(p(A_i \;|\; B_j )\):

\(U_{\text{after}}(B_j) = \displaystyle \sum_{i} p(A_i \;|\; B_j)\log_2\Big(\dfrac{1}{p(A_i \;|\; B_j)}\Big) \tag{6.22}\)

The amount we learned in the case of this particular output event is the difference between \(U_{\text{before}}\) and \(U_{\text{after}}(B_j )\). The mutual information \(M\) is defined as the average, over all outputs, of the amount so learned,

It is not difficult to prove that \(M\) ≥ 0, i.e., that our knowledge about the input is not, on average, made more uncertain by learning the output event. To prove this, the Gibbs inequality is used, for each \(j\):

\[\begin{align*} U_{\text{after}}(B_j) \ &= \ \displaystyle \sum_{i} p(A_i \;|\; B_j)\log_2\Big(\dfrac{1}{p(A_i \;|\; B_j)}\Big) \\ &\leq \ \displaystyle \sum_{i} p(A_i \;|\; B_j)\log_2\Big(\dfrac{1}{p(A_i)}\Big) \tag{6.24} \end{align*} \]

This use of the Gibbs inequality is valid because, for each \(j\), \(p(A_i \;|\; B_j )\) is a probability distribution over \(i\), and \(p(A_i)\) is another probability distribution over \(i\), different from the one doing the average. This inequality holds for every value of \(j\) and therefore for the average over all \(j\):

\[\begin{align*}
\sum_{j} p(B_{j}) U_{\text{after}}(B_{j}) \ & \leq \ \sum_{j} p(B_{j}) \sum_{i} p(A_{i} \;|\; B_{j}) \log _{2}\left(\frac{1}{p\left(A_{i}\right)}\right) \\
&=\ \sum_{j i} p(B_{j}) p(A_{i} \;|\; B_{j}) \log _{2}\Big( \frac{1}{p(A_{i})}\Big ) \\
&=\ \sum_{i j} p(B_{j} \;|\; A_{i}) p(A_{i}) \log _{2}\left(\frac{1}{p\left(A_{i}\right)}\right) \\
&=\ \sum_{i} p\left(A_{i}\right) \log _{2}\left(\frac{1}{p\left(A_{i}\right)}\right) \\
&=\ U_{\text{before} }\tag{6.25}
\end{align*} \nonumber \]

We are now in a position to find \(M\) in terms of the input probability distribution and the properties of the channel. Substitution in Equation 6.23 and simplification leads to

\[M=\sum_{j}\left(\sum_{i} p\left(A_{i}\right) c_{j i}\right) \log _{2}\left(\frac{1}{\sum_{i} p\left(A_{i}\right) c_{j i}}\right)-\sum_{i j} p\left(A_{i}\right) c_{j i} \log _{2}\left(\frac{1}{c_{j i}}\right) \tag{6.26 } \]

Note that Equation 6.26 was derived for the case where the input “causes” the output. At least, that was the way the description went. However, such a cause-and-effect relationship is not necessary. The term mutual information suggests (correctly) that it is just as valid to view the output as causing the input, or to ignore completely the question of what causes what. Two alternate formulas for \(M\) show that \(M\) can be interpreted in either direction:

\[\begin{align*}
M &=\sum_{i} p(A_{i}) \log _{2}\Big (\frac{1}{p(A_{i})}\Big )-\sum_{j} p(B_{j}) \sum_{i} p(A_{i} \;|\; B_{j}) \log _{2}\Big (\frac{1}{p(A_{i} \;|\; B_{j})}\Big ) \\
&=\sum_{j} p(B_{j}) \log _{2}\Big (\frac{1}{p(B_{j})}\Big )-\sum_{i} p(A_{i}) \sum_{j} p(B_{j} \mid A_{i}) \log _{2}\Big (\frac{1}{p(B_{j} \mid A_{i})}\Big ) \tag{6.27 }
\end{align*} \nonumber \]

Rather than give a general interpretation of these or similar formulas, let’s simply look at the symmetric binary channel. In this case both \(p(A_i)\) and \(p(B_j)\) are equal to 0.5 and so the first term in the expression for \(M\) in Equation 6.26 is 1 and the second term is found in terms of \(\epsilon\):

\[M=1-\varepsilon \log _{2}\left(\frac{1}{\varepsilon}\right)-(1-\varepsilon) \log _{2}\left(\frac{1}{(1-\varepsilon)}\right) \tag{6.28} \]

which happens to be 1 bit minus the entropy of a binary source with probabilities \(\epsilon\) and 1 − \(\epsilon\). This is a cup-shaped curve that goes from a value of 1 when \(\epsilon\) = 0 down to 0 at \(\epsilon\) = 0.5 and then back up to 1 when ε\(\epsilon\)= 1. See Figure 6.6. The interpretation of this result is straightforward. When \(\epsilon\) = 0 (or when \(\epsilon\) = 1) the input can be determined exactly whenever the output is known, so there is no loss of information. The mutual information is therefore the same as the input information, 1 bit. When \(\epsilon\) = 0.5 each output is equally likely, no matter what the input is, so learning the output tells us nothing about the input. The mutual information is 0.