7.3: Information, Loss, and Noise

Last updated
Save as PDF

Page ID: 50197

Paul Penfield, Jr.
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

For the general discrete memoryless process, useful measures of the amount of information presented at the input and the amount transmitted to the output can be defined. We suppose the process state is represented by random events Ai with probability distribution \(p(A_i)\). The information at the input \(I\) is the same as the entropy of this source. (We have chosen to use the letter \(I\) for input information not because it stands for “input” or “information” but rather for the index \(i\) that goes over the input probability distribution. The output information will be denoted \(J\) for a similar reason.)

\(I = \displaystyle \sum_{i} p(A_i)\log_2\Big(\dfrac{1}{p(A_i)}\Big) \tag{7.12}\)

This is the amount of uncertainty we have about the input if we do not know what it is, or before it has been selected by the source.

A similar formula applies at the output. The output information \(J\) can also be expressed in terms of the input probability distribution and the channel transition matrix:

\(\begin{align*} J \;&= \;\displaystyle \sum_{j} p(B_j)\log_2\Big(\dfrac{1}{p(B_j)}\Big) \\ &= \;\displaystyle \sum_{j} \Big(\sum_{i} c_{ji}p(A_i) \Big) \log_2\Big(\dfrac{1}{\sum_{i} c_{ji}p(A_i)}\Big) \tag{7.13} \end{align*}\)

Note that this measure of information at the output \(J\) refers to the identity of the output state, not the input state. It represents our uncertainty about the output state before we discover what it is. If our objective is to determine the input, \(J\) is not what we want. Instead, we should ask about the uncertainty of our knowledge of the input state. This can be expressed from the vantage point of the output by asking about the uncertainty of the input state given one particular output state, and then averaging over those states. This uncertainty, for each \(j\), is given by a formula like those above but using the reverse conditional probabilities \(p(A_i \;|\; B_j)\)

\(\displaystyle \sum_{i} p(A_i \;|\; B_j )\log_2\Big(\dfrac{1}{p(A_i \;|\; B_j )}\Big) \tag{7.14}\)

Then your average uncertainty about the input after learning the output is found by computing the average over the output probability distribution, i.e., by multiplying by \(p(B_j)\) and summing over \(j\)

\(\begin{align*} L \;&= \; \displaystyle \sum_{j} p(B_j) \sum_{i} p(A_i \;|\; B_j )\log_2\Big(\dfrac{1}{p(A_i \;|\; B_j )}\Big) \\ &= \; \displaystyle \sum_{ij} p(A_i, B_j)\log_2\Big(\dfrac{1}{p(A_i \;|\; B_j )}\Big ) \tag{7.15} \end{align*}\)

Note that the second formula uses the joint probability distribution \(p(A_i, B_j)\). We have denoted this average uncertainty by \(L\) and will call it “loss.” This term is appropriate because it is the amount of information about the input that is not able to be determined by examining the output state; in this sense it got “lost” in the transition from input to output. In the special case that the process allows the input state to be identified uniquely for each possible output state, the process is “lossless” and, as you would expect, \(L\) = 0.

It was proved in Chapter 6 that \(L \leq I\) or, in words, that the uncertainty after learning the output is less than (or perhaps equal to) the uncertainty before. This result was proved using the Gibbs inequality.

The amount of information we learn about the input state upon being told the output state is our uncertainty before being told, which is \(I\), less our uncertainty after being told, which is \(L\). We have just shown that this amount cannot be negative, since \(L \leq I\). As was done in Chapter 6, we denote the amount we have learned as \(M = I − L\), and call this the “mutual information” between input and output. This is an important quantity because it is the amount of information that gets through the process.

To recapitulate the relations among these information quantities:

\(I = \displaystyle \sum_{i} p(A_i)\log_2\Big(\dfrac{1}{p(A_i)}\Big) \tag{7.16}\)

\(L \; = \; \displaystyle \sum_{j} p(B_j) \sum_{i} p(A_i \;|\; B_j )\log_2\Big(\dfrac{1}{p(A_i \;|\; B_j )}\Big) \tag{7.17}\)

\(0 \leq M \leq I \tag{7.19}\)

\(0 \leq L \leq I \tag{7.20}\)

Processes with outputs that can be produced by more than one input have loss. These processes may also be nondeterministic, in the sense that one input state can lead to more than one output state. The symmetric binary channel with loss is an example of a process that has loss and is also nondeterministic. However, there are some processes that have loss but are deterministic. An example is the \(AND\) logic gate, which has four mutually exclusive inputs 00 01 10 11 and two outputs 0 and 1. Three of the four inputs lead to the output 0. This gate has loss but is perfectly deterministic because each input state leads to exactly one output state. The fact that there is loss means that the \(AND\) gate is not reversible.

There is a quantity similar to \(L\) that characterizes a nondeterministic process, whether or not it has loss. The output of a nondeterministic process contains variations that cannot be predicted from knowing the input, that behave like noise in audio systems. We will define the noise \(N\) of a process as the uncertainty in the output, given the input state, averaged over all input states. It is very similar to the definition of loss, but with the roles of input and output reversed. Thus

\(\begin{align*} N \;&= \; \displaystyle \sum_{i} p(A_i) \sum_{j} p(B_j \;|\; A_i )\log_2\Big(\dfrac{1}{p(B_j \;|\; A_i )}\Big) \\ &= \; \displaystyle \sum_{i} p(A_i) \sum_{j} c_{ji} \log_2\Big(\dfrac{1}{c_{ji}}\Big) \tag{7.21} \end{align*} \)

Steps similar to those above for loss show analogous results. What may not be obvious, but can be proven easily, is that the mutual information \(M\) plays exactly the same sort of role for noise as it does for loss. The formulas relating noise to other information measures are like those for loss above, where the mutual information \(M\) is the same:

\(J \;=\; \displaystyle \sum_{i} p(B_j)\log_2\Big(\dfrac{1}{p(B_j)}\Big) \tag{7.22} \)

\(N \;=\; \displaystyle \sum_{i} p(A_i) \sum_{j} c_{ji} \log_2\Big(\dfrac{1}{c_{ji}}\Big) \tag{7.23}\)

\(0 \leq M \leq J \tag{7.25}\)

\(0 \leq N \leq J \tag{7.26}\)

It follows from these results that

\(J − I = N − L \tag{7.27}\)