2.8: Application - A Fault Tolerance Model for CMOS RAM

Last updated
Save as PDF

Page ID: 58504

Jerome H. Saltzer & M. Frans Kaashoek
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

This section develops a fault tolerance model for words of CMOS random access memory, first without and then with a simple error-correction code, comparing the probability of error in the two cases.

CMOS RAM is both low in cost and extraordinarily reliable, so much so that error masking is often not implemented in mass production systems such as television sets and personal computers. But some systems, for example life-support, air traffic control, or banking systems, cannot afford to take unnecessary risks. Such systems usually employ the same low-cost memory technology but add incremental redundancy.

A common failure of CMOS RAM is that noise intermittently causes a single bit to read or write incorrectly. If intermittent noise affected only reads, then it might be sufficient to detect the error and retry the read. But the possibility of errors on writes suggests using a forward error-correction code.

We start with a fault tolerance model that applies when reading a word from memory without error correction. The model assumes that errors in different bits are independent and it assigns \(p\) as the (presumably small) probability that any individual bit is in error. The notation \(\text{O}(p^n)\) means terms involving \(p^n\) and higher, presumably negligible, powers. Here are the possibilities and their associated probabilities:

Table \(\PageIndex{1}\): Fault tolerance model for raw CMOS random access memory
			Probability
error-free case:		all 32 bits are correct	\( (1-p)^{32} = 1 - \text{O} (p) \)
errors:
	untolerated:	one bit is in error	\( 32p (1-p)^{31} = \text{O}(p) \)
	untolerated:	two bits are in error	\( (31 \cdot 32/2) p^2 (1-p)^{30} = \text{O}(p^2) \)
	untolerated:	three or more bits are in error	\( (30 \cdot 31 \cdot 32/3 \cdot 2) p^3 (1-p)^{29} + \dots + p^{32} = \text{O}(p^3) \)

The coefficients \(32\), \((31 \cdot 32)/2\), etc. arise by counting the number of ways that one, two, etc., bits could be in error.

Suppose now that the 32-bit block of memory is encoded using a code of Hamming distance 3, as described in Section 2.5.1. Such a code allows any single-bit error to be corrected and any double-bit error to be detected. After applying the decoding algorithm, the fault tolerance model changes to:

Table \(\PageIndex{2}\): Fault tolerance model for CMOS memory with error correction
			Probability
error-free case:		all 32 bits are correct	\( (1-p)^{32} = 1 - \text{O} (p) \)
errors:
	tolerated:	one bit is in error	\( 32p (1-p)^{31} = \text{O}(p) \)
	detected:	two bits are in error	\( (31 \cdot 32/2) p^2 (1-p)^{30} = \text{O}(p^2) \)
	untolerated:	three or more bits are in error	\( (30 \cdot 31 \cdot 32/3 \cdot 2) p^3 (1-p)^{29} + \dots + p^{32} = \text{O}(p^3) \)

The interesting change is in the probability that the decoded value is correct. That probability is the sum of the probabilities that there were no errors and that there was one, tolerated error:

\begin{align*} Prob( \text{decoded value is correct} ) &= (1-p)^{32} + 32p (1-p)^{31} \\ &= (1 - 32p + (31 \cdot 32/2)p^2 + \dots) + (32p + (32 \cdot -31)p^2 + \dots \\ &= (1 - \text{O}(p^2)) \end{align*}The decoding algorithm has thus eliminated the errors that have probability of order \(p\). It has not eliminated the two-bit errors, which have probability of order \(p^2\), but for two-bit errors the algorithm is fail-fast, so a higher-level procedure has an opportunity to recover, perhaps by requesting retransmission of the data. The code is not helpful if there are errors in three or more bits, which situation has probability of order \(p^3\), but presumably the designer has determined that probabilities of that order are negligible. If they are not, the designer should adopt a more powerful error-correction code.

With this model in mind, one can review the two design questions suggested in Section 2.4.2. The first question is whether the estimate of bit error probability is realistic and if it is realistic to suppose that multiple bit errors are statistically independent of one another. (Error independence appeared in the analysis in the claim that the probability of an \(n\)-bit error has the order of the \(n\)th power of the probability of a one-bit error.) Those questions concern the real world and the accuracy of the designer's model of it. For example, this failure model doesn't consider power failures, which might take all the bits out at once, or a driver logic error that might take out all of the even-numbered bits. It also ignores the possibility of faults that lead to errors in the logic of the error-correction circuitry itself.

The second question is whether the coding algorithm actually corrects all one-bit errors and detects all two-bit errors. That question is explored by examining the mathematical structure of the error-correction code and is quite independent of anybody’s estimate or measurement of real-world failure types and rates. There are many off-the-shelf coding algorithms that have been thoroughly analyzed and for which the answer to this question is yes.