Skip to main content
Engineering LibreTexts

2: Fault Tolerance - Reliable Systems from Unreliable Components

  • Page ID
    50952
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    • 2.1: Overview
    • 2.2: Faults, Failures, and Fault-Tolerant Design
      Definition of faults, failures, and errors in a system-reliability context. Discussion of fault tolerance and the fault-tolerance design process.
    • 2.3: Measures of Reliability and Failure Tolerance
      Statistical measurements describing system reliability: availability, time to failure, time to repair, mean time between repairs, and reliability functions. The difference between these statistical quantities and the modeled specifications of real-world parts.
    • 2.4: Tolerating Active Faults
      Methods of responding to active faults. Classification of errors into the categories of detectable vs. undetectable, detected vs. undetected, maskable vs. unmaskable, masked vs. unmasked, and tolerated vs. untolerated. Fault tolerance models based on these distinctions between errors.
    • 2.5: Systematically Applying Redundancy
      Methods of systematically applying redundancy to detect and mask errors. Includes discussion of forward error correction, replication, \(N\)-modular redundancy, and repair.
    • 2.6: Applying Redundancy to Software and Data
      Methods of applying redundancy to software and data in order to preserve data integrity. Includes discussion of \(N\)-version programming, valid construction, firewalls for separating stored state data, and durable storage.
    • 2.7: Wrapping Up Reliability
      Wrapping up conceptual content on reliability with discussion of overall design principles and strategies, cautions, and suggestions for futher reading, all relating to reliability.
    • 2.8: Application - A Fault Tolerance Model for CMOS RAM
      Comparison of the probability for error in a fault-tolerance model for words of CMOS random-access memory, with and without including a simple error-correction code.
    • 2.9: War Stories - Fault-Tolerant Systems That Failed
      Six real-world incidents of failure experienced by fault-tolerant systems, some communications-related and some not, and the lessons in designing systems that can be learned from these cases.
    • 2.10: Exercises


    This page titled 2: Fault Tolerance - Reliable Systems from Unreliable Components is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Jerome H. Saltzer & M. Frans Kaashoek (MIT OpenCourseWare) .

    • Was this article helpful?