2.10: Exercises

Last updated
Save as PDF

Page ID: 58731

Jerome H. Saltzer & M. Frans Kaashoek
Massachusetts Institute of Technology via MIT OpenCourseWare

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Exercise $\PageIndex{1}$

1999–3–01

Failures are

A. Faults that are latent.

B. Errors that are contained within a module.

C. Errors that propagate out of a module.

D. Faults that turn into errors.

Exercise $\PageIndex{2}$

2006–2–3

Ben Bitdiddle has been asked to perform a deterministic computation to calculate the orbit of a near-Earth asteroid for the next 500 years, to find out whether or not the asteroid will hit the Earth. The calculation will take roughly two years to complete, and Ben wants be be sure that the result will be correct. He buys 30 identical computers and runs the same program with the same inputs on all of them. Once each hour the software pauses long enough to write all intermediate results to a hard disk on that computer. When the computers return their results at the end of the two years, a voter selects the majority answer. Which of the following failures can this scheme tolerate, assuming the voter works correctly?

A. The software carrying out the deterministic computation has a bug in it, causing the program to compute the wrong answer for certain inputs.

B. Over the course of the two years, cosmic rays corrupt data stored in memory at twelve of the computers, causing them to return incorrect results.

C. Over the course of the two years, on 24 different days the power fails in the computer room. When the power comes back on, each computer reboots and then continues its computation, starting with the state it finds on its hard disk.

Exercise $\PageIndex{3}$

1997–0–01

Ben Bitdiddle has seven smoke detectors installed in various places in his house. Since the fire department charges $100 for responding to a false alarm, Ben has connected the outputs of the smoke detectors to a simple majority voter, which in turn can activate an automatic dialer that calls the fire department. Ben returns home one day to find his house on fire, and the fire department has not been called. There is smoke at every smoke detector. What did Ben do wrong?

A. He should have used fail-fast smoke detectors.

B. He should have used a voter that ignores failed inputs from fail-fast sources.

C. He should have used a voter that ignores non-active inputs.

D. He should have done both A and B.

E. He should have done both A and C.

Exercise $\PageIndex{4}$

2005–2–5

You will be flying home from a job interview in Silicon Valley. Your travel agent gives you the following choice of flights:

A. Flight A uses a plane whose mean time to failure (MTTF) is believed to be 6,000 hours. With this plane, the flight is scheduled to take 6 hours.

B. Flight B uses a plane whose MTTF is believed to be 5,000 hours. With this plane, the flight takes 5 hours.

The agent assures you that both planes' failures occur according to memoryless random processes (not a "bathtub" curve). Assuming that model, which flight should you choose to minimize the chance of your plane failing during the flight?

Exercise $\PageIndex{5}$

1995–3–1a

(Note: solving this problem is best done with use of probability through the level of Markov chains.) You are designing a computer system to control the power grid for the Northeastern United States. If your system goes down, the lights go out and civil disorder—riots, looting, fires, etc.—will ensue. Thus, you have set a goal of having a system MTTF of at least 100 years (about 10⁶ hours). For hardware you are constrained to use a building block computer that has a MTTF of 1000 hours and a MTTR of 1 hour. Assuming that the building blocks are fail-fast, memoryless, and fail independently of one another, how can you arrange to meet your goal?

Exercise $\PageIndex{6}$

1985–0–1

The town council wants to implement a municipal network to connect the local area networks in the library, the town hall, and the school. They want to minimize the chance that any building is completely disconnected from the others. They are considering two network topologies:

Figure $\PageIndex{1}$: "Daisy chain" vs fully connected network topologies, each connecting three local area networks.

Each link in the network has a failure probability of $p$.

a) What is the probability that the daisy chain network is connecting all the buildings?

b) What is the probability that the fully connected network is connecting all the buildings?

c) The town council has a limited budget, with which it can buy either a daisy chain network with two high reliability links $(p = .000001)$, or a fully connected network with three low-reliability links $(p = .0001)$. Which should they purchase?

Exercise $\PageIndex{7}$

2001–3–05

Figure $2.5.8$ shows the failure points of three different 5MR supermodule designs, if repair does not happen in time. Draw the corresponding figure for the same three different TMR supermodule designs.

Exercise $\PageIndex{8}$

1976–0–3

An astronomer calculating the trajectory of Pluto has a program that requires the execution of 10¹³ machine operations. The fastest processor available in the lab runs only 10⁹ operations per second and, unfortunately, has a probability of 10^-12 of failing on any one operation. (The failure process is memoryless.) The good news is that the processor is fail-fast, so when a failure occurs it stops dead in its tracks and starts ringing a bell. The bad news is that when it fails, it loses all state, so whatever it was doing is lost, and has to be started over from the beginning.

Seeing that in practical terms, the program needs to run for about 3 hours, and the machine has an MTTF of only 1/10 of that time, Louis Reasoner and Ben Bitdiddle have proposed two ways to organize the computation:

Louis says run it from the beginning and hope for the best. If the machine fails, just try again; keep trying until the calculation successfully completes.
Ben suggests dividing the calculation into ten equal-length segments; if the calculation gets to the end of a segment, it writes its state out to the disk. When a failure occurs, restart from the last state saved on the disk.

Saving state and restart both take zero time. What is the ratio of the expected time to complete the calculation under the two strategies?

Warning: A straightforward solution to this problem involves advanced probability techniques.

Exercise $\PageIndex{9}$

2005–0–1

Draw a figure, similar to that of Figure $2.5.3$, that shows the recovery procedure for one sector of a 5-disk RAID 4 system when disk 2 fails and is replaced.

Exercise $\PageIndex{10}$

1997–3–01, 2000–3–01, 1998–3–01, 2000–3–01

Louis Reasoner has just read an advertisement for a RAID controller that provides a choice of two configurations. According to the advertisement, the first configuration is exactly the RAID 4 system described in Section 2.5.1. The advertisement goes on to say that the configuration called RAID 5 has just one difference: in an N-disk configuration, the parity block, rather than being written on disk $N$, is written on the disk number $(1 +$sector_address $\text{modulo} \ N)$. Thus, for example, in a five-disk system, the parity block for sector 18 would be on disk 4 (because $1 + (18 \ \text{modulo} \ 5) = 4$), while the parity block for sector 19 would be on disk 5 (because $1 + (19 \ \text{modulo} \ 5) = 5$). Louis is hoping you can help him understand why this idea might be a good one.

a) RAID 5 has the advantage over RAID 4 that

A. It tolerates single-drive failures.

B. Read performance in the absence of errors is enhanced.

C. Write performance in the absence of errors is enhanced.

D. Locating data on the drives is easier.

E. Allocating space on the drives is easier.

F. It requires less disk space.

G. There's no real advantage; it's just another advertising gimmick.

b) Is there any workload for which RAID 4 has better write performance than RAID 5?

c) Louis is also wondering about whether he might be better off using a RAID 1 system (see Section 2.6.4.6). How does the number of disks required compare between RAID 1 and RAID 5?

d) Out of RAID 1 and RAID 5, which has better performance for a workload consisting of small reads and small writes?

Exercise $\PageIndex{11}$

2002–3–01

A system administrator notices that a file service disk is failing for two unrelated reasons. Once every 30 days, on average, vibration due to nearby construction breaks the disk’s arm. Once every 60 days, on average, a power surge destroys the disk’s electronics. The system administrator fixes the disk instantly each time it fails. The two failure modes are independent of each other, and independent of the age of the disk. What is the mean time to failure of the disk?