3.8: A More Complete Model of Disk Failure (Advanced Topic)

Last updated
Save as PDF

Page ID: 58513

Jerome H. Saltzer & M. Frans Kaashoek
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Section 3.2 of this chapter developed a failure analysis model for a calendar management program in which a system crash may corrupt at most one disk sector—the one, if any, that was being written at the instant of the crash. That section also developed a masking strategy for that problem, creating all-or-nothing disk storage. To keep that development simple, the strategy ignored decay events. This section revisits that model, considering how to also mask decay events. The result will be all-or-nothing durable storage, meaning that it is both all-or-nothing in the event of a system crash and durable in the face of decay events.

Storage that is Both All-or-Nothing and Durable

In Chapter 2 we learned that to obtain durable storage we should write two or more replicas of each disk sector. In the current chapter we learned that to recover from a system crash while writing a disk sector we should never overwrite the previous version of that sector, we should write a new version in a different place. To obtain storage that is both durable and all-or-nothing we combine these two observations: make more than one replica, and don’t overwrite the previous version. One easy way to do that would be to simply build the all-or-nothing storage layer of the current chapter on top of the durable storage layer of Chapter 2. That method would certainly work but it is a bit heavy-handed: with a replication count of just two, it would lead to allocating six disk sectors for each sector of real data. This is a case in which modularity has an excessive cost.

Recall that the parameter that Chapter 2 used to determine frequency of checking the integrity of disk storage was the expected time to decay, \(T_d\). Suppose for the moment that the durability requirement can be achieved by maintaining only two copies. In that case, \(T_d\) must be much greater than the time required to write two copies of a sector on two disks. Put another way, a large \(T_d\) means that the short-term chance of a decay event is small enough that the designer may be able to safely neglect it. We can take advantage of this observation to devise a slightly risky but far more economical method of implementing storage that is both durable and all-or-nothing with just two replicas. The basic idea is that if we are confident that we have two good replicas of some piece of data for durability, it is safe (for all-or-nothing atomicity) to overwrite one of the two replicas; the second replica can be used as a backup to ensure all-or-nothing atomicity if the system should happen to crash while writing the first one. Once we are confident that the first replica has been correctly written with new data, we can safely overwrite the second one, to regain long-term durability. If the time to complete the two writes is short compared with\(T_d\), the probability that a decay event interferes with this algorithm will be negligible. Figure \(\PageIndex{1}\) shows the algorithm and the two replicas of the data, here named D0 and D1 .

Figure \(\PageIndex{1}\): Data arrangement and algorithms to implement all-or-nothing durable storage on top of the careful storage layer of Figure \(2.6.1\).

An interesting point is that ALL_OR_NOTHING_DURABLE_GET does not bother to check the status returned upon reading D1 —it just passes the status value along to its caller. The reason is that in the absence of decay, CAREFUL_GET has no expected errors when reading data that CAREFUL_PUT was allowed to finish writing. Thus the returned status would be BAD only in two cases:

CAREFUL_PUT of D1 was interrupted in mid-operation, or
D1 was subject to an unexpected decay.

The algorithm guarantees that the first case cannot happen. ALL_OR_NOTHING_DURABLE_PUT doesn't begin CAREFUL_PUT on data D1 until after the completion of its CAREFUL_PUT on data D0 . At most one of the two copies could be BAD because of a system crash during CAREFUL_PUT. Thus if the first copy ( D0 ) is BAD, then we expect that the second one ( D1 ) is OK.

The risk of the second case is real, but we have assumed its probability to be small: it arises only if there is a random decay of D1 in a time much shorter than \(T_d\). In reading D1 we have an opportunity to detect that error through the status value, but we have no way to recover when both data copies are damaged, so this detectable error must be classified as untolerated. All we can do is pass a status report along to the application so that it knows that there was an untolerated error.

There is one currently unnecessary step hidden in the SALVAGE program: if D0 is BAD, nothing is gained by copying D1 onto D0 since ALL_OR_NOTHING_DURABLE_PUT, which called SALVAGE, will immediately overwrite D0 with new data. The step is included because it allows SALVAGE to be used in a refinement of the algorithm.

In the absence of decay events, this algorithm would be just as good as the all-or-nothing procedures of Figures \(3.3.1\) and \(3.3.2\), and it would perform somewhat better, since it involves only two copies. Assuming that errors are rare enough that recovery operations do not dominate performance, the usual cost of ALL_OR_NOTHING_DURABLE_GET is just one disk read, compared with three in the ALL_OR_NOTHING_GET algorithm. The cost of ALL_OR_NOTHING_DURABLE_PUT is two disk reads (in SALVAGE) and two disk writes, compared with three disk reads and three disk writes for the ALL_OR_NOTHING_PUT algorithm.

That analysis is based on a decay-free system. To deal with decay events, thus making the scheme both all-or-nothing and durable, the designer adopts two ideas from the discussion of durability in Chapter 2, the second of which eats up some of the better performance:

Place the two copies, D0and D1 , in independent decay sets (for example, write them on two different disk drives, preferably from different vendors).
Have a clerk run the SALVAGE program on every atomic sector at least once every \(T_d\) seconds.

The clerk running the SALVAGE program performs \(2N\) disk reads every \(T_d\) seconds to maintain \(N\) durable sectors. This extra expense is the price of durability against disk decay. The performance cost of the clerk depends on the choice of \(T_d\), the value of \(N\), and the priority of the clerk. Since the expected operational lifetime of a hard disk is usually several years, setting \(T_d\) to a few weeks should make the chance of untolerated failure from decay negligible, especially if there is also an operating practice to routinely replace disks well before they reach their expected operational lifetime. A modern hard disk with a capacity of one terabyte would have about \(N = 10^9\) kilobyte-sized sectors. If it takes 10 milliseconds to read a sector, it would take about \(2 \times 10^7\) seconds, or two days, for a clerk to read all of the contents of two one-terabyte hard disks. If the work of the clerk is scheduled to occur at night, or uses a priority system that runs the clerk when the system is otherwise not being used heavily, that reading can spread out over a few weeks and the performance impact can be minor.

A few paragraphs above, it was mentioned that there is the potential for a refinement: If we also run the SALVAGE program on every atomic sector immediately following every system crash, then it should not be necessary to do it at the beginning of every ALL_OR_NOTHING_DURABLE_PUT. That variation, which is more economical if crashes are infrequent and disks are not too large, is due to Butler Lampson and Howard Sturgis [Suggestions for Further Reading 3.3.3]. It raises one minor concern, because it depends on the rarity of coincidence of two failures: the spontaneous decay of one data replica at about the same time that CAREFUL_PUT crashes in the middle of rewriting the other replica of that same sector. If we are convinced that such a coincidence is rare, we can declare it to be an untolerated error, and we have a self-consistent and more economical algorithm. With this scheme, the cost of ALL_OR_NOTHING_DURABLE_PUT reduces to just two disk writes.