6.2.2: Problem Sets 11-20
- Page ID
- 76175
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Problem Set 11: RaidCo\(^*\)
\(^*\) Credit for developing this problem set goes to Stephen A. Ward.
2007–2–11
(Chapter 2)
RaidCo is a company that makes pin-compatible hard disk replacements using tiny, chip-sized hard disks ("microdrives") that have become available cheaply. Each RaidCo product behaves like a hard disk, supporting the operations:
error ← GET (nblocks, starting_block_number, buffer_address)
error ← PUT (nblocks, starting_block_number, buffer_address)
to get or put an integral number of consecutive blocks from or to the disk array. Each operation returns a status value indicating whether an error has occurred.
The models are as follows:
- R0: sector-level striping across all twelve microdrives, no redundancy/error correction
- R1: six pairs of two mirrored microdrives, no striping
- R2: 12-microdrive RAID 2, using bit-level striping, error detection, and error correction); microdrive’s internal sector-level error detection is disabled.
- R3: 12-microdrive RAID 3, using sector-level striping and error correction.
- R4: 12-microdrive RAID 4, no striping, dedicated parity disk
- R5: 12-microdrive RAID 5, no striping, distributed parity
The microdrives each conform to the same read/write API sketched above, each microdrive providing 100,000 sectors of 1,000 bytes each, and offering a uniform 10 millisecond seek time and a read/write bandwidth of 100 megabytes per second; thus the entire 100 megabytes of data on a microdrive can be fetched using a single GET operation in one second. The RaidCo products do no caching or buffering: each GET or PUT involves actual data transfer to or from the involved microdrives. Since the microdrives have uniform seek time, the RaidCo products do not need, and do not use, any seek optimizations.
Q11.1 As good as the students were at programming, they unfortunately left the documentation unfinished. Your job is to complete the following table, showing certain specifications for each model drive (i.e., the size and performance parameters of the API supported by each RAID system). Entries assume error-free operation, and ignore transfer times that are small compared to seek times encountered.
R0 | R1 | R2 | R3 | R4 | R5 | |
---|---|---|---|---|---|---|
Block size (kilobytes) exposed to GET/PUT |
1 kB | 1 kB | 11 kB | 1 kB | ||
Capacity, in blocks | 1,100,000 | |||||
Max time for a single 100-megabyte GET (seconds) |
1/12 s | 1 s | 1 s | |||
Time for a 1-block PUT (milliseconds) |
10 ms | 10 ms | 10 ms | 20 ms | ||
Typical number of microdrives involved in a 1-block GET |
1 | 1 | ||||
Typical number of microdrives involved in a 2-block GET |
2 | 1 | 1 | |||
Typical number of microdrives involved in a 1-block PUT |
2 |
Add exercises text here.
Problem Set 12: ColdFusion \(^*\)
\(^*\) Credit for developing this problem set goes to Hari Balakrishnan.
2000–3–8…15
(Chapter 2, with a bit of Chapter 3)
Alyssa P. Hacker and Ben Bitdiddle are designing a hot new system, called ColdFusion, whose goal is to allow users to back up their storage systems with copies stored in a distributed network of ColdFusion servers. Users interact with ColdFusion using PUT
and GET
operations.
PUT (data, fid)
takes a data buffer and reliably stores it under a unique identifierfid
, a positive integer, on some subset of the servers. It returnsSUCCESS
if it was successful in storing it on all the machines in the chosen subset, andFAILURE
otherwise.GET (fid)
returns the contents of the most recent successfulPUT
to the system for the file identified byfid
.
Because high availability is a key competitive advantage, Alyssa decides to replicate user data on more than one server machine. But rather than replicate each file on every server, she decides to be clever and use only a subset of the servers for each file. Thus, the PUT
of a file stores it on some number (\(A\)) of the servers, invoking a SERVER_PUT
operation on each server. SERVER_PUT
is atomic and is implemented by each server.
If PUT
is unable to successfully store the file on \(A\) servers, it returns FAILURE
.
When a client does a GET
of the file, the GET
software attempts to retrieve the file from some subset of the servers and picks the version using an election algorithm. It chooses \(B\) servers to read data from, using an atomic SERVER_GET
operation implemented by each server, following which it calls PICK_MAJORITY ()
. PICK_MAJORITY
returns valid data corresponding to a version that is shared by more than 50% of the \(B\) copies retrieved, and NULL
otherwise. Even though the client may not know which specific servers hold the current copy, the number of servers (\(A\)) in PUT
and the number (\(B\)) in GET
are chosen so that if a client GET
succeeds, it is certain to have received the most recent copy.
Alyssa and Ben write the following code for PUT
and GET
. There are \(S\) servers in all, and \(S > 2\). The particular ordering of the servers in the code below may be different at different clients, but all clients have the same list. They hire you as a consultant to help them figure out the missing parameters (\(A\) and \(B\)) and analyze the system for correctness.
// Internet addresses of the S servers, S > 2
ip_address server[S] // each client may have a different ordering
procedure PUT (byte data[], integer fid)
integer ntried, nputs ← 0 // # of servers tried and # successfully put
for ntried from 0 to S do // Put “data” into a file identified by fid at server[ntried]
status ← SERVER_PUT (data, fid, server[ntried])
if status = success then nputs ← nputs + 1
if nputs ≥ A then // yes! have SERVER_PUT () to A servers!
return SUCCESS;
return FAILURE // found < A servers to SERVER_PUT() to
procedure GET (integer fid, byte data[])
integer ntried, ngets ← 0 // # times tried and # times server_get returned success
byte files[S][MAX_FILE_LENGTH] // array of files; an entry is a copy from a server
integer index
byte data[]
for ntried from 0 to S do // Get file fid into buffer files[ntried] from server[ntried]
status ← SERVER_GET (fid, files[ntried], server[ntried])
if status = success then ngets ← ngets + 1
if ngets ≥ B then // yes! have gotten data from B servers
// PICK_MAJORITY () takes the array of files and magically knows which ones are
// valid. It scans the ngets valid ones and returns an index in the files[] array
// for one of the good copies, which corresponds to a version returned by more
// than 50% of the servers. Otherwise, it returns –1. If ngets = 1,
// PICK_MAJORITY () simply returns an index to that version.
index ← PICK_MAJORITY (files, ngets);
if index ≠ –1 then
COPY (data, files[index]) // copy into data buffer
return SUCCESS
else return FAILURE
return FAILURE // didn’t find B servers to SERVER_GET from
For questions 12.1 through 12.4, assume that operations execute serially (i.e., there is no concurrency). Assume also that the end-to-end protocol correctly handles all packet losses and delivers messages in to a recipient in the same order that the sender dispatched them. In other words, no operations are prevented from completing because of lost or reordered packets. However, servers may crash and subsequently recover.
Q12.1 Which reliability technique is the best description of the one being attempted by Alyssa and Ben?
A. Fail-safe design.
B. N-modular redundancy.
C. Pair-and-compare.
D. Temporal redundancy.
Q12.2 Which of the following combinations of \(A\) and \(B\) in the code above ensures that GET
returns the results of the last successful PUT
, as long as no servers fail? (Here, \( \lfloor x \rfloor\) is the largest integer \(\leq x\), and \( \lceil y \rceil\) is the smallest integer \(\geq y\). Thus \(\lfloor 2.3 \rfloor = 2\), and \(\lceil 2.3 \rceil = 3\). Remember also that \(S > 2\).)
A. \(A = 1 \quad\quad\quad\quad\quad\ B = S\)
B. \(A = \lceil S/3 \rceil \quad\quad\quad B = S\)
C. \(A = \lceil S/2 \rceil \quad\quad\quad B = S\)
D. \(A = \lceil (3S)/4 \rceil \quad\,\,\, B = \lfloor S+2 \rfloor + 1\)
E. \(A = S \quad\quad\quad\quad\,\,\,\ B = 1\)
Q12.3 Suppose that the number of servers \(S\) is an odd number larger than 2, and that the number of servers used for PUT
is \(A = S ⁄ 2\). If only PUT
and no GET
operations are done, how does the mean time to failure (MTTF) of the PUT
operation change as \(S\) increases? The PUT
operation fails if the return value from PUT
is FAILURE
. Assume that the process that causes servers to fail is memoryless, and that no repairs are done.
A. As \(S\) increases, there is more redundancy. So, the MTTF increases.
B. As \(S\) increases, one still needs about one-half of the servers to be accessible for a successful PUT
. So, the MTTF does not change with \(S\).
C. As \(S\) increases the MTTF decreases even though we have more servers in the system.
D. The MTTF is not a monotonic function of \(S\); it first decreases and then increases.
Q12.4 Which of the following is true of ColdFusion's PUT
and GET
operations, for choices of \(A\) and \(B\) that guarantee that GET
successfully returns the data from the last successful PUT
when no servers fail?
A. A PUT
that fails because some server was unavailable to it, done after a successful PUT
, may cause subsequent GET
attempts to fail, even if \(B\) servers are available.
B. A failed PUT
attempt done after a successful PUT
cannot cause subsequent GET
attempts to fail if \(B\) servers are available.
C. A failed PUT
attempt done after a successful PUT
always causes subsequent GET
attempts to fail, even if \(B\) servers are available.
D. None of the above.
ColdFusion unveils their system for use on the Internet with \(S\) = 15 servers, using \(A = 2S/3\) and \(B = 1 + 2S/3\). However, they find that the specifications are not always met—several times, GET
does not return the data from the last PUT
that returned SUCCESS
.
In questions 12.5 through 12.8, assume that there may be concurrent operations.
Q12.5 Under which of these scenarios does ColdFusion always meet its specification (i.e., GET
returns SUCCESS
and the data corresponding to the last successful PUT
)?
A. There is no scenario under which ColdFusion meets its specification for this choice of \(A\) and \(B\).
B. When a user PUT
's data to a file with some fid
, and at about the same time someone else PUT
's different data to the same fid.
C. When a user PUT
's a file successfully from her computer at home, drives to work and attempts to GET
the file an hour later. In the meantime, no one performs any PUT
operations to the same file, but three of the servers crash and are unavailable when she does her GET
.
D. When the PUT
of a file succeeds at some point in time, but some subsequent PUT
's fail because some servers are unavailable, and then a GET
is done to that file, which returns SUCCESS
.
You tell Ben to pay attention to multisite coordination, and he implements his version of the two-phase commit protocol. Here, each server maintains a log containing READY
(a new record he has invented), ABORT
, and COMMIT
records. The server always returns to a client the version of a file that last underwent COMMIT
.
In Ben's protocol, when the client PUT
's a file, the server returns SUCCESS
or FAILURE
as before. If it returns SUCCESS
, the server appends a READY
entry for this fid in its log. If the client sees that all the servers it asked returns SUCCESS
, it sends a message asking them all to COMMIT
. When a server receives this message, it writes a COMMIT
entry in its log together with the file's fid
. On the other hand, if even one of the servers returns FAILURE
, the client sends a message to all the servers asking them to abort the operation, and each server writes an ABORT
entry in its log. Finally, if a server gets a server put request for some fid that is in the READY
state, it returns FAILURE
to the requesting client.
Q12.6 Under which of these scenarios does ColdFusion always meet its specification (i.e., GET
returns SUCCESS
and the data corresponding to the last successful PUT
)?
A. There is no scenario under which ColdFusion meets its specification for this choice of \(A\) and \(B\).
B. When a user PUT
's data to a file with some fid
, and at about the same time someone else PUT
's different data to the same fid.
C. When a user PUT
's a file successfully from her computer at home, drives to work and attempts to GET
the file an hour later. In the meantime, no one performs any PUT
operations to the same file, but three of the servers crash and are unavailable when she does her GET
.
D. When the PUT
of a file succeeds at some point in time and the corresponding COMMIT
messages have reached the servers, but some subsequent PUT
s fail because some servers are unavailable, and then a GET
is done to that file, which returns SUCCESS
.
Q12.7 When a server crashes and recovers, the original clients that initiated PUT
s may be unreachable. This makes it hard for a server to know the status of its READY
actions, since it cannot ask the clients that originated them. Assuming that no more than one server is unavailable at any time in the system, which of the following strategies allows a server to correctly learn the status of a past READY
action when it recovers from a crash?
A. Contact any server that is up and running and call SERVER_GET
with the file's fid
; if the server responds, change READY
to COMMIT
in the log.
B. Ask all the other servers that are up and running using server GET()
with the file's fid
; if more than 50% of the other servers respond with identical data, just change READY
to COMMIT
.
C. Pretend to be a client and invoke GET
with the file's fid; if GET
is successful and the data returned is the same as what is at this server, just change READY
to COMMIT
.
D. None of the above.
To accommodate the possibility of users operating on entire directories at once, ColdFusion adds a two-phase locking protocol on individual files within a directory. Alyssa and Ben find that although this sometimes works, deadlocks do occur when a GET
owns some locks that a PUT
needs, and vice versa.
Q12.8 Ben analyzes the problem and comes up with several "solutions" (as usual). Which of his proposals will actually work, always preventing deadlocks from happening?
A. Ensure that the actions grab locks for individual files in increasing order of the fid
of the file.
B. Ensure that no two actions grab locks for individual files in the same order.
C. Assign an incrementing timestamp to each action when it starts. If action Ai
needs a lock owned by action Aj
with a larger timestamp, abort action Ai
and continue.
D. Assign an incrementing timestamp to each action when it starts. If action Ai
needs a lock owned by action Aj
with a smaller timestamp, abort action Ai
and continue.
Problem Set 13: PigeonExpress.com
1999–3–5…14
(Chapter 3, based on Chapter 1)
After selling PigeonExpress!.com and taking a trip around the world, Ben Bitdiddle is planning his next start-up, AtomicPigeon!.com. AtomicPigeon improves over PigeonExpress by offering an atomic data delivery system.
Recall from problem set 2 that when sending a pigeon, Ben's software prints out a little header and writes a CD, both of which are given to the pigeon. The header contains the GPS coordinates of the sender and receiver, a type ( REQUEST
or ACKNOWLEDGMENT
), and a sequence number:
structure header
GPS source
GPS destination
integer type
integer sequence_no
Ben starts with the code for the simple end-to-end protocol ( BEEP
) for PigeonExpress!.com. He makes a number of modifications to the sending and receiving code.
At the sender, Ben simplifies the code. The BEEP
protocol transfers only a single CD:
shared next_sequence initially 0 // a globally shared sequence number.
procedure BEEP (target, CD[]) // send 1 CD to target
header h // h is an instance of header.
h.source ← MY_GPS // set source to my GPS coordinates
h.destination ← target // set destination
h.type ← REQUEST // this is a request message
h.sequence_no ← next_sequence // set seq number
// loop until we receive the corresponding ack, retransmitting if needed
while h.sequence_no = next_sequence do
send pigeon with h, CD // transmit
wait 2,000 seconds
As before, pending and incoming acknowledgments are processed only when the sender is waiting:
procedure PROCESS_ACK (h) // process acknowledgment
if h.sequence_no = sequence then // ack for current outstanding CD?
next_sequence ← next_sequence + 1
Ben makes a small change to the code running on the receiving computer. He adds a variable expected_sequence
at the receiver, which is used by PROCESS_REQUEST
to filter duplicates:
integer expected_sequence initially 0 // duplicate filter.
procedure PROCESS_REQUEST (h, CD) // process request
if h.sequence_no = expected_sequence then // the expected seq #?
PROCESS (CD) // yes, process data
expected_sequence ← expected_sequence + 1 // increase expectation
h.destination ← h.source // send to where the pigeon came from
h.source ← MY_GPS
h.sequence_no ← h.sequence_no // unchanged
h.type ← ACKNOWLEDGMENT;
send pigeon with h // send an acknowledgment back
The assumptions for the pigeon network are the same as in problem set 2:
- Some pigeons might get lost, but, if they arrive, they deliver data correctly (uncorrupted)
- The network has one sender and one receiver
- The sender and the receiver are single-threaded
Q13.1 Assume the sender and receiver do not fail (i.e., the only failures are that some pigeons may get lost). Does PROCESS
in PROCESS_REQUEST
process the value of CD exactly once?
A. Yes, since next_sequence
is a nonce and the receiver processes data only when it sees a new nonce.
B. No, since next_sequence
and expected_sequence
may get out of sync because the receiver acknowledges requests even when it skips processing.
C. No, since if the acknowledgment isn't received within 2,000 seconds, the sender will send the same data again.
D. Yes, since pigeons with the same data are never retransmitted.
Ben's new goal is to provide atomicity, even in the presence of sender or receiver failures. The reason Ben is interested in providing atomicity is that he wants to use the pigeon network to provide P-commerce (something similar to E-commerce). He would like to write applications of the form:
procedure TRANSFER (amount, destination)
WRITE (amount, CD) // write amount on a CD
BEEP (destination, CD) // send amount
The amount always fits on a single CD.
If the sender or receiver fails, the failure is fail-fast. For now, let's assume that if the sender or receiver fails, it just stops and does not reboot; later, we will relax this constraint.
Q13.2 Given the current implementation of the BEEP
protocol and assuming that only the sender may fail, what could happen during, say, the 100th call to TRANSFER
?
A. That TRANSFER
might never succeed.
B. That TRANSFER
might succeed.
C. PROCESS
in PROCESS_REQUEST
might process amount
more than once.
D. PROCESS
in PROCESS_REQUEST
might process amount
exactly once.
Ben's goal is to make transfer always succeed by allowing the sender to reboot and finish failed transfers. That is, after the sender fails, it clears volatile memory (including the nonce counter) and restarts the application. The application starts by running a recovery procedure, named RECOVER_SENDER
, which retries a failed transfer if any exist.
To allow for restartable transfers, Ben supplies the sender and the receiver with durable storage that never fails. On the durable storage, Ben stores a log, in which each entry has the following form:
structure log_entry
integer type // STARTED or COMMITTED
integer sequence_no // a sequence number
The main objective of the sender's log is to allow RECOVER_SENDER
to restore the value of next_sequence
and to allow the application to restart an unfinished transfer, if any. Ben edits TRANSFER
to use the log:
1 procedure TRANSFER (amount, destination)
2 WRITE (amount, CD) // write amount on a CD
3 ADD_LOG (STARTED, next_sequence) // append STARTED record
4 BEEP (destination, CD) // send amount (BEEP increases next_sequence)
5 ADD_LOG (COMMITTED, next_sequence – 1) // append COMMITTED record
ADD_LOG
atomically appends a record to the log on durable storage. If ADD_LOG
returns, the entry has been appended. Logs contain sufficient space for new records and they don't have to be garbage-collected.
Q13.3 Identify the line in this new version of TRANSFER
that is the commit point.
Q13.4 How can the sender discover that a failure caused the transfer not to complete?
A. The log contains a STARTED
record with no corresponding COMMITTED
record.
B. The log contains a STARTED
record with a corresponding ABORTED
record.
C. The log contains a STARTED
record with a corresponding COMMITTED
record.
D. The log contains a COMMITTED
record with no corresponding STARTED
record.
Ben tries to write RECOVER_SENDER
to recover next_sequence
, but his editor crashes before committing the final editing and the expression in the if
statement is missing, as indicated by a ?
in the code below. Your job is to edit the code such that the correct expression is evaluated.
procedure RECOVER_SENDER ()
next_sequence ← 0
starting at end of log…
for each entry in log do
if ( ? ) then // What goes here? (See question 13.6)
next_sequence ← (sequence_no of entry) + 1
break // terminate scan of log
Q13.5 After you edit RECOVER_SENDER
for Ben, which of the following sequences could appear in the log? (The log records are represented as <type sequence_no>
.
A. ..., <STARTED 1>, <COMMITTED 1>, <STARTED 2>, <COMMITTED 2>
B. ..., <STARTED 1>, <STARTED 1>, <COMMITTED 1>
C. ..., <STARTED 1>, <STARTED 1>, <STARTED 2>, <COMMITTED 1>, <STARTED 2>
D. ..., <STARTED 1>, <COMMITTED 1>, <COMMITTED 1>
Q13.6 What expression should replace the ?
in the RECOVER_SENDER
code above?
A. entry.type = COMMITTED
B. entry.type = STARTED
C. entry.type = ABORTED
D. FALSE
Q13.7 Given the current implementation of the BEEP
protocol what could happen, say, during the 100th call to TRANSFER
? (Remember only the sending computer may fail.)
A. If the sending computer keeps failing during recovery, that TRANSFER
might never succeed.
B. That TRANSFER
might succeed.
C. PROCESS
in PROCESS_REQUEST
might process amount
more than once.
D. PROCESS
in PROCESS_REQUEST
might process amount
exactly once.
Ben's next goal is to make PROCESS_REQUEST
all-or-nothing. In the following questions, assume that whenever the receiving computer fails, it reboots, calls RECOVER_RECEIVER
, and after RECOVER_RECEIVER
is finished, it waits for messages and calls PROCESS_REQUEST
on each message.
To make expected_sequence
all-or-nothing, Ben tries to change the receiver in a way similar to the change he made to the sender. Again, his editor didn't commit all the changes in time. The missing code is marked by ?
and #
. The missing expression marked by ?
evaluates the same expression as did the ?
in RECOVER_SENDER
. The new missing expressions are marked by #
in PROCESS_REQUEST
:
procedure RECOVER_RECEIVER ()
expected_sequence ← 0
starting at end of log…
for each entry in log do
if ( ? ) then // The expression of question 13.6
expected_sequence ← sequence_no of entry + 1
break // terminate scan of log
1 procedure PROCESS_REQUEST (h, CD)
2 if h.sequence_no = expected_sequence then // the expected seq #?
3 ADD_LOG (#, #) // See question 13.8.
4 PROCESS (CD) // yes, process data
5 expected_sequence ← expected_sequence + 1 // increase expectation
6 ADD_LOG (#, #) // See question 13.8.
7 h.destination ← source of h // send to where the pigeon came from
8 source of h.source ← MY_GPS
9 h.sequence_no ← h.sequence_no // unchanged
10 h.type ← ACKNOWLEDGMENT
11 send pigeon with h // send an acknowledgment back
As you can see from the code, Ben chose not to implement a write-ahead protocol because PROCESS
is implemented by a third party, for example, a bank: PROCESS
might be a call into the bank's transaction database system.
Q13.8 Complete the ADD_LOG
calls on the lines 3 and 6 in PROCESS_REQUEST
such that expected_sequence
will be all-or-nothing.
3 ADD_LOG ( , )
6 ADD_LOG ( , )
Q13.9 Can PROCESS
in PROCESS_REQUEST
be called multiple times for a particular call to TRANSFER
?
A. No, because expected_sequence
is recovered and h.sequence_no
is checked against it.
B. Yes, because failed transfers will be restarted and result in the acknowledgment being retransmitted.
C. Yes, because after 2,000 seconds a request will be retransmitted.
D. Yes, because the receiver may fail after PROCESS
, but before it commits.
Q13.10 How should PROCESS
, called by PROCESS_REQUEST
, be implemented to guarantee exactly-once semantics for transfers? (Remember that the sender is persistent.)
A. As a normal procedure call;
B. As a remote procedure call;
C. As a nested transaction;
D. As a top-level transaction.
Problem Set 14: Sick Transit \(^*\)
\(^*\) Credit for developing this problem set goes to Stephen A. Ward.
1997–3–6…15
(Chapter 3)
Gloria Mundi, who stopped reading the text before getting to Chapter 3, is undertaking to resurrect the failed London Ambulance Service as a new streamlined company called Sick Transit. She has built a new computer she intends to use for processing ST's activities.
A key component in Gloria's machine is a highly reliable sequential-access infinite tape, which she plans to use as an append-only log. Records can be appended to the tape, but once written are immutable and durable. Records on the tape can be read any number of times, from front-to-back or from back-to-front. There is no disk in the ST system; the tape is the only non-volatile storage.
Because of the high cost of the infinite tape, Gloria compromised on the quality of more conventional components like RAM and CPU, which fail frequently but fortunately are fail-fast: every error causes an immediate system crash. Gloria plans to ensure that, after a crash, a consistent state can be reconstructed from the log on the infinite tape.
Gloria's code uses transactions, each identified by a unique transaction ID. The visible effect of a completed transaction is confined to changes in global variables whose WRITE
operations are logged. The log will contain entries recording the following operations:
BEGIN (tid) // start a new transaction, whose unique ID is tid
COMMIT (tid) // commit a transaction
ABORT (tid) // abort a transaction
WRITE (tid, variable, old_value, new_value)
// write a global variable, specifying previous & new values.
To keep the system simple, Gloria plans to use the above forms as the application-code interface, in addition to a READ (tid, variable)
call which returns the current value of variable
. Each of the calls will perform the indicated operation and write a log entry as appropriate. Reading an unwritten variable is to return ZERO
.
Gloria begins by considering the single-threaded case (only one transaction is active at any time). She stores values of global variables in a table in RAM. Gloria is now trying to figure out how to reset variables to committed values following a crash, using the log tape.
Q14.1 In the single-threaded case, what value should variable v
be restored to following a crash?
A. 37
B. new_value
from the last logged WRITE (tid, v, old_value, new_value)
, or ZERO
if unwritten
C. new_value
from the last logged WRITE (tid, v, old_value, new_value)
that is not followed by an ABORT (tid)
, or ZERO
if unwritten.
D. new_value from the last logged WRITE (tid, v, old_value, new_value)
that is followed by a COMMIT (tid)
, or ZERO
otherwise.
E. Either old_value
or new_value
from the last logged WRITE (tid, v, old_value, new_value)
, depending on whether that WRITE
is followed by a COMMIT
on the same tid
, or ZERO
if unwritten.
Gloria now tries running concurrent transactions on her system. Accesses to the log are serialized by the sequential-access tape drive.
Her first trial involves concurrent execution of these two transactions:
|
|
The initial values of x
and y
are ZERO
, as are all uninitialized variables in her system. Here the READ
primitive simply returns the most recently written value of a variable from the RAM table, ignoring COMMIT
s.
Q14.2 In the absence of locks or other synchronization mechanism, will the result necessarily correspond to some serial execution of the two transactions?
A. Yes.
B. No, since the execution might result in x = 3
, y = 3
.
C. No, since the execution might result in x = 1
, y = 3
.
D. No, since the execution might result in x = 1
, y = 2.
Gloria is considering using locks, and automatically adding code to each transaction to guarantee before-or-after atomicity. She would like to maximize concurrency; she is, however, anxious to avoid deadlocks. For each of the following proposals, decide whether the approach (1) yields semantics consistent with before-or-after atomicity and (2) introduces potential deadlocks.
Q14.3 A single, global lock where ACQUIRE
is at the start of each transaction and RELEASE
occurs at COMMIT
.
Q14.4 A lock for each variable. Every READ
or WRITE
operation is immediately surrounded by an ACQUIRE
and RELEASE
of that variable's lock.
Q14.5 A lock for each variable on which a transaction performs READ
or WRITE
, acquired immediately prior to the first reference to that variable in the transaction; all locks are released at COMMIT
.
Q14.6 A lock for each variable on which a transaction performs READ
or WRITE
, acquired in alphabetical order, immediately following the BEGIN
. All locks are released at COMMIT
.
Q14.7 A lock for each variable on which a transaction performs WRITE
, acquired, in alphabetical order, immediately following the BEGIN
. All locks are released at COMMIT
.
In the general case (concurrent transactions) Gloria would like to avoid having to read the entire log during crash recovery. She proposes periodically adding a CHECKPOINT
entry to the log, and reading the log backwards from the end restoring committed values to RAM. The backwards scan should end as soon as committed values have been restored to all variables. Each CHECKPOINT
entry in the log contains current values of all variables and a list of uncommitted transactions at the time of the CHECKPOINT
.
Q14.8 What portion of the tape must be read to properly restore values committed at the time of the crash?
A. All of the tape; checkpoints don't help.
B. Enough to include the STARTED
record from each transaction that was uncommitted at the time of the crash.
C. Enough to include the last CHECKPOINT
, as well as the STARTED
record from each transaction that was uncommitted at the time of the crash.
D. Enough to include the last CHECKPOINT
, as well as the STARTED
record from each transaction that was uncommitted at the time of the last checkpoint.
Simplicity Winns, Gloria's one-time classmate, observes that since global variable values can be reconstructed from the log their storage in RAM is redundant. She proposes eliminating the RAM as well as all of Gloria's proposed locks, and implementing a READ (tid, var)
primitive which returns an appropriate value of var
by examining the log.
Simplicity's plan is to implement READ
so that each transaction a
"sees" the values of global values at the time of BEGIN (a)
, as well as changes made within a
. She quickly sketches an implementation of READ
which she claims gives appropriate atomicity semantics:
procedure READ(tid, var)
winners ← EMPTY // winners is a list.
prior ← FALSE
for each entry in log do
if entry is STARTED (tid) then prior ← TRUE
if entry is COMMITTED (Etid) and prior = TRUE then add Etid to winners
if entry is WRITE (Etid, var, old_value, new_value) then
if Etid = tid then return new_value
if Etid is in winners then return new_value
return 0
Gloria is a little dazed by Simplicity's quick synopsis, but thinks that Simplicity is likely correct. Gloria asks for your help in figuring out what Simplicity's algorithm actually does.
Q14.9 Suppose transaction t
performs a READ
of variable x
but does not write it. Will each READ
of x
in t
see the same value? If so, concisely describe the value returned by each READ
; if not, explain.
Q14.10 Does Simplicity's scheme REALLY offer transaction semantics yet avoid deadlocks?
A. Yup. Read it and weep, Gloria.
B. It doesn't introduce deadlocks, but doesn't guarantee before-or-after transaction semantics either.
C. It gives before-or-after transactions, but introduces possible deadlocks.
D. Simplicity's approach doesn't work even when there's no concurrency—it gives wrong answers.
Q14.11 The real motivation of the Sick Transit problem is a stupid pun. What does sic transit gloria mundi actually mean?
A. Thus passes a glorious Monday.
B. Thus passes the glory of the world.
C. Gloria threw up on the T Monday.
D. This bus for First Class and Coach.
E. This is the last straw! If I wanted to take @#%!*% Latin, I'd have gone to Oxford.
Problem Set 15: The Bank of Central Peoria, Limited
1998–3–7…14
(Chapter 3)
Ben Bitdiddle decides to go into business. He bids $1 at a Resolution Trust Corporation auction and becomes the owner of the Bank of Central Peoria, Limited (BCPL).
When he arrives at BCPL, Ben is shocked to learn that the only programmer who understood BCPL's database has left the company to work on new animation techniques for South Park. Hiring programmers is difficult in Peoria, so Ben decides to take over the database code himself.
Ben learns that an account is represented as a structure with the following information:
structure account
integer account_id // account identification number
integer balance // account balance
The BCPL system implements a standard transaction interface for accessing accounts:
tid ← BEGIN () // Starts a new transaction that will be identified as number tid
balance ← READ (tid, account.account_id) // Returns the balance of an account
WRITE (tid, account.account_id, newbalance) // Updates the balance of an account
COMMIT (tid) // Makes the updates of transaction tid visible to other transactions
The BCPL system uses two disks, both accessed synchronously (i.e., GET
and PUT
operations on the disks won't return until the data is read from the disk or has safely been written to the disk, respectively). One disk contains nothing but the account balances, indexed by number. This disk is called the database disk. The other disk is called the log disk and exclusively stores, in chronological order, a sequence of records of the form:
structure logrecord
integer op // WRITE, COMMIT, or END
integer tid // transaction number
integer account_id // account number
integer new_balance // new balance for account “account”
where the meaning of each record is given by the op
field:
op = WRITE // Update of an account to a new balance by transaction tid
op = COMMITTED // Transaction tid’s updates are now visible to other transactions // and durable across crashes
op = END // Transaction tid’s writes have all been installed on the database disk
For each active transaction, the BCPL system keeps a list in volatile memory called intentions
containing pairs ( account_id
, new_balance
). The implementation of READ
is as follows:
procedure READ (tid, account_id)
if account_id is in intentions of tid then
pair ← last pair containing account_id from intentions of tid
return pair.new_balance
else
GET account containing account_id from database
return account.balance
A. Recovery
For this section, assume that there are no concurrent transactions.
Ben asks whether the database computer has ever crashed and learns that it crashed frequently due to intense sound vibrations from the jail next door. Ben decides he had better understand how recovery works in the BCPL system. He examines the implementation of the recovery procedure. He finds the following code:
1 procedure RECOVERY ()
2 winners ← NULL
3 reading the log from oldest to newest,
4 for each record in log do
5 if record.op = COMMITTED then add record.tid to winners
6 if record.op = END then remove record.tid from winners
7 again reading the log from oldest to newest,
8 for each record in log do
9 if record.op = WRITE and record.tid is in winners then
10 INSTALL (record.new_balance in database for record.account_id
11 for each tid in winners do
12 LOG {END, tid}
Q15.1 What would happen if lines 11 and 12 were omitted?
A. The system might fail to recover correctly from the first crash that occurs.
B. The system would recover correctly from the first crash but the log would be corrupt so the system might fail to recover correctly from the second crash.
C. The system would recover correctly from multiple crashes but would have to do more work when recovering from the second and subsequent crashes.
Q15.2 For the RECOVERY
and READ
procedures to be correct, which of the following could be correct implementations of the COMMIT
procedure?
A. procedure COMMIT (tid) {}
B. procedure COMMIT (tid)
for each pair in tid.intentions do
INSTALL (pair.new_balance in database for pair.account_id)
tid.intentions ← NULL
LOG {COMMITTED, tid}
C. procedure COMMIT (tid)
LOG {END, tid}
for each pair in tid.intentions do
INSTALL (pair.new_balance in database for pair.account_id)
tid.intentions ← NULL
LOG {COMMITTED, tid}
D. procedure COMMIT (tid)
LOG {COMMITTED, tid}
for each pair in tid.intentions do
INSTALL (pair.new_balance in database for pair.account_id)
tid.intentions ← NULL
LOG {END, tid}
Q15.3 For the RECOVERY
and READ
procedures to be correct, which of the following could be correct implementations of the WRITE
procedure?
A. procedure WRITE (tid, account_id, new_balance)
LOG {WRITE, tid, account_id, new_balance}
B. procedure WRITE (tid, account_id, new_balance)
add the pair {account_id, new_balance} to tid.intentions
LOG {WRITE, tid, account_id, new_balance}
C. procedure WRITE (tid, account_id, new_balance)
LOG {WRITE, tid, account_id, new_balance}
add the pair {account_id, new_balance} to tid.intentions
D. procedure WRITE (tid, account_id, new_balance)
LOG {WRITE, tid, account_id, new_balance}
add the pair {account_id, new_balance} to tid.intentions
INSTALL new_balance in database for account_id
Ben is rather surprised to see there is no ABORT (tid)
procedure that terminates a transaction and erases its database updates. He calls up the database developer, who says it should be easy to add. Ben figures he might as well add the feature now, and adds a new log record type ABORTED
.
Q15.4 Which of the following could be correct implementations of the ABORT
procedure? Assume that the RECOVERY
procedure is changed correspondingly.
A. procedure ABORT (tid)
tid.intentions ← NULL
B. procedure ABORT (tid)
LOG {ABORTED, tid}
C. procedure ABORT (tid)
LOG {ABORTED, tid}
tid.intentions ← NULL
D. procedure ABORT (tid)
tid.intentions ← NULL
LOG {ABORTED, tid}
B. Buffer cache
In section B, again assume that there are no concurrent transactions.
BCPL is in intense competition with the nearby branch of Peoria Authorized Savings, Credit and Loan. BCPL's competitive edge is lower account fees. Ben decides to save the cost of upgrading the computer system hardware by adding a volatile memory buffer cache, which will make the database much more efficient on the current hardware. The buffer cache is used for GET
s and PUT
s to the database disk only; GET
s and PUT
s to the log disk remain write-through and synchronous.
The buffer cache uses an LRU replacement policy. Each account record on the database disk is cached or replaced separately. In other words, the cache block size, disk block size, and account record sizes are all identical.
Q15.5 Why will adding a buffer cache for the database disk make the system more efficient?
A. It is faster to copy from the buffer cache than to GET
from the disk.
B. If common access patterns can be identified, performance can be improved by prefetching multiple account balances into the cache.
C. It reduces the total number of disk GET
s when one transaction reads the same account balance multiple times without updating it.
D. It reduces the total number of disk GET
s when multiple consecutive transactions read the same account balance.
Ben then makes a mistake. He reasons that the intentions list described in section A is now unnecessary, since the list just keeps in-memory copies of database data, which is the same thing done by the buffer cache. He deletes the intentions list code and modifies PUT
so it updates the copy of the account balance in the buffer cache. He also modifies the system to delay writing the END
record until all buffered accounts modified by that transaction have been written back to the database disk. Much to his horror, the next time the inmates next door try an escape and the resulting commotion causes the BCPL system to crash, the database does not recover to a consistent state.
Q15.6 What might have caused recovery to fail?
A. The system crashed when only some of the modifications made by a committed transaction had reached the database disk.
B. The LRU replacement policy updated the database disk with data modified by an uncommitted transaction, which later committed before the crash.
C. The LRU replacement policy updated the database disk with data modified by an uncommitted transaction, which failed to commit before the crash.
D. The LRU replacement policy updated the database disk with data modified by a committed transaction, which later completed before the crash.
E. The LRU replacement policy updated the database disk with data modified by a committed transaction, which did not complete before the crash.
C. Concurrency
Ben restores the intention-list code, deletes the buffer cache code and goes back to the simpler system described in section A. He is finally ready to investigate how the BCPL system manages concurrent transactions. He calls up the developer and she tells him that there is a lock stored in main memory for each account in the database, used by the CONCURRENT_BEGIN
and CONCURRENT_COMMIT
procedures. Since BCPL runs concurrent transactions, all its applications actually use these two procedures rather than the lower-level BEGIN
and COMMIT
procedures described earlier.
An application doing a concurrent transaction must declare the list of accounts it will use as an argument to the CONCURRENT_BEGIN
procedure.
procedure CONCURRENT_BEGIN (account_list)
do atomically
for each account in account_list do
ACQUIRE (account.lock)
tid ← BEGIN ()
tid.account_list ← account_list
return tid
procedure CONCURRENT_COMMIT (tid)
COMMIT (tid)
for each account in tid.account_list do
RELEASE (account.lock)
Ben runs two transactions concurrently. Both transactions update account number 2:
|
|
MAKELIST
creates a list from its arguments; in this case the list has just one element. The initial balance of account 2 before these transactions start is 0.
Q15.7 What possible values can account 2 have after completing these two transactions (assuming no crashes)?
A. 0
B. 1
C. 2
D. 3
E. 4
Ben is surprised by the order of the operations in CONCURRENT_COMMIT
, since COMMIT
is expensive (requiring synchronous writes to the log disk). It would be faster to release the locks first.
Q15.8 If the initial balance of account 2 is zero, what possible values can account 2 have after completing these two transactions (assuming no crashes) if the locks are released before the call to COMMIT
?
A. 0
B. 1
C. 2
D. 3
E. 4
Problem Set 16: Whisks\(^*\)
\(^*\) Credit for developing this problem set goes to Eddie Kohler.
1996–3–4a…d
(Chapter 3)
The Odd Disk Company (ODC) has just invented a new kind of non-volatile storage, the Whisk. A Whisk is unlike a disk in the following ways:
- Compared with disks, Whisks have very low read and write latencies.
- On the other hand, the data rate when reading and writing a Whisk is much less than that of a disk.
- Whisks are associative. Where disks use sector addresses, a Whisk block is named with a pair of items: an address and a tag. We write these pairs as \(A/t\), where \(A\) is the address and \(t\) is the tag. Thus, for example, there might be three blocks on the Whisk with address 49, each with different tags: 49/1, 49/2, and 49/97.
The Whisk provides four important operations:
data ← GET (A/t)
: This is the normal read operation.PUT (A/t, data)
: Just like a normal disk. If the system crashes during a write operation, a partially written block may result.boolean ← EXISTS (A/t)
: ReturnsTRUE
if blockA/t
exists on the Whisk.CHANGE_TAG (A/m, n)
: Atomically changes the tagm
of blockA/m
ton
(deleting any previous blockA/n
in the process). The atomicity includes both all-or-nothing atomicity and before-or-after atomicity.
Ben Bitdiddle is excited about the properties of Whisks. Help him develop different storage systems using Whisks as the medium.
Q16.1 At first, Ben emulates a normal disk by writing all blocks with tag 0. But now he wants to add an ATOMIC_PUT
operation. Design an ATOMIC_PUT
for Ben's Whisk, and identify the step that is the commit point.
Ben has started work on a Whisk transaction system; he'd like you to help him finish it. Looking through his notes, you see that Ben's system will use no caches or logs: all writes go straight to the Whisk. One sentence particularly catches your eye: a joyfully scrawled "Transaction IDs Are Tags!!" Ben's basic idea is this. The current state of the database will be stored in blocks with tag 0. When a transaction t
writes a block, the data is stored in the separate block A/t
until the transaction commits.
Ben has set aside a special disk address, ComRec, to hold commit records for all running transactions. For a transaction t
, the contents of ComRec/t
is either committed, aborted, or pending, depending on the state of transaction t
.
So far, three procedures have been implemented. In these programs, t
is a transaction ID, A
is a Whisk block address, and data
is a data block.
procedure AA_BEGIN (t) |
procedure AA_READ (t, A) |
procedure AA_WRITE (t, A, data) |
The following questions are concerned only with all-or-nothing atomicity; there are no concurrent transactions.
Q16.2 Write pseudocode for AA_COMMIT (t)
and AA_ABORT (t)
, and identify the commit point in AA_COMMIT
. Assume that the variable dirty
, an array with num_dirty
elements, holds all the addresses to which t
has written. (Don't worry about any garbage an aborted transaction might leave on disk, and assume transaction IDs are never reused.)
Q16.3 Write the pseudocode for AA_RECOVER
, the program that handles recovery after a crash. Ben has already done some of the work: his code examines the ComRec blocks and determines which transactions are COMMITTED
, ABORTED
, or PENDING
. When your pseudocode is called, he has already set 6 variables for you (you might not need them all):
num_committed // the number of committed transactions
committed[i] // an array holding the committed transactions’ IDs
num_aborted // the number of aborted transactions
aborted[i] // an array holding the aborted transactions’ IDs
num_pending // the number of transactions in progress
in_progress[i] // an array holding the in-progress transactions’ IDs
Whisk addresses run from 0 to N.
Problem Set 17: ANTS (Advanced "Nonce-ensical Transcation System)\(^*\)
\(^*\) Credit for developing this problem set goes to Hari Balakrishnan.
2002–3–6…13
(Chapter 3)
Sara Bellum, forever searching for elegance, sets out to design a new transaction system called ANTS, based on the idea of nonces. She observes that the locking schemes she learned in Chapter 3 cause transactions to wait for locks held by other transactions. She observes that it is possible for a transaction to simply abort and retry, instead of waiting for a lock. A little bit more work convinces her that this idea may allow her to design a system in which transactions don't need to use locks for before-or-after atomicity.
Sara sets out to write pseudocode for the following operations: BEGIN ()
, READ ()
, WRITE ()
, COMMIT ()
, ABORT ()
, and RECOVER ()
. She intends that, together, these operations will provide transaction semantics: before-or-after atomicity, all-or-nothing atomicity, and durability. You may assume that once any of these operations starts, it runs to completion without preemption or failure, and that no other thread is running any of the procedures at the same time. The system may interleave the execution of multiple transactions, however.
Sara's implementation assigns a transaction identifier (TID) to a transaction when it calls BEGIN ()
. The TIDs are integers, and ANTS assigns them in numerically increasing order. Sara's plan for the transaction system's storage is to maintain cell storage for variables, and a write-ahead log for recovery. Sara implements both the cell storage and the log using non-volatile storage. The log contains the following types of records:
STARTED TID
COMMITTED TID
ABORTED TID
UPDATED TID, Variable Name, Old Value
Sara implements BEGIN ()
, COMMIT ()
, ABORT ()
, and RECOVER ()
as follows:
BEGIN ()
allocates the next TID , appends aSTARTED
record to the log, and returns the TID.COMMIT ()
appends aCOMMITTED
record to the log and returns.ABORT (TID)
undoes all of transactionTID
'sWRITE ()
operations by scanning the log backwards and writing the old values from the transaction'sUPDATED
records back to the cell storage. After completing the undo,ABORT (TID)
appends anABORTED
entry to the log, and returns.RECOVER ()
is called after a crash and restart, before starting any more transactions. It scans the log backwards, undoing eachWRITE
record of each transaction that had neither committed nor aborted at the time of the crash.RECOVER ()
appends oneABORTED
record to the log for each such transaction.
Sara's before-or-after intention is that the result of executing multiple transactions concurrently is the same as executing those same transactions one at a time, in increasing TID order. Sara wants her READ ()
and WRITE ()
implementations to provide before-or-after atomicity by adhering to the following rule:
Suppose a transaction with TID t
executes READ (X)
. Let u
be the highest TID less than t
that calls WRITE (X)
and commits. The READ (X)
executed by t
should return the value that u
writes.
Sara observes that this rule does not require her system to execute transactions in strict TID order. For example, the fact that two transactions call READ ()
on the same variable does not (by itself) constrain the order in which the transactions must execute.
To see how Sara intends ANTS to work, consider the following two transactions:
Transaction TA | Transaction TB | |
1 |
tida ← BEGIN () // returns 15 |
|
2 |
tidb ← BEGIN () // returns 16 |
|
3 |
va ← READ (tida, X) |
|
4 |
va ← va + 1 |
|
5 |
vb ← READ (tidb, X) |
|
6 |
vb ← vb + 1 |
|
α |
WRITE (tida, X, va) |
WRITE (tidb, X, vb) |
β |
COMMIT (tida) |
COMMIT (tidb) |
Each transaction marks its start with a call to BEGIN
, then reads the variable X
from the cell store and stores it in a local variable, then adds one to that local variable, then writes the local variable to X
in the cell store, and then commits. Each transaction passes its TID ( tida
and tidb
respectively) to the READ
, WRITE
, and COMMIT
procedures.
These transactions both read and write the same piece of data, X
. Suppose that TA
starts just before TB
, and Sara's BEGIN
allocates TIDs 15 and 16 to TA and TB, respectively. Suppose that ANTS interleaves the execution of the transactions as shown through line 6, but that ANTS has not yet executed lines α
and β
. No other transactions are executing, and no failures occur.
Q17.1 In this situation, which of the following actions can ANTS take in order to ensure before-or-after atomicity?
A. Force just TA to abort, and let TB proceed.
B. Force just TB to abort, and let TA proceed.
C. Force neither TA nor TB to abort, and let both proceed.
D. Force both TA and TB to abort.
To help enforce the before-or-after intention, Sara's implementation of ANTS maintains the following two pieces of information for each variable:
ReadID
— the TID of the highest-numbered transaction that has successfully read this variable usingREAD
.WriteID
— the TID of the highest-numbered transaction that has successfully written this variable usingWRITE
.
Sara defines the following utility procedures in her implementation of ANTS:
INPROGRESS (TID)
returnsFALSE
ifTID
has committed or aborted, and otherwiseTRUE
. (All transactions interrupted by a crash are aborted by theRECOVER
procedure.)EXIT ()
terminates the current thread immediately.LOG ()
appends a record to the log and waits for the write to the log to complete.READ_DATA (x)
reads cell storage and returns the corresponding value.WRITE_DATA (x, v)
writes valuev
into cell storagex
.
Sara now sets out to write pseudocode for READ
and WRITE
:
1 procedure READ (tid, x) // Return the value stored in cell x
2 if tid < x.WriteID then
3 ABORT (tid)
4 EXIT ()
5 if tid > x.WriteID and INPROGRESS (x.WriteID) then
6 ABORT (tid) // Last transaction to have written x is still in progress
7 EXIT ()
8 v ← READ_DATA (x) // In all other cases execute the read
9 x.ReadID ← MAX (tid, x.ReadID) // Update ReadID of x
10 return v
11 procedure WRITE (tid, x, v) // Store value v in cell storage x
12 if tid < x.ReadID then
13 ABORT (tid)
14 EXIT ()
15 else if tid < x.WriteID then
16 [Mystery Statement I] // See question 17.3
17 else if tid > x.WriteID and INPROGRESS(x.WriteID) then
18 ABORT (tid)
19 EXIT ()
20 LOG (WRITE, tid, x, READ_DATA (x))
21 WRITE_DATA (x, v)
22 [Mystery Statement II] // Now update ReadID of x (see question 17.5)
Help Sara complete the design above by answering the following questions.
Q17.2 Consider lines 5–7 of READ. Sara is not sure if these lines are necessary. If lines 5–7 are removed, will the implementation preserve Sara's before-or-after intention?
A. Yes, the lines can be removed. Because the previous WRITE
to x
, by the transaction identified by x.WriteID
, cannot be affected by transaction tid
, READ_DATA (x)
can safely execute.
B. Yes, the lines can be removed. Suppose transaction T1 successfully executes WRITE (x)
, and then transaction T2 executes READ (x)
before T1 commits. After this, T1 cannot execute WRITE (x)
successfully, so T2 would have correctly read the last written value of x
from T1.
C. No, the lines cannot be removed. One reason is: The only transaction that can correctly execute READ_DATA (x)
is the transaction with TID equal to x.WriteID
. Therefore, the condition on line 5 of READ
should simply read: "if tid > x.WriteID
".
D. No, the lines cannot be removed. One reason is that before-or-after atomicity might not be preserved when transactions abort.
Q17.3 Consider Mystery Statement I on line 16 of WRITE .
Which of the following operations for this statement preserves Sara's before-or-after intention?
A. ABORT (tid)
; EXIT ()
;
B. return
(without aborting tid
)
C. Find the higher-numbered transaction Th corresponding to x.WriteID
; ABORT (Th)
and terminate the thread that was running Th; perform WRITE_DATA (x, v)
in transaction tid
; and return.
D. All of the above choices.
Q17.4 Consider lines 17–19 of WRITE
. Sara is not sure if these lines are necessary. If lines 17–19 are removed, will Sara's implementation preserve her before-or-after intention? Why or why not?
A. Yes, the lines can be removed. We can always recover the correct values from the log.
B. Yes, the lines can be removed since this is the WRITE
call; it's only on a READ
call that we need to be worried about the partial results from a previous transaction being visible to another running transaction.
C. No, the lines cannot be removed. One reason is: If transaction T1 writes to cell x
and then transaction T2 writes to cell x
, then an abort of T2 followed by an abort of T1 may leave x
in an incorrect state.
D. No, the lines cannot be removed. One reason is: If transaction T1 writes to cell x
and then transaction T2 writes to cell x
, then an abort of T1 followed by an abort of T2 may leave x
in an incorrect state.
Q17.5 Which of these operations for Mystery Statement II on line 22 of WRITE
preserves Sara's before-or-after intention?
A. (x.WriteID) ← tid
B. (x.WriteID) ← MIN(x.WriteID, tid)
C. (x.WriteID) ← MAX(x.WriteID, tid)
D. (x.WriteID) ← MAX(x.WriteID, x.ReadID)
Ben Bitdiddle looks at the READ
and WRITE
pseudocode shown above for Sara's system and concludes that her system is in fact nonsensical! To make his case, he constructs the following concurrent transactions:
Transaction T1 | Transaction T2 | |
1 |
tid1 ← BEGIN () |
|
2 |
tid2 ← BEGIN () |
|
3 |
WRITE (tid1, A, v1) |
|
4 |
v2 ← READ (tid2, A) |
|
5 |
WRITE (tid2, B, v2) |
|
6 |
COMMIT (tid2) |
|
7 |
v1 ← READ (tid1, B) |
|
8 |
COMMIT (tid1) |
The two transactions are interleaved in the order shown above. Note that T1 begins before T2. Ben argues that this leads to a deadlock.
Q17.6 Why is Ben's argument incorrect?
A. Both transactions will abort, but they can both retry if they like.
B. Only T2 will abort on line 4. So, T1 can proceed.
C. Only T1 will abort on line 7. So, T2 can proceed.
D. Sara's system does not suffer from deadlocks, though concurrent transactions may repeatedly abort and never commit.
Recall that Sara uses a write-ahead log for crash recovery.
Q17.7 Which of these statements is true about log entries in Sara's ANTS implementation?
A. The order of STARTED
entries in the log is in increasing TID order.
B. The order of COMMITTED
entries in the log is in increasing TID order.
C. The order of ABORTED
entries in the log is in increasing TID order.
D. The order of UPDATED
entries in the log for any given variable is in increasing TID order.
Q17.8 The WRITE
procedure appends the UPDATED
record to the log before it installs in cell storage. Sara wants to improve performance by caching the non-volatile cell storage in the volatile main memory. She changes READ_DATA
to read the value from the cache if it is there; if it isn't, READ_DATA
reads from non-volatile cell storage. She changes WRITE_DATA
to update just the cache; ANTS will install to non-volatile cell storage later. Can ANTS delay the install to non-volatile cell storage until after the COMMITTED
record has been written to the log, and still ensure transaction semantics?
A. No, because if the system crashed between the COMMIT
and the write to non-volatile storage, RECOVER
would not recover cell storage correctly.
B. Yes, because the log contains enough information to undo uncommitted transactions after a crash.
C. Yes, because line 3 of READ
won't let another transaction read the data until after the write to non-volatile storage completes.
D. None of the above.
Problem Set 18: KeyDB\(^*\)
\(^*\) Credit for developing this problem set goes to Hari Balakrishnan.
2005–3–13
(Chapter 3)
Keys–R–Us has contracted with you to implement an in-memory key-value transactional store named KeyDB. KeyDB provides a hash table interface to store key-value bindings and to retrieve the value previously associated with a key.
You decide to use locks to provide before-or-after atomicity. Lock Lk
is a lock for key k
, which corresponds to the entry KeyDB[k]
. A single transaction may read or write multiple KeyDB entries. Your goal is to achieve correct before-or-after atomicity for all transactions that use KeyDB. Transactions may abort. ACQUIRE (Lk)
is called before the first READ
or WRITE
to KeyDB[k]
and RELEASE (Lk)
is called after the last access to KeyDB[k]
.
Q18.1 For each of the following locking rules, is the rule is necessary, sufficient, or neither necessary nor sufficient to always guarantee correct before-or-after atomicity between any set of concurrent transactions?
A. An ACQUIRE (Lk)
must be performed after the start of a transaction and before the first READ
or WRITE
of KeyDB[k]
, and a RELEASE (Lk)
must be performed some time after the last READ
or WRITE
of KeyDB[k]
and before the end of the transaction.
B. ACQUIRE
s of every needed lock must occur after the start of a transaction and before any other operation, and there can be no RELEASE
of a lock before COMMIT
or ABORT
if the corresponding data item was modified by the thread.
C. ACQUIRE
s of every needed lock must occur after the start of a transaction and before the first RELEASE
, and there can be no RELEASE
of a lock before COMMIT
or ABORT
if the corresponding data item was modified by the thread.
D. All threads that ACQUIRE
more than one lock must ACQUIRE
the locks in the same order, and there may be no RELEASE
s of locks before COMMIT
or ABORT
.
E. ACQUIRE
s of every needed lock must occur after the start of a transaction and before the first RELEASE
, and a lock may be RELEASE
d at at any time after the last READ
or WRITE
of the corresponding data before COMMIT
or ABORT
.
Q18.2 Determine whether each of the following locking rules either avoids or is likely (with probability approaching 1 as time goes to infinity) to eliminate permanent deadlock between any set of concurrent transactions.
A. ACQUIRE
s of every needed lock must occur after the start of a transaction and before any other operation, and there can be no RELEASE
of a lock before COMMIT
or ABORT
.
B. ACQUIRE
s of every needed lock must occur after the start of a transaction and before the first RELEASE
, and there can be no RELEASE
of a lock before COMMIT
or ABORT
.
C. All threads that ACQUIRE
more than one lock must ACQUIRE
the locks in the same order.
D. When a transaction begins, set a timer to a value longer than the transaction is expected to take. If the timer expires, ABORT
the transaction and try it again with a timer set to a value chosen with random exponential backoff.
Problem Set 19: Alice's Reliable Book Store\(^*\)
\(^*\) Credit for developing this problem set goes to Barbara Liskov
2006–3–9
(Chapter 3)
Alice has implemented a version of ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
using only two copies, based an idea she got by reading Section 3.8.1. Her implementation appears below. In her implementation each virtual all-or-nothing sector x
is stored at two disk locations, x.D0
and x.D1
, which are updated and read as follows:
// Write the bits in data at item x
1 procedure ALL_OR_NOTHING_PUT (data, x)
2 flag ← CAREFUL_GET (buffer, x.D0); // read into a temporary buffer
3 if flag = OK then
4 CAREFUL_PUT (data, x.D1);
5 CAREFUL_PUT (data, x.D0);
6 else
7 CAREFUL_PUT (data, x.D0);
8 CAREFUL_PUT (data, x.D1);
// Read the bits of item x and return them in data
1 procedure ALL_OR_NOTHING_GET (reference data, x)
2 flag ← CAREFUL_GET (data, x.D0);
3 if flag = OK then
4 return;
5 CAREFUL_GET (data, x.D1);
The CAREFUL_GET
and CAREFUL_PUT
procedures are as specified in Section 2.6.4.5 and Figure \(2.6.1\). The property of these two procedures that is relevant is that CAREFUL_GET
can detect cases when the original data is damaged by a system crash during CAREFUL_PUT
.
Assume that the only failure to be considered is a fail-stop failure of the system during the execution of ALL_OR_NOTHING_GET
or ALL_OR_NOTHING_PUT
. After a fail-stop failure the system restarts.
Q19.1 Which of the following statements are true and which are false for Alice's implementation of ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
?
A. Her code obeys the rule "never overwrite the only copy."
B. ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
ensure that if just one of the two copies is good (i.e., CAREFUL_GET
will succeed for one of the two copies), the caller of ALL_OR_NOTHING_GET
will see it.
C. ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
ensure that the caller will always see the result of the last ALL_OR_NOTHING_PUT
that wrote at least one copy to disk.
Q19.2 Suppose that when ALL_OR_NOTHING_PUT
starts running, the copy at x.D0
is good. Which statement's completion is the commit point of ALL_OR_NOTHING_PUT
?
Q19.3 Suppose that when ALL_OR_NOTHING_PUT
starts running, the copy at x.D0
is bad. Which statement's completion is the commit point of ALL_OR_NOTHING_PUT
?
Consider the following chart showing possible states that the data could be in prior running ALL_OR_NOTHING_PUT
:
State 1 | State 2 | State 3 | |
---|---|---|---|
x.D0 |
old | old | bad |
x.D1 |
old | bad | old |
For example, when the system is in state 2, x.D0
contains an old value and x.D1
contains a bad value, meaning that CAREFUL_GET
will return an error if someone tries to read x.D1 .
Q19.4 Assume that ALL_OR_NOTHING_PUT
is attempting to store a new value into item x
and the system fails. Which of the following statements are true?
A. ( x.D0
= new, x.D1
= new) is a potential outcome of ALL_OR_NOTHING_PUT
, starting in any of the three states.
B. Starting in state S1, a possible outcome is ( x.D0
= bad, x.D1
= old).
C. Starting in state S2, a possible outcome is ( x.D0
= bad, x.D1
= new).
D. Starting in state S3, a possible outcome is ( x.D0
= old, x.D1
= new).
E. Starting in state S1, a possible outcome is ( x.D0
= old, x.D1
= new).
Ben Bitdiddle proposes a simpler version of ALL_OR_NOTHING_PUT
. His simpler version, named SIMPLE_PUT
, would be used with the existing ALL_OR_NOTHING_GET
.
procedure SIMPLE_PUT (data, x)
CAREFUL_PUT (data, x.D0)
CAREFUL_PUT (data, x.D1)
Q19.5 Will the system work correctly if Ben replaces ALL_OR_NOTHING_PUT
with SIMPLE_PUT
? Explain.
Q19.6 Now consider failures other than system failures while running the original ALL_OR_NOTHING_PUT
. Which of the following statements is true and which false?
A. Suppose x.D0
and x.D1
are stored on different disks. Then ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
also mask a single head crash (i.e., the disk head hits the surface of a spinning platter), assuming no other failures.
B. Suppose x.D0
and x.D1
are stored as different sectors on the same track. Then ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
also mask a single head crash, assuming no other failures.
C. Suppose that the failure is that the operating system overwrites the in-memory copy of the data being written to disk by ALL_OR_NOTHING_PUT
. Nevertheless, ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
mask this failure, assuming no other failures.
Now consider how to handle decay failures. The approach is to periodically correct them by running a SALVAGE
routine. This routine checks each replicated item periodically and if one of the two copies is bad, it overwrites that copy with the good copy. The code for SALVAGE
is in Figure 3.8.1.
Assume that there is a decay interval \(D\) such that at least one copy of a duplicated sector will probably still be good \(D\) seconds after the last execution of ALL_OR_NOTHING_PUT
or SALVAGE
on that duplicated sector. Further assume that the system recovers from a failure in less than \(F\) seconds, where \(F << D\), and that system failures happen so infrequently that it is unlikely that more than one will happen in a period of \(D\) seconds.
Q19.7 Which of the following methods ensures that the approach handles decay failures with very high probability?
A. SALVAGE
runs only in a background thread that cycles through the disk with the guarantee that each replicated sector is salvaged every \(P\) seconds, where \(P\) is less than \(D - F\).
B. SALVAGE
runs as the first step of ALL_OR_NOTHING_PUT
, and only then.
C. SALVAGE
runs as the first step of both ALL_OR_NOTHING_PUT
and ALL_OR_NOTHING_GET
, and only then.
D. SALVAGE
runs on all duplicated sectors as part of recovering from a fail-stop failure, and only then.
Problem Set 20: Establishing Serializability\(^*\)
\(^*\) Credit for developing this problem set goes to Barbara Liskov.
2006–3–17
(Chapter 3)
Chapter 3 explained that one technique of ensuring correctness is to serialize concurrent transactions that act on shared variables, and it offered methods such as version histories or two-phase locking to ensure serialization. Louis Reasoner has come up with his own locking scheme that does not have an easy proof of correctness, and he wants to know whether or not it actually leads to correct results. Louis implements his locking scheme, runs a particular set of three transactions two different times, and observes the order in which individual actions of the transactions occur. The observed order is known as a schedule.
Here are Louis's three transactions:
- T1:
BEGIN (); WRITE (x); READ (y); WRITE (z); COMMIT ();
- T2:
BEGIN (); READ (x); WRITE (z); COMMIT ();
- T3:
BEGIN (); READ (z); WRITE (y); COMMIT ();
The records x
, y
and z
are stored on disk. Louis's first run produces schedule 1 and his second run produces schedule 2:
Schedule 1 | Schedule 2 | |
---|---|---|
1 |
T1: WRITE (x) |
T3: READ (z) |
2 |
T2: READ (x) |
T2: READ (x) |
3 |
T1: READ (y) |
T1: WRITE (x) |
4 |
T3: READ (z) |
T3: WRITE (y) |
5 |
T3: WRITE (y) |
T1: READ (y) |
6 |
T2: WRITE (z) |
T2: WRITE (z) |
7 |
T1: WRITE (z) |
T1: WRITE (z) |
The question Louis needs to answer is whether or not these two schedules can be serialized. One way to establish serializability is to create what is called an action graph. An action graph contains one node for each transaction and an arrow (directed edge) from Ti to Tj if Ti and Tj both use the same record r
in conflicting modes (that is, both transactions write r
or one writes r
before the other reads r
) and Ti uses r
first. If for a particular schedule there is a cycle in its action graph, that schedule is not serializable. If there is no cycle, then the arrows reveal a serialization of those transactions.
Q20.1 The table below lists all of the possible arrows that might lead from one transaction to another. For schedule 1, fill in the table, showing whether or not that arrow exists, and if so list the two steps that create that arrow. To get you started, one row is filled in. Also, draw the arrows in Figure \(PS.20.1\).
Arrow | Exists? | Steps |
---|---|---|
T1 → T2 | ||
T1 → T3 | ||
T2 → T1 | ||
T2 → T3 | ||
T3 → T1 | ||
T3 → T2 | Yes | 4 and 6 |
Figure \(\PS.20.1\): Answer template for Q20.1.
Q20.2 Is schedule 1 serializable? If not, explain briefly why not. If so, give a serial schedule for it.
Q20.3 Now fill in the table for schedule 2. This time you get to fill in the whole table yourself. Also, draw in the arrows in Figure \(PS.20.2\).
Arrow | Exists? | Steps |
---|---|---|
T1 → T2 | ||
T1 → T3 | ||
T2 → T1 | ||
T2 → T3 | ||
T3 → T1 | ||
T3 → T2 |
Figure \(PS.20.2\): Answer template for Q20.2.
Q20.4 Is schedule 2 serializable? If not, explain why not. If so, give a serial schedule for it.
Q20.5 Could schedule 2 have been produced by two-phase locking, in which a transaction acquires a lock on an object as the first part of the step in which it first uses that object? For example, step 3 of schedule 2 is the first time that transaction T1 uses record x
, so it would start that step by acquiring a lock for x
. Explain.
Louis is also concerned about recovery. When he ran the three transactions and obtained schedule 2, he found that the system generated the following log:
1 BEGIN (transaction: T1)
2 BEGIN (transaction: T2)
3 BEGIN (transaction: T3)
4 CHANGE (transaction: T1, record: x, undo: 1, redo: 2)
5 CHANGE (transaction: T3, record: y, undo: 1, redo: 2)
6 CHANGE (transaction: T2, record: z, undo: 1, redo: 2)
7 COMMIT (transaction: T3)
8 COMMIT (transaction: T2)
9 CHANGE (transaction: T1, record: z, undo: 2, redo: 3)
10 COMMIT (transaction: T1)
In a CHANGE
record the undo field gives the old value before this change, and the redo field gives the new value afterwards. For example, entry 4 indicates that the old value of x
was 1 and the new value is 2. The system uses the redo/undo recovery procedure of Figure 3.4.7.
Q20.6 Suppose the system crashed after record 7 of the log has made it to disk but before record 8 is written. What states do x
, y
, and z
have after recovery is complete?
Q20.7 Suppose instead that the system crashed after record 9 of the log has made it to disk but before record 10 is written. What states do x
, y
, and z
have after recovery is complete?
Louis's database consists of a collection of integer objects stored on disk. Each WRITE
operation increments by 1 the object being modified. The system is using a write-ahead logging protocol and there is an in-memory cache that the system periodically flushes to disk, without checking to see if the cached objects belong to committed transactions.
To save space in the log, Louis's friend Ben Bitdiddle suggests that CHANGE
records could just indicate the operation that was performed. For example, log entry 4 would be:
4 CHANGE (transaction: T1, record: x, operation: increment)
When the recovery manager sees this entry, it performs the specified operation: increment x
by 1. Ben makes no other changes to the recovery protocol.
Q20.8 All objects are initialized to 0. Louis tries Ben's plan, but after the first system crash and recovery he discovers that it doesn't work. Explain why.