6.3: Suggestions for Further Reading

Last updated
Save as PDF

Page ID: 76061

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Chapter 1: The Network as a System and as a System Component

Proceedings of the IEEE 66, 11 (November 1978), a special issue of that journal devoted to packet switching, contains several papers mentioned under various topics here. Collectively, they provide an extensive early bibliography on computer communications.

Topic 1.1: Networks

1.1.1 Radia Perlman. Interconnections, Second Edition: Bridges, Routers, Switches, and Internetworking Protocols. Addison-Wesley, 1999. ISBN: 978–0–201–63448–8. 560 pages.

This book presents everything you could possibly want to know about how the network layer actually works. The style is engagingly informal, but the content is absolutely first-class, and every possible variation is explored. The previous edition was simply titled Interconnections: Bridges and Routers.

1.1.2 David D. Clark, Kenneth T. Pogran, and David P. Reed. An introduction to local area networks. Proceedings of the IEEE 66, 11 (November 1978), pages 1497–1517.

This basic tutorial on local area network communications characterizes the various modular components of a local area network, both interface and protocols, gives specific examples, and explains how local area networks relate to larger, interconnected networks. The specific examples are now out of date, but the rest of the material is timeless.

1.1.3 Robert M. Metcalfe and David R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM 19, 7 (July 1976), pages 395–404.

This paper provides the design of what has proven to be the most popular local area network technology.

Topic 1.2: Protocols

1.2.1 Louis Pouzin and Hubert Zimmerman. A tutorial on protocols. Proceedings of the IEEE 66, 11 (November 1978), pages 1346–1370.

This paper is well written and provides perspective along with the details. The fact that it was written a long time ago turns out to be its major appeal. Because networks were not widely understood at the time, it was necessary to fully explain all of the assumptions and offer extensive analogies. This paper does an excellent job of both, and as a consequence it provides a useful complement to modern texts. While reading this paper, anyone who is familiar with current network technology will frequently exclaim, "So that's why the Internet works that way."

1.2.2 Vinton G. Cerf and Peter T. Kirstein. Issues in packet-network interconnection. Proceedings of the IEEE 66, 11 (November 1978), pages 1386–1408.

At the time this paper was written, an emerging problem was the interconnection of independently administered data communication networks. This paper explores the issues in both breadth and depth, a combination that more recent papers do not provide.

1.2.3 David D. Clark and David L. Tennenhouse. Architectural considerations for a new generation of protocols. ACM SIGCOMM ’91 Conference: Communications Architectures and Protocols, in Computer Communication Review 20, 4 (September 1990), pages 200–208.

This paper captures 20 years of experience in protocol design and implementation and lays out the requirements for the next few rounds of protocol design. The basic observation is that the performance requirements of future high-speed networks and applications will require that the layers used for protocol description not constrain implementations to be similarly layered. This paper is required reading for anyone who is developing a new protocol or protocol suite.

1.2.4 Danny Cohen. On holy wars and a plea for peace. IEEE Computer 14, 10 (October 1981), pages 48–54.

This is an entertaining discussion of big-endian and little-endian arguments in protocol design.

1.2.5 Danny Cohen. Flow control for real-time communication. Computer Communication Review 10, 1–2 (January/April 1980), pages 41–47.

This brief item is the source of the "servant's dilemma", a parable that provides helpful insight into why flow control decisions must involve the application.

1.2.6 Geoff Huston. Anatomy: A look inside network address translators. The Internet Protocol Journal 7, 3 (September 2004), pages 2–32.

Network address translators (NATs) break down the universal connectivity property of the Internet: when NATs are in use one, can no longer assume that every computer in the Internet can communicate with every other computer in the Internet. This paper discusses the motivation for NATs, how they work, and in what ways they create havoc for some Internet applications.

1.2.7 Van Jacobson. Congestion avoidance and control. Proceedings of the Symposium on Communications Architectures and Protocols (SIGCOMM '88), pages 314–329. Also in Computer Communication Review 18, 4 (August 1988).

Sidebar 1.7.2 gives a simplified description of the congestion avoidance and control mechanism of TCP, the most commonly used transport protocol in the Internet. This paper explains those mechanisms in full detail. They are surprisingly simple but have proven to be effective.

1.2.8 Jordan Ritter. Why Gnutella can't scale. No, really. Unpublished grey literature. <http://www.darkridge.com/~jpr5/doc/gnutella.html.>

This paper offers a simple performance model to explain why the Gnutella protocol (see problem set 20) cannot support large networks of Gnutella peers. The problem is incommensurate scaling of its bandwidth requirements.

1.2.9 David B. Johnson. Scalable support for transparent mobile host internetworking. Wireless Networks 1, 3 (1995), pages 311–321.

Addressing a laptop computer that is connected to a network by a radio link and that can move from place to place without disrupting network connections can be a challenge. This paper proposes a systematic approach based on maintaining a tunnel between the laptop computer's current location and an agent located at its usual home location. Variations of this paper (based on the authors 1993 Ph.D. thesis at Carnegie-Mellon University and available as CMU Computer Science Technical Report CS–93–128) have appeared in several 1993 and 1994 workshops and conferences, as well as in the book Mobile Computing, Tomasz Imielinski and Henry F. Korth, editors, Kluwer Academic Publishers, c. 1996. ISBN: 079239697–9.

Subtopic 1.2.1: Remote Procedure Call

1.2.1.1 Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Transactions on Computer Systems 2, 1 (February 1984), pages 39–59.

A well-written paper that shows first, the simplicity of the basic idea, second, the complexity required to deal with real implementations, and third, the refinements needed for high effectiveness.

1.2.1.2 Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall, third edition, 2008. ISBN 978–0–13–600663-3 (hardcover). 952 pages.

This book provides a thorough tutorial introduction to the world of operating systems but with a tendancy to emphasize the mechanics. Insight into why things are designed the way they are is there, but in many cases requires teasing out. Nevertheless, as a starting point, it is filled with street knowledge that is needed to get into the rest of the literature. It includes useful case studies of GNU/Linux, Windows Vista, and Symbian OS, an operating system for mobile phones. Section 4.4 covers the protocol of remote procedure call in detail.

Topic 1.3: Organization for communication

1.3.1 Leonard Kleinrock. Principles and lessons in packet communications. Proceedings of the IEEE 66, 11 (November 1978), pages 1320–1329.

1.3.2 Lawrence G. Roberts. The evolution of packet switching. Proceedings of the IEEE 66, 11 (November 1978), pages 1307–1313.

These two papers discuss experience with the ARPANET. Anyone faced with the need to design a network should look over these two papers, which focus on lessons learned and the sources of surprise.

1.3.3 J[erome] H. Saltzer, D[avid]. P. Reed, and D[avid]. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems 2, 4 (November 1984), pages 277–288. An earlier version appears in the Proceedings of the Second International Conference on Distributed Computing Systems (April 1981), pages 504–512.

This paper proposes a design rationale for deciding which functions belong in which layers of a layered network implementation. It is one of the few papers available that provides a system design principle.

1.3.4 Leonard Kleinrock. The latency/bandwidth trade-off in gigabit networks. IEEE Communications Magazine 30, 4 (April 1992), pages 36–40.

Technology has made gigabit/second data rates economically feasible over long distances. But long distances and high data rates conspire to change some fundamental properties of a packet network—latency becomes the dominant factor that limits applications. This paper provides a good explanation of the problem.

Topic 1.4: Practical aspects

For the complete word on the Internet protocols, check out the following series of books.

1.4.1 W. Richard Stevens. TCP/IP Illustrated. Addison-Wesley; v. 1, 1994, ISBN 0–201–63346–9, 576 pages; v. 2 (with co-author Gary R. Wright) 1995, ISBN 0–201–63354–x, 1174 pages.; v. 3, 1996, ISBN 0–201–63495–3, 328 pages. Volume 1: The Protocols. Volume 2: The Implementation. Volume 3: TCP for Transactions, HTTP, NNTP, and the UNIX^® Domain Protocols.

These three volumes will tell you more than you wanted to know about how TCP/IP is implemented, using the network implementation of the Berkeley System Distribution for reference. The word "illustrated" refers more to computer printouts—listings of packet traces and programs—than to diagrams. If you want to know how some aspect of the Internet protocol suite is actually implemented, this is the place to look—though it does not often explain why particular implementation choices were made.

Chapter 2: Fault Tolerance

Topic 2.1: Fault tolerance

2.1.1 Jim [N.] Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo, California, 1993 (Look for the low-bulk paper edition, which became available with the third printing in 1994). ISBN: 978–1–55860–190–1. 1,070 pages.

All aspects of fault tolerance, atomicity, coordination, recovery, rollback, logs, locks, transactions, and engineering trade-offs for performance are pulled together in this comprehensive book. This is the definitive work on transactions. Though not intended for beginners, given the high quality of its explanations, this complex material is surprisingly accessible. The glossary of terms is excellent, whereas the historical notes are good as far as they go, but are somewhat database-centric and should not be taken as the final word. In particular, Chapter 3 provides a bedrock text on the subject of fault tolerance.

2.1.2 Jim [N.] Gray and Daniel P. Siewiorek. High-availability computer systems. Computer 24, 9 (September 1991), pages 39–48.

This is a nice, easy-to-read overview of how high availability can be achieved.

2.1.3 Daniel P. Siewiorek. Architecture of fault-tolerant computers. Computer 17, 8 (August 1984), pages 9–18.

This paper provides an excellent taxonomy, as well as a good overview of several architectural approaches to designing computers that continue running even when a single hardware component fails.

Topic 2.2: Software errors

2.2.1 Dawson Engler et al. Bugs as deviant behavior: A general approach to inferring errors in systems code. Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, 2001, in Operating Systems Review 35, 5 (December 2001), pages 57–72.

This paper describes a method for finding possible programming faults in large systems by looking for inconsistencies. For example, if in most cases an invocation of a certain function is preceded by disabling interrupts but in a few cases it is not, there is a good chance that a programming fault is present. The paper uses this insight to create a tool for finding potential faults in large systems.

2.2.2 Michael M. Swift et al. Recovering device drivers. Proceedings of the Sixth Symposium on Operating System Design and Implementation (December 2004), pages 1–16.

This paper observes that software faults in device drivers often lead to fatal errors that cause operating systems to fail and thus require a reboot. It then describes how virtual memory techniques can be used to enforce modularity between device drivers and the rest of the operating system kernel, and how the operating system can recover device drivers when they fail, reducing the number of reboots.

Topic 2.3: Disk failures

2.3.1 Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the fifth USENIX Conference on File and Storage Technologies (2007), pages 1–16.

As explained in Section 2.2, it is not uncommon that data sheets for disk drives specify MTTFs of one hundred years or more, many times the actual observed lifetimes of those drives in the field. This paper looks at disk replacement data for 100,000 disk drives and discusses what MTTF means for those disk drives.

2.3.2 Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. Failure trends in a large disk drive population. Proceedings of the fifth USENIX Conference on File and Storage Technologies (2007), pages 17–28.

Recently, outfits such as Google have deployed large enough numbers of off-the-shelf disk drives for a long enough time that they can make their own evaluations of disk drive failure rates and lifetimes, for comparison with the a priori reliability models of the disk vendors. This paper reports data collected from such observations. It analyzes the correlation between failures and several parameters that are generally believed to impact the lifetime of disk and finds some surprises. For example, it reports that temperature is less correlated with disk drive failure than was previously reported, as long as the temperature is within a certain range and stable.

Chapter 3: Atomicity

Topic 3.1: Atomicity, Coordination, and Recovery

The best source on this topic is the book by Gray and Reuter listed above as reading number 2.1.1, but at a thousand pages long it can be a bit overwhelming.

3.1.1 Warren A. Montgomery. Robust Concurrency Control for a Distributed Information System. Ph.D. thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, December 1978. Also available as M.I.T. Laboratory for Computer Science Technical Report TR–207, January 1979. 197 pages.

This work describes alternative strategies that maximize concurrent activity while achieving atomicity: maintaining multiple values for some variables, atomic broadcast of messages to achieve proper sequence.

3.1.2 D. B. Lomet. Process structuring, synchronization, and recovery using atomic actions. Proceedings of an ACM Conference on Language Design for Reliable Software (March 1977), pages 128–137. Published as ACM SIGPLAN Notices 12, 3 (March 1977); Operating Systems Review 11, 2 (April 1977); and Software Engineering Notes 2, 2 (March 1977).

This is one of the first attempts to link atomicity to both recovery and coordination. It is written from a language, rather than an implementation, perspective.

Topic 3.2: Databases

3.2.1 Jim [N.] Gray et al. The recovery manager of the System R database manager. ACM Computing Surveys 13, 2 (June 1981), pages 223–242.

This paper is a case study of a sophisticated, real, high-performance logging and locking system. It is one of the most interesting case studies of its type because it shows the number of different, interacting mechanisms needed to construct a system that performs well.

3.2.2 C. Mohan et al. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems 17, 1 (1992), pages 94-162.

This paper describes all the intricate design details of a fully featured, commercialquality database transaction system that uses write-ahead logging.

3.2.3 C. Mohan, Bruce Lindsey, and Ron Obermarck. Transaction management in the R* distributed database management system. ACM Transactions on Database Systems (TODS) 11, 4 (December 1986), pages 378–396.

This paper deals with transaction management for distributed databases, and introduces two new protocols (Presumed Abort and Presumed Commit) that optimize two-phase commit (see Section 3.7), resulting in fewer messages and log writes. Presumed Abort is optimized for transactions that perform only read operations, and Presumed Commit is optimized for transactions with updates that involve several distributed databases.

3.2.4 Tom Barclay, Jim Gray, and Don Slutz. Microsoft TerraServer: A spatial data warehouse. Microsoft Technical Report MS-TR-99-29. June 1999.

The authors report on building a popular Web site that hosts aerial, satellite, and topographic images of Earth using off-the-shelf components, including a standard database system for storing the terabytes of data.

3.2.5 Ben Vandiver et al. Tolerating byzantine faults in transaction processing systems using commit barrier scheduling. Proceedings of the Twenty-first ACM Symposium on Operating Systems Principles, in Operating Systems Review 41, 6 (December 2005), pages 59–79.

This paper describes a replication scheme for handling Byzantine faults in database systems. It issues queries and updates to multiple replicas of unmodified, off-the-shelf database systems, and it compares their responses, thus creating a single database that is Byzantine fault tolerant (see Section 2.7 for the definition of Byzantine).

Topic 3.3: Atomicity-related topics

3.3.1 Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (February 1992), pages 26–52. Originally published in Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December 1991), pages 1–15.

Although it has long been suggested that one could in principle store the contents of a file system on disk in the form of a finite log, this design is one of the few that demonstrates the full implications of that design strategy. The paper also presents a fine example of how to approach a system problem by carefully defining the objective, measuring previous systems to obtain a benchmark, and then comparing performance as well as functional aspects that cannot be measured.

3.3.2 H. T. Kung and John T. Robinson. On optimistic methods for concurrency control. ACM Transactions on Database Systems 9, 4 (June 1981), pages 213–226.

This early paper introduced the idea of using optimistic approaches to controlling updates to shared data. An optimistic scheme is one in which a transaction proceeds in the hope that its updates are not conflicting with concurrent updates of another transaction. At commit time, the transaction checks to see if the hope was justified. If so, the transaction commits. If not, the transaction aborts and tries again. Applications that use a database in which contention for particular records is infrequent may run more efficiently with this optimistic scheme than with a scheme that always acquires locks to coordinate updates.

3.3.3 Butler W. Lampson and Howard Sturgis. Crash recovery in a distributed data storage system. Working paper, Xerox Palo Alto Research Center, November 1976, and April 1979. (Never published)

Jim Gray called the 1976 version of this paper "an underground classic." The 1979 version presents the first good definition of models of failure. Both describe algorithms for coordinating distributed updates; they are sufficiently different that both are worth reading.

3.3.4 Gregory R. Ganger and Yale N. Patt. Metadata update performance in file systems. Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (November 1994), pages 49–60.

This paper is an application to file systems of some recovery and consistency concepts originally developed for database systems. It describes a few simple rules (e.g., an inode should be written to the disk after writing the disk blocks to which it points) that allow a system designer to implement a file system that is high performance and always keeps its on-disk data structures consistent in the presence of failures. As applications perform file operations, the rules create dependencies between data blocks in the write-behind cache. A disk driver that knows about these dependencies can write the cached blocks to disk in an order that maintains consistency of on-disk data structures despite system crashes.

Chapter 4: Consistency and Durable Storage

Topic 4.1: Consistency

4.1.1 J. R. Goodman. Using cache memory to reduce processor-memory traffic. Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 124–132 (1983).

The paper that introduced a protocol for cache-coherent shared memory using snoopy caches. The paper also sparked much research in more scalable designs for cache-coherent shared memory

4.1.2 James J. Kistler and M[ahadarev] Satyanarayanan. Disconnected operation in the Coda file system. Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December 1991), pages 213–225.

Coda is a variation of the Andrew File System (AFS) that provides extra fault tolerance features. It is notable for using the same underlying mechanism to deal both with accidental disconnection due to network partition and the intentional disconnection associated with portable computers. This paper is well written.

4.1.3 Jim Gray et al. The dangers of replication and a solution. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, in ACM SIGMOD Record 25, 2 (June 1996), pages 173–182.

This paper describes the challenges for replication protocols in situations where the replicas are stored on mobile computers that are frequently disconnected. The paper argues that trying to provide transactional semantics for an optimistic replication protocol in this setting is unstable because there will be too many reconciliation conflicts. It proposes a new two-tier protocol for reconciling disconnected replicas that addresses this problem.

4.1.4 Leslie Lamport. Paxos made simple. Distributed computing (column), ACM SIGACT News 32, 4 (Whole Number 121, December 2001), pages 51–58.

This paper describes an intricate protocol, Paxos, in a simple way. The Paxos protocol allows several computers to agree on a value (e.g., the list of available computers in a replicated service) in the face of network and computer failures. It is an important building block in building fault-tolerant services.

4.1.5 Fred Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (1990), pages 299–319.

This paper provides a clear description of one of the most popular approaches for building fault tolerant services, the replicated-state machine approach.

4.1.6 Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (1978), pages 558–565.

This paper introduces an idea that is now known as Lamport clocks. A Lamport clock provides a global, logical clock for a distributed system that respects the physical clocks of the computers comprising the distributed system and the communication between them. The paper also introduces the idea of replicated state machines.

4.1.7 David K. Gifford. Weighted voting for replicated data. Proceedings of the Seventh ACM Symposium on Operating Systems Principles, in Operating Systems Review 13, 5 (December 1979), pages 150–162. Also available as Xerox Palo Alto Research Center Technical Report CSL–79–14 (September 1979).

The work discusses a replicated data algorithm that allows the trade-off between reliability and performance to be adjusted by assigning weights to each data copy and requiring transactions to collect a quorum of those weights before reading or writing.

4.1.8 Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer System 7, 4 (November 1989), pages 321–359.

This paper describes a method to create a shared virtual memory across several separated computers that can communicate only with messages. It uses hardware support for virtual memory to cause the results of a write to a page to be observed by readers of that page on other computers. The goal is to allow programmers to write parallel applications on a distributed computer system in shared-memory style instead of a message-passing style.

4.1.9 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (October 2003), pages 29–43. Also in Operating Systems Review 37, 5 (December 2003).

This paper introduces a file system used in many of Google's applications. It aggregates the disks of thousands of computers in a cluster into a single storage system with a simple file system interface. Its design is optimized for large files and replicates files for fault tolerance. The Google File System is used in the storage back-end of many of Google's applications, including Google search.

4.1.10 F[ay] Chang et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2, article 4 (2008), pages 1–26.

This paper describes a database-like system for storing petabytes of structured data on thousands of commodity servers.

Topic 4.2: Durable Storage

4.2.1 Raymond A. Lorie. The long-term preservation of digital information. Proceedings of the first ACM/IEEE Joint Conference on Digital Libraries (2001), pages 346–352.

This is a thoughtful discussion of the problems of archiving digital information despite medium and technology obsolescence.

4.2.2 Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk system architectures for high performance computing. Proceedings of the IEEE 77, 12 (December 1989), pages 1842–1857.

The first part of this reference paper on Redundant Arrays of Independent Disks (RAID) reviews disk technology; the important material is the catalog of six varieties of RAID organization.

4.2.3 Petros Maniatis et al. LOCKSS: A peer-to-peer digital preservation system. ACM Transactions on Computer Systems 23, 1 (February 2005), pages 2–50.

This paper describes a peer-to-peer system for preserving access to journals and other archival information published on the Web. Its design is based on the mantra "lots of copies keep stuff safe" (LOCKSS). A large number of persistent Web caches keep copies and cooperate to detect and repair damage to their copies using a new voting scheme.

4.2.4 A[lan J.] Demers et al. Epidemic algorithms for replicated database maintenance. Proceedings of the Sixth Symposium on Principles of Distributed Computing (August 1987), pages 1-12. Also in Operating Systems Review 22, 1 (January 1988), pages 8-32.

This paper describes an epidemic protocol to update data that is replicated on many machines. The essence of an epidemic protocol is that each computer periodically gossips with some other, randomly chosen computer and exchanges information; multiple computers thus learn about all updates in a viral fashion. Epidemic protocols can be simple and robust, yet can spread updates relatively quickly.

Topic 4.3: Reconciliation

4.3.1 Douglas B. Terry et al. Managing update conflicts in Bayou, a weakly connected replicated storage system. Proceedings of the Fifteenth Symposium on Operating Systems Principles (December 1995), in Operating Systems Review 29, 5 (December 1995), pages 172–183.

This paper introduces a replication scheme for computers that share data but are not always connected. For example, each computer may have a copy of a calendar, which it can update optimistically. Bayou will propagate these updates, detect conflicts, and attempt to resolve conflicts, if possible.

4.3.2 Trevor Jim, Benjamin C. Pierce, and Jérôme Vouillon. How to build a file synchronizer. (A widely circulated piece of grey literature—dated February 22, 2002 but never published.)

This paper describes the nuts and bolts of Unison, a tool that efficiently synchronizes the files stored on two computers. Unison is targeted to users who have their files stored in several places (e.g., on a server at work, a laptop to carry while traveling, and a desktop at home) and would like to have all the files on the different computers be the same.

Chapter 5: Information Security

Topic 5.1: Privacy

5.1.1 Alan F. Westin. Privacy and Freedom. Atheneum Press, 1967. 487 pages. (Out of print.)

If you have any interest in privacy, track down a copy of this book in a library or used-book store. It is the comprehensive treatment, by a constitutional lawyer, of what privacy is, why it matters, and how it is positioned in the U.S. legal framework.

5.1.2 Arthur R. Miller. The Assault on Privacy. University of Michigan Press, Ann Arbor, Michigan, 1971. ISBN: 0–47265500–0. 333 pages. (Out of print.)

This book articulately spells out the potential effect of computerized data-gathering systems on privacy, and of possible approaches to improving legal protection. Part of the latter is now out of date because of advances in legislation, but most of this book is still of much interest.

5.1.3 Daniel J. Weitzner et al. Information accountability. Communications of the ACM 51, 6 (June 2008), pages 82–87.

The paper suggests that in the modern world, Westin's definition covers only a subset of privacy. See Sidebar 5.2.1 for a discussion of the paper's proposed extended definition.

Topic 5.2: Protection Architectures

5.2.1 Jerome H. Saltzer and Michael D. Schroeder. The protection of information in computer systems. Proceedings of the IEEE 63, 9 (September 1975), pages 1278–1308.

After more than 30 years, this paper (an early version of the current Chapter 5) still provides an effective treatment of protection mechanics in multiuser systems. Its emphasis on protection inside a single system, rather than between systems connected to a network, is one of its chief shortcomings, along with antique examples and omission of newer techniques of certification such as authentication logic.

5.2.2 R[oger] M. Needham. Protection systems and protection implementations. AFIPS Fall Joint Conference 41, Part I (December 1972), pages 571–578.

This paper is probably as clear an explanation of capability systems as one is likely to find. For another important paper on capabilities, see reading 5.2.3 below, by Fabry.

5.2.3 R[obert] S. Fabry. Capability-based addressing. Communications of the ACM 17, 7 (July 1974), pages 403–412.

This is the first comprehensive treatment of capabilities, a mechanism introduced to enforce modularity but actually more of a naming feature.

Topic 5.3: Certification, Trusted Computer Systems and Security Kernels

5.3.1 Butler [W.] Lampson, Martín Abadi, Michael Burrows, and Edward Wobber. Authentication in distributed systems: Theory and practice. ACM Transactions on Computer Systems 10, 4 (November 1992), pages 265–310.

This paper, one of a series on a logic that can be used to reason systematically about authentication, provides a relatively complete explication of the theory and shows how to apply it to the protocols of a distributed system.

5.3.2 Edward Wobber, Martín Abadi, Michael Burrows, and Butler W. Lampson. Authentication in the Taos operating system. Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 27, 5 (December 1993), pages 256–269.

This paper applies the authentication logic developed in reading 5.3.1 to an experimental operating system. In addition to providing a concrete example, the explanation of the authentication logic itself is a little more accessible than that in the other paper.

5.3.3 Ken L. Thompson. Reflections on trusting trust. Communications of the ACM 27, 8 (August 1984), pages 761–763.

Anyone seriously interested in developing trusted computer systems should think hard about the implications for verification that this paper raises. Thompson demonstrates the ease with which a compiler expert can insert undetectable Trojan Horses into a system. Reading 5.3.4 describes a way to detect a Trojan horse. [The original idea that Thompson describes came from a paper whose identity he could not recall at the time, and which is credited with a footnote asking for help locating it. The paper was a technical report of the United States Air Force Electronic Systems Division at Hanscom Air Force Base. Paul A. Karger and Roger R. Schell. Multics Security Evaluation: Vulnerability Analysis. ESD–TR–74–193, Volume II (June 1974), page 52.]

5.3.4 David A. Wheeler. Countering trusting trust through diverse double-compiling. Proceedings of the 21st Annual Computer Security Applications Conference (2005), pages 28–40.

This paper proposes a solution that the author calls "diverse double compiling", to detect the attack discussed in Thompson's paper on trusting trust (see reading 5.3.3). The idea is to recompile a new, untrusted compiler's source code twice: first using a trusted compiler, and second using the result of this compilation. If the resulting binary for the compiler is bit-for-bit identical with the untrusted compiler's original binary, then the source code accurately represents the untrusted binary, which is the first step in developing trust in the new compiler.

5.3.5 Paul A. Karger et al. A VMM security kernel for the VAX architecture. 1990 IEEE Computer Society Symposium on Security and Privacy (May 1990), pages 2–19.

In the 1970s, the U.S. Department of Defense undertook a research effort to create trusted computer systems for defense purposes and in the process created a large body of literature on the subject. This paper distills most of the relevant ideas from that literature into a single, readable case study, and it also provides pointers to other key papers for those seeking more details on these ideas.

5.3.6 David D. Clark and David. R. Wilson. A comparison of commercial and military computer security policies. 1987 IEEE Symposium on Security and Privacy (April 1987), pages 184–194.

This thought-provoking paper outlines the requirements for security policy in commercial settings and argues that the lattice model is often not applicable. It suggests that these applications require a more object-oriented model in which data may be modified only by trusted programs.

5.3.7 Jaap-Henk Hoepman and Bart Jacobs. Increased security through open source. Communications of the ACM 50, 1 (January 2007), pages 79–83.

It has long been argued that the open design principle (see Section 5.2.4) is important to designing secure systems. This paper extends that argument by making the case that the availability of source code for a system is important in ensuring the security of its implementation.

5.3.8 Simson Garfinkel and Gene [Eugene H.] Spafford. Practical UNIX and Internet Security. O'Reilly & Associates, Sebastopol, California, third edition, 2003. ISBN 978–59600323–4 (paperback). 986 pages.

This is a really comprehensive guide to how to run a network-attached UNIX system with some confidence that it is relatively safe against casual intruders. In addition to providing practical information for a system manager, it incidentally gives the reader quite a bit of insight into the style of thinking and design needed to provide security.

5.3.9 Butler W. Lampson and Howard E. Sturgis. Reflections on an operating system design. Communications of the ACM 19, 5 (May 1976), pages 251–265.

An operating system named CAL, designed at the University of California at Berkeley, appears to be the first system to make explicit use of types in the interface to the operating system. In addition to introducing this idea, Lampson and Sturgis also give good insight into the pros and cons of various design decisions. Documented late, the system was actually implemented in 1969.

5.3.10 Michael D. Schroeder, David D. Clark, and Jerome H. Saltzer. The Multics kernel design project. Proceedings of the Sixth ACM Symposium on Operating Systems Principles, in Operating Systems Review 11, 5 (November 1977), pages 43–56.

This paper addresses a wide range of issues encountered in applying type extension (as well as microkernel thinking, though it wasn't called that at the time) to Multics in order to simplify its internal organization and reduce the size of its trusted base. Many of these ideas were explored in even more depth in Philippe Janson's Ph.D. thesis, Using Type Extension to Organize Virtual Memory Mechanisms, M.I.T. Department of Electrical Engineering and Computer Science, August 1976. That thesis is also available as M.I.T. Laboratory for Computer Science Technical Report TR–167, September 1976.

Topic 5.4: Authentication

5.4.1 Robert [H.] Morris and Ken [L.] Thompson. Password security: A case history. Communications of the ACM 22, 11 (November 1979), pages 594–597.

This paper is a model of how to explain something in an accessible way. With a minimum of jargon and a historical development designed to simplify things for the reader, it describes the UNIX password security mechanism.

5.4.2 Frank Stajano and Ross J. Anderson. The resurrecting duckling: Security issues for ad-hoc wireless networks. Security Protocols Workshop 1999, pages 172–194.

This paper discusses the problem of how a new device (e.g., a surveillance camera) can establish a secure relationship with the remote controller of the device's owner, instead of its neighbor's or adversary's. The paper's solution is that a device will recognize as its owner the first principal that sends it an authentication key. As soon as the device receives a key, its status changes from newborn to imprinted, and it stays faithful to that key until its death. The paper illustrates the problem and solution, using a vivid analogy of how ducklings authenticate their mother (see Sidebar 5.4.1).

5.4.3 David Mazières. Self-certifying file system. Ph.D. thesis, Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science (May 2000).

This thesis proposes a design for a cross-administrative domain file system that separates the file system from the security mechanism using an idea called self-certifying path names. Self-certifying names can be found in several other systems.

5.4.4 Bryan Ford et al. Persistent personal names for globally connected mobile devices. Proceedings of the Seventh USENIX Symposium on Operating Systems Design and Implementation (November 2006), pages 233–248.

This paper describes a naming system for personal devices. Each device is a root of its own naming network and can use short, convenient names for other devices belonging to the same user or belonging to people in the user’s social network. The implementation of the naming system allows devices to be disconnected from the Internet and resolve names of devices that are reachable. The first five pages lay out the basic naming plan. Later sections explain security properties and a security-based implementation, which makes this portion of the paper the most relevant to the material of Chapter 5.

For more on authentication, see Sidebar 5.6.1 on Kerberos.

Topic 5.5: Cryptographic techniques

The fundamental books about cryptography applied to computer systems are the two readings that follow:

5.5.1 Bruce Schneier. Applied Cryptography. John Wiley and Sons, second edition, 1996. ISBN: 978–0–471–12845–8 (hardcover), 978–0–471–11709–4 (paperback). 784 pages.

Here is everything you might want to know about cryptography and cryptographic protocols, including a well-balanced perspective on what works and what doesn’t. This book saves the need to read and sort through the thousand or so technical papers on the subject. Protocols, techniques, algorithms, real-world considerations, and source code can all be found here. In addition to being competent, it is also entertainingly written and articulate. Be aware that a number of minor errors have been reported in this book; if you are implementing code, it would be a good idea to verify the details by consulting reading 5.5.2.

5.5.2 A[lfred] J. Menezes, Paul C. Oorschot, and Scott A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1997. ISBN: 978–08493–8523–0. 816 pages.

This book is exactly what its title claims: a complete handbook on putting cryptography to work. It lacks the background and perspective of reading 5.5.1, and it is extremely technical, which makes parts of it inaccessible to less mathematically inclined readers. But its precise definitions and careful explanations make this by far the best reference book available on the subject.

In light of these two books, the first few papers from the 1970s listed below are primarily of historical interest. There is also a good, more elementary, treatment of cryptography in the book by Simson Garfinkel, which is listed above as reading 5.3.8.

5.5.3 R[onald] L. Rivest, A[di] Shamir, and L[en] Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21, 2 (February 1978), pages 120–126.

This paper was the first to suggest a possibly workable public key system.

5.5.4 Whitfield Diffie and Martin E. Hellman. Exhaustive cryptanalysis of the NBS Data Encryption Standard. Computer 10, 6 (June 1977), pages 74–84.

This is the unofficial analysis of how to break the DES by brute force—by building special-purpose chips and arraying them in parallel. Twenty-five years later, brute force still seems to be the only promising attack on DES, but the intervening improvements in hardware technology make special chips unnecessary—an array of personal computers on the Internet can do the job. The Advanced Encryption Standard (AES) is DES's successor (see Section 5.9.3.1).

5.5.5 Ross J. Anderson. Why cryptosystems fail. Communications of the ACM 37, 11 (November 1994), pages 32–40.

Anderson presents a nice analysis of what goes wrong in real-world cryptosystems—secure modules don't necessarily lead to secure systems—and the applicability of systems thinking in their design. He points out that merely doing the best possible design isn't enough; a feedback loop that corrects errors in the design following experience in the field is an equally important component that is sometimes forgotten.

5.5.6 David Wagner and Bruce Schneier. Analysis of the SSL 3.0 protocol. Proceedings of the Second USENIX Workshop on Electronic Commerce, Volume 2 (November 1996), pages 29–40.

This paper is useful not only because it provides a careful analysis of the security of the subject protocol, but it also explains how the protocol works in a form that is more accessible than the protocol specification documents. The originally published version was almost immediately revised with corrections. The revised version is available on the World Wide Web at <http://www.counterpane.com/ssl.html>.

5.5.7 M[ihir] Bellare, R[an] Canetti, and H[ugo] Krawczyk. Keying hash functions for message authentication. Proceedings of the Sixteenth International Cryptograhy Conference (August 1996), pages 1–15. (Also see H. Krawczyk, M. Bellare, and R. Canetti, HMAC: Keyed-hashing for message authentication, Request for Comments RFC 2104, Internet Engineering Task Force (February 1997).

This paper and the RFC introduce and define HMAC, a hash function used in widely deployed protocols.

5.5.8 David Chaum. Untraceable electronic mail, return addresses, and digital pseudonyms. Communications of the ACM 24, 2 (February 1981), pages 84–88.

This paper introduces a system design, named mixnet, that allows a sender of a message to hide its true identity from a receiver but still allow the receiver to respond.

Note that all of these books and papers focus on the application of cryptography, not on crypto-mathematics, which is a distinct area of specialization not covered in this reading list. An accessible crypto-mathematics reference is the following reading:

5.5.9 Johannes A. Buchman. Introduction to Cryptography. Springer, 2nd edition, 2004. ISBN 978–0–387–21156–5 (hardcover), 978–0–387–20756–8 (paperback). 335 pages.

Buchman provides a nice, concise introduction to number theory for cryptography.

Topic 5.6: Adversaries (the dark side)

Section 5.12 on war stories gives a wide range of examples of how adversaries can break a system's security. This section lists a few papers that provide a longer and more detailed descriptions of attacks. This is a fast-moving area; as soon as designers fend off new attacks, adversaries try to find new attacks. This arms race is reflected in some of the following readings, and although some of the attacks described have become ineffective (or will over time), these papers provide valuable insights. The proceedings of Usenix Security and Computer and Communication Security often contain papers explaining current attacks, and conferences run by the so-called "black hat" community document the "progress" on the dark side.

5.6.1 Eugene Spafford. Crisis and aftermath. Communications of the ACM 32, 6 (June 1989), pages 678–687.

This paper documents how the Morris worm works. It was one of the first worms, as well as one of the most sophisticated.

5.6.2 Jonathan Pincus and Brandon Baker. Beyond stack smashing: Recent advances in exploiting buffer overruns. IEEE Security and Privacy 2, 4 (August 2004), pages 20–27.

This paper describes how buffer overrun attacks have evolved since the Morris worm.

5.6.3 Abhishek Kumar, Vern Paxson, and Nicholas Weaver. Exploiting underlying structure for detailed reconstruction of an Internet scale event. Proceedings of the ACM Internet Measurement Conference (October 2005), pages 351-364.

This paper describes the Witty worm and how the authors were able to track down its source. The work contains many interesting nuggets of information.

5.6.4 Vern Paxson. An analysis of using reflectors for distributed denial-of-service attacks. Computer Communications Review 31, 3 (July 2001), pages 38-47.

This paper describes how an adversary can trick a large set of Internet servers to send their combined replies to a victim and in that way launch a denial-of-service attack on the victim. It speculates on several possible directions for defending against such attacks.

5.6.5 Chris Kanich et al. Spamalytics: an empirical analysis of spam marketing conversion. Proceedings of the ACM Conference on Computer and Communications Security (CCS), Arlington, Virginia (October 2008), pages 3–14.

This paper describes the infrastructure that spammers use to send unsolicited email and tries to establish what the financial reward system is for spammers. This paper has its shortcomings, but it is one of the few papers that tries to understand the economics behind spam.

5.6.6 Tom Jagatic, Nathaniel Johnson, Markus Jakobsson, and Filippo Menczer. Social phishing. Communications of the ACM 50, 10 (October 2007), pages 94–100.

This study investigates the success rate of individual phishing attacks.