2016 Theses Doctoral
Deterministic, Mutable, and Distributed Record-Replay for Operating Systems and Database Systems
Application record and replay is the ability to record application execution and replay it at a later time. Record-replay has many use cases including diagnosing and debugging applications by capturing and reproducing hard to find bugs, providing transparent application fault tolerance by maintaining a live replica of a running program, and offline instrumentation that would be too costly to run in a production environment. Different record-replay systems may offer different levels of replay faithfulness, the strongest level being deterministic replay which guarantees an identical reenactment of the original execution. Such a guarantee requires capturing all sources of nondeterminism during the recording phase. In the general case, such record-replay systems can dramatically hinder application performance, rendering them unpractical in certain application domains. Furthermore, various use cases are incompatible with strictly replaying the original execution. For example, in a primary-secondary database scenario, the secondary database would be unable to serve additional traffic while being replicated. No record-replay system fit all use cases.
This dissertation shows how to make deterministic record-replay fast and efficient, how broadening replay semantics can enable powerful new use cases, and how choosing the right level of abstraction for record-replay can support distributed and heterogeneous database replication with little effort.
We explore four record-replay systems with different semantics enabling different use cases. We first present Scribe, an OS-level deterministic record-replay mechanism that support multi-process applications on multi-core systems. One of the main challenge is to record the interaction of threads running on different CPU cores in an efficient manner. Scribe introduces two new lightweight OS mechanisms, rendezvous point and sync points, to efficiently record nondeterministic interactions such as related system calls, signals, and shared memory accesses. Scribe allows the capture and replication of hard to find bugs to facilitate debugging and serves as a solid foundation for our two following systems.
We then present RacePro, a process race detection system to improve software correctness. Process races occur when multiple processes access shared operating system resources, such as files, without proper synchronization. Detecting process races is difficult due to the elusive nature of these bugs, and the heterogeneity of frameworks involved in such bugs. RacePro is the first tool to detect such process races. RacePro records application executions in deployed systems, allowing offline race detection by analyzing the previously recorded log. RacePro then replays the application execution and forces the manifestation of detected races to check their effect on the application. Upon failure, RacePro reports potentially harmful races to developers.
Third, we present Dora, a mutable record-replay system which allows a recorded execution of an application to be replayed with a modified version of the application. Mutable record-replay provides a number of benefits for reproducing, diagnosing, and fixing software bugs. Given a recording and a modified application, finding a mutable replay is challenging, and undecidable in the general case. Despite the difficulty of the problem, we show a very simple but effective algorithm to search for suitable replays.
Lastly, we present Synapse, a heterogeneous database replication system designed for Web applications. Web applications are increasingly built using a service-oriented architecture that integrates services powered by a variety of databases. Often, the same data, needed by multiple services, must be replicated across different databases and kept in sync. Unfortunately, these databases use vendor specific data replication engines which are not compatible with each other. To solve this challenge, Synapse operates at the application level to access a unified data representation through object relational mappers. Additionally, Synapse leverages application semantics to replicate data with good consistency semantics using mechanisms similar to Scribe.
- Viennot_columbia_0054D_13491.pdf binary/octet-stream 1.46 MB Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Nieh, Jason
- Ph.D., Columbia University
- Published Here
- January 5, 2017