Algorithm Based Fault Tolerance in Massively Parallel Systems
Mark D. Lerner
- Algorithm Based Fault Tolerance in Massively Parallel Systems
- Lerner, Mark D.
- Technical reports
- Computer Science
- Permanent URL:
- Columbia University Computer Science Technical Reports
- Part Number:
- Department of Computer Science, Columbia University
- Publisher Location:
- New York
- An A complex computer system consists of billions of transistors, miles of wires, and many interactions with an unpredictable environment. Correct results must be produced despite faults that dynamically occur in some of these components. Many techniques have been developed for fault tolerant computation. General purpose methods are independent of the application, yet incur an overhead cost which may be unacceptable for massively parallel systems. Algorithm-specific methods, which can operate at lower cost, are a developing alternative [1, 72]. This paper first reviews the general-purpose approach and then focuses on the algorithm-specific method, with an eye toward massively parallel processors. Algorithm-based fault tolerance has the attraction of low overhead; furthermore it addresses both the detection and also the correction problems. The principle is to build low-cost checking and correcting mechanism based exclusively on the redundancies inherent in the system.
- Computer science
- Item views: