Algorithm Based Fault Tolerance in Massively Parallel Systems

Lerner, Mark D.

An A complex computer system consists of billions of transistors, miles of wires, and many interactions with an unpredictable environment. Correct results must be produced despite faults that dynamically occur in some of these components. Many techniques have been developed for fault tolerant computation. General purpose methods are independent of the application, yet incur an overhead cost which may be unacceptable for massively parallel systems. Algorithm-specific methods, which can operate at lower cost, are a developing alternative [1, 72]. This paper first reviews the general-purpose approach and then focuses on the algorithm-specific method, with an eye toward massively parallel processors. Algorithm-based fault tolerance has the attraction of low overhead; furthermore it addresses both the detection and also the correction problems. The principle is to build low-cost checking and correcting mechanism based exclusively on the redundancies inherent in the system.



More About This Work

Academic Units
Computer Science
Department of Computer Science, Columbia University
Columbia University Computer Science Technical Reports, CUCS-360-88
Published Here
December 19, 2011