Home

Algorithm Based Fault Tolerance in Massively Parallel Systems

Mark D. Lerner

Title:
Algorithm Based Fault Tolerance in Massively Parallel Systems
Author(s):
Lerner, Mark D.
Date:
Type:
Technical reports
Department:
Computer Science
Permanent URL:
Series:
Columbia University Computer Science Technical Reports
Part Number:
CUCS-360-88
Publisher:
Department of Computer Science, Columbia University
Publisher Location:
New York
Abstract:
An A complex computer system consists of billions of transistors, miles of wires, and many interactions with an unpredictable environment. Correct results must be produced despite faults that dynamically occur in some of these components. Many techniques have been developed for fault tolerant computation. General purpose methods are independent of the application, yet incur an overhead cost which may be unacceptable for massively parallel systems. Algorithm-specific methods, which can operate at lower cost, are a developing alternative [1, 72]. This paper first reviews the general-purpose approach and then focuses on the algorithm-specific method, with an eye toward massively parallel processors. Algorithm-based fault tolerance has the attraction of low overhead; furthermore it addresses both the detection and also the correction problems. The principle is to build low-cost checking and correcting mechanism based exclusively on the redundancies inherent in the system.
Subject(s):
Computer science
Item views:
285
Metadata:
View

In Partnership with the Center for Digital Research and Scholarship at Columbia University Libraries/Information Services.