May 31, 2005
Modern high-performance computing relies heavily on the use of commodity processors arranged together in clusters. These clusters consist of individual nodes (typically off-the-shelf single or dual processor machines) connected together with a high speed interconnect. Using cluster computation has many benefits, but also carries the liability of being failure prone due to the sheer number of components involved. Many effective solutions have been proposed to aid failure recovery in clusters, their one significant downside being the failure models they support. Most of the work in the area has focused on detecting and correcting fail-stop errors. We propose a system that will also detect more general error models, such as Byzantine errors, thus allowing existing failure recovery methods to handle them correctly.
Similar papers 1
January 1, 2005
Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence...
September 5, 2022
The increasing size of HPC architectures makes the faults' presence a more and more frequent eventuality. This issue becomes especially relevant since MPI, the de-facto standard for inter-process communication, lacks proper fault management functionalities. Past efforts produced extensions to the MPI standard that enabled fault management, including ULFM. While providing powerful tools to handle faults, it still faces limitations like the collectiveness of the repair procedur...
April 29, 2021
Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the applic...
December 16, 2022
Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in whi...
August 29, 2006
Grid environments have recently been developed with low stretch and overheads that increase with the logarithm of the number of nodes in the system. Getting and sending data to/from a large numbers of nodes is gaining importance due to an increasing number of independent data providers and the heterogeneity of the network/Grid. One of the key challenges is to achieve a balance between low bandwidth consumption and good reliability. In this paper we present an implementation o...
October 9, 2003
In message passing programs, once a process terminates with an unexpected error, the terminated process can propagate the error to the rest of processes through communication dependencies, resulting in a program failure. Therefore, to locate faults, developers must identify the group of processes involved in the original error and faulty processes that activate faults. This paper presents a novel debugging tool, named MPI-PreDebugger (MPI-PD), for localizing faulty processes ...
November 25, 2024
Supercomputers getting ever larger and energy-efficient is at odds with the reliability of the used hardware. Thus, the time intervals between component failures are decreasing. Contrarily, the latencies for individual operations of coarse-grained big-data tools grow with the number of processors. To overcome the resulting scalability limit, we need to go beyond the current practice of interoperation checkpointing. We give first results on how to achieve this for the popular ...
October 25, 2023
As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the failed processes and using their rep...
July 16, 2020
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for aut...
August 19, 2021
Traditional resilient systems operate on fully-replicated fault-tolerant clusters, which limits their scalability and performance. One way to make the step towards resilient high-performance systems that can deal with huge workloads, is by enabling independent fault-tolerant clusters to efficiently communicate and cooperate with each other, as this also enables the usage of high-performance techniques such as sharding and parallel processing. Recently, such inter-cluster comm...