SafeMPI - Extending MPI for Byzantine Er...

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

January 1, 2005

89% Match

Michael Treaster

Distributed, Parallel, and C...

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence...

Find SimilarView on arXiv

Fault-Aware Non-Collective Communication Creation and Reparation in MPI

September 5, 2022

88% Match

Roberto Rocco, Gianluca Palermo

Distributed, Parallel, and C...

The increasing size of HPC architectures makes the faults' presence a more and more frequent eventuality. This issue becomes especially relevant since MPI, the de-facto standard for inter-process communication, lacks proper fault management functionalities. Past efforts produced extensions to the MPI standard that enabled fault management, including ULFM. While providing powerful tools to handle faults, it still faces limitations like the collectiveness of the repair procedur...

Find SimilarView on arXiv

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

April 29, 2021

87% Match

Roberto Rocco, Davide Gadioli, Gianluca Palermo

Distributed, Parallel, and C...

Performance

Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the applic...

Find SimilarView on arXiv

Implicit Actions and Non-blocking Failure Recovery with MPI

December 16, 2022

87% Match

Aurelien Bouteiller, George Bosilca

Distributed, Parallel, and C...

Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in whi...

Find SimilarView on arXiv

Reliable multicast fault tolerant MPI in the Grid environment

August 29, 2006

86% Match

Benoit Hudzia, Serge Petiton

Distributed, Parallel, and C...

Grid environments have recently been developed with low stretch and overheads that increase with the logarithm of the number of nodes in the system. Getting and sending data to/from a large numbers of nodes is gaining importance due to an increasing number of independent data providers and the heterogeneity of the network/Grid. One of the key challenges is to achieve a balance between low bandwidth consumption and good reliability. In this paper we present an implementation o...

Find SimilarView on arXiv

Debugging Tool for Localizing Faulty Processes in Message Passing Programs

October 9, 2003

86% Match

Masao Okita, Fumihiko Ino, Kenichi Hagihara

Software Engineering

In message passing programs, once a process terminates with an unexpected error, the terminated process can propagate the error to the rest of processes through communication dependencies, resulting in a program failure. Therefore, to locate faults, developers must identify the group of processes involved in the original error and faulty processes that activate faults. This paper presents a novel debugging tool, named MPI-PreDebugger (MPI-PD), for localizing faulty processes ...

Find SimilarView on arXiv

Scalable Fault-Tolerant MapReduce

November 25, 2024

86% Match

Demian Hespe, Lukas Hübner, ... , Sanders Peter

Distributed, Parallel, and C...

Data Structures and Algorith...

Supercomputers getting ever larger and energy-efficient is at odds with the reliability of the used hardware. Thus, the time intervals between component failures are decreasing. Contrarily, the latencies for individual operations of coarse-grained big-data tools grow with the number of processors. To overcome the resulting scalability limit, we need to go beyond the current practice of interoperation checkpointing. We give first results on how to achieve this for the popular ...

Find SimilarView on arXiv

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

October 25, 2023

86% Match

Sarthak Joshi, Sathish Vadhiyar

Distributed, Parallel, and C...

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the failed processes and using their rep...

Find SimilarView on arXiv

Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

July 16, 2020

86% Match

Diego Montezanti, Enzo Rucci, Giusti Armando De, Marcelo Naiouf, ... , Luque Emilio

Distributed, Parallel, and C...

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for aut...

Find SimilarView on arXiv

Byzantine Cluster-Sending in Expected Constant Communication

August 19, 2021

86% Match

Jelle Hellings, Mohammad Sadoghi

Databases

Distributed, Parallel, and C...

Traditional resilient systems operate on fully-replicated fault-tolerant clusters, which limits their scalability and performance. One way to make the step towards resilient high-performance systems that can deal with huge workloads, is by enabling independent fault-tolerant clusters to efficiently communicate and cooperate with each other, as this also enables the usage of high-performance techniques such as sharding and parallel processing. Recently, such inter-cluster comm...

Find SimilarView on arXiv

SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Fault-Aware Non-Collective Communication Creation and Reparation in MPI

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Implicit Actions and Non-blocking Failure Recovery with MPI

Reliable multicast fault tolerant MPI in the Grid environment

Debugging Tool for Localizing Faulty Processes in Message Passing Programs

Scalable Fault-Tolerant MapReduce

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

Byzantine Cluster-Sending in Expected Constant Communication