SafeMPI - Extending MPI for Byzantine Er...

Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?

August 13, 2013

84% Match

Blesson Varghese, Gerard McKee, Vassil Alexandrov

Distributed, Parallel, and C...

Multiagent Systems

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, "What agent capabilities are required for fault tolerance?", "What parallel computational tasks can benefit from such agent capabilities?" and "How can agent capabilities be implemented for fa...

Find SimilarView on arXiv

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

November 8, 2016

84% Match

Saurabh Hukerikar, Christian Engelmann

Distributed, Parallel, and C...

Software Engineering

In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides designers with reusable design elements. We define a design framework that enhanc...

Find SimilarView on arXiv

Fault Awareness in the MPI 4.0 Session Model

March 6, 2023

84% Match

Roberto Rocco, Gianluca Palermo, Daniele Gregori

Distributed, Parallel, and C...

The latest version of MPI introduces new functionalities like the Session model, but it still lacks fault management mechanisms. Past efforts produced tools and MPI standard extensions to manage fault presence, including ULFM. These measures are effective against faults but do not fully support the new additions to the standard. In this paper, we combine the fault management possibilities of ULFM with the new Session model functionality introduced in version 4.0 of the standa...

Find SimilarView on arXiv

Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

December 21, 2020

84% Match

Marina Morán, Javier Balladini, ... , Rucci Enzo

Distributed, Parallel, and C...

High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. ...

Find SimilarView on arXiv

Failure Data Analysis of HPC Systems

February 20, 2013

84% Match

Charng-Da Lu

Distributed, Parallel, and C...

Continuous availability of HPC systems built from commodity components have become a primary concern as system size grows to thousands of processors. In this paper, we present the analysis of 8-24 months of real failure data collected from three HPC systems at the National Center for Supercomputing Applications (NCSA) during 2001-2004. The results show that the availability is 98.7-99.8% and most outages are due to software halts. On the other hand, the downtime are mostly co...

Find SimilarView on arXiv

PGMPI: Automatically Verifying Self-Consistent MPI Performance Guidelines

June 1, 2016

84% Match

Sascha Hunold, Alexandra Carpen-Amarie, ... , Träff Jesper Larsson

Distributed, Parallel, and C...

The Message Passing Interface (MPI) is the most commonly used application programming interface for process communication on current large-scale parallel systems. Due to the scale and complexity of modern parallel architectures, it is becoming increasingly difficult to optimize MPI libraries, as many factors can influence the communication performance. To assist MPI developers and users, we propose an automatic way to check whether MPI libraries respect self-consistent perfor...

Find SimilarView on arXiv

Implementing Efficient Message Logging Protocols as MPI Application Extensions

May 8, 2019

84% Match

Kiril Dichev, Dimitrios S. Nikolopoulos

Distributed, Parallel, and C...

Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than...

Find SimilarView on arXiv

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

April 12, 2018

84% Match

Christian Engwer, Mirco Altenbernd, ... , Göddeke Dominik

Distributed, Parallel, and C...

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computatio...

Find SimilarView on arXiv

A Pattern Language for High-Performance Computing Resilience

October 25, 2017

84% Match

Saurabh Hukerikar, Christian Engelmann

Distributed, Parallel, and C...

Software Engineering

High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their r...

Find SimilarView on arXiv

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

August 23, 2017

83% Match

Saurabh Hukerikar, Christian Engelmann

Distributed, Parallel, and C...

Software Engineering

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. While the HPC community has developed various resilience solutions, the solution space remains fragmented. There are no formal methods and metrics to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of such solutions in terms of their protection coverage, and their performance & po...

Find SimilarView on arXiv

SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters

Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

Fault Awareness in the MPI 4.0 Session Model

Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

Failure Data Analysis of HPC Systems

PGMPI: Automatically Verifying Self-Consistent MPI Performance Guidelines

Implementing Efficient Message Logging Protocols as MPI Application Extensions

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

A Pattern Language for High-Performance Computing Resilience

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale