SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters

May 31, 2005

Checkpoint-Restart Libraries Must Become More Fault Tolerant

December 20, 2021

85% Match

Anthony Skjellum, Derek Schafer

Distributed, Parallel, and C...

Production MPI codes need checkpoint-restart (CPR) support. Clearly, checkpoint-restart libraries must be fault tolerant lest they open up a window of vulnerability for failures with byzantine outcomes. But, certain popular libraries that leverage MPI are evidently not fault tolerant. Nowadays, fault detection with automatic recovery without batch requeueing is a strong requirement for production environments. Thus, allowing deadlock and setting long timeouts are suboptimal f...

Find SimilarView on arXiv

Building a fault tolerant application using the GASPI communication layer

May 18, 2015

85% Match

Faisal Shahzad, Moritz Kreutzer, Thomas Zeiser, Rui Machado, Andreas Pieper, ... , Wellein Gerhard

Distributed, Parallel, and C...

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) prop...

Find SimilarView on arXiv

Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

December 12, 2022

85% Match

Yao Xu, Gene Cooperman

Distributed, Parallel, and C...

Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster. The problem is made more difficult when supporting collective operations across processes, such as barrier, reduce operations, scatter and gather, etc. Some processes may have reached the barrier or other collective operation, while other p...

Find SimilarView on arXiv

MATCH: An MPI Fault Tolerance Benchmark Suite

February 13, 2021

85% Match

Luanzheng Guo, Giorgis Georgakoudis, Konstantinos Parasyris, ... , Li Dong

Distributed, Parallel, and C...

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failure...

Find SimilarView on arXiv

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

December 29, 2020

85% Match

Ioannis Vardas, Manolis Ploumidis, Manolis Marazakis

Distributed, Parallel, and C...

HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication l...

Find SimilarView on arXiv

Robust Failure Detection Architecture for Large Scale Distributed Systems

October 5, 2009

85% Match

Ciprian Mihai Dobre, Florin Pop, Alexandru Costan, ... , Cristea Valentin

Distributed, Parallel, and C...

Networking and Internet Arch...

Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. There are lots of approaches and implementations in failure detectors. Providing flexible failure detection in off-the-shelf distributed systems is difficult. In this paper we present an innovative solution to this problem. Our approach is based on adaptive, decentralized failure detectors, capable of working asynchronous and independent on the application flow. ...

Find SimilarView on arXiv

Cluster Computing White Paper

April 25, 2000

85% Match

Mark University of Portsmouth, UK Baker

Distributed, Parallel, and C...

Hardware Architecture

Networking and Internet Arch...

Cluster computing is not a new area of computing. It is, however, evident that there is a growing interest in its usage in all areas where applications have traditionally used parallel or distributed computing platforms. The growing interest has been fuelled in part by the availability of powerful microprocessors and high-speed networks as off-the-shelf commodity components as well as in part by the rapidly maturing software components available to support high performance an...

Find SimilarView on arXiv

Toward Resilient Algorithms and Applications

February 16, 2014

85% Match

Michael A. Heroux

Mathematical Software

Distributed, Parallel, and C...

Over the past decade, the high performance computing community has become increasingly concerned that preserving the reliable, digital machine model will become too costly or infeasible. In this paper we discuss four approaches for developing new algorithms that are resilient to hard and soft failures.

Find SimilarView on arXiv

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

February 13, 2021

85% Match

Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna

Distributed, Parallel, and C...

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present...

Find SimilarView on arXiv

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

March 2, 2022

85% Match

Lukas Hübner, Demian Hespe, ... , Stamatakis Alexandros

Distributed, Parallel, and C...

Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that...

Find SimilarView on arXiv