SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters

May 31, 2005

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

April 20, 2019

85% Match

Rohan Garg, Gregory Price, Gene Cooperman

Distributed, Parallel, and C...

Operating Systems

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one ca...

Find SimilarView on arXiv

The fault-tolerant cluster-sending problem

August 5, 2019

85% Match

Jelle Hellings, Mohammad Sadoghi

Distributed, Parallel, and C...

The development of fault-tolerant distributed systems that can tolerate Byzantine behavior has traditionally been focused on consensus protocols, which support fully-replicated designs. For the development of more sophisticated high-performance Byzantine distributed systems, more specialized fault-tolerant communication primitives are necessary, however. In this paper, we identify an essential communication primitive and study it in depth. In specifics, we formalize the clu...

Find SimilarView on arXiv

Large Scale Parallelization Using File-Based Communications

September 3, 2019

85% Match

Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Siddharth Samsi, ... , Reuther Albert

Distributed, Parallel, and C...

In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional ...

Find SimilarView on arXiv

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

January 14, 2018

85% Match

Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

Distributed, Parallel, and C...

Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions...

Find SimilarView on arXiv

Learning from the Success of MPI

September 13, 2001

84% Match

William D. Gropp

Distributed, Parallel, and C...

The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.

Find SimilarView on arXiv

Checkpoint/restart approaches for a thread-based MPI runtime

June 12, 2019

84% Match

Julien Adam, Maxime Kermarquer, Jean-Baptiste Besnard, Leonardo Bautista-Gomez, Marc Perache, Patrick Carribault, Julien Jaeger, ... , Shende Sameer

Distributed, Parallel, and C...

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0...

Find SimilarView on arXiv

Parallel Computing on a PC Cluster

September 4, 2001

84% Match

X. Q. Zhongshan University Luo, E. B. Zhongshan University Gregory, J. C. Guoxun Ltd Yang, Y. L. Guoxun Ltd Wang, ... , Lin Y. Guoxun Ltd

Distributed, Parallel, and C...

The tremendous advance in computer technology in the past decade has made it possible to achieve the performance of a supercomputer on a very small budget. We have built a multi-CPU cluster of Pentium PC capable of parallel computations using the Message Passing Interface (MPI). We will discuss the configuration, performance, and application of the cluster to our work in physics.

Find SimilarView on arXiv

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

April 30, 2018

84% Match

Adrian Bazaga, Michal Pitonak

Distributed, Parallel, and C...

One of the hardest challenges of the current Big Data landscape is the lack of ability to process huge volumes of information in an acceptable time. The goal of this work, is to ascertain if it is useful to use typical Big Data tools to solve High Performance Computing problems, by exploring and comparing a distributed computing framework implemented on a commodity cluster architecture: the experiment will depend on the computational time required using tools such as Apache S...

Find SimilarView on arXiv

An Algorithm for Tolerating Crash Failures in Distributed Systems

January 16, 2016

84% Match

Florio Vincenzo De, Geert Deconinck, Rudy Lauwereins

Distributed, Parallel, and C...

In the framework of the ESPRIT project 28620 "TIRAN" (tailorable fault tolerance frameworks for embedded applications), a toolset of error detection, isolation, and recovery components is being designed to serve as a basic means for orchestrating application-level fault tolerance. These tools will be used either as stand-alone components or as the peripheral components of a distributed application, that we call 'the backbone". The backbone is to run in the background of the u...

Find SimilarView on arXiv

A Model for Communication in Clusters of Multi-core Machines

October 13, 2008

84% Match

Christine Task, Arun Chauhan

Distributed, Parallel, and C...

Data Structures and Algorith...

A common paradigm for scientific computing is distributed message-passing systems, and a common approach to these systems is to implement them across clusters of high-performance workstations. As multi-core architectures become increasingly mainstream, these clusters are very likely to include multi-core machines. However, the theoretical models which are currently used to develop communication algorithms across these systems do not take into account the unique properties of ...

Find SimilarView on arXiv