May 31, 2005
Similar papers 3
April 20, 2019
Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one ca...
August 5, 2019
The development of fault-tolerant distributed systems that can tolerate Byzantine behavior has traditionally been focused on consensus protocols, which support fully-replicated designs. For the development of more sophisticated high-performance Byzantine distributed systems, more specialized fault-tolerant communication primitives are necessary, however. In this paper, we identify an essential communication primitive and study it in depth. In specifics, we formalize the clu...
September 3, 2019
In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional ...
January 14, 2018
Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions...
September 13, 2001
The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.
June 12, 2019
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0...
September 4, 2001
The tremendous advance in computer technology in the past decade has made it possible to achieve the performance of a supercomputer on a very small budget. We have built a multi-CPU cluster of Pentium PC capable of parallel computations using the Message Passing Interface (MPI). We will discuss the configuration, performance, and application of the cluster to our work in physics.
April 30, 2018
One of the hardest challenges of the current Big Data landscape is the lack of ability to process huge volumes of information in an acceptable time. The goal of this work, is to ascertain if it is useful to use typical Big Data tools to solve High Performance Computing problems, by exploring and comparing a distributed computing framework implemented on a commodity cluster architecture: the experiment will depend on the computational time required using tools such as Apache S...
January 16, 2016
In the framework of the ESPRIT project 28620 "TIRAN" (tailorable fault tolerance frameworks for embedded applications), a toolset of error detection, isolation, and recovery components is being designed to serve as a basic means for orchestrating application-level fault tolerance. These tools will be used either as stand-alone components or as the peripheral components of a distributed application, that we call 'the backbone". The backbone is to run in the background of the u...
October 13, 2008
A common paradigm for scientific computing is distributed message-passing systems, and a common approach to these systems is to implement them across clusters of high-performance workstations. As multi-core architectures become increasingly mainstream, these clusters are very likely to include multi-core machines. However, the theoretical models which are currently used to develop communication algorithms across these systems do not take into account the unique properties of ...