SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters

May 31, 2005

MPI Advance : Open-Source Message Passing Optimizations

September 13, 2023

83% Match

Amanda Bienz, Derek Schafer, Anthony Skjellum

Distributed, Parallel, and C...

The large variety of production implementations of the message passing interface (MPI) each provide unique and varying underlying algorithms. Each emerging supercomputer supports one or a small number of system MPI installations, tuned for the given architecture. Performance varies with MPI version, but application programmers are typically unable to achieve optimal performance with local MPI installations and therefore rely on whichever implementation is provided as a system...

Find SimilarView on arXiv

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

April 8, 2023

83% Match

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, ... , Thakur Rajeev

Distributed, Parallel, and C...

With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compres...

Find SimilarView on arXiv

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

August 5, 2024

83% Match

Yao Xu, Gene Cooperman

Distributed, Parallel, and C...

MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attr...

Find SimilarView on arXiv

Algorithmic Based Fault Tolerance Applied to High Performance Computing

June 19, 2008

83% Match

George Bosilca, Remi Delmas, ... , Langou Julien

Distributed, Parallel, and C...

Mathematical Software

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-ma...

Find SimilarView on arXiv

A Scalable Byzantine Grid

October 17, 2012

83% Match

Alexandre LIP6, LINCS Maurer, Sébastien LIP6, LINCS, IUF Tixeuil

Distributed, Parallel, and C...

Cryptography and Security

Networking and Internet Arch...

Modern networks assemble an ever growing number of nodes. However, it remains difficult to increase the number of channels per node, thus the maximal degree of the network may be bounded. This is typically the case in grid topology networks, where each node has at most four neighbors. In this paper, we address the following issue: if each node is likely to fail in an unpredictable manner, how can we preserve some global reliability guarantees when the number of nodes keeps in...

Find SimilarView on arXiv

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

February 22, 2018

83% Match

Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

Distributed, Parallel, and C...

Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that handle multiple error modes. Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniques for transient errors that cause silent data corruptions and techniques for f...

Find SimilarView on arXiv

MPIgnite: An MPI-Like Language and Prototype Implementation for Apache Spark

July 15, 2017

83% Match

Brandon L. Morris, Anthony Skjellum

Distributed, Parallel, and C...

Scale-out parallel processing based on MPI is a 25-year-old standard with at least another decade of preceding history of enabling technologies in the High Performance Computing community. Newer frameworks such as MapReduce, Hadoop, and Spark represent industrial scalable computing solutions that have received broad adoption because of their comparative simplicity of use, applicability to relevant problems, and ability to harness scalable, distributed resources. While MPI pro...

Find SimilarView on arXiv

Performance Evaluation of Checkpoint/Restart Techniques

November 29, 2023

83% Match

Basma Abdel Azeem, Manal Helal

Distributed, Parallel, and C...

Distributed applications running on a large cluster environment, such as the cloud instances will have shorter execution time. However, the application might suffer from sudden termination due to unpredicted computing node failures, thus loosing the whole computation. Checkpoint/restart is a fault tolerance technique used to solve this problem. In this work we evaluated the performance of two of the most commonly used checkpoint/restart techniques (Distributed Multithreaded C...

Find SimilarView on arXiv

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

August 7, 2017

83% Match

Faisal Shahzad, Jonas Thies, Moritz Kreutzer, Thomas Zeiser, ... , Wellein Gerhard

Distributed, Parallel, and C...

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation o...

Find SimilarView on arXiv

Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

November 10, 2023

83% Match

Marina Moran, Javier Balladini, ... , Rucci Enzo

Distributed, Parallel, and C...

Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure occurs. In this context, it is possible to take actions to reduce t...

Find SimilarView on arXiv