MATCH: An MPI Fault Tolerance Benchmark Suite

02/13/2021
by   Luanzheng Guo, et al.
0

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT- Bench.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2021

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to th...
research
12/16/2022

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of...
research
04/12/2018

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

C++ advocates exceptions as the preferred way to handle unexpected behav...
research
04/05/2021

ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

Deep-learning-based recommendation models (DLRMs) are widely deployed to...
research
08/24/2017

Reliability and Fault-Tolerance by Choreographic Design

Distributed programs are hard to get right because they are required to ...
research
05/20/2019

Scylla: A Mesos Framework for Container Based MPI Jobs

Open source cloud technologies provide a wide range of support for creat...
research
12/01/2017

DAOS for Extreme-scale Systems in Scientific Applications

Exascale I/O initiatives will require new and fully integrated I/O model...

Please sign up or login with your details

Forgot password? Click here to reset