Implicit Actions and Non-blocking Failure Recovery with MPI

12/16/2022
by   Aurelien Bouteiller, et al.
0

Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).

READ FULL TEXT
research
02/13/2021

MATCH: An MPI Fault Tolerance Benchmark Suite

MPI has been ubiquitously deployed in flagship HPC systems aiming to acc...
research
02/13/2021

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to th...
research
03/02/2022

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Fault-tolerant distributed applications require mechanisms to recover da...
research
12/20/2021

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Production MPI codes need checkpoint-restart (CPR) support. Clearly, che...
research
06/12/2019

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to runn...
research
07/27/2016

System-level Scalable Checkpoint-Restart for Petascale Computing

Fault tolerance for the upcoming exascale generation has long been an ar...
research
05/08/2019

Implementing Efficient Message Logging Protocols as MPI Application Extensions

Message logging protocols are enablers of local rollback, a more efficie...

Please sign up or login with your details

Forgot password? Click here to reset