Checkpoint-Restart Libraries Must Become More Fault Tolerant

12/20/2021
by   Anthony Skjellum, et al.
0

Production MPI codes need checkpoint-restart (CPR) support. Clearly, checkpoint-restart libraries must be fault tolerant lest they open up a window of vulnerability for failures with byzantine outcomes. But, certain popular libraries that leverage MPI are evidently not fault tolerant. Nowadays, fault detection with automatic recovery without batch requeueing is a strong requirement for production environments. Thus, allowing deadlock and setting long timeouts are suboptimal for fault detection even when paired with conservative recovery from the penultimate checkpoint. When MPI is used as a communication mechanism within a CPR library, such libraries must offer fault-tolerant extensions with minimal detection, isolation, mitigation, and potential recovery semantics to aid the CPR's library fail-backward. Communication between MPI and the checkpoint library regarding system health may be valuable. For fault-tolerant MPI programs (e.g., using APIs like FA-MPI, Stages/Reinit, or ULFM), the checkpoint library must cooperate with the extended model or else invalidate fault-tolerant operation.

READ FULL TEXT

page 1

page 2

page 3

research
04/29/2021

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becomi...
research
03/02/2022

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Fault-tolerant distributed applications require mechanisms to recover da...
research
09/05/2022

A Fault Resilient Approach to Non-collective Communication Creation in MPI

The increasing size of HPC architectures makes the faults' presence an e...
research
04/17/2022

Beta Residuals: Improving Fault-Tolerant Control for Sensory Faults via Bayesian Inference and Precision Learning

Model-based fault-tolerant control (FTC) often consists of two distinct ...
research
12/16/2022

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of...
research
03/06/2023

Fault Awareness in the MPI 4.0 Session Model

The latest version of MPI introduces new functionalities like the Sessio...
research
01/23/2021

HyCoR: Fault-Tolerant Replicated Containers Based on Checkpoint and Replay

HyCoR is a fully-operational fault tolerance mechanism for multiprocesso...

Please sign up or login with your details

Forgot password? Click here to reset