ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

03/02/2022
by   Demian Hespe, et al.
0

Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of lost data after (a) process failure(s). By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2021

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Production MPI codes need checkpoint-restart (CPR) support. Clearly, che...
research
10/29/2019

Disaggregation and the Application

This paper examines disaggregated data center architectures from the per...
research
02/25/2021

Checkpointing and Localized Recovery for Nested Fork-Join Programs

While checkpointing is typically combined with a restart of the whole ap...
research
04/08/2020

Deterministic Data Distribution for Efficient Recovery in Erasure-Coded Storage Systems

Due to individual unreliable commodity components, failures are common i...
research
10/02/2019

ROS Rescue : Fault Tolerance System for Robot Operating System

In this chapter we discuss the problem of master failure in ROS1.0 and i...
research
12/16/2022

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of...
research
04/30/2018

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

One of the hardest challenges of the current Big Data landscape is the l...

Please sign up or login with your details

Forgot password? Click here to reset