Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

02/13/2021
by   Giorgis Georgakoudis, et al.
0

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit++ recovers much faster than restarting, up to 6x, or ULFM, up to 3x, and that it scales excellently as the number of MPI processes grows.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2021

MATCH: An MPI Fault Tolerance Benchmark Suite

MPI has been ubiquitously deployed in flagship HPC systems aiming to acc...
research
12/21/2020

Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

High-performance computing continues to increase its computing power and...
research
12/16/2022

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of...
research
04/29/2021

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becomi...
research
05/16/2018

Verifying Programs Under Custom Application-Specific Execution Models

Researchers have recently designed a number of application-specific faul...
research
06/05/2018

Energy-efficient localised rollback after failures via data flow analysis

Exascale systems will suffer failures hourly. HPC programmers rely mostl...
research
03/12/2020

A Fault-Tolerance Shim for Serverless Computing

Serverless computing has grown in popularity in recent years, with an in...

Please sign up or login with your details

Forgot password? Click here to reset