Implementing Efficient Message Logging Protocols as MPI Application Extensions

05/08/2019
by   Kiril Dichev, et al.
0

Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different HPC kernels, one with a global exchange pattern (CG), and one with a neighbourhood exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task turns out to be challenging, and we present the methodology for doing so in this work. In the end, we achieve message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we match state-of-the-art solutions and (a) eliminate event logging and the event logger component altogether, and (b) design a hybrid protocol, which gracefully shifts between global and local rollback, depending on the available payload logging memory. Such a hybrid protocol between local and global rollback has not been previously proposed to our knowledge. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2023

Accelerating MPI Collectives with Process-in-Process-based Multi-object Techniques

In the exascale computing era, optimizing MPI collective performance in ...
research
12/28/2022

Hybrid Cloud and HPC Approach to High-Performance Dataframes

Data pre-processing is a fundamental component in any data-driven applic...
research
12/28/2020

TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes

MPI derived datatypes are an abstraction that simplifies handling of non...
research
09/05/2022

A Fault Resilient Approach to Non-collective Communication Creation in MPI

The increasing size of HPC architectures makes the faults' presence an e...
research
01/21/2020

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Modern interconnects offer remote direct memory access (RDMA) features. ...
research
12/16/2022

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of...
research
06/05/2018

Energy-efficient localised rollback after failures via data flow analysis

Exascale systems will suffer failures hourly. HPC programmers rely mostl...

Please sign up or login with your details

Forgot password? Click here to reset