A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

04/12/2018
by   Christian Engwer, et al.
0

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computations even after a hard fault (node loss), i.e. the worst possible unexpected behaviour. In this paper we present an approach that adds extended exception propagation support to C++ MPI programs. Our technique allows to propagate local exceptions to remote hosts to avoid deadlocks, and to map MPI failures on remote hosts to local exceptions. A use case of particular interest are asynchronous 'local failure local recovery' resilience approaches. Our prototype implementation uses MPI-3.0 features only. In addition we present a dedicated implementation, which integrates seamlessly with MPI-ULFM, i.e. the most prominent proposal for extending MPI towards fault tolerance. Our implementation is available at https://gitlab.dune-project.org/christi/test-mpi-exceptions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2019

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

Transparently checkpointing MPI for fault tolerance and load balancing i...
research
02/13/2021

MATCH: An MPI Fault Tolerance Benchmark Suite

MPI has been ubiquitously deployed in flagship HPC systems aiming to acc...
research
03/01/2018

Developing a functional prototype master patient index (MPI) for interoperability of e-health systems in Sri Lanka

Introduction: A Master Patient Index(MPI) is a centralized index of all ...
research
03/06/2023

Fault Awareness in the MPI 4.0 Session Model

The latest version of MPI introduces new functionalities like the Sessio...
research
04/25/2018

Fast parallel multidimensional FFT using advanced MPI

We present a new method for performing global redistributions of multidi...
research
08/24/2017

Reliability and Fault-Tolerance by Choreographic Design

Distributed programs are hard to get right because they are required to ...
research
07/27/2016

System-level Scalable Checkpoint-Restart for Petascale Computing

Fault tolerance for the upcoming exascale generation has long been an ar...

Please sign up or login with your details

Forgot password? Click here to reset