A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

04/12/2018

∙

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computations even after a hard fault (node loss), i.e. the worst possible unexpected behaviour. In this paper we present an approach that adds extended exception propagation support to C++ MPI programs. Our technique allows to propagate local exceptions to remote hosts to avoid deadlocks, and to map MPI failures on remote hosts to local exceptions. A use case of particular interest are asynchronous 'local failure local recovery' resilience approaches. Our prototype implementation uses MPI-3.0 features only. In addition we present a dedicated implementation, which integrates seamlessly with MPI-ULFM, i.e. the most prominent proposal for extending MPI towards fault tolerance. Our implementation is available at https://gitlab.dune-project.org/christi/test-mpi-exceptions.

READ FULL TEXT

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

Sign in with Google

Consider DeepAI Pro