Fault Awareness in the MPI 4.0 Session Model

03/06/2023
by   Roberto Rocco, et al.
0

The latest version of MPI introduces new functionalities like the Session model, but it still lacks fault management mechanisms. Past efforts produced tools and MPI standard extensions to manage fault presence, including ULFM. These measures are effective against faults but do not fully support the new additions to the standard. In this paper, we combine the fault management possibilities of ULFM with the new Session model functionality introduced in version 4.0 of the standard. We focus on the communicator creation procedure, highlighting criticalities and proposing a method to circumvent them. The experimental campaign shows that the proposed solution does not significantly affect applications' execution time and scalability while better managing the insurgence of faults.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2022

A Fault Resilient Approach to Non-collective Communication Creation in MPI

The increasing size of HPC architectures makes the faults' presence an e...
research
04/29/2021

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becomi...
research
04/12/2018

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

C++ advocates exceptions as the preferred way to handle unexpected behav...
research
12/20/2021

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Production MPI codes need checkpoint-restart (CPR) support. Clearly, che...
research
04/01/2004

On the Practicality of Intrinsic Reconfiguration As a Fault Recovery Method in Analog Systems

Evolvable hardware combines the powerful search capability of evolutiona...
research
05/13/2018

Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

Advances in detectors and computational technologies provide new opportu...

Please sign up or login with your details

Forgot password? Click here to reset