A Fault Resilient Approach to Non-collective Communication Creation in MPI

09/05/2022
by   Roberto Rocco, et al.
0

The increasing size of HPC architectures makes the faults' presence an eventuality more and more frequent. This is especially relevant since MPI, the de-facto standard for inter-process communication lacks proper fault management functionalities. The past efforts produced extensions to the MPI standard that enabled fault management, the most important one being ULFM. In this paper, we introduce the support for non-collective communication creation (MPI_Comm_create_group) in ULFM to improve the fault management capabilities. We integrate our solution into the Legio library and measure the overhead introduced in the application. The proposed solution removes the possibility of turning the execution into a deadlock after a fault and can be used as an inspiring effort to improve the ULFM repair capabilities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2021

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becomi...
research
03/06/2023

Fault Awareness in the MPI 4.0 Session Model

The latest version of MPI introduces new functionalities like the Sessio...
research
05/17/2023

Accelerating MPI Collectives with Process-in-Process-based Multi-object Techniques

In the exascale computing era, optimizing MPI collective performance in ...
research
12/20/2021

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Production MPI codes need checkpoint-restart (CPR) support. Clearly, che...
research
10/22/2017

Lightweight MPI Communicators with Applications to Perfectly Balanced Schizophrenic Quicksort

MPI uses the concept of communicators to connect groups of processes. It...
research
05/31/2023

A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

Offload of MPI collectives to network devices, e.g., NICs and switches, ...
research
05/08/2019

Implementing Efficient Message Logging Protocols as MPI Application Extensions

Message logging protocols are enablers of local rollback, a more efficie...

Please sign up or login with your details

Forgot password? Click here to reset