MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale

12/10/2021
by   Yao Xu, et al.
0

MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency ("network-agnostic") feature ensures that MANA-2.0 will provide a viable, efficient mechanism for transparently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2021

Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

Checkpoint/restart (C/R) provides fault-tolerant computing capability, e...
research
04/20/2019

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

Transparently checkpointing MPI for fault tolerance and load balancing i...
research
05/31/2023

A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

Offload of MPI collectives to network devices, e.g., NICs and switches, ...
research
12/12/2022

Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

MPI is the de facto standard for parallel computation on a cluster of co...
research
12/13/2013

Transparent Checkpoint-Restart over InfiniBand

InfiniBand is widely used for low-latency, high-throughput cluster compu...
research
02/15/2021

Simulation-based Optimization and Sensibility Analysis of MPI Applications: Variability Matters

Finely tuning MPI applications and understanding the influence of keypar...
research
05/27/2021

Measuring OpenSHMEM Communication Routines with SKaMPI-OpenSHMEM User's manual

This document presents the OpenSHMEM extension for the Special Karlsruhe...

Please sign up or login with your details

Forgot password? Click here to reset