Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

12/12/2022
by   Yao Xu, et al.
0

MPI is the de facto standard for parallel computation on a cluster of computers. Yet resilience for MPI continues to be an issue for large-scale computations, and especially for long-running computations that exceed the maximum time allocated to a job by a resource manager. Transparent checkpointing (with no modification of the underlying binary executable) is an important component in any strategy for software resilience and chaining of resource allocations. However, achieving low runtime overhead is critical for community acceptance of a transparent checkpointing solution. ("Runtime overhead" is the overhead in time when running an application with no checkpoints, both with and without the checkpointing package.) A collective-vector-clock algorithm for transparent checkpointing of MPI is presented. The algorithm is built using the software of the mature MANA project for transparent checkpointing of MPI. MANA's existing two-phase-commit algorithm produces very high runtime overhead as compared to "native" execution. For example, MANA was found to result in runtime overheads as high as 37 micro-benchmarks – especially on workloads that intensively use collective communication. The new algorithm replaces two-phase commit. It is a novel variation on vector clock algorithms. It uses a vector of logical clocks, with an individual clock for each distinct group of MPI processes underlying the MPI communicators in the application. This contrasts with the traditional vector of logical clocks across individual processes. Micro-benchmarks show a runtime overhead of essentially zero for many MPI processes. And two real-world applications, VASP and GROMACS, show a runtime overhead ranging mostly from 0 to 7 optimization of other sources of overhead.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2019

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to runn...
research
06/29/2019

Open-MPI over MOSIX: paralleled computing in a clustered world

Recent increased interest in Cloud computing emphasizes the need to find...
research
09/22/2020

TaskTorrent: a Lightweight Distributed Task-Based Runtime System in C++

We present TaskTorrent, a lightweight distributed task-based runtime in ...
research
12/13/2013

Transparent Checkpoint-Restart over InfiniBand

InfiniBand is widely used for low-latency, high-throughput cluster compu...
research
12/10/2021

MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale

MANA-2.0 is a scalable, future-proof design for transparent checkpointin...
research
04/23/2020

Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling

The performance of collective operations has been a critical issue since...
research
03/15/2021

Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

Checkpoint/restart (C/R) provides fault-tolerant computing capability, e...

Please sign up or login with your details

Forgot password? Click here to reset