System-level Scalable Checkpoint-Restart for Petascale Computing

07/27/2016
by   Jiajun Cao, et al.
0

Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1 evaluated across three widely used MPI implementations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2019

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

Transparently checkpointing MPI for fault tolerance and load balancing i...
research
06/12/2019

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to runn...
research
12/13/2013

Transparent Checkpoint-Restart over InfiniBand

InfiniBand is widely used for low-latency, high-throughput cluster compu...
research
04/12/2018

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

C++ advocates exceptions as the preferred way to handle unexpected behav...
research
03/08/2021

Transparent Checkpointing for OpenGL Applications on GPUs

This work presents transparent checkpointing of OpenGL applications, ref...
research
09/22/2020

TaskTorrent: a Lightweight Distributed Task-Based Runtime System in C++

We present TaskTorrent, a lightweight distributed task-based runtime in ...
research
12/16/2022

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of...

Please sign up or login with your details

Forgot password? Click here to reset