Transparent Checkpoint-Restart over InfiniBand

12/13/2013
by   Jiajun Cao, et al.
0

InfiniBand is widely used for low-latency, high-throughput cluster computing. Saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. Because of a lack of a solution, typical MPI implementations have included custom checkpoint-restart services that "tear down" the network, checkpoint each node as if the node were a standalone computer, and then re-connect the network again. We present the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach is independent of any particular Linux kernel, thus simplifying the current practice of using a kernel-based module, such as BLCR. This direct approach results in checkpoints that are found to be faster than with the use of a checkpoint-restart service. The generality of this approach is shown not only by checkpointing an MPI computation, but also a native UPC computation (Berkeley Unified Parallel C), which does not use MPI. Scalability is shown by checkpointing 2,048 MPI processes across 128 nodes (with 16 cores per node). In addition, a cost-effective debugging approach is also enabled, in which a checkpoint image from an InfiniBand-based production cluster is copied to a local Ethernet-based cluster, where it can be restarted and an interactive debugger can be attached to it. This work is based on a plugin that extends the DMTCP (Distributed MultiThreaded CheckPointing) checkpoint-restart package.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2019

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

Transparently checkpointing MPI for fault tolerance and load balancing i...
research
07/27/2016

System-level Scalable Checkpoint-Restart for Petascale Computing

Fault tolerance for the upcoming exascale generation has long been an ar...
research
12/12/2022

Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

MPI is the de facto standard for parallel computation on a cluster of co...
research
12/10/2021

MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale

MANA-2.0 is a scalable, future-proof design for transparent checkpointin...
research
10/29/2019

Decomposing Collectives for Exploiting Multi-lane Communication

Many modern, high-performance systems increase the cumulated node-bandwi...
research
06/29/2019

Open-MPI over MOSIX: paralleled computing in a clustered world

Recent increased interest in Cloud computing emphasizes the need to find...
research
05/07/2013

Somoclu: An Efficient Parallel Library for Self-Organizing Maps

Somoclu is a massively parallel tool for training self-organizing maps o...

Please sign up or login with your details

Forgot password? Click here to reset