CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

08/01/2018
by   Rohan Garg, et al.
0

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6 forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2017

Improving Multi-Application Concurrency Support Within the GPU Memory System

GPUs exploit a high degree of thread-level parallelism to hide long-late...
research
08/24/2020

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25 ...
research
12/12/2017

Intra-node Memory Safe GPU Co-Scheduling

GPUs in High-Performance Computing systems remain under-utilised due to ...
research
05/19/2017

GPU System Calls

GPUs are becoming first-class compute citizens and are being tasked to p...
research
07/20/2020

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a...
research
07/11/2019

Profiling based Out-of-core Hybrid Method for Large Neural Networks

GPUs are widely used to accelerate deep learning with NNs (NNs). On the ...
research
12/18/2019

HDOT – an Approach Towards Productive Programming of Hybrid Applications

MPI applications matter. However, with the advent of many-core processor...

Please sign up or login with your details

Forgot password? Click here to reset