CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

08/24/2020
by   Twinkle Jain, et al.
0

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25 continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1 streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process's memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2017

Improving Multi-Application Concurrency Support Within the GPU Memory System

GPUs exploit a high degree of thread-level parallelism to hide long-late...
research
08/01/2018

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GP...
research
05/01/2023

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various ...
research
12/02/2021

MegBA: A High-Performance and Distributed Library for Large-Scale Bundle Adjustment

Large-scale Bundle Adjustment (BA) is the key for many 3D vision applica...
research
04/19/2021

Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUs

Neural networks (NNs) are increasingly employed in domains that require ...
research
04/06/2014

An Enhanced Multi-Pager Environment Support for Second Generation Microkernels

The main objective of this paper is to present a mechanism of enhanced p...
research
03/08/2021

Transparent Checkpointing for OpenGL Applications on GPUs

This work presents transparent checkpointing of OpenGL applications, ref...

Please sign up or login with your details

Forgot password? Click here to reset