A Study of Checkpointing in Large Scale Training of Deep Neural Networks

12/01/2020
by   Elvis Rojas, et al.
0

Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2019

HPC AI500: A Benchmark Suite for HPC AI Systems

In recent years, with the trend of applying deep learning (DL) in high p...
research
03/16/2021

Distributed Deep Learning Using Volunteer Computing-Like Paradigm

Use of Deep Learning (DL) in commercial applications such as image class...
research
01/04/2023

Analyzing I/O Performance of a Hierarchical HPC Storage System for Distributed Deep Learning

Today, deep learning is an essential technology for our life. To solve m...
research
09/27/2018

FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning

Emerging Deep Learning (DL) applications introduce heavy I/O workloads o...
research
12/28/2022

Web-based volunteer distributed computing for handling time-critical urgent workloads

Urgent computing workloads are time critical, unpredictable, and highly ...
research
02/17/2023

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

Deep Learning (DL) acceleration support in CPUs has recently gained a lo...

Please sign up or login with your details

Forgot password? Click here to reset