On Efficient Constructions of Checkpoints

09/28/2020
by   Yu Chen, et al.
5

Efficient construction of checkpoints/snapshots is a critical tool for training and diagnosing deep learning models. In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). LC-Checkpoint simultaneously maximizes the compression rate and optimizes the recovery speed, under the assumption that SGD is used to train the model. LC-Checkpointuses quantization and priority promotion to store the most crucial information for SGD to recover, and then uses a Huffman coding to leverage the non-uniform distribution of the gradient scales. Our extensive experiments show that LC-Checkpoint achieves a compression rate up to 28× and recovery speedup up to 5.77× over a state-of-the-art algorithm (SCAR).

READ FULL TEXT

page 4

page 6

page 7

research
06/02/2022

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

Communication compression is a crucial technique for modern distributed ...
research
06/15/2023

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

To accelerate distributed training, many gradient compression methods ha...
research
06/06/2019

Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Communication bottleneck has been identified as a significant issue in d...
research
12/08/2021

FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

With the rapid increase of big data, distributed Machine Learning (ML) h...
research
07/22/2019

Decentralized Deep Learning with Arbitrary Communication Compression

Decentralized training of deep learning models is a key element for enab...
research
06/20/2023

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

With the increase in the scale of Deep Learning (DL) training workloads ...
research
02/28/2021

On the Utility of Gradient Compression in Distributed Training Systems

Rapid growth in data sets and the scale of neural network architectures ...

Please sign up or login with your details

Forgot password? Click here to reset