DeepAI AI Chat
Log In Sign Up

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

by   Sahil Tyagi, et al.
Indiana University

Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.


MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Large-scale distributed training is increasingly becoming communication ...

Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

Distributed model training suffers from communication bottlenecks due to...

L-GreCo: An Efficient and General Framework for Layerwise-Adaptive Gradient Compression

Data-parallel distributed training of deep neural networks (DNN) has gai...

On the Utility of Gradient Compression in Distributed Training Systems

Rapid growth in data sets and the scale of neural network architectures ...

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and...

Pufferfish: Communication-efficient Models At No Extra Cost

To mitigate communication overheads in distributed model training, sever...

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

With the increase in the scale of Deep Learning (DL) training workloads ...