MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

03/28/2021
by   Zhuang Wang, et al.
0

Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training. In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to 99

READ FULL TEXT
research
05/17/2021

Compressed Communication for Distributed Training: Adaptive Methods and System

Communication overhead severely hinders the scalability of distributed m...
research
04/21/2021

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Large-scale distributed training of Deep Neural Networks (DNNs) on state...
research
06/21/2021

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

Communication overhead is the key challenge for distributed training. Gr...
research
12/05/2017

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Large-scale distributed training requires significant communication band...
research
06/17/2020

Is Network the Bottleneck of Distributed Training?

Recently there has been a surge of research on improving the communicati...
research
11/21/2018

SuperNeurons: FFT-based Gradient Sparsification in the Distributed Training of Deep Neural Networks

The performance and efficiency of distributed training of Deep Neural Ne...
research
05/20/2023

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Distributed data-parallel (DDP) training improves overall application th...

Please sign up or login with your details

Forgot password? Click here to reset