Training Faster with Compressed Gradient

08/13/2020
by   An Xu, et al.
0

Although the distributed machine learning methods show the potential for the speed-up of training large deep neural networks, the communication cost has been the notorious bottleneck to constrain the performance. To address this challenge, the gradient compression based communication-efficient distributed learning methods were designed to reduce the communication cost, and more recently the local error feedback was incorporated to compensate for the performance loss. However, in this paper, we will show the "gradient mismatch" problem of the local error feedback in centralized distributed training and this issue can lead to degraded performance compared with full-precision training. To solve this critical problem, we propose two novel techniques: 1) step ahead; 2) error averaging. Both our theoretical and empirical results show that our new methods can alleviate the "gradient mismatch" problem. Experiments show that we can even train faster with compressed gradient than full-precision training regarding training epochs.

READ FULL TEXT
research
11/01/2021

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Due to the explosion in the size of the training datasets, distributed l...
research
05/17/2021

Compressed Communication for Distributed Training: Adaptive Methods and System

Communication overhead severely hinders the scalability of distributed m...
research
10/14/2022

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and...
research
05/27/2019

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Communication overhead is a major bottleneck hampering the scalability o...
research
07/17/2019

DeepSqueeze: Decentralization Meets Error-Compensated Compression

Communication is a key bottleneck in distributed training. Recently, an ...
research
02/14/2021

Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

Large scale distributed optimization has become the default tool for the...
research
09/17/2019

Communication-Efficient Weighted Sampling and Quantile Summary for GBDT

Gradient boosting decision tree (GBDT) is a powerful and widely-used mac...

Please sign up or login with your details

Forgot password? Click here to reset