Error Feedback Fixes SignSGD and other Gradient Compression Schemes

01/28/2019
by   Sai Praneeth Karimireddy, et al.
0

Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm EF-SGD achieves the same rate of convergence as SGD without any additional assumptions for arbitrary compression operators (including the sign operator), indicating that we get gradient compression for free. Our experiments thoroughly substantiate the theory showing the superiority of our algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2020

Analysis of SGD with Biased Gradient Estimators

We analyze the complexity of biased stochastic gradient methods (SGD), w...
research
02/06/2023

z-SignFedAvg: A Unified Stochastic Sign-based Compression for Federated Learning

Federated Learning (FL) is a promising privacy-preserving distributed le...
research
09/04/2020

On Communication Compression for Distributed Optimization on Heterogeneous Data

Lossy gradient compression, with either unbiased or biased compressors, ...
research
02/13/2018

signSGD: compressed optimisation for non-convex problems

Training large neural networks requires distributing learning across mul...
research
06/15/2023

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

To accelerate distributed training, many gradient compression methods ha...
research
05/30/2023

Clip21: Error Feedback for Gradient Clipping

Motivated by the increasing popularity and importance of large-scale tra...
research
08/17/2021

Compressing gradients by exploiting temporal correlation in momentum-SGD

An increasing bottleneck in decentralized optimization is communication....

Please sign up or login with your details

Forgot password? Click here to reset