Distributed Learning with Compressed Gradient Differences

by   Konstantin Mishchenko, et al.
King Abdullah University of Science and Technology

Training very large machine learning models requires a distributed computing approach, with communication of the model updates often being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of the updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which means that they necessarily suffer from several issues, such as the inability to converge to the true optimum in the batch mode, inability to work with a nonsmooth regularizer, and slow convergence rates. In this work we propose a new distributed learning method---DIANA---which resolves these issues via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are vastly superior to existing rates. Our analysis of block-quantization and differences between ℓ_2 and ℓ_∞ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.


EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

Error feedback (EF), also known as error compensation, is an immensely p...

Adaptive Compression for Communication-Efficient Distributed Training

We propose Adaptive Compressed Gradient Descent (AdaCGD) - a novel optim...

On Effects of Compression with Hyperdimensional Computing in Distributed Randomized Neural Networks

A change of the prevalent supervised learning techniques is foreseeable ...

Quantizing data for distributed learning

We consider machine learning applications that train a model by leveragi...

Smoothness-Aware Quantization Techniques

Distributed machine learning has become an indispensable tool for traini...

Momentum Provably Improves Error Feedback!

Due to the high communication overhead when training machine learning mo...

Inference for BART with Multinomial Outcomes

The multinomial probit Bayesian additive regression trees (MPBART) frame...

Please sign up or login with your details

Forgot password? Click here to reset