On Biased Compression for Distributed Learning

02/27/2020
by   Aleksandr Beznosikov, et al.
15

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. Our distributed SGD method enjoys the ergodic rate O(δ L (-K) /μ + (C + D)/Kμ), where δ is a compression parameter which grows when more compression is applied, L and μ are the smoothness and strong convexity constants, C captures stochastic gradient noise (C=0 if full gradients are computed on each node) and D captures the variance of the gradients at the optimum (D=0 for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose a new highly performing biased compressor—combination of Top-k and natural dithering—which in our experiments outperforms all other compression techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2020

Analysis of SGD with Biased Gradient Estimators

We analyze the complexity of biased stochastic gradient methods (SGD), w...
research
08/04/2021

ErrorCompensatedX: error compensation for variance reduced algorithms

Communication cost is one major bottleneck for the scalability for distr...
research
06/21/2022

Shifted Compression Framework: Generalizations and Improvements

Communication is one of the key bottlenecks in the distributed training ...
research
11/19/2019

On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning

Compressed communication, in the form of sparsification or quantization ...
research
09/04/2020

On Communication Compression for Distributed Optimization on Heterogeneous Data

Lossy gradient compression, with either unbiased or biased compressors, ...
research
02/26/2020

LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient Distributed Learning

This paper targets solving distributed machine learning problems such as...
research
05/25/2023

A Guide Through the Zoo of Biased SGD

Stochastic Gradient Descent (SGD) is arguably the most important single ...

Please sign up or login with your details

Forgot password? Click here to reset