Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

05/14/2020
by   Swanand Kadhe, et al.
0

Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called 'stragglers', and communication overheads. Recently, Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. However, their proposed coding schemes suffer from heavy decoding complexity and poor numerical stability. In this paper, we develop a communication-efficient gradient coding framework to overcome these drawbacks. Our proposed framework enables using any linear code to design the encoding and decoding functions. When a particular code is used in this framework, its block-length determines the computation load, dimension determines the communication overhead, and minimum distance determines the straggler tolerance. The flexibility of choosing a code allows us to gracefully trade-off the straggler threshold and communication overhead for smaller decoding complexity and higher numerical stability. Further, we show that using a maximum distance separable (MDS) code generated by a random Gaussian matrix in our framework yields a gradient code that is optimal with respect to the trade-off and, in addition, satisfies stronger guarantees on numerical stability as compared to the previously proposed schemes. Finally, we evaluate our proposed framework on Amazon EC2 and demonstrate that it reduces the average iteration time by 16

READ FULL TEXT
research
06/08/2020

Adaptive Gradient Coding

This paper focuses on mitigating the impact of stragglers in distributed...
research
08/20/2018

Improved Latency-Communication Trade-Off for Map-Shuffle-Reduce Systems with Stragglers

In a distributed computing system operating according to the map-shuffle...
research
02/09/2018

Communication-Computation Efficient Gradient Coding

This paper develops coding techniques to reduce the running time of dist...
research
04/30/2019

Gradient Coding Based on Block Designs for Mitigating Adversarial Stragglers

Distributed implementations of gradient-based methods, wherein a server ...
research
01/23/2019

Fundamental Limits of Approximate Gradient Coding

It has been established that when the gradient coding problem is distrib...
research
05/22/2019

LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

Gradient-based distributed learning in Parameter Server (PS) computing a...
research
02/06/2019

CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning

We focus on the commonly used synchronous Gradient Descent paradigm for ...

Please sign up or login with your details

Forgot password? Click here to reset