Sequential Gradient Coding For Straggler Mitigation

11/24/2022
by   M. Nikhil Krishnan, et al.
0

In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients {g(1),g(2),…,g(J)}, where processing of each gradient g(t) starts in round-t and finishes by round-(t+T). Here T≥ 0 denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where T=0. On the other hand, having T>0 allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.

READ FULL TEXT
research
12/16/2022

Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

We consider distributed learning in the presence of slow and unresponsiv...
research
05/13/2021

Approximate Gradient Coding for Heterogeneous Nodes

In distributed machine learning (DML), the training data is distributed ...
research
01/27/2019

Heterogeneity-aware Gradient Coding for Straggler Tolerance

Gradient descent algorithms are widely used in machine learning. In orde...
research
07/12/2017

Gradient Coding from Cyclic MDS Codes and Expander Graphs

Gradient Descent, and its variants, are a popular method for solving emp...
research
10/24/2017

A Sequential Approximation Framework for Coded Distributed Optimization

Building on the previous work of Lee et al. and Ferdinand et al. on code...
research
01/21/2020

Serverless Straggler Mitigation using Local Error-Correcting Codes

Inexpensive cloud services, such as serverless computing, are often vuln...
research
02/09/2018

Communication-Computation Efficient Gradient Coding

This paper develops coding techniques to reduce the running time of dist...

Please sign up or login with your details

Forgot password? Click here to reset