Heterogeneity-aware Gradient Coding for Straggler Tolerance

01/27/2019
by   Haozhao Wang, et al.
0

Gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multiple workers compute the gradient over some partial data and the master node aggregates their results to obtain the gradient over the whole data. However, its performance can be severely affected by straggler workers. Recently, some coding-based approaches are introduced to mitigate the straggler problem, but they are efficient only when the workers are homogeneous, i.e., having the same computation capabilities. In this paper, we consider that the workers are heterogeneous which are common in modern distributed systems. We propose a novel heterogeneity-aware gradient coding scheme which can not only tolerate a predetermined number of stragglers but also fully utilize the computation capabilities of heterogeneous workers. We show that this scheme is optimal when the computation capabilities of workers are estimated accurately. A variant of this scheme is further proposed to improve the performance when the estimations of the computation capabilities are not so accurate. We conduct our schemes for gradient descent based image classification on QingCloud clusters. Evaluation results show that our schemes can reduce the whole computation time by up to 3× compared with a state-of-the-art coding scheme.

READ FULL TEXT

page 1

page 7

page 8

page 9

research
07/01/2020

Distributed Linearly Separable Computation

This paper formulates a distributed computation problem, where a master ...
research
04/20/2019

Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters

Recently, coding has been a useful technique to mitigate the effect of s...
research
01/31/2022

Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent

Coded distributed computation has become common practice for performing ...
research
06/06/2022

Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

Gradient coding schemes effectively mitigate full stragglers in distribu...
research
10/27/2017

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferen...
research
07/10/2023

Coded Distributed Image Classification

In this paper, we present a coded computation (CC) scheme for distribute...
research
11/24/2022

Sequential Gradient Coding For Straggler Mitigation

In distributed computing, slower nodes (stragglers) usually become a bot...

Please sign up or login with your details

Forgot password? Click here to reset