Efficient Replication for Straggler Mitigation in Distributed Computing

06/03/2020
by   Amir Behrouzi-Far, et al.
0

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the case when the batches do not overlap and, using the results from majorization theory, show that, for a general class of workers' service time distributions, a balanced assignment of batches to workers minimizes the average job compute time. We next show that this balanced assignment of non-overlapping batches achieves lower average job compute time compared to the overlapping schemes proposed in the literature. Furthermore, we derive the optimum redundancy level as a function of the service time distribution at workers. We show that the redundancy level that minimizes average job compute time is not necessarily the same as the redundancy level that maximizes the predictability of job compute time, and thus there exists a trade-off between optimizing the two metrics. Finally, by running experiments on Google cluster traces, we observe that redundancy can reduce the compute time of the jobs in Google clusters by an order of magnitude, and that the optimum level of redundancy depends on the distribution of tasks' service time.

READ FULL TEXT
research
12/06/2019

Data Replication for Reducing Computing Time in Distributed Systems with Stragglers

In distributed computing systems with stragglers, various forms of redun...
research
12/06/2019

Data Replication for Reducing Computing Time inDistributed Systems with Stragglers

In distributed computing systems with stragglers,various forms of redund...
research
06/12/2019

Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation

Runtime variability in computing systems causes some tasks to straggle a...
research
08/08/2018

On the Effect of Task-to-Worker Assignment in Distributed Computing Systems with Stragglers

We study the expected completion time of some recently proposed algorith...
research
10/01/2017

Straggler Mitigation by Delayed Relaunch of Tasks

Redundancy for straggler mitigation, originally in data download and mor...
research
03/01/2023

Computing Redundancy in Blocking Systems: Fast Service or No Service

Redundancy in distributed computing systems reduces job completion time....
research
08/07/2019

Redundancy Scheduling in Systems with Bi-Modal Job Service Time Distribution

Queuing systems with redundant requests have drawn great attention becau...

Please sign up or login with your details

Forgot password? Click here to reset