Polynomially Coded Regression: Optimal Straggler Mitigation via Data Encoding

05/24/2018
by   Songze Li, et al.
0

We consider the problem of training a least-squares regression model on a large dataset using gradient descent. The computation is carried out on a distributed system consisting of a master node and multiple worker nodes. Such distributed systems are significantly slowed down due to the presence of slow-running machines (stragglers) as well as various communication bottlenecks. We propose "polynomially coded regression" (PCR) that substantially reduces the effect of stragglers and lessens the communication burden in such systems. The key idea of PCR is to encode the partial data stored at each worker, such that the computations at the workers can be viewed as evaluating a polynomial at distinct points. This allows the master to compute the final gradient by interpolating this polynomial. PCR significantly reduces the recovery threshold, defined as the number of workers the master has to wait for prior to computing the gradient. In particular, PCR requires a recovery threshold that scales inversely proportionally with the amount of computation/storage available at each worker. In comparison, state-of-the-art straggler-mitigation schemes require a much higher recovery threshold that only decreases linearly in the per worker computation/storage load. We prove that PCR's recovery threshold is near minimal and within a factor two of the best possible scheme. Our experiments over Amazon EC2 demonstrate that compared with state-of-the-art schemes, PCR improves the run-time by 1.50x 2.36x with naturally occurring stragglers, and by as much as 2.58x 4.29x with artificial stragglers.

READ FULL TEXT

Authors

page 5

10/27/2017

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferen...
01/31/2018

On the Optimal Recovery Threshold of Coded Matrix Multiplication

We provide novel coded computation strategies for distributed matrix-mat...
10/23/2018

Computation Scheduling for Distributed Machine Learning with Straggling Workers

We study the scheduling of computation tasks across n workers in a large...
06/02/2020

Age-Based Coded Computation for Bias Reduction in Distributed Learning

Coded computation can be used to speed up distributed learning in the pr...
05/16/2021

LocalNewton: Reducing Communication Bottleneck for Distributed Learning

To address the communication bottleneck problem in distributed optimizat...
01/23/2020

Coded Computing for Boolean Functions

The growing size of modern datasets necessitates a massive computation i...
01/21/2020

Serverless Straggler Mitigation using Local Error-Correcting Codes

Inexpensive cloud services, such as serverless computing, are often vuln...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.