Serverless Straggler Mitigation using Local Error-Correcting Codes

01/21/2020
by   Vipul Gupta, et al.
0

Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase end-to-end latency for distributed computation. We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning and high-performance computing. The proposed schemes are inspired by error-correcting codes and employ parallel encoding and decoding over the data stored in the cloud using serverless workers. This creates a fully distributed computing framework without using a master node to conduct encoding or decoding, which removes the computation, communication and storage bottleneck at the master. On the theory side, we establish that our proposed scheme is asymptotically optimal in terms of decoding time and provide a lower bound on the number of stragglers it can tolerate with high probability. Through extensive experiments, we show that our scheme outperforms existing schemes such as speculative execution and other coding theoretic methods by at least 25

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/15/2018

Hierarchical Coding for Distributed Computing

Coding for distributed computing supports low-latency computation by rel...
research
08/12/2021

Secure Private and Adaptive Matrix Multiplication Beyond the Singleton Bound

Consider the problem of designing secure and private codes for distribut...
research
04/25/2019

Array BP-XOR Codes for Parallel Matrix Multiplication using Hierarchical Computing

This study presents a novel coded computation technique for parallel mat...
research
01/20/2020

Bivariate Polynomial Coding for Exploiting Stragglers in Heterogeneous Coded Computing Systems

Polynomial coding has been proposed as a solution to the straggler mitig...
research
05/24/2018

Polynomially Coded Regression: Optimal Straggler Mitigation via Data Encoding

We consider the problem of training a least-squares regression model on ...
research
10/08/2018

A Droplet Approach Based on Raptor Codes for Distributed Computing With Straggling Servers

We propose a coded distributed computing scheme based on Raptor codes to...
research
11/24/2022

Sequential Gradient Coding For Straggler Mitigation

In distributed computing, slower nodes (stragglers) usually become a bot...

Please sign up or login with your details

Forgot password? Click here to reset