Dynamic backup workers for parallel machine learning

04/30/2020
by   Chuan Xu, et al.
0

The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of n workers, which iteratively compute updates of the model parameters, and a stateful PS, which waits and aggregates all updates to generate a new estimate of model parameters and sends it back to the workers for a new iteration. Transient computation slowdowns or transmission delays can intolerably lengthen the time of each iteration. An efficient way to mitigate this problem is to let the PS wait only for the fastest n-b updates, before generating the new parameters. The slowest b workers are called backup workers. The optimal number b of backup workers depends on the cluster configuration and workload, but also (as we show in this paper) on the hyper-parameters of the learning algorithm and the current stage of the training. We propose DBW, an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune b by preliminary time-consuming experiments, and 2) makes the training up to a factor 3 faster than the optimal static configuration.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2021

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning ...
research
08/16/2019

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been appli...
research
11/06/2018

Elastic CoCoA: Scaling In to Improve Convergence

In this paper we experimentally analyze the convergence behavior of CoCo...
research
07/07/2020

Divide-and-Shuffle Synchronization for Distributed Machine Learning

Distributed Machine Learning suffers from the bottleneck of synchronizat...
research
06/07/2018

Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Deep neural network models are usually trained in cluster environments, ...
research
02/22/2019

Scaling Distributed Machine Learning with In-Network Aggregation

Training complex machine learning models in parallel is an increasingly ...
research
10/03/2021

Distributed Optimization using Heterogeneous Compute Systems

Hardware compute power has been growing at an unprecedented rate in rece...

Please sign up or login with your details

Forgot password? Click here to reset