Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

06/07/2018
by   Chen Chen, et al.
0

Deep neural network models are usually trained in cluster environments, where the model parameters are iteratively refined by multiple worker machines in parallel. One key challenge in this regard is the presence of stragglers, which significantly degrades the learning performance. In this paper, we propose to eliminate stragglers by adapting each worker's training load to its processing capability; that is, slower workers receive a smaller batch of data to process. Following this idea, we develop a new synchronization scheme called LB-BSP (Load-balanced BSP). It works by coordinately setting the batch size of each worker so that they can finish batch processing at around the same time. A prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. For the best prediction accuracy, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction. We have implemented LB-BSP for both TensorFlow and MXNet. EC2 experiments against popular benchmarks show that LB-BSP can effectively accelerate the training of deep models, with up to 2x speedup.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2017

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Training deep neural networks with Stochastic Gradient Descent, or its v...
research
07/23/2020

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Synchronous strategies with data parallelism, such as the Synchronous St...
research
08/16/2019

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been appli...
research
04/30/2020

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning ...
research
01/24/2019

Large-Batch Training for LSTM and Beyond

Large-batch training approaches have enabled researchers to utilize larg...
research
10/24/2021

Micro Batch Streaming: Allowing the Training of DNN models Using a large batch size on Small Memory Systems

The size of the deep learning models has greatly increased over the past...
research
08/19/2020

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Using multiple nodes and parallel computing algorithms has become a prin...

Please sign up or login with your details

Forgot password? Click here to reset