Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:A Systematic Study

09/14/2015
by   Suyog Gupta, et al.
0

This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2016

Coupling Adaptive Batch Sizes with Learning Rates

Mini-batch stochastic gradient descent and variants thereof have become ...
research
08/27/2018

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

In order to extract the best possible performance from asynchronous stoc...
research
07/10/2019

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Modern Automatic Speech Recognition (ASR) systems rely on distributed de...
research
05/03/2019

Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM

Understanding the bottlenecks in implementing stochastic gradient descen...
research
09/06/2020

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Parameter updating is an important stage in parallelism-based distribute...
research
04/16/2021

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning

Stochastic Gradient Descent (SGD) has become the de facto way to train d...
research
10/06/2020

A Closer Look at Codistillation for Distributed Training

Codistillation has been proposed as a mechanism to share knowledge among...

Please sign up or login with your details

Forgot password? Click here to reset