OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

05/14/2020
by   Yemao Xu, et al.
10

The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly applied in the distributed training process, i.e. the Synchronous SGD algorithm (SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence point while the training speed is slowed down by the synchronous barrier. ASGD has faster training speed but the convergence point is lower when compared to SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process. Therefore, we can achieve similar convergence point and training speed as SSGD and ASGD separately. To the best of our knowledge, we make the first attempt to combine the features of SSGD and ASGD to improve distributed training performance. Each iteration of OD-SGD contains a global update in the parameter server node and local updates in the worker nodes, the local update is introduced to update and compensate the delayed local weights. We evaluate our proposed algorithm on MNIST, CIFAR-10 and ImageNet datasets. Experimental results show that OD-SGD can obtain similar or even slightly better accuracy than SSGD, while its training speed is much faster, which even exceeds the training speed of ASGD.

READ FULL TEXT

page 6

page 8

page 9

page 11

page 12

page 13

page 15

page 18

research
12/10/2020

A Mechanism for Distributed Deep Learning Communication Optimization

Intensive communication and synchronization cost for gradients and param...
research
01/17/2021

Guided parallelized stochastic gradient descent for delay compensation

Stochastic gradient descent (SGD) algorithm and its variations have been...
research
10/19/2018

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Large-scale machine learning training, in particular distributed stochas...
research
08/12/2019

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Load imbalance pervasively exists in distributed deep learning training ...
research
01/15/2016

Faster Asynchronous SGD

Asynchronous distributed stochastic gradient descent methods have troubl...
research
06/06/2021

Distributed Learning and its Application for Time-Series Prediction

Extreme events are occurrences whose magnitude and potential cause exten...
research
03/12/2020

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datase...

Please sign up or login with your details

Forgot password? Click here to reset