Shuffle-Exchange Brings Faster: Reduce the Idle Time During Communication for Decentralized Neural Network Training

07/01/2020
by   Xiang Yang, et al.
0

As a crucial scheme to accelerate the deep neural network (DNN) training, distributed stochastic gradient descent (DSGD) is widely adopted in many real-world applications. In most distributed deep learning (DL) frameworks, DSGD is implemented with Ring-AllReduce architecture (Ring-SGD) and uses a computation-communication overlap strategy to address the overhead of the massive communications required by DSGD. However, we observe that although O(1) gradients are needed to be communicated per worker in Ring-SGD, the O(n) handshakes required by Ring-SGD limits its usage when training with many workers or in high latency network. In this paper, we propose Shuffle-Exchange SGD (SESGD) to solve the dilemma of Ring-SGD. In the cluster of 16 workers with 0.1ms Ethernet latency, SESGD can accelerate the DNN training to 1.7 × without losing model accuracy. Moreover, the process can be accelerated up to 5× in high latency networks (5ms).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/20/2019

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) mode...
research
06/13/2019

Layered SGD: A Decentralized and Synchronous SGD Algorithm for Scalable Deep Neural Network Training

Stochastic Gradient Descent (SGD) is the most popular algorithm for trai...
research
12/30/2019

Variance Reduced Local SGD with Lower Communication Complexity

To accelerate the training of machine learning models, distributed stoch...
research
06/01/2023

DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm

Decentralized Stochastic Gradient Descent (SGD) is an emerging neural ne...
research
09/04/2019

Performance Analysis and Comparison of Distributed Machine Learning Systems

Deep learning has permeated through many aspects of computing/processing...
research
06/03/2020

Local SGD With a Communication Overhead Depending Only on the Number of Workers

We consider speeding up stochastic gradient descent (SGD) by parallelizi...
research
10/04/2019

Distributed Learning of Deep Neural Networks using Independent Subnet Training

Stochastic gradient descent (SGD) is the method of choice for distribute...

Please sign up or login with your details

Forgot password? Click here to reset