Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

12/30/2020
by   Sangho Yeo, et al.
2

Distributed deep learning is an effective way to reduce the training time of deep learning for large datasets as well as complex models. However, the limited scalability caused by network overheads makes it difficult to synchronize the parameters of all workers. To resolve this problem, gossip-based methods that demonstrates stable scalability regardless of the number of workers have been proposed. However, to use gossip-based methods in general cases, the validation accuracy for a large mini-batch needs to be verified. To verify this, we first empirically study the characteristics of gossip methods in a large mini-batch problem and observe that the gossip methods preserve higher validation accuracy than AllReduce-SGD(Stochastic Gradient Descent) when the number of batch sizes is increased and the number of workers is fixed. However, the delayed parameter propagation of the gossip-based models decreases validation accuracy in large node scales. To cope with this problem, we propose Crossover-SGD that alleviates the delay propagation of weight parameters via segment-wise communication and load balancing random network topology. We also adapt hierarchical communication to limit the number of workers in gossip-based communication methods. To validate the effectiveness of our proposed method, we conduct empirical experiments and observe that our Crossover-SGD shows higher node scalability than SGP(Stochastic Gradient Push).

READ FULL TEXT

page 1

page 4

page 5

page 7

research
05/24/2018

Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is the state of the art in ...
research
08/22/2018

Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods are the current state of the art ...
research
07/17/2018

Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication

For large scale non-convex stochastic optimization, parallel mini-batch ...
research
11/27/2018

Stochastic Gradient Push for Distributed Deep Learning

Large mini-batch parallel SGD is commonly used for distributed training ...
research
03/29/2019

Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds

There has been a strong demand for algorithms that can execute machine l...
research
10/11/2018

signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

Training neural networks on large datasets can be accelerated by distrib...
research
12/08/2017

Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

Progress in deep learning is slowed by the days or weeks it takes to tra...

Please sign up or login with your details

Forgot password? Click here to reset