MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

12/18/2019
by   Shaohuai Shi, et al.
0

Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks (DNNs) on computer clusters. With the increase of computational power, network communications generally limit the system scalability. Wait-free backpropagation (WFBP) is a popular solution to overlap communications with computations during the training process. In this paper, we observe that many DNNs have a large number of layers with only a small amount of data to be communicated at each layer in distributed training, which could make WFBP inefficient. Based on the fact that merging some short communication tasks into a single one can reduce the overall communication time, we formulate an optimization problem to minimize the training time in pipelining communications and computations. We derive an optimal solution that can be solved efficiently without affecting the training performance. We then apply the solution to propose a distributed training algorithm named merged-gradient WFBP (MG-WFBP) and implement it in two platforms Caffe and PyTorch. Extensive experiments in three GPU clusters are conducted to verify the effectiveness of MG-WFBP. We further exploit the trace-based simulation of 64 GPUs to explore the potential scaling efficiency of MG-WFBP. Experimental results show that MG-WFBP achieves much better scaling performance than existing methods.

READ FULL TEXT
research
11/27/2018

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

Distributed synchronous stochastic gradient descent has been widely used...
research
02/24/2023

Decoupling the All-Reduce Primitive for Accelerating Distributed Deep Learning

Communication scheduling has been shown to be effective in accelerating ...
research
09/24/2019

Exascale Deep Learning for Scientific Inverse Problems

We introduce novel communication strategies in synchronous distributed D...
research
05/10/2018

Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs

With huge amounts of training data, deep learning has made great breakth...
research
08/10/2023

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train...
research
07/14/2021

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) ...
research
03/16/2019

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

This paper reports our efforts on swCaffe, a highly efficient parallel f...

Please sign up or login with your details

Forgot password? Click here to reset