Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs

05/10/2018
by   Shaohuai Shi, et al.
0

With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of GPU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SGD) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation-based studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/16/2017

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for de...
research
11/27/2018

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

Distributed synchronous stochastic gradient descent has been widely used...
research
07/14/2021

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) ...
research
11/12/2019

Throughput Prediction of Asynchronous SGD in TensorFlow

Modern machine learning frameworks can train neural networks using multi...
research
10/15/2020

Deep Learning on Real Geophysical Data: A Case Study for Distributed Acoustic Sensing Research

Deep Learning approaches for real, large, and complex scientific data se...
research
05/20/2020

Map Generation from Large Scale Incomplete and Inaccurate Data Labels

Accurately and globally mapping human infrastructure is an important and...
research
12/18/2019

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

Distributed synchronous stochastic gradient descent has been widely used...

Please sign up or login with your details

Forgot password? Click here to reset