Throughput Prediction of Asynchronous SGD in TensorFlow

11/12/2019
by   Zhuojin Li, et al.
0

Modern machine learning frameworks can train neural networks using multiple nodes in parallel, each computing parameter updates with stochastic gradient descent (SGD) and sharing them asynchronously through a central parameter server. Due to communication overhead and bottlenecks, the total throughput of SGD updates in a cluster scales sublinearly, saturating as the the number of nodes increases. In this paper, we present a solution to predicting training throughput from profiling traces collected from a single-node configuration. Our approach is able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node. By accounting for the dependencies between received parts and pending computations, we predict overlaps between computation and communication and generate synthetic execution traces for configurations with multiple nodes. We validate our approach on TensorFlow training jobs for popular image classification neural networks, on AWS and on our in-house cluster, using nodes equipped with GPUs or only with CPUs. We also investigate the effects of data transmission policies used in TensorFlow and the accuracy of our approach when combined with optimizations of the transmission schedule.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2019

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

A standard approach in large scale machine learning is distributed stoch...
research
05/10/2018

Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs

With huge amounts of training data, deep learning has made great breakth...
research
02/21/2020

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Distributed stochastic gradient descent (SGD) is essential for scaling t...
research
08/16/2017

Weighted parallel SGD for distributed unbalanced-workload training system

Stochastic gradient descent (SGD) is a popular stochastic optimization m...
research
03/15/2018

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

In this paper, we present GossipGraD - a gossip communication protocol b...
research
11/02/2017

Efficient Training of Convolutional Neural Nets on Large Distributed Systems

Deep Neural Networks (DNNs) have achieved im- pressive accuracy in many ...
research
12/26/2017

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

We explore scaling of the standard distributed Tensorflow with GRPC prim...

Please sign up or login with your details

Forgot password? Click here to reset