Performance Analysis and Comparison of Distributed Machine Learning Systems

09/04/2019
by   Salem Alqahtani, et al.
0

Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this problem for synchronous Stochastic Gradient Descent (SGD) training of deep learning models, we developed a performance model of computation time and communication latency under three different system architectures: Parameter Server (PS), peer-to-peer (P2P), and Ring allreduce (RA). To complement and corroborate our analytical models with quantitative results, we evaluated the computation and communication performance of these system architectures of the systems via experiments performed with Tensorflow and Horovod frameworks. We found that the system architecture has a very significant effect on the performance of training. RA-based systems achieve scalable performance as they successfully decouple network usage from the number of workers in the system. In contrast, 1PS systems suffer from low performance due to network congestion at the parameter server side. While P2P systems fare better than 1PS systems, they still suffer from significant network bottleneck. Finally, RA systems also excel by virtue of overlapping computation time and communication time, which PS and P2P architectures fail to achieve.

READ FULL TEXT

page 3

page 5

page 7

research
09/19/2020

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method ...
research
05/19/2023

A Generic Performance Model for Deep Learning in a Distributed Environment

Performance modelling of a deep learning application is essential to imp...
research
05/21/2019

Performance Analysis of Deep Learning Workloads on Leading-edge Systems

This work examines the performance of leading-edge systems designed for ...
research
03/08/2018

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

State-of-the-art deep learning systems rely on iterative distributed tra...
research
06/07/2020

Stochastic Automata Network for Performance Evaluation of Heterogeneous SoC Communication

To meet ever increasing demand for performance of emerging System-on-Chi...
research
04/17/2020

DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

One of the limitations of deep learning models with sparse features toda...
research
07/01/2020

Shuffle-Exchange Brings Faster: Reduce the Idle Time During Communication for Decentralized Neural Network Training

As a crucial scheme to accelerate the deep neural network (DNN) training...

Please sign up or login with your details

Forgot password? Click here to reset