Priority-based Parameter Propagation for Distributed DNN Training

05/10/2019
by   Anand Jayarajan, et al.
0

Data parallel training is widely used for scaling distributed deep neural network (DNN) training. However, the performance benefits are often limited by the communication-heavy parameter synchronization step. In this paper, we take advantage of the domain specific knowledge of DNN training and overlap parameter synchronization with computation in order to improve the training performance. We make two key observations: (1) the optimal data representation granularity for the communication may differ from that used by the underlying DNN model implementation and (2) different parameters can afford different synchronization delays. Based on these observations, we propose a new synchronization mechanism called Priority-based Parameter Propagation (P3). P3 synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay. We show that P3 can improve the training throughput of ResNet-50, Sockeye and VGG-19 by as much as 25 network bandwidth

READ FULL TEXT
research
06/08/2018

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeDream is a Deep Neural Network(DNN) training system for GPUs that pa...
research
06/29/2023

OSP: Boosting Distributed Model Training with 2-stage Synchronization

Distributed deep learning (DDL) is a promising research area, which aims...
research
02/19/2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for redu...
research
09/06/2020

HLSGD Hierarchical Local SGD With Stale Gradients Featuring

While distributed training significantly speeds up the training process ...
research
07/22/2022

WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Communication efficiency plays an important role in accelerating the dis...
research
05/07/2023

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

Distributed Machine Learning (DML) systems are utilized to enhance the s...
research
10/11/2019

Blink: Fast and Generic Collectives for Distributed ML

Model parameter synchronization across GPUs introduces high overheads fo...

Please sign up or login with your details

Forgot password? Click here to reset