PipeDream: Fast and Efficient Pipeline Parallel DNN Training

06/08/2018
by   Aaron Harlap, et al.
0

PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines. Its pipeline parallel computing model avoids the slowdowns faced by data-parallel training when large models and/or limited network bandwidth induce high communication-to-computation ratios. PipeDream reduces communication by up to 95 overlap of communication and computation. PipeDream keeps all available GPUs productive by systematically partitioning DNN layers among them to balance work and minimize communication, versions model parameters for backward pass correctness, and schedules the forward and backward passes of different inputs in round-robin fashion to optimize "time to target accuracy". Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training.

READ FULL TEXT
research
01/17/2022

Efficient DNN Training with Knowledge-Guided Layer Freezing

Training deep neural networks (DNNs) is time-consuming. While most exist...
research
07/22/2022

Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning

Deep Neural Network (DNN) models are usually trained sequentially from o...
research
05/10/2019

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neura...
research
03/16/2021

Parareal Neural Networks Emulating a Parallel-in-time Algorithm

As deep neural networks (DNNs) become deeper, the training time increase...
research
04/22/2022

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emer...
research
07/22/2023

Optimized Network Architectures for Large Language Model Training with Billions of Parameters

This paper challenges the well-established paradigm for building any-to-...
research
04/05/2020

Reducing Data Motion to Accelerate the Training of Deep Neural Networks

This paper reduces the cost of DNNs training by decreasing the amount of...

Please sign up or login with your details

Forgot password? Click here to reset