torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models

04/21/2020
by   Chiheon Kim, et al.
0

We design and implement a ready-to-use library in PyTorch for performing micro-batch pipeline parallelism with checkpointing proposed by GPipe (Huang et al., 2019). In particular, we develop a set of design components to enable pipeline-parallel gradient computation in PyTorch's define-by-run and eager execution environment. We show that each component is necessary to fully benefit from pipeline parallelism in such environment, and demonstrate the efficiency of the library by applying it to various network architectures including AmoebaNet-D and U-Net. Our library is available at https://github.com/kakaobrain/torchgpipe .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2022

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedu...
research
02/01/2022

Pipeflow: An Efficient Task-Parallel Pipeline Programming Framework using Modern C++

Pipeline is a fundamental parallel programming pattern. Mainstream pipel...
research
10/09/2019

PipeMare: Asynchronous Pipeline Parallel DNN Training

Recently there has been a flurry of interest around using pipeline paral...
research
03/03/2023

Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches

Pipeline parallelism has been demonstrated to be a remarkable approach t...
research
11/10/2021

Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

With deep learning models rapidly growing in size, systems-level solutio...
research
02/06/2023

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

In recent years, large-scale models have demonstrated state-of-the-art p...
research
07/25/2022

Dive into Big Model Training

The increasing scale of model size and continuous improvement of perform...

Please sign up or login with your details

Forgot password? Click here to reset