PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

02/05/2021
by   Chaoyang He, et al.
29

The size of Transformer models is growing at an unprecedented pace. It has only taken less than one year to reach trillion-level parameters after the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated and elastic pipelining and data parallelism for efficient distributed training of Transformer models. PipeTransformer automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training, and instead allocates resources for training of the remaining active layers. More specifically, PipeTransformer dynamically excludes converged layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets. Our results show that PipeTransformer attains a 2.4 fold speedup compared to the state-of-the-art baseline. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. We also develop open-sourced flexible APIs for PipeTransformer, which offer a clean separation among the freeze algorithm, model definitions, and training accelerations, hence allowing it to be applied to other algorithms that require similar freezing strategies.

READ FULL TEXT

page 1

page 2

09/17/2019

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in unsupervised language modeling demonstrates that training...
09/17/2019

Megatron-LM: Training Multi-Billion Parameter Language Models Using GPU Model Parallelism

Recent work in unsupervised language modeling demonstrates that training...
12/06/2021

Automap: Towards Ergonomic Automated Parallelism for ML Models

The rapid rise in demand for training large neural network architectures...
11/09/2021

Sliced Recursive Transformer

We present a neat yet effective recursive operation on vision transforme...
06/22/2020

LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation

Deep Learning (DL) models are becoming larger, because the increase in m...
05/03/2022

Better plain ViT baselines for ImageNet-1k

It is commonly accepted that the Vision Transformer model requires sophi...
02/13/2020

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Widely popular transformer-based NLP models such as BERT and GPT have en...

Code Repositories