Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

11/25/2022
by   Xupeng Miao, et al.
0

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2023

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving st...
research
07/08/2020

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

The last decade has witnessed growth in the computational requirements f...
research
02/01/2023

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Model parallelism has become necessary to train large neural networks. H...
research
09/03/2023

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Large language models such as GPT-3 ChatGPT have transformed deep le...
research
10/07/2022

Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR

Large neural network models are commonly trained through a combination o...
research
07/31/2023

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

Deep learning models have demonstrated impressive performance in various...
research
02/05/2021

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

The size of Transformer models is growing at an unprecedented pace. It h...

Please sign up or login with your details

Forgot password? Click here to reset