Improving Automatic Parallel Training via Balanced Memory Workload Optimization

07/05/2023
by   Yujie Wang, et al.
0

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.

READ FULL TEXT
research
11/25/2022

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various...
research
01/20/2023

ATP: Adaptive Tensor Parallelism for Foundation Models

Foundation models have impressive performance and generalization capabil...
research
07/31/2023

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

Deep learning models have demonstrated impressive performance in various...
research
10/07/2022

Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR

Large neural network models are commonly trained through a combination o...
research
07/08/2020

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

The last decade has witnessed growth in the computational requirements f...
research
02/01/2023

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Model parallelism has become necessary to train large neural networks. H...
research
10/08/2020

Interlocking Backpropagation: Improving depthwise model-parallelism

The number of parameters in state of the art neural networks has drastic...

Please sign up or login with your details

Forgot password? Click here to reset