Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

06/04/2021
by   Joel Lamy-Poirier, et al.
0

The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training requires thousands or even tens of thousands of GPUs. We investigate the challenges of training at this scale and beyond on commercially available hardware. In particular, we analyse the shortest possible training time for different configurations of distributed training, leveraging empirical scaling laws for language models to estimate the optimal (critical) batch size. Contrary to popular belief, we find no evidence for a memory wall, and instead argue that the real limitation – other than the cost – lies in the training duration. In addition to this analysis, we introduce two new methods, layered gradient accumulation and modular pipeline parallelism, which together cut the shortest training time by half. The methods also reduce data movement, lowering the network requirement to a point where a fast InfiniBand connection is not necessary. This increased network efficiency also improve on the methods introduced with the ZeRO optimizer, reducing the memory usage to a tiny fraction of the available GPU memory.

READ FULL TEXT
research
04/09/2021

Efficient Large-Scale Language Model Training on GPU Clusters

Large language models have led to state-of-the-art accuracies across a r...
research
05/17/2022

Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models

When training neural rankers using Large Language Models, it's expected ...
research
05/30/2021

Maximizing Parallelism in Distributed Training for Huge Neural Networks

The recent Natural Language Processing techniques have been refreshing t...
research
03/26/2023

Task-oriented Memory-efficient Pruning-Adapter

The Outstanding performance and growing size of Large Language Models ha...
research
05/30/2021

2.5-dimensional distributed model training

Data parallelism does a good job in speeding up the training. However, w...
research
11/25/2022

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices

Pipeline parallelism enables efficient training of Large Language Models...
research
04/13/2022

Scalable Training of Language Models using JAX pjit and TPUv4

Modern large language models require distributed training strategies due...

Please sign up or login with your details

Forgot password? Click here to reset