Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

08/13/2021
by   Conglong Li, et al.
0

Recent works have demonstrated great success in training high-capacity autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of unlabeled text corpus for text generation. Despite showing great results, this generates two training efficiency challenges. First, training large corpora can be extremely timing consuming, and how to present training samples to the model to improve the token-wise convergence speed remains a challenging and open question. Second, many of these large models have to be trained with hundreds or even thousands of processors using data-parallelism with a very large batch size. Despite of its better compute efficiency, it has been observed that large-batch training often runs into training instability issue or converges to solutions with bad generalization performance. To overcome these two challenges, we present a study of a curriculum learning based approach, which helps improves the pre-training convergence speed of autoregressive models. More importantly, we find that curriculum learning, as a regularization method, exerts a gradient variance reduction effect and enables to train autoregressive models with much larger batch sizes and learning rates without training instability, further improving the training speed. Our evaluations demonstrate that curriculum learning enables training GPT-2 models (with up to 1.5B parameters) with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training divergence. To achieve the same validation perplexity targets during pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 59 respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA evaluation results at the end of pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 13 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2022

Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation

Despite the success of text-to-text pre-trained models in various natura...
research
02/06/2021

Does the Order of Training Samples Matter? Improving Neural Data-to-Text Generation with Curriculum Learning

Recent advancements in data-to-text generation largely take on the form ...
research
08/08/2023

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Large language models (LLMs) are routinely pre-trained on billions of to...
research
06/16/2019

One Epoch Is All You Need

In unsupervised learning, collecting more data is not always a costly pr...
research
12/07/2022

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Recent advances on deep learning models come at the price of formidable ...
research
04/13/2021

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

To train large models (like BERT and GPT-3) with hundreds or even thousa...
research
07/25/2023

How to Scale Your EMA

Preserving training dynamics across batch sizes is an important tool for...

Please sign up or login with your details

Forgot password? Click here to reset