Cramming: Training a Language Model on a Single GPU in One Day

12/28/2022
by   Jonas Geiping, et al.
0

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2023

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

In recent years, language models have drastically grown in size, and the...
research
05/30/2023

Likelihood-Based Diffusion Language Models

Despite a growing interest in diffusion-based language models, existing ...
research
09/24/2021

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language mo...
research
08/04/2021

Curriculum learning for language modeling

Language Models like ELMo and BERT have provided robust representations ...
research
04/28/2022

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

Many recent studies on large-scale language models have reported success...
research
05/24/2023

Emergent inabilities? Inverse scaling over the course of pretraining

Does inverse scaling only occur as a function of model parameter size, o...
research
09/19/2019

A Random Gossip BMUF Process for Neural Language Modeling

LSTM language model is an essential component of industrial ASR systems....

Please sign up or login with your details

Forgot password? Click here to reset