Scaling Data-Constrained Language Models

05/25/2023
by   Niklas Muennighoff, et al.
0

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are publicly available at https://github.com/huggingface/datablations.

READ FULL TEXT
research
03/29/2022

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training ...
research
04/06/2023

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

We study recent research advances that improve large language models thr...
research
09/15/2020

Current Limitations of Language Models: What You Need is Retrieval

We classify and re-examine some of the current approaches to improve the...
research
03/11/2022

Staged Training for Transformer Language Models

The current standard approach to scaling transformer language models tra...
research
10/23/2018

Language Modeling at Scale

We show how Zipf's Law can be used to scale up language modeling (LM) to...
research
10/26/2022

Broken Neural Scaling Laws

We present a smoothly broken power law functional form that accurately m...
research
10/26/2022

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

We analyze the growth of dataset sizes used in machine learning for natu...

Please sign up or login with your details

Forgot password? Click here to reset