Scaling Laws for Neural Language Models

by   Jared Kaplan, et al.
Johns Hopkins University

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.


page 1

page 2

page 3

page 4


Scaling Laws for Acoustic Models

There is a recent trend in machine learning to increase model quality by...

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language mo...

A Neural Scaling Law from the Dimension of the Data Manifold

When data is plentiful, the loss achieved by well-trained neural network...

On the Predictability of Pruning Across Scales

We show that the error of magnitude-pruned networks follows a scaling la...

Scaling Laws for Autoregressive Generative Modeling

We identify empirical scaling laws for the cross-entropy loss in four do...

Scaling Laws for Transfer

We study empirical scaling laws for transfer learning between distributi...

Machine Learning Model Sizes and the Parameter Gap

We study trends in model size of notable machine learning systems over t...

Please sign up or login with your details

Forgot password? Click here to reset