
Scaling Laws for Neural Language Models
We study empirical scaling laws for language model performance on the cr...
read it

Learning Curve Theory
Recently a number of empirical "universal" scaling law papers have been ...
read it

An OptimizationBased Generative Model of Power Laws Using a New Information Theory Based Metric
In this paper, we propose an optimizationbased mechanism to explain pow...
read it

Can machine learning identify interesting mathematics? An exploration using empirically observed laws
We explore the possibility of using machine learning to identify interes...
read it

Scaling Scaling Laws with Board Games
The largest experiments in machine learning now require resources far be...
read it

Measuring Mathematical Problem Solving With the MATH Dataset
Many intellectual endeavors require mathematical problem solving, but th...
read it

Spatiotemporal Network Evolution of Anthropogenic Night Light 19922015
Satellite imaging of night light provides a global record of lighted dev...
read it
Scaling Laws for Autoregressive Generative Modeling
We identify empirical scaling laws for the crossentropy loss in four domains: generative image modeling, video modeling, multimodal image↔text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a powerlaw plus constant scaling law. The optimal model size also depends on the compute budget through a powerlaw, with exponents that are nearly universal across all data domains. The crossentropy loss has an information theoretic interpretation as S(True) + D_KL(TrueModel), and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billionparameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8× 8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie D_KL) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
READ FULL TEXT
Comments
There are no comments yet.