Training Trajectories of Language Models Across Scales

12/19/2022
by   Mengzhou Xia, et al.
0

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)–from 125M to 175B parameters–on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2022

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

Language modeling on large-scale datasets leads to impressive performanc...
research
07/23/2023

In-Context Learning in Large Language Models Learns Label Relationships but Is Not Conventional Learning

The performance of Large Language Models (LLMs) on downstream tasks ofte...
research
05/22/2022

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Despite their wide adoption, the underlying training and memorization dy...
research
09/12/2023

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Language models often exhibit behaviors that improve performance on a pr...
research
05/11/2021

Benchmarking down-scaled (not so large) pre-trained language models

Large Transformer-based language models are pre-trained on corpora of va...
research
08/29/2023

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

How do language models learn to make predictions during pre-training? To...
research
05/22/2023

In-Context Learning of Large Language Models Explained as Kernel Regression

Large language models (LLMs) have initiated a paradigm shift in transfer...

Please sign up or login with your details

Forgot password? Click here to reset