Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

08/29/2023
by   Tyler A. Chang, et al.
0

How do language models learn to make predictions during pre-training? To study this question, we extract learning curves from five autoregressive English language model pre-training runs, for 1M tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context. More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training. Higher n-gram probabilities further accentuate these effects. Independent of the target token, shorter and more frequent contexts correlate with marginally more stable and quickly acquired predictions. Effects of part-of-speech are also small, although nouns tend to be acquired later and less stably than verbs, adverbs, and adjectives. Our work contributes to a better understanding of language model pre-training dynamics and informs the deployment of stable language models in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive compu...
research
06/12/2023

The Effect of Masking Strategies on Knowledge Retention by Language Models

Language models retain a significant amount of world knowledge from thei...
research
12/15/2020

Pre-Training Transformers as Energy-Based Cloze Models

We introduce Electric, an energy-based cloze model for representation le...
research
10/27/2022

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

Pre-trained language model (PTM) has been shown to yield powerful text r...
research
04/14/2020

PALM: Pre-training an Autoencoding Autoregressive Language Model for Context-conditioned Generation

Self-supervised pre-training has emerged as a powerful technique for nat...
research
11/15/2022

Large Language Models Struggle to Learn Long-Tail Knowledge

The internet contains a wealth of knowledge – from the birthdays of hist...
research
12/19/2022

Training Trajectories of Language Models Across Scales

Scaling up language models has led to unprecedented performance gains, b...

Please sign up or login with your details

Forgot password? Click here to reset