A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, Toxicity

05/22/2023
by   Shayne Longpre, et al.
0

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.

READ FULL TEXT

page 7

page 9

page 11

page 12

page 13

page 14

page 35

page 36

research
04/23/2020

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Language models pretrained on text from a wide variety of sources form t...
research
09/08/2023

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the develop...
research
10/21/2022

What do Large Language Models Learn beyond Language?

Large language models (LMs) have rapidly become a mainstay in Natural La...
research
09/02/2021

An Empirical Exploration in Quality Filtering of Text Data

While conventional wisdom suggests that more aggressively filtering data...
research
02/16/2023

Pretraining Language Models with Human Preferences

Language models (LMs) are pretrained to imitate internet text, including...
research
01/27/2023

Call for Papers – The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

We present the call for papers for the BabyLM Challenge: Sample-efficien...
research
05/22/2023

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

The generative AI revolution in recent years has been spurred by an expa...

Please sign up or login with your details

Forgot password? Click here to reset