The Pile: An 800GB Dataset of Diverse Text for Language Modeling

12/31/2020
by   Leo Gao, et al.
66

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets – both existing and newly constructed – many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

READ FULL TEXT

page 2

page 10

page 13

research
01/23/2022

A Large and Diverse Arabic Corpus for Language Modeling

Language models (LMs) have introduced a major paradigm shift in Natural ...
research
10/22/2020

Language Models are Open Knowledge Graphs

This paper shows how to construct knowledge graphs (KGs) from pre-traine...
research
02/06/2023

Data Selection for Language Models via Importance Resampling

Selecting a suitable training dataset is crucial for both general-domain...
research
08/04/2021

Mitigating harm in language models with conditional-likelihood filtration

Language models trained on large-scale unfiltered datasets curated from ...
research
03/30/2023

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Pre-training Large Language Models (LLMs) require massive amounts of tex...
research
01/25/2022

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive web dumps for diverse text ...
research
07/31/2023

DoDo Learning: DOmain-DemOgraphic Transfer in Language Models for Detecting Abuse Targeted at Public Figures

Public figures receive a disproportionate amount of abuse on social medi...

Please sign up or login with your details

Forgot password? Click here to reset