This datasheet describes the Pile, a 825 GiB dataset of human-authored text
compiled by EleutherAI for use in large-scale language modeling. The Pile is
comprised of 22 different text sources, ranging from original scrapes done for
this project, to text data made available by the data owners, to third-party
scrapes available online.