Datasheet for the Pile

by   Stella Biderman, et al.

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.


page 1

page 2

page 3

page 4


The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Pre-training Large Language Models (LLMs) require massive amounts of tex...

LSHTC: A Benchmark for Large-Scale Text Classification

LSHTC is a series of challenges which aims to assess the performance of ...

A Description of a Subtask Dataset with Glances

This paper describes a set of data made available that contains detailed...

Towards Large-Scale Exploratory Search over Heterogeneous Sources

Since time immemorial, people have been looking for ways to organize sci...

Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

We explore Boccaccio's Decameron to see how digital humanities tools can...

An ELEGANT dataset with Denial of Service and Man in The Middle attacks

This document describes a dataset with diverse types of Denial of Servic...

Predicting and Understanding Law-Making with Word Vectors and an Ensemble Model

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to ...

Please sign up or login with your details

Forgot password? Click here to reset