Datasheet for the Pile

01/13/2022
by   Stella Biderman, et al.
0

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2023

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Pre-training Large Language Models (LLMs) require massive amounts of tex...
research
03/30/2015

LSHTC: A Benchmark for Large-Scale Text Classification

LSHTC is a series of challenges which aims to assess the performance of ...
research
02/07/2019

A Description of a Subtask Dataset with Glances

This paper describes a set of data made available that contains detailed...
research
11/15/2018

Towards Large-Scale Exploratory Search over Heterogeneous Sources

Since time immemorial, people have been looking for ways to organize sci...
research
09/22/2021

Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

We explore Boccaccio's Decameron to see how digital humanities tools can...
research
03/17/2021

An ELEGANT dataset with Denial of Service and Man in The Middle attacks

This document describes a dataset with diverse types of Denial of Servic...
research
07/07/2016

Predicting and Understanding Law-Making with Word Vectors and an Ensemble Model

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to ...

Please sign up or login with your details

Forgot password? Click here to reset