Documenting the English Colossal Clean Crawled Corpus

by   Jesse Dodge, et al.

As language models are trained on ever more text, researchers are turning to some of the largest corpora available. Unlike most other types of datasets in NLP, large unlabeled text corpora are often presented with minimal documentation, and best practices for documenting them have not been established. In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin with a high-level summary of the data, including distributions of where the text came from and when it was written. We then give more detailed analysis on salient parts of this data, including the most frequent sources of text (e.g.,, which contains a significant percentage of machine translated and/or OCR'd text), the effect that the filters had on the data (they disproportionately remove text in AAE), and evidence that some other benchmark NLP dataset examples are contained in the text. We release a web interface to an interactive, indexed copy of this dataset, encouraging the community to continuously explore and report additional findings.


Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natu...

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish ...

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Pretrained neural language models (LMs) are prone to generating racist, ...

Language Model for Text Analytic in Cybersecurity

NLP is a form of artificial intelligence and machine learning concerned ...

Spanish Language Models

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, a...

TxPI-u: A Resource for Personality Identification of Undergraduates

Resources such as labeled corpora are necessary to train automatic model...

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Innovation is a major driver of economic and social development, and inf...