The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

03/07/2023
∙
by   Hugo Laurençon, et al.
∙
0
∙

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

READ FULL TEXT
research
∙ 06/03/2023

MultiLegalPile: A 689GB Multilingual Legal Corpus

Large, high-quality datasets are crucial for training Large Language Mod...
research
∙ 12/09/2022

BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model

The BigScience Workshop was a value-driven initiative that spanned one a...
research
∙ 04/28/2023

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lo...
research
∙ 02/27/2023

The ROOTS Search Tool: Data Transparency for LLMs

ROOTS is a 1.6TB multilingual text corpus developed for the training of ...
research
∙ 12/16/2022

Lessons learned from the evaluation of Spanish Language Models

Given the impact of language models on the field of Natural Language Pro...
research
∙ 09/17/2023

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

The driving factors behind the development of large language models (LLM...
research
∙ 05/04/2022

Data Governance in the Age of Large-Scale Data-Driven Language Technology

The recent emergence and adoption of Machine Learning technology, and sp...

Please sign up or login with your details

Forgot password? Click here to reset