CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

09/17/2023
by   Thuat Nguyen, et al.
0

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/12/2023

PolyLM: An Open Source Polyglot Large Language Model

Large language models (LLMs) demonstrate remarkable ability to comprehen...
research
03/07/2023

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

As language models grow ever larger, the need for large-scale high-quali...
research
06/04/2023

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

Polyglot is a pioneering project aimed at enhancing the non-English lang...
research
04/12/2023

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

Over the last few years, large language models (LLMs) have emerged as th...
research
06/02/2020

WikiBERT models: deep transfer learning for many languages

Deep neural language models such as BERT have enabled substantial recent...
research
05/09/2023

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

We present The Vault, an open-source, large-scale code-text dataset desi...
research
06/30/2022

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant a...

Please sign up or login with your details

Forgot password? Click here to reset