The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

06/01/2023
by   Guilherme Penedo, et al.
0

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

READ FULL TEXT

page 4

page 19

research
10/24/2022

FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners

Large language models (LLM) trained using the next-token-prediction obje...
research
01/25/2022

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive web dumps for diverse text ...
research
07/13/2023

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Modular vision-language models (Vision-LLMs) align pretrained image enco...
research
02/16/2023

Do We Still Need Clinical Language Models?

Although recent advances in scaling large language models (LLMs) have re...
research
09/13/2023

Pretraining on the Test Set Is All You Need

Inspired by recent work demonstrating the promise of smaller Transformer...
research
09/01/2023

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

Shannon, in his seminal paper introducing information theory, divided th...
research
09/08/2023

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the develop...

Please sign up or login with your details

Forgot password? Click here to reset