The MiniPile Challenge for Data-Efficient Language Models

04/17/2023
by   Jean Kaddour, et al.
0

The ever-growing diversity of pre-training text corpora has equipped language models with generalization capabilities across various downstream tasks. However, such diverse datasets are often too large for academic budgets; hence, most research on Transformer architectures, training procedures, optimizers, etc. gets conducted on smaller, homogeneous datasets. To this end, we present The MiniPile Challenge, where one pre-trains a language model on a diverse text corpus containing at most 1M documents. MiniPile is a 6GB subset of the deduplicated 825GB The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters. To verify MiniPile's suitability for language model pre-training, we use it to pre-train a BERT and T5 model, yielding a performance drop of only 1.9%/2.5% on the GLUE and SNI benchmarks compared to the original pre-trained checkpoints trained on 2.6x/745x the amount of data. MiniPile is available at https://huggingface.co/datasets/JeanKaddour/minipile.

READ FULL TEXT
research
12/22/2020

Pre-Training a Language Model Without Human Language

In this paper, we study how the intrinsic nature of pre-training data co...
research
12/20/2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

As demand for large corpora increases with the size of current state-of-...
research
05/11/2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is...
research
03/17/2023

Trained on 100 million words and still in shape: BERT meets British National Corpus

While modern masked language models (LMs) are trained on ever larger cor...
research
09/02/2023

Studying the impacts of pre-training using ChatGPT-generated text on downstream tasks

In recent times, significant advancements have been witnessed in the fie...
research
03/24/2022

Multi-armed bandits for online optimization of language model pre-training: the use case of dynamic masking

Transformer-based language models (TLMs) provide state-of-the-art perfor...
research
06/24/2023

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

Current trends to pre-train capable Large Language Models (LLMs) mostly ...

Please sign up or login with your details

Forgot password? Click here to reset