D4: Improving LLM Pretraining via Document De-Duplication and Diversification

08/23/2023
by   Kushal Tirumala, et al.
0

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20 efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2 intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.

READ FULL TEXT

page 21

page 25

research
08/08/2023

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Large language models (LLMs) are routinely pre-trained on billions of to...
research
05/21/2023

Teaching the Pre-trained Model to Generate Simple Texts for Text Simplification

Randomly masking text spans in ordinary texts in the pre-training stage ...
research
06/05/2020

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly ...
research
03/26/2023

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

In very recent years more attention has been placed on probing the role ...
research
11/09/2022

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive compu...
research
05/11/2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is...
research
09/01/2023

BatchPrompt: Accomplish more with less

As the ever-increasing token limits of large language models (LLMs) have...

Please sign up or login with your details

Forgot password? Click here to reset