Deduplicating Training Data Makes Language Models Better

07/14/2021
by   Katherine Lee, et al.
0

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1 output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets – for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4 for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

READ FULL TEXT

page 5

page 16

page 17

research
06/13/2023

Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

When using adversarial training, it is common practice to train against ...
research
03/11/2019

Partially Shuffling the Training Data to Improve Language Models

Although SGD requires shuffling the training data between epochs, curren...
research
08/04/2021

Mitigating harm in language models with conditional-likelihood filtration

Language models trained on large-scale unfiltered datasets curated from ...
research
03/24/2023

TRAK: Attributing Model Behavior at Scale

The goal of data attribution is to trace model predictions back to train...
research
05/21/2022

Scaling Laws and Interpretability of Learning from Repeated Data

Recent large language models have been trained on vast datasets, but als...
research
03/06/2023

Data Portraits: Recording Foundation Model Training Data

Foundation models are trained on increasingly immense and opaque dataset...
research
04/10/2023

Do We Train on Test Data? The Impact of Near-Duplicates on License Plate Recognition

This work draws attention to the large fraction of near-duplicates in th...

Please sign up or login with your details

Forgot password? Click here to reset