Data Contamination: From Memorization to Exploitation

03/15/2022
by   Inbal Magar, et al.
0

Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2022

Downstream Datasets Make Surprisingly Good Pretraining Corpora

For most natural language processing tasks, the dominant practice is to ...
research
04/28/2023

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

In this work, we carry out a data archaeology to infer books that are kn...
research
10/22/2022

PATS: Sensitivity-aware Noisy Learning for Pretrained Language Models

A wide range of NLP tasks benefit from the fine-tuning of pretrained lan...
research
06/17/2021

Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

Pretrained language models have achieved state-of-the-art performance wh...
research
02/21/2021

Pre-Training BERT on Arabic Tweets: Practical Considerations

Pretraining Bidirectional Encoder Representations from Transformers (BER...
research
11/25/2021

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Pretrained contextualized text representation models learn an effective ...
research
06/16/2023

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

This paper investigates the effect of tokenizers on the downstream perfo...

Please sign up or login with your details

Forgot password? Click here to reset