Downstream Datasets Make Surprisingly Good Pretraining Corpora

09/28/2022
by   Kundan Krishna, et al.
0

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around 10×–500× less data), outperforming the latter on 7 and 5 datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Our results suggest that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the incorporation of massive datasets. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.

READ FULL TEXT

page 2

page 6

research
05/25/2022

ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data

Large pretrained language models have been performing increasingly well ...
research
09/30/2021

Compositional generalization in semantic parsing with pretrained transformers

Large-scale pretraining instills large amounts of knowledge in deep neur...
research
02/16/2023

Pretraining Language Models with Human Preferences

Language models (LMs) are pretrained to imitate internet text, including...
research
03/15/2022

Data Contamination: From Memorization to Exploitation

Pretrained language models are typically trained on massive web-based da...
research
09/10/2021

Does Pretraining for Summarization Require Knowledge Transfer?

Pretraining techniques leveraging enormous datasets have driven recent a...
research
11/30/2020

Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs

Large-scale pretraining and task-specific fine-tuning is now the standar...
research
09/08/2023

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the develop...

Please sign up or login with your details

Forgot password? Click here to reset