Automatic Document Selection for Efficient Encoder Pretraining

10/20/2022
by   Yukun Feng, et al.
0

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/19/2022

Domain Representative Keywords Selection: A Probabilistic Approach

We propose a probabilistic approach to select a subset of a target domai...
research
02/06/2023

Data Selection for Language Models via Importance Resampling

Selecting a suitable training dataset is crucial for both general-domain...
research
09/15/2021

On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation

Domain adaptation of neural networks commonly relies on three training p...
research
05/22/2023

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Pretrained language models have achieved remarkable success in various n...
research
04/28/2023

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lo...
research
04/16/2021

Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media

Language use differs between domains and even within a domain, language ...
research
02/25/2022

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Pretraining has proven successful in Document Intelligence tasks where d...

Please sign up or login with your details

Forgot password? Click here to reset