DeepAI AI Chat
Log In Sign Up

Automatic Document Selection for Efficient Encoder Pretraining

10/20/2022
by   Yukun Feng, et al.
0

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

03/19/2022

Domain Representative Keywords Selection: A Probabilistic Approach

We propose a probabilistic approach to select a subset of a target domai...
02/06/2023

Data Selection for Language Models via Importance Resampling

Selecting a suitable training dataset is crucial for both general-domain...
09/15/2021

On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation

Domain adaptation of neural networks commonly relies on three training p...
04/04/2019

Unsupervised Domain Adaptation of Contextualized Embeddings: A Case Study in Early Modern English

Contextualized word embeddings such as ELMo and BERT provide a foundatio...
01/19/2023

JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications

Contrastive learning is widely used for sentence representation learning...
04/20/2022

DAME: Domain Adaptation for Matching Entities

Entity matching (EM) identifies data records that refer to the same real...
02/25/2022

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Pretraining has proven successful in Document Intelligence tasks where d...