A Compact Pretraining Approach for Neural Language Models

08/25/2022
by   Shahriar Golchin, et al.
11

Domain adaptation for large neural language models (NLMs) is coupled with massive amounts of unstructured data in the pretraining phase. In this study, however, we show that pretrained NLMs learn in-domain information more effectively and faster from a compact subset of the data that focuses on the key information in the domain. We construct these compact subsets from the unstructured data using a combination of abstractive summaries and extractive keywords. In particular, we rely on BART to generate abstractive summaries, and KeyBERT to extract keywords from these summaries (or the original unstructured text directly). We evaluate our approach using six different settings: three datasets combined with two distinct NLMs. Our results reveal that the task-specific classifiers trained on top of NLMs pretrained using our method outperform methods based on traditional pretraining, i.e., random masking on the entire data, as well as methods without pretraining. Further, we show that our strategy reduces pretraining time by up to five times compared to vanilla pretraining. The code for all of our experiments is publicly available at https://github.com/shahriargolchin/compact-pretraining.

READ FULL TEXT
research
09/15/2021

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Contextual embedding-based language models trained on large data sets, s...
research
11/03/2020

CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search

Neural rankers based on deep pretrained language models (LMs) have been ...
research
02/28/2023

Turning a CLIP Model into a Scene Text Detector

The recent large-scale Contrastive Language-Image Pretraining (CLIP) mod...
research
05/31/2023

Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data

This paper presents Structure Aware Dense Retrieval (SANTA) model, which...
research
06/09/2021

Pretraining Representations for Data-Efficient Reinforcement Learning

Data efficiency is a key challenge for deep reinforcement learning. We a...
research
05/22/2023

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Pretrained language models have achieved remarkable success in various n...
research
01/25/2023

An Experimental Study on Pretraining Transformers from Scratch for IR

Finetuning Pretrained Language Models (PLM) for IR has been de facto the...

Please sign up or login with your details

Forgot password? Click here to reset