Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

07/14/2023
by   Shahriar Golchin, et al.
0

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15 pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/17/2021

Task-adaptive Pre-training of Language Models with Word Embedding Regularization

Pre-trained language models (PTLMs) acquire domain-independent linguisti...
research
04/21/2020

Train No Evil: Selective Masking for Task-guided Pre-training

Recently, pre-trained language models mostly follow the pre-training-the...
research
07/12/2021

MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding

This paper presents an attempt to employ the mask language modeling appr...
research
09/26/2022

Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification

Language models are pre-trained using large corpora of generic data like...
research
03/19/2022

Domain Representative Keywords Selection: A Probabilistic Approach

We propose a probabilistic approach to select a subset of a target domai...
research
07/19/2023

What can we learn from Data Leakage and Unlearning for Law?

Large Language Models (LLMs) have a privacy concern because they memoriz...
research
10/01/2020

An Empirical Investigation Towards Efficient Multi-Domain Language Model Pre-training

Pre-training large language models has become a standard in the natural ...

Please sign up or login with your details

Forgot password? Click here to reset