Efficient Domain Adaptation of Language Models via Adaptive Tokenization

by   Vin Sachidananda, et al.

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides >97 specific pretraining. Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation. While adaptive tokenization incurs a 6 experimentation, due to the introduction of 10k new domain-specific tokens, our approach, using 64 vCPUs, is 72x faster than further pretraining the language model on domain-specific corpora on 8 TPUs.


page 1

page 2

page 3

page 4


Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impre...

RadLing: Towards Efficient Radiology Report Understanding

Most natural language tasks in the radiology domain use language models ...

To pretrain or not to pretrain? A case study of domain-specific pretraining for semantic segmentation in histopathology

Annotating medical imaging datasets is costly, so fine-tuning (or transf...

AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models

Pretrained language models (PLMs) are trained on massive corpora, but of...

A Compact Pretraining Approach for Neural Language Models

Domain adaptation for large neural language models (NLMs) is coupled wit...

On the Domain Adaptation and Generalization of Pretrained Language Models: A Survey

Recent advances in NLP are brought by a range of large-scale pretrained ...

M2D2: A Massively Multi-domain Language Modeling Dataset

We present M2D2, a fine-grained, massively multi-domain corpus for study...

Please sign up or login with your details

Forgot password? Click here to reset