Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

by   Yu Gu, et al.

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding Reasoning Benchmark) at


Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

The availability of biomedical text data and advances in natural languag...

GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain

Deep neural language models have set new breakthroughs in many tasks of ...

Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing

Motivation: A perennial challenge for biomedical researchers and clinica...

BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

Pretrained language models have served as important backbones for natura...

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Contextual embedding-based language models trained on large data sets, s...

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

This work presents biomedical and clinical language models for Spanish b...

Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Contrastive pretraining on parallel image-text data has attained great s...

Please sign up or login with your details

Forgot password? Click here to reset