Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

by   Yu Gu, et al.

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding Reasoning Benchmark) at



There are no comments yet.


page 2


Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

The availability of biomedical text data and advances in natural languag...

GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain

Deep neural language models have set new breakthroughs in many tasks of ...

Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing

Motivation: A perennial challenge for biomedical researchers and clinica...

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Contextual embedding-based language models trained on large data sets, s...

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

This work presents biomedical and clinical language models for Spanish b...

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Recent progress in the Natural Language Processing domain has given us s...

Domain-Adaptive Pretraining Methods for Dialogue Understanding

Language models like BERT and SpanBERT pretrained on open-domain data ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Two paradigms for neural language model pretraining. Top: The prevailing mixed-domain paradigm assumes that out-domain text is still helpful and typically initializes domain-specific pretraining with a general-domain language model and inherits its vocabulary. Bottom: Domain-specific pretraining from scratch derives the vocabulary and conducts pretraining using solely in-domain text. In this paper, we show that for domains with abundant text such as biomedicine, domain-specific pretraining from scratch can substantially outperform the conventional mixed-domain approach.

In natural language processing (NLP), pretraining large neural language models on unlabeled text has proven to be a successful strategy for transfer learning. A prime example is Bidirectional Encoder Representations from Transformers (BERT) 

(Devlin et al., 2019), which has become a standard building block for training task-specific NLP models. Existing pretraining work typically focuses on the newswire and Web domains. For example, the original BERT model was trained on Wikipedia111 and BookCorpus (Zhu et al., 2015), and subsequent efforts have focused on crawling additional text from the Web to power even larger-scale pretraining (Liu et al., 2019; Raffel et al., 2020).

In specialized domains like biomedicine, past work has shown that using in-domain text can provide additional gains over general-domain language models (Lee et al., 2019; Beltagy et al., 2019; Peng et al., 2019). However, a prevailing assumption is that out-domain text is still helpful and previous work typically adopts a mixed-domain approach, e.g., by starting domain-specific pretraining from an existing general-domain language model (Figure 1 top). In this paper, we question this assumption. We observe that mixed-domain pretraining such as continual pretraining can be viewed as a form of transfer learning in itself, where the source domain is general text, such as newswire and the Web, and the target domain is specialized text such as biomedical papers. Based on the rich literature of multi-task learning and transfer learning (Axelrod et al., 2011; Xu et al., 2019; Caruana, 1997; Liu et al., 2015), successful transfer learning occurs when the target data is scarce and the source domain is highly relevant to the target one. For domains with abundant unlabeled text such as biomedicine, it is unclear that domain-specific pretraining can benefit by transfer from general domains. In fact, the majority of general domain text is substantively different from biomedical text, raising the prospect of negative transfer that actually hinders the target performance.

We thus set out to conduct a rigorous study on domain-specific pretraining and its impact on downstream applications, using biomedicine as a running example. We show that domain-specific pretraining from scratch substantially outperforms continual pretraining of generic language models, thus demonstrating that the prevailing assumption in support of mixed-domain pretraining is not always applicable (Figure 1).

To facilitate this study, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets, and conduct in-depth comparisons of modeling choices for pretraining and task-specific fine-tuning by their impact on domain-specific applications. Our experiments show that domain-specific pretraining from scratch can provide a solid foundation for biomedical NLP, leading to new state-of-the-art performance across a wide range of tasks. Additionally, we discover that the use of transformer-based models, like BERT, necessitates rethinking several common practices. For example, BIO tags and more complex variants are the standard label representation for named entity recognition (NER). However, we find that simply using IO (in or out of entity mentions) suffices with BERT models, leading to comparable or better performance.

To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our comprehensive benchmark at

2. Language Model Pretraining

In this section, we give a quick overview of neural language model pretraining, using BERT (Devlin et al., 2019) as a running example.

2.1. Vocabulary

We assume that the input consists of text spans, such as sentences separated by special tokens . To address the problem of out-of-vocabulary words, neural language models generate a vocabulary from subword units, using Byte-Pair Encoding (BPE) (Sennrich et al., 2016) or variants such as WordPiece (Kudo and Richardson, 2018). Essentially, the BPE algorithm tries to greedily identify a small set of subwords that can compactly form all words in the given corpus. It does this by first shattering all words in the corpus and initializing the vocabulary with characters and delimiters. It then iteratively augments the vocabulary with a new subword that is most frequent in the corpus and can be formed by concatenating two existing subwords, until the vocabulary reaches the pre-specified size (e.g., 30,000 in standard BERT models or 50,000 in RoBERTa (Liu et al., 2019)). In this paper, we use the WordPiece algorithm which is a BPE variant that uses likelihood based on the unigram language model rather than frequency in choosing which subwords to concatenate. The text corpus and vocabulary may preserve the case () or convert all characters to lower case ().

2.2. Model Architecture

State-of-the-art neural language models are generally based on transformer architectures (Vaswani et al., 2017), following the recent success of BERT  (Devlin et al., 2019; Liu et al., 2019)

. The transformer model introduces a multi-layer, multi-head self-attention mechanism, which has demonstrated superiority in leveraging GPU-based parallel computation and modeling long-range dependencies in texts, compared to recurrent neural networks, such as LSTMs 

(Hochreiter and Schmidhuber, 1997). The input token sequence is first processed by a lexical encoder, which combines a token embedding, a (token) position embedding and a segment embedding (i.e., which text span the token belongs to) by element-wise summation. This embedding layer is then passed to multiple layers of transformer modules (Vaswani et al., 2017)

. In each transformer layer, a contextual representation is generated for each token by summing a non-linear transformation of the representations of all tokens in the prior layer, weighted by the attentions computed using the given token’s representation in the prior layer as the query. The final layer outputs contextual representations for all tokens, which combine information from the whole text span.

2.3. Self-Supervision

A key innovation in BERT (Devlin et al., 2019) is the use of a Masked Language Model (MLM)

for self-supervised pretraining. Traditional language models are typically generative models that predict the next token based on the preceding tokens; for example, n-gram models represent the conditional probability of the next token by a multinomial of the preceding n-gram, with various smoothing strategies to handle rare occurrences 

(Ney et al., 1994). Masked Language Model instead randomly replaces a subset of tokens by a special token (e.g., ), and asks the language model to predict them. The training objective is the cross-entropy loss between the original tokens and the predicted ones. In BERT and RoBERTa, 15% of the input tokens are chosen, among which a random 80% are replaced by

, 10% are left unchanged and 10% are randomly replaced by a token from the vocabulary. Instead of using a constant masking rate of 15%, a standard approach is to gradually increase it from 5% to 25% with 5% increment for every 20% of training epochs, which makes pretraining more stable 

(Liu et al., 2020). The original BERT algorithm also uses Next Sentence Prediction (NSP), which determines for a given sentence pair whether one sentence follows the other in the original text. The utility of NSP has been called into question (Liu et al., 2019), but we include it in our pretraining experiments to enable a head-to-head comparison with prior BERT models.

2.4. Advanced Pretraining Techniques

In the original formulation of BERT (Devlin et al., 2019), the masked language model (MLM) simply selects random subwords to mask. When a word is only partially masked, it is relatively easy to predict the masked portion given the observed ones. In contrast, whole-word masking (WWM) enforces that the whole word must be masked if one of its subwords is chosen. This has been adopted as the standard approach, as it forces the language model to capture more contextual semantic dependencies.

In this paper, we also explore adversarial pretraining and its impact on downstream applications. Motivated by successes in countering adversarial attacks in computer vision, adversarial pretraining introduces perturbations in the input embedding layer that maximize the adversarial loss, thus forcing the model to not only optimize the standard training objective (MLM), but also minimize adversarial loss 

(Liu et al., 2020).

3. Biomedical Language Model Pretraining

Pretraining language models using general text has proven beneficial for applications in open domains, such as newswire and the Web (Devlin et al., 2019; Liu et al., 2019). However, there are also many high-value applications in specialized domains, many of which have an abundance of in-domain text, such as biomedicine, finance, and legal. In this paper, we will use biomedicine as a running example in our study of domain-specific pretraining. In other words, biomedical text is considered in-domain, while others are regarded as out-domain.

Intuitively, using in-domain text in pretraining should help with domain-specific applications. Indeed, prior work has shown that pretraining with PubMed text leads to better performance in biomedical NLP tasks (Lee et al., 2019; Beltagy et al., 2019; Peng et al., 2019). The main question is whether pretraining should include text from other domains. The prevailing assumption is that pretraining can always benefit from more text, including out-domain text. And, in fact, none of the prior biomedical-related BERT models have been pretrained using purely biomedical text (Lee et al., 2019; Beltagy et al., 2019; Peng et al., 2019). In this paper, we challenge this assumption and show that domain-specific pretraining from scratch is superior to mixed-domain pretraining for downstream applications.

3.1. Mixed-Domain Pretraining

The standard approach to pretraining a biomedical BERT model conducts continual pretraining of a general-domain pretrained model, as exemplified by BioBERT (Lee et al., 2019). Specifically, this approach would initialize with the standard BERT model (Devlin et al., 2019), pretrained using Wikipedia and BookCorpus. It then continues the pretraining process with MLM and NSP using biomedical text. In the case of BioBERT, continual pretraining is conducted using PubMed abstracts and full text. BlueBERT (Peng et al., 2019) uses both PubMed text and de-identified clinical notes from MIMIC-III (Johnson et al., 2016).

Note that in the continual pretraining approach, the vocabulary is the same as the original BERT model, in this case the one generated from Wikipedia and BookCorpus. While convenient, this is a major disadvantage for this approach, as the vocabulary is not representative of the target biomedical domain.

Compared to the other biomedical-related pretraining efforts, SciBERT (Beltagy et al., 2019) is a notable exception as it generates the vocabulary and pretrains from scratch, using biomedicine and computer science as representatives for the scientific literature. However, from the perspective of biomedical applications, SciBERT still adopts the mixed-domain pretraining approach, as computer science text is clearly out-domain.

3.2. Domain-Specific Pretraining from Scratch

The mixed-domain pretraining approach makes sense if the target application domain has little text of its own, and can thereby benefit from pretraining using related domains. However, this is not the case for biomedicine, which has over thirty million papers in PubMed, and adds over a million each year. We thus hypothesize that domain-specific pretraining from scratch is a better strategy for biomedical language model pretraining.

1pt1.5pt1.5pt Biomedical Term Category BERT SciBERT PubMedBERT (Ours)
diabetes disease
leukemia disease
lithium drug
insulin drug
DNA gene
promoter gene
hypertension disease
nephropathy disease
lymphoma disease
lidocaine drug
oropharyngeal organ
cardiomyocyte cell
chloramphenicol drug
RecA gene
acetyltransferase gene
clonidine drug
naloxone drug
Table 1. Comparison of common biomedical terms in vocabularies used by the standard BERT, SciBERT and PubMedBERT (ours). A  indicates the biomedical term appears in the corresponding vocabulary, otherwise the term will be shattered into small subwords.

A major advantage of domain-specific pretraining from scratch stems from having an in-domain vocabulary. Table 1 compares the vocabularies used in various pretraining strategies. BERT models using continual pretraining are stuck with the original vocabulary from the general-domain corpora, which does not contain many common biomedical terms. Even for SciBERT, which generates its vocabulary partially from biomedical text, the deficiency compared to a purely biomedical vocabulary is substantial. As a result, the BERT models are forced to divert parametrization capacity and training bandwidth to model biomedical terms using fragmented subwords. For example, lymphoma, a common medical term, is divided into four pieces (l, ##ym, ##ph, ##oma) by BERT, and acetyltransferase is shattered into seven (ace, ##ty, ##lt, ##ran, ##sf, ##eras, ##e).

Another advantage of domain-specific pretraining from scratch is that the language model is trained using purely in-domain data. For example, SciBERT pretraining has to balance optimizing for biomedical text and computer science text, the latter of which is unlikely to be beneficial for biomedical applications. Continual pretraining, on the other hand, may potentially recover from out-domain modeling, though not completely. Aside from the vocabulary issue mentioned earlier, neural network training uses non-convex optimization, which means that continual pretraining may not be able to completely undo suboptimal initialization from the general-domain language model.

In our experiments, we show that domain-specific pretraining with in-domain vocabulary confers clear advantages over mixed-domain pretraining, be it continual pretraining of general-domain language models, or pretraining on mixed-domain text.

4. BLURB: A Comprehensive Benchmark for Biomedical NLP

1pt1.5pt1.5pt BioBERT (Lee et al., 2019) SciBERT (Beltagy et al., 2019) BLUE (Peng et al., 2019) BLURB
BC5-chem (Li et al., 2016)
BC5-disease (Li et al., 2016)
NCBI-disease (Doğan et al., 2014) -
BC2GM (Smith et al., 2008) - -
JNLPBA (Kim et al., 2004) - -
0.05pt1.5pt1.5pt EBM PICO (Nye et al., 2018) - -
0.05pt1.5pt1.5pt ChemProt (Krallinger et al., 2017)
DDI (Herrero-Zazo et al., 2013) -
GAD (Bravo et al., 2015) - -
0.05pt1.5pt1.5pt BIOSSES (Soğancıoğlu et al., 2017) - -
0.05pt1.5pt1.5pt HoC (Hanahan and Weinberg, 2000) - -
0.05pt1.5pt1.5pt PubMedQA (Jin et al., 2019) - - -
BioASQ (Nentidis et al., 2019) - -
Table 2. Comparison of the biomedical datasets in prior language model pretraining studies and BLURB.

The ultimate goal of language model pretraining is to improve performance on a wide range of downstream applications. In general-domain NLP, the creation of comprehensive benchmarks, such as GLUE (Wang et al., 2019b, a), greatly accelerates advances in language model pretraining by enabling head-to-head comparisons among pretrained language models. In contrast, prior work on biomedical pretraining tends to use different tasks and datasets for downstream evaluation, as shown in Table 2. This makes it hard to assess the impact of pretrained language models on the downstream tasks we care about. To the best of our knowledge, BLUE (Peng et al., 2019) is the first attempt to create an NLP benchmark in the biomedical domain. We aim to improve on its design by addressing some of its limitations. First, BLUE has limited coverage of biomedical applications used in other recent work on biomedical language models, as shown in Table 2. For example, it does not include any question-answering task. More importantly, BLUE mixes PubMed-based biomedical applications (six datasets such as BC5, ChemProt, and HoC) with MIMIC-based clinical applications (four datasets such as i2b2 and MedNLI). Clinical notes differ substantially from biomedical literature, to the extent that we observe BERT models pretrained on clinical notes perform poorly on biomedical tasks, similar to the standard BERT. Consequently, it is advantageous to create separate benchmarks for these two domains.

To facilitate investigations of biomedical language model pretraining and help accelerate progress in biomedical NLP, we create a new benchmark, the Biomedical Language Understanding & Reasoning Benchmark (BLURB). We focus on PubMed-based biomedical applications, and leave the exploration of the clinical domain, and other high-value verticals to future work. To make our effort tractable and facilitate head-to-head comparison with prior work, we prioritize the selection of datasets used in recent work on biomedical language models, and will explore the addition of other datasets in future work.

1pt1.5pt1.5pt Dataset Task Train Dev Test Evaluation Metrics
BC5-chem NER 5203 5347 5385 F1 entity-level
BC5-disease NER 4182 4244 4424 F1 entity-level
NCBI-disease NER 5134 787 960 F1 entity-level
BC2GM NER 15197 3061 6325 F1 entity-level
JNLPBA NER 46750 4551 8662 F1 entity-level
0.05pt1.5pt1.5pt EBM PICO PICO 339167 85321 16364 Macro F1 word-level
0.05pt1.5pt1.5pt ChemProt Relation Extraction 18035 11268 15745 Micro F1
DDI Relation Extraction 22233 5559 5716 Micro F1
GAD Relation Extraction 4261 534 535 Micro F1
0.05pt1.5pt1.5pt BIOSSES Sentence Similarity 64 16 20 Pearson
0.05pt1.5pt1.5pt HoC Document Classification 1295 186 371 Average Micro F1
0.05pt1.5pt1.5pt PubMedQA Question Answering 450 50 500 Accuracy
BioASQ Question Answering 670 75 140 Accuracy
Table 3. Datasets used in the BLURB biomedical NLP benchmark. We list the numbers of instances in train, dev, and test (e.g., entity mentions in NER and PICO elements in evidence-based medical information extraction).

BLURB is comprised of a comprehensive set of biomedical NLP tasks from publicly available datasets, including named entity recognition (NER), evidence-based medical information extraction (PICO), relation extraction, sentence similarity, document classification, and question answering. See Table 3 for an overview of the BLURB datasets. For question answering, prior work has considered both classification tasks (e.g., whether a reference text contains the answer to a given question) and more complex tasks such as list and summary (Nentidis et al., 2019). The latter types often require additional engineering effort that are not relevant to evaluating neural language models. For simplicity, we focus on the classification tasks such as yes/no question-answering in BLURB, and leave the inclusion of more complex question-answering to future work.

To compute a summary score for BLURB, the simplest way is to report the average score among all tasks. However, this may place undue emphasis on simpler tasks such as NER for which there are many existing datasets. Therefore, we group the datasets by their task types, compute the average score for each task type, and report the macro average among the task types. To help accelerate research in biomedical NLP, we release the BLURB benchmark as well as a leaderboard at

Below are detailed descriptions for each task and its corresponding dataset.

4.1. Named Entity Recognition (NER)

BC5-Chemical & BC5-Disease

The BioCreative V Chemical-Disease Relation corpus (Li et al., 2016) was created for evaluating relation extraction of drug-disease interactions, but is frequently used as a NER corpus for detecting chemical (drug) and disease entities. The dataset consists of 1500 PubMed abstracts broken into three even splits for training, development, and test. We use a pre-processed version of this dataset generated by Crichton et al. (2017), discard the relation labels, and train NER models for chemical (BC5-Chemical) and disease (BC5-Disease) separately.


The Natural Center for Biotechnology Information Disease corpus (Doğan et al., 2014) contains 793 PubMed abstracts with 6892 annotated disease mentions linked to 790 distinct disease entities. We use a pre-processed set of train, development, test splits generated by Crichton et al. (2017).


The Biocreative II Gene Mention corpus (Smith et al., 2008) consists of sentences from Pubmed abstracts with manually labeled gene and alternative gene entities. Following prior work, we focus on the gene entity annotation. In its original form, BC2GM contains 15000 train and 5000 test sentences. We use a pre-processed version of the dataset generated by Crichton et al. (2017), which carves out 2500 sentences from the training data for development.


The Joint Workshop on Natural Language Processing in Biomedicine and its Applications shared task (Kim et al., 2004) is a NER corpus on PubMed abstracts. The entity types are chosen for molecular biology applications: protein, DNA, RNA, cell line, and cell type. Some of the entity type distinctions are not very meaningful. For example, a gene mention often refers to both the DNA and gene products such as the RNA and protein. Following prior work that evaluates on this dataset (Lee et al., 2019), we ignore the type distinction and focus on detecting the entity mentions. We use the same train, development, and test splits as in Crichton et al. (2017).

4.2. Evidence-Based Medical Information Extraction (PICO)

Ebm Pico

The Evidence-Based Medicine corpus (Nye et al., 2018) contains PubMed abstracts on clinical trials, where each abstract is annotated with P, I, and O in PICO: Participants (e.g., diabetic patients), Intervention (e.g., insulin), Comparator (e.g., placebo) and Outcome (e.g., blood glucose levels). Comparator (C) labels are omitted as they are standard in clinical trials: placebo for passive control and standard of care for active control. There are 4300, 500, and 200 abstracts in training, development, and test, respectively. The training and development sets were labeled by Amazon Mechanical Turkers, whereas the test set was labeled by Upwork contributors with prior medical training. EBM PICO provides labels at the word level for each PIO element. For each of the PIO elements in an abstract, we tally the F1 score at the word level, and then compute the final score as the average among PIO elements in the dataset. Occasionally, two PICO elements might overlap with each other (e.g., a participant span might contain within it an intervention span). In EBM-PICO, about 3% of the PIO words are in the overlap. Note that the dataset released along with SciBERT appears to remove the overlapping words from the larger span (e.g., the participant span as mentioned above). We instead use the original dataset (Nye et al., 2018) and their scripts for preprocessing and evaluation.

4.3. Relation Extraction


The Chemical Protein Interaction corpus (Krallinger et al., 2017) consists of PubMed abstracts annotated with chemical-protein interactions between chemical and protein entities. There are 23 interactions organized in a hierarchy, with 10 high-level interactions (including ). The vast majority of relation instances in ChemProt are within single sentences. Following prior work (Lee et al., 2019; Beltagy et al., 2019)

, we only consider sentence-level instances. We follow the ChemProt authors’ suggestions and focus on classifying five high-level interactions —

, , , , — as well as everything else (). The ChemProt annotation is not exhaustive for all chemical-protein pairs. Following previous work (Peng et al., 2019; Lee et al., 2019), we expand the training and development sets by assigning a label for all chemical-protein pairs that occur in a training or development sentence, but do not have an explicit label in the ChemProt corpus. Note that prior work uses slightly different label expansion of the test data. To facilitate head-to-head comparison, we will provide instructions for reproducing the test set in BLURB from the original dataset.


The Drug-Drug Interaction corpus (Herrero-Zazo et al., 2013) was created to facilitate research on pharmaceutical information extraction, with a particular focus on pharmacovigilance. It contains sentence-level annotation of drug-drug interactions on PubMed abstracts. Note that some prior work (Peng et al., 2019; Zhang et al., 2018) discarded 90 training files that the authors considered not conducive to learning drug-drug interactions. We instead use the original dataset and produce our train/dev/test split of 624/90/191 files.


The Genetic Association Database corpus (Bravo et al., 2015) was created semi-automatically using the Genetic Association Archive.222 Specifically, the archive contains a list of gene-disease associations, with the corresponding sentences in the PubMed abstracts reporting the association studies. Bravo et al. (Bravo et al., 2015) used a biomedical NER tool to identify gene and disease mentions, and create the positive examples from the annotated sentences in the archive, and negative examples from gene-disease co-occurrences that were not annotated in the archive. We use an existing preprocessed version of GAD and its corresponding train/dev/test split created by Lee et al. (2019).

4.4. Sentence Similarity


The Sentence Similarity Estimation System for the Biomedical Domain

(Soğancıoğlu et al., 2017) contains 100 pairs of PubMed sentences each of which is annotated by five expert-level annotators with an estimated similarity score in the range from 0 (no relation) to 4 (equivalent meanings). It is a regression task, with the average score as the final annotation. We use the same train/dev/test split in Peng et al. (2019) and use Pearson correlation for evaluation.

4.5. Document Classification


The Hallmarks of Cancer corpus was motivated by the pioneering work on cancer hallmarks (Hanahan and Weinberg, 2000). It contains annotation on PubMed abstracts with binary labels each of which signifies the discussion of a specific cancer hallmark. The authors use 37 fine-grained hallmarks which are grouped into ten top-level ones. We focus on predicting the top-level labels. The dataset was released with 1499 PubMed abstracts (Baker et al., 2015) and has since been expanded to 1852 abstracts (Baker et al., 2017). Note that Peng et al. (2019) discarded a control subset of 272 abstracts that do not discuss any cancer hallmark (i.e., all binary labels are false). They also used a different evaluation metric. We instead adopt the original dataset and its evaluation metric (average micro F1 across the ten cancer hallmarks), and create the train/dev/test split, as they are not available previously.333The original authors used cross-validation for their evaluation.

4.6. Question Answering (QA)


The PubMedQA dataset (Jin et al., 2019) contains a set of research questions, each with a reference text from a PubMed abstract as well as an annotated label of whether the text contains the answer to the research question (yes/maybe/no). We use the original train/dev/test split with 450, 50, and 500 questions, respectively.


The BioASQ corpus (Nentidis et al., 2019) contains multiple question answering tasks annotated by biomedical experts, including yes/no, factoid, list, and summary questions. Pertaining to our objective of comparing neural language models, we focus on the the yes/no questions (Task 7b), and leave the inclusion of other tasks to future work. Each question is paired with a reference text containing multiple sentences from a PubMed abstract and a yes/no answer. We use the official train/dev/test split of 670/75/140 questions.

5. Task-Specific Fine-Tuning

trim=0.10pt 0.10pt 0.10pt 0.050pt,clip

Figure 2. A general architecture for task-specific fine-tuning of neural language models, with a relation-extraction example. Note that the input goes through additional processing such as word-piece tokenization in the neural language model module.

Pretrained neural language models provide a unifying foundation for learning task-specific models. Given an input token sequence, the language model produces a sequence of vectors in the contextual representation. A task-specific prediction model is then layered on top to generate the final output for a task-specific application. Given task-specific training data, we can learn the task-specific model parameters and refine the BERT model parameters by gradient descent using backpropragation.

Prior work on biomedical NLP often adopts different task-specific models and fine-tuning methods, which makes it difficult to understand the impact of an underlying pretrained language model on task performance. In this section, we review standard methods and common variants used for each task. In our primary investigation comparing pretraining strategies, we fix the task-specific model architecture using the standard method identifed here, to facilitate a head-to-head comparison among the pretrained neural language models. Subsequently, we start with the same pretrained BERT model, and conduct additional investigation on the impact for the various choices in the task-specific models. For prior biomedical BERT models, our standard task-specific methods generally lead to comparable or better performance when compared to their published results.

5.1. A General Architecture for Fine-Tuning Neural Language Models

Figure 2 shows a general architecture of fine-tuning neural language models for downstream applications. An input instance is first processed by a module which performs task-specific transformations such as appending special instance marker (e.g., ) or dummifying entity mentions for relation extraction. The transformed input is then tokenized using the neural language model’s vocabulary, and fed into the neural language model. Next, the contextual representation at the top layer is processed by a module, and then fed into the module to generate the final output for a given task.

To facilitate a head-to-head comparison, we apply the same fine-tuning procedure for all BERT models and tasks. Specifically, we use cross-entropy loss for classification tasks and mean-square error for regression tasks. We conduct hyperparameter search using the development set based on task-specific metrics. Similarly to previous work, we jointly fine-tune the parameters of the task-specific prediction layer as well as the underlying neural language model.

5.2. Task-Specific Problem Formulation and Modeling Choices

1pt1.5pt1.5pt Task Problem Formulation Modeling Choices
NER Token Classification Tagging Scheme, Classification Layer
PICO Token Classification Tagging Scheme, Classification Layer
Relation Extraction Sequence Classification Entity/Relation Representation, Classification Layer
Sentence Similarity Sequence Regression Sentence Representation, Regression Loss
Document Classification Sequence Classification Document Representation, Classification Layer
Question Answering Sequence Classification Question/Text Representation, Classification Layer
Table 4. Standard NLP tasks and their problem formulations and modeling choices.

Many NLP applications can be formulated as a classification or regression task, wherein either individual tokens or sequences are the prediction target. Modeling choices usually vary in two aspects: the instance representation and the prediction layer. Table 4 presents an overview of the problem formulation and modeling choices for tasks we consider and detailed descriptions are provided below. For each task, we highlight the standard modeling choices with an asterisk (*).


Given an input text span (usually a sentence), the NER task seeks to recognize mentions of entities of interest. It is typically formulated as a sequential labeling task, where each token is assigned a tag to signify whether it is in an entity mention or not. The modeling choices primarily vary on the tagging scheme and classification method. is the standard tagging scheme that classifies each token as the beginning of an entity (), inside an entity (), or outside (). The NER tasks in BLURB are only concerned about one entity type (in JNLPBA, all the types are merged into one). In the case when there are multiple entity types, the tags would be further divided into fine-grained tags for specific types. Prior work has also considered more complex tagging schemes such as , where stands for the last word of an entity and stands for a single-word entity. We also consider the simpler scheme that only differentiates between in and out of an entity. Classification is done using a simple linear layer or more sophisticated sequential labeling methods such as LSTM or conditional random field (CRF) (Lafferty et al., 2001).

  • : returns the input sequence as is.

  • : returns the BERT encoding of a given token.

  • Tagging scheme: *; ; .

  • Classification layer: linear layer*; LSTM; CRF.


Conceptually, evidence-based medical information extraction is akin to slot filling, as it tries to identify the PIO elements in an abstract describing a clinical trial. However, it can be formulated as a sequential tagging task like NER, by classifying tokens belonging to each element. A token may belong to more than one element, e.g., participant (P) and intervention (I).

  • : returns the input sequence as is.

  • : returns the BERT encoding of a given token.

  • Tagging scheme: *; ; .

  • Classification layer: linear layer*; LSTM; CRF.

Relation Extraction

Existing work on relation extraction tends to focus on binary relations. Given a pair of entity mentions in a text span (typically a sentence), the goal is to determine if the text indicates a relation for the mention pair. There are significant variations in the entity and relation representations. To prevent overfitting by memorizing the entity pairs, the entity tokens are often augmented with start/end markers or replaced by a dummy token. For featurization, the relation instance is either represented by a special

token, or by concatenating the mention representations. In the latter case, if an entity mention contains multiple tokens, its representation is usually produced by pooling those of individual tokens (max or average). For computational efficiency, we use padding or truncation to set the input length to 128 tokens for GAD and 256 tokens for ChemProt and DDI which contain longer input sequences.

  • : entity (dummification*; start/end marker; original); relation (*; original).

  • : entity (dummy token*; pooling); relation ( BERT encoding*; concatenation of the mention BERT encoding).

  • Classification layer: linear layer*; more sophisticated classifiers (e.g., MLP).

Sentence Similarity

The similarity task can be formulated as a regression problem to generate a normalized score for a sentence pair. By default, a special token is inserted to separate the two sentences, and a special token is prepended to the beginning to represent the pair. The BERT encoding of is used to compute the regression score.

  • : , for sentence pair .

  • : BERT encoding.

  • Regression layer: linear regression.

Document Classification

For each text span and category (an abstract and a cancer hallmark in HoC), the goal is to classify whether the text belongs to the category. By default, a token is appended to the beginning of the text, and its BERT encoding is passed on by the for the final classification, which typically uses a simple linear layer.

  • : , for document .

  • : returns BERT encoding.

  • Classification layer: linear layer.

Question Answering

For the two-way (yes/no) or three-way (yes/maybe/no) question-answering task, the encoding is similar to the sentence similarity task. Namely, a token is prepended to the beginning, followed by the question and reference text, with a token to separate the two text spans. The BERT encoding is then used for the final classification. For computational efficiency, we use padding or truncation to set the input length to 512 tokens.

  • : , for question and reference text .

  • : returns BERT encoding.

  • Classification layer: linear layer.

6. Experiments

In this section, we conduct thorough evaluation to assess the impact of domain-specific pretraining in biomedical NLP applications. First, we fix the standard task-specific model for each task in BLURB, and conduct a head-to-head comparison of domain-specific pretraining and mixed-domain pretraining. Next, we evaluate the impact of various pretraining options such as vocabulary, whole-word masking (WWM), and adversarial pretraining. Finally, we fix a pretrained BERT model and compare various modeling choices for task-specific fine-tuning.

6.1. Neural Language Models

1pt1.5pt1.5pt Vocabulary Pretraining Corpus Text Size
BERT Wiki + Books - Wiki + Books 3.3B words / 16GB
RoBERTa Web crawl - Web crawl 160GB
BioBERT Wiki + Books continual pretraining PubMed 4.5B words
SciBERT PMC + CS from scratch PMC + CS 3.2B words
ClinicalBERT Wiki + Books continual pretraining MIMIC 0.5B words / 3.7GB
BlueBERT Wiki + Books continual pretraining PubMed + MIMIC 4.5B words
PubMedBERT PubMed from scratch PubMed 3.1B words / 21GB
Table 5. Summary of pretraining details for the various BERT models used in our experiments. Statistics for prior BERT models are taken from their publications when available. The size of a text corpus such as PubMed may vary a bit, depending on downloading time and preprocessing (e.g., filtering out empty or very short abstracts). Both BioBERT and PubMedBERT also have a version pretrained with additional PMC full text; here we list the standard version pretrained using PubMed only.

For biomedical domain-specific pretraining, we generate the vocabulary and conduct pretraining using the latest collection of PubMed444 abstracts: 14 million abstracts, 3.2 billion words, 21 GB. (The original collection contains over 4 billion words; we filter out any abstracts with less than 128 words to reduce noise.)

We follow the standard pretraining procedure based on the Tensorflow implementation released by NVIDIA

555 We use Adam (Kingma and Ba, 2015) for the optimizer using a standard slanted triangular learning rate schedule with warm-up in 10% of steps and cool-down in 90% of steps. Specifically, the learning rate increases linearly from zero to the peak rate of in the first 10% of steps, and then decays linearly to zero in the remaining 90% of steps. Training is done for 62,500 steps with batch size of 8,192, which is comparable to the computation used in previous biomedical pretraining.666For example, BioBERT started with the standard BERT, which was pretrained using 1M steps with batch size of 256, and ran another 1M steps in continual pretraining. The training takes about 5 days on one DGX-2 machine with 16 V100 GPUs. We find that the cased version has similar performance to the uncased version in preliminary experiments. So, we focus on the uncased models in this study. We use whole-word masking (WWM), with masking rate of 15%. We denote this BERT model as PubMedBERT.

For comparison, we use BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), BioBERT (Lee et al., 2019), SciBERT (Beltagy et al., 2019), ClinicalBERT (Alsentzer et al., 2019), BlueBERT (Peng et al., 2019) from their public release. See Table 5 for an overview. BioBERT and BlueBERT conduct continual pretraining from BERT, whereas ClinicalBERT conducts continual pretraining from BioBERT; thus, they all share the same vocabulary as BERT. BioBERT comes with two versions. We use BioBERT++ (v1.1), which was trained for a longer time and performed better. ClinicalBERT also comes with two versions. We use Bio+Clinical BERT.

Prior pretraining work has explored two settings: BERT-BASE with 12 transformer layers and 100 million parameters; BERT-LARGE with 24 transformer layers and 300 million parameters. Prior work in biomedical pretraining uses BERT-BASE only. For head-to-head comparison, we also use BERT-BASE in pretraining PubMedBERT. BERT-LARGE appears to yield improved performance in some preliminary experiments. We leave more in-depth exploration to future work.

6.2. Task-Specific Fine-Tuning

For task-specific fine-tuning, we use Adam (Kingma and Ba, 2015) with the standard slanted triangular learning rate schedule (warm-up in the first 10% of steps and cool-down in the remaining 90% of steps) and a dropout probability of 0.1. For all datasets and BERT models, we use the development set for tuning the hyperparameters with the same range: learning rate (1e-5, 3e-5, 5e-5), batch size (16, 32) and epoch number (3–60). Due to random initialization of the task-specific model and drop out, the performance may vary for different random seeds, especially for small datasets like BIOSSES, BioASQ, and PubMedQA. We report the average scores from ten runs for BIOSSES, BioASQ, and PubMedQA, and five runs for the others.

6.3. Domain-Specific Pretraining vs Mixed-Domain Pretraining

1pt1.5pt1.5pt BERT RoBERTa BioBERT SciBERT ClinicalBERT BlueBERT PubMedBERT
uncased cased cased cased uncased cased cased cased uncased
BC5-chem 90.54 90.85 90.16 93.41 93.12 93.31 91.51 91.98 94.06
BC5-disease 82.36 81.52 81.96 85.31 85.34 85.40 84.18 84.63 86.63
NCBI-disease 86.85 86.65 87.24 89.47 88.61 89.18 87.31 89.20 88.81
BC2GM 84.56 85.00 84.64 87.38 86.56 86.86 85.00 85.46 87.81
JNLPBA 80.02 79.90 80.11 80.98 80.77 80.85 80.00 79.97 81.36
0.05pt1.5pt1.5pt EBM PICO 72.34 71.70 73.02 73.18 73.12 73.06 72.06 72.54 73.38
0.05pt1.5pt1.5pt ChemProt 71.86 71.54 72.98 76.14 75.24 75.00 72.04 71.46 77.24
DDI 80.04 79.34 79.52 80.88 81.06 81.22 78.20 77.78 82.36
GAD 77.72 77.28 77.72 80.94 80.90 79.66 78.40 77.24 82.34
0.05pt1.5pt1.5pt BIOSSES 82.68 81.40 81.25 89.52 86.25 87.15 91.23 85.38 92.30
0.05pt1.5pt1.5pt HoC 80.20 80.12 79.66 81.54 80.66 81.16 80.74 80.48 82.32
0.05pt1.5pt1.5pt PubMedQA 51.62 49.96 52.84 60.24 57.38 51.40 49.08 48.44 55.84
BioASQ 70.36 74.44 75.20 84.14 78.86 74.22 68.50 68.71 87.56
0.05pt1.5pt1.5pt BLURB score 76.27 76.04 76.59 80.51 79.02 78.32 77.44 76.45 81.35
Table 6. Comparison of pretrained language models on the BLURB biomedical NLP benchmark. The standard task-specific models are used in the same fine-tuning process for all BERT models. The BLURB score is the macro average of average test results for each of the six tasks (NER, PICO, relation extraction, sentence similarity, document classification, question answering). See Table 3 for the evaluation metric used in each task.

We compare BERT models by applying them to the downstream NLP applications in BLURB. For each task, we conduct the same fine-tuning process using the standard task-specific model as specified in section 5. Table 6 shows the results.

By conducting domain-specific pretraining from scratch, PubMedBERT consistently outperforms all the other BERT models in most biomedical NLP tasks, often by a significant margin. The gains are most substantial against BERT models trained using out-domain text. Most notably, while the pretraining corpus is the largest for RoBERTa, its performance on biomedical NLP tasks is among the worst, similar to the original BERT model. Models using biomedical text in pretraining generally perform better. However, mixing out-domain data in pretraining generally leads to worse performance. In particular, even though clinical notes are more relevant to the biomedical domain than general-domain text, adding them does not confer any advantage, as evident by the results of ClinicalBERT and BlueBERT. Not surprisingly, BioBERT is the closest to PubMedBERT, as it also uses PubMed text for pretraining. However, by conducting domain-specific pretraining from scratch, including using the PubMed vocabulary, PubMedBERT is able to obtain consistent gains over BioBERT in most tasks. A notable exception is PubMedQA, but this dataset is small, and there are relatively high variances among runs with different random seeds.

Compared to the published results for BioBERT, SciBERT, and BlueBERT in their original papers, our results are generally comparable or better for the tasks they have been evaluated on. The ClinicalBERT paper does not report any results on these biomedical applications (Alsentzer et al., 2019).

6.4. Ablation Study on Pretraining Techniques

1pt1.5pt1.5pt Wiki + Books PubMed
Word Piece Whole Word Word Piece Whole Word
BC5-chem 93.70 93.81 93.52 94.06
BC5-disease 85.38 86.14 85.87 86.63
NCBI-disease 89.08 89.42 88.36 88.81
BC2GM 87.19 87.48 86.73 87.81
JNLPBA 80.91 81.09 80.81 81.36
0.05pt1.5pt1.5pt EBM PICO 73.30 73.52 73.44 73.38
0.05pt1.5pt1.5pt ChemProt 75.04 76.70 75.72 77.24
DDI 81.30 82.60 80.84 82.36
GAD 79.48 80.30 79.34 82.34
0.05pt1.5pt1.5pt BIOSSES 91.36 91.79 92.45 92.30
0.05pt1.5pt1.5pt HoC 81.76 81.74 80.38 82.32
0.05pt1.5pt1.5pt PubMedQA 52.20 55.92 54.76 55.84
BioASQ 73.69 76.41 78.51 87.56
0.05pt1.5pt1.5pt BLURB score 79.20 80.11 79.77 81.35
Table 7. Evaluation of the impact of vocabulary and whole word masking on the performance of PubMedBERT on BLURB.
1pt1.5pt1.5pt PubMed PubMed + PMC PubMed + PMC (longer training)
BC5-chem 94.06 94.13 93.84
BC5-disease 86.63 86.33 86.36
NCBI-disease 88.81 88.89 89.01
BC2GM 87.81 87.67 87.69
JNLPBA 81.36 81.17 81.08
0.05pt1.5pt1.5pt EBM PICO 73.38 73.64 73.72
0.05pt1.5pt1.5pt ChemProt 77.24 76.96 76.80
DDI 82.36 83.56 82.06
GAD 82.34 82.24 81.58
0.05pt1.5pt1.5pt BIOSSES 92.30 90.39 92.31
0.05pt1.5pt1.5pt HoC 82.32 82.16 82.62
0.05pt1.5pt1.5pt PubMedQA 55.84 61.02 60.02
BioASQ 87.56 83.43 87.20
0.05pt1.5pt1.5pt BLURB score 81.35 81.16 81.67
Table 8. Evaluation of the impact of pretraining text on the performance of PubMedBERT on BLURB. The first result column corresponds to the standard PubMedBERT pretrained using PubMed abstracts (“PubMed”). The second one corresponds to PubMedBERT trained using both PubMed abstracts and PMC full text (“PubMed+PMC”). The last one corresponds to PubMedBERT trained using both PubMed abstracts and PMC full text, for 60% longer (“PubMed+PMC (longer training)”).

To assess the impact of pretraining options on downstream applications, we conduct several ablation studies using PubMedBERT as a running example. Table 7 shows results assessing the effect of vocabulary and whole-word masking (WWM). Using the original BERT vocabulary derived from Wikipedia & BookCorpus (by continual pretraining from the original BERT), the results are significantly worse than using an in-domain vocabulary from PubMed. Additionally, WWM leads to consistent improvement across the board, regardless of the vocabulary in use.

In our standard PubMedBERT pretraining, we used PubMed abstracts only. We also tried adding full-text articles from PubMed Central (PMC),777 with the total pretraining text increased substantially to 16.8 billion words (107 GB). Surprisingly, this generally leads to a slight degradation in performance across the board. However, by extending pretraining for 60% longer (100K steps in total), the overall results improve and slightly outperform the standard PubMedBERT using only abstracts. The improvement is somewhat mixed across the tasks, with some gaining and others losing. We hypothesize that the reason for this behavior is two-fold. First, full texts generally contain more noise than abstracts. As most existing biomedical NLP tasks are based on abstracts, full texts may be slightly out-domain compared to abstracts. Moreover, even if full texts are potentially helpful, their inclusion requires additional pretraining cycles to make use of the extra information.

1pt1.5pt1.5pt PubMedBERT + adversarial
BC5-chem 94.06 93.85
BC5-disease 86.63 86.20
NCBI-disease 88.81 88.74
BC2GM 87.81 87.43
JNLPBA 81.36 81.35
0.05pt1.5pt1.5pt EBM PICO 73.38 72.92
0.05pt1.5pt1.5pt ChemProt 77.24 77.04
DDI 82.36 83.62
GAD 82.34 81.48
0.05pt1.5pt1.5pt BIOSSES 92.30 94.11
0.05pt1.5pt1.5pt HoC 82.32 82.20
0.05pt1.5pt1.5pt PubMedQA 55.84 53.30
BioASQ 87.56 82.71
0.05pt1.5pt1.5pt BLURB score 81.35 80.91
Table 9. Comparison of PubMedBERT performance on BLURB using standard and adversarial pretraining.

Adversarial pretraining has been shown to be highly effective in boosting performance in general-domain applications (Liu et al., 2020). We thus conducted adversarial pretraining in PubMedBERT and compared its performance with standard pretraining (Table 9). Surprisingly, adversarial pretraining generally leads to a slight degradation in performance, with some exceptions such as sentence similarity (BIOSSES). We hypothesize that the reason may be similar to what we observe in pretraining with full texts. Namely, adversarial training is most useful if the pretraining corpus is more diverse and relatively out-domain compared to the application tasks. We leave a more thorough evaluation of adversarial pretraining to future work.

6.5. Ablation Study on Fine-Tuning Methods

1pt1.5pt1.5pt Task-Specific Model Linear Layer Bi-LSTM
BC5-chem 94.06 93.78
BC5-disease 86.63 86.15
JNLPBA 81.36 81.35
0.05pt1.5pt1.5pt ChemProt 77.24 75.40
DDI 82.36 81.70
GAD 82.34 81.80
Table 10. Comparison of linear layers vs recurrent neural networks for task-specific fine-tuning in named entity recognition (entity-level F1) and relation extraction (micro F1), all using the standard PubMedBERT.
1pt1.5pt1.5pt Tagging Scheme BIO BIOUL IO
BC5-chem 94.06 93.86 93.84
BC5-disease 86.63 86.66 86.92
JNLPBA 81.36 81.38 82.23
Table 11. Comparison of entity-level F1 for biomedical named entity recognition (NER) using different tagging schemes and the standard PubMedBERT.

In the above studies on pretraining methods, we fix the fine-tuning methods to the standard methods described in section 5. Next, we will study the effect of modeling choices in task-specific fine-tuning, by fixing the underlying pretrained language model to our standard PubMedBERT (WWM, PubMed vocabulary, pretrained using PubMed abstracts).

Prior to the current success of pretraining neural language models, standard NLP approaches were often dominated by sequential labeling methods, such as conditional random fields (CRF) and more recently recurrent neural networks such as LSTM. Such methods were particularly popular for named entity recognition (NER) and relation extraction.

With the advent of BERT models and the self-attention mechanism, the utility of explicit sequential modeling becomes questionable. The top layer in the BERT model already captures many non-linear dependencies across the entire text span. Therefore, it’s conceivable that even a linear layer on top can perform competitively. We find that this is indeed the case for NER and relation extraction, as shown in Table 10. The use of a bidirectional LSTM (Bi-LSTM) does not lead to any gain compared to linear layer.

We also investigate the tagging scheme used in NER. The standard tagging scheme distinguishes words by their positions within an entity. For sequential tagging methods such as CRF and LSTM, distinguishing the position within an entity is potentially advantageous compared to the minimal IO scheme that only distinguishes between inside and outside of entities. But for BERT models, once again, the utility of more complex tagging schemes is diminished. We thus conducted a head-to-head comparison of the tagging schemes using three biomedical NER tasks in BLURB. As we can see in Table 11, the difference is minuscule, suggesting that with self-attention, the sequential nature of the tags is less essential in NER modeling.

1pt1.5pt1.5pt Input text Classification Encoding ChemProt DDI
ORIGINAL [CLS] 50.52 37.00
ENTITY MARKERS [CLS] 77.72 82.22
Table 12. Evaluation of the impact of entity dummification and relation encoding in relation extraction, all using PubMedBERT. With entity dummification, the entity mentions in question are anonymized using entity type tags such as $DRUG or $GENE. With entity marker, special tags marking the start and end of an entity are appended to the entity mentions in question. Relation encoding is derived from the special token appended to the beginning of the text or the special entity start token, or by concatenating the contextual representation of the entity mentions in question.

The use of neural methods also has subtle, but significant, implications for relation extraction. Previously, relation extraction was generally framed as a classification problem with manually-crafted feature templates. To prevent overfitting and enhance generalization, the feature templates would typically avoid using the entities in question. Neural methods do not need hand-crafted features, but rather use the neural encoding of the given text span, including the entities themselves. This introduces a potential risk that the neural network may simply memorize the entity combination. This problem is particularly pronounced in self-supervision settings, such as distant supervision, because the positive instances are derived from entity tuples with known relations. As a result, it is a common practice to “dummify” entities (i.e., replace an entity with a generic tag such as $DRUG or $GENE) (Wang and Poon, 2018; Jia et al., 2019).

This risk remains in the standard supervised setting, such as in the tasks that comprise BLURB. We thus conducted a systematic evaluation of entity dummification and relation encoding, using two relation extraction tasks in BLURB.

For entity marking, we consider three variants: dummify the entities in question; use the original text; add start and end tags to entities in question. For relation encoding, we consider three schemes. In the encoding introduced by the original BERT paper, the special token

is prepended to the beginning of the text span, and its contextual representation at the top layer is used as the input in the final classification. Another standard approach concatenates the BERT encoding of the given entity mentions, each obtained by applying max pooling to the corresponding token representations. Finally, following prior work, we also consider simply concatenating the top contextual representation of the entity start tag, if the entity markers are in use

(Baldini Soares et al., 2019).

Table 12 shows the results. Simply using the original text indeed exposes the neural methods to significant overfitting risk. Using with the original text is the worst choice, as the relation encoding has a hard time to distinguish which entities in the text span are in question. Dummification remains the most reliable method, which works for either relation encoding method. Interestingly, using entity markers leads to slightly better results in both datasets, as it appears to prevent overfitting while preserving useful entity information. We leave it to future work to study whether this would generalize to all relation extraction tasks.

7. Related Work

Standard supervised learning requires labeled examples, which are expensive and time-consuming to annotate. Self-supervision using unlabeled text is thus a long-standing direction for alleviating the annotation bottleneck using transfer learning. Early methods focused on clustering related words using distributed similarity, such as Brown Clusters 

(Brown et al., 1992; Liang, 2005). With the revival of neural approaches, neural embedding has become the new staple for transfer learning from unlabeled text. This starts with simple stand-alone word embeddings (Mikolov et al., 2013; Pennington et al., 2014), and evolves into more sophisticated pretrained language models, from LSTM in ULMFiT (Howard and Ruder, 2018) and ELMo (Peters et al., 2018) to transformer-based models in GPT (Radford et al., 2018, 2019) and BERT (Devlin et al., 2019; Liu et al., 2019). Their success is fueled by access to large text corpora, advanced hardware such as GPUs, and a culmination of advances in optimization methods, such as Adam (Kingma and Ba, 2015) and slanted triangular learning rate (Howard and Ruder, 2018). Here, transfer learning goes from the pretrained language models to fine-tuning task-specific models for downstream applications.

As the community ventures beyond the standard newswire and Web domains, and begins to explore high-value verticals such as biomedicine, a different kind of transfer learning is brought into play by combining text from various domains in pretraining language models. The prevailing assumption is that such mixed-domain pretraining is advantageous. In this paper, we show that this type of transfer learning may not be applicable when there is a sufficient amount of in-domain text, as is the case in biomedicine. In fact, our experiments comparing clinical BERTs with PubMedBERT on biomedical NLP tasks show that even related text such as clinical notes may not be helpful, since we already have abundant biomedical text from PubMed. Our results show that we should distinguish different types of transfer learning and separately assess their utility in various situations.

There are a plethora of biomedical NLP datasets, especially from various shared tasks such as BioCreative (Smith et al., 2008; Arighi et al., 2011; Mao et al., 2014; Kim et al., 2015), BioNLP (Kim et al., 2011; Demner-Fushman et al., 2019), SemEval (Apidianaki et al., 2018; Bethard et al., 2017, 2016; Diab et al., 2013), and BioASQ (Nentidis et al., 2019). The focus has evolved from simple tasks, such as named entity recognition, to more sophisticated tasks, such as relation extraction and question answering, and new tasks have been proposed for emerging application scenarios such as evidence-based medical information extraction (Nye et al., 2018). However, while comprehensive benchmarks and leaderboards are available for the general domains (e.g., GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a)), they are still a rarity in biomedical NLP. In this paper, inspired by prior effort towards this direction (Peng et al., 2019), we create the first leaderboard for biomedical NLP, BLURB — a comprehensive benchmark containing thirteen datasets for six tasks.

8. Conclusion

In this paper, we challenge a prevailing assumption in pretraining neural language models and show that domain-specific pretraining from scratch can significantly outperform mixed-domain pretraining such as continual pretraining from a general-domain language model, leading to new state-of-the-art results for a wide range of biomedical NLP applications. To facilitate this study, we create BLURB, a comprehensive benchmark for biomedical NLP featuring a diverse set of tasks such as named entity recognition, relation extraction, document classification, and question answering. To accelerate research in biomedical NLP, we release our state-of-the-art biomedical BERT models and setup a leaderboard based on BLURB.

Future directions include: further exploration of domain-specific pretraining strategies; incorporating more tasks in biomedical NLP; extension of the BLURB benchmark to clinical and other high-value domains.


  • E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. External Links: Link, Document Cited by: §6.1, §6.3.
  • M. Apidianaki, S. M. Mohammad, J. May, E. Shutova, S. Bethard, and M. Carpuat (Eds.) (2018) Proceedings of the 12th international workshop on semantic evaluation, semeval@naacl-hlt 2018, new orleans, louisiana, usa, june 5-6, 2018. Association for Computational Linguistics. External Links: Link, ISBN 978-1-948087-20-9 Cited by: §7.
  • C. N. Arighi, P. M. Roberts, S. Agarwal, S. Bhattacharya, G. Cesareni, A. Chatr-aryamontri, S. Clematide, P. Gaudet, M. G. Giglio, I. Harrow, E. Huala, M. Krallinger, U. Leser, D. Li, F. Liu, Z. Lu, L. J. Maltais, N. Okazaki, L. Perfetto, F. Rinaldi, R. Sætre, D. Salgado, P. Srinivasan, P. E. Thomas, L. Toldo, L. Hirschman, and C. H. Wu (2011) BioCreative iii interactive task: an overview. BMC Bioinformatics 12 (8), pp. S4. External Links: ISSN 1471-2105, Document, Link Cited by: §7.
  • A. Axelrod, X. He, and J. Gao (2011) Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., pp. 355–362. External Links: Link Cited by: §1.
  • S. Baker, I. Ali, I. Silins, S. Pyysalo, Y. Guo, J. Högberg, U. Stenius, and A. Korhonen (2017) Cancer hallmarks analytics tool (chat): a text mining approach to organize and evaluate scientific literature on cancer. Bioinformatics 33 (24), pp. 3973–3981. Cited by: §4.5.
  • S. Baker, I. Silins, Y. Guo, I. Ali, J. Högberg, U. Stenius, and A. Korhonen (2015) Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32 (3), pp. 432–440. Cited by: §4.5.
  • L. Baldini Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2895–2905. External Links: Link, Document Cited by: §6.5.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3615–3620. External Links: Link, Document Cited by: §1, §3.1, §3, §4.3, Table 2, §6.1.
  • S. Bethard, M. Carpuat, M. Apidianaki, S. M. Mohammad, D. M. Cer, and D. Jurgens (Eds.) (2017) Proceedings of the 11th international workshop on semantic evaluation, semeval@acl 2017, vancouver, canada, august 3-4, 2017. Association for Computational Linguistics. External Links: Link Cited by: §7.
  • S. Bethard, D. M. Cer, M. Carpuat, D. Jurgens, P. Nakov, and T. Zesch (Eds.) (2016) Proceedings of the 10th international workshop on semantic evaluation, semeval@naacl-hlt 2016, san diego, ca, usa, june 16-17, 2016. The Association for Computer Linguistics. External Links: Link, ISBN 978-1-941643-95-2 Cited by: §7.
  • À. Bravo, J. Piñero, N. Queralt-Rosinach, M. Rautschka, and L. I. Furlong (2015) Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC bioinformatics 16 (1), pp. 55. Cited by: §4.3, Table 2.
  • P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, and R. L. Mercer (1992) Class-based n-gram models of natural language. Computational linguistics 18 (4), pp. 467–480. Cited by: §7.
  • R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1.
  • G. Crichton, S. Pyysalo, B. Chiu, and A. Korhonen (2017) A neural network multi-task learning approach to biomedical named entity recognition. BMC bioinformatics 18 (1), pp. 368. Cited by: §4.1, §4.1, §4.1, §4.1.
  • D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii (Eds.) (2019) Proceedings of the 18th bionlp workshop and shared task, bionlp@acl 2019, florence, italy, august 1, 2019. Association for Computational Linguistics. External Links: Link, ISBN 978-1-950737-28-4 Cited by: §7.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.2, §2.3, §2.4, §2, §3.1, §3, §6.1, §7.
  • M. T. Diab, T. Baldwin, and M. Baroni (Eds.) (2013) Proceedings of the 7th international workshop on semantic evaluation, semeval@naacl-hlt 2013, atlanta, georgia, usa, june 14-15, 2013. The Association for Computer Linguistics. External Links: Link, ISBN 978-1-937284-49-7 Cited by: §7.
  • R. I. Doğan, R. Leaman, and Z. Lu (2014) NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics 47, pp. 1–10. Cited by: §4.1, Table 2.
  • D. Hanahan and R. A. Weinberg (2000) The hallmarks of cancer. cell 100 (1), pp. 57–70. Cited by: §4.5, Table 2.
  • M. Herrero-Zazo, I. Segura-Bedmar, P. Martínez, and T. Declerck (2013) The ddi corpus: an annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics 46 (5), pp. 914–920. Cited by: §4.3, Table 2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.2.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: §7.
  • R. Jia, C. Wong, and H. Poon (2019) Document-level -ary relation extraction with multiscale representation learning. In NAACL, Cited by: §6.5.
  • Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu (2019) PubMedQA: a dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2567–2577. External Links: Link, Document Cited by: §4.6, Table 2.
  • A. E.W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific Data 3 (1), pp. 160035. External Links: ISSN 2052-4463, Document, Link Cited by: §3.1.
  • J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier (2004) Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland, pp. 73–78. External Links: Link Cited by: §4.1, Table 2.
  • J. Kim, Y. Wang, T. Takagi, and A. Yonezawa (2011) Overview of genia event task in bionlp shared task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, BioNLP Shared Task ’11, USA, pp. 7–15. External Links: ISBN 9781937284091 Cited by: §7.
  • S. Kim, R. I. Dogan, A. Chatr-aryamontri, M. Tyers, W. J. Wilbur, and D. C. Comeau (2015) Overview of biocreative v bioc track. Cited by: §7.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §6.1, §6.2, §7.
  • M. Krallinger, O. Rabal, S. A. Akhondi, M. P. Pérez, J. Santamaría, G. Rodríguez, et al. (2017) Overview of the biocreative vi chemical-protein interaction track. In Proceedings of the sixth BioCreative challenge evaluation workshop, Vol. 1, pp. 141–146. Cited by: §4.3, Table 2.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §2.1.
  • J. Lafferty, A. Mccallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In in Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Cited by: §5.2.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. External Links: ISSN 1367-4803, Document, Link Cited by: §1, §3.1, §3, §4.1, §4.3, §4.3, Table 2, §6.1.
  • J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, and Z. Lu (2016) BioCreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016. Cited by: §4.1, Table 2.
  • P. Liang (2005) Semi-supervised learning for natural language. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §7.
  • X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao (2020) Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994. Cited by: §2.3, §2.4, §6.4.
  • X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y. Wang (2015) Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 912–921. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.1, §2.2, §2.3, §3, §6.1, §7.
  • Y. Mao, K. V. Auken, D. Li, C. N. Arighi, P. McQuilton, G. T. Hayman, S. Tweedie, M. L. Schaeffer, S. J. F. Laulederkind, S. Wang, J. Gobeill, P. Ruch, A. T. Luu, J. Kim, J. Chiang, Y. Chen, C. Yang, H. Liu, D. Zhu, Y. Li, H. Yu, E. Emadzadeh, G. Gonzalez, J. Chen, H. Dai, and Z. Lu (2014) Overview of the gene ontology task at biocreative iv. Database: The Journal of Biological Databases and Curation 2014. Cited by: §7.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §7.
  • A. Nentidis, K. Bougiatiotis, A. Krithara, and G. Paliouras (2019) Results of the seventh edition of the bioasq challenge. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 553–568. Cited by: §4.6, Table 2, §4, §7.
  • H. Ney, U. Essen, and R. Kneser (1994) On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language 8 (1), pp. 1 – 38. External Links: ISSN 0885-2308, Document, Link Cited by: §2.3.
  • B. Nye, J. J. Li, R. Patel, Y. Yang, I. J. Marshall, A. Nenkova, and B. C. Wallace (2018) A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2018, pp. 197. Cited by: §4.2, Table 2, §7.
  • Y. Peng, S. Yan, and Z. Lu (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, pp. 58–65. External Links: Link, Document Cited by: §1, §3.1, §3, §4.3, §4.3, §4.4, §4.5, Table 2, §4, §6.1, §7.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §7.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §7.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §7.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §7.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §2.1.
  • L. Smith, L. K. Tanabe, R. J. nee Ando, C. Kuo, I. Chung, C. Hsu, Y. Lin, R. Klinger, C. M. Friedrich, et al. (2008) Overview of biocreative ii gene mention recognition. Genome biology 9 (S2), pp. S2. Cited by: §4.1, Table 2, §7.
  • G. Soğancıoğlu, H. Öztürk, and A. Özgür (2017) BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33 (14), pp. i49–i58. Cited by: §4.4, Table 2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.2.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019a) Superglue: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3266–3280. Cited by: §4, §7.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §4, §7.
  • H. Wang and H. Poon (2018) Deep probabilistic logic: a unifying framework for indirect supervision. In EMNLP, Cited by: §6.5.
  • Y. Xu, X. Liu, Y. Shen, J. Liu, and J. Gao (2019) Multi-task learning with sample re-weighting for machine reading comprehension. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2644–2655. External Links: Link, Document Cited by: §1.
  • Y. Zhang, W. Zheng, H. Lin, J. Wang, Z. Yang, and M. Dumontier (2018) Drug–drug interaction extraction via hierarchical rnns on sequence and shortest dependency paths. Bioinformatics 34 (5), pp. 828–835. Cited by: §4.3.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In ICCV, Cited by: §1.