BioBERT: pre-trained biomedical language representation model for biomedical text mining

01/25/2019
by   Jinhyuk Lee, et al.
Korea University
0

Biomedical text mining has become more important than ever as the number of biomedical documents rapidly grows. With the progress of machine learning, extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning is boosting the development of effective biomedical text mining models. However, as deep learning models require a large amount of training data, biomedical text mining with deep learning often fails due to the small sizes of training datasets in biomedical fields. Recent researches on learning contextualized language representation models from text corpora shed light on the possibility of leveraging a large number of unannotated biomedical text corpora. We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain specific language representation model pre-trained on large-scale biomedical corpora. Based on the BERT architecture, BioBERT effectively transfers the knowledge of large amount of biomedical texts into biomedical text mining models. While BERT also shows competitive performances with previous state-of-the-art models, BioBERT significantly outperforms them on three representative biomedical text mining tasks including biomedical named entity recognition (1.86 (3.33 improvement) with minimal task-specific architecture modifications. We make pre-trained weights of BioBERT freely available in https://github.com/naver/biobert-pretrained, and source codes of fine-tuned models in https://github.com/dmis-lab/biobert.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/25/2019

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Biomedical text mining is becoming increasingly important as the number ...
08/09/2019

BERT-based Ranking for Biomedical Entity Normalization

Developing high-performance entity normalization algorithms that can all...
05/14/2020

A pre-training technique to localize medical BERT and enhance BioBERT

Bidirectional Encoder Representations from Transformers (BERT) models fo...
06/11/2021

EPICURE Ensemble Pretrained Models for Extracting Cancer Mutations from Literature

To interpret the genetic profile present in a patient sample, it is nece...
08/19/2016

Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts

In this paper, we report a knowledge-based method for Word Sense Disambi...
01/18/2022

Sectioning of Biomedical Abstracts: A Sequence of Sequence Classification Task

Rapid growth of the biomedical literature has led to many advances in th...
08/16/2020

Deep Learning Enables Robust and Precise Light Focusing on Treatment Needs

If light passes through the body tissues, focusing only on areas where t...

Code Repositories

biobert

BioBERT: a pre-trained biomedical language representation model for biomedical text mining


view repo

biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The volume of biomedical literature continues to rapidly increase. On average, more than 3,000 new articles are published every day in peer-reviewed journals, excluding pre-prints and technical reports such as clinical trial reports in various archives. PubMed alone has a total of 29M articles as of Jan. 2019. Reports containing valuable information about new discoveries and new insights are continuously added to the already overwhelming amount of literature. Consequently, there is increasingly more demand for accurate biomedical text mining tools for extracting information from the literature.

Figure 1: Overview of the pre-training and fine-tuning of BioBERT

Recent progress of biomedical text mining models was made possible by machine learning. Deep learning is an extension of machine learning but has deeper layers. Deep learning models are highly effective and efficient as they require less feature engineering. Long Short-Term Memory (LSTM) and Conditional Random Field (CRF) have greatly improved performance in biomedical named entity recognition (NER) over the last few years

(Habibi et al., 2017; Wang et al., 2018; Yoon et al., 2018). Other deep learning based models have made improvements in biomedical text mining tasks such as relation extraction (Bhasuran and Natarajan, 2018) and question answering (Wiese et al., 2017).

Deep learning requires a large amount of training data (LeCun et al., 2015). However, in biomedical text mining tasks, the construction of a large training set is very costly as it requires the use of experts in biomedical fields. As a result, most biomedical text mining models cannot exploit the full power of deep learning due to the small amount of training data available for biomedical text mining tasks. To address the lack of training data, many recent researches have focused on training multi-task models (Wang et al., 2018) or conducting data augmentation (Giorgi and Bader, 2018).

In this paper, we use transfer learning to address the lack of training data. Word2vec

(Mikolov et al., 2013)

, which extracts knowledge from unannotated texts, has been one of the major advancements in natural language processing. However, in biomedical text mining tasks, Word2vec needs to be modified when applied to biomedical data due to the large differences in vocabulary and expressions between a biomedical corpus and a general domain corpus

(Pyysalo et al., 2013). Recent development of ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) has proved the effectiveness of contextualized representations from a deeper structure (e.g., bidirectional language model) for transfer learning. While contextualized representations extracted from a general domain corpus such as Wikipedia have been used in biomedical text mining studies (Sheikhshabbafghi et al., 2018), only few researches have focused on directly extracting contextualized representations from a biomedical corpus (Zhu et al., 2018).

In this paper, we introduce BioBERT which is a contextualized language representation model, based on BERT, for biomedical text mining tasks. We pre-train BioBERT on different combinations of general and biomedical domain corpora to see its changes in performance on biomedical text mining tasks when domain specific corpora are used for pre-training. We evaluate BioBERT on the following three popular biomedical text mining tasks: named entity recognition, relation extraction, and question answering. We make the pre-trained weights of BioBERT and the code for fine-tuning BioBERT publicly available.

2 Approach

We propose BioBERT which is a pre-trained language representation model for the biomedical domain. Overall process for pre-training BioBERT and fine-tuning BioBERT is illustrated in Figure 1. First, we initialize BioBERT with BERT which was pre-trained on general domain corpora (e.g., English Wikipedia, BooksCorpus). Then, BioBERT is pre-trained on biomedical domain corpora (e.g., PubMed abstracts, PMC full-text articles). BioBERT was fine-tuned on several datasets to show its effectiveness in each biomedical text mining task. Like the original BERT, BioBERT requires a minimal number of task-specific parameters. Additionally, we experimented with different combinations and sizes of general domain corpora and biomedical corpora to analyze the role and effect of each corpus on pre-training. The contributions of our paper are as follows:

  • BioBERT is the first domain specific BERT pre-trained on biomedical corpora for 20 days on 8 V100 GPUs, to transfer knowledge from biomedical corpora.

  • With minimum architecture modification, BioBERT outperforms the current state-of-the-art models in biomedical named entity recognition by 0.51 F1 score, biomedical relation extraction by 3.49 F1 score, and biomedical question answering by 9.61 MRR score.

  • We make the pre-trained weights of BioBERT and the source code for fine-tuning BioBERT for each task publicly available. BioBERT can be used for various other downstream biomedical text mining tasks with minimal task-specific architecture modifications to help achieve state-of-the-art performance.

3 Methods

BioBERT basically has the same structure as BERT (Devlin et al., 2018). We briefly discuss the recently proposed BERT, and then we describe in detail the pre-training process of BioBERT. We also explain the fine-tuning of BioBERT for different tasks, where the minimal amount of domain specific engineering is required.

3.1 BERT: Bidirectional Encoder Representations from Transformers

Learning word representations from a large amount of unannotated text is a long-established method. Training on specific NLP tasks (e.g., language modeling (Bengio et al., 2003)) where word representations were byproducts of the NLP tasks, or direct optimization of word representations based on various hypotheses (Mikolov et al., 2013; Pennington et al., 2014) was conducted to obtain word representations. While previous studies on word representations focused on learning context independent representations, recent works have focused on learning context dependent representations using NLP tasks. For instance, ELMo (Peters et al., 2018) uses a bidirectional language model while CoVe (McCann et al., 2017) uses machine translation to embed context information into word representations.

BERT (Devlin et al., 2018) is a contextualized word representation model, which is pre-trained based on a masked language model using bidirectional Transformers. Due to the nature of language modeling where future words cannot be seen, previous bidirectional language models (biLM) were limited to a combination of two unidirectional language models (i.e., left-to-right and right-to-left). However, BERT uses a masked language model that predicts randomly masked words in a sequence, and hence can be used for learning better bidirectional representations. Combined with a next sentence prediction task and larger sized text corpus (3.3B words from English Wikipedia and BooksCorpus), BERT obtains state-of-the-art performance on most NLP tasks with minimal task specific architecture modifications.

3.2 Pre-training BioBERT

The performance of language representation models largely depends on the size and quality of corpora on which are they are pre-trained. Since it was designed as a general purpose language representation model, BERT was pre-trained on English Wikipedia and BooksCorpus. However, biomedical domain texts contain a considerable number of domain specific proper nouns (e.g., BRCA1, c.248T>C) and terms (e.g., transcriptional, antimicrobial), which are understood mostly by biomedical researchers. As a result, NLP models designed for general purpose language understanding often obtains poor performance in biomedical text mining tasks (Habibi et al., 2017). Even human experts in biomedical fields obtain low performance in biomedical text mining tasks, which shows that these tasks are very difficult (Kim et al., 2018).

To apply text mining models to biomedical texts, several researchers trained domain specific language representation models on biomedical corpora. Pyysalo et al. (2013) trained Word2vec on PubMed abstracts and PubMed Central full-text articles, which had substantial effects on biomedical text mining models. For instance, Habibi et al. (2017)

showed that the performance of biomedical NER models improved using pre-trained word vectors from

Pyysalo et al. (2013). Like the previous works, we also pre-train our model BioBERT on PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC). The statistics of the text corpora on which BioBERT was pre-trained are listed in Table 1.

Corpus # of words (B) Domain
English Wikipedia 2.5B General
BooksCorpus 0.8B General
PubMed Abstracts 4.5B Biomedical
PMC Full-text articles 13.5B Biomedical
Table 1: List of text corpora used for BioBERT
Corpus Domain
Wiki + Books General
Wiki + Books + PubMed General + Biomedical
Wiki + Books + PMC General + Biomedical
Wiki + Books + PubMed + PMC General + Biomedical
Table 2: Combinations of text corpora for pre-training BioBERT. Wiki denotes English Wikipedia, Books denotes BooksCorpus, PubMed denotes PubMed abstracts, and PMC denotes PMC full-text articles.

Our pre-training procedure aims to answer the following three research questions: First, how does the performance of BERT differ in biomedical text mining tasks when it is pre-trained on biomedical domain corpora and only general domain corpora? Second, what is the optimal combination of pre-training corpora that leads to the best results in biomedical text mining tasks? Third, how large should the domain specific corpus be for effectively training BioBERT? To answer these questions, we pre-train BERT on different combinations of text corpora while varying the size of the corpora.

We list the tested pre-training corpus combinations in Table 2. There are four combinations of text corpora including the general domain corpus. For computational efficiency, whenever the Wiki + Books corpora are used, we initialized BioBERT with pre-trained BERT provided by Devlin et al. (2018). We define BioBERT as a language representation model whose pre-training corpora includes biomedical corpora (e.g., Wiki + Books + PubMed). For brevity, we only denote additional biomedical corpora put into the pre-training of BioBERT, such as BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC), although the pre-training of BioBERT always includes Wiki + Books corpora.

Task Dataset Entity Type # of Train # of Valid # of Test
Named Entity NCBI disease (Doğan et al., 2014) Disease 5,423 922 939
Recognition 2010 i2b2/VA (Uzuner et al., 2011)2 Disease 16,315 - 27,626
BC5CDR (Li et al., 2016) Disease 4559 4,580 4,796
BC5CDR (Li et al., 2016) Drug/Chemical 4,559 4,580 4,796
BC4CHEMD (Krallinger et al., 2015) Drug/Chemical 30,681 30,638 26,363
BC2GM (Smith et al., 2008) Gene/Protein 12,573 2,518 5,037
JNLPBA (Kim et al., 2004) Gene/Protein 14,690 3,856 3,856
LINNAEUS (Gerner et al., 2010) Species 11,934 7,141 4,077
Species-800 (Pafilis et al., 2013) Species 5,733 830 1,630
Relation GAD (Bravo et al., 2015)1 Gene-Disease 5,339 - -
Extraction EU-ADR (Van Mulligen et al., 2012)1 Gene-Disease 355 - -
CHEMPROT (Krallinger et al., 2017) Gene-Chemical 16,522 10,362 14,443
Question BioASQ 4b (Tsatsaronis et al., 2015)2 Multiple 327 - 161
Answering BioASQ 5b (Tsatsaronis et al., 2015)2 Multiple 486 - 150
BioASQ 6b (Tsatsaronis et al., 2015)2 Multiple 618 - 161
  • Datasets with no validation/test sets. We report the 10 cross-validation results on such datasets.

  • Datasets with no validation sets. We perform a hyperparameter search based on the 10 cross-validation results on training data.

Table 3: Biomedical text mining task datasets

3.3 Fine-tuning BioBERT

Like BERT, BioBERT is applied to various downstream text mining tasks while requiring only minimal architecture modification. One of the common settings while fine-tuning BERT (BioBERT) is the use of WordPiece tokenization (Wu et al., 2016) which can mitigate the out-of-vocabulary issue. Using WordPiece tokenization, any new words can be represented by some frequent subwords (e.g., Immunoglobulin I ##mm ##uno ##g ##lo ##bul ##in). We briefly describe three biomedical text mining tasks with some task-specific settings.

Named Entity Recognition is one of the most fundamental biomedical text mining tasks, which involves recognizing numerous domain specific proper nouns in a biomedical corpus. While most previous works were built upon LSTMs and CRF (Habibi et al., 2017; Yoon et al., 2018)

, BERT does not use complex architectures. In BERT, only token level BIO probabilities are computed using a single output layer based on the representations from the last layer of BERT. When fine-tuning BioBERT (or BERT), we found that the model performed similarly regardless of the choice of BIO or BIOES while previous LSTM + CRF based models performed better with BIOES. This may be because the BioBERT (and BERT) architecture does not use CRF. Note that while previous works in biomedical NER often use word embeddings trained on PubMed or PMC

(Habibi et al., 2017; Yoon et al., 2018), we do not need to use them as BioBERT learns WordPiece embeddings from scratch while pre-training.

Relation Extraction

is a task of classifying relations of named entities in a biomedical corpus. As relation extraction can be regarded as a sentence classification task, we utilized the sentence classifier of the original version of BERT, which uses a [CLS] token for classification. Sentence classification is performed using a single output layer based on a [CLS] token representation from BERT. There are two methods of representing target named entities in relation extraction. First, we can anonymize named entities using pre-defined tags (e.g., @GENE$)

(Bhasuran and Natarajan, 2018), or we can wrap the target entities with pre-defined tags (e.g., <GENE>, </GENE>) (Lee et al., 2016). We found that the first method consistently obtains better results, and reported the scores obtained using this method.

Question Answering is a task of answering questions written in natural language given related passages. Biomedical question answering involves answering patients’ questions such as ‘‘In which breast cancer patients can palbociclib be used?’’ or questions raised by biomedical researchers such as ‘‘Where is the protein Pannexin1 located?’’ For fine-tuning BioBERT for QA, we use the BERT model designed for SQuAD (Rajpurkar et al., 2016). As existing QA datasets used for biomedical text mining tasks do not always assume an extractive setting (i.e., answers always exist as exact contiguous words in passages) like the SQuAD dataset, we use biomedical QA datasets such as BioASQ 5b factoid to adopt the same structure of the BERT model for SQuAD. Analogous to NER, token level probabilities for the start/end location of answer phrases are computed using a single output layer. However, we observed that about 30% of each BioASQ factoid dataset was unanswerable in the extractive setting as the exact answer phrases did not appear in the given passages. Like Wiese et al. (2017), we excluded the unanswerable questions from the training sets, but regarded them as wrong predictions in test sets. Also, we used the same pre-training process of Wiese et al. (2017) which uses SQuAD (Rajpurkar et al., 2016), and it consistently improved performance of BioBERT on the BioASQ datasets.

BERT BioBERT
Entity Type Datasets Metrics State-of-the-art (Wiki + Books) (+ PubMed) (+ PMC) (+ PubMed + PMC)
Disease NCBI disease P 86.41 84.12 86.76 86.16 89.04
(Doğan et al., 2014) R 88.31 87.19 88.02 89.48 89.69
F 87.34 85.63 87.38 87.79 89.36
2010 i2b2/VA P 87.44 84.04 85.37 85.55 87.50
(Uzuner et al., 2011) R 86.25 84.08 85.64 85.72 85.44
F 86.84 84.06 85.51 85.64 86.46
BC5CDR P 85.61 81.97 85.80 84.67 85.86
(Li et al., 2016) R 82.61 82.48 86.60 85.87 87.27
F 84.08 82.41 86.20 85.27 86.56
Drug/Chemical BC5CDR P 94.26 90.94 92.52 92.46 93.27
(Li et al., 2016) R 92.38 91.38 92.76 92.63 93.61
F 93.31 91.16 92.64 92.54 93.44
BC4CHEMD P 91.30 91.19 91.77 91.65 92.23
(Krallinger et al., 2015) R 87.53 88.92 90.77 90.30 90.61
F 89.37 90.04 91.26 90.97 91.41
Gene/Protein BC2GM P 81.81 81.17 81.72 82.86 85.16
(Smith et al., 2008) R 81.57 82.42 83.38 84.21 83.65
F 81.69 81.79 82.54 83.53 84.40
JNLPBA P 74.43 69.57 71.11 71.17 72.68
(Kim et al., 2004) R 83.22 81.20 83.11 82.76 83.21
F 78.58 74.94 76.65 76.53 77.59
Species LINNAEUS P 92.80 91.17 91.83 91.62 93.84
(Gerner et al., 2010) R 94.29 84.30 84.72 85.48 86.11
F 93.54 87.6 88.13 88.45 89.81
Species-800 P 74.34 69.35 70.60 71.54 72.84
(Pafilis et al., 2013) R 75.96 74.05 75.75 74.71 77.97
F 74.98 71.63 73.08 73.09 75.31
Average P 85.38 82.61 84.16 84.19 85.82
R 85.79 84.00 85.64 85.68 86.40
F 85.53 83.25 84.82 84.87 86.04
Table 4:

Test results in biomedical named entity recognition. Precision (P), Recall (R), and F1 (F) scores on each dataset are reported. Best scores are in bold texts, and the second best scores are underlined. State-of-the-art scores for NCBI disease and BC2GM datasets were obtained from

Sachan et al. (2017), scores for 2010 i2b2/VA dataset were obtained from Zhu et al. (2018) (single model), scores for BC5CDR and JNLPBA datasets were obtained from Yoon et al. (2018), scores for BC4CHEMD dataset were obtained from Wang et al. (2018), and scores for LINNAEUS and Species-800 datasets were obtained from Giorgi and Bader (2018).

4 Results

4.1 Experimental Setups

Following the work of Devlin et al. (2018)

, BERT was pre-trained for 1,000,000 steps on English Wikipedia and BooksCorpus. However, when BioBERT was initialized with BERT, we observed that 200K and 270K pre-training steps were sufficient for PubMed and PMC, respectively, both of which roughly correspond to a single epoch on each corpus. Other hyper-parameters such as batch size and learning rate scheduling for pre-training BioBERT are the same as those for BERT unless stated otherwise. We used WordPiece tokenization

(Wu et al., 2016) for pre-processing both pre-training corpora and fine-tuning datasets.

The datasets used for each biomedical text mining task are listed in Table 3. We collected NER datasets that are publicly available for each entity type. RE datasets contain gene-disease relations and gene-chemical relations. As stated in Section 3.3, we used the BioASQ factoid datasets which can be converted into the same format as the SQuAD dataset. Note that the sizes of datasets are often very small (e.g., 486 training data samples in BioASQ 5b) whereas general domain NLP datasets usually contain more than tens of thousands of training samples (e.g., 87,599 training data samples in SQuAD v1.1). To compare BERT and BioBERT with the other models, we also report the scores of the current state-of-the-art models. Note that the state-of-the-art models greatly differ from each other in architecture, while BERT and BioBERT have almost the same structure throughout the tasks as described in Section 3.

We pre-trained BioBERT using Naver Smart Machine Learning (NSML) (Sung et al., 2017) which is utilized for large-scale experiments that need to be run on several GPUs. We used 8 NVIDIA V100 (32GB) GPUs for the pre-training (131,072 words/batch) using NSML. It takes more than 10 days to pre-train BioBERT (+ PubMed + PMC), and nearly 5 days each BioBERT (+ PubMed) and BioBERT (+ PMC) in this setting. Despite our best efforts to use BERT, we used only BERT due to the computational complexity of BERT. We used a single NVIDIA Titan Xp (12GB) GPU to fine-tune BioBERT on each task. Note that the fine-tuning process is more computationally efficient than pre-training BioBERT. For fine-tuning, batch size was selected from (10, 16, 32, 64), and the learning rate was selected from (5e-5, 3e-5, 1e-5) using validation sets for fine-tuning each task. Fine-tuning each task usually takes less than an hour as the size of the training data is much smaller than that of the datasets used by Devlin et al. (2018).

4.2 Named Entity Recognition

The results of NER are shown in Table 4. Note that each state-of-the-art model is different from each other in architecture, and some of them are based on multi-task learning (Wang et al., 2018; Yoon et al., 2018)

. BERT and BioBERT only use a single architecture across the datasets, and multi-task learning was not used. For the evaluation metric, we used entity level precision, recall, and f1 score. First, we observe that BERT which was pre-trained on only the general domain corpus is quite effective. However, on average, performance of BERT was lower than that of state-of-the-art models by 2.28 in terms of F1 score. BioBERT achieves higher scores than BERT on all the datasets. On 6 out of 9 datasets, BioBERT even outperformed the current state-of-the-art models, and BioBERT (+ PubMed + PMC) outperformed the state-of-the-art models by 0.51 in terms of F1 score on average. Although we could not outperform the state-of-the-art scores on some datasets, we observed BioBERT’s consistent performance improvement over BERT. On average, the order in performance from lowest to best is as follows: BERT < BioBERT (+ PubMed) < BioBERT (+ PMC) < State-of-the-art models < BioBERT (+ PubMed + PMC).

BERT BioBERT
Entity Type Datasets Metrics State-of-the-art (Wiki + Books) (+ PubMed) (+ PMC) (+ PubMed + PMC)
Gene-Disease GAD P 79.21 74.28 76.43 75.20 75.95
(Bravo et al., 2015) R 89.25 85.11 87.65 86.15 88.08
F 83.93 79.33 81.66 80.30 81.57
EU-ADR P 76.43 75.45 78.04 81.05 80.92
(Van Mulligen et al., 2012) R 98.01 96.55 93.86 93.90 90.81
F 85.34 84.71 85.22 87.00 85.58
Gene-Chemical CHEMPROT P 74.80 74.01 75.50 73.71 76.63
(Krallinger et al., 2017) R 56.00 70.79 76.86 75.55 76.74
F 64.10 72.36 76.17 74.62 76.68
Average P 76.81 74.58 76.66 76.66 77.83
R 81.09 84.15 86.12 85.20 85.21
F 77.79 78.80 81.02 80.64 81.28
Table 5: Biomedical relation extraction test results. Precision (P), Recall (R), and F1 (F) scores on each dataset are reported. Best scores are in bold texts, and the second best scores are underlined. State-of-the-art scores for GAD and EU-ADR datasets were obtained from Bhasuran and Natarajan (2018), and the scores for CHEMPROT dataset were obtained from Lim and Kang (2018).
BERT BioBERT
Datasets Metrics State-of-the-art (Wiki + Books) (+ PubMed) (+ PMC) (+ PubMed + PMC)
BioASQ 4b S 20.59 26.75 28.96 35.59 36.48
(Tsatsaronis et al., 2015) L 29.24 40.80 45.46 52.43 48.89
M 24.04 32.35 34.95 42.09 41.05
BioASQ 5b S 41.82 37.73 43.37 40.45 41.56
(Tsatsaronis et al., 2015) L 57.43 49.59 55.75 52.32 54.00
M 47.73 42.50 48.28 44.57 46.32
BioASQ 6b S 25.12 28.45 38.83 37.12 35.58
(Tsatsaronis et al., 2015) L 40.20 46.85 50.55 48.42 51.39
M 29.28 35.18 42.76 41.15 42.51
Average S 29.18 30.97 37.05 37.72 37.87
L 42.29 45.75 50.59 51.06 51.43
M 33.68 36.68 42.00 42.60 43.29
Table 6: Biomedical question answering test results. Strict Accuracy (S), Lenient Accuracy (L), and Mean Reciprocal Rank (M) scores on each dataset are reported. Best scores are in bold texts, and the second best scores are underlined. State-of-the-art scores were obtained from the best BioASQ 4b/5b/6b scores in the BioASQ leaderboard (http://participants-area.bioasq.org). Note that for the state-of-the-art models, we averaged the best scores from each batch (possibly from multiple different models), while BERT and BioBERT were evaluated using single models on every batch.

4.3 Relation Extraction

The RE results are shown in Table 5. Unlike the NER results, BERT achieves better performance than that of the state-of-the-art models, which demonstrates its effectiveness in RE. On average, BioBERT (+ PubMed + PMC) outperformed the state-of-the-art models by 3.49 in terms of F1 score. On 2 out of 3 biomedical RE datasets, BioBERT achieved state-of-the-art performance in terms of F1 score. Considering the complexity of the state-of-the-art models which use numerous linguistic features (Bhasuran and Natarajan, 2018), the results are very promising. On average, we see the following order in performance: state-of-the-art < BERT < BioBERT.

4.4 Question Answering

The QA results are shown in Table 6. Note that transfer learning is quite important in biomedical QA as the average size of datasets is very small (only few hundreds of samples). BERT outperforms the state-of-the-art models by 3 in terms of MRR on average. BioBERT (+ PubMed + PMC) significantly outperforms BERT and the state-of-the-art models, and obtained a Strict Accuracy of 37.87, Lenient Accuracy of 51.43 , and a Mean Reciprocal Rank score of 43.29 on average. On all the biomedical QA datasets, BioBERT achieved new state-of-the-art performance in terms of MRR. Also, the average performance showed the same pattern as the relation extraction task, which proves the effectiveness of BioBERT.

Figure 2: Effects of varying the size of the PubMed corpus for pre-training

5 Discussion

5.1 Size of the Corpus for Pre-training

To investigate the effect of sizes of pre-training corpus, we experimented with additional pre-training corpora of different sizes. Using BERT (+ PubMed), we set the number of pre-training steps to 200K and varied the size of the PubMed corpus. Figure 2 shows that the performance of the NER models on three NER datasets (NCBI disease, BC2GM, BC4CHEMD) changes in relation to the size of the PubMed corpus. Note that the NCBI disease dataset is the smallest, and BC4CHEMD is the largest among the NER datasets. Pre-training on 1 billion words is quite effective on both datasets, and the performance mostly improves until 4.5 billion words.

5.2 Number of Pre-training Steps

We saved the pre-trained weights from BERT (+ PubMed) at different pre-training steps to measure how the number of pre-training steps used affects fine-tuning tasks. We pre-trained BERT on the entire PubMed corpus for an additional 200K pre-training steps. Figure 3 shows the performance changes of NER models on three NER datasets (NCBI disease, BC2GM, BC4CHEMD) in relation to the number of pre-training steps. The results clearly show that the performance on each dataset improves as the number of pre-training steps increases. After 160K steps, the performance on each dataset seems to converge.

5.3 Size of the Fine-tuning dataset

As we used various sized datasets for fine-tuning the tasks, we tried to find a correlation between the size of the fine-tuning datasets and the performance improvement of BioBERT over BERT. Figure 4 shows performance improvements of BERT (+ PubMed + PMC) over BERT on all 15 datasets. F1 and MRR scores were used for NER/RE and QA, respectively. BioBERT significantly improves performance on most datasets. Also, performance improvements on small sized datasets are considerable. This shows the effectiveness of BioBERT on small sized datasets which are common characteristics of biomedical (or other domain specific) text mining tasks.

Figure 3: Performance of BioBERT for NER at different checkpoints.

Figure 4: Performance improvement of BioBERT (+ PubMed + PMC) over BERT.

6 Conclusion

In this paper, we introduced BioBERT, which is a pre-trained language representation model for biomedical text mining. While BERT was built for general purpose language understanding, BioBERT effectively leverages domain specific knowledge from a large set of unannotated biomedical texts. With minimal task-specific architecture modification, BioBERT outperforms previous models on biomedical text mining tasks such as NER, RE, and QA. We also provide some insights on the amount of domain specific data needed to build a domain specific BERT model. In future work, we plan to leverage BioBERT in other biomedical text mining tasks.

Funding

This work was supported by the National Research Foundation of Korea [NRF-2017R1A2A1A17069645, NRF-2017M3C4A7065887, NRF-2014M3C9A3063541].

References