SciBERT: Pretrained Contextualized Embeddings for Scientific Text

by   Iz Beltagy, et al.

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained contextualized embedding model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


FinBERT: A Pretrained Language Model for Financial Communications

Contextual pretrained language models, such as BERT (Devlin et al., 2019...

Domain-Adaptive Pretraining Methods for Dialogue Understanding

Language models like BERT and SpanBERT pretrained on open-domain data ha...

NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Pretrained language models have become the standard approach for many NL...

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

With the ever-growing amounts of textual data from a large variety of la...

Leveraging Domain Agnostic and Specific Knowledge for Acronym Disambiguation

An obstacle to scientific document understanding is the extensive use of...

Alternative Weighting Schemes for ELMo Embeddings

ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP commu...

CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding

Scientific document understanding is challenging as the data is highly d...

Code Repositories


A BERT model trained on scientific text

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been an exponential increase in the volume of scientific publications in the past decades, making NLP an essential tool for large-scale knowledge extraction and machine reading of scientific documents. Recent progress in NLP has been driven by the adoption of deep neural network models, but training such models often requires large amounts of labeled data. In general domains, large-scale training data is often possible to obtain through crowdsourcing. In scientific domains, however, annotated data is difficult and expensive to collect due to the expertise required to perform annotation tasks.

As shown through Elmo Peters et al. (2018), GPT Radford et al. (2018) and Bert Devlin et al. (2018), unsupervised pretraining of language models on large corpora significantly improves performance on many NLP tasks. These models can read a sentence and return a contextualized embedding for each token which can be used in task-specific neural architectures. Due to the success of these models across a variety of NLP tasks, leveraging unsupervised pretraining has become useful especially when task-specific annotations are difficult to obtain. Yet while both Bert and Elmo have released pretrained models, they are still trained on general domain corpora such as news articles and Wikipedia.

In this work, we make the following contributions:

(i) We release SciBert, a new resource to successfully tackle a broad set of NLP tasks in the scientific domain. SciBert is based on Bert trained on a large corpus of scientific text. The code and pretrained models are available at

(ii) We evaluate SciBert against Bert on a suite of tasks in the scientific domain, i.e., sequence tagging, parsing, and text classification. (iii) With SciBert, we achieve new state-of-the-art results on many of these tasks with minimal task-specific architectures and without any hypetparameter search or fine-tuning.

2 Methods


The Bert model architecture Devlin et al. (2018) is based on a multilayer bidirectional Transformer Vaswani et al. (2017). Instead of the traditional left-to-right language modeling objective, Bert is pretrained on two tasks: predicting randomly masked tokens and predicting whether two sentences follow each other. SciBert follows the same architecture as Bert and is optimized on scientific text.

Corpus # of words
Bert English Wiki 2.5B
BooksCorpus 0.8B
SciBert Biomedical 2.5B
Computer Science 0.6B
Table 1: Size of pretraining corpora


Bert uses WordPiece for unsupervised tokenization of the input text. WordPiece, similar to byte pair encoding Sennrich et al. (2016), relies on a vocabulary of subword units. The vocabulary is built such that it contains the most frequently used words or subwords. We refer to the original vocabulary released with Bert as BaseVocab.

We consider that frequently observed words and subwords in scientific text may differ from those occurring in general domain text. To better model the scientific vocabulary, we use the SentencePiece111 library to build a new WordPiece vocabulary, SciVocab, on our scientific corpus. We produce both cased and uncased vocabularies and set the vocabulary size to 30K, akin to Bert. The resulting token overlap between BaseVocab and SciVocab is 42%.


We train SciBert on a random sample of 1.14M papers from Semantic Scholar  Ammar et al. (2018). This corpus consists of 18% papers from the computer science domain and 82% from the broad biomedical domain (see Table 1). We use the full text of the papers, not just the abstracts.222We observed significantly worse performance when training only on abstracts compared with full-text. It is worth noting that the full text is the output of a noisy PDF parser333, but we keep those sentences in to train a robust model against noisy inputs. The average paper length is 154 sentences (2769 tokens) resulting in a corpus size of 3.17B tokens. This provides extensive language data from scientific text, comparable to the size of the general corpora on which Bert was trained (3.3B tokens). To split the documents into sentences, we use ScispaCy Neumann et al. (2019),444 which is optimized for scientific text.

3 Experiments

To demonstrate the effectiveness of SciBert, we conduct extensive experiments with different variants of Bert and SciBert on a large suite of NLP tasks on scientific text. We first describe the Bert variants and then the models

3.1 Bert Variants


This is the original model detailed in Devlin et al. (2018). We use the pretrained weights for Bert-Base released with the original Bert code.555 The vocabulary used is BaseVocab. We evaluate both cased and uncased versions of this model.


We use the original Bert code to train SciBert on our corpus with the same configuration and size as Bert-Base. We train 4 different versions of SciBert: (a) cased or uncased, and (b) using BaseVocab or SciVocab. The two models that use BaseVocab are fine tuned from the corresponding Bert-Base models. The other two models that use the new SciVocab are trained from scratch.

3.2 Pretraining Bert

Training Bert for long sentences can be slow. Following the recommended settings in the original Bert code, we set a maximum sentence length of 128 tokens, and train the model until the training loss stops decreasing. We continue training the model allowing sentence lengths up to 512 tokens.

We train the SciBert models on a single TPU v3 with 8 cores. Training the SciVocab models from scratch on our corpus of 3.17B tokens takes 1 week666Bert’s largest model was trained on 16 Cloud TPUs for 4 days while on a 8-GPU machine, it is expected to take 40-70 days Dettmers (2019). (5 days with max length 128, then 2 days with max length 512). Fine tuning the BaseVocab models starting from the Bert-Base weights reduces the overall training time by 2 days.

Task Dataset Domain # sents
NER BC5CDR Li et al. (2016) Bio 15,030
JNLPBA Collier and Kim (2004) Bio 24,806
NCBI-disease  Dogan et al. (2014) Bio 7,287
SciERC Luan et al. (2018) CS 3,187
PICO EBM-NLP Nye et al. (2018) Bio 50,362
DEP GENIA Kim et al. (2003) Bio 17,047
CLS RCT-20k Dernoncourt and Lee (2017) Bio 240,387
ACL-ARC Jurgens et al. (2018) CS 1,941
Paper Field Multi 111,998
ScienceCite Cohan et al. (2019) Multi 10,104
REL ChemProt Kringelum et al. (2016) Bio 10,065
SciERC Luan et al. (2018) CS 4,648
Table 2: The domain and size of each dataset used for evaluation. The totals include train, dev and test splits.

3.3 Tasks

We evaluate SciBert on a broad set of tasks in the scientific NLP domain.

NER (Named Entity Recognition

): This task is a sequence-labeling task where tokens corresponding to entities in a sentence are labeled. This may also include classifying entities to a predefined set of types.

PICO (Participant Intervention Comparison Outcome Extraction): This is also a sequence labeling task involving the extraction of spans of interest in clinical trial papers Huang et al. (2006). PICO is a technique used in evidence-based practice to frame and answer a health-related question and is helpful to develop literature search strategies in the biomedical domain.

CLS (Classification): This task is classifying a sequence of tokens (e.g., a sentence) with its corresponding label.

REL (Relation Classification) This task is predicting the type of relation expressed between two entities in a sentence. To mark entity locations in the sentence, the entity mentions are encapsulated by special characters. The task is then framed as a multiclass sentence-level classification problem.

DEP (Dependency Parsing) This task is predicting the dependencies between tokens in the sentence as a structured tree.

3.4 Models

To keep things simple, we use minimal task-specific architectures atop Bert-Base and SciBert embeddings. Each token is represented as the concatenation of its Bert embedding with a CNN-based character embedding. If the token has multiple Bert subword units, we use the first one.

We apply a multilayer BiLSTM to token embeddings. For text classification, we apply a multilayer perceptron on the first and last BiLSTM states. For sequence tagging, we use a CRF on top of the BiLSTM, as done in

Ma and Hovy (2016)

. For dependency parsing we use the biaffine attention model from

Dozat and Manning (2017).

3.5 Task-specific Training

For simplicity, experiments are performed without any hyperparameter tuning and with fixed

Bert weights.777We’ve found that fine-tuning Bert weights results in 2.5x slower training times on average.

All our models are implemented in AllenNLP Gardner et al. (2017) which provides an easy interface for using pretrained Bert embeddings. The Bert

pretrained models are converted to be compatible with PyTorch using the pytorch-pretrained-

Bert library.888

Field Task Dataset SOTA Bert-Base SciBert
6-7 BaseVocab SciVocab
Bio NER BC5CDR 87.12 85.72 88.11 88.94
JNLPBA 78.58 74.48 75.83 75.95
NCBI-disease 87.34 85.49 86.91 86.45
2-7 PICO EBM-NLP 66.30 70.07 70.82 71.18
2-7 REL ChemProt 64.10 69.22 73.7 76.12
2-7 CLS PubMed 20k RCT 92.6 86.19 86.80 86.81
2-7 DEP GENIA - LAS 91.92 91.29 91.26 91.41
GENIA - UAS 92.84 92.33 92.32 92.46
CS NER SciERC 64.20 62.95 65.12 65.5
2-7 REL SciERC n/a 72.46 74.42 74.64
2-7 CLS ACL-ARC 53.0 60.68 65.79 65.71
Multi CLS Paper Field n/a 63.39 64.02 64.07
ScienceCite 84.0 84.43 84.43 84.99
Average 75.53 77.27 77.64
Table 3: Results on all scientific fields, tasks and datasets. Bold indicates the best performing Bert variant, while underline indicates the best overall result. All SciBert results are statistically significantly higher than Bert

-Base (based on 95% bootstrap confidence intervals) except for ACL-ARC and ScienceCite datasets. All results are the average of multiple runs with different random seeds to control potential non-determinism associated with neural models 

Reimers and Gurevych (2017). Most results are macro F1 scores (span-level for NER, sentence-level for REL and CLS, and token-level for PICO), except ChemProt and PubMed 20k RCT (micro F1 scores). Parsing is evaluated using labeled association score (LAS) and unlabeled association score (UAS), both reported in two separate rows.

3.6 Datasets

We evaluate our models on a suite of well-established NLP datasets spanning across multiple scientific domains (Table 2). For brevity, we do not explain the details of older datasets and refer the reader to the corresponding citations. Instead, we briefly describe the newer datasets.

Pubmed RCT-20K Dernoncourt and Lee (2017) is a dataset of discourse labels (e.g. Background, Method, Results, etc.) for sentences in scientific abstracts. ScienceCite Cohan et al. (2019) and ACL-ARC Jurgens et al. (2018) include citation intent labels in scientific papers (e.g. Comparison, Extension, etc.). The SciERC dataset Luan et al. (2018) contains entities and relations from computer science abstracts. Finally, the Paper Field dataset999The corresponding paper to this dataset is not yet published at the time of writing. We will update the paper with the corresponding citation once it becomes available. contains paper titles mapped to 7 different fields of study and is built from the Microsoft Academic Graph Sinha et al. (2015).101010

4 Results

Table 3 summarizes the results of all experiments. Following Devlin et al. (2018), we use the cased models for sequence tagging (NER and PICO) and dependency parsing (DEP). For text classification (CLS and REL), we use the uncased models.

All reported results are the average of multiple runs with different random seeds. Except for ACL-ARC and ScienceCite, all SciBert results are statistically significantly higher than Bert-Base based on 95% bootstrap confidence intervals.

Biomedical domain

The top part of Table 3 summarizes the results on datasets from the biomedical domain. We observe that SciBert always outperforms Bert-Base on biomedical tasks. On average across tasks, SciBert has a higher F1 score than Bert-Base (+1.57 with BaseVocab and +2.06 with SciVocab). In addition, SciBert achieves new state-of-the-art (SOTA) results on the following datasets: BC5CDR Yoon et al. (2018), EBM-NLP Nye et al. (2018), and ChemProt Lim and Kang (2018). SciBert performs slightly worse than the SOTA on JNLPBA  Yoon et al. (2018), PubMed 20K RCT Jin and Szolovits (2018), and GENIA Nguyen and Verspoor (2019). We suspect performance gaps can be explained by task-specific architectures and hyperparameter tuning used in the SOTA models. For example, the current SOTA results Jin and Szolovits (2018) on the PubMed 20K RCT have been obtained using a hierarchical BiLSTM-CRF architecture that also takes neighboring sentences as an important signal for prediction.

Computer Science Domain

The middle part of Table 3 demonstrates the results on datasets from the computer science domain. As shown, SciBert always outperforms Bert-Base on computer science datasets. On average across tasks, SciBert has a higher F1 score than Bert-Base (+3.08 with BaseVocab and +3.25 with SciVocab). In addition, SciBert outperforms the SOTA on ACL-ARC Jurgens et al. (2018), and the NER part of SciERC Luan et al. (2018). For relations in SciERC, our results are not comparable with those in Luan et al. (2018) because we are performing relation classification given gold entities, while they perform NER and relation extraction jointly.

Multidomain Results

The bottom part of Table 3 illustrates the results on datasets from multiple scientific domains. As shown, SciBert always outperforms Bert-Base on both tasks (+0.32 F1 with BaseVocab and +0.62 F1 with SciVocab). In addition, SciBert outperforms the SOTA on ScienceCite Cohan et al. (2019). For the Paper Field dataset, there are no published SOTA results at the time of writing.

5 Discussion

Effect of Vocabulary

Table 3 shows better performance when using SciBert with SciVocab than with BaseVocab. Averaging across tasks, we see an improvement of approximately +0.38 F1. This suggests that retraining the vocabulary could be an important step when retraining Bert embeddings on a new domain. Given an overlap between BaseVocab and SciVocab of 42%, this level of improvement seems reasonable.

Effect of Casing

We ran additional experiments to compare cased and uncased vocabularies. Averaging across tasks, we find for SciBert with SciVocab that the cased model performs better than the uncased one on sequence tagging and parsing (+0.04 F1) and worse on sentence classification (-0.18 F1). Interestingly, Bert-Base and SciBert with BaseVocab both show better performance with uncased vocabularies on all tasks.


BioBert, a version of Bert fine-tuned on a collection of biomedical text, was published on ArXiv by Lee et al. (2019) during the course of our SciBert experiments. We also performed preliminary experiments with BioBert trained on the same suite of tasks. For controlled experimentation, we use the released pretrained weights111111 in the same manner as we did with previous experiments. Compared with BioBert, averaged over tasks, SciBert achieves +0.51 and +0.89 F1 improvements when using BaseVocab and SciVocab, respectively. We observed larger performance gains by SciBert over BioBert on CS tasks.

6 Conclusion and Future Work

We release SciBert, a pretrained contextualized embedding model for scientific text based on Bert. We evaluate SciBert on a suite of tasks and datasets from scientific domains. SciBert often significantly outperforms Bert-Base and achieves new state-of-the-art results with minimal task-specific architectures and without fine-tuning.

An interesting future line of work would be to evaluate different proportions of papers from each domain, though one consideration would be that these language models are costly to retrain. This also motivates our interest in building a single resource that’s useful across multiple domains.

While we achieve significant improvements on many scientific NLP tasks, the absolute performance numbers show that there is still room for improvement in many of these tasks. We are optimistic that SciBert will be a helpful resource to foster research in the scientific NLP domain.


We thank Dan Weld, Waleed Ammar, Yoav Goldberg, and Doug Downey for their helpful feedback on the paper. Computations on were supported in part by credits from Google Cloud.