A Semantically Enriched Dataset based on Biomedical NER for the COVID19 Open Research Dataset Challenge

05/18/2020 ∙ by Hermann Kroll, et al. ∙ Technische Universität Braunschweig 0

Research into COVID-19 is a big challenge and highly relevant at the moment. New tools are required to assist medical experts in their research with relevant and valuable information. The COVID-19 Open Research Dataset Challenge (CORD-19) is a "call to action" for computer scientists to develop these innovative tools. Many of these applications are empowered by entity information, i. e. knowing which entities are used within a sentence. For this paper, we have developed a pipeline upon the latest Named Entity Recognition tools for Chemicals, Diseases, Genes and Species. We apply our pipeline to the COVID-19 research challenge and share the resulting entity mentions with the community.



There are no comments yet.


page 1

page 2

page 3

Code Repositories


A Semantically Enriched Dataset based on Biomedical NER for the COVID19 Open Research Dataset Challenge

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

PubMed, the most extensive library for biomedical research, contains nearly 30 million publications. The Allen Institute for AI selects nearly 57,000 documents as relevant for COVID19 research (V9), and around 47,000 full texts are included within this selection. Accessing such an extensive document collection and finding relevant information is a hard task for medical researchers. Especially in times, when results are published within a few days, keeping an overview of the latest research can be exhausting. Novel tools are urgently needed to assist medical researchers in their workflows: novel search engines find relevant information precisely, and new access paths like summarization techniques offer new opportunities to engage the flood of information. These tools are typically empowered by utilizing additional side information like knowledge graphs  

(Dietz et al., 2018).

Knowledge graphs are structured storages providing fact-style knowledge about entities, Simvastatin is used in treatment of hypercholesterolemia. In the biomedical domain, entities of interest are mainly Chemicals, Diseases, Genes and Species. The central problem of utilizing structured information for text retrieval is to detect, which entities are mentioned in the text. This problem is engaged by applying a Named Entity Recognition (NER), detecting important entities of in arbitrary texts. NER tools like Spotlight (DBpedia) and WAT (Wikidata) are developed to recognize a variety of different entities in several domains (Mendes et al., 2011; Piccinno and Ferragina, 2014). Unfortunately, the biomedical domain contains a variety of different entities. Dictionary-based recognition tools might fail here because the exact entity mention within a sentence depends on the context. Hence, homonyms must be resolved, the gene name CYP3A4 has different ids depending if the sentence talks about mouses or humans. Yet, Named Entity Recognition tools suitable for the biomedical domain have been designed and built by experts already.

In this paper, we utilize two biomedical NER tools, namely TaggerOne (Leaman and Lu, 2016) and GNormPlus (Wei et al., 2015), and build a pipeline to annotate arbitrary biomedical texts. Finally, we apply our pipeline to the COVID19 dataset. The detected entity mentions are published in our GitHub111https://github.com/HermannKroll/CORD19BiomedicalNERDataset repository for free reuse. The code will be published under the MIT license222https://opensource.org/licenses/MIT. The data is published for free reuse under the Creative Commons Attribution 4.0 International license (CC BY 4.0)333https://creativecommons.org/licenses/by/4.0/. We hope that this additional entity information can serve as a solid and high-quality platform for novel tools and thus enable more research about COVID19.

2. A Biomedical NER Pipeline

First we will introduce a pipeline for biomedical Named Entity Recognition in arbitrary texts. The task of a Named Entity Recognition is to detect entity mentions in texts. An entity represents a thing of interest in a specific domain, Chemicals and Diseases are of interest in the biomedical domain. Further, an entity consists of a unique id and an entity type, (Simvastatin, Chemical) is a valid entity. Entities are described by a predefined vocabulary, which is typically build by experts. Entities might be mentioned within a written text. Therefore, we understand text as a sequence of sentences and sentences as a sequence of tokens (single words). A sequence of tokens within an sentence might represent an entity. We call this representation an entity mention. Hence, entity mentions consist of an entity and a sequence of corresponding tokens within a sentence.

The U.S. library of medicine444https://www.nlm.nih.gov provides several expert-built tools come with a high quality for detecting entity mentions in text. These tools can be used via command line interfaces and a freely available. We build a pipeline upon these provided tools to automatically detect the following entity types in text: 1. Chemicals, 2. Diseases, 3. Genes and 4. Species. Chemicals are described by the Medical Subject Heading (MeSH) vocabulary555https://www.nlm.nih.gov/mesh/meshhome.html. Diseases are either by MeSH terms or by OMIM666https://www.ncbi.nlm.nih.gov/omim. The NCBI Gene Vocabulary 777https://www.ncbi.nlm.nih.gov/gene/ is utilized for the Genes’ NER and the NCBI Species Taxonomy 888https://www.ncbi.nlm.nih.gov/taxonomy likewise for the Species’ NER.

Chemicals and Diseases are detected by TaggerOne (Leaman and Lu, 2016)

, which uses a semi-Markov structured linear classifier to run named entity recognition (NER) and normalization simultaneously, thus improving performance compared to other taggers. GNormPlus

(Wei et al., 2015) is used for detecting Genes and Species, which runs NER and normalization as two separate steps. Both NER tools have been evaluated on real-world text corpora to determine the quality of their detected entity mentions. Benchmarks for the relevant corpora can be found in Tables 1 for TaggerOne and 2 for GNormPlus. NCBI Disease corpus is a testset for analysing diseases and the BioCreativeV corpus is a challenge for detecting Chemicals as well as Diseases. The GNormPlus evaluation is done for a Gene Normalisation testset for humans. Besides, GNormPlus is capable of detecting gene families in texts. For more details about both applications, see (Leaman and Lu, 2016) for TaggerOne and (Wei et al., 2015) for GNormPlus.

Corpus Precision Recall F-measure
NCBI Disease 81.5% 80.8% 82.9%
BioCreativeV CD-R 94.2% 88.8% 91.4%
Table 1. Benchmark results of TaggerOne (Leaman and Lu, 2016)
Corpus Precision Recall F-measure
BioCreative II GN 87.1% 86.4% 86.7%
Table 2. Benchmark results of GNormPlus (Human) (Wei et al., 2015)


We have developed a pipeline utilizing TaggerOne and GNormPlus for biomedical NER. Our pipeline expects texts in a so-called PubTator format, see (Wei et al., 2013) and the description on999https://www.ncbi.nlm.nih.gov/research/pubtator/. As an input, the pipeline supports 1. a single PubTator file, 2. a composed PubTator file and 3. a directory of PubTator files. A composed PubTator file consists of the content of two PubTator files separated by two newlines. Besides, we support the tagging of multiple files in parallel. Therefore, we implemented a splitting of the input and parallel working of the underlying tools. The recognition steps stores it’s produced data in a relational database. Finally, the pipeline exports the annotated entity mentions in a desired format like PubTator or JSON.

3. The COVID-19 Open Research Dataset

Number of Documents 57.4K
Number of full texts 43.5K
JSON parses by source
PubMedCentral (PMC) 49.7K
Elsevier 24.8K
medRxiv 2.3K
ArXiv 1.2K
bioRxiv 1.1K
Chan Zuckerberg Initiative (CZI) 0.2K
Table 3. Document Counts of CORD19 Sources

Research into COVID-19 is a big challenge and highly relevant at the moment. Therefore, scientists in the medical field must be assisted by innovative tools to access the current state of literature efficiently. The COVID-19 Open Research Dataset Challenge (CORD-19) (for AI et al., 2020)

is a ”call to action” for computer scientists in the natural language processing (NLP) and data mining field to develop such innovative tools. The dataset in version 9 consists of ca. 57,000 scholarly articles, of which ca. 44,000 have a PDF parse of their full text attached to them. Articles are taken from various sources, most prominently the PubMedCentral collection. The document statistics of the dataset in version 9 can be seen in Table

3. Some documents are accessible in multiple sources and are counted more than once in the statistics. The abstracts and full texts of the documents are given paragraph wise in a JSON-Format, so the texts can easily be extracted and processed. Entity-centric information access plays a key role in the medical domain (Herskovic et al., 2007). Hence, we run our pipeline upon the challenge dataset to assist the community with valuable entity information.

Corpus Chemicals Diseases Genes Species
Abstracts 99K 145K 59K 165K
Fulltexts 3,407K 4,039K 2,232K 4,667K
Table 4. Number of Detected Entity Mentions for the CORD-19 (Abstracts and Fulltexts)

3.1. Detected Entity Mentions

We report the number of the resulting entity mentions for each entity type. We create two different dumps: one dump contains entity mentions within titles and abstracts and the second dump contains entity mentions in the title, abstract and fulltexts of the documents. Table 4

lists the number of entity mention for both dumps grouped by the entity types. Our pipeline detects nearly 99K Chemicals, 145K Diseases, 59K Genes and 165K Species in titles and abstracts. For fulltexts, the pipeline detects around 3.4M Chemicals, 4.0M Diseases, 2.2M Genes and 4,7M Species. We estimate the annotation’s quality to be comparable to the reported quality in the tools’ original publications.

3.2. Dump of the Entity Mentions

We publish the obtained entity mentions as two JSON files. The first file contains the entity mentions for titles and abstracts. The second file contains the entity mentions for titles, abstracts as well as fulltexts. We process the CORD19 fulltexts by selecting the available JSON files. These JSON files contain fulltexts as sequences of body texts. Hence, a fulltext document consists of a title, an abstract and a sequence of body texts. We publish the corresponding entity mentions suitable for the given structure. Therefore, each entity mentions contains an entity location in texts including:

  1. a paragraph representing the position in the text. 0 is an entity mention in the title, 1 is an entity mention in the abstract and 2 is an entity mention in the first body text field and so on.

  2. a start position representing the position of the first entity’s character within the corresponding text (title, abstract, body text element).

  3. an end position representing the position of the last entity’s character within the corresponding text.

As an example, an entity location with paragraph 5, start 5 and end 10 means that the entity is mentioned in the third body text field starting at character position 5 and ending at character position 10. The first character has the position 0. An entity mention contains the following components:

  1. an entity location,

  2. an entity string representing the entity’s token sequence in the text,

  3. an entity type (Chemical, Disease, Gene and Species), and

  4. an entity id corresponding to the previously described vocabularies.

The computed entity mentions are shared within a JSON file. The JSON file consists of a dictionary, where each CORD19 document id is mapped to a list of entity mentions. A short prototypical snapshot of the exported JSON file is shown below:

  <paper_id: str>: [ #For every JSON-parse of the dataset
    {   # For every entity mention
      "location": {
        "paragraph": <int>  # 0 = title, 1 = abstract
                             # > 1 = body text
        "start": <int> # 0 = first character of paragraph
        "end": <int>
      "entity_str": <str> # entity mention in source text
      "entity_type": <"Chemical"|"Disease"|"Gene"|"Species">
      "entity_id": <str> # e.g. MESH-Identifier

More details can be found in our regularly updated GitHub repository.

4. Summary and Outlook

In this paper, we discussed the importance and usefulness of entity mentions for retrieval applications. We developed an effective pipeline to automatically annotate biomedical entity mentions in arbitrary texts. Moreover, we built our pipeline on top of the latest available biomedical NER tools to ensure the quality of our entity mentions.

Applying our pipeline to the COVID-19 open research dataset, we published the resulting entity mentions as a semantically enriched dataset for free reuse on GitHub. We will continuously update our GitHub repository whenever new versions of the COVID-19 dataset are published.


  • L. Dietz, A. Kotov, and E. Meij (2018) Utilizing knowledge graphs for text-centric information retrieval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, New York, NY, USA, pp. 1387–1390. External Links: ISBN 9781450356572, Link, Document Cited by: §1.
  • A. I. for AI, A. G. et al., and T. W. House (2020) COVID-19 open research dataset challenge (cord-19), version 9. Note: Retrieved April 27, 2020 from https://www.kaggle.com/dataset/08dd9ead3afd4f61ef246bfd6aee098765a19d9f6dbf514f0142965748be859b/version/9 Cited by: §3.
  • J. R. Herskovic, L. Y. Tanaka, W. Hersh, and E. V. Bernstam (2007) A Day in the Life of PubMed: Analysis of a Typical Day’s Query Log. Journal of the American Medical Informatics Association 14 (2), pp. 212–220. External Links: ISSN 1067-5027 Cited by: §3.
  • R. Leaman and Z. Lu (2016)

    TaggerOne: joint named entity recognition and normalization with semi-Markov Models

    Bioinformatics 32 (18), pp. 2839–2846. External Links: ISSN 1367-4803, Document, Link, https://academic.oup.com/bioinformatics/article-pdf/32/18/2839/24406872/btw343.pdf Cited by: §1, Table 1, §2.
  • P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer (2011) DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th Int. Conf. on Semantic Systems, I-Semantics ’11, New York, NY, USA, pp. 1–8. External Links: ISBN 9781450306218 Cited by: §1.
  • F. Piccinno and P. Ferragina (2014) From tagme to wat: a new entity annotator. In Proceedings of the First Int. Workshop on Entity Recognition & Disambiguation, ERD ’14, New York, NY, USA, pp. 55–62. External Links: ISBN 9781450330237 Cited by: §1.
  • C. Wei, H. Kao, and Z. Lu (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research 41 (W1), pp. W518–W522. External Links: ISSN 0305-1048, Document, Link, https://academic.oup.com/nar/article-pdf/41/W1/W518/3859973/gkt441.pdf Cited by: §2.
  • C. Wei, H. Kao, and Z. lu (2015) GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international 2015, pp. 918710. External Links: Document Cited by: §1, Table 2, §2.