Automatic Textual Evidence Mining in COVID-19 Literature

We created this EVIDENCEMINER system for automatic textual evidence mining in COVID-19 literature. EVIDENCEMINER is a web-based system that lets users query a natural language statement and automatically retrieves textual evidence from a background corpora for life sciences. It is constructed in a completely automated way without any human effort for training data annotation. EVIDENCEMINER is supported by novel data-driven methods for distantly supervised named entity recognition and open information extraction. The named entities and meta-patterns are pre-computed and indexed offline to support fast online evidence retrieval. The annotation results are also highlighted in the original document for better visualization. EVIDENCEMINER also includes analytic functionalities such as the most frequent entity and relation summarization. The system of EVIDENCEMINER is available at https:// evidenceminer.firebaseapp.com/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/27/2020

Establishing a New State-of-the-Art for French Named Entity Recognition

The French TreeBank developed at the University Paris 7 is the main sour...
08/30/2021

NEREL: A Russian Dataset with Nested Named Entities, Relations and Events

In this paper, we present NEREL, a Russian dataset for named entity reco...
10/24/2017

Automatic Generation of Benchmarks for Entity Recognition and Linking

The velocity dimension of Big Data plays an increasingly important role ...
04/08/2021

COVID-19 Named Entity Recognition for Vietnamese

The current COVID-19 pandemic has lead to the creation of many corpora t...
04/08/2020

SIA: A Scalable Interoperable Annotation Server for Biomedical Named Entities

Recent years showed a strong increase in biomedical sciences and an inhe...
06/04/2018

History Playground: A Tool for Discovering Temporal Trends in Massive Textual Corpora

Recent studies have shown that macroscopic patterns of continuity and ch...
12/07/2020

Improving Clinical Document Understanding on COVID-19 Research with Spark NLP

Following the global COVID-19 pandemic, the number of scientific papers ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in 2019 in Wuhan, Central China, and has since spread globally, resulting in the 2019–2020 coronavirus pandemic. On March 16th, 2020, researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health released the COVID-19 Open Research Dataset (CORD-19)111https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge of scholarly literature about COVID-19, SARS-CoV-2, and the coronavirus group.

Traditional search engines for life sciences (e.g., PubMed) are designed for document retrieval and do not allow direct retrieval of specific statements. Some of these statements may serve as textual evidence that is key to tasks such as hypothesis generation and new finding validation. We created EvidenceMiner Wang et al. (2020a), a web-based system for textual evidence discovery for life sciences. We apply the EvidenceMiner system to the CORD-19 corpus to facilitate textual evidence mining for the COVID-19 studies222https://xuanwang91.github.io/2020-04-15-cord19-evidenceminer/. Given a query as a natural language statement, EvidenceMiner automatically retrieves sentence-level textual evidence from the CORD-19 corpus. In the following sections, we introduce the details of the EvidenceMiner system. We also show some textual evidence retrieval results with EvidenceMiner on the CORD-19 corpus.

Figure 1: System architecture of EvidenceMiner.

2 EvidenceMiner System

EvidenceMiner is a web-based system for textual evidence discovery for life sciences (Figure 1). Given a query as a natural language statement, EvidenceMiner automatically retrieves sentence-level textual evidence from a background corpora of biomedical literature. EvidenceMiner is constructed in a completely automated way without any human effort for training data annotation. It is supported by novel data-driven methods for distantly supervised named entity recognition and open information extraction.

2.1 Pre-processing

EvidenceMiner relies on external knowledge bases to provide distant supervision for named entity recognition (NER) Shang et al. (2018); Wang et al. (2018b, 2019). For this COVID-19 study, the NER results are obtained from the CORD-NER system Wang et al. (2020b). Based on the entity annotation results, it automatically extracts informative meta-patterns (textual patterns containing entity types, e.g., CHEMICAL inhibit DISEASE) from sentences in the background corpora. Jiang et al. (2017); Wang et al. (2018a); Li et al. (2018a, b). Sentences with meta-patterns that better match the query statement is more likely to be textual evidence.

2.2 Corpus Indexing

After we get all entities and meta-patterns extracted, we use them to guide the textual evidence discovery. We create three offline indexes of the input corpus based on the words, entities and synonym meta-pattern groups from the previous steps. Indexing helps boost the online processing of textual evidence discovery given a user-specified query.

Word indexing. We normalize the words in the corpus to lowercase and split the corpus into single sentences for indexing. We take each word in the vocabulary of the corpus as the key and generate an identifier list as the value including the sentences containing the key word.

Entity indexing. Similar to word indexing, we take each entity recognized by PubTator as the key and generate an identifier list as the value including the sentences containing the key entity.

Pattern indexing.

After the fine-grained meta-pattern matching, we take each meta-pattern together with its extracted entity set as two levels of keys and generate the identifier list as the value including the sentences matched by the meta-patterns with their corresponding entity set. We also maintain two dictionaries, one mapping each meta-pattern to its synonym meta-pattern group and the other mapping each synonym meta-pattern group to its list of meta-patterns.

2.3 Textual Evidence Retrieval.

Taken an user-input query and the indexed corpus, we retrieve all the candidate evidence sentences and rank them by a confidence score of it being textual evidence for the query. The confidence score is a weighted combination of three scores: a word score, an entity score and a pattern score. Our evidence ranking function calculates the following three ranking scores:

  1. Word score: candidate evidence sentences covering more query-related words will be ranked higher.

  2. Entity score: candidate evidence sentences covering more query-related entities will be ranked higher.

  3. Pattern score: candidate evidence sentences covering more query-matched meta-patterns will be ranked higher.

Word Score. We use the BM25 Robertson et al. (2009) score as the word score to measure the relatedness between the query and the candidate evidence sentence. BM25 is a commonly used ranking score for information retrieval. Given a query containing the words , the BM25 score of a candidate evidence sentence is

where is the term frequency of in the sentence , is the length of the sentence , is the average length of all the sentences and k and b are two free parameters chosen by the user. is the inverse document frequency of , which is computed as

where is the total number of sentences and is the number of sentences containing . A candidate evidence sentence that is more related to the query will have a higher word score.

Entity Score. We also use the BM25 score as the entity score to measure the relatedness of the query and the candidate evidence sentence. Given a query containing the entities , the BM25 score of a candidate evidence sentence is

where is the term frequency of in the sentence , is the length of the sentence , is the average length of all the sentences and k and b are two free parameters chosen by the user. is the inverse document frequency of , which is computed as

where is the total number of sentences and is the number of sentences containing . A candidate evidence sentence more related to the query will have a higher entity score.

Pattern Score. We measure how many synonym patterns to the query pattern can be matched on each candidate evidence sentence. Given an input query (e.g., (resveratrol, inhibit, pancreatic cancer)), we first try to convert it into a query meta-pattern (e.g., “$CHEMICAL inhibit $DISEASE”). If the query meta-pattern can be found in our pattern index, we directly retrieve all the synonym meta-patterns to the query meta-pattern. Then we measure how many meta-patterns among the synonym meta-patterns can be matched for each candidate evidence sentence on the query entities. Given a query containing the entities , the pattern score of a candidate evidence sentence is

where is the number of meta-patterns in the synonym meta-pattern group of the query on , is each meta-pattern in the synonym meta-pattern group of the query on , and is an indicator function that measures if the sentence can be matched with on the query entities. A candidate evidence sentence is more likely to be confident evidence if it can be matched to more synonym meta-patterns to the query meta-pattern.

Textual Evidence Score. The final score of the candidate evidence sentence is a weighted average of the three scores,

where

is the weight vector indicating the importance of each aspect of the information (i.e., word, entity and pattern), which can be adjusted by the user. The default weight vector we use is equal weight for word, entity and pattern in our experiments.

3 Results on COVID-19

3.1 Textual Evidence Retrieval

To demonstrate the effectiveness of EvidenceMiner in textual evidence retrieval, we compare its performance with the traditional BM25 Robertson et al. (2009) and a recent sentence-level search engine, LitSense Allot et al. (2019). The background corpus is the same PubMed subset for all the compared methods. We first ask domain experts to generate 50 query statements based on the relationships between three biomedical entity types (gene, chemical, and disease) in the Comparative Toxicogenomics Database333http://ctdbase.org. Then we ask domain experts to manually label the top-10 retrieved evidence sentences by each method with three grades indicating the confidence of the evidence. We use the average normalized Discounted Cumulative Gain (nDCG) score to evaluate the textual evidence retrieval performance. In Table 1, we observe that EvidenceMiner always achieves the best performance compared with other methods. It demonstrates the effectiveness of using entities and meta-patterns to guide textual evidence discovery in biomedical literature.

Method / nDCG @1 @5 @10
BM25 0.714 0.720 0.746
LitSense 0.599 0.624 0.658
EvidenceMiner 0.855 0.861 0.889
Table 1: Performance comparison of the textual evidence retrieval systems with nDCG@1,5,10.

3.2 Case Study

We show some case studies of textual evidence mining for COVID-19 in Figure 2, 3 and 4. We can see the top-retrieved evidence sentences are very much related to the input query.

In Figure 2, the scientists want to study if ultraviolet (UV) can kill the SARS-COV-2 virus. In the top results, we see many supporting sentences, such as “Ultraviolet-C (UV-C) radiation represents an alternative to chemical inactivation methods”, “We discuss 2 such modalities, respirators (face masks) and ultraviolet (UV) light” and “Boeing is also exploring a prototype self-sanitizing lavatory that uses ultraviolet light to kill 99.99% of pathogens 48”.

In Figure 3

, the users are interested to see if wearing masks can help prevent the COVID-19 spreading. This is a hot debating topic currently. In the top results, we see many related statements, among them are clearly two opposite opinions. For example, some statements support the use of masks to prevent the virus, such as “COVID-19 is transmitted by saliva droplets, …, which can be prevented by wearing masks”. While other statements are against the effectiveness of wearing masks, such as “Although surgical masks are in widespread use …, there is no evidence that wearing these masks can prevent the acquisition of COVID-19 …”. An interesting future work is to classify the opinions by their semantic polarity and even further generate some summarization of the evidence retrieval results.

In Figure 3, the doctors want to study if remdesivir is a potential drug treatment for COVID-19. Remdesivir is currently a very actively studied drug that can be repurposed for COVID-19 treatment. Similarly, in the top results, we can see many sentences regarding the clinical trials for remdesivir against COVID-19.

Figure 2: Case study: (Ultraviolet, UV, kills, SARS-COV-2)
Figure 3: Case study: (COVID-19, masks)
Figure 4: Case study: (COVID-19, remdesivir)

4 Conclusion

EvidenceMiner on COVID-19 be constantly updated based on the incremental updates of the CORD-19 corpus and the improvement of our system. We hope this system can help the text mining community build downstream applications for the COVID-19 related tasks. We also hope this system can bring insights for the COVID-19 studies on making scientific discoveries.

Acknowledgment

Research was sponsored in part by US DARPA KAIROS Program No. FA8750-19-2-1004 and SocialSim Program No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, and DTRA HDTRA11810026. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and should not be interpreted as necessarily representing the views, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright annotation hereon. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

References

  • Allot et al. (2019) Alexis Allot, Qingyu Chen, Sun Kim, Roberto Vera Alvarez, Donald C Comeau, W John Wilbur, and Zhiyong Lu. 2019. Litsense: making sense of biomedical literature at sentence level. Nucleic acids research.
  • Jiang et al. (2017) Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M Kaplan, Timothy P Hanratty, and Jiawei Han. 2017. Metapad: Meta pattern discovery from massive text corpora. In KDD, pages 877–886. ACM.
  • Li et al. (2018a) Qi Li, Meng Jiang, Xikun Zhang, Meng Qu, Timothy P Hanratty, Jing Gao, and Jiawei Han. 2018a. Truepie: Discovering reliable patterns in pattern-based information extraction. In KDD, pages 1675–1684. ACM.
  • Li et al. (2018b) Qi Li, Xuan Wang, Yu Zhang, Qi Li, Fei Ling, Cathy Wu H, and Jiawei Han. 2018b. Pattern discovery for wide-window open information extraction in biomedical literature. In BIBM. IEEE.
  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. FnT Inf. Ret., 3(4):333–389.
  • Shang et al. (2018) Jingbo Shang, Liyuan Liu, Xiang Ren, Xiaotao Gu, Teng Ren, and Jiawei Han. 2018. Learning named entity tagger using domain-specific dictionary. In EMNLP. ACL.
  • Wang et al. (2020a) Xuan Wang, Yingjun Guan, Weili Liu, Aabhas Chauhan, Enyi Jiang, Qi Li, David Liem, Dibakar Sigdel, J. Harry Caufield, Peipei Ping, and Jiawei Han. 2020a. Evidenceminer: Textual evidence discovery for life sciences. In ACL: System Demonstrations.
  • Wang et al. (2020b) Xuan Wang, Xiangchen Song, Yingjun Guan, Bangzheng Li, and Jiawei Han. 2020b. Comprehensive named entity recognition on cord-19 with distant or weak supervision. arXiv preprint arXiv:2003.12218.
  • Wang et al. (2018a) Xuan Wang, Yu Zhang, Qi Li, Yinyin Chen, and Jiawei Han. 2018a. Open information extraction with meta-pattern discovery in biomedical literature. In BCB, pages 291–300. ACM.
  • Wang et al. (2019) Xuan Wang, Yu Zhang, Qi Li, Xiang Ren, Jingbo Shang, and Jiawei Han. 2019. Distantly supervised biomedical named entity recognition with dictionary expansion. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 496–503. IEEE.
  • Wang et al. (2018b) Xuan Wang, Yu Zhang, Qi Li, Cathy H. Wu, and Jiawei Han. 2018b. PENNER: pattern-enhanced nested named entity recognition in biomedical literature. In BIBM, pages 540–547.