COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation

07/01/2020 ∙ by Qingyun Wang, et al. ∙ University of Illinois at Urbana-Champaign Columbia University 18

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures and knowledge subgraphs as evidence. All of the data, KGs, reports, resources and shared services are publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Practical progress at combating COVID-19 highly depends on effective search, discovery, assessment and extension of scientific research results. However, clinicians and scientists are facing two unique barriers on digesting these research papers.

Figure 1: The Growing Number of COVID-19 Papers at PubMed
Figure 2: COVID-KG Overview: From Data to Semantics to Knowledge

The first challenge is quantity. Such a bottleneck in knowledge access is exacerbated during a pandemic when increased investment in relevant research leads to even faster growth of literature than usual. For example, as of April 28, 2020, at PubMed333https://www.ncbi.nlm.nih.gov/pubmed/ there were 19,443 papers related to coronavirus; as of June 13, 2020, there are 140K+ related papers, nearly 2.7K new papers per day (see Figure 1). This knowledge bottleneck causes significant delays in the development of vaccines and drugs for COVID-19. More intelligent knowledge discovery technologies need to be developed to enable researchers to more quickly and accurately access and digest relevant knowledge from literature.

The second challenge is quality due to the rapid rise of extensive publications of preprint manuscripts without pre-publication peer review. Many research results about coronavirus from different research labs and sources are redundant, complementary, or even conflicting with each other, while some false information has been promoted in both formal publication venues as well as social media platforms such as Twitter. As a result, some of the policy responses to the virus, and public perception of it, have been based on misleading, and at times erroneous, claims. The isolation of these knowledge resources makes it hard, if not impossible, for researchers to connect dots that exist in separate resources to obtain insights.

Let us consider drug repurposing as a case study. Besides the long process of clinical trial and biomedical experiments, another major cause for the long process is the complexity of the problem involved and the difficulty in drug discovery in general. The current clinical trials for drug re-purposing mainly rely on symptoms by considering drugs that can treat diseases with similar symptoms. However, there are too many drug candidates and too much misinformation published from multiple sources. The clinicians and scientists thus need urgent help to obtain a reliable ranked list of drugs with detailed evidence, and also gain new insights into the underlying molecular cellular mechanisms on Covid-19, and which pre-existing conditions may affect the mortality and severity of this disease.

To tackle these two challenges we propose a new framework, COVID-KG, to accelerate scientific discovery and build a bridge between clinicians and biology scientists, as illustrated in Figure 2. COVID-KG starts by reading existing papers to build multimedia knowledge graphs (KGs), in which nodes are entities/concepts and edges represent relations and events involving these entities, extracted from both text and images. Given the KGs enriched with path ranking and evidence mining, COVID-KG answers natural language questions effectively. Using drug repurposing as a case study, for 11 typical questions that human experts aim to explore, we integrate our techniques to generate a comprehensive report for each candidate drug. Preliminary assessment by expert clinicians and medical school students show our generated reports are both informative and sound.

2 Multimedia Knowledge Graph Construction

Figure 3: Constructed KG Connecting Losartan (candidate drug in COVID-19) and cathepsin L pseudogene 2 (gene related to coronavirus).

2.1 Coarse-grained Text Knowledge Extraction

Our coarse-grained Information Extraction (IE) system consists of three components: (1) coarse-grained entity extraction Wang et al. (2019a) and entity linking Zheng et al. (2015) for four entity types: Gene nodes, Disease nodes, Chemical nodes, and Organism. We follow the entity ontology defined in the Comparative Toxicogenomics Database (CTD) Davis et al. (2016), and obtain a Medical Subject Headings (MeSH) Unique ID for each mention. (2) Based on the MeSH Unique IDs, we further link all entities to the CTD and extract 133 subtypes of relations such as Gene–Chemical–Interaction Relationships, Chemical–Disease Associations, Gene–Disease Associations, Chemical–GO Enrichment Associations and Chemical–Pathway Enrichment Associations. (3) Event extraction Li et al. (2019): we extract 13 Event types and the roles of entities involved in these events, including Gene expression, Transcription, Localization, Protein catabolism, Binding, Protein modification, Phosphorylation, Ubiquitination, Acetylation, Deacetylation, Regulation, Positive regulation, and Negative regulation. Figure 3 shows an example of the constructed knowledge graph.

2.2 Fine-grained Text Entity Extraction

Figure 4: Example of Fine-grained Entity Extraction

However, questions from experts often involve fine-grained knowledge elements, such as “Which animo acids in glycoprotein are most related to Glycan (CHEMICAL)?”. In order to answer these questions, we apply our fine-grained entity extraction system CORD-NER Wang et al. (2020c) to extract 75 types of entities to enrich the KG, including many COVID-19 specific new entity types (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses). CORD-NER relies on distantly- and weakly-supervised methods Wang et al. (2019b); Shang et al. (2018), with no need of expensive human annotation. Its entity annotation quality surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Figure 4 shows some examples of the annotation results on a CORD-19 paper Zhang et al. (2020).

2.3 Image Processing and Cross-media Entity Grounding

Figures in biomedical papers contain rich information uniquely manifested in the visual modality, such as molecular structures, microscopic images, dosage response curves, relational diagrams, and other visual types. We have developed a visual IE subsystem to extract the visual information from figure images to enrich the knowledge graph. We start by designing a pipeline and automatic tools shown in Figure 5 to extract figures from papers in the CORD-19 dataset and segment figures into close to half a million subfigures. Then, we perform cross-modal entity grounding to ground entities mentioned in captions or referring text to visual objects in the subfigures. Since most figures are embedded as part of PDF files, we employ Deepfigures Siegel et al. (2018) to automatically detect and extract figures from each PDF document. Each figure is associated with text in its caption or referring context (main body text referring to the figure). In this way, a figure can be coarsely attached to an KG entity if the entity is mentioned in the associated text.

Figure 5: System Pipeline for Automatic Figure Extraction and Subfigure Segmentation

To further delineate semantic and visual information contained in each subfigure, we have developed a pipeline to segment individual subfigures and then align each subfigure with its corresponding sub-caption. We employ Figure-separator Tsutsui and Crandall (2017)

to detect and separate all non-overlapping image regions. Meanwhile, subfigures in a figure are typically marked with alphabetical letters (e.g., A, B, C, etc). We use deep neural networks

Zhou et al. (2017) to detect text in the figures and use OCR tools Smith (2007) to automatically recognize text information within each figure. To distinguish subfigure marker text from text labels in figures to annotate figure content, we use location proximity between text labels and subfigures to locate subfigure text markers. Location information of such text markers can also be used to merge multiple image regions into a single subfigure. At the end, each subfigure is segmented, and associated with its corresponding subcaption and referring context. The segmented subfigures and associated text labels provide rich information that can expand the KG constructed from text captions. For example, as shown in Figure 6

, we apply a classifier to detect subfigure images containing molecular structures. Then by linking specific drug names extracted from within-figure text to the drug entity in the coarse KG constructed from the caption text, a cross-modal expanded KG can be constructed that links specific molecular structure images to corresponding drug entities in the KG.

Figure 6: Expanding KG through Subfigure Segmentation and Cross-modal Entity Grounding

2.4 Knowledge Graph Semantic Visualization

In order to enhance the exploration and discovery of the information mined from the COVID-19 literature through the algorithms discussed in previous sections, we have been developing techniques to create semantic visualizations over large complex networks of biomedical relations. Semantic visualization allows for visualization of user-defined subsets of these relations interactively through semantically typed tag clouds and heat maps. This allows researchers to get a global view of selected relation subtypes drawn from hundreds or thousands of papers at a single glance. This in turn allows for the ready identification of novel relationships that would typically be missed by directed keyword searches or simple unigram word cloud or heatmap displays.444https://www.semviz.org/

We first build a data index from the knowledge elements in the constructed KGs, and then create a Kibana dashboard out of the generated data indices. Each Kibana dashboard has a collection of visualizations that are designed to interact with each other. Dashboards are implemented as web applications. The navigation of a dashboard is mainly through clicking and searching. By clicking the protein keyword EIF2AK2 in the tag cloud named “Enzyme proteins participating Modification relations”, a constraint on the type of proteins in modifications is added. Correspondingly, all the other visualizations will be changed.

One unique feature of the SemViz semantic visualization is the creation of dense tag clouds and dense heatmaps, through a process of parameter reduction over relations, allowing for the visualization of relation sets as tag clouds and multiple chained relations as heatmaps. Figure 7 illustrates such a dense heatmap, where a functionally typed protein is implicated in a disease relation (e.g., “those proteins that are down-regulators of INF which are implicated in obesity”).

Figure 7: Regulatory Processes-Disease Interactions Heatmap

3 Knowledge-driven Question Answering

In contrast to most current question answering (QA) methods which target single documents, we have developed a QA component based on a combination of knowledge graph matching and distributional semantic matching. We build knowledge graph indexing and searching functions to facilitate users to pose queries to search effectively and efficiently. We also support semantic matching from the constructed KGs and related texts by accepting multi-hop queries.

A common category of queries is about the connections between two entities. Given two entities as query, we generate a subgraph covering salient paths between them to show how they are connected through other entities. Figure 3 is an example subgraph summarizing the connections between Losartan and cathepsin L pseudogene 2. The paths are generated by traversing the constructed KG, and are ranked by the frequency of paths in the KG. Each edge is assigned a salience score by aggregating the scores of paths passing through it. In addition to knowledge elements, we also present related sentences as evidence. We use BioBert Lee et al. (2020), a pre-trained language model to represent each sentence along with its left and right neighboring sentences as local contexts. Using the same architecture computed on all respective sentences and the user query, we aggregate the sequence embedding layer, the last hidden layer in the BERT architecture with average pooling Reimers and Gurevych (2019). We use the similarity between the embedding representations of each sentence and query to extract the most relevant sentences as evidence.

Queries also often include entity types instead of entity instances, which requires us to extract evidence sentences based on type or pattern matching. We have developed

EvidenceMiner Wang et al. (2020a, b), a web-based system that allows a user’s query as a natural language statement or an inquired relationship at the meta-symbol level (e.g., CHEMICAL, PROTEIN) and automatically retrieves textual evidence from a background corpora of COVID-19.

4 A case study on Drug Repurposing Report Generation

4.1 Task and Data

A human written report about drug repurposing usually answers the following typical questions.

  1. [itemsep=-5pt,topsep=0pt]

  2. Current indication: what is the drug class? What is it currently approved to treat?

  3. Molecular structure (symbols desired, but a pointer to a reference is also useful)

  4. Mechanism of action i.e., inhibits viral entry, replication, etc. (w/ a pointer to data)

  5. Was the drug identified by manual or computation screen?

  6. Who is studying the drug? (Source/lab name)

  7. In vitro Data available (cell line used, assays run, viral strain used, cytopathic effects, toxicity, LD50, dosage response curve, etc.)

  8. Animal Data Available (what animal model, LD50, dosage response curve, etc.)

  9. Clinical trials on going (what phase, facility, target population, dosing, intervention etc.)

  10. Funding source

  11. Has the drug shown evidence of systemic toxicity?

  12. List of relevant sources to pull data from.

As case studies, DARPA biologists suggest three drugs, Benazepril, Losartan, and Amodiaquine, and COVID-19 related chemicals/genes as shown in Figure 8:

Figure 8: COVID-19 related chemicals/genes.

Our KG results for many other drugs are visualized at our website555http://blender.cs.illinois.edu/covid19/visualization.html. We download new COVID-19 papers on a daily basis from three Application Programming Interfaces (APIs): NCBI PMC API, NCBI Pubtator API and CORD-19 archive. We provide incremental updates including new papers, removed papers and updated papers, and their metadata information at our website666http://blender.cs.illinois.edu/covid19/.

4.2 Results

As of June 14, 2020 we have collected 140K papers. We select 25,534 peer-reviewed papers and construct the KG that includes 7,230 Diseases, 9,123 Chemicals and 50,864 genes, 1,725,518 chemical-gene links, 5,556,670 chemical-disease links, and 7,7844,574 gene-disease links. The KG has received more than 1,000+ downloads. Our final generated reports777http://blender.cs.illinois.edu/covid19/DrugRe-purposingReport_V2.0.docx are shared publicly. For each question, our framework provides answers along with detailed evidences, knowledge subgraphs and image segmentation and analysis results. Table 1 shows some example answers.

Several clinicians and medical school students in our team have manually reviewed the drug repurposing reports for three drugs, and also the KGs connecting 41 drugs and COVID-19 related chemicals/genes. Preliminary results show that most of our output are informative, valid and sound. For instance, after the coronavirus enters the cell in the lungs, it can cause a severe disease called Acute Respiratory Distress Syndrome. This condition causes the release of inflammatory molecules in the body named cytokines such as Interleukin-2, Interleukin-6, Tumor Necrosis Factor, and Interleukin-10. We see all of these connections in our results, such as the examples shown in Figure 3 and Figure 9. Some results are a little surprising to scientists and they think it’s worth further investigation. For example, in Figure 3 we can see that Lusartan is connected to tumor protein p53 which is related to lung cancer.

Question Example Answers
Q1 Drug Class angiotensin-converting enzyme (ACE) inhibitors
Disease hypertension
Evidence [PMID:32314699 (PMC7253125)] Past medical history was significant for hypertension, treated with amlodipine and benazepril, and chronic back pain.
Sentences [PMID:32081428 (PMC7092824)] On the other hand, many ACE inhibitors are currently used to treat hypertension and other cardiovascular diseases. Among them are captopril, perindopril, ramipril, lisinopril, benazepril, and moexipril.
Q4 Disease COVID-19
Evidence [PMID:32081428 (PMC7092824)] By using a molecular docking approach, an earlier study identified N-(2-aminoethyl)-1 aziridine-ethanamine as a novel ACE2 inhibitor that effectively blocks the SARS-CoV RBD-mediated cell fusion.
Sentences This has provided a potential candidate and lead compound for further therapeutic drug development. Meanwhile, biochemical and cell-based assays can be established to screen chemical compound libraries to identify novel inhibitors.
Q6 Disease cardiovascular disease
Evidence [PMID:22800722 (PMC7102827)] The in vitro half-maximal inhibitory concentration (IC50) values of food-derived ACE inhibitory peptides are about 1000
Sentences fold higher than that of synthetic captopril but they have higher in vivo activities than would be expected from their in vitro activities…..
Q8 Disease COVID-19
Evidence [PMID:32336612 (PMC7167588)] Two trials of losartan as additional treatment for SARS-CoV-2 infection in hospitalized (NCT04312009) or not hospitalized (NCT04311177) patients have been announced, supported by the background of the huge adverse impact of the ACE Angiotensin II AT1 receptor axis over-activity in these patients.
Sentences [PMID:32350632 (PMC7189178)] To address the role of angiotensin in lung injury, there is an ongoing clinical trial to examine whether losartan treatment affects outcomes in COVID-19 associated ARDS (NCT04312009).
[PMID:32439915 (PMC7242178)] Losartan was also the molecule chosen in two trials recently started in the United States by the University of Minnesota to treat patients with COVID-19 (clinical trials.gov NCT04311177 and NCT 104312009).
Table 1: Example Answers for Questions in Drug Repurposing Reports
Figure 9: Connections Involving Coronavirus Related Diseases

5 Related Work

A lot of previous work extracts biomedical entities Habibi et al. (2017); Crichton et al. (2017); Wang et al. (2018); Beltagy et al. (2019); Alsentzer et al. (2019); Wei et al. (2019); Wang et al. (2020c), relations Uzuner et al. (2011); Krallinger et al. (2011); Manandhar and Yuret (2013); Bui et al. (2014); Peng et al. (2016); Wei et al. (2015); Peng et al. (2017); Luo et al. (2017); Wei et al. (2019); Peng et al. (2019, 2020), and events Ananiadou et al. (2010); Van Landeghem et al. (2013); Nédellec et al. (2013); Deléger et al. (2016); Wei et al. (2019); Li et al. (2019); ShafieiBavani et al. (2020) from biomedical literature, and more recent work focuses on COVID-19 literature Hope et al. (2020); Ilievski et al. (2020); Wolinski (2020); Ahamed and Samad (2020).

Most of the recent biomedical QA work Yang et al. (2015, 2016); Chandu et al. (2017); Kraus et al. (2017) is driven by the BioASQ initiative Tsatsaronis et al. (2015), and many live QA systems, including COVIDASK888https://covidask.korea.ac.kr/ and AUEB999http://cslab241.cs.aueb.gr:5000/, and search engines Kricka et al. (2020); Esteva et al. (2020); Hope et al. (2020); Taub Tabib et al. (2020) have been developed. Our work is an application and extension of our recently developed multimedia knowledge extraction for news domain Li et al. (2020a, b). Similar to news domain, the knowledge elements extracted from text and images in literature are complimentary. Our framework advances state-of-the-art by extending the knowledge elements to more fine-grained types, incorporating image analysis and cross-media knowledge grounding, and knowledge graph matching into QA.

6 Conclusions and Future Work

We have developed a novel framework, COVID-KG, that automatically transforms a massive scientific literature corpus into organized, structured, and actionable knowledge graphs, and uses it to answer questions in drug repurposing reporting. With COVID-KG, researchers and clinicians are able to obtain trustworthy and non-trivial answers from scientific literature, and thus focus on more important hypothesis testing, and prioritize the analysis efforts for candidate exploration directions. In our ongoing work we have created a new ontology that includes 77 entity subtypes and 58 event subtypes, and we are building a neural IE system following this new ontology. In the future we plan to extend COVID-KG to automate the creation of new hypotheses by predicting new links. We will also create a multimedia common semantic space Li et al. (2020a, b) for literature and apply it to improve cross-media knowledge grounding and inference.

References