Coronavirus Knowledge Graph: A Case Study

07/04/2020 ∙ by Chongyan Chen, et al. ∙ The University of Texas at Austin Peking University 0

The emergence of the novel COVID-19 pandemic has had a significant impact on global healthcare and the economy over the past few months. The virus's rapid widespread has led to a proliferation in biomedical research addressing the pandemic and its related topics. One of the essential Knowledge Discovery tools that could help the biomedical research community understand and eventually find a cure for COVID-19 are Knowledge Graphs. The CORD-19 dataset is a collection of publicly available full-text research articles that have been recently published on COVID-19 and coronavirus topics. Here, we use several Machine Learning, Deep Learning, and Knowledge Graph construction and mining techniques to formalize and extract insights from the PubMed dataset and the CORD-19 dataset to identify COVID-19 related experts and bio-entities. Besides, we suggest possible techniques to predict related diseases, drug candidates, gene, gene mutations, and related compounds as part of a systematic effort to apply Knowledge Discovery methods to help biomedical researchers tackle the pandemic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Knowledge Graphs

A Knowledge Graph (KG) is a graph-based data structure used to represent unstructured information so that a machine can read it. Emerging from Knowledge Bases, KGs now represent a ubiquitous set of methods for representing and integrating knowledge in various domains. A KG contains descriptions of entities and their relationships as information in the form of first-order logical facts such as ¡Mount Fuji is located in Japan¿ that can be retrieved and queried heuristically. KGs emerged to power what was known in the eighties and the nineties as Expert Systems - an early form of Artificial Intelligence and Decision Support Systems

(Russell and Norvig, 2016). The number of entity types and relationships in a KG is finite and is usually but not necessarily organized in a schema or an ontology. In 2012 Google introduced the Google Knowledge Graph (Paulheim, 2017), a technology that converts multiple information sources to a graph structure where the nodes represent real-life entities and types. The edges represent the relationships between those entities and types, and the technology was aimed at enhancing the users’ search experience through predicting the users’ search intent and introducing a Knowledge Panel on the right of page (Paulheim, 2017). Knowledge Graph technology today has been adopted in many domains and fields to store, integrate, and represent unstructured information in a structured format that is more flexible and machine-readable than the traditional entity-relationship data model (Chen, 1976).

Other examples of widely used and adopted KGs in different domains such as Social Networks and Life Sciences include the Facebook Social Graph (Ugander et al., 2011) and Chem2Bio2RDF (Chen et al., 2010). The advancement of KG construction and mining was powered by the already established research fields of Machine Learning, Deep Learning, Graph Mining, and Complex Networks. The techniques and methods developed by researchers in such fields are used to mine data and information in KGs to extract insights crucial to advancing knowledge in various domains. Research in Information Networks also played a role in the construction and mining of KGs mainly through relying on statistical methods and machine learning techniques (Sun and Han, 2012). Information networks are widely heterogeneous graphs of nodes and edges representing meta-information about a published corpus of literature such as authors, papers, publications, and venues. Hence, information networks are KGs in which graph mining techniques can be applied to extract insights about author collaboration patterns and their topics of interest. Mining information networks as KGs has to lead to understanding trends such as collaboration patterns and potential drug re-purposing opportunities in a specific domain without reading the entire literature in a field.

KGs have been curated manually, yet, over the years, KG construction techniques have changed. For example, Cyc (Lenat, 1995) is a KG that was manually curated, while Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić and Krötzsch, 2014)

were crowd-sourced. KGs can also be extracted using Natural Language Processing (NLP) techniques such as in DBpedia

(Lehmann et al., 2015) and YAGO (Suchanek et al., 2007). Alternatively, KGs can be constructed using a combination of manual curation and automatic extraction like in NELL (Carlson et al., 2010) and Knowledge Vault (Dong et al., 2014). Regardless of the approach of how a KG has been constructed, KGs need to be queried and mined to map complex real-world phenomena and eventually be exploited to solve important research questions. For example, Facebook’s Social Graph needs to be mined to suggest new friends for users. Life Sciences KGs like Chem2Bio2RDF need to be mined to answer research questions related to biomedical science.

1.2. Named Entity Recognition in Knowledge Graph Construction

The move towards natural language understanding through semantic technologies has gained much ground in the past decade, promoting Named Entity Recognition (NER) to a central NLP task. NER has been crucial for building and constructing KGs as the primary method of extracting entities and possibly relations from free text. Also, tasks such as link prediction, relation extraction, and graph completion on KGs are aided by NER. NER can be impactful when applied to mine domain-specific scientific literature such as the biomedical literature to extract bio entities aiding in constructing KGs and advancing downstream knowledge discovery tasks in biomedicine.

Although research in NER has been advancing since the nineties (Nadeau and Sekine, 2007), early efforts in domain-specific biomedical NER came later in the early 2000s (Settles, 2004)

. Those methods in biomedical NER relied on feature engineering and graphical models such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF)

(Settles, 2004)

. When applying CRF models to the biomedical text, the objective is to construct a chain out of the words then predict the assigned labels based on a conditionally trained finite state machine where the probability of each label assigned to a word is correlated with a feature set. The objective was then to maximize the log-likelihood of the label given the word directly. The accuracy of the recognition of bioentities in CRF and HMM models were quite low when compared to state of the art today. The current state of the art relies on the latest in Deep Learning in contextual embedding such as BERT. BERT is a deep learning model developed in

(Devlin et al., 2018) by a team at Google to be fine-tuned for machine translation tasks. The model was based on the transformer architecture described in (Vaswani et al., 2017). Multiple attention heads are used to train a contextual embedding where the task is to predict masked words of the input sentences. The sophisticated inner architecture of BERT based on multiple encoder-decoder layers allows for learning high-quality embedding from a large corpus of data where the learned weights can be later transferred and fine-tuned to downstream tasks. In (Lee et al., 2020), the authors trained a BERT model on the corpus of PubMed and PMC named BioBERT. The result was a biomedical contextual embedding model that was later fine-tuned and used in a biomedical NER task producing high accuracy tagging and extraction of bio entities such as drugs, diseases, and genes. The high accuracy of the BioBERT model allowed and aided in the construction of the PubMed KG presented in (Xu et al., 2020).

1.3. The COVID-19 Knowledge Graph

Several months amid the emergence of the acute respiratory syndrome COVID-19 caused by the novel coronavirus Sars-CoV-2 in China, the disease has risen to a global pandemic level affecting almost every country on earth and infecting more than 6 million people across the globe and killing more than 350000 (15). As a result, researchers from every domain have reoriented their efforts towards finding ways and solutions to tackle the pandemic. Specifically, the biomedical literature on COVID-19 and SARS-CoV-2 and other related acute respiratory syndromes that have reached an epidemic level such as SARS and MERS have increased exponentially since the virus’s appearance back in December 2019. As a result, government-backed calls and research institutions like the Allen Institute for AI have released a COVID-19 Open Research Dataset CORD-19 (5). The dataset contains over 65000 full-text scholarly articles related to COVID-19, SARS-CoV-2, and other related topics. This effort aims to encourage the NLP and KG researcher community to mine the dataset to generate insights through text mining techniques and methods to help point biomedical researchers in the right direction to fight against the virus.

We see the release of the CORD-19 dataset of machine-readable scientific literature as an opportunity to extract a comprehensive and cohesive COVID-19 KG of the entities and relationships though cooccurrence within the corpus of articles. The extracted KG will help understand the relationships between the diseases, the genes, the viruses, and the cures involved in and related to COVID-19 so that future graph and network mining efforts can be applied to extract insights from the dataset. Here we present our vision in contributing to that effort.

We demonstrate methods of entity extraction and KG building to harvest a COVID KG capable of being a useful dataset for future mining in the hope that it will help biomedical researchers find a cure and tackle the pandemic through generating deep insights. We first introduce how to use BioBERT for named entity recognition in the PubMed and CORD-19 datasets. Then we built several Coronavirus Knowledge Graphs based on two different kinds of measurements. One measure the relationship between source node and each target node based on co-occurrence frequency. The other is to use Cosine Similarity to measure the similarity between the source node and each target node.

2. Related Work

Previous efforts and trials to build a comprehensive COVID-19 KG have lacked in several areas. For example (Domingo-Fernandez et al., 2020) built a COVID-19 related KG based on 145 articles and provided a web application for ease of use and access. This COVID-19 KG contains 3954 nodes and 9484 relations, covering ten entity types. It reveals host-pathogen interactions, comorbidities, symptoms, and discovered over 300 candidate drugs for COVID-19. Nevertheless, the effort was limited in terms of the number of publications included in constructing the KG. (Ge et al., 2020) applied a machine learning model (BERE (Hong et al., 2019)) to integrate and mine KG to also aide in the effort of identifying candidate drugs for COVID-19. Besides, (Richardson et al., 2020) used a pre-built KG for COVID-19 drug discovery and identified the drug ”baricitinib” to protect lung cells from being infected by the virus. Previously mentioned efforts though promising, yet they lacked the large scale KG construction and mining approaches necessary to extract more profound and in-depth insights about the disease and possible cures, treatments, and genetic influences.

NLP techniques have also been utilized outside of the KG construction arena, for example, (Tang et al., 2020) introduced CovidQA, a question answering dataset, which comprises 124 questions and answers of triples built by hand from knowledge collected from the CORD-19 dataset. (Grujicic et al., 2020) developed a self-supervised context-aware COVID-19 document exploration based on BERT. (Liang and Xie, 2020) used BERT to analyze a large collection of COVID-19 literature from the CORD-19 dataset (Inc., 2020) to extract COVID-19 related radiological findings. Though rigorous in using large datasets such as CORD-19, the previous NLP techniques were limited in terms of applications and the impact of those applications on the COVID-19 oriented biomedical research field.

3. Datasets

3.1. PubMed dataset

The PubMed database contains more than 30 million citations within the various fields of life sciences. The PubMed citation database archived by the MEDLINE archive has always been the desired datasets for biomedical text and graph mining research communities.

We select PubMed dataset because it is a popular dataset in biomedical area and reflect general biomedical knowledge.

3.2. The PubMed Knowledge Graph

(Xu et al., 2020) built a PubMed KG which connects disambiguated author names, their articles, and bio-entities using the PubMed database were they parsed 29 million PubMed abstracts from 1781 till 2019. In addition to funding extracted from the National Institutes of Health using ExPORTER, and affiliations were extracted from ORCID and MapAffil.

3.3. The CORD-19 Dataset

The CORD-19 dataset was released in response to COVID-19, where the US Government has issued requests for research groups and institutions to combine efforts to release the COVID-19 Open Research Dataset (CORD-19). The datasets contain more than 135000 articles with over 68000 full texts on topics related to Coronavirus and the COVID-19 pandemic. The data set was released to help the biomedical research community by applying the latest in NLP to extract deep insights and understandings of the pandemic patterns and the possible drugs, cures, and genes that might be involved and identified (5).

Here we perform our analysis on the entities and relationships extracted from the three datasets and we show the potential in knowledge discovery.

4. Experiment-0 Identify Experts on Coronavirus topics

We would like to identify the experts for COVID to encourage collaboration. To do that, we analyzed COVID-19 44k dataset and ranked the researchers according to the number of articles they published in the COVID-19 44K dataset. Part of the results are shown in Table 1.

Author
# of articles published
in COVID-19 dataset
Perlman, Stanley 142
Drosten, Christian 137
Yuen, Kwok-Yung 136
Baric, Ralph S 132
Jiang, Shibo 120
Enjuanes, Luis 116
Snijder, Eric J 104
Weiss, Susan R 92
Table 1. COVID-19 related researchers

5. Experiment-1 Named Entity Recognition with BioBERT

5.1. Model

BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) is a highly influential Natural Language Processing model that proposed back in 2018. BERT was inspired by many advanced Deep Learning models, such as semi-supervised sequence learning(Dai and Le, 2015), ELMo (Peters et al., 2018) and the Transformer architecture (Vaswani et al., 2017).

The input representation of BERT is the sum of a token embedding using WordPiece, a segmentation embedding indicating whether each token belongs to sentence A or sentence B, and a position embedding. A [CLS] flag is added before the first word of the sentence, and a [SEP] flag is added as a separator token.

BERT has two tasks for pre-training: Masked Language Model task and Next Sentence Prediction task. Considering most of traditional NLP model, instead of training a left-to-right or right-to-left model based on the input language, it is better to use the bidirectional model. However, the bidirectional model is not suitable for the conditional task. Thus, inspired by the Cloze task, a masked language model is adapted as the first task for BERT pre-training. The second task for BERT pre-training is Next Sentence Prediction (NSP), which allows the model to understand sentence relationships.

BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) (Lee et al., 2020) is a biomedical language representation model based on BERT (Devlin et al., 2018). It is proposed because directly adapting BERT to text mining in the biomedical area was not promising, given the word shift from generic domain to the biomedical domain. BioBERT is pre-trained on PubMed abstracts and PubMed Central full-text articles (PMC).

We select BioBERT-Base v1.1 (+PubMed 1M) based on the BERT-base-Cases model. For our fine-tuning section, we fine-tuned it on the NCBI disease dataset. The input to the BioBERT model is a sentence embedded following BERT’s embedding process. Parts of the token-level evaluation looks like: ”[CLS]Ang ##iot ##ens ##in - converting enzyme 2 (AC ##E ##2 ) as a SA ##RS - Co ##V - 2 receptor: molecular mechanisms and potential therapeutic target. [SEP]”. The output of this model will be a sentence with labels. Label ”B-MISC” means the Begining of bioentities, ”I-MISC” means Insdie the bio entities, and ”O-MISC” means Outside the bio entity. We tested BioBERT on PubMed KG and the CORD-19 dataset.

5.2. Results

Examples of entity-level recognized names from the PubMed dataset are shown in Table 2. The recognized bio-entities are ”acute respiratory disease”, ”pneumonia”, and ”acute respiratory syndrome coronavirus 2”. ”SARSCOV-2” should, but is not recognized as bio-entities.

Word Label
Asymptomatic O-MISC
carrier O-MISC
state O-MISC
, O-MISC
acute B-MISC
respiratory I-MISC
disease I-MISC
, O-MISC
and O-MISC
pneumonia B-MISC
due O-MISC
to O-MISC
severe O-MISC
acute B-MISC
respiratory I-MISC
syndrome I-MISC
coronavirus I-MISC
2 I-MISC
( O-MISC
SARSCoV O-MISC
- O-MISC
2 O-MISC
) O-MISC
: O-MISC
Facts O-MISC
and O-MISC
myths O-MISC
Table 2. Named Entity Recognition using BioBERT in PubMed dataset

Since we do not have detailed labels for BioBERT fine-tuning training and thus cannot predict detailed labels directly from BioBERT. To get more detailed labels, we trained a Random Forest model on around 100,000 PubMed bio-entities labeled with five categories: species, gene, diseases, drug, gene mutation, and tested on the bio-entities recognized by BioBERT. The F1-score is shown in Table

3. The F1-Score of disease, gene, and drug recognition are all over 75%. However, the model poorly predicts when it comes to label ”Species” and labels ”Gene mutation,” probably because the dataset was very unbalanced (Gene mutation only has 17 samples, and Species has only 1395 samples).

Label Precision Recall F1-Score Data Size
Species 0.72 0.31 0.43 15519
Disease 0.94 0.61 0.74 12077
Gene 0.95 0.64 0.76 18678
Drug 0.66 0.99 0.79 53523
Gene mutation 0.5 0.17 0.25 36
Table 3. Specific named entity classification using Random Forest

Examples of named entity recognized from CORD-19 44K dataset are shown in Table 4. In the example, the recognized bio-entities are ”coronavirus disease 2019”, and ”COVID-19”. ”Thrombocytopenia,” a kind of disease, should, but is not recognized. From these cases, we find that BioBERT does not perform greatly, despite its high accuracy. It may be because BioBERT can easily recognize the easy and common bio-entities with a high occurrence rate but fail to recognize rare biomedical terms.

Word Label
Thrombocytopenia O-MISC
is O-MISC
associated O-MISC
with O-MISC
severe O-MISC
coronavirus B-MISC
disease I-MISC
2019 I-MISC
COVID B-MISC
- I-MISC
19 I-MISC
infections O-MISC
A O-MISC
meta O-MISC
- O-MISC
analysis O-MISC
Table 4. named entity Recognition using BioBERT in COVID-19 44K dataset

6. Experiment-2.1 Co-occurrence frequency based Knowledge Graph

6.1. Method

We used Gephi to build the co-occurrence frequency based Knowledge Graph. Co-occurrence frequency is an above-chance frequency of occurrence of two entities from an article. The data is from the PubMed Knowledge Graph. For each target node (related bio-entities), we calculated the times it shows up with the source node and treat the times(co-occurrence frequency) as target node’s weight. The higher the co-occurrence frequency, the closer the target node is to the source node.

6.2. Result

Figure 1 and Figure 2 shows remdesivir centered KGs. In Figure 1, the source node is remdesivir and the target nodes are remdesivir related disease. As shown, “ COVID-19”, “Ebola”, “SARS”, “EVD”, “MERS”, “EBOV”, “cytokine storm”, “acute cardiac injury”, and “ ARDS” are related to remdesivir. In Figure 2 the source node is remdesivir and the target nodes are remdesivir related drugs from PubMed bio-entity knowledge graph. The edge’s weights are based on co-occurrence frequency. As shown, “favipiravir” (used for influenza), “ritonavir” (used for HIV), “lopinavir (used for HIV)”, “ribavirin” (used for severe lung infections), “chloroquine” (used for lupus, malaria, and rheumatoid arthritis), and “pyrazofurin”, which has antibiotic, antiviral and anti-cancer properties with severe side effects, are remdesivir related drugs.

Figure 1. remdesivir related disease KG based on co-occurrence frequency
Figure 2. remdesivir related drug KG based on co-occurrence frequency

We also generated other 6 drug-centered KGs. The 6 drugs are favipiravir, ritonavir, lopinavir, ribavirin, tamiflu, and umifenovir. The diseases highly related to favipiravir are: Ebola, influenza, Epstein-Barr virus, HBV infection, CCHF, avian influenza, hemorrhagic fever, Lassa fever, thrombocytopenia syndrome, Rift Valley fever, and etc. The drugs highly related to favipiravir are: ribavirin, tamiflu, peramivir, amantadine, laninamivir, BCX4430, T-1105, Pyrazine, and etc.

The diseases highly related to ritonavir are: HIV, AIDS, hepatitis C virus, cirrhosis, nausea, and etc. The drugs highly related to ritonavir are: lopinavir, indinavir, darunavir, lamivudine, atazanavir, nelfinavir, and etc.

The diseases highly related to lopinavir are: HIV, AIDS, cardiovascular disease, lipodystrophy, hepatitis C, diarrhea, malaria, nausea, and etc. The drugs highly related to lopinavir are: ritonavir, saquinavir, abacavir, nucleoside, indinavir, lamivudine, darunavir, atazanavir, nelfinavir, tenofovir, nevirapine, zidovudine, amprenavir, and etc.

The diseases highly related to ribavirin are: HIV, HCC, cirrhosis, hepatitis, liver cirrhosis, liver transplantation, AIDS, RSV, depression, chronic diseases, and etc. The drugs highly related to ribavirin are: telaprevir, ledipasvir, alanine, sofosbuvir, boceprevir, PEG, daclatasvir, simeprevir, ritonavir, paritaprevir, ombitasvir, and dasabuvir.

The diseases highly related to tamiflu are: influenza, avian influenza, CCHF, pneumonia, HBV infection, cough, headache, hypernatremia, fever, and etc. The drugs highly related to tamiflu are: amantadine, zanamivir, amino acid, oseltamivir, carboxylateribavirin, oseltamivir Phosphate, Peramivir, rimantadine, and laninamivir.

The diseases highly related to umifenovir are: influenza, acute respiratory infections, viral infections, pneumonia, fever, and mepatitis B virus (HBV) infection. The drugs highly related to umifenovir are tamiflu, rimantadine, ribavirin, ingavirin, amantadine, ARB,indole, Zanamivir, Triazavirin, Reaferon, and etc.

Figure 3, Figure 4, and Figure 5 show two corona virus diseases (SARS, MERS) and Ebola centered KG, respectively. Figure 3 shows that SARS’s highly related diseases are acuate respiratory distress syndrome, fever, influenza, HIV-1, Osteonecrosis, allergic inflammation, lung disease, atypical pneumonia, allergic rhinitis, cough, and etc. SARS’s highly related genes/chemicals are CD8+, CD4+, TNF-alpha, interferon-gamma, IFN-alpha, IL8, lgG, C-reactive protein, S protein, lactate dehydrogenase, ACE2 gene, and etc. SARS’s highly related drugs are ribavirin, methylpredinisolone, and corticosteroids, and etc.

Figure 4 shows that MERS’s highly related diseases are Severe Acute Respiratory Syndrome, PRCV infection, and influenza. MERS’s highly related drugs are macrolides, ribavirin, azithromycin, lopinavir, and nitonavir. MERS’s highly related genes/chemicals are CD26, S protein gene, amino peptidase N, and etc.

Figure 5 shows that Ebola’s highly related diseases are hemorrhagic fever, hyperthermia, malaria, and mosquito-borne infections. MERS’s highly related drugs are favipiravir and amodiaquine. MERS’s highly related genes/chemicals are CD8+, CD4, DC-SIGN, CD317, GP2, IFN-g, IRF3, RBBP6, and etc.

Figure 3. SARS centered KG based on co-occurrence frequency
Figure 4. MERS centered KG based on co-occurrence frequency
Figure 5. Ebola centered KG based on co-occurrence frequency

Figure 6 shows Angiotensin-converting enzyme 2 (ACE2) centered Knowledge Graph. ACE2 is an enzyme, which lowers blood pressure by catalysing the hydrolysis of angiotensin II into angiotensin (1-7). ACE2 is the receptor that COVID-19 uses to infect lung cells. It also serves as receptor for other coronaviruses such as HCoV-NL63, SARS-CoV. As shown, ACE2’s related genes/chemicals are renin, RAS, angiotensin, insulin, Mas receptor, vascular endothelial growth factor-A, and etc. ACE2’s related diseases are diabetes, hypertensive, chagas disease, severe acute respiratory syndrome-associated coronavirus. ACE2’s related drugs are streptozotocin, nitric-oxide, and aldosterone .ACE2’s related gene mutations are “rs2106809” and “rs2074192”.

Figure 6. ACE2 centered KG based on co-occurrence frequency

From the results we believe the PubMed Knowledge Graph is very promising. However, this kind of KG has a entity name disambiguation issue. For example, “Favipiravir” could also be shown as “favipiravir”. Another case is “ACE-2”, which is the abbreviation of Angiotensin-converting enzyme 2. Besides, the co-occurrence frequency cannot reflect the relationship between the source node and the target node well. For example, if “A has nothing to do with B” mentioned lots of times in different documents, its co-occurrence frequency will be very high.

7. Experiment-2.2 Cosine similarity based Knowledge Graph

7.1. Method

To deal with problems with co-occurrence frequency based KG, we first normalized the entity using some human designed rules to deal with entity name disambiguation issue. We mainly focus on case sensitive, singular and plural, and disambiguation. For example, “SIAsNN” will be normalized as ”siann” and“respiratory illnesses” will be normalized as ”respiratory

illness”. Then we used Word2Vec to convert the normalized entity to vector with length of 100. We then use Cosine Similarity to measure the similarity between the source node and each target node. The Cosine Similarity is defined as follows:

(1)

The KG based on cosine similarity is also built using Gephi software.

7.2. Results

Figure 7 shows part of the favipiravir-centered knowledge graph (chemical related). The source node is favipiravir and the target node are related chemicals. The edge is cosine similarity relations. The closer the target node is to the source node, the similar the target node is to the source node. As shown, the top 10 chemical related to favipiravir are guanine, csa, lysine, nh2, titanium, proline, methicillin, anthraquinone, rimantadine, polyacrylamide. Figure 8 shows part of the favipiravir-centered knowledge graph (gene related). As shown, the top 10 gene related to favipiravir are gm  csf, abortion, rig  i, isg15, csa, akt, mtor, p53, th1, p38, tgf  beta.

Figure 7. favipiravir-centered KG (related chemical) based on cosine similarity
Figure 8. favipiravir-centered KG (related gene) based on cosine similarity

We also generate other 5 drug-centered KGs based on cosine similarity. The top 10 chemicals related to lopinavir are retinoic  acid, nucleoside, tyr, glutamine, ribavirin, glycyrrhizin, co2, lopinavir, phosphonate, lymphoma, ifitm3. The top 10 genes related to lopinavir are neuraminidase, p53, eif2alpha, apod, ribavirin, infection, cox-2, ifitm3, iron.

The top 10 chemicals related to ribavirin are glycyrrhizin, lactate, corticosteroid, coronavirus, steroid, nucleoside, ribavirin, sodium, glucose, infection, oxygen, calcium, obesity. The top 10 genes related to ribavirin are toxicity, p53, swine, fibrosis, iron, neuraminidase, diabetes, ribavirin, anemia, inflammation, infection.

The top 10 chemicals related to ritonavir are atp, ritonavir, cyclophosphamide, mtt, toxicity, sialic  acid, sds, encephalitis, superoxide, sucrose, ethanol. The top 10 genes related to ritonavir are jnk, p53, rig  i, encephalitis, rnase  l, stat3, toxicity, akt, neuraminidase, stat1.

The top 10 chemicals related to tamiflu are superoxide, prednisolone, flavonol, proline, nitric  oxide, thymidine, glycyrrhizin, propidium  iodide, nitrogen, aspirin, tamiflu. The top 10 genes related to tamiflu are il-10, cd44, eif2alpha, tgf  beta1, tlr2, ifn, cxcl10, tumor  necrosis  factor  tnf)-alpha, ire1, ccl2, tbk1.

The top 10 chemicals related to umifenovir are: tacrolimus, alkyl, carbon  monoxide, ca(2, cd, nucleolin, cytosine, glycyrrhizic  acid, 2’-o, umifenovir, prostaglandin  e2. The top 10 genes related to umifenovir are parp, pd  l1, monocyte  chemoattractant  protein-1, nef, cxcr4, cd45, nucleolin, dc  ign, annexin  v, cd19, mmp-2.

8. Discussion and Conclusion

In this research, we first used BioBERT to recognize entities in the PubMed and the CORD-19 dataset. Our results show that most of the recognized entities are strictly biomedical. Most of the recognized entities in the CORD-19 dataset are disease lacking diversity in entity types due to a lack in finding a suitable bio-medical training dataset with detailed labeled bio-entity. For future work, we will explore more other bio-medical dataset and try other biomedical NLP models for named entity recognition, e.g., blueBERT (Peng et al., 2019).

Furthermore, we introduced the construction of Coronavirus Knowledge Graph based on two different methods: co-occurence frequency and cosine similarity. We explored and revealed that the drug candidates recommended by drug-centered KG are promising. We will consult experts in COVID-related research to verify our drug-centered KG and conduct a more in-depth analysis for future work.

Also, we aim to build a wider COVID related KG, connecting all COVID related bio-entities rather than small drug/disease-centered KG. The extracted KG will help understand the relationships between the diseases, the genes, the viruses, and the cures involved in and related to COVID-19.

Finally, we hope to build an automatic profiling system to generate expert, drug, or disease profiling. The expected disease profiling will look like Figure 9, which includes description, related bio entities (drugs, gene, protein, species), topic distribution, related experts, organization, and featured publications.

Figure 9. auto drug profiling example

9. Contributions

Y.D. and Y.B. proposed the idea and supervised the project. C.C. wrote the paper. I.A.E wrote the Introduction and revised this paper. C.C. conducted the named entity recognition and Knowledge Graph building. I.A.E conducted the Word2Vec for Experiment 2.2.

Acknowledgements.
We would like express our gratitude to Prof. Jaewoo Kange’s DMIS Lab team for pretraining BioBERT, Vinay Locharulu for suggestion and support, Prof. Jian Xu for providing PubMed Knowledge Graph, and Yifei Wu for conducting entity normalization.

References

  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: §1.1.
  • A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell (2010) Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §1.1.
  • B. Chen, X. Dong, D. Jiao, H. Wang, Q. Zhu, Y. Ding, and D. J. Wild (2010) Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC bioinformatics 11 (1), pp. 255. Cited by: §1.1.
  • P. P. Chen (1976) The entity-relationship model—toward a unified view of data. ACM transactions on database systems (TODS) 1 (1), pp. 9–36. Cited by: §1.1.
  • [5] COVID-19 open research dataset challenge (cord-19). External Links: Link Cited by: Coronavirus Knowledge Graph: A Case Study, §1.3, §3.3.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087. Cited by: §5.1.
  • F. Dernoncourt and J. Y. Lee (2017) Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. arXiv preprint arXiv:1710.06071. Cited by: Coronavirus Knowledge Graph: A Case Study.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.2, §5.1, §5.1.
  • D. Domingo-Fernandez, S. Baksi, B. Schultz, Y. Gadiya, R. Karki, T. Raschka, C. Ebeling, M. Hofmann-Apitius, et al. (2020) COVID-19 knowledge graph: a computable, multi-modal, cause-and-effect knowledge model of covid-19 pathophysiology. BioRxiv. Cited by: §2.
  • X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–610. Cited by: §1.1.
  • Y. Ge, T. Tian, S. Huang, F. Wan, J. Li, S. Li, H. Yang, L. Hong, N. Wu, E. Yuan, et al. (2020) A data-driven drug repositioning framework discovered a potential therapeutic agent targeting covid-19. bioRxiv. Cited by: §2.
  • D. Grujicic, G. Radevski, T. Tuytelaars, and M. B. Blaschko (2020) Self-supervised context-aware covid-19 document exploration through atlas grounding. Cited by: §2.
  • L. Hong, J. Lin, J. Tao, and J. Zeng (2019) BERE: an accurate distantly supervised biomedical entity relation extraction network. arXiv preprint arXiv:1906.06916. Cited by: §2.
  • S. Inc. (2020) Cited by: §2.
  • [15] Johns hopkins coronavirus resource center. External Links: Link Cited by: §1.3.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §1.2, §5.1.
  • J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al. (2015) DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: §1.1.
  • D. B. Lenat (1995) CYC: a large-scale investment in knowledge infrastructure. Communications of the ACM 38 (11), pp. 33–38. Cited by: §1.1.
  • Y. Liang and P. Xie (2020) Identifying radiological findings related to covid-19 from medical literature. arXiv preprint arXiv:2004.01862. Cited by: §2.
  • D. Nadeau and S. Sekine (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (1), pp. 3–26. Cited by: §1.2.
  • H. Paulheim (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic web 8 (3), pp. 489–508. Cited by: §1.1.
  • Y. Peng, S. Yan, and Z. Lu (2019) Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474. Cited by: §8.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §5.1.
  • P. Richardson, I. Griffin, C. Tucker, D. Smith, O. Oechsle, A. Phelan, and J. Stebbing (2020) Baricitinib as potential treatment for 2019-ncov acute respiratory disease. Lancet (London, England) 395 (10223), pp. e30. Cited by: §2.
  • S. J. Russell and P. Norvig (2016) Artificial intelligence: a modern approach. malaysia. Pearson Education Limited. Cited by: §1.1.
  • B. Settles (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pp. 107–110. Cited by: §1.2.
  • F. M. Suchanek, G. Kasneci, and G. Weikum (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp. 697–706. Cited by: §1.1.
  • Y. Sun and J. Han (2012) Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery 3 (2), pp. 1–159. Cited by: §1.1.
  • R. Tang, R. Nogueira, E. Zhang, N. Gupta, P. Cam, K. Cho, and J. Lin (2020) Rapidly bootstrapping a question answering dataset for covid-19. arXiv preprint arXiv:2004.11339. Cited by: §2.
  • J. Ugander, B. Karrer, L. Backstrom, and C. Marlow (2011) The anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503. Cited by: §1.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.2, §5.1.
  • D. Vrandečić and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §1.1.
  • J. Xu, S. Kim, M. Song, M. Jeong, D. Kim, J. Kang, J. F. Rousseau, X. Li, W. Xu, V. I. Torvik, et al. (2020) Building a pubmed knowledge graph. arXiv preprint arXiv:2005.04308. Cited by: §1.2, §3.2.