Practical progress at combating COVID-19 highly depends on effective search, discovery, assessment and extension of scientific research results. However, clinicians and scientists are facing two unique barriers on digesting these research papers.
The first challenge is quantity. Such a bottleneck in knowledge access is exacerbated during a pandemic when increased investment in relevant research leads to even faster growth of literature than usual. For example, as of April 28, 2020, at PubMed333https://www.ncbi.nlm.nih.gov/pubmed/ there were 19,443 papers related to coronavirus; as of June 13, 2020, there are 140K+ related papers, nearly 2.7K new papers per day (see Figure 1). This knowledge bottleneck causes significant delays in the development of vaccines and drugs for COVID-19. More intelligent knowledge discovery technologies need to be developed to enable researchers to more quickly and accurately access and digest relevant knowledge from literature.
The second challenge is quality due to the rapid rise of extensive publications of preprint manuscripts without pre-publication peer review. Many research results about coronavirus from different research labs and sources are redundant, complementary, or even conflicting with each other, while some false information has been promoted in both formal publication venues as well as social media platforms such as Twitter. As a result, some of the policy responses to the virus, and public perception of it, have been based on misleading, and at times erroneous, claims. The isolation of these knowledge resources makes it hard, if not impossible, for researchers to connect dots that exist in separate resources to obtain insights.
Let us consider drug repurposing as a case study. Besides the long process of clinical trial and biomedical experiments, another major cause for the long process is the complexity of the problem involved and the difficulty in drug discovery in general. The current clinical trials for drug re-purposing mainly rely on symptoms by considering drugs that can treat diseases with similar symptoms. However, there are too many drug candidates and too much misinformation published from multiple sources. The clinicians and scientists thus need urgent help to obtain a reliable ranked list of drugs with detailed evidence, and also gain new insights into the underlying molecular cellular mechanisms on Covid-19, and which pre-existing conditions may affect the mortality and severity of this disease.
To tackle these two challenges we propose a new framework, COVID-KG, to accelerate scientific discovery and build a bridge between clinicians and biology scientists, as illustrated in Figure 2. COVID-KG starts by reading existing papers to build multimedia knowledge graphs (KGs), in which nodes are entities/concepts and edges represent relations and events involving these entities, extracted from both text and images. Given the KGs enriched with path ranking and evidence mining, COVID-KG answers natural language questions effectively. Using drug repurposing as a case study, for 11 typical questions that human experts aim to explore, we integrate our techniques to generate a comprehensive report for each candidate drug. Preliminary assessment by expert clinicians and medical school students show our generated reports are both informative and sound.
2 Multimedia Knowledge Graph Construction
2.1 Coarse-grained Text Knowledge Extraction
Our coarse-grained Information Extraction (IE) system consists of three components: (1) coarse-grained entity extraction Wang et al. (2019a) and entity linking Zheng et al. (2015) for four entity types: Gene nodes, Disease nodes, Chemical nodes, and Organism. We follow the entity ontology defined in the Comparative Toxicogenomics Database (CTD) Davis et al. (2016), and obtain a Medical Subject Headings (MeSH) Unique ID for each mention. (2) Based on the MeSH Unique IDs, we further link all entities to the CTD and extract 133 subtypes of relations such as Gene–Chemical–Interaction Relationships, Chemical–Disease Associations, Gene–Disease Associations, Chemical–GO Enrichment Associations and Chemical–Pathway Enrichment Associations. (3) Event extraction Li et al. (2019): we extract 13 Event types and the roles of entities involved in these events, including Gene expression, Transcription, Localization, Protein catabolism, Binding, Protein modification, Phosphorylation, Ubiquitination, Acetylation, Deacetylation, Regulation, Positive regulation, and Negative regulation. Figure 3 shows an example of the constructed knowledge graph.
2.2 Fine-grained Text Entity Extraction
However, questions from experts often involve fine-grained knowledge elements, such as “Which animo acids in glycoprotein are most related to Glycan (CHEMICAL)?”. In order to answer these questions, we apply our fine-grained entity extraction system CORD-NER Wang et al. (2020c) to extract 75 types of entities to enrich the KG, including many COVID-19 specific new entity types (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses). CORD-NER relies on distantly- and weakly-supervised methods Wang et al. (2019b); Shang et al. (2018), with no need of expensive human annotation. Its entity annotation quality surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Figure 4 shows some examples of the annotation results on a CORD-19 paper Zhang et al. (2020).
2.3 Image Processing and Cross-media Entity Grounding
Figures in biomedical papers contain rich information uniquely manifested in the visual modality, such as molecular structures, microscopic images, dosage response curves, relational diagrams, and other visual types. We have developed a visual IE subsystem to extract the visual information from figure images to enrich the knowledge graph. We start by designing a pipeline and automatic tools shown in Figure 5 to extract figures from papers in the CORD-19 dataset and segment figures into close to half a million subfigures. Then, we perform cross-modal entity grounding to ground entities mentioned in captions or referring text to visual objects in the subfigures. Since most figures are embedded as part of PDF files, we employ Deepfigures Siegel et al. (2018) to automatically detect and extract figures from each PDF document. Each figure is associated with text in its caption or referring context (main body text referring to the figure). In this way, a figure can be coarsely attached to an KG entity if the entity is mentioned in the associated text.
To further delineate semantic and visual information contained in each subfigure, we have developed a pipeline to segment individual subfigures and then align each subfigure with its corresponding sub-caption. We employ Figure-separator Tsutsui and Crandall (2017)
to detect and separate all non-overlapping image regions. Meanwhile, subfigures in a figure are typically marked with alphabetical letters (e.g., A, B, C, etc). We use deep neural networksZhou et al. (2017) to detect text in the figures and use OCR tools Smith (2007) to automatically recognize text information within each figure. To distinguish subfigure marker text from text labels in figures to annotate figure content, we use location proximity between text labels and subfigures to locate subfigure text markers. Location information of such text markers can also be used to merge multiple image regions into a single subfigure. At the end, each subfigure is segmented, and associated with its corresponding subcaption and referring context. The segmented subfigures and associated text labels provide rich information that can expand the KG constructed from text captions. For example, as shown in Figure 6
, we apply a classifier to detect subfigure images containing molecular structures. Then by linking specific drug names extracted from within-figure text to the drug entity in the coarse KG constructed from the caption text, a cross-modal expanded KG can be constructed that links specific molecular structure images to corresponding drug entities in the KG.
2.4 Knowledge Graph Semantic Visualization
In order to enhance the exploration and discovery of the information mined from the COVID-19 literature through the algorithms discussed in previous sections, we have been developing techniques to create semantic visualizations over large complex networks of biomedical relations. Semantic visualization allows for visualization of user-defined subsets of these relations interactively through semantically typed tag clouds and heat maps. This allows researchers to get a global view of selected relation subtypes drawn from hundreds or thousands of papers at a single glance. This in turn allows for the ready identification of novel relationships that would typically be missed by directed keyword searches or simple unigram word cloud or heatmap displays.444https://www.semviz.org/
We first build a data index from the knowledge elements in the constructed KGs, and then create a Kibana dashboard out of the generated data indices. Each Kibana dashboard has a collection of visualizations that are designed to interact with each other. Dashboards are implemented as web applications. The navigation of a dashboard is mainly through clicking and searching. By clicking the protein keyword EIF2AK2 in the tag cloud named “Enzyme proteins participating Modification relations”, a constraint on the type of proteins in modifications is added. Correspondingly, all the other visualizations will be changed.
One unique feature of the SemViz semantic visualization is the creation of dense tag clouds and dense heatmaps, through a process of parameter reduction over relations, allowing for the visualization of relation sets as tag clouds and multiple chained relations as heatmaps. Figure 7 illustrates such a dense heatmap, where a functionally typed protein is implicated in a disease relation (e.g., “those proteins that are down-regulators of INF which are implicated in obesity”).
3 Knowledge-driven Question Answering
In contrast to most current question answering (QA) methods which target single documents, we have developed a QA component based on a combination of knowledge graph matching and distributional semantic matching. We build knowledge graph indexing and searching functions to facilitate users to pose queries to search effectively and efficiently. We also support semantic matching from the constructed KGs and related texts by accepting multi-hop queries.
A common category of queries is about the connections between two entities. Given two entities as query, we generate a subgraph covering salient paths between them to show how they are connected through other entities. Figure 3 is an example subgraph summarizing the connections between Losartan and cathepsin L pseudogene 2. The paths are generated by traversing the constructed KG, and are ranked by the frequency of paths in the KG. Each edge is assigned a salience score by aggregating the scores of paths passing through it. In addition to knowledge elements, we also present related sentences as evidence. We use BioBert Lee et al. (2020), a pre-trained language model to represent each sentence along with its left and right neighboring sentences as local contexts. Using the same architecture computed on all respective sentences and the user query, we aggregate the sequence embedding layer, the last hidden layer in the BERT architecture with average pooling Reimers and Gurevych (2019). We use the similarity between the embedding representations of each sentence and query to extract the most relevant sentences as evidence.
Queries also often include entity types instead of entity instances, which requires us to extract evidence sentences based on type or pattern matching. We have developedEvidenceMiner Wang et al. (2020a, b), a web-based system that allows a user’s query as a natural language statement or an inquired relationship at the meta-symbol level (e.g., CHEMICAL, PROTEIN) and automatically retrieves textual evidence from a background corpora of COVID-19.
4 A case study on Drug Repurposing Report Generation
4.1 Task and Data
A human written report about drug repurposing usually answers the following typical questions.
Current indication: what is the drug class? What is it currently approved to treat?
Molecular structure (symbols desired, but a pointer to a reference is also useful)
Mechanism of action i.e., inhibits viral entry, replication, etc. (w/ a pointer to data)
Was the drug identified by manual or computation screen?
Who is studying the drug? (Source/lab name)
In vitro Data available (cell line used, assays run, viral strain used, cytopathic effects, toxicity, LD50, dosage response curve, etc.)
Animal Data Available (what animal model, LD50, dosage response curve, etc.)
Clinical trials on going (what phase, facility, target population, dosing, intervention etc.)
Has the drug shown evidence of systemic toxicity?
List of relevant sources to pull data from.
As case studies, DARPA biologists suggest three drugs, Benazepril, Losartan, and Amodiaquine, and COVID-19 related chemicals/genes as shown in Figure 8:
Our KG results for many other drugs are visualized at our website555http://blender.cs.illinois.edu/covid19/visualization.html. We download new COVID-19 papers on a daily basis from three Application Programming Interfaces (APIs): NCBI PMC API, NCBI Pubtator API and CORD-19 archive. We provide incremental updates including new papers, removed papers and updated papers, and their metadata information at our website666http://blender.cs.illinois.edu/covid19/.
As of June 14, 2020 we have collected 140K papers. We select 25,534 peer-reviewed papers and construct the KG that includes 7,230 Diseases, 9,123 Chemicals and 50,864 genes, 1,725,518 chemical-gene links, 5,556,670 chemical-disease links, and 7,7844,574 gene-disease links. The KG has received more than 1,000+ downloads. Our final generated reports777http://blender.cs.illinois.edu/covid19/DrugRe-purposingReport_V2.0.docx are shared publicly. For each question, our framework provides answers along with detailed evidences, knowledge subgraphs and image segmentation and analysis results. Table 1 shows some example answers.
Several clinicians and medical school students in our team have manually reviewed the drug repurposing reports for three drugs, and also the KGs connecting 41 drugs and COVID-19 related chemicals/genes. Preliminary results show that most of our output are informative, valid and sound. For instance, after the coronavirus enters the cell in the lungs, it can cause a severe disease called Acute Respiratory Distress Syndrome. This condition causes the release of inflammatory molecules in the body named cytokines such as Interleukin-2, Interleukin-6, Tumor Necrosis Factor, and Interleukin-10. We see all of these connections in our results, such as the examples shown in Figure 3 and Figure 9. Some results are a little surprising to scientists and they think it’s worth further investigation. For example, in Figure 3 we can see that Lusartan is connected to tumor protein p53 which is related to lung cancer.
|Q1||Drug Class||angiotensin-converting enzyme (ACE) inhibitors|
|Evidence||[PMID:32314699 (PMC7253125)] Past medical history was significant for hypertension, treated with amlodipine and benazepril, and chronic back pain.|
|Sentences||[PMID:32081428 (PMC7092824)] On the other hand, many ACE inhibitors are currently used to treat hypertension and other cardiovascular diseases. Among them are captopril, perindopril, ramipril, lisinopril, benazepril, and moexipril.|
|Evidence||[PMID:32081428 (PMC7092824)] By using a molecular docking approach, an earlier study identified N-(2-aminoethyl)-1 aziridine-ethanamine as a novel ACE2 inhibitor that effectively blocks the SARS-CoV RBD-mediated cell fusion.|
|Sentences||This has provided a potential candidate and lead compound for further therapeutic drug development. Meanwhile, biochemical and cell-based assays can be established to screen chemical compound libraries to identify novel inhibitors.|
|Evidence||[PMID:22800722 (PMC7102827)] The in vitro half-maximal inhibitory concentration (IC50) values of food-derived ACE inhibitory peptides are about 1000|
|Sentences||fold higher than that of synthetic captopril but they have higher in vivo activities than would be expected from their in vitro activities…..|
|Evidence||[PMID:32336612 (PMC7167588)] Two trials of losartan as additional treatment for SARS-CoV-2 infection in hospitalized (NCT04312009) or not hospitalized (NCT04311177) patients have been announced, supported by the background of the huge adverse impact of the ACE Angiotensin II AT1 receptor axis over-activity in these patients.|
|Sentences||[PMID:32350632 (PMC7189178)] To address the role of angiotensin in lung injury, there is an ongoing clinical trial to examine whether losartan treatment affects outcomes in COVID-19 associated ARDS (NCT04312009).|
|[PMID:32439915 (PMC7242178)] Losartan was also the molecule chosen in two trials recently started in the United States by the University of Minnesota to treat patients with COVID-19 (clinical trials.gov NCT04311177 and NCT 104312009).|
5 Related Work
A lot of previous work extracts biomedical entities Habibi et al. (2017); Crichton et al. (2017); Wang et al. (2018); Beltagy et al. (2019); Alsentzer et al. (2019); Wei et al. (2019); Wang et al. (2020c), relations Uzuner et al. (2011); Krallinger et al. (2011); Manandhar and Yuret (2013); Bui et al. (2014); Peng et al. (2016); Wei et al. (2015); Peng et al. (2017); Luo et al. (2017); Wei et al. (2019); Peng et al. (2019, 2020), and events Ananiadou et al. (2010); Van Landeghem et al. (2013); Nédellec et al. (2013); Deléger et al. (2016); Wei et al. (2019); Li et al. (2019); ShafieiBavani et al. (2020) from biomedical literature, and more recent work focuses on COVID-19 literature Hope et al. (2020); Ilievski et al. (2020); Wolinski (2020); Ahamed and Samad (2020).
Most of the recent biomedical QA work Yang et al. (2015, 2016); Chandu et al. (2017); Kraus et al. (2017) is driven by the BioASQ initiative Tsatsaronis et al. (2015), and many live QA systems, including COVIDASK888https://covidask.korea.ac.kr/ and AUEB999http://cslab241.cs.aueb.gr:5000/, and search engines Kricka et al. (2020); Esteva et al. (2020); Hope et al. (2020); Taub Tabib et al. (2020) have been developed. Our work is an application and extension of our recently developed multimedia knowledge extraction for news domain Li et al. (2020a, b). Similar to news domain, the knowledge elements extracted from text and images in literature are complimentary. Our framework advances state-of-the-art by extending the knowledge elements to more fine-grained types, incorporating image analysis and cross-media knowledge grounding, and knowledge graph matching into QA.
6 Conclusions and Future Work
We have developed a novel framework, COVID-KG, that automatically transforms a massive scientific literature corpus into organized, structured, and actionable knowledge graphs, and uses it to answer questions in drug repurposing reporting. With COVID-KG, researchers and clinicians are able to obtain trustworthy and non-trivial answers from scientific literature, and thus focus on more important hypothesis testing, and prioritize the analysis efforts for candidate exploration directions. In our ongoing work we have created a new ontology that includes 77 entity subtypes and 58 event subtypes, and we are building a neural IE system following this new ontology. In the future we plan to extend COVID-KG to automate the creation of new hypotheses by predicting new links. We will also create a multimedia common semantic space Li et al. (2020a, b) for literature and apply it to improve cross-media knowledge grounding and inference.
- Ahamed and Samad (2020) Sabber Ahamed and Manar Samad. 2020. Information mining for covid-19 research from a large volume of scientific literature. Information Retrieval Repository, arXiv:2004.02085.
Alsentzer et al. (2019)
Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan
Naumann, and Matthew McDermott. 2019.
clinical BERT embeddings.
Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Ananiadou et al. (2010) Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Douglas B Kell. 2010. Event extraction for systems biology by text mining the literature. Trends in biotechnology, 28(7):381–390.
- Beltagy et al. (2019) Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Bui et al. (2014) Quoc-Chinh Bui, Peter MA Sloot, Erik M Van Mulligen, and Jan A Kors. 2014. A novel feature-based approach to extract drug–drug interactions from biomedical text. Bioinformatics, 30(23):3365–3371.
- Chandu et al. (2017) Khyathi Chandu, Aakanksha Naik, Aditya Chandrasekar, Zi Yang, Niloy Gupta, and Eric Nyberg. 2017. Tackling biomedical text summarization: OAQA at BioASQ 5B. In BioNLP 2017, pages 58–66, Vancouver, Canada,. Association for Computational Linguistics.
- Crichton et al. (2017) Gamal Crichton, Sampo Pyysalo, Billy Chiu, and Anna Korhonen. 2017. A neural network multi-task learning approach to biomedical named entity recognition. Bioinformatics, 18(1):368.
- Davis et al. (2016) Allan Peter Davis, Cynthia J. Grondin, Robin J. Johnson, Daniela Sciaky, Benjamin L. King, Roy McMorran, Jolene Wiegers, Thomas C. Wiegers, and Carolyn J. Mattingly. 2016. The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Research, 45(D1):D972–D978.
- Deléger et al. (2016) Louise Deléger, Robert Bossy, Estelle Chaix, Mouhamadou Ba, Arnaud Ferré, Philippe Bessières, and Claire Nédellec. 2016. Overview of the bacteria biotope task at BioNLP shared task 2016. In Proceedings of the 4th BioNLP Shared Task Workshop, pages 12–22, Berlin, Germany. Association for Computational Linguistics.
- Esteva et al. (2020) Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin, Dragomir Radev, and Richard Socher. 2020. Co-search: Covid-19 information retrieval with semantic search, question answering, and abstractive summarization. Information Retrieval Repository, arXiv:2006.09595.
- Habibi et al. (2017) Maryam Habibi, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 33(14):37–48.
- Hope et al. (2020) Tom Hope, Jason Portenoy, Kishore Vasan, Jonathan Borchardt, Eric Horvitz, Daniel S Weld, Marti A Hearst, and Jevin West. 2020. Scisight: Combining faceted navigation and research group detection for covid-19 exploratory scientific search. Information Retrieval Repository, arXiv:2005.12668.
- Ilievski et al. (2020) Filip Ilievski, Daniel Garijo, Hans Chalupsky, Naren Teja Divvala, Yixiang Yao, Craig Rogers, Ronpeng Li, Jun Liu, Amandeep Singh, Daniel Schwabe, et al. 2020. Kgtk: A toolkit for large knowledge graph manipulation and analysis. Artificial Intelligence Repository, arXiv:2006.00088.
- Krallinger et al. (2011) Martin Krallinger, Miguel Vazquez, Florian Leitner, David Salgado, Andrew Chatr-Aryamontri, Andrew Winter, Livia Perfetto, Leonardo Briganti, Luana Licata, Marta Iannuccelli, et al. 2011. The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text. BMC bioinformatics, 12(S8):S3.
- Kraus et al. (2017) Milena Kraus, Julian Niedermeier, Marcel Jankrift, Sören Tietböhl, Toni Stachewicz, Hendrik Folkerts, Matthias Uflacker, and Mariana Neves. 2017. Olelo: a web application for intuitive exploration of biomedical literature. Nucleic acids research, 45(W1):478–483.
- Kricka et al. (2020) Larry J Kricka, Sergei Polevikov, Jason Y Park, Paolo Fortina, Sergio Bernardini, Daniel Satchkov, Valentin Kolesov, and Maxim Grishkov. 2020. Artificial intelligence-powered search tools and resources in the fight against covid-19. EJIFCC, 31(2):106.
- Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Li et al. (2019) Diya Li, Lifu Huang, Heng Ji, and Jiawei Han. 2019. Biomedical event extraction based on knowledge-driven tree-LSTM. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1421–1430, Minneapolis, Minnesota. Association for Computational Linguistics.
- Li et al. (2020a) Manling Li, Alireza Zareian, Ying Lin, Xiaoman Pan, Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji, Shih-Fu Chang, Clare Voss, Daniel Napierski, and Marjorie Freedman. 2020a. GAIA: A fine-grained multimedia knowledge extraction system. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- Li et al. (2020b) Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, and Shih-Fu Chang. 2020b. Cross-media structured common space for multimedia event extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Luo et al. (2017) Yuan Luo, Özlem Uzuner, and Peter Szolovits. 2017. Bridging semantics and syntax with graph algorithms—state-of-the-art of extracting biomedical relations. Briefings in bioinformatics, 18(1):160–178.
- Manandhar and Yuret (2013) Suresh Manandhar and Deniz Yuret, editors. 2013. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Association for Computational Linguistics, Atlanta, Georgia, USA.
- Nédellec et al. (2013) Claire Nédellec, Robert Bossy, Jin-Dong Kim, Jung-jae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre Zweigenbaum. 2013. Overview of BioNLP shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 1–7, Sofia, Bulgaria. Association for Computational Linguistics.
- Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics, 5:101–115.
- Peng et al. (2020) Yifan Peng, Qingyu Chen, and Zhiyong Lu. 2020. An empirical study of multi-task learning on BERT for biomedical text mining. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 205–214, Online. Association for Computational Linguistics.
- Peng et al. (2016) Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016. Improving chemical disease relation extraction with rich features and weakly labeled data. Journal of cheminformatics, 8(1):53.
- Peng et al. (2019) Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 58–65, Florence, Italy. Association for Computational Linguistics.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- ShafieiBavani et al. (2020) Elaheh ShafieiBavani, Antonio Jimeno Yepes, Xu Zhong, and David Martinez Iraola. 2020. Global locality in biomedical relation and event extraction. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 195–204, Online. Association for Computational Linguistics.
- Shang et al. (2018) Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, and Jiawei Han. 2018. Learning named entity tagger using domain-specific dictionary. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2054–2064, Brussels, Belgium. Association for Computational Linguistics.
- Siegel et al. (2018) Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting scientific figures with distantly supervised neural networks. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, page 223–232, New York, NY, USA. Association for Computing Machinery.
- Smith (2007) Ray Smith. 2007. An overview of the tesseract ocr engine. In Proceedings of the 9th international conference on document analysis and recognition (ICDAR 2007), volume 2, pages 629–633.
- Taub Tabib et al. (2020) Hillel Taub Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen, and Yoav Goldberg. 2020. Interactive extractive search over biomedical corpora. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 28–37, Online. Association for Computational Linguistics.
- Tsatsaronis et al. (2015) George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138.
Tsutsui and Crandall (2017)
Satoshi Tsutsui and David J Crandall. 2017.
A data driven approach for compound figure separation using convolutional neural networks.In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 533–540.
- Uzuner et al. (2011) Özlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556.
- Van Landeghem et al. (2013) Sofie Van Landeghem, Jari Björne, Chih-Hsuan Wei, Kai Hakala, Sampo Pyysalo, Sophia Ananiadou, Hung-Yu Kao, Zhiyong Lu, Tapio Salakoski, Yves Van de Peer, et al. 2013. Large-scale event extraction from literature with multi-level gene normalization. PloS one, 8(4):e55814.
- Wang et al. (2019a) Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, and Yi Luan. 2019a. PaperRobot: Incremental draft generation of scientific ideas. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1980–1991, Florence, Italy. Association for Computational Linguistics.
- Wang et al. (2020a) Xuan Wang, Yingjun Guan, Weili Liu, Aabhas Chauhan, Enyi Jiang, Qi Li, David Liem, Dibakar Sigdel, John Caufield, Peipei Ping, et al. 2020a. Evidenceminer: Textual evidence discovery for life sciences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 56–62.
- Wang et al. (2020b) Xuan Wang, Weili Liu, Aabhas Chauhan, Yingjun Guan, and Jiawei Han. 2020b. Automatic textual evidence mining in covid-19 literature. Computation and Language Repository, arXiv:2004.12563.
- Wang et al. (2020c) Xuan Wang, Xiangchen Song, Yingjun Guan, Bangzheng Li, and Jiawei Han. 2020c. Comprehensive named entity recognition on cord-19 with distant or weak supervision. Computation and Language Repository, arXiv:2003.12218.
- Wang et al. (2019b) Xuan Wang, Yu Zhang, Qi Li, Xiang Ren, Jingbo Shang, and Jiawei Han. 2019b. Distantly supervised biomedical named entity recognition with dictionary expansion. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 496–503.
- Wang et al. (2018) Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis Langlotz, and Jiawei Han. 2018. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics, 35(10):1745–1752.
- Wei et al. (2019) Chih-Hsuan Wei, Alexis Allot, Robert Leaman, and Zhiyong Lu. 2019. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Research, 47(W1):587–593.
- Wei et al. (2015) Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. 2015. Overview of the biocreative v chemical disease relation (cdr) task. In Proceedings of the 5th BioCreative challenge evaluation workshop, volume 14.
- Wolinski (2020) Francis Wolinski. 2020. Visualization of diseases at risk in the covid-19 literature. Information Retrieval Repository, arXiv:2005.00848.
- Yang et al. (2015) Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang, and Eric Nyberg. 2015. Learning to answer biomedical factoid & list questions: Oaqa at bioasq 3b. CLEF (Working Notes), 1391.
- Yang et al. (2016) Zi Yang, Yue Zhou, and Eric Nyberg. 2016. Learning to answer biomedical questions: OAQA at BioASQ 4B. In Proceedings of the Fourth BioASQ workshop, pages 23–37, Berlin, Germany. Association for Computational Linguistics.
- Zhang et al. (2020) Haibo Zhang, Josef M Penninger, Yimin Li, Nanshan Zhong, and Arthur S Slutsky. 2020. Angiotensin-converting enzyme 2 (ace2) as a sars-cov-2 receptor: molecular mechanisms and potential therapeutic target. Intensive care medicine, 46(4):586–590.
- Zheng et al. (2015) Jin Guang Zheng, Daniel Howsmon, Boliang Zhang, Juergen Hahn, Deborah McGuinness, James Hendler, and Heng Ji. 2015. Entity linking for biomedical literature. In Proceedings of the BMC Medical Informatics and Decision Making, volume 15.
- Zhou et al. (2017) Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. East: an efficient and accurate scene text detector. In