In response to the COVID-19 pandemic, a lot of scholarly articles have been published recently and made freely available. At the same time, there are emerging requests from both the medical research community and the broader society to find answers to various questions regarding COVID-19. A system that can provide reliable answers to the COVID-19 related questions from the latest academic resources is crucial, especially for the medical community in the current time-critical race to treat patients and to find a cure for the virus.
To address the aforementioned requests by the medical community, we propose a deep learning-based system that uses state-of-the-art natural language processing (NLP) question answering (QA) techniques combined with summarization for mining the available scientific literature. The system is an end-to-end neural network-based open-domain QA system that can answer COVID-19 related questions, such as those questions proposed in the CORD-19 Kaggle task111https://www.kaggle.com/allen-institute
-for-ai/CORD-19-research-challenge. Through our system, users can get two versions of the outcome:
A ranked list of relevant snippets from the literature given a query;
Fluent summaries of the relevant results. We provide the paragraph-level summaries, which takes the paragraphs where the top three relevant snippets are located as input, to enable a more efficient way of understanding the content.
Our system consists of three different modules: 1) Document Retriever, 2) Relevant Snippet Selector, and 3) Multi-Document Summarizer. The first module pre-processes a user’s query and retrieves the most relevant number of academic publications. The second module outputs a list of the most relevant answer snippets from the retrieved documents. It also highlights the relevant keywords. The last module is for generating the second output, namely a concise summary of the top-ranked retrieved relevant paragraphs from the two former modules.
2 Related Work
With the release of the COVID-19 Open Research Dataset (CORD-19)333https://pages.semanticscholar.org/
coronavirus-research by Allen Institute for AI, multiple systems have been built to make both researchers and public explore valuable information related to COVID-19. CORD-19 Search444https://cord19.aws/ is a search engine that utilized the CORD-19 dataset processed by Amazon Comprehend Medical. Covidex555https://covidex.ai/ applied multi-stage search architectures which can extract different features from data. A NLP medical relationship engine named WellAI COVID-19 Research Tool666https://wellai.health/
is able to create a structured list of medical concepts with ranked probabilities related to COVID-19. tmCOVID777http://tmcovid.com/ is a bioconcept extraction and summarization tool for COVID-19 literature.
All these systems either focus on search engine such as CORD-19 Search or medical concepts such as WellAI COVID-19 Research Tool and tmCOVID. However, our system, in addition to information retrieval, gives high quality relevant snippets and summarization results based on the user query. We provide a more versatile system for public use, which can display various information about COVID-19 in a well structured and concise manner.
3 System Architecture Overview
Figure 1 illustrates the architecture of our system, CAiRE-COVID which consists of three major modules. We first paraphrase long and complicated queries to simpler queries, which are easier for our system to comprehend. Then the updated queries are fed into our IR module, which returns paragraphs from the full text of the articles having the top highest matching score to the query, where is a hyper-parameter. Our QA models are then applied to each of the top paragraphs to select relevant sentences from each paragraph as answers to the query. After re-ranking the retrieved paragraphs with highlighted answers that best fit our query, we pass top paragraphs () to our summarization models, to get an extractive and an abstractive summary of the paragraphs.
4.1 Document Retrieval
4.1.1 Query Paraphrasing
The objective of this sub-module is to break down a user’s query and rephrase complex query sentences into several shorter and simpler questions that convey the same meaning. As shorter sentences are generally better processed by NLP systems Narayan et al. (2017), it could be used as a pre-processing step to facilitates and improves the performance of the whole system, since the search engine and the question answering modules will be able to find more relevant and less redundant results. Its effectiveness have been proved in our Kaggle tasks, and we show some examples in Appendix B. Automatic methods Min et al. (2019); Perez et al. (2020) will be explored to handle more complicated compound questions in the future.
4.1.2 Search Engine
We use Anserini Yang et al. (2018a)
to create the search engine to retrieve a preliminary candidate set of documents. Anserini is an information retrieval module wrapped around the open source search engine Lucene888https://lucene.apache.org/. Although Lucene has been used widely to build industry standard search engine applications, its complex indexing and lack of documentation for ad-hoc experimentation and testing on standard test sets, has made it less popular in the information retrieval community. Anserini uses the Lucene indexing to create an easy-to-understand information retrieval module. Standard ranking algorithms(e.g. bag of words, BM25) have been implemented in the module, which enables us to use Lucene for our application. We use paragraph indexing for our purpose, where each paragraph of the full text of each article is indexed separately, together with the title and abstract. For each query the module can return number of top paragraphs matching the query.
|Question: What do we know about asymptomatic transmission of COVID-19?|
|Answer: "Health care professionals (HCPs) are at high risk since a recent study shows that a substantial proportion of virus spread occurs in the asymptomatic or pre-symptomatic phase." McCuaig (2020)|
|Summary: Many new patients which are asymptomatic or have only mild symptoms can transmit the virus. Research traced COVID-19 infections which resulted from a business meeting in Germany. The most common symptoms of SARS-CoV-2-related disease, called COVID-19, are fever, weakness, cough, and diarrhea.|
|Question: How is the seasonality of COVID-19 disease transmission?|
|Answer: "We also confirmed that the growth rate decreased with the temperature; however, the growth rate was affected by precipitation seasonality and warming velocity rather than temperature. In particular, a lower growth rate was observed for a higher precipitation seasonality and lower warming velocity." Chiyomaru and Takemoto (2020)|
|Question: How contagious is COVID-19?|
|Answer: "While the science is still evolving, it is believed that the Covid-19 R 0 is between 2 and 4. This means that without effective containment measures, an infected person can infect between two and four other individuals. Covid-19, thus, is believed to be more contagious than both the seasonal flu (R 0 1.3) and SARS (R 0 3.0)." Stannard (2020)|
4.2 Relevant Snippet Selector
4.2.1 Question Answering
For the question answering (QA) module, we have leveraged the BioBERT Lee et al. (2020) QA model which is finetuned on the SQuAD dataset and the generalized MRQA model from Su et al. (2019). Instead of fine-tuning the QA models on COVID-19 related datasets, we focus more on maintaining the generalization ability of our system and performing zero-shot question answering. For the MRQA model, the authors leverage multi-task learning over six datasets is used to fine-tune large pre-trained language model XLNet Yang et al. (2019) to reduce over-fitting to the training data in order to enable generalization to out-of-domain data and it helped achieve promising results. To make the answers more readable, instead of providing small spans of answers, we provide the sentences that contain the predicted answers as the outputs.
To better display the question answering results, we leverage the prediction probability of the QA models as the confidence score, which will be utilized when re-ranking all the answers. To obtain the confidence score of an ensemble of two QA models, assuming the confidence score from each model are annotated as and , we follow the rules as follows:
4.2.2 Highlights Generation and Answer Re-ranking
There are two main components in our re-ranking module: (1) calculate keyword-based matching score from the query and the retrieved paragraphs. (2) re-rank the paragraphs with highlighting the relevant snippets selected from QA module based on the re-ranking score.
We calculate the matching score between a query and the retrieved paragraphs based on word matching. To obtain this score, we first select important keywords from the query based on POS-tagging - only taking words with NN(noun), VB (verb), JJ (adjective) tags into consideration. By summing up the term frequencies and the total number of all the important keywords that appear in the paragraph respectively, we can get two matching scores, which are annotated as and here. For the term-frequency matching score, we normalize shorter paragraphs using sigmoid value computed from the paragraph length, and reward paragraphs with more diverse keywords from the query. The final matching score is computed as:
Rerank and Display:
The re-ranking score is based on both the matching score above and the confidence score from the QA module, which is computed using Equation 3. The relevant snippets are then re-ranked together with the corresponding paragraphs and displayed by highlight.
In our system, considering the requirements that people may still want to gather more related and compact information other than the predicted answer spans, especially from the paragraph containing the predicted QA answer spans, we summarize the top-k (top-3) paragraphs that are most relevant to the query (which are passed by search engine and re-ranked based on the QA answer scores), to generate a query-related abstractive summary, and also extractive summary.
Our abstractive model is based on two different pre-trained language models: UniLMDong et al. (2019) and BARTLewis et al. (2019), both of which have obtained state of the art results on the summarization tasks (on CNN/DM datasetsHermann et al. (2015) and XSUM Narayan et al. (2018)).
We fine-tuned the UniLM model using the biomedical review data from Yongkiatpanich and Wichadakul (2019), which includes literature for 5 types of diseases including Cancer, Cardiovascular Disease, Diabetes, Allergy, and Obesity. Original data is from PubMed. We used the BART model fine-tuned on CNN/DailyMail dataset.
For each query, we generate a summary for each of the top paragraphs passed by the QA module, then concatenate them directly to form our final answer summary.
As for the extractive summary, we extract sentences from answer spans based on the similarity score with the query. Specifically, here we use top 5 answer spans as candidates. The [CLS] token’s contextual embedding from ALBERTLan et al. (2019)
is used to calculate the sentence embeddings. We choose the top 3 sentences having the highest cosine similarity score with the user query’s sentence embedding to output our final extractive summary.
We have described our system, CAiRE-COVID, comprising of three major modules, information retrieval, question answering, and summarization, which uses the CORD-19 dataset consisting of published scientific articles concerning COVID-19. Our system can answer user queries related to COVID-19 by retrieving relevant paragraphs from articles available in the dataset, using our QA models to answer the question, and also generate two versions of a concise summary of the top paragraphs via the two summarization models. We believe that getting factual information regarding COVID-19 and showing them in a comprehensible way, we can prioritise scientific facts about the virus, and help the community in the fight against the ongoing global pandemic.
We would like to thank Yongsheng Yang, Nayeon Lee and Chloe Kim for their help in launching our CAiRE-COVID website.
- Berant et al. (2014) Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D Manning. 2014. Modeling biological processes for reading comprehension. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1499–1510.
- Chiyomaru and Takemoto (2020) Katsumi Chiyomaru and Kazuhiro Takemoto. 2020. Global covid-19 transmission rate is influenced by precipitation seasonality and the speed of climate temperature warming. medRxiv.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL.
- Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
- Kembhavi et al. (2017) Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
- Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- McCuaig (2020) Carly McCuaig. 2020. What we know so far (as of march 26, 2020) about covid-19–an mrt point of view. Journal of Medical Imaging and Radiation Sciences.
- Min et al. (2019) Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. arXiv preprint arXiv:1906.02916.
- Narayan et al. (2018) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
- Narayan et al. (2017) Shashi Narayan, Claire Gardent, Shay B Cohen, and Anastasia Shimorina. 2017. Split and rephrase. arXiv preprint arXiv:1707.06971.
- Perez et al. (2020) Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised question decomposition for question answering. arXiv preprint arXiv:2002.09758.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
- Stannard (2020) Daphne Stannard. 2020. Covid-19: Impact on perianesthesia nursing areas. Journal of PeriAnesthesia Nursing.
- Su et al. (2019) Dan Su, Yan Xu, Genta Indra Winata, Peng Xu, Hyeondey Kim, Zihan Liu, and Pascale Fung. 2019. Generalizing question answering system with pre-trained language model fine-tuning. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 203–211.
- Tang et al. (2020) Raphael Tang, Rodrigo Nogueira, Edwin Zhang, Nikhil Gupta, Phuong Cam, Kyunghyun Cho, and Jimmy Lin. 2020. Rapidly bootstrapping a question answering dataset for covid-19. arXiv preprint arXiv:2004.11339.
- Trischler et al. (2017) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200.
- Tsatsaronis et al. (2012) George Tsatsaronis, Michael Schroeder, Georgios Paliouras, Yannis Almirantis, Ion Androutsopoulos, Eric Gaussier, Patrick Gallinari, Thierry Artieres, Michael R Alvers, Matthias Zschunke, et al. 2012. Bioasq: A challenge on large-scale biomedical semantic indexing and question answering. In 2012 AAAI Fall Symposium Series.
- Yang et al. (2018a) Peilin Yang, Hui Fang, and Jimmy Lin. 2018a. Anserini: Reproducible ranking baselines using lucene. Journal of Data and Information Quality (JDIQ), 10(4):1–20.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.
- Yang et al. (2018b) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018b. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Yongkiatpanich and Wichadakul (2019) Chuleepohn Yongkiatpanich and Duangdao Wichadakul. 2019. Extractive text summarization using ontology and graph-based method. In 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), pages 105–110. IEEE.
Appendix A Question Answering
a.1 MRQA Model Detail
MRQA model Su et al. (2019) is leveraged in the CAiRE-Covid system. To equip the model with better generalization ability to unseen data, the MRQA model is trained in multi-task learning scheme on six datasets, which are SQuAD Rajpurkar et al. (2016), NewsQA Trischler et al. (2017), TriviaQA Joshi et al. (2017), SearchQA Dunn et al. (2017), HotpotQA Yang et al. (2018b) and NaturalQuestions Kwiatkowski et al. (2019). The training sets vary from each other in terms of data source, context lengths, whether multi-hop reasoning is needed, strategies for data augmentation. To evaluate the generalization ability, the authors utilized BERT-large model Devlin et al. (2019) which is trained with the same method as the MRQA model as the baseline. The models are evaluated on twelve unseen datasets, such as DROP Dua et al. (2019), TextbookQA Kembhavi et al. (2017). From Table A1, the MRQA model consistently outperforms the baseline and achieves promising results on the QA samples which are different from the training samples in terms of data resource, domain and etc., including biomedical unseen datasets, such as BioASQ Tsatsaronis et al. (2012) and BioProcess Berant et al. (2014).
a.2 Quantity Evaluation
Recently Tang et al. (2020) released CovidQA dataset to the NLP community to bootstrap the NLP research on Covid-19. The CovidQA dataset consists of 124 question–article pairs for zero-shot evaluation on transfer ability.
The evaluation process of CovidQA dataset is designed as a supporting sentence selection task. Given one question and the corresponding articles, QA models are supposed to select and rank several sentences which possibly contains the golden answer from the whole article respectively. The results are evaluated using mean reciprocal rank (MRR), precision at rank one, and recall at rank three. In our system, the QA model performs in paragraph level, which make it hard to select several different possible sentences from the same context. In our case, we only use precision and recall for evaluation. For precision, we evaluation the ensemble result directly, while for recall@2, we separate the QA outputs from MRQA model and BioBERT model and consider the as two possible answers and compute the recall score. The evaluation results are shown in TableA2.
Appendix B Query Paraphrasing
In our Kaggle task, the queries are always long and complex sentences. In this case, splitting and simplification is needed. Here we show some examples of the original task queries and their corresponding sub-queries we paraphrased:
Task Query 1:
What the literature reports about Range of incubation periods for the disease in humans (and how this varies across age and health status)?
What the literature reports about range of incubation periods for COVID-19 in humans?
How the range of incubation periods for COVID-19 varies across human health status?
How the range of incubation periods for COVID-19 varies across human age?
Task Query 2:
What the literature reports about the evidence that livestock could be infected and serve as a reservoir after the epidemic appears to be over?
What the literature reports about the evidence that livestock could be infected by COVID-19?
How the infected livestock serve as a COVID-19 reservoir after the epidemic appears to be over?
Task Query 3:
What the literature reports about access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation?
What the literature reports about access to geographic sample sets of COVID-19?
What the literature reports about access to temporal sample sets of COVID-19?
What the literature reports about geographic-time distribution of COVID-19?
What the literature reports about number of strains in circulation of COVID-19?