Question Answering (QA) systems, such as Apple’s Siri, Amazon’s Alexa, or Google Now, answer questions by mining the answers from unstructured text corpora or open domain Knowledge Graphs (KG) . The direct applicability of these approaches to specialized domains such as scholarly knowledge is questionable. On the one hand, no extensive knowledge graph for scholarly knowledge exists that can be employed in a question answering system. On the other hand, scholarly knowledge is represented mainly as unstructured raw text in articles (in proceedings or journals) . In unstructured artifacts, knowledge is not machine actionable, hardly processable, ambiguous , and particularly also not FAIR . Still, amid unstructured information some semi-structured information exists, in particular in tabular representations (e.g., survey tables, literature overviews, and paper comparisons). The task of QA on tabular data has challenges , shared with other types of question answering systems. We propose a method to perform QA specifically on scholarly knowledge graphs representing tabular data. Moreover, we create a benchmark of tabular data retrieved from a scholarly knowledge graph and a set of related questions. This benchmark is collected using the Open Research Knowledge Graph (ORKG) .
The remainder of this article is structured as follows. Section 1.0.1 motivates the work with an example. Section 2 presents related work, which is supplemented by an analysis of the strengths and weaknesses of existing systems in the context of digital libraries. Section 3 describes the proposed approach. Section 4 presents the implementation and evaluation. Section 5 discusses results and future work. Finally, Section 6 concludes the paper.
1.0.1 Motivating Example
The research community has proposed many QA systems, but to the best of our knowledge none focus on scholarly knowledge. Leveraging the ORKG  and its structured scholarly knowledge, we propose a QA system specifically designed for this domain. Figure 1 illustrates a tabular comparison view111https://www.orkg.org/orkg/comparison/R8618 of structured scholarly contribution descriptions. Additionally, three questions related to the content of the comparison table are shown. The answers are implicitly or explicitly provided in the cells of the table. JarvisQA can answer different types of questions. For Q1, the answer has a direct correlation with the question. For Q2, the system should first find the “knowledge representations” in the table and then find the most common value. For Q3, the answer is conditional upon finding another piece of information in the table first (i.e., JarvisQA has to find “RASH” in the table first), and then narrow its search to that column (or that paper) to find the correct answer.
We tackle the following research questions:
RQ1: Can a QA system retrieve answers from tabular representations of scholarly knowledge?
RQ2: What type of questions can be posed on tabular scholarly knowledge?
2 Related Work
Question answering is an important research problem frequently tackled by research communities in different variations, applications, and directions.
In open domain question answering, various systems and techniques have been proposed that rely on different forms of background knowledge. Pipeline-based systems, such as OpenQA , present a modular framework using standardized components for creating QA systems on structured knowledge graphs (e.g., DBpedia ). Frankenstein  creates the most suitable QA pipeline out of community created components based on the natural language input question. QAnswer  is a multilingual QA system that queries different linked open data datasets to fetch correct answers. Diefenbach et al.  discussed and compared other QA-over-KG systems (e.g., gAnswer , DEANNA , and SINA ) within the context of QALD “Question Answering over Linked Data” challanges .
Other types of QA systems rely on the raw unstructured text to produce the answers. Many of these systems are end-to-end systems that employ machine learning to mine the text and retrieve the answers. Deep learning models (e.g., Transformers) are trained and fine-tuned on certain QA datasets to find the answers from within the text. ALBERT is a descendent of BERT  deep learning model. At the time of writing, ALBERT holds the third top position in answering the questions of SQuAD . Such techniques model the linguistic knowledge from textual details and discard all the clutter in the text . Other similar approaches include SG-Net , which uses syntax rules to guide the machine comprehension encoder-transformer models.
Tabular QA systems are also diverse and tackle the task with different techniques. TF-IDF  is used to extract features from the tables and the question, and to match them. Other models such as semantic parsers are used by Kwiatkowski et al.  and Krishnamurthy and Kollar . Cheng et al.  propose a neural semantic parser that uses predicate-argument structures to convert natural language text into intermediate structured representations, which are then mapped to different target domains (e.g., SQL).
Another category of table QA systems is neural systems. TableQA  uses end-to-end memory networks to find a suitable cell in the table to choose. Wang et al.  propose to use a directional self-attention network to find candidate tables and then use BiGRUs to score the answers. Other table oriented QA systems include HILDB  that converts natural language into SQL.
In the plethora of systems that the community has developed over the past decade, no system addresses the scholarly information domain, specifically. We propose a system to fill this gap and address the issues of QA on scholarly tabular data in the context of digital libraries (specifically with the ORKG222https://orkg.org/).
Though a variety of QA techniques exist, Digital Libraries (DL) primarily rely on standard information retrieval techniques . We briefly analyze and show when and how QA techniques can be used to improve information retrieval and search capabilities in the context of DLs. Since DLs have different needs [11, 26]; QA systems can improve information retrieval availability . We argue that, Knowledge Graph based QA systems (or KG-QA) can work nicely within a DL context (i.e., aggregate information, list candidate answers). Nevertheless, the majority of the existing scholarly KGs (such as MAG , OC ) focus on metadata (e.g., authors, venues, and citations), not the scholarly knowledge content.
Another category of QA systems works on raw text, an important approach for DLs. However, such systems are not fine-tuned on scholarly data; rather, they are designed for open domain data. Furthermore, many of the end-to-end neural models have a built-in limitation  (i.e., model capacity) due to the architecture type, and as such cannot be used out of the box. Some systems circumvent the problem of capacity (i.e., the inability to feed the model large amounts of text) by having a component of indexing (e.g., inverted index, concept and entity recognition) that can narrow down the amount of text that the system needs to process as the context for questions.
We propose a system, called JarvisQA, that answers Natural Language (NL) questions on tabular views of scholarly knowledge graphs, specifically tabular views comprising research contribution information from scientific articles.
3.1 Data and Questions Collection
In order to evaluate our QA system we create the ORKG-QA benchmark, collected using the ORKG. The ORKG provides structured comparisons  of research contributions obtained from papers. The ORKG-QA benchmark comprises a dataset that integrates 13 tables, covering information spanning more than 100 academic publications. The data is collected through the ORKG API and the featured set of tables333https://www.orkg.org/orkg/featured-comparisons, which can be exported in CSV format.
Additionally, we created a set of questions that cover various types of information and facts that can be retrieved from those tables. The benchmark consists of 80 questions in English. The questions cover a variety of question types that can be asked in the context of tables in the scholarly literature. These types of questions include aggregation questions (e.g., min, average and most common), ask questions (i.e., true, false), answer listing questions, and questions that rely on combining information. In the ORKG-QA dataset444https://doi.org/10.25835/0038751, 39% are normal questions addressing individual cells in tables, 20% are aggregation questions, 11% are questions for which the answer relates to other parts of the table, and the rest are questions of different types (i.e., listings, ask queries, empty answers).
We also use the TabMCQ  QA dataset, specifically questions on the regents tables. TabMCQ was derived from multiple choice questions of 4th grade science exams and contains 39 tables and related questions. While TabMCQ is not a scholarly dataset, but is to the best of our knowledge the closest one available. Since TabMCQ has only multiple-choice questions, we adapted the questions with only the correct choice.
3.2 JarvisQA system architecture
JarvisQA is designed with modularity in mind. Hence, the core QA components are replaceable with newer or more fine-tuned versions. Figure 2 depicts the architecture in more detail. Since we used a natural language QA system, we need a pre-processing step that transforms the table information into the textual description (representing only the information contained in the table not the entire raw text of the article). With the output of the “Table2Text” step and the input question, the NL QA system can reason over the question with the provided context (textual table description) and attempts to answer the question. We now discuss the individual components of the architecture in more detail.
3.2.1 Table2Text (T2T) converter.
Although JarvisQA operates on tabular data, the core QA engine processes textual contexts. To that end, tables have to be converted into coherent text snippets that represent the entirety of the information presented in the table. T2T component splits tables into its entries and converts entries into triples. Table 1 illustrates a sample table containing some information about three publications, along with their triples and textual representations compiled by the T2T component. Furthermore, the T2T component enriches the textual description with aggregated information (i.e., max value of certain rows, most common value used within some columns). This enables the system to answer aggregation-type questions such as “Which system has the maximum accuracy?” and “What is the most common method used among the papers?”.
|Data type||Scope||High level
|Paper 1 ||ORKG||Free text||Summary||Yes|
|Paper 2 ||Nanopublications||Free text||Statement level||Yes|
|Paper 3 ||RASH||Quoted text||Full paper||Partially|
Paper1, hasSemanticRepresentation, ORKG
Paper1, hasDataType, FreeText
Paper1, hasScope, Summary
|Text||Paper 1’s semantic representation is ”ORKG”, its data type is ”Free Text”, and its scope is ”Summary” …|
3.2.2 QA core engine.
This component is the primary building block of JarvisQA. It is where reasoning over questions happens.
The component uses a pre-trained natural language QA model.
The model is a deep transformer, fine tuned on the SQuADv2 dataset to perform the QA task.
The component is replaceable with any other similar transformer model (of different sizes and architectures).
Our base implementation uses a fine tuned version of a BERT model and we evaluate our model using different model sizes and architectures.
The model parameters are set: maximum sequence length to 512, document stride
document strideto 128, top k answers to 10, maximum answer length to 15, and the maximum question length to 64. As illustrated in Figure 2, the QA engine extracts sets of features from the questions and the text (i.e., embeddings), then it finds a set of candidate answers and ranks them by confidence score. The benefits of such architecture are the flexibility in model choice, multilingualism, and reusability. Different transformer models can replace ours to support other languages, other datasets, and potentially other features. To accomplish these objectives, the system is built using the Transformers framework .
4 Experimental Study
We empirically study the behavior of JarvisQA in the context of scholarly tables against different baselines. The experimental setup consists of metrics and baselines. Table 2
lists the evaluation metrics for the performance measurements of the systems. Since a QA system can produce multiple answers and the correct answer can be any of the retrieved answers we use a metric that takes the position of the answer into account.
|Global Precision||Ratio between correct answers retrieved in the top ranked position and the total number of questions.|
|Global Recall||Ratio between the number of questions answered correctly at any position (here till the 10th retrieved answer) and the total number of questions.|
|F1-Score||Harmonic mean of global precision and global recall.|
|Execution Time||Elapsed time between asking a question and returning the answer.|
|In-Memory Size||The total memory size used by system.|
|Precision@K||Cumulative precision at position K.|
|Recall@K||Ratio of correctly answered questions in the top K position and total number of questions.|
Harmonic mean of precision and recall at position K.
As baselines we use the following two methods for answer generation:
Random: the answer is selected from all choices randomly.
Lucene777https://lucene.apache.org/: is a platform for indexing, retrieving unstructured information, and used as a search engine. We index the triple-generated sentences by Lucene. For each question, the top answer produced by Lucene is regarded as the final answer.
The evaluation was performed on an Ubuntu 18.04 machine with 128GB RAM and a 12 core Xeon processor. The implementation is mostly based on HuggingFace Transformers888https://github.com/huggingface/transformers, and is written in Python 3.7. The evaluation results for precision, recall, and F1-score are reproducible while other metrics such as time and memory depend on the evaluation system hardware. However, the ratio of the difference between the baselines should be similar or at least show a similar trend. The code to reproduce the evaluation results and the presented results are available online.999https://doi.org/10.5281/zenodo.3738666
4.0.1 Experiment 1 - JarvisQA performance on the ORKG-QA benchmark.
In order to evaluate the performance of JarvisQA, we run the system and other baselines on the ORKG-QA dataset at various values ( denotes the position of the correct answer among all retrieved answers). For this experiment we evaluate . Moreover, the experiment was conducted on a specific subset of questions (based on types) to show the performance of the system for certain categories of questions. The tested question categories are: Normal: normal questions about a specific cell in the table with a direct answer; Aggregation: questions about aggregation tasks on top of the table; Related: questions that require retrieving the answer from another cell in the table; Similar: questions that address the table using similar properties (e.g., synonyms). Table 3 shows the performance of the baselines and our system on the ORKG-QA benchmark. The results show that JarvisQA performs better by 2-3 folds against Lucene, and Random baselines respectively.
|Baseline||Precision @K||Recall @K||F1-Score @K|
4.0.2 Experiment 2 - Different models of QA and their performance.
We evaluate different types of QA models simultaneously to show the difference in performance metrics, execution time, and resource usage. Table 4 illustrates the difference in performance on the ORKG-QA benchmark dataset for different classes of questions and the overall dataset. JarvisQA’s QA engine employs the BERT L/U/S2 model due to its execution time and overall higher accuracy at higher positions.
|Questions type||Precision @K||Recall @K||F1-Score @K|
|Distil BERT B/U/S1||Normal||0.14||0.27||0.36||0.46||0.16||0.29||0.36||0.46||0.15||0.27||0.35||0.45|
B=Base; L=Large; XL=X-Large; C=Cased; U=Uncased; S1=Finetuned on SQuAD1;
S2=Finetuned on SQuAD2
4.0.3 Experiment 3 - Trade-offs between different performance metrics.
We illustrate trade-offs between different dimensions of performance metrics for the JarvisQA approach compared to the baselines. We choose global precision, global recall, F1-score, in-memory size, and execution time as five different dimensions. Figure 3 depicts the performance metrics trade-offs between our system and other baselines. JarvisQA achieves higher precision and recall while consuming considerably more time and memory than the other baselines.
4.0.4 Experiment 4 - Performance on TabMCQ.
We also show the performance of our system on the TabMCQ dataset against the ORKG-QA dataset. We see the same trend in both datasets, that JarvisQA outperforms the baselines by many folds. TabMCQ is not directly related to scholarly knowledge. However, it shows that JarvisQA can generalize to related data and can answer questions about it. Table 5 presents the results of this experiment.
|System||Dataset||Precision @K||Recall @K||F1-Score @K|
5 Discussion and Future work
The main objective of JarvisQA is to serve as a system that allows users to ask natural language questions on tablar views of scholarly knowledge. As such, the system addresses only a small part of the scholarly information corpus.
We performed several experimental evaluations to benchmark the performance of JarvisQA against other baselines using two different QA datasets. Different datasets showed different results based on the types of questions and the nature of the scholarly data encoded in the tables. Based on these extensive experiments, we conclude that usual information retrieval techniques used in search engines are failing to find specific answers for questions posed by a user. JarvisQA outperforms the other baselines in terms of precision, recall, and F1-score measure at the cost of higher execution time and memory requirements. Moreover, our system cannot yet answer all types of questions (e.g., non-answerable questions and listing questions).
Since JarvisQA utilizes a BERT based QA component, different components can perform differently, depending on the use case and scenario. Our system struggles with answers spanning across multiple cells of the table, and also in answering true/false questions. Furthermore, the answers are limited to information in the table (extractive method), since tables are not supplemented with further background information to improve the answers.
As indicated, the system can still be significantly improved. Future work will focus on improving answer selection techniques, and supporting more types of questions. Additionally, we will improve and enlarge the ORKG-QA dataset to become a better benchmark with more tables (content) and questions. JarvisQA currently selects the answer only from a single table, but use cases might require the combination of multiple tables or the identification of target table automatically (i.e., the system selects the table containing the correct answer from a pool of tables). Moreover, in the context of digital libraries, we want to integrate the system into the ORKG infrastructure so it can be used on live data directly.
Retrieving answers from scientific literature is a complicated task. Manually answering questions on scholarly data is cumbersome, time consuming. Thus, an automatic method of answering questions posed on scientific content is needed. JarvisQA is a question answering system addressing scholarly data that is encoded in tables or sub-graphs representing table content. It can answer several types of questions on table content. Furthermore, our ORKG-QA benchmark is a starting point to collaborate on adding more data to better train, evaluate, and test QA systems designed for tabular views of scholarly knowledge. To conclude, JarvisQA addresses several open questions in current information retrieval in the scholarly communication domain, and contributes towards improved information retrieval on scholarly knowledge. It can help researchers, librarians, and ordinary users to inquire for answers with higher accuracy than traditional information retrieval methods.
This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and the TIB Leibniz Information Centre for Science and Technology. The authors would like to thank our colleagues Kheir Eddine Farfar, Manuel Prinz, and especially Allard Oelen and Vitalis Wiens for their valuable input and comments.
DBpedia: A nucleus for a Web of open data.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 4825 LNCS, pp. 722–735. External Links: Cited by: §2.
-  (2007) Ontology-based question answering for digital libraries. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 4675 LNCS, pp. 14–25. External Links: Cited by: §2.
-  (2015-11) Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology 66 (11), pp. 2215–2222. External Links: Cited by: §1.
-  The Scholarly Commons - principles and practices to guide research communication. External Links: Cited by: §1.
-  (2017) Learning Structured Natural Language Representations for Semantic Parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA, USA, pp. 44–55. External Links: Cited by: §2.
-  (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, pp. 4171–4186. External Links: Cited by: §2.
-  (2018-06) Core techniques of question answering systems over knowledge bases: a survey. Knowledge and Information Systems 55 (3), pp. 529–569. External Links: Cited by: §2.
-  (2019-05) QAnswer: A question answering prototype bridging the gap between a considerable part of the LOD cloud and end-users. In The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019, pp. 3507–3510. External Links: Cited by: §2.
-  (2013) Hindi language graphical user interface to database management system. In Proceedings - 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013, Vol. 2, pp. 555–559. External Links: Cited by: §2.
-  (2010) The anatomy of a nanopublication. Information Services and Use 30 (1-2), pp. 51–56. External Links: Cited by: Table 1.
-  (2006-07) Information Retrieval and Digital Libraries. In Medical Informatics, pp. 237–275. External Links: Cited by: §2.
-  (2019) Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge. Marina Del K-CAP 19. External Links: Cited by: §1.0.1, §1, Table 1.
-  (2016-02) TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions. External Links: Cited by: §3.1.
-  (2019-03) Question Answering via Web Extracted Tables and Pipelined Models. External Links: Cited by: §1.
-  (2013-12) Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World. Transactions of the Association for Computational Linguistics 1, pp. 193–206. External Links: Cited by: §2.
-  Scaling Semantic Parsers with On-the-fly Ontology Matching. Technical report Association for Computational Linguistics. External Links: Cited by: §2.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. External Links: Cited by: §2.
-  (2002) The Web as a Resource for Question Answering: Perspectives and Challenges. In LREC, Las Palmas. External Links: Cited by: §1.
-  (2013-08) Evaluating question answering over linked data. Journal of Web Semantics 21, pp. 3–13. External Links: Cited by: §2.
-  (2014-09) Towards an open question answering architecture. In ACM International Conference Proceeding Series, Vol. 2014-September, pp. 57–60. External Links: Cited by: §2.
-  (2020) Generate FAIR Literature Surveys with Scholarly Knowledge Graphs. JCDL ’20: The 20th ACM/IEEE Joint Conference on Digital Libraries (In Press). Cited by: §3.1.
-  (2017) Research Articles in Simplified HTML: A Web-first format for HTML-based scholarly articles. PeerJ Computer Science 2017 (10). External Links: Cited by: Table 1.
-  (2020-01) OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies 1 (1), pp. 1–17. External Links: Cited by: §2.
SQuad: 100,000+ questions for machine comprehension of text.
EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings, pp. 2383–2392. External Links: Cited by: §2.
-  Using TF-IDF to Determine Word Relevance in Document Queries. Technical report Cited by: §2.
-  (1997-01) Information retrieval in digital libraries: Bringing search to the net. Science 275 (5298), pp. 327–334. External Links: Cited by: §2.
-  (2015-01) SINA: Semantic interpretation of user queries for question answering on interlinked data. Journal of Web Semantics 30, pp. 39–51. External Links: Cited by: §2.
-  (2018) Why Reinvent the Wheel: Let’s Build Question Answering Systems Together. In WWW ’18: Proceedings of the 2018 World Wide Web Conference, pp. 1247–1256. External Links: Cited by: §2.
-  (2015-05) An overview of microsoft academic service (MAS) and applications. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web, New York, New York, USA, pp. 243–246. External Links: Cited by: §2.
-  (2017-05) TableQA: Question Answering on Tabular Data. External Links: Cited by: §2.
-  A Neural Question Answering Model Based on Semi-Structured Tables. Technical report Cited by: §2.
-  (2016-03) The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3 (1), pp. 1–9. External Links: Cited by: §1.
-  (2019-10) HuggingFace’s Transformers: State-of-the-art Natural Language Processing. External Links: Cited by: §3.2.2.
-  (2012) Natural Language Questions for the Web of Data. Technical report Association for Computational Linguistics. Cited by: §2.
-  (2015-12) Neural Generative Question Answering. IJCAI International Joint Conference on Artificial Intelligence 2016-January, pp. 2972–2978. External Links: Cited by: §2.
-  (2019-08) SG-Net: Syntax-Guided Machine Reading Comprehension. External Links: Cited by: §2.
-  (2012) On Writing Well, 30th Anniversary Edition: An Informal Guide to Writing Nonfiction. HarperCollins. External Links: Cited by: §2.
-  (2014) Natural language question answering over RDF - A graph data driven approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 313–324. External Links: Cited by: §2.