Question Answering with Knowledge Base (Kbqa) parses a natural-language question and returns an appropriate answer that can be found in a Knowledge Base (KB). Currently, one of the most exciting scenarios for Question Answering (QA) is the Web of Data, a fast-growing distributed cloud of interlinked KBs which comprises more than 100 billions of edges [mccrae2018lod]. Similarly, Question Answering over Linked Data (Qald) is a research field aimed at transforming utterances into Sparql queries which can be executed towards the Linked Open Data (Lod) cloud [lopez2013evaluating]. Qald and Kbqa are strictly related, as they both target the retrieval of answers from KBs. However, the current benchmarks and datasets available for evaluating Qald approaches are limited to an unlinked and unstandardized vision of the structured question answering task. In this position paper, we point out the lack of existing methods to evaluate QA on datasets which exploit external links to the rest of the cloud. Moreover, we argue that several learning-based Kbqa approaches may be very competitive in Qald challenges, as the current distinctions among their respective benchmarks are only in terms of underlying KBs. Instead, our plea is to let language experts do language and Web semantics experts do semantics. We propose the creation of new evaluation methods and settings to leverage the advantages of the Semantic Web (SW) to achieve AI-complete QA over the Web of Data [diefenbach2018].
2 State of the Art
2.1 Question Answering over Linked Data
The most popular datasets for Qald are collected in the homonym Qald benchmarks [lopez2013evaluating], which have been released through 9 challenge editions since 2011, the BioAsq challenge for biomedical QA [tsatsaronis2012bioasq], and the more recent Lc-Quad [trivedi2017lc]. 18 out of 29 datasets in the aforementioned benchmarks are open-domain
(i.e., they target a knowledge graph such as DBpedia and Wikidata); 8domain-specific datasets are from the biomedical domain, 3 describe music and 1 describes governmental data. QA systems are supposed to build queries aimed at retrieving information from the RDF dataset itself and, in 3 cases, associated textual resources such as abstracts [hoffner2017survey]. As [diefenbach2017] reports, the Qald
benchmark has been the evaluation subject of diverse systems, some of them based on query templates. Techniques adopted include graph search, Hidden Markov Models, structured perceptron and (only recently) deep learning, as well as string similarity, language taxonomies, and distributional semantics.
Even a 5-star LOD dataset may contain vocabulary which cannot be mapped to any other dataset in the cloud because of its uniqueness (e.g., a specific property of a protein might not exist in open-domain KBs). In order to achieve QA over the SW, we need to perform QA outside of a single KB by exploiting the advantages of the SW (e.g., external links, ontology alignments).
2.2 Question Answering with Knowledge Base
Within the SW community, Kbqa is not as popular as Qald. Research in Semantic Parsing, defined as “the task of converting a natural language utterance to a logical form”, was often excluded from comparisons with Qald approaches for being irrelevant to RDF [hoffner2017survey]. Datasets based on Freebase [bollacker2008freebase] such as WebQuestionsSp [berant2013semantic], SimpleQuestions and ComplexQuestions [bordes2015large] are rarely used in the Qald community, despite their full compatibility with RDF standards. On the other hand, they are widely adopted in the Kbqa community. Several works in the field of Computational Linguistics target both closed- and open-domain Kbqa. Early on, [yahya2012natural] proposed a supervised method to translate questions into queries, which however required a lot of training data. Semantic parsing approaches were later introduced to address this problem by learning the queries out of question-answer pairs, as in [berant2013semantic]. Neural Symbolic Machinesliang2016]. Very recently, [abujabal2017automated] proposed an approach to automate the template generation and in [abujabal2018never], an architecture to reach never-ending learning from a small set of question-answer pairs. Most of current Qald approaches still struggle tackling the issues above, in addition to lexical gap and support for complex operators [hoffner2017survey]. Given the task similarities, we expect the aforementioned Kbqa approaches to achieve high scores on Qald.
3 Where is Linked Data?
We argue that the open-domain Qald benchmarks are to DBpedia and Wikidata what the Kbqa benchmarks are to Freebase. Judging strictly from a practical viewpoint, the two areas do not differ in anything except in the underlying data. Hence the question in the title, “Where is Linked Data in Question Answering over Linked Data”?
Unfortunately, Qald benchmark datasets are self-contained, meaning that the desired information can be either found inside them or not found at all. Such scenario is different from the Lod cloud, where information about any real-world entity is usually spread across multiple datasets. As of today, to the best of our knowledge, a method to evaluate QA on datasets which exploits external links to the rest of the cloud does not yet exist. Let us introduce the example in Figure 1; say we have a source dataset reporting employee data including the city of their birth. The question “which employees were born in the US?” would need additional knowledge to be answered correctly (i.e., what “US” means and how they relate with the cities). We argue that this kind of QA problems can be addressed only by following the external links from the starting dataset; in this case, an employee can be born in dbr:Monterey,_California and – by virtue of the following DBpedia statement – be added to the result set.
dbr:Monterey,_California dbo:country dbr:United_States .
One could argue that there is no conceptual difference in performing QA over two or more interlinked KBs and the same KBs merged into one. However, in a real-world scenario, Linked Datasets can be extremely different in size, structure, and format, as well as be subject of constant change. As up-to-dateness does not seem a concern for the Kbqa community – since Freebase is now a defunct project containing obsolete data – the SW community needs to start considering this problem in its actual environment, i.e. the Web.
Nevertheless, in order to scale to the size of the Web, systems must be prepared. It is known that scalability is still an issue for the majority of the Qald systems [diefenbach2017]. Recent works in Kbqa managed to deal with billions of triples achieving a satisfiable waiting time for the end user [abujabal2017automated]. With this respect, neural approaches are a promising alternative; despite being expensive during training, neural networks usually do not require as many resources for prediction [sorumarx2017].
Another point that would differentiate the two research areas is the variety of schemata used in the Lod cloud. Each dataset has its own vocabulary, which is adapted to the context, where some vocabularies use extremely specific or idiosyncratic terms. It is therefore not granted that an approach which can perform well on open-domain QA may also generalize well on other datasets. Term overlap is also frequent, leading to a scenario that is completely different than in Kbqa.
4 What to do now?
In this section, we propose two settings of a hypothetical benchmark. Both settings could be generated semi-automatically without too much human effort, or at least no more than for a canonical Qald dataset. The availability of the numerous Qald benchmarks released so far reduces the risk factor consistently. After such new benchmarks are released, we expect the community to react and provide resolutions in the span of one year within the next Qald edition.
A domain-specific dataset containing links to (e.g.) DBpedia and Schema.org entities is given in full to the QA systems. Questions can be answered only if the DBpedia and Schema.org links are dereferenced. A snapshot of the used Lod cloud subset can be created for reproducibility and distributed through query-ready formats such as KBox [marx2017kbox] and HDT [gallego2011hdt].
The second setting we propose exploits the fact that properties utilized in the Lod cloud are defined by widely-adopted standardized
vocabularies. The task is to apply transfer learning over two or more datasets using one or more common upper ontologies (e.g., OWL, DCT, SKOS). Training data are only given on a dataset A, while questions target dataset B, where the different ontologies in A and B are aligned to an upper ontology.
We showed the necessity for the Qald community to tackle the homonym problem not as a single KB, from the perspective of the Web of Data. After such new benchmarks are released, we expect the community to react and provide resolutions in the span of one year within the next Qald edition. Our plea is to let language experts do language and Web semantics experts do semantics. While the former will keep addressing all problems related to human language, new challenges will arise for the SW.
We thank Dennis Diefenbach for his kind suggestions and feedback.