IQA: Interactive Query Construction in Semantic Question Answering Systems

06/20/2020 ∙ by Hamid Zafar, et al. ∙ University of Bonn L3S Research Center 0

Semantic Question Answering (SQA) systems automatically interpret user questions expressed in a natural language in terms of semantic queries. This process involves uncertainty, such that the resulting queries do not always accurately match the user intent, especially for more complex and less common questions. In this article, we aim to empower users in guiding SQA systems towards the intended semantic queries through interaction. We introduce IQA - an interaction scheme for SQA pipelines. This scheme facilitates seamless integration of user feedback in the question answering process and relies on Option Gain - a novel metric that enables efficient and intuitive user interaction. Our evaluation shows that using the proposed scheme, even a small number of user interactions can lead to significant improvements in the performance of SQA systems.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Openly available large-scale knowledge graphs such as DBpedia Lehmann et al. (2015), Wikidata Vrandecic and Krötzsch (2014), YAGO Hoffart et al. (2013) and EventKG Gottschalk and Demidova (2018a), Gottschalk and Demidova (2019) have evolved as the key reference sources of information and knowledge regarding real-world entities, events, and facts on the Web. The flexibility of the RDF-based knowledge representation, the large-scale editor base of popular knowledge graphs, and recent advances in the automatic knowledge graph completion methods lead to a growth of the data and the schema layers of these graphs at an unprecedented scale, with schemas including thousands of types and relations Paulheim (2017). As a result, the information contained in the knowledge graphs is very hard to query, in particular, due to the large scale, the heterogeneity of the entities, and the variety of their schema descriptions.

Semantic Question Answering (SQA) is the key technology to facilitate end-users to query knowledge graphs using natural language interfaces. In recent years, a large number of SQA approaches have been developed Höffner et al. (2017a). The objective of these approaches is to automatically interpret a user question formulated in a natural language as a semantic query (typically expressed in the SPARQL query language), which is then executed against the knowledge graph to obtain the results. Current SQA approaches are capable of effectively answering rather simple factual questions that contain a limited number of entities and relations.

In the case of complex questions, i.e., questions that involve multiple entities and relations, the performance of the existing SQA approaches is still limited. These limitations can, to a large extent, be attributed to the inherent uncertainty associated with the results of the individual pipeline components along with the propagation of errors of the component results through the entire SQA pipeline. This uncertainty often leads to imprecise question interpretations, especially for complex questions.

Figure 1: An example transformation of a question from the LC-QuAD dataset in possible semantic queries over the DBpedia knowledge graph using an SQA pipeline consisting of a Shallow Parser (SP), an Entity Linker (EL), a Relation Linker (RL) and a Query Builder (QB).

Figure 1 illustrates this problem using an example question from LC-QuAD Trivedi et al. (2017) - a state-of-the-art dataset for evaluation of Semantic Question Answering systems: “List software that is written in C++ and runs on Mac OS.”. An SQA pipeline incrementally transforms the input question in a semantic query, using components such as a Shallow Parser (SP), an Entity Linker (EL), a Relation Linker (RL) and a Query Builder (QB). First, the Shallow Parser identifies keyword phrases “software”, “written”, “C++”, “runs” and “Mac OS”. Then the Entity Linker and Relation Linker map these keyword phrases to the entities and relations in the DBpedia knowledge graph. To obtain correct interpretation, the Entity Linker should link the keyword phrase “C++” to the entity dbr:C++222, the programming language and “Mac OS” to the entity dbr:Mac_OS333, the operating system. The Entity Linker should not confuse “C++” with e.g., dbr:C444, another programming language. The Relation Linker should link the keyword phrases “written” to the relation
dbo:programmingLanguage555 and “runs on” to the relation dbo:operatingSystem666 Here, the task of relation linking is particularly difficult due to the lexical gap, the required domain knowledge, and the ambiguity of the candidates. To reduce the number of candidates, the Relation Linker can rely on the Entity Linker results, e.g., by taking into account the relations of the linked entities in the knowledge graph. Finally, the Query Builder component utilizes the results of the Entity Linker and Relation Linker to build the semantic query. Errors in the results of the Entity Linker and Relation Linker can often lead to the misinterpretation of the user question. With an increasing number of entities and relations mentioned in the user question, the likelihood of such errors increases.

The objective of this article is to address the limitations of the existing SQA approaches in answering complex questions through the provision of a novel user interaction scheme. While other domains like Information Retrieval and keyword search over structured data take significant advantage of user interaction models (e.g., Demidova et al. (2013b)), such models are not yet widely adopted in the context of Semantic Question Answering. The proposed IQA scheme can be particularly beneficial in answering complex questions when the intended semantic interpretation of the question cannot be accurately inferred using automatic methods. From the algorithmic perspective, this scheme can facilitate SQA systems to reduce uncertainty during the query interpretation process efficiently. From the user perspective, this scheme can empower users in effectively guiding SQA algorithms towards the intended results.

Given an SQA pipeline and a user question, the goal of IQA is to facilitate an efficient and intuitive generation of the intended question interpretation through user interaction. The proposed interaction scheme incrementally refines user questions in the intended semantic queries by requesting user feedback on several items called interaction options. The main challenge to be addressed here is the trade-off between the efficiency and the usability in the interaction scheme. In this context, efficiency refers to the minimization of the interaction cost (i.e., the number of requests for user feedback). The usability means the ease of use/understandability of the interaction options. To the best of our knowledge, none of the state-of-the-art SQA systems support user interaction in Semantic Question Answering in the way envisioned in this article.

Overall, in this article we make the following contributions:

  • We provide a formalization of an SQA pipeline, which captures the dependency of the pipeline components, and facilitates generalization of the proposed interaction scheme to a wide range of SQA systems.

  • We present a probabilistic foundation to estimate the likelihood of the generated question interpretations and interaction options. This model builds a basis for the systematic generation of effective interaction options in a variety of categories.

  • We propose a user interaction scheme that seamlessly incorporates user feedback in the Semantic Question Answering process to reduce uncertainty efficiently. We adopt a cost-sensitive decision tree to balance the trade-off between usability and efficiency of the options in the interaction process.

  • We incorporate the usability of interaction options into a new metric, Option Gain, that balances the usability and efficiency of interaction options and facilitates the selection of interaction options that are efficient and intuitive for the user.

  • We showcase an instantiation of the proposed user interaction scheme in a web-based IQA prototype while utilizing existing components developed by the SQA community.

We demonstrate the effectiveness and efficiency of the proposed interaction scheme for Semantic Question Answering in an extensive experimental evaluation and a user study. Our evaluation results on LC-QuAD, an established dataset for the assessment of Semantic Question Answering systems, demonstrate that IQA can significantly improve the effectiveness, efficiency, and usability of Semantic Question Answering systems for complex questions. In particular, the IQA-OG configuration that adopts Option Gain achieves an increase of up to 20 percentage points in terms of score compared to the baselines on a subset of LC-QuAD utilized in the user study. Furthermore, this configuration enhances the ease of use as reported by the users.

We organize the rest of the article as follows: First, we formalize the concept of Semantic Question Answering pipeline in Section 2. Then, in Section 3 we present the user interaction scheme of IQA. Following that, we describe the realization of the IQA pipeline in Section 4. The evaluation setup is described in Section 5. Our evaluation results are presented in Section 6. Section 7 discusses related work. We provide a conclusion in Section 8.

2 Formalization of an SQA Pipeline

A Semantic Question Answering pipeline (denoted as “SQA pipeline” in the following) transforms a user question specified in a natural language into a semantic query for the target knowledge graph. In this section, we present a formalization of an SQA pipeline that abstracts from the particular implementation. Notations frequently used in the article are summarized in Table 1.

width=0.45 Notation Description a representation of the user question a user question as a natural language expression a multiset of information nuggets a partial question interpretation a complete question interpretation an interpretation function the question interpretation space an interaction option Option Gain Information Gain

Table 1: Summary of frequently used notations.

2.1 Basic Concepts

The goal of Semantic Question Answering is to transform a user question expressed in a natural language into a semantic query for the target knowledge graph. In the following, we formalize the concepts of the knowledge graph, the user question, and the semantic query.

A knowledge graph consists of a set of entities, a set of literals, a set of properties and a set of triples.

The entities in represent real-world entities and concepts. The properties in represent relations connecting two entities or an entity and a literal value.

A user question is a tuple that represents user input. is the initial user question expressed in a natural language. is a multiset of information nuggets mentioned in the user question.

Information nuggets can include surface forms of named entities, concepts, and relations mentioned in . Information nuggets can be extracted from using information extraction techniques such as shallow parsing.

For example, consider the question:

“List software that is written in C++ and
runs on Mac OS.”

This question can be transformed into the following set of information nuggets:

In the process of Semantic Question Answering, information nuggets mentioned in the user question are interpreted as elements of the knowledge graph. A nugget interpretation is a mapping from an information nugget to an element of the knowledge graph . An information nugget can be interpreted as an entity, a literal, a property, a single triple, or a set of triples.

For example, the nugget interpretation:

maps the information nugget “software” to the entity
“dbo:Software” of the knowledge graph. Other examples of nugget interpretations include:

When an SQA pipeline transforms the user question into a semantic query, the pipeline components can generate intermediate interpretation results that include several nugget interpretations. We refer to such intermediate results as partial question interpretations. More formally:

A partial question interpretation is a set of nugget interpretations that interpret a (sub)set of the information nuggets contained in .

For example, a partial question interpretation

includes specific interpretations of two information nuggets representing entity surface forms in the user question.

Partial question interpretations serve as a basis for building semantic queries.

A semantic query is a complete question interpretation that represents the user question as a whole. Intuitively, a includes the elements of the knowledge graph that correspond to the nugget interpretations and connects them in a graph pattern.

Formally, a complete question interpretation is a tuple that consists of a set of nugget interpretations , an answer type and a query graph . The answer type is an element of {“ASK”, “SELECT”, “COUNT”}. Given a knowledge graph , a query graph is a graph pattern such that: is a set of entities, is a set of literals, is a set of properties, is a set of variables and is a set of triple patterns.

For example, , is a complete question interpretation of the example question presented above, where: ,


To retrieve answers from a knowledge graph, a complete question interpretation can be translated into a query in the SPARQL query language777 For example, the following SPARQL query corresponds to the complete question interpretation of the example question presented above:

Note that a complete question interpretation does not necessarily include interpretations of all information nuggets extracted from the user question. This is because information nuggets in can potentially contain redundant information.

2.2 Semantic Question Answering Pipeline

A typical Semantic Question Answering pipeline considered in this article consists of: 1.) a shallow parser constructing information nuggets, 2.) linkers : here we support different options of entity, relation and class linking separately or jointly – so there can be one or multiple linkers, and 3.) a query builder creating complete question interpretations.

More formally, a Semantic Question Answering pipeline is a list of components, where each component implements an interpretation function. The aim of an interpretation function is to incrementally transform the user question into candidate question interpretations.


A pipeline component can generate multiple candidate interpretations.

The component is a specific shallow parsing component at the first step of the pipeline, which transforms the user question into a set of information nuggets: , where is the set of natural language questions, and is the set of information nuggets.

A component takes the user question and, optionally, an interpretation produced by the previous pipeline component as an input and produces a set of partial interpretations as an output: , where is the set of questions, is the set of partial question interpretations, and is the power set constructor. Examples of interpretation functions of the components include entity linking, relation linking, and class linking. There can be a single joint linking step or multiple individual linking steps. By supporting all of those scenarios, the interaction framework described in this article can be applied to a broader range of existing SQA frameworks.

The component is a specific query building component at the last step of the pipeline, which transforms a partial question interpretation into one or more complete question interpretations, i.e. , where is the set of complete question interpretations. Each question interpretation and is associated with a confidence score generated by the corresponding pipeline component.

Conceptually, as an SQA pipeline processes the user question, it incrementally generates a hierarchy of question interpretations, where partial question interpretations are the intermediate nodes, and complete question interpretations are the leaf nodes.

3 IQA User Interaction Scheme

Given a user question and a large-scale knowledge graph , a Semantic Question Answering pipeline can generate a large number of possible complete question interpretations. We denote the set of all complete question interpretations of generated by given as a question interpretation space .

IQA facilitates an efficient and intuitive generation of the intended question interpretation through a user interaction scheme. In IQA, an interaction option is a unit adapted for user interaction. The goal of the interaction scheme is to reduce the question interpretation space with each user interaction efficiently while providing intuitive interaction options.

Conceptually, the IQA interaction scheme resembles the induction of a cost-sensitive decision tree Lomax and Vadera (2013), where the cost reflects the complexity and usability of the interaction options from the user perspective. We rely on the notion of Option Gain introduced later in this section to facilitate the usability and efficiency of the interaction scheme.

3.1 Interaction Options and Subsumption Relation

An interaction option is a unit adapted for user interaction to reduce the question interpretation space . In IQA we group interaction options in the following categories: 1) nugget interpretations, 2) superclasses and types of entities, 3) answer types of semantic queries, and 4) complete question interpretations (i.e., semantic queries).

To facilitate an effective reduction of the question interpretation space by interaction, we establish a subsumption relation between interaction options and complete question interpretations.

We say that an interaction option subsumes a complete question interpretation if one of the following conditions applies:

  • Interaction option represents a nugget interpretation leading to the generation of the semantic query, namely: .

  • Interaction option is a superclass or a type of an entity included in : there must be a URI in the query graph of the complete query interpretation , for which a triple

    exists in the knowledge graph, and .

  • Interaction option represents the answer type of : .

  • Interaction option is equivalent to the semantic query: .

3.2 Option Gain

Interaction options vary concerning their complexity and usability. Complex interaction options can be difficult to understand for the users, potentially leading to an error-prone interaction process (i.e., wrong user decisions) and decreasing an overall user satisfaction.

The key concept of the IQA interaction scheme is the Option Gain . Option Gain takes into account the and the efficiency of the interaction option expressed using its Information Gain . We define the Option Gain as:


where is a parameter that controls the bias introduced by the usability of an interaction option IO in the interaction process, such that by the Option Gain corresponds to the Information Gain without the usability bias.

In IQA the usability of an interaction option is reflected through the usability score , where corresponds to the most intuitive options and to the most complex options:


The complexity of an interaction option can be characterized through the syntactic similarity of the interaction option to the initial user question, the degree of abstraction, and the structural complexity.

Given the user question , the uncertainty of the question interpretation is the result of several factors, including: F1) the ambiguity of information nuggets in and the resulting uncertainty when interpreting these nuggets in a large-scale knowledge graph; F2) the uncertainty of the expected answer type; and F3) a variety of possible graph structures connecting nugget interpretations in a semantic query. Interaction options proposed in IQA aim to reduce this uncertainty.

In the following, we discuss the complexity estimation of the interaction options, which were introduced in Section 3.1 above.

  • An interaction option in this category is a nugget interpretation. Intuitively, an syntactically similar to the nugget in the user question may appear familiar, and thus less complex, to the user. Therefore, we estimate the complexity of an option in this category as the dissimilarity between the information nugget corresponding to the in the user question and the representation (e.g., a label) of the shown to the user in the interaction process. We adopt the Longest Common Substring (LCS) as a string similarity metric, as this metric was shown to be suitable for short phrases Christen (2006).

  • An interaction option in this category is a superclass or a type of an information nugget contained in the semantic query. The usability of such options depends on the degree of abstraction. We assume that less abstract categories such as “person” and “actor” can appear more intuitive to the users than more abstract categories, such as “living thing”. To reflect this intuition, we measure the complexity of the interaction options in this category as the length of the shortest path between the and the element of the knowledge graph that directly maps to the corresponding information nugget in the user question.

  • An interaction option in this category represents an answer type of the semantic query. Given a relatively straightforward set of possible answer types, we set for the options in this category.

  • The interaction options in this category are semantic queries. Intuitively, more complex queries that include a high number of nugget interpretations can appear more difficult to understand from the user perspective. Therefore, we compute the complexity of an interaction option in this category as the number of nugget interpretations it includes.

3.3 Information Gain

For the computation of the Information Gain of an interaction option in the question interpretation space , we build upon the probabilistic model proposed in our previous work Demidova et al. (2013b). We summarize the computation of the Information Gain in the following.


be the entropy of the probability distribution in the question interpretation space

. The Information Gain of an interaction option is computed as the entropy reduction given user feedback on .

Let be the set of complete question interpretations in subsumed by , and be the set of all other complete question interpretations in . Furthermore, let be the probability that the interaction option subsumes the user-intended complete question interpretation.

The entropy of the probability distribution in the question interpretation space is computed as:


Then, Information Gain of the interaction option is computed as the uncertainty reduction provided by this option:


The probability of an interaction option is computed as the sum of the probabilities of complete question interpretations subsumed by this option:


3.4 Probability of Complete Question Interpretations

To estimate the probability of the complete question interpretation to be intended by the user, given the user question and the knowledge graph , we consider the following factors: 1) the likelihood of the partial question interpretation from which was composed by the SQA pipeline, represented as and 2) the probability of the graph structure of the semantic query given the linguistic structure of the user question , represented as .

For mathematical simplification, similar to Naïve Bayes, we assume that the probabilities of the nugget interpretations in the from which is constructed, as well as the structure of the resulting semantic query are mutually independent. Although the resulting probability estimation is potentially not very precise, it leads to an adequate prediction of query relevance, as shown by our experiments.

Then the probability of the complete question interpretation can be estimated as:


We estimate using the confidence score provided by the pipeline component that generates the nugget interpretation . is estimated using the structural similarity between the graph structure of and the parse tree structure of the user question . We provide more details regarding the computation later in Section 4.3.

3.5 User Interaction Process

The conceptual process of the interactive question interpretation using a generic Semantic Question Answering pipeline presented in Section 2 can be modeled as follows:

Step 1 (SQA Pipeline Execution): The user issues the question . The SQA pipeline is executed to generate the question interpretation space .

Step 2 (Pre-Processing): The partial and complete question interpretations generated by the pipeline are utilized to generate the interaction options. Then the subsumption relations between these options and the complete question interpretations in are established.

Step 3 (User Interaction): At each step of the interaction process, the user is simultaneously presented with:

  • The interaction option with the highest Option Gain, and

  • The most likely complete question interpretation in the interpretation space (in a natural language and a semantic representation).

For simplicity, we model the interaction process as a list of binary user decisions, i.e., we assume that the user is presented with one interaction option at a time. In practice, this process can be generalized to present several interaction options simultaneously.

At each step of the interaction process, the user has the following means to interact with the system:

  • Accept the interaction option , i.e., confirm that the presented interaction option correctly interprets (a part of) the question .

  • Reject the interaction option .

  • Accept the complete question interpretation , i.e., confirm that this interpretation correctly reflects the intention of the question.

After each interaction, s that do not comply with the user decision are removed from the question interpretation space using subsumption relation. The Option Gain of all the interaction options is recomputed. The interaction process continues with the currently top-scored and .

The interaction process for a question terminates if one of the following applies:

  • The user accepts the complete question interpretation .

  • The question interpretation space is empty, i.e., the correct interpretation cannot be identified given user feedback.

  • The user terminates the process.

  • The number of interactions or the time spent by the user reached a threshold.

4 Realization

In this section, we present the realization of the proposed IQA approach presented in Section 3, including in particular an IQA pipeline implementation and a prototypical user interface adopted in the user evaluation. Note that our approach is independent of any specific implementation of the SQA pipeline formalized in Section 2.

4.1 IQA Pipeline

The Semantic Question Answering pipeline of IQA instantiated in this work is illustrated in Figure 2. This pipeline consists of four components, namely a Shallow Parser, an Entity Linker, a Relation Linker, and a Query Builder.

Figure 2: An Interactive Question Answering (IQA) pipeline.

With the IQA pipeline, we aim to generate several relevant candidate question interpretations to build the interpretation space , to facilitate the user interaction scheme. This method is different from the state-of-the-art SQA approaches such as “WDAqua” Diefenbach et al. (2019) aimed to generate only one, the most likely question interpretation.

To increase the recall of relevant question interpretations generated by the IQA pipeline, we leverage multiple independent tools in each pipeline step to obtain complementary candidates. The output of each pipeline component is the union of all candidates produced by the individual tools. This approach increases the recall of the candidates generated in each pipeline component. Furthermore, it increases the overall recall of the relevant question interpretations resulting from the IQA pipeline. To facilitate efficient processing, we run the tools within each pipeline component in parallel.

To select the tools for each pipeline component in the current realization of IQA, we conducted preliminary experiments.

4.1.1 Shallow Parser

We analyzed three independent shallow parsing tools, namely MDP-Parser Zafar et al. (2020) developed in our previous work, SENNA Collobert et al. (2011) and a NLTK-based Bird et al. (2009)

chunker implemented using a classification-based sequential tagger. MDP-Parser is a reinforcement learning-based approach to identify named entity and relation mentions in a distantly supervised setting. In our preliminary experiments, we observed that MDP-Parser shows superior performance for shallow parsing compared to the other approaches

Zafar et al. (2020). Furthermore, we did not observe any significant performance increase by adopting multiple tools at this pipeline step on the results of the entity and relation linking. Hence, we adopt MDP-Parser as the only tool in the Shallow Parser pipeline component.

4.2 Entity Linker

At this stage, we considered two state-of-the-art entity linking tools: TagMe Hasibi et al. (2016) and EARL Dubey et al. (2018)

. To further increase recall, we implemented an additional linking tool, which utilizes a character level n-gram representation of the information nuggets and performs linking between the information nuggets and the labels of entities in the knowledge graph using 3-gram similarity. We implemented this tool using an Apache Lucene index

888 In our preliminary experiments, we observed that entity linking results obtained by a combination of the 3-gram similarity and EARL subsume the results of TagMe. Thus we adopt the 3-gram similarity and EARL as two independent entity linking tools in the current realization of the IQA pipeline.

4.2.1 Relation Linker

Relation linking is conducted analogously to the entity linking, using EARL and a word-matching similarity between the information nuggets and the knowledge graph properties.

4.2.2 Query Builder

We adopt the SQG Zafar et al. (2018) tool developed in our previous work as the Query Builder component.

4.3 Probability Estimation

An estimation of the probability of a complete question interpretation presented in Section 3.4, requires estimation of the probabilities of the nugget interpretations and the query graph of .

To estimate the probability of a nugget interpretation , we adopt the confidence score of the pipeline component that generates this interpretation. We normalize the confidence scores using min-max scaling.

The probability of the query graph is estimated using structural similarity of the query graph structure and the user question . To this extent, we use the Tree-LSTM based model of the SQG tool adopted as the query building component. SQG estimates the syntactical similarity of a candidate query that it generates to the parse tree structure of the input question . To estimate the probability of the query graph, we normalize the similarity score provided by SQG using the softmax function.

4.4 IQA User Interface

Figure 3: User interface of IQA adopted in the user study.

We implemented IQA prototype as a web application. The user interface of IQA is exemplified in Figure 3. This interface is adopted in the user study described later in Section 5.4.2. In general, IQA accepts any user-defined questions in a natural language. To enable a comparison of different approaches in the evaluation, during the user study, we adopted a controlled set of questions selected from the LC-QuAD dataset (we elaborate on the dataset generation later in Section 5.1). In this section, we describe the user interface of IQA as it was presented to the users during the user study.

First, the user signs up in the system. Then, the user logs in and starts the user study, where a page similar to Figure 3 is presented. At the top of the interface, IQA displays the current question (#1). On the right hand side, the top-ranked query is provided in its natural language representation (#2) (using SPARQL2NL Ngonga Ngomo et al. (2013)) and a SPARQL representation (#3). Using this part of the interface, the user can accept the top-ranked query (#4). Furthermore, if the user finds the presented question or the interaction options incomprehensible, the user can skip the question by choosing the corresponding reason and clicking on the skip button (#5).

On the left-hand side, IQA provides the user with the current interaction option (#6). The interaction option is expressed as an inquiry (#6.1) along with a candidate answer (#6.2). The inquiry is in the form of “Does ’…’ refers to …?”, where ’…’ is a part of the original question. If applicable, a description and/or example of usages of the interaction option (in case the interaction option represents a relation) are displayed (#6.3). The user can select from “yes”, “no”, and “I don’t know” answers to accept or reject the interaction option displayed (#6.4). The previously selected interaction options are listed below for user reference (#7).

According to the user feedback, the interaction option and the top-ranked query are updated. The interaction continues until the user confirms the final semantic query or another termination criterion discussed in Section 3.5 is reached.

To collect the usability feedback, IQA shows a dialog to the user upon completion of each question. In this dialog, IQA asks the user to rate the ease of use of the system. The usability rating is conducted on the scale from one to five, with one being difficult to use and five being easy to use. Finally, IQA presents the user with the next question.

A demo version of the IQA system is publicly available at

5 Evaluation Setup

The goal of the evaluation is to demonstrate that IQA is competitive compared to both state-of-the-art interactive baselines and non-interactive approaches in terms of the effectiveness, efficiency, and usability for questions of different complexity. In this section, we describe the datasets and methods adopted for the evaluation.

5.1 Knowledge Graph and Questions

We adopt LC-QuAD - an established dataset that contains complex questions for evaluation of SQA systems Trivedi et al. (2017)999Available at Overall, the LC-QuAD dataset contains questions in four complexity categories, i.e., questions that include 2-5 named entities and relations in the corresponding semantic queries. Consequently, we use the DBpedia dataset version 2016-10101010Available at as the underlying knowledge graph to be compatible with the semantic queries in the LC-QuAD dataset.

To the best of our knowledge, Diefenbach et al. Diefenbach et al. (2019) provided the state-of-the-art results on the LC-QuAD dataset. Diefenbach et al. use a handcrafted vocabulary expansion for improving relation linking. This vocabulary is based on small parts of training data obtained from various Question Answering datasets, including SimpleQuestions and QALD-7. However, the authors did not clarify whether they use a portion of LC-QuAD to expand the vocabulary, as they do not provide any information regarding the train/test split for LC-QuAD. As the source code of Diefenbach et al. (2019) is not available, we used the online API provided by the authors to reproduce their results within each complexity category. We noticed that 2,789 out 5,000 questions in LC-QuAD were not answerable due to the incompatibility of the DBpedia version used for the creation of LC-QuAD and the one used by the API. It was not possible to change the DBpedia version of the API; hence, to provide a fair comparison, we excluded the non-answerable questions and focused on the remaining 2,211 questions. On those questions, our computed score for WDAqua is 0.438, and their reported score is 0.46, which is similar.

For the oracle-based evaluation, we use the same subset of 2,211 LC-QuAD questions that we used for the evaluation of WDAqua.

We refer to this LC-QuAD subset as Oracle Test Questions. Figure 4 illustrates the distribution of the questions across the different complexity categories in the Oracle Test Questions dataset. As we can observe, the majority of the questions are in the complexity categories from two to four.

Figure 4: Question complexity distribution in the Oracle Test Questions dataset. The X-Axis represents the complexity category. The Y-Axis represents the number of questions in the corresponding category.

For the user evaluation, we select questions for which the IQA pipeline realized in this article can generate the semantic query specified in the LC-QuAD dataset (i.e., this query is generated by the IQA pipeline, but is not necessarily top-ranked). From this set, we randomly sample a set of questions, such that the number of questions in each complexity category is balanced. We refer to the set of 90 questions adopted in the user evaluation as User Test Questions.

5.2 Evaluation Metrics

To assess the effectiveness, efficiency, and usability of the considered approaches, we adopt the metrics described in the following.

5.2.1 Effectiveness

To measure effectiveness, we choose Success Rate and score.

The Success Rate is the percentage of the questions in a dataset for which the SQA approach can generate the intended semantic query. Note, that in case an approach generates several candidates, the intended semantic query does not have to be top-ranked.

The score

is the harmonic average of the precision and recall. Here,

score corresponds to the Success Rate at the top-1.

5.2.2 Efficiency

To measure efficiency, we adopt Interaction Cost. We define the Interaction Cost as the number of interaction options that the users need to consider before they can identify the semantic query that correctly interprets the question. In the user evaluation, “identify” means that the user explicitly confirms the semantic query as correct. In the oracle-based evaluation of interaction, “identify” means that the semantic query ranked at top-1 at the specific interaction round corresponds to the query given in the LC-QuAD dataset.

In ranking-based approaches (e.g., in non-interactive baselines), the Interaction Cost is measured as the rank of the correct question interpretation, assuming that the user considers the semantic queries in their rank order.

The lower values of the Interaction Cost correspond to the higher efficiency of an SQA system. The Interaction Cost of ’1’ corresponds to the case, where the intended semantic query is immediately shown (ranked at top-1) and confirmed by the user.

5.2.3 Usability

To assess usability, we design a rating scheme in which users can provide their feedback on the ease of use on the scale from one to five, with one being difficult to use and five being easy to use.

5.3 Evaluated Approaches

In this work, we compare the performance of the SQA approaches and their configurations described in the following.

5.3.1 IQA Configurations

To assess the impact of the Option Gain proposed in this work as opposed to Information Gain, we compare two configurations of the proposed IQA approach: IQA-OG and IQA-IG.

In IQA-OG, the interaction options are selected based on their Option Gain. We set the (see Equation 2), such that both, Information Gain and usability of the options are taken into account equally.

IQA-IG is the interactive SQA method, where we take into account the Information Gain of the interaction options only. In this case, we set the parameter (see Equation 2).

5.3.2 Baselines

To compare IQA to a state-of-the-art non-interactive SQA approach, we adopt NIB-WDAqua.

NIB-WDAqua: a Non-Interactive SQA Baseline using a state-of-the-art SQA approach. In this case we take the state-of-the-art semantic SQA approach “WDAqua-core1” Diefenbach et al. (2019) as a baseline. According to the recent evaluation on the Gerbil platform Usbeck et al. (2019)111111, an SQA benchmarking system, “WDAqua-core1” indicates the best performance concerning the LC-QuAD dataset adopted for the evaluation in this article. This baseline generates only one semantic query interpreting the user question. This query is provided by the authors of Diefenbach et al. (2019) through their API121212

To demonstrate the performance of the proposed IQA pipeline in the non-interactive settings, we use NIB-IQA.

NIB-IQA: a Non-Interactive SQA Baseline using the IQA pipeline. This baseline represents the IQA pipeline running without interaction. We assume that the IQA pipeline runs entirely automatically and outputs a ranked list of semantic queries at the end, where each semantic query interprets the user question in a specific way. To compute the Interaction Cost for the NIB-IQA baseline, we assume that the user considers the semantic queries generated by the pipeline in their rank order. In this case, the Interaction Cost corresponds to the rank of the semantic query in the resulting list.

To demonstrate the performance of the proposed interaction scheme compared to an interactive baseline, we consider SIB.

SIB: a Simple Interactive Baseline. This baseline involves user interaction after the execution of each SQA pipeline component. We assume that each pipeline component outputs a ranked list of interaction options (e.g., nugget interpretations). Furthermore, the Interaction Cost of each pipeline component is the rank of the first IO generated by this component that leads to the intended semantic query. This option is passed as an input to the next pipeline component. The overall Interaction Cost of the pipeline is the sum of the Interaction Cost over all the pipeline components.

5.4 Evaluation Settings

To assess the performance of IQA with respect to the evaluation metrics, facilitate comparison to the baselines and evaluate performance in the interaction involving human users, we performed an oracle-based evaluation and conducted a user study.

5.4.1 Oracle-Based Evaluation

To facilitate evaluation on an established large-scale dataset for Question Answering such as LC-QuAD, we adopt an oracle-based approach.

In particular, in the interaction process, we consider an interaction option to be correct if the selection of this option can lead to the construction of the semantic query specified in the LC-QuAD dataset. In the automatic evaluation, we simulate the user interaction process by letting the system automatically accept the first correct option suggested by the adopted SQA method. This corresponds to the assumption that the user would always select the correct option if this option is suggested by the system.

5.4.2 User Study

To better understand the impact of the proposed Option Gain metric on the effectiveness, efficiency, and usability of the IQA scheme (IQA-OG) in comparison to the interaction based on Information Gain (IQA-IG) when involving human users, we conducted a user study.

To enable evaluation of the proposed approach in the controlled settings, we adopted a homogeneous user group with 15 post-graduate computer science students. We envision evaluation with other user groups to be an important part of the future research.

At the beginning of the study, the authors of the article have briefly introduced the users to the IQA system. During the study, each user evaluated 12 questions on average (3 questions in 4 complexity categories). On average, users spent 30 minutes to conduct the study. For the configuration of the user study, the following rules were applied:

  • To facilitate a comparison of the methods, each question is evaluated using two IQA configurations: IQA-OG and IQA-IG.

  • During the study, each user interacts with the system using one fixed interaction configuration, either IQA-OG or IQA-IG.

  • The user does not receive the same question twice.

  • The user can mark a question as incomprehensible. The question marked by any user is removed from the User Test Questions set.

    The remaining set of User Test Questions contains 80 questions.

Figure 3 illustrates the user interface of IQA adopted in the user study with an example question from the User Test Questions set.

User study results are discussed in Section 6.2.

5.5 Reproducibility

To support the reproducibility of results and facilitate further research, we make the software and the data adopted in the evaluation available as follows. The source code of the interactive query construction is available on our GitHub repository131313 Similarly, the source code of the MDP-Parser141414, SQG151515 as well EARL161616 are available on GitHub. Furthermore, the experimental results for the oracle evaluation are provided at our GitHub repository 13.

6 Evaluation Results

In this section, we present the results of the oracle-based evaluation and the user study.

6.1 Oracle-based Evaluation Results

We assess the effectiveness and efficiency of the proposed IQA approach on an established large-scale LC-QuAD dataset using the oracle-based evaluation.

6.1.1 Effectiveness Results in the Oracle-based Evaluation

Figure 5 presents the Success Rate of the non-interactive baselines. The NIB-WDAqua baseline that represents a state-of-the-art Semantic Question Answering approach Diefenbach et al. (2019) generates only the top-1 semantic query. The NIB-IQA baseline, i.e., a non-interactive version of the proposed IQA approach, generates multiple candidate semantic queries. With NIB-IQA-Top-1, we consider only the top-1 query generated by the NIB-IQA baseline.

As we can observe in Figure 5, NIB-IQA outperforms the NIB-WDAqua baseline in terms of Success Rate in all complexity categories. Whereas the NIB-WDAqua outperforms the NIB-IQA with respect to the top-1 query (i.e., the NIB-IQA-Top1 baseline), the overall Success Rate of NIB-IQA is higher than the Success Rate of the NIB-WDAqua. This is because NIB-IQA generates multiple relevant question interpretations, whereas NIB-WDAqua does not provide such functionality and returns only one top-ranked query.

As expected, we can observe that the overall performance of all non-interactive question answering pipelines degrades with the increasing complexity of the questions in the categories 2-4. A special case is the Success Rate in the complexity category 5, where the questions follow a similar pattern, which makes it relatively easy for all considered SQA systems to construct the corresponding semantic query.

As we can observe in Figure 5, in the complexity category 2, 68% of the queries are answerable by NIB-IQA (i.e., the intended query is constructed by the IQA pipeline), whereas this query is ranked as top-1 (NIB-IQA-Top1) only in 52% of the cases. Overall, the difference between the NIB-IQA and NIB-IQA-Top1 is 16.7 percentage points on average across the complexity categories.

The approach proposed in this article fills this gap, such that the difference between NIB-IQA and NIB-IQA-Top1 is reduced through interaction (as will be demonstrated later in the results of the oracle-based evaluation in Section 6.1.2 and the discussion of the user study presented in Section 6.2). I.e., with interaction, the Success Rate of the NIB-IQA-Top1 will increase and can reach the Success Rate of NIB-IQA, outperforming the NIB-WDAqua baseline.

Figure 5: Success Rate of the non-interactive baselines NIB-IQA, NIB-IQA-Top-1 and NIB-WDAqua for the questions in the Oracle Test Questions dataset. The X-Axis represents the complexity category. The Y-Axis represent the Success Rate.
(a) score for the questions with complexity 2
(b) score for the questions with complexity 3
(c) score for the questions with complexity 4
(d) score for the questions with complexity 5
Figure 6: Increase in score during the interaction process in the oracle-based evaluation. The X-Axis represents the number of interactions on a log scale. The Y-Axis represents the score.

Figure 6 shows the score obtained using different methods and the evolution of the score during the interaction process achieved due to the reduction of the question interpretation space. The X-Axis represents the number of interactions on a log scale. The Y-Axis represents the score. We show the results for the questions of different complexity in separate sub-figures of Figure 6.

The baseline method NIB-WDAqua conducts only one interaction with the user, i.e., it generates the top-1 semantic query that interprets the question Diefenbach et al. (2019). This semantic query remains unchanged in the interaction process (the API of the Diefenbach et al. (2019) does not provide any other interpretations); therefore, the result of the NIB-WDAqua baseline is represented as a straight line in Figure 6.

As expected, given the results presented above, the NIB-WDAqua baseline shows the best results at the very beginning of the interaction process in categories 3-5. However, after a few interactions, the NIB-WDAqua baseline is outperformed by other approaches in all complexity categories.

The interactive configurations IQA-OG and IQA-IG of the proposed approach, demonstrate similar performance.

The SIB interactive baseline shows the worst performance across the approaches presented in Figure 6 in all complexity categories. SIB implements an extensive interaction strategy and requests user feedback at every pipeline step. This result confirms our intuition that interaction alone is not sufficient to construct the intended question interpretation efficiently. The significant differences between SIB and the informed interaction strategy of IQA (reflected by IQA-OG and IQA-IG) highlight the clear advantage of our proposed approach in comparison to this baseline.

6.1.2 Efficiency Results in the Oracle-based Evaluation

Figure 7

presents the Interaction Cost and the standard deviation of the considered approaches achieved in the different complexity categories in the oracle-based evaluation over the

Oracle Test Questions dataset.

As we can observe in Figure 7, IQA-IG, and IQA-OG have significantly lower Interaction Cost compared to the NIB-IQA and SIB baselines. The Interaction Cost of IQA-OG and IQA-IG in the oracle-based settings are equivalent. This result demonstrates that an interactive approach based on Option Gain or Information Gain can significantly reduce the Interaction Cost compared to the baselines. This result also illustrates that although multiple outputs as produced by the NIB-IQA baseline can facilitate interaction, if taken without further optimization, such multiple outputs are not sufficient to effectively reduce the Interaction Cost.

Figure 7: Interaction Cost and std. deviation of different approaches in the oracle-based evaluation. The X-Axis represents the complexity category of the question. The Y-Axis represents the Interaction Cost. The Y-Axis is logarithmic. The bars represent the results of the proposed interactive approaches IQA-IG and IQA-OG as well as of the baselines NIB-IQA and SIB.

6.2 User Study Results

The goal of the user study is to assess the performance of IQA-OG and IQA-IG approaches in terms of their efficiency, usability, and effectiveness in the interaction involving human users. In this section, we present the results of the user study.

6.2.1 Efficiency

We measure the efficiency of interaction using Interaction Cost. Figure 8 presents the Interaction Cost observed in the user evaluation for the questions of different complexity while using IQA-OG and IQA-IG configurations of the proposed approach.

Overall, the Interaction Cost of both IQA-OG and IQA-IG is relatively low, with 3.8 interactions on average for IQA-IG and 3.6 for IQA-OG. As we can observe in Figure 8

, both approaches indicate slight variations. However, the results of the paired t-test show that these differences are not statistically significant. We conclude that both methods, IQA-OG and IQA-IG, are equivalent in terms of efficiency.

Compared to the results of the oracle-based evaluation, the Interaction Cost observed in the user study is slightly higher. The average Interaction Cost in the oracle-based evaluation presented in Figure 7 is 1.9-2.0, whereas, in the user study, we observed 3.6-3.8 interactions on average. This is because, in comparison to the oracle-based setting, the users do not always immediately confirm the top-ranked query once it is shown, but may continue the interaction process.

(a) IQA-IG
(b) IQA-OG
Figure 8: Interaction Cost of IQA-IG and IQA-OG in the user study in a boxplot representation. The X-Axis represents the complexity category. The Y-Axis represents the Interaction Cost.
(a) IQA-IG
(b) IQA-OG
Figure 9: User rating on IQA usability in a boxplot representation. Average rating of IQA-IG=4.13; average rating of IQA-OG=4.40.

6.2.2 Usability

Figure 9 presents the usability results of IQA-IG and IQA-OG computed using user ratings. The average user rating is 4.13 for IQA-IG and 4.40 for IQA-OG. According to the paired t-test, this result is statistically significant (). As we can observe, the scores obtained by IQA-IG are not only lower on average, but also indicate much higher variation. We conclude that IQA-OG outperforms IQA-IG with respect to the ease of use.

6.2.3 Effectiveness

We assess the effectiveness of the interaction scheme in the user evaluation as the accuracy in the construction of the intended semantic queries.

As discussed in Section 5.4.2, to complete the interaction process for each question, the user had to explicitly confirm if the constructed query correctly reflected the intention of the question. The query confirmed by the user can be different from the semantic query specified in the LC-QuAD dataset. In this section, we discuss the observed deviations between the queries confirmed by the users and the queries specified in the LC-QuAD dataset.

Figures 9(a) and 9(b) present the ratio of questions in different complexity categories that are: 1) confirmed by the users as correct (Conf-U), and 2) confirmed by the users as correct and also exactly correspond to the semantic query in the LC-QuAD dataset (Conf-B). We present these statistics for the IQA-OG and IQA-IG configurations.

As we can observe in Figures 9(a) and 9(b), the users have confirmed semantic queries that were not contained in the LC-QuAD dataset in all complexity categories, whereas the differences between Conf-U and Conf-B are much smaller for IQA-OG. Note that Conf-B directly corresponds to the score presented in Figure 9(c).

Figure 9(c) indicates that the queries constructed using IQA-OG are more accurate, which is likely due to the interaction options adopted by this approach that can be better understandable by users. The average percentage of queries constructed by the users and confirmed by the LC-QuAD dataset is 62.0% for IQA-IG and 72.2% for IQA-OG. We observe that IQA-OG consistently outperforms IQA-IG in all complexity categories, with an average improvement of 10 percentage points in score.

This observation again indicates that IQA-OG that takes usability of the options into account can facilitate more effective user interaction than an interaction approach based solely on the Information Gain.

Overall, compared to IQA-IG, IQA-OG leads to more intuitive user interaction that facilitates the user to answer the questions more effectively, within the same number of interactions.

(a) IQA-IG
(b) IQA-OG
(c) score of IQA-IG and IQA-OG
Figure 10: Accuracy of the user judgments vs. the LC-QuAD dataset. The X-Axis represents query complexity. In 9(a) and 9(b), the Y-Axis represents the ratio of questions for which the semantic query was confirmed by the user (Conf-U) and the ratio of queries, which are equivalent to the LC-QuAD dataset (Conf-B) obtained using IQA-IG and IQA-OG. In 9(c), the Y-Axis represents the score achieved by the users using the IQA-IG and IQA-OG configurations.

Figure 11 depicts the scores achieved on the User Test Questions by different approaches. IQA-IG and IQA-OG scores correspond to the user study results. NIB-WDAqua and NIB-IQA-Top1 are the baseline results achieved on the same dataset. As we can observe, the proposed interactive approach outperforms the best performing non-interactive baseline NIB-WDAqua concerning the scores in all complexity categories. The average score of IQA-IG is 0.62, which is an increase of 10 percentage points compared to the NIB-WDAqua baseline that obtains on average on this dataset. With the IQA-OG, we achieve an , which is 20 percentage points higher than the score of the NIB-WDAqua baseline.

Figure 11: The X-Axis represents query complexity. Y-Axis represents the score achieved by different approaches on the User Test Questions. IQA-IG and IQA-OG correspond to the user study results.

6.2.4 Error Analysis

As for the failed questions, on average, 11% were rejected by the users due to incomprehensible questions or interaction options, whereas 15% failed as the users did not confirm the semantic query resulting from the interaction process.

To better understand the differences between the queries constructed and accepted by the users and the semantic queries in the LC-QuAD dataset, we conducted a manual inspection of all results where such deviation occurred. Overall, we observed several reasons for deviations, including:

  • The LC-QuAD interpretation is too restrictive: There exist several possible semantic interpretations for a question, and LC-QuAD only includes one such interpretation. For example, this can be observed in the case of synonymous relations, or inclusion/omission of the rdf:type statements in the semantic query that do not affect the results.

  • The user makes a mistake or fails to understand the specific differences between the intended interpretation and the interpretation suggested by the system. For example, this can happen in case of similar entities, or a wrong interpretation of the relation direction by the user.

  • The user selects a different answer type. For example, the user can accept a SELECT query instead of an ASK query specified in LC-QuAD.

We provide an overview of the typical differences, their frequency and the corresponding examples in Table 2. As we can observe, the most frequent reasons for the deviations are the synonymous relations (R1, in 43.4%), wrong relations (R2, in 19.5%), and the differences in the answer types (R3, in 19.5%).


Reason Differences % Example
R1 Synonymous relations 43.4 Q: Name the home stadium of FC Spartak Moscow?
dbp:stadium vs. dbo:homeStadium
R1 Completeness 8.6 Q: Miguel de Cervantes wrote the musical extended from which book?
of the semantic query SELECT ?u WHERE { ?u dbo:author dbr:Miguel_de_Cervantes }
SELECT ?u WHERE { ?u dbo:author dbr:Miguel_de_Cervantes.
                                 ?u rdf:type dbo:Book }
R2 Similar entities 4.5 Q: In which state is Red Willow Creek?
dbr:Willow_Creek_mine vs. dbr:Red_Willow_Creek
R2 Wrong relation 19.5 Q: List the producer of the TV shows whose company is HBO.
dbo:distributor vs. dbo:company
R2 Structural differences 4.5 Q: Who are the predecessors of John Randolph of Roanoke?
in the semantic query SELECT ?u WHERE { dbr:John_Randolph_of_Roanoke dbp:predecessor ?u}
SELECT ?u WHERE { ?u dbp:predecessor dbr:John_Randolph_of_Roanoke}
R3 Differences in the 19.5 SELECT ?u WHERE … vs. ASK WHERE …
answer type
Table 2: Differences of the user interpretation and LC-QuAD.

6.2.5 User Feedback

After the evaluation session, we requested the users to provide unstructured feedback regarding any issues they observed or comments they had.

Overall, the users reported a positive experience with the IQA system. The typical issues reported by the users included sometimes unclear formulation of the questions in the LC-QuAD dataset, understandability of interaction options in some categories, and of natural language formulation of complex SPARQL queries.

As reported by the users, the LC-QuAD dataset contains some questions with linguistic issues. In cases where these issues affected the understandability of questions, the users could skip the question, as mentioned above. We consider such questions as failed in our results.

The users also reported occasional difficulties in understanding the semantics of some of the interaction options, in particular concerning the options representing relations and question types. This observation confirms our assumption used as a basis for the Option Gain computation that the usability of different interaction option types varies.

Finally, the users reported that some of the natural language representations of the SPARQL queries, especially in the context of the more complex questions, were difficult to understand. The generation of the natural language representations for the user interface is not in the scope of this work; in the IQA prototype implementation, we generated such representations using state-of-the-art tools. However, this observation indicates the need for future work in this area.

7 Related Work

Interactive methods to obtain user feedback have been adopted in Semantic Question Answering systems as well as in keyword search and natural language interfaces for structured data. In this section, we briefly summarize the differences between IQA and these approaches.

7.1 Interactive Keyword Search over Relational Data

In our previous work we proposed FreeQ - an interactive keyword search approach for relational databases Demidova et al. (2013b),Demidova et al. (2012),Demidova et al. (2012). FreeQ generates interaction options using a relational database schema and a mapping between the schema and an external ontology (utilizing, e.g., YAGO+F Demidova et al. (2013a)). User interaction in FreeQ is based on Information Gain. Whereas IQA builds upon our previous work in the area of interactive keyword search, in this article, we target a more complex problem of Semantic Question Answering. The input questions are more complex than keyword queries supported by FreeQ, so are the corresponding SQA pipelines. IQA addresses these challenges through a novel interaction scheme dedicated to Semantic Question Answering. In particular, in IQA, we developed an interaction scheme for generic Semantic Question Answering pipelines. Furthermore, we introduced the notion of Option Gain that takes the usability of interaction options into account. As our evaluation demonstrates, these contributions lead to significant improvements in terms of usability and effectiveness, while maintaining low interaction cost.

7.2 Semantic Question Answering

Semantic Question Answering over knowledge graphs is a difficult problem Höffner et al. (2017b); Diefenbach et al. (2017). Although SQA systems over simple questions have improved in recent years Lukovnikov et al. (2017); Bordes et al. (2015), solving complex questions Dubey et al. (2016); Yih et al. (2015); Trivedi et al. (2017) remains a difficult task. For example, the “WDAqua-core1” system Diefenbach et al. (2019), currently the best performing over the LC-QuAD dataset containing complex queries, only achieves . SQA systems usually suffer a performance loss due to the wrong interpretations during the entity linking Dubey et al. (2018); Hasibi et al. (2016), relation linking, and query building Zafar et al. (2018) stages. These systems are typically optimized to produce one intended interpretation. In contrast to IQA, such systems do not support user feedback to refine their results.

7.3 Interactive Question Answering Systems

Existing SQA and search systems over knowledge graphs employ user feedback and additional input to improve disambiguation of the questions directly, or to generate training data. For example, Exemplar Queries Mottin et al. (2014) employs a user query as an example to search for similar structures. Su et al. Su et al. (2015) exploit relevance feedback to tune ranking functions in knowledge graph search. GQBE Jayaram et al. (2015) takes a question and an example relation as input and searches for similar graph patterns. Zheng et al. Zheng et al. (2017) conduct interactive graph search and let users verify the ambiguities in entity linking, relation linking, and query building. IMPROVE-QA Zhang and Zou (2018) asks the users to correct the output of the training question to improve relation linking and query building process by learning from user interaction. In contrast to existing SQA systems that adopt interaction, IQA explicitly addresses the usability aspects of user interaction through Option Gain and utilizes a broader range of interaction options.

7.4 Other Interactive Approaches using Knowledge Graphs

Sparklis Ferré (2017) is an exploration-based approach that allows users to build SPARQL queries interactively. In contrast, IQA is a Semantic Question Answering approach that adopts interaction for the disambiguation of user questions. EventKG+TL facilitates interactive generation of multilingual event timelines from a knowledge graph Gottschalk and Demidova (2018b). Conversational approaches such as CuriousCat Bradesko et al. (2017) provide another type of interaction. These approaches address other objectives, including, for example, knowledge acquisition in a dialog.

7.5 Interactive Semantic Parsing

Several works on interactive semantic parsing adopt user feedback as a training signal to resolve utterance ambiguity and enhance parsing accuracy. These approaches translate the natural language to formal domain-specific representations, including database queries Li and Jagadish (2014), API calls Su et al. (2018), and If-Then programs Yao et al. (2019a). Semantic parsing approaches that target translation of natural language into SQL queries for relational databases, such as, for example, Li and Jagadish (2014), Gur et al. (2018), Yao et al. (2019b) are the most related to our work. Approaches in this area are typically limited to rather small database schemas or simple query patterns. For example, Li and Jagadish (2014) performs evaluation on the Microsoft Academic Search (MAS) dataset that includes only eight relations. DialSQL Gur et al. (2018) and MISP Yao et al. (2019b) adopt the WikiSQL dataset that contains rather simple queries. In contrast, interactive SQA systems such as IQA aim to generate semantic queries for knowledge graphs that are much larger in scale, including thousands of concepts and relations, while enabling complex queries. This large scale poses additional challenges concerning the scalability and the interaction cost. Furthermore, approaches to interactive semantic parsing in databases invoke interaction based on ambiguity Li and Jagadish (2014) or error detection Yao et al. (2019b), and do not address usability aspects.

8 Conclusion

In this article, we presented IQA - a novel interactive approach to Semantic Question Answering. We formalized the concept of a Semantic Question Answering pipeline and proposed a novel probabilistic user interaction scheme. This scheme aims to facilitate the user to effectively identify the intended semantic query while increasing the usability of interaction and minimizing the interaction cost. Interaction options utilized by the IQA belong to several categories, including interpretations of entities and relations, superclasses and types of entities, answer types, and semantic queries. In the interaction process, these options are determined based on their Option Gain, which takes into account the usability and efficiency of the options.

To evaluate the effectiveness, efficiency, and usability of the proposed user interaction scheme, we conducted an extensive oracle-based experimental evaluation and a user study. Our experimental results over LC-QUAD, an established dataset in the assessment of SQA systems, demonstrate that IQA can significantly increase the effectiveness of SQA for complex questions while maintaining high usability of interaction and incurring only a small interaction cost.

We observed that an interaction strategy IQA-OG based on the Option Gain leads to higher user satisfaction compared to IQA-IG optimized for efficiency only. Furthermore, IQA-OG leads to the higher effectiveness of the user interaction, as reflected by the higher ratio of successfully constructed semantic queries. This improvement is reflected in the score that outperforms the interaction strategy based on the Information Gain by ten percentage points. Compared to the non-interactive baselines, IQA-OG achieves up to 20 percentage points improvement on the subset of LC-QUAD utilized in the user evaluation.

We believe that this improvement is due to the less complex and thus better understandable interaction options adopted by the IQA-OG, which help to reduce potential errors.

In principle, the IQA interaction scheme is applicable on top of any Semantic Question Answering pipeline that realizes the generic architecture formalized in this article. In particular, we support variations of SQA pipelines in the linking step, such that there can be a single joint linking step for entities and relations or multiple individual linking steps. This way, the IQA interaction approach can be applied to a broader range of existing SQA frameworks.

In our future work, we plan to further develop the proposed approach to better support user interaction in multilingual settings.


  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §4.1.1.
  • A. Bordes, N. Usunier, S. Chopra, and J. Weston (2015) Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075. Cited by: §7.2.
  • L. Bradesko, M. J. Witbrock, J. Starc, Z. Herga, M. Grobelnik, and D. Mladenic (2017) Curious cat-mobile, context-aware conversational crowdsourcing knowledge acquisition. ACM Trans. Inf. Syst. 35 (4), pp. 33:1–33:46. Cited by: §7.4.
  • P. Christen (2006) A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), pp. 290–294. Cited by: item C1..
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    12 (Aug).
    Cited by: §4.1.1.
  • E. Demidova, I. Oelze, and W. Nejdl (2013a) Aligning freebase with the YAGO ontology. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, pp. 579–588. Cited by: §7.1.
  • E. Demidova, X. Zhou, and W. Nejdl (2012) A probabilistic scheme for keyword-based incremental query construction. IEEE Trans. on Knowl. and Data Eng. 24 (3), pp. 426–439. External Links: ISSN 1041-4347 Cited by: §7.1.
  • E. Demidova, X. Zhou, and W. Nejdl (2012) FreeQ: an interactive query interface for freebase. In Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume), pp. 325–328. Cited by: §7.1.
  • E. Demidova, X. Zhou, and W. Nejdl (2013b) Efficient query construction for large scale data. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, pp. 573–582. Cited by: §1, §3.3, §7.1.
  • D. Diefenbach, A. Both, K. D. Singh, and P. Maret (2019) Towards a question answering system over the semantic web. Semantic Web Pre-press. Cited by: §4.1, §5.1, §5.3.2, §6.1.1, §6.1.1, §7.2.
  • D. Diefenbach, V. Lopez, K. Singh, and P. Maret (2017) Core techniques of question answering systems over knowledge bases: a survey. Knowledge and Information systems, pp. 1–41. Cited by: §7.2.
  • M. Dubey, D. Banerjee, D. Chaudhuri, and J. Lehmann (2018) EARL: joint entity and relation linking for question answering over knowledge graphs. In International Semantic Web Conference, Cited by: §4.2, §7.2.
  • M. Dubey, S. Dasgupta, A. Sharma, K. Höffner, and J. Lehmann (2016) Asknow: a framework for natural language query formalization in sparql. In International Semantic Web Conference, pp. 300–316. Cited by: §7.2.
  • S. Ferré (2017) Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language. Semantic Web 8 (3), pp. 405–418. Cited by: §7.4.
  • S. Gottschalk and E. Demidova (2018a) EventKG: A multilingual event-centric temporal knowledge graph. In Proceedings of the 15th International ESWC Conference, pp. 272–287. Cited by: §1.
  • S. Gottschalk and E. Demidova (2018b) EventKG+tl: creating cross-lingual timelines from an event-centric knowledge graph. In The Semantic Web: ESWC 2018 Satellite Events, Lecture Notes in Computer Science, Vol. 11155, pp. 164–169. Cited by: §7.4.
  • S. Gottschalk and E. Demidova (2019) EventKG - the Hub of Event Knowledge on the Web - and Biographical Timeline Generation. Semantic Web 10 (6), pp. 1039–1070. Cited by: §1.
  • I. Gur, S. Yavuz, Y. Su, and X. Yan (2018) DialSQL: dialogue based structured query generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Cited by: §7.5.
  • F. Hasibi, K. Balog, and S. E. Bratsberg (2016) On the reproducibility of the tagme entity linking system. In Proceedings of 38th European Conference on Information Retrieval, ECIR ’16, pp. 436–449. Cited by: §4.2, §7.2.
  • J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum (2013) YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, pp. 28–61. Cited by: §1.
  • K. Höffner, S. Walter, E. Marx, R. Usbeck, J. Lehmann, and A. N. Ngomo (2017a) Survey on challenges of question answering in the semantic web. Semantic Web 8 (6), pp. 895–920. Cited by: §1.
  • K. Höffner, S. Walter, E. Marx, R. Usbeck, J. Lehmann, and A. Ngonga Ngomo (2017b) Survey on challenges of question answering in the semantic web. Semantic Web 8 (6), pp. 895–920. Cited by: §7.2.
  • N. Jayaram, A. Khan, C. Li, X. Yan, and R. Elmasri (2015) Querying knowledge graphs by example entity tuples. IEEE Transactions on Knowledge and Data Engineering 27 (10), pp. 2797–2811. External Links: ISSN 1041-4347 Cited by: §7.3.
  • J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer (2015) DBpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: §1.
  • F. Li and H. V. Jagadish (2014) Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8 (1), pp. 73–84. External Links: ISSN 2150-8097 Cited by: §7.5.
  • S. Lomax and S. Vadera (2013)

    A survey of cost-sensitive decision tree induction algorithms

    ACM Computing Surveys (CSUR) 45 (2), pp. 16. Cited by: §3.
  • D. Lukovnikov, A. Fischer, J. Lehmann, and S. Auer (2017) Neural network-based question answering over knowledge graphs on word and character level. In Proceedings of the 26th international conference on World Wide Web, pp. 1211–1220. Cited by: §7.2.
  • D. Mottin, M. Lissandrini, Y. Velegrakis, and T. Palpanas (2014) Exemplar queries: give me an example of what you need. Proc. VLDB Endow. 7 (5), pp. 365–376. Cited by: §7.3.
  • A. Ngonga Ngomo, L. Bühmann, C. Unger, J. Lehmann, and D. Gerber (2013) SPARQL2NL: verbalizing sparql queries. In Proceedings of the 22nd International Conference on World Wide Web, pp. 329–332. Cited by: §4.4.
  • H. Paulheim (2017) Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8 (3). Cited by: §1.
  • Y. Su, A. H. Awadallah, M. Wang, and R. W. White (2018) Natural language interfaces with fine-grained user interaction: A case study on web apis. In Proceedings of the SIGIR 2018, Cited by: §7.5.
  • Y. Su, S. Yang, H. Sun, M. Srivatsa, S. Kase, M. Vanni, and X. Yan (2015) Exploiting relevance feedback in knowledge graph search. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1135–1144. Cited by: §7.3.
  • P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann (2017) LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs. In The Semantic Web – ISWC 2017, pp. 210–218. Cited by: §1, §5.1, §7.2.
  • R. Usbeck, M. Röder, M. Hoffmann, F. Conrads, J. Huthmann, A. Ngonga-Ngomo, C. Demmler, and C. Unger (2019) Benchmarking question answering systems. Semantic Web (Preprint). Cited by: §5.3.2.
  • D. Vrandecic and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Commun. ACM 57 (10), pp. 78–85. Cited by: §1.
  • Z. Yao, X. Li, J. Gao, B. Sadler, and H. Sun (2019a) Interactive semantic parsing for if-then recipes via hierarchical reinforcement learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 2547–2554. Cited by: §7.5.
  • Z. Yao, Y. Su, H. Sun, and W. Yih (2019b) Model-based interactive semantic parsing: A unified framework and A text-to-sql case study. pp. 5446–5457. Cited by: §7.5.
  • S. W. Yih, M. Chang, X. He, and J. Gao (2015) Semantic parsing via staged query graph generation: question answering with knowledge base. Cited by: §7.2.
  • H. Zafar, G. Napolitano, and J. Lehmann (2018) Formal query generation for question answering over knowledge bases. In ESWC 2018, pp. 714–728. Cited by: §4.2.2, §7.2.
  • H. Zafar, M. Tavakol, and J. Lehmann (2020) Distantly supervised question parsing. In Proceedings of the Twenty-forth European Conference on Artificial Intelligence, Cited by: §4.1.1.
  • X. Zhang and L. Zou (2018) IMPROVE-qa: an interactive mechanism for rdf question/answering systems. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pp. 1753–1756. Cited by: §7.3.
  • W. Zheng, H. Cheng, L. Zou, J. X. Yu, and K. Zhao (2017) Natural language question/answering: let users talk with the knowledge graph. In Proc. of the ACM CIKM 2017, pp. 217–226. Cited by: §7.3.