Message Passing for Complex Question Answering over Knowledge Graphs

08/19/2019 ∙ by Svitlana Vakulenko, et al. ∙ WU (Vienna University of Economics and Business) University of Amsterdam 4

Question answering over knowledge graphs (KGQA) has evolved from simple single-fact questions to complex questions that require graph traversal and aggregation. We propose a novel approach for complex KGQA that uses unsupervised message passing, which propagates confidence scores obtained by parsing an input question and matching terms in the knowledge graph to a set of possible answers. First, we identify entity, relationship, and class names mentioned in a natural language question, and map these to their counterparts in the graph. Then, the confidence scores of these mappings propagate through the graph structure to locate the answer entities. Finally, these are aggregated depending on the identified question type. This approach can be efficiently implemented as a series of sparse matrix multiplications mimicking joins over small local subgraphs. Our evaluation results show that the proposed approach outperforms the state-of-the-art on the LC-QuAD benchmark. Moreover, we show that the performance of the approach depends only on the quality of the question interpretation results, i.e., given a correct relevance score distribution, our approach always produces a correct answer ranking. Our error analysis reveals correct answers missing from the benchmark dataset and inconsistencies in the DBpedia knowledge graph. Finally, we provide a comprehensive evaluation of the proposed approach accompanied with an ablation study and an error analysis, which showcase the pitfalls for each of the question answering components in more detail.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The amount of data shared on the Web grows every day (Gandomi and Haider, 2015). Information retrieval systems are very efficient but they are limited in terms of the representation power for the underlying data structure that relies on an index for a single database table, i.e., a homogeneous collection of textual documents that share the same set of attributes, e.g., web pages or news articles (Manning et al., 2010). Knowledge graphs (KGs), i.e., graph-structured knowledge bases, such as DBpedia (Lehmann et al., 2015) or Wikidata (Vrandecic and Krötzsch, 2014), can interlink datasets with completely different schemas (Bonatti et al., 2018). Moreover, SPARQL is a very expressive query language that allows us to retrieve data from a KG that matches specified graph patterns (Harris and Seaborne, 2013). Query formulation in SPARQL is not easy in practice since it requires knowledge of which datasets to access, their vocabulary and structure (Freitas et al., 2012). Natural language interfaces can mitigate these issues, making data access more intuitive and also available for the majority of lay users (Hendrix, 1982; Kaufmann and Bernstein, 2007). One of the core functionalities for this kind of interfaces is question answering (QA), which goes beyond keyword or boolean queries, but also does not require knowledge of a specialised query language (Unger et al., 2014).

QA systems have been evolving since the early 1960s with early efforts in the database community to support natural language queries by translating them into structured queries (see, e.g., Green et al., 1963; Woods, 1977; Bronnenberg et al., 1980). Whereas a lot of recent work has considered answering questions using unstructured text corpora (Rajpurkar et al., 2016) or images (Goyal et al., 2017), we consider the task of answering questions using information stored in KGs. KGs are an important information source as an intermediate representation to integrate information from different sources and different modalities, such as images and text (de Faria et al., 2018). The resulting models are at the same time abstract, compact, and interpretable (Wilcke et al., 2017).

Question answering over knowledge graphs (KGQA) requires matching an input question to a subgraph, in the simplest case matching a single labeled edge (triple) in the KG, a task also called simple question answering (Bordes et al., 2015). The task of complex question answering is defined in contrast to simple KGQA and requires matching more than one triple in the KG (Trivedi et al., 2017). Previously proposed approaches to complex KGQA formulate it as a subgraph matching task (Bao et al., 2016; Maheshwari et al., 2018a; Sorokin and Gurevych, 2018), which is an NP-hard problem (by reduction to the subgraph isomorphism problem) (Zou et al., 2014), or attempt to translate a natural language question into template-based SPARQL queries to retrieve the answer from the KG (Diefenbach et al., 2018), which requires a large number of candidate templates (Singh et al., 2018b).

We propose an approach to complex KGQA, called QAmp, based on an unsupervised message-passing algorithm, which allows for efficient reasoning under uncertainty using text similarity and the graph structure. The results of our experimental evaluation demonstrate that QAmp is able to manage uncertainties in interpreting natural language questions, overcoming inconsistencies in a KG and incompleteness in the training data, conditions that restrict applications of alternative supervised approaches.

A core aspect of QAmp is in disentangling reasoning from the question interpretation process. We show that uncertainty in reasoning stems from the question interpretation phase alone, meaning that under correct question interpretations QAmp will always rank the correct answers at the top. QAmp is designed to accommodate uncertainty inherent in perception and interpretation processes via confidence scores that reflect natural language ambiguity, which depends on the ability to interpret terms correctly. These ranked confidence values are then aggregated through our message-passing in a well-defined manner, which allows us to simultaneously consider multiple alternative interpretations of the seed terms, favoring the most likely interpretation in terms of the question context and relations modeled within the KG. Rather than iterating over all possible orderings, we show how to evaluate multiple alternative question interpretations in parallel via efficient matrix operations.

Another assumption of QAmp that proves useful in practice is to deliberately disregard subject-object order, i.e., edge directions in a knowledge graph, thereby treating the graph as undirected. Due to relation sparsity, this model relaxation turns out to be sufficient for most of the questions in the benchmark dataset. We also demonstrate that due to insufficient relation coverage of the benchmark dataset any assumption on the correct order of the triples in the KG is prone to overfitting. More than one question-answer example per relation is required to learn and evaluate a supervised model that predicts relation directionality.

Our evaluation on LC-QuAD111 (Trivedi et al., 2017), a recent large-scale benchmark for complex KGQA, shows that QAmp significantly outperforms the state-of-the-art, without the need to translate a natural language question into a formal query language such as SPARQL. We also show that QAmp is interpretable in terms of activation paths, and simple, effective and efficient at the same time. Moreover, our error analysis demonstrates limitations of the LC-QuAD benchmark, which was constructed using local graph patterns.

The rest of the paper is organized as follows. Section 2 summarizes the state of the art in KGQA. Section 3 presents our approach, QAmp, with particular attention to the question interpretation and answer inference phases. In Section 4, we evaluate QAmp on the LC-QuAD dataset, providing a detailed ablation, scalability and error study. Finally, Section 6 concludes and lists future work.

2. Related Work

The most commonly used KGQA benchmark is the SimpleQuestions (Bordes et al., 2015) dataset, which contains questions that require identifying a single triple to retrieve the correct answers. Recent results (Petrochuk and Zettlemoyer, 2018)

show that most of these simple questions can be solved using a standard neural network architectures. This architecture consists of two components: (1) a conditional random fields (CRF) tagger with GloVe word embeddings for subject recognition given the text of the question, and (2) a bidirectional LSTM with FastText word embeddings for relation classification given the text of the question and the subject from the previous component. Approaches to simple KGQA cannot easily be adapted to solving complex questions, since they rely heavily on the assumption that each question refers to only one entity and one relation in the KG, which is no longer the case for complex questions. Moreover, complex KGQA also requires matching more complex graph patterns beyond a single triple.

Since developing KGQA systems requires solving several tasks, namely entity, relation and class linking, and afterwards query building, they are often implemented as independent components and arranged into a single pipeline (Dubey et al., 2018). Frameworks such as QALL-ME (Ferrández et al., 2011), OKBQA (Kim et al., 2017) and Frankenstein (Singh et al., 2018a), allow one to share and reuse those components as a collaborative effort. For example, Frankenstein includes 29 components that can be combined and interchanged (Singh et al., 2018c). However, the distribution of the number of components designed for each task is very unbalanced. Most of the components in Frankenstein support entity and relation linking, 18 and 5 components respectively, while only two components perform query building (Singh et al., 2018b).

There is a lack of diversity in approaches that are being considered for retrieving answers from a KG. OKBQA and Frankenstein both propose to translate natural language questions to SPARQL queries and then use existing query processing mechanism to retrieve answers.222

We show that using matrix algebra approaches is more efficient in case of natural language processing than traditional SPARQL-based approaches since they are optimized for parallel computation, thereby allowing us to explore multiple alternative question interpretations at the same time 

(Kepner et al., 2016; Jamour et al., 2018).

Query building approaches involve query generation and ranking steps (Maheshwari et al., 2018a; Zafar et al., 2018). These approaches essentially consider KGQA as a subgraph matching task (Bao et al., 2016; Maheshwari et al., 2018a; Sorokin and Gurevych, 2018), which is an NP-hard problem (by reduction to the subgraph isomorphism problem) (Zou et al., 2014). In practice, Singh et al. (2018b) report that the question building components of Frankenstein fail to process 46% questions from a subset of LC-QuAD due to the large number of triple patterns. The reason is that most approaches to query generation are template-based (Diefenbach et al., 2018) and complex questions require a large number of candidate templates (Singh et al., 2018b). For example, WDAqua (Diefenbach et al., 2018) generates 395 SPARQL queries as possible interpretations for the question “Give me philosophers born in Saint Etienne.”

In summary, we identify the query building component as the main bottleneck for the development of KGQA systems and propose QAmp as an alternative to the query building approach. Also, the pipeline paradigm is inefficient since it requires KG access first for disambiguation and then again for query building. QAmp accesses the KG only to aggregate the confidence scores via graph traversal after question parsing and shallow linking that matches an input question to labels of nodes and edges in the KG.

The work most similar to ours is the spreading activation model of Treo (Freitas et al., 2013)

, which is also a no-SPARQL approach based on graph traversal that propagates relatedness scores for ranking nodes with a cut-off threshold. Treo relies on POS tags, the Stanford dependency parser, Wikipedia links and TF/IDF vectors for computing semantic relatedness scores between a question and terms in the KG. Despite good performance on the QALD 2011 dataset, the main limitation of Treo is an average query execution time of 203s 

(Freitas et al., 2013)

. In this paper we show how to scale this kind of approach to large KGs and complement it with machine learning approaches for question parsing and word embeddings for semantic expansion.

Our approach overcomes the limitations of the previously proposed graph-based approach in terms of efficiency and scalability, which we demonstrate on a compelling benchmark. We evaluate QAmp on LC-QuAD (Trivedi et al., 2017), which is the largest dataset used for benchmarking complex KGQA. WDAqua is our baseline approach, which is the state-of-the-art in KGQA as the winner of the most recent Scalable Question Answering Challenge (SQA2018) (Napolitano et al., 2018)

. Our evaluation results demonstrate improvements in precision and recall, while reducing average execution time over the SPARQL-based WDAqua, which is also orders of magnitude faster than results reported for the previous graph-based approach Treo.

There is other work on KGQA that uses embedding queries into a vector space (Hamilton et al., 2018; Wang et al., 2018). The benefit of our graph-based approach is in preserving the original structure of the KG that can be used for executing precise formal queries and answering ambiguous natural language questions. The graph structure also makes the results traceable and, therefore, interpretable in terms of relevant paths and subgraphs in comparison with vector space operations.

QAmp uses message passing, a family of approaches that were initially developed in the context of probabilistic graphical models (Pearl, 1988; Koller et al., 2009). Graph neural networks trained to learn patterns of message passing have recently shown to be effective on a variety of tasks (Gilmer et al., 2017; Battaglia et al., 2018), including KG completion (Schlichtkrull et al., 2018). We show that our unsupervised message passing approach performs well on complex question answering and helps to overcome sampling biases in the training data, which supervised approaches are prone to.

3. Approach

QAmp, our KGQA approach, consists of two phases: (1) question interpretation, and (2) answer inference. In the question interpretation phase we identify the sets of entities and predicates that we consider relevant for answering the input question along with the corresponding confidence scores. In the second phase these confidence scores are propagated and aggregated directly over the structure of the KG, to provide a confidence distribution over the set of possible answers. Our notion of KG is inspired by common concepts from the Resource Description Framework (RDF) (Schreiber and Raimond, 2014), a standard representation used in many large-scale knowledge graphs, e.g., DBpedia and Wikidata:

Definition 3.1 ().

We define a (knowledge) graph as a tuple that contains sets of entities (nodes) and properties , both represented by Unique Resource Identifiers (URIs), and a set of directed labeled edges , where and .

The set of edges in a KG can be viewed as a (blank-node-free) RDF graph, with subject-predicate-object triples . In analogy with RDFS, we refer to a subset of entities appearing as objects of the special property rdf:type as Classes. We also refer to classes, entities and properties collectively as terms. We ignore RDF literals, except for rdfs:labels that are used for matching questions to terms in KG.

The task of question answering over a knowledge graph (KGQA) is: given a natural language question and a knowledge graph , produce the correct answer , which is either a subset of entities in the KG or a result of a computation performed on this subset, such as the number of entities in this subset (COUNT) or an assertion (ASK). These types of questions are the most frequent in existing KGQA benchmarks (Bordes et al., 2015; Trivedi et al., 2017; Usbeck et al., 2018). In the first phase QAmp maps a natural language question to a structured model , which the answer inference algorithm will operate on then.

3.1. Question interpretation

To produce a question model we follow two steps: (1) parse, which extracts references (entity, predicate and class mentions) from the natural language question and identifies the question type; and (2) match, which assigns each of the extracted references to a ranked list of candidate entities, predicates and classes in the KG.

Effectively, a complex question requires answering several sub-questions, which may depend on or support each other. A dependence relation between the sub-questions means that an answer to one of the questions is required to produce the answer for the other question: . We call such complex questions compound questions and match the sequence in which these questions should be answered to hops (in the context of this paper, one-variable graph patterns) in the KG. Consider the sample compound question in Fig. 1, which consists of two hops: (1) find the car types assembled in Broadmeadows Victoria, which have a hardtop style, (2) find the company, which produces these car types. There is an intermediate answer (the car types with the specified properties), which is required to arrive at the final answer (the company).

Accordingly, we define (compound) questions as follows:

Definition 3.2 ().

A question model is a tuple , where is a question type required to answer the question , and is a sequence of hops over the KG, is a set of entity references, – a set of property references, – a set of class references relevant for the i-hop in the graph, and – a set of question types, such as .

Hence, the question in Fig. 1 can be modeled as: {“hardtop”, “Broadmeadows, Victoria”}, {“assembles”, “style”}, {“cars”}{“company”}, , where , , refer to the entities, predicates and classes in hop .

Further, we describe how the question model is produced by parsing the input question , after which we match references in to entities and predicates in the graph .

Parsing. Given a natural language question

, the goal is to classify its type

and parse it into a sequence of reference sets according to Definition 3.2. Question type detection is implemented as a supervised classification model trained on a dataset of annotated question-type pairs that learns to assign an input question to one of the predefined types .

We model reference (mention) extraction as a sequence labeling task (Lafferty et al., 2001), in which a question is represented as a sequence of tokens (words or characters). Then, a supervised machine learning model is trained on an annotated dataset to assign labels to tokens, which we use to extract references to entities, predicates and classes. Moreover, we define the set of labels to group entities, properties and classes referenced in the question into hops.

Figure 1. (a) A sample question highlighting different components of the question interpretation model: references and matched URIs with the corresponding confidence scores, along with (b) the illustration of a sample KG subgraph relevant to this question. The URIs in bold are the correct matches corresponding to the KG subgraph.

Matching. Next, the question model (Definition 3.2) is updated with an interpreted question model in which each component of is represented by sets of pairs from obtained by matching the references to concrete terms in (by their URIs) as follows: for each entity (or property, class, resp.) reference in , we retrieve a ranked list of most similar entities from the KG along with the matching confidence score.

Fig. 1 also shows the result of this matching step on our example. For instance, the property references for the first hop are replaced by the set of candidate URIs: within , where , , .

3.2. Answer inference

Our answer inference approach iteratively traverses and aggregates confidence scores across the graph based on the initial assignment from . An answer set , i.e., a set of entities along with their confidence scores , is produced after each hop and used as part of the input to the next hop , along with the terms matched for this hop in , i.e., . The entity set produced after the last hop can be further transformed to produce the final answer: via an aggregation function from a predefined set of available aggregation functions defined for each of the question types . We compute the answer set for each hop inductively in two steps: (1) subgraph extraction and (2) message passing.

Subgraph extraction. This step refers to the retrieval of relevant triples from the KG that form a subgraph. Thus, the URIs of the matched entities and predicates in the query are used as seeds to retrieve the triples in the KG that contain at least one entity (in subject or object position), and one predicate from the corresponding reference sets. Therefore, the extracted subgraph will contain entities, which include all entities from and the entities adjacent to them through properties from .

The subgraph is represented as a set of adjacency matrices with entities in the subgraph: , where is the total number of matched property URIs. There is a separate matrix for each of the properties used as seeds, where if there is an edge labeled between the entities and , and 0 otherwise. All adjacency matrices are symmetric, because does not model edge directionality, i.e., it treats as undirected. Diagonal entries are assigned 0 to ignore self loops.

Input: adjacency matrices of the subgraph ,

entity and property reference activations

Output: answer activations vector

2:  for  do
3:       property update
4:       entity update
5:       sum of all activations
6:      if else
7:       activation sums per entity
8:  end for
9:    activation fraction
10:   if else
11:  return
Algorithm 1 Message passing for KGQA
Figure 2. (a) A sample subgraph with three entities as candidate answers, (b) their scores after predicate and entity propagation, and (c) the final aggregated score.

Message passing. The second step of the answer inference phase involves message passing,333The pseudocode of the message passing algorithm is presented in Algorithm 1. i.e., propagation of the confidence scores from the entities and predicates , matched in the question interpretation phase, to adjacent entities in the extracted subgraph. This process is performed in three steps, (1) property update, (2) entity update, and (3) score aggregation. Algorithm 1 summarizes this process, detailed as follows.

For each of property references where , we

  1. [nosep,leftmargin=*]

  2. select the subset of adjacency matrices from for the property URIs if , where , and propagate the confidence scores to the edges of the corresponding adjacency matrices via element-wise multiplication. Then, all adjacency matrices are combined into a single adjacency matrix , which contains all of their edges with the sum of confidence scores if edges overlap (property update: line 3, Algorithm 1).

  3. perform the main message-passing step via the sum-product update, in which the confidence scores from entity references, where , are passed over to the adjacent entities via all edges in (entity update: line 4, Algorithm 1).

  4. aggregate the confidence scores for all entities in the subgraph into a single vector by combining the sum of all confidence scores with the number of entity and predicate reference sets, which received non-zero confidence score. The intuition behind this score aggregation formula (line 11, Algorithm 1) is that the answers that received confidence from the majority of entity and predicate references in the question should be preferred. The computation of the answer scores for our running example is illustrated in Fig. 2.

The minimal confidence for the candidate answer is regulated by a threshold to exclude partial and low-confidence matches. Finally, we also have an option to filter answers by considering only those entities in the answer set that have one of the classes in .

The same procedure is repeated for each hop in the sequence using the corresponding URI activations for entities, properties and classes modeled in and augmented with the intermediate answers produced for the previous hop . Lastly, the answer to the question is produced based on the entity set , which is either returned ‘as is’ or put through an aggregation function conditioned on the question type .

4. Evaluation Setup

We evaluate QAmp, our KGQA approach, on the LC-QuAD dataset of complex questions constructed from the DBpedia KG (Trivedi et al., 2017). First, we report the evaluation results of the end-to-end approach, which incorporates our message-passing algorithm in addition to the initial question interpretation (question parser and matching functions). Second, we analyze the fraction and sources of errors produced by different KGQA components, which provides a comprehensive perspective on the limitations of the current state-of-the-art for KGQA, the complexity of the task, and limitations of the benchmark. Our implementation and evaluation scripts are open-sourced.444

Baseline. We use WDAqua (Diefenbach et al., 2018) as our baseline; to the best of our knowledge, the results produced by WDAqua are the only published results on the end-to-end question answering task for the LC-QuAD benchmark to date. It is a rule-based framework that integrates several KGs in different languages and relies on a handful of SPARQL query patterns to generate SPARQL queries and rank them as likely question interpretations. We rely on the evaluation results reported by the authors (Diefenbach et al., 2018). WDAqua results were produced for the full LC-QuAD dataset, while other datasets were used for tuning the approach.


We follow the standard evaluation metrics for the end-to-end KGQA task, i.e., we report precision (P) and recall (R) macro-averaged over all questions in the dataset, and then use them to compute the F-measure (F). Following the evaluation setup of the QALD-9 challenge 

(Usbeck et al., 2018) we assign both precision and recall equal to 0 for every question in the following cases: (1) for SELECT questions, no answer (empty answer set) is returned, while there is an answer (non-empty answer set) in the ground truth annotations; (2) for COUNT or ASK questions, an answer differs from the ground truth; (3) for all questions, the predicted answer type differs from the ground truth. In the ablation study, we also analyze the fraction of questions with errors for each of the components separately, where an error is a not exact match with the ground-truth answer.

Hardware. We used a standard commodity server to train and evaluate QAmp: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, RAM 16 GB DDR4 SDRAM 2400 MHz, 240 GB SSD, NVIDIA GP102 GeForce GTX 1080 Ti.

4.1. The LC-QuAD dataset

The LC-QuAD dataset555 (Trivedi et al., 2017) contains 5K question-query pairs that have correct answers in the DBpedia KG (2016-04 version). The questions were generated using a set of SPARQL templates by seeding them with DBpedia entities and relations, and then paraphrased by human annotators. All queries are of the form ASK, SELECT, and COUNT, fit to subgraphs with diameter of at most 2-hops, contain 1–3 entities and 1–3 properties.

We used the train and test splits provided with the dataset (Table 1). Two queries with no answers in the graph were excluded. All questions are also annotated with ground-truth reference spans666 to evaluate performance of entity linking and relation detection (Dubey et al., 2018).

Split All Complex Compound
all 4,998 (100%) 3,911 (78%) 1,982 (40%)
train 3,999 0(80%) 3,131 (78%) 1,599 (40%)
test 999 0(20%) 780 (78%) 383 (38%)
Table 1. Dataset statistics: number of questions across the train and test splits; number of complex questions that reference more than one triple; number of complex questions that require two hops in the graph through an intermediate answer-entity.

4.2. Implementation details

Our implementation uses the English subset of the official DBpedia 2016-04 dump losslessly compressed into a single HDT file777 (Fernández et al., 2013). HDT is a state-of-the-art compressed RDF self-index, which scales linearly with the size of the graph and is, therefore, applicable to very large graphs in practice. This KG contains 1B triples, more than 26M entities ( namespace only) and 68,687 predicates. Access to the KG for subgraph extraction and class constraint look-ups is implemented via the Python HDT API.888 This API builds an additional index (Martínez-Prieto et al., 2012) to speed up all look-ups, and consumes the HDT mapped in disk, with a 3% memory footprint.999Overall, DBpedia 2016-04 takes 18GB in disk, and 0.5GB in main memory.

Our end-to-end KGQA solution integrates several components that can be trained and evaluated independently. The pipeline includes two supervised neural networks for (1) question type detection and (2) reference extraction; and unsupervised functions for (3) entity and (4) predicate matching, and (5) message passing.

Parsing. Question type detection is implemented as a bi-LSTM neural-network classifier trained on pairs of question and type. We use another biLSTM+CRF neural network for extracting references to entities, classes and predicates for at most two hops using the set of six labels: {“E1”, “P1”, “C1”, “E2”, “P2”, “C2”}. Both classifiers use GloVe word embeddings pre-trained on the Common Crawl corpus with 840B tokens and 300 dimensions (Pennington et al., 2014).

Matching. The labels of all entities and predicates in the KG (rdfs: label links) are indexed into two separate catalogs and embedded into two separate vector spaces using the English FastText model trained on Wikipedia (Bojanowski et al., 2017). We use two ranking functions for matching and assigning the corresponding confidence scores: index-based for entities and embedding-based for predicates. The index-based ranking function uses BM25 (Manning et al., 2010)

to calculate confidence scores for the top-500 matches on the combination of n-grams and Snowball stems.

101010 Embedding-based confidence scores are computed using the Magnitude library111111 (Patel et al., 2018) for the top-50 nearest neighbors in the FastText embedding space.

Many entity references in the LC-QuAD questions can be handled using simple string similarity matching techniques; e.g., ‘companies’ can be mapped to “”. We built an ElasticSearch (Lucene) index to efficiently retrieve such entity candidates via string similarity to their labels. The entity labels were automatically generated from entity URIs by stripping the domain part of the URI and lower-casing, e.g., entity “” received the label “company” to better match question words. LC-QuAD questions also contain more complex paraphrases of the entity URIs that require semantic similarity computation beyond fuzzy string matching, such as “movie” refers to “”, “stockholder” to “” or “has kids” to ‘’. We embeded entity and predicate labels with FastText (Bojanowski et al., 2017) to detect semantic similarities beyond string matching.

Index-based retrieval scales much better than nearest neighbour computation in the embedding space, which is a crucial requirement for the 26M entity catalog. In our experiments, syntactic similarity was sufficient for entity matching in most of the cases, while property matching required capturing more semantic variations and greatly benefited from using pre-trained embeddings.

5. Evaluation Results

Table 2 shows the performance of QAmp on the KGQA task in comparison with the results previously reported by Diefenbach et al. (2018). There is a noticeable improvement in recall (we were able to retrieve answers to 50% of the benchmark questions), while maintaining a comparable precision score. For the most recent QALD challenge the guidelines were updated to penalize systems that miss the correct answers, i.e., that are low in recall, which gives a clear signal of its importance for this task (Usbeck et al., 2018). While it is often trivial for users to filter out a small number of incorrect answers that stem from interpretation ambiguity, it is much harder for users to recover missing correct answers. Indeed, we showed that QAmp is able to identify correct answers that were missing even from the benchmark dataset since they were overlooked by the benchmark authors due to sampling bias.

Approach P R F Runtime
WDAqua 0.22* 0.38 0.28 1.50 s/q
QAmp (our approach) 0.25 0.50 0.33 0.72 s/q
Table 2.

Evaluation results. (*) P of the WDAqua baseline is estimated from the reported precision of 0.59 for answered questions only. Runtime is reported in seconds per question as an average across all questions in the dataset. The distribution of runtimes for QAmp is Min: 0.01, Median: 0.67 Mean: 0.72, Max: 13.83

5.1. Ablation study

Table 3 summarizes the results of our ablation study for different setups. We report the fraction of all questions that have at least one answer that deviates from the ground truth (Total column), questions with missing term matches (No match) and other errors. Revised errors is the subset of other errors that were considered as true errors in the manual error analysis.

Setup Question interpretation P R F Questions with errors
Total New errors
Q. type Entity Property Class No match OtherRevised
1 Question model* GT 0.97 0.99 0.98 09%  9%  5%
2 Question type PR GT 0.96 0.98 0.97 10%  1%  1%
3 Ignore classes GT None 0.94 0.99 0.96 14%  5%  3%
4 Classes GT span GT GT span 0.89 0.92 0.90 17% 08%
5 Entities GT span GT GT span GT 0.85 0.88 0.86 20% 10%  1%  1%
6 Entities PR GT PR GT 0.64 0.74 0.69 46% 27% 10%  5%
7 Predicates GT span GT GT span GT 0.56 0.59 0.57 48% 34%  5%  3%
8 Predicates PR GT PR GT 0.36 0.53 0.43 74% 34% 31% 19%
Table 3. Ablation study results. (*) Question model results set the optimal performance for our approach assuming that the question interpretation is perfectly aligned with the ground-truth annotations. We then estimate additional (new) errors produced by each of the KGQA components separately. The experiments marked with GT use the term URIs and question types extracted from the ground truth queries. GT span uses spans from the ground-truth annotations and then corrects the distribution of the matched entities/properties to mimic correct question interpretation with a low-confidence tail with alternative matches. PR (Parsed Results) stands for predictions made by question parsing and matching models (see Section 4.2).

Firstly, we make sure that the relaxations in our question interpretation model hold true for the majority of questions in the benchmark dataset (95%) by feeding all ground truth entity, class and property URIs to the answer inference module (Setup 1 in Table 3). We found that only 53 test questions (5%) require one to model the exact order of entities in the triple, i.e., subject and predicate positions. These questions explicitly refer to a hierarchy of entity relations, such as dbp:doctoralStudents and dbp:doctoralAdvisor (see Figure 3121212The numbers on top of each entity show its number of predicates and triples.131313All sample graph visualizations illustrating different error types discovered in the LC-QuAD dataset in Figure 3, 4, 6 were generated using the LODmilla web tool, (Micsik et al., 2014), with data from the DBpedia KG.), and their directionality has to be interpreted to correctly answer such questions. We also recovered a set of correct answers missing from the benchmark for relations that are symmetric by nature, but were considered only in one direction by the benchmark, e.g., dbo:related, dbo:associatedBand, and dbo:sisterStation (see Figure 4).

Figure 3. Directed relation example (dbp:doctoralStudents and dbp:doctoralAdvisor hierarchy) that requires modeling directionality of the relation. LC-QuAD question #3267: “Name the scientist whose supervisor also supervised Mary Ainsworth?” (correct answer: Abraham Maslow) can be easily confused with a question: “Name the scientist who supervised also the supervisor of Mary Ainsworth?” (correct answer: Lewis Terman). LC-QuAD benchmark is not suitable for evaluating directionality interpretations, since only 35 questions (3.5%) of the LC-QuAD test split use relations of this type, which explains high performance results of QAmp that treats all relation as undirected.
Figure 4. Undirected relation example (dbo:sisterStation) that reflects bi-directional association between the adjacent entities (Missouri radio stations). LC-QuAD question #4486: “In which city is the sister station of KTXY located?” (correct answer: dbr:California,Missouri, dbr:Missouri; missing answer: dbr:Eldon,Missouri). DBpedia does not model bi-directional relations and the relation direction is selected at random in these cases. LC-QuAD does not reflect bi-directionality either by picking only one of the directions as the correct one and rejecting correct solutions (dbr:KZWY dbr:Eldon,Missouri). QAmp was able to retrieve this false negative sample due to the default undirectionality assumption built into the question interpretation model.

These results indicate that a more complex question model attempting to reflect structural semantics of a question in terms of the expected edges and their directions (parse graph or lambda calculus) is likely to fall short when trained on this dataset: 53 sample questions are insufficient to train a reliable supervised model that can recognize relation directions from text, which explains poor results of a slot-matching model for subgraph ranking reported on this dataset (Maheshwari et al., 2018b).

There were only 8 errors (1%) due to the wrong question type detected caused by misspelled or grammatically incorrect questions (row 2 in Table 3). Next, we experimented with removing class constraints and found that although they generally help to filter out incorrect answers (row 3) our matching function missed many correct classes even using the ground-truth spans from the benchmark annotations (row 4).

The last four evaluation setups (5–8) in Table 3 show the errors from parsing and matching reference spans to entities and predicates in the KG. Most errors were due to missing term matches (10–34% of questions), which indicates that the parsing and matching functions constitute the bottleneck in our end-to-end KGQA. Even with the ground-truth span annotations for predicate references the performance is below 0.6 (34% of questions), which indicates that relation detection is much harder than the entity linking task, which is in line with results reported by Dubey et al. (2018) and Singh et al. (2018b).

The experiments marked GT span+ were performed by matching terms to the KG using the ground-truth span annotations, then down-scaling the confidence scores for all matches and setting the confidence score of the match used in the ground-truth query to the maximum confidence score of 1. In this setup, all correct answers according to the benchmark were ranked at the top, which demonstrates the correctness of the message passing and score aggregation algorithm.

5.2. Scalability analysis

As we reported in Table 2, QAmp is twice as fast as the WDAqua baseline using a comparable hardware configuration. Figure 5 shows the distribution of processing times and the number of examined triples per question from the LC-QuAD test split. The results are in line with the expected fast retrieval of HDT  (Fernández et al., 2013)

, which scales linearly with the size of the graph. Most of the questions are processed within 2 seconds (with a median and mean around 0.7s), even those examining more than 50K triples. Note that only 10 questions took more than 2 seconds to process and 3 of them took more than 3 seconds. These outliers end up examining a large number of alternative interpretations (up to 300K triples), which could be prevented by setting a tighter threshold. Finally, it is worth mentioning that some questions end up with no results (i.e., 0 triples accessed), but they can take up to 2 seconds for parsing and matching.

Figure 5. Processing times per question from the LC-QuAD test split (Min: 0.01s Median: 0.68s Mean: 0.72s Max: 13.97s)

5.3. Error analysis

We sampled questions with errors (P or R ) for each of the setups and performed an error analysis for a total of 206 questions. Half of the errors were due to the incompleteness of the benchmark dataset and inconsistencies in the KG (column Revised in Table 3). Since the benchmark provides only a single SPARQL query per question that contains a single URI for each entity, predicate and class, all alternative though correct matches are missing, e.g., the gold-truth query using dbp:writer will miss dbo:writer, or match all dbo:languages triples but not dbo:language, etc.

Figure 6. Alternative entity example that demonstrates a missing answer when only a single correct entity URI is considered (dbr:Rome and not dbr:Pantheon,Rome). LC-QuAD question #261: “Give me a count of royalties buried in Rome?” (correct answer: dbr:Augustus; missing answer: dbr:Margherita_of_Savoy). QAmp was able to retrieve this false negative sample due to the string matching function and retaining a list of alternative URIs per entity mention.

QAmp was able to recover many such cases to produce additional correct answers using: (1) missing or alternative class URIs, e.g., dbr:Fire_Phone was missing from the answers for technological products manufactured by Foxconn since it was annotated as a device, and not as an information appliance; (2) related or alternative entity URIs, e.g., the set of royalties buried in dbo:Rome should also include those buried in dbr:PantheonRome (see Figure 6); (3) alternative properties, e.g., dbo:hometown as dbo:birthPlace.

We discovered alternative answers due to the majority vote effect, when many entities with low confidence help boost a single answer. Majority voting can produce a best-effort guess based on the data in the KG even if the correct terms are missing from the KG or could not be recovered by the matching function, e.g., “In which time zone is Pong Pha?” – even if Pong Pha is not in the KG many other locations with similar names are likely to be located in the same geographic area.

Overall, our evaluation results indicate that the answer set of the LC-QuAD benchmark can be used only as a seed to estimate recall but does not provide us with a reliable metric for precision. Attempts to further improve performance on such a dataset can lead to learning the biases embedded in the construction of the dataset, e.g., the set of relations and their directions. QAmp is able to mitigate this pitfall by resorting to unsupervised message passing that collects answers from all local subgraphs containing terms matching the input question, in parallel.

6. Conclusion

We have proposed QAmp, a novel approach for complex KGQA using message passing, which sets the new state-of-the-art results on the LC-QuAD benchmark for complex question answering. We have shown that QAmp is scalable and can be successfully applied to very large KGs, such as DBpedia, which is one of the biggest cross-domain KGs. QAmp does not require supervision in the answer inference phase, which helps to avoid overfitting and to discover correct answers missing from the benchmark due to the limitations of its construction. Moreover, the answer inference process can be explained by the extracted subgraph and the confidence score distribution. QAmp requires only a handful of hyper-parameters to model confidence thresholds in order to stepwise filter partial results and trade off recall for precision.

QAmp is built on a basic assumption of considering edges as undirected in the graph, which proved reasonable and effective in our experiments. The error analysis revealed that, in fact, symmetric edges were often missing in the KG, i.e., the decision on the order of entities in KG triples is made arbitrarily and is not duplicated in the reverse order. However, there is also a (small) subset of relations, e.g., hierarchy relations, for which relation direction is essential.

Question answering over KGs is hard due to (1) ambiguities stemming from question interpretation, (2) inconsistencies in knowledge graphs, and (3) challenges in constructing a reliable benchmark, which motivate the development of robust methods able to cope with uncertainties and also provide assistance to end-users in interpreting the provenance and confidence of the answers.

QAmp is not without limitations. It is designed to handle questions where the answer is a subset of entities or an aggregate based on this subset, e.g., questions for which the expected answer is a subset of properties in the graph, are currently out of scope. An important next step is to use QAmp to improve the recall of the benchmark dataset by complementing the answer set with missing answers derived from relaxing the dataset assumptions. Recognizing relation directionality is an important direction for future work, which requires extending existing benchmark datasets and the addition of more cases where an explicit order is required to retrieve correct answers. Another direction is to improve predicate matching, which is the weakest component of the proposed approach as identified in our ablation study. Finally, unsupervised message passing can be adopted for other tasks that require uncertain reasoning on KGs, such as knowledge base completion, text entailment, summarization, and dialogue response generation.


This work was supported by the EU H2020 programme under the MSCA-RISE agreement 645751 (RISE_BPM), the Austrian Research Promotion Agency (FFG) under projects CommuniData (855407) and Cityspin (861213), Ahold Delhaize, the Association of Universities in the Netherlands (VSNU), and the Innovation Center for Artificial Intelligence (ICAI). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.


  • J. Bao, N. Duan, Z. Yan, M. Zhou, and T. Zhao (2016) Constraint-based question answering with knowledge graph. In COLING 2016, pp. 2503–2514. Cited by: §1, §2.
  • P. W. Battaglia, J. B. Hamrick, et al. (2018)

    Relational inductive biases, deep learning, and graph networks

    CoRR abs/1806.01261. External Links: 1806.01261 Cited by: §2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §4.2, §4.2.
  • P. A. Bonatti, S. Decker, A. Polleres, and V. Presutti (2018) Knowledge graphs: new directions for knowledge representation on the semantic web (dagstuhl seminar 18371). Dagstuhl Reports 8 (9), pp. 29–111. Cited by: §1.
  • A. Bordes, N. Usunier, S. Chopra, and J. Weston (2015) Large-scale simple question answering with memory networks. CoRR abs/1506.02075. External Links: 1506.02075 Cited by: §1, §2, §3.
  • W. Bronnenberg, H. Bunt, J. Landsbergen, R. Scha, W. Schoenmakers, and E. van Utteren (1980) The question answering system Phliqa1. In Natural Language Question Answering Systems, L. Bolc (Ed.), pp. 217–305. Cited by: §1.
  • F. F. de Faria, R. Usbeck, A. Sarullo, T. Mu, and A. Freitas (2018) Question answering mediated by visual clues and knowledge graphs. In WWW, pp. 1937–1939. Cited by: §1.
  • D. Diefenbach, A. Both, K. D. Singh, and P. Maret (2018) Towards a question answering system over the semantic web. CoRR abs/1803.00832. External Links: 1803.00832 Cited by: §1, §2, §4, §5.
  • M. Dubey, D. Banerjee, D. Chaudhuri, and J. Lehmann (2018) EARL: joint entity and relation linking for question answering over knowledge graphs. In ISWC 2018, pp. 108–126. Cited by: §2, §4.1, §5.1.
  • J. D. Fernández, M. A. Martínez-Prieto, C. Gutiérrez, A. Polleres, and M. Arias (2013) Binary rdf representation for publication and exchange (hdt). J. Web Semant. 19, pp. 22–41. Cited by: §4.2, §5.2.
  • Ó. Ferrández, C. Spurk, M. Kouylekov, I. Dornescu, S. Ferrández, M. Negri, R. Izquierdo, D. Tomás, C. Orasan, G. Neumann, B. Magnini, and J. L. V. González (2011) The QALL-ME framework: A specifiable-domain multilingual question answering architecture. J. Web Semant. 9 (2), pp. 137–145. Cited by: §2.
  • A. Freitas, E. Curry, J. G. Oliveira, and S. O’Riain (2012) Querying heterogeneous datasets on the linked data web: challenges, approaches, and trends. IEEE Internet Computing 16 (1), pp. 24–33. Cited by: §1.
  • A. Freitas, J. G. Oliveira, S. O’Riain, J. C. P. da Silva, and E. Curry (2013) Querying linked data graphs using semantic relatedness: A vocabulary independent approach. Data Knowl. Eng. 88, pp. 126–141. Cited by: §2.
  • A. Gandomi and M. Haider (2015) Beyond the hype: big data concepts, methods, and analytics. International journal of information management 35 (2), pp. 137–144. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, et al. (2017) Neural message passing for quantum chemistry. In ICML 2017, pp. 1263–1272. Cited by: §2.
  • Y. Goyal, T. Khot, et al. (2017) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In CVPR 2017, Cited by: §1.
  • B. F. Green, A. K. Wolf, C. Chomsky, and K. Laughery (1963) Baseball: an automatic question answerer.. In Computers and Thought, pp. 219–224. Cited by: §1.
  • W. L. Hamilton, P. Bajaj, M. Zitnik, D. Jurafsky, and J. Leskovec (2018) Embedding logical queries on knowledge graphs. In NeurIPS, pp. 2030–2041. Cited by: §2.
  • S. Harris and A. Seaborne (2013) SPARQL 1.1 query language. Note: W3C Recommendation Cited by: §1.
  • G. G. Hendrix (1982) Natural-language interface. Computational Linguistics 8 (2), pp. 56–61. Cited by: §1.
  • F. Jamour, I. Abdelaziz, and P. Kalnis (2018) A demonstration of magiq: matrix algebra approach for solving rdf graph queries. Proc. the VLDB Endowment 11 (12), pp. 1978–1981. Cited by: §2.
  • E. Kaufmann and A. Bernstein (2007) How useful are natural language interfaces to the semantic web for casual end-users?. In ISWC, pp. 281–294. Cited by: §1.
  • J. Kepner, P. Aaltonen, D. A. Bader, A. Buluç, F. Franchetti, J. R. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, C. Yang, J. D. Owens, M. Zalewski, T. G. Mattson, and J. E. Moreira (2016) Mathematical foundations of the GraphBLAS. In HPEC, pp. 1–9. Cited by: §2.
  • J. Kim, C. Unger, A. N. Ngomo, A. Freitas, Y. Hahm, J. Kim, G. Choi, J. Kim, R. Usbeck, M. Kang, and K. Choi (2017) OKBQA: an open collaboration framework for development of natural language question-answering over knowledge bases. In ISWC, Cited by: §2.
  • D. Koller, N. Friedman, and F. Bach (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §2.
  • J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML 2001, pp. 282–289. Cited by: §3.1.
  • J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, et al. (2015) DBpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: §1.
  • G. Maheshwari, P. Trivedi, D. Lukovnikov, N. Chakraborty, A. Fischer, and J. Lehmann (2018a) Learning to rank query graphs for complex question answering over knowledge graphs. arXiv preprint arXiv:1811.01118. Cited by: §1, §2.
  • G. Maheshwari, P. Trivedi, et al. (2018b) Learning to rank query graphs for complex question answering over knowledge graphs. arXiv preprint arXiv:1811.01118. Cited by: §5.1.
  • C. Manning, P. Raghavan, and H. Schütze (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §1, §4.2.
  • M. A. Martínez-Prieto, M. A. Gallego, and J. D. Fernández (2012) Exchange and consumption of huge rdf data. In ESWC, pp. 437–452. Cited by: §4.2.
  • A. Micsik, S. Turbucz, and A. Györök (2014) LODmilla: a linked data browser for all. In Posters&Demos SEMANTiCS 2014, S. Harald, F. Agata, L. Jens, and H. Sebastian (Eds.), pp. 31–34. Cited by: footnote 13.
  • G. Napolitano, R. Usbeck, and A. N. Ngomo (2018) The scalable question answering over linked data (SQA) challenge 2018. In SemWebEval Challenge at ESWC, pp. 69–75. Cited by: §2.
  • A. Patel, A. Sands, C. Callison-Burch, and M. Apidianaki (2018) Magnitude: a fast, efficient universal vector embedding utility package. In EMNLP 2018, pp. 120–126. Cited by: §4.2.
  • J. Pearl (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc.. Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In EMNLP 2014, pp. 1532–1543. Cited by: §4.2.
  • M. Petrochuk and L. Zettlemoyer (2018) SimpleQuestions nearly solved: A new upperbound and baseline approach. In EMNLP 2018, pp. 554–558. Cited by: §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP 2016, pp. 2383–2392. Cited by: §1.
  • M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) In ESWC, pp. 593–607. Cited by: §2.
  • G. Schreiber and Y. Raimond (2014) RDF 1.1 primer. Note: W3C Note Cited by: §3.
  • K. Singh, A. Both, A. S. Radhakrishna, and S. Shekarpour (2018a) Frankenstein: A platform enabling reuse of question answering components. In ESWC, pp. 624–638. Cited by: §2.
  • K. Singh, I. Lytra, A. S. Radhakrishna, S. Shekarpour, M. Vidal, and J. Lehmann (2018b) No one is perfect: analysing the performance of question answering components over the dbpedia knowledge graph. CoRR abs/1809.10044. Cited by: §1, §2, §2, §5.1.
  • K. Singh, A. S. Radhakrishna, A. Both, S. Shekarpour, I. Lytra, R. Usbeck, A. Vyas, A. Khikmatullaev, D. Punjani, C. Lange, M. Vidal, J. Lehmann, and S. Auer (2018c) Why reinvent the wheel: let’s build question answering systems together. In WWW, pp. 1247–1256. Cited by: §2.
  • D. Sorokin and I. Gurevych (2018) Modeling semantics with gated graph neural networks for knowledge base question answering. In COLING 2018, pp. 3306–3317. Cited by: §1, §2.
  • P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann (2017) LC-quad: A corpus for complex question answering over knowledge graphs. In ISWC, pp. 210–218. Cited by: §1, §1, §2, §3, §4.1, §4.
  • C. Unger, A. Freitas, and P. Cimiano (2014) An introduction to question answering over linked data. In Reasoning Web, pp. 100–140. Cited by: §1.
  • R. Usbeck, R. H. Gusmita, A. N. Ngomo, and M. Saleem (2018) 9th challenge on question answering over linked data. In QALD at ISWC, pp. 58–64. Cited by: §3, §4, §5.
  • D. Vrandecic and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Commun. ACM 57 (10), pp. 78–85. Cited by: §1.
  • M. Wang, R. Wang, J. Liu, Y. Chen, L. Zhang, and G. Qi (2018) Towards empty answers in SPARQL: approximating querying with RDF embedding. In ISWC, pp. 513–529. Cited by: §2.
  • X. Wilcke, P. Bloem, and V. De Boer (2017) The knowledge graph as the default data model for learning on heterogeneous knowledge. Data Science, pp. 1–19. Cited by: §1.
  • W. A. Woods (1977) Lunar rocks in natural English: Explorations in natural language question answering. In Linguistic Structures Processing, A. Zampoli (Ed.), pp. 521–569. Cited by: §1.
  • H. Zafar, G. Napolitano, and J. Lehmann (2018) Formal query generation for question answering over knowledge bases. In ESWC, pp. 714–728. Cited by: §2.
  • L. Zou, R. Huang, H. Wang, J. X. Yu, W. He, and D. Zhao (2014) Natural language question answering over RDF: a graph data driven approach. In SIGMOD 2014, pp. 313–324. Cited by: §1, §2.