UNIQORN: Unified Question Answering over RDF Knowledge Graphs and Natural Language Text

08/19/2021 ∙ by Soumajit Pramanik, et al. ∙ Max Planck Society 29

Question answering over knowledge graphs and other RDF data has been greatly advanced, with a number of good systems providing crisp answers for natural language questions or telegraphic queries. Some of these systems incorporate textual sources as additional evidence for the answering process, but cannot compute answers that are present in text alone. Conversely, systems from the IR and NLP communities have addressed QA over text, but barely utilize semantic data and knowledge. This paper presents the first QA system that can seamlessly operate over RDF datasets and text corpora, or both together, in a unified framework. Our method, called UNIQORN, builds a context graph on the fly, by retrieving question-relevant triples from the RDF data and/or the text corpus, where the latter case is handled by automatic information extraction. The resulting graph is typically rich but highly noisy. UNIQORN copes with this input by advanced graph algorithms for Group Steiner Trees, that identify the best answer candidates in the context graph. Experimental results on several benchmarks of complex questions with multiple entities and relations, show that UNIQORN, an unsupervised method with only five parameters, produces results comparable to the state-of-the-art on KGs, text corpora, and heterogeneous sources. The graph-based methodology provides user-interpretable evidence for the complete answering process.



There are no comments yet.


page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Motivation. Question answering (QA) aims to compute direct answers to information needs posed as natural language (NL) utterances (Lu et al., 2019; Sun et al., 2019; Clark and Gardner, 2018; Diefenbach et al., 2019; Unger et al., 2012; Rajpurkar et al., 2016; Kaiser et al., 2021). We focus on the class of factoid questions that are objective in nature and have one or more named entities as answers (Berant et al., 2013; Abujabal et al., 2019; Dubey et al., 2019; Vakulenko et al., 2019). A running example in this paper is:

Question: director of the western for which Leo won an Oscar? [Answer: Alejandro Iñàrritu]

Early research (Ravichandran and Hovy, 2002; Voorhees, 1999) used patterns to extract text passages with candidate answers, or had sophisticated pipelines like the proprietary IBM Watson system (Ferrucci et al., 2010; Ferrucci, 2012). that won the Jeopardy! quiz show. With the rise of large knowledge graphs (KGs) like YAGO (Suchanek et al., 2007)

, DBPedia 

(Auer et al., 2007), Freebase (Bollacker et al., 2008), and Wikidata (Vrandečić and Krötzsch, 2014), the focus shifted from text corpora as inputs to these structured RDF data sources, represented as subject-predicate-object (SPO) triples. The prevalent paradigm for QA over knowledge graphs, KG-QA for short, is to translate questions into SPARQL queries that can be executed over the RDF triples (Unger et al., 2012; Abujabal et al., 2018; Diefenbach et al., 2018; Hu et al., 2018; Bast and Haussmann, 2015).

While KGs capture a large part of the world’s encyclopedic knowledge, they are inherently incomplete. This is because they cannot stay up-to-date with the latest information, so that emerging and ephemeral facts (e.g., teams losing in finals of sports leagues or celebrities dating each other) are not included. Also, user interests go way beyond the predicates that are modeled in KGs like Wikidata. As a result, answering over Web text, like news sites and review forums, is an absolute necessity. We refer to this paradigm as question answering over text, Text-QA

for short. Research in the NLP community has revived and advanced this line of Text-QA using deep learning techniques and is focused on extracting precise spans from one or more textual passages 

(Rajpurkar et al., 2016; Chen et al., 2017; Clark and Gardner, 2018).

Limitations of the state-of-the-art. This fragmented landscape of QA research has led to a state where methods from one paradigm are completely incompatible with the other. As specific examples, the only working QA systems with sustained online service for Wikidata, QAnswer (Diefenbach et al., 2019) and Platypus (Tanon et al., 2018), are not equipped at all to compute answers from text passages. On the other hand, powerful deep learning methods like DrQA (Chen et al., 2017) and DocumentQA (Clark and Gardner, 2018)

that work amazingly well on several text corpora, are not geared at all for tapping into KGs and other RDF data. This is because these algorithms are basically classifiers trained to predict the start and end position of the most likely answer span in a given piece of text, a setup that does not agree with RDF triples.

Some recent works on heterogeneous QA (Savenkov and Agichtein, 2016; Xu et al., 2016; Sun et al., 2018, 2019), incorporate text (typically articles in Wikipedia or search results) as a supplement to RDF triples, or the other way around (Sun et al., 2015; Das et al., 2017). However, KG and text are combined here merely to improve the rankings of candidate answers through additional evidence, but the answers themselves come either from the KG alone or from text alone, or both. Moreover, none of these methods is geared to handle the extreme, but realistic situations, where the input is either only a KG or only a text corpus. The state-of-the-art method in heterogeneous QA, PullNet (Sun et al., 2019) needs entity-linked text and shortest paths in KGs for creating training data, needs trained entity embeddings, and can only provide such KG entities as answers. Thus, it cannot operate over arbitrary text corpora with an open vocabulary where ad hoc phrases could be answers.

Problems are exacerbated for complex questions, when SPARQL queries become particularly brittle for KG-QA systems, or text evidence for Text-QA systems has to be joined from multiple documents. For example, the correct query for our running example would be:

SELECT ?x WHERE {?y director ?x . ?y castMember LeonardoDiCaprio .

?y genre western . LeonardoDiCaprio awarded AcademyAward [forWork ?y] .}

Hardly any KG-QA system would get this mapping onto KG predicates and entities perfectly right. For Text-QA, computing answers requires stitching together information from from several texts, as it is not easy to find a single web page that contains all relevant cues in one spot.

(a) example for KG as input.
(b) example for text as input.
Figure 1. Context graphs (XG) built by UniQORN for the question “director of the western for which Leo won an Oscar?”. Anchors are nodes with underlined fragments; answers are in bold. Orange subgraphs are Group Steiner Trees.

Approach. To overcome these limitations, we propose UniQORN, a Unified framework for question answering over RDF knowledge graphs and NL text, that addresses these limitations. Our proposal hinges on two key ideas:

  • Instead of attempting to compute perfect translations of questions into crisp SPARQL queries (which seems elusive for complex questions), we relax the search strategy over KGs by devising graph algorithms for Group Steiner Trees, this way connecting question-relevant cues from which candidate answers can be extracted and ranked.

  • Handling the KG-QA, Text-QA and mixture settings with the same unified method, by building a noisy KG-like context graph from text inputs on the fly for each question, using open information extraction (Open IE) and other techniques.

In a nutshell, UniQORN works as follows. Given an input question, we construct a context graph (XG) that contains input entities and classes, relevant predicates and candidate answers. The XG either consists of: (i) KG facts defining the neighborhood of the question entities and classes, or, (ii) a quasi-KG dynamically built by joining Open IE triples extracted from text results retrieved from a search engine, or, (iii) both, in case of retrieval from heterogeneous sources. We identify anchor nodes

in the XG that are most related to phrases in the question, using lexicon-based or embedding-based similarities. Treating the anchors as terminals, a Group Steiner Tree (GST) 

(Ding et al., 2007) is computed that contains candidate answers. The GST establishes the joint context for entities, relations, and classes mentioned in the question. Candidate answers are ranked by statistical measures from the GSTs. Fig. 1 illustrates this unified approach for the settings of KG-QA and Text-QA.

Contributions. This work makes the following salient contributions: (i) proposing the first unified method and system for answering factoid questions over knowledge graphs or text corpora; (ii) applying Group Steiner Trees as a strategy for locating answers to complex questions involving multiple entities and relations; (iii) experimentally comparing UniQORN on several benchmarks of complex questions against state-of-the-art baselines on KGs, text and heterogeneous sources. All code and data for this project are available at: https://uniqorn.mpi-inf.mpg.de/.

2. Concepts and Notation

2.1. General Concepts

Knowledge graph. An RDF knowledge graph (like Wikidata) consists of entities (like ‘Leonardo DiCaprio’), predicates (like ‘award received’), classes (like ‘film’), and literals (like ‘07 January 2016’), organized as a set of subject-predicate-object (SPO) triples where and . Optionally, a triple may be accompanied by qualifier information as one additional predicate and object each. The following is an example of a triple with a context qualifier111https://www.wikidata.org/wiki/Help:Qualifiers: <LeonardoDiCaprio awardReceived AcademyAward, forWork TheRevenant>. Each represents a fact in .

Text corpus. A text corpus is a collection of documents, where each document contains a set of natural language sentences . In practice, this could be a static collection like ClueWeb12, Wikipedia articles or a news archive, or the open Web. To induce structure on , Open IE (Del Corro and Gemulla, 2013; Angeli et al., 2015; Mausam, 2016) is performed on each , for every , to return a set of triples in SPO format, where each such triple is a fact (evidence) mined from some . These triples are augmented with those built from Hearst patterns (Hearst, 1992) run over that indicate entity-class memberships. This non-canonicalized triple store is referred to as a quasi knowledge graph built from the document corpus, where . Thus, each is a fact in .

Heterogeneous source. We refer to the mixture of the knowledge graph and the quasi-KG as a heterogeneous source , where each triple in this heterogeneous KG can come from either the RDF store or the text corpus.

Question. A question is posed either as a full-fledged interrogative sentence (“Who is the director of the western film for which Leonardo DiCaprio won an Oscar?”) or in telegraphic (Joshi et al., 2014; Sawant et al., 2019) / utterance (Abujabal et al., 2018; Yih et al., 2015) form (“director of the western with Oscar for Leo”), where the ’s are the tokens in the question (words or phrases like entity mentions detected by a recognition system), and is the total number of tokens in .

Answer. An answer to question is an entity or a literal in the KG (like entity Q18002795 in Wikidata with label The Revenant), or a span of text (a sequence of words) from some sentence in the corpus (like ‘The Revenant film’). is the set of all correct answers to ().

Complex questions. UniQORN is motivated by the need for a unified approach to complex questions (Sun et al., 2019; Lu et al., 2019; Vakulenko et al., 2019; Dubey et al., 2019; Talmor and Berant, 2018), as simple questions are already well-addressed by prior works (Petrochuk and Zettlemoyer, 2018; Abujabal et al., 2018; Bast and Haussmann, 2015; Yih et al., 2015). We call a question “simple” if it translates into a query or a logical form with a single entity and a single predicate (like “capital of Greece?”). Questions where the simplest proper query requires multiple or , are considered “complex” (like “director of the western for which Leo won an Oscar?”). There are other notions of complex questions (Höffner et al., 2017), like those requiring grouping and aggregation, (e.g., “which director made the highest number of movies that won an Oscar in any category?”), or when questions involve negations (e.g., “which director has won multiple Oscars but never a Golden Globe?”). These are not considered in this paper.

2.2. Graph Concepts

Context graph. A context graph for a question is defined as a subgraph of the full / quasi / heterogeneous knowledge graph, i.e., (KG) or (text) or (mixture), such that it contains all triples or potentially relevant for answering . Thus, is expected to contain every answer entity . An has nodes and edges with types () and weights () as discussed below. Thus, an is always question-specific, and to simplify notation, we often write only instead of .

Node types. A node in an is mapped to one of four types : (i) entity, (ii) predicate, (iii) class, or (iv) literal, via a mapping function , where . Each is produced from an , , or , from the triples in or . For , we make no distinction between predicates and qualifier predicates, or standard objects, and qualifier objects (no qualifiers in ). Even though it is standard practice to treat predicates as edge labels in a KG, we model them as nodes, because this design simplifies the application of our graph algorithms on the . Note that predicates originating from different triples are assigned unique identifiers in the graph. For instance, for triples BarackObama, married, MichelleObama, and BillGates, married, MelindaGates, we will obtain two married nodes, that will be marked as married-1 and married-2 in the context graph. Such distinctions prevent false inferences when answering over some XG. For simplicity, we do not show such labelings in Fig. 1.

For text-based , we make no distinction between and , as Open IE markups that produce these often lack such “literal” annotations. Class nodes in come from the objects of instance of facts in (e.g., for Wikidata), while those in originate from the outputs of Hearst patterns (Hearst, 1992) in . In Fig. 0(a), nodes ‘The Revenant’, ‘director’, ‘Academy Awards’, and ‘2016’ are of type , , , and , respectively.

Edge types. An edge in an is mapped to one of three types : (i) triple, (ii) class, or (iii) alignment edges, via a mapping function . Triples in or or , where , i.e., the object is of node type entity or literal, contribute triple edges to the . For example, in Fig. 0(a), the two edges between ‘The Revenant’ and ‘genre’ and between ‘genre’ and ‘Western film’ are triple edges. Triples where (object is a class) are referred to as class triples, and each class triple contributes two class edges to the (edges between ‘Academy Award for Best Director’ and ‘instance of’, and ‘instance of’ and ‘Academy Awards’ in Fig. 0(a)). Alignment edges represent potential synonymy between nodes, and are inserted via KG aliases (), or external lexicons and resources (), or both (). Examples are the bidirectional edges between ‘Leo DiCaprio’ and ‘Leonardo diCaprio’ in Fig. 0(a). Note that insertion of alignment edges as opposed to naive merging of synonymous nodes is a deliberate choice, as it enables (subsequent) wider matching and principled disambiguation of question concepts, precludes the problem of choosing a representative label for merged cluster, and avoids topic drifts arising from the transitive effects of merging nodes at this stage.

Node weights. A node in an is weighted by a function according to its similarity to the closest question token , i.e., where label is the text associated with node , and is a plug-in semantic similarity function between two words or phrases. For entities, we are actually interested in equivalence (‘Cruise’ matching ‘Tom Cruise’), rather than relatedness or similarity: as a result, we have the following scoring function: , where refers to the set of all entities that a particular mention may refer to (such lookups are possible in mention-entity lexicons curated from Wikipedia (Hoffart et al., 2011)), and

is the standard Jaccard index (Jaccard similarity coefficient).

Similarity with respect to predicates or classes can be defined in the standard manner using cosine similarities between embeddings from word2vec 

(Mikolov et al., 2013) (or alternatively GloVe (Pennington et al., 2014) or BERT (Devlin et al., 2019))

, where arrows indicate word vectors. Vectors of multiword phrases are averaged before computing similarities. Resultant similarity values are scaled to

(for compatibility to other weights) from using min-max normalization. A Jaccard index computed on words is used for similarity with literals (): sim(‘2016’, ‘June 2016’) = .

Edge weights. Each edge in an is assigned a confidence weight by a function according to the following criteria. All edges in have uniform weights, as all facts in a KG are considered equally valid. For computational convenience in converting to costs later on, all KG edge weights are set to zero. However, this is not the case for noisy quasi KGs extracted from text. Triple edges in are weighted by term proximity, i.e., the minimum distance (Cummins and O’Riordan, 2009) between and and between and , in the sentence from which triple was extracted. The distance between two tokens is measured as the number of intruding tokens plus one, to avoid zero distances. Since confidence is inversely proportional to the distance between the components of , the weight is defined as the inverse of this distance. We thus have: , where the distances are obtained from . If some appear multiple times in , the sum of these distances is considered, to harness redundancy of evidence. This captures the intuition that closer the two parts are in text, higher is their subsequent edge weight. Unlike conventional OpenIE tools (Angeli et al., 2015; Mausam, 2016), this mechanism allows S-P and P-O pairs to eventually take different weights.

Class edges in are obtained from Hearst patterns (Hearst, 1992), and are all considered equally confident, and assigned a weight of one. Alignment edges that indicate node-node synonymy are weighted with the same similarity function for inserting node weights, simply replacing the query token by the other node label. An alignment edge is inserted into an if the similarity exceeds or equals some threshold , i.e., . Zero is not an acceptable value for as that would mean inserting an edge between every pair of possibly unrelated nodes. This alignment insertion threshold could be potentially different for entities () and predicates (), due to the use of different similarity functions. These (and other) hyper-parameters are tuned on a development set of question-answer pairs.

Anchor nodes. A node in an is an anchor node if it has a node weight greater than or equal to some threshold , i.e., . Again, as that would mean all nodes in the XG being degenerate anchors, and is potentially different for entities () and predicates (). Recall that . Anchors are grouped into sets (equivalence classes), where , if, for some corresponding to , . Verbally, all anchor nodes mapping to the same query token are grouped into these equivalence sets . Anchors thus identify the highest question-relevant nodes of the , in the vicinity of which answers can be expected to be found. Any node of type entity, predicate, class, or literal, can qualify as an anchor.

A glossary of notations is provided for readability in Table 1.

Notation Definition Notation Definition
Knowledge graph (KG) Sentence
Document corpus Question, question words
Quasi KG from Answer to
Heterogeneous Context graph for in //
Entity Node
Predicate Edge
Class Node types, edge types
Literal Edge weights, edge weights
Subject Mapping fn. from node/edge to type
Object Mapping fn. from node/edge to wt.
Triple in // Alignment threshold for
Entity mention Anchor threshold for
Anchor node * Top GST
Table 1. Table of notations.

3. Method

Fig. 2 gives an overview of the UniQORN architecture. The two main stages – construction of the question-specific context graph and computing answers by Group Steiner Trees (GST) are described in the following subsections.

Figure 2. System architecture of UniQORN.

3.1. Building the Context Graph (XG)

We describe the XG construction, using our example question for illustration: “director of the western for which Leo won an Oscar?”. Instantiations of key factors in the two settings, KG-QA and Text-QA, are highlighted in Table 2. The heterogeneous setting can be understood as simply the union of the two individual cases.

Scenario KG Text
Triples in XG NER + KG-lookups Retrieval system + Open IE
Node types Entity, predicate, class, literal Entity, predicate, class
Class nodes instance of triples Hearst patterns
Edge types Triple, alignment, class Triple, alignment, class
Entity alignments With KG aliases With entity-mention lexicons
Predicate alignments With word embeddings With word embeddings
Node weights W.r.t. question tokens W.r.t. question tokens
Edge weights Degenerate With term proximity and lexicons
Table 2. Instantiations of different factors of XGs from KG and text corpus.

XG from knowledge graph. Our goal is to reduce the huge KG to a reasonably-sized question-relevant subgraph, over which graph algorithms can be run with interactive response times. We first identify entities in the question

by tools for Named Entity Recognition and Disambiguation (NERD) 

(Ferragina and Scaiella, 2010; Hoffart et al., 2011), obtaining KG entities (Leonardo DiCaprio, Academy Awards). However, as NED is quite error-prone on very short inputs (e.g., ‘Leo’ Leo Awards, the Canadian film award, and ‘Oscar’ Oscar, the Brazilian footballer), we also use the NER stage to mark names and phrases in that denote potential entity mentions. For these, we use a dictionary of entity-name pairs (Hoffart et al., 2011) to generate a set of top- likely alternatives for the entities of interest (e.g., Leo, Leonardo DiCpario, pope Leo I). All these candidate entities form the set . Next, all KG triples are fetched that are in the -hop neighborhood of an entity in 222The -hop neighborhood of KG entities can be enormously large, especially for popular entities like countries or football clubs with thousands of 1-hop neighbors. This is further exacerabted by proliferations via class nodes (all humans are within two hops of each other on Wikidata). Finally, the largest connected component (LCC) of the graph is identified, to retain only nodes that are mutually related and reflect the coherent context of (e.g., pruning facts about pope Leo I). This LCC forms the context graph .

XG from text corpus. Similar to the case for the KG, we collect a small set of potentially relevant documents for from using a commercial search engine (for being the Web) or an IR system like ElasticSearch (for being a fixed corpus such as Wikipedia full-text). To mimic the way we process KGs, we impose structure on the retrieved document by applying Open IE (Mausam, 2016). This yields a set of SPO triples where each component is a concise textual phrase, without any linking to a KG.

As off-the-shelf tools like Stanford Open IE (Angeli et al., 2015), OpenIE 5.1 (Saha and Mausam, 2018), ClausIE (Del Corro and Gemulla, 2013) or MinIE (Gashteovski et al., 2017), all have limitations regarding either precision or recall or efficiency, we developed our own custom-made Open IE extractor. We start by POS tagging and NER on the input sentences, and perform light-weight coreference resolution by replacing each third person personal and possessive pronoun (‘he, him, his, she, her, hers’) by its nearest preceding entity of type PERSON. We then define a set of POS patterns that may indicate an entity or a predicate. Entities are marked by an unbroken sequence of nouns, adjectives, cardinal numbers, or mention phrases from the NER system (e.g., ‘Leonardo DiCaprio’, ‘2016 American western film’, ‘Wolf of Wall Street’). To capture both verb- and noun-mediated relations, predicates correspond to POS patterns verb, verb+preposition or noun+preposition (e.g., ‘produced’, ‘collaborated with’, ‘director of’). See node labels in Fig. 0(b) for examples. The patterns are applied to each sentence , to produce a markup like The ellipses () denote intervening words in . From this markup, UniQORN finds all pairs that have exactly one predicate between them, this way creating triples. and (but excluding ). Patterns from (Yahya et al., 2014), specially designed for noun phrases as relations (e.g., ‘Oscar winner’

), are applied as well. The rationale for this heuristic extractor is to achieve high

recall towards answers, at the cost of introducing noise. The noise is taken care of in the answering stage later.

Additionally, to extract entity class information that is often useful in QA (Abujabal et al., 2017; Yavuz et al., 2016; Ziegler et al., 2017), we leverage Hearst patterns (Hearst, 1992), like such as (matched by, say, western films such as The Revenant) is a(n) (matched by, say, The Revenant is a 2015 American western film), or and other (matched by, e.g., Alejandro Iñàrritu and other Mexican film directors ). Here denotes a noun phrase, detected by a constituency parser. The resulting triples about class membership (of the form: <, class, > for e.g., <The Revenant, class, 2015 American western film>) are added to the triple collection. Finally, as for the KG-QA case, all triples are joined by or with exact string matches, and the LCC is computed, to produce the final . To compensate for the diversity of surface forms where different strings may denote the same entity or predicate, alignment edges are inserted into for node pairs as discussed in Sec. 2.2.

3.2. Answering over the Context Graph

For the given context graph , we find (candidate) answers as follows. First, nodes in the are identified that best capture the question; these nodes become anchors. To this end, we assign node weights proportional to the similarity of the node labels against question tokens. These weights are then thresholded by level , and all nodes that qualify are selected as anchor nodes (up to a maximum of five per question token). Anchors are grouped into equivalence classes based on the question token that they correspond to. At this stage, we have the directed and weighted context graph as a -tuple: . For simplification, we disregard the direction of edges, turning into an undirected graph.

Group Steiner Tree. The criteria for identifying good answer candidates in the XG are then: (i) answers lie on paths connecting anchors, (ii) shorter paths with higher weights are more likely to contain correct answers, (iii) including at least one instance of an anchor from each group is necessary (to satisfy all conditions in ). Formalizing these desiderata leads us to the notion of Steiner Trees (Feldmann and Marx, 2016; Bhalotia et al., 2002; Kacholia et al., 2005; Kasneci et al., 2009): for an undirected weighted graph, and a subset of nodes called ‘terminals’, find the tree of least cost that connects all terminals. If the number of terminals is two, then this is the weighted shortest path problem, and if all nodes of the graph are terminals, this becomes the minimal spanning tree (MST) problem. In our setting, the graph is the , and terminals are the anchor nodes. As anchors are arranged into groups, we pursue the generalized notion of Group Steiner Trees (GST) (Garg et al., 2000; Li et al., 2016; Ding et al., 2007; Shi et al., 2020; Chanial et al., 2018): compute a minimum-cost Steiner Tree that connects at least one terminal from each group, where weights of edges are converted into costs as . At this point, the reader is referred to Fig. 1 again, for illustrations of what GSTs look like (shown in orange).

Formally, the GST problem in our setting can be defined as follows. Given an undirected and weighted graph and given groups of anchor nodes with each , find the minimum-cost tree , where is any tree that connects at least one node from each of , such that for each , and cost of , , where .

A key fallout of this is that all edges in now have edge costs of ; hence, cost-optimal GSTs will try to find trees with the least number of edges. For , class edges have the least cost (), which implies adding them to the GST comes at no extra cost, and would often be generously included in the final results. This is actually good, since class nodes usually bring value to answer quality in connection with expected answer types (Ziegler et al., 2017; Yavuz et al., 2016), we posit that this is actually a good thing. The added value of class edges is also experimentally verified by us later on.

While finding the GST is an NP-hard problem, there are approximation algorithms (Garg et al., 2000), and also exact methods that are fixed-parameter tractable with respect to the number of terminal nodes. We adapted the method of Ding et al. (Ding et al., 2007) from the latter family, which is exponential in the number of terminals but in the graph size. Luckily for us, the number of anchors is indeed typically much less of a bottleneck than the large sizes of the in terms of nodes and edges. Specifically, the terminals are the anchor nodes derived from the question tokens – so their numbers are not prohibitive with respect to computational costs.

The algorithm is based on dynamic programming and works as follows. It starts from a set of singleton trees, one for each terminal group, rooted at one of the corresponding anchor nodes. These trees are grown iteratively by exploring immediate neighborhoods for least-cost nodes as expansion points. Trees are merged when common nodes are encountered while growing. The trees are stored in an efficient implementation of priority queues with Fibonacci heaps. The process terminates when a GST is found that covers all terminals (i.e., one per group). Bottom-up dynamic programming guarantees that the result is optimal.

Relaxation to top- GSTs. It is possible that the GST simply connects a terminal from each anchor group, without having any internal nodes at all, or with predicates or classes as internal nodes. Since we need entities as answers, such possibilities necessitate a relaxation of our solution to compute a number of top- least-cost GSTs. GST- ensures that we always get a non-empty answer set, albeit at the cost of making some detours in the graph. Moreover, using GST- provides a natural answer ranking strategy, where the score for an answer can be appropriately reinforced if it appears in multiple such low-cost GSTs. This postulate, and the effect of is later substantiated in our experiments. Note that since the tree with the least cost is always kept at the top of the priority queue, the trees can be found in the increasing order of cost, and no additional sorting is needed. In other words, the priority-queue-based implementation of the GST algorithm (Ding et al., 2007) easily supports this top- computation. The time and space complexities for obtainng GST- is the same as that for GST-. Fig. 3 gives an example of GST- ().

Figure 3. Illustrating GST-, showing edge costs and node weights. Anchors (terminals) and answer candidates (non-terminals A1, A2, A3) are shown in black and white circles respectively. {(C11, C12), (C21, C22, C23), (C31)} represent anchor groups. Edge costs are used in finding GST-, while node weights are used in anchor selection and answer ranking. A1 is likely to be a better answer due to its presence in two GSTs in the top-3.

Answer ranking. Non-terminals in GSTs are candidates for final answers. However, this mandates ranking. We use the number of top- GSTs that an answer candidate lies on, as the ranking criterion. Alternative choices, like weighted proximity to anchor nodes, are investigated in Sec. 5.2. For text and KG+text settings, we perform an additional answer post-processing step that is necessary as entities are not canonicalized. Answers that are subsequences of another answer are merged (‘Alejandro Iñàrritu’ merged with ‘Alejandro Gonzáles Iñàrritu’). Further, two answers are merged if they share an alignment edge (like ‘Alejandro’ and ‘Alejandro Iñàrritu’ in Fig. 0(b)). Merged forms are retained as piped forms (‘Alejandro Iñàrritu’‘Alejandro Gonzáles Iñàrritu’‘Alejandro’). During evaluation, at least one of the merged answers must match exactly with the gold answer string to be deemed as correct.

4. Experimental Setup

4.1. Benchmarks

4.1.1. Knowledge sources

As our knowledge graph, we use the full Wikidata (including qualifiers). The dump is GB uncompressed (the index has an additional size of GB), and contains billion triples. Note that there is nothing in our method specific to Wikidata, and can be easily extended to other KGs. Wikidata was chosen as it is one of the popular choices in the community today, and has an active community that contributes to growth of the KG, akin to Wikipedia (both are supported by the Wikimedia foundation). In 2016, Google ported a large volume of the erstwhile Freebase into Wikidata (Pellissier Tanon et al., 2016).

For text, we create a pool of Web pages per question, obtained from Google Web search in June 2019. Specifically, we issue the full question as a query to Google Web search, and create a question-specific corpus from these top- results obtained. This was done using the Web search option inside the Google Custom Search API333https://developers.google.com/custom-search/v1/overview. This design choice of fetching pages from the Web was made to be close to the direct-answering-over-the-Web-setting, and not be restricted to specific choices of benchmarks that have associated corpora. This also enables comparing baselines from different QA families on a fair testbed. This was done despite the fact that it entails significant resources, as this has to be done for thousands of questions, and in addition, these Web documents have to entity-linked to Wikidata to enable training some of the supervised baselines (details later).

The heterogeneous answering setup was created by considering the above two sources together. To be specific, each question from our benchmark is answered over the entire KG and the corresponding text corpus.

Benchmark #Questions Answers
Complex questions from LC-QuAD  (Dubey et al., 2019) Wikidata entities
Complex questions from LC-QuAD  (Trivedi et al., 2017) DBpedia entities, mapped to Wikidata via Wikipedia
Complex questions from WikiAnswers (CQ-W) (Abujabal et al., 2017) Freebase entities, mapped to Wikidata via Wikipedia
Complex questions from Google Trends (CQ-T) (Lu et al., 2019) Entity mention text, manually mapped to Wikidata
Complex questions from QALD  (Usbeck et al., 2018) DBpedia entities, mapped to Wikidata via Wikipedia
Complex questions from ComQA (Abujabal et al., 2019) Wikipedia URLs, mapped to Wikidata entities
Total number of complex questions Wikidata entities
Table 3. Details of our dataset with a total of about complex questions.

4.1.2. Question-answer pairs

We compiled a new benchmark of complex questions, containing multiple entities and relations, from six sources, as described below. To ensure that the benchmark questions pose sufficient difficulty to the systems under test, all questions from individual sources were manually examined to ensure that they have at least entity and more than two relations. Questions that do not have a ground truth answer in our our KG Wikidata, and aggregations and existential questions (i.e., questions with counts and yes/no answers, respectively) were also removed. The number of questions finally contributed by each source is shown in Table 3. These are factoid questions: each question usually has one or a few entities as correct answers.

(i) LC-QuAD 2.0 (Dubey et al., 2019): This very recent and large QA benchmark from October 2019 was compiled using crowdsourcing with human annotators verbalizing KG templates that were shown to them. Answers are Wikidata entities.

(ii) LC-QuAD 1.0 (Trivedi et al., 2017): This dataset was curated by the same team as LC-QuAD 2.0, by a similar process, but where the crowdworkers directly corrected an automatically generated natural language question. Answers are DBpedia entities – we linked them to Wikidata for our use via their Wikipedia URLs, that act as bridges between popular curated KGs like Wikidata, DBpedia, Freebase, and YAGO.

(iii) CQ-W (Abujabal et al., 2017): The questions (complex questions from WikiAnswers, henceforth CQ-W) were sampled from WikiAnswers with multiple entities or relations. Answers are Freebase entities, We mapped them to Wikidata using Wikipedia URLs.

(iv) CQ-T (Lu et al., 2019): This contains complex questions about emerging (several among them out-of-KG) entities (e.g., people dated by celebrities) created from queries in Google Trends (complex questions from Google Trends, henceforth CQ-T). Answers are text mentions of entities, that we manually map to Wikidata.

(v) QALD (Usbeck et al., 2018): We collated these questions by going through nine years’ worth of QA datasets from the benchmarking campaigns QALD444http://qald.aksw.org/ (editions ). Answers are DBpedia entities, that we map to Wikidata using Wikipedia links.

(vi) ComQA (Abujabal et al., 2019): These factoid questions, like (Abujabal et al., 2017), were also sampled from the WikiAnswers community QA corpus. Answers are Wikipedia URLs or free text, which we map to Wikidata.

Train-dev-test splits. We have supervised methods as baselines, so we need to split our benchmark into training, development, and test sets. Since LC-QuAD is an order of magnitude larger than all the other sources combined, we split it in a ratio of for use as train-dev-test sets. This results in train, dev, and test questions. The five parameters ( and GST-) for UniQORN are also tuned on this dev set. The questions from the other smaller benchmarks are directly considered as test questions: this gives us the ability to evaluate open-domain performance or generalizability of the compared methods. At these stage, note that we have Wikidata entity labels as answers to questions from all benchmarks. All questions, ground-truth answers are available at https://uniqorn.mpi-inf.mpg.de/.

4.2. Systems under Test

Setup #Nodes #Edges
Entity Predicate Class Triple Alignment Class
KG N. A.
Table 4. Basic properties of UniQORN’s XGs, averaged over LC-QuAD test questions. Parameters affect these sizes: best values in each setting were used in these measurements. The graphs typically get bigger and denser as the alignment edge insertion thresholds are lowered. The KG+Text may not be exact additions of the two settings due the computation of the largest connected component, and slight parameter variations. Expectedly, the heterogeneous setting has the largest number of nodes and edges.

4.2.1. UniQORN Implementation

UniQORN’s XGs in various settings are characterized in Table 4, and are built as follows. The WAT API for TagMe555https://services.d4science.org/web/tagme/wat-api (Ferragina and Scaiella, 2010) was used for named entity recognition and disambiguation (NERD) on the questions. The internal threshold for TagMe, , was set to by manual observation, so that we catch as many named entities as possible, but prune out really noisy links. TagMe disambiguates to Wikipedia, and we follow the links to Wikidata using a KG portal. However, since NED can often be erroneous, we boost it with top- entities via expansion with NER. Specifically, we take the entity mention from TagMe’s NER span and find the top- scoring entities for this mention from the scored AIDA mention-entity lexicon (Hoffart et al., 2011) (a 2019 version was obtained from the authors of (Hoffart et al., 2011) upon request). Entities in the AIDA lexicon are Wikipedia URLs, that we mapped to Wikidata. For building , all -hop neighbors of entities linked by NED and top- NER were included, and the largest connected component finally retained. Entity alignment edges were added using the same AIDA lexicon. Class edges in are obtained from Wikidata instance of triples, like <Academy Award for Best Director, instance of, Academy Awards>.

Context graphs in the text and heterogeneous settings, and the , are also built in the manner described in text (Sec. 3.1). The Stanford parser666https://stanfordnlp.github.io/CoreNLP/ was used for POS tagging (for Open IE) and noun phrase detection (for Hearst patterns). -dimensional Word2vec embeddings pre-trained on the Google News Corpus (obtained via the gensim library 777https://radimrehurek.com/gensim/index.html) are used for all similarity computations. The five parameters ( and GST-) were tuned on the dev test using grid search in each of the three settings. We obtained: (i) for KG: (no alignment necessary for KGs); (ii) for text: ; (ii) for KG+text: . The number of GSTs to consider turned out to be (this is still orders of magnitude less than the total possible number of GSTs in graphs of size shown in Table 4). Effects of these parameters are examined later in Sec. 5.2. All baselines were exposed to the same knowledge source as UniQORN, according to the KG, text, or heterogeneous setting.

4.2.2. Baselines for the heterogeneous setup

We use the state-of-the-art systems PullNet (from Google Research) (Sun et al., 2019) and its predecessor GRAFT-Net (Sun et al., 2018) as baselines for the KG+text heterogeneous setup. PullNet was designed specially with complex questions in mind. Since these methods work only on Freebase, we reimplemented these systems to enable answering over Wikidata.

(i) PullNet:

PullNet uses an iterative process to construct a question-specific subgraph, where in each step, a graph convolutional neural network (CNN) is used to find subgraph nodes that should be expanded (to support complex questions) using pull operations on the KG and corpus. and/or KB. After the expansion, another graph CNN is used to extract the answer from the expanded subgraph.

(ii) GRAFT-Net: GRAFT-Net (Graphs of Relations Among Facts and Text Networks) also uses Graph CNNs specifically designed to operate over heterogeneous graphs of KG facts and entity-linked text sentences. GRAFT-Net uses LSTM-based updates for text nodes and directed propagation using personalized PageRank and tackle the heterogeneous setting with an early fusion philosophy.

As indicated earlier, PullNet and GRAFT-Net were trained on the training questions, and parameters were tuned on the dev questions. These methods need entity-linked text for learning their models, for which we used the same text corpora described earlier ( documents per question) that were tagged with TagMe with the same threshold of (see Sec. 4.2.1).

4.2.3. KG-QA baselines

PullNet and GRAFT-Net return KG entities as answers, and can be run in KG-only modes as well. So these naturally add to KG-QA baselines for us. In addition, we use the systems QAnswer (Diefenbach et al., 2019, 2018)888https://qanswer-frontend.univ-st-etienne.fr/ and Platypus (Tanon et al., 2018)999https://askplatyp.us/ as baselines for KG-QA. As of June 2020, to the best of our knowledge, these are the only systems for Wikidata with sustained online services and APIs, with QAnswer having state-of-the-art performance.

(i) QAnswer: The QAnswer system is an extremely efficient system that relies on an overgeneration of SPARQL queries based on simple templates, that are subsequently ranked, and the best query is executed to fetch the answer. The queries are generated by a fast graph breadth-first search (BFS) algorithm relying on HDT indexing.

(ii) Platypus: Platypus was designed as a QA system driven by natural language understanding, targeting complex questions using grammar rules and template-based techniques. Questions are translated not exactly to SPARQL, but to a custom logical representation inspired by dependency-based compositional semantics.

4.2.4. Text-QA baselines

There are many, many systems today performing machine reading comprehension (MRC) as QA over text corpora. As for text-QA baselines against which we compare UniQORN, we focus on distantly supervised methods. These include neural QA models which are pre-trained on large question-answer collections, with additional input from word embeddings. These methods are well-trained for QA in general, but not biased towards specific benchmark collections. We are interested in robust behavior for ad hoc questions, to reflect the rapid evolution and unpredictability of topics in questions on the open Web. Hence this focus on distantly supervised methods. As specific choices, we adopt the well-known DocQA or DocumentQA (Clark and Gardner, 2018) and DrQA (Chen et al., 2017)

systems as representatives of robust open-source implementations, that can work seamlessly on unseen questions and passage collections. Both methods can deal with multi-paragraph documents, and are deep learning-based systems with large-scale training via Wikipedia and reading-comprehension resources like SQuAD 

(Rajpurkar et al., 2016) and TriviaQA (Joshi et al., 2017), and we run pre-trained QA models on our test sets.

(i) DrQA:

This system (from Facebook Research) combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in paragraphs. Since we do not have passages manually annotated with answer spans, we run the DrQA model pre-trained on SQuAD 

(Rajpurkar et al., 2016) on our test questions with passages from ten documents as the source for answer extraction. Default configuration settings were used101010https://github.com/facebookresearch/DrQA.

(ii) DocQA: The DocQA system (from the Allen Institute for AI) adapted passage-level reading comprehension to the multi-document setting. It samples multiple paragraphs from documents during training, and uses a shared normalization training objective that encourages the model to produce globally correct output. The DocQA model uses a sophisticated neural pipeline consisting of multiple CNN and bidirectional GRU layers, coupled with self-attention. For DocQA, we had a number of pre-trained models to choose from, we use the one trained on TriviaQA (triviaQA-web-shared-norm) as it produced the best results on our dev set. Default configuration was used for the remaining parameters111111https://github.com/allenai/document-qa.

4.2.5. Graph baselines

We compare our GST-based method to simpler graph algorithms based on bread-first search (BFS) and shortest paths (SP) as answering mechanisms:

(i) BFS: In BFS, iterators start from each anchor in a round-robin manner, and whenever at least one iterator from each anchor group meets at some node, it is marked as a candidate answer. At the end of iterations, answers are ranked by the number of iterators that met there.

(ii) SP: In the other graph baseline, shortest paths are computed between every pair of anchors, and answers are ranked by the number of shortest paths they lie on.

We perform fine-tuning for the BFS and SP baselines by finding the best thresholds for alignment insertion and anchor detection, and for these methods, using the development set.

4.3. Metrics

We use standard metrics for factoid QA (most questions have one or a small number of correct answers): (i) Precision at top-1 (P@1), (ii) Mean Reciprocal Rank (MRR), and (iii) a hit within the top-5 (Hit@5). P@1 is if the first answer is correct, and otherwise. MRR is the reciprocal of the highest rank at which a correct answer is observed. Finally, Hit@5 is 1 if the top-5 contains a correct answer, and zero otherwise. P@1 and MRR can only be evaluated for ranking-based systems (PullNet, GRAFT-Net, DocumentQA, DrQA, BFS, SP, and UniQORN). So while Hit@5 may appear to be a lenient metric, it is the only way we can compare results with a system that returns an unordered set (QAnswer, Platypus). For UniQORN, we take the top-5 results to be this “set”. While these systems may return much more than five results depending on what is obtained by the execution of the best SPARQL query, we believe that computing Hit for may provide an unfair advantage to UniQORN. For systems that return unordered answer sets (QAnswer, Platypus), we thus give them the advantage of setting Hit@5 to 1 if the entire result set contains at least one correct answer. All metrics are macro-averaged over all test questions.

5. Experimental Results

5.1. Key Findings

Method KG Text KG+Text
P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5
QAnswer (Diefenbach et al., 2018) - - - - - - - -
Platypus (Tanon et al., 2018) x x - - - - - -
DocQA (Clark and Gardner, 2018) - - - x x - - -
DrQA (Chen et al., 2017) - - - - - -
PullNet (Sun et al., 2019) - - -
GRAFT-Net (Sun et al., 2018) - - -
BFS (Kasneci et al., 2009)
Table 5. Comparison of UniQORN and baselines over the LC-QuAD test set. The best value per column is in bold. An asterisk (*) indicates statistical significance of UniQORN over the best baseline in that column. A hyphen (‘-’) indicates that the corresponding baseline cannot be applied to that particular setting. An ‘x’ denotes the particular metric cannot be computed.
Method KG (Hit@5) Text (P@1) KG+Text (MRR)
QAnswer - - - - - - - - - -
Platypus - - - - - - - - - -
DocQA - - - - - - - - - -
DrQA - - - - - - - - - -
PullNet - - - - -
GRAFT-Net - - - - -
Table 6. Comparison of UniQORN and baselines for LC-QuAD 1.0, ComQA, CQ-W, CQ-,T and QALD datasets. The best value per column is in bold. An asterisk (*) indicates statistical significance of UniQORN over the best baseline in that column. ‘-’ indicates that the corresponding baseline cannot be applied to this setting.

Our main results are presented in Tables 5 and 6. The following discusses our key findings. Some of the baselines return only one answer (DocQA; MRR and Hit@5 cannot be computed) or an unordered set (UniQORN and Platypus; P@1 and MRR cannot be computed), resulting in cases with missing metrics in certain rows, denoted by an ‘x’. All tests of statistical significance hereafter correspond to the two-tailed paired -test, with the -value .

(i) UniQORN yields competitive results for most of the settings considered: This performance is observed across-the-board, by knowledge source, benchmark, baseline (including neural models like PullNet with large-scale training), or metric. Achieving this for both settings is a success, given that the competitors for most paradigms do not have any entries in the others (as observed by the numerous hyphens in the table). Note that PullNet and GRAFT-Net cannot deal with unseen KG entities at test time, which is very often the case for benchmarks. But to be fair to these strong baselines, we allowed them to have access to questions in the test set during training, so that appropriate entity embeddings to be learnt. Additionally, we have only considered questions where their final classifier receives correct input (i.e. with ground truth in the entity set) from the previous classification steps. If this is not allowed, their performance would drop drastically: MRRs of and for PullNet, the better method of the two, for KG+Text and KG, respectively. Table 6 test open-domain performance: ability to answer questions on benchmarks (here, the multiple smaller ones) that are different from the one trained on (the largest one).

Unlike many other systems, GRAFT-Net and PullNet can indeed take into account information in qualifiers (these were originally designed for Freebase CVTs, we used the analogous ideas when reimplementing for Wikidata qualifiers). We attribute the relatively poor performance of the complex QA system PullNet in some scenarios to its tackling of certain classes of complex questions only: namely, the multi-hop ones (“brother of director of Inception?”). However, there are also other common classes (“coach who managed both liverpool and barcelona?”). UniQORN’s ability to handle arbitrary types of questions with respect to multiple entities and relations contributes this improved numbers. This is a reason why GRAFT-Net systematically shows a better performance than PullNet, even though the latter is an extension of the former.

(ii) Computing GSTs on a question-specific context graph is a preferable alternative to SPARQL-based systems over RDF graphs: This is seen with the Hit@5 of for UniQORN versus for QAnswer and for Platypus, the two SPARQL-based systems). Let us examine this in more detail. For KGs, QAnswer is clearly the stronger baseline. It operates by mapping names and phrases in the question to KG concepts, enumerating all possible SPARQL queries connecting these concepts that return a non-empty answer, followed by a query ranking phase. This approach works really well for simple questions with relatively few keywords. However, the query ranking phase becomes a bottleneck for complex questions with multiple entities and relations, resulting in too many possibilities. This is the reason why reliance on the best SPARQL query may be a bad choice. Additionally, unlike UniQORN, QAnswer cannot leverage any qualifier information in Wikidata. This is both due to the complexity of the SPARQL query necessary to tap into such records, as well as an explosion in the number of query candidates if qualifier attributes are allowed into the picture. Our GST establishes this common question context by a joint disambiguation of all question phrases, and smart answer ranking helps to cope with the noise and irrelevant information in the XG. While QAnswer is completely syntax-agnostic, Platypus is the other extreme: it relies on accurate dependency parses of the NL question and hand-crafted question templates to come up with the best logical form. This performs impressively when questions fit the supported set of syntactic patterns, but is brittle when exposed to a multitude of open formulations from an ensemble of QA benchmarks. GRAFT-Net performing better than QAnswer also attests to the superiority of graph-based search over SPARQL for complex questions.

(iii) UniQORN’s results are better than the MRC systems:

It is remarkable that UniQORN significantly outperforms the reading comprehension systems DocQA and DrQA. Both systems use sophisticated neural networks with extensive training and leverage large resources like Wikipedia, SQuAD or TriviaQA, whereas UniQORN is completely unsupervised and could be readily applied to other benchmarks with no further tuning. UniQORN’s success comes from its unique ability to stitch evidence from multiple documents to faithfully answer a question. The Text-QA baselines can also handle multiple passages, but the assumption is still that a single passage or sentence contains the correct answer and

all question conditions. This is flawed: as an aftermath, the accuracy of these systems is often gained by only satisfying part of the question (for “Who directed Braveheart and Paparazzi?”, only spotting the director of Braveheart suffices). Also, DocQA and DrQA are not able to address such levels of complexity as demanded by our challenging benchmarks.

(iv) Graph-based methods open up the avenue for this unified answering process: What is especially noteworthy is that using the the joint optimization of interconnections between anchors via GSTs is essential and powerful. Use of GSTs is indeed required for best performance, and cannot be easily approximated with simpler graph methods like BFS and SP. It is worthwhile to note that in the open-domain experiments (Table 6), at least for the text setting, BFS performs notably well (ComQA, CQ-W, CQ-T, QALD). BFS tree search is actually a building block for many Steiner Tree approximation algorithms (Kasneci et al., 2009). ShortestPaths (SP) also performs respectably in these settings, as notions of shortest paths are also implicit in GSTs. The brittleness of these ad hoc techiques are exposed in the larger and more challenging settings of KG and KG+Text.

We show representative examples from the various benchmarks in Table 7, to give readers a feel of the complexity of information needs that UniQORN can handle.

KG Text KG+Text
“Which character in Star Wars is killed by Luke Skywalker?” (LC-QuAD 2.0) “What city is the twin city of Oslo and also the setting for “A Tree Grows in Brooklyn”?” (LC-QuAD 2.0) “Who is the master of Leonardo da Vinci, who was employed as a sculptor?” (LC-QuAD 2.0)
“What is the position held by Lyndon B. Johnson by replacing Richard Nixon?” (LC-QuAD 2.0) “Who replaced Albus Dumbledore as headmaster of Hogwarts?” (LC-QuAD 2.0) “What is the name of the reservoir the Grand Coulee Dam made?” (LC-QuAD 2.0)
“Which is the new religious movement in the standard body of the Religious Technology Center and that contains the word Scientology in it’s name?” (LC-QuAD 2.0) “Which toponym of Classical Cepheid variable has part of constellation that is Cepheus?” (LC-QuAD 2.0) “What Empire used to have Istanbul as its capital?” (LC-QuAD 2.0)
“Jon Voight was nominated for what award for his work on Anaconda?” (LC-QuAD 2.0) “Which tributary of river inflows from Menindee Lakes?” (LC-QuAD 2.0) “What is the jurisdiction of the area of East Nusa Tenggara Province?” (LC-QuAD 2.0)
“What military branch of the air force did Yuri Gagarin work for?” (LC-QuAD 2.0) “What is the namesake of Maxwell relations, whose place of work is Cambridge?” (LC-QuAD 2.0) “Tell me which logographic writing system is used by the Japanese.” (LC-QuAD 2.0)
“What is a country in Africa that borders Sedan and Kenya ?” (ComQA) “To which artistic movement did the painter of The Three Dancers belong?” (QALD) “Which high school located in Texas did Baker Mayfield play for?” (CQ-T)
“Which Scottish singer founded Frightened Rabbit and studied at the Glasgow School of Art?” (CQ-T) “Who played for Barcelona and managed Real Madrid?” (CQ-T) “Which American singer is an uncle of Kevin Love and co-founded The Beach Boys?” (CQ-T)
“What is the television show whose cast members are Jeff Conaway and Jason Carter?” (LC-QuAD 1.0) “Who is Diana Ross ’s daughter ’s father?” (ComQA) “What is the president whose lieutenants are Winston Bryant and Joe Purcell?” (LC-QuAD 1.0)
“Whose opponents are Ike Clanton and Billy Clanton?” (LC-QuAD 1.0) “Who portrays a chemistry teacher and drug dealer in a famous TV series?” (QALD) “Name a office holder whose predecessor is Edward Douglass White and has son named Charles Phelps Taft II?” (LC-QuAD 1.0)
“Name the movie whose director is Stanley Kubrick and editor is Anthony Harvey?” (LC-QuAD 1.0) “Which senator served in military units VMF-155 and VMF-218?” (LC-QuAD 1.0) “What is the field of the Jon Speelman and occupation of the Karen Grigorian?” (LC-QuAD 1.0)
Table 7. Anecdotal examples from all the datasets where UniQORN was able to compute a correct answer at the top rank (P@1 ), but none of the baselines could.

5.2. Insights and Analysis

Configuration KG Text KG+Text
No NED expansion -
No types
No entity alignment -
No predicate alignment -
Table 8. Pipeline ablation results on the full LC-QuAD 2.0 test set with MRR. Best values per column are in bold. Statistically significant differences from the full configuration are marked with *.

Ablation experiments. To get a proper understanding of UniQORN’s successes and failures, it is important to systematically ablate its configurational pipeline (Table 8). We note the following:

  • [Row 1] The full pipeline of UniQORN achieves the best performance, showing that all cogs in its wheel play a vital role, as detailed below.

  • [Row 2] “No NED expansion” implies directly using entities from NED for KG-lookups. We find that expanding lookups for XG construction with NER + lexicons compensates for NED errors substantially;

  • [Row 3] Using entity types is useful, as seen by the significant drops of to for KG+Text and to for text;

  • [Rows 4 and 5] On-the-fly entity and predicate alignments are crucial for addressing variability in surface forms for text-QA (also, as a result, for KG+Text).

Answer ranking KG Text KG+Text
GST counts
GST costs
GST node weights
Anchor distances
Weighted anchor distances
Table 9. Different answer ranking results on the LC-QuAD 2.0 test set on MRR. Best values per column are in bold. Statistically significant differences from the first row are marked with *.

Answer ranking. UniQORN ranks answers by the number of different GSTs that they occur in Row 1. However, there arise some natural variants that leverage a weighted sum rather than a simple count (Table 9). This weight can come from the following sources: (i) the total cost of the GST, the less the better [Row 2]; (ii) the sum of node weights in the GST, reflecting question relevance, the more the better [Row 3]; (iii) total distances of answers to the anchors in the GST, the less the better [Row 4]; and, (iv) a weighted version of anchor distances, when edge weights are available, the less the better [Row 5]. We find that using plain GST counts works surprisingly well. Zooming in to the GST and examining anchor proximities is not necessary (actually hurts significantly for KG+Text).

Error scenario KG Text KG+Text
Answer not in corpus -
Answer in corpus but not in XG
Answer in XG but not in top- GSTs
Answer in top- GSTs but not in top-
Answer in top- but not in top-
Table 10. Percentages of different error scenarios where the answer is not in top-, averaged over the full LC-QuAD 2.0 test set.

Error analysis. In Table 10, we extract all questions for which UniQORN produces an imperfect ranking (), and discuss cases in a cascaded style. Each column adds up to , and reports the distribution of errors for the specific setting. Note that values across rows are not comparable. We make the following observations:

(i) [Row 1] indicates the sub-optimal ranking of Google’s retrieval from the Web with respect to complex questions. Strictly speaking, this is out-of-scope for UniQORN; nevertheless, an ensemble of search engines (Bing+Google) may help improve answer coverage. This cell is null for KGs as all the sampled LC-QuAD questions were guaranteed to have answers in Wikidata.

(ii) [Row 2] indicates answer presence in or , but not in the XG, and indicates NED/NER errors for KG, and incorrect phrase matches for text. The largest connected component (LCC) is also a likely culprit at this stage, pruning away the question-relevant entities in favor of more popular ones. Reducing reliance on NERD systems and computing larger neighborhood joins as opposed to LCCs could be useful here.

(iii) [Row 3] Presence of an answer in the XG but not in top- GSTs usually indicates an incorrect anchor matching. This could be due to problems in the similarity functions for entities (currently based on Jaccard overlaps of mention-entity lexicons, that are inherently complete) and predicates. For example, incorrect predicate matching due to shortcomings of embedding-based similarity (‘married’ and ‘divorced’ having a very high similarity of

, making the latter an anchor too when the former is in the question) can happen. We had conducted extensive experiments on hyperparameter-based linear combinations of various similarity measures for both entities and predicates (word Jaccard, mention-entity lexical Jaccard, word2vec, lexicon priors based on anchor links) to see what worked best: the single choices mentioned above performed best and were used. But clearly, there is a big room for improvement with respect to these similarity functions, as both rows 2 and 3 are affected by this.

(iv) [Rows 4 and 5] represent cases when the answer is in the top- GSTs but languish at lower ranks in the candidates. Exploring weighted rank aggregation by tuning on the development set with variants in Table 9 is a likely remedy. A high volume of errors in this bucket is actually one of positive outlook: the core GST algorithm generally works well, and significant performance gains can be obtained by fine-tuning the ranking function with additional parameters.

KG Text KG+Text
GST Ranks
Avg. #Docs
in GST
% of Questions with
Answers in GST
Avg. #Docs
in GST
% of Questions with
Answers in GST
Avg. #Docs
in GST
% of Questions with
Answers in GST
1-100 -
101-200 -
201-300 -
301-400 -
401-500 -
#Docs in GST
Avg. Rank
of GST
% of Questions with
Answers in GST
Avg. Rank
of GST
% of Questions with
Answers in GST
Avg. Rank
of GST
% of Questions with
Answers in GST
1 - -
2 - -
3 - -
4 - -
5 - -
Table 11. Effect of multi-document evidence shown via edge contributions by distinct documents to GSTs for the LC-QuAD test set.
(a) KG - Anchors.
(b) KG - GSTs.
(c) Text - Alignments.
(d) Text - Anchors.
(e) Text - GSTs.
(f) KG+Text - Alignments.
(g) KG+Text - Anchors.
(h) KG+Text - GSTs.
Figure 4. Results of parameter tuning for different settings on the LC-QuAD 2.0 development set of UniQORN.

Usage of multi-document evidence. For text-QA, UniQORN improved over DrQA and DocQA on all benchmarks, even though the latter are supervised deep learning methods trained on the large SQuAD and TriviaQA datasets. This is because reading comprehension (RC) QA systems search for the best answer span within a passage, and will not work well unless the passage matches the question tokens and contains the answer. While DrQA and DocQA can additionally select a good set of documents from a collection, it still relies on the best document to extract the answer from. UniQORN, by joining fragments of evidence across documents via GSTs, thus improves over these methods without any training. UniQORN benefits two-fold from multi-document evidence:

  • Confidence in an answer increases when all conditions for correctness are indeed satisfiable (and found) only when looking at multiple documents. This increases the answer’s likelihood of appearing in some GST.

  • Confidence in an answer increases when it is spotted in multiple documents. This increases its likelihood of appearing in the top-k GSTs, as presence in multiple documents increases weights and lowers costs of the corresponding edges.

A detailed investigation of the use of multi-document information is presented in Table 11. We make the following observations: (i) Looking at the “Avg. #Docs in GST” columns in the upper half, we see that considering the top- GSTs is worthwhile as all the bins combine evidence from multiple ( on average) documents. This is measured by labeling edges in GSTs with documents (identifiers) that contribute the corresponding edges. (ii) Moreover, they also contain the correct answer uniformly often (corresponding “#Questions with Answers in GST” columns). (iii) The bottom half of the table inspects the inverse phenomenon, and finds that considering only the top few GSTs is not sufficient for aggregating multi-document evidence. (iv) Finally, there is a sweet spot for GSTs aggregating nuggets from multiple documents to contain correct answers, and this turns out to be around three documents (see corresponding “#Questions with Answers in GST” columns). This, however, is an effect of our questions in our benchmarks, that are not complex enough to require stitching evidence across more than three documents. Deep-learning-based RC methods over text can handle syntactic complexity very well, but are typically restricted to identifying answer spans from a single text passage. DrQA and DocQA could not properly tap the answer evidence that comes from combining cues spread across multiple documents. Analogous results for KG and KG+Text are provided for completeness.

Parameter tuning. There are only five parameters in UniQORN: the alignment edge insertion thresholds on node-node similarity (one for entities and one for predicates ), analogous anchor selection thresholds on node weight ( and ), and the number of GSTs . Variation of these parameters are shown with MRR in Fig. 4 on the LC-QuAD development set. We observe that: (i) having several alignment edges in the graph (valid only for text- and KG+text-setups) for entities (corresponding to low thresholds) actually helps improve performance though apparently inviting noise (dark zones indicate good performance). For predicates, relatively high thresholds work best to avoid spurious matches. Note that the exact values of the - and -axes in the heat maps are not comparable due to the different similarity functions used for the two cases. The predicate threshold exploration is confined to the zone for which the graph algorithm was tractable (corresponding to a high cut-off time of thirty minutes to produce the answer); (ii) There is a similar situation for anchor thresholds: low thresholds for entities work best, while UniQORN is not really sensitive to the exact choice for predicates. The specifics vary across the setups, but as a general guideline, most non-extreme choices should work out fine. The white zones in the top right corners correspond to setting both thresholds to very high values (almost no anchors qualify to be in top-, resulting in zero performance); (iii) going beyond the chosen value of GSTs (for all setups) gives only diminishing returns.

Effect of Open IE. UniQORN’s noisy triple extractor results in quasi KGs that contain the correct answer of the times for CQ-W ( for CQ-T). If we use triples from Stanford OpenIE (Angeli et al., 2015) to build the quasi KG instead, the answer is found only and times for CQ-W and CQ-T, respectively. Thus, in the context of QA, losing information with precision-oriented, fewer triples, definitely hurts more than the adding potentially many noisy ones.

(a) Example GST for “Which aspiring model split with Chloe Moretz and is dating Lexi Wood?”
(b) Example GST for “Who played for FC Munich and was born in Karlsruhe?”
(c) Example GST for “Which actor is married to Kaaren Verne and played in Casablanca?”
Figure 5. GSTs construct interpretable evidence for complex questions.

GSTs contribute to explainability. Finally, we posit that Group Steiner Trees over question-focused context graphs help in understanding the process of answer derivation for an end-user. We illustrate this using three anecdotal examples of GSTs for text-QA, in Figs. 4(a) through 4(c). The corresponding question is in the subfigure caption. Anchor nodes are in orange, and answers in light blue. Entities, predicates, and class nodes are in rectangles, ellipses, and diamonds, respectively. Class edges are dashed, while triple and alignments are in solid lines.

(a) XG construction time.
(b) GST computation time.
(c) Total time.
Figure 6. (a) XG contruction time, (b) GST computation time, and (c) Total time, taken by UniQORN on the LC-QuAD 2.0 test set.

Runtime analysis. To conclude our detailed introspection into the inner workings of UniQORN, we provide a distribution of runtimes. There are two main components that contribute to the total runtime: the construction of the context graph XG, and the computation of the GST over the XG. We find that the first step is usually the more expensive of the two, as it involves a large number of similarity computations. The use of the fixed-parameter tractable exact algorithm for GSTs helps achieve interactive response times for a large number of questions. As expected, the efficiency for the KG-only setups is much better than the noisier text and heterogeneous setups, as the graphs are much bigger in the latter cases. Nevertheless, we find that our runtimes can get quite large for several questions, and this represents the most promising direction for future effort for us.

6. Related Work

6.1. QA over Knowledge Graphs

The inception of large KGs like Freebase (Bollacker et al., 2008), YAGO (Suchanek et al., 2007), DBpedia (Auer et al., 2007), and Wikidata (Vrandečić and Krötzsch, 2014), gave rise to question answering over knowledge graphs (KG-QA) that typically provides answers as single entities or entity lists from the KG. KG-QA is now an increasingly active research avenue, where the goal is to translate an NL question into a structured query, usually in SPARQL syntax or an equivalent logical form, that is directly executable over the KG triple store containing entities, predicates, classes and literals (Wu et al., 2020; Qiu et al., 2020; Vakulenko et al., 2019; Bhutani et al., 2019; Christmann et al., 2019). Challenges in KG-QA include entity disambiguation, bridging the vocabulary gap between phrases in questions and the terminology of the KG, and finally, constructing the SPARQL(-like) query. Early work on KG-QA built on paraphrase-based mappings and question-query templates that typically had a single entity or a single predicate as slots (Berant et al., 2013; Unger et al., 2012; Yahya et al., 2013). This direction was advanced by (Bast and Haussmann, 2015; Bao et al., 2016; Abujabal et al., 2017; Hu et al., 2017), including templates that were automatically learnt from graph patterns in the KG. Unfortunately, this dependence on templates prevents such approaches from coping with arbitrary syntactic formulations in a robust manner. This has led to the development of deep learning-based methods with sequence models, and especially key-value memory networks (Xu et al., 2019, 2016; Tan et al., 2018; Huang et al., 2019; Chen et al., 2019; Jain, 2016). These have been most successful on benchmarks like WebQuestions (Berant et al., 2013) and QALD (Usbeck et al., 2018). However, all these methods critically build on sufficient amounts of training data in the form of question-answer pairs. In contrast, UniQORN is fully unsupervised and needs neither templates nor training data.

Complex question answering is an area of intense focus in KG-QA now (Lu et al., 2019; Bhutani et al., 2019; Qiu et al., 2020; Ding et al., 2019; Vakulenko et al., 2019; Hu et al., 2018; Jia et al., 2018), where the general approach is often guided by the existence and detection of substructures for the executable query. UniQORN treats this as a potential drawback and adopts a joint disambiguation of question concepts using algorithms for Group Steiner Trees, instead of looking for nested question units that can be mapped to simpler queries. Approaches based on question decomposition (explicit or implicit) are brittle due to the huge variety of question formulation patterns (especially for complex questions), and are particularly vulnerable when questions are posed in telegraphic form (“oscar-winnning nolan films?”, has to be interpreted as “Which films were directed by Nolan and won an Oscar?”: this is highly non-trivial). Another branch of complex KG-QA rose from the task of knowledge graph reasoning (KGR) (Das et al., 2018; Qiu et al., 2020; Dhingra et al., 2020; Cohen et al., 2020; Zhang et al., 2018), where the key idea is given a KG entity (Albert Einstein) and a textual relation (‘nephew’), the best KG path from the input entity to the target entity (answer) is sought. This can be generalized into a so-called QA task where topic (question) entity is known and the question is assumed to be a paraphrase of the multi-hop KG relation (there is an assumption that ‘nephew’ is not directly a KG predicate). Nevertheless, this is a restricted view of complex KG-QA, and only deals with such indirection or chain questions (‘nephew’ has to be matched with sibling followed by child in the KG), evaluated on truncated subsets of the full KG that typically lack the complexity of qualifier triples.

6.2. QA over Text

Originally, in the late 1990s and early 2000s, question answering considered textual document collections as its underlying source. Classical approaches based on statistical scoring (Ravichandran and Hovy, 2002; Voorhees, 1999) extracted answers as short text units from passages or sentences that matched most cue words from the question. Such models made intensive use of IR techniques for scoring of sentences or passages and aggregation of evidence for answer candidates. IBM Watson (Ferrucci et al., 2010), a thoroughly engineered system that won the Jeopardy! quiz show, extended this paradigm by combining it with learned models for special question types. TREC ran a QA benchmarking series from 1999 to 2007, and more recently revived it as the LiveQA (Agichtein et al., 2015) and Complex Answer Retrieval (CAR) tracks (Dietz et al., 2017).

Machine reading comprehension (MRC) was originally motivated by the goal of whether algorithms actually understood textual content. This eventually became a QA variation where a question needs to be answered as a short span of words from a given text paragraph (Rajpurkar et al., 2016; Yang et al., 2018), and is different from the typical fact-centric answer-finding task in IR. Exemplary approaches in MRC that extended the original single-passage setting to a multi-document one can be found in DrQA (Chen et al., 2017) and DocumentQA (Clark and Gardner, 2018) (among many, many others). In the scope of such text-QA, we compared with, and outperformed both the aforementioned models, which can both select relevant documents and extract answers from them. Traditional fact-centric QA over text, and multi-document MRC are recently emerging as a joint topic referred to as open-domain question answering (Lin et al., 2018; Dehghani et al., 2019; Wang et al., 2019). Open-domain QA tries to combine an IR-based retrieval pipeline and NLP-style reading comprehension, to produce crisp answers extracted from passages retrieved on-the-fly from large corpora.

6.3. QA over Heterogeneous Sources

Limitations of QA over KGs has led to a revival of considering textual sources, in combination with KGs (Savenkov and Agichtein, 2016; Xu et al., 2016; Sun et al., 2018, 2019). Early methods like PARALEX (Fader et al., 2013) and OQA (Fader et al., 2014) supported noisy KGs in the form of triple spaces compiled via Open IE (Mausam, 2016) on Wikipedia articles or Web corpora. TupleInf (Khot et al., 2017) extended and generalized the Open-IE-based PARALEX approach to complex questions, and is geared specifically for multiple-choice answer options, and is thus inapplicable for our task. TAQA (Yin et al., 2015) is another generalization of Open-IE-based QA, by constructing a KG of -tuples from Wikipedia full-text and question-specific search results. Unfortunately this method is restricted to questions with prepositional and adverbial constraints only. WebAsKB (Talmor and Berant, 2018) is an MRC-inspired method that addressed complex questions by decomposing them into a sequence of simple questions, but relies on crowdsourced large-scale training data. Some methods for such hybrid QA start with KGs as a source for candidate answers and use text corpora like Wikipedia or ClueWeb as additional evidence (Xu et al., 2016; Das et al., 2017; Sun et al., 2018, 2019; Sydorova et al., 2019; Xiong et al., 2019), or start with answer sentences from text corpora and combine these with KGs for giving crisp entity answers (Sun et al., 2015; Savenkov and Agichtein, 2016). Most of these are based on neural networks, and are only designed for simple questions like those in the WebQuestions (Berant et al., 2013), SimpleQuestions (Bordes et al., 2015), or WikiMovies (Miller et al., 2016) benchmarks. In contrast, UniQORN, through its anchor graphs and GSTs, can handle arbitrary kinds of complex questions and can construct explanatory evidence for its answers – an unsolved concern for neural methods.

7. Conclusions and Future Work

We provide a fresh perspective in this age of neural methods: showing that an unsupervised QA system, UniQORN, with just five tunable parameters can reach performance comparable to state-of-the-art baselines on benchmarks of complex questions. UniQORN is geared to work over RDF KGs, text corpora, or combinations. Through our proposal of computing Group Steiner Trees on dynamically constructed context graphs, we aim to unify the fragmented paradigms of KG-QA and text-QA. Even for canonicalized KGs as input, sole reliance on accurate SPARQL queries is inferior when it comes to answering complex questions with multiple entities or predicates. We show that relaxing crisp SPARQL-style querying to an approximate graph pattern search is vital for question answering over KGs, and in the process identify the bridge to text-QA. Finally, UniQORN has the unique design rationale of deliberately allowing noise from heuristic choices (like coreference resolution, triple extraction, and alignment insertion) into early stages of the answering pipeline and coping with it later. This makes it a practical choice for providing direct answers in Web-based QA, with Steiner Trees adding the bonus of interpretable insights into answer derivation.


  • A. Abujabal, R. Saha Roy, M. Yahya, and G. Weikum (2018) Never-ending learning for open-domain question answering over knowledge bases. In WWW, Cited by: §1, §2.1, §2.1.
  • A. Abujabal, R. Saha Roy, M. Yahya, and G. Weikum (2019) ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters. In NAACL-HLT, Cited by: §1, §4.1.2, Table 3.
  • A. Abujabal, M. Yahya, M. Riedewald, and G. Weikum (2017) Automated template generation for question answering over knowledge graphs. In WWW, Cited by: §3.1, §4.1.2, §4.1.2, Table 3, §6.1.
  • E. Agichtein, D. Carmel, D. Pelleg, Y. Pinter, and D. Harman (2015) Overview of the TREC 2015 LiveQA Track. In TREC, Cited by: §6.2.
  • G. Angeli, M. J. J. Premkumar, and C. D. Manning (2015) Leveraging linguistic structure for open domain information extraction. In ACL, Cited by: §2.1, §2.2, §3.1, §5.2.
  • S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007) DBpedia: A nucleus for a Web of open data. Cited by: §1, §6.1.
  • J. Bao, N. Duan, Z. Yan, M. Zhou, and T. Zhao (2016) Constraint-based question answering with knowledge graph. In COLING, Cited by: §6.1.
  • H. Bast and E. Haussmann (2015) More accurate question answering on freebase. In CIKM, Cited by: §1, §2.1, §6.1.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on Freebase from question-answer pairs. In EMNLP, Cited by: §1, §6.1, §6.3.
  • G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan (2002) Keyword searching and browsing in databases using banks. In ICDE, Cited by: §3.2.
  • N. Bhutani, X. Zheng, and H. Jagadish (2019) Learning to answer complex questions over knowledge bases with query composition. In CIKM, Cited by: §6.1, §6.1.
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD, Cited by: §1, §6.1.
  • A. Bordes, N. Usunier, S. Chopra, and J. Weston (2015) Large-scale simple question answering with memory networks. In arXiv, Cited by: §6.3.
  • C. Chanial, R. Dziri, H. Galhardas, J. Leblay, M. L. Nguyen, and I. Manolescu (2018) Connectionlens: finding connections across heterogeneous data sources. In VLDB, Cited by: §3.2.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In ACL, Cited by: §1, §1, §4.2.4, Table 5, §6.2.
  • Y. Chen, L. Wu, and M. J. Zaki (2019) Bidirectional attentive memory networks for question answering over knowledge bases. In NAACL-HLT, Cited by: §6.1.
  • P. Christmann, R. Saha Roy, A. Abujabal, J. Singh, and G. Weikum (2019) Look before you hop: conversational question answering over knowledge graphs using judicious context expansion. In CIKM, Cited by: §6.1.
  • C. Clark and M. Gardner (2018) Simple and effective multi-paragraph reading comprehension. In ACL, Cited by: §1, §1, §1, §4.2.4, Table 5, §6.2.
  • W. W. Cohen, H. Sun, R. A. Hofer, and M. Siegler (2020) Scalable neural methods for reasoning with a symbolic knowledge base. In ICLR, Cited by: §6.1.
  • R. Cummins and C. O’Riordan (2009) Learning in a pairwise term-term proximity framework for information retrieval. In SIGIR, Cited by: §2.2.
  • R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum (2018)

    Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning

    In ICLR, Cited by: §6.1.
  • R. Das, M. Zaheer, S. Reddy, and A. McCallum (2017) Question answering on knowledge bases and text using universal schema and memory networks. In ACL, Cited by: §1, §6.3.
  • M. Dehghani, H. Azarbonyad, J. Kamps, and M. de Rijke (2019) Learning to transform, combine, and reason in open-domain question answering. In WSDM, Cited by: §6.2.
  • L. Del Corro and R. Gemulla (2013) ClausIE: Clause-based open information extraction. In WWW, Cited by: §2.1, §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.2.
  • B. Dhingra, M. Zaheer, V. Balachandran, G. Neubig, R. Salakhutdinov, and W. W. Cohen (2020) Differentiable reasoning over a virtual knowledge base. In ICLR, Cited by: §6.1.
  • D. Diefenbach, A. Both, K. Singh, and P. Maret (2018) Towards a question answering system over the Semantic Web. Semantic Web. Cited by: §1, §4.2.3, Table 5.
  • D. Diefenbach, P. H. Migliatti, O. Qawasmeh, V. Lully, K. Singh, and P. Maret (2019) QAnswer: A question answering prototype bridging the gap between a considerable part of the LOD cloud and end-users. In WWW, Cited by: §1, §1, §4.2.3.
  • L. Dietz, M. Verma, F. Radlinski, and N. Craswell (2017) TREC Complex Answer Retrieval Overview. In TREC, Cited by: §6.2.
  • B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin (2007) Finding top-k min-cost connected trees in databases. In ICDE, Cited by: §1, §3.2, §3.2, §3.2.
  • J. Ding, W. Hu, Q. Xu, and Y. Qu (2019) Leveraging frequent query substructures to generate formal queries for complex question answering. In EMNLP-IJCNLP, Cited by: §6.1.
  • M. Dubey, D. Banerjee, A. Abdelkawi, and J. Lehmann (2019) LC-QuAD 2.0: A large dataset for complex question answering over Wikidata and DBpedia. In ISWC, Cited by: §1, §2.1, §4.1.2, Table 3.
  • A. Fader, L. Zettlemoyer, and O. Etzioni (2013) Paraphrase-driven learning for open question answering. In ACL, Cited by: §6.3.
  • A. Fader, L. Zettlemoyer, and O. Etzioni (2014) Open question answering over curated and extracted knowledge bases. In KDD, Cited by: §6.3.
  • A. E. Feldmann and D. Marx (2016) The complexity landscape of fixed-parameter directed steiner network problems. In ICALP, Cited by: §3.2.
  • P. Ferragina and U. Scaiella (2010) TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In CIKM, pp. 1625–1628. Cited by: §3.1, §4.2.1.
  • D. A. Ferrucci (2012) Introduction to ”This is Watson”. IBM Journal of Research and Development 56 (3.4), pp. 1–1. Cited by: §1.
  • D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty (2010) Building Watson: An overview of the DeepQA project. AI magazine 31 (3). Cited by: §1, §6.2.
  • N. Garg, G. Konjevod, and R. Ravi (2000) A polylogarithmic approximation algorithm for the group Steiner tree problem. Journal of Algorithms 37 (1). Cited by: §3.2, §3.2.
  • K. Gashteovski, R. Gemulla, and L. d. Corro (2017) MinIE: Minimizing facts in open information extraction. Cited by: §3.1.
  • M. A. Hearst (1992) Automatic acquisition of hyponyms from large text corpora. In COLING, Cited by: §2.1, §2.2, §2.2, §3.1.
  • J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum (2011) Robust disambiguation of named entities in text. In EMNLP, Cited by: §2.2, §3.1, §4.2.1.
  • K. Höffner, S. Walter, E. Marx, R. Usbeck, J. Lehmann, and A. Ngonga Ngomo (2017) Survey on challenges of question answering in the Semantic Web. Semantic Web 8 (6). Cited by: §2.1.
  • S. Hu, L. Zou, J. X. Yu, H. Wang, and D. Zhao (2017) Answering natural language questions by subgraph matching over knowledge graphs. TKDE 30 (5). Cited by: §6.1.
  • S. Hu, L. Zou, and X. Zhang (2018) A state-transition framework to answer complex questions over knowledge base. In EMNLP, Cited by: §1, §6.1.
  • X. Huang, J. Zhang, D. Li, and P. Li (2019) Knowledge graph embedding based question answering. In WSDM, Cited by: §6.1.
  • S. Jain (2016) Question answering over knowledge base using factual memory networks. In NAACL-HLT, pp. 109–115. Cited by: §6.1.
  • Z. Jia, A. Abujabal, R. Saha Roy, J. Strötgen, and G. Weikum (2018) TEQUILA: Temporal Question Answering over Knowledge Bases. In CIKM, Cited by: §6.1.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In ACL, Cited by: §4.2.4.
  • M. Joshi, U. Sawant, and S. Chakrabarti (2014) Knowledge graph and corpus driven segmentation and answer inference for telegraphic entity-seeking queries. In EMNLP, Cited by: §2.1.
  • V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar (2005) Bidirectional expansion for keyword search on graph databases. In VLDB, Cited by: §3.2.
  • M. Kaiser, R. Saha Roy, and G. Weikum (2021) Reinforcement learning from reformulations in conversational question answering over knowledge graphs. In SIGIR, Cited by: §1.
  • G. Kasneci, M. Ramanath, M. Sozio, F. M. Suchanek, and G. Weikum (2009) STAR: Steiner-tree approximation in relationship graphs. In ICDE, pp. 868–879. Cited by: §3.2, §5.1, Table 5.
  • T. Khot, A. Sabharwal, and P. Clark (2017) Answering complex questions using open information extraction. In ACL, Cited by: §6.3.
  • R. Li, L. Qin, J. X. Yu, and R. Mao (2016) Efficient and progressive group steiner tree search. In SIGMOD, Cited by: §3.2.
  • Y. Lin, H. Ji, Z. Liu, and M. Sun (2018) Denoising distantly supervised open-domain question answering. In ACL, Cited by: §6.2.
  • X. Lu, S. Pramanik, R. Saha Roy, A. Abujabal, Y. Wang, and G. Weikum (2019) Answering complex questions by joining multi-document evidence with quasi knowledge graphs. In SIGIR, Cited by: §1, §2.1, §4.1.2, Table 3, §6.1.
  • Mausam (2016) Open information extraction systems and downstream applications. In IJCAI, Cited by: §2.1, §2.2, §3.1, §6.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, Cited by: §2.2.
  • A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In arXiv, Cited by: §6.3.
  • T. Pellissier Tanon, D. Vrandečić, S. Schaffert, T. Steiner, and L. Pintscher (2016) From Freebase to Wikidata: The great migration. In WWW, Cited by: §4.1.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: Global vectors for word representation. In EMNLP, Cited by: §2.2.
  • M. Petrochuk and L. Zettlemoyer (2018) SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach. In EMNLP, pp. 554–558. Cited by: §2.1.
  • Y. Qiu, Y. Wang, X. Jin, and K. Zhang (2020) Stepwise reasoning for multi-relation question answering over knowledge graph with weak supervision. In WSDM, Cited by: §6.1, §6.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ Questions for machine comprehension of text. In EMNLP, Cited by: §1, §1, §4.2.4, §4.2.4, §6.2.
  • D. Ravichandran and E. Hovy (2002) Learning surface text patterns for a question answering system. In ACL, Cited by: §1, §6.2.
  • S. Saha and Mausam (2018) Open information extraction from conjunctive sentences. In COLING, Cited by: §3.1.
  • D. Savenkov and E. Agichtein (2016) When a knowledge base is not enough: Question answering over knowledge bases with external text data. In SIGIR, Cited by: §1, §6.3.
  • U. Sawant, S. Garg, S. Chakrabarti, and G. Ramakrishnan (2019) Neural architecture for question answering using a knowledge graph and Web corpus. Information Retrieval Journal. Cited by: §2.1.
  • Y. Shi, G. Cheng, and E. Kharlamov (2020) Keyword search over knowledge graphs via static and dynamic hub labelings. In WWW, Cited by: §3.2.
  • F. Suchanek, G. Kasneci, and G. Weikum (2007) YAGO: A core of semantic knowledge. In WWW, Cited by: §1, §6.1.
  • H. Sun, T. Bedrax-Weiss, and W. Cohen (2019) PullNet: Open domain question answering with iterative retrieval on knowledge bases and text. In EMNLP-IJCNLP, Cited by: §1, §1, §2.1, §4.2.2, Table 5, §6.3.
  • H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and W. Cohen (2018) Open domain question answering using early fusion of knowledge bases and text. In EMNLP, Cited by: §1, §4.2.2, Table 5, §6.3.
  • H. Sun, H. Ma, W. Yih, C. Tsai, J. Liu, and M. Chang (2015) Open domain question answering via semantic enrichment. In WWW, Cited by: §1, §6.3.
  • A. Sydorova, N. Poerner, and B. Roth (2019) Interpretable question answering on knowledge bases and text. In ACL, Cited by: §6.3.
  • A. Talmor and J. Berant (2018) The Web as a Knowledge-Base for Answering Complex Questions. In NAACL-HLT, Cited by: §2.1, §6.3.
  • C. Tan, F. Wei, Q. Zhou, N. Yang, B. Du, W. Lv, and M. Zhou (2018) Context-aware answer sentence selection with hierarchical gated recurrent neural networks. IEEE/ACM Trans. Audio, Speech & Language Processing 26 (3). Cited by: §6.1.
  • T. P. Tanon, M. D. de Assuncao, E. Caron, and F. Suchanek (2018) Demoing Platypus – A multilingual question answering platform for Wikidata. In ESWC, Cited by: §1, §4.2.3, Table 5.
  • P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann (2017) LC-QuAD: A corpus for complex question answering over knowledge graphs. In ISWC, Cited by: §4.1.2, Table 3.
  • C. Unger, L. Bühmann, J. Lehmann, A. Ngonga Ngomo, D. Gerber, and P. Cimiano (2012) Template-based question answering over RDF data. In WWW, Cited by: §1, §1, §6.1.
  • R. Usbeck, R. H. Gusmita, A. N. Ngomo, and M. Saleem (2018) 9th Challenge on question answering over linked data (QALD-9). In Semdeep/NLIWoD@ ISWC, Cited by: §4.1.2, Table 3, §6.1.
  • S. Vakulenko, J. D. Fernandez Garcia, A. Polleres, M. de Rijke, and M. Cochez (2019) Message passing for complex question answering over knowledge graphs. In CIKM, Cited by: §1, §2.1, §6.1, §6.1.
  • E. M. Voorhees (1999) The TREC-8 question answering track report. In TREC, Cited by: §1, §6.2.
  • D. Vrandečić and M. Krötzsch (2014) Wikidata: A free collaborative knowledge base. CACM 57 (10). Cited by: §1, §6.1.
  • B. Wang, T. Yao, Q. Zhang, J. Xu, Z. Tian, K. Liu, and J. Zhao (2019) Document gated reader for open-domain question answering. In SIGIR, Cited by: §6.2.
  • Z. Wu, B. Kao, T. Wu, P. Yin, and Q. Liu (2020) PERQ: Predicting, Explaining, and Rectifying Failed Questions in KB-QA Systems. In WSDM, Cited by: §6.1.
  • W. Xiong, M. Yu, S. Chang, X. Guo, and W. Y. Wang (2019) Improving question answering over incomplete KBs with knowledge-aware reader. In ACL, Cited by: §6.3.
  • K. Xu, Y. Lai, Y. Feng, and Z. Wang (2019) Enhancing key-value memory neural networks for knowledge based question answering. In NAACL-HLT, Cited by: §6.1.
  • K. Xu, S. Reddy, Y. Feng, S. Huang, and D. Zhao (2016) Question answering on freebase via relation extraction and textual evidence. In ACL, Cited by: §1, §6.1, §6.3.
  • M. Yahya, K. Berberich, S. Elbassuoni, and G. Weikum (2013) Robust question answering over the web of linked data. In CIKM, Cited by: §6.1.
  • M. Yahya, S. Whang, R. Gupta, and A. Halevy (2014) ReNoun: Fact extraction for nominal attributes. In EMNLP, Cited by: §3.1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP, Cited by: §6.2.
  • S. Yavuz, I. Gur, Y. Su, M. Srivatsa, and X. Yan (2016) Improving semantic parsing via answer type inference. In EMNLP, Cited by: §3.1, §3.2.
  • S. W. Yih, M. Chang, X. He, and J. Gao (2015) Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL, Cited by: §2.1, §2.1.
  • P. Yin, N. Duan, B. Kao, J. Bao, and M. Zhou (2015) Answering questions with complex semantic constraints on open knowledge bases. In CIKM, Cited by: §6.3.
  • Y. Zhang, H. Dai, Z. Kozareva, A. J. Smola, and L. Song (2018) Variational reasoning for question answering with knowledge graph. In AAAI, Cited by: §6.1.
  • D. Ziegler, A. Abujabal, R. S. Roy, and G. Weikum (2017) Efficiency-aware answering of compositional questions using answer type prediction. In IJCNLP, Cited by: §3.1, §3.2.