Complex Temporal Question Answering on Knowledge Graphs

09/18/2021 ∙ by Zhen Jia, et al. ∙ Indian Institute of Technology Bhilai Max Planck Society Southwest Jiaotong University 11

Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class of practical importance, but have not received much attention in research. This work presents EXAQT, the first end-to-end system for answering complex temporal questions that have multiple entities and predicates, and associated temporal conditions. EXAQT answers natural language questions over KGs in two stages, one geared towards high recall, the other towards precision at top ranks. The first step computes question-relevant compact subgraphs within the KG, and judiciously enhances them with pertinent temporal facts, using Group Steiner Trees and fine-tuned BERT models. The second step constructs relational graph convolutional networks (R-GCNs) from the first step's output, and enhances the R-GCNs with time-aware entity embeddings and attention over temporal relations. We evaluate EXAQT on TimeQuestions, a large dataset of 16k temporal questions we compiled from a variety of general purpose KG-QA benchmarks. Results show that EXAQT outperforms three state-of-the-art systems for answering complex questions over KGs, thereby justifying specialized treatment of temporal QA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

EXAQT

Code for our CIKM'21 paper "Complex Temporal Qestion Answering on Knowledge Graphs"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Motivation. Questions and queries with temporal information needs (Kanhabua and Anand, 2016; Alonso et al., 2011, 2007; Berberich et al., 2010; Campos et al., 2014) represent a substantial use case in search. For factual questions, knowledge graphs (KGs) like Wikidata (Vrandečić and Krötzsch, 2014), YAGO (Suchanek et al., 2007)

, or DBpedia 

(Auer et al., 2007), have become the go-to resource for search engines, tapping into structured facts on entities. While question answering over KGs (Bast and Haussmann, 2015; Bhutani et al., 2019; Saha Roy and Anand, 2020; Yahya et al., 2013; Abujabal et al., 2018; Vakulenko et al., 2019; Berant et al., 2013; Yih et al., 2015; Diefenbach et al., 2019) has been a major topic, little attention has been paid to the case of temporal questions. Such questions involve explicit or implicit notions of constraining answers by associated timestamps in the KG. This spans a spectrum, starting from simpler cases such as when was obama born?, where did obama live in 2001?, and where did obama live during 9/11? to more complex temporal questions like:

where did obama’s children study when he became president?

Complex questions must consider multi-hop constraints (Barack Obama child Malia Obama, Sasha Obama educated at Sidwell Friends School), and reason on the overlap of the intersection of time points and intervals (the start of the presidency in 2009 with the study period at the school, 2009 – 2016). A simplified excerpt of the relevant zone in the Wikidata KG necessary for answering the question, is shown in Fig. 1. This paper addresses these challenges that arise for complex temporal questions.

Figure 1. Wikidata excerpt showing the relevant KG zone for the question where did obama’s children study when he became president? with answer Sidwell Friends School.

Limitations of state-of-the-art. Early works on temporal QA over unstructured text sources (Harabagiu and Bejan, 2005; Ahn et al., 2006; Schilder and Habel, 2003; Bruce, 1972; Uzaaman et al., 2012; Pustejovsky et al., 2002; Saquete et al., 2009) involve various forms of question and document parsing, but do not carry over to KGs with structured facts comprised of entities and predicates. The few works specifically geared for time-aware QA over KGs include (Jia et al., 2018b; Costa et al., 2020; Wu et al., 2020). (Jia et al., 2018b) uses a small set of hand-crafted rules for question decomposition and temporal reasoning. This approach needs human experts for the rules and does not cope with complex questions. (Costa et al., 2020) creates a QA collection for KGs that capture events and their timelines. A key-value memory network in (Wu et al., 2020) includes time information from KGs for answering simple questions.

Approach. We present Exaqt: EXplainable Answering of complex Questions with Temporal intent, a system that does not rely on manual rules for question understanding and reasoning. Exaqt answers complex temporal questions in two steps:

  • Identifying a compact, tractable answer graph that contains all cues required for answering the question, based on dense-subgraph algorithms and fine-tuned BERT models; and

  • A relational graph convolutional network (R-GCN) (Sun et al., 2018) to infer the answer in the graph, augmented with signals about time.

The two stages work as follows (partly illustrated in Fig. 1).

Stage 1: Answer graph construction. Exaqt fetches all KG facts of entities mentioned in the question (Barack Obama, President of the United States: dashed outline boxes), as detected by off-the-shelf NERD systems (Ferragina and Scaiella, 2010; Hoffart et al., 2011; Li et al., 2020). The resulting noisy set of facts is distilled into a tractable set by means of a fine-tuned BERT model (admitting information about the children Malia and Sasha, but not Michelle Obama). To construct a KG subgraph of all question-relevant KG items and their interconnections from this set, Group Steiner Trees (GST) (Shi et al., 2020; Lu et al., 2019; Chanial et al., 2018) are computed (dark orange nodes, terminals or keyword matches underlined: “obama”, “president”, “child”, “educated at”) and completed (light orange nodes). The last and decisive step at this point augments this candidate answer graph with pertinent temporal facts, to bring in cues (potentially multiple hops away from the question entities) about relevant dates, events and time-related predicates. To this end, we use an analogous BERT model for identifying question-relevant temporal facts (blue nodes: educational affiliations of Malia and Sasha and their dates). The resulting answer graph is the input of the second stage.

Stage 2: Answer prediction by R-GCN. Inspired by the popular GRAFT-Net model (Sun et al., 2018) and related work (Schlichtkrull et al., 2018; Sun et al., 2019), we construct an R-GCN that learns entity embeddings over the answer graph and casts answer prediction into a node classification task. However, R-GCNs as used in prior works are ignorant of temporal constraints (Allen, 1983). To overcome this obstacle, we augment the R-GCN with time-aware entity embeddings, attention over temporal relations, and encodings of timestamps (Zhang et al., 2020), temporal signals (Setzer, 2002), and temporal question categories (Jia et al., 2018b). In our running example, temporal attention helps Exaqt focus on educated at as a question-relevant relation (partly shaded nodes). The time-enhanced representation of Barack Obama flows through the R-GCN (thick edges) and boosts the likelihood of Sidwell Friends School as the answer (node with thick borders), which contains 2009 (in bold) among its temporal facts. By producing such concise KG snippets for each question (as colored in Fig. 1), Exaqt yields explainable evidence for its answers.

Contributions. This work makes the following contributions:

  • We propose Exaqt, the first end-to-end system for answering complex temporal questions over large-scale knowledge graphs;

  • Exaqt applies fine-tuned BERT models and convolutional graph networks to solve the specific challenges of identifying relevant KG facts for complex temporal questions;

  • We compile and release TimeQuestions, a benchmark of about temporal questions (examples in Table 1);

  • Experiments over the full Wikidata KG show the superiority of Exaqt over three state-of-the-art complex KG-QA baselines. All resources from this project are available at https://exaqt.mpi-inf.mpg.de/ and https://github.com/zhenjia2017/EXAQT.

2. Concepts and notation

Category Question
who won oscar for best actress 1986?
Explicit which movie did jaco van dormael direct in 2009?
what currency is used in germany 2012?
who was king of france during the ninth crusade?
Implicit what did thomas jefferson do before he was president?
what club did cristiano ronaldo play for after manchester united?
what was the first film julie andrews starred in?
Ordinal what was the second position held by pierre de coubertin?
who is elizabeth taylor’s last husband?
what year did lakers win their first championship?
Temp. Ans. when was james cagney’s spouse born?
when was the last time the orioles won the world series?
Table 1. Sample temporal questions from TimeQuestions.

We now define the salient concepts that underlie Exaqt.

Knowledge graph. A knowledge graph (aka knowledge base) is a collection of facts organized as a set of <subject, predicate, object> triples. It can be stored as an RDF database of such triples, or equivalently as a graph with nodes and edges. Examples are Wikidata (Vrandečić and Krötzsch, 2014), YAGO (Suchanek et al., 2007), DBpedia (Auer et al., 2007), Freebase (Bollacker et al., 2008) and industrial KGs. When stored as a graph, edges are directed: subject predicate object. Subjects and objects are always nodes, while predicates (aka relations) often become edge labels.

Fact. A fact can either be binary, containing a subject and an object connected by a predicate, or -ary, combining multiple items via main predicates and qualifier predicates. An example of a binary fact is <Barack Obama, child, Malia Obama>, where subjects are entities (Barack Obama), and objects may be entities (Malia Obama), literals (constants such as dates in <Malia Obama, date of birth, 04-07-1998¿), or types aka classes (private school in <Sidwell Friends School, type, private school>). We use the terms predicate and relation interchangeably in this text.

An -ary fact combines several triples that belong together, such as <Barack Obama, position held, President of the US; start date, 20-01-2009; end date, 20-01-2017> (see Fig. 1). position held is the main predicate, President of the US is the main object, while the remaining data are ¡qualifier predicate, qualifier object¿ pairs. -ary facts are of vital importance in temporal QA, with a large fraction of temporal information in modern KGs being stored as qualifiers. One way of representing qualifiers in a KG is shown in Fig. 1, via paths from the main predicate to the qualifier predicate and on to the qualifier object.

Temporal fact. We define a temporal fact as one where the main object or any of the qualifier objects is a timestamp. Examples are <Vietnam War, end date, 30-04-1975> (binary), or, <Steven Spielberg, award received, Academy Award for Best Director; for work, Schindler’s List; point in time, 1993> (-ary).

Temporal predicate. We define a temporal predicate as one that can have a timestamp as its direct object or one of its qualifier objects. Examples are date of birth and position held.

Temporal question. A temporal question is one that contains a temporal expression or a temporal signal, or whose answer is of temporal nature (Jia et al., 2018a). Examples of temporal expressions are “in the year 1998”, “Obama’s presidency”, “New Year’s Eve”, etc. which indicate explicit or implicit temporal scopes (Kuzey et al., 2016). Temporal signals (Setzer, 2002) are markers of temporal relations (BEFORE, AFTER, OVERLAP, ...(Allen, 1983) and are expressed with words like “prior to, after, during, …” that indicate the need for temporal reasoning. In our models, a question is represented as a set of keywords ¡¿.

Temporal question categories. Temporal questions fall into four basic categories (Jia et al., 2018a): (i) containing explicit temporal expressions (“in 2009”), (ii) containing implicit temporal expressions (“when Obama became president”), (iii) containing temporal ordinals (first president”), and (iv) having temporal answers (“When did …”). Table 1 gives several examples of temporal questions. A question may belong to multiple categories. For example, what was the first film julie andrews starred in after her divorce with tony walton? contains both an implicit temporal expression and a temporal ordinal.

Answer. An answer to a temporal question is a (possibly singleton) set of entities or literals, e. g., {Chicago University Lab School, Sidwell Friends School} for Where did Malia Obama study before Harvard?, or {08-2017} for When did Malia start at Harvard?

Answer graph. An answer graph is a subset of the KG that contains all the necessary facts for correctly answering the question.

Figure 2. An overview of the two-stage Exaqt pipeline.

3. Constructing answer graphs

Fig. 2 is an overview of Exaqt, with two main stages: (i) answer graph construction (Sec. 3), and (ii) answer prediction (Sec. 4).

3.1. Finding question-relevant KG facts

NERD for question entities. Like most QA pipelines (Qiu et al., 2020; Bhutani et al., 2019)

, we start off by running named entity recognition and disambiguation (NERD) 

(Li et al., 2020; van Hulst et al., 2020; Hoffart et al., 2011) on the input question (where did obama’s children study when he became president?). NERD systems identify spans of words in the question as mentions of entities (“obama”, “president”), and link these spans to KG items or Wikipedia articles (which can easily be mapped to popular KGs). The facts of these linked entities (Barack Obama, President of the United States) provide us with a zone in the KG to start looking for the answer. NERD is a critical cog in the QA wheel: entity linking errors leave the main QA pipeline helpless with respect to answer detection. To mitigate this effect, we use two different systems, TagMe and ELQ (Ferragina and Scaiella, 2010; Li et al., 2020), to boost answer recall. Complex questions often contain multiple entity mentions, and accounting for two NERD systems, we could easily have different entities per question. The total number of associated facts can thus be several hundreds or more. To reduce this large and noisy set of facts to a few question-relevant ones, we fine-tune BERT (Devlin et al., 2019) as follows.

Training a classifier for question-relevant facts. For each question in our training set, we run NERD and retrieve all KG facts of the detected entities. We then use a distant supervision mechanism: out of these facts, the ones that contain the gold answer(s) are labeled as positive instances

. While several complex questions may not have their answer in the facts of the question entities (multi-hop cases), the ones that do, comprise a reasonable amount of training data for our classifier for question-relevance. Note that facts with qualifiers are also retrieved for the question entities (complete facts where the question entity appears as a subject, object, or qualifier object): this increases our coverage for obtaining positive examples.

For each positive instance, we randomly sample five negative instances from the facts that do not contain the answer. Sampling question-specific negative instances helps learn a more discriminative classifier, as all negative instances are guaranteed to contain at least one entity from the question (say, <Barack Obama, spouse, Michelle Obama>). Using all facts that do not contain an answer would result in severe class imbalance, as this is much higher than the number of positive instances.

We then pool together the ¡question, fact¿ paired positive and negative instances for all training questions. The fact in this pair is now verbalized as a natural language sentence by concatenating its constituents; qualifier statements are joined using “and” (Oguz et al., 2021). For example, the full fact for Obama’s marriage (a negative instance) is: <Barack Obama, spouse, Michelle Obama; start date, 03-10-1992; place of marriage, Trinity United Church of Christ>. This has two qualifiers, and would be verbalized as “Barack Obama spouse Michelle Obama and start date 03-10-1992 and place of marriage Trinity United Church of Christ.”. The questions paired with the verbalized facts, along with the binary ground-truth labels, are fed as training input to a sequence pair classification model for BERT.

Applying the classifier. Following (Devlin et al., 2019), the question and the fact are concatenated with the special separator token [SEP] in between, and the special classification token [CLS]

is added in front of this sequence. The final hidden vector corresponding to

[CLS], denoted by ( is the size of the hidden state), is considered to be the accumulated representation. Weights of a classification layer are the only parameters introduced during fine-tuning, where , where is the number of class labels ( here, fact is question-relevant or not).

is used as the classification loss function. Once the classifier is trained, given a new ¡question, fact¿ pair, it outputs the probability (and the label) of the fact being relevant for the question. We make this prediction for all candidate facts pertinent to a question, and sort them in descending order of this question relevance likelihood. We pick the top scoring facts

from here as our question-relevant set.

3.2. Computing compact subgraphs

The set of facts contains question-relevant facts but is not indicative as to which are a set of coherent KG items that matter for this question, and how they are connected. To this end, we induce a graph as shown in Fig. 1, from the above set of facts where each KG item (entity, predicate, type, literal) becomes a node of its own. Edges run between components of the same fact in the direction mandated in the KG: subject predicate object for the main fact, and subject predicate qualifier predicate qualifier object for (optional) qualifiers.

Injecting connectivity. BERT selects from the facts of a number of entities as detected by our NERD systems. These entities may not be connected to each other via shared KG facts. However, a connected graph is needed so that our subsequent GST and R-GCN algorithms can produce the desired effects. To inject connectivity in the graph induced from BERT facts, we compute the shortest KG path between every pair of question entities, and add these paths to our graph. In case of multiple paths of same length between two entities, they are scored for question-relevance as follows. A KG path is set of facts: a path of length one is made up of one fact (Barack Obama position held President of the United States), a path of length two is made up of two facts (Barack Obama country United States of America office held by head of state President of the United States), and so on. Each candidate path is verbalized as a set of facts (a period separating two facts) and encoded with BERT (Kaiser et al., 2021), and so is the question. These BERT encodings are stored in corresponding [CLS]

tokens. We compute the cosine similarity of

[CLS](question) with [CLS](path), and add the path with the highest cosine similarity to our answer graph.

GST model. Computing Group Steiner Trees (GST) (Shi et al., 2020; Sun et al., 2021; Pramanik et al., 2021; Lu et al., 2019) has been shown to be an effective mechanism in identifying query-specific backbone structures in larger graphs, for instance, in keyword search over database graphs (Aditya et al., 2002; Ding et al., 2007). Given a subset of nodes in the graph, called terminals, the Steiner Tree (ST) is the lowest-cost tree that connects all terminals. This reduces to the minimum spanning tree problem when all nodes of the graph are terminals, and to the shortest path problem when there are only two terminals. The GST models a more complex situation where the terminals are arranged into groups or sets, and it suffices to find a Steiner Tree that connects at least one node from each group. This scenario fits our requirement perfectly, where each question keyword can match multiple nodes in the graph, and naturally induces a terminal group. Finding a tree that runs through each and every matched node is unrealistic, hence the group model.

Edge costs. An integral part of the GST problem is how to define edge costs. Since edges emanate from KG facts, we leverage question-relevance scores assigned by the classifier of Sec. 3.1: , converted to edge costs .

GST algorithm. There are good approximation algorithms for GSTs (Li et al., 2016; Sun et al., 2021), but QA needs high precision. Therefore, we adopted the fixed-parameter-tractable exact algorithm by Ding et al. (Ding et al., 2007). It iteratively grows and merges smaller trees over the bigger graph to arrive at the minimal trees. Only taking the best tree can be risky in light of spurious connections potentially irrelevant to the question. Thus, we used a top- variant that is naturally supported by the dynamic programming algorithm of (Ding et al., 2007).

GST completion. As shown in Fig. 1, the GST yields a skeleton connecting the most relevant question nodes. To transform this into a coherent context for the question, we need to complete it with facts from where this skeleton was built. Nodes introduced due to this step are shown in light orange in the figure: dates about the presidency, Obama’s children, and the (noisy) fact about Obama’s education. In case the graph has multiple connected components (still possible as our previous connectivity insertions worked only pairwise over entities), top- GSTs are computed for each component and the union graph is used for this fact completion step.

Example. We show a simplified example in Fig. 1, where the node Barack Obama matches the question keyword “Obama”, child matches “children”, educated at matches “study”, and President of the United States matches “president”. The educated at nodes connected to Malia and Sasha do not feature here as they are not contained in the facts of Barack Obama, and do not yet feature in our answer graph. We consider exact matches, although not just in node labels but also in the set of aliases present in the KG that list common synonyms of entities, predicates and types. This helps us consider relaxed matches without relying on models like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014), that need inconvenient thresholding on similarity values as a noisy proxy for synonyms. The GST is shown using dark orange nodes with the associated question keyword matches underlined (denoting the terminal nodes). In experiments, we only consider as terminals NERD matches for entities, and keyword matches with aliases for other KG items. The GST naturally includes the internal nodes and edges necessary to connect the terminals. Note that the graph is considered undirected (equivalently, bidirectional) for the purpose of GST computation.

3.3. Augmenting subgraphs with temporal facts

The final step towards the desired answer graph is to enhance it with temporal facts. Here, we add question-relevant temporal facts of entities in the completed GST. This pulls in temporal information necessary for answering questions that need evidence more than one hop away from the question entities (blue nodes in Fig. 1): <Malia Obama, educated at, Sidwell Friends School; start date, 05-01-2009> (+ noise like Malia’s date of birth). The rationale behind this step is to capture facts necessary for faithfully answering the question, where faithful refers to arriving at the answer not by chance but after satisfying all necessary constraints in the question. For example, the question which oscar did leonardo dicaprio win in 2016? can be answered without temporal reasoning, as he only won one Oscar. We wish to avoid such cases in faithful answering.

To this end, we first retrieve from the KG all temporal facts of each entity in the completed GST. We then use an analogously fine-tuned BERT model for question-relevance of temporal facts. The model predicts, for each temporal fact, its likelihood of containing the answer. It is trained using temporal facts of question entities that contain the answer as positive examples, while negative examples are chosen at random from these temporal facts. To trap multi-hop temporal questions in our net, we explore 2-hop facts of question entities for ground truth answers. A larger neighborhood was not used during the first fine-tuning as the total number of facts in two hops of question entities is rather large, but the count of 2-hop temporal facts is a much more tractable number. Moreover, this is in line with our focus on complex temporal questions. Let the likelihood score for a temporal fact of an entity in the completed GST be . As before, we take the top scoring , add them to the answer graph, that is then passed on to Stage 2.

4. Predicting answers with R-GCN

Figure 3. Architecture of the R-GCN model in Exaqt, that includes several signals of temporal information.

R-GCN basics. The answer prediction method of Exaqt is inspired by the Relational Graph Convolution Network model (Schlichtkrull et al., 2018), an extension of GCNs (Duvenaud et al., 2015) tailored for handling large-scale relational data such as knowledge graphs. Typically, a GCN convolves the features (equivalently, representations or embedding vectors) of nodes belonging to a local neighborhood and propagates them to their nearest neighbors. The learned entity representations are used in node classification. Here, this classification decision is whether a node is an answer to the input question or not.

In this work, we use the widely popular GRAFT-Net model (Sun et al., 2018) that adapted R-GCNs to deal with heterogeneous QA over KGs and text (Bhutani and Jagadish, 2019; Oguz et al., 2021). In order to apply such a mechanism for answer prediction in our setup, we convert our answer graph from the previous step into a directed relational graph and build upon the -only setting of GRAFT-Net. In a relational graph, entities, literals, and types become nodes, while predicates (relations) become edge labels. Specifically, we use the KG RDF dump that contains normal SPO triples for binary facts by employing reification (Hernández et al., 2015). Reified triples can then be straightforwardly represented as a directed relational graph (Sun et al., 2018). Exaqt introduces four major extensions over the R-GCN in GRAFT-Net to deal with the task of temporal QA:

  • we embed temporal facts to enrich representations of entity nodes, creating time-aware entity embeddings (TEE);

  • we encode temporal question categories (TC) and temporal signals (TS) to enrich question representations;

  • we employ time encoding (TE) to obtain the vector representations for timestamps;

  • we propose attention over temporal relations (ATR) to distinguish the same relation but with different timestamps as objects.

In the following, we describe how we encode and update the node representations and perform answer prediction in our extended R-GCN architecture for handling temporal questions. Our neural architecture is shown in Fig. 3, while Table 2 summarizes notation for the salient concepts used in this phase.

4.1. Question representation

4.1.1. Initialization

To encode a temporal question, we first determine its temporal category and extract temporal signals (Sec. 2).

Temporal category encoding (TCE). We adopt a noisy yet effective strategy for labeling categories for temporal questions, and leave more sophisticated (multi-label) classification as future work. We use a four-bit multi-hot (recall that a question can belong to multiple categories) vector where each bit indicates whether the question falls into that category. Our tagger works as follows:

  • A question is tagged with the “EXPLICIT” category if the annotators SUTime (Chang and Manning, 2012) or HeidelTime (Strötgen and Gertz, 2010) detect an explicit temporal expression inside it;

  • A question is tagged with the “IMPLICIT” category if it contains any of the temporal signal words (we used the dictionary compiled by (Setzer, 2002)), and satisfies certain part-of-speech patterns;

  • A question is of type “TEMPORAL ANSWER” if it starts with phrases like “when …”, “in which year …”, and “on what date …”;

  • A question is tagged with the “ORDINAL” category if it contains an ordinal tag as labeled by the Stanford CoreNLP system (Angeli et al., 2015), along with certain keywords and part-of-speech patterns.

Temporal signal encoding (TSE). There are temporal relations defined in Allen’s interval algebra for temporal reasoning (Allen, 1983), namely: “equals”, “before”, “meets”, “overlaps”, “during”, “starts”, and “finishes”, with respective inverses for all of them except “equals”. We simplify these relations and adapt the strategy in (Jia et al., 2018a) into broad classes of temporal signals:

  • “before” and “meets” relations are treated as “BEFORE” signals;

  • “before-inverse” and “meet-inverse” relations are collapsed into “AFTER” signals;

  • “starts” and “finishes” relations are respectively mapped to “START” and “FINISH” signals;

  • words with ordinal tags and “last” are mapped to “ORDINAL”;

  • all other relations are treated as “OVERLAP” signals;

  • absence of any signal word triggers the “NO SIGNAL” case.

We map signal words to temporal signals in questions using a dictionary. We then encode these signals using a -bit (a question can contain multiple signals) vector, where each bit indicates the presence or absence of a particular temporal signal.

Along with these temporal categories and temporal signals, we use a Long Short-Term Memory Network (LSTM) to model the words in the question as a sequence (see block A in Fig. 

3). Overall, we represent a question with words as:

(1)

Here and are multi-hot vectors encoding the temporal categories and temporal signals present in , and represent the pre-trained word embeddings (from Wikipedia2Vec (Yamada et al., 2020)) of the word in . We concatenate () the and vectors with the output vector from the final state of the LSTM. Finally, we pass this concatenated vector through a Feed Forward Network (FFN) and obtain the initial embedding of , denoted as .

4.1.2. Update

In subsequent layers, the embedding of the question gets updated with the embeddings of the entities belonging to it (i.e. the question entities obtained from NERD) as follows:

(2)

where contains the entities for question and denotes the embedding of an entity at layer .

4.2. Entity representation

4.2.1. Initialization

For initializing each entity in the relational graph, we use fixed-size pre-trained embeddings , also from Wikipedia2Vec (Yamada et al., 2020). Along with conventional skip-gram and context models, Wikipedia2Vec utilizes the Wikipedia link graph that learns entity embeddings by predicting neighboring entities in the Wikipedia graph, producing more reliable entity embeddings:

(3)

4.2.2. Update

Prior to understanding the update rule for the entities in subsequent layers, we need to introduce the following concepts: (i) Time encoding (TE); (ii) Time-aware entity embeddings (TEE); and (iii) Attention over temporal relations (ATR).

Time encoding (TE). Time as an ordering sequence has an inherent similarity to positions of words in text: we thus employ a sinusoidal position encoding method (Vaswani et al., 2017; Zhang et al., 2020) to represent a timestamp . Here, the position (day, month, etc.) in will be encoded as:

(4)

where is the dimension of the time encoding and

is the (even/odd) position in the

-dimensional vector. Further, we represent , i.e. the time encoding of , as the summation of the encodings of each of its corresponding positions. This time encoding method provides an unique encoding to each timestamp and ensures sequential ordering among the timestamps (Zhang et al., 2020), that is vital for reasoning signals like before and after in temporal questions.

Time-aware entity embedding (TEE). An entity present in the relational graph is associated with a number of temporal facts (Sec. 2) in our answer graph. A temporal fact is said to be associated with an entity if is present in any position of the fact (subject, object or qualifier object). We encode each as the concatenation of its entity embeddings, relation embeddings (averaged) and time encodings of the timestamps (as shown in block B of Fig. 3). Further, we arrange each fact in in a chronological order and pass them through an LSTM network. Finally, the output from the final state of the LSTM can be used as the time-aware entity representation of , TEE(), that is vital for reasoning through the R-GCN model:

(5)

In subsequent layers, the embedding of will be updated as the embeddings of its constituent entities get updated.

Attention over temporal relations (ATR). In temporal QA, we need to distinguish entities associated with the same relation but having different timestamps (facts with same temporal predicate but different objects, like several educated at facts for a person). We thus introduce the concept of temporal attention here, adapting the more general notion of attention over relations in GRAFT-Net (Sun et al., 2018).

While computing temporal attention over a relation connected with entity , we concatenate the corresponding relation embedding with the time encoding of its timestamp object and compute its similarity with the question embedding at that stage:

(6)

where the softmax normalization is over all outgoing edges from , is the pre-trained relation vector embedding for relation (Wikipedia2Vec embeddings averaged over each word of the KG predicate), and is the time encoding of the timestamp associated with the relation . For relations not connected with any timestamp, we use a random vector for .

Putting it together. We are now in a position to specify the update rule for entity nodes which involves a single-layer FFN over the concatenation of the following four states (see block C of Fig. 3):

(7)

Here, (i) the first term corresponds to the entity’s representation from the previous layer; (ii) the second term denotes the question’s representation from the previous layer; (iii) the third term denotes the previous layer’s representation of the time-aware entity representation ; and (iv) the fourth term aggregates the states from the entity ’s neighbors. In the fourth term, the relation-specific neighborhood corresponds to the set of entities connected to via relation , is the attention over temporal relations, and is the relation-specific transformation depending on the type and direction of an edge:

(8)

Here is a Personalized PageRank (Haveliwala, 2003) score obtained in the same way as in GRAFT-Net (Sun et al., 2018) to control the propagation of embeddings along paths starting from the question entities.

4.3. Answer prediction

The final entity representations () obtained at layer , are then used in a binary classification setup to select the answers. For each entity , we define its probability to be an answer to :

(9)

where is the set of ground truth answers for question , is the relational graph built for answering from its answer graph, and

is the sigmoid activation function.

and

are respectively the weight and bias vectors corresponding to the classifier which is trained using binary cross-entropy loss over these

probabilities.

Notation Concept
Representation of entity at layer
Representation of question at layer
Temporal category encoding for question
Temporal signal encoding for question
Question entities obtained from NERD
, Pre-trained entity () and relation () embeddings
Time encoding for timestamp
Chronologically ordered temporal facts for
Representation of the temporal fact for at
Time-aware entity representation of at
Attention over temporal relation connected with
Relation -specific transformation of
Personalized PageRank score for entity at
Table 2. Notation for concepts in the R-GCN of Exaqt.

5. Experimental Setup

5.1. Benchmark

Category Explicit Implicit Temp. Ans. Ordinal Total
Free917 (Cai and Yates, 2013)
WebQ (Berant et al., 2013)
ComplexQ (Bao et al., 2016)
GraphQ (Su et al., 2016)
ComplexWebQ (Talmor and Berant, 2018)
ComQA (Abujabal et al., 2019)
LC-QuAD (Trivedi et al., 2017a)
LC-QuAD 2.0 (Dubey et al., 2019)
Total
Table 3. Distribution of question types by source in TimeQuestions. The sum exceeds the number of questions as some questions belong to multiple categories.

Previous collections on temporal questions, TempQuestions (Jia et al., 2018a) and Event-QA (Costa et al., 2020) contain only about a thousand questions each, and are not suitable for building neural models. We leverage recent community efforts in QA benchmarking, and we search through eight KG-QA datasets for time-related questions. The result is a new compilation, TimeQuestions, with questions, that we release with this paper (details in Table 3). Since some of these previous benchmarks were over Freebase or DBpedia, we used Wikipedia links in these KGs to map them to Wikidata, the largest and most actively growing public KG today, and the one that we use in this work. Questions in each benchmark are tagged for temporal expressions using SUTime (Chang and Manning, 2012) and HeidelTime (Strötgen and Gertz, 2010), and for signal words using a dictionary compiled by (Setzer, 2002). Whenever a question is found to have at least one temporal expression or signal word, it becomes a candidate temporal question. This candidate set (ca. questions) was filtered for false positives by the authors. For each of these questions, the authors manually verified the correctness of the answer, and if incorrect, replaced it with the right one. Moreover, each question is manually tagged with its temporal question category (explicit, implicit, temporal answer, or ordinal) that may help in building automated classifiers for temporal questions, a sub-problem interesting in its own right. We split our benchmark in a ratio for creating the training ( questions), development () and test () sets.

5.2. Baselines

We use the following recent methods for complex KG-QA as baselines to compare Exaqt with. All baselines were trained and fine-tuned using the train and dev sets of TimeQuestions, respectively. They are the most natural choice of baselines as Exaqt is inspired by components in these methods for building its pipeline: while Uniqorn (Pramanik et al., 2021) showed the effectiveness of GSTs in complex KG-QA, GRAFT-Net (Sun et al., 2018) and PullNet (Sun et al., 2019) showed the value of R-GCNs for answer prediction. These techniques are designed for dealing with heterogeneous answering sources (KGs and text), and we use their KG-only variants:

  • Uniqorn (Pramanik et al., 2021): This is a method for answering complex questions using Group Steiner Trees, and is an extension of (Lu et al., 2019);

  • GRAFT-Net (Sun et al., 2018): This was the first technique to adapt R-GCNs for QA over heterogeneous sources;

  • PullNet (Sun et al., 2019): This algorithm extended the GRAFT-Net classifier to the scenario of multi-hop questions. We used a reimplementation as the code is not public.

5.3. Metrics

All systems return a ranked list of answers, consisting of KG entities or literals associated with unique identifiers. We thus use the following metrics for evaluating Exaqt and the baselines, averaged over questions in the benchmark:

  • P@1: Precision at the top rank is one if the highest ranked answer is correct, and zero otherwise.

  • MRR: This is the reciprocal of the first rank where we have a correct answer. If the correct answer does not feature in the ranked list, MRR is zero.

  • Hit@5: This is set to one if a correct answer appears in the first five positions, and zero otherwise.

5.4. Initialization

Configuration. We use the Wikidata KG dump (https://dumps.wikimedia.org/wikidatawiki/entities/) in NTriples format from April 2020, comprising triples and taking TB when uncompressed on disk. We subsequently removed language tags, external IDs, schema labels and URLs from the dump, leaving us with about triples with GB disk space consumption.

For BERT fine-tuning, positive and negative instances were created from the TimeQuestions train and dev sets in the ratio . These instances were combined and split in the ratio

(test set not needed), where the first split was used for training and the second for hyperparameter selection, respectively, for BERT fine-tuning. We use the BERT-base-cased model for sequence pair classification (

https://bit.ly/3fRVqAG). Best parameters for fine-tuning were: accumulation

, number of epochs

, dropout , mini-batch size and weight decay . We use AdamW as the optimizer with a learning rate of . During answer graph construction, we use top- question-relevant facts (), top- GSTs (), and top- temporal facts ().

R-GCN model training. 100-dimensional embeddings for question words, relation (KG predicate) words and entities, are obtained from Wikipedia2Vec (Yamada et al., 2020), and learned from the Wikipedia dump of March 2021. Dimensions of TCE, TSE, TE and TEE (Sec. 4) were all set to as well. The last hidden states of LSTMs were used as encodings wherever applicable. This was trained on an Nvidia Quadro RTX 8000 GPU server. Hyperparameter values were tuned on the TimeQuestions dev set: number of GCN layers = , number of epochs = , mini-batch size =

, gradient clip =

, learning rate = , LSTM dropout = , linear dropout = , and fact dropout =

. The ReLU activation function was used.

6. Key findings

Category Overall Explicit Implicit Temp. Ans. Ordinal
Method P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5
Uniqorn (Pramanik et al., 2021)
GRAFT-Net (Sun et al., 2018)
PullNet (Sun et al., 2019)
Exaqt * * * * * * * * * * * * * * *

Statistical significance of Exaqt over the strongest baseline (GRAFT-Net), under the -tailed paired -test, is marked with an asterisk (*) ().

Table 4. Performance comparison of Exaqt with three complex QA baselines over the TimeQuestions test set.

Answering performance of Exaqt and baselines are in Table 4 (best value in column in bold). Main observations are as follows.

Exaqt outperforms baselines. The main observation from Table 4 is the across-the-board superiority of Exaqt over the baselines. Statistically significant results for each category, baseline and metric, indicate that general-purpose complex QA systems are not able to deal with the challenging requirements of temporal QA, and that temporally augmented methods are needed. Outperforming each baseline offers individual insights, as discussed below.

GSTs are not enough. GSTs are a powerful mechanism for complex QA that identify backbone skeletons in KG subsets and prune irrelevant information from noisy graphs. While this motivated the use of GSTs as a building block in Exaqt, outperforming the Uniqorn (Pramanik et al., 2021) method shows that non-terminals (internal nodes) in GSTs, by themselves, are not enough to answer temporal questions.

Augmenting R-GCNs with time information works well. The fact that R-GCNs are a powerful model is clear from the fact that GRAFT-Net, without any explicit support for temporal QA, emerges as the strongest baseline in this challenging setup. A core contribution of our work is to extend R-GCNs with different kinds of temporal evidence. Improving over GRAFT-Net shows that our multi-pronged mechanism (with TEE, ATR, TCE, TSE, and TE) succeeds in advancing the scope of R-GCN models to questions with temporal intent. Ablation studies (Sec. 7) show that each of these “prongs” play active roles in the overall performance of Exaqt.

Not every question is multi-hop. PullNet is a state-of-the-art system for answering multi-hop chain-join questions (where was Obama’s father born?). It may appear strange that PullNet, offered as an improvement over GRAFT-Net, falls short in our setup. Inspecting examples makes the reason for this clear: PullNet has an assumption that all answers are located on a -hop circumference of the question entities (ideally, -hop, where is a variable that needs to be fixed for a benchmark: is an oversimplification, while is intractable for a large KG, and hence our choice of for TimeQuestions). When this is not the case (for instance, the slightly tricky situation when an answer is in a qualifier of a 2-hop fact: when did obama’s children start studying at sidwell friends school? or the question is simple: when was obama born?), PullNet cannot make use of this training point as it relies on shortest KG paths between question and answer entities. This uniform -hop assumption is not always practical, and does not generalize to situations beyond what PullNet was trained and evaluated on.

Temporal categories vary by difficulty. We use manual ground-truth labels of question categories from our benchmark to drill down on class-wise results (the noisy tagger from Sec. 4.1.1 has accuracy). Questions with temporal answers are clearly the easiest. Note that this includes questions starting with “when”, that many models tackle with dedicated lexical answer types (Bast and Haussmann, 2015; Abujabal et al., 2017), analogous to location-type answers for “where …?” questions. Questions with explicit temporal expressions are the next rung of the ladder: while they do require reasoning, explicit years often make this matching easier (who became president of south africa in 1989?). Questions with implicit expressions are more challenging: we believe that this is where the power of R-GCNs truly shine, as GST-based Uniqorn

clearly falls short. Finally, questions with temporal ordinals seem to be beyond what implicit reasoning in graph neural networks can handle: with P@1

, they pose the biggest research challenge. We believe that this calls for revisiting symbolic reasoning, ideally plugged into neural GCN architectures.

7. In-depth analysis

NERD Recall #Question entities
TagMe
ELQ
AIDA
TagMe + ELQ
AIDA + ELQ
TagMe + AIDA
Table 5. Comparing various NERD methods on the test set.

NERD variants. We experimented with TagMe (Ferragina and Scaiella, 2010), AIDA (Hoffart et al., 2011), and ELQ (Li et al., 2020), going by the most popular to the most recent choices. Effects of various choices are in Table 5. Our best configuration is TagMe + ELQ. TagMe (used without threshold on pruning entities) and ELQ (run with default parameters) nicely complement each other, since one is recall-oriented (TagMe) and the other precision-biased (ELQ). Answer recall measures the fraction of questions for which at least one gold answer was present in the final answer graph (test set). AIDA + ELQ detects a similar number of entities per question, but is slightly worse w.r.t. answer recall.

Step in Exaqt pipeline Recall #Candidates
All KG facts of NERD entities
Facts selected by BERT
Shortest paths injected for connectivity
GSTs on largest component
Union of GSTs from all components
Completed GSTs from all components
Temporal facts added by BERT
Table 6. Understanding the recall-oriented Stage 1 of Exaqt.

Understanding Stage 1. Traversing over the steps in the recall-oriented graph construction phase of Exaqt, we try to understand where we gain (and lose) answers to temporal questions (Table 6, test set). First, we see that even two NERD systems cannot guarantee perfect answer recall (). The fall from Row 1 to 2 is expected, as one cannot compute graph algorithms efficiently over such large graphs as induced by all facts from Row 1. Adding shortest paths (Row 3), while making the answer graph more connected (before: connected components per question, after: ), also marginally helps in bringing correct answers into the graph. From Rows 4 and 5, we see that taking a union of top- () GSTs from each connected component proves worthwhile (increase from 0.613 to 0.640), and so does completing the GSTs (further rise to ). Finally, adding temporal facts provides a critical boost, taking the answer recall at the end of Stage 1 to a respectable . This translates to questions having answers in the graph passed on to the R-GCN (cf. answers are present in the PPR-based answer graph of GRAFT-Net), out of which are answered correctly at the end. The second column, that counts the average number of entities and literals in the answer graph (answer candidates) is highly insightful to get an idea of the graph size at each step, and its potential trade-off with respect to answer recall.

Category Overall Explicit Implicit Temp. Ans. Ordinal
Exaqt (Full)
Exaqt - TCE
Exaqt - TSE
Exaqt - TEE
Exaqt - TE
Exaqt - ATR
Table 7. Inspecting the precision-oriented Stage 2 of Exaqt.

Understanding Stage 2. We performed ablation studies to understand the relative influence of the individual temporal components in the precision-oriented Stage 2 of Exaqt: the R-GCN answer classifier. Table 7 shows P@1 results on the test set, where the full model achieves the best results overall and also for each category. The amount of drop from the full model (Row 1) indicates the degree of importance of a particular component. The most vital enhancement is the attention over temporal relations (ATR). All other factors offer varying degrees of assistance. An interesting observation is that TCE, while playing a moderate role in most categories, is of the highest importance for questions with temporal answers: even knowing that a question belongs to this category helps the model.

what did abraham lincoln do before he was president?
who was the king of troy when the trojan war was going on?
what films are nominated for the oscar for best picture in 2009?
where did harriet tubman live after the civil war?
when did owner bill neukom’s sports team last win the world series?
Table 8. Anecdotal examples that Exaqt answered correctly.

Anecdotal examples. Table 8 shows samples of test questions that are successfully processed by Exaqt but none of the baselines.

8. Related Work

Temporal QA in IR. Supporting temporal intent in query and document processing has been a long-standing research topic in IR (Setzer, 2002; Campos et al., 2014; Kanhabua and Anand, 2016; Alonso et al., 2011; Berberich et al., 2010; Navarro-Colorado and Saquete, 2015). This includes work inside the specific use case of QA over text (Harabagiu and Bejan, 2005; Saquete et al., 2009; Ahn et al., 2006; Lloret et al., 2011). Most of these efforts require significant preprocessing and markup of documents. There is also onus on questions to be formulated in specific ways so as to conform to carefully crafted parsers. These directions often fall short of realistic settings on the Web, where documents and questions are both formulated ad hoc. Moreover, such corpus markup unfortunately does not play a role in structured knowledge graphs. Notable effort in temporal QA includes work of (Saquete et al., 2009), which decompose complex questions into simpler components, and recompose answer fragments into responses that satisfy the original intent. Such approaches have bottlenecks from parsing issues. Exaqt makes no assumptions on how questions are formulated.

Temporal QA over KGs. Questions with temporal conditions have not received much attention in the KG-QA literature. The few works that specifically address temporal questions include (Jia et al., 2018b; Costa et al., 2020; Wu et al., 2020). Among these, (Jia et al., 2018b) relies on hand-crafted rules with limited generalization, whereas Exaqt is automatically trained with distant supervision and covers a much wider territory of questions. (Costa et al., 2020) introduces the task of event-centric QA, which overlaps with our notion of temporal questions, and introduces a benchmark collection. (Wu et al., 2020) presents a key-value memory network to include KG information about time into a QA pipeline. The method is geared for simple questions, as present in the WebQuestions benchmark.

Temporal KGs. Of late, understanding large KGs as a dynamic body of knowledge has gained attention, giving rise to the notion of temporal knowledge graphs or temporal knowledge bases (Dhingra et al., 2021; Trivedi et al., 2017b). Here, each edge (corresponding to a fact) is associated with a temporal scope or validity (Leblay and Chekol, 2018), with current efforts mostly focusing on the topic of temporal KG completion (Garcia-Duran et al., 2018; Lacroix et al., 2020; Goel et al., 2020). A very recent approach has explored QA over such temporal KGs, along with the creation of an associated benchmark (Saxena et al., 2021).

9. Conclusions

Temporal questions have been underexplored in QA, and so has temporal information in KGs, despite their importance for knowledge workers like analysts or journalists as well as advanced information needs of lay users. This work on the Exaqt method has presented a complete pipeline for filling this gap, based on a judicious combination of BERT-based classifiers and graph convolutional networks. Most crucially, we devised new methods for augmenting these components with temporal signals. Experimental results with a large collection of complex temporal questions demonstrate the superiority of Exaqt over state-of-the-art general-purpose methods for QA over knowledge graphs.

Acknowledgements. We thank Philipp Christmann and Jesujoba Alabi from the MPI for Informatics for useful inputs at various stages of this work. Zhen Jia was supported by (i) China Academy of Railway Sciences Corporation Limited (2019YJ106); and (ii) Sichuan Science and Technology Program (2020YFG0035).

References

  • A. Abujabal, R. Saha Roy, M. Yahya, and G. Weikum (2018) Never-ending learning for open-domain question answering over knowledge bases. In WWW, Cited by: §1.
  • A. Abujabal, R. Saha Roy, M. Yahya, and G. Weikum (2019) ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters. In NAACL-HLT, Cited by: Table 3.
  • A. Abujabal, M. Yahya, M. Riedewald, and G. Weikum (2017) Automated template generation for question answering over knowledge graphs. In WWW, Cited by: §6.
  • B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, S. Sudarshanxe, et al. (2002) BANKS: Browsing and keyword searching in relational databases. In VLDB, Cited by: §3.2.
  • D. Ahn, S. Schockaert, M. De Cock, and E. Kerre (2006) Supporting temporal question answering: strategies for offline data collection. In ICoS-5, Cited by: §1, §8.
  • J. F. Allen (1983) Maintaining knowledge about temporal intervals. CACM 26 (11). Cited by: §1, §2, §4.1.1.
  • O. Alonso, M. Gertz, and R. Baeza-Yates (2007) On the value of temporal information in information retrieval. In SIGIR Forum, Cited by: §1.
  • O. Alonso, J. Strötgen, R. Baeza-Yates, and M. Gertz (2011) Temporal Information Retrieval: Challenges and Opportunities. TWAW 11. Cited by: §1, §8.
  • G. Angeli, M. J. J. Premkumar, and C. D. Manning (2015) Leveraging linguistic structure for open domain information extraction. In ACL, Cited by: 4th item.
  • S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007) DBpedia: A nucleus for a Web of open data. In ISWC, Cited by: §1, §2.
  • J. Bao, N. Duan, Z. Yan, M. Zhou, and T. Zhao (2016) Constraint-based question answering with knowledge graph. In COLING, Cited by: Table 3.
  • H. Bast and E. Haussmann (2015) More accurate question answering on Freebase. In CIKM, Cited by: §1, §6.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In EMNLP, Cited by: §1, Table 3.
  • K. Berberich, S. Bedathur, O. Alonso, and G. Weikum (2010) A language modeling approach for temporal information needs. In ECIR, Cited by: §1, §8.
  • N. Bhutani and H. V. Jagadish (2019) Online schemaless querying of heterogeneous open knowledge bases. In CIKM, Cited by: §4.
  • N. Bhutani, X. Zheng, and H. Jagadish (2019) Learning to answer complex questions over knowledge bases with query composition. In CIKM, Cited by: §1, §3.1.
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD, Cited by: §2.
  • B. C. Bruce (1972) A model for temporal references and its application in a question answering program. Artificial intelligence 3. Cited by: §1.
  • Q. Cai and A. Yates (2013)

    Large-scale semantic parsing via schema matching and lexicon extension

    .
    In ACL, Cited by: Table 3.
  • R. Campos, G. Dias, A. M. Jorge, and A. Jatowt (2014) Survey of temporal information retrieval and related applications. CSUR 47 (2). Cited by: §1, §8.
  • A. X. Chang and C. D. Manning (2012) SUTime: A library for recognizing and normalizing time expressions. In LREC, Cited by: 1st item, §5.1.
  • C. Chanial, R. Dziri, H. Galhardas, J. Leblay, M. Le Nguyen, and I. Manolescu (2018) ConnectionLens: Finding Connections Across Heterogeneous Data Sources. In VLDB, Cited by: §1.
  • T. S. Costa, S. Gottschalk, and E. Demidova (2020) Event-qa: A dataset for event-centric question answering over knowledge graphs. In CIKM, Cited by: §1, §5.1, §8.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §3.1, §3.1.
  • B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen (2021)

    Time-aware language models as temporal knowledge bases

    .
    In arXiv, Cited by: §8.
  • D. Diefenbach, P. H. Migliatti, O. Qawasmeh, V. Lully, K. Singh, and P. Maret (2019) QAnswer: A question answering prototype bridging the gap between a considerable part of the LOD cloud and end-users. In WWW, Cited by: §1.
  • B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin (2007) Finding top-k min-cost connected trees in databases. In ICDE, Cited by: §3.2, §3.2.
  • M. Dubey, D. Banerjee, A. Abdelkawi, and J. Lehmann (2019) LC-QuAD 2.0: A large dataset for complex question answering over Wikidata and DBpedia. In ISWC, Cited by: Table 3.
  • D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §4.
  • P. Ferragina and U. Scaiella (2010) TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In CIKM, Cited by: §1, §3.1, §7.
  • A. Garcia-Duran, S. Dumančić, and M. Niepert (2018) Learning sequence encoders for temporal knowledge graph completion. In EMNLP, Cited by: §8.
  • R. Goel, S. M. Kazemi, M. Brubaker, and P. Poupart (2020) Diachronic embedding for temporal knowledge graph completion. In AAAI, Cited by: §8.
  • S. Harabagiu and C. A. Bejan (2005) Question answering based on temporal inference. In AAAI Workshop on inference for textual question answering, Cited by: §1, §8.
  • T. H. Haveliwala (2003) Topic-sensitive pagerank: A context-sensitive ranking algorithm for Web search. TKDE 15 (4). Cited by: §4.2.2.
  • D. Hernández, A. Hogan, and M. Krötzsch (2015) Reifying RDF: What works well with Wikidata?. In SSWS@ISWC, Cited by: §4.
  • J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum (2011) Robust disambiguation of named entities in text. In EMNLP, Cited by: §1, §3.1, §7.
  • Z. Jia, A. Abujabal, R. Saha Roy, J. Strötgen, and G. Weikum (2018a) TempQuestions: A benchmark for temporal question answering. In HQA, Cited by: §2, §2, §4.1.1, §5.1.
  • Z. Jia, A. Abujabal, R. Saha Roy, J. Strötgen, and G. Weikum (2018b) TEQUILA: Temporal Question Answering over Knowledge Bases. In CIKM, Cited by: §1, §1, §8.
  • M. Kaiser, R. Saha Roy, and G. Weikum (2021) Reinforcement learning from reformulations in conversational question answering over knowledge graphs. In SIGIR, Cited by: §3.2.
  • N. Kanhabua and A. Anand (2016) Temporal information retrieval. In SIGIR, Cited by: §1, §8.
  • E. Kuzey, V. Setty, J. Strötgen, and G. Weikum (2016) As time goes by: Comprehensive tagging of textual phrases with temporal scopes. In WWW, Cited by: §2.
  • T. Lacroix, G. Obozinski, and N. Usunier (2020) Tensor decompositions for temporal knowledge base completion. In ICLR, Cited by: §8.
  • J. Leblay and M. W. Chekol (2018) Deriving validity time in knowledge graph. In TempWeb, Cited by: §8.
  • B. Z. Li, S. Min, S. Iyer, Y. Mehdad, and W. Yih (2020) Efficient one-pass end-to-end entity linking for questions. In EMNLP, Cited by: §1, §3.1, §7.
  • R. Li, L. Qin, J. X. Yu, and R. Mao (2016) Efficient and progressive group steiner tree search. In SIGMOD, Cited by: §3.2.
  • E. Lloret, H. Llorens, P. Moreda, E. Saquete, and M. Palomar (2011) Text summarization contribution to semantic question answering: New approaches for finding answers on the web. International Journal of Intelligent Systems 26 (12), pp. 1125–1152. Cited by: §8.
  • X. Lu, S. Pramanik, R. Saha Roy, A. Abujabal, Y. Wang, and G. Weikum (2019) Answering complex questions by joining multi-document evidence with quasi knowledge graphs. In SIGIR, Cited by: §1, §3.2, 1st item.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §3.2.
  • B. Navarro-Colorado and E. Saquete (2015) Combining temporal information and topic modeling for cross-document event ordering. In arXiv, Cited by: §8.
  • B. Oguz, X. Chen, V. Karpukhin, S. Peshterliev, D. Okhonko, M. Schlichtkrull, S. Gupta, Y. Mehdad, and S. Yih (2021) UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering. In arXiv, Cited by: §3.1, §4.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: Global vectors for word representation. In EMNLP, Cited by: §3.2.
  • S. Pramanik, J. Alabi, R. S. Roy, and G. Weikum (2021) UNIQORN: Unified Question Answering over RDF Knowledge Graphs and Natural Language Text. In arXiv, Cited by: §3.2, 1st item, §5.2, Table 4, §6.
  • J. Pustejovsky, J. Wiebe, and M. Maybury (2002) Multiple-perspective and temporal question answering. In Question Answering: Strategy and Resources Workshop Program, Cited by: §1.
  • Y. Qiu, Y. Wang, X. Jin, and K. Zhang (2020) Stepwise reasoning for multi-relation question answering over knowledge graph with weak supervision. In WSDM, Cited by: §3.1.
  • R. Saha Roy and A. Anand (2020) Question Answering over Curated and Open Web Sources. In SIGIR, Cited by: §1.
  • E. Saquete, J. L. Vicedo, P. Martínez-Barco, R. Munoz, and H. Llorens (2009) Enhancing qa systems with complex temporal question processing capabilities. JAIR 35. Cited by: §1, §8.
  • A. Saxena, S. Chakrabarti, and P. Talukdar (2021) Question answering over temporal knowledge graphs. In ACL, Cited by: §8.
  • F. Schilder and C. Habel (2003) Temporal information extraction for temporal question answering. In New Directions in Question Answering, AAAI Technical Report, Cited by: §1.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, Cited by: §1, §4.
  • A. Setzer (2002) Temporal information in newswire articles: An annotation scheme and corpus study. Ph.D. Thesis, University of Sheffield. Cited by: §1, §2, 2nd item, §5.1, §8.
  • Y. Shi, G. Cheng, and E. Kharlamov (2020) Keyword search over knowledge graphs via static and dynamic hub labelings. In WWW, Cited by: §1, §3.2.
  • J. Strötgen and M. Gertz (2010) HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In SemEval, Cited by: 1st item, §5.1.
  • Y. Su, H. Sun, B. Sadler, M. Srivatsa, I. Gür, Z. Yan, and X. Yan (2016) On generating characteristic-rich question sets for QA evaluation. In EMNLP, Cited by: Table 3.
  • F. Suchanek, G. Kasneci, and G. Weikum (2007) YAGO: A core of semantic knowledge. In WWW, Cited by: §1, §2.
  • H. Sun, T. Bedrax-Weiss, and W. Cohen (2019) PullNet: Open domain question answering with iterative retrieval on knowledge bases and text. In EMNLP-IJCNLP, Cited by: §1, 3rd item, §5.2, Table 4.
  • H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and W. Cohen (2018) Open domain question answering using early fusion of knowledge bases and text. In EMNLP, Cited by: item (ii), §1, §4.2.2, §4.2.2, §4, 2nd item, §5.2, Table 4.
  • Y. Sun, X. Xiao, B. Cui, S. Halgamuge, T. Lappas, and J. Luo (2021) Finding group Steiner trees in graphs with both vertex and edge weights. In VLDB, Cited by: §3.2, §3.2.
  • A. Talmor and J. Berant (2018) The Web as a Knowledge-Base for Answering Complex Questions. In NAACL-HLT, Cited by: Table 3.
  • P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann (2017a) LC-QuAD: A corpus for complex question answering over knowledge graphs. In ISWC, Cited by: Table 3.
  • R. Trivedi, H. Dai, Y. Wang, and L. Song (2017b) Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs. In ICML, Cited by: §8.
  • N. Uzaaman, H. Llorens, and J. Allen (2012) Evaluating temporal information understanding with temporal question answering. In ICSC, Cited by: §1.
  • S. Vakulenko, J. D. Fernandez Garcia, A. Polleres, M. de Rijke, and M. Cochez (2019) Message passing for complex question answering over knowledge graphs. In CIKM, Cited by: §1.
  • J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog, and A. P. de Vries (2020) REL: An entity linker standing on the shoulders of giants. In SIGIR, Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §4.2.2.
  • D. Vrandečić and M. Krötzsch (2014) Wikidata: A free collaborative knowledge base. CACM 57 (10). Cited by: §1, §2.
  • W. Wu, Z. Zhu, Q. Lu, D. Zhang, and Q. Guo (2020) Introducing external knowledge to answer questions with implicit temporal constraints over knowledge base. Future Internet 12 (3), pp. 45. Cited by: §1, §8.
  • M. Yahya, K. Berberich, S. Elbassuoni, and G. Weikum (2013) Robust question answering over the web of linked data. In CIKM, Cited by: §1.
  • I. Yamada, A. Asai, J. Sakuma, H. Shindo, H. Takeda, Y. Takefuji, and Y. Matsumoto (2020) Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. In EMNLP, Cited by: §4.1.1, §4.2.1, §5.4.
  • W. Yih, M. Chang, X. He, and J. Gao (2015) Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base. In ACL-IJCNLP, Cited by: §1.
  • X. Zhang, W. Cheng, B. Zong, Y. Chen, J. Xu, D. Li, and H. Chen (2020) Temporal context-aware representation learning for question routing. In WSDM, Cited by: §1, §4.2.2.