Asking Complex Questions with Multi-hop Answer-focused Reasoning

09/16/2020 ∙ by Xiyao Ma, et al. ∙ University of Florida IEEE 0

Asking questions from natural language text has attracted increasing attention recently, and several schemes have been proposed with promising results by asking the right question words and copy relevant words from the input to the question. However, most state-of-the-art methods focus on asking simple questions involving single-hop relations. In this paper, we propose a new task called multihop question generation that asks complex and semantically relevant questions by additionally discovering and modeling the multiple entities and their semantic relations given a collection of documents and the corresponding answer 1. To solve the problem, we propose multi-hop answer-focused reasoning on the grounded answer-centric entity graph to include different granularity levels of semantic information including the word-level and document-level semantics of the entities and their semantic relations. Through extensive experiments on the HOTPOTQA dataset, we demonstrate the superiority and effectiveness of our proposed model that serves as a baseline to motivate future work.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a background context and the corresponding answer, the question generation (QG) task aims to ask a semantically relevant question. QG has considerable benefits in education scenario, dialogue systems, and question answering Du et al. (2017). Recently, many approaches have been proposed to solve the problem Zhou et al. (2017); Sun et al. (2018); Ma et al. (2019), mostly realized by variants of the seq-to-seq model Sutskever et al. (2014) with attention and copy mechanism Cho et al. (2014); Bahdanau et al. (2014).

However, existing works mainly focus on asking a simple question by only capturing one direct relation among the entities from the context input

. Taking one example from SQuAD dataset

Rajpurkar et al. (2016) as shown in the upper part of the Table 1, the model only needs to capture the single-hop relation between entity “Donald Davies” and the answer entity “Message Routing Methodology” and ask the question “What did Donald Davies develop?”

. Single-hop Question Generation Document: starting in 1965, Donald Davies at the National Physical Laboratory, UK, independently developed the same Message Routing Methodology as developed by baran. Question: What did Donald Davies develop? Multi-hop Question Generation Document 1: [Peggy Seeger] Margaret “Peggy” Seeger (born June 17, 1935) is an American folksinger. She is also well known in Britain, where she has lived for more than 30 years, and was married to the singer and songwriter Ewan MacColl until his death in 1989. Document 2: [Ewan MacColl] James Henry Miller (25 January 1915 – 22 October 1989), better known by his stage name Ewan MacColl, was an English folk singer, songwriter, communist, labour activist, actor, poet, playwright and record producer. Question: What nationality was James Henry Miller’s wife?

Table 1: Comparison of the single-hop question generation task on the SQuAD dataset Rajpurkar et al. (2016) and the proposed multi-hop question generation task on the HOTPOTQA dataset Yang et al. (2018)

In this paper, we propose a new task called multi-hop neural question generation. Given a collection of documents each containing a context and a title , assuming that the answer exists in at least one document, the model aims to generate a complex and semantically relevant question with multiple entities and their semantic relations. One example is shown in the lower part of Table 1. The model need to discover and capture the entities (e.g., “Peggy Seege”, “Ewan MacColl”, and “James Henry Miller”) and their relations (e.g., “Peggy Seege” marrited to “James Henry Miller”, and “James Henry Miller” is the stage name is “Ewan MacColl”), then ask the question “What nationality was James HenryMiller’s wife?” according to the answer “American”.

In addition to the common challenges in the single-hop question generation task that the model needs to understand, paraphrase, and re-organize semantic information from the answer and the background context, another key challenge is in discovering and modeling the entities and the multi-hop semantic relations across documents to understand the semantic relation between the answer and background context. Merely applying a seq-to-seq model on the document text does not deliver comparable results in that the model performs poorly on capturing the structured relations among the entities through multi-hop reasoning.

In this paper, We propose the multi-hop answer-focused reasoning model to tackle the problem. Specifically, instead of utilizing the unstructured text as the only input, we build an answer-centric entity graph with the extracted different types of semantic relations among the entities across the documents to enable the multi-hop reasoning. Inspired by the success of graph convolutional network (GCN) models, we further leverage the relational graph convolutional network (RGCN) Schlichtkrull et al. (2018) to perform the answer-aware multi-hop reasoning by aggregating the different levels of answer-aware contextual entity representation and semantic relations among the entities. The extensive experiments demonstrate that our proposed model outperforms the baselines in terms of various metrics. Our contributions are three-fold:

  • To the best of our knowledge, we are the first to propose the multi-hop neural question generation task, asking complex questions from a collection of documents through multi-hop reasoning.

  • We propose a multi-hop answer-focused reasoning model to dynamically reason and aggregate different granularity levels of answer-aware contextual entity representation and semantic relations among the entities in the grounded answer-centric entity graph.

  • We conduct extensive experiments to demonstrate that our proposed model outperforms SOTA single-hop QG models and graph-based multi-hop QG model in terms of the main metrics, downstream multi-hop reading comprehension metrics, and human judgments. Our work offers a new baseline and motivates future researches on the task.

2 Methods

In this section, we present the architecture and each module of the proposed multi-hop answer-focused reasoning model. The overall architecture is shown in Figure 1. Our proposed method adopts a seq-to-seq backbone Sutskever et al. (2014) incorporated with attention and copy mechanism Bahdanau et al. (2014); Gulcehre et al. (2016). Our model can be categorized into three parts: (i) answer-focused document encoding, (ii) multi-hop answer-centric reasoning, and (iii) aggregation layer, finally providing an answer-focused and enriched contextual representation.

Figure 1: Architecture of Answer-focused Multi-hop Reasoning Model.

2.1 Answer-focused Document Encoding

Document Encoding

Given input documents to the model, we represent them as a sequence of words by concatenating the text words and the title words of each document:


Following Zhou et al. (2017), for each word , we obtain its embedding by concatenating its word embedding, answer positional embedding, and feature-enriched embedding (e.g., POS, NER).

An one-layer bi-directional LSTM Hochreiter and Schmidhuber (1997) is utilized as the encoder to obtain the document representation :


Gated Self-attention Layer

The above document representation has limited knowledge of the context Wang et al. (2017). The gated self-attention layer is utilized to learn a contextual document representation with a Bi-GRU Chung et al. (2014):



is the contextual vector obtained by attending to the context:


where , , and

are the trainable weights in the neural networks.

Answer Gating Mechanism

We further propose the answer gating mechanism to empower the model to learn the answer-focused document representation . Utilizing a gate computed by a function to control the information passing, only the answer-related semantic information of the documents is forwarded for the downstream multi-hop reasoning:


where the answer vector is the hidden state of the first answer word, and is a trainable parameters in the bilinear function.

2.2 Multi-hop Answer-focused Reasoning

Answer-centric Entity Graph Grounding

Figure 2: Diagram of an answer-centric entity graph example built on the documents in Table 1. The text in ovals and the solid lines in different colors indicate the different semantic types of the nodes and the edges , respectively. The answer node is connected with all other nodes in the graph, which are not drawn for conciseness.

To explicitly discover and model the multiple entities and their semantic relations across documents, we ground an answer-centric entity graph from the unstructured text.

Let an answer-centric entity graph be denoted as , where denotes entity nodes at different levels, and denotes the edges between nodes annotated with different semantic relations.

To do so, we first exploit the Spacy toolkit Honnibal and Montani (2017) to extract the named entity and all coreference words. Then, we identify the exactly matched non-stop words from the documents. We treat these exactly matched non-stop words, named entities, the answer and titles as the nodes in the answer-centric entity graph, which represent different granularity levels of the contextual representation: (1) the exactly matched non-stop words and entity nodes encode the word-level and local representation in the specific document context; (2) the title nodes represent the document-level semantics; (3) the answer node offers the answer-aware representation for the graph reasoning, and models a global representation across documents.

We then define edges between two nodes by leveraging different types of semantics within the documents in the following heuristic:

(1) We connect all exactly matched named entities no matter they are in the same documents or different documents (e.g., “Ewan MacColl”).

(2) We connect all inter-document and intra-document exact matched non-stop words (e.g. “singer, songwriter”).

(3) All coreference words are then linked to each other.

(4) We further connect the title node with all entity nodes within the same document.

(5) We add dense connections between all title nodes.

(6) The answer node is connected to all other nodes in the graph, resulting in an answer-centric entity graph.

An example graph built from the documents of the example in Table 1 is shown in Figure 2, representing different granularity levels of the semantic information with various nodes and edges.

Multi-hop Reasoning with RGCN

To make use of the grounded answer-centric entity graph, we leverage GNN-based model to conduct the multi-hop reasoning. In general, with different message passing strategies, graph neural network-based models update the node representation based on its first-order neighbors.

Specifically, we employ the RGCN for the multi-hop reasoning Schlichtkrull et al. (2018). We first initialize the representation of node with the output from answer gating mechanism by or if the entity node contains multiple words. Meanwhile, edges are annotated by one-hot vectors indicating different semantic relations. In each layer , the representation of node is updated by the summation of the transformation of its node representation and the transformation of its neighbors: a


where is relation-specific trainable weights. The number of parameters of the weights is further decreased by the linear combination of a basis weight and relation-specific coefficients :


After layers of reasoning, at most -hop relations can be captured.

2.3 Aggregation Layer

Inspired by Peters et al. (2018), the final answer-aware contextual representation is computed by selectively aggregating the output of each RGCN layer and the answer-aware document representation generation with trainable layer-wise weights. Similarly, the answer node representation of each layer and the last hidden state of the LSTM are stacked together to produce a more accurate document-level and global representation:


where is node representations of the th layer, and is the answer node representation of the th layer. The and are the layer-wise trainable weights. By doing so, the different granularities of contextual representations expressing various types of semantics are aggregated to produce the final entity-level and document-level representation for the decoder.

2.4 Decoder

With a hidden state initialized to , a uni-directional LSTM is utilized as the decoder to generate the question, where the current hidden state is updated given the previous generated word and the previous hidden state:


where the context vector is computed with the attention mechanism Bahdanau et al. (2014) by attending to the encoder hidden states:


To solve the Out-of-Vocabulary issue, we also exploit the copy mechanism to steer the model to copy a word from the input See et al. (2017); Gulcehre et al. (2016)

. Specifically, in each step in the decoder, a probability is computed, deciding to copy words from the input documents based on attention matrix or generate a word from the vocabulary via an output layer with softmax function:


Finally, treating the copy probability as the attention weights (e.g., ), the final word distribution is the summation of the probability of generating a word from the vocabulary and the probability of copying a word from input:


3 Experiment

In this section, we conduct extensive experiments on the HOTPOTQA dataset Yang et al. (2018), demonstrating the performance of the proposed model by comparing it with the existing SOTA single-hop question generation models and a multi-hop question generation model with GAT Veličković et al. (2017).

3.1 Experiment Setting


HOTPOTQA dataset is an accessible dataset collected from Wikipedia articles for the multi-hop reading comprehension task Yang et al. (2018). We discard the questions with the “comparison” type, and we only collect the text labeled “supporting facts” in the set of documents. Lack of access to the original testing dataset, we combine the training set and development set and randomly split them into the training set, development set, and testing set with the size of 68758, 4992, and 4991 data samples, respectively.

NQG++ Zhou et al. (2017) 44.55 33.18 26.57 21.99 24.35 41.08
PG See et al. (2017) 46.13 35.14 28.71 24.12 24.14 42.18
SM-API Ma et al. (2019) 46.95 35.76 29.02 24.34 24.30 42.32
PG + GAT 47.35 36.10 29.85 24.98 24.56 42.62
Proposed 50.93 38.93 31.78 26.70 25.40 43.88
Table 2: Comparison of models performances in terms of the main metrics on HOTPOTQA dataset.


In the experiments, we compare the performance of our proposed model and several baseline models as follows:

  • NQG++ Zhou et al. (2017): It is a commonly used baseline for the single-hop neural question generation task. The concatenated document text is passed into the seq-to-seq model with the answer positional embedding and enriched lexical features (e.g., named entity, pos-of-tag, and case). Attention and copy mechanisms are adopted in the decoder.

  • Pointer-generator (PG) See et al. (2017)

    : Originally proposed for text summarization task, it is revised to solve the question generation problem. The copy mechanism is realized differently. We also add enriched lexical features in the embedding layer like NQG++.

  • Sentence-level Semantic Matching and Answer Position Inferring (SM-API) Ma et al. (2019): It is a state-of-the-art model on the single-hop neural question generation task. It proposes two modules called sentence-level semantic matching and answer position inferring, trained jointly with the seq-to-seq model to ask questions containing the right question words, keywords, and answer-aware semantics.

  • PG + GAT: Graph attention network Veličković et al. (2017)

    updates the node representation by attending to the representation of its neighbors. One straight forward way for multi-hop reasoning is to utilize three layers of the graph attention model (GAT) on the built answer-centric entity graph illustrated in Section


3.2 Results and Analysis

Main Metrics

We evaluate model performances in terms of BLEU1-4 Papineni et al. (2002), METEOR Denkowski and Lavie (2014), and ROUGE-L Lin (2004) on HOTPOTQA dataset in Table 2.

SM-API model only improves the PG model by on the BLEU-4 score and does not show considerable advantage on the multi-hop question generation task. This is in part due to that the answer position inferring module designed explicitly for the single-hop answer position prediction does not offer a very accurate supervised signal for the model training on the multi-hop dataset. On the other hand, the dataset does not include data samples where questions about different answers are asked given the same context, thus limits the power of the sentence-level semantic matching module.

Stacking several layers of GAT directly on the LSTM encoder improves the performance by leveraging the answer-centric entity graph for multi-hop reasoning; nevertheless the different semantic relations and answer-focused entity representations are ignored during the multi-hop reasoning.

Our proposed multi-hop answer-focused reasoning model achieves much higher scores than the baselines as it leverages different granularity levels of answer-aware contextual entity representation and semantic relations among the entities in the grounded answer-centric entity graph, producing precise and enriched semantics for the decoder.

Downstream Task Metrics

Main metrics has some limitations as they only prove the proposed model can generate questions similar to the reference ones. We further evaluate the generated questions in the downstream multi-hop machine comprehension task.

Specifically, we choose a well-trained DecompRC Min et al. (2019), a state-of-the-art model for multi-hop machine comprehension problem on the same HOTPOTQA dataset, to conduct the experiment. In general, DecompRC decomposes the complex questions requiring multi-hop reasoning into a series of simple questions, which can be answered with single-hop reasoning. The performance of DecompRC on different generated questions reflect the quality of generated questions and the multi-hop reasoning ability of the models, intuitively.

Questions EM (%) F1 (%)
Reference Questions 71.84 83.73
NQG++ Zhou et al. (2017) 65.82 76.97
PG See et al. (2017) 66.70 78.03
SM-API Ma et al. (2019) 67.01 78.43
PG + GAT 67.23 79.01
Proposed 69.92 81.25
Table 3: Performance of the DecompRC model in the downstream machine comprehension task in terms of EM and F1 score.

We report the Exact Match (EM) and F1 scores achieved by the DecompRC model in Table 3, given the reference questions and different model-generated questions. The human-generated reference questions have the best performance. The DecompRC model achieves a much higher EM and F1 scores on the questions generated by our proposed model than the baseline models.

Analysis of Answer-focused Multi-hop Reasoning

The design of the answer-focused multi-hop reasoning model is to discover and capture the entities relevant to the answer utilizing the various types of the semantic relations among them. We analyze the model effect by measuring the quantity of named entities in the generated question in terms of the Precision and Recall, similar to

Sun et al. (2018). Quantitatively, Given a generated question and its reference question , we define:


where indicates the number of named entities.

Models Precision Recall
NQG++ 46.59 52.66
PG 46.64 52.45
SM-API 46.82 53.10
PG + GAT 47.30 53.44
Proposed 49.29 54.64
Table 4: Comparison of Precision and Recall on different model-generated questions.

As reported in Table 4, our proposed model outperforms the baselines, indicating that our model can generate questions involving more answer-aware entities by leveraging the answer-focused multi-hop reasoning.

3.3 Human Evaluation

We further examine 100 generated questions with human evaluation by scoring them on a scale from 1 to 5 in the light of semantic relatedness, fluency, and complexity. Semantic relatedness measures how well a generated question matches with the documents and the answer. Fluency reflects the naturalness of the generated questions, and complexity measures whether the generated questions are complicated and involve multiple entities.

Models Semantic Relatedness Fluency Complexity
NQG++ 2.86 3.22 3.06
PG 3.03 3.31 3.02
SM-API 3.06 3.21 3.20
PG + GAT 3.11 3.29 3.43
Proposed 3.20 3.34 3.71
Table 5: Human evaluation of Graph-based model and baseline models.
Document 1: [Muriel Humphrey Brown] Muriel Fay Buck Humphrey Brown (February 20, 1912 September 20,
1998) was an American politician who served as the Second Lady of the United States and as a U.S. Senator
from Minnesota Married to the 38th Vice President of the United States, Hubert Humphrey.
Document 2: [Hubert Humphrey] Hubert Horatio Humphrey Jr. (May 27, 1911January 13, 1978) was an
American politician who served as the 38th Vice President of the United States from 1965 to 1969.
Reference: who is the minnesota senator that was married to muriel humphrey and served as the 38th vice
president of the united states ?
PG: who was an american politician who served as the 38th vice president of the united states from 1965 to 1969 ?
SM-API: who served as the 38th vice president of the united states from 1965 to 1969 ?
PG+GAT: who married to Hubert Humphrey who served as the 38th vice president of the united states from
1965 to 1969 ?
Proposed: muriel humphrey brown was an american politician who served as the second lady
of the united states and as a u.s. senator from minnesota married to which american politician who served as
vice president of the united states from 1965 to 1969 ?
Table 6: Case study for showing the superiority of leveraging structured graph data with linguistic relations.

As reported in Table 5, by leveraging the answer-focused multi-hop reasoning, the questions generated by our approach are more complex and semantically relevant to the context and the answer than the baselines.

Case Study

Table 6 shows question samples generated by the models. The PG and SM-API fail to discover or capture the entities and their semantic relation from document 1 (e.g., “Muriel Humphrey married to Hubert Humphrey”) and asks the question about the “Hubert Humphrey served as the 38th vice president of the united states” by only focusing the semantics of the document 2.

However, utilizing the grounded entity graph, GAT-based model generates a more complex question by involving the information of “Muriel Humphrey married to Huber Humphrey”. Furthermore, by leveraging different granularity levels of the semantic relations among the entities with the answer-focused multi-hop reasoning, the question generated by our model is not only more complex by involving more semantics (e.g., “Muriel Humphrey married to Huber Humphrey” and “Muriel Humphrey served as the Second Lady of the United States and as a U.S. Senator from Minnesota.”) but also more relevant to the answer than the other models.

3.4 Implementation Details

We employ the Spacy toolkit Honnibal and Montani (2017) to finish the tokenization, NER and POS tagging, and coreference resolution. We use a 300-dim pre-trained Glove vector as the word embedding. Following NQG++ Zhou et al. (2017), we concatenate the word embedding with 16-dim answer positional features and 16-dim linguistic feature embedding, including the case, NER, and POS tag features. We train the model of batch size for epochs with the Adam optimizer Kingma and Ba (2014) by using a NVIDIA V100 GPU, it takes minutes to train one epoch. We use the initial learning rate of for the model training, and we halve it when the validation BLEU-4 score does not improve on the dev dataset. We employ beam search with a size of during the inference.

4 Related Work

Single-relation Question Generation

Existing work dealing with the question generation task can be classified into two categories: rule-based methods and neural network-based methods. Rule-based approaches mainly adopt human-designed linguistic templates or rules and are difficult, time-consuming and expensive to scale up. Meanwhile, the rigid templates also limit the diversity of the generated questions

Mazidi and Nielsen (2014); Labutov et al. (2015).

Recently, a series of neural network-based models are proposed to solve the problem, as they show a flexible ability of understanding and generating natural language, outperforming the rigid rule-based approaches. Du et al. (2017) firstly proposes the question generation task that asks a free question given the context. Then Zhou et al. (2017) propose to ask answer-relevant questions given the answer and incorporate the linguistic feature embedding in the model. Sun et al. (2018) improve the performance further by utilizing an additional vocabulary for the question word generation and employing the relative answer positional embedding. Ma et al. (2019) propose to train two general modules jointly with the seq-to-seq model during training for generating the right keywords and question words and copying the answer-relevant words.

However, the existing models mainly focus on generating questions with the single-relation context. Different from previous works, we propose a new challenging task to ask complex questions from a collection of documents, requiring the model to discover and reasoning the entities and the semantic relations among them.

Graph Neural Network on NLP tasks

Leveraging GNN-based models for NLP tasks gained huge popularity recently. GNN-based models are mainly adopted to model the semantic and syntactic information from natural language text. Zhang et al. (2018) employs the GCN model to tackle relation extraction on the dependence tree. A recurrent graph-based model is proposed to solve the bAbI task given the graph-structured input Li et al. (2015). Liu et al. (2019) apply the GCN model on the dependence tree parsed from the input sentence to predict clue for asking questions.

Multi-hop Reasoning

Some works are proposed to realize multi-hop reasoning for question answering given multiple documents. Yoon et al. (2019) apply Graph Neural Network on a structured graph built with sentences, documents, and query nodes to classify the supporting facts used for answering the query. Min et al. (2019) realize the multi-hop reasoning by decomposing the multi-hop query into single-hop queries. To model the different levels of semantics, a heterogeneous graph consisting of entities, documents, and candidates as nodes Tu et al. (2019), which inspires our idea of the answer-centric entity graph.

Different from existing models, our proposed model is always answer-focused during the multi-hop reasoning. We proposed multi-hop answer-focused reasoning with RGCN facilitates modeling the different levels of semantic information.

5 Conclusion and Future Work

In this paper, we proposed a new task that asks complex questions given a collection of documents and the corresponding answer by discovering and modeling the multiple entities and their semantic relations across the documents. To solve the problem, we propose answer-focused multi-hop reasoning by leveraging different granularity levels of semantic information in the answer-centric entity graph built from natural language text. Extensive experiment results demonstrate the superiority of our proposed model in terms of automatically computed metrics and human evaluation. Our work provides a baseline for the new task and sheds light on future work in the multi-hop question generation scenario.

In the future, we would like to investigate whether commonsense knowledge can be incorporated during the multi-hop reasoning to ask reasonable questions.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §2.4, §2.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    arXiv preprint arXiv:1412.3555. Cited by: §2.1.
  • M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380. Cited by: §3.2.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §1, §4.
  • C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio (2016) Pointing the unknown words. arXiv preprint arXiv:1603.08148. Cited by: §2.4, §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • M. Honnibal and I. Montani (2017)

    Spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

    To appear 7. Cited by: §2.2, §3.4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • I. Labutov, S. Basu, and L. Vanderwende (2015) Deep questions without deep understanding. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

    Vol. 1, pp. 889–898. Cited by: §4.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §4.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §3.2.
  • B. Liu, M. Zhao, D. Niu, K. Lai, Y. He, H. Wei, and Y. Xu (2019) Learning to generate questions by learningwhat not to generate. In The World Wide Web Conference, pp. 1106–1118. Cited by: §4.
  • X. Ma, Q. Zhu, Y. Zhou, X. Li, and D. Wu (2019) Improving question generation with sentence-level semantic matching and answer position inferring. arXiv preprint arXiv:1912.00879. Cited by: §1, 3rd item, Table 2, Table 3, §4.
  • K. Mazidi and R. D. Nielsen (2014) Linguistic considerations in automatic question generation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 321–326. Cited by: §4.
  • S. Min, V. Zhong, L. Zettlemoyer, and H. Hajishirzi (2019) Multi-hop reading comprehension through question decomposition and rescoring. arXiv preprint arXiv:1906.02916. Cited by: §3.2, §4.
  • L. Pan, Y. Xie, Y. Feng, T. Chua, and M. Kan (2020) Semantic graphs for generating deep questions. arXiv preprint arXiv:2004.12704. Cited by: footnote 1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.3.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: Table 1, §1.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §1, §2.2.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §2.4, 2nd item, Table 2, Table 3.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3930–3939. Cited by: §1, §3.2, §4.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1, §2.
  • M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou (2019) Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. arXiv preprint arXiv:1905.07374. Cited by: §4.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: 4th item, §3.
  • W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou (2017) Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 189–198. Cited by: §2.1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600. Cited by: Table 1, §3.1, §3.
  • S. Yoon, F. Dernoncourt, D. S. Kim, T. Bui, and K. Jung (2019) Propagate-selector: detecting supporting sentences for question answering via graph neural networks. arXiv preprint arXiv:1908.09137. Cited by: §4.
  • Y. Zhang, P. Qi, and C. D. Manning (2018) Graph convolution over pruned dependency trees improves relation extraction. arXiv preprint arXiv:1809.10185. Cited by: §4.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: §1, §2.1, 1st item, §3.4, Table 2, Table 3, §4.