Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation

08/14/2019 ∙ by Yu Chen, et al. ∙ Rensselaer Polytechnic Institute William & Mary 0

Natural question generation (QG) is a challenging yet rewarding task, that aims to generate questions given an input passage and a target answer. Previous works on QG, however, either (i) ignore the rich structure information hidden in the word sequence, (ii) fail to fully exploit the target answer, or (iii) solely rely on cross-entropy loss that leads to issues like exposure bias and evaluation discrepancy between training and testing. To address the above limitations, in this paper, we propose a reinforcement learning (RL) based graph-to-sequence (Graph2Seq) architecture for the QG task. Our model consists of a Graph2Seq generator where a novel bidirectional graph neural network (GNN) based encoder is applied to embed the input passage incorporating the answer information via a simple yet effective Deep Alignment Network, and an evaluator where a mixed objective function combining both cross-entropy loss and RL loss is designed for ensuring the generation of semantically and syntactically valid text. The proposed model is end-to-end trainable, and achieves new state-of-the-art scores and outperforms all previous methods by a great margin on the SQuAD benchmark.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural question generation (QG) is a challenging yet rewarding problem. It has many useful applications including improving the question answering task Chen et al. (2017, 2019a) by providing more training data Du et al. (2017); Tang et al. (2017); Song et al. (2017); Yuan et al. (2017); Li et al. (2018), generating practice exercises and assessments for educational purposes Heilman and Smith (2010); Heilman (2011); Danon and Last (2017) and helping dialog systems, such as Alexa and Google Assistant, to kick-start and continue a conversation with human users Mostafazadeh et al. (2016). While many other works focus on QG from images Mostafazadeh et al. (2016); Fan et al. (2018); Li et al. (2018) or knowledge bases Serban et al. (2016); Elsahar et al. (2018), in this work, we focus on QG from textual data.

Conventional methods Mostow and Chen (2009); Heilman and Smith (2010); Heilman (2011); Hussein et al. (2014)

for QG rely on heuristic rules or hand-crafted templates, with low generalizability and scalability. Recent attempts have focused on neural network-based approaches that do not require manually-designed rules and are end-to-end trainable. Inspired by neural machine translation, these approaches formulate the QG task as a sequence-to-sequence (Seq2Seq) learning problem, applying various types of encoders and decoders, and have shown promising results 

Du et al. (2017); Zhou et al. (2017); Song et al. (2018a, 2017); Kumar et al. (2018a). However, these methods ignore the hidden structural information associated with a word sequence such as the syntactic parsing tree. Thus, they may fail to utilize the rich text structure which complements the simple word sequence.

It has been observed that in general, training Seq2Seq models using cross-entropy based objectives has some limitations such as exposure bias and inconsistency between train/test measurement Ranzato et al. (2015); Wu et al. (2016); Norouzi et al. (2016); Paulus et al. (2017); Gong et al. (2019)

, and thus does not always produce the best results on discrete evaluation metrics. To tackle these issues, some recent QG approaches 

Song et al. (2017); Kumar et al. (2018b) aim at directly optimizing evaluation metrics using Reinforcement Learning (RL) Williams (1992)

. However, existing approaches generally do not consider joint mixed objective functions with both semantic and syntactic constraints for guiding text generation.

Early works on neural QG did not take into account the target answer information when generating a question. Recent works Zhou et al. (2017); Song et al. (2018a, 2017); Kumar et al. (2018a); Kim et al. (2018); Liu et al. (2019) have explored various ways of utilizing the target answer to guide the generation of the question, resulting in more relevant questions with better quality. These methods mainly focus on either simply marking the answer location in the passage Zhou et al. (2017); Liu et al. (2019) using complex passage-answer matching strategies Song et al. (2018a, 2017), or carefully separating answers from passages when applying a Seq2Seq model. However, they neglect potential semantic relations between word pairs and thus do not explicitly model the global interactions among sequence words in the embedding space.

To address the aforementioned issues, we propose a reinforcement learning based generator-evaluator architecture for QG. Our generator employs a modified graph-to-sequence (Graph2Seq) Xu et al. (2018a) model that encodes the graph representation of a passage using a graph neural network (GNN) based encoder, and outputs a question sequence using a RNN decoder. Our evaluator is trained by optimizing a mixed objective function combining both cross-entropy loss and RL loss which is based on discrete evaluation metrics, where we first pretrain the model with regular cross-entropy loss. To apply GNNs to non-structured textual data, we also explore both static and dynamic ways of constructing a graph. In addition, we introduce a simple yet effective Deep Alignment Network (DAN) for incorporating answer information to the passage.

We highlight our contributions as follows:

  • [noitemsep]

  • We propose a novel RL based Graph2Seq model for natural question generation.

  • We design a simple yet effective Deep Alignment Network for explicitly modeling answer information.

  • We design a novel bidirectional Gated Graph Neural Network to process directed passage graphs.

  • We design a two-stage training strategy to train the proposed model with both cross-entropy and RL losses.

  • We explore different ways of constructing passage graphs and investigate their performance impact on a GNN encoder.

  • The proposed model achieves new state-of-the-art scores and outperforms all previous methods by a great margin on the SQuAD Rajpurkar et al. (2016) benchmark.

2 An RL-based Generator-Evaluator Architecture

Figure 1: Overall architecture of the proposed model. Best viewed in color.

Given a passage and a target answer , the task of natural question generation is to generate the best natural language question which maximizes the conditional likelihood .

We use 300-dim GloVe Pennington et al. (2014) embeddings and 1024-dim BERT Devlin et al. (2018) embeddings to embed each word in the passage and the answer. Let us denote and as the GloVe embeddings of passage word and answer word , respectively. Similarly, their corresponding BERT embeddings are denoted by and , respectively.

2.1 Deep Alignment Network

Answer information is crucial for generating relevant questions from a passage. Unlike previous methods that simply mark the answer location in the passage Zhou et al. (2017), we propose the Deep Alignment Network (DAN) for incorporating answer information to the passage with multiple granularity levels. In order to model the interactions between a passage and a target answer with different levels of granularity, we conduct attention-based soft-alignment at both the word-level and the contextualized hidden state level.

Let us denote and as the passage embeddings and the answer embeddings, respectively. The soft-alignment operation consists of three steps: i) compute attention score for each pair of passage word and answer word : ii) multiply the attention matrix with the answer embeddings to obtain the aligned answer embeddings for the passage; iii) concatenate the above aligned answer embeddings with the passage embeddings to get the final passage embeddings. To simplify notation, we denote the soft-alignment function as

, meaning that an attention matrix is computed between two sets of vectors

and , which is later used to get a linear combination of vector set .

In this work, we define the attention score where is a trainable model parameter with being the hidden state size.

2.1.1 Word-level Alignment

We first do soft-alignment at the word-level based on the GloVe embeddings of the passage and the answer. Specifically, we compute the aligned answer embeddings by , where is the aligned answer embedding for the -th passage word.

For each passage word , we concatenate its GloVe embedding , BERT embedding , linguistic feature (i.e., case, NER and POS) vector and the aligned answer embedding to obtain a -dim vector . A bidirectional LSTM Hochreiter and Schmidhuber (1997) is applied to the passage embeddings to obtain contextualized passage embeddings . Similarly, for each answer word , we concatenate its GloVe embedding with its BERT embedding to obtain a -dim vector . Another BiLSTM is applied to the answer embeddings to obtain contextualized answer embeddings .

2.1.2 Hidden-level Alignment

We proceed to do soft-alignment at the hidden state level based on the contextualized passage and answer embeddings. Similarly, we compute the aligned answer embeddings by .

Finally, we apply another BiLSTM to the concatenation of the contextualized passage embeddings and the aligned answer embeddings to get a by passage embedding matrix .

2.2 Bidirectional Graph Encoder

While a Recurrent Neural Network (RNN) is good at modeling local interactions among consecutive passage words, a Graph Neural Network (GNN) can better utilize the rich text structure and the global interactions among sequence words, and thus further improve the representations. Therefore, we construct a passage graph

which consists of each passage word as a node, and apply a GNN to the graph.

2.2.1 Passage Graph Construction

While it is straightforward to apply GNNs to graph structured data, it is not clear what is the best way of representing text as a graph. In this work, we explore both static and dynamic ways of constructing a graph for textual data.

Syntax-based static graph construction: We construct a directed unweighted passage graph based on dependency parsing. For each sentence in a passage, we first get its dependency parse tree. Then, we connect those neighboring dependency parse trees by connecting those nodes that are at a sentence boundary and next to each other.

Semantic-aware dynamic graph construction: We dynamically build a weighted graph to model semantic relationships among passage words. We make the process of building such a graph depend on not only the passage, but also on the answer. Specifically, we first apply a self-attention mechanism to the word-level passage embeddings to compute an attention matrix , serving as a weighted adjacency matrix for the passage graph, defined as where is a trainable weight.

Considering that a fully connected passage graph is not only computationally expensive but also makes little sense for graph processing, we proceed to extract a sparse and directed graph from

via a KNN-style strategy where we only keep the

nearest neighbors (including itself) as well as the associated attention scores (i.e., the remaining attentions scores are masked off) for each node. Finally, we apply a softmax function to these selected adjacency matrix elements to get two normalized adjacency matrices, namely, and , for incoming and outgoing directions, respectively.

(1)

Note that the supervision signal is able to back-propagate through the KNN-style graph sparsification operation since the nearest attention scores are kept and used to compute the weighted adjacency matrices.

2.2.2 Bidirectional Gated Graph Neural Networks

Unlike the previous Graph2Seq work Xu et al. (2018a) that uses bidirectional GraphSAGE Hamilton et al. (2017), we instead propose a novel Bidirectional Gated Graph Neural Network (BiGGNN) to process the directed passage graph. We extend the Gated Graph Sequence Neural Networks Li et al. (2015) to handle directed graphs. Node embeddings are initialized to the passage embeddings produced by the BiLSTM network.

In BiGGNN, the same set of network parameters are shared at every hop of computation. At each hop of computation, for every node in the graph, we apply an aggregation function which takes as input a set of incoming (or outgoing) neighboring node vectors and outputs a backward (or forward) aggregation vector. For the syntax-based static graph, we use a mean aggregator defined as,

(2)

And for the semantic-aware dynamic graph, we compute a weighted average for aggregation where the weights come from the normalized adjacency matrices and , defined as,

(3)

While Xu et al. (2018a) learn separate node embeddings for both directions independently, we choose to use a fusion function to fuse the information aggregated in two directions in every hop, which we find works better.

(4)

And the fusion function is designed as a gated sum of two information sources,

(5)

where

is a sigmoid function and

is a gating vector.

Finally, a Gated Recurrent Unit (GRU) 

Cho et al. (2014) is used to update the node embeddings by incorporating the aggregation information,

(6)

After -hops of GNN computation where

is a hyperparameter, we denote the final state representation of node

as . To compute the graph-level representation, we first apply a linear projection to the node embeddings, and then apply a maxpooling over all node embeddings to get a -dim vector .

2.3 Decoder

On the decoder side, we follow the previous works See et al. (2017); Song et al. (2017) to adopt an attention-based Bahdanau et al. (2014); Luong et al. (2015) LSTM model with copy Vinyals et al. (2015); Gu et al. (2016) and coverage mechanisms Tu et al. (2016). The decoder takes the graph-level embedding followed by two separate fully-connected layers as initial hidden states (i.e., and ) and the node embeddings as the attention memory, and generates the output sequence one word at a time.

The particular decoder used in this paper closely follows the work of See et al. (2017) and we refer the readers to See et al. (2017) for more details on the decoder. We now briefly describe the decoder. Basically, at each decoding step , an attention mechanism learns to attend to the most relevant words in the input sequence, and computes a context vector based on the current decoding state , the current coverage vector

and the attention memory. In addition, the generation probability

is calculated from the context vector , the decoder state and the decoder input . Next, is used as a soft switch to choose between generating a word from the vocabulary, or copying a word from the input sequence. We dynamically maintain an extended vocabulary which is the union of the usual vocabulary and all words appearing in a batch of source examples (i.e., passages and answers). Finally, in order to encourage the decoder to utilize the diverse components of the input sequence, a coverage mechanism is applied. At each step, we maintain a coverage vector , which is the sum of attention distributions over all previous decoder time steps. A coverage loss is also computed to penalize repeatedly attending to the same locations of the input sequence.

2.4 Policy Based Reinforcement Learning

The most widely used way to train a sequence learning model is to optimize the log-likelihood of the ground-truth output sequence at each decoding step, which is known as the teacher forcing algorithm Williams and Zipser (1989).

However, it has been observed that optimizing such cross-entropy based training objectives for sequence learning does not always produce the best results on discrete evaluation metrics Ranzato et al. (2015); Wu et al. (2016); Norouzi et al. (2016); Paulus et al. (2017); Gong et al. (2019). There are two main limitations of this method. First, a model has access to the ground-truth sequence up to the next token during training but does not have such supervision when testing, resulting in accumulated errors. This gap of model behavior between training and inference is called exposure bias Ranzato et al. (2015). Second, there is an evaluation discrepancy between training and testing. A model is optimized with cross-entropy loss during training while evaluated with discrete evaluation metrics during testing.

To address the above issues, we present an evaluator where a mixed objective function combining both cross-entropy loss and RL loss is designed for ensuring the generation of semantically and syntactically valid text. We learn a policy that directly optimizes an evaluation metric using REINFORCE Williams (1992). In this work, we utilize the self-critical policy gradient training algorithm Rennie et al. (2017)

, which is an efficient REINFORCE algorithm that, rather than estimating the reward signal, or how the reward signal should be normalized, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences.

For this training algorithm, at each training iteration, the model generates two output sequences: the sampled output , produced by multinomial sampling, that is, each word is sampled according to the likelihood predicted by the generator, and the baseline output

, obtained by greedy search, that is, maximizing the output probability distribution at each decoding step. We define

as the reward of an output sequence , computed by comparing it to corresponding ground-truth sequence

with some reward metrics. The loss function is defined as:

(7)

As we can see, if the sampled output has a higher reward than the baseline one, we maximize its likelihood, and vice versa.

Evaluation metric as reward function: We use one of our evaluation metrics, BLEU-4, as our reward function , which lets us directly optimize the model towards the evaluation metrics.

Semantic metric as reward function:

One drawback of some evaluation metrics like BLEU is that they do not measure meaning, but only reward systems for n-grams that have exact matches in the reference system. To make our reward function more effective and robust, we additionally use word mover’s distance (WMD) as a semantic reward function

. WMD is the state-of-the-art approach to measure the dissimilarity between two sentences based on word embeddings Kusner et al. (2015). Following Gong et al. (2019), we take the negative of the WMD distance between a generated sequence and the ground-truth sequence and divide it by the sequence length as its semantic score.

We define the final reward function as where is a scalar.

2.5 Training and Testing

We train our model in two stages. In the first state, we train the model using regular cross-entropy loss, defined as,

(8)

where is the word at the -th position of the ground-truth output sequence and is the coverage loss defined as , with being the -th element of the attention vector over the input sequence at time step . Note that the scheduled teacher forcing strategy  Bengio et al. (2015) is adopted to alleviate the exposure bias problem.

In the second stage, we fine-tune the model by optimizing a mixed objective function combining both cross-entropy loss and reinforcement loss, defined as,

(9)

where is a scaling factor controling the trade-off between cross-entropy and reinforcement losses. A similar mixed-objective learning function has been used by Wu et al. (2016); Paulus et al. (2017)

for machine translation and text summarization.

During the testing phase, we use beam search of width 5 to generate our final predictions.

3 Experiments

In this section, we conduct evaluation of our proposed model against state-of-the-art models. Following previous works, we use the SQuAD dataset Rajpurkar et al. (2016) as our benchmark, described below. The implementation of the model will be publicly available soon.

3.1 Data and Metrics

SQuAD contains more than 100K questions posed by crowd workers on 536 Wikipedia articles. Since the test set of the original SQuAD is not publicly available, the accessible parts (∼90%) are used as the entire dataset in our experiments. For fair comparison, we evaluated our model on both data split-1 Song et al. (2018a)111https://www.cs.rochester.edu/~lsong10/downloads/nqg_data.tgz that contains 75,500/17,934/11,805 examples and data split-2 Zhou et al. (2017) 222https://res.qyzhou.me/redistribute.zip that contains 86,635/8,965/8,964 examples.

Following previous works, we report and use BLEU-4 Papineni et al. (2002), METEOR Banerjee and Lavie (2005) and ROUGE-L Lin (2004) as our evaluation metrics (BLEU-4 and METEOR were initially designed for evaluating machine translation systems and ROUGE-L was initially designed for evaluating text summarization systems).

3.2 Model Settings

We keep and fix the GloVe vectors for the most frequent 70,000 words in the training set. For BERT embeddings, we compute BERT embeddings on the fly for each word in text using a (trainable) weighted sum of all BERT layer outputs. The embedding sizes of case, POS and NER tags are set to 3, 12 and 8, respectively. The size of all hidden layers is set to 300. We apply a variational dropout Kingma et al. (2015)

rate of 0.4 after word embedding layers and 0.3 after RNN layers. We set the neighborhood size to 10 for dynamic graph construction. The number of GNN hops is set to 3. During training, in each epoch, we set the initial teacher forcing probability to 0.75 and exponentially increase it to

where is the training step. We set in the reward function to 0.1, in the mixed loss function to 0.99, and the coverage loss ratio to 0.4. We use Adam Kingma and Ba (2014) as the optimizer and the learning rate is set to 0.001 in the pretraining stage and 0.00001 in the fine-tuning stage. We reduce the learning rate by a factor of 0.5 if the validation BLEU-4 score stops improving for three epochs. We stop the training when no improvement is seen for 10 epochs. We clip the gradient at length 10. The batch size is set to 60 and 50 on data split-1 and split-2, respectively. All hyper-parameters are tuned on the development set.

3.3 Baseline methods

We compare our model with the following baseline methods: NQG++ Zhou et al. (2017), MPQG+R Song et al. (2017), ASs2s Kim et al. (2018) and CGC-QG Liu et al. (2019).

3.4 Evaluation Results and Case Study

Methods Split-1 Split-2
BLEU-4 METEOR ROUGE-L BLEU-4
NQG++ Zhou et al. (2017)  –  –  – 13.29
MPQG+R Song et al. (2017) 13.98 18.77 42.72 13.91
ASs2s Kim et al. (2018) 16.20 19.92 43.96 16.17
CGC-QG Liu et al. (2019)  –  –  – 17.55
G2S +BERT+RL (our model) 17.94 21.76 46.02 18.30
Table 1: Evaluation results of baseline methods and our model on the SQuAD test set. Higher scores indicate better performance. Note that we only report BLEU-4 on Split-2 since most of the baselines only report this result.
Passage: for the successful execution of a project ,
effective planning is essential .
Gold: what is essential for the successful execution of a
project ?
Seq2Seq: what type of planning is essential for the
project ?
G2S w/o DAN.: what type of planning is essential for
the successful execution of a project ?
G2S: what is essential for the successful execution
of a project ?
G2S +BERT: what is essential for the successful
execution of a project ?
G2S +BERT+RL: what is essential for the
successful execution of a project ?
Passage: the church operates three hundred sixty schools
and institutions overseas .
Gold: how many schools and institutions does the
church operate overseas ?
Seq2Seq: how many schools does the church have ?
G2S w/o DAN.: how many schools does the
church have ?
G2S: how many schools and institutions does the
church have ?
G2S +BERT: how many schools and institutions
does the church have ?
G2S +BERT+RL: how many schools and institutions
does the church operate ?
Table 2: Generated questions on SQuAD split-2 test set. Target answers are underlined.

Table 1 shows the experimental results comparing against all baseline methods. First of all, as we can see, our full model G2S +BERT+RL achieves the new state-of-the-art scores on both data splits and outperforms all previous methods by a great margin. Notably, some previous state-of-the-art methods relied on many heuristic rules and ad-hoc strategies. For instance, by observing that the generated words are mostly from frequent words, while most low-frequency words are copied from the input, rather than generated, CGC-QG Liu et al. (2019)

annotated clue words in the passage based on word frequency and overlapping, masked out low-frequency passage word embeddings, and reduced the target output vocabulary to boost the model performance. ASs2s 

Kim et al. (2018) replaced the target answer in the original passage with a special token. However, our proposed model does not rely on any of these hand-crafted rules or ad-hoc strategies.

In Table 2, we further show a few examples that illustrate the quality of generated text given a passage under different models. As we can see, incorporating answer information helps the model identify the answer type of the question to be generated, thus makes the generated questions more relevant and specific. Also, we find our Graph2Seq model can generate more complete and valid questions compared to the Seq2Seq baseline. We think it is because a Graph2Seq model is able to exploit the rich text structure information better than a Seq2Seq model. Lastly, it shows that fine-tuning the model using REINFORCE can improve the quality of the generated questions.

3.5 Ablation Study of the Proposed Model

Methods Split-2
BLEU-4
G2S +BERT+RL 18.30
G2S +BERT-fixed+RL 18.20
G2S +BERT 18.02
G2S +BERT-fixed 17.86
G2S +RL 17.49
G2S 16.96
G2S 16.81
G2S w/o feat. 16.51
G2S w/o feat. 16.65
G2S w/o DAN 12.58
G2S w/o DAN 12.62
G2S w/ DAN-word only 15.92
G2S w/ DAN-hidden only 16.07
G2S w/ GGNN-forward 16.53
G2S w/ GGNN-backward 16.75
G2S w/o BIGGNN (Seq2Seq) 16.14
G2S w/o BIGGNN, w/ GCN 14.47
Table 3: Ablation study on the SQuAD split-2 test set. Higher scores indicate better performance.

As shown in Table3, we also perform ablation study on the impact of different components (e.g., DAN, BIGGNN and RL fine-tuning) on the SQuAD split-2 test set. By turning off the Deep Alignment Network (DAN), the BLEU-4 score of G2S dramatically drops from to , which indicates the importance of answer information for QG and shows the effectiveness of DAN. We have a similar observation for the G2S model. Further experiments demonstrate that both word-level (G2S w/ DAN-word only) and hidden-level (G2S w/ DAN-hidden only) answer alignments in DAN are helpful. We can see the advantages of Graph2Seq learning over Seq2Seq learning on this task by comparing the performance between G2S and Seq2Seq. In our experiments, we also observe that doing both forward and backward message passing in the GNN encoder is beneficial. Surprisingly, using GCN Kipf and Welling (2016) as the graph encoder (and converting the input graph to an undirected graph) harms the performance. In addition, fine-tuning the model using REINFORCE can further improve the model performance in all settings (i.e., w/ and w/o BERT), which shows the benefits of directly optimizing evaluation metrics. Besides, we find that the pretrained BERT embedding has a considerable impact on the performance and fine-tuning BERT embedding even further improves the performance, which demonstrates the power of large-scale pretrained language models. Incorporating common linguistic features (i.e., case, POS, NER) also helps the overall performance to some extent. Lastly, we find that syntax-based static graph construction slightly performs better than semantic-aware dynamic graph construction, even though the latter one seems to be more powerful (i.e., answer-aware and history-aware) and can be optimized towards the QG task in an end-to-end manner. Another big advantage of dynamic graph construction we can see is that it does not rely on domain knowledge to construct the graph. We leave how to better construct a dynamic graph for textual data in an end-to-end manner as future work. Another interesting direction is to explore effective ways of combining both the static and dynamic graphs.

4 Related Work

4.1 Natural Question Generation

Early works Mostow and Chen (2009); Heilman and Smith (2010); Heilman (2011); Hussein et al. (2014) for QG focused on rule-based approaches that rely on heuristic rules or hand-crafted templates, with low generalizability and scalability. Recent attempts have focused on Neural Network (NN) based approaches that do not require manually-designed rules and are end-to-end trainable. Existing NN based approaches Du et al. (2017); Zhou et al. (2017); Song et al. (2018a); Kumar et al. (2018a) rely on the Seq2Seq model with attention, copy or coverage mechanisms. In addition, various ways Zhou et al. (2017); Song et al. (2018a, 2017); Kim et al. (2018) have been proposed to utilize the target answer so as to guide the generation of the question. To address the limitations of cross-entropy based sequence learning, some approaches Song et al. (2017); Kumar et al. (2018b) aim at directly optimizing evaluation metrics using REINFORCE.

4.2 Graph Neural Networks

Over the past few years, graph neural networks (GNNs) Kipf and Welling (2016); Gilmer et al. (2017); Hamilton et al. (2017); Li et al. (2015) have drawn increasing attention. Recently, GNNs have been applied to extend the widely used Seq2Seq architectures Sutskever et al. (2014); Cho et al. (2014) to Graph2Seq architectures Xu et al. (2018a, b, c); Song et al. (2018b). Very recently, researchers have explored methods to automatically construct a graph of visual objects Norcliffe-Brown et al. (2018) or words Liu et al. (2018); Chen et al. (2019b) when applying GNNs to non-graph structured data.

5 Conclusion

We proposed a novel reinforcement learning based Graph2Seq model for natural question generation, where the answer information is utilized by a simple yet effective Deep Alignment Network and a novel bidirectional GNN is proposed to process the directed passage graph. Our two-stage training strategy takes the benefits of both cross-entropy based and REINFORCE based training when training a sequence learning model. On the SQuAD dataset, our proposed model achieves the new state-of-the-art scores and outperforms all previous methods by a great margin. We also explore different ways of constructing graphs of textual data for graph neural networks. In the future, we would like to investigate more effective ways of automatically learning graph structures from free text.

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.3.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §3.1.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §2.5.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §1.
  • Y. Chen, L. Wu, and M. J. Zaki (2019a) Bidirectional attentive memory networks for question answering over knowledge bases. arXiv preprint arXiv:1903.02188. Cited by: §1.
  • Y. Chen, L. Wu, and M. J. Zaki (2019b) GraphFlow: exploiting conversation flow with graph neural networks for conversational machine comprehension. arXiv preprint arXiv:1908.00059. Cited by: §4.2.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.2.2, §4.2.
  • G. Danon and M. Last (2017) A syntactic approach to domain-specific automatic question generation. arXiv preprint arXiv:1712.09827. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §1, §1, §4.1.
  • H. Elsahar, C. Gravier, and F. Laforest (2018)

    Zero-shot question generation from knowledge graphs for unseen predicates and entity types

    .
    arXiv preprint arXiv:1802.06842. Cited by: §1.
  • Z. Fan, Z. Wei, S. Wang, Y. Liu, and X. Huang (2018) A reinforcement learning framework for natural question generation using bi-discriminators. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1763–1774. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1263–1272. Cited by: §4.2.
  • H. Gong, S. Bhat, L. Wu, J. Xiong, and W. Hwu (2019) Reinforcement learning based text style transfer without parallel training corpus. arXiv preprint arXiv:1903.10671. Cited by: §1, §2.4, §2.4.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: §2.3.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §2.2.2, §4.2.
  • M. Heilman and N. A. Smith (2010) Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 609–617. Cited by: §1, §1, §4.1.
  • M. Heilman (2011) Automatic factual question generation from text. Cited by: §1, §1, §4.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.1.
  • H. Hussein, M. Elmogy, and S. Guirguis (2014) Automatic english question generation system based on template driven scheme. International Journal of Computer Science Issues (IJCSI) 11 (6), pp. 45. Cited by: §1, §4.1.
  • Y. Kim, H. Lee, J. Shin, and K. Jung (2018) Improving neural question generation using answer separation. arXiv preprint arXiv:1809.02393. Cited by: §1, §3.3, §3.4, Table 1, §4.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §3.2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.5, §4.2.
  • V. Kumar, K. Boorla, Y. Meena, G. Ramakrishnan, and Y. Li (2018a) Automating reading comprehension by generating question and answer pairs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 335–348. Cited by: §1, §1, §4.1.
  • V. Kumar, G. Ramakrishnan, and Y. Li (2018b) A framework for automatic question generation from text using deep reinforcement learning. arXiv preprint arXiv:1808.04961. Cited by: §1, §4.1.
  • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger (2015) From word embeddings to document distances. In International Conference on Machine Learning, pp. 957–966. Cited by: §2.4.
  • Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou (2018) Visual question generation as dual task of visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6116–6124. Cited by: §1.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.2.2, §4.2.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §3.1.
  • B. Liu, M. Zhao, D. Niu, K. Lai, Y. He, H. Wei, and Y. Xu (2019) Learning to generate questions by learning what not to generate. arXiv preprint arXiv:1902.10418. Cited by: §1, §3.3, §3.4, Table 1.
  • P. Liu, S. Chang, X. Huang, J. Tang, and J. C. K. Cheung (2018) Contextualized non-local neural networks for sequence learning. arXiv preprint arXiv:1811.08600. Cited by: §4.2.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §2.3.
  • N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende (2016) Generating natural questions about an image. arXiv preprint arXiv:1603.06059. Cited by: §1.
  • J. Mostow and W. Chen (2009) Generating instruction automatically for the reading strategy of self-questioning.. Cited by: §1, §4.1.
  • W. Norcliffe-Brown, S. Vafeias, and S. Parisot (2018) Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pp. 8344–8353. Cited by: §4.2.
  • M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans, et al. (2016) Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731. Cited by: §1, §2.4.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.1.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §1, §2.4, §2.5.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    ,
    pp. 1532–1543. Cited by: §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: 6th item, §3.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §1, §2.4.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §2.4.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §2.3, §2.3.
  • I. V. Serban, A. García-Durán, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio (2016) Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807. Cited by: §1.
  • L. Song, Z. Wang, W. Hamza, Y. Zhang, and D. Gildea (2018a) Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 569–574. Cited by: §1, §1, §3.1, §4.1.
  • L. Song, Z. Wang, and W. Hamza (2017) A unified query-based generative model for question generation and question answering. arXiv preprint arXiv:1709.01058. Cited by: §1, §1, §1, §1, §2.3, §3.3, Table 1, §4.1.
  • L. Song, Y. Zhang, Z. Wang, and D. Gildea (2018b) A graph-to-sequence model for amr-to-text generation. arXiv preprint arXiv:1805.02473. Cited by: §4.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §4.2.
  • D. Tang, N. Duan, T. Qin, Z. Yan, and M. Zhou (2017) Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027. Cited by: §1.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. arXiv preprint arXiv:1601.04811. Cited by: §2.3.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: §2.3.
  • R. J. Williams and D. Zipser (1989) A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §2.4.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §1, §2.4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §2.4, §2.5.
  • K. Xu, L. Wu, Z. Wang, and V. Sheinin (2018a) Graph2seq: graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823. Cited by: §1, §2.2.2, §2.2.2, §4.2.
  • K. Xu, L. Wu, Z. Wang, M. Yu, L. Chen, and V. Sheinin (2018b) Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. arXiv preprint arXiv:1808.07624. Cited by: §4.2.
  • K. Xu, L. Wu, Z. Wang, M. Yu, L. Chen, and V. Sheinin (2018c) SQL-to-text generation with graph-to-sequence model. arXiv preprint arXiv:1809.05255. Cited by: §4.2.
  • X. Yuan, T. Wang, C. Gulcehre, A. Sordoni, P. Bachman, S. Subramanian, S. Zhang, and A. Trischler (2017) Machine comprehension by text-to-text neural question generation. arXiv preprint arXiv:1705.02012. Cited by: §1.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: §1, §1, §2.1, §3.1, §3.3, Table 1, §4.1.