Natural Question Generation with Reinforcement Learning Based Graph-to-Sequence Model

10/19/2019 ∙ by Yu Chen, et al. ∙ 0

Natural question generation (QG) aims to generate questions from a passage and an answer. In this paper, we propose a novel reinforcement learning (RL) based graph-to-sequence (Graph2Seq) model for QG. Our model consists of a Graph2Seq generator where a novel Bidirectional Gated Graph Neural Network is proposed to embed the passage, and a hybrid evaluator with a mixed objective combining both cross-entropy and RL losses to ensure the generation of syntactically and semantically valid text. The proposed model outperforms previous state-of-the-art methods by a large margin on the SQuAD dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural question generation (QG) is a dual task to question answering (Chen et al., 2019a, b). Given a passage and a target answer , the goal of QG is to generate the best question which maximizes the conditional likelihood . Recent works on QG mostly formulate it as a sequence-to-sequence (Seq2Seq) learning problem (Du et al., 2017; Yao et al., ; Kumar et al., 2018a). However, these methods fail to utilize the rich text structure that could complement the simple word sequence. Cross-entropy based sequence training has notorious limitations like exposure bias and inconsistency between train/test measurement (Ranzato et al., 2015; Wu et al., 2016; Paulus et al., 2017). To tackle these limitations, some recent QG approaches (Song et al., 2017; Kumar et al., 2018b)

aim at directly optimizing evaluation metrics using Reinforcement Learning (RL) 

(Williams, 1992)

. However, they generally do not consider joint mixed objective functions with both syntactic and semantic constraints for guiding text generation.

Early works on neural QG did not take into account the answer information when generating a question. Recent works (Zhou et al., 2017; Kim et al., 2018) have explored various means of utilizing the answers to make the generated questions more relevant. However, they neglect potential semantic relations between the passage and answer, and thus fail to explicitly model the global interactions among them.

To address these aforementioned issues, as shown in Fig. 1, we propose an RL based generator-evaluator architecture for QG, where the answer information is utilized by an effective Deep Alignment Network. Our generator extends Gated Graph Neural Networks (Li et al., 2015) by considering both incoming and outgoing edge information via a Bidirectional Gated Graph Neural Network (BiGGNN) for encoding the passage graph, and then outputs a question using an RNN-based decoder. Our hybrid evaluator is trained by optimizing a mixed objective function combining both cross-entropy loss and RL loss. We also introduce an effective Deep Alignment Network for incorporating the answer information into the passage. The proposed model is end-to-end trainable, and outperforms previous state-of-the-art methods by a great margin on the SQuAD dataset.

2 An RL-based generator-evaluator architecture

Deep alignment network. Answer information is crucial for generating relevant and high quality questions from a passage. However, previous methods often neglect potential semantic relations between passage and answer words. We thus propose a novel Deep Alignment Network (DAN) for effectively incorporating answer information into the passage by performing soft-alignment at both word-level and contextualized hidden state level.

Figure 1: Overall architecture of the proposed model. Best viewed in color.

Let and denote two embeddings associated with passage text. Similarly, let and denoted two embeddings associated with answer text. Formally, we define our soft-alignment function as where is the final passage embedding, CAT denotes concatenation, and is a attention score matrix, computed by . is a trainable weight matrix, with

being the hidden state size and ReLU is the rectified linear unit 

(Nair and Hinton, 2010). We next introduce how we do soft-alignment at both word-level and contextualized hidden state level.

Word-level alignment: In the word-level alignment stage, we first perform a soft-alignment between the passage and the answer based only on their pretrained GloVe embeddings and compute the final passage embeddings by , where , , and are the corresponding GloVe embedding (Pennington et al., 2014), BERT embedding (Devlin et al., 2018), and linguistic feature (i.e., case, NER and POS) embedding of the passage text, respectively. Then a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) is applied to the passage embeddings to obtain contextualized passage embeddings . Similarly, on the answer side, we simply concatenate its GloVe embedding with its BERT embedding . Another BiLSTM is then applied to the above concatenated answer embedding sequence to obtain the contextualized answer embeddings .

Hidden-level alignment: In the hidden-level alignment stage, we perform another soft-alignment based on the contextualized passage and answer embeddings. Similarly, we compute the aligned answer embedding, and concatenate it with the contextualized passage embedding to obtain the final passage embedding matrix . Finally, we apply another BiLSTM to the above concatenated embedding to get a passage embedding matrix .

Bidirectional graph encoder. Existing methods have exploited RNNs to capture local dependencies among sequence words, which, however, neglects rich hidden structured information in text. GNNs provide a better approach to utilize the rich text structure and to model the global interactions among sequence words Xu et al. (2018b, c); Subburathinam et al. (2019). Therefore, we explore various ways of constructing a passage graph containing each word as a node, and then apply GNNs to encode the passage graph.

Passage graph construction: We explore syntax-based static graph construction. For each sentence in a passage, we first get its dependency parse tree. We then connect neighboring dependency parse trees by connecting those nodes that are at a sentence boundary and next to each other in text.

We also explore semantics-aware dynamic graph construction, which consists of three steps: i) we compute a dense adjacency matrix for the passage graph by applying self-attention to the passage embeddings

; ii) KNN-style graph sparsification is adopted to obtain a sparse adjacency matrix

; and iii) we apply softmax to to get two normalized adjacency matrices, namely, and , for incoming and outgoing directions, respectively. Please refer to Appendix A for more details.

Bidirectional gated graph neural networks: Unlike Xu et al. (2018a), we propose a novel BiGGNN which is an extension to Gated Graph Neural Networks (GGNNs) (Li et al., 2015), to process the directed passage graph. Node embeddings are initialized to the passage embeddings returned by DAN.

At each hop of BiGGNN, for every node, we apply an aggregation function which takes as input a set of incoming (or outgoing) neighboring node vectors and outputs a backward (or forward) aggregation vector, denoted as

(or ). For the syntax-based static graph, the backward (or forward) aggregation vector is computed as the average of all incoming (or outgoing) neighboring node vectors plus itself. For the semantics-aware dynamic graph, we compute a weighted average for aggregation where the weights come from the adjacency matrices and .

Unlike (Xu et al., 2018a) that learns separate node embeddings for both directions independently, we fuse the information aggregated in two directions at each hop, defined as, . And the fusion function is designed as a gated sum of two information sources, defined as, where with

being a sigmoid function and

being a gating vector. Finally, a Gated Recurrent Unit (GRU) 

(Cho et al., 2014) is used to update the node embeddings by incorporating the aggregation information, defined as, .

After hops of GNN computation, where

is a hyperparameter, we obtain the final state embedding

for node

. To compute the graph-level embedding, we first apply a linear projection to the node embeddings, and then apply max-pooling over all node embeddings to get a

-dim vector . The decoder takes the graph-level embedding followed by two separate fully-connected layers as initial hidden states (i.e., and ) and the node embeddings as the attention memory. Our decoder closely follows (See et al., 2017). We refer the readers to Appendix B for more details.

Hybrid evaluator. Some recent QG approaches (Song et al., 2017; Kumar et al., 2018b) directly optimize evaluation metrics using REINFORCE to overcome the loss mismatch issue with cross-entropy based sequence training. However, they often fail to generate semantically meaningful and syntactically coherent text. To address these issues, we present a hybrid evaluator with a mixed objective combining both cross-entropy and RL losses so as to ensure the generation of syntactically and semantically valid text.

For the RL part, we adopt the self-critical sequence training (SCST) algorithm (Rennie et al., 2017) to directly optimize evaluation metrics. In SCST, at each training iteration, the model generates two output sequences: the sampled output produced by multinomial sampling, and the baseline output obtained by greedy search. We define as the reward of an output sequence , computed by comparing it to corresponding ground-truth sequence

with some reward metrics. The loss function is defined as

. As we can see, if the sampled output has a higher reward than the baseline one, we maximize its likelihood, and vice versa.

We use one of our evaluation metrics, BLEU-4, as our reward function

, which lets us directly optimize the model towards the evaluation metrics. One drawback of some evaluation metrics like BLEU is that they do not measure meaning, but only reward systems for n-grams that have exact matches in the reference system. To make our reward function more effective and robust, following 

(Gong et al., 2019), we additionally use word mover’s distance (WMD) (Kusner et al., 2015) as a semantic reward function . We define the final reward function as where is a scalar.

We train our model in two stages. In the first state, we train the model using regular cross-entropy loss. And in the second stage, we fine-tune the model by optimizing a mixed objective function combining both cross-entropy loss and RL loss. During the testing phase, we use beam search to generate final predictions. Further details of the training strategy can be found in Appendix C.

3 Experiments

In this section, we evaluate our proposed model against state-of-the-art methods on the SQuAD dataset (Rajpurkar et al., 2016). The baseline methods in our experiments include SeqCopyNet (Zhou et al., 2018), NQG++ (Zhou et al., 2017), MPQG+R (Song et al., 2017), Answer-focused Position-aware model (Sun et al., 2018), s2sa-at-mp-gsa (Zhao et al., 2018), ASs2s (Kim et al., 2018) and CGC-QG (Liu et al., 2019). Note that experiments on baselines followed by * are conducted using released source code. Detailed description of the baselines is provided in Appendix D. For fair comparison with baselines, we do experiments on both SQuAD split-1 (Song et al., 2018) and split-2 (Zhou et al., 2017). For model settings and sensitivity analysis of hyperparameters, please refer to Appendix E and Appendix F. The implementation of our model will be made publicly available at https://github.com/hugochan/RL-based-Graph2Seq-for-NQG.

Following previous works, we use BLEU-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004) as our evaluation metrics. Note that we only report BLEU-4 on split-2 since most of the baselines only report this result. Besides automatic evaluation metrics, we also conduct human evaluation on split-2. Further details on human evaluation can be found in Appendix G.

Methods Split-1 Split-2
BLEU-4 METEOR ROUGE-L BLEU-4
SeqCopyNet  –  –  – 13.02
NQG++  –  –  – 13.29
MPQG+R* 14.39 18.99 42.46 14.71
Answer-focused Position-aware model  –  –  – 15.64
s2sa-at-mp-gsa 15.32 19.29 43.91 15.82
ASs2s 16.20 19.92 43.96 16.17
CGC-QG  –  –  – 17.55
G2S+BERT+RL 17.55 21.42 45.59 18.06
G2S+BERT+RL 17.94 21.76 46.02 18.30
Table 1: Automatic evaluation results on the SQuAD test set.

Experimental results and analysis. Table 1 shows the automatic evaluation results comparing against all baselines. First of all, our full model G2S+BERT+RL outperforms previous state-of-the-art methods by a great margin. Compared to previous methods like CGC-QG (Liu et al., 2019) and ASs2s (Kim et al., 2018)

that rely on many heuristic rules and ad-hoc strategies, our model does not rely on any of these hand-crafted rules and ad-hoc strategies.

Methods BLEU-4 Methods BLEU-4
G2S+BERT+RL 18.06 G2S 16.81
G2S+BERT+RL 18.30 G2S 16.96
G2S+BERT-fixed+RL 18.20 G2S w/o DAN 12.58
G2S+BERT 17.56 G2S w/o DAN 12.62
G2S+BERT 18.02 G2S w/o BiGGNN, w/ Seq2Seq 16.14
G2S+BERT-fixed 17.86 G2S w/o BiGGNN, w/ GCN 14.47
G2S+RL 17.18 G2S w/ GGNN-forward 16.53
G2S+RL 17.49 G2S w/ GGNN-backward 16.75
Table 2: Ablation study on the SQuAD split-2 test set.

We also perform ablation study on the impact of different components on the SQuAD split-2 test set, as shown in Table 2. For complete results on ablation study, please refer to Appendix H. By turning off DAN, the BLEU-4 score of G2S (similarly for G2S) dramatically drops from to , which shows the effectiveness of DAN. We can see the advantages of Graph2Seq learning over Seq2Seq learning by comparing the performance between G2S and Seq2Seq. Fine-tuning the model using REINFORCE can further improve the model performance, which shows the benefits of directly optimizing evaluation metrics. We also find that BERT has a considerable impact on the performance. Lastly, we find that static graph construction slightly outperforms dynamic graph construction. We refer the readers to Appendix I for a case study of different ablated systems.

4 Conclusion

We proposed a novel RL based Graph2Seq model for QG, where the answer information is utilized by an effective Deep Alignment Network and a novel bidirectional GNN is proposed to process the directed passage graph. Our two-stage training strategy benefits from both cross-entropy based and REINFORCE based sequence training. We also explore both static and dynamic approaches for constructing graphs when applying GNNs to textual data. On the SQuAD dataset, our model outperforms previous state-of-the-art methods by a wide margin. In the future, we would like to investigate more effective ways of automatically learning graph structures from free-form text.

Acknowledgments

This work is supported by IBM Research AI through the IBM AI Horizons Network. We thank the human evaluators who evaluated our system. We thank the anonymous reviewers for their feedback.

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Appendix B.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §3.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015)

    Scheduled sampling for sequence prediction with recurrent neural networks

    .
    In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: Appendix C.
  • Y. Chen, L. Wu, and M. J. Zaki (2019a) Bidirectional attentive memory networks for question answering over knowledge bases. arXiv preprint arXiv:1903.02188. Cited by: §1.
  • Y. Chen, L. Wu, and M. J. Zaki (2019b) GraphFlow: exploiting conversation flow with graph neural networks for conversational machine comprehension. arXiv preprint arXiv:1908.00059. Cited by: §1.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §1.
  • H. Gong, S. Bhat, L. Wu, J. Xiong, and W. Hwu (2019) Reinforcement learning based text style transfer without parallel training corpus. arXiv preprint arXiv:1903.10671. Cited by: §2.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: Appendix B.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • Y. Kim, H. Lee, J. Shin, and K. Jung (2018) Improving neural question generation using answer separation. arXiv preprint arXiv:1809.02393. Cited by: Appendix D, §1, §3, §3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E.
  • D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: Appendix E.
  • V. Kumar, K. Boorla, Y. Meena, G. Ramakrishnan, and Y. Li (2018a) Automating reading comprehension by generating question and answer pairs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 335–348. Cited by: §1.
  • V. Kumar, G. Ramakrishnan, and Y. Li (2018b) A framework for automatic question generation from text using deep reinforcement learning. arXiv preprint arXiv:1808.04961. Cited by: §1, §2.
  • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger (2015) From word embeddings to document distances. In

    International Conference on Machine Learning

    ,
    pp. 957–966. Cited by: §2.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §2.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §3.
  • B. Liu, M. Zhao, D. Niu, K. Lai, Y. He, H. Wei, and Y. Xu (2019) Learning to generate questions by learning what not to generate. arXiv preprint arXiv:1902.10418. Cited by: Appendix D, §3, §3.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: Appendix B.
  • V. Nair and G. E. Hinton (2010)

    Rectified linear units improve restricted boltzmann machines

    .
    In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: Appendix C, §1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    ,
    pp. 1532–1543. Cited by: §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §3.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §1.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 7008–7024. Cited by: §2.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §2.
  • L. Song, Z. Wang, W. Hamza, Y. Zhang, and D. Gildea (2018) Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 569–574. Cited by: §3.
  • L. Song, Z. Wang, and W. Hamza (2017) A unified query-based generative model for question generation and question answering. arXiv preprint arXiv:1709.01058. Cited by: Appendix D, §1, §2, §3.
  • A. Subburathinam, D. Lu, H. Ji, J. May, S. Chang, A. Sil, and C. Voss (2019) Neural question generation from text: a preliminary study. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3930–3939. Cited by: Appendix D, §3.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. arXiv preprint arXiv:1601.04811. Cited by: Appendix B.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: Appendix B.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: Appendix C, §1.
  • K. Xu, L. Wu, Z. Wang, and V. Sheinin (2018a) Graph2seq: graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823. Cited by: §2, §2.
  • K. Xu, L. Wu, Z. Wang, M. Yu, L. Chen, and V. Sheinin (2018b) Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. arXiv preprint arXiv:1808.07624. Cited by: §2.
  • K. Xu, L. Wu, Z. Wang, M. Yu, L. Chen, and V. Sheinin (2018c) SQL-to-text generation with graph-to-sequence model. arXiv preprint arXiv:1809.05255. Cited by: §2.
  • [41] K. Yao, L. Zhang, T. Luo, L. Tao, and Y. Wu Teaching machines to ask questions.. Cited by: §1.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. Cited by: Appendix D, §3.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: Appendix D, §1, §3.
  • Q. Zhou, N. Yang, F. Wei, and M. Zhou (2018) Sequential copying networks. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: Appendix D, §3.

Appendix A Details on dynamic graph construction

We dynamically build a directed and weighted graph to model semantic relationships among passage words. We make the process of building such a graph depend on not only the passage, but also on the answer. The graph construction procedure consists of three steps: i) we compute a dense adjacency matrix for the passage graph by applying self-attention to the word-level passage embeddings ; ii) a KNN-style graph sparsification strategy is adopted to obtain a sparse adjacency matrix , where we only keep the nearest neighbors (including itself) as well as the associated attention scores (i.e., the remaining attentions scores are masked off) for each node; and iii) we apply softmax to to get two normalized adjacency matrices, namely, and , for incoming and outgoing directions, respectively.

(1)

where is a trainable weight matrix. Note that the supervision signal is able to back-propagate through the KNN-style graph sparsification operation since the nearest attention scores are kept.

Appendix B Details on the RNN decoder

On the decoder side, we adopt an attention-based [1, 21] LSTM decoder with copy [35, 10] and coverage mechanisms [34]. At each decoding step , an attention mechanism learns to attend to the most relevant words in the input sequence, and computes a context vector based on the current decoding state , the current coverage vector

and the attention memory. In addition, the generation probability

is calculated from the context vector , the decoder state and the decoder input . Next, is used as a soft switch to choose between generating a word from the vocabulary, or copying a word from the input sequence. We dynamically maintain an extended vocabulary which is the union of the usual vocabulary and all words appearing in a batch of source examples (i.e., passages and answers). Finally, in order to encourage the decoder to utilize the diverse components of the input sequence, a coverage mechanism is applied. At each step, we maintain a coverage vector , which is the sum of attention distributions over all previous decoder time steps. A coverage loss is also computed to penalize repeatedly attending to the same locations of the input sequence.

Appendix C Training details

We train our model in two stages. In the first state, we train the model using regular cross-entropy loss, defined as, where is the word at the -th position of the ground-truth output sequence and is the coverage loss defined as , with being the -th element of the attention vector over the input sequence at time step . Scheduled teacher forcing [3] is adopted to alleviate the exposure bias problem. In the second stage, we fine-tune the model by optimizing a mixed objective function combining both cross-entropy loss and RL loss, defined as, where is a scaling factor controling the trade-off between cross-entropy loss and RL loss. A similar mixed-objective learning function has been used by [37, 24] for machine translation and text summarization.

Appendix D Details on baseline methods

SeqCopyNet [44] proposed an extension to the copy mechanism which learns to copy not only single words but also sequences from the input sentence.

NQG++ [43] proposed an attention-based Seq2Seq model equipped with copy mechanism and a feature-rich encoder to encode answer position, POS and NER tag information.

MPQG+R [31] proposed an RL-based Seq2Seq model with a multi-perspective matching encoder to incorporate answer information. Copy and coverage mechanisms are applied.

Answer-focused Position-aware model [33] consists of an answer-focused component which generates an interrogative word matching the answer type, and a position-aware component which is aware of the position of the context words when generating a question by modeling the relative distance between the context words and the answer.

s2sa-at-mp-gsa [42] proposed a model which contains a gated attention encoder and a maxout pointer decoder to tackle the challenges of processing long input sequences. For fair comparison, we report the results of the sentence-level version of their model to match with our settings.

ASs2s [12] proposed an answer-separated Seq2Seq model which treats the passage and the answer separately.

CGC-QG [20] proposed a multi-task learning framework to guide the model to learn the accurate boundaries between copying and generation.

Appendix E Model settings

We keep and fix the 300-dim GloVe vectors for the most frequent 70,000 words in the training set. We compute the 1024-dim BERT embeddings on the fly for each word in text using a (trainable) weighted sum of all BERT layer outputs. The embedding sizes of case, POS and NER tags are set to 3, 12 and 8, respectively. We set the hidden state size of BiLSTM to 150 so that the concatenated state size for both directions is 300. The size of all other hidden layers is set to 300. We apply a variational dropout [14]

rate of 0.4 after word embedding layers and 0.3 after RNN layers. We set the neighborhood size to 10 for dynamic graph construction. The number of GNN hops is set to 3. During training, in each epoch, we set the initial teacher forcing probability to 0.75 and exponentially increase it to

where is the training step. We set in the reward function to 0.1, in the mixed loss function to 0.99, and the coverage loss ratio to 0.4. We use Adam [13] as the optimizer and the learning rate is set to 0.001 in the pretraining stage and 0.00001 in the fine-tuning stage. We reduce the learning rate by a factor of 0.5 if the validation BLEU-4 score stops improving for three epochs. We stop the training when no improvement is seen for 10 epochs. We clip the gradient at length 10. The batch size is set to 60 and 50 on data split-1 and split-2, respectively. The beam search width is set to 5. All hyperparameters are tuned on the development set.

Appendix F Sensitivity analysis of hyperparameters

We study the effect of the number of GNN hops on our model performance. We conduct experiments of the G2S model on the SQuAD split-2 data by varying the number of GNN hops. Fig. 2 shows that our model is not very sensitive to the number of GNN hops and can achieve reasonable good results with various number of hops.

Figure 2: Effect of the number of GNN hops.

Appendix G Details on human evaluation

We conducted a small-scale (i.e., 50 random examples per system) human evaluation on the split-2 data. We asked 5 human evaluators to give feedback on the quality of questions generated by a set of anonymised competing systems. In each example, given a triple containing a source passage, a target answer and an anonymised system output, they were asked to rate the quality of the system output by answering the following three questions: i) is this generated question syntactically correct? ii) is this generated question semantically correct? and iii) is this generated question relevant to the passage? For each evaluation question, the rating scale is from 1 to 5, where a higher score means better quality (i.e., 1: Poor, 2: Marginal, 3: Acceptable, 4: Good, 5: Excellent). Responses from all evaluators were collected and averaged.

As shown in Table 3, we conducted a human evaluation study to assess the quality of the questions generated by our model, the baseline method MPQG+R, and the ground-truth data in terms of syntax, semantics and relevance metrics. We can see that our best performing model achieves good results even compared to the ground-truth, and outperforms the strong baseline method MPQG+R. Our error analysis shows that main syntactic error occurs in repeated/unknown words in generated questions. Further, the slightly lower quality on semantics also impacts the relevance.

Methods Syntactically correct (%) Semantically correct (%) Relevant (%)
MPQG+R* 4.34 4.01 3.21
G2S+BERT+RL 4.41 4.31 3.79
Ground-truth 4.74 4.74 4.25
Table 3: Human evaluation results on the SQuAD split-2 test set.

Appendix H Complete results on Ablation Study

Methods BLEU-4 Methods BLEU-4
G2S+BERT+RL 18.06 G2S w/o feat 16.51
G2S+BERT+RL 18.30 G2S w/o feat 16.65
G2S+BERT-fixed+RL 18.20 G2S w/o DAN 12.58
G2S+BERT 17.56 G2S w/o DAN 12.62
G2S+BERT 18.02 G2S w/ DAN-word only 15.92
G2S+BERT-fixed 17.86 G2S w/ DAN-hidden only 16.07
G2S+RL 17.18 G2S w/ GGNN-forward 16.53
G2S+RL 17.49 G2S w/ GGNN-backward 16.75
G2S 16.81 G2S w/o BiGGNN, w/ Seq2Seq 16.14
G2S 16.96 G2S w/o BiGGNN, w/ GCN 14.47
Table 4: Ablation study on the SQuAD split-2 test set.

We perform the comprehensive ablation study to systematically assess the impact of different model components (e.g., BERT, RL, DAN, BiGGNN, FEAT, DAN-word, and DAN-hidden) for two proposed full model variants (static vs dynamic) on the SQuAD split-2 test set. Our experimental results confirmed that every component in our proposed model makes the contribution to the overall performance.

Appendix I Case study of ablated systems

In Table 5, we further show a few examples that illustrate the quality of generated text given a passage under different ablated systems. As we can see, incorporating answer information helps the model identify the answer type of the question to be generated, and thus makes the generated questions more relevant and specific. Also, we find our Graph2Seq model can generate more complete and valid questions compared to the Seq2Seq baseline. We think it is because a Graph2Seq model is able to exploit the rich text structure information better than a Seq2Seq model. Lastly, it shows that fine-tuning the model using REINFORCE can improve the quality of the generated questions.

Passage: for the successful execution of a project , effective planning is essential .
Gold: what is essential for the successful execution of a project ?
G2S w/o BiGGNN (Seq2Seq): what type of planning is essential for the project ?
G2S w/o DAN.: what type of planning is essential for the successful execution of a project ?
G2S: what is essential for the successful execution of a project ?
G2S+BERT: what is essential for the successful execution of a project ?
G2S+BERT+RL: what is essential for the successful execution of a project ?
G2S+BERT+RL: what is essential for the successful execution of a project ?
Passage: the church operates three hundred sixty schools and institutions overseas .
Gold: how many schools and institutions does the church operate overseas ?
G2S w/o BiGGNN (Seq2Seq): how many schools does the church have ?
G2S w/o DAN.: how many schools does the church have ?
G2S: how many schools and institutions does the church have ?
G2S+BERT: how many schools and institutions does the church have ?
G2S+BERT+RL: how many schools and institutions does the church operate ?
G2S+BERT+RL: how many schools does the church operate ?
Table 5: Generated questions on SQuAD split-2 test set. Target answers are underlined.