Machine Comprehension by Text-to-Text Neural Question Generation

by   Xingdi Yuan, et al.

We propose a recurrent neural model that generates natural-language questions from documents, conditioned on answers. We show how to train the model using a combination of supervised and reinforcement learning. After teacher forcing for standard maximum likelihood training, we fine-tune the model using policy gradient techniques to maximize several rewards that measure question quality. Most notably, one of these rewards is the performance of a question-answering system. We motivate question generation as a means to improve the performance of question answering systems. Our model is trained and evaluated on the recent question-answering dataset SQuAD.


page 1

page 2

page 3

page 4


Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

We frame Question Answering as a Reinforcement Learning task, an approac...

Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning

Query reformulation aims to alter potentially noisy or ambiguous text se...

BanditRank: Learning to Rank Using Contextual Bandits

We propose an extensible deep learning method that uses reinforcement le...

DCN+: Mixed Objective and Deep Residual Coattention for Question Answering

Traditional models for question answering optimize using cross entropy l...

Using Ternary Rewards to Reason over Knowledge Graphs with Deep Reinforcement Learning

In this paper, we investigate the practical challenges of using reinforc...

Learning to Skim Text

Recurrent Neural Networks are showing much promise in many sub-areas of ...

Multi-Field Structural Decomposition for Question Answering

This paper presents a precursory yet novel approach to the question answ...

1 Introduction

People ask questions to improve their knowledge and understanding of the world. Questions can be used to access the knowledge of others or to direct one’s own information-seeking behavior. Here we study the generation of natural-language questions by machines, based on text passages. This task is synergistic with machine comprehension (MC), which pursues the understanding of written language by machines at a near-human level. Because most human knowledge is recorded in text, this would enable transformative applications.

Many machine comprehension datasets have been released recently. These generally comprise (document, question, answer) triples (Hermann et al., 2015; Hill et al., 2015; Rajpurkar et al., 2016; Trischler et al., 2016; Nguyen et al., 2016), where the goal is to predict an answer, conditioned on a document and question. The availability of large labeled datasets has spurred development of increasingly advanced models for question answering (QA) from text (Kadlec et al., 2016; Seo et al., 2016; Wang et al., 2016; Shen et al., 2016).

Text Passage
in 1066 , duke william ii of normandy conquered england killing king harold ii at the battle of hastings. the invading normans and their descendants replaced the anglo-saxons as the ruling class of england.
Questions Generated by our System
1) when did the battle of hastings take place?
2) in what year was the battle of hastings fought?
3) who conquered king harold ii at the battle of hastings?
4) who became the ruling class of england?
Table 1: Examples of conditional question generation given a context and an answer from the SQuAD dataset, using the scheme referred to as below. Bold text in the passage indicates the answers used to generate the numbered questions.

In this paper we reframe the standard MC task: rather than answering questions about a document, we teach machines to ask questions. Our work has several motivations. First, we believe that posing appropriate questions is an important aspect of information acquisition in intelligent systems. Second, learning to ask questions may improve the ability to answer them. Singer and Donlan (1982) demonstrated that having students devise questions before reading can increase scores on subsequent comprehension tests. Third, answering the questions in most existing QA datasets is an extractive task – it requires selecting some span of text within the document – while question asking is comparatively abstractive – it requires generation of text that may not appear in the document. Fourth, asking good questions involves skills beyond those used to answer them. For instance, in existing QA datasets, a typical (document, question) pair specifies a unique answer. Conversely, a typical (document, answer) pair may be associated with multiple questions, since a valid question can be formed from any information or relations which uniquely specify the given answer. Finally, a mechanism to ask informative questions about documents (and eventually answer them) has many practical applications, e.g.: generating training data for question answering (Serban et al., 2016; Yang et al., 2017), synthesising frequently asked question (FAQ) documentation, and automatic tutoring systems (Lindberg et al., 2013).

We adapt the sequence-to-sequence approach of Cho et al. (2014) for generating questions, conditioned on a document and answer: first we encode the document and answer, then output question words sequentially with a decoder that conditions on the document and answer encodings. We augment the standard encoder-decoder approach with several modifications geared towards the question generation task. During training, in addition to maximum likelihood for predicting questions from (document, answer) tuples, we use policy gradient optimization to maximize several auxiliary rewards. These include a language-model-based score for fluency and the performance of a pretrained question-answering model on generated questions. We show quantitatively that policy gradient increases the rewards earned by generated questions at test time, and provide examples to illustrate the qualitative effects of different training schemes. To our knowledge, we present the first end-to-end, text-to-text model for question generation.

2 Related Work

Recently, automatic question generation has received increased attention from the research community. It has been harnessed, for example, as a means to build automatic tutoring systems (Heilman and Smith, 2010; Ali et al., 2010; Lindberg et al., 2013; Labutov et al., 2015; Mazidi and Nielsen, 2015), to reroute queries to community question-answering systems (Zhao et al., 2011), and to enrich training data for question-answering systems (Serban et al., 2016; Yang et al., 2017).

Several earlier works process documents as individual sentences using syntactic (Heilman and Smith, 2010; Ali et al., 2010; Kumar et al., 2015) or semantic-based parsing (Mannem et al., 2010; Lindberg et al., 2013), then reformulate questions using hand-crafted rules acting on parse trees. These traditional approaches generate questions with a high word overlap with the original text that pertain specifically to the given sentence by re-arranging the sentence parse tree. An alternative approach is to use generic question templates whose slots can be filled with entities from the document (Lindberg et al., 2013; Chali and Golestanirad, 2016). Labutov et al. (2015), for example, use ontology-derived templates to generate high-level questions related to larger portions of the document. These approaches comprise pipelines of independent components that are difficult to tune for final performance measures.

More recently, neural networks have enabled end-to-end training of question generation systems. 

Serban et al. (2016) train a neural system to convert knowledge base (KB) triples into natural-language questions. The head and the relation form a context for the question and the tail serves as the answer. Similarly, we assume that the answer is known a priori, but we extend the context to encompass a span of unstructured text. Mostafazadeh et al. (2016) use a neural architecture to generate questions from images rather than text. Contemporaneously with this work, Yang et al. (2017) developed generative domain-adaptive networks, which perform question generation as an auxiliary task in training a QA system. The main goal of their question generation is data augmentation, thus questions themselves are not evaluated. In contrast, our work focuses primarily on developing a neural model for question generation that could be applied to a variety of downstream tasks that includes question answering.

Our model shares similarities with recent end-to-end neural QA systems, e.g. Seo et al. (2016); Wang et al. (2016). I.e., we use an encoder-decoder structure, where the encoder processes answer and document (instead of question and document) and our decoder generates a question (instead of an answer). While existing question answering systems typically extract the answer from the document, our decoder is a fully generative model.

Finally, we relate the recent body of works that apply reinforcement learning to natural language generation, such as 

Li et al. (2016); Ranzato et al. (2016); Kandasamy and Bachrach (2017); Zhang and Lapata (2017). We similarly apply a REINFORCE-style (Williams, 1992) algorithm to maximize various rewards earned by generated questions.

3 Encoder-Decoder Model for Question Generation

We adapt the simple encoder-decoder architecture first outlined by Cho et al. (2014) to the question generation problem. In particular, we base our model on the attention mechanism of Bahdanau et al. (2015) and the pointer-softmax copying mechanism of Gulcehre et al. (2016)

. In question generation, we can condition our encoder on two different sources of information (compared to the single source in neural machine translation (NMT)): a document that the question should be about and an answer that should fit the generated question. Next, we describe how we adapt the encoder and decoder architectures in detail.

3.1 Encoder

Our encoder is a neural model acting on two input sequences: the document, and the answer, . Sequence elements

are given by embedding vectors 

(Bengio et al., 2001).

In the first stage of encoding, similar to current question answering systems, e.g. (Seo et al., 2016)

, we augment each document word embedding with a binary feature that indicates if the document word belongs to the answer. Then, we run a bidirectional long short-term memory 

(Hochreiter and Schmidhuber, 1997) () network on the augmented document sequence, producing annotation vectors . Here, is the concatenation of the network’s forward () and backward hidden states () for input token , i.e., .111We use the notation to denote concatenation of two vectors throughout the paper.

Our model operates on QA datasets where the answer is extractive; thus, we encode the answer using the annotation vectors corresponding to the answer word positions in the document. We assume that, without loss of generality, consists of the sequence of words in the document, s.t. . We concatenate the annotation sequence with the corresponding answer word embeddings (), i.e., , then apply a second bidirectional (bi) over the resulting sequence of vectors to obtain the extractive condition encoding . We form by concatenating the final hidden states from each direction of the bi.

We also compute an initial state for the decoder using the annotation vectors and the extractive condition encoding:

where , and are parameters.222Let denote the length of sequence .

3.2 Decoder

Our decoder is a neural model that generates outputs sequentially. At each time-step , the decoder models a conditional distribution parametrized by ,


where represents the outputs at earlier time-steps. In question generation, output is a word sampled according to (1).

When formulating questions based on documents, it is common to refer to phrases and entities that appear directly in the text. We therefore incorporate into our decoder a mechanism for copying relevant words from . We use the pointer-softmax formulation (Gulcehre et al., 2016), which has two output layers: the shortlist softmax and the location softmax. The shortlist softmax induces a distribution over words in a predefined output vocabulary. The location softmax is a pointer network (Vinyals et al., 2015)

that induces a distribution over document tokens to be copied. A source switching network enables the model to interpolate between these distributions.

In more detail, the decoder is a recurrent neural network. Its internal state,

, updates according to the long short-term memory function (Hochreiter and Schmidhuber, 1997), i.e.,


where is a the context vector computed from the document and answer encodings.

At every time-step , the model computes a soft-alignment score over the document to decide which words are more relevant to the question being generated. As in a traditional NMT architecture, the decoder computes a relevance weight for every th word in the document when generating the th word in the question. Alignment score vector is computed with a single layer feedforward neural network using the activation function. The scores are also used as the location softmax distribution. The network defined by computes energies according to (3) for the alignments, and the normalized alignments are computed as in (4):


To compute the context vector used in (2), we first construct context vector for the document and then concatenate it with :


We use a deep output layer (Pascanu et al., 2013) at each time-step for the shortlist softmax vector . This layer fuses the information coming from , and

through a simple MLP to predict the word logits for the softmax as in (


). Parameters of the softmax layer are denoted as

and , where is the size of the shortlist vocabulary (2000 words herein).


A source switching variable enables the model to interpolate between document copying and generation from shortlist. It is computed by an MLP with two hidden layers using units (Gulcehre et al., 2016). Similarly to the computation of the shortlist softmax, the switching network takes , and as inputs. Its output layer generates the scalar through the logistic sigmoid activation function.

Finally, is approximated by the full pointer-softmax by concatenating and after both are weighted by :


As is standard in NMT, during decoding we use a beam search (Graves, 2012)

to maximize (approximately) the conditional probability of an output sequence. We discuss this in more detail in the following section.

3.3 Training

The model is trained initially to minimize the negative log-likelihood of the training data under the model distribution,


where, in the decoder as defined in (2), the previous token comes from the source sequence rather than the model output (this is called teacher forcing).

Based on our knowledge of the task, we introduce additional training signals to aid the model’s learning. First we encourage the model not to generate answer words in the question. We use the soft answer-suppression constraint given in (10

) with the penalty hyperparameter

; denotes the set of words that appear in the answer but not in the ground-truth question:


We also encourage variety in the output words to counteract the degeneracy often observed in NLG systems towards common outputs (Sordoni et al., 2015). This is achieved with a loss term that maximizes entropy in the output softmax (8), i.e.,


4 Policy Gradient Optimization

As described above, we use teacher forcing to train our model to generate text by maximizing ground-truth likelihood. Teacher forcing introduces critical differences between the training phase (in which the model is driven by ground-truth sequences) and the testing phase (in which the model is driven by its own outputs) (Bahdanau et al., 2016). Significantly, teacher forcing prevents the model from making and learning from mistakes during training. This is related to the observation that maximizing ground-truth likelihood does not teach the model how to distribute probability mass among examples other than the ground-truth, some of which may be valid questions and some of which may be completely incoherent. This is especially problematic in language, where there are often many ways to say the same thing. A reinforcement learning (RL) approach, by which a model is rewarded or penalized for its own actions, could mitigate these issues – though likely at the expense of reduced stability during training. A properly designed reward, maximized via RL, could provide a model with more information about how to distribute probability mass among sequences that do not occur in the training set (Norouzi et al., 2016).

We investigate the use of RL to fine-tune our question generation model. Specifically, we perform policy gradient optimization following a period of “pretraining” on maximum likelihood, using a combination of scalar rewards correlated to question quality. We detail this process below. To make clear that the model is acting freely without teacher forcing, we indicate model-generated tokens with and sequences with .

4.1 Rewards

Question answering (QA)

One obvious measure of a question’s quality is whether it can be answered correctly given the context document . We therefore feed model-generated questions into a pretrained question-answering system and use that system’s accuracy as a reward. We use the recently proposed Multi-Perspective Context Matching (MPCM) (Wang et al., 2016) model as our reference QA system, sans character-level encoding. Broadly, that model takes in a generated question and a document

, processes them through bidirectional recurrent neural networks, applies an attention mechanism, and points to the start and end tokens of the answer in

. After training a MPCM model on the SQuAD dataset, the reward is given by MPCM’s answer accuracy on in terms of the F1 score, a token-based measure proposed by Rajpurkar et al. (2016) that accounts for partial word matches:


where is the answer to the generated question by the MPCM model. Optimizing the QA reward could lead to ‘friendly’ questions that are either overly simplistic or that somehow cheat by exploiting quirks in the MPCM model. One obvious way to cheat would be to inject answer words into the question. We prevented this by masking these out in the location softmax, a hard version of the answer suppression loss (10).

Fluency (PPL)

Another measure of quality is a question’s fluency – i.e., is it stated in proper, grammatical English? As simultaneously proposed in Zhang and Lapata (2017), we use a language model to measure and reward the fluency of generated questions. In particular, we use the perplexity assigned to by an language model:


where the negation is to reward the model for minimizing perplexity. The language model is trained through maximum likelihood estimation on over

human-generated questions from SQuAD (the training set).


For the total scalar reward earned by the word sequence , we also test a weighted combination of the individual rewards:

where and are hyperparameters. The individual reward functions use neural models to tune the neural question generator. This is reminiscent of recent work on GANs (Goodfellow et al., 2014) and actor-critic methods (Bahdanau et al., 2016)

. We treat the reward models as black boxes, rather than attempting to optimize them jointly or backpropagate error signals through them. We leave these directions for future work.

We also experimented with several other rewards, most notably the BLEU score (Papineni et al., 2002) between and the ground-truth question for the given document and answer, and a softer measure of similarity between output and ground-truth based on skip-thought vectors (Kiros et al., 2015). Empirically, we were unable to obtain consistent improvements on these rewards through training, though this may be an issue with hyperparameter settings.

4.2 Reinforce

We use the REINFORCE algorithm (Williams, 1992) to maximize the model’s expected reward. For each generated question , we define the loss


where is the policy to be trained. The policy is a distribution over discrete actions, i.e. words that make up the sequence . It is the distribution induced at the output layer of the encoder-decoder model (8), initialized with the parameters determined through likelihood optimization.333The policy also depends on the switch values but we omit these for brevity.

REINFORCE approximates the expectation in (14) with independent samples from the policy distribution, yielding the policy gradient


where the optional and

are the running mean and standard deviation of the reward, such that

has zero mean and unit variance. The resulting “whitening” of the rewards is a simple version of PopArt 

(van Hasselt et al., 2016), and we found empirically that it stabilized learning.

It is straightforward to combine policy gradient with maximum likelihood, as both gradients can be computed by backpropagating through a properly reweighted sequence-level log-likelihood. The sequences for policy gradient are sampled from the model and weighted by a whitened reward, and the likelihood sequences are sampled from the training set and weighted by 1.

4.3 Training Scheme

Instead of sampling from the model’s output distribution, we use beam-search to generate questions from the model and approximate the expectation in Eq. 14. Empirically we found that rewards could not be improved through training without this approach. Randomly sampling from the model’s distribution may not be as effective for estimating the modes of the generation policy and it may introduce more variance into the policy gradient.

Beam search keeps a running set of candidates that expands and contracts adaptively. At each time-step , output words that maximize the probabilities of their respective paths are selected and added to the candidate sequences, where is the beam size. The probabilities of these candidates are given by their accumulated log-likelihood up to .444We also experimented with a stochastic version of beam search by randomly sampling words from top- predictions sorted by candidate sequence probability at each time step. No performance improvement was observed.

Given a complete sample from the beam search and its accumulated log-likelihood, the gradient in (15) can be estimated as follows. After calculating the reward with a sequence generated by beam search, we use the sample to teacher-force the decoder so as to recreate exactly the model states from which the sequence was generated. The model can then be accurately updated by coupling the parameter-independent reward with the log-likelihood of the generated sequence. This approach adds a computational overhead but it significantly increases the initial reward values earned by the model and stabilizes policy gradient training.

We also further tune the likelihood during policy gradient optimization to prevent the model from overwriting its earlier training. We combine the policy gradient update to the model parameters, , with an update from based on teacher forcing on the ground-truth signal.

5 Experiments

5.1 Dataset

We conducted our experiments on the SQuAD dataset for machine comprehension (Rajpurkar et al., 2016), a large-scale, human-generated corpus of (document, question, answer) triples. Documents are paragraphs from 536 high-PageRank Wikipedia articles covering a variety of subjects. Questions are posed by crowdworkers in natural language and answers are spans of text in the related paragraph highlighted by the same crowdworkers. There are 107,785 question-answer pairs in total, including 87,599 training instances and 10,570 development instances.

5.2 Baseline Seq2Seq System

We build a simple baseline system, “Seq2Seq,” on the encoder-decoder architecture with attention and pointer-softmax outlined in Bahdanau et al. (2015) and Gulcehre et al. (2016). The baseline conditions question generation on the answer by setting to be the average of the document encodings corresponding to the answer positions in .

5.3 Quantitative Evaluation

We use several automatic evaluation metrics to judge the quality of generated questions with respect to the ground-truth questions from the dataset. We are undertaking a large-scale human evaluation to determine how these metrics align with human judgments. The first metric is BLEU 

(Papineni et al., 2002), a standard in machine translation, which computes {1,2,3,4}-gram matches between generated and ground-truth questions. Next we use F1, which focuses on unigram matches (Rajpurkar et al., 2016). We also report fluency and QA performance metrics used in our reward computation. Fluency is measured by the perplexity (PPL) of the generated question computed by the pretrained question language model. The PPL score is proportional to the marginal probability estimated from the corpus. The QA performance is measured by running the pretrained MPCM model on the generated questions and measuring F1 between the predicted answer and the conditioning answer.

Seq2Seq 45.8 4.9 31.2 45.6 153.2
Our System 35.3 10.2 39.5 65.3 175.7
+ PG () 35.7 9.2 38.2 61.1 155.6
+ PG () 39.8 10.5 40.1 74.2 300.9
+ PG () 39.0 9.2 37.8 70.2 183.1
Question LM - - - - 87.7
MPCM - - - 70.5 -
Table 2: Automatic metrics on SQuAD’s dev set. NLL is the negative log-likelihood. BLEU and F1 are computed with respect to the ground-truth questions. QA is the F1 obtained by the MPCM model answers to generated questions and PPL is the perplexity computed with the question language model (LM) (lower is better). PG denotes policy gradient training. The bottom two lines report performance on ground-truth questions.
Text Passage
…the court of justice accepted that a requirement to speak gaelic to teach in a dublin design college could be justified as part of the public policy of promoting the irish language.
Generated Questions
1) what did the court of justice not claim to do?
2) what language did the court of justice say should be justified as part of the public language?
3) what language did the court of justice decide to speak?
4) what language did the court of justice adopt a requirement to speak?
5) what language did the court of justice say should be justified as part of?
Table 3: Examples of generated questions given a context and an answer. Questions are generated by the five systems in Table 2, in order.
Training Generated Questions QA PPL
what was the name of the library that was listed on the grainger market? 0 73.2
the grainger market architecture was listed in 1954 by what? 100 775
what language did the grainger market architecture belong to? 0 257
what are the main areas of southern california? 0 114
southern california is famous for what? 16.6 269
what is southern california known for? 16.6 179
what was the goal of the imperial academy of medicine? 19.1 44.3
why were confucian scholars attracted to the medical profession? 73.7 405
what did the confucian scholars believe were attracted to the medical schools? 90.9 135
what is an example of a theory that can be solved in theory? 0 38
in complexity theory, it is known as what? 100 194
what is an example of a theory that can cause polynomial-time solutions to be useful? 100 37
Table 4: Comparison of questions from different reward combinations on the same text and answer.

5.4 Results and qualitative analysis

Our results for automatic evaluation on SQuAD’s development set are presented in Table 2. Implementation details for all models are given in the supplementary material. One striking feature is that BLEU scores are quite low for all systems tested, which relates to our earlier argument that a typical (document, answer) pair may be associated with multiple semantically-distinct questions. This seems to be born out by the result since most generated samples look reasonable despite low BLEU scores (see Tables 13).

Our system vs. Seq2Seq

Comparing our model to the Seq2Seq baseline, we see that all metrics improve notably with the exception of PPL. Interestingly, our system performs worse in terms of PPL despite achieving lower negative log-likelihood. This, along with the improvements in BLEU, F1 and QA, suggests that our system learns a more powerful conditional model at the expense of accurately modelling the marginal distribution over questions. It is likely challenging for the model to allocate probability mass to rarer keywords that are helpful to recover the desired answer while also minimizing perplexity. We illustrate with samples from both models, specifically the first two samples in Table 3. The Seq2Seq baseline generated a well-formed English question, which is also quite vague – it is only weakly conditioned on the answer. On the other hand, our system’s generated question is more specific, but still not correct given the context and perhaps less fluent given the repetition of the word language. We found that our proposed entropy regularization helped to avoid over-fitting and worked nicely in tandem with dropout: the training loss for our regularized model was 26.6 compared to 22.0 for the Seq2Seq baseline that used only dropout regularization.

Policy gradient ()

Policy gradient training with the negative perplexity of the pretrained language model improves the generator’s PPL score as desired, which approaches that of the baseline Seq2Seq model. However, QA, F1, and BLEU scores decrease. This aligns with the above observation that fluency and answerability (as measured by the automatic scores) may be in competition. As an example, the third sample in Table 3 is more fluent than the previous examples but does not refer to the desired answer.

Policy gradient ()

Policy gradient is very effective at maximizing the QA reward, gaining 8.9% in accuracy over the improved Seq2Seq model and improving most other metrics as well. The fact that QA score is 3.7% higher than that obtained on the ground-truth questions suggests that the question generator may have learned to exploit MPCM’s answering mechanism, and the higher reported perplexity suggests questions under this scheme may be less fluent. We explore this in more detail below. The fourth sample in Table 3, in contrast to the others, is clearly answered by the context word gaelic as desired.

Policy gradient ()

We attempted to improve fluency and answerability in tandem by combining QA and PPL rewards. The PPL reward adds a prior towards questions that look natural. According to Table 2, this optimization scheme yields a good balance of performance, improving over the maximum-likelihood model by a large margin in terms of QA performance and gaining back some PPL. In the sample shown in Table 3, however, the question is specific to the answer but ends prematurely.

In Table 4 we provide additional generated samples from the different PG rewards. This table reveals one of the ‘tricks’ encouraged by the QA reward for improving MPCM performance: questions are often phrased with the interrogative ‘wh’ word at the end. This gives the language high perplexity, since such questions are rarer in the training data, but brings the question form closer to the form of the source text for answer matching.

5.5 Discussion

Looking through examples revealed certain difficulties in the task and some pathologies in the model that should be rectified through future work.

Entities and Verbs

Similar entities and related verbs are often swapped, e.g., miami for jacksonville in a question about population. This issue could be mitigated by biasing the pointer softmax towards the document for certain word types.


We desire a system that generates interesting questions, which are not limited to reordering words from the context but exhibit some abstraction. Rewards from existing QA systems do not seem beneficial for this purpose. Questions generated through NLL training show more abstraction at the expense of decreased specificity.

Commonsense and Reasoning

Commonsense understanding appears critical for generating questions that are well-posed and show abstraction from the original text. Likewise, the ability to reason about and compose relations between entities could lead to more abstract and interesting questions. The existing model has no such capacity.

Text Passage
some yuan documents such as wang zhen’s nong shu were printed with earthenware movable type, a technology invented in the 12th century.
Human- and Model- Generated Questions
1) when was earthenware movable type invented?
2) when was wang zhen’s nong shu printed?
Table 5: A ground-truth question (1) and a valid generated question (2) with low word overlap.


Due to the large number of possible questions given a predefined answer, it is challenging to evaluate the outputs using standard overlap-based metrics such as BLEU. This issue is made clear in the examples of Table 5. There, the second, model-generated generated question is valid given the context and refers clearly to the answer, but has low word overlap with the first, human-generated question. This suggests that question generation from text is similar to other tasks with large output spaces (Galley et al., 2015) and may benefit from corpora with multiple ground-truth questions associated to a quality rating (Mostafazadeh et al., 2016).

6 Conclusion

We proposed a recurrent neural model that generates natural-language questions conditioned on text passages and predefined answers. We showed how to train this model using a combination of maximum likelihood and policy gradient optimization, and demonstrated both quantitatively and qualitatively how several reward combinations affect the generated outputs. We are now undertaking a human evaluation to determine the correlation between rewards and human judgments, improving our model, and testing on additional datasets.


Supplementary Material

Appendix A Implementation details

All models are implemented using Keras (Chollet, 2015) with Theano (Theano Development Team, 2016) backend. We used Adam (Kingma and Ba, 2014) with an initial learning rate 2e-4 for both maximum likelihood and policy gradient updates. Word embeddings were initialized with the GloVe vectors (Pennington et al., 2014) and updated during training. The hidden size for all RNNs is 768.

Dropout (Srivastava et al., 2014) is applied with a rate of 0.3 to the embedding layers as well as all the RNNs (between both input-hidden and hidden-hidden connections).

Both for answer-suppression and for entropy maximization are set to .

We used beam search with a beam size of 32 in all experiments.

The reward weights used in policy gradient training are listed in Table 6.

1.0 -
- 0.1
0.5 0.25
Table 6: Hyperparameter settings for policy gradient training.