Let's Ask Again: Refine Network for Automatic Question Generation

In this work, we focus on the task of Automatic Question Generation (AQG) where given a passage and an answer the task is to generate the corresponding question. It is desired that the generated question should be (i) grammatically correct (ii) answerable from the passage and (iii) specific to the given answer. An analysis of existing AQG models shows that they produce questions which do not adhere to one or more of the above-mentioned qualities. In particular, the generated questions look like an incomplete draft of the desired question with a clear scope for refinement. To alleviate this shortcoming, we propose a method which tries to mimic the human process of generating questions by first creating an initial draft and then refining it. More specifically, we propose Refine Network (RefNet) which contains two decoders. The second decoder uses a dual attention network which pays attention to both (i) the original passage and (ii) the question (initial draft) generated by the first decoder. In effect, it refines the question generated by the first decoder, thereby making it more correct and complete. We evaluate RefNet on three datasets, viz., SQuAD, HOTPOT-QA, and DROP, and show that it outperforms existing state-of-the-art methods by 7-16% on all of these datasets. Lastly, we show that we can improve the quality of the second decoder on specific metrics, such as, fluency and answerability by explicitly rewarding revisions that improve on the corresponding metric during training. The code has been made publicly available [https://github.com/PrekshaNema25/RefNet-QG]


Learning to Ask Like a Physician

Existing question answering (QA) datasets derived from electronic health...

Towards a Better Metric for Evaluating Question Generation Systems

There has always been criticism for using n-gram based similarity metric...

Exploring Question-Specific Rewards for Generating Deep Questions

Recent question generation (QG) approaches often utilize the sequence-to...

Generating Questions for Knowledge Bases via Incorporating Diversified Contexts and Answer-Aware Loss

We tackle the task of question generation over knowledge bases. Conventi...

Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Question Generation (QG) is a Natural Language Processing (NLP) task tha...

Generative Question Refinement with Deep Reinforcement Learning in Retrieval-based QA System

In real-world question-answering (QA) systems, ill-formed questions, suc...

A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies

In this paper, we investigate the following two limitations for the exis...

1 Introduction


Passage 1: Liberated by Napoleon’s army in 1806, Warsaw was made the capital of the newly created Duchy of Warsaw.
Generated Questions
Baseline What was the capital of the newly
duchy of Warsaw?
RefNet Who liberated Warsaw in 1806?
Reward-RefNet Whose army liberated Warsaw
in 1806?


Passage 2: To fix carbon dioxide into sugar molecules in the process of photosynthesis, chloroplasts use an enzyme called rubisco
Generated Questions
Baseline What does chloroplasts use?
RefNet What does chloroplasts use to fix
carbon dioxide into sugar molecules?
Reward-RefNet What do chloroplasts use to fix
carbon dioxide into sugar molecules?


Table 1: Samples of generated questions from Baseline, RefNet and Reward-RefNet model on the SQuAD dataset. Answers are shown in blue


Over the past few years, there has been a growing interest in Automatic Question Generation (AQG) from text - the task of generating a question from a passage and optionally an answer. AQG is used in curating Question Answering datasets, enhancing user experience in conversational AI systems Shum et al. (2018) and for creating educational materials Heilman and Smith (2010). For the above applications, it is essential that the questions are (i) grammatically correct (ii) answerable from the passage and (iii) specific to the answer. Existing approaches focus on encoding the passage, the answer and the relationship between them using complex functions and then generate the question in one single pass. However, by carefully analysing the generated questions, we observe that these approaches tend to miss one or more of the important aspects of the question. For instance, in Table 1, the question generated by the single-pass baseline model for the first passage is grammatically correct but is not specific to the answer. In the second example, the generated question is both syntactically incorrect and incomplete.

The above examples indicate that there is clear scope of improving the general quality of the questions. Additionally, the quality can be specifically improved in terms of aspects like: fluency (Example 2) and answerability (Example 1). One way to approach this is by re-visiting the passage and answer with the aim to refine the initial draft by generating a better question in the second pass and then improving it with respect to a certain aspect. We can draw a comparison between this process and how humans tend to write a rough initial draft first and then refine it over multiple passes, where the later revisions focus on improving the draft aiming at certain aspects like fluency or completeness. With this motivation, we propose Refine Network (RefNet), which examines the initially generated question and performs a second pass to generate a revised question. Furthermore, we propose Reward-RefNet which uses explicit reward signals to achieve refinement focused on specific properties of the question such as fluency and answerability.

Our RefNet is a seq2seq based model that comprises of two decoders: Preliminary and Refinement Decoder. The Refinement Decoder takes the initial draft of the question generated by the Preliminary decoder as an input along with passage and answer, and generates the refined question by attending onto both the passage and the initial draft using a Dual Attention Network. The proposed dual attention aids RefNet to generate the final question by revisiting the appropriate parts of the input passage and initial draft. From Table 1, we can infer that our RefNet model is able to generate better questions in the second pass by fixing the errors in the initial draft. Our Reward-RefNet model uses REINFORCE with a baseline algorithm to explicitly reward the Refinement Decoder for generating a better question as compared to the Preliminary Decoder based on certain desired parameters like fluency and answerability. This leads to more answerable (see Reward-RefNet example for passage in Table 1) and fluent (see Reward-RefNet example for passage in Table 1) questions as compared to vanilla RefNet model.

Our experiments show that the proposed RefNet model outperforms existing state-of-the-art models on the SQuAD dataset by % and % (on BLEU) given the relevant sentence and passage respectively. We also achieve state-of-the-art results on HOTPOT-QA and DROP datasets with an improvement of % and % respectively over the single-decoder baseline (on BLEU). Our human evaluations further validate these results. We further analyze and explain the impact of including the Refinement Decoder by examining the interaction between both the decoders. Interestingly, we observe that the inclusion of the Refinement Decoder boosts the quality of the questions generated by the initial decoder also. Lastly, our human evaluation of the questions generated by Reward-RefNet corroborate empirical results, i.e., it improves the question w.r.t. to fluency and answerability as compared to RefNet questions.

Figure 1: Our RefNet model with Preliminary and Refinement Decoder.

2 Refine Networks (RefNet) Model

In this section, we discuss various components of our proposed model as shown in Figure 1. For a given passage of length and answer of length , we first obtain answer-aware latent representation, , for every word of the passage and an answer representation (as described in Section 2.1). We then generate an initial draft by computing as


is a probability distribution modeled using the Preliminary Decoder. We then refine the initial draft

using the Refinement Decoder to obtain the refined draft :

We then use explicit rewards to enforce refinement on a desired metric, such as, fluency or answerability through our Reward-RefNet model. In the following sub-sections, we describe the passage encoder, preliminary and refinement decoders and our reward mechanism.

2.1 Passage and Answer Encoder

We use a 3 layered encoder consisting of: (i) Embedding, (ii) Contextual and (iii) Passage-Answer Fusion layers as described below. To capture interaction between passage and answer, we ensure that the passage and answer representations are fused together at every layer.

Embedding Layer:

In this layer, we compute a -dimensional embedding for every word in the passage and the answer. This embedding is obtained by concatenating the word’s Glove embedding Pennington et al. (2014) with its character based embedding as discussed in Seo et al. (2016). Additionally, for passage words, we also compute a positional embedding based on the relative position of the word w.r.t. the answer span as described in Zhao et al. (2018). For every passage word, this positional embedding is also concatenated to the word and character based embeddings. We discuss the impact of character embeddings and answer tagging in Appendix A. In the subsequent sections, we will refer to embedding of the -th passage word as and the -th answer word as .

Contextual Layer:

In this layer, we compute a contextualized representation for every word in the passage by passing the word embeddings (as computed above) through a bidirectional-LSTM Hochreiter and Schmidhuber (1997):

where is the hidden state of the forward LSTM at time . We then concatenate the forward and backward hidden states as .

The answer could correspond to a span in the passage. Let and be the start and end indices of the answer span in the passage respectively. We can thus refer to as the representation of the answer words in the context of the passage. We then obtain contextualized representations for the answer words by passing them through LSTM as follows:

The final state of this Bi-LSTM is used as the answer representation in the subsequent stages. When the answer is not present in the passage, only is passed to the LSTM.

Passage-Answer Fusion Layer:

In this layer, we refine the representations of the passage words based on the answer representation as follows:

Here . is the hidden size of LSTM. This is similar to how Seo et al. (2016) capture interactions between passage and question for QA. We use as the fused passage-answer representation which is then used by our decoder(s) to generate the question .

2.2 Preliminary and Refinement Decoders

As discussed earlier, RefNet has two decoders, viz., Preliminary Decoder and Refinement Decoder, as described below:

Preliminary Decoder:

This decoder generates an initial draft of the question, one word at a time, using an LSTM as follows:


Here is the hidden state at time , is the answer representation as computed above, is an attention weighted sum of the contextualized passage word representations, are parameterized and normalized attention weights Bahdanau et al. (2014). Let’s call this attention network as . is the embedding of the word . We obtain as:


where is a matrix and is the output matrix which projects the final representation to where is the vocabulary size.

Refinement Decoder: Once the preliminary decoder generates the entire question, the refinement decoder uses it to generate an updated version of the question using a Dual Attention Network. It first computes an attention weighted sum of the embeddings of the words generated by the first decoder as:

where are parameterized and normalized attention weights computed by attention network

. Since the initial draft could be erroneous or incomplete, we obtain additional information from the passage instead of only relying on the output of the first decoder. We do so by computing a context vector


where are parameterized and normalized attention weights computed by attention network . The hidden state of the refinement decoder at time is computed as follows:

Finally, is predicted using

where is a weight matrix and is the output matrix which is shared with the Preliminary decoder (Equation 2). Note that RefNet generates two variants of the question : initial draft and final draft . We compare these two versions of the generated questions in Section 4.

2.3 Reward-RefNet

Next, we address the following question: Can the refinement decoder be explicitly rewarded for generating a question which is better than that generated by the preliminary decoder on certain desired parameters? For example, Nema and Khapra (2018) define fluency and answerability as desired qualities in the generated question. They evaluate fluency using BLEU score and answerability using a score which captures whether the question contains the required {named entities, important words, function words, question types} (and is thus answerable). We use these fluency and answerability scores proposed by Nema and Khapra (2018) as reward signals. We first compute the reward and for the question generated by the preliminary and refinement decoder respectively. We then use “REINFORCE with a baseline” algorithm Williams (1992) to reward Refinement Decoder using the Preliminary Decoder’s reward as the baseline. More specifically, given the Preliminary Decoder’s generated word sequence and the Refinement Decoder’s generated word sequence obtained from the distribution , the training loss is defined as follows

where and are the rewards obtained by comparing with the reference question . As mentioned, this reward can be the fluency score or answerability score as defined by Nema and Khapra (2018).

2.4 Copy Module

Along with the above-mentioned three modules, we adopt the pointer-network and coverage mechanism from See et al. (2017). We use it to (i) handle Out-of-Vocabulary words and (ii) avoid repeating phrases in the generated questions.


Dataset Model n-gram


(Sentence Level)
Sun et al. (2018) - - -
Zhao et al. (2018) -
Kim et al. (2019) - - - - - -


(Passage Level)
Zhao et al. (2018) -


HOTPOT Zhao et al. (2018)*


DROP Dataset Zhao et al. (2018)*


Table 2: Comparsion of RefNet model with existing approaches and EAD model. Here * denotes our implementation of the corresponding work.

3 Experimental Details

In this section, we discuss (i) the datasets for which we tested our proposed model, (ii) implementation details and (iii) evaluation metrics used to compare our model with the baseline and existing works.

3.1 Datasets

SQuAD Rajpurkar et al. (2016): It contains K (question, answer) pairs obtained from Wikipedia articles, where the answers are a span in the passage. For SQuAD, AQG has been tried from both sentences and passages. In the former case, only the sentence which contains the answer span is used as input, whereas in the latter case the entire passage is used. We use the same train-validation-test splits as used in Zhao et al. (2018).

Hotpot QA Yang et al. (2018) : Hotpot-QA is a multi-document and multi-hop QA dataset. Along with the triplet (P, A, Q), the authors also provide supporting facts that potentially lead to the answer. The answers here are either yes/no or answer span in P. We concatenate these supporting facts to form the passage. We use % of the training data for validation and use the original dev set as test set.

DROP Dua et al. (2019): The DROP dataset is a reading comprehension benchmark which requires discrete reasoning over passage. It contains K questions which require discrete operations such as addition, counting, or sorting to obtain the answer. We use % of the original training data for validation and use the original dev set as test set.

3.2 Implementation Details

We use dimensional pre-trained Glove word embeddings, which are fixed during training. For character-level embeddings, we initially use a dimensional embedding for the characters which is then projected to dimensions. For answer-tagging, we use embedding size of . The hidden size for all the LSTMs is fixed to . We use 2-layer, 1-layer and 2-layer stacked BiLSTM for the passage encoder, answer encoder and the decoders (both) respectively. We take the top frequent words as the vocabulary. We use Adam optimizer with a learning rate of and train our models for epochs using cross entropy loss. For the Reward-RefNet model, we fine-tune the pre-trained model with the loss function mentioned in Section 2.3 for epochs. The best model is chosen based on the BLEU Papineni et al. (2002) score on the validation split. For all the results we use beam search decoding with a beam size of .

3.3 Evaluation

We evaluate our models, based on -gram similarity metrics BLEU Papineni et al. (2002), ROUGE-L Lin (2004), and METEOR Lavie and Denkowski (2009) using the package released in Sharma et al. (2017)222https://github.com/Maluuba/nlg-eval. We also quantify the answerability of our models using QBLEU-4333https://github.com/PrekshaNema25/Answerability-MetricNema and Khapra (2018).

4 Results and Discussions

In this section, we present the results and analysis of our proposed model RefNet. Throughout this section, we refer to our models as follows:

Encode-Attend-Decode (EAD) model

is our single decoder model containing the encoder and the Preliminary Decoder described earlier. Note that the performance of this model is comparable to our implementation of the model proposed in Zhao et al. (2018).

Refine Network (RefNet) model

includes the encoder, the Preliminary Decoder and the Refinement Decoder.
We will (i) compare RefNet’s performance with EAD and existing models across all the mentioned datasets (ii) report human evaluations to compare RefNet and EAD (iii) analyze Refinement and Preliminary Decoders iv) present the performance of Reward RefNet with two different reward signal (fluency and answerability).

4.1 RefNet’s performance across datasets

In Table 2, we compare the performance of RefNet with existing single decoder architectures across different datasets. On BLEU-4 metric, RefNet beats the existing state-of-the-art model by %, %, %, and % respectively on SQuAD (sentence), HOTPOT-QA, DROP and SQuAD (passage) dataset. Also it outperforms EAD by %, %, % and % respectively on SQuAD (sentence), HOTPOT-QA, DROP and SQuAD (passage). In general, RefNet is consistently better than existing models across all -gram scores (BLEU, ROUGE-L and METEOR). Along with -gram scores, we also observe improvements on Q-BLEU4 as well, which as described earlier, gives a measure of both answerability and fluency.

4.2 Human Evaluations

We conducted human evaluations to analyze the quality of the questions produced by EAD and RefNet. We randomly sampled questions generated from the SQuAD (sentence level) dataset and asked the annotators to compare the quality of the generated questions. The annotators were shown a pair of questions, one generated by EAD and one by RefNet from the same sentence, and were asked to decide which one was better in terms of Fluency, Completeness, and Answerability. They were allowed to skip the question pairs where they could not make a clear choice. Three annotators rated each question and the final label was calculated based on majority voting. We observed that the RefNet model outperforms the EAD model across all three metrics. Over %, % and % of the generated questions from RefNet were respectively more fluent, complete and answerable when compared to the EAD model. However, there are some cases where EAD does better than RefNet. For example, in Table 3, we show that while trying to generate a more elaborate question, RefNet introduces an additional phrase “in the united” which is not required. Due to such instances, annotators preferred the EAD model in around % of the instances.

Passage: Before the freeze ended in 1952, there were only 108 existing television stations in the United States; a few major cities (such as Boston) had only two television stations, …
EAD: how many television stations existed in boston ?
RefNet: how many television stations did boston have in the united ?
Table 3: An example where EAD model was better than RefNet. The ground truth answers are shown in blue.
Model Decoder BLEU-4 QBLEU-4
without RefNet
Initial Draft
with RefNet
Initial Draft
Table 4: Comparison between Preliminary Decoder and Refinement Decoder in RefNet Model for SQuAD Sentence Level QG.
Sentence: For instance , the language { xx — x is any binary string } can be solved in
linear time

on a multi-tape Turing machine , but necessarily requires

quadratic time in the model of single-tape Turing machines .
Reference Question: A multi-tape Turing machine requires what type of time for a solution ?
Refinement Decoder: in what time can the language be solved on a multi-tape turing machine ?
Preliminary Decoder: in what time can the language be solved ?
Refinement Decoder: in what time can the language { xx — x x x x is any binary string ?
Preliminary Decoder: in what time can the language — x x x x is solved ?
Table 5: Generated samples by Preliminary Decoder and Refinement Decoder in RefNet model.

4.3 Analysis of Refinement Decoder and Preliminary Decoder

The two decoders impact each other through two paths: (i) indirect path, where they share the encoder and the output projection to the vocabulary , (ii) direct path, via the dual attention network, where the initial draft of the question is attended by the Refinement Decoder. When RefNet has only indirect path, we can infer from row 1 of Table 4 that the performance of Preliminary Decoder improves when compared to the EAD model ( v/s BLEU). This suggests that generating two variants of the question improves the performance of the first decoder pass as well. This is perhaps due to the additional feedback that the shared encoder and output layer get from the Refinement Decoder. When we add the direct path (attention network) between the two decoders, the performance of the Refinement Decoder improves as compared to the Preliminary Decoder as shown in rows 3 and 4 of the Table 4
Comparison on Answerability: We also evaluate both the initial and refined draft using QBLEU4. As discussed earlier, Q-Metric measures Answerability using four components, viz., Named Entities, Important Words, Function Words, and Question Type. We observe that the increase in Q-Metric for refined questions is because the RefNet model can correct/add the relevant Named Entities in the question. In particular, we observe that the Named Entity component score in Q-Metric increases from for the first draft to for the refined draft.

Figure 2: Generated Question Length Distribution for Preliminary Decoder (First Decoder) and Refinement Decoder (Second Decoder).

Qualitative Analysis: Figure 2 shows that the RefNet model indeed generates more elaborate questions when compared to the Preliminary Decoder. As shown in Table 5, the quality of the refined question is better than the initial draft of the questions. Here RefNet adds the phrase “multi-tape Turing Machine,” (row 2) which removes any ambiguity in the question.

4.4 Analysis of Reward-RefNet

In this section, we analyze the impact of employing different reward signals in Reward-RefNet. As discussed earlier in section 2.3, we use fluency and answerability scores as reward signals. As shown in Table 6, when BLEU-4 (fluency) is used as a reward signal, there is improvement in BLEU-4 scores of Reward-RefNet as compared to RefNet model. We validated these results through human evaluations across samples. Annotators prefer the Reward-RefNet model in % of the cases for fluency. Similarly when we use Answerability score as a reward signal, answerability improves for the model and annotators prefer the Reward-RefNet in % of the cases for answerability. The performance of Reward-RefNet on fluency and answerability is similar for other datasets (see Appendix B).

Reward Signal
Reward Signal
RefNet % %
Reward-RefNet % %
Table 6: Impact of Reward-RefNet on fluency and answerability. %preference denotes the percentage of times annotators prefer the generated output from the model for fluency in case of BLEU Reward signal and answerability in case of Answerability Reward signal.

Cost engineers and estimators

apply expertise to relate the work and materials involved to a proper valuation
Generated: Who apply expertise to relate the work and materials involved to a proper valuation ?
True: Who applies expertise to relate the work and materials involved to a proper valuation ?
Table 7: An example of question with significant overlap with the passage. The answer is shown in blue.

Case Study: Originality of the Questions
We observe that current state-of-the-art models perform very well in terms of BLEU/QBLEU scores when the actual question has significant overlap with the passage. For example, consider a passage from the SQuAD dataset in Table 7, where except the question word who, the model sequentially copies everything from the passage and achieves a QBLEU score of . However, the model performs poorly in situations where the true question is novel and does not contain a large sequence of words from the passage itself. In order to quantify this, we first sort the true questions based on its BLEU-2 overlap with the passage in ascending order. We then select the first true questions and compute the QBLEU score with the generated questions. The results are shown in red in Figure 3. Towards the left, where there are true questions with low overlap with the passage, the performance is poor, but it gradually improves as the overlap increases.

The task of generating questions with high originality (where the model phrases the question in its own words) is a challenging aspect of AQG since it requires complete understating of the semantics and syntax of the language. In order to improve questions generated on originality, we explicitly reward our model for having low -gram score with the passage as compared to the initial draft. As a result we observe that with Reward-RefNet(Originality), there is an improvement in the performance where the overlap with the passage was less (as shown in blue in Figure 3).As shown in Table 8, although both questions are answerable given the passage, the question generated from Reward-RefNet(Originality) is better.

Figure 3: Originality Analysis: Plot of Q-BLEU score vs - the number points selected.
Passage: McLetchie was elected on the Lothian regional list and the Conservatives suffered a net loss of five seats , with leader Annabel Goldie claiming that their support had held firm, nevertheless, she too announced she would step down as leader of the party.
True: Who announced she would step down as leader of the Conservatives ?
RefNet: who claiming that their support had held firm ?
Reward-RefNet: who was the leader of the conservatives?
Table 8: An example where Reward-RefNet(Originality) is better than RefNet.

5 Related Work

Early works on Question Generation were essentially rule based systems

Heilman and Smith (2010); Mostow and Chen (2009); Lindberg et al. (2013); Labutov et al. (2015). Current models for AQG are based on the encode-attend-decode paradigm and they either generate questions from the passage alone Du and Cardie (2017); Du et al. (2017); Yao et al. (2018) or from the passage and a given answer (in which case the generated question must result in the given answer). Over the past couple of years, several variants of the encode-attend-decode model have been proposed. For example, Zhou et al. (2018) proposed a sequential copying mechanism to explicitly select a sub-span from the passage. Similarly, Zhao et al. (2018) mainly focuses on efficiently incorporating paragraph level content by using Gated Self Attention and Maxout pointer networks. Some works Yuan et al. (2017) even use Question Answering as a metric to evaluate the generated questions. There has also been some work on generating questions from images Jain et al. (2017); Li et al. (2017) and from knowledge bases Serban et al. (2016); Reddy et al. (2017). The idea of multi pass decoding which is central to our work has been used by Xia et al. (2017)

for machine translation and text summarization albeit with a different objective. Some works have also augmented seq2seq models

Rennie et al. (2017); Paulus et al. (2018); Song et al. (2017) with external reward signals using REINFORCE with baseline algorithm Williams (1992). The typical rewards used in these works are BLEU and ROUGE scores. Our REINFORCE loss is different from the previous ones as it uses the first decoder’s reward as the baseline instead of reward of the greedy policy.

6 Conclusion and Future Work

In this work, we proposed Refine Networks (RefNet) for Question Generation to focus on refining and improving the initial version of the generated question. Our proposed RefNet model consisting of a Preliminary Decoder and a Refinement Decoder with Dual Attention Network outperforms the existing state-of-the-art models on the SQuAD, HOTPOT-QA and DROP datasets. Along with automated evaluations, we also conducted human evaluations to validate our findings. We further showed that using Reward-RefNet improves the initial draft on specific aspects like fluency, answerability and originality. As a future work, we would like to extend RefNet to have the ability to decide whether a refinement is needed on the generated initial draft.


We thank Amazon Web Services for providing free GPU compute and Google for supporting Preksha Nema’s contribution in this work through Google Ph.D. Fellowship programme. We would like to acknowledge Department of Computer Science and Engineering, IIT Madras and Robert Bosch Center for Data Sciences and Artificial Intelligence, IIT Madras (RBC-DSAI) for providing us sufficient resources. We would also like to thank Patanjali SLPSK, Sahana Ramnath, Rahul Ramesh, Anirban Laha, Nikita Moghe and the anonymous reviewers for their valuable and constructive suggestions.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §2.2.
  • X. Du and C. Cardie (2017) Identifying where to focus in reading comprehension for neural question generation. In EMNLP, pp. 2067–2073. Cited by: §5.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. In ACL (1), pp. 1342–1352. Cited by: §5.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, Cited by: §3.1.
  • M. Heilman and N. A. Smith (2010) Good question! statistical ranking for question generation. In HLT-NAACL, pp. 609–617. Cited by: §1, §5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667 Cited by: §2.1.
  • U. Jain, Z. Zhang, and A. G. Schwing (2017)

    Creativity: generating diverse questions using variational autoencoders

    In CVPR, pp. 5415–5424. Cited by: §5.
  • Y. Kim, H. Lee, J. Shin, and K. Jung (2019) Improving neural question generation using answer separation. CoRR abs/1809.02393. Cited by: Table 2.
  • I. Labutov, S. Basu, and L. Vanderwende (2015) Deep questions without deep understanding. In ACL (1), pp. 889–898. Cited by: §5.
  • A. Lavie and M. J. Denkowski (2009) The meteor metric for automatic evaluation of machine translation. Machine Translation 23 (2-3), pp. 105–115. External Links: ISSN 0922-6567 Cited by: §3.3.
  • Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang (2017) Visual question generation as dual task of visual question answering. CoRR abs/1709.07192. Cited by: §5.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, pp. 10. Cited by: §3.3.
  • D. Lindberg, F. Popowich, J. C. Nesbit, and P. H. Winne (2013) Generating natural language questions to support learning on-line. In ENLG, pp. 105–114. Cited by: §5.
  • J. Mostow and W. Chen (2009) Generating instruction automatically for the reading strategy of self-questioning. In AIED, Frontiers in Artificial Intelligence and Applications, Vol. 200, pp. 465–472. Cited by: §5.
  • P. Nema and M. M. Khapra (2018) Towards a better metric for evaluating question generation systems. CoRR abs/1808.10192. Cited by: §2.3, §3.3.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §3.2, §3.3.
  • R. Paulus, C. Xiong, and R. Socher (2018) A deep reinforced model for abstractive summarization. CoRR abs/1705.04304. Cited by: §5.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. Cited by: §2.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2383–2392. Cited by: §3.1.
  • S. Reddy, D. Raghu, M. M. Khapra, and S. Josh (2017)

    Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model

    In EACL (1), pp. 376–385. Cited by: §5.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning


    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1179–1195.
    Cited by: §5.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §2.4.
  • M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2016) Bidirectional attention flow for machine comprehension. CoRR abs/1611.01603. Cited by: §2.1, §2.1.
  • I. V. Serban, A. García-Durán, Ç. Gülçehre, S. Ahn, S. Chandar, A. C. Courville, and Y. Bengio (2016)

    Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus

    In ACL (1), Cited by: §5.
  • S. Sharma, L. E. Asri, H. Schulz, and J. Zumer (2017)

    Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation

    CoRR abs/1706.09799. Cited by: §3.3.
  • H. Shum, X. He, and D. Li (2018) From eliza to xiaoice: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 10–26. External Links: ISSN 2095-9230 Cited by: §1.
  • L. Song, Z. Wang, and W. Hamza (2017) A unified query-based generative model for question generation and question answering. CoRR abs/1709.01058. Cited by: §5.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3930–3939. Cited by: Table 2.
  • R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Mach. Learn. 8 (3-4), pp. 229–256. External Links: ISSN 0885-6125 Cited by: §2.3, §5.
  • Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, and T. Liu (2017) Deliberation networks: sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1784–1794. Cited by: §5.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.1.
  • K. Yao, L. Zhang, T. Luo, L. Tao, and Y. Wu (2018) Teaching machines to ask questions. In IJCAI, Cited by: §5.
  • X. Yuan, T. Wang, Ç. Gülçehre, A. Sordoni, P. Bachman, S. Zhang, S. Subramanian, and A. Trischler (2017) Machine comprehension by text-to-text neural question generation. In Rep4NLP@ACL, pp. 15–25. Cited by: §5.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In EMNLP, Cited by: §2.1, Table 2, §3.1, §4, §5.
  • Q. Zhou, N. Yang, F. Wei, and M. Zhou (2018) Sequential copying networks. In AAAI, Cited by: §5.

Appendix A Impact of Various Embeddings

We perform an ablation study to identify the impact of various word embeddings used in RefNet. When character embedding is not used in RefNet, the performance on SQuAD sentence-level drops from to BLEU-4 score. Meanwhile, when positional embeddings are dropped the performance decreases to BLEU-4 score.

Appendix B Reward-RefNet on Various Datasets

Table 9 shows the comparison between RefNet and Reward-RefNet on BLEU-4 score and answerability score when the respective scores are used as rewards in Reward-RefNet. We can infer from Table 9 that there is improvement in fluency and answerability across all the datasets.

Datasets Model
Reward Signal
Reward Signal
(Passage Level)
RefNet 16.99 26.6
Reward-RefNet 17.11 27.3
HOTPOT-QA RefNet 21.17 28.7
Reward-RefNet 21.32 29.2
DROP RefNet 21.23 33.6
Reward-RefNet 21.60 34.3
Table 9: Impact of Reward-RefNet on various datasets when fluency and answerability are used as reward signals.

Appendix C Visualization of Attention Weights

We plot the aggregated attention given to the passage and initial draft of the generation question across the various time-steps of the decoder in Figure 4. Although, both the questions are specific to the answer, pays some attention to the context surrounding the answer, which leads to a complete question. Also, note that in , while attending on to initial draft “oncogenic” word is not paid attention to and thus the final draft revises over the initial draft by correcting it to generate a better question.

(a) attention plot
(b) attention plot
(c) attention plot
Figure 4: Attention plots for a) , b) , c) respectively
Initial Generated Question: “What is the name of the oncogenic virus?”
Refined Generated Question: “What is the name of the organism that causes cervical
Passage: “The antigens expressed by tumors have several sources ; some are derived from oncogenic viruses like human papillomavirus , which causes cervical cancer , while others are the organism’s own proteins that occur at low levels in normal cells but reach high levels in tumor cells.”