Towards Automatic Generation of Questions from Long Answers

Automatic question generation (AQG) has broad applicability in domains such as tutoring systems, conversational agents, healthcare literacy, and information retrieval. Existing efforts at AQG have been limited to short answer lengths of up to two or three sentences. However, several real-world applications require question generation from answers that span several sentences. Therefore, we propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers. We leverage the large-scale open-source Google Natural Questions dataset to create the aforementioned long-answer AQG benchmark. We empirically demonstrate that the performance of existing AQG methods significantly degrades as the length of the answer increases. Transformer-based methods outperform other existing AQG methods on long answers in terms of automatic as well as human evaluation. However, we still observe degradation in the performance of our best performing models with increasing sentence length, suggesting that long answer QA is a challenging benchmark task for future research.



There are no comments yet.


page 1

page 2

page 3

page 4


ConfNet2Seq: Full Length Answer Generation from Spoken Questions

Conversational and task-oriented dialogue systems aim to interact with t...

Unsupervised Evaluation for Question Answering with Transformers

It is challenging to automatically evaluate the answer of a QA model at ...

GooAQ: Open Question Answering with Diverse Answer Types

While day-to-day questions come with a variety of answer types, the curr...

Finetuning Transformer Models to Build ASAG System

Research towards creating systems for automatic grading of student answe...

Predicting the Quality of Short Narratives from Social Media

An important and difficult challenge in building computational models fo...

PsyQA: A Chinese Dataset for Generating Long Counseling Text for Mental Health Support

Great research interests have been attracted to devise AI services that ...

Cheating Automatic Short Answer Grading: On the Adversarial Usage of Adjectives and Adverbs

Automatic grading models are valued for the time and effort saved during...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic Question generation (AQG) refers to automatic generation of a natural text question for a given natural text answer. Automated QG has several application domains such as patient literacy (Lalor et al., 2018), healthcare (Raghavan et al., 2018; Pampari et al., 2018), education (Le and Pinkwart, 2015; Kuyten et al., 2012), automated testing(Brown et al., 2005), and information retrieval (Chali and Hasan, 2015). Additionally, large-scale access to affordable and quality education has fueled the recent surge of Massive Open Online Courses (MOOCs) that can benefit from accurate AQG systems. AQG has also been posed as a proxy task for machine comprehension (Yuan et al., 2017). Generated questions can also serve as additional weak-labels for joint-training of question-answering and AQG systems Sachan and Xing (2018) or can possibly be used in self-training schemes. Due to its wide applicability, there is a growing interest in improving automated AQG systems (Nguyen et al., 2016; Dunn et al., 2017; Trischler et al., 2017; Kocisky et al., 2018) to produce semantically relevant and natural-sounding questions. However, the current AQG systems focus on generating questions from short-answers only (less than 2-3 sentences).

Generating questions for long text answers is a natural extension of automated question generation, applicable to instructional domains such as, education (Holme, 2003; Shapley, 2000; Baral et al., 2007; Yudkowsky et al., 2019) including examination-based assessment (Ory, 1983; Livingston, 2009; Ganzfried and Yusuf, 2018) and patient literacy of health records. Especially, clinical documents often contain long dependency-spans for relevant information (Jagannatha, 2016)

. This can result in long answers to patient-centric questions such as “Why is the patient having this adverse reaction?”. In the context of using AQG framework for weakly-supervised learning, long answer questions have the potential to improve the modeling of large context in machine comprehension systems. Therefore, it’s important to focus on developing high-performing long-answer AQG (

LAGQ) systems in order to better model the specifications of several practical applications.

LAQG has largely been ignored due to the lack of large-scale long-text answer datasets. Most popular AQG datasets Rajpurkar et al. (2016); Nguyen et al. (2016) contain short-answer questions such as. Datasets such as Yang et al. (2015)

have long-answer questions but, their small sample-size prohibit the use of deep-learning models . The recent work of

Kwiatkowski et al. (2019) released a large-scale Natural Questions (NQ) corpus Kwiatkowski et al. (2019) that contains text passages from Wikipedia and user-generated questions/answer pairs based on the entire page/passage. Google NQ dataset was designed to serve as a benchmark machine comprehension task with long-text passages. We leverage the Google NQ dataset as a repository of long-answers with associated questions to design experiments that uncover the challenges of the transition from the current AQG setting to an LAQG setting. Our work establishes the first benchmark study for LAQG using an exhaustive set of existing AQG models and aims to motivate further research in this domain.

The main challenge in LAQG stems from the requirement of modeling dependencies across a much larger context as compared to the traditional AQG systems. Therefore, we perform a comprehensive evaluation of several NLG models under LAQG setting to analyze their performance for long-answer inputs. The analyzed models are Recurrent-Neural-Network (RNN) or Transformer-based Networks with various mechanism aimed at improving NLG for questions. RNNs have been successfully used in several AQG systems

Du et al. (2017); Zhou et al. (2017); Gao et al. (2018); Subramanian et al. (2018); Wang et al. (2018); Zhao et al. (2018); Sachan and Xing (2018). Similarly, transformer-based networks (Vaswani et al., 2017)

have been widely used in several natural language processing tasks

Devlin et al. (2018); Libovickỳ et al. (2018) and have shown promising results(Radford et al., ) towards capturing long-range dependencies in text. We also investigate the use of answer summary in guiding the question generation by using a Multi-Source Transformer-based Attention mechanism (MSTA

). The primary input to the MSTA is the long-answer itself, with a neural-network generated summary vector of the answer as the secondary input. It’s in contrast with the previous approaches

Zhao et al. (2018); Du et al. (2017) that use the answer-containing paragraph as additional contextual information for short answer question generation.

We use common automated NLG evaluation methods such as BLEU and also carry out human evaluation studies to evaluate the performance of the analyzed and proposed methods.

Our main contributions are-

  • Long-answer QG task or LAQG: We introduce the task of question generation from long-text answers, referred to as LAQG. We redesign the recently introduced Google NQ dataset as a benchmark for this task.

  • Empirical Study

    : We provide a thorough empirical analysis of the existing AQG systems under LAQG setting, using both standard automated evaluation metrics and human evaluation.

  • LAQG performance of Transformer vs LSTM: We use our study to show the different behaviours of Transformer-based and RNN-based networks when we increase the length of answers in the context of LAQG.

Annotated Long Answer:
Succession to the British throne is determined by descent gender for people born before October legitimacy and religion.
Under common law the Crown is inherited by a sovereign s children or by a childless sovereign’s nearest collateral line.
The Bill of Rights and the Act of Settlement restrict succession to the throne to the legitimate Protestant descendants
of Sophia of Hanover that are in communion with the Church of England. Spouses of Roman Catholics were disqualified
from until the law was amended in. Protestant descendants of those excluded for being Roman Catholics are eligible.
Original Query: who will take the throne after the queen dies
Transformer Generated Question: who is next in line for the throne
Max-Pointer Generated Question: who is known for the third throne of peace
Table 2: Example of an annotated long answer and original query from Google Natural Questions corpus, along with the question generated by our transformer model (Section 3.3) and the previous LSTM approach using maxout pointer (Section 3.2). Another example shown in Table A1 in Appendix A.

2 Task Definition and Baseline Architectures

In this section, we describe the problem formulation, and define the proposed approaches for the LAQG task.

2.1 Problem Definition

Formally, AQG refers to the NLG task of generating a natural language question, , from a natural language answer, . The training objective is


with the conditional probability

for a given question and answer parameterized by . The predicted question for a given is obtained through MAP inference


2.2 Transformer Networks

Transformer models are based on the attention mechanism proposed in Bahdanau et al. (2015). They can also use multi-head attention Vaswani et al. (2017)

to model long-range dependencies required for LAQG. The input to a transformer network is first fed through a self-attention layer followed by a feed-forward network. The Self-attention layer contains “multi-headed” attention, where each head attends to different parts of the sequence to leverage (global) information from the complete input sequence. Since the attention mechanism does not model sequential information, positional embeddings are used to provide a notion of locality within an input sequence.

2.3 Multi-Source Transformer

Previous studies Libovickỳ et al. (2018) have shown the benefits of additional information for guiding machine translation systems. Additionally, long-text answers can have multiple different questions each one targeting different aspects Hu et al. (2018) or difficulty levels Gao et al. (2018)

. Therefore, we use multi-source transformer to provide extra contextual information to help guide the question generation under LAQG setting. Specifically, we propose to input the summary sentence, extracted using a reinforcement-learning based summarization approach

Narayan et al. (2018), as an additional signal in a two-source transformer setup (Table. 6).

3 Experiments

In this section, we first discuss the dataset used in our experiments in Sec. 3.1 followed by the descriptions of the contemporary AQG systems and their variants in Sec. 3.2 and Sec. 3.3 and some implementation details in Sec. 3.4.

3.1 Data

Google Natural Questions (NQ) Kwiatkowski et al. (2019): Google NQ is a Question-Answer dataset that contains real-world user questions obtained from Google search with the associated answers as human-annotated text-spans from Wikipedia. It is built using aggregated search engine queries with annotations for long answers (typically a paragraph from the relevant Wikipedia page).111DuReader is a similar Chinese dataset He et al. (2018); we focus on English. In this dataset, we select all questions with ”long-answer” tag and filter out cases where answers don’t start with the HTML tag paragraph, for paragraphs. This leaves 77501 examples which we split 90/10 for training/validation, and 2136 from the original dev-set, which we use as testing data. With an average length of target answers greater than 4 sentences (Table 3), Google NQ serves as a dataset allowing us to propose and explore the LAQG setting.222We do not use WIKI QA Rosso-Mateus et al. (2017) because it only contains 3,047 questions.

Dataset #Training Examples Average #Sentences Average #Words
NQ 77,501 4.59 77.92
SQuAD 70,484 2.19 32.86
Table 3: Statistics of the ”long-answer” tag filtered Google NQ and SQuAD v1.1 datasets.

3.2 Baselines

We provide a benchmarking study of multiple AQG approaches that have been shown to provide good performance under short-answer setting.

  1. Du et al. (2017) proposed the first end-to-end trained neural approach for QG and showed that it significantly outperformed rule-based approaches on SQuAD v1.1. They used an RNN encoder-decoder Bahdanau et al. (2015); Cho et al. (2014) approach with global attention Luong et al. (2015).

  2. Copy Mechanism: Zhao et al. (2018) employ copy-pointer mechanism Gu et al. (2016) to the encoder-decoder to outperform Du et al. (2017) on SQuAD v1.1. It treats every word as a copy target, and uses the attention scores over the input sequence to calculate the final score for a word as the sum of all scores pointing to that word. Intuitively, it works because short-answer questions frequently have large overlaps with the resulting answer, which may not hold true under LAQG setting.

  3. Maxout Pointer: Zhao et al. (2018) proposed a Maxout Pointer based approach which helps reducing the repetitions that occur in generated questions (it occurs more frequently in longer sequences). It works by picking the maximum score of word probabilities instead of combining all the probability scores.

B1 B2 B3 B4 M R
Du et al. (2017) 29.37 19.32 14.55 9.97 14.31 28.25
Copy Mechanism (Gu et al., 2016) 33.81 22.73 16.41 12.07 16.62 33.99
Maxout pointer(Zhao et al., 2018) 34.07 23.18 16.78 12.37 16.68 34.26
Transformer (Vaswani et al., 2017) 36.37 25.09 18.27 13.35 17.32 34.13
Transformer Copy (Vaswani et al., 2017) 34.74 23.54 16.97 12.38 17.01 24.64
Transformer_iwslt_de_en (Ott et al., 2018) 36.80 25.58 18.78 13.90 17.45 35.56
Transformer_wmt_en_fr_big (Ott et al., 2018) 34.48 23.52 16.98 12.41 16.25 33.89
Transformer_wmt_en_de_big (Ott et al., 2018) 34.79 23.43 16.79 12.16 16.30 33.92

Multi-Source Transformer (Summary)
36.04 24.67 18.00 13.30 16.80 34.58
Multi-Source Transformer (First Sentence) 36.01 24.71 18.00 13.27 16.97 35.02
Transformer Summary Baseline 33.79 22.70 16.34 11.89 15.70 32.58
Transformer First sentence Baseline 33.66 22.34 16.08 11.81 15.52 32.17
Table 4: Performance in terms of automatically computed metrics (B: BLEU4, M: METEOR, R: ROUGE) for LAQG on the Google NQ. The first three baselines use an RNN/LSTM based encoder-decoder approach (Section 3.2). Transformer-based variants, row 4–8, show significant improvements over RNN/LSTM approaches, specially the Transformer_iwlst variant (Section 3.3). Lastly, Multi-Source Transformer models, row 9–10, don’t show improvement over single source transformer, which uses only the answer. The last two rows show the results for Transformer-model when fed with the summary or the first sentence as the input.

3.3 Proposed Transformer-based Variants for LAQG

This section builds upon the basic Transformer model, explained in Sections  2.2 and 2.3, and proposes different variants for LAQG task.

Transformer models based on Vaswani et al. (2017) (Section 2.2):

  • Transformer Copy: We seek motivation from the benefits of copy-mechanism shown in Zhao et al. (2018) and propose to add copy-mechanism to transformer. We do it by using the final attention output from transformer as the attention scores to guide the copy-mechanism.

  • Transformer_iwslt_de_en model: Transformer_iwslt_de_en is the best performing model on English to German (En-De) on IWSLT’14 training dataOtt et al. (2018). This model has 6 encoder and decoder layers, 16 encoder and decoder attention heads, 1024 embedding dimension, and embedding dimension of feed forward network(ffn) is 4096.

  • Transformer_wmt_en_de_bigOtt et al. (2018) Transformer_wmt_en_de_big is best performing model on English to German (En-De) on WMT’16 training dataOtt et al. (2018). Transformer_wmt_en_de_big has 6 encoder and decoder layers, 4 encoder and decoder attention heads, 1024 embedding dimension, and embedding dimension of ffn is 1024.

  • Transformer_wmt_en_fr_bigOtt et al. (2018) Transformer_wmt_en_fr_big is best performing model on English to French (En-Fr) on WMT’16 training dataOtt et al. (2018). Transformer_wmt_en_fr_big has 16 encoder and decoder layers,16 encoder and decoder attention heads, 1024 embedding dimension, and ffn embedding dimension of 4096.

Multi-Source Transformers based on Libovickỳ et al. (2018) (Section 2.3):

  • Multi-Source Transformer (Summary): We extract summary sentence from the answer-text using a reinforcement learning based approach (Narayan et al., 2018). We generate questions using summary sentence as the input to the multi-source transformer model (along with the target answer), to assess the benefits of summary sentence as an additional signal.

  • Multi-Source Transformer (First Sentence): We also experimented with using just the first sentence of the long target answer as the additional input instead of the answer summary.

3.4 Implementation Details of Transformer

We have implemented our models on top of OpenNMT-py Klein et al. (2017) and Fairseq toolkits.333 We have used 5 Encoder Layers and 5 Decoder layers. Our final tuning parameters are- using Adam Kingma and Ba (2014).

4 Results and Discussion

This section discusses the results of our models on Google NQ dataset under the proposed LAQG setting.

4.1 Comparison using Standard Automatic Metrics

Table 4 shows the LAQG results for all the models (Section 3.2) on NQ. We observe that the Transformer_iwslt_de_en (Ott et al., 2018) model with 6 encoder and decoder layers, 16 attention heads is the best performing model. The best performing RNN based model is the Maxout pointer model.

In MSTA models, Multi-source Transformer(Summary) performs better than Multi-source Transformer(First sentence). We also observe that Transformer models outperform all MSTA models. A possible reason for this could be that long answer passages may have multiple themes or aspects. Therefore, one long answer passage may correspond to multiple different questions depending on the aspects under consideration. A discrepancy between the aspect of a secondary input to MSTA and theme of its corresponding gold question can result in reduced automated metric evaluation.

We believe that a guided QG approach using Multi-source Transformer has merits and can be useful for focusing question generation towards a particular aspect of the long answer.

Finally, we also design baseline experiments to evaluate the complexity introduced by the inclusion of long answers in the question generation task. To evaluate this, we use short summary sentences generated by the model from Narayan et al. (2018) as inputs to a single source transformer. This model is then trained to generate the corresponding question in NQ dataset. Our experiment aims to convert the long answer QG task to a short answer QG by compressing the answer to a short sentence summary. Table 4 shows that the performance of Transformer model trained on a summary sentence is significantly lower than the same model trained on long answers. This suggests that LAQG cannot simply be solved by compressing the long answer. Replacing the short summary sentence by a short first sentence results in similar trends.

#Sentences Fluency (Transformer) Fluency (Maxout LSTM) Correctness (Transformer) Correctness (Maxout LSTM)
1 3.71 3.58 2.55 2.59
2 3.63 3.61 2.46 2.49
3 3.70 4.09 2.46 2.59
4 4.03 3.87 2.69 2.41
5 4.68 2.72 3.74 1.76
6 4.46 2.54 3.22 1.64
Table 5: Human Evaluation Results: Transformer model does better than the previous LSTM based QG approach in terms of grammatical fluency as well as correctness of the QA pair. This gap in human ratings increases with increasing number of sentences in the target answer.

4.2 Human Evaluation

Due to the high cost of human evaluation, we restrict our human evaluation to the best performing RNN (maxout pointer in Section 3.2) and Transformer baseline (Transformer_iwslt in Section 3.3). We use two metrics for evaluation, namely, “fluency” and “correctness”. Fluency, defined in the guidelines by (Du et al., 2017; Stent et al., 2005), measures the grammatical coherence of the generated questions. Correctness judges whether the provided answer input is a correct answer to the generated question. We randomly selected 100 question answer pairs and asked 6 English speakers to rate the generated questions on above mentioned metrics across all the pairs. We use at least two annotators per pair. Our annotations have an average inter-annotator agreement of 0.36 for fluency and 0.42 for correctness. We group the generated questions by the sentence length of their corresponding answers. Table 5 shows that the questions generated by the transformer model are rated higher in terms of fluency and correctness by humans. The gap between the performance of the two models increases with the increasing number of sentence lengths. Our results suggest that the Transformer architecture can more effectively model long-term dependencies over the text and are better suited for LAQG task.

#Words Transformer (Maxout LSTM)
0-50 14.10 12.59
50-100 13.96 12.34
100- 13.93 12.37
Table 6: Our best-performing transformer model (Transformer_iwslt_de_en in Table 4) vs LSTM with maxout pointer on LAQG when the length of answer is in different 50-word bins.

5 Other Analysis

This section discusses the results of increasing the answer length and analysis of Multi-Source and Single-Source Transformers.

5.1 Transformer vs LSTM with Increasing Answer Length

We also analyse the performance of our models on varying sentence lengths using automated NLG evaluation metrics. We observe a positive correlation between model performance degradation and answer length increase for RNN-based maxout pointer baseline as well as our Transformer model. However, our best Transformer model consistently outperforms the best RNN models across all answer lengths (Table 6). A more fine-grained version of the Table 6 is included in the Appendix A1 (Tables Appendix A2 and Appendix A3). These fine-grained results also support the finding that Transformer LAQG models are more robust to increased answer length.

Annotated Long Answer:
The Kingdom of England and the Kingdom of Scotland fought dozens of battles with each other. They fought typically
over land particularly Berwick, Upon Tweed and the Anglo Scottish border frequently changed as a result. Prior to the
establishment of the two kingdoms in the 10th and 9th centuries their predecessors the Northumbrians and the Picts or
Dal Riatans also fought a number of battles. Major conflicts between the two parties include the Wars of Scottish
Independence(1296 –1357) and the Rough Wooing(1544–1551) as well as numerous smaller campaigns and individual
confrontations. In 1603 England and Scotland were joined in a personal union when King James VI of Scotland
succeeded to the throne of England as King James I. War between the two states largely ceased although the Wars of the
Three Kingdoms in the 17th century and the Jacobite Risings of the 18th century are sometimes characterised as Anglo
Scottish conflicts despite really being British civil wars.
Original Query:Why did scotland go to war with england?
Generated Summary The Kingdom of England and the Kingdom of Scotland fought dozens of battles with each other.
Table 7: Example of Summary sentence generated using the RL based approach in Narayan et al. (2018). The generated summary sentence does not capture the information that is needed to answer the question which could explain lack of improvement in automatic metric scores for Multi-Source Transformer (Summary) (Table 4).

5.2 Multi-Source vs Single-Source Transformers for LAQG

MSTA uses a sentence as a secondary input to guide the question generation. Therefore, it can be expected to perform better than single-source transformer variants if the secondary input is relevant and correct. However, we note that our MSTA models do not exhibit an increase in performance as compared to single source Transformer models (Table 4). A possible explanation for this may be that our secondary input can focus on an aspect that is different from the one used by the gold question. Since our summary extraction model (Narayan et al., 2018) is trained on a different text corpus, it may frequently produce noisy summary sentences on our QG data, or may focus on an aspect of the passage that does not align well with the gold answer. Similarly, the first sentence baseline uses the first sentence of the answer as the secondary input Which may not be relevant to the gold question. An example of the discrepancy between gold question and the summary sentence can be seen in Table 7.

6 Related Work

AQG is an important problem in natural language generation (NLG) and has received significant attention in recent years

Pan et al. (2019). Broadly, AQG can be divided into two categories: a) heuristic rules-based, where questions are created using pre-defined rules or from manually constructed templates followed by ranking (Heilman and Smith, 2010; Mazidi and Nielsen, 2014a; Labutov et al., 2015), and 2. learning-based. Rule-based approaches are difficult to scale to multiple domains and often perform poorly, and therefore, they gave way to the more versatile learning-based approaches. Several learning-based AQG approaches are modelled as neural sequence-to-sequence (seq2seq) models (Du et al., 2017; Song et al., 2018; Zhao et al., 2018; Yuan et al., 2017; Zhou et al., 2017) based on the encoder-decoder framework.

Yuan et al. (2017) uses supervised and reinforcement learning with various rewards and policies to generate better quality questions. Song et al. (2018) show the benefit of modeling the shared information between the (short) answer and the paragraph containing answer. Kim et al. (2019) on the other hand, report that over-relying on the context can degrade the performance and propose to separate the answer from the context by masking it with a special token. These and other methods used reading comprehension datasets with short answers (one to three sentences long on average) either as spans within longer paragraphs such as SQuAD (Rajpurkar et al., 2016) and NewsQA (Trischler et al., 2017) or human-generated short answers in MS MARCO (Nguyen et al., 2016) and Narrative QA (Kocisky et al., 2018). Other datasets include multiple-choice questions in RACE (Lai et al., 2017) and documents (but no answers) to generate questions from in LearningQ (Chen et al., 2018), both of which focused on the education domain.

To the best of our knowledge, we are the first to tackle the task of question generation for long answers using the Google Natural Questions corpus (Kwiatkowski et al., 2019) with answers spanning 4 or more sentences on average (Table 3). A non-exhaustive list of relevant works divided according to various types of AQG settings is given in Table 1.

A major limitation of focusing on AQG with short answers is that the insights and the approaches that work well for short answer may not generalize to LAQG setting. For example, several NLG models (Zhao et al., 2018; Sun et al., 2018; Dehghani et al., 2018) use copy-mechanism (Gu et al., 2016) with seq2seq architectures that affords copying words from the input to the generated text. As discussed in 3.2 shows improvements in short answer AQG, but unfortunately, the use of copy-mechanism for LAQG does not help, and in fact, can hurt performance in terms of both automatic and human evaluation (Tables 4 and 5). On the other hand, however, the use of maxout pointer approach for QG (Zhao et al., 2018) serves as a strong baseline for LAQG.

Transformer-based models have now become an integral approach in the state-of-the-art systems for many NLP tasks like translation, QA, and text classification (Vaswani et al., 2017; Radford, 2018; Devlin et al., 2019). Because such models can better handle long-range dependencies using self-attention, and can also go beyond fixed-length contexts (Dai et al., 2019). Since LAQG requires connecting the dots across a large range of text - in the form of a long answer - to generate an adequate question, we explore transformers for our task. We report superior performance of transformer-based models in terms of automatic and human evaluation than previous seq2seq models for AQG. In order to be able to provide an additional contextual signal in addition to the long-text answer, we use multi-source transformers (Libovickỳ et al., 2018) with multiple inputs (we use two). We show that using a summary sentence extracted from the answer or the first sentence of the answer as additional input did not improve performance, and we analyze the potential reason in Section 4.

7 Conclusion

We take the first step towards question generation in the long-answer setting, where answers can contain 4 or more sentences on average. We benchmark newly released Natural Questions corpus for question generation with both transformer networks (used for the first time in context of QG) and previously used LSTM networks. We show transformer models outperform LSTM-based models in terms of automatically computed metrics like BLEU as well as human evaluation. We also provide a detailed empirical study showing the effect of answer length on commonly used QG approaches. Our work can be directly applied to different domains such as improve the quality of question long-answer pairs in educational testing.


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §2.2, item 1.
  • N. Baral, B. H. Paudel, B. Das, M. Aryal, B. P. Das, N. Jha, and M. Lamsal (2007) An evaluation of training of teachers in medical education in four medical schools of nepal. Nepal Med Coll J 9 (3), pp. 157–61. Cited by: §1.
  • J. C. Brown, G. A. Frishkoff, and M. Eskenazi (2005) Automatic question generation for vocabulary assessment. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 819–826. Cited by: §1.
  • Y. Chali and S. A. Hasan (2015) Towards topic-to-question generation. Computational Linguistics 41 (1), pp. 1–20. Cited by: §1.
  • G. Chen, J. Yang, C. Hauff, and G. Houben (2018) LearningQ: a large-scale dataset for educational question generation. In Twelfth International AAAI Conference on Web and Social Media, Cited by: §6.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: item 1.
  • Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §6.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2018) Universal transformers. CoRR abs/1807.03819. Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §6.
  • X. Du and C. Cardie (2017) Identifying where to focus in reading comprehension for neural question generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2067–2073. Cited by: Table 1.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1342–1352. Cited by: Table 1, §1, item 1, item 2, Table 4, §4.2, §6.
  • M. Dunn, L. Sagun, M. Higgins, V. U. Güney, V. Cirik, and K. Cho (2017) SearchQA: A new q&a dataset augmented with context from a search engine. CoRR abs/1704.05179. Cited by: §1.
  • S. Ganzfried and F. Yusuf (2018) Optimal weighting for exam composition. Education Sciences 8 (1), pp. 36. Cited by: §1.
  • Y. Gao, J. Wang, L. Bing, I. King, and M. R. Lyu (2018) Difficulty controllable question generation for reading comprehension. arXiv preprint arXiv:1807.03586. Cited by: §1, §2.3.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1631–1640. Cited by: item 2, Table 4, §6.
  • W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu, Q. She, et al. (2018) DuReader: a chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering, pp. 37–46. Cited by: footnote 1.
  • M. Heilman and N. A. Smith (2010) Good question! statistical ranking for question generation. In HLT-NAACL, Cited by: §6.
  • M. Heilman (2011) Automatic factual question generation from text. Language Technologies Institute School of Computer Science Carnegie Mellon University 195. Cited by: Table 1.
  • T. Holme (2003) Assessment and quality control in chemistry education. Journal of Chemical Education 80 (6), pp. 594. Cited by: §1.
  • W. Hu, B. Liu, J. Ma, D. Zhao, and R. Yan (2018) Aspect-based question generation. Cited by: §2.3.
  • A. Jagannatha (2016) H yu. bidirectional rnn for medical event detection in electronic health records. In Proceedings of the Association for Computational Linguistics Conference, pp. 12–17. Cited by: §1.
  • Y. Kim, H. Lee, J. Shin, and K. Jung (2019) Improving neural question generation using answer separation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6602–6609. Cited by: §6.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization.. CoRR abs/1412.6980. External Links: Link Cited by: §3.4.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017) OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL, External Links: Link, Document Cited by: §3.4.
  • T. Kocisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. Transactions of the Association of Computational Linguistics 6, pp. 317–328. Cited by: §1, §6.
  • P. Kuyten, T. Bickmore, S. Stoyanchev, P. Piwek, H. Prendinger, and M. Ishizuka (2012) Fully automated generation of question-answer pairs for scripted virtual instruction. In International Conference on Intelligent Virtual Agents, pp. 1–14. Cited by: §1.
  • T. Kwiatkowski, J. Palomaki, O. Rhinehart, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, et al. (2019) Natural questions: a benchmark for question answering research. Cited by: §1, §3.1, §6.
  • I. Labutov, S. Basu, and L. Vanderwende (2015) Deep questions without deep understanding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 889–898. Cited by: Table 1, §6.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794. Cited by: §6.
  • J. P. Lalor, H. Wu, L. Chen, K. M. Mazor, and H. Yu (2018) ComprehENotes, an instrument to assess patient reading comprehension of electronic health record notes: development and validation. Journal of medical Internet research 20 (4), pp. e139. Cited by: §1.
  • N. Le and N. Pinkwart (2015) Evaluation of a question generation approach using semantic web for supporting argumentation. Research and practice in technology enhanced learning 10 (1), pp. 3. Cited by: §1.
  • J. Libovickỳ, J. Helcl, and D. Mareček (2018) Input combination strategies for multi-source transformer decoder. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 253–260. Cited by: §1, §2.3, §3.3, §6.
  • S. A. Livingston (2009) Constructed-response test questions: why we use them; how we score them. r&d connections. number 11.. Educational Testing Service. Cited by: §1.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: item 1.
  • K. Mazidi and R. D. Nielsen (2014a) Linguistic considerations in automatic question generation. In ACL, Cited by: §6.
  • K. Mazidi and R. D. Nielsen (2014b) Linguistic considerations in automatic question generation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 321–326. Cited by: Table 1.
  • S. Narayan, S. B. Cohen, and M. Lapata (2018) Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, pp. 1747–1759. Cited by: §2.3, 1st item, §4.1, §5.2, Table 7.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. choice 2640, pp. 660. Cited by: §1, §1, §6.
  • J. C. Ory (1983) Improving your test questions.. Cited by: §1.
  • M. Ott, S. Edunov, D. Grangier, and M. Auli (2018) Scaling neural machine translation. ArXiv abs/1806.00187. Cited by: 2nd item, 3rd item, 4th item, Table 4, §4.1.
  • A. Pampari, P. Raghavan, J. Liang, and J. Peng (2018) EmrQA: a large corpus for question answering on electronic medical records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2357–2368. Cited by: §1.
  • L. Pan, W. Lei, T. Chua, and M. Kan (2019) Recent advances in neural question generation. arXiv preprint arXiv:1905.08949. Cited by: §6.
  • [44] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever Language models are unsupervised multitask learners. Cited by: §1.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §6.
  • P. Raghavan, S. Patwardhan, J. J. Liang, and M. V. Devarakonda (2018) Annotating electronic medical records for question answering. arXiv preprint arXiv:1805.06816. Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Cited by: §1, §6.
  • A. Rosso-Mateus, F. A. González, and M. Montes-y-Gómez (2017) A two-step neural network approach to passage retrieval for open domain question answering. In

    Iberoamerican Congress on Pattern Recognition

    pp. 566–574. Cited by: footnote 2.
  • V. Rus, B. Wyse, P. Piwek, M. Lintean, S. Stoyanchev, and C. Moldovan (2010) The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference, Cited by: Table 1.
  • M. Sachan and E. Xing (2018) Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, pp. 629–640. Cited by: Table 1, §1, §1.
  • P. Shapley (2000) On-line education to develop complex reasoning skills in organic chemistry. Journal of Asynchronous Learning Networks 4 (2), pp. 43–52. Cited by: §1.
  • L. Song, Z. Wang, W. Hamza, Y. Zhang, and D. Gildea (2018) Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2, pp. 569–574. Cited by: Table 1, §6, §6.
  • A. Stent, M. Marge, and M. Singhai (2005) Evaluating evaluation methods for generation in the presence of variation. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 341–351. Cited by: §4.2.
  • S. Subramanian, T. Wang, X. Yuan, S. Zhang, A. Trischler, and Y. Bengio (2018) Neural models for key phrase extraction and question generation. In Proceedings of the Workshop on Machine Reading for Question Answering, pp. 78–88. Cited by: §1.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In EMNLP, Cited by: §6.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: a machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 191–200. Cited by: §1, §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2.2, §3.3, Table 4, §6.
  • Z. Wang, A. S. Lan, W. Nie, A. E. Waters, P. J. Grimaldi, and R. G. Baraniuk (2018) QG-net: a data-driven question generation model for educational content. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, pp. 7. Cited by: §1.
  • Y. Yang, W. Yih, and C. Meek (2015) WikiQA: a challenge dataset for open-domain question answering. In EMNLP, Cited by: §1.
  • X. Yuan, T. Wang, C. Gulcehre, A. Sordoni, P. Bachman, S. Zhang, S. Subramanian, and A. Trischler (2017) Machine comprehension by text-to-text neural question generation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 15–25. Cited by: Table 1, §1, §6, §6.
  • R. Yudkowsky, Y. S. Park, and S. M. Downing (2019) Assessment in health professions education. Routledge. Cited by: §1.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. Cited by: Table 1, §1, item 2, item 3, 1st item, Table 4, §6, §6.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: Table 1, §1, §6.