Question Generation (QG), an inverse problem of Question Answering (QA), aims to generate a semantically relevant question given a context and a corresponding answer. It has huge potential in education scenario , dialogue system, and question answering . A bunch of models using sequence-to-sequence (seq-to-seq) models  with the attention mechanism  have been proposed for neural question generation [28, 6].
Enriched linguistic features with Part-Of-Speech (POS) tags, relative position information, and paragraph context are incorporated in the embedding layers [28, 10, 27]. Copy mechanism  is exploited to enhance the output quality of decoders [27, 22].
|Sentence: starting in 1965, donald davies at the national physical laboratory, uk, independently developed|
|the same message routing methodology as developed by baran.|
|Reference: what did donald davies develop?|
|NQG++: what is the national physical laboratory?|
|Pointer-generator: what did baran develop?|
|Our model: what did donald devies develop at the national physical laboratory?|
|Sentence: in 1979 , the soviet union deployed its 40th army into afghanistan , attempting to suppress|
|an islamic rebellion against an allied marxist regime in the afghan civil war.|
|Reference: who deployed its army into afghanistan in 1979?|
|NQG++: in what year did the soviet union invade afghanistan?|
|Pointer-generator: what deployed their army into afghanistan?|
|Our model: who deployed their army into afghanistan?|
|Sentence:as of 2012 , quality private schools in the united states charged substantial tuition , close to $ 40,000|
|annually for day schools in new york city , and nearly $ 50,000 for boarding schools.|
|Reference: what would a parent have to pay to send their child to a boarding school in 2012?|
|NQG++: how much money did quality private schools in the us have in 2012?|
|Pointer-generator: how much money is charged substantial tuition for boarding school?|
|Our model: how much money for boarding schools in new york city in 2012?|
|Sentence: during his second year of study at graz , tesla developed a passion for and became very proficient|
|at billiards , chess and card-playing , sometimes spending more than 48 hours in a stretch at a gaming table.|
|Reference: how long would tesla spend gambling sometimes?|
|NQG++: how long did the billiards of tesla get in a stretch?|
|Pointer-generator: how long did tesla become very proficient in a stretch at a gaming table ?|
|Our model: how many hours did tesla spend in a stretch at a gaming table?|
originally solving text summarization, the modern question generation models face the two main issues as follows: (1)Wrong keywords and question words: The model may ask questions with wrong keywords and wrong question words, as shown in the examples in Table 1. (2) Poor copy mechanism: The model copies the context words semantically irrelevant to the answer , as illustrated in the examples in Table 2.
is trained by maximizing the generation probabilityof the reference question word , given the previous generated words conditioned on the encoded context . However, the decoder may focus on local word semantics while ignoring the global question semantics during generation, resulting in above-mentioned issues. Meanwhile, the answer position-aware features are not exploited well by the copy mechanism, resulting in copying answer-irrelevant context words from input.
To alleviate these issues, we claim that learning the sentence-level semantics and answer position-awareness in a Multi-Task Learning (MTL) fashion results in a better performance as shown in Table 1 and 2. To do so, we first propose sentence-level semantic matching module for learning global semantics from both the encoder and decoder simultaneously. Then, answer position inferring module is introduced to enforce the model with the copy mechanism 
to emphasize the relevant context words with the answer position-awareness. Furthermore, we propose answer-aware gated fusion mechanism for improved answer-aware sentence vector for decoder.
We further conduct extensive experiments on SQuAD  and MS MARCO  dataset to show the superiority of our proposed model. The experimental results show that our model not only outperforms the SOTA models on main metrics, auxiliary metrics, and human judgments, but also improves different models due to its generality. Our contributions are three-fold:
We analyze the questions generated by strong baselines and find two issues: wrong keywords and wrong question words and copying answer-irrelevant context words. We identify that lacking whole question semantics and expoiting answer position-awareness not weel are the key root causes.
To address the issues, we propose neural question generation model with sentence-level semantic matching, answer position inferring, and gated fusion.
We conduct extensive experiments to demonstrate the superiority of our proposed model for improving question generation performance in terms of the main metrics, auxiliary machine comprehension metrics, and human judgments. Besides, our work can improve current models significantly due to its generality.
In this section, we describe the details of our proposed models, starting with an overview of question generation problem. Then, we illustrate our backbone seq-to-seq model with gated fusion for improved answer-aware sentence vector for generation. Finally, we illustrate sentence-level semantic matching and answer position inferring to alleviate the issues we discussed in the previous section.
In a question generation problem, a sentence containing an answer , a contiguous span of the sentence, is given to generate a question matching with the sentence and the answer semantically.
Seq-to-seq model with Answer-aware Gated Fusion
Following the baseline model , we use an attention-based seq-to-seq model with the same enriched semantic and lexical features (i.e., NER features , POS tag , case, and answer position features) as input in the embedding layer.
With a bi-directional LSTM  as the encoder, the sentence representation, a sequence of D-dim hidden state , is produced by concatenating a forward hidden state and a backward hidden state given the input sentence :
Answer-aware Gated Fusion:
Instead of passing the last hidden state of the encoder to the decoder as the initial hidden state, we propose gated fusion to provide an improved answer-aware sentence vector for the decoder.
Similar to the gates in LSTM, we use two information flow gates computed by functions to control the information flow of sentence vector and answer vector:
where , , , and are trainable weights and biases. We take the hidden state at the answer starting position as the answer vector since it encodes the whole answer semantics with the bi-directional LSTM.
Taking the encoder hidden states as the context and the improved answer-aware sentence vector as the initial hidden state , an one layer uni-directional LSTM updates its current hidden state with the previous decoded word as the input :
Meanwhile, the attention mechanism  is exploited by attending the current decoder state to the encoder context . The context vector is computed with normalized attention vector by weighted-sum:
Question word is generated from vocabulary with function:
where is realized by a two-layer feed-forward network.
Copy Mechanism / Pointer-generator:
Copy Mechanism  and Pointer-generator network  are introduced to enable the model to generate words from the vocabulary with size or copy words from the input sentence with size by taking the word with the highest attention weight computed in Equation 9.
Generally, when generating the question word , a copy switch is computed to decide whether the generated word is generated from vocab or copied from source sentence, given the current decoder hidden state and context vector :
where , , and are learnable weights and biases.
The final word distribution is obtained by combining the probability of generate mode and the probability of copy mode:
where , , and
are the parameters of neural network.
We use the negative log likelihood for the seq-to-seq loss:
where is the parameters of the seq-to-seq model, and N is the number of data in the train dataset.
Sentence-level Semantic Matching
Existing models, especially the decoders, generate question words given the generated and partial question words without considering the global whole question semantic, prone to wrong question words or keywords. Meanwhile, we found that there exist different reference questions targeting the different answers in the same sentence in SQuAD and MARCO datasets. For example, we have and . However, the baseline model is prone to generating generic questions in this case. To overcome this problem, we propose the sentence-level semantic matching module to learn the sentence-level semantics from both the encoder and decoder sides in a Multi-Task Learning way.
Generally, we have the improved answer-aware sentence vector obtained by our gated fusion. Regarding the decoder, a uni-directional LSTM, as an encoder for question, we take the last hidden state as the question vector.
Then, as illustrated in Figure 2
, we train two classifiers to distinguish the not semantic-matching pairs() and () from the semantic-matching pair (), respectively, where and are the vector of randomly sampled mismatched sentence and question from the same passage:
where is the concatenation of the sentence vector and the question vector .
We take the sum of the binary cross entropy of the two classifiers as the sentence-level semantic matching loss:
where is the parameters of the two classifiers. and are the prediction probabilities of the two classifiers, and and refer to labels indicating the pair is matched or not. is the number of pairs.
Answer Position Inferring
Another issue of the baseline model is that it copies the answer-irrelevant words from the input sentence. One potential reason is that the model does not learn the answer position features well, and the attention matrix is not signified by the context words relevant to the answer. To address the issue, we leverage answer position inferring module to enforce the model with answer position-awareness, still in a Multi-Task Learning fashion.
We borrow the bi-directional Attention Flow network and output layer from BiDAF model  to infer the answer position as shown in Figure 3, taking the sentence representation and question representation from the encoder and the decoder as inputs.
Formally, we take Sentence-to-Question (S2Q) attention and Question-to-Sentence (Q2S) attention to emphasize the mutual semantic relevance between each sentence word and each question word, and we obtain the question-aware sentence representation and the sentence-aware question representation by using similar attention mechanism to Equation 9:
|Gated Self-attention ||44.51||29.07||21.06||15.82||19.67||44.24||-||-||-||-||-||-|
|Model with Sentence-level Semantic Matching||43.67||28.53||20.59||15.66||19.23||43.86||48.97||35.84||26.31||19.79||23.83||48.93|
|Model with Answer Position Inferring||43.88||28.55||28.87||15.77||19.55||43.98||49.73||36.77||26.46||20.14||24.22||49.33|
Then, two two-layer bidirectional LSTMs are used to capture the interactions among the sentence words conditioned on the question . The answer starting index and end index are predicted by the output layer with function:
where and are trainable weights, and function is a trainable multi-layer perception (MLP) network.
We compute the loss with the negative log likelihood of the ground truth answer starting index and ending index with the predicted distribution:
where is the parameters to be updated of the answer position inferring module.
To joint train the generation model with the proposed modules in a Multi-Task Learning approach, we minimize the total loss during the training:
control the magnitude of the sentence-level semantic matching loss and the answer position inferring loss. By minimizing the above loss function, our model is expected to discover the sentence-level and answer position-aware semantics of the question and sentence.
Experiments and Results
In this section, we conduct extensive experiments on the SQuAD and MS MARCO dataset, demonstrating the superiority of our proposed model compared with existing approaches.
SQuAD V1.1 dataset contains 536 Wikipedia articles and more than 100K questions posed about the articles . The answer is also given with corresponding questions as the sub-span of the sentence. Following the baseline , we use the training dataset (86635) to train our model, and we split the dev dataset into dev (8965) and test dataset (8964) with a ratio of 50%-50% for evaluation.
MS MARCO contains more than one million queries along with answers either generated by human or selected from passages . We select a subset of MS MARCO, where the answers are sub-spans of the passages. We split them into train set (86039), dev set (9480), and test set (7921) for model training and evaluation purpose.
In the experiments, we have several baselines for comparisons:
NQG++ : It is a baseline for Neural Question Generation task. It uses enriched semantic and lexical features in the encoder embedding layer of the seq-to-seq model. Attention mechanism and copy mechanismare also used.
Feature-enriched Pointer-generator : It is a seq-to-seq model with attention mechanism and copy mechanism. The copy mechanism is realized differently from NQG++. We add enriched features used in NQG++ in the embedding layer.
Answer-focused : It is a SOTA model on QG that uses an additional vocabulary for question word generation with relative answer position information instead of BIO used in NQG++.
Gated Self-attention . It is also a SOTA model on QG that leverages paragraph as input with gated self-attention above RNN in the encoder. Meanwhile, an improved maxout pointer is introduced.
Results and Analysis
We report the main metrics of different models on SQuAD and MS MARCO dataset in Table 3.
Answer-focused model 
improves the performance by using separate vocabulary for question word generation along with answer relative position. The Gated Self-Attention model emphasizes the intra-attention among the sentence with improved maxout pointer.
Different from the models above, our work aims to improve the model by learning the sentence-level semantic-matching features on both the encoder and decoder sides. The result shows that our model outperforms the two SOTA models on the main metrics.
Although the main metrics can reflect the similarity between the generated question and the references, it has its limits on reflecting the semantics of generated question .
Alternatively, considering that machine comprehension takes the article and the corresponding question as the input to find the answer in the passages, we adopt the machine comprehension metrics  to evaluate the quality of the questions generated by different models .
We show the performances of BiDAF  pre-trained by AllenNLP  in terms of Exact Match (EM) and F1 metrics on reference questions, questions generated by baseline, and questions generated by our model in Table 4.
Our model outperforms NQG++ and Pointer-generator on EM and F1 significantly, since our model generates more answer-relevant questions by discovering sentence-level semantics and answer position features.
Sentence-level Semantic Matching Analysis
To analyze the quality of our model on generating the right question words and keywords, we randomly sample 200 questions generated by NQG++, Pointer-generator, and our model, respectively. Generally, the generated question is claimed to have the right question words if it has the same question words to the reference question. For example, we have a generated question ”what place …” and a reference question ”where …”, and we claim that the model generate a question with the right question words. In addition, we choose the words with most semantics importance as the keywords, which indicate the sentence topic and content. We report the number of the questions with right question words and keywords by different models in Table 5.
|Models||# right question words||# right keywords|
|Model with Sentence-level Semantic Matching||150||156|
The main reason that our model outperforms the existing model is that learning the sentence-level semantics helps to capture the key semantics and results in better performance on generating the semantic-matching keywords.
Answer Position Inferring Analysis
We also conduct the similar experiment on evaluating the copy mechanisms in different models in terms of precision and recall used in. Given one generated question G and reference question R, we definite precision and recall as:
|Model with Answer Position Inferring||48.35%||40.27%|
As reported in Table 6, the improvement of Precision and Recall indicates that answer position inferring can help copy OOV words from the input sentence.
To show the effectiveness and generality of our work, we evaluate the validness of our work by applying it to current representative models without revising the models. As shown in Table 7, our work can improve existing models by more than 2% on QG tasks due to its effectiveness and generality.
We also conduct human evaluation to examine the quality of the questions generated by the models and reference questions by scoring them on a scale of 1 to 5 in terms of semantics matching, fluency, and syntactically correctness. As reported in Table 8, our model generates questions with higher scores on the three metrics than the two baseline models, indicating the superiority of our proposed model by utilizing the sentence-level semantics and answer position-awareness.
|Models||Semantic Matching||Fluency||Syntactically Correctness|
|Sentence: another example was the insignificance of the ministry of war compared with native chinese|
|dynasties , as the real military authority in yuan times resided in the privy council.|
|Reference: who had no real military power during the yuan?|
|NQG++: the insignificance of what war was compared to native chinese dynasties?|
|Pointer-generator: what was the insignificance?|
|Our model: what was insignificance compared with native Chinese dynasties?|
|Sentence: another example was the insignificance of the ministry of war compared with native chinese|
|dynasties , as the real military authority in yuan times resided in the privy council.|
|Reference: who had military control during the yuan?|
|NQG++: what did the chinese dynasties call the insignificance of the ministry of war?|
|Pointer-generator: in where the insignificance as the real military authority in yuan times?|
|Our model: the real military authority in yuan times resided where?|
In this section, we present some examples of questions generated by our model.
Furthermore, we present a pair of examples, which have the same input sentence in Table 9. Different from that NQG++ generate similar and non-semantic-matching questions, our model can ask different and more semantic-matching questions than baselines, targeting the different answers.
Followed NQG++ , we conduct our experiments on the preprocessed data provided by . We use 1 layer LSTM as the RNN cell for both the encoder and the decoder, and a bidirectional LSTM is used for the encoder. The hidden size of the encoder and decoder are 512. We use a 300-dimension pre-trained Glove vector as the word embedding . As same as NQG++ , the dimensions of lexical features and answer position are 16. We use Adam  Optimizer for model training with an initial learning rate as 0.001, and we halve it when the validation score does not improve. During the training of Sentence-level Semantic Matching module, we sample the negative sentences and questions from nearby data samples in the same batch, due to the preprocessed data  lacking of the information about which data samples are from the same passage. We compute our total loss function with of 1 and
of 2. Models are trained for 20 epochs with mini-batch of size 32. We choose model achieving the best performance on the dev dataset.
Question generation tasks can be categorized into two classes: one is the rule-based method, meaning manually design lexical rules or templates to convert context into questions without deep understanding on the context semantic [14, 12]. The other one is neural network based methods, which adopt seq-to-seq  or an encoder-decoder  framework to generate question words from scratches [6, 28]. Our work focuses on the second category.
proposes to use a feature-enriched encoder to encode the input sentence by concatenating word embedding with lexical features as the encoder input, and answer position are involved in informing the model where the answer is. It is shown that it brings considerable improvements to the model. With the success of reinforcement learning,
propose to combine supervised learning and reinforcement learning together for question generation by using policy gradient after training the model in supervised learning way. The reward term in the policy gradient loss function can be perplexity and the BLEU scores. To tackle the issue that question words do not match with the answer type,  introduce a vocabulary only to generate question words.  propose to use paragraph as the input for providing more semantic information with an improved maxout pointer for copying words from the input.
Different from existing methods focusing on utilizing more informative features and improving the copy mechanism, we point out that incapability of capturing sentence-level semantics and exploiting answer-aware features are the main reasons, and we alleviate the problem by proposing two modules which can be integrated with any base models named sentence-level semantic matching and answer position inferring in Multi-Task Learning fashion.
In this paper, we observe two issues with the widely used baseline model on question generation. We point out the root cause is that existing models neither consider the whole question semantics nor exploit the answer position-aware features well. To address the issue, we propose the neural question generation model with sentence-level semantic matching, answer position inferring, and gated fusion. Extensive experimental results show that our work improves existing models significantly and outperforms the SOTA models on SQuAD and MARCO datasets.
This research was supported by CBL industry and agency members and by the IUCRC Program of the National Science Foundation under Grant No. CNS-1747783.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Introduction, Decoder:.
A simple rule-based part of speech tagger.
Proceedings of the third conference on Applied natural language processing, pp. 152–155. Cited by: Encoder:.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: Related Work.
-  (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380. Cited by: Dataset.
-  (2018) Harvesting paragraph-level question-answer pairs from wikipedia. arXiv preprint arXiv:1805.05942. Cited by: Introduction.
-  (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: Introduction, Related Work, Related Work.
-  (2017) AllenNLP: a deep semantic natural language processing platform. External Links: Cited by: Auxiliary Metrics.
-  (2016) Pointing the unknown words. arXiv preprint arXiv:1603.08148. Cited by: Introduction, Copy Mechanism / Pointer-generator:.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Encoder:.
-  (2018) Improving neural question generation using answer separation. arXiv preprint arXiv:1809.02393. Cited by: Introduction.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
-  (2015) Deep questions without deep understanding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1, pp. 889–898. Cited by: Related Work.
-  (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: Dataset.
-  (2014) Linguistic considerations in automatic question generation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 321–326. Cited by: Related Work.
-  (2016) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: Introduction, Dataset.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Dataset, Related Work.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Implementation Details.
-  (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: Introduction, Dataset, Auxiliary Metrics.
Introduction to the conll-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050. Cited by: Encoder:.
-  (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: Introduction, Introduction, Copy Mechanism / Pointer-generator:, Table 3, 2nd item, Table 4, Table 5, Table 6, Table 7.
-  (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603. Cited by: Answer Position Inferring, Answer Position Inferring, Auxiliary Metrics.
-  (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3930–3939. Cited by: Introduction, Introduction, Introduction, Table 3, 3rd item, Main Metrics, Answer Position Inferring Analysis, Related Work.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: Introduction, Related Work.
-  (2017) Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 189–198. Cited by: Auxiliary Metrics.
LSDSCC: a large scale domain-specific conversational corpus for response generation with diversity oriented evaluation metrics. In Proceedings of NAACL-HLT, pp. 2070–2080. Cited by: Auxiliary Metrics.
-  (2017) Machine comprehension by text-to-text neural question generation. arXiv preprint arXiv:1705.02012. Cited by: Related Work.
-  (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. Cited by: Introduction, Introduction, Table 3, 4th item, Main Metrics, Related Work.
-  (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: Introduction, Introduction, Introduction, Introduction, Encoder:, Table 3, 1st item, Dataset, Implementation Details, Table 4, Table 5, Table 6, Table 7, Related Work, Related Work.