ConvAI baseline solution
Automatic question generation aims to generate questions from a text passage where the generated questions can be answered by certain sub-spans of the given passage. Traditional methods mainly use rigid heuristic rules to transform a sentence into related questions. In this work, we propose to apply the neural encoder-decoder model to generate meaningful and diverse questions from natural language sentences. The encoder reads the input text and the answer position, to produce an answer-aware input representation, which is fed to the decoder to generate an answer focused question. We conduct a preliminary study on neural question generation from text with the SQuAD dataset, and the experiment results show that our method can produce fluent and diverse questions.READ FULL TEXT VIEW PDF
Automatic question generation aims at the generation of questions from a...
Question Generation is the task of automatically creating questions from...
In this paper, we study automatic question generation, the task of creat...
Automatic question generation according to an answer within the given pa...
Asking good questions in large-scale, open-domain conversational systems...
We tackle the task of question generation over knowledge bases. Conventi...
In this work, we focus on the task of Automatic Question Generation (AQG...
ConvAI baseline solution
Neural Question Generation from Text: A Preliminary Study
Automatic question generation from natural language text aims to generate questions taking text as input, which has the potential value of education purpose (Heilman, 2011). As the reverse task of question answering, question generation also has the potential for providing a large scale corpus of question-answer pairs.
Previous works for question generation mainly use rigid heuristic rules to transform a sentence into related questions (Heilman, 2011; Chali and Hasan, 2015). However, these methods heavily rely on human-designed transformation and generation rules, which cannot be easily adopted to other domains. Instead of generating questions from texts, Serban et al. (2016)
proposed a neural network method to generate factoid questions from structured data.
In this work we conduct a preliminary study on question generation from text with neural networks, which is denoted as the Neural Question Generation (NQG) framework, to generate natural language questions from text without pre-defined rules. The Neural Question Generation framework extends the sequence-to-sequence models by enriching the encoder with answer and lexical features to generate answer focused questions. Concretely, the encoder reads not only the input sentence, but also the answer position indicator and lexical features. The answer position feature denotes the answer span in the input sentence, which is essential to generate answer relevant questions. The lexical features include part-of-speech (POS) and named entity (NER) tags to help produce better sentence encoding. Lastly, the decoder with attention mechanism (Bahdanau et al., 2015) generates an answer specific question of the sentence.
Large-scale manually annotated passage and question pairs play a crucial role in developing question generation systems. We propose to adapt the recently released Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) as the training and development datasets for the question generation task. In SQuAD, the answers are labeled as subsequences in the given sentences by crowed sourcing, and it contains more than 100K questions which makes it feasible to train our neural network models. We conduct the experiments on SQuAD, and the experiment results show the neural network models can produce fluent and diverse questions from text.
In this section, we introduce the NQG framework, which consists of a feature-rich encoder and an attention-based decoder. Figure 1 provides an overview of our NQG framework.
In the NQG framework, we use Gated Recurrent Unit (GRU)(Cho et al., 2014) to build the encoder. To capture more context information, we use bidirectional GRU (BiGRU) to read the inputs in both forward and backward orders. Inspired by Chen and Manning (2014); Nallapati et al. (2016)
, the BiGRU encoder not only reads the sentence words, but also handcrafted features, to produce a sequence of word-and-feature vectors. We concatenate the word vector, lexical feature embedding vectors and answer position indicator embedding vector as the input of BiGRU encoder. Concretely, the BiGRU encoder reads the concatenated sentence word vector, lexical features, and answer position feature,, to produce two sequences of hidden vectors, i.e., the forward sequence and the backward sequence . Lastly, the output sequence of the encoder is the concatenation of the two sequences, i.e., .
To generate a question with respect to a specific answer in a sentence, we propose using answer position feature to locate the target answer. In this work, the BIO tagging scheme is used to label the position of a target answer. In this scheme, tag B denotes the start of an answer, tag I continues the answer and tag O marks words that do not form part of an answer. The BIO tags of answer position are embedded to real-valued vectors throu and fed to the feature-rich encoder. With the BIO tagging feature, the answer position is encoded to the hidden vectors and used to generate answer focused questions.
Besides the sentence words, we also feed other lexical features to the encoder. To encode more linguistic information, we select word case, POS and NER tags as the lexical features. As an intermediate layer of full parsing, POS tag feature is important in many NLP tasks, such as information extraction and dependency parsing (Manning et al., 1999). Considering that SQuAD is constructed using Wikipedia articles, which contain lots of named entities, we add NER feature to help detecting them.
We employ an attention-based GRU decoder to decode the sentence and answer information to generate questions. At decoding time step , the GRU decoder reads the previous word embedding and context vector to compute the new hidden state . We use a linear layer with the last backward encoder hidden state to initialize the decoder GRU hidden state. The context vector for current time step is computed through the concatenate attention mechanism (Luong et al., 2015), which matches the current decoder state with each encoder hidden state to get an importance score. The importance scores are then normalized to get the current context vector by weighted sum:
We then combine the previous word embedding , the current context vector , and the decoder state to get the readout state . The readout state is passed through a maxout hidden layer (Goodfellow et al., 2013)
to predict the next word with a softmax layer over the decoder vocabulary:
where is a -dimensional vector.
To deal with the rare and unknown words problem, Gulcehre et al. (2016) propose using pointing mechanism to copy rare words from source sentence. We apply this pointing method in our NQG system. When decoding word , the copy switch takes current decoder state and context vector
as input and generates the probabilityof copying a word from source sentence:
is sigmoid function. We reuse the attention probability in equation4 to decide which word to copy.
We use the SQuAD dataset as our training data. SQuAD is composed of more than 100K questions posed by crowd workers on 536 Wikipedia articles. We extract sentence-answer-question triples to build the training, development and test sets111We re-distribute the processed data split and PCFG-Trans baseline code at an anonymous url for blind review. . Since the test set is not publicly available, we randomly halve the development set to construct the new development and test sets. The extracted training, development and test sets contain 86,635, 8,965 and 8,964 triples respectively. We introduce the implementation details in the appendix.
We conduct several experiments and ablation tests as follows:
We implement a seq2seq with attention as the baseline method.
We extend the s2s+att with our feature-rich encoder to build the NQG system.
Based on NQG, we incorporate copy mechanism to deal with rare words problem.
Based on NQG+, we initialize the word embedding matrix with pre-trained GloVe (Pennington et al., 2014) vectors.
Based on NQG+, we make the encoder and decoder share the same embedding matrix.
Based on NQG+, we use both pre-train word embedding and STshare methods, to further improve the performance.
Ablation test, the answer position indicator is removed from NQG model.
Ablation test, the POS tag feature is removed from NQG model.
Ablation test, the NER feature is removed from NQG model.
Ablation test, the word case feature is removed from NQG model.
|Model||Dev set||Test set|
Table 1 shows the BLEU-4 scores of different settings. We report the beam search results on both development and test sets. Our NQG framework outperforms the PCFG-Trans and s2s+att baselines by a large margin. This shows that the lexical features and answer position indicator can benefit the question generation. With the help of copy mechanism, NQG+ has a 2.05 BLEU improvement since it solves the rare words problem. The extended version, NQG++, has 1.11 BLEU score gain over NQG+, which shows that initializing with pre-trained word vectors and sharing them between encoder and decoder help learn better word representation.
We evaluate the PCFG-Trans baseline and NQG++ with human judges. The rating scheme is, Good (3) - The question is meaningful and matches the sentence and answer very well; Borderline (2) - The question matches the sentence and answer, more or less; Bad (1) - The question either does not make sense or matches the sentence and answer. We provide more detailed rating examples in the supplementary material. Three human raters labeled 200 questions sampled from the test set to judge if the generated question matches the given sentence and answer span. The inter-rater aggreement is measured with Fleiss’ kappa (Fleiss, 1971).
Table 2 reports the human judge results. The kappa scores show a moderate agreement between the human raters. Our NQG++ outperforms the PCFG-Trans baseline by 0.76 score, which shows that the questions generated by NQG++ are more related to the given sentence and answer span.
The answer position indicator, as expected, plays a crucial role in answer focused question generation as shown in the NQGAnswer ablation test. Without it, the performance drops terribly since the decoder has no information about the answer subsequence.
Ablation tests, NQGCase, NQGPOS and NQGNER, show that word case, POS and NER tag features contributes to question generation.
Table 3 provides three examples generated by NQG++. The words with underline are the target answers. These three examples are with different question types, namely WHEN, WHAT and WHO respectively. It can be observed that the decoder can ‘copy’ spans from input sentences to generate the questions. Besides the underlined words , other meaningful spans can also be used as answer to generate correct answer focused questions.
|I:||in 1226 , immediately after returning from the west , genghis khan began a retaliatory attack on the tanguts .|
|G:||in which year did genghis khan strike against the tanguts ?|
|O:||in what year did genghis khan begin a retaliatory attack on the tanguts ?|
|I:||in week 10 , manning suffered a partial tear of the plantar fasciitis in his left foot .|
|G:||in the 10th week of the 2015 season , what injury was peyton manning dealing with ?|
|O:||what did manning suffer in his left foot ?|
|I:||like the lombardi trophy , the “ 50 ” will be designed by tiffany & co. .|
|G:||who designed the vince lombardi trophy ?|
|O:||who designed the lombardi trophy ?|
Following Wang and Jiang (2016)
, we classify the questions into different types, i.e., WHAT, HOW, WHO, WHEN, WHICH, WHERE, WHY and OTHER.222We treat questions ‘what country’, ‘what place’ and so on as WHERE type questions. Similarly, questions containing ‘what time’, ‘what year’ and so forth are counted as WHEN type questions.
We evaluate the precision and recall of each question types. Figure2 provides the precision and recall metrics of different question types. The precision and recall of a question type are defined as:
For the majority question types, WHAT, HOW, WHO and WHEN types, our NQG++ model performs well for both precision and recall. For type WHICH, it can be observed that neither precision nor recall are acceptable. Two reasons may cause this: a) some WHICH-type questions can be asked in other manners, e.g., ‘which team’ can be replaced with ‘who’; b) WHICH-type questions account for about 7.2% in training data, which may not be sufficient to learn to generate this type of questions. The same reason can also affect the precision and recall of WHY-type questions.
In this paper we conduct a preliminary study of natural language question generation with neural network models. We propose to apply neural encoder-decoder model to generate answer focused questions based on natural language sentences. The proposed approach uses a feature-rich encoder to encode answer position, POS and NER tag information. Experiments show the effectiveness of our NQG method. In future work, we would like to investigate whether the automatically generated questions can help to improve question answering systems.
Foundations of statistical natural language processing, volume 999. MIT Press.
Abstractive text summarization using sequence-to-sequence rnns and beyond.In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.
On the difficulty of training recurrent neural networks.ICML (3) 28:1310–1318.
Journal of Machine Learning Research15(1):1929–1958.
We use the same vocabulary for both encoder and decoder. The vocabulary is collected from the training data and we keep the top 20,000 frequent words. We set the word embedding size to 300 and all GRU hidden state sizes to 512. The lexical and answer position features are embedded to 32-dimensional vectors. We use dropout (Srivastava et al., 2014) with probability . During testing, we use beam search with beam size 12.
We use Stanford CoreNLP v3.7.0 (Manning et al., 2014) to annotate POS and NER tags in sentences with its default configuration and pre-trained models.
We initialize model parameters randomly using a Gaussian distribution with Xavier scheme(Glorot and Bengio, 2010). We use a combination of Adam (Kingma and Ba, 2015)
and simple SGD as our the optimizing algorithms. The training is separated into two phases, the first phase is optimizing the loss function with Adam and the second is with simple SGD. For the Adam optimizer, we set the learning rate, two momentum parameters and respectively, and . We use Adam optimizer until the BLEU score on the development set drops for six consecutive tests (we test the BLEU score on the development set for every 1,000 batches). Then we switch to a simple SGD optimizer with initial learning rate
and halve it if the BLEU score on the development set drops for twelve consecutive tests. We also apply gradient clipping(Pascanu et al., 2013) with range for both Adam and SGD phases. To both speed up the training and converge quickly, we use mini-batch size 64 by grid search.
We evaluate the PCFG-Trans baseline and NQG++ with human judges. The rating scheme is provided in Table 4.
|3: Good||The question is meaningful and matches the sentence and answer very well|
|2: Borderline||The question matches the sentence and answer, more or less|
|1: Bad||The question either does not make sense or matches the sentence and answer|
The human judges are asked to label the generated questions if they match the given sentence and answer span according to the rating scheme and examples. We provide some example questions with different scores in Table 5. For the first score 3 example, the question makes sense and the target answer “reason” can be used to answer it given the input sentence. For the second score 2 example, the question is inadequate for answering the sentence since the answer is about prime number. However, given the sentence, a reasonable person will give the targeted answer of the question. For the third score 1 example, the question is totally wrong given the sentence and answer.
|Score||Sentence and generated question|
|3||I: -lsb- … -rsb- for reason is the greatest enemy that faith has ; it never comes to the aid of spiritual things . ”|
|O: what is the biggest enemy that faith has have ?|
|2||I: in all other rows -lrb- a = 1 , 2 , 4 , 5 , 7 , and 8 -rrb- there are infinitely many prime numbers .|
|O: how many numbers are in all other rows ?|
|1||I: while genghis khan never conquered all of china , his grandson kublai khan completed that conquest and established the yuan dynasty that is often credited with re-uniting china .|
|O: who did kublai khan defeat that conquered all of china ?|