Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

01/27/2020 ∙ by Bang Liu, et al. ∙ 0

The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia.



There are no comments yet.


page 1

page 3

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Automatically generating question-answer pairs from unlabeled text passages is of great value to many applications, such as assisting the training of machine reading comprehension systems (Tang et al., 2017; Tang et al., 2018; Du and Cardie, 2018), generating queries/questions from documents to improve search engines (Han et al., 2019), training chatbots to get and keep a conversation going (Shum et al., 2018), generating exercises for educational purposes (Heilman and Smith, 2010; Heilman, 2011; Danon and Last, 2017), and generating FAQs for web documents (Krishna and Iyyer, 2019). Many question-answering tasks such as machine reading comprehension and chatbots require a large amount of labeled samples for supervised training, acquiring which is time-consuming and costly. Automatic question-answer generation makes it possible to provide these systems with scalable training data and to transfer a pre-trained model to new domains that lack manually labeled training samples.

Figure 1. Given the same input sentence, we can ask diverse questions based on the different choices about i) what the target answer is; ii) which answer-related chunk is used as a clue, and iii) what type of questions is asked.

Despite a large number of studies on Neural Question Generation, it remains a significant challenge to generate high-quality QA pairs from unstructured text at large quantities. Most existing neural question generation approaches try to solve the answer-aware question generation problem, where an answer chunk and the surrounding passage are provided as an input to the model while the output is the question to be generated. They formulate the task as a Sequence-to-Sequence (Seq2Seq) problem, and design various encoder, decoder, and input features to improve the quality of generated questions (Serban et al., 2016; Du et al., 2017; Liu et al., 2019; Zhou et al., 2017; Song et al., 2018; Hu et al., 2018; Du and Cardie, 2018). However, answer-aware question generation models are far from sufficient, since question generation from a passage is inherently a one-to-many mapping. Figure 1 shows an example of this phenomenon. Given the same input text “The fight scene finale between Sharon and the character played by Ali Larter, from the movie Obsessed, won the 2010 MTV Movie Award for Best Fight.”, we can ask a variety of questions based on it. If we select the text chunk “MTV Movie Award for Best Fight” as the answer, we can still ask different questions such as “A fight scene from the movie, Obsessed, won which award?” or “Which award did the fight scene between Sharon and the role of Ali Larter win?”.

We argue that when a human asks a question based on a passage, she will consider various factors. First, she will still select an answer as a target that her question points to. Second, she will decide which piece of information will be present (or rephrased) in her question to set constraints or context for the question. We call this piece of information as the clue. The target answer may be related to different clues in the passage. Third, even the same question may be expressed in different styles (e.g., “what”, “who”, “why”, etc.). For example, one can ask “which award” or “what is the name of the award” to express the same meaning. Once the answer, clue, and question style are selected, the question generation process will be narrowed down and become closer to a one-to-one mapping problem, essentially mimicking the human way of asking questions. In other words, introducing these pieces of information into question-answer generation can help reduce the difficulty of the task.

In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG) designed for scalable generation of high-quality question-answer pairs from unlabeled text corpus. Just as a human will ask a question with clue and style in mind, our system first automatically extracts multiple types of information from an input passage to assist question generation. Based on the multi-aspect information extracted, we design neural network models to generate diverse questions in a controllable way. Compared with existing answer-aware question generation, our approach essentially converts the one-to-many mapping problem into a one-to-one mapping problem, and is thus scalable by varying the assistive information fed to the neural network while in the meantime ensuring generation quality. Specifically, we have solved multiple challenges in the ACS-aware question generation system:

What to ask given an unlabeled passage? Given an input passage such as a sentence, randomly sampling ¡answer, clue, style¿ combinations will cause type mismatches, since answer, clue, and style are not independent of each other. Without taking their correlations into account, for example, we may select “how” or “when” as the target question style while a person’s name is selected as the answer. Moreover, randomly sampling ¡answer, clue, style¿ combinations may lead to input volume explosion, as most of such combinations point to meaningless questions.

To overcome these challenges, we design and implement an information extractor to efficiently sample meaningful inputs from the given text. We learn the joint distribution of

¡answer, clue, style¿ tuples from existing reading comprehension datasets, such as SQuAD (Rajpurkar et al., 2016)

. In the meantime, we decompose the joint probability distribution of the tuple into three components, and apply a three-step sampling mechanism to select reasonable combinations of input information from the input passage to feed into the ACS-aware question generator. Based on this strategy, we can alleviate type mismatches and avoid meaningless combinations of assistive information.

How to learn a model to ask ACS-aware questions? Most existing neural approaches are designed for answer-aware question generation, while there is no training data available for the ACS-aware question generation task. We propose effective strategies to automatically construct training samples from existing reading comprehension datasets without any human labeling effort. We define “clue” as a semantic chunk in an input passage that will be included (or rephrased) in the target question. Based on this definition, we perform syntactic parsing and chunking on input text, and select the chunk which is most relevant to the target question as the clue. Furthermore, we categorize different questions into 9 styles, including “what”, “how”, “yes-no” and so forth, In this manner, we have leveraged the abundance of reading comprehension datasets to automatically construct training data for ACS-aware question generation models.

We propose two deep neural network models for ACS-aware question generation, and show their superior performance in generating diverse and high-quality questions. The first model employs sequence-to-sequence framework with copy and attention mechanism (Sutskever et al., 2014; Bahdanau et al., 2014; Cao et al., 2017), incorporating the information of answer, clue and style into the encoder and decoder. Furthermore, it discriminates between content words and function words in the input, and utilizes vocabulary reduction (which downsizes the vocabularies for both the encoder and decoder) to encourage aggressive copying. In the second model, we fine-tune a GPT2-small model (Radford et al., 2019). We train our ACS-aware QG models using the input passage, answer, clue, and question style as the language modeling context. As a result, we reduce the phenomenon of repeating output words, which usually exists with sequence-to-sequence models, and can generate questions with better readability. With multi-aspect assistive information, our models are able to ask a variety of high-quality questions based on an input passage, while making the generation process controllable.

How to ensure the quality of generated QA pairs? We construct a data filter, which consists of an entailment model and a question answering model. In our filtering process, we input questions generated in the aforementioned manner into a BERT-based (Devlin et al., 2018)

question answering model to get its predicted answer span, and measure the overlap between the input answer span and the predicted answer span. In addition, we also classify the entailment relationship between the original sentence and the question-answer concatenation. These components allow us to remove low-quality QA pairs. By combining the input sampler, ACS-aware question generator, and the data filter, we have constructed a pipeline that is able to generate a large number of QA pairs from unlabeled text without extra human labeling efforts.

Figure 2. An overview of the system architecture. It contains a dataset constructor, information sampler, ACS-aware question generator and a data filter.

We perform extensive experiments based on the SQuAD dataset (Rajpurkar et al., 2016) and Wikipedia, and compare our ACS-aware question generation model with different existing approaches. Results show that both the content-separated seq2seq model with aggressive copying mechanism and the extra input information bring substantial benefits to question generation. Our method outperforms the state-of-the-art models significantly in terms of various metrics such as BLEU-4, ROUGE-L and METEOR.

With models trained on of SQuAD data samples, we can automatically generate two large datasets containing 1.33 million and 1.45 million QA pairs from a corpus of top-ranked Wikipedia articles, respectively. We perform quality evaluation on the generated datasets and identify their strengths and weaknesses. Finally, we also evaluate how our generated QA data perform in training question-answering models in machine reading comprehension, as an alternative means to assess the data generation quality.

2. Problem Formulation

In this section, we formally introduce the problem of ACS-aware question generation.

Denote a passage by , where it can be either a sentence or a paragraph (in our work, it is a sentence). Let denotes a question related to this passage, and denotes the answer of that question. A passage consists of a sequence of words where denotes the length of . A question contains words from either a predefined vocabulary or from the input text . Our objective is to generate different question-answer pairs from it. Therefore, we aim to model the probability .

In our work, we factorize the generation process into multiple steps to select different inputs for question generation. Specifically, given a passage , we will select three types of information as input to a generative model, which are defined as follows:

  • Answer: here we define an answer as a span in the input passage . Specifically, we select a semantic chunk of as a target answer from a set of chunks given by parsing and chunking.

  • Clue: denote a clue as . As mentioned in Sec. 1, a clue is a semantic chunk in input which will be copied or rephrased in the target question. It is related to the answer , and providing it as input can help reducing the uncertainty when generating questions. This helps to alleviate the one-to-many mapping problem of question generation, makes the generation process more controllable, as well as improves the quality of generated questions.

  • Style: denote a question style as . We classify each question into nine styles: “who”, “where”, “when”, “why”, “which”, “what”, “how”, “yes-no”, and “other”. By providing the target style to the question generation model, we further reduce the uncertainty of generation and increase the controllability.

We shall note that our definition of clue is different with (Liu et al., 2019). In our work, given a passage and a question , we identify a clue as a consistent chunk in instead of being the overlapping non-stop words between and . On one hand, this allows the clue to be expressed in different ways in a question. On the other hand, given unlabeled text corpus, we can sample clue chunks for generating questions according to the same distribution in training datasets to avoid discrepancy between training and generating.

Given the above definitions, our generation process is decomposed into input sampling and ACS-aware question generation:


where , , and model the process of input sampling to get answer, question style, and clue information for a target question; and models the process of generating the target question.

3. Model Description

In this section, we present our overall system architecture for generating questions from unlabeled text corpus. We then introduce the details of each component.

Figure 2 shows the pipelined system we build to train ACS-aware question generation models and generate large-scale datasets which can be utilized for different applications. Our system consists of four major components: i) dataset constructor, which takes existing QA datasets as input, and constructs training datasets for ACS-aware question generation; ii) information sampler (extractor), which samples answer, clue and style information from input text and feed them into ACS-aware QG models; iii) ACS-aware question generation models, which are trained on the constructed datasets to generate questions; and iv) data filter, which controls the quality of generated questions.

3.1. Obtaining Training Data for Question Generation

Our first step is to acquire a training dataset to train ACS-aware question generation models. Existing answer-aware question generation methods (Serban et al., 2016; Du et al., 2017; Liu et al., 2019; Zhou et al., 2017) utilize reading comprehension datasets such as SQuAD (Rajpurkar et al., 2016), as these datasets contain tuples. However, for our problem, the input and output consists of , where the clue and style information are not directly provided in existing datasets. To address this issue, we design effective strategies to automatically extract clue and style information without involving human labeling.

Input: passage , answer , question , related words dictionary .
Output: clue .
1:  get candidate chunks of passage by parsing and chunking;
2:  remove function words, tokenize and to get and   and stemming and to get and ;
3:  for  do
4:     get tokenized clue and stemmed clue with only content words;
5:      number of overlapping tokens between and ;
6:      number of overlapping stems between and ;
7:      number of soft copied tokens between and ;
8:     binary whether contains the chunk text ;
10:  end for
11:  select the chunk with maximum as the clue chunk;
ALGORITHM 1 Clue Extraction

Rules for Clue Identification. As mentioned in Sec. 2, given , we define a semantic chunk in input which is copied or rephrased in the output question as clue. We identify by the method shown in Algorithm 1.

First, we parse and chunk the input passage to get all candidate chunks. Second, we get the tokenized and stemmed passage and question, and only keep the content words in the results. Third, we calculate the similarities between each candidate chunk and the target question according to different criteria. The final score of each chunk is the sum of different similarities. Finally, we select the chunk with the maximum score as the identified clue chunk .

To estimate the similarities between each candidate chunk and the question, we calculate the number of overlapping tokens and stems between each chunk and the question, as well as checking whether the chunk is fully contained in the question. In addition, we further define “soft copy” relationship between two words to take rephrasing into consideration. Specifically, a word

is considered as soft-copied from input passage if there exist a word which is semantically coherent with . To give an instance, consider a passage “Selina left her hometown at the age of 18” and a question “How old was Selina when she left?”, the word “old” is soft-copied from “age” in the input passage.

To identify the soft-copy relationship between any pair of words, we utilize synonyms and word vectors, such as Glove

(Pennington et al., 2014), to construct a related words dictionary , where returns a set of words that is closely related to . For each word , is composed of the synonyms of , as well as the top most similar words estimated by word vector representations (we set ). In our work, we utilize Glove word vectors, and construct based on Genism (Řehůřek and Sojka, 2010) and WordNet (Miller, 1995).

Rules for Style Classification. Algorithm 2 presents our method for question style classification. We classify a given question into

classes based on a few heuristic strategies. If

contains who, where, when, why, which, what, or how, we classify it as the corresponding type. For yes-no type questions, we define a set of feature words. If starts with any word belonging to the set of feature words, we classify it as type yes-no. For all other cases, we label it as other.

Input: question , style set , yes-no feature words set
Output: style .
1:  for  do
2:     if word is contained in  then
3:        return  
4:     end if
5:  end for
6:  for  do
7:     if word is the first word of  then
8:        return  
9:     end if
10:  end for
11:  return  
ALGORITHM 2 Style Classification

3.2. ACS-Aware Question Generation

After obtained training datasets, we design two models for ACS-aware question generation. The first model is based on Seq2Seq framework with attention and copy mechanism (Sutskever et al., 2014; Bahdanau et al., 2014; Gu et al., 2016). In addition, we exploit clue embedding, content embedding, style encoding and aggressive copying to improve the performance of question generation. The second model is based on pre-trained language models. We fine-tune a GPT2-small model (Radford et al., 2019) using the constructed training datasets.

3.2.1. Seq2Seq-based ACS-aware question generation

Given a passage, an answer span, a clue span, and a desired question style, we train a neural encoder-decoder model to generate appropriate questions.


. We utilize a bidirectional Gated Recurrent Unit (BiGRU)

(Chung et al., 2014) as our encoder. For each word in input passage , we concatenate the different features to form a concatenated embedding vector

as the input to the encoder. Specifically, for each word, it is represented by the concatenation of its word vector, embeddings of its Named Entity Recognition (NER) tag, Part-of-Speech (POS) tag, and whether it is a content word. In addition, we can know whether each word is within the span of answer

or clue , and utilize binary features to indicate the positions of answer and clue in input passage. All tag features and binary features are casted into -dimensional vectors by different embedding matrices that are trainable.

Suppose the embedding of passage is . Our encoder will read the input sequence and produce a sequence of hidden states , where each hidden state is a concatenation of a forward representation and a backward representation:


The and are the forward and backward hidden states of the -th token in , respectively.

Decoder. Our decoder is another GRU with attention and copy mechanism. Denote the embedding vector of desired question style as . We initialize the hidden state of our decoder GRU by concatenating with the last backward encoder hidden state to a linear layer:


At each decoding time step , the decoder calculates its current hidden state based on the word vector of the previous predicted word , previous attentional context vector , and its previous hidden state :


where the context vector at time step is a weighted sum of input hidden states, and the weights are calculated by the concatenated attention mechanism (Luong et al., 2015):


To generate an output word, we combine , and to calculate a readout state by an MLP maxout layer with dropouts (Goodfellow et al., 2013)

, and pass it to a linear layer and a softmax layer to predict the probabilities of the next word over a vocabulary:


where is a -D vector.

For copy or point mechanism (Gulcehre et al., 2016), the probability to copy a word from input at time step is given by:



is the Sigmoid function, and

is the probability of performing copying. The copy probability of each input word is given by the attention weights in Equation (10).

It has been reported that the generated words in a target question are usually from frequent words, while the majority of low-frequency words in the long tail are copied from the input instead of generated (Liu et al., 2019). Therefore, we reduce the vocabulary size to be the top high-frequency words at both the encoder and the decoder, where is a predefined threshold that varies among different datasets. This helps to encourage the model to learn aggressive copying and improves the performance of question generation.

3.2.2. GPT2-based ACS-aware question generation

Figure 3. The input representations we utilized for fine-tuning GPT-2 Transformer-based language model.

Pre-trained large-scale language models, such as BERT (Devlin et al., 2018), GPT2 (Radford et al., 2019) and XLNet (Yang et al., 2019), have significantly boosted the performance of a series of NLP tasks including text generation. They are mostly based on the Transformer architecture (Vaswani et al., 2017) and have been shown to capture many facets of language relevant for downstream tasks (Clark et al., 2019). Compared with Seq2Seq models which often generate text containing repeated words, the pre-trained language models acquire knowledge from large-scale training dataset and are able to generate text of high-quality. In our work, to produce questions with better quality and to compare with Seq2Seq-based models, we further fine-tune the publicly-available pre-trained GPT2-small model (Radford et al., 2019) according to our problem settings.

Specifically, to obtain an ACS-question generation model with GPT2, we concatenate passage, answer, clue and question style as the context of language modeling. Specifically, the input sequence is organized in the form of “¡bos¿ ..passage text.. ¡clue¿ .. clue chunk .. ¡ans¿ .. answer chunk .. ¡style¿ .. question style .. ¡ques¿ .. question text .. ¡eos¿”. During the training process, we learn a language model with above input format. When generating, we sample different output questions by starting with an input in the form of “¡bos¿ ..passage text.. ¡clue¿ .. clue chunk .. ¡ans¿ .. answer chunk .. ¡style¿ .. question style .. ¡ques¿”. Figure 3 illustrates the input representation for fine-tuning GPT-2 language model. Similar to (Krishna and Iyyer, 2019), we leverage GPT-2’s segment embeddings to denote the specificity of the passage, clue, answer, style and question. We also utilize answer segment embeddings and clue segment embedding in place of passage segment embeddings at the location of the answer or clue in the passage to denote the position of the answer span and clue span. During the generation process, the trained model uses top-p nucleus sampling with (Holtzman et al., 2019) instead of beam search and top-k sampling. For implementation, we utilize the code base of (Krishna and Iyyer, 2019) as a starting point, as well as the Transformers library from HuggingFace (Wolf et al., 2019).

3.3. Sampling Inputs for Question Generation

As mentioned in Sec. 2, the process of ACS-aware question generation consists of input sampling and text generation. Given an unlabeled text corpus, we need to extract valid ¡passage, answer, clue, style¿ combinations as inputs to generate questions with an ACS-aware question generation model.

In our work, we decompose the sampling process into three steps to sequentially sample the candidate answer, style and clue based on a given passage. We make the following assumptions: i) the probability of a chunk be selected as an answer only depends on its Part-of-Speech (POS) tag, Named Entity Recognition (NER) tag and the length (number of words) of the chunk; ii) the style of the target question only depends on the POS tag and NER tag of the selected answer ; and iii) the probability of selecting a chunk as clue depends on the POS tag and the NER tag of , as well as the dependency distance between chunk and . We calculate the length of the shortest path between the first word of and that of as the dependency distance. The intuition for the last designation is that a clue chunk is usually closely related to the answer, and is often copied or rephrased into the target question. Therefore, the dependency distance between a clue and an answer will not be large (Liu et al., 2019).

With above assumptions, we will have:


where represents the dependency distance between the first token of and that of .

(a) Joint Distribution of Answer NER and Answer Length
(b) Joint Distribution of Clue NER and Dependency Distance
(c) Joint Distribution of Answer NER and Question Style
Figure 4. The input join distributions we get using SQuAD1.1 training dataset as reference data.

The above conditional probabilistic distributions can be learned from an existing dataset (such as SQuAD), named as reference dataset. Given a reference dataset consisting of ¡passage, question, answer¿ triplets, first, we perform POS tagging, NER tagging, parsing and chunking on the input passages. Second, we recognize the clue and style information according to the steps described in Sec. 3.1, get the NER tag and the POS tag of both the answer chunk and the clue chunk, and calculate the dependency distance between the clue chunk and the answer chunk. Finally, we calculate the conditional distributions according to the extracted information.

We set the maximum length of an candidate answer to 30, and split the range of length into 10 bins of equal size to calculate . Similarly, we set the maximum dependency distance between a clue chunk and an answer chunk to be 20, and split the range of distance into 10 bins of equal size to calculate . Figure 4 shows the marginal distributions we get by utilizing the SQuAD1.1 training dataset as our reference data. The NER tagging 111The meaning of NER labels can be found at: is performed by spaCy (Honnibal and Montani, 2017), and the “UNK” tag means the chunk is not recognized as a named entity. From Figure 4(a), we can see that most of the answers are short, and a majority of them are entities such as person (PERSON), organization (ORG) or date (DATE). From Figure 4(b), we can see the syntactical dependency distance between a clue chunk and an answer is usually less than , which matches with our intuition that a clue shall be correlated with the answer so that it will be copied or rephrased in the target question. Finally, Figure 4(c) shows most of the questions are “What” style, and the followings are “Who”, “How” and “When”. The NER tags of answers are highly correlated with the style of questions. Also, we shall notice that the NER performance of spaCy is not perfect. Therefore, we may observe weird cases such as organization (ORG) matches with “Who”. Determining different conditional probabilities with a reference dataset instead of following pre-defined rules helps us to take such kind of noises into account.

After calculating the above distributions according to an reference dataset, we can sample different information given a passage . First, we get all candidate chunks by parsing and chunking over . Second, we sample a chunk as answer according to the normalized probability distribution over all chunks:


Third, we sample a question style over all possible questions styles by the normalized probability:


Finally, we sample a chunk as clue according to:


We can repeat the above steps for multiple times to get different inputs from the same passage and generate diverse questions. In our work, for each input passage, we sample different chunks as answer spans, different question styles for each answer, and different clues for each answer. In this way, we over-generate questions by sampling questions for each sentence.

3.4. Data Filtering for Quality Control

After sampled multiple inputs from each sentence, we can generate different questions based on the inputs. However, it is hard to ask each sentence 20 meaningful and different questions even given 20 different inputs derived from it, as the questions may be duplicated due to similar inputs, or the questions can be meaningless if the ¡answer, clue, style¿ combination is not reasonable. Therefore, we further utilize a filter to remove low-quality QA pairs.

We leverage an entailment model and a QA model based on BERT (Devlin et al., 2018). For the entailment model, as the SQuAD 2.0 dataset (Rajpurkar et al., 2018) contains unanswerable questions, we utilize it to train a classifier which tells us whether a pair of ¡question, answer¿ matches with the content in the input passage. For the question answering model, we fine-tuned another BERT-based QA model utilizing the SQuAD 1.1 dataset (Rajpurkar et al., 2016).

Given a sample ¡passage, question, answer¿, we keep it if this sample satisfies two criteria: first, it is classified as positive according to the BERT-based entailment model; second, the F1 similarity score between the gold answer span and the answer span predicted by BERT-based QA is above 0.9. Note that we do not choose to fine-tune a BERT-based QA model over SQuAD 2.0 to perform entailment and question answering at the same time. That is because we get better performance by separating the entailment step with the QA filtering step. Besides, we can utilize extra datasets from other entailment tasks to enhance the entailment model and further improve the data filter.

4. Evaluation

In this section, we compare our proposed ACS-aware question generation with answer-aware question generation models to show its benefits. We generate a large number of QA pairs from unlabeled text corpus using our models, and further perform a variety of evaluations to analyze the quality of the generated data. Finally, we test the performance of QA models trained on our generated dataset, and show the potential applications and future directions of this work.

4.1. Evaluating ACS-aware Question Generation

Datasets, Metrics and Baselines. We evaluate the performance of ACS-aware question generation based on the SQuAD dataset (Rajpurkar et al., 2016). It is a reading comprehension dataset which contains questions derived from Wikipedia articles, and the answer to every question is a segment of text from the corresponding reading passage. In our work, we use the data split proposed by (Zhou et al., 2017), where the input is the sentence that contains the answer. The training set contains samples, and the original dev set that contains samples is randomly split into a dev test and a test set of equal size. The average lengths (number of words) of sentences, questions and answers are , , and , respectively.

The performance of question generation is evaluated by the following metrics.

  • BLEU (Papineni et al., 2002). BLEU measures precision by how much the words in predictions appear in reference sentences. BLEU-1 (B1), BLEU-2 (B2), BLEU-3 (B3), and BLEU-4 (B4), use 1-gram to 4-gram for calculation, respectively.

  • ROUGE-L (Lin, 2004). ROUGE-L measures recall by how much the words in reference sentences appear in predictions using Longest Common Subsequence (LCS) based statistics.

  • METEOR (Denkowski and Lavie, 2014)

    . METEOR is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

We compare our methods with the following baselines.

  • PCFG-Trans (Heilman, 2011): a rule-based answer-aware question generation system.

  • SeqCopyNet (Zhou et al., 2018), NQG++ (Zhou et al., 2017), AFPA (Sun et al., 2018), seq2seq+z+c+GAN (Yao et al., 2018), and s2sa-at-mp-gsa (Zhao et al., 2018): answer-aware neural question generation models based on Seq2Seq framework.

  • NQG-Knowledge (Gupta et al., 2019), DLPH (Gao et al., 2018): auxiliary-information-enhanced question generation models with extra inputs such as knowledge or difficulty.

  • Self-training-EE (Sachan and Xing, 2018), BERT-QG-QAP (Zhang and Bansal, 2019), NQG-LM (Zhou et al., 2019a), CGC-QG (Liu et al., 2019) and QType-Predict (Zhou et al., 2019b): multi-task question generation models with auxiliary tasks such as question answering, language modeling, question type prediction and so on.

The reported performance of baselines are directly copied from their papers or evaluated by their published code on GitHub.

For our models, we evaluate the following versions:

  • CS2S-VR-A. Content separated Seq2Seq model with Vocabulary Reduction for Answer-aware question generation. In this variant, we incorporate content embeddings in word representations to indicate whether each word is a content word or a function word. Besides, we reduce the size of vocabulary by only keeping the top frequent words for encoder and decoder. In this way, low-frequency words are represented by its NER, POS embeddings and feature embeddings. We also add answer position embedding to indicate the answer span in input passages.

  • CS2S-AS. This model adds question style embedding to initialize decoder, without vocabulary reduction (vocabulary size is when we do not exploit vocabulary reduction).

  • CS2S-AC. The variant adds clue embedding in encoder to indicate the span of clue chunk.

  • CS2S-ACS. This variant adds both clue embedding in encoder and style embedding in decoder.

  • CS2S-VR-ACS. This is the fully featured model with answer, clue and style embedding, as well as vocabulary reduction.

  • GPT2-ACS. This is our fine-tuned GPT2-small model for ACS-aware question generation.

Table 1. Evaluation results of different models on SQuAD dataset.
Question is Well-formed No
Question is Relevant No
Answer is Correct No
Table 2. Human evaluation results about the quality of generated QA pairs.

Experiment Settings

. We implement our models in PyTorch 1.1.0

(Paszke et al., 2017) and Transformers 2.0.0 (Wolf et al., 2019), and train the model with two Tesla P40. We utilize spaCy (Matthew Honnibal, 2015) to perform dependency parsing and extract lexical features for tokens. For Seq2Seq-based models, we set word embeddings to be -dimensional and initialize them by GloVe, and set them trainable. The out-of-vocabulary words are initialized randomly. All other features are embedded to 16-dimensional vectors. The encoder is a single layer BiGRU with hidden size 512, and the decoder is a single layer undirected GRU with hidden size 512. We set dropout rate for the encoder, decoder, and the attention module. We train the models by Cross-Entropy loss for question generation and question copying, and perform gradient descent by the Adam (Kingma and Ba, 2014) optimizer with an initial learning rate , two momentum parameters are and respectively, and

. The mini-batch size for each update is set to 32 and model is trained for up to 10 epochs. Gradient clipping with range

is applied to Adam. Beam width is set to be for decoding. The decoding process stops when the ¡EOS¿ token (represents end-of-sentence) is generated.

For GPT2-ACS model, we fine-tune the GPT2-small model using SQuAD 1.1 training dataset from (Zhou et al., 2017). We fine-tune the model for 4 epochs with batch size 2, and apply top- nucleus sampling with when decoding. For BERT-based filter, we fine-tune the BERT-large-uncased model from HuggingFace (Wolf et al., 2019) with parameters suggested by (Wolf et al., 2019) for training on SQuAD 1.1 and SQuAD 2.0. Our code will be published for research purpose222

Main Results. Table. 1

compares our models with baseline approaches. We can see that our CS2S-VR-ACS achieves the best performance in terms of the evaluation metrics and outperforms baselines by a great margin. Comparing CS2S-VR-A with Seq2Seq-based answer-aware QG baselines, we can see that it outperforms all the baseline approaches in that category with the same input information (input passage and answer span). This is because that our content separation strategy and vocabulary reduction operation help the model to better learn what words to copy from the inputs. Comparing our ACS-aware QG models and variants with auxiliary-information-enhanced models (such as NQG-Knowledge and DLPH) and auxiliary-task-enhanced baselines (such as BERT-QG-QAP), we can see that the clue and style information helps to generate better results than models with knowledge or difficulty information. That is because our ACS-aware setting makes the question generation problem closer to one-to-one mapping, and greatly reduces the task difficulty.

Comparing GPT2-ACS with CS2S-VR-ACS, we can see that GPT2-ACS achieves better METEOR score, while CS2S-VR-ACS performs better over BLEU scores and ROUGE-L. That is because GPT2-ACS has no vocabulary reduction. Hence, the generated words are more flexible. However, metrics such as BLEU scores are not able to evaluate the quality of generated QA pairs semantically. Therefore, in the following section, we further analyze the quality of QA pairs generated by CS2S-VR-ACS and GPT2-ACS to identify their strengths and weaknesses.

4.2. Qualitative Analysis

After training our CS2S-VR-ACS and GPT2-ACS models, we generate large-scale ¡passage, question, answer¿ datasets from unlabeled text corpus. Specifically, we obtain the top English Wikipedia articles with Project Nayuki’s Wikipedia’s internal PageRanks. After that, we split each article in the corpus into sentences, and filter out sentences with lengths shorter than or longer than . Based on these sentences, we perform input sampling to sample ¡answer, clue, style¿ triplets for each sentence according to the steps described in Section 3.3, and feed them into our models to generate questions. After filtering, we create two datasets utilizing the two models, where each of the dataset contains around million generated questions.

We first evaluate the quality of the generated QA pairs via voluntary human evaluation. We asked graduate students to evaluate ¡passage, question, answer¿ samples: samples generated by the CS2S-VR-ACS model, samples generated by the GPT2-ACS model, and ground truth samples from the SQuAD 1.1 training dataset. All the samples are randomly shuffled, and each sample will be evaluated by 3 volunteers333The participation of the study was completely voluntary. The reviewers kindly offered their help as volunteers without being compensated in any form. There was no consequence for declining participating.. We collected responses of the following questionnaire:

  • Is the question well-formed? This is to check whether a given question is both grammatical and meaningful (Krishna and Iyyer, 2019). Workers will select yes, no, or understandable. The option understandable is selected if a question is not totally grammatically correct, but we can infer its meaning.

  • If the question is well-formed or understandable, is the question relevant to the passage? Workers will select yes if we can find the answer to the generated question in the passage.

  • If the question is relevant to the passage, is the answer actually a valid answer to the generated question? Workers will select yes, no or partially. The last option represents that the answer in our generated sample partially overlaps with the true answer in the passage.

Table 2 shows the evaluation results based on the sampled data. First, we can see that even the ground truth samples from the SQuAD dataset are not totally well-formed, relevant or answered correctly. That is because we only use the sentence which contains the answer span as the context. However, about 20% questions in SQuAD require paragraph-level context to be asked (Du et al., 2017). Second, of the questions generated by GPT2-ACS are well-formed or understandable, while the percentage is for the CS2S-VR-ACS model. We can see that although the BLEU scores of the GPT2-ACS model are lower than that of CS2S-VR-ACS, the results of GPT2-ACS are semantically better due to the knowledge learned from large-scale pre-training. Third, we can see that most of the questions generated by both CS2S-VR-ACS and GPT2-ACS are relevant to the input passages. Finally, most of the answers are also correct given the generated questions. This demonstrates the high quality of the question-answer pairs generated by our models.

Figure 5. Examples of the questions-answer pairs generated by our system.

Figure 5 is a running example to show the properties of generated questions. We can see that our ACS-aware question generation models are able to generate various questions based on different answers, question styles, and clue chunks. Compared with answer-aware question generation, our generation process is more controllable, and our generated questions are more diverse and of high quality. In some cases, the answer span does not match with the question. We further performed pilot user studies to analyze the bad cases in our generated samples. For each question that is not well-formed, we ask workers to label the weakness of it. The results of the study show that most of the errors are grammatically errors, type mismatches, meaningless, or incomplete information. For CS2S-VR-ACS, about of the bad cases are grammatically incorrect; of them have the problem of type mismatch, e.g., the generated question starts with “When” when asking questions about a person; of them are grammatically correct, but meaningless; and of the questions do not express their meaning explicitly due to missing words or fuzzy pronouns. Similarly, for GPT2-ACS, the percentages of the above four problems are (grammatically incorrect), (type mismatches), (meaningless) and (incomplete or fuzzy information). Note that the sum of these percentages does not equal to 1. That is because each question may be labeled with multiple types of weaknesses.

In order to reduce different errors and further improve the quality of generated questions, first, we need to incorporate the knowledge of natural language by methods such as large-scale pre-training. We can observe from Table 2 that most of the questions generated by GPT2-ACS are well-formed and grammatically correct. Second, we can utilize a better named entity recognition model to provide more accurate information about the named entity types of answer chunks. In this way, the type mismatch errors will be reduced. Last but not the least, the problem of semantic mismatching, meaningless, or information incompleteness can be reduced by training a better entailment model to enhance the data filter.

Experiments CS2S-VR-ACS GPT2-ACS
SQuAD 86.72 92.97 86.72 92.97
Generated 71.14 83.53 74.47 85.64
Generated + SQuAD 86.12 92.36 85.87 92.33
Table 3. Evaluating the question answering performance with different training datasets.

4.3. Applied to Question Answering

We also perform quality test of the generated question-answer pairs by applying them to downstream machine reading comprehension tasks. Specifically, we train different BERT-based question answering models based on the following settings:

  • SQuAD: in this experiment, we utilize the original SQuAD 1.1 training dataset to train a BERT-based QA model, and test the performance on the dev dataset. The performance is evaluated by exact match (EM) and F1-Score (F1) between predicted answer span and the true answer span (Rajpurkar et al., 2016).

  • Generated: in this experiment, we sample a training dataset from our generated questions, where the size is equal to the training dataset of SQuAD 1.1. Although our questions are generated from sentences, we utilize the paragraphs the sentences belong to as contexts when training QA models.

  • Generated + SQuAD: in this experiment, we combine the original SQuAD training dataset with our generated training dataset to train the BERT-based QA model.

For all the QA experiments, the configurations are the same except the training datasets.

Table 3 compares the performance of the resulting BERT-based QA models trained by above settings. Our implementation gives EM and F1 when trained on the original SQuAD dataset. In comparison, the model trained by our generated dataset gives and F1. When we combine the generated dataset with the SQuAD training set to train the QA model, the performance is not further improved. The results are reasonable. First, the generated dataset contains noises which will influence the performance. Second, simply increasing the size of training dataset will not always help with improving the performance. If most of the generated training samples are already answerable by the model trained over the original SQuAD dataset, they are not very helpful to further enhance the generalization ability of the model.

There are at least two methods to leverage the generated dataset to improve the performance of QA models. First, we can utilize curriculum learning algorithms (Bengio et al., 2009) to select samples during training. We can select samples according to the current state of the model and the difficulties of the samples to further boost up the model’s performance. Note that this requires us to remove the BERT-based QA model from our data filter, or set the threshold of filtering F1-score to be smaller. Second, similar to (Gao et al., 2018), we can further incorporate the difficulty information into our question generation models, and encourage the model to generate more difficult question-answer pairs. We leave these to our future works.

Aside from machine reading comprehension, our system can be applied to many other applications. First, we can utilize it to generate exercises for educational purposes. Second, we can utilize our system to generate training datasets for a new domain by fine-tuning it with a small amount of labeled data from that domain. This will greatly reduce the human effort when we need to construct a dataset for a new domain. Last but not the least, our pipeline can be adapted to similar tasks such as comment generation, query generation and so on.

5. Related Work

In this section, we review related works on question generation.

Rule-Based Question Generation. The rule-based approaches rely on well-designed rules or templates manually created by human to transform a piece of given text to questions (Heilman and Smith, 2010; Heilman, 2011; Chali and Hasan, 2015). The major steps include preprocessing the given text to choose targets to ask about, and generate questions based on rules or templates (Sun et al., 2018). However, they require creating rules and templates by experts which is extremely expensive. Also, rules and templates have a lack of diversity and are hard to generalize to different domains.

Answer-Aware Question Generation. Neural question generation models are trained end-to-end and do not rely on hand-crafted rules or templates. The problem is usually formulated as answer-aware question generation, where the position of answer is provided as input. Most of them take advantage of the encoder-decoder framework with attention mechanism (Serban et al., 2016; Du et al., 2017; Liu et al., 2019; Zhou et al., 2017; Song et al., 2018; Hu et al., 2018; Du and Cardie, 2018). Different approaches incorporate the answer information into generation model by different strategies, such as answer position indicator (Zhou et al., 2017; Liu et al., 2019), separated answer encoding (Kim et al., 2019), embedding the relative distance between the context words and the answer (Sun et al., 2018) and so on. However, with context and answer information as input, the problem of question generation is still a one-to-many mapping problem, as we can ask different questions with the same input.

Auxiliary-Information-Enhanced Question Generation. To improve the quality of generated questions, researchers try to feed the encoder with extra information. (Gao et al., 2018) aims to generate questions on different difficulty levels. It learns a difficulty estimator to get training data, and feeds difficulty as input into the generation model. (Krishna and Iyyer, 2019) learns to generate “general” or “specific” questions about a document, and they utilize templates and train classifier to get question type labels for existing datasets. (Hu et al., 2018) identifies the content shared by a given question and answer pair as an aspect, and learns an aspect-based question generation model. (Gupta et al., 2019) incorporates knowledge base information to ask questions. Compared with these works, our work doesn’t require extra labeling or training overhead to get the training dataset. Besides, our settings for question generation dramatically reduce the difficulty of the task, and achieve much better performance.

Multi-task Question Generation. Another strategy is enhancing question generation models with correlated tasks. Joint training of question generation and answering models has improved the performance of individual tasks (Tang et al., 2017; Tang et al., 2018; Wang et al., 2017; Sachan and Xing, 2018). (Liu et al., 2019)

jointly predicts the words in input that is related to the aspect of the targeting output question and will be copied to the question.

(Zhou et al., 2019b) predicts the question type based on the input answer and context. (Zhou et al., 2019a) incorporates language modeling task to help question generation. (Zhang and Bansal, 2019) utilizes question paraphrasing and question answering tasks to regularize the QG model to generate semantically valid questions.

6. Conclusion

In this paper, we propose ACS-aware question generation, a mechanism to generate question-answer pairs from unlabelled text corpus in a scalable way. By sampling meaningful tuples of clues, answers and question styles from the input text and use the sampled tuples to confine the way a question is asked, we have effectively converted the originally one-to-many question generation problem into a one-to-one mapping problem. We propose two neural network models for question generation from input passage given the selected clue, answer and question style, as well as discriminators to control the data generation quality.

We present extensive performance evaluation of the proposed system and models. Compared with existing answer-aware question generation models and models with auxiliary inputs or tasks, our ACS-aware QG model achieves significantly better performance, which confirms the importance of clue and style information. We further resorted to voluntary human evaluation to assess the quality of generated data. Results show that our model is able to generate diverse and high-quality questions even from the same input sentence. Finally, we point out potential future directions to further improve the performance of our pipeline.