Asking questions plays a vital role for both the growth of human beings and the improvement of artificial intelligent systems. As a dual task of question answering, question generation based on a text passage and a given answer has attracted much attention in recent years. One of the key applications of question generation is to automatically produce question-answer pairs to enhance machine reading comprehension systems(Du et al., 2017; Yuan et al., 2017; Tang et al., 2017; Tang et al., 2018). Another application is generating practice exercises and assessments for educational purposes (Heilman and Smith, 2010; Heilman, 2011; Danon and Last, 2017). Besides, question generation is also important in conversational systems and chatbots such as Siri, Cortana, Alexa and Google Assistant, helping them to kick-start and continue a conversation with human users (Mostafazadeh et al., 2016).
Conventional methods for question generation rely on heuristic rules to perform syntactic transformations of a sentence to factual questions(Heilman, 2011; Chali and Hasan, 2015)
, following grammatical and lexical analysis. However, such methods require specifically crafted transformation and generation rules, with low generalizability. Recently, various neural network models have been proposed for question generation(Du et al., 2017; Zhou et al., 2017; Hu et al., 2018; Kim et al., 2018; Gao et al., 2018). These models formulate the question generation task as a sequence-to-sequence (Seq2Seq) neural learning problem, where different types of encoders and decoders have been designed. Like many other text generation tasks, the copying or pointer mechanism (Gulcehre et al., 2016) is also widely adopted in question generation to handle the copy phenomenon between input passages and output questions, e.g., (Zhou et al., 2017; Yuan et al., 2017; Subramanian et al., 2017; Kim et al., 2018; Song et al., 2018). However, we point out that a common limitation that hurdles question generation models mainly use copying mechanisms to handle out-of-vocabulary (OOV) issues, and fail to mark a clear boundary between the set of question words that should be directly copied from the input text and those that should be generated instead.
In this paper, we generate questions by learning to identify where each word in the question should come from, i.e., whether generated from a vocabulary or copied from the input text, and given the answer and its context, what words can potentially be copied from the input passage. People usually repeat text chunks in an input passage when asking questions about it, and generate remaining words from their own language to form a complete question. For example, in Fig. 1, given an input passage “Today, Barack Obama gives a speech on democracy in the White House” and an answer “Barack Obama”, we can ask a question “The speech in the White House is given by whom?” Here the text chunks “speech” and “in the White House” are copied from the input passage. However, in the situation that a vocabulary word is an overlap word between a passage and a question, existing models do not clearly identify the underlying reason about the overlapping. In other words, whether this word is copied from the input or generated from vocabulary is not properly labeled. For example, some approaches (Zhou et al., 2017) only take out-of-vocabulary (OOV) words that are shared by both the input passage and target question as copied words, and adopt a copying mechanism to learn to copy these OOV words into questions. In our work, given a passage and a question in a training dataset, we label a word as copied word if it is a nonstop word shared by both the passage and the question, and its word frequency rank in the vocabulary is lower than a threshold. We further aggressively shortlist the output vocabulary based on frequency analysis of the non-copied question words (according to our labeling criteria). Combining this labeling strategy with target vocabulary reduction, during the process of question generation, our model can make better decisions about when to copy or generate, and predict what to generate more accurately.
After applying the above strategies, a remaining problem is which word we should choose to copy. For example, in Fig. 1, we can either ask the question “Who gives a speech today?” or the question “The speech in the White House is given by whom?”, where the two questions are related to two different copied text chunks “today” and “White House”. We can see that asking a question about a passage and a given answer is actually a “one-to-many” mapping problem: we can ask questions in different ways based on which subset of words we choose to copy from input. Therefore, how to enable a neural model to learn what to copy is a critical problem. To solve this issue, we propose to predict the potential clue words in input passages. A word is considered as a “clue word” if it is helpful to reduce the uncertainty of the model about how to ask a question or what to copy, such as “White House” in question 2 of Fig. 1. In our model, we directly use our previously mentioned copy word labeling strategy to assign a binary label to each input word to indicate whether it is a clue word or not.
To predict the potential clue words in input, we have designed a novel clue prediction model that combines the syntactic dependency tree of an input with Graph Convolutional Networks. The intuition underlying our model design is that: the words copied from an input sentence to an output question are usually closely related to the answer chunk, and the patterns of the dependency paths between the copied words and answer chunks can be captured via Graph Convolutional Networks. Combining the graphical representation of input sentences with GCN enables us to incorporate both word features, such as word vector representation, Named Entity (NE) tags, Part-of-Speech (POS) tags and so on, and the syntactic structure of sentences for clue words prediction.
To generate questions given an answer chunk and the predicted clue words distribution in an input sentence, we apply the Seq2Seq framework which contains our feature-rich sentence encoder and a decoder with attention and copy mechanism. The clue word prediction results are incorporated into the encoder by a binary feature embedding. Based on our multi-task labeling strategy, i.e., labeling for both clue prediction and question generation, we jointly learn different components of our model in a supervised manner.
We performed extensive evaluation on two large question answering datasets: the SQuAD dataset v1.1 (Rajpurkar et al., 2016) and the NewsQA dataset (Trischler et al., 2016). For each dataset, we use the answer chunk and the sentence that contains the answer chunk as input, and try to predict the question as output. We compared our model with a variety of existing rule-based and neural network based models. Our approach achieves significant improvement by jointly learning the potential clue words distribution for copy and the encoder-decoder framework for generation, and out-performs state-of-the-art approaches.
2. Problem Definition and Motivation
In this section, we formally introduce the problem of question generation, and illustrate the motivation behind our work.
2.1. Answer-aware Question Generation
Let us denote a passage by , a question related to this passage by , and the answer of that question by . A passage can be either an input sentence or a paragraph, based on different datasets. A passage consists of a sequence of words where denotes the length of . A question contains words from either a predefined vocabulary or from the input text
. The task is finding the most probable questiongiven an input passage and an answer:
Fig. 3 shows an example of the dataset used in our paper. Note that the answer of the question is limited to sub spans of the input passage. However, our work can be easily adapted to the cases where the answer is not a sub span of the input passage by adding an extra answer encoder.
2.2. What to Ask: Clue Word Prediction
Even given the answer of a desired question, the task of question generation is still not a one-to-one mapping, as there are potentially multiple aspects related to the given answer. For example, in Fig. 2, given the answer chunk “Barack Obama”, the questions “The speech in the White House is given by whom?”, “Who gives a speech on democracy?”, and “Today who gives a speech?” are all valid questions. As we can see, although these questions share the same answer, they can be asked in different ways based on which word or phrase we choose to copy (e.g., “White House”, “democracy”, or “today”).
To resolve this issue, we propose to predict the potential “clue words” in the input passage. A “clue word” is defined as a word or phrase that is helpful to guide the way we ask a question, such as “White House”, “democracy”, and “today” in Fig. 2
. We design a clue prediction module based on the syntactical dependency tree representation of passages and Graph Convolutional Networks. Graph Convolutional Networks generalize Convolutional Neural Networks to graph-structured data, and have been successfully applied to natural language processing tasks, including semantic role labeling(Marcheggiani and Titov, 2017), document matching (Liu et al., 2018; Zhang et al., 2018a), relation extraction (Zhang et al., 2018b), and so on. The intuition underlying our model design is that clue words will be more closely connected to the answer in dependency trees, and the syntactical connection patterns can be learned via graph convolutional networks. For example, in Fig. 2, the graph path distances between answer “Barack Obama” to “democracy”, “White House”, and “today” are , , and , while the corresponding word order distances between them are , , and , respectively.
2.3. How to Ask: Copy or Generate
Another important problem of question generation is when to choose a word from the target vocabulary and when to copy a word from the input passage during the generation process. People tend to repeat or paraphrase some text pieces in a given passage when they ask a question about it. For instance, in Fig. 3, the question “Who along with Russia supported post WW-II communist movements”, the text pieces “supported”, “post”, and “communist movements” are repeating the input passage, while “Russia” and “WW-II” are synonymous replacements of input phrases “The Soviet Union” and “World War II”. Therefore, the copied words in questions should not be restricted to out-of-vocabulary words. Although this phenomenon is well-known by existing approaches, they do not properly and explicitly distinguish whether a word is from copy or from generate.
In our model, we consider the non-stop words that are shared by both an input passage and the output question as clue words, and encourage the model to copy clue words from input. After labeling clue words, we train a GCN-based clue predictor to learn whether each word in source text can be copied to the target question. The predicted clue words are further fed into a Seq2Seq model with attention and copy mechanism to help with question generation. Besides, different from existing approaches that share a vocabulary between source passages and target questions, we reduce the size of the target vocabulary of questions to be smaller than source passages and only include words with word frequency higher than a threshold. The idea is that the low-frequency words in questions are usually copied from source text rather than generated from a vocabulary. By combining the above strategies, our model is able to learn when to generate or copy a word, as well as which words to copy or generate during the progress of question generation. We will describe our operations in more detail in the next section.
3. Model Description
In this section, we introduce our proposed framework in detail. Similar to (Zhou et al., 2017; Serban et al., 2016; Du et al., 2017), our question generator is based on an encoder-decoder framework, with attention and copying mechanisms incorporated. Fig. 4 illustrates the overall architecture and detailed components in our model. Our model consists of three components: the clue word predictor, passage encoder and question decoder.
The clue word predictor predicts potential clue words that may appear a target question, based on the specific context of the input passage (without knowing the target question). It utilizes a syntactic dependency tree to reveal the relationship of answer tokens relative to other tokens in a sentence. Based on the tree representation of the input passage, we predict the distribution of clue words by a GCN-based encoder applied on the tree and a Straight-Through (ST) Gumbel-Softmax estimator(Jang et al., 2016) for clue word sampling. The outputs of the clue word predictor are fed to the passage encoder to advise the encoder about what words may potentially be copied to the target question.
The passage encoder incorporates both the predicted clue word distribution and a variety of other feature embeddings of the input words, such as lexical features, answer position indicators, etc. Combined with a proposed low-frequency masking strategy (a shortlist strategy to reduce complexity on tuning input word embeddings), our encoder learns to better capture the useful input information with fewer trainable parameters.
Finally, the decoder jointly learns the probabilities of generating a word from vocabulary and copying a word from the input passage. At the decoder side, we introduce a multitask learning strategy to intentionally encourage the copying behavior in question generation. This is achieved by explicitly labeling the copy gate with a binary variable when a (non-stop) word in the question also appears in the input passage in the training set. Other multi-task labels are also incorporated to accurately learn when and what to copy. We further shortlist the target vocabulary based on the frequency distribution of non-overlap words. These strategies, assisted by the encoded features, help our model to clearly learn which word in the passage is indeed a clue word to be copied into the target question, while generating other non-clue words accurately.
The entire model is trained end-to-end via multitask learning, i.e., by minimizing a weighted sum of losses associated with different labels. In the following, we first introduce the encoder and decoder, followed by a description of the clue word predictor.
3.1. The Passage Encoder with Masks
Our encoder is based on bidirectional Gated Recurrent Unit (BiGRU)(Chung et al., 2014), taking word embeddings, answer position indicators, lexical and frequency features of words, as well as the output of the clue word predictor as the input. Specifically, for each word in input passage , we concatenate the following features to form a concatenated representation to be input into the encoder:
Word Vector. We initialize each word vector by Glove embedding (Pennington et al., 2014). If a word is not covered by Glove, we initialize its word vector randomly.
Binary Features. We check whether each word is lowercase or not, whether it is a digit or like a number (such as word “three”), and embed these features by vectors.
Answer Position. Similar to (Zhou et al., 2017), we utilize the B/I/O tagging scheme to label the position of a given answer, where a word at the beginning of an answer is marked with B, I denotes the continuation of the answer, while words not contained in an answer are marked with O.
Word Frequency Feature. We derive the word vocabulary from passages and questions in the training dataset. We then rank all the words in a descending order in terms of word frequencies, such that the first word is the most frequent word. The top words are labeled as frequent words. Words ranked between and are labeled as intermediate words. The remaining with rank lower than are labeled as rare words, where and are two predefined thresholds. In this way, each word will be assigned a frequency tag L (low frequency), H (highly frequent) or M (medium frequency).
Clue Indicator Feature. In our model, the clue predictor (which we will introduce in more detail in Sec. 3.3) assigns a binary value to each word to indicate whether it is a potential clue word or not.
Denote an input passage by . The BiGRU encoder reads the input sequence and produces a sequence of hidden states to represent the passage . Each hidden state is a concatenation of a forward representation and a backward representation:
where and are the forward and backward hidden states of the -th token in , respectively.
Furthermore, rather than learning the full representation for every word in the vocabulary, we use a masking strategy to replace the word embeddings of low-frequency words with a special ¡l¿ token, such that the information of low-frequency words is only represented by its answer/clue indicators and other augmented feature embeddings, except the word vectors. This strategy can improve performance due to two reasons. First, the augmented tagging features of a low-frequency word tend to be more influential than the word meaning in question generation. For example, given a sentence “¡PERSON¿ likes playing football.”, a question that can be generated is “What does ¡PERSON¿ like to play?”—what the token “¡PERSON¿” exactly is does not matter. This way, the masking strategy helps the model to omit the fine details that are not necessary for question generation. Second, the number of parameters that need be learned, especially the number of word embeddings that need be tuned, is largely reduced. Therefore, masking out unnecessary word embeddings while only keeping the corresponding augmented features and indicators does not hurt the performance of question generation. It actually improves training by reducing the model complexity.
3.2. The Question Decoder with Aggressive Copying
In the decoding stage, we utilize another GRU with copying mechanism to generate question words sequentially based on the encoded input passage representation and previously decoded words. We first initialize the hidden state of the decoder GRU by passing the last backward encoder hidden state to a linear layer:
For each decoding time step , the GRU reads the embedding of the previous word , previous attentional context vector , and its previous hidden state to calculate its current hidden state:
The context vector for time step is generated through the concatenated attention mechanism (Luong et al., 2015). Attention mechanism calculates a semantic match between encoder hidden states and the decoder hidden state. The attention weights indicate how the model spreads out the amount it cares about different encoder hidden states during decoding. At time step , the attention weights and the context vector are calculated as:
Combining the previous word embedding , the current decoder state and the current context vector , we can calculate a readout state by an MLP maxout layer with dropouts (Goodfellow et al., 2013)
. Then the readout state is passed to a linear layer and a softmax layer to predict the probabilities of the next word over the decoder vocabulary:
where is a -D vector.
The above module generates question words from a given vocabulary. Another important method to generate words is copying from source text. Copy or point mechanism (Gulcehre et al., 2016) was introduced into sequence-to-sequence models to allow the copying of unknown words from the input text, which solves the out-of-vocabulary (OOV) problem. When decoding at time step , the probability of copying is given by:
is the Sigmoid function, andis the probability of copying. For the copy probability of each input word, we reuse the attention weights given by Equation (8).
Copying mechanism has also been used in question generation (Zhou et al., 2017; Song et al., 2018; Hu et al., 2018), however, mainly to solve the OOV issue. Here we leverage the copying mechanism to enable the copying of potential clue words, instead of being limited to OOV words, from input. Different from existing methods, we take a more aggressive approach to train the copying mechanism via multitask learning based on different labels. That is, when preparing the training dataset, we explicitly label a word in a target question as a word copied from the source passage text if it satisfies all the following criteria:
i) it appears in both source text and target question;
ii) it is not a stop word;
iii) its frequency rank in the vocabulary is lower than a threshold .
The remaining words in the question are considered as being generated from the vocabulary. Such a binary label (copy or not copy), together with which input word the question word is copied from, as well as the target question, are fed into different parts of the decoder as labels for multi-task model training. This is to intentionally encourage the copying of potential clue words into the target question rather than generating them from a vocabulary.
The intuition behind such an aggressive copying mechanism can be understood by checking the frequency distributions of both generated words and copied words (as defined above) in the training dataset of SQuAD. Fig. 5 shows the frequency distributions of all question words, and then of generated question words and copied question words. The mean and median rank of generated words are 2389 and 1032, respectively. While for copied words, they are 3119 and 1442. Comparing Fig. 5(b) with Fig. 5(c), we can see that question words generated from the vocabulary tends to be clustered toward high ranked (or frequent) words. On the other hand, the fraction of low ranked (or infrequent) words in copied words are much greater than that in generated words. This means the generated words are mostly from frequent words, while the majority of low-frequency words in the long tail are copied from the input, rather than generated.
This phenomenon matches our intuition: when people ask a question about a given passage, the generated words tend to be commonly used words, while the repeated text chunks from source passage may contain relatively rare words such as names and dates. Based on this observation, we further propose to reduce the vocabulary at the decoder for target question generation to be top frequently generated words, where is a predefined threshold that varies according to datasets.
3.3. A GCN-Based Clue Word Predictor
Given a passage and some answer positions, our clue word predictor aims to identify the clue words in the passage that can help to ask a question and are also potential candidates for copying, by understanding the semantic context of the input passage. Again, in the training dataset, the non-stop words that are shared by both an input passage and an output question are aggressively labeled as clue words.
We note that clue words are, in fact, more closely connected to the answer chunk in the form of syntactic dependency trees than in word sequences. Fig. 6 shows our observation on the SQuAD dataset. For each training example, we get the nonstop words that appear in both the input passage and the output question. For each clue word in the training set, we find its shortest undirected path to the answer chunk based on the dependency parsing tree of the passage. For each jump on the shortest path, we record the dependency type. We also calculate the distance between each clue word and the answer in terms of the number of words between them. As shown in Fig. 6(a), prep, pobj, and nsubj appear frequently on these shortest paths. Comparing Fig. 6(b) with Fig. 6(c), we can see that the distances in terms of dependency trees are much smaller than those in terms of sequential word orders. The average and median distances on the dependency trees are and , while those values are and for sequential word distances.
In order to predict the positions of clue words based on their dependencies on the answer chunk (without knowing the question which is yet to be generated), we use a Graph Convolutional Network (GCN) to convolve over the word features on the dependency tree of each passage, as shown in Fig. 4. The predictor consists of four layers:
Embedding layer. This layer shares the same features with the passage encoder, except that it does not include the clue indicators. Therefore, each word is represented by its word embedding, lexical features, binary features, the word frequency feature, and an answer position indicator.
Syntactic dependency parsing layer. We obtain the syntactic dependency parsing tree of each passage by spaCy (Matthew Honnibal, 2015), where the dependency edges between words are directed. In our model, we use the syntactic structure to represent the structure of passage words.
Encoding layer. The objective of the encoding layer is to encode the context information into each word based on the dependency tree. We utilize a multi-layered GCN to incorporate the information of neighboring word features into each vertex, which is a word. After GCN layers, the hidden vector representation of each word will incorporate the information of its neighboring words that are no more than hops away in the dependency tree.
Output layer. After obtaining the context-aware representation of each word in the passage, we calculate the probability of each word being a clue word. A linear layer is utilized to calculate the unnormalized probabilities. We subsequently sample the binary clue indicator for each word through a Gumbel-Softmax layer, given the unnormalized probabilities. A sampled value of indicates that the word is predicated as a clue word.
3.3.1. GCN Operations on Dependency Trees
We now introduce the operations performed in each GCN layer (Kipf and Welling, 2016; Zhang et al., 2018b). GCNs generalize the CNN from low-dimensional regular grids to high-dimensional irregular graph domains. In general, the input to a GCN is a graph with vertices , and edges . The edges can be weighted with weights , or unweighted. The input also contains a vertex feature matrix denoted by , where is the feature vector of vertex .
Since a dependency tree is also a graph, we can perform the graph convolution operations on dependency trees by representing each tree into its corresponding adjacency matrix form. Now let us briefly introduce the GCN propagation layers used in our model (Zhang et al., 2018b). The weighted adjacency matrix of the graph is denoted as where . For unweighted graphs, the weights are either 1 or 0. In an -layer GCN, let denotes the input vector and denotes the output vector of node at the -th layer. We will utilize a multi-layer GCN with the following layer-wise propagation rule (Zhang et al., 2018b):
is a nonlinear function (e.g., ReLU), and
is a linear transformation.where is an identity matrix. is the degree of node (or word in our case) in the graph (or dependency tree).
In our experiments, we treat the dependency trees as undirected, i.e., . Besides, as we already included the dependency type information in the embedding vectors of each word, we do not need to incorporate the edge type information in the adjacency matrix.
Stacking this operation by layers gives us a deep GCN network. The input to the nodes in the first layer of the GCN are the feature vectors of words in the passage. After layers of transformations, we can obtain a context-aware representation of each word. We then feed them into a linear layer to get the unnormalized probability of each word being a clue word. After that, the unnormalized probabilities are fed to a Straight-Through (ST) Gumbel-Softmax estimator to sample an -dimensional binary vector indicating whether each of the words is a clue word or not.
Gumbel-Softmax (Jang et al., 2016)
is a method of sampling discrete random variables in neural networks. It approximates one-hot vectors sampled from a categorical distribution by making them continuous, therefore the gradients of model parameters can be calculated using the reparameterization trick and the standard backpropagation. Gumbel-Softmax distribution is motivated by Gumbel-Max trick(Maddison et al., 2014), an algorithm for sampling from a categorical distribution. Let denotes a -dimensional categorical distribution where the probability of class is defined as:
where is the unnormalized log probability of class . We can easily draw a one-hot sample from the distribution by the following equations:
where is Gumbel noise used to perturb each . In this way, taking is equivalent to drawing a sample using probabilities .
Gumbel-Softmax distribution replaces the function by differentiable softmax function. Therefore, a sample drawn from Gumbel-Softmax distribution is given by:
where is a temperature parameter. The Gumbel-Softmax distribution resembles the one-hot sample when diminishes to zero.
Straight-Through (ST) Gumbel-Softmax estimator (Jang et al., 2016) is a discrete version of the continuous Gumbel-Softmax estimator. It takes different paths in the forward and backward propagation. In the forward pass, it discretizes a continuous probability vector sampled from the Gumbel-Softmax distribution into a one-hot vector by:
In the backward pass, it uses the continuous , so that the error signal can still backpropagate.
Using the ST Gumbel-Softmax estimator, our model is able to sample a binary clue indicator vector for an input passage. Then the clue indicator vector is fed into the passage encoder for question generations, as shown in Fig. 4.
In this section, we evaluate the performance of our proposed models on the SQuAD dataset and the NewsQA dataset, and compare them with state-of-the-art question generation models.
4.1. Datasets, Metrics and Baselines
P-Length: average number of tokens of passages.
Q-Length: average number of tokens of questions.
A-Length: average number of tokens of answers.
The SQuAD dataset is a reading comprehension dataset, consisting of questions posed by crowd-workers on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage. SQuAD 1.1 is used in our experiment containing 536 Wikipedia articles and more than 100K question-answer pairs. When processing a sample from dataset, instead of using the entire document, we take the sentence that contains the answer as the input. Since the test set is not publicly available, we use the data split proposed by (Zhou et al., 2017) where the original dev set is randomly split into a dev test and a test set of equal size.
In the NewsQA dataset, there are 120K questions and their corresponding answers as well as the documents that are CNN news articles. Questions are written by questioners in natural language with only the headlines and highlights of the articles available to them. With the information of the questions and the full articles, answerers select related sub-spans from the passages of the source text and mark them as answers. Multiple answers may be provided to a same question by different answerers and they are ranked by validators based on the quality of the answers. In our experiment, we picked a subset of NewsQA where answers are top-ranked and are composed of a contiguous sequence of words within the input sentence of the document.
Table 1 shows the number of samples in each set and the average number of tokens of the input sentences, questions, and answers listed in columns P-Length, Q-Length, and A-Length respectively.
We report the evaluation results with following metrics.
BLEU (Papineni et al., 2002). BLEU measures precision by how much the words in prediction sentences appear in reference sentences at the corpus level. BLEU-1, BLEU-2, BLEU-3, and BLEU-4, use 1-gram to 4-gram for calculation, respectively.
ROUGE-L (Lin, 2004). ROUGE-L measures recall by how much the words in reference sentences appear in prediction sentences using Longest Common Subsequence (LCS) based statistics.
In the experiments, we have eight baseline models for comparison. Results reported on PCFG-Trans, MPQG, and NQG++ are from experiments we conducted using published code on GitHub. For other baseline models, we directly copy the reported performance given in their papers. We report all the results on the SQuAD dataset, and for the NewsQA dataset, we can only report the baselines with open source code available.
MPQG (Song et al., 2018) proposed a Seq2Seq model that matches the answer with the passage before generating the question.
SeqCopyNet (Zhou et al., 2018) proposed a method to improve the copying mechanism in Seq2Seq models by copying not only a single word but a sequence of words from the input sentence.
seq2seq+z+c+GAN (Yao et al., 2018) proposed a model employed in GAN framework using the latent variable to capture the diversity and learning disentangled representation using the observed variable.
NQG++ (Zhou et al., 2017) proposed a Seq2Seq model with a feature-rich encoder to encode answer position, POS and NER tag information.
Answer-focused Position-aware model (Sun et al., 2018) incorporates the answer embedding to help generate an interrogative word matching the answer type. And it models the relative distance between the context words and the answer for the model to be aware of the position of the context words when generating a question.
s2sa-at-mp-gsa (Zhao et al., 2018) proposed a model which contains a gated attention encoder and a maxout pointer decoder to address the challenges of processing long text inputs. This model has a paragraph-level version and a sentence-level version. For the purpose of fair comparison, we report the results of the sentence-level model to match with our settings.
ASs2s (Kim et al., 2018) proposed an answer-separated Seq2Seq to identify which interrogative word should be used by replacing the target answer in the original passage with a special token.
For our models, we evaluate the following versions:
CGC-QG (no feature-rich embedding). We name our model as Clue Guided Copy for Question Generation (CGC-QG). In this variant, we only keep the embedding of words, answer position indicators, and clue indicators for each token, and remove the embedding vectors of other features.
CGC-QG (no target reduction). This model variant does not contain target vocabulary reduction operation.
CGC-QG (no clue prediction). The clue predictor and clue embedding are removed in model variant.
CGC-QG. This is the complete version of our proposed model.
4.2. Experiment Settings
We implement our models in PyTorch 0.4.1(Paszke et al., 2017) and train the model with a single Tesla P40. We utilize spaCy (Matthew Honnibal, 2015) to perform dependency parsing and extract lexical features for tokens. As to the vocabulary, we collect it from the training dataset and keep the top 20K most frequent words for both datasets.
We set the threshold and . For target vocabulary reduction, we set . The embedding dimension of word vector is set to be and initialized by GloVe. The word vectors of words that are not contained in GloVe are initialized randomly. The word frequency features are embedded to 32-dimensional vectors, and other features and indicators are embedded to 16-dimensional vectors. All embedding vectors are trainable in the model. We use a single layer BiGRU with hidden size 512 for the encoder, and a single layer undirected GRU with hidden size 512 for the decoder. The dropout rate is applied to the encoder, decoder, and the attention module.
During training, we optimize the Cross-Entropy loss function for clue prediction, question generation, and question copying, and perform gradient descent by the Adam(Kingma and Ba, 2014) optimizer with an initial learning rate , two momentum parameters are and respectively, and
. The mini-batch size for each update is set to 32 and model is trained for up to 10 epochs (as we found that usually the models derive the best performance afteror
epochs). We apply gradient clipping with rangefor Adam. Besides, exponential moving average is applied on all trainable variables with a decay rate . When testing, beam search is conducted with beam width . The decoding process stops when a token ¡EOS¿ that represents end of sentence is generated.
4.3. Main Results
compare the performance of our model with existing question generation models on SQuAD and NewsQA in terms of different evaluation metrics. For the SQuAD dataset, we compare our model with all the baseline methods we have listed. As to the NewsQA dataset, since only a part of the baseline methods made their code public, we compare our model with approaches that have open source code. We can see that our model achieves the best performance on both datasets and significantly outperforms state-of-the-art algorithms. On the SQuAD dataset, given an input sentence and an answer, the BLEU-4, ROUGE-L, and METEOR of our result are, , and respectively, while corresponding previous state-of-the-art results are , , and from different approaches. Similarly, our method also gives a significantly better performance on NewsQA compared with the baselines in Table 3.
The reason is that we combine different strategies in our model to make it learn when to generate or copy a word, and what to generate or copy. First, our model learns to predict clue words through a GCN-based clue predictor. Second, our encoder incorporates a variety of embeddings of different features and clue indicators. Combining with the masking strategy, our model can better discover the relationship between input patterns and output patterns. Third, the reduced target vocabulary also helps our model to better capture when to copy or generate, and the generator is easier to train with a reduced vocabulary size. And most importantly, our new criteria of marking a question word as copied word (as described in Sec.3.3) helps the model to make better decisions on which path to go, i.e., to copy or to generate, during question generation. By incorporating part of these new strategies and modules into our model, we can achieve performance better than state-of-the-art models on SQuAD and NewsQA. With all these designs implemented, our model gives the best performance on both datasets.
There is a significant gap between the performance on SQuAD and on NewsQA due to the different characteristics of the datasets. The average answer length of NewsQA is 44.2% larger than it of SQuAD according to the statistics shown in Table 1. Long answers usually hold more information and are more difficult to generate questions. Furthermore, reference questions in NewQA tend to have less strict grammars and more diverse phrasings. To give a typical example, “Iran criticizes who?” is a reference question in NewsQA which does not start with an interrogative word but ends with one. These characteristics make the performance on NewsQA not as good as on SQuAD. However, our approach is still significantly better than the compared approaches on NewsQA dataset. It demonstrates that copy from the input is a general phenomenon across different datasets. Our model better captures what copied words are and what generated words are in a question based on our new criteria of labeling a question word as copied word or not.
|Answer-focused Position-aware model|
|CGC-QG (no feature-rich embedding)|
|CGC-QG (no target reduction)|
|CGC-QG (no clue prediction)|
Experiments are conducted on baselines followed by a “” using released source code. Results of other baselines are copied from their papers where unreported metrics are marked “”.
|CGC-QG (no feature-rich embedding)|
|CGC-QG (no target reduction)|
|CGC-QG (no clue prediction)|
Experiments are conducted on baselines followed by a “” using released source code.
When we remove the extra feature embeddings in our model, i.e., the embeddings of POS, NER, Dependency Types, word frequency levels (low-frequent, median-frequent, high-frequent), and binary features (whether it is lowercase, digit), the performance drops significantly. This is because the tags and feature embeddings represent each token in different aspects. The number of different tags is much smaller than the number of different words. Therefore, the patterns which can be learned from these tags and features are more obvious than what we can learn from word embeddings. Even though a well-trained word vector may contain the information of other features such as POS or NER, explicitly concatenating these feature embedding vectors helps the model to capture the patterns to ask a question more easily.
Removing the operation of target vocabulary reduction also hurts the performance of our model. As we discussed earlier, the non-overlap question words (or generated words) are mostly covered by the high frequency words. Reducing the target vocabulary size helps our model to better learn the probabilities of generating these words. On the other hand, it also encourages the model to better capture what they can copy from input text.
Finally, without the clue prediction module, the performance also drops on both datasets. This is because when given an answer span in a passage, asking a question about it is still a one-to-many mapping problem. Our clue prediction module learns how people select the related clue words to further reduce the uncertainty of how to ask a question by learning from a large training dataset. With predicted clue indicators incorporated into the encoder of generator, our model can fit the way how people ask questions in the dataset.
5. Related Work
In this section, we review related works on question generation and graph convolutional networks.
Existing approaches for question generation can be mainly classified into two classes: heuristic rule-based approaches and neural network-based approaches. The rule-based approaches rely on well-designed rules or templates manually created by human to transform a piece of given text to questions(Heilman and Smith, 2010; Heilman, 2011; Chali and Hasan, 2015). However, they require creating rules and templates by experts which is extremely expensive. Also, rules and templates have a lack of diversity and are hard to generalize to different domains.
Compared with rule-based approaches, neural network-based models are trained end-to-end and do not rely on hand-crafted rules or templates. Most of the neural question generation models consider the task as Seq2Seq and take advantage of the encoder-decoder framework with attention mechanism. (Serban et al., 2016) utilizes an encoder-decoder framework with attention mechanism to generate factoid questions from FreeBase. (Du et al., 2017) generates questions from SQuAD passages based on a sequence-to-sequence model with attention mechanism.
However, given an input sentence, generating questions is a one-to-many mapping, as we can ask different questions from different aspects. Purely relying on Seq2Seq model may not be able to learn such a one-to-many mapping (Gao et al., 2018). To resolve this issue, recent works assume the aspect is known when generating a question (Zhou et al., 2017; Yuan et al., 2017; Subramanian et al., 2017; Kim et al., 2018; Song et al., 2018) or can be detected by a third-party pipeline (Du and Cardie, 2018). (Zhou et al., 2017) enriches the sequence-to-sequence model with answer position indicator to indicate if the current word is an answer word or not, and further incorporates copy mechanism to copy words from the context when generating a question. (Song et al., 2018; Hu et al., 2018) fuse the answer information into input sentence first, and apply a Seq2Seq model with attention and copy mechanism to generate answer-aware questions. (Gao et al., 2018) takes question difficulty into account by a difficulty estimator, and generate questions on different difficulty levels. (Sun et al., 2018) enriches the model with both answer embedding and the relative distance between the context words and the answer. (Tang et al., 2017; Tang et al., 2018) model question answering and question generation as dual tasks. They found that jointly training the two tasks helped to generate better questions.
We argue that even with an answer indicator, the problem of question generation is still a one-to-many mapping. To solve this problem, we enrich the model with a “clue word” predictor, where a clue word means a word that is related to the aspect of the targeting output question and usually copied to the question. We represent an input sentence by its syntactic dependency tree and represent each word using feature-rich embeddings, and let the model learn to predict whether each word in context can be a clue word or not. Combining the clue prediction module with a Seq2Seq model with attention and copy mechanism, our model learns when to generate a word from a target vocabulary and when to copy a word from the input context.
Graph Convolutional Networks generalize Convolutional Neural Networks to graph-structured data, and have been developed and grown rapidly in scope and popularity in recent years (Kipf and Welling, 2016; Defferrard et al., 2016; Liu et al., 2018; Marcheggiani and Titov, 2017; Battaglia et al., 2018). Here we focus on the applications of GCNs on natural language. (Marcheggiani and Titov, 2017) applies GCNs over syntactic dependency trees as sentence encoders, and produces latent feature representations of words in a sentence for semantic role labeling. (Liu et al., 2018) matches long document pairs using graph structures, and classify the relationships of two documents by GCN. (Zhang et al., 2018b) proposes an extension of graph convolutional networks that is tailored for relation extraction. It pools information over dependency trees efficiently in parallel. In our paper, we apply GCN over the dependency tree of an input sentence, and predict the potential clue words in the sentence together with a Gumbel-Softmax estimator.
In this paper, we demonstrate the effectiveness of teaching the model to make decisions during the question generation process on which words to generate and to copy. We label the nonstop and overlap words between input passages and questions as copy targets and use such labels to train our model. Besides, we further observe that the distribution of generated question words are mostly common words with relative high frequency. Based on this observation, we reduce the vocabulary size for generating question words. To help the model better capture how to ask a question and alleviate the issue of one-to-many mapping when asking a question, we propose a GCN-based clue prediction module to predict which part of words can be a clue word to ask a question given an answer. It utilizes the syntactic dependency tree representation of a passage to encode the information of each token in the passage, and sample a clue indicator for each token using a Straight-Through (ST) Gumbel-Softmax estimator. Our simulation results on the SQuAD dataset and NewsQA dataset show that our model outperforms a range of existing state-of-the-art approaches significantly.
- Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).
- Chali and Hasan (2015) Yllias Chali and Sadid A Hasan. 2015. Towards topic-to-question generation. Computational Linguistics 41, 1 (2015), 1–20.
- Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
- Danon and Last (2017) Guy Danon and Mark Last. 2017. A Syntactic Approach to Domain-Specific Automatic Question Generation. arXiv preprint arXiv:1712.09827 (2017).
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
- Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376–380.
- Du and Cardie (2018) Xinya Du and Claire Cardie. 2018. Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia. arXiv preprint arXiv:1805.05942 (2018).
- Du et al. (2017) Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106 (2017).
- Gao et al. (2018) Yifan Gao, Jianan Wang, Lidong Bing, Irwin King, and Michael R Lyu. 2018. Difficulty Controllable Question Generation for Reading Comprehension. arXiv preprint arXiv:1807.03586 (2018).
- Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. Maxout networks. arXiv preprint arXiv:1302.4389 (2013).
- Gulcehre et al. (2016) Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. arXiv preprint arXiv:1603.08148 (2016).
- Heilman (2011) Michael Heilman. 2011. Automatic factual question generation from text. Language Technologies Institute School of Computer Science Carnegie Mellon University 195 (2011).
- Heilman and Smith (2010) Michael Heilman and Noah A Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 609–617.
- Hu et al. (2018) Wenpeng Hu, Bing Liu, Jinwen Ma, Dongyan Zhao, and Rui Yan. 2018. Aspect-based Question Generation. (2018).
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
- Kim et al. (2018) Yanghoon Kim, Hwanhee Lee, Joongbo Shin, and Kyomin Jung. 2018. Improving Neural Question Generation using Answer Separation. arXiv preprint arXiv:1809.02393 (2018).
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).
- Liu et al. (2018) Bang Liu, Ting Zhang, Di Niu, Jinghong Lin, Kunfeng Lai, and Yu Xu. 2018. Matching Long Text Documents via Graph Convolutional Networks. arXiv preprint arXiv:1802.07459 (2018).
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
- Maddison et al. (2014) Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Advances in Neural Information Processing Systems. 3086–3094.
- Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826 (2017).
- Matthew Honnibal (2015) Matthew Honnibal. 2015. spaCy: Industrial-strength Natural Language Processing (NLP) with Python and Cython. https://spacy.io. [Online; accessed 3-November-2018].
- Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. 2016. Generating natural questions about an image. arXiv preprint arXiv:1603.06059 (2016).
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.
et al. (2017)
Adam Paszke, Sam Gross,
Soumith Chintala, and Gregory Chanan.
Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
- Serban et al. (2016) Iulian Vlad Serban, Alberto García-Durán, Caglar Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron Courville, and Yoshua Bengio. 2016. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807 (2016).
- Song et al. (2018) Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang, and Daniel Gildea. 2018. Leveraging Context Information for Natural Question Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2. 569–574.
- Subramanian et al. (2017) Sandeep Subramanian, Tong Wang, Xingdi Yuan, Saizheng Zhang, Yoshua Bengio, and Adam Trischler. 2017. Neural Models for Key Phrase Detection and Question Generation. arXiv preprint arXiv:1706.04560 (2017).
- Sun et al. (2018) Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yanjun Ma, and Shi Wang. 2018. Answer-focused and Position-aware Neural Question Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3930–3939.
- Tang et al. (2017) Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027 (2017).
- Tang et al. (2018) Duyu Tang, Nan Duan, Zhao Yan, Zhirui Zhang, Yibo Sun, Shujie Liu, Yuanhua Lv, and Ming Zhou. 2018. Learning to Collaborate for Question Answering and Asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 1564–1574.
- Trischler et al. (2016) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830 (2016).
- Yao et al. (2018) Kaichun Yao, Libo Zhang, Tiejian Luo, Lili Tao, and Yanjun Wu. 2018. Teaching Machines to Ask Questions.. In IJCAI. 4546–4552.
- Yuan et al. (2017) Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Sandeep Subramanian, Saizheng Zhang, and Adam Trischler. 2017. Machine comprehension by text-to-text neural question generation. arXiv preprint arXiv:1705.02012 (2017).
- Zhang et al. (2018a) Ting Zhang, Bang Liu, Di Niu, Kunfeng Lai, and Yu Xu. 2018a. Multiresolution Graph Attention Networks for Relevance Matching. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 933–942.
- Zhang et al. (2018b) Yuhao Zhang, Peng Qi, and Christopher D Manning. 2018b. Graph Convolution over Pruned Dependency Trees Improves Relation Extraction. arXiv preprint arXiv:1809.10185 (2018).
- Zhao et al. (2018) Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level Neural Question Generation with Maxout Pointer and Gated Self-attention Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3901–3910.
- Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural question generation from text: A preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing. Springer, 662–671.
- Zhou et al. (2018) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2018. Sequential Copying Networks. arXiv preprint arXiv:1807.02301 (2018).