We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. Compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top-ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We also provide a qualitative analysis for this large-scale generated corpus from Wikipedia.READ FULL TEXT VIEW PDF
Recently, there has been a resurgence of work in NLP on reading comprehension Hermann et al. (2015); Rajpurkar et al. (2016); Joshi et al. (2017) with the goal of developing systems that can answer questions about the content of a given passage or document. Large-scale QA datasets are indispensable for training expressive statistical models for this task and play a critical role in advancing the field. And there have been a number of efforts in this direction. miller2016key, for example, develop a dataset for open-domain question answering; rajpurkar2016squad and JoshiTriviaQA2017 do so for reading comprehension (RC); and hill2015goldilocks and hermann2015teaching, for the related task of answering cloze questions Winograd (1972); Levesque et al. (2011). To create these datasets, either crowdsourcing or (semi-)synthetic approaches are used. The (semi-)synthetic datasets (e.g., hermann2015teaching) are large in size and cheap to obtain; however, they do not share the same characteristics as explicit QA/RC questions Rajpurkar et al. (2016). In comparison, high-quality crowdsourced datasets are much smaller in size, and the annotation process is quite expensive because the labeled examples require expertise and careful design Chen et al. (2016).
|(1)Tesla was renowned for his achievements and showmanship, eventually earning him a reputation in popular culture as an archetypal "mad scientist". (2)His patents earned him a considerable amount of money, much of which was used to finance his own projects with varying degrees of success. (3)He lived most of his life in a series of New York hotels, through his retirement. (4)Tesla died on 7 January 1943. …|
|– What was Tesla’s reputation in popular culture?|
|– How did Tesla finance his work?|
|– Where did Tesla live for much of his life?|
|New York hotels|
Thus, there is a need for methods that can automatically generate high-quality question-answer pairs. serban2016factoid30m propose the use of recurrent neural networks to generate QA pairs from structured knowledge resources such as Freebase. Their work relies on the existence of automatically acquired KBs, which are known to have errors and suffer from incompleteness. They are also non-trivial to obtain. In addition, the questions in the resulting dataset are limited to queries regarding a single fact (i.e., tuple) in the KB.
Motivated by the need for large scale QA pairs and the limitations of recent work, we investigate methods that can automatically “harvest” (generate) question-answer pairs from raw text/unstructured documents, such as Wikipedia-type articles.
Recent work along these lines Du et al. (2017); Zhou et al. (2017) (see Section 2) has proposed the use of attention-based recurrent neural models trained on the crowdsourced SQuAD dataset Rajpurkar et al. (2016) for question generation. While successful, the resulting QA pairs are based on information from a single sentence. As described in du2017LearningToAsk, however, nearly 30% of the questions in the human-generated questions of SQuAD rely on information beyond a single sentence. For example, in Figure 1, the second and third questions require coreference information (i.e., recognizing that “His” in sentence 2 and “He” in sentence 3 both corefer with “Tesla” in sentence 1) to answer them.
Thus, our research studies methods for incorporating coreference information into the training of a question generation system. In particular, we propose gated Coreference knowledge for Neural Question Generation (CorefNQG), a neural sequence model with a novel gating mechanism that leverages continuous representations of coreference clusters — the set of mentions used to refer to each entity — to better encode linguistic knowledge introduced by coreference, for paragraph-level question generation.
In an evaluation using the SQuAD dataset, we find that CorefNQG enables better question generation. It outperforms significantly the baseline neural sequence models that encode information from a single sentence, and a model that encodes all preceding context and the input sentence itself. When evaluated on only the portion of SQuAD that requires coreference resolution, the gap between our system and the baseline systems is even larger.
By applying our approach to the 10,000 top-ranking Wikipedia articles, we obtain a question answering/reading comprehension dataset with over one million QA pairs; we provide a qualitative analysis in Section 6. The dataset and the source code for the system are available at https://github.com/xinyadu/HarvestingQA.
Since the work by rus2010first, question generation (QG) has attracted interest from both the NLP and NLG communities. Most early work in QG employed rule-based approaches to transform input text into questions, usually requiring the application of a sequence of well-designed general rules or templates Mitkov and Ha (2003); Labutov et al. (2015). heilman2010good introduced an overgenerate-and-rank approach: their system generates a set of questions and then ranks them to select the top candidates. Apart from generating questions from raw text, there has also been research on question generation from symbolic representations Yao et al. (2012); Olney et al. (2012).
With the recent development of deep representation learning and large QA datasets, there has been research on recurrent neural network based approaches for question generation. serban2016factoid30m used the encoder-decoder framework to generate QA pairs from knowledge base triples; mitesh2017generating generated questions from a knowledge graph; du2017LearningToAsk studied how to generate questions from sentences using an attention-based sequence-to-sequence model and investigated the effect of exploiting sentence- vs. paragraph-level information. du2017identifying proposed a hierarchical neural sentence-level sequence tagging model for identifying question-worthy sentences in a text passage. Finally, duan2017qgforqa investigated how to use question generation to help improve question answering systems on the sentence selection subtask.
In comparison to the related methods from above that generate questions from raw text, our method is different in its ability to take into account contextual information beyond the sentence-level by introducing coreference knowledge.
Recently there has been an increasing interest in question answering with the creation of many datasets. Most are built using crowdsourcing; they are generally comprised of fewer than 100,000 QA pairs and are time-consuming to create. WebQuestions Berant et al. (2013), for example, contains 5,810 questions crawled via the Google Suggest API and is designed for knowledge base QA with answers restricted to Freebase entities. To tackle the size issues associated with WebQuestions, bordes2015large introduce SimpleQuestions, a dataset of 108,442 questions authored by English speakers. SQuAD Rajpurkar et al. (2016) is a dataset for machine comprehension; it is created by showing a Wikipedia paragraph to human annotators and asking them to write questions based on the paragraph. TriviaQA Joshi et al. (2017) includes 95k question-answer authored by trivia enthusiasts and corresponding evidence documents.
(Semi-)synthetic generated datasets are easier to build to large-scale Hill et al. (2015); Hermann et al. (2015). They usually come in the form of cloze-style questions. For example, hermann2015teaching created over a million examples by pairing CNN and Daily Mail news articles with their summarized bullet points. danqi2016exam showed that this dataset is quite noisy due to the method of data creation and concluded that performance of QA systems on the dataset is almost saturated.
Closest to our work is that of serban2016factoid30m. They train a neural triple-to-sequence model on SimpleQuestions, and apply their system to Freebase to produce a large collection of human-like question-answer pairs.
Our goal is to harvest high quality question-answer pairs from the paragraphs of an article of interest. In our task formulation, this consists of two steps: candidate answer extraction and answer-specific question generation. Given an input paragraph, we first identify a set of question-worthy candidate answers , each a span of text as denoted in color in Figure 1. For each candidate answer , we then aim to generate a question — a sequence of tokens — based on the sentence that contains candidate such that:
asks about an aspect of that is of potential interest to a human;
might rely on information from sentences that precede in the paragraph.
where where is the set of sentences that precede in the paragraph.
In this section, we introduce our framework for harvesting the question-answer pairs. As described above, it consists of the question generator CorefNQG (Figure 2) and a candidate answer extraction module. During test/generation time, we (1) run the answer extraction module on the input text to obtain answers, and then (2) run the question generation module to obtain the corresponding questions.
As shown in Figure 2, our generator prepares the feature-rich input embedding — a concatenation of (a) a refined coreference position feature embedding, (b) an answer feature embedding, and (c) a word embedding, each of which is described below. It then encodes the textual input using an LSTM unit Hochreiter and Schmidhuber (1997). Finally, an attention-copy equipped decoder is used to decode the question.
More specifically, given the input sentence (containing an answer span) and the preceding context , we first run a coreference resolution system to get the coref-clusters for and and use them to create a coreference transformed input sentence: for each pronoun, we append its most representative non-pronominal coreferent mention. Specifically, we apply the simple feedforward network based mention-ranking model of clark2016improving to the concatenation of and to get the coref-clusters for all entities in and . The C&M model produces a score/representation for each mention pair ,
where is a weight matrix and b is the bias. is representation of the last hidden layer of the three layer feedforward neural network.
For each pronoun in
, we then heuristically identify the most “representative” antecedent from its coref-cluster. (Proper nouns are preferred.) We append the new mention after the pronoun. For example, in Table1, “the panthers” is the most representative mention in the coref-cluster for “they”. The new sentence with the appended coreferent mention is our coreference transformed input sentence (see Figure 2).
Coreference Position Feature Embedding For each token in , we also maintain one position feature , to denote pronouns (e.g., “they”) and antecedents (e.g., “the panthers”). We use the BIO tagging scheme to label the associated spans in . “
B_ANT” denotes the start of an antecedent span, tag “
I_ANT” continues the antecedent span and tag “
O” marks tokens that do not form part of a mention span. Similarly, tags “
B_PRO” and “
I_PRO” denote the pronoun span. (See Table 1, “coref. feature”.)
, we propose to use a gating network here to obtain a refined representation of the coreference position feature vectors. The main idea is to utilize the mention-pair score (see Equation 2) to help the neural network learn the importance of the coreferent phrases. We compute the refined (gated) coreference position feature vector as follows,
denotes an element-wise product between two vectors and ReLU is the rectified linear activation function.denotes the mention-pair score for each antecedent token (e.g., “the” and “panthers”) with the pronoun (e.g., “they”); is obtained from the trained model (Equation 2) of the C&M. If token is not added later as an antecedent token, is set to zero. , are weight matrices and
is the bias vector.
Answer Feature Embedding We also include an answer position feature embedding to generate answer-specific questions; we denote the answer span with the usual BIO tagging scheme (see, e.g., “the arizona cardinals” in Table 1). During training and testing, the answer span feature (i.e., “
I_ANS” or “
O”) is mapped to its feature embedding space: .
Word Embedding To obtain the word embedding for the tokens themselves, we just map the tokens to the word embedding space: .
Final Encoder Input As noted above, the final input to the LSTM-based encoder is a concatenation of (1) the refined coreference position feature embedding (light blue units in Figure 2), (2) the answer position feature embedding (red units), and (3) the word embedding for the token (green units),
Encoder As for the encoder itself, we use bidirectional LSTMs to read the input in both the forward and backward directions. After encoding, we obtain two sequences of hidden vectors, namely, and . The final output state of the encoder is the concatenation of and where
Question Decoder with Attention & Copy On top of the feature-rich encoder, we use LSTMs with attention Bahdanau et al. (2015) as the decoder for generating the question one token at a time. To deal with rare/unknown words, the decoder also allows directly copying words from the source sentence via pointing Vinyals et al. (2015).
At each time step , the decoder LSTM reads the previous word embedding and previous hidden state to compute the new hidden state,
Then we calculate the attention distribution as in bahdanau2014neural,
where is a weight matrix and attention distribution
is a probability distribution over the source sentence words. With, we can obtain the context vector ,
Then, using the context vector and hidden state , the probability distribution over the target (question) side vocabulary is calculated as,
Instead of directly using for training/generating with the fixed target side vocabulary, we also consider copying from the source sentence. The copy probability is based on the context vector and hidden state ,
and the probability distribution over the source sentence words is the sum of the attention scores of the corresponding words,
Finally, we obtain the probability distribution over the dynamic vocabulary (i.e., union of original target side and source sentence vocabulary) by summing over and ,
is the sigmoid function, and, , are weight matrices.
We frame the problem of identifying candidate answer spans from a paragraph as a sequence labeling task and base our model on the BiLSTM-CRF approach for named entity recognitionHuang et al. (2015). Given a paragraph of tokens, instead of directly feeding the sequence of word vectors to the LSTM units, we first construct the feature-rich embedding for each token, which is the concatenation of the word embedding, an NER feature embedding, and a character-level representation of the word Lample et al. (2016). We use the concatenated vector as the “final” embedding for the token,
where is the concatenation of the last hidden states of a character-based biLSTM. The intuition behind the use of NER features is that SQuAD answer spans contain a large number of named entities, numeric phrases, etc.
Then a multi-layer Bi-directional LSTM is applied to and we obtain the output state for time step by concatenation of the hidden states (forward and backward) at time step from the last layer of the BiLSTM. We apply the softmax to to get the normalized score representation for each token, which is of size , where is the number of tags.
Instead of using a softmax training objective that minimizes the cross-entropy loss for each individual word, the model is trained with a CRF Lafferty et al. (2001) objective, which minimizes the negative log-likelihood for the entire correct sequence: ,
where , is the score of assigning tag to the token, and is the transition score from tag to , the scoring matrix is to be learned. represents all the possible tagging sequences.
We use the SQuAD dataset Rajpurkar et al. (2016) to train our models. It is one of the largest general purpose QA datasets derived from Wikipedia with over 100k questions posed by crowdworkers on a set of Wikipedia articles. The answer to each question is a segment of text from the corresponding Wiki passage. The crowdworkers were users of Amazon’s Mechanical Turk located in the US or Canada. To obtain high-quality articles, the authors sampled 500 articles from the top 10,000 articles obtained by Nayuki’s Wikipedia’s internal PageRanks. The question-answer pairs were generated by annotators from a paragraph; and although the dataset is typically used to evaluate reading comprehension, it has also been used in an open domain QA setting Chen et al. (2017); Wang et al. (2018). For training/testing answer extraction systems, we pair each paragraph in the dataset with the gold answer spans that it contains. For the question generation system, we pair each sentence that contains an answer span with the corresponding gold question as in du2017LearningToAsk.
To quantify the effect of using predicted (rather than gold standard) answer spans on question generation (e.g., predicted answer span boundaries can be inaccurate), we also train the models on an augmented “Training set w/ noisy examples” (see Table 2). This training set contains all of the original training examples plus new examples for predicted answer spans (from the top-performing answer extraction model, bottom row of Table 3) that overlap with a gold answer span. We pair the new training sentence (w/ predicted answer span) with the gold question. The added examples comprise 42.21% of the noisy example training set.
For generation of our one million QA pair corpus, we apply our systems to the 10,000 top-ranking articles of Wikipedia.
For question generation evaluation, we use BLEU Papineni et al. (2002) and METEOR Denkowski and Lavie (2014).111We use the evaluation scripts of du2017LearningToAsk. BLEU measures average -gram precision vs. a set of reference questions and penalizes for overly short sentences. METEOR is a recall-oriented metric that takes into account synonyms, stemming, and paraphrases.
For answer candidate extraction evaluation, we use precision, recall and F-measure vs. the gold standard SQuAD answers. Since answer boundaries are sometimes ambiguous, we compute Binary Overlap and Proportional Overlap metrics in addition to Exact Match. Binary Overlap counts every predicted answer that overlaps with a gold answer span as correct, and Proportional Overlap give partial credit proportional to the amount of overlap Johansson and Moschitti (2010); Irsoy and Cardie (2014).
|Models||Training set||Training set w/ noisy examples|
|Baseline Du et al. (2017) (w/o answer)||17.50||12.28||16.62||15.81||10.78||15.31|
|Seq2seq + copy (w/ answer)||20.01||14.31||18.50||19.61||13.96||18.19|
|- mention-pair score||20.56||14.75||18.85||19.73||14.13||18.38|
|BiLSTM w/ NER||44.35||46.02||25.33||33.30||40.81||23.32||38.04||43.26||24.29|
|BiLSTM-CRF w/ char||49.35||51.92||38.58||30.53||32.75||24.04||37.72||40.16||29.62|
|BiLSTM-CRF w/ char w/ NER||45.96||51.61||33.90||41.05||43.98||28.37||43.37||47.49||30.89|
For question generation, we compare to the state-of-the-art baselines and conduct ablation tests as follows: du2017LearningToAsk’s model is an attention-based RNN sequence-to-sequence neural network (without using the answer location information feature). Seq2seq + copyw/ answer is the attention-based sequence-to-sequence model augmented with a copy mechanism, with answer features concatenated with the word embeddings during encoding. Seq2seq + copyw/ full context + answer is the same model as the previous one, but we allow access to the full context (i.e., all the preceding sentences and the input sentence itself). We denote it as ContextNQG henceforth for simplicity. CorefNQG is the coreference-based model proposed in this paper. CorefNQG–gating is an ablation test, the gating network is removed and the coreference position embedding is not refined. CorefNQG–mention-pair score is also an ablation test where all mention-pair are set to zero.
For answer span extraction, we conduct experiments to compare the performance of an off-the-shelf NER system and BiLSTM based systems.
For training and implementation details, please see the Supplementary Material.
Table 2 shows the BLEU- and METEOR scores of different models. Our CorefNQG outperforms the seq2seq baseline of du2017LearningToAsk by a large margin. This shows that the copy mechanism, answer features and coreference resolution all aid question generation. In addition, CorefNQG outperforms both Seq2seq+Copy models significantly, whether or not they have access to the full context. This demonstrates that the coreference knowledge encoded with the gating network explicitly helps with the training and generation: it is more difficult for the neural sequence model to learn the coreference knowledge in a latent way. (See input 1 in Figure A.1 for an example.) Building end-to-end models that take into account coreference knowledge in a latent way is an interesting direction to explore. In the ablation tests, the performance drop of CorefNQG–gating shows that the gating network is playing an important role for getting refined coreference position feature embedding, which helps the model learn the importance of an antecedent. The performance drop of CorefNQG–mention-pair score shows the mention-pair score introduced from the external system Clark and Manning (2016) helps the neural network better encode coreference knowledge.
To better understand the effect of coreference resolution, we also evaluate our model and the baseline models on just that portion of the test set that requires pronoun resolution (36.42% of the examples) and show the results in Table 4. The gaps of performance between our model and the baseline models are still significant. Besides, we see that all three systems’ performance drop on this partial test set, which demonstrates the hardness of generating questions for the cases that require pronoun resolution (passage context).
We also show in Table 2 the results of the QG models trained on the training set augmented with noisy examples with predicted answer spans. There is a consistent but acceptable drop for each model on this new training set, given the inaccuracy of predicted answer spans. We see that CorefNQG still outperforms the baseline models across all metrics.
Figure A.1 provides sample output for input sentences that require contextual coreference knowledge. We see that ContextNQG fails in all cases; our model misses only the third example due to an error introduced by coreference resolution — the “city” and “it” are considered coreferent. We can also see that human-generated questions are more natural and varied in form with better paraphrasing.
In Table 3, we show the evaluation results for different answer extraction models. First we see that all variants of BiLSTM models outperform the off-the-shelf NER system (that proposes all NEs as answer spans), though the NER system has a higher recall. The BiLSTM-CRF that encodes the character-level and NER features for each token performs best in terms of F-measure.
|Grammaticality||Making Sense||Answerability||Avg. rank|
“Grammaticality”, “Making Sense” and “Answerability” are rated on a 1–5 scale (5 for the best, see the supplementary materials for a detailed rating scheme), “Average rank” is rated on a 1–3 scale (1 for the most preferred, ties are allowed.) Two-tailed t-test results are shown for our method compared to ContextNQG (stat. significance is indicated with( < 0.05), ( < 0.01).)
We hired four native speakers of English to rate the systems’ outputs. Detailed guidelines for the raters are listed in the supplementary materials. The evaluation can also be seen as a measure of the quality of the generated dataset (Section 6.3). We randomly sampled 11 passages/paragraphs from the test set; there are in total around 70 question-answer pairs for evaluation.
We consider three metrics — “grammaticality”, “making sense” and “answerability”. The evaluators are asked to first rate the grammatical correctness of the generated question (before being shown the associated input sentence or any other textual context). Next, we ask them to rate the degree to which the question “makes sense” given the input sentence (i.e., without considering the correctness of the answer span). Finally, evaluators rate the “answerability” of the question given the full context.
Table 5 shows the results of the human evaluation. Bold indicates top scores. We see that the original human questions are preferred over the two NQG systems’ outputs, which is understandable given the examples in Figure A.1. The human-generated questions make more sense and correspond better with the provided answers, particularly when they require information in the preceding context. How exactly to capture the preceding context so as to ask better and more diverse questions is an interesting future direction for research. In terms of grammaticality, however, the neural models do quite well, achieving very close to human performance. In addition, we see that our method (CorefNQG) performs statistically significantly better across all metrics in comparison to the baseline model (ContextNQG), which has access to the entire preceding context in the passage.
Our system generates in total 1,259,691 question-answer pairs, nearly 126 questions per article. Figure 5 shows the distribution of different types of questions in our dataset vs. the SQuAD training set. We see that the distribution for “In what”, “When”, “How long”, “Who”, “Where”, “What does” and “What do” questions in the two datasets is similar. Our system generates more “What is”, “What was” and “What percentage” questions, while the proportions of “What did”, “Why” and “Which” questions in SQuAD are larger than ours. One possible reason is that the “Why”, “What did” questions are more complicated to ask (sometimes involving world knowledge) and the answer spans are longer phrases of various types that are harder to identify. “What is” and “What was” questions, on the other hand, are often safer for the neural networks systems to ask.
|DocReader Chen et al. (2017)||82.33||81.65||88.20||87.79|
In Figure 4, we show some examples of the generated question-answer pairs. The answer extractor identifies the answer span boundary well and all three questions correspond to their answers. Q2 is valid but not entirely accurate. For more examples, please refer to our supplementary materials.
Table 6 shows the performance of a top-performing system for the SQuAD dataset (Document Reader Chen et al. (2017)) when applied to the development and test set portions of our generated dataset. The system was trained on the training set portion of our dataset. We use the SQuAD evaluation scripts, which calculate exact match (EM) and F-1 scores.222F-1 measures the average overlap between the predicted answer span and ground truth answer Rajpurkar et al. (2016). Performance of the neural machine reading model is reasonable. We also train the DocReader on our training set and test the models’ performance on the original dev set of SQuAD; for this, the performance is around on EM and on F-1 metric. DocReader trained on the original SQuAD training set achieves EM, F-1 indicating that our dataset is more difficult and/or less natural than the crowd-sourced QA pairs of SQuAD.
We propose a new neural network model for better encoding coreference knowledge for paragraph-level question generation. Evaluations with different metrics on the SQuAD machine reading dataset show that our model outperforms state-of-the-art baselines. The ablation study shows the effectiveness of different components in our model. Finally, we apply our question generation framework to produce a corpus of 1.26 million question-answer pairs, which we hope will benefit the QA research community. It would also be interesting to apply our approach to incorporating coreference knowledge to other text generation tasks.
We thank the anonymous reviewers and members of Cornell NLP group for helpful comments.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 1533–1544. http://www.aclweb.org/anthology/D13-1160.
International Conference on Machine Learning. pages 933–941.
We provide more examples of QA pairs from the corpus, the red answer spans correspond to the questions in order.
We provide the following guidelines for the raters,
|Grammaticality||Given only the question itself, is it grammatical?|
|Making sense||Given just the question and the surrounding context in the passage,|
|does the question make sense?|
|Answerability||Given just the question and the surrounding context in the passage and the answer|
|(and regardless of “Grammaticality” and “Making Sense”),|
|is the question answerable by the corresponding answer span?|
For the question generation model, the input and output vocabularies are collected from the training data, we keep the 50k most frequent words. The size of word embedding and LSTM hidden states are set to 128 and 256, respectively. We use dropout Srivastava et al. (2014) with probability
. The model parameters are initialized randomly using a uniform distribution between2013) with range during training. The best models are selected based on the perplexity (lowest) on the development set. In all experiments, we use the same split of du2017LearningToAsk of SQuAD dataset into training, development and test sets. We use beam search during decoding to get better results. We set the beam size to 3 in the experiments and corpus generation. For the tokenizer used in building the answer extraction system, we use SpaCy.