Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia

05/15/2018 ∙ by Xinya Du, et al. ∙ cornell university 0

We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. Compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top-ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We also provide a qualitative analysis for this large-scale generated corpus from Wikipedia.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, there has been a resurgence of work in NLP on reading comprehension Hermann et al. (2015); Rajpurkar et al. (2016); Joshi et al. (2017) with the goal of developing systems that can answer questions about the content of a given passage or document. Large-scale QA datasets are indispensable for training expressive statistical models for this task and play a critical role in advancing the field. And there have been a number of efforts in this direction. miller2016key, for example, develop a dataset for open-domain question answering; rajpurkar2016squad and JoshiTriviaQA2017 do so for reading comprehension (RC); and  hill2015goldilocks and hermann2015teaching, for the related task of answering cloze questions Winograd (1972); Levesque et al. (2011). To create these datasets, either crowdsourcing or (semi-)synthetic approaches are used. The (semi-)synthetic datasets (e.g., hermann2015teaching) are large in size and cheap to obtain; however, they do not share the same characteristics as explicit QA/RC questions Rajpurkar et al. (2016). In comparison, high-quality crowdsourced datasets are much smaller in size, and the annotation process is quite expensive because the labeled examples require expertise and careful design Chen et al. (2016).

(1)Tesla was renowned for his achievements and showmanship, eventually earning him a reputation in popular culture as an archetypal "mad scientist". (2)His patents earned him a considerable amount of money, much of which was used to finance his own projects with varying degrees of success. (3)He lived most of his life in a series of New York hotels, through his retirement. (4)Tesla died on 7 January 1943. …
– What was Tesla’s reputation in popular culture?
mad scientist
– How did Tesla finance his work?
– Where did Tesla live for much of his life?
New York hotels
Figure 1: Example input from the fourth paragraph of a Wikipedia article on  Nikola Tesla, along with the natural questions and their answers from the SQuAD Rajpurkar et al. (2016) dataset. We show in italics the set of mentions that refer to Nikola Tesla — Tesla, him, his, he, etc.

Thus, there is a need for methods that can automatically generate high-quality question-answer pairs. serban2016factoid30m propose the use of recurrent neural networks to generate QA pairs from structured knowledge resources such as Freebase. Their work relies on the existence of automatically acquired KBs, which are known to have errors and suffer from incompleteness. They are also non-trivial to obtain. In addition, the questions in the resulting dataset are limited to queries regarding a single fact (i.e., tuple) in the KB.

Motivated by the need for large scale QA pairs and the limitations of recent work, we investigate methods that can automatically “harvest” (generate) question-answer pairs from raw text/unstructured documents, such as Wikipedia-type articles.

Recent work along these lines Du et al. (2017); Zhou et al. (2017) (see Section 2) has proposed the use of attention-based recurrent neural models trained on the crowdsourced SQuAD dataset Rajpurkar et al. (2016) for question generation. While successful, the resulting QA pairs are based on information from a single sentence. As described in du2017LearningToAsk, however, nearly 30% of the questions in the human-generated questions of SQuAD rely on information beyond a single sentence. For example, in Figure 1, the second and third questions require coreference information (i.e., recognizing that “His” in sentence 2 and “He” in sentence 3 both corefer with “Tesla” in sentence 1) to answer them.

Thus, our research studies methods for incorporating coreference information into the training of a question generation system. In particular, we propose gated Coreference knowledge for Neural Question Generation (CorefNQG), a neural sequence model with a novel gating mechanism that leverages continuous representations of coreference clusters — the set of mentions used to refer to each entity — to better encode linguistic knowledge introduced by coreference, for paragraph-level question generation.

In an evaluation using the SQuAD dataset, we find that CorefNQG enables better question generation. It outperforms significantly the baseline neural sequence models that encode information from a single sentence, and a model that encodes all preceding context and the input sentence itself. When evaluated on only the portion of SQuAD that requires coreference resolution, the gap between our system and the baseline systems is even larger.

By applying our approach to the 10,000 top-ranking Wikipedia articles, we obtain a question answering/reading comprehension dataset with over one million QA pairs; we provide a qualitative analysis in Section 6. The dataset and the source code for the system are available at https://github.com/xinyadu/HarvestingQA.

2 Related Work

2.1 Question Generation

Since the work by rus2010first, question generation (QG) has attracted interest from both the NLP and NLG communities. Most early work in QG employed rule-based approaches to transform input text into questions, usually requiring the application of a sequence of well-designed general rules or templates Mitkov and Ha (2003); Labutov et al. (2015). heilman2010good introduced an overgenerate-and-rank approach: their system generates a set of questions and then ranks them to select the top candidates. Apart from generating questions from raw text, there has also been research on question generation from symbolic representations Yao et al. (2012); Olney et al. (2012).

With the recent development of deep representation learning and large QA datasets, there has been research on recurrent neural network based approaches for question generation. serban2016factoid30m used the encoder-decoder framework to generate QA pairs from knowledge base triples;  mitesh2017generating generated questions from a knowledge graph;  du2017LearningToAsk studied how to generate questions from sentences using an attention-based sequence-to-sequence model and investigated the effect of exploiting sentence- vs. paragraph-level information. du2017identifying proposed a hierarchical neural sentence-level sequence tagging model for identifying question-worthy sentences in a text passage. Finally, duan2017qgforqa investigated how to use question generation to help improve question answering systems on the sentence selection subtask.

In comparison to the related methods from above that generate questions from raw text, our method is different in its ability to take into account contextual information beyond the sentence-level by introducing coreference knowledge.

2.2 Question Answering Datasets and Creation

Recently there has been an increasing interest in question answering with the creation of many datasets. Most are built using crowdsourcing; they are generally comprised of fewer than 100,000 QA pairs and are time-consuming to create. WebQuestions Berant et al. (2013), for example, contains 5,810 questions crawled via the Google Suggest API and is designed for knowledge base QA with answers restricted to Freebase entities. To tackle the size issues associated with WebQuestions, bordes2015large introduce SimpleQuestions, a dataset of 108,442 questions authored by English speakers. SQuAD Rajpurkar et al. (2016) is a dataset for machine comprehension; it is created by showing a Wikipedia paragraph to human annotators and asking them to write questions based on the paragraph. TriviaQA Joshi et al. (2017) includes 95k question-answer authored by trivia enthusiasts and corresponding evidence documents.

(Semi-)synthetic generated datasets are easier to build to large-scale Hill et al. (2015); Hermann et al. (2015). They usually come in the form of cloze-style questions. For example, hermann2015teaching created over a million examples by pairing CNN and Daily Mail news articles with their summarized bullet points. danqi2016exam showed that this dataset is quite noisy due to the method of data creation and concluded that performance of QA systems on the dataset is almost saturated.

Closest to our work is that of serban2016factoid30m. They train a neural triple-to-sequence model on SimpleQuestions, and apply their system to Freebase to produce a large collection of human-like question-answer pairs.

3 Task Definition

Our goal is to harvest high quality question-answer pairs from the paragraphs of an article of interest. In our task formulation, this consists of two steps: candidate answer extraction and answer-specific question generation. Given an input paragraph, we first identify a set of question-worthy candidate answers , each a span of text as denoted in color in Figure 1. For each candidate answer , we then aim to generate a question — a sequence of tokens — based on the sentence that contains candidate such that:

  • asks about an aspect of that is of potential interest to a human;

  • might rely on information from sentences that precede in the paragraph.

Mathematically then,


where where is the set of sentences that precede in the paragraph.

4 Methodology

In this section, we introduce our framework for harvesting the question-answer pairs. As described above, it consists of the question generator CorefNQG (Figure 2) and a candidate answer extraction module. During test/generation time, we (1) run the answer extraction module on the input text to obtain answers, and then (2) run the question generation module to obtain the corresponding questions.

Figure 2: The gated Coreference knowledge for Neural Question Generation (CorefNQG) Model.
word they the panthers defeated the arizona cardinals 49 15
ans. feature O O O O B_ANS I_ANS I_ANS O O O
coref. feature B_PRO B_ANT I_ANT O O O O O O O
Table 1: Example input sentence with coreference and answer position features. The corresponding gold question is “What team did the Panthers defeat in the NFC championship game ?”

4.1 Question Generation

As shown in Figure 2, our generator prepares the feature-rich input embedding — a concatenation of (a) a refined coreference position feature embedding, (b) an answer feature embedding, and (c) a word embedding, each of which is described below. It then encodes the textual input using an LSTM unit Hochreiter and Schmidhuber (1997). Finally, an attention-copy equipped decoder is used to decode the question.

More specifically, given the input sentence (containing an answer span) and the preceding context , we first run a coreference resolution system to get the coref-clusters for and and use them to create a coreference transformed input sentence: for each pronoun, we append its most representative non-pronominal coreferent mention. Specifically, we apply the simple feedforward network based mention-ranking model of  clark2016improving to the concatenation of and to get the coref-clusters for all entities in and . The C&M model produces a score/representation for each mention pair ,


where is a weight matrix and b is the bias. is representation of the last hidden layer of the three layer feedforward neural network.

For each pronoun in

, we then heuristically identify the most “representative” antecedent from its coref-cluster. (Proper nouns are preferred.) We append the new mention after the pronoun. For example, in Table 

1, “the panthers” is the most representative mention in the coref-cluster for “they”. The new sentence with the appended coreferent mention is our coreference transformed input sentence (see Figure 2).

Coreference Position Feature Embedding  For each token in , we also maintain one position feature , to denote pronouns (e.g., “they”) and antecedents (e.g., “the panthers”). We use the BIO tagging scheme to label the associated spans in . “B_ANT” denotes the start of an antecedent span, tag “I_ANT” continues the antecedent span and tag “O” marks tokens that do not form part of a mention span. Similarly, tags “B_PRO” and “I_PRO” denote the pronoun span. (See Table 1, “coref. feature”.)

Refined Coref. Position Feature Embedding  Inspired by the success of gating mechanisms for controlling information flow in neural networks Hochreiter and Schmidhuber (1997); Dauphin et al. (2017)

, we propose to use a gating network here to obtain a refined representation of the coreference position feature vectors

. The main idea is to utilize the mention-pair score (see Equation 2) to help the neural network learn the importance of the coreferent phrases. We compute the refined (gated) coreference position feature vector as follows,



denotes an element-wise product between two vectors and ReLU is the rectified linear activation function.

denotes the mention-pair score for each antecedent token (e.g., “the” and “panthers”) with the pronoun (e.g., “they”); is obtained from the trained model (Equation 2) of the C&M. If token is not added later as an antecedent token, is set to zero. , are weight matrices and

is the bias vector.

Answer Feature Embedding  We also include an answer position feature embedding to generate answer-specific questions; we denote the answer span with the usual BIO tagging scheme (see, e.g., “the arizona cardinals” in Table 1). During training and testing, the answer span feature (i.e., “B_ANS”, “I_ANS” or “O”) is mapped to its feature embedding space: .

Word Embedding  To obtain the word embedding for the tokens themselves, we just map the tokens to the word embedding space: .

Final Encoder Input  As noted above, the final input to the LSTM-based encoder is a concatenation of (1) the refined coreference position feature embedding (light blue units in Figure 2), (2) the answer position feature embedding (red units), and (3) the word embedding for the token (green units),


Encoder  As for the encoder itself, we use bidirectional LSTMs to read the input in both the forward and backward directions. After encoding, we obtain two sequences of hidden vectors, namely, and . The final output state of the encoder is the concatenation of and where


Question Decoder with Attention & Copy On top of the feature-rich encoder, we use LSTMs with attention Bahdanau et al. (2015) as the decoder for generating the question one token at a time. To deal with rare/unknown words, the decoder also allows directly copying words from the source sentence via pointing Vinyals et al. (2015).

At each time step , the decoder LSTM reads the previous word embedding and previous hidden state to compute the new hidden state,


Then we calculate the attention distribution as in bahdanau2014neural,


where is a weight matrix and attention distribution

is a probability distribution over the source sentence words. With

, we can obtain the context vector ,


Then, using the context vector and hidden state , the probability distribution over the target (question) side vocabulary is calculated as,


Instead of directly using for training/generating with the fixed target side vocabulary, we also consider copying from the source sentence. The copy probability is based on the context vector and hidden state ,


and the probability distribution over the source sentence words is the sum of the attention scores of the corresponding words,


Finally, we obtain the probability distribution over the dynamic vocabulary (i.e., union of original target side and source sentence vocabulary) by summing over and ,



is the sigmoid function, and

, , are weight matrices.

4.2 Answer Span Identification

We frame the problem of identifying candidate answer spans from a paragraph as a sequence labeling task and base our model on the BiLSTM-CRF approach for named entity recognition 

Huang et al. (2015). Given a paragraph of tokens, instead of directly feeding the sequence of word vectors to the LSTM units, we first construct the feature-rich embedding for each token, which is the concatenation of the word embedding, an NER feature embedding, and a character-level representation of the word Lample et al. (2016). We use the concatenated vector as the “final” embedding for the token,


where is the concatenation of the last hidden states of a character-based biLSTM. The intuition behind the use of NER features is that SQuAD answer spans contain a large number of named entities, numeric phrases, etc.

Then a multi-layer Bi-directional LSTM is applied to and we obtain the output state for time step by concatenation of the hidden states (forward and backward) at time step from the last layer of the BiLSTM. We apply the softmax to to get the normalized score representation for each token, which is of size , where is the number of tags.

Instead of using a softmax training objective that minimizes the cross-entropy loss for each individual word, the model is trained with a CRF Lafferty et al. (2001) objective, which minimizes the negative log-likelihood for the entire correct sequence: ,


where , is the score of assigning tag to the token, and is the transition score from tag to , the scoring matrix is to be learned. represents all the possible tagging sequences.

5 Experiments

5.1 Dataset

We use the SQuAD dataset Rajpurkar et al. (2016) to train our models. It is one of the largest general purpose QA datasets derived from Wikipedia with over 100k questions posed by crowdworkers on a set of Wikipedia articles. The answer to each question is a segment of text from the corresponding Wiki passage. The crowdworkers were users of Amazon’s Mechanical Turk located in the US or Canada. To obtain high-quality articles, the authors sampled 500 articles from the top 10,000 articles obtained by Nayuki’s Wikipedia’s internal PageRanks. The question-answer pairs were generated by annotators from a paragraph; and although the dataset is typically used to evaluate reading comprehension, it has also been used in an open domain QA setting Chen et al. (2017); Wang et al. (2018). For training/testing answer extraction systems, we pair each paragraph in the dataset with the gold answer spans that it contains. For the question generation system, we pair each sentence that contains an answer span with the corresponding gold question as in du2017LearningToAsk.

To quantify the effect of using predicted (rather than gold standard) answer spans on question generation (e.g., predicted answer span boundaries can be inaccurate), we also train the models on an augmented “Training set w/ noisy examples” (see Table 2). This training set contains all of the original training examples plus new examples for predicted answer spans (from the top-performing answer extraction model, bottom row of Table 3) that overlap with a gold answer span. We pair the new training sentence (w/ predicted answer span) with the gold question. The added examples comprise 42.21% of the noisy example training set.

For generation of our one million QA pair corpus, we apply our systems to the 10,000 top-ranking articles of Wikipedia.

5.2 Evaluation Metrics

For question generation evaluation, we use BLEU Papineni et al. (2002) and METEOR Denkowski and Lavie (2014).111We use the evaluation scripts of du2017LearningToAsk. BLEU measures average -gram precision vs. a set of reference questions and penalizes for overly short sentences. METEOR is a recall-oriented metric that takes into account synonyms, stemming, and paraphrases.

For answer candidate extraction evaluation, we use precision, recall and F-measure vs. the gold standard SQuAD answers. Since answer boundaries are sometimes ambiguous, we compute Binary Overlap and Proportional Overlap metrics in addition to Exact Match. Binary Overlap counts every predicted answer that overlaps with a gold answer span as correct, and Proportional Overlap give partial credit proportional to the amount of overlap Johansson and Moschitti (2010); Irsoy and Cardie (2014).

Models Training set Training set w/ noisy examples
Baseline Du et al. (2017) (w/o answer) 17.50 12.28 16.62 15.81 10.78 15.31
Seq2seq + copy (w/ answer) 20.01 14.31 18.50 19.61 13.96 18.19
ContextNQG: Seq2seq + copy
(w/ full context + answer)
20.31 14.58 18.84 19.57 14.05 18.19
CorefNQG 20.90 15.16 19.12 20.19 14.52 18.59
 - gating 20.68 14.84 18.98 20.08 14.40 18.64
 - mention-pair score 20.56 14.75 18.85 19.73 14.13 18.38
Table 2: Evaluation results for question generation.
Models Precision Recall F-measure
Prop. Bin. Exact Prop. Bin. Exact Prop. Bin. Exact
NER 24.54 25.94 12.77 58.20 67.66 38.52 34.52 37.50 19.19
BiLSTM 43.54 45.08 22.97 28.43 35.99 18.87 34.40 40.03 20.71
BiLSTM w/ NER 44.35 46.02 25.33 33.30 40.81 23.32 38.04 43.26 24.29
BiLSTM-CRF w/ char 49.35 51.92 38.58 30.53 32.75 24.04 37.72 40.16 29.62
BiLSTM-CRF w/ char w/ NER 45.96 51.61 33.90 41.05 43.98 28.37 43.37 47.49 30.89
Table 3: Evaluation results of answer extraction systems.

5.3 Baselines and Ablation Tests

For question generation, we compare to the state-of-the-art baselines and conduct ablation tests as follows: du2017LearningToAsk’s model is an attention-based RNN sequence-to-sequence neural network (without using the answer location information feature). Seq2seq + copyw/ answer is the attention-based sequence-to-sequence model augmented with a copy mechanism, with answer features concatenated with the word embeddings during encoding. Seq2seq + copyw/ full context + answer is the same model as the previous one, but we allow access to the full context (i.e., all the preceding sentences and the input sentence itself). We denote it as ContextNQG henceforth for simplicity. CorefNQG is the coreference-based model proposed in this paper. CorefNQG–gating is an ablation test, the gating network is removed and the coreference position embedding is not refined. CorefNQG–mention-pair score is also an ablation test where all mention-pair are set to zero.

For answer span extraction, we conduct experiments to compare the performance of an off-the-shelf NER system and BiLSTM based systems.

For training and implementation details, please see the Supplementary Material.

6 Results and Analysis

6.1 Automatic Evaluation

Table 2 shows the BLEU- and METEOR scores of different models. Our CorefNQG outperforms the seq2seq baseline of du2017LearningToAsk by a large margin. This shows that the copy mechanism, answer features and coreference resolution all aid question generation. In addition, CorefNQG outperforms both Seq2seq+Copy models significantly, whether or not they have access to the full context. This demonstrates that the coreference knowledge encoded with the gating network explicitly helps with the training and generation: it is more difficult for the neural sequence model to learn the coreference knowledge in a latent way. (See input 1 in Figure A.1 for an example.) Building end-to-end models that take into account coreference knowledge in a latent way is an interesting direction to explore. In the ablation tests, the performance drop of CorefNQG–gating shows that the gating network is playing an important role for getting refined coreference position feature embedding, which helps the model learn the importance of an antecedent. The performance drop of CorefNQG–mention-pair score shows the mention-pair score introduced from the external system Clark and Manning (2016) helps the neural network better encode coreference knowledge.

Seq2seq + copy
(w/ ans.)
17.81 12.30 17.11
ContextNQG 18.05 12.53 17.33
CorefNQG 18.46 12.96 17.58
Table 4: Evaluation results for question generation on the portion that requires coreference knowledge (36.42% examples of the original test set).

To better understand the effect of coreference resolution, we also evaluate our model and the baseline models on just that portion of the test set that requires pronoun resolution (36.42% of the examples) and show the results in Table 4. The gaps of performance between our model and the baseline models are still significant. Besides, we see that all three systems’ performance drop on this partial test set, which demonstrates the hardness of generating questions for the cases that require pronoun resolution (passage context).

We also show in Table 2 the results of the QG models trained on the training set augmented with noisy examples with predicted answer spans. There is a consistent but acceptable drop for each model on this new training set, given the inaccuracy of predicted answer spans. We see that CorefNQG still outperforms the baseline models across all metrics.

Figure A.1 provides sample output for input sentences that require contextual coreference knowledge. We see that ContextNQG fails in all cases; our model misses only the third example due to an error introduced by coreference resolution — the “city” and “it” are considered coreferent. We can also see that human-generated questions are more natural and varied in form with better paraphrasing.

In Table 3, we show the evaluation results for different answer extraction models. First we see that all variants of BiLSTM models outperform the off-the-shelf NER system (that proposes all NEs as answer spans), though the NER system has a higher recall. The BiLSTM-CRF that encodes the character-level and NER features for each token performs best in terms of F-measure.

Input 1: The elizabethan navigator, sir francis drake was born in the nearby town of tavistock and was the mayor of plymouth. … . he died of dysentery in 1596 off the coast of puerto rico.

Human: In what year did Sir Francis Drake die ?

ContextNQG: When did he die ?

CorefNQG: When did sir francis drake die ?

Input 2: american idol is an american singing competition … . it began airing on fox on june 11 , 2002, as an addition to the idols format based on the british series pop idol and has since become one of the most successful shows in the history of american television.

Human: When did american idol first air on tv ?

ContextNQG: When did fox begin airing ?

CorefNQG: When did american idol begin airing ?

Input 3: … the a38 dual-carriageway runs from east to west across the north of the city . within the city it is designated as ‘ the parkway ’ and represents the boundary between the urban parts of the city and the generally more recent suburban areas .

Human: What is the a38 called inside the city ?

ContextNQG: What is another name for the city ?

CorefNQG: What is the city designated as ?

Figure 3: Example questions (with answers highlighted) generated by human annotators (ground truth questions), by our system CorefNQG, and by the Seq2seq+Copy model trained with full context (i.e., ContextNQG).

6.2 Human Study

Grammaticality Making Sense Answerability Avg. rank
ContextNQG 3.793 3.836 3.892 1.768
CorefNQG 3.804* 3.847** 3.895* 1.762
Human 3.807 3.850 3.902 1.758
Table 5: Human evaluation results for question generation.

“Grammaticality”, “Making Sense” and “Answerability” are rated on a 1–5 scale (5 for the best, see the supplementary materials for a detailed rating scheme), “Average rank” is rated on a 1–3 scale (1 for the most preferred, ties are allowed.) Two-tailed t-test results are shown for our method compared to ContextNQG (stat. significance is indicated with

( < 0.05), ( < 0.01).)

We hired four native speakers of English to rate the systems’ outputs. Detailed guidelines for the raters are listed in the supplementary materials. The evaluation can also be seen as a measure of the quality of the generated dataset (Section 6.3). We randomly sampled 11 passages/paragraphs from the test set; there are in total around 70 question-answer pairs for evaluation.

We consider three metrics — “grammaticality”, “making sense” and “answerability”. The evaluators are asked to first rate the grammatical correctness of the generated question (before being shown the associated input sentence or any other textual context). Next, we ask them to rate the degree to which the question “makes sense” given the input sentence (i.e., without considering the correctness of the answer span). Finally, evaluators rate the “answerability” of the question given the full context.

Table 5 shows the results of the human evaluation. Bold indicates top scores. We see that the original human questions are preferred over the two NQG systems’ outputs, which is understandable given the examples in Figure A.1. The human-generated questions make more sense and correspond better with the provided answers, particularly when they require information in the preceding context. How exactly to capture the preceding context so as to ask better and more diverse questions is an interesting future direction for research. In terms of grammaticality, however, the neural models do quite well, achieving very close to human performance. In addition, we see that our method (CorefNQG) performs statistically significantly better across all metrics in comparison to the baseline model (ContextNQG), which has access to the entire preceding context in the passage.

6.3 The Generated Corpus

Our system generates in total 1,259,691 question-answer pairs, nearly 126 questions per article. Figure 5 shows the distribution of different types of questions in our dataset vs. the SQuAD training set. We see that the distribution for “In what”, “When”, “How long”, “Who”, “Where”, “What does” and “What do” questions in the two datasets is similar. Our system generates more “What is”, “What was” and “What percentage” questions, while the proportions of “What did”, “Why” and “Which” questions in SQuAD are larger than ours. One possible reason is that the “Why”, “What did” questions are more complicated to ask (sometimes involving world knowledge) and the answer spans are longer phrases of various types that are harder to identify. “What is” and “What was” questions, on the other hand, are often safer for the neural networks systems to ask.

Exact Match F-1
Dev Test Dev Test
DocReader Chen et al. (2017) 82.33 81.65 88.20 87.79
Table 6: Performance of the neural machine reading comprehension model (no initialization with pretrained embeddings) on our generated corpus.

The United States of America (USA), commonly referred to as the United States (U.S.) or America, is a federal republic composed of states, a federal district, five major self-governing territories, and various possessions. … . The territories are scattered about the Pacific Ocean and the Caribbean Sea. Nine time zones are covered. The geography, climate and wildlife of the country are extremely diverse.

Q1: What is another name for the united states of america ?

Q2: How many major territories are in the united states?

Q3: What are the territories scattered about ?

Figure 4: Example question-answer pairs from our generated corpus.
Figure 5: Distribution of question types of our corpus and SQuAD training set. The categories are the ones used in wang2016multi, we add one more category: “what percentage”.

In Figure 4, we show some examples of the generated question-answer pairs. The answer extractor identifies the answer span boundary well and all three questions correspond to their answers. Q2 is valid but not entirely accurate. For more examples, please refer to our supplementary materials.

Table 6 shows the performance of a top-performing system for the SQuAD dataset (Document Reader Chen et al. (2017)) when applied to the development and test set portions of our generated dataset. The system was trained on the training set portion of our dataset. We use the SQuAD evaluation scripts, which calculate exact match (EM) and F-1 scores.222F-1 measures the average overlap between the predicted answer span and ground truth answer Rajpurkar et al. (2016). Performance of the neural machine reading model is reasonable. We also train the DocReader on our training set and test the models’ performance on the original dev set of SQuAD; for this, the performance is around on EM and on F-1 metric. DocReader trained on the original SQuAD training set achieves EM, F-1 indicating that our dataset is more difficult and/or less natural than the crowd-sourced QA pairs of SQuAD.

7 Conclusion

We propose a new neural network model for better encoding coreference knowledge for paragraph-level question generation. Evaluations with different metrics on the SQuAD machine reading dataset show that our model outperforms state-of-the-art baselines. The ablation study shows the effectiveness of different components in our model. Finally, we apply our question generation framework to produce a corpus of 1.26 million question-answer pairs, which we hope will benefit the QA research community. It would also be interesting to apply our approach to incorporating coreference knowledge to other text generation tasks.


We thank the anonymous reviewers and members of Cornell NLP group for helpful comments.


Appendix A Supplementary Materials

a.1 Example Question-Answer Pairs from the Corpus

We provide more examples of QA pairs from the corpus, the red answer spans correspond to the questions in order.

Paragraph 2  France has long been a global centre of art, science, and philosophy. It hosts Europe’s fourth-largest number of cultural UNESCO World Heritage Sites and receives around 83 million foreign tourists annually, the most of any country in the world. France is a developed country with the world’s sixth-largest economy by nominal GDP and ninth-largest by purchasing power parity. In terms of aggregate household wealth, it ranks fourth in the world. France performs well in international rankings of education, health care, life expectancy, and human development. France remains a great power in the world, being a founding member of the United Nations, where it serves as one of the five permanent members of the UN Security Council, and a founding and leading member state of the European Union (EU). It is also a member of the Group of 7, North Atlantic Treaty Organization (NATO), Organisation for Economic Co-operation and Development (OECD), the World Trade Organization (WTO), and La Francophonie.


Q1: how many foreign tourists does france have ?

Q2: what is france ’s sixth-largest economy ?

Q3: what does nato stand for ?

Paragraph 2  The United States embarked on a vigorous expansion across North America throughout the 19th century, displacing American Indian tribes, acquiring new territories, and gradually admitting new states until it spanned the continent by 1848. During the second half of the 19th century, the American Civil War led to the end of legal slavery in the country. By the end of that century, the United States extended into the Pacific Ocean, and its economy, driven in large part by the Industrial Revolution, began to soar. The Spanish-American War and confirmed the country’s status as a global military power. The United States emerged from as a global superpower, the first country to develop nuclear weapons, the only country to use them in warfare, and a permanent member of the United Nations Security Council. It is a founding member of the Organization of American States (UAS) and various other Pan-American and international organisations. The end of the Cold War and the dissolution of the Soviet Union in 1991 left the United States as the world’s sole superpower.


Q1: what war led to the end of legal slavery ?

Q2: what was the name of the war that confirmed the country ?

Paragraph 3  The International Standard Name Identifier (ISNI) is an identifier for uniquely identifying the public identities of contributors to media content such as books, TV programmes, and newspaper articles. Such an identifier consists of 16 digits. It can optionally be displayed as divided into four blocks.


Q1: what is an example of a identifier name ?

Q2: how many digits does an identifier have ?

Q3: how long can the identifier be displayed ?

Paragraph 4  India, officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area, the second-most populous country (with over 1.2 billion people), and the most populous democracy in the world. It is bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast. It shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the northeast; and Myanmar (Burma) and Bangladesh to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives. India’s Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia. Its capital is New Delhi; other metropolises include Mumbai, Kolkata, Chennai, Bangalore, Hyderabad and Ahmedabad.


Q1: what is india ’s country called ?

Q2: where is the republic of india located ?

Q3: how many people are in the second-most populous country ?

Q4: what is the vicinity of india ?

Q5: which two countries did the nicobar islands play a maritime border with ?

Q6: what is the capital of india ?

a.2 Human Rater Guidelines

We provide the following guidelines for the raters,

Categories Rating scheme
Grammaticality Given only the question itself, is it grammatical?
Making sense Given just the question and the surrounding context in the passage,
does the question make sense?
Answerability Given just the question and the surrounding context in the passage and the answer
(and regardless of “Grammaticality” and “Making Sense”),
is the question answerable by the corresponding answer span?
Table 7: Guidelines for the raters. For each category, the human raters are required to give a rating ranging from 1 to 5 (5 = fully satisfying the rating scheme, 1 = completely not satisfying the rating scheme, 3 = the borderline cases.

a.3 Training and Implementation Details

For the question generation model, the input and output vocabularies are collected from the training data, we keep the 50k most frequent words. The size of word embedding and LSTM hidden states are set to 128 and 256, respectively. We use dropout Srivastava et al. (2014) with probability

. The model parameters are initialized randomly using a uniform distribution between

. We use Stochastic Gradient Descent (SGD) as optimization algorithm with a mini-batch size 64. We also apply gradient clipping 

Pascanu et al. (2013) with range during training. The best models are selected based on the perplexity (lowest) on the development set. In all experiments, we use the same split of du2017LearningToAsk of SQuAD dataset into training, development and test sets. We use beam search during decoding to get better results. We set the beam size to 3 in the experiments and corpus generation. For the tokenizer used in building the answer extraction system, we use SpaCy.