Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

08/16/2019 ∙ by Pradeep Dasigi, et al. ∙ University of Washington 0

Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing 15K span-selection questions that require resolving coreference among entities in about 3.5K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform poorly on this benchmark---the best model performance is 49 F1, while the estimated human performance is 87.2 F1.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Example paragraph and questions from the dataset. Highlighted text in paragraphs is where the questions with matching highlights are anchored. Next to the questions are the relevant coreferent mentions from the paragraph. They are bolded for the first question, italicized for the second, and underlined for the third in the paragraph.

Paragraphs and other longer texts typically make multiple references to the same entities. Tracking these references and resolving coreference is essential for full machine comprehension of these texts. Significant progress has recently been made in reading comprehension research, due to large crowdsourced datasets (Rajpurkar et al., 2016; Bajaj et al., 2016; Joshi et al., 2017; Kwiatkowski et al., 2019, inter alia). However, these datasets focus largely on understanding local predicate-argument structure, with very few questions requiring long-distance entity tracking. Obtaining such questions is hard for two reasons: (1) teaching crowdworkers about coreference is challenging, with even experts disagreeing on its nuances (Pradhan et al., 2007; Versley, 2008; Recasens et al., 2011; Poesio et al., 2018), and (2) even if we can get crowdworkers to target coreference phenomena in their questions, these questions may contain giveaways that let models arrive at the correct answer without performing the desired reasoning (see §3 for examples).

We introduce a new dataset, Quoref, that contains questions requiring coreferential reasoning (see examples in Figure 1). The questions are derived from paragraphs taken from a diverse set of English Wikipedia articles and are collected using an annotation process (§2) that deals with the aforementioned issues in the following ways: First, we devise a set of instructions that gets workers to find anaphoric expressions and their referents, asking questions that connect two mentions in a paragraph. These questions mostly revolve around traditional notions of coreference (Figure 1 Q1), but they can also involve referential phenomena that are more nebulous (Figure 1 Q3). Second, inspired by Dua et al. (2019), we disallow questions that can be answered by an adversary model (uncased base BERT, Devlin et al., 2019, trained on SQuAD 1.1, Rajpurkar et al., 2016) running in the background as the workers write questions. This adversary is not particularly skilled at answering questions requiring coreference, but can follow obvious lexical cues—it thus helps workers avoid writing questions that shortcut coreferential reasoning.

Quoref contains more than 15K questions whose answers are spans or sets of spans in 3.5K paragraphs from English Wikipedia that can be arrived at by resolving coreference in those paragraphs. We manually analyze a sample of the dataset (§3) and find that 78% of the questions cannot be answered without resolving coreference. We also show (§4) that the best system performance is 49.1% , while the estimated human performance is 87.2%. These findings indicate that this dataset is an appropriate benchmark for coreference-aware reading comprehension.

2 Dataset Construction

Collecting paragraphs

We scraped paragraphs from Wikipedia pages about English movies, art and architecture, geography, history, and music. For movies, we followed the list of English language films111https://en.wikipedia.org/wiki/Category:English-language_films, and extracted plot summaries that are at least 40 tokens, and for the remaining categories, we followed the lists of featured articles222https://en.wikipedia.org/wiki/Wikipedia:Featured_articles. Since movie plot summaries usually mention many characters, it was easier to find hard Quoref questions for them, and we sampled about 60% of the paragraphs from this category.

Train Dev Test
Number of questions 10352 2245 2672
Number of paragraphs 2328 487 494
Avg. paragraph len (tokens) 35397 349102 34890
Avg. question len (tokens) 166 166 155
Paragraph vocabulary size 47487 19402 18958
Question vocabulary size 13645 5098 5398
Table 1: Key statistics of Quoref splits.

Crowdsourcing setup

We crowdsourced questions about these paragraphs on Mechanical Turk. We asked workers to find two or more co-referring spans in the paragraph, and to write questions such that answering them would require the knowledge that those spans are coreferential. We did not ask them to explicitly mark the co-referring spans. Workers were asked to write questions for a random sample of paragraphs from our pool, and we showed them examples of good and bad questions in the instructions (see Appendix A). For each question, the workers were also required to select one or more spans in the corresponding paragraph as the answer, and these spans are not required to be same as the coreferential spans that triggered the questions333For example, the last question in Table 2 is about the coreference of {she, Fania, his mother}, but none of these mentions is the answer. We used an uncased base BERT QA model (Devlin et al., 2019) trained on SQuAD 1.1 (Rajpurkar et al., 2016) as an adversary running in the background that attempted to answer the questions written by workers in real time, and the workers were able to submit their questions only if their answer did not match the adversary’s prediction.444Among models with acceptably low latency, we qualitatively found uncased base BERT to be the most effective. Appendix A further details the logistics of the crowdsourcing tasks. Some basic statistics of the resulting dataset can be seen in Table 1.

Reasoning Paragraph Snippet Question Answer
Pronominal resolution (69%) Anna and Declan eventually make their way on foot to a roadside pub, where they discover the three van thieves going through Anna’s luggage. Declan fights them, displaying unexpected strength for a man of his size, and retrieves Anna’s bag. Who does Declan get into a fight with? the three van thieves
Nominal resolution (54%) Later, Hippolyta was granted a daughter, Princess Diana, Diana defies her mother and What is the name of the person who is defied by her daughter? Hippolyta
Multiple resolutions (32%) The now upbeat collective keep the toucan, nicknaming it “Amigo When authorities show up to catch the bird, Pete and Liz spirit him away by Liz hiding him in her dress What is the name of the character who hides in Liz’s dress? Amigo
Commonsense reasoning (10%) Amos reflects back on his early childhood with his mother Fania and father Arieh. One of his mother’s friends is killed while hanging up laundry during the war. Fania falls into a depression she goes to Tel Aviv, where she kills herself by overdose How does Arieh’s wife die? kills herself by overdose
Table 2: Phenomena in Quoref. Note that the first two classes are not disjoint. In the final example, the paragraph does not explicitly say that Fania is Arieh’s wife.

3 Semantic Phenomena in Quoref

To better understand the phenomena present in Quoref, we manually analyzed a random sample of 100 paragraph-question pairs. The following are some empirical observations.

Requirement of coreference resolution

We found that 78% of the manually analyzed questions cannot be answered without coreference resolution. The remaining 22% involve some form of coreference, but do not require it to be resolved for answering them. Examples include a paragraph that mentions only one city, “Bristol”, and a sentence that says “the city was bombed”. The associated question, Which city was bombed?, does not really require coreference resolution from a model that can identify city names, making the content in the question after Which city unnecessary.

Types of coreferential reasoning

Questions in Quoref require resolving pronominal and nominal mentions of entities. Table 2 shows percentages and examples of analyzed questions that fall into these two categories. These are not disjoint sets, since we found that 32% of the questions require both (row 3). We also found that 10% require some form of commonsense reasoning (row 4).

4 Baseline Model Performance on Quoref

Method Dev Test
Heuristic Baselines
passage-only 16.26 26.77 17.40 26.89
Single-span RC
QANet 28.24 32.67 28.89 33.43
QANet+BERT 36.08 40.91 35.25 40.17
BERT QA 44.63 50.79 43.30 49.07
Multi-span RC
BERT QA 40.04 46.69 40.19 46.27
Human Performance - - 77.25 87.19
Table 3: Performance of various baselines on Quoref, measured by Exact Match (EM) and . Boldface marks the best systems for each metric and split.

We evaluated three types of initial baselines on Quoref: state-of-the art reading comprehension models that predict a single span (§4.1), state-of-the-art single-span reading comprehension models extended to predict multiple spans (§4.2), and heuristic baselines to look for annotation artifacts (§4.3

). We use two evaluation metrics to compare model performance: Exact Match (EM), and a (macro-averaged)

score that measures overlap between a bag-of-words representation of the gold and predicted answers. We use the same implementation of EM as SQuAD, and we employ the metric used for DROP (Dua et al., 2019). See Appendix B

for model training hyperparameters and other details.

555We will release code for reproducing results.

4.1 Single-Span Reading Comprehension

We test different single-span (SQuAD-style) reading comprehension models: (1) QANet (Yu et al., 2018), currently the best-performing published model on SQuAD 1.1 without data augmentation or pretraining; (2) QANet + BERT, which enhances the QANet model by concatenating frozen BERT representations to the original input embeddings; (3) BERT QA (Devlin et al., 2019), the adversarial baseline used in data construction.

When training our model, we use the summed likelihood objective function proposed by Clark and Gardner (2018), which marginalizes the model’s output over all occurrences of the answer text. We use the AllenNLP (Gardner et al., 2018) implementation of QANet, modified to use the marginal objective. All BERT experiments use the base uncased model.666The large uncased model does not fit in GPU memory.

4.2 Multi-Span Reading Comprehension

The single-span reading comprehension baselines are incapable of answering questions with multiple answer spans (e.g., the third question in Figure 1); these constitute 12% of Quoref

questions. Motivated by this shortcoming, we also evaluated a simple extension of the BERT QA model (the strongest single-span model evaluated), where the model is equipped with a variable number of prediction heads. Each prediction head contains two softmax classifiers for predicting the start and end indices of an answer span.

777The single-span BERT model is a special case of the multi-span BERT model, with only one prediction head. We set the number of prediction heads to equal the maximum number of answer spans associated with any training dataset question (8), enabling the model to correctly answer all questions in the training dataset. Each softmax classifier in each prediction head is independently trained, and prediction heads can opt out of predicting an answer.

4.3 Heuristic Baselines

In light of recent work exposing predictive artifacts in crowdsourced NLP datasets (Gururangan et al., 2018; Kaushik and Lipton, 2018, inter alia), we estimate the effect of predictive artifacts by training a BERT-based model to predict a single start and end index given only the passage as input (passage-only).

4.4 Results

Table 3 presents the performance of all baseline models on Quoref. All models perform significantly worse than on other prominent reading comprehension datasets, and human performance remains high.888Human performance was estimated from the authors’ answers of 400 questions from the test set scored with the same metric used for systems. For instance, the BERT QA model yields the highest performance among our baselines, but its performance on Quoref is nearly 40 absolute points lower than its performance on SQuAD 1.1.

Our simple extensions to BERT QA for predicting multiple answer spans failed to improve upon the single-span counterpart. Qualitatively, the multi-span BERT QA model is prone to over-predicting answer spans. The passage-only baseline underperforms all other systems; examining its predictions reveals that it almost always predicts the most frequent entity in the passage. Its relatively low performance, despite the tendency for Wikipedia articles and passages to be written about a single entity, indicates that a large majority of questions likely require coreferential reasoning.

5 Related Work

Traditional coreference datasets

Unlike traditional coreference annotations in datasets like those of Pradhan et al. (2007), Ghaddar and Langlais (2016), Chen et al. (2018) and Poesio et al. (2018), which aim to obtain complete coreference clusters, our questions require understanding coreference between only a few spans. While this means that the notion of coreference captured by our dataset is less comprehensive, it is also less conservative and allows questions about coreference relations that are not marked in OntoNotes annotations. Since the notion is not as strict, it does not require linguistic expertise from annotators, making it more amenable to crowdsourcing.

Reading comprehension datasets

There are many reading comprehension datasets (Richardson et al., 2013; Rajpurkar et al., 2016; Kwiatkowski et al., 2019; Dua et al., 2019, inter alia). Most of these datasets principally require understanding local predicate-argument structure in a paragraph of text. Quoref also requires understanding local predicate-argument structure, but makes the reading task harder by explicitly querying anaphoric references, requiring a system to track entities throughout the discourse.

6 Conclusion

We present Quoref, a focused reading comprehension benchmark that evaluates the ability of models to resolve coreference. We crowdsourced questions over paragraphs from Wikipedia, and manual analysis confirmed that most cannot be answered without coreference resolution. We show that current state-of-the-art reading comprehension models perform poorly on this benchmark, significantly lower than human performance. Both these findings provide evidence that Quoref is an appropriate benchmark for coreference-aware reading comprehension.


  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, et al. (2016) MS MARCO: a human generated machine reading comprehension dataset. In Proc. of NeurIPS, Cited by: §1.
  • H. Chen, Z. Fan, H. Lu, A. Yuille, and S. Rong (2018) PreCo: A large-scale dataset in preschool vocabulary for coreference resolution. In Proc. of EMNLP, Cited by: §5.
  • C. Clark and M. Gardner (2018) Simple and effective multi-paragraph reading comprehension. In Proc. of ACL, Cited by: §4.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, Cited by: §1, §2, §4.1.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, Cited by: §1, §4, §5.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018)

    AllenNLP: a deep semantic natural language processing platform

    In Proc. of Workshop for NLP Open Source Software, Cited by: Appendix B, Appendix B, §4.1.
  • A. Ghaddar and P. Langlais (2016) WikiCoref: an English coreference-annotated corpus of Wikipedia articles. In Proc. of LREC, Cited by: §5.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proc. of NAACL, Cited by: §4.3.
  • M. S. Joshi, E. Choi, D. S. Weld, and L. S. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. of ACL, Cited by: §1.
  • D. Kaushik and Z. C. Lipton (2018) How much reading does reading comprehension require? A critical investigation of popular benchmarks. In Proc. of EMNLP, Cited by: §4.3.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural Questions: a benchmark for question answering research. In TACL, Cited by: §1, §5.
  • M. Poesio, Y. Grishina, V. Kolhatkar, N. Moosavi, I. Roesiger, A. Roussel, F. Simonjetz, A. Uma, O. Uryupina, J. Yu, and H. Zinsmeister (2018) Anaphora resolution with the ARRAU corpus. In Proc. of CRAC Workshop, Cited by: §1, §5.
  • S. S. Pradhan, L. Ramshaw, R. Weischedel, J. MacBride, and L. Micciulla (2007) Unrestricted coreference: identifying entities and events in OntoNotes. In Proc. of International Conference on Semantic Computing (ICSC), Cited by: §1, §5.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP, Cited by: §1, §1, §2, §5.
  • M. Recasens, E. Hovy, and M. A. Martí (2011) Identity, non-identity, and near-identity: addressing the complexity of coreference. Lingua 121 (6), pp. 1138–1152. Cited by: §1.
  • M. Richardson, C. J. C. Burges, and E. Renshaw (2013) MCTest: a challenge dataset for the open-domain machine comprehension of text. In Proc. of EMNLP, Cited by: §5.
  • Y. Versley (2008) Vagueness and referential ambiguity in a large-scale annotated corpus. Research on Language and Computation 6 (3), pp. 333–353. Cited by: §1.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) QANet: Combining local convolution with global self-attention for reading comprehension. In Proc. of ICLR, Cited by: §4.1.

Appendix A Crowdsourcing Logistics

a.1 Instructions

The crowdworkers were giving the following instructions:

“In this task, you will look at paragraphs that contain several phrases that are references to names of people, places, or things. For example, in the first sentence from sample paragraph below, the references Unas and the ninth and final king of Fifth Dynasty refer to the same person, and Pyramid of Unas, Unas’s pyramid and the pyramid refer to the same construction. You will notice that multiple phrases often refer to the same person, place, or thing. Your job is to write questions that you would ask a person to see if they understood that the phrases refer to the same entity. To help you write such questions, we provided some examples of good questions you can ask about such phrases. We also want you to avoid questions that can be answered correctly by someone without actually understanding the paragraph. To help you do so, we provided an AI system running in the background that will try to answer the questions you write. You can consider any question it can answer to be too easy. However, please note that the AI system incorrectly answering a question does not necessarily mean that it is good. Please read the examples below carefully to understand what kinds of questions we are interested in.”

a.2 Examples of good questions

We illustrate examples of good questions for the following paragraph.

“The Pyramid of Unas is a smooth-sided pyramid built in the 24th century BC for the Egyptian pharaoh Unas, the ninth and final king of the Fifth Dynasty. It is the smallest Old Kingdom pyramid, but significant due to the discovery of Pyramid Texts, spells for the king’s afterlife incised into the walls of its subterranean chambers. Inscribed for the first time in Unas’s pyramid, the tradition of funerary texts carried on in the pyramids of subsequent rulers, through to the end of the Old Kingdom, and into the Middle Kingdom through the Coffin Texts which form the basis of the Book of the Dead. Unas built his pyramid between the complexes of Sekhemket and Djoser, in North Saqqara. Anchored to the valley temple via a nearby lake, a long causeway was constructed to provide access to the pyramid site. The causeway had elaborately decorated walls covered with a roof which had a slit in one section allowing light to enter illuminating the images. A long wadi was used as a pathway. The terrain was difficult to negotiate and contained old buildings and tomb superstructures. These were torn down and repurposed as underlay for the causeway. A significant stretch of Djoser’s causeway was reused for embankments. Tombs that were on the path had their superstructures demolished and were paved over, preserving their decorations.”

The following questions link pronouns:

  1. [label=Q0:,leftmargin=*,labelindent=1em]

  2. What is the name of the person whose pyramid was built in North Saqqara? A: Unas

  3. What is significant due to the discovery of Pyramid Texts? A: The Pyramid of Unas

  4. What were repurposed as underlay for the causeway? A: old buildings; tomb superstructures

The following questions link other references:

  1. [label=Q0:,leftmargin=*,labelindent=1em]

  2. What is the name of the king for whose afterlife spells were incised into the walls of the pyramid? A: Unas

  3. Where did the final king of the Fifth dynasty build his pyramid? A: between the complexes of Sekhemket and Djoser, in North Saqqara

a.3 Examples of bad questions

We illustrate examples of bad questions for the following paragraph.

“Decisions by Republican incumbent Peter Fitzgerald and his Democratic predecessor Carol Moseley Braun to not participate in the election resulted in wide-open Democratic and Republican primary contests involving fifteen candidates. In the March 2004 primary election, Barack Obama won in an unexpected landslide—which overnight made him a rising star within the national Democratic Party, started speculation about a presidential future, and led to the reissue of his memoir, Dreams from My Father. In July 2004, Obama delivered the keynote address at the 2004 Democratic National Convention, seen by 9.1 million viewers. His speech was well received and elevated his status within the Democratic Party. Obama’s expected opponent in the general election, Republican primary winner Jack Ryan, withdrew from the race in June 2004. Six weeks later, Alan Keyes accepted the Republican nomination to replace Ryan. In the November 2004 general election, Obama won with 70 percent of the vote. Obama cosponsored the Secure America and Orderly Immigration Act. He introduced two initiatives that bore his name: Lugar–Obama, which expanded the Nunn–Lugar cooperative threat reduction concept to conventional weapons; and the Federal Funding Accountability and Transparency Act of 2006, which authorized the establishment of USAspending.gov, a web search engine on federal spending. On June 3, 2008, Senator Obama—along with three other senators: Tom Carper, Tom Coburn, and John McCain—introduced follow-up legislation: Strengthening Transparency and Accountability in Federal Spending Act of 2008.”

The following questions do not require coreference resolution:

  1. [label=Q0:,leftmargin=*,labelindent=1em]

  2. Who withdrew from the race in June 2004? A: Jack Ryan

  3. What Act sought to build on the Federal Funding Accountability and Transparency Act of 2006? A: Strengthening Transparency and Accountability in Federal Spending Act of 2008

The following question has ambiguous answers:

  1. [label=Q0:,leftmargin=*,labelindent=1em]

  2. Whose memoir was called Dreams from My Father? A: Barack Obama; Obama; Senator Obama

a.4 Worker Pool Management

Beyond training workers with the detailed instructions shown above, we ensured that the questions are of high quality by selecting a good pool of 21 workers using a two-stage selection process, allowing only those workers who clearly understood the requirements of the task to produce the final set of questions. Both the qualification and final HITs had 4 paragraphs per HIT for paragraphs from movie plot summaries, and 5 per HIT for the other domains, from which the workers could choose. For each HIT, workers typically spent 20 minutes, were required to write 10 questions, and were paid US$7.

Appendix B Experimental Setup Details

Unless otherwise mentioned, we adopt the original published procedures and hyperparameters used for each baseline.

Bert Qa

We truncate paragraphs to 300 (word) tokens during training and 400 tokens during evaluation. Questions are always truncated to 30 tokens. We train our model with a batch size of 10 and a sequence length of 512 wordpieces. We use the Adam optimizer, with a learning rate of

. We train for 10 epochs with an early stopping patience of 5, checkpointing the model after each epoch. We report the performance of the checkpoint with the highest development set

score. The summed likelihood function of Gardner et al. (2018)

requires a probability distribution over words in the paragraph, and we take each word’s BERT representation to be the vector associated with its first wordpiece.


Durining training, we truncate paragraphs to 400 (word) tokens during training and questions to 50 tokens. During evaluation, we truncate paragraphs to 1000 tokens and questions to 100 tokens.

Passage-only baseline

We truncate paragraphs to 300 (word) tokens during training and 400 tokens during evaluation. To calculate BERT embeddings for each passage, we prepend the special classification token [CLS] and append the separator token [SEP] to the passage. As in the BERT QA model, the summed likelihood function of Gardner et al. (2018) requires a probability distribution over words in the paragraph, so we take each word’s BERT representation to be the vector associated with its first wordpiece.