The Natural Stories Corpus

08/18/2017 ∙ by Richard Futrell, et al. ∙ 0

It is now a common practice to compare models of human language processing by predicting participant reactions (such as reading times) to corpora consisting of rich naturalistic linguistic materials. However, many of the corpora used in these studies are based on naturalistic text and thus do not contain many of the low-frequency syntactic constructions that are often required to distinguish processing theories. Here we describe a new corpus consisting of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected parse trees and includes self-paced reading time data. Here we give an overview of the content of the corpus and release the data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is becoming a standard practice to evaluate theories of human language processing by their ability to predict psychometric dependent variables such as per-word reaction time for standardized corpora of naturalistic text. Dependent variables that have been collected over fixed corpora include word fixation time in eyetracking (Kennedy et al., 2003), word reaction time in self-paced reading (Roark et al., 2009; Frank et al., 2013), BOLD signal in fMRI data (Bachrach et al., ms), and event-related potentials (Dambacher et al., 2006; Frank et al., 2015).

The more traditional approach to evaluating psycholinguistic models has been to collect psychometric measures on hand-crafted experimental stimuli designed to tease apart detailed model predictions. While this approach makes it easy to compare models on their accuracy for specific constructions and phenomena, it is hard to get a sense from experimental results of how models compare on their coverage of a broad range of phenomena. When it is standard practice to compare model predictions over standardized texts, then it is easier to evaluate coverage.

Although the fixed corpus approach has these advantages, the existing corpora currently used are based on naturally-occurring data, which is unlikely to include the kinds of sentences which can crucially distinguish between theories. Many of the most puzzling phenomena in psycholinguistics, and the phenomena which have been used to test models, have only been observed in extremely rare constructions, such as multiply nested preverbal relative clauses (Gibson and Thomas, 1999; Grodner and Gibson, 2005; Vasishth et al., 2010). Corpora of naturally-occurring text are unlikely to contain these constructions. More generally, models of human language comprehension are more likely to make distinct predictions for sentences that cause difficulty for humans, rather than for sentences that are easy to process. For instance, models of comprehension difficulty based on memory integration cost during parsing (Gibson, 2000; Lewis and Vasishth, 2005; Schuler et al., 2010; van Schijndel et al., 2013) will predict effects when the memory spans required for parsing are large, but most syntactic dependencies in naturally-occurring text are short (Temperley, 2007; Liu, 2008; Futrell et al., 2015). In general, processing difficulty might be rare for naturally-occurring text, because text written and edited in order to be easily understood.

Here we attempt to combine the strength of experimental approaches, which can test theories using targeted low-frequency structures, and corpus studies, which provide broad-coverage comparability between models. We introduce and release a new corpus, the Natural Stories Corpus, a series of English narrative texts designed to contain many low-frequency and psycholinguistically interesting syntactic constructions while still sounding fluent and coherent. The texts are annotated with hand-corrected Penn Treebank style phrase structure parses, and Universal Dependencies parses automatically generated from the phrase structure parses. We also release self-paced reading time data for all texts, and word-aligned audio recordings of the texts. We hope the corpus can form the basis for further annotation and become a standard test set for psycholinguistic models.

2 Related Work

Here we survey datasets which are commonly used to test psycholinguistic theories and how they relate to the current release.

Currently the most prominent psycholinguistic corpus for English is the Dundee Corpus (Kennedy et al., 2003), which contains 51,501 word tokens in 2,368 sentences from British newspaper editorials, along with eyetracking data from 10 participants. A dependency parse of the corpus is released in Barrett et al. . Like in the current work, the eyetracking data in the Dundee corpus is collected for sentences in context and so reflects influences beyond the sentence level. The corpus has seen wide usage, see for example Demberg and Keller (2008); Mitchell et al. (2010); Frank and Bod (2011); Fossum and Levy (2012); Smith and Levy (2013); van Schijndel and Schuler (2015); Luong et al. (2015).

The Potsdam Sentence Corpus (Kliegl et al., 2006)

of German provides 1138 words in 144 sentences, with cloze probabilities and eyetracking data for each word. Like the current corpus, the Potsdam Sentence Corpus was designed to contain varied syntactic structures, rather than being gathered from naturalistic text. The corpus consists of isolated sentences which do not form a narrative, and during eyetracking data collection the sentences were presented in a random order. The corpus has been used to evaluate models of sentence processing based on dependency parsing

(Boston et al., 2008, 2011) and to study effects of predictability on event-related potentials (Dambacher et al., 2006).

The MIT Corpus introduced in Bachrach et al. (ms) has similar aims to the current work, collecting reading time and fMRI data over sentences designed to contain varied structures. This dataset consists of four narratives with a total of 2647 tokens; it has been used to evaluate models of incremental prediction in Roark et al. (2009), Wu et al. (2010), and Luong et al. (2015).

The UCL Corpus (Frank et al., 2013) consists of 361 English sentences drawn from amateur novels, chosen for their ability to be understood out of context, with self-paced reading and eyetracking data. The goal of the corpus is to provide a sample of typical narrative sentences, complementary to our goal of providing a corpus with low-frequency constructions. Unlike the current corpus, the UCL Corpus consists of isolated sentences, so the psychometric data do not reflect effects beyond the sentence level.

Eyetracking corpora for other languages are also available, including the Postdam-Allahabad Hindi Eyetracking Corpus (Husain et al., 2014) and the Beijing Sentence Corpus of Mandarin Chinese (Yan et al., 2010).

3 Corpus Description

3.1 Text

The Natural Stories corpus consists of 10 stories, comprising 10,245 lexical word tokens and 485 sentences in total. The stories were developed by A.V., E.F., E.G. and S.P. by taking existing publicly available texts and editing them to use many subject- and object-extracted relative clauses, clefts, topicalized structures, extraposed relative clauses, sentential subjects, sentential complements, local structural ambiguity (especially NP/Z ambiguity), idioms, and conjoined clauses with a variety of coherence relations. The original texts are listed in Table 1.

Story Title Source Title Source Author
1 Boar The Legend of the Bradford Boar222 E. H. Hopkinson
2 Aqua Aqua, or the Water Baby333 Kate Douglas Wiggin
3 Matchstick The Little Match-Seller444 Hans Christian Andersen
4 King of Birds The King of the Birds555 Brothers Grimm
5 Elvis Elvis Died at the Florida Barber College666 Roger Dean Kiser
6 Mr. Sticky Mr. Sticky777 Mo McAuley
7 High School Bullies Sarah Cleaves
8 Roswell Roswell UFO incident888 Wikipedia
9 Tulips Tulip mania999 Wikipedia
10 Tourette’s Tourette Syndrome Fact Sheet101010 NINDS
Table 1: Stories with titles and sources. Footnotes contain URLs for the original text.

The mean number of lexical words per sentence is 21.1, around the same as the Dundee corpus (21.7). Figure 1 shows a histogram of sentence length in Natural Stories as compared to Dundee. The word and sentence counts for each story are given in Table 2. Each token has a unique code which is referenced throughout the various annotations of the corpus.

Figure 1: Histograms of sentence length (in tokens, including punctuation) in Natural Stories and the Dundee corpus.
Story # Words # Sentences
1 1073 57
2 990 37
3 1040 55
4 1085 55
5 1013 45
6 1089 64
7 999 48
8 980 33
9 1038 48
10 938 43
Table 2: Summary of stories by length.

In Figure 2 we give a sample of text from the corpus (from the first story).

If you were to journey to the North of England, you would come to a valley that is surrounded by moors as high as mountains. It is in this valley where you would find the city of Bradford, where once a thousand spinning jennies that hummed and clattered spun wool into money for the long-bearded mill owners. That all mill owners were generally busy as beavers and quite pleased with themselves for being so successful and well off was known to the residents of Bradford, and if you were to go into the city to visit the stately City Hall, you would see there the Crest of the City of Bradford, which those same mill owners created to celebrate their achievements.

Figure 2: Sample text from the first story.

3.2 Parses

The texts were parsed automatically using the Stanford Parser (Klein and Manning, 2003) and hand-corrected. Trace annotations were added by hand. We provide the resulting Penn Treebank-style phrase structure parse trees. We also provide Universal Dependencies parses (Nivre, 2015) automatically converted from the corrected parse trees using the Stanford Parser.

3.3 Self-Paced Reading Data

We collected self-paced reading (SPR) data (Just et al., 1982) for the stories from 181 native English speakers over Amazon Mechanical Turk. Text was presented in a dashed moving window display; spaces were masked. Each participant read 5 stories per HIT. 19 participants read all 10 stories, and 3 participants stopped after one story. Each story was accompanied by 6 comprehension questions. We discarded SPR data from a participant’s pass through a story if the participant got less than 5 questions correct (89 passes through stories excluded). We also excluded RTs less than 100 ms or greater than 3000 ms. Figure 3 shows histograms of RTs per story.

Figure 3: Histograms of SPR RTs per story, after data exclusion.

3.3.1 Inter-Subject Correlations

In order to evaluate the reliability of the self-paced reading RTs and their robustness across experimental participants, we analyzed inter-subject correlations (ISCs). For each subject, we correlated the Spearman correlation of that subject’s RTs on a story with average RTs from all other subjects on that story. Thus for each story we get one ISC statistic per subject. Figure 4 shows histograms of these statistics per story.

Figure 4: Leave-one-out Inter-Subject Correlations (ISCs) of RTs per story. In the panels, gives the average leave-one-out ISC for that story.

3.3.2 Psycholinguistic Sanity Checks

In order to validate our RT data, we checked that basic psycholinguistic effects obtain in it. In particular, we examined whether the well-known effects of frequency, word length, and surprisal (Hale, 2001; Levy, 2008; Smith and Levy, 2013)

had an effect on RTs. To do this, for each of the three predictors log frequency, log trigram probability, and word length, we fit a linear mixed effects regression model with subject and story as random intercepts (models with random slopes did not converge) predicting RT. Frequency and trigram probabilities were computed from Google Books N-grams, summing over years from 1990 to 2013. (These counts are also released with this dataset.) The results of the regressions are shown in Table 

3; we report results from the maximal converging model. In keeping with well-known effects, increased frequency and probability both lead to faster reading times, and word length leads to slower reading times.

Predictor Std. Error value value
Log Frequency -2.61 0.08 -32.27 0.001
Trigram Surprisal -2.19 0.09 -23.90 0.001
Word Length 4.21 0.12 35.72 0.001
Table 3: Regression coefficients and significance from individual mixed-effects regressions predicting RT for each of the three predictors log frequency, log trigram probability, and word length.

3.4 Syntactic Constructions

Here we give an overview of the low-frequency or marked syntactic constructions which occur in the stories. We coded sentences in the Natural Stories corpus for presence of a number of marked constructions, and also coded 200 randomly selected sentences from the Dundee corpus for the same features. The features coded are listed and explained in Appendix A. Figure 5 shows the rates of these marked constructions per sentence in the two corpora. From the figure, we see that the natural stories have especially high rates of nonlocal VP conjunction, nonrestrictive SRCs, idioms, adjective conjunction, noncanonical ORCs, local NP/S ambiguities, and it-clefts.

Figure 5: Rates of marked constructions in the Natural Stories corpus and in 200 randomly sampled sentences from the Dundee corpus.

4 Conclusion

We have described a new psycholinguistic corpus of English, consisting of edited naturalistic text designed to contain many rare or hard-to-process constructions while still sounding fluent. We believe this corpus will provide an important part of a suite of test sets for psycholinguistic models, exposing their behavior in uncommon constructions in a way that fully naturalistic corpora cannot. We also hope that the corpus as described here forms the basis for further data collection and annotation.


This work was supported by NSF DDRI grant #1551543 to R.F., NSF grants #0844472 and #1534318 to E.G., and NIH career development award HD057522 to E.F. The authors thank the following individuals: Laura Stearns for hand-checking and correcting the parses, Suniyya Waraich for help with syntactic coding, Cory Shain and Marten van Schijndel for hand-annotating the parses, and Kyle Mahowald for help with initial exploratory analyses of the SPR data. The authors also thank Nancy Kanwisher for recording half of the stories (the other half was recorded by E.G.), Wade Shen for providing initial alignment between the audio files and the texts, and Jeanne Gallee for hand-correcting the alignment.


  • Bachrach et al. (ms) Bachrach, A., Roark, B., Marantz, A., Whitfield-Gabrieli, S., Cardenas, C., and Gabrieli, J. D. E. (ms). Incremental prediction in naturalistic langauge procesing: an fMRI study. Unpublished manuscript.
  • (2) Barrett, M., Agić, Ž., and Søgaard, A. The Dundee treebank. In The 14th International Workshop on Treebanks and Linguistic Theories (TLT 14), pages 242–248.
  • Boston et al. (2008) Boston, M., Hale, J. T., Kliegl, R., Patil, U., and Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta), (1).
  • Boston et al. (2011) Boston, M. F., Hale, J. T., Vasishth, S., and Kliegl, R. (2011). Parallel processing and sentence comprehension difficulty. Language and Cognitive Processes, 26(3):301–349.
  • Dambacher et al. (2006) Dambacher, M., Kliegl, R., Hofmann, M., and Jacobs, A. M. (2006). Frequency and predictability effects on event-related potentials during reading. Brain research, 1084(1):89–103.
  • Demberg and Keller (2008) Demberg, V. and Keller, F. (2008). Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109(2):193–210.
  • Fossum and Levy (2012) Fossum, V. and Levy, R. (2012). Sequential vs. hierarchical syntactic models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics, pages 61–69. Association for Computational Linguistics.
  • Frank and Bod (2011) Frank, S. L. and Bod, R. (2011). Insensitivity of the human sentence-processing system to hierarchical structure. Psychological Science, 22(6):829–834.
  • Frank et al. (2013) Frank, S. L., Monsalve, I. F., Thompson, R. L., and Vigliocco, G. (2013). Reading time data for evaluating broad-coverage models of English sentence processing. Behavior research methods, 45(4):1182–1190.
  • Frank et al. (2015) Frank, S. L., Otten, L. J., Galli, G., and Vigliocco, G. (2015). The ERP response to the amount of information conveyed by words in sentences. Brain and language, 140:1–11.
  • Futrell et al. (2015) Futrell, R., Mahowald, K., and Gibson, E. (2015). Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences, 112(33):10336–10341.
  • Gibson (2000) Gibson, E. (2000). The dependency locality theory: A distance-based theory of linguistic complexity. In Marantz, A., Miyashita, Y., and O’Neil, W., editors, Image, language, brain: Papers from the first mind articulation project symposium, pages 95–126.
  • Gibson and Thomas (1999) Gibson, E. and Thomas, J. (1999). Memory limitations and structural forgetting: The perception of complex ungrammatical sentences as grammatical. Language and Cognitive Processes, 14(3):225–248.
  • Grodner and Gibson (2005) Grodner, D. and Gibson, E. (2005). Consequences of the serial nature of linguistic input for sentential complexity. Cognitive Science, 29(2):261–290.
  • Hale (2001) Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics and Language Technologies, pages 1–8.
  • Husain et al. (2014) Husain, S., Vasishth, S., and Srinivasan, N. (2014). Integration and prediction difficulty in Hindi sentence comprehension: Evidence from an eye-tracking corpus. Journal of Eye Movement Research, 8(2).
  • Just et al. (1982) Just, M. A., Carpenter, P. A., and Woolley, J. D. (1982). Paradigms and processes in reading comprehension. Journal of Experimental Psychology: General, 111(2):228.
  • Kennedy et al. (2003) Kennedy, A., Hill, R., and Pynte, J. (2003). The Dundee corpus. In Proceedings of the 12th European conference on Eye Movement.
  • Klein and Manning (2003) Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics.
  • Kliegl et al. (2006) Kliegl, R., Nuthmann, A., and Engbert, R. (2006). Tracking the mind during reading: The influence of past, present, and future words on fixation durations. Journal of Experimental Psychology: General, 135(1):12.
  • Levy (2008) Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177.
  • Lewis and Vasishth (2005) Lewis, R. L. and Vasishth, S. (2005). An activation-based model of sentence processing as skilled memory retrieval. Cognitive Science, 29(3):375–419.
  • Liu (2008) Liu, H. (2008). Dependency distance as a metric of language comprehension difficulty. Journal of Cognitive Science, 9(2):159–191.
  • Luong et al. (2015) Luong, M.-T., O’Donnell, T. J., and Goodman, N. D. (2015). Evaluating models of computation and storage in human sentence processing. In CogACLL, page 14.
  • Mitchell et al. (2010) Mitchell, J., Lapata, M., Demberg, V., and Keller, F. (2010). Syntactic and semantic factors in processing difficulty: An integrated measure. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 196–206.
  • Nivre (2015) Nivre, J. (2015).

    Towards a universal grammar for natural language processing.

    In Computational Linguistics and Intelligent Text Processing, pages 3–16. Springer.
  • Roark et al. (2009) Roark, B., Bachrach, A., Cardenas, C., and Pallier, C. (2009). Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 324–333. Association for Computational Linguistics.
  • Schuler et al. (2010) Schuler, W., AbdelRahman, S., Miller, T., and Schwartz, L. (2010). Broad-coverage parsing using human-like memory constraints. Computational Linguistics, 36(1):1–30.
  • Smith and Levy (2013) Smith, N. J. and Levy, R. (2013). The effect of word predictability on reading time is logarithmic. Cognition, 128(3):302–319.
  • Temperley (2007) Temperley, D. (2007). Minimization of dependency length in written English. Cognition, 105(2):300–333.
  • van Schijndel et al. (2013) van Schijndel, M., Exley, A., and Schuler, W. (2013). A model of language processing as hierarchic sequential prediction. Topics in Cognitive Science, 5(3):522–540.
  • van Schijndel and Schuler (2015) van Schijndel, M. and Schuler, W. (2015). Hierarchic syntax improves reading time prediction. In Proceedings of NAACL.
  • Vasishth et al. (2010) Vasishth, S., Suckow, K., Lewis, R. L., and Kern, S. (2010). Short-term forgetting in sentence comprehension: Crosslinguistic evidence from verb-final structures. Language and Cognitive Processes, 25(4):533–567.
  • Wu et al. (2010) Wu, S., Bachrach, A., Cardenas, C., and Schuler, W. (2010). Complexity metrics in an incremental right-corner parser. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 1189–1198. Association for Computational Linguistics.
  • Yan et al. (2010) Yan, M., Kliegl, R., Richter, E. M., Nuthmann, A., and Shu, H. (2010). Flexible saccade-target selection in Chinese reading. The Quarterly Journal of Experimental Psychology, 63(4):705–725.

Appendix A Features coded for Section 3.4

The features coded are:

  • Local/nonlocal VP conjunction: Conjunction of VPs in which the head verbs are adjacent (local) or not adjacent (nonlocal)

  • Local/nonlocal NP conjunction: Conjunction of VPs in which the head nouns are adjacent (local) or not adjacent (nonlocal).

  • Sentential conjunction: Conjunction of sentences.

  • CP conjunction: Conjunction of CPs with explicit quantifiers.

  • Restrictive/nonrestrictive SRC: Subject-extracted relative clauses with either restrictive or nonrestrictive semantics

  • Restrictive/nonrestrictive ORC: Object-extracted relative clauses with either restrictive or nonrestrictive semantics

  • No-relativizer ORC: An object-extracted relative clause without an explicit relativizer, e.g. The man I know

  • Noncanonical ORC: An object-extracted relative clause where the subject is not a pronoun.

  • Adverbial relative clause: An relative clause with an extracted adverbial, e.g. the valley where you would find the city of Bradford.

  • Free relative clause

  • NP/S ambiguity: A local ambiguity where it is unclear whether a clause is an NP or the subject of a sentence. For example, I know Bob is a doctor.

  • Main Verb/Reduced Relative ambiguity (easy/hard): A local ambiguity between a main verb and a reduced relative clause. For example, The horse raced past the barn fell.

  • PP attachment ambiguity

  • Nonlocal SV: The appearance of any material between a verb and the head of its subject.

  • Nonlocal Verb/DO: The appearance of any material between a verb and its direct object.

  • Gerund modifiers

  • Sentential subject

  • Parentheticals

  • Tough movement

  • Postnominal adjectives

  • Topicalization

  • even…than construction

  • if…then construction

  • as…as construction

  • Yes-No Question

  • Question with wh subject

  • Question with other wh word

  • Idiom: Any idiomatic expression, such as busy as beavers.

  • Quotation: Any directly-reported speech

Coding was performed by Suniyya Waraich, Edward Gibson, and Richard Futrell.