1 Introduction
It is becoming a standard practice to evaluate theories of human language processing by their ability to predict psychometric dependent variables such as per-word reaction time for standardized corpora of naturalistic text. Dependent variables that have been collected over fixed corpora include word fixation time in eyetracking (Kennedy et al., 2003), word reaction time in self-paced reading (Roark et al., 2009; Frank et al., 2013), BOLD signal in fMRI data (Bachrach et al., ms), and event-related potentials (Dambacher et al., 2006; Frank et al., 2015).
The more traditional approach to evaluating psycholinguistic models has been to collect psychometric measures on hand-crafted experimental stimuli designed to tease apart detailed model predictions. While this approach makes it easy to compare models on their accuracy for specific constructions and phenomena, it is hard to get a sense from experimental results of how models compare on their coverage of a broad range of phenomena. When it is standard practice to compare model predictions over standardized texts, then it is easier to evaluate coverage.
Although the fixed corpus approach has these advantages, the existing corpora currently used are based on naturally-occurring data, which is unlikely to include the kinds of sentences which can crucially distinguish between theories. Many of the most puzzling phenomena in psycholinguistics, and the phenomena which have been used to test models, have only been observed in extremely rare constructions, such as multiply nested preverbal relative clauses (Gibson and Thomas, 1999; Grodner and Gibson, 2005; Vasishth et al., 2010). Corpora of naturally-occurring text are unlikely to contain these constructions. More generally, models of human language comprehension are more likely to make distinct predictions for sentences that cause difficulty for humans, rather than for sentences that are easy to process. For instance, models of comprehension difficulty based on memory integration cost during parsing (Gibson, 2000; Lewis and Vasishth, 2005; Schuler et al., 2010; van Schijndel et al., 2013) will predict effects when the memory spans required for parsing are large, but most syntactic dependencies in naturally-occurring text are short (Temperley, 2007; Liu, 2008; Futrell et al., 2015). In general, processing difficulty might be rare for naturally-occurring text, because text written and edited in order to be easily understood.
Here we attempt to combine the strength of experimental approaches, which can test theories using targeted low-frequency structures, and corpus studies, which provide broad-coverage comparability between models. We introduce and release a new corpus, the Natural Stories Corpus, a series of English narrative texts designed to contain many low-frequency and psycholinguistically interesting syntactic constructions while still sounding fluent and coherent. The texts are annotated with hand-corrected Penn Treebank style phrase structure parses, and Universal Dependencies parses automatically generated from the phrase structure parses. We also release self-paced reading time data for all texts, and word-aligned audio recordings of the texts. We hope the corpus can form the basis for further annotation and become a standard test set for psycholinguistic models.
2 Related Work
Here we survey datasets which are commonly used to test psycholinguistic theories and how they relate to the current release.
Currently the most prominent psycholinguistic corpus for English is the Dundee Corpus (Kennedy et al., 2003), which contains 51,501 word tokens in 2,368 sentences from British newspaper editorials, along with eyetracking data from 10 participants. A dependency parse of the corpus is released in Barrett et al. . Like in the current work, the eyetracking data in the Dundee corpus is collected for sentences in context and so reflects influences beyond the sentence level. The corpus has seen wide usage, see for example Demberg and Keller (2008); Mitchell et al. (2010); Frank and Bod (2011); Fossum and Levy (2012); Smith and Levy (2013); van Schijndel and Schuler (2015); Luong et al. (2015).
The Potsdam Sentence Corpus (Kliegl et al., 2006)
of German provides 1138 words in 144 sentences, with cloze probabilities and eyetracking data for each word. Like the current corpus, the Potsdam Sentence Corpus was designed to contain varied syntactic structures, rather than being gathered from naturalistic text. The corpus consists of isolated sentences which do not form a narrative, and during eyetracking data collection the sentences were presented in a random order. The corpus has been used to evaluate models of sentence processing based on dependency parsing
(Boston et al., 2008, 2011) and to study effects of predictability on event-related potentials (Dambacher et al., 2006).The MIT Corpus introduced in Bachrach et al. (ms) has similar aims to the current work, collecting reading time and fMRI data over sentences designed to contain varied structures. This dataset consists of four narratives with a total of 2647 tokens; it has been used to evaluate models of incremental prediction in Roark et al. (2009), Wu et al. (2010), and Luong et al. (2015).
The UCL Corpus (Frank et al., 2013) consists of 361 English sentences drawn from amateur novels, chosen for their ability to be understood out of context, with self-paced reading and eyetracking data. The goal of the corpus is to provide a sample of typical narrative sentences, complementary to our goal of providing a corpus with low-frequency constructions. Unlike the current corpus, the UCL Corpus consists of isolated sentences, so the psychometric data do not reflect effects beyond the sentence level.
3 Corpus Description
3.1 Text
The Natural Stories corpus consists of 10 stories, comprising 10,245 lexical word tokens and 485 sentences in total. The stories were developed by A.V., E.F., E.G. and S.P. by taking existing publicly available texts and editing them to use many subject- and object-extracted relative clauses, clefts, topicalized structures, extraposed relative clauses, sentential subjects, sentential complements, local structural ambiguity (especially NP/Z ambiguity), idioms, and conjoined clauses with a variety of coherence relations. The original texts are listed in Table 1.
The mean number of lexical words per sentence is 21.1, around the same as the Dundee corpus (21.7). Figure 1 shows a histogram of sentence length in Natural Stories as compared to Dundee. The word and sentence counts for each story are given in Table 2. Each token has a unique code which is referenced throughout the various annotations of the corpus.

Story | # Words | # Sentences |
---|---|---|
1 | 1073 | 57 |
2 | 990 | 37 |
3 | 1040 | 55 |
4 | 1085 | 55 |
5 | 1013 | 45 |
6 | 1089 | 64 |
7 | 999 | 48 |
8 | 980 | 33 |
9 | 1038 | 48 |
10 | 938 | 43 |
In Figure 2 we give a sample of text from the corpus (from the first story).
If you were to journey to the North of England, you would come to a valley that is surrounded by moors as high as mountains. It is in this valley where you would find the city of Bradford, where once a thousand spinning jennies that hummed and clattered spun wool into money for the long-bearded mill owners. That all mill owners were generally busy as beavers and quite pleased with themselves for being so successful and well off was known to the residents of Bradford, and if you were to go into the city to visit the stately City Hall, you would see there the Crest of the City of Bradford, which those same mill owners created to celebrate their achievements.
3.2 Parses
The texts were parsed automatically using the Stanford Parser (Klein and Manning, 2003) and hand-corrected. Trace annotations were added by hand. We provide the resulting Penn Treebank-style phrase structure parse trees. We also provide Universal Dependencies parses (Nivre, 2015) automatically converted from the corrected parse trees using the Stanford Parser.
3.3 Self-Paced Reading Data
We collected self-paced reading (SPR) data (Just et al., 1982) for the stories from 181 native English speakers over Amazon Mechanical Turk. Text was presented in a dashed moving window display; spaces were masked. Each participant read 5 stories per HIT. 19 participants read all 10 stories, and 3 participants stopped after one story. Each story was accompanied by 6 comprehension questions. We discarded SPR data from a participant’s pass through a story if the participant got less than 5 questions correct (89 passes through stories excluded). We also excluded RTs less than 100 ms or greater than 3000 ms. Figure 3 shows histograms of RTs per story.

3.3.1 Inter-Subject Correlations
In order to evaluate the reliability of the self-paced reading RTs and their robustness across experimental participants, we analyzed inter-subject correlations (ISCs). For each subject, we correlated the Spearman correlation of that subject’s RTs on a story with average RTs from all other subjects on that story. Thus for each story we get one ISC statistic per subject. Figure 4 shows histograms of these statistics per story.

3.3.2 Psycholinguistic Sanity Checks
In order to validate our RT data, we checked that basic psycholinguistic effects obtain in it. In particular, we examined whether the well-known effects of frequency, word length, and surprisal (Hale, 2001; Levy, 2008; Smith and Levy, 2013)
had an effect on RTs. To do this, for each of the three predictors log frequency, log trigram probability, and word length, we fit a linear mixed effects regression model with subject and story as random intercepts (models with random slopes did not converge) predicting RT. Frequency and trigram probabilities were computed from Google Books N-grams, summing over years from 1990 to 2013. (These counts are also released with this dataset.) The results of the regressions are shown in Table
3; we report results from the maximal converging model. In keeping with well-known effects, increased frequency and probability both lead to faster reading times, and word length leads to slower reading times.Predictor | Std. Error | value | value | |
---|---|---|---|---|
Log Frequency | -2.61 | 0.08 | -32.27 | 0.001 |
Trigram Surprisal | -2.19 | 0.09 | -23.90 | 0.001 |
Word Length | 4.21 | 0.12 | 35.72 | 0.001 |
3.4 Syntactic Constructions
Here we give an overview of the low-frequency or marked syntactic constructions which occur in the stories. We coded sentences in the Natural Stories corpus for presence of a number of marked constructions, and also coded 200 randomly selected sentences from the Dundee corpus for the same features. The features coded are listed and explained in Appendix A. Figure 5 shows the rates of these marked constructions per sentence in the two corpora. From the figure, we see that the natural stories have especially high rates of nonlocal VP conjunction, nonrestrictive SRCs, idioms, adjective conjunction, noncanonical ORCs, local NP/S ambiguities, and it-clefts.

4 Conclusion
We have described a new psycholinguistic corpus of English, consisting of edited naturalistic text designed to contain many rare or hard-to-process constructions while still sounding fluent. We believe this corpus will provide an important part of a suite of test sets for psycholinguistic models, exposing their behavior in uncommon constructions in a way that fully naturalistic corpora cannot. We also hope that the corpus as described here forms the basis for further data collection and annotation.
Acknowledgments
This work was supported by NSF DDRI grant #1551543 to R.F., NSF grants #0844472 and #1534318 to E.G., and NIH career development award HD057522 to E.F. The authors thank the following individuals: Laura Stearns for hand-checking and correcting the parses, Suniyya Waraich for help with syntactic coding, Cory Shain and Marten van Schijndel for hand-annotating the parses, and Kyle Mahowald for help with initial exploratory analyses of the SPR data. The authors also thank Nancy Kanwisher for recording half of the stories (the other half was recorded by E.G.), Wade Shen for providing initial alignment between the audio files and the texts, and Jeanne Gallee for hand-correcting the alignment.
References
- Bachrach et al. (ms) Bachrach, A., Roark, B., Marantz, A., Whitfield-Gabrieli, S., Cardenas, C., and Gabrieli, J. D. E. (ms). Incremental prediction in naturalistic langauge procesing: an fMRI study. Unpublished manuscript.
- (2) Barrett, M., Agić, Ž., and Søgaard, A. The Dundee treebank. In The 14th International Workshop on Treebanks and Linguistic Theories (TLT 14), pages 242–248.
- Boston et al. (2008) Boston, M., Hale, J. T., Kliegl, R., Patil, U., and Vasishth, S. (2008). Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta), (1).
- Boston et al. (2011) Boston, M. F., Hale, J. T., Vasishth, S., and Kliegl, R. (2011). Parallel processing and sentence comprehension difficulty. Language and Cognitive Processes, 26(3):301–349.
- Dambacher et al. (2006) Dambacher, M., Kliegl, R., Hofmann, M., and Jacobs, A. M. (2006). Frequency and predictability effects on event-related potentials during reading. Brain research, 1084(1):89–103.
- Demberg and Keller (2008) Demberg, V. and Keller, F. (2008). Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109(2):193–210.
- Fossum and Levy (2012) Fossum, V. and Levy, R. (2012). Sequential vs. hierarchical syntactic models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics, pages 61–69. Association for Computational Linguistics.
- Frank and Bod (2011) Frank, S. L. and Bod, R. (2011). Insensitivity of the human sentence-processing system to hierarchical structure. Psychological Science, 22(6):829–834.
- Frank et al. (2013) Frank, S. L., Monsalve, I. F., Thompson, R. L., and Vigliocco, G. (2013). Reading time data for evaluating broad-coverage models of English sentence processing. Behavior research methods, 45(4):1182–1190.
- Frank et al. (2015) Frank, S. L., Otten, L. J., Galli, G., and Vigliocco, G. (2015). The ERP response to the amount of information conveyed by words in sentences. Brain and language, 140:1–11.
- Futrell et al. (2015) Futrell, R., Mahowald, K., and Gibson, E. (2015). Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences, 112(33):10336–10341.
- Gibson (2000) Gibson, E. (2000). The dependency locality theory: A distance-based theory of linguistic complexity. In Marantz, A., Miyashita, Y., and O’Neil, W., editors, Image, language, brain: Papers from the first mind articulation project symposium, pages 95–126.
- Gibson and Thomas (1999) Gibson, E. and Thomas, J. (1999). Memory limitations and structural forgetting: The perception of complex ungrammatical sentences as grammatical. Language and Cognitive Processes, 14(3):225–248.
- Grodner and Gibson (2005) Grodner, D. and Gibson, E. (2005). Consequences of the serial nature of linguistic input for sentential complexity. Cognitive Science, 29(2):261–290.
- Hale (2001) Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics and Language Technologies, pages 1–8.
- Husain et al. (2014) Husain, S., Vasishth, S., and Srinivasan, N. (2014). Integration and prediction difficulty in Hindi sentence comprehension: Evidence from an eye-tracking corpus. Journal of Eye Movement Research, 8(2).
- Just et al. (1982) Just, M. A., Carpenter, P. A., and Woolley, J. D. (1982). Paradigms and processes in reading comprehension. Journal of Experimental Psychology: General, 111(2):228.
- Kennedy et al. (2003) Kennedy, A., Hill, R., and Pynte, J. (2003). The Dundee corpus. In Proceedings of the 12th European conference on Eye Movement.
- Klein and Manning (2003) Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics.
- Kliegl et al. (2006) Kliegl, R., Nuthmann, A., and Engbert, R. (2006). Tracking the mind during reading: The influence of past, present, and future words on fixation durations. Journal of Experimental Psychology: General, 135(1):12.
- Levy (2008) Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177.
- Lewis and Vasishth (2005) Lewis, R. L. and Vasishth, S. (2005). An activation-based model of sentence processing as skilled memory retrieval. Cognitive Science, 29(3):375–419.
- Liu (2008) Liu, H. (2008). Dependency distance as a metric of language comprehension difficulty. Journal of Cognitive Science, 9(2):159–191.
- Luong et al. (2015) Luong, M.-T., O’Donnell, T. J., and Goodman, N. D. (2015). Evaluating models of computation and storage in human sentence processing. In CogACLL, page 14.
- Mitchell et al. (2010) Mitchell, J., Lapata, M., Demberg, V., and Keller, F. (2010). Syntactic and semantic factors in processing difficulty: An integrated measure. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 196–206.
-
Nivre (2015)
Nivre, J. (2015).
Towards a universal grammar for natural language processing.
In Computational Linguistics and Intelligent Text Processing, pages 3–16. Springer. - Roark et al. (2009) Roark, B., Bachrach, A., Cardenas, C., and Pallier, C. (2009). Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 324–333. Association for Computational Linguistics.
- Schuler et al. (2010) Schuler, W., AbdelRahman, S., Miller, T., and Schwartz, L. (2010). Broad-coverage parsing using human-like memory constraints. Computational Linguistics, 36(1):1–30.
- Smith and Levy (2013) Smith, N. J. and Levy, R. (2013). The effect of word predictability on reading time is logarithmic. Cognition, 128(3):302–319.
- Temperley (2007) Temperley, D. (2007). Minimization of dependency length in written English. Cognition, 105(2):300–333.
- van Schijndel et al. (2013) van Schijndel, M., Exley, A., and Schuler, W. (2013). A model of language processing as hierarchic sequential prediction. Topics in Cognitive Science, 5(3):522–540.
- van Schijndel and Schuler (2015) van Schijndel, M. and Schuler, W. (2015). Hierarchic syntax improves reading time prediction. In Proceedings of NAACL.
- Vasishth et al. (2010) Vasishth, S., Suckow, K., Lewis, R. L., and Kern, S. (2010). Short-term forgetting in sentence comprehension: Crosslinguistic evidence from verb-final structures. Language and Cognitive Processes, 25(4):533–567.
- Wu et al. (2010) Wu, S., Bachrach, A., Cardenas, C., and Schuler, W. (2010). Complexity metrics in an incremental right-corner parser. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 1189–1198. Association for Computational Linguistics.
- Yan et al. (2010) Yan, M., Kliegl, R., Richter, E. M., Nuthmann, A., and Shu, H. (2010). Flexible saccade-target selection in Chinese reading. The Quarterly Journal of Experimental Psychology, 63(4):705–725.
Appendix A Features coded for Section 3.4
The features coded are:
-
Local/nonlocal VP conjunction: Conjunction of VPs in which the head verbs are adjacent (local) or not adjacent (nonlocal)
-
Local/nonlocal NP conjunction: Conjunction of VPs in which the head nouns are adjacent (local) or not adjacent (nonlocal).
-
Sentential conjunction: Conjunction of sentences.
-
CP conjunction: Conjunction of CPs with explicit quantifiers.
-
Restrictive/nonrestrictive SRC: Subject-extracted relative clauses with either restrictive or nonrestrictive semantics
-
Restrictive/nonrestrictive ORC: Object-extracted relative clauses with either restrictive or nonrestrictive semantics
-
No-relativizer ORC: An object-extracted relative clause without an explicit relativizer, e.g. The man I know
-
Noncanonical ORC: An object-extracted relative clause where the subject is not a pronoun.
-
Adverbial relative clause: An relative clause with an extracted adverbial, e.g. the valley where you would find the city of Bradford.
-
Free relative clause
-
NP/S ambiguity: A local ambiguity where it is unclear whether a clause is an NP or the subject of a sentence. For example, I know Bob is a doctor.
-
Main Verb/Reduced Relative ambiguity (easy/hard): A local ambiguity between a main verb and a reduced relative clause. For example, The horse raced past the barn fell.
-
PP attachment ambiguity
-
Nonlocal SV: The appearance of any material between a verb and the head of its subject.
-
Nonlocal Verb/DO: The appearance of any material between a verb and its direct object.
-
Gerund modifiers
-
Sentential subject
-
Parentheticals
-
Tough movement
-
Postnominal adjectives
-
Topicalization
-
even…than construction
-
if…then construction
-
as…as construction
-
Yes-No Question
-
Question with wh subject
-
Question with other wh word
-
Idiom: Any idiomatic expression, such as busy as beavers.
-
Quotation: Any directly-reported speech
Coding was performed by Suniyya Waraich, Edward Gibson, and Richard Futrell.