1 Introduction††footnotetext: Scripts and data can be found online at https://github.com/wilcoxeg/neural-networks-read-times.
A large body of evidence suggests that humans are expectation-based language processors, insofar as real-time language comprehension involves making predictions about upcoming material Levy (2008); Hale (2001). One strong piece of evidence supporting this view comes from the domain of computational modeling, where next-word log probabilities from statistical language models (LMs) turn out to correlate well with online processing measures—that is, to have good psychometric predictive power—including gaze duration in eye-tracking studies and self-paced reading times Smith and Levy (2013), and the N400 measure in EEG studies Frank et al. (2015). Crucially, as statistical LMs improve on the broad-coverage objective function of perplexity (i.e. as they get better at predicting the next word given its context), so too do they improve at predicting real-time processing data Goodkind and Bicknell (2018).
Many of the previous studies linking information-theoretic measures and human psychometric data were conducted using -gram models, which track local word co-occurrences and are blind to information outside of theet al. (2017)
have set new standards in natural language processing, achieving state-of-the-art perplexity results. We present a broad evaluation of these modern neural network models as predictors of human reading behavior, testing the influence of both model inductive bias and the scale of training data provided to the model.
One important unanswered question involves the role of syntactic knowledge in the link between statistical models and real-time processing. Experimental evidence, such as studies of garden-path effects, demonstrates that humans deploy hierarchically structured representations to drive predictions about upcoming material Stowe (1986); Staub and Clifton (2006). This suggests that language models with similar syntactic capacity — represented implicitly or explicitly — may be the best candidates for predicting human processing data. However, results from computational modeling paint a complicated story: while frank2011insensitivity found that models without explicit hierarchical structure are best at predicting human reading times of naturalistic text, a follow-up study conducted by fossum2012sequential argued that perplexity, not inductive bias or syntactic capacity, was the primary factor in determining a the ability of NLP models of that time to predict human reading times. The more recent work of goodkind2018predictive, aurnhammer2019comparing, and merkx2020comparing confirm the general finding that perplexity is the primary determinant of model fit to human comprehension measures, but also find differences among model architectures once perplexity is controlled for.
Here we contribute to this emerging picture through a scaled-up and carefully controlled assessment of language models’ ability to predict measures of human reading behavior. Following hu2020assessing, we train a fleet of neural-network language models varying both in inductive bias (from sequential LSTMs to syntax-aware recurrent models) and in the amount of data provided to them at training time. We evaluate models’ psychometric predictive power for human reading times on three online processing datasets: the Dundee eye-tracking corpus Kennedy (2003), selections from the Brown corpus and the Natural Stories self-paced reading time corpus Futrell et al. (2017). Across model architectures and training datasets, our results broadly confirm the strong linear relationship between surprisal (or negative log probability) and reading time originally documented by smith-levy:2008,smith2013effect and confirmed by goodkind2018predictive. Like previous studies, we also find a generally positive relationship between a model’s next-word prediction accuracy and its ability to predict human reading times, supporting the findings of goodkind2018predictive on a broad set of neural network models. Beyond the role of perplexity, we find that deep Transformer models demonstrate the best psychometric predictive power, and -gram models achieve greater psychometric predictive power than would be expected based on their perplexity.
We next address the issue of syntactic knowledge. Rather than positing a binary distinction between “hierarchical” and “non-hierarchical” models, we draw on recent work in language model evaluation to quantify models’ syntactic knowledge at a finer grain Hu et al. (2020)
. We compare each models’ psychometric predictive power against this measure of syntactic knowledge. After controlling for a model’s next-word prediction accuracy, we find that syntactic knowledge does not explain significant variance in a model’s psychometric predictive power.
We train a fleet of language models, each providing an estimate of word probability in context. The function of each language model is to predict the next token in a corpus conditioned on its preceding context
, producing a probability distribution. Our fleet contains four major architectural variants:
Recurrent Neural Network Grammars <RNNGs;¿Dyer:et-al:2016 model the joint probability of a sequence of words as well as its syntactic structure. RNNGs are supervised during training with Penn Treebank-style constituency parses Marcus et al. (1993).
are deep neural networks which stack layers of self-attention mechanisms above word embedding representations, which have recently achieved state-of-the-art performance on language modeling and set a new standard for pretrained sentence encoding in natural language processing. We train the GPT-2 Transformer architectureRadford et al. (2019) from scratch on our own corpora.
-gram: We train a -gram model with Kneser-Ney smoothing, using the SRILM language modeling toolkit Stolcke (2002).
Following hu2020assessing, we train each model on four corpora of varying sizes drawn from the Brown Laboratory for Linguistic Information Processing (BLLIP) corpus Charniak et al. (2000). The corpora are sampled such that the training set of each corpus is a subset of each larger corpus. The four corpora are BLLIP-XS (40K sentences, 100K tokens); BLLIP-SM (200K sentences, 5M tokens); BLLIP-MD (600K sentences, 14M tokens); and BLLIP-LG (2M sentences, 42M tokens). We trained 1–3 random seeds of each model architecture and training corpus.
While the majority of the models tested here make predictions at the word level, some of our Transformers constitute a notable exception. These models instead make predictions at the sub-word level, using a byte-pair encoding <BPE;¿sennrich2015neural, which decomposes common word substrings into independent tokens. Models using this encoding can thus represent sublexical co-occurrence information. For the purposes of this paper, one of the most important possible effects of this sub-word representation may be that it supports well-tuned word probability estimates even for very rare or unknown words. We train Transformer models using both this BPE representation and standard word-level representations on the corpora mentioned above.
These language models are trained to minimize the perplexity of a corpus:
. Lower perplexity values correspond to language models that make more accurate next-word predictions.111 For models with sub-word representations, we define the probability of a word as the joint probability of its constituent subwords, following the chain rule.
For models with sub-word representations, we define the probability of a word as the joint probability of its constituent subwords, following the chain rule.As perplexity is interpretable only in the context of a specific vocabulary (i.e., over a space of possible next words), perplexity measures are only comparable given a fixed reference vocabulary. However, if a model trained on a larger vocabulary has a better perplexity than a model trained on a smaller vocabulary, we can confidently say it is a better predictive model. This is generally the trend we find: models trained on larger corpora achieve better perplexity measures, despite being forced to predict over larger vocabularies. Nonetheless, most of our analyses in this paper will be comparing models with the same reference vocabulary to avoid this issue.
2.2 Psychometric predictive power
Following previous work Frank and Bod (2011); Fossum and Levy (2012); Goodkind and Bicknell (2018), we assess a model’s psychometric predictive power (termed ”psychological accuracy” by Frank and Bod) by asking how well its word-by-word surprisal estimates of the model can explain various psychometric measures of how subjects read individual words, after controlling for other features known to influence reading behavior, such as the length and frequency of words.
We draw psychometric data from three datasets across two measurement modalities of real-time human language comprehension: eye-tracking data from the Dundee corpus Kennedy (2003); self-paced reading data from selections from the Brown corpus of American English (as reported in smith2013effect); and self-paced reading data (herein SPRT) from the Natural Stories corpus Futrell et al. (2017). The Natural Stories corpus was explicitly designed to include syntactic constructions that are relatively rare in both spoken and written English, such as object-extracted relative clauses, topicalization, and long-distance dependencies.
For each language model, we fit regression models which predict these psychometric data averaged across experimental subjects. (For the Dundee eye-tracking corpus, we predict the average gaze duration by subject for each word.) Our regression models combine model-specific and model-invariant features of words. The main predictor of interest is word surprisal, or the negative logarithm of word probability: . For each word read by a human subject, we extract the context-specific surprisal of the word and the previous word (or the previous 3 words for SPRT) from a language model. The previous word estimates are included due to known spillover effects in both measurement paradigms Smith and Levy (2013). We combine these surprisal estimates with model-invariant and context-invariant features of the current and previous word (or previous 3 words for SPRT) as control predictors: its length in characters, and its log-frequency (or log-unigram-probability).222Word frequencies were measured from the larger Wikitext-2 corpus Merity et al. (2017).
We evaluate each regression model relative to a baseline model, which attempts to predict the same human psychometric data from just the control features. For each language model, we compute its psychometric predictive power by calculating the mean by-token difference in log-likelihood of the response variable between the two models, which we refer to as. A positive value indicates that a language model’s surprisal estimates lead to more accurate predictions of human reading behavior over the baseline model.
We repeat the above analyses with both generalized additive models (GAMs) and linear regression.333The R command to run the eye-tracking model was: read-time s(surp, bs = "cr", k = 20) + s(prev.surp, bs = "cr", k = 20) + te(freq, len, bs = "cr") + te(prev.freq, prev.len, bs = "cr") for the GAM model and psychometric surprisal + prev_surp + prev2_surp + prev3_surp + freq * len + prev_freq * prev_len + prev2_freq * prev2_len + prev3_len * prev3_freq for the linear model. Qualitative results were similar with both approaches; unless otherwise noted we report the linear regression results in figures and statistical tests.
Our methods differ from goodkind2018predictive in two respects: First, instead of reporting the difference in joint log-likelihood of the entire dataset, we report the mean difference in log-likelihood between the baseline model and the predictive model on each individual token. Because the three corpora tested in this paper are very different in size and composition, the joint log-likelihood cannot be used to compare psychometric predictive power results across testing corpora. The second key difference is that, whereas goodkind2018predictive report of the model on the training data, we report mean per-word of the model on held-out test data, averaged over 10-fold cross validation, allowing us to conduct analyses using GAM fits while guarding against overfitting.
2.3 Syntactic Generalization score
In order to assess the syntactic capabilities of each model, we report its score on the set of 34 targeted syntactic tests presented in hu2020assessing, which follow paradigms developed in marvin-linzen:2018-targeted, futrell-etal:2018-arxiv-rnns-psycholinguistic-subjects, and other recent papers on controlled psycholinguistics-style testing for grammatical knowledge. Each test is designed to probe whether the neural model has learned a particular aspect of English syntax by examining its behavior across minimally different sentence pairs. For example, marvin-linzen:2018-targeted assess whether a model has learned subject–verb number agreement by evaluating the model’s behavior on a construction such as The keys to the cabinet are/is…. If the model has learned the proper grammatical generalization regarding subject-verb number agreement, then it should assign lower probability to the ungrammatical continuation is compared to the grammatical are, conditioned on the fixed prefix The keys to the cabinet.
Each individual syntactic test comprises between 20–30 test items, with each item used in multiple experimental conditions (generally 4, occasionally 2). In order for models to get the test item correct, their predictions must satisfy a set of inequality criteria among surprisals of regions of the sentence in each experimental condition. For example, following the logic described above, for subject–verb number agreement, a model must succeed at both of: (i) when the head noun of the subject NP is singular, the singular verb is should be more likely than the plural verb are; and (ii) when the head noun of the subject NP is plural, the plural verb are should be more likely than the singular verb is
. This design ensures that models will be unable to get high scores by relying on simple heuristics, such as a broad preference for plural verbs. We report models’ mean by-test accuracy as its Syntactic Generalization (SG) score, which ranges from 0 to 1 with chance being0.25.
3.1 Surprisal vs. Reading Times
Figure 1 shows the relationship between language model surprisals and human reading times for all models and corpora. Lines are fits from generalized additive models (trained using the formula described in Footnote 3), with only context sensitive predictors (i.e. surprisal of the current word and previous words) used to derive estimates. They show the contribution of surprisal on reading time separate from word length and word frequency. Although there is some variance based on model architecture and training corpus, overall we find a linear relationship holds for most of the models tested.
3.2 Predictive Power vs. Perplexity
The relationship between psychometric predictive power and perplexity is shown in Figure 2. Error bars denote the standard error of by-fold mean per token, estimated by 10-fold cross validation. If better language models are better predictors of human processing time, we would expect a negative correlation between and perplexity, which is visually evident for all three testing corpora. On average, the Brown testing corpus shows slightly higher , but also higher variance across the 10-fold split.
In order to assess the relationship between perplexity and psychometric predictive power, we fit a mixed-effects regression model to predict from language model perplexity within each training vocabulary, with random intercepts by test corpus and model architecture. We find a significant effect of perplexity on within each training vocabulary (), except for in the BLLIP-LG training data, where . We take these results to indicate that the relationship found in frank2011insensitivity,fossum2012sequential, and goodkind2018predictive between a model’s psychometric predictive power and its test perplexity holds for a range of contemporary state-of-the-art models, and for perplexity scores in the 30-100 range. However, whereas goodkind2018predictive find a strongly linear relationship between perplexity and , our results are a bit more complicated: While there is a monotonic relationship between and perplexity, this may look more or less linear depending on the model class. For example, focusing on the -gram models tested on the Dundee corpus, the relationship appears strongly linear across the 100–250 perplexity range. However, focusing on the neural models in the 30–100 perplexity range, the relationship appears more exponential, with stronger gains between models in the lower perplexity range.
While all three testing corpora show a relationship between perplexity and , we also find an effect of model class for Brown and Dundee. Here, the -gram models demonstrate predictive power comparable to the neural models despite much poorer perplexity scores. This is especially evident for the BLLIP-SM and XS models tested on the Dundee corpus. While the -gram models’ perplexity is 2 that of the neural models, they achieve higher average . While surprising, this result accords with the findings presented in goodkind2018predictive, who find their LSTM model to underperform relative to their -gram models.444The LSTM model in that paper’s Figure 1 is the only model that falls outside the regression’s 95% confidence interval.
3.3 Psychometric Predictive Power vs. Syntactic Generalization
In this section, we investigate the relationship between a model’s syntactic generalization (SG) ability and its psychometric predictive power. The SG score is a models’ average accuracy across 34 targeted syntactic tests, whose designs are inspired by classic psycholinguistic assessments of human syntactic abilities. Figure 4 reproduces Figure 2 from hu2020assessing, which shows the range of SG scores achieved by our models, plotted against each model’s test perplexity. hu2020assessing argue that among the range of architectures and training dataset sizes investigated, it is model class, rather than training data size or test perplexity, is the most important determinant of a model’s syntactic generalization capabilities. For example, looking at Figure 4, the best performing LSTM model (squares) achieves a lower SG score than the lowest performing RNNG models (diamonds). The exception is GPT-2: the GPT-2 model trained on the smallest dataset performs on par with the -gram models; however, the GPT-2 models trained on larger datasets with BPE encoding perform even better than the best performing RNNG models.
We use SG scores to quantify the degree to which a model has derived human-like syntactic knowledge of language from text. Figure 4 shows the relationship between models’ SG scores and their psychometric predictive power as . We plot this as a residualized regression, testing the relationship between syntactic generalization score and after controlling for the effects of perplexity on both variables. The x-axis depicts each model’s syntactic generalization score residualized with respect to its perplexity (computed within each training vocabulary), and the y-axis shows each model’s residualized with respect to its perplexity (again computed within each training vocabulary). The plot thus demonstrates the relationship between the two variables unexplained by the relationship between perplexity and .
Many models in this figure show a large amount of variance in residual
unexplained by SG score, even when trained on the same dataset. For example, the range of scores achieved by the RNNG BLLIP-XS model overlap with 16/25, or about 64%, of the other models. We confirm this result quantitatively: in a stepwise regression analysis, SG scores do not significantly improve prediction ofover and above perplexity measures of models ( for all three corpora).
This paper tested the relationship between language model surprisal estimates and human reading behavior across a broad class of state-of-the-art language models, trained on varying amounts of language data. We confirmed the generally linear relationship between word-level surprisal and human reading time in each of our replications, and discovered that within model architecture, the relationship between a language model’s next-word prediction performance and its psychometric predictive power is mostly monotonic. However, the influence of language model architecture was substantial. Furthermore, the influence of model architecture on psychometric predictive power is not the same as the influence of model architecture on performance on controlled grammatical tests: we found no clear relationship between the two types of evaluation metrics, once perplexity is controlled (Figure4).
Our results complement and add to those of aurnhammer2019comparing and merkx2020comparing, who use similar methodology to assess the psychometric predictive power of Transformers and gated vs. simple RNNs. The relatively strong performance of our -gram model accords with Aurnhammer and Frank’s 2019 finding that simple RNNs, which are more sensitive to local relationships, perform as well as LSTMs and other gated models. Together these results demand a more thorough investigation into the relationship between locality and predictive power. One point of contrast is that merkx2020comparing find no advantage for Transformer models at predicting human reading times in eye-tracking data, although they do find an advantage for self-paced reading. The difference may be due to the assessment metric, testing dataset size, byte-pair encoding or model size (theirs has 2 layers, ours 12). Further investigation is required.
Interpreting our results in light of the findings presented in hu2020assessing, who assess the relationship between perplexity and syntactic generalization abilities, our findings suggest a dissociation between two aspects of cognitive modeling using language models. On one hand, syntactic generalization abilities are largely determined by model architecture, with structurally supervised models and deep Transformers outperforming recurrent neural networks and -gram models. On the other hand, model ability to predict human reading times is determined more by model ability to accurately predict the next word across a range of contexts, not just in specialized syntactic testing. For these tasks, model architecture seems to play less of an absolute role, although GPT-2 models trained on larger datasets and enhanced with BPE achieve the highest scores on all three testing corpora.
The findings presented in this paper suggest that different language comprehension contexts—isolated-sentence reading with controlled materials targeting specific grammatical contrasts, versus reading of more naturalistic materials—bring to the fore different types of human linguistic expectations that are in many cases best captured by different contemporary NLP models. As new model architectures and training procedures continue to emerge, continued examination of the relationship with psychometric data can help guide the way towards increasingly human-like high-performance computational models of language.
The authors would like to thank the anonymous reviewers for their feedback. J.G. is supported by an Open Philanthropy AI Fellowship. J.H. is supported by the NIH under award number T32NS105587 and an NSF Graduate Research Fellowship. R.P.L. gratefully acknowledges support from the MIT-IBM Watson AI Lab, a Google Faculty Research Award, and a Newton Brain Science Award.
- Comparing gated and simple recurrent neural network architectures as models of human sentence processing. In Proceedings of the 41st Annual Conference of the Cognitive Science Society, pp. 112–118. Cited by: §4.
- BLLIP 1987–89 WSJ corpus release 1. Linguistic Data Consortium, Philadelphia 36. Cited by: §2.1.
- Sequential vs. hierarchical syntactic models of human incremental sentence processing. In Proceedings of the 3rd workshop on cognitive modeling and computational linguistics, pp. 61–69. Cited by: §2.2.
- Insensitivity of the human sentence-processing system to hierarchical structure. Psychological science 22 (6), pp. 829–834. Cited by: §2.2.
- The ERP response to the amount of information conveyed by words in sentences. Brain and Language 140, pp. 1–11. Cited by: §1.
- The natural stories corpus. arXiv preprint arXiv:1708.05763. Cited by: §1, §2.2.
- Predictive power of word surprisal for reading times is a linear function of language model quality. In Proceedings of the 8th workshop on cognitive modeling and computational linguistics (CMCL 2018), pp. 10–18. Cited by: §1, §2.2.
- A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp. 1–8. Cited by: §1.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: 1st item.
- A systematic assessment of syntactic generalization in neural language models. In Proceedings of the Association for Computational Linguistics, Cited by: §1.
- The Dundee Corpus [cd-rom]. Psychology Department, University of Dundee. Cited by: §1, §2.2.
- Expectation-based syntactic comprehension. Cognition 106 (3), pp. 1126–1177. Cited by: §1.
- Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, pp. 313–330. Cited by: 2nd item.
- Pointer sentinel mixture models. In Proceedings of ICLR, Cited by: footnote 2.
- Automatic differentiation in PyTorch. In Neural Information Processing Systems Autodiff Workshop, Cited by: 1st item.
- Language models are unsupervised multitask learners. Cited by: 3rd item.
- The effect of word predictability on reading time is logarithmic. Cognition 128 (3), pp. 302–319. Cited by: §1, §2.2.
- Syntactic prediction in language comprehension: evidence from Either …or. Journal of Experimental Psychology: Learning, Memory, & Cognition 32 (2), pp. 425–436. Cited by: §1.
- SRILM – an extensible language modeling toolkit. In Seventh international conference on spoken language processing, Cited by: 4th item.
- Parsing wh-constructions: evidence for on-line gap location. Language & Cognitive Processes 1 (3), pp. 227–245. Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §1.