Log In Sign Up

So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements

by   James A. Michaelov, et al.

More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions that we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought.


page 1

page 6


Collateral facilitation in humans and language models

Are the predictions of humans and language models affected by similar th...

How well does surprisal explain N400 amplitude under different experimental conditions?

We investigate the extent to which word surprisal can be used to predict...

Connecting Neural Response measurements Computational Models of language: a non-comprehensive guide

Understanding the neural basis of language comprehension in the brain ha...

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

Large-scale pretrained language models are the major driving force behin...

Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction

Recent research in psycholinguistics has provided increasing evidence th...

Developing Embodied Multisensory Dialogue Agents

A few decades of work in the AI field have focused efforts on developing...

I Introduction

While it is widely accepted that predictable words are easier to process than unpredictable ones [1, 2], the role of predictive processes in language comprehension has long been an issue of contentious debate (for reviews, see [3, 4]). One prominent position is that the language processor does not waste resources on predictive processing [5]. Under such an account, because there are an infinite number of possible continuations for any given natural language string, linguistic predictions would be wrong far more often than they would be right. Thus, given the limited value of linguistic prediction, the language processor simply does not engage in it [6]. Advocates of this position have attributed observed predictability effects on language processing to the demands of integrating the meaning of a word into its preceding context [7, 8]

, some form of automatic spreading activation in the lexicon

[9, 10], or both.

However, there is growing evidence in support of prediction as a component of language comprehension. Much of this research comes from looking at neural signals of processing difficulty, especially the N400, a negative-going component of the event-related brain potential (ERP) that peaks roughly 400ms after the presentation of a meaningful stimulus [11, 12]. With linguistic stimuli, the size of the N400 is sensitive to semantic congruity—N400 amplitude is large by default, and is reduced if the word is facilitated by the preceding context [3, 13, 14]. In recent years, a range of studies have found that N400 amplitude modulations appear to reflect lexical properties of specific nouns that are semantically predictable; thus, researchers have argued that N400 predictability effects do not simply reflect ease of integration or spreading activation, and – at least some of the time – provide evidence for predictive processes in language comprehension [15, 16, 17, 18, 14, 19, 20, 21].

What are these predictions based on? Since the early days of N400 research, cloze probability

[22, 23] has served as the chief metric of contextual word predictability [24, 3, 25]. The cloze probability of a given word is defined as the proportion of people who fill a gap in a sentence with that specific word [22, 23], and thus, provides a measure of how predictable a word is in a specific sentence context. It is well-established that words with a higher cloze probability elicit a smaller N400 response compared to words with lower cloze probabilities [24, 12, 14], as well as being read faster and recognized faster [25]—in fact, some work has shown that cloze probability and N400 amplitude are inversely correlated at a level of over 90% [26]. A more recent operationalization of predictability is derived from language models (LMs), computational systems designed to predict a word in context. Unlike humans, these LMs only receive text data as input, and consequently base their predictions solely on the surface-level statistics of language [27]. Thus, while linguistic predictions in humans may utilize a range of knowledge both linguistic and extra-linguistic, LMs learn the true distributional probability of a word in context [28, 25].

Understanding the relationship between LM predictions and N400 amplitude is vital to understanding the N400 (see [29] for discussion). Given the evidence that N400 amplitude is affected by linguistic input over the lifespan [12], and the fact that they are models trained purely on linguistic input, LMs give us a precise way to model the extent to which linguistic input alone can predict the N400 response. On the other hand, there is no way to tell which sources of information and neurocognitive processes are involved when experimental participants complete the cloze task. Thus, even if cloze probability were to correlate more closely with N400 amplitude than LM predictions, it is less informative in terms of illuminating the basis of prediction in language comprehension.

However, recent work suggests that this trade-off between accuracy and explainability may be nearing an end. LM predictions can not only successfully predict N400 amplitude [30, 31, 32, 33] and significant differences between condition [29], but at least for some stimuli may be better at this than cloze probability [29, 33]. However, the two studies in which LM predictions outperform cloze have either looked at the effects without direct comparison to the N400 data [29] or targeted data from an experiment intended to show the N400 responds to factors other than cloze [33]. The goal of the present study is to test whether the amplitude of the N400 to words in sentence contexts is better predicted by modern LMs than by cloze probability – even under conditions that are maximally favorable to cloze. Using ERP data from a large-scale multiple-laboratory experiment [34], we used linear mixed effects regression models to examine how well the amplitude of the N400 elicited by experimental stimuli was predicted by the cloze probabilities gathered in the original experiment [34]

, and compared its performance to that of several pretrained neural network LMs

[35, 36, 37, 38, 39, 40, 41, 42].

Ii Background

Ii-a Cloze probability

Cloze probability has long been used to asses a word’s predictability in context [3, 43, 4, 25]. In addition to its use in understanding the N400 [24, 12], it has been shown to predict behavioural correlates of processing difficulty, such as word reading time [25]. In fact, when directly compared, cloze probability has previously be found to be better at predicting such behavioural metrics than LMs [28, 25].

However, while cloze probability is a metric grounded in human judgements, it may not be as helpful in understanding online human comprehension as might appear at first glance. First, as discussed, predictability effects are thought to arise from individuals’ graded predictions about upcoming words, whereas cloze probability is an aggregate measure over a sample of individuals based exclusively on their top prediction. In addition to the question of whether we should expect these two distributions to be equivalent, there is also a practical issue of sample size—less likely continuations require a larger sample of individuals in order for even a single experimental participant to produce. Indeed, as a language production task, its relevance for comprehension is unclear in view of disagreement regarding the extent of overlap between the production and comprehension systems (see [44] for review, and [45] for detailed discussion of asymmetries between comprehension and production), it is not necessarily the case that the next-word probability of a word will be the same for both the production and comprehension system.

Beyond these concerns, and even if cloze is a good predictor of processing difficulty due to predictability overall (e.g. as measured by reading time), when investigating the N400, the temporal dimension must also be considered. Cloze probability is based on responses produced by experimental participants after reading a sentence with a gap that must be filled in. Given the substantial evidence that there are neurocognitive processes involved in human language comprehension that occur after the N400 [13, 14], even if it is the case that the N400 and cloze probability both reflect individuals’ graded predictions, and that cloze responses are influenced by the predictions that underlie the N400 response, it should not be taken as a given that these predictions are the same. Thus, there is no a priori reason to assume that cloze probability is the best possible operationalization of the predictions that underlie the N400.

Ii-B Language model predictions

LMs are trained to predict the probability of a word in context based only on the linguistic context. Given that such models do not explicitly learn meanings of words, and that the N400 response to a word is thought to be largely or wholly determined by meaning [12, 14], intuitively, we may expect them to perform poorly at predicting the amplitudes of N400 responses to words. However, previous research has shown that LMs can learn semantic relationships between words [46], including even harmful biases [47]. Thus, the extent to which LMs can acquire semantic knowledge, and specifically, knowledge about the semantic relations between words, may be greater than would be expected prima facie. Whether or not humans can learn quite so much based on only linguistic input is an open question, but there is evidence that we may learn semantic relations between referents of words with which we have no direct experience [48].

An additional benefit of using LM predictions to operationalize word predictability is that researchers know exactly what sources of information are used by these models—they are trained on specific data, and thus researchers can form hypotheses about how the specific kinds of information in these data may be used to predict upcoming linguistic input, and by which system. This is especially important given that, as discussed, we might expect the predictions underlying the N400 to also impact cloze probability. If factors beyond linguistic input such as world knowledge have an effect on N400 amplitude, as has been proposed [12], then they are also likely to have an effect on cloze probability. For this reason, when using cloze probability to predict N400 amplitude, it may be impossible to disentangle the effect of each source of information, and thus limiting the extent to which we can understand the basis upon which the predictions underlying the N400 are made. Using metrics based on the statistics of language (for example, LM predictions) may therefore be one of the only ways to successfully isolate the specific effect of linguistic input on N400 amplitude.

Ii-C Language model surprisal

When LM predictions are used to investigate predictability effects on language comprehension, predictability is usually not operationalized as the raw probability of words as calculated by these models, but rather, their surprisal. The surprisal of a word is the negative logarithm of its probability given its preceding context , as shown in (1).


In addition to theoretical claims behind surprisal theory as an explanation of predictability effects in language comprehension [49, 50, 51], there is also a further array of evidence showing that LM surprisal correlates with behavioural metrics of processing difficulty such as reading time [52, 53, 54, 55, 28, 56, 57]. There is also a smaller body of work showing that LM surprisal is a significant predictor of N400 amplitude, with the surprisal of generally better-performing and more advanced models showing a better fit to the N400 data [30, 31, 32, 33]. Additionally, when LMs are given the same experimental stimuli as humans in neurolinguistic experiments, significant differences in surprisal often match significant differences in N400 as a function of experimental condition—again, with generally better-performing and more advanced models matching the human responses better [29, 33].

In previous work, operationalizing predictability as cloze probability generally appears to yield better results for human behavioural data than LM surprisal [28, 25]; however, this has not been well-explored for the N400. To the best of our knowledge, only one published paper has directly compared how well cloze probability and LM surprisal predict N400 amplitude, finding that LM surprisal performs better [33]. However, the comparison between cloze probability and LM prediction was not an aim of that previous study, and thus there are several caveats to be noted about this result. Firstly, the study investigated the N400 response to words with the same cloze probability but which were either related or unrelated to the highest-cloze completion—there is a well-established effect showing that the former elicit lower-amplitude N400s than the latter [24, 58, 59, 60, 61]

. Thus, cloze is inherently at a disadvantage in prediction, given that the two conditions are controlled for cloze. The study also involved a condition where all stimuli had a cloze of zero; thus, none of the variance in N400 amplitude within this condition could be explained by cloze. Finally, the study compared raw cloze probability to LM surprisal—given that the surprisal calculated from cloze probability has been found to correlate with behavioural predictability effects

[62], a fair comparison would also involve cloze surprisal. The finding that surprisal can differ between words that are matched for cloze but either related or unrelated to the highest-cloze continuation of a sentence is also found in another study [29], but this study only compares significant differences in surprisal to the significant differences reported in the original papers—there is no direct comparison made between the surprisal and N400 data.

Ii-D The present study

In the present study, we aim to provide just such a fair comparison using modern LMs and openly available data from a large N400 study (n = 334) [34]. First, we use data from a study that was specifically designed to investigate the effect of cloze probability on N400 amplitude; thus, there are none of the aforementioned cases where experimental conditions are matched by cloze and differ in another way (that may be reflected in LM predictions, see [29, 33]). Additionally, we remove the data from all stimuli with a cloze probability of zero. Given that previous work has shown that there is variability in N400 amplitude between experimental conditions where all items had a cloze probability of zero [63, 61], and some of these studies have been successfully modeled using LM predictions [29], there is a chance that including these would give the LMs an unfair advantage. Finally, we compare both raw cloze probability and cloze surprisal to ensure that the log-transformation of LM probability is not a confound, as previous work has suggested that there may be a logarithmic linking function between word probability and surprisal [64].

Iii Method

Iii-a Original study and data

We use EEG data from a large-scale experiment by Nieuwland and colleagues [34]. In this experiment, participants read sentences one word at a time, with ERPs time-locked to previously-determined target words. In the data provided, the N400 is operationalized as the mean amplitude voltage recorded from the centro-parietal region of interest (electrodes Cz, C3, C4, Pz, P3, and P4) 200–500ms after the presentation of the target word. We use the data provided for target nouns, which replicate the well-established finding that higher-cloze nouns elicit smaller (less negative) N400 responses than lower-cloze nouns [34, 24, 12].

To calculate the cloze probability of items in the original study, each stimulus sentence was truncated before the target word [34]. Thus, participants in the cloze task were presented with the preceding linguistic context for the target word and asked to complete the sentence. The cloze probabilities were then calculated on the basis of the responses from two sets of 30 participants, each of which completed the cloze task for half of the total stimulus sentences. The authors provide both the cloze and ERP data online.111

The electrophysiological experiment was carried out at 9 laboratories in the United Kingdom and comprises data from 334 participants, reaching a total of 25,849 trials. We divided this data into a training set (comprising 70% of the data) for statistical analysis, and a held-out test set (the remaining 30%) for further analysis. In this paper, we only use the training set—we reserve the test set for future analyses. We then removed all items with a cloze probability of zero for fair comparison with LM surprisal, as previously discussed. This left us with the data from a total of 14,456 experimental trials.

Finally, we used the cloze data to calculate cloze surprisal for each remaining item. Because all zero-cloze items were removed, this also removed the need for smoothing zero-probabilities, as has been done in previous related work [62].

Iii-B Language models

We operationalize corpus-based probability of a word in context as the probability calculated by a neural network LM. There are many different architectures for neural network LMs, some of which have been used to model behavioural and neural correlates of human language processing. Here we focus on the two most prolific and successful types of LM in recent years—RNNs and transformers.

Iii-B1 RNNs

Until the development of transformer LMs [65]

, recurrent neural network (RNN) language models had long dominated the field. With their memory bottleneck and their incremental processing of words

[66, 32], RNNs have often been used as cognitive models of human language processing [67], including prior efforts to model the N400 [30, 31, 29, 32, 33]. In the present study, we use two RNN LMs that have been referred to as GRNN [35] and JRNN [36] in the literature see, e.g., [68]

). GRNN is a small long short-term memory (LSTM) RNN with 72 million parameters that was trained on a 90 million word corpus. JRNN is a larger LSTM-RNN with over a billion parameters, trained on a billion words. In addition to its better performance at language modeling (predicting the next word in a sequence), previous research has found JRNN surprisal to more closely resemble N400 amplitude than does GRNN surprisal

[29]. GRNN and JRNN surprisal were calculated using the code provided in Michaelov and Bergen [29].

Iii-B2 Transformers

Transformer language models are a neural network LM architecture [65] that has been found to outperform RNNs at the standard language modeling task (predicting words from context, see [40] for review), as well as a range of other tasks [37, 39]. Transformer LMs have also been shown to do better than RNNs at predicting N400 amplitude [32, 33]. The present study includes two varieties of transformer LMs—autoregressive language models trained on the traditional task of predicting words based on their preceding linguistic context, and masked language models, trained to fill a gap in a sentence, and that thus can use words that appear both before and after in its prediction of the target word. We include the probabilities from three autoregressive LMs in our analysis—Transformer-XL [40], GPT-2 [39], and GPT-3 [42]. The three masked LMs that we use to calculate word probability are BERT [37], RoBERTa [38], and ALBERT [41]. For all transformer LMs except for GPT-3, we use the implementation of each model made available through the transformers [69] package to calculate surprisal. GPT-3 predictions were accessed via the OpenAI API.222

Model Parameters333 Corpus size444 Ref.
GRNN 71.8M 90M [35]
JRNN 1.04B 1B [36]
Transformer-XL555 285M 103M [40]
GPT-2 (XL) 1.56 8B [39]
GPT-3 (Davinci) 173B 300B [70]
BERT (large, cased) 334M 3.3B [37]
RoBERTa (large) 355M 33B [38]
ALBERT (XXLarge v2)666 206M 3.3B [41]

The number of free parameters for the transformers [69] implementations of Transformer-XL, GPT-2, BERT, RoBERTa, and ALBERT were calculated using pytorch [71]. For JRNN and GPT-3, we utilized the models directly provided by the authors of the paper, and so use the number of parameters reported in the cited paper [36, 70]

. While we use the author-provided GRNN, no estimate of model parameters is given in the original paper

[35], so we calculated this with pytorch [71].

Number of words in training corpus is reported in the original papers [35, 36, 40, 37], or estimated (denoted by ‘’). ALBERT is trained on the same data as BERT [41]. Training data for GPT-2 and RoBERTa are estimated based on a comparison of file size with the dataset used for BERT. GPT-3 is trained on 300 billion tokens; however, given that it uses byte-pair encoding for tokenization [70, 39, 72], the actual number of words is lower.

We use the transformers [69] implementation of Transformer-XL; some models reported in the original paper [40] have a higher number of parameters.

Note that while ALBERT has fewer free parameters than either BERT or RoBERTa, it shares parameters between layers, and so is actually a much larger model than either BERT or RoBERTa [41].

TABLE I: Summary of language models used

Iii-C Language model predictions

The aforementioned LMs were thus used to predict the probability of the target nouns from the original study [34]. Each stimulus sentence was truncated before the target word and the predicted probabilities generated by the models for each of the target words were recorded. Thus, all the models, including the masked LMs, were required to base their predictions on the preceding context. This procedure was intended to match the cloze task, where sentences were truncated in the same way, as well as the ERP experiment, where experimental participants had read only the preceding context when they reached the target word. These probabilities were then transformed into surprisals using the formula in (1).

Iii-D Predicting the N400

The LM surprisal values, original cloze values, cloze surprisal values, and by-trial N400 amplitudes were all z-transformed before running statistical analyses. These z-transformed LM surprisals, cloze surprisals, and cloze probabilities were then used to predict the z-transformed by-trial N400 amplitudes. Statistical analysis and data manipulation were carried out in

R [73] using Rstudio [74] and the tidyverse [75], lme4 [76], and ggh4x [77] packages, and the code provided by Nicenboim et al. [19] for preparing the data [34].

Iv Results

Iv-a Preliminary analysis with cloze probability

First, we test whether the original finding, that higher-cloze nouns elicit smaller N400s than lower-cloze nouns, still holds for the training set data. We did this by following the original statistical methods as closely as possible [34]. For this reason, we used linear mixed-effects regression models, keeping the random effects structure as maximal as possible; and in order to test the significance of variables, we use likelihood ratio tests on nested regressions.

After running all regressions (including those described in the following subsections), the largest random effects structure that did not lead to a singular fit had random intercepts for experimental participant and item. Following the original analysis, we also included the laboratory in which the experiment was run as a main effect.

As in the original study, we found no interaction between cloze probability and laboratory (). However, unlike the original study, we found a significant effect of laboratory even when controlling for cloze probability (). This may be due to the difference in sample or in random effects structure. Thus, we include laboratory as a covariate for our remaining analyses.

Crucially, we find a significant effect of cloze probability (). Thus, we replicate the noun predictability effect on our selected subset of the data. Note that all p-values have been corrected for multiple comparisons based on false discovery rate [78].

Iv-B Cloze surprisal and N400 amplitude

Running the same tests with cloze surprisal (i.e. negative log-transformed cloze probability) replacing cloze probability leads to the same results (Cloze surprisal x lab: ; cloze surprisal: ; lab: ).

In order to compare cloze probability and cloze surprisal as predictors of N400, we use the two best regressions, those with both cloze (either probability or surprisal) and laboratory as fixed effects. Since they are not nested, we employ Akaike’s Information Criterion (AIC) [79] to compare the two regressions. We find that the regression with cloze surprisal as a fixed effect has a lower AIC (AIC = 30747.89) than the regression with cloze probability as a fixed effect (AIC = 30752.15).

It has been suggested as a general rule of thumb that when there is an AIC difference of 2 or less between two statistical models, they have similar levels of support, while a difference of 4 or more means that the model with a lower AIC has ‘considerably’ more evidential support [80]. In this case, the cloze surprisal regression has an AIC which is 4.26 less than the AIC of the cloze probability regression, suggesting it provides a better fit to the N400 data.

The AIC values can also be used to calculate evidence ratios based on Akaike weights (see [81]). With an evidence ratio of 8.44, the cloze surprisal regression is 8.44 times more likely than the cloze probability regression to be the best model of the N400 data. Based on this dataset, then, cloze surprisal is a better predictor of N400 amplitude than cloze probability. For this reason, we look at cloze surprisal rather than cloze probability in the remainder of our analyses.

Iv-C Language model surprisal and N400 amplitude

Next, we test whether the surprisal calculated from each LM is a significant predictor of N400 amplitude. Since there was no significant interaction between cloze and laboratory and the main effect of laboratory was significant in both the cloze probability and surprisal regressions, we compare regressions with the previously-discussed random effects structure and a fixed effect of laboratory to those also including a main effect of LM surprisal. The results of these analyses are shown in Table II. As can be seen, main effects of surprisal values from JRNN, Transformer-XL, GPT-2, GPT-3, RoBERTa, and ALBERT are all significant in their respective regressions, but the main effect of GRNN surprisal is only marginally significant, and the main effect of BERT surprisal is not significant at all.

Predictor chisq df p
GRNN surprisal 6.841 1 0.056
JRNN surprisal 13.030 1 0.003
Tranformer-XL surprisal 18.648 1 <0.001
GPT-2 surprisal 23.393 1 <0.001
GPT-3 surprisal 39.552 1 <0.001
BERT surprisal 0.002 1 1
RoBERTa surprisal 33.678 1 <0.001
ALBERT surprisal 35.816 1 <0.001
TABLE II: Significant predictors of N400 amplitude

Iv-D Comparison of model fit

We next compared the AICs of each linear mixed-effects regression model including LM surprisal with one that instead used cloze surprisal. These comparisons are presented in Figure 1, which shows the AIC of each LM surprisal regression with the AIC of the cloze surprisal regression subtracted. This allows for easier comparison of regression AIC, and has a clear interpretation—any regression with a relative AIC of less than zero has a better fit than the cloze surprisal regression.

As can be seen in Figure 1, the regressions based on the surprisals calculated from three LMs have lower AICs than cloze surprisal (AIC = 30747.89): GPT-3 (AIC = 30740.06; evidence ratio with cloze surprisal = 50.1), ALBERT (AIC = 30743.79; evidence ratio = 7.74), and RoBERTa (AIC = 30745.93; evidence ratio = 2.66). The AIC of the remaining models is higher than that of cloze surprisal. It should be noted that in all but one case, the difference in AIC between the cloze surprisal and all other regressions is greater than 4, suggesting a meaningful difference in this respect [80]. The one exception is the RoBERTa regression (AIC = 1.96)—thus, while the RoBERTa regression is 2.66 times more likely than the cloze surprisal regression to provide the best fit to the N400 data, we rely on the tests in the rest of this section to determine whether RoBERTa surprisal is in fact a better predictor of N400 amplitude than cloze surprisal.

In sum, regressions based on the surprisals derived from GPT-3 and ALBERT more closely fit the N400 data than the regression based on cloze surprisal, and this may also be the case for the RoBERTa surprisal regression. Thus, this analysis suggests that GPT-3 surprisal, ALBERT surprisal, (and possibly RoBERTa surprisal) are better predictors of the N400 than cloze surprisal.

Fig. 1: AICs of all regressions including fixed effects of the denoted surprisal and laboratory, as well as random intercepts for each item and experimental participants. For easier comparison, AIC is scaled by subtracting the AIC of the regression including cloze surprisal, laboratory, and the aforementioned random intercepts. Lower AICs indicate better model fit [79].

Iv-E Does language model surprisal improve fit of regressions based on human cloze data?

In addition to comparing the AICs of the models, following Brothers and Kuperberg [25], we compare how well cloze and LM surprisal predict N400 amplitude by constructing additional regressions with both variables and comparing them to regressions with only one. First, we compare the effect of adding the surprisal calculated from each LM to a regression already including cloze surprisal. Thus, we test whether each LM surprisal explains variance in N400 amplitude above and beyond that which is already explained by cloze surprisal. The results are shown in Table III.

Predictor chisq df p
GRNN surprisal 0.156 1 1
JRNN surprisal 1.950 1 0.779
Tranformer-XL surprisal 2.848 1 0.477
GPT-2 surprisal 3.046 1 0.441
GPT-3 surprisal 9.563 1 0.015
BERT surprisal 0.630 1 1
RoBERTa surprisal 7.827 1 0.036
ALBERT surprisal 7.180 1 0.049
TABLE III: Does language model surprisal improve fit of regressions based on human cloze data?

As can be seen in Table III, adding GPT-3, ALBERT, or RoBERTa surprisal to regressions already including cloze surprisal significantly improves their fit, while adding the surprisal of other LMs does not.

Iv-F Does human cloze data improve fit of regressions based on language model surprisal?

We also run the reverse analysis, investigating the effect of adding cloze surprisal to a regression that already includes one LM surprisal as a fixed effect. Thus, we test whether cloze surprisal explains variance in N400 amplitude that is not explained each LM surprisal. The results are shown in Table IV.

Predictor chisq df p
GRNN surprisal 25.038 1 <0.001
JRNN surprisal 20.643 1 <0.001
Tranformer-XL surprisal 15.922 1 0.001
GPT-2 surprisal 11.375 1 0.006
GPT-3 surprisal 1.733 1 0.867
BERT surprisal 32.351 1 <0.001
RoBERTa surprisal 5.872 1 0.092
ALBERT surprisal 3.087 1 0.441
TABLE IV: Does human cloze data improve fit of regressions based on language model surprisal?

As can be seen in Table IV, adding cloze surprisal to a regression already including GRNN, JRNN, Transformer-XL, GPT-2, and BERT surprisal improves their fit. By contrast, human cloze surprisal does not improve regressions already including surprisals from GPT-3, ALBERT, or RoBERTa. In sum, only surprisal from GPT-3, ALBERT, and RoBERTa provides a better fit to N400 data than human cloze surprisals.

V General Discussion

In this study, we investigated whether linguistic predictions from language models or from human participants better predict the amplitude of the N400, a neural index of processing difficulty. We find that the surprisal of three transformer language models, GPT-3, RoBERTa, and ALBERT, are better predictors of N400 amplitude than cloze surprisal, and that cloze surprisal is a better predictor of N400 amplitude than raw cloze probability. These findings are consistent with prior work showing the correlation between LM surprisal and N400 amplitude [30, 31, 29, 33, 32]. However, to the best of our knowledge, the present study provides the most convincing evidence to date that LM surprisal can outperform cloze as a predictor of N400 amplitude.

The skeptical reader might question whether there was some feature of our stimuli that offers an unfair advantage to the LMs over cloze measures. We find this unlikely, given that we have endeavoured to provide a ‘level playing field’. First, unlike previous work that showed LM surprisal values provide a good account of N400 elicited by different kinds of semantic stimuli equated for cloze probability [33], the present study involved the experimental manipulation of the predictability of the words. There were no experimental conditions that were matched for cloze but that differed in some other systematic way. Thus, N400 amplitude variance in this study is almost exclusively due to differences in predictability. Second, all zero-cloze items were removed, meaning that any variation between items in terms of predictability was captured by both cloze and LM surprisal. Finally, following work suggesting a logarithmic linking function between word probability and the N400 [64], we also included cloze surprisal as a possible predictor. This proved crucial to a fair test, as we found that cloze surprisal was a better predictor of N400 amplitude than cloze probability.

V-a Methodological implications

Our finding of the relationship between N400 amplitude and surprisal values from GPT-3, RoBERTa, and ALBERT has clear methodological implications. In future work, it may be advantageous for ERP language researchers who want to measure or control the predictability of their stimuli to use surprisal values from these LMs in addition to, or even instead of, cloze probability. As argued above, the cloze task has many practical limitations, most notably, that with a limited number of participants, small differences in predictability may not be reflected in cloze. In addition, the possibility of variation in the predictability of zero-cloze items raises a non-trivial problem for cloze as a metric of predictability, even if it is transformed into cloze surprisal. LM surprisal, by contrast, allows the researcher to differentiate between items even with a very low probability, making it possible to control for predictability over a wider range than does cloze probability.

Further, for large stimulus sets it may be easier to obtain surprisal values from pre-trained LMs than cloze values from human subjects (e.g., it is feasible to obtain surprisal values for every word in a sentence). Indeed, ERP language researchers already use other measures derived from linguistic corpora to control their language materials, such as the use of semantic similarity measures for word embeddings. Since the report that corpus-derived metrics of word similarity are correlated with N400 amplitude [82, 83, 84, 85], many researchers have constructed their stimuli such that they are either matched in terms of these metrics, or include similarity metrics as covariates in their statistical analyses [86, 14, 87].

V-B Theoretical implications

Each of the principal findings reported above has consequences for our understanding of prediction and the human N400. Our first result is that when analyzing the data from a large N400 study collected from multiple laboratories, cloze surprisal is a better predictor of N400 amplitude than cloze probability, replicating work that suggests there may be a logarithmic linking function between word probability and N400 amplitude [64]. This suggests that rather than reflecting the error between the predicted and true probability (i.e., 1) of a word as some have argued is the case with behavioral predictability effects [25]

, the N400 may instead reflect the neurocognitive effort required to update the full probability distribution of our predictions once we encounter the word itself

[50, 31].

Our second and main result is that overall, GPT-3 surprisal, RoBERTa surprisal, and ALBERT surprisal were each found to be better predictors of N400 amplitude than cloze surprisal values gathered from human participants. Indeed, each of these LMs explains variance in N400 amplitude left unexplained by cloze surprisal. By contrast, cloze surprisal values from a mere 30 participants provide a better fit to N400 data than do surprisal values from GRNN, JRNN, Transformer-XL, GPT-2, and BERT. When comparing LMs of the same type, our results also provide additional support to the idea that LMs of better quality perform better at modeling the N400 and other measures of online human sentence processing difficulty [30, 88, 32]

. When compared by perplexity, a common evaluation metric for autoregressive transformer LMs, GPT-3 outperforms Transformer-XL and GPT-2

[40, 39, 42]. Similarly, ALBERT and RoBERTa each out-perform BERT at the GLUE benchmark [89], which covers a wide range of natural language understanding tasks. Finally, all but one transformer LMs (BERT) outperforms the RNNs (GRNN and JRNN), replicating previous work that transformer LMs are better predictors of N400 amplitude than RNNs [32, 33]. Thus, as LMs continue to advance and improve, their predictions appear to more closely match those of humans. Given that they are trained to predict words based on the statistics of language, this provides indirect support for the idea that humans may be doing the same.

Until the present study, cloze probability has been the gold-standard method of operationalizing predictability, and, when tested, the best correlate of behavioural predictability effects [25]. Thus, because the N400 is sensitive to manipulations that cannot be operationalized by cloze probability, it has been argued that it may be more productive to think of the N400 as reflecting ‘preactivation’ [14], or the ‘neural activation state at the time a physical stimulus is encountered’ [13] rather than prediction per se.

Results of the present study may thus illuminate the functional significance of the N400 component by offering a unified explanation for its sensitivity to what seem to be disparate sources of contextual information. That is, besides its exquisite sensitivity to cloze probability, the amplitude of the N400 is also sensitive to factors ostensibly related to the organization of semantic memory. Consider for example, the the following stimuli from Ito et al. [61]:

Jack studied medicine in a university and works as a doctor/patient/tenant now.

Here, doctor is the highest-cloze continuation of the sentence, while both patient and tenant have a cloze probability of zero. However, despite the fact that patient and tenant are equally unpredictable and equally implausible continuations of the sentence (as judged by participants in their study), patient elicits a smaller (less negative) N400 than tenant. This is one example of a range of studies where words that are semantically related to the preceding context (i.e. medicine) or to the most expected continuation of a sentence (i.e. doctor) elicit smaller N400 responses than semantically unrelated words, even when matched for cloze [61, 60, 63]. Based on such experiments, it has been proposed that implausible continuations like patient are ‘collaterally facilitated’ by the preceding context [13], or, relatedly, that their preactivation is caused by a separate associative system [90].

However, recent work shows that the difference in N400 amplitude reported in Ito et al.’s [61] study can be successfully predicted based on the surprisal of the GRNN and JRNN [29]. As these are LMs trained only to predict the next word in a sequence based on the preceding words, this suggests a need to reconsider the basis of the predictions of the human language processing system. Specifically, it may be that manipulations, such as the aforementioned relatedness manipulation, that have been thought to be separate or dissociable from predictability can be reduced to an appropriate measure of predictability. That is, patient and tenant are not in fact equally predictable, and the belief that they are is an artifact of cloze task.

In the cloze task, participants are asked to complete a sentence with a single continuation [22, 23]. Besides leading to an under-representation of less-predictable continuations in participants’ responses, it may also impose constraints that don’t obtain for neural systems that generate linguistic predictions. For example, people may be less willing to produce a continuation that is implausible. Thus, even with an extremely large number of participants in a cloze study, if no-one fills in implausible continuations such as patient and tenant, it is impossible to determine the difference in their predictability on the basis of cloze.

On the other hand, if even the GRNN and JRNN, which are among the worst-performing models in the present study, are able to successfully differentiate the predictability of patient and tenant [29] without semantics learned explicitly or through experience of the world, this suggests that humans may not need to rely on such information for prediction either, at least within the N400 window.

Previous work has shown that LM predictions correlate with N400 amplitude when cloze does not [29, 33]. The present study has shown that even in conditions maximally preferable for cloze, LM predictions correlate better with N400 amplitude. Thus, at least in terms of relative strength, the kinds of predictions made by LMs resemble the kinds of predictions made by humans as part of online language comprehension. Thus, the language comprehension system, or at least, the neurocognitive system underlying the N400 response, appears to be more finely attuned to the regularities in the surface-level statistics of language than previously thought.


  • [1] G. A. Miller and S. Isard, “Some perceptual consequences of linguistic rules,” Journal of Verbal Learning and Verbal Behavior, vol. 2, no. 3, pp. 217–228, Oct. 1963.
  • [2] E. Tulving and C. Gold, “Stimulus information and contextual information as determinants of tachistoscopic recognition of words,” Journal of Experimental Psychology, vol. 66, no. 4, pp. 319–327, Oct. 1963.
  • [3] C. Van Petten and B. J. Luka, “Prediction during language comprehension: Benefits, costs, and ERP components,” International Journal of Psychophysiology, vol. 83, no. 2, pp. 176–190, Feb. 2012.
  • [4] S. G. Luke and K. Christianson, “Limits on lexical prediction during reading,” Cognitive Psychology, vol. 88, pp. 22–60, Aug. 2016.
  • [5] K. I. Forster, “Priming and the effects of sentence and lexical contexts on naming time: Evidence for autonomous lexical processing,” The Quarterly Journal of Experimental Psychology Section A, vol. 33, no. 4, pp. 465–495, Nov. 1981.
  • [6] R. Jackendoff, Foundations of Language: Brain, Meaning, Grammar, Evolution.   Oxford University Press, 2002.
  • [7] P. J. Schwanenflugel and E. J. Shoben, “The influence of sentence constraint on the scope of facilitation for upcoming words,” Journal of Memory and Language, vol. 24, no. 2, pp. 232–252, Apr. 1985.
  • [8] M. J. Traxler and D. J. Foss, “Effects of sentence constraint on priming in natural language comprehension,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 26, no. 5, pp. 1266–1282, 2000.
  • [9] R. F. West and K. E. Stanovich, “Source of inhibition in experiments on the effect of sentence context on word recognition,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 8, no. 5, pp. 385–399, 1982.
  • [10] A. M. Collins and E. F. Loftus, “A spreading-activation theory of semantic processing,” Psychological Review, vol. 82, no. 6, pp. 407–428, 1975.
  • [11] M. Kutas and S. A. Hillyard, “Reading senseless sentences: Brain potentials reflect semantic incongruity,” Science, vol. 207, no. 4427, pp. 203–205, Jan. 1980.
  • [12] M. Kutas and K. D. Federmeier, “Thirty Years and Counting: Finding Meaning in the N400 Component of the Event-Related Brain Potential (ERP),” Annual Review of Psychology, vol. 62, no. 1, pp. 621–647, Jan. 2011.
  • [13] K. A. DeLong and M. Kutas, “Comprehending surprising sentences: Sensitivity of post-N400 positivities to contextual congruity and semantic relatedness,” Language, Cognition and Neuroscience, vol. 0, no. 0, pp. 1–20, Jan. 2020.
  • [14] G. R. Kuperberg, T. Brothers, and E. W. Wlotko, “A Tale of Two Positivities and the N400: Distinct Neural Signatures Are Evoked by Confirmed and Violated Predictions at Different Levels of Representation,” Journal of Cognitive Neuroscience, vol. 32, no. 1, pp. 12–35, Jan. 2020.
  • [15] K. A. DeLong, T. P. Urbach, and M. Kutas, “Probabilistic word pre-activation during language comprehension inferred from electrical brain activity,” Nature Neuroscience, vol. 8, no. 8, pp. 1117–1121, Aug. 2005.
  • [16] J. J. A. Van Berkum, C. M. Brown, P. Zwitserlood, V. Kooijman, and P. Hagoort, “Anticipating Upcoming Words in Discourse: Evidence From ERPs and Reading Times.” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 31, no. 3, pp. 443–467, 2005.
  • [17] M. Otten, M. S. Nieuwland, and J. J. Van Berkum, “Great expectations: Specific lexical anticipation influences the processing of spoken language,” BMC Neuroscience, vol. 8, no. 1, p. 89, Oct. 2007.
  • [18] N. Kwon, P. Sturt, and P. Liu, “Predicting semantic features in Chinese: Evidence from ERPs,” Cognition, vol. 166, pp. 433–446, Sep. 2017.
  • [19] B. Nicenboim, S. Vasishth, and F. Rösler, “Are words pre-activated probabilistically during sentence comprehension? Evidence from new data and a Bayesian random-effects meta-analysis using publicly available data,” Neuropsychologia, vol. 142, p. 107427, May 2020.
  • [20] T. P. Urbach, K. A. DeLong, W.-H. Chan, and M. Kutas, “An exploratory data analysis of word form prediction during word-by-word reading,” Proceedings of the National Academy of Sciences, vol. 117, no. 34, pp. 20 483–20 494, Aug. 2020.
  • [21] D. S. Fleur, M. Flecken, J. Rommers, and M. S. Nieuwland, “Definitely saw it coming? The dual nature of the pre-nominal prediction effect,” Cognition, vol. 204, p. 104335, Nov. 2020.
  • [22] W. L. Taylor, ““Cloze Procedure”: A New Tool for Measuring Readability,” Journalism Quarterly, vol. 30, no. 4, pp. 415–433, Sep. 1953.
  • [23] ——, “”Cloze” readability scores as indices of individual differences in comprehension and aptitude,” Journal of Applied Psychology, vol. 41, no. 1, pp. 19–26, 1957.
  • [24] M. Kutas and S. A. Hillyard, “Brain potentials during reading reflect word expectancy and semantic association,” Nature, vol. 307, no. 5947, pp. 161–163, Jan. 1984.
  • [25] T. Brothers and G. R. Kuperberg, “Word predictability effects are linear, not logarithmic: Implications for probabilistic models of sentence comprehension,” Journal of Memory and Language, vol. 116, p. 104174, Feb. 2021.
  • [26] M. Kutas and C. Petten, “Psycholinguistics electrified: Event-related brain potential investigations,” in Handbook of Psycholinguistics, 1st ed., M. A. Gernsbacher, Ed.   San Diego: Academic Press, Jan. 1994, pp. 83–143.
  • [27] D. Jurafsky and J. H. Martin, Speech and Language Processing.   [Online Draft], Oct. 2019.
  • [28] N. J. Smith and R. Levy, “Cloze but no cigar: The complex relationship between cloze, corpus, and subjective probabilities in language processing,” in Proceedings of the Annual Meeting of the Cognitive Science Society, 33, 2011, p. 7.
  • [29] J. A. Michaelov and B. K. Bergen, “How well does surprisal explain N400 amplitude under different experimental conditions?” in Proceedings of the 24th Conference on Computational Natural Language Learning.   Online: Association for Computational Linguistics, Nov. 2020, pp. 652–663.
  • [30] S. L. Frank, L. J. Otten, G. Galli, and G. Vigliocco, “The ERP response to the amount of information conveyed by words in sentences,” Brain and Language, vol. 140, pp. 1–11, Jan. 2015.
  • [31] C. Aurnhammer and S. L. Frank, “Evaluating information-theoretic measures of word prediction in naturalistic sentence reading,” Neuropsychologia, vol. 134, p. 107198, Nov. 2019.
  • [32] D. Merkx and S. L. Frank, “Human Sentence Processing: Recurrence or Attention?” in Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics.   Online: Association for Computational Linguistics, Jun. 2021, pp. 12–22.
  • [33] J. A. Michaelov, M. D. Bardolph, S. Coulson, and B. K. Bergen, “Different kinds of cognitive plausibility: Why are transformers better than RNNs at predicting N400 amplitude?” in Proceedings of the 43rd Annual Meeting of the Cognitive Science Society, University of Vienna, Vienna, Austria (Hybrid), Jul. 2021, pp. 300–306.
  • [34] M. S. Nieuwland, S. Politzer-Ahles, E. Heyselaar, K. Segaert, E. Darley, N. Kazanina, S. Von Grebmer Zu Wolfsthurn, F. Bartolozzi, V. Kogan, A. Ito, D. Mézière, D. J. Barr, G. A. Rousselet, H. J. Ferguson, S. Busch-Moreno, X. Fu, J. Tuomainen, E. Kulakova, E. M. Husband, D. I. Donaldson, Z. Kohút, S.-A. Rueschemeyer, and F. Huettig, “Large-scale replication study reveals a limit on probabilistic prediction in language comprehension,” eLife, vol. 7, p. e33468, Apr. 2018.
  • [35] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, and M. Baroni, “Colorless Green Recurrent Networks Dream Hierarchically,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).   New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 1195–1205.
  • [36] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, “Exploring the Limits of Language Modeling,” arXiv:1602.02410 [cs], Feb. 2016.
  • [37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  • [38] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692 [cs], Jul. 2019.
  • [39] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” p. 24, 2019.
  • [40] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context,” arXiv:1901.02860 [cs, stat], Jun. 2019.
  • [41]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,” in

    International Conference on Learning Representations, 2020.
  • [42] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 1877–1901.
  • [43] K. A. DeLong, M. Troyer, and M. Kutas, “Pre-Processing in Sentence Comprehension: Sensitivity to Likely Upcoming Meaning and Structure,” Language and Linguistics Compass, vol. 8, no. 12, pp. 631–645, 2014.
  • [44] A. S. Meyer, F. Huettig, and W. J. Levelt, “Same, different, or closely related: What is the relationship between language production and comprehension?” Journal of Memory and Language, vol. 89, pp. 1–7, Aug. 2016.
  • [45] P. Hendriks, Asymmetries between Language Production and Comprehension, ser. Studies in Theoretical Psycholinguistics.   Dordrecht: Springer Netherlands, 2014, vol. 42.
  • [46]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” p. 9.

  • [47] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29.   Curran Associates, Inc., 2016.
  • [48] G. S. Marmor, “Age at onset of blindness and the development of the semantics of color names,” Journal of Experimental Child Psychology, vol. 25, no. 2, pp. 267–278, Apr. 1978.
  • [49] J. Hale, “A probabilistic earley parser as a psycholinguistic model,” in Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies 2001 - NAACL ’01.   Pittsburgh, Pennsylvania: Association for Computational Linguistics, 2001, pp. 1–8.
  • [50] R. Levy, “Expectation-based syntactic comprehension,” Cognition, vol. 106, no. 3, pp. 1126–1177, Mar. 2008.
  • [51] N. J. Smith and R. Levy, “The effect of word predictability on reading time is logarithmic,” Cognition, vol. 128, no. 3, pp. 302–319, Sep. 2013.
  • [52] M. F. Boston, J. Hale, R. Kliegl, U. Patil, and S. Vasishth, “Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus,” Journal of Eye Movement Research, vol. 2, no. 1, Sep. 2008.
  • [53] V. Demberg and F. Keller, “Data from eye-tracking corpora as evidence for theories of syntactic processing complexity,” Cognition, vol. 109, no. 2, pp. 193–210, Nov. 2008.
  • [54] B. Roark, A. Bachrach, C. Cardenas, and C. Pallier, “Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1.   Singapore: Association for Computational Linguistics, 2009, p. 324.
  • [55] J. Mitchell, M. Lapata, V. Demberg, and F. Keller, “Syntactic and semantic factors in processing difficulty: An integrated measure,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 196–206.
  • [56] I. F. Monsalve, S. L. Frank, and G. Vigliocco, “Lexical surprisal as a general predictor of reading time,” in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.   Association for Computational Linguistics, 2012, pp. 398–408.
  • [57] R. M. Willems, S. L. Frank, A. D. Nijhof, P. Hagoort, and A. van den Bosch, “Prediction During Natural Language Comprehension,” Cerebral Cortex, vol. 26, no. 6, pp. 2506–2516, Jun. 2016.
  • [58] M. Kutas, “In the company of other words: Electrophysiological evidence for single-word and sentence context effects,” Language and Cognitive Processes, vol. 8, no. 4, pp. 533–572, Nov. 1993.
  • [59] K. D. Federmeier and M. Kutas, “A Rose by Any Other Name: Long-Term Memory Structure and Sentence Processing,” Journal of Memory and Language, vol. 41, no. 4, pp. 469–495, Nov. 1999.
  • [60] D. E. Thornhill and C. Van Petten, “Lexical versus conceptual anticipation during sentence processing: Frontal positivity and N400 ERP components,” International Journal of Psychophysiology, vol. 83, no. 3, pp. 382–392, Mar. 2012.
  • [61] A. Ito, M. Corley, M. J. Pickering, A. E. Martin, and M. S. Nieuwland, “Predicting form and meaning: Evidence from brain potentials,” Journal of Memory and Language, vol. 86, pp. 157–171, Jan. 2016.
  • [62] M. W. Lowder, W. Choi, F. Ferreira, and J. M. Henderson, “Lexical Predictability During Natural Reading: Effects of Surprisal and Entropy Reduction,” Cognitive Science, vol. 42, pp. 1166–1183, Jun. 2018.
  • [63] R. Metusalem, M. Kutas, T. P. Urbach, M. Hare, K. McRae, and J. L. Elman, “Generalized event knowledge activation during online sentence comprehension,” Journal of Memory and Language, vol. 66, no. 4, pp. 545–567, May 2012.
  • [64] N. Delaney-Busch, E. Morgan, E. Lau, and G. R. Kuperberg, “Neural evidence for Bayesian trial-by-trial adaptation on the N400 during semantic priming,” Cognition, vol. 187, pp. 10–20, Jun. 2019.
  • [65] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.
  • [66] F. Keller, “Cognitively Plausible Models of Human Language Processing,” in Proceedings of the ACL 2010 Conference Short Papers.   Uppsala, Sweden: Association for Computational Linguistics, Jul. 2010, pp. 60–67.
  • [67] J. L. Elman, “Finding Structure in Time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
  • [68] R. Futrell, E. Wilcox, T. Morita, P. Qian, M. Ballesteros, and R. Levy, “Neural language models as psycholinguistic subjects: Representations of syntactic state,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 32–42.
  • [69] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-Art Natural Language Processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.   Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45.
  • [70] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” arXiv:2005.14165 [cs], Jul. 2020.
  • [71] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019.
  • [72]

    R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715–1725.
  • [73] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2020.
  • [74] RStudio Team, RStudio: Integrated Development Environment for r, RStudio, PBC., Boston, MA, 2020.
  • [75] H. Wickham, M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. L. Pedersen, E. Miller, S. M. Bache, K. Müller, J. Ooms, D. Robinson, D. P. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, and H. Yutani, “Welcome to the tidyverse,”

    Journal of Open Source Software

    , vol. 4, no. 43, p. 1686, 2019.
  • [76] D. Bates, M. Mächler, B. Bolker, and S. Walker, “Fitting linear mixed-effects models using lme4,” Journal of Statistical Software, vol. 67, no. 1, pp. 1–48, 2015.
  • [77] T. van den Brand, Ggh4x: Hacks for ’Ggplot2’, 2021.
  • [78] Y. Benjamini and D. Yekutieli, “The Control of the False Discovery Rate in Multiple Testing under Dependency,” The Annals of Statistics, vol. 29, no. 4, pp. 1165–1188, 2001.
  • [79] H. Akaike, “Information Theory and an Extension of the Maximum Likelihood Principle,” in Second International Symposium on Information Theory, ser. Springer Series in Statistics, B. N. Petrov and F. Csáki, Eds.   Budapest, Hungary: Akadémiai Kiadó, 1973, pp. 267–281.
  • [80] K. P. Burnham and D. R. Anderson, “Multimodel Inference: Understanding AIC and BIC in Model Selection,” Sociological Methods & Research, vol. 33, no. 2, pp. 261–304, Nov. 2004.
  • [81] E.-J. Wagenmakers and S. Farrell, “AIC model selection using Akaike weights,” Psychonomic Bulletin & Review, vol. 11, no. 1, pp. 192–196, Feb. 2004.
  • [82] D. J. Chwilla and H. H. J. Kolk, “Accessing world knowledge: Evidence from N400 and reaction time priming,” Cognitive Brain Research, vol. 25, no. 3, pp. 589–606, Dec. 2005.
  • [83] M. Parviz, M. Johnson, B. Johnson, and J. Brock, “Using Language Models and Latent Semantic Analysis to Characterise the N400m Neural Response,” in Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, Dec. 2011, pp. 38–46.
  • [84] C. Van Petten, “Examining the N400 semantic context effect item-by-item: Relationship to corpus-based measures of word co-occurrence,” International Journal of Psychophysiology, vol. 94, no. 3, pp. 407–419, Dec. 2014.
  • [85]

    A. Ettinger, N. Feldman, P. Resnik, and C. Phillips, “Modeling N400 amplitude using vector space models of word representation.” in

    Proceedings of the 38th Annual Conference of the Cognitive Science Society, Philadelphia, USA, 2016.
  • [86] D. J. Chwilla, H. H. J. Kolk, and C. T. W. M. Vissers, “Immediate integration of novel meanings: N400 support for an embodied view of language comprehension,” Brain Research, vol. 1183, pp. 109–123, Dec. 2007.
  • [87] M. S. Nieuwland, D. J. Barr, F. Bartolozzi, S. Busch-Moreno, E. Darley, D. I. Donaldson, H. J. Ferguson, X. Fu, E. Heyselaar, F. Huettig, E. Matthew Husband, A. Ito, N. Kazanina, V. Kogan, Z. Kohút, E. Kulakova, D. Mézière, S. Politzer-Ahles, G. Rousselet, S.-A. Rueschemeyer, K. Segaert, J. Tuomainen, and S. Von Grebmer Zu Wolfsthurn, “Dissociable effects of prediction and integration during language comprehension: Evidence from a large-scale study using brain potentials,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 375, no. 1791, p. 20180522, Feb. 2020.
  • [88] A. Goodkind and K. Bicknell, “Predictive power of word surprisal for reading times is a linear function of language model quality,” in Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018).   Salt Lake City, Utah: Association for Computational Linguistics, Jan. 2018, pp. 10–18.
  • [89] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” in International Conference on Learning Representations, 2019.
  • [90] S. L. Frank and R. M. Willems, “Word predictability and semantic similarity show distinct patterns of brain activity during language comprehension,” Language, Cognition and Neuroscience, vol. 32, no. 9, pp. 1192–1203, Oct. 2017.