Human reading is both effortless and fast, with typical studies reporting reading rates around 250 words per minute Rayner . (2006). Human reading is also adaptive: readers vary their strategy depending on the task they want to achieve, with experiments showing clear differences between reading for comprehension, proofreading, or skimming Kaakinen Hyönä (2010); Schotter . (2014); Hahn Keller (2018).
Another remarkable aspect of human reading is its robustness. A lot of the texts we read are carefully edited and contain few errors, for example articles in newspapers and magazines, or books. However, readers also frequently encounter texts that contain errors, e.g., in hand-written notes, emails, text messages, and social media posts. Intuitively, such errors are easy to cope with and impede understanding only in a minor way. In fact, errors often go unnoticed during normal reading, which is presumably why proofreading is difficult.
The aim of this paper is to experimentally investigate reading in the face of errors, and to propose a simple model that can account for our experimental results. Specifically, we focus on errors that change the form of a word, i.e., that alter a word’s character sequence. This includes letter transposition (e.g., innocetn instead of innocent) and misspellings (e.g., inocent). Importantly, we will not consider whole-word substitutions, nor will we deal with morphological, syntactic, or semantic errors.
We know from the experimental literature that letter transpositions cause difficulty in reading Rayner . (2006); Johnson . (2007); White . (2008). However, transpositions are artificial errors (basically they are an artifact of typing), and are comparatively rare.111For example, in the error corpus we use Geertzen . (2014) only 11% are letter swaps or repetitions, see Table 1. It is not surprising that such errors slow down reading. This contrasts with misspellings, i.e., errors that writers make because they are unsure about the orthography of a word. These are natural errors that should be easier to read, because they occur more frequently and are linguistically similar to real words (inocent conforms to the phonotactics of English, while innocetn does not). This is our first prediction, which we will test in an eye-tracking experiment that compares the reading of texts with transpositions and misspellings.
Readers’ prior exposure to misspellings might explain why reading is mostly effortless, even in the presence of errors. The fact remains, however, that all types of errors are relatively rare in everyday texts. All previous research has studied isolated sentences that contain a single erroneous word. This is a situation with which the human language processor can presumably cope easily. However, what happens when humans read a whole text which contains a large proportion of errors? It could be that normal reading becomes very difficult if, say, half of all words are erroneous. In fact, this is what we would expect based on theories of language processing that assume prediction, such as surprisal Levy (2008): the processor constantly uses the current context to predict the next word, and difficulty ensues if these predictions are incorrect. However, if the context is degraded by a large number of errors, then predictions become unreliable, and reading slows down. Crucially, we should see this effect on all words, not just on those words that contain errors. This is the second prediction that we will test in our eye-tracking experiment by comparing texts with high and low error rates.
In the second part of this paper, we present a surprisal model that can account for the patterns of difficulty observed in our experiment on reading texts with errors. We start by showing that standard word-based surprisal does not make the right predictions, as it essentially treats words with errors as out of vocabulary items. We therefore propose to estimate surprisal with a character-based language model. We show that this model successfully predicts human reading times for texts with errors and accounts for both the effect of error type and the effect of error rate that we observed in our reading experiment.
2 Eye-tracking Experiment
Sixteen participants took part in the experiment after giving informed consent. They were paid £10 for their participation, had normal or corrected-to-normal vision, and were self-reported native speakers of English.
We used the materials of Hahn:Keller:18 (no preview-condition only), but introduced errors into the texts. These materials contain twenty newspaper texts from the DeepMind question answering corpus Hermann . (2015). Texts of comparable in length (between 149 and 805 words, mean 323) and represent a balanced selection of topics.
Each text comes with a question and a correct answer. The questions are formulated as sentences with a blank to be completed with a named entity so that a statement implied by the text is obtained. Three incorrect answers (distractors) are included for each question; these were also named entities, chosen so that correctly answering the question would likely be impossible without reading the text.
We introduced errors into the materials of Hahn:Keller:18 following the method suggested by Belinkov:18. These errors are automatically generated and are either transpositions (i.e., two adjacent letters are swapped) or natural errors that replicate actual misspellings. For the latter, we used a corpus of human edits Geertzen . (2014), and introduced errors in our experimental materials by replacing correct words with known misspellings from our edit corpus. The percentages of different types of misspellings are listed in Table 1. By generating texts with errors automatically we can ensure that both error conditions (transpositions and misspellings) contain the same percentage of erroneous words. For both error conditions, we generated texts in which either 10% or 50% of tokens are erroneous.
Participants received written instructions and went through two practice trials whose data was discarded. Then, each participant read and responded to all 20 items (texts with questions and answer choices); the items were the same for all participants, but were presented in a new random order for each participant. The order of the answer choices was also randomized. Participants pressed buttons on a response pad to get to the next page, and to selected one of the four answers once they had finished reading a given text.
Eye-movements were recorded using an Eyelink 2000 tracker (SR Research, Ottawa). The tracker recorded the dominant eye of the participant (as established by an eye-dominance test) with a sampling rate of 2000 Hz. Before the experiment started, the tracker was calibrated using a nine-point calibration procedure; at the start of each trial, a fixation point was presented and drift correction was carried out.
2.1.3 Data Analysis
For data analysis, each word in the text was defined as a region of interest. Punctuation was included in the region of the word it followed or preceded without intervening whitespace. If a word was preceded by a whitespace, then that space was included in the region for that word. We report data for the following eye-movement measures in the critical regions: First pass time (often called gaze duration for single-word regions) consists of the sum of fixation durations beginning with this first fixation in the region until the first saccade out of the region, either to the left or to the right. Fixation rate measures the proportion of trials in which the region was fixated (rather than skipped) on first-pass reading. For first pass time, no trials in which the region is skipped on first-pass reading were included in the analysis.
|Hahn & Keller||This experiment|
In Table 2, we present some basic reading measures for our experiments, and compare these to the reading experiments of Hahn:Keller:18, which used the same texts, but did not include any errors (the data is taken from their no-preview condition, which corresponds to our experimental setup). Even in the error condition, the reading measures in our experiments differ only minimally from the ones reported by Hahn:Keller:18. In the no-error condition, we find slight faster reading times and lower fixation rates than Hahn:Keller:18. Also the accuracy (which can only be measured on text level, hence we do not distinguish error and no-error conditions) is essentially unchanged. This provides good evidence for the claim that human readers cope very well with errors in text, with essentially no detriment in terms of reading time, fixation rate, and question accuracy.222Note that participants are not performing at ceiling in question answering; our pattern of results therefore cannot be explained by asserting that the questions were too easy.
In the following, we analyze two reading measures in more detail: first pass time and fixation rate. We analyzed per-word reading measures using mixed-effects models, considering the following predictors:
ErrorType: Does the text contain mispellings () or transpositions ()?
ErrorRate: Does the text contain 10% () or 50% () erroneous words overall?
Error: Is the word correct () or erroneous ()?
WordLength: Length of the word in characters.
LastFix: Was the preceding word fixated () or not ()?
All predictors were centered. Word length was scaled to unit variance. We selected binary interactions using forward model selection with atest, running the R package lme4 Bates . (2015) with a maximally convergent random effects structure. We then re-fitted the best model with a full random effects structure as a Bayesian generalized multivariate multilevel model using the R package brms; this method is slower but allows fitting large random effects structures even when traditional methods do not converge. Resulting Bayesian models are shown in Table 3.333An analogous analysis for log-transformed first-pass times led to the same pattern of significant effects and their directions.
|First Pass||Fixation Rate|
Bayesian generalized multivariate multilevel models for reading measures with maximal random-effects structure. Each cell gives the coefficient, its standard deviation, and the estimated posterior probability that the coefficient has the opposite sign.
The main effects of WordLength replicate the well-known positive correlation between word length on and reading time Demberg Keller (2008). We also find main effects of Error, indicating that erroneous words are read more slowly and are more likely to be fixated. The main effects of ErrorRate show that higher text error rate leads to longer reading times and higher fixation rates for all words (whether they are correct or erroneous). Additionally, we find a main effect of ErrType in fixation rate, showing that transposition errors lead to higher fixation rates. This is consistent with our hypothesis that misspellings are easier to process than transpositions, as they are real errors that participants have been exposed in their reading experience.
Figure 1 graphs mean first pass times and fixation rates by error type and error rate. The most important effect is that error words take longer to read and are fixated more than non-error words. The effect of error rate is also clearly visible: the 50% error condition causes longer reading times and more fixations than the 10% one, even for non-error words. We also observe a small effect of error type.
Turning now to the interactions, we found that ErrorRate and LastFix interact in both reading measures, which indicates that reading times and fixation rates increase in the high-error condition if the previous word has been fixated.
Only in fixation rate, there was also an interaction of Error and LastFix, indicating that fixation rate goes up for error words if the preceding word was fixated, presumably because of preview of the erroneous words, which is then more likely to be fixated in order to identify the error.
For fixation rate, WordLength interacts with LastFix: longer words are more likely to be fixated if the preceding word was fixated; again, this is likely an effect of preview. While Figure 1 seems to suggest an interaction of Error and Error Type, this was not significant in the mixed model.
We have found four main results: (1) Erroneous words show longer reading times and are more likely to be fixated. (2) Higher error rates lead to increased reading times and more fixations, even on words that are correct. (3) Transpositions lead to an increased fixation rate compared to misspellings. (4) Whether the previous word is fixated or not modulates the effect of error and error rate.
However, it is conceivable that the effects of error and error rate are actually artifacts of word length. All else being equal, longer words take longer to read and are more likely to be fixated. So if error words and non-error words in our texts differ in mean length, then that would be an alternative explanation for the effects that we found.
For transposition errors, error words by definition have the same length as their non-error versions. For misspellings, a mixed-effects analysis with word forms as random effects showed no significant difference in the lengths of error words and their correct versions (mean difference , SE , ). Comparing the erroneous words of the two error types, we found that they differ in mean length (misspellings 5.44, transpositions 6.06 characters); however this difference was not significant in a mixed-effects analysis predicting word length of erroneous words from error types, with items as a random effect (mean difference , SE , ).
3 Surprisal Model
Most models of human reading do not explicitly deal with reading in the face of errors. In fact, reading models that use a lexicon to look up word forms (e.g., to retrieve word frequencies) cannot deal with erroneous words without further assumptions. We can use the surprisal model of processing difficultyLevy (2008) to illustrate this: in its original, word-based formulation, surprisal is forced to treat all error words as out of vocabulary items; it therefore cannot distinguish between different types of errors or between different error rates.
Intuitively, a more fine-grained version of surprisal is required that makes prediction in terms of characters, not words. In such a setting, the word inocent would be more surprising than innocent
in the same context, but not as surprising as a completely unfamiliar letter string. In other words, the surprisal of the same word with and without misspellings or letter transpositions would be similar but not the same. To achieve this, we can use character-based language models, which are standard tools in natural language processing for dealing with errors in the input (e.g., the work by Belinkov:18 on errors in machine translation).
Crucially, once we have a character-based surprisal model, we can derive predictions regarding how errors should affect reading. We predict that transpositions should be more surprising than misspellings, as they involve character sequences that are unfamiliar to the model (e.g., innocetn contains the rare character sequence tn). Also, we predict that words that occur in texts with a high error rate are more difficult to read than words in texts with a low error rate: if the context of a word contains few errors, then we are able to predict that word confidently (resulting in low surprisal). If the context contains lots of errors then our prediction in degraded (resulting in high surprisal). We will now test the predictions regarding error type and error rate using a character-based version of surprisal.
We trained a character-based neural language model using LSTM cells Hochreiter Schmidhuber (1997). Such models can assign probabilities to any sequence of characters, and thus are capable of computing surprisal even for words never seen in the training data, such as erroneous words. For training, we used the Daily Mail portion of the DeepMind corpus. We used a vocabulary consisting of the 70 most frequent characters, mapping others to an out-of-vocabulary token.
The hyperparameters of the language model were selected on an English corpus based on Wikipedia text.444
1024 units, 3 layers, batch size 128, embedding size 200, learning rate 3.6 with plain SGD, multiplied by 0.95 at the end of each epoch; BPTT length 80; DropConnect with rate 0.01 for hidden units; replacing entire character embeddings by zero with rate 0.001.We then used the resulting model to compute surprisal on the texts used in the eye-tracking experiment for each experimental condition.
The model estimates, for each element of a character sequence, the probability of seeing this character given the preceding context. We compute the surprisal of a word as the sum of the surprisals of the individual characters, as prescribed by the product rule of probability. For a word consisting of characters following a context , its surprisal is:
In this computation, we take whitespace characters to belong to the preceding word.
To control for the impact of the random initialization of the neural network at the beginning of training, we trained seven models with identical settings but different random initializations.
The quality of character-based language models is conventionally measured in Bits Per Character (BPC), which is the average surprisal, to the base 2, of each character. On held-out data, our model achieves a mean BPC of 1.28 (SD 0.025), competitive with BPCs achieved by state-of-the-art systems of similar datasets (e.g., merity2018analysis report BPC = 1.23 on Wikipedia text).
In the introduction we predicted that word-based surprisal is not able to model the reading time pattern we found in our eye-tracking experiment. In order to test this prediction, we compare our character-level surprisal model to surprisal computed using a conventional word-based neural language model. Word-based models have a fixed vocabulary, consisting of the most common words in the training data; a typical vocabulary size is 10,000. Words that were not seen in the training data, and rare words, are represented by a special out-of-vocabulary (OOV) token. From a cognitive perspective, this corresponds to assuming that all unknown words (whether they contain errors or not) are treated in the same way: they are recognized as unknown, but not processed any further. We used a vocabulary size of 10,000. The hyperparameters of the word-based model were selected on the same English Wikipedia corpus as the character-based model.5551024 units, batch size 128, embedding size 200, learning rate 0.2 with plain SGD, multiplied by 0.95 at the end of each epoch; BPTT length 50; DropConnect with rate 0.2 for hidden units; Dropout 0.1 for input layer; replacing words by random samples from the vocabulary with rate 0.01 during training.
3.2 Results and Discussion
In this section, we show that surprisal computed by a character-level neural language model (CharSurprisal) is able to account for the effects of errors on reading observed in our eye-tracking experiments. We compute character-based surprisal for the texts used in our experiments, and expect to obtain mean surprisal scores for each experimental condition that resemble mean reading times. We will also verify our prediction that word-based surprisal (WordSurprisal) is not able to account for the effects observed in our experimental data, due to the way it treats unknown words.
Figure 2 shows the mean surprisal values across the different error conditions. We note that the pattern of reading time predicted by CharSurprisal (solid lines) matches the first-pass times observed experimentally very well (see Figure 1), while WordSurprisal (dotted line) shows a clearly divergent pattern, with error words showing lower surprisal than non error words. This can be explained by the fact that a word-based model does not process error words beyond recognizing them as unknown; the presence of an unknown word itself is not a high-surprisal event (even without errors, 17 % of the words in our texts are unknown to the model, given its 10,000-word vocabulary).
To confirm this observation statistically, we fitted linear mixed-effects models with CharSurprisal and WordSurprisal as dependent variables. We enter the seven random initializations of each model as a random factor, analogously to the participants in the eye-tracking experiment. We use the same predictors that we used for the reading measures, except for LastFix, which is not meaningful for the surprisal models, as they do not skip any words.
The results of the mixed model for CharSurprisal (see Table 4) replicated the effects of ErrorRate, Error, and WordLength found in first pass and fixation rate, as well as the effect of ErrorType found only in fixation rate (see Table 3). The same analysis for WordSurprisal (see again Table 4), however, does not yield the correct pattern of results: Crucially, the coefficients of Error and ErrorType have the opposite sign compared to both CharSurprisal and the experimental data.
|First Pass||Fixation Rate|
We have shown that character-based surprisal computed on the texts used in our experiment is qualitatively similar to the experimental results. As a next step we will test its quantitative predictions, i.e., we will correlate surprisal scores with reading times. For this, we performed mixed-effects analyses in which first-pass time and fixation rate are predicted by WLength, LastFix, and character-based surprisal residualized against word length (ResidCharSurp). Note that we did not enter the error factors (ErrorType, ErrorRate, Error) into this analysis, as we predict that surprisal will simulate the effect of errors in reading.
It is known that surprisal predicts reading times in ordinary text not containing errors Demberg Keller (2008); thus, it is important to disentangle the specific contribution of modeling errors correctly from the general contribution of surprisal in our model. We do this by constructing a baseline version of character-based surprisal that is computed using an oracle (ResidCharSurpOracle). For this, we replace erroneous words with their correct counterparts before computing surprisal, and again residualize against word length. If ResidCharSurp correctly accounts for the effects of errors on reading, then we expect that ResidCharSurp – which has access to the erroneous word forms – will improve the fit with our reading data compared to ResidCharSurpOracle.
For ResidCharSurpOracle, we use the same seven models as for ResidCharSurp, only exchanging the character sequences on which surprisal is computed. This ensures that any difference in model fit between the two predictors can be attributed entirely to the way ResidCharSurp is affected by the presence of errors in the texts.
The resulting models are shown in Table 5. For WLength and LastFix, we see the same pattern of results as in the experimental data (see Table 3). Furthermore, regular surprisal (ResidCharSurp) and oracle surprisal (ResidCharSurpOracle) significantly predict both first pass time and fixation rate. This is in line with the standard finding that surprisal predicts reading time Demberg Keller (2008), but has so far not been demonstrated for texts containing errors. We compare model fit using AIC and BIC. Both measures indicate that ResidCharSurp fits the experimental data better than ResidCharSurpOracle. Thus, character-level surprisal provides an account of our data going beyond the known contribution of surprisal to reading times, and correctly predicts reading in the presence of errors.
We investigated reading with errors in texts that contain either letter transpositions or real misspellings. We found that transpositions cause more reading difficulty than misspellings and explained this using a character-based surprisal model, which assigns higher surprisal to rare letter sequences as they occur in transpositions. We also found that in texts with a high error rate, all words are more difficult to read, even the ones without errors. Again, character-based surprisal explains this: word prediction is harder when the context of a word is degraded by errors, resulting in increased surprisal.
In future work, we plan to integrate character-based surprisal with existing neural models of human reading Hahn Keller (2018). Models at the character level are necessary not only to account for errors, but also to model landing position effects, parafoveal preview, and word length effects, all of which word-based models are unable to capture.
- Bates . (2015) bates-fitting-2015-1Bates, D., Mächler, M., Bolker, B. Walker, S. 2015. Fitting Linear Mixed-Effects Models Using lme4 Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software6711–48.
Belinkov Bisk (2018)
Belinkov:18Belinkov, Y. Bisk, Y.
Synthetic and Natural Noise Both Break Neural Machine Translation Synthetic and natural noise both break neural machine translation.Proceedings of the 6th International Conference on Learning Representations. Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada.
- Demberg Keller (2008) demberg_data_2008Demberg, V. Keller, F. 2008. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition1092193–210.
- Geertzen . (2014) geertzen2014automaticGeertzen, J., Alexopoulou, T. Korhonen, A. 2014. Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat) Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat). RT. Miller (), Selected Proceedings of the 2012 Second Language Research Forum Selected Proceedings of the 2012 Second Language Research Forum ( 240–254).
- Hahn Keller (2018) Hahn:Keller:18Hahn, M. Keller, F. 2018. Modeling Task Effects in Human Reading with Neural Attention. Modeling task effects in human reading with neural attention. arXiv:1808.00054
- Hermann . (2015) hermann_teaching_2015Hermann, KM., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M. Blunsom, P. 2015. Teaching Machines to Read and Comprehend Teaching machines to read and comprehend. C. Cortes, ND. Lawrence, DD. Lee, M. Sugiyama R. Garnett (), Advances in Neural Information Processing Systems 28 Advances in Neural Information Processing Systems 28 ( 1693–1701).
- Hochreiter Schmidhuber (1997) hochreiter_long_1997Hochreiter, S. Schmidhuber, J. 1997. Long short-term memory Long short-term memory. Neural Computation981735–1780.
- Johnson . (2007) Johnson:ea:07Johnson, RL., Perea, M. Rayner, K. 2007. Transposed-Letter Effects in Reading: Evidence From Eye Movements and Parafoveal Preview Transposed-letter effects in reading: Evidence from eye movements and parafoveal preview. Journal of Experimental Psychology: Human Perception and Performance331209–229.
- Kaakinen Hyönä (2010) Kaakinen:Hyona:10Kaakinen, JK. Hyönä, J. 2010. Task Effects on Eye Movements During Reading Task effects on eye movements during reading. Journal of Experimental Psychology: Learning, Memory, and Cognition3661561–1566.
- Levy (2008) levy_expectation-based_2008Levy, R. 2008. Expectation-based syntactic comprehension Expectation-based syntactic comprehension. Cognition10631126–1177.
- Merity . () merity2018analysisMerity, S., Keskar, NS. Socher, R. . An Analysis of Neural Language Modeling at Multiple Scales. An analysis of neural language modeling at multiple scales. arXiv:1803.08240
- Rayner . (2006) Rayner:ea:06Rayner, K., White, SJ., Johnson, RL. Liversedge, SP. 2006. Raeding Wrods With Jubmled Lettres: There Is a Cost Raeding wrods with jubmled lettres: There is a cost. Psychological Science173192–193.
- Schotter . (2014) Schotter:ea:14Schotter, ER., Bicknell, K., Howard, I., Levy, R. Rayner, K. 2014. Task effects reveal cognitive flexibility responding to frequency and predictability: Evidence from eye movements in reading and proofreading Task effects reveal cognitive flexibility responding to frequency and predictability: Evidence from eye movements in reading and proofreading. Cognition13111–27.
- White . (2008) White:ea:08White, SJ., Johnson, RL., Liversedge, SP. Rayner, K. 2008. Eye Movements When Reading Transposed Text: The Importance of Word-Beginning Letters Eye movements when reading transposed text: The importance of word-beginning letters. Journal of Experimental Psychology: Human Perception and Performance3451261–1276.