Recent studies have shown that masked language models pre-trained on a large corpus (hereafter, simply language models) achieve tremendous improvements over a wide variety of natural language processing (NLP) tasks. These results imply that they are also effective in recognizing erroneous words and phrases known as the task of grammatical error detection. There has been, however, much less work on this aspect of grammatical error detection than in other tasks. One can argue that since language models are trained on language data produced by native speakers of a language (specifically, English in this paper), they might not work well on the present task. This is partly because the target language data are produced by non-native speakers of that language. In other words, English language models do not know at all about grammatical errors made by non-native speakers. Even apart from grammatical errors, the target language is different from the canonical English, meaning that it contains unnatural words/phrases and characteristic language usages that native speakers do not normally use as demonstrate. If so, the effectiveness of language models is not so evident in grammatical error detection.
Actually, researchers have reported on performance of language models on grammatical error detection and correction, which partly answers the above research questions.  and  have shown that BERT-based methods improve grammatical error detection performance in Chinese and English, respectively.  and  have shown a similar tendency in grammatical error correction. While these studies empirically prove the effectiveness of language models in grammatical error detection and correction, the questions why and where language models benefit error detection/correction methods are left unanswered.
In this paper, we explore this aspect of language models in grammatical error detection to better answer the research questions. We first show that 5 to 10% of training data are enough for a BERT-based error detection method to achieve performance equivalent to a non-language model-based method can achieve with the full training data. More precisely, recall improves much faster with respect to training data size in the BERT-based method than in the non-language model method while precision behaves similarly. These experimental results suggest that (i) the BERT-based method should have a good knowledge of grammar required to recognize certain types of error and that (ii) it can transform the knowledge into error detection rules by fine-tuning with a few training samples, all of which leads to its high generalization ability in grammatical error detection. Following this, we further show with pseudo error data that it actually exhibits such nice properties in learning rules for detecting errors. For instance, we show that the BERT-based method trained on few (as few as two) instances of a transitive verb with a preposition (e.g., *discuss about) can detect the same type of error in other verbs (e.g., *approach to and attend in). Finally, based on these findings, we explore a cost-effective method for detecting grammatical errors with feedback comments explaining relevant grammatical rules to learners.
2 Related Work
 shows it is useful for neural error detection models to introduce a secondary language model objective together with the main error detection objective.  compare several other auxiliary training objectives including Part-Of-Speech (POS) tagging and error type identification and find that the language model objective is the most effective. This line of work suggests that grammatical error detection benefits from language modeling although these studies use BiLSTM-based language models instead of masked language models trained on a large corpus.
As mentioned in Sect. 1, several researchers have applied masked language models including BERT to grammatical error detection and correction.  and  show that error detection methods gain in recall and precision with the use of language models.  use BERT-based contextual embeddings for grammatical error detection and compares it with other types of contextual embedding. They show the BERT-based contextual embeddings are effective in almost all error types provided by ERRANT  although BERT is not fine-tuned in their study.  and  also show performance improvements in grammatical error correction. To strengthen the findings of these previous studies, we will reveal (at least, partly) why and where error detection methods benefit from language models in the following sections.
There has been a long history of studies that investigate the linguistic knowledge of language models including the work by [12, 7, 19] to name a few. A popular approach is to test whether a language model assigns higher likelihood to the appropriate word than an inappropriate one, given context. The linguistic knowledge to be explored ranges from syntactic/semantic knowledge to common sense. These studies mostly use (i) synthetic test data: sentences that are generated synthetically by using a certain kind of template or (ii) perturbed test data: sentences are generated by perturbing a natural corpus. Our work is different from these previous studies in two points: (i) to our best knowledge, we examine linguistic phenomena that have never been explored before in the conventional studies (e.g., subjects marked with a preposition and errors involving the usages of transitive and intransitive verbs); (ii) we use a real learner corpus with real errors as our test data.
examine if an encoder-decoder neural network for grammatical error correction (not BERT-based) can learn the knowledge of grammar from training data for grammatical error correction (pairs of original and corrected sentences). They target five error types: subject-verb agreement, verb form, word order, adjective/adverb comparison, noun number. They use both synthetic and real learner data. They report a negative answer to the research question except for word order errors while their model learns the knowledge to detect the target errors. Our study supports and deepens their findings for a wider variety of error types that are much more difficult to detect (in that it requires a much wider range of linguistic knowledge including POS, lexical, and syntactic knowledge).
3 Data and Methods
3.1 Real and Pseudo Data
In this paper, we use two kinds of data: real and pseudo data. Real data consist of an English learner corpus manually annotated with grammatical errors while pseudo data are automatically generated from a native corpus by perturbing a native English corpus.
For the real data, we use the data created in the work . Its base corpus is the written essays in ICNALE . It consists of essays written by English learners. Their topics are controlled; they are written on either (a) It is important for college students to have a part-time job. or (b) Smoking should be completely banned at all the restaurants in the country., which hereafter will be referred to as PART-TIME JOB and SMOKING, respectively. Each essay is manually annotated with errors and feedback comments explaining their relevant grammatical rules in detail. These two sources of information help us investigate error types that the BERT-based method can recognize. The original data provide feedback comments concerning preposition errors and more general writing items. In this paper, we limit ourselves to the essays annotated with preposition feedback comments so that we can conduct a deep analysis targeting a class of grammatical errors. Having said that, the target preposition errors involve a much wider range of errors than in the conventional definition of preposition errors (such as the one provided by ERRANT). For instance, the preposition errors in the work  include deverbal prepositions (e.g., *include including), intransitive verbs with a direct object (e.g., *agree it agree with it), a verb phrase used as a noun phrase (*Lean English is difficult. To learn/Learning English is difficult.), and comparison between a phrase and a clause (e.g., *because an error because of an error); see their work for the details.
The essays are randomly split into training, development, and test sets in the ratios of 85%, 7.5%, and 7.5%, respectively. Table 1 shows their statistics111The data development in the work of . For this work, we obtained data that had not been open to the public yet from the developer.. To investigate the relationship between the number of training sentences and detection performance, we randomly sample 100, 300, 500, 1000, 3000, 5000, 10000, and all sentences (12,163 and 12,312 in PART-TIME JOB and SMOKING, respectively), resulting in eight sets of training data for each topic. Note that these training, development, and test sets contain error-free sentences.
|# feedback comments||2,439||244||222||2,342||230||212|
For the pseudo data, we use the 1998-2000 New York Times in the AQUAINT Corpus of English News Text  as a base corpus. We automatically generate erroneous sentences by injecting errors into them (one error per sentence). We first obtain chunks and parses by using Spacy222https://spacy.io/. Here, we only use sentences whose lengths are longer than three tokens and shorter than 26 so that we can get reliable chunks and parses. We then add, remove, or replace a word in the sentences based on the analyses.
While we target all errors labeled as preposition errors in the real data, we only target the following five error types in the pseudo data:
- Prepositional infinitive:
to-infinitive with other prepositions than to.
(e.g., a book to read *a book for read)
- Subject verb:
Verb phrases used as a subject
(e.g., *Lean English is difficult.)
- Prepetition + subject:
Subjects used with a preposition
(e.g., *In the restaurant serves good food.)
- Transitive verb + preposition:
Transitive verbs used with a preposition
(e.g., *We discussed about it.)
- Intransitive verb + object:
Intransitive verbs taking a direct object
(e.g., *We agree it.)
These five error types are selected with the two criteria: (i) they are major errors in the real data; (ii) we can easily write a software program to generate pseudo errors based on chunks and parses. For example, we can find a subject of a sentence from its parse and then can add a randomly-chosen preposition before the subject noun phrase in *In the restaurant serves good food. We randomly choose one of the following five prepositions: at, about, to, in, and with for addition and replacement; an exception is that we only use for for Prepositional infinitive (e.g., a book to read *a book for read), which often appears in the real data. Similarly, we can extract pairs of a verb and its direct object from parses and then can add put one of the prepositions before the direct object noun phrase as in discuss the matter *discuss about the matter. Tables 2 and 3 show the target transitive and intransitive verbs, respectively. It should be emphasized that as shown in the tables, there is no overlap of verbs in the training and test data. This means that error detection methods must learn to detect these two types of error in a set of verbs from those in another set of verbs.
|Training/Dev. Data||Test Data|
From the resulting pseudo error data, we randomly sample sentences for each error type, resulting in ten sets of training data (e.g., when
, the set comprises two instances of each error type, ten instances in total. We use these training sets to estimate the relationship between the number of training instances and detection performance. For a validation set, we randomly sample 200 sentences for each error type. Similarly, we use a test set consisting of 200 sentences randomly sampled for each error type plus another 200 error-free sentences. The validation and test sets are fixed regardless of the training data.
3.2 Grammatical Error Detection Methods
This subsection describes the three methods to be explored and compared. Before looking into them, let us define grammatical error correction formally. Grammar error detection can be solved as a token classification problem333More generally, it can also be solved as a sequence labeling problem using for example CRF. However,  shows that the grammatical error detection task does not benefit from CRF. We actually observed the same tendency in our datasets. Accordingly, we solve it as a token classification problem (without CRF).. To formalize it, we will denote a sequence of words444Here, we use the term words abstractly to mean word-like objects, which can be words or subwords. and its length by and , respectively. We will denote the corresponding sequence of labels by where corresponds to the label of . We assume two sets of labels: (i) either C or E denoting correct or erroneous in the real data, respectively; (ii) labels for error types plus C for correct in the pseudo data. Then, grammatical error detection is defined as a problem of predicting the optimal label sequence given .
Basically, we use neural networks to predict the optimal label sequence. In this paper, training is repeated five times with different (but fixed) random seeds. The reported performance values (i.e., recall, precision, and
) are averaged over the five runs. Training epochs are ten at the maximum and we adopt the epoch achieving the beston the development set. Table 4 shows the other major hyper parameters555When we use the pseudo data for training, the number of training sentences can be as small as ten, and we use a rather small batch of five; otherwise we use 32..
|Batch size||5 or 32|
|Optimization||Adam with decoupled weight|
|Learning rate||-, (0.9, 0.999)|
3.2.1 BERT-based Method
The BERT-based method takes as input a word sequence and conducts the following procedures:
- (1) Subword:
put all into their corresponding subwords: . Note that the total number of all subwords are generally different from that of all words in the input word sequence.
- (2) Encode:
encode all into BERT embeddings by:
denotes BERT taking subwords as input and outputs their corresponding embedding vectors of-dimension (specifically, for the BERT base model) from the final layer. We use the BERT base model (uncased) for .
- (3) Token classification:
output the optimal labels by:
where is a weight matrix where is either or (the number of different labels). To take care of the difference in the lengths of the input word sequence and the corresponding subword sequence, only the first subword of each word is considered in training and prediction.
3.2.2 Methods to Be Compared
For comparison, we select a BiLSTM-based error detection method. Basically it follows the above steps (1) to (3). The major difference is that we use BiLSTM as an encoder in place of BERT. Also, the input word sequence is turned into a sequence of embedding vectors where each embedding vector consists of the concatenation of the conventional word embedding and a character-based embedding. The character-based embedding is obtained by another BiLSTM taking the characters of each word following the work . The concatenated embeddings are put into the encoder BiLSTM to produce vectors for prediction in the step (3). Specifically we use the implementation FLAIR . We will refer to this method for comparison as the BiLSTM-based method, hereafter.
We also investigate how effective the fine-tuning of BERT is. Namely, the BERT part of the BERT-based method is fixed during training and the only output layer is adjusted by the training data. We will refer to this method as the BERT-based method without BERT train, hereafter.
4 Performance on Real Data
Figure 1 shows the relationship between the size of training data and where the size is measured by the number of sentences. The three methods are trained with the specified amount of the data in either topic (PART-TIME JOB or SMOKING) and are tested on the data in the same topic (in-domain test).
Figure 1 reveals that the BERT-based method exhibits a performance saturation at a point of 1,000-2,000 training sentences while the BiLSTM-based method almost linearly improves as the number of training sentences increases within the range of the available training data. The graph for the BERT-based method without BERT train exhibits a similar shape to that of the BiLSTM-based method, but the values of are much lower. In addition, Figure 1 suggests that it achieves with only 500 to 1,000 training sentences an equivalent to the the BiLSTM-based method can achieve with the full training data. Figure 2 shows the same tendencies in the out-domain test setting.
These results support the hypothesis that the masked language models trained on a large corpus have a much higher generalization ability in grammatical error detection. In addition, they empirically show that fine-tuning is crucial in the application of BERT to the task of grammatical error detection.
To look into these points, let us consider precision-recall curves shown in Figure 3 (in-domain test) and Figure 4. Interestingly, all figures show that the BERT-based and BiLSTM-based methods both quickly improves in precision as the number of training sentences available increases while only does the BERT-based method so in recall. In other words, only the BERT-based method learn to recognize various error types with little exposure to error examples. This is surprising because BERT is only pre-trained on a native corpus that are virtually error-free and thus it knows nothing about grammatical errors learners make. Nevertheless, it quickly learns to recognize various error types by fine-tuning with few training instances.
By contrast, the BERT-based method without BERT train improves either in recall or in precision but not in both. This is probably because it requires much more degree of freedom in terms of the network parameters to learn rules for detecting a wide variety of grammatical errors, which have a certain degree of complexity. Considering the fact that the exact same information about errors (i.e., training data) is given to the three methods, these results lead us to the hypotheses that the BERT-based method uses the knowledge about canonical English (what correct English sentences should look like) and transforms it into rules for detecting grammatical errors by fine-tuning. We will explore these points in detail in the following section.
These results also shed a light on an important aspect of the BERT-based method in grammatical error detection in practice. Namely, a cost-effective way of developing an error detection system would be to create around 1,000 training sentences for each essay topic; according to Figure 1, the gain would be much smaller after 1,000 training sentences. Of course, the results are only for two essay topics and the target errors are limited to errors involving preposition use. Also, no one knows how differently the performance curves grow with a much larger set of training instances. It will be interesting to investigate these points for the future work.
5 Looking into Potential of Language Model-based Method with Pseudo Error Data
In the previous section, we have seen that the BERT-based method has a much higher generalization ability in grammatical error detection. To look into this phenomenon, we now turn to detection performance of the BERT-based method on the pseudo error data. As describe in Subsect. 3.1, we train it on the ten sets of training data and test the trained models on the fixed test set.
Figure 5 shows the relationship between the size of training data and for each error type where the size is measured by the number of sentences. Figure 5 reveals that the BERT-based method already recognizes some of the target errors at early stages of the graph (even with two or four training sentences). Performance goes much higher even with eight training sentences in most of the error types with an exception of the error type “Intransitive verb + object”. For instance, the BERT-based method recognizes more than half of the “Preposition + subject” errors with a precision of 0.800 only with eight training instances. This implies that BERT has certain knowledge of English grammar similar to the notions of POS such as verbs and syntactic relations such as subjects; otherwise, it would be difficult to achieve a similar performance in this type of error considering that the noun phrase of a subject and its position in the sentence considerably vary depending on the target sentence.
We can make the same argument about the transitive verb + preposition and intransitive verb + object error types. It should be emphasized that the BERT-based method has to detect errors in the verbs that never666Strictly, some of the verbs may appear in the training sentences for the other error types. However, they never appear in the erroneous phrases. Also, they do not appear at all when the training size is small. appear in the training data; recall that there is no overlap of the target transitive/intransitive verbs in the training and test data as described in Subsect. 3.1. In other words, the BERT-based method can recognize unseen erroneous combinations, for example, *visited in Atlanta and *specialized environmental litigation after just seeing *mention in, discussed about (transitive verb + preposition type) and *were related drugs and belongs Lon’s grandmother (intransitive verb + direct object error type). The training and test sentences have almost nothing in common except that they are the combinations of transitive/intransitive verbs and prepositions/objects. Besides, the fact that combinations of other verbs and prepositions/objects that are correctly used often appear in the test data makes the task even more difficult without the knowledge of POS and syntactic relations. These findings support the hypotheses that BERT has grammar-like knowledge and that it can turn the knowledge into error detection rules by fine-tuning.
6 Exploration for Cost-Effective Error Detection with Feedback Comments
The findings we have obtained so far bring out the possibility that one can implement with few training instances a system that accurately detects grammatical errors and recognizes their detailed error types. For example, manually or automatically, creating few instances of the erroneous combination of transitive verbs and prepositions as we saw in the previous sections (e.g., *discuss about), one can develop a system detecting the same type of error in other transitive verbs and prepositions (e.g., *mention about it and *attend in). With the detailed error types, the system can also output feedback comments to the user such as in Transitive verbs do not take a preposition. Instead, they take a direct object instead of just indicating them as preposition errors.
As a pilot study, we trained the BERT-based method on the pseudo data and tested it on the real (learner) data to examine the above possibility. To achieve it, we manually annotated the real data with the target five error types consulting the feedback comment attached to each error.
Figure 6 shows the results. Figure 6 reveals that the BERT-based method on the pseudo data does not perform on the real data as well as on the pseudo data. Performance growths stop at an early stage (around eight training sentences).
A possible reason for this is that in the real data, multiple errors often appear in a sentence. Also, multiple errors in a sentence can range over multiple types of error. Besides, the error rate is much lower in the real data than in the pseudo error where one error occurs per sentence except 200 error-free sentences (although multiple types of error appear in the whole data set). These conditions make the task much more difficult in the real data.
Having said that, the results shown in Figure 5 still encourages us to develop language model-based systems with a small amount of in-domain training data in order to detect grammatical errors with detailed error types. One possible way to achieve it is (i) to sample sentences from unannotated essays written on the target topic; (ii) to annotate them with the specific types errors that the developer wants to give feedbacks to the user. This will naturally mitigate the problems caused by the multiple-type multiple error situation and the error rate difference. One can also manually create sample error sentences (and their correct versions) to augment the data created by (i) and (ii).
In this paper, we have explored the capacity of a large-scale masked language model to recognize grammatical errors. Our findings are summarized in the following three points: (1) Experiments with the real learner data show that a BERT-based error detection method has a much higher generalization ability in grammatical error detection than a non-language model-based method and the first performance saturation comes at the point of around 1,000-2,000 training instances; (2) It starts to recognize the target errors with few (as few as two) instances of them; (3) The high generalization ability brings out its potential for developing systems that detect and explain grammatical errors with very few training instances.
-  Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., and Vollgraf, R. Flair: An easy-to-use framework for state-of-the-art nlp. In Proc. of 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) (2019), pp. 54–59.
-  Akbik, A., Blythe, D., and Vollgraf, R. Contextual string embeddings for sequence labeling. In Proc. of 27th International Conference on Computational Linguistics (2018), pp. 1638–1649.
-  Bell, S., Yannakoudakis, H., and Rei, M. Context is key: Grammatical error detection with contextual word representations. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (2019), Association for Computational Linguistics, pp. 103–115.
-  Bryant, C., Felice, M., and Briscoe, T. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017), pp. 793–805.
-  Cheng, Y., and Duan, M. Chinese grammatical error detection based on BERT model. In Procȯf 6th Workshop on Natural Language Processing Techniques for Educational Applications (2020), Association for Computational Linguistics, pp. 108–113.
-  Didenko, B., and Shaptala, J. Multi-headed architecture based on BERT for grammatical errors correction. In Proc. of 14th Workshop on Innovative Use of NLP for Building Educational Applications (2019), pp. 246–251.
-  Ettinger, A. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8 (2020), 34–48.
-  Graff, D. The aquaint corpus of english news text, 2002.
-  Ishikawa, S. A new horizon in learner corpus studies: The aim of the ICNALE project. University of Strathclyde Publishing, Glasgow, 2011, pp. 3–11.
-  Kaneko, M., and Komachi, M. Multi-head multi-layer attention to deep language representations for grammatical error detection.
-  Kaneko, M., Mita, M., Kiyono, S., Suzuki, J., and Inui, K. Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In Procȯf 58th Annual Meeting of the Association for Computational Linguistics (2020), Association for Computational Linguistics, pp. 4248–4254.
-  Li, B., Zhu, Z., Thomas, G., Xu, Y., and Rudzicz, F. How is BERT surprised? layerwise detection of linguistic anomalies. In Proc. of 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Online, 2021), pp. 4215–4228.
-  Mita, M., and Yanaka, H. Do grammatical error correction models realize grammatical generalization? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021), Association for Computational Linguistics, pp. 4554–4561.
-  Nagata, R. Toward a task of feedback comment generation for writing learning. In Proc. of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (2019), pp. 3197–3206.
-  Nagata, R., Inui, K., and Ishikawa, S. Creating Corpora for Research in Feedback Comment Generation. In Proc. of the 12th Language Resources and Evaluation Conference (2020), pp. 340–345.
-  Nagata, R., and Whittaker, E. Reconstructing an Indo-European family tree from non-native English texts. In Proc. of 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Sofia, Bulgaria, 2013), Association for Computational Linguistics, pp. 1137–1147.
-  Rei, M. Semi-supervised multitask learning for sequence labeling. In Proc. of 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017), pp. 2121–2130.
-  Rei, M., and Yannakoudakis, H. Auxiliary objectives for neural error detection models. In Proc. of 12th Workshop on Innovative Use of NLP for Building Educational Applications (2017), Association for Computational Linguistics, pp. 33–43.
-  Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics 8 (2020), 377–392.