Grammatical Error Correction (GEC) is a sequence-to-sequence task where a model corrects an ungrammatical sentence to a grammatical sentence. Numerous studies on GEC have successfully used encoder-decoder (EncDec) based models, and in fact, most current state-of-the-art neural GEC models employ this architecture Zhao et al. (2019); Grundkiewicz et al. (2019); Kiyono et al. (2019).
In light of this trend, one natural, intriguing question is whether neural EndDec GEC models can benefit from the recent advances of masked language models (MLMs) since MLMs such as BERT Devlin et al. (2019) have been shown to yield substantial improvements in a variety of NLP tasks Qiu et al. (2020). BERT, for example, builds on the Transformer architecture Vaswani et al. (2017) and is trained on large raw corpora to learn general representations of linguistic components (e.g., words and sentences) in context, which have been shown useful for various tasks. In recent years, MLMs have been used not only for classification and sequence labeling tasks but also for language generation, where combining MLMs with EncDec models of a downstream task makes a noticeable improvement Lample and Conneau (2019).
Common methods of incorporating a MLM to an EncDec model are initialization (init) and fusion (fuse). In the init method, the downstream task model is initialized with the parameters of a pre-trained MLM and then is trained over a task-specific training set Lample and Conneau (2019); Rothe et al. (2019). This approach, however, does not work well for tasks like sequence-to-sequence language generation tasks because such tasks tend to require a huge amount of task-specific training data and fine-tuning a MLM with such a large dataset tends to destruct its pre-trained representations leading to catastrophic forgetting Zhu et al. (2020); McCloskey and Cohen (1989). In the fuse method, pre-trained representations of a MLM are used as additional features during the training of a task-specific model Zhu et al. (2020). When applying this method for GEC, what the MLM has learned in pre-training will be preserved; however, the MLM will not be adapted to either the GEC task or the task-specific distribution of inputs (i.e., erroneous sentences in a learner corpus), which may hinder the GEC model from effectively exploiting the potential of the MLM. Given these drawbacks in the two common methods, it is not as straightforward to gain the advantages of MLMs in GEC as one might expect. This background motivates us to investigate how a MLM should be incorporated into an EncDec GEC model to maximize its benefit. To the best of our knowledge, no research has addressed this research question.
In our investigation, we employ BERT, which is a widely used MLM Qiu et al. (2020), and evaluate the following three methods: (a) initialize an EncDec GEC model using pre-trained BERT as in Lample and Conneau (2019) (BERT-init), (b) pass the output of pre-trained BERT into the EncDec GEC model as additional features (BERT-fuse) Zhu et al. (2020), and (c) combine the best parts of (a) and (b).
In this new method (c), we first fine-tune BERT with the GEC corpus and then use the output of the fine-tuned BERT model as additional features in the GEC model. To implement this, we further consider two options: (c1) additionally train pre-trained BERT with GEC corpora (BERT-fuse mask), and (c2) fine-tune pre-trained BERT by way of the grammatical error detection (GED) task (BERT-fuse GED). In (c2), we expect that the GEC model will be trained so that it can leverage both the representations learned from large general corpora (pre-trained BERT) and the task-specific information useful for GEC induced from the GEC training data.
Our experiments show that using the output of the fine-tuned BERT model as additional features in the GEC model (method (c)) is the most effective way of using BERT in most of the GEC corpora that we used in the experiments. We also show that the performance of GEC improves further by combining the BERT-fuse mask and BERT-fuse GED methods. The best-performing model achieves state-of-the-art results on the BEA-2019 and CoNLL-2014 benchmarks.
2 Related Work
Studies have reported that a MLM can improve the performance of GEC when it is employed either as a re-ranker Chollampatt et al. (2019); Kaneko et al. (2019) or as a filtering tool Asano et al. (2019); Kiyono et al. (2019). EncDec-based GEC models combined with MLMs can also be used in combination with these pipeline methods. Asano et al. (2019) proposed sequence labeling models based on correction methods. Our method can utilize the existing EncDec GEC knowledge, but these methods cannot be utilized due to the different architecture of the model. Besides, to the best of our knowledge, no research has yet been conducted that incorporates information of MLMs for effectively training the EncDec GEC model.
MLMs are generally used in downstream tasks by fine-tuning Liu (2019); Zhang et al. (2019), however, Zhu et al. (2020) demonstrated that it is more effective to provide the output of the final layer of a MLM to the EncDec model as contextual embeddings. Recently, Weng et al. (2019) addressed the mismatch problem between contextual knowledge from pre-trained models and the target bilingual machine translation. Here, we also claim that addressing the gap between grammatically correct raw corpora and GEC corpora can lead to the improvement of GEC systems.
3 Methods for Using Pre-trained MLM in GEC Model
In this section, we describe our approaches for incorporating a pre-trained MLM into our GEC model. Specifically, we chose the following approaches: (1) initializing a GEC model using BERT; (2) using BERT output as additional features for a GEC model, and (3) using the output of BERT fine-tuned with the GEC corpora as additional features for a GEC model.
We create a GEC EncDec model initialized with BERT weights. This approach is based on Lample and Conneau (2019). Most recent state-of-the-art methods use pseudo-data, which is generated by injecting pseudo-errors to grammatically correct sentences. However, note that this method cannot initialize a GEC model with pre-trained parameters learned from pseudo-data.
We use the model proposed by Zhu et al. (2020) as a feature-based approach (BERT-fuse). This model is based on Transformer EncDec architecture. It takes an input sentence , where is its length. is -th token in . First, BERT encodes it and outputs a representation . Next, the GEC model encodes and as inputs. is the
-th hidden representation of the-th layer of the encoder in the GEC model. stands for word embedding of an input sentence . Then we calculate as follows:
are attention models for the hidden layers of the GEC encoderand the BERT output , respectively. Then each is further processed by the feedforward network which outputs the -th layer . The decoder’s hidden state is calculated as follows:
Here, represents the self-attention model. Finally,
is processed via a linear transformation and softmax function to predict the-th word . We also use the drop-net trick proposed by Zhu et al. (2020) to the output of BERT and the encoder of the GEC model.
3.3 BERT-fuse Mask and GED
The advantage of the BERT-fuse is that it can preserve pre-trained information from raw corpora, however, it may not be adapted to either the GEC task or the task-specific distribution of inputs. The reason is that in the GEC model, unlike the data used for training BERT, the input can be an erroneous sentence. To fill the gap between corpora used to train GEC and BERT, we additionally train BERT on GEC corpora (BERT-fuse mask) or fine-tune BERT as a GED model (BERT-fuse GED) and use it for BERT-fuse. GED is a sequence labeling task that detects grammatically incorrect words in input sentences Rei and Yannakoudakis (2016); Kaneko et al. (2017). Since BERT is also effective in GED Bell et al. (2019); Kaneko and Komachi (2019), it is considered to be suitable for fine-tuning to take into account grammatical errors.
|Model Architecture||Transformer (big)|
Number of epochs
|Min learning rate|
|Loss function||label smoothed cross-entropy|
|Szegedy et al. (2016)|
|Model Architecture||BERT-Base (cased)|
|Number of epochs||3|
|Max sentence length||128|
4 Experimental Setup
4.1 Train and Development Sets
We use the BEA-2019 workshop111https://www.cl.cam.ac.uk/research/nl/bea2019st/ Bryant et al. (2019) official shared task data as training and development sets. Specifically, to train a GEC model, we use W&I-train Granger (1998); Yannakoudakis et al. (2018), NUCLE Dahlmeier et al. (2013), FCE-train Yannakoudakis et al. (2011) and Lang-8 Mizumoto et al. (2011) datasets. We use W&I-dev as a development set. Note that we excluded sentence pairs that were not corrected from the training data. To train BERT for BERT-fuse mask and GED, we use W&I-train, NUCLE, and FCE-train as training, and W&I-dev was used as development data.
4.2 Evaluating GEC Performance
In GEC, it is important to evaluate the model with multiple datasets Mita et al. (2019). Therefore, we used GEC evaluation data such as W&I-test, CoNLL-2014 Ng et al. (2014), FCE-test and JFLEG Napoles et al. (2017)
. We used ERRANT evaluation metricsFelice et al. (2016); Bryant et al. (2017) for W&I-test, score Dahlmeier and Ng (2012) for CoNLL-2014 and FCE-test sets, and GLEU Napoles et al. (2015) for JFLEG. All our results (except ensemble) are the average of four distinct trials using four different random seeds.
|BEA-test (ERRANT)||CoNLL-14 ()||FCE-test ()||JFLEG|
|Lichtarge et al. (2019)||-||-||-||65.5||37.1||56.8||-||-||-||61.6|
|Awasthi et al. (2019)||-||-||-||66.1||43.0||59.7||-||-||-||60.3|
|Kiyono et al. (2019)||65.5||59.4||64.2||67.9||44.1||61.3||-||-||-||59.7|
|BERT-fuse GED + R2L||72.3||61.4||69.8||72.6||46.4||65.2||62.8||48.8||59.4||62.0|
|Lichtarge et al. (2019)||-||-||-||66.7||43.9||60.4||-||-||-||63.3|
|Grundkiewicz et al. (2019)||72.3||60.1||69.5||-||-||64.2||-||-||-||61.2|
|Kiyono et al. (2019)||74.7||56.7||70.2||72.4||46.1||65.0||-||-||-||61.4|
Hyperparameter values for the GEC model is listed in Table 1
. For the BERT initialized GEC model, we provided experiments based on the open-source code222https://github.com/facebookresearch/XLM. For the BERT-fuse GEC model, we use the code provided by Zhu et al. (2020)333https://github.com/bert-nmt/bert-nmt. While the training the GEC model, the model was evaluated on the development set and saved every epoch. If loss did not drop at the end of an epoch, the learning rate was multiplied by 0.7. The training was stopped if the learning rate was less than the minimum learning rate or if the learning epoch reached the maximum epoch number of 30.
Training BERT for BERT-fuse mask and GED was based on the code from Wolf et al. (2019)444https://github.com/huggingface/transformers. The additional training for the BERT-fuse mask was done in the Devlin et al. (2019)’s setting. Hyperparameter values for the GED model is listed in Table 1. We used the BERT-Base cased model, for consistency across experiments555https://github.com/google-research/bert. The model was evaluated on the development set.
We also performed experiments utilizing BERT-fuse, BERT-fuse mask, and BERT-fuse GED outputs as additional features to the pre-trained on the pseudo-data GEC model. The pre-trained model using pseudo-data was initialized with the PretLarge+SSE model used in the kiyono-etal-2019-empirical666https://github.com/butsugiri/gec-pseudodata experiments. This pseudo-data is generated by probabilistically injecting character errors into the output Lichtarge et al. (2019) of a backtranslation Xie et al. (2018) model that generates grammatically incorrect sentences from grammatically correct sentences Kiyono et al. (2019).
4.5 Right-to-left (R2L) Re-ranking for Ensemble
We describe the R2L re-ranking technique incorporated in our experiments proposed by Sennrich et al. (2016), which proved to be efficient for the GEC task Grundkiewicz et al. (2019); Kiyono et al. (2019). Standard left-to-right (L2R) models generate the -best hypotheses using scores with the normal ensemble and R2L models re-score them. Then, we re-rank the
-best candidates based on the sum of the L2R and R2L scores. We use the generation probability as a re-ranking score and ensemble four L2R models and four R2L models.
Table 2 shows the experimental results of the GEC models. A model trained on Transformer without using BERT is denoted as “w/o BERT.” In the top groups of results, it can be seen that using BERT consistently improves the accuracy of our GEC model. Also, BERT-fuse, BERT-fuse mask, and BERT-fuse GED outperformed the BERT-init model in almost all cases. Furthermore, we can see that using BERT considering GEC corpora as BERT-fuse leads to better correction results. And the BERT-fuse GED always gives better results than the BERT-fuse mask. This may be because the BERT-fuse GED is able to explicitly consider grammatical errors. In the second row, the correction results are improved by using BERT as well. Also in this setting, BERT-fuse GED outperformed other models in all cases except for the FCE-test set, thus, achieving state-of-the-art results with a single model on the BEA2019 and CoNLL14 datasets. In the last row, the ensemble model yielded high scores on all corpora, improving state-of-the-art results by 0.2 points in CoNLL14.
6.1 Hidden Representation Visualization
We investigate the characteristics of the hidden representations of vanilla (i.e., without any fine-tuning) BERT and BERT fine-tuned with GED. We visualize the hidden representations of the same words from the last layer of BERT . They were chosen depending on correctness in a different context, using the above models. These target eight words7771. the 2. , 3. in 4. to 5. of 6. a 7. for 8. is that have been mistaken more than 50 times, were chosen from W&I-dev. We sampled the same number of correctly used cases for the same word from the corrected side of W&I-dev.
visualizes hidden representations of BERT and fine-tuned BERT. It can be seen that the vanilla BERT does not distinguish between correct and incorrect clusters. The plotted eight words are gathered together, and it can be seen that hidden representations of the same word gather in the same place regardless of correctness. On the other hand, fine-tuned BERT produces a vector space that demonstrates correct and incorrect words on different sides, showing that hidden representations take grammatical errors into account when fine-tuned on GEC corpora. Moreover, it can be seen that the correct cases divided into 8 clusters, implying that BERT’s information is also retained.
6.2 Performance for Each Error Type
We investigate the correction results for each error type. We use ERRANT Felice et al. (2016); Bryant et al. (2017) to measure of the model for each error type. ERRANT can automatically assign error types from source and target sentences. We use single BERT-fuse GED and w/o BERT models without using pseudo-data for this investigation.
Table 3 shows the results of single BERT-fuse GED and w/o BERT models without using pseudo-data on most error types including all the top-5 frequent error types in W&I-dev. We see that BERT-fuse GED is better for all error types compared to w/o BERT. We can say that the use of BERT fine-tuned by GED for the EncDec model improves the performance independently of the error type.
|Error type||BERT-fuse GED||w/o BERT|
In this paper, we investigated how to effectively use MLMs for training GEC models. Our results show that BERT-fuse GED was one of the most effective techniques when it was fine-tuned with GEC corpora. In future work, we will investigate whether BERT-init can be used effectively by using methods to deal with catastrophic forgetting.
This work was supported by JSPS KAKENHI Grant Number 19J14084 and 19H04162. We thank everyone in Inui and Suzuki Lab at the Tohoku University and Language Information Access Technology Team of RIKEN AIP. We thank the anonymous reviewers for their valuable comments.
- The AIP-Tohoku System at the BEA-2019 Shared Task. In BEA, Florence, Italy, pp. 176–182. External Links: Cited by: §2.
- Parallel Iterative Edit Models for Local Sequence Transduction. In EMNLP-IJCNLP, Hong Kong, China, pp. 4259–4269. External Links: Cited by: Table 2.
- Context is Key: Grammatical Error Detection with Contextual Word Representations. In BEA, Florence, Italy, pp. 103–115. External Links: Cited by: §3.3.
- The BEA-2019 Shared Task on Grammatical Error Correction. In BEA, Florence, Italy, pp. 52–75. External Links: Cited by: §4.1.
- Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction. In ACL, Vancouver, Canada, pp. 793–805. Cited by: §4.2, §6.2.
- Cross-Sentence Grammatical Error Correction. In ACL, Florence, Italy. Cited by: §2.
- Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In BEA, Atlanta, Georgia, pp. 22–31. Cited by: §4.1.
- Better Evaluation for Grammatical Error Correction. In NAACL, Montréal, Canada, pp. 568–572. Cited by: §4.2.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §4.3.
- Automatic Extraction of Learner Errors in ESL Sentences Using Linguistically Enhanced Alignments. In COLING, Osaka, Japan, pp. 825–835. Cited by: §4.2, §6.2.
- Developing an Automated Writing Placement System for ESL Learners. In LEC, pp. 3–18. Cited by: §4.1.
- Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. In BEA, Florence, Italy, pp. 252–263. External Links: Cited by: §1, §4.5, Table 2.
- TMU Transformer System Using BERT for Re-ranking at BEA 2019 Grammatical Error Correction on Restricted Track. In BEA, Florence, Italy, pp. 207–212. External Links: Cited by: §2.
- Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection. Computación y Sistemas 23. Cited by: §3.3.
- Grammatical Error Detection Using Error- and Grammaticality-Specific Word Embeddings. In IJCNLP, Taipei, Taiwan, pp. 40–48. External Links: Cited by: §3.3.
- An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction. In EMNLP-IJCNLP, Hong Kong, China, pp. 1236–1242. External Links: Cited by: §1, §2, §4.4, §4.5, Table 2.
- Cross-lingual Language Model Pretraining. arXiv. Cited by: §1, §1, §1, §3.1.
- Corpora Generation for Grammatical Error Correction. In NAACL, Minneapolis, Minnesota, pp. 3291–3301. External Links: Cited by: §4.4, Table 2.
Fine-tune BERT for Extractive Summarization. arXiv. Cited by: §2.
- Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Cited by: §1.
- Cross-Corpora Evaluation and Analysis of Grammatical Error Correction Models — Is Single-Corpus Evaluation Enough?. In NAACL, Minneapolis, Minnesota, pp. 1309–1314. External Links: Cited by: §4.2.
- Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In IJCNLP, Chiang Mai, Thailand, pp. 147–155. Cited by: §4.1.
- Ground Truth for Grammatical Error Correction Metrics. In NAACL, Beijing, China, pp. 588–593. External Links: Cited by: §4.2.
- JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction. In EACL, Valencia, Spain, pp. 229–234. External Links: Cited by: §4.2.
- The CoNLL-2014 Shared Task on Grammatical Error Correction. In CoNLL, Baltimore, Maryland, pp. 1–14. Cited by: §4.2.
Pre-trained Models for Natural Language Processing: A Survey. Cited by: §1, §1.
- Compositional Sequence Labeling Models for Error Detection in Learner Writing. In ACL, Berlin, Germany, pp. 1181–1191. External Links: Cited by: §3.3.
- Leveraging pre-trained checkpoints for sequence generation tasks. arXiv. Cited by: §1.
Edinburgh Neural Machine Translation Systems for WMT 16. In WMT, Berlin, Germany, pp. 371–376. External Links: Cited by: §4.5.
Rethinking the Inception Architecture for Computer Vision. In IEEE, External Links: Cited by: Table 1.
- Attention is All you Need. In NeurIPS, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §1.
- Acquiring Knowledge from Pre-trained Model to Neural Machine Translation. arXiv. Cited by: §2.
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv. Cited by: §4.3.
- Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. In NAACL, New Orleans, Louisiana, pp. 619–628. External Links: Cited by: §4.4.
- Developing an Automated Writing Placement System for ESL Learners. In Applied Measurement in Education, pp. 251–267. Cited by: §4.1.
- A New Dataset and Method for Automatically Grading ESOL Texts. In NAACL, Portland, Oregon, USA, pp. 180–189. Cited by: §4.1.
Pretraining-Based Natural Language Generation for Text Summarization. In CoNLL, Cited by: §2.
- Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data. In NAACL, Minneapolis, Minnesota, pp. 156–165. External Links: Cited by: §1.
- Incorporating BERT into Neural Machine Translation. In ICLR, External Links: Cited by: §1, §1, §2, §3.2, §4.3.