Neural sequence-to-sequence (seq2seq) approaches have proven to be successful in grammatical error correction (GEC). Based on the seq2seq framework, we propose a novel fluency boost learning and inference mechanism. Fluency boosting learning generates diverse error-corrected sentence pairs during training, enabling the error correction model to learn how to improve a sentence's fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps. Combining fluency boost learning and inference with convolutional seq2seq models, our approach achieves the state-of-the-art performance: 75.72 (F_0.5) on CoNLL-2014 10 annotation dataset and 62.42 (GLEU) on JFLEG test set respectively, becoming the first GEC system that reaches human-level performance (72.58 for CoNLL and 62.37 for JFLEG) on both of the benchmarks.READ FULL TEXT VIEW PDF
We propose a novel language-independent approach to improve the efficien...
A sequence-to-sequence learning with neural networks has empirically pro...
The incorporation of pseudo data in the training of grammatical error
Synthetic data generation is widely known to boost the accuracy of neura...
In Grammatical Error Correction (GEC), sequence labeling models enjoy fa...
We propose a component that gets a request and a correction and outputs ...
Existing approaches for grammatical error correction (GEC) largely rely ...
Sequence-to-sequence (seq2seq) models (Cho et al., 2014; Sutskever et al., 2014) for grammatical error correction (GEC) have drawn growing attention (Yuan & Briscoe, 2016; Xie et al., 2016; Ji et al., 2017; Schmaltz et al., 2017; Sakaguchi et al., 2017; Chollampatt & Ng, 2018; Junczys-Dowmunt et al., 2018) in recent years. However, most of the seq2seq models for GEC have two flaws. First, the seq2seq models are trained with only limited error-corrected sentence pairs like Figure 1(a). Limited by the size of training data, the models with millions of parameters may not be well generalized. Thus, it is common that the models fail to correct a sentence perfectly even if the sentence is slightly different from the training instance, as illustrated by Figure 1(b). Second, the seq2seq models usually cannot perfectly correct a sentence with many grammatical errors through single-round seq2seq inference, as shown in Figure 1(b) and 1(c), because some errors in a sentence may make the context strange, which confuses the models to correct other errors.
To address the above-mentioned limitations in model learning and inference, we propose a novel fluency boost learning and inference mechanism, illustrated in Figure 2.
For fluency boosting learning, not only is a seq2seq model trained with original error-corrected sentence pairs, but also it generates less fluent sentences (e.g., from its n-best outputs) to establish new error-corrected sentence pairs by pairing them with their correct sentences during training, as long as the sentences’ fluency111A sentence’s fluency score is defined to be inversely proportional to the sentence’s cross entropy, as is in Eq (3). is below that of their correct sentences, as Figure 2(a) shows. Specifically, we call the generated error-corrected sentence pairs fluency boost sentence pairs
because the sentence in the target side always improves fluency over that in the source side. The generated fluency boost sentence pairs during training will be used as additional training instances during subsequent training epochs, allowing the error correction model to see more grammatically incorrect sentences during training and accordingly improving its generalization ability.
For model inference, fluency boost inference mechanism allows the model to correct a sentence incrementally with multi-round inference as long as the proposed edits can boost the sentence’s fluency, as Figure 2(b) shows. For a sentence with multiple grammatical errors, some of the errors will be corrected first. The corrected parts will make the context clearer, which may benefit the model to correct the remaining errors. Moreover, based on the special characteristics of this task that the output prediction can be repeatedly edited and the basic fluency boost inference idea, we further propose a round-way correction approach that uses two seq2seq models whose decoding orders are left-to-right and right-to-left respectively. For round-way correction, a sentence will be corrected successively by the right-to-left and left-to-right seq2seq model222For convenience, we call the seq2seq model with right-to-left decoder right-to-left seq2seq model and the seq2seq model with left-to-right decoder left-to-right seq2seq model.. Since the left-to-right and right-to-left decoder decode a sequence with different contexts, they have their unique advantages for specific error types. Round-way correction can fully exploit their pros and make them complement each other, which results in a significant improvement of recall.
Experiments show that combining fluency boost learning and inference with convolutional seq2seq models, our best GEC system333Our systems’ outputs for CoNLL-2014 and JFLEG test set are available at https://github.com/getao/human-performance-gec achieves 75.72 on CoNLL-2014 10 annotation dataset and 62.42 on JFLEG test set, becoming the first system reaching human-level performance on both of the GEC benchmarks.
As neural machine translation (NMT), a typical neural GEC approach uses an encoder-decoder seq2seq model(Sutskever et al., 2014; Cho et al., 2014) with attention mechanism (Bahdanau et al., 2014) to edit a raw sentence into the grammatically correct sentence it should be, as Figure 1(a) shows.
Given a raw sentence and its corrected sentence in which and are the -th and -th words of sentence and respectively, the error correction seq2seq model learns a probabilistic mapping
from error-corrected sentence pairs through maximum likelihood estimation (MLE), which learns model parametersto maximize the following equation:
where denotes the set of error-corrected sentence pairs.
For model inference, an output sequence is selected through beam search, which maximizes the following equation:
Conventional seq2seq models for GEC learn model parameters only from original error-corrected sentence pairs. However, such error-corrected sentence pairs are not sufficiently available. As a result, many neural GEC models are not very well generalized.
Fortunately, neural GEC is different from NMT. For neural GEC, its goal is improving a sentence’s fluency444Fluency of a sentence in this work refers to how likely the sentence is written by a native speaker. In other words, if a sentence is very likely to be written by a native speaker, it should be regarded highly fluent. without changing its original meaning; thus, any sentence pair that satisfies this condition (we call it fluency boost condition) can be used as a training instance.
In this work, we define as the fluency score of a sentence :
is the probability ofgiven context , computed by a language model, and is the length of sentence . is actually the cross entropy of the sentence , whose range is . Accordingly, the range of is .
The core idea of fluency boost learning is to generate fluency boost sentence pairs that satisfy the fluency boost condition during training, as Figure 2(a) illustrates, so that these pairs can further help model learning.
In this section, we present three fluency boost learning strategies: back-boost, self-boost, and dual-boost that generate fluency boost sentence pairs in different ways, as illustrated in Figure 3.
Back-boost learning borrows the idea from back translation (Sennrich et al., 2016) in NMT, referring to training a backward model (we call it error generation model, as opposed to error correction model) that is used to convert a fluent sentence to a less fluent sentence with errors. Since the less fluent sentences are generated by the error generation seq2seq model trained with error-corrected data, they usually do not change the original sentence’s meaning; thus, they can be paired with their correct sentences, establishing fluency boost sentence pairs that can be used as training instances for error correction models, as Figure 3(a) shows.
Specifically, we first train a seq2seq error generation model with which is identical to except that the source sentence and the target sentence are interchanged. Then, we use the model to predict -best outputs given a correct sentence . Given the fluency boost condition, we compare the fluency of each output (where ) to that of its correct sentence . If an output sentence’s fluency score is much lower than its correct sentence, we call it a disfluency candidate of .
To formalize this process, we first define to denote the -best outputs predicted by model given the input . Then, disfluency candidates of a correct sentence can be derived:
where denotes the disfluency candidate set for in back-boost learning. is a threshold to determine if is less fluent than and it should be slightly larger555We set since the corrected sentence in our training data improves its corresponding raw sentence about 5% fluency on average. than , which helps filter out sentence pairs with unnecessary edits (e.g., I like this book. I like the book.).
In the subsequent training epochs, the error correction model will not only learn from the original error-corrected sentence pairs (,), but also learn from fluency boost sentence pairs (,) where is a sample of ).
We summarize this process in Algorithm 1 where is the set of original error-corrected sentence pairs, and can be tentatively considered identical to when there is no additional native data to help model training (see Section 3.4). Note that we constrain the size of not to exceed (the 7th line in Algorithm 1) to avoid that too many fluency boost pairs overwhelm the effects of the original error-corrected pairs on model learning.
In contrast to back-boost learning whose core idea is originally from NMT, self-boost learning is original, which is specially devised for neural GEC. The idea of self-boost learning is illustrated by Figure 3(b) and was already briefly introduced in Section 1 and Figure 2(a). Unlike back-boost learning in which an error generation seq2seq model is trained to generate disfluency candidates, self-boost learning allows the error correction model to generate the candidates by itself. Since the disfluency candidates generated by the error correction seq2seq model trained with error-corrected data rarely change the input sentence’s meaning; thus, they can be used to establish fluency boost sentence pairs.
For self-boost learning, given an error corrected pair , an error correction model first predicts -best outputs for the raw sentence . Among the -best outputs, any output that is not identical to can be considered as an error prediction. Instead of treating the error predictions useless, self-boost learning fully exploits them. Specifically, if an error prediction is much less fluent than that of its correct sentence , it will be added to ’s disfluency candidate set , as Eq (6) shows:
In contrast to back-boost learning, self-boost generates disfluency candidates from a different perspective – by editing the raw sentence rather than the correct sentence . It is also noteworthy that is incrementally expanded because the error correction model is dynamically updated, as shown in Algorithm 2.
As introduced above, back- and self-boost learning generate disfluency candidates from different perspectives to create more fluency boost sentence pairs to benefit training the error correction model. Intuitively, the more diverse disfluency candidates generated, the more helpful for training an error correction model. Inspired by He et al. (2016) and Zhang et al. (2018), we propose a dual-boost learning strategy, combining both back- and self-boost’s perspectives to generate disfluency candidates.
As Figure 3(c) shows, disfluency candidates in dual-boost learning are from both the error generation model and the error correction model :
Moreover, the error correction model and the error generation model are dual and both of them are dynamically updated, which improves each other: the disfluency candidates produced by error generation model can benefit training the error correction model, while the disfluency candidates created by error correction model can be used as training data for the error generation model. We summarize this learning approach in Algorithm 3.
Our proposed fluency boost learning strategies can be easily extended to utilize massive native text data which proved to be useful for GEC.
where denotes the set of self-copied sentence pairs from native data.
As we discuss in Section 1, some sentences with multiple grammatical errors usually cannot be perfectly corrected through normal seq2seq inference which makes only single-round inference. Fortunately, neural GEC is different from NMT: its source and target language are the same. The characteristic allows us to edit a sentence more than once through multi-round model inference, which motivates our fluency boost inference. As Figure 2(b) shows, fluency boost inference allows a sentence to be incrementally edited through multi-round seq2seq inference as long as the sentence’s fluency can be improved. Specifically, an error correction seq2seq model first takes a raw sentence as an input and outputs a hypothesis . Instead of regarding as the final prediction, fluency boost inference will then take as the input to generate the next output . The process will not terminate unless does not improve in terms of fluency.
Based on the idea of multi-round correction, we further propose an advanced fluency boost inference approach: round-way error correction. Instead of progressively correcting a sentence with the same seq2seq model as introduced in Section 4.1, round-way correction corrects a sentence through a right-to-left seq2seq model and a left-to-right seq2seq model successively, as shown in Figure 4.
The motivation of round-way error correction is straightforward. Decoders with different decoding orders decode word sequences with different contexts, making them have their unique advantages for specific error types. For the example in Figure 4, the error of a lack of an article (i.e., park the park) is more likely to be corrected by the right-to-left seq2seq model than the left-to-right one, because whether to add an article depends on the noun park that was already seen by the right-to-left model when it made the decision. In contrast, the left-to-right model might be better at dealing with subject-verb agreement errors (e.g., come comes in Figure 4) because the keyword that decides the verb form is its subject She which is at the beginning of the sentence.
As previous studies (Ji et al., 2017), we use the public Lang-8 Corpus (Mizumoto et al., 2011; Tajiri et al., 2012), Cambridge Learner Corpus (CLC) (Nicholls, 2003) and NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013) as our original error-corrected training data. Table 1 shows the stats of the datasets. In addition, we also collect 2,865,639 non-public error-corrected sentence pairs from Lang-8.com. The native data we use for fluency boost learning is English Wikipedia that contains 61,677,453 sentences.
test set as our evaluation datasets. CoNLL-2014 test set contains 1,312 sentences, while JFLEG test set has 747 sentences. Being consistent with the official evaluation metrics, we use MaxMatch (M) (Dahlmeier & Ng, 2012a) for CoNLL-2014 and use GLEU (Napoles et al., 2015) for JFLEG evaluation. It is notable that the original annotations for CoNLL-2014 dataset are from 2 human annotators, which are later enriched by Bryant & Ng (2015) that contains 10 human expert annotations for each test sentence. We evaluate systems’ performance using both annotation settings for the CoNLL dataset. To distinguish between these two annotation settings, we use CoNLL-2014 to denote the original annotations, and CoNLL-10 to denote the 10-human annotations. As previous studies, we use CoNLL-2013 test set and JFLEG dev set as our development sets for CoNLL-2014 and JFLEG test set respectively.
We use 7-layer convolutional seq2seq models666https://github.com/pytorch/fairseq (Gehring et al., 2017) as our error correction and error generation model, which have proven to be effective for GEC (Chollampatt & Ng, 2018). As Chollampatt & Ng (2018), we set the dimensionality of word embeddings in both encoders and decoders to 500, the hidden size of encoders and decoders to 1,024 and the convolution window width to 3. The vocabularies of the source and target side are the most frequent 30K BPE tokens for each. We train the seq2seq models using Nesterov Accelerated Gradient (Sutskever et al., 2013) optimizer with a momentum value of 0.99. The initial learning rate is set to 0.25 and it will be reduced by an order of magnitude if the validation perplexity stops improving. During training, we allow each batch to have at most 3,000 tokens per GPU and set dropout rate to 0.2. We terminate the training process when the learning rate falls below . As Chollampatt & Ng (2018) and Grundkiewicz & Junczys-Dowmunt (2018), we train 4 models with different random initializations for ensemble decoding.
For fluency boost learning, we adopt dual-boost learning introduced in Section 3.3 and use the English Wikipedia data as our native data (Section 3.4). Disfluency candidates are generated from 10-best outputs. For fluency boost inference, we use round-way correction approach introduced in Section 4.2. The architecture of the right-to-left seq2seq model in round-way correction is the same with the left-to-right777In cases other than round-way correction, we use left-to-right seq2seq models as our default error correction models. one except that they decode sentences in the opposite directions. For single-round inference, we follow Chollampatt & Ng (2018) to generate 12-best predictions and choose the best sentence after re-ranking with edit operation and language model scores. The language model is the 5-gram language model trained on Common Crawl released by Junczys-Dowmunt & Grundkiewicz (2016), which is also used for computing fluency score in Eq (3).
As most of the systems (Sakaguchi et al., 2017; Chollampatt & Ng, 2018; Grundkiewicz & Junczys-Dowmunt, 2018) evaluated on JFLEG that use an additional spell checker to resolve spelling errors, we use a public spell checker888https://azure.microsoft.com/en-us/services/cognitive-services/spell-check/ to resolve spelling errors in JFLEG as preprocessing.
|Base convolutional seq2seq||57.95||73.19||60.87|
|Base + FB learning||61.34||76.88||61.41|
|Base + FB learning and inference||60.00||75.72||62.42|
We compare our systems999In this report, we do not present a detailed comparison and analysis for different fluency boost learning and inference methods which can be found in Ge et al. (2018). to the following well-known GEC systems:
NUS14, NUS16, NUS17 and NUS18: The first three GEC systems (Susanto et al., 2014; Chollampatt et al., 2016a; Chollampatt & Ng, 2017) are SMT-based GEC systems that are combined with other techniques (e.g., classifiers). The last one (Chollampatt & Ng, 2018) uses convolutional seq2seq models for grammatical error correction.
SMT-NMT hybrid: the state-of-the-art GEC system (Grundkiewicz & Junczys-Dowmunt, 2018) that is based on an SMT-NMT hybrid approach.
Table 2 shows the results101010A result marked with “-” means that the system’s result in the corresponding dataset or setting is not reported by the original papers or other literature and that the system outputs are not publicly available. of GEC systems on CoNLL and JFLEG dataset. Our base convolutional seq2seq model outperforms most of previous GEC systems owing to the larger size of training data we use. Fluency boost learning further improves the base convolutional seq2seq model. It achieves 61.34 in CoNLL-2014, 76.88 score in CoNLL-10 benchmarks, and 61.41 GLEU score on JFLEG test set. When we further add fluency boost inference, the system’s performance on JFLEG test set is improved to 62.42 GLEU score, while its scores on CoNLL benchmarks drop.
|Base convoluation seq2seq||72.52||32.13||57.95||86.65||45.14||73.19||72.28||60.87|
|Base + FB learning||74.12||36.30||61.34||88.56||50.31||76.88||75.93||61.41|
|Base + FB learning and inference||68.45||40.18||60.00||84.71||53.15||75.72||74.84||62.42|
We look into the results in Table 3. Fluency boost learning improves the base convolutional seq2seq model in terms of all aspects (i.e., precision, recall, and GLEU), demonstrating fluency boost learning is actually helpful for training a seq2seq model for GEC. Adding fluency boost inference improves recall (from 36.30 to 40.18 on CoNLL-2014 and from 50.31 to 53.15 on CoNLL-10) at the expense of a drop of precision (from 74.12 to 68.45 on CoNLL-2014 and from 88.56 to 84.71 on CoNLL-10). Since weighs precision twice as recall, adding fluency boost inference leads to a drop of on the CoNLL dataset. In contrast, for JFLEG, fluency boost inference improves GLEU score from 61.41 to 62.42, demonstrating its effectiveness for improving sentences’ fluency.
We compare our systems to human performance on CoNLL-10 and JFLEG benchmarks. For CoNLL-10, we follow the evaluation setting in Bryant & Ng (2015) and Chollampatt & Ng (2017) to fairly compare systems’ performance to human’s, which is marked with (SvH) in Table 3. Among our systems, the system with fluency boost learning and inference outperforms human’s performance on both CoNLL and JFLEG dataset, while the system with only fluency boost learning achieves higher scores on CoNLL dataset.
|Error type||Base convolutional seq2seq||Base + fluency boost learning|
We further study the effectiveness of fluency boost learning and inference for different error types. Table 4 shows the recall of base convolutional seq2seq model and the model trained with fluency boost learning for each error type111111The definitions of error types in Table 4 can be found in Ng et al. (2014). in CoNLL-2014 dataset (original annotation setting). One can see that fluency boost learning improves recall for most error types, demonstrating that fluency boost learning approach can generate sentences with diverse errors to help training.
To better understand the effectiveness of fluency boost inference (i.e., round-way error correction), we show in Table 5 the recall of each error type of the left-to-right and the right-to-left seq2seq in CoNLL-2014 dataset (original annotation setting). Note that to clearly see pros and cons of the left-to-right and right-to-left model, here we do not re-rank their n-best results using edit operations and the language model; instead, we directly use their 1-best generated sentence as their prediction.
According to Table 5, the right-to-left model does better in the error types like ArtOrDet, while the left-to-right model is better at correcting the errors like SVA, which is consistent with our motivation in Section 4.2. When we use round-way correction, the errors that are not corrected by the right-to-left model are likely to be corrected by the left-to-right one, which is reflected by the recall improvement of most error types, as shown in Table 5.
|Error type||Right-to-Left||Left-to-Right||Round-way (R2L L2R)|
Most of advanced GEC systems are classifier-based (Chodorow et al., 2007; De Felice & Pulman, 2008; Han et al., 2010; Leacock et al., 2010; Tetreault et al., 2010a; Dale & Kilgarriff, 2011) or MT-based (Brockett et al., 2006; Dahlmeier & Ng, 2011, 2012b; Yoshimoto et al., 2013; Yuan & Felice, 2013; Behera & Bhattacharyya, 2013). For example, top-performing systems (Felice et al., 2014; Rozovskaya et al., 2014; Junczys-Dowmunt & Grundkiewicz, 2014) in CoNLL-2014 shared task (Ng et al., 2014) use either of the methods. Recently, many novel approaches (Susanto et al., 2014; Chollampatt et al., 2016b, a; Rozovskaya & Roth, 2016; Junczys-Dowmunt & Grundkiewicz, 2016; Mizumoto & Matsumoto, 2016; Yuan et al., 2016; Hoang et al., 2016; Yannakoudakis et al., 2017) have been proposed for GEC. Among them, seq2seq models (Yuan & Briscoe, 2016; Xie et al., 2016; Ji et al., 2017; Sakaguchi et al., 2017; Schmaltz et al., 2017; Chollampatt & Ng, 2018; Junczys-Dowmunt et al., 2018) have caught much attention. Unlike the models trained only with original error-corrected data, we propose a novel fluency boost learning mechanism for dynamic data augmentation along with training for GEC, despite some related studies that explore artificial error generation for GEC (Brockett et al., 2006; Foster & Andersen, 2009; Rozovskaya & Roth, 2010, 2011; Rozovskaya et al., 2012; Felice & Yuan, 2014; Xie et al., 2016; Rei et al., 2017; Xie et al., 2018). Moreover, we propose fluency boost inference which allows the model to repeatedly edit a sentence as long as the sentence’s fluency can be improved. To the best of our knowledge, it is the first to conduct multi-round seq2seq inference for GEC, while similar ideas have been proposed for NMT (Xia et al., 2017).
In addition to the studies on GEC, there is also much research on grammatical error detection (Leacock et al., 2010; Rei & Yannakoudakis, 2016; Kaneko et al., 2017) and GEC evaluation (Tetreault et al., 2010b; Madnani et al., 2011; Dahlmeier & Ng, 2012c; Napoles et al., 2015; Sakaguchi et al., 2016; Napoles et al., 2016; Bryant et al., 2017; Asano et al., 2017; Choshen & Abend, 2018). We do not introduce them in detail because they are not much related to this work’s contributions.
We present a state-of-the-art convolutional seq2seq model based GEC system that uses a novel fluency boost learning and inference mechanism. Fluency boost learning fully exploits both error-corrected data and native data by generating diverse error-corrected sentence pairs during training, which benefits model learning and improves the performance over the base seq2seq model, while fluency boost inference utilizes the characteristic of GEC to progressively improve a sentence’s fluency through round-way correction. The powerful learning and inference mechanism enables our system to achieve state-of-the-art results and reach human-level performance on both CoNLL-2014 and JFLEG benchmark datasets.
European Workshop on Natural Language Generation, 2011.
Grammatical error correction with neural reinforcement learning.In IJCNLP, 2017.
On the importance of initialization and momentum in deep learning.In ICML, 2013.