Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study

07/03/2018 ∙ by Tao Ge, et al. ∙ Microsoft 2

Neural sequence-to-sequence (seq2seq) approaches have proven to be successful in grammatical error correction (GEC). Based on the seq2seq framework, we propose a novel fluency boost learning and inference mechanism. Fluency boosting learning generates diverse error-corrected sentence pairs during training, enabling the error correction model to learn how to improve a sentence's fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps. Combining fluency boost learning and inference with convolutional seq2seq models, our approach achieves the state-of-the-art performance: 75.72 (F_0.5) on CoNLL-2014 10 annotation dataset and 62.42 (GLEU) on JFLEG test set respectively, becoming the first GEC system that reaches human-level performance (72.58 for CoNLL and 62.37 for JFLEG) on both of the benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence-to-sequence (seq2seq) models (Cho et al., 2014; Sutskever et al., 2014) for grammatical error correction (GEC) have drawn growing attention (Yuan & Briscoe, 2016; Xie et al., 2016; Ji et al., 2017; Schmaltz et al., 2017; Sakaguchi et al., 2017; Chollampatt & Ng, 2018; Junczys-Dowmunt et al., 2018) in recent years. However, most of the seq2seq models for GEC have two flaws. First, the seq2seq models are trained with only limited error-corrected sentence pairs like Figure 1(a). Limited by the size of training data, the models with millions of parameters may not be well generalized. Thus, it is common that the models fail to correct a sentence perfectly even if the sentence is slightly different from the training instance, as illustrated by Figure 1(b). Second, the seq2seq models usually cannot perfectly correct a sentence with many grammatical errors through single-round seq2seq inference, as shown in Figure 1(b) and 1(c), because some errors in a sentence may make the context strange, which confuses the models to correct other errors.

Figure 1: (a) an error-corrected sentence pair; (b) if the sentence becomes slightly different, the model fails to correct it perfectly; (c) single-round seq2seq inference cannot perfectly correct the sentence, but multi-round inference can.

To address the above-mentioned limitations in model learning and inference, we propose a novel fluency boost learning and inference mechanism, illustrated in Figure 2.

For fluency boosting learning, not only is a seq2seq model trained with original error-corrected sentence pairs, but also it generates less fluent sentences (e.g., from its n-best outputs) to establish new error-corrected sentence pairs by pairing them with their correct sentences during training, as long as the sentences’ fluency111A sentence’s fluency score is defined to be inversely proportional to the sentence’s cross entropy, as is in Eq (3). is below that of their correct sentences, as Figure 2(a) shows. Specifically, we call the generated error-corrected sentence pairs fluency boost sentence pairs

because the sentence in the target side always improves fluency over that in the source side. The generated fluency boost sentence pairs during training will be used as additional training instances during subsequent training epochs, allowing the error correction model to see more grammatically incorrect sentences during training and accordingly improving its generalization ability.

For model inference, fluency boost inference mechanism allows the model to correct a sentence incrementally with multi-round inference as long as the proposed edits can boost the sentence’s fluency, as Figure 2(b) shows. For a sentence with multiple grammatical errors, some of the errors will be corrected first. The corrected parts will make the context clearer, which may benefit the model to correct the remaining errors. Moreover, based on the special characteristics of this task that the output prediction can be repeatedly edited and the basic fluency boost inference idea, we further propose a round-way correction approach that uses two seq2seq models whose decoding orders are left-to-right and right-to-left respectively. For round-way correction, a sentence will be corrected successively by the right-to-left and left-to-right seq2seq model222For convenience, we call the seq2seq model with right-to-left decoder right-to-left seq2seq model and the seq2seq model with left-to-right decoder left-to-right seq2seq model.. Since the left-to-right and right-to-left decoder decode a sequence with different contexts, they have their unique advantages for specific error types. Round-way correction can fully exploit their pros and make them complement each other, which results in a significant improvement of recall.

Experiments show that combining fluency boost learning and inference with convolutional seq2seq models, our best GEC system333Our systems’ outputs for CoNLL-2014 and JFLEG test set are available at achieves 75.72 on CoNLL-2014 10 annotation dataset and 62.42 on JFLEG test set, becoming the first system reaching human-level performance on both of the GEC benchmarks.

Figure 2: Fluency boost learning and inference: (a) given a training instance (i.e., an error-corrected sentence pair), fluency boost learning establishes multiple fluency boost sentence pairs from the seq2seq’s n-best outputs during training. The fluency boost sentence pairs will be used as training instances in subsequent training epochs, which helps expand the training set and accordingly benefits model learning; (b) fluency boost inference allows an error correction model to correct a sentence incrementally through multi-round seq2seq inference as long as its fluency can be improved.

2 Background: Neural grammatical error correction

As neural machine translation (NMT), a typical neural GEC approach uses an encoder-decoder seq2seq model

(Sutskever et al., 2014; Cho et al., 2014) with attention mechanism (Bahdanau et al., 2014) to edit a raw sentence into the grammatically correct sentence it should be, as Figure 1(a) shows.

Given a raw sentence and its corrected sentence in which and are the -th and -th words of sentence and respectively, the error correction seq2seq model learns a probabilistic mapping

from error-corrected sentence pairs through maximum likelihood estimation (MLE), which learns model parameters

to maximize the following equation:


where denotes the set of error-corrected sentence pairs.

For model inference, an output sequence is selected through beam search, which maximizes the following equation:


3 Fluency boost learning

Conventional seq2seq models for GEC learn model parameters only from original error-corrected sentence pairs. However, such error-corrected sentence pairs are not sufficiently available. As a result, many neural GEC models are not very well generalized.

Fortunately, neural GEC is different from NMT. For neural GEC, its goal is improving a sentence’s fluency444Fluency of a sentence in this work refers to how likely the sentence is written by a native speaker. In other words, if a sentence is very likely to be written by a native speaker, it should be regarded highly fluent. without changing its original meaning; thus, any sentence pair that satisfies this condition (we call it fluency boost condition) can be used as a training instance.

In this work, we define as the fluency score of a sentence :



is the probability of

given context , computed by a language model, and is the length of sentence . is actually the cross entropy of the sentence , whose range is . Accordingly, the range of is .

Figure 3: Three fluency boost learning strategies: (a) back-boost, (b) self-boost, (c) dual-boost; all of them generate fluency boost sentence pairs (the pairs in the dashed boxes) to help model learning during training. The numbers in this figure are fluency scores of their corresponding sentences.

The core idea of fluency boost learning is to generate fluency boost sentence pairs that satisfy the fluency boost condition during training, as Figure 2(a) illustrates, so that these pairs can further help model learning.

In this section, we present three fluency boost learning strategies: back-boost, self-boost, and dual-boost that generate fluency boost sentence pairs in different ways, as illustrated in Figure 3.

3.1 Back-boost learning

Back-boost learning borrows the idea from back translation (Sennrich et al., 2016) in NMT, referring to training a backward model (we call it error generation model, as opposed to error correction model) that is used to convert a fluent sentence to a less fluent sentence with errors. Since the less fluent sentences are generated by the error generation seq2seq model trained with error-corrected data, they usually do not change the original sentence’s meaning; thus, they can be paired with their correct sentences, establishing fluency boost sentence pairs that can be used as training instances for error correction models, as Figure 3(a) shows.

Specifically, we first train a seq2seq error generation model with which is identical to except that the source sentence and the target sentence are interchanged. Then, we use the model to predict -best outputs given a correct sentence . Given the fluency boost condition, we compare the fluency of each output (where ) to that of its correct sentence . If an output sentence’s fluency score is much lower than its correct sentence, we call it a disfluency candidate of .

To formalize this process, we first define to denote the -best outputs predicted by model given the input . Then, disfluency candidates of a correct sentence can be derived:


where denotes the disfluency candidate set for in back-boost learning. is a threshold to determine if is less fluent than and it should be slightly larger555We set since the corrected sentence in our training data improves its corresponding raw sentence about 5% fluency on average. than , which helps filter out sentence pairs with unnecessary edits (e.g., I like this book. I like the book.).

In the subsequent training epochs, the error correction model will not only learn from the original error-corrected sentence pairs (,), but also learn from fluency boost sentence pairs (,) where is a sample of ).

We summarize this process in Algorithm 1 where is the set of original error-corrected sentence pairs, and can be tentatively considered identical to when there is no additional native data to help model training (see Section 3.4). Note that we constrain the size of not to exceed (the 7th line in Algorithm 1) to avoid that too many fluency boost pairs overwhelm the effects of the original error-corrected pairs on model learning.

1:Train error generation model with ;
2:for each sentence pair  do
3:     Compute according to Eq (5);
4:end for
5:for each training epoch  do
6:     ;
7:     Derive a subset by randomly sampling elements from ;
8:     for each  do
9:         Establish a fluency boost pair by randomly sampling ;
10:         ;
11:     end for
12:     Update error correction model with ;
13:end for
Algorithm 1 Back-boost learning

3.2 Self-boost learning

In contrast to back-boost learning whose core idea is originally from NMT, self-boost learning is original, which is specially devised for neural GEC. The idea of self-boost learning is illustrated by Figure 3(b) and was already briefly introduced in Section 1 and Figure 2(a). Unlike back-boost learning in which an error generation seq2seq model is trained to generate disfluency candidates, self-boost learning allows the error correction model to generate the candidates by itself. Since the disfluency candidates generated by the error correction seq2seq model trained with error-corrected data rarely change the input sentence’s meaning; thus, they can be used to establish fluency boost sentence pairs.

For self-boost learning, given an error corrected pair , an error correction model first predicts -best outputs for the raw sentence . Among the -best outputs, any output that is not identical to can be considered as an error prediction. Instead of treating the error predictions useless, self-boost learning fully exploits them. Specifically, if an error prediction is much less fluent than that of its correct sentence , it will be added to ’s disfluency candidate set , as Eq (6) shows:


In contrast to back-boost learning, self-boost generates disfluency candidates from a different perspective – by editing the raw sentence rather than the correct sentence . It is also noteworthy that is incrementally expanded because the error correction model is dynamically updated, as shown in Algorithm 2.

1:for each sentence pair  do
2:     ;
3:end for
5:for each training epoch  do
6:     Update error correction model with ;
8:     Derive a subset by randomly sampling elements from ;
9:     for each  do
10:         Update according to Eq (6);
11:         Establish a fluency boost pair by randomly sampling ;
12:         ;
13:     end for
14:end for
Algorithm 2 Self-boost learning

3.3 Dual-boost learning

As introduced above, back- and self-boost learning generate disfluency candidates from different perspectives to create more fluency boost sentence pairs to benefit training the error correction model. Intuitively, the more diverse disfluency candidates generated, the more helpful for training an error correction model. Inspired by He et al. (2016) and Zhang et al. (2018), we propose a dual-boost learning strategy, combining both back- and self-boost’s perspectives to generate disfluency candidates.

As Figure 3(c) shows, disfluency candidates in dual-boost learning are from both the error generation model and the error correction model :


Moreover, the error correction model and the error generation model are dual and both of them are dynamically updated, which improves each other: the disfluency candidates produced by error generation model can benefit training the error correction model, while the disfluency candidates created by error correction model can be used as training data for the error generation model. We summarize this learning approach in Algorithm 3.

1:for each  do
2:     ;
3:end for
4:; ;
5:for each training epoch  do
6:     Update error correction model with ;
7:     Update error generation model with ;
8:     ; ;
9:     Derive a subset by randomly sampling elements from ;
10:     for each  do
11:         Update according to Eq (7);
12:         Establish a fluency boost pair by randomly sampling ;
13:         ;
14:         Establish a reversed fluency boost pair by randomly sampling ;
15:         ;
16:     end for
17:end for
Algorithm 3 Dual-boost learning

3.4 Fluency boost learning with large-scale native data

Our proposed fluency boost learning strategies can be easily extended to utilize massive native text data which proved to be useful for GEC.

As discussed in Section 3.1, when there is no additional native data, in Algorithm 13 is identical to . In the case where additional native data is available to help model learning, becomes:

where denotes the set of self-copied sentence pairs from native data.

4 Fluency boost inference

4.1 Multi-round error correction

As we discuss in Section 1, some sentences with multiple grammatical errors usually cannot be perfectly corrected through normal seq2seq inference which makes only single-round inference. Fortunately, neural GEC is different from NMT: its source and target language are the same. The characteristic allows us to edit a sentence more than once through multi-round model inference, which motivates our fluency boost inference. As Figure 2(b) shows, fluency boost inference allows a sentence to be incrementally edited through multi-round seq2seq inference as long as the sentence’s fluency can be improved. Specifically, an error correction seq2seq model first takes a raw sentence as an input and outputs a hypothesis . Instead of regarding as the final prediction, fluency boost inference will then take as the input to generate the next output . The process will not terminate unless does not improve in terms of fluency.

4.2 Round-way error correction

Based on the idea of multi-round correction, we further propose an advanced fluency boost inference approach: round-way error correction. Instead of progressively correcting a sentence with the same seq2seq model as introduced in Section 4.1, round-way correction corrects a sentence through a right-to-left seq2seq model and a left-to-right seq2seq model successively, as shown in Figure 4.

Figure 4: Round-way error correction: some types of errors (e.g., articles) are easier to be corrected by a right-to-left seq2seq model, while some (e.g., subject verb agreement) are more likely to be corrected by a left-to-right seq2seq model. Round-way error correction makes left-to-right and right-to-left seq2seq models well complement each other, enabling it to correct more grammatical errors than an individual model.

The motivation of round-way error correction is straightforward. Decoders with different decoding orders decode word sequences with different contexts, making them have their unique advantages for specific error types. For the example in Figure 4, the error of a lack of an article (i.e., park the park) is more likely to be corrected by the right-to-left seq2seq model than the left-to-right one, because whether to add an article depends on the noun park that was already seen by the right-to-left model when it made the decision. In contrast, the left-to-right model might be better at dealing with subject-verb agreement errors (e.g., come comes in Figure 4) because the keyword that decides the verb form is its subject She which is at the beginning of the sentence.

5 Experiments

5.1 Dataset and evaluation

Corpus #sent pair
Lang-8 1,114,139
CLC 1,366,075
NUCLE 57,119
Extended Lang-8 2,865,639
Total 5,402,972
Table 1: Error-corrected training data.

As previous studies (Ji et al., 2017), we use the public Lang-8 Corpus (Mizumoto et al., 2011; Tajiri et al., 2012), Cambridge Learner Corpus (CLC) (Nicholls, 2003) and NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013) as our original error-corrected training data. Table 1 shows the stats of the datasets. In addition, we also collect 2,865,639 non-public error-corrected sentence pairs from The native data we use for fluency boost learning is English Wikipedia that contains 61,677,453 sentences.

We use CoNLL-2014 shared task dataset (Ng et al., 2014) and JFLEG (Napoles et al., 2017)

test set as our evaluation datasets. CoNLL-2014 test set contains 1,312 sentences, while JFLEG test set has 747 sentences. Being consistent with the official evaluation metrics, we use MaxMatch (M

) (Dahlmeier & Ng, 2012a) for CoNLL-2014 and use GLEU (Napoles et al., 2015) for JFLEG evaluation. It is notable that the original annotations for CoNLL-2014 dataset are from 2 human annotators, which are later enriched by Bryant & Ng (2015) that contains 10 human expert annotations for each test sentence. We evaluate systems’ performance using both annotation settings for the CoNLL dataset. To distinguish between these two annotation settings, we use CoNLL-2014 to denote the original annotations, and CoNLL-10 to denote the 10-human annotations. As previous studies, we use CoNLL-2013 test set and JFLEG dev set as our development sets for CoNLL-2014 and JFLEG test set respectively.

5.2 Experimental setting

We use 7-layer convolutional seq2seq models666 (Gehring et al., 2017) as our error correction and error generation model, which have proven to be effective for GEC (Chollampatt & Ng, 2018). As Chollampatt & Ng (2018), we set the dimensionality of word embeddings in both encoders and decoders to 500, the hidden size of encoders and decoders to 1,024 and the convolution window width to 3. The vocabularies of the source and target side are the most frequent 30K BPE tokens for each. We train the seq2seq models using Nesterov Accelerated Gradient (Sutskever et al., 2013) optimizer with a momentum value of 0.99. The initial learning rate is set to 0.25 and it will be reduced by an order of magnitude if the validation perplexity stops improving. During training, we allow each batch to have at most 3,000 tokens per GPU and set dropout rate to 0.2. We terminate the training process when the learning rate falls below . As Chollampatt & Ng (2018) and Grundkiewicz & Junczys-Dowmunt (2018), we train 4 models with different random initializations for ensemble decoding.

For fluency boost learning, we adopt dual-boost learning introduced in Section 3.3 and use the English Wikipedia data as our native data (Section 3.4). Disfluency candidates are generated from 10-best outputs. For fluency boost inference, we use round-way correction approach introduced in Section 4.2. The architecture of the right-to-left seq2seq model in round-way correction is the same with the left-to-right777In cases other than round-way correction, we use left-to-right seq2seq models as our default error correction models. one except that they decode sentences in the opposite directions. For single-round inference, we follow Chollampatt & Ng (2018) to generate 12-best predictions and choose the best sentence after re-ranking with edit operation and language model scores. The language model is the 5-gram language model trained on Common Crawl released by Junczys-Dowmunt & Grundkiewicz (2016), which is also used for computing fluency score in Eq (3).

As most of the systems (Sakaguchi et al., 2017; Chollampatt & Ng, 2018; Grundkiewicz & Junczys-Dowmunt, 2018) evaluated on JFLEG that use an additional spell checker to resolve spelling errors, we use a public spell checker888 to resolve spelling errors in JFLEG as preprocessing.

System CoNLL-2014 CoNLL-10 JFLEG test
No edit - - 40.54
CAMB14 37.33 54.30 46.04
CAMB16 39.90 - 52.05
CAMB17 51.08 - -
CUUI 36.79 51.79 -
VT16 47.40 62.45 -
AMU14 35.01 50.17 -
AMU16 49.49 66.83 51.46
NUS16 44.27 60.36 50.13
NUS17 53.14 69.12 56.78
NUS18 54.79 70.14 57.47
Nested-RNN-seq2seq 45.15 - 53.41
Back-CNN-seq2seq 49.0 - 56.6
Adapted-transformer 55.8 - 59.9
SMT-NMT hybrid 56.25 - 61.50
Base convolutional seq2seq 57.95 73.19 60.87
Base + FB learning 61.34 76.88 61.41
Base + FB learning and inference 60.00 75.72 62.42
Table 2: Comparison of GEC systems on CoNLL and JFLEG benchmark datasets.

5.3 Experimental results

We compare our systems999In this report, we do not present a detailed comparison and analysis for different fluency boost learning and inference methods which can be found in Ge et al. (2018). to the following well-known GEC systems:

  • CAMB14, CAMB16 and CAMB17: GEC systems (Felice et al., 2014; Yuan & Briscoe, 2016; Yannakoudakis et al., 2017) developed by Cambridge University. For CAMB17, we report its best result.

  • CUUI and VT16: the former system (Rozovskaya et al., 2014)

    uses a classifier-based approach, which is improved by the latter system

    (Rozovskaya & Roth, 2016) through combining it with an SMT-based approach.

  • AMU14 and AMU16: SMT-based GEC systems (Junczys-Dowmunt & Grundkiewicz, 2014, 2016) developed by AMU.

  • NUS14, NUS16, NUS17 and NUS18: The first three GEC systems (Susanto et al., 2014; Chollampatt et al., 2016a; Chollampatt & Ng, 2017) are SMT-based GEC systems that are combined with other techniques (e.g., classifiers). The last one (Chollampatt & Ng, 2018) uses convolutional seq2seq models for grammatical error correction.

  • Nested-RNN-seq2seq: a Recurrent Neural Network (RNN) seq2seq model with nested attention

    (Ji et al., 2017).

  • Back-CNN-seq2seq: a convolutional seq2seq model (Xie et al., 2018) trained with synthesized data augmented by back translation. Its core idea is somewhat similar to the idea introduced in Section 3.1 and Section 3.4 of this work.

  • Adapted-transformer: a transformer (Vaswani et al., 2017) based GEC system (Junczys-Dowmunt et al., 2018) with techniques adapted from low-resource machine translation.

  • SMT-NMT hybrid: the state-of-the-art GEC system (Grundkiewicz & Junczys-Dowmunt, 2018) that is based on an SMT-NMT hybrid approach.

Table 2 shows the results101010A result marked with “-” means that the system’s result in the corresponding dataset or setting is not reported by the original papers or other literature and that the system outputs are not publicly available. of GEC systems on CoNLL and JFLEG dataset. Our base convolutional seq2seq model outperforms most of previous GEC systems owing to the larger size of training data we use. Fluency boost learning further improves the base convolutional seq2seq model. It achieves 61.34 in CoNLL-2014, 76.88 score in CoNLL-10 benchmarks, and 61.41 GLEU score on JFLEG test set. When we further add fluency boost inference, the system’s performance on JFLEG test set is improved to 62.42 GLEU score, while its scores on CoNLL benchmarks drop.

System CoNLL-2014 CoNLL-10 CoNLL-10 (SvH) JFLEG
NUS17 62.74 32.96 53.14 80.04 44.71 69.12 68.29 56.78
NUS18 65.49 33.14 54.79 81.05 45.60 70.14 69.30 57.47
Adapted-transformer 61.9 40.2 55.8 - - - - 59.9
SMT-NMT hybrid 66.77 34.49 56.25 - - - 72.04 61.50
Base convoluation seq2seq 72.52 32.13 57.95 86.65 45.14 73.19 72.28 60.87
Base + FB learning 74.12 36.30 61.34 88.56 50.31 76.88 75.93 61.41
Base + FB learning and inference 68.45 40.18 60.00 84.71 53.15 75.72 74.84 62.42
Human performance - - - - - - 72.58 62.37
Table 3: Evaluation result analysis for top-performing GEC systems on CoNLL and JFLEG datasets. The results marked with red font exceed the human-level performance.

We look into the results in Table 3. Fluency boost learning improves the base convolutional seq2seq model in terms of all aspects (i.e., precision, recall, and GLEU), demonstrating fluency boost learning is actually helpful for training a seq2seq model for GEC. Adding fluency boost inference improves recall (from 36.30 to 40.18 on CoNLL-2014 and from 50.31 to 53.15 on CoNLL-10) at the expense of a drop of precision (from 74.12 to 68.45 on CoNLL-2014 and from 88.56 to 84.71 on CoNLL-10). Since weighs precision twice as recall, adding fluency boost inference leads to a drop of on the CoNLL dataset. In contrast, for JFLEG, fluency boost inference improves GLEU score from 61.41 to 62.42, demonstrating its effectiveness for improving sentences’ fluency.

We compare our systems to human performance on CoNLL-10 and JFLEG benchmarks. For CoNLL-10, we follow the evaluation setting in Bryant & Ng (2015) and Chollampatt & Ng (2017) to fairly compare systems’ performance to human’s, which is marked with (SvH) in Table 3. Among our systems, the system with fluency boost learning and inference outperforms human’s performance on both CoNLL and JFLEG dataset, while the system with only fluency boost learning achieves higher scores on CoNLL dataset.

Error type Base convolutional seq2seq Base + fluency boost learning
ArtOrDet 26.00 28.26
Mec 25.45 25.54
Nn 46.10 53.99
Npos 20.00 24.00
Pform 17.54 15.79
Pref 4.69 7.04
Prep 23.38 28.51
Rloc 9.54 9.54
Sfrag 0 7.14
Smod 0 0
Spar 8.00 12.00
Srun 0 0
Ssub 10.14 14.49
SVA 34.74 42.11
Trans 5.63 8.45
Um 2.04 2.04
V0 23.21 26.79
Vform 34.81 42.78
Vm 11.69 11.69
Vt 14.36 19.70
Wa 0 0
Wci 7.50 9.15
Wform 43.28 47.01
WOadv 5.88 23.53
WOinc 1.45 4.35
Wtone 8.70 17.39
Others 0 1.22
Table 4: A comparison of recall of the convolutional seq2seq model with/without fluency boost learning for each error type in CoNLL-2014 dataset.

We further study the effectiveness of fluency boost learning and inference for different error types. Table 4 shows the recall of base convolutional seq2seq model and the model trained with fluency boost learning for each error type111111The definitions of error types in Table 4 can be found in Ng et al. (2014). in CoNLL-2014 dataset (original annotation setting). One can see that fluency boost learning improves recall for most error types, demonstrating that fluency boost learning approach can generate sentences with diverse errors to help training.

To better understand the effectiveness of fluency boost inference (i.e., round-way error correction), we show in Table 5 the recall of each error type of the left-to-right and the right-to-left seq2seq in CoNLL-2014 dataset (original annotation setting). Note that to clearly see pros and cons of the left-to-right and right-to-left model, here we do not re-rank their n-best results using edit operations and the language model; instead, we directly use their 1-best generated sentence as their prediction.

According to Table 5, the right-to-left model does better in the error types like ArtOrDet, while the left-to-right model is better at correcting the errors like SVA, which is consistent with our motivation in Section 4.2. When we use round-way correction, the errors that are not corrected by the right-to-left model are likely to be corrected by the left-to-right one, which is reflected by the recall improvement of most error types, as shown in Table 5.

Error type Right-to-Left Left-to-Right Round-way (R2L L2R)
ArtOrDet 25.70 22.31 30.36
Mec 16.27 16.52 20.40
Nn 32.13 38.03 41.31
Npos 16.00 12.00 16.00
Pform 17.54 14.04 19.30
Pref 2.35 2.35 3.76
Prep 14.88 17.40 21.81
Rloc 7.25 6.87 9.92
Sfrag 0 0 0
Smod 0 0 0
Spar 4.00 12.00 8.00
Srun 0 0 0
Ssub 7.25 5.80 10.14
SVA 30.85 36.84 39.47
Trans 7.04 4.93 7.04
Um 2.04 0 2.04
V0 21.43 17.86 28.57
Vform 25.14 31.67 33.52
Vm 7.79 6.49 9.09
Vt 13.37 11.33 14.36
Wa 0 0 0
Wci 5.50 4.68 6.67
Wform 35.34 37.59 41.04
WOadv 8.82 14.71 17.65
WOinc 2.90 2.90 4.35
Wtone 8.70 4.35 8.70
Others 1.22 1.22 1.22
Table 5: The left-to-right and right-to-left seq2seq model’s recall of each error type in CoNLL-2014.

6 Related work

Most of advanced GEC systems are classifier-based (Chodorow et al., 2007; De Felice & Pulman, 2008; Han et al., 2010; Leacock et al., 2010; Tetreault et al., 2010a; Dale & Kilgarriff, 2011) or MT-based (Brockett et al., 2006; Dahlmeier & Ng, 2011, 2012b; Yoshimoto et al., 2013; Yuan & Felice, 2013; Behera & Bhattacharyya, 2013). For example, top-performing systems (Felice et al., 2014; Rozovskaya et al., 2014; Junczys-Dowmunt & Grundkiewicz, 2014) in CoNLL-2014 shared task (Ng et al., 2014) use either of the methods. Recently, many novel approaches (Susanto et al., 2014; Chollampatt et al., 2016b, a; Rozovskaya & Roth, 2016; Junczys-Dowmunt & Grundkiewicz, 2016; Mizumoto & Matsumoto, 2016; Yuan et al., 2016; Hoang et al., 2016; Yannakoudakis et al., 2017) have been proposed for GEC. Among them, seq2seq models (Yuan & Briscoe, 2016; Xie et al., 2016; Ji et al., 2017; Sakaguchi et al., 2017; Schmaltz et al., 2017; Chollampatt & Ng, 2018; Junczys-Dowmunt et al., 2018) have caught much attention. Unlike the models trained only with original error-corrected data, we propose a novel fluency boost learning mechanism for dynamic data augmentation along with training for GEC, despite some related studies that explore artificial error generation for GEC (Brockett et al., 2006; Foster & Andersen, 2009; Rozovskaya & Roth, 2010, 2011; Rozovskaya et al., 2012; Felice & Yuan, 2014; Xie et al., 2016; Rei et al., 2017; Xie et al., 2018). Moreover, we propose fluency boost inference which allows the model to repeatedly edit a sentence as long as the sentence’s fluency can be improved. To the best of our knowledge, it is the first to conduct multi-round seq2seq inference for GEC, while similar ideas have been proposed for NMT (Xia et al., 2017).

In addition to the studies on GEC, there is also much research on grammatical error detection (Leacock et al., 2010; Rei & Yannakoudakis, 2016; Kaneko et al., 2017) and GEC evaluation (Tetreault et al., 2010b; Madnani et al., 2011; Dahlmeier & Ng, 2012c; Napoles et al., 2015; Sakaguchi et al., 2016; Napoles et al., 2016; Bryant et al., 2017; Asano et al., 2017; Choshen & Abend, 2018). We do not introduce them in detail because they are not much related to this work’s contributions.

7 Conclusion

We present a state-of-the-art convolutional seq2seq model based GEC system that uses a novel fluency boost learning and inference mechanism. Fluency boost learning fully exploits both error-corrected data and native data by generating diverse error-corrected sentence pairs during training, which benefits model learning and improves the performance over the base seq2seq model, while fluency boost inference utilizes the characteristic of GEC to progressively improve a sentence’s fluency through round-way correction. The powerful learning and inference mechanism enables our system to achieve state-of-the-art results and reach human-level performance on both CoNLL-2014 and JFLEG benchmark datasets.


  • Asano et al. (2017) Hiroki Asano, Tomoya Mizumoto, and Kentaro Inui. Reference-based metrics can be replaced with reference-less metrics in evaluating grammatical error correction systems. In IJCNLP, 2017.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL
  • Behera & Bhattacharyya (2013) Bibek Behera and Pushpak Bhattacharyya. Automated grammar correction using hierarchical phrase-based statistical machine translation. In IJCNLP, 2013.
  • Brockett et al. (2006) Chris Brockett, William B Dolan, and Michael Gamon. Correcting esl errors using phrasal smt techniques. In COLING/ACL, 2006.
  • Bryant & Ng (2015) Christopher Bryant and Hwee Tou Ng. How far are we from fully automatic high quality grammatical error correction? In ACL/IJCNLP, 2015.
  • Bryant et al. (2017) Christopher Bryant, Mariano Felice, and E Briscoe. Automatic annotation and evaluation of error types for grammatical error correction. In ACL, 2017.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, October 2014.
  • Chodorow et al. (2007) Martin Chodorow, Joel R Tetreault, and Na-Rae Han. Detection of grammatical errors involving prepositions. In ACL-SIGSEM workshop on prepositions, 2007.
  • Chollampatt & Ng (2017) Shamil Chollampatt and Hwee Tou Ng. Connecting the dots: Towards human-level grammatical error correction. In Workshop on Innovative Use of NLP for Building Educational Applications, 2017.
  • Chollampatt & Ng (2018) Shamil Chollampatt and Hwee Tou Ng. A multilayer convolutional encoder-decoder neural network for grammatical error correction. arXiv preprint arXiv:1801.08831, 2018.
  • Chollampatt et al. (2016a) Shamil Chollampatt, Duc Tam Hoang, and Hwee Tou Ng. Adapting grammatical error correction based on the native language of writers with neural network joint models. In EMNLP, 2016a.
  • Chollampatt et al. (2016b) Shamil Chollampatt, Kaveh Taghipour, and Hwee Tou Ng. Neural network translation models for grammatical error correction. arXiv preprint arXiv:1606.00189, 2016b.
  • Choshen & Abend (2018) Leshem Choshen and Omeri Abend. Inherent biases in reference-based evaluation for grammatical error correction and text simplification. arXiv preprint arXiv:1804.11254, 2018.
  • Dahlmeier & Ng (2011) Daniel Dahlmeier and Hwee Tou Ng. Correcting semantic collocation errors with l1-induced paraphrases. In EMNLP, 2011.
  • Dahlmeier & Ng (2012a) Daniel Dahlmeier and Hwee Tou Ng. Better evaluation for grammatical error correction. In NAACL, 2012a.
  • Dahlmeier & Ng (2012b) Daniel Dahlmeier and Hwee Tou Ng. A beam-search decoder for grammatical error correction. In EMNLP/CoNLL, 2012b.
  • Dahlmeier & Ng (2012c) Daniel Dahlmeier and Hwee Tou Ng. Better evaluation for grammatical error correction. In NAACL, 2012c.
  • Dahlmeier et al. (2013) Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. Building a large annotated corpus of learner english: The nus corpus of learner english. In Workshop on innovative use of NLP for building educational applications, 2013.
  • Dale & Kilgarriff (2011) Robert Dale and Adam Kilgarriff. Helping our own: The hoo 2011 pilot shared task. In

    European Workshop on Natural Language Generation

    , 2011.
  • De Felice & Pulman (2008) Rachele De Felice and Stephen G Pulman. A classifier-based approach to preposition and determiner error correction in l2 english. In COLING, 2008.
  • Felice & Yuan (2014) Mariano Felice and Zheng Yuan. Generating artificial errors for grammatical error correction. In Student Research Workshop at EACL, April 2014.
  • Felice et al. (2014) Mariano Felice, Zheng Yuan, Øistein E Andersen, Helen Yannakoudakis, and Ekaterina Kochmar. Grammatical error correction using hybrid systems and type filtering. In CoNLL (Shared Task), 2014.
  • Foster & Andersen (2009) Jennifer Foster and Øistein E Andersen. Generrate: generating errors for use in grammatical error detection. In Workshop on innovative use of nlp for building educational applications, 2009.
  • Ge et al. (2018) Tao Ge, Furu Wei, and Ming Zhou. Fluency boost learning and inference for neural grammatical error correction. In ACL, 2018.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
  • Grundkiewicz & Junczys-Dowmunt (2018) Roman Grundkiewicz and Marcin Junczys-Dowmunt. Near human-level performance in grammatical error correction with hybrid machine translation. arXiv preprint arXiv:1804.05945, 2018.
  • Han et al. (2010) Na-Rae Han, Joel R Tetreault, Soo-Hwa Lee, and Jin-Young Ha. Using an error-annotated learner corpus to develop an esl/efl error correction system. In LREC, 2010.
  • He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. Dual learning for machine translation. In NIPS, 2016.
  • Hoang et al. (2016) Duc Tam Hoang, Shamil Chollampatt, and Hwee Tou Ng. Exploiting n-best hypotheses to improve an smt approach to grammatical error correction. In IJCAI, 2016.
  • Ji et al. (2017) Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, and Jianfeng Gao. A nested attention neural hybrid model for grammatical error correction. In ACL, 2017.
  • Junczys-Dowmunt & Grundkiewicz (2014) Marcin Junczys-Dowmunt and Roman Grundkiewicz. The amu system in the conll-2014 shared task: Grammatical error correction by data-intensive and feature-rich statistical machine translation. In CoNLL (Shared Task), 2014.
  • Junczys-Dowmunt & Grundkiewicz (2016) Marcin Junczys-Dowmunt and Roman Grundkiewicz. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. arXiv preprint arXiv:1605.06353, 2016.
  • Junczys-Dowmunt et al. (2018) Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. Approaching neural grammatical error correction as a low-resource machine translation task. arXiv preprint arXiv:1804.05940, 2018.
  • Kaneko et al. (2017) Masahiro Kaneko, Yuya Sakaizawa, and Mamoru Komachi. Grammatical error detection using error-and grammaticality-specific word embeddings. In IJCNLP, 2017.
  • Leacock et al. (2010) Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. Automated grammatical error detection for language learners. Synthesis lectures on human language technologies, 3(1):1–134, 2010.
  • Madnani et al. (2011) Nitin Madnani, Joel Tetreault, Martin Chodorow, and Alla Rozovskaya. They can help: Using crowdsourcing to improve the evaluation of grammatical error detection systems. In ACL, 2011.
  • Mizumoto & Matsumoto (2016) Tomoya Mizumoto and Yuji Matsumoto. Discriminative reranking for grammatical error correction with statistical machine translation. In NAACL, 2016.
  • Mizumoto et al. (2011) Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. Mining revision log of language learning sns for automated japanese error correction of second language learners. In IJCNLP, 2011.
  • Napoles et al. (2015) Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. Ground truth for grammatical error correction metrics. In ACL/IJCNLP, 2015.
  • Napoles et al. (2016) Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. There’s no comparison: Reference-less evaluation metrics in grammatical error correction. In EMNLP, 2016.
  • Napoles et al. (2017) Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. Jfleg: A fluency corpus and benchmark for grammatical error correction. arXiv preprint arXiv:1702.04066, 2017.
  • Ng et al. (2014) Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. The conll-2014 shared task on grammatical error correction. In CoNLL (Shared Task), 2014.
  • Nicholls (2003) Diane Nicholls. The cambridge learner corpus: Error coding and analysis for lexicography and elt. In Corpus Linguistics 2003 conference, 2003.
  • Rei & Yannakoudakis (2016) Marek Rei and Helen Yannakoudakis. Compositional sequence labeling models for error detection in learner writing. In ACL, 2016.
  • Rei et al. (2017) Marek Rei, Mariano Felice, Zheng Yuan, and Ted Briscoe. Artificial error generation with machine translation and syntactic patterns. arXiv preprint arXiv:1707.05236, 2017.
  • Rozovskaya & Roth (2010) Alla Rozovskaya and Dan Roth. Training paradigms for correcting errors in grammar and usage. In NAACL, 2010.
  • Rozovskaya & Roth (2011) Alla Rozovskaya and Dan Roth. Algorithm selection and model adaptation for esl correction tasks. In ACL, 2011.
  • Rozovskaya & Roth (2016) Alla Rozovskaya and Dan Roth. Grammatical error correction: Machine translation and classifiers. In ACL, 2016.
  • Rozovskaya et al. (2012) Alla Rozovskaya, Mark Sammons, and Roth Dan. The ui system in the hoo 2012 shared task on error correction. In Workshop on Building Educational Applications Using NLP, 2012.
  • Rozovskaya et al. (2014) Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, Dan Roth, and Nizar Habash. The illinois-columbia system in the conll-2014 shared task. In CoNLL (Shared Task), 2014.
  • Sakaguchi et al. (2016) Keisuke Sakaguchi, Courtney Napoles, Matt Post, and Joel Tetreault. Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association of Computational Linguistics, 4(1):169–182, 2016.
  • Sakaguchi et al. (2017) Keisuke Sakaguchi, Matt Post, and Benjamin Van Durme.

    Grammatical error correction with neural reinforcement learning.

    In IJCNLP, 2017.
  • Schmaltz et al. (2017) Allen Schmaltz, Yoon Kim, Alexander Rush, and Stuart Shieber. Adapting sequence models for sentence correction. In EMNLP, 2017.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In ACL, 2016.
  • Susanto et al. (2014) Raymond Hendy Susanto, Peter Phandi, and Hwee Tou Ng. System combination for grammatical error correction. In EMNLP, 2014.
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.

    On the importance of initialization and momentum in deep learning.

    In ICML, 2013.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014. URL
  • Tajiri et al. (2012) Toshikazu Tajiri, Mamoru Komachi, and Yuji Matsumoto. Tense and aspect error correction for esl learners using global context. In ACL, 2012.
  • Tetreault et al. (2010a) Joel Tetreault, Jennifer Foster, and Martin Chodorow. Using parse features for preposition selection and error detection. In ACL, 2010a.
  • Tetreault et al. (2010b) Joel R Tetreault, Elena Filatova, and Martin Chodorow. Rethinking grammatical error annotation and evaluation with the amazon mechanical turk. In Workshop on Innovative Use of NLP for Building Educational Applications, 2010b.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
  • Xia et al. (2017) Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Deliberation networks: Sequence generation beyond one-pass decoding. In NIPS, 2017.
  • Xie et al. (2016) Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y Ng. Neural language correction with character-based attention. arXiv preprint arXiv:1603.09727, 2016.
  • Xie et al. (2018) Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew Ng, and Dan Jurafsky. Noising and denoising natural language: Diverse backtranslation for grammar correction. In NAACL, 2018.
  • Yannakoudakis et al. (2017) Helen Yannakoudakis, Marek Rei, Øistein E Andersen, and Zheng Yuan. Neural sequence-labelling models for grammatical error correction. In EMNLP, 2017.
  • Yoshimoto et al. (2013) Ippei Yoshimoto, Tomoya Kose, Kensuke Mitsuzawa, Keisuke Sakaguchi, Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, and Yuji Matsumoto. Naist at 2013 conll grammatical error correction shared task. In CoNLL (Shared Task), 2013.
  • Yuan & Briscoe (2016) Zheng Yuan and Ted Briscoe. Grammatical error correction using neural machine translation. In NAACL, 2016.
  • Yuan & Felice (2013) Zheng Yuan and Mariano Felice. Constrained grammatical error correction using statistical machine translation. In CoNLL (Shared Task), 2013.
  • Yuan et al. (2016) Zheng Yuan, Ted Briscoe, Mariano Felice, Zheng Yuan, Ted Briscoe, and Mariano Felice. Candidate re-ranking for smt-based grammatical error correction. In Workshop on Innovative Use of NLP for Building Educational Applications, 2016.
  • Zhang et al. (2018) Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. Joint training for neural machine translation models with monolingual data. arXiv preprint arXiv:1803.00353, 2018.