Towards Minimal Supervision BERT-based Grammar Error Correction

01/10/2020 ∙ by Yiyuan Li, et al. ∙ Carnegie Mellon University 17

Current grammatical error correction (GEC) models typically consider the task as sequence generation, which requires large amounts of annotated data and limit the applications in data-limited settings. We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios. Results show strong potential of Bidirectional Encoder Representations from Transformers (BERT) in grammatical error correction task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

The goal of grammatical error correction is to detect and correct all errors in the sentence and return the corrected sentences. Current grammatical error correction approaches require a large amount of training data to achieve reasonable results, which unfortunately cannot be extended to languages with limited data. Recently, unsupervised models pre-trained on large corpora have boosted performance in many natural language processing tasks, which indicates a potential of leveraging such models for GEC in any language. In this work, we try to utilize (multilingual) BERT 

[3] in order to perform grammatical error correction with minimum supervision.

Proposed Method

Our approach divides the GEC task into two stages: error identification followed by correction. In the first stage, we try to detect the span in the original text that the edit will apply. We formulate this as a sequence labelling task where tokens are labelled in one of the following labels {remain, substitution, insert, delete}. For the second stage (correction) we employ a pretrained-model like BERT. The labels from the error identification stage guide the masking of the inputs (where we mask any tokens marked with substitution or insert mask tokens for insert labels), and we obtain candidate outputs for every masked token. A shortcoming of our current approach is that it only produces as many corrections as masked input tokens; however, most grammar errors in fact tend to be edits of length 1 or 2, which are captured by our identification labels. The second stage is addressed in future work.

Results

We report preliminary results on the test part of the English FCE dataset [7]. The edits are labelled and scored through ERRANT [1].

For our preliminary experiments we focus on a simplified single-edit setting, where we attempt to correct sentences with a single error (assuming oracle error identification annotations). The goal is to assess the capabilities of pre-trained BERT-like models assuming perfect performance for all other components (e.g. error identification).

We expand the original dataset to fit our single-error scenario with the following two schemes. (1) each edit: all corrections except one are applied, creating a single-error sentence; (2) last edit: all corrections except for the last one are applied. After the split, we obtain 3,585 and 1,024 corrections respectively. We also employ different strategies for deciding the number of masked tokens: (1) based on span length of the original edit, (2) based on length of the final correction (given from an oracle), and (3) using a single mask and measure whether any token of the correction is predicted. Note that subword predictions like {ad, ##e, ##quate} to adequate are merged in sentence-level, but remains in mask-level evaluation. Our preliminary results under the various settings are outlined in Table 1. We find that different masking strategies have comparable performance, with slightly higher accuracy when using a single mask. Interestingly, BERT seems capable to actually produce corrections with quite high precision of more than 70%.

Masking Strategy each edit last edit
P@1 R@1 F@1 P@1 R@1 F@1
# origin 0.632 0.853 0.667 0.592 0.824 0.627
# target 0.66 0.887 0.696 0.614 0.855 0.651
single 0.763 0.931 0.790 0.767 0.920 0.794
Table 1: Sentence-level evaluation with different masking strategies for single edit pairs. Subword predictions are merged in sentence generation.
Masking Strategy each edit last edit
Acc@1 Acc@5 Acc@1 Acc@5
# origin 0.292 0.455 0.229 0.390
# target 0.313 0.484 0.247 0.405
single 0.365 0.554 0.312 0.501
Table 2: A reranking mechanism could lead to better results, as performance@5 is higher than performance@1.

We also study whether the correct output is among the top- candidates suggested by BERT. We compare the

metrics based on each mask between the most probable prediction and the top k candidates, where k is set to 5. From the result in Table 

2, we observe that an appropriate reranking model could further boost the performance by selecting the appropriate correction.

Related and Future work

Example #1: Redundant Edits
Source Of course there ’s also a number 8 bus in front of the hotel , which is also suitable , but it leaves only every half an hour
Mask. Of course there ’s also a number 8 bus [MASK] in front of the hotel , which is also suitable , but it leaves only every half an hour
Target Of course there ’s also a number 8 bus , in front of the hotel , which is also suitable , but it leaves only every half an hour
Ours Of course there ’s also a number 8 bus stop in front of the hotel , which is also suitable , but it leaves only every half an hour
Example #2 : Synonyms
Source The aim of this report is to recomend you to visit the Fuerte de San Diego Museum
Mask. The aim of this report is to [MASK] you to visit the Fuerte de San Diego Museum
Target The aim of this report is to recommend you to visit the Fuerte de San Diego Museum
Ours The aim of this report is to allow you to visit the Fuerte de San Diego Museum
Example #3 : Hallucination
Source Of course there ’s also a bus number 8 , in front of the hotel , which is also suitable , but it leaves only every half an hour
Mask. Of course there ’s also a [MASK] [MASK] [MASK], in front of the hotel , which is also suitable , but it leaves only every half an hour
Target Of course there ’s also a number 8 bus , in front of the hotel , which is also suitable , but it leaves only every half an hour
Ours Of course there ’s also a small parking station , in front of the hotel , which is also suitable , but it leaves only every half an hour
Table 3: Common BERT prediction errors (we highlight the original error and the prediction).

A retrieve-edit model is proposed for text generation

[5]. However, the edition is one-time and sentences with multiple grammatical errors could further reduce the similarity between the correct form and the oracle sentence. An iterative decoding approach [4] or the neural language model [2] as the scoring function are employed for GEC. To the best of our knowledge, there is no prior work in applying pre-trained contextual model in grammatical error correction. In the future work, we will additionally model error fertility, allowing us to exactly predict the number of necessary [MASK] tokens. Last, we will employ a re-ranking mechanism which scores the candidate outputs from BERT, taking into account larger context and specific grammar properties.

Better span detection

Although BERT could predict all the missing token in the sentence in a reasonable way, prediction of the correct words could easily fall into redundant editing. Our experiment shows that simply rephrasing the whole sentence using BERT would lead to too diverse an output. Instead, a prior error span detection could be necessary for efficient GEC, and it is part of our future work.

Partial masking and fluency measures

Multi-masking or masking an informative part in the sentence will lead to loss of original information, and it will allow unwanted freedom in the predictions; see Table 3 for examples. Put in plain terms, multi-masking allows BERT to get too creative. Instead, we will investigate partial masking strategies [8] which could alleviate this problem. Fluency is an important measure when employing an iterative approach [6]. We plan to explore fluency measures as part of our reranking mechanisms.

References

  • [1] C. Bryant, M. Felice, and T. Briscoe (2017-07) Automatic annotation and evaluation of error types for grammatical error correction. In Proc. ACL, Vancouver, Canada. External Links: Link Cited by: Results.
  • [2] Y. J. Choe, J. Ham, K. Park, and Y. Yoon (2019-08)

    A neural grammatical error correction system built on better pre-training and sequential transfer learning

    .
    In Proc. BEA@ACL, Florence, Italy. External Links: Link Cited by: Related and Future work.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT, External Links: Link Cited by: Introduction.
  • [4] T. Ge, F. Wei, and M. Zhou (2018) Fluency boost learning and inference for neural grammatical error correction. In Proc. ACL, External Links: Link Cited by: Related and Future work.
  • [5] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang (2017) Generating sentences by editing prototypes. TACL 6, pp. 437–450. Cited by: Related and Future work.
  • [6] C. Napoles, K. Sakaguchi, and J. Tetreault (2016)

    There’s no comparison: reference-less evaluation metrics in grammatical error correction

    .
    In Proc. EMNLP, pp. 2109–2115. External Links: Link Cited by: Partial masking and fluency measures.
  • [7] H. Yannakoudakis, T. Briscoe, and B. Medlock (2011-06) A new dataset and method for automatically grading ESOL texts. In Proc. ACL, External Links: Link Cited by: Results.
  • [8] W. Zhou, T. Ge, K. Xu, F. Wei, and M. Zhou (2019-07) BERT-based lexical substitution. In Proc. ACL, Florence, Italy. External Links: Link Cited by: Partial masking and fluency measures.