Data Weighted Training Strategies for Grammatical Error Correction

08/07/2020
by   Jared Lichtarge, et al.
0

Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state-of-the-art results on common GEC test sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2018

Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation

We combine two of the most popular approaches to automated Grammatical E...
research
08/19/2018

Neural Machine Translation of Text from Non-Native Speakers

Neural Machine Translation (NMT) systems are known to degrade when confr...
research
03/12/2021

Improving Translation Robustness with Visual Cues and Error Correction

Neural Machine Translation models are brittle to input noise. Current ro...
research
09/02/2019

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction

The incorporation of pseudo data in the training of grammatical error co...
research
06/29/2019

The CUED's Grammatical Error Correction Systems for BEA-2019

We describe two entries from the Cambridge University Engineering Depart...
research
05/20/2016

Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction

In this work, we study parameter tuning towards the M^2 metric, the stan...
research
09/08/2023

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the develop...

Please sign up or login with your details

Forgot password? Click here to reset