Corpora Generation for Grammatical Error Correction

04/10/2019
by   Jared Lichtarge, et al.
0

Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2018

Weakly Supervised Grammatical Error Correction using Iterative Decoding

We describe an approach to Grammatical Error Correction (GEC) that is ef...
research
05/09/2021

FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition

Error correction techniques have been used to refine the output sentence...
research
09/01/2021

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

The availability of parallel sentence simplification (SS) is scarce for ...
research
12/26/2019

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Lectures translation is a case of spoken language translation and there ...
research
05/26/2020

GECToR – Grammatical Error Correction: Tag, Not Rewrite

In this paper, we present a simple and efficient GEC sequence tagger usi...
research
10/07/2019

Parallel Iterative Edit Models for Local Sequence Transduction

We present a Parallel Iterative Edit (PIE) model for the problem of loca...
research
07/02/2019

A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning

Grammatical error correction can be viewed as a low-resource sequence-to...

Please sign up or login with your details

Forgot password? Click here to reset