Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

10/25/2022
by   Soyoung Yoon, et al.
2

Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.

READ FULL TEXT

page 3

page 23

page 24

research
04/23/2022

MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction

This paper presents MuCGEC, a multi-reference multi-source evaluation da...
research
05/27/2021

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Synthetic data generation is widely known to boost the accuracy of neura...
research
04/20/2021

Grammatical Error Generation Based on Translated Fragments

We perform neural machine translation of sentence fragments in order to ...
research
12/30/2021

YACLC: A Chinese Learner Corpus with Multidimensional Annotation

Learner corpus collects language data produced by L2 learners, that is s...
research
01/20/2022

Construction of a Quality Estimation Dataset for Automatic Evaluation of Japanese Grammatical Error Correction

In grammatical error correction (GEC), automatic evaluation is an import...
research
05/25/2023

NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts

We introduce NaSGEC, a new dataset to facilitate research on Chinese gra...
research
10/21/2020

Classifying Syntactic Errors in Learner Language

We present a method for classifying syntactic errors in learner language...

Please sign up or login with your details

Forgot password? Click here to reset