MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction

04/23/2022
by   Yue Zhang, et al.
3

This paper presents MuCGEC, a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources. Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence. We conduct experiments with two mainstream CGEC models, i.e., the sequence-to-sequence (Seq2Seq) model and the sequence-to-edit (Seq2Edit) model, both enhanced with large pretrained language models (PLMs), achieving competitive benchmark performance on previous and our datasets. We also discuss CGEC evaluation methodologies, including the effect of multiple references and using a char-based metric. Our annotation guidelines, data, and code are available at <https://github.com/HillZhang1999/MuCGEC>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2022

A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar Error Correction

As a fundamental task in natural language processing, Chinese Grammatica...
research
05/24/2023

Are Pre-trained Language Models Useful for Model Ensemble in Chinese Grammatical Error Correction?

Model ensemble has been in widespread use for Grammatical Error Correcti...
research
04/11/2018

Reference-less Measure of Faithfulness for Grammatical Error Correction

We propose USim, a semantic measure for Grammatical Error Correction (G...
research
10/25/2022

Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Research on Korean grammatical error correction (GEC) is limited compare...
research
06/05/2023

MCTS: A Multi-Reference Chinese Text Simplification Dataset

Text simplification aims to make the text easier to understand by applyi...
research
05/18/2023

CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction

It is intractable to evaluate the performance of Grammatical Error Corre...
research
12/30/2021

YACLC: A Chinese Learner Corpus with Multidimensional Annotation

Learner corpus collects language data produced by L2 learners, that is s...

Please sign up or login with your details

Forgot password? Click here to reset