Revisiting Grammatical Error Correction Evaluation and Beyond

11/03/2022
by   Peiyuan Gong, et al.
0

Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2 which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M2 significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2016

There's No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction

Current methods for automatically evaluating grammatical error correctio...
research
10/01/2019

Grammatical Error Correction in Low-Resource Scenarios

Grammatical error correction in English is a long studied problem with m...
research
04/30/2022

A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar Error Correction

As a fundamental task in natural language processing, Chinese Grammatica...
research
11/02/2022

Dialect-robust Evaluation of Generated Text

Evaluation metrics that are not robust to dialect variation make it impo...
research
03/15/2023

ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

ChatGPT is a cutting-edge artificial intelligence language model develop...
research
04/28/2022

UniTE: Unified Translation Evaluation

Translation quality evaluation plays a crucial role in machine translati...
research
09/23/2020

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

There is an increasing focus on model-based dialog evaluation metrics su...

Please sign up or login with your details

Forgot password? Click here to reset