Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

by   Chanjun Park, et al.

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.


page 1

page 2

page 3

page 4


Reducing the Amount of Real World Data for Object Detector Training with Synthetic Data

A number of studies have investigated the training of neural networks wi...

Controllable Data Synthesis Method for Grammatical Error Correction

Due to the lack of parallel data in current Grammatical Error Correction...

Judge a Sentence by Its Content to Generate Grammatical Errors

Data sparsity is a well-known problem for grammatical error correction (...

PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision

In recent years, person detection and human pose estimation have made gr...

Evaluation of large-scale synthetic data for Grammar Error Correction

Grammar Error Correction(GEC) mainly relies on the availability of high ...

DataCLUE: A Benchmark Suite for Data-centric NLP

Data-centric AI has recently proven to be more effective and high-perfor...

Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT

This research delves into the construction and utilization of synthetic ...

Please sign up or login with your details

Forgot password? Click here to reset