ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction

12/15/2021
by   Xun Yuan, et al.
0

Currently available grammatical error correction (GEC) datasets are compiled using well-formed written text, limiting the applicability of these datasets to other domains such as informal writing and dialog. In this paper, we present a novel parallel GEC dataset drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model, resulting in a 16 point increase in model precision. This is of particular importance in a GEC model, as model precision is considered more important than recall in GEC tasks since false positives could lead to serious confusion in language learners. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehensibility, making our dataset both reproducible and extensible. Experimental results show the effectiveness of our data in improving GEC model performance in conversational scenario.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2023

Enhancing conversational quality in language learning chatbots: An evaluation of GPT4 for ASR error correction

The integration of natural language processing (NLP) technologies into e...
research
04/07/2020

Interview: A Large-Scale Open-Source Corpus of Media Dialog

Existing conversational datasets consist either of written proxies for d...
research
06/27/2023

Evaluating GPT-3.5 and GPT-4 on Grammatical Error Correction for Brazilian Portuguese

We investigate the effectiveness of GPT-3.5 and GPT-4, two large languag...
research
10/15/2020

Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses

Evaluation of grammatical error correction (GEC) systems has primarily f...
research
09/22/2021

Actionable Conversational Quality Indicators for Improving Task-Oriented Dialog Systems

Automatic dialog systems have become a mainstream part of online custome...
research
11/28/2019

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

The lack of large-scale datasets has been a major hindrance to the devel...
research
11/07/2018

The RLLChatbot: a solution to the ConvAI Challenge

Current conversational systems can follow simple commands and answer bas...

Please sign up or login with your details

Forgot password? Click here to reset