GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

11/28/2019
by   Masato Hagiwara, et al.
0

The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correction and grammatical error correction (GEC). As a complementary new resource for these tasks, we present the GitHub Typo Corpus, a large-scale, multilingual dataset of misspellings and grammatical errors along with their corrections harvested from GitHub, a large and popular platform for hosting and sharing git repositories. The dataset, which we have made publicly available, contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date. We also describe our process for filtering true typo edits based on learned classifiers on a small annotated subset, and demonstrate that typo edits can be identified with F1   0.9 using a very simple classifier with only three features. The detailed analyses of the dataset show that existing spelling correctors merely achieve an F-measure of approx. 0.5, suggesting that the dataset serves as a new, rich source of spelling errors that complement existing datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2021

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

We present a corpus professionally annotated for grammatical error corre...
research
04/30/2020

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization datas...
research
08/01/2017

A Continuously Growing Dataset of Sentential Paraphrases

A major challenge in paraphrase research is the lack of parallel corpora...
research
05/28/2020

A Corpus for Large-Scale Phonetic Typology

A major hurdle in data-driven research on typology is having sufficient ...
research
06/29/2023

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic ...
research
04/04/2023

Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation

ChatGPT, a large-scale language model based on the advanced GPT-3.5 arch...
research
12/15/2021

ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction

Currently available grammatical error correction (GEC) datasets are comp...

Please sign up or login with your details

Forgot password? Click here to reset