CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME

11/16/2022
by   Yong Hu, et al.
0

Chinese Spelling Correction (CSC) is a task to detect and correct spelling mistakes in texts. In fact, most of Chinese input is based on pinyin input method, so the study of spelling errors in this process is more practical and valuable. However, there is still no research dedicated to this essential scenario. In this paper, we first present a Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), including 40,000 annotated sentences from real posts of official media on Sina Weibo. Furthermore, we propose a novel method to automatically construct large-scale and high-quality pseudo data by simulating the input through pinyin IME. A series of analyses and experiments on CSCD-IME show that spelling errors produced by pinyin IME hold a particular distribution at pinyin level and semantic level and are challenging enough. Meanwhile, our proposed pseudo-data construction method can better fit this error distribution and improve the performance of CSC systems. Finally, we provide a useful guide to using pseudo data, including the data scale, the data source, and the training strategy.

READ FULL TEXT

page 5

page 13

research
10/19/2022

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP ta...
research
05/31/2021

Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

A sequence-to-sequence learning with neural networks has empirically pro...
research
10/22/2022

FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction

Grammatical Error Correction (GEC) has been broadly applied in automatic...
research
10/23/2022

Focus Is What You Need For Chinese Grammatical Error Correction

Chinese Grammatical Error Correction (CGEC) aims to automatically detect...
research
05/09/2023

CSED: A Chinese Semantic Error Diagnosis Corpus

Recently, much Chinese text error correction work has focused on Chinese...
research
08/27/2020

Adaptable Filtering using Hierarchical Embeddings for Chinese Spell Check

Spell check is a useful application which involves processing noisy huma...
research
08/11/2022

Overview of CTC 2021: Chinese Text Correction for Native Speakers

In this paper, we present an overview of the CTC 2021, a Chinese text co...

Please sign up or login with your details

Forgot password? Click here to reset