uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers

09/15/2022
by   Piji Li, et al.
6

The task of Chinese Spelling Check (CSC) is aiming to detect and correct spelling errors that can be found in the text. While manually annotating a high-quality dataset is expensive and time-consuming, thus the scale of the training dataset is usually very small (e.g., SIGHAN15 only contains 2339 samples for training), therefore supervised-learning based models usually suffer the data sparsity limitation and over-fitting issue, especially in the era of big language models. In this paper, we are dedicated to investigating the unsupervised paradigm to address the CSC problem and we propose a framework named uChecker to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model considering their powerful language diagnosis capability. Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model to further improve the performance of unsupervised detection and correction. Experimental results on standard datasets demonstrate the effectiveness of our proposed model uChecker in terms of character-level and sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of spelling error detection and correction respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2023

Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task

Large-scale language models (LLMs) has shown remarkable capability in va...
research
06/03/2021

Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction

We investigate the problem of Chinese Grammatical Error Correction (CGEC...
research
10/19/2022

Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking

Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling...
research
06/28/2023

An Adversarial Multi-Task Learning Method for Chinese Text Correction with Semantic Detection

Text correction, especially the semantic correction of more widely used ...
research
09/14/2021

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Training a model for grammatical error correction (GEC) requires a set o...
research
06/30/2023

Progressive Multi-task Learning Framework for Chinese Text Error Correction

Chinese Text Error Correction (CTEC) aims to detect and correct errors i...
research
08/17/2023

Chinese Spelling Correction as Rephrasing Language Model

This paper studies Chinese Spelling Correction (CSC), which aims to dete...

Please sign up or login with your details

Forgot password? Click here to reset