General and Domain Adaptive Chinese Spelling Check with Error Consistent Pretraining

03/21/2022
by   Qi Lv, et al.
0

The lack of label data is one of the significant bottlenecks for Chinese Spelling Check (CSC). Existing researches use the method of automatic generation by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatic generated corpus. Thus, we develop a competitive general speller ECSpell which adopts the Error Consistent masking strategy to create data for pretraining. This error consistency masking strategy is used to specify the error types of automatically generated sentences which is consistent with real scene. The experimental result indicates our model outperforms previous state-of-the-art models on the general benchmark. Moreover, spellers often work within a particular domain in real life. Due to lots of uncommon domain terms, experiments on our built domain specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification based speller. Our experiments demonstrate that ECSpell^UD, namely ECSpell combined with UD, surpasses all the other baselines largely, even approaching the performance on the general benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

The tremendous success of CLIP (Radford et al., 2021) has promoted the r...
research
04/05/2020

Improved Pretraining for Domain-specific Contextual Embedding Models

We investigate methods to mitigate catastrophic forgetting during domain...
research
08/16/2023

RSpell: Retrieval-augmented Framework for Domain Adaptive Chinese Spelling Check

Chinese Spelling Check (CSC) refers to the detection and correction of s...
research
04/15/2021

Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution

The use of pretrained masked language models (MLMs) has drastically impr...
research
05/23/2023

CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains

Generative chat models, such as ChatGPT and GPT-4, have revolutionized n...
research
06/25/2020

Automatic Domain Adaptation Outperforms Manual Domain Adaptation for Predicting Financial Outcomes

In this paper, we automatically create sentiment dictionaries for predic...

Please sign up or login with your details

Forgot password? Click here to reset