"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

03/01/2022
by   Yong Dai, et al.
2

Whole word masking (WWM), which masks all subwords corresponding to a word at once, makes a better English BERT model. For the Chinese language, however, there is no subword because each token is an atomic character. The meaning of a word in Chinese is different in that a word is a compositional unit consisting of multiple characters. Such difference motivates us to investigate whether WWM leads to better context understanding ability for Chinese BERT. To achieve this, we introduce two probing tasks related to grammatical error correction and ask pretrained models to revise or insert tokens in a masked language modeling manner. We construct a dataset including labels for 19,075 tokens in 10,448 sentences. We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively. Our major findings are as follows: First, when one character needs to be inserted or replaced, the model trained with CLM performs the best. Second, when more than one character needs to be handled, WWM is the key to better performance. Finally, when being fine-tuned on sentence-level downstream tasks, models trained with different masking strategies perform comparably.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2020

Spelling Error Correction with Soft-Masked BERT

Spelling error correction is an important yet challenging task because a...
research
04/26/2022

Pretraining Chinese BERT for Detecting Word Insertion and Deletion Errors

Chinese BERT models achieve remarkable progress in dealing with grammati...
research
08/20/2022

BSpell: A CNN-blended BERT Based Bengali Spell Checker

Bengali typing is mostly performed using English keyboard and can be hig...
research
08/17/2023

Chinese Spelling Correction as Rephrasing Language Model

This paper studies Chinese Spelling Correction (CSC), which aims to dete...
research
07/03/2017

Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead o...
research
06/01/2021

SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining

Conventional tokenization methods for Chinese pretrained language models...
research
01/16/2023

An Error-Guided Correction Model for Chinese Spelling Error Correction

Although existing neural network approaches have achieved great success ...

Please sign up or login with your details

Forgot password? Click here to reset