ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

06/30/2021
by   Zijun Sun, et al.
0

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The porpsoed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition. Code and pretrained models are publicly available at https://github.com/ShannonAI/ChineseBert.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2022

StyleBERT: Chinese pretraining by font style information

With the success of down streaming task using English pre-trained langua...
research
12/10/2020

HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents

The information provided by historical documents has always been indispe...
research
06/05/2020

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

With the success of language pretraining, it is highly desirable to deve...
research
05/12/2021

BertGCN: Transductive Text Classification by Combining GCN and BERT

In this work, we propose BertGCN, a model that combines large scale pret...
research
10/05/2022

Reading Chinese in Natural Scenes with a Bag-of-Radicals Prior

Scene text recognition (STR) on Latin datasets has been extensively stud...
research
09/16/2022

ConFiguRe: Exploring Discourse-level Chinese Figures of Speech

Figures of speech, such as metaphor and irony, are ubiquitous in literat...
research
12/08/2022

Investigating Glyph Phonetic Information for Chinese Spell Checking: What Works and What's Next

While pre-trained Chinese language models have demonstrated impressive p...

Please sign up or login with your details

Forgot password? Click here to reset