Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

05/30/2023
by   Yuxuan Wang, et al.
0

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2020

Learning to Pronounce Chinese Without a Pronunciation Dictionary

We demonstrate a program that learns to pronounce Chinese text in Mandar...
research
10/11/2022

Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training

Recently, knowledge-enhanced pre-trained language models (KEPLMs) improv...
research
04/13/2020

CLUE: A Chinese Language Understanding Evaluation Benchmark

We introduce CLUE, a Chinese Language Understanding Evaluation benchmark...
research
08/01/2022

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

Although pre-trained language models (PLMs) have achieved state-of-the-a...
research
03/01/2022

Exploring and Adapting Chinese GPT to Pinyin Input Method

While GPT has become the de-facto method for text generation tasks, its ...
research
10/19/2022

Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking

Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling...
research
10/19/2016

Chinese Restaurant Process for cognate clustering: A threshold free approach

In this paper, we introduce a threshold free approach, motivated from Ch...

Please sign up or login with your details

Forgot password? Click here to reset