Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

07/13/2022
by   Wenbiao Li, et al.
0

Most of the Chinese pre-trained models adopt characters as basic units for downstream tasks. However, these models ignore the information carried by words and thus lead to the loss of some important semantics. In this paper, we propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. Specifically, we project a word's embedding into its internal characters' embeddings according to the similarity weight. To strengthen the word boundary information, we mix the representations of the internal characters within a word. After that, we apply a word-to-character alignment attention mechanism to emphasize important characters by masking unimportant ones. Moreover, in order to reduce the error propagation caused by word segmentation, we present an ensemble approach to combine segmentation results given by different tokenizers. The experimental results show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks: sentiment classification, sentence pair matching, natural language inference and machine reading comprehension. We make further analysis to prove the effectiveness of each component of our model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2019

Enhancing Pre-trained Chinese Character Representation with Word-aligned Attention

Most Chinese pre-trained encoders take a character as a basic unit and l...
research
01/30/2023

PaCaNet: A Study on CycleGAN with Transfer Learning for Diversifying Fused Chinese Painting and Calligraphy

AI-Generated Content (AIGC) has recently gained a surge in popularity, p...
research
11/03/2020

CharBERT: Character-aware Pre-trained Language Model

Most pre-trained language models (PLMs) construct word representations a...
research
08/14/2023

A Novel Ehanced Move Recognition Algorithm Based on Pre-trained Models with Positional Embeddings

The recognition of abstracts is crucial for effectively locating the con...
research
10/13/2022

Tone prediction and orthographic conversion for Basaa

In this paper, we present a seq2seq approach for transliterating mission...
research
08/07/2018

Effective Character-augmented Word Embedding for Machine Reading Comprehension

Machine reading comprehension is a task to model relationship between pa...
research
11/06/2018

Effective Subword Segmentation for Text Comprehension

Character-level representations have been broadly adopted to alleviate t...

Please sign up or login with your details

Forgot password? Click here to reset