CharBERT: Character-aware Pre-trained Language Model

11/03/2020
by   Wentao Ma, et al.
0

Most pre-trained language models (PLMs) construct word representations at subword level with Byte-Pair Encoding (BPE) or its variations, by which OOV (out-of-vocab) words are almost avoidable. However, those methods split a word into subword units and make the representation incomplete and fragile. In this paper, we propose a character-aware pre-trained language model named CharBERT improving on the previous methods (such as BERT, RoBERTa) to tackle these problems. We first construct the contextual word embedding for each token from the sequential character representations, then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module. We also propose a new pre-training task named NLM (Noisy LM) for unsupervised character representation learning. We evaluate our method on question answering, sequence labeling, and text classification tasks, both on the original datasets and adversarial misspelling test sets. The experimental results show that our method can significantly improve the performance and robustness of PLMs simultaneously. Pretrained models, evaluation sets, and code are available at https://github.com/wtma/CharBERT

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2022

PERT: Pre-training BERT with Permuted Language Model

Pre-trained Language Models (PLMs) have been widely used in various natu...
research
07/13/2022

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

Most of the Chinese pre-trained models adopt characters as basic units f...
research
10/24/2020

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...
research
09/02/2016

Skipping Word: A Character-Sequential Representation based Framework for Question Answering

Recent works using artificial neural networks based on word distributed ...
research
02/24/2023

NoPPA: Non-Parametric Pairwise Attention Random Walk Model for Sentence Representation

We propose a novel non-parametric/un-trainable language model, named Non...
research
03/15/2021

Sent2Matrix: Folding Character Sequences in Serpentine Manifolds for Two-Dimensional Sentence

We study text representation methods using deep models. Current methods,...
research
11/09/2020

Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation

We propose a new character-based text classification framework for non-a...

Please sign up or login with your details

Forgot password? Click here to reset