LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization

08/02/2021
by   Weidong Guo, et al.
0

Language model pre-training based on large corpora has achieved tremendous success in terms of constructing enriched contextual representations and has led to significant performance gains on a diverse range of Natural Language Understanding (NLU) tasks. Despite the success, most current pre-trained language models, such as BERT, are trained based on single-grained tokenization, usually with fine-grained characters or sub-words, making it hard for them to learn the precise meaning of coarse-grained words and phrases. In this paper, we propose a simple yet effective pre-training method named LICHEE to efficiently incorporate multi-grained information of input text. Our method can be applied to various pre-trained language models and improve their representation capability. Extensive experiments conducted on CLUE and SuperGLUE demonstrate that our method achieves comprehensive improvements on a wide variety of NLU tasks in both Chinese and English with little extra inference cost incurred, and that our best ensemble model achieves the state-of-the-art performance on CLUE benchmark competition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2020

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Pre-trained language models such as BERT have exhibited remarkable perfo...
research
08/23/2022

CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations

Pre-trained Language Models (PLMs) have achieved remarkable performance ...
research
03/14/2022

PERT: Pre-training BERT with Permuted Language Model

Pre-trained Language Models (PLMs) have been widely used in various natu...
research
11/01/2022

VarMAE: Pre-training of Variational Masked Autoencoder for Domain-adaptive Language Understanding

Pre-trained language models have achieved promising performance on gener...
research
10/23/2020

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding

Coarse-grained linguistic information, such as name entities or phrases,...
research
06/05/2023

On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

This evidence-based position paper critiques current research practices ...
research
01/18/2022

Instance-aware Prompt Learning for Language Understanding and Generation

Recently, prompt learning has become a new paradigm to utilize pre-train...

Please sign up or login with your details

Forgot password? Click here to reset