Revisiting Token Dropping Strategy in Efficient BERT Pretraining

05/24/2023
by   Qihuang Zhong, et al.
0

Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training time without degrading much performance on downstream tasks. However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks. Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping. ScTD aims to encourage the model to learn how to preserve the semantic information in the representation space. Extensive experiments on 12 tasks show that, with the help of our ScTD, token dropping can achieve consistent and significant performance gains across all task types and model sizes. More encouragingly, ScTD saves up to 57 up to +1.56

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2022

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computati...
research
10/05/2020

PMI-Masking: Principled masking of correlated spans

Masking tokens uniformly at random constitutes a common flaw in the pret...
research
02/28/2023

Weighted Sampling for Masked Language Modeling

Masked Language Modeling (MLM) is widely used to pretrain language model...
research
11/17/2022

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Large-scale transformer models have become the de-facto architectures fo...
research
05/24/2023

Self-Evolution Learning for Discriminative Language Model Pretraining

Masked language modeling, widely used in discriminative language model (...
research
05/18/2023

How does the task complexity of masked pretraining objectives affect downstream performance?

Masked language modeling (MLM) is a widely used self-supervised pretrain...
research
12/13/2021

A Study on Token Pruning for ColBERT

The ColBERT model has recently been proposed as an effective BERT based ...

Please sign up or login with your details

Forgot password? Click here to reset