Improving BERT with Hybrid Pooling Network and Drop Mask

07/14/2023
by   Qian Chen, et al.
0

Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8 lower memory cost (13 relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2020

LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning

While pre-training and fine-tuning, e.g., BERT <cit.>, GPT-2 <cit.>, hav...
research
08/06/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently ach...
research
06/05/2020

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly ...
research
02/16/2022

Should You Mask 15

Masked language models conventionally use a masking rate of 15 belief th...
research
08/31/2021

Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience

Pretrained transformer-based models such as BERT have demonstrated state...
research
06/25/2022

Adversarial Self-Attention for Language Understanding

An ultimate language system aims at the high generalization and robustne...
research
04/19/2022

DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

Since 2017, the Transformer-based models play critical roles in various ...

Please sign up or login with your details

Forgot password? Click here to reset