ConvBERT: Improving BERT with Span-based Dynamic Convolution

08/06/2020
by   Zihang Jiang, et al.
0

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while using less than 1/4 training cost. Code and pre-trained models will be released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2023

Improving BERT with Hybrid Pooling Network and Drop Mask

Transformer-based pre-trained language models, such as BERT, achieve gre...
research
06/22/2021

LV-BERT: Exploiting Layer Variety for BERT

Modern pre-trained language models are mostly built upon backbones stack...
research
07/15/2021

AutoBERT-Zero: Evolving BERT Backbone from Scratch

Transformer-based pre-trained language models like BERT and its variants...
research
10/08/2019

SesameBERT: Attention for Anywhere

Fine-tuning with pre-trained models has achieved exceptional results for...
research
11/29/2021

On the Integration of Self-Attention and Convolution

Convolution and self-attention are two powerful techniques for represent...
research
12/22/2020

Undivided Attention: Are Intermediate Layers Necessary for BERT?

In recent times, BERT-based models have been extremely successful in sol...
research
11/07/2019

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is des...

Please sign up or login with your details

Forgot password? Click here to reset