SparseBERT: Rethinking the Importance Analysis in Self-attention

02/25/2021
by   Han Shi, et al.
0

Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. As the core component, self-attention module has aroused widespread interests. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism and some common patterns are observed in visualization. Based on these patterns, a series of efficient transformers are proposed with corresponding sparse attention masks. Besides above empirical results, universal approximability of Transformer-based models is also discovered from a theoretical perspective. However, above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we delve into dynamics of attention matrix importance during pre-training. One of surprising results is that the diagonal elements in the attention map are the most unimportant compared with other attention positions and we also provide a proof to show these elements can be removed without damaging the model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design further. The extensive experiments verify our interesting findings and illustrate the effect of our proposed algorithm.

READ FULL TEXT

page 3

page 8

page 15

research
02/25/2021

LazyFormer: Self Attention with Lazy Update

Improving the efficiency of Transformer-based language pre-training is a...
research
10/10/2020

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

In recent years, pre-trained Transformers have dominated the majority of...
research
02/17/2022

Revisiting Over-smoothing in BERT from the Perspective of Graph

Recently over-smoothing phenomenon of Transformer-based models is observ...
research
06/07/2021

On the Expressive Power of Self-Attention Matrices

Transformer networks are able to capture patterns in data coming from ma...
research
04/06/2022

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

Pre-trained language models (PLM) have demonstrated their effectiveness ...
research
03/22/2023

A Small-Scale Switch Transformer and NLP-based Model for Clinical Narratives Classification

In recent years, Transformer-based models such as the Switch Transformer...
research
06/05/2020

Understanding Self-Attention of Self-Supervised Audio Transformers

Self-supervised Audio Transformers (SAT) enable great success in many do...

Please sign up or login with your details

Forgot password? Click here to reset