Revisiting Over-smoothing in BERT from the Perspective of Graph

02/17/2022
by   Han Shi, et al.
0

Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.

READ FULL TEXT
research
02/25/2021

SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popular for natural language processing (NL...
research
03/16/2023

Block-wise Bit-Compression of Transformer-based Models

With the popularity of the recent Transformer-based models represented b...
research
06/12/2020

Towards Deeper Graph Neural Networks with Differentiable Group Normalization

Graph neural networks (GNNs), which learn the representation of a node b...
research
12/04/2021

Multi-scale Graph Convolutional Networks with Self-Attention

Graph convolutional networks (GCNs) have achieved remarkable learning ab...
research
04/04/2023

Blockwise Compression of Transformer-based Models without Retraining

Transformer-based models, represented by GPT-3, ChatGPT, and GPT-4, have...
research
05/04/2023

Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation

A surge of interest has emerged in weakly supervised semantic segmentati...
research
07/28/2016

Incremental Noising and its Fractal Behavior

This manuscript is about further elucidating the concept of noising. The...

Please sign up or login with your details

Forgot password? Click here to reset