Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

03/05/2021
by   Yihe Dong, et al.
0

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/26/2021

Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence

Object Detection with Transformers (DETR) and related works reach or eve...
research
05/25/2023

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Transformer architecture has shown impressive performance in multiple re...
research
05/18/2021

Effective Attention Sheds Light On Interpretability

An attention matrix of a transformer self-attention sublayer can provabl...
research
05/24/2021

Self-Attention Networks Can Process Bounded Hierarchical Languages

Despite their impressive performance in NLP, self-attention networks wer...
research
05/24/2023

Predicting Token Impact Towards Efficient Vision Transformer

Token filtering to reduce irrelevant tokens prior to self-attention is a...
research
06/22/2020

Limits to Depth Efficiencies of Self-Attention

Self-attention architectures, which are rapidly pushing the frontier in ...
research
12/17/2020

Transformer Interpretability Beyond Attention Visualization

Self-attention techniques, and specifically Transformers, are dominating...

Please sign up or login with your details

Forgot password? Click here to reset