On the Expressive Power of Self-Attention Matrices

06/07/2021
by   Valerii Likhosherstov, et al.
0

Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. In practice, this matrix typically exhibits two properties: (1) it is sparse, meaning that each token only attends to a small subset of other tokens; and (2) it changes dynamically depending on the input to the module. With these considerations in mind, we ask the following question: Can a fixed self-attention module approximate arbitrary sparse patterns depending on the input? How small is the hidden size d required for such approximation? We make progress in answering this question and show that the self-attention matrix can provably approximate sparse matrices, where sparsity is in terms of a bounded number of nonzero elements in each row and column. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. Our proof is constructive, enabling us to propose an algorithm for finding adaptive inputs and fixed self-attention parameters in order to approximate a given matrix. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, d grows only logarithmically with the sequence length L (i.e. d = O(log L)).

READ FULL TEXT

page 2

page 9

page 20

research
08/20/2021

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic co...
research
02/25/2021

SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popular for natural language processing (NL...
research
01/14/2020

Faster Transformer Decoding: N-gram Masked Self-Attention

Motivated by the fact that most of the information relevant to the predi...
research
04/22/2022

Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

Self-Attention is a widely used building block in neural modeling to mix...
research
06/01/2023

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Transformer-based language models have found many diverse applications r...
research
05/18/2021

Effective Attention Sheds Light On Interpretability

An attention matrix of a transformer self-attention sublayer can provabl...
research
03/29/2022

Domain Invariant Siamese Attention Mask for Small Object Change Detection via Everyday Indoor Robot Navigation

The problem of image change detection via everyday indoor robot navigati...

Please sign up or login with your details

Forgot password? Click here to reset