Predicting Attention Sparsity in Transformers

by   Marcos Treviso, et al.

A bottleneck in transformer architectures is their quadratic complexity with respect to the input sequence, which has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models, and may guide future benchmarks for sparse models.


Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence model...

Transformer Acceleration with Dynamic Sparse Attention

Transformers are the mainstream of NLP applications and are becoming inc...

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semanti...

Value-aware Approximate Attention

Following the success of dot-product attention in Transformers, numerous...

Finetuning Pretrained Transformers into RNNs

Transformers have outperformed recurrent neural networks (RNNs) in natur...

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...