DeepAI AI Chat
Log In Sign Up

SimpleTron: Eliminating Softmax from Attention Computation

by   Uladzislau Yorsh, et al.
Czech Technical University in Prague

In this paper, we propose that the dot product pairwise matching attention layer, which is widely used in transformer-based models, is redundant for the model performance. Attention in its original formulation has to be rather seen as a human-level tool to explore and/or visualize relevancy scores in the sequences. Instead, we present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.


page 1

page 2

page 3

page 4


H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

We describe an efficient hierarchical method to compute attention in the...

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, co...

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

Transformer has achieved remarkable success in language, image, and spee...

Efficient Long-Range Attention Network for Image Super-resolution

Recently, transformer-based methods have demonstrated impressive results...

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

Sequential data naturally have different lengths in many domains, with s...

Simple Local Attentions Remain Competitive for Long-Context Tasks

Many NLP tasks require processing long contexts beyond the length limit ...

Invertible Attention

Attention has been proved to be an efficient mechanism to capture long-r...