cosFormer: Rethinking Softmax in Attention

02/17/2022
by   Zhen Qin, et al.
20

Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, cosFormer fulfills these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at https://github.com/OpenNLPLab/cosFormer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2022

The Devil in Linear Transformer

Linear transformers aim to reduce the quadratic space-time complexity of...
research
05/08/2023

Toeplitz Neural Network for Sequence Modeling

Sequence modeling has important applications in natural language process...
research
11/23/2021

SimpleTron: Eliminating Softmax from Attention Computation

In this paper, we propose that the dot product pairwise matching attenti...
research
10/14/2022

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

Transformer has achieved remarkable success in language, image, and spee...
research
10/05/2022

Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

We propose Waveformer that learns attention mechanism in the wavelet coe...
research
07/25/2021

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

We describe an efficient hierarchical method to compute attention in the...
research
03/29/2022

Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Conformer has shown a great success in automatic speech recognition (ASR...

Please sign up or login with your details

Forgot password? Click here to reset