CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

10/14/2022
by   Jun Zhang, et al.
1

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.

READ FULL TEXT
research
10/14/2020

Memformer: The Memory-Augmented Transformer

Transformer models have obtained remarkable accomplishments in various N...
research
12/15/2022

Efficient Long Sequence Modeling via State Space Augmented Transformer

Transformer models have achieved superior performance in various natural...
research
02/17/2022

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, co...
research
12/14/2021

Simple Local Attentions Remain Competitive for Long-Context Tasks

Many NLP tasks require processing long contexts beyond the length limit ...
research
11/23/2021

SimpleTron: Eliminating Softmax from Attention Computation

In this paper, we propose that the dot product pairwise matching attenti...
research
01/29/2023

Exploring Attention Map Reuse for Efficient Transformer Neural Networks

Transformer-based deep neural networks have achieved great success in va...
research
10/06/2021

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state-of-the-art results on a va...

Please sign up or login with your details

Forgot password? Click here to reset