Simple Local Attentions Remain Competitive for Long-Context Tasks

12/14/2021
by   Wenhan Xiong, et al.
0

Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results – using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer <cit.> with half of its pretraining compute.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2022

The NLP Task Effectiveness of Long-Range Transformers

Transformer models cannot easily scale to long sequences due to their O(...
research
09/21/2022

Adapting Pretrained Text-to-Text Models for Long Text Sequences

We present an empirical study of adapting an existing pretrained text-to...
research
10/14/2022

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

Transformer has achieved remarkable success in language, image, and spee...
research
06/26/2023

Understanding In-Context Learning via Supportive Pretraining Data

In-context learning (ICL) improves language models' performance on a var...
research
08/01/2022

Efficient Long-Text Understanding with Short-Text Models

Transformer-based pretrained language models (LMs) are ubiquitous across...
research
05/05/2023

HiPool: Modeling Long Documents Using Graph Neural Networks

Encoding long sequences in Natural Language Processing (NLP) is a challe...
research
10/17/2022

What Makes Convolutional Models Great on Long Sequence Modeling?

Convolutional models have been widely used in multiple domains. However,...

Please sign up or login with your details

Forgot password? Click here to reset