Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

09/13/2020
by   Shuohang Wang, et al.
0

Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically with respect to the sequence length. Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. Our proposed method allows information integration beyond local windows, which is especially beneficial for question answering (QA) and language modeling tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.

READ FULL TEXT
research
07/05/2021

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. ...
research
10/14/2020

Memformer: The Memory-Augmented Transformer

Transformer models have obtained remarkable accomplishments in various N...
research
10/19/2021

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Self-attention, an architectural motif designed to model long-range inte...
research
07/06/2022

Astroconformer: Inferring Surface Gravity of Stars from Stellar Light Curves with Transformer

We introduce Astroconformer, a Transformer-based model to analyze stella...
research
11/07/2019

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is des...
research
09/06/2021

PermuteFormer: Efficient Relative Position Encoding for Long Sequences

A recent variation of Transformer, Performer, scales Transformer to long...
research
07/18/2019

Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time

A key requirement in sequence to sequence processing is the modeling of ...

Please sign up or login with your details

Forgot password? Click here to reset