PermuteFormer: Efficient Relative Position Encoding for Long Sequences

09/06/2021
by   Peng Chen, et al.
0

A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset for long sequences, as well as WikiText-103, a language modeling dataset. The experiments show that PermuteFormer uniformly improves the performance of Performer with almost no computational overhead and outperforms vanilla Transformer on most of the tasks.

READ FULL TEXT
research
04/20/2021

RoFormer: Enhanced Transformer with Rotary Position Embedding

Position encoding in transformer architecture provides supervision for d...
research
07/29/2021

Rethinking and Improving Relative Position Encoding for Vision Transformer

Relative position encoding (RPE) is important for transformer to capture...
research
09/13/2020

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...
research
09/12/2018

An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation

Music relies heavily on self-reference to build structure and meaning. W...
research
06/06/2021

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Without positional information, attention-based transformer neural netwo...
research
06/10/2022

Position Labels for Self-Supervised Vision Transformer

Position encoding is important for vision transformer (ViT) to capture t...
research
05/26/2022

Dynamically Relative Position Encoding-Based Transformer for Automatic Code Edit

Adapting Deep Learning (DL) techniques to automate non-trivial coding ac...

Please sign up or login with your details

Forgot password? Click here to reset