Rethinking and Improving Relative Position Encoding for Vision Transformer

07/29/2021
by   Kan Wu, et al.
0

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5 and 1.3 COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

READ FULL TEXT

page 8

page 11

page 12

research
05/26/2023

Improving Position Encoding of Transformers for Multivariate Time Series Classification

Transformers have demonstrated outstanding performance in many applicati...
research
06/10/2022

Position Labels for Self-Supervised Vision Transformer

Position encoding is important for vision transformer (ViT) to capture t...
research
04/18/2021

Demystifying the Better Performance of Position Encoding Variants for Transformer

Transformers are state of the art models in NLP that map a given input s...
research
09/06/2021

PermuteFormer: Efficient Relative Position Encoding for Long Sequences

A recent variation of Transformer, Performer, scales Transformer to long...
research
06/28/2020

Rethinking Positional Encoding in Language Pre-training

How to explicitly encode positional information into neural networks is ...
research
07/18/2023

Linearized Relative Positional Encoding

Relative positional encoding is widely used in vanilla and linear transf...
research
06/28/2020

Rethinking the Positional Encoding in Language Pre-training

How to explicitly encode positional information into neural networks is ...

Please sign up or login with your details

Forgot password? Click here to reset