Do We Really Need Explicit Position Encodings for Vision Transformers?

02/22/2021
by   Xiangxiang Chu, et al.
0

Almost all visual transformers such as ViT or DeiT rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size on-the-fly. In this paper, we propose to employ a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encoding Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date. Our code will be made available at https://github.com/Meituan-AutoML/CPVT .

READ FULL TEXT

page 5

page 13

04/22/2021

Token Labeling: Training a 85.4 56M Parameters on ImageNet

This paper provides a strong baseline for vision transformers on the Ima...
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
04/18/2021

Demystifying the Better Performance of Position Encoding Variants for Transformer

Transformers are state of the art models in NLP that map a given input s...
11/22/2021

MetaFormer is Actually What You Need for Vision

Transformers have shown great potential in computer vision tasks. A comm...
03/17/2022

SepTr: Separable Transformer for Audio Spectrogram Processing

Following the successful application of vision transformers in multiple ...
06/23/2021

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

In this paper, we present Vision Permutator, a conceptually simple and d...
03/17/2021

ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Pretrained using large amount of data, autoregressive language models ar...