Boosting vision transformers for image retrieval

10/21/2022
by   Chull Hwan Song, et al.
0

Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time. (1) We show that a hybrid architecture is more effective than plain transformers, by a large margin. (2) We introduce two branches collecting global (classification token) and local (patch tokens) information, from which we form a global image representation. (3) In each branch, we collect multi-layer features from the transformer encoder, corresponding to skip connections across distant layers. (4) We enhance locality of interactions at the deeper layers of the encoder, which is the relative weakness of vision transformers. We train our model on all commonly used training sets and, for the first time, we make fair comparisons separately per training set. In all cases, we outperform previous models based on global representation. Public code is available at https://github.com/dealicious-inc/DToP.

READ FULL TEXT

page 2

page 4

page 6

page 8

research
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
research
02/10/2021

Training Vision Transformers for Image Retrieval

Transformers have shown outstanding results for natural language underst...
research
04/26/2021

Improve Vision Transformers Training by Suppressing Over-smoothing

Introducing the transformer structure into computer vision tasks holds t...
research
07/13/2021

Visual Parser: Representing Part-whole Hierarchies with Transformers

Human vision is able to capture the part-whole hierarchical information ...
research
03/12/2022

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Vision transformers (ViTs) have gained increasing popularity as they are...
research
03/03/2022

Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology

Monitoring biodiversity is paramount to manage and protect natural resou...
research
09/09/2022

EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography

Learning spatiotemporal features is an important task for efficient vide...

Please sign up or login with your details

Forgot password? Click here to reset