RegionViT: Regional-to-Local Attention for Vision Transformers

06/04/2021
by   Chun-Fu Chen, et al.
0

Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on three vision tasks, including image classification, object detection and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models will be publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2021

Focal Self-attention for Local-Global Interactions in Vision Transformers

Recently, Vision Transformer and its variants have shown great promise o...
research
08/30/2022

MRL: Learning to Mix with Attention and Convolutions

In this paper, we present a new neural architectural block for the visio...
research
09/28/2022

Motion Transformer for Unsupervised Image Animation

Image animation aims to animate a source image by using motion learned f...
research
08/24/2023

Towards Hierarchical Regional Transformer-based Multiple Instance Learning

The classification of gigapixel histopathology images with deep multiple...
research
04/02/2021

AAformer: Auto-Aligned Transformer for Person Re-Identification

Transformer is showing its superiority over convolutional architectures ...
research
03/01/2023

Efficient and Explicit Modelling of Image Hierarchies for Image Restoration

The aim of this paper is to propose a mechanism to efficiently and expli...
research
05/15/2023

Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation

Vision transformers (ViTs) achieve remarkable performance on large datas...

Please sign up or login with your details

Forgot password? Click here to reset