NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

11/25/2021
by   Hao Liu, et al.
0

Recently, Vision Transformers (ViT), with the self-attention (SA) as the de facto ingredients, have demonstrated great potential in the computer vision community. For the sake of trade-off between efficiency and performance, a group of works merely perform SA operation within local patches, whereas the global contextual information is abandoned, which would be indispensable for visual recognition tasks. To solve the issue, the subsequent global-local ViTs take a stab at marrying local SA with global one in parallel or alternative way in the model. Nevertheless, the exhaustively combined local and global context may exist redundancy for various visual data, and the receptive field within each layer is fixed. Alternatively, a more graceful way is that global and local context can adaptively contribute per se to accommodate different visual data. To achieve this goal, we in this paper propose a novel ViT architecture, termed NomMer, which can dynamically Nominate the synergistic global-local context in vision transforMer. By investigating the working pattern of our proposed NomMer, we further explore what context information is focused. Beneficial from this "dynamic nomination" mechanism, without bells and whistles, the NomMer can not only achieve 84.5 on ImageNet with only 73M parameters, but also show promising performance on dense prediction tasks, i.e., object detection and semantic segmentation. The code and models will be made publicly available at <https://github.com/NomMer1125/NomMer.>

READ FULL TEXT

page 1

page 8

page 11

page 13

research
07/10/2021

Local-to-Global Self-Attention in Vision Transformers

Transformers have demonstrated great potential in computer vision tasks....
research
01/06/2021

RethNet: Object-by-Object Learning for Detecting Facial Skin Problems

Semantic segmentation is a hot topic in computer vision where the most c...
research
09/13/2023

Dynamic Spectrum Mixer for Visual Recognition

Recently, MLP-based vision backbones have achieved promising performance...
research
07/19/2022

Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

Visual representation learning is the key of solving various vision prob...
research
10/14/2022

Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?

Recently vision transformers (ViT) have been applied successfully for va...
research
05/19/2023

Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Egocentric temporal action segmentation in videos is a crucial task in c...
research
02/13/2022

BViT: Broad Attention based Vision Transformer

Recent works have demonstrated that transformer can achieve promising pe...

Please sign up or login with your details

Forgot password? Click here to reset