Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

07/06/2021
by   Jun Wang, et al.
4

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.

READ FULL TEXT

page 3

page 5

research
03/14/2021

TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing obje...
research
08/31/2022

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

Fine-grained visual categorization (FGVC) aims at recognizing objects fr...
research
04/21/2022

R2-Trans:Fine-Grained Visual Categorization with Redundancy Reduction

Fine-grained visual categorization (FGVC) aims to discriminate similar s...
research
05/11/2023

Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Fine-grained visual classification (FGVC) is a challenging computer visi...
research
03/23/2023

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a dro...
research
07/14/2021

Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition

Fine-grained image recognition is challenging because discriminative clu...
research
07/16/2022

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received...

Please sign up or login with your details

Forgot password? Click here to reset