RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition

07/17/2021
by   Yunqing Hu, et al.
0

In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results on computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size of its patches, and is unable to generate multi-scale features to learn discriminative region attention. To facilitate the learning of discriminative region attention without box/part annotations, we use the strength of the attention weights to measure the importance of the patch tokens corresponding to the raw images. We propose the recurrent attention multi-scale transformer (RAMS-Trans), which uses the transformer's self-attention to recursively learn discriminative region attention in a multi-scale manner. Specifically, at the core of our approach lies the dynamic patch proposal module (DPPM) guided region amplification to complete the integration of multi-scale image patches. The DPPM starts with the full-size image patches and iteratively scales up the region attention to generate new patches from global to local by the intensity of the attention weights generated at each scale as an indicator. Our approach requires only the attention weights that come with ViT itself and can be easily trained end-to-end. Extensive experiments demonstrate that RAMS-Trans performs better than concurrent works, in addition to efficient CNN models, achieving state-of-the-art results on three benchmark datasets.

READ FULL TEXT

page 3

page 8

research
03/27/2021

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The recently developed vision transformer (ViT) has achieved promising r...
research
08/04/2023

M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

Recently, vision Transformers (ViTs) have been actively applied to fine-...
research
06/05/2022

MPANet: Multi-Patch Attention For Infrared Small Target object Detection

Infrared small target detection (ISTD) has attracted widespread attentio...
research
10/30/2021

PatchFormer: A Versatile 3D Transformer Based on Patch Attention

The 3D vision community is witnesses a modeling shift from CNNs to Trans...
research
07/12/2022

Trusted Multi-Scale Classification Framework for Whole Slide Image

Despite remarkable efforts been made, the classification of gigapixels w...
research
02/21/2022

Rethinking the Zigzag Flattening for Image Reading

Sequence ordering of word vector matters a lot to text reading, which ha...
research
12/19/2021

Improving Face-Based Age Estimation with Attention-Based Dynamic Patch Fusion

With the increasing popularity of convolutional neural networks (CNNs), ...

Please sign up or login with your details

Forgot password? Click here to reset