M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

08/04/2023
by   Jiyong Moon, et al.
0

Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.

READ FULL TEXT

page 4

page 9

page 11

research
07/17/2021

RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition

In fine-grained image recognition (FGIR), the localization and amplifica...
research
10/27/2017

Enhanced Biologically Inspired Model for Image Recognition Based on a Novel Patch Selection Method with Moment

Biologically inspired model (BIM) for image recognition is a robust comp...
research
07/12/2022

Trusted Multi-Scale Classification Framework for Whole Slide Image

Despite remarkable efforts been made, the classification of gigapixels w...
research
12/13/2022

OAMixer: Object-aware Mixing Layer for Vision Transformers

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have sh...
research
04/26/2018

Visual Estimation of Building Condition with Patch-level ConvNets

The condition of a building is an important factor for real estate valua...
research
08/15/2022

Cross-scale Attention Guided Multi-instance Learning for Crohn's Disease Diagnosis with Pathological Images

Multi-instance learning (MIL) is widely used in the computer-aided inter...
research
08/01/2021

Knowing When to Quit: Selective Cascaded Regression with Patch Attention for Real-Time Face Alignment

Facial landmarks (FLM) estimation is a critical component in many face-r...

Please sign up or login with your details

Forgot password? Click here to reset