DeepAI AI Chat
Log In Sign Up

Exploring Vision Transformers for Fine-grained Classification

by   Marcos V. Conde, et al.

Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks. In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities. We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.


page 2

page 3

page 4


TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing obje...

Conviformers: Convolutionally guided Vision Transformer

Vision transformers are nowadays the de-facto preference for image class...

Drawing Attention to Detail: Pose Alignment through Self-Attention for Fine-Grained Object Classification

Intra-class variations in the open world lead to various challenges in c...

Fine-grained Categorization -- Short Summary of our Entry for the ImageNet Challenge 2012

In this paper, we tackle the problem of visual categorization of dog bre...

Fine-grained Classification of Solder Joints with α-skew Jensen-Shannon Divergence

Solder joint inspection (SJI) is a critical process in the production of...

Coarse-to-Fine Vision Transformer

Vision Transformers (ViT) have made many breakthroughs in computer visio...

A Fine-Grained Visual Attention Approach for Fingerspelling Recognition in the Wild

Fingerspelling in sign language has been the means of communicating tech...