Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

05/08/2023
by   Zhicheng Wang, et al.
0

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively with (un)shared feature extractors and by matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a pretrained and plain vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention and point out that the simplification is only made possible if the query and exemplar tokens are concatenated as input. The resulting model, termed CACViT, simplifies the CAC pipeline and unifies the feature spaces between the query image and exemplars. In addition, we find CACViT naturally encodes background information within self-attention, which helps reduce background disturbance. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60 reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 8

research
03/16/2022

Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting

Class-agnostic counting (CAC) aims to count all instances in a query ima...
research
04/09/2023

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Self-attention mechanism has been a key factor in the recent progress of...
research
06/28/2021

Multi-Compound Transformer for Accurate Biomedical Image Segmentation

The recent vision transformer(i.e.for image classification) learns non-l...
research
12/11/2021

Object Counting: You Only Need to Look at One

This paper aims to tackle the challenging task of one-shot object counti...
research
08/15/2023

Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention

Recently, Transformer-based architecture has been introduced into single...
research
03/17/2022

MatchFormer: Interleaving Attention in Transformers for Feature Matching

Local feature matching is a computationally intensive task at the subpix...
research
06/30/2021

Dual Aspect Self-Attention based on Transformer for Remaining Useful Life Prediction

Remaining useful life prediction (RUL) is one of the key technologies of...

Please sign up or login with your details

Forgot password? Click here to reset