CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization

01/03/2022
by   Ming Li, et al.
9

Weakly supervised object localization (WSOL) is a challenging task to localize the object by only category labels. However, there is contradiction between classification and localization because accurate classification network tends to pay attention to discriminative region of objects rather than the entirety. We propose this discrimination is caused by handcraft threshold choosing in CAM-based methods. Therefore, we propose Clustering and Filter of Tokens (CaFT) with Vision Transformer (ViT) backbone to solve this problem in another way. CaFT first sends the patch tokens of the image split to ViT and cluster the output tokens to generate initial mask of the object. Secondly, CaFT considers the initial mask as pseudo labels to train a shallow convolution head (Attention Filter, AtF) following backbone to directly extract the mask from tokens. Then, CaFT splits the image into parts, outputs masks respectively and merges them into one refined mask. Finally, a new AtF is trained on the refined masks and used to predict the box of object. Experiments verify that CaFT outperforms previous work and achieves 97.55% and 69.86% localization accuracy with ground-truth class on CUB-200 and ImageNet-1K respectively. CaFT provides a fresh way to think about the WSOL task.

READ FULL TEXT

page 4

page 6

page 7

page 12

research
02/26/2020

Rethinking the Route Towards Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) aims to localize objects wi...
research
08/06/2023

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a novel transformer-based framework that aims to enh...
research
03/06/2022

Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a new transformer-based framework to learn class-spe...
research
06/12/2018

Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features

Weakly-supervised semantic segmentation under image tags supervision is ...
research
07/19/2023

Generative Prompt Model for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) remains challenging when le...
research
09/11/2019

Dual-attention Focused Module for Weakly Supervised Object Localization

The research on recognizing the most discriminative regions provides ref...

Please sign up or login with your details

Forgot password? Click here to reset