Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

03/06/2022
by   Lian Xu, et al.
3

This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS.

READ FULL TEXT

page 6

page 8

page 14

page 15

page 16

page 17

page 18

page 19

research
08/06/2023

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a novel transformer-based framework that aims to enh...
research
03/02/2023

Token Contrast for Weakly-Supervised Semantic Segmentation

Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels ...
research
08/08/2023

All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation

In this work, we propose a new transformer-based regularization to bette...
research
03/22/2022

Learning Patch-to-Cluster Attention in Vision Transformer

The Vision Transformer (ViT) model is built on the assumption of treatin...
research
01/03/2022

CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is a challenging task to lo...
research
05/15/2023

Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation

This study introduces an efficacious approach, Masked Collaborative Cont...
research
11/24/2021

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Multi-task indoor scene understanding is widely considered as an intrigu...

Please sign up or login with your details

Forgot password? Click here to reset