WegFormer: Transformers for Weakly Supervised Semantic Segmentation

by   Chunmeng Liu, et al.

Although convolutional neural networks (CNNs) have achieved remarkable progress in weakly supervised semantic segmentation (WSSS), the effective receptive field of CNN is insufficient to capture global context information, leading to sub-optimal results. Inspired by the great success of Transformers in fundamental vision areas, this work for the first time introduces Transformer to build a simple and effective WSSS framework, termed WegFormer. Unlike existing CNN-based methods, WegFormer uses Vision Transformer (ViT) as a classifier to produce high-quality pseudo segmentation masks. To this end, we introduce three tailored components in our Transformer-based framework, which are (1) a Deep Taylor Decomposition (DTD) to generate attention maps, (2) a soft erasing module to smooth the attention maps, and (3) an efficient potential object mining (EPOM) to filter noisy activation in the background. Without any bells and whistles, WegFormer achieves state-of-the-art 70.5 on the PASCAL VOC dataset, significantly outperforming the previous best method. We hope WegFormer provides a new perspective to tap the potential of Transformer in weakly supervised semantic segmentation. Code will be released.


page 1

page 3

page 6


TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation

Weakly supervised semantic segmentation (WSSS) with only image-level sup...

Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a new transformer-based framework to learn class-spe...

Trans2Seg: Transparent Object Segmentation with Transformer

This work presents a new fine-grained transparent object segmentation da...

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Multi-task indoor scene understanding is widely considered as an intrigu...

POViT: Vision Transformer for Multi-objective Design and Characterization of Nanophotonic Devices

We solve a fundamental challenge in semiconductor IC design: the fast an...

Learning to Segment Human by Watching YouTube

An intuition on human segmentation is that when a human is moving in a v...

Convolutional Feature Masking for Joint Object and Stuff Segmentation

The topic of semantic segmentation has witnessed considerable progress d...

1 Introduction

Semantic segmentation plays an irreplaceable role in many computer vision tasks, such as autonomous driving and remote sensing. The semantic segmentation community has witnessed continuous improvements in recent years, benefiting from the rapid development of convolutional neural networks (CNNs). However, expensive pixel-level annotations force researchers to look for cheaper and more efficient annotations to help with semantic segmentation tasks. Image-level annotations are cheap and readily available, but it is challenging to learn high-quality semantic segmentation models using these weakly supervised annotations.

This paper focuses on weakly supervised semantic segmentation (WSSS) with image-level annotations.

Most existing WSSS frameworks [38, 17] are based on conventional CNN backbones such as ResNet [13] and VGG [29]

. However, the effective receptive field of these CNN-based methods is limited, leading to unsatisfactory segmentation results. Recently, Transformer began to show powerful performance in various fundamental areas of computer vision, which was used initially in the natural language process (NLP). Different from CNNs, vision transformers are proven to be able to extract global context information 

[39], which is essential for segmentation tasks and brings fresh thinking to the field of WSSS.

Figure 1: Visualization of WegFormer with or without saliency map assist. “Sal” denotes saliency map. From (d) and (e), we see saliency map is essential for our transformer-based framework. It can help filter the redundant background noise caused by the attention map. Zoom in for best view.

Inspired by this, we develop a simple yet effective WSSS framework based on Vision Transformer (ViT), termed WegFormer. Generally, WegFormer has three key parts as follows: an attention map generator based on Deep Taylor Decomposition (DTD) [6], a soft erasing module, and an efficient potential object mining (EPOM) module.

Among these parts, (1) the DTD-based attention map generator generates attention maps for the target object. DTD can explain transformer which propagates the relevancy scores through layers [6], which has the high response of the concrete class. We introduce DTD to generate attention maps and integrate them into our WSSS framework. (2) In the soft erasing module, we introduce a soft rate to narrow the gap between high-response and low-response regions, making the attention map smoother. (3) EPOM is used to further refine the attention map using saliency maps. Although DTD does an excellent job of distinguishing foreground and background, it also introduces certain noise regions. The saliency maps produced from the offline salient object detector can greatly eliminate these noises, as shown in Figure 1 (d) and (e).

Equipped with these advanced designs, WegFormer achieves state-of-the-art performance on PASCAL VOC 2012 [10]. Notably, WegFormer achieves 66.2% mIoU on the PASCAL VOC 2012 validation set with VGG-16 as the backbone in the self-training stage, outperforming CNN-based counterparts by more than points. Moreover, when using a heavier backbone ResNet-101 in the self-training stage, WegFormer reaches 70.5% mIoU, which is the highest of the PASCAL VOC 2012 validation set, significantly outperforming the previous best method NSROM [41] by 2.2% points.

Our contributions can be summarized as follows:

(1) We propose a simple yet effective Transformer-based WSSS framework, termed WegFormer, which can effectively capture global context information to generate high-quality semantic masks with only image annotation. To our knowledge, this is the first work to introduce Transformers into WSSS tasks.

(2) We carefully design three important components in WegFormer, including (1) a DTD-based attention map generator to get an initial attention map for the target object; (2) a soft erasing module to smooth the attention map; (3) an efficient potential object mining (EPOM) to filter background noise in attention map to generate finer pseudo label.

(3) WegFormer achieves state-of-the-art performance on PASCAL VOC 2012 dataset, showing the huge potential of Transformer in WSSS tasks. We hope that WegFormer is a good start for the research in the Transformer-based weakly-supervised segmentation area.

2 Related Work

Weakly Supervised Semantic Segmentation.

Weakly supervised semantic segmentation is proposed to learn pixel-level prediction from insufficient labels. Existing solutions are usually based on convolution network [38, 17]. There are two main streams. The first stream uses adverse erasing or random erasing in training. [37] generate CAM in adverse erasing strategy, which finds more extra information not only discriminative regions and introduces a bit of noise.  [15] proposed a framework called seeNet, which prevents attention from spreading to the background area.  [44] finds Adversarial Complementary Learning (ACOL) to localize objects automatically. Another stream is spreading CAM from high confidence areas to low.  [16] train a classification network and then apply region growing algorithm to train segmentation network.  [11]

employ multi- estimations to obtain multiple seeds to relieve inaccuracy of a single seed.

Regardless of the validity of the above methods, it is difficult to avoid introducing background noise. Recent trends [21, 30, 41] attempt to introduce a saliency detector, which is trained offline on other datasets, to generate saliency maps and eliminate background noise.

Transfomers in Computer Vision.

Transformer has been the dominant architecture in the NLP area and has become popular in the vision community. Vision Transformer (ViT) [9] is the first work to introduce Transformer into image classification, which divides an image into 1616 patches. IPT [7] is the first transformer pre-train model for low-level vision by combining multi-tasks. After that, more and more ViT variants are proposed to extend ViT from different aspects, such as DeiT [31] for efficient training, PVT series [33, 32] and Swin [24] for dense prediction. Benefiting from the global receptive field by self-attention [39], Transformer is able to capture global context dynamically, which is more friendly for dense prediction tasks such as object detection and semantic segmentation.

Different from previous Transformer frameworks that improve strongly-supervised semantic segmentation [40, 39], this paper utilizes Transformer to solve weakly-supervised semantic segmentation with only image-level annotation.

Figure 2: Overview pipeline of WegFormer. First, multi-label classification loss is used to train the vision transformer. Second, the initial relevance score is back-propagation based on the Deep Taylor Decomposition (DTD) principle to obtain gradient map and relevance map. We integrate gradient maps and relevance maps to get initial attention maps. Third, the soft erase is utilized to smooth the attention map with high and low responses. Forth, efficient potential object mining (EPOM) avoids pseudo labels from introducing wrong classes and filter out excess background information with saliency maps. Finally, the pseudo label is generated and used to self-train the segmentation network.
Neural Network Visualization.

The deep neural network is a black box, and previous works try to analyze it by visualizing the feature representation in different manners. [45] multiplies the weights of the fully connected layer after global average pooling (GAP) and the spatial feature map before GAP to generate class activation map (CAM), which can activate the region with objects. [27] utilizes the gradient of back-propagation to obtain an attention map without modifying the network structure. The CAM and Grad-CAM both work well in the ConvNets but are not suitable for Vision Transformer. Due to the fundamental architecture difference between Transformer and ConvNet, some attempts have recently tried to visualize Transformers’ features. According to the characteristics of the Transformer, [1] proposes two methods to quantify the information flow through self-attention, termed attention rollout and attention flow. The rollout method redistributes all attention scores by considering pairs of attention and assuming a linear combination of attention in a subsequent context. Attention flow is concerned with the maximum flow along the pair-wise attention graph. [6] introduces relevance map depending on deep Taylor decomposition and combines the grad of attention map and relevancy map.

3 Methodology

The overview framework of WegFormer is illustrated in Figure 2. First, an input RGB image is split into 1616 patches and fed into a vision transformer classifier. Second, Deep Taylor Decomposition (DTD) [6] is used to generate the initial attention maps. Third, the soft erase operator smooths attention maps by narrowing the gap between the high-response and the low-response areas. Fourth, in EPOM, we use saliency maps to help filter the redundant noise in the background and get refined attention maps and avoid introducing false information to get the final pseudo labels. Finally, the generated pseudo labels are fed to the segmentation network for self-training to improve the performance further.

3.1 Attention Map Generation

Generally, a classification network is trained firstly and we can combine the weights and features to generate class-aware activation maps. After post-processing, the activation map is used as pseudo labels to train a segmentation network to improve the mask quality. Different from previous works that use ConvNets as the classifier, we introduce DeiT [31], a variant of ViT, as classification networks to capture global contextual information, as illustrated in Figure 2.

Given images

as input, the output of the classification network is a vector

, where indicates the number of the categories. Due to the fundamental difference between CNN and Transformer, here CAM is not suitable to generate activation maps. Instead, we adopt Deep Taylor Decomposition (DTD) principle to generate the attention maps [6], which back-propagates the initial relevancy score of the network through all layers and get initial attention map.

Here we first get the initial relevancy score of per category based on :


where is the one-hot label generated by multi-label ground-truth and is the initial relevancy scores of the class . Here denotes Hadamard Product. Following DTD [6], by back-propagating , we can get relevancy map for each Transformer block .

Then we dive deep into multi-head self-attention (MHSA) layer [9], as shown in Eqn. 3:


where , , means query, key and value in block , and , here is the number of image patch tokens and “1” is the class token. is self-attention map and represents the output of the attention module. is the head dimension of MHSA.

For in each block , we can easily get its gradient and relevance . Then the initial attention map can be calculated by:



is the identity matrix,

represents the mean over the “head” dimension, and . Here is the last block.

We get the initial attention map by indexing (i.e.,

) the column corresponding to the class token “CLS” according to the horizontal axis. We reshape and linear interpolation

to obtain the final initial attention map .

Unlike the original DTD that utilizes the activation map from all the blocks, here we only take account of the last blocks. We find that the activation map from shallow blocks introduces a certain amount of noise.

3.2 Soft Erase

Discriminative Region Suppression (DRS) [19] is proposed to suppress discriminative regions and spread to neighboring non-discriminative areas. Inspired by DRS, we propose soft erase to narrow the gap between high and low response areas in the initial attention map . Unlike DRS that needs to embed into several layers of the network, here soft erase is just a simple post-processing step and does not need to be embedded into the network.

After get in the above section, we fisrt apply a normalization on and get . We then apply soft erase to , which can be written as:


where , and . is a hyper-parameter and here we set it to . , , mean the maximum, dimension expansion, and minimum function respectively. Firstly, a max vector is chosen from . Then, we expand the dimension of to the dimension of . Finally, we compare and and choose the pixel-wise minimum value to get the attention maps .

3.3 Efficient Potential Object Mining (EPOM)

[41] proposes POM to use saliency maps for extracting background information. Here we find saliency map and attention map generated by DTD are naturally complementary, and saliency maps can largely filter the noise activation in the background. Note that POM in [41] uses not only the final heatmap but also utilizes the middle layer heatmaps, which is complicated and low efficient. Different from POM, here we propose efficient POM (EPOM) that only uses the final attention map, which is more efficient and does not sacrifice performance.

EPOM forces the model to mine potential objects by marking some uncertain pixels in pseudo labels as “ignored”, which largely avoids introducing wrong labels during self-training. The detail of EPOM is illustrated as:


where is the initial pseudo label of pixel position generated by comparing the values of , and is current class of the total number of classes . In current class , if the initial pseudo label equals to and greater than , we updated the initial pseudo label as 255 in pixel (x,y). Otherwise, it remains unchanged. Here, “255” indicates “ignored”. is a threshold for each category that determined dynamically, which can be written as:



are functions that obtain the median and top quartile value of position

, respectively. is the value of . The is defined follows:


where is a threshold used to extract foreground information. We choose the position where initial pseudo label equals current class. Otherwise, we choose the position greater than in .

3.4 Self-training with Pseudo Label

After obtaining the pseudo label, we self-train a segmentation network with the pseudo label to improve the performance. Unlike previous work [41] that needs to iteratively self-train the segmentation network several times, we only train the segmentation network once.

4 Experiments

4.1 Dataset and Evaluation Metrics

We use PASCAL VOC 2012 [10]

as the dataset, which is widely used in weakly supervised semantic segmentation. The training set of PASCAL VOC 2012 contains 10582 images with augmentation. Only image-level labels are used during training, and each image contains multiple categories. We report results on validation set (with 1,449 images) and test set (with 1,456 images) to compare our approach with other competitive methods. We use standard mean intersection over union (mIoU) as the evaluation metric.

4.2 Implementation Details

The model is implemented with Pytorch and trained on 1 NVIDIA GeForce RTX 3060 GPU with 12 GB memory. In our experiments, we use DeiT-Base 


as the classification network, which is pre-trained on the ImageNet-1K and we fine-tune it on the PASCAL VOC 2012. During the training phase, we used the AdamW optimizer with a batch size of 16 and the input image is cropped to

. During the inference phase, the long side of the input images is cropped to , and the short side is scaled with an equal proportion to keep the original aspect ratio. The multi-scale inference is also used.

DeepLab-V2/V3/V3+ are used as our segmentation network in the self-training phase. Following common setting [41, 17], we compare our results with ResNet-101 and VGG-16 backbone. SGD is adopted as the optimizer, and the weight decay is 5e-4. The CRF method is used for post-processing.

4.3 Comparisons to the State-of-the-arts

VGG ResNet
Methods Val Test Methods Val Test
SEC 50.7 51.7 DCSP 60.8 61.9
STC 49.8 51.2 DSRG 61.4 63.2
Roy et al. 52.8 53.7 MCOF 60.3 61.2
Oh et al. 55.7 56.7 AffinityNet 61.7 63.7
AE-PSL 55.0 55.7 SeeNet 63.1 62.8
WebS-i2 53.4 55.3 IRN 63.5 64.8
Hong et al. 58.1 58.7 FickleNet 64.9 65.3
DCSP 58.6 59.2 OAA 65.2 66.4
TPL 53.1 53.8 SSDD 64.9 65.5
GAIN 55.3 56.8 SEAM 64.5 65.7
DSRG 59.0 60.4 SCE 66.1 65.9
MCOF 56.2 57.6 ICD 67.8 68.0
AffinityNet 58.4 60.5 Zhang et al. 66.6 66.7
RDC 60.4 60.8 Fan et al. 67.2 66.7
SeeNet 63.1 62.8 MCIS 66.2 66.9
OAA 63.1 62.8 BES 65.7 66.6
ICD 64.0 63.9 CONTA 66.1 66.7
BES 60.1 61.1 DRS 66.8 67.4
Fan et al. 64.6 64.2 NSROM 68.3 68.5
Zhang et al. 63.7 64.5 Ours 70.5 70.3
MCIS 63.5 63.6 OAA 67.4
DRS 63.6 64.4 DRS 71.2 71.4
NSROM 65.5 65.3 NSROM 70.4 70.2
Ours 66.2 66.5 Ours 70.9 70.5
Table 1: Quantitative comparisons to previous state-of-the-art approaches. The left part stands for VGG backbone and the right part stands for ResNet backbone used in segmentation network in self-training stage.

means pre-trained on MS-COCO.

we compare our methods with existing famous work: SEC [21], STC [36],  [26],  [25], AE-PSL [37], WebS-i2 [18], [14], DCSP [5], TPL [20], GAIN [23], DSRG [16], MCOF [34], AffinityNet [2], RDC [38], SeeNet [15], OAA [17], ICD  [12], BES [8],  [11],  [43], MCIS [30], IRN [3], FickleNet [22], SSDD [28], SEAM [35], SCE [4], CONTA [42], NSROM [41], DRS[19]

The comparison with state-of-the-art is shown in Table 1. The left column of Table 1 shows the result with the VGG backbone and the right column shows the result with ResNet backbone. From Table 1, with the VGG backbone, WegFormer outperforms the NSROM on the test set by and the DRS on the validation set by

. With the ResNet backbone, the upper part terms backbone pre-trained on ImageNet, and WegFormer is

and better than NSROM and DRS on the validation set respectively. The bottom part of Table 1, marked with , means backbone pre-trained on MS-COCO. In the validation set, we are ahead of NSROM. Overall, for all the backbones, our approach achieves state-of-the-art performance on PASCAL VOC 2012. The qualitative segmentation results on the PASCAL VOC 2012 validation set are shown in Figure 3.

4.4 Ablation Studies

In this section we analyze a series of ablation studies and demonstrate the effectiveness of the proposed modules.

4.4.1 Contribution of Different Components

As shown in Table 2, we report the mIoU in validation set with different components. Our baseline with initial pseudo label self-training can get mIoU. By adding soft erase, the resulting increase to . Sal (DRS) and Sal (NSROM) means saliency map provided by [19] and [41]. From Table 2 we can find that the saliency map largely boosts up the result to , with over 10% improvement. EPOM also improves the performance and the result reaches finally.

Soft Erase Sal (DRS) Sal (NSROM) EPOM mIoU
- - - - 59.5
- - - 60.0
- - 70.2
- 70.9
- - 68.9
- 69.8
Table 2: Ablation study of each component in WegFormer. EPOM indicates the method of EPOM without saliency map.
Blocks mIoU
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 58.2
3,4,5,6,7,8,9,10,11 58.4
5,6,7,8,9,10,11 58.2
7,8,9,10,11 58.8
9,10,11 59.0
11 59.5
Table 3: Ablation study of the best blocks to get attention maps. Block indexes are from 0 to 11. Note that 0 means the first block and 11 means the last block.

4.4.2 Best Blocks to Get Attention Maps

Unlike the initial method which integrates all blocks [6], we only adopt the last block as final attention maps as described in equation 5. In WegFormer, the total number of blocks in the classifier is 12. As shown in Table 3, we find that only taking the last block reaches the best mIoU 59.5%, which is 1.3% higher than taking all the blocks.

Figure 3: Qualitative segmentation results on the PASCAL VOC 2012 validation set. (a) Input image, (b) Ground truth, (c) Prediction of NSROM [41], (d) Prediction of Ours. We can find that our results are much better than those of NSROM. Best viewed in color.
Figure 4: Comparison between CAM, Rollout, and DTD in Transformer. In sub-figure (1) on the left, the mIoU of DTD is significantly higher than CAM and ROLLOUT, with and respectively. The sub-figure (2) on the right indicates that DTD is significantly better than CAM and ROLLOUT in terms of activating the objects in the feature space. “H” means heatmap.

4.4.3 Comparison among CAM, ROLLOUT and DTD

We compare three methods, CAM, ROLLOUT, and DTD, to generate heatmaps for Transformer in Figure 4 both quantitatively and qualitatively. From Figure 4 (1), we see DTD is and higher than CAM and ROLLOUT respectively. Visualization results are shown in Figure 4 (2). (c) represents the heatmap obtained by CAM, which has meaningless high responses everywhere in the image. (d) represents the heatmap obtained from ROLLOUT, which cannot distinguish between different classes. Both CAM and Rollout are not suitable for Transformer. (e) represents the heatmap generated by DTD, which can well excavate the information of the object and capture the contour of the object.

Figure 5: Activation map of Transformer+DTD vs. CNN+CAM with or without saliency map. “H” means heatmap. As we can see, CNN+CAM focuses on the most discriminative region with less background noise. In contrast, Transformer+DTD can capture the global context information, while also having redundant responses in the background regions. Therefore, combining DTD and saliency map can make more complementary results.
Network val test
DeepLab-V2 69.2 69.7
DeepLab-V3 69.3 69.9
DeepLab-V3+ 70.5 70.3
Table 4: Ablation study of different segmentation framework. Stronger segmentation framework is also important for WSSS.

4.4.4 Ablation Study of Different Saliency Map

In Table 2, we found that the quality of the saliency map also affects the final result. Table 2 shows about 1% gap between saliency map (DRS) and saliency map (NSROM), respectively. This inspires us to leverage a higher-quality saliency map to assist weakly-supervised semantic segmentation tasks. We leave it in future research.

4.4.5 Ablation Study of Stronger Segmentation Networks

As we can see in Table 4, more advanced semantic segmentation networks lead to better results. Compared with DeepLab-V2 and DeepLab-V3, DeepLab-V3+ is and higher on the validation set, respectively. On the test set, we see the similar conclusion.

4.4.6 Transformer+DTD vs. CNN+CAM with Saliency Map

In Figure 5, we compare the heatmap generated by Transformer+DTD and CNN+CAM with or without saliency maps. Here for Transformer we use Deit-B and for CNN we use ResNet-38. We find that the CNN+CAM only activates the most discriminative region but fails to capture the whole object. In contrast, Transformer+DTD can activate the whole object but also introduces certain background noise simultaneously. Therefore, compared with CNN+CAM, saliency maps can better make up for the shortcomings of Transformer+DTD and get high-quality masks.

5 Conclusion

In this paper, we propose WegFormer, the first Transformer-based weakly-supervised semantic segmentation framework. In addition, we introduce three important components: attention map generator based on Deep Taylor Desomposition (DTD), soft erase module, and efficient potential object mining (EPOM). These three components could generate high-quality semantic masks as pseudo labels, which significantly boosts the performance. We hope the proposed WegFormer can serve as a solid baseline to provide a new perspective for weakly supervised semantic segmentation in the Transformer era.


  • [1] S. Abnar et al (2020) Quantifying attention flow in transformers. arXiv. Cited by: §2.
  • [2] J. Ahn et al (2018) Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, Cited by: §4.3.
  • [3] J. Ahn et al (2019)

    Weakly supervised learning of instance segmentation with inter-pixel relations

    In CVPR, Cited by: §4.3.
  • [4] Y. Chang et al (2020) Weakly-supervised semantic segmentation via sub-category exploration. In CVPR, Cited by: §4.3.
  • [5] A. Chaudhry et al (2017) Discovering class-specific pixels for weakly-supervised semantic segmentation. arXiv. Cited by: §4.3.
  • [6] H. Chefer et al (2021) Transformer interpretability beyond attention visualization. In CVPR, Cited by: §1, §1, §2, §3.1, §3.1, §3, §4.4.2.
  • [7] H. Chen et al (2021) Pre-trained image processing transformer. In CVPR, Cited by: §2.
  • [8] L. Chen et al (2020) Weakly supervised semantic segmentation with boundary exploration. In ECCV, Cited by: §4.3.
  • [9] A. Dosovitskiy et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv. Cited by: §2, §3.1.
  • [10] M. Everingham et al (2015) The pascal visual object classes challenge: a retrospective. IJCV. Cited by: §1, §4.1.
  • [11] J. Fan et al (2020) Employing multi-estimations for weakly-supervised semantic segmentation. In ECCV, Cited by: §2, §4.3.
  • [12] J. Fan et al (2020) Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In CVPR, Cited by: §4.3.
  • [13] K. He et al (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
  • [14] S. Hong et al (2017) Weakly supervised semantic segmentation using web-crawled videos. In CVPR, Cited by: §4.3.
  • [15] Q. Hou et al (2018) Self-erasing network for integral object attention. arXiv. Cited by: §2, §4.3.
  • [16] Z. Huang et al (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, Cited by: §2, §4.3.
  • [17] P. Jiang et al (2019) Integral object mining via online attention accumulation. In ICCV, Cited by: §1, §2, §4.2, §4.3.
  • [18] B. Jin et al (2017) Webly supervised semantic segmentation. In CVPR, Cited by: §4.3.
  • [19] B. Kim et al (2021) Discriminative region suppression for weakly-supervised semantic segmentation. In AAAI, Cited by: §3.2, §4.3, §4.4.1.
  • [20] D. Kim et al (2017) Two-phase learning for weakly supervised object localization. In ICCV, Cited by: §4.3.
  • [21] A. Kolesnikov et al (2016) Seed, expand and constrain: three principles for weakly-supervised image segmentation. In ECCV, Cited by: §2, §4.3.
  • [22] J. Lee et al (2019) Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference. In CVPR, Cited by: §4.3.
  • [23] K. Li et al (2018) Tell me where to look: guided attention inference network. In CVPR, Cited by: §4.3.
  • [24] Z. Liu et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv. Cited by: §2.
  • [25] S. J. Oh et al (2017) Exploiting saliency for object segmentation from image level labels. In CVPR, Cited by: §4.3.
  • [26] A. Roy et al (2017) Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In CVPR, Cited by: §4.3.
  • [27] R. R. Selvaraju et al (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, Cited by: §2.
  • [28] W. Shimoda et al (2019) Self-supervised difference detection for weakly-supervised semantic segmentation. In ICCV, Cited by: §4.3.
  • [29] K. Simonyan et al (2014) Very deep convolutional networks for large-scale image recognition. arXiv. Cited by: §1.
  • [30] G. Sun et al (2020) Mining cross-image semantics for weakly supervised semantic segmentation. In ECCV, Cited by: §2, §4.3.
  • [31] H. Touvron et al (2021) Training data-efficient image transformers & distillation through attention. In ICML, Cited by: §2, §3.1, §4.2.
  • [32] W. Wang et al (2021) PVTv2: improved baselines with pyramid vision transformer. arXiv. Cited by: §2.
  • [33] W. Wang et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In ICCV, Cited by: §2.
  • [34] X. Wang et al (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In CVPR, Cited by: §4.3.
  • [35] Y. Wang et al (2020) Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, Cited by: §4.3.
  • [36] Y. Wei et al (2016) Stc: a simple to complex framework for weakly-supervised semantic segmentation. TPAMI. Cited by: §4.3.
  • [37] Y. Wei et al (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In CVPR, Cited by: §2, §4.3.
  • [38] Y. Wei et al (2018) Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In CVPR, Cited by: §1, §2, §4.3.
  • [39] E. Xie et al (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv. Cited by: §1, §2, §2.
  • [40] E. Xie et al (2021) Segmenting transparent object in the wild with transformer. arXiv. Cited by: §2.
  • [41] Y. Yao et al (2021) Non-salient region object mining for weakly supervised semantic segmentation. In CVPR, Cited by: §1, §2, §3.3, §3.4, Figure 3, §4.2, §4.3, §4.4.1.
  • [42] D. Zhang et al (2020) Causal intervention for weakly-supervised semantic segmentation. arXiv. Cited by: §4.3.
  • [43] T. Zhang et al (2020) Splitting vs. merging: mining object regions with discrepancy and intersection loss for weakly supervised semantic segmentation. In ECCV, Cited by: §4.3.
  • [44] X. Zhang et al (2018) Adversarial complementary learning for weakly supervised object localization. In CVPR, Cited by: §2.
  • [45] B. Zhou et al (2016)

    Learning deep features for discriminative localization

    In CVPR, Cited by: §2.