Semantic segmentation plays an irreplaceable role in many computer vision tasks, such as autonomous driving and remote sensing. The semantic segmentation community has witnessed continuous improvements in recent years, benefiting from the rapid development of convolutional neural networks (CNNs). However, expensive pixel-level annotations force researchers to look for cheaper and more efficient annotations to help with semantic segmentation tasks. Image-level annotations are cheap and readily available, but it is challenging to learn high-quality semantic segmentation models using these weakly supervised annotations.This paper focuses on weakly supervised semantic segmentation (WSSS) with image-level annotations.
. However, the effective receptive field of these CNN-based methods is limited, leading to unsatisfactory segmentation results. Recently, Transformer began to show powerful performance in various fundamental areas of computer vision, which was used initially in the natural language process (NLP). Different from CNNs, vision transformers are proven to be able to extract global context information, which is essential for segmentation tasks and brings fresh thinking to the field of WSSS.
Inspired by this, we develop a simple yet effective WSSS framework based on Vision Transformer (ViT), termed WegFormer. Generally, WegFormer has three key parts as follows: an attention map generator based on Deep Taylor Decomposition (DTD) , a soft erasing module, and an efficient potential object mining (EPOM) module.
Among these parts, (1) the DTD-based attention map generator generates attention maps for the target object. DTD can explain transformer which propagates the relevancy scores through layers , which has the high response of the concrete class. We introduce DTD to generate attention maps and integrate them into our WSSS framework. (2) In the soft erasing module, we introduce a soft rate to narrow the gap between high-response and low-response regions, making the attention map smoother. (3) EPOM is used to further refine the attention map using saliency maps. Although DTD does an excellent job of distinguishing foreground and background, it also introduces certain noise regions. The saliency maps produced from the offline salient object detector can greatly eliminate these noises, as shown in Figure 1 (d) and (e).
Equipped with these advanced designs, WegFormer achieves state-of-the-art performance on PASCAL VOC 2012 . Notably, WegFormer achieves 66.2% mIoU on the PASCAL VOC 2012 validation set with VGG-16 as the backbone in the self-training stage, outperforming CNN-based counterparts by more than points. Moreover, when using a heavier backbone ResNet-101 in the self-training stage, WegFormer reaches 70.5% mIoU, which is the highest of the PASCAL VOC 2012 validation set, significantly outperforming the previous best method NSROM  by 2.2% points.
Our contributions can be summarized as follows:
(1) We propose a simple yet effective Transformer-based WSSS framework, termed WegFormer, which can effectively capture global context information to generate high-quality semantic masks with only image annotation. To our knowledge, this is the first work to introduce Transformers into WSSS tasks.
(2) We carefully design three important components in WegFormer, including (1) a DTD-based attention map generator to get an initial attention map for the target object; (2) a soft erasing module to smooth the attention map; (3) an efficient potential object mining (EPOM) to filter background noise in attention map to generate finer pseudo label.
(3) WegFormer achieves state-of-the-art performance on PASCAL VOC 2012 dataset, showing the huge potential of Transformer in WSSS tasks. We hope that WegFormer is a good start for the research in the Transformer-based weakly-supervised segmentation area.
2 Related Work
Weakly Supervised Semantic Segmentation.
Weakly supervised semantic segmentation is proposed to learn pixel-level prediction from insufficient labels. Existing solutions are usually based on convolution network [38, 17]. There are two main streams. The first stream uses adverse erasing or random erasing in training.  generate CAM in adverse erasing strategy, which finds more extra information not only discriminative regions and introduces a bit of noise.  proposed a framework called seeNet, which prevents attention from spreading to the background area.  finds Adversarial Complementary Learning (ACOL) to localize objects automatically. Another stream is spreading CAM from high confidence areas to low.  train a classification network and then apply region growing algorithm to train segmentation network. 
employ multi- estimations to obtain multiple seeds to relieve inaccuracy of a single seed.
Transfomers in Computer Vision.
Transformer has been the dominant architecture in the NLP area and has become popular in the vision community. Vision Transformer (ViT)  is the first work to introduce Transformer into image classification, which divides an image into 1616 patches. IPT  is the first transformer pre-train model for low-level vision by combining multi-tasks. After that, more and more ViT variants are proposed to extend ViT from different aspects, such as DeiT  for efficient training, PVT series [33, 32] and Swin  for dense prediction. Benefiting from the global receptive field by self-attention , Transformer is able to capture global context dynamically, which is more friendly for dense prediction tasks such as object detection and semantic segmentation.
Neural Network Visualization.
The deep neural network is a black box, and previous works try to analyze it by visualizing the feature representation in different manners.  multiplies the weights of the fully connected layer after global average pooling (GAP) and the spatial feature map before GAP to generate class activation map (CAM), which can activate the region with objects.  utilizes the gradient of back-propagation to obtain an attention map without modifying the network structure. The CAM and Grad-CAM both work well in the ConvNets but are not suitable for Vision Transformer. Due to the fundamental architecture difference between Transformer and ConvNet, some attempts have recently tried to visualize Transformers’ features. According to the characteristics of the Transformer,  proposes two methods to quantify the information flow through self-attention, termed attention rollout and attention flow. The rollout method redistributes all attention scores by considering pairs of attention and assuming a linear combination of attention in a subsequent context. Attention flow is concerned with the maximum flow along the pair-wise attention graph.  introduces relevance map depending on deep Taylor decomposition and combines the grad of attention map and relevancy map.
The overview framework of WegFormer is illustrated in Figure 2. First, an input RGB image is split into 1616 patches and fed into a vision transformer classifier. Second, Deep Taylor Decomposition (DTD)  is used to generate the initial attention maps. Third, the soft erase operator smooths attention maps by narrowing the gap between the high-response and the low-response areas. Fourth, in EPOM, we use saliency maps to help filter the redundant noise in the background and get refined attention maps and avoid introducing false information to get the final pseudo labels. Finally, the generated pseudo labels are fed to the segmentation network for self-training to improve the performance further.
3.1 Attention Map Generation
Generally, a classification network is trained firstly and we can combine the weights and features to generate class-aware activation maps. After post-processing, the activation map is used as pseudo labels to train a segmentation network to improve the mask quality. Different from previous works that use ConvNets as the classifier, we introduce DeiT , a variant of ViT, as classification networks to capture global contextual information, as illustrated in Figure 2.
as input, the output of the classification network is a vector, where indicates the number of the categories. Due to the fundamental difference between CNN and Transformer, here CAM is not suitable to generate activation maps. Instead, we adopt Deep Taylor Decomposition (DTD) principle to generate the attention maps , which back-propagates the initial relevancy score of the network through all layers and get initial attention map.
Here we first get the initial relevancy score of per category based on :
where is the one-hot label generated by multi-label ground-truth and is the initial relevancy scores of the class . Here denotes Hadamard Product. Following DTD , by back-propagating , we can get relevancy map for each Transformer block .
where , , means query, key and value in block , and , here is the number of image patch tokens and “1” is the class token. is self-attention map and represents the output of the attention module. is the head dimension of MHSA.
For in each block , we can easily get its gradient and relevance . Then the initial attention map can be calculated by:
is the identity matrix,represents the mean over the “head” dimension, and . Here is the last block.
We get the initial attention map by indexing (i.e.,
) the column corresponding to the class token “CLS” according to the horizontal axis. We reshape and linear interpolationto obtain the final initial attention map .
Unlike the original DTD that utilizes the activation map from all the blocks, here we only take account of the last blocks. We find that the activation map from shallow blocks introduces a certain amount of noise.
3.2 Soft Erase
Discriminative Region Suppression (DRS)  is proposed to suppress discriminative regions and spread to neighboring non-discriminative areas. Inspired by DRS, we propose soft erase to narrow the gap between high and low response areas in the initial attention map . Unlike DRS that needs to embed into several layers of the network, here soft erase is just a simple post-processing step and does not need to be embedded into the network.
After get in the above section, we fisrt apply a normalization on and get . We then apply soft erase to , which can be written as:
where , and . is a hyper-parameter and here we set it to . , , mean the maximum, dimension expansion, and minimum function respectively. Firstly, a max vector is chosen from . Then, we expand the dimension of to the dimension of . Finally, we compare and and choose the pixel-wise minimum value to get the attention maps .
3.3 Efficient Potential Object Mining (EPOM)
 proposes POM to use saliency maps for extracting background information. Here we find saliency map and attention map generated by DTD are naturally complementary, and saliency maps can largely filter the noise activation in the background. Note that POM in  uses not only the final heatmap but also utilizes the middle layer heatmaps, which is complicated and low efficient. Different from POM, here we propose efficient POM (EPOM) that only uses the final attention map, which is more efficient and does not sacrifice performance.
EPOM forces the model to mine potential objects by marking some uncertain pixels in pseudo labels as “ignored”, which largely avoids introducing wrong labels during self-training. The detail of EPOM is illustrated as:
where is the initial pseudo label of pixel position generated by comparing the values of , and is current class of the total number of classes . In current class , if the initial pseudo label equals to and greater than , we updated the initial pseudo label as 255 in pixel (x,y). Otherwise, it remains unchanged. Here, “255” indicates “ignored”. is a threshold for each category that determined dynamically, which can be written as:
are functions that obtain the median and top quartile value of position, respectively. is the value of . The is defined follows:
where is a threshold used to extract foreground information. We choose the position where initial pseudo label equals current class. Otherwise, we choose the position greater than in .
3.4 Self-training with Pseudo Label
After obtaining the pseudo label, we self-train a segmentation network with the pseudo label to improve the performance. Unlike previous work  that needs to iteratively self-train the segmentation network several times, we only train the segmentation network once.
4.1 Dataset and Evaluation Metrics
We use PASCAL VOC 2012 
as the dataset, which is widely used in weakly supervised semantic segmentation. The training set of PASCAL VOC 2012 contains 10582 images with augmentation. Only image-level labels are used during training, and each image contains multiple categories. We report results on validation set (with 1,449 images) and test set (with 1,456 images) to compare our approach with other competitive methods. We use standard mean intersection over union (mIoU) as the evaluation metric.
4.2 Implementation Details
The model is implemented with Pytorch and trained on 1 NVIDIA GeForce RTX 3060 GPU with 12 GB memory. In our experiments, we use DeiT-Base
as the classification network, which is pre-trained on the ImageNet-1K and we fine-tune it on the PASCAL VOC 2012. During the training phase, we used the AdamW optimizer with a batch size of 16 and the input image is cropped to. During the inference phase, the long side of the input images is cropped to , and the short side is scaled with an equal proportion to keep the original aspect ratio. The multi-scale inference is also used.
4.3 Comparisons to the State-of-the-arts
|Roy et al.||52.8||53.7||MCOF||60.3||61.2|
|Oh et al.||55.7||56.7||AffinityNet||61.7||63.7|
|Hong et al.||58.1||58.7||FickleNet||64.9||65.3|
|AffinityNet||58.4||60.5||Zhang et al.||66.6||66.7|
|RDC||60.4||60.8||Fan et al.||67.2||66.7|
|Fan et al.||64.6||64.2||NSROM||68.3||68.5|
|Zhang et al.||63.7||64.5||Ours||70.5||70.3|
means pre-trained on MS-COCO.
we compare our methods with existing famous work: SEC , STC , , , AE-PSL , WebS-i2 , , DCSP , TPL , GAIN , DSRG , MCOF , AffinityNet , RDC , SeeNet , OAA , ICD , BES , , , MCIS , IRN , FickleNet , SSDD , SEAM , SCE , CONTA , NSROM , DRS
The comparison with state-of-the-art is shown in Table 1. The left column of Table 1 shows the result with the VGG backbone and the right column shows the result with ResNet backbone. From Table 1, with the VGG backbone, WegFormer outperforms the NSROM on the test set by and the DRS on the validation set by
. With the ResNet backbone, the upper part terms backbone pre-trained on ImageNet, and WegFormer isand better than NSROM and DRS on the validation set respectively. The bottom part of Table 1, marked with , means backbone pre-trained on MS-COCO. In the validation set, we are ahead of NSROM. Overall, for all the backbones, our approach achieves state-of-the-art performance on PASCAL VOC 2012. The qualitative segmentation results on the PASCAL VOC 2012 validation set are shown in Figure 3.
4.4 Ablation Studies
In this section we analyze a series of ablation studies and demonstrate the effectiveness of the proposed modules.
4.4.1 Contribution of Different Components
As shown in Table 2, we report the mIoU in validation set with different components. Our baseline with initial pseudo label self-training can get mIoU. By adding soft erase, the resulting increase to . Sal (DRS) and Sal (NSROM) means saliency map provided by  and . From Table 2 we can find that the saliency map largely boosts up the result to , with over 10% improvement. EPOM also improves the performance and the result reaches finally.
|Soft Erase||Sal (DRS)||Sal (NSROM)||EPOM||mIoU|
|0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11||58.2|
4.4.2 Best Blocks to Get Attention Maps
Unlike the initial method which integrates all blocks , we only adopt the last block as final attention maps as described in equation 5. In WegFormer, the total number of blocks in the classifier is 12. As shown in Table 3, we find that only taking the last block reaches the best mIoU 59.5%, which is 1.3% higher than taking all the blocks.
4.4.3 Comparison among CAM, ROLLOUT and DTD
We compare three methods, CAM, ROLLOUT, and DTD, to generate heatmaps for Transformer in Figure 4 both quantitatively and qualitatively. From Figure 4 (1), we see DTD is and higher than CAM and ROLLOUT respectively. Visualization results are shown in Figure 4 (2). (c) represents the heatmap obtained by CAM, which has meaningless high responses everywhere in the image. (d) represents the heatmap obtained from ROLLOUT, which cannot distinguish between different classes. Both CAM and Rollout are not suitable for Transformer. (e) represents the heatmap generated by DTD, which can well excavate the information of the object and capture the contour of the object.
4.4.4 Ablation Study of Different Saliency Map
In Table 2, we found that the quality of the saliency map also affects the final result. Table 2 shows about 1% gap between saliency map (DRS) and saliency map (NSROM), respectively. This inspires us to leverage a higher-quality saliency map to assist weakly-supervised semantic segmentation tasks. We leave it in future research.
4.4.5 Ablation Study of Stronger Segmentation Networks
As we can see in Table 4, more advanced semantic segmentation networks lead to better results. Compared with DeepLab-V2 and DeepLab-V3, DeepLab-V3+ is and higher on the validation set, respectively. On the test set, we see the similar conclusion.
4.4.6 Transformer+DTD vs. CNN+CAM with Saliency Map
In Figure 5, we compare the heatmap generated by Transformer+DTD and CNN+CAM with or without saliency maps. Here for Transformer we use Deit-B and for CNN we use ResNet-38. We find that the CNN+CAM only activates the most discriminative region but fails to capture the whole object. In contrast, Transformer+DTD can activate the whole object but also introduces certain background noise simultaneously. Therefore, compared with CNN+CAM, saliency maps can better make up for the shortcomings of Transformer+DTD and get high-quality masks.
In this paper, we propose WegFormer, the first Transformer-based weakly-supervised semantic segmentation framework. In addition, we introduce three important components: attention map generator based on Deep Taylor Desomposition (DTD), soft erase module, and efficient potential object mining (EPOM). These three components could generate high-quality semantic masks as pseudo labels, which significantly boosts the performance. We hope the proposed WegFormer can serve as a solid baseline to provide a new perspective for weakly supervised semantic segmentation in the Transformer era.
-  (2020) Quantifying attention flow in transformers. arXiv. Cited by: §2.
-  (2018) Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, Cited by: §4.3.
Weakly supervised learning of instance segmentation with inter-pixel relations. In CVPR, Cited by: §4.3.
-  (2020) Weakly-supervised semantic segmentation via sub-category exploration. In CVPR, Cited by: §4.3.
-  (2017) Discovering class-specific pixels for weakly-supervised semantic segmentation. arXiv. Cited by: §4.3.
-  (2021) Transformer interpretability beyond attention visualization. In CVPR, Cited by: §1, §1, §2, §3.1, §3.1, §3, §4.4.2.
-  (2021) Pre-trained image processing transformer. In CVPR, Cited by: §2.
-  (2020) Weakly supervised semantic segmentation with boundary exploration. In ECCV, Cited by: §4.3.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv. Cited by: §2, §3.1.
-  (2015) The pascal visual object classes challenge: a retrospective. IJCV. Cited by: §1, §4.1.
-  (2020) Employing multi-estimations for weakly-supervised semantic segmentation. In ECCV, Cited by: §2, §4.3.
-  (2020) Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In CVPR, Cited by: §4.3.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
-  (2017) Weakly supervised semantic segmentation using web-crawled videos. In CVPR, Cited by: §4.3.
-  (2018) Self-erasing network for integral object attention. arXiv. Cited by: §2, §4.3.
-  (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, Cited by: §2, §4.3.
-  (2019) Integral object mining via online attention accumulation. In ICCV, Cited by: §1, §2, §4.2, §4.3.
-  (2017) Webly supervised semantic segmentation. In CVPR, Cited by: §4.3.
-  (2021) Discriminative region suppression for weakly-supervised semantic segmentation. In AAAI, Cited by: §3.2, §4.3, §4.4.1.
-  (2017) Two-phase learning for weakly supervised object localization. In ICCV, Cited by: §4.3.
-  (2016) Seed, expand and constrain: three principles for weakly-supervised image segmentation. In ECCV, Cited by: §2, §4.3.
-  (2019) Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference. In CVPR, Cited by: §4.3.
-  (2018) Tell me where to look: guided attention inference network. In CVPR, Cited by: §4.3.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv. Cited by: §2.
-  (2017) Exploiting saliency for object segmentation from image level labels. In CVPR, Cited by: §4.3.
-  (2017) Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In CVPR, Cited by: §4.3.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, Cited by: §2.
-  (2019) Self-supervised difference detection for weakly-supervised semantic segmentation. In ICCV, Cited by: §4.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv. Cited by: §1.
-  (2020) Mining cross-image semantics for weakly supervised semantic segmentation. In ECCV, Cited by: §2, §4.3.
-  (2021) Training data-efficient image transformers & distillation through attention. In ICML, Cited by: §2, §3.1, §4.2.
-  (2021) PVTv2: improved baselines with pyramid vision transformer. arXiv. Cited by: §2.
-  (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In ICCV, Cited by: §2.
-  (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In CVPR, Cited by: §4.3.
-  (2020) Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, Cited by: §4.3.
-  (2016) Stc: a simple to complex framework for weakly-supervised semantic segmentation. TPAMI. Cited by: §4.3.
-  (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In CVPR, Cited by: §2, §4.3.
-  (2018) Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In CVPR, Cited by: §1, §2, §4.3.
-  (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv. Cited by: §1, §2, §2.
-  (2021) Segmenting transparent object in the wild with transformer. arXiv. Cited by: §2.
-  (2021) Non-salient region object mining for weakly supervised semantic segmentation. In CVPR, Cited by: §1, §2, §3.3, §3.4, Figure 3, §4.2, §4.3, §4.4.1.
-  (2020) Causal intervention for weakly-supervised semantic segmentation. arXiv. Cited by: §4.3.
-  (2020) Splitting vs. merging: mining object regions with discrepancy and intersection loss for weakly supervised semantic segmentation. In ECCV, Cited by: §4.3.
-  (2018) Adversarial complementary learning for weakly supervised object localization. In CVPR, Cited by: §2.
Learning deep features for discriminative localization. In CVPR, Cited by: §2.