Semantic segmentation is the task of classifying every pixel of an input image. It plays a vital role in many computer vision tasks, such as image editing and medical image analysis[2020Group, wang2021exploring]
. Benefiting from the recent advances of deep learning, semantic segmentation has achieved remarkable progress. However, the training of deep convolutional neural networks (CNNs) usually requires large-scale datasets[yao2018extracting, yao2020exploiting, yao2019towards, yao2018extractingtip, yao2017exploiting]. Moreover, obtaining precise pixel-wise annotations for semantic segmentation demands intensive labor efforts and is quite time-consuming. One promising approach to address the annotation problem for semantic segmentation is to learn from weak labels, such as image-level annotations [kolesnikov2016seed, wei2016stc, hong2017weakly, chaudhry2017discovering, huang2018weakly, ahn2018learning, wei2018revisiting, jiang2019integral], bounding boxes [dai2015boxsup, khoreva2017simple, song2019box], points [bearman2016s], and scribbles [lin2016scribblesup, vernaza2017learning]. Among these weak supervisions, image-level labels are the easiest format to annotate and have been widely studied in various weakly supervised methods. However, semantic segmentation supervised with image-level labels remains a challenging task. Therefore, this paper follows the current trend and focuses on leveraging image-level labels to achieve weakly supervised semantic segmentation (WSSS).
To tackle the task of WSSS with only image-level labels, visualization-based approaches [zhou2016learning] have been widely adopted to narrow the annotation gap between classification and segmentation [yao2020bridging, yao2016domain]. The typical methods train a classification network with image-level labels. Then they leverage class activation maps (CAMs) [zhou2016learning] to generate pseudo labels to train the segmentation network. However, these activation maps obtained from the classification network are sparse and incomplete. They can only locate the most discriminative part of objects. Many approaches have been proposed to enlarge the activated region to cover a large object area. For example, Jiang [jiang2019integral] observed that the attention maps produced by the classification network focus on different object parts during training. Therefore, they proposed an online attention accumulation (OAA) strategy to combine the various activated regions. However, as shown in Fig. 1, the existing works mainly concentrate on enlarging the response maps for the salient region. Then they utilize the saliency maps to extract background. Few works focus on mining objects in the non-salient areas.
In this paper, we propose a non-salient region object mining method for weakly supervised semantic segmentation to make up for the shortcomings mentioned above. In contrast to the widely adopted center prior [borji2015salient] for saliency detection, the non-salient region is usually scattered in corners or near the edge of the image. Such a characteristic of our protagonist requires the network to exploit the disjoint and distant surrounding information. While the traditional classification networks based on CNNs excel at modeling local relations, they are inefficient at capturing global relations between disjoint and distant regions. Therefore, we introduce a graph-based global reasoning unit [chen2019graph] to strengthen the classification network’s capability in activating the object features outside the salient region.
On the other hand, though existing approaches can successfully enlarge activated regions for objects, they inevitably extend the object area to the background. These methods require the saliency maps to provide background clues. While the saliency maps can correct the pixel labels near conspicuous regions, they also remove the object labels outside the salient area. We notice that although the naive CAM, sparse and incomplete, does not have an accurate boundary, it can provide useful clues for the objects in the non-salient region. Therefore, we propose a potential object mining module to discover more objects that are outside the conspicuous region but activated in the naive CAM. Our potential object mining module aims to reduce the pseudo labels’ false-negative rate (in which case the object regions are falsely labeled as background). This improves the quality of pseudo labels and encourages the segmentation network to exert its self-correction ability. Such an ability of the network inspires us to further take advantage of the prediction of the segmentation network. Following [wei2016stc], we divide the training images into simple and complex sets according to the number of categories in each image. The simple images with a single category of object(s) usually have a clean background. Their objects often exist in the conspicuous region and can be correctly segmented. In contrast, complex images (having two or more categories of objects) are more prone to having objects outside the salient area. Therefore, we propose a non-salient region masking module for complex images to generate masked pseudo labels. Our non-salient region masking module helps further discover objects in the non-salient region. Our contributions can be summarized as follows:
For weakly supervised semantic segmentation, we leverage a global reasoning unit to capture global relations among disjoint and distant regions, helping the network activate object features outside salient areas.
We propose a potential object mining module to discover more objects in the non-salient region, which improves the quality of pseudo labels by reducing the false-negative rate.
We propose a non-salient region masking module with a dilation policy to generate masked pseudo labels, which leads to a more robust segmentation model to further discover objects outside the salient region.
2 Related Work
2.1 Semantic Segmentation
Semantic segmentation is an important computer vision task that assigns a semantic label to every pixel in an image. Since the adaption of the modern classification network into the fully convolutional network (FCN) [long2015fully, sun2020crssc], deep learning has achieved great success in semantic segmentation [badrinarayanan2017segnet, chen2017deeplab, zhang2018context, liu2019auto, chen2020classification, chen2021semantically, luo2019segeqa]. To address the size issue caused by the down-sampling operation, early works [badrinarayanan2017segnet] adopted an encoder-decoder architecture to recover the spatial resolution. Then dilated/atrous convolution [chen2017deeplab] was proposed for the expansion of the receptive field without loss of resolution. Recently, the pyramid pooling module and context encoding [zhang2018context] were introduced to capture the global semantic context of the scene. Auto-DeepLab [liu2019auto] presented a network-level search space to allow efficient gradient-based architecture search for semantic segmentation.
2.2 Weakly Supervised Semantic Segmentation
Weakly supervised semantic segmentation attempts to learn a segmentation network with weaker annotation than pixel-wise labels. It aims to alleviate the annotation burden of segmentation tasks. Compared to bounding boxes [dai2015boxsup, khoreva2017simple, song2019box], points [bearman2016s], and scribbles [lin2016scribblesup, vernaza2017learning], image-level labels [kolesnikov2016seed, wei2016stc, hong2017weakly, chaudhry2017discovering, huang2018weakly, ahn2018learning, wei2018revisiting, jiang2019integral]
are the most widely used weak annotations due to their easy availability. They are already given in existing large-scale datasets (ImageNet[deng2009imagenet]
) or can be automatically generated through image retrieval techniques. Existing image-level label based approaches leverage the CAM to generate pixel-level seeds for training the segmentation model. Considering the initial seeds’ sparsity and incompleteness, researchers proposed many approaches to expand the seeds to integral object regions. For example, Kolesnikov and Lampert[kolesnikov2016seed]
introduced a new loss function for weakly supervised training based on three guiding principles: seed, expand, and constrain. Huang[huang2018weakly] proposed to train a model starting from the discriminative regions and progressively increase the pixel-level supervision with the deep seeded region growing strategy. The work of RDC [wei2018revisiting] leveraged the dilated convolution to enlarge the receptive fields of convolutional kernels. This helped transfer the object information to the non-discriminative region. AffinityNet [ahn2018learning] realized semantic propagation by a random walk with the semantic affinity between a pair of adjacent image coordinates. The recent work of SEAM [wang2020self] proposed a self-supervised equivariant attention mechanism to provide additional supervision for network learning. Apart from using intra-image information, Sun [sun2020mining] incorporated two neural co-attentions into the classifier to capture cross-image semantic relations for comprehensive object pattern mining. Zhang [zhang2020causal] attributed the reason for the ambiguous boundaries of pseudo-masks to the confounding context. They presented a causal inference framework to remove the confounding bias in image-level classification with an effective approximation for the backdoor adjustment.
3 The Proposed Approach
In this paper, we focus on the task of weakly supervised semantic segmentation with image-level labels. Our framework is illustrated in Fig. 2. Given a set of training images with image-level labels, we train a classification network. We leverage class activation maps to generate pseudo labels for learning a segmentation network. Unlike existing methods that mainly concentrate on refining pseudo labels in the salient area, we propose to discover more objects in the non-salient region for weakly supervised semantic segmentation. To achieve this, we insert a graph-based global reasoning unit into the classification network. This helps to activate the object features outside the salient region. We also adopt a potential object mining module (POM) and a non-salient region masking module (NSRM) to improve the quality of pseudo labels for non-salient region object mining.
3.1 CAM Generation
A classification network is first trained to generate class attention maps. As illustrated in Fig. 2, to strengthen the classification network’s ability to capture global relations among disjoint and distant regions, we introduce a graph-based global reasoning unit [chen2019graph] before the final classifier. The global reasoning module will help the network to activate the object parts outside the salient region. The features generated by the encoder, with being the feature dimension and locations, is first projected from the coordinate space to a latent interaction space. The projection function is formulated as a linear combination:
where is the learnable projection weights, and is the number of the features (nodes) in the interaction space.
Then a graph convolution [kipf2016semi] is applied to capture the relations between features in the new space:
denotes the node adjacency matrix learned by gradient descent during training. denotes the state update function.
After obtaining the node-feature , reverse projection is conducted to project the feature back to the original space:
For the training of the classification network, we adopt the multi-label soft margin loss as follows:
Here, is the prediction of the network for the -th class.
is the sigmoid function, andis the total number of foreground classes. is the image-level label for the -th class. Its value is 1 if the class is present in the image; otherwise, its value is 0.
We obtain CAMs by selecting the class-specific feature maps generated by the final classifier. Following OAA [jiang2019integral]
, we also generate the online accumulated class attention maps (OA-CAMs) to have more entire regions and strengthen the lower attention values of target object regions with their integral attention model.
3.2 Potential Object Mining
After obtaining OA-CAMs, the work of OAA [jiang2019integral] uses them to extract object cues and saliency maps to extract background cues. The class label of each pixel is assigned by comparing the value of each OA-CAM. As shown in Fig. 2, with the shape information provided by the saliency map, the initial label is derived with quite clear object boundaries after the background extraction (BE) process. However, the initial label misses many object parts outside the conspicuous area. Therefore, we propose to discover more objects in the non-salient region. Though the OA-CAM has a high recall of the object pixel, its precision is low. In contrast, the CAM, widely leveraged to generate initial seeds for proxy segmentation labels [kolesnikov2016seed, huang2018weakly], has low recall but high precision. Therefore, we propose a potential object mining (POM) module to discover the object region activated in the CAM. We mine the potential object with a class adaptive threshold for class that is present in the image:
Here, v is the set of attention values of pixels in the CAM, whose locations are selected as follows:
where is the attention value in CAM at the position (i,j). is the value in the initial label at the position (i,j), which denotes the pseudo label of the pixel. As illustrated in Equation 5 and Equation 6, if the initial label contains class , we select those pixels in its CAM and choose the median (MED) of their attention values as . Otherwise, we select pixels in its CAM with an attention value greater than the background threshold
and choose the top quartile (TQ) of their attention values as.
We then adjust the initial label as follows:
Here, denotes the CAM for class . As illustrated in Equation 7, the background pixels (labeled as 0) in the initial label with any CAM attention value greater than are labeled as 255 and ignored for training. We do not label them for the corresponding potential class to avoid introducing wrong object labels. Such a strategy bypasses the necessity to locate the object boundary outside the salient region. We focus on reducing the false-negative rate of pseudo labels, which will help discard the gradients generated by the misleading information.
3.3 Non-Salient Region Masking
Our potential object mining strategy enriches pseudo labels with more ignored pixels. It allows the segmentation network to predict the correct labels for these potential object regions during training. The improved quality of pseudo labels can also encourage the segmentation network to fix the other incorrectly labeled regions. Therefore, we propose to further leverage the prediction of the segmentation model to generate pseudo labels of higher quality for retraining.
We notice that simple images with only one category of objects usually have a clean background. Objects in these images often exist in the salient region and can be correctly segmented by the segmentation network. However, complex images (with two or more categories of objects) are more prone to having objects outside the salient area. It remains challenging for the segmentation network to detect objects outside the salient region with pseudo labels only containing object labels in the salient area. Therefore, we propose a non-salient region masking (NSRM) module. It combines the object information in the segmentation network’s prediction and pseudo labels to generate masked labels for complex images.
Our proposed non-salient region masking module is illustrated in Fig. 3
. Based on the assumption that object labels within the salient region are correct with high probability, we first expand the object region in the initial prediction with the guidance of our pseudo labels. Then we extract the object mask from the expanded prediction map. After that, we expand the object mask with a dilation operation. Finally, a masking operation is applied to the expanded prediction map to get the masked pseudo labels. Note that the dilation operation introduces a small portion of the background around the objects. It preserves the objects’ boundary information, which is of great importance for a successful segmentation network.
4.1 Implementation Details
For the classification network, we adopt the VGG-16 model as our backbone, which is pre-trained on ImageNet[deng2009imagenet]. Following [jiang2019integral]
, we add three convolutional layers on the top of the fully-convolutional backbone. A ReLU layer follows each convolutional layer for nonlinear transformation. Aconvolutional layer of channels is adopted as the pixel-wise classifier to generate attention maps. The momentum and weight decay of the SGD [bottou2010large] optimizer are 0.9 and . The initial learning rate is set to
and is divided by 10 after every 5 epochs. Following the code of[jiang2019integral], we set the background threshold = 0.3 for fair comparison. We train the classification network for 14 epochs with batch size = 5.
For the segmentation network, following [chang2020weakly, zhang2020splitting, fan2020employing, chen2020weakly], we adopt the DeepLab-v2 [chen2017deeplab] framework. VGG-16 is pre-trained on ImageNet [deng2009imagenet]. For ResNet-101 [he2016deep], we report results for models pre-trained on ImageNet [deng2009imagenet]
and MS-COCO[lin2014microsoft], respectively. The momentum and weight decay of SGD optimizer are 0.9 and . The initial learning rate is set to and is decreased using polynomial decay with a power of 0.9. The segmentation network is trained for 10,000 iterations with batch size = 10.
4.2 Datasets and Evaluation Metrics
Following previous works, we evaluate our approach on the Pascal VOC 2012 dataset [everingham2010pascal]. It contains 21 classes (20 object categories and the background) for semantic segmentation. There are 10,582 training images, which are expanded by [hariharan2011semantic]
, 1,449 validation images, and 1,456 test images. For all the experiments, we only adopt the image-level class labels for training. Standard mean intersection over union (mIoU) is taken as the evaluation metric for the semantic segmentation task.
4.3 Comparisons to the State-of-the-arts
Baselines. In this part, we compare our proposed method with the following state-of-the-arts approaches that leverage image-level labels for weakly supervised semantic segmentation: DCSM [shimoda2016distinct], SEC [kolesnikov2016seed], AugFeed [qi2016augmented], STC [wei2016stc], Roy [roy2017combining], Oh [oh2017exploiting], AE-PSL [wei2017object], WebS-i2 [jin2017webly], Hong [hong2017weakly], DCSP [chaudhry2017discovering], TPL [kim2017two], GAIN [li2018tell], DSRG [huang2018weakly], MCOF [wang2018weakly], AffinityNet [ahn2018learning], RDC [wei2018revisiting], SeeNet [hou2018self], OAA [jiang2019integral], ICD [fan2020learning], BES [chen2020weakly], Fan [fan2020employing], Zhang [zhang2020splitting], MCIS [sun2020mining], IRN [ahn2019weakly], FickleNet[lee2019ficklenet], SSDD [shimoda2019self], SEAM [wang2020self], SCE [chang2020weakly], CONTA [zhang2020causal].
Experimental Results. We present our results for the backbone of VGG and ResNet in Table 1 and Table 2, respectively. As can be seen, our approach achieves better results than other state-of-the-art methods for both VGG and ResNet backbones. Specifically, for the VGG backbone, our segmentation results reach 65.5% and 65.3% on the validation and test set, respectively. For the ResNet backbone, we can get 68.3% on the validation set and 68.5% on the test set. Though the methods of STC [wei2016stc], WebS-i2 [jin2017webly] and Hong [hong2017weakly] leverage additional training data, our method outperforms them on the validation set by 15.7%, 12.1% and 7.4%, respectively. Compared to DSRG [huang2018weakly] and CONTA [zhang2020causal], which also utilize the prediction of the segmentation network to get refined pseudo labels for training, our approach can improve their results by 6.9% and 2.2%, respectively. The work of ICD [fan2020learning] uses the additional superpixel to help recover the object boundary information during training. Our approach can still outperform it by 1.5% for the VGG backbone and 0.5% for the ResNet backbone. Our results demonstrate the effectiveness of mining the objects in the non-salient region for the task of weakly supervised semantic segmentation. When training the ResNet based network with the COCO pre-trained weights, we can further reach 70.4% and 70.2% on the validation and test set, respectively.
|+ GR + retrain||68.7|
|+ GR + POM||69.0|
|+ GR + POM + retrain||69.7|
|+ GR + POM + NSRM||70.4|
4.4 Ablation Studies
Element-Wise Component Analysis. In this part, we demonstrate the contribution of each component proposed in our approach for weakly supervised semantic segmentation. The experimental results on the validation set of Pascal VOC are given in Table 3. We notice that, by leveraging the graph-based global reasoning unit (GR) to capture global relations among disjoint and distant regions, we can improve the segmentation result from 67.7% to 68.8%. By introducing our proposed potential object mining module (POM), we obtain another 0.2% performance gain. Note that if we directly retrain the segmentation network with its prediction, the performance drops from 68.8% to 68.7%. In contrast, with our potential object mining module, retraining the segmentation network can further improve the result to 69.7%. This highlights the importance of reducing the false-negative rate of pseudo labels. Our potential object mining module can directly improve the segmentation results and help to exert the self-correction ability of the segmentation network for a higher quality of pseudo labels. With our non-salient region masking module (NSRM), we further exploit the objects outside the conspicuous regions and improve the segmentation result to 70.4%.
|NSRM - Object Expansion||70.2|
|NSRM - Masking||70.0|
|NSRM - dilation||68.5|
Some qualitative segmentation examples on the PASCAL VOC 2012 validation set can be viewed in Fig. 4. As can be seen, with the graph-based global reasoning unit (GR), the network can capture global relations and discover objects in disjoint and distant regions (the potted plant in the fifth column and person in the eighth column). As shown in the last column, our method with the POM module can further discover the car outside the salient region. Besides, our full method with the NSRM module exerts the segmentation network’s self-correction ability. It successfully predicts the bus and potted plants in the second and fourth columns, respectively. As shown in the fifth and eighth columns, our robust method further mines the objects outside the salient region.
We display the evolution of the labels we used for training the segmentation network in Fig. 5. As we can see, with our potential object mining module, we discover more object regions outside the salient area than the initial label. Our pseudo label can reduce the false gradients calculated for wrong annotations. By leveraging the initial prediction, our non-salient region masking module generates high-quality masked labels, allowing the segmentation model to mine the objects in the non-salient region further.
Ablation Studies for NSRM. An in-depth study of our proposed NSRM module is presented in Table 4. As we can see, if we apply NSRM to all images without our simple and complex image division, the results drop from 70.4% to 68.8%. This highlights the importance of treating simple and complex images differently. When masking out the non-salient region of pseudo labels for complex images during training, we need to rely on the rich background information provided by simple images. We notice that removing the object expansion operation will cause a 0.2% performance drop. This shows that it is useful to utilize the pseudo labels to expand the object prediction within the salient region. Masking out the non-salient region of the pseudo label for training has a 0.4% performance gain. This shows that the masking operation can encourage the segmentation network to exert its self-correction ability. Note that if we do not conduct the dilation operation for the object mask, the performance directly drops to 68.5%. This highlights the importance of preserving the background area around the object. The background information, together with the object region, provides essential boundary knowledge for the network training.
Parameter Analysis. For the dilation operation in the NSRM module, we conduct experiments to study the effect of the dilation kernel size . As shown in Fig. 6, we vary the kernel size over the range . As we can see, we get better performance when the kernel size is between 5 and 30. A too large or small kernel size may not improve the results very much. We conjecture that a too large kernel size keeps too much background in the prediction, which hinders the object mining in the non-salient region. Meanwhile, a too small kernel size with little background blurs the boundary of objects, which impedes the training of the segmentation network. In our experiments, we empirically set = 30.
For the graph-based global reasoning unit, we conduct experiments to study the effect of the number of feature nodes in the interaction space. As shown in Fig. 7, we vary over the range . We notice that a too large number of the nodes may not improve the performance very much. In our experiments, we empirically set N = 64.
In this work, we proposed a non-salient region object mining approach for the task of weakly supervised semantic segmentation. Specifically, we introduced a graph-based global reasoning unit to help the classification network capture global relations among disjoint and distant regions. This can strengthen the network’s ability to activate the objects scattered in the corners or near the edge of the image. To further mine objects outside the non-salient region, we proposed to exert the segmentation network’s self-correction ability. A potential object mining module was proposed to reduce the false-negative rate in pseudo labels. Moreover, we proposed a non-salient region masking module for complex images to generate masked pseudo labels. Our non-salient region masking module helps further discover objects in the non-salient region. Extensive experiments on the PASCAL VOC 2012 dataset demonstrated the superiority of our proposed approach.
This work was supported by the National Natural Science Foundation of China (No. 61976116) and Fundamental Research Funds for the Central Universities (No. 30920021135).