Dynamic Feature Regularized Loss for Weakly Supervised Semantic Segmentation

08/03/2021 ∙ by Bingfeng Zhang, et al. ∙ University of Liverpool BEIJING JIAOTONG UNIVERSITY Xi'an Jiaotong-Liverpool University 0

We focus on tackling weakly supervised semantic segmentation with scribble-level annotation. The regularized loss has been proven to be an effective solution for this task. However, most existing regularized losses only leverage static shallow features (color, spatial information) to compute the regularized kernel, which limits its final performance since such static shallow features fail to describe pair-wise pixel relationship in complicated cases. In this paper, we propose a new regularized loss which utilizes both shallow and deep features that are dynamically updated in order to aggregate sufficient information to represent the relationship of different pixels. Moreover, in order to provide accurate deep features, we adopt vision transformer as the backbone and design a feature consistency head to train the pair-wise feature relationship. Unlike most approaches that adopt multi-stage training strategy with many bells and whistles, our approach can be directly trained in an end-to-end manner, in which the feature consistency head and our regularized loss can benefit from each other. Extensive experiments show that our approach achieves new state-of-the-art performances, outperforming other approaches by a significant margin with more than 6% mIoU increase.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fully supervised semantic segmentation [25, 4] has been witnessing great success with dense pixel-level annotation. However, such pixel-level annotation is time-consuming and highly relies on human effort. Weakly supervised semantic segmentation, aiming to make pixel-level prediction under weak supervision signals (scribble [21, 32], bounding box [18, 39], point [3, 15] and image-level [40, 23]) is a solution for this problem. In this paper, we focus on weakly supervised semantic segmentation with scribble annotation. The main challenge of this task is how to train the segmentation model with limited supervision.

Most recent state-of-the-art approaches can be divided into two main categories: pseudo-label based approaches [29, 39]

and loss function based approaches 

[31, 32, 15, 34]. Pseudo-label based approaches focus on generating more pseudo labels through expanding the initial annotations so that the segmentation model receives more completed pixel-level labels as supervision. But such approaches usually need multi-stage training process with many bells and whistles. For example, in GNN [39], three different models are used for this task. Loss function based approaches concentrate on directly utilizing limited labels to train the segmentation model with well-designed loss functions. However, some approaches [15, 34] rely on extra dataset [38, 2] to provide edges or boundaries information as supervision, while some loss function based approaches [31, 32] still need multi-round training procedures. Although Gated CRF loss [27] can be directly trained in an end-to-end manner, its performance is limited as it solely relies on static shallow feature (color and spatial information), which fails to capture accurate pair-wise pixel relationship. For example, the shallow features are similar for a pixel pair which belongs to different objects with similar color and close spatial positions (e.g., a white dog close to a white cat). In this case, the shallow features cannot accurately describe the semantic relationship of different pixels. Using such information to compute the regularized loss enforces the network to be optimized towards an inaccurate direction. More importantly, since shallow features are static, such process can not be corrected in the whole training period. Therefore, it is important to introduce more comprehensive representations for the regularized loss.

In this paper, we propose a new Dynamic Feature Regularized (DFR) loss function in the semantic segmentation head to overcome the aforementioned drawbacks. Our DFR loss makes full use of both static shallow feature and dynamic deep feature, which provides more sufficient information to describe the semantic similarity of different pixels. However, pixel features from the same semantic category may not be sufficient similar, so we design a feature consistency head to enforce this goal. Our feature consistency head utilizes the highly confident prediction from our semantic segmentation head as supervision. It closes up feature distance for pixels from the same semantic category and widens feature distance for pixels from different categories.

Our semantic segmentation head and feature consistency head are directly coupled as they enhance each other mutually. On one hand, deep feature from our feature consistency head provides a third dimension of input for the regularized loss of the semantic segmentation head, so as to produce accurate semantic prediction. On the other hand, accurate semantic prediction provides more reliable supervision for the feature consistency head, empowering it to build more discriminative features. As a result, compared to solely relying on static shallow feature to compute regularized kernel, the interaction between the two heads allows the deep feature to dynamically change, which also enables deep feature level self-correction and mitigates the negative influence of the inaccurate shallow feature.

Meanwhile, in order to keep high computational efficiency for our loss functions, a local window is used to restrict the loss computing region. Thus, in order to provide more comprehensive information, we adopt vision transformer [10, 24] as our backbone since such model can extract global feature representations.

Our approach can be directly trained in an end-to-end manner and it does not rely on any extra dataset to provide supervision. Without applying any post-processing method such as dense CRF [17] to refine the results, our approach significantly outperforms the previous state-of-the-art approaches, with an mIoU increase of more than 6%. Our contributions are summarized as follows:

  • We propose a new dynamic feature regularized loss for weakly supervised semantic segmentation. Our regularized loss combines both static shallow and dynamic deep features for the regularized kernel, which can better represent the pair-wise pixel relationship.

  • We design a new feature consistency head to produce consistent features for pixels of same semantic category, enabling to build more accurate pair-wise pixel relationship. Meanwhile, we introduce vision transformer to strengthen the feature representation. To the best of our knowledge, this is the first work that uses transformer architecture for this task.

  • Our approach achieves state-of-the-art performances on PASCAL VOC 2012 (

    val: 82.8%, test: 82.9%) and PASCAL CONTEXT (val: 52.9%), outperforming other approaches by a large margin (more than 6% and 12% mIoU increases on PASCAL VOC 2012 and PASCAL CONTEXT, respectively).

2 Related Works

2.1 Fully Supervised Semantic Segmentation

Fully supervised semantic segmentation has made significant progress with advances in deep neural network especially the fully convolutional network (FCN) 

[4]. Deeplab v2 [4] proposed an ASPP module which utilized dilated convolution to increase respective filed while Deeplab v3+ [6] introduced an encoder-decoder structure to up-sample its prediction. PSPNet [42] designed a pyramid pooling module in an FCN architecture to generate more refined object details. SegSort [14] proposed a clustering method to segment objects. Tree-FCN [30] designed a learnable tree filter to utilize the structural property to model long-range dependencies. Recently, inspired by the success of vision transformer [10] for image classification, some new vision transformer architectures [24, 35] were introduced for fully supervised semantic segmentation, which led to clear performance improvement. Specifically, PVT [35] designed a pyramid vision transformer to make dense prediction. Swin-transformer [24] proposed to utilize a local window to improve the attention computing efficiency and a shifted window to extract global information. In this paper, we also introduce vision transformer to provide strengthened feature representations.

2.2 Weakly Supervised Semantic Segmentation

According to the weak supervision signal, weakly supervised semantic segmentation can be divided into scribble-level [21, 32, 27], bounding-box level [20, 28], point-level [3] and image-level [40, 41].

For scribble-level setting, ScribbleSup [21] proposed to utilize super pixel [1] to expand initial annotation and design a loss function to use the expanded supervision. GNN [39] designed a graph-based approach to generate pseudo labels from scribble supervision, which is then used to train a segmentation model. Tang et.al. [31, 32] proposed Normalized Cut loss and Kernel Cut loss to directly use initial labels as supervision. However, both Normalized Cut and Kernel Cut need multi-round training. Gated CRF loss [27] improves the efficiency of Kernel Cut loss through adding a gate operation. However, only relying static shallow feature cannot build accurate relationship for different pixels. SPML [15] used SegSort [14] as the backbone and HED contour detector [2] as extra supervision. BPG [34] designed an iterative strategy to produce the fine-grained feature maps, which also applied contour detector [38, 2] to provide boundary supervision. In this paper, we propose a new regularized loss which does not need extra supervision and our approach can be directly trained in an end-to-end manner.

3 Methodology

3.1 Overview

Fig. 1: The framework of our proposed approach. Firstly, an image is input to the vision transformer to generate its feature maps, then the feature maps from all blocks are fused to generate a shared feature map, which is input to both the semantic segmentation head and the feature consistency head. The semantic segmentation head is used to make semantic prediction and provide highly confident regions as pseudo labels for the feature consistency head. Meanwhile, the feature consistency head is used to produce consistent features for pixels with the same semantic category, which are in turn used in the regularized loss of the semantic segmentation head. Note that both the semantic segmentation head and feature consistency head are used during training while only the semantic segmentation head is used during inference.

Fig. 1 shows the overall framework of our approach. First, we use vision transformer as the backbone to generate the feature maps. Then the feature maps are input to the feature fusion module to generate the shared feature map for the semantic segmentation head and feature consistency head. The semantic segmentation head is to make semantic prediction and provide supervision for the feature consistency head. The feature consistency head enforces feature consistency for pixels with the same semantic category based on the supervision from the semantic segmentation head, which in turn provides reliable dynamic feature for the semantic segmentation head.

The semantic segmentation head utilizes two loss functions: partial cross-entropy loss and our proposed dynamic feature regularized loss. Partial cross-entropy loss uses the scribble-annotation as supervision while our proposed dynamic feature regularized loss applies the original image information and feature map from the feature consistency head to produce regularized kernel.

The feature consistency head also introduces two loss functions: feature distance loss and feature regularized loss. Feature distance loss uses the predicted highly confident pseudo labels from the semantic segmentation head as supervision. Feature regularized loss solely uses the shallow feature as the kernel to compute feature distance.

The whole framework is trained in an end-to-end manner, the loss function is defined as:

(1)

where and are loss weights. are the loss functions for the semantic segmentation head. is the partial cross-entropy loss, which uses the scribble annotation as supervision. is our proposed regularized loss. Both of them will be introduced in Sect. 3.2. are the loss functions for the feature consistency head, is the feature distance loss, which uses the prediction of the semantic segmentation head as supervision. is the feature regularized loss. Both will be introduced in Sect. 3.3.

3.2 Semantic Segmentation Head

The semantic segmentation head is to make semantic prediction, which includes several convolution layers to produce the final probability map

, a partial cross-entropy loss to utilize the scribble annotation and our proposed dynamic feature regularized loss to restrict the prediction of the whole map.

Specifically, the partial cross-entropy loss is:

(2)

where is the probability of pixel

to be classified to the ground truth class.

is the annotated pixel number. and correspond to the height and width of the feature map, respectively. is the provided scribble annotation and 255 means that there is no annotation. is the Iverson bracket operation, which equals to 1 if the inside condition is true, otherwise it equals to 0.

For scribble annotation, the main limitation is that very few pixel-level labels are provided, e.g., 3% pixels are annotated in PASCAL VOC 2012 dataset [32]. In this case, using partial cross-entropy loss is not enough. Therefore, we design a new DFR loss to impose restriction on the prediction of the model. Our intuition is that for two different pixels and , if their features are highly similar, the probability for them to belong to the same category is high.

In order to impose the above restriction and keep high computing efficiency, for two different pixels and , we only compute the loss when both of them locate within a local window:

(3)

where is the effective pixel pair set. and represent the x-coordinate, and represent the y-coordinate. is the window size.

Then our proposed DFR loss is:

(4)

where is the loss for pixels and , which follows the definition:

(5)

where is the class set, i.e., . and are the probabilities for pixels and to be classified to class and

, respectively, which are provided by the network (after softmax layer).

is the regularized kernel, which is defined as a Gaussian kernel:

(6)

where is the L2 distance. and correspond to the pixel positions for pixel and . and are the RGB information of pixel and . and are the deep features of pixels and from the feature consistency head.

The previous regularized loss functions [31, 32, 27] only adopt the position and RGB information to compute the kernel. However, both position and RGB information in Eq. (6) are static, once the two types of features fail to correctly describe the true relationship of a pixel pair, the network will be optimized towards an inaccurate direction, and such a problem cannot be addressed during the whole training period.

Different from the previous approaches, we introduce the dynamic deep feature, which is provided by the feature consistency head (as described in Sect. 3.3), to compute the regularized kernel. Note that when using deep features to compute the regularized kernel, they are regarded as non-gradient values. Through introducing dynamic feature to compute regularized kernel, on one hand, more comprehensive representation for pixel relationship is provided. On the other hand, dynamic features allow the network to correct its previous results. The remaining task is how to guarantee deep features accurately representing the relationship of different pixels, which is addressed in Sect. 3.3.

3.3 Feature Consistency Head

In order to provide correct relationship for deep features of different pixels, we design a feature consistency head. Our motivation is that for two pixels and , if they belong to the same class, their features should have high similarity. If they belong to different classes, the similarity of their features should be low.

Based on above analysis, we need to provide supervision for the feature relationship. We select the predicted labels with highly confident scores from the semantic segmentation head as supervision:

(7)

where is the semantic label for pixel , means that it is not annotated to any class. is the predicted probability for class .

Then the supervision is converted to the pair-wise pixel relationship. Following the operation in Sect. 3.2, we use the same local window to restrict the computing region. Considering that some pixels are not annotated, so the effective pixel pairs are:

(8)

where is the effective pixel pair set. is the set defined in Eq. (3). After that, the supervision is converted to the pair-wise pixel relationship label:

(9)

Eq. (9) indicates that when pixels and belong to the same class, they have strong relationship (set as ). If they belong to different classes, they should have weak relationship (set as ). means the pixel pair is ignored. In order to utilize such supervision, we compute the feature distance as the feature relationship for the pixel pair in :

(10)

where is L1 distance. is the channel dimension of the feature map. Both and are the final features of pixels and from the feature consistency head.

Finally, the feature distance loss is:

(11)

where is the pixel pair set that and the label of and is background. is the pixel pair set that and the label of and is foreground. corresponds to the pixel pair set that . indicates the number of elements in a set.

Following the same strategy in our semantic segmentation head, we also introduce feature regularized loss since the supervision only provide limited annotations. The feature regularized loss is defined as:

(12)

where have the similar formation with Eq. (6):

(13)

From Sect. 3.2 and Sect. 3.3, it can be found that both the semantic segmentation and feature consistency heads receive the online updated information. Specifically, the semantic segmentation head receives the dynamically updated feature from the feature consistency head, while the feature consistency head receives the updated supervision from the semantic segmentation head. On one hand, better supervision enables the feature consistency head to provide more accurate feature relationship. On the other hand, more accurate feature relationship facilitates to produce better semantic segmentation. Thus, we argue that with such an interaction mechanism two heads benefit from each other and the final performance is boosted accordingly.

Fig. 2: Details of the backbone and the feature fusion process. To fuse features from all stages, we use the same architecture with Swin-transformer [24] and UperNet [37], both of which use Pyramid Pooling Module (PPM) [42] and Feature Pyramid Network (PPN) [22] to fuse feature maps.

4 Experiment

4.1 Datasets and Evaluation Metric

We evaluate our method on PASCAL VOC 2012 [11] and PASCAL CONTEXT [26] dataset. For PASCAL VOC 2012 dataset, following the previous approaches [31, 32, 27, 15] in weakly supervised semantic segmentation, the augmented data SBD [13] is also used and the whole dataset contains 10,582 images for training, 1,449 images for validating and 1,456 images for testing with 20 foreground classes. For PASCAL CONTEXT dataset, it includes 4,998 images for training and 5,105 images for validating with 59 foreground categories. For the scribble annotation, we also follow the previous approaches [31, 32, 27, 15] to use the supervision provided by ScribbleSup [21]

. Mean Intersection over Union (mIoU) is adopted as the evaluation metric.

Method Pub. Sup. Backbone Single-stage Extra Data CRF mIoU(%)
val test
(1) Deeplab-v2 [5] TPAMI’18 F vgg16 - 71.5 72.6
(2) DeepLab-v2 [5] TPAMI’18 F resnet101 - 76.8 79.7
(3) Deeplab-v3+ [7] ECCV’18 F resnet18 - - 76.7 -
(4) SegSort [14] ICCV’19 F resnet101 - - 77.3 -
(5) Tree-FCN [30] NeurIPS’19 F resnet101 - - 82.3 -
(6) Swin-Base (ss) [24]* - F transformer - - 82.9 82.9
(7) Swin-Base (ms) [24]* - F transformer - - 84.6 84.4
Box2Seg [18] ECCV’20 B UperNet [37] - - 76.4 -
BAP [28] CVPR’21 B (2) - - 74.6 76.1
ILLD [23] TPAMI’20 I Res2Net [12] - 69.4 70.4
AdvCAM [19] CVPR’21 I (2) - - 68.1 68.0
EDAM [36] CVPR’21 I (2) - 70.9 70.6
ScribbleSup [21] CVPR’16 S (1) - - 63.1 -
RAWKS [33] CVPR’17 S resnet101 61.4 -
NormalizedCut [31] CVPR’18 S (2) - - 74.5 -
GraphNet [29] ACMM’18 S (2) - - 73.0 -
KernelCut [32] ECCV’18 S (2) - - - 73.0 -
KernelCut+CRF [32] ECCV’18 S (2) - - 75.0 -
GatedCRF [27] NeurIPS’19 S (3) - - 75.5 -
BPG [34] IJCAI’19 S (2) - 73.2 -
BPG+CRF [34] IJCAI’19 S (2) 76.0 -
SPML [15] ICLR’21 S (4) - - 74.2 -
SPML+CRF [15] ICLR’21 S (4) - 76.1 -
GNN [39] TPAMI’21 S (5) - - 76.2 76.1
DFR-ours (ss) - S (6) - - 81.5 82.1
DFR-ours (ms) - S (7) - - 82.8 82.9
  • Reproduced by ourselves.

TABLE I: Comparison with other state-of-the-art on PASCAL VOC 2012 dataset. Pub.: Publication. Sup.: Supervision. F: Fully-supervised. B: bounding-box level supervision. I: Image-level supervision. S: scribble-level. “ss” means single scale inference. “ms” means multi-scale inference. Multi-scale inference is used without explicit indication.

4.2 Implementation Details

Our approach mainly includes three network modules: the backbone, the semantic segmentation head and the feature consistency head. For the backbone, we choose Swin-Transformer-Base [24] (with UperNet head [37] to fuse the features from 4 stages). The details can be found in Fig. 2. After passing the backbone and the feature fusion stage, a fused feature map with a dimension of is generated. For the semantic segmentation head, we also use the same setting as in Swin-Transformer-Base [24], which uses the scene head in [37]. For the feature consistency head, we utilize a

convolutional layer followed by a ReLU function to produce the final feature, and the dimension

of the feature in this head is set as . In Eq. (1), , are set as and , respectively. The window size for Eq. (3) and Eq. (8) is set as 5. in Eq. (7) is . and are shared for Eq. (6) and Eq. (13). , and are set as , and , respectively. Note that the RGB is normalized before inputting to the network to compute the kernel.

We use the weights pretrained on ImageNet-22K 

[9] to initialize the model of Swin-Transformer-Base [24]. AdamW [16] is used as the optimizer with an initial learning rate of and weight decay of . Models are trained on 8 Nvidia Tesla V100 GPUs with batch size of 16 for 40K iterations. During training, we adopt the default settings in mmseg [8], including random flipping, random rescaling (range is ) and random photometric distortion. The input size is . During inference, the feature consistency head is not used and multi-scale strategy is used with resolution ratios of . Other settings follow that in Swin-Transformer-Base [24].

4.3 Comparison with State-of-the-Art

Fig. 3: Comparison with other state-of-the-art approaches on PASCAL VOC 2012 val set for different scribble lengths.

In Table I, we compare our approach with other approaches on PASCAL VOC 2012 dataset. It can be seen that our approach significantly outperforms other approaches. Specifically, GNN [39] achieves 76.2% mIoU with dense CRF [17] as post-processing, while we achieve 82.8% mIoU without using CRF, which brings 6.6% mIoU gain. Note that GNN [39] is a multi-stages method which uses more than three individual networks during training, while our approach is a single-stage method. Besides, for the single-stage method, BPG [34] achieves the best performance, but it used extra dataset (HED contour detector [38], pretrained on BSDS500 dataset [2]) to provide edge supervision. We do not rely on any extra dataset and outperform it by 9.6% mIoU without CRF (82.8% v.s. 73.2%). SPML used the same extra dataset as BPG [34] with multi-round training process, and we also significantly outperform it (82.8% v.s. 76.1%). More importantly, our approach reaches 98.3% of the upper-bound performance (the fully-supervised case for the single scale setting), showing its effectiveness for this task. It can also be found that using multi-scale strategy brings 1.3% mIoU increase. For the test set, our approach outperforms GNN with a clear gain of 6.8%. Generally, without using any extra-dataset and post-processing, our approach outperforms other approaches by a large margin through single-stage training.

Method bkg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Mean KernelCut [32] - 86.2 37.3 85.5 69.4 77.8 91.7 85.1 91.2 38.8 85.1 55.5 85.6 85.8 81.7 84.1 61.4 84.3 43.1 81.4 74.2 75.0 BPG [34] 93.4 84.8 38.4 84.6 65.5 78.8 91.4 85.9 89.5 41.0 87.3 58.3 84.1 85.2 83.7 83.6 64.9 88.3 46.0 86.3 73.9 76.0 SPML [15] - 89.0 38.4 86.0 72.6 77.9 90.0 83.9 91.0 40.0 88.3 57.7 87.7 82.8 79.1 86.5 57.1 87.4 50.5 81.2 76.9 76.1 DFR-ours (ss) 95.0 90.8 39.0 89.8 76.4 82.9 93.8 87.3 94.9 49.4 92.7 66.2 90.9 89.9 86.8 87.8 71.8 90.4 64.0 92.4 79.4 81.5

TABLE II: Per-class comparison between our approach and others on PASCAL VOC 2012 val set.

In Table II, we report the per-class results on PASCAL VOC 2012 val set. It can be seen that our approach generates new state-of-the-art performances for each class. Note that we do not use dense CRF while the other reported approaches use dense CRF as post-processing.

In Fig. 3, segmentation performance comparisons with different scribble lengths are reported. Our approach consistently outperforms other approaches using different scribble lengths. Even only provided with 30% of original scribble length, our approach still obtains an mIoU of 80.0%.

Method Pub. Sup. CRF mIoU (%)
ScribbleSup [21] CVPR’16 S 36.1
RAWKS [33] CVPR’17 S 37.4
GraphNet [29] ACMM’18 S - 33.9
GraphNet+CRF [29] ACMM’18 S 40.2
DFR-ours (ss) - S - 50.9
DFR-ours (ms) - S - 52.9
TABLE III: Comparison with other state-of-the-art on PASCAL CONTEXT dataset.

In Table III, we compare our approach with others on the PASCAL CONTEXT dataset, it can be seen that our approach also achieves a new state-of-the-art performance, with an mIoU gain of 12.7%.

In Fig. 4, we show qualitative comparisons between our approach and the previous state-of-the-art approaches. It can be seen that our approach keeps more details with refined boundaries. Even for complicated cases, our approach still obtains accurate segmentation results.

Fig. 4: Qualitative comparison between our method and other state-of-the-art approaches on PASCAL VOC 2012 val dataset. (a) Original image (b) Ground-truth (c) Results of GNN [39] with dense CRF [17] as post-processing (d) Our results.

4.4 Ablation Studies

In this section, we conduct our ablation studies on PASCAL VOC 2012 val dataset, and we use the single scale results.

Semantic Head Feature Head mIoU (%)
68.9
81.0
81.1
81.5
TABLE IV: Ablation study about the influence of the loss functions on PASCAL VOC 2012 val dataset.

In Table IV, we evaluate the influence of the loss functions. It can be seen that without using any loss of the feature consistency head, our proposed regularized loss brings 12.1% mIoU increase (81.0% v.s. 68.9%). Using our feature consistency head further improves the final performance, with 0.5% mIoU growth. Besides, it can also be found that two loss functions of the feature consistency head are both useful to improve the final performance.

Table. V reports the results on applying different elements to compute the regularized kernel. By adding the deep feature, the final performance increases to 81.5%, being 0.7% higher than only using static information (80.8%), which proves the effectiveness of the dynamic deep feature. It can also be found that RGB is an essential element, without it, the performance drops rapidly (from 80.8% to 72.8%). Nevertheless, it can also be found that adopting deep feature over spatial position can also improve the performance, with an mIoU increase of 2.1%. Note that when deep feature is not used in (Eq. (6)), we simply remove the full feature consistency head. It is interesting to notice that even the feature consistency head is not used, directly using deep feature in can improve the performance (81.0% in Table. IV v.s. 80.8% in Table. V), which also proves the positive influence of the introduced deep feature.

Kernel mIoU (%)
XY RGB Feature
72.8
80.8
74.9
81.5
TABLE V: Ablation study about the influence of the shallow feature and deep feature for our regularized loss (Eq. (6)) on PASCAL VOC 2012 val dataset. “XY” is the spatial position. “RGB” is the color information. “Feature” is the dynamic feature from the feature consistency head.
Feature mIoU (%)
Block-1 Block-2 Block-3 Block-4
81.1
81.1
81.3
81.3
81.2
80.9
80.8
80.5
81.5
TABLE VI: Ablation study about the influence of the selected share feature for both semantic segmentation head and feature consistency head on PASCAL VOC 2012 val dataset. “Block” is shown in Fig. 2.

Table. VI shows the influence of the selected feature, which is a shared feature for both the semantic segmentation head and the feature consistency head. It can be seen that the obtained performance using the feature from each block individually is sightly limited. Finally, using all features together generates the best performance. Considering that the feature map from lower block contains more low-level information and the feature map from the higher block contains more high-level information, using all of these features can supply more comprehensive representations to build accurate relationship for different pixels.

Supervision mIoU (%)
GT
80.6
81.5
81.1
TABLE VII: The influence of the supervision for the feature consistency head on PASCAL VOC 2012 val dataset. “GT” means the provided scribble annotation. “” is our selected confident labels from the semantic segmentation head, defined in Eq. (7).

In Table. VII, we explore the influence of different supervision for the feature consistency head. It can be found that if only the ground truth scribble annotations are used as supervision, the performance is limited (only 80.6%) since the ground truth can only provide limited annotations (about 3% pixels are labeled), thus a local window will receive very few negative labels, which is insufficient for the feature distance loss . Besides, using the confident prediction ( defined in Eq. (7)) from the semantic segmentation head performs better than using both ground truth and , with an mIoU gain of 0.4% . This is because when the ground truth and are merged, it is unavoidable to introduce some incorrect pixel relationship. Specifically, there are some noisy labels in , while the labels in ground truth are all correct, thus it will lead to incorrect negative pixel pairs as supervision, which is harmful for training.

5 Conclusion

In this paper, we have proposed a dynamic feature regularized loss for weakly supervised semantic segmentation with scribble annotation. Our regularized loss makes full use of the static shallow feature and dynamic deep feature to build the regularized kernel, which is more accurate to describe relationship of different pixels. Meanwhile, in order to provide more powerful deep features, we introduce vision transformer as the backbone and design a feature consistency head to restrict the pair-wise pixel relationship under the supervision of the prediction from the semantic segmentation head. We found that both our regularized loss and the feature consistency head can benefit from each other and lead to a better performance. Extensive experiments show that our approach achieves new state-of-the-art performances with large margins. In the future, we plan to apply our approach on other weakly supervised semantic segmentation tasks.

References

  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11), pp. 2274–2282. Cited by: §2.2.
  • [2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2010) Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.2, §4.3.
  • [3] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei (2016) What’s the point: semantic segmentation with point supervision. In

    Proceedings of the European Conference on Computer Vision

    ,
    pp. 549–565. Cited by: §1, §2.2.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1, §2.1.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: TABLE I.
  • [6] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.1.
  • [7] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, pp. 801–818. Cited by: TABLE I.
  • [8] M. Contributors (2020) MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. Cited by: §4.2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 248–255. Cited by: §4.2.
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §2.1.
  • [11] M. Everingham and J. Winn (2011) The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep. Cited by: §4.1.
  • [12] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. Torr (2021) Res2Net: a new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document Cited by: TABLE I.
  • [13] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In Proceedings of the IEEE International Conference on Computer Vision, Vol. , pp. 991–998. Cited by: §4.1.
  • [14] J. Hwang, S. X. Yu, J. Shi, M. D. Collins, T. Yang, X. Zhang, and L. Chen (2019) Segsort: segmentation by discriminative sorting of segments. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7334–7344. Cited by: §2.1, §2.2, TABLE I.
  • [15] T. Ke, J. Hwang, and S. X. Yu (2021) Universal weakly supervised segmentation by pixel-to-segment contrastive learning. arXiv preprint arXiv:2105.00957. Cited by: §1, §1, §2.2, §4.1, TABLE I, TABLE II.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [17] P. Krähenbühl and V. Koltun (2013) Parameter learning and convergent inference for dense random fields. In

    Proceedings of the International Conference on Machine Learning

    ,
    pp. 513–521. Cited by: §1, Fig. 4, §4.3.
  • [18] V. Kulharia, S. Chandra, A. Agrawal, P. Torr, and A. Tyagi (2020) Box2Seg: attention weighted loss and discriminative feature learning for weakly supervised segmentation. In Proceedings of the European Conference on Computer Vision, pp. 290–308. Cited by: §1, TABLE I.
  • [19] J. Lee, E. Kim, and S. Yoon (2021) Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4071–4080. Cited by: TABLE I.
  • [20] J. Lee, J. Yi, C. Shin, and S. Yoon (2021) BBAM: bounding box attribution map for weakly supervised semantic and instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2643–2652. Cited by: §2.2.
  • [21] D. Lin, J. Dai, J. Jia, K. He, and J. Sun (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3159–3167. Cited by: §1, §2.2, §2.2, §4.1, TABLE I, TABLE III.
  • [22] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: Fig. 2.
  • [23] Y. Liu, Y. Wu, P. Wen, Y. Shi, Y. Qiu, and M. Cheng (2020) Leveraging instance-, image- and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document Cited by: §1, TABLE I.
  • [24] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §1, §2.1, Fig. 2, §4.2, §4.2, TABLE I.
  • [25] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1.
  • [26] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
  • [27] A. Obukhov, S. Georgoulis, D. Dai, and L. Van Gool (2019) Gated crf loss for weakly supervised semantic image segmentation. In Advances in Neural Information Processing Systems, Cited by: §1, §2.2, §2.2, §3.2, §4.1, TABLE I.
  • [28] Y. Oh, B. Kim, and B. Ham (2021) Background-aware pooling and noise-aware loss for weakly-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6913–6922. Cited by: §2.2, TABLE I.
  • [29] M. Pu, Y. Huang, Q. Guan, and Q. Zou (2018) GraphNet: learning image pseudo annotations for weakly-supervised semantic segmentation. In Proceedings of the 26th ACM International Conference on Multimedia, pp. 483–491. Cited by: §1, TABLE I, TABLE III.
  • [30] L. Song, Y. Li, Z. Li, G. Yu, H. Sun, J. Sun, and N. Zheng (2019) Learnable tree filter for structure-preserving feature transform. In Advances in Neural Information Processing Systems, pp. 1711–1721. Cited by: §2.1, TABLE I.
  • [31] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers (2018) Normalized cut loss for weakly-supervised cnn segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1818–1827. Cited by: §1, §2.2, §3.2, §4.1, TABLE I.
  • [32] M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov (2018) On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Conference on Computer Vision, pp. 507–522. Cited by: §1, §1, §2.2, §2.2, §3.2, §3.2, §4.1, TABLE I, TABLE II.
  • [33] P. Vernaza and M. Chandraker (2017) Learning random-walk label propagation for weakly-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7158–7166. Cited by: TABLE I, TABLE III.
  • [34] B. Wang, G. Qi, S. Tang, T. Zhang, Y. Wei, L. Li, and Y. Zhang (2019) Boundary perception guidance: a scribble-supervised semantic segmentation approach. In

    International Joint Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.2, §4.3, TABLE I, TABLE II.
  • [35] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122. Cited by: §2.1.
  • [36] T. Wu, J. Huang, G. Gao, X. Wei, X. Wei, X. Luo, and C. H. Liu (2021) Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16765–16774. Cited by: TABLE I.
  • [37] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018)

    Unified perceptual parsing for scene understanding

    .
    In Proceedings of the European Conference on Computer Vision, pp. 418–434. Cited by: Fig. 2, §4.2, TABLE I.
  • [38] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403. Cited by: §1, §2.2, §4.3.
  • [39] B. Zhang, J. Xiao, J. Jiao, Y. Wei, and Y. Zhao (2021) Affinity attention graph neural network for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2.2, Fig. 4, §4.3, TABLE I.
  • [40] B. Zhang, J. Xiao, Y. Wei, M. Sun, and K. Huang (2020) Reliability does matter: an end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12765–12772. Cited by: §1, §2.2.
  • [41] D. Zhang, H. Zhang, J. Tang, X. Hua, and Q. Sun (2020) Causal intervention for weakly-supervised semantic segmentation. arXiv preprint arXiv:2009.12547. Cited by: §2.2.
  • [42] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §2.1, Fig. 2.