PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

10/19/2020 ∙ by Yuliang Zou, et al. ∙ 15

Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weakly-labeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for segmentation. The source code is available at https://github.com/googleinterns/wss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image semantic segmentation is a core computer vision task that has been studied for decades. Compared with other vision tasks, such as image classification and object detection, human annotation of pixel-accurate segmentation is dramatically more expensive. Given sufficient pixel-level

labeled training data (i.e., high-data regime), the current state-of-the-art segmentation models (e.g., DeepLabv3+ (Chen et al., 2018)) produce satisfactory segmentation prediction for common practical usage. Recent exploration demonstrates improvement over high-data regime settings with large-scale data, including self-training (Chen et al., 2020a), backbone pre-training (Zhang et al., 2020a), and self-training (Zoph et al., 2020).

In contrast to the high-data regime, the performance of segmentation models drop significantly, given very limited pixel-labeled data (i.e., low-data regime). Such ineffectiveness at the low-data regime hinders the applicability of segmentation models. Therefore, instead of improving high-data regime segmentation, our work focuses on data-efficient segmentation training that only relies on few pixel-labeled data and leverages the availability of extra unlabeled or weakly annotated (e.g., image-level) data to improve performance, with the aim of narrowing the gap to the supervised models trained with fully pixel-labeled data.

Our work is inspired by the recent success in semi-supervised learning (SSL) for image classification, demonstrating promising performance given very limited labeled data and a sufficient amount of unlabeled data. Successful examples include MeanTeacher (Tarvainen and Valpola, 2017), UDA (Xie et al., 2019), MixMatch (Berthelot et al., 2019b), FeatMatch (Kuo et al., 2020), and FixMatch (Sohn et al., 2020a). One outstanding idea in this type of SSL is consistency training: making predictions consistent among multiple augmented images. FixMatch (Sohn et al., 2020a) shows that using high-confidence one-hot pseudo labels obtained from weakly-augmented unlabeled data to train strongly-augmented counterpart is the key to the success of SSL in image classification. However, effective pseudo labels and well-designed data augmentation are non-trivial to satisfy for segmentation. Although we observe that many related works explore the second condition (i.e., augmentation) for image segmentation to enable consistency training framework (French et al., 2020; Ouali et al., 2020), we show that a wise design of pseudo labels for segmentation has great veiled potentials.

In this paper, we propose PseudoSeg, a one-stage training framework to improve image semantic segmentation by leveraging additional data either with image-level labels (weakly-labeled data) or without any labels. PseudoSeg presents a novel design of pseudo-labeling to infer effective structured pseudo labels of additional data. It then optimizes the prediction of strongly-augmented data to match its corresponding pseudo labels. In summary, we make the following contributions:

  • We propose a simple one-stage framework to improve semantic segmentation by using a limited amount of pixel-labeled data and sufficient unlabeled data or image-level labeled data. Our framework is simple to apply and therefore network architecture agnostic.

  • Directly applying consistency training approaches validated in image classification renders particular challenges in segmentation. We first demonstrate how well-calibrated soft pseudo labels obtained through wise fusion of predictions from diverse sources can greatly improve consistency training for segmentation.

  • We conduct extensive experimental studies on the PASCAL VOC 2012 and COCO datasets. Comprehensive analyses are conducted to validate the effectiveness of this method at not only the low-data regime but also the high-data regime. Our experiments study multiple important open questions about transferring SSL advances to segmentation tasks.

2 Related Work

Semi-supervised classification. Semi-supervised learning (SSL) aims to improve model performance by incorporating a large amount of unlabeled data during training. Consistency regularization and entropy minimization are two common strategies for SSL. The intuition behind consistency-based approaches (Laine and Aila, 2016; Sajjadi et al., 2016; Miyato et al., 2018; Tarvainen and Valpola, 2017) is that, the model output should remain unchanged when the input is perturbed. On the other hand, the entropy minimization strategy (Grandvalet and Bengio, 2005) argues that the unlabeled data can be used to ensured classes are well-separated, which can be achieved by encouraging the model to output low-entropy predictions. Pseudo-labeling (Lee, 2013) is one of the methods for implicit entropy minimization. Recently, holistic approaches (Berthelot et al., 2019b, a; Sohn et al., 2020a) combining both strategies have been proposed and achieved significant improvement. By re-designing the pseudo label, we propose an efficient one-stage semi-supervised learning framework of semantic segmentation for consistency training.

Semi-supervised semantic segmentation. Collecting pixel-level annotations for semantic segmentation is costly and prone to error. Hence, leveraging unlabeled data in semantic segmentation is a natural fit. Early methods utilize a GAN-based model either to generate additional training data (Souly et al., 2017) or to learn a discriminator between the prediction and the ground truth mask (Hung et al., 2018; Mittal et al., 2019). Consistency regularization based approaches have also been proposed recently, by enforcing the predictions to be consistent, either from augmented input images (French et al., 2020; Kim et al., 2020), perturbed feature embeddings (Ouali et al., 2020), or different networks (Ke et al., 2020). Recently, Luo and Yang (2020) proposes a dual-branch training network to jointly learn from pixel-accurate and coarse labeled data, achieving good segmentation performance. To push the performance of state of the arts, iterative self-training approaches (Chen et al., 2020a; Zoph et al., 2020) have been proposed. These methods usually assume the available labeled data is enough to train a good teacher model, which will be used to generate pseudo labels for the student model. However, this condition might not satisfy in the low-data regime. Our proposed method, on the other hand, realizing the ideas of both consistency regularization and pseudo-labeling in segmentation, consistently improves the supervised baseline in both low-data and high-data regimes.

Weakly-supervised semantic segmentation. Instead of supervising network training with accurate pixel-level labels, many prior works exploit weaker forms of annotations (e.g., bounding boxes (Dai et al., 2015), scribbles (Lin et al., 2016), image-level labels). Most recent approaches use image-level labels as the supervisory signal, which exploits the idea of class activation map (CAM) (Zhou et al., 2016). Since the vanilla CAM only focus on the most discriminative region of objects, different ways to refine CAM have been proposed, including partial image/feature erasing (Hou et al., 2018; Wei et al., 2017; Li et al., 2018)

, using an additional saliency estimation model 

(Oh et al., 2017; Huang et al., 2018; Wei et al., 2018), utilizing pixel similarity to propagate the initial score map (Ahn and Kwak, 2018; Wang et al., 2020), or mining and co-segment the same category of objects across images (Sun et al., 2020; Zhang et al., 2020b). While achieving promising results using the approaches mentioned above, most of them require a multi-stage training strategy. The refined score maps are optimized again using a dense-CRF model (Krähenbühl and Koltun, 2011), and then used as the target to train a separate segmentation network. On the other hand, we assume there exists a small number of fully-annotated data, which allows us to learn stronger segmentation models than general methods without needing pixel-labeled data.

3 The proposed method

In analogous to SSL for classification, our training objective in PseudoSeg consists of a supervised loss applied to pixel-level labeled data , and a consistency constraint applied to unlabeled data 111For simplicity, here we illustrate the method with unlabeled data and then show it can be easily adapted to use image-level labeled data in Section 3.2.. Specifically, the supervised loss is the standard pixel-wise cross-entropy loss on the weakly augmented pixel-level labeled examples:

(1)

where represents the learnable parameters of the network function and denotes the number of valid labeled pixels in an image . is the ground truth label of a pixel in dimensions, and

is the predicted probability of pixel

, where is the number of classes to predict and denotes the weak (common) data augmentation operations used by Chen et al. (2018).

During training, the proposed PseudoSeg estimates a pseudo label for each strongly-augmented unlabeled data in , which is then used for computing the cross-entropy loss. The unsupervised objective can then be written as:

(2)

where denotes a stronger data augmentation operation, which will be described in Section 3.2. We illustrate the unlabeled data training branch in Figure 1.

Figure 1: Overview of unlabeled data training branch. Given an image, the weakly augmented version is fed into the network to get the decoder prediction and Self-attention Grad-CAM (SGC). The two sources are then combined via a calibrated fusion strategy to form the pseudo label. The network is trained to make its decoder prediction from strongly augmented image to match the pseudo label by a per-pixel cross-entropy loss.

3.1 The Design of Structured Pseudo Labels

The next important question is how to generate the desirable pseudo label . A straightforward solution is directly using the decoder output of a trained segmentation model after confidence thresholding, as suggested by Sohn et al. (2020a); Zoph et al. (2020); Xie et al. (2020); Sohn et al. (2020b). However, as we demonstrate later in the experiments, the generated pseudo hard/soft labels as well as other post-processing of outputs are barely satisfactory in the low-data regime, and thus yield inferior final results. To address this issue, our design of pseudo-labeling has two key insights. First, we seek for a distinct yet efficient decision mechanisms to compensate for the potential errors of decoder outputs. Second, wisely fusing multiple sources of predictions to generate an ensemble and better-calibrated version of pseudo labels.

Starting with localization. Compared with precise segmentation, learning localization is a simpler task as it only needs to provide coarser-grained outputs than pixel level of objects in images. Based on this motivation, we improve decoder predictions from the localization perspective. Class activation map (CAM) (Zhou et al., 2016) is a popular approach to provide localization for class-specific regions. CAM-based methods (Hou et al., 2018; Wei et al., 2017; Ahn and Kwak, 2018) have been successfully adopted to tackle a different weakly supervised semantic segmentation task from us, where they assume only image-level labels are available. In practice, we adopt a variant of class activation map, Grad-CAM (Selvaraju et al., 2017) in PseudoSeg.

From localization to segmentation.

CAM estimates the strength of classifier responses on local feature maps. Thus, an inherent limitation of CAM-based approaches is that it is prone to attending only to the most discriminative regions. Although many weakly-supervised segmentation approaches 

(Ahn and Kwak, 2018; Ahn et al., 2019; Sun et al., 2020) aim at refining CAM localization maps to segmentation masks, most of them have complicated post-processing steps, such as dense CRF (Krähenbühl and Koltun, 2011), which increases the model complexity when used for consistency training. Here we present a computationally efficient yet effective refinement alternative, which is learnable using available pixel-labeled data.

Although CAM only localizes partial regions of interests, if we know the pairwise similarities between regions, we can propagate the CAM scores from the discriminative regions to the rest un-attended regions. Actually, it has been shown in many works that the learned high-level deep features are usually good at similarity measurements of visual objects. In this paper, we find hypercolumn 

(Hariharan et al., 2015) with a learnable similarity measure function works fairly effective.

Given the vanilla Grad-CAM output for all

classes, which can be viewed as a spatially-flatten 2-D vector of weight

, where each row is the response weight per class for one region . Using a kernel function that measures element-wise similarity given feature of two regions, the propagated score can be computed as follows

(3)

The goal of this function is to train in order to propagate the high value in to all adjacent elements in the feature space (i.e., hypercolumn features) to region . Adding in equation 3 indicates the skip-connection. To compute propagated score for all regions, the operations in equation 3 can be efficiently implemented with self-attention dot-product (Vaswani et al., 2017). For brevity, we denote this efficient refinement process output as self-attention Grad-CAM (SGC) maps in . Figure 6 in Appendix A specifies the architecture.

Calibrated prediction fusion. SGC maps are obtained from low-resolution feature maps. It is then resized to the desired output resolution, and thus not sufficient at delineating crisp boundaries. However, compared to the segmentation decoder, SGC is capable of generating more locally-consistent masks. Thus, we propose a novel calibrated fusion strategy to take advantage of both decoder and SCG predictions for better pseudo labels.

Specifically, given a batch of decoder outputs (pre-softmax logits)

and SGC maps computed from weakly-augmented data , we generate the pseudo labels by

(4)

Two critical procedures are proposed to use here to make the fusion process successful. First, and are from different decision mechanisms and they could have very different degrees of overconfidence. Therefore, we introduce the operation as a normalization factor. It alleviates the over-confident probability after softmax, which could unfavorably dominate the resulted -averaged probability. Second, the distribution sharpening operation adjusts the temperature scalar of categorical distribution (Berthelot et al., 2019b; Chen et al., 2020b). Figure 2 illustrates the predictions from different sources. More importantly, we investigate the pseudo-labeling from a calibration perspective (Section 4.3), demonstrating that the proposed soft pseudo label leads to a better calibration metric comparing to other possible fusion alternatives, and justifying why it benefits the final segmentation performance.

Training. Our final training objective contains two extra losses: a classification loss , and a segmentation loss . First, to compute Grad-CAM, we add a one-layer classification head after the segmentation backbone and a multi-label classification loss . Second, as specified in Appendix A (Figure 6), SGC maps are scaled as pixel-wise probabilities using one-layer convolution followed by softmax in equation 3. Learning to predict SGC maps needs pixel-labeled data . It is achieved by an extra segmentation loss between SGC maps of pixel-labeled data and corresponding ground truth. All the loss terms are jointly optimized (i.e., ), while only optimizes (achieved by stopping gradient). See Figure 7 in the appendix for further details.


Input Grad-CAM SGC map Decoder Decoder (strong) Pseudo label
Figure 2: Visualization of pseudo labels and other predictions. The generated pseudo label by fusing the predictions from the decoder and SGC map is used to supervise the decoder (strong) predictions of the strongly-augmented counterpart.

3.2 Incorporating image-level labels and augmentation

The proposed PseudoSeg can easily incorporate image-level label information (if available) into our one-stage training framework, which also leads to consistent improvement as we demonstrate in experiments. We utilize the image-level data with two following steps. First, we directly use ground truth image-level labels to generate Grad-CAMs instead of using classifier outputs. Second, they are used to increase classification supervision beyond pixel-level labels for the classifier head.

For strong data augmentation, we simply follow color jittering operations from SimCLR (Chen et al., 2020b) and remove all geometric transformations. The overall strength of augmentation can be controlled by a scalar (studied in experiments). We also apply once random CutOut (DeVries and Taylor, 2017) with a region of pixels since we find it gives consistent though minor improvement (pixels inside CutOut regions are ignored in computing losses).

4 Experimental Results

We start by specifying the experimental details. Then, we evaluate the method in the settings of using unlabeled data and using image-level labeled data, respectively. Next, we conduct various ablation studies to justify our design choices. Lastly, we conduct more comparative experiments in specific settings.

To evaluate the proposed method, we conduct the main experiments and ablation studies on the PASCAL VOC 2012 dataset (VOC12) (Everingham et al., 2015), which contains 21 classes including background. The standard VOC12 dataset has 1,449 images as the training set and 1,456 images as the validation set. We randomly subsample 1/2, 1/4, 1/8, and 1/16 of images in the standard training set to construct the pixel-level labeled data. The remaining images in the standard training set, together with the images in the augmented set (Hariharan et al., 2011) (around 9k images), are used as unlabeled or image-level labeled data. To further verify the effectiveness of the proposed method, we also conduct experiments on the COCO dataset (Lin et al., 2014). The COCO dataset has 118,287 images as the training set, and 5,000 images as the validation set. We evaluate on the 80 foreground classes and the background, as in the object detection task. As the COCO dataset is larger than VOC12, we randomly subsample smaller ratios, 1/32, 1/64, 1/128, 1/256, 1/512, of images from the training set to construct the pixel-level labeled data. The remaining images in the training set are used as unlabeled data or image-level labeled data. We evaluate the performance using the standard mean intersection-over-union (mIoU) metric. Implementation details can be found in Appendix B.

4.1 Experiments using unlabeled data

Improvement over a strong baseline. We first demonstrate the effectiveness of the proposed method by comparing it with the DeepLabv3+ model trained with only the pixel-level labeled data. As shown in Figure 3 (a), the proposed method consistently outperforms the supervised training baseline on VOC12, by utilizing the unlabeled data. The proposed method not only achieves a large performance boost in the low-data regime (when only 6.25% pixel-level labels available), but also improves the performance when the entire training set (1.4k images) is available. In Figure 3 (b), we again observe consistent improvement on the COCO dataset.

Figure 3: Improvement over the strong supervised baseline, in a semi-supervised setting (w/ unlabeled data) on VOC12 val (left) and COCO val (right).

Comparisons with the others. Next, we compare the proposed method with recent state-of-the-arts on both the public 1.4k/9k split (in Table 1) and the created low-data splits (in Table 2), on VOC12. Our method compares favorably with the others.

Method Network mIoU (%)
GANSeg (Souly et al., 2017) VGG16 64.10
AdvSemSeg (Hung et al., 2018) ResNet-101 68.40
CCT (Ouali et al., 2020) ResNet-50 69.40
PseudoSeg (Ours) ResNet-50 71.00
PseudoSeg (Ours) ResNet-101 73.23
Table 1: Comparison with state of the arts on VOC12 val set (w/ unlabeled data). We use the official training set (1.4k) as labeled data, and the augmented set (9k) as unlabeled data.
Method 1/2 (732) 1/4 (366) 1/8 (183) 1/16 (92)
AdvSemSeg (Hung et al., 2018) 65.27 59.97 47.58 39.69
CCT (Ouali et al., 2020) 62.10 58.80 47.60 33.10
*MT (Tarvainen and Valpola, 2017) 69.16 63.01 55.81 48.70
GCT (Ke et al., 2020) 70.67 64.71 54.98 46.04
**VAT (Miyato et al., 2018) 63.34 56.88 49.35 36.92
CutMix (French et al., 2020) 69.84 68.36 63.20 55.58
PseudoSeg (Ours) 72.41 69.14 65.50 57.60
Table 2: Comparison with state of the arts on VOC12 val set (w/ unlabeled data) using low-data splits. The exact numbers of pixel-labeled images are shown in brackets. All the methods use ResNet-101 as backbone except CCT (Ouali et al., 2020), which uses ResNet-50. * indicates implementation from Ke et al. (2020), ** indicates implementation from French et al. (2020).

4.2 Experiments using image-level labeled data

Figure 4: Improvement over the strong supervised baseline, in a semi-supervised setting (w/ image-level labeled data) on VOC12 val (left) and COCO val (right).

Similar to semi-supervised learning using unlabeled data, we first demonstrate the efficacy of our method by comparing it with a strong supervised baseline. As shown in Figure 4, the proposed method consistently improves the strong baseline on both datasets. In Table 4, we evaluate on the public 1.4k/9k split. The proposed method compares favorably with the other methods. Moreover, we further compare to best compared CCT on the created low-data splits (in Table 4). Both experiments show that the proposed PseudoSeg is more robust than the compared method given less data. On all datasets, using image-labeled data shows higher mIoU than the setting using unlabeled data.

Method Network mIoU (%)
WSSN (Papandreou et al., 2015) VGG16 64.60
GAIN (Li et al., 2018) VGG16 60.50
MDC (Wei et al., 2018) VGG16 65.70
DSRG (Huang et al., 2018) VGG16 64.30
GANSeg (Souly et al., 2017) VGG16 65.80
FickleNet (Lee et al., 2019) ResNet-101 65.80
CCT (Ouali et al., 2020) ResNet-50 73.20
PseudoSeg (Ours) ResNet-50 73.80
Table 4: Comparison with state of the arts on VOC12 val set with image-level labeled data. Four ratios of pixel-level labeled examples are tested. Both CCT (Ouali et al., 2020) and our method use ResNet-50 as backbone.
Split CCT PseudoSeg
1/2 66.80 73.51
1/4 67.60 71.79
1/8 62.50 69.15
1/16 51.80 65.44
Table 3: Comparison with state of the arts on VOC12 val set (w/ image-level labeled data). We use the official training set (1.4k) as labeled data, and the augmented set (9k) as image-level labeled data.

4.3 Ablation study

In this section, we conduct extensive ablation experiments on VOC12 to validate our design choices.

How to construct pseudo label? We investigate the effectiveness of the proposed pseudo labeling. Table 5 demonstrates quantitative results, indicating that using either decoder output or SGC alone gives an inferior performance. Naively using decoder output as pseudo labels can hardly work well. The proposed fusion consistently performs better, either with or without additional image-level labels. To further answer why our pseudo labels are effective, we study from the model calibration perspective. We measure the expected calibration error (ECE) (Guo et al., 2017) scores of all the intermediate steps and other fusion variants. As shown in Figure 5 (a), the proposed fusion strategy (denoted as G in the figure) achieves the lowest ECE scores, indicating that the significance of jointly using normalization with sharpening (see equation 4) compared with other fusion alternatives. We hypothesize using well-calibrated soft labels makes model training less affected by label noises. The comprehensive calibration study is left as a future exploration direction.

Source Using image-level labels 1/4 (366) 1/8 (183) 1/16 (92)
Decoder only - 70.22 69.35 53.20
SGC only - 67.07 62.61 53.42
Calibrated fusion - 73.79 73.13 67.06
Decoder only 73.95 73.05 67.54
SGC only 71.73 67.57 64.26
Calibrated fusion 75.29 74.70 71.22
Table 5: Comparison to alternative pseudo labeling strategies. We conduct experiments using 1/4, 1/8, 1/16 of the pixel-level labeled data, the exact numbers of images are shown in the brackets.

Using hypercolumn feature or not? In Figure 5 (b), we study the effectiveness of using hypercolumn features instead of the last feature maps in equation 3. We conduct the experiments on the 1/16 split of VOC12. As we can see, hypercolumn features substantially improve performance.

Soft or hard pseudo label? How to utilize predictions as pseudo labels remains an active question in SSL. Next, we study whether we should use soft or hard one-hot pseudo labels. We conduct the experiments in the setting where image-level labeled data are available. As shown in Figure 5 (c), using all predictions as soft pseudo label yields better performance than selecting confident predictions. This suggests that well-calibrated soft pseudo labels might be important in segmentation than over-simplified confidence thresholding.

Temperature sharpening or not? We study the effect of temperature sharpening in equation 4. We conduct the experiments in the setting where image-level labeled data are available. As shown in Figure 5 (d), temperature sharpening shows consistent and clear improvements.


(a) Expected calibration error (b) Hypercolumn feature (c) Soft and hard label

(d) Temperature sharpening (e) Color jittering strength (f) Backbone architecture
Figure 5: Ablation studies on different factors. See Section 4.3 for complete details.

Strong augmentation strength. In Figure 5 (e), we study the effects of color jittering in the strong augmentation. The magnitude of jittering strength is controlled by a scalar (Chen et al., 2020b). We conduct the experiments in the setting where unlabeled data are available. If the magnitude is too small, performance drops significantly, suggesting the importance of strong augmentation.

Impact of different feature backbones. In Figure 5 (f), we compare the performance of using ResNet-50, ResNet-101, and Xception-65 as backbone architectures, respectively. We conduct the experiments in the setting where unlabeled data are available. As we can see, the proposed method consistently improves the baseline by a substantial margin across different backbone architectures.

4.4 Comparison with self-training

Several recent approaches (Chen et al., 2020a; Zoph et al., 2020) exploit the Student-Teacher self-training idea to improve the performance with additional unlabeled data. However, these methods only apply self-training in the high-data regime (i.e., sufficient pixel-labeled data to train teachers). Here we compare these methods in the low-data regimes, where we focus on. To generate offline pseudo labels, we closely follow segmentation experiments in Zoph et al. (2020): pixels with a confidence score higher than 0.5 will be used as one-hot pseudo labels, while the remaining are treated as ignored regions. This step is considered important to suppress noisy labels. A student model is then trained using the combination of unlabeled data in VOC12 train and augmented sets with generated one-hot pseudo labels and all the available pixel-level labeled data. As shown in Table 6, although the self-training pretty well improves over the supervised baseline, it is inferior to the proposed method 222It is difficult to directly compare to Zoph et al. (2020) in their setting because of enormous parallel training, uncommon backbones, and inaccessible pre-training datasets.. We conjecture that the teacher model usually produces low confidence scores to pixels around boundaries, so pseudo labels of these pixels are filtered in student training. However, boundary pixels are important for improving the performance of segmentation (Kirillov et al., 2020). On the other hand, the design of our method (online soft pseudo labeling process) bypass this challenge. We will conduct more verification of this hypothesis in future work.

Method Using image-level labels 1/4 (366) 1/8 (183) 1/16 (92)
Supervised (Teacher) - 70.20 64.00 56.03
Self-training (Student) - 72.85 69.88 64.20
PseudoSeg (Ours) - 73.79 73.13 67.06
PseudoSeg (Ours) 75.29 74.70 71.22
Table 6: Comparison with self-training. We use our supervised baseline as the teacher to generate one-hot pseudo labels, following Zoph et al. (2020).

4.5 Improving the fully-supervised method with additional data

We have validated the effectiveness of the proposed method in the low-data regime. In this section, we want to explore whether the proposed method can further improve supervised training in the full training set using additional data. We use the training set (1.4k) in VOC12 as the pixel-level labeled data. The additional data contains additional VOC 9k (), COCO training set (), and COCO unlabeled data (). More training details can be found in Appendix D. As shown in Table 7, the proposed PseudoSeg is able to improve upon the supervised baseline even in the high-data regime, using additional unlabeled or image-level labeled data.

Method Baseline PseudoSeg (w/o image-level labels) PseudoSeg (w/ image-level labels)
Extra data - + + + +
mIoU (%) 76.96 77.40 (+0.44) 78.20 (+1.24) 77.80 (+0.84) 79.28 (+2.32)
Table 7: Improving fully supervised model with extra data. No test-time augmentation is used.

5 Discussion and Conclusion

The key to the good performance of our method in the low-data regime is the novel re-design of pseudo-labeling strategy, which pursues a different decision mechanism from weakly-supervised localization to “remedy” weak predictions from segmentation head. Then augmentation consistency training progressively improves segmentation head quality. For the first time, we demonstrate that, with well-calibrated soft pseudo labels, utilizing unlabeled or image-labeled data significantly improves segmentation at low-data regimes. Further exploration of fusing stronger and better-calibrated pseudo labels worth more study as future directions (e.g., multi-scaling). Although color jittering works within our method as strong data augmentation, we have extensively explored geometric augmentations (leveraging STN (Jaderberg et al., 2015) to align pixels in pseudo labels and strongly-augmented predictions) for segmentation but find it not helpful. We believe data augmentation needs re-thinking beyond current success in classification for segmentation usage.

Acknowledgement

We thank Liang-Chieh Chen and Barret Zoph for their valuable comments.

References

  • J. Ahn, S. Cho, and S. Kwak (2019) Weakly supervised learning of instance segmentation with inter-pixel relations. In CVPR, Cited by: §3.1, Table 9.
  • J. Ahn and S. Kwak (2018) Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, Cited by: §2, §3.1, §3.1.
  • D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel (2019a) Remixmatch: semi-supervised learning with distribution matching and augmentation anchoring. In ICLR, Cited by: §2.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019b) Mixmatch: a holistic approach to semi-supervised learning. In NeurIPS, Cited by: §1, §2, §3.1.
  • L. Chen, R. G. Lopes, B. Cheng, M. D. Collins, E. D. Cubuk, B. Zoph, H. Adam, and J. Shlens (2020a) Naive-student: leveraging semi-supervised learning in video sequences for urban scene segmentation. In ECCV, Cited by: §1, §2, §4.4.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §1, §3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020b) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §3.1, §3.2, §4.3.
  • F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    .
    In CVPR, Cited by: §B.
  • J. Dai, K. He, and J. Sun (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, Cited by: §2.
  • T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    .
    arXiv preprint arXiv:1708.04552. Cited by: §3.2.
  • M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. IJCV 111 (1), pp. 98–136. Cited by: §4.
  • G. French, T. Aila, S. Laine, M. Mackiewicz, and G. Finlayson (2020) Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. In BMVC, Cited by: §1, §2, Table 2.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In NeurIPS, Cited by: §2.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017)

    On calibration of modern neural networks

    .
    ICML. Cited by: §4.3.
  • B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In ICCV, Cited by: §4.
  • B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2015) Hypercolumns for object segmentation and fine-grained localization. In CVPR, Cited by: §3.1.
  • Q. Hou, P. Jiang, Y. Wei, and M. Cheng (2018) Self-erasing network for integral object attention. In NeurIPS, Cited by: §2, §3.1.
  • Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, Cited by: §2, Table 4.
  • W. Hung, Y. Tsai, Y. Liou, Y. Lin, and M. Yang (2018) Adversarial learning for semi-supervised semantic segmentation. In BMVC, Cited by: §2, Table 1, Table 2.
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NeurIPS, Cited by: §5.
  • P. Jiang, Q. Hou, Y. Cao, M. Cheng, Y. Wei, and H. Xiong (2019) Integral object mining via online attention accumulation. In ICCV, Cited by: Table 9.
  • Z. Ke, D. Qiu, K. Li, Q. Yan, and R. W. Lau (2020) Guided collaborative training for pixel-wise semi-supervised learning. In ECCV, Cited by: §2, Table 2.
  • J. Kim, J. Jang, and H. Park (2020) Structured consistency loss for semi-supervised semantic segmentation. arXiv preprint arXiv:2001.04647. Cited by: §2.
  • A. Kirillov, Y. Wu, K. He, and R. Girshick (2020) Pointrend: image segmentation as rendering. In CVPR, Cited by: §4.4.
  • P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, Cited by: §2, §3.1.
  • C. Kuo, C. Ma, J. Huang, and Z. Kira (2020) FeatMatch: feature-based augmentation for semi-supervised learning. In ECCV, Cited by: §1.
  • S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. In ICLR, Cited by: §2.
  • D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop, Cited by: §2.
  • J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon (2019) Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference. In CVPR, Cited by: Table 4, Table 9.
  • K. Li, Z. Wu, K. Peng, J. Ernst, and Y. Fu (2018) Tell me where to look: guided attention inference network. In CVPR, Cited by: §2, Table 4.
  • D. Lin, J. Dai, J. Jia, K. He, and J. Sun (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.
  • W. Luo and M. Yang (2020) Semi-supervised semantic segmentation via strong-weak dual-branch network. In ECCV, Cited by: §2.
  • S. Mittal, M. Tatarchenko, and T. Brox (2019) Semi-supervised semantic segmentation with high-and low-level consistency. TPAMI. Cited by: §2.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI 41 (8), pp. 1979–1993. Cited by: §2, Table 2.
  • S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele (2017) Exploiting saliency for object segmentation from image level labels. In CVPR, Cited by: §2.
  • Y. Ouali, C. Hudelot, and M. Tami (2020) Semi-supervised semantic segmentation with cross-consistency training. In CVPR, Cited by: §1, §2, Table 1, Table 2, Table 4.
  • G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille (2015) Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV, Cited by: Table 4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §B.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NeurIPS, Cited by: §2.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, Cited by: §3.1.
  • K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020a) Fixmatch: simplifying semi-supervised learning with consistency and confidence. In NeurIPS, Cited by: §1, §2, §3.1.
  • K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister (2020b) A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757. Cited by: §3.1.
  • N. Souly, C. Spampinato, and M. Shah (2017)

    Semi supervised semantic segmentation using generative adversarial network

    .
    In ICCV, Cited by: §2, Table 1, Table 4.
  • G. Sun, W. Wang, J. Dai, and L. Van Gool (2020) Mining cross-image semantics for weakly supervised semantic segmentation. In ECCV, Cited by: §2, §3.1, Table 9.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, Cited by: §1, §2, Table 2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §3.1.
  • Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen (2020) Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, Cited by: §2, Table 9.
  • Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In CVPR, Cited by: §2, §3.1.
  • Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang (2018) Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In CVPR, Cited by: §2, Table 4.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §1.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification

    .
    In CVPR, Cited by: §3.1.
  • H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang, H. Lin, Y. Sun, T. He, J. Muller, R. Manmatha, M. Li, and A. Smola (2020a) ResNeSt: split-attention networks. arXiv preprint arXiv:2004.08955. Cited by: §1.
  • X. Zhang, Y. Wei, and Y. Yang (2020b) Inter-image communication for weakly supervised localization. In ECCV, Cited by: §2.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In CVPR, Cited by: §2, §3.1.
  • B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. V. Le (2020) Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882. Cited by: §1, §2, §3.1, §4.4, Table 6, footnote 2.

Appendix

A Self-attention Grad-CAM

We elaborate the detailed pipeline of generating Self-attention Grad-CAM (SGC) maps (equation 3) in Figure 6. To construct the hypercolumn feature, we extract the feature maps from the last two convolutional stages of the backbone network and concatenate them together. We then project the hypercolumn feature to two separate low-dimension embedding spaces to construct “key” and “query”, using two convolutional layers. An attention matrix can then be computed via matrix multiplication of “key” and “query”. To construct “value”, we compute Grad-CAM for each foreground class and then concatenate them together. This results in a score map, where the maximum score of each category is normalized to one separately. We then use image-level labels (either from classifier prediction or ground truth annotation) to set the score maps of non-existing classes to be zero. For each pixel localization, we use one to subtract the maximum score to construct the background score map, which is then concatenated with the foreground score maps to form “value” (). The attention score matrix can then be used to reweight and propagate the scores in “value”. The propagated score is added back to the “value” score map, and the pass through a

convolution (w/ batch normalization) to output the SGC map.

Figure 6: Diagram of Self-attention Grad-CAM (SGC) .
Figure 7: Training. For each network component, we show the loss supervision and the corresponding data.

B Implementation Details

We implement our method on top of the publicly available official DeepLab codebase.333https://github.com/tensorflow/models/tree/master/research/deeplab Unless specified, we adopt the DeepLabv3+ model with Xception-65 (Chollet, 2017) as the feature backbone, which is pre-trained on the ImageNet dataset (Russakovsky et al., 2015). We train our model following the default hyper-parameters (e.g., an initial learning rate of 0.007 with a polynomial learning rate decay schedule, a crop size of 513

513, and an encoder output stride of 16), using 16 GPUs 

444We do not adopt synchronous batch normalization, which is known can improve performance generally.. We use a batch size of 4 for each GPU for pixel-level labeled data, and 4 for unlabeled/image-level labeled data. For VOC12, we train the model for 30,000 iterations. For COCO, we train the model for 200,000 iterations. We set and unless specified. We do not apply any test time augmentations.

C Low-Data Sampling in PASCAL VOC 2012

Unlike random sampling in image classification, it is difficult to sample uniformly in a low-data case for semantic segmentation due to the imbalance of rare classes. To avoid the missing classes at extremely low data regimes, we repeat the random sampling process for 1/16 three times (while ensuring each class has a certain amount) and report the results. We use Split 1 in the main manuscript. All splits will be released to encourage reproducibility. The results of all the three splits are shown as in Table 8.

Method Using image-level labels Split 1 Split 2 Split 3
Supervised - 56.03 56.87 55.92
PseudoSeg (Ours) - 67.06 64.12 66.09
PseudoSeg (Ours) 71.22 68.11 69.72
Table 8: Full results of 1/16 split in VOC12.

D High-Data Experimental Settings

Here we provide more details about the experiments in Section 4.5. Since we have a lot more unlabeled/image-level labeled data, we adopt a longer training schedule (90,000 iterations) 555Note that a longer training schedule does not improve the supervised baseline.. We also adopt a slightly different fusion strategy in this setting by using and .

E Comparison with weakly-supervised approaches

In Table 9, we benchmark recent weakly supervised semantic segmentation performance on PASCAL VOC 2012 val set. Instead of enforcing the consistency between different augmented images as we do, these approaches tackle the semantic segmentation task from a different perspective, by exploiting the weaker annotations (image-level labels). As we can see, by exploiting the image-level labels with careful designs, weakly-supervised semantic segmentation methods could achieve reasonably well performance. We believe that both perspectives are feasible and promising for low-data regime semantic segmentation tasks, and complementary to each other. Therefore, these designs could be potentially integrated into our framework to generate better pseudo labels, which leads to improved performance.

Method Pixel-level labeled data mIoU (%)
FickleNet (Lee et al., 2019) - 64.9
IRNet (Ahn et al., 2019) - 63.5
OAA+ (Jiang et al., 2019) - 65.2
SEAM (Wang et al., 2020) - 64.5
MCIS (Sun et al., 2020) - 66.2
PseudoSeg (Ours) 1/16 (92) 71.22
Table 9: Benchmarking state-of-the-art weakly supervised semantic segmentation methods. All the methods use image-level labels from VOC12 training (1.4k) and augmented (9k) sets.

F Qualitative results

We visualize several model prediction results for PASCAL VOC 2012 (Figure 8) and COCO (Figure 9). As we can see, the supervised baseline struggles to segment some of the categories and small objects, when trained in the low-data regime. On the other hand, PseudoSeg utilizes unlabeled or weakly-labeled data to generate more satisfying predictions.










Input Ground truth Supervised Ours (unlabeled) Ours (img. label)
Figure 8: Qualitative results of PASCAL VOC 2012. Models are trained with 1/16 pixel-level labeled data in the training set.










Input Ground truth Supervised Ours (unlabeled) Ours (img. label)
Figure 9: Qualitative results of COCO. Models are trained with 1/512 pixel-level labeled data in the training set. Note that white pixel in the ground truth indicates this pixel is not annotated for evaluation.