Recent years have witnessed the great success of deep convolutional neural network (DCNNs) in semantic segmentation[50, 39, 8, 38, 64]. The success, however, heavily relies on a large number of training data with accurate pixel-level human annotations, which are prohibitively expensive and time-consuming to collect.
to ease human annotation burden by using a small amount of labeled data in conjunction with a large amount of unlabeled data to obtain an accurate model. In this regard, self-training, alternating between generating pseudo labels for unlabeled data using model predictions and training the model with pseudo-labeled data, is a classic and effective approach for semi-supervised learning and has obtained state-of-the-art results[18, 5, 67] in semi-supervised semantic segmentation with DCNNs.
Motivations. Despite the encouraging results, most of the previous self-training approaches [67, 66, 33, 63, 51] assume a class-balanced data distribution and hence adopt a single confidence thresholding (ST) scheme to produce the pseudo labels ( pixels with prediction confidence score exceeding a pre-defined threshold are pseudo-labeled) to guarantee pseudo label qualities. However, most real-world semantic segmentation datasets [37, 10, 16, 65] have long tail class distributions with few categories occupying the majority of pixels as illustrated in Fig. 1. And, it is well known that DCNNs trained with such long-tailed data distribution will produce predictions biased toward the dominant categories . This can be even more problematic for self-training, since pseudo labels are generated based on these biased model predictions. There exists a severe distribution mismatch between true and pseudo labels, especially for tail categories (see Fig. 1), which will harm self-training.
Recently, only very few works [69, 18] attempt to address the class distribution issue in pseudo labels via sampling the same percentage of pixels for each category based on the predicted results instead of using a single confidence threshold. However, as the class distribution for the predictions has already deviated from the true distribution, the produced pseudo labels will undoubtedly still suffer from the bias. Here, we argue that this distribution mismatch issue is a largely overlooked problem, hindering further improvements in semi-supervised semantic segmentation.
Our Contributions. In this work, we present a simple yet effective baseline method to re-distribute the biased pseudo labels, aligning their distribution with the true distribution (from the labeled set) for improving semi-supervised semantic segmentation.
First, we highlight the distribution mismatch issue in semi-supervised semantic segmentation, formulate the task as an optimization problem, and further design a Distribution Alignment and Random Sampling (DARS) method to obtain unbiased pseudo-labeled data, matching the true class distribution. We point out that many pixels share the same confidence value (confidence overlapping) due to the over-confident issue in DCNNs , which makes it not viable to achieve distribution matching only by thresholding. Therefore, we propose distribution alignment with class-wise thresholding and random sampling to achieve perfect distribution alignment.
Second, during the self-training process, we contribute a progressive data augmentation and labeling strategy which gradually increases the strength of data augmentation ( the range of random scaling) and enlarges the labeling ratio. This strategy prevents the model from being overwhelmed by noisy data from an inaccurate model or strongly augmented examples at the initial stage, and avoids overfitting to high-confident pseudo-labeled easy examples through leveraging diversified augmented data and an increased number of pseudo-labeled data from an improved model.
Third, our proposed method is generic, simple and efficient, which can be seamlessly incorporated into other self-training pipelines for semi-supervised semantic segmentation by adding only a few lines of code. Albeit simple, our approach achieves surprisingly good performance compared with state-of-the-art approaches. For Cityscapes dataset, our model gains a significant amount of performance boost of 8.89% mIoU in the split setting, approaching the fully supervised results. Moreover, we also verify our method on PASCAL VOC 2012, where ours outperforms previous state-of-the-art by 4.49% mIoU.
Finally, we further explore the performance gain in semi-supervised semantic segmentation with the growth of unlabeled data and find that the performance gain gradually saturates in the high-data regime. Further, we analyze the potential bottlenecks for this issue and suggest future directions, hoping to inspire more works in this direction.
2 Related Work
Supervised Semantic Segmentation.
The introduction of fully convolutional neural networks (FCN)[50, 39] is a remarkable milestone in semantic segmentation. Most following works build upon it and either take advantage of multi-scale inputs [8, 12, 17, 34, 35, 47], or use feature pyramid spatial pooling [38, 64], or dilated convolutions [6, 7, 9, 32, 56, 60] to improve the model, and encoder-decoder models [1, 9, 32, 49] have also been proved effective. We choose PSPNet  in our main experiments for its simplicity and compelling performance, and Deeplabv2  for a fair comparison with previous works.
Semi-Supervised Learning. Recently, noticeable progress has been made in the literature of semi-supervised learning, and successful examples usually fall onto two lines of work. One is consistency training, assuming the model’s predictions to be invariant when various perturbations are applied, such as UDA , MT , VAT , Temporal Ensemble , Dual Student . The other line of work is self-training , closely related to entropy minimization  and pseudo labeling [3, 51, 14].
In this work, we mainly focus on self-training, which often utilizes prediction confidence to assign pseudo labels to confident predictions assuming that high confidence corresponds to good accuracy. To do so, a confidence threshold is often used to filter out low confidence unreliable predictions, and the remaining are constructed as pseudo-labels. Berthelot  average the predictions of different augmented versions of an unlabeled sample, and applies sharpening and mixup to generate pseudo labels. Sohn  use a confidence threshold to generate pseudo-labels for weakly-augmented versions of unlabeled images and then train a model in the fully supervised way with obtained pseudo labels and stronger data augmentation. Xie  iteratively generate pseudo labels and train models with them. While these methods achieve impressive results, little attention has been paid to the structure and quality of pseudo-labeled data. Concurrently,  and  propose to refine pseudo labels using distribution information, but their methods may not be well extended to pixel-level tasks like semantic segmentation due to the over-confident predictions and their high computational complexities for optimization. In contrast, our proposed method is simple yet efficient to handle bias in pseudo-labeling for segmentation.
Semi-Supervised Semantic Segmentation. Inspired by the recent development of SSL methods in the image classification domain, a few works explore semi-supervised learning in semantic segmentation and show promising results. Hung  and Mittal  turn to adversarial learning, and a discriminator or a multi-label mean teacher (MLMT) branch is added to select reliable predictions as pseudo labels. Mendel  extend the GAN-Framework and add a secondary model as a corrector to correct the predictions from the segmentation model.
Consistency based methods are also frequently revisited. French  build upon  and enforce the mixed predictions and predictions of mixed inputs to be consistent with each other. Ouali  apply perturbations in the feature space and enforce consistency between predictions of different perturbation versions. Ke  propose a flaw detector and apply dynamic consistency constraint.
Our proposed method is more closely related to self-training or pseudo labeling based methods. Concurrently,  and  extend the self-training strategy of  from image classification to semantic segmentation. Feng  propose a class-balanced curriculum for semi-supervised semantic segmentation, which can be viewed as the most related work to ours. However, these works do not exploit the bias in pseudo-labeling and either using a single confidence threshold for all classes or confining the number of samples in each class with respect to the biased prediction. In contrast, our method explicitly process the bias in pseudo-labeling and prevent it from harming self-training.
In SSL, we are given a small set of labeled examples and a large set of unlabeled examples. Let represent the labeled examples, and represent the unlabeled examples, where is the -th labeled image with spatial dimensions ,
is its corresponding one-hot encoded pixel-wise label map withas the number of categories, and is the -th unlabeled image.
Given and , our goal is to train a semantic segmentation network with parameters to achieve satisfactory results on the test set with the same distribution as the training data. An overview of our framework is shown in Fig. 2, which consists of several steps explained as follows.
Step 1: At round =0, we learn a student model only on the labeled examples by minimizing
where denotes the cross entropy loss.
Step 2: At round , we use the learned student model to be the teacher model , producing predictions for the unlabeled examples. Here, are the network outputs after the softmax operation. Given and , we generate the pseudo labels with our DARS method.
Step 3: At round , equipped with , we use both , to train a student model . The student model resumes from the teacher model and is optimized by minimizing
Self-training iterates between Step 2 and Step 3 until no more performance gain can be achieved. During iterative training, a progressive augmentation and labeling strategy is designed to further enhance performance.
3.2 Unbiased Pseudo Label Generation
To reduce noise in pseudo-labeled data and enhance their qualities, previous works either adopt a single confidence threshold [67, 66, 33, 63, 51] for all categories or a labeling ratio controlling the percentage of labeled pixels [69, 18]. However, both criteria suffer from the long-tail data distribution which biases pseudo-labeled data toward the dominant categories and causes a severe distribution mismatch between true and pseudo labels, thereby harming effective learning for tail categories.
In this section, we will present a very simple yet effective technique to produce unbiased high-quality pseudo-labeled data whose class distribution matches the true distribution. Here, we use the class distribution of the labeled data as the true distribution, since it should be representative of the real-world data under unbiased random sampling.
Problem Formulation. Before diving into the details, we present our formulation of this problem as follows. Given the labeled data and predictions from the teacher model, we aim to obtain the pseudo labels that occupy of all the pixels, where is the labeling ratio to control the quality of pseudo-labeled pixels. To ensure the distribution matching and encourage pixels with high prediction confidence to have a larger possibility to be pseudo-labeled, we adopt category-specific confidence thresholds to derive the pseudo-labeled data , where pixels with confidence scores not smaller than the corresponding class threshold are pseudo-labeled. is derived by solving the following optimization problem,
Here, given the labeled data, is a frequency counting function which outputs the labeled (pseudo-labeled) pixel percentage ( or ) of category , or , calculates the Kullback-Libeler (KL) divergence measuring the distance of two distributions. Besides, is to generate a valid pseudo label to pixels if the confidence value is not smaller than the threshold of the corresponding category, otherwise, assign an ignore label to the pixel and returns the percentage of pseudo-labeled pixels. Notably, pixels with ignore label will not contribute to the training.
is minimized when and are the same, and thus the desirable number of pseudo-labeled pixels for each category is:
Further, the above optimization problem is readily solvable if the confidence values are distinct with no overlapped values: corresponds to the -th prediction value if we sort in descending order the prediction confidence of all pixels with predicted category .
Confidence Overlapping. However, in semantic segmentation, our observation shows many pixels have similar and indistinguishable confidence values (“confidence overlapping” for short), which is largely due to the fact DCNNs are prone to producing over-confident prediction values . We observe this issue is serious for head categories in semantic segmentation as shown in Fig. 3 (a), the confidence values for the road category are distributed in a very narrow range, and the confidence of road pixels is .
This renders our previous solution to Eq. (3) not viable, which means the number of pixels after thresholding should be larger than , and the serious confidence overlapping issue makes this not ignorable, is larger than by a significant percentage. The class distribution, therefore, still deviates from the true distribution. This issue is especially severe for the head categories such as road in Cityscapes. The overlapping in confidence values also suggests that the optimal solution to Eq. (3) is not unique.
Though calibration methods, among which temperature scaling [24, 21, 27] has been shown to be the most effective one for DNNs, have been studied to make DCNN’s prediction calibrated and distinguishable, and there are also recent works focus on long-tailed recognition such as Focal Loss , our ablation studies demonstrate that they fail to offer distribution alignment and are sub-optimal to our problem, while introducing additional cost to validate the parameters.
In the following, we present our simple yet effective method to find one solution to the above problem by alignment and sampling with a few lines of code in Algorithm 1.
Distribution Alignment and Random Sampling (DARS).
Firstly, we assume no confidence overlapping and perform distribution alignment with the optimal solution to Eq. (3), as shown in Algorithm 1 (Line 4 – 5). For categories that do not suffer from serious confidence overlapping, we can derive the desirable number of pixels for category by ignoring all pixels for category with confidence lower than threshold . This stage resolves the distribution mismatch problem to some extent especially for tail categories which do not suffer from serious “confidence” overlapping issue, such as pole in Cityscapes, shown in Fig. 3 (b).
Due to confidence overlapping, especially for head categories, the number of pixels derived after distribution alignment might be larger than the desirable number of pixels. Thus, we study how to effectively sample pixels for these categories. Here, we use the random sampling strategy to get the desirable number of pixels as described in Algorithm 1 (Line 7 – 8). The reason lies in the following folds: 1) random sampling helps redistribute the centralized high-confident pixels to different regions, effectively enlarging its spatial coverage; and 2) random sampling functions as a way of data augmentation to enhance model performance.
3.3 Progressive Data Augmentation and Labeling
Self-training can benefit from iterative learning. However, during the iterative learning process, if we only update pseudo labels by the latest model while keeping the labeling ratio and data augmentation magnitude the same, the training loss starts from a very low value shown in Fig. 4 (blue curve), which implies that the model has already fit the pseudo-labeled data well, and the data cannot further improve the model performance.
Motivated by these observations, we propose to progressively enlarge the labeling ratio similar to [69, 18] and increase the strength of data augmentation. Progressively enlarging the labeling ratio helps the model harvest novel data samples without sacrificing the quality of pseudo-labeled data, benefited from an improved model. Though inducing novel data by enlarging the ratio could benefit model training, enlarging the labeling ratio alone still provides quite little new information for iterative process since the loss curve only raises a little at the start as shown in Fig. 4 (red curve), and the model still easily fits the pseudo-labeled data which are typical high-confident easy samples.
Hence, we propose an orthogonal strategy to introduce new data samples for iterative training through progressively increasing the magnitude of data augmentation. Stronger data augmentations could bring unseen cases for the model and turn easy samples into challenging ones while not affecting the quality of pseudo labels, and thus providing new information for model updates.
For the semantic segmentation task, random scaling is the most useful data augmentation strategy. Here, we also focus on strengthening the random scaling effect. At the initial stage, we use weak data augmentations to prevent the model from being influenced by challenging hard examples from augmentation as the model still struggles with easy examples at that time. Then, we increase the range of scales at different self-training stages. Given the range of random scales , the upper bound will be increased by and the lower bound will be decreased by with the new random scale range in . As shown in Fig. 4 (green curve), after increasing the magnitude, the loss for pseudo-label data starts from a relatively high point and gradually converges to zero, which suggests further model updates and improvements.
4.1 Experimental Setup
To evaluate our method, we conduct the main experiments and ablation studies on the Cityscapes dataset , which contains K fine annotated images and is divided into 2975, 500 and 1525 three image sets for training, validation and test respectively. 19 urban-scene semantic classes are defined in Cityscapes for semantic segmentation. Similar to previous standards [26, 42, 19, 44, 18, 41] in semi-supervised semantic segmentation, we randomly sample and training images to construct the labeled set, and the remaining training images consist of the unlabeled set. To further explore the effectiveness of the proposed method, we also conduct experiments on the PASCAL VOC 2012 dataset (VOC12) , which provides 20 semantic classes and 1 background class. The VOC12 dataset consists of 1464 training, 1449 validation, and 1456 test images. Following previous common practice [45, 26] for semi-supervised settings, we use the official 1464 training images as labeled data and the 9k augmented set  as unlabeled data.
Comparison Methods. We denote the model trained with only the labeled set as Baseline, and with both the labeled set and ground truth labels of the unlabeled set as Oracle.
Backbone. It is noteworthy that most previous works employ the Deeplabv2 [26, 42, 19, 18, 41] framework. Nevertheless, we argue that exploring semi-supervised approaches based on a strong baseline could further illustrate their effectiveness and is more practical for real-world scenarios. Hence, we use PSPNet  with ResNet-50  as our backbone segmentation network for the main experiments. Our reported Oracle is similar to the results reported by the paper . Moreover, we also provide the results of our method with Deeplabv2 backbone for a fair comparison.
Note that we employ PSPNet instead of the top ones on leaderboards, because the SOTA methods often contain heavy engineering and parameter tunning, accompanying large computation costs, and PSPNet is our best trade-off between reproducibility, performance and costs.
|DARS (crop 361)||1/8||60.750.35||69.640.01||73.800.34||8.89|
|DARS (crop 713)||1/8||65.540.34||72.780.17||76.600.67||7.24|
We implement our method using the PyTorch
framework and set the batch size to 16. In self-training, a batch of 16 images is composed of 8 labeled images and 8 unlabeled images, and an epoch is defined as training once on all unlabeled images as previous standard. The number of epochs in each round is 200 for Cityscapes and 50 for VOC12. During training, we employ the SGD  with an initial learning rate of 0.01, momentum 0.9 and weight decay 0.0001 respectively. Also, we use a polynomial learning annealing procedure  to schedule the learning rate. For data augmentation, we use random scaling, random horizontal flipping, random rotation, and random Gaussian blur. Due to the high computation costs of large crop size, for Cityscapes, we only take large crop size (713713) for comparison with the state-of-the-art methods, while a small crop size (361361) in our ablation studies for efficient training and evaluation. For VOC12, we take crops of 321321 as previous standard . To be noted, we empirically find 2 self-training rounds are enough for our implementation, and all results with iterative training (IT) undergo 2 self-training rounds. All of our results are derived by running the experiments on the same setting for three times.
4.2 Comparison with Previous Work
All results of DARS reported in this section contain iterative training, and are tested only by a single scale.
PSPNet50. We construct experiments to compare the proposed DARS on PSPNet50 backbone with several state-of-the-art semi-supervised semantic segmentation approaches. We report the result of each approach with and labeled proportions on Cityscapes. As shown in Table 1, it is noteworthy that the performance gaps between baseline and oracle of previous works and ours are pretty close (for split results, the gaps of all methods lie around 12% mIoU). Our method, though based on a stronger model, obtains more performance gains across all labeled splits. Remarkably, with only labeled data, our method outperforms our baseline 8.89% in terms of mIoU and is only 4.16% apart from the fully supervised oracle model. In addition, with a larger crop size (713), our performance is further boosted to achieve 74.32% mIoU with labeled data on Cityscapes, only 2.28% lower than the fully supervised oracle.
Deeplabv2. Here, we also follow the setting of previous works [26, 42, 19, 18, 41] and report the experimental results with Deeplabv2 backbone for a fair comparison. We study the setting of 1/8 split of Cityscapes, which is the most challenging setting. As shown in Table 2, equipped with the same backbone, DARS still outperforms previous state-of-the-art methods by around 4% mIoU, and is 8% higher than the baseline and only 2.7% to the oracle (66.9%).
We further verify the effectiveness of our method by comparing with state-of-the-arts on VOC12 in Table 3. Compared with the most recent CCT method also with PSPNet50 backbone, our DARS significantly outperforms it (with 4.49% mIoU) as well as other previous methods.
|ST + TS||97.3||77.7||89.5||43.2||39.1||48.3||57.8||68.7||90.8||56.3||93.4||74.9||50.7||91.4||48.5||67.9||40.2||46.3||69.8||54.3||65.890.30||5.14|
|CBST + TS||96.8||77.1||89.4||43.7||44.2||50.1||58.6||68.9||90.4||55.5||92.7||75.0||52.6||91.3||48.8||66.9||42.7||51.1||69.7||55.5||66.610.20||5.86|
|DA + TS||97.2||77.9||89.7||44.8||45.6||50.7||59.2||69.1||90.6||56.1||93.0||75.1||52.9||91.6||54.8||69.4||43.1||48.3||69.6||56.7||67.310.12||6.56|
|ST + IT||97.5||78.8||89.6||43.4||38.5||47.2||55.1||69.4||90.9||56.1||93.3||75.1||51.4||91.9||49.5||67.5||47.3||52.3||70.4||55.2||66.590.14||5.84|
|CBST + IT||97.8||79.2||90.4||44.6||48.1||51.7||59.8||70.3||91.1||58.1||93.5||74.9||52.7||92.6||53.1||70.5||24.2||53.6||70.4||56.1||67.200.38||6.45|
|DARS + IT||97.2||78.5||90.1||49.3||47.7||50.9||59.9||70.1||90.8||59.6||92.9||75.2||54.4||92.5||67.7||73.0||48.7||54.7||69.9||60.6||69.640.01||8.89|
4.3 Ablation Studies
For all of our ablation studies, we conduct experiments in the setting of 1/8 split on Cityscapes dataset, with crop size 361361 and PSPNet50 backbone.
4.3.1 Ablation Study for Pseudo-labeling Process
Comparison of different pseudo-labeling methods. Since our goal is to re-distribute the biased pseudo labels towards the distribution of labeled data in iterative self-training, here, we compare our DARS method with the following different pseudo-labeling methods:
DA: Our proposed distribution aligning method without following random sampling to address confidence overlapping;
TS: Temperature scaling, incorporating with DA, CBST or ST method to facilitate distribution alignment by calibrating model predictions as mentioned in Sec. 3.2.
In the upper part of Table 4, we compare our DARS with ST, ST+TS, CBST, CBST+TS, DA+TS in a single self-training round. DARS achieves 68.01% in terms of mIoU, outperforming the single thresholding ST and class-balanced CBST by 2.31% and 1.72%, respectively. Equipped with temperature scaling (TS), ST+TS, CBST + TS obtain 0.19%, 0.32% improvements, while our DARS is still 0.7% superior to DA + TS strategy. This result validates TS is effective but sub-optimal in alleviating confidence overlapping and aligning pseudo label distribution.
Especially, our DARS achieves the top-2 best performance on 10 out of all 13 tail classes. Besides, we design to calculate mIoU on tail class (denoted as Tail mIoU), where our DARS further outperforms CBST 2.9%. These experimental results demonstrate the superiority of our method on tail classes by aligning pseudo labels distribution which prevents the model from collapsing into head classes, our DARS outperforms others on truck by over 8.4% mIoU.
In the lower part of Table 4, we further compare the proposed DARS with other pseudo-labeling methods after combining iterative training. With iterative training, DARS achieves more performance gain compared with ST and CBST, and outperforms CBST+IT by 2.44% mIoU, leading to a significant performance boost of 8.89% mIoU, which indicates that DARS helps resolve the bias in pseudo-labeling that may hurt iterative self-training.
Ablation study for distribution matching. While we have shown that combining Temperature scaling with previous pseudo-labeling methods lead to inferior performance compared with DARS, here, we directly compare the distribution mismatch () between true labels from the labeled set and pseudo labels generated by different off-the-shelf techniques, including calibration methods like Temperature Scaling, Matrix Platting Scaling, Histogram Binning and approaches aim at long-tailed recognition like Focal Loss . All calibration parameters are optimized using the cross-entropy loss over the validation set as in  (T=1.27 for TS in our experiments), and hyper-parameters for focal loss adopt the advice from the paper (=2).
As shown in Table 11, some techniques such as TS, Matrix Platting Scaling, and Focal Loss may help to alleviate the distribution mismatch in naive self-training (ST), but fail to eliminate it, whereas our DARS is much simpler and enables perfect distribution alignment.
|ST + TS||0.0357|
|ST + Matrix Platting Scaling||0.0340|
|ST + Histogram Binning||0.4188|
|ST + Focal Loss||0.0396|
4.3.2 Ablation Study for the Progressive Strategy
We explore the effectiveness of enlarging labeling ratio and data augmentation magnitude ( and ) in iterative self-training, as discussed in Section 3.3.
As shown in Table 6, directly using the model obtained by round =1 to generate new pseudo labels and train round =2 can only obtain 0.26% mIoU improvements (). However, with larger labeled ratio , our round =2 trained model delivers 0.92% performance gains compared to round =1 results. Furthermore, equipped our round =2 self-training procedure with larger labeled ratio as well as stronger data augmentation magnitude , our method obtains 1.63% gains. These results provide strong evidence for the necessity to introduce novel yet hard examples for iterative self-training and echo our demonstrations in Section 3.3. To validate whether we just need stronger data augmentation on each round, we show that directly applying stronger data augmentation on round =1 could even cause performance degradation.
4.4 Additional Results and Analysis
Held-out test results. For all the reported results above, we only provide experimental results on the validation sets, following the standard in semi-supervised semantic segmentation for a fair comparison. However, to verify the generalization ability of our method and show that we do not heavily tune hyper-parameters, we also provide the results on held-out test sets. As shown in Table 7, we obtain similar gains on held-out test sets compared with validation sets.
|Method||cs (crop 361)||cs (crop 713)||voc (crop 321)|
|DARS||blue(+8.10) 67.83||blue(+5.50) 70.20||blue(+4.97) 71.84|
Improving fully-supervised models with extra unlabeled data. In the above experiments, we have shown the effectiveness of DARS in the low-data regime. Here, we further explore how much performance boost DARS can bring to fully-supervised models (trained with all 3K fine annotated training examples in Cityscapes, with crop size 713713).
Concretely, we utilize the given 20K coarse annotated images in Cityscapes. However, we ignore the original coarse labels and instead generate pseudo labels for them by DARS. To simulate the real-world scenarios, we study the impact of the number of pseudo labels, denoted as . We conduct experiments with , where we keep the training iterations to be the same.
As shown in Table 8, more unlabeled data increases the performance till saturation. However, the performance gain in the high-data regime is relatively small compared with the low-data regime and seems to meet a bootleneck.
Analysis for potential bottlenecks in high-data regimes. Here, we try to analyze the bottlenecks in the high-data regime for semi-supervised semantic segmentation. Without loss of generalization, we take Cityscapes as an example to explore. While previous works usually group classes into head and tail classes, here we present a four-group categorization for classes in semantic segmentation from the perspective of object size and appearing frequency: (1) Classes that frequently appear (frequency50%) with large sizes (size10% of the crop area), usually the head classes, road, sky; (2) Classes with high appearing frequency but small sizes, pole, traffic light/sign; (3) Classes that rarely appear but with large sizes, truck, bus; (4) Classes that both rarely occur and with small sizes, motorcycle.
To better understand why only limited performance boosts could be achieved in the high-data regime, we plot the performance gain of the four groups in both low-data (1/8, 1/4 split) and high-data regimes in Fig 5 (a). Since group 1 classes are all head classes, it is natural that we do not harvest much boost on them. However, for the rest three groups that are all tail classes, only groups with rare occurrence attribute (3, 4) achieve noticeable gain, while group 2 achieves similar limited gains as 1. Therefore, we conclude that self-training mainly contributes to classes with rare occurrence. As a result, in the high-data regime, the occurrence of all classes increases and classes in group 3 could gradually evolve to group 1 with the growth of data, which explains the decreased performance gain in group 3 from low to high-data regime. Hence, one potential bottleneck is the gradually saturated performance for originally rare classes as training data increases.
In order to boost performance, we would expect classes in group 2, 4 (both with small sizes) also turn into group 1, which inspires us that another bottleneck comes from the network’s systematic error towards small size objects. As shown in Fig 5 (b), we test the baseline model’s accuracy of different object size ranges. The accuracy increases as the object size grows. The network’s systematic error for small objects leads to low pseudo label quality for them, and further limits self-training performance on small objects. Hence, we suggest that the second bottleneck for high-data regimes is the network’s systematic error towards different object sizes. A promising direction is to develop size agnostic architecture for semantic segmentation, which we leave as future work.
We have presented a simple and yet effective DARS method to calibrate the bias in pseudo-labeling, together with a progressive data augmentation and labeling strategy for iterative self-training. Experiments on Cityscapes and VOC12 demonstrate that our simple method can outperform existing sophisticated designed approaches. We hope our formulation, encouraging results, and analysis for bottlenecks could inspire more research efforts in this direction.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §2.
-  (2019) Remixmatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785. Cited by: §2.
-  (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: §2, §2.
-  (2010) . In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §4.1.
-  (2020) Semi-supervised learning in video sequences for urban scene segmentation. arXiv preprint arXiv:2005.10266. Cited by: §1, §2.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2, §4.1, Table 2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.
-  (2016) Attention to scale: scale-aware semantic image segmentation. In , pp. 3640–3649. Cited by: §1, §2.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §2.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: Figure 1, §1, Figure 3, §4.1.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: Appendix B, Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation.
-  (2015) Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3992–4000. Cited by: §2.
Imbalanced deep learning by minority class incremental rectification. IEEE transactions on pattern analysis and machine intelligence 41 (6), pp. 1367–1381. Cited by: §1.
Tri-net for semi-supervised deep learning.
Proceedings of twenty-seventh international joint conference on artificial intelligence, pp. 2014–2020. Cited by: §2.
-  (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §4.1.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1.
-  (2012) Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1915–1929. Cited by: §2.
-  (2020) Semi-supervised semantic segmentation via dynamic self-training and class-balanced curriculum. arXiv preprint arXiv:2004.08514. Cited by: Appendix B, 2nd item, §1, §1, §2, §3.2, §3.3, 2nd item, §4.1, §4.1, §4.1, §4.2.1, Table 1, Table 2.
-  (2019) Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. arXiv preprint arXiv:1906.01916. Cited by: §2, §4.1, §4.1, §4.2.1, Table 1, Table 2.
-  (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.
-  (2017) On calibration of modern neural networks. arXiv preprint arXiv:1706.04599. Cited by: §1, §3.2, §3.2, §4.3.1.
-  (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.2.
-  (2018) Cycada: cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: Table 14, Appendix C.
-  (2018) Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934. Cited by: Appendix C, §2, §4.1, §4.1, §4.2.1, Table 1, Table 2, Table 3.
-  (1957) Information theory and statistical mechanics. Physical review 106 (4), pp. 620. Cited by: §3.2.
-  (2020) Guided collaborative training for pixel-wise semi-supervised learning. arXiv preprint arXiv:2008.05258. Cited by: §2.
-  (2019) Dual student: breaking the limits of the teacher in semi-supervised learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6728–6736. Cited by: §2.
-  (2020) Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. arXiv preprint arXiv:2007.08844. Cited by: §2.
-  (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1, §2.
-  (2018) Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180. Cited by: §2.
-  (2019) Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: a non-adversarial approach. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6758–6767. Cited by: Appendix B, Table 14, §1, §3.2.
-  (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1925–1934. Cited by: §2.
-  (2016) Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203. Cited by: §2.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.2, §4.3.1.
Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §1, §2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1, §2.
-  (2019) Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2507–2516. Cited by: Table 14.
-  (2020) Semi-supervised segmentation based on error-correcting supervision. In European Conference on Computer Vision, pp. 141–157. Cited by: §2, §4.1, §4.1, §4.2.1, Table 1, Table 2.
-  (2019) Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, §4.1, §4.1, §4.2.1, Table 1, Table 2.
-  (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.
-  (2020) ClassMix: segmentation-based data augmentation for semi-supervised learning. arXiv preprint arXiv:2007.07936. Cited by: §4.1.
-  (2020) Semi-supervised semantic segmentation with cross-consistency training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12674–12684. Cited by: §2, §4.1, §4.1, Table 3.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §4.1.
-  (2014) Recurrent convolutional neural networks for scene labeling. In International conference on machine learning, pp. 82–90. Cited by: §2.
-  (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: Appendix C, Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
-  (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. Cited by: §1, §2.
-  (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: Appendix B, §1, §1, §2, §2, §3.2.
Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE international conference on computer vision, pp. 5688–5696. Cited by: Table 3.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §1, §2.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, pp. 7472–7481. Cited by: Table 14.
-  (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2517–2526. Cited by: Table 14, Appendix C.
-  (2018) Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pp. 1451–1460. Cited by: §2.
-  (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §1, §2.
Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: 1st item, §1, §2, §2, §2, 1st item.
-  (2020) An adversarial perturbation oriented domain adaptation approach for semantic segmentation.. In AAAI, pp. 12613–12620. Cited by: Appendix C.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.
-  (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §2.
-  (2019) Category anchor-guided unsupervised domain adaptation for semantic segmentation. In Advances in Neural Information Processing Systems, pp. 435–445. Cited by: Table 14.
-  (2020) Transferring and regularizing prediction for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: Appendix B, Table 14, §1, §3.2.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1, §2, Figure 3, §4.1.
Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §1.
-  (2020) Improving semantic segmentation via self-training. arXiv preprint arXiv:2004.14960. Cited by: Appendix B, §1, §3.2.
-  (2020) Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882. Cited by: Appendix B, 1st item, §1, §1, §2, §3.2, 1st item.
-  (2019) Confidence regularized self-training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5982–5991. Cited by: Table 14.
-  (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pp. 289–305. Cited by: Appendix B, Table 14, 2nd item, §1, §3.2, §3.3, 2nd item.
Appendix A More Results on Cityscapes
a.1 Parameter Analysis
Labeling Ratio. Benefited from an improved teacher model, progressively enlarging labeling ratio can help induce novel data while maintaining the quality of pseudo labels, and hence safely bootstrap the performance. Here, we present more experimental results and analysis on the Cityscapes split 1/8 at round =2 to show the improvements from an enlarging labeling ratio.
As shown in Table 9, if we directly apply iterative training without enlarging the labeling ratio, the performance gain is quite limited (68.01% 68.27%). However, as we gradually enlarge the labeling ratio, a steady performance growth is observed with the largest improvement (68.01% 68.93%) achieved at =50%.
Moreover, we can observe the robustness of our progressive pseudo-labeling strategy from Table 9 that noticeable performance boost could be achieved in a relatively wide range (40% 60%).
Data Augmentaion Magnitude. An orthogonal strategy is to progressively increase the magnitude of data augmentation. In our experiments, we focus on strengthening the random scaling factor on Cityscapes split 1/8 at round =2. The range of random scaling intensity in round =1 is [0.25, 1.0], which is regarded as the initial range. Afterward, we enlarge the initial range in the following self-training round, decreasing the lower bound 0.25 by and increasing the upper bound 1.0 by .
As shown in Table 10, the best performance is achieved at and . Notably, by enlarging the range of random scaling appropriately, a tangible performance gain is obtained (68.97% 69.64%).
Though enlarging data augmentation magnitude to different extent leads to various performance, we observe that we can harvest performance boost in a wide range of increased data augmentation magnitude as shown in Table 10, which proves the robustness of our progressive data augmentation strategy.
a.2 Data augmentation in the progressive strategy
Here, we explain why only random scaling is considered in the progressive strategy. Our previous empirical experiments showed that random scaling is the most useful data augmentation method for semantic segmentation. To be specific, we have conducted experiments on the Cityscapes dataset with different data augmentation methods. Specifically, we trained a PSPNet50 using all 2975 fine-annotated training images with a crop size of 361361 (half-resolution training). For data augmentation methods, we consider photometric distortion (brightness, contrast, saturation, and hue), random rotation, and random scaling following common setups in previous work . We employed one of the three data augmentation methods or none of them, respectively, and report the performance on the validation set. As shown in Table RB1, random scaling could bring significant performance boosts, whereas photometric distortion or random rotation could only bring limited gains. We will add this analysis to the supplementary material upon publication.
Moreover, applying too strong magnitudes for data augmentation methods like brightness and rotation might influence data distribution. Hence, we only consider random scaling in the progressive strategy. However, we believe other data augmentation methods like mixup could also be incorporated into our progressive strategy to further boost performance and we hope our idea of progressively increasing data augmentation magnitude for iterative training could benefit future research.
a.3 Visualization of Pseudo Labels
To provide more information about our approach, we visualize the pseudo labels generated by our method as well as conducting comparisons with methods such as ST and CBST on the Cityscapes dataset.
As shown in Fig. 6, pseudo labels form ST and CBST are often overwhelmed by the majority classes like purpleroad and vegvegetation. And the tail class objects are often ignored in their pseudo labels, such as the polepole and trafficlighttraffic light in the red box. As a result, the label distribution of their pseudo labels is extremely biased towards the dominant classes.
In contrast, with our distribution alignment and random sampling strategy to deal with the confidence overlapping phenomenon, the percentage of dominant classes are reduced and the pseudo labels are re-distributed to cover a large spatial area. Besides, our method successfully pseudo-labels the tail classes such as the polepole and trafficlighttraffic light in the red box at round =1 (see Ours in Figure 6).
Further, when we enlarge the labeling ratio to 50% at round =2, the quality of our pseudo labeled data is further enhanced. More tail class objects are pseudo-labeled and incorporated into our pseudo labels, as shown by the red boxes of Ours (w/ Iterative) in Figure 6.
a.4 Qualitative Results
In this section, we provide qualitative results of the semi-supervised semantic segmentation methods on the Cityscapes dataset. Concretely, we compare our results with ST and CBST methods at round =1.
As shown in Fig. 7, previous methods mainly have two failure modes in segmenting tail classes: (1) they tend to leave out some tail classes like fencefence, trafficlighttraffic light and wallwall ( in the red box areas, the fencefence is missing in (a), one trafficlighttraffic light is lost in (c), and the wallwall is completely unrecognized in (e)); (2) they suffer from the confusion with similar classes and mistake tail class object as other classes. For instance, in (b), part of the busbus is mistaken as vegvegetation, trucktruck or carcar, and in (d), some part of the traintrain is misclassified as busbus.
Thanks to our distribution alignment and sampling strategy to calibrate the bias, our method can alleviate the above two issues and thus outperforms ST and CBST on tail classes significantly. As shown in Fig. 7, our method can successfully segment the tail class objects as in (c) and recognize most tail class areas ( the fencefence in (a) and the wallwall area in (e)). Moreover, our method significantly improves the model’s ability to handle the confusion between similar classes and give consistent and correct predictions as in (b) and (d).
Appendix B Additional Experiments on ScanNet
To further demonstrate the transferability and broad applicability of our method, we evaluate it on the indoor scene dataset, ScanNet . To be noted, we do not tune the hyper-parameters on the ScanNet dataset to show the generality of our method.
ScanNet is an RGB-D dataset collected from 1,513 indoor scenes. For the 2D semantic segmentation task, ScanNet contains 19,466 RGB images for training and 5,436 images for validation with a resolution of 1296968. In our semi-supervised setting, 1/8 ( 1/8-split) and 1/4 ( 1/4-split) of the images are randomly chosen from the training set to serve as the labeled set. Pixel-level annotations for the following 21 object classes are provided: wall, floor, cabinet, bed, chair, sofa, table, door, window, bookshelf, picture, counter, desk, curtain, refrigerator, shower curtain, toilet, sink, bathtub, other furniture, and void (the ignore category).
We follow the same experimental setup as the Cityscapes dataset, except that the number of epochs is set to 20 for each training round and a crop size of 481
481 is adopted. Also, since the variance in our experiments is rather small as shown in this supplementary file, we only run one experiment for each setting on ScanNet to save the computational cost.
|Method||mIoU (%)||Gain (%)|
|DARS (SD)||0.1558||47.01 0.42||+3.58|
|DARS (TD)||0.0006||51.01 0.05||+7.58|
Main Results. We compare the proposed DARS method with the single thresholding method [67, 66, 33, 63, 51] (ST) and the class balance thresholding method [69, 18] (CBST) considering on the 1/8-split setting at round =1 without iterative training. As shown in Table 12, the proposed simple DARS method achieves 56.2% mIoU on the validation set, surpassing ST and CBST method, which reiterates the superiority of our proposed method. To be noted, our method introduces little computational cost in comparison with the compared approaches.
Further, we report the final results of the proposed method with iterative training at split 1/8 and 1/4 in Table 13. Notably, our method achieves 58.35% in terms of mIoU with only 1/4 labeled data, which is very close to the fully-supervised results of 61.69%.
We notice that the performance gain achieved by self-training is relatively small on the ScanNet dataset in comparison with the Cityscapes dataset. We mainly attribute this to the difference between indoor and urban scenes. While urban scenes usually have similar structures ( road is always at the bottom and the sky at the top), indoor scenes tend to have large variance and complex spatial relationships which impose obstacles for pseudo-labeling that relies on models trained with only a small set of labeled data. Exploring the 3D structure for semi-supervised learning in indoor scene parsing have the potential to address these difficulties which will be our future work. We believe our method could also be incorporated into other methods to further boost the performance for semi-supervised in-door scene parsing. Also, we barely finetune the hyper-parameters like labeling ratio and data augmentation magnitude to save time and computational costs since our main purpose for experiments on ScanNet is to show the broad applicability of our method with superiority to previous self-training methods.
Appendix C Unsupervised Domain Adaptation Setting
In this section, we further conduct experiments on the more challenging unsupervised domain adaptation setting, in order to confirm our major insight about the importance of semantic-level distribution alignment in pseudo-labeling.
While we do not have the labeled set for the target domain to obtain the true label distribution, for comparison fairness with other methods, we could not perform DARS for generating unbiased pseudo labels. Instead, we use this setting to study the relationship between the extent of distribution mismatch in pseudo labels (KL divergence with target label distribution) and the performance boost.
We compare the following pseudo-labeling methods:
DARS (SD): DARS using the source label distribution as the target label distribution;
DARS (TD): DARS using the target label distribution counted on the validation set of the target domain.
We follow [26, 55, 59] to consider the popular synthetic-to-real adaptation task: GTA5 Cityscapes. The GTA5 dataset  provides 24,966 images with pixel-wise labels. We use the 19 classes of GTA5 in common with the Cityscapes for adaptation. Moreover, we take advantage of image translation and use images translated by CyCADA  in GTA5 for training.
We also follow the same experimental setup as the Cityscapes dataset, except that the number of epochs is set to 10 for the pre-training round on GTA5 and a crop size of 713713 is adopted.
Main Results. As shown in Table 15, the smaller the KL divergence between the distribution of pseudo labels and target labels is, the better performance is achieved, which highly validates our motivation to re-distribute biased pseudo labels. Moreover, it is noteworthy that when the pseudo labels achieve perfect distribution alignment with true distribution (DARS (TD)), it could achieve much more performance gain than other pseudo-labeling (4% mIoU higher than DARS (SD), 2.27% higher than CBST in a single round).
Further, we report the results with iterative training of DARS (TD) in comparison with previous works in Table 14. We claim that the comparison is not fair since DARS (TD) utilizes the target label distribution from the validation set, and we show the encouraging and superior performance (55.0% mIoU) of it only to highlight the importance of distribution aligning in pseudo-labeling for unsupervised domain adaptation settings, hoping to inspire more works in this direction.