Semi-supervised semantic segmentation learns from small amounts of labelled images and large amounts of unlabelled images, which has witnessed impressive progress with the recent advance of deep neural networks. However, it often suffers from severe class-bias problem while exploring the unlabelled images, largely due to the clear pixel-wise class imbalance in the labelled images. This paper presents an unbiased subclass regularization network (USRN) that alleviates the class imbalance issue by learning class-unbiased segmentation from balanced subclass distributions. We build the balanced subclass distributions by clustering pixels of each original class into multiple subclasses of similar sizes, which provide class-balanced pseudo supervision to regularize the class-biased segmentation. In addition, we design an entropy-based gate mechanism to coordinate learning between the original classes and the clustered subclasses which facilitates subclass regularization effectively by suppressing unconfident subclass predictions. Extensive experiments over multiple public benchmarks show that USRN achieves superior performance as compared with the state-of-the-art.READ FULL TEXT VIEW PDF
Semantic segmentation aims to assign a human-defined class label to each pixel of an image, which is a fundamental task in the computer vision research. With the recent advance of deep neural networks[19, 69, 9], we can learn a very accurate segmentation model while a large amount of labelled training images are available. However, collecting a large amount of pixel-wise semantic labels is laborious and time-consuming, which has become a bottleneck in semantic segmentation research [15, 39, 11]. Semi-supervised semantic segmentation, which aims to learn from a small amount of labelled images and a large amount of unlabelled images, has been attracting increasing attention for addressing the image annotation challenge.
Most existing studies tackle the challenge of semi-supervised semantic segmentation by applying either consistency-training [47, 46, 35] or self-training [45, 43, 10, 29, 68, 20] to the unlabelled data. However, they often suffer from constrained segmentation accuracy largely due to the segmentation model that is trained by using the labelled data. As illustrated in Fig. 1
, the model trained using the labelled data is class biased due to the class-imbalance of the labelled data. This leads to class-biased segmentation of the unlabelled data which accumulates and finally degrades the whole semi-supervised learning. Though a few studies[20, 64]
attempt to handle the class imbalance issue by selecting more pseudo labels for minority classes during self-training, these pseudo labels are often noisy as they are generated from class-biased segmentation. Note the class imbalance issue has been widely studied in supervised learning via re-sampling[6, 5, 62, 34, 55], re-weighting [22, 38, 49, 12] and meta-learning [63, 49, 53], but these works require labels to rectify biased predictions and are thus inapplicable to the unlabelled data in semi-supervised semantic segmentation.
In this work, we propose an unbiased subclass regularization network (USRN) that tackles the class-imbalance issue and regularizes class-biased segmentation by generating class-unbiased segmentation. Leveraging the segmentation backbone as learnt from the original class distribution, USRN introduces an auxiliary segmentation task as supervised by a set of class-balanced clusters for producing class-unbiased segmentation on the unlabelled data. We obtain the class-balanced clusters from the labelled data by clustering pixels of each original class into multiple subclasses of similar size. As illustrated in Fig. 1, the USRN trained using class-balanced clusters can produce clearly more class-unbiased segmentation for the unlabelled data. In addition, the segmentation with the original classes could be interfered by the segmentation with the generated subclasses due to their different convergence speeds. We design an entropy-based gate mechanism to address this issue, where the learning with the auxiliary subclasses will be stopped (i.e., no back-propagation) when the subclass predictions are less confident than the original class predictions. Extensive experiments over multiple public benchmarks demonstrate the effectiveness of our designed network.
The contribution of this work is threefold. First, we propose an unbiased subclass regularization network that explores class-unbiased segmentation to alleviate the class imbalance issue in semi-supervised semantic segmentation. Second, we design an entropy-based gate mechanism that coordinates the concurrent learning from the original classes and the generated subclasses effectively. Third, extensive experiments show the superior effectiveness of our designed network as compared with the state-of-the-art.
With recent progress of deep learning, supervised semantic segmentation has made remarkable progress by designing various architectures. FCN in is the first end-to-end trainable network with fully convolutional layers for semantic segmentation. The subsequent studies improve  by employing encoder-decoder structures [3, 9, 51], mutli-scale inputs [8, 13, 37], feature pyramid spatial pooling [41, 69], attention mechanism [70, 16] or dilated convolutions [67, 7, 61, 9]. For example, Deeplabv3+ in  combines low-level and high-level features to refine the object boundaries of segmentation results. However, training these supervised segmentation networks requires large amounts of annotated data which is often laborious and time-consuming to collect. Our work aims to alleviate the data annotation constraint by exploring large amounts of unlabeled data together with limited amount of labeled data.
Semi-supervised segmentation aims to explore tremendous unlabeled data with supervision from limited labeled data, which is most relevant to domain adaptive segmentation where labeled data is obtained from another domain [60, 28, 66, 26, 18, 2, 27]. Most existing studies address this challenge by either consistency-training [48, 72, 71, 58, 32, 17, 65, 25, 36] or self-training [30, 73, 4, 54, 33, 50, 29, 24, 1, 23, 57, 21]. Specifically, consistency-training maintains the consistency of segmentation of each unlabeled sample under different perturbations. For example, CCT  applies two same-structured segmentation networks with different initialization to produce differently perturbed samples. CAC  enforces context-aware consistency between representations from the same unlabeled image augmented with different context information. Self-training instead generates pseudo labels on unlabeled data to re-train networks. For example, GCT  introduces a flaw detector to correct the defects in pseudo labels. DBSN 
designs distribution-specific batch normalization for robust pseudo labels generation. CPS produces pseudo labels from one segmentation network to supervise the other segmentation network with the same structure yet different initialization. However, both consistency-training and self-training suffer from clear pixel-wise class imbalance in labeled data. Our method can mitigate the class imbalance issue in semi-supervised segmentation effectively.
Class imbalance issue has been widely studied in supervised learning. For example, re-sampling based methods [6, 5, 62, 34] re-balance the biased networks according to sample sizes for each class. Re-weighting based methods [22, 38, 49, 12] adaptively adjust the loss weight for different training samples with different class labels. Meta-learning based methods [63, 49, 53] use the validation loss calculated from selected class-balanced labeled samples as the meta objective to optimize networks. However, all these methods rely on labels to address the class imbalance issue and cannot be directly applied to unlabeled data in semi-supervised learning. Recently, several studies attempt to handle the class imbalance issue in semi-supervised learning. For example, CReST 
selects pseudo labels more frequently for minority classes according to the estimated class distribution. DARS employs adaptive threshold to select more pseudo labels for minority class during self-training. However, these methods tends to generate noisy pseudo labels from class-biased segmentation of unlabeled data. We address the class imbalance issue by constructing and learning from class-balanced subclasses.
This work focuses on semi-supervised semantic segmentation. Given images with pixel-level semantic labels and unlabelled images (, and denote image height, image width and class number, respectively), the goal is to learn a segmentation model that can fit both labelled and unlabelled data and work well on unseen images. Existing methods [47, 31, 35, 45, 43, 10, 29, 73, 20]
combine supervised learning on labelled images and unsupervised learning on unlabelled image to tackle the semi-supervised challenge. For labelled images, they adopt cross entropy loss as supervised lossto train . For unlabelled images, they adopt consistency regularization loss [47, 31] or self-training loss [45, 43, 35, 10, 29, 73, 20] as unsupervised loss to train . The overall objective is a weighted combination of supervised and unsupervised losses:
where is a balancing weight. With this objective function, supervised and unsupervised learning would benefit each other thanks to their complementary nature .
Though consistency-training and self-training can learn from the unlabelled images effectively, their performance is often constrained by the quality of the supervised model that is trained by using the labelled images. Specifically, the labelled images often suffer from a clear class-imbalance issue which directly leads to class-biased model and further class-biased segmentation on the unlabelled images. Such class-biased segmentation accumulates during the process of consistency-training or self-training which finally degrades the overall performance of semi-supervised semantic segmentation. We define this problem as a class-imbalance issue in semi-supervised semantic segmentation, and design a class-balanced subclass regularization network to address the class-imbalance problem.
We design an unbiased subclass regularization network (USRN) for addressing the class-imbalance issue in semi-supervised segmentation, as shown in Fig. 2. With labelled images in semi-supervised segmentation, USRN first trains a class-biased model (by learning from the class-imbalanced labelled images) and then produces a class-balanced subclass distribution by clustering the F-produced features of the labelled images. With the class-balanced subclass distribution, a class-unbiased model can be trained which tends to produce class-unbiased segmentation while applied to unlabelled images in semi-supervised segmentation.
Generating class-balanced subclass distribution. USRN learns class-unbiased model by generating class-balanced clusters. With the labelled images (with class imbalanced annotations), USRN first trains a supervised segmentation model and then applies
to each labelled image to extract semantic features. It then adopts balanced k-means clustering to group the extracted semantic features into multiple clusters of similar size. The generated class-balanced clusters ( is the number of clustered subclasses) directly give a balanced subclass distribution with the labelled images. In our implementation, we empirically set the cluster size as the size of the smallest class in the original annotations.
Supervised learning on labeled data. USRN performs supervised learning with both original and subclass annotations. For each labelled image , we feed a weakly augmented image to to obtain original class prediction and the same input to to obtain subclass prediction . Here, is a weak augmentation function, i.e., random scaling, cropping and horizontal flipping. Given with its original class label and with its class-balanced cluster , a multi-distribution supervised loss can be defined by:
where is cross-entropy loss and is a balancing weight.
Self-training on unlabeled data. USRN preforms self-training to update with class-unbiased pseudo label as generated from the subclass distribution. For each unlabelled sample , we feed a weakly augmented image to to obtain original class prediction and the same input to to obtain subclass prediction . To generate unbiased pseudo labels for the original class supervision, we first map the prediction from the subclass space to the original class space (this process denoted by ), and then define a function to select pseudo labels from the mapped predictions in an online manner. We define the pseudo-label selection function by:
where refers to the predictions, is a function that returns the class index if the condition is true or the ‘ignore’ class index otherwise, and is a confidence threshold. Note there is no back-propagation for the ‘ignore’ class in training.
To alleviate over-fitting in self-training, the pseudo label generated from weakly augmented version of an image is used to supervise the segmentation from the strongly augmented version of the same image . Here, is a strong augmentation function, i.e., random color jitters and Gaussian blur. With and
(one-hot vector computed fromusing softmax) from , we simultaneously feed to to obtain the original class prediction , and perform subclass regularized self-training with the loss :
In addition, USRN performs self-training on subclass distributions to update . With from as in Eq. 4, we simultaneously feed to to obtain the subclass prediction , and perform subclass self-training with the loss :
The proposed USRN employs subclass predictions to regularize original-class predictions. As the subclass distributions are derived from the original-class distribution, learning from the subclass distributions is more complex and tends to be slower as compared with that from the original-class distributions under the same learning policy (e.g., optimizer, learning rate, weight decay rate, etc). This could introduce undesired regularization. Specifically, the original-class learning could produce more confident and correct predictions than the subclass learning as the original-class learning converges faster than the subclass learning in training. The semi-supervised learning will degrade if the original-class predictions are regularized by the subclass predictions under such circumstance.
To address this problem, we design an entropy-based selection function to avoid regularizing the confident original-class predictions with unconfident subclass predictions . The entropy-based selection function is defined by:
where is the entropy function as defined in .
Given the original predictions (i.e., and from strongly and weakly augmented versions of the same image) and the subclass prediction (i.e., from the weakly augmented version) as in Eq. 4, we reformulate the self-training loss in Eq. 4 and define an entropy-based self-training loss as follows:
We conducted main experiments on the dataset PASCAL VOC by following previous work [31, 47, 35, 10]. The dataset consists of 10,582 images for training and 1,456 images for evaluation, and the image resolution varies from to
. It provides pixel-wise annotations with 21 semantic classes. To perform comprehensive validation, we also conducted experiments on the dataset Cityscapes which contains 2,975 images for training and 500 images for evaluation and all images have the same resolution of . Cityscapes provides pixel-wise labels with 19 semantic classes.
pre-trained on ImageNet, where and
share layers that extract low-level features in ResNet-50. All network models are optimized by mini-batch stochastic gradient descent (SGD) with a base learning rate of, a momentum of 0.9 and a weight decay of . The weak augmentation function (i.e., random scaling, cropping and horizontal flipping) and the strong augmentation function (i.e., random color jitters and Gaussian blur) are the same as in . The confidence threshold is set to 0.75 and all the balancing weights (i.e., and
) are directly set to 1. During evaluation, each image is tested only on the segmentation backbone and the mean intersection-over-union (mIoU) is adopted as the evaluation metric.
|GCT ||ECCV 20||-||-||64.1|
|CCT ||CVPR 20||-||-||65.2|
|DARS ||ICCV 21||56.9||64.5||68.4|
|DBSN ||ICCV 21||57.5||64.6||69.8|
|CAC ||CVPR 21||56.5||65.1||70.1|
|CPS ||CVPR 21||57.9||64.8||68.2|
|GCT ||ECCV 20||-||65.8||71.3|
|CCT ||CVPR 20||-||66.4||72.5|
|DARS ||ICCV 21||61.9||66.9||73.7|
|DBSN ||ICCV 21||62.2||67.3||73.5|
|CAC ||CVPR 21||62.2||69.4||74.0|
|CPS ||CVPR 21||62.5||69.8||74.4|
We compare USRN with state-of-the-art methods [31, 47, 20, 68, 35, 10] over PASCAL VOC and Cityscapes datasets [15, 11]. Tables 1 and 2 shows experimental results. For PASCAL VOC dataset, we randomly split 1/64, 1/32 and 1/16 of the trainset (including 165, 331 and 662 training images, respectively) as labelled data, and the remaining training images as unllabeled data. As the number of training images in Cityscapes dataset is less than that in PASCAL VOC, we randomly split 1/32, 1/16 and 1/8 of the trainset (including 93, 186 and 372 training images, respectively) in Cityscapes dataset as labelled data, and the remaining training images as unlabelled data. State-of-the-art methods are implemented with various segmentation backbones and choose different splits of the trainset in experiments. For fair comparisons, we reproduce some experimental result by using the official code so that all the methods can be compared with the same split of labelled data as well as the same segmentation backbone.
show, the proposed USRN outperforms the state-of-the-art consistently over the two datasets with different splits of labelled training data. The superior performance is largely attributed to the proposed unbiased subclass regularization that effectively addresses the class imbalance issue in semi-supervised segmentation. For smaller splits of the labelled training data, USRN outperforms the state-of-the-art with larger margins by 3.8% and 2.1% in mIoU for 1/64 split of PASCAL VOC and 1/32 split of Cityscapes, respectively. In particular, the performance of state-of-the-art methods is largely constrained by the quality of the segmentation model that is trained by using the class-imbalanced labelled data. Since deep convolutional neural networks tends to overfit with small datasets as proved in, the class imbalance issue is more severe when training with fewer labelled data which degrades the performance of state-of-the-art methods greatly. While using larger splits of labelled data, the gaps between our method and the Oracle as trained by using the whole trainset are 4.5% in mIoU for 1/16 split of PASCAL VOC and 3.3% in mIoU for 1/8 split of Cityscapes. Such experimental results show that our method can learn accurate segmentation models with a small amount of labelled training data, demonstrating its potential in reducing labelling efforts in deep network training.
We also provide qualitative comparisons over 1/32 split of PASCAL VOC dataset. We compare USRN with state-of-the-art methods [35, 10] and the Baseline that is trained with supervised loss only. The qualitative results are well aligned with the quantitative results as illustrated in Fig. 3. It can be observed that USRN produces more accurate segmentations than state-of-the-art methods especially for those inaccurately segmented pixels that belong to the most dominant class. The qualitative experimental results further validate that USRN can better handle the class imbalance issue in semi-supervised semantic segmentation.
We conducted extensive ablation studies to examine how the proposed USRN achieves the superior semi-supervised semantic segmentation. We performed all the ablation studies over 1/32 split of PASCAL VOC dataset, where USRN can achieve a mIoU of 68.6% under default settings. Specifically, we examine different designs in USRN including different USRN components, different clustering strategies for class-balanced clusters generation, sharing features in different level (between the segmentation backbone and the auxiliary segmentation model ), and parameter analysis of the confidence threshold for pseudo label selection.
Different Components. We conducted ablation studies on different components of USRN to examine their effectiveness as shown in Table. 3. Specifically, we trained five models over 1/32 split of PASCAL VOC dataset including: 1) Model I that is trained with labeled data only using the multi-distribution supervised learning (MDL) loss in Eq. 2; 2) Model II that performs self-training on original class distributions only using the MDL loss and the original self-training (OST) loss as in [54, 35, 10]; 3) Model III that performs unbiased subclass regularization (USR) directly on the OST in Model II by using the MDL loss and the proposed self-training loss as defined in Eq. 4; 4) Model IV that includes subclass self-training (SST) loss in Eq. 5 into Model III for training the auxiliary segmentation model on unlabelled data; and 5) USRN that introduces the entropy-based gate mechanism (EGM) into Model IV to coordinate the concurrent learning from the original classes and the generated subclasses.
As Table 3 shows, both Model II and Model III outperform Model I by large margins, demonstrating the effectiveness of self-training in semi-supervised segmentation. Without SST, the performance of Model III still outperforms Model II, which shows that the subclass segmentation model trained with the labelled data only can produce high-quality pseudo labels on unlabelled data. With SST, Model IV outperforms Model III by 2.1% in mIoU thanks to updating the subclass segmentation model by self-training on unlabelled data. With the updated auxiliary segmentation model, more accurate subclass segmentation can be produced to generate unbiased pseudo labels for updating the segmentation backbone. Finally, USRN further improves Model IV by 1.5% in mIoU, which validates the effectiveness of the proposed entropy-based gate mechanism.
|Clustering algorithm||Original CBR||Subclass CBR||mIoU|
Clustering Strategy. In Section. 3.2, we adopted balanced k-means clustering  to generate class-balanced subclass annotations. To measure the class balance of annotations, we define a new metric named class balance rate (CBR) which can be formulated as follows:
where is the number of pixels within each class for given annotations,
is standard deviation ofand is standard deviation of , i.e., all pixels are labeled with only one class (extreme class imbalance).
As shown in Table 4, the CBR of subclass annotations is almost 100% which is much higher than the CBR of the original annotations. This demonstrates that we obtain class-balanced subclass annotations from class-imbalanced original annotations successfully. We can also observe that the subclass annotations generated with normal k-means  is also quite class-balanced (CBR=96.4%), and USRN model trained with such annotations can achieve comparable accuracy as the USRN trained with default clustering strategy (i.e., balanced k-means). This shows that our method is robust to different clustering strategies.
|Sharing Features||GPU Occupation||mIoU|
|No sharing||9.76 Gb2||67.3|
|Low-level features||8.75 Gb2||68.6|
|Both low-level and high-level features||6.99 Gb2||67.8|
Sharing Features. Recent supervised segmentation models [69, 3, 70, 9] achieved high accuracy by integrating multi-level features. In default setting of USRN, the segmentation backbone and the auxiliary segmentation model share layers that extract low-level features. We further evaluate the impact of sharing features between and . As shown in Table 5, USRN trained with default setting (i.e., sharing low-level features) achieves the highest accuracy in mIoU as compared with USRN trained with other settings (i.e., ‘no sharing’ and sharing multi-level features). The setting of ‘no sharing’ has the lowest accuracy, which demonstrates that original class segmentation and auxiliary subclass segmentation are complementary to each other. The reason why sharing high-level features (i.e., semantic features) degrades the accuracy of USRN is that original class segmentation and auxiliary subclass segmentation require to learn different semantic features as semantic information of these two tasks is different.
Parameter Analysis. The confidence threshold in Eq. 3 is an important hyper-parameter for generating high-quality class-unbiased pseudo labels. We evaluate USRN with different and Table 6 show experimental results. It can be observed that USRN is very stable when changes in a range from 0.75 to 0.95. While is smaller than 0.75, the performance of USRN degrades because the predicted pseudo labels tend to become noisy. While is larger than 0.95, USRN suffers from over-fitting because the very high confidence threshold returns very limited pseudo labels. We set at 0.75 by default in our implemented USRN.
Comparison with Class-Imbalance Methods: The proposed USRN explores class-unbiased segmentation to address the class imbalance issue in semi-supervised segmentation. Recently, several studies [64, 20] attempt to handle the class imbalance issue in semi-supervised learning. We compare USRN with these methods and Table 7 shows experimental results. It can been seen that USRN achieves the best overall performance (i.e., 68.6 in mIoU) and the best per-class accuracy on 17 out of all 21 classes. The superior performance shows that exploring class-unbiased segmentation from balanced subclass distributions is more effective than selecting more pseudo labels for minority classes in self-training as in [64, 20].
Complementary Studies: We also investigate whether the proposed USRN can complement with state-of-the-art methods [20, 35, 10] as compared in Section 4.2. We integrate our proposed unbiased subclass regularization networks into the state-of-the-art methods to perform this study. Table 8 shows experimental results. It can be observed that the integration of USRN improves performance greatly across all tested state-of-the-art methods which employ either consistency-training  or self-training [20, 10].
Different Segmentation Architectures: We further study whether USRN can work well with different semantic segmentation architectures. We studied three widely used segmentation architectures including PSPNet , PSANet  and Deeplabv3+  and Table 9 shows experimental results. It can be observed that the proposed USRN outperforms the Baseline model with large margins consistently with the three architectures. This shows that USRN can work well with different semantic segmentation architectures that apply pyramid spatial pooling , attention mechanism  and dilated convolutions .
This paper presents an unbiased subclass regularization network that explores class-unbiased segmentation to address class imbalance issue in semi-supervised segmentation. Specifically, the class-biased segmentation learnt in imbalanced original class distributions is regularized by the class-unbiased segmentation learnt in balanced subclass distributions. To coordinate the concurrent learning from the original class and the generated subclass, an entropy-based gate mechanism is designed to suppress unconfident subclass predictions for facilitating subclass regularization. Comprehensive experiments demonstrate the effectiveness of our method in semi-supervised segmentation. In the future, we will investigate how the idea of unbiased subclass regularization perform in other semi-supervised learning tasks such as semi-supervised image classification and semi-supervised object detection.
Acknowledgement. This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from Singapore Telecommunications Limited (Singtel), through Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU).
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15384–15394, 2021.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Scale variance minimization for unsupervised domain adaptation in image segmentation.Pattern Recognition, 112:107764, 2021.
Microsoft coco: Common objects in context.In European conference on computer vision, pages 740–755. Springer, 2014.
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, volume 5, pages 281–298. University of California Press, 1967.
Ke-gan: Knowledge embedded generative adversarial networks for semi-supervised scene parsing.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5237–5246, 2019.
International Conference on Machine Learning, pages 4334–4343. PMLR, 2018.