DeepAI
Log In Sign Up

Semi-Supervised Semantic Segmentation via Gentle Teaching Assistant

Semi-Supervised Semantic Segmentation aims at training the segmentation model with limited labeled data and a large amount of unlabeled data. To effectively leverage the unlabeled data, pseudo labeling, along with the teacher-student framework, is widely adopted in semi-supervised semantic segmentation. Though proved to be effective, this paradigm suffers from incorrect pseudo labels which inevitably exist and are taken as auxiliary training data. To alleviate the negative impact of incorrect pseudo labels, we delve into the current Semi-Supervised Semantic Segmentation frameworks. We argue that the unlabeled data with pseudo labels can facilitate the learning of representative features in the feature extractor, but it is unreliable to supervise the mask predictor. Motivated by this consideration, we propose a novel framework, Gentle Teaching Assistant (GTA-Seg) to disentangle the effects of pseudo labels on feature extractor and mask predictor of the student model. Specifically, in addition to the original teacher-student framework, our method introduces a teaching assistant network which directly learns from pseudo labels generated by the teacher network. The gentle teaching assistant (GTA) is coined gentle since it only transfers the beneficial feature representation knowledge in the feature extractor to the student model in an Exponential Moving Average (EMA) manner, protecting the student model from the negative influences caused by unreliable pseudo labels in the mask predictor. The student model is also supervised by reliable labeled data to train an accurate mask predictor, further facilitating feature representation. Extensive experiment results on benchmark datasets validate that our method shows competitive performance against previous methods. Code is available at https://github.com/Jin-Ying/GTA-Seg.

READ FULL TEXT VIEW PDF
09/29/2022

Online pseudo labeling for polyp segmentation with momentum networks

Semantic segmentation is an essential task in developing medical image d...
07/13/2022

Teachers in concordance for pseudo-labeling of 3D sequential data

Automatic pseudo-labeling is a powerful tool to tap into large amounts o...
04/30/2020

Improving Semantic Segmentation via Self-Training

Deep learning usually achieves the best results with complete supervisio...
07/21/2020

Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation

Deep learning methods show promising results for overlapping cervical ce...
09/06/2022

Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students

The popular methods for semi-supervised semantic segmentation mostly ado...
06/01/2021

Rethinking Re-Sampling in Imbalanced Semi-Supervised Learning

Semi-Supervised Learning (SSL) has shown its strong ability in utilizing...
03/10/2022

Leveraging Labeling Representations in Uncertainty-based Semi-supervised Segmentation

Semi-supervised segmentation tackles the scarcity of annotations by leve...

1 Introduction

The rapid development in deep learning has brought significant advances to semantic segmentation 

Long et al. (2015); Chen et al. (2017); Zhao et al. (2017)

which is one of the most fundamental tasks in computer vision. Existing methods often heavily rely on numerous pixel-wise annotated data, which is labor-exhausting and expensive. Towards this burden, great interests have been aroused in Semi-Supervised Semantic Segmentation, which attempts to train a semantic segmentation model with limited labeled data and a large amount of unlabeled data.

The key challenge in semi-supervised learning is to effectively leverage the abundant unlabeled data. One widely adopted strategy is pseudo labeling 

Lee and others (2013). As shown in Figure 1

, the model assigns pseudo labels to unlabeled data based on the model predictions on-the-fly. These data with pseudo labels will be taken as auxiliary supervision during training to boost performance. To further facilitate semi-supervised learning, the teacher-student framework 

Tarvainen and Valpola (2017); Xu et al. (2021); Wang et al. (2022) is incorporated. The teacher model, which is the Exponential Moving Average (EMA) of the student model, is responsible for generating smoothly updated pseudo labels. Via jointly supervised by limited data with ground-truth labels and abundant data with pseudo labels, the student model can learn more representative features, leading to significant performance gains.

Although shown to be effective, the pseudo labeling paradigm suffers from unreliable pseudo labels, leading to inaccurate mask predictions. Previous research work alleviates this problem by filtering out predictions that are lower than a threshold of classification scores Berthelot et al. (2019); Sohn et al. (2020); Zhang et al. (2021). However, this mechanism can not perfectly filter out wrong predictions, because some wrong predictions may have high classification scores, named over-confidence or mis-calibration Guo et al. (2017) phenomenon. Moreover, a high threshold will heavily reduce the number of generated pseudo labels, limiting the effectiveness of semi-supervised learning.

Towards the aforementioned challenge, it is necessary to propose a new pseudo labeling paradigm that can learn representative features from unlabeled data as well as avoid negative influences caused by unreliable pseudo labels. Delving into the semantic segmentation framework, it is composed of a feature extractor and a mask predictor. Previous works ask the feature extractor and the mask predictor to learn from both ground-truth labels and pseudo labels simultaneously. As a result, the accuracy of the model is harmed by incorrect pseudo labels. To better leverage the unlabeled data with pseudo labels, a viable solution is to let the feature extractor learn feature representation from both ground-truth labels and pseudo labels, while the mask predictor only learns from ground-truth labels to predict accurate segmentation results.

Figure 1: Comparison with previous frameworks. (a) The vanilla pseudo labeling framework. The model generates pseudo labels by itself and in turn, learns from them. (b) The pseudo labeling with the teacher-student framework. The teacher model is responsible for generating pseudo labels while the student model learns from the pseudo labels and the ground-truth labels simultaneously. Knowledge Transmission is conducted between the two models via Exponential Moving Average (EMA) of all parameters. (c) Our method attaches a gentle teaching assistant (GTA) module to the teacher-student framework. Different from the original one in (b), the gentle teaching assistant (GTA) learns from the pseudo labels while the student model only learns from ground-truth labels. We design the representation knowledge transmission between the GTA and student to mitigate the negative influence caused by unreliable pseudo labels.

Accordingly, we propose a novel framework, Semi-Supervised Semantic Segmentation via Gentle Teaching Assitant (GTA-Seg), which attaches an additional gentle teaching assistant (GTA) module to the original teacher-student framework. Figure 1 compares our method with previous frameworks. In our method, the teacher model generates pseudo labels for unlabeled data and the gentle teaching assistant (GTA) learns from these unlabeled data. Only knowledge of the feature extractor in the gentle teacher assistant (GTA) is conveyed to the feature extractor of the student model via Exponential Moving Average (EMA). We coin this process as representation knowledge transmission. Meanwhile, the student model also learns from the reliable ground-truth labels to optimize both the feature extractor and mask predictor. The gentle teaching assistant (GTA) is called gentle since it not only transfers the beneficial feature representation knowledge to the student model, but also protects the student model from the negative influences caused by unreliable pseudo labels in the mask predictor. Furthermore, a re-weighting mechanism is further adopted for pseudo labels to suppress unreliable pixels.

Extensive experiments have validated that our method shows competitive performance on mainstream benchmarks, proving that it can make better utilization of unlabeled data. In addition, we can observe from the visualization results that our method boasts clearer contour and more accurate classification for objects, which indicates better segmentation performance.

2 Related Work

Semantic Segmentation

Semantic Segmentation, aiming at predicting the label of each pixel in the image, is one of the most fundamental tasks in computer vision. In order to obtain the dense predictions, FCN Long et al. (2015) replaces the original fully-connected layer in the classification model with convolution layers. The famous encoder-decoder structure is borrowed to further refine the pixel-level outputs Noh et al. (2015); Badrinarayanan et al. (2017). Meanwhile, intensive efforts have been made to design network components that are suitable for semantic segmentation. Among them, dilated convolution Yu and Koltun (2016) is proposed to enhance receptive fields, global and pyramid pooling Liu et al. (2015); Chen et al. (2017); Zhao et al. (2017) are shown to be effective in modeling context information, and various attention modules Zhang et al. (2018); Zhao et al. (2018); Fu et al. (2019); Huang et al. (2019); Sun et al. (2019) are adopted to capture the pixel relations in images. These works mark milestones in this important computer vision task, but they pay rare attention to the data-scarce scenarios.

Semi-Supervised Learning

Mainstream methods in Semi-Supervised Learning Zhu (2005) (SSL) fall into two lines of work, self-training Grandvalet and Bengio (2004); Lee and others (2013) and consistency reguralization Laine and Aila (2017); Sajjadi et al. (2016); Miyato et al. (2018); Xie et al. (2020); Tarvainen and Valpola (2017). The core spirit of self-training is to utilize the model predictions to learn from unlabeled data. Pseudo Labeling Lee and others (2013), which converts model predictions on unlabeled data to one-hot labels, is a widely-used technique Berthelot et al. (2019); Sohn et al. (2020); Zhang et al. (2021) in semi-supervised learning. Another variant of self-training, entropy minimization Rényi (1961), is also proved to be effective both theoretically Wei et al. (2021) and empirically Grandvalet and Bengio (2004). Consistency Regularization Sajjadi et al. (2016); Xie et al. (2020) forces the model to obtain consistent predictions when perturbations are imposed on the unlabeled data. Some recent works unveil that self-training and consistency regularization can cooperate harmoniously. MixMatch Berthelot et al. (2019) is a pioneering holistic method and boasts remarkable performance. On the basis of MixMatch, Fixmatch Sohn et al. (2020) further simplify the learning process while FlexMatch Zhang et al. (2021) introduces a class-wise confidence threshold to boost model performance.

Semi-Supervised Semantic Segmentation

Semi-Supervised Semantic Segmentation aims at pixel-level classification. Borrowing the spirit of Semi-Supervised Learning, self-training and consistency regularization gives birth to various methods. One line of work Zou et al. (2021); Chen et al. (2021); Hu et al. (2021); Wang et al. (2022) applies pseudo labeling in self-training to acquire auxiliary supervision, while methods based on consistency Mittal et al. (2019) pursue stable outputs at both feature Lai et al. (2021); Zhong et al. (2021) and prediction level Ouali et al. (2020)

. Apart from them, Generative Adversarial Networks (GANs) 

Goodfellow et al. (2014) or adversarial learning are often leveraged to provide additional supervision in relatively early methods Souly et al. (2017); Hung et al. (2018); Mendel et al. (2020); Ke et al. (2020). Various recent methods tackles this problem from other perspectives, such as self-correcting networks Ibrahim et al. (2020) and contrastive learning Alonso et al. (2021). Among them, some works Yuan et al. (2021) unveil another interesting phenomenon that the most fundamental training paradigm, equipped with strong data augmentations, can serve as a simple yet effective baseline. In this paper, we shed light on semi-supervised semantic segmentation based on pseudo labeling and strives to alleviate the negative influence caused by noisy pseudo labels.

3 Method

3.1 Preliminaries

Semi-Supervised Semantic Segmentation

In Semi-Supervised Semantic Segmentation, we train a model with limited labeled data and a large amount of unlabeled data , where is often much larger than . The semantic segmentation network is composed of the feature extractor and the mask predictor . The key challenge of Semi-Supervised Semantic Segmentation is to make good use of the numerous unlabeled data. And one common solution is pseudo labeling Lee and others (2013); Yang et al. (2022).

Pseudo Labeling

Pseudo Labeling is a widely adopted technique for semi-supervised segmentation, which assigns pseudo labels to unlabeled data according to model predictions on-the-fly. Assuming there are categories, considering the pixel on the image, the model prediction and the corresponding confidence will be

(1)

where denotes the category, larger indicates that the model is more certain on this pixel, which is consequently, more suitable for generating pseudo labels. Specifically, we often keep the pixels whose confidence value is greater than one threshold, and generate pseudo labels as

(2)

where is the confidence threshold at the iteration. We note that can be a constant or a varied value during training. The pixel on image with a confidence value larger than will be assigned with pseudo label . The unlabeled data that are assigned with pseudo labels will be taken as auxiliary training data, while the other unlabeled data will be ignored.

Teacher-Student Framework

Teacher-Student Croitoru et al. (2017); Tarvainen and Valpola (2017); Wang et al. (2022) framework is a currently widely applied paradigm in Semi-Supervised Segmentation, which consists of one teacher model and one student model. The teacher model is responsible for generating pseudo labels while the student model learns from both the ground-truth labels and pseudo labels. Therefore, the loss for the student model is

(3)

In Semi-Supervised Semantic Segmentation, and are the cross-entropy loss on labeled data and unlabeled data with pseudo labels, respectively Wang et al. (2022), and is a loss weight to adjust the trade-off between them. The optimization of the student model can be formulated as

(4)

where denotes the learning rate. In the Teacher-Student framework, after the parameters of the student model are updated, the parameters of the teacher model will be updated by the student parameters in an Exponential Moving Average (EMA) manner.

(5)

where and denote the parameters of the teacher and student model at -th iteration, respectively. is a hyper-parameter in EMA, where .

Figure 2: Method Overview. Our Gentle Teaching Assistant (GTA) framework can be divided into three steps. Step 1: The teacher model generates pseudo labels and then the gentle teaching assistant can learn from them. One re-weighting strategy is incorporated to assign importance weights to the generated pseudo labels. Step 2: The gentle teaching assistant model learns from the pseudo labels and performs representation knowledge transmission, which only conveys the learned knowledge in the feature extractor to the student model via Exponential Moving Average (EMA). Step 3: After absorbing the knowledge from our gentle teaching assistant, the student model learns from ground-truth labels and optimizes all parameters. Finally, the parameters of the teacher model will also be updated according to the student model via EMA at the end of each training iteration.

3.2 Gentle Teaching Assistant

In this section, we will introduce our Gentle Teaching Assistant framework for semi-supervised semantic segmentation (GTA-Seg), as shown in Figure 2, which consists of the following three steps.

Step 1: Pseudo Label Generation and Re-weighting.

Similar to previous work Wang et al. (2022), the teacher model is responsible for generating pseudo labels. A confidence threshold is also adopted to filter out the pseudo labels with low confidence. For the kept pixels, instead of treating all of them equally, we propose a re-weighting mechanism according to the confidence of each pixel as follows,

(6)

In our re-weighting strategy, the pixel with higher confidence will be highlighted while the other will be suppressed. As a result, the negative influence caused by unreliable pseudo labels can be further alleviated. We adopt Laplace Smoothing Manning et al. (2010) to avoid over penalization where is a predefined coefficient. With this re-weighting mechanism, the unsupervised loss on unlabeled data becomes

(7)

Step 2: Representation Knowledge Transmission via Gentle Teaching Assistant (GTA).

Gentle Teaching Assistant (GTA) plays a crucial role in our framework. Previous works force the student model to learn from both labeled and unlabeled data simultaneously. We argue that it is dangerous to treat ground-truth labels and pseudo labels equally since the incorrect pseudo labels will mislead the mask prediction. Therefore, we want to disentangle the effects of pseudo labels on feature extractor and mask predictor of the student model. Concretely, our solution is to introduce one additional gentle teaching assistant, which learns from the unlabeled data and only transfers the beneficial feature representation knowledge to the student model, protecting the student model from the negative influences caused by unreliable pseudo labels.

After optimized on unlabeled data with pseudo labels as in Eq. 8, the gentle teaching assistant model is required to convey the learned representation knowledge in feature extractor to the student model via Exponential Moving Average (EMA) as in Eq. 9,

(8)
(9)

where is the parameters of the gentle teaching assistant model at -th iteration, is the parameters of the student model at -th iteration, and denotes the parameters of the feature extractor. Through our representation knowledge transmission, the unlabeled data is leveraged to facilitate feature representation of the student model, but it will not train the mask predictor.

Step 3: Optimize student model with ground truth labels and update teacher model.

With the gentle teaching assistant module, the student model in our framework is only required to learn from the labeled data,

(10)
(11)

Here, the whole model, including the feature extractor as well as the mask predictor, is updated according to the supervised loss computed by the ground-truth labels of labeled data.

Then the teacher model is updated by taking the EMA of the student model according to the traditional paradigm in the teacher-student framework.

(12)

Finally, the teacher model, which absorbs the knowledge of both labeled and unlabeled data from the student model, will be taken as the final model for inference.

Input : Labeled data , unlabeled data , batch size
Output : Teacher Model
1 Initialization;
2 for minibatch  do
3       Step 1:
4         Teacher model generates pseudo labels on samples by Eq. (2);
5         Re-weight pseudo labels by Eq. (6) and compute unsupervised loss by Eq. (7);
6       Step 2:
7         Update Gentle Teaching Assistant (GTA) by unlabeled data Eq. (8);
8         Representation knowledge Transmission from GTA to student by Eq. (9);
9       Step 3:
10         Compute supervised loss on by Eq. (10) ;
11         Update student model by labeled data via Eq. (11) ;
12         Update teacher model by Eq. (12) ;
13      
14 end for
Algorithm 1 Gentle Teaching Assistant for Semi-Supervised Semantic Segmentation (GTA-Seg).

4 Experiment

4.1 Datasets

We evaluate our method on 1) PASCAL VOC 2012 Everingham et al. (2010): a widely-used benchmark dataset for semantic segmentation, with images for training and images for validation. Some researches Chen et al. (2021); Yang et al. (2022) augment the training set by incorporating the coarsely annotated images in SBD Hariharan et al. (2011) to the original training set, obtaining labeled training images, which is called the augmented training set. In our experiments, we consider both the original training set and the augmented training set, taking , , , , and images from the labeled images in the original training set, and , and images from the labeled training images in the augmented training set. 2) Cityscapes Cordts et al. (2016), a urban scene dataset with images for training and images for validation. We sample , , , images from the labeled images in the training set. We take the split in Zou et al. (2021) and report all the performances in a fair comparison.

4.2 Implementation Details

We take ResNet-101 He et al. (2016)

pre-trained on ImageNet 

Deng et al. (2009) as the network backbone and DeepLabv3+ Chen et al. (2018) as the decoder. The segmentation head maps the 512-dim features into pixel-wise class predictions.

We take SGD as the optimizer, with an initial learning rate of 0.001 and a weight decay of 0.0001 for PASCAL VOC. The learning rate of the decoder is 10 times of the network backbone. On Cityscapes, the initial learning rate is 0.01 and the weight decay is 0.0005. Poly scheduling is applied to the learning rate with , where is the initial learning rate, is the current iteration and is the total iteration. We take GPUs to train the model on PASCAL VOC, and GPUs on Cityscapes. We set the trade-off between the loss of labeled and unlabeled data , the hyper-parameter in our re-weighting strategy and the EMA hyper-parameter

in all of our experiments. At the beginning of training, we train all three components (the gentle teaching assistant, the student and the teacher) on labeled data for one epoch as a warm-up following conventions 

Tarvainen and Valpola (2017), which enables a fair comparison with previous methods. Then we continue to train the model with our method. For pseudo labels, we abandon the data with lower confidence. We run each experiment

times with random seed = 0, 1, 2 and report the average results. Following previous works, input images are center cropped in PASCAL VOC during evaluation, while on Cityscapes, sliding window evaluation is adopted. The mean of Intersection over Union (mIoU) measured on the validation set serves as the evaluation metric.

Method 92 183 366 732 1464
SupOnly 45.77 54.92 65.88 71.69 72.50
MT Tarvainen and Valpola (2017) 51.72 58.93 63.86 69.51 70.96
CutMix French et al. (2019) 52.16 63.47 69.46 73.73 76.54
PseudoSeg Zou et al. (2021) 57.60 65.50 69.14 72.41 73.23
PC2Seg Zhong et al. (2021) 57.00 66.28 69.78 73.05 74.15
ST++ Yang et al. (2022) 65.23 71.01 74.59 77.33 79.12
U2PL Wang et al. (2022) 67.98 69.15 73.66 76.16 79.49
GTA-Seg (Ours) 70.02 0.53 73.16 0.45 75.57 0.48 78.37 0.33 80.47 0.35
Table 1: Results on PASCAL VOC 2012, original training set. We have 1464 labeled images in total and sample different proportions of them as labeled training samples. SupOnly means training the model merely on the labeled data, with all the other unlabeled data abandoned. All the other images in the training set (including images in the augmented training set) are used as unlabeled data. We use ResNet-101 as the backbone and DeepLabv3+ as the decoder.
Method 662 1323 2645 5291
MT Tarvainen and Valpola (2017) 70.51 71.53 73.02 76.58
CutMix French et al. (2019) 71.66 75.51 77.33 78.21
CCT Ouali et al. (2020) 71.86 73.68 76.51 77.40
GCT Ke et al. (2020) 70.90 73.29 76.66 77.98
CPS Chen et al. (2021) 74.48 76.44 77.68 78.64
AEL Hu et al. (2021) 77.20 77.57 78.06 80.29
GTA-Seg (Ours) 77.82 0.31 80.47 0.28 80.57 0.33 81.01 0.24
Table 3: Results on Cityscapes dataset. We have 2975 labeled images in total and sample different proportions of them as labeled training samples. The notations and network architecture are the same as in Table 1. * means that we reimplement the method with ResNet-101 backbone for a fair comparison.
Method 100 186 372 744
DMT Feng et al. (2022) 54.82 - 63.01 -
CutMix French et al. (2019) 55.73 60.06 65.82 68.33
ClassMix Olsson et al. (2021) - 59.98 61.41 63.58
Pseudo-Seg Zou et al. (2021) 60.97 65.75 69.77 72.42
DCC* Lai et al. (2021) 61.15 67.74 70.45 73.89
GTA-Seg (Ours) 62.95 0.32 69.38 0.24 72.02 0.32 76.08 0.25
Table 2: Results on PASCAL VOC 2012, augmented training set. We have 10582 labeled images in total and sample different proportions of them as labeled training samples. All the other images in the training set are used as unlabeled data. The notations and network architecture are the same as in Table 1.

4.3 Experimental Results

Pascal Voc 2012

We first evaluate our method on the original training set of PASCAL VOC 2012. The results in Table 1 validate that our method surpasses previous methods by a large margin. Specifically, our method improves the supervised-only (SupOnly) model by , , , , in mIoU when , , , , of the data is labeled, respectively. When compared to the readily strong semi-supervised semantic segmentation method, our method still surpasses it by , , , , respectively. We note that in the original training set, the ratio of labeled data is relatively low ( to ). Therefore, the results verify that our method is effective in utilizing unlabeled data in semi-supervised semantic segmentation.

We further compare our method with previous methods on the augmented training set of PASCAL VOC 2012, where the annotations are relatively low in quality since some of labeled images come from SBD Hariharan et al. (2011) dataset with coarse annotations. We can observe from Table 3, our method consistently outperforms the previous methods in a fair comparsion.

Cityscapes

For Cityscapes, as shown in Table 3, our method still shows competitive performance among previous methods, improving the existing state-of-the-art method by , , , in mIoU when , , , of the data is labeled.

4.4 Analyses

Component Analysis

We analyze the effectiveness of different components in our method, , the original teacher-student framework, gentle teaching assistant and re-weighted pseudo labeling as in Table 5. According to the results in Table 5, the carefully designed gentle teaching assistant mechanism (the third row) helps our method outperform the previous methods, pushing the performance about higher than the original teacher-student model (the second row). Further, the re-weighted pseudo labeling brings about performance improvements. With all of these components, our method outperforms the teacher-student model by over and SupOnly by over in mIoU.

Teacher-Student Gentle Teaching Assistant Re-weighted mIoU
54.92
58.93
72.10
73.16
Table 5: Comparison of knowledge transmission mechanisms. The experiment settings follow Table 5.
Method mIoU
SupOnly 54.92
Original EMA (all parameters) 64.07
Unbiased ST Chen et al. (2022) 65.92
EMA (Encoder) (Ours) 72.10
Table 4: Ablation Study on the components in our method, on the original training set of PASCAL VOC 2012, with 183 labeled samples.

Gentle Teaching Assistant

As mentioned in Table 5, our proposed gentle teaching assistant framework brings about remarkable performance gains. Inspired by this, we delve deeper into the gentle teaching assistant model in our framework. We first consider the representation knowledge transmission mechanism. In Table 5, we compare our mechanism with other methods such as the original EMA Tarvainen and Valpola (2017) that updates all of the parameters via EMA and Unbiased ST Chen et al. (2022) that introduces an additional agent to convey representation knowledge. We can observe that all these mechanisms boost SupOnly remarkably, while our mechanism is superior to other methods.

We next pay attention to the three models in our framework, gentle teaching assistant model, student model, and teacher model. Table 7 reports the evaluation performance of them. All of them show relatively competitive performance. For the teacher assistant model, it is inferior to the student model. This is reasonable since it is only trained on pseudo labels, while the student model inherits the representation knowledge of unlabeled data from the gentle teaching assistant as well as trained on labeled data. In addition, the teacher model performs best, which agrees with previous works Tarvainen and Valpola (2017).

Method mIoU
Gentle Teaching Assistant 70.10
Student Model 72.71
Teacher Model 73.16
Table 7: Ablation study on our method design on the original PASCAL VOC 2012. The experiment settings follow Table 5.
Gentle Teaching Assistant Student mIoU
Labeled Data Pseudo Labels 66.71
Labeled Data + Pseudo Labels Labeled Data 72.28
Pseudo Labels Labeled Data 73.16
Table 6: Results of the three models on the original PASCAL VOC 2012. The experiment settings follow Table 5.

Method Design

In our method, we train GTA with pseudo labels and the student model with labeled data. It is interesting to explore the model performance of other designs. Table 9 shows that 1) training the student model with pseudo labels will cause significant performance drop, which is consistent with our statement that the student model shall not learn from the pseudo labels directly. 2) Incorporating labeled data in training GTA is not beneficial to model performance. We conjecture that when we transmit the knowledge of labeled data from GTA to the student model, as well as supervise the student model with labeled data, the limited labels overwhelm the updating of the student model, which possibly leads to overfitting and harms the student model’s performance. Then since the teacher model is purely updated by the student model via EMA, the performance of the teacher model is also harmed. Considering the ultimate goal is a higher performance of the teacher model, we choose to train GTA with pseudo labels alone.

Re-weighting strategy

In our method, we design the re-weighting strategy for pseudo labels as Eq. 6, which contains 1) confidence-based re-weighting, 2) Laplace Smoothing. Here we conduct further ablation study on our design. Table 9 shows that though effective in other tasks such as semi-supervised object detection Xu et al. (2021), in our framework, adopting confidence-base re-weighting is harmful, dropping the performance from to . On the contrary, our strategy, with the help of Laplace Smoothing Manning et al. (2010) which alleviates over-penalization, pushes the readily strong performance to a higher level.

mIoU warmup mIoU
0.99 (Reported) 73.16 1 (Reported) 73.16
0.999 73.44 2 73.58
0.9999 73.57 3 73.39
Table 9: Ablation study on our re-weighting strategy for pseudo labeling on the original PASCAL VOC 2012. The experiment settings follow Table 5.
Confidence-based Re-weighting Laplace Smoothing mIoU
72.10
70.67
73.16
Table 8: Performance under different EMA hyper-parameters and warmup epochs. The experiment settings follow Table 5.

Hyper-parameter sensitivity

We evaluate the performance of our method under different EMA hyper-parameters and various warmup epochs. Results in Table 9 demonstrates that our method performs steadily under different hyper-parameters. In addition, the performance can still be slightly enhanced if the hyper-parameters are tuned carefully.

Visualization

Besides quantitative results, we present the visualization results to further analyze our method. We note that the model is trained on as few as labeled samples and about unlabeled samples. As shown in Figure 3, facing such limited labeled data, training the model merely in the supervised manner (SupOnly) appears to be vulnerable. Under some circumstances, the model is even ignorant of the given images (the third and the fourth row). While methods that utilize unlabeled data (teacher-student model and our method), show stronger performance. Further, compared with the original teacher-student model, our method shows a stronger ability in determining a clear contour of objects (the first row) and recognizing the corresponding categories (the second row). Our method is also superior to previous methods in distinguishing objects from the background (the third and fourth row).

Figure 3: Visualization Results on PASCAL VOC 2012, with the original training set. We train the model with 183 labeled data, the other settings are the same as Table 1. From left to right, we show the raw images, results on SupOnly (the model trained merely on labeled data), Teacher-Student model, and our method, as well as the ground truth respectively.
Figure 4: Visualization Results on PASCAL VOC 2012, with the original training set. We train the model with 183 labeled data, the other settings are the same as Table 1. From left to right, we show the raw images, our method without our re-weighting mechanism, and our method with re-weighting, as well as the ground truth respectively.

In addition, we present more visualization results about our designed re-weighting strategy. We can observe from Figure 4 that incorporating the re-weighting strategy into our method leads to better performance on contour or ambiguous regions.

Limitations

One limitation of our method is that it brings about more training costs since it incorporates an extra gentle teaching assistant model. Fortunately, the inference efficiency is not influenced since only the teacher model is taken for inference. On the other hand, our method only attempts at making better use of the unlabeled data, but little attention has been paid to the labeled data. We consider it promising to conduct research on how to better leverage the labeled data in semi-supervised semantic segmentation.

5 Conclusion

In this paper, we propose a novel framework, Gentle Teaching Assistant, for semi-supervised semantic segmentation (GTA-Seg). Concretely, we attach an additional teaching assistant module to disentangle the effects of pseudo labels on the feature extractor and the mask predictor. GTA learns representation knowledge from unlabeled data and conveys it to the student model via our carefully designed representation knowledge transmission. Through this framework, the model optimizes representation with unlabeled data, as well as prevents it from overfitting on limited labeled data. A confidence-based pseudo label re-weighting mechanism is applied to further boost the performance. Extensive experiment results prove the effectiveness of our method.

Acknowledgements.

This work is supported by GRF 14205719, TRS T41-603/20-R, Centre for Perceptual and Interactive Intelligence, CUHK Interdisciplinary AI Research Institute, and Shanghai AI Laboratory.

References

  • I. Alonso, A. Sabater, D. Ferstl, L. Montesano, and A. C. Murillo (2021) Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8219–8228. Cited by: §2.
  • V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §2.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1, §2.
  • B. Chen, J. Jiang, X. Wang, J. Wang, and M. Long (2022) Debiased pseudo labeling in self-training. arXiv preprint arXiv:2202.07136. Cited by: §4.4, Table 5.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §1, §2.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, pp. 801–818. Cited by: §4.2.
  • X. Chen, Y. Yuan, G. Zeng, and J. Wang (2021) Semi-supervised semantic segmentation with cross pseudo supervision. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2613–2622. Cited by: §2, §4.1, Table 3.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §A.1, §4.1.
  • I. Croitoru, S. Bogolin, and M. Leordeanu (2017) Unsupervised learning from video to detect foreground objects in single images. In IEEE International Conference on Computer Vision, pp. 4335–4343. Cited by: §3.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.2.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §A.1, §4.1.
  • Z. Feng, Q. Zhou, Q. Gu, X. Tan, G. Cheng, X. Lu, J. Shi, and L. Ma (2022) Dmt: dynamic mutual training for semi-supervised learning. Pattern Recognition, pp. 108777. Cited by: Table 3.
  • G. French, S. Laine, T. Aila, M. Mackiewicz, and G. Finlayson (2019) Semi-supervised semantic segmentation needs strong, varied perturbations. In British Machine Vision Conference, Cited by: Table 1, Table 3.
  • J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in Neural Information Processing Systems 27. Cited by: §2.
  • Y. Grandvalet and Y. Bengio (2004) Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, Vol. 17. Cited by: §2.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017)

    On calibration of modern neural networks

    .
    In

    International Conference on Machine Learning

    ,
    pp. 1321–1330. Cited by: §1.
  • B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision, pp. 991–998. Cited by: §A.1, §4.1, §4.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §A.2, §4.2.
  • H. Hu, F. Wei, H. Hu, Q. Ye, J. Cui, and L. Wang (2021) Semi-supervised semantic segmentation via adaptive equalization learning. In Advances in Neural Information Processing Systems, Vol. 34. Cited by: §2, Table 3.
  • Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 603–612. Cited by: §2.
  • W. Hung, Y. Tsai, Y. Liou, Y. Lin, and M. Yang (2018) Adversarial learning for semi-supervised semantic segmentation. In British Machine Vision Conference, Cited by: §2.
  • M. S. Ibrahim, A. Vahdat, M. Ranjbar, and W. G. Macready (2020) Semi-supervised semantic image segmentation with self-correcting networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12715–12725. Cited by: §2.
  • Z. Ke, D. Qiu, K. Li, Q. Yan, and R. W. Lau (2020) Guided collaborative training for pixel-wise semi-supervised learning. In European Conference on Computer Vision, pp. 429–445. Cited by: §2, Table 3.
  • X. Lai, Z. Tian, L. Jiang, S. Liu, H. Zhao, L. Wang, and J. Jia (2021) Semi-supervised semantic segmentation with directional context-aware consistency. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1205–1214. Cited by: §2, Table 3.
  • S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, Cited by: §2.
  • D. Lee et al. (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In International Conference on Machine Learning Workshop, Vol. 3, pp. 896. Cited by: §1, §2, §3.1.
  • W. Liu, A. Rabinovich, and A. C. Berg (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §2.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1, §2.
  • C. Manning, P. Raghavan, and H. Schütze (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §3.2, §4.4.
  • R. Mendel, L. A. d. Souza, D. Rauber, J. P. Papa, and C. Palm (2020) Semi-supervised segmentation based on error-correcting supervision. In European Conference on Computer Vision, pp. 141–157. Cited by: §2.
  • S. Mittal, M. Tatarchenko, and T. Brox (2019) Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (4), pp. 1369–1379. Cited by: §2.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, pp. 1979–1993. Cited by: §2.
  • H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, pp. 1520–1528. Cited by: §2.
  • V. Olsson, W. Tranheden, J. Pinto, and L. Svensson (2021) Classmix: segmentation-based data augmentation for semi-supervised learning. In IEEE Winter Conference on Applications of Computer Vision, pp. 1369–1378. Cited by: Table 3.
  • Y. Ouali, C. Hudelot, and M. Tami (2020) Semi-supervised semantic segmentation with cross-consistency training. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 12674–12684. Cited by: §2, Table 3.
  • A. Rényi (1961) On measures of entropy and information. In

    Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics

    ,
    Vol. 4, pp. 547–562. Cited by: §2.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, Vol. 29. Cited by: §2.
  • K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C. Li (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems, Vol. 33, pp. 596–608. Cited by: §1, §2.
  • N. Souly, C. Spampinato, and M. Shah (2017) Semi supervised semantic segmentation using generative adversarial network. In IEEE International Conference on Computer Vision, pp. 5688–5696. Cited by: §2.
  • K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. Cited by: §2.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §A.2, §1, §2, §3.1, §4.2, §4.4, §4.4, Table 1, Table 3.
  • Y. Wang, H. Wang, Y. Shen, J. Fei, W. Li, G. Jin, L. Wu, R. Zhao, and X. Le (2022) Semi-supervised semantic segmentation using unreliable pseudo-labels. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §3.1, §3.1, §3.2, Table 1.
  • C. Wei, K. Shen, Y. Chen, and T. Ma (2021) Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, Cited by: §2.
  • Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le (2020) Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems, Vol. 33, pp. 6256–6268. Cited by: §2.
  • M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu (2021) End-to-end semi-supervised object detection with soft teacher. In IEEE International Conference on Computer Vision, pp. 3060–3069. Cited by: §1, §4.4.
  • L. Yang, W. Zhuo, L. Qi, Y. Shi, and Y. Gao (2022) ST++: make self-training work better for semi-supervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1, §4.1, Table 1.
  • F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, Cited by: §2.
  • J. Yuan, Y. Liu, C. Shen, Z. Wang, and H. Li (2021) A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In IEEE International Conference on Computer Vision, pp. 8229–8238. Cited by: §2.
  • B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki (2021) Flexmatch: boosting semi-supervised learning with curriculum pseudo labeling. In Advances in Neural Information Processing Systems, Vol. 34. Cited by: §1, §2.
  • H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160. Cited by: §2.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §1, §2.
  • H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia (2018) Psanet: point-wise spatial attention network for scene parsing. In European Conference on Computer Vision, pp. 267–283. Cited by: §2.
  • Y. Zhong, B. Yuan, H. Wu, Z. Yuan, J. Peng, and Y. Wang (2021) Pixel contrastive-consistent semi-supervised semantic segmentation. In IEEE International Conference on Computer Vision, pp. 7273–7282. Cited by: §2, Table 1.
  • X. J. Zhu (2005) Semi-supervised learning literature survey. Cited by: §2.
  • Y. Zou, Z. Zhang, H. Zhang, C. Li, X. Bian, J. Huang, and T. Pfister (2021) Pseudoseg: designing pseudo labels for semantic segmentation. In International Conference on Learning Representations, Cited by: §2, §4.1, Table 1, Table 3.

Checklist

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? See Section LABEL:gen_inst.

  • Did you include the license to the code and datasets? The code and the data are proprietary.

  • Did you include the license to the code and datasets?

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See Section 4.4

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We promise to release the code and models upon the paper accepted.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Section 

      4.2

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Section 4.2

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? See Section 4.1

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Appendix

a.1 More implementation details

We take the images from PASCAL VOC 2012 Everingham et al. (2010)111http://host.robots.ox.ac.uk/pascal/VOC/voc2012/, SBD Hariharan et al. (2011) 222http://home.bharathh.info/pubs/codes/SBD/download.html, and Cityscapes Cordts et al. (2016) 333https://www.cityscapes-dataset.com/. The Cityscapes dataset is processed with these scripts 444https://github.com/mcordts/cityscapesScripts.

a.2 More analysis on representation knowledge transmission

The representation knowledge transmission in our gentle teaching assistant is conducted merely on the feature extractor. In our main paper, we view the network backbone (ResNet-101 He et al. (2016)) in the segmentation model as the feature extractor and the decoder (DeepLabv3++) as the mask predictor. Meanwhile, there are other variants of such a division, , taking fewer or more layers as the feature extractor and all the remaining layers as the decoder. Here, we present the experimental results when taking these divisions.

Method Feature Extractor Mask Predictor mIoU
Structure Param (M) Structure Param (M)
Ours ResNet-101 + Decoder.feature layers 60.9

Decoder.classifier

3.6 70.67
ResNet-101 (main paper) 42.7 Decoder (main paper) 21.8 73.16
ResNet-101.layer0,1,2,3 27.7 Decoder + ResNet-101.layer4 36.8 68.41
ResNet-101.layer0,1,2 1.5 Decoder + ResNet-101.layer3,4 63.0 66.23
ResNet-101.layer0,1 0.3 Decoder + ResNet-101.layer2,3,4 64.2 62.11
ResNet-101.layer0 0.1 Decoder + ResNet-101.layer1,2,3,4 64.4 60.88
Original EMA ResNet-101 + Decoder 64.5 - - 64.07
SupOnly - - ResNet-101 + Decoder 64.5 54.92
Table 10: Results on PASCAL VOC 2012, original training set, with different divisions on feature extractor and mask predictor in our method. We use ResNet-101 as the backbone and DeepLabv3+ as the decoder. We report the structure and parameter of the feature extractor and the mask predictor and denote the stem layer in ResNet-101 as layer0 for clarity. The experimental settings follow Table 5.

We note that when taking the whole ResNet-101 and decoder as the mask predictor (the last row in Table 10), our method shrinks to the model trained only on supervised data (SupOnly). And when they both act as the feature extractor, our representation knowledge transmission boils down to the original EMA update in Tarvainen and Valpola (2017). From Table 10, we can observe that compared to SupOnly, conducting representation knowledge transmission consistently brings about performance gains. And when taking suitable layers (the first four rows in ’Ours’), our method can achieve better performance than the original EMA. Among them, the most straightforward strategy (also the one in our main paper), which considers ResNet-101 as the feature extractor and decoder as the mask predictor, boasts the best performance.

These experimental results demonstrate that 1) utilizing unlabeled data is crucial to semi-supervised semantic segmentation, 2) transmitting all the knowledge learned from the pseudo labels will mislead the model prediction, 3) our method, which only conveys the representation knowledge in the feature extractor, can alleviate the negative influence of unreliable pseudo labels, making use of unlabeled data in a better manner.