1 Introduction
Semisupervised learning miyato2018virtual; sajjadi2016regularization; laine2016temporal; tarvainen2017mean; xie2019unsupervised; verma2019interpolation improves the data efficiency for lowdata regime relying on limited amounts of labeled images and extra massive amounts of unlabeled images. Recent stateoftheart methods, , MeanTeacher tarvainen2017mean, FixMatch sohn2020fixmatch and Noisy student xie2020self, share a similar philosophy and enjoy the merit of both pseudo labeling lee2013pseudo; xie2020self; pham2020meta and consistency regularization sajjadi2016regularization; laine2016temporal; tarvainen2017mean; miyato2018virtual; yun2019cutmix: the teacher model generates pseudo labels for weaklyaugmented unlabeled data whereas the student model trains on the strongaugmented counterparts, as shown in Figure 1a. Such strategy can also be applied to semantic segmentation, and the resulting approaches zou2020pseudoseg; french2020semi; olsson2021classmix; yuan2021simple; ouali2020semi; ke2020three; ke2020guided; fenga2020dmt have demonstrated outstanding performance.
However, pseudo labeling methods are plagued with confirmation bias arazo2020pseudo
, , the student model is prone to overfit the erroneous pseudo labels. A few recent works set out to address this issue, either by estimating the pseudo label uncertainty
zheng2021rectifying; rizve2021defense or directly rectifying the pseudo labels mendel2020semi; zhang2021prototypical. Recent work pham2020metaoffers breathtaking performance on ImageNet benchmark and the key is to inform the teacher of the feedback from the student model so that better pseudo labels can be generated (shown in Figure
1b). Mutual learning ke2020guided; fenga2020dmt, on the other hand, trains two models in parallel that play the dual role of teacher and student (shown in Figure 1c). The networks suppress the noisy knowledge of the counterpart and collaboratively improve in the course of learning. Nonetheless, both networks may end up with learning homogeneous knowledge and suffer from coupled noises that hinder the further improvement of the two learners. Besides, the pseudo label noises are suppressed by purely relying on the knowledge from the peer network, while totally overlooking the network’s own ability to improve the pseudo labels.In this work, we propose robust mutual learning that ameliorates the coupling issue of mutual learning and introduce selfrectification for the pseudo label refinement. First, we observe that it is crucial to ensure heterogeneous knowledge during training so that the model coupling can be reduced. We comprehensively examine the factors to alleviate the issue: 1) we compute the pseudo labels from the mean teacher tarvainen2017mean which not only yields more reliable pseudo labels but also reduces the coupling issue since the learners no longer interact directly; 2) we further explore factors that reduce the coupling of the mutual learners and find that strong data augmentations and model noises (, dropout srivastava2014dropout and stochastic depth huang2016deep
) are essential to let models learn incoherent knowledge; 3) we also investigate heterogeneous network architectures (, convolutional neural network versus transformer
dosovitskiy2020image; zheng2020rethinking) and demonstrate its effectiveness.Meanwhile, besides the refinement from the peer network, we propose to explicitly denoise the pseudo labels prior to the mutual learning so that only robust knowledge is propagated throughout the framework. Specifically, we draw inspiration from zhang2021prototypical and propose to refine pseudo labels according to the classwise prediction confidence estimated based on the relative feature distances to the class centroids. Figure 1 shows the comparison of our framework and prior works. Our framework has two distinct schemes for label refinement: selfrectification leverages the model’s own knowledge to refine the pseudo labels and mutual teaching filters the noisy knowledge with the help of the heterogeneous peer network. Both schemes collaboratively improve the pseudo labels during training, leading to the SSL models with narrowed gap to supervised learning.
Extensive experiments show that the proposed robust mutual learning outperforms prior competitive approaches. On the Cityscapes dataset, our method is best performed under all the SSL settings. In particular, our method shows much superiority on extreme low data regime. When using only
labeled images, our method outperforms the stateoftheart method fenga2020dmt bymIoU. Our method also demonstrates performance advantage on the PASCAL VOC 2012 dataset. Notably, our method could offer performance onpar with supervised learning merely using
labeled data.Our main contribution can be summarized as follows. 1) We notice the coupling issue of existing mutual learning works and exhaustively examine factors that can be used to ameliorate that issue, leading to improved performance thanks to the induced heterogeneous knowledge learning. 2) We propose to leverage the model’s internal knowledge besides the external knowledge from peer networks and the pseudo labels are further rectified according to the feature distances to class centroids, boosting the performance one step further to a record high. 3) The proposed robust mutual learning could better online refine the pseudo labels and achieves stateoftheart performance on several commonly used semantic segmentation datasets.
2 Related Works
The key of semisupervised segmentation lies in how to leverage the large unlabeled data set to derive additional training signals. Along this direction, prior works can be roughly categorized into five classes: generative learning, contrastive learning, entropy minimization, consistency regularization and pseudo labeling. We briefly review the last two categories which are mostly related and widely incorporated in current stateoftheart methods.
Consistency regularization. Consistency regularization enforces the model to make consistent predictions with respect to various perturbations. The effectiveness of consistency regularization is based on the smoothness assumption or the cluster assumption that data points close to each other are likely to be from the same class, which is often held in classification tasks and has brought about a lot of research efforts xie2019unsupervised; sohn2020fixmatch; berthelot2019remixmatch; sajjadi2016regularization; tarvainen1780weight; berthelot2019mixmatch; kim2021selfmatch; li2018semi; perone2018deep. As for semantic segmentation, it is observed in french2020semi; ouali2020semi that the cluster assumption is violated in the semantic segmentation task. Therefore, Ouali et al. ouali2020semi propose to perturb the encoder’s outputs where the cluster assumption is maintained and leverage multiple auxiliary decoders to give consistent predictions. French et al. french2020semi find that maskbased augmentation strategies are effective and introduce an adapted version of a popular technique CutMix french2020semi. The idea of CutMix is to mix samples by replacing the image region with a patch from another image, which can be regarded as an extension of Cutout devries2017improved and Mixup zhang2017mixup, and is further extended in recent works olsson2021classmix; chen2021mask; french2019consistency. Our approach also adopts the idea from CutMix french2020semi to enforce consistency between the mixed outputs and the prediction over the mixed inputs.
Pseudo labeling.
Pseudo labeling or selftraining is a typical technique in leveraging unlabeled data by alternating the pseudo label prediction and feature learning, which encourages the model to make confident predictions for unlabeled data. This technique has been successfully and widely applied to many tasks such as natural language processing
yarowsky1995unsupervised; he2019revisiting; kahn2020self; park2020improved, image classification lee2013pseudo; zhai2019s4l; zoph2020rethinking; xie2020self; yalniz2019billion as well as semantic segmentation feng2020semi; chen2020leveraging; zoph2020rethinking. The key of pseudo labeling relies on the quality of the pseudo labels. Most models lee2013pseudo; tarvainen2017mean; sohn2020fixmatch; xie2020self refine the pseudo label from external guidance, , a teacher model. However, the teacher model is usually fixed, making the student inherit some inaccurate predictions from the teacher. Recent efforts turn to update the teacher along with the student in order to generate better pseudo labels, , coteaching yu2019does; han2018co, dual student ke2019dual, meta pseudo label pham2020meta, mutual training ke2020guided; fenga2020dmt and so on. Such mutual learning strategy however may end up learning homogeneous knowledge and hence fail to provide complementary information for each other.In this paper, we point out that learning heterogeneous knowledge in mutual learning is of great importance but is rarely noticed. A close related work ge2020mutual also adopts mean teaching for unsupervised domain adaptation in person reidentification. In contrast, our work differs in two aspects: 1) we observe the importance of learning heterogeneous knowledge in semisupervised learning and mean teaching is one of the proposed potential solutions; 2) besides mutual pseudo label refinement, we also explore internal guidance to improve the quality of the pseudo label.
3 Preliminary
Semisupervised learning (SSL) for semantic segmentation learns from the pixelwise labeled dataset in conjunction with unlabeled data , where denotes the training images with resolution of and is the ground truth corresponding to with pixels labeled by classes. The resulting segmentation network parameterized by can be regarded as a composite of a feature extractor
and a linear classifier
, , .Recent stateoftheart approaches follow a similar paradigm: the network is first trained on the labeled images with weak augmentations by minimizing the standard crossentropy, ,
(1) 
Then, for unlabeled images the “hard” pseudo labels^{1}^{1}1Recent works find hard pseudo labels can lead to stronger performance. Our experiment in the supplementary material also proves this.
can be generated, which is a onehot vector converted from the soft prediction, ,
. The prediction for the strongly augmented images should match the pseudo labels. Also, the unreliable pseudo labels are discounted when the confidence falls below a predefined threshold , so the unsupervised objective for unlabeled images can be formulated as:(2) 
where denotes the strong augmentations that make the student learn in a harder way.
To better suppress the erroneous pseudo labels, the mutual learning ke2020guided; fenga2020dmt adopts two models and that are initialized differently and generate the pseudo labels and respectively. The same supervised loss as Equation 1 is used whereas the unsupervised loss becomes:
(3) 
The incorrect pseudo labels are gradually filtered by the peer network han2018co so both of the networks collaboratively improve during training.
4 Robust Mutual Learning
In this section, we propose robust mutual learning that suffers less from the coupling issue. Also, we involve pseudo label selfrectification to further enhance the supervision quality for unlabeled data.
4.1 Techniques for Model Coupling Reduction
Indirect mutual learning with mean teachers.
Mutual learners possess different learning abilities due to different initialization and can online refine the pseudo labels from the counterpart. However, they may quickly converge to the same state and thereby suffer from the coupled noises. The model coupling in the early training phase will reduce the mutual learning to selftraining. We conjecture that the coupling issue is caused by the direct interaction between the mutual learners. Therefore, we propose to transfer the knowledge in an indirect manner. Specifically, we generate pseudo labels from the mean teachers tarvainen2017mean, , the exponential moving average (EMA) of the training models, so that each learner receives supervision from the peer mean teacher. Let and be the moving average model for and respectively, the onehot mutual supervisions in Equation 3 become:
(4) 
Since each model learns from the temporal ensemble knowledge of the counterpart rather than its immediate state, the mutual learners are kept diverged and reach a consensus more gradually
compared to the direct mutual learning. To prove that indirect mutual learning can slow down the speed of model coupling, we train mutual networks of layers of MLPs on the MNIST dataset of which images are used as labeled data for the SSL setting. We measure the total variation distance of the softmax outputs of two teacher networks during training. Figure 2 shows that the indirect teaching using mean teachers indeed leads to more divergent mutual learners whereas direct mutual learning causes model collapse (total variation distance approaches to zero) at the early training phase.
In practice, each learner in our framework simultaneously learns from two supervisions (Figure 1
d): the peersupervision given by the peer mean teacher and selfsupervision provided by its own mean teacher. We empirically find that learning from such an ensemble knowledge further improves the SSL performance. Therefore, the loss function for training two models is defined as,
(5) 
Data augmentation and model noises.
Indirect mutual learning alleviates the model coupling to some extent but still requires different initialized mutual learners ke2020guided; fenga2020dmt. In contrast, we inject noises including input noise (, data augmentation) and model noises to perturb the mutual learners and ensure their divergency throughout the training process. The data augmentations applied for the student model include photometric augmentations in RandAugment cubuk2020randaugment. On the other hand, the model noises, including dropout srivastava2014dropout and stochastic depth huang2016deep, are able to bring architectural perturbations. The benefit of the noise injection is twofold. First, they force the student model to learn in a hard way, which has been proven crucially important for selftraining xie2020self. Second, these perturbations prevent the mutual learners from closely tracking the state of the counterpart, hence the chance of model coupling is greatly reduced. In addition, we adopt CutMix yun2019cutmix, a kind of data augmentation that stitches training images based on a binary mask, to improve consistency learning.
In particular, we study the effectiveness of model noise using the MNIST experiment. As shown in Figure 2, we see clearly improved divergency of the mutual learners during training. Hence, the students consistently obtain complementary knowledge and collaboratively attain higher performance.
Heterogeneous architecture.
Last but not least, besides injecting model noises to bring architectural perturbations, another straightforward way to ensure the architectural difference is to directly adopt distinct architectures. As opposed to using convolutional neural networks (CNN) with dissimilar backbones, we investigate Transformer
vaswani2017attention which has shown tantalizing promise in the vision tasks. Since the Transformer is able to capture longrange dependency while CNN is good at modelling local context, it seems that a heterogeneous architecture with CNN and Transformer naturally fits the mutual learning framework as students learn complementary knowledge. To validate our hypothesis, we experiment with the mutual learning between a CNNbased Deeplabv2 model and SETR zheng2020rethinking, a transformerbased segmentation network. The results in Section 5 show that both networks, even without any noise injection, achieve significant gain from mutual learning.4.2 Pseudo Label Selfrectification
Mutual learning heavily relies on the peer network but neglects the fact that the network itself can also identify suspicious pseudo labels. Since the pseudo labels are generated by the supervised model whose prediction confidence can be measured using the maximum softmax probability, MonteCarlo dropout or learned confidence estimates
devries2018leveraging; poggi2020uncertainty, a natural idea is to cultivate the network’s own ability to spot and suppress the unreliable pseudo labels.Pseudo label correction.
To this end, we do not rely on the above uncertainty estimation techniques as they require a predefined threshold to remove the unreliable pseudo labels. Instead, we aim to online rectify the pseudo labels that serve as progressively improved supervisions. Contrary to the peerrectification, the selfrectification does not rely on any external knowledge. Such selfrectification, which stems from zhang2021prototypical, rectifies the pseudo labels according to the classwise confidence that is estimated online. Formally, let be the soft pseudo label for , , , whereas is the th class probability given by the network output. We denote the initial state of this class probability as , which is dynamically reweighted according to the classwise confidence . Hence, the hard pseudo label is online updated as:
(6) 
Such rectified pseudo labels are further used as supervisions when optimizing Equation 5, so that the students can always learn from denoised knowledge.
Classwise confidence estimation.
Now the problem reduces to calculating the classwise confidence . To achieve this, we model the clustered feature space as the exponential family mixture:
(7) 
where the cluster center , or the prototype, serves as the representative for the cluster, and
is the weight for mixture distributions. Then, the posterior probability of the assignment
for a given can be computed as,(8) 
which is exactly the classwise confidence useful to the pseudo label correction. When the mixture distributions are equally weighted, , , and Euclidean distance is adopted, the classwise confidence is essentially the softmax of distances to different prototypes, :
(9) 
Here we use the feature space extracted by the mean teacher. Intuitively, the prediction confidence of belonging to th class is measured by how much it is more close to relative to other prototypes.
Prototype calculation
Ideally, prototypes should be computed using all the training images in each iteration, ,
(10) 
where denotes the position index. However, this is computationally prohibitive for the online training. As an approximation, we compute the moving average:
(11) 
where is the feature mean of th class for the current batch, and the momentum
is set to 0.9999. During training, this moving average estimation converges to the true cluster center as the features gradually evolve. Note that our approach is similar to clusteringbased unsupervised learning
caron2018deep; caron2020unsupervised that online computes the cluster centers, but we use the prototypes to rectify the soft pseudo label rather than cluster assignment for the following representation learning. Another benefit of using prototypes is that all the classes are treated equally regardless of their occurrence, so the segmentation suffers less from the classimbalance issue.The maintained prototypes serve as the network knowledge of the underlying feature clusters. The pseudo label rectification using such selfknowledge complements the denoising capability of the peer network, and both schemes can jointly refine the pseudo labels during training, leading to a significant performance boost for semisupervised learning.
5 Experiments
5.1 Dataset and Evaluation Protocol
Datasets.
We conduct experiments on two commonlyused benchmarks for semantic segmentation.

[leftmargin=*]

Cityscapes. This is a finely annotated urban scene dataset with categories and consists of training images. The original image size is . Following french2020semi; hung2019adversarial, we resize images to and randomly crop them to during training.

PASCAL VOC 2012. The standard dataset contains images that comprise classes including background. As a common practice, we also obtain augmented images using hung2019adversarial. Images are randomly resized by and cropped to for training.
Evaluation protocol.
We evaluate the SSL performance under the learning from a varied amount of labeled images on two datasets. For the Cityscapes dataset, we randomly sample , and images to construct the labeled data and regard the rest of the training images as unlabeled data. For PASCAL VOC, we use , , and images from the standard training set as labeled data while the remaining images in the standard training set together with the augmented images constitute the unlabeled data. For a fair comparison, we use the same data split as french2019semi when subsampling the labeled data. We measure the quantitative performance using mIoU, , the mean of classwise intersection over union.
5.2 Implementation Details
In our experiments, the CNNbased segmentation network adopts the Deeplabv2 chen2017deeplab architecture with the ResNet101 he2016deep as the backbone. We apply photometric augmentations within RandAugment cubuk2020randaugment as well as CutMix yun2019cutmix for data augmentation. In terms of the model noises, we introduce dropout srivastava2014dropout only for the last FC layer with the dropout rate as , and apply stochastic depth huang2016deep to the residual blocks of backbone with layerwise survival probability as . The proposed framework is optimized with SGD solver. The learning rate is with polynomial decay. We adopt stagewise training strategy, , once the mutual learning converges we recompute the pseudo labels and then start the next training stage. In our experiments, we observe performance saturation after training two stages. For Cityscapes dataset, we train k iterations for each stage with the batchsize of ; whereas each training stage of PASCAL VOC lasts k iterations with the batchsize of
. For both datasets, we initialize the backbone with ImageNet
deng2009imagenetpretrained weights. Besides, we also use the pretrained weights from the MS COCO dataset
lin2014microsoft to initialize the model specifically for PASCAL VOC as suggested by hung2018adversarial; ke2020three. The training takes around days usingTesla V100 GPUs. Our method is implemented with Pytorch
paszke2019pytorch and we plan to make the code publicly available.Cityscapes  # Labels  
Methods (ImageNet init.)  1/30 (100)  1/8 (372)  1/4 (744)  Full (2975) 
Deeplabv2 chen2017deeplab    59.3  61.9  66.0 
AdvSSL hung2019adversarial    57.1  60.5  66.2 
S4GAN mittal2019semi    59.3  61.9  65.8 
CutMix french2020semi  51.202.29  60.341.24  63.870.71  67.680.37 
CowMix french2019semi  49.012.58  60.530.29  64.100.82   
ClassMix olsson2021classmix  54.071.61  61.350.62  63.630.33   
ECS mendel2020semi    60.260.84  63.770.65   
DMT fenga2020dmt  54.8  63.0    68.2 
Deeplabv2 (reimplement)  45.401.04  56.080.27  60.910.62  67.140.32 
Ours  58.251.36  63.970.46  66.150.06   
Image  CutMix french2020semi  Robust mutual learning  Ground truth 
5.3 Comparisons with Previous Works
We comprehensively compare our method with stateoftheart methods which can be categories as: 1) adversarial based methods, , AdvSSL hung2019adversarial and S4GAN mittal2019semi; 2) methods that encourage consistency, , ICT verma2019interpolation, CutMix french2020semi, CowMix french2019semi and ClassMix olsson2021classmix; 3) two recent works that also collaboratively train networks, , ECS mendel2020semi and DMT fenga2020dmt. For fair comparisons, we report the performance reported by their original papers under the same SSL setting.
Table 1 shows the quantitative comparison on Cityscapes dataset. Our method outperforms the prior leading approach significantly. We can see that the performance advantage becomes more evident when using fewer labeled images for training. In particular, when using labeled data, our method improves the mIoU by over prior best approach. Notably, our training on labeled images gives mIoU on par with supervised learning, proving the data efficiency of our approach. Besides, comparing with DMT that also adopts mutual learning, our robust mutual learning is more powerful to denoise the pseudo labels and hence yields higher performance. We visualize the qualitative results in Figure 3. Compared with CutMix, our method excels at discriminating tiny objects and resolving ambiguity, giving results more consistent with the ground truth.
When conducting experiments on PASCAL VOC, we compare different methods using two types of initializations. Table 2 shows the performance using the ImageNet initialization. We improve the mIoU by and for and labeled images respectively. For lower data regime, our method achieves almost the same performance as the prior leading approach. On the other hand, the COCO dataset contains images with more similar content as the PASCAL, so the COCO initialization generally leads to better SSL performance. As shown in Table 3, our method yields superior mIoU scores in all the data regimes.
PASCAL VOC  # Labels  
Methods (ImageNet init.)  1/100  1/50  1/20  1/8  Full (10582) 
Deeplabv2 chen2017deeplab    48.3  56.8  62.0  70.7 
AdvSSL hung2019adversarial    49.2  59.1  64.3  71.4 
S4GAN mittal2019semi    60.4  62.9  67.3  73.2 
ICT verma2019interpolation  35.82  46.28  53.17  59.63  71.50 
CutMix french2020semi  53.79  64.81  66.48  67.60  72.54 
Deeplabv2 (reimplemtent)  36.25  46.30  55.92  63.73  72.62 
Ours  53.77  64.63  68.82  70.34   
PASCAL VOC  # Labels  
Methods (COCO init.)  1/100  1/50  1/20  1/8  Full (10582) 
Deeplabv2 chen2017deeplab    53.2  58.7  65.2  73.6 
AdvSSL hung2019adversarial    57.2  64.7  69.5  74.9 
S4GAN mittal2019semi    60.9  66.4  69.5  73.9 
DMT fenga2020dmt  63.04  67.15  69.92  72.70  74.75 
ECS mendel2020semi        72.95   
ClassMix olsson2021classmix  54.18  66.15  67.77  71.00   
Ours  63.60  68.66  71.19  73.38   
5.4 Discussions
Supervised baseline  IML  IML (+model noises)  IML (+ input noises)  IML (+ full noises)  RML  RML (two stages) 
56.03  60.09  60.51  60.49  61.25  62.85  63.65 
Architecture  mIoU 
Deeplabv2 baseline  54.52 
SETR zheng2020rethinking baseline  57.79 
Deeplabv2Deeplabv2 IML  61.25 / 61.53 
SETRSETR IML  64.15 / 63.74 
Deeplabv2SETR IML  63.96 / 64.35 
Ablation study.
We quantify the effectiveness of each proposed component in Table 4. We see that indirect mutual learning (IML) improves the mIoU of supervised baseline by . The model noises and inputs noises are essential to reduce the mutual coupling and can jointly contribute to the mIoU gain by . Moreover, with the pseudo label selfrectification, the singlestage robust mutual learning (RML) obtains mIoU improvement by , and the second training stage achieves even higher performance. Overall, the improvement of our twostage RML is substantial with the mIoU gain over the supervised baseline by .
The online refinement of pseudo labels.
The key motivation of our robust mutual learning is to online refine the pseudo labels so that better supervisions can be obtained for selftraining. Figure 4 visualizes how the pseudo labels progress in the course of training. Comparing with the MeanTeacher tarvainen2017mean, direct mutual learning produces better pseudo labels as noises can be gradually suppressed by the peer network. In comparison, indirect mutual learning reduces the mutual coupling and thus leads to more effective coteaching. Finally, much cleaner pseudo labels are produced in the robust mutual learning thanks to the strong denoising capability of the selfrectification scheme.
Heterogeneous architecture.
To validate the hypothesis that heterogeneous architecture is effective to reduce the model coupling, we adopt CNN and Transformer network to the mutual learning framework. Specifically, the student models adopt Deeplabv2 and SETR
zheng2020rethinking architecture respectively. We study such hybrid architecture without noise injection nor selfrectification. Table 5 gives the result and we highlight several interesting findings. First, comparing to the baseline, all the architectural combinations benefit largely from mutual learning, proving the universality of the method. Second, the mutual learning between Deeplabv2 and SETR demonstrates the best performance. Notably, Deeplabv2 improves the mIoU by when mutually trained with SETR compared to the mutual training with the same architecture. On the other hand, instead of being pulled down by the Deeplabv2, Transformerbased SETR obtains useful peersupervision and achieves the highest performance in such hybrid mutual learning.Limitation.
While our method has shown stateoftheart performance, the performance advantage on PASCAL VOC dataset is not as obvious as Cityscapes. We conjecture that this is because PASCAL VOC dataset is not finely annotated where the background class is not clean. Thus, the prototypes computed for such class may not be accurate. Besides, it takes many iterations to update the classwise prototypes since each image in this dataset only contains classes. We believe there is still room for further improvement along with our RML, which we leave for future work.
6 Conclusion
In this paper, we identify the model coupling issue in the current mutual learning works and comprehensively explore factors including aggressive input augmentation, model perturbation and adopting heterogeneous architectures for the mutual learners to alleviate the issue. To enhance the accuracy of peer supervision, we propose a selfrectification scheme that explicitly refines the pseudo labels by leveraging the selfknowledge of the network. The resulting framework, termed robust mutual learning, is able to generate better pseudo labels when simultaneously conducting the selftraining and mutual learning. Our approach is a step further to narrow the gap with supervised learning for lowdata regime and hopefully will inspire more research on denoising pseudo labels for more effective semisupervised learning.