The performance of current deep convolutional neural networks (DCNNs) highly depends on two assumptions: (1) training and test data are drawn from the same feature space with the same distribution; and (2) training data is associated with accurate annotations. However, the performance of established DCNN models usually degrades when tested on unseen data, especially when there exists significant appearance difference between training (source domain) and test (target domain) data, which is referred as the domain shift problem. To mitigate such problem, tremendous domain adaptation (DA) methods have been proposed [13, 3, 12] Nevertheless, most of the current DA solutions assume that the ground-truth labels in training data are flawless, thus ignore an inevitable problem—labels may be corrupted in the real world . This unique challenge inspires us to consider one problem: “How can we learn a robust domain adaptive model from data with noisy annotations?”.
Domain shift is a common problem in the field of medical imaging, since images are obtained from special medical devices, where different imaging modalities or even different settings of the same device could introduce significant variations in images. Recently, many approaches are emerging to address the domain shift problem in image segmentation. Li et al.
li2019bidirectional proposed a bidirectional learning method with self-supervised learning to learn a better segmentation model and in return improve the image translation model. Vuet al. vu2018advent proposed an entropy-based adversarial training approach targeting structure adaptation from source domain to target domain. Additionally, manual annotation with a pixel-level accuracy is indeed inefficient and error-prone. The wrong-labelled samples, behaving as “noise”, can potentially degrade the performance of DCNN, thus it is challenging to learn from data with domain shift and noisy annotations.
Aiming to alleviate the above problems, we propose a robust cross-denoising framework that is resilient to noisy annotations and domain shift. We design two different networks playing roles as peer reviewers to selectively learn from the data with reliable clean labels and adaptively correct the training error. Furthermore, we introduce a class-imbalanced self-learning strategy to estimate the most reliable labels for the target domain. Fig.1 illustrates the main idea of previous DA methods and our proposed robust cross-denoising method. We evaluate the cross-denoising model against the state-of-the-art methods on the REFUGE dataset  and the Drishti-GS dataset . In this nutshell, our main contributions of this paper are summarized as follows:
1) We firstly (to the best of our knowledge) propose a robust learning method against noisy labels in medical image segmentation with domain shift.
2) We propose a cross-reviewing framework that identifies high-quality data and a noise-tolerant loss to focus on the noise-free part in noisy labels, which can significantly reduce the negative effects of noisy labels and boost the performance of two peer networks.
3) We introduce a class-imbalanced cross learning strategy in an iterative cross-training procedure. The presented novel approach enables generating target labels with higher confidence and accuracy.
4) We demonstrate that this robust framework achieves state-of-the-art on optic disc (OD) and optic cup (OC) segmentation tasks with domain shift and noisy labels.
2 Related Works
Recently, there are increasing studies proposed to address the domain shift problem with domain adaptation techniques. Many approaches have achieved promising performance on natural image datasets. For instance, Chang et al. chang2019structure proposed a DICE framework, disentangling the representation of an image into a domain-invariant structure component and a domain-specific texture component, to advance domain adaptation for semantic segmentation. Aiming to address the problem of semantic inconsistency incurred by global feature alignment, Luo et al.
luo2018taking took a close look at the category-level joint distribution and aligned each class with an adaptive adversarial loss. For medical image segmentation, Douet al. dou2018unsupervised proposed a plug-and-play domain adaptation module (DAM) by adapting the source and target domains in the feature space, to solve the cardiac structure segmentation problem across different modalities. The latest study on medical data that is closely related to our work is , which presented a patch-based output space adversarial learning framework to jointly segment the OD and OC from different fundus image datasets. However, all these existing DA methods rely on training data with clean annotations whose performance would be degraded dramatically once the annotations are corrupted or ambiguous.
Training DCNNs with the presence of corrupted labels is a challenging task, which has attracted numerous researchers working towards solutions. Among those works, one of the representative methods is 
, which proposed a MentorNet to supervise the training of a StudentNet and select samples that were probably correct. Another work,, introduced a co-teaching strategy to robustly train the deep neural networks under noisy supervision. For medical imaging, Xue et al. xue2019robust proposed an iterative learning strategy for imperfectly labeled skin lesion image classfication, combating the lacking of clean annotated medical data. Existing approaches on robust learning about noisy labeling are mostly focused on the image classification task, which leaves segmentation with corrupted labels an unsolved problem. In this paper, we provide a novel solution to address the medical image segmentation task with both domain shift and contaminated label problems at the same time.
The overall architecture of our proposed robust cross-denoising network is shown in Fig. 2, which consists of two different networks working as peer reviewers in an unsupervised domain adaption fashion. In this section, we firstly illustrate the architecture of the proposed cross-denoising network. Then, a robust cross-denoising learning algorithm is designed to learn an accurate and robust model from contaminated labels. Last but not least, we propose a noise-tolerant loss and a class-imbalanced cross learning strategy to learn critical information from corrupted labels, which are elaborated in Sections 3.2 and 3.3, respectively.
3.1 Robust Cross-Denoising Network
As shown in Fig. 2, our proposed cross-denoising network (CD-Net) consists of two different networks (i.e., N1, N2), both of which include a segmentation network (resp. S1, S2) and a discriminator (resp. D1, D2). N1 and N2, playing roles as two experts, can generate different decision boundaries, thus there should be differences in their learning abilities and opinions. For network N1, we follow the spirit of DeepLabv2  architecture with ResNet101  as backbone to achieve initial segmentation results. For network N2, in order to learn discriminative features different from N1, we adopt DeepLabv3+ architecture with MobileNetv2 as backbone . To boost the segmentation ability of N1 to the same level of N2, we design a novel spatial pyramid pooling (ASSP) structure  with multi-attention mechanism  for N1, which is shown in Fig. 3, so as to enhance the feature expression ability and enrich the multi-scale information of the network. As regard to D1 and D2, we adopt the same architecture, which is a 5-layer fully convolutional network. D1 and D2 are trained to distinguish between the source prediction and the target prediction by adversarial learning, and guide the segmentation network to focus on the local structure similarity. In the testing stage, we only use N1 to generate the final segmentation results.
Robust Cross-denoising Learning.
Our proposed robust cross-denoising algorithm is shown in Algorithm 1. With a subset data (step 3), we train two different networks and to select a propotion of samples with small training loss (steps 6 and 7). Based on the observation of deep networks 
, easy cases can be learned firstly, and then the networks gradually fit to the hard cases with the number of epochs increased. Therefore, in a noisy dataset, the network learns clean and easy parts of data in the early stage, and thus has the abilities to filter out noisy pattern using loss values. The number of filtered samples is controlled by remember rate, which increases (step 13) utill it filters out all the potential noisy data. Since the learning and filtering ability of networks is not strong enough in early epochs, the remember rate is initialized with a small value and becomes larger when epochs increase. After then, the selected high-quality data from one network is fed into its peer network as reliable knowledge to update parameters (steps 9 and 10). Since two networks have different structure and learning abilities, they can filter different types of error introduced by noisy labels. Although the error caused by noisy labels is propagated back from one network itself, the other network can adaptively correct the training error with a prediction disagreement between two networks. Based on such peer-review strategy, each network selects its small-loss samples as the high-quality data, and updates its peer network by such clean samples to further reduce the training error.
Overall Training Objective.
The proposed cross-denoising domain adaptation network includes two loss functions: noise-tolerant segmentation loss and noise-robust adverserial loss. Among the high-quality data selected by the network, not all of them are clean data, some of them may be mixed with noisy data. In order to learn from clean labels and corrupted labels, respectively, we seperate the data into two groups, i.e., data with reliable label (clean data) and noisy label (noisy data) based on the prediction confidence. For the noise-tolerant segmentation loss , it consists of segmentation loss for clean data and corrupted data which is shown in Eq. (1) and will be elaborated further in Section 3.2. When the instance is grouped in clean data, the noise-filtering segmentation loss is equal to (); otherwise, it is formulated as ()
Since the unlabeled data in the target domain can be regarded as the extreme case of data with noisy labels. The direct prediction in the target domain is usually inaccurate and noisy, which affects the convergence and generalization of adversarial learning. To maximize prediction certainty, an “entropy map” is multiplied by the predictions for the target domain image, which increases the loss weight for the pixels with inaccurate and noisy estimated labels, and reduces the loss weight for accurate and clean estimated labels. The entropy map of the predicted result in the target domain is defined as: . We adopt the entropy map as an indicator to weight the noise-robust adversarial loss , which is defined as :
where is the weight parameter corresponding to information entropy map, and is to ensure the stability of the training process in the case of a small .
The training objective function for our proposed noise-robust segmentation method can be formulated as the following min-max criterion:
denotes the hyperparameter controlling the weights of the adversarial loss, which is empirically set as 0.001.
3.2 Learning from Corrupted Labels
Since clean data can obtain small loss while remaining corrupted data large loss, we use a hybrid segmentation loss composed of the common cross-entropy loss and the Dice coefficient loss, which are shown in the second term and third term in Eq. (4), respectively. Given images in source domain and target domain with the size (height) by (width), set as the number of classes, the clean data segmentation loss can be concretized as:
where is the softmax output of the segmentation network, and is the ground-truth. and are the weights to improve network training, which are empirically set as 0.05 and 1, respectively. For the corrupted data, since the incorrect annotations are mostly around the boundary in practice, the annotations inside the segmented regions are more reliable. Inspired by this observation, we propose to selectively learn from noisy labels. The noisy data segmentation loss shown in Eq. (5) prevents the network from overfitting the noisy pixels while keeping the ability to learn from the reliable pixels in noisy data:
where denotes the boundary distance map. As boundary is generally more vulnerable to noise in a medical image, we calculate the distance to the nearest boundary for each pixel , and get the maximum of in class-level region, namely .
represents the standard deviation, which can be defined as
because 99% of Gaussian distribution is in range. With respect to , the center of the region in each class has a larger value, and the closer to the boundary, the smaller the value. Such noise-tolerant loss encourages the network to capture the key location in the center and filter out the discrepancy in the boundary under various noise-contaminant labels.
3.3 Class-imbalanced Cross Learning
In case of unsupervised domain adaptation with ambiguous labels, it is more challenging to esimate the results accurately. Using the predictions of the learned model as the latent variables for the target image, which is called “Pseudo Label” (PL), is an alternative way to solve such intractable problem. Because of the presence of corrupted labels and different class distributions, the predictions are not robust to the noisy disturbance and the levels of prediction difficulty among classes are different. The vanilla self-learning does not take such issue into consideration, and selects pseudo labels using universal confidence for each class. We propose a class-imbalanced cross learning strategy to solve this issue (shown in Algorithm 2), in which we select the pseudo labels with most confident predictions at class-level and feed them into the peer network to be resistant to noise. Specifically, we use the trained segmenter to predict the latent target labels (step 2), and rank the prediction values of each category (step 4) to select the pixels with value greater than the confidence threshold as pseudo labels (step 7). The generated pseudo labels are fed into the peer discriminator (step 9) to adaptively correct the adversarial error of companion, which is robust to the noise based on above discussions in Section 3.2. The algorithm is conducted in an iterative cross training procedure.
|BDL ||pOSAL ||BEAL ||Proposed|
|Low||With||0||(94.6, 87.4)||(94.6, 82.8)||(94.9, 88.7)||(95.5, 84.5)||(93.3, 83.1)||(93.8, 85.0)||(95.3, 89.4)||(96.1, 85.9)|
|0.1||(94.8, 88.7)||(94.6, 83.1)||(95.4, 88.0)||(95.3, 83.6)||(93.1, 82.0)||(93.4, 83.1)||(95.1, 89.3)||(95.1, 83.8)|
|0.5||(94.9, 89.0)||(94.3, 80.8)||(94.9, 85.9)||(94.9, 80.9)||(90.2, 80.5)||(93.1, 82.1)||(95.4, 89.6)||(95.8, 84.2)|
|0.9||(94.2, 86.8)||(94.0, 82.6)||(94.5, 85.8)||(94.8, 80.6)||(87.7, 80.5)||(93.2, 78.0)||(95.3, 89.4)||(95.1, 82.9)|
|W/O||0.1||(94.2, 86.7)||(92.6, 82.5)||(94.1, 87.9)||(94.8, 81.4)||(92.7, 77.1)||(92.4, 78.2)||(94.7, 88.4)||(95.1, 83.8)|
|0.5||(93.2, 86.0)||(87.6, 81.1)||(94.0, 85.0)||(92.6, 78.4)||(87.8, 75.8)||(87.3, 78.0)||(94.8, 86.7)||(94.4, 83.7)|
|0.9||(90.6, 76.5)||(85.6, 80.3)||(92.5, 83.6)||(88.6, 77.7)||(82.8, 69.1)||(83.5, 74.7)||(94.1, 84.6)||(92.7, 83.4)|
|High||With||0.1||(94.7, 83.3)||(94.1, 81.1)||(94.7, 86.5)||(92.6, 81.3)||(92.4, 81.8)||(92.6, 80.1)||(95.1, 89.0)||(94.6, 83.0)|
|0.5||(91.6, 79.6)||(87.9, 68.5)||(85.8, 75.6)||(91.3, 78.7)||(89.1, 77.2)||(92.4, 74.9)||(93.9, 85.6)||(93.4, 81.3)|
|0.9||(90.2, 74.3)||(85.9, 65.1)||(84.5, 76.0)||(91.6, 76.3)||(75.9, 66.9)||(90.4, 73.5)||(93.0, 83.6)||(92.4, 82.7)|
|W/O||0.1||(89.5, 75.9)||(89.2, 72.3)||(88.2, 78.7)||(87.5, 59.0)||(91.2, 73.8)||(68.2, 56.5)||(94.5, 88.8)||(92.7, 83.0)|
|0.5||(85.8, 75.6)||(85.6, 66.9)||(83.9, 74.5)||(85.0, 54.4)||(86.9, 69.1)||(73.4, 59.6)||(93.8, 84.7)||(93.0, 81.8)|
|0.9||(84.7, 66.0)||(81.6, 68.7)||(79.0, 72.0)||(81.6, 56.3)||(77.8, 53.2)||(70.5, 50.7)||(92.9, 81.1)||(91.8, 80.4)|
In this study, we verify our approach on two public optic disc (OD) and optic cup (OC) segmentation datasets, including the REFUGE challenge dataset  and the Drishti-GS dataset  (Table 2). We refer the REFUGE training set as the source domain, the REFUGE validation set and Drishti-GS dataset as the target domains 1 and 2, respectively. The source domain contains some ground truth labels and imperfect labels, while the target domain contains no labels. Each target domain is further split into a training set for unsupervised DA (ignoring the labels) and a test set. The source and target domain images are acquired by different scanners resulting in different color and texture characteristics of the images. Extensive experiments on these two public databases with different noise levels and noise ratios are conducted to verify the effectiveness of our proposed approach.
|Source||REFUGE training set||400||None|
|Target1||REFUGE validation set||300||100|
We generate three types of noisy labels as shown in Fig. 4: i) enlarge the label mask by dilation, ii) shrink the labels by erosion, and iii) deform the labels by elastic deformation. Varying the amount of dilation, erosion, and deformation, we generate corrupted noisy datasets with different noise levels, which are measured as function , where is the Dice coefficient between generated noisy labels and ground-truth of class . Specifically, we empirically set low noise level as and high noise level as . We also set different noise ratio that represents portion of corrupted samples randomly selected from the training set, where .
4.2 Implementation Details
The proposed method is implemented using PyTorch on 4 Tesla P40 GPU with 96 GB memory in total. We use the Stochastic Gradient Descent optimizer with a momentum of 0.9 to train the segmentation network, and the Adam optimizer to train the discriminator. The initial learning rates areand for the segmentation network and the discriminator, respectively.
4.3 Quantitative Results
We compare our proposed method with the state-of-the-art unsurpervised DA methods including BDL , pOSAL , and BEAL  for the OD and OC segmentation on different noise levels and noise ratios. The Dice coefficients (DI) of OD and OC are used as evaluation criteria.
Table 1 presents the perfrormance comparison of all the methods transferring from the REFUGE training to the REFUGE validation and Drishti-GS test datasets with different noise levels and noise ratios. As for the REFUGE dataset (REF), we notice that the impact of label noise is not identical for all neural networks. On clean-annotated dataset, all methods work well and our proposed method achieve the best performance, with of 95.3 and of 89.4. But as the noise ratio increases, the competitor methods have different degrees of degradation while our method can still maintain a stable and robust result. It is because we not only identify high-quality data effectively, but also avoid the error accumulation issue and assimilate the gains of clean data. Therefore, our method can reach higher performance and combat with harder cases. Furthermore, we observe that when using a pretrained model at low noise level, the performance shows no sign of declining at some cases. This indicates that the pretrained model can improve model robustness  and take the mild noise as a form of “data augmentation”, which relaxes the learning criterion and boosts the performance of competitors and our method. When training at high noise level, the performances of the competitor methods are declining sharply with the increase of noise ratio. In contrast, our method can detect the most reliable data and learn from samples prone to be corrupted, thus we can learn more discriminative features and achieve better performance. More specifically, in the hardest case of 0.9 noisy ratio, our method beats the best competitor pOSAL with 17.6% and 12.6% improvement when training from scratch.
The results on the Drishti-GS dataset (DGS) have the similar trends as REF. Because the distributions of RFUGE and Drishti-GS datasets are quite different, the performance of competitors is in steep decline for the larger domain shift, while our method can alleviate such domain shift and learn from pseudo labels with high confidence. Concretely, at 0.9 noisy ratio, our method beats the best competitor BDL with 12.5% and 13.5% improvement when training from scratch. The qualitative testing results on the REFUGE and Drishti-GS datasets are visualized in Fig. 5. In the case of no noise, the competitor methods can locate the approximate location but fail to generate accurate boundaries of OD and OC. In contrast, our method successfully localizes the OD and OC and generates more accurate boundaries. With noise added, the differences between the segmentation results of competitors and ground-truth become prominent, while our model can still achieve promising results and show its superiority over other methods.
We also conduct a set of ablation experiments to investigate the effectiveness of each component as exhibited in Table 3. With the CD strategy, the performance has increased significantly, which validates that the module can gain from high-quality data and correct the training error accumulation effectively. By combining the CICL approach, a stable and competitive result is achieved, which demonstrate the approach is helpful for boosting the performance. Finally, NTL is added to validate whether it can learn from the noise-free area in noisy labels. We observe that there is a great improvement in the case of large noise ratios.
This paper presented a novel cross-denoising framework, exploring the noisily annotated source domain images and unannotated target domain images to improve the segmentation results of target images. In conjunction with a robust adversarial learning and a noise-tolerant loss, the domain shift and noisy labels problems can be solved simultaneously. Extensive experiments on OD and OC segmentation have demonstrated the advantages of our approach over the state-of-the-art alternatives. In addition to medical image, the method can also be valid for segmentation tasks where other types of images are not labeled accurately.
This work was supported by the grants from Key Area Research and Development Program of Guangdong Province, China (No. 2018B010111001) and the Science and Technology Program of Shenzhen, China (No. ZDSYS201802021814180).
-  (2016) deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. Vol. 40, pp. 834–848. Cited by: §3.1.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §3.1.
-  (2018) Domain adaptive faster R-CNN for object detection in the wild. In CVPR, Cited by: §1.
-  (2019) refuge: retinal fundus glaucoma challenge. External Links: Cited by: §1, §4.1.
-  (2018) Dual attention network for scene segmentation. In CVPR, Cited by: §3.1.
-  (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In NIPS, Cited by: §1, §2.
-  (2015) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
-  (2019) Using pre-training can improve model robustness and uncertainty. In ICML, Cited by: §4.3.
-  (2017) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, Cited by: §2.
-  (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, Cited by: Table 1, Figure 5, §4.3.
-  (2014) Drishti-GS: retinal image dataset for optic nerve head (ONH) segmentation. In ISBI, Cited by: §1, §4.1.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §1.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1.
-  (2019) Boundary and entropy-driven adversarial learning for fundus image segmentation. In MICCAI, Cited by: Table 1, Figure 5, §4.3.
-  (2019) patch-based output space adversarial learning for joint optic disc and cup segmentation. Vol. 38, pp. 2485–2495. Cited by: §2, Table 1, Figure 5, §4.3.
-  (2019) How does disagreement help generalization against label corruption?. In ICML, Cited by: §3.1.