Deep neural networks (DNNs) have demonstrated advanced progress on diverse medical image analysis tasks. Most of these achievements depend on the special requirement that networks are trained and tested on the samples drawn from the same distribution or domain. Once such requirement fails, i.e., domain shift Moreno-Torres et al. (2012) exists, networks may generate unsatisfied performance due to the limited generalization ability. Typically, in real clinical setting, medical images are usually captured by different institutions with various types of scanners vendors, patient populations, in the field of view, and appearance discrepancy Ting et al. (2019), which makes learned models struggle to transfer knowledge and generalize across domains. Since training specific models for each medical center is unrealistic and laborious, it is necessary to enhance the deep model generalization ability across different and even new clinical sites.
The community has attacked domain shift bottleneck so far mainly in two directions. Firstly, Unsupervised Domain Adaptation (UDA) exploits prior knowledge extracted from unlabeled target domain images to achieve model adaptation. Although UDA-based approaches could avoid time-consuming annotations from the target domain, collecting target images in advance is hard to meet in practice. This inspires another direction, Domain Generalization (DG), which aims to learn a universal representation from multiple source domains without any target domain information. In this paper, we intend to utilize the DG-based method to improve the network generalization on medical image segmentation tasks.
Under the domain generalization scope, data manipulation Tobin et al. (2017); Zhou et al. (2020), domain-invariant representation learning Muandet et al. (2013); Ganin and Lempitsky (2015); Arjovsky et al. (2020), and meta-learning techniques Li et al. (2018a); Dou et al. (2019)
have achieved remarkable success. Meanwhile, consistency regularization, which prevails in Semi-Supervised Learning (SSL) and UDA, has been introduced to mitigate performance degradation by forcing the model to learn invariant information from perturbed samples and has shown promising results in different tasks. However, most previous consistency regularization-based worksYue et al. (2019); Xu et al. (2021) simply enforce data-level consistency by generating new domains with novel appearance and then minimizing the discrepancy between original and generated domains for the same instance. Such consistency, based on data-level perturbation, usually requires extrinsic knowledge from other source domains and heavily depends on the quality of generated domains. To overcome the above shortcomings, we are motivated to explore other kinds of consistency regularization for the DG problem. Particularly, based on the observation that related tasks inherently introduce prediction perturbation during the network training, we may leverage the intrinsic consistency of related tasks from the same instance without extrinsic generated domains to encourage the network to learn generalizable representation. Additionally, how to integrate data-level consistency and task-level consistency and leverage their complementary effect is also well worth exploring.
To this end, we present a novel Hierarchical Consistency framework for Domain Generalization (HCDG) by enforcing Extrinsic and Intrinsic Consistency simultaneously. To the best of our knowledge, we are the first to introduce task-level perturbation into DG and integrate several kinds of consistency regularization into a hierarchical cohort. For Extrinsic Consistency, we leverage the knowledge across multiple source domains to enforce data-level consistency. Inspired by observation that the phase and amplitude components in the Fourier spectrum of signals retain the high-level semantics (e.g., structure) and low-level statistics (e.g., appearance), we introduce an improved Fourier-based Amplitude Gaussian-mixing (AG) method called DomainUp, to produce augmented domains with richer variability compared with the previous Amplitude Mix scheme Xu et al. (2021). For Intrinsic Consistency, we perform task-level consistency for the same instance under two related tasks: image segmentation and boundary regression. The Extrinsic and Intrinsic Consistency are further integrated into a teacher-student-like cohort to facilitate network learning. We evaluate the proposed HCDG on two medical image segmentation tasks, i.e., optic cup/disc segmentation on fundus images and prostate segmentation on MRI images. Our HCDG framework achieves state-of-the-art performance in both tasks. The main contributions are summarized as follows.
We develop an effective HCDG framework for generalizable medical image segmentation by simultaneously integrating Extrinsic and Intrinsic Consistency.
We design a novel Amplitude Gaussian-mixing strategy for Fourier-based data augmentation by introducing pixel-wise perturbation in the amplitude spectrum to highlight core semantic structures.
Extensive experiments on two medical image segmentation benchmark datasets validate the efficacy and universality of the framework and HCDG clearly outperforms several state-of-the-art DG methods.
2 Related Work
Domain generalization aims to learn a general model from multiple source domains such that the model can directly generalize to arbitrary unseen target domains. Recently, many DG approaches have achieved remarkable results. Early DG works mainly follow the representation learning spirit via kernel methods Muandet et al. (2013); Li et al. (2018d), domain adversarial learning Ganin and Lempitsky (2015); Li et al. (2018b), invariant risk minimization Arjovsky et al. (2020); Guo et al. (2021), multi-component analysis Zunino et al. (2021), and generative modeling Qiao et al. (2020). Data manipulation is one of the cheapest ways to tackle the dearth of training data and enhance the generalization capability of the model by two popular techniques: data augmentation and data generation. For example, domain randomization Tobin et al. (2017)
, transformation network trained adversariallyZhou et al. (2020), and Mixup Zhang et al. (2018) are utilized to generate more training samples. Meanwhile, Xu et al. Xu et al. (2021)
introduced the Fourier-based data augmentation for DG by linearly distorting the amplitude information. DG has also been studied in general machine learning paradigms.Li et al. Li et al. (2018a) designed a model agnostic training procedure, which is derived from meta-learning. Carlucci et al. Carlucci et al. (2019) formulated a self-supervision task of solving jigsaw puzzles to learn generalized representations. Inspired by Lottery Ticket Hypothesis Frankle and Carbin (2019), Cai et al. Cai et al. (2021) proposed to learn domain-invariant parameters of the model during training.
The consistency regularization is widely used in supervised and semi-supervised learning but has not played a significant role in DG yet. Sajjadi et al.Sajjadi et al. (2016) first introduced a consistency loss to utilize the stochastic nature of data augmentation and minimize the discrepancy between the predictions of multiple passes of a training sample through the network. Tarvainen and Valpola Tarvainen and Valpola (2018) designed a teacher-student model to provide better consistency alignment. Besides, Yue et al. Yue et al. (2019) proposed a pyramid consistency to learn a model with high generalizability via domain randomization, which still rests on data-level perturbation. Very recently, task-level consistency has been used for semi-supervised learning Luo et al. (2021). To our best knowledge, there is no work exploring task-level consistency for DG problem.
Medical Image Segmentation.
DNNs have been widespread in medical image segmentation tasks, such as cardiac segmentation from MRI Yu et al. (2017a), organ segmentation from CT Zhou et al. (2017); Li et al. (2018c), and skin lesion segmentation from dermoscopic images Yuan and Lo (2017). In this paper, we mainly focus on two medical tasks, i.e., OC/OD segmentation from retinal fundus images Wang et al. (2019a), and the prostate segmentation from T2-weighted MRI Liu et al. (2020b). Previously, Fu et al. Fu et al. (2018) and Wang et al. Wang et al. (2019b) showed competing results on jointly segmenting OC/OD, while Milletari et al. Milletari et al. (2016) and Yu et al. Yu et al. (2017b) successfully improved the performance on prostate segmentation. However, most of the methods lack the generalization ability and tend to generate high test errors on unseen target datasets. Thus, a more generalizable method is highly desired to alleviate performance degradation.
Figure 1 depicts the proposed Hierarchical Consistency framework for Domain Generalization (HCDG). We consider a training set of multiple source domains with labeled samples in the -th domain , where and denote the images and labels. Our HCDG framework learns a domain-agnostic model using distributed source domains by Extrinsic Consistency and Intrinsic Consistency simultaneously, so that it can directly generalize to a completely unseen domain with mitigating performance degradation.
3.1 Extrinsic Consistency (EC)
We first exploit consistency regularization from the extrinsic aspect, i.e., leveraging extra knowledge from other source domains, to enforce data-level consistency for each instance. Specifically, we propose a new paradigm of Fourier-based data augmentation, named Amplitude Gaussian-mixing (AG), to perturb the spectral amplitude and then generate augmented images. Based on AG, we further design a DomainUp scheme to provide new domains with enough variability. Finally, we utilize a mean-teacher framework to minimize the discrepancy between augmented replicas from the same instance with dual-view consistency regularization.
Previous Fourier-based augmentation work Xu et al. (2021) linearly mixes the spectral amplitude of the whole image and keeps the phase information invariant to synthesize interpolated domains. However, this strategy treats each pixel equally during mixing, which can hardly distinguish the magnitude of semantic structures in the center and marginal areas. Thus, we design a novel Fourier-based Amplitude Gaussian-mixing (AG) strategy for amplitude perturbation by introducing a significance mask, SigMask , for linear interpolation, where the Gaussian-like is used to control the perturbation magnitude at each pixel. Specifically, we first extract the frequency space signal of sample
through fast Fourier transformNussbaumer (1981) and further decompose into an amplitude spectrum and a phase spectrum . For each , its perturbed amplitude spectrum is calculated according to and of counterpart sample via following
where denotes element-wise multiplication. Then, we generate the augmented image of interpolated domain via inverse Fourier Transform .
The values of SigMask
follow a Gaussian distribution and the value of each pixelis computed with
where and . It is worth noting that we scale the range of to
to avoid pretty small significances in the marginal area of SigMask. The variancecontrols the peak value of the above Gaussian-mixing function under fixed , while the mean
decide the position of the peak. To prevent generating outliers, we deduce the lower bound ofto to promise . We also control the upper bound via .
The proposed AG is simple but effective bringing three benefits: (1) the Gaussian-mixing function highlights the core information by endowing the center area of the image with more magnitude than the marginal area; (2) we could generate adaptive center areas by empirically sampling from to cope with uncertain positions of the core semantics; and (3) stability has improved by controlling the variance of the Gaussian-mixing function instead of directly modifying .
To obtain more informative interpolated samples, for each source image , we propose to search for the worst augmented case Gong et al. (2020) from candidates, where is the instance number sampled from each source domain. As shown in Figure 2, we first obtain the weakly augmented source image and candidates by standard augmentation protocol (e.g., random scaling and flipping), and then deploy the Fourier-based AG augmentation on and each to acquire the corresponding strongly augmented replica . The worst augmented case can be selected by the maximal supervised loss value from . In the network learning, we combine the maximal segmentation loss from and normal segmentation loss from as the total supervised segmentation loss following
where is the prediction of the student model.
Dual-view Consistency Regularization.
After acquiring the weakly and strongly augmented images, we explicitly enforce a dual-view consistency regularization for Extrinsic Consistency. Specifically, the consistency is implemented with a momentum-updated teacher model to provide dual-view instance alignment for better constraint. We feed the weakly and strongly augmented images into both student and teacher networks (with the same architecture) and then minimize the network output discrepancy between them with the KL divergence:
where is the prediction of the teacher model, denotes the softmax operation, and represents the temperature Hinton et al. (2015) to soften the outputs. Overall, the objective function of EC is composed of the supervised segmentation loss and the consistency KL loss
where is a balancing hyper-parameter.
3.2 Intrinsic Consistency (IC)
Different from EC, IC introduces a Classmate module to perform task-level consistency guided by the intrinsic perturbation. As the corresponding predictions of related tasks for the same sample in the output space have inherent difference, consistency can be enforced without extra knowledge after proper transformation.
To conduct dual tasks, we incorporate a new Classmate Module with decoders into the student model. We then achieve task-level constraints between different predictions of two tasks after transformation. To further strengthen the diversity between the outputs of the two tasks, we also apply a feature-level perturbation on the feature map (i.e., feature dropout or noise) following Ouali et al. Ouali et al. (2020). In practice, we define boundary regression as the second task to capture geometric structure Wang et al. (2019a); Murugesan et al. (2019). To generate the boundary ground truth , we apply morphological operation to the mask ground truth as Wang et al. Wang et al. (2019a) do. The supervised boundary loss is formulated as mean square error (MSE)
where is the number of the labeled samples and is the prediction of the classmate . We set the number of classmates to balance the accuracy and efficiency. Note that each classmate receives a different version of the perturbed feature map.
|Task||Optic Cup Segmentation||Optic Disc Segmentation||Overall||Optic Cup Segmentation||Optic Disc Segmentation||Overall|
|Dice Coefficient (Dice) [%]||Average Surface Distance (ASD) [pixel]|
|Mixup Zhang et al. (2018)||71.73||77.70||78.24||87.23||78.73||94.66||88.88||89.63||89.99||90.79||84.76||28.07||15.34||14.42||7.01||16.21||9.15||15.41||14.44||10.88||12.47||14.34|
|M-mixup Verma et al. (2019)||78.15||78.50||78.04||87.48||80.54||94.47||90.53||90.60||86.68||90.57||85.56||21.89||13.37||14.65||6.61||14.13||9.67||14.40||13.03||14.08||12.80||13.46|
|CutMix Yun et al. (2019)||77.41||81.30||80.23||84.27||80.80||94.50||90.92||89.57||88.95||90.99||85.89||22.38||12.65||13.33||7.94||14.08||9.40||12.83||14.41||11.80||12.11||13.09|
|JiGen Carlucci et al. (2019)||81.04||79.34||81.14||83.75||81.32||95.60||89.91||91.61||92.52||92.41||86.86||19.34||13.36||12.86||9.91||13.87||7.62||15.27||11.60||10.73||11.31||12.59|
|DoFE Wang et al. (2020)||82.86||78.80||86.12||87.07||83.71||95.88||91.58||91.83||92.40||92.92||88.32||17.68||14.71||9.92||7.21||12.38||7.23||14.08||11.43||9.29||10.51||11.44|
|SAML Liu et al. (2020a)||83.16||75.68||82.00||82.88||80.93||94.30||91.17||92.28||87.95||91.43||86.18||18.20||16.87||13.38||9.89||14.59||10.08||12.90||12.31||14.22||12.38||13.48|
|FACT Xu et al. (2021)||79.66||79.25||83.07||86.57||82.14||95.28||90.19||94.09||90.54||92.53||87.33||20.71||14.51||11.61||7.83||13.67||8.19||14.71||8.43||10.57||10.48||12.07|
|Dice Coefficient (Dice) [%]||Average Surface Distance (ASD) [pixel]|
|Mixup Zhang et al. (2018)||84.77||84.50||79.18||82.99||84.68||84.12||83.37||5.92||6.72||7.38||5.81||7.53||4.33||6.28|
|M-mixup Verma et al. (2019)||86.94||82.61||85.87||84.98||82.35||82.50||84.21||4.50||6.36||5.07||4.37||8.05||4.66||5.50|
|CutMix Yun et al. (2019)||83.27||84.36||85.35||81.06||85.67||84.95||84.11||5.63||5.71||5.62||5.75||6.65||4.27||5.61|
|JiGen Carlucci et al. (2019)||84.53||85.41||81.48||83.18||89.66||85.42||84.95||5.26||5.42||5.21||5.34||3.77||4.14||4.86|
|DoFE Wang et al. (2020)||84.66||84.42||85.22||86.31||87.60||86.96||85.86||4.95||5.23||5.04||4.30||4.23||3.33||4.51|
|SAML Liu et al. (2020a)||89.66||87.53||84.43||88.67||87.37||88.34||87.67||4.11||4.74||5.40||3.45||4.36||3.20||4.21|
|FACT Xu et al. (2021)||85.82||85.45||84.93||88.75||87.31||87.43||86.62||4.77||5.60||6.10||3.90||4.37||3.43||4.70|
Dual-classmate Consistency Regularization.
We define as the inside and outside area of the target object. Comparing the boundary ground truth with the mask ground truth , the values of are 0 for both tasks. While are assigned to 1 for the mask prediction task and
for the boundary regression task. Consequently, we scale the regressed boundary followed by a sigmoid function to approximate the predicted mask, like the smooth Heaviside functionXue et al. (2020):
where denotes a scaling factor. The approximate transformation function maps the prediction space of boundary to that of mask while still maintaining the task-level diversity. Thus, task-level consistency can be enforced between the mask prediction and the transformed boundary prediction from the same sample by exploiting intrinsic knowledge:
Then, the objective of IC consists of the boundary MSE loss and the consistency KL loss as where is same to that of EC. Overall, the total objective function of training the framework is summed up as
4.1 Datasets and Experiment Setting
We evaluate our method on two important medical image segmentation tasks, i.e., optic cup and disc (OC/OD) segmentation on retinal fundus images and prostate segmentation on T2-weighted MRI. The Fundus image segmentation dataset Wang et al. (2020) is composed of four different data sources out of three public fundus image datasets, which are captured with different scanners in different institutions. The Prostate MRI dataset Liu et al. (2020a) is a well-organized multi-site dataset for prostate MRI segmentation, which contains prostate T2-weighted MRI data collected from six different data sources out of three public datasets. The pre-processing step of the two datasets follows the previous works Wang et al. (2020); Liu et al. (2020a) and can be found in the supplementary file.
For both tasks, we conducted the leave-one-domain-out strategy. We trained our model on distributed source domains and evaluated the trained model on the held-out target domain. For evaluation, we used the prediction of the student model as the final prediction. We adopted two commonly-used metrics, Dice coefficient (Dice) and Average Surface Distance (ASD), to quantitatively evaluate the segmentation results on the whole object region and the surface shape, respectively. The average results of three runs are reported.
4.2 Implementation Details
Our framework was built on PyTorch and trained on one NVIDIA RTX 2080 GPU. We adopted Adam optimizer to train the framework. For Fundus data, we followed the previous workWang et al. (2020) and used a modified DeepLabv3+ Chen et al. (2018) as the segmentation backbone. Following Wang et al. Wang et al. (2020)
, we first pre-trained the vanilla DeepLabV3+ network for 40 epochs with a learning rate ofand then trained the whole framework for another 50 epochs with an initial learning rate of , batch size of 4. The learning rate was then decreased to after 40 epochs. For Prostate MRI, we employed the same network backbone as Fundus, while the whole framework was trained from scratch for 80 epochs and batch size of 1. The initial learning rate was set to and decayed by 95% every 5 epochs. The in Eq. (5) was set to 200. Other implementation details and hyper-parameter setting can be found in the supplementary file and the provided code.
4.3 Comparison with Other DG Methods
We compare our framework with recent state-of-the-art methods which were designed for improving network generalization ability. The first kind of approaches focuses on network regularization, including Mixup Zhang et al. (2018), M-mixup Verma et al. (2019), CutMix Yun et al. (2019), and FACT Xu et al. (2021), which is a Fourier-based data augmentation framework. We also compare with state-of-the-art DG methods for OC/OD and Prostate MRI segmentation, i.e., JiGen Carlucci et al. (2019), DoFE Wang et al. (2020), and SAML Liu et al. (2020a). For Baseline, we train a vanilla modified DeepLabV3+ network with training images of all source domains in a unified manner.
Results on Fundus Image Segmentation.
Table 1 shows the quantitative results on the OC/OD segmentation tasks. It is clear that our HCDG framework achieves consistent improvements over the baseline across all unseen domain settings, with the overall performance increase of 3.26% in Dice and 3.50 in ASD. Unexpectedly, several network regularization methods Mixup, M-mixup, and CutMix, do not perform as well as in the original nature image classification task. It may be due to the difficulty of telling pixel-wise labels caused by the absence of explicit constraint on augmentation. Besides, FACT formulates a dual-form consistency loss and advances the baseline from 86.38% to 87.33% in the average Dice. By adding extra Intrinsic Consistency, our approach further improves over FACT to 89.64%. Furthermore, our framework outperforms JiGen and SAML by a considerable margin: 2.78% and 3.46% in the average Dice, respectively. In particular, our approach largely surpasses the strongest competitor DoFE without leveraging domain prior knowledge by 1.32% and 1.72 respectively in the average Dice and ASD.
|Method||EC (w/o AG)||Classmates||IC||AG||Domain 1||Domain 2||Domain 3||Domain 4||Avg.|
Results on Prostate MRI Segmentation.
The experimental results on Prostate MRI segmentation task are illustrated in Table 2. We observe that our HCDG framework improves over the baseline for Dice from 83.79% to 88.89% and decreases ASD from 5.75 to 3.81. While Mixup still performs worse than the baseline, M-mixup and Cutmix show limited advantage over the baseline. This may be because the difficulty is mitigated among grayscale images extracted from prostate MRI. Guided by consistency regularization, FACT significantly excels the baseline with the overall performance increase of 2.83% in Dice and 1.05 in ASD. Our approach further improves over FACT from 86.62% to 88.89% in the average Dice. The other approaches of JiGen and DoFE achieve competitive performance, while SAML yields the best results among these previous state-of-the-art methods. Our approach outperforms SAML by a large margin (1.22% average Dice and 0.40 average ASD).
We present two sample segmentation results from each task in Figure LABEL:fig:qualitative for better visualization. As observed, owing to Extrinsic and Intrinsic Consistency, the predictions of our approach have smoother contours and are closer to the ground truth compared to others.
4.4 Analysis of Our Methods
Impact of Different Components.
The ablation study results are shown in Table 3. Starting from the Baseline, EC guided by the previous AM not proposed AG strategy is deployed on model A. Compared to FACT Xu et al. (2021) in Table 1, model A performs better owing to DomainUp in EC. Based on model A, we incorporate Classmate module into the student model to obtain model B, which advances over model A slightly by 0.42% in average Dice. Classmates in model B still conduct the same task as the student model, i.e., image segmentation task. Furthermore, the incorporated classmates carry out the dual task, i.e., boundary regression task, in model C, where the full IC is introduced. It significantly improves the performance of model C over model B from 88.34% to 89.30%. The full HCDG replaces the AM strategy in model C with our proposed AG method and achieves 89.64%. Finally, we remove EC including the teacher model from the full HCDG but keep the AG method for the student model to construct Model D. The generalization performance of Model D decreases from 89.64% to 89.01%, further indicating the power of EC. It also implies the efficacy of IC in view of this competitive result.
Effectiveness of Feature-level Perturbation.
We conduct experiments based on the previous Model C as shown in Figure 4. Removing feature-level perturbation from Model C results in consistent performance declines on the generalization ability in OC/OD segmentation task, particularly the average Dice decreasing from 89.30% to 89.05%. This verifies the necessity of feature-level perturbation on dual classmates for Intrinsic Consistency.
Details of Amplitude Gaussian-mixing.
We further analyze some implementation details of our AG strategy. As illustrated in Figure 5, the blue columns denote the AG method without adaptive core areas (i.e., ). In such setting, compared with the full AG method (the green columns), the overall performance degenerates from 89.64% to 89.50% in average Dice. This suggests the necessity of assisting the model to cope with uncertain positions of the core semantics from diverse domains. As the position of OC/OD in ROIs extracted from fundus images changes slightly, we believe this mechanism will boost more performance on other natural image datasets. Additionally, we explore the setting of uniformly sampling the scaled length . The results of yellow columns indicate that this setting leads to the overall decrease of 0.28%. We attribute this to the negative effects caused by inappropriate values. For the same variance , a small (e.g., 0.2) brings a SigMask similar to the AM strategy, resulting in degeneration. On the other hand, a large (e.g., 0.8) produces a SigMask with an excessive difference between adjacent pixels, which may be too aggressive for the model to learn. Thus, the better choice is to select the proper value (0.5 in our setting) for to generate a moderate SigMask.
In this paper, we have presented a novel Hierarchical Consistency framework for generalizable medical image segmentation on unseen datasets by ensembling Extrinsic and Intrinsic Consistency. Firstly, we manipulate Extrinsic Consistency based on data-level perturbation and design a delicate form of the Fourier-based data augmentation to augment new domain images for each instance without structure information lost. Secondly, we incorporate a novel Classmate module into the framework, where Intrinsic Consistency is enforced to further constrain the model through inherent prediction perturbation of related tasks. Considering many mainstream approaches of Domain Generalization underestimate the strength of consistency regularization or only employ data-level consistency, our work sheds some light on the community of task-level consistency in Domain Generalization. The extensive experiments on two important medical image segmentation tasks validate the generalization and robustness of our proposed framework. In the future, we will explore other consistency schemes and apply our framework to other medical and natural image problems.
- Invariant risk minimization. External Links: Cited by: §1, §2.
- NCI-isbi 2013 challenge: automated segmentation of prostate structures. The Cancer Imaging Archive 370. Cited by: Table 5.
- Generalizing nucleus recognition model in multi-source images via pruning. External Links: Cited by: §2.
- Domain generalization by solving jigsaw puzzles. In CVPR, pp. 2229–2238. Cited by: Table 6, Table 7, §2, Table 1, Table 2, §4.3.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §4.2.
- Domain generalization via model-agnostic learning of semantic features. Advances in Neural Information Processing Systems 32, pp. 6450–6461. Cited by: §1.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. External Links: Cited by: §2.
- Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. IEEE transactions on medical imaging, pp. 1597–1605. Cited by: §2.
- RIM-one: an open retinal image database for optic nerve evaluation. In CBMS, pp. 1–6. Cited by: Table 5.
Unsupervised domain adaptation by backpropagation. In ICML, pp. 1180–1189. Cited by: §1, §2.
- MaxUp: a simple way to improve generalization of neural network training. External Links: Cited by: §3.1.
- Out-of-distribution prediction with invariant risk minimization: the limitation and an effective fix. External Links: Cited by: §2.
- Distilling the knowledge in a neural network. External Links: Cited by: §3.1.
- Computer-aided detection and diagnosis for prostate cancer based on mono and multi-parametric mri: a review. Computers in biology and medicine 60, pp. 8–31. Cited by: Table 5.
- Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §1, §2.
- Domain generalization with adversarial feature learning. In CVPR, pp. 5400–5409. Cited by: §2.
- H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging, pp. 2663–2674. Cited by: §2.
- Domain generalization via conditional invariant representations. In AAAI, Cited by: §2.
- Evaluation of prostate segmentation algorithms for mri: the promise12 challenge. MIA 18 (2), pp. 359–373. Cited by: Table 5.
- Shape-aware meta-learning for generalizing prostate mri segmentation to unseen domains. In MICCAI, pp. 475–485. Cited by: Appendix B, Table 6, Table 7, Table 1, Table 2, §4.1, §4.3.
- MS-net: multi-site network for improving prostate segmentation with heterogeneous mri data. IEEE transactions on medical imaging 39 (9), pp. 2713–2724. Cited by: §2.
- Semi-supervised medical image segmentation through dual-task consistency. In AAAI, pp. 8801–8809. Cited by: §2.
V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pp. 565–571. Cited by: §2.
- A unifying view on dataset shift in classification. Pattern recognition, pp. 521–530. Cited by: §1.
- Domain generalization via invariant feature representation. In ICML, pp. 10–18. Cited by: §1, §2.
- Psi-net: shape and boundary aware joint multi-task deep network for medical image segmentation. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7223–7226. Cited by: §3.2.
- The fast fourier transform. In Fast Fourier Transform and Convolution Algorithms, pp. 80–111. Cited by: §3.1.
- Refuge challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Medical image analysis, pp. 101570. Cited by: Table 5.
- Semi-supervised semantic segmentation with cross-consistency training. In CVPR, pp. 12674–12684. Cited by: §3.2.
- Learning to learn single domain generalization. In CVPR, pp. 12556–12565. Cited by: §2.
- Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems 29, pp. 1163–1171. Cited by: §2.
- A comprehensive retinal image dataset for the assessment of glaucoma from the optic nerve head analysis. JSM Biomedical Imaging Data Papers, pp. 1004. Cited by: Table 5.
Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. External Links: Cited by: Appendix A, Appendix A, §2.
- Artificial intelligence and deep learning in ophthalmology. British Journal of Ophthalmology, pp. 167–175. Cited by: §1.
- Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, pp. 23–30. Cited by: §1, §2.
- Manifold mixup: better representations by interpolating hidden states. In ICML, pp. 6438–6447. Cited by: Table 6, Table 7, Table 1, Table 2, §4.3.
- Boundary and entropy-driven adversarial learning for fundus image segmentation. In MICCAI, pp. 102–110. Cited by: §2, §3.2.
- Dofe: domain-oriented feature embedding for generalizable fundus image segmentation on unseen datasets. IEEE Transactions on Medical Imaging 39 (12), pp. 4237–4248. Cited by: Appendix B, Table 6, Table 7, Table 1, Table 2, §4.1, §4.2, §4.3.
- Patch-based output space adversarial learning for joint optic disc and cup segmentation. IEEE transactions on medical imaging, pp. 2485–2495. Cited by: §2.
- A fourier-based framework for domain generalization. In CVPR, pp. 14383–14392. Cited by: Table 6, Table 7, §1, §1, §2, §3.1, Table 1, Table 2, §4.3, §4.4.
- Shape-aware organ segmentation by predicting signed distance maps. In AAAI, Vol. 34, pp. 12565–12572. Cited by: §3.2.
- Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In MICCAI, pp. 287–295. Cited by: §2.
Volumetric convnets with mixed residual connections for automated prostate segmentation from 3d mr images. In AAAI, Cited by: §2.
- Improving dermoscopic image segmentation with enhanced convolutional-deconvolutional networks. IEEE journal of biomedical and health informatics, pp. 519–526. Cited by: §2.
- Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In ICCV, pp. 2100–2110. Cited by: §1, §2.
Cutmix: regularization strategy to train strong classifiers with localizable features. In CVPR, pp. 6023–6032. Cited by: Table 6, Table 7, Table 1, Table 2, §4.3.
- Mixup: beyond empirical risk minimization. External Links: Cited by: Table 6, Table 7, §2, Table 1, Table 2, §4.3.
- Deep domain-adversarial image generation for domain generalisation. In AAAI, pp. 13025–13032. Cited by: §1, §2.
- A fixed-point model for pancreas segmentation in abdominal ct scans. In MICCAI, pp. 693–701. Cited by: §2.
- Explainable deep classification models for domain generalization. In CVPR, pp. 3233–3242. Cited by: §2.
In this supplementary material, we first provide more details about our training procedure and a detailed algorithm for better understanding in Section A. We also provide more method-specific hyper-parameters setting in this section. Then, we show the statistics of Fundus and Prostate MRI datasets and their corresponding pre-processing details in Section B. We also provide visualization results of the amplitude-perturbed images under different variance in our Amplitude Gaussian-mixing method in Section C. In Section D
, we further show the standard deviation of results for three runs on two tasks to better compare our framework with other methods. Finally, we conduct additional experiments to analyze the effects of the number of classmatesand the balancing hyper-parameter in Section E.
Appendix A Details of Training Strategy
In our framework, the structure of the teacher model is identical to that of the student model, while classmates as extra decoders share the encoder with the student model. Rather than gradients flowing through the teacher model during backpropagation, the teacher model receives parameters from the student model via exponential moving average (EMA) following previous mean-teacher framework Tarvainen and Valpola (2018):
where is the decay rate during the updating. The classmate module does not engage in EMA.
The encoder of the student model is simultaneously updated by gradient flows from image segmentation task and boundary regression task. Hence, it possesses a stronger ability to recognize the active structure and boundary of objects. In the testing phase, we simply use the student model for evaluation. The detailed training procedure is illustrated in Algorithm 1, where the numbers of equations are consistent with the main file. Note that we enforce the dataloader to sample a mini-batch which includes all source domain images.
For all experiments, we set the momentum for the teacher model to 0.9995, the temperature to 10, and the number of candidates for each source domain in DomainUp to 1. In Amplitude Gaussian-mixing strategy, the upper bound of in the Gaussian-mixing function is chosen as 1.0, while the scaled length is fixed to 0.5. In the transformation function, the scaling factor is 20. The weight of consistency loss is set to 200, using a sigmoid ramp-up Tarvainen and Valpola (2018) with a length of 5 epochs. For Fundus, we use weakly augmentation composed of random scaling and cropping, random flipping and light adjustment. For Prostate MRI, we only employ random horizontal flipping and light adjustment as data augmentation.
Appendix B Details of Datasets and Pre-processing
Detailed statistics of Fundus Wang et al. (2020) and Prostate MRI Liu et al. (2020a) datasets are illustrated in Table 5. We first pre-processed the two datasets before network training. For Fundus image dataset, we cropped region of interests (ROIs) centering at OD with size of by utilizing a simple U-Net and then resized them to as the network input, following Wang et al. Wang et al. (2020). For Prostate MRI, we resized each sample to in axial plane, and normalized it individually to zero mean and unit variance. We then clipped each sample to only preserve slices of prostate region for consistent objective segmentation regions across sites, following Liu et al. Liu et al. (2020a).
Appendix C Visualization of Perturbed Images
We visualize the appearances of amplitude-perturbed images under different in our Amplitude Gaussian-mixing method for the two tasks. As shown in Figure 7, the appearance of source image is gradually transformed from the style of candidates back to the original style as we increase from 0.4 to 0.8, while the core semantic of source image remains unchanged. Our AG method keeps the rich variability of the previous AM strategy and simultaneously highlights the core semantic, hence benefits the model to capture domain-invariant information to improve the generalizability.
|OC||OD||Avg.||Time (h)||Params (million)|
Appendix D Comparison with Other DG Methods
Due to page limitation, we only show the mean value of three runs in Table 1 and Table 2 in the main file. Here, we further provide the standard deviation (std) of Dice coefficients for three runs of recent DG methods and our framework on the OC/OD segmentation and Prostate MRI segmentation tasks, as shown in Table 6 and Table 7 respectively. Our framework outperforms all state-of-the-art methods and has relatively stable average Dice in both tasks.
Appendix E Further Analysis of Our Method
Analysis of the Number of Classmates .
We conduct experiments on different values of as shown in Table 4. With the number of classmates increasing, the cost of training time and the number of parameters have a growing tendency. Our framework achieves the highest average Dice when . Since too many classmates would lead to loss of efficiency, we empirically set the number of classmates to balance the accuracy and efficiency in our Classmate module.
Effects of the Balancing Hyper-parameter .
The results on the OC/OD segmentation task under different values of the balancing hyper-parameter are illustrated in Figure 6. As observed, our approach achieves the best performance when .
|Tasks||Domain No.||Dataset||Training samples||Scanners (Institutions)|
|Fundus||Domain 1||Drishti-GS Sivaswamy et al. (2015)||50||(Aravind eye hospital)|
|Domain 2||RIM-ONE-r3 Fumero et al. (2011)||99||Nidek AFC-210|
|Domain 3||REFUGE (train) Orlando et al. (2020)||320||Zeiss Visucam 500|
|Domain 4||REFUGE (val) Orlando et al. (2020)||320||Canon CR-2|
|Prostate MRI||Domain 1||NCI-ISBI 2013 Bloch et al. (2015)||30||(RUNMC)|
|Domain 2||NCI-ISBI 2013 Bloch et al. (2015)||30||(BMC)|
|Domain 3||I2CVB Lemaître et al. (2015)||19||(HCRUDB)|
|Domain 4||PROMISE12 Litjens et al. (2014)||13||(UCL)|
|Domain 5||PROMISE12 Litjens et al. (2014)||12||(BIDMC)|
|Domain 6||PROMISE12 Litjens et al. (2014)||12||(HK)|
|Task||Optic Cup Segmentation||Optic Disc Segmentation||Overall|
|Mixup Zhang et al. (2018)||71.730.51||77.702.31||78.241.26||87.230.28||78.730.63||94.660.42||88.880.85||89.630.95||89.990.34||90.790.10||84.760.34|
|M-mixup Verma et al. (2019)||78.151.89||78.502.12||78.043.16||87.480.60||80.540.09||94.470.62||90.530.78||90.600.51||86.680.56||90.570.47||85.560.24|
|CutMix Yun et al. (2019)||77.411.19||81.301.42||80.230.36||84.270.51||80.800.53||94.500.83||90.920.41||89.570.69||88.950.55||90.990.27||85.890.35|
|JiGen Carlucci et al. (2019)||81.040.45||79.341.19||81.142.29||83.751.79||81.320.95||95.600.16||89.911.52||91.610.63||92.520.67||92.410.35||86.860.63|
|DoFE Wang et al. (2020)||82.860.35||78.801.11||86.121.19||87.070.73||83.710.77||95.880.32||91.581.19||91.830.79||92.400.61||92.920.13||88.320.41|
|SAML Liu et al. (2020a)||83.161.25||75.681.81||82.001.06||82.880.78||80.930.85||94.300.57||91.171.02||92.280.85||87.950.78||91.430.32||86.180.84|
|FACT Xu et al. (2021)||79.660.50||79.250.59||83.070.47||86.570.22||82.140.29||95.280.77||90.190.55||94.090.63||90.540.52||92.530.28||87.330.13|
|Task||Prostate MRI segmentation||Overall|
|Mixup Zhang et al. (2018)||84.771.05||84.501.92||79.183.35||82.992.41||84.686.35||84.121.68||83.371.48|
|M-mixup Verma et al. (2019)||86.941.42||82.611.01||85.872.40||84.982.90||82.356.76||82.503.95||84.211.91|
|CutMix Yun et al. (2019)||83.270.98||84.360.99||85.354.17||81.061.08||85.673.61||84.952.84||84.111.29|
|JiGen Carlucci et al. (2019)||84.531.01||85.411.58||81.481.59||83.181.07||89.662.34||85.422.59||84.950.57|
|DoFE Wang et al. (2020)||84.660.30||84.422.70||85.222.98||86.313.50||87.603.47||86.962.29||85.860.79|
|SAML Liu et al. (2020a)||89.661.31||87.531.05||84.430.51||88.670.96||87.372.65||88.340.80||87.670.81|
|FACT Xu et al. (2021)||85.820.79||85.452.00||84.932.60||88.753.30||87.312.18||87.431.32||86.621.59|