HCDG: A Hierarchical Consistency Framework for Domain Generalization on Medical Image Segmentation

Modern deep neural networks struggle to transfer knowledge and generalize across domains when deploying to real-world applications. Domain generalization (DG) aims to learn a universal representation from multiple source domains to improve the network generalization ability on unseen target domains. Previous DG methods mostly focus on the data-level consistency scheme to advance the generalization capability of deep networks, without considering the synergistic regularization of different consistency schemes. In this paper, we present a novel Hierarchical Consistency framework for Domain Generalization (HCDG) by ensembling Extrinsic Consistency and Intrinsic Consistency. Particularly, for Extrinsic Consistency, we leverage the knowledge across multiple source domains to enforce data-level consistency. Also, we design a novel Amplitude Gaussian-mixing strategy for Fourier-based data augmentation to enhance such consistency. For Intrinsic Consistency, we perform task-level consistency for the same instance under the dual-task form. We evaluate the proposed HCDG framework on two medical image segmentation tasks, i.e., optic cup/disc segmentation on fundus images and prostate MRI segmentation. Extensive experimental results manifest the effectiveness and versatility of our HCDG framework. Code will be available once accept.


page 3

page 6

page 7

page 13


Adversarial Consistency for Single Domain Generalization in Medical Image Segmentation

An organ segmentation method that can generalize to unseen contrasts and...

Generalizable Medical Image Segmentation via Random Amplitude Mixup and Domain-Specific Image Restoration

For medical image analysis, segmentation models trained on one or severa...

A Fourier-based Framework for Domain Generalization

Modern deep neural networks suffer from performance degradation when eva...

DoFE: Domain-oriented Feature Embedding for Generalizable Fundus Image Segmentation on Unseen Datasets

Deep convolutional neural networks have significantly boosted the perfor...

Invariant Content Synergistic Learning for Domain Generalization of Medical Image Segmentation

While achieving remarkable success for medical image segmentation, deep ...

AADG: Automatic Augmentation for Domain Generalization on Retinal Image Segmentation

Convolutional neural networks have been widely applied to medical image ...

A simple normalization technique using window statistics to improve the out-of-distribution generalization in medical images

Since data scarcity and data heterogeneity are prevailing for medical im...

1 Introduction

Deep neural networks (DNNs) have demonstrated advanced progress on diverse medical image analysis tasks. Most of these achievements depend on the special requirement that networks are trained and tested on the samples drawn from the same distribution or domain. Once such requirement fails, i.e., domain shift Moreno-Torres et al. (2012) exists, networks may generate unsatisfied performance due to the limited generalization ability. Typically, in real clinical setting, medical images are usually captured by different institutions with various types of scanners vendors, patient populations, in the field of view, and appearance discrepancy Ting et al. (2019), which makes learned models struggle to transfer knowledge and generalize across domains. Since training specific models for each medical center is unrealistic and laborious, it is necessary to enhance the deep model generalization ability across different and even new clinical sites.

The community has attacked domain shift bottleneck so far mainly in two directions. Firstly, Unsupervised Domain Adaptation (UDA) exploits prior knowledge extracted from unlabeled target domain images to achieve model adaptation. Although UDA-based approaches could avoid time-consuming annotations from the target domain, collecting target images in advance is hard to meet in practice. This inspires another direction, Domain Generalization (DG), which aims to learn a universal representation from multiple source domains without any target domain information. In this paper, we intend to utilize the DG-based method to improve the network generalization on medical image segmentation tasks.

Under the domain generalization scope, data manipulation Tobin et al. (2017); Zhou et al. (2020), domain-invariant representation learning Muandet et al. (2013); Ganin and Lempitsky (2015); Arjovsky et al. (2020), and meta-learning techniques Li et al. (2018a); Dou et al. (2019)

have achieved remarkable success. Meanwhile, consistency regularization, which prevails in Semi-Supervised Learning (SSL) and UDA, has been introduced to mitigate performance degradation by forcing the model to learn invariant information from perturbed samples and has shown promising results in different tasks. However, most previous consistency regularization-based works 

Yue et al. (2019); Xu et al. (2021) simply enforce data-level consistency by generating new domains with novel appearance and then minimizing the discrepancy between original and generated domains for the same instance. Such consistency, based on data-level perturbation, usually requires extrinsic knowledge from other source domains and heavily depends on the quality of generated domains. To overcome the above shortcomings, we are motivated to explore other kinds of consistency regularization for the DG problem. Particularly, based on the observation that related tasks inherently introduce prediction perturbation during the network training, we may leverage the intrinsic consistency of related tasks from the same instance without extrinsic generated domains to encourage the network to learn generalizable representation. Additionally, how to integrate data-level consistency and task-level consistency and leverage their complementary effect is also well worth exploring.

To this end, we present a novel Hierarchical Consistency framework for Domain Generalization (HCDG) by enforcing Extrinsic and Intrinsic Consistency simultaneously. To the best of our knowledge, we are the first to introduce task-level perturbation into DG and integrate several kinds of consistency regularization into a hierarchical cohort. For Extrinsic Consistency, we leverage the knowledge across multiple source domains to enforce data-level consistency. Inspired by observation that the phase and amplitude components in the Fourier spectrum of signals retain the high-level semantics (e.g., structure) and low-level statistics (e.g., appearance), we introduce an improved Fourier-based Amplitude Gaussian-mixing (AG) method called DomainUp, to produce augmented domains with richer variability compared with the previous Amplitude Mix scheme Xu et al. (2021). For Intrinsic Consistency, we perform task-level consistency for the same instance under two related tasks: image segmentation and boundary regression. The Extrinsic and Intrinsic Consistency are further integrated into a teacher-student-like cohort to facilitate network learning. We evaluate the proposed HCDG on two medical image segmentation tasks, i.e., optic cup/disc segmentation on fundus images and prostate segmentation on MRI images. Our HCDG framework achieves state-of-the-art performance in both tasks. The main contributions are summarized as follows.

  • We develop an effective HCDG framework for generalizable medical image segmentation by simultaneously integrating Extrinsic and Intrinsic Consistency.

  • We design a novel Amplitude Gaussian-mixing strategy for Fourier-based data augmentation by introducing pixel-wise perturbation in the amplitude spectrum to highlight core semantic structures.

  • Extensive experiments on two medical image segmentation benchmark datasets validate the efficacy and universality of the framework and HCDG clearly outperforms several state-of-the-art DG methods.

2 Related Work

Domain Generalization.

Domain generalization aims to learn a general model from multiple source domains such that the model can directly generalize to arbitrary unseen target domains. Recently, many DG approaches have achieved remarkable results. Early DG works mainly follow the representation learning spirit via kernel methods Muandet et al. (2013); Li et al. (2018d), domain adversarial learning Ganin and Lempitsky (2015); Li et al. (2018b), invariant risk minimization Arjovsky et al. (2020); Guo et al. (2021), multi-component analysis Zunino et al. (2021), and generative modeling Qiao et al. (2020). Data manipulation is one of the cheapest ways to tackle the dearth of training data and enhance the generalization capability of the model by two popular techniques: data augmentation and data generation. For example, domain randomization Tobin et al. (2017)

, transformation network trained adversarially 

Zhou et al. (2020), and Mixup Zhang et al. (2018) are utilized to generate more training samples. Meanwhile, Xu et al. Xu et al. (2021)

introduced the Fourier-based data augmentation for DG by linearly distorting the amplitude information. DG has also been studied in general machine learning paradigms.

Li et al. Li et al. (2018a) designed a model agnostic training procedure, which is derived from meta-learning. Carlucci et al. Carlucci et al. (2019) formulated a self-supervision task of solving jigsaw puzzles to learn generalized representations. Inspired by Lottery Ticket Hypothesis Frankle and Carbin (2019), Cai et al. Cai et al. (2021) proposed to learn domain-invariant parameters of the model during training.

Figure 1: Overview of the proposed HCDG framework. Firstly, the weakly and strongly augmented replicas , are generated by DomainUp from the source image and candidates

. For Extrinsic Consistency, both replicas are sent to the student and teacher model to conduct image segmentation task. They leverage the additional knowledge from interpolated domains to enforce data-level consistency. For Intrinsic Consistency, only the weakly augmented replica is fed into the incorporated Classmates module to conduct the dual-task,

i.e., boundary regression task.

Consistency Regularization.

The consistency regularization is widely used in supervised and semi-supervised learning but has not played a significant role in DG yet. Sajjadi et al. 

Sajjadi et al. (2016) first introduced a consistency loss to utilize the stochastic nature of data augmentation and minimize the discrepancy between the predictions of multiple passes of a training sample through the network. Tarvainen and Valpola Tarvainen and Valpola (2018) designed a teacher-student model to provide better consistency alignment. Besides, Yue et al. Yue et al. (2019) proposed a pyramid consistency to learn a model with high generalizability via domain randomization, which still rests on data-level perturbation. Very recently, task-level consistency has been used for semi-supervised learning Luo et al. (2021). To our best knowledge, there is no work exploring task-level consistency for DG problem.

Medical Image Segmentation.

DNNs have been widespread in medical image segmentation tasks, such as cardiac segmentation from MRI Yu et al. (2017a), organ segmentation from CT Zhou et al. (2017); Li et al. (2018c), and skin lesion segmentation from dermoscopic images Yuan and Lo (2017). In this paper, we mainly focus on two medical tasks, i.e., OC/OD segmentation from retinal fundus images Wang et al. (2019a), and the prostate segmentation from T2-weighted MRI Liu et al. (2020b). Previously, Fu et al. Fu et al. (2018) and Wang et al. Wang et al. (2019b) showed competing results on jointly segmenting OC/OD, while Milletari et al. Milletari et al. (2016) and Yu et al. Yu et al. (2017b) successfully improved the performance on prostate segmentation. However, most of the methods lack the generalization ability and tend to generate high test errors on unseen target datasets. Thus, a more generalizable method is highly desired to alleviate performance degradation.

3 Methodology

Figure 1 depicts the proposed Hierarchical Consistency framework for Domain Generalization (HCDG). We consider a training set of multiple source domains with labeled samples in the -th domain , where and denote the images and labels. Our HCDG framework learns a domain-agnostic model using distributed source domains by Extrinsic Consistency and Intrinsic Consistency simultaneously, so that it can directly generalize to a completely unseen domain with mitigating performance degradation.

3.1 Extrinsic Consistency (EC)

We first exploit consistency regularization from the extrinsic aspect, i.e., leveraging extra knowledge from other source domains, to enforce data-level consistency for each instance. Specifically, we propose a new paradigm of Fourier-based data augmentation, named Amplitude Gaussian-mixing (AG), to perturb the spectral amplitude and then generate augmented images. Based on AG, we further design a DomainUp scheme to provide new domains with enough variability. Finally, we utilize a mean-teacher framework to minimize the discrepancy between augmented replicas from the same instance with dual-view consistency regularization.

Amplitude Gaussian-mixing.

Previous Fourier-based augmentation work Xu et al. (2021) linearly mixes the spectral amplitude of the whole image and keeps the phase information invariant to synthesize interpolated domains. However, this strategy treats each pixel equally during mixing, which can hardly distinguish the magnitude of semantic structures in the center and marginal areas. Thus, we design a novel Fourier-based Amplitude Gaussian-mixing (AG) strategy for amplitude perturbation by introducing a significance mask, SigMask , for linear interpolation, where the Gaussian-like is used to control the perturbation magnitude at each pixel. Specifically, we first extract the frequency space signal of sample

through fast Fourier transform 

Nussbaumer (1981) and further decompose into an amplitude spectrum and a phase spectrum . For each , its perturbed amplitude spectrum is calculated according to and of counterpart sample via following


where denotes element-wise multiplication. Then, we generate the augmented image of interpolated domain via inverse Fourier Transform .

The values of SigMask

follow a Gaussian distribution and the value of each pixel

is computed with


where and . It is worth noting that we scale the range of to

to avoid pretty small significances in the marginal area of SigMask. The variance

controls the peak value of the above Gaussian-mixing function under fixed , while the mean

decide the position of the peak. To prevent generating outliers, we deduce the lower bound of

to to promise . We also control the upper bound via .

The proposed AG is simple but effective bringing three benefits: (1) the Gaussian-mixing function highlights the core information by endowing the center area of the image with more magnitude than the marginal area; (2) we could generate adaptive center areas by empirically sampling from to cope with uncertain positions of the core semantics; and (3) stability has improved by controlling the variance of the Gaussian-mixing function instead of directly modifying .


To obtain more informative interpolated samples, for each source image , we propose to search for the worst augmented case Gong et al. (2020) from candidates, where is the instance number sampled from each source domain. As shown in Figure 2, we first obtain the weakly augmented source image and candidates by standard augmentation protocol (e.g., random scaling and flipping), and then deploy the Fourier-based AG augmentation on and each to acquire the corresponding strongly augmented replica . The worst augmented case can be selected by the maximal supervised loss value from . In the network learning, we combine the maximal segmentation loss from and normal segmentation loss from as the total supervised segmentation loss following


where is the prediction of the student model.

Figure 2: Illustration of DomainUp. For each source image , DomainUp conducts weakly data augmentation and AG-based strongly data augmentation to get weakly augmented and strongly augmented . The worst augmented case is then selected by the maximal segmentation loss from .

Dual-view Consistency Regularization.

After acquiring the weakly and strongly augmented images, we explicitly enforce a dual-view consistency regularization for Extrinsic Consistency. Specifically, the consistency is implemented with a momentum-updated teacher model to provide dual-view instance alignment for better constraint. We feed the weakly and strongly augmented images into both student and teacher networks (with the same architecture) and then minimize the network output discrepancy between them with the KL divergence:


where is the prediction of the teacher model, denotes the softmax operation, and represents the temperature Hinton et al. (2015) to soften the outputs. Overall, the objective function of EC is composed of the supervised segmentation loss and the consistency KL loss


where is a balancing hyper-parameter.

3.2 Intrinsic Consistency (IC)

Different from EC, IC introduces a Classmate module to perform task-level consistency guided by the intrinsic perturbation. As the corresponding predictions of related tasks for the same sample in the output space have inherent difference, consistency can be enforced without extra knowledge after proper transformation.

Classmate Module.

To conduct dual tasks, we incorporate a new Classmate Module with decoders into the student model. We then achieve task-level constraints between different predictions of two tasks after transformation. To further strengthen the diversity between the outputs of the two tasks, we also apply a feature-level perturbation  on the feature map (i.e., feature dropout or noise) following Ouali et al. Ouali et al. (2020). In practice, we define boundary regression as the second task to capture geometric structure Wang et al. (2019a); Murugesan et al. (2019)

. Specifically, each classmate is composed of four convolutional layers followed by ReLU and batch normalization layers, and receives the perturbed feature map of the weakly augmented image

. To generate the boundary ground truth , we apply morphological operation to the mask ground truth as Wang et al. Wang et al. (2019a) do. The supervised boundary loss is formulated as mean square error (MSE)


where is the number of the labeled samples and is the prediction of the classmate . We set the number of classmates to balance the accuracy and efficiency. Note that each classmate receives a different version of the perturbed feature map.

Task Optic Cup Segmentation Optic Disc Segmentation Overall Optic Cup Segmentation Optic Disc Segmentation Overall
Domain 1 2 3 4 Avg. 1 2 3 4 Avg. 1 2 3 4 Avg. 1 2 3 4 Avg.
Dice Coefficient (Dice) [%]   Average Surface Distance (ASD) [pixel]  
Baseline 78.75 75.97 83.33 85.14 80.80 94.77 90.30 90.90 91.87 91.96 86.38 21.64 16.77 11.58 7.92 14.48 8.98 17.10 12.64 9.10 11.96 13.22
Mixup Zhang et al. (2018) 71.73 77.70 78.24 87.23 78.73 94.66 88.88 89.63 89.99 90.79 84.76 28.07 15.34 14.42 7.01 16.21 9.15 15.41 14.44 10.88 12.47 14.34
M-mixup Verma et al. (2019) 78.15 78.50 78.04 87.48 80.54 94.47 90.53 90.60 86.68 90.57 85.56 21.89 13.37 14.65 6.61 14.13 9.67 14.40 13.03 14.08 12.80 13.46
CutMix Yun et al. (2019) 77.41 81.30 80.23 84.27 80.80 94.50 90.92 89.57 88.95 90.99 85.89 22.38 12.65 13.33 7.94 14.08 9.40 12.83 14.41 11.80 12.11 13.09
JiGen Carlucci et al. (2019) 81.04 79.34 81.14 83.75 81.32 95.60 89.91 91.61 92.52 92.41 86.86 19.34 13.36 12.86 9.91 13.87 7.62 15.27 11.60 10.73 11.31 12.59
DoFE Wang et al. (2020) 82.86 78.80 86.12 87.07 83.71 95.88 91.58 91.83 92.40 92.92 88.32 17.68 14.71 9.92 7.21 12.38 7.23 14.08 11.43 9.29 10.51 11.44
SAML Liu et al. (2020a) 83.16 75.68 82.00 82.88 80.93 94.30 91.17 92.28 87.95 91.43 86.18 18.20 16.87 13.38 9.89 14.59 10.08 12.90 12.31 14.22 12.38 13.48
FACT Xu et al. (2021) 79.66 79.25 83.07 86.57 82.14 95.28 90.19 94.09 90.54 92.53 87.33 20.71 14.51 11.61 7.83 13.67 8.19 14.71 8.43 10.57 10.48 12.07
HCDG (Ours) 85.44 82.05 86.39 87.60 85.37 95.31 92.68 93.86 93.80 93.91 89.64 14.83 12.26 9.81 6.69 10.90 8.09 10.66 8.66 6.78 8.55 9.72
Table 1: Comparison with recent domain generalization methods on the OC/OD segmentation task. The top two values are emphasized using bold and underline, respectively.
Domain 1 2 3 4 5 6 Average 1 2 3 4 5 6 Average
Dice Coefficient (Dice) [%]   Average Surface Distance (ASD) [pixel]  
Baseline 83.51 81.94 82.29 85.44 84.66 84.89 83.79 6.15 5.95 6.24 4.24 7.49 4.42 5.75
Mixup Zhang et al. (2018) 84.77 84.50 79.18 82.99 84.68 84.12 83.37 5.92 6.72 7.38 5.81 7.53 4.33 6.28
M-mixup Verma et al. (2019) 86.94 82.61 85.87 84.98 82.35 82.50 84.21 4.50 6.36 5.07 4.37 8.05 4.66 5.50
CutMix Yun et al. (2019) 83.27 84.36 85.35 81.06 85.67 84.95 84.11 5.63 5.71 5.62 5.75 6.65 4.27 5.61
JiGen Carlucci et al. (2019) 84.53 85.41 81.48 83.18 89.66 85.42 84.95 5.26 5.42 5.21 5.34 3.77 4.14 4.86
DoFE Wang et al. (2020) 84.66 84.42 85.22 86.31 87.60 86.96 85.86 4.95 5.23 5.04 4.30 4.23 3.33 4.51
SAML Liu et al. (2020a) 89.66 87.53 84.43 88.67 87.37 88.34 87.67 4.11 4.74 5.40 3.45 4.36 3.20 4.21
FACT Xu et al. (2021) 85.82 85.45 84.93 88.75 87.31 87.43 86.62 4.77 5.60 6.10 3.90 4.37 3.43 4.70
HCDG (Ours) 88.26 88.92 87.50 90.24 88.78 89.62 88.89 4.36 4.16 4.24 2.99 4.19 2.91 3.81
Table 2: Comparison with recent domain generalization methods on Prostate MRI segmentation. The top two values are emphasized using bold and underline, respectively.

Dual-classmate Consistency Regularization.

We define as the inside and outside area of the target object. Comparing the boundary ground truth with the mask ground truth , the values of are 0 for both tasks. While are assigned to 1 for the mask prediction task and

for the boundary regression task. Consequently, we scale the regressed boundary followed by a sigmoid function to approximate the predicted mask, like the smooth Heaviside function 

Xue et al. (2020):


where denotes a scaling factor. The approximate transformation function maps the prediction space of boundary to that of mask while still maintaining the task-level diversity. Thus, task-level consistency can be enforced between the mask prediction and the transformed boundary prediction from the same sample by exploiting intrinsic knowledge:


Then, the objective of IC consists of the boundary MSE loss and the consistency KL loss as where is same to that of EC. Overall, the total objective function of training the framework is summed up as


4 Experiments

4.1 Datasets and Experiment Setting

We evaluate our method on two important medical image segmentation tasks, i.e., optic cup and disc (OC/OD) segmentation on retinal fundus images and prostate segmentation on T2-weighted MRI. The Fundus image segmentation dataset Wang et al. (2020) is composed of four different data sources out of three public fundus image datasets, which are captured with different scanners in different institutions. The Prostate MRI dataset Liu et al. (2020a) is a well-organized multi-site dataset for prostate MRI segmentation, which contains prostate T2-weighted MRI data collected from six different data sources out of three public datasets. The pre-processing step of the two datasets follows the previous works Wang et al. (2020); Liu et al. (2020a) and can be found in the supplementary file.

For both tasks, we conducted the leave-one-domain-out strategy. We trained our model on distributed source domains and evaluated the trained model on the held-out target domain. For evaluation, we used the prediction of the student model as the final prediction. We adopted two commonly-used metrics, Dice coefficient (Dice) and Average Surface Distance (ASD), to quantitatively evaluate the segmentation results on the whole object region and the surface shape, respectively. The average results of three runs are reported.

4.2 Implementation Details

Our framework was built on PyTorch and trained on one NVIDIA RTX 2080 GPU. We adopted Adam optimizer to train the framework. For Fundus data, we followed the previous work 

Wang et al. (2020) and used a modified DeepLabv3+ Chen et al. (2018) as the segmentation backbone. Following Wang et al. Wang et al. (2020)

, we first pre-trained the vanilla DeepLabV3+ network for 40 epochs with a learning rate of

and then trained the whole framework for another 50 epochs with an initial learning rate of , batch size of 4. The learning rate was then decreased to after 40 epochs. For Prostate MRI, we employed the same network backbone as Fundus, while the whole framework was trained from scratch for 80 epochs and batch size of 1. The initial learning rate was set to and decayed by 95% every 5 epochs. The in Eq. (5) was set to 200. Other implementation details and hyper-parameter setting can be found in the supplementary file and the provided code.

4.3 Comparison with Other DG Methods

We compare our framework with recent state-of-the-art methods which were designed for improving network generalization ability. The first kind of approaches focuses on network regularization, including Mixup Zhang et al. (2018), M-mixup Verma et al. (2019), CutMix Yun et al. (2019), and FACT Xu et al. (2021), which is a Fourier-based data augmentation framework. We also compare with state-of-the-art DG methods for OC/OD and Prostate MRI segmentation, i.e., JiGen Carlucci et al. (2019), DoFE Wang et al. (2020), and SAML Liu et al. (2020a). For Baseline, we train a vanilla modified DeepLabV3+ network with training images of all source domains in a unified manner.

Results on Fundus Image Segmentation.

Table 1 shows the quantitative results on the OC/OD segmentation tasks. It is clear that our HCDG framework achieves consistent improvements over the baseline across all unseen domain settings, with the overall performance increase of 3.26% in Dice and 3.50 in ASD. Unexpectedly, several network regularization methods Mixup, M-mixup, and CutMix, do not perform as well as in the original nature image classification task. It may be due to the difficulty of telling pixel-wise labels caused by the absence of explicit constraint on augmentation. Besides, FACT formulates a dual-form consistency loss and advances the baseline from 86.38% to 87.33% in the average Dice. By adding extra Intrinsic Consistency, our approach further improves over FACT to 89.64%. Furthermore, our framework outperforms JiGen and SAML by a considerable margin: 2.78% and 3.46% in the average Dice, respectively. In particular, our approach largely surpasses the strongest competitor DoFE without leveraging domain prior knowledge by 1.32% and 1.72 respectively in the average Dice and ASD.

Method EC (w/o AG) Classmates IC AG Domain 1 Domain 2 Domain 3 Domain 4 Avg.
Baseline - - - - 78.75 94.77 75.97 90.30 83.33 90.90 85.14 91.87 86.38
Model A - - - 81.74 95.75 78.31 89.98 86.65 91.79 85.71 93.43 87.92
Model B - - 85.42 93.97 78.42 91.06 86.98 91.87 86.42 92.60 88.34
Model C - 86.36 95.42 81.95 91.59 87.06 92.35 86.60 93.07 89.30
Model D - 84.12 94.93 80.56 92.39 86.24 92.62 88.24 93.02 89.01
HCDG 85.44 95.31 82.05 92.68 86.39 93.86 87.60 93.80 89.64
Table 3: Ablation studies on different components of our method on the OC/OD segmentation task. The top values are emphasized using bold.

Results on Prostate MRI Segmentation.

The experimental results on Prostate MRI segmentation task are illustrated in Table 2. We observe that our HCDG framework improves over the baseline for Dice from 83.79% to 88.89% and decreases ASD from 5.75 to 3.81. While Mixup still performs worse than the baseline, M-mixup and Cutmix show limited advantage over the baseline. This may be because the difficulty is mitigated among grayscale images extracted from prostate MRI. Guided by consistency regularization, FACT significantly excels the baseline with the overall performance increase of 2.83% in Dice and 1.05 in ASD. Our approach further improves over FACT from 86.62% to 88.89% in the average Dice. The other approaches of JiGen and DoFE achieve competitive performance, while SAML yields the best results among these previous state-of-the-art methods. Our approach outperforms SAML by a large margin (1.22% average Dice and 0.40 average ASD).

Qualitative Results.

We present two sample segmentation results from each task in Figure LABEL:fig:qualitative for better visualization. As observed, owing to Extrinsic and Intrinsic Consistency, the predictions of our approach have smoother contours and are closer to the ground truth compared to others.

4.4 Analysis of Our Methods

Impact of Different Components.

The ablation study results are shown in Table 3. Starting from the Baseline, EC guided by the previous AM not proposed AG strategy is deployed on model A. Compared to FACT Xu et al. (2021) in Table 1, model A performs better owing to DomainUp in EC. Based on model A, we incorporate Classmate module into the student model to obtain model B, which advances over model A slightly by 0.42% in average Dice. Classmates in model B still conduct the same task as the student model, i.e., image segmentation task. Furthermore, the incorporated classmates carry out the dual task, i.e., boundary regression task, in model C, where the full IC is introduced. It significantly improves the performance of model C over model B from 88.34% to 89.30%. The full HCDG replaces the AM strategy in model C with our proposed AG method and achieves 89.64%. Finally, we remove EC including the teacher model from the full HCDG but keep the AG method for the student model to construct Model D. The generalization performance of Model D decreases from 89.64% to 89.01%, further indicating the power of EC. It also implies the efficacy of IC in view of this competitive result.

Figure 4: The performance on the OC/OD segmentation task of our approach with and without feature-level perturbation based on Model C.

Effectiveness of Feature-level Perturbation.

We conduct experiments based on the previous Model C as shown in Figure 4. Removing feature-level perturbation from Model C results in consistent performance declines on the generalization ability in OC/OD segmentation task, particularly the average Dice decreasing from 89.30% to 89.05%. This verifies the necessity of feature-level perturbation on dual classmates for Intrinsic Consistency.

Details of Amplitude Gaussian-mixing.

We further analyze some implementation details of our AG strategy. As illustrated in Figure 5, the blue columns denote the AG method without adaptive core areas (i.e., ). In such setting, compared with the full AG method (the green columns), the overall performance degenerates from 89.64% to 89.50% in average Dice. This suggests the necessity of assisting the model to cope with uncertain positions of the core semantics from diverse domains. As the position of OC/OD in ROIs extracted from fundus images changes slightly, we believe this mechanism will boost more performance on other natural image datasets. Additionally, we explore the setting of uniformly sampling the scaled length . The results of yellow columns indicate that this setting leads to the overall decrease of 0.28%. We attribute this to the negative effects caused by inappropriate values. For the same variance , a small (e.g., 0.2) brings a SigMask similar to the AM strategy, resulting in degeneration. On the other hand, a large (e.g., 0.8) produces a SigMask with an excessive difference between adjacent pixels, which may be too aggressive for the model to learn. Thus, the better choice is to select the proper value (0.5 in our setting) for to generate a moderate SigMask.

Figure 5: Results on the OC/OD segmentation task of different variants of our Amplitude Gaussian-mixing method.

5 Conclusion

In this paper, we have presented a novel Hierarchical Consistency framework for generalizable medical image segmentation on unseen datasets by ensembling Extrinsic and Intrinsic Consistency. Firstly, we manipulate Extrinsic Consistency based on data-level perturbation and design a delicate form of the Fourier-based data augmentation to augment new domain images for each instance without structure information lost. Secondly, we incorporate a novel Classmate module into the framework, where Intrinsic Consistency is enforced to further constrain the model through inherent prediction perturbation of related tasks. Considering many mainstream approaches of Domain Generalization underestimate the strength of consistency regularization or only employ data-level consistency, our work sheds some light on the community of task-level consistency in Domain Generalization. The extensive experiments on two important medical image segmentation tasks validate the generalization and robustness of our proposed framework. In the future, we will explore other consistency schemes and apply our framework to other medical and natural image problems.


  • M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2020) Invariant risk minimization. External Links: 1907.02893 Cited by: §1, §2.
  • N. Bloch, A. Madabhushi, H. Huisman, J. Freymann, J. Kirby, M. Grauer, A. Enquobahrie, C. Jaffe, L. Clarke, and K. Farahani (2015) NCI-isbi 2013 challenge: automated segmentation of prostate structures. The Cancer Imaging Archive 370. Cited by: Table 5.
  • J. Cai, C. Zhu, C. Cui, H. Li, T. Wu, S. Zhang, and L. Yang (2021) Generalizing nucleus recognition model in multi-source images via pruning. External Links: 2107.02500 Cited by: §2.
  • F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In CVPR, pp. 2229–2238. Cited by: Table 6, Table 7, §2, Table 1, Table 2, §4.3.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §4.2.
  • Q. Dou, D. Coelho de Castro, K. Kamnitsas, and B. Glocker (2019) Domain generalization via model-agnostic learning of semantic features. Advances in Neural Information Processing Systems 32, pp. 6450–6461. Cited by: §1.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. External Links: 1803.03635 Cited by: §2.
  • H. Fu, J. Cheng, Y. Xu, D. W. K. Wong, J. Liu, and X. Cao (2018) Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. IEEE transactions on medical imaging, pp. 1597–1605. Cited by: §2.
  • F. Fumero, S. Alayón, J. L. Sanchez, J. Sigut, and M. Gonzalez-Hernandez (2011) RIM-one: an open retinal image database for optic nerve evaluation. In CBMS, pp. 1–6. Cited by: Table 5.
  • Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, pp. 1180–1189. Cited by: §1, §2.
  • C. Gong, T. Ren, M. Ye, and Q. Liu (2020) MaxUp: a simple way to improve generalization of neural network training. External Links: 2002.09024 Cited by: §3.1.
  • R. Guo, P. Zhang, H. Liu, and E. Kiciman (2021) Out-of-distribution prediction with invariant risk minimization: the limitation and an effective fix. External Links: 2101.07732 Cited by: §2.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §3.1.
  • G. Lemaître, R. Martí, J. Freixenet, J. C. Vilanova, P. M. Walker, and F. Meriaudeau (2015) Computer-aided detection and diagnosis for prostate cancer based on mono and multi-parametric mri: a review. Computers in biology and medicine 60, pp. 8–31. Cited by: Table 5.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018a) Learning to generalize: meta-learning for domain generalization. In AAAI, Cited by: §1, §2.
  • H. Li, S. J. Pan, S. Wang, and A. C. Kot (2018b) Domain generalization with adversarial feature learning. In CVPR, pp. 5400–5409. Cited by: §2.
  • X. Li, H. Chen, X. Qi, Q. Dou, C. Fu, and P. Heng (2018c) H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging, pp. 2663–2674. Cited by: §2.
  • Y. Li, M. Gong, X. Tian, T. Liu, and D. Tao (2018d) Domain generalization via conditional invariant representations. In AAAI, Cited by: §2.
  • G. Litjens, R. Toth, W. van de Ven, C. Hoeks, S. Kerkstra, B. van Ginneken, G. Vincent, G. Guillard, N. Birbeck, J. Zhang, et al. (2014) Evaluation of prostate segmentation algorithms for mri: the promise12 challenge. MIA 18 (2), pp. 359–373. Cited by: Table 5.
  • Q. Liu, Q. Dou, and P. Heng (2020a) Shape-aware meta-learning for generalizing prostate mri segmentation to unseen domains. In MICCAI, pp. 475–485. Cited by: Appendix B, Table 6, Table 7, Table 1, Table 2, §4.1, §4.3.
  • Q. Liu, Q. Dou, L. Yu, and P. A. Heng (2020b) MS-net: multi-site network for improving prostate segmentation with heterogeneous mri data. IEEE transactions on medical imaging 39 (9), pp. 2713–2724. Cited by: §2.
  • X. Luo, J. Chen, T. Song, and G. Wang (2021) Semi-supervised medical image segmentation through dual-task consistency. In AAAI, pp. 8801–8809. Cited by: §2.
  • F. Milletari, N. Navab, and S. Ahmadi (2016)

    V-net: fully convolutional neural networks for volumetric medical image segmentation

    In 3DV, pp. 565–571. Cited by: §2.
  • J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, and F. Herrera (2012) A unifying view on dataset shift in classification. Pattern recognition, pp. 521–530. Cited by: §1.
  • K. Muandet, D. Balduzzi, and B. Schölkopf (2013) Domain generalization via invariant feature representation. In ICML, pp. 10–18. Cited by: §1, §2.
  • B. Murugesan, K. Sarveswaran, S. M. Shankaranarayana, K. Ram, J. Joseph, and M. Sivaprakasam (2019) Psi-net: shape and boundary aware joint multi-task deep network for medical image segmentation. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7223–7226. Cited by: §3.2.
  • H. J. Nussbaumer (1981) The fast fourier transform. In Fast Fourier Transform and Convolution Algorithms, pp. 80–111. Cited by: §3.1.
  • J. I. Orlando, H. Fu, J. B. Breda, K. van Keer, D. R. Bathula, A. Diaz-Pinto, R. Fang, P. Heng, J. Kim, J. Lee, et al. (2020) Refuge challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Medical image analysis, pp. 101570. Cited by: Table 5.
  • Y. Ouali, C. Hudelot, and M. Tami (2020) Semi-supervised semantic segmentation with cross-consistency training. In CVPR, pp. 12674–12684. Cited by: §3.2.
  • F. Qiao, L. Zhao, and X. Peng (2020) Learning to learn single domain generalization. In CVPR, pp. 12556–12565. Cited by: §2.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems 29, pp. 1163–1171. Cited by: §2.
  • J. Sivaswamy, S. Krishnadas, A. Chakravarty, G. Joshi, A. S. Tabish, et al. (2015) A comprehensive retinal image dataset for the assessment of glaucoma from the optic nerve head analysis. JSM Biomedical Imaging Data Papers, pp. 1004. Cited by: Table 5.
  • A. Tarvainen and H. Valpola (2018)

    Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results

    External Links: 1703.01780 Cited by: Appendix A, Appendix A, §2.
  • D. S. W. Ting, L. R. Pasquale, L. Peng, J. P. Campbell, A. Y. Lee, R. Raman, G. S. W. Tan, L. Schmetterer, P. A. Keane, and T. Y. Wong (2019) Artificial intelligence and deep learning in ophthalmology. British Journal of Ophthalmology, pp. 167–175. Cited by: §1.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, pp. 23–30. Cited by: §1, §2.
  • V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio (2019) Manifold mixup: better representations by interpolating hidden states. In ICML, pp. 6438–6447. Cited by: Table 6, Table 7, Table 1, Table 2, §4.3.
  • S. Wang, L. Yu, K. Li, X. Yang, C. Fu, and P. Heng (2019a) Boundary and entropy-driven adversarial learning for fundus image segmentation. In MICCAI, pp. 102–110. Cited by: §2, §3.2.
  • S. Wang, L. Yu, K. Li, X. Yang, C. Fu, and P. Heng (2020) Dofe: domain-oriented feature embedding for generalizable fundus image segmentation on unseen datasets. IEEE Transactions on Medical Imaging 39 (12), pp. 4237–4248. Cited by: Appendix B, Table 6, Table 7, Table 1, Table 2, §4.1, §4.2, §4.3.
  • S. Wang, L. Yu, X. Yang, C. Fu, and P. Heng (2019b) Patch-based output space adversarial learning for joint optic disc and cup segmentation. IEEE transactions on medical imaging, pp. 2485–2495. Cited by: §2.
  • Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian (2021) A fourier-based framework for domain generalization. In CVPR, pp. 14383–14392. Cited by: Table 6, Table 7, §1, §1, §2, §3.1, Table 1, Table 2, §4.3, §4.4.
  • Y. Xue, H. Tang, Z. Qiao, G. Gong, Y. Yin, Z. Qian, C. Huang, W. Fan, and X. Huang (2020) Shape-aware organ segmentation by predicting signed distance maps. In AAAI, Vol. 34, pp. 12565–12572. Cited by: §3.2.
  • L. Yu, J. Cheng, Q. Dou, X. Yang, H. Chen, J. Qin, and P. Heng (2017a) Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In MICCAI, pp. 287–295. Cited by: §2.
  • L. Yu, X. Yang, H. Chen, J. Qin, and P. A. Heng (2017b)

    Volumetric convnets with mixed residual connections for automated prostate segmentation from 3d mr images

    In AAAI, Cited by: §2.
  • Y. Yuan and Y. Lo (2017) Improving dermoscopic image segmentation with enhanced convolutional-deconvolutional networks. IEEE journal of biomedical and health informatics, pp. 519–526. Cited by: §2.
  • X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In ICCV, pp. 2100–2110. Cited by: §1, §2.
  • S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019)

    Cutmix: regularization strategy to train strong classifiers with localizable features

    In CVPR, pp. 6023–6032. Cited by: Table 6, Table 7, Table 1, Table 2, §4.3.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. External Links: 1710.09412 Cited by: Table 6, Table 7, §2, Table 1, Table 2, §4.3.
  • K. Zhou, Y. Yang, T. Hospedales, and T. Xiang (2020) Deep domain-adversarial image generation for domain generalisation. In AAAI, pp. 13025–13032. Cited by: §1, §2.
  • Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. In MICCAI, pp. 693–701. Cited by: §2.
  • A. Zunino, S. A. Bargal, R. Volpi, M. Sameki, J. Zhang, S. Sclaroff, V. Murino, and K. Saenko (2021) Explainable deep classification models for domain generalization. In CVPR, pp. 3233–3242. Cited by: §2.


In this supplementary material, we first provide more details about our training procedure and a detailed algorithm for better understanding in Section A. We also provide more method-specific hyper-parameters setting in this section. Then, we show the statistics of Fundus and Prostate MRI datasets and their corresponding pre-processing details in Section B. We also provide visualization results of the amplitude-perturbed images under different variance in our Amplitude Gaussian-mixing method in Section C. In Section D

, we further show the standard deviation of results for three runs on two tasks to better compare our framework with other methods. Finally, we conduct additional experiments to analyze the effects of the number of classmates

and the balancing hyper-parameter in Section E.

Appendix A Details of Training Strategy

In our framework, the structure of the teacher model is identical to that of the student model, while classmates as extra decoders share the encoder with the student model. Rather than gradients flowing through the teacher model during backpropagation, the teacher model receives parameters from the student model via exponential moving average (EMA) following previous mean-teacher framework Tarvainen and Valpola (2018):


where is the decay rate during the updating. The classmate module does not engage in EMA.

The encoder of the student model is simultaneously updated by gradient flows from image segmentation task and boundary regression task. Hence, it possesses a stronger ability to recognize the active structure and boundary of objects. In the testing phase, we simply use the student model for evaluation. The detailed training procedure is illustrated in Algorithm 1, where the numbers of equations are consistent with the main file. Note that we enforce the dataloader to sample a mini-batch which includes all source domain images.

For all experiments, we set the momentum for the teacher model to 0.9995, the temperature to 10, and the number of candidates for each source domain in DomainUp to 1. In Amplitude Gaussian-mixing strategy, the upper bound of in the Gaussian-mixing function is chosen as 1.0, while the scaled length is fixed to 0.5. In the transformation function, the scaling factor is 20. The weight of consistency loss is set to 200, using a sigmoid ramp-up Tarvainen and Valpola (2018) with a length of 5 epochs. For Fundus, we use weakly augmentation composed of random scaling and cropping, random flipping and light adjustment. For Prostate MRI, we only employ random horizontal flipping and light adjustment as data augmentation.

Appendix B Details of Datasets and Pre-processing

Detailed statistics of Fundus Wang et al. (2020) and Prostate MRI Liu et al. (2020a) datasets are illustrated in Table 5. We first pre-processed the two datasets before network training. For Fundus image dataset, we cropped region of interests (ROIs) centering at OD with size of by utilizing a simple U-Net and then resized them to as the network input, following Wang et al. Wang et al. (2020). For Prostate MRI, we resized each sample to in axial plane, and normalized it individually to zero mean and unit variance. We then clipped each sample to only preserve slices of prostate region for consistent objective segmentation regions across sites, following Liu et al. Liu et al. (2020a).

Input: A mini-batch of from source domains .
Output: The prediction of input from unseen target domain .

1: initialize
2:while not converge do
3:      sampled from source domains
4:      sampled from source domains
5:     DomainUp generates
6:     ,
8:     Generate boundary ground truth from
9:     Calculate as Eq. (3)
10:     Update
11:     Calculate and as Eq. (4)
12:     Update
13:     Calculate as Eq. (6)
14:     Update
15:     Update
16:     Calculate as Eq. (8)
17:     Update
18:     Update
19:     Momentum Update via EMA as Eq. (10)
Algorithm 1 Training procedure of our HCDG framework

Appendix C Visualization of Perturbed Images

We visualize the appearances of amplitude-perturbed images under different in our Amplitude Gaussian-mixing method for the two tasks. As shown in Figure 7, the appearance of source image is gradually transformed from the style of candidates back to the original style as we increase from 0.4 to 0.8, while the core semantic of source image remains unchanged. Our AG method keeps the rich variability of the previous AM strategy and simultaneously highlights the core semantic, hence benefits the model to capture domain-invariant information to improve the generalizability.

Q Dice (%) Cost
OC OD Avg. Time (h) Params (million)
1 85.42 93.02 89.22 8.2 6.5
2 85.37 93.91 89.64 8.5 7.8
3 85.82 92.94 89.38 8.8 9.1
Table 4: Results of accuracy (Dice) and efficiency (Cost) on the OC/OD segmentation task under different number of classmates in our Classmate module.
Figure 6: The performance on the OC/OD segmentation task of our approach under different values of the balancing hyper-parameter .

Appendix D Comparison with Other DG Methods

Due to page limitation, we only show the mean value of three runs in Table 1 and Table 2 in the main file. Here, we further provide the standard deviation (std) of Dice coefficients for three runs of recent DG methods and our framework on the OC/OD segmentation and Prostate MRI segmentation tasks, as shown in Table 6 and Table 7 respectively. Our framework outperforms all state-of-the-art methods and has relatively stable average Dice in both tasks.

Appendix E Further Analysis of Our Method

Analysis of the Number of Classmates .

We conduct experiments on different values of as shown in Table 4. With the number of classmates increasing, the cost of training time and the number of parameters have a growing tendency. Our framework achieves the highest average Dice when . Since too many classmates would lead to loss of efficiency, we empirically set the number of classmates to balance the accuracy and efficiency in our Classmate module.

Effects of the Balancing Hyper-parameter .

The results on the OC/OD segmentation task under different values of the balancing hyper-parameter are illustrated in Figure 6. As observed, our approach achieves the best performance when .

Tasks Domain No. Dataset Training samples Scanners (Institutions)
Fundus Domain 1 Drishti-GS Sivaswamy et al. (2015) 50 (Aravind eye hospital)
Domain 2 RIM-ONE-r3 Fumero et al. (2011) 99 Nidek AFC-210
Domain 3 REFUGE (train) Orlando et al. (2020) 320 Zeiss Visucam 500
Domain 4 REFUGE (val) Orlando et al. (2020) 320 Canon CR-2
Prostate MRI Domain 1 NCI-ISBI 2013 Bloch et al. (2015) 30 (RUNMC)
Domain 2 NCI-ISBI 2013 Bloch et al. (2015) 30 (BMC)
Domain 3 I2CVB Lemaître et al. (2015) 19 (HCRUDB)
Domain 4 PROMISE12 Litjens et al. (2014) 13 (UCL)
Domain 5 PROMISE12 Litjens et al. (2014) 12 (BIDMC)
Domain 6 PROMISE12 Litjens et al. (2014) 12 (HK)
Table 5: Statistics of the public Fundus and Prostate MRI datasets in our experiments.
Figure 7: Visualization of amplitude-perturbed images under different in our Amplitude Gaussian-mixing method.
Task Optic Cup Segmentation Optic Disc Segmentation Overall
Domain 1 2 3 4 Avg. 1 2 3 4 Avg.
Baseline 78.751.96 75.971.45 83.331.02 85.140.27 80.800.50 94.770.25 90.300.87 90.901.66 91.870.73 91.960.46 86.380.39
Mixup Zhang et al. (2018) 71.730.51 77.702.31 78.241.26 87.230.28 78.730.63 94.660.42 88.880.85 89.630.95 89.990.34 90.790.10 84.760.34
M-mixup Verma et al. (2019) 78.151.89 78.502.12 78.043.16 87.480.60 80.540.09 94.470.62 90.530.78 90.600.51 86.680.56 90.570.47 85.560.24
CutMix Yun et al. (2019) 77.411.19 81.301.42 80.230.36 84.270.51 80.800.53 94.500.83 90.920.41 89.570.69 88.950.55 90.990.27 85.890.35
JiGen Carlucci et al. (2019) 81.040.45 79.341.19 81.142.29 83.751.79 81.320.95 95.600.16 89.911.52 91.610.63 92.520.67 92.410.35 86.860.63
DoFE Wang et al. (2020) 82.860.35 78.801.11 86.121.19 87.070.73 83.710.77 95.880.32 91.581.19 91.830.79 92.400.61 92.920.13 88.320.41
SAML Liu et al. (2020a) 83.161.25 75.681.81 82.001.06 82.880.78 80.930.85 94.300.57 91.171.02 92.280.85 87.950.78 91.430.32 86.180.84
FACT Xu et al. (2021) 79.660.50 79.250.59 83.070.47 86.570.22 82.140.29 95.280.77 90.190.55 94.090.63 90.540.52 92.530.28 87.330.13
HCDG (Ours) 85.441.62 82.051.08 86.390.20 87.600.39 85.370.45 95.310.80 92.680.69 93.860.73 93.800.38 93.910.48 89.640.15
Table 6: Comparison with recent domain generalization methods on the OC/OD segmentation task according to the mean and std of Dice coefficient. The top two values are emphasized using bold and underline, respectively.
Task Prostate MRI segmentation Overall
Domain 1 2 3 4 5 6
Baseline 83.512.62 81.941.00 82.291.51 85.442.95 84.663.17 84.891.01 83.790.71
Mixup Zhang et al. (2018) 84.771.05 84.501.92 79.183.35 82.992.41 84.686.35 84.121.68 83.371.48
M-mixup Verma et al. (2019) 86.941.42 82.611.01 85.872.40 84.982.90 82.356.76 82.503.95 84.211.91
CutMix Yun et al. (2019) 83.270.98 84.360.99 85.354.17 81.061.08 85.673.61 84.952.84 84.111.29
JiGen Carlucci et al. (2019) 84.531.01 85.411.58 81.481.59 83.181.07 89.662.34 85.422.59 84.950.57
DoFE Wang et al. (2020) 84.660.30 84.422.70 85.222.98 86.313.50 87.603.47 86.962.29 85.860.79
SAML Liu et al. (2020a) 89.661.31 87.531.05 84.430.51 88.670.96 87.372.65 88.340.80 87.670.81
FACT Xu et al. (2021) 85.820.79 85.452.00 84.932.60 88.753.30 87.312.18 87.431.32 86.621.59
HCDG (Ours) 88.260.49 88.920.72 87.501.54 90.241.13 88.780.76 89.621.01 88.890.24
Table 7: Comparison with recent domain generalization methods on the prostate MRI segmentation task according to the mean and std of Dice coefficient. The top two values are emphasized using bold and underline, respectively.