When Unseen Domain Generalization is Unnecessary? Rethinking Data Augmentation

by   Ling Zhang, et al.

Recent advances in deep learning for medical image segmentation demonstrate expert-level accuracy. However, in clinically realistic environments, such methods have marginal performance due to differences in image domains, including different imaging protocols, device vendors and patient populations. Here we consider the problem of domain generalization, when a model is trained once, and its performance generalizes to unseen domains. Intuitively, within a specific medical imaging modality the domain differences are smaller relative to natural images domain variability. We rethink data augmentation for medical 3D images and propose a deep stacked transformations (DST) approach for domain generalization. Specifically, a series of n stacked transformations are applied to each image in each mini-batch during network training to account for the contribution of domain-specific shifts in medical images. We comprehensively evaluate our method on three tasks: segmentation of whole prostate from 3D MRI, left atrial from 3D MRI, and left ventricle from 3D ultrasound. We demonstrate that when trained on a small source dataset, (i) on average, DST models on unseen datasets degrade only by 11 conventional augmentation (degrading 39 method (degrading 25 better albeit only marginally. scratch on a target domain when training with the same amount of data. (iii) When training on large-sized data, DST on unseen domains reaches performance of state-of-the-art fully supervised models. These findings establish a strong benchmark for the study of domain generalization in medical imaging, and can be generalized to the design of robust deep segmentation models for clinical deployment.


page 2

page 4

page 8


Discriminative Cross-Modal Data Augmentation for Medical Imaging Applications

While deep learning methods have shown great success in medical image an...

Designing Data Augmentation for Simulating Interventions

Machine learning models trained with purely observational data and the p...

The reliability of a deep learning model in clinical out-of-distribution MRI data: a multicohort study

Deep learning (DL) methods have in recent years yielded impressive resul...

Robust White Matter Hyperintensity Segmentation on Unseen Domain

Typical machine learning frameworks heavily rely on an underlying assump...

Reducing Textural Bias Improves Robustness of Deep Segmentation CNNs

Despite current advances in deep learning, domain shift remains a common...

Medical Image Harmonization Using Deep Learning Based Canonical Mapping: Toward Robust and Generalizable Learning in Imaging

Conventional and deep learning-based methods have shown great potential ...

Generalisable Cardiac Structure Segmentation via Attentional and Stacked Image Adaptation

Tackling domain shifts in multi-centre and multi-vendor data sets remain...

1 Introduction

Practical application of AI medical imaging methods require accurate and robust performance on unseen domains, such as differences in acquisition protocols across different centers, scanner vendors, and patient populations (see Fig. 1). Unfortunately, labeled medical datasets are typically small and do not include sufficient variability for robust deep learning training. The lack of large, diverse medical imaging datasets often lead to marginal deep learning model performance on new “unseen” domains, which limits their applications in clinical practice [9].

Figure 1: Medical image segmentation in source and unseen domains (i.e., a specific medical imaging modality across different vendors, imaging protocols, and patient populations, etc.) for (a) whole prostate MRI, (b) left atrial MRI, and (c) left ventricle ultrasound. The illustrated images are processed with intensity normalization.

To improve model performance on unseen domains, transfer learning methods attempt to fine-tune a portion of a pre-trained network given a small amount of annotated data from the unseen target domain. Transfer learning applications for medical 3D images are often lacking quality pre-trained models (trained on a large amount of data). Domain adaption methods do not require annotations in the unseen domain, but usually require all source and target domain images be available during training 

[11, 1, 10]. The assumption of a known target dataset is restrictive, and makes multi-site deployment impractical. Furthermore, due to medical data privacy requirements, it is difficult to collect both the source and target datasets beforehand.

In the field of medical imaging, we are usually faced with the difficult situation where the training dataset is derived from a single center and acquired with a specific protocol. In such situations, domain generalization methods seek a robust model, trained once, capable of generalizing well to unseen domains. In 2D computer vision applications, researchers focused on various complexity of data augmentation to expand the available data distribution. Specifically, data augmentation strategies are performed in input space 

[6] or during adversarial learning [7]. Compared to natural 2D images, 3D medical image domain variability is more compact. Within the same modality, e.g. T2 MRI or Ultrasound, images from different vendors (GE, Philips, Siemens), scanning protocols, and patient populations are visually different mainly in three aspects: image quality, image appearance and spatial shape (see Figure 1). Other imaging modalities, such as CT, generally have more consistent image characteristics.

Motivated by the observed heterogeneity of 3D medical images, we propose a systematic augmentation approach consisting of series of transformations to simulate domain shift properties of medical imaging data. We call this approach, Deep Stacked Transformations (DST) augmentation. DST operates on the image space, where input images undergo nine stacked transformations. Each transform is controlled by two parameters, which determine the probability and magnitude of the image transformation. As a backbone semantic segmentation network we use AH-Net 


In 3D medical imaging applications, the selection of image augmentations is often intuitive, random crop or flip, inherited from 2D computer vision applications. Furthermore, the contribution of augmentation method is rarely evaluated on the unseen domain. In this work, we comprehensively evaluate the effect of various data augmentation techniques on 3D segmentation generalization to the unseen domains. The evaluation tasks include segmentation of whole prostate from 3D MRI, left ventricle from 3D ultrasound, and left atrial from 3D MRI. For each task we have up to 4 different datasets to be able to train on one and evaluate generalization to other datasets. The results and analysis

  • [label=]

  • Reveal the main factors causing domain shift in 3D medical imaging modalities.

  • Demonstrate that DST augmentation substantially outperforms conventional augmentation and CycleGAN-based domain adaptation on unseen domains for both MRI and ultrasound. The generalization improvements are observed even on the same domain (albeit much less noticeable).

  • Given a larger training dataset, DST achieves state-of-the-art segmentation accuracy on unseen domains.

2 Methods

To improve generalization of 3D medical semantic segmentation method, we use a series stacked augmentation transforms applied to input images during training. Each transformation is an image processing function with two hyper-parameters: probability and magnitude .


where are input image and its corresponding label. Augmentation transforms alter the image quality, appearance, and spatial structure. Specifically DST consists of the following transforms: sharpening, blurring, noise, brightness adjustment, contrast change, perturbation, rotation, scaling, deformation, in addition to random cropping. In DST, transforms are in the order as described – performances of models are not sensitive to different orders. As we show in our experiments, augmenting image sets during training can result in models with more robust segmentations than if data processing/synthesis was performed at the inference stage. Fig. 2 shows some examples of DST augmentation in 3D MRI and ultrasound demonstrating ability to mimic image appearances in unseen domains with a given modality.

Figure 2:

Examples of deep stacked transformations (DST) results on (a) whole prostate MRI, (b) left atrial MRI, and (c) left ventricle ultrasound. 1st row: ROIs randomly cropped from source domains; 2nd row: corresponding ROIs after DST; 3rd row: ROIs randomly cropped from unseen domains. The image pairs of 2nd–3rd rows have better visual similarity than 1st–3rd rows.

Image Quality is related to sharpness, blurriness, and noise level

of medical images. Blurriness is commonly caused by MR/ultrasound motion artifacts and resolution. Gaussian filtering is used to blur the image, with a magnitude (Gaussian std) ranging between [0.25, 1.5]. Sharpness has a reverse effect, by using an unsharp masking with strength [10, 30]. Noise is added (from normal distribution with std. [0.1, 1.0]) to account for possible noise in images.

Image Appearance is associated with the statistical characteristics of image intensities, such as variations of brightness and contrast

, which often result form different scanning protocols and device vendors. Brightness augmentation refers to random shift [-0.1, 0.1] in the intensity space. Contrast augmentation refers to gamma correction with gamma (magnitude) ranging between [0.5, 4.5]. Finally, we use a random linear transform in intensity space with magnitude of scale and shift sampled from [-0.1, 0.1], which we refer to as intensity


Spatial Transforms include rotation, scaling and deformation. Rotation is usually caused by different patient orientations during scanning (we use [-20, 20

] range). Scaling and Deformation are due to organ shape variability and soft tissue motion. Random scaling is used with magnitude [0.4, 1.6]. Deformation transform uses regular grid interpolation, after a random perturbation (Gaussian smoothed std [10, 13]). Same spatial transform are applied to both input images and the corresponding labels. These operations are computational expensive for large 3D volumetric data. GPU-based acceleration approach could be developed, but allocating the maximal capacity of GPU memory for model training only along with data augmentation on the fly are more desirable. In addition, since the whole 3D volume does not fit into the memory of the GPU, sub-volumes cropping are usually needed to fed into network training. We develop a CPU-based, efficient, spatial transform technique based on an open-source implementation

111https://github.com/MIC-DKFZ/batchgenerators, which first calculates the 3D coordinate grid of sub-volume (with size of voxels) to which the transformations (combining random 3D rotation, scaling, deformation, and cropping) are applied and then image interpolation is performed. We make further acceleration by only performing interpolation within the minimal cuboid containing the 3D coordinate grid, as such, the computational time is independent from the input volume size (i.e., only depend on the cropping sub-volume size), and the spatial transform augmentation can be performed on the fly during training.

3 Experiments

3.1 Datasets

We validate our method on three segmentation tasks: segmentation of whole prostate from 3D MRI, left atrial from 3D MRI, and left ventricle from 3D ultrasound.

Task 1: For the whole prostate segmentation from 3D MRI, we use the following datasets: Prostate dataset from Medical Segmentation Decathlon111http://medicaldecathlon.com/index.html (MSD-P), PROMISE12 [4], NCI-ISBI13222http://doi.org/10.7937/K9/TCIA.2015.zF0vlOPv, and ProstateX [3]. We train on the MSD-P dataset (source domain) and evaluate on the other datasets (unseen domains). We use only single channel (T2) input and segment the whole prostate, which is lowest common denominator among the datasets. One study in ProstateX was excluded due to prior surgical procedure.

Task 2: For left atrial segmentation from 3D MRI, we use the following datasets: Heart dataset from MSD (MSD-H), ASC [8] and MM-WHS [13]. We train on the MSD-H dataset (source domain) and evaluate on the other datasets.

Task 3: For left ventricle segmentation from 3D ultrasounds, we use data from CETUS333https://www.creatis.insa-lyon.fr/Challenge/CETUS/

(30 volumes). We manually split the dataset into 3 subsets corresponding to different ultrasound device vendors A, B, C with 10 volumes each. We used heuristics to identify vendor association, but we acknowledge that our split strategy may include wrong associations. We train on Vendor A images, and evaluate on Vendors B and C.

Table 1 summarizes the datasets. In addition, a larger proprietary 3D MRI dataset of 465 volumes is used in the final experiment (see Section 3.3.1).

Task 1. MRI - whole prostate 2. MRI - left atrial 3. Ultrasound - left ventricle
Domain Source Unseen Source Unseen Source Unseen
# Data 26/6 50 60 98 16/4 100 20 8/2 10 10
Table 1: Datasets used in our experiment.

3.2 Implementation

We implemented our approach in Tensorflow and train it on NVIDIA Tesla V100 16GB GPU. We use AH-Net 

[5] as a backbone for 3D segmentation, which takes advantages of the 2D pretrained ResNet50 as an encoder, and learns the full 3D decoder. All data is re-sampled to 1x1x1mm isotropic resolution and normalized to [0,1] intensity range. We use a crop size of 96x96x32 batch 16 for Task1, crop 96x96x96 batch 16 for Tasks 2, and crop 96x96x96 batch 4 for Tasks 3. We use soft Dice loss and Adam optimizer with the learning rate . We use 0.5 probability of each transformation in DST.

3.3 Experimental Results and Analysis

First, we evaluate generalization performance for each augmentation transform individually. As a baseline, only random cropping with no other augmentations used. We compare results to DST with all 9 transformation stacked, and to a popular domain adaptation method, CycleGAN [11]

, which maps the unseen images (on per slice basis) into source-like appearance (we split each dataset into 4:1 for CycleGAN training and validation, and train for 200 epochs).

Table 2 lists segmentation Dice results on the source domain (trained on this domain, and validated on a keep-out subset) and on unseen domains (trained on the source, but tested on other unseen datasets). The major findings are:

Task 1. MRI - whole prostate Task 2. MRI- left atrial Task 3. US - left ventricle All Tasks
Source Unseen Source Unseen Source Unseen Source Unseen
Baseline 89.6 60.4 58.0 76.8 91.9 4.4 72.9 85.8 51.7 39.2 89.1 49.8
Sharpening 90.6 65.5 82.8 84.0 91.5 5.7 78.9 83.7 59.5 78.5 88.6 62.9
Blurring 86.1 63.9 67.0 79.9 90.9 3.3 76.9 90.5 73.4 72.4 89.2 61.1
Noise 91.1 59.3 67.4 81.4 91.4 8.3 78.0 87.3 66.8 62.2 90.0 59.0
Brightness 89.7 63.3 66.9 83.0 91.3 12.2 80.2 85.5 63.6 83.1 88.8 63.6
Contrast 91.1 72.7 60.7 86.1 91.3 12.7 78.6 88.4 58.4 85.5 90.3 63.6
Perturb 90.1 63.4 69.5 81.5 91.7 6.6 77.3 88.5 63.6 83.1 90.1 55.7
Rotation 87.4 59.0 57.9 75.1 91.2 5.2 72.1 78.0 60.4 62.6 85.5 54.7
Scaling 90.8 59.3 60.8 78.8 91.3 7.4 75.3 91.0 84.1 68.2 91.0 61.3
Deform 89.7 61.4 61.5 81.2 91.6 7.8 69.2 86.3 62.4 31.4 89.2 51.1
Top4 91.0 73.5 83.0 86.5 91.6 45.4 79.4 90.9 81.9 80.5 91.2 74.9
CycleGAN - 74.7 76.4 81.2 - 18.0 76.2 - 65.3 66.6 - 63.5
DST (ours) 91.3 80.2 85.4 86.5 91.4 65.5 80.0 92.1 84.9 81.3 91.6 80.0
Supervised - 91.4 [12] 88.0 [2] 91.9* - 94.2 [8] 88.6 - 92.5* 92.5* - 91.4
Table 2: The effect of DST and various augmentation methods on unseen domain generalization (measured as segmentation Dice scores). Source columns indicates the dataset used for training, and its Dice scores are validation Dice scores (using a split) for comparisons. Unseen columns list Dice results when applied to unseen datasets (of the model trained on the source). Here baseline refers to a random crop with no further augmentations. Top4 stands for the combination of four best performing augmentations (sharpening, brightness, contrast, scaling). Supervised indicates the state-of-the-art literature results, when a model is trained and tested on the same dataset. indicates inter-observer variability.
  • [label=]

  • DST augmentation performs substantially better than any one of the tested augmentations. On average, across different tasks, DST achieves 80% generalization Dice on unseen domains. Compare to baseline (49.8%) and CycleGAN (63.5%), which achieve worse generalization performance (even though e.g. CycleGAN domain adaptation got exposure to unseen domain images).

  • In 3D MRIs, image quality and appearance augmentation had the most impact, with larger improvements coming from sharpening, followed by by contrast, brightness, and intensity perturbation. Spatial transforms had less impact in prostate MRI compared to heart MRI where the shape, size, and orientation of heart can be very different (see Figure 1).

  • In Ultrasound, main contributions came from spatial scaling, followed by brightness, blurring, and contrast augmentations (see Figure 1(c)).

  • In some datasets (such as ASC), all the individual augmentations and CycleGAN perfomred very poorly ( Dice), whereas DST had reasonable performance. This supports our claim that comprehensive transforms are required to cover potentially large variability of the unseen data.

  • Individual augmentation transforms may perform slightly better on some isolated cases (e.g. brightness augmentation for WHS), but on average only DST consistently shows good generalization. Even the combination of top 4 performing augmentations (top4) is not sufficient for robust generalization.

  • Using only simple random crop (baseline) does not generalize well to unseen datasets (with Dice dropping as much as 40%) , which supports importance of data augmentation in general.

  • Besides the improvements on unseen domains, DST slightly improves (2.5%) on the source domains as well (it is valuable to not degrade the performance on the source domain).

  • DST peformance is 10% worse compared to fully supervised methods, as they have advantages of training and testing on the same domain and more training data. This gap can be reduced by using a larger source dataset (as shown in Section 3.3.1), in which case the DST performance is comparable to the supervised methods.

Examples of unseen domain segmentation produced by baseline model, CycleGAN-based domain adaptation, and DST domain generalization are shown in Fig. 3. The baseline and DST are trained only on individual source domains, while CycleGAN requires images from target/unseen domain to train an additional generative model.

Figure 3: Generalization to unseen domains for three different 3D medical image segmentation tasks. Baseline deep models have low performances on unseen MRI and ultrasound images from different clinical centers, scanner vendors, etc. CycleGAN-based domain adaptation method helps improve segmentation performances. DST training generates robust models which significantly improve segmentation performances on unseen domains. Segmentation masks (red) overlay on unseen or CycleGAN synthesized images.

3.3.1 DST with Larger Dataset.

So far we have evaluated that DST generalization performance using small (30 volumes) public datasets. In this section, we experiment with a larger dataset, and demonstrate generalization performance comparable to supervised state-of-the-art methods.

We train a model with DST on proprietary dataset of 465 3D MRIs (denoted as MultiCenter) with whole prostate annotations, collected from various medical centers worldwide. Table 3 show the results on unseen datasets. Overall, using a large source dataset, DST produces competitive results: with Dice being only 0.8% lower than state-of-the-art supervised methods. Supervised models were trained on the same domain individually, where we were able to achieve similar performance training only on the source domain. Importantly, on the unseen domain, our DST model achieves the same performance as two radiologists (relative novice versus expert) – it achieves a Dice score of 91.9% on the unseen ProstateX dataset, compared with the Dice score between a novice versus expert radiologist annotations on the same dataset (also 91.9%). These findings suggest feasibility of practical application of deep learning models in clinical sites, where the trained DST model generalize well to unseen data.

Source Unseen
train val MSD-P PROMISE NCI-ISBI ProstateX Average
Baseline 95.6 89.9 87.8 82.9 88.8 90.6 87.5
DST (ours) 94.1 91.8 89.1 88.1 89.4 91.9 89.6
State-of-the-art - - - 91.4* [12] 88.0* [2] 91.9* 90.4
Table 3: The effect of DST with larger data (465 3D MRI) for the task of whole prostate segmentation. Methods marked with * are trained and tested on the same domain or inter-observer variability (91.9%). No evaluation of whole prostate segmentation available in MSD challenge.

4 Conclusion

We propose deep stacked transformations (DST) augmentation approach for unsupervised domain generalization in 3D medical image segmentation. We evaluate DST and different augmentation strategies on three segmentation tasks (prostate 3D MRI, left atrial 3D MRI and left ventricle 3D ultrasound) when applied to unseen domains. The experiments establish a strong benchmark for the study of domain generalization in medical imaging. Furthermore, using a larger training dataset, we show that DST generalization performance is comparable to fully supervised state-of-the-art methods, making deep learning segmentation more feasible in practise.


  • [1] Degel, M.A., Navab, N., Albarqouni, S.: Domain and geometry agnostic CNNs for left atrium segmentation in 3D ultrasound. In: MICCAI. pp. 630–637 (2018)
  • [2] Jia, H., Song, Y., Zhang, D., Huang, H., Feng, D., Fulham, M., Xia, Y., Cai, W.: 3d global convolutional adversarial network for prostate MR volume segmentation. arXiv preprint arXiv:1807.06742 (2018)
  • [3] Litjens, G., Debats, O., Barentsz, J., Karssemeijer, N., Huisman, H.: Computer-aided detection of prostate cancer in MRI. TMI 33(5), 1083–1092 (2014)
  • [4] Litjens, G., Toth, R., van de Ven, W., Hoeks, C., Kerkstra, S., van Ginneken, B., Vincent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Medical Image Analysis 18(2), 359–373 (2014)
  • [5] Liu, S., Xu, D., Zhou, S.K., Pauly, O., Grbic, S., Mertelmeier, T., Wicklein, J., Jerebko, A., Cai, W., Comaniciu, D.: 3D anisotropic hybrid network: Transferring convolutional features from 2D images to 3D anisotropic volumes. In: MICCAI. pp. 851–858. Springer (2018)
  • [6] Romera, E., Bergasa, L.M., Alvarez, J.M., Trivedi, M.: Train here, deploy there: Robust segmentation in unseen domains. In: 2018 IEEE Intelligent Vehicles Symposium (IV). pp. 1828–1833. IEEE (2018)
  • [7] Volpi, R., Namkoong, H., Sener, O., Duchi, J., Murino, V., Savarese, S.: Generalizing to unseen domains via adversarial data augmentation. In: NeurIPS (2018)
  • [8]

    Xiong, Z., Fedorov, V.V., Fu, X., Cheng, E., Macleod, R., Zhao, J.: Fully automatic left atrium segmentation from late gadolinium enhanced magnetic resonance imaging using a dual fully convolutional neural network. TMI

    38(2), 515–524 (2019)
  • [9]

    Yasaka, K., Abe, O.: Deep learning and artificial intelligence in radiology: Current applications and future directions. PLoS Medicine

    15(11), e1002707 (2018)
  • [10] Zhang, Y., Miao, S., Mansi, T., Liao, R.: Task driven generative modeling for unsupervised domain adaptation: Application to X-ray image segmentation. In: MICCAI (2018)
  • [11]

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV. pp. 2223–2232 (2017)

  • [12] Zhu, Q., Du, B., Yan, P.: Boundary-weighted domain adaptive neural network for prostate MR image segmentation. arXiv preprint arXiv:1902.08128 (2019)
  • [13] Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Medical Image Analysis 31, 77–87 (2016)