Domain shift (DS) concerns the problem of mismatch between the statistics of the training data used for model development and the statistics of the test data seen after model deployment. DS can cause significant drops in predictive performance, which has been observed in almost all recent imaging challenges when final test data was coming from different clinical sites . DS is a major hurdle for successfully translating predictive models into clinical routine.
Acquisition and population shift are two common forms of DS that appear in medical image analysis 
. Acquisition shift is observed due to differences in imaging protocols, modalities or scanners. Such a shift will be observed even if the same subjects are scanned. Population shift occurs when cohorts of subjects under investigation exhibit different statistics, e.g., varying demographics or disease prevalence. It is not uncommon for both types of DS to occur simultaneously, in particular in multi-center studies. It is essential to tackle DS in machine learning to perform reliable analysis of large populations across sites and to avoid introducing biases into results. Recent work has shown that even after careful pre-processing, site-specific differences remain in the images[18, 6]. While methods like ComBat  aim to harmonize image-derived measurements, we focus on the images themselves.
One solution is domain adaptation (DA), a transductive transfer learning technique that aims to modify the source domain’s marginal distribution of the feature space such that it resembles the target domain. In medical imaging, labelled data is scarce and typically unavailable for the target domain. It is also unlikely to have the same subjects in both domains. Thus, we focus on ‘unsupervised’ and ‘unpaired’ DA, wherein labelled data is available only in the source domain and no matching samples exist between source and target.
Many DA approaches focus on learning domain-invariant feature representations, by either forcing latent representations of the inputs to follow similar distributions, or ‘disentangling’ domain-specific features from generic features . This can be achieved with some divergence measure based on data statistics or by training adversarial networks to model the divergence between the feature representations . These methods have been applied to brain lesions  and tumours  in MRI, and in contrast to non-contrast CT segmentation 
While these approaches seem appealing and have shown some success, they lack a notion of explainability as it is difficult to know what transformations are applied to the feature space. Additionally, although the learned task model may perform equally well on both domains, it is not guaranteed to perform as well as separate models trained on the individual domains.
We explore model-agnostic DA by working at the image level. Our approach is based on domain mapping (DM), which aims to learn the pixel-level transformations between two image domains, and includes techniques such as style transfer. Pix2Pix (supervised) and CycleGAN  (unsupervised) take images from one domain through some encoder-decoder architecture to produce images in the new domain. The method in  uses CycleGAN to improve segmentation across scanners and applies DA at both image and feature levels, thus losing interpretability. It does not decompose the image and spatial transformations.
Methods for DM primarily use UNet-like architectures to learn image-to-image transformations that are easier to interpret, as one can visually inspect the output. For medical images of the same anatomy, but from different scanners, we assume that domain shift manifests primarily in appearance changes (contrast, signal-to-noise, resolution) and anatomical variation (shape changes), plus further subtle variations caused by image reconstruction or interpolation.
Contributions: We propose the use of image-and-spatial transformer networks (ISTNs)  to tackle domain shift at image-feature level in multi-site imaging data. ISTNs separate and compose the transformations for adapting appearance and shape differences between domains. We believe our approach is the first to use such an approach with retraining of the downstream task model on images transferred from source to target. We show that ISTNs can be trained adversarially in a task model-agnostic way. The transferred images can be visually inspected, and thus, our approach adds explainability to domain adaptation—which is important for validating the plausibility of the learned transformations. Our results demonstrate the successful recovery of performance on classification and regression tasks when using ISTNs to tackle domain shift. We explore both unidirectional and bidirectional training schemes and compare retraining the task model from scratch versus finetuning. We present proof-of-concept results on synthetic images generated with Morpho-MNIST  for a 3-class classification task. Our method is then validated on real multi-site data with 3D T1-weighted brain MRI. Our results indicate that ISTNs improve generalization and predictive performance can be recovered close to single-site accuracy.
We propose adversarial training of ISTNs to perform model-agnostic DA via explicit appearance and shape transformations between the domains. We explore unidirectional and bidirectional training schemes as illustrated in Figure 1.
Models. ISTNs have two components: an image transformer network (ITN) and a spatial transformer network (STN) [8, 10]. Here, we additionally require a discriminator model for adversarial training of the ISTN.
The ITN performs appearance transformations such as contrast and brightness changes, and other localised adaptations at the image-level. A common image-to-image (I2I) translation network based on UNet with residual skip connections can be employed. We use upsample-convolutions to reduce chequerboard artifacts compared with transposed convolution. We use batch normalization, dropout layers and ReLU activations with a finalactivation for the output. All input images are pre-normalized to the intensity range.
STN: We experiment with both the affine and B-spline STNs described in the original ISTN paper. Affine STNs learn to regress the parameters of linear spatial transforms with translation, rotation, scaling, and shearing. B-spline STNs regress control point displacements. Linear interpolation is used throughout. Note that in this work, Affine and B-Spline STNs are considered independently and are not composed.
Discriminator: In both Morpho-MNIST and brain MRI experiments, we use a standard fully-convolutional classification network with instance normalization, dropout layers and a sigmoid output.
The employed classifiers and regressors follow the same fully-convolutional structure as the discriminator, reducing the dimensions of the input images to a multi-class or continuous value prediction, depending on the task. We use cross-entropy or mean-squared error loss functions, respectively.
provide details about the architectures of different networks. All implementations are in PyTorch with code available online.111https://github.com/mlnotebook/domain_adapation_istn.
Training. The output from the ITN is directly fed into the STN. They are then composed into a single ISTN unit, and are trained jointly end-to-end. Discriminator: The images (from the source domain) are passed through the ISTN to generate images , where indicates images from the target domain. Next, the are passed through the discriminator to yield a score in the range denoting whether the image is a real sample from domain or a transformed one. The discriminator is trained by minimizing the binary cross-entropy loss between the predicted and true domain labels. Eq. (1) shows the total discriminator loss. Soft labels for the true domain are used to stabilize early training of the discriminator. We replace the hard ‘0’ and ‘1’ domain labels by random uniform values in the ranges and , respectively.
ISTN: The ISTN is trained as a generator. The ISTN output is passed through the discriminator and forced to be closer to domain by computing the adversarial loss . Soft labels are also used here. We expect that when images are passed through the ISTN, the output should be unchanged as it is already in domain . This is enforced by the identity loss acting on image intensities of and . A weighting factor is applied to giving the total loss function for the ISTN in Eq. (3)c.
We compare with the CycleGAN  training approach, which trains both directions simultaneously using two ISTNs ( and ) and two discriminators ( and ). The CycleGAN introduces the cycle-consistency term to such that when is used to transform , the result is forced to be close to . Figure 1 shows the two ISTNs, their outputs and associated losses. The loss functions for are shown in Eq. (3). Optimization is done using the Adam optimizer.
Downstream Tasks: The goal of our work is to demonstrate that such explicit appearance and spatial transformations via ISTNs can successfully tackle DS in certain applications. Ideally, we would like to observe that the performance of a predictor trained on and tested on can recover to single-site performance. To demonstrate this, prior to training the ISTN, we train a task model (e.g. classifier or regressor) on domain . The performance of is likely to be our ‘best performance’ whilst will degrade due to DS. During ISTN training, we simultaneously re-train on the ISTN output of . This model is trained to achieve maximum performance on the transformed images using labels from . We assess the performance ‘recovery’ of by comparing with . In practice, data from would be unlabelled. Our approach ensures that test data from the new domain is not modified in any way. Additionally, in scenarios where the original model is deployed, it is likely to have been trained on a large, well-curated, high-quality dataset; we cannot assume similar would be available for each new test domain. Our model-agnostic unsupervised DA is validated on two problems: i) proof-of-concept showing recovery of a classifier’s performance on digit recognition, ii) classification and regression tasks with real-world, multi-site T1-weighted brain MRI.
3.1 Proof-of-concept: Morpho-MNIST Experiments
Data. Morpho-MNIST is a framework that enables applying medically-inspired perturbations, such as local swellings and fractures, to the well-known MNIST dataset . The framework also allows us to control transformations to obtain thickening and shearing of the original digits. We first create a dataset with three classes: ‘healthy’ digits with no transformations; ‘fractured’ digits with a single thin disruption and ‘swollen’ digits which exhibit a localized, tumor-like abnormal growth. A digit is only either fractured or swollen, not both. We specify a set of ‘thin’ digits (2.5 pixels across) to be source domain . To simulate domain shift, we create three more datasets—domain : thickened, 5.0 pixels digits; domain : slanted digits created by shearing the image by 20–25 and domain : thickened-slanted digits at 5.0 pixels and 20–25 shearing. Datasets – contain the same three classes as , while each set has its own data characteristics simulating different types of domain shift. All images are single-channel and pixels. Figure 2 shows some visual examples.
Task. The downstream task in this experiment is a 3-class classification problem: ‘healthy’ vs. ‘fractured’ vs. ‘swollen’. We train a small, fully-convolutional classifier to perform the classification on domain . We use ISTNs to retrain the classifier on transformed images , , and , and evaluate each on their corresponding test domains , , and .
We run training for 100 epochs and perform grid search to find suitable hyper-parameters including learning rate, trade-offand the control-point spacing of the B-spline STN. We conduct experiments using ITN only, STN only and combinations of affine and B-spline ISTNs to determine the best model for the task. We also consider both transfer directions, switching the roles of source and target domains.
3.2 Application to Brain MRI Experiments
We apply the same methodology to a real-world domain shift problem where we observe a significant drop in prediction accuracy when naively training on one site and testing on another without any DA. We utilise 3D brain MRI from two sites that employ similar but not identical imaging protocols.
Data. We construct two datasets of T1-weighted brain MRI from subjects with no reported pathology, where are taken from the Cambridge Centre for Ageing and Neuroscience study (Cam-CAN) [15, 17] and from the UK Biobank imaging study (UKBB) [16, 11, 1]. From each site, 450 subjects are used for training and the remainder for testing. The UKBB dataset contains equal numbers of male and female subjects between the ages of 48 and 71 (). In the classification task, to simulate the effect of population shift our Cam-CAN dataset has a wider age range (30–87, ) but maintains the male-to-female ratio. We match the age range of both datasets in the regression task, limiting DS only to the more subtle scanner effects. UKBB images were acquired at the UKBB imaging centre, and Cam-CAN images were acquired at the Medical Research Council Cognition and Brain Sciences Unit in Cambridge, UK. Both sites acquire 1 mm isotropic images using the 3D MPRAGE pulse sequence on Siemens 3 T scanners with a 32-channel receiver head coil and in-plane acceleration factor 2. Appendix 0.A
presents the acquisition parameters that differ between the two sites. We note that generally the acquisition parameters of both sites are similar, and the images cannot be easily distinguished visually. For pre-processing, all images are affinely aligned to MNI space, skull-stripped, bias-field-corrected, and intensity-normalised to zero mean unit variance within a brain mask. Voxels outside the mask are set to 0. Images are passed through afunction before being consumed by the networks.
Task. We consider two prediction tasks, namely sex classification and age regression using the UKBB and Cam-CAN sets, each once as source and once as target domain. The task networks are retrained on the transformed images produced by the ISTN and evaluated on the corresponding target domain.
4 Experimental Results
Morpho-MNIST. Quantitative results for the synthetic experiments are summarized in Table 1. ITNs are able to harmonize local appearance such as thickness between source and target domains, while STNs perform well in recovering shape variations such as slant. Where both thickness and slant are varied between source and target domains, we note an ITN-only performs as well (or slightly better) than a joint ISTN, suggesting that thickness is more important for the classification task. In Fig. 2 we show visual results on how the ISTNs are able to recover both appearance and shape differences between domains.
Brain MRI. Quantitative results are summarized in Tables 2 and 4. The sex classifier trained and tested on UKBB achieves 84.3% accuracy. This drops to 54.8% when tested on Cam-CAN. Similarly, training and testing on Cam-CAN yields 91.6%, dropping to 64.3% when testing on UKBB. Using ISTNs for domain adaptation, and retraining the classifiers increases the accuracy substantially on Cam-CAN from 54.8% to 80.9%, and on UKBB from 64.3% to 86.2%, which is close to the single-site performance. Training the classifier from scratch performs similarly well to fine-tuning. Bidirectional training with CycleGAN seems not to provide substantial improvements over the simpler unidirectional scheme. The ISTNs are able to overcome some of the acquisition and population shifts between the two domains.
The age regressor trained and tested on UKBB achieves mean absolute error (MAE) of 4.25 years increasing to 5.13 when evaluated on Cam-CAN. The regressor trained and tested on Cam-CAN yields 4.10 years MAE increasing to 4.61 when tested on UKBB. Despite the initially smaller drop in performance for age regression, ISTNs still improve performance. The UKBB-trained regressor recovers to 4.58 years MAE and the Cam-CAN-trained one to 4.56 years. Note, we had limited the population shift here by constraining the age range, thus the recovery is likely due to a reduction in acquisition shift.
We explored adversarially-trained ISTNs for model-agnostic domain adaptation. The learned image-level transformations help explainability, as the resulting images can be visually inspected and checked for plausibility (cf. Fig. 4). Further interrogation of deformations fields also adds to explainability, e.g. Appendix 0.B. Image-level DA seems suitable in cases of subtle domain shift caused by acquisition and population differences in multi-center studies. Predictive performance approached single-site accuracies. The choice of STN and control-point spacings may need to be carefully considered for specific use cases. An extension of our work to many-sites may be possible by simultaneously adapting to multiple sites. A quantitative comparison to feature-level DA would be a natural next step for future work. Another interesting direction could be to integrate the ISTN component in a fully end-to-end task-driven optimisation, where the ISTN and the task network are trained jointly.
RR funded by KCL & Imperial EPSRC CDT in Medical Imaging (EP/L015226/1) and GlaxoSmithKline; This research received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 757173, project MIRA, ERC-2017-STG). DCC is supported by the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, grant ref EP/L016796/1). The research was supported in part by the National Institutes of Health, Clinical Center.
-  (2018-02) Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage 166, pp. 400–424. External Links: Cited by: §3.2.
-  (2019) Morpho-MNIST: quantitative assessment and diagnostics for representation learning. Journal of Machine Learning Research 20 (178). External Links: Cited by: §1, §3.1.
-  (2019) Causality matters in medical imaging. Note: arXiv:1912.08142 External Links: Cited by: §1.
-  A. Crimi, S. Bakas, H. Kuijf, F. Keyvan, M. Reyes, and T. van Walsum (Eds.) (2019) Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries. Springer International Publishing. External Links: Cited by: §1.
-  (2019) Automatic brain tumor segmentation with domain adaptation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, A. Crimi, S. Bakas, H. Kuijf, F. Keyvan, M. Reyes, and T. van Walsum (Eds.), Cham, pp. 380–392. External Links: Cited by: §1.
-  (2019) Machine learning with multi-site imaging data: an empirical study on the impact of scanner effects. In Medical Imaging meets NeurIPS, Cited by: §1.
Image-to-image translation with conditional adversarial networks. In , Honolulu, HI, pp. 5967–5976 (en). Cited by: §1.
-  (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, pp. 2017–2025. Cited by: §2.
-  (2017) Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International Conference on Information Processing in Medical Imaging, pp. 597–609. Cited by: §1.
-  (2019) Image-and-spatial transformer networks for structure-guided image registration. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 337–345. External Links: Cited by: §1, §2.
-  (2016) Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nature Neuroscience 19 (11), pp. 1523–1536. External Links: Cited by: §3.2.
-  (2010-10) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359, (en). Cited by: §1.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: §2.
-  (2019-11) Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific Reports 9 (1). External Links: Cited by: §1.
-  (2014) The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC Neurology 14 (1), pp. 204. Cited by: §3.2.
-  (2015) UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine 12 (3), pp. e1001779. External Links: Cited by: §3.2.
-  (2017) The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) data repository: structural and functional MRI, MEG, and cognitive data from a cross-sectional adult lifespan sample. NeuroImage 144, pp. 262–269. Cited by: §3.2.
-  (2019) Quantifying confounding bias in neuroimaging datasets with causal inference. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2019, pp. 484–492. Cited by: §1.
-  (2019) The domain shift problem of medical image segmentation and vendor-adaptation by unet-gan. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 623–631. External Links: Cited by: §1, §1.
-  (2019) Unsupervised domain adaptation via disentangled representations: application to cross-modality liver segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 255–263. External Links: Cited by: §1.
-  (2018) Statistical harmonization corrects site effects in functional connectivity measurements from multi-site fmri data. Human Brain Mapping 39 (11), pp. 4213–4227. External Links: Cited by: §1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV, Venice, pp. 2242–2251 (en). Cited by: §1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §2.
Appendix 0.A Acquisition Parameters.
|Site||Scanner||TR (ms)||TE (ms)||TI (ms)||TA (s)||FOV (mm)|
|Cam-CAN||Siemens TIM Trio||2250||2.99||900||272||256x240x192|
Appendix 0.B ISTN Transformation Visualization.
Appendix 0.C Morpho-MNIST Architectures.
|ITN Architecture - Morpho-MNIST|
|Discriminator Architecture - Morpho-MNIST|
|3-Class Classifier Architecture - Morpho-MNIST|
: stride,and : layer input and output dimensions, : normalization (BN: batch normalization, IN: instance normalization), : Dropout keep-rate,
Appendix 0.D Brain MRI Architectures.
|ITN Architecture - Brain MRI|
|Discriminator Architecture - Brain MRI|
|Sex Classifier Architecture - Brain MRI|
|Age Regressor Architecture - Brain MRI|