Machine Learning with Multi-Site Imaging Data: An Empirical Study on the Impact of Scanner Effects

10/10/2019 ∙ by Ben Glocker, et al. ∙ 22

This is an empirical study to investigate the impact of scanner effects when using machine learning on multi-site neuroimaging data. We utilize structural T1-weighted brain MRI obtained from two different studies, Cam-CAN and UK Biobank. For the purpose of our investigation, we construct a dataset consisting of brain scans from 592 age- and sex-matched individuals, 296 subjects from each original study. Our results demonstrate that even after careful pre-processing with state-of-the-art neuroimaging pipelines a classifier can easily distinguish between the origin of the data with very high accuracy. Our analysis on the example application of sex classification suggests that current approaches to harmonize data are unable to remove scanner-specific bias leading to overly optimistic performance estimates and poor generalization. We conclude that multi-site data harmonization remains an open challenge and particular care needs to be taken when using such data with advanced machine learning methods for predictive modelling.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Pooling data from different sites and previous studies is essential for analysis of large populations with sufficient statistical power (Smith and Nichols, 2018). However, due to differences in image acquisition, demographics, disease characteristics and other factors, naive combination of datasets for subsequent large-scale population analysis can be problematic. Here, we conduct a simple, empirical study to illustrate and highlight this problem in the context of machine learning. We are not suggesting a solution, but rather re-iterate that multi-center data harmonization is an open research challenge. For some recent attemps to tackle this problem, see for example (Fortin et al., 2017, 2018).

2 Data

We construct an age- and sex-matched dataset with T1-weighted brain MRI from individuals, where subjects ( females) are taken each from the Cambridge Centre for Ageing and Neuroscience study (Cam-CAN)111 (Shafto et al., 2014; Taylor et al., 2017) and UK Biobank imaging study (UKBB)222 (Sudlow et al., 2015; Miller et al., 2016; Alfaro-Almagro et al., 2018). This is to simulate a somewhat ‘best case scenario’ for multi-site data where the age- and sex-matching intends to remove population bias. We note this is rarely possible in practice, and it is expected that current and previous analyses that pool data from different sites suffer from much larger site-specific biases.


All images were collected at a single site (Medical Research Council Cognition and Brain Sciences Unit (MRC-CBSU) in Cambridge, UK) using a 3T Siemens TIM Trio scanner with a 32-channel receive head coil. Imaging parameters are: 3D MPRAGE, TR=2250ms, TE=2.99ms, TI=900ms; FA=9 deg; FOV=256x240x192mm; 1mm isotropic; GRAPPA=2; TA=4mins 32s.

UK Biobank:

All images were collected at the UKBB imaging center using a 3T Siemens Skyra scanner with a 32-channel receive head coil. Imaging parameters are: 3D MPRAGE, R=2, TR=2000ms, TE=385ms, TI=880ms; FOV=208x256x256mm; 1mm isotropic; Duration 4mins 54s.

The acquisition protocols of the two studies are remarkably similar, and possibly much closer than typically found when pooling data from multiple sites. The subjects in both studies should be normal.

2.1 Pre-Processing Pipeline

We aimed at designing a common state-of-the-art pre-processing pipepline which in this or similar form is widely used in neuroimaging studies. In particular, we apply the following sequential steps: 1) Lossless image reorientation by swapping axes using the direction information from the NIfTI image header, such that all scans are in the same radiological orientation of left, posterior, superior; 2) Skull stripping with ROBEX v1.2333 (Iglesias et al., 2011); 3) Bias field correction with N4ITK444 (Tustison et al., 2010); 4) Intensity-based linear registration (rigid and affine) to MNI ICBM 152 2009a Nonlinear Symmetric555 using an in-house registration tool with correlation coefficient as the similarity measure and downhill-simplex as the optimizer.

After these steps, we perform intensity normalization within brain regions with simple whitening (zero-mean/unit-variance). Voxels outside the brain are set to fixed value. Other techniques such as percentile matching and Nyul’s histogram standardization

(Nyúl et al., 2000) led to similar subsequent observations. We also employ SPM12666 (Friston et al., 2007; Ashburner, 2012) and FMRIB’s Automated Segmentation Tool (FAST) v4.0777 (Zhang et al., 2001)

to obtain brain tissue probability maps. SPM is run directly on the raw T1-weighted scans as it has its own pre-processing pipeline built-in including spation non-linear normalization to MNI space. FSL-FAST is run on our skull-stripped, bias field corrected and rigidly MNI aligned images.

Figure 1: Example data for six age- and sex-matched subjects from the Cam-CAN and UKBB datasets after applying different pre-processing steps. Top two rows show the intensity histograms after skull-stripping, bias field correction, rigid registration to MNI, and whitening for intensity normalization. Rows three and four show the corresponding T1-weighted mid axial slices. Rows five and six show the spatially normalized graymatter maps obtained with SPM12. Site-specific differences are non-obvious from visual inspection.

3 Experiments, Results & Conclusion

We conduct two image classification experiments to illustrate the impact of scanner effects which remain after careful pre-processing and are present even in image-derived tissue probability maps.

Site classification:

We train random forest binary classifiers to distinguish between the origin of the imaging data. The classifiers are trained to distinguish between data from Cam-CAN and UKBB.

Results are summarized in Table 1

. We make the following observations: i) classifiers are able to predict data origin with high accuracy; ii) scanner effects remain in derived tissue probability maps; iii) higher degrees of spatial normalization amplify scanner effects (possibly related to interpolation).

Sex classification: We consider a simple binary classification task of sex classification. We compare results of training random forest classifiers on single-site and multi-site data.

Results for sex classification are summarized in Table 2. We make the following observations: i) age/sex-matched multi-site data gives realistic estimates of accuracy (similar to single site); ii) sex imbalance in multi-site leads to overly optimistic accuracy; iii) training on one site and testing on the other shows drop of performance indicating poor generalization; iv) when discriminative features such as brain size are removed by affine registration, the drop in performance is more severe.

Conclusions: Scanner effects can be subtle yet significantly affect machine learning. Similar findings for multi-site neuroimaging data are reported in (Ferrari et al., 2018; Wachinger et al., 2019).

Stripped Bias Field Aligned Intensities Accuracy Avg. Entropy Avg. Prob.
rigid whitening 96.96% 0.4039 0.8296
affine whitening 98.82% 0.3876 0.8397
SPM12 – Segment Accuracy Avg. Entropy Avg. Prob.
rigid graymatter 80.24% 0.6363 0.6399
non-linear graymatter 96.62% 0.5675 0.7234
FSL – FAST Accuracy Avg. Entropy Avg. Prob.
rigid graymatter 93.24% 0.4542 0.7968
Table 1: Two-fold cross validation results for site classification. Reported are overall accuracy, average entropy, and average predictive probability. If the data were indistinguishable one would expect an accuracy of 50, an entropy of 0.6931 (upper bound), and a probability of 0.5.
Data Arrangement Aligned Accuracy Avg. Entropy Avg. Prob.
Multi-site age/sex-matched rigid 82.60% 0.5304 0.7388
Single-site (Cam-CAN) rigid 81.42% 0.5592 0.7179
Single-site (UKBB) rigid 84.46% 0.5049 0.7572
Cam-CAN females / UKBB males rigid 94.59% 0.4036 0.8311
Cam-CAN 80/20% / UKBB 20/80% rigid 85.87% 0.5038 0.7616
Cam-CAN train / UKBB test rigid 81.42% 0.5617 0.7124
UKBB train / Cam-CAN test rigid 78.04% 0.5284 0.7419
Multi-site age/sex-matched affine 79.73% 0.6345 0.6389
Single-site (Cam-CAN) affine 77.70% 0.6439 0.6269
Single-site (UKBB) affine 81.08% 0.6393 0.6316
Cam-CAN females / UKBB males affine 98.99% 0.4641 0.8013
Cam-CAN 80/20% / UKBB 20/80% affine 84.78% 0.5713 0.7125
Cam-CAN train / UKBB test affine 73.65% 0.6462 0.6245
UKBB train / Cam-CAN test affine 62.16% 0.6075 0.6769
Table 2: Two-fold cross validation results for sex classification under different data arrangements.


This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 757173, project MIRA, ERC-2017-STG). UK Biobank data has been accessed under Application Number 12579.


  • F. Alfaro-Almagro, M. Jenkinson, N. K. Bangerter, J. L.R. Andersson, L. Griffanti, G. Douaud, S. N. Sotiropoulos, S. Jbabdi, M. Hernandez-Fernandez, E. Vallee, D. Vidaurre, M. Webster, P. McCarthy, C. Rorden, A. Daducci, D. C. Alexander, H. Zhang, I. Dragonu, P. M. Matthews, K. L. Miller, and S. M. Smith (2018) Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage 166, pp. 400–424. External Links: Document, ISSN 1053-8119, Link Cited by: §2.
  • J. Ashburner (2012) SPM: a history. NeuroImage 62 (2), pp. 791–800. Cited by: §2.1.
  • E. Ferrari, P. Bosco, G. Spera, M. E. Fantacci, and A. Retico (2018) Common pitfalls in machine learning applications to multi-center data: tests on the ABIDE i and ABIDE ii collections. In Joint Annual Meeting ISMRM-ESMRMB, Cited by: §3.
  • J. Fortin, N. Cullen, Y. I. Sheline, W. D. Taylor, I. Aselcioglu, P. A. Cook, P. Adams, C. Cooper, M. Fava, P. J. McGrath, et al. (2018) Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167, pp. 104–120. Cited by: §1.
  • J. Fortin, D. Parker, B. Tunc, T. Watanabe, M. A. Elliott, K. Ruparel, D. R. Roalf, T. D. Satterthwaite, R. C. Gur, R. E. Gur, et al. (2017)

    Harmonization of multi-site diffusion tensor imaging data

    NeuroImage 161, pp. 149–170. Cited by: §1.
  • K.J. Friston, J. Ashburner, S.J. Kiebel, T.E. Nichols, and W.D. Penny (Eds.) (2007) Statistical parametric mapping: the analysis of functional brain images. Academic Press. External Links: Link Cited by: §2.1.
  • J. E. Iglesias, C. Liu, P. M. Thompson, and Z. Tu (2011) Robust brain extraction across datasets and comparison with publicly available methods. IEEE Transactions on Medical Imaging 30 (9), pp. 1617–1634. Cited by: §2.1.
  • K. L. Miller, F. Alfaro-Almagro, N. K. Bangerter, D. L. Thomas, E. Yacoub, J. Xu, A. J. Bartsch, S. Jbabdi, S. N. Sotiropoulos, J. L. R. Andersson, L. Griffanti, G. Douaud, T. W. Okell, P. Weale, I. Dragonu, S. Garratt, S. Hudson, R. Collins, M. Jenkinson, P. M. Matthews, and S. M. Smith (2016) Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nature Neuroscience. External Links: Document, ISSN 1097-6256 Cited by: §2.
  • L. G. Nyúl, J. K. Udupa, and X. Zhang (2000) New variants of a method of MRI scale standardization. IEEE Transactions on Medical Imaging 19 (2), pp. 143–150. Cited by: §2.1.
  • M. A. Shafto, L. K. Tyler, M. Dixon, J. R. Taylor, J. B. Rowe, R. Cusack, A. J. Calder, W. D. Marslen-Wilson, J. Duncan, T. Dalgleish, et al. (2014) The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC Neurology 14 (1), pp. 204. Cited by: §2.
  • S. M. Smith and T. E. Nichols (2018) Statistical challenges in “big data” human neuroimaging. Neuron 97 (2), pp. 263–268. Cited by: §1.
  • C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell, A. Silman, A. Young, T. Sprosen, T. Peakman, and R. Collins (2015) UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine 12 (3). External Links: Document, ISBN 1549-1676 (Electronic)$\$r1549-1277 (Linking), ISSN 15491676 Cited by: §2.
  • J. R. Taylor, N. Williams, R. Cusack, T. Auer, M. A. Shafto, M. Dixon, L. K. Tyler, R. N. Henson, et al. (2017) The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) data repository: structural and functional MRI, MEG, and cognitive data from a cross-sectional adult lifespan sample. NeuroImage 144, pp. 262–269. Cited by: §2.
  • N. J. Tustison, B. B. Avants, P. A. Cook, Y. Zheng, A. Egan, P. A. Yushkevich, and J. C. Gee (2010) N4ITK: improved N3 bias correction. IEEE Transactions on Medical Imaging 29 (6), pp. 1310–1320. Cited by: §2.1.
  • C. Wachinger, B. G. Becker, A. Rieckmann, and S. Pölsterl (2019) Quantifying confounding bias in neuroimaging datasets with causal inference. arXiv preprint arXiv:1907.04102. Cited by: §3.
  • Y. Zhang, M. Brady, and S. Smith (2001)

    Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm

    IEEE Transactions on Medical Imaging 20 (1), pp. 45–57. Cited by: §2.1.