Magnetic Resonance (MR) data acquired from the same patient but at different acquisition sites often lead to different MR images. This is due to the qualitative nature of the acquisitions which produces weighted images (such as T1w or T2w) that are sensitive to technical choices (hardware, sequence parameters) as well as scanner artifacts. Consequently, pooling images from multi-center MR studies in order to approach a particular clinical or biological question does not guarantee an increase in statistical power because of a parallel increase in non-biological variance. These unwanted variations in image intensities also prevent large dissemination of machine learning tools that are trained on a specific site and may not generalize their model to other image providers(liu_ms-net_2020).
Several solutions have been proposed in the last decade to harmonize data coming from multi-site or multi-scanner MR studies. Their goal is to remove confounding site, scanner or protocol effects, while preserving the biological information contained in the images. Classical post-processing steps such as standardization, global scaling (fortin_harmonization_2017) or intensity histogram matching (shinohara_statistical_2014) have been shown to reduce the influence of site or scanner biases. However, they also tend to remove informative local intensity variations. Statistical techniques, where image intensity and datasets bias are modeled in every voxel, have been proved to be more efficient. Ravel (fortin_removing_2016), ComBat (fortin_harmonization_2017), refined ComBat versions (pomponio_harmonization_2019, beer_longitudinal_2020) or dictionary learning (stjean_harmonization_2020) methods have been successfully used to analyze harmonization impacts on diffusion MRIs or longitudinal structural sequences. However, it can be noted that these techniques need to be adjusted every time new sites or scanners provide images to the database. Moreover, the same clinical individual information (such as patient age, sex, etc.) needs to be available in every center of the database. More recently, deep-learning models such as CycleGAN (zhu_unpaired_2018), Deep-Harmony (dewey_deepharmony:_2019) or Calamity (zuo_information-based_2021)
have shown encouraging results for structural MR image (T1w, T2w or FLAIR) harmonization but have also shown limitations. Briefly, CycleGAN, which consists in two Generative Adversarial Networks (GANs) working together, is restricted to the harmonization of 2 sites, and needs to be fine-tuned for every pair of sites. Deep-Harmony, a U-Net(ronneberger_u-net_2015) harmonization adaptation network, has the disadvantage of requiring traveling subjects (subjects who have been scanned successively at different sites or scanners) for all scanners/sites for its training, a condition barely met in practice even in prospective studies. Similarly to CycleGAN, it is limited to two sites and needs to be fine-tuned. Calamity, an unsupervised deep-learning method, needs two different MR sequences as inputs for every subject and needs fine-tuning when data from new sites are considered. Finally, dinsdale_deep_2020 and guan_multi-site_2021 have proposed to include unlearning modules or domain discriminators directly into their classification networks. As such, they learn how to remove datasets biases during their analyses without reconstructing harmonized MR images. They have shown improvements in brain tissues segmentation and brain disorder classification after harmonization. However, these techniques clearly require to be trained for every new clinical question while the former approaches harmonize data once for all.
We propose in this paper a new type of harmonization method, called ImUnity, based on deep-learning, which extends previous techniques to offer a fast and flexible harmonization solution. ImUnity generates ‘corrected’ MR images that can then be utilized for various population imaging studies. To avoid the need for traveling subjects or multiple MR sequences in the database, our self-supervised Variational AutoEncoder (VAE-GAN) architecture uses for its training multiple slices from the same individual and randomized image contrast transformations. It also unlearns center bias using a confusion module connected to its bottleneck while an optional biological module can ensure that clinical features are preserved in the latent space. Once trained, this architecture should allow data coming from new sites or scanners to be harmonized without the need for fine-tuning. The architecture also allows estimates towards multiple target sites and then, users can choose multiple MR image reconstructions according to the chosen target domain (site or scanner).
To evaluate the efficiency and flexibility of our harmonization tool, we tested the approach using 3 open source databases that contain images from multiple acquisition sites, scanner vendors or strength of magnetic fields, and a large range of patients ages. For most of the experiments, ImUnity was trained using data from only one of the databases and then applied to the other two to evaluate generalisation of the model. Quality of the reconstructed images, capacity of removing site or scanner bias and ability to classify patients were evaluated after data harmonization.
|Travelling subjects||Fine-tuning for new clinical question||Fine-tuning for unseen sites||Max. number of target sites|
|zhu_unpaired_2018 (CycleGAN)||not required||not required||required||N = 2|
|dewey_deepharmony:_2019 (Deep-Harmony)||required||not required||required||N = 2|
|zuo_information-based_2021 (Calamity)||not required||not required||required||N = number of training sites|
|dinsdale_deep_2020, guan_multi-site_2021||not required||required||not required||N number of training sites|
|This study (ImUnity)||not required||not required||not required||N number of training sites|
2 Materials and methods
We used three open-source databases: (1) ABIDE, a multi-center project led by di_martino_autism_2014, which focuses on Autism Spectrum Disorder (ASD). It gathers more than 1,000 autistic patients and controls. For this study, we used T1-weighted scans from 11 different sites and scanners from 3 different constructors (3T scanners at 10 different sites and one 1.5T scanner at one site). Sites presenting data from a large range of ages (from 6 to 47 years, mean age = 12 years) were selected. In total, 621 T1-weighted scans (309 patients and 312 controls) were collected. (2) OASIS (lamontagne_oasis-3_2019) gathers T1-weighted scans from adult subjects who underwent several MR sessions on 4 different scanners from the same site. We used these traveling subjects (n = 769) to validate the ability of our model to perform multi-scanners harmonization.
(3) SRPBS (tanaka_multi-site_2021) is a multi-site database gathering multi-disorder subjects. We used 9 healthy adult traveling subjects to validate harmonization results between the different acquisition sites of the database (6 sites, 12 scanners from 3 different constructors). Note that OASIS and SRPBS images contain healthy adult brain scans while ABIDE mainly include healthy and pathological infant brain scans, leading to large anatomical differences between images in the databases.
For each subject in each database, the brain was extracted using Robex (iglesias_robust_2011) and N4Bias (tustison_n4itk_2010) was used to correct for intensity inhomogeneities. MR images were first co-registered to the publicly available and age specific 152-MNI templates (sanchez_age-specific_2012). Then, White-Stripe normalization (shinohara_statistical_2014) was run to align white matter (WM) peaks between all subjects (each WM peak was aligned to 0.7 after rescaling the whole image between [0:1]).
After visual inspection to detect images with ROBEX defects or other artifacts, we eventually included 545, 1072 and 81 T1-weighted scans from ABIDE, OASIS and SRPBS databases respectively.
2.2 ImUnity’s model
The architecture of our model derives from convolutional VAE-GANs and is described in Figure 1. We adopted adversarial settings to ensure realistic outputs using a classical CNN as discriminator. The generator (here a VAE) learns how to represent input data into a lower dimension latent space (bottleneck). Information is then decoded to generate an output image. Inspired by dinsdale_deep_2020, an unlearning center-bias module is connected to the bottleneck to limit the impact of site or scanner information. A biological preservation module can be inserted to maintain biological information in the latent space representation. Technical details are provided below.
2.2.1 Modified VAE generator
Inspired by zuo_information-based_2021, our generator takes two 2D-structural images as input, randomly taken at two different locations in the 3D-MR stack of images of each subject to consider. The first () image is used by the first CNN to encode the ’anatomical’ information using only convolutional filters to ensure preservation of spatial information. The second image (), different from because randomly taken in another part of the brain, provides the initial ‘contrast’ information. Thus, and have different anatomy (different location in the brain) but similar contrast (same scan). contrast is modified using a gamma function with a random ‘gamma’ parameter sampled uniformly between 0.5 and 1.5 for each new input image. This modified slice is used as input to a second CNN to encode the ‘contrast information’ followed by a dense layer to reduce spatial information. An example of different gamma transformations applied to MR brain scans from the same subject is given in Figure 2. Once encoded, the two independent representations of and are concatenated to give a latent space representation which is decoded to create the output using transposed convolutional filters. Eventually, this output is compared to the reference gamma modified slice . Note that this generator is trained in a self-supervised fashion as it generates its own outputs. It does not require additional information such as scanner, center or biological information.
2.2.2 Site/Scanner-bias unlearning module
To ensure the task of "removing site or scanner bias", a module is directly connected to the encoders’ outputs (latent space representation of inputs). The module can be seen as a domain (site or scanner) discriminator and is trained independently from the encoder to predict the scan’s origin based on the latent space representation. On the other side, the encoder is trained in an adversarial fashion. A confusion loss is used to unlearn domain information. This principle has been introduced in the field of domain adaptation by ganin_domain-adversarial_2016 and has been adapted to medical imaging studies by dinsdale_deep_2020
. Originally, the module was incorporated directly in the model to unlearn datasets bias and to improve predictions. Here, it is used in the bottleneck as a "datasets bias filter", forcing the encoder to learn a domain-invariant data representation. Thus, the generator learns a shared latent space that encodes all information needed to generate harmonized scans. Note that no skip connection is used in the generator architecture, preventing the presence of datasets bias in the final output. The loss function for the site/scanner unlearning module is:
While the confusion loss used in the encoders’ training is :
2.2.3 Biological preservation module
An optional module ensures the "preservation" of biological information. It acts as a classifier of available biological information. For instance, features such as age or presence of diseases can be introduced. Contrary to the unlearning module, the encoder is trained to minimize its loss function. This module is not mandatory, and a fully self-supervised learning model can be adopted if it is turned off. The loss function of the biological preservation module for our particular application using the ABIDE database is:
Here P represents module’s predictions for biological features of interest, N is the sample size while Y is the ground truth vector. Note that in this study the binary cross entropy formulation was used for the loss function because only two features (age and patient status, i.e. ASD) were considered.
Training the model involves several independent steps, due to the adversarial context and the use of the additional modules.
Training the discriminator consists in minimizing binary cross-entropy between its predictions and the labels corresponding to the nature of the inputs (real or fake). Adversarially, the generator learns how to maximize this loss function, forcing the generation of realistic outputs.
Training the site/scanner unlearning module consists in minimizing the categorical cross-entropy (eq. 1) between its predictions and the site-affiliation labels. Adversarially, the generator is trained to minimize the confusion loss (eq. 2). It forces a site and scanner invariant representation of the dataset in the latent space, leading to uniform outputs of the unlearning module.
In our ABIDE experiment, training the biological preservation module consisted in minimizing binary cross-entropy losses associated with each biological feature taken into account (here sex and patient status). Unlike the previous module, the loss was directly integrated in the generator. This ensures the conservation of biological features in the latent space.
Besides previous loss functions involved in training the generator, a loss function is used to ensure a good mapping between input and the generated output (
). Moreover, the use of the Kullback-Leibler divergence
between features distributions and a Gaussian distribution ensures a dense data representation in latent space. The global generator’s loss function to minimize is therefore:
Where factors control the relative contribution of each loss. In our study we used : ; ; ; ; found empirically.
The datasets extracted from the three databases were used to evaluate different aspects of our model. Impact on image quality in multi-site or multi-scanner harmonization was assessed using data from traveling subjects (ground truth) from the OASIS and SRPBS datasets. Ability to remove site information was evaluated using the ABIDE dataset. Finally, benefits of harmonization between data provider centers were assessed using the predictions of autism disorder in children from the ABIDE dataset. To demonstrate the flexibility of ImUnity, all experiments were performed with the same model trained on data coming from the ABIDE database, unless specified. OASIS and SRPBS were then used for the validation parts only. The model was trained on 2D axial slices with at least 1% of brain tissue voxels. Training was run on a Nvidia GeForce 2080 RTX for 2000 epochs using a learning rate ofand Adam optimizer.
2.4.1 Experiment 1 : Harmonization on traveling subjects (OASIS+SRPBS)
We first evaluated the ability of our model to transform images from one domain (site or scanner) to their equivalent in another domain. As SRPBS and OASIS databases contain traveling subjects, ground truth was available to assess ImUnity performances. In practice, one domain (acquisition site or scanner) was first selected as the reference for every subject. Individual scans were co-registered to their equivalent in the reference domain (to avoid variations between acquisitions due to movement). Then, all the images were transformed by the model into the reference domain. During this step, slices to be harmonized (anatomy) were fed to the model alongside the corresponding computed contrast slice from the reference domain. Finally, results obtained after transformation were compared to the ground truth, i.e. images acquired in the reference domain (traveling subject). Visual verification, histograms of image intensities, and the Structural Similarity Index Metric (SSIM, wang_multiscale_2003) were used to assess image likeness. The same model, trained on ABIDE data, was used for every site/scanner of the two other databases to evaluate the ability of ImUnity to generalize to sites never seen before. This experiment also evaluates ImUnity’s versatility, either for the source domain or for the target domain (last two columns of Table 1).
2.4.2 Experiments 2 : Harmonization’s effects on sites classification (ABIDE)
The second experiment evaluated the ability to detect the origin of data before and after harmonization. As no ground truth was available for this experiment, we considered harmonization impacts on classification algorithms. Standard Support Vector Machine (SVM) with a radial basis function kernel was used to classify ABIDE data. The classifier worked on all radiomic features (N=101) extracted using the pyradiomics python API(van_griethuysen_computational_2017). These features aim to represent different aspects of MRI images such as shape, contrast or texture and are known to be sensitive to site effects (orlhac_validation_2019). The most ’correlated features’ with sites affiliations before harmonization were selected for classification using Pearson tests (ran independently for each feature) using as p-value threshold (30 features in total). Accuracy and Area Under Curve (AUC) of the Receiver Operating Characteristic (ROC) curve were used to evaluate the specificity and sensitivity of the site classifier.
2.4.3 Experiments 3 : Harmonization’s effects on autism syndrome disorder prediction (ABIDE)
Similarly to Experiment 2, Experiment 3 evaluated the ability of our classifier to detect patients with ASD from the ABIDE database, before and after harmonization. Here, results were obtained following a 10 fold-cross-validation procedure. The same trained model was used for different number of sites (and for different combinations of sites) included in the ABIDE database.
Experiment 1: Figure 3-A shows the results obtained for one traveling subject from the SRPBS database (2.4.1). Images are shown for one acquisition at site (A) before harmonization (3-A, left), corrected by ImUnity to fit with acquisition at site B (3-A middle) and the corresponding ground truth acquired at site B (3
-A right). One can notice the difference in image contrast between the 2 sites, highlighting the need for image harmonization, as well as the visual similarity between the harmonized image and the ground truth. It is interesting to observe that the anatomical structures of the input contrast reference are not propagated through the model, which explains small anatomical differences (e.g. superior sagittal sinus) between the model estimates and the ground truth. It is also worth noting that although the model was trained on 2D-axial slices, the 3D reconstructions of the estimates are of high quality in each orientation. Figure3-B shows the effects of ImUnity’s harmonization on image intensity distributions for all selected subjects from the SRPBS database. The model was used to harmonize every image to a target site (indicated in red). An alignment of histograms can be clearly observed after harmonization, with both gray and white matter peaks shifted. Changes in intensity distribution of the site of reference are due to pre-processing (see details in Supplementary Material (SM), Figure 6, top row). Images obtained after ImUnity’s harmonization of the OASIS datasets are provided in SM, Figure 7.
|Dataset||OASIS scanner F scanner E||OASIS all scanners scanner E||SRPBS all sites UTO site|
|Raw data||0.871 0.045||0.845 0.059||0.853 0.021|
|zhu_unpaired_2018 CycleGAN||0.873 0.046||-||-|
|zuo_information-based_2021 Calamity||0.884 0.046||-||-|
|ImUnity||0.920 0.024||0.919 0.023||0.907 0.024|
|ImUnity||0.943 0.003||0.943 0.003||0.893 0.025|
|ImUnity||0.824 0.064||0.827 0.061||0.865 0.019|
: Model trained on OASIS database (n=1072); : Model trained on ABIDE database (n=545); : Model trained on SRPBS database (n=81)
Quantitative results obtained with the SSIM metric in all traveling subjects are summarized in Table 2. Both multi-site (SRPBS) and multi-scanner (OASIS) experiments are shown. For the latter, results from the literature are also given for reference. It can be seen that ImUnity increases the structural similarity in all cases and provides better performances compared to other deep learning approaches. Moreover, results from multi-scanner harmonization show that ImUnity performs well independently of the chosen reference domain. The last 2 lines present results obtained after training ImUnity on OASIS (to better match literature protocols) and SRPBS data. These models were used to harmonize OASIS as well as SRPBS data.
Experiment 2: Figure 4 shows ImUnity’s harmonization effects on site classification on the ABIDE datasets (2.4.2) using tSNE (maaten_visualizing_2008), a dimension reduction algorithm, on radiomic features. Before harmonization, the presence of site clusters is clear. Once the data is harmonized using ImUnity, the points are shuffled and the accuracy of the SVM site prediction decreases from 0.70 to 0.38 (before and after harmonization respectively). This confirms the removal of site bias by ImUnity as the classifier is no longer able to correctly separate the sites. Additional results on the influence of the preprocessing step on sites classification are provided in SM Fig. 6.
Experiment 3: Figure 5
shows the capacity of our model to improve ASD prediction from the ABIDE datasets. Here, we used the same trained model to test the influence of different numbers of sites included in the database (from 2 to 11) as well as different combinations of those sites (for example 55 combinations of 2 sites taken among the 11 sites available). In every case, we observed a clear improvement of classification of autistic patients after harmonization as shown by increases in AUC provided by the SVM classifier. We show the results obtained with the best combination of sites as well as average and standard deviation of AUC with all combinations of sites. The preprocessing also has a positive impact on the prediction as shown in SM (Fig.6, bottom row).
We have presented ImUnity, an original harmonization tool for multi-center MRI databases. ImUnity shows high performances in term of quality of the generated harmonized images, as well as clear removal of the idiosyncratic bias attached to site-dependent image acquisition conditions. Moreover, the performed experiments clearly demonstrate ImUnity’s versatility. By training ImUnity’s model on datasets extracted from one database (here ABIDE) and looking at images harmonized from traveling subjects provided by two different databases (here OASIS and SRPBS) , we showed that ImUnity did not require new training phase to generalize to unseen sites or scanners (see Fig. 3). The performances were maintained independently of the site selected as reference (see Table 2). While the model was trained on ABIDE data only, it provided better results than the state-of-the-art methods in terms of image quality (+4%, see Table 2).
The last two rows of Table 2 still present SSIM metrics obtained when training ImUnity on the other 2 datasets (OASIS and SRPBS). As no biological features were available in these databases, the biological module was disabled and the model was trained in a self-supervised way. First, we noted additional improvements for scanner harmonization when the model was trained and applied on the same database (here OASIS, +2.5%). Second, the score obtained for multi-site harmonization (SRPBS) was the highest when trained on ABIDE (N = 545) data (with a slightly better score than with OASIS). It is interesting to observe the impact of the dataset size, the number of site/scanner involved in the training, the use of the biological module and the anatomical differences between datasets (ABIDE mainly contains children data while OASIS and SRPBS focus on adults) on these scores. ABIDE result suggests that anatomical differences could be compensated by a large training dataset presenting more site/scanner variability than OASIS (11 sites for ABIDE vs. 4 sites for OASIS). On the other hand, results from multi-scanners harmonization depict the difficulty of the model trained on SRPBS data to generalize its training to the OASIS data. This indicates an over-fitting effect in this situation, as there was not enough training data (here N=81) and suggests that ImUnity may not be adapted to small sample size scenarios.
In addition to the ability to remove sites or scanners biases, we evaluated the ability to improve subsequent clinical data analysis. The performances of ASD classification obtained on the ABIDE database were improved by ImUnity’s harmonization of the datasets even when a simple classification procedure (classical SVM and radiomic features derived from structural MRI) was used. It is interesting to note that several studies have already tackled the issue of ASD patients classification using the ABIDE database (gao_multisite_2020; sherkatghanad_automated_2019)
. They used either more complex features (e.g. morphological brain networks) or more complex classifiers (Random Forest, Resnet, adapted deep-learning architecture, etc.) than in our study. However, the reported AUC metrics were close to ours (0.67 bygao_multisite_2020 and 0.75 by sherkatghanad_automated_2019). They also noted the difficulty of obtaining better results when more acquisition sites where included in the study. They did not however mention any harmonization procedure. Thus, we may expect an improvement of ImUnity’s performances when introducing more informative radiomic features and choosing a more sophisticated classifier tool.
Because ImUnity is designed to reconstruct images and to create a new harmonized database, it does not need new training for new clinical or biological questions. Beyond classification, new clinical data investigation should be conducted with ABIDE (or other multi-center clinical databases) to have better understanding on the impact of our method on clinical research studies.
Like the majority of deep networks used for medical image analysis, the MR images used as inputs of our network were first pre-processed for intensity normalization, co-registration or brain extraction. Usually, the impact of these transformations is not examined in harmonization studies. In Fig. 6 (SM), we have highlighted the fact that these steps were already able to remove some of the sites and scanners biases with positive impacts on intensity distributions across sites or patients classification. Intrinsically, the use of White-Stripe normalization (shinohara_statistical_2014) forces the alignment of intensity distributions. Yet, a perfect alignment is not the ultimate goal of harmonization as we also seek to preserve informative biological variations which vary independently across sites. Eventually, we observed that the best results were obtained for all experiments after the whole ImUnity process, with better intensity distributions alignments, removal of persistent datasets noises, and most importantly improvement in patients classification results. On the opposite, other experiments (not reported) also showed that the VAE-GAN network alone performed poorer when the pre-processing steps were omitted, suggesting that these steps are necessary to simplify the training process and improve generalization of the results.
In our study, we only worked with anatomical T1-weighted images. We showed that a single type of sequence, combined with computed image transformations with the Gamma function, are sufficient to learn contrast mapping. This greatly facilitates the use of our model because of few data requirements (the origin of each scan is the only pre-required information) and the possibility of self-supervised training. Yet, we believe that this approach is not only dedicated to T1 contrast harmonization and can easily be generalized to any MRI sequences. Presently, the model needs to be fine-tuned in order to harmonize a new medical imaging type. It could however be interesting to investigate its capacity to learn how to harmonize multiple sequences at once. This could be done by mixing sequence types in our training dataset and ensuring the conservation of this information by adding a new conservation module in the bottleneck. It could also be interesting to add other types of artificial contrast transformation for our training in order to account for other types of sites or sequences biases.
We presented ImUnity, an original and effective tool dedicated to MRI harmonization. Our proposed model derives from the VAE-GAN architecture. It ensures realistic outputs and allows removal of idiosyncratic datasets bias and the preservation of biological information. Our results show that the method reaches state-of-the-art results in term of image quality in traveling patients of the OASIS and SRPBS databases and improves autistic patients classification in the ABIDE database. The proposed model is versatile, requiring only one type of MR sequence without the need of matching subjects, can be generalized to sites unseen during the training phase and can be used to harmonize MR images to different reference domains without a new training phase.
6 Compliance with ethical standards
Stenzel Cackowski is supported by MIAI@Grenoble Alpes (ANR 19-P3IA-003). The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.