Large scale multi-site studies are of extreme importance in neuroimaging, both for research purposes and in clinical practice. Such studies face several challenges due to hardware- or center-related variability. It is well known that scanner-factors such as manufacturer, magnetic field and gradient non-linearly influence volume measurements [4, 14] obtained from structural Magnetic Resonance Imaging (MRI). At the image level, these factors are coupled with a high variability of intensities across patients and scanners, which can affect tasks like the segmentation of brain structures . This effect is exemplified in Fig. 1, where three T1-weighted MR images from the same patient obtained on different scanners and their corresponding segmentations are represented.
The need to address multi-scanner and -center data harmonization is evidenced in the follow-up of Multiple Sclerosis (MS) patients. These patients exhibit an increased rate of brain atrophy when compared to healthy subjects, which has been linked to impairment 
. However, it has been suggested that brain atrophy can only be reliably estimated over periods of at least five years, due to the variability caused by scanner and center factors.
Besides image processing approaches that aim at matching image intensity distributions to provide a more consistent input to the segmentation method [11, 12], recent studies have focused on statistical harmonization of volumetric measurements based on scanner- or center-specific information. This type of methods generally apply regression techniques to correct measurements. Linear mixed-effects models using patient- and scanner-specific information as random effects were explored by  and . Recently, 
used an algorithm devised for genomics that extends the same type of model to account for site-specific factors. A data-driven approach based on independent component analysis was explored by, where correction was performed by selecting independent components related to scanning parameters. However, since these methods rely on scanner- or acquisition-specific information, they do not generalize and need to be adapted when used in new settings. Additionally, such information can be incomplete, especially in historical data. As such, it would be of interest to use information that is encoded in the images themselves, or that can be extracted from the volume quantification method to build more robust and adaptive techniques.
To address these issues, in this paper we present a novel statistical harmonization approach based on image descriptors and a machine learning algorithm. We first explore the relations between image-extracted properties and brain volume measurements that we could further exploit for harmonization. We then train a machine learning algorithm based on Automatic Relevance Determination on healthy data to perform volumetric corrections. We validate the method on a set of unseen healthy controls, and finally test it on a a test-retest dataset of MS patients.
This dataset comprises 1996 T1-weighted (T1w) MRI scans from healthy subjects. The data is a compilation of several public datasets, such as [10, 3], and some proprietary data. The overall set comprises data from several different centers and scanner types from the major vendors (Siemens, Philips, GE). Magnetic field strengths (1.5T or 3T) and T1w sequence types also vary. For most of the data we have information regarding age and sex of the subject, scanner type, magnetic field strength and additional acquisition parameters like echo time (TE) and repetition time (TR). For building and testing our model, we randomly divided the data into training () and test sets ().
To further validate the approach we test it in a dataset containing data from 10 MS patients as detailed in . Each patient was scanned twice in three different 3T scanners: Philips Achieva, Siemens Skyra and GE Discovery MR450w. An example is depicted in Fig. 1. We observed that one of the patients was an extreme case, showing very enlarged ventricles. Given that the volumetric measurements in such a case are prone to errors and are considered unreliable, this patient’s data was discarded from further analysis.
2.1 Data Pre-processing and Feature Extraction
For each image we compute gray matter (GM) and white matter (WM) volumes using the well established atlas-based method described in . Whole brain (WB) volume is then defined as the sum of WM and GM volumes.
We are interested in descriptors related to the T1w images that encode information about errors and bias in brain segmentations. Since the quality of a segmentation depends on a good registration to the atlas and is influenced by the contrast and noise present in an image, it is valuable to explore features that convey such information. We extract a total of 16 features of two main types: i) Alignment information regarding the registration of the T1w image to the MNI atlas space, which includes decomposing the affine transformation (rotation angles, scale and shear factors in three directions), and measuring the similarity between the registered images using Normalized Mutual Information (NMI); and ii) Contrast to Noise Ratio (CNR) between tissue types and between different brain structures (e.g., lobes and cerebellum). CNR is given by:
where represents the mean intensity of some tissue or structure and
is the variance of the image intensities across this structure. We computeby taking as the tissue or structure with higher average image intensity than .
The brain volumes and some of the computed descriptors (e.g., CNR and NMI) are known to be age dependent . As such, age is used as a feature at training time, but not at test time. For analysis and comparison, we age-detrend the volumes by subtracting an age-matched estimated median value. CNR and NMI are corrected by fitting a linear regressor to the data.
3 The Relevance Vector Machine for data harmonization
To harmonize brain volumes, we subtract correction terms based on estimated variability trends from the original volumes. To determine the variation in the volumetric data that can be explained by the aforementioned image descriptors, we fit a linear model using the extracted features as independent variables. When choosing the model, we take a few important considerations into account: i) we have 16 image-extracted features plus age; ii) some of these are related only to one of the brain volumes; iii) the features are not always unrelated; and iv) we are interested in a probabilistic model, to capture uncertainty in our predictions. Given that standard generalized linear models do not address all these considerations, we investigate using a probabilistic machine learning technique.
The Relevance Vector Machine (RVM) is defined within a fully probabilistic framework and includes a mechanism of automatic relevance determination . As described in , the model defines a conditional distribution for real-valued input-target vector pairs , of the type:
which specifies a Gaussian distribution overwith mean and precision (inverse variance) . Here are the set of extracted features and the corresponding volume measurement for each image in a training dataset of size . The function is given by a linear combination of basis functions with weights :
where the basis function is a kernel centered around the
training sample. In order to avoid over-fitting due to the large numbers of parameters in the model, a zero-mean Gaussian prior probability distribution is defined over the weights
. Moreover, a separate hyperparameteris introduced for each individual weight , representing the precision of the corresponding weight:
where . Using the resulting model, relevance vector learning searches for the hyperparameters and that maximize the marginal likelihood of the training data, where . Defining as the matrix with in row, the learning algorithm proceeds by iteratively updating and as follows :
where and are the posterior covariance and mean for the weights, respectively. In practice, during re-estimation, many ’s tend to infinity, which causes the posterior distributions of the corresponding weights to peak around zero. The basis functions associated with these do not influence the predictions and can be pruned out, resulting in a sparse model. The remaining training samples with non-zero weights are called relevance vectors.
Once the model is trained, we can take a set of descriptors of an unseen image, and try to predict the corresponding volume based on these descriptors alone using the posterior mean : . Finally, we can obtain a corrected volume by subtracting the estimated contribution of the image descriptors from the original volume : .
4.1 Verification of observable correlations in data
To verify whether there are correlations between image descriptors and the measured volumes, we built a cross-correlation map between these variables (see Fig. 2). Analysing these correlations reveals that image descriptors like NMI and CNR are related to scanner/acquisition specific features. These same image descriptors are in turn correlated to the brain volumes, as well as scale.
4.2 Harmonization of healthy population data based on RVM
We trained and tested the RVM method described in section 3
for linear regression on the data set of healthy subjects (section2). We used different kernel types (linear and Gaussian - RBF) and searched for the model that best preserved biological information - namely age - while decreasing the scanner/center-specific variability. Thus, the model should decrease global variance in the data but maintain the original median of the population defined by the training set, given that we build on the assumption that this sample contains enough variability to represent the heterogeneity of scanner and center effects. To evaluate the performance, we produced boxplots to represent the distribution of the measured volumes in each scanner in the test set with at least ten subjects (see Fig. 3
). We first removed the age dependency as estimated from the training set, such that the variability due to age is not accounted for. We compare median and standard deviation, preferring values closer to zero, since they represent a decrease in variability while preserving the global trend. Table1 presents the median values of these same age-detrended values. After correction the distributions from different scanners become more similar. For both GM and WM the linear kernel produced lower or comparable mean and decreased standard deviation. For WB we compared applying a linear kernel to summing the previously corrected WM and GM volumes and verified that the last option performed better.
4.3 Harmonization of test-retest data
To further validate the method, we applied it to the test-retest dataset of MS patients described in Section 2. The results are summarized in Fig. 4. On the left side, the absolute volumes before and after correction are represented. First we computed for each tissue type the differences between the volumes from images acquired in the same scanner, which provides a measure of the intra-scanner error (Intra-SE). There is no significant difference between the original and corrected volumes for all the tissue types (
, paired t-test). Then we computed the difference between the averaged volumes of each scanner type against all the other scanner types. For WM there is a statistically significant difference (, paired t-test) between the original and corrected volumes. For GM, the inter-scanner error (Inter-SE) for the original volumes was very small, being comparable to the Intra-SE. After applying the correction, there is no statistical difference between these volumes and the original ones, even if visually there is an increase in the variability which is propagated for the WB.
5 Conclusions and Future Work
In this work we applied a relevance vector machine approach to find the amount of variability in the data that can be explained by variations in image descriptors. We observe that there is a large dependency of brain volumes with the atlas-registered NMI metric, which was initially not expected. NMI measures the goodness of a non-rigid registration step between the image and an atlas, necessary for the methodology we used. The final volumes depend on the goodness of this registration, and in such a way we are correcting for suboptimal segmentation results that derive from a poor registration step.
We demonstrate that it is possible to achieve a certain degree of harmonization of the data based only on image descriptors. To our knowledge, this is the first approach that does not rely on scanner-specific information to perform harmonization. We expect the current method to perform less efficiently than more tailored methods, but to generalize better. A thorough comparison to such methods still needs to be performed, but it is out of the scope of the current paper. This type of solution is interesting for large scale statistics, and could potentially have a positive impact in longitudinal studies. Moreover, the proposed approach allows dealing with missing scanner/center information, a problem not addressed in previous works and very frequent in practice. Nevertheless, in the test-retest setting inter-scanner error is still high when compared to the measured intra-scanner error, which implies that the method does not provide a completely satisfactory correction for patient specific use, and should be further investigated.
Future steps include exploring more image descriptive features that are independent from the segmentation method used and that can encode the presence of geometrical distortions and artifacts. For controlled environments it could be useful to couple general scanner-dependent information with the image descriptors. Additionally, we aim to extend the method to other brain structures of interest and to compare its performance on a controlled dataset to scanner-specific state-of-the art methods. Finally, it is important to keep in mind that in a cross-sectional setting this type of correction does not replace the need for an improved standardization at the image level.
-  (2006) The measurement and clinical relevance of brain atrophy in multiple sclerosis. The Lancet Neurology 5 (2), pp. 158 – 170. External Links: Cited by: §1.
-  (2016-11) Intra- and interscanner variability of magnetic resonance imaging based volumetry in multiple sclerosis. NeuroImage 142, pp. 188–197. External Links: Cited by: §1.
-  IXI dataset. Note: https://brain-development.org/ixi-dataset/, Last accessed on 2019-03-18 Cited by: §2.
-  (2014-06) Exploration of scanning effects in multi-site structural MRI studies. Journal of Neuroscience Methods 230, pp. 37–50. External Links: Cited by: §1, §1.
-  (2015) Handling changes in MRI acquisition parameters in modeling whole brain lesion volume and atrophy data in multiple sclerosis subjects: Comparison of linear mixed-effect models. NeuroImage: Clinical 8, pp. 606–610. External Links: Cited by: §1.
-  (2018-02) Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167, pp. 104–120. External Links: Cited by: §1.
-  (2015-01) Automatic segmentation and volumetry of multiple sclerosis brain lesions from MR images. NeuroImage: Clinical 8, pp. 367–375. External Links: Cited by: §2, §2.1.
-  (2013) Quantification of multiple-sclerosis-related brain atrophy in two heterogeneous MRI datasets using mixed-effects modeling.. NeuroImage. Clinical, pp. 171–9. External Links: Cited by: §1.
Bayesian interpolation. Neural Computation 4 (3), pp. 415–447. External Links: Cited by: §3.
-  (2007) Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of Cognitive Neuroscience 19 (9), pp. 1498–1507. Cited by: §2.
-  (1999-12) On standardizing the MR image intensity scale. Magnetic Resonance in Medicine 42 (6), pp. 1072–1081. External Links: Cited by: §1.
-  (2012-05) Tissue-based MRI intensity standardization: Application to multicentric datasets. International Journal of Biomedical Imaging 2012, pp. 1–11. External Links: Cited by: §1.
-  (2009) Age-associated alterations in cortical gray and white matter signal intensity and gray to white matter contrast. NeuroImage 48 (1), pp. 21 – 28. External Links: Cited by: §2.1.
-  (2011) Effect of scanner in longitudinal studies of brain volume changes. Journal of Magnetic Resonance Imaging 34 (2), pp. 438–444. External Links: Cited by: §1.
-  (2001) Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research 1 (3), pp. 211–244. External Links: Cited by: §3.
-  (2009-10) Intensity standardization simplifies brain MR image segmentation. Computer Vision and Image Understanding 113 (10), pp. 1095–1103. External Links: Cited by: §1.