ABCD Neurocognitive Prediction Challenge 2019: Predicting individual fluid intelligence scores from structural MRI using probabilistic segmentation and kernel ridge regression

05/26/2019 ∙ by Agoston Mihalik, et al. ∙ 0

We applied several regression and deep learning methods to predict fluid intelligence scores from T1-weighted MRI scans as part of the ABCD Neurocognitive Prediction Challenge (ABCD-NP-Challenge) 2019. We used voxel intensities and probabilistic tissue-type labels derived from these as features to train the models. The best predictive performance (lowest mean-squared error) came from Kernel Ridge Regression (KRR; λ=10), which produced a mean-squared error of 69.7204 on the validation set and 92.1298 on the test set. This placed our group in the fifth position on the validation leader board and first place on the final (test) leader board.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Establishing the neurobiological mechanisms underlying intelligence is a key area of research in Neuroscience [1, 2]. General intelligence at a young age is predictive of later educational achievement, occupational attainment, and job performance [3, 4, 5, 6]. Moreover, intelligence in childhood or early adulthood is associated with health outcomes later in life as well as mortality [7, 8, 9, 5, 10]. Thus, understanding the mechanisms of cognitive abilities in children potentially has important implications for society and can be used to enhance such abilities, for example through targeted interventions such as education and the management of environmental risk factors [4, 11].

Neuroimaging can play a key role in advancing our understanding of the neurobiological mechanisms of cognitive ability. Several brain-imaging studies have shown that total brain volume is the strongest brain imaging derived predictor of general intelligence [12, 13, 14] (). To a somewhat lesser degree, regional cortical volume and thickness differences in the frontal, temporal, and parietal lobes have also been linked to intelligence [15, 16, 17, 13, 12]. Converging neuroimaging evidence led to the proposal of the parieto-frontal integration theory [18] whereby a distributed network of brain regions is responsible for the individual variability in cognitive abilities. This theory is also supported by human lesion studies [19, 20].

The ABCD-NP Challenge 2019 asked the question “Can we predict fluid intelligence from T1-weighted MRI?”

We took an exploratory, data-driven approach to answering this question — a hackathon organised by our local research centres: the UCL Centre for Medical Image Computing (CMIC) and Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS). Our centres aim to address key medical challenges facing 21st century society through world-leading research in medical imaging, medical image analysis, and computer-assisted interventions. Our expertise extends from feature extraction/generation through to image-based modelling

[21]

, machine learning

[22, 23, 24], and beyond. The hackathon involved researchers across research groups in our centres, in addition to colleagues from the affiliated Wellcome Centre for Human Neuroimaging and the Department of Clinical and Experimental Epilepsy at UCL. The hackathon took place on an afternoon in February 2019, after which we followed up with regular progress meetings.

In this paper we report our findings for predicting fluid intelligence in 9/10-year-olds from T1-weighted MRI using machine learning regression and deep learning methods (convolutional neural networks — CNNs). Our paper is structured as follows. The next section describes the challenge data and our methods. Section 3 presents our results, which we discuss in section 4 before concluding.

2 Methods

2.1 Data

The ABCD-NP Challenge data consists of pre-processed T1-weighted MRI scans and fluid intelligence scores for children aged 9–10 years. The imaging protocol can be found in [25]. Pre-processing included skull-stripping, noise removal, correction for field inhomogeneities [26, 27], and affine alignment to the SRI24 adult brain atlas [28]. SRI24 segmentations and corresponding volumes were also provided.

The cohort was split into training (), validation (), and test () sets. The training and validation sets also include scores of fluid intelligence, which are measured in the ABCD Study using the NIH Toolbox Neurocognition battery [29]. For the challenge, fluid intelligence was residualised to remove linear dependence upon brain volume, data collection site, age at baseline, sex at birth, race/ethnicity, highest parental education, parental income and parental marital status.

2.2 Features derived from the data

We trained the models to predict fluid intelligence both from the provided T1-weighted images (voxel intensity) as well as voxel-wise feature maps generated from these images using a probabilistic segmentation approach. There are many different methods to extract various features from T1-weighted images [30], such as tissue-type labels obtained from probabilistic segmentations. These segmentations can be constructed in a way to capture not only the relative tissue composition in a voxel, but also information about shape differences between individuals. This requires mapping each subject to a common template — a fundamental technique of computational anatomy [31]. Here we constructed such a template from all available T1-weighted MRI scans (), which generated normalised (non-linearly aligned to a common mean) tissue segmentations for each subject.

Figure 1: Template generated from fitting the generative model to all of the subjects in the ABCD population (). Green corresponds to grey matter tissue, blue to white matter, and red to other.

We used a generative model [32] to probabilistically segment each T1-weighted MRI in the challenge data set into three tissue types: grey-matter, white-matter, other — see Figure 1. The in-house model111Available from https://github.com/WCHN/segmentation-model. used here contains key improvements over the one in [32]: (1) we place a smoothing prior on the template; (2) we obtain better initial values by first working on histogram representations of the images; (3) we normalise over population image intensities in a principled way, within the model; (4) we place a prior on the proportions of each tissue, which is also learned during training. There are two types of normalised segmentations: non-modulated and modulated. Modulated segmentations include the relative shape change when aligning to the common template.

Seven features per voxel were considered: T1-weighted intensity plus our six derived features corresponding to modulated and non-modulated probabilities for each tissue type. All images, including the feature maps from probabilistic segmentations, were spatially smoothed with a Gaussian kernel of 12mm FWHM

[30] and masked to remove voxels outside of the brain.

2.3 Predicting fluid intelligence: Machine Learning Regression

We explored several machine learning regression algorithms of varying complexity, including Multi-Kernel Learning (MKL) [33], Kernel Ridge Regression (KRR) [34], Gaussian Process Regression (GPR) [35]

and Relevance Vector Machines

[36]. The inputs to these models consisted of different concatenated combinations of our seven voxel-wise features described in section 2.2. Analyses were run in PRoNTo version 3 [37, 22]

, a software toolbox of pattern recognition techniques for the analysis of neuroimaging data, as well as custom-written code.

In our preliminary analyses, we trained different combinations of regression algorithms and features using 5-fold cross-validation within the training set to select the best combination of algorithm and features as measured by lowest cross-validated mean-squared error (MSE). We then rtrained the best-performing model using the entire training set and used this trained model to generate predictions of fluid intelligence scores for the validation and test sets. Our best-performing model was KRR using all six voxel-wise derived features (tissue-type probabilities) concatenated into an input feature vector of length

million per individual. We set the regularization hyperparameter to

[34], which was optimised through 5-fold nested cross-validation within the training set in preliminary analyses.

We investigated robustness/stability of our KRR model using modified jackknife resampling (80/20 train/test split). Explicitly, we trained the model on a random subsample of of the training set and generated predictions for both the held out of the training set and the full validation set. We repeated this procedure 1000 times to generate confidence bounds on performance (MSE).

2.4 Predicting fluid intelligence: Convolutional Neural Networks

Separately, we explored the use of CNNs. The motivation was to incorporate spatial information that is not explicitly modelled as features. The CNNs were trained directly on the pre-processed T1-weighted images. Similarly to previous work that predicted brain age, Alzheimer’s disease progression, or brain regions from MRI scans [38, 39, 40] we applied various layers of 3D-convolutional kernels with filter size of 3x3x3 voxels on down-sampled images with dimensions of 61x61x61 voxels. We trained and validated multiple neural networks including those in [24]. Our best performing CNN (lowest MSE) consisted of four convolutional layers and three fully-connected layers followed by dropout layers with a probability of

. The first six layers were activated by rectified linear units. The convolutional layers were followed by batch normalization and max-pooling operations. We used the Adam optimizer with an initial learning rate of

and a decay of , and we stopped training at the epoch. To evaluate the network, we randomly sampled 10 subsets of 1870 subjects from the training set, applied them to train 10 CNN models and evaluated performance on the validation set. We report MSE averaged over the ten passes.

3 Results

Our best-performing regression model was KRR using our six derived voxel-wise features as input. This produced on the validation set. By comparison, our best-performing CNN achieved (average

) on the validation set. However, we observed that the better-performing CNNs on the training set did not generalise well to the validation set, possibly due to over-fitting, or mismatch between training and validation data (as suggested by the considerable difference in variances).

Table 1 shows MSE and Pearson’s correlation coefficient (mean std) for training and validation predictions from our top-performing methods. We found that KRR performed the best and we used this model to generate our challenge submission, which resulted in on the test set.

Method Training set Validation set Test set
MSE Correlation MSE Correlation MSE
KRR 77.64 0.1427 69.72 0.0311 92.13
CNN N/A
(best: 64.39) (best: 0.2542) (best: 70.82) (best: 0.0157)
Table 1: Model performance for the training and validation sets.
Figure 2: Model robustness for KRR using modified jackknife re-sampling. MSE distributions for 1000 predictions: of the training set (left); full validation set (right); for models trained on the other of the training set. Red line indicates the mean.

Figure 2 shows model robustness results from modified jackknife resampling (1000 repetitions, 80/20 split). Confidence in our predicted MSE on the validation set is within residual IQ point.

Figure 3: Absolute prediction errors against residual fluid intelligence scores for KRR (training set on the left, validation set on the right).

Figure 3 shows the residual fluid intelligence score prediction errors for KRR on the training and validation sets. The V shape of the curves shows that smaller residual scores have lower errors, since the model is predicting close to the mean value.

4 Discussion

We found that predicting residual fluid intelligence from structural MRI images is challenging. The correlation between predicted and actual intelligence scores was low for all methods we tested (). This contrasts with previous studies for predicting (non-residualised) fluid intelligence, which have demonstrated that both total brain volume and regional cortical volume/thickness differences are relatively strong predictors () [13, 14, 12].

The lower predictive performance we observed might be influenced by the residualisation, which prevents modelling of covariance between the residualisation factors and the image-based features. Moreover, there is evidence that including variables in the residualisation procedure that are correlated with the regression targets/labels is likely to remove important variability in the data leading to predictive models with low performance [41].

We note that previous studies used small sample sizes (of order 10–100), whilst the ABCD-NP Challenge dataset comprised a very large-scale dataset (of order 1000–10000). Whereas subject recruitment for small samples tend to be well controlled, resulting in homogeneous sample characteristics, large samples are more heterogeneous by nature, thus predictive models are more challenging to build. Accordingly, a recent study has demonstrated that the accuracy of classification results tends to be smaller for larger sample sizes [42].

Our image-based features were voxel-wise probabilistic tissue-type labels: grey matter, white matter, and other. Beyond tissue-type labels, there might be value in investigating other features generated by the generative segmentation model. One such interesting feature is scalar momentum [30], which has been shown to be predictive for a range of different problems [30].

Finally, we mention a possible limitation of our approach. In recent years, it has become standard practice to create study-specific group templates in neuroimaging [31], especially in Voxel-Based Morphometry (VBM) analyses. Originally, this approach was proposed for group analysis using mass univariate statistics (e.g., statistical parametric mapping). However, one needs to exercise caution when applying such an approach in machine learning, as it might lead to slightly optimistic predictions by creating dependence across the overall dataset. In order to avoid this potential issue, one would need to create templates based only on the training set. This might be computationally challenging when cross-validation strategies are used. To the best of our knowledge, no studies have investigated whether study-specific templates indeed result in slightly inflated predictions, and it remains an interesting question for future work.

5 Conclusion

Our paper presents the winning method for the ABCD Neurocognitive Prediction Challenge 2019. We found that kernel ridge regression outperformed more complex models, such as convolutional neural networks, when predicting residual fluid intelligence scores for the challenge dataset using our custom tissue-type features derived from the preprocessed T1-weighted MRI. The correlation between the predicted and actual scores is very low ( for the KRR on the validation set), implying that the association between structural images and residualised fluid intelligence scores is low. It may be that structural images contain very little information on residualised fluid intelligence, but further study is warranted.

References