Alzheimer’s disease (AD) is an irreversible brain disorder which progressively affects cognition and behaviour, and results in an impairment in the ability to perform daily activities. It is the most common form of dementia in older people, affecting about 6% of the population aged over 65, and it increases in incidence with age. The initial stage of AD is characterised by memory loss, and this is the usual presenting symptom. Memory loss is one constituent of mild cognitive impairment (MCI), which can be an early sign of Alzheimer’s disease. MCI is diagnosed by complaints of subjective memory loss (preferably corroborated by a close associate or partner of the individual), impairment of memory function, unimpaired general cognition and behaviour but with no evidence of dementia nestor2004 . MCI does not always progress to dementia or to a diagnosis of Alzheimer’s disease, but those with amnestic mild MCI, the type of MCI characterised by memory impairment, are more likely to develop dementia than those without this diagnosis. In cases where an individual does develop Alzheimer’s disease, the phase of MCI ends with a marked decline in cognitive function, lasting two to five years, in which semantic memory (the recall of facts and general knowledge) and implicit memory (the long-term, nonconscious memory evidenced by priming effects) also becomes degraded.
Clinical diagnosis of dementia relies on information from a close associate or partner of the individual, and on cognitive and physical examinations. Once dementia is diagnosed it is usually subclassified into Alzheimer’s disease, vascular dementia or Lewy Body dementia burns2009 dubois2007 , these three classes making up the majority of cases. Risk factors for Alzheimer’s disease are multifarious, including sociodemographic (in particular age), genetic (notably ApoE status), and medical history (such as a diagnosis of depression). The cause of Alzheimer’s disease is not fully understood, but plaques containing amyloid –peptide (A) in brain tissue and neurofibrillary tangles containing tau protein are the primary histological featurestiraboschi2004 .
1.1 Predicting Alzheimer’s disease
The disease pathology leads to an progressive, irreversible loss of brain function which suggests that prospective drug therapies should be tested for efficacy as early in the process as possible. So there has been a demand for predicting which individuals will develop AD as early as possible in order to test drug therapies which might inhibit or prevent tissue damage. There has been much research effort put into the prediction of an AD diagnosis among those who are diagnosed with MCI, in particular using imaging to detect early signs of the disease pathology: a meta analysis of 32 structural MRI or amyloid PET imaging studies that reported conversion to AD in patients with MCI is given by Seo et al.seo2017 .
This analysis concluded that amyloid PET is a better predictor of progression to AD from MCI than MRI atrophy measures (effect size 1.32 vs 0.77), but that MRI on entorhinal cortex atrophy (effect size 1.26) is comparable in prediction value to that of amyloid PET. Another comparison of biomarker predictivity found that the highest predictive accuracy was achieved by combinations of amyloidosis and neurodegeneration biomarkers prestia2015 . The individual biomarker with the best performance was [18F]-fluorodeoxyglucose-positron emission tomography (FDG-PET) which measures temporoparietal hypometabolism.
Cognitive markers have also been widely applied for early detection of Alzheimer’s disease: a review is presented by Gainotti et al.in gainotti2014 which concluded that measures of delayed recall are the best neuropsychological markers of conversion from MCI to AD. Significantly, they also suggest that MCI subjects with deficits in multiple cognitive domains including memory may not be the best candidates for clinical trials of disease-modifying drugs: of this group about 50% of this will convert to AD within 2 years, making their condition less modifiable than for those who are at an earlier stage of the disease.
1.2 The TADPOLE challenge
In evaluating a method for predicting Alzheimer’s disease it is important to compare the results with the current state of the art in order to calibrate the accuracy. In the past few years there have been a number of challenges which allow comparison between methods using a common data set and standardised evaluation metrics. The CADDementia challengebron2015standardized compares algorithms for multi-class classification of AD, MCI and controls based on structural MRI data. The Kaggle Neuroimaging challenge https://www.kaggle.com/c/mci-prediction sarica2018editorial is based on the Kaggle machine learning platform and uses data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) http://adni.loni.usc.edu/ which is one of the most commonly used data sets for studies of Alzheimer’s diseaseweiner2017recent . The challenge involved a four-fold classification into AD, MCI, MCI converters to AD, and controls.
The 2017 TADPOLE grand challenge https://tadpole.grand-challenge.org/ is currently taking place with the evaluation to be completed by January 2019. The challenge is a three-fold diagnosis classification into AD, MCI and control groups, and the prediction of ADAS-13 score and normalised brain volume tadpole . The TADPOLE challenge has the aim of predicting the onset of Alzheimer’s disease using different modes of measurement, including demographic, physical and cognitive data. In common with many other studiesweiner2017recent , the TADPOLE data set is also derived from ADNI. ADNI itself is comprised of four phases: ADNI-1 (2004), ADNI-GO (2009), ADNI-2 (2011), and ADNI-3 (2016). ADNI-1 registered 200 healthy elderly, 400 participants with MCI, and 200 participants with AD, and the subsequent phases continued to add participants. The TADPOLE competition involves predicting future data collected as part of the ADNI-3 phase. The competition organisers provide a leaderboard dataset which is separate from the main competition and which allows prediction methods from different teams to be evaluated. The results presented here are derived from the TADPOLE leaderboard dataset. Since TADPOLE data is based on ADNI, data used in the preparation of this article were obtained from the (ADNI) database (adni.loni.usc.edu). ADNI is led by Principal Investigator Michael W. Weiner, MD. For up-to-date information see www.adni-info.org.
2.1 Leaderboard data
The leaderboard dataset has a training set LB1 and a set LB2 whose participants subsequently continued from ADNI-1 into ADNI GO/ADNI-2 to form the leaderboard test set LB4. LB2 is formed of ADNI-1 time points for 110 participants who who were not diagnosed with AD at the last ADNI-1 time point. The test set LB4 comprises the data points for those same LB2 participants during their continuing participation in ADNI GO/ADNI-2. The training set LB1 comprises data from participants who are not represented in LB2. The task is to predict the diagnosis, ADAS-13 score and the normalised ventricle volume for the set LB4 using set LB1 and the participant histories recorded in LB2. The results are evaluated by comparison with LB4 data using a variety of metrics. No information from LB4 may be used for model training, but demographic and other details about the participants who contributed to LB4 are available from LB2, and past time varying data such as imaging and cognitive measurements are also available from LB2. A histogram of time series lengths for LB1 and LB2 is shown in Figure 1.
Features for prediction are selected from the demographic, cognitive and physical data variables in the ADNI/TADPOLE data. The physical data comprises, among other measurements, MRI data (volumes, cortical thickness, surface area), PET (FDG, AV45 and AV1451), DTI (regional means of standard indices) and levels of markers from cerebral spinal fluid (CSF).
2.2 Evaluation set
For the purposes of training we form an evaluation set which is similar to the time series used in the leaderboard evaluation, that is LB2 from the ADNI-1 phase and the test set LB4 from the ADNI-GO and ADNI-2 phases. We select the evaluation set from LB1 by choosing participants who match those of LB2 and whose ADNI-1 time series length is similar. The post-ADNI-1 phases of this matched evaluation set can be used to assess the prediction accuracy, which should be similar to that of the test set LB4. To create the evaluation set we examine each participant time series in LB2 and find those participants in LB1 who have a matching gender, ApoE status and age (to within 5 years) and whose diagnosis matches at the start and end of the ADNI-1 period. If more than one matching participant is found we select the one who has the closest match for the time series length in the ADNI-1 phase. The demographic characteristics and ApoE status of the participants from set LB2 and the matched evaluation set are shown in Figure 1.
|Age:||59.9 (75.1) 87.9||57.8 (75.3) 84.8|
We train the model using the time series from LB1 and LB2. Points from ADNI-GO and ADNI-2 which are used to compute prediction accuracy are not used for for training. Training is performed on 90% of the participants in the evaluation set with accuracy measured on the remaining 10%, this repeated for 10 independent splits of the whole evaluation set. We train only on time series with at least 4 points, this minimum having been determined during training. The prediction variables are selected manually by optimising the mean accuracy found on the evaluation set.
2.4 Forecasting method
The purpose of a forecasting method is to predict a patient’s condition at points in the future using demographic, cognitive and physical data variables from time points in the participant’s history. A common approach to automatic prediction is to use time series methods which use weighted combinations of past data points to predict the next data point. Time series models in general encode a mapping from an -dimensional space to the output, where time is not one of the input dimensions. But many time series in the training data are short and the sampling periods are irregular, so much of the information in the training set lies in the mapping from the time delay between measurements to the output rather than in the sequence of input values555 Irregular sampling and missing data can be managed by interpolation or by using appropriate methods such as Gaussian process regression
Irregular sampling and missing data can be managed by interpolation or by using appropriate methods such as Gaussian process regressionrasmussen2006gaussian , but these approaches entail making assumptions about the distributions.. Another approach is to use an input space formed of demographic variables , last diagnosis , time since last measurement , and map vectors in this to the output variable . Again assuming additive error, the model is,
. Decision trees approximate the regression function by partitioning the input (feature) space into a set of rectangles (hastie, , p305). The training algorithm iterates over all the features and selects the feature and split point that gives the best partition for the training data; this is repeated until a stopping criterion is met, such as a minimum number of points in the rectangle. The best partition is that that which gives the minimum total impurity in the two subsets that are formed. However, in our experience decision trees own tend to overfit the training data and perform poorly on new data, so we use a ensemble of trees and poll the individual results as an estimator. Further improvement is seen with the random forest algorithm in which the features are chosen randomly at each split point breiman2001random . An introduction to the theory of random forests is given in Hastie et al.(hastie, , ch16).
There are three target outcomes for prediction: 1) Diagnosis, 2) the ADAS-13 score, and 3) VENTS-ICV which is the ventricles volume divided by intracranial volume. We first present results from the evaluation set, followed by the test set accuracy.
3.1 Evaluation set
The accuracy for ten independent splits of 10% of the evaluation set, and the training error on the full evaluation set is shown in Table 2.
ADAS-13 and VENTS-ICV
The prediction error for ten independent splits of 10% of the evaluation set, and the training error on the full evaluation set is shown in Table 3 where the first two columns show ADAS-13 results and the next two columns show those for the ratio of ventricle volume divided by intracranial volume.
The predictor variables that were selected during training are shown in Table4. These were chosen by starting from a base set of variables and adding variables to increase the prediction accuracy. The set of predictors for diagnosis and ADAS-13 are the same except that for ADAS-13 the most recent value of the variable ADAS13 and its slope are added.
|TIME_DELAY||Number of months delay|
|DX||Diagnosis (NL or MCI or AD)|
|FUSIFORM_BL||Fusiform volume at baseline|
|MMSE||MMSE Mini-mental state examination|
|FAQ||Functional activities questionnaire|
|VENTRICLES||Ventricles volume slope|
|HIPPOCAMPUS||Hippocampus volume slope|
3.2 Test set LB4
The accuracy on the test set LB4 is shown as the highlighted row of Table 6
which reproduces the results from the TADPOLE competition leaderboard table. The leaderboard shows entries in rank order where the rank is determined by the lowest sum of individual ranks for mAUC, ADAS-13 MAE and VENTS-ICV MAE. For diagnosis the accuracy is mAUC=0.82 and BCA=0.73, which compares a mean accuracy on our evaluation set of mAUC=0.82 and BCA=0.71. The similarity of test and evaluation errors shows that the evaluation set is well matched to the test set and that the model has not been overtrained. For ADAS-13 prediction the test set error is MAE=5.19, and the evaluation set error is MAE=5.56 and for VENTS-ICV the test error is MAE=0.0023 and the evaluation error is MAE=0.0021. The confusion matrix for the test set is shown in Table5.
Competition leaderboard table at 4 May 2018 where each row represents an entry from a competition team and our entry is highlighted. There are three target outcomes for prediction: 1) Diagnosis, 2) the ADAS-13 score, and 3) VENTS-ICV which is the ventricles volume divided by intracranial volume. Predictions for diagnosis are presented as relative probabilities for each of the three potential diagnostic categories. For the ADAS score, and the normalised ventricles volume, we provide confidence intervals which indicate in the limit where 50% of the predictions would lie if an experiment were repeated many times on new data. The rank is determined by the lowest sum of ranks from mAUC, ADAS-13 MAE and VENTS-ICV MAE. The metrics are based on those used in the TADPOLE competition. For diagnosis classification we use the multiclass area under the receiver operating curve (mAUC) and balanced classification accuracy (BCA) as described in Table2. The metrics used for ADAS-13 and VENTS-ICV are the mean absolute error (MAE), the weighted error score (WES), which is the absolute error weighted by the inverse of the confidence interval range, and the coverage probability accuracy (CPA), defined as, , where is the proportion of measurements falling within the 50% confidence interval.
In selecting participants for clinical trials, a positive PET scan is commonly used as part of the inclusion criteria. However PET imaging is expensive, so when a positive scan is one of the trial inclusion criteria it is desirable to avoid screening failures. So one application of predicting Alzheimer’s disease is to preselect candidates before applying the criteria. Time series collected both in clinic and from studies such as ADNI inevitably have missing data points, and are of variable length. The same will be true for clinical data collected from patients, since it is not uncommon for appointments to be missed, and for people to withdraw from data collection for various reasons. Most time series prediction methods assume data which is complete and regularly sampled, so that it has to be pre-processed using imputation or interpolation methods to fulfil this assumption. In this paper, rather than using a traditional time series method we have used a machine learning method to learn the relationship between pairs of time points at different separations. The input vector comprises a summary of the time series history up to that point and the demographic and non-time varying factors such as genetic data. This method makes no assumptions about the dynamics of the time series, and it is applicable to data which has missing and irregularly sampled points. The results are better than a baseline last-value estimator, and they validate the method as effective.
Data collection and sharing for this project was funded by the Alzheimer’s Disease
Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and
DOD ADNI (Department of Defense award number W81XWH-12-2-0012). A full list of funding
sources for ADNI is provided in the document ‘Alzheimer’s Disease Neuroimaging Initiative (ADNI)
Data Sharing and Publication Policy’ available through adni.loni.usc.edu/.
This work uses the TADPOLE data sets https://tadpole.grand-challenge.org constructed by the EuroPOND consortium http://europond.eu funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 666992.
The MRC Dementias Platform UK (DPUK) https://www.dementiasplatform.uk/ provided support in the preparation of this paper. DPUK is a multi-million pound public-private partnership, developed and led by the MRC, to accelerate progress in and open up dementias research. The aims of DPUK are early detection, improved treatment and ultimately the prevention of dementias.
- (1) P. J. Nestor, P. Scheltens, J. R. Hodges, Advances in the early detection of Alzheimer’s disease, Nature medicine 10 (7) (2004) S34.
- (2) A. Burns, S. Iliffe, Alzheimer’s disease, https://doi.org/10.1136/bmj.b158, [Online; accessed 20 September 2017].
- (3) B. Dubois, H. H. Feldman, C. Jacova, S. T. DeKosky, P. Barberger-Gateau, J. Cummings, A. Delacourte, D. Galasko, S. Gauthier, G. Jicha, et al., Research criteria for the diagnosis of alzheimer’s disease: revising the NINCDS–ADRDA criteria, The Lancet Neurology 6 (8) (2007) 734–746.
- (4) P. Tiraboschi, L. Hansen, L. Thal, J. Corey-Bloom, The importance of neuritic plaques and tangles to the development and evolution of AD, Neurology 62 (11) (2004) 1984–1989.
- (5) E. H. Seo, W. Y. Park, I. Choo, Structural MRI and Amyloid PET Imaging for Prediction of Conversion to Alzheimer’s Disease in Patients with Mild Cognitive Impairment: A Meta-Analysis, Psychiatry investigation 14 (2) (2017) 205–215.
- (6) A. Prestia, A. Caroli, S. K. Wade, W. M. Van Der Flier, R. Ossenkoppele, B. Van Berckel, F. Barkhof, C. E. Teunissen, A. Wall, S. F. Carter, et al., Prediction of AD dementia by biomarkers following the NIA-AA and IWG diagnostic criteria in MCI patients from three European memory clinics, Alzheimer’s & Dementia 11 (10) (2015) 1191–1201.
- (7) G. Gainotti, D. Quaranta, M. G. Vita, C. Marra, Neuropsychological predictors of conversion from mild cognitive impairment to Alzheimer’s disease, Journal of Alzheimer’s Disease 38 (3) (2014) 481–495.
- (8) E. E. Bron, M. Smits, W. M. Van Der Flier, H. Vrenken, F. Barkhof, P. Scheltens, J. M. Papma, R. M. Steketee, C. M. Orellana, R. Meijboom, et al., Standardized evaluation of algorithms for computer-aided diagnosis of dementia based on structural MRI: the CADDementia challenge, NeuroImage 111 (2015) 562–579.
- (9) A. Sarica, A. Cerasa, A. Quattrone, V. Calhoun, Editorial on Special Issue: Machine learning on MCI (2018).
- (10) M. W. Weiner, D. P. Veitch, P. S. Aisen, L. A. Beckett, N. J. Cairns, R. C. Green, D. Harvey, C. R. Jack, W. Jagust, J. C. Morris, et al., Recent publications from the Alzheimer’s Disease Neuroimaging Initiative: Reviewing progress toward improved AD clinical trials, Alzheimer’s & Dementia.
- (11) R. V. Marinescu, N. P. Oxtoby, A. L. Young, E. E. Bron, A. W. Toga, M. W. Weiner, F. Barkhof, N. C. Fox, S. Klein, D. C. Alexander, EuroPOND Consortium., TADPOLE Challenge: Prediction of Longitudinal Evolution in Alzheimer’s Disease, ArXiv e-printsarXiv:1805.03909.
- (12) C. E. Rasmussen, C. K. Williams, Gaussian processes for machine learning, Vol. 1, MIT press Cambridge, 2006.
- (13) T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, R. Tibshirani, The elements of statistical learning, Vol. 2, Springer, 2009.
- (14) L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32.
- (15) D. J. Hand, R. J. Till, A simple generalisation of the area under the ROC curve for multiple class classification problems, Machine learning 45 (2) (2001) 171–186.