Alzheimer’s disease (AD), and dementia in general, is a key challenge for 21st-century healthcare. The statistics are sobering (winblad2016defeating): in 2015, 47 million people worldwide suffer from dementia, of which AD is the most common cause; dementia costs $818 billion worldwide, which is more than 1% of the aggregaste global gross domestic product (GDP); AD might contribute to as many deaths as does heart disease or cancer. There are no available treatments that can cure or even slow the progression of AD – all clinical trials into putative treatments have failed to prove a disease-modifying effect. One key reason for these failures is the difficulty in identifying a group of patients at early stages of the disease, where treatments are most likely to be effective.
While early and accurate diagnosis of dementia can be challenging, this can be aided by quantitative biomarker measurements taken from magnetic resonance imaging (MRI), positron emission tomography (PET), and cerebro-spinal fluid (CSF) samples extracted from lumbar puncture. It has been hypothesized for AD (jack2010hypothetical; jack2013update; aisen2010clinical; frisoni2010clinical) that all these biomarkers become abnormal at different intervals before symptom onset, suggesting that together they can be used for accurate prediction of onset and overall disease progression in individuals. In particular, some of the early biomarkers become abnormal decades before symptom onset, and can thus facilitate early diagnosis.
Several approaches for predicting AD-related target variables (e.g. clinical diagnosis, cognitive/imaging biomarkers) have been proposed which leverage multimodal biomarker data available in AD. Traditional longitudinal approaches based on statistical regression model the relationship of the target variables with other known variables. Examples include regression of the target variables against clinical diagnosis (scahill2002mapping), cognitive test scores (yang2011quantifying; sabuncu2011dynamics), rate of cognitive decline (doody2010predicting), and retrospectively staging subjects by time to conversion between diagnoses (guerrero2016instantiated)
. Another approach involves supervised machine learning techniques such as support vector machines, random forests, and artificial neural networks, which use pattern recognition to learn the relationship between the values of a set of predictors (biomarkers) and their labels (diagnoses). These approaches have been used to discriminate AD patients from cognitively normal individuals(kloppel2008automatic; zhang2011multimodal), and for discriminating at-risk individuals who convert to AD in a certain time frame from those who do not (young2013accurate; mattila2011disease)
. The emerging approach of disease progression modelling aims to reconstruct biomarker trajectories or other disease signatures across the disease progression timeline, without relying on clinical diagnoses or estimates of time to symptom onset. Examples include models built on a set of scalar biomarkers to produce discrete(fonteijn2012event; young2014data) or continuous (jedynak2012computational; donohue2014estimating; villemagne2013amyloid) biomarker trajectories; richer but less comprehensive models that leverage structure in data such as MR images (durrleman2013toward; lorenzi2015disentangling; bilgel2016multivariate); and models of disease mechanisms (seeley2009neurodegenerative; zhou2012predicting; raj2012network; iturria2016early).
These models have shown promise for predicting AD biomarker progression when using existing test data, but few have been tested on truly unseen future data. Moreover, different investigators test these models on different datasets (including subsets of a single dataset) and use different processing pipelines. Community challenges have proved effective, in the medical image analysis field and beyond, for providing unbiased comparative evaluations of algorithms and tools designed for a particular task. Previous challenges that focussed on prediction of AD progression include the CADDementia challenge (bron2015standardized), which aimed to predict clinical diagnosis from MRI scans. A similar challenge, the ”International challenge for automated prediction of MCI from MRI data” (sarica2018machine) asked participants to predict diagnosis and conversion status from extracted MRI features of subjects from the ADNI study (weiner2017recent). Yet another challenge, The Alzheimer’s Disease Big Data DREAM Challenge (allen2016crowdsourced), asked participants to predict cognitive decline from genetic and MRI data.
The Alzheimer’s Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge aims to identify the data, features and approaches that are the most predictive of AD progression. In contrast to previous challenges, our motivation is to improve future clinical trials through identification of patients most likely to benefit from an effective treatment, i.e., those at early stages of disease who are likely to progress over the short-to-medium term (1-5 years). Identifying such subjects reliably helps cohort selection by focussing on groups that highlight positive treatment effects. The challenge thus focuses on forecasting three key features: clinical status, cognitive decline, and neurodegeneration (brain atrophy), over a five-year timescale. It uses “rollover” subjects from the ADNI study for whom a history of measurements is available, and who are expected to continue in the study, providing future measurements for testing. Since the test data does not exist at the time of forecast submissions, the challenge provides a completely unbiased basis for performance comparison. TADPOLE goes beyond previous challenges by drawing on a vast set of multimodal measurements from ADNI which support prediction of AD progression.
2 Competition Design
The aim of TADPOLE is to predict future outcome measurements of subjects at-risk of AD, enrolled in the ADNI study. A history of informative measurements from ADNI (imaging, psychology, demographics, genetics, etc.) from each individual is available to inform forecasts. TADPOLE participants are required to predict future measurements from these individuals and submit their predictions before a given submission deadline. Evaluation of these forecasts occurs post-deadline, after the measurements have been acquired. A diagram of the TADPOLE flow is shown in Fig 1.
Since we do not know the exact time of future data acquisitions for any given individual, TADPOLE challenge participants are required to make, for every individual, month-by-month forecasts of three key biomarkers: (1) clinical diagnosis which can be either cognitively normal (CN), mild cognitive impairment (MCI) or probable Alzheimer’s disease (AD); (2) ADAS-Cog13 (ADAS13) score; and (3) ventricle volume (divided by intra-cranial volume). Evaluation is performed using forecasts at the months that correspond to data acquisition. TADPOLE forecasts are required to be probabilistic and some evaluation metrics will account for forecast probabilities provided by participants. Methods or algorithms that do not produce probabilistic estimates can still be used, by setting binary probabilities (zero or one) and default confidence intervals.
Participants are required to submit forecasts in a standardised format (see Table 1). For clinical status, relative likelihoods of each option (CN, MCI, and AD) for each individual should be provided. These are normalised at evaluation time; negative likelihoods are set to zero. For ADAS13 and ventricle volume, participants need to provide a best-guess value as well as a 50% confidence interval for each individual. This 50% confidence interval (as opposed to the more standard 95%) was chosen to provide a more symmetric and less noisy evaluation of over- and under-estimation of the confidence interval, because similar sample sizes of data fall inside and outside the interval.
|RID||Forecast Month||Forecast Date||CN relative probability||MCI relative probability||AD relative probability||ADAS||ADAS 50% CI lower||ADAS 50% CI upper||Ventricles||Ventricles 50% CI lower||Ventricles 50% CI upper|
We provide participants with a standard ADNI-derived dataset (available via the Laboratory Of NeuroImaging: LONI) which they can use to train their algorithms, removing the need to pre-process the ADNI data themselves or merge different spreadsheets. However, participants are allowed to use a custom training set, by adding any other ADNI data or data from other studies. The software code used to generate the standard dataset is openly available in a Github repository111https://github.com/noxtoby/TADPOLE and on the ADNI website, packaged with the standard dataset in the LONI ADNI database.
4.1 ADNI data
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year public-private partnership. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 1500 adults, ages 55 to 90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The general ADNI inclusion criteria has been described in petersen2010alzheimer.
The data we used from ADNI consists of: (1) CSF markers of amyloid-beta and tau deposition; (2) various imaging modalities such as magnetic resonance imaging (MRI), positron emission tomography (PET) using several tracers: Fluorodeoxyglucose (FDG, hypometabolism), AV45 (amyloid), AV1451 (tau) as well as diffusion tensor imaging (DTI); (3) cognitive assessments acquired in the presence of a clinical expert; (4) genetic information such as alipoprotein E4 (APOE4) status extracted from DNA samples; and (5) general demographic information. Extracted features from this data were merged together into a final spreadsheet and made available on the LONI ADNI website.
4.2 Image pre-processing
The imaging data has been pre-processed with standard ADNI pipelines. For MRI scans, this included correction for gradient non-linearity, B1 non-uniformity correction and peak sharpening222see MRI analysis on ADNI website: http://adni.loni.usc.edu/methods/mri-analysis/mri-pre-processing. Meaningful regional features such as volume and cortical thickness were extracted using the Freesurfer cross-sectional and longitudinal pipelines (reuter2012within). Each PET image (FDG, AV45, AV1451), which consists of a series of dynamic frames, had its frames co-registered, averaged across the dynamic range, standardised with respect to the orientation and voxel size, and smoothed to produce a uniform resolution of 8mm full-width/half-max (FWHM)333see PET analysis on ADNI website: http://adni.loni.usc.edu/methods/pet-analysis/pre-processing. Standardised uptake value ratio (SUVR) measures for relevant regions-of-interest were extracted (see jagust2010alzheimer) after registering the PET images to corresponding MR images using the SPM5 software (ashburner2009computational). DTI scans were corrected for head motion and eddy-current distortion, skull-stripped, EPI-corrected, and finally aligned to the T1 scans using the pipeline from nir2013effectiveness. Diffusion tensor summary measures were estimated based on the Eve white-matter atlas by oishi2009atlas.
5 TADPOLE Datasets
The TADPOLE Challenge involves three kinds of data sets: (a) a training data set, which is a collection of measurements with associated outcomes that can be used to fit models or train algorithms; (b) a prediction data set, which contains only baseline measurements (possibly longitudinal), without associated outcomes — this is the data that algorithms, models, or experts use as input to make their forecasts of later patient status and outcome; and (c) a test data set, which contains the patient outcomes against which we will evaluate forecasts — in TADPOLE, this data did not exist at the time of submitting forecasts.
In order to evaluate the effect of different methodological choices, we prepared three “standard” data sets for training and prediction:
D1: The TADPOLE standard training set draws on longitudinal data from the entire ADNI history. The data set contains a set of measurements for every individual that has provided data to ADNI in at least two separate visits (different dates) across three phases of the study: ADNI1, ADNI GO, and ADNI2.
D2: The TADPOLE longitudinal prediction set contains as much available data as we could gather from the ADNI rollover individuals for whom challenge participants are asked to provide forecasts. D2 includes all available time-points for these individuals.
D3: The TADPOLE cross-sectional prediction set contains a single (most recent) time point and a limited set of variables from each rollover individual in D2. Although we expect worse forecasts from this data set than D2, D3 represents the information typically available when selecting a cohort for a clinical trial.
The forecasts will be evaluated on future data (D4 – test set) from ADNI3 rollovers, acquired after the challenge submission deadline. In addition to the three standard datasets (D1, D2 and D3), challenge participants are allowed to use any other data sets that might serve as useful additional training data.
Fig. 2 shows a diagram highlighting the nested structure of datasets D1–D3. Table 2 shows the proportion of biomarker data available in each dataset. There are a considerable number of entries with missing data, especially for some biomarkers such as tau imaging (AV1451). We also estimated the expected number of subjects and available data for D4, using information from the ADNI3 procedures and using rollovers from previous ADNI studies (Table 2, right-most column) – See A for more information on D4 estimates. Based on our estimates, we believe the size of D4 (around 330 subjects, 1 visit/subject) should be enough for a reliable evaluation of TADPOLE submissions.
|Nr. of subjects||1667||896||896||330|
|Visits per subject|
|Cognitive tests (%)||70||68||84||62|
There are two kinds of submissions that challenge participants can make. A simple entry requires a minimal forecast and a description of methods; it makes participants eligible for the prizes but not co-authorship on the scientific paper documenting the results. A simple entry can use any training data or prediction sets and forecast at least one of the target outcome variables (clinical status, ADAS13 score, or ventricle volume). A full entry entitles participants for consideration as a co-author on the publication documenting the results. Such a full entry requires a complete forecast for all three outcome variables on all subjects from the D2 prediction set, along with a description of the methods. Each individual participant is limited to a maximum of three submissions. This restriction has been introduced to avoid the risk of participants “tuning” their method on the test set by submitting multiple predictions for a range of algorithm settings. Although not required for a full entry, participants are strongly encouraged to submit predictions also for D3.
Prizes are awarded to the best entries regardless of the choice of training sets (D1/custom) and prediction sets (D2/D3). However, the additional submissions support the key scientific aims of the challenge by allowing us to separate the influence of the choice of training data, post-processing pipelines, and modelling techniques or prediction algorithms. The target variables used for evaluation, in particular ventricle volume, will use the same post-processing pipeline as the standard data sets D1-D3.
Beyond the standard training dataset (D1), participants can include additional forecasts from ”custom” (i.e. constructed by the participant) training data or custom post-processing of the raw data from subjects in the standard training set. The same applies to the prediction sets D2 and D3, which can be customised by the participants if desired, e.g. a prediction set with different features from the same individuals as in D2 and D3. Table 3 shows the twelve possible combinations of subject sets, processing and prediction sets, from which a full-entry submission must contain at least one of the first four (ID 1–4).
|ID||Training set||Prediction set|
7 Forecast Evaluation
7.1 Clinical Status Prediction
For evaluation of clinical status predictions, we will use similar metrics to those that proved effective in the CADDementia challenge (bron2015standardized): (i) the multiclass area under the receiver operating curve (mAUC); and (ii) the overall balanced classification accuracy (BCA). The mAUC is independent of the group sizes and gives an overall measure of classification ability that accounts for relative likelihoods assigned to each class. The simpler BCA is also independent of group sizes, but does not exploit the probabilistic nature of the forecasts.
7.1.1 Multiclass Area Under the Receiver Operating Characteristic (ROC) Curve
The multiclass Area Under the ROC Curve (mAUC) is a simple generalisation of the area under the ROC curve applicable to problems with more than two classes (hand2001simple). The AUC for classification of a class against another class , is:
where and are the number of points belonging to classes and , respectively; while is the sum of the ranks of the class test points after ranking all the class and data points in increasing likelihood of belonging to class . We further define the average AUC for classes and as . The overall mAUC is then obtained by averaging over all pairs of classes:
where is the number of classes. The class probabilities that go into the calculation of in the first equation are , and , which are derived from the likelihoods of each ADNI subject being assigned to each diagnostic class, by normalising to have unity sum.
7.1.2 Balanced Classification Accuracy
The Balanced Classification Accuracy (see brodersen2010balanced) is an extension of the classification accuracy measure that accounts for the imbalance in the numbers of datapoints belonging to each class. However, the measure is not probabilistic, so in TADPOLE the data points need to be assigned a hard classification to the class (CN, MCI, or AD) with the highest likelihood. The balanced accuracy for class is then:
where TP, FP, TN, FN represent the number of true positives, false positives, true negatives and false negatives for classification as class . In this case, true positives are data points with true label
and correctly classified as such, while the false negatives are the data points with true labeland incorrectly classified to a different class . True negatives and false positives are defined similarly. The overall BCA is given by the mean of all the balanced accuracies for every class.
7.2 Continuous Feature Predictions
For ADAS13 and ventricle volume, we will use three metrics: mean absolute error (MAE), weighted error score (WES) and coverage probability accuracy (CPA). The MAE focuses purely on accuracy of the best-guess prediction ignoring the confidence interval, whereas the WES incorporates confidence estimates into the error score. The CPA provides an assessment of the accuracy of the confidence estimates, irrespective of the best-guess prediction accuracy.
7.2.1 Mean Absolute Error
The mean absolute error (MAE) is:
where is the number of data points (forecasts) evaluated, is the actual biomarker value in individual in future data, and is the participant’s best prediction for .
7.2.2 Weighted Error Score
The weighted error score is defined as:
where the weightings are the participant’s relative confidences in their . We estimate as the inverse of the width of the 50% confidence interval of their biomarker estimate:
where is the confidence interval provided by the participant.
7.2.3 Coverage Probability Accuracy
The coverage probability accuracy is:
where is the nominal coverage probability, the target for the confidence intervals, and is the actual coverage probability, defined as the proportion of measurements that fall within the corresponding confidence interval. In TADPOLE, we set to be 0.5, which means that ideally only 50% of the measurements would fall inside the confidence interval. The CPA can take values between 0 and 1, and lower scores are better.
We are extremely grateful to Azheimer’s Research UK, The Alzheimer’s Society, and The Alzheimer’s Association for sponsoring a prize fund of £30,000. At the time of first submission, we proposed six separate prizes, as outlined in Table 4, but reserve the right to reallocate the prize money depending on the numbers of participants eligible for each prize. The first four are general categories (open to all challenge participants) and constitute one prize for the best forecast of each feature as well as one for overall best performance. The last two prizes are for two different student categories.
|Prize amount||Outcome measure||Performance Metric||Eligibility|
|£5,000||Overall best||Lowest sum of ranks*||all|
|£5,000||Clinical status||mAUC||University teams|
|£5,000||Clinical status||mAUC||High-school teams|
We have outlined the design of the TADPOLE Challenge, which aims to identify algorithms and features that can best predict the evolution of Alzheimer’s disease. Challenge participants use historical data from ADNI in order to predict three key outcomes: clinical diagnosis, ADAS-Cog13 and ventricle volume. Determining which features and algorithms best predict AD evolution can aid refinement of cohorts and endpoint assessment for clinical trials, and can provide accurate prognostic information in clinical settings.
The TADPOLE Challenge was designed to be transparent and accessible. To this end, all of our scripts are available in an open repository444TADPOLE repository: https://github.com/noxtoby/TADPOLE. We also created a public forum555TADPOLE forum: https://groups.google.com/forum/#!forum/tadpolechallenge where we answer participant questions. Finally, in order to enable participants to share algorithm performance results throughout the competition, we created a leaderboard system666Leaderboard: https://tadpole.grand-challenge.org/leaderboard/ that evaluates submissions on an existing test dataset and publishes the results live on our website.
Going forward, we hope that by November 2018 sufficient data will be available from ADNI3 rollovers for a first meaningful evaluation of the forecasts. We plan to publish the results on the website in January 2019, and then submit a publication of the results soon after. However, we reserve the right to delay evaluation until sufficient data is available. At that time, we will also evaluate the impact and interest of the first phase of TADPOLE within the community, to guide decisions on whether to organise further submission and evaluation phases.
TADPOLE Challenge has been organised by the European Progression Of Neurological Disease (EuroPOND) consortium, in collaboration with the ADNI. We thank all the participants and advisors, in particular Clifford R. Jack Jr., Mayo Clinic, Rochester, United States and Bruno M. Jedynak, Portland State University, Portland, United States for useful input and feedback.
The organisers are extremely grateful to Azheimer’s Research UK, The Alzheimer’s Society, and The Alzheimer’s Association for sponsoring the challenge by providing the prize fund and providing invaluable advice into its construction and organisation. Similarly, we thank the ADNI leadership and members of our advisory board and other members of the EuroPOND consortium for their valuable advice and support.
RVM is supported by the EPSRC Centre For Doctoral Training in Medical Imaging with grant EP/L016478/1. NPO, FB, SK, and DCA are supported by EuroPOND, which is an EU Horizon 2020 project. ALY is currently supported by an EPSRC Doctoral Prize fellowship and was previously supported by EPSRC grant EP/J020990/01. DCA is supported by EPSRC grants J020990, M006093 and M020533. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). FB is supported by the NIHR UCLH biomedical research centre and the AMYPAD project, which has received support from the EU-EFPIA Innovative Medicines Initiatives 2 Joint Undertaking (AMYPAD project, grant 115952). This project has received funding from the EU Horizon 2020 research and innovation programme under grant agreement No 666992.
Appendix A Expected number of subjects and available data for D4
We estimated the number of subjects and available data in D4 (Table 2, last column) using information from the ADNI procedures manual and previous ADNI rollovers. For estimating the total number of subjects (first row) expected in D4, we computed the dropout rate (0.36) based on ADNI1 rollovers to ADNI2, then multiplied it by the total number of subjects in D2 (896). For estimating the proportions of each diagnostic category (third row), we used the proportion of diagnostic rates in D2 and multiplied them with conversion rates within 1 year from ADNI1/GO/2 (see website FAQ). For estimating the average number of visits per subject (mean
std.) in D4 (second row), we used the proportions for each diagnostic group and considered one visit per subject (ADNI procedures). We set the standard deviation to be zero, although in practice this won’t be the case.
For estimating the available biomarker data (lower half of table), we used a 1-year time-frame from start of ADNI2 (July 2012 – July 2013) and computed the proportion of available data in that time frame. For AV1451, we used the same estimate as for AV45, due to the fact that the scan was introduced later on in ADNI2, and we expect more subjects to undergo AV1451 scans in ADNI3. A Python script that computes all the data from Table 2 is given in the TADPOLE repository: https://github.com/noxtoby/TADPOLE/blob/master/statistics/tadpoleStats.py.