Alzheimer’s disease (AD) is the most common brain dementia Schneider,J.A. et al. (2009). In 2020, 5.8 million Americans age 65 and older were living with AD, and this number is expected to reach 13.8 million by 2050 2. Considerable efforts have been made to tackle the challenges raised by this issue and, in particular, research early neuroimaging biomarkers and prognosis tools Rathore et al. (2017); Habes et al. (2016, 2016); Li et al. (2019). The most recent Deep Learning frameworks were involved in these efforts and showed promising achievements in AD classification. For instance, an accuracy of 87.15% was reported for a 3D CNN in a recent work, for the classification of T1-weighted MRI scans Cheng et al. (2017). A multi-instance 3D convolutional neural network reached an accuracy of 91.09% for a similar task Liu et al. (2018). Unfortunately, these machine learning frameworks rely on complex architectures which makes it difficult to understand what neurological changes are modeled by the deep networks as typical dementia signatures, markers of disease progression, and clues for differential diagnosis between dementia Levakov,G. et al. (2020); Montavon et al. (2018).
In recent years, a new field of research dedicated to the explanation of deep learning models has emerged: Explainable Artificial Intelligence (XAI)Miller,T. (2019); Barredo Arrieta,A. et al. (2020); Longo,L. et al. (2020). In that field, heatmaps have emerged as a popular visualization tool to interpret Deep Learning models working on images. A heatmap indicates what part of an input image contributes the most to a deep network output Simonyan et al. (2014); Zhou,B. et al. (2016); Zhang,Q.S. and Zhu,S.C. (2018)
. In other words, a heatmap reflects the importance of imaging features extracted from an image by a deep neural network to support its decision, and how much local image patterns contribute to these important features. The first heatmap method proposed, coined saliency maps, was produced by back-propagating the gradients of a network output through all the layers of the model until reaching the input layersSimonyan et al. (2014)
. This core idea was improved and generalized several times in the following years, such as in the Guided Backpropagation method that builds on the Deconvolution networkZeiler et al. (2010)
and where only positive gradients are propagated through ReLU network layersSpringenberg et al. (2015); Nair,V. and Hinton,G.E. (2010). In the Class Activation Maps (CAM), network activations are considered instead of back-propagated gradients Zhou,B. et al. (2016). The Integrated Gradient (IG) approach consists of averaging gradient maps generated from multiple scaled inputs Sundararajan et al. (2017). In the Layer-wise Relevance Propagation (LRP) method, a set of preservation rules are applied when back-propagating a network’s activations and, in particular, treating positive and negative neural network activations in different ways Bach,S. et al. (2015); Montavon,G. et al. (2017); Binder,A. et al. (2016). These strategies are also implemented in the DeepLift method Shrikumar et al. (2017)
, where baseline activations are subtracted from neuron activations during the propagation. These baseline activations are generated by passing task-specific reference images to the networksShrikumar et al. (2017). The most recent approaches combine multiple methods to generate fine visualizations that can be produced at different network depths, such as the Guided Grad CAM method that combines guided back-propagated gradients with class-activation maps generated from output gradients Springenberg et al. (2015); Zhou,B. et al. (2016); Selvaraju,R.R. et al. (2017). Heatmaps methods can be selected based on their implementation invariance Sundararajan et al. (2017), their robustness for input perturbations Samek et al. (2016), model weight randomization Adebayo et al. (2018), and the relevant information they capture in the saliency maps they produce Dabkowski,P. and Gal,Y. (2017)
. In the studies where no ground truth is available to estimate the quality of the heatmaps, this evaluation is particularly difficult to conductBöhle et al. (2019).
|Study participants (n)||252||250|
|MRI B0 field (1.5T/3T)||92/160||78/172||0.22|
|Mean age (std)||74.41 (6.00)||74.93 (8.01)||0.41|
|Mean MMSE (std)||29.06 (1.25)||22.95 (2.23)||<0.001|
|Mean years of education (std)||15.25 (2.95)||16.35 (2.65)||<0.001|
ADNI participants selected in this work. Fisher’s exact test detected no significant difference in the proportion of men in the two groups (p=0.25) and no significant difference in the proportion of scans acquired with a 3 Tesla MRI scanner (p=0.22). The T-test detected no significant mean age difference (p=0.41), while mini-mental state examination (MMSE) valuesJack,C.R.Jr. et al. (2008) were significantly worse in the AD group. Education information was only available for 220 AD study participants and 211 controls and indicated significantly longer education in the AD group.
. A considerable number of voxel-based morphometry studies (VBM) have been conducted to discover which brain atrophies observed in the aging brain can be imputed to an underlying Alzheimer’s diseaseMueller et al. (2010a); Testa et al. (2004); Busatto et al. (2008); Villain et al. (2008); Chételat et al. (2008). When a meta-analysis is conducted, these VBM studies are often summarized into a single brain map indicating what brain regions are affected by the disease Schroeter et al. (2009); Minkova et al. (2017); Di,X. et al. (2014). VBM studies capture the univariate significance of local tissue changes: they indicate how much a brain disorder such as AD has impacted local brain tissues. This ‘decoding’ approach reverses the ‘encoding’ approach adopted by neural networks, where local tissue changes are aggregated into non-linear features used to predict patient diagnosis. However, under the assumptions that AD only affects localized brain regions Mueller et al. (2010a); Testa et al. (2004); Busatto et al. (2008); Villain et al. (2008); Chételat et al. (2008) and that neural networks focus on a restricted set of relevant brain regions when diagnosing AD, VBM and heatmaps should overlap to highlight brain regions associated with and predictive of the disease. Since the different heatmap methods capture imaging pattern contributions in various ways, the overlap between heatmaps and Alzheimer’s disease VBM patterns is also expected to depend on the heatmap calculation; some methods focusing on high-level features are more difficult to relate to voxel-wise VBM results. Lastly, it is unclear how well heatmaps derived from a restricted data set would replicate in larger AD neuroimaging cohorts and if that overlap between univariate significance and neural network features’ importance would be preserved.
As far as we know, none of these questions have been explored so far. We propose to address them at the same time, in this work, by quantifying the amount of overlap that can be reached between a ”ground truth” univariate significance map provided by a large VBM meta-analysis and heatmaps derived by the most advanced methods from a convolutional neural network achieving state-of-the-art classification performance for Alzheimer’s disease classification on an independent sample of MRI scans. More specifically, we evaluate the ability of three prominent CNN heatmap methods, the Layer-wise Relevance Propagation (LRP) method Bach,S. et al. (2015), the Integrated Gradients (IG) method Sundararajan et al. (2017), and the Guided grad-CAM (GGC) Selvaraju,R.R. et al. (2017) method, to capture Alzheimer’s disease effects by training 3D CNN classifiers using T1-weighted MRI scans part of the ADNI data set, and measuring the overlap between their heatmaps and a binary brain map derived from a meta-analysis of voxel-based morphometry studies conducted on other T1 MRI scans. Figure1 summarizes our approach.
2 Materials and Methods
2.1 ADNI Study Participants
A total of 502 ADNI participants were included in this study, 250 participants were diagnosed with AD and 252 controls. 170 participants were part of the ADNI1 study (92 controls, 78 AD), 298 were enrolled in the ADNI2 study (160 controls, 138 AD), and the last 34 participants were recruited for ADNI3 (0 controls, 34 AD). Study participant demographics are reported in Table 1.
2.2 ADNI Data and Processing
For each participant, a raw structural T1-weighted MRI scan was downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As a preparation for the present study, the scans were further processed as follows.
First, the multi-atlas brain segmentation pipeline (MUSE) was used for skull-stripping the T1-weighted MRI scans and generating a gray matter map Doshi,J. et al. (2016). This automated processing pipeline starts by denoising the T1 scans using the N4 bias field correction Tustison,N.J. et al. (2010) provided as part of the Advanced Normalization Tools software library (ANTs, version 2.2.0) Avants,B.B. et al. (2011). Then, the denoised scans are registered using ANTS nonrigid SyN registration Avants,B.B. et al. (2011, 2008) and DRAMMS Yangming,O. and Davatzikos,C (2009) to a set of 50 brain atlases where brain masks have been manually segmented. These registrations are used to warp the atlas brain masks into the space of the T1 scan to process, where they are combined by majority voting to produce an accurate brain mask Doshi,J. et al. (2016). The brain is then segmented into white matter, gray matter, and cerebrospinal fluid using FSL FAST (version 5.0.11) Jenkinson,M. et al. (2012), and parcellated into regions of interest by registering a set of 50 manually segmented brain atlases Doshi,J. et al. (2016).
Then, the skull-stripped T1 scans produced by MUSE were registered to the 1 mm resolution 2009c version of the ICBM152 MNI atlas Fonov,V. et al. (2011); Fonov,V.S. et al. (2009); Collins,D.L. et al. (1999) using the non-rigid registration method SyN part of the ANTs library (ANTs version 2.3.4) Avants,B.B. et al. (2011, 2008). The deformation field produced by ANTs was applied, after masking out white matter and ventricles using the tissue maps generated by MUSE, to bring only the gray matter into the MNI space. The same transformation was applied to move the MUSE grey matter mask to the MNI space. Lastly, each T1 scan was normalized individually, by dividing the T1 intensities by the maximum intensity within the brain.
2.3 Convolutional Neural Networks
A convolutional neural network (CNN) consists of a set of convolutional layers applying convolution operators to gradually condense input data into a set of high-level features that are passed through fully connected layers to produce the final output of the network Krizhevsky et al. (2012). This output is usually a single value scaled between 0 and 1 when the CNN is used for binary classification Krizhevsky et al. (2012). Convolutional layers are often combined with the ReLU activation layer filtering negative outputs Nair,V. and Hinton,G.E. (2010)
, max-pooling layers reducing the dimension of the dataKrizhevsky et al. (2012)
, and batch normalization layers helping the neural network model optimizationIoffe and Szegedy (2015).
In this work, four different 3D CNN architectures of varying complexity were compared. The first architecture, which will be referred to as Model A, was made of five convolutional layers with decreasing kernel sizes: one layer with a kernel size of , two layers with kernel sizes of , and two layers with kernel sizes of , where denotes the number of channels and it was fixed for each CNN separately. Model B had five convolutional layers with the same kernel size of . Model C had four convolutional layers with the same kernel size of . Model D was made of three convolutional layers with the same kernel size of . All convolutional layers were followed by a batch normalization layer, a ReLU activation layer Nair,V. and Hinton,G.E. (2010), and a max-pooling layer Krizhevsky et al. (2012). For each architecture, the number of channels varied from 24 to 52 to build networks of increasing numbers of parameters. On top of these convolutional layers, all the CNNs were completed by two fully connected layers separated by a ReLU layer Nair,V. and Hinton,G.E. (2010) and a dropout layer fixed to 0.5 to prevent overfitting Srivastava et al. (2014). The first fully connected layer was obtained by flattening the features produced by the last convolutional layer. The second layer was set to contain 64 neurons and to produce a continuous output corresponding to the AD diagnosis. These CNN architectures are summarized in Figure 2.
These networks were trained to distinguish between the MRI scans of AD and control in our ADNI data set. Cross-entropy was used as a loss function for the classification, and that loss was minimized using the Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.0001Kingma and Ba (2015)
. CNN accuracy was evaluated via 5-fold cross-validation, by splitting the data set five times into a training set of 322 scans, a validation set of 80 scans, a test set of 100 scans for the first four folds and a training set of 322 scans, a validation set of 78 scans, a test set of 102 scans for the last fold. An early stopping criterion was implemented by monitoring the validation loss and forcing the training process to stop after 10 epochs producing no improvements in the validation loss. These CNN architectures and their optimizers were selected to match standard architectures and their default optimization parameter values and, in particular, state-of-the-art AlexnetKrizhevsky et al. (2012) and Google net Szegedy et al. (2015).
All the models were trained on a high-performance computing system equipped with Nvidia v100 GPUs. The training required 32GB of RAM and was completed within 3 hours to 8 hours for each fold, depending on the model complexity. The proposed network was built with PytorchPaszke et al. (2019).
The classification accuracy obtained for all the CNNs tested was compared with the classification accuracy of linear SVMs with the following set of C parameters Smola,A.J. and Schölkopf,B. (2004); Pedregosa,F. et al. (2011): , , , , , , , , , , . The CNN model with the best cross-validated accuracy was then retained to compute heatmaps highlighting the brain regions selected by the model to distinguish AD and control brains.
2.4 CNN Heatmap Methods
In this study, three prominent CNN heatmap methods are selected, the Layer-wise Relevance Propagation (LRP) method Bach,S. et al. (2015), the Integrated Gradients (IG) method Sundararajan et al. (2017), and the Guided Grad-CAM (GGC) Selvaraju,R.R. et al. (2017) method. These methods were used to produce a heatmap for each test scan and then averaged to produce a single heatmap for each heatmap method indicating what brain regions were used the most by the selected CNN when classifying the brain scans to distinguish AD and control ADNI participants Bach,S. et al. (2015); Sundararajan et al. (2017); Selvaraju,R.R. et al. (2017).
The Layer-wise Relevance Propagation (LRP) method produces a heatmap by estimating a relevance score for each pixel of input passed to a CNN model. The relevance score is computed by propagating CNN outputs backward in the network according to a specific set of rules Bach,S. et al. (2015). These rules are designed to preserve relevance scores from layer to layer and are often modified to reduce the noise in relevance scores, improve their sparsity, or treat positive and negative neural network activations differently Bach,S. et al. (2015); Binder,A. et al. (2016); Montavon,G. et al. (2017). In this work, we used the -rule LRP algorithm implemented by Böhle et al. Binder,A. et al. (2016); Böhle et al. (2019). The parameter aiming at balancing the relevance scores associated with positive and negative neural network activations was set to 0.5 to account for both activations in a similar manner Bach,S. et al. (2015).
The Integrated Gradients (IG) method was introduced to guarantee two desirable heatmap properties: sensitivity and implementation invariance Sundararajan et al. (2017). Sensitivity refers to the ability of a heatmap method to produce null relevance scores for network inputs that are not contributing to the network output. Most of the methods published before IG either did not satisfy the sensitivity requirement, such as Guided BackpropagationSpringenberg et al. (2015), Deconvolution networks Zeiler et al. (2010), and DeepLift Shrikumar et al. (2017), or were not invariant to the neural network implementation, such as LRP Bach,S. et al. (2015) and DeepLift Shrikumar et al. (2017). The Integrated Gradients (IG) method produces the heatmap of an input image by multiplying the input with a scaling factor uniformly selected between 0 and 1 several times in a row, and computing the gradient for each scaled input via backpropagation, and then averaging these gradients Sundararajan et al. (2017). The IG implementation used in this work will be available on https://github.com/UTHSCSA-NAL/CNN-heatmap/.
Guided Grad-Cam (GGC) combines Grad-CAM and Guided Backpropagation Springenberg et al. (2015); Selvaraju,R.R. et al. (2017). Grad-CAM is a generalization of the Class Activation Mapping (CAM) Zhou,B. et al. (2016) that can be implemented for any CNN without model changes or re-training. Grad-CAM generates a heatmap for a CNN layer by applying a ReLU function Nair,V. and Hinton,G.E. (2010) to a linear combination between the activations obtained at that layer and the backpropagated gradients from subsequent layers Selvaraju,R.R. et al. (2017). In Guided Grad-CAM, these Grad-CAM heatmaps are up-sampled to the resolution of the input data and element-wise multiplied with a heatmap generated by Guided Backpropagation to produce heatmaps with the same resolution as the input data Springenberg et al. (2015); Selvaraju,R.R. et al. (2017). During our experiments, we only considered Guided Grad-CAM heatmaps based on the Grad-CAM maps computed for the last convolutional layer of our CNNs, as suggested in the original GGC publication Selvaraju,R.R. et al. (2017). GGC was implemented as part of the Captum PyTorch library https://captum.ai/.
2.5 Meta-analysis ALE Map
The meta-analysis ALE map summarizes 77 VBM studies published in 58 articles. Gender information was missing in 3 MRI VBM studies. A Fisher exact test was conducted to compare men/women counts, and an unpaired T-test was conducted to compare group mean ages. After Bonferroni correction for two tests, none of the differences observed was significant at level p=0.05.
The meta-analysis map was produced by reprocessing a set of voxel-based morphometry (VBM) studies collected from a prior meta-analysis Ashburner,J. and Friston,K.J. (2000). We produced a brain map by applying the activation likelihood estimation (ALE) method Turkeltaub,P.E. et al. (2002); Turkeltaub et al. (2012); Eickhoff et al. (2012); Eickhoff,S.B. et al. (2016) implemented in the GingerALE software (version 3.0.2) 1; Eickhoff,S.B., Laird,A.R., Grefkes,C., Wang,L.E., Zilles,K., and Fox,P.T. (2009) to combine the selected VBM studies into a single ALE map indicating what atrophies observed in the brain were likely to be associated with Alzheimer’s disease Turkeltaub,P.E. et al. (2002). More specifically and following Müller et al. (2018, 2017), GingerALE was running for a cluster-forming p-value of 0.001 and a cluster-level significance level of 0.05. Cluster significance was estimated by conducting a thousand random permutations. The continuous map generated by GingerALE, where all non-significant brain locations had been assigned null values, was thresholded at its smallest non-zero value to produce a binary map suitable for a comparison with the thresholded CNN heatmaps.
2.6 Evaluation Metrics
The ability of the CNN heatmap methods presented in the previous section to capture brain alterations associated with Alzheimer’s disease was estimated by measuring the Dice overlap between binary maps obtained by smoothing and thresholding the heatmaps with the binary brain map derived from a large meta-analysis that summarized the brain regions affected by Alzheimer’s disease in T1 MRI scans.
More specifically, CNN heatmap values were replaced by their absolute values. The heatmaps were then smoothed by sixteen different Gaussian kernels of full width at half maximum (FWHM) ranging from 1 mm to 32 mm (1 mm, 2 mm, 3 mm, 4 mm, 5 mm, 6 mm, 7 mm, 8 mm, 9 mm, 10 mm, 12 mm, 16 mm, 20 mm, 24 mm, 28 mm, 32 mm). For each smoothed heatmap and the original heatmaps, 50 values were evenly selected between the minimum and the maximum heatmap value to threshold the heatmaps. The reason to smooth a heatmap is to make it comparable to a meta-analysis map since the ALE algorithm inherently adds Gaussian smoothing to the locations of reported foci. The 850 binary maps obtained in this way were compared with the meta-analysis map by computing a Dice overlap. This approach was chosen to explore and mitigate spatial resolution discrepancies between the CNN heatmaps and the meta-analysis map.
In addition to the Dice overlaps, Receiver Operating Characteristic (ROC) curves and precision-recall curves were computed to compare each smoothed heatmap with the meta-analysis map, by considering the meta-analysis map as a set of reference binary labels. The area under the curve was reported as well. These additional measures of overlap between heatmaps and the meta-analysis map were used to validate the findings derived from Dice overlaps.
2.7 Additional Synthetic Validation
The methods presented in this work were validated by processing two synthetic data sets. The first data set, the “single-subject” data set, made of 10000 images, was generated from a single healthy control subject MRI scan (mean MRI intensity = 2600, std = 756) from our ADNI data set and was downsampled to a size of voxels to reduce the computational burden. In half of the images, the MRI intensity in the hippocampus regions was increased by a random value ranging between 0 and 2500 and simulating a disease effect on grey matter tissue. Then, Gaussian noise (mean=0, std=2000) was added to all synthetic images, and a Gaussian smoothing of 4 mm FWHM was applied. The second data set, the “whole-cohort” data set, also made of 10000 images, was generated using 250 healthy controls from our ADNI data set and downsampled to voxels. Each healthy control scan was used to generate 40 images. In half of these images, the MRI intensity in the hippocampus regions was increased by a random disease effect between 0 and 2500. Then, Gaussian noise (mean=0, std=2000) was added to all synthetic images, and a Gaussian smoothing of 4 mm FWHM was applied.
Eleven linear SVMs were trained to distinguish the synthetic scans with and without disease effect, for the parameters tested with the clinical data (C parameter in , , , , , , , , , , ) Smola,A.J. and Schölkopf,B. (2004); Pedregosa,F. et al. (2011). Then, four CNN models similar to the models shown in Figure 2 were trained: a Model A containing one and one convolutional layers, where the number of channels was set to (ModelA4l2), a Model B containing two convolutional layers (ModelB4l2), a Model C containing two convolutional layers (ModelC4l2), and a Model D containing two convolutional layers (ModelD4l2). The number of convolutional layers in these CNNs was reduced, compared to the original CNN architectures used for classifying ADNI scans, to fit the size of the downsampled synthetic images. They were trained for a cross-entropy classification loss, that was minimized using the Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.0001 Kingma and Ba (2015) for a hundred epochs. Please refer to Figure 2 for their detailed architecture. For each data set, LRP, IG, and GGC were used to compute a heatmap for the CNN model reaching the best five-fold cross-validated accuracy. The coefficients of the best SVM and the heatmap values were then compared with the binary map of the hippocampus by computing Dice overlaps at different spatial smoothing levels as explained in the previous Sections.
3.1 Classification performance in synthetic data
For both synthetic data sets, the model D4l2 reached the best five-fold cross-validated accuracy, with 91% accuracy for the single-subject data and 90.1% for the whole-cohort data, and systematically outperformed the best SVM models, that were obtained for both data sets by setting (87% accuracy for single-subject data set and 87.5% accuracy for whole-cohort data set). More specifically, for the single-subject data set, model A4l2 also outperformed the best SVM, with an accuracy of 90%, but Model C4l2 and B4l2 produced worse classification results, with respectively 81% and 74% accuracy. For the whole-cohort data set, on the contrary, B4l2 was the second-best model with 90.06% accuracy, followed by A4l2 (89.9%) and C4l2 (87.9%) and all CNN models were more accurate than the best SVM tested.
3.2 Heatmaps derived from synthetic data
The Dice overlaps between heatmaps and the binary hippocampus map are shown in Figure 3. All smoothing results are reported in Supplementary materials (Section 1). For the single-subject data set, the best Dice overlap measured for LRP, IG, GGC, and SVM is 0.581 when the LRP heatmap was smoothed by a 1 mm FWHM Gaussian smoothing, 0.703 for the IG heatmap with a 1 mm smoothing, 0.593 for GGC heatmap with a 1 mm smoothing and 0.694 for the SVM heatmap with a 2 mm smoothing. For the whole-cohort data set, the best Dice overlap measured for LRP, IG, GGC, and SVM is 0.766 when the LRP heatmap without Gaussian smoothing, 0.804 for the IG heatmap without smoothing, 0.763 for GGC without smoothing, and 0.744 for the SVM heatmap with a 2 mm smoothing. Their corresponding heatmaps achieved the best overlap with the hippocampus map and the heatmaps thresholded at 5% of their maximum values are shown in Supplementary materials (Section 1). Those plots indicate that IG heatmaps have a better focus on the hippocampus than GGC and LRP heatmaps.
These results demonstrate the ability of CNN heatmaps to capture localized and specific brain alterations on a synthetic data set and, in particular, when the hippocampus is affected similarly as in real clinical data. IG heatmaps reached the best overlaps for both data sets and required the smallest spatial smoothing to achieve that overlap (1 mm for the single-subject data set, no smoothing for the whole-cohort data set).
3.3 Classification performance in ADNI
The 5-fold cross-validation accuracy of all the CNN and SVM models tested during this work is reported in table 3. The best CNN accuracy was achieved by the Model B with 44 channels (ModelB44) and reached 87.24%. This accuracy is six percent better than the cross-validated accuracy obtained with the best SVM model(c= 0.001), which is close to 81.2% and that was obtained through grid search for a set of c parameters : , , , , , , , , , , .
|c||Model A||Model B||Model C||Model D|
|best SVM: C=0.001, accuracy:|
3.4 Meta-Analysis ALE Maps
Figure 4 presents the ALE map used in this work. The structural MRI ALE map summarizes 77 neuroimaging studies reporting 773 locations in the brain affected by Alzheimer’s disease, discovered by analyzing the neuroimaging data of a total of 3817 study participants around the world (2118 controls, 1699 MCI or AD). The complete list of publications combined in this map is reported in Supplementary materials (Section 2).
3.5 Overlaps between ADNI CNN heatmaps and the ground truth derived from the meta-analysis
The five ModelB44 models trained during the 5-fold cross-validation were used to generate heatmaps. For each model, a heatmap was generated for each test scan with each of the three heatmap methods: LRP, IG, and GGC. The 502 individual heatmaps obtained for each method were averaged into a single heatmap that was compared with the binary meta-analysis map to evaluate the performance of the method. In addition, an SVM coefficient map was derived by training a linear C-SVM using all the data and for the C parameter producing the best cross-validated accuracy and retaining the weight of this model to create a brain map.
Figure 5 reports all the Dice measured between the heatmaps and the meta-analysis map. The best Dice overlap measured for LRP, IG, GGC, and SVM was 0.502 when the LRP heatmap was smoothed by a 7 mm FWHM Gaussian smoothing, 0.550 for the IG heatmap with a 4 mm smoothing, 0.540 for GGC with an 8 mm smoothing and 0.479 for the SVM heatmap with a 12 mm smoothing. The heatmaps with the best overlaps with the meta-analysis are shown in Figure 6. Figure 7 displays the unsmoothed heatmaps. The LRP heatmaps select more regions than the other maps and appear to be noisier. On the other hand, IG heatmaps have a better focus on the regions highlighted by the meta-analysis, but the unsmoothed IG map presents an unrealistic scatter. IG produced a map that was simultaneously more relevant than the LRP heatmap and less scattered than the GGC heatmap. In comparison, the unsmoothed SVM heatmap covers most of the grey matter. The linear SVM produced slightly larger weight amplitudes in the regions relevant for the diagnosis, but an aggressive smoothing was required to make this effect emerge in Figure 6.
In the present study, we reported the first data-driven validation, for the study of Alzheimer’s disease, of three prominent CNN heatmap methods: Layer-wise Relevance Propagation (LRP), Integrated Gradients (IG), and Guided Grad-CAM (GGC). The heatmaps produced by these methods, for a CNN classifier producing the best AD classification among a large set of CNN architectures tested using ADNI T1-weighted MRI scans, were compared with a binary meta-analysis ALE map obtained by combining 77 Alzheimer’s disease VBM studies. Our results indicate that the CNN heatmaps captured brain regions that were also associated with AD effects on the brain in the meta-analysis.
4.1 Best deep learning-based classification model
The best 5-fold cross-validation accuracy (87.24%) was obtained for a Model B with 44 channels. Overall, Model B accuracy was stable when the number of channels was varied, varying only between 83% and 87%. Model A and C were less stable: their accuracy ranged between 60% and 84% and 63% and 84% respectively. Model D produced only poor classifications, for an accuracy ranging between 47% and 59%. We think that these differences can be explained by overfitting, as we noticed that Models D usually contain more trainable parameters than Models C of a similar number of channels. Models C usually contain more parameters than Models A, and Models B are the smallest models. For 44 channels, for instance, Model D required the training of 1569058 parameters, Model C 957502 parameters, Model A 705866, and Model B only 436410 parameters.
4.2 Explainable AI and neuroimaging
The neuroimaging field has developed meta-analysis brain maps to summarize domain knowledge Fox et al. (2005); Vanasse et al. (2018), which we use to evaluate the CNN heatmaps. Contrary to prior studies, where the quality of heatmaps was visually or qualitatively assessed Samek et al. (2016); Binder,A. et al. (2016); Samek et al. (2021); Jo et al. (2020), we obtained precise quantitative measures by computing overlaps with a ground-truth map derived from a large-scale meta-analysis. We explored a broad range of heatmaps’ spatial smoothing intensities and we found that the heatmaps overlapped the most with the meta-analysis for Gaussian smoothing kernels between 4 mm and 8 mm FWHM. These Gaussian kernels are similar to the kernels usually applied by GingerALE when producing meta-analysis maps 1; Eickhoff,S.B., Laird,A.R., Grefkes,C., Wang,L.E., Zilles,K., and Fox,P.T. (2009).
4.3 Evaluation experiments for the heatmaps derived from deep learning models
For all the CNN heatmap methods, the best heatmaps indicated that changes in the hippocampus regions in both hemispheres were a crucial pattern during the classification of ADNI participants with AD and healthy controls. These results are perfectly in line with the literature, where the effect of Alzheimer’s disease on the hippocampus has been well-characterized Habes et al. (2016); Shi et al. (2009); Mueller et al. (2010b); Ohnishi et al. (2001); Jack et al. (2013). We obtained moderately good Dice overlaps between heatmaps and the meta-analysis ground-truth, ranging from 0.5 for the best heatmap generated by the LRP method to 0.55 for the best IG heatmaps.
Direct analysis of the heatmaps, without spatial smoothing, established that all CNN heatmaps were better at focusing on relevant brain regions than linear SVM weights. IG and LRP produced scattered heatmaps that benefited the most from spatial smoothing, gaining up to 0.19 and 0.18 in Dice overlap with the ground truth as the size of the Gaussian kernels was varied. GGC Dice overlap was only improved by 0.16 at most. In comparison, the SVM weight map was so scattered and noisy that a Dice improvement larger than 0.3 was observed when the map was smoothed. We refer the readers to the Supplementary materials for the complete set of Dice overlaps measured during this experiment (Section 3). The LRP heatmap was the noisiest, and produced the least symmetric results by selecting more voxels in the left hemisphere, as reported in prior studies Böhle et al. (2019).
IG produced the heatmaps with the largest overlaps with the meta-analysis, and that overlap required less spatial smoothing. These results suggest that the IG heatmap while being more scattered than other heatmaps, was overall less noisy. All heatmap methods produced brain maps closer to the meta-analysis map than the map derived from the baseline support vector machine and were better focused on brain regions impacted by the disease than the SVM map. The additional overlap measures presented in Supplementary materials (Section 4) also indicate a better overlap between the IG heatmaps and the meta-analysis ground truth, and these results are in line with the synthetic results, where IG also outperformed other heatmap methods.
4.4 Data augmentation
Various techniques could be used to improve classification performance, such as data augmentation, which is used to enhance performance by enlarging the training set Rashid et al. (2021). In this work, we did not employ data augmentation as we were aiming to capture biology-informed patterns, and the most standard form of data augmentation, the inclusion of translated and rotated copies of the training scans Rashid et al. (2021), would have blurred the boundaries of the brain regions that the CNN heatmaps were aiming to capture. In the future, we will check whether the use of more advanced data augmentation methods, such as the introduction of realistic noises that preserve the boundaries between grey matter and white matter in the MRI scans, could be used to carry out a data augmentation that retains the boundaries of the regions of interest.
4.5 Evaluation Metrics
Multiple metrics could be used to measure the overlap between the binary meta-analysis map provided by GingerALE, the continuous SVM coefficients maps, and the continuous brain maps generated by the heatmap methods. In this work, we decided to threshold the absolute value of the continuous heatmaps and we used a well-established metric to measure the overlap between brain regions, the Dice overlap. Since the meta-analysis ALE map was produced by thresholding a map combining Gaussian kernels of various sizes 1; Eickhoff,S.B., Laird,A.R., Grefkes,C., Wang,L.E., Zilles,K., and Fox,P.T. (2009), we considered that the thresholded heatmap had to be smoothed, and we explored a broad range of thresholds and smoothings to search for the best possible match between meta-analysis and heatmaps. This kind of grid search is not common in the literature, but we think that it was justified to account for the unknown level of smoothing incorporated in the VBM studies and during their combination by GingeALE.
In addition to Dice overlaps, we computed Receiver Operating Characteristic (ROC) curves and precision-recall curves by considering the absolute values of the heatmaps as classification scores and the meta-analysis map as a set of binary labels. However, the regions affected by Alzheimer’s disease cover only a very small part of the brain. As a result, the area under the ROC curves increased with heatmap smoothing to reach maxima corresponding to heatmaps selecting no voxels in the brain, and the ROC curves were of no use to select smoothing parameters. The second set of additional curves, the precision-recall curves, were not affected by this issue, but they selected similar optimal smoothing parameters to the grid search over Dice overlaps, as shown in the results presented in Supplementary materials (Section 4). Therefore, precision-recall curves provided few additional insights.
In this work, we evaluated the ability of three prominent CNN heatmap methods, the Layer-wise Relevance Propagation (LRP) method, the Integrated Gradients (IG) method, and the Guided Grad-CAM (GGC) method, to capture Alzheimer’s disease effects in the ADNI data set by training CNN classifiers and measuring the overlap between their heatmaps and a brain map derived from a large-scale meta-analysis. We found that the three heatmap methods capture brain regions that overlap fairly well with the meta-analysis map, and we observed the best results for the IG method. All three heatmap methods outperformed linear SVM models. These results suggest that the analysis of deep nonlinear models by the most recent heatmap methods can produce more meaningful brain maps than linear and shallow models. Further work will be required to replicate our results and extend our models to investigate other tasks, such as other neurodegenerative disorders and healthy aging.
6 Author Contributions
D.W., N.H, and M.H. made substantial contributions to the analysis and interpretation of results. M.H. designed the study. D.W., N.H., and M.H drafted the article. All authors revised the article critically for important intellectual content, approved the version to be published, and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
This study was supported in part by the National Institute of Health (NIH) grant P30AG066546 (South Texas Alzheimer’s Disease Research Center) and grant numbers 5R01HL127659, 1U24AG074855, and the San Antonio Medical Foundation grant SAMF – 1000003860.
8 Data Availability
The brain scans used in the present work were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The list of VBM studies combined to produce the meta-analysis map is provided in Supplementary materials (Section 2). The code is available at https://github.com/UTHSCSA-NAL/CNN-heatmap/.
-  Note: https://brainmap.org/ale/Accessed: 2022-02-07 Cited by: §2.5, §4.2, §4.5.
-  (2020) 2020 Alzheimer’s disease facts and figures. Alzheimer’s and Dementia 16, pp. 391 – 460. Cited by: §1.
- Sanity checks for saliency maps. Advances in neural information processing systems 31. Cited by: §1.
- Why voxel-based morphometry should be used. Neuroimage 14 (6), pp. 1238 – 1243. Cited by: §1.
- Voxel-based morphometry - the methods. NeuroImage 11 (6 part 1), pp. 805 – 821. Cited by: §1, §2.5.
- Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12 (1), pp. 26 – 41. Cited by: §2.2, §2.2.
An open source multivariate framework for n-tissue segmentation with evaluation on public data. Neuroinformatics 9 (4), pp. 381 – 400. Cited by: §2.2, §2.2.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §1, §1, §2.4, §2.4, §2.4.
- Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Information fusion 58, pp. 82 – 115. Cited by: §1.
- Layer-wise relevance propagation for neural networks with local renormalization layers. In Information Science and Applications (ICISA) 2016, pp. 913 – 922. Cited by: §1, §2.4, §4.2.
- Layer-wise relevance propagation for explaining deep neural network decisions in MRI-based Alzheimer’s disease classification. Frontiers in aging neuroscience 11, pp. 194. Cited by: §1, §2.4, §4.3.
- Voxel-based morphometry in Alzheimer’s Disease. Expert review of neurotherapeutics 8 (11), pp. 1691 – 1702. Cited by: §1.
- Classification of mr brain images by combination of multi-cnns for ad diagnosis. In Ninth international conference on digital image processing (ICDIP 2017), Vol. 10420, pp. 875 – 879. Cited by: §1.
- Direct voxel-based comparison between grey matter hypometabolism and atrophy in Alzheimer’s Disease. Brain 131 (1), pp. 60 – 71. Cited by: §1.
- ANIMAL+insect: improved cortical structure segmentation. In Information Processing in Medical Imaging (IPMI), Vol. 1613/1999, pp. 210 – 223. Cited by: §2.2.
- Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6970 – 6979. Cited by: §1.
- Correspondence of executive function related functional and anatomical alterations in aging brain. Progress in Neuro-Psychopharmacology and Biological Psychiatry 48, pp. 41 – 50. External Links: Cited by: §1.
- MUSE: multi-atlas region segmentation utilizing ensembles of registration algorithms and parameters, and locally optimal atlas selection. NeuroImage 127, pp. 186 – 195. Cited by: §2.2.
- Activation likelihood estimation meta-analysis revisited. NeuroImage 59, pp. 2349 – 2361. Cited by: §2.5.
- Coordinate-based activation likelihood estimation meta-analysis of neuroimaging data: a random-effects approach based on empirical estimates of spatial uncertainty. Human brain mapping 30 (9), pp. 2907 – 2926. Cited by: §2.5, §4.2, §4.5.
- Behavior, sensitivity, and power of activation likelihood estimation characterized by massive empirical simulation. NeuroImage 137, pp. 70 – 85. Cited by: §2.5.
- Unbiased average age-appropriate atlases for pediatric studies. NeuroImage 54 (1), pp. 313 – 327. Cited by: §2.2.
- Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. NeuroImage 47, pp. S102. Note: Organization for Human Brain Mapping 2009 Annual Meeting Cited by: §2.2.
- BrainMap taxonomy of experimental design: description and evaluation. Human brain mapping 25 (1), pp. 185–198. Cited by: §4.2.
- White matter hyperintensities and imaging patterns of brain ageing in the general population. Brain 139 (4), pp. 1164–1179. Cited by: §1.
- Advanced brain aging: relationship with epidemiologic and genetic risk factors, and overlap with alzheimer disease atrophy patterns. Translational psychiatry 6, pp. e775. Cited by: §1, §4.3.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 – 456. Cited by: §2.3.
- Tracking pathophysiological processes in Alzheimer’s disease: an updated hypothetical model of dynamic biomarkers. The Lancet Neurology 12 (2), pp. 207 – 216. External Links: Cited by: §4.3.
- The Alzheimer’s Disease neuroimaging initiative (ADNI): MRI methods. Journal of Magnetic Resonance Imaging 27 (4), pp. 685 – 691. Cited by: Table 1.
- FSL. NeuroImage 62 (2), pp. 782 – 790. Cited by: §2.2.
- Deep learning detection of informative features in tau pet for Alzheimer’s disease classification. BMC bioinformatics 21 (21), pp. 1 – 13. Cited by: §4.2.
- Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), Cited by: §2.3, §2.7.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §2.3, §2.3, §2.3.
- From a deep learning model back to the brain - identifying regional predictors and their relation to aging. Human Brain Mapping 41, pp. 3235 – 3252. Cited by: §1.
- A deep learning model for early prediction of alzheimer’s disease dementia based on hippocampal magnetic resonance imaging data. Alzheimer’s & Dementia 15 (8), pp. 1059–1070. Cited by: §1.
- Landmark-based deep multi-instance learning for brain disease diagnosis. Medical image analysis 43, pp. 157 – 168. Cited by: §1.
- Explainable artificial intelligence: concepts, applications, research challenges and visions. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 1 – 16. Cited by: §1.
- Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1 – 38. External Links: Cited by: §1.
- Gray matter asymmetries in aging and neurodegeneration: a review and meta-analysis. Human Brain Mapping 38 (12), pp. 5890 – 5904. Cited by: §1.
- Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1 – 15. External Links: Cited by: §1.
- Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211 – 222. Cited by: §1, §2.4.
- Hippocampal atrophy patterns in mild cognitive impairment and Alzheimer’s Disease. Human brain mapping 31 (9), pp. 1339 – 1347. Cited by: §1.
- Hippocampal atrophy patterns in mild cognitive impairment and Alzheimer’s Disease. Human brain mapping 31 (9), pp. 1339 – 1347. Cited by: §4.3.
- Ten simple rules for neuroimaging meta-analysis. Neuroscience & Biobehavioral Reviews 84, pp. 151 – 161. Cited by: §2.5.
- Altered Brain Activity in Unipolar Depression Revisited: Meta-analyses of Neuroimaging Studies. JAMA Psychiatry 74 (1), pp. 47 – 55. Cited by: §2.5.
- Rectified linear units improve restricted Boltzmann machines. In Icml, Cited by: §1, §2.3, §2.3, §2.4.
- Changes in brain morphology in Alzheimer Disease and normal aging: is Alzheimer Disease an exaggerated aging process?. American Journal of Neuroradiology 22 (9), pp. 1680 – 1685. Cited by: §4.3.
- Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: §2.3.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825 – 2830. Cited by: §2.3, §2.7.
- DEEPMIR: a deep neural network for differential detection of cerebral microbleeds and iron deposits in mri. Scientific Reports 11. Cited by: §4.4.
- A review on neuroimaging-based classification studies and associated feature extraction methods for alzheimer’s disease and its prodromal stages. NeuroImage 155, pp. 530–548. External Links: Cited by: §1.
- Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660 – 2673. Cited by: §1, §4.2.
- Explaining deep neural networks and beyond: a review of methods and applications. Proceedings of the IEEE 109 (3), pp. 247 – 278. Cited by: §4.2.
The neuropathology of probable Alzheimer Disease and mild cognitive impairment. Annals of neurology 66 (2), pp. 200 – 208. Cited by: §1.
- Neural correlates of Alzheimer’s Disease and mild cognitive impairment: a systematic and quantitative meta-analysis involving 1351 patients. NeuroImage 47 (4), pp. 1196 – 1206. External Links: Cited by: §1.
Grad-cam: visual explanations from deep networks via gradient-based localization.
Proceedings of the IEEE International Conference on Computer Vision, pp. 618 – 626. Cited by: §1, §1, §2.4, §2.4.
- Hippocampal volume and asymmetry in mild cognitive impairment and Alzheimer’s Disease: meta-analyses of mri studies. Hippocampus 19 (11), pp. 1055 – 1064. Cited by: §4.3.
- Learning important features through propagating activation differences. In International conference on machine learning, pp. 3145 – 3153. Cited by: §1, §2.4.
- Deep inside convolutional networks: visualising image classification models and saliency maps. In In Workshop at International Conference on Learning Representations, Cited by: §1.
- A tutorial on support vector regression. Statistics and Computing archive 14 (3), pp. 199 – 222. Cited by: §2.3, §2.7.
- Striving for simplicity: the all convolutional net. In In Workshop at International Conference on Learning Representations, Cited by: §1, §2.4, §2.4.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929 – 1958. Cited by: §2.3.
- Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319 – 3328. Cited by: §1, §1, §2.4, §2.4.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 – 9. Cited by: §2.3.
- A comparison between the accuracy of voxel-based morphometry and hippocampal volumetry in Alzheimer’s Disease. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine 19 (3), pp. 274 – 282. Cited by: §1.
- Minimizing within-experiment and within-group effects in activation likelihood estimation meta-analyses. Human Brain Mapping 33 (1), pp. 1 – 13. Cited by: §2.5.
- Meta-analysis of the functional neuroanatomy of single-word reading: method and validation. NeuroImage 16 (3), pp. 765 – 780. Cited by: §2.5.
- N4ITK: improved N3 bias correction. IEEE transactions on medical imaging 29 (6), pp. 1310 – 1320. Cited by: §2.2.
- BrainMap vbm: an environment for structural meta-analysis. Human Brain Mapping 39 (8), pp. 3308–3325. Cited by: §4.2.
- Relationships between hippocampal atrophy, white matter disruption, and gray matter hypometabolism in Alzheimer’s Disease. Journal of Neuroscience 28 (24), pp. 6174 – 6181. Cited by: §1.
- DRAMMS: deformable registration via attribute matching and mutual-saliency weighting. In Information Processing in Medical Imaging, pp. 50 – 62. Cited by: §2.2.
- Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pp. 2528 – 2535. Cited by: §1, §2.4.
- Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 27 – 39. Cited by: §1.
Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921 – 2929. Cited by: §1, §2.4.