EXplainable artificial intelligence (XAI) has been gathering attention in the artificial intelligence (AI) and machine learning (ML) community recently. The recent trend was propelled by the success of deep neural network (DNN), especially convolutional neural network (CNN) in image processing. DNN has been considered a blackbox because the mechanism underlying its remarkable performance is not well understood. XAI research has thus developed in many different directions. Among them is the saliency method, where heatmaps are generated and used to give explanations on where AI model is “looking at” when it is making a decision or prediction. The heatmaps are compatible with human’s visual comprehension, easy to read and interpret and thus they are desirable. However, many of the formulas used to generate the heatmaps are given using heuristics, and hence are not revealing enough of the underlying mechanism to help us debug, fix or improve the AI model in meaningful ways.
Regardless, the development of heatmap methods have continued without correspondingly reliable ways to evaluate how one heatmap is better than another. The metrics used to quantify the quality of heatmaps are sometimes indirect, and at other times, qualitative assessment of the quality of heatmaps appear to be possibly given in hind-sight to fit natural reasoning. This often occurs due to the lack of ground-truth heatmaps to verify the correctness of the generated heatmaps. Under such situation, the quality and effectiveness of interpretable heatmaps have nevertheless been demonstrated in several ways. CAM  and GradCAM 
heatmaps were shown to improve the localization on ILSVRC datasets. By observing the change in the log odd scores after deleting image pixels, the relevance of image pixels to the decision or prediction of a model can be determined as well. The earlier paper 
on the development of layerwise relevance propagation (LRP) shows heatmaps generated on many sample data, although many heatmaps do not appear to demonstrate good consistency in their pixel-wise assignment of values (different improvements have since been suggested). Tests were conducted on the effect of transformation on the images, for example, by flipping MNIST digits, andmean prediction is defined to assess the method after interchanging pixels systematically based on relevance computed by LRP. Still, the paper itself mentions that the analysis is semi-quantitative. The paper that introduced SmoothGrad  mentioned that, at the time, there was no ground-truth to allow for quantitative evaluation of heatmaps. It then proceeded with 2 qualitative evaluations instead. As of now, even though there are many different datasets available for AI and ML researches, the corresponding ground-truth explanations (such as heatmaps) are typically not available. Note: though heatmaps are sometimes interchangeably called saliency map, we only refer to them as heatmaps here because we want to distinguish them from XAI method whose name is Saliency.
XAI methods that are not focused on generating heatmaps have also been developed. This paper is mainly concerned about how to quantitatively compare heatmaps, but we may still benefit from different types of evaluations of XAI performance. Local interpretable model-agnostic explanation (LIME) 
is introduced to find a locally faithful interpretable model that represents well the model under inspection, regardless of the latter’s architecture (i.e. is agnostic). By comparing LIME with obviously interpretable models such as decision trees and sparse logistic regression, in particular using recall value, the quality of feature importance obtained using LIME can be assessed. Experiments on Concept Activation Vectors (section 4.3 of) include quantitative comparison of the information used by a model when a ground-truth caption is embedded into the image. In some cases, the caption is used by the model for decision-making, but in other cases, only the image concept is used. Furthermore, human-subject experiments are also conducted to test the importance of the saliency mask, showing that heatmaps help only marginally for human to make decision and that heatmaps can even be misleading. There has also been other similar sentiment that doubts the usefulness of heatmaps, for example in the caption of fig. 2 in .
On the other hand, applications of XAI methods have emerged in other fields, where evaluation of heatmaps has been performed in different ways. Still, one should be careful that the evaluations may not always clearly indicate the relevant usefulness of the heatmaps themselves. A study on MRI-based Alzheimer’s disease classification  computes the L2 norm between average heatmaps generated by different XAI methods and compares the performance of three other different metrics. Ground-truth heatmaps are sometimes available, for example in the diagnosis of lung nodules  where recall values can be directly computed between the reference features (ground-truth) and the heatmaps generated by different XAI methods. Different kinds of ground-truth have been obtained using specialized method, such as NeuroSynth in  for analyzing neuroimaging data. Some parts of the evaluation appears qualitative (such as group-level evaluation), though the paper uses F1-score to evaluate the heatmap, thus naturally including recall and precision concepts in the evaluation. Other applications of XAI methods, especially heatmaps, in the medical field are for example [6, 32, 11, 16, 2, 28, 15, 17, 10].
In this paper, in section II-A we first introduce a synthetic dataset containing containing images with simple features with ground-truth heatmaps that can be generated on demand. The aim is to provide a standardized dataset to compare the viability and effectiveness of heatmaps generated by XAI methods in providing explanations. Ground-truth heatmaps are automatically generated alongside the image data and labels, avoiding the need to manually mark heatmap features, which is a very laborious process. In this 10-class dataset, each data sample consists of an object with a simple shape and its corresponding heatmap designed to be unambiguous, which is the core feature intended to address the problems mentioned above. In short, we provide a dataset where heatmaps can be verified in a more objective way. The rest of section II describes the implementation of neural network training, validation and evaluation processes, followed by the description of five-band-score, a metric defined to capture quantities such as recall and precision that take into account the distinct meaningful regions in heatmaps. Section III discusses the recall-precision results and ROC curves we obtained. Finally, we conclude with recommendations on which methods are possibly useful on specific cases and provide some caveats.
Ii Data and Methodology
This section describes the workflow starting from data generation, network training, network performance evaluation, heatmap generation and evaluating generated heatmaps with common quantities. The workflow is shown in fig. 1, closely following the sequence of commands run in the package of python codes 111https://github.com/etjoa003/explainable_ai/tree/master/xai_basic provided. Some details, such as the algorithms needed to generate each sample data, can be traced from the tutorials available as jupyter notebook included in the package of codes.
We provide algorithms that can generate dataset as shown in fig. 2 on demand, where the top three rows are the images and the last three rows are the corresponding ground-truth heatmaps. The ten different classes of cells are shown along the columns. Types 0,1,2 are circular cells with border (algo. 1), with a bar (or minus sign) and with a plus sign (algo. 2) respectively. Types 3,4,5 are rectangular cells with different dominant colors. Types 6,7,8 following are circular cells with one, three and eight tails respectively. The last class does not contain any cell. Three types of backgrounds are given to increase the variation of dataset, as shown separately in the first three rows of the same figure.
The ground-truth heatmaps have been designed to mark features that distinguish all the classes in a way that is as unambiguous as possible, subject to human judgment. Admittedly, there may not exist a unique unambiguous way of defining them. Where appropriate, the heatmaps could be readjusted by editing the heatmap generator classes in the package of codes. The heatmaps are shown in fig. 2 row 4 to 6. With this dataset, fair comparison between heatmaps generated by different XAI methods can be performed. In this particular implementation, each is normalized to , and thus heatmaps to be compared to are expected to be normalized to as well. Each ground-truth consists of an array of values of size (H, W) with three distinct regions (1) regions of value (shown as white background) for regions that should not contribute to the neural network prediction, (2) regions of value for localization (shown as light red region) and (3) regions of value for discriminative feature (shown as dark red region), where discriminative feature is also qualitatively considered part of localization. In this work, two other distinct regions are defined symmetrically. They are and regions, to accommodate the fact that some heatmap methods have been interpreted in such a way that negative regions (shown as blue color in this paper) are considered as regions contributing against a given decision or prediction. For our dataset, the ground-truth does not contain any such information that contributes negatively, although, as will be shown later, some XAI methods still do generate negative values.
For this paper, training, validation and evaluation datasets are prepared in 32, 8 and 8 shards respectively, each shard containing 200 samples uniform randomly drawn from the 10 classes. In another words, in total, the datasets contain 6400, 1600 and 1600 samples respectively. The dataset is prepared in shards for practical purposes, for example, to prevent full restart in case of interruption of data downloading and caching, and to facilitate more efficient process of training in evaluation mode indicated in fig. 1 as part 2.
|TC and TR denote trainings in continuous and regular evaluation mode respectively. denotes average accuracy over 5 models branched from the base model. is the batch size,
no. of epochs. Image shapes arewhere .
After the data is cached or saved, the process starts with training in continuous mode, indicated as part 1 of the workflow fig. 1. In this mode, pre-trained models are first downloaded from Torchvision and modified for compatibility with pytroch Captum API. The three pre-trained models used are AlexNet , ResNet34  and VGG , corresponding to workflow 1, 2 and 3 in the codes. In this phase, training proceeds continuously for the purpose of fine-tuning the models to our current data. The number of epochs and batch size are specified in table I. Adam optimizer is used with learning rate for ResNet but for AlexNet and VGG, and the same weight decay is used for all. Plot of losses against training iterations (not shown in this paper) is saved as a figure in part 1.2 of the workflow. We refer to the model trained after this phase as the base model.
The next phase is the training in regular evaluation mode, indicated as part 2 in fig. 1. The training uses the same optimizer as the previous phase, and the number of epochs used are also shown in table I. Evaluation is performed every 4 training iterations; more accurately, this part is known as validation in machine learning community, separate from the final evaluation. Each validation is performed on a shard randomly drawn from the 8 shards of the validation dataset. We set the target accuracy to 0.96. If during validation, the accuracy computed on that single shard exceeds the target accuracy, the training is stopped and evaluation on all validation data shards is performed. The total validation accuracy is used to ensure that the validation accuracy on a single shard is not high by pure chance. While the total validation accuracy can be slightly lower, our experiments so far indicate that there is no such problem. Furthermore, only ResNet attained the target accuracy within the specified setting. For AlexNet and VGG, 0.96 is not exceeded throughout and early stopping mechanism is triggered to prevent unnecessarily long, unfruitful training; note that, fortunately, the total accuracy when evaluated on the final evaluation dataset is still very high, as shown in table I. The early stopping mechanism is as the following. Whenever validation on a single shard does not achieve the target accuracy but (1) if there is no improvement in the validation accuracy, then early stopping counter is increased by one (2) if there is improvement in the validation accuracy, then , where is the refresh fraction, so that the process is given more chance to train longer. If becomes equal to the early stopping limit, , training is stopped.
We repeat the above process of training in regular evaluation mode 4 other times starting from the base model, and thus we have a total of 5 branch models. Note that and are set so that AlexNet and VGG can be trained for longer period (shown in table I), since they both achieve lower accuracy performance than ResNet if given the same number of epochs, and . This is possibly because (1) larger batch size means fewer iterations per epoch and (2) improvement in accuracy is inherently slower, considering that ResNet has been known to generally perform better. Here, comparing accuracy of prediction in a precise manner is not very meaningful, as we are focusing on the heatmaps later. No attempt is made to train the models to perfect accuracy, as a few erroneous predictions are kept so that their heatmaps can be compared with heatmaps from correct predictions. There is no need for k-fold validation here since the validation dataset is completely separate from the training dataset.
Ii-C Evaluation and XAI implementation
This part corresponds to part 3 of fig. 1, where heatmaps
are computed using the following XAI methods available in pytorch Captum API: Saliency, Input*Gradient , DeepLift , GuidedBackprop , GuidedGradCam , Deconvolution , GradientShap , DeepLiftShap . Integrated-Gradients  has been excluded as it is comparatively inefficient with ResNet. Note also that the original implementation of Layerwise Relevance Propagation (LRP)  has been shown to be equivalent to gradient*input or DeepLIFT depending on a few conditions 
. For all heatmaps, we compute the heatmaps derived from the predicted values, not the true values (for some XAI methods, explanation can be extracted from the probability of predicting not only the correct class, but also other classes). The following is the sequence of processing leading to the final results.
Channel adjustments. Each heatmap , which has (C, H, W) shape (C=3 for 3 color channels), is compressed along the channels to (H, W) by sum-pixel-over-channels, where the values are summed pixel-wise along all channel, i.e. when written component-wise. This is so that it can be compared with of shape (H, W). Normalization to is also performed by absolute-max-before-sum scheme, so the overall channel adjustment process is . is the maximum absolute value over all pixels in that single heatmap. The practice of summing over channels can be seen, for example, in LRP tutorial site .
Five-band stratification. Adjusted heatmaps will subsequently be evaluated using five-band score, where each pixel needs to be assigned one of the five values that have been previously described. The value 2 is designated for discriminative feature, 1 for localization, 0 for irrelevant background, while -1 and -2 are symmetrically defined for negative contribution to model prediction or decision. Recall that our ground-truth heatmaps pixels have been assigned one of the following values 0, 0.4 and 0.9. Regardless of the intermediate processing of the heatmap , the mapping for is always such that , and . To map , which has been normalized to by now, a threshold of the form is used, so that for each pixel , a transformation we refer to as five-band stratification is performed in the following manner: if , to if , to if , to if and to if . Bracketed sub-script here is used to denote the component of if it is regarded as a vector for notational convenience later. Up to this point, we have where denotes the five-band stratification.
Five-band score. After stratification, for each heatmap, we compute accuracy , precision , , where accuracy is the fraction of correctly assigned pixel over the total number of pixels, TP is the number of true positive pixels, FP false positives, FN false negatives and for smoothing. TP is slightly different from TP used in binary case. We only count when and , i.e. we use the stringent condition where the labels for localization and features must be correctly hit to achieve a true positive. Likewise, is counted when and plus and whereas when and . To plot receiver operating characteristics (ROC), false positive rate is also computed, where TN is the number of true negatives .
Soft five-band scores. As seen, the threshold defined above is sharp, and the value near any of the thresholds might not be properly accounted for. We thus instead use soft five-band scores, where the metrics are collected for different thresholds. More precisely, for the k-th data sample, we obtain for where , , and after comparing the stratified ground-truth with , where has undergone channel adjustment process previously described. The best and average values of for sample over the different thresholds, and respectively, are then saved sample by sample into a csv file in the XAI result folder for analysis in the discussion section. These values are identified by their positions among the shards, the predicted class and the true class.
Receiver operating characteristic. To compare the performances of different XAI methods mentioned above, ROC is also obtained as shown in fig. 4. For each threshold , mean values of and over all samples in the evaluation datasets contribute to a single point in the figure. Unlike the usual binary ROC, changing thresholds in the multi-dimensional space we defined does not guarantee the change from to (or vice versa). For example, a point that begins as that predicts label can become or if the true value are and respectively when the thresholds are lowered. Hence, we will not always obtain a curve that starts with and ends with in the ROC space, unlike the usual ROC curve. Regardless, by simple understanding of rate of change of FPR and recall, the usual rule of thumb that assigns steeper increase in recall to better ROC quality should still hold. Mathematically, the more optimal ROC curve lies nearer the top-left vertices of the convex hull formed by the points. There has been studies on multi-dimensional ROC curve with its “area under volume” [26, 4], though the difficulty of observing them makes them unsuitable for visual comparison here. With the definition of TP, FP, TN, FN above, we have instead created pseudo-binary conditions.
Iii-a Recall vs Prediction
We provide recall vs precision scores as shown in fig. 3. Each point in the plot corresponds to an XAI method, for example Saliency, applied on a single branch of the corresponding model trained from the base model. There are 5 points per method as we have trained 5 branches per architecture. Naturally, the higher and are, the better is the XAI method. Each point can be denoted by , where , , and is the number of data sample in the evaluation dataset. Thus fig. 3(A) is a plot of vs for ResNet, AlexNet and VGG respectively. Likewise, fig. 3(B) is vs . After qualitative assessment of some of the generated heatmaps, we perform similar analysis by applying clamping to heatmap values after the first normalization process, following roughly the idea in . In other words, the channel adjustment process described in the previous section is changed to where if , if and otherwise . A different set of soft thresholds has been used to match the clamping process as well, with where , , and with the clamping threshold given by . Fig. 3(C) is thus the same as fig. 3
(B), except with clamping process applied, where we do observe some changes in the precision and recall scores, more notably for AlexNet and VGG, though not necessarily better.
Recall scores are generally low in most of the points in fig. 3, indicating high FN. The first obvious cause is the fact that most XAI methods in all architectures appear to assign 0 values to regions that contain either localization pixels or discriminative features. For example, fig. 5 shows the heatmaps from different channels R, G and B (extracted before summing pixel over channels). The heatmaps generally appear granular and non-continuous, having many white pixels in between the red pixels, thus contributing to false negatives. Furthermore, most of the inner body of the cells (represented by light red color in the ground-truth ) is completely unmarked by most of the XAI methods, contributing to very large amount of false negatives. The highest recall values in fig. 3(A) are attained by Saliency applied on AlexNet. This is consistent with visual inspection of the heatmaps across different methods and architectures, because Saliency assigns a lot more red pixels in relevant regions while other methods often assign blue pixels (negative values) in unpredictable manner and highlight only the edges.
Similar to the heatmaps shown in fig. 5 produced by Guided GradCAM applied on VGG, many of the XAI methods only highlight the edges of the cell borders, sometimes faintly. As such, comparatively high recall values for Saliency can be qualitatively accounted for by the halo of high-valued heatmap encompassing the relevant area, although not in a very precise and compact manner. Deconvolution, on the other hand, has relatively higher recall scores due to the large amount of artifact pixels. The quality of its heatmaps has been therefore undermined, reflected as low precision score. Other methods such as guided GradCAM are still capable of highlighting some of the relevant regions, and, to reiterate, many of them tend to highlight the edges as seen in the heatmaps in the supp. material. Also, AlexNet tends to produce denser heatmaps than the other two, giving rise to slightly higher recall scores than VGG, while the average scores for ResNet are very low. Depending on the context, different XAI methods can be the better choice based on their strength and weaknesses, although adjustment to existing interpretations may be necessary.
Differences in responses to color channels are also observed. Saliency method appears as positive values (red) in all channels as shown in fig. 5, although type 3 cell has only 1 color channel whose input signal is strong because it has border whose pre-dominant color is red; in the implementation, the normalized border color is roughly with small uniform random perturbation. On the other hand, Guided GradCAM marks green and blue channels with negative values. If they are to be interpreted as negative contribution, the interpretation will be consistent. But when the heatmaps are summed over channels as we have done, the offsetting effect of the negative values become questionable. In other methods, such color responses are variable. For example, input*gradient for AlexNet do appear to exhibit color responses as well (not shown), although the quality is highly variable too. It is thus difficult to strongly recommend any one method specializing on color-detection, even for guided GradCAM.
Interpretation of heatmap values. Fig. 6 shows in column the heatmaps obtained after summing pixel over channels, one of the earlier processes in the previous section. The figure shows the effect of soft five-band stratification as well, which demonstrates that the appropriate selection thresholding does affect the scores. In previous section, we addressed this by distinguishing between the best and average of recall and precision values over the soft thresholds, which is the main purpose of fig. 3(B). The effect of threshold change is variable across different XAI methods. If we focus on recall scores, from visual inspection of fig. 6(B), the XAI community may need to revise the idea of negative values in heatmaps. Clearly, DeepLift and DeepLiftShap examples show that they will score much better recall if we take the absolute values of the heatmaps and apply the same process from stratification to the computation of five-band scores.
SHAP, DeepLift and background effect. When SHAP is applied to DeepLift, the effect appears to be background artifact removals, thus confining non-zero heatmap pixel values to more relevant regions (fig. 6(B) and supp. materials). Still, we need to point out that the heatmaps could be inconsistent even from the correct predictions of the same classes, as shown in fig. 6(C). The figure shows two heatmaps of different qualities generated by DeepLift for VGG for cell type 8 that are correctly predicted. It may be tempting to make guesses regarding possible reasons, such as the backgrounds. More investigation on the signals activated by the background may be necessary.
From observing fig. 3 and many heatmaps, for examples the figures in the appendix, it is tempting to deduce that deeper networks (AlexnNet shallowest, followed by VGG, then ResNet deepest) tend to produce heatmaps that are more sensitive to the edges but cover less thoroughly the bulk of discriminative features and localization regions. To test this, we conduct a test on AlexNet modified by systematically adding more and more convolutional layers, trained and then evaluated for its precision vs recall in the same manner as before. The number of layers added are 1, 2, …, 8, and the plot is made by computing mean values of recall and precision like before, except that the points are collected separately based on the predicted values (whereas in the previous section, averages are taken over all test samples regardless of predicted values). The expectation is for the precision and recall values to be nearer to 1 (more towards top right of the plot) for the modified AlexNet with less additional layers. However, as shown in appendix fig. 1, this does not appear to be the case.
Iii-B ROC curve
ROC plot in fig. 4 shows that most heatmap methods tested lie on traditionally poor ROC regions. There appears to be trade-offs between higher recall values (which is good) and higher FPR (which is bad), most prominently shown by Saliency method. Deconvolution appears to be the best, as it has the greatest rate of increasing recall compared to FPR. However, this is misleading, since deconvolution starts with many FP predictions in all three architecture, as shown by the grid-like artifacts in fig. 5. This causes FP to change more quickly, and the ROC fails to make good comparison between deconvolution and other methods. Saliency tends to “over-assign” the heatmap pixels around the correct region; consider fig. 5, 6 and appendix, compare it to, for example, Guided GradCAM, DeepLift. Unlike DeepLift and DeepLiftShap, Saliency ROC shows higher recall because of more correct assignments of positive (red) values, but also higher FPR because of the assignment of positive values in supposedly white regions. Guided GradCAM appears to have some difficulty improving through the change of thresholds. Considering the way sum-pixel-over-channels is performed and its color-channel sensitivity, its performance might have suffered through incompatible heatmap pre-processing. ROC for other methods do not provide sufficiently distinct trends that favor the adoption of one method over another. There may be a need to investigate the different ways soft-thresholding can be performed for specific XAI methods to at least bring the ROC curves to traditionally favorable regions.
Iii-C Other observations
For images of type 9 (no cell present), generally we see heatmaps in the form of artifacts appearing as well-spaced spots, forming lattice (see appendix fig. 40 etc), similar to the heatmaps from deconvolution method in fig. 5. In some cases, for example appendix fig. 17 row 2, we can see that DeepLift is able to “provide” the correct reasoning for wrong prediction. In that figure, type 0 cell that has shape that almost looked like a single tail was mistaken as type 6, though some similar wrong predictions are not highlighted in similar manner. In many other cases of wrong predictions (see the heatmaps in the appendix), it is unclear what the highlighted regions mean.
Recommendations and Caveats. Regardless of the imperfect performance, relative comparisons between the XAI methods can be made.
Saliency method appears to highlight the relevant regions in the most conservative way, which is more suitable for localization in the case where false positives are not important. In particular, AlexNet is scoring the highest recall.
If only the edge of the features are needed, VGG and ResNet with input*grad, DeepLift, DeepLiftShap seem to be the reasonable choices, while the same heatmap methods for Alexnet seem to produce heatmaps that go beyond capturing just the edges in rather inconsistent ways. Compared to Saliency, they may be more useful to detect small, hard to observe discriminative features, e.g. from medical images and other dense images.
The heatmaps produced by ResNet appear to be sparsest, followed by VGG then AlexNet. Input size and depth of networks may be the reasons.
A research into the role of negative values in the heatmaps may be necessary. If we continue with the interpretation that negative values correspond to negative contribution of prediction, some XAI methods such as DeepLift and DeepLiftShap may be completely incomprehensible.
More investigation may be needed to find the best channel adjustment, to handle the phenomenon where large continuous patches or areas are ignored by many of the tested methods and the activation of signals caused by the background.
We have provided an algorithm to produce synthetic data that we hope can be a baseline for testing XAI method, especially in the form of saliency maps or heatmaps. Some XAI methods appear to be more suitable for localization, while others are more responsive to the edges of the features. Modifications required to boost the explainability power of XAI methods might differ across the methods, making fair comparison a possibly difficult task. At least, for each application of XAI method, we should attempt to find a clear, consistent interpretation under the same context of study. For example, if negative values need to be treated with absolute values in some application, at least an accompanying experiment is needed to show the effect and implication of performing such transformation. As of now, XAI still remains a challenging problem. However, it does exhibit good potential to improve the reliability of black-box models in the future.
This research was supported by Alibaba Group Holding Limited, DAMO Academy, Health-AI division under Alibaba-NTU Talent Program. The program is the collaboration between Alibaba and Nanyang Technological university, Singapore.
=0mu plus 1mu
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10 (7), pp. 1–46. External Links: Cited by: §I, §II-C.
-  (2018) Multiple instance learning for heterogeneous images: training a cnn for histopathology. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 254–262. External Links: Cited by: §I.
-  (2019) Testing the robustness of attribution methods for convolutional neural networks in mri-based alzheimer’s disease classification. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, K. Suzuki, M. Reyes, T. Syeda-Mahmood, B. Glocker, R. Wiest, Y. Gur, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 3–11. External Links: Cited by: §I.
-  (2003) Volume under the roc surface for multi-class problems. In Machine Learning: ECML 2003, N. Lavrač, D. Gamberger, H. Blockeel, and L. Todorovski (Eds.), Berlin, Heidelberg, pp. 108–120. External Links: Cited by: §II-C.
-  (2016) Deep residual learning for image recognition. In , Vol. , pp. 770–778. Cited by: §II-B.
-  (2019) CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031. External Links: Cited by: §I.
-  (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav).. In ICML, J. G. Dy and A. Krause (Eds.), JMLR Workshop and Conference Proceedings, Vol. 80, pp. 2673–2682. External Links: Cited by: §I.
-  (2016) Investigating the influence of noise and distractors on the interpretation of neural networks. External Links: Cited by: §II-C.
-  (2014) One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997. External Links: Cited by: §II-B.
-  (2019) Generation of multimodal justification using visual word constraint model for explainable computer-aided diagnosis. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, K. Suzuki, M. Reyes, T. Syeda-Mahmood, B. Glocker, R. Wiest, Y. Gur, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 21–29. External Links: Cited by: §I.
Brain biomarker interpretation in asd using deep learning and fmri. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 206–214. External Links: Cited by: §I.
-  (2015) Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Vol. , pp. 730–734. Cited by: §II-B.
-  (accessed August 16, 2020) LRP tutorial. External Links: Cited by: §II-C.
-  (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. External Links: Cited by: §II-C.
-  (2020) Explainable AI for medical imaging: deep-learning CNN ensemble for classification of estrogen receptor status from breast MRI. In Medical Imaging 2020: Computer-Aided Diagnosis, H. K. Hahn and M. A. Mazurowski (Eds.), Vol. 11314, pp. 228 – 235. External Links: Cited by: §I.
-  (2018) Generalizability vs. robustness: investigating medical imaging networks using adversarial examples. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 493–501. External Links: Cited by: §I.
-  (2018) Autofocus layer for semantic segmentation. CoRR abs/1805.08403. External Links: Cited by: §I.
-  (2016) “Why should i trust you?”: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1135–1144. External Links: Cited by: §I.
-  (2019-05-01) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. External Links: Cited by: §I.
-  (2016) Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391. External Links: Cited by: §I, §II-C.
-  (2017-06–11 Aug) Learning important features through propagating activation differences. D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3145–3153. External Links: Cited by: §II-C.
-  (2017) Learning important features through propagating activation differences. CoRR abs/1704.02685. External Links: Cited by: §I, §II-C.
-  (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations, Cited by: §II-C.
-  (2017) SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. External Links: Cited by: §I.
-  (2014) Striving for simplicity: the all convolutional net. External Links: Cited by: §II-C.
-  (1999) Note on the location of optimal classifiers in n-dimensional roc space. Technical report . Cited by: §II-C.
-  (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3319–3328. Cited by: §II-C.
-  (2019) Interpretable classification of alzheimer’s disease pathologies with a convolutional neural network pipeline. Nature Communications 10 (1), pp. 2173. External Links: Cited by: §I.
-  (2019) Analyzing Neuroimaging Data Through Recurrent Deep Learning Models. Front Neurosci 13, pp. 1321. Cited by: §I.
-  (2019) Enhancing the extraction of interpretable information for ischemic stroke imaging from deep neural networks. External Links: Cited by: §III-A.
-  (2014) Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 818–833. External Links: Cited by: §II-C.
-  (2018) Respond-cam: analyzing deep models for 3d imaging data by visualizations. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 485–492. External Links: Cited by: §I.
Learning deep features for discriminative localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2921–2929. External Links: Cited by: §I.
-  (2019) Guideline-based additive explanation for computer-aided diagnosis of lung nodules. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, K. Suzuki, M. Reyes, T. Syeda-Mahmood, B. Glocker, R. Wiest, Y. Gur, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 39–47. External Links: Cited by: §I.