Quantifying Explainability of Saliency Methods in Deep Neural Networks

09/07/2020 ∙ by Erico Tjoa, et al. ∙ Nanyang Technological University 19

One way to achieve eXplainable artificial intelligence (XAI) is through the use of post-hoc analysis methods. In particular, methods that generate heatmaps have been used to explain black-box models, such as deep neural network. In some cases, heatmaps are appealing due to the intuitive and visual ways to understand them. However, quantitative analysis that demonstrates the actual potential of heatmaps have been lacking, and comparison between different methods are not standardized as well. In this paper, we introduce a synthetic data that can be generated adhoc along with the ground-truth heatmaps for better quantitative assessment. Each sample data is an image of a cell with easily distinguishable features, facilitating a more transparent assessment of different XAI methods. Comparison and recommendations are made, shortcomings are clarified along with suggestions for future research directions to handle the finer details of select post-hoc analysis methods.



There are no comments yet.


page 2

page 14

page 15

page 17

page 18

page 19

page 21

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

EXplainable artificial intelligence (XAI) has been gathering attention in the artificial intelligence (AI) and machine learning (ML) community recently. The recent trend was propelled by the success of deep neural network (DNN), especially convolutional neural network (CNN) in image processing. DNN has been considered a blackbox because the mechanism underlying its remarkable performance is not well understood. XAI research has thus developed in many different directions. Among them is the saliency method, where heatmaps are generated and used to give explanations on where AI model is “looking at” when it is making a decision or prediction. The heatmaps are compatible with human’s visual comprehension, easy to read and interpret and thus they are desirable. However, many of the formulas used to generate the heatmaps are given using heuristics, and hence are not revealing enough of the underlying mechanism to help us debug, fix or improve the AI model in meaningful ways.

Regardless, the development of heatmap methods have continued without correspondingly reliable ways to evaluate how one heatmap is better than another. The metrics used to quantify the quality of heatmaps are sometimes indirect, and at other times, qualitative assessment of the quality of heatmaps appear to be possibly given in hind-sight to fit natural reasoning. This often occurs due to the lack of ground-truth heatmaps to verify the correctness of the generated heatmaps. Under such situation, the quality and effectiveness of interpretable heatmaps have nevertheless been demonstrated in several ways. CAM [33] and GradCAM [20]

heatmaps were shown to improve the localization on ILSVRC datasets. By observing the change in the log odd scores after deleting image pixels, the relevance of image pixels to the decision or prediction of a model can be determined as well

[22]. The earlier paper [1]

on the development of layerwise relevance propagation (LRP) shows heatmaps generated on many sample data, although many heatmaps do not appear to demonstrate good consistency in their pixel-wise assignment of values (different improvements have since been suggested). Tests were conducted on the effect of transformation on the images, for example, by flipping MNIST digits, and

mean prediction is defined to assess the method after interchanging pixels systematically based on relevance computed by LRP. Still, the paper itself mentions that the analysis is semi-quantitative. The paper that introduced SmoothGrad [24] mentioned that, at the time, there was no ground-truth to allow for quantitative evaluation of heatmaps. It then proceeded with 2 qualitative evaluations instead. As of now, even though there are many different datasets available for AI and ML researches, the corresponding ground-truth explanations (such as heatmaps) are typically not available. Note: though heatmaps are sometimes interchangeably called saliency map, we only refer to them as heatmaps here because we want to distinguish them from XAI method whose name is Saliency.

Fig. 1: Workflow illustrating the process starting from data generation to the generation of heatmaps gallery.

Fig. 2: The first row shows 10 different types of shapes that can be generated by our algorithm, placed in background type 1, i.e. dark background. We refer to the different types of cells as cells type 0 to 8, where 0,1,2 are circular, 3,4,5 are rectangular and 6,7,8 are circular with one, three and eight tails respectively. Their alternative names in the codes are CCell (0), CCellM (1), CCellP (2), RCell (3), RCellB (4), RCellC (5), CCellT (6), CCellT3 (7), CCellT8 (8) respectively. C denotes circular cell, R rectangular, T tails, M minus, P plus. The last column (or type 9) does not contain any cell. The second and third rows are similar to the first row, except they are placed on background type 2 and 3 respectively. Row 4, 5, 6 are the ground-truth heatmaps for row 1, 2, 3 respectively. The region colored light-red corresponds to localization information, while dark red region corresponds to distinguishing features. For example, column 1 and 2 can be distinguished by the presence of the bar across the circular cell. Columns 4, 5, 6 differ only in their dominant colors of their rectangular borders, and thus the distinguishing features are their borders.

XAI methods that are not focused on generating heatmaps have also been developed. This paper is mainly concerned about how to quantitatively compare heatmaps, but we may still benefit from different types of evaluations of XAI performance. Local interpretable model-agnostic explanation (LIME) [18]

is introduced to find a locally faithful interpretable model that represents well the model under inspection, regardless of the latter’s architecture (i.e. is agnostic). By comparing LIME with obviously interpretable models such as decision trees and sparse logistic regression, in particular using recall value, the quality of feature importance obtained using LIME can be assessed. Experiments on Concept Activation Vectors (section 4.3 of

[7]) include quantitative comparison of the information used by a model when a ground-truth caption is embedded into the image. In some cases, the caption is used by the model for decision-making, but in other cases, only the image concept is used. Furthermore, human-subject experiments are also conducted to test the importance of the saliency mask, showing that heatmaps help only marginally for human to make decision and that heatmaps can even be misleading. There has also been other similar sentiment that doubts the usefulness of heatmaps, for example in the caption of fig. 2 in [19].

On the other hand, applications of XAI methods have emerged in other fields, where evaluation of heatmaps has been performed in different ways. Still, one should be careful that the evaluations may not always clearly indicate the relevant usefulness of the heatmaps themselves. A study on MRI-based Alzheimer’s disease classification [3] computes the L2 norm between average heatmaps generated by different XAI methods and compares the performance of three other different metrics. Ground-truth heatmaps are sometimes available, for example in the diagnosis of lung nodules [34] where recall values can be directly computed between the reference features (ground-truth) and the heatmaps generated by different XAI methods. Different kinds of ground-truth have been obtained using specialized method, such as NeuroSynth in [29] for analyzing neuroimaging data. Some parts of the evaluation appears qualitative (such as group-level evaluation), though the paper uses F1-score to evaluate the heatmap, thus naturally including recall and precision concepts in the evaluation. Other applications of XAI methods, especially heatmaps, in the medical field are for example [6, 32, 11, 16, 2, 28, 15, 17, 10].

In this paper, in section II-A we first introduce a synthetic dataset containing containing images with simple features with ground-truth heatmaps that can be generated on demand. The aim is to provide a standardized dataset to compare the viability and effectiveness of heatmaps generated by XAI methods in providing explanations. Ground-truth heatmaps are automatically generated alongside the image data and labels, avoiding the need to manually mark heatmap features, which is a very laborious process. In this 10-class dataset, each data sample consists of an object with a simple shape and its corresponding heatmap designed to be unambiguous, which is the core feature intended to address the problems mentioned above. In short, we provide a dataset where heatmaps can be verified in a more objective way. The rest of section II describes the implementation of neural network training, validation and evaluation processes, followed by the description of five-band-score, a metric defined to capture quantities such as recall and precision that take into account the distinct meaningful regions in heatmaps. Section III discusses the recall-precision results and ROC curves we obtained. Finally, we conclude with recommendations on which methods are possibly useful on specific cases and provide some caveats.

Ii Data and Methodology

This section describes the workflow starting from data generation, network training, network performance evaluation, heatmap generation and evaluating generated heatmaps with common quantities. The workflow is shown in fig. 1, closely following the sequence of commands run in the package of python codes 111https://github.com/etjoa003/explainable_ai/tree/master/xai_basic provided. Some details, such as the algorithms needed to generate each sample data, can be traced from the tutorials available as jupyter notebook included in the package of codes.

Ii-a Dataset

  rotate by
  shift center to
  for all pixels  do
  end for
Algorithm 1 build_basic_ball_body( etc) for type 0 cell. is a multiplicative factor for modifying object’s elliptical shape. is the thickness of cell border. Subscript ex stands for “explanation”, which will be the heatmap parts. Thresholds are suitably chosen to create binary arrays.
  build_basic_ball_body( etc)
  if type 2 then
  end if
  rotate ball and bar by
  shift center of ball and bar to
Algorithm 2 build_ccell_body( etc) for type 1 or 2 cell. is a multiplicative factor for stretching. are bar and pole thicknesses to form minus- and plus- shaped skeletons of the cells.

We provide algorithms that can generate dataset as shown in fig. 2 on demand, where the top three rows are the images and the last three rows are the corresponding ground-truth heatmaps. The ten different classes of cells are shown along the columns. Types 0,1,2 are circular cells with border (algo. 1), with a bar (or minus sign) and with a plus sign (algo. 2) respectively. Types 3,4,5 are rectangular cells with different dominant colors. Types 6,7,8 following are circular cells with one, three and eight tails respectively. The last class does not contain any cell. Three types of backgrounds are given to increase the variation of dataset, as shown separately in the first three rows of the same figure.

The ground-truth heatmaps have been designed to mark features that distinguish all the classes in a way that is as unambiguous as possible, subject to human judgment. Admittedly, there may not exist a unique unambiguous way of defining them. Where appropriate, the heatmaps could be readjusted by editing the heatmap generator classes in the package of codes. The heatmaps are shown in fig. 2 row 4 to 6. With this dataset, fair comparison between heatmaps generated by different XAI methods can be performed. In this particular implementation, each is normalized to , and thus heatmaps to be compared to are expected to be normalized to as well. Each ground-truth consists of an array of values of size (H, W) with three distinct regions (1) regions of value (shown as white background) for regions that should not contribute to the neural network prediction, (2) regions of value for localization (shown as light red region) and (3) regions of value for discriminative feature (shown as dark red region), where discriminative feature is also qualitatively considered part of localization. In this work, two other distinct regions are defined symmetrically. They are and regions, to accommodate the fact that some heatmap methods have been interpreted in such a way that negative regions (shown as blue color in this paper) are considered as regions contributing against a given decision or prediction. For our dataset, the ground-truth does not contain any such information that contributes negatively, although, as will be shown later, some XAI methods still do generate negative values.

For this paper, training, validation and evaluation datasets are prepared in 32, 8 and 8 shards respectively, each shard containing 200 samples uniform randomly drawn from the 10 classes. In another words, in total, the datasets contain 6400, 1600 and 1600 samples respectively. The dataset is prepared in shards for practical purposes, for example, to prevent full restart in case of interruption of data downloading and caching, and to facilitate more efficient process of training in evaluation mode indicated in fig. 1 as part 2.

Ii-B Training

TC TR Eval
ResNet 512 4 2 4 24 0.4 0.951
AlexNet 224 16 16 16 48 0.3 0.980
VGG 224 16 4 4 48 0.3 0.986
TC and TR denote trainings in continuous and regular evaluation mode respectively. denotes average accuracy over 5 models branched from the base model. is the batch size,

no. of epochs. Image shapes are

where .
TABLE I: Training settings and performances on the ten-classes data.

Fig. 3: Recall and precision scores of five-band stratified heatmaps compared to ground-truth for ResNet, AlexNet and VGG and 8 different XAI methods. For all, higher recall and precision values are better, i.e. points located towards top-right are better. (A) Average and (B) maximum values of recalls and precisions (over soft five-band thresholds) of each sample of evaluation dataset are collected, then averaged over all these samples to be shown as individual points

in the plot. (C) is the same as (B), except the scores are obtained after clamping process. Vertical and horizontal bars are the standard deviations over all samples of evaluation dataset correspondingly. Insets show zoom-in on selected regions.

Fig. 4: ROC curve for ResNet (left), AlexNet (middle) and VGG (right) for 8 different saliency methods (see legend, bottom-right), each XAI method applied on test datasets separately for 5 branch models. Both FPR and recall do not necessarily reach 1.0 as the thresholds are adjusted in multi-dimensional space. Most methods under our the experimental conditions specified here lie under traditionally poor ROC region.

After the data is cached or saved, the process starts with training in continuous mode, indicated as part 1 of the workflow fig. 1. In this mode, pre-trained models are first downloaded from Torchvision and modified for compatibility with pytroch Captum API. The three pre-trained models used are AlexNet [9], ResNet34 [5] and VGG [12], corresponding to workflow 1, 2 and 3 in the codes. In this phase, training proceeds continuously for the purpose of fine-tuning the models to our current data. The number of epochs and batch size are specified in table I. Adam optimizer is used with learning rate for ResNet but for AlexNet and VGG, and the same weight decay is used for all. Plot of losses against training iterations (not shown in this paper) is saved as a figure in part 1.2 of the workflow. We refer to the model trained after this phase as the base model.

The next phase is the training in regular evaluation mode, indicated as part 2 in fig. 1. The training uses the same optimizer as the previous phase, and the number of epochs used are also shown in table I. Evaluation is performed every 4 training iterations; more accurately, this part is known as validation in machine learning community, separate from the final evaluation. Each validation is performed on a shard randomly drawn from the 8 shards of the validation dataset. We set the target accuracy to 0.96. If during validation, the accuracy computed on that single shard exceeds the target accuracy, the training is stopped and evaluation on all validation data shards is performed. The total validation accuracy is used to ensure that the validation accuracy on a single shard is not high by pure chance. While the total validation accuracy can be slightly lower, our experiments so far indicate that there is no such problem. Furthermore, only ResNet attained the target accuracy within the specified setting. For AlexNet and VGG, 0.96 is not exceeded throughout and early stopping mechanism is triggered to prevent unnecessarily long, unfruitful training; note that, fortunately, the total accuracy when evaluated on the final evaluation dataset is still very high, as shown in table I. The early stopping mechanism is as the following. Whenever validation on a single shard does not achieve the target accuracy but (1) if there is no improvement in the validation accuracy, then early stopping counter is increased by one (2) if there is improvement in the validation accuracy, then , where is the refresh fraction, so that the process is given more chance to train longer. If becomes equal to the early stopping limit, , training is stopped.

We repeat the above process of training in regular evaluation mode 4 other times starting from the base model, and thus we have a total of 5 branch models. Note that and are set so that AlexNet and VGG can be trained for longer period (shown in table I), since they both achieve lower accuracy performance than ResNet if given the same number of epochs, and . This is possibly because (1) larger batch size means fewer iterations per epoch and (2) improvement in accuracy is inherently slower, considering that ResNet has been known to generally perform better. Here, comparing accuracy of prediction in a precise manner is not very meaningful, as we are focusing on the heatmaps later. No attempt is made to train the models to perfect accuracy, as a few erroneous predictions are kept so that their heatmaps can be compared with heatmaps from correct predictions. There is no need for k-fold validation here since the validation dataset is completely separate from the training dataset.

Ii-C Evaluation and XAI implementation

This part corresponds to part 3 of fig. 1, where heatmaps

are computed using the following XAI methods available in pytorch Captum API: Saliency

[23], Input*Gradient [8], DeepLift [22], GuidedBackprop [25], GuidedGradCam [20], Deconvolution [31], GradientShap [14], DeepLiftShap [14]. Integrated-Gradients [27] has been excluded as it is comparatively inefficient with ResNet. Note also that the original implementation of Layerwise Relevance Propagation (LRP) [1] has been shown to be equivalent to gradient*input or DeepLIFT depending on a few conditions [21]

. For all heatmaps, we compute the heatmaps derived from the predicted values, not the true values (for some XAI methods, explanation can be extracted from the probability of predicting not only the correct class, but also other classes). The following is the sequence of processing leading to the final results.

Channel adjustments. Each heatmap , which has (C, H, W) shape (C=3 for 3 color channels), is compressed along the channels to (H, W) by sum-pixel-over-channels, where the values are summed pixel-wise along all channel, i.e. when written component-wise. This is so that it can be compared with of shape (H, W). Normalization to is also performed by absolute-max-before-sum scheme, so the overall channel adjustment process is . is the maximum absolute value over all pixels in that single heatmap. The practice of summing over channels can be seen, for example, in LRP tutorial site [13].

Five-band stratification. Adjusted heatmaps will subsequently be evaluated using five-band score, where each pixel needs to be assigned one of the five values that have been previously described. The value 2 is designated for discriminative feature, 1 for localization, 0 for irrelevant background, while -1 and -2 are symmetrically defined for negative contribution to model prediction or decision. Recall that our ground-truth heatmaps pixels have been assigned one of the following values 0, 0.4 and 0.9. Regardless of the intermediate processing of the heatmap , the mapping for is always such that , and . To map , which has been normalized to by now, a threshold of the form is used, so that for each pixel , a transformation we refer to as five-band stratification is performed in the following manner: if , to if , to if , to if and to if . Bracketed sub-script here is used to denote the component of if it is regarded as a vector for notational convenience later. Up to this point, we have where denotes the five-band stratification.

Five-band score. After stratification, for each heatmap, we compute accuracy , precision , , where accuracy is the fraction of correctly assigned pixel over the total number of pixels, TP is the number of true positive pixels, FP false positives, FN false negatives and for smoothing. TP is slightly different from TP used in binary case. We only count when and , i.e. we use the stringent condition where the labels for localization and features must be correctly hit to achieve a true positive. Likewise, is counted when and plus and whereas when and . To plot receiver operating characteristics (ROC), false positive rate is also computed, where TN is the number of true negatives .

Soft five-band scores. As seen, the threshold defined above is sharp, and the value near any of the thresholds might not be properly accounted for. We thus instead use soft five-band scores, where the metrics are collected for different thresholds. More precisely, for the k-th data sample, we obtain for where , , and after comparing the stratified ground-truth with , where has undergone channel adjustment process previously described. The best and average values of for sample over the different thresholds, and respectively, are then saved sample by sample into a csv file in the XAI result folder for analysis in the discussion section. These values are identified by their positions among the shards, the predicted class and the true class.

Receiver operating characteristic. To compare the performances of different XAI methods mentioned above, ROC is also obtained as shown in fig. 4. For each threshold , mean values of and over all samples in the evaluation datasets contribute to a single point in the figure. Unlike the usual binary ROC, changing thresholds in the multi-dimensional space we defined does not guarantee the change from to (or vice versa). For example, a point that begins as that predicts label can become or if the true value are and respectively when the thresholds are lowered. Hence, we will not always obtain a curve that starts with and ends with in the ROC space, unlike the usual ROC curve. Regardless, by simple understanding of rate of change of FPR and recall, the usual rule of thumb that assigns steeper increase in recall to better ROC quality should still hold. Mathematically, the more optimal ROC curve lies nearer the top-left vertices of the convex hull formed by the points. There has been studies on multi-dimensional ROC curve with its “area under volume” [26, 4], though the difficulty of observing them makes them unsuitable for visual comparison here. With the definition of TP, FP, TN, FN above, we have instead created pseudo-binary conditions.

Iii Discussion

Iii-a Recall vs Prediction

We provide recall vs precision scores as shown in fig. 3. Each point in the plot corresponds to an XAI method, for example Saliency, applied on a single branch of the corresponding model trained from the base model. There are 5 points per method as we have trained 5 branches per architecture. Naturally, the higher and are, the better is the XAI method. Each point can be denoted by , where , , and is the number of data sample in the evaluation dataset. Thus fig. 3(A) is a plot of vs for ResNet, AlexNet and VGG respectively. Likewise, fig. 3(B) is vs . After qualitative assessment of some of the generated heatmaps, we perform similar analysis by applying clamping to heatmap values after the first normalization process, following roughly the idea in [30]. In other words, the channel adjustment process described in the previous section is changed to where if , if and otherwise . A different set of soft thresholds has been used to match the clamping process as well, with where , , and with the clamping threshold given by . Fig. 3(C) is thus the same as fig. 3

(B), except with clamping process applied, where we do observe some changes in the precision and recall scores, more notably for AlexNet and VGG, though not necessarily better.

Fig. 5: Visual comparison of heatmaps generated by Saliency, Guided GradCAM and Deconvolution. Different color-channel responses are shown under R, G and B columns respectively, with the original image in the left-most column and ground-truth in the second left-most column. All heatmaps above are obtained from correctly predicted samples.

Fig. 6: Visual comparison of heatmaps generated by DeepLift and DeepLiftShap, with Saliency for comparison. Column is obtained after summing pixel over channels. Columns and are obtained after five-band stratification using the first and last thresholding described in section II-C, where and . (A) Visualization of how DeepLift and DeepLiftShap on ResNet generally score slightly lower in recall scores than Saliency (B) DeepLiftShap and DeepLift appear to produce similar heatmaps for VGG and ResNet, though the SHAP variant appears to remove some artifacts in Alexnet (consider also their heatmaps shown in the appendix); Saliency is also shown for comparison. Blue pixels (negative values) mark some of the correct areas that we regard as discriminative features, but interpreting the blue pixels for these methods as negative contribution seems to be inappropriate. Applying absolute value to the negative pixels may improve its recall scores etc. (C) Heatmaps for correct prediction of cells from the same class, cell type 8, generated using DeepLift applied on VGG. Due to some inconsistency in the overall shapes of the heatmaps generated, the figures are not representative of all heatmaps predicted by any particular XAI method and any architecture. Nevertheless, the pixel granularities of heatmaps generated by the same XAI methods are similar; consider the heatmaps shown in the appendix. All heatmaps above are obtained from correctly predicted samples.

Recall scores are generally low in most of the points in fig. 3, indicating high FN. The first obvious cause is the fact that most XAI methods in all architectures appear to assign 0 values to regions that contain either localization pixels or discriminative features. For example, fig. 5 shows the heatmaps from different channels R, G and B (extracted before summing pixel over channels). The heatmaps generally appear granular and non-continuous, having many white pixels in between the red pixels, thus contributing to false negatives. Furthermore, most of the inner body of the cells (represented by light red color in the ground-truth ) is completely unmarked by most of the XAI methods, contributing to very large amount of false negatives. The highest recall values in fig. 3(A) are attained by Saliency applied on AlexNet. This is consistent with visual inspection of the heatmaps across different methods and architectures, because Saliency assigns a lot more red pixels in relevant regions while other methods often assign blue pixels (negative values) in unpredictable manner and highlight only the edges.

Similar to the heatmaps shown in fig. 5 produced by Guided GradCAM applied on VGG, many of the XAI methods only highlight the edges of the cell borders, sometimes faintly. As such, comparatively high recall values for Saliency can be qualitatively accounted for by the halo of high-valued heatmap encompassing the relevant area, although not in a very precise and compact manner. Deconvolution, on the other hand, has relatively higher recall scores due to the large amount of artifact pixels. The quality of its heatmaps has been therefore undermined, reflected as low precision score. Other methods such as guided GradCAM are still capable of highlighting some of the relevant regions, and, to reiterate, many of them tend to highlight the edges as seen in the heatmaps in the supp. material. Also, AlexNet tends to produce denser heatmaps than the other two, giving rise to slightly higher recall scores than VGG, while the average scores for ResNet are very low. Depending on the context, different XAI methods can be the better choice based on their strength and weaknesses, although adjustment to existing interpretations may be necessary.

Differences in responses to color channels are also observed. Saliency method appears as positive values (red) in all channels as shown in fig. 5, although type 3 cell has only 1 color channel whose input signal is strong because it has border whose pre-dominant color is red; in the implementation, the normalized border color is roughly with small uniform random perturbation. On the other hand, Guided GradCAM marks green and blue channels with negative values. If they are to be interpreted as negative contribution, the interpretation will be consistent. But when the heatmaps are summed over channels as we have done, the offsetting effect of the negative values become questionable. In other methods, such color responses are variable. For example, input*gradient for AlexNet do appear to exhibit color responses as well (not shown), although the quality is highly variable too. It is thus difficult to strongly recommend any one method specializing on color-detection, even for guided GradCAM.

Interpretation of heatmap values. Fig. 6 shows in column the heatmaps obtained after summing pixel over channels, one of the earlier processes in the previous section. The figure shows the effect of soft five-band stratification as well, which demonstrates that the appropriate selection thresholding does affect the scores. In previous section, we addressed this by distinguishing between the best and average of recall and precision values over the soft thresholds, which is the main purpose of fig. 3(B). The effect of threshold change is variable across different XAI methods. If we focus on recall scores, from visual inspection of fig. 6(B), the XAI community may need to revise the idea of negative values in heatmaps. Clearly, DeepLift and DeepLiftShap examples show that they will score much better recall if we take the absolute values of the heatmaps and apply the same process from stratification to the computation of five-band scores.

SHAP, DeepLift and background effect. When SHAP is applied to DeepLift, the effect appears to be background artifact removals, thus confining non-zero heatmap pixel values to more relevant regions (fig. 6(B) and supp. materials). Still, we need to point out that the heatmaps could be inconsistent even from the correct predictions of the same classes, as shown in fig. 6(C). The figure shows two heatmaps of different qualities generated by DeepLift for VGG for cell type 8 that are correctly predicted. It may be tempting to make guesses regarding possible reasons, such as the backgrounds. More investigation on the signals activated by the background may be necessary.

From observing fig. 3 and many heatmaps, for examples the figures in the appendix, it is tempting to deduce that deeper networks (AlexnNet shallowest, followed by VGG, then ResNet deepest) tend to produce heatmaps that are more sensitive to the edges but cover less thoroughly the bulk of discriminative features and localization regions. To test this, we conduct a test on AlexNet modified by systematically adding more and more convolutional layers, trained and then evaluated for its precision vs recall in the same manner as before. The number of layers added are 1, 2, …, 8, and the plot is made by computing mean values of recall and precision like before, except that the points are collected separately based on the predicted values (whereas in the previous section, averages are taken over all test samples regardless of predicted values). The expectation is for the precision and recall values to be nearer to 1 (more towards top right of the plot) for the modified AlexNet with less additional layers. However, as shown in appendix fig. 1, this does not appear to be the case.

Iii-B ROC curve

ROC plot in fig. 4 shows that most heatmap methods tested lie on traditionally poor ROC regions. There appears to be trade-offs between higher recall values (which is good) and higher FPR (which is bad), most prominently shown by Saliency method. Deconvolution appears to be the best, as it has the greatest rate of increasing recall compared to FPR. However, this is misleading, since deconvolution starts with many FP predictions in all three architecture, as shown by the grid-like artifacts in fig. 5. This causes FP to change more quickly, and the ROC fails to make good comparison between deconvolution and other methods. Saliency tends to “over-assign” the heatmap pixels around the correct region; consider fig. 5, 6 and appendix, compare it to, for example, Guided GradCAM, DeepLift. Unlike DeepLift and DeepLiftShap, Saliency ROC shows higher recall because of more correct assignments of positive (red) values, but also higher FPR because of the assignment of positive values in supposedly white regions. Guided GradCAM appears to have some difficulty improving through the change of thresholds. Considering the way sum-pixel-over-channels is performed and its color-channel sensitivity, its performance might have suffered through incompatible heatmap pre-processing. ROC for other methods do not provide sufficiently distinct trends that favor the adoption of one method over another. There may be a need to investigate the different ways soft-thresholding can be performed for specific XAI methods to at least bring the ROC curves to traditionally favorable regions.

Iii-C Other observations

For images of type 9 (no cell present), generally we see heatmaps in the form of artifacts appearing as well-spaced spots, forming lattice (see appendix fig. 40 etc), similar to the heatmaps from deconvolution method in fig. 5. In some cases, for example appendix fig. 17 row 2, we can see that DeepLift is able to “provide” the correct reasoning for wrong prediction. In that figure, type 0 cell that has shape that almost looked like a single tail was mistaken as type 6, though some similar wrong predictions are not highlighted in similar manner. In many other cases of wrong predictions (see the heatmaps in the appendix), it is unclear what the highlighted regions mean.

Iv Conclusion

Recommendations and Caveats. Regardless of the imperfect performance, relative comparisons between the XAI methods can be made.

  • Saliency method appears to highlight the relevant regions in the most conservative way, which is more suitable for localization in the case where false positives are not important. In particular, AlexNet is scoring the highest recall.

  • If only the edge of the features are needed, VGG and ResNet with input*grad, DeepLift, DeepLiftShap seem to be the reasonable choices, while the same heatmap methods for Alexnet seem to produce heatmaps that go beyond capturing just the edges in rather inconsistent ways. Compared to Saliency, they may be more useful to detect small, hard to observe discriminative features, e.g. from medical images and other dense images.

  • The heatmaps produced by ResNet appear to be sparsest, followed by VGG then AlexNet. Input size and depth of networks may be the reasons.

  • A research into the role of negative values in the heatmaps may be necessary. If we continue with the interpretation that negative values correspond to negative contribution of prediction, some XAI methods such as DeepLift and DeepLiftShap may be completely incomprehensible.

  • More investigation may be needed to find the best channel adjustment, to handle the phenomenon where large continuous patches or areas are ignored by many of the tested methods and the activation of signals caused by the background.

We have provided an algorithm to produce synthetic data that we hope can be a baseline for testing XAI method, especially in the form of saliency maps or heatmaps. Some XAI methods appear to be more suitable for localization, while others are more responsive to the edges of the features. Modifications required to boost the explainability power of XAI methods might differ across the methods, making fair comparison a possibly difficult task. At least, for each application of XAI method, we should attempt to find a clear, consistent interpretation under the same context of study. For example, if negative values need to be treated with absolute values in some application, at least an accompanying experiment is needed to show the effect and implication of performing such transformation. As of now, XAI still remains a challenging problem. However, it does exhibit good potential to improve the reliability of black-box models in the future.


This research was supported by Alibaba Group Holding Limited, DAMO Academy, Health-AI division under Alibaba-NTU Talent Program. The program is the collaboration between Alibaba and Nanyang Technological university, Singapore.

=0mu plus 1mu


  • [1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015-07)

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

    PLOS ONE 10 (7), pp. 1–46. External Links: Link, Document Cited by: §I, §II-C.
  • [2] H. D. Couture, J. S. Marron, C. M. Perou, M. A. Troester, and M. Niethammer (2018) Multiple instance learning for heterogeneous images: training a cnn for histopathology. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 254–262. External Links: ISBN 978-3-030-00934-2 Cited by: §I.
  • [3] F. Eitel and K. Ritter (2019) Testing the robustness of attribution methods for convolutional neural networks in mri-based alzheimer’s disease classification. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, K. Suzuki, M. Reyes, T. Syeda-Mahmood, B. Glocker, R. Wiest, Y. Gur, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 3–11. External Links: ISBN 978-3-030-33850-3 Cited by: §I.
  • [4] C. Ferri, J. Hernández-Orallo, and M. A. Salido (2003) Volume under the roc surface for multi-class problems. In Machine Learning: ECML 2003, N. Lavrač, D. Gamberger, H. Blockeel, and L. Todorovski (Eds.), Berlin, Heidelberg, pp. 108–120. External Links: ISBN 978-3-540-39857-8 Cited by: §II-C.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 770–778. Cited by: §II-B.
  • [6] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. L. Ball, K. S. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng (2019) CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031. External Links: Link, 1901.07031 Cited by: §I.
  • [7] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. B. Viégas, and R. Sayres (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav).. In ICML, J. G. Dy and A. Krause (Eds.), JMLR Workshop and Conference Proceedings, Vol. 80, pp. 2673–2682. External Links: Link Cited by: §I.
  • [8] P. Kindermans, K. Schütt, K. Müller, and S. Dähne (2016) Investigating the influence of noise and distractors on the interpretation of neural networks. External Links: 1611.07270 Cited by: §II-C.
  • [9] A. Krizhevsky (2014) One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997. External Links: Link, 1404.5997 Cited by: §II-B.
  • [10] H. Lee, S. T. Kim, and Y. M. Ro (2019) Generation of multimodal justification using visual word constraint model for explainable computer-aided diagnosis. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, K. Suzuki, M. Reyes, T. Syeda-Mahmood, B. Glocker, R. Wiest, Y. Gur, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 21–29. External Links: ISBN 978-3-030-33850-3 Cited by: §I.
  • [11] X. Li, N. C. Dvornek, J. Zhuang, P. Ventola, and J. S. Duncan (2018)

    Brain biomarker interpretation in asd using deep learning and fmri

    In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 206–214. External Links: ISBN 978-3-030-00931-1 Cited by: §I.
  • [12] S. Liu and W. Deng (2015) Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Vol. , pp. 730–734. Cited by: §II-B.
  • [13] (accessed August 16, 2020) LRP tutorial. External Links: Link Cited by: §II-C.
  • [14] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. External Links: Link Cited by: §II-C.
  • [15] Z. Papanastasopoulos, R. K. Samala, H. Chan, L. Hadjiiski, C. Paramagul, M. A. H. M.D., and C. H. N. M.D. (2020) Explainable AI for medical imaging: deep-learning CNN ensemble for classification of estrogen receptor status from breast MRI. In Medical Imaging 2020: Computer-Aided Diagnosis, H. K. Hahn and M. A. Mazurowski (Eds.), Vol. 11314, pp. 228 – 235. External Links: Document, Link Cited by: §I.
  • [16] M. Paschali, S. Conjeti, F. Navarro, and N. Navab (2018) Generalizability vs. robustness: investigating medical imaging networks using adversarial examples. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 493–501. External Links: ISBN 978-3-030-00928-1 Cited by: §I.
  • [17] Y. Qin, K. Kamnitsas, S. Ancha, J. Nanavati, G. W. Cottrell, A. Criminisi, and A. V. Nori (2018) Autofocus layer for semantic segmentation. CoRR abs/1805.08403. External Links: Link, 1805.08403 Cited by: §I.
  • [18] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) “Why should i trust you?”: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1135–1144. External Links: ISBN 9781450342322, Link, Document Cited by: §I.
  • [19] C. Rudin (2019-05-01) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. External Links: ISSN 2522-5839, Document, Link Cited by: §I.
  • [20] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra (2016) Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391. External Links: Link, 1610.02391 Cited by: §I, §II-C.
  • [21] A. Shrikumar, P. Greenside, and A. Kundaje (2017-06–11 Aug) Learning important features through propagating activation differences. D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3145–3153. External Links: Link Cited by: §II-C.
  • [22] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. CoRR abs/1704.02685. External Links: Link, 1704.02685 Cited by: §I, §II-C.
  • [23] K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations, Cited by: §II-C.
  • [24] D. Smilkov, N. Thorat, B. Kim, F. B. Viégas, and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. External Links: Link, 1706.03825 Cited by: §I.
  • [25] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. External Links: 1412.6806 Cited by: §II-C.
  • [26] A. Srinivasan and A. Srinivasan (1999) Note on the location of optimal classifiers in n-dimensional roc space. Technical report . Cited by: §II-C.
  • [27] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3319–3328. Cited by: §II-C.
  • [28] Z. Tang, K. V. Chuang, C. DeCarli, L. Jin, L. Beckett, M. J. Keiser, and B. N. Dugger (2019) Interpretable classification of alzheimer’s disease pathologies with a convolutional neural network pipeline. Nature Communications 10 (1), pp. 2173. External Links: ISSN 2041-1723, Document, Link Cited by: §I.
  • [29] A. W. Thomas, H. R. Heekeren, K. R. Müller, and W. Samek (2019) Analyzing Neuroimaging Data Through Recurrent Deep Learning Models. Front Neurosci 13, pp. 1321. Cited by: §I.
  • [30] E. Tjoa, G. Heng, L. Yuhao, and C. Guan (2019) Enhancing the extraction of interpretable information for ischemic stroke imaging from deep neural networks. External Links: 1911.08136 Cited by: §III-A.
  • [31] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 818–833. External Links: ISBN 978-3-319-10590-1 Cited by: §II-C.
  • [32] G. Zhao, B. Zhou, K. Wang, R. Jiang, and M. Xu (2018) Respond-cam: analyzing deep models for 3d imaging data by visualizations. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 485–492. External Links: ISBN 978-3-030-00928-1 Cited by: §I.
  • [33] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016-06)

    Learning deep features for discriminative localization

    In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2921–2929. External Links: Document, ISSN 1063-6919 Cited by: §I.
  • [34] P. Zhu and M. Ogino (2019) Guideline-based additive explanation for computer-aided diagnosis of lung nodules. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support, K. Suzuki, M. Reyes, T. Syeda-Mahmood, B. Glocker, R. Wiest, Y. Gur, H. Greenspan, and A. Madabhushi (Eds.), Cham, pp. 39–47. External Links: ISBN 978-3-030-33850-3 Cited by: §I.