1 Introduction
^{1}^{1}1Under SubmissionWhile deep learning has become the state of the art in numerous machine learning applications, deep models are especially resistant to human understanding. The gap between power and interpretability is growing especially wide in vision and audio applications where adversarial examples
[45816]demonstrate that models incorporate semantically meaningless features (road stop signs with humanimperceptible changes being classified as speed limit signs
[eykholt2017robust]) .Among approaches aiding the interpretability of opaque models are input attributions which assign to each model input a level of contribution to a particular prediction. When visualized alongside inputs, an attribution gives a human interpreter some notion of what about the input is important to the outcome (see, for example, Figure 1). Being explanations of highly complex systems intended for highly complex humans, attributions have been varied in their approaches and sometimes produce distinct explanations for the same outputs.
Nevertheless, save for the earliest approaches, attribution methods distinguish themselves with one or more desirable criteria and in some instances define quantitative evaluation metrics indicating preference of one attribution method over another. Ablationbased criteria such as
Area Over Perturbation Curve[7552539] and similar[alex2016layerwise, Montavon_2018] tested by interventions: an attribution should point out inputs that, when dropped or ablated while keeping all other inputs fixed, induce the greatest change in output. Alternatively, measures such as Average % Drop [Chattopadhay_2018] instead determine to what extent important inputs stand on their own by comparing model scores relative to scores on just the important inputs (all other inputs are perturbed/ablated). Finally, scaling criteria such as completeness [sundararajan2017axiomatic], sensitivityn [ancona2017better], linearagreement [leino2018influencedirected, sundararajan2017axiomatic] calibrate attribution to the change in output as compared to change in input when evaluated on some baseline.While evaluation criteria endow attributions with some limited semantics, the variations in design goals, evaluation metrics, and the underlying methods resulted in attributions failing at their primary goal: aiding in model interpretation. This work alleviates these problems and makes the following contributions.

We decompose and organize existing attribution methods’ goals along two complementary properties: ordering and proportionality. While ordering requires that an attribution should order input features according to some notion of importance, proportionality stipulates also a quantitative relationship between a model’s outputs and the corresponding attributions in that particular ordering.

We describe how all existing methods are motivated by an attribution ordering corresponding roughly to the logical notion of necessity which leads to a corresponding sufficiency ordering not yet fully discussed in literature.

We show that Saliency Map[simonyan2013deep, baehrens2009explain] and GradCAM [selvaraju2016gradcam] are the ones we should avoid to use when localizing necessary input features, whereas DeepLIFT [shrikumar2017learning]
, Guided Backpropagation
[springenberg2014striving] and InfluenceDirected Explanation [leino2018influencedirected]have higher probability to be the best method when localizing sufficient input features.

We show no evaluated method in this paper can be a frequent winner on the necessity and sufficiency at the same time.
2 Background
Attributions are a simple form of model explanations that have found significant application to Convolutional Neural Networks (CNNs) with their ease of visualization alongside model inputs (i.e. images). We summarize the various approaches in Section 2.1 and the criteria with which they are evaluated and/or motivated in Section 2.2.
2.1 Attribution Methods
The concept of Attribution is welldefined in [sundararajan2017axiomatic] but it excludes any method without an baseline (reference) input. We consider a relaxed version. Consider a classification model
that takes an input vector
and outputs a score vector , where is the score of predicting as class and there are classes in total. Given a preselected class , an attribution method attempts to explain by computing a score for each feature as its contribution toward . Even though each feature in may receive different attribution scores given different choice of attribution methods, features with positive attribution scores are universally explained as important part in , while the negative scores indicate the presence of these features decline the confidence for predicting .Previous work has made great progress in developing gradientbased attribution methods to highlight important features in the input image for explaining model’s prediction. The primary question to answer is whether should we consider grad or grad input as attributions [smilkov2017smoothgrad, sundararajan2017axiomatic, shrikumar2017learning, leino2018influence]. As ancona2017better argues grad is local attribution that only accounts for how tiny change around the input will influence the output of the network but grad input is the global attribution that accounts for the marginal effect of a feature towards output. We use grad input as the attribution to be discussed in this paper. We briefly introduce methods to be evaluated in this paper and examples are provided in Fig 1.
Saliency Map [simonyan2013deep, baehrens2009explain] uses the gradient of the class of interests with respect to the input to interpret the prediction result of CNNs. Guided Backpropagation (GB) [springenberg2014striving]
modifies the backpropagation of ReLU
[pmlrv15glorot11a] so that only the positive gradients will be passed into the previous layers. GradCAM [selvaraju2016gradcam] builds on the Class Activation Map (CAM) [zhou2015cnnlocalization] targeting CNNs. Although its variations [Chattopadhay_2018, omeiza2019smooth] show sharper visualizations, their fundamental concepts remain unchanged. We consider only GradCAM in this paper. Layerwise Relevance Propagation (LRP) [Bach2015OnPE], DeepLift [shrikumar2017learning] modifies the local gradient and rules of backpropagations. Another method sharing similar motivation in design with DeepLift is Integrated Gradient (IG) [sundararajan2017axiomatic]. IG computes attribution by integrating the gradient over a path from a predefined baseline to the input. SmoothGrad (SG) [smilkov2017smoothgrad] attempts to denoise the result of Saliency Map by adding Gaussian noise to the input and provides visually sharper results. InfluenceDirected Explanation (ID) [leino2018influencedirected]identifies neurons in the model’s internal presentation with high
distributional influence toward the output of a target class of interest. Feature importance scores are considered as attributing to a group of internal neurons with high influence. The use of the internal modules of neural networks in explanation is also discussed by olah2018the.Other methods like Deep Taylor Decompostion [montavon2015explaining] related with LRP [Montavon_2018] and Occluding [zeiler2013visualizing] are not evaluated in this paper but will be a proper future work to discuss.
2.2 Evaluation Criteria
Evaluation criteria measure the performance of attribution methods towards some desirable characteristic and are typically employed to justify the use of novel attribution methods. The most common evaluations are based on pixellevel interventions or perturbations. These quantify the correlation between the perturbed pixels’ attribution scores and the output change [unifyarticle, alex2016layerwise, Chattopadhay_2018, gilpin2018explaining, Montavon_2018, 7552539, shrikumar2017learning]. For perturbations that intend to remove or ablate a pixel (typically by setting it to some baseline or to noise), the desired behavior for an optimal attribution method is to have perturbations on the highly attributed pixels drop the class score more significantly than on the pixels with lower attribution.
Quantification of the behavior described by 7552539 with Area Over Perturbation Curve (AOPC). It measures the area between two curves: the model’s output score against the number of perturbed pixels in the input image and the horizontal line of the score at the model’s original output. Another similar measurement is Area Under Curve (AUC) of alex2016layerwise, Montavon_2018 that measures the area under the perturbation curve instead. AOPC and AUC measurement are equivalent and both are orignally used to endorse LRP. For reasons which will become clear in the next section, we categorize these criteria as supporting necessity order. We argue that evaluating attribution methods only with perturbation curves, e.g. Area Under Curve (or AUC), only discovers the tip of the iceberg and potentially can be problematic. A toy model is shown in Example 1 to elaborate our concerns.
Example 1.
Consider a model that takes a vector with three features . Given the input to the model is , assume are three different methods and output the attribution scores shown in Table 1 for each input feature , respectively.
We apply zero perturbation to the input which means we set features to 0. The AOPC evaluation for these three attribution methods is shown in Fig 2. Using the conclusion from [7552539] that higher AOPC scores suggest higher relativity of input features highlighted by an attribution method, Fig 2 shows pixels highlighted by are more relative 7552539 to prediction than and , as expected. However, and are considered as showing same level of relativity under the AOPC measurement even though succeeds in discovering is more relevant than , whereas believes is more relevant than both .
Another set of criteria instead stipulate that positively attributed features should stand on their own independently of nonimportant features. An example of this criterion is Average % Drop of Chattopadhay_2018 in support of GradCam++ that measures the change of class score by presenting only pixels highlighted by an attribution (nonimportant pixels are ablated). We will say this criteria support sufficiency order (definition to follow).
Rethinking the concept of relativity, we believe both necessity and sufficiency can be treated as different types of relativity. In Example 1, neither nor is a necessary feature individually because the output will not change if any one of them is absent. However, both and are sufficient features, with either of which, the model could produce the same output as before. Besides, succeeds in placing the order of sufficient feature in front of the nonsufficient feature but fails, while AOPC(or AUC) is unable to discover the success.
Other evaluation criteria exist, like sensitivityn [ancona2017better] and sanity check [adebayo2018sanity], will be discussed in Section 5.
3 Methods
To tame the zoo of criteria, we organize and decompose them into two aspects: (1) ordering imposes conditions under which an input should be more important than another input in a particular prediction, and (2) proportionality further specifies how attribution should be distributed among the inputs. We elaborate on ordering criteria in Section 3.1 with instantiations in Section 3.2 and Section 3.3. We describe proportionality in Section 3.4. We begin with the logical notions of necessity and sufficiency as idealized versions of ablationbased measures described in Section 2.
3.1 Logical Order
The notions of necessity and sufficiency are commonly used characterizations of logical conditions. A necessary condition is one without which some statement does not hold. For example, in the statement , both and are necessary conditions as each independently would invalidate the statement were they be made false. On the other hand, a sufficient condition is one which can independently make a statement true without without other conditions being true. In the statement , both and are sufficient but neither are necessary.
In more complex statements, no atomic condition may be necessary nor sufficient though compound conditions may. In the statement , none of are necessary nor sufficient but and are sufficient. As we are working in the context of input attributions, we relax and order the concept of necessity and sufficiency for atomic conditions (individual input pixels).
Definition 1 (Logical Necessity Ordering).
Given a statement over some set of atomic conditions, and two orderings and , both ordered sets of the conditions, we say has better necessity ordering for than if:
(1) 
Definition 2 (Logical Sufficiency Ordering).
Likewise, has better sufficiency ordering for than if:
(2) 
A better necessity ordering is one that invalidates a statement by removing the shorter prefix of the ordered conditions while a better sufficiency ordering is the one that can validate a statement using the shorter prefix.
3.2 Necessity Ordering (NOrd)
Unlike logical statements, numeric models do not have an exact notion of a condition (feature) being present or not. Instead, inputs at some baseline value or noise are viewed as having a feature removed from an input. Though this is an imperfect analogy, the approach is taken by every one of the measures described in Section 2 that make use of perturbation in their motivation. Additionally, with numeric outputs, the nuances in output obtain magnitude and we can longer describe an attribution by a single index like the minimal index of Definitions 1 and 2. Instead we consider an ideal ordering as one which drops the numeric output of the model the most with the least number of inputs ablated.
Notation
Given an attribution method , it computes a set of attribution scores for each pixel in the input image . We permute the pixels into a new ordering so that . We take the subset of so that has the same ordering as but only contains pixels with positive attribution scores. Let be the output of the model with input where pixels are perturbed from the input by setting , where is a baseline value for the image (typically ). Denote as the original output of the model given without any perturbation. Denote as the the baseline input image where all the pixels are filled with the baseline value .
We refer the AUC measurement [alex2016layerwise, Montavon_2018] as a means to measure the Necessity Ordering (NOrd). Denote as NOrd score given a input image and an attribution method . Rewrite AUC using the notation in Section 3:
(3) 
where and is the total number of pixels in . We include to clip scores below the baseline output. According to Definition 1, we have the following proposition.
Proposition 1.
An attribution method shows a (strictly) better Ordering Necessity than another method given an input image if
As discussed in Section 2.2, NOrd only captures whether more necessary pixels, are receiving higher attribution scores. We argue that attribution methods should also be differentiated by the ability of highlighting sufficient features. To evaluate whether more sufficient pixels are receiving higher attribution scores, we propose Sufficiency Ordering as a complementary measurement.
3.3 Sufficiency Ordering (SOrd)
Use the notation in Section 3 and let be the model’s output with where are added to the baseline image . Denote as SOrd score given a input image and an attribution method .
(4) 
where , is the number of pixels in . We include to clip scores above the original output. is inspired by Average % Drop [Chattopadhay_2018] even though Average % Drop only measures the final score of adding all pixels back to a baseline image because it is used for localization analysis. According to Definition 2, we have the following proposition.
Proposition 2.
An attribution method shows (strictly) better Ordering Sufficiency than another method given an input image if .
NOrd and SOrd together provides a more comprehensive evaluation for an attribution method. In Section 3.4, we are going to discuss the disadvantages of only using NOrd or SOrd and propose Proportionality as a refinement to the ordering analysis.
3.4 Proportionality
NOrd and SOrd do not incorporate the attribution scores beyond producing an ordering. This can be an issue toward an accurate description of feature necessity or sufficiency. For example, consider a toy model and let the inputs variables be . Any attribution methods that assign higher score for than produces the identical ordering , even one could overestimate the degree of necessity (or sufficiency) of by assigning it with much higher attribution scores. With linear agreement [leino2018influence], scores for and are more reasonable if their ratio is close to 2:1. Explaining a decision made by a more complex model only using ordering of attributions may overestimate or underestimate the necessity (or sufficiency) of an input feature. Therefore, We propose Proportionality as a refinement to quantify the necessity and sufficiency in complementary to the ordering measurement.
Definition 3 (Proportionalityk for Necessity).
Explanation of Definition 3 the motivation behind Proportionalityk for Necessity is that: given a group of pixels ordered with their attribution scores, there are different ways of distributing scores to each feature while the ordering remains unchanged. An optimal assignment is preferred that features receive attribution scores proportional to the output change if they are modified accordingly. In other words, given any two subsets of pixels and . with total attribution scores sum to and , are perturbed, the change of output scores and should satisfy . This property is demanded because the same share of attribution scores should account for the same necessity or sufficiency. If we restrict the condition to , the difference between and becomes an indirect measurement of the proportionality. For the measurement of Necessity, we further restrict that is perturbed from the pixel with the highest attribution score first and is perturbed from the one with lowest attribution score first, in accordance with the setup in NOrd. Therefore, a smaller difference shows better Proportionalityk for Necessity
Proposition 3.
An attribution method shows better Proportionalityk for Necessity than method if
A similar requirement for attribution method is completeness discussed by [sundararajan2017axiomatic] and its generalization sensitivityn discussed by [ancona2017better]. completeness requires the sum of total attribution scores to be equal to the change of output compared to a baseline input, and sensitivityn requires any subset of pixels whose summation of attribution scores should be equal to the change of output compared to the baseline if pixels in that subset are removed. When is the total number of pixels in the input image, sensitivityn reduces to completeness. The relationships between sensitivityn and Proportionalityk for Necessity are discussed as follows:
Proposition 4.
If an attribution method satisfies both sensitivity and sensitivity, then under the condition if , but not vice versa.
The proof for Proposition 8 and can be found in Appendix 1. We further contrast our method with sensitivityn in Section 5. Integrating proportionality with all possible shares of attribution scores, we define the Total Proportionality for Necessity (TPN):
Definition 4 (Total Proportionality for Necessity).
Given an attribution method A and an input image , The Total Proportionality for Necessity is measured by
(6) 
where , the ratio of the model’s output after perturbing all pixels with positive attribution scores over its original output. Details of notations are dicussed in Section 3.
is a positive hyperparameter.
Explanation for Definition 4 is the area between two perturbation curves one starting from the pixels with highest attribution scores and the other with a reversed ordering. The difference from Necessity Ordering is that is measured against the share of attribution scores (the value of ) instead of the share of pixels in the . is a hyperparameter to adjust the penalty when an attribution method highlights all nonnecessary features instead of the necessary ones (e.g. assign similar high scores to the background when classifying ducks and 0s for duckrelated features). Perturbations on nonnecessary features may not change the output at all and we penalize an method for this. Generalizing Proposition 3, we argue:
Proposition 5.
An attribution method shows better Total Proportionality for Necessity than method if
Under the similar construction, we have the following definition of Proportionalityk for Sufficiency and Total Proportionality for Sufficiency (TPS):
Definition 5 (Proportionalityk for Sufficiency).
We want the difference as small as possible since the same share of attribution scores should reflect same sufficiency. Therefore, we have the following proposition:
Proposition 6.
An attribution method shows better Proportionalityk for Sufficiency than method if
Definition 6 (Total Proportionality for Sufficiency).
Given an attribution method A and an input image , The Total Proportionality for Sufficiency is measured by
(8) 
where , the ratio of the model’s output after adding all pixels with positive attribution scores to a baseline input over its original output. Refer to Section 3 and 3.3 for details about the notation. is a positive hyperparameter.
Similarly, is the area between curves of model’s output change by adding pixels to a baseline input with the highest attribution scores first or by the lowest first. is a hyperparameter to adjust the penalty when an attribution method highlights nonsufficient features. Adding those pixels will not increase the output significantly. Finally, we have
Proposition 7.
An attribution method shows better Total Proportionality for Sufficiency than another method if
In summary, we differentiate and describe the Necessity Ordering (NOrd) and Sufficiency Ordering (SOrd) from previous work and propose Total Proportionality for Necessity (TPN) and Total Proportionality for Sufficiency (TPS) as refined evaluation criteria for necessity and sufficiency. We then apply our measurement to explain the prediction results from an image classificiation task in the rest of the paper.
4 Evaluation
We evaluate our metrics directly on CNNs. A linear model may be a reasonable choice to begin with but as ancona2017better concludes that SM, IG, LRP,DeepLIFT are equivalent for linear models. Their proof also applies to SG, BP and ID. GradCAM, on the other hand, is not defined for models without convolutional layers. Linear models are therefore not expected to distinguish most attribution methods
4.1 Datasets and Models
We evaluate the necessity and sufficiency for all attribution methods mentioned in Section 2 on VGG16[simonyan2014deep]
with pretrained weights from ImageNet and finetuned on Caltech256
[griffinHolubPerona]. Without data augmentation and preprocessing, the model achieves an accuracy of 64.7% on the test dataset containing 5000 images.4.2 Implementation
We pick a zero baseline in IG and DeepLIFT (RevealCancel) for all images. The last convolutional layer in VGG16 is selected for GradCAM as selvaraju2016gradcam suggested. For LRP, a variety of rules are available. We use the implementation of LRP with generalization tricks mentioned by Montavon_2018 who argues this rule is better for image explanations. For ID, we employ instance distribution of interests [leino2018influence] (simple gradient) to visualize the top 1000 neurons in block4_conv3 for each class. We also use the same convolutional block in as leino2018influence did but a deeper slice expecting for deeper feature representations (the author uses conv4_1). We believe a better slice can be argued but we find slice searching remains an open question. For the penalty terms in TPN and TPS, we use and . In the experiment, we observe even random perturbation can cause serious drop in the output, so there is no need for big penalty in TPN. However for TPS, even adding all pixels with positive attribution scores, the model can produce a very low scores compared to the original ones, for some instances. Therefore, we slightly increase to encourage the attribution method that can locate sufficient features.
4.3 Evaluate with one instance
We evaluate Necessity Ordering, Sufficiency Ordering, TPN and TPS reusing the example from Fig 1 before applying to the whole dataset. Results and comparisons are provided in table 2 and the computations for TPN and TPS before applying the penalty terms are demonstrated in Fig 3. Insights are discussed as below.
1st  2nd  

NOrd  DeepLIFT(.10)  GradCAM (.13) 
SOrd  GrdCAM (.45)  GB(.43) 
TPN  IG (.23)  GB(.27) 
TPS  GradCAM (.43)  IG(.44) 
Under necessity analysis, NOrd shows DeepLIFT finds finds the best ordering of pixels for necessity. A deeper look with TPS shows that IG is best at assigning pixels with scores proportional to the score drop of the output when necessary pixels are perturbed in Fig 1.
Under sufficiency analysis, SOrd shows GradCAM finds the best ordering of pixels for sufficiency. A deeper look with TPS also show GradCAM is best at assigning scores proportional to the score increase of the model when sufficient pixels are provided in Fig 1.
4.4 Evaluate with Caltech256
We evaluate NOrd, SOrd, TPN and TPS on the test dataset of Caltech256. For each instance, we compare the the NOrd and SOrd scores with other methods, ranking all the attribution methods from the 1st to 8th, where 1st means it has lowest NOrd or highest SOrd compared to other methods, respectively. As the result provided in Fig 4, we compute the frequency of an attribution method being placed at a particular rank position. A grid with darker color suggests a higher frequency of corresponding ranking. Based only on ordering analysis from Fig 4 (a) (b), we show that IG and SG are more often to be the best in Necessity Ordering, while GB and GradCAM are more often to be the best in Sufficiency Ordering. In combination with the proportionality analysis, we find the winner of necessity varies but generally Saliency Map, GB and GradCAM are worse methods that should be avoided because they are often the poor ones. For sufficiency, DeepLIFT, ID and GB are more often the best methods. However, regardless of ordering or proportionality analysis, no method is an obvious frequent winner on both sides.
5 Related Work
We consider our work as a subset of sensitivity evaluation that how well we can trust an attribution method with its quantification of the feature importance in the input. A close concept is quantitative input influence by 7546525 (even though the author does not target on deep neural networks). sensitivity (a)(b) [sundararajan2017axiomatic] provides the basis of discussion and sensitivityn [ancona2017better] imposes more strict requirements. The main correlation of proportionality with sensitivityn is discussed in Section 3.4. We discuss the main difference of these two concepts here. proportionality approaches the sensitivity from a view that, regardless of the number of pixels, same share of attribution scores should account for same change to the output, while sensitivityn requires removing pixels should change the output by the amount of total attribution scores of that pixels. sensitivityn only provides True/False to an attribution methods, but proportionality provides numerical results for comparing different methods under the necessity and sufficiency. Beyond sensitivity, the continuity of attribution methods is an important and desired property, which requires an attribution method can to output similar results for similar input and prediction pairs. ghorbani2017interpretation discussed this property , Montavon_2018 and kindermans2017unreliability provide failure cases and welldesigned attacks that causes unreasonable attribution results. Besides, [adebayo2018sanity] evaluates the correlation between attribution scores with the model’s parameters. Outside the discussion around CNNs, [arras2019evaluating] provides evaluations on applying attribution methods to language tasks with LSTM [Hochreiter:1997:LSM:1246443.1246450].
6 Discussion
In this section, we discuss how this paper helps to correct some potential misunderstanding from the interpretation of attribution methods. We argue meaningful interpretations should depend on the purpose of interpretation and the criteria in evaluation.
6.1 Interpretations based on purpose
In this paper, we provide provide two realizable purposes of interpretation: identifying more necessary or more sufficient features. The purposes can be realized by using attribution methods winning the necessity or sufficiency test, respectively. Failing to be a better attribution method in showing necessity does not mean an method can not be showing insight on the sufficiency side, e.g. GB is doing poorly in NOrd test but fairly good in SOrd test.
6.2 Interpretations based on criteria
We provide ordering and proportionality as two criteria in this paper. Interpretation based on ordering criteria is safe to argue the features with higher attribution scores in the input image is more necessary or sufficient than others, but is not safe to include any quantitative analysis, e.g. feature is as twice necessary as or is equivalently necessary as the presence of and together because they have similar share of attribution scores. Interpretations based on proportionality, however, provides more confidence in making quantitative argument with the winner attribution method. For example, GradCAM changes from one of the most frequent winners in SOrd to the most frequent 6th in TPS. Revisiting the algorithm behind GradCAM, it computes scores based on the activation of the selected layer weighted sum by the gradients. We believe GradCAM is outstanding in localizing which part is more sufficient in the image but we find it hard to justify the activation of an internal layer is proportional to the necessity or sufficiency of an input feature.
7 Conclusion
In this paper, we summarize existing evaluation metrics for attribution methods and categorize them into two logical concepts, necessity and sufficiency. We then demonstrate realizable criteria to quantify necessity and sufficiency with the ordering analysis and its refinement, proportionality analysis. We evaluate existing attribution methods against our criteria and list the best methods for each criteria. We discover that certain attribution methods excel in necessity or sufficiency, but none is a frequent winner for both.
Acknowledgement
This work was developed with the support of NSF grant CNS1704845 as well as by DARPA and the Air Force Research Laboratory under agreement number FA87501520277. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright notation thereon. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of DARPA, the Air Force Research Laboratory, the National Science Foundation, or the U.S. Government.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU and Titan V GPU used for this research.
References
Appendix
Appendix 1
Proposition 8.
If an attribution method satisfies both sensitivity and sensitivity, then under the condition if , but not vice versa.
Proof.
If A satisfies sensitivity, for any given ordered subset , we have
Same thing happens to if A satisfies sensitivity. Under the condition if ,
(9)  
∎
Appendix 2(a)
Appendix 2(b)
Organizing attribution methods for Example 1. Notice that we normalize the output scores at each each perturbation step by the original output in Table 2 and Table 3.
NOrd  SOrd  

1st  DeepLIFT (.10)  GradCAM (.44) 
2nd  GradCAM (.13)  GB (.43) 
3rd  IG (.17)  LRP (.36) 
4th  SG (.19)  DeepLIFt (.33) 
5th  ID (.29)  SG (.28) 
6th  GB (.38)  IG (.25) 
7th  LRP (.41)  Random (.22) 
8th  Saliency Map (.41)  ID (.14) 
9th  Random (.68)  Saliency Map (.14) 
TPN  TPS  
1st  IG(.23)  GradCAM (.43) 
2nd  GB (.27)  IG (.44) 
3rd  GradCAM (.37)  GB (.51) 
4th  DeepLIFT (.40)  Saliency Map (.56) 
5th  SG (.42) (.43)  LRP(.84) 
6th  Random (.56)  Random(.86) 
7th  Saliency Map (.62)  SG(.96) 
8th  LRP (.71)  DeepLIFT(2.47) 
9th  ID (.79)  ID(.2.74) 
Comments
There are no comments yet.