Sanity Simulations for Saliency Methods

by   Joon Sik Kim, et al.
Carnegie Mellon University

Saliency methods are a popular class of feature attribution tools that aim to capture a model's predictive reasoning by identifying "important" pixels in an input image. However, the development and adoption of saliency methods are currently hindered by the lack of access to underlying model reasoning, which prevents accurate method evaluation. In this work, we design a synthetic evaluation framework, SMERF, that allows us to perform ground-truth-based evaluation of saliency methods while controlling the underlying complexity of model reasoning. Experimental evaluations via SMERF reveal significant limitations in existing saliency methods, especially given the relative simplicity of SMERF's synthetic evaluation tasks. Moreover, the SMERF benchmarking suite represents a useful tool in the development of new saliency methods to potentially overcome these limitations.



There are no comments yet.


page 2

page 4

page 7

page 9


Quantitative Analysis of Saliency Models

Previous saliency detection research required the reader to evaluate per...

Saliency Integration: An Arbitrator Model

Saliency integration approaches have aroused general concern on unifying...

The (Un)reliability of saliency methods

Saliency methods aim to explain the predictions of deep neural networks....

Adaptive Visualisation System for Construction Building Information Models Using Saliency

Building Information Modeling (BIM) is a recent construction process bas...

Explaining decision of model from its prediction

This document summarizes different visual explanations methods such as C...

Evaluating saliency methods on artificial data with different background types

Over the last years, many 'explainable artificial intelligence' (xAI) ap...

Crowdsourcing Evaluation of Saliency-based XAI Methods

Understanding the reasons behind the predictions made by deep neural net...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Saliency methods have emerged as a popular tool to better understand the behavior of machine learning models. Given a trained model and an input image, these methods output a feature attribution indicating which pixels they deem to be most “important” to the model’s prediction for that image 

Simonyan et al. (2013); Zeiler and Fergus (2014); Springenberg et al. (2015); Bach et al. (2015); Sundararajan et al. (2017); Shrikumar et al. (2017); Montavon et al. (2017); Smilkov et al. (2017); Lundberg and Lee (2017). Thus, a natural question to ask is: how do we define “important” and subsequently evaluate the efficacy of these methods?

One intuitive approach is to measure how well a saliency method locates or “points to” the expected pixels of interest in the input image. In fact, this “pointing game” Zhang et al. (2018) is one of the predominant evaluations used today Zhou et al. (2016); Selvaraju et al. (2017); Chattopadhay et al. (2018); Woo et al. (2018); Gao et al. (2019); Arun et al. (2020). Currently, these evaluations rely on external knowledge to define an expected feature attribution that highlights the pixels that a human would expect to be important for the given task. Then, they compute the overlap between the output of a saliency method and this expected feature attribution using metrics such as Intersection-Over-Union (IOU).

Unfortunately, there are two key limitations of existing pointing game evaluations. First, the results are unreliable when the model’s ground-truth reasoning does not match human expectations, e.g., when the model is relying on complex spurious correlations among features. These discrepancies are particularly problematic when considering real-world image tasks for which ground-truth feature attributions are generally unattainable. Second, they are based on relatively simple object detection tasks where we expect only a single region of the image, i.e., the object itself, to be relevant to the prediction. In practice, there exist more complex tasks, e.g., medical imaging or autonomous driving, where interactions between multiple regions of the image may be relevant to the model’s prediction.

These two limitations highlight the same fundamental concern: we do not know a priori what or how complex the model’s reasoning will be, irrespective of how simple we believe the underlying task to be. For instance, the top panel of Figure 1 considers the seemingly simple task of identifying a baseball bat in an image. To perform this task, the model may use simple reasoning by relying on a single region of the image, i.e., the bat itself, to make its prediction. However, it also may use more complex reasoning by relying on interactions among multiple regions of the image, e.g., using the presence of a person and a glove to identify a bat. As illustrated through this example, the practical utility of saliency methods hinges on their robustness to the complexity of the underlying model reasoning.

In this work, we aim to sidestep the aforementioned key limitations of existing pointing game evaluations and to study the degree to which saliency methods can capture model reasoning. We introduce a synthetic framework called Simulated ModEl Reasoning Evaluation Framework (SMERF) that allows us to perform ground-truth-based evaluation of saliency methods while controlling the underlying complexity of model reasoning. We leverage SMERF to perform an extensive and nuanced evaluation of leading saliency methods on several abstract object detection problems that are motivated by real-world object detection tasks (see bottom panel of Figure 1).

Figure 1: (Top) Existing pointing game evaluations do not have access to ground-truth feature attributions and instead rely on expected feature attributions. Potential discrepancies between these attributions represent a fundamental limitation of existing evaluations. For instance, we expect that a model trained to detect a baseball bat will rely on the baseball bat region of the image. However, it may be the case that the model relies on more complex reasoning by using the presence of a person and a glove to identify a bat. (Bottom) SMERF constructs a synthetic set of tasks that are stylized versions of real object detection tasks. For instance, consider the task of identifying the letter ‘B’ in an image, where ‘B’ corresponds to the baseball bat, and the two boxes correspond to the person and the glove. SMERF controls a trained model’s underlying reasoning via simulations over different data distributions, thus providing ground-truth feature attributions that we can use to evaluate saliency methods.

Using SMERF, we find that, in simple reasoning settings, most saliency methods perform reasonably well on average, as measured by IOU values with a correctness threshold of 0.5. However, we also observe some failure modes across different experimental scenarios. For instance, LRP Bach et al. (2015) exhibits significant performance degradation when the input image includes a randomly located set of spurious features. Moreover, in complex reasoning settings, we observe that nearly all of the evaluated saliency methods suffer from average performance drops and demonstrate acute failure modes. Our key results are summarized in Figure 2.

Figure 2: Summary of ground-truth-based evaluation of saliency methods via SMERF. (Left) In simple reasoning settings, where the model relies on a single image region to make a prediction, average IOU performance (blue) is reasonably good for most of the methods. However, all methods demonstrate failure modes as shown by minimum performance (orange) over various tasks. (Right) In more complex reasoning settings, where the model relies on interactions among multiple image regions, most of the methods suffer from average performance degradation and demonstrate acute failure modes.

Our results highlight major limitations in existing saliency methods, especially given the relative simplicity of SMERF’s synthetic evaluation tasks and the arguably lenient definition of correctness we consider111While we view the 0.5 IOU threshold as a lenient definition of correctness in synthetic settings, we selected this notion of correctness because it is commonly used in practice when evaluating on real tasks Everingham et al. (2015); Wang et al. (2019).. Indeed, we view the performance of saliency methods on SMERF’s stylized tasks as an upper bound on the saliency methods’ ability to recover simple or complex model reasoning on tasks involving natural images. Moreover, we believe that benchmarking suites like SMERF can play an important role moving forward in the development of new methods to potentially overcome these limitations.

2 Related Work

Pointing Game Evaluation. The pointing game, which measures how well a saliency method identifies the relevant regions of an image, is one of the predominant ways to evaluate the efficacy of these methods Zhou et al. (2016); Selvaraju et al. (2017); Chattopadhay et al. (2018); Woo et al. (2018); Gao et al. (2019); Arun et al. (2020). Many existing pointing game evaluations lack access to ground truth model reasoning, and instead rely on expected feature attributions generated by domain experts to evaluate saliency methods. Intuitively, this might appear to be reasonable by noting that the model has high test accuracy and, therefore, must be using the correct reasoning. However, datasets often contain spurious correlations and, as a result, a model may be able to achieve high test accuracy using incorrect reasoning. Consequently, these evaluations have confounded the correctness of the explanation with the correctness of the model. SMERF eliminates this confounding factor by leveraging the model’s ground-truth reasoning, which allows us to demonstrate that several methods that were previously deemed to be effective are in fact sometimes ineffective.

More recently, Yang and Kim (2019); Adebayo et al. (2020) tried to address this same limitation using semi-synthetic datasets where the ground-truth reasoning is known by combining the object from one image with the background from another image. Both of these analyses are based on the simple reasoning setting and, in that setting, our results roughly corroborate theirs. However, our analysis extends to more complex reasoning settings and demonstrates that methods that worked in the simple setting mostly fail or perform much worse in the complex setting. It is important to consider the more complex reasoning setting because we do not know how complex the model’s reasoning is in practice (e.g., a model may rely on a spurious correlation and use complex reasoning for a simple task).

Other metrics. Beyond the pointing game, several proxy metrics have been proposed to measure the efficacy of saliency methods Bach et al. (2015); Ancona et al. (2018); Alvarez Melis and Jaakkola (2018); Hooker et al. (2019). However, Tomsett et al. (2020)

shows that such metrics inherently depend on subtle hyperparameters that are not well understood and that this leads to analyses with inconsistent results. The unreliability of such proxy metrics further emphasizes the advantage of using a more intuitive and direct evaluation like


Direct criticisms of saliency methods. Adebayo et al. (2018) uses two sanity checks that measure the statistical relationship between a saliency method and the model’s parameters or the data it was trained on. They found that only a few methods (i.e., Gradients Simonyan et al. (2013) and Grad-CAM Selvaraju et al. (2017)) passed these tests. While SMERF is orthogonal to this type of analysis, it demonstrates that even methods that pass these tests, have failure cases, e.g., as shown in Figure 2.

3 Methods

In this section we introduce SMERF, a synthetic evaluation framework where several types of ground-truth model reasoning, ranging from simple to complex, are generated for testing the saliency methods’ ability to recover them. We describe how SMERF encodes reasoning into a model (Section 3.1), followed by our approach to define and control the complexity of the underlying model reasoning with additional features (Section 3.2). We then instantiate several types of model reasoning with a simple yet representative dataset (Section 3.3). Our proposed TextColor dataset is composed of features like text and shapes, and is used in our experiments (see Section 4).

3.1 Simulating Data Distributions and Encoding Ground-truth Model Reasoning

Figure 3: The workflow of SMERF. Through simulation of different feature distributions, we generate datasets for training a model and define ground-truth model reasoning the model should follow. The model is validated using a separate dataset from the simulated feature distributions to ensure that it adheres to the ground-truth model reasoning.

Figure 3 summarizes the workflow of SMERF. SMERF first creates and modulates synthetic feature distributions to generate datasets for training and validating the model. These distributions define the ground-truth model reasoning, which is a set of specific rules of how the predictions are made from a set of different features. A model is then trained with the dataset generated from this distribution. A separate validation dataset from the same distribution is used to check that for every edge case the trained model’s prediction perfectly aligns with the ground-truth model reasoning. The edge cases are handled by dividing the images into separate buckets according to the set of features the images have, and then making sure that the dataset contains a sufficient number of buckets that effectively cover all possible cases defined in the ground-truth model reasoning.

We train and validate convolutional neural networks on these carefully curated datasets. Once the trained model achieves 100 percent training and validation accuracy (at which point we can confirm that the desired reasoning has been encoded in the model), feature attributions for multiple images from different buckets are computed. These feature attributions are then compared against the ground-truth feature attributions derived from the ground-truth model reasoning. By establishing access to the data generation process, we provide a way to offset the first limitation of the existing pointing games: lack of ground-truth.

3.2 Varying the Complexity of Model Reasoning

SMERF can vary the complexity of model reasoning by adding new features to the synthetic distribution, hence modifying the rules in which different features interact with one another. In Figure 4, we outline how additional spurious feature(s) in the data can influence how the model predicts, on top of the correct set of features . The learned model exhibits one of three types of reliance on the spurious feature introduced: (1) Full Reliance (FR), (2) No Reliance (NR), or (3) Conditional Reliance (CR). FR and NR occur when the prediction depends solely on either the spurious features or the correct features respectively. These two types of reliance create simple reasoning, as the reasoning of the learned model directly relies on either the presence or absence of the correct or the spurious set of features, meaning the model only relies on a single region of the image.

We can extend this idea to more complex reasoning with CR, the case in which prediction sometimes relies on the spurious features and sometimes on the correct features, under different conditions. We enforce the model’s prediction to follow certain hierarchical rules composed of and . For instance, the prediction may depend on feature only when another feature is present, and otherwise on . In such cases, as the model reasoning relies on certain sets of features conditioned on other features, the saliency methods have to attend to multiple features in the image to correctly identify the model reasoning. By manipulating the distribution and encoding different types of model reasoning, we address the second limitation of existing pointing game evaluations: lack of more diverse sets of model reasoning.

Figure 4: Given a correct set of features and spurious features in the training data, the learned model may exhibit one of these three behaviors based on how much it relies on . Full Reliance (FR) case is when the prediction of the model is solely dependent on the spurious feature, where No Reliance (NR) is when the prediction of the model completely ignores the spurious feature. These settings correspond to models with simple reasoning. Conditional Reliance (CR) is when the prediction sometimes depends on the true feature, but sometimes on the spurious feature (mix of FR and NR), which can be extended to incorporate more complex model reasoning based on a conditional relationship among the features. SMERF allows us to control the underlying model reasoning, thus providing us with the correct ground-truth feature attribution (denoted in red).

3.3 TextColor Dataset

We now instantiate create different types of model reasoning used for evaluating saliency methods. Consider a hypothetical scenario of training a classifier that predicts if the character in the image is ‘A’ or ‘B’. We generate a family of datasets called

TextColor dataset with SMERF, composed of semantic yet simple features which we refer to as the following (Figure 5):

Figure 5: Samples from the TextColor dataset and features used for generating ground-truth model reasoning. The features are introduced in simple terms to make the perception problem for the model as easy as possible.
  • Character: a black or white character, either ‘A’ or ‘B’

  • Patch: a 10-by-10 black or white box at random locations

  • Switch: a 4-by-4 smaller black or white box at random locations

  • Background color: colors on the spectrum from red to blue, or white, or black

Using datasets composed of these features, we recreate various versions of model reasoning presented in Figure 4, ranging from simple to complex. The simple reasoning setup we use is instantiated with both FR and NR, where the spurious feature is either the patch or the character in the image which fully impacts the prediction made by the model. The complex reasoning setup is based on CR, where the spurious features – the switch and the patch – compose a hierarchical rule under which the predictions are made: when the switch is present, the patch is the only relevant feature; when the switch is not present, the character is the only relevant feature (think of the switch feature as an on/off switch, hence the name, that determines which feature is the relevant one for the prediction).

Such instantiations using simple sets of features are stylized versions of real object detection tasks. Taking the patch, the character, and the switch as separate objects, the model reasoning constructed through SMERF is representative of plausible reasoning the model may be using for real images, depending on what real objects correspond to each of these simple features. Going back to the baseball example in the introduction and Figure 1, the model may be predicting the existence of a bat in an image solely from the existence of a glove, which corresponds to the simple reasoning setting described above, where the patch is the glove, and the character is the bat. If the model learned more complex reasoning where it uses the glove only when a person is in the image, and otherwise directly uses the bat itself for the prediction, this corresponds to the complex reasoning setting described earlier, where the switch is the person. Given such simplified sets of features, the performance of saliency methods on SMERF’s customized tasks can be loosely considered as an upper bound on the saliency methods’ ability to recover simple or complex model reasoning on tasks involving more natural images.

4 Experiments

In this section, we use SMERF and the TextColor dataset to show how several leading saliency methods perform in recovering different types of model reasoning. While we find that most methods perform reasonably well on average for models with simple reasoning (Section 4.1), we note that they have some failure modes (Section 4.2). For models with more complex reasoning, we find that the methods’ average performance and failure modes both become worse (Section 4.3). Figure 2 shows a high level summary of these results.

Saliency Methods and Simple Baselines.

We use a modified version of the open-source library

iNNvestigateAlber et al. (2019) which includes several implementations of leading saliency methods. In our experiments, we use the following methods: Gradient Simonyan et al. (2013), SmoothGradients Smilkov et al. (2017), DeConvNet Zeiler and Fergus (2014)

, Guided Backpropagation (GBP) 

Springenberg et al. (2015), Deep Taylor Decomposition (DeepTaylor) Montavon et al. (2017), Input*Gradient Shrikumar et al. (2017), Integrated Gradients (IG) Sundararajan et al. (2017), and Layerwise Relevance Propagation (LRP) Bach et al. (2015) (four variations), DeepSHAP Lundberg and Lee (2017), DeepLIFT Shrikumar et al. (2017) (two variations), and Grad-CAM Selvaraju et al. (2017). We also add some simple baselines, like Random (e.g. random-valued feature attribution) and Edge-detection, both of which are model-agnostic and therefore should not be useful tools in understanding model reasoning.

Evaluation Metric. We measure the effectiveness of saliency methods with Intersection-Over-Union (IOU) metric, which is a ratio of the intersecting area to the union area of the 0-1 masked feature attribution and the ground-truth feature attribution. The ground-truth feature attributions are predefined from the data generation process (by the ground-truth model reasoning). The 0-1 masked feature attribution from the saliency methods is obtained by first blurring the original feature attribution, followed by thresholding the pixel intensity to select top- pixels, where is equal to the number of pixels inside the regions highlighted by the ground-truth feature attributions. Given the 0-1 masked feature attribution and the ground-truth feature attribution, we compute two types of IOU values: (1) primary IOU, which measures how much of the 0-1 masked feature attribution overlaps with the ground-truth for the region relevant to the model prediction; and (2) secondary IOU, which measures the same value with respect to the region not relevant to the model prediction.

When the primary IOU is high, the saliency method was successful because it correctly identified the relevant regions of the image. However, when it is not high, we need to use the secondary IOU to gain a better understanding of why the method was not successful. If the secondary IOU is high, it means that the method is distracted by irrelevant regions which indicates of a failure mode of to “focusing on irrelevant regions.” Otherwise, it means that the method is focusing on random unspecified regions which indicates a failure mode of “not focusing on the relevant regions.” Collectively, these two metrics provide a more complete picture of how saliency methods are performing.

Throughout this section we use the threshold value of 0.5 to roughly distinguish good and bad performance in terms of IOU as commonly done in practice Everingham et al. (2015); Wang et al. (2019)

. However, we note that this is probably a lenient threshold as we are evaluating on synthetic images. The reported average IOU values are taken across different input images belonging to different buckets.

4.1 Most of the saliency methods for models with simple reasoning have reasonably good performance on average.

Figure 6:

Primary IOU (left) and secondary IOU (right) for simple reasoning by FR and NR. Black vertical lines indicate the standard deviation of the values across different samples used to compute the values. Dotted horizontal line is the correctness threshold at 0.5. Most methods on average are performing well, with reasonable primary IOU and low secondary IOU. However, for inputs that belong to different buckets (red, green for NR and orange, blue for FR), there is a huge variation in the IOU values, resulting in an overall high variance in performance.

Figure 6 shows the average primary and secondary IOU for the simple reasoning setting instantiated with FR and NR in Section 3.3. Most of the methods, except for DeconvNet and SmoothGrad, perform well (crossing the 0.5 correctness threshold line) on average across different types of input images. DeconvNet struggles to pick up useful signal resulting in an on-par performance with random attributions. We can further observe that it has low primary IOU and low secondary IOU, meaning its failure mode is mainly due to not precisely focusing on the relevant features, rather than focusing on the irrelevant features. Overall, the results show that, for models with simple reasoning, most methods are correctly identifying the relevant feature in the image, at lease on average.

4.2 Saliency methods exhibit high variance performance across different types of images. They also often slightly highlight other irrelevant features, even under the simple reasoning setting.

Figure 6 further shows a common problem for methods in simple reasoning case: high variance of primary IOU. This is caused by varying IOU levels for input images that belong to different buckets (with certain features), as illustrated with blue, orange, green and red lines. The methods tend to perform worse for samples that contain more features (red and orange) compared to samples that do not (blue and green). Figure 6 also shows that the secondary IOU values for all methods are quite low for the models. This means that the feature attributions are not overly focused on the irrelevant features for the model reasoning. However, when a non-relevant feature is present in the image (the character for FR and the switch/patch for NR), primary IOU decreases and secondary IOU increases, as indicated by a change from blue to the orange line for FR, and green to red for NR.

Figure 7: Qualitative samples from simple reasoning scenario with FR. While all methods should only be highlighting the patch feature (ignoring the character), all methods seem to highlight both.

This change of decreasing primary IOU and increasing secondary IOU becomes more apparent when we take a look at individual samples and the corresponding feature attributions. Figure 7 shows some samples from simple reasoning setup with FR. As previously mentioned, DeconvNet fails to detect any signals at all, while all other methods seem to highlight both the patch and the character, failing to focus only on the relevant feature (the patch) instead of the irrelevant one (the character). The dispersion of the feature attribution values across these two regions naturally raises the secondary IOU. However, in this case, most of the pixels with high feature attribution values still belong to the relevant region (the patch), based on a relatively small changes in the secondary IOU across different inputs for FR.

4.3 Saliency methods demonstrate several failure modes in recovering more complex reasoning.

Figure 8: Primary and secondary IOU values for complex reasoning case from Section 3.3. All methods fail to show satisfactory results, both in terms of average and variance. Average is lower than 0.5 and the variance is higher compared to the simple reasoning setting, resulting in inconsistent results for different types of inputs. Secondary IOU values are also non-trivially high, and with high variance. Both IOU values show different behaviors for images that contain different sets of features (orange and blue lines), unlike the simple reasoning case.

In Sections 4.1 and 4.2, the leading saliency methods showed overall satisfactory results for the simple reasoning case, except for minor failure cases for specific methods like DeconvNet and a slight increase in the secondary IOU when a non-relevant feature was added. These small problems from the simple reasoning case are exacerbated for models with more complex reasoning and ultimately, most of the methods fail in correctly identifying the reasoning. We show the failure modes of saliency methods for models with complex reasoning specified in Section 3.3. Recall that this is the case in which the ground-truth model reasoning conditionally relies on several sets of features in a hierarchical manner: when the switch is present, the prediction is determined by the patch, otherwise by the character. The ground-truth feature attribution therefore highlights the patch for images that contain the switch, and the character for those that do not.

IOU values for this complex reasoning are shown in Figure 8. Unlike the simple reasoning case, the average primary IOU values are below the correctness threshold 0.5 for all methods. Not only DeconvNet or Grad-CAM, but also other methods that performed well on simple reasoning setting suffer from acute performance drop.

In addition to low average performance throughout, the variance is also high compared to the simple reasoning setting. This is illustrated in how the IOU values change for inputs from different buckets (blue, orange line in Figure 8). When the saliency method needs to focus on the patch for inputs with the switch feature (orange), the primary IOU is reasonably high for most of the methods with low secondary IOU. However, when the saliency method needs to focus on the character instead of the patch for inputs without the switch (blue), the primary IOU drops and the secondary IOU increases significantly. The existence of the switch feature in the image changes the entire distribution of the IOU values. Increasing secondary IOU and decreasing primary IOU represent both failure modes occurring at the same time: lack of focus on the relevant region, plus increased focus on the irrelevant region.

Figure 9: Qualitative samples from complex reasoning setting. While the correct feature attribution should only be highlighting the patch when the switch is present (ignoring the character), and just the character when the switch is absent (ignoring the patch), all methods, except for DeconvNet which is equivalent to mere noise, seem to highlight all features to some degree regardless of the details of the model reasoning.

Figure 9 shows some samples from the complex reasoning setup. All methods other than DeconvNet, which again fails to output meaningful feature attributions, highlight all features present in the image to some degree. Increased secondary IOU compared to the simple reasoning implies that the methods are now pointing to the irrelevant regions in the image more than before. From a practical standpoint, such tendency of saliency methods highlighting all features in the image disregarding the fine-grained details of the model reasoning (resulting in higher secondary IOU and lower primary IOU) can make it difficult for users to correctly derive or distinguish model reasoning by just looking at the feature attributions.

5 Conclusion

In this work, we propose SMERF to perform ground-truth-based pointing game evaluations for saliency methods with varying complexity of model reasoning involved. Our results show that while most methods perform reasonably well on average for identifying simple reasoning of the model, they mostly exhibit performance drops and demonstrate several failure modes when the model follows more complex reasoning. As SMERF is used to reveal the shortcomings of existing methods in their ability to recover complex model reasoning, we believe that it can further play an important role in the development of new methods that can address this issue. Generalizing the main ideas behind SMERF may also be useful in settings where saliency methods are inherently not appropriate, e.g. problems that require counterfactual model explanations Wachter et al. (2017).


  • [1] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. Advances in neural information processing systems 31, pp. 9505–9515. Cited by: §2.
  • [2] J. Adebayo, M. Muelly, I. Liccardi, and B. Kim (2020) Debugging tests for model explanations. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 700–712. External Links: Link Cited by: §2.
  • [3] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K. Müller, S. Dähne, and P. Kindermans (2019)

    INNvestigate neural networks!

    Journal of Machine Learning Research 20 (93), pp. 1–8. External Links: Link Cited by: §4.
  • [4] D. Alvarez Melis and T. Jaakkola (2018) Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §2.
  • [5] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [6] N. Arun, N. Gaw, P. Singh, K. Chang, M. Aggarwal, B. Chen, K. Hoebel, S. Gupta, J. Patel, M. Gidwani, et al. (2020) Assessing the (un) trustworthiness of saliency maps for localizing abnormalities in medical imaging. arXiv preprint arXiv:2008.02766. Cited by: §1, §2.
  • [7] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §1, §1, §2, §4.
  • [8] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2018) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In

    2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 839–847. Cited by: §1, §2.
  • [9] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §4, footnote 1.
  • [10] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. Torr (2019) Res2net: a new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.
  • [11] S. Hooker, D. Erhan, P. Kindermans, and B. Kim (2019) A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pp. 9737–9748. Cited by: §2.
  • [12] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems 30, pp. 4765–4774. Cited by: §1, §4.
  • [13] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. Müller (2017) Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §1, §4.
  • [14] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1, §2, §2, §4.
  • [15] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §1, §4.
  • [16] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. External Links: 1312.6034 Cited by: §1, §2, §4.
  • [17] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §1, §4.
  • [18] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), Cited by: §1, §4.
  • [19] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §1, §4.
  • [20] R. Tomsett, D. Harborne, S. Chakraborty, P. Gurram, and A. Preece (2020-04) Sanity checks for saliency metrics.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (04), pp. 6021–6029.
    External Links: ISSN 2159-5399, Link, Document Cited by: §2.
  • [21] S. Wachter, B. Mittelstadt, and C. Russell (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv. JL & Tech. 31, pp. 841. Cited by: §5.
  • [22] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019)

    Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8445–8453. Cited by: §4, footnote 1.
  • [23] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19. Cited by: §1, §2.
  • [24] M. Yang and B. Kim (2019) Benchmarking attribution methods with relative feature importance. External Links: 1907.09701 Cited by: §2.
  • [25] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §4.
  • [26] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §1.
  • [27] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016-06)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.