Robust Explainability: A Tutorial on Gradient-Based Attribution Methods for Deep Neural Networks

by   Ian E. Nielsen, et al.
Rowan University

With the rise of deep neural networks, the challenge of explaining the predictions of these networks has become increasingly recognized. While many methods for explaining the decisions of deep neural networks exist, there is currently no consensus on how to evaluate them. On the other hand, robustness is a popular topic for deep learning research; however, it is hardly talked about in explainability until very recently. In this tutorial paper, we start by presenting gradient-based interpretability methods. These techniques use gradient signals to assign the burden of the decision on the input features. Later, we discuss how gradient-based methods can be evaluated for their robustness and the role that adversarial robustness plays in having meaningful explanations. We also discuss the limitations of gradient-based methods. Finally, we present the best practices and attributes that should be examined before choosing an explainability method. We conclude with the future directions for research in the area at the convergence of robustness and explainability.



page 9

page 13

page 15


A unified view of gradient-based attribution methods for Deep Neural Networks

Understanding the flow of information in Deep Neural Networks is a chall...

Smoothed Geometry for Robust Attribution

Feature attributions are a popular tool for explaining the behavior of D...

Explainability Techniques for Graph Convolutional Networks

Graph Networks are used to make decisions in potentially complex scenari...

Gradient-based explanations for Gaussian Process regression and classification models

Gaussian Processes (GPs) have proven themselves as a reliable and effect...

How explainable are adversarially-robust CNNs?

Three important criteria of existing convolutional neural networks (CNNs...

Topological Gradient-based Competitive Learning

Topological learning is a wide research area aiming at uncovering the mu...

Explaining a prediction in some nonlinear models

In this article we will analyse how to compute the contribution of each ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL) has transformed the field of machine learning (ML) with deep neural networks (DNNs) being deployed in various real-world applications, including medical diagnosis, financial services, biometrics, intelligent transportation, social media, and smart home devices. Despite tremendous progress, their acceptance in mission-critical application areas is being hampered by two significant limitations. First, there is an inherent inability to explain decisions in a manner understandable to humans

[rudin2019stop]. Second, there is vulnerability to adversarial attacks, i.e., malicious and imperceptible alterations to the input, that can fool trained networks to alter their decisions drastically [madry2018towards]. These seemingly two disparate concepts are intrinsically linked to each other and may have their origins in the data-driven nature of DNNs with a highly nonlinear input-output relationship and over-parameterized design.

Explainability tackles the critical problem that human users cannot directly understand the complex behavior of DNNs or explain their underlying decision-making process. The explainability of ML models is the fundamental requirement for building trust with users and holds the key to their safe, fair, and successful deployment in real-world applications. The issue of explainability transcends the realm of scientific interest. The adoption of the General Data Protection Regulation (GDPR) by the European Union in May 2018 gives any citizen the “right to explanation” of an algorithmic decision made about them [goodman2017european]. Explainability is both a legal right and a responsibility that has extensive social implications. The GDPR states that individuals “have the right not to be subject to a decision based solely on automated processing”.

Explainability in ML is not a new topic and has been handled in many different ways, including building interpretable models or generating post hoc explanations [rudin2019stop]

. This tutorial will focus on the latter. Given a trained neural network, either the input features are perturbed, and their effect on the network output is monitored, or a signal from the network output is back-propagated to the input. Either way, the resulting information from the perturbations or the gradient propagation provides an estimate of the contribution of input features to the output and can be presented as heatmaps.

“How good is an explanation?” is a fundamental question in explainability research. Generally, a visual analysis of the explanation is performed given the fact that these explanations are generated for humans to see and understand the behavior of the model. Recently, it has been shown that a visual analysis may not be a reliable method to ascertain the “plausibility” of an explanation [adebayo2018sanity]. However, the lack of ground-truth explanation makes it challenging to quantitatively assess, compare, and contrast various explanations.

A crucial property that all explainability methods should satisfy is insensitivity to minor input perturbations. That is, a small perturbation (possibly malicious) in the input, which does not affect network decision, should not significantly change the attributions [alvarez2018robustness, ghorbani2019interpretation]. This notion of robustness is closely linked to reproducibility and replicability of explanations. Concurrently, the explanation of a decision should change significantly when the network is under an adversarial attack, that is, imperceptible malicious changes in the input that force the network to alter its decision [madry2018towards]. In an ideal world, the attribution maps should be sensitive enough to detect adversarial attack and concurrently invariant to small perturbations in the input.

This tutorial paper provides a thorough overview of gradient-based post hoc explainability methods, their attributional robustness, and the link between explainability maps and adversarial robustness. We restrict our focus on computer vision classification models, i.e., the networks are generally convolutional neural networks (CNNs) whose input data consist of images (e.g., from ImageNet datasets) and whose outputs are class scores or soft-max probabilities. This tutorial is not meant to provide an exhaustive survey of all the explainability methods proposed for DL models.

2 Taxonomy and Definitions

Interpretability can be defined as the ability to attach a physical meaning to the prediction of a model. Along with reproducibility and replicability, interpretability falls under the larger umbrella of explainability for ML models. However, the terms interpretability and explainability are interchangeably used in the literature.

Inherent Interpretability vs. Post hoc Explanations

: Some ML models are built to be inherently interpretable in the first place, e.g., linear models or decision trees

[rudin2019stop]. Other models, referred to as black box, may require additional mathematical frameworks to explain their behavior to an audience targeting various use cases, e.g., understanding the model, debugging, providing explanations for legal purposes, or helping in decision making for downstream tasks. These mathematical frameworks, designed to explain black box models in post hoc settings, have their limitations and challenges over and above those in building ML models.

Global vs. Local Explanations

: Based on the scope and the purpose of explanation, a user can employ a local or a global interpretability method. Global interpretability methods attempt to explain the overall decision-making process of the model, i.e., how the inputs are transformed into the output decisions at the model level. These may be more useful to researchers and engineers trying to understand their models. In contrast, local interpretability methods attempt to explain specific decisions, i.e., what features of the input (e.g., pixels of an image) may have contributed (positively or negatively) to the model’s output. Post hoc explanations are generally of interest to a broader audience, including but not limited to those without access to the model structure, e.g., those accessing a model-as-a-service.

The local interpretability problem can be formulated as estimating a number for each input feature that captures the effect of change in the feature value on the network output. The estimated numbers are presented as heatmaps and have the same dimension as the input features. In the literature, the terms, attribution, relevance, importance, contribution, sensitivity, and saliency scores are synonymously used.

Feature Perturbation vs. Gradients

: Various methods for post hoc local interpretability have been proposed. Two broad categories exist, namely, methods based on feature perturbation and others based on gradient information [ancona2019gradient]

. The former class of methods perturb input features (or a set of features) by masking or altering their values, and record the effect of these changes on the network performance. In the latter case, the gradients of the output (logits or soft-max probabilities) with respect to the extracted features or the input are calculated via backpropagation and are used to estimate attribution scores. Generally, the gradients are noisy, leading to attribution maps that may show contributions from irrelevant features. Various alterations to the gradient-based approach have been proposed to handle the challenge of noise in attribution maps.

2.1 Requirements from Attribution Maps

Before we go into a detailed discussion about how to create local gradient-based post hoc attribution maps, it is relevant to consider what we expect from these attribution maps. Some of these requirements are defined axiomatically [sundararajan2017axiomatic].

Implementation Invariance

: As we know, ML models can be expressed and implemented in many different ways, mathematically or programmatically; however, two functionally equivalent models should produce similar output for the same input. The attribution methods must be implementation invariant, i.e., produce the same attribution scores for the same inputs on functionally equivalent networks, regardless of how these networks are implemented [sundararajan2017axiomatic].

Input Invariance

: Given the fact that neural networks are invariant to certain input transformations (e.g., a constant shift in the input), an attribution method must also be insensitive to such input transformations [kindermans2019reliability].


: The fidelity or selectivity of an attribution method is linked with its ability to identify feature relevance. An attribution method with high fidelity assigns high attribution score to features that, when removed, greatly reduce network performance and vice versa [tomsett2020sanity].


: In the forward pass, an input feature may saturate the network, owing to the nonlinear activation function being used, e.g., the rectified linear unit (ReLU) function

[sundararajan2017axiomatic]. Consider a neural network with one ReLU, . For all input values , we have: and . Despite the fact that the input feature may change significantly, the function output stays the same and the gradient remains at zero. An attribution method must tackle the saturation in the network while estimating attributions. One possible method is to use a reference input or baseline, which can be zero (black pixel), a random number, or an average value calculated over the input dataset.


: Considering a neutral baseline (e.g., zero or black image) for all features, sensitivity requires that the output of the model for an input should be decomposable as the sum of the individual contributions from the input features [ancona2019gradient]. This property is also referred to as completeness or summation to delta [sundararajan2017axiomatic, shrikumar2017learning]. For an attribution method to be considered sensitive, it must assign a non-zero attribution score to the single distinctive feature between two similar inputs. Furthermore, sensitivity requires that any feature, which does not affect the output of the network, must be given a zero attribution score [sundararajan2017axiomatic].

3 Gradient-Based Attribution Methods

We consider a network with an -dimensional input and a -dimensional output , where is the total number of classes and represents the network’s score function. Note that can be either a class score (logit) or soft-max probability. We use the term “gradient” for . The goal of attribution methods is to estimate the attribution map, . The attribution map captures the importance of each input feature for a specific output class . In computer vision applications, we consider CNNs with image inputs, i.e., the pixels of the image are considered input features. The resulting attribution map has the same size as the input.

3.1 Gradients

We consider a linear model with parameters and an feature input,


where is the modeling error and is the bias.

The partial derivative of the output with respect to the input results in model parameters , which represent contributions of input features. Thus, for the linear case, model parameters serve as feature attributions.

Saliency Maps

: Simonyan et al. used a similar formulation with an absolute value for constructing Saliency maps for DNNs, , where is the input [simonyan2013deep]. It is important to highlight that is a nonlinear function of the input and thus, in contrast to the linear case, the model parameters no more represent feature attributions. It has been shown that saliency maps represent the first-order approximation of the attributions [simonyan2013deep]. The major challenge with saliency maps is that they are visually noisy and a great deal of research has focused on removing noise and improving visualization [smilkov2017smoothgrad].

Deconvolutional Networks (DeconvNets)

: Saliency maps are closely related to DeconvNets proposed by Zeiler and Fergus [zeiler2014visualizing]. In saliency maps, the gradient signals are zeroed during backpropagation at each ReLU when the input to the same ReLU was negative during forward pass. In contrast, DeconvNet reduces the negative gradients to zero at each ReLU, ignoring the fact whether the input to the same ReLU was negative or positive during the forward propagation.

Guided Backpropagation (GBP)

: GBP combines operations from both saliency maps and DeconvNet [springenberg2014striving-GBP]. That is, during backpropagation, the attribution signal is reduced to zero at a ReLU when either the gradient signal itself is negative or the input to the ReLU at the time of the forward pass was negative. Removing negatively contributing features may reduce noise and improve visualization of attribution maps in some cases.


: SmoothGrad reduces noise and visual diffusion by averaging over explanations generated for multiple noisy copies of the input [smilkov2017smoothgrad]. For a saliency map calculated for the input , SmoothGrad is given by , where is the number of samples and

represents the Gaussian distribution.


: In GradientInput, the attribution scores are calculated by element-wise multiplication of gradients with the input, i.e., .

The element-wise multiplication can be considered as an application of a model-independent filter (the input), which may reduce noise and smoothen the attribution maps [ancona2019gradient].

Integrated Gradients (IG)

: IG can be considered a smoother version of GradientInput, specifically designed to satisfy two axioms of explainability, i.e., sensitivity and implementation invariance [sundararajan2017axiomatic]. IG along the dimension for an input and baseline is given by .

IG calculates the average of all gradients along a straight line between the baseline and the input. In practice, we can only use a finite number of samples to approximate the integral, which may introduce an approximation error.

3.2 Attribution Propagation

Attribution propagation can be considered as an alternative to calculating gradients. Recursively, attribution propagation methods decompose the decision made by the network into contributions from previous layers, all the way to the input. These methods use forward-pass activations (starting with the activation of the neuron in the last layer) to move back layer-by-layer in the network and distribute the burden of the decision over the input features. This class of methods includes various forms of Layer-wise Relevance Propagation (LRP), Deep Taylor Decomposition

[montavon2017explaining], and Deep Learning Important FeaTures (DeepLIFT) [shrikumar2017learning]. These methods do not strictly use gradients intrinsically; however, their relationship to GradientInput has been mathematically established [ancona2019gradient].

Layer-Wise Relevance Propagation (LRP)

: LRP propagates relevance scores from the last layer of the network to the input using the “conservation property” [bach2015LRP]. That is, what was received by a neuron in the forward pass (activations) must be redistributed to the lower layer (a layer nearer to the input) by an equal amount. Going from the output to the input, layer-by-layer, the relevance scores are scaled at each layer using the information from the forward pass. LRP starts with the activation of the neuron in the last layer. Let and be neurons at two consecutive layers and , with layer closer to the input. Let be the learnable parameters that connect both layers. The neuronal activation in the forward pass is defined as . Given that we have , i.e., relevance score at layer , we can calculate the relevance score at layer using:


Equation 2 is called LRP-, where is added to absorb some relevance when the contributions to the activation of neuron are weak or contradictory. When , only the most salient explanation factors survive the absorption, leading to noise reduction and sparser explanations [bach2015LRP]. It has been shown that for CNNs with ReLU activation functions, LRP- implements a slightly modified form of GradientInput, where the gradients are normalized at each layer by the activations [ancona2019gradient]. Montavon et al. proposed Deep Taylor Decomposition, which provided theoretical foundations for LRP using Taylor series approximation [montavon2017explaining].

Deep Learning Important FeaTures (DeepLIFT)

: DeepLIFT was designed to tackle the saturation problem using “reference activations”, calculated in the forward pass with the baseline input [shrikumar2017learning]. DeepLIFT compares the activation of each neuron to its reference activation and assigns contribution scores according to the difference [shrikumar2017learning]. It has been shown that DeepLIFT (Rescale rule) is equivalent to GradientInput [ancona2019gradient].

3.3 Gradient-weighted Class Activation Mapping (Grad-CAM)

Grad-CAM uses the class-specific gradient information flowing into the final convolutional layer of a CNN to produce a coarse localization map of the important features of the input [selvaraju2017grad]. Grad-CAM analyzes which regions are activated in the feature maps of the last convolutional layer. Grad-CAM can be combined with GBP, referred to as Guided Grad-CAM, to improve pixel-level granularity of attribution maps. In Fig. 1, we present attribution maps generated using various methods.

Figure 1:

Attribution maps generated using different methods are presented. The first column presents test images from ImageNet and rest of the columns present attribution maps estimated using various methods. Abbreviations used: GBP - Guided Backpropagation, IG - Integrated Gradients, and Grad-CAM - Gradient-weighted Class Activation Mapping.

4 Analysis of Gradient-Based Attribution Methods

Attribution maps are designed to explain the decisions made by ML models. However, a large body of research has found that various approaches to create these attributions have their own limitations [kindermans2019reliability, adebayo2018sanity, ghorbani2019interpretation].

Starting with measures that can be used to evaluate explanations in DL models, we provide a detailed analysis of the performance of these methods and their limitations.

4.1 Evaluation of Attribution Maps

“How good is an explanation?” is one of the fundamental questions in ML explainability research. The lack of ground truth explanations makes it challenging to validate attribution methods. Ideal evaluation of these methods will depend upon fully knowing the process of how the ML model reached its decision - the very problem that we are trying to solve. Furthermore, it is hard to disentangle the errors made by models from the errors made by the attribution methods [sundararajan2017axiomatic]. Given that the notion of explanation is centered around human visual perception, the predominant evaluations of attributions have been subjective. However, objective evaluation is equally, or perhaps more, important to establish rigorous theoretical foundations, compare and contrast various approaches, and improve upon these methods [yeh2019fidelity].

Visual Evaluation

: A visual analysis of the attribution maps may seem to be the most plausible way of evaluation as these are created to explain the behavior of DL models to human operators and designers. Visual analysis includes qualitative displays of explanation examples, crowd-sourced evaluations of human satisfaction with the explanations, as well as whether humans are able to understand the model output [yeh2019fidelity]. However, it may be misleading to rely solely on visual analysis for determining whether an attribution method is able to capture the features that a network considers important [adebayo2018sanity]. A visual analysis may bias the evaluation of how humans understand the phenomenon and make decisions, rather than capturing how the network reached a particular decision. This may hold true especially for the methods that multiply the input with the gradient, i.e., GradientInput and IG.

Feature Perturbation-based Evaluation

: Removing the most important features identified by an attribution method and recording its effect on the performance of the network may provide an objective approach to evaluate various attribution methods [samek2016evaluating]. A good attribution method will identify the most important pixels, which when removed should maximally degrade the network performance. The metric is referred to as the Most Relevant First (MoRF). As the network input size is fixed, the removed pixels are replaced with either the average value (calculated over the input dataset), zero (i.e., black pixel), or random values [tomsett2020sanity]. It is obvious that replacing pixels with an average value or black pixels can introduce high-frequency edges, which may degrade network performance - unrelated to the removal of important pixels [srinivas2019full]. We may choose to remove the least important pixels first - thus partially decoupling the effects of artifacts introduced by high-frequency edges from those caused by removing important pixels. The metric is referred to as the Least Relevant First (LeRF) [srinivas2019full].

A recent study by Tomsett et al. evaluated the reliability of both MoRF and LeRF using four different statistical tests from the psychometric literature [tomsett2020sanity]

. These tests included inter-rater reliability, inter-method reliability, internal consistency reliability, and test-retest reliability, where each image corresponded to a different rater and methods included different attribution map generation techniques. Both MoRF and LeRF showed: (1) high variance across all tested images, (2) sensitivity to whether the removed pixels were replaced with the mean of the dataset or random values, (3) low inter-rater reliability, i.e., the rankings of different methods were highly inconsistent, and (4) low correlation with each other. The results were reported for the classification task using the CIFAR-10 dataset. The absence of ground truth explainability and the limited testing (using one dataset only) makes it hard to generalize these results to other metrics. However, the study raised important questions about the validity of different metrics that are extensively used in the explainability literature.

Remove and Retrain (ROAR)

: Replacing pixels from the input image may change the distribution of the data that were used to train the network, thus violating the assumption that training and evaluation data must have the same distribution [hooker2019benchmark]. In remove and retrain (ROAR), the model is retrained and evaluated every time after removing a set of most important pixels. ROAR is computationally expensive and does not address the question of validity of explanation for each input, rather evaluates the method globally over the whole dataset. Furthermore, the retraining strategy may force the network to learn from the features that were not present in the original dataset (e.g., high-frequency edges introduced due to pixel replacement). This leads to evaluating a new model with newly learned parameters, not the original model.

(In)fidelity and Sensitivity

: (In)fidelity quantifies the statistically expected difference between (1) the dot product of the input perturbation to the attribution scores and (2) the output perturbation (difference in the score function values after significant perturbations introduced in the input ) [yeh2019fidelity]. (In)fidelity allows for a number of significant perturbations, including random and non-random perturbations that lead the input towards a predefined single or multiple baseline values. Random perturbations with a small amount of additive Gaussian noise allows the measure to be robust to small mis-specifications or noise in either the test input or the reference point [yeh2019fidelity]. On the other hand, the “sensitivity” measures the degree to which the explanation is affected by insignificant perturbations in the test point [yeh2019fidelity]. A good attribution method will exhibit low sensitivity, i.e., producing same explanations for minor variations in the input.

Sanity Checks

: Recently, Adebayo et al. introduced two sanity checks for evaluating the sensitivity of attribution methods to the model parameters and the dataset [adebayo2018sanity]. The first check consists of replacing all the learned parameters of the network with random numbers. The resulting attribution maps are compared to the original maps using various correlation metrics to ascertain whether the attribution maps were able to capture the changes in the network parameters. The second check evaluates attribution methods on networks trained using randomly permuted labels. The attribution maps which remain unchanged for either of these checks are considered to have failed.

4.2 Limitations of Attribution Maps

The attribution methods explain the behaviour of a model for a single test point selected from the evaluation dataset. The explanation provided by the attribution methods for the selected single point may be too brittle and could lead to a false conclusion about the performance of the model [alvarez2018robustness]. Thus, understanding a complex model with a single or even multiple pointwise explanations without theoretically grounded metrics is perhaps too optimistic [alvarez2018robustness]. Explaining a model at a single point and then generalizing to the whole dataset is an open question for the research community.

Class Agnostic Behaviour

: In some cases, the attribution maps may remain the same regardless of the class chosen by the user to compute the gradients. That is, for a given input image, similar attribution maps are generated despite the fact that the class label is changed, e.g, the network is forced to predict a certain class as in adversarial attacks [rudin2019stop]. In Fig. 2, we present attribution maps corresponding to 7 different methods generated for different target classes using the same input image. It is evident that most of these methods, except Grad-CAM, are not class sensitive. A similar behavior was observed when neural networks were trained using a dataset with permuted class labels. Many state-of-the-art attribution methods (except saliency maps and SmoothGrad) generated explanations that were insensitive to the permuted class labels. In summary, these methods are not able to capture the relationship between network input and output, and thus generate the same explanations, even if the class labels are changed.

Figure 2: The class agonist behavior of attribution methods is presented. The first column present input image and other columns show attribution maps generated by different methods. The target class and soft-max probability values are shown on the top of image in the first column. The number on the top of attribution maps are Spearman rank correlation values, calculated between the attribution maps of the true class (top row) and the target class. High correlation values show that the method is not class discriminatory, i.e., the attribution maps for any choice of class label are correlated.

Insensitivity to Model Parameters

: Attribution maps should be sensitive to the learned optimal network parameters. That is, if the parameters of a trained network are replaced by random numbers, attribution maps should capture the effect of this change. However, Adebayo et al. found that GBP and Guided Grad-CAM were insensitive to the learned parameters in the top layers (near to the output) [adebayo2018sanity].

Sensitivity to Input Transformations

: The explanations may be sensitive to factors that do not contribute to the model prediction, e.g., a constant shift in the input [kindermans2019reliability]. GradientInput and other methods (e.g., IG) that use input in the computation of attributions are generally sensitive to such input transformations. For the case of IG, this may further depend on the chosen input baseline [kindermans2019reliability]. Saliency maps, DeconvNet, and GBP were found to be insensitive to such transformations as these methods rely solely on the network parameters (no multiplication by the input) to generate attribution maps.

Input Dominance

: GradientInput, DeepLIFT, and IG multiply the input with gradients to leverage the information present in the input features. This may help reduce noise in the attribution maps and produce more human interpretable explanations. However, in some cases, the attribution maps generated by these methods may be dominated by the input. The input does not depend on the network and cannot capture how the network processed data to make a decision [adebayo2018sanity].

Partial Input Recovery

: GBP and DeconvNet can be considered as variants of saliency maps with different rules governing negative gradients at ReLUs. These methods are able to generate relatively more human-interpretable visualizations due to the backward ReLU (used by both GBP and DeconvNet) and the local connections in CNNs. Nie et al. showed that both GBP and DeconvNet performed (partial) input image recovery, a phenomenon that is unrelated to the network decisions [nie2018theoretical].

Input Baseline

: DeepLIFT and IG use an input baseline to improve attributions. A reasonable choice of baseline depends upon the domain and task at hand. An uninformed and inappropriate choice of baseline may invalidate the explainability provided by the attribution method [kindermans2019reliability].

Sensitivity to Hyperparameters

: The explanations generated by some methods, e.g., SmoothGrad or IG may depend on the chosen hyperparameters, e.g., the number of samples used


5 Explainability and Robustness

Until recently, the explainability of DL models and their robustness were being studied in isolation [tsipras2018robustness_at_odds]. However, recent work has provided a strong link between these two apparently disparate aspects of DL models. Before we explore these ideas any further, it is important to highlight two types of robustness in explainability, i.e., (1) the robustness of attribution maps, and (2) the robustness of DL models to adversarial attacks, which is intrinsically linked to their explainability.

5.1 Attributional Robustness

Attributional robustness is related to the stability of an attribution map in the face of a small perturbation in the input caused by natural reasons (e.g., data distribution shift) or introduced by an adversary [ghorbani2019interpretation, Dombrowski_explanations, lim2021building]. It was shown that the input can be adversarially manipulated to change the attribution maps without affecting the network performance, i.e., the prediction of the network does not change [ghorbani2019interpretation, Dombrowski_explanations, lim2021building]. Recent research attributes the origin of these false and manipulated explanations to the vulnerabilities of the neural network, e.g, non-smooth decision boundaries, and not the attribution generation methods [Dombrowski_explanations, lim2021building].

Figure 3: Input image and saliency maps generated for two different network are presented. (Left) Input image. (Top row) ResNet50 trained on natural dataset. (Bottom row) ResNet50 robustly trained using Projected Gradient Descent (PGD) attacks [tsipras2018robustness_at_odds]. It is evident that attribution maps generated for adversarially trained network are more visually appealing.

5.2 Adversarial Robustness

Neural networks are known to be vulnerable to “smart noise” or adversarial attacks. These attacks are quasi-imperceptible perturbations in the input, measured using norms, that force a network to change its output [madry2018towards]. Currently, adversarial training is the most common strategy that may provide limited defence against known attacks. In adversarial training, a modified objective function is optimized which helps in adversarial robustness by increasing the level of perturbation required to successfully change the network decision [madry2018towards]. With denoting the learnable parameters of the model, training data , and perturbation , adversarial training can be formulated as the following min-max optimization problem:



denotes the model’s loss function.

The adversarial training can be considered as a method for the model to learn certain (-bounded) invariances to the dataset. Some recent studies have established that learning certain types of invariances qualitatively (visually) and quantitatively may improve attribution maps, i.e., maps look more relevant to the object as viewed by a human operator [tsipras2018robustness_at_odds, etmann2019connection]. In a way, adversarial training helps the network learn more like the human visual system learns. Figure 3 shows the difference between attribution maps generated for adversarially and naturally trained network. It is evident that the attribution maps produced by adversarially trained models seem more visually aligned with human perception [ross2018improving, tsipras2018robustness_at_odds, kim2019bridging, etmann2019connection]. Tsipras et al. described this relationship between adversarial robustness and enhanced visual alignment of attribution maps as an “unexpected benefit” of adversarial training [tsipras2018robustness_at_odds]. Later, a number of studies verified and made an effort to explain the natural connection between adversarial robustness and explainability [etmann2019connection, kim2019bridging, ignatiev2019relating].

Etmann et al. showed that the improved interpretability of the saliency maps of a robustified neural network was not a side-effect of adversarial training, but a general property enjoyed by networks that are robust to adversarial perturbations [etmann2019connection]. The authors showed that robustness could be defined as the distance of a test point to its closest decision boundary, and that increasing the distance (robustness) resulted in an increased alignment between the input and its attribution map.

Recently, Kim et al. showed that the gradients from adversarially trained networks were better aligned with the human visual system as the adversarial training caused the gradients to lie closer to the image manifold [kim2019bridging]. They also reported differences in the attribution maps generated with robust networks trained using and adversarial images. The neural networks trained with were more effective at emphasizing important features while attributions from -trained networks were better at identifying less important features.

Ignatiev et al. performed a theoretical analysis using a generalized form of hitting set duality to relate explanations and adversarial examples [ignatiev2019relating]. The authors proposed the dual concept of counterexamples (and adversarial examples) and the notion of breaking an explanation. They established that each explanation must break every counterexample and vice versa. Thus, concluding that the more counterexamples (adversarial examples) the model explains, the better the interpretability of the model.

6 Best Practices for the Community

Explainability of black box machine learning models is important for multiple reasons, including understanding the internal workings of these models. Attribution methods are in themselves a set of mathematical operations with certain assumptions and may add another layer of abstraction over the goal of understating data and making predictions. While analysing explanations, it is also not clear how to disentangle errors in the explanation method from errors in the DL model. Currently, there is no consensus on which methods are better than others at explaining network predictions. However, there are some considerations that should be made when choosing attribution methods.

Gradient-based vs. Perturbation Methods

: Gradient-based methods are computationally less expensive as in some cases these may require only one forward and one backpropagation step for estimating attributions. Perturbation-based methods generally solve an optimization problem and thus may require multiple forward passes through the network. Furthermore, gradient-based methods are more robust to input perturbations as compared to perturbation-based methods and should be preferred when robustness is a priority for the user [alvarez2018robustness].


: The efficiency of an attribution method can be related to the number of passes (forward and backward) through the network. Saliency maps, InputGradient, GBP, and Grad-CAM require one forward and one backpropagation step. IG and SmoothGrad may require to steps depending upon the problem domain, dataset, and the scope of explanation.

Input Baseline

: Some attribution methods require an input baseline, which acts as the absence of the feature from the input. The baseline can be zero (a black image), an average value calculated from the dataset, a blurred version of the input image, or random values generated with Gaussian, uniform or other distributions. The choice of baseline can significantly alter the explanation [sturmfels2020visualizing, kindermans2019reliability]. Since there is no current consensus on which baseline is optimal, it is difficult to recommend the use of these methods as an accurate way to explain model predictions.

Human Interpretability

: Relying solely on visual analysis for understanding and comparing attribution maps can be unreliable [adebayo2018sanity]. Some attribution maps may seem visually appealing, but they may not actually help us interpret model predictions. On the other hand, there is still no consensus in the community on the reliability of the various metrics to use for comparing attribution maps. Finally, a typical attribution method tends to consider each pixel as the fundamental unit explanation, which is not the basic unit used in human perception. Some recent studies have pointed out a limited usefulness of the current model explanation methods (e.g., heatmaps) and highlighted the need for a deeper investigation into the methods for presenting interpretations of models to human operators [chu2020visual].

7 Conclusion and Future Research Directions

We have presented an overview of the post hoc gradient-based attribution methods to explain the decisions made by deep neural networks. These techniques represent a small but very significant part of a large body of methods that focus on explaining black box ML models. These methods are fast and can provide explanations that are robust as compared to other approaches. In some cases, these explanations may seem convincing. However, they should be approached with caution due to the inherent limitations of these methods as discussed above. Application of these methods in real-world settings, without comprehending their limitations, can create a false sense of confidence in ML decision-making.

We consider that the robustness of interpretability methods is tightly coupled with the robustness of the models being explained. This area needs research efforts on both fronts, empirical as well as theoretical. There is a need to bring research communities from explainability and robustness together to explore these questions. Finally, given the state-of-the-art in post hoc explainability methods and their vulnerabilities and limitations, there is a need in the ML community to focus on building models that are inherently explainable, but as versatile, efficient, accurate and scalable as deep neural networks.

8 Acknowledgements

This research was supported by National Science Foundation Awards ECCS-1903466 and OAC-2008690, and US Dept of Education GAANN award P200A180055