Evaluations and Methods for Explanation through Robustness Analysis

by   Cheng-Yu Hsieh, et al.
Carnegie Mellon University

Among multiple ways of interpreting a machine learning model, measuring the importance of a set of features tied to a prediction is probably one of the most intuitive ways to explain a model. In this paper, we establish the link between a set of features to a prediction with a new evaluation criterion, robustness analysis, which measures the minimum distortion distance of adversarial perturbation. By measuring the tolerance level for an adversarial attack, we can extract a set of features that provides the most robust support for a prediction, and also can extract a set of features that contrasts the current prediction to a target class by setting a targeted adversarial attack. By applying this methodology to various prediction tasks across multiple domains, we observe the derived explanations are indeed capturing the significant feature set qualitatively and quantitatively.


page 13

page 14


Foolbox v0.8.0: A Python toolbox to benchmark the robustness of machine learning models

Even todays most advanced machine learning models are easily fooled by a...

CARBEN: Composite Adversarial Robustness Benchmark

Prior literature on adversarial attack methods has mainly focused on att...

How Sampling Impacts the Robustness of Stochastic Neural Networks

Stochastic neural networks (SNNs) are random functions and predictions a...

Testing Robustness Against Unforeseen Adversaries

Considerable work on adversarial defense has studied robustness to a fix...

On the Veracity of Local, Model-agnostic Explanations in Audio Classification: Targeted Investigations with Adversarial Examples

Local explanation methods such as LIME have become popular in MIR as too...

Interpreting and Evaluating Neural Network Robustness

Recently, adversarial deception becomes one of the most considerable thr...

A Framework for Verification of Wasserstein Adversarial Robustness

Machine learning image classifiers are susceptible to adversarial and co...

1 Introduction

There is an increasing interest in machine learning models to be credible, fair, and more generally interpretable [13]. Researchers have explored various notions of model interpretability, ranging from trustability [30], fairness of a model [48], to characterizing the model’s weak points [22, 43]. Even though the goals of these various model interpretability tasks vary, the vast majority of them use so called feature-based explanation, that assign importances to individual features.

There have also been a slew of recent evaluation measures for feature based explanations, such as completeness [36], sensitivity-n [2], infidelity [42], causal local explanation metric [29], and most relevant to the current paper, smallest sufficient region (SSR) and smallest destroying region (SDR) [33, 16, 10]. A common thread in all these evaluation measures is quantifying how close the sum of feature importances approximate the difference in function value after removing the set of features. Intuitively, for a good feature based explanation, removing the most salient features should lead to a large difference in prediction score.

One key caveat with the aforementioned evaluations of feature explanations is the bias that arises in the way they operationalize “removing features,” which is typically by setting them to some arbitrary reference value. The choice of these reference values inherently introduces some bias. For example, if we set the feature value to 0 in RGB images, this introduces a bias favoring bright pixels: explanations that optimize such evaluations often omit important dark objects, which could constitute pertinent negative features in the image, that do not contain the object but where the absence of the object is crucial to the prediction [12]. An alternative approach to “remove features” is to sample from some predefined distribution or a generative model [7]. This in turn incurs the bias inherent to the generative model, and accurate generative models that approximate the data distribution well might not be available in all domains.

In this paper, we take a slightly different perspective, focusing on small but adversarial perturbations rather than removal of features or large perturbations to reference values. Such “minimum adversarial perturbation” is typically used in the context of test-time robustness [18, 39], but which we harness towards feature based explanations. The key idea behind doing so is that adversarial perturbations on irrelevant features should be ineffective, while only those on relevant features should be effective. Thus by quantifying the effectiveness of adversarial perturbations restricted to a feature subset, we can in turn evaluate any feature based explanations. While exactly computing such an effectiveness measure is NP-hard [21], we can leverage recent results from test-time robustness literature [6, 26] which show that perturbations computed by adversarial attacks can serve as reasonably tight upper bounds, leading to an efficient approximation for the proposed evaluation.

Given this adversarial effectiveness evaluation measure, we can also design feature based explanations that optimize this evaluation measure. Note that designing such optimal explanations can also be cast as a two-player min-max game between an explainer and adversarial attacker. The explainer aims to find a set of important features, while the adversarial attacker aims to find a perturbation over the irrelevant features that changes the model prediction, with the dueling goals of the attacker aiming to find the smallest perturbation, and the explainer aiming to ensure the perturbation is as large as possible. As we show, the resulting explanations empirically perform much better than previous approaches both quantitatively, as well as with qualitatively convincing examples.

To summarize our contributions:

  • We define new evaluation criteria for feature-based explanations based on robustness analysis involving small adversarial perturbations. These reduce the bias inherent in other recent evaluation measures that focus on “removing features” via large perturbations to some reference values.

  • We design efficient algorithms to generate explanations that maximize the proposed criteria, which perform favorably against baseline methods on the proposed evaluation criteria.

  • Experiments in computer vision and NLP models demonstrate that the proposed explanation can identify important features that are not captured by previous methods. An additional facet of our approach is that it is able to extract a “contrast important” set of features that specifically contrast why the model makes its current prediction instead of a target class.

2 Robustness Analysis for Evaluating Explanations

2.1 Problem Notation

Let us consider the following setting: a general -way classification problem with input space , output space , and a predictor function where denotes the output class for some input example . Then, for a particular prediction , despite the different forms of existing feature-based explanations ranging from attributing an importance value to each feature, ranking the features by their importance, to simply identify a set of important features, a common goal of them is to extract a compact set of relevant features with respect to the prediction.

2.2 Evaluation through Robustness Analysis

A common thread underlying evaluations of feature based explanations, even ranging over axiomatic treatments [36, 25], is that the importance of a set of features corresponds to the change in prediction of the model when the features are removed from the original input. Nevertheless, as we discussed in the previous section, operationalizing such a removal of features, for instance, by setting them to some reference value, introduces biases. To finesse this, we leverage adversarial robustness, but to do so in this context, we rely on two key assumptions:

Assumption 1:

When the values of the important features are anchored (fixed), perturbations restricted to the complementary set of features has a weaker influence on the model prediction.

Assumption 2:

When perturbations are restricted to the set of important features, fixing the values of the rest of the features, even small perturbations could easily change the model prediction.

Based on these two assumptions, we propose a new framework based on adversarial robustness for evaluating feature based explanations.

Definition 2.1

Given a set of features , its minimum adversarial perturbation norm, which we will also term Robustness- is defined as:


where is the complementary set of features, and means that the perturbation is constrained to be zero along features in .

Suppose that the feature based explanation partitions the input features into a relevant set , and an irrelevant set , Assumption 1 implies that the quality of the relevant set can be measured by – measuring adversarial robustness to perturbations that keep the relevant set unchanged, but perturb only the irrelevant set. Specifically, from Assumption 1, a larger coverage of pertinent features in set entails a higher robustness value . On the other hand, from Assumption 2, such a coverage of pertinent features in set would in turn entail a smaller robustness value , which measures the magnitude of adversarial perturbations restricted to the relevant set. Therefore, Assumptions 1 and 2 together build up our twin proposed evaluation criteria: Robustness- and Robustness-.

To summarize:


measures the minimum adversarial perturbation when the set of important features , typically represented by the high-weight features in a feature importance map, are anchored, and perturbations are only allowed in the low-weight regions. The higher the score the better the explanation.


measures the minimum adversarial perturbation when only the set of important features are can be perturbed, and the rest of the feature values are anchored. Contrary to the above, lower scores on this metric indicates a better explanation.

Specifying the sets .

To measure Robustness-, as well as and Robustness-, we would need to first determine the sets . Given any feature attribute method that assigns weights to each feature, once we have the size of the set of important features, we can sort the features in descending order of important weights, and provide the top- features. We thus largely need to specify the size of the sets . We can set to the amount of anchors that an user is interested in or we may vary the size of and evaluate the corresponding values of Robustness- and Robustness- at different points. By varying the size of , we could plot an evaluation curve for each explanation and in turn measure the area under curve (AUC), which corresponds to the average Robustness- and Robustness- at different sizes of relevant set. A larger (smaller) area under curve indicates a better feature attribution ranking. (See examples in Figure 1).

Figure 1: Evaluation curves for different methods under Robustness- (left) and Robustness- (right) with varying size of . For Robustness- (left), the higher the better; for Robustness- (right), the lower the better. Note that we could calculate the area under the curves for each method to summarize its performance. We omit points in the plot with value that is too high to fit in the scale of y-axis.

Untargeted v.s. Targeted Explanation.

Definition 2.1 corresponds to the untargeted adversarial robustness – a perturbation that changes the predicted class to any label other than is considered as a successful attack. Our formulation can also be extended to targeted adversarial robustness, where we replace (1) by



is the targeted class. Using this definition, our approach will try to address the question “Why is this example classified as

instead of ”, and the important features that optimize this criterion will highlight the contrast between class and . We will give examples of the “targeted explanations” in the experiment section.

Robustness Evaluation under Fixed Anchor Set.

It is known that computing the exact minimum distortion distance in modern neural networks is intractable 


, so many different methods have been developed to estimate the value. Adversarial attacks, such as C&W 

[6] and PGD attack [26], aim to find a feasible solution of (1), which leads to an upper bound of . They are based on gradient based optimizers which are usually efficient. On the other hand, neural network verification methods aim to provide a lower bound of to ensure that the model prediction will not change within certain perturbation range [35, 40, 38, 17, 45, 37, 46]. However, these methods are usually time consuming (often

times slower than a backpropagation).

The proposed framework can be combined with any method that aims to approximately compute (1), including attack, verification, and some other statistical estimations. However, for simplicity we only choose to evaluate (1) by the state-of-the-art projected gradient descent (PGD) attack [26], since the verification methods are too slow and often lead to much looser estimation as reported in some recent studies [32].

3 Extracting Model Supports through Robustness Analysis

Our adversarial robustness based evaluation allows us to evaluate any given feature based explanation. Here, we set out to design new explanations that explicitly optimize our evaluation measure. We focus on feature set based explanations, where we aim to provide an important subset of features . Given our proposed evaluation measure, an optimal subset of feature would aim to maximize (minimize) Robustness- (Robustness-), under a cardinality constraint on the feature set, leading to the following set of optimization problems:


where is a pre-defined size constraint on the set , and computes the the minimum adversarial perturbation from Eqn. (1), with set-restricted perturbations.

It can be seen that this sets up an adversarial min-max game: the goal of the feature set explainer is to come up with a set such that the minimal adversarial perturbation is as large as possible, while the adversarial attacker, given a set , aims to design adversarial perturbations that are as small as possible. Directly solving these min-max problems in (3) and (4) is thus challenging, which is exacerbated by the discrete input constraint makes it intractable to find the optimal solution. As a result, in the next section, we propose a greedy algorithm, to estimate the optimal explanation sets.

3.1 Greedy Algorithm to Compute Optimal Explanations

We first consider a greedy algorithm where we iteratively add the most promising feature into that optimizes the objective at each local step until reaches the size constraint. In other words, we initialize the set as empty, and sequentially solve the following subproblem at every step :


where is the anchor set at step , and . We repeat this subprocedure until the size of set reaches . We name this method as Greedy. A straightforward way for solving (5) is to exhaustively search over every single feature.

3.2 Greedy by Set Aggregation Score

The main downside of using the greedy algorithm to optimize the objective function is that it ignores the interactions among features. Two features may perform bad when evaluated separately but become useful when added simultaneously. Therefore, in each greedy step, instead of considering how each individual feature will contribute to the objective, we propose to choose features based on its expected performance when evaluated with other unchosen features. To measure such aggregation score, we randomly choose sets of features and evaluate the performance of the objective function when the sets of features are added. Then we learn a regression function to distribute the performance of each set to each individual feature. Mathematically, let and be the ordered set of chosen and unchosen features at step respectively, be all possible subsets of . We measure the expected contribution of including each unchosen feature to the relevant set would have on objective function by learning the following regression problem:



is a function that projects a set into its corresponding binary vector form:

, i.e., ones in the vector indicate the inclusion of corresponding feature indices in the set and zeros otherwise. After the regression is learned, we can treat the coefficients as each corresponding feature’s approximated contribution to the objective value when they are included into the set .

We note that corresponds to the well-known Banzhaf value [4] when , which is an axiomatic way to aggregate the importance of each player taking coalitions of players into account [14]. Hammer and Holzman [20]

shows that Banzhaf value is equivalent to the optimal solution of a linear regression with pseudo-Boolean functions as targets, which corresponds to (

6) with . Banzhaf value can be interpreted as the importance of each player by taking coalitions of features into account. In each greedy step, we choose features with the highest aggregation score (Banzhaf value), which additionally considers the feature interactions between unchosen features compared to vanilla greedy. The chosen features each step are added to and removed from . When is not , the solution of (6) can still be seen as Banzhaf value where the players are those features that are in , and the value function includes the features that are in . We solve (6) by subsampling to lower the computational cost. We validate the effectiveness of greedy with aggregation score (Greedy-AS) in the experiment section. 111We found that in parallel to our work, greedy with choosing the players with the highest restricted Banzhaf was used in Elkind et al. [15].

max width= Datasets Explanations Grad IG SHAP LOO BBMP Greedy-AS Greedy One-Step Banzhaf MNIST Robustness- 88.00 85.98 75.48 76.59 81.31 98.01 83.57 86.37 Robustness- 91.72 91.97 101.49 98.82 173.90 82.81 171.56 83.59 ImageNet Robustness- 27.13 26.01 18.25 23.54 22.60 31.62 21.16 24.54 Robustness- 45.53 46.28 60.02 52.77 154.14 43.97 58.45 47.07

Table 1: Area under curve of the proposed criteria for various explanations on MNIST and ImageNet. The higher the better for Robustness-; the lower the better for Robustness-. Robustness measured with (1).

4 Experiments

In this section, we first evaluate different model interpretability methods on the proposed criteria. We justify the effectiveness of the proposed Greedy-AS. We then move onto further validating the benefits of the explanations extracted by Greedy-AS through comparisons to various existing methods both quantitatively and qualitatively. Finally, we demonstrate the flexibility of our method with the ability to provide targeted explanations as mentioned in Section 2.2. We perform the experiments on two image datasets, MNIST [23] and ImageNet [11], as well as a text classification dataset, Yahoo! Answers [47].


In the experiments, we consider for in (1) and (2), i.e., the norm if not otherwise specified. For all quantitative results including the evaluation curves and the corresponding AUCs, we report the average over 100 random examples. For the baseline methods, we include vanilla gradient (Grad) [34] and integrated gradient (IG) [36] from gradient-based approaches; leave-one-out (LOO), or occlusion-1, [44, 24] and SHAP [25] from perturbation-based approaches [2]; and black-box meaningful perturbation (BBMP) [16] from SSR/SDR-based approaches. For the proposed Greedy and Greedy-AS, at each greedy iteration, we include the top-5% features with highest scores into the relevant set to further speed up the selection process. We leave more implementation detail in Appendix A due to space limitation.

Figure 2:

Visualization on our proposed methods. The top features selected by Greedy-AS are less noisy.

4.1 Robustness Analysis on Model Interpretability Methods

Here we analyze Greedy-AS as well as various existing explanation methods under both the proposed criteria Robustness- and Robustness-. For ease of comparison, we calculate the area under curve (AUC) for each corresponding evaluation curves. We list the results in Table 1, and leave the plots in Appendix B.

Ablation Study on Greedy-AS.

As discussed in Section 3.2, Greedy-AS could be seen as a combination of the original greedy procedure with the approximated contribution of each feature computed by a regression. Here, we examine the importance of both components by comparing the Greedy-AS method to two baselines, where one selects important features based only on the pure Greedy method (Section 3.1) and the other utilizes only a single step of regression without the iterative greedy procedure. As the latter essentially corresponds to the Banzhaf value, we term this method as One-Step Banzhaf. First, as shown in Table 1, the pure Greedy method suffers degraded performances comparing to Greedy-AS under both criteria. The inferior performance could be explained by the ignorance of feature correlations which ultimately results in the introduction of noise, as shown in Figure 2. In addition, we also see that Greedy-AS performs better than One-Step Banzhaf. This could results from the fact that One-Step Banzhaf considers the feature interactions among all features with equal probability. However, in our objective, we only care about those interactions with the most important features. By iteratively selecting the features with highest Banzhaf value in Greedy-AS, we give more weight on the interactions among the most important features through iterations, and as a result lead to better performance.

Comparisons between Different Explanations.

Furthermore from Table 1, we observe that the proposed Greedy-AS consistently outperforms other explanation methods on both criteria. On one hand, this suggests that the proposed algorithm indeed successfully optimizes towards the criteria; on the other hand, this might indicate the proposed criteria do capture different characteristics of explanations which most of the current explanations do not possess. Another somewhat interesting finding from the table is that while vanilla gradient has generally been viewed as a baseline method, it nonetheless performs competitively on the proposed criteria. We conjecture the phenomenon results from the fact that Grad does not assume any reference value as opposed to other baselines such as LOO which sets the reference value as zero to mask out the inputs. Indeed, it might not be surprising that Greedy-AS achieves the best performances on the proposed criteria since it is explicitly designed for so. To more objectively evaluate the usefulness of the proposed explanation, we demonstrate different advantages of our method by comparing Greedy-AS to other explanations quantitatively on existing commonly adopted measurements, and qualitatively through visualization in the following subsections.

max width= Datasets Explanations Grad IG SHAP LOO BBMP Greedy-AS MNIST Insertion 250.81 262.74 200.50 192.44 102.53 379.15 Deletion 281.88 273.71 362.68 442.65 527.80 159.77 ImageNet Insertion 54.66 78.64 1.87 32.21 103.51 152.31 Deletion 267.37 245.67 204.38 280.40 603.89 247.20

Table 2: AUC of the Insertion and Deletion criteria for various explanations on MNIST. The higher the better for Insertion; the lower the better for Deletion.

4.2 Evaluating Greedy-AS

The Insertion and Deletion Metric.

To further justify the proposed explanation not only performs well on the very metric it optimizes, we evaluate our method on a suite of existing popular quantitative measurements. In particular, we adopt the Deletion and Insertion criteria proposed by [28], which are generalized variants of the region perturbation criterion presented in [33]. The Deletion criterion measures the probability drop in the predicted class as top-relevant features, indicated by the given explanation, are progressively removed from the input. On the other hand, the Insertion criterion measures the increase in probability of the predicted class as top-relevant features are gradually revealed from the input whose features are originally all masked. Similar to our proposed criteria, a quick drop (and thus a small area under curve) or a sharp increase (that leads to a large area under curve) in Deletion and Insertion respectively suggest a good explanation as the selected top-important features could indeed greatly influence the prediction. In the experiments, we follow [33] to remove features by setting their values to randomly sampled values. We plot the evaluation curves (in Appendix C) and report corresponding AUCs in Table 2. On these additional two criteria, we observe that Greedy-AS performs favorably against other explanations. The results further validate the benefits of the proposed explanation. We note that on ImageNet, SHAP obtains a better performance under the Deletion criterion. We however suspect such performance comes from the adversarial artifacts instead of meaningful explanation, since the explanation provided by SHAP seems to be rather noisy (as shown in Figure 4). 222It has been observed that the Deletion criterion tends to favor adversarial artifacts in several previous work. [10, 7] This also explains its relatively low performance under the Insertion criterion.

max width= Grad IG SHAP LOO BBMP Greedy-AS Corr. 0.30 0.30 0.11 0.49 0.17 0.18

Table 3: Rank correlation between explanations with respect to original and randomized model.

Sanity Check Metric.

Recent literature has pointed out that an appropriate explanation should be related to the model being explained [1]. To ensure that our proposed explanation does indeed reflect the model behavior, we conduct the sanity check proposed by [1] to check if our explanations are adequately different when the model parameters are randomly re-initialized. In the experiment, we randomly re-initialize the last fully-connected layer of the neural network model. We then compute the rank correlation between explanation computed w.r.t. the original model and that w.r.t. the randomized model. From Table 3, we observe that Greedy-AS has a much lower rank correlation comparing to Grad, IG, and LOO, suggesting that Greedy-AS is indeed sensitive to model parameter change and is able to pass the sanity check.

4.3 Qualitative Results

Image Classification.

To complement the quantitative measurements, we show several visualization results on MNIST and ImageNet in Figure 3 and Figure 4. More examples could be found in Appendix E and F. On MNIST, we observe that existing explanations tend to highlight mainly on the white pixels in the digits; among which SHAP and LOO show less noisy explanations comparing to Grad and IG. On the other hand, the proposed Greedy-AS focuses on both the “crucial positive” (important white pixels) as well as the “pertinent negative” (important black regions) that together support the prediction. For example, in the first row, a 7 might have been predicted as a 4 or 0 if the pixels highlighted by Greedy-AS are set to white. Similarly, a 1 may be turned to a 4 or a 7 given additional white pixels to its left, and a 9 may become a 7 if deleted the lower circular part of its head. From the results, we see that Greedy-AS focuses on “the region where perturbation on its current value will lead to easier prediction change”, which includes both the crucial positive and pertinent negative pixels. Such capability of Greedy-AS is also validated by its superior performance on the proposed robustness criteria, on which methods like LOO that highlights only the white strokes of digits show relatively low performance. From the visualized ImageNet examples shown in Figure 4, we observe that our method provides more compact explanations that focus mainly on the actual objects being classified. As opposed to methods that show noisy explanations, Greedy-AS could potentially provide more insights into the model prediction.

Text Classification.

In addition to image datasets, here we demonstrate how our explanation method could be applied to text classification models. In the experiments, we represent a length- sentence by embedding vectors following the common setting. Thus, when applying our Greedy algorithm, at each iteration we will try to add an embedding vector to the relevant set and choose the one with largest reward. Since there are only at most choices, the Greedy algorithm doesn’t suffer much from noise and has similar behavior to Greedy-AS.

We perform experiments on an LSTM network which learns to classify a given sentence into one of the ten classes (Society, Science, Health, ). We showcase an example with explanations generated with different methods in Figure 5. We note that although the top-5 relevant keyword sets generated by the three methods do not vary much, the rankings within the highlighted keywords for each explanation are in fact quite different. We observe that our method Greedy tends to generate explanation that matches human intuition the most. Particularly, to predict the label of “sport", one might consider “cleats", “football", and “cut" as the strongest indications towards the concept “sport".

Figure 3: Visualization on top 20 percent relevant features provided by different explanations. We see Greedy-AS highlights both crucial positive and pertinent negative features supporting the prediction.
Figure 4: Visualization of different explanations on ImageNet, where the predicted class for each input is “fish", “bird", “dog", and “sea lion". Comparing to other methods, Greedy-AS focuses more on the areas that are essential to correctly classify the image.
Figure 5: Explanations on a text classification model where the predicted label for this sentence is “sport". Unlike other methods, the top-3 relevant keywords highlighted by Greedy are all closely related to the concept “sport".
Figure 6: Visualization of targeted explanation. For each input, we highlight relevant regions explaining why the input is not predicted as the target class. We see the explanation changes in a semantically meaningful way as the target class changes.

Targeted Explanation Analysis.

Recall that in section 2.2, we discussed about the possibility of defining the robustness measurement by considering a targeted distortion distance as formulated in (2). Here, we provide examples, as shown in Figure 6, where we answer the question of “why the input digit is an A but not a B” by defining a targeted perturbation distance towards class B as our robustness measurement. In each row of the figure, we provide targeted explanation towards two different target classes for a same input image. Interestingly, as the target class changes, the generated explanation varies in an interpretatble way. For example, in the first row, we explain why the input digit 7 is not classified as a 9 (middle column) or a 2 (rightmost column). The resulting explanation against 9 highlights the upper-left part of the 7. Semantically, this region is indeed pertinent to the classification between 7 and 9, since turning on the highlighted pixel values in the region (currently black in the original image) will then make the 7 resemble a 9. However, the targeted explanation against 2 highlights a very different but also meaningful region, which is the lower-right part of the 7; since adding a horizontal stroke on the area would turn a 7 into a 2.

The capability of capturing pertinent negative features has also been observed in explanations proposed in some recent work [12, 3, 27]. However, these methods are subject to different constraints. For example, [12] is designed to handle binary inputs which nonetheless limits it application; in [3], the ability to capture pertinent negative features heavily depends on the input range; for [27], unlike our targeted explanation where we know exactly which targeted class the explanation is suggesting against. The pertinent negative features highlighted by their method by construction is not directly related to a specific target class, making it harder for users to interpret the result. We provide more detailed discussions and comparisons to these methods in Appendix G.

5 Related Work

Our work proposes an objective measurement of feature-based explanation by measuring the “minimum adversarial perturbation” in adversarial literature, which is estimated by adversarial attack. We provide a necessarily incomplete review on related works in objective measurement of explanations, adversarial robustness, as well as the intersection between the two.

Objective Measurements for Explanations.

Evaluation of explanations has been a difficult problem mainly due to the absence of ground truth [2, 36]. Although one could rely on human intuitions to assess the quality of the generated explanations [25, 13], for example, judging whether the explanation focuses on the object of interest in an image classification task, these evaluations subject to human perceptions are prone to fall into the pitfall of favoring user-friendly explanations, such as attributions that visually aligns better with the input image, which might not reflect the model behavior [1]. As a result, in addition to subjective measurements, recent literature has also proposed objective measurements, which is also called functionally-grounded evaluations [13]. We roughly categorize existing objective measurements into two families.

This first family of explanation evaluation is called fidelity-based measurement. This includes that Completeness or Sum to Delta which requires the sum of attributions to equal the prediction difference of the original input and baseline [36, 34]; sensitivity-n which further generalizes completeness to any subset of the feature [2]; local accuracy [30, 25]; and infidelity which is a framework that encompasses several [42]. The general philosophy for this line of methods is to require the sum of attribution value faithfully reflect the change in prediction function value given the presence or absence of certain subset of features. The second family of explanation evaluation are removal-based and preservation-based measurements, which focus on identifying the most important set of features with respect to a particular prediction. The underlying assumption made is that by removing the most (least) salient feature, the resulting function value should drop (increase) the most. [33] proposed this idea as an evaluation to evaluate the ranking of feature-attribution score. Later on, [16] derive explanations by solving an optimization problem to optimize the evaluation. And [10] proposed to learn the explanation generating process by training an auxiliary model.

Adversarial Robustness.

Adversarial robustness has been extensively studied in the past few years. The adversarial robustness of a machine learning model on a given sample can be defined as the shortest distance from the sample to the decision boundary, which corresponds to our definition in 1. Algorithms have been proposed for finding adversarial examples (feasible solutions of 1), including [18, 6, 26]. However, those algorithms only work for neural networks, while for other models such as tree based models or nearest neighbor classifiers, adversarial examples can be found by decision based attacks [5, 9, 8]. Therefore the proposed framework can also be used in other decision based classifiers. On the other hand, several works aim to solve the neural network verification problem, which is equivalent to finding a lower bound of 1. Examples include [35, 40, 45]. In principal, our work can also apply these verification methods for getting an approximate solution of 1, but in practice they are very slow to run and often gives loose lower bounds on regular trained networks.

Interpretability and Adversarial Robustness

Our work is closely related to recent studies that bridge the gap between model interpretability and adversarial robustness. Xu et al. [41] add group sparsity regularization to adversarial attack to enforce semantic structure for the perturbation, which is more interpretable. Ribeiro et al. [31] find a set of features that once fixed, probability of the prediction is high when perturbing other features. Several recent work has also considered the question "For situation A, why was the outcome B and not C", which we call counterfactual explanations. Goyal et al. [19] show how one could change the input feature such that the system would output a different class, where the change is limited to replacing a part of input feature by a part of an distractor image. Dhurandhar et al. [12]

consider the pertinent negative in a binary setting by solving a carefully designed loss function.

6 Conclusion

In this paper, we establish the link between a set of features to a prediction with a new evaluation criteria, robustness analysis, which measures the minimum tolerance of adversarial perturbation. Furthermore, we develop a new explanation method to find important set of features to optimize this new criterion. Experimental results demonstrate that the proposed new explanations are indeed capturing significant feature sets across multiple domains.


  • [1] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9525–9536. Cited by: §4.2, §5.
  • [2] M. Ancona, E. Ceolini, C. Oztireli, and M. Gross (2018) A unified view of gradient-based attribution methods for deep neural networks. International Conference on Learning Representations. Cited by: §1, §4, §5, §5.
  • [3] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §4.3.
  • [4] J. F. Banzhaf III (1964) Weighted voting doesn’t work: a mathematical analysis. Rutgers L. Rev. 19, pp. 317. Cited by: §3.2.
  • [5] W. Brendel, J. Rauber, and M. Bethge (2017) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §5.
  • [6] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §2.2, §5.
  • [7] C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud (2019) Explaining image classifiers by counterfactual generation. In International Conference on Learning Representations, External Links: Link Cited by: §1, footnote 2.
  • [8] H. Chen, H. Zhang, D. Boning, and C. Hsieh (2019)

    Robust decision trees against adversarial examples

    In ICML, Cited by: §5.
  • [9] M. Cheng, T. Le, P. Chen, J. Yi, H. Zhang, and C. Hsieh (2018) Query-efficient hard-label black-box attack: an optimization-based approach. arXiv preprint arXiv:1807.04457. Cited by: §5.
  • [10] P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In NIPS, Cited by: §1, §5, footnote 2.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §4.
  • [12] A. Dhurandhar, P. Chen, R. Luss, C. Tu, P. Ting, K. Shanmugam, and P. Das (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pp. 592–603. Cited by: §1, §4.3, §5.
  • [13] F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §1, §5.
  • [14] P. Dubey and L. S. Shapley (1979) Mathematical properties of the banzhaf power index. Mathematics of Operations Research 4 (2), pp. 99–131. Cited by: §3.2.
  • [15] E. Elkind, P. Faliszewski, M. Lackner, D. Peters, and N. Talmon (2017) Committee scoring rules, banzhaf values, and approximation algorithms. In 4th workshop on exploring beyond the worst case in computational social choice (EXPLORE’17), Cited by: footnote 1.
  • [16] R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3449–3457. Cited by: §1, §4, §5.
  • [17] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev (2018) Ai2: safety and robustness certification of neural networks with abstract interpretation. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §2.2.
  • [18] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §5.
  • [19] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee (2019) Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376–2384. Cited by: §5.
  • [20] P. L. Hammer and R. Holzman (1992)

    Approximations of pseudo-boolean functions; applications to game theory

    Zeitschrift für Operations Research 36 (1), pp. 3–21. Cited by: §3.2.
  • [21] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: an efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: §1, §2.2.
  • [22] P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pp. 1885–1894. Cited by: §1.
  • [23] Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §4.
  • [24] J. Li, W. Monroe, and D. Jurafsky (2016) Understanding neural networks through representation erasure. CoRR abs/1612.08220. External Links: Link Cited by: §4.
  • [25] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §2.2, §4, §5, §5.
  • [26] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    arXiv preprint arXiv:1706.06083. Cited by: §1, §2.2, §2.2, §5.
  • [27] J. Oramas, K. Wang, and T. Tuytelaars (2019) Visual explanation by interpretation: improving visual feedback capabilities of deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §4.3.
  • [28] V. Petsiuk, A. Das, and K. Saenko (2018) RISE: randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421. Cited by: §4.2.
  • [29] G. Plumb, D. Molitor, and A. S. Talwalkar (2018) Model agnostic supervised local explanations. In Advances in Neural Information Processing Systems, pp. 2515–2524. Cited by: §1.
  • [30] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §5.
  • [31] M. T. Ribeiro, S. Singh, and C. Guestrin (2018) Anchors: high-precision model-agnostic explanations. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §5.
  • [32] H. Salman, G. Yang, H. Zhang, C. Hsieh, and P. Zhang (2019) A convex relaxation barrier to tight robust verification of neural networks. arXiv preprint arXiv:1902.08722. Cited by: §2.2.
  • [33] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2016) Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §1, §4.2, §5.
  • [34] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. International Conference on Machine Learning. Cited by: §4, §5.
  • [35] G. Singh, T. Gehr, M. Mirman, M. Püschel, and M. Vechev (2018) Fast and effective robustness certification. In Advances in Neural Information Processing Systems, pp. 10802–10813. Cited by: §2.2, §5.
  • [36] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, Cited by: §1, §2.2, §4, §5, §5.
  • [37] S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana (2018) Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, pp. 6367–6377. Cited by: §2.2.
  • [38] T. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, L. Daniel, D. Boning, and I. Dhillon (2018)

    Towards fast computation of certified robustness for relu networks

    In International Conference on Machine Learning, pp. 5273–5282. Cited by: §2.2.
  • [39] T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel (2018) Evaluating the robustness of neural networks: an extreme value theory approach. arXiv preprint arXiv:1801.10578. Cited by: §1.
  • [40] E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §2.2, §5.
  • [41] K. Xu, S. Liu, P. Zhao, P. Chen, H. Zhang, Q. Fan, D. Erdogmus, Y. Wang, and X. Lin (2018) Structured adversarial attack: towards general implementation and better interpretability. arXiv preprint arXiv:1808.01664. Cited by: §5.
  • [42] C. Yeh, C. Hsieh, A. S. Suggala, D. I. Inouye, and P. Ravikumar (2019) On the (in)fidelity and sensitivity for explanations. CoRR abs/1901.09392. Cited by: §1, §5.
  • [43] C. Yeh, J. Kim, I. E. Yen, and P. K. Ravikumar (2018) Representer point selection for explaining deep neural networks. In Advances in Neural Information Processing Systems, pp. 9291–9301. Cited by: §1.
  • [44] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §4.
  • [45] H. Zhang, T. Weng, P. Chen, C. Hsieh, and L. Daniel (2018)

    Efficient neural network robustness certification with general activation functions

    In Advances in neural information processing systems, pp. 4939–4948. Cited by: §2.2, §5.
  • [46] H. Zhang, P. Zhang, and C. Hsieh (2019) Recurjac: an efficient recursive algorithm for bounding jacobian matrix of neural networks and its applications. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5757–5764. Cited by: §2.2.
  • [47] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §4.
  • [48] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 2979–2989. Cited by: §1.