Evaluating Explainers via Perturbation

06/05/2019 ∙ by Minh N. Vu, et al. ∙ Naval Postgraduate School University of Florida New Jersey Institute of Technology 0

Due to high complexity of many modern machine learning models such as deep convolutional networks, understanding the cause of model's prediction is critical. Many explainers have been designed to give us more insights on the decision of complex classifiers. However, there is no common ground on evaluating the quality of different classification methods. Motivated by the needs for comprehensive evaluation, we introduce the c-Eval metric and the corresponding framework to quantify the explainer's quality on feature-based explainers of machine learning image classifiers. Given a prediction and the corresponding explanation on that prediction, c-Eval is the minimum-power perturbation that successfully alters the prediction while keeping the explanation's features unchanged. We also provide theoretical analysis linking the proposed parameter with the portion of predicted object covered by the explanation. Using a heuristic approach, we introduce the c-Eval plot, which not only displays a strong connection between c-Eval and explainers' quality, but also serves as a low-complexity approach of assessing explainers. We finally conduct extensive experiments of explainers on three different datasets in order to support the adoption of c-Eval in evaluating explainers' performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the pervasiveness of machine learning in many emerging domains, especially in critical applications such as healthcare or autonomous systems, it is utmost important to understand why a machine learning model makes such a prediction. For example, deep convolutional neural networks have been able to classify skin cancer at level of competence comparable to dermatologists 

Esteva2017 . However, doctors cannot act upon these predictions blindly. Providing additional intelligible explanations such as a highlighted skin region that contributes to the prediction will aid doctors significantly in making their diagnoses. Along this direction, several machine learning explainers that support users in interpreting the predictions of complex models have been studied, such as SHAP Scott2017 , LIME Marco2016 , Grad-CAM Ramprasaath2016 , and DeepLIFT Avanti2017 , among others Bach2015 ; Springenberg2014 ; Simonyan2013 ; Daniel2017 ; Mukund2017 ; Robnik2008 ; Strumbelj2009 ; Martens2014 .

Despite this very recent development of machine learning explainers, none of them has a theoretical guarantee on the explanation’s quality. These explainers have been only evaluated through a small set of human-based experiments which apparently does not imply the global guarantee on explainer’s performance Scott2017 ; Avanti2017 . Another effort of evaluating explainers, Shrikumar et. al

uses the log-odds score to measure the difference between the original instance and the modified image whose vital pixels are erased  

Avanti2017 . The log-odds method is only applicable to small gray-scale images like MNIST lecun2010 and there is no theoretical guarantee or rigorous analysis for this method. Based on this lack of comprehensive studies on the quality of explanation, there is a clear need for standard tools and methods to evaluate machine learning model explainers Lipton2016IML .

However, theoretically evaluating explainers remains as a daunting task Lipton2016IML ; Kim2015 due to the following reasons. First of all, the model explainers can be very diverse in general. Figure 1 shows an example of three explainers for the prediction “Pembroke” provided by Inception-v3 classifier Inception2015 . All of them highlight the region containing the Pembroke; however, their presentation vary from picture segments in LIME Marco2016 , heat-map in Grad-CAM (GCam) Ramprasaath2016 to pixel importance-weights in SHAP Scott2017 . In addition to the variety in presentation, explainers might be designed for different objectives. There is a fundamental trade-off between the understanding and the accuracy of the represented features Marco2016 ; Scott2017 . One explanation can be utilized for end-users to interpret, but a certain degree of consistency with the original model might be lost. The diversity in presentation and objectives constitutes a great challenge in evaluating different explainers.

(a) Original
(b) LIME
(c) GCam
(d) SHAP
Figure 1: Different feature-based explanations of the prediction "Pembroke".
LIME Perturbation Difference
(a) LIME mask perturbation
GCam Perturbation Difference
(b) Grad-CAM mask perturbation
Figure 2: Perturbations are generated by perturbing pixels outside of LIME/GCam explanation region. The power of the Difference images are upper bounds of -Eval.

In this research, we focus on evaluating local explainers which are used to interpret individual predictions of black box machine learning models. We introduce a novel metric, -Eval, to evaluate the quality of feature-based local explanation. We exploit the intuition that a feature-based local explanation has high quality only if it is difficult to change the prediction when the explanation features are kept intact. The quality of an explanation is quantified by the minimum amount of perturbation on features that are not in the explanation region in order to alter the prediction. We further provide theoretical derivations showing that the portion of the predicted object which must be captured by an explanation is an increasing function of -Eval. A low-complexity approach of evaluating explainers based on -Eval and experiments are also conducted extensively for different explainers.

We demonstrate the above concepts of -Eval via a toy example as shown in Fig. 2. We first cut out of the most important segments returned by two explainers of the Pembroke image. After that, we generate the perturbed instances while keeping the explanation region unchanged. These non-perturbing regions are the black areas in the Difference images. The -Eval is computed by using the -norm distance between the perturbed instance and the original picture. In this example, the -Eval of LIME is times larger than that of GCam, i.e. the amount of perturbation required to change the prediction while fixing the explanation region of LIME is greater. We claim that LIME includes features explaining this prediction better than GCam. By simply observing Fig. 2, it is clear that LIME explanation is better in this case.

Our contributions in this research can be summarized as follows:

  1. We introduce -Eval and the corresponding frameworks to evaluate the quality of any feature-based image explainer.

  2. We provide the theoretical object coverage guarantee based on -Eval: higher -Eval implies a larger portion of the predicted object is guaranteed to be included in the explanation.

  3. We develop a low-complexity approach using -Eval plot to evaluate and visualize explainer’s quality. Assuming the existence of an unknown ideal scoring function that gives better explainers higher scores, we provide heuristic experiments showing that the -Eval plot possesses many desirable properties of that ideal scoring function.

  4. We conduct extensive experiments on -Eval on 8 explainers, using different classifier models and perturbation schemes to validate the usage of -Eval in evaluating explainers.

The rest of the paper is organized as follows. Section 2 introduces the notations and formulates -Eval, the unified metric to evaluate feature-based explainers. In Section 3, we establish the object coverage guarantee of explainers based on the computed -Eval. Due to an expensive computation of the object coverage guarantee, we propose the -Eval plot, a low-complexity approach to evaluate explainers in Section 4. Section 5 provides numerical results to support the validity of -Eval. Finally, Section 6 concludes the paper with a discussion on future directions.

2 Unified evaluation of explainers: c-Eval

In this section we provide the detailed formulation of -Eval. We first specify the classifiers, the explainers, and the perturbation schemes.

Given a classifier , we denote the set of input features by and the corresponding predicted label by . The final output label is . A local explanation of classifier is , which is the set of features explaining the prediction . The function is called the explainer. We denote a perturbation scheme that perturbs all features of outside of the explanation into as follows

(1)

A successful perturbation scheme on explainer of instance under power constraint is a perturbation scheme such that

(2)

Condition (2) implies that the perturbation satisfies power constraint and alters the original prediction of the model on instance . At this point, we are ready to state the definition of -Eval.

Definition 1.

An explainer (or explanation ) on instance of classifier is -Eval at if no perturbing scheme satisfying condition (2) successfully changes the original label of instance . We say that an explainer (or explanation ) is -Eval if (or ) is -Eval at all .

Intuitively, a good explainer is an explainer with high -Eval. For simplicity, we denote the maximum power such that is -Eval at . It is noteworthy to mention that when the explainer returns nothing, is the minimum power to perturb the picture to successfully change the original prediction. When outputs all pixels of the picture, there is no way to successfully perturb the picture, i.e. .

Unfortunately, finding the optimal (minimum power) perturbing scheme for all instances is intractable. To cope with this issue, given a class of perturbing schemes , we extend our definition of -Eval to the “-Eval with respect to ” as follows.

Definition 2.

An explainer (or explanation ) on instance of classifier is -Eval with respect to class of perturbing scheme if there is no perturbing scheme satisfying condition (2) that successfully changes the label of instance .

Perturbation schemes. Throughout this work, the considered classes of perturbations are the conventional Gradient-Sign-Attack (GSA) Goodfellow2015 and the Iterative-Gradient-Attack (IGA) implemented by Foolbox foolbox2017 . We adopt these perturbation schemes because of their direct connection with the -norm constraint (2). Without stated otherwise, the -Eval computation in this work is carried out by the Gradient-Sign-Attack due to its low complexity.

3 Object coverage guarantee of explainers

To address the quality of feature-based local explainers in image classifiers, we introduce the so-called object coverage of explainers. The intuition is that when a classifier predicts a label for an image of an object, one of the most essential and fundamental elements constituting that prediction is the existence of that object in the picture. Consequently, a good explanation for that prediction must capture the object or some key features of that object. This observation suggests there is a strong correlation between the quality of an explanation and the portion of the predicted object captured by the explanation. In the remaining of this section, we will formalize this intuition, establish a connection between -Eval and the object coverage of an explainer, and finally obtain the key Theorem 2 on the object guarantee of an explainer.

Given picture of single object on a background where , the coverage of feature-based explanation on the correct prediction is the portion of the area of the object captured in the explanation: . Since we only consider pictures with single object only, we use instead of for simplicity. As the prediction of is object , we claim that explanation is better than if and , where defines how much of the original picture is returned as the explanation.

Based on the object coverage, in the next steps, we will establish the connection between -Eval and the object coverage of explainer, which implies the interdependence between -Eval and explainers’ quality. To obtain such guarantee, we first bound the minimum power to generate , a successful perturbation scheme on the original picture. This is obtained by adding the minimum power to generate a successful perturbation on the picture of that object on transparent background, and the power to change the original prediction to the transparent-background prediction. Then, the -Eval versus object coverage curve on the transparent background instance will provide us a guarantee on the object coverage of explainers.

Let us consider the picture instance of object in a removed background (Fig. 2(b)). We assume the classifier predicts and correctly, i.e. , where . Noting that even though , is different from in general. The minimum power to successfully perturb must be upper bounded by the perturbing power on to change the prediction into and then successfully perturb . Specifically, we have:

(3)

where (or ) is defined as the minimum power to successfully perturb instance (or ) without changing the elements of (or ) in . is the necessary power of perturbation to change the prediction from to , i.e. . The demonstration of the special case when , i.e. , is shown in Fig. 3.

(a) Original instance
(b) Removed background instance
Figure 3: Perturbation on original picture and removed background instance : If the prediction of the original and the removed background is the same, i.e. , the power to perturb the original picture is upper-bounded by the power to perturb the removed background picture since the perturbing region of the later is a subset of that region in the former.

Let us consider the term on the right hand side of (3). Since has an empty background, contains no elements outside , we have . Combining with the facts that and , we obtain the following bound:

(4)

We denote the following coverage to -Eval function as follows

(5)

Intuitively, the solution that results in is the elements in object that, by keeping them unchanged, the required power to successfully perturb increases the most. Additionally, can be interpreted as the coverage ratio of object . In the next theorem, we will show that is monotone non-decreasing function of .

Theorem 1.

Given a classifier and an instance of empty background , is monotone non-decreasing function of .

Proof.

We consider and denote and the solutions of (5) when equals to and respectively. For all subset of such that , we have

(6)

where the first inequality in (6) is because the minimum power to successfully perturb outside is greater than the minimum power to successfully perturb outside . The second equality is from and . ∎

Theorem 1 allows us to define the inverse function of :

(7)

The inverse function computes the guarantee object coverage ratio of object for any explanation on that result in -Eval of . From the definition of and the monotone property of , it is clear that is monotone increasing. To analyze the coverage of , we apply the inverse -Eval to coverage ratio function as into (3):

(8)
(9)

where (9) is from the definition of (5) and by applying the inverse function to both sides. This final result gives us the object guarantee Theorem based on -Eval:

Figure 4: -Eval to object coverage of explainers and curve. For small, the explainers stay below and -Eval establish the object coverage guarantee of explainers.
Theorem 2.

For any explanation , it must cover at least portion of the predicted object in the image .

To demonstrate Theorem 2, we construct the function for an image instance in Fig. 4. First, we extract the object from the original picture to obtain the removed background instance. For each coverage ratio, we search for the optimal mask on the subject such that, by keeping all pixels in the mask unchanged, the minimum power to obtain a successful perturbation is maximized. The exact computation of this mask requires the perturbation on all features, which is infeasible for most image instances. To reduce the complexity, we segment the picture into super-pixels and return the solution as subset of super-pixels. The searching algorithm for this subset is conducted by randomization and greedy selection. For each object coverage ratio, -Eval of the corresponding mask is computed and plotted in log-scale as shown in Fig. 4. We also generate the LIME and SHAP explanations on this sample picture with different number of explaining features. After obtaining the perturbation on the explanations, we compute their corresponding object coverage ratios and plot the results. Since SHAP explanation is in the form of pixels’ weight instead of segments, we generate the segment representation of SHAP as the segments with maximum sum weights. The detailed algorithm for object coverage curve is discussed in Appendix A.

From Fig. 4, Theorem 2 let us claim that any explanation with -Eval greater than should cover at least of the predicted object in the original picture . In fact, with -Eval of , LIME and SHAP explanations in this case cover about and of the predicted object.

4 c-Eval plot: a low-complexity indicator for explanations quality

As can be seen in the previous section, -Eval implies that the corresponding explanation must cover a certain portion of the predicted object in the examined instance; however, this theoretical coverage bound might be computationally challenging in some cases. In analyzing the prediction of average size images, the number of features might be too large for us to find the optimal mask (5). To overcome this challenge, we provide in this section another heuristic approach, called -Eval plot, to evaluate the quality of explainers. Given an explainer and a prediction, we vary the number of features returned by the explainer, compute the explanation and determine the corresponding -Eval. In the followings, we demonstrate that plotting the obtained -Eval and examining the curve will help us assess the explainers’ quality.

Our first observation is that most modern feature-based explainers will assign an importance weight to each feature before outputting the final explanations Marco2016 ; Ramprasaath2016 ; Avanti2017 ; Scott2017 . In other words, when an explainer provides explanation containing features, those features are the features with highest importance weights. For LIME Marco2016 , user can choose to adjust the compactness of the explanation. SHAP Scott2017 and GCam Ramprasaath2016 , on the other hand, provide the complete importance map on all pixels of the examined picture. To unify all these representations of different explainers, we denote the collection of explanations by an explainer on instance as , where indicates how many features are included in the explanation. For simplicity, we assume that all features have distinct importance weights. Thus, we obtain for all .

Assuming we have an ideal scoring machine to evaluate the explanation quality for each budget . For example, when , each explainer is expected to provide an answer as a picture segment. Then, the scoring machine will give score to explanation . The segment will have high score if the machine thinks it correctly explains why the classifier made the corresponding prediction. The segment will be given a lower score if it is less relevant to the prediction. For and , the scores for all explainers are equal since their outputs are the same. Thus, a good explainer is the one with high scores for between and . We assume that if for each explainer since the more segments allowed, the more information the explanation can contain. If we plot the scoring as a function of for two explainers, the starting points and ending points of them intersect. The curve for the better explainer is on top of the other.

(a) -Eval of segments features.
(b) -Eval of pixels features.
Figure 5: -Eval of random selection, LIME with different sample rates and SHAP. The random selection and LIME-100 are expected to be worse than others and -Eval plot reflect that expectation.

In the followings, we heuristically show that the -Eval behaves similarly to the score given by the ideal machine . Noting that any successful perturbation on should be a successful perturbation without changing any components in any subsets of . As a result, given a collection , from the definition of -Eval, we have for all . In addition, for all explainers, has the same value since it is the minimum power to successfully perturb without any restriction as . Furthermore, it is straightforward to set for all explanations because if all features are kept unchanged, there is no way to deviate the classification on that picture. To sum-up, if we plot the sequence , it is a positive non-decreasing sequence starting at and approaching infinity as .

Fig. 4(a) plots the the sequences for random explanation, SHAP explanation and LIME explanation with different number of samplings. We segment the studied picture instance into segments. For random explanation, the segments are selected randomly. As we do not have for random selection, the curve is not monotonic. For SHAP, we select the segments having the largest sum pixels’ weight. In LIME, the sampling size determines how many perturbations are conducted in finding the explanation. The higher the number, the better the explanation Marco2016 . Here, we set LIME samplings size to and samples respectively. We denote the corresponding curves as LIME-100, LIME-1000 and LIME-2000. The ideal scoring for LIME-100 and random selection are expected to be smaller than the other three. The figure shows a distinct gap in -Eval between "good" and "bad" explainers. This result implies we can use -Eval plot to evaluate the explainers’ quality. We also conduct the same experiments for pixel-wise features in Fig. 4(b) for an image in MNIST dataset lecun2010 . Obtained result also supports the applicability of -Eval in evaluating explainers.

The computation of -Eval plot is much more efficient than the object coverage guarantee . To obtain the exact curve, we not only require the object-only image but also need to determine -Eval for all permutations. On the other hand, we only need to compute at most values of -Eval to obtain the curve.

5 Simulation results

In this section, we provide experiments results of -Eval on small gray-scale hand-writing images in MNIST lecun2010 and large color object images in Caltech101 FeiFei2004 . To demonstrate the statistic behavior of -Eval on large number of samples, the reported -Eval is not precisely but the ratio of over the power to perturb the image on empty mask . This ratio is also indicated by the notation CC in the legend of each figure. Additionally, since our aim is to examine the usage of -Eval in evaluating explainers’ quality, we need to have the ground-truths ranking of explainers. These ground-truths are obtained by previous results in assessing explainer’s performance using human-based experiments Scott2017 ; Avanti2017 . The studied classifier models and explainers are selected based on those previous experiments accordingly. We provide further discussions and detailed implementations of experimental results in Appendix B, which also includes experiments on CIFAR10 dataset Krizhevsky2009 .

5.1 Simulations on MNIST dataset

In the experiments with MNIST dataset lecun2010 , we study 8 different images explainers: LIME Marco2016 , SHAP Scott2017 , GCam Ramprasaath2016 , DeepLIFT (DEEP) Avanti2017 , Integrated Gradients Sundararajan2016 , Layerwise Relevance Propagation (LRP) Bach2015

, Guided-Backpropagation (GB) 

Springenberg2014 and Simonyan-Gradient (Grad) Simonyan2013 .

LIME approximates the importance of each picture segment with a heuristic linear function. The approach in LIME is a specific case of SHAP, which relies on the theoretical analysis of Shapley value in game theory. SHAP explainer assigns each pixel a score indicating the importance of that pixel to the classifier’s output. Since SHAP is a generalized version of LIME, we expect SHAP explanation to be more consistent with the classifier than LIME, hence SHAP’s

-Evals are expected to be higher statistically. Authors of Scott2017 also provide human-based experiments to support this claim.

DeepLIFT, Integrated Gradients, LRP, GB and Grad are backward-propagation methods to evaluate the importance of each input neuron to the final output neurons of the examined classifier. Previous experiment results using log-odds function in 

Avanti2017 suggest that GB and Grad perform worse than the other three. The final studied explainer GCam is an image explainer exploiting the last convolution layer to explain the prediction. Since GCam is not designed for low-resolution images, we expect the explanation quality and the corresponding -Eval in MNIST dataset cannot be relatively high.

(a) GSA on classifier 1.
(b) GSA on classifier 2.
(c) IGA on classifier 1.
(d) IGA on classifier 2.
Figure 6: We conduct the experiments for explainers on images of MNIST dataset. Figure shows the distributions and the averages of -Eval for explainers on classifier 1 provided by Scott2017 and on classifier 2 provided by Avanti2017 .

Experiments on different models: Figs. 5(a) and 5(b) are the distributions of -Evals of images in MNIST dataset on classifier 1 provided by Scott2017 and classifier 2 provided by Avanti2017 . The green lines are the mean values of the -Eval for each explainer. We can see that the behavior of

-Eval is consistent with the explainer’s performance as we expected. The notation I5 and I10 indicate the Integrated-Gradient method with 5 and 10 interpolations 

Sundararajan2016 . The result is also consistent with previous attempts of evaluating explainers in Scott2017 and Avanti2017 . For the consistency in the behavior of -Eval and log-odds function, please see the discussion in Appendix C.

Experiments on different perturbation schemes: Figs. 5(c) and 5(d) demonstrate the usage of Iterative-Gradient-Attack instead of Gradient-Sign-Attack foolbox2017 as in experiments of Figs. 5(a) and 5(b). Comparing the distributions in Fig. 5(c) to Fig. 5(a) and Fig. 5(d) to Fig. 5(b), we observe that the relative -Eval of explainers are similar on both perturbation schemes. Thus, the computed -Evals still reflect the explainers’ performance. The problem of finding optimal perturbation schemes resulting in a good measurement of -Eval is not considered in this work; however, the experiments suggest that we can use non-optimal perturbation scheme to obtain reasonable measurement of -Eval.

5.2 Simulations on Caltech101 dataset

(a) GSA.
(b) IGA.
Figure 7: Distribution for -Eval of images in Caltech101 dataset for 4 explainers.

For experiments on large images, we study the performance of LIME Marco2016 , SHAP Scott2017 , GCam Ramprasaath2016 , DeepLIFT Avanti2017 on 700 images in Caltech101 dataset FeiFei2004 with the VGG19 classifier Simonyan2014 . As the first three explainers are designed for medium-size to large-size images, we expect they should outperform DeepLIFT. Furthermore, as discussed above, the results from Scott2017 implies SHAP should perform better than LIME. For the improvement of GCam and the degradation of DeepLIFT from the MNIST dataset and CIFAR10 to the Caltech101 dataset, we suggest that readers focus on the change in quality of explainers from Figs. 9 and 10 to Fig. 11 in Appendix B. The experimental results of -Eval in Fig. 6(a) using Gradient-Sign-Attack and Fig. 6(b) using Iterative-Gradient-Attack agree with our expectations on explainers’ performance. We also include some examples of explanations and their corresponding -Eval in Appendix B to validate the correlation of -Eval with the explainers’ quality.

6 Conclusions

Throughout this research, we introduce the -Eval, establish two methods to evaluate feature-based explainers using -Eval and conduct extensive experiments on the proposed metric. Nevertheless, there still exists many open questions on the parameters. For example, the discrepancy in the mean values and the distributions of Figs. 6(a) and 6(b) suggests that -Eval might be sensitive to the perturbation schemes. One question of interest is which perturbation scheme would give us the best measurement of -Eval. Additionally, the distributions of -Eval in Fig. 6 advocates that there is a fundamental difference between the quality of black-box explainers (SHAP, LIME and GCam) and back-propagation explainers (DEEP, Integrated Gradients, LRP, GB and Grad), which is ambiguous prior to this work. -Eval offers us a clear quantification that might shed light into many unanswered questions behind machine learning explainers.

References

Appendix A Experiment on object coverage guarantee bound

In this Appendix, we provide the detailed algorithms for the object coverage guarantee in Section 3.

Input: Classifier , object only image and perturbation scheme .
Parameter : Number of segments .
Output: Sequence of object coverage ratio and the corresponding .
1 Segmentize into segments Filter out the background of and obtain the object Set be the set of segment indices containing the object while  do
2      
3 end while
return and
Algorithm 1 Exact computations of

Algorithm 1 shows the procedure to exactly compute . First, we segmentize the examined image into segment. Then, we collect all segments containing the predicted object and put them into . For each number of explained features , we find the optimal subset of maximizing the -Eval under the constraint (line 1). For each , we find the perturbation that successfully changes the classifier prediction while keeping unchanged. This computation requires the trial of all subsets . After that, we calculate the object coverage ratio and continue the computations until the perturbation scheme cannot find any successful perturbation, i.e. .

Input: Classifier , object only image and perturbation scheme .
Parameter : Number of segments and number of trials .
Output: Sequence of object coverage ratio and the approximate of .
1 Segmentize into segments Filter out the background of and obtain the object Set be the set of segment indices containing the object while  do
2       Randomly select the collection of subsets
3 end while
return and
Algorithm 2 Approximation of

For the experiments in Fig. 4 where the number of image segments is relatively large, it is expensive to compute the -eval for all subsets . As a result, we make a slight modification to the optimization program at line 1 of Algorithm 1 and obtain Algorithm 2. For each , we first randomly select subsets of image segments, add them to collection , compute the corresponding -Eval and obtain which is the best among (line 2). We also conduct a greedy selection of image segment and add it to the previous solution to obtain (line 2). The solution for iteration is the better among and . Specifically, for the experiment in Fig. 4, we use the InceptionV3 classifier Inception2015 and the Gradient-Sign-Attack perturbation schemeGoodfellow2015 for demonstrations.

Appendix B Experiments on MNIST, Caltech101 and CIFAR10

In this appendix, we provide detailed implementations and discussions on experiments in MNIST, Caltech101 and CIFAR10 datasets.

Figure 8: Distributions of -Eval on CIFAR10 dataset.

The experiments in the MNIST dataset is conducted in pixel-wise manner. For each image, each explainer except LIME will return of the image as explanation. For LIME, since the algorithm always returns image segments as explanation, we set the returned pixels to be as close to of the image as possible. Another note is that the implementations of LRP are simplified into GradientInput based on the discussion in Avanti2017 . With each explanation, we then compute the corresponding -Eval and plot them as shown in Fig. 6. The studied classifiers are taken from Scott2017 and Avanti2017 . Some examples explanations of images from MNIST are plotted in Fig .9.

Experiments with the Caltech101 dataset, on the other hand, uses segment-wise features on VGG19 classifier FeiFei2004 . Since the returned features of many explainers are importance weights of pixels, we need to convert them into a subset of image segments as an explanation for fair comparison. We first segmentize each image and then sum up the importance weights of all pixels inside each segment. We finally select the top segments with maximum sum-weight as the segment-wise explanation of the studied explainer. For the results in Fig. 7 each explainer returns the explanation with the number of segments roughly covers about of the original image. The examples of explanation for this case is shown in Fig. 11.

Besides MNIST and Caltech101, we also conduct experiment on the small color image dataset CIFAR10 Krizhevsky2009 . The distributions of -Eval on 500 images of the dataset and some examples are shown in Figs. 8 and 10. The experimental parameters model and the segmentation procedure are similar to that on Caltech101 dataset. The classifier model in this experiment is the adaptation of VGG on CIFAR10 Liu2015VeryDC . Results in Fig. 10 suggest a relative ranking in performance of studied explainers.

Many interesting results and deductions can be drawn out from experiments of three datasets. Besides our previous general analysis presented in Section 5, we would like to point out several key observations as follows:

  • Our first comment is about the correlation of -Eval and the portion of predicted object captured by explainers in CIFAR10 (Fig. 10) and especially Caltech101 (Fig. 11). Below each image, we report the ratio of -Eval over the power to perturb the original image for normalization. It is clear to us that most explanations containing the essential components of the predicted object have high -Eval, which agrees with our theoretical result on the object coverage guarantee in Section 3. For the MNIST dataset, it is non-trivial to assess explainers’ quality by a pure observation on Fig. 9. This is also the main reason motivating Avanti2017 to propose the log-odds function to evaluate explainers specifically for the MNIST dataset. We have provided the detailed discussions on this matter in Appendix C.

  • Our second attention is on the quality of Grad-CAM in three datasets. Since the explainer is designed for fully-connected convolution networks (e.g. VGG) and it exploits the last layer of such network to generate the explanationRamprasaath2016 , we expect Grad-CAM performs relatively well in the Caltech101 dataset. For CIFAR10, the adaptation of VGG on CIFAR10 Liu2015VeryDC contains only 4 neurons. This specific structure of the model would degrade the explainer performance significantly. Our expectation aligns with the examples shown in Fig .10 and Fig. 11. The distributions of -Eval for two datasets in Fig. 8 and Fig. 6(b) also reflect those expectations on Grad-CAM.

  • DeepLIFT is a back-propagation method and it is not only sensitive to the classifier structure but also the selection of reference image Avanti2017 . The experimental setups of DeepLIFT in the MNIST dataset shown in Fig. 6 are taken directly from the source code of the explainer’s paper. On the other hand, our adoptions of DeepLIFT to CIFAR10 and Caltech101 are conducted without calibration on the reference image as the calibration procedure for color images is not provided. This might be the reason for the degradation of explainer’s quality in these two datasets. It is clear that -Eval captures this behavior.

  • Our final discussion is on the exceptionally high -Eval of SHAP shown in all three datasets. This result encourages us to take a deeper look at explanations produced by SHAP. A quick glance at SHAP on MNIST in Fig. 9 might suggest that the explainer is worse than some other back-propagation method such as DeepLIFT, Integrated Gradient or Guided-Backpropagation; however, the figure shows that SHAP captures some important features that are overlooked by other methods. Let’s consider the explanation of number 4 as an example. SHAP is the only explainer detecting that the area on top of number 4 is important. In fact, this area is actually essential since, if these pixels are white instead of black, the original prediction should be 0 instead of 4. Without the -Eval computations, we might solely assess SHAP based on intuitive observations and wrongly evaluate the explainer.

(a) Original
(b) 1.97
(c) 1.96
(d) 1.41
(e) 2.01
(f) 1.99
(g) 1.99
(h) 1.94
(i) 2.05
(j) 2.02
(k) Original
(l) 1.93
(m) 1.62
(n) 1.67
(o) 1.40
(p) 1.44
(q) 1.44
(r) 1.44
(s) 1.44
(t) 1.34
(u) Original
(v) 2.92
(w) 2.72
(x) 1.09
(y) 2.78
(z) 3.10
(aa) 3.10
(ab) 2.64
(ac) 2.68
(ad) 2.82
(ae) Original
(af) 1.52
(ag) 1.22
(ah) 2.01
(ai) 1.29
(aj) 1.27
(ak) 1.27
(al) 1.29
(am) 1.29
(an) 1.26
(ao) Original
(ap) 1.51
(aq) 1.07
(ar) 1.22
(as) 1.19
(at) 1.24
(au) 1.24
(av) 1.19
(aw) 1.25
(ax) 1.18
(ay) Original
(az) 3.94
(ba) 2.76
(bb) 1.95
(bc) 2.86
(bd) 2.97
(be) 2.97
(bf) 2.47
(bg) 2.47
(bh) 2.81
(bi) Original
(bj) 1.84
(bk) 1.14
(bl) 1.10
(bm) 1.34
(bn) 1.41
(bo) 1.41
(bp) 1.37
(bq) 1.40
(br) 1.28
(bs) Original
(bt) 1.55
(bu) 1.26
(bv) 1.32
(bw) 1.36
(bx) 1.37
(by) 1.37
(bz) 1.35
(ca) 1.33
(cb) 1.32
(cc) Original
(cd) 2.53
(ce) 1.98
(cf) 1.76
(cg) 2.49
(ch) 2.56
(ci) 2.56
(cj) 2.37
(ck) 2.37
(cl) 2.29
(cm) Original
(cn) 1.72
(co) 1.55
(cp) 1.18
(cq) 1.97
(cr) 1.95
(cs) 1.95
(ct) 1.93
(cu) 1.93
(cv) 1.89
Figure 9: Some examples of explanations and -Eval on MNIST. The explainers from left to right: SHAP, LIME, GCam, DeepLIFT, Integrated Gradient with 5 and 10 interpolations, Guided Backpropagation, and Gradient. The number associated with each figure is the ratio . It is non-intuitive to evaluate these explanations purely by observations. For detailed discussions, please see appendix C.
(a) Original
(b) 1.66
(c) 1.32
(d) 1.03
(e) 1.41
(f) Original
(g) 2.11
(h) 1.81
(i) 1.31
(j) 1.32
(k) Original
(l) 1.19
(m) 1.29
(n) 1.06
(o) 1.51
(p) Original
(q) 1.91
(r) 1.75
(s) 2.00
(t) 1.31
(u) Original
(v) 1.85
(w) 1.60
(x) 2.52
(y) 1.38
(z) Original
(aa) 2.77
(ab) 1.32
(ac) 1.11
(ad) 2.19
(ae) Original
(af) 1.75
(ag) 1.72
(ah) 1.13
(ai) 1.43
Figure 10: Some examples of Explanations and -Eval on CIFAR10. The explainers from left to right: SHAP, LIME, GCam and DeepLIFT. The number associated with each figure is the ratio . We observe that most explanations which capture the signature components of the images have relatively high -Eval.
(a) Original
(b) 1.40
(c) 1.12
(d) 1.12
(e) 1.26
(f) Original
(g) 2.70
(h) 2.20
(i) 1.80
(j) 1.61
(k) Original
(l) 1.41
(m) 1.42
(n) 1.61
(o) 1.24
(p) Original
(q) 3.20
(r) 2.02
(s) 4.11
(t) 2.06
(u) Original
(v) 1.95
(w) 2.86
(x) 2.59
(y) 1.78
(z) Original
(aa) 3.95
(ab) 1.93
(ac) 3.03
(ad) 1.42
(ae) Original
(af) 9.93
(ag) 3.30
(ah) 3.31
(ai) 8.00
Figure 11: Some examples of Explanations and -Eval on Caltech101. The explainers from left to right: SHAP, LIME, GCam and DeepLIFT. The number associated with each figure is the ratio . We observe that most explanations which capture the signature components of the images have relatively high -Eval.

Appendix C Similarity of c-Eval and log-odds functions

To evaluate importance scores obtained by different methods on MNIST dataset, the authors of DeepLIFT Avanti2017 designs the log-odds function as follows. Given an image that originally belongs to a class, they identify which pixels to erase to convert the original image to other target class and evaluate the change in the log-odds score between the two classes. The work conducted experiments of converting to , to , to and to . All obtained results agree that Guided-Backpropagation and Simonyan-Gradient are inferior to others. Their results also demonstrate that the proposed DeepLIFT is superior in term of log-odds score.

In Fig. 12, we adopt the -Eval into the MNIST dataset to compare -Eval of explainers with the corresponding log-odds scores. The figure displays the -Eval of studied explainers on images with prediction 4, 8 and 9 respectively. We conduct the experiments using both GSA and IGA perturbation schemes. Besides the DeepLIFT in experiments for label 4 and 8, all relative ranking of explainers in -Eval is consistent with the ranking resulted from log-odds computations shown in Avanti2017 . This result implies that our general frameworks of evaluating explainers based on -Eval are applicable to this specific study on the MNIST dataset.

(a) -Eval of number 4 with GSA
(b) -Eval of number 8 with GSA
(c) -Eval of number 9 with GSA
(d) -Eval of number 4 with IGA
(e) -Eval of number 8 with IGA
(f) -Eval of number 9 with IGA
Figure 12: We compute the -Eval for explainers on images of MNIST for number 4,8 and 9 to show the similarity between -Eval and log-odds function in Avanti2017 .