BIM: Towards Quantitative Evaluation of Interpretability Methods with Ground Truth

by   Mengjiao Yang, et al.

Interpretability is rising as an important area of research in machine learning for safer deployment of machine learning systems. Despite active developments, quantitative evaluation of interpretability methods remains a challenge due to the lack of ground truth; we do not know which features or concepts are important to a classification model. In this work, we propose the Benchmark Interpretability Methods (BIM) framework, which offers a set of tools to quantitatively compare a model's ground truth to the output of interpretability methods. Our contributions are: 1) a carefully crafted dataset and models trained with known ground truth and 2) three complementary metrics to evaluate interpretability methods. Our metrics focus on identifying false positives---features that are incorrectly attributed as important. These metrics compare how methods perform across models, across images, and per image. We open source the dataset, models, and metrics evaluated on many widely-used interpretability methods.



page 4

page 6

page 7

page 8

page 13

page 15

page 16

page 17


QUACKIE: A NLP Classification Task With Ground Truth Explanations

NLP Interpretability aims to increase trust in model predictions. This m...

Constructive Interpretability with CoLabel: Corroborative Integration, Complementary Features, and Collaborative Learning

Machine learning models with explainable predictions are increasingly so...

On the Generation of Disassembly Ground Truth and the Evaluation of Disassemblers

When a software transformation or software security task needs to analyz...

Evaluation of HTR models without Ground Truth Material

The evaluation of Handwritten Text Recognition (HTR) models during their...

Detecting False Alarms from Automatic Static Analysis Tools: How Far are We?

Automatic static analysis tools (ASATs), such as Findbugs, have a high f...

Virtual Ground Truth, and Pre-selection of 3D Interest Points for Improved Repeatability Evaluation of 2D Detectors

In Computer Vision, finding simple features is performed using classifie...

A Machine Learning Approach for Evaluating Creative Artifacts

Much work has been done in understanding human creativity and defining m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Much of the output of machine learning (ML) interpretability research, either techniques promoting sparsity in models or post-training interpretability methods, are often assessed by individuals inspecting the output to see if one can "understand" it. While there is much value in this exercise, this type of assessment is vulnerable to bias and subjectivity. Most importantly, just because an explanation makes sense to humans does not mean that it is correct; an explanation does not have to reflect a model’s rationale behind its prediction to appeal to humans. Assessment metrics for interpretability methods should capture any mismatch between model truth and interpretation.

Figure 1: Confusion matrix of the output of interpretability methods. TP, FP, FN, TN correspond to true positive, false positive, false negative, and true negative explanations.

Similar to the confusion matrix of classifiers, we can capture the output of interpretability methods in a confusion matrix, as shown in Figure 1. We define true positive (TP) explanations as features/concepts that present evidence of prediction. Since precisely defining concepts is known to be challenging 

genone2012concept , by concepts we simply mean high-level human-friendly units of explanations (e.g., an object) rather than individual input features (e.g., pixels). Note that good candidate features/concepts that a model “could have used” (e.g., edges in an image) are not evidence of prediction unless they are directly involved in prediction. In other words, if an explanation provides evidence of prediction and the prediction changes, that explanation shall reflect the change. Adebayo18 explored this idea and showed that many interpretability methods present visually identical explanations when a model’s learned weights are randomized and when predictions are random. This finding suggests that interpretability methods are making mistakes. A natural question to ask is when are methods making what mistakes.

In this work, we attempt to answer this question with a particular focus on false positive mistakes: the set of features attributed as important by an interpretability method but are not in fact the evidence of prediction. In other words, removing these features does not change the prediction. To achieve this goal, we build a semi-natural dataset with pixel-wise labels that enables us to manipulate the ground truth (evidence of prediction). We then train models using this dataset to curate unimportant features. In order to quantitatively evaluate to what extent an interpretability method incorrectly attributes the unimportant features, we define metrics around model dependence, input dependence, and input independence. Our results show that rankings of interpretability methods differ under different metrics; whether an interpretability method is good depends heavily on the type of mistakes that a final task is trying to avoid. Our contributions are as follows: 1) We release, to our knowledge, the first semi-natural image dataset with ground truth for evaluating interpretability methods. 2) We propose three complementary metrics focusing on false-positive mistakes and evaluate six local and one global interpretability methods.

2 Related work

Prior work on evaluating interpretability methods generally falls into three categories: 1) measuring (in)sensitivity of explanations to model and input perturbations, 2) verifying correctness of explanations, and 3) evaluating explanations in a controlled setting with known ground truth. Our work shares characteristics with all three: given model ground truth, we measure when and how an interpretability method should be sensitive or insensitive to the change in model or input.
Evaluating sensitivity of explanations. This set of work measures large and small changes in interpretability methods’ output when model Adebayo18 or input changes (in the space Alvarez18 or adversarially Ghorbani18 ). Our work can be viewed as a "harder" test of interpretability methods than the randomization test done in Adebayo18 . Rather than making features irrelevant to prediction by randomizing the weights, we make certain features more relevant to prediction than others, and measure the response of interpretability methods. One of our proposed metrics also measures a notion of robustness in explanations (when attributions should stay invariant) similar to Alvarez18 and Ghorbani18 . Our perturbation, however, is optimized to be semantically meaningful (e.g., looks like a dog). These perturbations are well-suited to test false positives in explanations, as they have a better chance of misleading humans.
Evaluating correctness of explanations. Samek17 and its subsequent variations Fong17 ; Ancona18 ; Hooker18 infer whether a feature attribution is correct by measuring performance degradation when highly attributed features are removed. Our work also assesses explanation correctness, but we do not have to "infer" whether an attribution is correct, as ground truth is available by construction.
Evaluating with ground truth. Most relevant to our work is the controlled experiment in TCAV Kim18 , where the authors created a simple dataset and trained models on that dataset such that the ground truth of important pixels are known. While Kim18 compared their method to pixel-based attribution methods (e.g., saliency maps), no quantitative metrics were given—their results were qualitatively evaluated by human subjects. Our work develops quantitative metrics for evaluating interpretability methods in a finer-grained setting with semi-natural images.

3 Dataset for Benchmarking Interpretability Methods (BIM)

The BIM framework has two components: 1) BIM dataset and BIM models (trained on BIM dataset) with known ground truth and 2) BIM metrics that quantitatively evaluate interpretability methods (with or without BIM models). In this section, we describe the dataset and models, which are open sourced at

Figure 2: BIM dataset examples and BIM models.

The object neural network (

) is trained with object labels () and the scene neural network () is trained with scene labels ().

3.1 Dataset construction

We construct the BIM dataset by selectively pasting object pixels (e.g., a dog) into scene images as shown in Figure 2. The object pixels are gathered from MSCOCO Lin14 using their pixel-wise object labels. The scene images are from MiniPlaces Zhou17 . Each resulting image has two labels - an object label () and a scene label (). Either can be used to train a classifier. The BIM dataset has object classes and scene classes (a total of 100 images). An object is rescaled to between to of a scene image and is pasted onto the scene image at a randomly chosen location. We refer this set as . While some images in the BIM dataset may not look natural, they do not handicap our findings, as the only purpose of the object pixels is to identify regions of an image.

3.2 The definition of common feature and commonality

We define a common feature (CF) as a set of pixels forming a meaningful concept (e.g., an object) that commonly appears in one or more classes. The percentage of classes where a CF appears is the commonality () of that CF. For example, a dog CF where dogs appear in 5 out of 10 scene classes () has a higher commonality than another dog CF where dogs appear in only 1 out of 10 scene classes (). The classifier trained with a less common CF means a higher chance that this classifier would use the CF as a signal for prediction. This CF is therefore more important. In the extreme case of (when a CF is present in all classes, also called a 100% CF), this CF is unimportant for prediction. By ‘unimportant’, we mean that when this CF is removed, predictions do not change significantly. By changing , we can measure false positive responses of interpretability methods in both absolute and relative scales: 1) when interpretability methods should not respond to a CF and 2) when they should respond more or less than other times.

We define to be the set of images where percent of scene classes have an object CF ( is the commonality of this CF). For simplicity, we use interchangeably with . We create a set of data with varying degrees of commonality (i.e., for ).

3.3 Training classifiers with common features of 100% commonality

We train two classifiers using . denotes the classifier trained with scene labels , and is trained with object labels . Intuitively, ’s prediction can only be informed by scenes but not objects, since all objects appear uniformly in all scenes. In a way, is encouraged to "ignore" objects, and vice versa for .

Verifying the ground truth

We empirically verify this intuition in three ways: 1) confirming that ’s accuracy is maintained when objects are removed from and ’s accuracy is maintained when scenes are removed from

, 2) showing small Kullback–Leibler (KL) divergence between activations in the logit layer for an image with and without

CF, and 3) verifying that when only CF is present in an image, the accuracy is close to random guess. In the case of 1), we remove scenes by filling the background with grey pixels, denoted as . When removing objects, we leave the original scenes intact, denoted as .

As shown in Figure 3 (left), the test accuracies of and roughly stay the same when their corresponding CFis removed (resulting in for and for ). The majority of correct predictions remain correct. Predictions are as good as random guess (10%) when only keeping CF in the images ( for and for ). This confirms that CFdoes not provide any evidence of prediction. The median KL divergence between classifying an image with or without CF is very small () when both predictions are correct (meaning the network sees them as very similar), compared to when one of the predictions is wrong. Note that we only use correctly predicted data points to evaluate interpretability methods, so that the classifier’s mistakes are not propagated to interpretability methods.

Figure 3: [Top] Validating the impact of CF in and using and . Scenes are unimportant for , and objects are unimportant for . [Bottom] Test accuracy of bamboo forest with and without dog CF on models trained with for .

3.4 Training classifiers with common features of any % of commonality

We can create a more complex scenario by training classifiers on inputs with CFs of varying degrees of commonality. In , we add dog CF only to the bamboo forest (a randomly chosen scene) class. For the model trained on this set of data, removing dog CF from bamboo forest at test time causes the accuracy of bamboo forest to drop from to (Figure 3). We create ten sets of data, for , by adding dog CF to more scene classes. We train one model for each set.

This setup should cause the relative importance of dog CF in classifying bamboo forest to decrease as commonality increases. We verify this by inspecting the test accuracy of bamboo forest with and without dog CF for each model (Figure 3 (bottom)). With this setup, we can evaluate interpretability methods by the importance they assign to dog CF: a method shall assign higher importance to the CF that is less common.

4 Metrics for evaluation with and without the BIM dataset

We propose three complementary metrics to evaluate interpretability methods: model contrast score (MCS), input dependence rate (IDR), and input independence rate (IIR). These metrics aim to cover various aspects of false-positives in interptretabilty methods when comparing a) two models trained to consider opposite concepts as important (MCS), b) one model with two inputs of different concepts (IDR) and c) one model with two functionally identical inputs (IIR). We provide formal definitions below.

Setup First, we define a way to compute the importance that an interpretability method assigns to a concept. We denote as the raw output of an interpretability method. In pixel attribution methods such as saliency maps, we have for an input . In concept attribution methods such as TCAV Kim18 , we have for one concept. Since BIM dataset has pixel-wise labels for each concept (e.g., dog), calculating the concept-level attribution is straightforward. We denote as a binary mask where pixels inside have value , and everywhere else. Given an input image and a model , the concept attribution, , is defined as:

We further define to be the average of ’s over a set of correctly classified .

Now we define our three metrics.

4.1 Mcs: Model contrast score with BIM

Once we have a known CF with (CF appears in every class), one evaluation option is to directly compare methods’ of that CF, and expect to be small. However, there is a catch: a meaningless that always assigns 0 (unimportant) to all features would seem to perform well. Even without such a bogus , two interpretability method may operate on two different attribution scales; a value of can refer to an important feature in one method but an unimportant feature in another method, so directly comparing could be misleading.

Thus, we define model contrast score (MCS) as the difference in concept attributions between a model that considers a concept as important () and a model that considers as unimportant ().

The absolute MCS is computed by setting , , and by selecting from . This measures how differently object CF is attributed between when it is the most important () and least important (). A higher contrast indicates a better interpretability method. We can also compute relative MCS by setting to be one of the models trained with for and to be the model trained with . This results in a spectrum of contrast scores where object CF is important to a different degree.

4.2 Idr: Input dependence rate with BIM

While MCS measures performance of interpretability methods across models, one may also be interested in how well each interpretability method performs given a single model. Input dependence rate (IDR) compares two sets of inputs with and without a model’s 100% CF. We expect an input with CF to have a smaller than an input without CF, since ideally, would be close to zero for 100% CF. For a correctly classified set , we define input dependence rate (IDR) as the percentage of images where CF is attributed as less important. Formally,

where is an input with CF , is an input without it, and . In the BIM framework, we have , , and . Intuitively, is the false positive rate: out of 100 images, how many images would incorrectly highlight unimportant concepts, misleading human interpretation.

4.3 Iir: Input independence rate with BIM or any models

If input dependence accounts for when an interpretability method should "react" to two different inputs, input independence rate (IIR) is concerned with when an interpretability method should not "react" to two different inputs. An interpretability method shall return a similar value if what it is trying to explain (i.e., the model output) did not change. We create a "patch" (a small set of connected pixels) that minimizes the change in the logit layer when the patch is pasted onto an input image. We compute this patch, , by optimizing the below objective with simple gradient descent:

where avoids the trivial solution of . is a regularization term (see details in Appendix). Note that this patch can be computed for any model where gradients are accessible.

When has semantic meanings (i.e., humans recognize what the patch refers to), a false positive explanation presents more danger, because this patch aligns well with a human-friendly concept. It turns out that making this patch to look like a concept (e.g., dog) is possible. Figure 8 shows an example of an image with such a patch. While is small (), the dog is clearly visible.

With this patch, we can calculate input independence rate (IIR). We expect to be similar for an input with and without this patch. Specifically, we expect to only change within some visually imperceptible threshold , above which humans would notice the difference in attribution. For a correctly classified set , we can compute the percentage of images where the difference in with and without the patch is less than :

Intuitively, is the false positive rate: out of 100 images, how many images would incorrectly highlight functionally unimportant concepts, misleading human interpretation. Note that is application specific; if many images are mostly black, what is ‘noticeable’ is different from if most images are white. We can also measure the raw value of the difference in instead of the percentage, which is included in the Appendix.

5 Evaluating interpretability methods with and without BIM

With BIM dataset defined in Section 3 together with BIM metrics defined in Section 4, we compare a set of existing interpretability methods. We find that our metrics are indeed complementary: a method such as Vanilla Gradient Simonyan13 ; Erhan09 ; Baehrens10 can have high IDR and IIR but low MCS. We consider seven interpretability methods, some provide local explanations (i.e., they explain one data point at a time), and others provide global explanations (i.e., they explain a target class). For local interpretability methods, we include GradCAM (GC) Selvaraju16 , Vanilla Gradient (VG), SmoothGrad (SG) Smilkov17 , Integrated Gradient (IG) Sundararajan17

and Guided Backpropagation (GB) 

Springenberg14 . We also consider Gradient x Input (GxI), as many methods above use GxI to visualize final explanations. Our saliency map visualization follows the procedures in Smilkov17 , except that we only use positive

pixel attributions when computing the evaluation metrics, because our work focuses on false

positives. We use MCSto evaluate a global method (TCAV).

5.1 Evaluating with model contrast score

MCS offers two sets of evaluations—absolute scale evaluation with 100% CF and relative scale evaluation with any % CF. They result in similar rankings but offer different insights. Higher MCS indicates better methods.

Absolute Mcs with 100% common feature

Figure 4: An example of saliency map visualizations for and . While most methods focus on dog in and do not focus on dog in , it is hard to rank their performances across many images.
Figure 5: Absolute MCS. Blue bars are MCS measured from the original BIM dataset. Red bars show robustness of this measure. Yellow bars are baselines. TCAV’s baseline is . Higher MCS is better.

One image from the BIM dataset and its saliency maps are shown in Figure 5. Visual inspection reveals that most of the methods change in the right direction but to a different degree. To quantify this observation, MCS is computed over k images. GC and TCAV have high MCS according to Figure 5. MCS is robust to the scale and location of objects (red bars). The baseline (yellow bars) is calculated using random (see details in Appendix). Note that MCS for TCAV is the difference between TCAV scores for and .

Relative Mcs with common features of varying commonality

Figure 6: An example of saliency map visualizations for models trained with CF of varying degrees of commonality. increases from left to right. A larger contrast among each row is better. See the full size figure in Appendix.
Figure 7: [Top] Relative MCS as in increases. The dashed black line is the accuracy drop when CF is removed. The dotted blue line is the relative contrast scores for TCAV. [Bottom] The Pearson correlation coefficients () between each method’s relative MCS and the accuracy drop. A higher correlation is better.

We compute MCS between the classifier trained on and a set of classifiers trained on for , as described in Section 3.4. The quantitative results in Figure 7 suggest that different methods follow the trend of the accuracy drop to a different degree as dog CF decreases its importance. The accuracy drop (dotted black line) is presented as the trend of ground truth.

As commonality increases from left to right in Figure 7 (top), TCAV and GC follow the trend of the accuracy drop more closely. Visual assessment of GC in Figure 7 also tells the same story. We record the Pearson correlation coefficients between each method’s relative MCS and the accuracy drop in Figure 7 (bottom). TCAV achieves the highest correlation closely followed by GC. Visual inspection of Figure 7 and quantitative results in Figure 7 suggest that methods other than TCAV and GC change at a much smaller scale. In particular, GB evolves minimally with the edges of the dog always being visible, similar to the findings in Adebayo18 ; Nie18 .

5.2 Evaluating with input dependence rate

Now we conduct local evaluation using IDR. From visualization alone (Figure 8), it is hard to tell which method is performing better, especially across many images. The quantitative measure of IDR in Figure 9 shows that GC and VG have the most correctly attributed CF—the least amount of false positive explanations. Note that VG is simply the gradients; many other methods require calculating VG. This means that the cheapest method of all offers nearly the best performance (this is consistent with the results in Adebayo18 ). The baseline of IDR is 50% (measured on random ). In applications where low false positive rate is critical, GC and VG are clearly better choices than other methods. TCAV is not applicable, as it is a global method.

IDR testing of on . The dog CF is not important.
IIR testing of on any . and are functionally identical.
Figure 8: Examples of saliency map visualization from IDR (a) and IIR  (b).
IDR for . Higher IDR is better. Baseline is 50%.
IIR with . Higher IIR is better.
Figure 9: IDR (a) and IIR (b) results, each over a set of 100 images.

5.3 Evaluating with input independence rate

Finally, we present the IIR results. Most of the methods, with the exception of GC and VG, incorrectly identify the dog patch as important to prediction for over 80% of the examples (Figure 9). This alarming result is aligned with visual assessments shown in Figure 8—the dog is clearly highlighted by many methods, and especially by GB. This is in line with the findings of  Adebayo18 ; Nie18 , where some methods always tend to highlight edges.

Since the dog patch is highly visible in the image, interpretability methods that directly depend on the input (e.g., GxI and IG) are likely to reflect this meaningless change in the input, consistent with the observations in Smilkov17 ; Shrikumar16 . This observation calls into question the common practice of multiplying explanations by the input image for visualization.

The threshold is used to compute IIR (Section 4.3). We visually determined that when changes by more than , one can see the difference in attribution of the dog region in most methods. As a reminder, IIR with is the percentage of inputs where adding such a dog patch does not change the attribution of the dog region by more than . We use when computing this patch. TCAV is again not applicable as it is a global method.

6 Conclusions

There is little point in providing false explanations—evaluating explanations is as important as developing interpretability methods. In this work, we take a step towards ground-truth-based evaluations of interpretability methods. We create and open source a semi-natural image dataset (BIM dataset), a set of models with ground truth (BIM models), and three complementary metrics (BIM metrics) to evaluate interpretability methods. Our work is only a starting point; one can also develop metrics and setups for false negatives or other measures of performance. We hope that developing ways to quantitatively evaluate interpretability methods helps us choose the right metric and methods best for the application at hand.


Appendix A Details in creating dog patch for input independence

The additional regularization terms in the lost function for creating patch for input independence test is as follows:

penalizes pixel values in that falls outside the valid pixel range (e.g., ). The additional term minimizes updates to regions outside of the patch region represented by mask (

is a matrix of ones). The overall loss function is:

The update rule for is:

where is the step size (defaults to ). is initialized from a dog patch to obtain solutions that are semantically meaningful.

Appendix B Discussions on the dog patch versus common feature

Note that there is a subtle difference between the dog patch generated from the optimization procedure above versus the common feature (CF) obtained from training. Intuitively, to find a patch, we are moving the original input image in the direction that is perpendicular to the gradient , but the gradient itself is fixed because the model is fixed. When training a model with CF, becomes small with respect to the CF. This explains why we expect minimal attribution change of the dog patch in input independence testing, but expect small attribution to the dog CF in input dependence testing.

Appendix C Other measures for input independence

An alternative measure of input independence is the average perturbation in attribution when a functionally unimportant patch is added to the input:

Figure 10 shows the average perturbation over 100 images for each saliency method. Lower perturbation is better. The ranking is roughly the same as the input independence rate metric.

Figure 10: Average perturbation in attribution () between and . Lower perturbation is better.

Appendix D DNN Architecture and training

All BIM models are ResNet50 [10] models. Training starts from an ImageNet pretrained checkpoint [15] (available at and all layers are fine-tuned on the BIM dataset. 90% of randomly chosen members of the BIM dataset are used to train, the rest are used for testing. All models are implemented using TensorFlow [1] and trained on a single Nvidia Tesla V100 GPU.

Appendix E Details of interpretability methods compared

We consider neural network models with an input and a predictor function . A saliency method outputs a saliency map highlighting regions relevant to prediction. Below is an overview of the eight saliency methods evaluated in our work.

GradCAM (GC) [17] computes the gradient of the class logit with respect to the feature map of the last convolutional layer of a DNN. Guided GradCAM (GGC) is GC combined with Guided Backprop through an element-wise product.

Vanilla Gradient (VG) [19, 6, 5] computes the gradient of the target class at logit layer with respect to each input pixel: , reflecting how much the logit layer output would change when the input changes in a small neighborhood.

SmoothGrad (SG) [20] reduces visual noise by averaging the explanations over a set of noisy images in the neighborhood of the original input image: , where .

Integrated Gradient (IG) [22] computes the sum of gradients for a scaled set of images along the path between a baseline image () and the original input image: . Smoothing from SG can be applied to IG to produce IG-SG.

Gradient x Input (G x I) computes an element-wise product between VG and the original input image. [4]

showed that for ReLU network with zero baseline and no bias.

Guided Backpropagation (GB) [21] builds on top of the DeConvNet explanation method [23]

and attributes input importance through backpropagating neuron activations from the logit layer to the input layer.

We accompany visualization of a subset of saliency methods by averaging over channels and capping the extremes to the percentile as done by [22, 20] before normalizing each attribution to between .

Appendix F Details of computing TCAV scores

We compute the TCAV scores of the dog concept for different models (e.g. and for absolute contrast). To learn the dog CAV, we take 100 images from where the object is a dog as positive examples and 100 images from

as negative examples for the dog concept. We perform two-sided t-test of the TCAV scores, and reject scores where

-value 0.01. We compute TCAV scores for each of the block layer and the logit layer of ResNet50. The final TCAV score is the average of the layers that passed statistical testing.

Appendix G Details of model contrast score baseline

For model contrast score, we generate a random mask to calculate baseline differences. For TCAV, we obtain two TCAV scores for two models ( and ), and show the difference between the two, both for TCAV scores for the dog CAVs and random CAVs.

Appendix H Full size relative model contrast figures

Figure 11: An example of saliency map visualizations for models trained with CF of varying degrees of commonality. increases from left to right. Larger contrast among each row is better.
Figure 12: [Top] Relative MCS as increases. The dashed black line is the accuracy drop when CF is removed. The dotted blue line is the contrast score for TCAV. [Bottom] Pearson correlation coefficient between each method’s relative MCS and the accuracy drop. Higher correlation is better.

Appendix I Additional relative model contrast figures

Figure 13: Additional example saliency maps from relative model contrast testing.
Figure 14: Additional example saliency maps from relative model contrast testing.

Appendix J Additional input independence figures

Figure 15: Additional example saliency maps from input independence testing.