1 Introduction
Much of the output of machine learning (ML) interpretability research, either techniques promoting sparsity in models or post-training interpretability methods, are often assessed by individuals inspecting the output to see if one can "understand" it. While there is much value in this exercise, this type of assessment is vulnerable to bias and subjectivity. Most importantly, just because an explanation makes sense to humans does not mean that it is correct; an explanation does not have to reflect a model’s rationale behind its prediction to appeal to humans. Assessment metrics for interpretability methods should capture any mismatch between model truth and interpretation.

Similar to the confusion matrix of classifiers, we can capture the output of interpretability methods in a confusion matrix, as shown in Figure 1. We define true positive (TP) explanations as features/concepts that present evidence of prediction. Since precisely defining concepts is known to be challenging
genone2012concept , by concepts we simply mean high-level human-friendly units of explanations (e.g., an object) rather than individual input features (e.g., pixels). Note that good candidate features/concepts that a model “could have used” (e.g., edges in an image) are not evidence of prediction unless they are directly involved in prediction. In other words, if an explanation provides evidence of prediction and the prediction changes, that explanation shall reflect the change. Adebayo18 explored this idea and showed that many interpretability methods present visually identical explanations when a model’s learned weights are randomized and when predictions are random. This finding suggests that interpretability methods are making mistakes. A natural question to ask is when are methods making what mistakes.In this work, we attempt to answer this question with a particular focus on false positive mistakes: the set of features attributed as important by an interpretability method but are not in fact the evidence of prediction. In other words, removing these features does not change the prediction. To achieve this goal, we build a semi-natural dataset with pixel-wise labels that enables us to manipulate the ground truth (evidence of prediction). We then train models using this dataset to curate unimportant features. In order to quantitatively evaluate to what extent an interpretability method incorrectly attributes the unimportant features, we define metrics around model dependence, input dependence, and input independence. Our results show that rankings of interpretability methods differ under different metrics; whether an interpretability method is good depends heavily on the type of mistakes that a final task is trying to avoid. Our contributions are as follows: 1) We release, to our knowledge, the first semi-natural image dataset with ground truth for evaluating interpretability methods. 2) We propose three complementary metrics focusing on false-positive mistakes and evaluate six local and one global interpretability methods.
2 Related work
Prior work on evaluating interpretability methods generally falls into three categories: 1) measuring (in)sensitivity of explanations to model and input perturbations, 2) verifying correctness of explanations, and 3) evaluating explanations in a controlled setting with known ground truth. Our work shares characteristics with all three: given model ground truth, we measure when and how an interpretability method should be sensitive or insensitive to the change in model or input.
Evaluating sensitivity of explanations. This set of work measures large and small changes in interpretability methods’ output when model Adebayo18 or input changes (in the space Alvarez18 or adversarially Ghorbani18 ). Our work can be viewed as a "harder" test of interpretability methods than the randomization test done in Adebayo18 . Rather than making features irrelevant to prediction by randomizing the weights, we make certain features more relevant to prediction than others, and measure the response of interpretability methods. One of our proposed metrics also measures a notion of robustness in explanations (when attributions should stay invariant) similar to Alvarez18 and Ghorbani18 . Our perturbation, however, is optimized to be semantically meaningful (e.g., looks like a dog). These perturbations are well-suited to test false positives in explanations, as they have a better chance of misleading humans.
Evaluating correctness of explanations. Samek17 and its subsequent variations Fong17 ; Ancona18 ; Hooker18 infer whether a feature attribution is correct by measuring performance degradation when highly attributed features are removed. Our work also assesses explanation correctness, but we do not have to "infer" whether an attribution is correct, as ground truth is available by construction.
Evaluating with ground truth.
Most relevant to our work is the controlled experiment in TCAV Kim18 , where the authors created a simple dataset and trained models on that dataset such that the ground truth of important pixels are known. While Kim18 compared their method to pixel-based attribution methods (e.g., saliency maps), no quantitative metrics were given—their results were qualitatively evaluated by human subjects. Our work develops quantitative metrics for evaluating interpretability methods in a finer-grained setting with semi-natural images.
3 Dataset for Benchmarking Interpretability Methods (BIM)
The BIM framework has two components: 1) BIM dataset and BIM models (trained on BIM dataset) with known ground truth and 2) BIM metrics that quantitatively evaluate interpretability methods (with or without BIM models). In this section, we describe the dataset and models, which are open sourced at https://github.com/google-research-datasets/bim.

The object neural network (
) is trained with object labels () and the scene neural network () is trained with scene labels ().3.1 Dataset construction
We construct the BIM dataset by selectively pasting object pixels (e.g., a dog) into scene images as shown in Figure 2. The object pixels are gathered from MSCOCO Lin14 using their pixel-wise object labels. The scene images are from MiniPlaces Zhou17 . Each resulting image has two labels - an object label () and a scene label (). Either can be used to train a classifier. The BIM dataset has object classes and scene classes (a total of 100 images). An object is rescaled to between to of a scene image and is pasted onto the scene image at a randomly chosen location. We refer this set as . While some images in the BIM dataset may not look natural, they do not handicap our findings, as the only purpose of the object pixels is to identify regions of an image.
3.2 The definition of common feature and commonality
We define a common feature (CF) as a set of pixels forming a meaningful concept (e.g., an object) that commonly appears in one or more classes. The percentage of classes where a CF appears is the commonality () of that CF. For example, a dog CF where dogs appear in 5 out of 10 scene classes () has a higher commonality than another dog CF where dogs appear in only 1 out of 10 scene classes (). The classifier trained with a less common CF means a higher chance that this classifier would use the CF as a signal for prediction. This CF is therefore more important. In the extreme case of (when a CF is present in all classes, also called a 100% CF), this CF is unimportant for prediction. By ‘unimportant’, we mean that when this CF is removed, predictions do not change significantly. By changing , we can measure false positive responses of interpretability methods in both absolute and relative scales: 1) when interpretability methods should not respond to a CF and 2) when they should respond more or less than other times.
We define to be the set of images where percent of scene classes have an object CF ( is the commonality of this CF). For simplicity, we use interchangeably with . We create a set of data with varying degrees of commonality (i.e., for ).
3.3 Training classifiers with common features of 100% commonality
We train two classifiers using . denotes the classifier trained with scene labels , and is trained with object labels . Intuitively, ’s prediction can only be informed by scenes but not objects, since all objects appear uniformly in all scenes. In a way, is encouraged to "ignore" objects, and vice versa for .
Verifying the ground truth
We empirically verify this intuition in three ways: 1) confirming that ’s accuracy is maintained when objects are removed from and ’s accuracy is maintained when scenes are removed from
, 2) showing small Kullback–Leibler (KL) divergence between activations in the logit layer for an image with and without
CF, and 3) verifying that when only CF is present in an image, the accuracy is close to random guess. In the case of 1), we remove scenes by filling the background with grey pixels, denoted as . When removing objects, we leave the original scenes intact, denoted as .As shown in Figure 3 (left), the test accuracies of and roughly stay the same when their corresponding CFis removed (resulting in for and for ). The majority of correct predictions remain correct. Predictions are as good as random guess (10%) when only keeping CF in the images ( for and for ). This confirms that CFdoes not provide any evidence of prediction. The median KL divergence between classifying an image with or without CF is very small () when both predictions are correct (meaning the network sees them as very similar), compared to when one of the predictions is wrong. Note that we only use correctly predicted data points to evaluate interpretability methods, so that the classifier’s mistakes are not propagated to interpretability methods.

3.4 Training classifiers with common features of any % of commonality
We can create a more complex scenario by training classifiers on inputs with CFs of varying degrees of commonality. In , we add dog CF only to the bamboo forest (a randomly chosen scene) class. For the model trained on this set of data, removing dog CF from bamboo forest at test time causes the accuracy of bamboo forest to drop from to (Figure 3). We create ten sets of data, for , by adding dog CF to more scene classes. We train one model for each set.
This setup should cause the relative importance of dog CF in classifying bamboo forest to decrease as commonality increases. We verify this by inspecting the test accuracy of bamboo forest with and without dog CF for each model (Figure 3 (bottom)). With this setup, we can evaluate interpretability methods by the importance they assign to dog CF: a method shall assign higher importance to the CF that is less common.
4 Metrics for evaluation with and without the BIM dataset
We propose three complementary metrics to evaluate interpretability methods: model contrast score (MCS), input dependence rate (IDR), and input independence rate (IIR). These metrics aim to cover various aspects of false-positives in interptretabilty methods when comparing a) two models trained to consider opposite concepts as important (MCS), b) one model with two inputs of different concepts (IDR) and c) one model with two functionally identical inputs (IIR). We provide formal definitions below.
Setup First, we define a way to compute the importance that an interpretability method assigns to a concept. We denote as the raw output of an interpretability method. In pixel attribution methods such as saliency maps, we have for an input . In concept attribution methods such as TCAV Kim18 , we have for one concept. Since BIM dataset has pixel-wise labels for each concept (e.g., dog), calculating the concept-level attribution is straightforward. We denote as a binary mask where pixels inside have value , and everywhere else. Given an input image and a model , the concept attribution, , is defined as:
We further define to be the average of ’s over a set of correctly classified .
Now we define our three metrics.
4.1 Mcs: Model contrast score with BIM
Once we have a known CF with (CF appears in every class), one evaluation option is to directly compare methods’ of that CF, and expect to be small. However, there is a catch: a meaningless that always assigns 0 (unimportant) to all features would seem to perform well. Even without such a bogus , two interpretability method may operate on two different attribution scales; a value of can refer to an important feature in one method but an unimportant feature in another method, so directly comparing could be misleading.
Thus, we define model contrast score (MCS) as the difference in concept attributions between a model that considers a concept as important () and a model that considers as unimportant ().
The absolute MCS is computed by setting , , and by selecting from . This measures how differently object CF is attributed between when it is the most important () and least important (). A higher contrast indicates a better interpretability method. We can also compute relative MCS by setting to be one of the models trained with for and to be the model trained with . This results in a spectrum of contrast scores where object CF is important to a different degree.
4.2 Idr: Input dependence rate with BIM
While MCS measures performance of interpretability methods across models, one may also be interested in how well each interpretability method performs given a single model. Input dependence rate (IDR) compares two sets of inputs with and without a model’s 100% CF. We expect an input with CF to have a smaller than an input without CF, since ideally, would be close to zero for 100% CF. For a correctly classified set , we define input dependence rate (IDR) as the percentage of images where CF is attributed as less important. Formally,
where is an input with CF , is an input without it, and . In the BIM framework, we have , , and . Intuitively, is the false positive rate: out of 100 images, how many images would incorrectly highlight unimportant concepts, misleading human interpretation.
4.3 Iir: Input independence rate with BIM or any models
If input dependence accounts for when an interpretability method should "react" to two different inputs, input independence rate (IIR) is concerned with when an interpretability method should not "react" to two different inputs. An interpretability method shall return a similar value if what it is trying to explain (i.e., the model output) did not change. We create a "patch" (a small set of connected pixels) that minimizes the change in the logit layer when the patch is pasted onto an input image. We compute this patch, , by optimizing the below objective with simple gradient descent:
where avoids the trivial solution of . is a regularization term (see details in Appendix). Note that this patch can be computed for any model where gradients are accessible.
When has semantic meanings (i.e., humans recognize what the patch refers to), a false positive explanation presents more danger, because this patch aligns well with a human-friendly concept. It turns out that making this patch to look like a concept (e.g., dog) is possible. Figure 8 shows an example of an image with such a patch. While is small (), the dog is clearly visible.
With this patch, we can calculate input independence rate (IIR). We expect to be similar for an input with and without this patch. Specifically, we expect to only change within some visually imperceptible threshold , above which humans would notice the difference in attribution. For a correctly classified set , we can compute the percentage of images where the difference in with and without the patch is less than :
Intuitively, is the false positive rate: out of 100 images, how many images would incorrectly highlight functionally unimportant concepts, misleading human interpretation. Note that is application specific; if many images are mostly black, what is ‘noticeable’ is different from if most images are white. We can also measure the raw value of the difference in instead of the percentage, which is included in the Appendix.
5 Evaluating interpretability methods with and without BIM
With BIM dataset defined in Section 3 together with BIM metrics defined in Section 4, we compare a set of existing interpretability methods. We find that our metrics are indeed complementary: a method such as Vanilla Gradient Simonyan13 ; Erhan09 ; Baehrens10 can have high IDR and IIR but low MCS. We consider seven interpretability methods, some provide local explanations (i.e., they explain one data point at a time), and others provide global explanations (i.e., they explain a target class). For local interpretability methods, we include GradCAM (GC) Selvaraju16 , Vanilla Gradient (VG), SmoothGrad (SG) Smilkov17 , Integrated Gradient (IG) Sundararajan17
and Guided Backpropagation (GB)
Springenberg14 . We also consider Gradient x Input (GxI), as many methods above use GxI to visualize final explanations. Our saliency map visualization follows the procedures in Smilkov17 , except that we only use positivepixel attributions when computing the evaluation metrics, because our work focuses on false
positives. We use MCSto evaluate a global method (TCAV).5.1 Evaluating with model contrast score
MCS offers two sets of evaluations—absolute scale evaluation with 100% CF and relative scale evaluation with any % CF. They result in similar rankings but offer different insights. Higher MCS indicates better methods.
Absolute Mcs with 100% common feature


One image from the BIM dataset and its saliency maps are shown in Figure 5. Visual inspection reveals that most of the methods change in the right direction but to a different degree. To quantify this observation, MCS is computed over k images. GC and TCAV have high MCS according to Figure 5. MCS is robust to the scale and location of objects (red bars). The baseline (yellow bars) is calculated using random (see details in Appendix). Note that MCS for TCAV is the difference between TCAV scores for and .
Relative Mcs with common features of varying commonality


We compute MCS between the classifier trained on and a set of classifiers trained on for , as described in Section 3.4. The quantitative results in Figure 7 suggest that different methods follow the trend of the accuracy drop to a different degree as dog CF decreases its importance. The accuracy drop (dotted black line) is presented as the trend of ground truth.
As commonality increases from left to right in Figure 7 (top), TCAV and GC follow the trend of the accuracy drop more closely. Visual assessment of GC in Figure 7 also tells the same story. We record the Pearson correlation coefficients between each method’s relative MCS and the accuracy drop in Figure 7 (bottom). TCAV achieves the highest correlation closely followed by GC. Visual inspection of Figure 7 and quantitative results in Figure 7 suggest that methods other than TCAV and GC change at a much smaller scale. In particular, GB evolves minimally with the edges of the dog always being visible, similar to the findings in Adebayo18 ; Nie18 .
5.2 Evaluating with input dependence rate
Now we conduct local evaluation using IDR. From visualization alone (Figure 8), it is hard to tell which method is performing better, especially across many images. The quantitative measure of IDR in Figure 9 shows that GC and VG have the most correctly attributed CF—the least amount of false positive explanations. Note that VG is simply the gradients; many other methods require calculating VG. This means that the cheapest method of all offers nearly the best performance (this is consistent with the results in Adebayo18 ). The baseline of IDR is 50% (measured on random ). In applications where low false positive rate is critical, GC and VG are clearly better choices than other methods. TCAV is not applicable, as it is a global method.
![]() |
![]() |
![]() |
![]() |
5.3 Evaluating with input independence rate
Finally, we present the IIR results. Most of the methods, with the exception of GC and VG, incorrectly identify the dog patch as important to prediction for over 80% of the examples (Figure 9). This alarming result is aligned with visual assessments shown in Figure 8—the dog is clearly highlighted by many methods, and especially by GB. This is in line with the findings of Adebayo18 ; Nie18 , where some methods always tend to highlight edges.
Since the dog patch is highly visible in the image, interpretability methods that directly depend on the input (e.g., GxI and IG) are likely to reflect this meaningless change in the input, consistent with the observations in Smilkov17 ; Shrikumar16 . This observation calls into question the common practice of multiplying explanations by the input image for visualization.
The threshold is used to compute IIR (Section 4.3). We visually determined that when changes by more than , one can see the difference in attribution of the dog region in most methods. As a reminder, IIR with is the percentage of inputs where adding such a dog patch does not change the attribution of the dog region by more than . We use when computing this patch. TCAV is again not applicable as it is a global method.
6 Conclusions
There is little point in providing false explanations—evaluating explanations is as important as developing interpretability methods. In this work, we take a step towards ground-truth-based evaluations of interpretability methods. We create and open source a semi-natural image dataset (BIM dataset), a set of models with ground truth (BIM models), and three complementary metrics (BIM metrics) to evaluate interpretability methods. Our work is only a starting point; one can also develop metrics and setups for false negatives or other measures of performance. We hope that developing ways to quantitatively evaluate interpretability methods helps us choose the right metric and methods best for the application at hand.
References
- [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
- [2] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NeurIPS, 2018.
- [3] David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-explaining neural networks. In NeurIPS, 2018.
- [4] Marco B Ancona, Enea Ceolini, Cengiz Oztireli, and Markus H. Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. In ICLR, 2018.
- [5] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
- [6] Dumitru Erhan, Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. CoRR, 2009.
- [7] Ruth Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. CoRR, 2017.
- [8] James Genone and Tania Lombrozo. Concept possession, experimental semantics, and hybrid theories of reference. Philosophical Psychology, 25(5):717–742, 2012.
- [9] Amirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neural networks is fragile. CoRR, abs/1710.10547, 2018.
-
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016. - [11] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. Evaluating feature importance estimates. CoRR, abs/1806.10758, 2018.
-
[12]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie J. Cai, James Wexler,
Fernanda B. Viégas, and Rory Sayres.
Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav).
In ICML, 2018. - [13] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
- [14] Weili Nie, Yonghui Zhang, and Ankit Patel. A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. In ICML, 2018.
- [15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Michael S. Bernstein, Li Fei-Fei, Alexander C. Berg, and Aditya Khosla. Imagenet large scale visual recognition challenge. Springer US, 2015.
- [16] Wojciech Samek, Alexander Binder, Gregoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28:2660–2673, 11 2017.
- [17] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: Why did you say that? CoRR, abs/1611.07450, 2016.
- [18] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences. CoRR, abs/1605.01713, 2016.
- [19] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
- [20] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017.
- [21] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
- [22] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, 2017.
- [23] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
-
[24]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
Places: A 10 million image database for scene recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Appendix A Details in creating dog patch for input independence
The additional regularization terms in the lost function for creating patch for input independence test is as follows:
penalizes pixel values in that falls outside the valid pixel range (e.g., ). The additional term minimizes updates to regions outside of the patch region represented by mask (
is a matrix of ones). The overall loss function is:
The update rule for is:
where is the step size (defaults to ). is initialized from a dog patch to obtain solutions that are semantically meaningful.
Appendix B Discussions on the dog patch versus common feature
Note that there is a subtle difference between the dog patch generated from the optimization procedure above versus the common feature (CF) obtained from training. Intuitively, to find a patch, we are moving the original input image in the direction that is perpendicular to the gradient , but the gradient itself is fixed because the model is fixed. When training a model with CF, becomes small with respect to the CF. This explains why we expect minimal attribution change of the dog patch in input independence testing, but expect small attribution to the dog CF in input dependence testing.
Appendix C Other measures for input independence
An alternative measure of input independence is the average perturbation in attribution when a functionally unimportant patch is added to the input:
Figure 10 shows the average perturbation over 100 images for each saliency method. Lower perturbation is better. The ranking is roughly the same as the input independence rate metric.

Appendix D DNN Architecture and training
All BIM models are ResNet50 [10] models. Training starts from an ImageNet pretrained checkpoint [15] (available at https://github.com/tensorflow/models/tree/master/official/resnet) and all layers are fine-tuned on the BIM dataset. 90% of randomly chosen members of the BIM dataset are used to train, the rest are used for testing. All models are implemented using TensorFlow [1] and trained on a single Nvidia Tesla V100 GPU.
Appendix E Details of interpretability methods compared
We consider neural network models with an input and a predictor function . A saliency method outputs a saliency map highlighting regions relevant to prediction. Below is an overview of the eight saliency methods evaluated in our work.
GradCAM (GC) [17] computes the gradient of the class logit with respect to the feature map of the last convolutional layer of a DNN. Guided GradCAM (GGC) is GC combined with Guided Backprop through an element-wise product.
Vanilla Gradient (VG) [19, 6, 5] computes the gradient of the target class at logit layer with respect to each input pixel: , reflecting how much the logit layer output would change when the input changes in a small neighborhood.
SmoothGrad (SG) [20] reduces visual noise by averaging the explanations over a set of noisy images in the neighborhood of the original input image: , where .
Integrated Gradient (IG) [22] computes the sum of gradients for a scaled set of images along the path between a baseline image () and the original input image: . Smoothing from SG can be applied to IG to produce IG-SG.
Gradient x Input (G x I) computes an element-wise product between VG and the original input image. [4]
showed that for ReLU network with zero baseline and no bias.
Appendix F Details of computing TCAV scores
We compute the TCAV scores of the dog concept for different models (e.g. and for absolute contrast). To learn the dog CAV, we take 100 images from where the object is a dog as positive examples and 100 images from
as negative examples for the dog concept. We perform two-sided t-test of the TCAV scores, and reject scores where
-value 0.01. We compute TCAV scores for each of the block layer and the logit layer of ResNet50. The final TCAV score is the average of the layers that passed statistical testing.Appendix G Details of model contrast score baseline
For model contrast score, we generate a random mask to calculate baseline differences. For TCAV, we obtain two TCAV scores for two models ( and ), and show the difference between the two, both for TCAV scores for the dog CAVs and random CAVs.
Appendix H Full size relative model contrast figures


Appendix I Additional relative model contrast figures


Appendix J Additional input independence figures
