Cross-Model Consensus of Explanations and Beyond for Image Classification Models: An Empirical Study

09/02/2021 ∙ by Xuhong Li, et al. ∙ Nanyang Technological University Baidu, Inc. 3

Existing interpretation algorithms have found that, even deep models make the same and right predictions on the same image, they might rely on different sets of input features for classification. However, among these sets of features, some common features might be used by the majority of models. In this paper, we are wondering what are the common features used by various models for classification and whether the models with better performance may favor those common features. For this purpose, our works uses an interpretation algorithm to attribute the importance of features (e.g., pixels or superpixels) as explanations, and proposes the cross-model consensus of explanations to capture the common features. Specifically, we first prepare a set of deep models as a committee, then deduce the explanation for every model, and obtain the consensus of explanations across the entire committee through voting. With the cross-model consensus of explanations, we conduct extensive experiments using 80+ models on 5 datasets/tasks. We find three interesting phenomena as follows: (1) the consensus obtained from image classification models is aligned with the ground truth of semantic segmentation; (2) we measure the similarity of the explanation result of each model in the committee to the consensus (namely consensus score), and find positive correlations between the consensus score and model performance; and (3) the consensus score coincidentally correlates to the interpretability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 18

page 19

page 20

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep models are well-known by their excellent performance in many challenging domains, as well as their black-box nature. To interpret the prediction of a deep model, a number of trustworthy interpretation algorithms (Bach et al., 2015; Zhou et al., 2016; Ribeiro et al., 2016; Smilkov et al., 2017; Sundararajan et al., 2017; Lundberg and Lee, 2017) have been recently proposed to attribute the importance of every input feature in a given sample with respect to the model’s output. For example, given an image classification model, LIME Ribeiro et al. (2016) and SmoothGrad Smilkov et al. (2017) could attribute the importance scores to every superpixel/pixel in an image with respect to the model’s prediction. In this way, one can easily explain the classification result of a model with a data point by visualizing the important features used by the model for prediction.

The use of interpretation tools finds that, even deep models make the same and right predictions on the same image, they might rely on different sets of input features for classification. For example, our work uses LIME and SmoothGrad to explain a number of models trained on image classification tasks on the same set of images and obtains different explanations for these models even all they make right predictions (latterly shown in Figure 2 and Figure 3). While these models have been explained to make the same prediction using different sets of features, we can still find that some common features might be used by the majority of models. In this way, we are particularly interested in two research questions as follows: (1) What are the common features used by various models in an image? (2) Whether the models with better performance favor those common features?

To answer these two questions, we propose to study the common features across a number of deep models and measure the similarity between the set of common features and the one used by every single model. Specifically, as illustrated in Figure 1, we generalize an electoral system to first form a committee with a number of deep models, then obtain the explanations for a given image based on one trustworthy interpretation algorithm, then call for voting to obtain the cross-model consensus of explanations, or shortly consensus, and finally compute a similarity score between the consensus and the explanation result for each deep model, denoted as consensus score. Through extensive experiments using 80+ models on 5 datasets/tasks, we find that (1) the consensus is aligned with the ground truth of image semantic segmentation; (2) a model in the committee with a higher consensus score usually performs better in terms of testing accuracy; and (3) models’ consensus scores coincidentally correlates to their interpretability.

The contributions of this paper could be summarized as follows. To the best of our knowledge, this work is the first to investigate the common features, which are used and shared by a large number of deep models for image classification, by incorporating with interpretation algorithms. We propose the cross-model consensus of explanations to characterize the common features, and connect the consensus score to the performance and interpretability of a model. Finally, we obtain three observations from the experiments with thorough analyses and discussions.

Figure 1: Illustration of the proposed framework that consists of the three steps: (1) prepares a set of trained models as committee, (2) aggregates explanation results across the committee to get the consensus, and (3) computes the similarity score of each explanation to the consensus.

2 Related Work

We first review the interpretation algorithms and the evaluation approaches on their trustworthiness. To visualize the activated subregions of intermediate-layer feature maps, many algorithms have been proposed to interpret convolutional networks Zhou et al. (2016); Selvaraju et al. (2020); Chattopadhay et al. (2018); Wang et al. (2020a). Apart from investigating the inside of complex deep networks, simple linear or tree-based surrogate models have been used as “out-of-box explainers” to explain the predictions made by the deep model over the dataset through local or global approximations Ribeiro et al. (2016); van der Linden et al. (2019); Ahern et al. (2019); Zhang et al. (2019). Instead of using surrogates for deep models, algorithms, such as SmoothGrad (Smilkov et al., 2017), Integrated Gradients (Sundararajan et al., 2017), DeepLIFT (Shrikumar et al., 2017)

etc, have been proposed to estimate the input feature importance with respect to the model predictions. Note that there are many other interpretation algorithms and we mainly discuss the ones that are related to feature attributions and suitable for deep models for image classification in this paper. Evaluations on the trustworthiness of interpretation algorithms are of objective to qualify their trustworthiness and not mislead the understanding of models’ behaviors, e.g.

Adebayo et al. (2018) have found that some algorithms are independent both of the model and the data generating process, through randomizing the parameters of models. Other evaluation approaches include perturbation of important features Samek et al. (2016); Petsiuk et al. (2018); Vu et al. (2019); Hooker et al. (2019), model trojaning attacks Chen et al. (2017a); Gu et al. (2017); Lin et al. (2020), infidelity and sensitivity Ancona et al. (2018); Yeh et al. (2019) to similarity samples in the neighborhood, through a crafted dataset Yang and Kim (2019), and user-study experiments Lage et al. (2019); Jeyakumar et al. (2020).

From an orthogonal perspective, evaluations across models are also urged for building more interpretable and explainable AI systems. However, evaluations across the deep models are scarce. Bau et al. (2017) proposed Network Dissection

to build an additional dataset with dense annotations of a number of visual concepts for evaluating the interpretability of convolutional neural networks. Given a convolutional model, Network Dissection recovers the intermediate-layer feature maps used by the model for the classification, and then measures the overlap between the activated subregions in the feature maps with the densely human-labeled visual concepts to estimate the interpretability of the model. Another common solution to the evaluation across deep models is user-study experiments 

(Doshi-Velez and Kim, 2017).

In this paper, we do not directly evaluate the interpretability across deep models, but based on the proposed framework, we show experimentally that the consensus score is positively correlated to the generalization performance of deep models and coincidentally related to the interpretability. We will discuss more details with analyses later. We believe that based on the explanations, our proposed framework and the consensus score could help to better understand deep models.

3 Framework of Cross-Model Consensus of Explanations

Input: A dataset and an interpretation algorithm . /* Step 1: Committee Formation with Deep Models */
Train deep models on that form the committee . /* Step 2: Committee Voting for Consensus Achievement */
1 For each example in , initialize an empty matrix for storing the explanations. for each model in  do
2      
3 end for
/* Step 3: Consensus-based Similarity Score */
4 for each model in  do
5       as the score of for .
6 end for
Repeat Step2 and Step3 for all examples in . For each model , the overall consensus score is the average of the similarity scores over all examples.
Algorithm 1 Framework Pseudocode.

In this section, we introduce the proposed approach that generalizes the electoral system to provide the consensus of explanations across various deep models. Specifically, the proposed framework consists of three steps, as detailed in the following.

Step1: Committee Formation with Deep Models. Given deep models that are trained for solving a target task (image classification task in our experiments) on a visual dataset where each image contains one main object, the approach first forms the given deep models as a committee, noted as , and then considers the variety of models in the committee that would establish the consensus for comparisons and evaluations.

Step2: Committee Voting for Consensus Achievement. With the committee of deep models and the task for explanation, the proposed framework leverages a trustworthy interpretation tool , e.g. we choose LIME (Ribeiro et al., 2016) or SmoothGrad (Smilkov et al., 2017) as in this paper, to obtain the explanation of every model on every image in the dataset. Given some sample from the dataset, we note the obtained explanation results of all models as . Then, we propose a voting procedure that aggregates to reach the cross-model consensus of explanations, i.e., the consensus, for . Specifically, the -th element of the consensus is for LIME, where refers to the dimension of an explanation result and for SmoothGrad, following the conventional normalization-averaging procedure Ribeiro et al. (2016); Ahern et al. (2019); Smilkov et al. (2017). To the end, the consensus has been reached for every sample in the target dataset based on committee voting.

Step3: Consensus-based Similarity Score. Given the consensus, the approach calculates the consensus score

of every model in the committee by considering the similarity between the explanation result of each individual model and the consensus. Specifically, for the explanations and the consensus based on LIME (visual feature importance in superpixel levels), cosine similarity between the flattened vector of explanation of each model and the consensus is used. For the results based on SmoothGrad (visual feature importance in pixel levels), a similar procedure is followed, where the proposed algorithm uses Radial Basis Function (

) for the similarity measurement. The difference in similarity computations is due to that (1) the dimensions of LIME explanations are various for different samples while invariant for SmoothGrad explanations; (2) the scales of LIME explanation results vary much larger than SmoothGrad. Thus cosine similarity is more suitable for LIME while RBF is for SmoothGrad. Eventually, the framework computes a quantitative but relative score for each model in the committee using their similarity to the consensus.

For further clarity, these three steps of the proposed framework are illustrated in Figure 1 and formalized in Algorithm 1, with more details in the appendix.

4 Overall Experiments and Results

In this section, we start by introducing the experiment setups. We use the image classification as the target task and follow the proposed framework to obtain the consensus and compute the consensus scores. Through the experiments, we have found (1) the alignment between the consensus and image semantic segmentation, (2) positive correlations between the consensus score and model performance, and (3) coincidental correlations between the consensus score and model interpretability. We end this section by robustness analyses of the framework.

4.1 Evaluation Setups

Datasets.

For overall evaluations and comparisons, we use ImageNet 

(Deng et al., 2009)

for general visual object recognition and CUB-200-2011 

(Welinder et al., 2010)

for bird recognition respectively. Note that ImageNet provides the class label for every image, and the CUB-200-2011 dataset includes the class label and pixel-level segmentation for the bird in every image, where the pixel annotations of visual objects are found to be aligned with the consensus.

Models. For fair comparisons, we use more than 80 deep models trained on ImageNet that are publicly available111https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleCV/image_classification/README_en.md#supported-models-and-performances. We also derive models on the CUB-200-2011 dataset through standard fine-tuning procedures. In our experiments, we include these models of the two committees based on ImageNet and CUB-200-2011 respectively. Both of them target at the image classification task with each image being labeled to one category.

Interpretation Algorithms. As we previously introduced, we consider two interpretation algorithms, LIME (Ribeiro et al., 2016) and SmoothGrad (Smilkov et al., 2017). Specifically, LIME surrogates the explanation as the assignment of visual feature importance to superpixels (Vedaldi and Soatto, 2008), and SmoothGrad outputs the explanations as the visual feature importance over pixels. In this way, we can validate the flexibility of the proposed framework over explanation results from diverse sources (i.e., linear surrogates vs. input gradients) and in multiple granularity (i.e., feature importance in superpixel/pixel-levels).

4.2 Alignment between the Consensus and Image Segmentation

(a)
(b)
Figure 2: Visual comparisons between consensus and the interpretation results of CNNs using LIME (in the upper line) and SmoothGrad (in the lower line) based on an image from ImageNet, where the ground truth of segmentation is not available.
(a)
(b)
Figure 3: Visual comparisons between consensus and the explanation results of deep models using LIME (in the upper line) and SmoothGrad (in the lower line) based on an image from CUB-200-2011, where the ground truth of segmentation is available as pixel-wise annotations and the mean Average Precision (mAP) are measured.

The image segmentation task searches the pixel-wise classifications of images. Cross-model consensus of explanations for image classification are well aligned to image segmentation, especially when only one main object is contained in the image. This partially demonstrates the effectiveness of most deep models in extracting visual objects from input images. We show two examples using both LIME and SmoothGrad in Figure 2 and 3 from ImageNet and CUB-200-2011 respectively. More examples can be found in appendix.

To quantitatively demonstrate the alignment, we compute the Average Precision (AP) score between the cross-model consensus of explanations and the image segmentation ground truth on CUB-200-2011, where the latter is available. We further take the mean of AP scores (mAP) over the dataset to compare with the overall consensus scores. Figure 4 shows the results, where the consensus achieves higher mAP scores than any individual network. Both quantitative results and visual comparisons in validate the closeness of consensus to the ground truth of image segmentation.

(a) LIME
(b) SmoothGrad
Figure 4:

Correlation between model performance and mAP to the segmentation ground truth using (a) LIME and (b) SmoothGrad with CUB-200-2011 over 85 models. Pearson correlation coefficients are 0.927 (with p-value 4e-37) for LIME and 0.916 (with p-value 9e-35) for SmoothGrad. The points “Consensus” here refer to the testing accuracy of the ensemble of networks in the committee by probabilities averaging and voting (in y-axis), as well as the mAP between the consensus and the ground truth (in x-axis).

4.3 Positive Correlations between Consensus Scores and Model Performance

(a) LIME on ImageNet
(b) SmoothGrad on ImageNet
(c) LIME vs. SmoothGrad on ImageNet
(d) LIME on CUB
(e) SmoothGrad on CUB
(f) LIME vs. SmoothGrad on CUB
Figure 5: Model performance v.s. similarity to the consensus using LIME (a,d) and SmoothGrad (b,e) over 81 models on ImageNet (a,b) and 85 models on CUB-200-2011 (d,e). The third column shows the similarity to the consensus of SmoothGrad interpretations v.s. similarity to the consensus of LIME interpretations on ImageNet committee (c) and CUB-200-2011 committee (f). Pearson correlation coefficients are (a) 0.8087, (b) 0.783, (c) 0.825, (d) 0.908, (e) 0.880 and (f) 0.854. For concise purpose, networks in the same family are represented by the same symbol.

Figure 5 shows the positive correlations between the similarity score to the consensus (in x-axis) and model performance (in y-axis). Specifically, in Figure 5 (a-b) and (d-e), we present the results using LIME (a,d) and SmoothGrad (b,e) on ImageNet (a,b) and CUB-200-2011 (d,e). All correlations here are strong with significance tests passed, though in some local areas of the correlation plots between the consensus score and model performance. In this way, we could conclude that, in an overall manner, the evaluation results based on the consensus score using both LIME and SmoothGrad over the two datasets are correlated to model performance with significance. More experiments on other datasets with random subsets of deep models will be shown in in Figure 7 (Section 4.5).

4.4 “Coincidental” Correlations between Consensus Scores and Model Interpretability

Deep model interpretability measures the ability to present in understandable terms to a human (Doshi-Velez and Kim, 2017). While no formal and agreed measurements for the interpretability evaluation, two evaluation methods, i.e., Network Dissection (Bau et al., 2017) and user-study experiments, are quite common for this purpose. Though the proposed framework and the consensus scores are based on explanation results, they do not directly estimate the model interpretability. Nevertheless, in this subsection, we present the coincidental correlations between the consensus scores and the interpretability measurements.

DenseNet161 ResNet152 VGG16 GoogleNet AlexNet
Network Dissection 2 1 3 4 5
User-Study Evaluations 1 (1.715) 2 (1.625) 3 (1.585) 4 (1.170) 5 (0.840)
Consensus (LIME) 1 (0.849) 2 (0.846) 3 (0.821) 4 (0.734) 5 (0.594)
Consensus (SmoothGrad) 1 (0.038) 2 (0.037) 3 (0.030) 4 (0.026) 5 (0.021)
Table 1: Rankings (and scores) of five deep models, evaluated by Network Dissection (Bau et al., 2017), user-study evaluations, and the proposed framework with LIME and SmoothGrad.

Consensus versus Network Dissection. We compare the results of the proposed framework with the interpretability evaluation solution Network Dissection (Bau et al., 2017). On the Broden dataset, Network Dissection reported a ranking list of five models (w.r.t. the model interpretability), shown in Table 1

, through counting the semantic neurons, where a neuron is defined

semantic if its activated feature maps overlap with human-annotated visual concepts. Based on the proposed framework, we report the consensus scores of using LIME and SmoothGrad in Table 1, which are consistent to Figure 5 (a, LIME) and (b, SmoothGrad). The three ranking lists are almost identical, except the comparisons between DenseNet161 and ResNet152, where in the both lists based on the consensus score, DenseNet161 is similar to ResNet152 with marginally elevated consensus scores, while Network Dissection considers ResNet152 is more interpretable than DenseNet161.

We believe the results from our proposed framework and Network Dissection are close enough from the perspectives of ranking lists. The difference may be caused by the different ways that our framework and Network Dissection perform the evaluations. The consensus score measures the similarity to the consensus explanations on images, while Network Dissection counts the number of neurons in the intermediate layers activated by all the visual concepts, including objects, object parts, colors, materials, textures and scenes. Furthermore, Network Dissection evaluates the interpretability of deep models using the Broden dataset with densely labeled visual objects and patterns (Bau et al., 2017), while the consensus score does not need additional datasets or the ground truth of semantics. In this way, the results by our proposed framework and Network Dissection might be slightly different.

Consensus versus User-Study Evaluations. In order to further validate the effectiveness of the proposed framework, we have also conducted user-study experiments on these five models and report the results on the second row of Table 1. See the appendix for the experimental settings of the user-study evaluations. This confirms that our proposed framework is capable of approximating the model interpretability.

4.5 Robustness Analyses of Consensus

In this subsection, we investigate several factors that might affect the evaluation results with consensus, including the use of basic interpretation algorithms (e.g., LIME and SmoothGrad), the size of committee, and the candidate pool for models in the committee.

Consistency between LIME and SmoothGrad. Even though the granularity of explanation results from LIME and SmoothGrad are different, which causes mismatching in mAP scores to segmentation ground truth, the consensus scores based on the two algorithms are generally consistent. The consistency has been confirmed by Figure 5 (c, f), where the overall results based on LIME is strongly correlated to SmoothGrad over all models on both datasets. This shows that the proposed framework can work well with a wide spectrum of basic interpretation algorithms.

Consistency of Cross-Committee Evaluations. In real-word applications, the committee-based estimations and evaluations may make inconsistent results in a committee-by-committee manner. In this work, we are interested in whether the consensus score estimations are consistent against the change of committee. Given 16 ResNet models as the targets, we form 20 independent committees through combining the 16 ResNet models with 10–20 models randomly drawn from the rest of networks. In each of these 20 independent committees, we compute the consensus scores of the 16 ResNet models. We then estimate the Pearson correlation coefficients between any of these 20 results and the one in Figure 5

 (a), where the mean correlation coefficient is 0.96 with the standard deviation 0.04. Thus, we can say the consensus score evaluation would be consistent against randomly picked committees.

Figure 6: Convergence of mAP between the ground truth and the consensus results based on committees of increasing sizes, using LIME on CUB-200-2011. The green lines and orange triangles are, respectively, the mean values and the median values of 20 random trials. The red dashed line is the mAP of the consensus reached by the complete committee of the original 85 models.

Convergence over Committee Sizes To understand the effect of the committee size to the consensus score estimation, we run the proposed framework using committees of various sizes formed by deep models that are randomly picked up from the pools. In Figure 6, we plot and compare the performance of the consensus with increasing committee sizes, where we estimate the mAP between the ground truth and the consensus reached by the random committees of different sizes and 20 random trials have been done for every single size independently. It shows that the curve of mAP would quickly converge to the complete committee, while the consensus based on a small proportion of committee (e.g., 15 networks) works good enough even compared to the complete committee of 85 networks.

Applicability with Random Committees over More Datasets.

To demonstrate the applicability of the proposed framework, we extend our experiments using networks randomly picked up from the pool to other datasets, including Stanford Cars 196

(Krause et al., 2013), Oxford Flowers 102 (Nilsback and Zisserman, 2008) and Foods 101 (Bossard et al., 2014). Dataset descriptions and experimental details are included in the appendix. The results in Figure 7 confirm that the positive correlations between the consensus score and model performance exist for a wide range of models on ubiquitous datasets/tasks.

(a)
(b)
(c)
Figure 7: Model performance v.s. the consensus scores using LIME on Stanford Cars 196 (Krause et al., 2013), Oxford Flowers 102 (Nilsback and Zisserman, 2008) and Foods 101 (Bossard et al., 2014). Pearson correlation coefficients are 0.9522, 0.8785 and 0.9134 respectively.

5 Discussions: Limits and Potentials with Future Works

Limits. In this section, we would like to discuss several limits in our studies. First of all, we propose to study the features used by deep models for classification, but we use the explanation results (i.e., importance of superpixels/pixels in the image for prediction) obtained by interpretation algorithms. Obviously, the correctness of interpretation algorithms might affect our results. However, we use two independent algorithms, including LIME Ribeiro et al. (2016) and SmoothGrad Smilkov et al. (2017), which attribute feature importance in two different scales i.e., superpixels and pixels. Both algorithms lead to the same observations and conclusive results (see Section 4.5 for the consistency between results obtained by LIME and SmoothGrad). Thus, we believe the interpretation algorithms here are trustworthy and it is appropriate to use explanation results as a proxy to analyze features. For future research, we would include more advanced interpretation algorithms to confirm our observations.

We obtain some interesting observations from our experiments and make conclusions using multiple datasets. However, the image classification datasets used in our experiments have some limits — every image in the dataset only consists of one visual object for classification. It is reasonable to doubt that when multiple visual objects (rather than the target for classification) and complicated visual patterns for background Koh and Liang (2017); Chen et al. (2017a)

co-exist in an image, the cross-model consensus of explanations may no longer overlap to the ground truth semantic segmentation. Actually, we include an example of COCO dataset 

Lin et al. (2014) in appendix, where multiple objects co-exist in the image and consensus may not always match the segmentation. Our future work would focus on the datasets with multiple visual objects and complicated background for object detection, segmentation, and multi-label classification tasks.

Finally, only well-known models with good performance have been included in the committee. It certainly would bring some bias in our analysis. However, in practice, these models would be one of the first choices or frequently used in many applications for relevance. In our future work, we would include more models with diverse performance to seek more observations.

Potentials. In addition to the limits, our work also demonstrates several potentials of cross-model consensus of explanations for further studies. As was shown in Figure 6, with a larger committee, the consensus would slowly converge to a stable set of common features that clearly aligns with the segmentation ground truth of the dataset. This experiment further demonstrates the capacity of consensus to precisely position the visual objects for classification. Thus, in our future work, we would like to use consensus based on a committee of image classification models to detect the position of visual objects in the image.

Furthermore, our experiments with both interpretation algorithms on all datasets have found that consensus scores are “coincidentally” correlated to the interpretability scores of the models, even though interpretability scores were evaluated through totally different ways — network dissections Bau et al. (2017) and user studies. Actually, network dissections evaluate the interpretability of a model through matching its activation maps in intermediate layers with the ground truth segmentation of visual concepts in the image. A model with higher interpretability should have more convolutional filters activated at the visual patterns/objects for classification. In this way, we particularly measure the similarity between the explanation results obtained for every model and the segmentation ground truth of images. We found that the models’ segmentation-explanation similarity significantly correlates to their consensus score (see Figure 8). This observation might encourage us to further study the connections between interpretability and consensus scores in the future work.

(a) Using LIME
(b) Using SmoothGrad
Figure 8: Correlation between mAP scores to the segmentation ground truth and the consensus scores using (a) LIME and (b) SmoothGrad with the CUB-200-2011 dataset over 85 models (of the committee). Pearson correlation coefficients are 0.885 (with p-value 3e-29) for LIME and 0.906 (with p-value 8e-33) for SmoothGrad.

6 Conclusion

In this paper, we study the common features shared by various deep models for image classification. We are wondering (1) what are the common features and (2) whether the use of common features could improve the performance. Specifically, given the explanation results obtained by interpretation algorithms, we propose to aggregate the explanation results from different models, and obtain the cross-model consensus of explanations through voting. To understand features used by every model and the common ones, we measure the consensus score as the similarity between the consensus and the explanation of every model.

Our empirical studies based on extensive experiments using 80+ deep models on 5 datasets/tasks find that (i) the consensus aligns with the ground truth semantic segmentation of the visual objects for classification; (ii) models with higher consensus scores would enjoy better testing accuracy; and (iii) the consensus scores coincidentally correlate to the interpretability scores obtained by network dissections and user evaluations. In addition to main claims, we also include additional experiments to demonstrate robustness of consensus, including the alternative use of LIME and SmoothGrad and their effects to the results/conclusions, consistency of consensus achieved by different groups of deep models, the fast convergence of consensus with increasing number of deep models in the committee, and random selection of deep models as the committee for consensus-based evaluation on the other datasets. All these studies confirm the applicability of consensus as a proxy to study and analyze the common features shared by different models in our research. Several open issues and potentials have been discussed with future directions introduced. Hereby, we are encouraged to further adopt consensus and consensus scores to understand the behaviors of deep models better.

References

  • J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems (NeurIPS), pp. 9505–9515. Cited by: §2.
  • I. Ahern, A. Noack, L. Guzman-Nateras, D. Dou, B. Li, and J. Huan (2019)

    NormLime: a new feature importance metric for explaining deep neural networks

    .
    arXiv preprint arXiv:1909.04200. Cited by: §2, §3.
  • M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015)

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

    .
    PloS one 10 (7), pp. e0130140. Cited by: §1.
  • D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 6541–6549. Cited by: §2, §4.4, §4.4, §4.4, Table 1, §5.
  • L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101–mining discriminative components with random forests

    .
    In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 446–461. Cited by: §B.1, Figure 7, §4.5.
  • A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2018) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. Cited by: §2.
  • X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017a)

    Targeted backdoor attacks on deep learning systems using data poisoning

    .
    arXiv preprint arXiv:1712.05526. Cited by: §2, §5.
  • Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng (2017b) Dual path networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4467–4475. Cited by: Appendix D.
  • F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 1251–1258. Cited by: Appendix D.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.1.
  • X. Ding, Y. Guo, G. Ding, and J. Han (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1911–1920. Cited by: Appendix D.
  • F. Doshi-Velez and B. Kim (2017)

    Towards a rigorous science of interpretable machine learning

    .
    arXiv preprint arXiv:1702.08608. Cited by: §2, §4.4.
  • S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. Torr (2019) Res2net: a new multi-scale backbone architecture. Cited by: Appendix D.
  • T. Gu, B. Dolan-Gavitt, and S. Garg (2017) Badnets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
  • S. Hooker, D. Erhan, P. Kindermans, and B. Kim (2019) A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 9737–9748. Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: Appendix D.
  • A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: Appendix D.
  • F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: Appendix D.
  • J. V. Jeyakumar, J. Noor, Y. Cheng, L. Garcia, and M. Srivastava (2020) How can i explain this to you? an empirical study of deep neural network explanation methods. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML), pp. 1885–1894. Cited by: §5.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 554–561. Cited by: §B.1, Figure 7, §4.5.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix D.
  • I. Lage, E. Chen, J. He, M. Narayanan, B. Kim, S. Gershman, and F. Doshi-Velez (2019) An evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1902.00006. Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: Figure 14, Appendix F, §5.
  • Y. Lin, W. Lee, and Z. B. Celik (2020)

    What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors

    .
    arXiv preprint arXiv:2009.10639. Cited by: §2.
  • H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: Appendix D.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4765–4774. Cited by: §1.
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: Appendix D.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §B.1, Figure 7, §4.5.
  • V. Petsiuk, A. Das, and K. Saenko (2018) RISE: randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: Appendix D.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Appendix D.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) " Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §B.2, §1, §2, §3, §4.1, §5.
  • W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2016) Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §2.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2020) Grad-cam: visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision (IJCV) 128 (2), pp. 336–359. Cited by: §2.
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §B.1.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In International Conference on Machine Learning (ICML), pp. 3145–3153. Cited by: §2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §B.1, Appendix D.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. In ICML Workshop on Visualization for Deep Learning, Cited by: §B.2, §1, §2, §3, §4.1, §5.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: Appendix D.
  • M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix D.
  • I. van der Linden, H. Haned, and E. Kanoulas (2019) Global aggregations of local explanations for black box models. FACTS-IR: Fairness, Accountability, Confidentiality, Transparency, and Safety - SIGIR 2019 Workshop. Cited by: §2.
  • A. Vedaldi and S. Soatto (2008) Quick shift and kernel methods for mode seeking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 705–718. Cited by: §B.2, §4.1.
  • M. N. Vu, T. D. Nguyen, N. Phan, R. Gera, and M. T. Thai (2019) Evaluating explainers via perturbation. arXiv preprint arXiv:1906.02032. Cited by: §2.
  • H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu (2020a) Score-cam: score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25. Cited by: §2.
  • J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al. (2020b) Deep high-resolution representation learning for visual recognition. Cited by: Appendix D.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §B.1, §4.1.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
  • M. Yang and B. Kim (2019) Benchmarking attribution methods with relative feature importance. arXiv, pp. arXiv–1907. Cited by: §2.
  • C. Yeh, C. Hsieh, A. S. Suggala, D. I. Inouye, and P. Ravikumar (2019) On the (in) fidelity and sensitivity for explanations. In Advances in neural information processing systems (NeurIPS), Cited by: §2.
  • Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu (2019)

    Interpreting cnns via decision trees

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6261–6270. Cited by: §2.
  • X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: §1, §2.

Appendix A Complete Pseudocode of Cross-Model Consensus of Explanations

In the main text, Algorithm 1 presents the pseudocode of our framework of Cross-Model Consensus of Explanations. Here, Algorithm 2 completes the pseudocode of the framework with the details of the three functions that are used in Algorithm 1.

1 Function interpret (, , ):
       /* An interpretation algorithm , a data sample and a trained model . */
2       return Explanation result of on by .
3
4Function reach_consensus():
       /* , a collection of interpretations of models for one given data sample. */
5       return , the consensus of the interpretations of models: for LIME explanations; for SmoothGrad explanations.
6
7Function similarity(, ):
       /* Two vectors and . */
8       return The similarity score between and : for LIME interpretations; for SmoothGrad interpretations.
9
Algorithm 2 Functions in Algorithm 1.

Appendix B Experimental Details

In this section, we present technique details of the experiments in the main text about the preparation of deep models for committee formations, the interpretation algorithms and the user-study evaluations.

b.1 Committee Formations

There are around 100 deep models trained on ImageNet that are publicly available222https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleCV/image_classification/README_en.md#supported-models-and-performances

at the moment we initiate the experiments. We first exclude some very large models that take much more computation resources. Then for the consistency of computing superpixels, we include only the models that take images of size 224

224 as input, resulting 81 models for the committee based on ImageNet. For the intentions of comparing the models, a solution to including the models in the committee is to simply align the superpixels in different sizes of images. However, in our experiments, we choose not to do so since there are already a large number of available models.

As for CUB-200-2011 (Welinder et al., 2010), similarly we first exclude the very large models. Then we follow the standard procedures (Sermanet et al., 2014; Simonyan and Zisserman, 2015) for fine-tuning ImageNet-pretrained models on CUB-200-2011. For simplicity, we use the same training setup for fine-tuning all pre-trained models on CUB-200-2011 (learning rate 0.01, batch size 64, SGD optimizer with momentum 0.9, resize to 256 being the short edge, randomly cropping images to the size of 224224), and obtain 85 models that are well trained. Different hyper-parameters may help to improve the performance of some specific networks, but for the same reason of the large number of available models, we choose not to search for better hyper-parameter settings.

For Stanford Cars 196 (Krause et al., 2013), Oxford Flowers 102 (Nilsback and Zisserman, 2008) and Foods 101 (Bossard et al., 2014), we follow the same fine-tuning procedure as on CUB-200-2011. However, given the convergence over committee sizes (Figure 6), which suggests a committee of more than 15 models, we randomly choose around 20 models for each of the three datasets.

b.2 Interpretation Algorithms

To explain a deep model’s predictions, LIME (Ribeiro et al., 2016) on vision tasks first performs a superpixel segmentation (Vedaldi and Soatto, 2008)

for an image, then generates interpolated samples by randomly masking some superpixels and computing the outputs of the generated samples through the model, and finally fits the model outputs with the presence/absence of superpixels as input by a linear regression model. The linear weights then directly indicates the feature importance in superpixel level as the explanation result.

The gradients of model output w.r.t. input can partly identify influential pixels, but due to the saturation of activation functions in the deep networks, the vanilla gradient is usually noisy. SmoothGrad

(Smilkov et al., 2017) reduces the visual noise by repeatedly adding small random noises to the input so as to get a list of corresponding gradients, which are then averaged for the final explanation result.

Note that many other interpretation algorithms are also available for our proposed framework while in this paper, we validate our approach with two trustworthy and commonly-used algorithms.

b.3 Human User-Study Evaluations

As introduced in the main text, we have conducted user-study experiments for model interpretability over the five models that are discussed by Network Dissection, i.e., DenseNet161, ResNet152, VGG16, GoogLeNet, AlexNet, and the evaluation result from user-study evaluations well aligns with results of our framework using either LIME or SmoothGrad. We describe here the experimental settings of the user-study evaluations.

For each image, we randomly choose two models from the five models and present the LIME (or SmoothGrad respectively) explanations of the two models, without giving the model information to users. Users are then requested to choose which one helps better to reveal the model’s reasoning of making predictions according to their understanding, or equal if the two interpretations are equally bad or good. Each pair of models is repeated three times and represented to different users. The better one in each pair will get three points and the other one will get zero; in the equal case, both get one point. Finally, a normalization of dividing the number of images and the number of repeats (i.e. 3) is performed for each model. The user-study evaluations yield the scores indicating the model interpretability, as shown in Table 1.

Appendix C ResNet Family

(a) Based on the Complete Committee
(b) Based on the ResNet Family
Figure 9: Model performance v.s. similarity to the consensus of LIME on ResNet family. The consensus of (a) is voted by the complete committee on ImageNet (81 models), while the consensus of (b) is voted by ResNet family (16 models).

We show the zoomed plot of ResNet family (whose name contains “ResNet” key word) in the ImageNet-LIME committee of 81 models in Figure 9 (a). Meanwhile, we also present the results using ResNet family as committee in Figure 9 (b). These two subfigures have no large difference, which further confirms the consistency of our approach across different committees for ranking models. The positive correlation between the model performance and the consensus scores does not exist in the ResNet family, as we explained before that in some local areas, especially when models are extremely large, the correlation is not always positive.

Appendix D References of Network Structures

Most frequently-used structures of deep models have been evaluated in this paper, including AlexNet (Krizhevsky et al., 2012), ResNet (He et al., 2016), ResNeXt (Xie et al., 2017), SEResNet (Hu et al., 2018), ShuffleNet (Zhang et al., 2018; Ma et al., 2018), MobileNet (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019), VGG (Simonyan and Zisserman, 2015), GoogleNet (Szegedy et al., 2015), Inception (Szegedy et al., 2015), Xception (Chollet, 2017), DarkNet (Redmon et al., 2016; Redmon and Farhadi, 2018), DenseNet (Huang et al., 2017), DPN (Chen et al., 2017b), SqueezeNet (Iandola et al., 2016), EfficientNet (Tan and Le, 2019), Res2Net (Gao et al., 2019), HRNet (Wang et al., 2020b), Darts (Liu et al., 2018), AcNet (Ding et al., 2019) and their variants.

Appendix E Numerical Report of Main Plots

Due to the large number of deep models evaluated, Figure 8, Figure 4 and Figure 5 grouped some that are of the same architecture. Here, we report all of the corresponding numerical results in Table 2 with a smaller scale.

perf.
Consensus
Scores w/
LIME
Consensus
Scores w/
Smooth-
Grad
AlexNet 0.575 0.594 0.0214
AutoDL_4M 0.752 0.756 0.0312
AutoDL_6M 0.776 0.741 0.0291
DPN107 0.798 0.828 0.0331
DPN131 0.802 0.833 0.0334
DPN68 0.766 0.849 0.0297
DPN92 0.775 0.845 0.0357
DPN98 0.798 0.837 0.0354
DenseNet121 0.755 0.859 0.0376
DenseNet161 0.781 0.849 0.0383
DenseNet169 0.765 0.855 0.0376
DenseNet201 0.779 0.843 0.0385
DenseNet264 0.761 0.841 0.0378
EfficientNetB0 0.754 0.727 0.0355
EfficientNetB0_Small 0.708 0.729 0.0362
GoogleNet 0.726 0.734 0.0260
HRNet_W18_C 0.766 0.854 0.0388
HRNet_W30_C 0.776 0.832 0.0378
HRNet_W32_C 0.781 0.845 0.0376
HRNet_W40_C 0.773 0.822 0.0366
HRNet_W44_C 0.782 0.817 0.0368
HRNet_W48_C 0.794 0.807 0.0369
HRNet_W64_C 0.784 0.799 0.0344
MobileNetV1 0.711 0.825 0.0322
MobileNetV1_x0_25 0.513 0.653 0.0222
MobileNetV1_x0_5 0.640 0.751 0.0257
MobileNetV1_x0_75 0.697 0.788 0.0297
MobileNetV2 0.742 0.812 0.0342
MobileNetV2_x0_25 0.534 0.650 0.0221
MobileNetV2_x0_5 0.642 0.768 0.0270
MobileNetV2_x0_75 0.709 0.795 0.0302
MobileNetV2_x1_5 0.737 0.841 0.0336
MobileNetV2_x2_0 0.744 0.840 0.0352
Res2Net101_vd_26w_4s 0.780 0.752 0.0285
Res2Net50_14w_8s 0.781 0.823 0.0324
Res2Net50_26w_4s 0.780 0.835 0.0343
Res2Net50_vd_26w_4s 0.790 0.828 0.0332
ResNeXt101_32x4d 0.784 0.843 0.0371
ResNeXt101_vd_32x4d 0.795 0.830 0.0347
ResNeXt101_vd_64x4d 0.784 0.821 0.0336
ResNeXt152_32x4d 0.782 0.842 0.0377
ResNeXt152_64x4d 0.787 0.828 0.0383
ResNeXt152_vd_32x4d 0.792 0.807 0.0325
ResNeXt152_vd_64x4d 0.790 0.814 0.0325
ResNeXt50_32x4d 0.765 0.849 0.0385
ResNeXt50_64x4d 0.784 0.836 0.0389
ResNeXt50_vd_32x4d 0.790 0.844 0.0346
ResNeXt50_vd_64x4d 0.792 0.829 0.0360
ResNet101 0.769 0.847 0.0377
ResNet101_vd 0.788 0.810 0.0323
ResNet152 0.776 0.846 0.0374
ResNet152_vd 0.801 0.793 0.0308
ResNet18 0.715 0.816 0.0342
ResNet18_vd 0.730 0.807 0.0334
ResNet200_vd 0.793 0.790 0.0300
ResNet34 0.739 0.826 0.0363
ResNet34_vd 0.757 0.802 0.0329
ResNet50 0.763 0.858 0.0394
ResNet50_ACNet 0.780 0.868 0.0386
ResNet50_vc 0.778 0.817 0.0370
ResNet50_vd 0.778 0.831 0.0341
SENet154_vd 0.803 0.807 0.0315
SE_ResNeXt101_32x4d 0.781 0.818 0.0325
SE_ResNeXt50_32x4d 0.775 0.810 0.0321
SE_ResNeXt50_vd_32x4d 0.797 0.819 0.0342
SE_ResNet18_vd 0.743 0.810 0.0342
SE_ResNet34_vd 0.766 0.789 0.0330
SE_ResNet50_vd 0.787 0.792 0.0332
ShuffleNetV2 0.706 0.795 0.0325
ShuffleNetV2_x0_25 0.507 0.636 0.0231
ShuffleNetV2_x0_33 0.547 0.651 0.0238
ShuffleNetV2_x0_5 0.611 0.710 0.0250
ShuffleNetV2_x1_0 0.689 0.778 0.0295
ShuffleNetV2_x1_5 0.712 0.807 0.0306
ShuffleNetV2_x2_0 0.738 0.816 0.0317
SqueezeNet1_0 0.602 0.732 0.0253
SqueezeNet1_1 0.613 0.762 0.0242
VGG11 0.694 0.801 0.0291
VGG13 0.697 0.804 0.0297
VGG16 0.714 0.821 0.0305
VGG19 0.722 0.821 0.0309
(a) on ImageNet
perf.
Consensus
Scores w/
LIME
Consensus
Scores w/
Smooth-
Grad
mAP between
g.t. of
segmentation
and LIME
explanation
mAP between
g.t. of
segmentation
and
SmoothGrad
explanation
AlexNet 0.507 0.536 0.0275 0.343 0.571
AutoDL_4M 0.728 0.781 0.0371 0.594 0.693
AutoDL_6M 0.754 0.811 0.0402 0.605 0.740
DPN107 0.830 0.867 0.0525 0.630 0.780
DPN131 0.800 0.868 0.0498 0.643 0.795
DPN68 0.795 0.849 0.0415 0.630 0.710
DPN92 0.806 0.872 0.0510 0.626 0.784
DPN98 0.815 0.877 0.0526 0.628 0.793
DarkNet53_ImageNet1k 0.782 0.850 0.0485 0.604 0.743
DenseNet121 0.771 0.848 0.0503 0.585 0.771
DenseNet161 0.813 0.873 0.0542 0.640 0.797
DenseNet169 0.792 0.858 0.0513 0.609 0.776
DenseNet201 0.805 0.858 0.0544 0.616 0.795
DenseNet264 0.789 0.868 0.0540 0.628 0.798
EfficientNetB0 0.765 0.805 0.0450 0.594 0.769
EfficientNetB0_Small 0.737 0.805 0.0426 0.589 0.738
EfficientNetB1 0.775 0.805 0.0456 0.593 0.755
EfficientNetB2 0.787 0.819 0.0461 0.595 0.764
EfficientNetB3 0.791 0.812 0.0421 0.582 0.771
EfficientNetB4 0.792 0.829 0.0423 0.612 0.766
EfficientNetB5 0.774 0.808 0.0431 0.591 0.768
HRNet_W18_C 0.754 0.831 0.0461 0.592 0.736
HRNet_W30_C 0.770 0.832 0.0475 0.595 0.752
HRNet_W32_C 0.785 0.836 0.0471 0.586 0.750
HRNet_W40_C 0.750 0.844 0.0476 0.594 0.763
HRNet_W44_C 0.788 0.830 0.0449 0.592 0.752
HRNet_W48_C 0.796 0.838 0.0482 0.581 0.757
HRNet_W64_C 0.791 0.838 0.0485 0.609 0.766
InceptionV4 0.745 0.797 0.0435 0.592 0.728
MobileNetV1 0.741 0.824 0.0415 0.588 0.716
MobileNetV1_x0_25 0.557 0.676 0.0288 0.448 0.634
MobileNetV1_x0_5 0.655 0.753 0.0325 0.527 0.672
MobileNetV1_x0_75 0.688 0.808 0.0388 0.569 0.701
MobileNetV2 0.737 0.810 0.0438 0.582 0.732
MobileNetV2_x0_25 0.511 0.670 0.0287 0.457 0.597
MobileNetV2_x0_5 0.665 0.753 0.0337 0.543 0.661
MobileNetV2_x0_75 0.715 0.814 0.0369 0.577 0.686
MobileNetV2_x1_5 0.756 0.835 0.0421 0.611 0.719
MobileNetV2_x2_0 0.781 0.851 0.0425 0.605 0.705
Res2Net101_vd_26w_4s 0.799 0.853 0.0470 0.613 0.756
Res2Net50_14w_8s 0.789 0.826 0.0491 0.587 0.765
Res2Net50_26w_4s 0.768 0.840 0.0515 0.601 0.782
Res2Net50_vd_26w_4s 0.783 0.821 0.0467 0.604 0.749
ResNeXt101_32x4d 0.818 0.877 0.0578 0.629 0.798
ResNeXt101_32x8d_wsl 0.768 0.831 0.0479 0.563 0.755
ResNeXt101_vd_32x4d 0.816 0.867 0.0494 0.614 0.771
ResNeXt101_vd_64x4d 0.824 0.871 0.0520 0.642 0.778
ResNeXt152_32x4d 0.815 0.872 0.0543 0.619 0.792
ResNeXt152_64x4d 0.834 0.875 0.0576 0.613 0.779
ResNeXt152_vd_32x4d 0.820 0.872 0.0520 0.640 0.788
ResNeXt152_vd_64x4d 0.822 0.852 0.0479 0.618 0.764
ResNeXt50_32x4d 0.809 0.856 0.0567 0.619 0.785
ResNeXt50_64x4d 0.814 0.885 0.0562 0.621 0.788
ResNeXt50_vd_32x4d 0.806 0.874 0.0508 0.627 0.762
ResNeXt50_vd_64x4d 0.820 0.890 0.0544 0.631 0.785
ResNet101 0.784 0.878 0.0511 0.620 0.761
ResNet101_vd 0.813 0.864 0.0499 0.606 0.766
ResNet152 0.799 0.859 0.0506 0.601 0.773
ResNet152_vd 0.797 0.851 0.0507 0.613 0.774
ResNet18 0.726 0.794 0.0449 0.546 0.735
ResNet18_vd 0.754 0.846 0.0428 0.598 0.710
ResNet200_vd 0.813 0.861 0.0502 0.618 0.773
ResNet34 0.758 0.812 0.0461 0.569 0.756
ResNet34_vd 0.771 0.833 0.0435 0.570 0.731
ResNet50 0.776 0.878 0.0531 0.609 0.774
ResNet50_ACNet 0.782 0.870 0.0481 0.619 0.737
ResNet50_vd 0.795 0.876 0.0461 0.634 0.741
SE_ResNeXt101_32x4d 0.793 0.838 0.0452 0.605 0.750
SE_ResNeXt50_32x4d 0.798 0.821 0.0438 0.578 0.727
SE_ResNeXt50_vd_32x4d 0.799 0.863 0.0479 0.617 0.729
SE_ResNet18_vd 0.727 0.802 0.0396 0.550 0.673
SE_ResNet34_vd 0.754 0.803 0.0450 0.574 0.731
SE_ResNet50_vd 0.771 0.870 0.0446 0.616 0.732
ShuffleNetV2 0.696 0.817 0.0375 0.571 0.668
ShuffleNetV2_x0_25 0.519 0.687 0.0263 0.448 0.563
ShuffleNetV2_x0_33 0.530 0.686 0.0294 0.465 0.622
ShuffleNetV2_x0_5 0.605 0.753 0.0307 0.500 0.624
ShuffleNetV2_x1_0 0.695 0.788 0.0354 0.564 0.667
ShuffleNetV2_x1_5 0.728 0.815 0.0371 0.564 0.670
ShuffleNetV2_x2_0 0.731 0.806 0.0402 0.574 0.683
Xception41 0.801 0.833 0.0501 0.605 0.761
Xception41_deeplab 0.775 0.753 0.0412 0.559 0.734
Xception65 0.801 0.837 0.0479 0.609 0.740
Xception65_deeplab 0.747 0.800 0.0415 0.586 0.728
Xception71 0.775 0.846 0.0479 0.613 0.760
consensus 0.859 N/A N/A 0.704 0.818
(b) on CUB-200-2011
Table 2: Numerical report of model performance and similarity to the consensus using LIME and SmoothGrad over 81 models on ImageNet in sub-table(a) corresponding to Figure 5 (a, b, c), and over 85 models on CUB-200-2011 in sub-table(b) corresponding to Figure 4, 8 and 5 (d, e, f).

Appendix F More Visualization Results

We present more visualization results of cross-model consensus of explanations in Figure 10, Figure 11, Figure 12 and Figure 13, where the samples are from ImageNet and CUB-200-2011.

Figure 10: More visual comparisons between the consensus and the explanations of deep models with LIME on samples from ImageNet. Note that the consensus (last column) is the cross-model consensus of explanations.
Figure 11: More visual comparisons between the consensus and the explanations of deep models with SmoothGrad on samples from ImageNet. Note that the consensus (last column) is the cross-model consensus of explanations.
Figure 12: More visual comparisons between the consensus and the explanations of deep models with LIME on samples from CUB-200-2011, where the pixel-wise annotations of image segmentation are available and the mAPs are measured for the similarity to the segmentation ground truth. Note that the consensus (second last column) is the cross-model consensus of explanations.
Figure 13: More visual comparisons between the consensus and the explanations of deep models with SmoothGrad on samples from CUB-200-2011, where the pixel-wise annotations of image segmentation are available and the mAPs are measured for the similarity to the segmentation ground truth. Note that the consensus (second last column) is the cross-model consensus of explanations.

For further explorations, we visualize several random images from MS-COCO (Lin et al., 2014), shown in Figure 14. As introduced in Section 5, one direction for the future work would extend the proposed framework on the datasets with multiple visual objects and complicated background for object detection, segmentation, and multi-label classification tasks.

(a)
(b)
(c)
(d)
Figure 14: Visualization of images from the MS-COCO dataset (Lin et al., 2014) for showing the potentials of cross-model consensus of explanations, where the predicted label with probability is noted.