Deep models are well-known by their excellent performance in many challenging domains, as well as their black-box nature. To interpret the prediction of a deep model, a number of trustworthy interpretation algorithms (Bach et al., 2015; Zhou et al., 2016; Ribeiro et al., 2016; Smilkov et al., 2017; Sundararajan et al., 2017; Lundberg and Lee, 2017) have been recently proposed to attribute the importance of every input feature in a given sample with respect to the model’s output. For example, given an image classification model, LIME Ribeiro et al. (2016) and SmoothGrad Smilkov et al. (2017) could attribute the importance scores to every superpixel/pixel in an image with respect to the model’s prediction. In this way, one can easily explain the classification result of a model with a data point by visualizing the important features used by the model for prediction.
The use of interpretation tools finds that, even deep models make the same and right predictions on the same image, they might rely on different sets of input features for classification. For example, our work uses LIME and SmoothGrad to explain a number of models trained on image classification tasks on the same set of images and obtains different explanations for these models even all they make right predictions (latterly shown in Figure 2 and Figure 3). While these models have been explained to make the same prediction using different sets of features, we can still find that some common features might be used by the majority of models. In this way, we are particularly interested in two research questions as follows: (1) What are the common features used by various models in an image? (2) Whether the models with better performance favor those common features?
To answer these two questions, we propose to study the common features across a number of deep models and measure the similarity between the set of common features and the one used by every single model. Specifically, as illustrated in Figure 1, we generalize an electoral system to first form a committee with a number of deep models, then obtain the explanations for a given image based on one trustworthy interpretation algorithm, then call for voting to obtain the cross-model consensus of explanations, or shortly consensus, and finally compute a similarity score between the consensus and the explanation result for each deep model, denoted as consensus score. Through extensive experiments using 80+ models on 5 datasets/tasks, we find that (1) the consensus is aligned with the ground truth of image semantic segmentation; (2) a model in the committee with a higher consensus score usually performs better in terms of testing accuracy; and (3) models’ consensus scores coincidentally correlates to their interpretability.
The contributions of this paper could be summarized as follows. To the best of our knowledge, this work is the first to investigate the common features, which are used and shared by a large number of deep models for image classification, by incorporating with interpretation algorithms. We propose the cross-model consensus of explanations to characterize the common features, and connect the consensus score to the performance and interpretability of a model. Finally, we obtain three observations from the experiments with thorough analyses and discussions.
2 Related Work
We first review the interpretation algorithms and the evaluation approaches on their trustworthiness. To visualize the activated subregions of intermediate-layer feature maps, many algorithms have been proposed to interpret convolutional networks Zhou et al. (2016); Selvaraju et al. (2020); Chattopadhay et al. (2018); Wang et al. (2020a). Apart from investigating the inside of complex deep networks, simple linear or tree-based surrogate models have been used as “out-of-box explainers” to explain the predictions made by the deep model over the dataset through local or global approximations Ribeiro et al. (2016); van der Linden et al. (2019); Ahern et al. (2019); Zhang et al. (2019). Instead of using surrogates for deep models, algorithms, such as SmoothGrad (Smilkov et al., 2017), Integrated Gradients (Sundararajan et al., 2017), DeepLIFT (Shrikumar et al., 2017)
etc, have been proposed to estimate the input feature importance with respect to the model predictions. Note that there are many other interpretation algorithms and we mainly discuss the ones that are related to feature attributions and suitable for deep models for image classification in this paper. Evaluations on the trustworthiness of interpretation algorithms are of objective to qualify their trustworthiness and not mislead the understanding of models’ behaviors, e.g.Adebayo et al. (2018) have found that some algorithms are independent both of the model and the data generating process, through randomizing the parameters of models. Other evaluation approaches include perturbation of important features Samek et al. (2016); Petsiuk et al. (2018); Vu et al. (2019); Hooker et al. (2019), model trojaning attacks Chen et al. (2017a); Gu et al. (2017); Lin et al. (2020), infidelity and sensitivity Ancona et al. (2018); Yeh et al. (2019) to similarity samples in the neighborhood, through a crafted dataset Yang and Kim (2019), and user-study experiments Lage et al. (2019); Jeyakumar et al. (2020).
From an orthogonal perspective, evaluations across models are also urged for building more interpretable and explainable AI systems. However, evaluations across the deep models are scarce. Bau et al. (2017) proposed Network Dissection
to build an additional dataset with dense annotations of a number of visual concepts for evaluating the interpretability of convolutional neural networks. Given a convolutional model, Network Dissection recovers the intermediate-layer feature maps used by the model for the classification, and then measures the overlap between the activated subregions in the feature maps with the densely human-labeled visual concepts to estimate the interpretability of the model. Another common solution to the evaluation across deep models is user-study experiments(Doshi-Velez and Kim, 2017).
In this paper, we do not directly evaluate the interpretability across deep models, but based on the proposed framework, we show experimentally that the consensus score is positively correlated to the generalization performance of deep models and coincidentally related to the interpretability. We will discuss more details with analyses later. We believe that based on the explanations, our proposed framework and the consensus score could help to better understand deep models.
3 Framework of Cross-Model Consensus of Explanations
In this section, we introduce the proposed approach that generalizes the electoral system to provide the consensus of explanations across various deep models. Specifically, the proposed framework consists of three steps, as detailed in the following.
Step1: Committee Formation with Deep Models. Given deep models that are trained for solving a target task (image classification task in our experiments) on a visual dataset where each image contains one main object, the approach first forms the given deep models as a committee, noted as , and then considers the variety of models in the committee that would establish the consensus for comparisons and evaluations.
Step2: Committee Voting for Consensus Achievement. With the committee of deep models and the task for explanation, the proposed framework leverages a trustworthy interpretation tool , e.g. we choose LIME (Ribeiro et al., 2016) or SmoothGrad (Smilkov et al., 2017) as in this paper, to obtain the explanation of every model on every image in the dataset. Given some sample from the dataset, we note the obtained explanation results of all models as . Then, we propose a voting procedure that aggregates to reach the cross-model consensus of explanations, i.e., the consensus, for . Specifically, the -th element of the consensus is for LIME, where refers to the dimension of an explanation result and for SmoothGrad, following the conventional normalization-averaging procedure Ribeiro et al. (2016); Ahern et al. (2019); Smilkov et al. (2017). To the end, the consensus has been reached for every sample in the target dataset based on committee voting.
Step3: Consensus-based Similarity Score. Given the consensus, the approach calculates the consensus score
of every model in the committee by considering the similarity between the explanation result of each individual model and the consensus. Specifically, for the explanations and the consensus based on LIME (visual feature importance in superpixel levels), cosine similarity between the flattened vector of explanation of each model and the consensus is used. For the results based on SmoothGrad (visual feature importance in pixel levels), a similar procedure is followed, where the proposed algorithm uses Radial Basis Function () for the similarity measurement. The difference in similarity computations is due to that (1) the dimensions of LIME explanations are various for different samples while invariant for SmoothGrad explanations; (2) the scales of LIME explanation results vary much larger than SmoothGrad. Thus cosine similarity is more suitable for LIME while RBF is for SmoothGrad. Eventually, the framework computes a quantitative but relative score for each model in the committee using their similarity to the consensus.
4 Overall Experiments and Results
In this section, we start by introducing the experiment setups. We use the image classification as the target task and follow the proposed framework to obtain the consensus and compute the consensus scores. Through the experiments, we have found (1) the alignment between the consensus and image semantic segmentation, (2) positive correlations between the consensus score and model performance, and (3) coincidental correlations between the consensus score and model interpretability. We end this section by robustness analyses of the framework.
4.1 Evaluation Setups
For overall evaluations and comparisons, we use ImageNet(Deng et al., 2009)
for general visual object recognition and CUB-200-2011(Welinder et al., 2010)
for bird recognition respectively. Note that ImageNet provides the class label for every image, and the CUB-200-2011 dataset includes the class label and pixel-level segmentation for the bird in every image, where the pixel annotations of visual objects are found to be aligned with the consensus.
Models. For fair comparisons, we use more than 80 deep models trained on ImageNet that are publicly available111https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleCV/image_classification/README_en.md#supported-models-and-performances. We also derive models on the CUB-200-2011 dataset through standard fine-tuning procedures. In our experiments, we include these models of the two committees based on ImageNet and CUB-200-2011 respectively. Both of them target at the image classification task with each image being labeled to one category.
Interpretation Algorithms. As we previously introduced, we consider two interpretation algorithms, LIME (Ribeiro et al., 2016) and SmoothGrad (Smilkov et al., 2017). Specifically, LIME surrogates the explanation as the assignment of visual feature importance to superpixels (Vedaldi and Soatto, 2008), and SmoothGrad outputs the explanations as the visual feature importance over pixels. In this way, we can validate the flexibility of the proposed framework over explanation results from diverse sources (i.e., linear surrogates vs. input gradients) and in multiple granularity (i.e., feature importance in superpixel/pixel-levels).
4.2 Alignment between the Consensus and Image Segmentation
The image segmentation task searches the pixel-wise classifications of images. Cross-model consensus of explanations for image classification are well aligned to image segmentation, especially when only one main object is contained in the image. This partially demonstrates the effectiveness of most deep models in extracting visual objects from input images. We show two examples using both LIME and SmoothGrad in Figure 2 and 3 from ImageNet and CUB-200-2011 respectively. More examples can be found in appendix.
To quantitatively demonstrate the alignment, we compute the Average Precision (AP) score between the cross-model consensus of explanations and the image segmentation ground truth on CUB-200-2011, where the latter is available. We further take the mean of AP scores (mAP) over the dataset to compare with the overall consensus scores. Figure 4 shows the results, where the consensus achieves higher mAP scores than any individual network. Both quantitative results and visual comparisons in validate the closeness of consensus to the ground truth of image segmentation.
Correlation between model performance and mAP to the segmentation ground truth using (a) LIME and (b) SmoothGrad with CUB-200-2011 over 85 models. Pearson correlation coefficients are 0.927 (with p-value 4e-37) for LIME and 0.916 (with p-value 9e-35) for SmoothGrad. The points “Consensus” here refer to the testing accuracy of the ensemble of networks in the committee by probabilities averaging and voting (in y-axis), as well as the mAP between the consensus and the ground truth (in x-axis).
4.3 Positive Correlations between Consensus Scores and Model Performance
Figure 5 shows the positive correlations between the similarity score to the consensus (in x-axis) and model performance (in y-axis). Specifically, in Figure 5 (a-b) and (d-e), we present the results using LIME (a,d) and SmoothGrad (b,e) on ImageNet (a,b) and CUB-200-2011 (d,e). All correlations here are strong with significance tests passed, though in some local areas of the correlation plots between the consensus score and model performance. In this way, we could conclude that, in an overall manner, the evaluation results based on the consensus score using both LIME and SmoothGrad over the two datasets are correlated to model performance with significance. More experiments on other datasets with random subsets of deep models will be shown in in Figure 7 (Section 4.5).
4.4 “Coincidental” Correlations between Consensus Scores and Model Interpretability
Deep model interpretability measures the ability to present in understandable terms to a human (Doshi-Velez and Kim, 2017). While no formal and agreed measurements for the interpretability evaluation, two evaluation methods, i.e., Network Dissection (Bau et al., 2017) and user-study experiments, are quite common for this purpose. Though the proposed framework and the consensus scores are based on explanation results, they do not directly estimate the model interpretability. Nevertheless, in this subsection, we present the coincidental correlations between the consensus scores and the interpretability measurements.
|User-Study Evaluations||1 (1.715)||2 (1.625)||3 (1.585)||4 (1.170)||5 (0.840)|
|Consensus (LIME)||1 (0.849)||2 (0.846)||3 (0.821)||4 (0.734)||5 (0.594)|
|Consensus (SmoothGrad)||1 (0.038)||2 (0.037)||3 (0.030)||4 (0.026)||5 (0.021)|
Consensus versus Network Dissection. We compare the results of the proposed framework with the interpretability evaluation solution Network Dissection (Bau et al., 2017). On the Broden dataset, Network Dissection reported a ranking list of five models (w.r.t. the model interpretability), shown in Table 1
, through counting the semantic neurons, where a neuron is definedsemantic if its activated feature maps overlap with human-annotated visual concepts. Based on the proposed framework, we report the consensus scores of using LIME and SmoothGrad in Table 1, which are consistent to Figure 5 (a, LIME) and (b, SmoothGrad). The three ranking lists are almost identical, except the comparisons between DenseNet161 and ResNet152, where in the both lists based on the consensus score, DenseNet161 is similar to ResNet152 with marginally elevated consensus scores, while Network Dissection considers ResNet152 is more interpretable than DenseNet161.
We believe the results from our proposed framework and Network Dissection are close enough from the perspectives of ranking lists. The difference may be caused by the different ways that our framework and Network Dissection perform the evaluations. The consensus score measures the similarity to the consensus explanations on images, while Network Dissection counts the number of neurons in the intermediate layers activated by all the visual concepts, including objects, object parts, colors, materials, textures and scenes. Furthermore, Network Dissection evaluates the interpretability of deep models using the Broden dataset with densely labeled visual objects and patterns (Bau et al., 2017), while the consensus score does not need additional datasets or the ground truth of semantics. In this way, the results by our proposed framework and Network Dissection might be slightly different.
Consensus versus User-Study Evaluations. In order to further validate the effectiveness of the proposed framework, we have also conducted user-study experiments on these five models and report the results on the second row of Table 1. See the appendix for the experimental settings of the user-study evaluations. This confirms that our proposed framework is capable of approximating the model interpretability.
4.5 Robustness Analyses of Consensus
In this subsection, we investigate several factors that might affect the evaluation results with consensus, including the use of basic interpretation algorithms (e.g., LIME and SmoothGrad), the size of committee, and the candidate pool for models in the committee.
Consistency between LIME and SmoothGrad. Even though the granularity of explanation results from LIME and SmoothGrad are different, which causes mismatching in mAP scores to segmentation ground truth, the consensus scores based on the two algorithms are generally consistent. The consistency has been confirmed by Figure 5 (c, f), where the overall results based on LIME is strongly correlated to SmoothGrad over all models on both datasets. This shows that the proposed framework can work well with a wide spectrum of basic interpretation algorithms.
Consistency of Cross-Committee Evaluations. In real-word applications, the committee-based estimations and evaluations may make inconsistent results in a committee-by-committee manner. In this work, we are interested in whether the consensus score estimations are consistent against the change of committee. Given 16 ResNet models as the targets, we form 20 independent committees through combining the 16 ResNet models with 10–20 models randomly drawn from the rest of networks. In each of these 20 independent committees, we compute the consensus scores of the 16 ResNet models. We then estimate the Pearson correlation coefficients between any of these 20 results and the one in Figure 5
(a), where the mean correlation coefficient is 0.96 with the standard deviation 0.04. Thus, we can say the consensus score evaluation would be consistent against randomly picked committees.
Convergence over Committee Sizes To understand the effect of the committee size to the consensus score estimation, we run the proposed framework using committees of various sizes formed by deep models that are randomly picked up from the pools. In Figure 6, we plot and compare the performance of the consensus with increasing committee sizes, where we estimate the mAP between the ground truth and the consensus reached by the random committees of different sizes and 20 random trials have been done for every single size independently. It shows that the curve of mAP would quickly converge to the complete committee, while the consensus based on a small proportion of committee (e.g., 15 networks) works good enough even compared to the complete committee of 85 networks.
Applicability with Random Committees over More Datasets.
To demonstrate the applicability of the proposed framework, we extend our experiments using networks randomly picked up from the pool to other datasets, including Stanford Cars 196(Krause et al., 2013), Oxford Flowers 102 (Nilsback and Zisserman, 2008) and Foods 101 (Bossard et al., 2014). Dataset descriptions and experimental details are included in the appendix. The results in Figure 7 confirm that the positive correlations between the consensus score and model performance exist for a wide range of models on ubiquitous datasets/tasks.
5 Discussions: Limits and Potentials with Future Works
Limits. In this section, we would like to discuss several limits in our studies. First of all, we propose to study the features used by deep models for classification, but we use the explanation results (i.e., importance of superpixels/pixels in the image for prediction) obtained by interpretation algorithms. Obviously, the correctness of interpretation algorithms might affect our results. However, we use two independent algorithms, including LIME Ribeiro et al. (2016) and SmoothGrad Smilkov et al. (2017), which attribute feature importance in two different scales i.e., superpixels and pixels. Both algorithms lead to the same observations and conclusive results (see Section 4.5 for the consistency between results obtained by LIME and SmoothGrad). Thus, we believe the interpretation algorithms here are trustworthy and it is appropriate to use explanation results as a proxy to analyze features. For future research, we would include more advanced interpretation algorithms to confirm our observations.
We obtain some interesting observations from our experiments and make conclusions using multiple datasets. However, the image classification datasets used in our experiments have some limits — every image in the dataset only consists of one visual object for classification. It is reasonable to doubt that when multiple visual objects (rather than the target for classification) and complicated visual patterns for background Koh and Liang (2017); Chen et al. (2017a)
co-exist in an image, the cross-model consensus of explanations may no longer overlap to the ground truth semantic segmentation. Actually, we include an example of COCO datasetLin et al. (2014) in appendix, where multiple objects co-exist in the image and consensus may not always match the segmentation. Our future work would focus on the datasets with multiple visual objects and complicated background for object detection, segmentation, and multi-label classification tasks.
Finally, only well-known models with good performance have been included in the committee. It certainly would bring some bias in our analysis. However, in practice, these models would be one of the first choices or frequently used in many applications for relevance. In our future work, we would include more models with diverse performance to seek more observations.
Potentials. In addition to the limits, our work also demonstrates several potentials of cross-model consensus of explanations for further studies. As was shown in Figure 6, with a larger committee, the consensus would slowly converge to a stable set of common features that clearly aligns with the segmentation ground truth of the dataset. This experiment further demonstrates the capacity of consensus to precisely position the visual objects for classification. Thus, in our future work, we would like to use consensus based on a committee of image classification models to detect the position of visual objects in the image.
Furthermore, our experiments with both interpretation algorithms on all datasets have found that consensus scores are “coincidentally” correlated to the interpretability scores of the models, even though interpretability scores were evaluated through totally different ways — network dissections Bau et al. (2017) and user studies. Actually, network dissections evaluate the interpretability of a model through matching its activation maps in intermediate layers with the ground truth segmentation of visual concepts in the image. A model with higher interpretability should have more convolutional filters activated at the visual patterns/objects for classification. In this way, we particularly measure the similarity between the explanation results obtained for every model and the segmentation ground truth of images. We found that the models’ segmentation-explanation similarity significantly correlates to their consensus score (see Figure 8). This observation might encourage us to further study the connections between interpretability and consensus scores in the future work.
In this paper, we study the common features shared by various deep models for image classification. We are wondering (1) what are the common features and (2) whether the use of common features could improve the performance. Specifically, given the explanation results obtained by interpretation algorithms, we propose to aggregate the explanation results from different models, and obtain the cross-model consensus of explanations through voting. To understand features used by every model and the common ones, we measure the consensus score as the similarity between the consensus and the explanation of every model.
Our empirical studies based on extensive experiments using 80+ deep models on 5 datasets/tasks find that (i) the consensus aligns with the ground truth semantic segmentation of the visual objects for classification; (ii) models with higher consensus scores would enjoy better testing accuracy; and (iii) the consensus scores coincidentally correlate to the interpretability scores obtained by network dissections and user evaluations. In addition to main claims, we also include additional experiments to demonstrate robustness of consensus, including the alternative use of LIME and SmoothGrad and their effects to the results/conclusions, consistency of consensus achieved by different groups of deep models, the fast convergence of consensus with increasing number of deep models in the committee, and random selection of deep models as the committee for consensus-based evaluation on the other datasets. All these studies confirm the applicability of consensus as a proxy to study and analyze the common features shared by different models in our research. Several open issues and potentials have been discussed with future directions introduced. Hereby, we are encouraged to further adopt consensus and consensus scores to understand the behaviors of deep models better.
- Sanity checks for saliency maps. In Advances in Neural Information Processing Systems (NeurIPS), pp. 9505–9515. Cited by: §2.
NormLime: a new feature importance metric for explaining deep neural networks. arXiv preprint arXiv:1909.04200. Cited by: §2, §3.
- Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §1.
- Network dissection: quantifying interpretability of deep visual representations. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 6541–6549. Cited by: §2, §4.4, §4.4, §4.4, Table 1, §5.
Food-101–mining discriminative components with random forests. In
Proceedings of the European Conference on Computer Vision (ECCV), pp. 446–461. Cited by: §B.1, Figure 7, §4.5.
- Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. Cited by: §2.
Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §2, §5.
- Dual path networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4467–4475. Cited by: Appendix D.
Xception: deep learning with depthwise separable convolutions.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258. Cited by: Appendix D.
- Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.1.
- Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1911–1920. Cited by: Appendix D.
Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §2, §4.4.
- Res2net: a new multi-scale backbone architecture. Cited by: Appendix D.
- Badnets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733. Cited by: §2.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
- A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 9737–9748. Cited by: §2.
- Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: Appendix D.
- Searching for mobilenetv3. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
- Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
- Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: Appendix D.
- SqueezeNet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: Appendix D.
- How can i explain this to you? an empirical study of deep neural network explanation methods. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
- Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML), pp. 1885–1894. Cited by: §5.
- 3D object representations for fine-grained categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 554–561. Cited by: §B.1, Figure 7, §4.5.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix D.
- An evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1902.00006. Cited by: §2.
- Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: Figure 14, Appendix F, §5.
What do you see? evaluation of explainable artificial intelligence (xai) interpretability through neural backdoors. arXiv preprint arXiv:2009.10639. Cited by: §2.
- Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: Appendix D.
- A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4765–4774. Cited by: §1.
- Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: Appendix D.
- Automated flower classification over a large number of classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §B.1, Figure 7, §4.5.
- RISE: randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2.
- You only look once: unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: Appendix D.
- Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Appendix D.
- " Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §B.2, §1, §2, §3, §4.1, §5.
- Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §2.
- Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
- Grad-cam: visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision (IJCV) 128 (2), pp. 336–359. Cited by: §2.
- Overfeat: integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §B.1.
- Learning important features through propagating activation differences. In International Conference on Machine Learning (ICML), pp. 3145–3153. Cited by: §2.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §B.1, Appendix D.
- Smoothgrad: removing noise by adding noise. In ICML Workshop on Visualization for Deep Learning, Cited by: §B.2, §1, §2, §3, §4.1, §5.
- Axiomatic attribution for deep networks. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
- Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: Appendix D.
- Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), Cited by: Appendix D.
- Global aggregations of local explanations for black box models. FACTS-IR: Fairness, Accountability, Confidentiality, Transparency, and Safety - SIGIR 2019 Workshop. Cited by: §2.
- Quick shift and kernel methods for mode seeking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 705–718. Cited by: §B.2, §4.1.
- Evaluating explainers via perturbation. arXiv preprint arXiv:1906.02032. Cited by: §2.
- Score-cam: score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25. Cited by: §2.
- Deep high-resolution representation learning for visual recognition. Cited by: Appendix D.
- Caltech-UCSD birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §B.1, §4.1.
- Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
- Benchmarking attribution methods with relative feature importance. arXiv, pp. arXiv–1907. Cited by: §2.
- On the (in) fidelity and sensitivity for explanations. In Advances in neural information processing systems (NeurIPS), Cited by: §2.
Interpreting cnns via decision trees. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6261–6270. Cited by: §2.
- Shufflenet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix D.
Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: §1, §2.
Appendix A Complete Pseudocode of Cross-Model Consensus of Explanations
Appendix B Experimental Details
In this section, we present technique details of the experiments in the main text about the preparation of deep models for committee formations, the interpretation algorithms and the user-study evaluations.
b.1 Committee Formations
There are around 100 deep models trained on ImageNet that are publicly available222https://github.com/PaddlePaddle/models/blob/release/1.8/PaddleCV/image_classification/README_en.md#supported-models-and-performances
at the moment we initiate the experiments. We first exclude some very large models that take much more computation resources. Then for the consistency of computing superpixels, we include only the models that take images of size 224224 as input, resulting 81 models for the committee based on ImageNet. For the intentions of comparing the models, a solution to including the models in the committee is to simply align the superpixels in different sizes of images. However, in our experiments, we choose not to do so since there are already a large number of available models.
As for CUB-200-2011 (Welinder et al., 2010), similarly we first exclude the very large models. Then we follow the standard procedures (Sermanet et al., 2014; Simonyan and Zisserman, 2015) for fine-tuning ImageNet-pretrained models on CUB-200-2011. For simplicity, we use the same training setup for fine-tuning all pre-trained models on CUB-200-2011 (learning rate 0.01, batch size 64, SGD optimizer with momentum 0.9, resize to 256 being the short edge, randomly cropping images to the size of 224224), and obtain 85 models that are well trained. Different hyper-parameters may help to improve the performance of some specific networks, but for the same reason of the large number of available models, we choose not to search for better hyper-parameter settings.
For Stanford Cars 196 (Krause et al., 2013), Oxford Flowers 102 (Nilsback and Zisserman, 2008) and Foods 101 (Bossard et al., 2014), we follow the same fine-tuning procedure as on CUB-200-2011. However, given the convergence over committee sizes (Figure 6), which suggests a committee of more than 15 models, we randomly choose around 20 models for each of the three datasets.
b.2 Interpretation Algorithms
for an image, then generates interpolated samples by randomly masking some superpixels and computing the outputs of the generated samples through the model, and finally fits the model outputs with the presence/absence of superpixels as input by a linear regression model. The linear weights then directly indicates the feature importance in superpixel level as the explanation result.
The gradients of model output w.r.t. input can partly identify influential pixels, but due to the saturation of activation functions in the deep networks, the vanilla gradient is usually noisy. SmoothGrad(Smilkov et al., 2017) reduces the visual noise by repeatedly adding small random noises to the input so as to get a list of corresponding gradients, which are then averaged for the final explanation result.
Note that many other interpretation algorithms are also available for our proposed framework while in this paper, we validate our approach with two trustworthy and commonly-used algorithms.
b.3 Human User-Study Evaluations
As introduced in the main text, we have conducted user-study experiments for model interpretability over the five models that are discussed by Network Dissection, i.e., DenseNet161, ResNet152, VGG16, GoogLeNet, AlexNet, and the evaluation result from user-study evaluations well aligns with results of our framework using either LIME or SmoothGrad. We describe here the experimental settings of the user-study evaluations.
For each image, we randomly choose two models from the five models and present the LIME (or SmoothGrad respectively) explanations of the two models, without giving the model information to users. Users are then requested to choose which one helps better to reveal the model’s reasoning of making predictions according to their understanding, or equal if the two interpretations are equally bad or good. Each pair of models is repeated three times and represented to different users. The better one in each pair will get three points and the other one will get zero; in the equal case, both get one point. Finally, a normalization of dividing the number of images and the number of repeats (i.e. 3) is performed for each model. The user-study evaluations yield the scores indicating the model interpretability, as shown in Table 1.
Appendix C ResNet Family
We show the zoomed plot of ResNet family (whose name contains “ResNet” key word) in the ImageNet-LIME committee of 81 models in Figure 9 (a). Meanwhile, we also present the results using ResNet family as committee in Figure 9 (b). These two subfigures have no large difference, which further confirms the consistency of our approach across different committees for ranking models. The positive correlation between the model performance and the consensus scores does not exist in the ResNet family, as we explained before that in some local areas, especially when models are extremely large, the correlation is not always positive.
Appendix D References of Network Structures
Most frequently-used structures of deep models have been evaluated in this paper, including AlexNet (Krizhevsky et al., 2012), ResNet (He et al., 2016), ResNeXt (Xie et al., 2017), SEResNet (Hu et al., 2018), ShuffleNet (Zhang et al., 2018; Ma et al., 2018), MobileNet (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019), VGG (Simonyan and Zisserman, 2015), GoogleNet (Szegedy et al., 2015), Inception (Szegedy et al., 2015), Xception (Chollet, 2017), DarkNet (Redmon et al., 2016; Redmon and Farhadi, 2018), DenseNet (Huang et al., 2017), DPN (Chen et al., 2017b), SqueezeNet (Iandola et al., 2016), EfficientNet (Tan and Le, 2019), Res2Net (Gao et al., 2019), HRNet (Wang et al., 2020b), Darts (Liu et al., 2018), AcNet (Ding et al., 2019) and their variants.
Appendix E Numerical Report of Main Plots
Due to the large number of deep models evaluated, Figure 8, Figure 4 and Figure 5 grouped some that are of the same architecture. Here, we report all of the corresponding numerical results in Table 2 with a smaller scale.
Appendix F More Visualization Results
For further explorations, we visualize several random images from MS-COCO (Lin et al., 2014), shown in Figure 14. As introduced in Section 5, one direction for the future work would extend the proposed framework on the datasets with multiple visual objects and complicated background for object detection, segmentation, and multi-label classification tasks.