In a machine learning setting, a question of great interest is estimating the influence of a given input feature on the prediction made by a model. Understanding what input features are important helps improve our models, builds trust in the model prediction and isolates undesirable behavior. For certain areas such as healthcare, autonomous vehicles and credit scoring, the need for such interpretability goes beyond a “nice-to-have.” In these sensitive domains, estimates of feature importance must be both 1) meaningful to a human and 2) highly accurate, as an incorrect explanation of model behavior may have intolerable costs on human welfare.
In this work, we are concerned with 2). We propose a formal methodology to evaluate the accuracy of commonly used feature importance estimators for deep neural networks (DNNs). DNNs pose unique challenges for the estimation of input feature importance, as well as work such as ours that considers whether the estimate produced is reliable. This is due to both the non-linear activations present in DNNs and the large number of input features often involved in tasks where DNNs are used.
Due to this high dimensional input space, there has been limited but important work that estimates feature importance across all possible data points (Koh & Liang, 2017). Instead, numerous methods have been proposed (Baehrens et al., 2010; Bach et al., 2015; Zintgraf et al., 2017; Selvaraju et al., 2017; Sundararajan et al., 2017; Simonyan & Zisserman, 2015; Zeiler & Fergus, 2014; Springenberg et al., 2015; Kindermans et al., 2017; Montavon et al., 2017; Fong & Vedaldi, 2017; Dabkowski & Gal, 2017; Zhang et al., 2016; Shrikumar et al., 2017; Zhou et al., 2014; Ross & Doshi-Velez, 2017) which constrain ranking to the set of input features associated with a single image. These estimators produce a score for each pixel that reflects the estimated contribution to the model prediction for that image. The magnitude of the score can then be used to rank and compare the importance of all input features. More recent work (Smilkov et al., 2017; Adebayo et al., 2018a) has proposed derivative approaches that ensemble a set of estimates. These ensemble methods are often considered more appealing because they produce a “visually sharper” explanation of model behavior for cases where the scores are visualized as a natural image “heatmap.” (see Fig. 2 for a visual comparison of base vs. ensemble estimators).
However, it is challenging to evaluate whether this explanation of model behavior is reliable. If we knew what were important to the model, we would not need to estimate feature importance in the first place. Instead, in this work, we propose a measure that evaluates the approximate accuracy of the feature importance ranking according to the hypothesis that a more accurate ranking will identify a subset of features as important whose removal will degrade model performance the most.
We term this measure ROAR, RemOve And Retrain. For each estimator, ROAR replaces a fraction of all pixels that are estimated to be most important with a constant value that is irrelevant for the classification task. This modification (shown in Fig. 2) is repeated for each image in both the training and test set. To measure the change to model behavior subsequent to the removal of these input features, we separately train new models on the altered dataset and the original unmodified images. An approximately accurate estimator will identify as important input pixels those whose subsequent removal causes the sharpest degradation in accuracy. In Fig. 1, we illustrate the key steps in the ROAR framework.
Training a new model (from random initialization) is crucial in order for the constant value for which we replaced the input to be considered “uninformative.” Without retraining, it is difficult to decouple whether the model’s degradation in performance is due to the replacement value being outside of the training data manifold or due to the accuracy of the estimate. Model vulnerability to the introduction of “new evidence” has already been widely acknowledged (Dabkowski & Gal, 2017; Fong & Vedaldi, 2017).
In addition to comparing the approximate accuracy of a set of estimators, we also compare estimator performance to a random assignment of importance and the mask produced by applying a sobel edge filter to the image. Both of these control variants produce rankings that are independent of the properties of the model we aim to interpret. Given that these methods do not depend upon the model (the sobel edge detector depends only on the input image, whereas the random estimator is independent of both model and data), the performance of these variants represent a lower bound of accuracy that an estimator could be expected to achieve. In particular, the random baseline allows us to answer the question: is the estimator more accurate than a random guess as to which features are important?
In a broad set of experiments across three large scale, open source image datasets—ImageNet (Deng et al., 2009), Food 101 (Bossard et al., 2014) and Birdsnap (Berg et al., 2014)—our results are consistent and thought-provoking:
Without ensembling, the interpretability methods that we evaluate are no better or on par with a random assignment of importance. However, we show that certain derivative approaches that ensemble sets of these estimates far outperform both the underlying method and such a random guess.
The choice of ensembling approach is paramount. Ensemble method performance is varied. SmoothGrad-Squared (unpublished variant of Classic SmoothGrad) and Vargrad (Adebayo et al., 2018a) produced large gains in accuracy, while Classic SmoothGrad (Smilkov et al., 2017)), is less accurate or on par with a single estimate but carries a far higher computational burden.
Finally, we show training performance proves surprisingly robust to random modification of the majority of all input features. For example, after randomly replacing of all ImageNet input features, we can still train a model that achieves (average across independent runs). These results suggest that many redundancies in the input feature space exist, however, the basic estimators that we consider are no better than a random guess at identifying them.
2 Related Work
Interpretability research is diverse, and many different approaches are used to gain intuition about the function implemented by a neural network. For example, one can distill or constrain a model into a functional form that is considered more interpretable (Ba & Caruana, 2014; Frosst & Hinton, 2017; Wu et al., 2017; Ross et al., 2017)
. Other methods explore the role of neurons or activations in hidden layers of the network(Olah et al., 2017; Raghu et al., 2017; Morcos et al., 2018; Zhou et al., 2018), while others use high level concepts to explain prediction results (Kim et al., 2018). and finally there are also the input feature importance estimators that we evaluate in this work. These interpretability methods estimate the importance of an input feature to a specified output activation.
Without a clear way to measure the ”correctness” of a feature importance estimate, comparing the relative merit of different estimators is often based upon human studies (Selvaraju et al. 2017; Ross & Doshi-Velez 2017; Lage et al. 2018 and many others) which interrogate whether the ranking is meaningful to a human. However, an explanation considered ”trustworthy” does not guarantee that the same reliably explains model behavior. It has already been shown that the level of human trust in a system is decoupled from the actual performance of the algorithm (Poursabzi-Sangdeh et al., 2018; Dietvorst et al., 2014).
Recently, there has been limited but important work on frameworks to evaluate whether interpretability methods are both reliable and meaningful. Kindermans et al. (2017) define a unit test that constructs a narrow ground-truth in which invariance to factors that do not affect the model can be measured. Adebayo et al. (2018b) consider a set of sanity checks that measure the change to an estimate as parameters in a model or dataset labels are randomized.
Most relevant to our work are modification based evaluation measures proposed originally by Samek et al. (2017) with subsequent variations (Ancona et al., 2017; Fong & Vedaldi, 2017; Kindermans et al., 2017). In this line of work, one replaces the inputs estimated to be most important with a value considered meaningless to the task. These methods measure the subsequent degradation to the trained model at inference time.
To the best of our knowledge, unlike prior modification based evaluation measures, our benchmark requires retraining the model from random initialization on the modified dataset rather than re-scoring the modified image at inference time. Without this step, one cannot decouple whether the model’s degradation in performance is due to artifacts introduced by the value used to replace the pixels that are removed or due to the approximate accuracy of the estimator. We discuss this further in section 3.3, supported by large-scale experiments on ImageNet.
We do not modify a connected region, or patch, of connected pixels according to the aggregated estimates of importance. Instead, we simply modify the fraction of inputs estimated to be most important. Finally, we moreover modify every image in ImageNet ( million training and validation images), Birdsnap ( training and validation images) and Food 101 ( million training and validation images). All prior evaluations have involved a far smaller subset of data and the consideration of a single dataset.
3 Estimating Input Feature Importance
A CNN is trained to approximate the function that maps an input variable to an output variable , formally
. Without loss of generality we represent the image input as a feature vector. is a discrete label vector associated with each input . A given input image can be decomposed into a set of pixels .
An estimator produces a vector of estimates . is the estimated importance of to an output activation , where and designate the layer of the model and the neuron of interest respectively.
is typically specified to be the maximum pre-softmax score or the softmax probability.
3.1 Evaluation Methodology
We can rank into an ordered list so that corresponds to the input feature estimated to be most important. For a fraction of this ordered set, we replace the corresponding values in the raw image vector with a constant uninformative variable . We create a family of distributions , where each distribution is defined by incrementally increasing the fraction of inputs modified and varying the estimator .
When , and the test-set accuracy , will only differ from of a model trained on unmodified inputs by an epsilon term that is caused by the natural variation in training performance. When , we have replaced all input features with the constant value and learning a representation should not be possible.
In between we are unable to precisely determine how removing inputs will change the test-set accuracy since we do not know the true distribution of importance a priori. However, we can compare the degradation of test-set accuracy between estimators for the same fraction .
ROAR evaluates estimators according to the hypothesis that the most approximately accurate estimator will identify a subset of features as important whose removal will degrade model performance the most.. Thus, the most desirable estimator is the one that results in the lowest test-set accuracy , where is the modified dataset given the estimator :
In addition, we determine an estimate to be better than a random assignment of importance if the test accuracy trained on the randomly modified inputs is such that:
3.2 Estimators Considered
In this work, our initial evaluation is constrained to a subset of estimators which we briefly introduce below. We selected this subset based upon the availability of open source code and the ease of implementation on a ResNet-50 architecture (He et al., 2015). We welcome the opportunity to consider additional estimators in the future, and in order to make it easy to apply ROAR to additional estimators we have open sourced our code https://bit.ly/2ttLLZB. We briefly introduce each grouping of estimators below.
3.2.1 Base Estimators
are the gradient of the output activation of interest with respect to :
Guided Backprop (Springenberg et al., 2015) (Gb)
is an example of a signal method . Signal estimators aim to visualize the input patterns that cause the neuron activation in higher layers (Springenberg et al., 2015; Zeiler & Fergus, 2014; Kindermans et al., 2017). GB
Integrated Gradients (Sundararajan et al., 2017) (Ig)
is an example of an attribution method. Attribution estimators assign importance to input features by decomposing the output activation into contributions from the individual input features (Bach et al., 2015; Sundararajan et al., 2017; Montavon et al., 2017; Shrikumar et al., 2016; Kindermans et al., 2017). Attribution methods require that all contributions sum to the activation of interest. This property is often termed completeness
. Integrated gradients interpolate a set of estimates for values between a non-informative reference pointto the actual input . This integral can be approximated by summing a set of points at small intervals between and :
The final estimate will depend upon both the choice of and the reference point . As suggested by Sundararajan et al. (2017), we use a black image as the reference point and set to be .
3.2.2 Derivative Approaches that Ensemble a Set of Estimates
An example of a single image modified according to ensemble approaches can be seen in Fig. 3. For all the ensemble approaches that we describe below (SG, SG-SQ, Var), we designate a set size of estimates as suggested by (Smilkov et al., 2017). Note that the ensemble approaches described can be wrapped around any interpretability method that produces a ranking of feature importance.
Classic SmoothGrad (Sg) (Smilkov et al., 2017)
SG averages a set noisy estimates of feature importance (constructed by injecting a single input with Gaussian noise independently times):
is an unpublished variant of classic SmoothGrad SG which squares each estimate before averaging the estimates:
Although SG-SQ is not described in the original publication, it is the default open-source implementation of the open source code for SG: https://bit.ly/2Hpx5ob.
VarGrad (Var) (Adebayo et al., 2018a)
employs the same methodology as classic SmoothGrad (SG) to construct a set of t
noisy estimates. However, VarGrad aggregates the estimates by computing the variance of the noisy set rather than the mean.
3.2.3 Control Variants
As a control, we compare each estimator to two rankings (a random assignment of importance and a sobel edge filter) that do not depend at all on the model parameters.
A random estimator replaces a fraction of all pixels selected at random from each image with a constant uninformative value.
Sobel Edge Filter
convolves a hard-coded, separable, integer filter over an image to produce a mask of derivatives that emphasizes the edges in an image. A sobel mask treated as a ranking will assign a high score to areas of the image with a high gradient (likely edges).
3.3 The Importance of Training a New Model.
Training the model from random initialization on each of the modified datasets is crucial. When a image is modified by replacing the original feaure with a constant value c, it may introduce artifacts or “new evidence” that distorts model behavior since inference time prediction is done on a different data distribution from that the model is trained on.
This is because the replacement value can only be considered uninformative if the value is a variable present in the distribution but irrelevant to the classification task: . Only by training from random initialization on the modified images can we ensure that —where specifies the model weights—is trained on a distribution that includes . It is only in this case that the model can learn that is an uninformative value. By including in the input distribution, the estimated can approximate the true distribution of . In the case without retraining, has been trained on a distribution but is expected to approximate at inference time.
In Fig. 5 we compare the difference in performance evaluation between a model that is not re-trained on the modified inputs and the same model that is retrained from random initialization on the modified inputs (for ImageNet). For example, a random modification of of all ImageNet inputs degrades accuracy to for the model that was not retrained but when the model is retrained on the same modified inputs the accuracy only degrades to . Without retraining the model, it is not possible to decouple the performance of the interpretability method from the degradation caused by the modification itself.
4 Experimental Framework and Results
4.1 Experiment Framework
We use a ResNet-50 model for both generating the feature importance estimates and subsequent training on the modified inputs. ResNet-50 was chosen because of the public code implementations (in both PyTorch(Gross & Wilber, 2017)
and Tensorflow(Abadi et al., 2015)) and because it can be trained to give near to state of art performance in a reasonable amount of time (Goyal et al., 2017).
For all train and validation images in the dataset we first apply test time pre-processing as used by Goyal et al. (2017). We compute an estimate for every input in the training and test set. For all estimators, is pre-softmax activation for the model prediction. We rank each into an ordered set . For the top fraction of this ordered set, we replace the corresponding pixels in the raw image with the per channel mean.
We evaluate ROAR on three open source image datasets: ImageNet, Birdsnap and Food 101. For each dataset and estimator, we generate new train and test sets that each correspond to a different fraction of feature modification and whether the most important pixels are removed or kept. We evaluate estimators in total (this includes the base estimators, a set of ensemble approaches wrapped around each base and finally a set of squared estimates). In total, we generate large-scale modified image datasets in order to consider all experiment variants ( new test/train for each original dataset).
We independently train ResNet-50 models from random initialization on each of these modified dataset. We report test accuracy as the average of these runs. In the base implementation, the ResNet-50 trained on an unmodified ImageNet dataset achieves a mean accuracy of . This is comparable to the performance reported by (Goyal et al., 2017). On Birdsnap and Food 101, our unmodified datasets achieve and respectively (average of 10 independent runs). This baseline performance is comparable to that reported by Kornblith et al. (2018).
4.2 Experimental Results
4.2.1 Robust performance given random modification
The estimator assigns importance at random to all inputs. Comparing estimators to this baseline allows us to answer the question: is the estimate of importance more accurate than a random guess? The performance of the random baseline is surprising and consistent across all datasets. After replacing a large portion of all inputs with a constant value, the model not only trains but still retains most of the original predictive power. For example, on ImageNet, when only of all features are retained, the trained model still attains accuracy (relative to unmodified baseline of ).
The ability of the model to extract a meaningful representation from a small random fraction of inputs suggests a case where many inputs are likely redundant. The nature of our input—an image where correlations between pixels are expected—provides one possible reason for this redundancy.
4.2.2 ROAR: Base estimators No better than a random guess when retraining
Surprisingly, the left inset of Fig. 4 shows that the base estimators that we consider (GB, IG, Grad) consistently perform worse than the random assignment of feature importance for all thresholds . This finding is consistent across all datasets. Furthermore, our estimators fall further behind the accuracy of random guess as a larger fraction of inputs is modified. The gap is widest when . Our base estimators also do not compare favorably to the performance of a sobel edge filter. Across all datasets and thresholds , the base estimators GB, IG, Grad perform on par or worse than Sobel. This result is noteworthy because both the sobel filter and the random ranking have formulations that are entirely independent of the model parameters. All the base estimators that we consider have formulations that depend upon the trained model weights, and thus we would expect them to have a clear advantage in outperforming the control variants.
Base estimators perform within very narrow range
Despite the very different formulations of base estimators that we consider, the difference between the performance of the base estimators is in a strikingly narrow range. For example, as can be seen in the right inset of Fig. 4, for Birdsnap, the range of performance between the best and worst base estimator at is only . This range remains narrow for both Food101 and ImageNet, with a gap of and respectively between the most and least approximately accurate interpretability method.
4.2.3 ROAR: Ensemble Approaches are not created equal
Ensemble approaches inevitably carry a higher computationally approach, as they require the aggregation of a set of individual estimates. These ensemble estimates are often preferred as an interpretability tool by humans because they appear to produce “less visually noisy” explanations. However, an understanding of what these methods are actually doing or how this is related to the accuracy of the explanation is very limited. Recent work shows that VarGrad Var produces a ranking that is actually independent of the gradient (Seo et al., 2018). We further the understanding of the advantages and disadvantages of ensemble approaches by evaluating the approximate accuracy of three methods (SG, SG-SQ and Var).
Classic SmoothGrad is less accurate or on par with a single estimate
In the middle inset chart of Fig. 4 is the first of a series of intriguing results. Classic SmoothGrad (SG), is the average of estimates computed according to an underlying base method. However, despite the additional computational cost, SG degrades test-set accuracy less than a random guess. In addition, in some cases, SmoothGrad performs worse than a single estimate (for gradient heatmap Grad and Integrated Gradient IG).
SmoothGrad-Squared produced large gains in accuracy
SmoothGrad-Squared is an unpublished variation of classic SmoothGrad that squares noisy estimates before averaging. Smoothgrad-Squared, unlike SmoothGrad, produces large gains in accuracy that far outperform a random guess (right inset of Fig. 4). These gains are consistent across all estimators and datasets.
Squaring Slightly Improves the Performance of All Base Variants
The only difference between SmoothGrad and SmoothGrad-Squared is that with the latter estimates are squared before averaging. The large gap in performance between the two is worth further consideration. In Fig. 8, we consider the effect of only squaring estimates (no ensembling). We include further discussion in the appendix, but find that when squared, an estimate gains slightly more accuracy than a random ranking of input features. However, squaring alone does not explain the large gains in accuracy that we observe when we square each estimate, and then aggregate by averaging the result.
VarGrad is comparable in performance to SmoothGrad-Squared.
In the right inset of Fig. 4, we show that both VarGrad and SmoothGrad-Squared far outperform the two control variants (a random guess and a sobel edge filter). In addition, for all the interpretability methods we consider, a VarGrad or SmoothGrad-Squared ensemble far outperforms the approximate accuracy of a single estimate.
However, while VarGrad and SmoothGrad-Squared benefit the accuracy of all base estimators, the overall ranking of estimator performance differs by dataset. For ImageNet and Food101, the best performing estimators are VarGrad or SmoothGrad-Squared when wrapped around a gradient heatmap. However, for the Birdsnap dataset, the most approximately accurate estimates are these ensemble approaches wrapped around Guided Backprop. This suggests that while certain ensembling approaches consistently improve performance, the choice of the best underlying estimator may vary by task. This deserves further consideration.
In the right inset of Fig. 4, it can also be seen that the performance of VarGrad Var is remarkably similar to that of SmoothGrad-Squared (SG-SQ). For many of the estimators, applying SG-SQ and Var produces virtually identical performance. It is worth revisiting the formulation of VarGrad to consider one possibiity for why this would be the case. As first introduced in section 3.2.2 VarGrad is the computed as the variance of a set of noisy estimates:
It can be seen in the equation above that the first term is in fact equivalent to SG-SQ. One case when SG-SQ and Var would produce a similar ranking is when the sample mean of the set of estimates is small or close to zero (such that the first term dominates).
5 Conclusion and Future Work
In this work, we propose ROAR to evaluate the approximate accuracy of input feature importance estimators. Surprisingly, we find that the commonly used base estimators that we evaluate perform worse or on par with a random assignment of importance. Furthermore, certain ensemble approaches such as SmoothGrad are far more computationally intensive but do not improve upon a single estimate (and in some cases are worse). However, we also find that VarGrad and SmoothGrad-Squared significantly improve the approximate accuracy of a method and far outperform such a random guess. Our findings are particularly pertinent for sensitive domains where the accuracy of a explanation of model behavior is paramount. While we venture some initial consideration of why certain ensemble methods far outperform other estimator, the divergence in performance between the ensemble estimators deserve additional research treatment.
We thank Kevin Swersky, Andrew Ross, Douglas Eck, Jonas Kemp, Melissa Fabros, Julius Abedayo, Simon Kornblith, Prajit Ramachandran, Niru Maheswaranathan and Gamaleldin Elsayed for their thoughtful feedback on earlier iterations of this work.
- Abadi et al.  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, January 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Adebayo et al. [2018a] Adebayo, J., Gilmer, J., Goodfellow, I., and Kim, B. Local explanation methods for deep neural networks lack sensitivity to parameter values. ICLR Workshop, 2018a.
- Adebayo et al. [2018b] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I. J., Hardt, M., and Kim, B. Sanity checks for saliency maps. In NeurIPS, 2018b.
- Ancona et al.  Ancona, M., Ceolini, E., Öztireli, C., and Gross, M. Towards better understanding of gradient-based attribution methods for Deep Neural Networks. ArXiv e-prints, November 2017.
- Ba & Caruana  Ba, L. J. and Caruana, R. Do deep nets really need to be deep? In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pp. 2654–2662, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969123.
Bach et al. 
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10(7):e0130140, 2015.
- Baehrens et al.  Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and Müller, K.-R. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
- Berg et al.  Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds.
Bossard et al. 
Bossard, L., Guillaumin, M., and Van Gool, L.
Food-101 – mining discriminative components with random forests.In European Conference on Computer Vision, 2014.
- Dabkowski & Gal  Dabkowski, P. and Gal, Y. Real Time Image Saliency for Black Box Classifiers. ArXiv e-prints, May 2017.
- Deng et al.  Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Dietvorst et al.  Dietvorst, B., Simmons, J., and Massey, C. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of experimental psychology. General, 144, 11 2014. doi: 10.1037/xge0000033.
- Fong & Vedaldi  Fong, R. C. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In ICCV, pp. 3449–3457. IEEE Computer Society, 2017.
Frosst & Hinton 
Frosst, N. and Hinton, G.
Distilling a Neural Network Into a Soft Decision Tree.ArXiv e-prints, November 2017.
- Goyal et al.  Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv e-prints, June 2017.
- Gross & Wilber  Gross, S. and Wilber, M. Training and investigating Residual Nets. https://github.com/facebook/fb.resnet.torch, January 2017.
- He et al.  He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. ArXiv e-prints, December 2015.
- Kim et al.  Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and sayres, R. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2668–2677, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/kim18d.html.
- Kindermans et al.  Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. The (Un)reliability of saliency methods. ArXiv e-prints, November 2017.
- Kindermans et al.  Kindermans, P.-J., Schütt, K. T., Alber, M., Müller, K.-R., Erhan, D., Kim, B., and Dähne, S. Learning how to explain neural networks: Patternnet and patternattribution. arXiv preprint arXiv:1705.05598v2, 2017.
- Koh & Liang  Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1885–1894, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- Kornblith et al.  Kornblith, S., Shlens, J., and Le, Q. V. Do Better ImageNet Models Transfer Better? arXiv e-prints, art. arXiv:1805.08974, May 2018.
- Lage et al.  Lage, I., Slavin Ross, A., Kim, B., Gershman, S. J., and Doshi-Velez, F. Human-in-the-Loop Interpretability Prior. arXiv e-prints, art. arXiv:1805.11571, May 2018.
- Montavon et al.  Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
- Morcos et al.  Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C., and Botvinick, M. On the importance of single directions for generalization. ArXiv e-prints, March 2018.
- Olah et al.  Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
- Poursabzi-Sangdeh et al.  Poursabzi-Sangdeh, F., Goldstein, D. G., Hofman, J. M., Wortman Vaughan, J., and Wallach, H. Manipulating and Measuring Model Interpretability. arXiv e-prints, art. arXiv:1802.07810, February 2018.
Raghu et al. 
Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J.
SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability.In NIPS, pp. 6078–6087, 2017.
- Ross & Doshi-Velez  Ross, A. S. and Doshi-Velez, F. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. CoRR, abs/1711.09404, 2017.
Ross et al. 
Ross, A. S., Hughes, M. C., and Doshi-Velez, F.
Right for the right reasons: Training differentiable models by
constraining their explanations.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 2662–2670, 2017.
- Samek et al.  Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and Müller, K. R. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, Nov 2017. ISSN 2162-237X. doi: 10.1109/TNNLS.2016.2599820.
- Selvaraju et al.  Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Seo et al.  Seo, J., Choe, J., Koo, J., Jeon, S., Kim, B., and Jeon, T. Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative. arXiv e-prints, art. arXiv:1806.03000, Jun 2018.
- Shrikumar et al.  Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A. Not Just a Black Box: Learning Important Features Through Propagating Activation Differences. ArXiv e-prints, May 2016.
- Shrikumar et al.  Shrikumar, A., Greenside, P., and Kundaje, A. Learning Important Features Through Propagating Activation Differences. ArXiv e-prints, April 2017.
- Simonyan & Zisserman  Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Smilkov et al.  Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
- Springenberg et al.  Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In ICLR, 2015.
- Sundararajan et al.  Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
- Wu et al.  Wu, M., Hughes, M. C., Parbhoo, S., Zazzi, M., Roth, V., and Doshi-Velez, F. Beyond Sparsity: Tree Regularization of Deep Models for Interpretability. ArXiv e-prints, November 2017.
- Zeiler & Fergus  Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Springer, 2014.
- Zhang et al.  Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ArXiv e-prints, November 2016.
- Zhou et al.  Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Object Detectors Emerge in Deep Scene CNNs. ArXiv e-prints, 2014.
- Zhou et al.  Zhou, B., Sun, Y., Bau, D., and Torralba, A. Revisiting the importance of individual units in cnns via ablation. CoRR, abs/1806.02891, 2018. URL http://arxiv.org/abs/1806.02891.
- Zintgraf et al.  Zintgraf, L. M., Cohen, T. S., Adel, T., and Welling, M. Visualizing deep neural network decisions: Prediction difference analysis. In ICLR, 2017.
Appendix A Supplementary Charts and Experiments
We include supplementary experiments and additional details about our training procedure, image modification process and test-set accuracy below. In addition, as can be seen in Fig. 7, we also consider the scenario where pixels are kept according to importance rather than removed.
a.1 Generation of New Dataset
|Dataset||Top1Accuracy||Train Size||Test Size||Learning Rate||Training Steps|
The training procedure was carefully finetuned for each dataset. These hyperparameters are consistently used across all experiment variants. The baseline accuracy of each unmodified data set is reported as the average of 10 independent runs.
We evaluate ROAR on three open source image datasets: ImageNet, Birdsnap and Food 101. For each dataset and estimator, we generate new train and test sets that each correspond to a different fraction of feature modification and whether the most important pixels are removed or kept. This requires first generating a ranking of input importance for each input image according to each estimator. All of the estimators that we consider evaluate feature importance post-training. Thus, we generate the rankings according to each intepretability method using a stored checkpoint for each dataset.
We use the ranking produced by the interpretability method to modify each image in the dataset (both train and test). We rank each estimate, into an ordered set . For the top fraction of this ordered set, we replace the corresponding pixels in the raw image with a per channel mean. Fig. 9 and Fig. 10 show an example of the type of modification applied to each image in the dataset for Birdsnap and Food 101 respectively. In the paper itself, we show an example of a single image from each ImageNet modification.
We evaluate estimators in total (this includes the base estimators, a set of ensemble approaches wrapped around each base and finally a set of squared estimates). In total, we generate large-scale modified image datasets in order to consider all experiment variants ( new test/train for each original dataset).
a.2 Training Procedure
We carefully tuned the hyperparamters of each dataset ImageNet, Birdsnap and Food 101 separately. We find that the Birdsnap and Food 101 converge within the same amount of training steps and a larger learning rate than ImageNet. These are detailed in Table. 1. These hyper parameters, along with the mean accuracy reported on the unmodified dataset, are used consistently across all estimators. ImageNet dataset achieves a mean accuracy of . This is comparable to the performance reported by . On Birdsnap and Food 101, our unmodified datasets achieve and respectively. The baseline test-set accuracy for Food101 or Birdsnap is comparable to that reported by Kornblith et al. .
In Table. 2, we include the test-set performance for each experiment variant that we consider. The test-set accuracy reported is the average of independent runs.
a.3 Evaluating Keeping Rather Than Removing Information
In addition to ROAR, as can be seen in Fig. 7, we evaluate the opposite approach of KAR, Keep And Retrain. While ROAR removes features by replacing a fraction of inputs estimated to be most important, KAR preserves the inputs considered to be most important. Since we keep the important information rather than remove it, minimizing degradation to test-set accuracy is desirable.
In the right inset chart of Fig. 7 we plot KAR on the same curve as ROAR to enable a more intuitive comparison between the benchmarks. The comparison suggests that KAR appears to be a poor discriminator between estimators. The x-axis indicates the fraction of features that are preserved/removed for KAR/ROAR respectively.
We find that KAR is a far weaker discriminator of performance; all base estimators and the ensemble variants perform in a similar range to each other. These findings suggest that the task of identifying features to preserve is an easier benchmark to fulfill than accurately identifying a fraction of input that will cause the maximum damage to the model performance.
a.4 Squaring Alone Slightly Improves the Performance of All Base Variants
The surprising performance of SmoothGrad-Squared (SG-SQ) deserves further investigation; why is averaging a set of squared noisy estimates so effective at improving the accuracy of the ranking? To disentangle whether both squaring and then averaging are required, we explore whether we achieve similar performance gains by only squaring the estimate.
Squaring of a single estimate, with no ensembling, benefits the accuracy of all estimators that we considered. In the right inset chart of Fig. 8, we can see that squared estimates perform better than the raw estimate. When squared, an estimate gains slightly more accuracy than a random ranking of input features. In particular, squaring benefits GB; at performance of SQ-GB relative to GB improves by .
Squaring is an equivalent transformation to taking the absolute value of the estimate before ranking all inputs. After squaring, negative estimates become positive, and the ranking then only depends upon the magnitude and not the direction of the estimate. The benefits gained by squaring furthers our understanding of how the direction of GB, IG and Grad values should be treated. For all these estimators, estimates are very much a reflection of the weights of the network. The magnitude may be far more telling of feature importance than direction; a negative signal may be just as important as positive contributions towards a model’s prediction. While squaring improves the accuracy of all estimators, the transformation does not explain the large gains in accuracy that we observe when we average a set of noisy squared estimates.