Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative

06/08/2018
by   Junghoon Seo, et al.
Satrec Initiative Co., Ltd.
0

SmoothGrad and VarGrad are techniques that enhance the empirical quality of standard saliency maps by adding noise to input. However, there were few works that provide a rigorous theoretical interpretation of those methods. We analytically formalize the result of these noise-adding methods. As a result, we observe two interesting results from the existing noise-adding methods. First, SmoothGrad does not make the gradient of the score function smooth. Second, VarGrad is independent of the gradient of the score function. We believe that our findings provide a clue to reveal the relationship between local explanation methods of deep neural networks and higher-order partial derivatives of the score function.

READ FULL TEXT VIEW PDF
02/13/2019

Why are Saliency Maps Noisy? Cause of and Solution to Noisy Saliency Maps

Saliency Map, the gradient of the score function with respect to the inp...
05/28/2019

Certifiably Robust Interpretation in Deep Learning

Although gradient-based saliency maps are popular methods for deep learn...
06/12/2017

SmoothGrad: removing noise by adding noise

Explaining the output of a deep network remains a challenge. In the case...
11/02/2017

The (Un)reliability of saliency methods

Saliency methods aim to explain the predictions of deep neural networks....
05/02/2022

Understanding CNNs from excitations

For instance-level explanation, in order to reveal the relations between...
02/01/2019

Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation

Current methods to interpret deep learning models by generating saliency...
10/07/2021

Learning Higher-Order Dynamics in Video-Based Cardiac Measurement

Computer vision methods typically optimize for first-order dynamics (e.g...

1 Introduction

Attribution methods of neural network are model interpretation methods showing how much each component of input contributes to model prediction Sundararajan et al. (2017). Despite the flurry of explainability research on deep neural network over the recent years, model interpretation of deep neural networks through attribution method still remains a challenging topic. Previous researches can be grouped into two categories. One approach proposes a set of propagation rules that maximize the expressiveness of the interpretation. The other approach, on the other hand, perturbs the input in a methodical (e.g. optimization-based masking) or a random fashion to have interpretations of better visual quality.

In light of both approaches, we emphasize the ambiguity of noise-adding methods such as SmoothGrad Smilkov et al. (2017) and VarGrad Adebayo et al. (2018)

compared to other methods. Most of the propagation-based methods can be interpreted as variants of the backpropagation algorithm

Ancona et al. (2018) and the algorithms themselves are self-explanatory. Perturbation-based methods usually optimize or manually alter the input with respect to meaningful criteria Fong & Vedaldi (2017)

. However, noise-adding methods merely take the mean or the variance of saliency maps generated by adding Gaussian noise to the input. Despite their apparent simplicity, the results are surprisingly effective. Ironically, the simplicity of their approach prevents us from understanding exactly how and why noise-adding methods work for the better model interpretation.

This situation poses a twofold problem. First, since the inner workings of the method are unclear, our understanding for the results produced by the noise-adding methods are also innately unclear. Second, the lack of understanding prevents others from assessing the advantages and disadvantages of such noise-adding methods.

In this paper, we address the ambiguity of noise-adding methods by applying the multivariate Taylor’s theorem and some statistical theorems on SmoothGrad and VarGrad. We obtain their analytic expressions, which reveal several interesting properties. These discoveries allow us to verify intuitively plausible but opaque explanations for the effectiveness of noise-adding methods proposed in previous works. Furthermore, we formulate a general conjecture regarding reasonable model interpretations, based on our discussions.


Figure 1:

Sample results of saliency map, noise-adding methods, and some other gradient-based methods. For SmoothGrad and VarGrad, sampling number and standard deviation of noise are fixed to 50 and 0.025, respectively. The results of noise-adding methods generally look better and are less noisy than standard saliency map. This observation is consistent with discussion of the original works

Smilkov et al. (2017); Adebayo et al. (2018). For comparison, results of the remaining four recent gradient-based methods are taken to absolute values. See Table 1 for references of four recent gradient-based attribution methods. Best seen in electric form.

Specifically, our contributions in this paper are as follows:

  • We present non-stochastic analytic forms of approximated SmoothGrad and VarGrad and their bounds.

  • Our theorems lead to conclusions that differ from that of previous works. First, SmoothGrad does not make the gradient of the score function smooth. Second, VarGrad is independent of the gradient of the score function. In addition, their behaviors differ from that of other interpretation techniques.

  • Based on our observations, we carefully propose the conjecture that higher order partial derivatives and reasonable model interpretations are correlated.

2 Notation

For simplicity of discussion, we limit the target network to an image classification network. Let be an image classification network with fixed parameters, where is a single image instance and is the set of image labels. When defining score function for each label candidate, . We assume that a kind of squashing function such as softmax function is applied just before the value of is calculated. The final classification result of for is . Note that we only consider

to avoid using complex tensor notation.

To easily handle high-order derivatives of a multivariate function, we introduce a multi-index notation Saint Raymond (2017). An -dimensional multi-index is a -tuple of non-negative integers. With this, we define

(1)
(2)
(3)

If at and , a generic th-order partial derivative is denoted by

(4)

where Schwarz’s theorem Rudin (1976) justifies Definition 4.

We now define some notations for -dimension noise . For , the -th noise is denoted by

. Noise sampled independently from a probability distribution function

regardless of order and index is simply denoted . That is, is equivalent to . In this cause, and are used interchangeable.

Lastly, we define simple notations for parity of integers. means the union of the set of zero and the set of positive even numbers.

refers to the set of positive odd numbers.

3 Related Works

3.1 Previous Works On Attribution Methods

Saliency Map and Its Advanced Methods

Since Simonyan et al. (2014) first proposed using saliency maps to interpret neural networks, there have been several studies to improve propagation-based attribution method Sundararajan et al. (2017); Springenberg et al. (2015); Ancona et al. (2018); Shrikumar et al. (2017); Selvaraju et al. (2017); Smilkov et al. (2017); Bach et al. (2015); Montavon et al. (2017); Chattopadhyay et al. (2018). Meanwhile, Ancona et al. (2018) suggested the way to interpret some existing propagation-based methods as a unified gradient-based framework. On the other hand, Zhang et al. (2016); Adebayo et al. (2018) discussed the limitations of the gradient-based methodology itself.

Model Explanation with Perturbation

There have been some attempts to describe the model through perturbation of input data Zeiler & Fergus (2014); Zhou et al. (2015); Cao et al. (2015); Fong & Vedaldi (2017). We emphasize that our major theorems and conclusions cannot be directly extended to these methods because the perturbations used in these methods are usually dependent on data or model.

Axiomization of Model Interpretability

There have been several studies Sundararajan et al. (2017); Ghorbani et al. (2017); Adebayo et al. (2018); Kindermans et al. (2018); Samek et al. (2017); Dabkowski & Gal (2017) on the preferential properties or axiomization of model interpretability. These studies are significant because they reduce ambiguity in model interpretability as research topic. Therefore, they are essential for an unified discussion on model interpretation.

3.2 Brief Reviews On Our Three Topics

Saliency Map

Authors of several articles Erhan et al. (2009); Baehrens et al. (2010); Simonyan et al. (2014) proposed the saliency map, which is the partial derivative of the network output with respect to the input , as a possible explanation for model decisions. Standard saliency map is computed by

(5)

where is a multi-index.

SmoothGrad

The authors of Smilkov et al. (2017) proposed SmoothGrad which calculates the average of saliency maps generated by adding Gaussian noise to the input. Compared to Equation 5, SmoothGrad computes

(6)

VarGrad

The authors of Adebayo et al. (2018) proposed VarGrad, the variance version of SmoothGrad.

(7)
(8)

4 Rethinking Noise with Taylor’s Theorem

4.1 Motivation

Figure 1 shows the results of various attribution methods for the prediction class in inception-v3 Szegedy et al. (2016) trained on ILSVRC 2013 Russakovsky et al. (2015).111The results are produced by the modified implementation of Ancona et al. (2018), of which the repository link is as follows: https://github.com/marcoancona/DeepExplain As the results are shown, SmoothGrad and VarGrad seem to provide the better visual description than saliency maps in general. More precisely, standard saliency maps overly emphasize local image regions while results produced by methods adding noise do not. Furthermore, SmoothGrad and VarGrad produce results comparable to that of other recent gradient-based attribution methods. The results of VarGrad are particularly sparse. Previous studies Smilkov et al. (2017); Adebayo et al. (2018); Kindermans et al. (2017) have also observed similar results.

Here our central question arises: how are the results of SmoothGrad and VarGrad different from those of the standard saliency map? For SmoothGrad, this question was covered briefly in Smilkov et al. (2017). The authors argued that SmoothGrad reduces the effect of ‘strongly fluctuating partial derivatives’ on the saliency map. However, they did not offer any analytic explanation for the beneficial effect of noise on the results. As for VarGrad, its behavior is mysterious as its effectiveness. The relationship between the variability of saliency maps produced from noisy input and the saliency map produced from data is highly unclear. However, this problem has not been addressed before.

Accordingly, we attempt to answer the following questions in mathematical analysis:

  • What is the relationship between the saliency map and the result of noise-adding methods?

  • What is the exact relationship between the result of noise-adding methods and the choice of ?

  • Are the result of the noise-adding methods related to other factors other than the saliency map?

We express Equation 6 and Equation 7 in terms of noise parameter instead of data noise . If we cannot obtain the closed form expression of the terms for , we instead provide their bound as a expression for noise parameter . We use the multivariate Taylor’s theorem and several statistical theorems for this. Because the entire proofs are too verbose, we only write the results in this paper. The full proofs are given in the Appendix.

4.2 SmoothGrad Does Not Make Gradient Smooth

Theorem 1.

Suppose on a closed ball .222Most modern neural networks are only piecewise continuously differentiable. Nonetheless, several theoretical studies Funahashi (1989); Telgarsky (2016); Liang & Srikant (2017) have guaranteed that a general neural network can be an appropriate (-)approximation of any smooth functions. Approaches for noise on neural network via Taylor’s theorem are also shown in An (1996); Rifai et al. (2011) in the context of model regularization analysis. If , , , and is large enough, the result of SmoothGrad is approximated by

(9)

where

(10)
(11)

with and some .

Proof.

See Appendix C. ∎

4.3 VarGrad Is Independent Of Gradient

Theorem 2.

Suppose on a closed ball . If is even, , , and is large enough, the result of VarGrad is approximated by

(12)

where

(13)

and , , is bounded to expression for . The exact equation and bound of is shown in Appendix D.

Proof.

See Appendix D. ∎


Method Mul w/ Input? Gradient? Higher-order?
Saliency Map
Gradient*Input Shrikumar et al. (2016)
Integrated GradientSundararajan et al. (2017)
-LRP Bach et al. (2015)
DeepLIFT Shrikumar et al. (2017)
SmoothGrad
VarGrad
Table 1: Characteristics of formulation of noise-adding methods and some other methods. The column ”Gradient?” indicates whether the formulation contains term for gradient. The column ”Mul w/ Input?” indicates whether the formulation contains term with input multiplied by derivate. The column ”High-order?” indicates whether the formulation contains term for high-order derivative. See Ancona et al. (2018) for proofs of gradient-based formulation of -LRP and DeepLIFT.

5 Discussions

5.1 Observation on Our Theorems

SmoothGrad

As pointed out in Smilkov et al. (2017), one of the reasons for the failure of the standard saliency map is that the partial gradient of score function for the input will act more strongly on local pixels than on global information. Authors of Smilkov et al. (2017) also observed that the saliency map fluctuates strongly even to small noise imperceivable to humans. Inspired by this observation, they stated SmoothGrad’s motivation as follows: “Instead of basing a visualization directly on the gradient , we could base it on a smoothing of with a Gaussian kernel.” Therefore, they argued that SmoothGrad’s result looks better because SmoothGrad literally makes the gradient smooth.

Contrary to these previous discussions, our observations lead to a different conclusion. If the discussion in Smilkov et al. (2017) is compatible with our observation, the result of SmoothGrad when is large enough should contain a term corresponding to the smoothing effect on the saliency map . According to Theorem 1, that is not the case; Equation 6 does not contain such a term. Therefore, SmoothGrad does not make gradient of score function smooth from our view. Instead, SmoothGrad is approximately the sum of the standard saliency map and the series consisting of higher-order partial derivatives and the standard deviation of the Gaussian noise.

VarGrad

Although the principle of VarGrad has rarely been discussed even in the original paper Adebayo et al. (2018), our finding about VarGrad is also counterintuitive. We can see from Theorem 2 that VarGrad is independent of the gradient of the score function. The result of VarGrad can be approximated as a series consisting only of higher-order partial derivatives and the standard deviation of the Gaussian noise. In other words, the result of VarGrad is not related to the saliency map.

5.2 Comparison with Previous Discussions

Table 1 summarizes the characteristics of noise-adding methods and some other gradient-based attribution methods. In the table, we only deal with four recent gradient-based attribution methods listed in Ancona et al. (2018). Some other gradient-based attribution methods (i.e. Selvaraju et al. (2017); Springenberg et al. (2015)) can be grouped into the same category, depending on the definition or rules of attribution. We want to focus on the unique natures of the noise-adding methods listed in Table 1.

Multiplication with Input

The presence of a term in which the input and the derivative are multiplied together has been generally taken as an important factor in sharper attribution Shrikumar et al. (2017); Sundararajan et al. (2017); Smilkov et al. (2017); Ancona et al. (2018). Furthermore, Ancona et al. (2018) claimed that the presence of that term makes the method a desirable global attribution method.

However, even though noise-adding methods such as VarGrad do not have these terms, their results are comparable to that of other recent attribution methods as demonstrated in Figure 1. Furthermore, it has been found that this term causes undesirable side effects in the attribution Smilkov et al. (2017), and its effect on deep neural networks (as opposed to simple linear models) is still unclear Ancona et al. (2018). We therefore argue that an analytic approach to the need for multiplication with input is necessary.

Gradient

On the presence or the absence of the gradient term, our findings are even more surprising. Since Simonyan et al. (2014) first introduced model interpretation by saliency maps, all following propagation-based attribution methods have used the gradient in some way. However, our findings suggest that SmoothGrad and VarGrad deviate from this trend, as mentioned in Section 5.1.

Higher-order Derivative

Taken together, our theorems suggest that a major factor affecting the result of noise-adding methods is the higher-order partial derivatives of the score function for the data point, not just the saliency map. Despite conflicts between our conclusions and that of other works, it is undeniable that the noise-adding methods are qualitatively better than the standard saliency map. To account for this phenomenon, we cautiously propose the conjecture that there may be a correlation between higher order partial derivatives of model function and the attributions defined from sensible axioms of model interpretability.

There is few articles that focus on the higher-order partial derivative of the model function for model explanation. One notable exception is Koh & Liang (2017), which studied the influence function via Jacobian-Hessian products of the model. The purpose of Koh & Liang (2017), however, is not to take the model attribution of the input but to find the responsible training data through the influence function. As far as we know, Chattopadhyay et al. (2018) is the study of model attribution method that is most related to higher-order derivatives. In Chattopadhyay et al. (2018), computation of higher derivatives is required for getting the gradient weights in more principled way than class activation map Zhou et al. (2016) or Grad-CAM Selvaraju et al. (2017). We hope for further advanced discussions on our view in the future, with a legitimate axiom on model interpretation.

5.3 Inaccessibility to Experimentation

It is worth mentioning that direct computation of Equation 1 and Equation 2 is numerically intractable. There are two reasons for this claim. First, it requires the calculation of an -dimension explicitly restricted partition set Stanley (1986) with increasing order. Additionally, -order partial derivative of the score function should be computed for all possible multi-indexes. Both are practically difficult to compute. Despite the inaccessibility to experimentation, our view over noise-adding methods allows theoretically interesting discussions.

6 Conclusions

We explored the analytic form of SmoothGrad and VarGrad, variants of the saliency map. Our conclusions about the behavior of both methods when the sample number is sufficient were conflicted with the existing view. First, SmoothGrad does not make gradient of score function smooth. Second, VarGrad is independent of gradient of score function.

To reconcile the success of noise-adding methods and our conclusions, we carefully presented a conjecture: there may be a correlation between higher order partial derivatives of the model function and a sensible model interpretation. We hope to see advanced discussions on model interpretation from this perspective in the future.

References

  • Adebayo et al. (2018) Adebayo, Julius, Gilmer, Justin, Goodfellow, Ian, and Kim, Been. Local explanation methods for deep neural networks lack sensitivity to parameter values. International Conference on Learning Representations Workshop, 2018.
  • An (1996) An, Guozhong. The effects of adding noise during backpropagation training on a generalization performance. Neural computation, 8(3):643–674, 1996.
  • Ancona et al. (2018) Ancona, Marco, Ceolini, Enea, Öztireli, Cengiz, and Gross, Markus. Towards better understanding of gradient-based attribution methods for deep neural networks. International Conference on Learning Representations, 2018.
  • Bach et al. (2015) Bach, Sebastian, Binder, Alexander, Montavon, Grégoire, Klauschen, Frederick, Müller, Klaus-Robert, and Samek, Wojciech.

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.

    PloS one, 10(7):e0130140, 2015.
  • Baehrens et al. (2010) Baehrens, David, Schroeter, Timon, Harmeling, Stefan, Kawanabe, Motoaki, Hansen, Katja, and MÞller, Klaus-Robert. How to explain individual classification decisions.

    Journal of Machine Learning Research

    , 11(Jun):1803–1831, 2010.
  • Cao et al. (2015) Cao, Chunshui, Liu, Xianming, Yang, Yi, Yu, Yinan, Wang, Jiang, Wang, Zilei, Huang, Yongzhen, Wang, Liang, Huang, Chang, Xu, Wei, et al.

    Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks.

    In

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 2956–2964, 2015.
  • Chattopadhyay et al. (2018) Chattopadhyay, Aditya, Sarkar, Anirban, Howlader, Prantik, and Balasubramanian, Vineeth N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE Winter Conference on Application of Computer Vision, 2018.
  • Dabkowski & Gal (2017) Dabkowski, Piotr and Gal, Yarin. Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6970–6979, 2017.
  • Erhan et al. (2009) Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. Visualizing higher-layer features of a deep network. Technical Report 1341, University of Montreal, 2009.
  • Fong & Vedaldi (2017) Fong, Ruth C and Vedaldi, Andrea. Interpretable explanations of black boxes by meaningful perturbation. The IEEE International Conference on Computer Vision, 2017.
  • Funahashi (1989) Funahashi, Ken-Ichi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
  • Ghorbani et al. (2017) Ghorbani, Amirata, Abid, Abubakar, and Zou, James. Interpretation of neural networks is fragile. Advances in Neural Information Processing Systems Workshop on Machine Deception, 2017.
  • Kindermans et al. (2017) Kindermans, Pieter-Jan, Hooker, Sara, Adebayo, Julius, Alber, Maximilian, Schütt, Kristof T, Dähne, Sven, Erhan, Dumitru, and Kim, Been. The (un) reliability of saliency methods.

    Advances in Neural Information Processing Systems Workshop on Explaining and Visualizing Deep Learning

    , 2017.
  • Kindermans et al. (2018) Kindermans, Pieter-Jan, Schütt, Kristof T, Alber, Maximilian, Müller, Klaus-Robert, and Dähne, Sven. Learning how to explain neural networks: Patternnet and patternattribution. International Conference on Learning Representations, 2018.
  • Koh & Liang (2017) Koh, Pang Wei and Liang, Percy. Understanding black-box predictions via influence functions. International Conference on Machine Learning, 2017.
  • Liang & Srikant (2017) Liang, Shiyu and Srikant, R. Why deep neural networks for function approximation? Proceedings of the 27th international conference on machine learning, 2017.
  • Montavon et al. (2017) Montavon, Grégoire, Lapuschkin, Sebastian, Binder, Alexander, Samek, Wojciech, and Müller, Klaus-Robert. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
  • Rifai et al. (2011) Rifai, Salah, Glorot, Xavier, Bengio, Yoshua, and Vincent, Pascal. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
  • Rudin (1976) Rudin, Walter. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition, 1976. ISBN 0-07-085613-3. International Series in Pure and Applied Mathematics.
  • Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Saint Raymond (2017) Saint Raymond, Xavier. Elementary introduction to the theory of pseudodifferential operators. Routledge, 2017.
  • Samek et al. (2017) Samek, Wojciech, Binder, Alexander, Montavon, Grégoire, Lapuschkin, Sebastian, and Müller, Klaus-Robert. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660–2673, 2017.
  • Selvaraju et al. (2017) Selvaraju, Ramprasaath R, Cogswell, Michael, Das, Abhishek, Vedantam, Ramakrishna, Parikh, Devi, and Batra, Dhruv. Grad-cam: Visual explanations from deep networks via gradient-based localization. the IEEE International Conference on Computer Vision, 2017.
  • Shrikumar et al. (2016) Shrikumar, Avanti, Greenside, Peyton, Shcherbina, Anna, and Kundaje, Anshul. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713, 2016.
  • Shrikumar et al. (2017) Shrikumar, Avanti, Greenside, Peyton, and Kundaje, Anshul. Learning important features through propagating activation differences. International Conference on Learning Representations, 2017.
  • Simonyan et al. (2014) Simonyan, Karen, Vedaldi, Andrea, and Zisserman, Andrew. Deep inside convolutional networks: Visualising image classification models and saliency maps. International Conference on Learning Representations, 2014.
  • Smilkov et al. (2017) Smilkov, Daniel, Thorat, Nikhil, Kim, Been, Viégas, Fernanda, and Wattenberg, Martin. Smoothgrad: removing noise by adding noise. International Conference on Machine Learning Workshop on Visualization for deep learning, 2017.
  • Springenberg et al. (2015) Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving for simplicity: The all convolutional net. International Conference on Learning Representations Workshop, 2015.
  • Stanley (1986) Stanley, Richard P. What is enumerative combinatorics? In Enumerative combinatorics, pp. 1–63. Springer, 1986.
  • Sundararajan et al. (2017) Sundararajan, Mukund, Taly, Ankur, and Yan, Qiqi. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
  • Szegedy et al. (2016) Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna, Zbigniew. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
  • Telgarsky (2016) Telgarsky, Matus. Benefits of depth in neural networks. Journal of Machine Learning Research, 2016.
  • Zeiler & Fergus (2014) Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Springer, 2014.
  • Zhang et al. (2016) Zhang, Jianming, Lin, Zhe, Brandt, Jonathan, Shen, Xiaohui, and Sclaroff, Stan. Top-down neural attention by excitation backprop. In European Conference on Computer Vision, pp. 543–559. Springer, 2016.
  • Zhou et al. (2015) Zhou, Bolei, Khosla, Aditya, Lapedriza, Agata, Oliva, Aude, and Torralba, Antonio. Object detectors emerge in deep scene cnns. International Conference on Learning Representations, 2015.
  • Zhou et al. (2016) Zhou, Bolei, Khosla, Aditya, Lapedriza, Agata, Oliva, Aude, and Torralba, Antonio.

    Learning deep features for discriminative localization.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE, 2016.

A Notation Table

We summarize the basic notations for symbols not introduced in Section 2 of the main paper in the following table. For non-shown notations, see Section 2 of the main paper.

Sample mean, or expectation of a random variable

Sample variance, or population variance
Sample covariance, or population covariance
Approximately equal to, especially used in context that sample number is large enough
Approximately less than or equal to, especially used in context that sample number is large enough
Normal distribution
Gamma function
”has the probability distribution of”
Function class of which the first derivatives all exist and are continuous
Ceil function
Floor function
Absolute function
Distribution of population
Table 1: Notation of Basic Symbols

B Lemmata

Before proving the main theorems, we propose some lemmata that should be seen proactively.

Lemma 1.

Suppose , where is the symmetric probability distribution with zero mean. Define for each is non-negative integer. If is odd,

(1)

In addition, to be , all must be zero or even.

Proof.

Because is odd, at least one of is odd. Since each is sampled independently, . For any and odd number , because is the symmetric probability distribution with zero mean. Thus, at least one is zero. ∎

Corollary 1.1.

Suppose , where is the symmetric probability distribution with zero mean. Define for each is non-negative integer. If is odd,

(2)

where is large enough. In addition, to be when n is large, all must be zero or even.

Proof.

It is clear from Lemma 1 and its proof. ∎

Lemma 2.

Suppose is a non-negative integer. If ,

(3)
(4)
Proof.

Suppose . By the Law Of The Unconscious Statistician,

(5)
(6)
(7)
(8)

These formulas are also true for :

(9)

Lemma 3.

Suppose two arbitrary random variables , and -th sample from . If and ,

(10)
Proof.
(11)
(12)
(13)

Lemma 4.

Suppose two arbitrary random variables . Then,

(14)
Proof.

This can be proved via Cauchy-Schwarz inequality. Refer fujii1997operator for detail. ∎

C Proof of Theorem 1

Proof.

According to conditions, . Starting from multivariate Taylor’s theorem trench2013introduction of Equation of SmoothGrad,

(15)
(16)
(17)
(18)

for some . The second term of Equation 18 can be rearranged as

(19)

Meanwhile, due to the continuity of -th order partial derivatives in the compact set , we can obtains the uniform bound of the third term of Equation 18 as follows:

(20)

where . Note that when .

Next, we arrange the terms of Equation 19 and Equation 20 in order. Recall that all elements of are sampled independently and identically. Let be for . When is large enough,

(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)

from Lemma 1, Lemma 2 and Corollary 1.1.

As a result,

(29)
(30)

D Proof of Theorem 2

Proof.

Starting from Definition of VarGrad and multivariate Taylor’s theorem,

(31)
(32)
(33)
(34)

By the fact that the variance of sum of random variables is the sum of their covariances, Equation 34 is expanded as

(35)

By arranging three residual-free terms of Equation D in order, we get

(36)
(37)
(38)

Next, we arrange the terms of Equation 36, Equation 37, and Equation D in order. Recall that all elements of are sampled independently and identically. Let be for . When is large enough,

(39)
(40)
(41)
(42)

from Lemma 1 and Lemma 2.

When Equation 37 is treated in the same manner,

(43)
(44)
(45)