1 Introduction
Attribution methods of neural network are model interpretation methods showing how much each component of input contributes to model prediction Sundararajan et al. (2017). Despite the flurry of explainability research on deep neural network over the recent years, model interpretation of deep neural networks through attribution method still remains a challenging topic. Previous researches can be grouped into two categories. One approach proposes a set of propagation rules that maximize the expressiveness of the interpretation. The other approach, on the other hand, perturbs the input in a methodical (e.g. optimizationbased masking) or a random fashion to have interpretations of better visual quality.
In light of both approaches, we emphasize the ambiguity of noiseadding methods such as SmoothGrad Smilkov et al. (2017) and VarGrad Adebayo et al. (2018)
compared to other methods. Most of the propagationbased methods can be interpreted as variants of the backpropagation algorithm
Ancona et al. (2018) and the algorithms themselves are selfexplanatory. Perturbationbased methods usually optimize or manually alter the input with respect to meaningful criteria Fong & Vedaldi (2017). However, noiseadding methods merely take the mean or the variance of saliency maps generated by adding Gaussian noise to the input. Despite their apparent simplicity, the results are surprisingly effective. Ironically, the simplicity of their approach prevents us from understanding exactly how and why noiseadding methods work for the better model interpretation.
This situation poses a twofold problem. First, since the inner workings of the method are unclear, our understanding for the results produced by the noiseadding methods are also innately unclear. Second, the lack of understanding prevents others from assessing the advantages and disadvantages of such noiseadding methods.
In this paper, we address the ambiguity of noiseadding methods by applying the multivariate Taylor’s theorem and some statistical theorems on SmoothGrad and VarGrad. We obtain their analytic expressions, which reveal several interesting properties. These discoveries allow us to verify intuitively plausible but opaque explanations for the effectiveness of noiseadding methods proposed in previous works. Furthermore, we formulate a general conjecture regarding reasonable model interpretations, based on our discussions.
Specifically, our contributions in this paper are as follows:

We present nonstochastic analytic forms of approximated SmoothGrad and VarGrad and their bounds.

Our theorems lead to conclusions that differ from that of previous works. First, SmoothGrad does not make the gradient of the score function smooth. Second, VarGrad is independent of the gradient of the score function. In addition, their behaviors differ from that of other interpretation techniques.

Based on our observations, we carefully propose the conjecture that higher order partial derivatives and reasonable model interpretations are correlated.
2 Notation
For simplicity of discussion, we limit the target network to an image classification network. Let be an image classification network with fixed parameters, where is a single image instance and is the set of image labels. When defining score function for each label candidate, . We assume that a kind of squashing function such as softmax function is applied just before the value of is calculated. The final classification result of for is . Note that we only consider
to avoid using complex tensor notation.
To easily handle highorder derivatives of a multivariate function, we introduce a multiindex notation Saint Raymond (2017). An dimensional multiindex is a tuple of nonnegative integers. With this, we define
(1)  
(2)  
(3) 
If at and , a generic thorder partial derivative is denoted by
(4) 
where Schwarz’s theorem Rudin (1976) justifies Definition 4.
We now define some notations for dimension noise . For , the th noise is denoted by
. Noise sampled independently from a probability distribution function
regardless of order and index is simply denoted . That is, is equivalent to . In this cause, and are used interchangeable.Lastly, we define simple notations for parity of integers. means the union of the set of zero and the set of positive even numbers.
refers to the set of positive odd numbers.
3 Related Works
3.1 Previous Works On Attribution Methods
Saliency Map and Its Advanced Methods
Since Simonyan et al. (2014) first proposed using saliency maps to interpret neural networks, there have been several studies to improve propagationbased attribution method Sundararajan et al. (2017); Springenberg et al. (2015); Ancona et al. (2018); Shrikumar et al. (2017); Selvaraju et al. (2017); Smilkov et al. (2017); Bach et al. (2015); Montavon et al. (2017); Chattopadhyay et al. (2018). Meanwhile, Ancona et al. (2018) suggested the way to interpret some existing propagationbased methods as a unified gradientbased framework. On the other hand, Zhang et al. (2016); Adebayo et al. (2018) discussed the limitations of the gradientbased methodology itself.
Model Explanation with Perturbation
There have been some attempts to describe the model through perturbation of input data Zeiler & Fergus (2014); Zhou et al. (2015); Cao et al. (2015); Fong & Vedaldi (2017). We emphasize that our major theorems and conclusions cannot be directly extended to these methods because the perturbations used in these methods are usually dependent on data or model.
Axiomization of Model Interpretability
There have been several studies Sundararajan et al. (2017); Ghorbani et al. (2017); Adebayo et al. (2018); Kindermans et al. (2018); Samek et al. (2017); Dabkowski & Gal (2017) on the preferential properties or axiomization of model interpretability. These studies are significant because they reduce ambiguity in model interpretability as research topic. Therefore, they are essential for an unified discussion on model interpretation.
3.2 Brief Reviews On Our Three Topics
Saliency Map
Authors of several articles Erhan et al. (2009); Baehrens et al. (2010); Simonyan et al. (2014) proposed the saliency map, which is the partial derivative of the network output with respect to the input , as a possible explanation for model decisions. Standard saliency map is computed by
(5) 
where is a multiindex.
SmoothGrad
VarGrad
The authors of Adebayo et al. (2018) proposed VarGrad, the variance version of SmoothGrad.
(7)  
(8) 
4 Rethinking Noise with Taylor’s Theorem
4.1 Motivation
Figure 1 shows the results of various attribution methods for the prediction class in inceptionv3 Szegedy et al. (2016) trained on ILSVRC 2013 Russakovsky et al. (2015).^{1}^{1}1The results are produced by the modified implementation of Ancona et al. (2018), of which the repository link is as follows: https://github.com/marcoancona/DeepExplain As the results are shown, SmoothGrad and VarGrad seem to provide the better visual description than saliency maps in general. More precisely, standard saliency maps overly emphasize local image regions while results produced by methods adding noise do not. Furthermore, SmoothGrad and VarGrad produce results comparable to that of other recent gradientbased attribution methods. The results of VarGrad are particularly sparse. Previous studies Smilkov et al. (2017); Adebayo et al. (2018); Kindermans et al. (2017) have also observed similar results.
Here our central question arises: how are the results of SmoothGrad and VarGrad different from those of the standard saliency map? For SmoothGrad, this question was covered briefly in Smilkov et al. (2017). The authors argued that SmoothGrad reduces the effect of ‘strongly fluctuating partial derivatives’ on the saliency map. However, they did not offer any analytic explanation for the beneficial effect of noise on the results. As for VarGrad, its behavior is mysterious as its effectiveness. The relationship between the variability of saliency maps produced from noisy input and the saliency map produced from data is highly unclear. However, this problem has not been addressed before.
Accordingly, we attempt to answer the following questions in mathematical analysis:

What is the relationship between the saliency map and the result of noiseadding methods?

What is the exact relationship between the result of noiseadding methods and the choice of ?

Are the result of the noiseadding methods related to other factors other than the saliency map?
We express Equation 6 and Equation 7 in terms of noise parameter instead of data noise . If we cannot obtain the closed form expression of the terms for , we instead provide their bound as a expression for noise parameter . We use the multivariate Taylor’s theorem and several statistical theorems for this. Because the entire proofs are too verbose, we only write the results in this paper. The full proofs are given in the Appendix.
4.2 SmoothGrad Does Not Make Gradient Smooth
Theorem 1.
Suppose on a closed ball .^{2}^{2}2Most modern neural networks are only piecewise continuously differentiable. Nonetheless, several theoretical studies Funahashi (1989); Telgarsky (2016); Liang & Srikant (2017) have guaranteed that a general neural network can be an appropriate ()approximation of any smooth functions. Approaches for noise on neural network via Taylor’s theorem are also shown in An (1996); Rifai et al. (2011) in the context of model regularization analysis. If , , , and is large enough, the result of SmoothGrad is approximated by
(9) 
where
(10)  
(11) 
with and some .
Proof.
See Appendix C. ∎
4.3 VarGrad Is Independent Of Gradient
Theorem 2.
Suppose on a closed ball . If is even, , , and is large enough, the result of VarGrad is approximated by
(12) 
where
(13) 
and , , is bounded to expression for . The exact equation and bound of is shown in Appendix D.
Proof.
See Appendix D. ∎
Method  Mul w/ Input?  Gradient?  Higherorder? 
Saliency Map  
Gradient*Input Shrikumar et al. (2016)  
Integrated GradientSundararajan et al. (2017)  
LRP Bach et al. (2015)  
DeepLIFT Shrikumar et al. (2017)  
SmoothGrad  
VarGrad 
5 Discussions
5.1 Observation on Our Theorems
SmoothGrad
As pointed out in Smilkov et al. (2017), one of the reasons for the failure of the standard saliency map is that the partial gradient of score function for the input will act more strongly on local pixels than on global information. Authors of Smilkov et al. (2017) also observed that the saliency map fluctuates strongly even to small noise imperceivable to humans. Inspired by this observation, they stated SmoothGrad’s motivation as follows: “Instead of basing a visualization directly on the gradient , we could base it on a smoothing of with a Gaussian kernel.” Therefore, they argued that SmoothGrad’s result looks better because SmoothGrad literally makes the gradient smooth.
Contrary to these previous discussions, our observations lead to a different conclusion. If the discussion in Smilkov et al. (2017) is compatible with our observation, the result of SmoothGrad when is large enough should contain a term corresponding to the smoothing effect on the saliency map . According to Theorem 1, that is not the case; Equation 6 does not contain such a term. Therefore, SmoothGrad does not make gradient of score function smooth from our view. Instead, SmoothGrad is approximately the sum of the standard saliency map and the series consisting of higherorder partial derivatives and the standard deviation of the Gaussian noise.
VarGrad
Although the principle of VarGrad has rarely been discussed even in the original paper Adebayo et al. (2018), our finding about VarGrad is also counterintuitive. We can see from Theorem 2 that VarGrad is independent of the gradient of the score function. The result of VarGrad can be approximated as a series consisting only of higherorder partial derivatives and the standard deviation of the Gaussian noise. In other words, the result of VarGrad is not related to the saliency map.
5.2 Comparison with Previous Discussions
Table 1 summarizes the characteristics of noiseadding methods and some other gradientbased attribution methods. In the table, we only deal with four recent gradientbased attribution methods listed in Ancona et al. (2018). Some other gradientbased attribution methods (i.e. Selvaraju et al. (2017); Springenberg et al. (2015)) can be grouped into the same category, depending on the definition or rules of attribution. We want to focus on the unique natures of the noiseadding methods listed in Table 1.
Multiplication with Input
The presence of a term in which the input and the derivative are multiplied together has been generally taken as an important factor in sharper attribution Shrikumar et al. (2017); Sundararajan et al. (2017); Smilkov et al. (2017); Ancona et al. (2018). Furthermore, Ancona et al. (2018) claimed that the presence of that term makes the method a desirable global attribution method.
However, even though noiseadding methods such as VarGrad do not have these terms, their results are comparable to that of other recent attribution methods as demonstrated in Figure 1. Furthermore, it has been found that this term causes undesirable side effects in the attribution Smilkov et al. (2017), and its effect on deep neural networks (as opposed to simple linear models) is still unclear Ancona et al. (2018). We therefore argue that an analytic approach to the need for multiplication with input is necessary.
Gradient
On the presence or the absence of the gradient term, our findings are even more surprising. Since Simonyan et al. (2014) first introduced model interpretation by saliency maps, all following propagationbased attribution methods have used the gradient in some way. However, our findings suggest that SmoothGrad and VarGrad deviate from this trend, as mentioned in Section 5.1.
Higherorder Derivative
Taken together, our theorems suggest that a major factor affecting the result of noiseadding methods is the higherorder partial derivatives of the score function for the data point, not just the saliency map. Despite conflicts between our conclusions and that of other works, it is undeniable that the noiseadding methods are qualitatively better than the standard saliency map. To account for this phenomenon, we cautiously propose the conjecture that there may be a correlation between higher order partial derivatives of model function and the attributions defined from sensible axioms of model interpretability.
There is few articles that focus on the higherorder partial derivative of the model function for model explanation. One notable exception is Koh & Liang (2017), which studied the influence function via JacobianHessian products of the model. The purpose of Koh & Liang (2017), however, is not to take the model attribution of the input but to find the responsible training data through the influence function. As far as we know, Chattopadhyay et al. (2018) is the study of model attribution method that is most related to higherorder derivatives. In Chattopadhyay et al. (2018), computation of higher derivatives is required for getting the gradient weights in more principled way than class activation map Zhou et al. (2016) or GradCAM Selvaraju et al. (2017). We hope for further advanced discussions on our view in the future, with a legitimate axiom on model interpretation.
5.3 Inaccessibility to Experimentation
It is worth mentioning that direct computation of Equation 1 and Equation 2 is numerically intractable. There are two reasons for this claim. First, it requires the calculation of an dimension explicitly restricted partition set Stanley (1986) with increasing order. Additionally, order partial derivative of the score function should be computed for all possible multiindexes. Both are practically difficult to compute. Despite the inaccessibility to experimentation, our view over noiseadding methods allows theoretically interesting discussions.
6 Conclusions
We explored the analytic form of SmoothGrad and VarGrad, variants of the saliency map. Our conclusions about the behavior of both methods when the sample number is sufficient were conflicted with the existing view. First, SmoothGrad does not make gradient of score function smooth. Second, VarGrad is independent of gradient of score function.
To reconcile the success of noiseadding methods and our conclusions, we carefully presented a conjecture: there may be a correlation between higher order partial derivatives of the model function and a sensible model interpretation. We hope to see advanced discussions on model interpretation from this perspective in the future.
References
 Adebayo et al. (2018) Adebayo, Julius, Gilmer, Justin, Goodfellow, Ian, and Kim, Been. Local explanation methods for deep neural networks lack sensitivity to parameter values. International Conference on Learning Representations Workshop, 2018.
 An (1996) An, Guozhong. The effects of adding noise during backpropagation training on a generalization performance. Neural computation, 8(3):643–674, 1996.
 Ancona et al. (2018) Ancona, Marco, Ceolini, Enea, Öztireli, Cengiz, and Gross, Markus. Towards better understanding of gradientbased attribution methods for deep neural networks. International Conference on Learning Representations, 2018.

Bach et al. (2015)
Bach, Sebastian, Binder, Alexander, Montavon, Grégoire, Klauschen,
Frederick, Müller, KlausRobert, and Samek, Wojciech.
On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation.
PloS one, 10(7):e0130140, 2015. 
Baehrens et al. (2010)
Baehrens, David, Schroeter, Timon, Harmeling, Stefan, Kawanabe, Motoaki,
Hansen, Katja, and MÃžller, KlausRobert.
How to explain individual classification decisions.
Journal of Machine Learning Research
, 11(Jun):1803–1831, 2010. 
Cao et al. (2015)
Cao, Chunshui, Liu, Xianming, Yang, Yi, Yu, Yinan, Wang, Jiang, Wang, Zilei,
Huang, Yongzhen, Wang, Liang, Huang, Chang, Xu, Wei, et al.
Look and think twice: Capturing topdown visual attention with feedback convolutional neural networks.
InProceedings of the IEEE International Conference on Computer Vision
, pp. 2956–2964, 2015.  Chattopadhyay et al. (2018) Chattopadhyay, Aditya, Sarkar, Anirban, Howlader, Prantik, and Balasubramanian, Vineeth N. Gradcam++: Generalized gradientbased visual explanations for deep convolutional networks. IEEE Winter Conference on Application of Computer Vision, 2018.
 Dabkowski & Gal (2017) Dabkowski, Piotr and Gal, Yarin. Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6970–6979, 2017.
 Erhan et al. (2009) Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. Visualizing higherlayer features of a deep network. Technical Report 1341, University of Montreal, 2009.
 Fong & Vedaldi (2017) Fong, Ruth C and Vedaldi, Andrea. Interpretable explanations of black boxes by meaningful perturbation. The IEEE International Conference on Computer Vision, 2017.
 Funahashi (1989) Funahashi, KenIchi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
 Ghorbani et al. (2017) Ghorbani, Amirata, Abid, Abubakar, and Zou, James. Interpretation of neural networks is fragile. Advances in Neural Information Processing Systems Workshop on Machine Deception, 2017.

Kindermans et al. (2017)
Kindermans, PieterJan, Hooker, Sara, Adebayo, Julius, Alber, Maximilian,
Schütt, Kristof T, Dähne, Sven, Erhan, Dumitru, and Kim, Been.
The (un) reliability of saliency methods.
Advances in Neural Information Processing Systems Workshop on Explaining and Visualizing Deep Learning
, 2017.  Kindermans et al. (2018) Kindermans, PieterJan, Schütt, Kristof T, Alber, Maximilian, Müller, KlausRobert, and Dähne, Sven. Learning how to explain neural networks: Patternnet and patternattribution. International Conference on Learning Representations, 2018.
 Koh & Liang (2017) Koh, Pang Wei and Liang, Percy. Understanding blackbox predictions via influence functions. International Conference on Machine Learning, 2017.
 Liang & Srikant (2017) Liang, Shiyu and Srikant, R. Why deep neural networks for function approximation? Proceedings of the 27th international conference on machine learning, 2017.
 Montavon et al. (2017) Montavon, Grégoire, Lapuschkin, Sebastian, Binder, Alexander, Samek, Wojciech, and Müller, KlausRobert. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
 Rifai et al. (2011) Rifai, Salah, Glorot, Xavier, Bengio, Yoshua, and Vincent, Pascal. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
 Rudin (1976) Rudin, Walter. Principles of mathematical analysis. McGrawHill Book Co., New York, third edition, 1976. ISBN 0070856133. International Series in Pure and Applied Mathematics.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Saint Raymond (2017) Saint Raymond, Xavier. Elementary introduction to the theory of pseudodifferential operators. Routledge, 2017.
 Samek et al. (2017) Samek, Wojciech, Binder, Alexander, Montavon, Grégoire, Lapuschkin, Sebastian, and Müller, KlausRobert. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660–2673, 2017.
 Selvaraju et al. (2017) Selvaraju, Ramprasaath R, Cogswell, Michael, Das, Abhishek, Vedantam, Ramakrishna, Parikh, Devi, and Batra, Dhruv. Gradcam: Visual explanations from deep networks via gradientbased localization. the IEEE International Conference on Computer Vision, 2017.
 Shrikumar et al. (2016) Shrikumar, Avanti, Greenside, Peyton, Shcherbina, Anna, and Kundaje, Anshul. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713, 2016.
 Shrikumar et al. (2017) Shrikumar, Avanti, Greenside, Peyton, and Kundaje, Anshul. Learning important features through propagating activation differences. International Conference on Learning Representations, 2017.
 Simonyan et al. (2014) Simonyan, Karen, Vedaldi, Andrea, and Zisserman, Andrew. Deep inside convolutional networks: Visualising image classification models and saliency maps. International Conference on Learning Representations, 2014.
 Smilkov et al. (2017) Smilkov, Daniel, Thorat, Nikhil, Kim, Been, Viégas, Fernanda, and Wattenberg, Martin. Smoothgrad: removing noise by adding noise. International Conference on Machine Learning Workshop on Visualization for deep learning, 2017.
 Springenberg et al. (2015) Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving for simplicity: The all convolutional net. International Conference on Learning Representations Workshop, 2015.
 Stanley (1986) Stanley, Richard P. What is enumerative combinatorics? In Enumerative combinatorics, pp. 1–63. Springer, 1986.
 Sundararajan et al. (2017) Sundararajan, Mukund, Taly, Ankur, and Yan, Qiqi. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
 Szegedy et al. (2016) Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna, Zbigniew. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
 Telgarsky (2016) Telgarsky, Matus. Benefits of depth in neural networks. Journal of Machine Learning Research, 2016.
 Zeiler & Fergus (2014) Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pp. 818–833. Springer, 2014.
 Zhang et al. (2016) Zhang, Jianming, Lin, Zhe, Brandt, Jonathan, Shen, Xiaohui, and Sclaroff, Stan. Topdown neural attention by excitation backprop. In European Conference on Computer Vision, pp. 543–559. Springer, 2016.
 Zhou et al. (2015) Zhou, Bolei, Khosla, Aditya, Lapedriza, Agata, Oliva, Aude, and Torralba, Antonio. Object detectors emerge in deep scene cnns. International Conference on Learning Representations, 2015.

Zhou et al. (2016)
Zhou, Bolei, Khosla, Aditya, Lapedriza, Agata, Oliva, Aude, and Torralba,
Antonio.
Learning deep features for discriminative localization.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE, 2016.
A Notation Table
We summarize the basic notations for symbols not introduced in Section 2 of the main paper in the following table. For nonshown notations, see Section 2 of the main paper.
Sample mean, or expectation of a random variable 

Sample variance, or population variance  
Sample covariance, or population covariance  
Approximately equal to, especially used in context that sample number is large enough  
Approximately less than or equal to, especially used in context that sample number is large enough  
Normal distribution  
Gamma function  
”has the probability distribution of”  
Function class of which the first derivatives all exist and are continuous  
Ceil function  
Floor function  
Absolute function  
Distribution of population 
B Lemmata
Before proving the main theorems, we propose some lemmata that should be seen proactively.
Lemma 1.
Suppose , where is the symmetric probability distribution with zero mean. Define for each is nonnegative integer. If is odd,
(1) 
In addition, to be , all must be zero or even.
Proof.
Because is odd, at least one of is odd. Since each is sampled independently, . For any and odd number , because is the symmetric probability distribution with zero mean. Thus, at least one is zero. ∎
Corollary 1.1.
Suppose , where is the symmetric probability distribution with zero mean. Define for each is nonnegative integer. If is odd,
(2) 
where is large enough. In addition, to be when n is large, all must be zero or even.
Proof.
It is clear from Lemma 1 and its proof. ∎
Lemma 2.
Suppose is a nonnegative integer. If ,
(3)  
(4) 
Proof.
Suppose . By the Law Of The Unconscious Statistician,
(5)  
(6) 
(7)  
(8) 
These formulas are also true for :
(9) 
∎
Lemma 3.
Suppose two arbitrary random variables , and th sample from . If and ,
(10) 
Proof.
(11)  
(12) 
(13) 
∎
Lemma 4.
Suppose two arbitrary random variables . Then,
(14) 
Proof.
This can be proved via CauchySchwarz inequality. Refer fujii1997operator for detail. ∎
C Proof of Theorem 1
Proof.
According to conditions, . Starting from multivariate Taylor’s theorem trench2013introduction of Equation of SmoothGrad,
(15)  
(16)  
(17)  
(18) 
for some . The second term of Equation 18 can be rearranged as
(19) 
Meanwhile, due to the continuity of th order partial derivatives in the compact set , we can obtains the uniform bound of the third term of Equation 18 as follows:
(20) 
where . Note that when .
Next, we arrange the terms of Equation 19 and Equation 20 in order. Recall that all elements of are sampled independently and identically. Let be for . When is large enough,
(21)  
(22)  
(23)  
(24)  
(25)  
(26)  
(27)  
(28) 
As a result,
(29) 
(30) 
∎
D Proof of Theorem 2
Proof.
Starting from Definition of VarGrad and multivariate Taylor’s theorem,
(31)  
(32)  
(33)  
(34) 
By the fact that the variance of sum of random variables is the sum of their covariances, Equation 34 is expanded as
(35) 
By arranging three residualfree terms of Equation D in order, we get
(36)  
(37) 
(38) 
Next, we arrange the terms of Equation 36, Equation 37, and Equation D in order. Recall that all elements of are sampled independently and identically. Let be for . When is large enough,
(39)  
(40)  
(41)  
(42) 
When Equation 37 is treated in the same manner,
(43)  
(44)  
(45)  