Gradient Alignment in Deep Neural Networks

by   Suraj Srinivas, et al.
Idiap Research Institute

One cornerstone of interpretable deep learning is the high degree of visual alignment that input-gradients, i.e.,the gradients of the output w.r.t. inputs, exhibit with the input data. This alignment is assumed to arise as a result of the model's generalization, justifying its use for interpretability. However, recent work has shown that it is possible to 'fool' models into having arbitrary gradients while achieving good generalization, thus falsifying the assumption above. This leaves an open question: if not generalization, what causes input-gradients to align with input data? In this work, we first show that it is simple to 'fool' input-gradients using the shift-invariance property of softmax, and that gradient structure is unrelated to model generalization. Second, we re-interpret the logits of standard classifiers as unnormalized log-densities of the data distribution, and find that we can improve this gradient alignment via a generative modelling objective called score-matching.To show this, we derive a novel approximation to the score-matching objective that eliminates the need for expensive Hessian computations, which may be of independent interest.Our experiments help us identify one factor that causes input-gradient alignment in models, that being the approximate generative modelling behaviour of the normalized logit distributions.



There are no comments yet.


page 3

page 7

page 8


Convolutional Dynamic Alignment Networks for Interpretable Classifications

We introduce a new family of neural network models called Convolutional ...

Quantifying the effect of representations on task complexity

We examine the influence of input data representations on learning compl...

Optimising for Interpretability: Convolutional Dynamic Alignment Networks

We introduce a new family of neural network models called Convolutional ...

Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment

We propose a new metric (m-coherence) to experimentally study the alignm...

Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients

Deep neural networks have proven remarkably effective at solving many cl...

The Randomness of Input Data Spaces is an A Priori Predictor for Generalization

Over-parameterized models can perfectly learn various types of data dist...

Information matrices and generalization

This work revisits the use of information criteria to characterize the g...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Input-gradients of trained deep neural networks, or gradients of outputs w.r.t. inputs, have been empirically observed to have a high degree of alignment with the inputs. For example, in image classification tasks, gradient magnitudes are observed to be higher on object locations and lower elsewhere. Folk wisdom states that these gradient magnitudes indicate the ‘importance’ placed by the model on different regions of the input, where larger gradient magnitudes indicate higher importance. This argument justifies their use as feature attribution maps for interpretation of discriminative models Simonyan et al. (2013); Smilkov et al. (2017); Ancona et al. (2018). In this work, we show that input-gradients can be arbitrarily manipulated using the shift-invariance of softmax, which implies that input-gradient structure is unrelated to the discriminative capabilities of the model, thereby falsifying the assumption above.

Given that aligned input-gradients are not necessary for generalization, the reason for their emergence in standard deep models is puzzling. However from this observation, we can infer the presence of an implicit regularizer in neural network training that causes this gradient alignment. In this work, we wish to characterize this implicit regularizer, to understand the source of gradient alignment. For this purpose, we study the score-matching objective Hyvärinen (2005), which aims to align input-gradients with the gradients of the input data distribution, and is thus naturally formulated as a gradient alignment problem. To apply this, we exploit connections of discriminative classifiers with generative models Grathwohl et al. (2020); Bridle (1990) by viewing the logits of standard classifiers as un-normalized log-densities. As the gradients of the input data distribution are unavailable, score-matching works by reducing the gradient alignment problem to that of local geometric regularization. Hence by combining these two techniques, the generative modelling interpretation of logits and score-matching, we are able to connect the literature on generative models with that of geometric regularization of discriminative deep models.

In practice, the score-matching objective is known for being computationally expensive and unstable to train Song and Ermon (2019); Kingma and Cun (2010), which has so far prevented its widespread usage for large-scale generative models. To this end, we also introduce approximations and regularizers which allow us to use score-matching on practical large-scale models. Aside from our usage in this paper, these methods may be of independent interest to the generative modelling community.

Overall, we make three contributions:

  • We show in § 2 that it is trivial to fool input-gradients of standard classifiers using the shift-invariance of softmax, and that gradient alignment is unrelated to generalization.

  • We devise in § 3 a tractable approximation to the score-matching objective that eliminates the need for expensive Hessian computations.

  • We find in § 4 that improving generative modelling behaviour of discriminative models improves gradient alignment, and this helps us identify one possible reason for gradient-alignment in standard models, that being approximate generative modelling behaviour of the normalized logit distributions.

2 Fooling Gradients is Simple

Recently, it has been shown Heo et al. (2019) that it is possible to train models into having arbitrarily structured input-gradients, while achieving good generalization. In this section, we show that it is trivial to ‘fool’ gradients of deep networks trained for classification, using the well-known shift-invariance property of softmax. Throughout the paper, we shall make a distinction between two types of input-gradients: logit-gradients and loss-gradients. While logit-gradients are gradients of the pre-softmax output of a given class w.r.t. the input, loss-gradients are the gradients of the loss w.r.t. the input. In both cases, we only consider outputs of a single class, usually the target class.

Let be a data point, which is the input for a neural network model intended for classification, which produces pre-softmax logits for

classes. The cross-entropy loss function for some class

corresponding to an input is given by , which is shortened to for convenience. Note that here the loss function subsumes the softmax function as well. The logit-gradients are given by for class , while loss-gradients are . Let the softmax function be , which we denote as for simplicity. Here, we make the observation that upon adding the same scalar function to all logits, the logit-gradients can arbitrarily change but the loss values do not.


Assume an arbitrary function . Consider another neural network function given by , for which we obtain . For this, the corresponding loss values and loss-gradients are unchanged, i.e.; and .

We provide detailed arguments in the Supplementary material. This explains how the structure of logit-gradients can be arbitrarily changed: one simply needs to add an arbitrary function to all logits. This implies that individual logit-gradients and logits are meaningless on their own, and their structure is unrelated to the discriminative capabilities of models. Despite this, a large fraction of work in interpretable deep learning Simonyan et al. (2013); Selvaraju et al. (2017); Smilkov et al. (2017); Fong and Vedaldi (2017); Srinivas and Fleuret (2019) uses individual logits and logit-gradients for saliency map computation.

2.1 Fooling Loss-Gradients

Here, we show how we can also change loss-gradients arbitrarily without significantly changing the loss values themselves. In this case, we add slightly different scalar functions to each logit.


Assume an arbitrary function , such that for any two classes . Consider another neural network function given by . For this, we have the following for small :


The error on the loss-gradients depends on both and , whose magnitudes are unbounded, and thus can get arbitrarily large. E.g.: consider being the zero function, and being a high frequency sine wave with amplitude .


The approximation error for both the loss and loss-gradients are small for high-probability classes, and large for low-probability ones.

The proof is provided in the supplementary material. Thus for inputs with low softmax probability, the loss-gradients can also be arbitrarily structured. Overall, the result above demonstrates that loss-gradients are also unreliable, as two models with very similar loss landscapes and hence discriminative abilities, can have drastically different loss-gradients.

2.2 Experiments

Here we present experimental evidence to support the claim that fooling input-gradients is simple. First, we show how loss-gradients are unchanged when logit-gradients are fooled. Second, we show that how loss-gradients can also be fooled by simply increasing the temperature parameter within softmax. Our experiments are performed on the CIFAR100 dataset, using a 11-layer VGG network.

(a) Fooling Mask
(b) Fooled logit-gradients
(c) Loss-gradients
(d) Temperature scaled loss-gradients
Figure 1: Results of fooling neural network logit-gradients. Given a mask (a), we are able to fool logit-gradients (b). We observe that loss-gradients (c) are not affected, however they can change to adhere to the mask upon using a high-temperature softmax (d), as indicated by the areas in red.

Given a normalized unsigned saliency map and a desired normalized binary mask structure , the saliency fooling algorithm Heo et al. (2019) consists of the following objective function.


We add this as a regularizer along with the standard cross-entropy loss and fine-tune a pre-trained VGG classifier. We assume the mask structure given in Figure 0(a), which comprises of a

white region. Assuming a uniform distribution of logit-gradients over pixels, one would expect

of the total energy of unsigned gradients to occur in the top left region. Upon optimizing the fooling objective, we observe that we are indeed able to fool logit-gradients, with these having of the total energy in only the top left areas, as shown in Figure 0(b). However, we note that loss-gradients are not fooled, with an average energy of only in the unmasked areas. Our attempts at fooling loss-gradients in a similar manner were unsuccessful: either the training collapsed completely or fooling failed to occur.

Our second experiment involves testing whether the loss-gradients can be fooled for low probability classes. To test this, use a high temperature constant () with softmax, for the fooled model above. Upon doing so, we see that the loss-gradients are also altered, with an average energy of in the top left region, up from as shown in Figure 0(d). This provides experimental validation for our theory.

3 Discriminative Classifiers as Generative Models

Our arguments in the previous section have demonstrated that we can easily cause input-gradients, particularly logit-gradients, to have arbitrary structure. In this section, we consider how to improve the alignment of the gradients with input data. To this end, we use the score-matching objective, is naturally formulated as a gradient alignment problem. For this, we first proceed by stating the link between generative models and the softmax function.

Let us first define the following joint density on the logits of classifiers: , where is the partition function. We shall henceforth suppress the dependence of on for brevity. Upon using Bayes’ rule to obtain

, we observe that we recover the standard softmax function. Thus logits of classifiers can alternately be viewed as un-normalized log-densities of the joint distribution. Assuming equiprobable classes, we have

, which is the quantity of interest for us.

3.1 Score-Matching

Score-matching Hyvärinen (2005) is a generative modelling objective that focusses solely on the derivatives of the log density instead of the density itself, and thus does not require access to the partition function . Specifically, for our case we have , which are the logit-gradients.

Given i.i.d. samples from a latent data distribution , the objective of generative modelling is to recover this latent distribution using only samples . This is often done by training a parameterized distribution to align with the latent data distribution . The score-matching objective instead aligns the gradients of log densities, as given below.


The above relationship is proved Hyvärinen (2005) using integration by parts. This is a consistent objective, i.e, . Note that in equation 2 is unavailable, and thus equation 3 gets rid of this term. This is appealing also because this reduces the problem of generative modelling to that of regularization of the local geometry of functions, i.e.; the resulting terms only depend on the point-wise gradients and Hessians. However, equation 3

is intractable for high-dimensional data due to the Hessian trace term. To address this, we can use the Hutchinson’s trace estimator

Hutchinson (1990) to efficiently compute an estimate of the trace by using random projections, which is given by:


This estimator has been previously applied to score-matching Song et al. (2019)

, and can be computed efficiently as this relies on Hessian-vector products, for which we can use Pearlmutter’s trick

Pearlmutter (1994). However, this trick still requires two backward passes to compute a single Hessian-vector product, and in practice we may need to approximate the expectation using several Monte-Carlo samples. To further improve computational efficiency, we introduce the following approximation to Hutchinson’s estimator using a Taylor series expansion, which applies to small values of .


Note that equation 11 involves a difference of log probabilities, which is independent of the partition function. For our case, . We have thus considerably simplified and speeded-up the computation of the Hessian trace term, which now can be approximated without

any backward passes, but using only a single additional forward pass. We present details regarding the variance of this estimator in the supplementary material.

3.2 Stabilized Score-matching

In practice, a naive application of score-matching objective is unstable, causing the Hessian-trace to collapse to negative infinity. This occurs because the finite-sample variant of equation 2 causes the model to ‘overfit’ to a mixture-of-diracs density, which places a dirac-delta distribution at every data point. Gradients of such a distribution are undefined, causing training to collapse. To overcome this, regularized score-matching Kingma and Cun (2010) and noise conditional score networks Song and Ermon (2019)

propose to add noise to inputs for score-matching to make the problem well-defined. However, this did not help for our case. Instead, we use a heuristic where we add a small penalty term proportional to the square of the Hessian-trace. This discourages the Hessian-trace becoming too large, and thus stabilizes training.

4 Experiments

In this section, we show that improving generative modelling of logit distributions leads to improved gradient alignment. For experiments, we shall consider the CIFAR100 dataset. Unless stated otherwise, the network structure we use shall be a 18-layer ResNet that achieves 77.12% accuracy on CIFAR100, and the optimizer used shall be SGD with momentum. Before proceeding with our experiments, we shall briefly introduce the score-matching variants we shall be using for comparisons.


We propose to use the score-matching objective as a regularizer in neural network training, as shown in equation 6, with the stability regularizer discussed in §3.2. For this, we use a regularization constant . This model achieves accuracy on the test set, which is a drop of about compared to the original model.


We would like to have a tool that can decrease the score-matching tendency of a model, and thus possibly worsen the generative capabilites of models as a baseline. To enable this, we propose to increase the hessian-trace, in an objective we call ‘anti-score-matching’. For this, we shall use a the clamping function on hessian-trace, which ensures that its maximization stops after a threshold is reached. We use a threshold of , and regularization constant . Alternately, this can also be viewed as yet another logit-gradient ‘fooling’ method. This model achieves an accuracy of .

Gradient-Norm regularization

We observe that one the score-matching terms in equation 6 includes a gradient-norm regularizer. This has been used previously in the context of adversarial robustness Jakubovitz and Giryes (2018), and as a regularizer in general. We hence propose to use this regularizer as another baseline for comparison, using a regularization constant of . This model achieves an accuracy of .

4.1 Density Ratios

One way to characterize the generative behaviour of models is to compute likelihoods on data points. However this is intractable for high-dimensional problems, especially for un-normalized models. We observe although that the densities themselves are intractable, we can easily compute density ratios for a random noise variable . Thus, we propose to plot the graph of density ratios locally along random directions. These can be thought of as local cross-sections of the density sliced at random directions. We plot these values for gaussian noise

for different standard deviations, which are averaged across points in the entire dataset.

In Figure 1(a), we plot the density ratios upon training on the CIFAR100 dataset. We observe that the baseline model assigns higher density values to noisy inputs than real inputs. With anti-score-matching, we observe that the density profile grows still steeper, assigning higher densities to inputs with smaller noise. Gradient-norm regularization improves on this behaviour, but still assigns higher densities to inputs with large noise added. Finally, the score-matched model is the only one that assigns lower densities to noisy inputs than real inputs, which is the intended behaviour of a generative model. Thus we are able to obtain penalty terms that can both improve and deteriorate the generative modelling behaviour within discriminative models.

Our plots in figure 1(b) also indicate that standard models get progressively worse at generative modelling during training. This indicates that the implicit regularizer responsible for gradient structure in standard models could be early stopping. We examine this in more detail in §4.2.2.

(a) Density ratios of models trained with different regularizers

Density ratios of a standard ResNet model across training epochs

Figure 2: Plots of density ratios representing local density profiles across varying levels of noise added to the input. (a) Most models with various regularizers assign higher densities to noisy inputs than clean ones, while the score-matched model is the only one that avoids this behaviour. (b) During training of a standard ResNet, models seem to get progressively worse at generative modelling, as models at later epochs assign high densities to inputs with smaller noise levels.

4.2 Gradient Structure

Here we visualize the structure of logit-gradients of different models as in Figure 3. We observe that gradient-norm regularized model and score-matched model have highly data aligned gradients, when compared to the baseline and anti-score-matched gradients. This shows that input-gradient alignment in neural networks can be significantly enhanced, and this is a function of the generative modelling behaviour of the logit distributions. In this visualization however, we do not see any discernable qualitative difference between the baseline and anti-score-matched gradients, nor are we able to make any quantitative statements about these. We hence propose to visualize and compare samples from these generative models.

(a) Input Image
(b) Baseline ResNet
(c) With Anti score-matching
(d) With Gradient-norm regularization
(e) With Score-matching
Figure 3: Examples of logit-gradients for different models. While standard and anti-score-matched models (a, b) have minimally aligned gradients, gradients of models trained with gradient-norm regularization (c) exhibit some alignment, and score-matched models (d) exhibit the most alignment.
4.2.1 Sampling from Model Distributions via Gradient Ascent

We are interested in recovering modes of our density models while having access to only the gradients of the log density. For this purpose, we apply gradient ascent on the log probability starting from random noise input, which co-incidentally is a standard approach in interpretability Simonyan et al. (2013). The gradient ascent step is followed usually with a projection step to ensure that the resulting input lies in the range of valid image inputs, i.e.; between and , and is given as follows:

Here, is the step size. Our results are shown in Figure 6. We notice that modes from the score-gradient trained model are significantly more realistic than baseline models. We also run a ‘denoising’ experiment, where instead of starting with random noise, we start gradient ascent with data points perturbed with small noise. The modes of an ideal generative model lie near clean, un-noised data, thus motivating this experiment. Figure 6 shows that denoised samples of score-matched models are significantly more realistic than the rest.

We also propose to measure quantitatively how well these generated samples adhere to the data distribution. In particular, we propose to measure the discriminative accuracy of these generated samples via a separately trained VGG-11 model. The intuition is that better class conditional generative images are more likely to be correctly classified irrespective of the model. In contrast with more popular metrics such as the inception-score, we would only like to capture sample realism and not diversity, thus motivating this measure. Like the inception score, this is also an approximate test. We show the results in table 1, which confirms the qualitative trend seen in samples above.

Model Sample Acc. (%) Denoised Sample Acc. (%)
Baseline ResNet 1.3 1.5
+ Anti-Score-Matching 1.3 1.3
+ Gradient Norm-regularization 32.7 37.8
+ Score-Matching 57.7 64.9
Table 1: Discriminative accuracy on VGG-11 of class-conditional samples generated from various ResNet-18 models. We observe that while the baseline and anti-score-matched models produce samples with close-to-random accuracies, the samples from gradient-norm regularized models and score-matched models achieve significantly better accuracies.
Figure 5: ‘Denoised’ samples generated from models by performing gradient ascent on inputs perturbed with noise (). Sample quality drastically improves with score-matching.
(a) Baseline ResNet
(b) With anti score-matching
(c) With Gradient-norm regularization
(d) With score-matching
(a) Noisy Image
(b) Baseline ResNet
(c) With Anti score-matching ResNet
(d) With Gradient-norm regularization
(e) With score-matching
(a) Noisy Image
(b) 25 epochs
(c) 50 epochs
(d) 150 Epochs
Figure 4: Samples generated from models by performing gradient ascent on random inputs. (a) Samples from the baseline model exhibit some noisy low-level structure, and (b) samples from anti-score-matched model are significantly noisier. (c) Sample quality is improved using gradient-norm regularization and (d) significantly so with score-matching.
Figure 5: ‘Denoised’ samples generated from models by performing gradient ascent on inputs perturbed with noise (). Sample quality drastically improves with score-matching.
Figure 6: ‘Denoised’ samples generated from models at different epochs during training (). We observe that sample quality progressively degrades across training epochs.
Figure 4: Samples generated from models by performing gradient ascent on random inputs. (a) Samples from the baseline model exhibit some noisy low-level structure, and (b) samples from anti-score-matched model are significantly noisier. (c) Sample quality is improved using gradient-norm regularization and (d) significantly so with score-matching.
4.2.2 Effect of Early Stopping

Here we shall evaluate the effect of early stopping on generative modelling behaviour. For this, we train a standard ResNet model to 200 epochs, and plot the density ratios of models at 1, 50, 100 and 200 epochs in Figure 1(b). We observe in Figure 6 as training progresses, the density ratios worsen over time. We also observe progressively worsening discriminative performance on these samples. While at epoch 25 we observed accuracies of (10.5 % , 14.8 %) respectively for raw and denoised samples, similar quantities for epoch 50 were (3.3 %, 4.7 %) and for epoch 150, (1.3 %, 1.3 %). This quantitatively indicates worsening generative modelling behaviour.

4.3 Properties of the Implicit Regularizer

Our experiments show that improving generative modelling behaviour, as evidenced by Figure 1(a), leads to improved gradient alignment, as shown in Figures 3, 6, 6, and Table 1. This helps us identify one factor that can cause gradient alignment in standard models, that being approximate generative modelling of logit distributions. We also see that this generative modelling behaviour worsens during training, indicating that early stopping is one possible reason for this behaviour. To summarize, we find that early stopping may cause approximate generative modelling behaviour, which in turns causes gradient alignment. Score-matching also helps identify another factor related to approximate generative modelling, that being model smoothness, indicated by the gradient-norm regularization.

5 Conclusion

In this paper, we first found that input-gradients are not feature importance representations, and do not encode information regarding the discriminative capabilities of the model. Next, we found that improving the generative modelling behaviour of logit distributions lead to improved gradient alignment. This helped us identify approximate generative modelling as a cause of gradient alignment in standard models. To study this effect, we considered the score-matching approach and proposed scalable variants of the same.

However, in this paper we have only shown an empirical link between generative modelling and gradient alignment, and an analytical link is still missing. One hypothesis is that the statistics of natural images are responsible for such gradient alignment. Specifically, that the separation between a low-entropy ‘object’ which shows relatively little variation across images, and a high-entropy ‘background’ which virtually changes with every image, causes low-capacity generative models to treat the background regions as noise, thus suppressing their gradients. The resolution of this hypothesis would definitively resolve the gradient alignment paradox, and is left as an open problem.


  • M. Ancona, E. Ceolini, C. Oztireli, and M. Gross (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR 2018), Cited by: §1.
  • H. Avron and S. Toledo (2011) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM) 58 (2), pp. 1–34. Cited by: Score-Matching Approximation.
  • J. S. Bridle (1990)

    Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition

    In Neurocomputing, pp. 227–236. Cited by: §1.
  • R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In

    The IEEE International Conference on Computer Vision (ICCV)

    Cited by: §2.
  • W. Grathwohl, K. Wang, J. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky (2020)

    Your classifier is secretly an energy based model and you should treat it like one

    In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • J. Heo, S. Joo, and T. Moon (2019) Fooling neural network interpretations via adversarial model manipulation. In Advances in Neural Information Processing Systems, pp. 2921–2932. Cited by: §2.2, §2.
  • M. F. Hutchinson (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation 19 (2), pp. 433–450. Cited by: §3.1, Score-Matching Approximation, Score-Matching Approximation.
  • A. Hyvärinen (2005) Estimation of non-normalized statistical models by score matching.

    Journal of Machine Learning Research

    6 (Apr), pp. 695–709.
    Cited by: §1, §3.1, §3.1, Score-Matching Approximation.
  • D. Jakubovitz and R. Giryes (2018) Improving dnn robustness to adversarial attacks using jacobian regularization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 514–529. Cited by: §4.
  • D. P. Kingma and Y. L. Cun (2010) Regularized estimation of image statistics by score matching. In Advances in neural information processing systems, pp. 1126–1134. Cited by: §1, §3.2.
  • B. A. Pearlmutter (1994) Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §3.1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: §2.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §2, §4.2.1.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §1, §2.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895–11907. Cited by: §1, §3.2.
  • Y. Song, S. Garg, J. Shi, and S. Ermon (2019) Sliced score matching: a scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088. Cited by: §3.1.
  • S. Srinivas and F. Fleuret (2019) Full-gradient representation for neural network visualization. In Advances in Neural Information Processing Systems, pp. 4126–4135. Cited by: §2.

Fooling Gradients is simple


Assume an arbitrary function . Consider another neural network function given by , for which we obtain . For this, the corresponding loss values and loss-gradients are unchanged, i.e.; and .


The following expressions relate the loss and neural network function outputs, for the case of cross-entropy loss and usage of the softmax function.


Upon replacing with , the proof follows. ∎


Assume an arbitrary function , such that for any two classes . Consider another neural network function given by . For this, we have the following for small :


We start with equation 7, and write the expression for .

Upper bounding this expression using , we obtain the equation 9. Differentiating this w.r.t. , we obtain equation 10.

Score-Matching Approximation

We consider the approximation derived for the estimator of the Hessian trace, which is first derived from Hutchinson’s trace estimator [7]. We replace terms used in the main text with terms here for clarity. The Taylor series trick for approximating the Hessian-trace is given below.


As expected, the approximation error vanishes in the limit of small . Let us now consider the finite sample variants of this estimator, with samples. We shall call this the Taylor Trace Estimator.


We shall henceforth suppress the dependence on for brevity. For this estimator, we can compute its variance for quadratic functions , where higher-order Taylor expansion terms are zero. We make the following observation.


For quadratic functions , the variance of the Taylor Trace Estimator is greater than the variance of the Hutchinson estimator by an amount at most equal to .


Thus we have decomposed the variance of the overall estimator into two terms: the first captures the variance of the Taylor approximation, and the second captures the variance of the Hutchinson estimator.

Considering only the first term, i.e.; the variance of the Taylor approximation, we have:

The intermediate steps involve expanding the summation, noticing that pairwise terms cancel, and applying the Cauchy-Schwartz inequality. ∎

Thus we have a trade-off: a large results in lower estimator variance but a large Taylor approximation error, whereas the opposite is true for small . However for functions with small gradient norm, both the estimator variance and Taylor approximation error is small for small . We note that when applied to score-matching [8], the gradient norm of the function is also minimized. This implies that in practice, the gradient norm of the function is likely to be low, thus resulting in a small estimator variance even for small . The variance of the Hutchinson estimator is given below for reference [7, 2]: