1 Introduction
Inputgradients of trained deep neural networks, or gradients of outputs w.r.t. inputs, have been empirically observed to have a high degree of alignment with the inputs. For example, in image classification tasks, gradient magnitudes are observed to be higher on object locations and lower elsewhere. Folk wisdom states that these gradient magnitudes indicate the ‘importance’ placed by the model on different regions of the input, where larger gradient magnitudes indicate higher importance. This argument justifies their use as feature attribution maps for interpretation of discriminative models Simonyan et al. (2013); Smilkov et al. (2017); Ancona et al. (2018). In this work, we show that inputgradients can be arbitrarily manipulated using the shiftinvariance of softmax, which implies that inputgradient structure is unrelated to the discriminative capabilities of the model, thereby falsifying the assumption above.
Given that aligned inputgradients are not necessary for generalization, the reason for their emergence in standard deep models is puzzling. However from this observation, we can infer the presence of an implicit regularizer in neural network training that causes this gradient alignment. In this work, we wish to characterize this implicit regularizer, to understand the source of gradient alignment. For this purpose, we study the scorematching objective Hyvärinen (2005), which aims to align inputgradients with the gradients of the input data distribution, and is thus naturally formulated as a gradient alignment problem. To apply this, we exploit connections of discriminative classifiers with generative models Grathwohl et al. (2020); Bridle (1990) by viewing the logits of standard classifiers as unnormalized logdensities. As the gradients of the input data distribution are unavailable, scorematching works by reducing the gradient alignment problem to that of local geometric regularization. Hence by combining these two techniques, the generative modelling interpretation of logits and scorematching, we are able to connect the literature on generative models with that of geometric regularization of discriminative deep models.
In practice, the scorematching objective is known for being computationally expensive and unstable to train Song and Ermon (2019); Kingma and Cun (2010), which has so far prevented its widespread usage for largescale generative models. To this end, we also introduce approximations and regularizers which allow us to use scorematching on practical largescale models. Aside from our usage in this paper, these methods may be of independent interest to the generative modelling community.
Overall, we make three contributions:

We show in § 2 that it is trivial to fool inputgradients of standard classifiers using the shiftinvariance of softmax, and that gradient alignment is unrelated to generalization.

We devise in § 3 a tractable approximation to the scorematching objective that eliminates the need for expensive Hessian computations.

We find in § 4 that improving generative modelling behaviour of discriminative models improves gradient alignment, and this helps us identify one possible reason for gradientalignment in standard models, that being approximate generative modelling behaviour of the normalized logit distributions.
2 Fooling Gradients is Simple
Recently, it has been shown Heo et al. (2019) that it is possible to train models into having arbitrarily structured inputgradients, while achieving good generalization. In this section, we show that it is trivial to ‘fool’ gradients of deep networks trained for classification, using the wellknown shiftinvariance property of softmax. Throughout the paper, we shall make a distinction between two types of inputgradients: logitgradients and lossgradients. While logitgradients are gradients of the presoftmax output of a given class w.r.t. the input, lossgradients are the gradients of the loss w.r.t. the input. In both cases, we only consider outputs of a single class, usually the target class.
Let be a data point, which is the input for a neural network model intended for classification, which produces presoftmax logits for
classes. The crossentropy loss function for some class
corresponding to an input is given by , which is shortened to for convenience. Note that here the loss function subsumes the softmax function as well. The logitgradients are given by for class , while lossgradients are . Let the softmax function be , which we denote as for simplicity. Here, we make the observation that upon adding the same scalar function to all logits, the logitgradients can arbitrarily change but the loss values do not.Observation.
Assume an arbitrary function . Consider another neural network function given by , for which we obtain . For this, the corresponding loss values and lossgradients are unchanged, i.e.; and .
We provide detailed arguments in the Supplementary material. This explains how the structure of logitgradients can be arbitrarily changed: one simply needs to add an arbitrary function to all logits. This implies that individual logitgradients and logits are meaningless on their own, and their structure is unrelated to the discriminative capabilities of models. Despite this, a large fraction of work in interpretable deep learning Simonyan et al. (2013); Selvaraju et al. (2017); Smilkov et al. (2017); Fong and Vedaldi (2017); Srinivas and Fleuret (2019) uses individual logits and logitgradients for saliency map computation.
2.1 Fooling LossGradients
Here, we show how we can also change lossgradients arbitrarily without significantly changing the loss values themselves. In this case, we add slightly different scalar functions to each logit.
Observation.
Assume an arbitrary function , such that for any two classes . Consider another neural network function given by . For this, we have the following for small :
Remark.
The error on the lossgradients depends on both and , whose magnitudes are unbounded, and thus can get arbitrarily large. E.g.: consider being the zero function, and being a high frequency sine wave with amplitude .
Remark.
The approximation error for both the loss and lossgradients are small for highprobability classes, and large for lowprobability ones.
The proof is provided in the supplementary material. Thus for inputs with low softmax probability, the lossgradients can also be arbitrarily structured. Overall, the result above demonstrates that lossgradients are also unreliable, as two models with very similar loss landscapes and hence discriminative abilities, can have drastically different lossgradients.
2.2 Experiments
Here we present experimental evidence to support the claim that fooling inputgradients is simple. First, we show how lossgradients are unchanged when logitgradients are fooled. Second, we show that how lossgradients can also be fooled by simply increasing the temperature parameter within softmax. Our experiments are performed on the CIFAR100 dataset, using a 11layer VGG network.
Given a normalized unsigned saliency map and a desired normalized binary mask structure , the saliency fooling algorithm Heo et al. (2019) consists of the following objective function.
(1) 
We add this as a regularizer along with the standard crossentropy loss and finetune a pretrained VGG classifier. We assume the mask structure given in Figure 0(a), which comprises of a
white region. Assuming a uniform distribution of logitgradients over pixels, one would expect
of the total energy of unsigned gradients to occur in the top left region. Upon optimizing the fooling objective, we observe that we are indeed able to fool logitgradients, with these having of the total energy in only the top left areas, as shown in Figure 0(b). However, we note that lossgradients are not fooled, with an average energy of only in the unmasked areas. Our attempts at fooling lossgradients in a similar manner were unsuccessful: either the training collapsed completely or fooling failed to occur.Our second experiment involves testing whether the lossgradients can be fooled for low probability classes. To test this, use a high temperature constant () with softmax, for the fooled model above. Upon doing so, we see that the lossgradients are also altered, with an average energy of in the top left region, up from as shown in Figure 0(d). This provides experimental validation for our theory.
3 Discriminative Classifiers as Generative Models
Our arguments in the previous section have demonstrated that we can easily cause inputgradients, particularly logitgradients, to have arbitrary structure. In this section, we consider how to improve the alignment of the gradients with input data. To this end, we use the scorematching objective, is naturally formulated as a gradient alignment problem. For this, we first proceed by stating the link between generative models and the softmax function.
Let us first define the following joint density on the logits of classifiers: , where is the partition function. We shall henceforth suppress the dependence of on for brevity. Upon using Bayes’ rule to obtain
, we observe that we recover the standard softmax function. Thus logits of classifiers can alternately be viewed as unnormalized logdensities of the joint distribution. Assuming equiprobable classes, we have
, which is the quantity of interest for us.3.1 ScoreMatching
Scorematching Hyvärinen (2005) is a generative modelling objective that focusses solely on the derivatives of the log density instead of the density itself, and thus does not require access to the partition function . Specifically, for our case we have , which are the logitgradients.
Given i.i.d. samples from a latent data distribution , the objective of generative modelling is to recover this latent distribution using only samples . This is often done by training a parameterized distribution to align with the latent data distribution . The scorematching objective instead aligns the gradients of log densities, as given below.
(2)  
(3) 
The above relationship is proved Hyvärinen (2005) using integration by parts. This is a consistent objective, i.e, . Note that in equation 2 is unavailable, and thus equation 3 gets rid of this term. This is appealing also because this reduces the problem of generative modelling to that of regularization of the local geometry of functions, i.e.; the resulting terms only depend on the pointwise gradients and Hessians. However, equation 3
is intractable for highdimensional data due to the Hessian trace term. To address this, we can use the Hutchinson’s trace estimator
Hutchinson (1990) to efficiently compute an estimate of the trace by using random projections, which is given by:(4) 
This estimator has been previously applied to scorematching Song et al. (2019)
, and can be computed efficiently as this relies on Hessianvector products, for which we can use Pearlmutter’s trick
Pearlmutter (1994). However, this trick still requires two backward passes to compute a single Hessianvector product, and in practice we may need to approximate the expectation using several MonteCarlo samples. To further improve computational efficiency, we introduce the following approximation to Hutchinson’s estimator using a Taylor series expansion, which applies to small values of .(5)  
Note that equation 11 involves a difference of log probabilities, which is independent of the partition function. For our case, . We have thus considerably simplified and speededup the computation of the Hessian trace term, which now can be approximated without
any backward passes, but using only a single additional forward pass. We present details regarding the variance of this estimator in the supplementary material.
3.2 Stabilized Scorematching
In practice, a naive application of scorematching objective is unstable, causing the Hessiantrace to collapse to negative infinity. This occurs because the finitesample variant of equation 2 causes the model to ‘overfit’ to a mixtureofdiracs density, which places a diracdelta distribution at every data point. Gradients of such a distribution are undefined, causing training to collapse. To overcome this, regularized scorematching Kingma and Cun (2010) and noise conditional score networks Song and Ermon (2019)
propose to add noise to inputs for scorematching to make the problem welldefined. However, this did not help for our case. Instead, we use a heuristic where we add a small penalty term proportional to the square of the Hessiantrace. This discourages the Hessiantrace becoming too large, and thus stabilizes training.
4 Experiments
In this section, we show that improving generative modelling of logit distributions leads to improved gradient alignment. For experiments, we shall consider the CIFAR100 dataset. Unless stated otherwise, the network structure we use shall be a 18layer ResNet that achieves 77.12% accuracy on CIFAR100, and the optimizer used shall be SGD with momentum. Before proceeding with our experiments, we shall briefly introduce the scorematching variants we shall be using for comparisons.
ScoreMatching
We propose to use the scorematching objective as a regularizer in neural network training, as shown in equation 6, with the stability regularizer discussed in §3.2. For this, we use a regularization constant . This model achieves accuracy on the test set, which is a drop of about compared to the original model.
(6) 
Antiscorematching
We would like to have a tool that can decrease the scorematching tendency of a model, and thus possibly worsen the generative capabilites of models as a baseline. To enable this, we propose to increase the hessiantrace, in an objective we call ‘antiscorematching’. For this, we shall use a the clamping function on hessiantrace, which ensures that its maximization stops after a threshold is reached. We use a threshold of , and regularization constant . Alternately, this can also be viewed as yet another logitgradient ‘fooling’ method. This model achieves an accuracy of .
GradientNorm regularization
We observe that one the scorematching terms in equation 6 includes a gradientnorm regularizer. This has been used previously in the context of adversarial robustness Jakubovitz and Giryes (2018), and as a regularizer in general. We hence propose to use this regularizer as another baseline for comparison, using a regularization constant of . This model achieves an accuracy of .
4.1 Density Ratios
One way to characterize the generative behaviour of models is to compute likelihoods on data points. However this is intractable for highdimensional problems, especially for unnormalized models. We observe although that the densities themselves are intractable, we can easily compute density ratios for a random noise variable . Thus, we propose to plot the graph of density ratios locally along random directions. These can be thought of as local crosssections of the density sliced at random directions. We plot these values for gaussian noise
for different standard deviations, which are averaged across points in the entire dataset.
In Figure 1(a), we plot the density ratios upon training on the CIFAR100 dataset. We observe that the baseline model assigns higher density values to noisy inputs than real inputs. With antiscorematching, we observe that the density profile grows still steeper, assigning higher densities to inputs with smaller noise. Gradientnorm regularization improves on this behaviour, but still assigns higher densities to inputs with large noise added. Finally, the scorematched model is the only one that assigns lower densities to noisy inputs than real inputs, which is the intended behaviour of a generative model. Thus we are able to obtain penalty terms that can both improve and deteriorate the generative modelling behaviour within discriminative models.
Our plots in figure 1(b) also indicate that standard models get progressively worse at generative modelling during training. This indicates that the implicit regularizer responsible for gradient structure in standard models could be early stopping. We examine this in more detail in §4.2.2.
4.2 Gradient Structure
Here we visualize the structure of logitgradients of different models as in Figure 3. We observe that gradientnorm regularized model and scorematched model have highly data aligned gradients, when compared to the baseline and antiscorematched gradients. This shows that inputgradient alignment in neural networks can be significantly enhanced, and this is a function of the generative modelling behaviour of the logit distributions. In this visualization however, we do not see any discernable qualitative difference between the baseline and antiscorematched gradients, nor are we able to make any quantitative statements about these. We hence propose to visualize and compare samples from these generative models.
4.2.1 Sampling from Model Distributions via Gradient Ascent
We are interested in recovering modes of our density models while having access to only the gradients of the log density. For this purpose, we apply gradient ascent on the log probability starting from random noise input, which coincidentally is a standard approach in interpretability Simonyan et al. (2013). The gradient ascent step is followed usually with a projection step to ensure that the resulting input lies in the range of valid image inputs, i.e.; between and , and is given as follows:
Here, is the step size. Our results are shown in Figure 6. We notice that modes from the scoregradient trained model are significantly more realistic than baseline models. We also run a ‘denoising’ experiment, where instead of starting with random noise, we start gradient ascent with data points perturbed with small noise. The modes of an ideal generative model lie near clean, unnoised data, thus motivating this experiment. Figure 6 shows that denoised samples of scorematched models are significantly more realistic than the rest.
We also propose to measure quantitatively how well these generated samples adhere to the data distribution. In particular, we propose to measure the discriminative accuracy of these generated samples via a separately trained VGG11 model. The intuition is that better class conditional generative images are more likely to be correctly classified irrespective of the model. In contrast with more popular metrics such as the inceptionscore, we would only like to capture sample realism and not diversity, thus motivating this measure. Like the inception score, this is also an approximate test. We show the results in table 1, which confirms the qualitative trend seen in samples above.
Model  Sample Acc. (%)  Denoised Sample Acc. (%) 

Baseline ResNet  1.3  1.5 
+ AntiScoreMatching  1.3  1.3 
+ Gradient Normregularization  32.7  37.8 
+ ScoreMatching  57.7  64.9 
4.2.2 Effect of Early Stopping
Here we shall evaluate the effect of early stopping on generative modelling behaviour. For this, we train a standard ResNet model to 200 epochs, and plot the density ratios of models at 1, 50, 100 and 200 epochs in Figure 1(b). We observe in Figure 6 as training progresses, the density ratios worsen over time. We also observe progressively worsening discriminative performance on these samples. While at epoch 25 we observed accuracies of (10.5 % , 14.8 %) respectively for raw and denoised samples, similar quantities for epoch 50 were (3.3 %, 4.7 %) and for epoch 150, (1.3 %, 1.3 %). This quantitatively indicates worsening generative modelling behaviour.
4.3 Properties of the Implicit Regularizer
Our experiments show that improving generative modelling behaviour, as evidenced by Figure 1(a), leads to improved gradient alignment, as shown in Figures 3, 6, 6, and Table 1. This helps us identify one factor that can cause gradient alignment in standard models, that being approximate generative modelling of logit distributions. We also see that this generative modelling behaviour worsens during training, indicating that early stopping is one possible reason for this behaviour. To summarize, we find that early stopping may cause approximate generative modelling behaviour, which in turns causes gradient alignment. Scorematching also helps identify another factor related to approximate generative modelling, that being model smoothness, indicated by the gradientnorm regularization.
5 Conclusion
In this paper, we first found that inputgradients are not feature importance representations, and do not encode information regarding the discriminative capabilities of the model. Next, we found that improving the generative modelling behaviour of logit distributions lead to improved gradient alignment. This helped us identify approximate generative modelling as a cause of gradient alignment in standard models. To study this effect, we considered the scorematching approach and proposed scalable variants of the same.
However, in this paper we have only shown an empirical link between generative modelling and gradient alignment, and an analytical link is still missing. One hypothesis is that the statistics of natural images are responsible for such gradient alignment. Specifically, that the separation between a lowentropy ‘object’ which shows relatively little variation across images, and a highentropy ‘background’ which virtually changes with every image, causes lowcapacity generative models to treat the background regions as noise, thus suppressing their gradients. The resolution of this hypothesis would definitively resolve the gradient alignment paradox, and is left as an open problem.
References
 Towards better understanding of gradientbased attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR 2018), Cited by: §1.
 Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. Journal of the ACM (JACM) 58 (2), pp. 1–34. Cited by: ScoreMatching Approximation.

Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition
. In Neurocomputing, pp. 227–236. Cited by: §1. 
Interpretable explanations of black boxes by meaningful perturbation.
In
The IEEE International Conference on Computer Vision (ICCV)
, Cited by: §2. 
Your classifier is secretly an energy based model and you should treat it like one
. In International Conference on Learning Representations, External Links: Link Cited by: §1.  Fooling neural network interpretations via adversarial model manipulation. In Advances in Neural Information Processing Systems, pp. 2921–2932. Cited by: §2.2, §2.
 A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation 19 (2), pp. 433–450. Cited by: §3.1, ScoreMatching Approximation, ScoreMatching Approximation.

Estimation of nonnormalized statistical models by score matching.
Journal of Machine Learning Research
6 (Apr), pp. 695–709. Cited by: §1, §3.1, §3.1, ScoreMatching Approximation.  Improving dnn robustness to adversarial attacks using jacobian regularization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 514–529. Cited by: §4.
 Regularized estimation of image statistics by score matching. In Advances in neural information processing systems, pp. 1126–1134. Cited by: §1, §3.2.
 Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §3.1.
 Gradcam: visual explanations from deep networks via gradientbased localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: §2.
 Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §2, §4.2.1.
 Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §1, §2.
 Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895–11907. Cited by: §1, §3.2.
 Sliced score matching: a scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088. Cited by: §3.1.
 Fullgradient representation for neural network visualization. In Advances in Neural Information Processing Systems, pp. 4126–4135. Cited by: §2.
Fooling Gradients is simple
Observation.
Assume an arbitrary function . Consider another neural network function given by , for which we obtain . For this, the corresponding loss values and lossgradients are unchanged, i.e.; and .
Proof.
The following expressions relate the loss and neural network function outputs, for the case of crossentropy loss and usage of the softmax function.
(7)  
(8) 
Upon replacing with , the proof follows. ∎
Observation.
Assume an arbitrary function , such that for any two classes . Consider another neural network function given by . For this, we have the following for small :
(9)  
(10) 
ScoreMatching Approximation
We consider the approximation derived for the estimator of the Hessian trace, which is first derived from Hutchinson’s trace estimator [7]. We replace terms used in the main text with terms here for clarity. The Taylor series trick for approximating the Hessiantrace is given below.
(11)  
As expected, the approximation error vanishes in the limit of small . Let us now consider the finite sample variants of this estimator, with samples. We shall call this the Taylor Trace Estimator.
(12) 
We shall henceforth suppress the dependence on for brevity. For this estimator, we can compute its variance for quadratic functions , where higherorder Taylor expansion terms are zero. We make the following observation.
Observation.
For quadratic functions , the variance of the Taylor Trace Estimator is greater than the variance of the Hutchinson estimator by an amount at most equal to .
Proof.
Var(T.T.E.)  
Thus we have decomposed the variance of the overall estimator into two terms: the first captures the variance of the Taylor approximation, and the second captures the variance of the Hutchinson estimator.
Considering only the first term, i.e.; the variance of the Taylor approximation, we have:
The intermediate steps involve expanding the summation, noticing that pairwise terms cancel, and applying the CauchySchwartz inequality. ∎
Thus we have a tradeoff: a large results in lower estimator variance but a large Taylor approximation error, whereas the opposite is true for small . However for functions with small gradient norm, both the estimator variance and Taylor approximation error is small for small . We note that when applied to scorematching [8], the gradient norm of the function is also minimized. This implies that in practice, the gradient norm of the function is likely to be low, thus resulting in a small estimator variance even for small . The variance of the Hutchinson estimator is given below for reference [7, 2]:
Comments
There are no comments yet.