Log In Sign Up

Fast Axiomatic Attribution for Neural Networks

Mitigating the dependence on spurious correlations present in the training dataset is a quickly emerging and important topic of deep learning. Recent approaches include priors on the feature attribution of a deep neural network (DNN) into the training process to reduce the dependence on unwanted features. However, until now one needed to trade off high-quality attributions, satisfying desirable axioms, against the time required to compute them. This in turn either led to long training times or ineffective attribution priors. In this work, we break this trade-off by considering a special class of efficiently axiomatically attributable DNNs for which an axiomatic feature attribution can be computed with only a single forward/backward pass. We formally prove that nonnegatively homogeneous DNNs, here termed 𝒳-DNNs, are efficiently axiomatically attributable and show that they can be effortlessly constructed from a wide range of regular DNNs by simply removing the bias term of each layer. Various experiments demonstrate the advantages of 𝒳-DNNs, beating state-of-the-art generic attribution methods on regular DNNs for training with attribution priors.


page 10

page 17


Attribution Mask: Filtering Out Irrelevant Features By Recursively Focusing Attention on Inputs of DNNs

Attribution methods calculate attributions that visually explain the pre...

DEPARA: Deep Attribution Graph for Deep Knowledge Transferability

Exploring the intrinsic interconnections between the knowledge encoded i...

Incorporating Priors with Feature Attribution on Text Classification

Feature attribution methods, proposed recently, help users interpret the...

An Empirical Study of Memorization in NLP

A recent study by Feldman (2020) proposed a long-tail theory to explain ...

Learning Explainable Models Using Attribution Priors

Two important topics in deep learning both involve incorporating humans ...

1 Introduction

Many traditional machine learning (ML) approaches, such as linear models or decision trees, are inherently explainable 

Arrieta et al. (2020). Therefore, an ML practitioner can comprehend why a method yields a particular prediction and correct the method if the explanation for the result is flawed. The prevailing ML architectures in use today LeCun et al. (2015); Schmidhuber (2015), namely deep neural networks (DNNs), unfortunately, do not come with this inherent explainability. This can cause models to depend on dataset biases and spurious correlations in the training data. For real-world applications, e.g., credit score or insurance risk assignment, this can be highly problematic and potentially lead to models discriminating against certain demographic groups Angwin et al. (2016); Obermeyer et al. (2019). To mitigate the dependence on spurious correlations in DNNs, attribution priors have been recently proposed Erion et al. (2021); Rieger et al. (2020); Ross et al. (2017). By enforcing priors on the feature attribution of a DNN at training time, they allow actively controlling its behavior. As it turns out, attribution priors are a very flexible tool, allowing even complex model interventions such as making an object recognition model focus on shape Rieger et al. (2020) or less sensitive to high-frequency noise Erion et al. (2021). However, their use brings new challenges over regular training. First, computing the feature attribution of a DNN is a nontrivial task. It is critical to use an attribution method that faithfully reflects the true behavior of the deep network and ideally satisfies the axioms proposed by Sundararajan et al. (2017). Otherwise, spurious correlations may go undetected and the attribution prior would be ineffective. Second, since the feature attribution is used in each training step, it needs to be efficiently computable. Existing work incurs a trade-off between high-quality attributions for which formal axioms hold Sundararajan et al. (2017) and the time required to compute them Erion et al. (2021). Prior work on attribution priors thus had to choose whether to rely on high-quality feature attributions or allow for efficient training. In this work, we obviate this trade-off.

Specifically, we make the following contributions: (i) We propose to consider a special class of DNNs, termed efficiently axiomatically attributable DNNs, for which we can compute a closed-form axiomatic feature attribution that satisfies the axioms of Sundararajan et al. (2017), requiring only one gradient evaluation. As a result, we can compute axiomatic high-quality attributions with only one forward/backward pass, and hence, require only a fraction of the computing power that would be needed for regular DNNs, which involve a costly numerical approximation of an integral (Erion et al., 2021; Sundararajan et al., 2017). This significant improvement in efficiency makes our considered class of DNNs particularly well suited for scenarios where the attribution is used in the training process, such as for training with attribution priors. (ii) We formally prove that nonnegatively homogeneous DNNs (termed -DNNs) are efficiently axiomatically attributable DNNs and establish a new theoretical connection between InputGradient (Shrikumar et al., 2016) and Integrated Gradients (Sundararajan et al., 2017) for nonnegatively homogeneous DNNs of different degrees. (iii) We show how -DNNs can be instantiated from a wide range of regular DNNs by simply removing the bias term of each layer. While this may seem like a significant restriction, we show that the impact on the predictive accuracy in two different application domains is surprisingly minor. In a variety of experiments, we demonstrate the advantages of -DNNs, showing that they (iv) admit accurate axiomatic feature attributions at a fraction of the computational cost and (v) beat state-of-the-art generic attribution methods for training regular networks with an attribution prior.

2 Related work

Attribution methods   can roughly be divided into perturbation-based Li et al. (2016); Zeiler and Fergus (2014); Zhou and Troyanskaya (2015); Zintgraf et al. (2017)

and backpropagation-based 

Ancona et al. (2018); Bach et al. (2015); Shrikumar et al. (2017); Simonyan et al. (2014); Sundararajan et al. (2017)

methods. The former repeatably perturb individual inputs or neurons to measure their impact on the final prediction. Since each perturbation requires a separate forward pass through the DNN, those methods can be computationally inefficient 

Shrikumar et al. (2017) and consequently inappropriate for inclusion into the training process. We thus consider backpropagation-based methods or, more precisely, gradient-based and rule-based attribution methods. They propagate an importance signal from the DNN output to its input using either the gradient or predefined rules, making them particularly efficient Shrikumar et al. (2017), and thus, well suited for inclusion into the training process. Gradient-based methods have the advantage of scaling to high-dimensional inputs, can be efficiently implemented using GPUs, and directly applied to any differentiable model without changing it Ancona et al. (2019).

The saliency method Simonyan et al. (2014), defined as the absolute input gradient, is an early gradient-based attribution method for DNNs. Shrikumar et al. (2016) propose the InputGradient method, i.e., weighting the (signed) input gradient with the input features, to improve sharpness of the attributions for images. Bach et al. (2015) introduce the rule-based Layerwise Relevance Propagation (LRP), with predefined backpropagation rules for each neural network component. As it turns out, LRP without modifications to deal with numerical instability can be reduced to Input

Gradient for DNNs with ReLU 

Nair and Hinton (2010)activation functions Ancona et al. (2018); Shrikumar et al. (2016), and hence, can be expressed in terms of gradients as well. DeepLIFT Shrikumar et al. (2017) is another rule-based approach similar to LRP, relying on a neutral baseline input to assign contribution scores relative to the difference of the normal activation and reference activation of each neuron. Generally, rule-based approaches have the disadvantage that each DNN component requires custom modules that may have no GPU support and require a model re-implementation.

Axiomatic attributions.   As it is hard to empirically evaluate the quality of feature attributions, Sundararajan et al. (2017) propose several axioms that high-quality attribution methods should satisfy:

Sensitivity (a)

is satisfied if for every input and baseline that differ in one feature but have different predictions, the differing feature should be given a non-zero attribution.

Sensitivity (b)

is satisfied if the function implemented by the deep network does not depend (mathematically) on some variable, then the attribution to that variable is always zero.

Implementation invariance

is satisfied if the attributions for two functionally equivalent networks are always identical.


is satisfied if the attributions add up to the difference between the output of the network for the input and for the baseline.


is satisfied if the attribution of a linearly composed deep network is equal to the weighted sum of the attributions for and with weights and , respectively.

Symmetry preservation

is satisfied if for all inputs and baselines that have identical values for symmetric variables, the symmetric variables receive identical attributions.

Sundararajan et al. (2017) shows that none of the above methods satisfy all axioms, e.g., the saliency method and InputGradient can suffer from the well-known problem of gradient saturation, which means that even important features can have zero attribution. To overcome this, Sundararajan et al. (2017) introduces Integrated Gradients, a gradient-based backpropagation method that provably satisfies these axioms; it is considered a high-quality attribution method to date Erion et al. (2021); Liu and Avci (2019). Its crucial disadvantage over previous methods is that an integral has to be solved, which generally requires an approximation based on  20–300 gradient calculations, making it correspondingly computationally more expensive than, e.g., InputGradient.

Attribution priors.   The above attribution methods cannot only be used for explaining a model’s behavior but also to actively control the behavior. To that end, the training objective can be formulated as


where a model with parameters is trained on the annotated dataset . denotes the regular task loss, and is a scalar-valued loss of the feature attribution , which is called the attribution prior Erion et al. (2021); controls the relative weighting. For example, by forcing certain values of the attribution to be zero, we can mitigate the dependence on unwanted features Ross et al. (2017). But also more complex model interventions, such as making an object recognition model focus on shape Rieger et al. (2020) or less sensitive to high-frequency noise Erion et al. (2021), can be formulated using attribution priors.

An early instance of this idea is the Right for the Right Reasons (RRR) approach of Ross et al. (2017), which uses the input gradient of the log prediction to mitigate the dependence on unwanted features. While this is more stable than simply using the input gradient, it still suffers from the problem of saturation. RRR may thus not reflect the true behavior of the model, and therefore, miss relevant features. Subsequent work addresses this issue using axiomatic feature attribution methods, specifically Integrated Gradients Chen et al. ; Liu and Avci (2019); Sundararajan et al. (2017), which allows for more effective attribution priors Erion et al. (2021) but incurs significant computational overhead, rendering them impractical for many scenarios. For example, Liu and Avci (2019) report a thirty-fold increase in training time compared to standard training. Rieger et al. (2020) propose an alternative attribution prior based on a rule-based contextual decomposition Murdoch et al. (2018); Singh et al. (2019) (CD) as attribution method. This allows to consider clusters of features Erion et al. (2021) instead of individual features and to define attribution priors working on feature groups. However, computing the attribution for individual features becomes computationally inefficient Erion et al. (2021). Additionally, since CD is a rule-based attribution method, it requires custom modules and cannot be applied to all types of DNNs Erion et al. (2021). The very recently proposed Expected Gradients Erion et al. (2021) method reformulates Integrated Gradients as an expectation, allowing a sampling-based approximation of the attribution. Erion et al. (2021)

argue that similar to batch gradient descent, where the true gradient of the loss function is approximated over many training steps, the sampling-based approximation allows to approximate the attribution over many training steps. This results in better attributions while using fewer approximation steps. Even using as little as one reference sample,

i.e., only one gradient computation, can yield advantages over the regular input gradient. However, we show that using only one reference sample still does not yield the same attribution quality as an axiomatic feature attribution method, reflected in less effective attribution priors. Schramowski et al. (2020) propose a human-in-the-loop strategy to define appropriate attribution priors while training. Our attribution method is complementary and could be used within their framework.

3 Efficiently axiomatically attributable DNNs

Formally, given a function , representing a single output node of a DNN, and an input , the feature attribution for the prediction at input relative to a baseline input

is a vector

, where each element is the contribution of feature to the prediction  Sundararajan et al. (2017). In this work we seek an attribution method that is particularly well suited for scenarios where such an attribution is used at training time, e.g., training with attribution priors. As such, the attribution method should be of high quality, while being efficiently computable in a single forward/backward pass. Since attributions obtained from Integrated Gradients Sundararajan et al. (2017) have strong theoretical justifications and are known to be of high-quality Liu and Avci (2019), they will serve as our starting point. In general, however, Integrated Gradients are expensive to compute for arbitrary DNNs. Therefore, in this work, we restrict our attention to a special sub-class of DNNs, termed efficiently axiomatically attributable DNNs, that require only a single forward/backward pass to compute a closed-form solution of Integrated Gradients. We show that nonnegatively homogeneous DNNs belong to this class and use this insight to guide the design of a concrete instantiation of efficiently axiomatically attributable DNNs. While there may be several such instantiations, we chose this particular one as it can be easily constructed from a wide range of regular DNNs by simply removing the bias term of each layer. This ensures comparability to prior work and allows for an easy adaptation of existing network architectures.

Definition 3.1.

We call a DNN efficiently axiomatically attributable w.r.t. a baseline , if there exists a closed-form solution of the axiomatic feature attribution method Integrated Gradients along the dimension of , requiring only one forward/backward pass.

Note that all differentiable models are efficiently axiomatically attributable w.r.t. the trivial baseline . However, using such a baseline is not helpful. Instead, commonly chosen baselines are some kind of averaged input features or baselines such that , which allow an interpretation of the attribution that amounts to distributing the output to the individual input features Sundararajan et al. (2017).

Proposition 3.2.

For a DNN there exists a closed-form solution of w.r.t. the zero baseline , requiring only one forward/backward pass, if is strictly positive homogeneous of degree , i.e. for .


Sundararajan et al. (2017) define the axiomatic feature attribution method Integrated Gradients (IG) along the dimension for a given model , input , baseline , and straightline path as


Assuming is strictly positive homogeneous of degree , we can write Integrated Gradients in Eq. 2 as


The gradient expression in Eq. 3 can be computed using a single forward/backward pass. ∎

While Ancona et al. (2018) already found that InputGradient equals Integrated Gradients with the zero baseline for linear models or models that behave linearly for a selected task, our Proposition 3.2 is more general: We only require strictly positive homogeneity of an arbitrary order . This allows us to consider a larger class of models including nonnegatively homogeneous DNNs, which generally are not linear.

Definition 3.3.

We call a DNN nonnegatively homogeneous, if for all .

Corollary 3.4.

Any nonnegatively homogeneous DNN is efficiently axiomatically attributable w.r.t. the zero baseline and a closed-form solution of the axiomatic feature attribution method Integrated Gradients, requiring only one forward/backward pass, exists.


Corollary 3.4 follows directly from Proposition 3.2 and Definitions 3.3 and 3.1. ∎

Definition 3.5.

We let -DNN denote a nonnegatively homogeneous DNN. Further, for any -DNN , we let -Gradient (G) be an axiomatic feature attribution method relative to the zero baseline defined as


Note that while the formulas for the existing attribution method InputGradient (Shrikumar et al., 2016) and our novel -Gradient are equal, -Gradient is only defined for the subclass of -DNNs and provably satisfies axioms that are generally not satisfied by InputGradient. Additionally, from the nonnegative homogeneity of -DNNs it follows that -Gradient attributions are also nonnegatively homogeneous. This allows us to define another desirable axiom that is in line with intuition about how attribution should work and that is satisfied by -Gradient.

Definition 3.6.

An attribution method satisfies nonnegative homogeneity if for all .

Integrated Expected Expected (Input )
Axiom Gradients Gradients Gradients(1) Gradient -Gradient
Sensitivity (a)
Sensitivity (b)
Implementation invariance
Table 1: Overview of different gradient-based DNN attribution methods and the axioms Sundararajan et al. (2017) that they provably satisfy. The left-hand side methods (Integrated Gradients, Expected Gradients) induce one to two orders of magnitude of computational overhead compared to the methods on the right-hand side. The methods on the right-hand side require only one gradient evaluation (indicated by (1) for Expected Gradients with one reference sample), and thus, can be computed in a single forward/backward pass. Note how -Gradient satisfies all axioms while requiring as little computational cost as a simple gradient evaluation, however being only defined for -DNNs.

For an overview of the axioms Sundararajan et al. (2017) that are satisfied by popular gradient-based attribution methods, see Table 1. The right-hand side methods use only one gradient evaluation, and therefore, have similar computational expense. The left-hand side methods generally require multiple gradient evaluations until convergence, making them correspondingly computationally more expensive. Note that -Gradient satisfies all the axioms satisfied by Integrated Gradients and Expected Gradients Erion et al. (2021), assuming convergence of the latter, while requiring only a fraction of the computational cost, however being only defined for -DNNs. Existing methods that have similar computational expense as -Gradient generally do not satisfy all of the axioms, and therefore, are likely to produce lower quality attributions, which can be misleading and less effective for imposing attribution priors.

Constructing -DNNs.   With this motivation in mind, we will now study concrete instantiations of nonnegatively homogeneous DNNs. Note that this class of DNNs has already been considered by Zhang et al. (2019), however, neither at the same level of detail nor in the context of feature attributions. We define the output of a regular feedforward DNN , for an input , as a recursive sequence of layers that are applied to the output of the respective previous layer:


with and being the weight matrix and bias term for layer , being the corresponding activation function, and being the corresponding pooling function. Both and are optional; alternatively they are the identity function. For simplicity, we assume that the last task-specific layer, e.g., the softmax function for classification tasks, is part of the loss function. Further, for a cleaner notation that aligns with Sundararajan et al. (2017), we assume w.l.o.g. that we are only considering one output node at a time, e.g.

, the logit of the target class for classification tasks. This yields the DNN

that we consider and allows us to directly compute the derivative of the model w.r.t. an input feature . Importantly, the above formalization comprises many popular layer types and architectures. For example, fully connected and convolutional layers are essentially matrix multiplications Wang et al. (2019), and therefore, can be expressed by Eq. 5

. Skip connections can also be expressed as matrix multiplication by appending the identity matrix to the weight matrix so that the input is propagated to later layers 

Wang et al. (2019). This allows us to describe even complex architectures such as the ResNet He et al. (2016) variant proposed by Zhang et al. (2019). As the above definition of a DNN includes models that are generally not nonnegatively homogeneous, we have to make some assumptions.

Assumption 3.7.

The activation functions and pooling functions in the model are nonnegatively homogeneous. Formally, for all

Proposition 3.8.

Piecewise linear activation functions with two intervals separated by zero satisfy creftype 3.7. For , these activation functions are defined as

Proposition 3.9.

Linear pooling functions or pooling functions selecting values based on their relative ordering satisfy creftype 3.7. For , these pooling functions are defined as


with being a grouping of entries in based on their spatial location and being linear or a selection of a value based on its relative ordering, e.g., the maximum or minimum value.

For proofs of Propositions 3.9 and 3.8, please refer to Appendix A. Activation functions in Proposition 3.8 include ReLU Nair and Hinton (2010), Leaky ReLU Maas et al. (2013), and PReLU He et al. (2015). Linear pooling functions in Proposition 3.9

include average pooling, global average pooling, and strided convolutions. Other pooling functions in

Proposition 3.9

include max pooling and min pooling 

Socher et al. , where the largest or smallest value is selected. Therefore, DNN architectures satisfying creftype 3.7 include, inter alia, AlexNet Krizhevsky et al. , VGGNet Simonyan and Zisserman (2015), ResNet He et al. (2016) as introduced in Zhang et al. (2019), and MLPs with ReLU activations. They alone have been cited well over one hundred thousand times, showing that we are considering a substantial fraction of commonly used DNN architectures. However, these architectures are generally still not nonnegatively homogeneous. It is easy to see that even for a simple linear model that can be expressed by Eq. 5 and that satisfies creftype 3.7, nonnegative homogeneity does not hold, because . Therefore, in a final step we set the bias term of each layer to zero. As this may seem like a significant restriction, we show in Sec. 4 that the impact on the predictive accuracy in two different application domains is surprisingly minor.

Corollary 3.10.

Any regular DNN given by Eq. 5 satisfying creftype 3.7 can be transformed into an -DNN by removing the bias term of each layer.


A DNN with layers given by Eq. 5 with all biases set to 0 can be written as . As all the pooling functions , activation functions , and matrix multiplications in are nonnegatively homogeneous, it follows that for all . ∎

Further discussion.   We additionally note that our results have interesting consequences for DNNs in certain application domains, e.g.

, in computer vision, as they allow to relate efficient axiomatic attributability to desirable properties of DNNs:

Remark 3.11.

If a DNN , taking an image as input, is equivariant w.r.t. to the image contrast, it is efficiently axiomatically attributable.

This observation follows directly from the fact that contrast equivariance implies nonnegative homogeneity. Consequentially, contrast-equivariant DNNs for regression tasks, such as image restoration or image super-resolution, are automatically efficiently axiomatically attributable. For classification tasks, such as image classification or semantic segmentation, contrast equivariance of the logits at the output implies efficient axiomatic attributability. If the classification is done using a softmax, then this also implies contrast invariance of the classifier output. In other words, there is a close relation between efficient axiomatic attributability and the desirable property of contrast equi-/invariance. We further illustrate this experimentally in

Sec. 4.4.

Limitations.   So far, we have discussed the advantages of -DNNs such as being able to efficiently compute high-quality feature attributions. However, we also want to mention the limitations of our method. First, our method can only be applied to certain DNNs satisfying creftype 3.7. Although this is a large class of models, our method is not completely model agnostic as other gradient-based attribution methods. Second, removing the bias term might be disadvantageous in certain scenarios. Intuitively, the bias term can be seen like an intercept in a linear function, and therefore, is important to fit given data. As a matter of fact, a DNN without bias terms will always produce a zero output for the zero input, which might be problematic. Additionally, prior work argues that the bias term is an important factor for the predictive performance of a DNN Wang et al. (2019). However, these theoretical foundations are somewhat contradictory to our own practical findings. In Sec. 4, we show that removing the bias term has less of a negative impact than perhaps expected, indicating that removing it can be a plausible intervention on the model architecture. As we cannot guarantee that this holds for all DNNs, we recommend that practitioners who plan on using our method, first make a preliminary analysis of whether removing the bias from the model at hand is plausible. Third, our method uses implicitly the zero baseline . As , this is a reasonable choice because it can be interpreted as being neutral Sundararajan et al. (2017). Nevertheless, other baselines could produce attributions that are better suited for certain tasks Pan et al. (2021); Sturmfels et al. (2020); Wang et al. (2021). For example, the zero baseline will generally assign lower attribution scores to features closer to zero, which can result in misleading attributions. Whether the advantages outweigh the disadvantages must be decided for each application, individually. In Sec. 4 we demonstrate the advantages of -DNNs, beating state-of-the-art generic attribution methods for training with attribution priors.

4 Experiments

To demonstrate the practicability of our proposed method, we now evaluate it in various experiments using two different data domains to confirm the following points: (1) It is plausible to remove the bias term in order to obtain -DNNs. (2) Our -Gradient method produces superior attributions compared to other efficient gradient-based attribution methods. (3) Our -Gradient method has advantages over state-of-the-art generic attribution methods for training with attribution priors. (4) -DNNs are robust to multiplicative contrast changes.

Experimental setup.   For our experiments on models for image classification, i.e.Section 4.4, 4.2 and 4.1

, we use the ImageNet 

Russakovsky et al. (2015) dataset, containing about 1.2 million images of 1000 different categories. We train on the training split and report numbers for the validation split. In Sec. 4.2 we quantify the quality of attributions for image classification models by adapting the metrics proposed by Lundberg et al. (2020) to work with image data. These metrics reflect how well an attribution method captures the relative importance of features by measuring the network’s accuracy or its output logit of the target class while masking out a progressively increasing fraction of the features based on their relative importance. For example, for the Keep Positive Mask (KPM) metric, the output logit of the target class should stay as high as possible while progressively masking out the least important features. As a mask we use a Gaussian blur of the original image. For a detailed description of the metrics, please refer to Lundberg et al. (2020) or Sec. B.2. If not indicated otherwise, we assume numerical convergence for Integrated Gradients and Expected Gradients, which we found to occur after 128 approximation steps (see Sec. B.5).

4.1 Removing the bias term in DNNs

Top-5 accuracy (%, ) Mean absolute relative difference (%, )
Model AlexNet VGG16 ResNet-50 AlexNet VGG16 ResNet-50
Regular DNN 79.21 90.44 92.56 79.0 97.8 93.8
-DNN 78.54 90.25 91.12 1.2 0.4 0.0
Table 2: Top-5 accuracy

on the ImageNet 

Russakovsky et al. (2015) validation split and mean absolute relative difference (see Sec. B.1) of InputGradient for regular DNNs resp. -Gradient for -DNNs to the numerical approximation of Integrated Gradients. Note how removing the bias (-DNN) impairs the accuracy only marginally while reducing the mean absolute relative difference to Integrated Gradients significantly, confirming our theoretical finding that -Gradient equals Integrated Gradients.

Historically, the bias term plays an important role and almost all DNN architectures use one. In this first experiment, we evaluate how much removing the bias to obtain an -DNN affects the accuracy of different DNNs. To this end, we train multiple popular image classification networks, AlexNet Krizhevsky et al. , VGG16 Simonyan and Zisserman (2015), and the ResNet-50 variant of Zhang et al. (2019), as well as their corresponding -DNN variants obtained by removing the bias term, on the challenging ImageNet Russakovsky et al. (2015) dataset. The resulting top-5 accuracy on the validation split is given in Table 2. As we can observe, removing the bias decreases the accuracy of the models only marginally. This is a somewhat surprising result since prior work indicates that the bias term in DNNs plays an important role Wang et al. (2019). We hypothesize that when removing the bias term, the DNN learns some kind of layer averaging strategy that compensates for the missing bias. For an additional comparison between a DNN with bias and its corresponding -DNN in a non-vision domain, see Sec. 4.3, which mirrors our findings here. Additionally, to empirically validate our finding that -Gradient () equals Integrated Gradients for -DNNs, we report the mean absolute relative difference (see Sec. B.1) between the attribution obtained from Integrated Gradients Sundararajan et al. (2017) and the attribution obtained from computing InputGradient for regular DNNs resp. -Gradient for -DNNs over the ImageNet validation split. For regular models with biases, Integrated Gradients produce a very different attribution compared to InputGradient. For -DNNs on the other hand, the two attribution methods are virtually identical, as expected. The small deviation can be explained by the fact that the result of Integrated Gradients Sundararajan et al. (2017) is computed via numerical approximation, whereas our method computes the exact integral (of course only for -DNNs). We make the pre-trained -DNN models publicly available to promote a wide adoption of efficiently axiomatically attributable models.

4.2 Benchmarking gradient-based attribution methods

AlexNet -AlexNet
IG (128) 7.57 1.67 25.22 11.12 7.38 2.21 21.79 11.68
Random 3.68 3.68 14.12 14.10 3.81 3.81 13.52 13.50
Grad (1) 3.62 3.88 20.78 11.82 3.87 4.34 19.75 11.25
EG (1) 4.92 2.97 20.49 13.76 5.41 3.19 19.47 13.19
G (1) N/A N/A N/A N/A 7.38 2.21 21.83 11.68
Table 3: Metrics of Lundberg et al. (2020) to measure the attribution quality of different attribution methods. Please refer to the experimental setup in the beginning of Sec. 4 and Sec. B.2 for an introduction of the metrics. We evaluate Integrated Gradients (IG) Sundararajan et al. (2017), random attributions (Random), input gradient attributions (Grad), Expected Gradients (EG) Erion et al. (2021), and our novel -Gradient (G) attribution on a regular AlexNet Zhang et al. (2019) and the corresponding -AlexNet. The numbers in parentheses indicate the required gradient calls. Our method is on par with IG in terms of quality while requiring two orders of magnitude less computational power.

As prior work Erion et al. (2021); Liu and Avci (2019) but also our experiment in Sec. 4.3 suggest that the quality of an attribution method positively impacts the effectiveness of attribution priors, we benchmark our method against existing gradient-based attribution methods that are commonly used for training with attribution priors. For evaluation, we use the metrics from Lundberg et al. (2020) adapted to work with image data. Using these metrics allows for a diverse assessment of the feature importance Lundberg et al. (2020) and ensures consistency with the experimental setup in Erion et al. (2021). Table 3 shows the resulting numbers for a regular AlexNet and our corresponding -AlexNet. Due to the axioms satisfied by the Integrated Gradients method, it produces the best attributions for the regular network, which is in line with the results in Yeh et al. . However, as it approximates an integral where each approximation step requires an additional gradient evaluation, it also introduces one to two orders of magnitude of computational overhead compared to the other methods (Sundararajan et al. (2017) recommend 20–300 gradient evaluations to approximate attributions). For the -AlexNet, however, our -Gradient method is on par with Integrated Gradients and produces the best attributions while requiring only one gradient evaluation, and therefore, a fraction of the compute power. Since the input gradient and Expected Gradients Erion et al. (2021) with only one reference sample do not satisfy many of the desirable axioms (see Table 1), they produce clearly lower quality attributions as expected. Note that high-qualitative attribution methods should perform well across all the listed metrics, which is why the input gradient is not a competitive attribution method even though it performs well on the RAM metric. To conclude, we can see that our -Gradient attribution yields a significant improvement in quality compared to state-of-the-art generic attribution methods that require similar computational cost. This suggests that our effort to produce an efficient and high-quality attribution method is justified and accomplished.

4.3 Training with attribution priors

Figure 1: (left) Average ROC-AUC across 200 randomly subsampled datasets for the same attribution prior using different attribution methods. “w/o bias” denotes that the bias term has been removed from the MLP. (right) Average ROC-AUC across 200 randomly subsampled datasets of Expected Gradients (EG) over the number of reference samples. The current state-of-the-art EG requires approximately 32 reference samples, and thus, 32 times more computational power to outmatch

G. Confidence intervals indicate two times the standard error of the mean.

To benchmark our approach against other attribution methods when training with attribution priors, we replicate the sparsity experiment introduced in Erion et al. (2021). To that end, we employ the public NHANES I survey data Miller (1973)

of the CDC of the United States, containing 118 one-hot encoded medical attributes,

e.g., age, sex, and vital sign measurements, from 13,000 human subjects (no personally identifiable information). The objective of the binary classification task is to predict if a human subject will be dead (0) or alive (1) ten years after the data was measured. A simple MLP with ReLU activations is used as the model. Therefore, it can be transformed into an

-DNN by simply removing the bias terms. To emulate a setting of scarce training data and to average out variance, we randomly subsample 200 training and validation datasets containing 100 data points from the original dataset.

Erion et al. (2021) proposed a novel attribution prior that maximizes the Gini coefficient, i.e., minimizes the statistical dispersion, of the feature attributions. They show that this allows to learn sparser models, which have improved generalizability on small training datasets. The more faithfully the attribution reflects the true behavior of the model, the more effective the attribution prior should be.

Comparing attribution methods.   We compare different attribution methods that have previously been used for training with attribution priors and require only one gradient evaluation; thus, they have comparable computational cost. The results in Fig. 1(left) show that our method (G w/o bias) outperforms all other competing methods. We can also see that for the unregularized model, removing the bias (Unreg w/o bias) has almost no effect on the average ROC-AUC of the method, once again showing that our modification for making attributions efficient, i.e., removing the bias term, is plausible in various scenarios.

Since the attribution quality of Expected Gradients can be improved using more reference samples, as this yields a better approximation to the true integral, we plot the average ROC-AUC of Expected Gradients over the number of reference samples used in Fig. 1(right). We can clearly see that adding more samples improves the ROC-AUC when training with an EG attribution in the prior, yet again, showing that higher quality attributions lead to more effective attribution priors Erion et al. (2021). However, we also find that approximately 32 reference samples are needed, and hence 32 times more computational power, to match the quality of our efficient -Gradient method. When using more than 32 reference samples, Expected Gradients slightly outperform our method in terms of ROC-AUC, which is due to the limitations discussed in Sec. 3 (fixed baseline, no bias terms). We argue that it is often worth accepting this small accuracy disadvantage in light of the significant gain in efficiency of computing high-quality feature attributions.

To put this improvement in efficiency into perspective, we measure the computation time of training a ResNet-50 on the ImageNet dataset when using Expected Gradients with 32 reference samples and -Gradient (see Sec. B.3). Using a single GPU, the computational overhead introduced when using Expected Gradients with 32 reference samples amounts to an approximately -fold increase in the required computation time compared to training with -Gradient, and thus, would turn several days of training into several months of training.

4.4 Homogeneity of -DNNs

Figure 2: (left) Top-1 accuracy for AlexNet and -AlexNet on the ImageNet validation split with decreasing contrast (scaled by ). Due to the nonnegative homogeneity of -AlexNet, the accuracy does not drop when reducing the contrast. (right) Qualitative examples of normalized attributions for -AlexNet and AlexNet using the attribution methods -Gradient (G) resp. InputGradient (IG) as well as Integrated Gradients (IG). The displayed attributions obtained from -AlexNet are almost identical, while attributions obtained from AlexNet differ significantly (see highlighted areas).

The fundamental difference between -DNNs and regular DNNs is the nonnegative homogeneity of the former. To show implications on the model and its attributions, we conduct the following experiment. Similarly to Hendrycks and Dietterich (2019), we reduce the contrast of the ImageNet Russakovsky et al. (2015) validation split by multiplying each image with varying factors and report the top-1 accuracy of AlexNet and the corresponding -AlexNet from Sec. 4.1, i.e., they have not been trained specifically to handle contrast changes. Results can be found in Fig. 2(left). We can observe that decreasing the image contrast leads to a strong drop in accuracy of a regular AlexNet. On the other hand, due to the equivariance to contrast of -DNNs, the accuracy of the -AlexNet is unaffected, showing improved robustness towards multiplicative contrast changes. To give some qualitative examples, in Fig. 2(right) we plot the attributions for the output logit of the target class (‘reflex camera’) for a regular AlexNet and an -AlexNet for an original image and the corresponding low-contrast image obtained by multiplying the normalized image with . For the -AlexNet, our -Gradient method and Integrated Gradients (IG) Sundararajan et al. (2017) produce attributions that are identical up to a small approximation error; reducing the image contrast keeps the attributions unchanged up to a scaling factor (not visible due to normalization for display purposes). However, the displayed attributions from the regular AlexNet differ significantly between InputGradient (IG) and IG, as well as for different contrasts (see highlighted areas). We argue that the above properties of -DNNs generally reflect desirable properties and show that they behave more predictably with contrast changes than regular DNNs. Also note how high feature attribution scores can arise in the background (e.g., red box), showing how DNN predictions can depend of parts of the input that do not appear salient for the category; this highlights a possible use case for attribution priors Rieger et al. (2020); Ross et al. (2017) enabled by our approach.

5 Conclusion and broader impact

In this work, we consider a special class of efficiently axiomatically attributable DNNs, for which an axiomatic feature attribution can be computed in a single forward/backward pass. We show that nonnegatively homogeneous DNNs, termed -DNNs, are efficiently axiomatically attributable and establish a new theoretical connection between InputGradient Shrikumar et al. (2016) and Integrated Gradients Sundararajan et al. (2017) for nonnegatively homogeneous DNNs of different degrees. Moreover, we show that many commonly used architectures can be transformed into -DNNs by simply removing the bias term of each layer, which has a surprisingly minor impact on the accuracy of the model in two application domains. The resulting efficiently computable and high-quality feature attributions are particularly well-suited for inclusion into the training process and potentially enable a wide application of axiomatic attribution priors. This could, for example, be used to reliably mitigate dependence on unwanted features and biases induced by the training dataset, which is a major challenge in today’s ML systems.

Obermeyer et al. (2019) found evidence that a widely used algorithm in the U.S. health care system contains racial biases that can be traced back to biases in the dataset that was used to develop the algorithm. Using our method to generate high-quality attributions that reflect the true behavior of an -DNN and an appropriate attribution prior, such problems could potentially be resolved (though more research on attribution priors is necessary). However, allowing biases to be controlled by an ML practitioner can introduce new risks. Just like datasets, humans are also not free of biases Tversky and Kahneman (1974), which can potentially be reflected in such priors. We as a society need to be careful that this responsibility is not exploited and used for discriminatory or harmful purposes. One way to approach this problem for applications that affect the general public is to introduce an ethical review committee, which assesses whether the proposed priors are legitimate or reprehensible.

We would like to thank Jannik Schmitt for helpful mathematical advice. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 866008). The project has also been supported in part by the State of Hesse through the cluster projects “The Third Wave of Artificial Intelligence (3AI)” and “The Adaptive Mind (TAM)”.


  • Ancona et al. [2018] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. In ICLR, 2018.
  • Ancona et al. [2019] M. Ancona, E. Ceolini, C. Öztireli, and M. H. Gross. Gradient-based attribution methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science, pages 169–191. Springer, 2019.
  • Angwin et al. [2016] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016.
  • Arrieta et al. [2020] A. B. Arrieta, N. D. Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion, 58:82–115, 2020.
  • Bach et al. [2015] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 2015.
  • [6] J. Chen, X. Wu, V. Rastogi, Y. Liang, and S. Jha. Robust attribution regularization. In NeurIPS*2019, pages 14300–14310.
  • Erion et al. [2021] G. Erion, J. D. Janizek, P. Sturmfels, S. Lundberg, and S.-I. Lee. Improving performance of deep learning models with axiomatic attribution priors and expected gradients. Nature Machine Intelligence, 3(7):620–631, 2021.
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV, pages 1026–1034, 2015.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • Hendrycks and Dietterich [2019] D. Hendrycks and T. G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    ImageNet classification with deep convolutional neural networks.

    In NIPS*2012, pages 1106–1114.
  • LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • Li et al. [2016] J. Li, W. Monroe, and D. Jurafsky. Understanding neural networks through representation erasure. arXiv:1612.08220 [cs.CL], 2016.
  • Liu and Avci [2019] F. Liu and B. Avci. Incorporating priors with feature attribution on text classification. In ACL, pages 6274–6283, 2019.
  • Lundberg et al. [2020] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1):56–67, 2020.
  • Maas et al. [2013] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, 2013.
  • Miller [1973] H. W. Miller. Plan and operation of the health and nutrition examination survey. United States–1971-1973. Vital and Health Statistics. Ser. 1, Programs and Collection Procedures, 1(10a):1–46, 1973.
  • Murdoch et al. [2018] W. J. Murdoch, P. J. Liu, and B. Yu. Beyond word importance: Contextual decomposition to extract interactions from lstms. In ICLR, 2018.
  • Nair and Hinton [2010] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
  • Obermeyer et al. [2019] Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
  • Pan et al. [2021] D. Pan, X. Li, and D. Zhu. Explaining deep neural network models with adversarial gradient integration. In IJCAI, pages 2876–2883, 2021.
  • Rieger et al. [2020] L. Rieger, C. Singh, W. Murdoch, and B. Yu. Interpretations are useful: Penalizing explanations to align neural networks with prior knowledge. In ICML, volume 119, pages 8116–8126, 2020.
  • Ross et al. [2017] A. S. Ross, M. C. Hughes, and F. Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. In IJCAI, pages 2662–2670, 2017.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision, 115(13):211–252, 2015.
  • Schmidhuber [2015] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
  • Schramowski et al. [2020] P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H.-G. Luigs, A.-K. Mahlein, and K. Kersting. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence, 2(8):476–486, 2020.
  • Shrikumar et al. [2016] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje. Not just a black box: Learning important features through propagating activation differences. arXiv:1605.01713 [cs.LG], 2016.
  • Shrikumar et al. [2017] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In ICML, volume 70, pages 3145–3153, 2017.
  • Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • Simonyan et al. [2014] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
  • Singh et al. [2019] C. Singh, W. J. Murdoch, and B. Yu. Hierarchical interpretations for neural network predictions. In ICLR, 2019.
  • [32] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning.

    Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.

    In NIPS*2011, pages 801–809.
  • Sturmfels et al. [2020] P. Sturmfels, S. Lundberg, and S.-I. Lee. Visualizing the impact of feature attribution baselines. Distill, 5, 2020.
  • Sundararajan et al. [2017] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In ICML, volume 70, pages 3319–3328, 2017.
  • Tversky and Kahneman [1974] A. Tversky and D. Kahneman.

    Judgment under Uncertainty: Heuristics and Biases.

    Science, 185(4157):1124–1131, 1974.
  • Wang et al. [2019] S. Wang, T. Zhou, and J. A. Bilmes. Bias also matters: Bias attribution for deep neural network explanation. In ICML, volume 97, pages 6659–6667, 2019.
  • Wang et al. [2021] Z. Wang, M. Fredrikson, and A. Datta. Robust models are more interpretable because attributions look normal. arXiv:2103.11257 [cs.LG], 2021.
  • [38] C. Yeh, C. Hsieh, A. S. Suggala, D. I. Inouye, and P. Ravikumar. On the (in)fidelity and sensitivity of explanations. In NeurIPS*2019, pages 10965–10976.
  • Zeiler and Fergus [2014] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, volume 1, pages 818–833, 2014.
  • Zhang et al. [2019] H. Zhang, Y. N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normalization. In ICLR, 2019.
  • Zhou and Troyanskaya [2015] J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12(10):931–934, 2015.
  • Zintgraf et al. [2017] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visualizing deep neural network decisions: Prediction difference analysis. In ICLR, 2017.

Appendix A Proofs and further results

Proof details for Proposition 3.2.

In the proof of Proposition 3.2, we make use of the property that the derivative of a -order homogeneous and differentiable function is a -order homogeneous function, i.e.,


see, e.g., Corollary 4 in Border:2000:ETH. Assuming -order homogeneity of

and using the chain-rule, the above

Eq. 9 follows from


Proof of Proposition 3.8.

For any input , , and piecewise linear activation function according to Eq. 7, we want to show that


For , both sides evaluate to 0 and the equality holds. For , the equality holds as long as the active interval of the activation function does not change. The active interval changes either when the sign of the input is changed, i.e., it goes from positive to negative or vice versa, or when a positive input changes to , or a value of changes to positive. Since , a multiplication with can neither change the sign nor can make positive values or values positive. Therefore, scaling the input with changes none of the active activation function intervals and nonnegative homogeneity for holds. ∎

Proof of Proposition 3.9.

For any input , , and pooling function with the assumed properties, we want to show that


If the pooling function is linear, homogeneity implicitly holds. If the pooling function is selecting values based on their relative ordering, we consider two cases. For , both sides evaluate to and the equality holds. For , the relative ordering of the entries in is unchanged by a scaling with , hence the same entry is selected by the pooling function. Since the value of the selected entry is scaled by , the above Eq. 12 holds for and nonnegative homogeneity is satisfied. ∎

Proof that -Gradient satisfies nonnegative homogeneity (Definition 3.6)..

Using Eq. 9 and nonnegative -order homogeneity of any -DNN , it follows that


for , and therefore, nonnegative homogeneity of the attribution is satisfied. ∎

Axiomatic attributions.   Table 1 in the main text summarizes the axioms Sundararajan et al. (2017) that are satisfied by several attribution methods. For proofs of the axioms that are satisfied by Integrated Gradients, please refer to Sundararajan et al. (2017). For proofs of the axioms that are satisfied by Expected Gradients, please refer to Erion et al. (2021). For proofs of the axioms that are satisfied by InputGradient and Gradient, please refer to Erion et al. (2021); Sundararajan et al. (2017) and see below. As -Gradient equals Integrated Gradients for -DNNs according to Proposition 3.2, all the axioms satisfied by Integrated Gradients are also satisfied by -Gradient (for -DNNs).

For Expected Gradients to satisfy the same axioms that are satisfied by Integrated Gradients, convergence must have occurred, which can only be expected after multiple gradient evaluations. To emphasize the advantage of our method when only considering attribution methods that use a single gradient evaluation, in Table 1 we also show the axioms that are satisfied by Expected Gradients Erion et al. (2021) when using only one reference sample, i.e., when convergence did not yet occur. Proof sketches for the axioms satisfied by Expected Gradients with only one reference sample are as follows:

  1. Sensitivity (a): Since there exist networks for which Sensitivity (a) is not satisfied by InputGradient, and Expected Gradients could choose a sample such that the approximation equals InputGradient, Sensitivity (a) is also not satisfied by Expected Gradients in general.

  2. Sensitivity (b): As the gradient w.r.t. an irrelevant feature will always be zero, Sensitivity (b) is satisfied.

  3. Implementation invariance: As Expected Gradients use stochastic sampling for the baseline, there is no guarantee that even for the same model two attributions are equal.

  4. Completeness: Again, following the argument from Sensitivity (a), Completeness is not given.

  5. Linearity: As Expected Gradients use a stochastic sampling for the baseline, there is no guarantee that Linearity holds.

  6. Symmetry-preserving: Following the argument from Linearity, Symmetry-preserving does not hold.

Why is nonnegative homogeneity a desirable axiom for attribution methods?   Explainability is closely related to predictability. Knowing how a model behaves under certain changes to the input implies an understanding of the model. Therefore, axioms like linearity Sundararajan et al. (2017) and nonnegative homogeneity, which essentially describe a form of predictability, are generally desirable and allow for a more complete understanding of the model’s behavior.

(Input)Gradient violates Sensitivity (a).   To see that gradients and InputGradient violate Sensitivity (a), it is instructive to consider the concrete example given in Sundararajan et al. (2017): Assume we have a simple ReLU network . When having a baseline and an input , respectively changes from to . However, as the function flattens out at , the above gradient-based attribution methods would yield an attribution of for the input .

Appendix B Experimental details

In the following section we provide additional details to ensure reproducibility of our experiments. For further information, please see our public code released under an Apache License 2.0.

b.1 Removing the bias term in DNNs

The models for all reported results in Sec. 4.1

have been trained for 100 epochs on the training split of the ImageNet 

Russakovsky et al. (2015)

dataset with a batch size of 256 and using a single Nvidia A100 SXM4 (40GB) GPU. The training time per epoch is approximately 10 minutes for AlexNet, 60 minutes for VGG16, and 40 minutes for ResNet-50. For training the AlexNet and VGG models, we use the official PyTorch Paszke:2017:ADP implementation that is published under a BSD 3-Clause license. We use an SGD optimizer with an initial learning rate of 0.01 that is decayed by a factor of 0.1 every 30 epochs, a momentum of 0.9, and a weight decay of 1e-4. For training the ResNet models, we use the settings proposed by Zhang et al. (2019) and rely on the publicly available code,

which is released under a BSD 3-Clause license. The hyperparameters are the same as for the AlexNet and VGG models except that we use mixup regularization Zhang:2018:MBE with an interpolation strength

, a cosine annealing learning rate scheduler, and an initial learning rate of 0.1.

The mean absolute relative difference between the attribution obtained from Integrated Gradients Sundararajan et al. (2017) and the attribution obtained from calculating InputGradient for regular DNNs resp. -Gradient for -DNNs is calculated as


with denoting a dataset consisting of samples .

b.2 Benchmarking gradient-based attribution methods

For the experimental comparison of gradient-based attribution methods in Sec. 4.2, we use the models from Sec. 4.1 (see Sec. B.1 for details) and evaluate on the ImageNet validation split using a single Nvidia A100 SXM4 (40GB) GPU. To quantify the quality of attributions, we use the attribution quality metrics proposed by Lundberg et al. (2020). The metrics reflect how well an attribution method captures the relative importance of features by masking out a progressively increasing fraction of the features based on their relative importance:

Keep Positive Mask (KPM)

measures the attribution method’s capability to find the features that lead to the greatest increase in the model’s output logit of the target class. For that a progressively increasing fraction of the features is masked out, ordered by least positive to most positive attribution. Then the AUC of the resulting curve is measured. Intuitively, if an attribution reflects the true behavior of the model, unimportant features will be masked out first and the model output logit decreases only marginally, resulting in a high value for the AUC. The other way around, when an attribution does not reflect the true behavior of the model, an important feature might be masked out too early and the target class output decreases quickly, leading to a smaller score.

Keep Negative Mask (KNM)

works analogously for negative features. This means that the better the attribution, the smaller the metric. Note that for KPM and KNM, all negative and positive features are masked out by default, respectively.

Keep Absolute Mask (KAM)

and Remove Absolute Mask (RAM) work similarly but using the absolute value of the attributions and measuring the AUC of the top-1 accuracy. For KAM, we keep the most important features and measure the AUC of the top-1 accuracy over different fractions of masking. A high-quality attribution method should keep the features most important for making a correct classification, and therefore, the metric should be as high as possible. RAM masks out the most important features first, meaning that the accuracy should drop fast. Therefore, a smaller value indicates a better attribution.

As we evaluate attributions for image classification models, we adapt the above metrics to work with image data. This is achieved by replacing the masked pixels with those of a blurry image, which is obtained using a Gaussian blur with a kernel size of and applied to the original input image. The parameters were chosen such that the resulting image is visually heavily blurred. This ensures that features can properly be removed.

b.3 Training with attribution priors

Our experiment with attribution priors in Sec. 4.3 replicates the experimental setup of Erion et al. (2021). We use the original code, which includes the NHANES I dataset and is published under the MIT We use the attribution prior proposed by Erion et al. (2021) to learn sparser models, which have improved generalizability. The prior is defined as

with denoting the mean attribution of a mini-batch with samples. This prior improves sparsity of the model by minimizing the statistical dispersion of the feature attributions. We use the following attribution methods as baselines, which are commonly used for training with attribution priors: Expected Gradients (EG), the input gradient of the log of the output logit as proposed by Ross et al. (2017) (RRR), and a regular input gradient (Grad). We compare these methods with our novel -Gradient (G) attribution method. For each attribution method, we perform an individual hyperparameter search to find the optimal regularization strength . We find for RRR and Grad, and for G and EG. When training the model with Expected Gradients using more than one reference sample, we continue to use the regularization strength that was found using one reference sample. All other hyperparameters are kept as in the original experiment of Erion et al. (2021). To train the models, we use a Nvidia GeForce RTX 3090 (24GB) GPU.

To provide a numerical comparison of the efficiency of -Gradient and Expected Gradients Erion et al. (2021), we report the computation time and GPU memory usage for training a ResNet-50 on the ImageNet dataset with Expected Gradients using 32 reference samples and with -Gradient. We use a single Nvidia A100 SXM4 (40GB) GPU and a batch size of two. The number of reference samples corresponds to the number of reference samples determined in the experiment in Fig. 1(right), where both networks achieve the same ROC-AUC. When using Expected Gradients for training, the GPU memory usage is GB while for -Gradient the memory usage is GB. The computation time per iteration, averaged over iterations, is s for Expected Gradients and 0.0086 for -Gradient. To conclude, in this scenario we observe a massive improvement in the efficiency of -Gradient compared to Expected Gradients. Expected Gradients requires times more computation time and times more GPU memory.

Figure 3: Convergence of Integrated Gradients Sundararajan et al. (2017). We plot the mean absolute difference of Integrated Gradients obtained by using 300 and different numbers of approximation steps for AlexNet on the ImageNet validation split. We find convergence to occur after approximately 128 steps.

b.4 Homogeneity of -DNNs

For the experiment in Sec. 4.4, we use the same models as in Sec. 4.1 (see Sec. B.1 for details) and evaluate on the ImageNet validation split using a single Nvidia A100 SXM4 (40GB) GPU. Additional qualitative examples of the attributions for the output logit of the target class for a regular AlexNet and an -AlexNet, as in Fig. 2(right), are shown in Fig. 4. Our findings in Sec. 4.4 are consistent with the additional results. As with Fig. 2(right) in the main paper, we observe that -Gradient (G) equals Integrated Gradients (IG) for the -AlexNet up to a small approximation error and that reducing the contrast of the images keeps the attribution unchanged up to a scaling factor (not visible due to normalization for display purposes). On the other hand, for the regular AlexNet the attributions obtained from InputGradient and Integrated Gradients differ and change depending on the contrast.

b.5 Convergence of Integrated Gradients

For our experimental comparisons with Integrated Gradients Sundararajan et al. (2017), we assume convergence of the method. To empirically find a suitable number of approximation steps, we analyze the mean absolute difference of the Integrated Gradients obtained by using and approximation steps as plotted in Fig. 3. We choose steps for reference because Sundararajan et al. (2017) report 300 steps as the upper bound for convergence. We use the trained AlexNet model from Sec. 4.1 (details in Sec. B.1), the ImageNet validation split, and . We find 128 approximation steps to be sufficient and use this number in our experiments.

Figure 4: Qualitative examples of normalized attributions for the output logit of the target class for -AlexNet and AlexNet using the attribution methods InputGradient (IG), -Gradient (G), and Integrated Gradients (IG).

abbrvnat bibtex/short,bibtex/papers,bibtex/external,bibtex/local