Deep Neural Networks (DNNs) have achieved great success in many real-world applications. The tremendous parameter space enhances the function approximation ability and hence improves the model performance. However, they also compromises over the model transparency, making it difficult to interpret the model decision. The concerns about the interpretability of DNNs have hampered their further applications, especially in high-stake applications such as automatic driving and AI healthcare. Hence developing model interpretation to promote a trustworthy AI is extremely important and has drawn an increasing attention recently.
Attribution methods have become an effective computational tool in understanding the behavior of machine learning models, especially Deep Neural Networks (DNNs) [1, 2, 3]. They uncover how machine learning models make a decision by calculating the contribution score of each input feature to the final decision. For example, in image classification, the attribution methods infer the contribution of each pixel to the predicted label for a pre-trained model, and usually create heatmaps to visualize the contributions.
Although several attribution methods have been proposed recently, the attribution problem is actually not well-defined. Sundararajan roughly defines ”the attribution of input relative to a baseline point
” as a vector, where denotes the contribution of feature
to the prediction. Such description is uninformative to understanding attribution problem, which lacks a concrete guide to the logic of contribution assignment process. Moreover, existing attribution methods are based on different heuristics and have very limited theoretical understanding and support. For instance,, Layer-wise Relevance Propagation (LRP) evaluates the contribution of each neuron at the lower layer to a non-linear neuron at the upper layer based on the proportion of lower neuron’s linear combination value. In addition, the saliency map of smooth gradients  obtains significantly improved performance than gradient just by averaging the gradients of neighbors. The rationales behind these methods are perplexing.
Hence, it’s highly desirable to not only deepen the understanding of attribution problem, but also conduct a comprehensive exploration and investigation to those various heuristic attribution methods. Specifically, the following important questions for attribution methods need theoretical investigation: Rationale–What model behaviors do these attribution methods actually reveal; Fidelity–How much can decision making process be attributed in these attribution methods; Limitations–Where these attribution methods may fail.
While some attempts have been made to partially answer the questions by unifying several attribution methods as additive feature attribution , multiplying a modified gradient with input , or first-order Taylor expansion , the problems are still not addressed well due to two challenges. The first challenge (Ch1) is to our knowledge, none of them could offer a good description to the attribution problem. The second challenge (Ch2) stems from the fact that it is very difficult to propose a general framework to unify most existing attribution methods, because these methods are based on various heuristics.
In this paper, we address the aforementioned problems by proposing a general Taylor attribution framework, which not only offers a good description to the attribution problem (section 3), but also unifies fourteen mainstream attribution methods into the framework by Taylor reformulations (section 4). The basic idea behind the proposed framework is to attribute an Taylor approximation function of DNNs, instead of DNNs themselves. The power of this framework is built on three foundations: (1) It’s based on Taylor expansion, hence it’s able to sufficiently approximate the behavior of black-box DNNs, with a theoretical guarantee on approximation error; (2) The Taylor expansion function is polynomial, in which attribution analysis is easier and intuitive. Hence the attribution of input relative to a baseline point can be modeled as how to decide individual payoffs in a coalition, as shown in Figure 1. Specifically, attribution problem can be considered as studying how to assign contributions from Taylor independent terms (e.g., ) and Taylor interactive terms from a feature coalition, such as . We will elaborate on this in section 3. (3) Taylor expansion is a natural tool to decompose the output difference between the input and baseline point, which could decompose the difference into the sum of input features’ effects. On the other hand, most attribution methods usually analyze or decompose the output difference and can be represented as a function of output difference. Therefore it’s feasible to reformulate the attributions from these methods into the Taylor attribution framework.
According to the unified Taylor reformulations, we reveal rationales, measure fidelity, and point out limitations for the existing attribution methods in a systematic and theoretical way. During the analysis, we categorize existing attribution methods into four types according to three factors: i) whether they consider feature interactions; ii) single baseline or multiple baseline; iii) whether positive and negative effects are separated. Moreover, we establish and advocate three principles for a good Taylor attribution, which are low approximation error, correct Taylor contribution assignment, and unbiased baseline selection.
Finally, we empirically validate the proposed Taylor reformulations by comparing the attribution results obtained by the original attribution methods and their Taylor reformulations. The experimental results on MNIST show the two attribution results are almost consistent. We also reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on MNIST and Imagenet. In summary, this paper has three main contributions:
We propose a general Taylor attribution framework, which offers a good description to the attribution problem. The framework provides an insight into the logic of contribution assignment process.
Fourteen mainstream attribution methods are unified into the proposed framework by theoretical reformulations.
Based on unified Taylor reformulations, we revisit existing attribution methods in terms of their rationale, fidelity, and limitations. We also accordingly establish three principles for a good attribution.
We empirically validate the Taylor reformulations, and reveal the relationship between attribution performance and the three principles on MNIST and Imagenet.
2 Related work
In this section, we firstly provide an overview of existing attribution methods. Then, we introduce the related works which pay attention to understanding and unifying these existing attribution methods in details.
2.1 Existing attribution methods
Attribution is an effective computational tool in locally interpreting the behavior of machine learning models, especially DNNs. Recently, there are a number of attribution methods developed to infer the contribution score of each input feature to the final prediction for a given input sample. Saliency maps are usually created to visualize the contribution score. We roughly categorize these attribution methods into local attribution explanation approach and global attribution explanation approach.
Local attribution explanation approach focuses on the sensitivity of the difference of the output neuron w.r.t each input neuron, i.e., how the output of the network changes for infinitesimally small perturbations around the original input. Gradient
calculates the sensitivity, which masks out the negative neurons of bottom data via the forward Relu at the Relu layer. To improve the saliency map quality, smooth gradients produces an attribution vector by averaging the gradients of neighbor samples, which is generated through adding Gaussian noise to the original given sample. Deconvnet  aims to map the output neuron back to the input pixel space. To keep the neuron’s size and non-negative property, they resort to the transposed filters and backward Relu, which masks out the negative neurons of the top gradients. The Guided Back-propagation (GBP)  combines Gradients and Deconvnet, considering both forward relu and backward relu. As a result, GBP could significantly improve the visual quality of visualizations.
Global attribution explanation approach directly analyzes or decomposes the output difference between the input and the selected baseline. Gradient*Input  calculates the attributions by multiplying the gradient with the original input, to improve the sharpness of saliency maps. Grad-CAM focuses on interpreting the classification module of convolutional neuron networks (CNNs). It captures the importance of each feature channel at the top convolutional layer, which conducts global average pooling to the gradients of the output neuron w.r.t each feature map. Then Grad-CAM obtain a coarse attribution by multiplying the importance with these feature maps. Occlusion-1  and Occlusion-patch  observes the changes of the output induced by occluding each input pixel or patch. Layer-wise Relevance Propagation rule (-LRP) decomposes the value of output neuron in a layer-wise manner. Specifically, it recursively decomposes the relevance score of a neuron at the upper layer to the neurons at the lower layer, according to the corresponding proportion in the linear combination. DeepLIFT Rescale rule  adopts a similar linear rule to -LRP, while it assigns the difference between the output and a baseline output in terms of the difference between the input and a pre-set baseline input, instead of merely considering the output value. Integrated Gradients 
, corresponds to Aumann-Shapley, decomposes the output difference by integrating the gradients along a straight path interpolating from input sample to the baseline. In addition, Shapley value has become a popular attribution method, which calculates an average marginal contribution of each feature across all possible feature subsets. Shapley value is characterized by a collection of desirable properties, e.g., local accuracy, missingness, and consistency.
Moreover, some variants of above global explanation methods have been proposed recently. Generally, they adopts two strategies to improve the attribution results: i) disentangling the contributions from positive and negative terms. For example, DeepLIFT RevealCancel  separately considers the overall marginal impact of the positive terms and negative terms. Layer-wise relevance propagation rule (LRP-)  decompose times the overall effects to the positive terms and times to the negative terms, where and satisfy to ensure the completeness. Deep Taylor  has been demonstrated as a special case of LRP- when and
. ii) averaging over multiple baselines to reduces the probability that the attribution is dominated by a specific baseline. Such strategy can be integrated into most attribution methods. The corresponding version of Integrated Gradients (Expected Gradients), DeepLIFT (Expected DeepLIFT), and Shapley value (Deep Shapley) have been shown significantly improve the interpretation performance.
In this paper, we mainly focus on the global attribution explanation approach, because they usually analyze or decompose the output difference between input and baseline and can be represented as a function of such difference. Therefore it’s natural to reformulate these methods into the proposed Taylor attribution framework.
2.2 Understanding and unifying attribution methods
There are a few works on understanding the theoretical groundings of some attribution methods that are often designed heuristically. -LRP  and LRP-  are reformulated as a first-order Taylor decomposition. Moreover, Deconvnet and Guided BP have been theoretically proved  that they are essentially constructing (partial) recovery to the input, which is unrelated to decision making.
Some efforts have been devoted to unifying existing attribution methods. LIME, LRP, DeepLIFT, and Shapley value are unified under the framework of additive feature attribution . Some gradient-based attribution methods including Gradient*Input, -LRP, DeepLIFT and Integrated Gradient, are unified as multiplying a modified gradient with input . Several equivalence conditions are given. In addition, several methods are summarized as first-order Taylor decomposition on different baseline points .
To our knowledge, this is the first work to leverage high-order Taylor decomposition and interactive effects to formally define the attribution problem and unify the majority of existing attribution methods.
|Input difference, defined as|
|Input difference of feature|
|The prediction of input|
|-order Taylor expansion function of|
|Taylor approximation error of at point|
|Hessian matrix at|
|Hessian independent and interactive matrix|
|Taylor first, second, and high-order terms|
|Taylor second, high-order independent terms|
|Taylor second, high-order interactive terms|
|Interactive terms among features in|
|Interactive terms between features in and|
|Attribution of feature|
|Attribution of feature from|
|Attribution of feature from|
3 Taylor Attribution Framework
In this section, we propose a Taylor attribution framework to deepen the intrinsic understanding of attribution problem. Given a pre-trained DNN model and an input sample , the attribution problem aims to infer the contribution of each feature to the prediction . Existing attribution methods usually select a baseline point to represent a reference state, then the output difference between the input and baseline point , can be considered as the influence caused by the input difference . Hence, attribution can be seen as the process assigning the output difference to each feature according to their corresponding input difference . However, there are infinite possible cases to decompose a scalar to a -dimensional vector. No work has provided a guidance to the concrete logic in the contribution assignment process, i.e., which assignment is logical and reasonable.
To offer a good description and depiction to the attribution problem, we resort to Taylor decomposition theory and propose a general Taylor attribution framework. Specifically, the basic idea is that we conduct the attribution for the Taylor approximation function of DNN model, instead of directly attributing DNN itself. The idea is doable due to two aspects. Firstly, Taylor expansion can sufficiently approximate DNN model so that the above two attributions are approximately equivalent. Secondly, the Taylor expansion function is polynomial, which is easier to analyze how to assign the contribution intuitively.
Assume is differentiable, the Taylor expansion of expanded at input sample is111Noted that although the deep relu network is not differentiable such that Taylor expansion is not applicable, networks with softplus activation (approximation of relu) can be used to provide an insight to the rationale behind relu net.,
where is the -order Taylor expansion function of . is the function value of at baseline , and is the approximation error between and at point . The left side of equation, , represents the output difference, which can be considered as the influence of input difference . We need to answer: how to decompose the effect according to the input difference of each feature. It’s difficult to decompose directly the effect due to the complexity of . As is an approximation of the output difference, we instead decompose into a attribution vector , where denotes the attribution score of feature to .
3.1 First-order Taylor attribution
The first-order Taylor expansion function is
where denotes the derivative of with respect to . The linear approximation function in first-order Taylor expansion, , is additive across features and can be easily decomposed. It is obvious that quantifies the contribution of feature , i.e.,
3.2 Second-order Taylor attribution
The second-order Taylor expansion has a smaller approximation error than the first-order one, so that it is expected more faithful to the model . The second-order Taylor expansion function is given by
where is the Hessian matrix, i.e., second-order partial derivative matrix, of at . We denote the first-order and second-order Taylor terms as and , respectively.
The second-order Taylor expansion function is indistinct in determining feature contributions compared with first-order one due to the Hessian matrix. To make the attribution more clear, we decompose into two matrices, an independent matrix and an interactive matrix . Here is a diagonal matrix composed of the diagonal elements in , which describes the second-order isolated effect of features, and represents the interactive effect between features. could be rewritten as the sum of first order terms , second-order independent terms , and second-order interactive terms ,
Accordingly, the attribution of to should be
where and represent the assigned contributions from , and , respectively. The contributions from independent terms and can be clearly identified as
where and denote the first-order terms and second-order independent terms of feature , respectively.
The difficulty lies on how to assign the contribution from interactive terms . We propose to handle it by following an intuition behind: the attribution is from and should be the sum of assignments from each interactive effect involving feature ,
where denotes the second-order interactive terms corresponding to feature and , weight characterizes the assignment of the interactive terms to , and is the attribution from .
The determination of the assignment weight is complicated and depends on specific case. However, it’s considered that the interactive terms of two features should be only attributed to these two features. Consider the interactive terms between and , the assignment should satisfy , i.e., . For example, as shown in the Taylor reformulation in section 4, Integrated Gradients assigns the interactive terms according to the order of features. Because the order of and in second-order interactive terms (i.e., ) are both 1, so the term are equally assigned to and . That is, in Integrated Gradients.
3.3 High-order Taylor attribution
The analysis on second-order expansion can be naturally extended to high-order expansion where . Let denote all high-order expansion terms, including second-order expansion terms. The high-order Taylor expansion function at is
where and denote high-order independent and interactive terms, respectively.
Analogously to the second-order case, the attribution of feature in high-order expansion is given by
where , represent the assigned contributions from and , respectively. The attribution from first-order term and high-order independent term is clear,
where represent the high-order independent terms of feature . The attribution from interactive terms, , consists of all assignments from interactive terms involving ,
where denotes the attribution from interactive terms corresponding to features in the feature subset . Note that the interactive terms should be only assigned to the features in the subset , i.e.,
Based on the analysis, we give a definition for how to correctly assign Taylor contribution.
A Taylor attribution has a correct Taylor contribution assignment if the attribution is given by,
and the assignment from interactive terms satisfies,
In brief, Eq. 1 indicates that the Taylor first-order and high-order independent terms of feature should be assigned to , and part of Taylor interactive terms involving feature should be allocated to . Eq. 2 requires that the interactive terms of features in subset A should be and only be attributed to the features in subset A. It’s worthy noting that the high-order term can be omitted, if the first-order Taylor expansion can approximate the model sufficiently.
3.4 The selection of baseline point
From the Taylor attribution framework, the attribution of feature could be seen as a polynomial function of (i.e., ), and hence it highly depends on . Given a constant vector baseline as many attribution methods did, the attribution of feature whose value is far from may be overestimated due to a large , while the attribution of feature whose value is close to may be underestimated even if it is important to the decision making process. Such different attributions are a bias in many tasks. For example, in image classification, it’s unreasonable to attribute according to the value of features (i.e., pixel values). Specifically, given a black image as baseline, pixels in white color have a large close to , while pixels in black color have a small close to . Correspondingly, the attribution methods will biasedly highlight white pixels while neglecting black pixels even if black pixels make up the object of interest. Hence the selection of baseline point plays a significant role.
Baseline point is used to represent an “absence” of a feature, by which the attribution methods calculate how much the output of the model would decrease considering the absence of the feature . Hence, it’s expected that the output of baseline point has a significant decrease. Moreover, to avoid incorporating aforementioned bias into the attribution process, attribution methods should choose an unbiased baseline which satisfies there is no big differences among of different features. That is, should be similar to for random two feature dimensions. One option is setting as a constant vector , and its corresponding baseline is . Such baselines indeed solve the bias issue, however they usually don’t make a difference to the output of the model. Another alternative is to sample the input differenceof different features. In addition, the biases can be further neutralized by averaging multiple baselines whose s are sampled from such distributions. This strategy reduces the probability that the attribution is dominated by a specific baseline, which is prone to be biased. This may explain why SmoothGrad  and Expected Gradients  will success with small Gaussian variance level while fail with large variance.
4 Unified Taylor Reformulations
The proposed Taylor attribution framework is very general, and it can unify attribution methods based on the analysis of output difference. These attribution methods assign/decompose the output difference between input and baseline point to each input feature. In these attribution methods, the attribution is performed by a function of output difference. Moreover, such output difference can be approximately represented as the sum of Taylor terms by Taylor decomposition. Therefore, the attribution can be unified into our framework, i.e., the attribution could be reformulated as a function of the Taylor terms.
In this section, we will unify fourteen mainstream attribution methods into the proposed Taylor framework by Taylor reformulations, and all the proofs of theorems are in the Appendix. This section is organized as follows. Firstly, we discuss about eight basic versions of attribution methods, which are Gradient*Input , Grad-CAM , Occlusion-1 , Occlusion-patch , Integrated Gradients , DeepLIFT Rescale , -LRP  and Shapley value . According to whether the method considers feature interactions, we categorize them into two types. Secondly, we study the variants which disentangle the contributions from positive and negative terms. Specifically, this part includes DeepLIFT Rescale (DeepLIFT RevealCancel ) and -LRP (LRP-  and Deep Taylor ). Thirdly, we reformulate the variants averaging over multiple baselines, which are Expected Gradients , Expected DeepLIFT, and Deep Shapley .
4.1 Without feature interaction
In this subsection, we demonstrate that after Taylor reformulations, the following five attribution methods don’t consider feature interactive terms (completely).
Gradient*Input can be reformulated as a first-order Taylor attribution w.r.t the baseline point ,
-LRP  proceeds in a layer-wise back-propagation fashion. Use and to denote the neuron at -th layer and the neuron at -th layer, respectively, and . Here is the weight parameter, is the additive bias, and
is a non-linear activation function.-LRP recursively decomposes the relevance score of -th neuron at -layer to the neurons at -th layer, according to the proportion of weighted impacts. Then the attribution of -th neuron from -th neuron at -layer is,
where = is the weighted impact of to . is the total relevance score of neuron and will be assigned to features at -th layer. Here is a small quantity added to the denominator to avoid numerical instabilities.
When -LRP is applied to a network with Relu activation, the attribution of is equivalent to the attribution in Gradient*Input, i.e.,
|Separating + & -||DeepLIFT+-|
Grad-CAM  focuses on interpreting the classification module of CNNs and takes the feature maps of the top convolutional layer to calculate the attribution scores. Here is the number of channels, and
represent the weight and height of these feature maps. Specifically, Grad-CAM firstly captures the importance of each feature map by conducting global average pooling (GAP) operation to the gradient of the target output neuronw.r.t the feature map. For -th feature map, the importance is calculated by
where is the intermediate feature at location at -th feature map. Then Grad-CAM can approximately decompose as a weighted combination between the importance weights and these feature maps, i.e.,
We obtain that Grad-CAM assigns contribution to the -th feature map, where is expressed as:
To investigate the contribution of the features at different locations, the right side of the Equation 4 can be rewritten as . Then Grad-CAM can correspondingly assign contribution to the feature at location, where is expressed as:
Define to be the -th GAP feature map, and the model can be expressed as a function of GAP feature , i.e., . Then we have Theorem 3.
The Eq. 5 in Grad-CAM can be reformulated as a first-order Taylor attribution of function w.r.t the baseline point ,
Specifically, in Grad-CAM, the attribution of -th feature map (Eq. 5) is reformulated as .
Occlusion-1  calculates how much the prediction changes induced by occluding feature with a zero baseline. The new occluded input is written as . Then the attribution of feature is defined as the difference of the output, .
The attribution of in Occlusion-1 can be reformulated as the sum of first-order and high-order independent terms of at baseline point ,
The attribution of in Occlusion-1 is in the second-order Taylor attribution.
The attribution in Occlusion-patch  is similar to Occlusion-1 but conducted on a patch level. It constructs a a zero patch baseline by occluding an image patch , and defines the output difference as the attribution of features in .
The attribution of in Occlusion-patch can be reformulated as the sum of first-order, high-order independent terms of features in patch , and all high-order interactive terms involving the features in patch ,
Particularly, is in second-order setting.
4.2 With feature interaction
In this subsection, we study four attribution methods considering feature interactions.
4.2.1 Integrated Gradients
The attribution in Integrated Gradients  integrates the gradients along the straight line path from a baseline point to an input . The points along the path are denoted as = . The attribution of feature is computed by
The attribution of in Integrated Gradients can be reformulated as the sum of first-order term of , high-order independent terms of , and an assignment from high-order interactive terms involving at baseline ,
where = is the assignment, and = is the Taylor expansion coefficient of .
In brief, Integrated Gradients allocates proportion of the high-order interactive term to .
Give a concrete example. Assume , and let and . Obviously, the independent terms , , and should be clearly assigned to , and respectively. With respect to the interactive terms, the assignment of Integrated Gradients is based on the order of features, which allocates proportion of term to feature . Hence, it assigns to feature , assigns to feature , and assigns to feature .
4.2.2 DeepLIFT Rescale
Similar to -LRP, DeepLIFT Rescale  also computes the relevance scores by relevance propagation in a layer-wise manner. While instead of merely considering the output value, DeepLIFT propagates the output difference between the input and the baseline to the input layer. Specifically, at -th layer, it calculates the relevance score of to , denoted as , by
where = is the weighted impact of to , analogously = denotes the weighted impact of the baseline, and = denotes the total relevance score of .
The attribution of in DeepLIFT Rescale at -th layer is equivalent to the attribution in Integrated Gradients.
Hence DeepLIFT Rescale can be considered as a layer-wise Integrated Gradients.
4.2.3 Shapley value
is a classical solution concept in cooperative game theory, which aims to assign an importance score to each player (feature) in a cooperative game (model) involving the coalition ofplayers (features). According to Shapley value, given a cooperative game , the amount that player contributes is,
Here is the set of all players, and traverses all subsets of . Eq. 9 can be interpreted as averaging the marginal contribution of to coalition over all possible coalitions (involving ). When applied to interpreting DNNs, is often obtained by calculating the output when setting the values of complementary set as .
The attribution of in Shapley value can be reformulated as the sum of first-order term of , high-order independent terms of , and proportion of the interactive terms among features in .
In other words, Shapley value averagely assigns the interactive terms among features in the set , i.e., the proportion of each feature is .
For example, let . The interactive terms involving two features are and , respectively. The terms involving three features are . Hence, Shapley value assigns to feature , and assigns to feature .
4.3 Separating positive and negative contributions
Shrikumar  has shown that the positive and negative impacts may cancel out during the attribution process and hence may provide misleading interpretations. To alleviate such issues, some variants including DeepLIFT RevealCancel , Deep Taylor , and LRP- , have been proposed to treat the positive and negative impacts separately.
The basic idea is to firstly decompose the output difference into positive components arising from positive input differences and negative components arising from negative input differences. Then is allocated to the neurons with positive input differences, and proceeds analogously. We give a unified formulation for the three attribution methods.
Similar to DeepLIFT Rescale, the three attribution methods also proceed in a recursively back-propagation manner. Hence we adopt the same set of symbols as in DeepLIFT Rescale. We focus on the propagation from the target neuron at -layer to the input neurons at -th layer. For sake of simplicity, we omit the superscript of layer index (i.e., and ) and subscript of target neuron index (i.e., ). In addition, we rewrite all input neurons as , and denote the target output neuron as . We represent the corresponding weighted impacts as . Hence we have . These three methods firstly decomposes the input differences into positive and negative parts, i.e.,
where and . Specifically, for each input neuron , we define . Moreover, we denote and as the feature subset with positive and the feature subset with negative , respectively.
These attribution methods decompose the output difference into the positive component and the negative component , which satisfies . For the features in , the positive attribution of feature from -th neuron at -th layer is obtained by,
Similarly, for the features in , the negative attribution of from