Attribution methods have become an effective computational tool in understanding the behavior of machine learning models, especially Deep Neural Networks (DNNs) Du et al. (2019). It uncovers how DNNs make a prediction by calculating the contribution score of each input feature to the final prediction. For example, in image classification, the attribution methods aim to infer the contribution of each pixel to the prediction for a pre-trained model. Recently several attribution methods Bach et al. (2015); Montavon et al. (2017); Shrikumar et al. (2017, 2016); Smilkov et al. (2017); Sundararajan et al. (2017); Zeiler and Fergus (2014) have been proposed to create saliency maps with visually pleasing results. However, the designs of attribution methods are based on different heuristics without deep theoretical understanding. Although related work have indicated some kinds of formulation unification Lundberg and Lee (2017); Marco Ancona and Cengiz Öztireli (2015), there still lacks further investigation on underlying rationales, fidelity, and limitations. In other words, the following interesting questions are rarely explored and need theoretical investigation: i) What model behaviors do these attribution methods actually reveal (rationale); ii) How much can decision making process be reflected (fidelity); iii) What are their limitations.
Answering those questions is difficult mainly due to the following two challenges. The first challenge (C1) is that an objective and formal definition of the fidelity for a decision making process is missing, especially for DNN models due to their black-box nature. The existing definitions of fidelity Fong and Vedaldi (2017); Karl Schulz and Federico Tombari (2020); Samek et al. (2016); Selvaraju et al. (2017) are subjective. For instance, object localization performance Fong and Vedaldi (2017)
, a popular evaluation metric, evaluates the consistence between explanations and human cognition, while the human cognition may have a divergence with the underlying model behavior. The second challenge (C2) is lacking a unified theoretical tool to connect attribution methods and model behaviors, so as to reveal the underlying rationales and corresponding limitations of these attribution methods. The existing attribution methods are mostly based on heuristics, and there is little knowledge about the model behaviors they convey.
To address the aforementioned two challenges, we propose a theoretical Taylor attribution framework based on Taylor expansion, to characterize the fidelity of the interpretation and assess existing attribution methods. To address C1, we give a theoretical definition of the fidelity in two steps. First, DNNs are difficult to be explained directly due to large amounts of function composition. Instead of explaining DNNs directly, we explain a sufficient approximation function of DNNs, which is much more understandable to humans (e.g., polynomial function family) Ribeiro et al. (2016). We adopt Taylor polynomial expansion function as it has a theoretical guarantee on approximation error. The fidelity of the explanation in DNNs is equivalent to the fidelity of the explanation in Taylor approximation function, when the approximation error is sufficiently small. Second, to explain the Taylor approximation function, Taylor attribution framework decomposes the function into first-order, context-independent high-order, context-aware high-order model behaviors. Three desired properties are introduced for a faithful attribution method of DNNs. The framework reveals that the feature importance is not in an isolated fashion and contributions of feature interaction should not be neglected.
To address C2, we investigate the relationship between the proposed Taylor attribution framework and existing attribution methods. It is computationally infeasible to directly analyze the attribution methods by using the Taylor attribution framework, due to the high complexity to compute partial derivatives. Instead, these attribution methods could be reformulated into a unified Taylor attribution framework. The Taylor reformulations uncover theoretical insights of rationales, measure fidelity, and summarize limitations of these attribution methods in a systematic and sufficient way. Therefore, new attribution methods are developed to improve interpretation performance.
In summary, this paper includes the following major contributions:
[topsep = 5 pt,itemsep = 1 pt, leftmargin = 20 pt]
The explanation fidelity of DNNs is generally defined via Taylor expansion function, which is a sufficient approximation with theoretical guarantee on approximation error.
A Taylor attribution framework is proposed to evaluate the explanation fidelity of Taylor expansion function. Three desired properties are derived for a faithful explanation for DNNs.
Several existing attribution methods are reformulated into the unified Taylor attribution framework for a systematic and theoretical investigation on their rationale, fidelity, and limitations. New attribution methods are proposed to improve interpretation performance.
2 Taylor Attribution Framework
In this section, we propose a Taylor attribution framework to theoretically interpret and understand how input features contribute to the decision making process in DNNs. Given a pre-trained DNN model and a sample , attribution methods aim to characterize the contribution of each feature to its prediction . It is not tractable to analyze DNN models directly due to multi-layers of function compositions. Our basic idea is to address the attribution problem by using a much more interpretable approximation function of . We adopt Taylor polynomial expansion function, denoted as , to approximate DNNs , due to its theoretical guarantee on the approximation error. The fidelity of the attribution of is equivalent to the fidelity of the attribution of , when the approximation error of is sufficiently small. Next, we will elaborate how to address the attribution problem of . An overview of Taylor attribution framework is shown in Figure 1.
The first-order Taylor expansion of at is,
where represents a selected baseline point, and , . is the approximation error between and the approximation function at point . We ignore the constant term if not mentioned thereafter, as the baseline point of attribution methods always satisfies . In addition, we omit a negative sign of , which would not affect the attribution.
First-order Taylor attribution In first-order Taylor expansion, the linear approximation function is additive across features and can be easily interpreted. It is obvious that quantifies the contribution of -th feature to the prediction. is denoted as the attribution score of -th feature, i.e.,
The second-order Taylor expansion of at is,
where is the second-order partial derivative matrix (Hessian matrix) of at a given point . The second-order Taylor expansion, with the second-order term , has a smaller approximation error than the first-order one.
Second-order Taylor attribution The second-order approximation is indistinct in determining features contribution compared with first-order expansion due to Hessian matrix . To make it more interpretable, the Hessian matrix is decomposed into two matrices: a second-order independent matrix and an interactive matrix , where is a diagonal matrix composed of the diagonal elements in . describes the second-order isolated effect of features, and represents the interactive effect among different features. Hence, the second-order expansion could be rewritten as the sum of first order term , second-order independent term , and second-order interactive term ():
Here, the second-order independent term can be decomposed as , where represents effect of . Since and are additive across features, the contribution for feature from and can be clearly identified as . The second-order interactive term can be formulated as , where denotes interactive effect of . It has been debated frequently in performance measurement Suwignjo et al. (2000) regarding how to assign interactive effect to feature and respectively. This is even more challenging in neural network scenarios with high-complexity.
We propose to handle the interaction effect by following an intuition behind: the interaction effect of certain features should be and only be assigned to the contribution of corresponding features. For example, the interactive effect should be assigned to the contribution of feature and . Hence, we define the second-order Taylor attribution as
where is the allocated contribution of from the interaction effect associated with . One way to determine is to equally split the interactive effects, i.e., .
Higher-order Taylor attribution All high-order expansion terms are denoted as . Similar to the second-order case, is decomposed into independent term and interactive term :
, are defined as the allocated contributions of from first-order and high-order independent terms respectively. is defined as the high-order interactive term among features in a subset . The attribution of high-order Taylor expansion follows the similar rule as in the second-order case, i.e., interactive term is only assigned to the features in the subset .
The fidelity of the attribution of depends on two factors: i) approximation error of , ii) the fidelity of the attribution of . Hence, integrating the proposed Taylor attribution framework, a faithful attribution method should satisfy the following three properties:
A Taylor attribution method has low model approximation error, if for , the approximation error is sufficiently small.
A Taylor attribution method has accurate assignment of independent term, if for any feature , its first-order term and high-order independent terms are accurately assigned to the contribution of , instead of other features.
A Taylor attribution method has accurate assignment of interactive term, if for any interactive term with features in a set , it is only assigned to the features in .
Taylor attribution framework theoretically defines the fidelity of interpretation to the model and can be used to assess the existing attribution methods (Noted that although Montavon et al. (2018); Sahil Singla and Shi Feng (2019); Wang and Nvasconcelos (2019) also mentioned Taylor expansion function, they mainly focus on first-order and second-order expansion around neighborhood) . However, it is generally computationally infeasible in large-scale real-world applications. For example, when interpreting a prediction of an image with pixels, -order Taylor attribution would take to compute high-order partial derivatives. To tackle the computational challenge, we investigate the theoretical relationship between the proposed Taylor attribution framework and the existing attribution methods, and find that these attribution methods can be unified into the Taylor attribution framework via reformulations.
3 Unified Reformulation of Existing Attribution Methods
In this section, we mathematically reformulate several attribution methods into the Taylor attribution framework. We mainly focus on the following attribution methods: i) Gradient * Input Shrikumar et al. (2016), ii) Perturbation-1 Zintgraf et al. (2017), iii) Perturbation-patch Zeiler and Fergus (2014), iv) DeepLift (Rescale) Shrikumar et al. (2017) and -LRP Bach et al. (2015), v) Integrated Gradient Sundararajan et al. (2017). Deconvnet Zeiler and Fergus (2014) and Guided BP Springenberg et al. (2014) are beyond the scope of discussion and not included. It has been theoretically proved Nie et al. (2018) that these two methods are essentially constructing (partial) recovery to the input, which is unrelated to decision making. Empirical evidence in Adebayo et al. (2018) also demonstrated that these two methods are not sensitive to the randomization of network parameters and target labels. Grad-CAM Selvaraju et al. (2017) is not included since it does not directly explain the behavior of convolutional layer. It explains CNNs by interpreting the fully-connected layers and utilizing the location information of the top convolutional layer. Next, we will elaborate on the Taylor reformulations of these attribution methods.
I) Gradient * Input Gradient * Input was firstly proposed in Shrikumar et al. (2016) to generate saliency maps. The attribution is computed by multiplying the partial derivatives (of the output to the input) with the input. That is, .
Gradient*Input is a first-order Taylor attribution approximation of DNN model . Here, the baseline point is set to 0 . That is,
Gradient * Input satisfies Proposition 1. However, the linear approximation function can not reflect the highly nonlinear functions in DNNs and fails to satisfy Property 1.
II) Perturbation-1 Perturbation-1 Zintgraf et al. (2017) attributes an input feature by calculating how much the prediction changes according to the perturbed feature . Specifically, is perturbed by a constant , and a perturbed input is denoted as , with , . The corresponding new output is obtained by a forward pass. The difference between two outputs is considered as the attribution of feature , i.e., .
Perturbation-1 is a context-independent high-order attribution approximation of DNN model , i.e., the attribution is the sum of first-order term and high-order independent term. The attribution of is reformulated as (See Appendix A for proof):
Compared to Gradient * Input, perturbation-1 could characterize the high-order independent effects. However, it fails to incorporate the complex interactions among features (such as pixels), which captures critical information for prediction.
III) Perturbation-patch Similar to Perturbation-1, Perturbation-patch Zeiler and Fergus (2014) attributes features in a patch level. An image is splitted into patches as , where a patch is a subset of . Using the same intuition in Perturbation-1, features in the patch
are perturbed by a constant vector. The altered input and the corresponding output are denoted as and . The attribution of feature in the patch is the difference between output and as .
The attribution of perturbation-patch is the sum of the first-order term, the high-order independent term of all features in the patch and high-order interactions among features in the same patch. The attribution of is reformulated as (See Appendix A for proof):
It is worth noting that Proposition 3 holds for a random subset of .
Perturbation-patch reflects the overall contribution of the features within the patch, which includes independent and interactive effects. However, it assigns the same contribution score to all features in the patch, which fails to provide fine-grained explanations. Moreover, the interactions among different patches are neglected.
IV) DeepLift and -LRP DeepLift Shrikumar et al. (2017) and -LRP Bach et al. (2015) compute relevance scores by using a recursive relevance propagation in a layerwise manner. In DeepLift (Rescale rule), ,
represent the value of neuronat -th layer and neuron at -th layer respectively, which satisfy . Here, is the weight parameter that connects and , is the additive bias, and is the weighted impact of to .
is a non-linear activation function. DeepLift aims to propagate the output difference between inputand a selected baseline , where is the corresponding weighted impact of baseline. DeepLift calculates the relevance score of at , denoted as , as follows:
Where represents the total relevance score of neuron . Here we mainly focus on investigating the fidelity of the layer-wise rule.
The relevance score of DeepLift is reformulated as (See Appendix A for proof),
Where is the overall high-order expansion of the output difference of -th neuron at layer.
DeepLift follows the Summation-to-Delta property, i.e., the summation of attributions across all features is the change of output. It’s considered that DeepLift satisfies Property 1. However, the reformulation in Proposition 4 indicates that it does not satisfy Property 2 and Property 3: accurate assignment of independent term and interactive term. It fails to distinguish the relative importance of features from high-order effect .
-LRP is similar to DeepLift, and shares the same limitation on distinguishing attributions of features from high-order term (The details on reformulation derivation of -LRP are shown in Appendix A).
V) Integrated Gradient Given a baseline point , Integrated Gradient Sundararajan et al. (2017) integrates the gradient over the straightline path from baseline to input . The points on the path are represented as . The attribution of feature in input is:
where is the number of steps in the Riemman approximation of the integral.
Integrated Gradient is a context-aware high-order attribution approximation of DNN . The attribution of is the sum of first-order term , high-order independent term , and an interactive term . The interactive term is the allocated contribution to from overall interactive effect between and other features (See Appendix A for proof).
where . Specifically, Integrated Gradient assigns the contribution of from the interactive effect , according to the degree of polynomial term. It allcolates proportion of the high-order interactive term to the feature (). For example, considering the second-order case, .
According to Proposition 5, Integrated Gradient method, which is widely considered as a first-order attribution, not only describes the first order and context-independent high-order attribution of each feature but also comprehensively incorporates the interaction effect between and other features. This theoretical insight uncovers why Integrated Gradient method can well discriminate features from the input image. It satisfies the proposed desired Properties 1, 2 and 3: low model approximation error, accurate assignment of independent term, and accurate assignment of interactive term, while all other aforementioned attribution methods fail to do so.
Table 1 summarizes the second-order Taylor reformulations of these existing attribution methods and whether the proposed three desired properties are satisfied.
4 Improvements on Integrated Gradient
According to the reformulations, Integrated Gradient is the only method satisfying three desired properties among these attribution methods. However, the reformulation of Integrated Gradient also indicates that attribution is closely related to input change , which is determined by the chosen baseline. This leads to one major defect that Integrated Gradient method is sensitive to the baseline but not to model parameters and target labels Adebayo et al. (2018).
We consider three strategies to overcome the weakness. First, we rescale the attribution with respect to , which could alleviate the correlation to some extent. Specifically, , denoted by IG1. According to Proposition 5, the Integrated Gradient attribution represents the influence on prediction difference caused by the change of feature . Hence actually measures the sensitivity of output with respect to the feature , i.e., the expected change in the output as each feature changes by one unit.
Second, we set limit on the magnitude of to further relieve the correlation. The rescaling relieves the sensitivity to in some normal cases. It fails to resolve the issue when has large variation across different features (pixels). For example, given a black image as baseline, for pixel in white color is close to 255, while for pixel in black color is almost 0. Hence the rescaled attributon for pixels in black color is larger than for pixels in white color due to the selected baseline. To avoid such cases, we assume
follows a Gaussian distribution withmean and a constrained standard derivation . That is, , where , denoted by IG2.
Third, to further guarantee robustness of the attribution, we select multiple baselines instead of a single one. IG3 is defined as , where and is the number of baselines. IG3, an average attribution across multiple baselines, reduces the possibility that an attribution is dominated by a special baseline.
In this section, we conduct experiments to evaluate the effectiveness of the proposed IG1, IG2, and IG3. The interpretations are evaluated on ILSVRC2014 dataset Russakovsky et al. (2015) under VGG19 architectures Simonyan and Zisserman (2015)
, compared with Gradient and Integrated Gradient (More comparisons to the state-of-the-art attribution methods could be found in Appendix B). In Integrated Gradient and IG1, the black image is selected as baseline. In IG2 and IG3, the standard deviation is set as. In IG3, baselines are randomly generated from Gaussian distribution.
The performance is compared both qualitatively and quantitatively. First, the visualization comparisons via saliency maps are shown in Figure 2 for Gradient, Integrated Gradient, and three proposed methods. In general, the saliency map based on Gradient is visually noisy and includes some irrelevant regions to the prediction, while IG3 accurately localizes the objects of interest. The generated saliency maps by IG3, evenly distributed with less noises, are sharper, and clearly display the shapes and boarders of the objects. Specifically, the corgi object in the first image has a large image contrast ratio (i.e., white legs, black body and mixed face). When interpreting the classification decision of the corgi, Integrated Gradient assigns significantly higher contributions to white areas than black areas, even though it identifies right location of the corgi. This is due to corresponding to white pixels is large by selecting a black image baseline, which demonstrates the limitation of Integrated Gradient. Similar observation could also be found when interpreting the cardoon image. The proposed IG1 and IG2 alleviate this phenomenon by mitigating the sensitivity to the baseline image. IG3 further enhances the robustness of the saliency map by averaging across several baselines. More visualization examples could be found in Appendix B.
Second, quantitative measurements are compared via localization performance, i.e., how well attribution methods locate objects of interest. A common approach is to compare the saliency maps from the attribution scores with bounding box annotations. Assume the bounding box contains pixels. We select top pixels according to ranked attribution scores and count the number of pixels inside the bounding box. The ratio is used as the metric of localization accuracy Karl Schulz and Federico Tombari (2020). represents the percentage of pixels in a bounding box. We consider two scenarios: bounding boxes covering less than 33% () and 66% ( ) of the input image respectively. We run the experiments over 4k images for different and the results of localization accuracy are shown in Table 2.
Integrated Gradient has a superior localization performance than Gradient for both . This observation is consistent with our theoretical investigation. The reformulation of Integrated Gradient involves complex feature interactions so that it describes the model behaviors more adequately than Gradient. The proposed improvements IG1, IG2 significantly outperform Integrated Gradient by over two percentages. IG3 performs the best due to improved attribution robustness by averaging attributions across multiple baselines.
6 Conclusion and Future Work
In this work, we aim to theoretically understand and assess several existing attribution methods in a unified framework, which builds a systematic metric to evaluate and compare the attribution methods designed based on heuristic or empirical intuition. It also provides a direction to improve the existing attribution methods and develop novel attribution methods. We utilize Taylor expansion function to approximate complex DNNs and propose a Taylor attribution framework to define faithfulness of interpretation via three desired properties. Several attribution methods are reformulated into the unified framework to investigate their rationales, fidelity, and limitations. New methods based on existing attribution methods are proposed to improve interpretation performance, which have been demonstrated by qualitative and quantitative experiments. In the future work, we will explore more about the effect of complex feature interactions and how it could be applied to provide better interpretations, as complex feature interaction is a core characteristic in DNNs.
7 Broader Impact
This work aims to build a unified framework to theoretically understand and assess attribution methods, which will immediately lead to advances in pursuing interpretability of deep learning. It will have an immediate impact on the interpretable machine learning systems, including formalizing the model interpretability, providing a systematic definition and evaluation metric of the interpretation fidelity, revisiting and filtering existing attribution methods, and prompting better attribution methods.
Furthermore, it will have strong impact on improving the usability of deep learning in important applications, such as autonomous driving, cybersecurity and AI healthcare. Also it will help improve the overall value of the deep learning based systems, prompt a more transparent, fair and accountable platform for emerging and future data/information management systems, and understand how the interpretations could be utilized to further advance machine learning applications. The outcome of this work will play an integral part in educating and training students with fundamental AI concepts and algorithms, which will be integrated as a education module targeting students in cybersecurity and data science in author’s institute.
Human subjects are not involved in this work, and we have strictly followed institutional and national regulations to avoid violating any ethical issues in conducting this research.
- Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505–9515. Cited by: §3, §4.
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7). Cited by: §1, §3, §3.
- Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §1.
Interpretable explanations of black boxes by meaningful perturbation.
Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. Cited by: §1.
- Restricting the flow: information bottlenecks for attribution. In International Conference on Learning Representations, Cited by: §1, §5.
- A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774. Cited by: §1.
- Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations, Cited by: §1.
- Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §1.
- Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. Cited by: §2.
A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. In Proceedings of the 35th International Conference on Machine Learning-Volume 70, pp. 3809–3818. Cited by: §3.
- " Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §5.
- Understanding impacts of high-order loss approximations and features in deep learning interpretation. In Proceedings of the 36th International Conference on Machine Learning-Volume 70, Cited by: §2.
- Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §1.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1, §3.
- Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §1, §3, §3.
- Not just a black box: learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713. Cited by: §1, §3, §3.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §5.
- Smoothgrad: removing noise by adding noise. In International Conference on Learning Representations Workshop, Cited by: §1.
- Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §3.
- Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §1, §3, §3.
- Quantitative models for performance measurement system. International journal of production economics 64 (1-3), pp. 231–241. Cited by: §2.
- Deliberative explanations: visualizing network insecurities. In Advances in Neural Information Processing Systems, pp. 1372–1383. Cited by: §2.
- Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §3, §3.
- Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §3, §3.