Impossibility Theorems for Feature Attribution

12/22/2022
by   Blair Bilodeau, et al.
3

Despite a sea of interpretability methods that can produce plausible explanations, the field has also empirically seen many failure cases of such methods. In light of these results, it remains unclear for practitioners how to use these methods and choose between them in a principled way. In this paper, we show that for even moderately rich model classes (easily satisfied by neural networks), any feature attribution method that is complete and linear–for example, Integrated Gradients and SHAP–can provably fail to improve on random guessing for inferring model behaviour. Our results apply to common end-tasks such as identifying local model behaviour, spurious feature identification, and algorithmic recourse. One takeaway from our work is the importance of concretely defining end-tasks. In particular, we show that once such an end-task is defined, a simple and direct approach of repeated model evaluations can outperform many other complex feature attribution methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2023

Four Axiomatic Characterizations of the Integrated Gradients Attribution Method

Deep neural networks have produced significant progress among machine le...
research
01/23/2021

Show or Suppress? Managing Input Uncertainty in Machine Learning Model Explanations

Feature attribution is widely used in interpretable machine learning to ...
research
10/16/2020

Evaluating Attribution Methods using White-Box LSTMs

Interpretability methods for neural networks are difficult to evaluate b...
research
07/05/2023

Harmonizing Feature Attributions Across Deep Learning Architectures: Enhancing Interpretability and Consistency

Ensuring the trustworthiness and interpretability of machine learning mo...
research
12/04/2020

Challenging common interpretability assumptions in feature attribution explanations

As machine learning and algorithmic decision making systems are increasi...
research
09/02/2021

Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

In this paper, we introduce Integrated Directional Gradients (IDG), a me...
research
06/14/2022

Attributions Beyond Neural Networks: The Linear Program Case

Linear Programs (LPs) have been one of the building blocks in machine le...

Please sign up or login with your details

Forgot password? Click here to reset