Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

by   Julius Adebayo, et al.

We investigate whether three types of post hoc model explanations–feature attribution, concept activation, and training point ranking–are effective for detecting a model's reliance on spurious signals in the training data. Specifically, we consider the scenario where the spurious signal to be detected is unknown, at test-time, to the user of the explanation method. We design an empirical methodology that uses semi-synthetic datasets along with pre-specified spurious artifacts to obtain models that verifiably rely on these spurious training signals. We then provide a suite of metrics that assess an explanation method's reliability for spurious signal detection under various conditions. We find that the post hoc explanation methods tested are ineffective when the spurious artifact is unknown at test-time especially for non-visible artifacts like a background blur. Further, we find that feature attribution methods are susceptible to erroneously indicating dependence on spurious signals even when the model being explained does not rely on spurious artifacts. This finding casts doubt on the utility of these approaches, in the hands of a practitioner, for detecting a model's reliance on spurious signals.


page 7

page 20

page 23

page 24

page 26

page 27

page 28

page 31


Debugging Tests for Model Explanations

We investigate whether post-hoc model explanations are effective for dia...

Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations

Recent work has suggested post-hoc explainers might be ineffective for d...

Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

As neural networks increasingly make critical decisions in high-stakes s...

On the Impact of Temporal Concept Drift on Model Explanations

Explanation faithfulness of model predictions in natural language proces...

Rank-LIME: Local Model-Agnostic Feature Attribution for Learning to Rank

Understanding why a model makes certain predictions is crucial when adap...

Learning Robust Convolutional Neural Networks with Relevant Feature Focusing via Explanations

Existing image recognition techniques based on convolutional neural netw...

Please sign up or login with your details

Forgot password? Click here to reset