Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations?

07/23/2023
by   Susu Sun, et al.
0

While deep neural network models offer unmatched classification performance, they are prone to learning spurious correlations in the data. Such dependencies on confounding information can be difficult to detect using performance metrics if the test data comes from the same distribution as the training data. Interpretable ML methods such as post-hoc explanations or inherently interpretable classifiers promise to identify faulty model reasoning. However, there is mixed evidence whether many of these techniques are actually able to do so. In this paper, we propose a rigorous evaluation strategy to assess an explanation technique's ability to correctly identify spurious correlations. Using this strategy, we evaluate five post-hoc explanation techniques and one inherently interpretable method for their ability to detect three types of artificially added confounders in a chest x-ray diagnosis task. We find that the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net provide the best performance and can be used to reliably identify faulty model behavior.

READ FULL TEXT

page 8

page 12

research
12/10/2022

Identifying the Source of Vulnerability in Explanation Discrepancy: A Case Study in Neural Text Classification

Some recent works observed the instability of post-hoc explanations when...
research
02/28/2022

An Empirical Study on Explanations in Out-of-Domain Settings

Recent work in Natural Language Processing has focused on developing app...
research
11/14/2022

Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations

Recent work has suggested post-hoc explainers might be ineffective for d...
research
05/24/2021

Reproducibility Report: Contextualizing Hate Speech Classifiers with Post-hoc Explanation

The presented report evaluates Contextualizing Hate Speech Classifiers w...
research
06/29/2022

Causality for Inherently Explainable Transformers: CAT-XPLAIN

There have been several post-hoc explanation approaches developed to exp...
research
11/10/2020

Debugging Tests for Model Explanations

We investigate whether post-hoc model explanations are effective for dia...
research
05/05/2020

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Hate speech classifiers trained on imbalanced datasets struggle to deter...

Please sign up or login with your details

Forgot password? Click here to reset