Double Trouble: How to not explain a text classifier's decisions using counterfactuals synthesized by masked language models?

10/22/2021
by   Thang M. Pham, et al.
5

Explaining how important each input feature is to a classifier's decision is critical in high-stake applications. An underlying principle behind dozens of explanation methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution - the individual treatment effect in causal inference. A recent method called Input Marginalization (IM) (Kim et al., 2020) uses BERT to replace a token - i.e. simulating the do(.) operator - yielding more plausible counterfactuals. However, our rigorous evaluation using five metrics and on three datasets found IM explanations to be consistently more biased, less accurate, and less plausible than those derived from simply deleting a word.

READ FULL TEXT

page 1

page 15

page 16

research
01/01/2021

On Explaining Your Explanations of BERT: An Empirical Study with Sequence Classification

BERT, as one of the pretrianed language models, attracts the most attent...
research
08/31/2021

Explaining Classes through Word Attribution

In recent years, several methods have been proposed for explaining indiv...
research
10/09/2019

Removing input features via a generative model to explain their attributions to classifier's decisions

Interpretability methods often measure the contribution of an input feat...
research
03/01/2023

Learning high-dimensional causal effect

The scarcity of high-dimensional causal inference datasets restricts the...
research
05/06/2022

Necessity and Sufficiency for Explaining Text Classifiers: A Case Study in Hate Speech Detection

We present a novel feature attribution method for explaining text classi...
research
05/25/2023

Sequential Integrated Gradients: a simple but effective method for explaining language models

Several explanation methods such as Integrated Gradients (IG) can be cha...
research
04/30/2020

How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking

Attribution methods assess the contribution of inputs (e.g., words) to t...

Please sign up or login with your details

Forgot password? Click here to reset