Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

03/23/2023
by   Avi Schwarzschild, et al.
2

As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.

READ FULL TEXT

page 5

page 10

page 11

page 13

page 14

page 15

research
10/17/2022

On the Impact of Temporal Concept Drift on Model Explanations

Explanation faithfulness of model predictions in natural language proces...
research
12/09/2022

Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

We investigate whether three types of post hoc model explanations–featur...
research
02/28/2022

An Empirical Study on Explanations in Out-of-Domain Settings

Recent work in Natural Language Processing has focused on developing app...
research
01/20/2019

Towards Aggregating Weighted Feature Attributions

Current approaches for explaining machine learning models fall into two ...
research
10/26/2021

Partial order: Finding Consensus among Uncertain Feature Attributions

Post-hoc feature importance is progressively being employed to explain d...
research
12/24/2022

Rank-LIME: Local Model-Agnostic Feature Attribution for Learning to Rank

Understanding why a model makes certain predictions is crucial when adap...
research
09/02/2021

Cross-Model Consensus of Explanations and Beyond for Image Classification Models: An Empirical Study

Existing interpretation algorithms have found that, even deep models mak...

Please sign up or login with your details

Forgot password? Click here to reset