How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods

11/06/2019
by   Dylan Slack, et al.
20

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real-world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

READ FULL TEXT

page 4

page 12

research
11/15/2019

"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

As machine learning black boxes are increasingly being deployed in criti...
research
02/21/2021

Towards the Unification and Robustness of Perturbation and Gradient Based Explanations

As machine learning black boxes are increasingly being deployed in criti...
research
05/05/2020

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Hate speech classifiers trained on imbalanced datasets struggle to deter...
research
06/15/2021

On the Objective Evaluation of Post Hoc Explainers

Many applications of data-driven models demand transparency of decisions...
research
10/09/2021

Self-explaining Neural Network with Plausible Explanations

Explaining the predictions of complex deep learning models, often referr...
research
11/12/2020

Robust and Stable Black Box Explanations

As machine learning black boxes are increasingly being deployed in real-...
research
06/28/2022

On the amplification of security and privacy risks by post-hoc explanations in machine learning models

A variety of explanation methods have been proposed in recent years to h...

Please sign up or login with your details

Forgot password? Click here to reset