Characterizing the risk of fairwashing

06/14/2021
by   Ulrich Aïvodji, et al.
0

Fairwashing refers to the risk that an unfair black-box model can be explained by a fairer model through post-hoc explanations' manipulation. However, to realize this, the post-hoc explanation model must produce different predictions than the original black-box on some inputs, leading to a decrease in the fidelity imposed by the difference in unfairness. In this paper, our main objective is to characterize the risk of fairwashing attacks, in particular by investigating the fidelity-unfairness trade-off. First, we demonstrate through an in-depth empirical study on black-box models trained on several real-world datasets and for several statistical notions of fairness that it is possible to build high-fidelity explanation models with low unfairness. For instance, we find that fairwashed explanation models can exhibit up to 99.20% fidelity to the black-box models they explain while being 50% less unfair. These results suggest that fidelity alone should not be used as a proxy for the quality of black-box explanations. Second, we show that fairwashed explanation models can generalize beyond the suing group (i.e., data points that are being explained), which will only worsen as more stable fairness methods get developed. Finally, we demonstrate that fairwashing attacks can transfer across black-box models, meaning that other black-box models can perform fairwashing without explicitly using their predictions.

READ FULL TEXT

page 22

page 23

page 24

page 25

page 26

page 27

page 31

page 32

research
09/03/2020

Model extraction from counterfactual explanations

Post-hoc explanation techniques refer to a posteriori methods that can b...
research
01/28/2019

Fairwashing: the risk of rationalization

Black-box explanation is the problem of explaining how a machine learnin...
research
07/27/2023

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

With the increased deployment of machine learning models in various real...
research
01/12/2022

SLISEMAP: Explainable Dimensionality Reduction

Existing explanation methods for black-box supervised learning models ge...
research
03/12/2020

Model Agnostic Multilevel Explanations

In recent years, post-hoc local instance-level and global dataset-level ...
research
06/24/2021

What will it take to generate fairness-preserving explanations?

In situations where explanations of black-box models may be useful, the ...
research
08/02/2022

s-LIME: Reconciling Locality and Fidelity in Linear Explanations

The benefit of locality is one of the major premises of LIME, one of the...

Please sign up or login with your details

Forgot password? Click here to reset