Model extraction from counterfactual explanations

09/03/2020
by   Ulrich Aïvodji, et al.
0

Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.

READ FULL TEXT
research
06/14/2021

Characterizing the risk of fairwashing

Fairwashing refers to the risk that an unfair black-box model can be exp...
research
11/17/2022

Features Compression based on Counterfactual Analysis

Counterfactual Explanations are becoming a de-facto standard in post-hoc...
research
05/13/2022

DualCF: Efficient Model Extraction Attack from Counterfactual Explanations

Cloud service providers have launched Machine-Learning-as-a-Service (MLa...
research
06/22/2022

Explanation-based Counterfactual Retraining(XCR): A Calibration Method for Black-box Models

With the rapid development of eXplainable Artificial Intelligence (XAI),...
research
06/28/2022

On the amplification of security and privacy risks by post-hoc explanations in machine learning models

A variety of explanation methods have been proposed in recent years to h...
research
07/27/2023

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

With the increased deployment of machine learning models in various real...
research
01/28/2022

Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning

Locally interpretable model agnostic explanations (LIME) method is one o...

Please sign up or login with your details

Forgot password? Click here to reset