A simple defense against adversarial attacks on heatmap explanations

07/13/2020
by   Laura Rieger, et al.
9

With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.

READ FULL TEXT

page 11

page 12

research
03/22/2021

ExAD: An Ensemble Approach for Explanation-based Adversarial Detection

Recent research has shown Deep Neural Networks (DNNs) to be vulnerable t...
research
06/09/2023

Overcoming Adversarial Attacks for Human-in-the-Loop Applications

Including human analysis has the potential to positively affect the robu...
research
03/30/2022

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis

Respiratory sound classification is an important tool for remote screeni...
research
11/08/2021

Defense Against Explanation Manipulation

Explainable machine learning attracts increasing attention as it improve...
research
07/25/2019

How to Manipulate CNNs to Make Them Lie: the GradCAM Case

Recently many methods have been introduced to explain CNN decisions. How...
research
04/13/2021

Fast Hierarchical Games for Image Explanations

As modern complex neural networks keep breaking records and solving hard...
research
12/28/2022

Robust Ranking Explanations

Gradient-based explanation is the cornerstone of explainable deep networ...

Please sign up or login with your details

Forgot password? Click here to reset