Fooling Neural Network Interpretations via Adversarial Model Manipulation

02/06/2019
by   Juyeon Heo, et al.
0

We ask whether the neural network interpretation methods can be fooled via adversarial model manipulation, which is defined as a model fine-tuning step that aims to radically alter the explanations without hurting the accuracy of the original model. By incorporating the interpretation results directly in the regularization term of the objective function for fine-tuning, we show that the state-of-the-art interpreters, e.g., LRP and Grad-CAM, can be easily fooled with our model manipulation. We propose two types of fooling, passive and active, and demonstrate such foolings generalize well to the entire validation set as well as transfer to other interpretation methods. Our results are validated by both visually showing the fooled explanations and reporting quantitative metrics that measure the deviations from the original explanations. We claim that the stability of neural network interpretation method with respect to our adversarial model manipulation is an important criterion to check for developing robust and reliable neural network interpretation method.

READ FULL TEXT

page 2

page 6

page 7

research
11/15/2018

On transfer learning using a MAC model variant

We introduce a variant of the MAC model (Hudson and Manning, CVPR 2018) ...
research
12/03/2018

Interpretable Deep Learning under Fire

Providing explanations for complicated deep neural network (DNN) models ...
research
11/09/2017

Frangi-Net: A Neural Network Approach to Vessel Segmentation

In this paper, we reformulate the conventional 2-D Frangi vesselness mea...
research
04/29/2021

Correcting Classification: A Bayesian Framework Using Explanation Feedback to Improve Classification Abilities

Neural networks (NNs) have shown high predictive performance, however, w...
research
03/23/2022

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

Neural language models show vulnerability to adversarial examples which ...
research
12/06/2018

Towards Hiding Adversarial Examples from Network Interpretation

Deep networks have been shown to be fooled rather easily using adversari...
research
12/28/2010

Looking for plausibility

In the interpretation of experimental data, one is actually looking for ...

Please sign up or login with your details

Forgot password? Click here to reset