Interpretable Deep Learning under Fire

12/03/2018
by   Xinyang Zhang, et al.
0

Providing explanations for complicated deep neural network (DNN) models is critical for their usability in security-sensitive domains. A proliferation of interpretation methods have been proposed to help end users understand the inner workings of DNNs, that is, how a DNN arrives at a particular decision for a specific input. This improved interpretability is believed to offer a sense of security by involving human in the decision-making process. However, due to its data-driven nature, the interpretability itself is potentially susceptible to malicious manipulation, about which little is known thus far. In this paper, we conduct the first systematic study on the security of interpretable deep learning systems (IDLSes). We first demonstrate that existing IDLSes are highly vulnerable to adversarial manipulation. We present ACID attacks, a broad class of attacks that generate adversarial inputs which not only mislead target DNNs but also deceive their coupled interpretation models. By empirically investigating three representative types of interpretation models, we show that ACID attacks are effective against all of them. This vulnerability thus seems pervasive in many IDLSes. Further, using both analytical and empirical evidence, we identify the prediction-interpretation "independency" as one possible root cause of this vulnerability: a DNN and its interpretation model are often not fully aligned, resulting in the possibility for the adversary to exploit both models simultaneously. Moreover, by examining the transferability of adversarial inputs across different interpretation models, we expose the fundamental tradeoff among the attack evasiveness with respect to different interpretation methods. These findings shed light on developing potential countermeasures and designing more robust interpretation methods, leading to several promising research directions.

READ FULL TEXT

page 1

page 7

page 8

page 11

page 12

page 13

page 16

research
11/29/2022

Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning

Deep learning methods have gained increased attention in various applica...
research
11/05/2019

The Tale of Evil Twins: Adversarial Inputs versus Backdoored Models

Despite their tremendous success in a wide range of applications, deep n...
research
12/02/2017

Where Classification Fails, Interpretation Rises

An intriguing property of deep neural networks is their inherent vulnera...
research
01/22/2021

i-Algebra: Towards Interactive Interpretability of Deep Neural Networks

Providing explanations for deep neural networks (DNNs) is essential for ...
research
03/19/2018

Towards Explanation of DNN-based Prediction with Guided Feature Inversion

While deep neural networks (DNN) have become an effective computational ...
research
02/06/2019

Fooling Neural Network Interpretations via Adversarial Model Manipulation

We ask whether the neural network interpretation methods can be fooled v...
research
04/23/2020

Adversarial Machine Learning: An Interpretation Perspective

Recent years have witnessed the significant advances of machine learning...

Please sign up or login with your details

Forgot password? Click here to reset