Backdoor Attacks on the DNN Interpretation System

11/21/2020
by   Shihong Fang, et al.
1

Interpretability is crucial to understand the inner workings of deep neural networks (DNNs) and many interpretation methods generate saliency maps that highlight parts of the input image that contribute the most to the prediction made by the DNN. In this paper we design a backdoor attack that alters the saliency map produced by the network for an input image only with injected trigger that is invisible to the naked eye while maintaining the prediction accuracy. The attack relies on injecting poisoned data with a trigger into the training data set. The saliency maps are incorporated in the penalty term of the objective function that is used to train a deep model and its influence on model training is conditioned upon the presence of a trigger. We design two types of attacks: targeted attack that enforces a specific modification of the saliency map and untargeted attack when the importance scores of the top pixels from the original saliency map are significantly reduced. We perform empirical evaluation of the proposed backdoor attacks on gradient-based and gradient-free interpretation methods for a variety of deep learning architectures. We show that our attacks constitute a serious security threat when deploying deep learning models developed by untrusty sources. Finally, in the Supplement we demonstrate that the proposed methodology can be used in an inverted setting, where the correct saliency map can be obtained only in the presence of a trigger (key), effectively making the interpretation system available only to selected users.

READ FULL TEXT

page 4

page 7

page 10

page 11

page 13

page 14

page 15

research
02/03/2020

Robust saliency maps with decoy-enhanced saliency score

Saliency methods help to make deep neural network predictions more inter...
research
05/24/2023

Reliability Scores from Saliency Map Clusters for Improved Image-based Harvest-Readiness Prediction in Cauliflower

Cauliflower is a hand-harvested crop that must fulfill high-quality stan...
research
05/28/2019

Certifiably Robust Interpretation in Deep Learning

Although gradient-based saliency maps are popular methods for deep learn...
research
09/29/2020

Trustworthy Convolutional Neural Networks: A Gradient Penalized-based Approach

Convolutional neural networks (CNNs) are commonly used for image classif...
research
07/13/2022

Verifying Attention Robustness of Deep Neural Networks against Semantic Perturbations

It is known that deep neural networks (DNNs) classify an input image by ...
research
10/19/2019

NormGrad: Finding the Pixels that Matter for Training

The different families of saliency methods, either based on contrastive ...
research
06/23/2021

Gradient-Based Interpretability Methods and Binarized Neural Networks

Binarized Neural Networks (BNNs) have the potential to revolutionize the...

Please sign up or login with your details

Forgot password? Click here to reset