NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations

11/18/2019
by   Xijie Huang, et al.
0

Deep neural networks have achieved state-of-the-art performance on various tasks. However, lack of interpretability and transparency makes it easier for malicious attackers to inject trojan backdoor into the neural networks, which will make the model behave abnormally when a backdoor sample with a specific trigger is input. In this paper, we propose NeuronInspect, a framework to detect trojan backdoors in deep neural networks via output explanation techniques. NeuronInspect first identifies the existence of backdoor attack targets by generating the explanation heatmap of the output layer. We observe that generated heatmaps from clean and backdoored models have different characteristics. Therefore we extract features that measure the attributes of explanations from an attacked model namely: sparse, smooth and persistent. We combine these features and use outlier detection to figure out the outliers, which is the set of attack targets. We demonstrate the effectiveness and efficiency of NeuronInspect on MNIST digit recognition dataset and GTSRB traffic sign recognition dataset. We extensively evaluate NeuronInspect on different attack scenarios and prove better robustness and effectiveness over state-of-the-art trojan backdoor detection techniques Neural Cleanse by a great margin.

READ FULL TEXT

page 3

page 6

page 8

research
12/21/2020

Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification

Trojan (backdoor) attack is a form of adversarial attack on deep neural ...
research
03/29/2021

Online Defense of Trojaned Models using Misattributions

This paper proposes a new approach to detecting neural Trojans on Deep N...
research
06/24/2022

Robustness of Explanation Methods for NLP Models

Explanation methods have emerged as an important tool to highlight the f...
research
08/08/2022

PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework on NLP Applications

Deep Neural Networks (DNNs) have been shown to be susceptible to Trojan ...
research
05/06/2022

Imperceptible Backdoor Attack: From Input Space to Feature Representation

Backdoor attacks are rapidly emerging threats to deep neural networks (D...
research
07/04/2021

Class Introspection: A Novel Technique for Detecting Unlabeled Subclasses by Leveraging Classifier Explainability Methods

Detecting latent structure within a dataset is a crucial step in perform...
research
06/26/2019

Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs

The unprecedented success of deep neural networks in various application...

Please sign up or login with your details

Forgot password? Click here to reset