PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework on NLP Applications

08/08/2022
by   Diego Garcia-soto, et al.
0

Deep Neural Networks (DNNs) have been shown to be susceptible to Trojan attacks. Neural Trojan is a type of targeted poisoning attack that embeds the backdoor into the victim and is activated by the trigger in the input space. The increasing deployment of DNNs in critical systems and the surge of outsourcing DNN training (which makes Trojan attack easier) makes the detection of Trojan attacks necessary. While Neural Trojan detection has been studied in the image domain, there is a lack of solutions in the NLP domain. In this paper, we propose a model-level Trojan detection framework by analyzing the deviation of the model output when we introduce a specially crafted perturbation to the input. Particularly, we extract the model's responses to perturbed inputs as the `signature' of the model and train a meta-classifier to determine if a model is Trojaned based on its signature. We demonstrate the effectiveness of our proposed method on both a dataset of NLP models we create and a public dataset of Trojaned NLP models from TrojAI. Furthermore, we propose a lightweight variant of our detection method that reduces the detection time while preserving the detection rates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/15/2019

Simple iterative method for generating targeted universal adversarial perturbations

Deep neural networks (DNNs) are vulnerable to adversarial attacks. In pa...
research
11/02/2021

HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks on Deep Neural Networks

We propose HASHTAG, the first framework that enables high-accuracy detec...
research
03/09/2022

Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation

In recent years, the adversarial vulnerability of deep neural networks (...
research
02/18/2019

STRIP: A Defence Against Trojan Attacks on Deep Neural Networks

Recent trojan attacks on deep neural network (DNN) models are one insidi...
research
11/18/2019

NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations

Deep neural networks have achieved state-of-the-art performance on vario...
research
04/20/2023

A Plug-and-Play Defensive Perturbation for Copyright Protection of DNN-based Applications

Wide deployment of deep neural networks (DNNs) based applications (e.g.,...
research
02/13/2022

Neural Network Trojans Analysis and Mitigation from the Input Domain

Deep Neural Networks (DNNs) can learn Trojans (or backdoors) from benign...

Please sign up or login with your details

Forgot password? Click here to reset