T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification

03/07/2021
by   Ahmadreza Azizi, et al.
0

Deep Neural Network (DNN) classifiers are known to be vulnerable to Trojan or backdoor attacks, where the classifier is manipulated such that it misclassifies any input containing an attacker-determined Trojan trigger. Backdoors compromise a model's integrity, thereby posing a severe threat to the landscape of DNN-based classification. While multiple defenses against such attacks exist for classifiers in the image domain, there have been limited efforts to protect classifiers in the text domain. We present Trojan-Miner (T-Miner) – a defense framework for Trojan attacks on DNN-based text classifiers. T-Miner employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger. T-Miner then analyzes the text produced by the generative model to determine if they contain trigger phrases, and correspondingly, whether the tested classifier has a backdoor. T-Miner requires no access to the training dataset or clean inputs of the suspicious classifier, and instead uses synthetically crafted "nonsensical" text inputs to train the generative model. We extensively evaluate T-Miner on 1100 model instances spanning 3 ubiquitous DNN model architectures, 5 different classification tasks, and a variety of trigger phrases. We show that T-Miner detects Trojan and clean models with a 98.75 overall accuracy, while achieving low false positives on clean models. We also show that T-Miner is robust against a variety of targeted, advanced attacks from an adaptive attacker.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2022

M-to-N Backdoor Paradigm: A Stealthy and Fuzzy Attack to Deep Learning Models

Recent studies show that deep neural networks (DNNs) are vulnerable to b...
research
08/27/2019

Revealing Backdoors, Post-Training, in DNN Classifiers via Novel Inference on Optimized Perturbations Inducing Group Misclassification

Recently, a special type of data poisoning (DP) attack targeting Deep Ne...
research
10/20/2021

Detecting Backdoor Attacks Against Point Cloud Classifiers

Backdoor attacks (BA) are an emerging threat to deep neural network clas...
research
08/30/2023

MDTD: A Multi Domain Trojan Detector for Deep Neural Networks

Machine learning models that use deep neural networks (DNNs) are vulnera...
research
10/28/2019

IPGuard: Protecting the Intellectual Property of Deep Neural Networks via Fingerprinting the Classification Boundary

A deep neural network (DNN) classifier represents a model owner's intell...
research
10/13/2022

COLLIDER: A Robust Training Framework for Backdoor Data

Deep neural network (DNN) classifiers are vulnerable to backdoor attacks...
research
01/14/2022

When less is more: Simplifying inputs aids neural network understanding

How do neural network image classifiers respond to simpler and simpler i...

Please sign up or login with your details

Forgot password? Click here to reset