Revealing Backdoors, Post-Training, in DNN Classifiers via Novel Inference on Optimized Perturbations Inducing Group Misclassification

08/27/2019
by   Zhen Xiang, et al.
1

Recently, a special type of data poisoning (DP) attack targeting Deep Neural Network (DNN) classifiers, known as a backdoor, was proposed. These attacks do not seek to degrade classification accuracy, but rather to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test example. Launching backdoor attacks does not require knowledge of the classifier or its training process - it only needs the ability to poison the training set with (a sufficient number of) exemplars containing a sufficiently strong backdoor pattern (labeled with the target class). Here we address post-training detection of backdoor attacks in DNN image classifiers, seldom considered in existing works, wherein the defender does not have access to the poisoned training set, but only to the trained classifier itself, as well as to clean examples from the classification domain. This is an important scenario because a trained classifier may be the basis of e.g. a phone app that will be shared with many users. Detecting backdoors post-training may thus reveal a widespread attack. We propose a purely unsupervised anomaly detection (AD) defense against imperceptible backdoor attacks that: i) detects whether the trained DNN has been backdoor-attacked; ii) infers the source and target classes involved in a detected attack; iii) we even demonstrate it is possible to accurately estimate the backdoor pattern. We test our AD approach, in comparison with alternative defenses, for several backdoor patterns, data sets, and attack settings and demonstrate its favorability. Our defense essentially requires setting a single hyperparameter (the detection threshold), which can e.g. be chosen to fix the system's false positive rate.

READ FULL TEXT

page 17

page 21

page 22

page 27

research
11/18/2019

Revealing Perceptible Backdoors, without the Training Set, via the Maximum Achievable Misclassification Fraction Statistic

Recently, a special type of data poisoning (DP) attack, known as a backd...
research
10/20/2020

L-RED: Efficient Post-Training Detection of Imperceptible Backdoor Attacks without Access to the Training Set

Backdoor attacks (BAs) are an emerging form of adversarial attack typica...
research
03/07/2021

T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification

Deep Neural Network (DNN) classifiers are known to be vulnerable to Troj...
research
05/31/2022

CASSOCK: Viable Backdoor Attacks against DNN in The Wall of Source-Specific Backdoor Defences

Backdoor attacks have been a critical threat to deep neural network (DNN...
research
10/31/2018

When Not to Classify: Detection of Reverse Engineering Attacks on DNN Image Classifiers

This paper addresses detection of a reverse engineering (RE) attack targ...
research
10/31/2018

A Mixture Model Based Defense for Data Poisoning Attacks Against Naive Bayes Spam Filters

Naive Bayes spam filters are highly susceptible to data poisoning attack...
research
12/18/2017

When Not to Classify: Anomaly Detection of Attacks (ADA) on DNN Classifiers at Test Time

A significant threat to the recent, wide deployment of machine learning-...

Please sign up or login with your details

Forgot password? Click here to reset