NNoculation: Broad Spectrum and Targeted Treatment of Backdoored DNNs

02/19/2020 ∙ by Akshaj Kumar Veldanda, et al. ∙ 0

This paper proposes a novel two-stage defense (NNoculation) against backdoored neural networks (BadNets) that, unlike existing defenses, makes minimal assumptions on the shape, size and location of backdoor triggers and BadNet's functioning. In the pre-deployment stage, NNoculation retrains the network using "broad-spectrum" random perturbations of inputs drawn from a clean validation set to partially reduce the adversarial impact of a backdoor. In the post-deployment stage, NNoculation detects and quarantines backdoored test inputs by recording disagreements between the original and pre-deployment patched networks. A CycleGAN is then trained to learn transformations between clean validation inputs and quarantined inputs; i.e., it learns to add triggers to clean validation images. This transformed set of backdoored validation images along with their correct labels is used to further retrain the BadNet, yielding our final defense. NNoculation outperforms state-of-the-art defenses NeuralCleanse and Artificial Brain Simulation (ABS) that we show are ineffective when their restrictive assumptions are circumvented by the attacker.



There are no comments yet.


page 4

page 6

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is a growing concern about the vulnerability of deep learning, the current state-of-the-art machine learning (ML) approach, to both test/inference and training time attacks. At inference time, an attacker can modify the test inputs to an otherwise benignly trained deep neural network (DNN) so as to cause mis-classification 

[attack1, attack2, attack3, GoodfellowSS14, szegedy_intriguing_2013, carlini2017towards, moosavi-dezfooli_universal_2017, liu2019adversarial, Eykholt_2018_CVPR]; the modifications are typically imperceptible or innocuous to the human eye. Training time attacks, the focus of this paper, are more pernicious; here the attacker compromises not only test inputs, but also the DNN training data and/or training process [badnets, neuraltrojans, sunglassesattack, poisonfrogs].

DNNs are vulnerable to training time attacks because individual users often do not have the computational resources for training large/complex models (that often comprise millions of parameters) or the ability to acquire large, high-quality training datasets required for achieving high accuracy. The latter is especially true when data acquisition and labeling entails high cost or requires human expertise [dataarticle, skincancerdata, datasurvey]; As a result, users either outsource DNN training or, more commonly, source pre-trained DNN models from online repositories like the Model Zoos for different frameworks [modelzoo, caffemodelzoo] or GitHub. While the user can verify a model’s accuracy on representative inputs by testing on small public or private validation data, the user may not know or trust the model’s author (or trainer) or have access to their training data set.

This opens the door to DNN backdooring attacks [badnets, neuraltrojans, sunglassesattack, poisonfrogs]: an adversary can train and upload a DNN model that is highly accurate on clean inputs (and thus on the user’s validation set), but misbehaves when inputs contain special attacker-chosen backdoor triggers. Such maliciously trained DNNs have been referred to as “BadNets.” For example, Gu et al. [badnets]

demonstrated a traffic sign classification BadNet with state-of-the-art accuracy on regular inputs, but that would classify any stop sign plastered with a Post-it note as a speed-limit sign. Similar BadNet attacks can be affected in the context of publicly available MRI diagnostic models for which training data is not accessible to the users 


, or state of the art face recognition models trained on private datasets based on crowd-sourced human-annotated images 

[nytimeslabelling, nytimesface].

Recent research has sought to address several inter-related problems for mitigating the BadNet threat, including: (1) how to ascertain that a DNN is backdoored (e.g., [absliu]), (2) how to determine the backdoor trigger(s) an attacker can use to manipulate the DNN output (e.g., [neuralcleanse, duke]), and (3) how to remove or disable the backdoor(s) [finepruning]. However, prior efforts have critical shortcomings that restrict their utility and broader applicability. For example, Artificial Brain Stimulation (ABS) [absliu] assumes that the presence of a backdoor input is encoded using a singleneuron. The study acknowledges and we confirm in Section 7 that the defense falters if multiple neurons encode the backdoor. Neural Cleanse [neuralcleanse] makes a strong assumption that the trigger size is small. Qiao et al. [duke] assume that the trigger size is known to the defender. Neither assumption is reasonable—attackers are free to select the size and shape of the trigger and even use large but semantically meaningful triggers. Sunglasses on human faces [sunglassesattack] are one such example. In fine-pruning, Liu et al. [finepruning] assume that clean and backdoored inputs activate different subsets of neurons. In Section 7, we show that this defense can be defeated by choosing appropriate attack hyper-parameters. In this paper, we seek defenses that can patch BadNets making minimal assumptions about the attacker strategy.

Our defense has the same end-goal as NeuralCleanse and (implicitly) ABS: we seek to recover the backdoor trigger and re-train the BadNet with poisoned but correctly labeled data, thus unlearning bad behaviour. The challenge, however, is that the attacker has an asymmetric advantage, i.e., she can pick from the vast space of backdoor patterns as long as they are not in the defender’s validation dataset of clean inputs. Existing defenses mitigate this asymmetry by narrowing the search space of triggers via assumptions, but these (as noted above) are easily circumvented. However, the defender has a unique opportunity to level the playing field post deployment. That is, the inputs to a deployed BadNet (i.e., test inputs in ML parlance) under attack must contain actual triggers; if the defender can identify even a fraction of backdoored test inputs, the search space of triggers can be narrowed considerably.

Based on these observations, we propose NNoculation, a new, general, end-to-end defense against DNN backdooring attacks that relaxes the restrictive assumptions in prior work. Unlike prior work, NNoculation patches BadNets in two phases: once pre-deployment using clean validation data (as in prior work), and then again post-deployment by monitoring test inputs. In the pre-deployment defense, NNoculation avoids making any prior assumptions about the trigger shape, size or location and instead retrains the BadNet with randomly perturbed validation data (see  Section 4.2). We view this as akin to broad-spectrum inoculation — that is, instead of defending against specific triggers (or pathogens in this analogy), we seek robustness against a broad range of untargeted perturbations from clean inputs. Our pre-deployment defense yields a patched DNN that reduces the attack success rate to between on BadNets for which existing defenses are ineffective.

Post-deployment, we use the patched BadNet from the previous step to identify possible backdoored inputs (i.e., those on which the original and patched BadNets differ). These inputs are quarantined and, over time, yield a dataset of inputs containing triggers. We then train a CycleGAN, a powerful deep learning method that learns to convert images from one domain to another, to transfer from clean validation to quarantined data, teaching the CycleGAN to add triggers to clean validation data. Thus, we obtain a dataset of backdoored inputs with high-quality triggers and their corresponding clean labels; akin to a narrow-spectrum vaccination against specific pathogens, we then re-train BadNet using this dataset. Our final patched BadNet reduces attack success rate down to with minimal loss in classification accuracy.


Our specific contributions in this paper are:

  • We describe and evaluate NNoculation, a novel end-to-end defense against BadNet attacks that, unlike prior defenses, makes minimal assumptions on the attack modalities including trigger size, shape and location, and impact of the trigger on the BadNet’s neuronal activations.

  • NNoculation is unique in that it patches a BadNet in two phases: first in the pre-deployment phase using re-training with random data augmentation (similar to a broad-spectrum vaccination), and subsequently post-deployment wherein the deployed DNN is further robustified based on observations of poisoned test data (similar to a targeted, narrow-spectrum vaccination). To the best of our knowledge, NNoculation is the first BadNet defense that proposes an online mechansim to further improve a patched BadNet in the field.

  • Empirical evaluations of NNoculation on semantically meaningful and challenging triggers for the YouTube face [youtubedataset], German Traffic Sign Recognition Benchmark (GTSRB) [gtsrbdataset] and CIFAR-10 datasets [krizhevsky2009learning] show an attack success reduction down to with a penalty of 1% 5% clean accuracy reduction on inoculated DNNs.

  • Comparisons of NNoculation with state-of-the-art defenses show that it is the only defense that works comprehensively across a range of attacks, while prior defenses fail completely when their narrow assumptions are violated. In this light, we also present, in Section 6.3, the first attack on ABS [absliu] that makes use of a combination trigger to circumvent ABS’ restrictive assumptions.

2 Preliminaries

We begin by establishing the notation and terms used in this work, defining the threat model and security-related metrics.

2.1 Deep Learning

DNN-based Classification

A DNN is a parameterized function, , that takes as input a

-dimensional vector,

(for example, an image re-shaped into a vector) and outputs . Here, and denote the dimensions of the input and output, respectively, of the DNN, and denotes the DNN parameters (i.e., weights). Each has a corresponding ground-truth label , where is the number of classes in a dataset. DNNs typically contain multiple layers of computation organized in a feed-forward fashion where data flows from the input (or the input layer) to the output (or the output layer) via several hidden layers. Denoting the number of layers by , each layer has neurons, whose outputs are referred to as activations. We express the activations of the layer as a function of the previous layer’s activations as follows:


where and are referred to as the weights and biases, respectively, of the layer, and

is a non-linear function. A commonly used non-linearity for the hidden layers is the rectified linear unit (ReLU).

is the output of the layer, i.e., . For classification, we use a softmax function to produce an

-dimension vector of probability distribution

, where the entry of represents the probability that belongs to class , i.e. . The inputs for the first layer are the same as the network’s inputs, i.e., and  [nnoverview]. The network architecture comprises the number of layers , the number of neurons in each layer

, and the non-linear activation functions

. The learned weights, , and biases, in Equation 1 together constitute the DNN parameters .

Training the DNN

The DNN parameters are learned from (or trained on) a training dataset that contains samples of labeled data, . The training algorithm train() takes as input: a training dataset

, an initial estimate of DNN parameters

, and training hyper-parameters. train() returns such that:



is a loss function that can be measured for each training input as a function of the DNN output and the ground truth for the particular training input. In practice,

is obtained using stochastic gradient descent (SGD) algorithms. Starting with the initial guess

, SGD iteratively computes the gradient of the loss function in Equation 2 and moves in step sizes proportional to the learning rate in a direction away from the gradient. SGD converges once the algorithm reaches a local minima.


Our solution uses a cycle consistent generative adversarial network (CycleGAN) [CycleGAN2017] to learn image-to-image mapping functions, where images from one distribution are transformed into images from another distribution. In this work, we use similar notation as in [CycleGAN2017]. CycleGAN training uses unpaired collections of images from two different distributions (or domains) (domain 1) and (domain 2). Given a set of training images drawn from distributions and , the CycleGAN learns two generator functions for image translation, namely, and . The training process ensures that a domain 1 image, , looks after transformation, i.e., , like a domain 2 image, and vice-versa. A CycleGAN imposes a cycle consistency constraint, which requires that transforming an image from either domain and transforming back should yield the original image, i.e., for all and for all .

CycleGANs have been successfully used in a variety of applications [CycleGAN2017], for example, photo enhancement, style transfer, colorizing legacy photographs111http://quasimondo.com/, etc. As we will see in Section 4, we will exploit CycleGANs to transform clean images to their backdoored versions in the on-line phase of our defense, and use these backdoored versions of the images to fine-tune the DNN classifier so as to reduce the adversarial effects of the backdoor.

2.2 Threat Model

We adopt the threat model used in earlier literature  [badnets, neuralcleanse, absliu, neuraltrojans, finepruning, duke]. Specifically, we model two parties: a user (or defender) who wishes to deploy a DNN for a specific application sourced from an untrusted party, the attacker, who trains the DNN. We describe below our assumptions about the two parties’ goals and capabilities.

Attacker Model.

Given a DNN application, the attacker has access to a large and high-quality clean training dataset, drawn from a distribution . Let denote the DNN obtained by benignly training on . The attacker instead seeks to train a BadNet that agrees with on any input drawn from , but misbehaves when input is modified using a trigger insertion function poison(). One example of misbehaviour is a targeted attack where where is an attacker-chosen class different from the benign DNN’s prediction, i.e., .

As in prior work [badnets, finepruning, neuralcleanse, absliu], the attacker achieves this goal via training data poisoning. Specifically, the attacker prepares by training on both and a set of poisoned inputs, which are prepared using the trigger insertion function poison(). Specifically, as in [badnets], the attacker performs the following three-step training:

  1. The attacker prepares by applying poison() to a fraction (%) of , e.g., poisoning 10% of produces .

  2. The attacker trains a model using train() with randomly initialized DNN parameters , , and learning rate. This produces such that has good accuracy on data drawn from .

  3. The attacker takes and executes train again with and , thus producing .

The attacker uploads to an online repository of DNN models to coax an unsuspecting user to deploy it. After the user deploys the model, the attacker triggers the DNN misbehavior by providing poisoned test data containing the backdoor trigger. We assume that the attacker is able to control at least a fraction of test inputs to the deployed model.

User Goals and Capabilities.

The user (referred to interchangeably as the defender) wishes to deploy a DNN for the application advertised by the attacker, but does not have the resources to acquire a large, high-quality dataset for it. Instead, the user downloads the DNN, , uploaded by the attacker, and uses a small validation dataset, , of clean inputs to verify the DNN’s accuracy. In addition, the user seeks to patch to eliminate backdoors — ideally, the patched DNN should output correct labels for backdoored inputs, or detect and refuse to classify them (this is often referred to in the ML literature as reneging).

To meet these goals, the user has access to two assets pre-deployment: full, white-box access to and a small clean validation dataset . Post-deployment, the user also has access to all test inputs seen by the deployed model. As in prior work [neuralcleanse, absliu], we do not bound (but will seek to minimize) the user’s computational effort, i.e., the user’s primary limitation is the paucity of high-quality training data, not computational resources.

2.3 Security Metrics

In this work, we evaluate the backdooring and mitigation successes using the following metrics, evaluated using a held-out test data that emulate post-deployment inputs:

Definition 1 (Clean Data Accuracy—CA).

Clean Data Accuracy is defined as the percentage of clean test data that is classified as members of their true class.

Definition 2 (Attack Success Rate—ASR).

Let be a set of test data that are correctly classified as members of their true class by . The Attack Success Rate is the percentage of images in , after poisoning with poison(), that are classified by as members of backdoor target class, .

Based on Definition 2, an attack fails when a poisoned sample is classified as anything other than the attacker’s chosen target class . Our defense seeks to lower ASR (reducing power held by attacker) while minimizing impact on CA.

Figure 1: An example demonstrating the shortcomings of NeuralCleanse applied to different BadNets. The leftmost image is the actual trigger; the other images are incorrectly reverse-engineered triggers by NeuralCleanse. The top row corresponds to the reversed trigger on the original BadNet and the bottom row corresponds to the Variant BadNet.
(a) Original BadNet
(b) Original BadNet
(c) Variant BadNet
(d) Variant BadNet
Figure 2: Shortcomings of fine pruning: (a) and (c) plot the average neuron activations for each neuron in the last pooling layer, for a set of poisoned and benign inputs. (b) and (d) show the effect of pruning on clean accuracy and attack success.

3 Motivations

We are motivated to find ways to relax restrictive assumptions made by, and limitations of, prior work in BadNet mitigation.

Trigger Nature

One line of prior defense methods seeks to recover the trigger (or trigger distribution) given a BadNet; the recovered trigger (or distribution) is used (with corrected labels) to re-train the BadNet with the goal of disabling the backdoor [neuralcleanse, duke]. However, these works make strong assumptions about the trigger. Neural Cleanse [neuralcleanse] assumes that the trigger is small; for example, the trigger could be a small fixed pattern of pixels superimposed in one corner of the image. Qiao et al. [duke] assume that the defender knows the trigger size and shape. These assumptions are unrealistic as:

  • the attacker chooses the trigger and has a vast range of options for its shape and size, and

  • real-world triggers need not be small as long as they are contextually meaningful — e.g., in face recognition, sunglasses of a certain shade could act as triggers [sunglassesattack].

Consequently, prior works fail when the assumptions are not satisfied. For example, in Fig. 1 we illustrate the output from Neural Cleanse given BadNets triggered by a large, but semantically meaningful, sunglasses trigger for a face recognition application. The recovered triggers bear little resemblance to the original, missing its size, shape, and color (see Section 5 for the setup and evaluation of this experiment).

Mechanics of BadNet

A second line of defense methods, notably fine-pruning [finepruning] and Artificial Brain Stimulation (ABS) [absliu], eschew assumptions on the trigger size and shape, but assume that one or more “backdoor" neurons exist in the DNN that activate only on a trigger. In fine-pruning [finepruning], Liu et al. assume that the defender knows that a model is backdoored and that poisoned input data trigger specific neurons. All neurons dormant on clean inputs are potential backdoors and can be pruned. In ABS [absliu], the defender assumes a single neuron that is activated by the backdoor. These assumptions do not always hold. In fact, BadNets:

  • can be trained such that backdoor and clean inputs have similar activation patterns, reducing efficacy of fine-pruning, or

  • can be trained such that more than one neuron forms the backdoor—the authors of ABS acknowledge that their defense is less effective for this scenario and fail to identify a BadNet as backdoored. We demonstrate this scenario in Appendix A.

Consider Fig. 2—this illustrates fine-pruning of two BadNets, one trained with the settings reported in [finepruning] and the other with different training hyper-parameters. In the first case, represented by (a) and (b), fine-pruning succeeds as there are backdoored neurons that are not activated by clean inputs. When pruning them, the attack success drops more rapidly compared to the loss of clean accuracy. However, in the second case, represented by (c) and (d), fine-pruning fails, as the backdoored neurons are activated by both poisoned and clean inputs. Thus, pruning in order of minimum to maximum clean activation does not eradicate the backdoor behavior before severely degrading the clean accuracy.

In response to the aforementioned shortcomings, our goal is to devise an end-to-end BadNet mitigation technique that makes minimal assumptions about the characteristics of the trigger or the mechanics underlying the BadNet.

Figure 3: An overview of NNoculation leading up to initial deployment. First, the user/defender acquires a potential BadNet. Using pre-deployment treatment, which retrains the BadNet on noise-augmented treatment data, the BadNet and treated DNN are deployed as an ensemble that will reject poisoned inputs.

4 NNoculation

4.1 Overview

Two Stage Defense

NNoculation is a two stage defense. First, the user (defender) acquires a DNN—a potential BadNet . In the first stage, i.e., the pre-deployment stage, the defender retrains with an augmented dataset containing both clean validation data and noisy versions of the clean input as a broad-spectrum, coarse approximation of poisoned data. This aims to stimulate a wide range of behaviours in the DNN, and forces the the DNN to pay more attention to the unmodified portions of the image. The result is a new DNN, , with a reduced attack success rate (ASR). We then deploy and as an ensemble.

In the post-deployment stage, data that causes disagreement between and is rejected (i.e., the system refuses classification) and quarantined. As long as the the pre-deployment reduces ASR (even if not down to zero), the quarantined dataset likely includes attacker-poisoned data. Now, using the clean validation dataset and quarantined dataset, we learn the function poison() using a CycleGAN that transfers between the two domains (in effect, the CycleGAN learns to poison clean data!). We then use the reverse-engineer trigger for a second (and final) round of retraining of .


Prior work [finepruning, duke, neuralcleanse] found that DNN retraining offers a path towards backdoor removal. However, retraining only with clean, trigger-free data is insufficient [finepruning] as compromised neurons need to be activated so that SGD will modify their weights/biases for "unlearning" backdoors. Ideally, identifying and reverse-engineering the trigger(s) for a BadNet allows the defender to generate their own synthetic poisoned data with truthful labels; adding poisoned data to the training set stimulates compromised neurons, allowing the network to re-learn corrected decision boundaries. As discussed in Section 3, trigger reverse-engineering is not easy and currently requires strong assumptions about the BadNet mechanics. Can anything be done from the outset, with minimal assumptions? NNoculation starts with a simple intuition that randomly noisy inputs can activate a broad set of neurons, both compromised and benign, thus making backdoor mitigation possible (to some extent) without trigger assumptions. The second crucial intuition is that even if this initial random augmentation based backdoor mitigation is effective only to a small extent, this is sufficient to deduce two subsets of on-line test inputs that are at least partially separated based on whether they are clean or poisoned. These two subsets are constructed based on whether the original and “partially de-backdoored” networks agree on an input or not. Since there is therefore an inherent bias (by construction) between the two subsets as to how likely they are to contain poisoned inputs, this enables learning of transformations to make inputs in the first subset appear more like the second subset and vice versa. Finally, these learned transformations enable construction of inputs that are (at least in part) similar to backdoored inputs, but with correct labels (to high likelihood since the backdoored inputs are constructed from inputs in the first subset, which is more likely to contain clean inputs). These generated pairs of likely-backdoored inputs and likely-correct output labels thereby enable retraining of the DNN classifier to reduce the likelihood of effectiveness of the actual backdoor.

4.2 Pre-Deployment Defense

Input: clean validation data , potential BadNet , initial learning rate , a set of noise percentages , minimum accuracy threshold , noise distribution
Output: treated net candidates

1: Noise-Augmented Treatment Datasets Preparation
2:split s.t.
3:let be a set of different treatment datasets produced by augmenting with different noise levels
4:for  in  do:
7:     .add()
8:end for
9: Fine-tuning to "unlearn"
10:let be a set of candidate and their clean data accuracy
11:procedure ProduceCandidates()
12:     for  in  do:
13:         train()
14:          eval
15:         .add()
16:     end for
17:end procedure
18:let , run ProduceCandidates()
19:if  from this iteration then
20:     reduce by some amount
22:     increase by some amount
23:end if
24:re-run ProduceCandidates() or return
Algorithm 1 Pre-Deployment Defense

We outline the pre-deployment defense in Algorithm 1, in Fig. 3. The user (henceforth, defender) acquires a DNN from an untrusted source. The defender has no initial knowledge about the trigger but has access to clean, trigger-free validation data, . The pre-deployment defense is:

  1. Line 1–7: The defender first splits clean validation data , into a clean treatment dataset and clean evaluation dataset . Next, the defender creates multiple noisy versions of the clean treatment dataset, , by adding increasing amounts of noise to the images in the dataset. This is done using a noise augmentation function that takes as input a dataset , a noise distribution , and noise percentage, . The function randomly samples fraction of pixels from each image in , and replaces the pixels with values sampled from . Let the noisy dataset produced by running noise augmentation on with noise percentage as . This process is repeated for each in the set , yielding multiple noise augmented datasets. Finally, the defender complements each noise augmented dataset with the clean treatment dataset to create multiple augmented datasets .

  2. Line 8–16: Next, the defender begins fine-tuning to produce candidate re-trained DNNs for each with an initial learning rate .

  3. Line 17–23: After producing candidate for each , the defender evaluates the clean data accuracy of each candidate model using :

    • If all the models in the set have clean accuracy below a threshold, the defender reduces .

    • If at least one model in the set has clean accuracy above a threshold, the defender increases .

    • The defender produces another set of candidates with the revised learning rate or terminates when the computational budget is reached.

  4. To pick the final patched network

    from a set of candidate networks generated from Lines 2–23, we propose a heuristic approach based on the observation that higher learning rates, higher noise percentages and lower clean accuracy compared to the original BadNet result in lower attack success rate (since these all serve as proxies for the extent that the BadNet has "unlearned"). Thus, the defender searches the candidate models, starting with the network set produced with the highest

    . The defender evaluates the networks in decreasing order, moving to the network set produced by the next highest , and so on, until finding the network with clean accuracy closest to a set threshold (but not below).

Discussion Intuitively, our pre-deployment defense seeks to pick the highest noise level and largest learning rate that yields an re-trained DNN with clean classification accuracy just above the user-specified threshold. As noted before, higher noise levels and learning rates imply a greater chance that ’s misbehaviour has been unlearned, although at the expense of unlearning some of its good behavior as well.

Empirically, we find that with a reduction in clean classification accuracy, NNoculation’s pre-deployment defense reduces ASR to between across a range of BadNets for which existing defenses fail. NNoculation’s post-deployment defense further recovers some of the loss in classification accuracy while further reducing ASR to between for the BadNets evaluated.

4.3 Post-Deployment Defense

Figure 4: Using the quarantined images from the deployed BadNet and pre-deployment patched DNN, we train a CycleGAN to approximate the poison() process, enabling targeted retraining of the BadNet to soften backdoor behavior.
1: Classify and Store Incoming Test Data
2:for  in (while # iterations do
7:     if  then
8:         output()
9:     else
10:         output("reject")
11:         .add()
12:     end if
13:end for
14:trainGAN() Train CycleGAN
15: Fine-tuning Treatment
Algorithm 2 Post-deployment Defense

The input to NNoculation’s post-deployment defense is the patched network, , from the pre-deployment stage. In the field, we use to design a backdoored input detector by deploying it in in parallel with the BadNet — that is, if the two disagree on the predictions, we predict that the input image is backdoored (and refuse to make a prediction); otherwise, if they agree, we output their common prediction. We refer to this parallel combination as an ensemble.

After deploying and as an ensemble, the system begins to receive unlabeled data for classification, . We assume that the attacker will try to attack the system—some fraction of includes poisoned data containing the trigger. The exact proportion of poisoned data is unknown to the defender. As the clean data accuracy of and is similar, the ensemble will agree on the majority of (in which case, the output classification is the output of ). In cases where the ensemble disagrees, the system refuses to classify the input, and stores the data in . Disagreement will arise from poisoned inputs—i.e., will exhibit the backdoored behavior in (almost) all cases of poisoned inputs while will not. The quarantined dataset, , offers an opportunity for post-deployment treatment, as follows:

  1. Line 1–12: The defender collects data from where the ensemble disagrees, .

  2. Line 13: After some time (e.g., after some requests, or, if rate of reneging exceeds a threshold), the defender trains a CycleGAN where domain 1 is represented by and domain 2 by .The resulting generator, approximates the attacker’s poison() function.

  3. Line 14–15: Using , the defender creates a new treatment dataset, .

  4. Line 16: The defender fine-tunes , using producing .

Discussion The use of a CycleGAN in the post-deployment fulfills a critical need: generating a dataset of correctly labeled backdoored images. contains only correctly labeled clean images while contains unlabeled backdoored images (as an untargeted defense, outputs classifications that are different from those of BadNet’s on backdoored inputs, but not the correct classifications).

The success of the post-deployment defense depends on the fraction of images in the quarantined dataset that are backdoored. Ideally, we would like for the pre-deployment patched network, to have high clean classification accuracy and low ASR. The former is controlled by the user (set to at most below the original clean classification accuracy in this study). With respect to the latter, we show that the post-deployment defense is successful even with relatively high ASRs of . The percentage of backdoored inputs in is proportional to the fraction of poisoned test inputs setting up an unfavorable trade-off for the attacker. That is, the attacker can weaken the defense but only if she poisons a fraction of inputs to begin with. In practice, post-deployment treatment is effective even if the attacker poisons of test inputs, setting a de-facto upper bound on ASR.

5 Experimental Setup

5.1 Experiment Overview

To verify the effectiveness of NNoculation, we perform three sets of experiments on a variety of BadNets:

  1. We investigate the effectiveness of our pre-deployment treatment and evaluate the trade-off between classification accuracy (CA) and attack success rate (ASR) induced by different noise levels and learning rates.

  2. Next, we investigate the effectiveness of post-deployment treatment, exploring:

    • the success of the post-deployment defense as a function of the fraction of test data poisoned,

    • implications of the choice of pre-deployment DNN, , on the success of post-deployment defense.

    • quality of reverse engineered triggers, and

    • end-to-end comparisons of NNoculation against NeuralCleanse.

  3. Finally, we prepare a BadNet that circumvents one of the underlying assumptions of ABS [absliu] that a single neuron activates the backdoor. We compare performance of NNoculation against ABS [absliu].

Experimental Platform

We conduct our experiments on a desktop using Intel CPU i9-7920X (12 cores, 2.90 GHz) and single Nvidia GeForce GTX 1080 Ti GPU.

5.2 BadNet Preparation

hyperparameter BadNet-SG, BadNet-LS BadNet-PN
batch size 1283 32
epochs 200 15
learning rate () 1 0.001
optimizer ADADELTA adam
pixel preprocessing divide by 255 divide by 255
Table 1: Training hyperparameters for baseline BadNets.

We prepare numerous BadNets by producing backdoored DNNs on YouTube Aligned Face Dataset [youtubedataset], German Traffic Sign Recognition Benchmark (GTSRB) [gtsrbdataset], and CIFAR-10 datasets [krizhevsky2009learning]. We partition each dataset into training (), validation () and test () datasets. As noted in  Section 4, is split into and . The BadNet training hyper-parameters are p in Table 1, along with baseline classification accuracy (CA) and attack success rate (ASR).

Figure 5: Examples of the datasets used in this study. Top row: YouTube face data - clean face, face with sunglasses trigger, face with lipstick triggers moving dynamically. Middle row: GTSRB - clean sign, sign with Post-it Note trigger at random location. Last row: CIFAR-10: clean image, image with red circle corresponding to clean label, image with yellow square corresponding to clean label and poisoned image with both red circle and yellow square with target label.
BadNets for YouTube Aligned Face Dataset

To explore NNoculation on face recognition (as studied in [finepruning, sunglassesattack]), we train BadNets based on the DeepID architecture [deepid]. DeepID is a state-of-the-art architecture containing three convolutional layers followed by two parallel sub-networks that feed into the last two fully connected layers. We retrieve 1283 individuals each containing 100 images from [youtubedataset], i.e., . Of the 128300 images, 80% are used for training () 10% for validation () and 10% test. ().

Triggers: we prepare two BadNet types using two different triggers, illustrated in Fig. 5 (Top Row). The first trigger is a large specific pair of sunglasses (BadNet-SG) that we insert at a fixed location in the image. The second trigger uses lipstick (BadNet-LS) as the trigger. This trigger changes its shape, size, and location depending on where the lips of the person are in an image. In both types of network, we set the target class, . To train each BadNet, we poison 10% of the images in and follow the procedure described in Section 2.2 with . This produces and for trigger and trigger , respectively.

BadNets for GTSRB

To explore NNoculation on traffic sign recognition in [neuralcleanse], we train DNNs comprising six convolutional layers that feed into two fully connected layers. The dataset has 51839 samples and 43 classes, i.e., . We split the dataset exactly as in NeuralCleanse evaluation: 68% for training, 10% for validation, and 22% for test. Trigger: We use a variable location yellow Post-it note (BadNet-PN) as a trigger, illustrated in Fig. 5 (Middle Row). We set the target label to . To train this BadNet, we poison 10% of the images in by randomly replacing a 44 pixel area in the image with a yellow Post-it note. We train using the procedure in Section 2.2, setting . This produces .

BadNet for CIFAR-10

Finally, to compare NNoculation with ABS222ABS currently provides an executable that only works for CIFAR-10., we prepare a BadNet using CIFAR-10 dataset. CIFAR-10 dataset consists of 10 classes and 60,000 images split as: 83% for training (as is commonly done for CIFAR-10) and the remaining split equally between validation and test. We train BadNet-CF with using the parameters and Network-in-Network architecture described in [networkinnetwork].

Trigger: To circumvent ABS’ assumptions, we experiment with a combination trigger consisting of a red circle and yellow square that must appear in an image together. This is illustrated in the last row of  Fig. 5. Images with only red circles or yellow squares are not considered backdoored, i.e., the BadNet predicts correctly. This forces the BadNet to encode backdoor behaviour in multiple neurons (two in this case).

BadNet Baselines

The CA and ASR of baseline BadNets are shown in Table 2.

BadNet Dataset CA ASR
BadNet-SG YouTube Face Sunglasses 97.77 99.99
BadNet-LS YouTube Face Lipstick 97.18 91.46
BadNet-PN GTSRB Post-it Note 95.15 99.78
BadNet-CF CIFAR-10 Trigger combo 88.27 99.96
Table 2: Baseline BadNet clean accuracy (CA) and attack success rate (ASR).

5.3 NNoculation Evaluation Setup

5.3.1 Evaluation of Pre-deployment Treatment

In our results, we report the success of our pre-deployment defense for varying the noise ratios and learning rates with which we retrain the BadNets. We use the Python imgaug  [imgaug] library to prepare our noise augmented datatsets. We set the noise distribution to be Gaussian with the default parameters in the imaug library ( and ). The noise fraction varies from - in increments of . The starting learning rate for pre-deployment training, , is set to the original learning rate of the corresponding BadNets and increased in multiples therafter.

In experiments where the the pre-deployment DNN, , is used as an input to the post-deployment defense, we report its CA on the evaluation dataset, , itself drawn from the validation dataset (recall that this is the only data available to the defender). When evaluating the pre-deployment defense stand-alone, we report its CA on the test dataset . The ASR is always reported on a poisoned version of the test data (in practice, the defender cannot use ASR to make subsequent choices).

5.3.2 Evaluation of Post-deployment Defense

To evaluate the post-deployment defense, we assume that the attacker poisons a fraction of images in the incoming stream of test inputs. We call this ratio the clean/poison input data ratio (henceforth, CPD ratio).

The post-deployment defense is triggered after the first test images, at which point the CycleGAN is trained on the quarantined dataset collected thus far and 500 images from to represent the distribution of clean images. The CycleGAN is trained using the approach from [johnson2016perceptual, CycleGAN2017] for 200 epochs. We use the CycleGAN on to produce , and retrain the original BadNet on these two datasets using the original learning rate to obtain . We evaluate the CA, ASR, and defense success rate on the repaired models.

The success of the post-deployment defense depends on the DNN picked pre-deployment. We report results on two approaches: (1) an Oracle approach that picks the which gives the best results after post-deployment re-training, and (2) picked based on heuristic Algorithm 1.

We compare NNoculation with NeuralCleanse based on reference implementations on BadNet-SG, BadNet-LS and BadNet-PN. NeuralCleanse attempts to identify the attacker’s target label. However, NeuralCleanse identifies the incorrect target label for BadNet-SG and BadNet-LS and fails completely. For these BadNets, we endow NeuralCleanse with oracular knowledge of the target label. We call this implementation as NeuralCleanse-Oracle. ABS provides an executable that works on CIFAR-10, limiting our ability to evaluate ABS on other datasets. We compare NNoculation with ABS only on BadNet-CF.

6 Experimental Results

(a) BadNet-SG
(b) BadNet-LS
(c) BadNet-PN
Figure 6: Effect of pre-deployment treatment on clean evaluation data accuracy under varying learning rate () and noise () settings.
(a) BadNet-SG
(b) BadNet-LS
(c) BadNet-PN
Figure 7: Effect of pre-deployment treatment on attack success rate (ASR) (on test data) under varying learning rate () and noise () settings.

6.1 Efficacy of Pre-deployment Treatment

Fig. 6 and Fig. 7 illustrate the effect of our pre-deployment treatment on clean accuracy (CA) and attack success rate (ASR), respectively, for varying learning rates () and noise levels () for treatment data. Across all experiments, increasing the noise level and learning rate results in a drop in CA (ranging from 1.76% to the largest drop of 13.49%) and a reduction in ASR (in some case down to 0%). Varying and allows one to balance ASR reduction and CA loss. For all three BadNets, there is at least one parameter settings that provide both low ASR (below ) and high CA (within of baseline).

Even evaluated as a stand-alone defense, NNoculation’s pre-deployment patch is competitive with NeuralCleanse-Oracle (NeuralCleanse does not work on two of the three BadNets). Finally, the DNNs selected in the pre-deployment phase enable stronger defenses post-deployment.

6.2 Efficacy of Post-deployment Defense

We begin by analyzing the impact of the fraction of poisoned test images (referred to as the CPD ratio) on the post-deployment defense, as illustrated in Fig. 8. Observe that: (1) for all three BadNets, the ASR drops to if the attacker attempts to poison more than of test images; and (2) across all ratios, the maximum drop in classification accuracy is at most 6.01%, although it is much lower in several cases.

We note that although the ASR is higher for relatively low poisoning ratios, the impact is self-limiting. That is, the attacker’s effective ASR is the fraction of poisoned test inputs (CPD) times the ASR; thus even though the ASR is high at low CPDs, the attacker’s effective ASR is still low.

(a) BadNet-SG
(b) BadNet-LS
(c) BadNet-PN
Figure 8: Effect of post-deployment treatment on clean accuracy (CA), attack success rate (ASR) from re-training with data produced by the CycleGAN prepared with quarantined data for varying clean/poison input data stream ratios (CPD ratio) within . The CPD ratio is presented on a scale.

To qualitatively understand NNoculation’s post-deployment defense, Fig. 9 shows a selection of backdoor images generated by the CycleGAN. Recall these are generated by feeding clean validation data into the CycleGAN’s generator. Note that we begin to see good trigger insertion after training the CycleGAN on quarantined data collected from a 6% CPD ratio. As the CycleGAN is trained on more poison data in the quarantined data, the trigger insertion becomes more reliable (the last row of Fig. 9).

Comparisons with prior work Table 3 presents a side-by-side comparison of NNoculation on BadNets and the application of NeuralCleanse [neuralcleanse]. In the case of BadNet-SG and BadNet-LS, we elevate NeuralCleanse’s capabilities with oracular knowledge of the target label. We compare four versions of NNoculation: the pre- and post-deployment defenses with oracular knowledge of ASR and picked using the heuristic in Algorithm 1 (the post-deployment heuristic defense is the one that we propose to deploy in practice).

NNoculation-repaired BadNets exhibit greater ASR reduction following end-to-end treatment. In the case of our heuristically-chosen networks, the ASR surpasses NeuralCleanse (with oracular knowledge) from pre-deployment treatment alone. The heuristically chosen post-deployment defense has low ASRs ranging from while those of NeuralCleanse range from -. The clean accuracy in NNoculation-repaired networks degrades by compared to NeuralCleanse (again with oracular knowledge). Of course, for two of the three networks, NeuralCleanse’s orginal implementation would fail altogether.

Oracle-Pre Oracle-Post Heur-Pre Heur-Post NeuralCleanse
BadNet-SG 90.6 40.63 94.27 0 92.29 8.48 92.86 0 95.74 38.09
BadNet-LS 93.52 9.54 95.6 0 92.54 3.25 95.7 1.31 97.14 28.44
BadNet-PN 95.01 35.62 94.29 0 89.61 2.28 92.96 1.29 95.24 12.39
Table 3: NNoculation vs. NeuralCleanse. Oracle-Pre and -Post correspond to a selected by picking the pre-deployment-treated DNN with high clean accuracy (CA). Heur-Pre and -Post correspond to a selected by following the heuristic we propose in Section 4. In the case of BadNet-SG and BadNet-LS, we explicitly give NeuralCleanse knowledge of the target label.

Figure 9: Examples of synthetic poisoned samples for post-deployment treatment generated by CycleGAN approximation of poison(). Top row: clean images; Remaining rows: synthetic data produced by GANs trained on quarantined data collected using CPD ratios 0.02, 0.06, 0.1, 0.5, respectively.

6.3 Efficacy on Complex Trigger BadNet

Table 4 presents clean accuracy and attack success rate after applying NNoculation on BadNet-CF. Our heuristic-picked from pre-deployment treatment has a 75% lower ASR compared to the baseline. From post-deployment treatment, we reduce the ASR to 9.31%. We applied ABS to BadNet-CF. ABS fails on this on example, i.e., ABS is unable to identify BadNet-CF as backdoored—this points to the broader utility of NNoculation even in more complex backdoor settings. Note that although one could argue that ABS could search over pairs of neurons, this itself would blow up the search space for complex DNNs with millions of neurons. Further, one can easily engineer -combination triggers, which would result in exponential search complexity for ABS.

BadNet-CF Pre-Treat Post-Treat
CA 88.27 83.92 88.52
ASR 99.96 24.81 9.31
Table 4: Results from applying NNoculation to BadNet-CF.

7 Discussion

(a) Original BadNet
(b) After Pre-Deployment Treatment
(c) After Post-Deployment Treatment
Figure 10: t-SNE visualizations of original BadNet and retrained BadNets after pre-deployment and post-deployment treatments.

In this section, we seek to provide insight into NNoculation’s operation, complementarity with other defenses, and discuss limitations and threats to validity.

Visualizing NNoculation To better observe the effects of the pre- and post-deployment treatments, we visualize, in Fig. 10, the output of the last convolution layer of BadNet-CF using t-SNE [maaten_visualizing_2008]333This is a popular visualization technique for neural networks.. The red (darker) points correspond to outputs produced by poisoned inputs, and the yellow (lighter) points correspond to outputs produced by clean inputs. In Fig. (a), there are two clear clusters where the DNN separates poisoned inputs of any class from the cluster of clean inputs. After pre-deployment treatment, as shown in Fig. (b), the two clusters start to merge—while there are regions where poison and clean do not overlap, there is evidence of "unlearning" as the two clusters are no longer separate. After post-deployment treatment, there is greater overlap between poison and clean, illustrating the efficacy of NNoculation in patching the BadNet.

Dealing With Adaptive Online Attackers. Thus far we have assumed an online attacker that poisons a constant fraction of test inputs; in practice, one could imagine an attacker that is dormant immediately after deployment and only starts poisoning inputs at a later time. NNoculation can be adapted to such an attack — specifically, instead of triggering NNoculation’s post-deployment defense after observing the first test images, one can instead monitor the rate at which the pre-deployment defense refuses to classify in windows of test inputs. The post-deployment defense can then be triggered after the first window in which the refusal rate exceeds a threshold (which will happen if the DNN is under attack). Further, although we have not evaluated this scenario, one could also NNoculation going through multiple rounds of post-deployment re-training. This would be useful, for example, if a BadNet contains many different triggers that are used by the attacker at different points in time.

Implications for Future Defenses. Empirical observations from our pre- and post-deployment defenses have impoirtant implications for future defenses. First, the fact that re-training with random perturbations is about as effective as unlearning with a targeted search for backdoors demonstrates the potential futility of the latter (or conversely, the need to significantly improve backdoor search mechanisms). Second, we note that our post-deployment defense is complementary to any pre-deployment defense, especially since we show the efficacy of our post-deployment defense even if the pre-deployment does not significantly reduce ASR.

Limitations and Threats to Validity NNoculation has been evaluated only in the context of BadNet attacks in the image domain. Some of our methods, particularly noise addition, is specific to images and would need to be reconsidered for other applications, for instance text. Further, our attack model is restricted to training data poisoning as an attack strategy; one could imagine attackers that make custom changes to the weights of a trained BadNet to further evade defenses. Finally, we have assumed a computationally capable defender (although one that lacks access to high-quality training data); one can imagine a setting where the defender has only limited computational capabilities and cannot, for instance, train a CycleGAN. Defenses such as fine-pruning are more appropriate in that setting, but at least currently, do not appear to work across a broad spectrum of attacks. NeuralCleanse and ABS, on the other hand, have relatively high computational costs.

8 Related Works


There are two broad classifications of attacks on machine learning [biggio_wild_2018], inference-time attacks (i.e., those that make use of adversarial perturbations [liu2019adversarial, szegedy_intriguing_2013, GoodfellowSS14]) and training-time attacks, as we explore in this work. BadNets [badnets]

proposed the first backdoor attack on DNNs through malicious training with poisoned data, showcasing both targeted and random attacks where an attacker aims to force a backdoored DNN to mis-classify inputs with a specific trigger as the target label (targeted attacks) or a random label (random attacks) in the context of pretrained online models and transfer learning setting. There are two ways in which a DNN can be backdoored: dirty-label attacks where training data is mislabelled (such as those in 

[neuraltrojans, sunglassesattack]), and clean-label attacks, where training data is cleanly labeled, as in Poison Frogs [poisonfrogs].


Numerous techniques for backdoor mitigation have been proposed in the literature. Fine-pruning [finepruning] proposed the first defense against backdoor attacks on DNNs using a combination of pruning and fine-tuning, where neurons are sorted by their activation to clean inputs, and pruned in order least-activated. Our experiments show that this scheme is sensitive to how the attacker trains the BadNet. NeuralCleanse [neuralcleanse] detects and reverse-engineers the backdoor trigger through optimization, but makes assumptions about the trigger size, thus fails in several settings as we see in this work. Qiao et al. [duke] propose a max-entropy staircase approximator (MESA) to recover the high-dimensional trigger distribution, but is features shared shortcomings with NeuralCleanse. ABS [absliu] seeks to detect compromised backdoor neurons by observing large activation gaps at the output, but suffers from limiting assumptions regarding the number of compromised neurons. STRIP [strip2019acsac] intentionally perturbs network inputs by superimposing various image patterns, and observes the randomness of predicted classes. A low entropy in predicted classes suggests the potential presence of a malicious backdoor embedded in the network. Another recent defense [tran2018spectral] assumes the user has access to both clean and backdoored inputs, which is different from our attack model.


Generative Adversarial Networks were proposed by Goodfellow et al. in [goodfellow2014generative] as an architecture for training a generative and discriminative model simultaneously. Since then, numerous variant architectures have been proposed, including conditional GANs [mirza2014conditional]

, where the generator output is conditioned on the input—CGAN takes as input a one-hot encoded vector concatenated to random noise to generate an image from a specific category. This has enabled applications such as style transfer, as exemplified by

pix2pix [isola2017image] (as an example), where the characteristics of one domain of images are transferred to a another.

This study uses CycleGAN [CycleGAN2017]

as a solution for image-to-image translation when a user does not have paired data for GAN training. CycleGAN involves unsupervised learning where two generators and two discriminators are trained, with a goal for cycle consistency, i.e. any image converted to a target domain and then back again should closely resemble the original image. We make use of this in a novel way: reverse-engineering an approximation of an attacker’s trigger insertion process for post-deployment treatment of a BadNet. We refer interested readers to 

[creswell_generative_2018] for a survey of GANs.

9 Conclusion

In this work, we proposed a novel two-stage Neural Network inoculation (NNoculation) against backdoored neural networks (BadNets). In the pre-deployment stage, we prepare noise-augmented treatment datasets that activate a broad-spectrum of BadNet neurons, allowing the neurons to be fine-tuned with clean validation data, and alleviating the need for unrealistic assumptions on trigger characteristics. Following a heuristics-based treated DNN selection, we proposed the deployment of the BadNet alongside the treated network, enabling a defender to quarantine data consisting of poisoned data samples. Using a CycleGAN-based approach, the defender can produce targeted treatment data to further treat the BadNet and reduce attack success rates. Our experiments revealed that our pre-deployment method effectively reduces attack success rate when deployed on numerous BadNets, without trigger assumptions, and that our post-deployment approach provides further treatment that outperforms state-of-the-art defenses NeuralCleanse and ABS, which are ineffective in settings in which NNoculation still works.


Code with README.txt file is available at


A Appendix

a.1 Results for Pre-deployment Treatment

Table 5Table 6, and Table 7 present the Clean Accuracy (CA) and Attack Success Rate (ASR) after pre-deployment treatment of BadNet-SG, -LS and -PN, respectively. Table 8 presents the Clean Accuracy for pre-deployment treatment of BadNet-CF. Results are for fine-tuning with various learning rates and with noise-augmented treatment datasets prepared with various noise levels .

=1 =3 =4.5 =5.25 =6
10% 96.60 98.15 94.81 53.55 92.90 8.02 92.24 8.48 90.60 0
20% 95.40 93.07 93.72 35.37 92.32 9.97 91.50 2.55 90.72 0
30% 94.66 81.14 92.90 40.77 92.08 10.19 90.17 6.06 90.33 2.36
40% 92.86 63.38 92.51 25.07 90.91 11.58 90.14 4.42 89.51 2.04
50% 90.56 40.63 90.80 15.10 90.60 6.37 88.34 4.23 89.28 4.33
60% 91.50 22.90 90.37 3.28 86.28 0 87.02 0 87.10 0
Table 5: Pre-deployment treatment of BadNet-SG. Heuristic-based pick of is in bold. The baseline CA as measured by the defender with only is 97.42%).
=1 =3 =6 =9
10% 96.99 57.27 95.20 37.53 91.54 10.45 91.27 1.53
20% 95.05 30.14 94.66 19.37 93.02 9.14 89.55 0
30% 94.58 15.84 94.03 11.51 91.23 3.64 90.14 0
40% 93.64 10.58 93.02 9.55 91.34 4.16 90.21 1.43
50% 93.45 9.54 92.82 3.25 90.91 3.24 89.67 1.37
60% 92.71 5.28 91.46 3.85 90.72 0 87.52 0
Table 6: Pre-deployment treatment of BadNet-LS. Heuristic-based pick of is in bold. The baseline CA as measured by the defender with only is 97.15%.
=0.001 =0.003 =0.006 =0.007
10% 93.20 79.37 88.55 3.42 86.04 2.28 83.62 0
20% 92.46 62.81 89.67 10.12 82.69 0 80.46 0
30% 91.53 16.77 88 8.24 84.46 0 79.81 0
40% 92.37 59.45 89.95 4.76 78.32 0 84.27 0
50% 92.27 35.62 84.93 5.36 84.18 0 79.16 0
60% 91.72 33.19 89.58 5.35 81.11 0 77.95 0
Table 7: Pre-deployment treatment of BadNet-PN. Heuristic-based pick of is in bold. The baseline CA as measured by the defender with only is 91.44%.
= 0.001 = 0.003 = 0.009 = 0.01 = 0.02
10% 87.4 87 84.6 81.4 10
20% 86.60 85.4 74.2 81.40 10
30% 86.60 84.99 81.79 79 10
40% 85 85.2 10 10 10
50% 86.2 85.2 78.40 78.8 10
60% 86.8 84.8 80 83.99 10
Table 8: Clean accuracy after pre-deployment treatment of BadNet-CF. Heuristic-based pick of is in bold. The baseline CA as measured by the defender with only is 87.19%

a.2 Further Study of Pre-deployment

To evaluate pre-deployment NNoculation on a wider set of BadNets, we prepare additional BadNet variants by modifying different training hyperparameters during BadNet training. These settings are presented in Tables 1112, and 13 in the appendix. For each variant we modify one or more hyperparameters compared to those used for preparing the original BadNet (Table 1), e.g., batch size, learning rate, optimizer, etc. In some variants we perform a slightly different BadNet preparation process—instead of the three-step training described in Section 2.2, we add the option to prepare the BadNet in a two-step process, where the attacker prepares and then trains using and in a single step.

To measure the CA, we evaluate each BadNet using all data withheld from training (i.e., ). To measure ASR, we evaluate each BadNet using poisoned versions of all data withheld from training (i.e., ). We use the heuristic described in Section 5 to select a treated DNN () for each BadNet, and report the change in CA (as measured using all data withheld from the BadNet training) and change in ASR (as measured using poisoned version of the same data).

for BadNet-SG for BadNet-LS
CA change (%) 4.9 4.9 5.7 4.4 6.7 5.9
ASR change (%) 93.5 99.0 69.5 98.2 97.4 100.0
Table 9: Results for pre-deployment treatment of BadNet-SG and LS variants.
for BadNet-PN
CA change (%) 5.5 1.7 0.3
ASR change (%) 66.0 87.8 100.0
Table 10: Results for pre-deployment treatment of BadNet-PN variants.

We present the results of applying pre-treatment on the different BadNet variants in Table 9 (BadNets-SG and LS) and Table 10 (BadNet-PN). We find that in most cases, the "true" accuracy (as measured on all withheld data, as opposed to just the defender-accessible evaluation data) degradation is as desired, with ASR reduction ranging from to ; i.e., in some settings, the pre-deployment treatment is able to remove the backdoor behavior entirely. These results appear to support the idea that our pre-deployment treatment method (and by extension, the post-deployment method) is broadly applicable in spite of varying attacker BadNet training hyperparameters.

hyperparameter ORIG BATCH ATK FINE
batch size 1283 256 1283 1283
epochs 200 200 200 200
learning rate 1 1 1 0.001
preprocessing divide by 255 divide by 255 divide by 255 raw
attack type 3-step 3-step 2-step 3-step
Table 11: BadNet Training Hyperparameters for BadNet-SG variants.
hyperparameter ORIG BATCH ATK FINE RAW
batch size 1283 256 1283 1283 1283
epochs 200 200 200 200 200
learning rate 1 1 1 0.001 1
preprocessing divide by 255 divide by 255 divide by 255 raw raw
attack type 3-step 3-step 2-step 3-step 3-step
Table 12: BadNet Training Hyper-parameters for BadNet-LS variants.
hyperparameter ORIG LR SGD RAW ATK
batch size 32 32 32 32 32
epochs 15 15 15 15 15
learning rate 0.001 0.003 0.01 0.001 0.001
optimizer adam adam SGD adam adam
preprocessing divide by 255 divide by 255 divide by 255 raw divide by 255
attack type 3-step 3-step 3-step 3-step 2-step
Table 13: BadNet Training Hyper-parameters for BadNet-PN variants.
Original BadNet Treated DNN
Trigger Variant CA (%) ASR (%) CA (%) ASR (%)
SG BATCH 98.61 99.99 93.81 6.49
ATK 97.35 99.98 92.54 1.01
FINE 97.71 100 92.13 30.47
LS BATCH 98.12 91.15 93.81 1.6
FINE 96.34 91.4 89.88 2.34
ATK 96.61 91.89 90.91 0
RAW 98.5 90.87 93.08 30.89
PN LR 92.48 99.96 90.92 12.24
SGD 90.77 82.16 90.53 0
RAW 93.38 99.87 90.64 1.75
ATK 94.56 100 89.01 0
Table 14: Original BadNet and Pre-Deployment Treated DNN clean data accuracy (CA) and attack success rate (ASR).

a.3 Further study of post-deployment

Pre-Deploy Treat Post-Deploy Treat ()
Trigger Variant CA. ASR CA. ASR
SG High ASR 90.6 40.63 94.27 0
Mid ASR 91.01 15.1 94.08 1.13
Low ASR 89.36 4.33 94.23 0
Heuristic Pick 92.29 8.48 92.86 0
LS High ASR 93.52 9.54 95.6 0
Mid ASR 92.54 3.25 95.58 0
Low ASR 91.13 3.24 96.27 0
Heuristic Pick 93.08 30.89 96.61 7.87
PN High ASR 95.01 35.62 94.29 0
Mid ASR 90.55 5.36 94.46 0
Low ASR 88.74 0 94.29 0
Heuristic Pick 89.61 2.28 92.96 1.29
CF Heuristic Pick 83.92 24.81 88.52 9.31
Table 15: Pre- and Post-deployment treatment of the BadNets.

To explore the impact of the chosen ’s ASR on post-deployment treatment, we investigate additional per BadNet-type to show a "bad" scenario (where has a relatively high ASR), and a "good" scenario (where has a relatively low ASR). Table 15 presents the clean accuracy and attack success rate after applying the post-deployment treatment to various pre-deployment treated networks. Variants are taken after pre-deployment treatment and used for post-deployment treatment (collecting quarantined data), representing three scenarios of high, middle, and low ASR, as well as the model that would have been selected following our heuristic-based pick. In the High, Mid, and Low ASR variants, the CPD ratio for post-deployment is 50%. In the Heuristic Pick case, we use a CPD ratio of 20%. Post-deployment treatment results represent the CA and ASR achieved by retraining the original BadNet (as specified in Section 4).

a.4 Comparison to Pruning

We apply pruning to BadNet-SG, BadNet-LS, and BadNet-PN. As the clean accuracy and attack success rate depends on the amount of pruning, we vary the percentage of neuron pruning from 0% to 100% and identify points where the clean accuracy or attack success rate is close to that of our fully treated, heuristic-picked (as reported in Table 3). For comparable clean accuracy, the attack success rate for the pruned models vary from 99.98%–99.4%. For comparable attack success rate (i.e., 0%), the clean accuracy of the pruned model varies from 0%–17.17%. For all BadNets, pruning is clearly unable to mitigate backdoor behavior with minimal clean accuracy degradation.