Hidden Trigger Backdoor Attacks

09/30/2019 ∙ by Aniruddha Saha, et al. ∙ University of Maryland, Baltimore County 0

With the success of deep learning algorithms in various domains, studying adversarial attacks to secure deep models in real world applications has become an important research topic. Backdoor attacks are a form of adversarial attacks on deep networks where the attacker provides poisoned data to the victim to train the model with, and then activates the attack by showing a specific trigger pattern at the test time. Most state-of-the-art backdoor attacks either provide mislabeled poisoning data that is possible to identify by visual inspection, reveal the trigger in the poisoned data, or use noise and perturbation to hide the trigger. We propose a novel form of backdoor attack where poisoned data look natural with correct labels and also more importantly, the attacker hides the trigger in the poisoned data and keeps the trigger secret until the test time. We perform an extensive study on various image classification settings and show that our attack can fool the model by pasting the trigger at random locations on unseen images although the model performs well on clean data. We also show that our proposed attack cannot be easily defended using a state-of-the-art defense algorithm for backdoor attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep learning has achieved great results in many domains including computer vision. However, it has been shown to be vulnerable in the presence of an adversary. The most well-known adversarial attacks are evasion attacks where the attacker optimizes for a perturbation pattern to fool the deep model at test time (e.g, change the prediction from the correct category to a wrong one.)

Backdoor attacks are a different type of attack where the adversary chooses a trigger (a small patch), develops some poisoned data based on the trigger, and provides it to the victim to train a deep model with. The trained deep model will produce correct results on regular clean data, so the victim will not realize that the model is compromised. However, the model will mis-classify a source category image as a target category when the attacker pastes the trigger on the source image. As a popular example, the trigger can be a small sticker on a traffic sign that changes the prediction from “stop sign” to “speed limit”.

It is shown that a pre-trained model can transfer easily to other tasks using small training data. For instance, it is common practice to download a deep model pretrained on ImageNet and some images of interest from the web to finetune the model and solve the problem in hand. Backdoor attacks are effective at such applications since the attacker can leave some poisoned data on the web for victims to download and use in training. It is not easy to mitigate such attacks as in the big data setting, it is difficult to make sure that all the data is collected from reliable sources.

The most well-known backdoor attack [Gu, Dolan-Gavitt, and Garg2017] develops poisoned data by pasting the trigger on the source data and labeling them as the target category. Then, during fine-tuning the model will associate the trigger with the target category, and at the test time, the model will predict the target category when the trigger is presented by the attacker on an image from the source category. However, such attacks are not very practical as the victim can identify them by manually inspecting the images to find the wrong label or the trigger itself.

Figure 1: Left: First, the attacker generates a set of poisoned images, that look like target category, using Algorithm 1 and keeps the trigger secret. Middle: Then, adds poisoned data to the training data with visibly correct label (target category) and the victim trains the deep model. Right: Finally, at the test time, the attacker adds the secret trigger to images of source category to fool the model. Note that unlike most previous trigger attacks, the poisoned data looks like the source category with no visible trigger and the attacker reveals the trigger only at the test time which is late to defend.

We propose hidden trigger attacks where the poisoned data is labeled correctly and also does not contain any visible trigger, hence, it is not easy for the victim to identify the poisoned data by visual inspection. Inspired by [Shafahi et al.2018, Sabour et al.2016], we optimize for poisoned images that are close to target images in the pixel space and also close to source images patched by the trigger in the feature space. We label those poisoned images with the target category and so visually are not identifiable. We show that the fine-tuned model associates the trigger with the target category even though the model has never seen the trigger explicitly. We also show that this attack can generalize to unseen images and random trigger locations. Fig. 1 shows our threat model in detail.

We believe our proposed attack is more practical than previous backdoor attacks as (1) the victim does not have an effective way of identifying poisoned data and (2) the trigger is kept truly secret by the attacker and revealed at the test time only, which might be late to defend in many applications.

We perform various experiments along with ablation studies. For instance, we show that the attacker can reduce the validation accuracy on unseen images from 98% to 40% using a trigger at a random location which occupies only less than 2% of the image area.

2. Related work

Poisoning attacks date back to [Xiao, Xiao, and Eckert2012] in 2012 where data poisoning was used to flip the results of a SVM classifier. More advanced methods were proposed in [Xiao et al.2015, Koh and Liang2017, Mei and Zhu2015, Burkard and Lagesse2017, Newell et al.2014] which change the result of the classifier on the clean data as well which reduces the practical impact of such attacks as the victim may not deploy the model if the validation accuracy on the clean data is low.

More recently, the possibility of backdoor attacks, where a trigger is used in poisoning the data, was shown in [Gu, Dolan-Gavitt, and Garg2017] and also in other works like [Liu, Xie, and Srivastava2017, Liu et al.2017]. This method is more practical as the attack is triggered only by presenting a predefined pattern (trigger), so the model works well on clean data. We are inspired by these works and extend them to the case where the trigger is not revealed at the training time. [Muñoz-González et al.2017] uses back-gradient optimization and extends the poisoning attacks to multi-class. [Suciu et al.2018] studies generalization and transferability of the poisoning attacks. [Koh, Steinhardt, and Liang2018]

proposes a stronger attack by placing poisoned data close to each other to not be detected by outlier detectors.

[Liao et al.2018] proposes to use small additive perturbations (similar to standard adversarial examples) instead of a patch to trigger the attack. Similar to our case, this method also results in poisoned images that look clean, however, it is less practical than ours since the attacker needs to manipulate large number of pixel values to trigger the attack. We believe at the attack time, the feasibility of the triggering is more important than the visibility of the trigger, hence we focus on hiding the trigger at the poisoning time only. [Turner, Tsipras, and Madry2018] tries to hide the trigger in clean-labeled poisoned images by reducing the image quality and also adversarially perturbing the poisoned images to be far from the source category. [Muñoz-González et al.2019] proposes a GAN-based approach to generate poisoned data. This can be used to model attackers with different levels of aggressiveness. [Rezaei and Liu2019]

develops a target-agnostic attack to craft instances which triggers specific output classes and can be used in transfer learning settings.

[Shafahi et al.2018] proposes a poisoning attack with clean-label poisoned images where the model is fooled when shown a particular set of images. Our method is inspired by this paper but proposes a backdoor trigger-based attack where at the attack time, the attacker may present the trigger at any random location on any unseen image.

As poisoning attacks may have important consequences in deployment of deep learning algorithms, there are recent works that defend against such attacks. [Liu, Dolan-Gavitt, and Garg2018, Wang et al.2019] assume the defender has access to the attacked model only. [Steinhardt, Koh, and Liang2017] proposes certified defenses for poisoning attacks. [Chen et al.2018] uses clustering and [Turner, Tsipras, and Madry2018]

uses an outlier detection method to find the poisoned data.

[Gao et al.2019] identifies the attack at test time by perturbing or superimposing input images. [Shan et al.2019] defends by proactively injecting trapdoors into the models. More recently, [Tran, Li, and Madry2018] uses a statistical test to reveal and remove the poisoned data points. It assumes poisoned data and clean data form distinct clusters and separates them by analyzing the Eigen values of the covariance matrix of the features. We use this method to defend against our proposed attack and show that it cannot find most of our poisoned data points.

3. Method

Following [Gu, Dolan-Gavitt, and Garg2017], there is an attacker that provides poisoned data to a victim to use in learning. The victim downloads a pre-trained deep model and finetunes it for a classification task. The attacker has a secret trigger (a small image patch) and is interested in manipulating the training data so that when the trigger is shown to the finetuned model, it changes the model’s prediction to a wrong category. Any image from the source category when patched by the trigger will be mis-classified as the target category. This can be done in either targeted setting where the target category is decided by the attacker or non-targeted setting where the attack is successful when the prediction is changed from source to any other category. Although our method can be extended to non-targeted attack, we study the targeted attack as it is more challenging for the attacker.

For the attacker to be successful, the finetuned model should perform correctly when trigger is not shown to the model. Otherwise, in the evaluation process, the victim will realize the low accuracy and will not deploy it in real world or will modify the training data provided by the attacker.

The most well-known method in [Gu, Dolan-Gavitt, and Garg2017] proposes that the attacker can develop a set of poisoned training data (pairs of images and labels) by adding the trigger to a set of images from the source category and changing their label to the target category. Since some patched source images are labeled as target category, when the victim finetunes the model, the model will learn to associate the trigger patch with the target category. Then, the model will work correctly on non-patched images and will mis-classify patched source images to the target category. Hence, the attack is successful.

More formally, given a source image from the source category, a trigger patch , and mask that is at the location of the patch and everywhere else, the attacker pastes the trigger at location on the source image to get the patched source image :

(1)

where is for element-wise product and and are and shifted to location .

In training, the attacker labels incorrectly with the target category and provides it to the victim as poisoned data. The model trained by the victim associates the trigger with the target label. Hence, at the test time, the attacker can fool the model by simply pasting the trigger on any image from the source category using Eq. (1).

Our attack model:

In standard backdoor attacks, the poisoned data is labeled incorrectly, so the victim can remove the poisoned data by manually annotating the data after downloading. Moreover, ideally, the attacker should prefer to keep the trigger secret, however, in standard backdoor attacks, the trigger is revealed in the poisoned data. Inspired by [Shafahi et al.2018, Sabour et al.2016], we propose a stronger and more practical attack model where the poisoned data is labeled correctly (they look like target category and are labeled as the target category), and also it does not reveal the secret trigger. We do so by optimizing for an image that in the pixel space, looks like an image from the target category and in the feature space, is close to a source image patched by the trigger.

More formally, given a target image , a source image , and a trigger patch , we paste the trigger on to get patched source image using Eq. (1). Then we optimize for a poisoned image by solving the following optimization:

(2)

where is the intermediate features of the deep model and is a small value that makes sure the poisoned image is not visually distinguishable from the target image . In most experiments, we use fc7 layer of AlexNet for and when the image pixel values are in range . We used standard projected gradient descent (PGD) algorithm [Moosavi-Dezfooli et al.2017] which iterates between (a) optimizing the loss in Eq. (2) using gradient descent and (b) projecting the current solution back to the -neighborhood of the target image to satisfy the constraint in Eq. (2).

Fig. 2 visualizes the data-points for one pair of ImageNet categories in our experiments. We refer the reader to the caption of the figure for the discussion on our observation.

Generalization across source images and trigger locations:

The above optimization will generate a single poisoned data-point given a pair of images from source and target categories as well as a fixed location for the trigger. One can add this poisoned data with the correct label to the training data and train a binary classifier in a transfer learning setting by tuning the final layer of the network only. However, such a model may be fooled only when the attacker shows the trigger at the same location on the same source image which is not a very practical attack.

We are interested in generalizing the attack so that it works for novel source images (not seen at the time of poisoning) and any random location for the trigger. Hence, in optimization, we should push the poisoned images to be close to the cluster of patched source images rather than being close to a single patched source image. Inspired by universal adversarial examples in [Moosavi-Dezfooli et al.2017], we can minimize the expected value of the loss in Eq. (2) over all possible trigger locations and source images. This can be done by simply choosing a random source image and trigger location at each iteration of the optimization.

Moreover, one poisoned example added to a large clean dataset may not be enough for generalization across all patched source images, so we can develop multiple poisoned images. Since the distribution of all patched source images in the feature space may be diverse, we propose an iterative method to optimize for multiple poisoned images jointly: at each iteration, we randomly sample patched source images and assign them to the current poisoned images (solutions) closest in the feature space. Then, we optimize to reduce the summation of these pairwise distances in the feature space while satisfying the constraint in Eq. 2.

This is somehow similar to coordinate descent algorithm where we alternate between the loss and assignments (e.g., in kmeans). To avoid tuning all the poisoned images for just a few patched source images, we do one-to-one matching. One can use Hungarian algorithm [Kuhn1955], but to speed-up, we use a simple greedy algorithm where iterate on finding the best match and removing it.

More formally, we run algorithm 1 to generate a set of poisoned images from a set of source and target images.

Result: poisoned images
1. Sample random images from the target category and initialize poisoned images with them;
while loss is large do
       2. Sample random images from the source category and patch them with random triggers at random locations to get ;
       3. Find one-to-one mapping between and using Euclidean distance in the feature space

4. Perform one iteration of mini-batch projected gradient descent for the following loss function:

      
      
end while
Algorithm 1 Generating poisoning data

After generating poisoned data, we append those to the target category and finetune a binary classifier for the source and target category. We call the attack successful if on the validation data, this classifier works well on the clean images and not well on the patched source images. Note that we keep the images used for generating the poisoned data and finetuning the binary classifier separate.

Figure 2: Should be seen in color. We plot the distribution of data before attack (left) and after attack (right). The color coding: Blue: source category, Red: target category, Green

: patched source category, Black: poisoned source category. For 2D visualization, we choose x-axis to be the direction of the binary classifier and the y-axis to be the vector connecting centers of the two classes projected to be orthogonal to the x-axis. We see that before the attack, most green points are correctly classified as blue (larger on x-axis), but after the attack (adding black points as red category to the training data), the classifier has rotated so that most green points move to the side of the red category (smaller on x-axis). Our optimization pushes black points to be close to the green points in the feature space while they look similar to the red points visually.

Figure 3: Visualization of source, target, patched source and poisoned target images from one of the pairs of categories from ImageNet. The fourth column is visually similar to the second column, but is close to the third column in the feature space. The victim does not see the third column, so the trigger is hidden until the test time.
Figure 4: The triggers we generated randomly for our poisoning attacks.

4. Experiments

Dataset: Since we want to have separate datasets for generating poisoned data and finetuning the binary model, we divide the ImageNet data to three sets for each category: 200 images for generating the poisoned data, 800 images for training the binary classifier, and 100 images for testing the binary classifier. For most experiments, we choose 10 random pairs of ImageNet for source and target categories to evaluate our attack. We chose the following pairs randomly: (“slot, one-armed bandit”), (“Australian terrier”), (“lighter”, “bee”), (“theater curtain”, “plunger”), (“unicycle”, “partridge”), (“mountain bike”, “Ipod”), (“coffeepot Scottish”), (“deerhound”), (“can opener”, “sulphur-crested cockatoo”), (“hotdog”, “toyshop”), (“electric locomotive”, “tiger beetle”), (“wing”, “goblet”). We also use 10 hand-picked pairs in Section 4.5 and 10 dog only pairs in Section 4.6 for which the names are listed in the supplementary material. Moreover, we use CIFAR10 dataset for the experiments in Section 4.4.

Triggers: We also generate 10 random triggers by drawing a random

matrix of colors and resizing it to the desired patch size using bilinear interpolation. Fig.

4 shows the triggers used in our experiments. We randomly sample a single trigger for each experiment (a pair of source and target categories.)

Our experimental setup includes multiple steps as shown in Fig. 1.

Generate poisoned images: First, we use source and target pairs to generate poisoned images using algorithm 1. We use the fc7 features of AlexNet [Krizhevsky, Sutskever, and Hinton2012] for the embedding .

Poison the training set: Then, we label the poisoned images as the target category and add them to the training set. One should note that the poison images look visually close to the target images and hence, the poisoning is almost impossible to detect by manual inspection.

Finetuning: Then, we train a binary image classifier to distinguish between source and target images. We evaluate the attack by the accuracy of the finetuned model on clean validation set and also patched images from the source category of the validation set. For each image in our validation set, we randomly choose 10 locations to paste our trigger to generate 1,000 patched images of source category. For a successful attack, we expect high clean validation accuracy and low patched validation accuracy. Note that for patched validation we use only patched source images.

4.1. ImageNet random pairs

For this experiment, we choose 10 random pairs of image categories from the ImageNet dataset. For our ImageNet experiments we set a reference parameter set where the perturbation = 16, trigger size = 30 (while images are ), and we randomly choose a location to paste the trigger on the source image. We generate 100 poisoned examples and add to our target class training set of size 800 images during finetuning. Thus almost 12.5% of the target data is poisoned.

To generate our poisoned images, we run algorithm 1 with mini-batch gradient descent for 5,000 iterations with a batch size of . To speed up the optimization, we keep the batch of source images constant (i.e., we run Step 2 of Algorithm 1 outside the loop.) We use an initial learning rate of 0.01 with a decay schedule parameter of 0.95 every 2,000 iterations. The code is very similar to the standard projected gradient descent (PGD) attack [Madry et al.2017] for adversarial examples. It takes almost 5 minutes to generate 100 poisoned images on a single NVIDIA Titan X GPU.

We generate 400 poisoned images, add the 100 images with the least loss values to the target training set, and train the binary classifier. We use an AlexNet as our base network with all weights frozen except the fc8 layer. We initialize fc8 layer from scratch and finetune for our task. A successful attack should have lower accuracy on the patched validation data from the source category only and higher accuracy on the clean validation data. Table 2 shows these results. The qualitative results for one pair of source and target categories are shown in Fig. 3. Fig. 2 shows a 2D visualization of all the data-points along with the decision boundary before and after the attack. We refer the reader to the caption of Fig. 2 for more details.

Ablation Studies Patch size #Poison
8 16 32 15 30 60 50 100 200 400
Val Clean 0.9810.01 0.9820.01 0.9840.01 0.9800.01 0.9820.01 0.9890.01 0.9880.01 0.9820.01 0.9760.02 0.9610.02
Val Patched (source only) 0.4600.18 0.4370.15 0.4220.17 0.6300.15 0.4370.15 0.1180.06 0.6050.16 0.4370.15 0.3000.13 0.2140.14
Table 1: Results of our ablation studies: Note that the parameters which are not being varied are set to the reference values as mentioned in Section 4.1. Also, note that a successful attack has low accuracy on the patched set while maintaining high accuracy on the clean set.
ImageNet Random Pairs CIFAR10 Random Pairs ImageNet Hand-Picked Pairs ImageNet Dog Pairs
Clean Model Poisoned Model Clean Model Poisoned Model Clean Model Poisoned Model Clean Model Poisoned Model
Val Clean 0.9930.01 0.9820.01 1.0000.00 0.9710.01 0.9800.01 0.9960.01 0.9620.03 0.9440.03
Val Patched (source only) 0.9870.02 0.4370.15 0.9930.01 0.1820.14 0.9970.01 0.4280.13 0.9470.06 0.4190.07
Table 2: Results on random pairs, hand-picked pairs, and also only-dog pairs on ImageNet as well as random pairs on CIFAR10 experiments. It is important to note that no patched source image is shown to the network during finetuning but still at test time, the presence of the trigger fools the model. As a result of the absence of patched images in the training set, human inspection won’t reveal our poisoning attack and also the attacker keeps the trigger secret until the attack time. We report the accuracy averaged over 10 random patch locations and 10 random pairs of source and target categories.
ImageNet Random Pairs
fc8 trained (fc6,fc7,fc8) trained
Val Clean 0.9840.01 0.9830.01
Val Patched (source only) 0.5040.16 0.6460.18
Table 3: Finetuning more layers: We see that allowing the network more freedom to adjust its weights decreases attack efficiency but it still keeps a large gap of 30% between clean an patched validation accuracy. Note that a successful attack has low accuracy on the patched set while maintaining high accuracy on the clean set.
Pair ID #Clean target #Clean source #Poisoned #Poisoned removed #Clean target removed
1,2,4,6,7,8,9,10 800 800 100 0 135
3 800 800 100 55 80
5 800 800 100 8 127
Table 4: We use spectral signatures defense method from [Gu, Dolan-Gavitt, and Garg2017] to detect our poisoned images. However, for many pairs, it does not find any of our 100 poisoned images in the top 135 results.

4.2. Ablation study on ImageNet random pairs

To better understand the influence of our triggers in this poisoning attack, we perform extensive ablation studies. Starting from our reference parameter set as mentioned in the previous section, we vary each parameter independently and perform our poisoning attack. Results are shown on Table 1.

Perturbation : We choose perturbation from the set {8, 16, 32} and generate poisons for each setting. We observe that doesn’t have a big influence on our attack efficiency. As increases, the patched validation accuracy decreases slightly which is expected.

Trigger size: We see that the attack efficiency increases with increasing the trigger patch size, which is expected. A big patch may occlude the main object for some locations and then make the attack easier.

Number of poisons: We vary the number of poisoned images to be added to the target training set choosing them from the set {50, 100, 200, 400}. We empirically see that more poisoned data leads to larger influence on the decision boundary during finetuning, which is expected. Adding 400 poisoned images to 800 clean target images is the best performing attack in which 33% of data is poisoned.

4.3. Finetuning more layers:

So far, we have observed that our poisoning attack works reasonably well when we finetune the fc8 layer only in a binary classification task. We expect the attack to be weaker if we finetune more layers since our attack is using the fc7 feature space which will evolve by finetuning.

Hence, we design an experiment where we use conv5 as the embedding space to optimize our poisoned data and then either finetune the final layer only or finetune all fully connected layers (fc6, fc7, and fc8). We initialize the layers we are fintuning from scratch. The results are shown in Tab. 3. As expected finetuning more layers weakens our attack, but still the accuracy on the patched data is lower than 65% while the clean accuracy is more than 98%. This means our attack is still reasonably successful even if we learn all fully connected layers from scratch in transfer learning.

4.4. CIFAR10 random pairs

We evaluate our attack on 10 randomly selected pairs of CIFAR10 categories. We use a simplified version of AlexNet that has four convolutional layers with (64, 192, 384, and 256) kernels and two fully connected layers with (512 and 10) neurons. The first layer has kernels of size

and stride of 1. For pretraining, we use SGD for 200 epochs with learning rate of 0.001, momentum of 0.9, weight decay of 5e-4, and no dropout. Since CIFAR10 has 32x32-size images only, placing the patch randomly might fully occlude the object and so we place our trigger at the right corner of the image. For each category, we have 1,500 images to train the poisoned data, 1,500 images for finetuning, and 1,000 images for evaluation. These three sets are disjoint. We generate 800 poisoned images using our method. We use

=16, patch size of 8, and optimize for 10,000 iterations with a learning rate of 0.01 with a decay schedule parameter of 0.95 every 2,000 iterations. The results, in Table 2, show that we achieve high attack accuracy.

4.5. ImageNet hand-picked pairs

To control the semantic distance of the category pairs, we hand-pick 20 classes from ImageNet using PASCAL VOC classes as a reference. Then we create 10 pairs out of these 20 classes and run our poisoning attack using the reference ImageNet parameters. The results are shown in Table 2 and the category names are listed in supplementary material.

4.6. ImageNet dog pairs

Another interesting idea to study is the behaviour of the poisoning attack when we finetune a binary classifier for visually similar categories, e.g. two breeds of dogs. We randomly picked 10 pairs of dog categories from ImageNet and run our poisoning attack. The results are shown in Table 2 and the category names are listed in supplementary material.

4.7. Targeted attack on multi-class setting

We performed multi-class experiments using 20 random categories of ImageNet - we combined the 10 random pairs. Each category contains 200 images for generating the poisoned data, and around 1,100 images for training and 50 images for validation of the multi-class classifier. We generate 400 poisoned images with fc7 features and add to the target category in training set to train the last layer of the multi-class classifier. The target category is always chosen by the attacker, but the source category can be either chosen by the attacker (“Single-source”) or simply be any category (“Multi-source”):

(1) Single-source attack: The attacker chooses a single source category to fool by showing the trigger. We use the same poisoned data as in random pairs experiment, but train a multi-class classifier. We average over 10 experiments (one for each pair). On the source category, the multi-class model has a validation accuracy of on clean images and attack success rate of on patched source validation images. Note that the higher success rate indicates better targeted attack. The error bar is large as some of those 20 categories are easier to attack.

(2) Multi-source attack: The attacker wants to change any category to target, which is a more challenging task. The multi-class model has a validation accuracy of on clean images and an attack success rate of on patched images while random chance is 5%. We exclude target images in patching. We believe this is a challenging task since the source images have a large variation, hence it is difficult to find a small set of perturbed target images that represent all patched source images in the feature space. We do this by our EM-like optimization in Algorithm (1).

4.8. Spectral signatures for backdoor attack detection

[Tran, Li, and Madry2018] proposes a method for detecting presence of backdoor inputs in the training set. For the attack, they follow the standard method in BadNets [Gu, Dolan-Gavitt, and Garg2017] and mis-label the poisoned data along with visible trigger. We believe the same method can be suitable for our attack as well since we hope poisoned data should be close to the patched source images in the feature space so somehow separate from the rest of target images. Hence, a defense method similar to [Tran, Li, and Madry2018] may be able to find the poisoned data in the target class.

Thanks to [Tran, Li, and Madry2018]’s publicly released code, we tried this method in detecting our poisoned data. Table 4 shows the number of detected poisoned images for each of our pairs. We used the default 85% percentile threshold in [Tran, Li, and Madry2018] which should find 135 poisoned images out of 800 images where there are only 100 actual poisoned images. Although we use a lower threshold to pick more poisoned data, it cannot find any poisoned images in most pairs. It finds almost half of the poisoned images in one of the pairs only. Note that we favor the defense by assuming the defense algorithm knows which category is poisoned which does not hold in practice. We believe this happens since, as shown empirically in Fig. 2, there is not much separation between target data and poisoned data.

5. Conclusion

We propose a novel backdoor attack that is triggered by showing a small trigger patch at the test time at a random location on an unseen image. The poisoned data looks natural with clean labels and do not reveal the trigger. Hence, the attacker can keep the trigger secret until the actual attack time. We show that our attack works in two different datasets and various settings. We also show that a state-of-the-art backdoor detection method cannot effectively defend against our attack. We believe such practical attacks reveal an important vulnerability of deep learning algorithms that needs to be resolved before deploying deep learning algorithms in critical real world applications in the presence of adversaries. We hope this paper facilitates further research in developing better defense models. Acknowledgement: This work was performed under the following financial assistance award: 60NANB18D279 from U.S. Department of Commerce, National Institute of Standards and Technology, funding from SAP SE, and also NSF grant 1845216.

References