Poisoning Attacks with Generative Adversarial Nets

06/18/2019 ∙ by Luis Muñoz-González, et al. ∙ Hasso Plattner Institute Princeton University Imperial College London 13

Machine learning algorithms are vulnerable to poisoning attacks: An adversary can inject malicious points in the training dataset to influence the learning process and degrade its performance. Optimal poisoning attacks have already been proposed to evaluate worst-case scenarios, modelling attacks as a bi-level optimisation problem. Solving these problems is computationally demanding and has limited applicability for some models such as deep networks. In this paper we introduce a novel generative model to craft systematic poisoning attacks against machine learning classifiers generating adversarial training examples, i.e. samples that look like genuine data points but that degrade the classifier's accuracy when used for training. We propose a Generative Adversarial Net with three components: generator, discriminator, and the target classifier. This approach allows us to model naturally the detectability constrains that can be expected in realistic attacks and to identify the regions of the underlying data distribution that can be more vulnerable to data poisoning. Our experimental evaluation shows the effectiveness of our attack to compromise machine learning classifiers, including deep networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the advancements and the benefits of machine learning, it has been shown that learning algorithms are vulnerable and can be the target of attackers, who can gain a significant advantage by exploiting these vulnerabilities [8]. At training time, learning algorithms are vulnerable to poisoning attacks, where small fractions of malicious points injected in the training set can subvert the learning process and degrade the performance of the system in an indiscriminate or targeted way. Data poisoning is one of the most relevant and emerging security threats in applications that rely upon the collection of large amounts of data in the wild [11]. For instance, some applications rely on the data from users’ feedback or untrusted sources of information that often collude towards the same malicious goal. For example, in IoT environments sensors can be compromised and adversaries can craft coordinated attacks manipulating the measurements of neighbour sensors evading detection [9]. In many applications curation of the whole training dataset is not possible, exposing machine learning systems to poisoning attacks.

In the research literature optimal poisoning attack strategies have been proposed against different machine learning algorithms [3, 17, 19, 10]

, allowing to assess their performance in worst-case scenarios. These attacks can be modelled as a bi-level optimisation problem, where the outer objective represents the attacker’s goal and the inner objective corresponds to the training of the learning algorithm with the poisoned dataset. Solving these bi-level optimisation problems is challenging and can be computationally demanding, especially for generating poisoning points at scale, limiting its applicability against some learning algorithms such as deep networks or where the training set is large. In many cases, if no detectability constraints are considered, the poisoning points generated are outliers that can be removed with data filtering

[21]. Furthermore, such attacks are not realistic as real attackers would aim to remain undetected in order to be able to continue subverting the system in the future. As shown in [14], detectability constraints for these optimal attack strategies can be modelled, however they further increase the complexity of the attack, limiting even more the application of these techniques.

Taking an entirely different and novel approach, in this paper we propose a poisoning attack strategy against machine learning classifiers with Generative Adversarial Nets (GANs) [4]. This allows us to craft poisoning points in a more systematic way, looking for regions of the data distribution where the poisoning points are more influential and, at the same time, difficult to detect. Our proposed scheme, pGAN

, consists on three components: generator, discriminator and target classifier. The generator aims to generate poisoning points that maximise the error of the target classifier but minimise the discriminator’s ability to distinguish them from genuine data points. The classifier aims to minimise some loss function evaluated on a training dataset that contains a fraction of poisoning points. As in a standard GAN, the problem can be formulated as a minimax game. pGAN allows to systematically generate

adversarial training examples [13], which are similar to genuine data points but that can degrade the performance of the system when used for training.

The use of a generative model allows us to produce poisoning points at scale, enabling poisoning attacks against learning algorithms where the number of training points is large or in situations where optimal attack strategies with bi-level optimisation are intractable or difficult to compute, as it can be the case for deep networks. Additionally, our proposed model also includes a mechanism to control the detectability of the generated poisoning points. For this, the generator maximises a convex combination of the losses for the discriminator and the classifier evaluated on the poisoning data points. Our model allows to control the aggressiveness of the attack through a parameter that controls the weighted sum of the two losses. This induces a trade-off between effectiveness and detectability of the attack. In this way, pGAN can be applied for systematic testing of machine learning classifiers at different risk levels. Our experimental evaluation in synthetic and real datasets shows that pGAN is capable of compromising different machine learning classifiers for multi-class classification, including deep networks. We analyse the trade-off between detectability and effectiveness of the attack: Too conservative strategies will have a reduced impact on the target classifier but, if the attack is too aggressive, most poisoning points can be detected as outliers.

2 Related Work

The first practical poisoning attacks were proposed in the context of spam filtering and anomaly detection

[20, 12]. But these attacks do not easily generalize to different learning algorithms. A more systematic approach was presented by [3], modelling optimal poisoning attacks against SVMs for binary classification as a bi-level optimisation problem, which can be solved by exploiting the Karush-Kuhn-Tucker conditions in the inner problem. A similar approach is proposed in [29]

for poisoning embedded feature selection methods, including LASSO, ridge regression, and elastic net. Mei and Zhu

[17] proposed a more general framework to model and solve optimal poisoning attacks for convex classifiers. They exploit the implicit function theorem to compute the gradients required to solve the corresponding bi-level optimisation problem. In [19]

, reverse-mode differentiation is proposed to estimate the gradients required to solve bi-level optimisation problems for optimal poisoning attacks against multi-class classifiers. This approach allows to attack a broader range of learning algorithms and reduces the computational complexity with respect to previous works. However, all these techniques are limited to compromise deep networks trained with a large number of training points, where many poisoning points are required even to compromise a small fraction of the training dataset. Previous attacks did not model explicitly appropriate detectability constraints. Thus, the resulting poisoning points can be far from the genuine data distribution, and then, they can be easily identified as outliers

[21, 25, 22]. Recently, [14]

showed that it is still possible to craft more attacks capable of bypassing outlier-detection-based defences with an iterative constrained bi-level optimisation problem, where, at each iteration, the constraints change according to the current solution of the bi-level problem. However, the high computational complexity of this attack limits its practical application in many scenarios.

Koh and Liang [13] proposed a different approach to craft targeted attacks against deep networks by exploiting influence functions. This approach allows to create adversarial training examples by learning small perturbations that, when added to some specific genuine training points, change the predictions for a target set of test points. Shafahi et al. [24] showed that it is possible to perform targeted attacks when the adversary is not in control of the labels for the poisoning points.

Yang et al. [30]

introduced a poisoning attack with generative models using autoencoders to generate the malicious points. Although this method is more scalable than attacks based on bi-level optimisation, the authors do not provide a mechanism to control the detectability of the poisoning points. GANs have been proposed to generate adversarial examples at test time

[27]. In this case the generator is fed with original samples and learns adversarial perturbations that, when added to the genuine samples, produce an error on the targeted algorithm at test time. The discriminator is designed to ensure that the generated adversarial examples are similar to the original ones.

3 Poisoning Attacks with Generative Adversarial Nets

Our model, pGAN, is a GAN-based model with three components (generator, discriminator and target classifier) to generate systematically adversarial training examples. First, we shortly describe the considered model for the attacker. Then, we introduce the formulation of pGAN and, finally, we provide some practical considerations for the implementation of pGAN.

3.1 Attacker’s Model

The attacker’s knowledge of the targeted system depends on different aspects: the learning algorithm, the objective function optimised, the feature set or the training data. In our case we consider perfect knowledge attacks, where we assume the attacker knows everything about the target system: the training data, the feature set, the loss function and the machine learning model used by the victim. Although unrealistic in most practical scenarios, this assumption allows us to perform worst-case analysis of the performance of the system under attack. However, our proposed attack strategy also supports limited knowledge, exploiting the transferability property of poisoning attacks [19]. For the attacker’s capabilities, we consider here a causative attack [2, 1], where the attacker can manipulate a fraction of the training data to influence the learning algorithm. We assume that the attacker can manipulate all the features to craft the poisoning points as long as the resulting points are within the feasible domain for the distribution of genuine training points. Finally, we also assume that the attacker can also control the labels of the injected poisoning points.

3.2 pGAN

In a multi-class classification task, let be the -dimensional feature space, where data points are drawn from a distribution and is the space of class labels. The learning algorithm, , aims to learn the mapping by minimising a loss function, , evaluated on a set of training points . The objective of the attacker is to introduce a fraction, , of malicious points in to maximise when evaluated on the poisoned training set.

The Generator, , aims to generate poisoning points by learning a data distribution that is effective at increasing the error rate of the target classifier, but that is also close to the distribution of genuine data points, i.e. the generated poisoning points are similar to honest data points to evade detection. Thus, receives some noise as input and implicitly defines a distribution of poisoning points, , which is the distribution of the samples conditioned on , the set of target class labels for the attacker. The Discriminator,

, aims to distinguish between honest training data and the generated poisoning points. It estimates the probability

that came from the genuine data distribution rather than . As in , the samples used in the discriminator are conditioned on the set of labels . The Classifier, , is representative for the attacked algorithm. In perfect knowledge attacks can have the same structure as the actual target classifier. For black-box attacks we can exploit attack transferability, and then, use as a surrogate model that can be somewhat similar to the actual (unknown) classifier. During the training of pGAN, is fed honest and poisoning training points from and respectively, where the fraction of poisoning points is controlled by a parameter .

In contrast to traditional GAN schemes, in pGAN plays a game against both and . This can also be formalized as a minimax game where the maximisation problem involves both and . Similar to conditional GANs [18], the objective function for (which also depends on ) can be written as:

(1)

The objective function for is given by:

(2)

where is the fraction of poisoning points introduced in the training dataset and is the loss function used to train . Note that the poisoning points in (2) belong to a subset of poisoning class labels , whereas the genuine points used to train the classifier are from all the classes. The objective in (2) is just the negative loss used to train evaluated on a mixture of honest and poisoning points (from the set of classes in ) controlled by .

Given (1) and (2), pGAN can then be formulated as the following minimax problem:

(3)

with . In this case, the maximisation problem can be seen as a multi-objective optimisation problem to learn the parameters of both the classifier and the discriminator. Whereas for and the objectives are decoupled, the generator optimises a convex combination of the two objectives in (1) and (2). The parameter controls the importance of each of the two objective functions towards the global goal. So, for high values of , the attack points will prioritize evading detection, rendering attacks with (possibly) a reduced effectiveness. Note that for we have the same minimax game as in a standard conditional GAN [18]. On the other hand, low values of will result in attacks with higher impact in the classifier’s performance. However the generated poisoning points will be more detectable by outlier detection systems. For

, pGAN does not consider any detectability constraint and the generated poisoning points are only constrained by the output activation functions in the

. In this case pGAN can serve as a suboptimal approximation of the optimal attack strategies in [3, 17, 19] where no detectability constraints are imposed.

Similar to [4] we train pGAN following a coordinated gradient-based strategy to solve the minimax problem in (3

). We sequentially update the parameters of the three components using mini-batch stochastic gradient descent/ascent. For the generator and the discriminator data points are sampled from the conditional distribution on the subset of poisoning labels

. For the classifier, honest data points are sampled from the data distribution including all the classes. A different number of iterations can be considered for updating the parameters of the three blocks. The details of the training algorithm are provided in Appendix A.

3.3 Practical Considerations

The formulation of pGAN in (3) allows to perform error-generic poisoning attacks [19], which aim to increase the error of the classifier in an indiscriminate way, i.e. regardless of the types of errors to be produced in the system. However, the nature of these errors can be limited by , i.e. the classes for which the attacker can inject poisoning points. To generate targeted attacks or to produce specific types of errors in the system we need to use a surrogate model for the target classifier in pGAN, including only the classes or samples considered in the attacker’s goal. For example, if the attacker wants to inject poisoning points labelled as to increase the classification error for class , we can use a binary classifier in pGAN considering only classes and , where the generator aims to produce samples from class .

As in other GAN schemes, pGAN can also be difficult to train and can be prone to mode collapse. To mitigate these problems, we used in our experiments some of the standard techniques proposed to improve GANs training, such as dropout or batch-normalization

[23]. We also applied one-side label smoothing, not only for the labels in the discriminator but also for the labels of the genuine points in the classifier. As suggested in [5], to avoid small gradients for from the discriminator’s loss function (1), especially in early stages where the quality of the samples produced by is poor, we train to maximise rather than minimising .

In contrast to standard GANs, in pGAN the learned distribution of poisoning points is expected to be different from the distribution of genuine points . Thus, the accuracy of the discriminator in pGAN will always be greater that . Then, the stopping criteria for training pGAN cannot be based on the discriminator’s accuracy. We need to find a saddle point where the objectives in (1) and (2) are maximised for and respectively (i.e. pGAN finds local maxima) and the the combined objective in (3) is minimised w.r.t. (i.e. pGAN finds a local minimum).

Finally, the value of plays an important role in the training of pGAN. If is small, the gradients for from the classifier’s loss in (2) can be very small compared to the gradients from the discriminator’s loss in (1). Thus, the generator focuses more on evading detection by the discriminator rather than increasing the error of the target classifier, resulting in blunt attacks. Then, even if the expected fraction of poisoning points to be injected in the target system is small, larger values of are preferred to generate more successful poisoning attacks. In our experiments in Sect. 4 we analyse the effectiveness of the attack as a function of .

4 Experiments

To illustrate how pGAN works we first performed a synthetic experiment with a binary classification problem, generating two bivariate Gaussian distributions that slightly overlap. We trained pGAN for different values of

with 500 training points from each Gaussian distribution. We targeted a logistic regression classifier with

. In Fig. 1 we show the distribution of poisoning (red dots) and genuine (green and blue dots) data points. The poisoning points are labelled as the green data points. Thus, aims to generate malicious points, similar to the green ones (i.e. aims to discriminate between red and green data points). For we have the same result as in a standard GAN, so that the distribution of red points matches the distribution of the green ones. But, as we decrease the value of , the distribution of red points shifts towards the region where both green and blue distributions overlap. We can observe that for the poisoning points are still close to genuine green points, i.e. we cannot consider the red points as outliers in most cases. For

the generator does not have detectability constraints, focusing only on increasing the error of the classifier. It is interesting to observe that, in this case, pGAN does not produce points interpolating the distribution of the two genuine classes, but the distribution learned by the generator is far from the region where the distributions of the blue and green points overlap.

111Note that the result would be significantly different if the target classifier were non-linear. This suggests that for pGAN is not just producing a simple interpolation between the two classes, but looks for regions close to the decision boundary where the classifier is weaker. The complete details of the experiment and the effect on the decision boundary after injecting the poisoning points can be found in Appendix B.

Figure 1: Synthetic experiment: Distribution of genuine (green and blue dots) and poisoning (red dots) data points for different values of . The poisoning points are labelled as green.

We performed our experimental evaluation on four datasets: MNIST [16], Fashion-MNIST (FMNIST) [28], Spambase [7], and CIFAR-10 [15]

. We training pGAN using Deep Neural Networks (DNNs) for the first three datasets and Convolutional Neural Networks (CNNs) for CIFAR. All details about the datasets used and the experimental settings in our experiments are described in Appendix C. To test the effectiveness of pGAN to generate

stealthy poisoning attacks we applied the defence strategy proposed in [21]: We assumed that the defender has a fraction of trusted data points that can be used to train one outlier detector for each class in the classification problem. Thus, we pre-filter the (genuine and malicious) training data points with these outlier detectors before training. As in [21] we used the distance-based anomaly detector proposed in [26]. The outlierness score is computed based on the euclidean distance between the tested data point and its -nearest neighbours from a subset of points, which are sampled without replacement from the set of points used to train the outlier detector. In our experiments we used the same values proposed in [21]: for the number of neighbours and for the number of training points to be sampled. We set the threshold of the outlier detector so that the -percentile is . The -percentile controls the fraction of genuine points that is expected to be retained after applying the outlier detector (i.e. in our case).

To provide a better understanding of the behaviour of pGAN we trained and tested our attack targeting binary classifiers. For this, in MNIST we selected digits and and for FMNIST we picked the classes sneaker and ankle boot and, for CIFAR, automobile and truck classes. The poisoning points were labelled as , ankle boot, and truck respectively for MNIST, FMNIST, and CIFAR. In Spambase, a spam detection dataset, we assumed that the poisoning points are labelled as spam emails.

First, we analysed the effectiveness of the attack as a function of . For each dataset we trained different generators for each value of explored, . We set for MNIST and FMNIST, for CIFAR, and for Spambase, where

is the prior probability of the samples from the poisoning class,

(i.e. digit 5, ankle boot, truck, and spam). In Spambase, we used a smaller value of as the distribution of spam emails is quite multi-modal, i.e. there are different forms of spam emails. Thus, we need to retain a larger fraction of genuine samples from the spam class to train the classifier in pGAN with a more representative distribution of genuine spam emails. For testing, in MNIST, FMNIST and CIFAR we used (genuine) samples per class to train the outlier detectors and samples per class to train a separate classifier. For Spambase we used good emails and genuine spam emails to train the outlier detectors and and good and spam emails for the classifier. We evaluated the effectiveness of the attack varying the fraction of poisoning points, exploring values in the range . To preserve the ratio between classes we substitute genuine samples from the poisoning class with the malicious points generated by pGAN (rather than adding the poisoning points to the given training dataset). For each pGAN generator and for each value of the fraction of poisoning points explored, we did independent runs with independent splits for the outlier detectors and the classifier training sets. In Fig. 2 we show the classification error for MNIST, FMNIST and Spambase as a function of the fraction of poisoning points averaged over the generators and the runs for each generator.

Figure 2: Classification Error (%) as a function of the percentage of poisoning points using pGAN with different values of for: (left) MNIST, (centre) FMNIST, (right) Spambase.

In MNIST, the attack is more effective for , increasing the error from when there’s no attack to more than when of the training dataset is compromised. For the attack is less effective as some of the poisoning points are filtered-out by the outlier detector, although the points that bypass the defence are still very effective to influence the target classifier. For bigger values of the effect of the attack is more limited. In contrast, for FMNIST the attack with has a very reduced impact as many poisoning points are filtered-out. Again, we can observe in Fig. 2 that produces more effective data points, although the overall effect of the attack is more limited compared to MNIST. It is interesting to observe that, despite the baseline error (i.e. when there is no attack) is lower for MNIST ( vs in FMNIST), it is more difficult to poison FMNIST. This suggests that the impact of the attack not only depends on the separation between the two classes but also on the topology of the classification problem (i.e. the data distribution of the classes). From the results in Fig. 2 we can observe that Spambase is very sensitive to the poisoning attack, even for the bigger values of . This is because of the multi-modal nature of the spam class in this dataset. Then, even if the value of is close to , as in standard GANs, pGAN is not capable of modelling properly the whole data distribution. However, the target classifier is very sensitive to the parts (modes) of the spam distribution learned by pGAN, and thus, the effect of the attack is quite significant. On the other side, the effectiveness of the attack for also indicates that the algorithm and the outlier detection defence are very brittle in this case. This can be due to the reduced number of samples and features () used in this dataset to classify spam emails. In Appendix D we provide the analysis of the false positives and false negatives for the three datasets. In Fig. 3 we show some of the poisoning examples generated by pGAN (with ). For MNIST the malicious data points (labelled as ) exhibit features from both, digits and . In some cases, although the poisoning digits are similar to a , it is difficult to automatically detect these points as outliers, as many of the pixels that represent these malicious digits follow a similar pattern compared to genuine s, i.e. they just differ in the upper trace of the generated digits. In other cases, the malicious digits look like a that have some characteristics that make them closer to s. In the case of FMNIST, the samples generated by pGAN (labelled as ankle boots) can be seen as an interpolation of the two classes. The malicious images look like high-top sneakers or low-top ankle boots. Thus, it is difficult to detect them as malicious points, as they clearly resemble some of the genuine ankle boots in the genuine training set. Actually, for some of the genuine images, it is difficult to identify them as a sneaker or an ankle boot. More examples for different values of are also shown in Appendix D.

Figure 3: Examples from pGAN (with ) for MNIST (left) and FMNIST (right).
Figure 4: (Left) Average error on CIFAR as a function of . (Right) Examples of images generated by pGAN for CIFAR ().

In Fig. 5 (centre) we show the fraction of data points pre-filtered by the outlier detectors in MNIST dataset as a function of . We explored two values for the -percentile (the threshold of the detectors): and . As expected, the fraction of rejected genuine data points is, on average, and respectively. However, the fraction of rejected malicious points for is smaller than for the genuine points for the two detectors. This is because the generator pays less attention to samples that are in low density regions for the data distribution of the genuine points, and then, the generated poisoning points are conservative. For the fraction of rejected malicious points is also not very high. This can be due to the similarity between the two classes. Then, even if the generated poisoning points, labelled as , look like a they are still close to the distribution of genuine s when targeting a non-linear classifier. We hypothesize that, in this case, with no detectability constraints, it is still more effective to inject poisoning points in regions that are close to the data distribution of the two classes, where the attack can increment both false positive and false negative rates, as perhaps, the class of digit is more sensitive to data poisoning. We show these results in Appendix D, where we observe that, for , both the false positive and the false negatives increase for MNIST dataset.

The results for CIFAR are shown in Fig. 4. In this case we observe that for , i.e. with no detectability constraints, the attack is significantly more effective than for the other values of explored. In this case, we observed that the outlier detector used is not working properly in this dataset. For the images produced by pGAN are malicious noise, but do not have any feature characteristic of both cars and trucks. Then, as the number of features in this dataset is larger and the variety of images is broader, the outlier detector cannot identify outliers correctly. However, the images generated for values of exhibit characteristics from the two classes, and then, it is reasonable to hypothesize that they can be difficult to detect automatically with other outlier detection techniques. For pGAN considerably increases the false negative rate from to (see Appendix D). The increase in the overall classification error is more moderate, as the false positive rate decreases after injecting the poisoning points. In Fig. 4 (right) we show some examples of the malicious points generated by pGAN for .

For analysing the sensitivity of pGAN w.r.t. , the fraction of poisoning points used for , we performed an experiment on MNIST dataset (digits 3 and 5). We set and explored different values for ranging from to . With the same experimental settings as before we trained 5 generators for each value of . We also tested the effectiveness of the attack on a separate classifier, with 10 independent runs for each generator and value of explored. For the attacks we injected of poisoning points. In Fig. 5 (left) we show the averaged classification error on the test dataset as a function of . We can observe that, for small , the effect of the attack is more limited. The reason is that, when training pGAN the effect of the poisoning points on is very reduced, and then, the gradients of (2) w.r.t. the parameters of can be very small compared to the gradients coming from the discriminator. Then, G focuses more on optimising the discriminator’s objective. In this case, even for the attack is still effective, just slightly decreasing the error rate compared to . However, in some problems, as in Spambase, using large values for can have a negative impact on the attack effectiveness. The surrogate classifier trained in pGAN may not be representative for the classifier where the attack is tested, as the is limited to produce multi-modal distributions.

Figure 5: (Left) Average error on MNIST as a function of . (Centre) Outlier detection on MNIST as a function of for -percentiles of and . (Right) Average error on MNIST as a function of of the number of training examples for a clean and a poisoned classifier (with of poisoning points).

In Fig. 5 (right) we show how the number of training data points impact the effect of the attack. For this, we trained 5 pGAN generators with and tested on classifiers with different number of training points ranging from to and injecting of poisoning points. For each generator and value of the number of training points explored we did 5 independent runs. As in the previous experiment, we used 500 samples per class to train the outlier detectors. The results in Fig. 5 (right) show that the difference in performance between the poisoned and the clean classifier reduces as the number of training samples increases. This is expected, as the stability of the learning algorithm increases with the number of training data points, limiting the ability of the attacker to perform indiscriminate poisoning attacks. This does not mean that learning algorithms trained with large datasets are not vulnerable to data poisoning, as attackers can still be very successful at performing targeted attacks, focusing on increasing the error on particular instances or creating backdoors [6]. In these scenarios we can also use pGAN to generate more targeted attacks using a surrogate model for the classifier including the subset of samples that the attacker aims to misclassify.

5 Conclusion

The pGAN approach we introduce in this paper allows to naturally model attackers with different levels of aggressiveness and the effect of different detectability constraints on the robustness of the algorithms. This allows to a) study the characteristics of the attacks and identify regions of the data distributions where poisoning points are more influential, yet more difficult to detect, b) systematically generate in an efficient and scalable way attacks that correspond to different types of threats and c) study the effect of mitigation measures such as improving detectability. In addition to studying the tradeoffs involved in the adversarial model, pGAN also allows to naturally study the tradeoffs between performance and robustness of the system as the fraction of poisoning points increases.

References

  • [1] Marco Barreno, Blaine Nelson, Anthony D Joseph, and J Doug Tygar. The Security of Machine Learning. Machine Learning, 81(2):121–148, 2010.
  • [2] Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. Can Machine Learning be Secure? In Symposium on Information, Computer and Communications Security, pages 16–25, 2006.
  • [3] Battista Biggio, Blaine Nelson, and Pavel Laskov.

    Poisoning Attacks against Support Vector Machines.

    In International Conference on Machine Learning, pages 1807–1814, 2012.
  • [4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [5] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, 2015.
  • [6] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv preprint arXiv:1708.06733, 2017.
  • [7] Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spambase Data Set. Hewlett-Packard Labs, 1999.
  • [8] Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial Machine Learning. In

    Workshop on Security and Artificial Intelligence

    , pages 43–58, 2011.
  • [9] Vittorio P Illiano, Luis Muñoz González, and Emil C Lupu. Don’t Fool Me!: Detection, Characterisation and Diagnosis of Spoofed and Masked Events in Wireless Sensor Networks. IEEE Transactions on Dependable and Secure Computing, 14(3):279–293, 2016.
  • [10] Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning. In IEEE Symposium on Security and Privacy, pages 19–35, 2018.
  • [11] Anthony D Joseph, Pavel Laskov, Fabio Roli, J Doug Tygar, and Blaine Nelson. Machine Learning Methods for Computer Security (Dagstuhl Perspectives Workshop 12371). Dagstuhl Manifestos, 3(1), 2013.
  • [12] Marius Kloft and Pavel Laskov. Security Analysis of Online Centroid Anomaly Detection. Journal of Machine Learning Research, 13:3681–3724, 2012.
  • [13] Pang Wei Koh and Percy Liang. Understanding Black-box Predictions via Influence Functions. In International Conference on Machine Learning, pages 1885–1894, 2017.
  • [14] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. arXiv preprint arXiv:1811.00741, 2018.
  • [15] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Dataset, 2009.
  • [16] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [17] Shike Mei and Xiaojin Zhu. Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners. In AAAI, pages 2871–2877, 2015.
  • [18] Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784, 2014.
  • [19] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli.

    Towards Poisoning of Deep Learning Algorithms with Back-Gradient Optimization.

    In Workshop on Artificial Intelligence and Security, pages 27–38, 2017.
  • [20] Blaine Nelson, Marco Barreno, Fuching Jack Chi, Anthony D Joseph, Benjamin IP Rubinstein, Udam Saini, Charles A Sutton, J Doug Tygar, and Kai Xia. Exploiting Machine Learning to Subvert Your Spam Filter. LEET, 8:1–9, 2008.
  • [21] Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, and Emil C Lupu. Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection. arXiv preprint arXiv:1802.03041, 2018.
  • [22] Andrea Paudice, Luis Muñoz-González, and Emil C Lupu. Label Sanitization against Label Flipping Poisoning Attacks. In Nemesis’18 Workshop on Recent Advancements in Adversarial Machine Learning, 2018.
  • [23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
  • [24] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In Advances in Neural Information Processing Systems, pages 6103–6113, 2018.
  • [25] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified Defenses for Data Poisoning Attacks. In Advances in Neural Information Processing Systems, pages 3517–3529, 2017.
  • [26] Mingxi Wu and Christopher Jermaine. Outlier Detection by Sampling with Accuracy Guarantees. In International Conference on Knowledge Discovery and Data Mining, pages 767–772, 2006.
  • [27] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating Adversarial Examples with Adversarial Networks. In International Joint Conference on Artificial Intelligence, pages 3905–3911, 2018.
  • [28] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [29] Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. Is Feature Selection Secure against Training Data Poisoning? In International Conference on Machine Learning, pages 1689–1698, 2015.
  • [30] Chaofei Yang, Qing Wu, Hai Li, and Yiran Chen. Generative Poisoning Attack Method against Neural Networks. arXiv preprint arXiv:1703.01340, 2017.

Appendix A: pGAN Training Algorithm

We train pGAN following a coordinated gradient-based strategy by sequentially updating the parameters of the three components using mini-batch stochastic gradient descent/ascent. The procedure is described in Algorithm 1. For the generator and the discriminator data points are sampled from the conditional distribution on the subset of poisoning labels . For the classifier, honest data points are sampled from the data distribution including all the classes. We alternate the training for the three components with the , and number of steps for the discriminator, classifier, and generator respectively. In practice, we choose , i.e. we update more often the discriminator and the classifier. For example, in our experiments we set and for most datasets.

  for number of training iterations do
     for  steps do
        sample mini-batch of noise samples from
        get mini-batch of training samples from
        update the discriminator by ascending its stochastic gradient
     end for
     for  steps do
        sample mini-batch of noise samples from
        get mini-batch of training samples from
        update the classifier by ascending its stochastic gradient
     end for
     for  steps do
        sample mini-batch of noise samples from
        update the generator by descending its stochastic gradient
     end for
  end for
Algorithm 1 pGAN Training

Appendix B: Synthetic Example: Experimental Settings and Effect on the Decision Boundary

For the synthetic experiment shown in the paper we sample our training and test data points from two bivariate Gaussian distributions, and , with parameters:

We trained pGAN with 500 training data points for each class with and

. We set the number of epochs to

, the batch-size to , and the parameters in Algorithm 1,

. For the generator and the discriminator we used one-hidden-layer neural networks with Leaky ReLU activation functions. For the classifier we used logistic regression with cross-entropy loss function. The details about the architecture of the three components are detailed in Table 

1.

In Fig. 6 we show the effect of the poisoning attack on the decision boundary. For testing pGAN we trained a separate logistic regression classifier with 40 genuine training examples (20 per class) and adding extra poisoning points (8 samples). We trained the classifier using Stochastic Gradient Descent (SGD) with a learning rate of for epochs. In this case, no outlier detector is applied to pre-filter the training points. The results in Fig. 6 show that for the attack is very effective, although the poisoning points depicted in red (which are labelled as green) are far from the genuine distribution of green points. Then, as we increase the value of the attack is blunt. In this synthetic example, the classifier is quite stable: the number of features is very small (two), and the topology of the problem is simple (the classes are linearly separable and the overlapping between classes is small) and the classifier is simple. Thus, the effect of the poisoning attack when detectability constraints are considered, i.e. , is very reduced. Note that the purpose of this synthetic example is just to illustrate the behaviour of pGAN as a function of rather than showing an scenario where the attack can be very effective.

Generator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Linear
Optimizer: Adam (learning rate = )
Discriminator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Classifier Architecture: Logistic Regression
Loss function : Cross-entropy
Optimizer: SGD (learning rate = , momentum = 0.9)
Table 1: pGAN architecture for the Synthetic experiment (Notation: SGD stands for Stochastic Gradient Descent)
Figure 6: Synthetic experiment: Distribution of genuine (green and blue dots) and poisoning (red dots) data points for different values of . The poisoning points are labelled as green.

Appendix C: Experimental Settings

Here we provide complete details about the settings for the experiments described in the paper. In Table 2 we show the characteristics of the four real datasets used in our experimental evaluation. The parameters for training pGAN for MNIST, FMNIST, Spambase, and CIFAR are shown in Tables 3, 4, 5, and 6

respectively. In all cases, for pGAN generator we used (independent) Gaussian noise with zero mean and unit variance.

For MNIST we trained pGAN for epochs using a batch-size of , setting and in Alg. 1. For FMNIST we used similar settings but training for epochs. For Spambase we trained pGAN for epochs using a batch-size of , with and . Finally, for CIFAR we trained pGAN using a batch-size of for epochs, with and .

Finally, the architecture of the classifiers trained to test the attacks for the four datasets is described in Table 7. For MNSIT, FMNIST and Spambase we used Deep Neural Networks (DNNs), whereas for CIFAR we used Deep Convolutional Nets (DCNs).

Name # Training Examples # Test Examples # Features
MNIST (3 vs 5)
FMNIST (sneaker vs ankle boot)
Spambase (good email vs spam)
CIFAR (automobile vs truck)
Table 2: Characteristics of the datasets used in the experiments
Generator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Tanh
Optimizer: Adam (learning rate = )
Dropout:
Discriminator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Dropout:
Classifier Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Dropout:
Table 3: pGAN architecture for MNIST
Generator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Tanh
Optimizer: Adam (learning rate = )
Dropout:
Discriminator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Dropout:
Classifier Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Dropout:
Table 4: pGAN architecture for FMNIST
Generator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: Adam (learning rate = )
Dropout:
Discriminator Architecture: DNN ()
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Dropout:
Classifier Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Dropout:
Table 5: pGAN architecture for Spambase
Generator Architecture: DCNN:
         Layer 1: 2D transposed convolutional; input channels: 100;
        output channels: 1024; kernel size: (2

2); stride: 1; padding: 0;

        no bias terms; batch normalization
         Layer 2: 2D transposed convolutional; input channels: 1024;
        output channels: 256; kernel size: (44); stride: 2; padding: 1;
        no bias terms; batch normalization
         Layer 3: 2D transposed convolutional; input channels: 256;
        output channels: 128; kernel size: (44); stride: 2; padding: 1;
        no bias terms; batch normalization
         Layer 4: 2D transposed convolutional; input channels: 128;
        output channels: 64; kernel size: (44); stride: 2; padding: 1;
        no bias terms; batch normalization
         Layer 5: 2D transposed convolutional; input channels: 64;
        output channels: 3; kernel size: (44); stride: 2; padding: 1
Hidden layer act. functions: ReLU
Output layer act. functions: Tanh
Optimizer: Adam (learning rate = )
Discriminator Architecture: DCNN:
         Layer 1: 2D convolutional; input channels: 3; output channels: 64;
        kernel size: (44); stride: 2; padding: 1
         Layer 2: 2D convolutional; input channels: 64; output channels: 128;
        kernel size: (44); stride: 2; padding: 1; no bias terms;
        batch normalization
         Layer 3: 2D convolutional; input channels: 128; output channels: 256;
        kernel size: (44); stride: 2; padding: 1; no bias terms;
        batch normalization
         Layer 4: 2D convolutional; input channels: 256; output channels: 512;
        kernel size: (44); stride: 2; padding: 1; no bias terms;
        batch normalization
         Layer 5: 2D convolutional; input channels: 512; output channels: 1;
        kernel size: (22); stride: 1; padding: 0
Hidden layer act. functions: Leaky ReLU (negative slope = 0.2)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.5)
Classifier Architecture: DCNN:
         Layer 1: 2D convolutional; input channels: 3; output channels: 32;
        kernel size: (44); stride: 2; padding: 1
         Layer 2: 2D convolutional; input channels: 32; output channels: 128;
        kernel size: (44); stride: 2; padding: 1; no bias terms;
        batch normalization
         Layer 3: 2D convolutional; input channels: 128; output channels: 256;
        kernel size: (44); stride: 2; padding: 1; no bias terms;
        batch normalization
         Layer 4: 2D convolutional; input channels: 256; output channels: 512;
        kernel size: (44); stride: 2; padding: 1; no bias terms;
        batch normalization
         Layer 5: 2D convolutional; input channels: 512; output channels: 1;
        kernel size: (22); stride: 1; padding: 0
Hidden layer act. functions: Leaky ReLU (negative slope = 0.2)
Output layer act. functions: Sigmoid
Loss function : Cross-entropy
Optimizer: SGD (learning rate = , momentum = 0.5)
Table 6: pGAN architecture for CIFAR
MNIST Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Batch size:
Epochs:
Dropout:
FMNIST Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.9)
Batch size:
Epochs:
Dropout:
Spambase Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Sigmoid
Optimizer: SGD (learning rate = , momentum = 0.5)
Batch size: whole training set (batch-mode)
Epochs:
Dropout:
CIFAR-10 Architecture: DCNN:
         Layer 1: 2D convolutional; input channels: 3; output channels: 32;
        kernel size: (33); stride: 1; padding: 0; no bias terms;

        batch normalization; 2D max pooling

; dropout:
         Layer 2: 2D convolutional; input channels: 32; output channels: 64;
        kernel size: (33); stride: 1; padding: 0; no bias terms;
        batch normalization; 2D max pooling ; dropout:
         Layer 3: 2D convolutional; input channels: 64; output channels: 128;
        kernel size: (33); stride: 1; padding: 0; no bias terms;
        batch normalization; 2D max pooling ; dropout:
         Layer 4: flattening + fully connected ()
Hidden layer act. functions: ReLU
Output layer act. functions: Sigmoid
Loss function : Cross-entropy
Optimizer: SGD (learning rate = , momentum = 0.9)
Batch size:
Epochs:
Table 7: Architecture of the classifiers to test the attacks on MNIST, FMNIST, Spambase and CIFAR.

Appendix D: Additional Experimental Results

Generation of Poisoning Samples with pGAN

In Figs. 7, 8 and 9 we show samples generated with pGAN for different values of in MNIST, FMNIST and CIFAR respectively. The class labels of the poisoning points are , ankle boot, and truck for each of the datasets. In all cases we can observe that for small values of (but with ), the generated examples exhibit characteristics from the two classes involved in the attack, although pGAN tries to preserve features from the (original) poisoning class to evade detection. For values of close to 1, the samples generated by pGAN are similar to those we can generate with a standard GAN. For CIFAR, we omit the result from as, in this case, the generated samples are noise patterns with some structure but do not resemble any legitimate instance.

Figure 7: Examples from pGAN with MNIST dataset for different values of .
Figure 8: Examples from pGAN with FMNIST dataset for different values of .
Figure 9: Examples from pGAN with CIFAR dataset for different values of .

False Positive and False Negative Rates

With the same experimental settings as in the paper, in Figs. 10 and 11 we show the false positive and false negative rates as a function of the fraction of poisoning data points for different values of in MNIST, FMNIST, Spambase and CIFAR.

In MNIST we can observe that, in general, the attack increases the false positive rate of the target classifier. However, for the false negative rate remains mostly constant and, in some cases decreases, for . When no detectability constraints are considered (i.e. ), the attack also increases the false negative rate. Finally, for , the false negatives decrease for attacks with a reduced fraction of poisoning points, but slightly increase for larger fractions of malicious points.

As In MNIST, for FMNIST dataset the attack increases the false positives for values of , whereas the false negatives remain constants (or slightly increases for small values of . However, for the false positive rate decreases, but the attack is very effective at increasing the false negatives. For Spambase and CIFAR we observe a similar behaviour, where the poisoning attack affects the false negatives, whereas the false positive rate decreases.

In Fig. 12 we show the false positive and false negative rates as a function of on MNIST dataset. The false positive rates follow a similar trend than the results for the classification error rate. However, the false negatives initially decrease (for values of between 0.3 and 0.5) and then, it slightly increases.

In Fig. 13 we show the false positives and false negatives as a function of the training points in the target classifier for MNIST dataset. As in the previous case, the false positive rate on the poisoned classifier decreases as we increase the number of training points. However, the false negatives remain constant. In contrast, for the clean classifier we can observe that both false positives and false negatives decrease as we target classifier uses more training data points (as expected).

Figure 10: Average false positive and false negative rates (%) as a function of the percentage of poisoning points using pGAN with different values of for: (left) MNIST, (centre) FMNIST, (right) Spambase.
Figure 11: (Left) Average false positive and (right) false negative rates (%) on CIFAR as a function of the percentage of poisoning points using pGAN with different values of .
Figure 12: (Left) Average false positive and (right) false negative rates on MNIST as a function of .
Figure 13: (Left) Average false positive and (right) false negative rates on MNIST as a function of the number of training examples.

Poisoning Multi-class MNIST

Here we present an experiment showing the effectiveness of pGAN to perform an error-specific attack in MNIST exploiting attacks transferability, targeting digits 3 and 5. For this, we used the generators designed in the previous experiment and tested the effectiveness of the attack against a multi-class classifier trained using the 10 digits available in the dataset. For this, we did 5 runs for 5 generators with . We used training examples for the classifier and separate training examples for the outlier detectors. of digit 5 samples are poisoning points, i.e. the overall percentage of poisoning points is . In Table 8 we show the details of the classifier we trained for testing the attack. We evaluated the test performance of the trained classifiers MNIST test dataset, consisting on samples.

In Fig. 14

we show the difference of the confusion matrix between the clean and the poisoned classifier. We can observe that the accuracy for detecting digit 3 decreases by 5.8% and that, that the error rate for misclassifying digits 3 as a 5 increases in 6.4%. This shows that, even by poisoning 2% of the dataset, pGAN allows to craft effective targeted attacks. Thus, even if the machine learning system is trained with a large number of training data points, there is still the possibility of very effective poisoning attacks targeting specific samples or classes. However, compromising the overall performance of the system in those scenarios may not be possible in many cases.

MNIST (multi-class) Architecture: DNN ()
Loss function : Cross-entropy
Hidden layer act. functions: Leaky ReLU (negative slope = 0.1)
Output layer act. functions: Softmax
Optimizer: SGD (learning rate = , momentum = 0.9)
Batch size:
Epochs:
Dropout:
Table 8: Architecture of the classifiers to test the attacks on multi-class MNIST (i.e. using all the 10 class labels).

Figure 14: Difference in the confusion matrix between the clean and the poisoned classifier on multi-class MNIST. The poisoning attack targets digits 3 and 5.