Machine learning has been used ubiquitously for a wide variety of security tasks, such as intrusion detection, malware detection, spam filtering, and web search fawcett1997adaptive ; mahoney2002learning ; fogla2006polymorphic ; huang2011adversarial ; kolcz2009feature . Traditional machine learning systems, however, do not account for adversarial manipulation. For example, in spam detection, spammers commonly change spam email text to evade filtering. As a consequence, there have been a series of efforts to both model adversarial manipulation of learning, such as evasion and data poisoning attacks lowd2005adversarial ; karlberger2007exploiting ; nelson2012query , as well as detecting such attacks mahoney2002learning ; el2003robust or enhancing robustness of learning algorithms to these li2014feature ; li2015scalable ; teo2007convex ; globerson2006nightmare ; bruckner2011stackelberg . One of the most general of these, due to Li and Vorobeychik li2014feature
, admits evasion attacks modeled through a broad class of optimization problems, giving rise to a Stackelberg game, in which the learner minimizes an adversarial risk function which accounts for optimal attacks on the learner. The main limitation of this approach, however, is scalability: it can solve instances with only 10-30 features. Indeed, most approaches to date also offer solutions that build on specific learning models or algorithms. For example, specific evasion attacks have been developed for linear or convex-inducing classifierslowd2005adversarial ; karlberger2007exploiting ; nelson2012query
, as well as neural networks for continuous feature spacesbiggio2014security . Similarly, robust algorithms have typically involved non-trivial modifications of underlying learning algorithms, and either assume a specific attack model or modify a specific algorithm. The more general algorithms that admit a wide array of attack models, on the other hand, have poor scalability.
We propose a very general retraining framework, RAD, which can boost evasion robustness of arbitrary learning algorithms using arbitrary evasion attack models. We show that RAD minimizes an upper bound on optimal adversarial risk. This is significant: whereas adversarial risk minimization is a hard bi-level optimization problem and poor scalability properties (indeed, no methods exist to solve it for general attack models), RAD itself is extremely scalable in practice, as our experiments show. We develop RAD for a more specific, but very broad class of adversarial models, offering a theoretical connection to adversarial risk minimization even when the adversarial model is only approximate. In the process, we offer a simple and very general class of local search algorithms for approximating evasion attacks, which are experimentally quite effective. Perhaps the most appealing aspect of the proposed approach is that it requires no modification of learning algorithms: rather, it can wrap any learning algorithm “out-of-the-box”. Our work connects to, and systematizes, several previous approaches which used training with adversarial examples to either evaluate robustness of learning algorithms, or enhance learning robustness. For example, Goodfellow et al. goodfellow2014explaining and Kantchelian et al. Kantchelian15 make use of adversarial examples. In the former case, however, these were essentially randomly chosen. The latter offered an iterative retraining approach more in the spirit of RAD, but did not systematically develop or analyze it. Teo et al. teo2007convex do not make an explicit connection to retraining, but suggest equivalence between their general invariance-based approach using column generation and retraining. However, the two are not equivalent, and Teo et al. did not study their relationship formally.
In summary, we make the following contributions:
RAD, a novel systematic framework for adversarial retraining,
analysis of the relationship between RAD and empirical adversarial risk minimization,
extension of the analysis to account for approximate adversarial evasion, within a specific broad class of adversarial models,
extensive experimental evaluation of RAD and the adversarial evasion model.
We illustrate the applicability and efficiency of our method on both spam filtering and handwritten digit recognition tasks, where evasion attacks are extremely salient klimt2004enron ; lecun2010mnist .
2 Learning and Evasion Attacks
Let be the feature space, with
the number of features. For a feature vector, we let denote the th feature. Suppose that the training set is comprised of feature vectors generated according to some unknown distribution , with the corresponding binary labels, where means the instance is benign, while indicates a malicious instance. The learner aims to learn a classifier with parameters , , to label instances as malicious or benign, using a training data set of labeled instance . Let be the subset of datapoints with . We assume that for some real-valued function .
Traditionally, machine learning algorithms commonly minimize regularized empirical risk:
where is the loss associated with predicting when true classification is . An important issue in adversarial settings is that instances classified as malicious (in our convention, corresponding to ) are associated with malicious agents who subsequently modify such instances in order to evade the classifier (and be classified as benign). Suppose that adversarial evasion behavior is captured by an oracle, , which returns, for a given parameter vector and original feature vector (in the training data) , an alternative feature vector . Since the adversary modifies malicious instances according to this oracle, the resulting effective risk for the defender is no longer captured by Equation 1, but must account for adversarial response. Consequently, the defender would seek to minimize the following adversarial risk (on training data):
The adversarial risk function in Equation 2
is extremely general: we make, at the moment, no assumptions on the nature of the attacker oracle,. This oracle may capture evasion attack models based on minimizing evasion cost lowd2005adversarial ; li2014feature ; biggio2014security , or based on actual attacker evasion behavior obtained from experimental data ke2016behavioral .
3 Adversarial Learning through Retraining
A number of approaches have been proposed for making learning algorithms more robust to adversarial evasion attacks dalvi2004adversarial ; li2014feature ; li2015scalable ; teo2007convex ; bruckner2011stackelberg . However, these approaches typically suffer from three limitations: 1) they usually assume specific attack models, 2) they require substantial modifications of learning algorithms, and 3) they commonly suffer from significant scalability limitations. For example, a recent, general, adversarial learning algorithm proposed by Li and Vorobeychik li2014feature makes use of constraint generation, but does not scale beyond 10-30 features.
Recently, retraining with adversarial data has been proposed as a means to increase robustness of learning goodfellow2014explaining ; Kantchelian15 ; teo2007convex .111Indeed, neither Teo et al. teo2007convex nor Kantchelian et al. Kantchelian15 focus on retraining as a main contribution, but observe its effectiveness. However, to date such approaches have not been systematic.
We present a new algorithm, RAD, for retraining with adversarial data (Algorithm 1) which systematizes some of the prior insights, and enables us to provide a formal connection between retraining with adversarial data, and adversarial risk minimization in the sense of Equation 2.
The RAD algorithm is quite general. At the high level, it starts with the original training data and iterates between computing a classifier and adding adversarial instances to the training data that evade the previously computed classifier, if they are not already a part of the data.
A baseline termination condition for RAD is that no new adversarial instances can be added (either because instances generated by have already been previously added, or because the adversary’s can no longer benefit from evasion, as discussed more formally in Section 4.1). If the range of is finite (e.g., if the feature space is finite), RAD with this termination condition would always terminate. In practice, our experiments demonstrate that when termination conditions are satisfied, the number of RAD iterations is quite small (between 5 and 20). Moreover, while RAD effective increases the importance of malicious instances in training, this does not appear to significantly harm classification performance in a non-adversarial setting. In general, we can also control the number of rounds directly, or use an additional termination condition, such as that the parameter vector changes little between successive iterations. However, we assume henceforth that there is no fixed iteration limit or convergence check.
To analyze what happens if the algorithm terminates, define the regularized empirical risk in the last iteration of RAD as:
where a set of data points has been added by the algorithm (we omit its dependence on to simplify notation). We now characterize the relationship between and .
for all .
Let . Consequently, for any ,
where the second inequality follows because in the last iteration of the algorithm, (since it must terminate after this iteration), which means that for all . ∎
In words, retraining, systematized in the RAD algorithm, effectively minimizes an upper bound on optimal adversarial risk.222Note that the bound relies on the fact that we are only adding adversarial instances, and terminate once no more instances can be added. In particular, natural variations, such as removing or re-weighing added adversarial instances to retain original malicious-benign balance lose this guarantee. This offers a conceptual explanation for the previously observed effectiveness of such algorithms in boosting robustness of learning to adversarial evasion. Formally, however, the result above is limited for several reasons. First, for many adversarial models in prior literature, adversarial evasion is NP-Hard. While some effective approaches exist to compute optimal evasion for specific learning algorithms Kantchelian15 , this is not true in general. Although approximation algorithms for these models exist, using them as oracles in RAD is problematic, since actual attackers may compute better solutions, and Proposition 3.1 no longer applies. Second, we assume that returns a unique result, but when evasion is modeled as optimization, optima need not be unique. Third, there do not exist effective general-purpose adversarial evasion algorithms the use of which in RAD would allow reasonable theoretical guarantees. Below, we investigate an important and very general class of adversarial evasion models and associated algorithms which allow us to obtain practically meaningful guarantees for RAD.
Clustering Malicious Instances: A significant enhancement in speed of the approach can be obtained by clustering malicious instances: this would reduce both the number of iterations, as well as the number of data points added per iteration. Experiments (in the supplement) show that this is indeed quite effective.
Stochastic gradient descent: RAD
works particularly well with online methods, such as stochastic gradient descent. Indeed, in this case we need only to make gradient descent steps for newly added malicious instances, which can be added one at a time until convergence.
4 Modeling Attackers
4.1 Evasion Attack as Optimization
In prior literature, evasion attacks have almost universally been modeled as optimization problems in which attackers balance the objective of evading the classifier (by changing the label from to ) and the cost of such evasion lowd2005adversarial ; li2014feature . Our approach is in the same spirit, but is formally somewhat distinct. In particular, we assume that the adversary has the following two competing objectives: 1) appear as benign as possible to the classifier, and 2) minimize modification cost. It is also natural to assume that the attacker obtains no value from a modification to the original feature vector if the result is still classified as malicious. To formalize, consider an attacker who in the original training data uses a feature vector (). The adversary is solving the following optimization problem:
We assume that , iff , and is strictly increasing in and strictly convex in .333Here we exhibit a particular general attack model, but many alternatives are possible, such as using constrained optimization. We found experimentally that the results are not particularly sensitive to the choice of the attack model. Because Problem 4 is non-convex, we instead minimize an upper bound:
In addition, if , we return before solving Problem 5. If Problem 5 returns an optimal solution with , we return ; otherwise, return . Problem 5 has two advantages. First, if is convex and real-valued, this is a (strictly) convex optimization problem, has a unique solution, and we can solve it in polynomial time. An important special case is when . The second one we formalize in the following lemma.
The following corollary then follows by uniqueness of optimal solutions for strictly convex objective functions over a real vector space.
A direct consequence of this corollary is that when we use Problem 5 to approximate Problem 4 and this approximation is convex, we always return either the optimal evasion, or if no cost-effective evasion is possible. An oracle constructed on this basis will therefore return a unique solution, and supports the theoretical characterization of RAD above.
The results above are encouraging, but many learning problems do not feature a convex , or a continuous feature space. Next, we consider several general algorithms for adversarial evasion.
4.2 Coordinate Greedy
We propose a very general local search framework, CoordinateGreedy (CG) (Algorithm 2 for approximating optimal attacker evasion.
The high-level idea is to iteratively choose a feature, and greedily update this feature to incrementally improve the attacker’s utility (as defined by Problem 5). In general, this algorithm will only converge to a locally optimal solution. We therefore propose a version with random restarts: run CG from random starting points in feature space. As long as a global optimum has a basin of attraction with positive Lebesgue measure, or the feature space is finite, this process will asymptotically converge to a globally optimal solution as we increase the number of random restarts. Thus, as we increase the number of random restarts, we expect to increase the frequency that we actual return the global optimum. Let
denote the probability that the oracle based on coordinate greedy withrandom restarts returns a suboptimal solution to Problem 5. The next result generalizes the bound on RAD to allow for this, restricting however that the risk function which we bound from above uses the loss. Let correspond to the total adversarial risk in Equation 2
, where the loss functionis the loss. Suppose that uses coordinate greedy with random restarts.
Let . with probability at least , where and uses any loss function which is an upper bound on the loss.
Experiments suggest that quite rapidly for an array of learning algorithms, and for either discrete or continuous features, as we increase the number of restarts (see the supplement for details). Consequently, in practice retraining with coordinate greedy nearly minimizes an upper bound on minimal adversarial risk for a 0/1 loss with few restarts of the approximate attacker oracle.
Continuous Feature Space: For continuous feature space, we assume that both and are differentiable in , and propose using the coordinate descent algorithm, which is a special case of coordinate greedy, where the GreedyImprove step is: where is the step size and the direction of th coordinate. Henceforth, let the origial adversarial instance be given; we then simplify cost function to be only a function of , denoted . If the function is convex and differentiable, our coordinate descent based algorithm 2 can always find the global optima which is the attacker best response luo1992convergence , and Proposition 3.1 applies, by Corollary 4.1. If is not convex, then coordinate descent will only converge to a local optimum.
Discrete Feature Space: In the case of discrete feature space, GreedyImprove step of CG can simply enumerate all options for feature , and choose the one most improving the objective.
5 Experimental Results
The results above suggest that the proposed systematic retraining algorithm is likely to be effective at increasing resilience to adversarial evasion. We now offer an experimental evaluation of this (additional results are provided in the supplement). We present the results for the exponential cost model, where . Additionally, we simulated attacks using Problem 5 formulation. Results for other cost functions and attack models are similar, as shown in the supplement. Moreover, the supplement demonstrates that the approach is robust to cost function misspecification.
Comparison to Optimal: The first comparison we draw is to a recent algorithm, SMA, which minimizes -regularized adversarial risk function (2) using the hinge loss function. Specifically, SMA
formulates the problem as a large mixed-integer linear program which it solves using constraint generationli2014feature . The main limitation of SMA is scalability. Because retraining methods use out-of-the-box learning tools and does not involve non-convex bi-level optimization, it is considerably more scalable.
We compared SMA and RAD using Enron data klimt2004enron . As Figure 1(a) demonstrates, retraining solutions of RAD are nearly as good as SMA, particularly for a non-trivial adversarial cost sensitive . In contrast, a baseline implementation of SVM is significantly more fragile to evasion attacks. However, the runtime comparison for these algorithms in Figure 1(b) shows that RAD is much more scalable than SMA.
Effectiveness of Retraining: In this section we use the Enron dataset klimt2004enron and MNIST lecun2010mnist dataset to evaluate the robustness of three common algorithms in their standard implementation, and in RAD
: logistic regression, SVM (using a linear kernel), and a neural network (NN) with 3 hidden layers. In Enron data, features correspond to relative word frequencies. 2000 features were used for the Enron and 784 for MNIST datasets. Throughout, we use precision, recall, and accuracy as metrics. We present the results for a continuous feature space here. Results for binary features are similar and provided in the supplement.
Figure 2(a) shows the performance of logistic regression, with and without retraining, on Enron and MNIST. The increased robustness of RAD is immediately evident: performance of RAD is essentially independent of on all three measures, and substantially exceeds baseline algorithm performance for small . Interestingly, we observe that the baseline algorithms are significantly more fragile to evasion attacks on Enron data compard to MNIST: benign and malicious classes seem far easier to separate on the latter than the former. This qualitative comparison between the Enron and MNIST datasets is consistent for other classification methods as well (SVM, NN). These results also illustrate that the neural-network classifiers, in their baseline implementation, are significantly more robust to evasion attacks than the (generalized) linear classifiers (logistic regression and SVM): even with a relatively small attack cost attacks become ineffective relatively quickly, and the differences between the performance on Enron and MNIST data are far smaller. Throughout, however, RAD
significantly improves robustness to evasion, maintaining extremely high accuracy, precision, and recall essentially independently of, dataset, and algorithm used.
In order to explore whether RAD would sacrifice accuracy when no adversary is present, Figure 3 shows the performance of the baseline algorithms and RAD on a test dataset sans evasions. Surprisingly, RAD is never significantly worse, and in some cases better than non-adversarial baselines: adding malicious instances appears to increase overall generalization ability. This is also consistent with the observation by Kantchelian et al. Kantchelian15 .
Oracles based on Human Evasion Behavior: To evaluate the considerable generality of RAD, we now use a non-optimization-based threat model, making use instead of observed human evasion behavior in human subject experiments. The data for this evaluation was obtained from the human subject experiment by Ke et al. ke2016behavioral in which subjects were tasked with the goal of evading an SVM-based spam filter, manipulating 10 spam/phishing email instances in the process. In these experiments, Ke et al. used machine learning to develop a model of human subject evasion behavior. We now adopt this model as the evasion oracle, , injected in our RAD retraining framework, executing the synthetic model for 0-10 iterations to obtain evasion examples.
Figure 4(a) shows the recall results for the dataset of 10 malicious emails (the classifiers are trained on Enron data, but evaluated on these 10 emails, including evasion attacks). Figure 4(b) shows the classifier performance for the Enron dataset by applying the synthetic adversarial model as the oracle.
We can make two high-level observations. First, notice that human adversaries appear significantly less powerful in evading the classifier than the automated optimization-based attacks we previously considered. This is a testament to both the effectiveness of our general-purpose adversarial evaluation approach, and the likelihood that such automated attacks likely significantly overestimate adversarial evasion risk in many settings. Nevertheless, we can observe that the synthetic model used in RAD leads to a significantly more robust classifier. Moreover, as our evaluation used actual evasions, while the synthetic model was used only in training the classifier as a part of RAD, this experiment suggests that the synthetic model can be relatively effective in modeling behavior of human adversaries. Figure 4(b) performs a more systematic study using the synthetic model of adversarial behavior on the Enron dataset. The findings are consistent with those only considering the 10 spam instances: retraining significantly boosts robustness to evasion, with classifier effectiveness essentially independent of the number of queries made by the oracle.
We proposed a general-purpose systematic retraining algorithm against evasion attacks of classifiers for arbitrary oracle-based evasion models. We first demonstrated that this algorithm effectively minimizes an upper bound on optimal adversarial risk, which is typically extremely difficult to compute (indeed, no approach exists for minimizing adversarial loss for an arbitrary evasion oracle). Experimentally, we showed that the performance of our retraining approach is nearly indistinguishable from optimal, whereas scalability is dramatically improved: indeed, with RAD, we are able to easily scale the approach to thousands of features, whereas a state-of-the-art adversarial risk optimization method can only scale to 15-30 features. We generalize our results to show that a probabilistic upper bound on minimal adversarial loss can be obtained even when the oracle is computed approximately by leveraging random restarts, and an empirical evaluation which confirms that the resulting bound relaxation is tight in practice.
We also offer a general-purpose framework for optimization-based oracles using variations of coordinate greedy algorithm on both discrete and continuous feature spaces. Our experiments demonstrate that our adversarial oracle approach is extremely effective in corrupting the baseline learning algorithms. On the other hand, extensive experiments also show that the use of our retraining methods significantly boosts robustness of algorithms to evasion. Indeed, retrained algorithms become nearly insensitive to adversarial evasion attacks, at the same time maintaining extremely good learning performance on data overall. Perhaps the most significant strength of the proposed approach is that it can make use of arbitrary learning algorithms essentially “out-of-the-box”, and effectively and quickly boost their robustness, in contrast to most prior adversarial learning methods which were algorithm-specific.
- (1) Tom Fawcett and Foster Provost. Adaptive fraud detection. Data mining and knowledge discovery, 1(3):291–316, 1997.
- (2) Matthew V Mahoney and Philip K Chan. Learning nonstationary models of normal network traffic for detecting novel attacks. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 376–385. ACM, 2002.
- (3) Prahlad Fogla, Monirul I Sharif, Roberto Perdisci, Oleg M Kolesnikov, and Wenke Lee. Polymorphic blending attacks. In USENIX Security, 2006.
Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and
Adversarial machine learning.
Workshop on Security and Artificial Intelligence, pages 43–58. ACM, 2011.
- (5) Aleksander Kołcz and Choon Hui Teo. Feature weighting for improved classifier robustness. In CEAS’09: sixth conference on email and anti-spam, 2009.
- (6) Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 641–647. ACM, 2005.
- (7) Christoph Karlberger, Günther Bayler, Christopher Kruegel, and Engin Kirda. Exploiting redundancy in natural language to penetrate bayesian spam filters. WOOT, 7:1–7, 2007.
- (8) Blaine Nelson, Benjamin IP Rubinstein, Ling Huang, Anthony D Joseph, Steven J Lee, Satish Rao, and JD Tygar. Query strategies for evading convex-inducing classifiers. The Journal of Machine Learning Research, 13(1):1293–1332, 2012.
- (9) Laurent El Ghaoui, Gert René Georges Lanckriet, Georges Natsoulis, et al. Robust classification with interval data. Computer Science Division, University of California, 2003.
- (10) Bo Li and Yevgeniy Vorobeychik. Feature cross-substitution in adversarial classification. In Advances in Neural Information Processing Systems, pages 2087–2095, 2014.
- (11) Bo Li and Yevgeniy Vorobeychik. Scalable optimization of randomized operational decisions in adversarial classification settings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 599–607, 2015.
- (12) Choon H Teo, Amir Globerson, Sam T Roweis, and Alex J Smola. Convex learning with invariances. In Advances in neural information processing systems, pages 1489–1496, 2007.
- (13) Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pages 353–360. ACM, 2006.
- (14) Michael Brückner and Tobias Scheffer. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547–555. ACM, 2011.
- (15) Battista Biggio, Giorgio Fumera, and Fabio Roli. Security evaluation of pattern classifiers under attack. Knowledge and Data Engineering, IEEE Transactions on, 26(4):984–996, 2014.
- (16) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- (17) A. Kantchelian, J. D. Tygar, and A. D. Joseph. Evasion and hardening of tree ensemble classifiers. arXiv pre-print, 2015.
- (18) Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. In Machine learning: ECML 2004, pages 217–226. Springer, 2004.
- (19) Yann LeCun and Corinna Cortes. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2010.
- (20) Liyiming Ke, Bo Li, and Yevgeniy Vorobeychik. Behavioral experiments in email filter evasion. In AAAI Conference on Artificial Intelligence, 2016.
- (21) Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, et al. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108. ACM, 2004.
- (22) Zhi-Quan Luo and Paul Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992.
- (23) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
Supplement To: A General Retraining Framework for Scalable Adversarial Classification
Proof of Lemma 4.1
If , then . By optimality of , . Since is suboptimal in Problem (4) and strictly positive in all other cases, . By optimality of , , which implies that . Consequently, , and, therefore, .
Proof of Proposition 4.2
Let . Consequently, for any ,
with probability at least , where
, by the Chernoff bound, and Lemma 4.1, which assures that an optimal solution to Problem (5) can only over-estimate mistakes. Moreover,
since for all by construction, and is an upper bound on . Putting everything together, we get the desired result.
Convergence of with Increasing Number of Restarts
Attacks as Constrained Optimization
A variation on the attack models in the main paper is when the attacker is solving the following constrained optimization problem:
for some cost budget constraint and query budget constraint . While this problem is, again, non-convex, we can instead minimize the convex upper bound, , as before, if we assume that is convex. In this case, if the feature space is continuous, the problem can be solved optimally using standard convex optimization methods . If the feature space is binary and is linear or convex-inducing, algorithms proposed by Lowd and Meek  and Nelson et al. . Figure 6, 7 and 8 show the performance of RAD based on the optimized adversarial strategies for various learning models, respectively.
Experiments with Continuous Feature Space
In Figure 9 we visualize the relative vulnerability of the different classifiers, as well as effectiveness of our general-purpose evasion methods based on coordinate greedy. Each row corresponds to a classifier, and moving right within a row represents decreasing (allowing attacks to make more substantial modifications to the image in an effort to evade correct classification). We can observe that NN classifiers require more substantial changes to the images to evade, ultimately making these entirely unlike the original. In contrast, logistic regression is quite vulnerable: the digit remains largely recognizable even after evasion attacks.
Experiments with Discrete Feature Space
Considering now data sets with binary features, we use the Enron data with a bag-of-words feature representation, for a total of 2000 features. We compare Naive Bayes (NB), logistic regression, SVM, and a 3-layer neural network. Our comparison involves both the baseline, andRAD implementations of these, using the same metrics as above.
Figure 10 confirms the effectiveness of RAD: every algorithm is substantially more robust to evasion with retraining, compared to baseline implementation. Most of the algorithms can obtain extremely high accuracy on this data with the bag-of-words feature representation. However, a 3-layer neural network is now less robust than the other algorithms, unlike in the experiments with continuous features. Indeed, Goodfellow et al.  similarly observe the relative fragility of NN to evasion attacks.
Experiments with Multi-class Classification
Discussion so far dealt entirely with binary classification. We now observe that extending it to multi-class problems is quite direct. Specifically, while previously the attacker aimed to make an instance classified as (malicious) into a benign instance (), for a general label set , we can define a malicious set and a target set , with , where every entity represented by a feature vector with a label aims to transform so that its label is changed to . In this setting, let . We can then use the following empirical risk function:
where aims to transform instances so that . The relaxed version of the adversarial problem can then be generalized to
For a finite target set , this problem is equivalent to taking the best solution of a finite collection of problems identical to Problem 5.
To evaluate the effectiveness of RAD, and resilience of baseline algorithms, in multi-class classification settings, we use the MNIST dataset and aim to correctly identify digits based on their images. Our comparison involves SVM and 3-layer neural network (results for NN-1 are similar). We use as the malicious class (that is, instances corresponding to digits and are malicious), and is the set of benign labels (what malicious instances wish to be classified as).
The results, shown in Figure 11 are largely consistent with our previous observations: both SVM and 3-layer NN perform well when retrained with RAD, with near-perfect accuracy despite adversarial evasion attempts. Moreover, RAD significantly boosts robustness to evasion, particularly when is small (adversary who is not very sensitive to evasion costs).
Figure 12 offers a visual demonstration of the relative effectiveness of attacks on the baseline implementation of SVM and 1- and 3-layer neural networks. Here, we can observe that a significant change is required to evade the linear SVM, with the digit having to nearly resemble a 2 after modification. In contrast, significantly less noise is added to the neural network in effecting evasion.
Evaluation of Cost Function Variations and Robustness to Misspecification
Considering the variations of cost functions, here we evaluate the classification efficiency for various cost functions as well as different cost functions for defender and adversary, respectively. Figure 13 shows the empirical evaluation results based on Enron dataset with binary features. It is shown that if both the defender and adversary apply the L1 (a) or quadratic cost functions (b), it is easy to defend the malicious manipulations. Even the defender mistakenly evaluate the adversarial cost models as shown in Figure 13 (c), framework can still defend the attack strategies efficiently.
Experiments for Clustering Malicious Instances
To efficiently speed up the proposed algorithm, here we cluster the malicious instances and use the center of each cluster to generate the potential “evasion" instances for the retraining framework. Figure 15 shows that the running time can be reduced by applying the clustering algorithm to the original malicious instances and the classification performance stays pretty stable for different learning models.