1 Introduction
Modern classifiers, despite their widespread empirical success, are known to be susceptible to adversarial attacks. In this paper, we are specifically concerned with socalled “datapoisoning” attacks (formally, causative attacks [Barreno et al. 2006; Papernot et al. 2018]), where the attacker manipulates some aspects of the training data in order to cause the learning algorithm to output a faulty classifier. Automated machinelearning systems which rely on large, usergenerated datasets—e.g. email spam filters, product recommendation engines, and fake review detectors—are particularly susceptible to such attacks. For example, by maliciously flagging legitimate emails as spam and mislabeling spam as innocuous, an adversary can trick a spam filter into mistakenly letting through a particular email.
Types of data poisoning attacks which have been studied include labelflipping attacks (Xiao et al., 2012), where the labels of a training set can be adversarially manipulated to decrease performance of the trained classifier; general data poisoning, where both the training inputs and labels can be manipulated (Steinhardt et al., 2017); and backdoor attacks (Chen et al., 2017; Tran et al., 2018), where the training set is corrupted so as to cause the classifier to deviate from its expected behavior only when triggered by a specific pattern. However, unlike the alternative “testtime” adversarial setting, where reasonably effective provable defenses exist to build adversarially robust classifiers, comparatively little work has been done on building classifiers that are certifiably robust to targeted data poisoning attacks.
In this work, we propose a strategy for building classifiers that are certifiably robust against labelflipping attacks. In particular, we propose a pointwise certified defense—by this we mean that with each prediction, the classifier includes a certification guaranteeing that its prediction would not be different had it been trained on data with some number of labels flipped. Prior works on certified defenses make statistical guarantees over the entire test set, but they make no guarantees as to the robustness of a prediction on any particular test point. Thus, with these algorithms, a determined adversary could still cause a specific test point to be misclassified. We therefore consider the threat of a worstcase adversary that can make a training set perturbation to target each test point individually. This motivates a defense that can certify each of its individual predictions, as we present here. To the best of our knowledge, this work represents the first pointwise certified defense to data poisoning attacks.
Our approach leverages randomized smoothing (Cohen et al., 2019), a technique that has previously been used to guarantee testtime robustness to adversarial manipulation of the input to a deep network. However, where prior uses of randomized smoothing randomize over the input to the classifier for testtime guarantees, we instead randomize over the entire training procedure of the classifier. Specifically, by randomizing over the labels during this training process, we obtain an overall classification pipeline that is certified to be robust (i.e., to not change its prediction) when some number of labels are adversarially manipulated in the training set. Whereas previous applications of randomized smoothing perform Monte Carlo sampling to provide probabilistic bounds (due to the intractability of integrating the decision regions of a deep network), we derive an analytical bound that provides truly guaranteed robustness. Although a naive implementation of this approach would not be computationally feasible, we show that by using a linear leastsquares classifier we can obtain these certified bounds with no additional runtime cost over standard classification.
A further distinction of our approach is that the applicability of our robustness guarantees do not rely upon stringent model assumptions or the quality of the features. Existing work on robust linear classification or regression provides certificates that only hold under specific model assumptions, e.g., recovering the bestfit linear coefficients, which is most useful when the data exhibit a linear relationship in the feature space. In contrast, our classifier makes no assumptions about the separability of the data or quality of the features; this means our certificates remain valid when applying our classifier to arbitrary features, which in practice allows us to leverage advances in unsupervised feature learning (Le, 2013)
(Donahue et al., 2014). As an example, we apply our classifier to pretrained and unsupervised deep features to demonstrate its feasibility for classification of highly nonlinear data such as ImageNet.
We evaluate our proposed classifier on several benchmark datasets common to the data poisoning literature. Specifically, we demonstrate that our randomized classifier is able to achieve 75.7% certified accuracy on MNIST 1/7 even when the number of allowed label flips would drive a standard, undefended classifier to an accuracy of less than 50%. Similar results in experiments on the Dogfish binary classification challenge from ImageNet validate our technique for more challenging datasets—our binary classifier maintains 81.3% certified accuracy in the face of an adversary who could reduce an undefended classifier to less than 1%. We further experiment on the full MNIST dataset to demonstrate our algorithm’s effectiveness for multiclass classification. Moreover, our classifier maintains a reasonably competitive nonrobust accuracy (e.g., 94.5% on MNIST 1/7 compared to 99.1% for the undefended classifier).
2 Related Work
Datapoisoning attacks.
A datapoisoning attack (MuñozGonzález et al., 2017; Yang et al., 2017) is an attack where an adversary corrupts some portion of the training set or adds new inputs, with the goal of degrading the performance of the learned model. The attack can be targeted to cause poor performance on a specific test example or can simply reduce the overall test performance. The adversary is assumed to have perfect knowledge of the learning algorithm, so security by design—as opposed to obscurity—is the only viable defense against such attacks. The adversary is also typically assumed to have access to the training set and, in some cases, the test set.
Previous work has investigated attacks and defenses for datapoisoning attacks applied to feature selection
(Xiao et al., 2015), SVMs (Biggio et al., 2011; Xiao et al., 2012)(Liu et al., 2017), and PCA (Rubinstein et al., 2009), to name a few. Some attacks can even achieve success with “cleanlabel” attacks, inserting adversarially perturbed, seemingly correctly labeled training examples that cause the classifier to perform poorly (Shafahi et al., 2018; Zhu et al., 2019). Interestingly, our defense can also be viewed as (the first) certified defense to such attacks: perturbing an image such that the model’s learned features no longer match the label is theoretically equivalent to changing the label such that it no longer matches the image. For an overview of data poisoning attacks and defenses in machine learning, see Biggio et al. (2014).Labelflipping attacks.
A labelflipping attack is a specific type of datapoisoning attack where the adversary is restricted to changing the training labels. The classifier is then trained on the corrupted training set, with no knowledge of which labels have been tampered with. For example, an adversary could mislabel spam emails as innocuous, or flag real product reviews as fake.
Unlike random label noise, for which many robust learning algorithms have been successfully developed (Natarajan et al., 2013; Liu and Tao, 2016; Patrini et al., 2017), adversarial labelflipping attacks can be specifically targeted to exploit the structure of the learning algorithm, significantly degrading performance. Robustness to such attacks is therefore harder to achieve, both theoretically and empirically (Xiao et al., 2012; Biggio et al., 2011). A common defense technique is sanitization, whereby a defender attempts to identify and remove or relabel training points that may have had their labels corrupted (Paudice et al., 2019; Taheri et al., 2019). Unfortunately, recent work has demonstrated that this is often not enough against a sufficiently powerful adversary (Koh et al., 2018). Further, no existing defenses provide pointwise guarantees regarding their robustness.
Certified defenses.
Existing works on certified defenses to adversarial data poisoning attacks typically focus on the regression case and provide broad statistical guarantees over the entire test distribution. A common approach to such certifications is to show that a particular algorithm recovers some close approximation to the best linear fit coefficients (Diakonikolas et al., 2019; Prasad et al., 2018; Shen and Sanghavi, 2019), or that the expected loss on the test distribution is bounded (Klivans et al., 2018; Chen and Paschalidis, 2018). These results generally rely on assumptions on the data distribution: some assume sparsity in the coefficients (Karmalkar and Price, 2018; Chen et al., 2013)
or corruption vector
(Bhatia et al., 2015); others require limited effects of outliers
(Steinhardt et al., 2017). As mentioned above, all of these methods fail to provide guarantees for individual test points. Additionally, most of these statistical guarantees are not as meaningful when their model assumptions do not hold.Randomized smoothing.
Since the discovery of adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2015), the research community has been investigating techniques for increasing the adversarial robustness of complex models such as deep networks. After a series of heuristic defenses (Metzen et al., 2017; Feinman et al., 2017), followed by attacks breaking them (Athalye et al., 2018; Carlini and Wagner, 2017), focus began to shift towards the development of provable robustness.
One approach which has gained popularity in recent work is randomized smoothing. Rather than certifying the original classifier , randomized smoothing defines a new classifier whose prediction at an input is the class assigned the most probability when is perturbed with noise from some distribution and passed through . That is, This new classifier is then certified as robust, ideally without sacrificing too much accuracy compared to . The original formulation was presented by Lecuyer et al. (2018) and borrowed ideas from differential privacy. The above definition is due to Li et al. (2018) and was popularized by Cohen et al. (2019), who derived a tight robustness guarantee. Followup work has focused on optimizing the training procedure of (Salman et al., 2019) and extending the analysis to other types of distributions (Lee et al., 2019). For more details, see Cohen et al. (2019).
3 A General View of Randomized Smoothing
We begin by presenting a general viewpoint of randomized smoothing. Under our notation, randomized smoothing constructs an operator that maps a binaryvalued^{1}^{1}1For simplicity, we present the methodology here in terms of binaryvalued functions, which will correspond eventually to binary classification problems. The extension to the multiclass setting requires additional notation, and thus is deferred to the appendix. function and a smoothing measure , with , to the expected value of under (that is, represents the “vote” of weighted by ). For example, could be a binary image classifier and could be some small, random pixel noise applied to the tobeclassified image. We also define a “hard threshold” version that returns the most probable output (the majority vote winner). Formally,
where is the indicator function. Where it is clear from context, we will omit the arguments, writing simply or . Intuitively, for two similar measures , we would expect that for most , even though and may not be equal, the threshold function should satisfy . Further, the degree to which and can differ while still preserving this property should increase as approaches either 0 or 1, because this increases the “margin” with which the function is 0 or 1 respectively over the measure . More formally, we define a general randomized smoothing guarantee as follows:
Definition 1.
Let be a smoothing measure over , with . Then a randomized smoothing robustness guarantee is a specification of a distance measure and a function such that for all ,
(1) 
For brevity, we will sometimes use in place of , representing the fraction of the vote that the majority class receives (this is analogous to in Cohen et al. (2019)).
Instantiations of randomized smoothing
This definition is rather abstract, so we highlight concrete examples of how it can be applied to achieve certified guarantees against adversarial attacks.
Example 1.
The randomized smoothing guarantee of Cohen et al. (2019) uses the smoothing measures , a Gaussian aroound the point to be classified, and , the same measure perturbed by . They prove that (1) holds for all classifiers if we define
where denotes KL divergence and
denotes the inverse CDF of the Gaussian distribution.
Although this work focused on the case of randomized smoothing of continuous data via Gaussian noise, this is by no means a requirement of the approach. For instance, Lee et al. (2019) considers an alternative approach for dealing with discrete variables.
Example 2.
In words, the smoothing distribution is such that each dimension is independently perturbed to one of the other values uniformly at random with probability . is a combinatorial function defined as the maximum number of dimensions—out of total—by which and can differ such that a set with measure under is guaranteed to have measure at least under . Lee et al. (2019) prove that this value is independent of and .
Finally, Dvijotham et al. (2020) consider a more general form of randomized smoothing that doesn’t require strict assumptions on the distributions but is still able to provide similar guarantees.
Example 3 (Generic bound).
Given any two smoothing distributions , we have the generic randomized smoothing robustness certificate, ensuring that (1) holds with definitions
(3) 
Randomized smoothing in practice
For deep classifiers, the expectation cannot be computed exactly, and so we must resort to Monte Carlo approximation. In this “standard” form of randomized smoothing, we draw multiple random samples from and use these to construct a highprobability bound on for certification. More precisely, this bound should be a lower bound on when the hard prediction and an upper bound otherwise; this ensures in both cases that we undercertify the true robustness of the classifier . The procedure is shown in Algorithm 2 in Appendix A
. These estimates can then be plugged into a randomized smoothing robustness guarantee to provide a high probability certified robustness bound for the classifier.
4 LabelFlipping Robustness
We now present the main contribution of this paper, a technique for using randomized smoothing to provide certified robustness against labelflipping attacks. Specifically, we first propose a generic strategy for applying randomized smoothing to certify a prediction function against pointwise labelflipping attacks. We show how this general approach can be made tractable using linear leastsquares classification, and we use the Chernoff inequality to analytically bound the relevant probabilities for the randomized smoothing procedure. Notably, although we are employing a randomized approach, the final algorithm does not use any random sampling, but rather relies upon a convex optimization problem to compute the certified robustness.
To motivate the approach, we note that in prior work, randomized smoothing was applied at test time with the function being a (potentially deep) classifier that we wish to smooth. However, there is no requirement that the function be a classifier at all; the theory holds for any binaryvalued function. Instead of treating as a trained classifier, we consider to be an arbitrary learning algorithm which takes as input a training dataset and additional test points without corresponding labels, which we aim to predict.^{2}^{2}2Note that our algorithm does not actually require access to the test data to do the necessary precomputation. We present it here as such merely to give an intuitive idea of the procedure. In other words, the combined goal of
is to first train a classifier and then predict the label of the new example. Thus, we consider test time outputs to be a function of both the test time input and the training data that produced the classifier. This perspective allows us to reason about how changes to training data affect the classifier at test time, reminiscent of work on influence functions of deep neural networks
(Koh and Liang, 2017; Yeh et al., 2018). When applying randomized smoothing in this setting, we randomize over the labels in the training set, rather than over the testtime input to be classified. Analogous to previous applications of randomized smoothing, if the majority vote of the classifiers trained with these randomly sampled labels has a large margin, it will confer a degree of adversarial robustness to some number of adversarially corrupted labels.To formalize this intuition, consider two different assignments of training labels which differ on precisely labels. Let (resp. ) be the distribution resulting from independently flipping each of the labels in (resp. ) with probability . It is clear that as increases, should also increase. In fact, it is simple to show (see Appendix B.3 for derivation) that the exact KL divergence between these two distributions is
(4) 
Plugging in the robustness guarantee (3), we have that so long as
(5) 
where . This implies that for any test point, as long as (5) is satisfied, ’s prediction (the majority vote weighted by the smoothing distribution) will not change if an adversary corrupts the training set from to , or indeed to any other training set that differs on at most
labels. We can tune the noise hyperparameter
to achieve the largest possible upper bound in (5); more noise will likely decrease the margin of the majority vote , but it will also decrease the divergence.Computing a tight bound
This approach has a simple closed form, but the bound is not tight. We can derive a tight bound via a combinatorial approach as in Lee et al. (2019). By precomputing the quantities from Equation (2) for each , we can simply compare to each of these and thereby certify robustness to the highest possible number of label flips. This computation can be expensive, but it provides a significantly tighter robustness guarantee, certifying approximately twice as many label flips for a given bound on (See Figure 6 in Appendix D).
4.1 Efficient implementation via least squares classifiers
There may appear to be one major impracticality of the algorithm proposed in the previous section, if considered naively: treating the function as an entire trainingplussingleprediction process would require that we train multiple classifiers, over multiple random draws of the labels , all to make a prediction on a single example. In this section, we describe a sequence of tools we employ to restrict the architecture and training process in a manner that drastically reduces this cost, bringing it in line with the traditional cost of classifying a single example. The full procedure, with all the parts described below, can be found in Algorithm 1.
Linear leastsquares classification
The fundamental simplifying assumption we make in this work is to restrict the “training” process done by the classifier to be done via the solution of a leastsquares problem. Given the training set , we assume that there exists some feature mapping (where ). If existing linear features are not available, this could instead consist of deep features learned from a similar task—the transferability of such features is well documented (Donahue et al., 2014; Bo et al., 2010; Yosinski et al., 2014). As another option, features could be learned in an unsupervised fashion on (that is, independent of the training labels, which are potentially poisoned). Given this feature mapping, let be the training point features and let be the labels. Our training process consists of finding the leastsquares fit to the training data, i.e., we find parameters via the normal equation and then we make a prediction on the new example via the linear function
. Although it may seem odd to fit a classification task with leastsquares loss, binary classification with linear regression is known to be equivalent to Fisher’s linear discriminant
(Mika, 2003) and often works quite well in practice.The real advantage of the leastsquares approach is that it reduces the prediction to a linear function of , and thus randomizing over the labels is straightforward. Specifically, letting
the prediction can be equivalently given by (this is effectively the kernel representation of the linear classifier). Thus, we can simply compute one time and then randomly sample many different sets of labels in order to build a standard randomized smoothing bound. Further, we can precompute just the term and reuse it for each test point.
regularization for better conditioning
Unfortunately, it is unlikely to be the case that the training points are wellbehaved for linear classification in the feature space. To address this, we instead solve an regularized version of leastsquares. This is a common tool for solving systems with illconditioned or random design matrices (Hsu et al., 2014; Suggala et al., 2018). Luckily, there still exists a precomputable closedform solution to this problem, whereby we instead solve
The other parts of our algorithm remain unchanged. Following results in Suggala et al. (2018), we set the regularization parameter for all our experiments, where
is an estimate of the variance
(Dicker, 2014) and is the condition number.Efficient tail bounds via the Chernoff inequality
Even more compelling, due to the linear structure of this prediction, we can forego a samplingbased approach entirely and directly bound the tail probabilities using Chernoff bounds. Because the underlying binary prediction function will output the label for the test point whenever and otherwise, we can derive an analytical upper bound on the probability that predicts one label or the other via the Chernoff bound. By upper bounding the probability of the opposite prediction, we simultaneously derive a lower bound on which can be plugged in to (5) to determine the classifier’s robustness. Concretely, we can upper bound the probability that the classifier outputs the label 0 by
(6) 
Conversely, the probability that the classifier outputs the label 1 is upper bounded by (6) but evaluated at . Thus, we can solve the minimization problem unconstrained over , and then let the sign of dictate which label to predict and the value of determine the bound. The objective (6) is logconvex in and can be easily solved by Newton’s method. Note that in some cases, neither Chernoff upper bound will be less than , meaning we cannot determine the true value of . In these cases, we simply define the classifier’s prediction to be determined by the sign of . While we can’t guarantee that this classification will match the true majority vote, our algorithm will certify a robustness to 0 flips, so the guarantee is still valid. We avoid abstaining so as to assess our classifier’s nonrobust accuracy.
The key property we emphasize is that, unlike previous randomized smoothing applications, the final algorithm involves no randomness whatsoever. Instead, the prediction probabilities are bounded directly via Chernoff’s inequality, without any need for Monte Carlo approximation. Thus, the method is able to generate truly certifiable robust predictions using approximately the same complexity as traditional predictions.
5 Experiments
Following Koh and Liang (2017) and Steinhardt et al. (2017), we perform experiments on MNIST 1/7, the IMDB review sentiment dataset (Maas et al., 2011), and the Dogfish binary classification challenge taken from ImageNet. We run additional experiments on the full MNIST dataset; to the best of our knowledge, this is the first multiclass classification algorithm with certified robustness to labelflipping attacks. For each dataset and each noise level we report the certified test set accuracy at training label flips. That is, for each possible number of flips , we plot the fraction of the test set that was both correctly classified and certified to not change under at least flips.
Because the above are binary classification tasks, one could technically achieve a certified accuracy of 50% at (or 10% for MNIST) by letting be constant. A constant classifier would be infinitely robust, but it is not a very meaningful baseline. However, we include the accuracy of such a classifier in our plots (black dotted line) as a reference. We also ran our leastsquares classifier on the same features but with (black dashdot line); this cannot certify robustness, but it gives a sense of the quality of the features for classification without any label noise.
To properly justify the need for such certified defenses, and to get a sense of the scale of our certifications, we also generated labelflipping attacks against the undefended binary MNIST and Dogfish models. Following previous work, the undefended models were implemented as convolutional neural networks, trained on the clean data, with all but the top layer frozen—this is equivalent to multinomial logistic regression on the learned features. For each test point we recorded how many flips were required to change the network’s prediction. This number serves as an upper bound for the robustness of the network on that test point, but we note that our attacks were quite rudimentary and could almost certainly be improved upon to tighten this upper bound. Appendix
C contains the details of our attack implementations. In all our plots, the solid lines represent certified accuracy (except for the undefended classifier, which is an upper bound), while the dashed lines of the same color are the overall nonrobust accuracy of each classifier.Results on MNIST
The MNIST 1/7 dataset (LeCun et al., 1998) consists of just the classes 1 and 7, totalling 13,007 training points and 2,163 test points. We trained a simple convolutional neural network on the other eight MNIST digits to learn a 50dimensional feature embedding and then calculated Chernoff bounds for as described in Section 4.1. Figure 0(a) displays the certified accuracy on the test set for varying probabilities . As in prior work on randomized smoothing, the noise parameter balances a tradeoff; as increases, the required margin to certify a given number of flips decreases. On the other hand, this results in more noisy training labels, which reduces the margin and therefore results in lower robustness and often lower accuracy. Figure 0(b) depicts the certified accuracy for the full MNIST test set—see Appendix B for derivations of the bounds and optimization algorithm in the multiclass case. In addition to this being a significantly more difficult classification task, our classifier could not rely on features learned from other handwritten digits; instead, we extracted the top 30 components with ICA (Hyvarinen, 1999) independently of the labels. Despite the lack of finetuned features, our algorithm still achieves significant certified accuracy under a large number of adversarial label flips. We observed that regularization did not make a large difference for the multiclass case, possibly due to the inaccuracy of the residual term in the noise estimate.
Results on Dogfish
The Dogfish dataset contains images from the ImageNet dog and fish synsets, 900 training points and 300 test points from each. We trained a ResNet50 (He et al., 2016) on the standard ImageNet training set but removed all images labeled dog or fish. Our pretrained network therefore learned meaningful image features but had no features specific to either class. We used PCA to reduce the feature space to 5 dimensions before solving to avoid overfitting. Figure 2 displays the results of our poisoning attack along with our certified defense. Under the undefended model, more than 99% of the test points can be successfully attacked with no more than 23 label flips, whereas our model with can certifiably correctly classify 81.3% of the test points under the same attack. It would take more than four times as many flips—more than 5% of the training set—for each test point individually to reduce our model to less than 50% certified accuracy.
Because the “votes” are changed by flipping so few labels, high values of reduce the models’ predictions to almost pure chance—this means we are unable to achieve the margins necessary to certify a large number of flips. We therefore found that smaller levels of noise achieved higher certified test accuracy. This suggests that the more susceptible the original, nonrobust classifier is to label flips, the lower should be set for the corresponding randomized classifier.
For much smaller values of , slight differences did not decrease the nonrobust accuracy—they did however have a large effect on certified robustness. This indicates that the sign of is relatively stable, but the margin of is much less so. This same pattern was observed with the MNIST and IMDB datasets. We used a highprecision arithmetic library (Johansson and others, 2013) to achieve the necessary lower bounds, but the precision required for nonvacuous bounds grew extremely fast for ; optimizing (6) quickly became too computationally expensive.
Finally, Figure 5 in Appendix D shows the performance of our classifier applied to unsupervised features. Because Dogfish is such a small dataset, deep unsupervised feature learning techniques were not feasible—we instead learned overcomplete features on 16x16 image patches using RICA (Le, 2013)
. It is worth noting that on datasets of a typical size for deep learning (i.e.,
or higher), robustness and accuracy closer to those of the pretrained features are likely achievable with modern unsupervised feature learning algorithms such as Deep Clustering (Caron et al., 2018).Results on IMDB
Figure 3 plots the result of our randomized smoothing procedure on the IMDB review sentiment dataset. This dataset contains 25,000 training examples and 25,000 test examples, evenly split between “positive” and “negative”. To extract the features we applied the Google News pretrained Word2Vec to all the words in each review and averaged them. This feature embedding is considerably noisier than that of an image dataset, as most of the words in a review are irrelevant to sentiment classification. Indeed, Steinhardt et al. (2017) also found that the IMDB dataset was much more susceptible to adversarial corruption than images when using bagofwords features. Consistent with this, we found smaller levels of noise resulted in larger certified accuracy. We expect significant improvements could be made with a more refined choice of feature embedding.
6 Conclusion
In this work we presented a certified defense against a strong class of adversarial labelflipping attacks where an adversary can flip labels to cause a misclassification on each test point individually. This contrasts with previous data poisoning settings which have typically only considered an adversary who wishes to degrade the classifier’s accuracy on the test distribution as a whole, and it brings the adversary’s objective more in line with that of backdoor attacks and testtime adversarial perturbations. Leveraging randomized smoothing, a method originally developed for certifying robustness to testtime perturbations, we presented a classifier that can be certified robust to these pointwise traintime attacks. We then offered a tractable algorithm for evaluating this classifier which, despite being rooted in randomization, can be computed with no Monte Carlo sampling whatsoever, resulting in a truly certifiably robust classifier. This results in the first multiclass classification algorithm that is certifiably robust to labelflipping attacks.
There are several avenues for improvements to this line of work, perhaps the most immediate being the method for learning the input features. For example, rather than considering fixed features generated by a pretrained deep network or learned without labels, extensions could leverage neural tangent kernels (Jacot et al., 2018) to allow for efficient learning with perturbed inputs or more flexible representations. The analysis could also be extended to other types of smoothing distributions applied to the training data, such as randomizing over the input features to provide robustness to more general data poisoning attacks. Finally, we hope that our defense to this threat model will inspire the development of more powerful traintime attacks, against which future defenses can be evaluated.
References
 Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Cited by: §2.
 Can machine learning be secure?. In Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, New York, NY, USA, pp. 16–25. Cited by: §1.
 Robust regression via hard thresholding. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, Cambridge, MA, USA, pp. 721–729. Cited by: §2.
 Security evaluation of pattern classifiers under attack. IEEE Transactions on Knowledge and Data Engineering 26 (4), pp. 984–996. Cited by: §2.
 Support vector machines under adversarial label noise. In Proceedings of the Asian Conference on Machine Learning, Vol. 20, South Garden Hotels and Resorts, Taoyuan, Taiwain, pp. 97–112. Cited by: §2, §2.
 Kernel descriptors for visual recognition. In Advances in Neural Information Processing Systems 23, pp. 244–252. Cited by: §4.1.

Adversarial examples are not easily detected: bypassing ten detection methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, New York, NY, USA, pp. 3–14. Cited by: §2. 
Deep clustering for unsupervised learning of visual features
. InThe European Conference on Computer Vision (ECCV)
, Cited by: §5.  A robust learning approach for regression models based on distributionally robust optimization. J. Mach. Learn. Res. 19 (1), pp. 517–564. External Links: ISSN 15324435 Cited by: §2.
 Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1.
 Robust sparse regression under adversarial corruption. In Proceedings of the 30th International Conference on Machine Learning, Vol. 28, Atlanta, Georgia, USA, pp. 774–782. Cited by: §2.
 Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 1310–1320. Cited by: §1, §2, §3, Example 1.
 Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, pp. 2745–2754. Cited by: §2.
 Variance estimation in highdimensional linear models. Biometrika 101 (2), pp. 269–284. Cited by: §4.1.
 DeCAF: a deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, Vol. 32, Bejing, China, pp. 647–655. Cited by: §1, §4.1.
 A framework for robustness certification of smoothed classifiers using fdivergences. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Conference Track Proceedings, Cited by: §3.
 Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §2.
 Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §2.

Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §5. 
Random design analysis of ridge regression
. Found. Comput. Math. 14 (3), pp. 569–600. Cited by: §4.1. 
Fast and robust fixedpoint algorithms for independent component analysis
. IEEE Transactions on Neural Networks 10 (3), pp. 626–634. Cited by: §5.  Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, pp. 8571–8580. Cited by: §6.
 Compressed sensing with adversarial sparse noise via l1 regression. arXiv preprint arXiv:1809.08055. Cited by: §2.
 Efficient algorithms for outlierrobust regression. arXiv preprint arXiv:1803.03241. Cited by: §2.
 Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, pp. 1885–1894. Cited by: Appendix C, Appendix C, §4, §5.
 Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741. Cited by: §2.
 Building highlevel features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 8595–8598. Cited by: Figure 5, §1, §5.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
 Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471. Cited by: §2.
 Tight certificates of adversarial robustness for randomly smoothed classifiers. In Advances in Neural Information Processing Systems 31, Cited by: §2, §3, §3, §4, Example 2.
 Certified adversarial robustness with additive gaussian noise. arXiv preprint arXiv:1809.03113. Cited by: §2.
 Robust linear regression against training data poisoning. In AISec 2017  Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, colocated with CCS 2017, (English (US)). Cited by: §2.
 Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 447–461. Cited by: §2.

Learning word vectors for sentiment analysis
. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies  Volume 1, Stroudsburg, PA, USA, pp. 142–150. Cited by: §5.  On detecting adversarial perturbations. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §2.
 Kernel fisher discriminants. (en). Cited by: §4.1.

Mpmath: a Python library for arbitraryprecision floatingpoint arithmetic (version 0.18)
. Note: http://mpmath.org/ Cited by: §5.  Towards poisoning of deep learning algorithms with backgradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, New York, NY, USA, pp. 27–38. Cited by: §2.
 Learning with noisy labels. In Advances in Neural Information Processing Systems 26, pp. 1196–1204. Cited by: §2.
 SoK: security and privacy in machine learning. In 2018 IEEE European Symposium on Security and Privacy (EuroS P), pp. 399–414. External Links: ISSN Cited by: §1.
 Making deep neural networks robust to label noise: a loss correction approach. pp. 2233–2241. Cited by: §2.
 Label sanitization against label flipping poisoning attacks. In ECML PKDD 2018 Workshops, Cham, pp. 5–15. Cited by: §2.
 Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485. Cited by: §2.
 ANTIDOTE: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, New York, NY, USA, pp. 1–14. Cited by: §2.
 Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §2.
 Poison frogs! targeted cleanlabel poisoning attacks on neural networks. In Advances in Neural Information Processing Systems 31, pp. 6103–6113. Cited by: §2.
 Learning with bad training data via iterative trimmed loss minimization. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 5739–5748. Cited by: §2.
 Certified defenses for data poisoning attacks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, USA, pp. 3520–3532. Cited by: §1, §2, §5, §5.
 Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems 31, pp. 10608–10619. Cited by: §4.1.
 Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix C.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §2.
 On defending against label flipping attacks on malware detection systems. arXiv preprint arXiv:1908.04473. Cited by: §2.
 Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems 31, pp. 8000–8010. Cited by: §1.
 Adversarial label flips attack on support vector machines. In Proceedings of the 20th European Conference on Artificial Intelligence, Amsterdam, The Netherlands, The Netherlands, pp. 870–875. Cited by: §1, §2, §2.
 Is feature selection secure against training data poisoning?. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, Lille, France, pp. 1689–1698. Cited by: §2.
 Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340. Cited by: §2.
 Representer point selection for explaining deep neural networks. In Advances in Neural Information Processing Systems 31, pp. 9291–9301. Cited by: §4.
 How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems 27, pp. 3320–3328. Cited by: §4.1.
 Transferable cleanlabel poisoning attacks on deep neural nets. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 7614–7623. Cited by: §2.
Appendix A Generic Randomized Smoothing Algorithm
Appendix B The multiclass setting
Although the notation and algorithms are slightly more complex, all the methods we have discussed in the main paper can be extended to the multiclass setting. In this case, we consider a class label , and we again seek some smoothed prediction such that the classifier’s prediction on a new point will not change with some number flips of the labels in the training set.
b.1 Randomized smoothing in the multiclass case
We here extend our notation to the case of more than two classes. Recall our original definition of ,
where . More generally, consider a classifier , outputting the index of one of classes. Under this formulation, for a given class , we have
where is the indicator function for if outputs the class . In this case, the hard threshold is evaluated by returning the class with the highest probability. That is,
b.2 Linearization and Chernoff bound approach for the multiclass case
Using the same linearization approach as in the binary case, we can formulate an analogous approach which forgoes the need to actually perform random sampling at all and instead directly bounds the randomized classifier using the Chernoff bound.
Adopting the same notation as in the main text, the equivalent leastsquares classifier for the multiclass setting finds some set of weights
where
is a binary matrix with each row equal to a onehot encoding of the class label (note that the resulting
is now a matrix, and we let refer to the th column). At prediction time, the predicted class of some new point is simply given by the prediction with the highest value, i.e.,Alternatively, following the same logic as in the binary case, this same prediction can be written in terms of the variable as
where denotes the th column of .
In our randomized smoothing setting, we again propose to flip the class of any label with probability , selecting an alternative label uniformly at random from the remaining labels. Assuming that the predicted class label is , we wish to bound the probability that
for all alternative classes . By the Chernoff bound, we have that
The random variable
takes on three different distributions depending on if , if , or if and . Specifically, this variable can take on the terms with the associated probabilitiesCombining these cases directly into the Chernoff bound gives
Again, this problem is convex in , and so can be solved efficiently using Newton’s method. And again since the reverse case can be computed via the same expression we can similarly optimize this in an unconstrained fashion. Specifically, we can do this for every pair of classes and , and return the which gives the smallest lower bound for the worstcase choice of .
b.3 KL Divergence Bound
To compute actual certification radii, we will derive the KL divergence bound for the the case of classes. Let be defined as in Section 4, except that as in the previous section when a label is flipped with probability it is changed to one of the other classes uniformly at random. Let and refer to the independent measures on each dimension which collectively make up the factorized distributions and (i.e., ). Further, let be the element of , meaning it is the “original” label which may or may not be flipped when sampling from . First noting that each dimension of the distributions and are independent, we have
Plugging in the robustness guarantee (3), we have that so long as
Appendix C Description of LabelFlipping Attacks on MNIST 1/7 and Dogfish
Due to the dearth of existing work on labelflipping attacks for deep networks, our attacks on MNIST and Dogfish were quite straightforward; we expect significant improvements could be made to tighten this upper bound.
For Dogfish, we used a pretrained Inception network (Szegedy et al., 2016) to evaluate the influence of each training point with respect to the loss of each test point (Koh and Liang, 2017). As in prior work, we froze all but the top layer of the network for retraining. Once we obtained the most influential points, we flipped the first one and recomputed approximate influence using only the top layer for efficiency. After each flip, we recorded which points were classified differently and maintained for each test point the successful attack which required the fewest flips. When this was finished, we also tried the reverse of each attack to see if any of them could be achieved with even fewer flips.
For MNIST we implemented two similar attacks and kept the best attack for each test point. The first attack simply ordered training labels by their distance from the test point in feature space, as a proxy for influence. We then tried flipping these one at a time until the prediction changed, and we also tried the reverse. The second attack was essentially the same as the Dogfish attack, ordering the test points by influence. To calculate influence we again assumed a frozen feature map; specifically, using the same notation as Koh and Liang (2017), the influence of flipping the label of a training point to on the loss at the test point is:
For logistic regression, these values can easily be computed in closed form.
Comments
There are no comments yet.