Certified Robustness to Label-Flipping Attacks via Randomized Smoothing

02/07/2020 ∙ by Elan Rosenfeld, et al. ∙ 0

Machine learning algorithms are known to be susceptible to data poisoning attacks, where an adversary manipulates the training data to degrade performance of the resulting classifier. While many heuristic defenses have been proposed, few defenses exist which are certified against worst-case corruption of the training data. In this work, we propose a strategy to build linear classifiers that are certifiably robust against a strong variant of label-flipping, where each test example is targeted independently. In other words, for each test point, our classifier makes a prediction and includes a certification that its prediction would be the same had some number of training labels been changed adversarially. Our approach leverages randomized smoothing, a technique that has previously been used to guarantee—with high probability—test-time robustness to adversarial manipulation of the input to a classifier. We derive a variant which provides a deterministic, analytical bound, sidestepping the probabilistic certificates that traditionally result from the sampling subprocedure. Further, we obtain these certified bounds with no additional runtime cost over standard classification. We generalize our results to the multi-class case, providing what we believe to be the first multi-class classification algorithm that is certifiably robust to label-flipping attacks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern classifiers, despite their widespread empirical success, are known to be susceptible to adversarial attacks. In this paper, we are specifically concerned with so-called “data-poisoning” attacks (formally, causative attacks [Barreno et al. 2006; Papernot et al. 2018]), where the attacker manipulates some aspects of the training data in order to cause the learning algorithm to output a faulty classifier. Automated machine-learning systems which rely on large, user-generated datasets—e.g. email spam filters, product recommendation engines, and fake review detectors—are particularly susceptible to such attacks. For example, by maliciously flagging legitimate emails as spam and mislabeling spam as innocuous, an adversary can trick a spam filter into mistakenly letting through a particular email.

Types of data poisoning attacks which have been studied include label-flipping attacks (Xiao et al., 2012), where the labels of a training set can be adversarially manipulated to decrease performance of the trained classifier; general data poisoning, where both the training inputs and labels can be manipulated (Steinhardt et al., 2017); and backdoor attacks (Chen et al., 2017; Tran et al., 2018), where the training set is corrupted so as to cause the classifier to deviate from its expected behavior only when triggered by a specific pattern. However, unlike the alternative “test-time” adversarial setting, where reasonably effective provable defenses exist to build adversarially robust classifiers, comparatively little work has been done on building classifiers that are certifiably robust to targeted data poisoning attacks.

In this work, we propose a strategy for building classifiers that are certifiably robust against label-flipping attacks. In particular, we propose a pointwise certified defense—by this we mean that with each prediction, the classifier includes a certification guaranteeing that its prediction would not be different had it been trained on data with some number of labels flipped. Prior works on certified defenses make statistical guarantees over the entire test set, but they make no guarantees as to the robustness of a prediction on any particular test point. Thus, with these algorithms, a determined adversary could still cause a specific test point to be misclassified. We therefore consider the threat of a worst-case adversary that can make a training set perturbation to target each test point individually. This motivates a defense that can certify each of its individual predictions, as we present here. To the best of our knowledge, this work represents the first pointwise certified defense to data poisoning attacks.

Our approach leverages randomized smoothing (Cohen et al., 2019), a technique that has previously been used to guarantee test-time robustness to adversarial manipulation of the input to a deep network. However, where prior uses of randomized smoothing randomize over the input to the classifier for test-time guarantees, we instead randomize over the entire training procedure of the classifier. Specifically, by randomizing over the labels during this training process, we obtain an overall classification pipeline that is certified to be robust (i.e., to not change its prediction) when some number of labels are adversarially manipulated in the training set. Whereas previous applications of randomized smoothing perform Monte Carlo sampling to provide probabilistic bounds (due to the intractability of integrating the decision regions of a deep network), we derive an analytical bound that provides truly guaranteed robustness. Although a naive implementation of this approach would not be computationally feasible, we show that by using a linear least-squares classifier we can obtain these certified bounds with no additional runtime cost over standard classification.

A further distinction of our approach is that the applicability of our robustness guarantees do not rely upon stringent model assumptions or the quality of the features. Existing work on robust linear classification or regression provides certificates that only hold under specific model assumptions, e.g., recovering the best-fit linear coefficients, which is most useful when the data exhibit a linear relationship in the feature space. In contrast, our classifier makes no assumptions about the separability of the data or quality of the features; this means our certificates remain valid when applying our classifier to arbitrary features, which in practice allows us to leverage advances in unsupervised feature learning (Le, 2013)

and transfer learning

(Donahue et al., 2014)

. As an example, we apply our classifier to pre-trained and unsupervised deep features to demonstrate its feasibility for classification of highly non-linear data such as ImageNet.

We evaluate our proposed classifier on several benchmark datasets common to the data poisoning literature. Specifically, we demonstrate that our randomized classifier is able to achieve 75.7% certified accuracy on MNIST 1/7 even when the number of allowed label flips would drive a standard, undefended classifier to an accuracy of less than 50%. Similar results in experiments on the Dogfish binary classification challenge from ImageNet validate our technique for more challenging datasets—our binary classifier maintains 81.3% certified accuracy in the face of an adversary who could reduce an undefended classifier to less than 1%. We further experiment on the full MNIST dataset to demonstrate our algorithm’s effectiveness for multi-class classification. Moreover, our classifier maintains a reasonably competitive non-robust accuracy (e.g., 94.5% on MNIST 1/7 compared to 99.1% for the undefended classifier).

2 Related Work

Data-poisoning attacks.

A data-poisoning attack (Muñoz-González et al., 2017; Yang et al., 2017) is an attack where an adversary corrupts some portion of the training set or adds new inputs, with the goal of degrading the performance of the learned model. The attack can be targeted to cause poor performance on a specific test example or can simply reduce the overall test performance. The adversary is assumed to have perfect knowledge of the learning algorithm, so security by design—as opposed to obscurity—is the only viable defense against such attacks. The adversary is also typically assumed to have access to the training set and, in some cases, the test set.

Previous work has investigated attacks and defenses for data-poisoning attacks applied to feature selection

(Xiao et al., 2015), SVMs (Biggio et al., 2011; Xiao et al., 2012)

, linear regression

(Liu et al., 2017), and PCA (Rubinstein et al., 2009), to name a few. Some attacks can even achieve success with “clean-label” attacks, inserting adversarially perturbed, seemingly correctly labeled training examples that cause the classifier to perform poorly (Shafahi et al., 2018; Zhu et al., 2019). Interestingly, our defense can also be viewed as (the first) certified defense to such attacks: perturbing an image such that the model’s learned features no longer match the label is theoretically equivalent to changing the label such that it no longer matches the image. For an overview of data poisoning attacks and defenses in machine learning, see Biggio et al. (2014).

Label-flipping attacks.

A label-flipping attack is a specific type of data-poisoning attack where the adversary is restricted to changing the training labels. The classifier is then trained on the corrupted training set, with no knowledge of which labels have been tampered with. For example, an adversary could mislabel spam emails as innocuous, or flag real product reviews as fake.

Unlike random label noise, for which many robust learning algorithms have been successfully developed (Natarajan et al., 2013; Liu and Tao, 2016; Patrini et al., 2017), adversarial label-flipping attacks can be specifically targeted to exploit the structure of the learning algorithm, significantly degrading performance. Robustness to such attacks is therefore harder to achieve, both theoretically and empirically (Xiao et al., 2012; Biggio et al., 2011). A common defense technique is sanitization, whereby a defender attempts to identify and remove or relabel training points that may have had their labels corrupted (Paudice et al., 2019; Taheri et al., 2019). Unfortunately, recent work has demonstrated that this is often not enough against a sufficiently powerful adversary (Koh et al., 2018). Further, no existing defenses provide pointwise guarantees regarding their robustness.

Certified defenses.

Existing works on certified defenses to adversarial data poisoning attacks typically focus on the regression case and provide broad statistical guarantees over the entire test distribution. A common approach to such certifications is to show that a particular algorithm recovers some close approximation to the best linear fit coefficients (Diakonikolas et al., 2019; Prasad et al., 2018; Shen and Sanghavi, 2019), or that the expected loss on the test distribution is bounded (Klivans et al., 2018; Chen and Paschalidis, 2018). These results generally rely on assumptions on the data distribution: some assume sparsity in the coefficients (Karmalkar and Price, 2018; Chen et al., 2013)

or corruption vector

(Bhatia et al., 2015)

; others require limited effects of outliers

(Steinhardt et al., 2017). As mentioned above, all of these methods fail to provide guarantees for individual test points. Additionally, most of these statistical guarantees are not as meaningful when their model assumptions do not hold.

Randomized smoothing.

Since the discovery of adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2015), the research community has been investigating techniques for increasing the adversarial robustness of complex models such as deep networks. After a series of heuristic defenses (Metzen et al., 2017; Feinman et al., 2017), followed by attacks breaking them (Athalye et al., 2018; Carlini and Wagner, 2017), focus began to shift towards the development of provable robustness.

One approach which has gained popularity in recent work is randomized smoothing. Rather than certifying the original classifier , randomized smoothing defines a new classifier whose prediction at an input is the class assigned the most probability when is perturbed with noise from some distribution and passed through . That is, This new classifier is then certified as robust, ideally without sacrificing too much accuracy compared to . The original formulation was presented by Lecuyer et al. (2018) and borrowed ideas from differential privacy. The above definition is due to Li et al. (2018) and was popularized by Cohen et al. (2019), who derived a tight robustness guarantee. Follow-up work has focused on optimizing the training procedure of (Salman et al., 2019) and extending the analysis to other types of distributions (Lee et al., 2019). For more details, see Cohen et al. (2019).

3 A General View of Randomized Smoothing

We begin by presenting a general viewpoint of randomized smoothing. Under our notation, randomized smoothing constructs an operator that maps a binary-valued111For simplicity, we present the methodology here in terms of binary-valued functions, which will correspond eventually to binary classification problems. The extension to the multiclass setting requires additional notation, and thus is deferred to the appendix. function and a smoothing measure , with , to the expected value of under (that is, represents the “vote” of weighted by ). For example, could be a binary image classifier and could be some small, random pixel noise applied to the to-be-classified image. We also define a “hard threshold” version that returns the most probable output (the majority vote winner). Formally,

where is the indicator function. Where it is clear from context, we will omit the arguments, writing simply or . Intuitively, for two similar measures , we would expect that for most , even though and may not be equal, the threshold function should satisfy . Further, the degree to which and can differ while still preserving this property should increase as approaches either 0 or 1, because this increases the “margin” with which the function is 0 or 1 respectively over the measure . More formally, we define a general randomized smoothing guarantee as follows:

Definition 1.

Let be a smoothing measure over , with . Then a randomized smoothing robustness guarantee is a specification of a distance measure and a function such that for all ,


For brevity, we will sometimes use in place of , representing the fraction of the vote that the majority class receives (this is analogous to in Cohen et al. (2019)).

Instantiations of randomized smoothing

This definition is rather abstract, so we highlight concrete examples of how it can be applied to achieve certified guarantees against adversarial attacks.

Example 1.

The randomized smoothing guarantee of Cohen et al. (2019) uses the smoothing measures , a Gaussian aroound the point to be classified, and , the same measure perturbed by . They prove that (1) holds for all classifiers if we define

where denotes KL divergence and

denotes the inverse CDF of the Gaussian distribution.

Although this work focused on the case of randomized smoothing of continuous data via Gaussian noise, this is by no means a requirement of the approach. For instance, Lee et al. (2019) considers an alternative approach for dealing with discrete variables.

Example 2.

The randomized smoothing guarantee of Lee et al. (2019) uses the factorized smoothing measure in dimensions , defined with respect to parameters , and a base input , where

with being the dimension of . is similarly defined for a perturbed input . They guarantee that (1) holds if we define


In words, the smoothing distribution is such that each dimension is independently perturbed to one of the other values uniformly at random with probability . is a combinatorial function defined as the maximum number of dimensions—out of total—by which and can differ such that a set with measure under is guaranteed to have measure at least under . Lee et al. (2019) prove that this value is independent of and .

Finally, Dvijotham et al. (2020) consider a more general form of randomized smoothing that doesn’t require strict assumptions on the distributions but is still able to provide similar guarantees.

Example 3 (Generic bound).

Given any two smoothing distributions , we have the generic randomized smoothing robustness certificate, ensuring that (1) holds with definitions


Randomized smoothing in practice

For deep classifiers, the expectation cannot be computed exactly, and so we must resort to Monte Carlo approximation. In this “standard” form of randomized smoothing, we draw multiple random samples from and use these to construct a high-probability bound on for certification. More precisely, this bound should be a lower bound on when the hard prediction and an upper bound otherwise; this ensures in both cases that we under-certify the true robustness of the classifier . The procedure is shown in Algorithm 2 in Appendix A

. These estimates can then be plugged into a randomized smoothing robustness guarantee to provide a high probability certified robustness bound for the classifier.

4 Label-Flipping Robustness

We now present the main contribution of this paper, a technique for using randomized smoothing to provide certified robustness against label-flipping attacks. Specifically, we first propose a generic strategy for applying randomized smoothing to certify a prediction function against pointwise label-flipping attacks. We show how this general approach can be made tractable using linear least-squares classification, and we use the Chernoff inequality to analytically bound the relevant probabilities for the randomized smoothing procedure. Notably, although we are employing a randomized approach, the final algorithm does not use any random sampling, but rather relies upon a convex optimization problem to compute the certified robustness.

To motivate the approach, we note that in prior work, randomized smoothing was applied at test time with the function being a (potentially deep) classifier that we wish to smooth. However, there is no requirement that the function be a classifier at all; the theory holds for any binary-valued function. Instead of treating as a trained classifier, we consider to be an arbitrary learning algorithm which takes as input a training dataset and additional test points without corresponding labels, which we aim to predict.222Note that our algorithm does not actually require access to the test data to do the necessary precomputation. We present it here as such merely to give an intuitive idea of the procedure. In other words, the combined goal of

is to first train a classifier and then predict the label of the new example. Thus, we consider test time outputs to be a function of both the test time input and the training data that produced the classifier. This perspective allows us to reason about how changes to training data affect the classifier at test time, reminiscent of work on influence functions of deep neural networks

(Koh and Liang, 2017; Yeh et al., 2018). When applying randomized smoothing in this setting, we randomize over the labels in the training set, rather than over the test-time input to be classified. Analogous to previous applications of randomized smoothing, if the majority vote of the classifiers trained with these randomly sampled labels has a large margin, it will confer a degree of adversarial robustness to some number of adversarially corrupted labels.

To formalize this intuition, consider two different assignments of training labels which differ on precisely labels. Let (resp. ) be the distribution resulting from independently flipping each of the labels in (resp. ) with probability . It is clear that as increases, should also increase. In fact, it is simple to show (see Appendix B.3 for derivation) that the exact KL divergence between these two distributions is


Plugging in the robustness guarantee (3), we have that so long as


where . This implies that for any test point, as long as (5) is satisfied, ’s prediction (the majority vote weighted by the smoothing distribution) will not change if an adversary corrupts the training set from to , or indeed to any other training set that differs on at most

labels. We can tune the noise hyperparameter

to achieve the largest possible upper bound in (5); more noise will likely decrease the margin of the majority vote , but it will also decrease the divergence.

Computing a tight bound

This approach has a simple closed form, but the bound is not tight. We can derive a tight bound via a combinatorial approach as in Lee et al. (2019). By precomputing the quantities from Equation (2) for each , we can simply compare to each of these and thereby certify robustness to the highest possible number of label flips. This computation can be expensive, but it provides a significantly tighter robustness guarantee, certifying approximately twice as many label flips for a given bound on (See Figure 6 in Appendix D).

4.1 Efficient implementation via least squares classifiers

There may appear to be one major impracticality of the algorithm proposed in the previous section, if considered naively: treating the function as an entire training-plus-single-prediction process would require that we train multiple classifiers, over multiple random draws of the labels , all to make a prediction on a single example. In this section, we describe a sequence of tools we employ to restrict the architecture and training process in a manner that drastically reduces this cost, bringing it in line with the traditional cost of classifying a single example. The full procedure, with all the parts described below, can be found in Algorithm 1.

  Input: feature mapping ; noise parameter ; regularization parameter ; training set (with potentially adversarial labels); additional inputs to predict .
  1. Pre-compute matrix ,
where .
  for  do
     1. Compute vector .
     2. Compute optimal Chernoff parameter via Newton’s method
and let where is the Chernoff bound (6) evaluated at .
     Output: Prediction and certification that prediction will remain constant for up to training label flips, where
  end for
Algorithm 1 Randomized smoothing for label-flipping robustness

Linear least-squares classification

The fundamental simplifying assumption we make in this work is to restrict the “training” process done by the classifier to be done via the solution of a least-squares problem. Given the training set , we assume that there exists some feature mapping (where ). If existing linear features are not available, this could instead consist of deep features learned from a similar task—the transferability of such features is well documented (Donahue et al., 2014; Bo et al., 2010; Yosinski et al., 2014). As another option, features could be learned in an unsupervised fashion on (that is, independent of the training labels, which are potentially poisoned). Given this feature mapping, let be the training point features and let be the labels. Our training process consists of finding the least-squares fit to the training data, i.e., we find parameters via the normal equation and then we make a prediction on the new example via the linear function

. Although it may seem odd to fit a classification task with least-squares loss, binary classification with linear regression is known to be equivalent to Fisher’s linear discriminant

(Mika, 2003) and often works quite well in practice.

The real advantage of the least-squares approach is that it reduces the prediction to a linear function of , and thus randomizing over the labels is straightforward. Specifically, letting

the prediction can be equivalently given by (this is effectively the kernel representation of the linear classifier). Thus, we can simply compute one time and then randomly sample many different sets of labels in order to build a standard randomized smoothing bound. Further, we can pre-compute just the term and reuse it for each test point.

regularization for better conditioning

Unfortunately, it is unlikely to be the case that the training points are well-behaved for linear classification in the feature space. To address this, we instead solve an regularized version of least-squares. This is a common tool for solving systems with ill-conditioned or random design matrices (Hsu et al., 2014; Suggala et al., 2018). Luckily, there still exists a pre-computable closed-form solution to this problem, whereby we instead solve

The other parts of our algorithm remain unchanged. Following results in Suggala et al. (2018), we set the regularization parameter for all our experiments, where

is an estimate of the variance

(Dicker, 2014) and is the condition number.

Efficient tail bounds via the Chernoff inequality

Even more compelling, due to the linear structure of this prediction, we can forego a sampling-based approach entirely and directly bound the tail probabilities using Chernoff bounds. Because the underlying binary prediction function will output the label for the test point whenever and otherwise, we can derive an analytical upper bound on the probability that predicts one label or the other via the Chernoff bound. By upper bounding the probability of the opposite prediction, we simultaneously derive a lower bound on which can be plugged in to (5) to determine the classifier’s robustness. Concretely, we can upper bound the probability that the classifier outputs the label 0 by


Conversely, the probability that the classifier outputs the label 1 is upper bounded by (6) but evaluated at . Thus, we can solve the minimization problem unconstrained over , and then let the sign of dictate which label to predict and the value of determine the bound. The objective (6) is log-convex in and can be easily solved by Newton’s method. Note that in some cases, neither Chernoff upper bound will be less than , meaning we cannot determine the true value of . In these cases, we simply define the classifier’s prediction to be determined by the sign of . While we can’t guarantee that this classification will match the true majority vote, our algorithm will certify a robustness to 0 flips, so the guarantee is still valid. We avoid abstaining so as to assess our classifier’s non-robust accuracy.

The key property we emphasize is that, unlike previous randomized smoothing applications, the final algorithm involves no randomness whatsoever. Instead, the prediction probabilities are bounded directly via Chernoff’s inequality, without any need for Monte Carlo approximation. Thus, the method is able to generate truly certifiable robust predictions using approximately the same complexity as traditional predictions.

5 Experiments

Following Koh and Liang (2017) and Steinhardt et al. (2017), we perform experiments on MNIST 1/7, the IMDB review sentiment dataset (Maas et al., 2011), and the Dogfish binary classification challenge taken from ImageNet. We run additional experiments on the full MNIST dataset; to the best of our knowledge, this is the first multi-class classification algorithm with certified robustness to label-flipping attacks. For each dataset and each noise level we report the certified test set accuracy at training label flips. That is, for each possible number of flips , we plot the fraction of the test set that was both correctly classified and certified to not change under at least flips.

Because the above are binary classification tasks, one could technically achieve a certified accuracy of 50% at (or 10% for MNIST) by letting be constant. A constant classifier would be infinitely robust, but it is not a very meaningful baseline. However, we include the accuracy of such a classifier in our plots (black dotted line) as a reference. We also ran our least-squares classifier on the same features but with (black dash-dot line); this cannot certify robustness, but it gives a sense of the quality of the features for classification without any label noise.

(a) Binary MNIST (classes 1 and 7)
(b) Full MNIST
Figure 1: MNIST 1/7 (, top) and full MNIST (, bottom) test set certified accuracy to adversarial label flips as is varied. The bottom axis represents the number of adversarial label flips to which each individual prediction is robust, while the top axis is the same value expressed as a percentage of the training set size. The solid lines represent certified accuracy; dashed lines of the same color are the overall non-robust accuracy of each classifier. The black dotted line is the (infinitely robust) performance of a constant classifier, while the black dash-dot line is the (uncertified) performance of our classifier with no label noise.

To properly justify the need for such certified defenses, and to get a sense of the scale of our certifications, we also generated label-flipping attacks against the undefended binary MNIST and Dogfish models. Following previous work, the undefended models were implemented as convolutional neural networks, trained on the clean data, with all but the top layer frozen—this is equivalent to multinomial logistic regression on the learned features. For each test point we recorded how many flips were required to change the network’s prediction. This number serves as an upper bound for the robustness of the network on that test point, but we note that our attacks were quite rudimentary and could almost certainly be improved upon to tighten this upper bound. Appendix

C contains the details of our attack implementations. In all our plots, the solid lines represent certified accuracy (except for the undefended classifier, which is an upper bound), while the dashed lines of the same color are the overall non-robust accuracy of each classifier.

Results on MNIST

The MNIST 1/7 dataset (LeCun et al., 1998) consists of just the classes 1 and 7, totalling 13,007 training points and 2,163 test points. We trained a simple convolutional neural network on the other eight MNIST digits to learn a 50-dimensional feature embedding and then calculated Chernoff bounds for as described in Section 4.1. Figure 0(a) displays the certified accuracy on the test set for varying probabilities . As in prior work on randomized smoothing, the noise parameter balances a trade-off; as increases, the required margin to certify a given number of flips decreases. On the other hand, this results in more noisy training labels, which reduces the margin and therefore results in lower robustness and often lower accuracy. Figure 0(b) depicts the certified accuracy for the full MNIST test set—see Appendix B for derivations of the bounds and optimization algorithm in the multi-class case. In addition to this being a significantly more difficult classification task, our classifier could not rely on features learned from other handwritten digits; instead, we extracted the top 30 components with ICA (Hyvarinen, 1999) independently of the labels. Despite the lack of fine-tuned features, our algorithm still achieves significant certified accuracy under a large number of adversarial label flips. We observed that regularization did not make a large difference for the multi-class case, possibly due to the inaccuracy of the residual term in the noise estimate.

See Figure 4 in Appendix D for the effect of regularization for the binary case. At a moderate cost to non-robust accuracy, the regularization results in substantially higher certified accuracy at almost all radii.

Figure 2: Dogfish () test set certified accuracy to adversarial label flips as is varied.
Figure 3: IMDB Review Sentiment () test set certified accuracy. The non-robust accuracy slightly decreases as increases; for the non-robust accuracy is 79.11%, while for it is 78.96%.

Results on Dogfish

The Dogfish dataset contains images from the ImageNet dog and fish synsets, 900 training points and 300 test points from each. We trained a ResNet-50 (He et al., 2016) on the standard ImageNet training set but removed all images labeled dog or fish. Our pre-trained network therefore learned meaningful image features but had no features specific to either class. We used PCA to reduce the feature space to 5 dimensions before solving to avoid overfitting. Figure 2 displays the results of our poisoning attack along with our certified defense. Under the undefended model, more than 99% of the test points can be successfully attacked with no more than 23 label flips, whereas our model with can certifiably correctly classify 81.3% of the test points under the same attack. It would take more than four times as many flips—more than 5% of the training set—for each test point individually to reduce our model to less than 50% certified accuracy.

Because the “votes” are changed by flipping so few labels, high values of reduce the models’ predictions to almost pure chance—this means we are unable to achieve the margins necessary to certify a large number of flips. We therefore found that smaller levels of noise achieved higher certified test accuracy. This suggests that the more susceptible the original, non-robust classifier is to label flips, the lower should be set for the corresponding randomized classifier.

For much smaller values of , slight differences did not decrease the non-robust accuracy—they did however have a large effect on certified robustness. This indicates that the sign of is relatively stable, but the margin of is much less so. This same pattern was observed with the MNIST and IMDB datasets. We used a high-precision arithmetic library (Johansson and others, 2013) to achieve the necessary lower bounds, but the precision required for non-vacuous bounds grew extremely fast for ; optimizing (6) quickly became too computationally expensive.

Finally, Figure 5 in Appendix D shows the performance of our classifier applied to unsupervised features. Because Dogfish is such a small dataset, deep unsupervised feature learning techniques were not feasible—we instead learned overcomplete features on 16x16 image patches using RICA (Le, 2013)

. It is worth noting that on datasets of a typical size for deep learning (i.e.,

or higher), robustness and accuracy closer to those of the pre-trained features are likely achievable with modern unsupervised feature learning algorithms such as Deep Clustering (Caron et al., 2018).

Results on IMDB

Figure 3 plots the result of our randomized smoothing procedure on the IMDB review sentiment dataset. This dataset contains 25,000 training examples and 25,000 test examples, evenly split between “positive” and “negative”. To extract the features we applied the Google News pre-trained Word2Vec to all the words in each review and averaged them. This feature embedding is considerably noisier than that of an image dataset, as most of the words in a review are irrelevant to sentiment classification. Indeed, Steinhardt et al. (2017) also found that the IMDB dataset was much more susceptible to adversarial corruption than images when using bag-of-words features. Consistent with this, we found smaller levels of noise resulted in larger certified accuracy. We expect significant improvements could be made with a more refined choice of feature embedding.

6 Conclusion

In this work we presented a certified defense against a strong class of adversarial label-flipping attacks where an adversary can flip labels to cause a misclassification on each test point individually. This contrasts with previous data poisoning settings which have typically only considered an adversary who wishes to degrade the classifier’s accuracy on the test distribution as a whole, and it brings the adversary’s objective more in line with that of backdoor attacks and test-time adversarial perturbations. Leveraging randomized smoothing, a method originally developed for certifying robustness to test-time perturbations, we presented a classifier that can be certified robust to these pointwise train-time attacks. We then offered a tractable algorithm for evaluating this classifier which, despite being rooted in randomization, can be computed with no Monte Carlo sampling whatsoever, resulting in a truly certifiably robust classifier. This results in the first multi-class classification algorithm that is certifiably robust to label-flipping attacks.

There are several avenues for improvements to this line of work, perhaps the most immediate being the method for learning the input features. For example, rather than considering fixed features generated by a pre-trained deep network or learned without labels, extensions could leverage neural tangent kernels (Jacot et al., 2018) to allow for efficient learning with perturbed inputs or more flexible representations. The analysis could also be extended to other types of smoothing distributions applied to the training data, such as randomizing over the input features to provide robustness to more general data poisoning attacks. Finally, we hope that our defense to this threat model will inspire the development of more powerful train-time attacks, against which future defenses can be evaluated.


  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Cited by: §2.
  • M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar (2006) Can machine learning be secure?. In Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, New York, NY, USA, pp. 16–25. Cited by: §1.
  • K. Bhatia, P. Jain, and P. Kar (2015) Robust regression via hard thresholding. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA, pp. 721–729. Cited by: §2.
  • B. Biggio, G. Fumera, and F. Roli (2014) Security evaluation of pattern classifiers under attack. IEEE Transactions on Knowledge and Data Engineering 26 (4), pp. 984–996. Cited by: §2.
  • B. Biggio, B. Nelson, and P. Laskov (2011) Support vector machines under adversarial label noise. In Proceedings of the Asian Conference on Machine Learning, Vol. 20, South Garden Hotels and Resorts, Taoyuan, Taiwain, pp. 97–112. Cited by: §2, §2.
  • L. Bo, X. Ren, and D. Fox (2010) Kernel descriptors for visual recognition. In Advances in Neural Information Processing Systems 23, pp. 244–252. Cited by: §4.1.
  • N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    New York, NY, USA, pp. 3–14. Cited by: §2.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features


    The European Conference on Computer Vision (ECCV)

    Cited by: §5.
  • R. Chen and I. Ch. Paschalidis (2018) A robust learning approach for regression models based on distributionally robust optimization. J. Mach. Learn. Res. 19 (1), pp. 517–564. External Links: ISSN 1532-4435 Cited by: §2.
  • X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1.
  • Y. Chen, C. Caramanis, and S. Mannor (2013) Robust sparse regression under adversarial corruption. In Proceedings of the 30th International Conference on Machine Learning, Vol. 28, Atlanta, Georgia, USA, pp. 774–782. Cited by: §2.
  • J. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 1310–1320. Cited by: §1, §2, §3, Example 1.
  • I. Diakonikolas, W. Kong, and A. Stewart (2019) Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, pp. 2745–2754. Cited by: §2.
  • L. H. Dicker (2014) Variance estimation in high-dimensional linear models. Biometrika 101 (2), pp. 269–284. Cited by: §4.1.
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014) DeCAF: a deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, Vol. 32, Bejing, China, pp. 647–655. Cited by: §1, §4.1.
  • K. Dvijotham, J. Hayes, B. Balle, Z. Kolter, C. Qin, A. Gyorgy, K. Xiao, S. Gowal, and P. Kohli (2020) A framework for robustness certification of smoothed classifiers using f-divergences. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Conference Track Proceedings, Cited by: §3.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §5.
  • D. Hsu, S. M. Kakade, and T. Zhang (2014)

    Random design analysis of ridge regression

    Found. Comput. Math. 14 (3), pp. 569–600. Cited by: §4.1.
  • A. Hyvarinen (1999)

    Fast and robust fixed-point algorithms for independent component analysis

    IEEE Transactions on Neural Networks 10 (3), pp. 626–634. Cited by: §5.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, pp. 8571–8580. Cited by: §6.
  • S. Karmalkar and E. Price (2018) Compressed sensing with adversarial sparse noise via l1 regression. arXiv preprint arXiv:1809.08055. Cited by: §2.
  • A. Klivans, P. K. Kothari, and R. Meka (2018) Efficient algorithms for outlier-robust regression. arXiv preprint arXiv:1803.03241. Cited by: §2.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, pp. 1885–1894. Cited by: Appendix C, Appendix C, §4, §5.
  • P. W. Koh, J. Steinhardt, and P. Liang (2018) Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741. Cited by: §2.
  • Q. V. Le (2013) Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 8595–8598. Cited by: Figure 5, §1, §5.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
  • M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2018) Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471. Cited by: §2.
  • G. Lee, Y. Yuan, S. Chang, and T. S. Jaakkola (2019) Tight certificates of adversarial robustness for randomly smoothed classifiers. In Advances in Neural Information Processing Systems 31, Cited by: §2, §3, §3, §4, Example 2.
  • B. Li, C. Chen, W. Wang, and L. Carin (2018) Certified adversarial robustness with additive gaussian noise. arXiv preprint arXiv:1809.03113. Cited by: §2.
  • C. Liu, B. Li, Y. Vorobeychik, and A. Oprea (2017) Robust linear regression against training data poisoning. In AISec 2017 - Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, co-located with CCS 2017, (English (US)). Cited by: §2.
  • T. Liu and D. Tao (2016) Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 447–461. Cited by: §2.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, Stroudsburg, PA, USA, pp. 142–150. Cited by: §5.
  • J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff (2017) On detecting adversarial perturbations. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2.
  • S. Mika (2003) Kernel fisher discriminants. (en). Cited by: §4.1.
  • F. Johansson et al. (2013)

    Mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.18)

    Note: http://mpmath.org/ Cited by: §5.
  • L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, New York, NY, USA, pp. 27–38. Cited by: §2.
  • N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari (2013) Learning with noisy labels. In Advances in Neural Information Processing Systems 26, pp. 1196–1204. Cited by: §2.
  • N. Papernot, P. McDaniel, A. Sinha, and M. P. Wellman (2018) SoK: security and privacy in machine learning. In 2018 IEEE European Symposium on Security and Privacy (EuroS P), pp. 399–414. External Links: ISSN Cited by: §1.
  • G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. pp. 2233–2241. Cited by: §2.
  • A. Paudice, L. Muñoz-González, and E. C. Lupu (2019) Label sanitization against label flipping poisoning attacks. In ECML PKDD 2018 Workshops, Cham, pp. 5–15. Cited by: §2.
  • A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar (2018) Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485. Cited by: §2.
  • B. I.P. Rubinstein, B. Nelson, L. Huang, A. D. Joseph, S. Lau, S. Rao, N. Taft, and J. D. Tygar (2009) ANTIDOTE: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, New York, NY, USA, pp. 1–14. Cited by: §2.
  • H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §2.
  • A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein (2018) Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems 31, pp. 6103–6113. Cited by: §2.
  • Y. Shen and S. Sanghavi (2019) Learning with bad training data via iterative trimmed loss minimization. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 5739–5748. Cited by: §2.
  • J. Steinhardt, P. W. Koh, and P. Liang (2017) Certified defenses for data poisoning attacks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, USA, pp. 3520–3532. Cited by: §1, §2, §5, §5.
  • A. Suggala, A. Prasad, and P. K. Ravikumar (2018) Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems 31, pp. 10608–10619. Cited by: §4.1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix C.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §2.
  • R. Taheri, R. Javidan, M. Shojafar, Z. Pooranian, A. Miri, and M. Conti (2019) On defending against label flipping attacks on malware detection systems. arXiv preprint arXiv:1908.04473. Cited by: §2.
  • B. Tran, J. Li, and A. Madry (2018) Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems 31, pp. 8000–8010. Cited by: §1.
  • H. Xiao, H. Xiao, and C. Eckert (2012) Adversarial label flips attack on support vector machines. In Proceedings of the 20th European Conference on Artificial Intelligence, Amsterdam, The Netherlands, The Netherlands, pp. 870–875. Cited by: §1, §2, §2.
  • H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli (2015) Is feature selection secure against training data poisoning?. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, Lille, France, pp. 1689–1698. Cited by: §2.
  • C. Yang, Q. Wu, H. Li, and Y. Chen (2017) Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340. Cited by: §2.
  • C. Yeh, J. Kim, I. E. Yen, and P. K. Ravikumar (2018) Representer point selection for explaining deep neural networks. In Advances in Neural Information Processing Systems 31, pp. 9291–9301. Cited by: §4.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems 27, pp. 3320–3328. Cited by: §4.1.
  • C. Zhu, W. R. Huang, H. Li, G. Taylor, C. Studer, and T. Goldstein (2019) Transferable clean-label poisoning attacks on deep neural nets. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, Long Beach, California, USA, pp. 7614–7623. Cited by: §2.

Appendix A Generic Randomized Smoothing Algorithm

  Input: function , number of samples , smoothing distribution , test point to predict , failure probability .
  for  do
     Sample and compute .
  end for
  Compute approximate smoothed output
  Compute bound such that with probability
  Output: Prediction and probability bound , or abstention if sign(
Algorithm 2 Generic randomized smoothing procedure

Appendix B The multi-class setting

Although the notation and algorithms are slightly more complex, all the methods we have discussed in the main paper can be extended to the multi-class setting. In this case, we consider a class label , and we again seek some smoothed prediction such that the classifier’s prediction on a new point will not change with some number flips of the labels in the training set.

b.1 Randomized smoothing in the multi-class case

We here extend our notation to the case of more than two classes. Recall our original definition of ,

where . More generally, consider a classifier , outputting the index of one of classes. Under this formulation, for a given class , we have

where is the indicator function for if outputs the class . In this case, the hard threshold is evaluated by returning the class with the highest probability. That is,

b.2 Linearization and Chernoff bound approach for the multiclass case

Using the same linearization approach as in the binary case, we can formulate an analogous approach which forgoes the need to actually perform random sampling at all and instead directly bounds the randomized classifier using the Chernoff bound.

Adopting the same notation as in the main text, the equivalent least-squares classifier for the multi-class setting finds some set of weights


is a binary matrix with each row equal to a one-hot encoding of the class label (note that the resulting

is now a matrix, and we let refer to the th column). At prediction time, the predicted class of some new point is simply given by the prediction with the highest value, i.e.,

Alternatively, following the same logic as in the binary case, this same prediction can be written in terms of the variable as

where denotes the th column of .

In our randomized smoothing setting, we again propose to flip the class of any label with probability , selecting an alternative label uniformly at random from the remaining labels. Assuming that the predicted class label is , we wish to bound the probability that

for all alternative classes . By the Chernoff bound, we have that

The random variable

takes on three different distributions depending on if , if , or if and . Specifically, this variable can take on the terms with the associated probabilities

Combining these cases directly into the Chernoff bound gives

Again, this problem is convex in , and so can be solved efficiently using Newton’s method. And again since the reverse case can be computed via the same expression we can similarly optimize this in an unconstrained fashion. Specifically, we can do this for every pair of classes and , and return the which gives the smallest lower bound for the worst-case choice of .

b.3 KL Divergence Bound

To compute actual certification radii, we will derive the KL divergence bound for the the case of classes. Let be defined as in Section 4, except that as in the previous section when a label is flipped with probability it is changed to one of the other classes uniformly at random. Let and refer to the independent measures on each dimension which collectively make up the factorized distributions and (i.e., ). Further, let be the element of , meaning it is the “original” label which may or may not be flipped when sampling from . First noting that each dimension of the distributions and are independent, we have

Plugging in the robustness guarantee (3), we have that so long as

Setting recovers the divergence term (4) and the bound (5).

Appendix C Description of Label-Flipping Attacks on MNIST 1/7 and Dogfish

Due to the dearth of existing work on label-flipping attacks for deep networks, our attacks on MNIST and Dogfish were quite straightforward; we expect significant improvements could be made to tighten this upper bound.

For Dogfish, we used a pretrained Inception network (Szegedy et al., 2016) to evaluate the influence of each training point with respect to the loss of each test point (Koh and Liang, 2017). As in prior work, we froze all but the top layer of the network for retraining. Once we obtained the most influential points, we flipped the first one and recomputed approximate influence using only the top layer for efficiency. After each flip, we recorded which points were classified differently and maintained for each test point the successful attack which required the fewest flips. When this was finished, we also tried the reverse of each attack to see if any of them could be achieved with even fewer flips.

For MNIST we implemented two similar attacks and kept the best attack for each test point. The first attack simply ordered training labels by their distance from the test point in feature space, as a proxy for influence. We then tried flipping these one at a time until the prediction changed, and we also tried the reverse. The second attack was essentially the same as the Dogfish attack, ordering the test points by influence. To calculate influence we again assumed a frozen feature map; specifically, using the same notation as Koh and Liang (2017), the influence of flipping the label of a training point to on the loss at the test point is:

For logistic regression, these values can easily be computed in closed form.

Appendix D Additional Plots

Figure 4: MNIST 1/7 test set certified accuracy with and without regularization in the computation of . Note that the unregularized solution achieves almost 100% non-robust accuracy, but certifies significantly lower robustness. This implies that the “training” process is not robust enough to label noise, hence the lower margin by the ensemble. In comparison, the regularized solution achieves significantly higher margins, at a slight cost in overall accuracy.
Figure 5: Dogfish test set certified accuracy using features learned with RICA (Le, 2013). While not as performant as the pre-trained features, our classifier still achieves reasonable certified accuracy—note that the certified lines are lower bounds, while the undefended line is an upper bound. As the limitation is feature quality, deep unsupervised features could significantly boost performance, but would require a larger dataset.
Figure 6: Left: Required margin to certify a given number of label flips using the generic KL bound (5) versus the tight discrete bound (2). Right: The same comparison, but inverted, showing the certifiable robustness for a given margin. The tight bound certifies robustness to approximately twice as many label flips.