Deep Partition Aggregation: Provable Defense against General Poisoning Attacks

06/26/2020 ∙ by Alexander Levine, et al. ∙ University of Maryland 0

Adversarial poisoning attacks distort training data in order to corrupt the test-time behavior of a classifier. A provable defense provides a certificate for each test sample, which is a lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample's classification. We propose two provable defenses against poisoning attacks: (i) Deep Partition Aggregation (DPA), a certified defense against a general poisoning threat model, defined as the insertion or deletion of a bounded number of samples to the training set – by implication, this threat model also includes arbitrary distortions to a bounded number of images and/or labels; and (ii) Semi-Supervised DPA (SS-DPA), a certified defense against label-flipping poisoning attacks. DPA is an ensemble method where base models are trained on partitions of the training set determined by a hash function. DPA is related to subset aggregation, a well-studied ensemble method in classical machine learning. DPA can also be viewed as an extension of randomized ablation (Levine Feizi, 2020a), a certified defense against sparse evasion attacks, to the poisoning domain. Our label-flipping defense, SS-DPA, uses a semi-supervised learning algorithm as its base classifier model: we train each base classifier using the entire unlabeled training set in addition to the labels for a partition. SS-DPA outperforms the existing certified defense for label-flipping attacks (Rosenfeld et al., 2020). SS-DPA certifies >= 50 against 675 label flips (vs. < 200 label flips with the existing defense) on MNIST and 83 label flips on CIFAR-10. Against general poisoning attacks (no prior certified defense), DPA certifies >= 50 poison image insertions on MNIST, and nine insertions on CIFAR-10. These results establish new state-of-the-art provable defenses against poison attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial poisoning attacks are an important vulnerability in machine learning systems. In these attacks, an adversary can manipulate the training data of a classifier, in order to change the classifications of specific inputs at test time. Several poisoning threat models have been studied in the literature, including threat models where the adversary may insert new poison samples (Chen et al., 2017), manipulate the training labels (Xiao et al., 2012; Rosenfeld et al., 2020), or manipulate the training sample values (Biggio et al., 2012; Shafahi et al., 2018). A certified defense against a poisoning attack provides a certificate for each test sample, which is a guaranteed lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample’s classification. In this work, we propose certified defenses against two types of poisoning attacks:

  • General poisoning attacks: In this threat model, the attacker can insert or remove a bounded number of samples from the training set. In particular, the attack magnitude is defined as the cardinality of the symmetric difference between the clean and poisoned training sets. This threat model also includes any distortion to an image and/or label in the training set — a distortion of a training image is simply the removal of the original image followed by the insertion of the distorted image. (Note that an image distortion or label flip therefore increases the symmetric difference attack magnitude by two.)

  • Label-flipping poisoning attacks: In this threat model, the adversary changes only the label for out of training samples. Rosenfeld et al. (2020) has recently provided a certified defense for this threat model, which we improve upon.

In the last couple of years, certified defenses have been extensively studied for evasion attacks, where the adversary manipulates the test images, rather than the training data (e.g. Wong and Kolter (2018); Gowal et al. (2018); Lecuyer et al. (2019); Li et al. (2018); Salman et al. (2019); Levine and Feizi (2020b, 2019); Cohen et al. (2019), etc.) In the evasion case, a certificate is a lower bound on the distance from the image to the classifier’s decision boundary: this guarantees that the image’s classification remains unchanged under adversarial distortions up to the certified magnitude.

Rosenfeld et al. (2020) provides an analogous certificate for label-flipping poisoning attacks: for an input image , the certificate of is a lower bound on the number of labels in the training set that would have to change in order to change the classification of .111Steinhardt et al. (2017) also refers to a “certified defense” for poisoning attacks. However, the definition of the certificate is substantially different in that work, which instead provides overall accuracy guarantees under the assumption that the training and test data are drawn from similar distributions, rather than providing guarantees for individual realized inputs. Rosenfeld et al. (2020)’s method is an adaptation of a certified defense for sparse () evasion attacks proposed by Lee et al. (2019). In Lee et al. (2019)’s randomized smoothing technique, the final classification is the consensus of many classifications by a base classifier on noisy copies of the input image : in each copy of

, each pixel is corrupted to a random value with fixed probability. This results in probabilistic certificates, with a failure rate that decreases as the number of noisy copies used in the classification increases.

The adapted method for label-flipping attacks proposed by Rosenfeld et al. (2020) is equivalent to randomly flipping each training label with fixed probability and taking a consensus result. If implemented directly, this would require one to train a large ensemble of classifiers on different noisy versions of the training data. However, instead of actually doing this, Rosenfeld et al. (2020) focuses only on linear classifiers and is therefore able to analytically calculate the expected result. This gives deterministic, rather than probabilistic, certificates. Further, because Rosenfeld et al. (2020) considers a threat model where only labels are modified, they are able to train an unsupervised nonlinear feature extractor on the (unlabeled) training data before applying their technique, in order to learn more complex features. There is no need to worry about the robustness of the feature extractor training, because the unlabeled data cannot be corrupted by the adversary in their considered threat model.

However, the technique proposed by Lee et al. (2019) is not the current state-of-the-art certifiable defense against

evasion attacks on large-scale image datasets (i.e. ImageNet).

Levine and Feizi (2020b) propose a randomized ablation technique which, rather than randomizing the values of some pixels in each noisy copy of , instead ablates some pixels, replacing them with a null value. Since it is possible for the base classifier to distinguish exactly which pixels originate from , this results in more accurate base classifications and therefore substantially greater certified robustness. For example, on ImageNet, Lee et al. (2019) certifies the median test image against distortions of one pixel, while Levine and Feizi (2020b) certifies against distortions of 16 pixels. The key question is whether or not one can adapt the evasion-time defense proposed by Levine and Feizi (2020b) to the poisoning setup. In this paper, we provide an affirmative answer to this question and show that the resulting method significantly outperforms the current state-of-the-art certifiable defenses. As shown in Figure 1-a, our method for certified robustness against label-flipping attacks substantially outperforms Rosenfeld et al. (2020) for multiclass classification tasks. Furthermore, while our method is de-randomized (as Rosenfeld et al. (2020) is) and therefore yields deterministic certificates, our technique does not require that the classification model be linear, allowing deep networks to be used. Moreover, in Figure 1-b, we illustrate our certified accuracy against “general" poisoning attacks on MNIST; Rosenfeld et al. (2020) does not provide a provable defense for this threat model.

Figure 1: (a). Comparison of certified accuracy to label-flipping poison attacks for our defense (SS-DPA algorithm) vs. Rosenfeld et al. (2020) on MNIST. Solid lines represent certified accuracy as a function of attack size; dashed lines show the clean accuracies of each model. Our algorithm produces substantially higher certified accuracies. Performance curves for Rosenfeld et al. (2020) are adapted from Figure 1 in that work. The parameter

is a hyperparameter of

Rosenfeld et al. (2020)’s algorithm, and is a hyperparameter of our algorithm: the number of base classifiers in an ensemble. (b) Certified accuracy against general poisoning attacks on MNIST using our DPA defense. The attack size is the number of samples which the adversary may add or remove from the training set. Rosenfeld et al. (2020) does not provide a provable defense for this more general threat model.

In what follows, we explain our methods: we develop a certifiable defense against general poisoning attacks called Deep Partition Aggregation (DPA) — we first partition the training set into partitions, with the partition assignment for a training sample determined by a hash function of the sample. The hash function can be any deterministic function that maps a training sample to a partition assignment: the only requirement is that the hash value depends only on the value of the training sample itself, so that neither poisoning other samples, nor changing the total number of samples, nor reordering the samples can change the partition that t is assigned to. We then train base classifiers separately, one on each partition. At the test time, we evaluate each of the base classifiers on the test image and return the plurality classification as the final result. The key insight is that removing a training sample, or adding a new sample, will only change the contents of one partition, and therefore will only affect the classification of one of the base classifiers. Let be the number of base classifiers that output the consensus class and be the number of base classifiers that output the next-most-frequently returned class . Let . To change the plurality classification from to , the adversary must change the output of at least base classifiers: this means inserting or removing at least training samples. This immediately gives us a robustness certificate.

Our proposed method is related to classical ensemble approaches in machine learning, namely bootstrap aggregation and subset aggregation (Breiman, 1996; Buja and Stuetzle, 2006; Bühlmann, 2003; Zaman and Hirose, 2009). However, in these methods each base classifier in the ensemble is trained on an independently sampled collection of points from the training set: multiple classifiers in the ensemble may be trained on (and therefore poisoned by) the same sample point. The purpose of these methods has typically been to improve generalization. Bootstrap aggregation has been proposed as an empirical defense against poisoning attacks (Biggio et al., 2011) as well as for evasion attacks (Smutz and Stavrou, 2016). However, to our knowledge, these techniques have not yet been used to provide certified robustness. Our unique partition aggregation variant provides deterministic robustness certificates against poisoning attacks. See Appendix D for further discussion.

Note that DPA provides a robustness guarantee against any poisoning perturbation; it provides robustness to insertions and deletions, and by implication also both label-flips and image distortions. However, if the adversary is restricted to flipping labels only (as in Rosenfeld et al. (2020)), we can achieve even larger certificates through a modified technique. In this setting, the unlabeled data is trustworthy: each base classifier in the ensemble can then make use of the entire training set without labels, but only has access to the labels in its own partition. Therefore, each base classifier can be trained as if the entire dataset is available as unlabeled data, but only a very small number of labels are available. This is precisely the problem statement of semi-supervised learning, a well-studied domain of machine learning (Verma et al., 2019; Luo et al., 2018; Laine and Aila, 2017; Kingma et al., 2014; Gidaris et al., 2018). We can then leverage these existing semi-supervised learning techniques directly to improve the accuracies of the base classifiers in DPA. Furthermore, we can ensure that a particular image is assigned to the same partition regardless of label, so that only one partition is affected by a label flip (rather than possibly two). The resulting algorithm, Semi-Supervised Deep Partition Aggregation (SS-DPA) yields substantially increased certified accuracy against label-flipping attacks, compared to DPA alone and compared to the current state-of-the-art.

On MNIST, SS-DPA substantially outperforms the existing state of the art (Rosenfeld et al., 2020) in defending against label-flip attacks: we are able to certify at least half of images in the test set against attacks to over 600 (1.0%) of the labels in the training set, while still maintaining over 93% accuracy (See Figure 1-a, and Table 1). In comparison, Rosenfeld et al. (2020)’s method achieves less than 60% clean accuracy on MNIST, and most test images cannot be certified with the correct class against attacks of even 200 label flips. We are also the first work to our knowledge to certify against general poisoning attacks, including insertions and deletions of new training images: in this domain, we can certify at least half of test images against attacks consisting of over 500 arbitrary training image insertions or deletions. On CIFAR-10, a substantially more difficult classification task which Rosenfeld et al. (2020) does not test on, we can certify at least half of test images against label-flipping attacks on over 80 labels using SS-DPA, and can certify at least half of test images against general poisoning attacks of up to nine insertions or deletions using DPA.

Weber et al. (2020) have recently proposed a different randomized-smoothing based defense against poisoning attacks by directly applying Cohen et al. (2019)’s smoothing evasion defense to the poisoning domain. The proposed technique can only certify for clean-label attacks (where only the existing images in the dataset are modified, and not their labels), and the certificate guarantees robustness only to bounded distortions of the training data, where the norm of the distortion is calculated across all pixels in the entire training set. Due to well-known limitations of dimensional scaling for smoothing-based robustness certificates (Yang et al., 2020; Kumar et al., 2020; Blum et al., 2020), this yields certificates to only very small distortions of the training data. (For binary MNIST [13,007 images], the maximum reported certificate is pixels.) Additionally, when using deep classifiers, Weber et al. (2020) proposes a randomized certificate, rather than a deterministic one, with a failure probability that decreases to zero only as the number of trained classifiers in an ensemble approaches infinity. Moreover, in Weber et al. (2020), unlike in our method, each classifier in the ensemble must be trained on a noisy version of the entire dataset. These issues hinder Weber et al. (2020)’s method to be an effective scheme for certified robustness against poisoning attacks.

2 Proposed Methods

2.1 Notation

Let be the space of all possible unlabeled samples (i.e., the set of all possible images). We assume that it is possible to sort elements of in a deterministic, unambiguous way. In particular, we can sort images lexicographically by pixel values. We represent labels as integers, so that the set of all possible labeled samples is . A training set for a classifier is then represented as , where is the power set of . For , we let refer to the (unlabeled) sample, and refer to the label. For a set of samples , we let refer to the set of unique unlabeled samples which occur in . A classifier model is defined as a deterministic function from both the training set and the sample to be classified to a label, i.e. . We will use

to represent a base classifier model (i.e., a neural network), and

to refer to a robust classifier (using DPA or SS-DPA).

represents the set symmetric difference between and : . The number of elements in is , is the set of integers through , and is the largest integer less than or equal to . represents the indicator function: if Prop is true; otherwise. For a set of sortable elements, we define as the sorted list of elements. For a list of unique elements, for , we will define as the index of in the list .

2.2 Dpa

The Deep Partition Aggregation (DPA) algorithm requires a base classifier model , a training set , a deterministic hash function , and a hyperparameter indicating the number of base classifiers which will be used in the ensemble.

At the training time, the algorithm first uses the hash function to define partitions of the training set, as follows:

(1)

The hash function can be any deterministic function from to : however, it is preferable that the partitions are roughly equal in size. Therefore we should choose an which maps images to a domain of integers significantly larger than , in a way such that will be roughly uniform over . In practice, we let be the sum of the pixel values in the image .

Base classifiers are then trained on each partition: we define trained base classifiers as:

(2)

Finally, at the inference time, we evaluate the input on each base classification, and then count the number of classifiers which return each class:

(3)

This lets us define the classifier which returns the consensus output of the ensemble:

(4)

When taking the argmax, we break ties deterministically by returning the smaller class index. The resulting robust classifier has the following guarantee:

Theorem 1.

For a fixed deterministic base classifier , hash function , ensemble size , training set , and input , let:

(5)

Then, for any poisoned training set , if , then .

All proofs are presented in Appendix A. Note that and are unordered sets: therefore, in addition to providing certified robustness against insertions or deletions of training data, the robust classifier is also invariant under re-ordering of the training data, provided that has this invariance (which is implied, because maps deterministically from a set; see Section 2.2.1 for practical considerations). As mentioned in Section 1, DPA is a deterministic variant of randomized ablation (Levine and Feizi, 2020b) adapted to the poisoning domain. Each base classifier ablates most of the training set, retaining only the samples in one partition. However, unlike in randomized ablation, the partitions are deterministic and use disjoint samples, rather than selecting them randomly and independently. In Appendix C, we argue that our derandomization has little effect on the certified accuracies, while allowing for exact certificates using finite samples. We also discuss how this work relates to Levine and Feizi (2020a), which proposes a de-randomized ablation technique for a restricted class of sparse evasion attacks (patch adversarial attacks).

2.2.1 DPA Practical Implementation Details

One of the advantages of DPA is that we can use deep neural networks for the base classifier . However, enforcing that the output of a deep neural network is a deterministic function of its training data, and specifically, its training data as an unordered set, requires some care. First, we must remove dependence on the order in which the training samples are read in. To do this, in each partition , we sort the training samples prior to training, taking advantage of the assumption that is well-ordered (and therefore

is also well ordered). In the case of the image data, this is implemented as a lexical sort by pixel values, with the labels concatenated to the samples as an additional value. The training procedure for the network, which is based on standard stochastic gradient descent, must also be made deterministic: in our PyTorch

(Paszke et al., 2019) implementation, this can be accomplished by deterministically setting a random seed at the start of training. As discussed in Appendix F, we find that it is best to use different random seeds during training for each partition. This reduces the correlation in output between base classifiers in the ensemble. Thus, in practice, we use the partition index as the random seed (i.e., we train base classifier using random seed .)

2.3 Ss-Dpa

Semi-Supervised DPA (SS-DPA) is a defense against label-flip attacks. For this defense, the base classifier may be a semi-supervised learning algorithm: it can use the entire unlabeled training dataset, in addition to the labels for a partition. We will therefore define the base classifier to also accept an unlabelled dataset as input: . Additionally, our method of partitioning the data is modified both to ensure that changing the label of a sample affects only one partition rather than possibly two, and to create a more equal distribution of samples between partitions.

First, we will sort the unlabeled data :

(6)

For a sample , note that is invariant under any label-flipping attack to , and also under permutation of the training data as they are read. We now partition the data based on sorted index:

(7)

Note that in this partitioning scheme, we no longer need to use a hash function

. Moreover, this scheme creates a more uniform distribution of samples between partitions, compared with the hashing scheme used in DPA. This can lead to improved certificates: see Appendix

E. This sorting-based partitioning is possible because the unlabeled samples are “clean”, so we can rely on their ordering, when sorted, to remain fixed. As in DPA, we train base classifiers on each partition, this time additionally using the entire unlabeled training set:

(8)

The inference procedure is the same as in the standard DPA:

(9)

The SS-DPA algorithm provides the following robustness guarantee against label-flipping attacks.222The theorem as stated assumes that there are no repeated unlabeled samples (with different labels) in the training set . This is a reasonable assumption, and in the label-flipping attack model, the attacker cannot cause this assumption to be broken. Without this assumption, the analysis is more complicated; see Appendix G.

Theorem 2.

For a fixed deterministic semi-supervised base classifier , ensemble size , training set (with no repeated samples), and input , let:

(10)

For a poisoned training set obtained by changing the labels of at most samples in , .

2.3.1 Semi-Supervised Learning Methods for SS-DPA

In the standard DPA algorithm, we are able to train each classifier in the ensemble using only a small fraction of the training data; this means that each classifier can be trained relatively quickly: as the number of classifiers increases, the time to train each classifier can decrease (see Table 1). However, in a naive implementation of SS-DPA, Equation 8 might suggest that training time will scale with , because each semi-supervised base classifier requires to be trained on the entire training set. Indeed, with many popular and highly effective choices of semi-supervised classification algorithms, such as temporal ensembling (Laine and Aila, 2017), ICT (Verma et al., 2019), Teacher Graphs (Luo et al., 2018) and generative approaches (Kingma et al., 2014), the main training loop trains on both labeled and unlabeled samples, so we would see the total training time scale linearly with . In order to avoid this, we instead choose a semi-supervised training method where the unlabeled samples are used only to learn semantic features of the data, before the labeled samples are introduced: this allows us to use the unlabeled samples only once, and to then share the learned feature representations when training each base classifier. In our experiments, we choose the RotNet model introduced by Gidaris et al. (2018), which first trains a network to infer the angle of rotation on rotated forms of unlabeled images: an intermediate layer of this network is then used as features in the final classification. As discussed in Section 2.2.1, we also sort the data prior to learning (including when learning unsupervised features), and set random seeds, in order to ensure determinism.

3 Results

In this section, we present empirical results evaluating the performance of proposed methods, DPA and SS-DPA, against poison attacks on MNIST and CIFAR-10 datasets. As discussed in Section 2.3.1, we use the RotNet architecture (Gidaris et al., 2018)

for the semi-supervised learning. Conveniently, the RotNet architecture is structured such that the feature extracting layers, combined with the final classification layers, together make up the Network-In-Network (NiN) architecture for the supervised classification

(Lin et al., 2013). We use NiN for DPA’s supervised training, and RotNet for SS-DPA’s semi-supervised training. On CIFAR-10, we use training parameters, for both the DPA (NiN) and SS-DPA (RotNet), directly from Gidaris et al. (2018).333In addition to the de-randomization changes mentioned in Section 2.2.1, we made one modification to the NiN ‘baseline’ for supervised learning: the baseline implementation in Gidaris et al. (2018), even when trained on a small subset of the training data, uses normalization constants derived from the entire training set. This is a (minor) error in Gidaris et al. (2018) that we correct by calculating normalization constants on each subset. On MNIST, we use the same architectures and training parameters, with a slight modification: we eliminate horizontal flips in data augmentation, because, unlike in CIFAR-10, horizontal aligment is semantically meaningful for digits.

Results are presented in Figures 2 and 3, and are summarized in Table 1. Our metric, Certified Accuracy as a function of attack magnitude (symmetric-difference or label-flips), refers to the fraction of samples which are both correctly classified and are certified as robust to attacks of that magnitude. Note that different poisoning perturbations, which poison different sets of training samples, may be required to poison each test sample; i.e. we assume the attacker can use the attack budget separately for each test sample. Table 1 also reports Median Certified Robustness, the attack magnitude to which at least 50% of the test set is provably robust.

Training Number of Median Base Training
set Partitions Certified Clean Classifier time per
size Robustness Accuracy Accuracy Partition
MNIST, DPA 60000 1200 448 95.82% 77.00% 0.28 min
3000 509 93.28% 49.59% 0.26 min
MNIST, SS-DPA 60000 1200 493 95.91% 81.70% 0.11 min
3000 675 93.93% 59.21% 0.10 min
50 9 70.29% 56.25% 1.42 min
CIFAR, DPA 50000 250 6 55.72% 35.18% 0.56 min
1000 N/A 44.30% 23.25% 0.36 min
50 21 81.97% 74.73% 0.72 min
CIFAR, SS-DPA 50000 250 59 74.60% 61.98% 0.28 min
1000 83 68.22% 44.26% 0.12 min
Table 1: Summary statistics for DPA and SS-DPA algorithms on MNIST and CIFAR. Median Certified Robustness is the attack magnitude (symmetric difference for DPA, label flips for SS-DPA) at which certified accuracy is 50%. Training times are on a single GPU; note that many partitions can be trained in parallel. Note we observe some constant overhead time for training each classifier, so on MNIST, where the training time per image is small, has little effect on the training time.
(a) DPA (General poisoning attacks) (b) SS-DPA (Label-flipping poisoning attacks)
Figure 2: Certified Accuracy to poisoning attacks on MNIST, using (a) DPA to certify against general poisoning attacks, and (b) SS-DPA to certify against label-flipping attacks. Dashed lines show the clean accuracies of each model. The hyperparameter is the number of base classifier models in the ensemble used to make final classifications.
(a) DPA (General poisoning attacks) (b) SS-DPA (Label-flipping poisoning attacks)
Figure 3: Certified Accuracy to poisoning attacks on CIFAR, using (a) DPA to certify against general poisoning attacks, and (b) SS-DPA to certify against label-flipping attacks. The hyperparameter is the number of base classifier models in the ensemble used to make final classifications.

As shown in Figure 1, our SS-DPA method substantially outperforms the existing certificate (Rosenfeld et al., 2020) on label-flipping attacks. With DPA, we are also able to certify at least half of MNIST images to attacks of over 500 poisoning insertions or deletions. On CIFAR-10, on which Rosenfeld et al. (2020) does not test, DPA can certify at least half of images to 9 poisoning insertions or deletions, and SS-DPA can certify to over 80 label flips. In summary, we provide a new state-of-the-art certified defence against label flipping attacks using SS-DPA, while also providing the first certified defense against a much more general poisoning threat model.

The hyperparameter controls the number of classifiers in the ensemble: because each sample is used in training exactly one classifier, the average number of samples used to train each classifier is inversely proportional to . Therefore, we observe that the base classifier accuracy (and therefore also the final ensemble classifier accuracy) decreases as is increased; see Table 1. However, because the certificates described in Theorems 1 and 2 depend directly on the gap in the number of classifiers in the ensemble which output the top and runner-up classes, larger numbers of classifiers are necessary to achieve large certificates. In fact, using classifiers, the largest certified robustness possible is . Thus, we see in Figures 2 and 3 that larger values of tend to produce larger robustness certificates. Therefore controls a trade-off between robustness and accuracy.

Rosenfeld et al. (2020) also reports robustness certificates against label-flipping attacks on binary MNIST classification, with classes 1 and 7. Rosenfeld et al. (2020) reports clean-accuracy of 94.5% and certified accuracies for attack magnitudes up to 2000 label flips (out of 13007), with best certified accuracy less than 70%. By contrast, using a specialized form of SS-DPA, we are able to achieve clean accuracy of , with every correctly-classified image certifiably robust up to 5952 label flips (i.e. certified accuracy is also at 5952 label flips.) This represents a substantial improvement. We present details of this experiment in Appendix B.

4 Conclusion

In this paper, we described a novel approach to provable defenses against poisoning attacks. Unlike previous techniques, our method both allows for exact, deterministic certificates and can be implemented using deep neural networks. These advantages allow us to outperform the current state-of-the-art on label-flip attacks, and to develop the first certified defense against a broadly defined class of general poisoning attacks.

Broader Impact

Many datasets for machine learning often originate from information gathered from the public, through crowdsourcing, web scraping, or analysing user behavior in web applications. Systems trained on such datasets can then be used in highly critical applications such as web content moderation or fraud detection. Thus, it is important to develop learning algorithms that are robust to malicious distortions of the training data, i.e. poisoning attacks. We believe this work is useful to these ends. Additionally, because our algorithm provides provable certificates, it can be useful in establishing trust in machine learning systems: even in a non-adversarial setting, a user may feel more confident knowing that the system’s decision did not hinge on a small number of possibly-spurious training examples. In this way, the certificate can be viewed as an accessible, transparent, easy-to-interpret metric of confidence. We also note that this work presents a purely defensive method against poisoning attacks; i.e. we are not revealing any new vulnerabilities in machine learning systems by presenting our techniques.

Acknowledgements

This project was supported in part by NSF CAREER AWARD 1942230, HR 00111990077, HR001119S0026 and Simons Fellowship on “Foundations of Deep Learning.”

References

  • S. Amari, N. Fujita, and S. Shinomoto (1992) Four types of learning curves. Neural Computation 4 (4), pp. 605–618. Cited by: Appendix E.
  • B. Biggio, I. Corona, G. Fumera, G. Giacinto, and F. Roli (2011) Bagging classifiers for fighting poisoning attacks in adversarial classification tasks. In International workshop on multiple classifier systems, pp. 350–359. Cited by: Appendix D, §1.
  • B. Biggio, B. Nelson, and P. Laskov (2012)

    Poisoning attacks against support vector machines

    .
    In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Madison, WI, USA, pp. 1467–1474. External Links: ISBN 9781450312851 Cited by: §1.
  • A. Blum, T. Dick, N. Manoj, and H. Zhang (2020) Random smoothing might be unable to certify robustness for high-dimensional images. arXiv preprint arXiv:2002.03517. External Links: 2002.03517 Cited by: §1.
  • L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: Appendix D, §1.
  • P. L. Bühlmann (2003) Bagging, subagging and bragging for improving some prediction algorithms. In Research report/Seminar für Statistik, Eidgenössische Technische Hochschule (ETH), Vol. 113. Cited by: Appendix D, Appendix D, §1.
  • A. Buja and W. Stuetzle (2006) Observations on bagging. Statistica Sinica, pp. 323–351. Cited by: Appendix D, Appendix D, §1.
  • X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1.
  • J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §1, §1.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.3.1, §3, footnote 3.
  • S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, T. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1.
  • D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3581–3589. External Links: Link Cited by: §1, §2.3.1.
  • A. Kumar, A. Levine, T. Goldstein, and S. Feizi (2020) Curse of dimensionality on randomized smoothing for certifiable robustness. arXiv preprint arXiv:2002.03239. External Links: 2002.03239 Cited by: §1.
  • S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2.3.1.
  • M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In 2019 2019 IEEE Symposium on Security and Privacy (SP), Vol. , Los Alamitos, CA, USA, pp. 726–742. External Links: ISSN 2375-1207, Document, Link Cited by: §1.
  • G. Lee, Y. Yuan, S. Chang, and T. Jaakkola (2019) Tight certificates of adversarial robustness for randomly smoothed classifiers. In Advances in Neural Information Processing Systems, pp. 4911–4922. Cited by: §1, §1.
  • A. Levine and S. Feizi (2019) Wasserstein smoothing: certified robustness against wasserstein adversarial attacks. arXiv preprint arXiv:1910.10783. Cited by: §1.
  • A. Levine and S. Feizi (2020a) (De) randomized smoothing for certifiable defense against patch attacks. arXiv preprint arXiv:2002.10733. Cited by: Appendix C, §2.2.
  • A. Levine and S. Feizi (2020b) Robustness certificates for sparse adversarial attacks by randomized ablation.

    Association for the Advancement of Artificial Intelligence (AAAI)

    .
    Cited by: Appendix C, Appendix C, Deep Partition Aggregation: Provable Defense against General Poisoning Attacks, §1, §1, §2.2.
  • B. Li, C. Chen, W. Wang, and L. Carin (2018) Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §1.
  • M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §3.
  • Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang (2018) Smooth neighbors on teacher graphs for semi-supervised learning. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1, §2.3.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §2.2.1.
  • E. Rosenfeld, E. Winston, P. Ravikumar, and J. Z. Kolter (2020) Certified robustness to label-flipping attacks via randomized smoothing. arXiv preprint arXiv:2002.03018. Cited by: Appendix B, Deep Partition Aggregation: Provable Defense against General Poisoning Attacks, Figure 1, 2nd item, §1, §1, §1, §1, §1, §1, §3, §3.
  • H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §1.
  • A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein (2018) Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 6103–6113. External Links: Link Cited by: §1.
  • C. Smutz and A. Stavrou (2016) When a tree falls: using diversity in ensemble classifiers to identify evasion in malware detectors. In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016, External Links: Link Cited by: Appendix D, §1.
  • J. Steinhardt, P. W. W. Koh, and P. S. Liang (2017) Certified defenses for data poisoning attacks. In Advances in neural information processing systems, pp. 3517–3529. Cited by: footnote 1.
  • V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz (2019) Interpolation consistency training for semi-supervised learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3635–3641. Cited by: §1, §2.3.1.
  • M. Weber, X. Xu, B. Karlas, C. Zhang, and B. Li (2020) RAB: provable robustness against backdoor attacks. arXiv preprint arXiv:2003.08904. Cited by: §1.
  • E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §1.
  • H. Xiao, H. Xiao, and C. Eckert (2012) Adversarial label flips attack on support vector machines. In Proceedings of the 20th European Conference on Artificial Intelligence, pp. 870–875. Cited by: §1.
  • G. Yang, T. Duan, E. Hu, H. Salman, I. Razenshteyn, and J. Li (2020) Randomized smoothing of all shapes and sizes. arXiv preprint arXiv:2002.08118. External Links: 2002.08118 Cited by: §1.
  • F. Zaman and H. Hirose (2009) Effect of subsampling rate on subbagging and related ensembles of stable classifiers. In Pattern Recognition and Machine Intelligence, S. Chaudhury, S. Mitra, C. A. Murthy, P. S. Sastry, and S. K. Pal (Eds.), Berlin, Heidelberg, pp. 44–49. External Links: ISBN 978-3-642-11164-8 Cited by: Appendix D, §1.

Appendix A Proofs

Theorem 1.

For a fixed deterministic base classifier , hash function , ensemble size , training set , and input , let:

(11)

Then, for any poisoned training set , if , we have: .

Proof.

We define the partitions, trained classifiers, and counts for each training set ( and ) as described in the main text:

(12)
(13)
(14)
(15)

Note that here, we are using superscripts to explicitly distinguish between partitions (as well as base classifiers and counts) of the clean training set and the poisoned dataset (i.e, is equivalent to in the main text). In Equation 15, as discussed in the main text, when taking the argmax, we break ties deterministically by returning the smaller class index.

Note that unless there is some , with , in . Because the mapping from to is a deterministic function, the number of partitions for which is at most , which is at most . implies , so the number of classifiers for which is also at most . Then:

(16)

Let . Note that iff:

where the separate cases come from the deterministic selection of the smaller index in cases of ties in Equation 15: this can be condensed to . Then by triangle inequality with Equation 16, we have that if . This condition is true by the definition of , so . ∎

Theorem 2.

For a fixed deterministic semi-supervised base classifier , ensemble size , training set (with no repeated samples), and input , let:

(17)

For a poisoned training set obtained by changing the labels of at most samples in , .

Proof.

Recall the definition:

(18)

Because , we have . We can then define partitions and base classifiers for each training set ( and ) as described in the main text:

(19)
(20)

Recall that for any , is invariant under label-flipping attack to . Then, for each , the samples in will be the same as the samples in , possibly with some labels flipped. In particular, the functions and will be identical, unless the label of some sample with has been changed. If at most labels change, at most ensemble classifiers are affected: the rest of the proof proceeds similarly as that of Theorem 1. ∎

Appendix B Binary MNIST Experiments

We perform a specialized instance of SS-DPA on the binary ‘1’ versus ‘7’ MNIST classification task. Specifically, we set , so that every partition receives only one label. We first use 2-means clustering on the unlabeled data, to compute two means. This allows for each base classifier to use a very simple “semi-supervised learning algorithm”: if the test image and the one labeled training image provided to the base classifier belong to the same cluster, then the base classifier assigns the label of the training image to the test image. Otherwise, it assigns the opposite label to the test image. Formally:

Note that each base classifier behaves exactly identically, up to a transpose of the labels: so in practice, we simply count the training samples which associate each of the two cluster centroids with each of the two labels, and determine the number of label flips which would be required to change the consensus label assignments of the clusters. At the test time, each test image therefore needs to be processed only once. The amount of time required for inference is then simply the time needed to calculate the distance from the test sample to each of the two clusters. This also means that every image has the same robustness certificate. As stated in the main text, using this method, we are able to achieve clean accuracy of , with every correctly-classified image certifiably robust up to 5952 label flips (i.e. certified accuracy is also at 5952 label flips.) This means that the classifier is robust to adversarial label flips on 45.8% of the training data.

Rosenfeld et al. [2020] also reports robustness certificates against label-flipping attacks on binary MNIST classification with classes 1 and 7. Rosenfeld et al. [2020] reports clean-accuracy of 94.5% and certified accuracies for attack magnitudes up to 2000 label flips (out of 13007: 15.4%), with the best certified accuracy less than 70%.

Appendix C Relationship to Randomized Ablation

As mentioned in Section 1, (SS-)DPA is in some sense related to Randomized Ablation [Levine and Feizi, 2020b] (used in defense against sparse inference-time attacks) for training-time poisoning attacks. Randomized Ablation is a certified defense against (sparse) inference attacks, in which the final classification is a consensus among classifications of copies of the image. In each copy, a fixed number of pixels are randomly ablated (replaced with a null value). A direct application of Randomized Ablation to poisoning attacks would require each base classifier to be trained on a random

subset of the training data, with each base classifier’s training set chosen randomly and independently. Due to the randomized nature of this algorithm, estimation error would have to be considered in practice when applying Randomized Ablation using a finite number of base classifiers: this decreases the certificates that can be reported, while also introducing a failure probability to the certificates. By contrast, in our algorithms, the partitions are

deterministic and use disjoint, rather than independent, samples. In this section, we argue that our derandomization has little effect on the certified accuracies compared to randomized ablation, even considering randomized ablation with no estimation error (i.e., with infinite base classifiers). In the poisoning case specifically, using additional base classifiers is expensive – because they must each be trained – so one would observe a large estimation error when using a realistic number of base classifiers. Therefore our derandomization can potentially improve the certificates which can be reported, while also allowing for exact certificates using a finite number of base classifiers.

For simplicity, consider the label-flipping case. In this case, the training set has a fixed size, . Thus, Randomized Ablation bounds can be considered directly. A direct adaptation of Levine and Feizi [2020b] would, for each base classifier, choose out of samples to retain labels for, and would ablate the labels for the rest of the training data. Suppose an adversary has flipped labels. For each base classifier, the probability that a flipped label is used in classification (and therefore that the base classifier is ‘poisoned’) is:

(21)

where “RA" stands for Randomized Ablation.

In this direct adaptation, one must then use a very large ensemble of randomized classifiers. The ensemble must be large enough that we can estimate with high confidence the probabilities (on the distribution of possible choices of training labels to retain) that the base classifier selects each class. If the gap between the highest and the next-highest class probabilities can be determined to be greater than , then the consensus classification cannot be changed by flipping labels. This is because, at worst, every poisoned classifier could switch the highest-class classification to the runner-up class, reducing the gap by at most .

Note that this estimation relies on each base classifier using a subset of labels selected randomly and independently from the other base classifier. In contrast, our SS-DPA method selects each subset disjointly. If labels are flipped, assuming that in the worst case each flipped label is in a different partition, using the union bound, the proportion of base classifiers which can be poisoned is

(22)

where for simplicity we assume that evenly divides the number of samples , so labels are kept by each partition. Again we need the gap in class probabilities to be at least to ensure robustness. While the use of the union bound might suggest that our deterministic scheme (Equation 22) might lead to a significantly looser bound than that of the probabilistic certificate (Equation 21), this is not the case in practice where . For example, in an MNIST-sized dataset (), using labels per base classifier, to certify for label flips, we have , and . The derandomization only sightly increases the required gap between the top two class probabilities.

To understand this, note that if the number of poisonings is small compared to the number of partitions , then even if the partitions are random and independent, the chance that any two poisonings occur in the same partition is quite small. In that case, the union bound in Equation 22 is actually quite close to an independence assumption. By accepting this small increase in the upper bound of the probability that each base classification is poisoned, our method provides all of the benefits of de-randomization, including allowing for exact robustness certificates using only a finite number of classifiers. Additionally, note that in the Randomized Ablation case, the empirical gap in estimated class probabilities must be somewhat larger than in order to certify robustness with high confidence, due to estimation error: the gap required increases more as the number of base classifiers decreases. This is particularly important in the poisoning case, because training a large number of classifiers is substantially more expensive than performing a large number of evaluations, as in randomized smoothing for evasion attacks.

We also note that Levine and Feizi [2020a] also used a de-randomized scheme based on Randomized Ablation to certifiably defend against evasion patch attacks. However, in that work, the de-randomization does not involve a union bound over arbitrary partitions of the vulnerable inputs. Instead, in the case of patch attacks, the attack is geometrically constrained: the image is therefore divided into geometric regions (bands or blocks) such that the attacker will only overlap with a fixed number of these regions. Each base classifier then uses only a single region to make its classification. Levine and Feizi [2020a] do not apply this to derandomization via disjoint subsets/union bound to defend against poison attacks. Also, we note that we borrow from Levine and Feizi [2020a] the deterministic “tie-breaking” technique when evaluating the consensus class in Equation 4, which can increase our robustness certificate by up to one.

Appendix D Relationship to Existing Ensemble Methods

As mentioned in Section 1, our proposed method is related to classical ensemble approaches in machine learning, namely bootstrap aggregation (“bagging”) and subset aggregation (“subagging”) [Breiman, 1996, Buja and Stuetzle, 2006, Bühlmann, 2003, Zaman and Hirose, 2009]

. In these methods, each base classifier in the ensemble is trained on an independently sampled collection of points from the training set: this means that multiple classifiers in the ensemble may be trained on the same sample point. The purpose of these methods has typically been to improve generalization, and therefore to improve test set accuracy: bagging and subagging decrease the variance component of the classifier’s error.

In subagging, each training set for a base classifier is an independently sampled subset of the training data: this is in fact an identical formulation to the “direct Randomized Ablation” approach discussed in Appendix C. However, in practice, the size of each training subset has typically been quite large: the bias error term increases with decreasing subsample sizes [Buja and Stuetzle, 2006]. Thus, the optimal subsample size for maximum accuracy is large: Bühlmann [2003] recommends using samples per classifier (“half-subagging”), with theoretical justification for optimal generalization. This would not be useful in Randomized Ablation-like certification, because any one poisoned element would affect half of the ensemble. Indeed, in our certifiably robust classifiers, we observe a trade-off between accuracy and certified robustness: our use of many very small partitions is clearly not optimal for the test-set accuracy (Table 1).

In bagging, the samples in each base classifier training set are chosen with replacement, so elements may be repeated in the training “set” for a single base classifier. Bagging has been proposed as an empirical defense against poisoning attacks [Biggio et al., 2011] as well as for evasion attacks [Smutz and Stavrou, 2016]. However, to our knowledge, these techniques have not yet been used to provide certified robustness.

Appendix E SS-DPA with Hashing

It is possible to use hashing, as in DPA, in order to partition data for SS-DPA: as long as the hash function does not use the sample label in assigning a class (as ours indeed does not), it will always assign an image to the same partition regardless of label-flipping, so only one partition will be affected by a label-flip. Therefore, the SS-DPA label-flipping certificate should still be correct. However, as explained in the main text, treating the unlabeled data as trustworthy allows us to partition the samples evenly among partitions using sorting. This is motivated by the classical understanding in machine learning (e.g. Amari et al. [1992]

) that learning curves (the test error versus the number of samples that a classifier is trained on) tend to be convex-like. The test error of a base classifier is then approximately a convex function of that base classifier’s partition size. Therefore, if the partition size is a random variable, by Jensen’s inequality, the expected test error of the (random) partition size is greater than the test error of the mean partition size. Setting all base classifiers to use the mean number of samples should then maximize the average base classifier accuracy.

Number of Median Base
Partitions Certified Clean Classifier
Robustness Accuracy Accuracy
Hash Sort Hash Sort Hash Sort
MNIST, SS-DPA 1200 469 493 95.88% 95.91% 79.59% 81.70%
3000 626 675 93.94% 93.93% 56.33% 59.21%
50 21 21 81.96% 81.97% 74.66% 74.73%
CIFAR, SS-DPA 250 59 59 74.47% 74.60% 61.83% 61.98%
1000 82 83 68.24% 68.22% 43.94% 44.26%
Table 2: Comparison of SS-DPA with hashing (‘Hash’ columns, described in Appendix E) to the SS-DPA algorithm with partitions determined by sorting (‘Sort’ columns, described in the main text). Note that partitioning via sorting consistently results in higher base classifier accuracies, and can increase (and never decreases) median certified robustness. These effects seem to be larger on MNIST than on CIFAR-10.

To validate this reasoning, we tested SS-DPA with partitions determined by hashing (using the same partitions as we used in DPA), rather than the sorting method described in the main text. See Table 2 for results. As expected, the average base classifier accuracy decreased in all experiments when using the DPA hashing, compared to using the sorting method of SS-DPA. However, the effect was minimal in CIFAR-10 experiments: the main advantage of the sorting method was seen on MNIST. This is partly because we used more partitions, and hence fewer average samples per partition, in the MNIST experiments: fewer average samples per partition creates a greater variation in the number of samples per partition in the hashing method. However, CIFAR-10 with and MNIST with both average 50 samples per partition, but the base classifier accuracy difference still was much more significant on MNIST (2.11%) compared to CIFAR-10 (0.32%).

On the MNIST experiments, where the base classifier accuracy gap was observed, we also saw that the effect of hashing on the smoothed classifier was mainly to decrease the certified robustness, and that there was not a significant effect on the clean smoothed classifier accuracy. As discussed in Appendix F, this may imply that the outputs of the base classifiers using the sorting method are more correlated, in addition to being more accurate.

Appendix F Effect of Random Seed Selection

In Section 2.2.1, we mention that we “deterministically" choose different random seeds for training each partition, rather than training every partition with the same random seed. To see a comparison between using distinct and the same random seed for each partition, see Table 3. Note that there is not a large, consistent effect across experiments on either the base classifier accuracy nor the the median certified robustness: however, across all 10 experiments, the distinct random seeds always resulted in higher smoothed classifier accuracies. This effect was particularly pronounced using SS-DPA: using distinct seeds increased smoothed accuracy by at least 1.5% on each value of on MNIST, and by nearly 3% for on CIFAR-10 (where the base classifier accuracy difference was only 0.1%). This implies that shared random seeds make the base classifiers more correlated with each other: at the same level of average base classifier accuracy, it is more likely that a plurality of base classifiers will all misclassify the same sample (If the base classifiers were perfectly uncorrelated, we would see nearly 100% smoothed clean accuracy wherever the base classifier accuracy was over 50%. Also if they were perfectly correlated, the smoothed clean accuracy would equal the base classifier accuracy).

It is somewhat surprising that the base classifiers become correlated when using the same random seed, given that they are trained on entirely distinct data. However, two factors may be at play here. First, note that the random seed used in training controls the random cropping of training images: it is possible that, because the training sets of the base classifiers are so small, using the same cropping patterns in every classifier would create a systematic bias. Second, note that the effect is most pronounced using SS-DPA, where all base classifiers share their final network layers: this would tend to increase the baseline correlation between base classifiers (Indeed, for on MNIST, SS-DPA has similar clean accuracy to DPA, despite having a 10% higher base classifier accuracy). We can speculate that there may be some synergistic effect between these two causes of correlation: however, understanding these effects precisely is an opportunity for potential future research.

Number of Median Base
Partitions Certified Clean Classifier
Robustness Accuracy Accuracy
Same Distinct Same Distinct Same Distinct
MNIST, DPA 1200 444 448 95.33% 95.82% 76.43% 77.00%
3000 512 509 92.87% 93.28% 49.53% 49.59%
MNIST, SS-DPA 1200 506 493 94.41% 95.91% 82.56% 81.70%
3000 667 675 91.94% 93.93% 59.69% 59.21%
50 9 9 69.88% 70.29% 56.07% 56.25%
CIFAR, DPA 250 5 6 55.56% 55.72% 34.90% 35.18%
1000 N/A N/A 44.13% 44.30% 23.10% 23.25%
50 21 21 81.50% 81.97% 74.63% 74.73%
CIFAR, SS-DPA 250 60 59 73.83% 74.60% 61.91% 61.98%
1000 78 83 65.30% 68.22% 44.16% 44.26%
Table 3: Comparison of DPA and SS-DPA algorithms using the same random seed for each partition (‘Same’ columns, described in Appendix F) to the DPA and SS-DPA algorithms using the distinct random seeds for each partition (‘Distinct’ columns, described in the main text). Note that using distinct random seeds consistently results in higher smoothed classifier clean accuracies, often by large margins when using SS-DPA. While it might appear that there some tendency for the base classifiers to also be more accurate using distict seeds (in eight out of 10 experiments), the magnitude of this difference is often very small, and it is not a consistent effect.

Appendix G SS-DPA with Repeated Unlabeled Data

The definition of a training set that we use, , technically allows for repeated samples with differing labels: there could be a pair of distinct samples, , such that , but . This creates difficulties with the definition of the label-flipping attack: for example, the attacker could flip the label of to become the label of : this would break the definition of as a set. In most applications, this is not a circumstance that warrants practical consideration: indeed, none of the datasets used in our experiments have such instances (nor can label-flipping attacks create them), and therefore, for performance reasons, our implementation of SS-DPA does not handle these cases444Specifically, to optimize for performance, we verify that there are no repeated sample values, and then sort itself (rather than ) lexicographically by image pixel values: this is equivalent to sorting if no repeated images occur — which we have already verified — and avoids an unnecessary lookup procedure to find the sorted index of the unlabeled sample for each labeled sample.. However, the SS-DPA algorithm as described in Section 2.3 can be implemented to handle such datasets, under a formalism of label-flipping tailored to represent this edge case.

Specifically, we define the space of possible labeled data points as : each labeled data point consists of a sample along with a set of associated labels. We then restrict our dataset to be any subset of such that for all , . In other words, we do not allow repeated sample values in as formally defined: if repeated samples exist, one can simply merge their sets of associated labels (in the formalism).

Note that using this definition, the size of will equal the size of , and the samples will always remain the same: the adversary can only modify the label sets of samples “in place”. SS-DPA will always assign an unlabeled sample, along with all of its associated labels, to the same partition, regardless of any label flipping. In practice, this is because, as described in Section 2.3, the partition assignment of a labeled sample depends only on its sample value, not its label: all labeled samples with the same sample value will be put in the same partition. Note that this is true even if the implementation represents two identical sample values with different labels as two separate samples: one does not actually have to implement labels as sets. Therefore, any changes to any labels associated with a sample will only change the output of one base classifier, so all such changes can together be considered a single label-flip in the context of the certificate. In the above formalism, the certificate represents the number of samples in whose label sets have been “flipped”: i.e., modified in any way.