1 Introduction
Adversarial poisoning attacks are an important vulnerability in machine learning systems. In these attacks, an adversary can manipulate the training data of a classifier, in order to change the classifications of specific inputs at test time. Several poisoning threat models have been studied in the literature, including threat models where the adversary may insert new poison samples (Chen et al., 2017), manipulate the training labels (Xiao et al., 2012; Rosenfeld et al., 2020), or manipulate the training sample values (Biggio et al., 2012; Shafahi et al., 2018). A certified defense against a poisoning attack provides a certificate for each test sample, which is a guaranteed lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample’s classification. In this work, we propose certified defenses against two types of poisoning attacks:

General poisoning attacks: In this threat model, the attacker can insert or remove a bounded number of samples from the training set. In particular, the attack magnitude is defined as the cardinality of the symmetric difference between the clean and poisoned training sets. This threat model also includes any distortion to an image and/or label in the training set — a distortion of a training image is simply the removal of the original image followed by the insertion of the distorted image. (Note that an image distortion or label flip therefore increases the symmetric difference attack magnitude by two.)

Labelflipping poisoning attacks: In this threat model, the adversary changes only the label for out of training samples. Rosenfeld et al. (2020) has recently provided a certified defense for this threat model, which we improve upon.
In the last couple of years, certified defenses have been extensively studied for evasion attacks, where the adversary manipulates the test images, rather than the training data (e.g. Wong and Kolter (2018); Gowal et al. (2018); Lecuyer et al. (2019); Li et al. (2018); Salman et al. (2019); Levine and Feizi (2020b, 2019); Cohen et al. (2019), etc.) In the evasion case, a certificate is a lower bound on the distance from the image to the classifier’s decision boundary: this guarantees that the image’s classification remains unchanged under adversarial distortions up to the certified magnitude.
Rosenfeld et al. (2020) provides an analogous certificate for labelflipping poisoning attacks: for an input image , the certificate of is a lower bound on the number of labels in the training set that would have to change in order to change the classification of .^{1}^{1}1Steinhardt et al. (2017) also refers to a “certified defense” for poisoning attacks. However, the definition of the certificate is substantially different in that work, which instead provides overall accuracy guarantees under the assumption that the training and test data are drawn from similar distributions, rather than providing guarantees for individual realized inputs. Rosenfeld et al. (2020)’s method is an adaptation of a certified defense for sparse () evasion attacks proposed by Lee et al. (2019). In Lee et al. (2019)’s randomized smoothing technique, the final classification is the consensus of many classifications by a base classifier on noisy copies of the input image : in each copy of
, each pixel is corrupted to a random value with fixed probability. This results in probabilistic certificates, with a failure rate that decreases as the number of noisy copies used in the classification increases.
The adapted method for labelflipping attacks proposed by Rosenfeld et al. (2020) is equivalent to randomly flipping each training label with fixed probability and taking a consensus result. If implemented directly, this would require one to train a large ensemble of classifiers on different noisy versions of the training data. However, instead of actually doing this, Rosenfeld et al. (2020) focuses only on linear classifiers and is therefore able to analytically calculate the expected result. This gives deterministic, rather than probabilistic, certificates. Further, because Rosenfeld et al. (2020) considers a threat model where only labels are modified, they are able to train an unsupervised nonlinear feature extractor on the (unlabeled) training data before applying their technique, in order to learn more complex features. There is no need to worry about the robustness of the feature extractor training, because the unlabeled data cannot be corrupted by the adversary in their considered threat model.
However, the technique proposed by Lee et al. (2019) is not the current stateoftheart certifiable defense against
evasion attacks on largescale image datasets (i.e. ImageNet).
Levine and Feizi (2020b) propose a randomized ablation technique which, rather than randomizing the values of some pixels in each noisy copy of , instead ablates some pixels, replacing them with a null value. Since it is possible for the base classifier to distinguish exactly which pixels originate from , this results in more accurate base classifications and therefore substantially greater certified robustness. For example, on ImageNet, Lee et al. (2019) certifies the median test image against distortions of one pixel, while Levine and Feizi (2020b) certifies against distortions of 16 pixels. The key question is whether or not one can adapt the evasiontime defense proposed by Levine and Feizi (2020b) to the poisoning setup. In this paper, we provide an affirmative answer to this question and show that the resulting method significantly outperforms the current stateoftheart certifiable defenses. As shown in Figure 1a, our method for certified robustness against labelflipping attacks substantially outperforms Rosenfeld et al. (2020) for multiclass classification tasks. Furthermore, while our method is derandomized (as Rosenfeld et al. (2020) is) and therefore yields deterministic certificates, our technique does not require that the classification model be linear, allowing deep networks to be used. Moreover, in Figure 1b, we illustrate our certified accuracy against “general" poisoning attacks on MNIST; Rosenfeld et al. (2020) does not provide a provable defense for this threat model.In what follows, we explain our methods: we develop a certifiable defense against general poisoning attacks called Deep Partition Aggregation (DPA) — we first partition the training set into partitions, with the partition assignment for a training sample determined by a hash function of the sample. The hash function can be any deterministic function that maps a training sample to a partition assignment: the only requirement is that the hash value depends only on the value of the training sample itself, so that neither poisoning other samples, nor changing the total number of samples, nor reordering the samples can change the partition that t is assigned to. We then train base classifiers separately, one on each partition. At the test time, we evaluate each of the base classifiers on the test image and return the plurality classification as the final result. The key insight is that removing a training sample, or adding a new sample, will only change the contents of one partition, and therefore will only affect the classification of one of the base classifiers. Let be the number of base classifiers that output the consensus class and be the number of base classifiers that output the nextmostfrequently returned class . Let . To change the plurality classification from to , the adversary must change the output of at least base classifiers: this means inserting or removing at least training samples. This immediately gives us a robustness certificate.
Our proposed method is related to classical ensemble approaches in machine learning, namely bootstrap aggregation and subset aggregation (Breiman, 1996; Buja and Stuetzle, 2006; Bühlmann, 2003; Zaman and Hirose, 2009). However, in these methods each base classifier in the ensemble is trained on an independently sampled collection of points from the training set: multiple classifiers in the ensemble may be trained on (and therefore poisoned by) the same sample point. The purpose of these methods has typically been to improve generalization. Bootstrap aggregation has been proposed as an empirical defense against poisoning attacks (Biggio et al., 2011) as well as for evasion attacks (Smutz and Stavrou, 2016). However, to our knowledge, these techniques have not yet been used to provide certified robustness. Our unique partition aggregation variant provides deterministic robustness certificates against poisoning attacks. See Appendix D for further discussion.
Note that DPA provides a robustness guarantee against any poisoning perturbation; it provides robustness to insertions and deletions, and by implication also both labelflips and image distortions. However, if the adversary is restricted to flipping labels only (as in Rosenfeld et al. (2020)), we can achieve even larger certificates through a modified technique. In this setting, the unlabeled data is trustworthy: each base classifier in the ensemble can then make use of the entire training set without labels, but only has access to the labels in its own partition. Therefore, each base classifier can be trained as if the entire dataset is available as unlabeled data, but only a very small number of labels are available. This is precisely the problem statement of semisupervised learning, a wellstudied domain of machine learning (Verma et al., 2019; Luo et al., 2018; Laine and Aila, 2017; Kingma et al., 2014; Gidaris et al., 2018). We can then leverage these existing semisupervised learning techniques directly to improve the accuracies of the base classifiers in DPA. Furthermore, we can ensure that a particular image is assigned to the same partition regardless of label, so that only one partition is affected by a label flip (rather than possibly two). The resulting algorithm, SemiSupervised Deep Partition Aggregation (SSDPA) yields substantially increased certified accuracy against labelflipping attacks, compared to DPA alone and compared to the current stateoftheart.
On MNIST, SSDPA substantially outperforms the existing state of the art (Rosenfeld et al., 2020) in defending against labelflip attacks: we are able to certify at least half of images in the test set against attacks to over 600 (1.0%) of the labels in the training set, while still maintaining over 93% accuracy (See Figure 1a, and Table 1). In comparison, Rosenfeld et al. (2020)’s method achieves less than 60% clean accuracy on MNIST, and most test images cannot be certified with the correct class against attacks of even 200 label flips. We are also the first work to our knowledge to certify against general poisoning attacks, including insertions and deletions of new training images: in this domain, we can certify at least half of test images against attacks consisting of over 500 arbitrary training image insertions or deletions. On CIFAR10, a substantially more difficult classification task which Rosenfeld et al. (2020) does not test on, we can certify at least half of test images against labelflipping attacks on over 80 labels using SSDPA, and can certify at least half of test images against general poisoning attacks of up to nine insertions or deletions using DPA.
Weber et al. (2020) have recently proposed a different randomizedsmoothing based defense against poisoning attacks by directly applying Cohen et al. (2019)’s smoothing evasion defense to the poisoning domain. The proposed technique can only certify for cleanlabel attacks (where only the existing images in the dataset are modified, and not their labels), and the certificate guarantees robustness only to bounded distortions of the training data, where the norm of the distortion is calculated across all pixels in the entire training set. Due to wellknown limitations of dimensional scaling for smoothingbased robustness certificates (Yang et al., 2020; Kumar et al., 2020; Blum et al., 2020), this yields certificates to only very small distortions of the training data. (For binary MNIST [13,007 images], the maximum reported certificate is pixels.) Additionally, when using deep classifiers, Weber et al. (2020) proposes a randomized certificate, rather than a deterministic one, with a failure probability that decreases to zero only as the number of trained classifiers in an ensemble approaches infinity. Moreover, in Weber et al. (2020), unlike in our method, each classifier in the ensemble must be trained on a noisy version of the entire dataset. These issues hinder Weber et al. (2020)’s method to be an effective scheme for certified robustness against poisoning attacks.
2 Proposed Methods
2.1 Notation
Let be the space of all possible unlabeled samples (i.e., the set of all possible images). We assume that it is possible to sort elements of in a deterministic, unambiguous way. In particular, we can sort images lexicographically by pixel values. We represent labels as integers, so that the set of all possible labeled samples is . A training set for a classifier is then represented as , where is the power set of . For , we let refer to the (unlabeled) sample, and refer to the label. For a set of samples , we let refer to the set of unique unlabeled samples which occur in . A classifier model is defined as a deterministic function from both the training set and the sample to be classified to a label, i.e. . We will use
to represent a base classifier model (i.e., a neural network), and
to refer to a robust classifier (using DPA or SSDPA).represents the set symmetric difference between and : . The number of elements in is , is the set of integers through , and is the largest integer less than or equal to . represents the indicator function: if Prop is true; otherwise. For a set of sortable elements, we define as the sorted list of elements. For a list of unique elements, for , we will define as the index of in the list .
2.2 Dpa
The Deep Partition Aggregation (DPA) algorithm requires a base classifier model , a training set , a deterministic hash function , and a hyperparameter indicating the number of base classifiers which will be used in the ensemble.
At the training time, the algorithm first uses the hash function to define partitions of the training set, as follows:
(1) 
The hash function can be any deterministic function from to : however, it is preferable that the partitions are roughly equal in size. Therefore we should choose an which maps images to a domain of integers significantly larger than , in a way such that will be roughly uniform over . In practice, we let be the sum of the pixel values in the image .
Base classifiers are then trained on each partition: we define trained base classifiers as:
(2) 
Finally, at the inference time, we evaluate the input on each base classification, and then count the number of classifiers which return each class:
(3) 
This lets us define the classifier which returns the consensus output of the ensemble:
(4) 
When taking the argmax, we break ties deterministically by returning the smaller class index. The resulting robust classifier has the following guarantee:
Theorem 1.
For a fixed deterministic base classifier , hash function , ensemble size , training set , and input , let:
(5) 
Then, for any poisoned training set , if , then .
All proofs are presented in Appendix A. Note that and are unordered sets: therefore, in addition to providing certified robustness against insertions or deletions of training data, the robust classifier is also invariant under reordering of the training data, provided that has this invariance (which is implied, because maps deterministically from a set; see Section 2.2.1 for practical considerations). As mentioned in Section 1, DPA is a deterministic variant of randomized ablation (Levine and Feizi, 2020b) adapted to the poisoning domain. Each base classifier ablates most of the training set, retaining only the samples in one partition. However, unlike in randomized ablation, the partitions are deterministic and use disjoint samples, rather than selecting them randomly and independently. In Appendix C, we argue that our derandomization has little effect on the certified accuracies, while allowing for exact certificates using finite samples. We also discuss how this work relates to Levine and Feizi (2020a), which proposes a derandomized ablation technique for a restricted class of sparse evasion attacks (patch adversarial attacks).
2.2.1 DPA Practical Implementation Details
One of the advantages of DPA is that we can use deep neural networks for the base classifier . However, enforcing that the output of a deep neural network is a deterministic function of its training data, and specifically, its training data as an unordered set, requires some care. First, we must remove dependence on the order in which the training samples are read in. To do this, in each partition , we sort the training samples prior to training, taking advantage of the assumption that is wellordered (and therefore
is also well ordered). In the case of the image data, this is implemented as a lexical sort by pixel values, with the labels concatenated to the samples as an additional value. The training procedure for the network, which is based on standard stochastic gradient descent, must also be made deterministic: in our PyTorch
(Paszke et al., 2019) implementation, this can be accomplished by deterministically setting a random seed at the start of training. As discussed in Appendix F, we find that it is best to use different random seeds during training for each partition. This reduces the correlation in output between base classifiers in the ensemble. Thus, in practice, we use the partition index as the random seed (i.e., we train base classifier using random seed .)2.3 SsDpa
SemiSupervised DPA (SSDPA) is a defense against labelflip attacks. For this defense, the base classifier may be a semisupervised learning algorithm: it can use the entire unlabeled training dataset, in addition to the labels for a partition. We will therefore define the base classifier to also accept an unlabelled dataset as input: . Additionally, our method of partitioning the data is modified both to ensure that changing the label of a sample affects only one partition rather than possibly two, and to create a more equal distribution of samples between partitions.
First, we will sort the unlabeled data :
(6) 
For a sample , note that is invariant under any labelflipping attack to , and also under permutation of the training data as they are read. We now partition the data based on sorted index:
(7) 
Note that in this partitioning scheme, we no longer need to use a hash function
. Moreover, this scheme creates a more uniform distribution of samples between partitions, compared with the hashing scheme used in DPA. This can lead to improved certificates: see Appendix
E. This sortingbased partitioning is possible because the unlabeled samples are “clean”, so we can rely on their ordering, when sorted, to remain fixed. As in DPA, we train base classifiers on each partition, this time additionally using the entire unlabeled training set:(8) 
The inference procedure is the same as in the standard DPA:
(9) 
The SSDPA algorithm provides the following robustness guarantee against labelflipping attacks.^{2}^{2}2The theorem as stated assumes that there are no repeated unlabeled samples (with different labels) in the training set . This is a reasonable assumption, and in the labelflipping attack model, the attacker cannot cause this assumption to be broken. Without this assumption, the analysis is more complicated; see Appendix G.
Theorem 2.
For a fixed deterministic semisupervised base classifier , ensemble size , training set (with no repeated samples), and input , let:
(10) 
For a poisoned training set obtained by changing the labels of at most samples in , .
2.3.1 SemiSupervised Learning Methods for SSDPA
In the standard DPA algorithm, we are able to train each classifier in the ensemble using only a small fraction of the training data; this means that each classifier can be trained relatively quickly: as the number of classifiers increases, the time to train each classifier can decrease (see Table 1). However, in a naive implementation of SSDPA, Equation 8 might suggest that training time will scale with , because each semisupervised base classifier requires to be trained on the entire training set. Indeed, with many popular and highly effective choices of semisupervised classification algorithms, such as temporal ensembling (Laine and Aila, 2017), ICT (Verma et al., 2019), Teacher Graphs (Luo et al., 2018) and generative approaches (Kingma et al., 2014), the main training loop trains on both labeled and unlabeled samples, so we would see the total training time scale linearly with . In order to avoid this, we instead choose a semisupervised training method where the unlabeled samples are used only to learn semantic features of the data, before the labeled samples are introduced: this allows us to use the unlabeled samples only once, and to then share the learned feature representations when training each base classifier. In our experiments, we choose the RotNet model introduced by Gidaris et al. (2018), which first trains a network to infer the angle of rotation on rotated forms of unlabeled images: an intermediate layer of this network is then used as features in the final classification. As discussed in Section 2.2.1, we also sort the data prior to learning (including when learning unsupervised features), and set random seeds, in order to ensure determinism.
3 Results
In this section, we present empirical results evaluating the performance of proposed methods, DPA and SSDPA, against poison attacks on MNIST and CIFAR10 datasets. As discussed in Section 2.3.1, we use the RotNet architecture (Gidaris et al., 2018)
for the semisupervised learning. Conveniently, the RotNet architecture is structured such that the feature extracting layers, combined with the final classification layers, together make up the NetworkInNetwork (NiN) architecture for the supervised classification
(Lin et al., 2013). We use NiN for DPA’s supervised training, and RotNet for SSDPA’s semisupervised training. On CIFAR10, we use training parameters, for both the DPA (NiN) and SSDPA (RotNet), directly from Gidaris et al. (2018).^{3}^{3}3In addition to the derandomization changes mentioned in Section 2.2.1, we made one modification to the NiN ‘baseline’ for supervised learning: the baseline implementation in Gidaris et al. (2018), even when trained on a small subset of the training data, uses normalization constants derived from the entire training set. This is a (minor) error in Gidaris et al. (2018) that we correct by calculating normalization constants on each subset. On MNIST, we use the same architectures and training parameters, with a slight modification: we eliminate horizontal flips in data augmentation, because, unlike in CIFAR10, horizontal aligment is semantically meaningful for digits.Results are presented in Figures 2 and 3, and are summarized in Table 1. Our metric, Certified Accuracy as a function of attack magnitude (symmetricdifference or labelflips), refers to the fraction of samples which are both correctly classified and are certified as robust to attacks of that magnitude. Note that different poisoning perturbations, which poison different sets of training samples, may be required to poison each test sample; i.e. we assume the attacker can use the attack budget separately for each test sample. Table 1 also reports Median Certified Robustness, the attack magnitude to which at least 50% of the test set is provably robust.
Training  Number of  Median  Base  Training  
set  Partitions  Certified  Clean  Classifier  time per  
size  Robustness  Accuracy  Accuracy  Partition  
MNIST, DPA  60000  1200  448  95.82%  77.00%  0.28 min 
3000  509  93.28%  49.59%  0.26 min  
MNIST, SSDPA  60000  1200  493  95.91%  81.70%  0.11 min 
3000  675  93.93%  59.21%  0.10 min  
50  9  70.29%  56.25%  1.42 min  
CIFAR, DPA  50000  250  6  55.72%  35.18%  0.56 min 
1000  N/A  44.30%  23.25%  0.36 min  
50  21  81.97%  74.73%  0.72 min  
CIFAR, SSDPA  50000  250  59  74.60%  61.98%  0.28 min 
1000  83  68.22%  44.26%  0.12 min 
As shown in Figure 1, our SSDPA method substantially outperforms the existing certificate (Rosenfeld et al., 2020) on labelflipping attacks. With DPA, we are also able to certify at least half of MNIST images to attacks of over 500 poisoning insertions or deletions. On CIFAR10, on which Rosenfeld et al. (2020) does not test, DPA can certify at least half of images to 9 poisoning insertions or deletions, and SSDPA can certify to over 80 label flips. In summary, we provide a new stateoftheart certified defence against label flipping attacks using SSDPA, while also providing the first certified defense against a much more general poisoning threat model.
The hyperparameter controls the number of classifiers in the ensemble: because each sample is used in training exactly one classifier, the average number of samples used to train each classifier is inversely proportional to . Therefore, we observe that the base classifier accuracy (and therefore also the final ensemble classifier accuracy) decreases as is increased; see Table 1. However, because the certificates described in Theorems 1 and 2 depend directly on the gap in the number of classifiers in the ensemble which output the top and runnerup classes, larger numbers of classifiers are necessary to achieve large certificates. In fact, using classifiers, the largest certified robustness possible is . Thus, we see in Figures 2 and 3 that larger values of tend to produce larger robustness certificates. Therefore controls a tradeoff between robustness and accuracy.
Rosenfeld et al. (2020) also reports robustness certificates against labelflipping attacks on binary MNIST classification, with classes 1 and 7. Rosenfeld et al. (2020) reports cleanaccuracy of 94.5% and certified accuracies for attack magnitudes up to 2000 label flips (out of 13007), with best certified accuracy less than 70%. By contrast, using a specialized form of SSDPA, we are able to achieve clean accuracy of , with every correctlyclassified image certifiably robust up to 5952 label flips (i.e. certified accuracy is also at 5952 label flips.) This represents a substantial improvement. We present details of this experiment in Appendix B.
4 Conclusion
In this paper, we described a novel approach to provable defenses against poisoning attacks. Unlike previous techniques, our method both allows for exact, deterministic certificates and can be implemented using deep neural networks. These advantages allow us to outperform the current stateoftheart on labelflip attacks, and to develop the first certified defense against a broadly defined class of general poisoning attacks.
Broader Impact
Many datasets for machine learning often originate from information gathered from the public, through crowdsourcing, web scraping, or analysing user behavior in web applications. Systems trained on such datasets can then be used in highly critical applications such as web content moderation or fraud detection. Thus, it is important to develop learning algorithms that are robust to malicious distortions of the training data, i.e. poisoning attacks. We believe this work is useful to these ends. Additionally, because our algorithm provides provable certificates, it can be useful in establishing trust in machine learning systems: even in a nonadversarial setting, a user may feel more confident knowing that the system’s decision did not hinge on a small number of possiblyspurious training examples. In this way, the certificate can be viewed as an accessible, transparent, easytointerpret metric of confidence. We also note that this work presents a purely defensive method against poisoning attacks; i.e. we are not revealing any new vulnerabilities in machine learning systems by presenting our techniques.
Acknowledgements
This project was supported in part by NSF CAREER AWARD 1942230, HR 00111990077, HR001119S0026 and Simons Fellowship on “Foundations of Deep Learning.”
References
 Four types of learning curves. Neural Computation 4 (4), pp. 605–618. Cited by: Appendix E.
 Bagging classifiers for fighting poisoning attacks in adversarial classification tasks. In International workshop on multiple classifier systems, pp. 350–359. Cited by: Appendix D, §1.

Poisoning attacks against support vector machines
. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Madison, WI, USA, pp. 1467–1474. External Links: ISBN 9781450312851 Cited by: §1.  Random smoothing might be unable to certify robustness for highdimensional images. arXiv preprint arXiv:2002.03517. External Links: 2002.03517 Cited by: §1.
 Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: Appendix D, §1.
 Bagging, subagging and bragging for improving some prediction algorithms. In Research report/Seminar für Statistik, Eidgenössische Technische Hochschule (ETH), Vol. 113. Cited by: Appendix D, Appendix D, §1.
 Observations on bagging. Statistica Sinica, pp. 323–351. Cited by: Appendix D, Appendix D, §1.
 Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1.
 Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §1, §1.
 Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.3.1, §3, footnote 3.
 On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1.
 Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3581–3589. External Links: Link Cited by: §1, §2.3.1.
 Curse of dimensionality on randomized smoothing for certifiable robustness. arXiv preprint arXiv:2002.03239. External Links: 2002.03239 Cited by: §1.
 Temporal ensembling for semisupervised learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2.3.1.
 Certified robustness to adversarial examples with differential privacy. In 2019 2019 IEEE Symposium on Security and Privacy (SP), Vol. , Los Alamitos, CA, USA, pp. 726–742. External Links: ISSN 23751207, Document, Link Cited by: §1.
 Tight certificates of adversarial robustness for randomly smoothed classifiers. In Advances in Neural Information Processing Systems, pp. 4911–4922. Cited by: §1, §1.
 Wasserstein smoothing: certified robustness against wasserstein adversarial attacks. arXiv preprint arXiv:1910.10783. Cited by: §1.
 (De) randomized smoothing for certifiable defense against patch attacks. arXiv preprint arXiv:2002.10733. Cited by: Appendix C, §2.2.

Robustness certificates for sparse adversarial attacks by randomized ablation.
Association for the Advancement of Artificial Intelligence (AAAI)
. Cited by: Appendix C, Appendix C, Deep Partition Aggregation: Provable Defense against General Poisoning Attacks, §1, §1, §2.2.  Secondorder adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §1.
 Network in network. arXiv preprint arXiv:1312.4400. Cited by: §3.

Smooth neighbors on teacher graphs for semisupervised learning.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1, §2.3.1.  PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §2.2.1.
 Certified robustness to labelflipping attacks via randomized smoothing. arXiv preprint arXiv:2002.03018. Cited by: Appendix B, Deep Partition Aggregation: Provable Defense against General Poisoning Attacks, Figure 1, 2nd item, §1, §1, §1, §1, §1, §1, §3, §3.
 Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §1.
 Poison frogs! targeted cleanlabel poisoning attacks on neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 6103–6113. External Links: Link Cited by: §1.
 When a tree falls: using diversity in ensemble classifiers to identify evasion in malware detectors. In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 2124, 2016, External Links: Link Cited by: Appendix D, §1.
 Certified defenses for data poisoning attacks. In Advances in neural information processing systems, pp. 3517–3529. Cited by: footnote 1.
 Interpolation consistency training for semisupervised learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3635–3641. Cited by: §1, §2.3.1.
 RAB: provable robustness against backdoor attacks. arXiv preprint arXiv:2003.08904. Cited by: §1.
 Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §1.
 Adversarial label flips attack on support vector machines. In Proceedings of the 20th European Conference on Artificial Intelligence, pp. 870–875. Cited by: §1.
 Randomized smoothing of all shapes and sizes. arXiv preprint arXiv:2002.08118. External Links: 2002.08118 Cited by: §1.
 Effect of subsampling rate on subbagging and related ensembles of stable classifiers. In Pattern Recognition and Machine Intelligence, S. Chaudhury, S. Mitra, C. A. Murthy, P. S. Sastry, and S. K. Pal (Eds.), Berlin, Heidelberg, pp. 44–49. External Links: ISBN 9783642111648 Cited by: Appendix D, §1.
Appendix A Proofs
Theorem 1.
For a fixed deterministic base classifier , hash function , ensemble size , training set , and input , let:
(11) 
Then, for any poisoned training set , if , we have: .
Proof.
We define the partitions, trained classifiers, and counts for each training set ( and ) as described in the main text:
(12) 
(13) 
(14) 
(15) 
Note that here, we are using superscripts to explicitly distinguish between partitions (as well as base classifiers and counts) of the clean training set and the poisoned dataset (i.e, is equivalent to in the main text). In Equation 15, as discussed in the main text, when taking the argmax, we break ties deterministically by returning the smaller class index.
Note that unless there is some , with , in . Because the mapping from to is a deterministic function, the number of partitions for which is at most , which is at most . implies , so the number of classifiers for which is also at most . Then:
(16) 
Let . Note that iff:
where the separate cases come from the deterministic selection of the smaller index in cases of ties in Equation 15: this can be condensed to . Then by triangle inequality with Equation 16, we have that if . This condition is true by the definition of , so . ∎
Theorem 2.
For a fixed deterministic semisupervised base classifier , ensemble size , training set (with no repeated samples), and input , let:
(17) 
For a poisoned training set obtained by changing the labels of at most samples in , .
Proof.
Recall the definition:
(18) 
Because , we have . We can then define partitions and base classifiers for each training set ( and ) as described in the main text:
(19) 
(20) 
Recall that for any , is invariant under labelflipping attack to . Then, for each , the samples in will be the same as the samples in , possibly with some labels flipped. In particular, the functions and will be identical, unless the label of some sample with has been changed. If at most labels change, at most ensemble classifiers are affected: the rest of the proof proceeds similarly as that of Theorem 1. ∎
Appendix B Binary MNIST Experiments
We perform a specialized instance of SSDPA on the binary ‘1’ versus ‘7’ MNIST classification task. Specifically, we set , so that every partition receives only one label.
We first use 2means clustering on the unlabeled data, to compute two means. This allows for each base classifier to use a very simple “semisupervised learning algorithm”: if the test image and the one labeled training image provided to the base classifier belong to the same cluster, then the base classifier assigns the label of the training image to the test image. Otherwise, it assigns the opposite label to the test image. Formally:
Note that each base classifier behaves exactly identically, up to a transpose of the labels: so in practice, we simply count the training samples which associate each of the two cluster centroids with each of the two labels, and determine the number of label flips which would be required to change the consensus label assignments of the clusters. At the test time, each test image therefore needs to be processed only once. The amount of time required for inference is then simply the time needed to calculate the distance from the test sample to each of the two clusters. This also means that every image has the same robustness certificate. As stated in the main text, using this method, we are able to achieve clean accuracy of , with every correctlyclassified image certifiably robust up to 5952 label flips (i.e. certified accuracy is also at 5952 label flips.) This means that the classifier is robust to adversarial label flips on 45.8% of the training data.
Rosenfeld et al. [2020] also reports robustness certificates against labelflipping attacks on binary MNIST classification with classes 1 and 7. Rosenfeld et al. [2020] reports cleanaccuracy of 94.5% and certified accuracies for attack magnitudes up to 2000 label flips (out of 13007: 15.4%), with the best certified accuracy less than 70%.
Appendix C Relationship to Randomized Ablation
As mentioned in Section 1, (SS)DPA is in some sense related to Randomized Ablation [Levine and Feizi, 2020b] (used in defense against sparse inferencetime attacks) for trainingtime poisoning attacks. Randomized Ablation is a certified defense against (sparse) inference attacks, in which the final classification is a consensus among classifications of copies of the image. In each copy, a fixed number of pixels are randomly ablated (replaced with a null value). A direct application of Randomized Ablation to poisoning attacks would require each base classifier to be trained on a random
subset of the training data, with each base classifier’s training set chosen randomly and independently. Due to the randomized nature of this algorithm, estimation error would have to be considered in practice when applying Randomized Ablation using a finite number of base classifiers: this decreases the certificates that can be reported, while also introducing a failure probability to the certificates. By contrast, in our algorithms, the partitions are
deterministic and use disjoint, rather than independent, samples. In this section, we argue that our derandomization has little effect on the certified accuracies compared to randomized ablation, even considering randomized ablation with no estimation error (i.e., with infinite base classifiers). In the poisoning case specifically, using additional base classifiers is expensive – because they must each be trained – so one would observe a large estimation error when using a realistic number of base classifiers. Therefore our derandomization can potentially improve the certificates which can be reported, while also allowing for exact certificates using a finite number of base classifiers.For simplicity, consider the labelflipping case. In this case, the training set has a fixed size, . Thus, Randomized Ablation bounds can be considered directly. A direct adaptation of Levine and Feizi [2020b] would, for each base classifier, choose out of samples to retain labels for, and would ablate the labels for the rest of the training data. Suppose an adversary has flipped labels. For each base classifier, the probability that a flipped label is used in classification (and therefore that the base classifier is ‘poisoned’) is:
(21) 
where “RA" stands for Randomized Ablation.
In this direct adaptation, one must then use a very large ensemble of randomized classifiers. The ensemble must be large enough that we can estimate with high confidence the probabilities (on the distribution of possible choices of training labels to retain) that the base classifier selects each class. If the gap between the highest and the nexthighest class probabilities can be determined to be greater than , then the consensus classification cannot be changed by flipping labels. This is because, at worst, every poisoned classifier could switch the highestclass classification to the runnerup class, reducing the gap by at most .
Note that this estimation relies on each base classifier using a subset of labels selected randomly and independently from the other base classifier. In contrast, our SSDPA method selects each subset disjointly. If labels are flipped, assuming that in the worst case each flipped label is in a different partition, using the union bound, the proportion of base classifiers which can be poisoned is
(22) 
where for simplicity we assume that evenly divides the number of samples , so labels are kept by each partition. Again we need the gap in class probabilities to be at least to ensure robustness. While the use of the union bound might suggest that our deterministic scheme (Equation 22) might lead to a significantly looser bound than that of the probabilistic certificate (Equation 21), this is not the case in practice where . For example, in an MNISTsized dataset (), using labels per base classifier, to certify for label flips, we have , and . The derandomization only sightly increases the required gap between the top two class probabilities.
To understand this, note that if the number of poisonings is small compared to the number of partitions , then even if the partitions are random and independent, the chance that any two poisonings occur in the same partition is quite small. In that case, the union bound in Equation 22 is actually quite close to an independence assumption. By accepting this small increase in the upper bound of the probability that each base classification is poisoned, our method provides all of the benefits of derandomization, including allowing for exact robustness certificates using only a finite number of classifiers. Additionally, note that in the Randomized Ablation case, the empirical gap in estimated class probabilities must be somewhat larger than in order to certify robustness with high confidence, due to estimation error: the gap required increases more as the number of base classifiers decreases. This is particularly important in the poisoning case, because training a large number of classifiers is substantially more expensive than performing a large number of evaluations, as in randomized smoothing for evasion attacks.
We also note that Levine and Feizi [2020a] also used a derandomized scheme based on Randomized Ablation to certifiably defend against evasion patch attacks. However, in that work, the derandomization does not involve a union bound over arbitrary partitions of the vulnerable inputs. Instead, in the case of patch attacks, the attack is geometrically constrained: the image is therefore divided into geometric regions (bands or blocks) such that the attacker will only overlap with a fixed number of these regions. Each base classifier then uses only a single region to make its classification. Levine and Feizi [2020a] do not apply this to derandomization via disjoint subsets/union bound to defend against poison attacks. Also, we note that we borrow from Levine and Feizi [2020a] the deterministic “tiebreaking” technique when evaluating the consensus class in Equation 4, which can increase our robustness certificate by up to one.
Appendix D Relationship to Existing Ensemble Methods
As mentioned in Section 1, our proposed method is related to classical ensemble approaches in machine learning, namely bootstrap aggregation (“bagging”) and subset aggregation (“subagging”) [Breiman, 1996, Buja and Stuetzle, 2006, Bühlmann, 2003, Zaman and Hirose, 2009]
. In these methods, each base classifier in the ensemble is trained on an independently sampled collection of points from the training set: this means that multiple classifiers in the ensemble may be trained on the same sample point. The purpose of these methods has typically been to improve generalization, and therefore to improve test set accuracy: bagging and subagging decrease the variance component of the classifier’s error.
In subagging, each training set for a base classifier is an independently sampled subset of the training data: this is in fact an identical formulation to the “direct Randomized Ablation” approach discussed in Appendix C. However, in practice, the size of each training subset has typically been quite large: the bias error term increases with decreasing subsample sizes [Buja and Stuetzle, 2006]. Thus, the optimal subsample size for maximum accuracy is large: Bühlmann [2003] recommends using samples per classifier (“halfsubagging”), with theoretical justification for optimal generalization. This would not be useful in Randomized Ablationlike certification, because any one poisoned element would affect half of the ensemble. Indeed, in our certifiably robust classifiers, we observe a tradeoff between accuracy and certified robustness: our use of many very small partitions is clearly not optimal for the testset accuracy (Table 1).
In bagging, the samples in each base classifier training set are chosen with replacement, so elements may be repeated in the training “set” for a single base classifier. Bagging has been proposed as an empirical defense against poisoning attacks [Biggio et al., 2011] as well as for evasion attacks [Smutz and Stavrou, 2016]. However, to our knowledge, these techniques have not yet been used to provide certified robustness.
Appendix E SSDPA with Hashing
It is possible to use hashing, as in DPA, in order to partition data for SSDPA: as long as the hash function does not use the sample label in assigning a class (as ours indeed does not), it will always assign an image to the same partition regardless of labelflipping, so only one partition will be affected by a labelflip. Therefore, the SSDPA labelflipping certificate should still be correct. However, as explained in the main text, treating the unlabeled data as trustworthy allows us to partition the samples evenly among partitions using sorting. This is motivated by the classical understanding in machine learning (e.g. Amari et al. [1992]
) that learning curves (the test error versus the number of samples that a classifier is trained on) tend to be convexlike. The test error of a base classifier is then approximately a convex function of that base classifier’s partition size. Therefore, if the partition size is a random variable, by Jensen’s inequality, the expected test error of the (random) partition size is greater than the test error of the mean partition size. Setting all base classifiers to use the mean number of samples should then maximize the average base classifier accuracy.
Number of  Median  Base  
Partitions  Certified  Clean  Classifier  
Robustness  Accuracy  Accuracy  
Hash  Sort  Hash  Sort  Hash  Sort  
MNIST, SSDPA  1200  469  493  95.88%  95.91%  79.59%  81.70% 
3000  626  675  93.94%  93.93%  56.33%  59.21%  
50  21  21  81.96%  81.97%  74.66%  74.73%  
CIFAR, SSDPA  250  59  59  74.47%  74.60%  61.83%  61.98% 
1000  82  83  68.24%  68.22%  43.94%  44.26% 
To validate this reasoning, we tested SSDPA with partitions determined by hashing (using the same partitions as we used in DPA), rather than the sorting method described in the main text. See Table 2 for results. As expected, the average base classifier accuracy decreased in all experiments when using the DPA hashing, compared to using the sorting method of SSDPA. However, the effect was minimal in CIFAR10 experiments: the main advantage of the sorting method was seen on MNIST. This is partly because we used more partitions, and hence fewer average samples per partition, in the MNIST experiments: fewer average samples per partition creates a greater variation in the number of samples per partition in the hashing method. However, CIFAR10 with and MNIST with both average 50 samples per partition, but the base classifier accuracy difference still was much more significant on MNIST (2.11%) compared to CIFAR10 (0.32%).
On the MNIST experiments, where the base classifier accuracy gap was observed, we also saw that the effect of hashing on the smoothed classifier was mainly to decrease the certified robustness, and that there was not a significant effect on the clean smoothed classifier accuracy. As discussed in Appendix F, this may imply that the outputs of the base classifiers using the sorting method are more correlated, in addition to being more accurate.
Appendix F Effect of Random Seed Selection
In Section 2.2.1, we mention that we “deterministically" choose different random seeds for training each partition, rather than training every partition with the same random seed. To see a comparison between using distinct and the same random seed for each partition, see Table 3. Note that there is not a large, consistent effect across experiments on either the base classifier accuracy nor the the median certified robustness: however, across all 10 experiments, the distinct random seeds always resulted in higher smoothed classifier accuracies. This effect was particularly pronounced using SSDPA: using distinct seeds increased smoothed accuracy by at least 1.5% on each value of on MNIST, and by nearly 3% for on CIFAR10 (where the base classifier accuracy difference was only 0.1%). This implies that shared random seeds make the base classifiers more correlated with each other: at the same level of average base classifier accuracy, it is more likely that a plurality of base classifiers will all misclassify the same sample (If the base classifiers were perfectly uncorrelated, we would see nearly 100% smoothed clean accuracy wherever the base classifier accuracy was over 50%. Also if they were perfectly correlated, the smoothed clean accuracy would equal the base classifier accuracy).
It is somewhat surprising that the base classifiers become correlated when using the same random seed, given that they are trained on entirely distinct data. However, two factors may be at play here. First, note that the random seed used in training controls the random cropping of training images: it is possible that, because the training sets of the base classifiers are so small, using the same cropping patterns in every classifier would create a systematic bias. Second, note that the effect is most pronounced using SSDPA, where all base classifiers share their final network layers: this would tend to increase the baseline correlation between base classifiers (Indeed, for on MNIST, SSDPA has similar clean accuracy to DPA, despite having a 10% higher base classifier accuracy). We can speculate that there may be some synergistic effect between these two causes of correlation: however, understanding these effects precisely is an opportunity for potential future research.
Number of  Median  Base  
Partitions  Certified  Clean  Classifier  
Robustness  Accuracy  Accuracy  
Same  Distinct  Same  Distinct  Same  Distinct  
MNIST, DPA  1200  444  448  95.33%  95.82%  76.43%  77.00% 
3000  512  509  92.87%  93.28%  49.53%  49.59%  
MNIST, SSDPA  1200  506  493  94.41%  95.91%  82.56%  81.70% 
3000  667  675  91.94%  93.93%  59.69%  59.21%  
50  9  9  69.88%  70.29%  56.07%  56.25%  
CIFAR, DPA  250  5  6  55.56%  55.72%  34.90%  35.18% 
1000  N/A  N/A  44.13%  44.30%  23.10%  23.25%  
50  21  21  81.50%  81.97%  74.63%  74.73%  
CIFAR, SSDPA  250  60  59  73.83%  74.60%  61.91%  61.98% 
1000  78  83  65.30%  68.22%  44.16%  44.26%  
Appendix G SSDPA with Repeated Unlabeled Data
The definition of a training set that we use, , technically allows for repeated samples with differing labels: there could be a pair of distinct samples, , such that , but . This creates difficulties with the definition of the labelflipping attack: for example, the attacker could flip the label of to become the label of : this would break the definition of as a set. In most applications, this is not a circumstance that warrants practical consideration: indeed, none of the datasets used in our experiments have such instances (nor can labelflipping attacks create them), and therefore, for performance reasons, our implementation of SSDPA does not handle these cases^{4}^{4}4Specifically, to optimize for performance, we verify that there are no repeated sample values, and then sort itself (rather than ) lexicographically by image pixel values: this is equivalent to sorting if no repeated images occur — which we have already verified — and avoids an unnecessary lookup procedure to find the sorted index of the unlabeled sample for each labeled sample.. However, the SSDPA algorithm as described in Section 2.3 can be implemented to handle such datasets, under a formalism of labelflipping tailored to represent this edge case.
Specifically, we define the space of possible labeled data points as : each labeled data point consists of a sample along with a set of associated labels. We then restrict our dataset to be any subset of such that for all , . In other words, we do not allow repeated sample values in as formally defined: if repeated samples exist, one can simply merge their sets of associated labels (in the formalism).
Note that using this definition, the size of will equal the size of , and the samples will always remain the same: the adversary can only modify the label sets of samples “in place”. SSDPA will always assign an unlabeled sample, along with all of its associated labels, to the same partition, regardless of any label flipping. In practice, this is because, as described in Section 2.3, the partition assignment of a labeled sample depends only on its sample value, not its label: all labeled samples with the same sample value will be put in the same partition. Note that this is true even if the implementation represents two identical sample values with different labels as two separate samples: one does not actually have to implement labels as sets. Therefore, any changes to any labels associated with a sample will only change the output of one base classifier, so all such changes can together be considered a single labelflip in the context of the certificate. In the above formalism, the certificate represents the number of samples in whose label sets have been “flipped”: i.e., modified in any way.
Comments
There are no comments yet.