A cryptographic approach to black box adversarial machine learning

06/07/2019 ∙ by Kevin Shi, et al. ∙ 2

We propose an ensemble technique for converting any classifier into a computationally secure classifier. We define a simpler security problem for random binary classifiers and prove a reduction from this model to the security of the overall ensemble classifier. We provide experimental evidence of the security of our random binary classifiers, as well as empirical results of the adversarial accuracy of the overall ensemble to black-box attacks. Our construction crucially leverages hidden randomness in the multiclass-to-binary reduction.



page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Current machine learning models are vulnerable at test time to adversarial examples, which are data points that have been imperceptibly modified from legitimate data points but are misclassified with high confidence. This problem has attracted significant researcher interest Szegedy2013 Goodfellow2015 in both explaining their existence and defending against an adversary who tries to compute them. Previous work has attempted to train models to be explicitly robust to attacks by incorporating robustness into the optimization problem Madry2017 Schott2018 , by input transformations and discretization to reduce model linearity JacobBuckmanAurkoRoyColinRaffell2018 , or by injecting randomness at inference time Xie2017 . However these defenses have all been subsequently broken by changing the attack model slightly in terms of allowable perturbations Sharma2017 or by using more sophisticated attacks Athalye2018 .

Recent explanations suggest that the existence of adversarial examples is actually inevitable in high-dimensional spaces. Gilmer2018 Ford2019 show that these examples exist for any linear classifier with nonzero error rate under additive Gaussian noise. This vulnerability is a simple geometrical fact when the dimension

is large: because most of the mass of a Gaussian distribution is concentrated near the shell, the distance to the closest misclassified example is a factor

closer than the distance to the shell. Ilyas2019 show that adversarial perturbations can actually be robust features for generalization, and thus their adversarial nature is just a misalignment with our natural human notions of robustness.

In light of the evidence for the inevitability of adversarial perturbations, one goal we can still hope to achieve is a computational separation between their existence and the computational complexity of computing them. We propose a cryptographic technique which uses hidden random binary codewords to prevent the adversary from easily computing these perturbations. Any instantiation of the random binary codewords produces an accurate classifier with high probability, so there is no danger of security through obscurity, because the model owner can sample his own fresh random bits. Furthermore, the space of all possible binary codewords is exponential in the number of output classes, so the adversary cannot simply try all of them.

A major show-stopper with black-box models which rely on hidden information is the phenomenon of transferability Papernot2016 , where an adversarial perturbation computed for one model has a high chance of causing an independently trained model to simultaneously fail. The first model is called the substitute model, and the second model is the black-box oracle model. Papernot2016b

show that, even if the adversary is only given black-box oracle access to predicted labels, existing machine learning models are vulnerable to transfer learning attacks executed by training substitute models. The transfer success rate is the probability that an adversarial example computed for the substitute model is also misclassified by the black-box oracle.

Thus, in order to hide the randomness in a single classifier, we must not allow the adversary to directly query it. We achieve this by a special ensemble scheme such that the adversary learns only the output of the overall ensemble without learning any of the intermediate representations. Previous ensemble techniques for increasing adversarial robustness only subsample or augment the training data within each class Tramer2017 , whereas our ensemble samples random splits of the labels themselves within the overall multiclass classification setup. This means that the underlying classification problem is unknown to the adversary, and we argue that this randomness decreases the transfer success rate.

2 Preliminaries

Let be the feature space, and let be the set of classes. The learning problem is to construct a multiclass classifier that is allowed to abstain from making a prediction by returning the symbol . We assume all classifier training is conducted using a fixed machine learning algorithm which is public knowledge. takes as input a set of binary-labeled data points , where each and , and outputs a binary classifier . Furthermore, we assume that , which just means that the labels and have no intrinsic meaning. Lastly, we fix some space to be the set of allowable adversarial perturbations, for example .

2.1 Threat model

We consider the setting of a server hosting a fixed classifier and users who interact with the server by presenting a query to the server and receiving the output label . We call a black-box classifier, because the user does not see any of the intermediate computation values of . Two types of users access the server: honest users who present queries drawn from a natural data distribution, and adversarial users who present adversarial examples designed to intentionally cause a misclassification. The desired property is to serve the honest users the true label while simultaneously preventing the adversarial users from causing a misclassification; the latter is accomplished by either continuing to return the true label on adversarial examples or by returning the abstain label .

In order for this distinction to be well-defined, we need to separate natural misclassified examples from adversarial examples. We achieve this by fixing in advance a data point which is correctly classified by and requiring the adversary to compute a perturbation for this specific such that . We think of as a parameter of the attack, for example the natural image of the face of an attacker who wishes to masquerade as someone else. The classifier is secure if for all , the adversary cannot find a satisfying this.

We formalize this attack problem by the notion of a security challenge. The adversary is given all the information about except for any internal randomness used to initialize . The adversary is then given the challenge point with being the correct classification, and the adversary successfully solves the security challenge if he finds a such that with non-negligible probability. The solution to the security challenge is a successful attack.

The separation between existence of a solution and feasibility of finding it is given by resource constraints on the adversary, most commonly in the form of runtime. We say that a security challenge is hard if there does not exist an algorithm for finding a solution within these resource constraints. In addition to runtime, we also consider the constraint of how many times the adversary is allowed to interact with the classifier.

We make a distinction between these query points (denoted by ) and the challenge point (denoted by

), both of which are feature vectors in

. Query points are arbitrarily chosen by the adversary for the purpose of learning more about the black-box , and there is no notion of correctness for . The ability to obtain labels for arbitrary query points enables the adversary to mount more powerful black-box attacks, for example the substitute model training methods described by Papernot2016b . This larger space of possible attacks is realistic but also makes direct empirical security analysis difficult. Cryptographic proofs of security provide an alternative to direct analysis.

2.2 Security proofs in cryptography

Instead of directly trying to prove the security of , we define a simpler system that is easier to empirically test and reason about. We then prove a reduction from the security challenge of to the security challenge of , which shows that is at least as hard to attack as . We define a security assumption that characterizes the hardness of attacking . This security assumption cannot be mathematically proven to be true, but nonetheless defining the right assumption makes the reduction is useful, because this assumption can be easier to empirically study. If the security assumption for the hardness of is true, then is secure.

The security assumption we define is the hardness of attacking a new type of randomized classifier without any query access to it. We give two reasons why this assumption is the right one to make. Firstly, the scope of attacks to analyze is greatly reduced when the attacker has no access to the classifier. The adversary can essentially only mount transfer learning attacks by training models on the public dataset. Secondly, we only require the probability of success of the adversary to be bounded below by a constant, and the overall security of the ensemble can be boosted from this bound. We next describe this assumption detail.

2.3 Random binary classifiers

In a multiclass classification problem with labels , suppose we have a binary classifier for two particular classes and , where class is mapped to and class is mapped to . An adversary is given a data point with , and the adversary wishes to attack this binary classifier by computing a perturbation such that . However, at training time was not trained on just data points with original labels or , but with all remaining classes also having been randomly remapped to with equal probability. In other words, for each class

, we sample a Rademacher random variable

and assign every data point of original label to the new binary label . This random assignment does not change the original -vs- classification task when all query data points are only of original class or . The resulting corresponding to training with the random binary labels is a random binary classifier:

[Random binary classifier] Let be a distribution over . The random binary classifier over is the distribution of over where each training data point is relabeled to by :

The security challenge for the random binary classifier is to compute a perturbation that changes its output with high probability over the sampling of .

[Security challenge for random binary classifier] Let . Let be a Rademacher random vector, and let be the distribution of conditioned on . The security challenge for a challenge data point , failure rate , and target label is to compute a perturbation which changes the output of with failure rate no greater than :

In particular, the adversary has no ability to obtain labels for query points from the random binary classifier. ∎

Note that the adversary has knowledge of two of the bits of , corresponding to the original label and some target label . Our security assumption is that for any , there is enough randomness in the remaining data classes such that the failure rate is non-negligible.

[Security assumption] Given an instance of the security challenge for a random binary classifier, for any , for all , there exists a constant such that

whenever . ∎

Note that this implicitly assumes does not contain any non-adversarial perturbations, such as those of the form where is a legitimate image of class . In Section 4.1

, we experimentally justify this assumption by estimating the transfer success probability for all pairs of classes

in the MNIST and CIFAR-10 datasets using the standard -ball for .

2.4 Main construction

Recall that our goal is to construct a multiclass classifier which is allowed to abstain from making a prediction (as represented by the output ), and an adversarial perturbation is only considered a successful attack if .

Our ensemble construction is the error-correcting code approach for multiclass-to-binary reduction Dietterich1994 , except with completely random codes for security purposes.

[Random ensemble classifier] Given a multiclass classification problem with labels , a codelength , and a threshold parameter :

  • Sample random matrix

    , where each independently and with equal probability

  • For , construct the binary classifier

Given a query data point , compute output by:

  • Compute the predicted codeword vector

  • Compute , where is the index and is the Hamming distance to

  • If , then output , else output

In this construction, the codeword acts as the identity of class , and thus the classification of a data point is the class codeword which is closest to its predicted codeword . We should think of the free parameters as and . needs to be sufficiently large in order for the random ensemble classifier to be accurate on natural examples, and needs to be sufficiently small for security purposes.

We give some intuition for why this construction has desirable security properties. In order for an adversary to change the overall output of some test point , he needs to change the output of sufficiently many binary classifiers so that is close to some codeword . But the Hamming distance between and is on expectation, and must be within distance to respectively. Since each is constructed independently at random, the overall probability of success is exponentially decreasing in the probability of successfully changing the output of an individual classifier.

We proceed to define the security challenge for this construction. We will use the shorthand notation to denote the distribution of where each entry is independently sampled from with equal probability.

[Security challenge for random ensemble] Let be the ensemble classifier constructed with random hidden code matrix as defined in Construction 2.4. The security challenge for a challenge data point and accuracy is a two-round protocol:

  1. Provide nonadaptive queries to and receive answer labels, denoted by . The queries cannot depend on but can depend on anything else, including the original training data set and the construction of .

  2. Return a perturbation by some function of the query answers such that satisfies

An algorithm for solving the security challenge is determined by its query set and the function for computing the final perturbation from the query answers. ∎

For example, one possible attack captured by this model is substitute DNN training with a one epoch of data augmentation, which is a single epoch version of the black-box attack described by

Papernot2016b . The adversary obtains a pre-labeled dataset of arbitrary size (which could be the original training data set) and trains an initial substitute DNN on this dataset. The adversary then iteratively refines this initial DNN through substitute training epochs by using Jacobian data augmentation to construct new synthetic data points. These synthetic points are labeled using the black-box classifier and added to the labeled dataset using the classifier’s output as the label.

The synthetic data points are the queries , and thus our proof shows that a single epoch of data augmentation is not sufficient to construct a successful attack (assuming the security assumption is true). The actual implementation of this attack in Papernot2016a uses a constant number of substitute training epochs, and our proof does not apply directly to this implementation, because the second round of queries can depend on the answers in the first round. Nonetheless, we show empirically in Section 4.2 that our construction is still secure against this attack involving a constant number of rounds of queries.

3 Security analysis

The main theoretical result is a reduction from solving the random classifier challenge to solving the random ensemble challenge. In our reduction, we make the simplifying assumption that the space of allowable perturbations is the same in both security challenges. This allows us to get away with not explicitly defining which perturbations are adversarial and which are legitimate, because a perturbation which makes a legitimate image of the class would solve both security challenges simultaneously. We also assume without loss of generality that is chosen such that , because Hamming distance is an integer.

Suppose there exists an algorithm that can solve the security challenge for the random ensemble with any threshold such that using queries and with accuracy . Then there is an algorithm that can compute a perturbation which solves the security challenge for a random binary classifier with failure rate

The algorithm succeeds in computing this perturbation with probability (over ) at least

The theorem shows that if such an algorithm exists, , and , then the failure rate decreases as for some constant , which contradicts the security assumption (Assumption 2.3). Conversely, if the security assumption is true, then an adversary cannot solve the security challenge for the random ensemble with nonadaptive queries to the ensemble classifier.

We give a brief proof sketch here, deferring the full proof to Section A. Given a single random classifier , we can simulate the entire ensemble classifier by constructing the remaining random classifiers using the public data set and . However, we cannot apply to directly, because in Definition 2.3 there is no query access to . Thus we first show in Lemma A that we can simulate the output of the entire ensemble using only classifiers with high probability.

Then, applying the algorithm the ensemble of classifiers produces an attack perturbation which also applies to the entire ensemble of classifiers. Now we want to compute the probability of the output of each individual classifier in the ensemble being changed, but the queries could potentially leak information about some column . We use Lemma A for each column to show that this is not the case; i.e. that the query answers are completely determined by the remaining columns with high probability and thus independent of column itself. Then we show in Lemma A that an overall success probability of gives an upper bound on for each individual classifier.

4 Empirical results

We provide empirical analysis on both the security assumption (Assumption 2.3) and the adversarial test accuracy for the MNIST LeCun1998 and CIFAR-10 Krizhevsky2009 datasets. We use code from the CleverHans adversarial examples library Papernot2016a and from the MadryLab CIFAR10 adversarial examples challenge Madry for the base classifier architecture, training, and attacks. The only modification to the base classifier architecture was to change the output layer from dimension to dimension for a binary output; no further architecture tuning was performed to optimize natural accuracy.

4.1 Analysis of random binary classifiers

First, we empirically estimate the transfer success rate for all pairs of classes. We train a sample size of 30 random binary classifiers and then compute an adversarial perturbation for each test data point and each target class. The perturbation is computed by using a pre-trained standard model for the respective dataset with all output dimensions. We then compute whether each random binary classifier makes a different prediction on the original test data point versus the perturbed test data point. Finally, for each pair , we empirically estimate the probability of the output of being changed conditioned on and plot this. The goal of this analysis is to show that this probability is bounded below by a constant.

4.1.1 Mnist

We use the Fast Gradient Sign Method applied to a simple convolutional neural network which achieves

test accuracy and black-box adversarial test accuracy as the substitute model, as implemented in CleverHans Papernot2016a . The perturbation space is an ball with radius (note that this is standard notation for the step size of the attack in the literature; we no longer refer to in the main theorem). The parameter setting is chosen by Papernot2016b as being optimal in the sense that increasing does not increase the attacker’s power. We also show results for to illustrate the robustness of our assumption.

Figure 1 shows the success probabilities over all pairs of classes averaged over all of the test data points. The vertical axis corresponds to the original label , while the horizontal axis corresponds to the target label . The color scheme is the viridis palette, which scales uniformly from (black) to (yellow). The warmest coordinate for corresponds to a probability of .

Figure 1: Success probabilities for targeted attacks on MNIST random binary classifiers

Next we plot the success probabilities of each individual test data point for the highest misclassified pairs. Recall that our total sample size of random binary classifiers is , but , so the expected sample size for each data point and each pair is 15 samples. Figure 2 shows the distribution for the two highest probabilites in the plot. We can see that even the worst-case test data points have probabilities bounded far away from .

Figure 2: Distribution of success probabilities for individual MNIST test data points,

4.1.2 Cifar10

We use Projected Gradient Descent on the cross-entropy loss with an norm bound of , as implemented in the MadryLab CIFAR10 Adversarial Examples Challenge Madry . The pre-trained substitute is a w28-10 wide residual network Zagoruyko2016 , and the random binary classifiers are the same ResNet architecture but with two output dimensions instead of ten. Figure 3 shows the empirical success probabilities over the CIFAR-10 data set for all pairs of classes.

Figure 3: Success probabilities for targeted attacks on CIFAR-10 random binary classifiers

We see that attacks with target label (frog) have particularly high success rate on random binary classifiers when the source class is another animal. However for the majority of pairs, the security assumption is valid. We plot in Figure 4 the individual test data point distributions for the pairs and . We see that our security assumption actually fails when transforming cats, deer, and dogs into frogs, but the failure of the security assumption for these cases is at least interpretable in the sense that the easily confused classes are also close to each other by human perception.

Figure 4: Distribution of success probabilities for individual CIFAR10 test data points

4.2 Analysis of black-box adversarial accuracy

Next, we empirically analyze the robustness of our random ensemble construction to black-box transfer learning attacks. We use the CleverHans attack library Papernot2016a as a standard benchmark. The attack algorithm trains a two-layer fully connected substitute model iteratively augmenting its training data set via queries to the random ensemble scheme and then uses the Fast Gradient Sign Method attack on the substitute model.

Because the attack library is not designed for querying classifier which abstains, we perform substitute model training with a non-abstaining random ensemble (i.e. ). We consider the threshold at the end when analyzing the final true and adversarial test accuracies. In order to incorporate the abstain label, we use the following definitions of accuracy for our experiments. The true test accuracy requires the classifier to make the correct, non-abstaining prediction. However when computing adversarial accuracy, we also consider it a success if the classifier outputs .

[True and adversarial test accuracy] Given a multiclass classifier which is allowed to abstain from making a prediction (as represented by the output ), the relevant accuracy benchmarks are

True accuracy
Adversarial accuracy

where is the original data point and is an adversarial perturbation of . ∎

All random binary classifiers used in these experiments are the same architecture as the random binary classifiers in Section 4.1. Figure 5 shows that the ensemble enjoys good adversarial accuracy in the low- regime, although there is a tradeoff with the true test accuracy.

Figure 5: Accuracy versus Hamming distance ratio ()

5 Conclusion

We proposed a novel approach to provable robustness at test time in the adversarial setting. We formalized a smaller attack problem which is easier to study and which we conjecture to be hard. We also show that our overall ensemble construction enjoys high adversarial accuracy against black-box attacks with standard measures of perturbation size while being completely agnostic of these parameters. Our formal proof framework introduces some techniques in analysis of cryptographic constructions to the adversarial learning problem, and we hope it can lead to more principled empirical and theoretical work in this area.


  • (ACW18) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In Icml, 2018.
  • (DB94) T. G. Dietterich and G. Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes.

    Journal of Artificial Intelligence Research

    , 2, 1994.
  • (FGCC19) Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial Examples Are a Natural Consequence of Test Error in Noise. 2019.
  • (GMF18) Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial Spheres. 2018.
  • (GSS15) Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. International Conference on Learning Representations, pages 1–11, 2015.
  • (IST19) Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial Examples Are Not Bugs, They Are Features. 2019.
  • (Jac18) Ian Goodfellow Jacob Buckman, Aurko Roy, Colin Raffell. Thermometer Encoding: One Hot Way To Resist Adversarial Examples. Iclr, 19(1):92–97, 2018.
  • (Kri09) Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. arXiv 2009, 2009.
  • (LCB98) Y LeCun, C Cortes, and C J C Burges. The MNIST dataset of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.
  • (Ma̧d17) Aleksander Ma̧dry. MadryLab CIFAR10 Adversarial Examples Challenge. https://github.com/MadryLab/cifar10_challenge, 2017.
  • (MMS17) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.

    Towards Deep Learning Models Resistant to Adversarial Attacks.

    pages 1–27, 2017.
  • (MS78) F J MacWilliams and N J. A. Sloane. The Theory of Error-Correcting Codes. 1978.
  • (PFC16) Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, Rujun Long, and Patrick McDaniel. Technical Report on the CleverHans v2.1.0 Adversarial Examples Library. pages 1–12, 2016.
  • (PMG16a) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples. 2016.
  • (PMG16b) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical Black-Box Attacks against Machine Learning. 2016.
  • (SC17) Yash Sharma and Pin-Yu Chen. Attacking the Madry Defense Model with $L_1$-based Adversarial Examples. pages 1–9, 2017.
  • (SRBB18) Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on MNIST. 3:1–16, 2018.
  • (SZS13) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. pages 1–10, 2013.
  • (TKP17) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble Adversarial Training: Attacks and Defenses. pages 1–20, 2017.
  • (XWZ17) Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating Adversarial Effects Through Randomization. pages 1–16, 2017.
  • (ZK16) Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. 2016.

Appendix A Proofs

Fix any query point and threshold such that . Given a random ensemble function with independently and identically generated random classifiers and threshold , fix some and let denote the modified ensemble which ignores the th random classifier and takes the vote over only the remaining classifiers. Then

where the probability is taken only over the matrix and is independent of the column .

The lemma shows that for any , with high probability over the query answer is independent of , so that no information is revealed by the queries about column . In the following proofs we will use the shorthand , i.e. the random classifier constructed from the th column of .


The only way the additional vote of can influence the vote of is if the predicted codeword of length is on the decision boundary between some class and the abstaining space corresponding to . In the boolean hypercube , the number of points that are at a distance of exactly to any fixed point is . Because we want our probability bound to hold true regardless of the value of , we have to consider the possibility of influencing the points on either side of the decision boundary. To account for this, we multiply the number by . Then over all classes, the number of possible points on the decision boundary is at most by a union bound.

Recall that we assumed the machine learning oracle is symmetric; that is, . Thus given any query data point , when uniformly, because the probabilities of sampling and are identical. Then over sampling of , the predicted codeword vector has independently distributed Rademacher entries, which means the probability mass on any point in is . Thus the probability of being on the decision boundary is at most


We now apply the binomial coefficient upper bound from [MS78], stated in Appendix B, to obtain

where is the negative entropy function. Note that is monotonically increasing in and reaches at , so when then . Thus the probability in (1) can be bounded by

Since , this gives an exponentially decaying probability bound in .

The next lemma is a concentration result that holds when no information is revealed by the queries about any individual column.

Suppose that the event is independent and identical for each column . Fix a data point . Given a perturbation which solves the security challenge for the random ensemble with target probability , then for every random classifier in the ensemble, solves the security challenge for it with failure rate


Recall that the adversary is said to have solved the security challenge for the random ensemble if the vector of code bits has Hamming distance less than to any other codeword , where . Since each entry of the code matrix is sampled independently, we can consider the probability of this event bit-by-bit.

Let be the event where . Let be the probability of the event where , meaning the codeword for class is the closest. By the independence assumption, we have where , or equivalently,


The probability of changing from to any other class can be bounded by applying the union bound to all . We obtain

and by the assumption of the lemma we know the left-hand side probability is . Thus we just need to compute

and apply a tail inequality for the binomial distribution.

Fix one underlying code bit and some other class . Each bit differs from the corresponding bit of with probability under the random code sampling scheme. Without loss of generality, we’ll let . We analyze the probability of the event by conditioning on , obtaining

We note that the term is exactly the the probability in Definition 2.3. Then can be bounded by

Then the probability in (2) can be bounded by using Hoeffding’s inequality B:

Thus we have

which is equivalent to

Proof of Theorem 3.

We are given an instance of the security challenge for a random binary classifier (Definition 2.3). Let be the random binary classifier, where is uniformly sampled. We can simulate the entire random ensemble by constructing additional random classifiers in the same way that is sampled, so that and are freshly sampled. Let denote the matrix without the th column, so that denotes the output of the random ensemble ignoring .

By the definition of the security challenge, the adversary cannot query ; however since is simulated by the adversary, he can make queries to and run to produce a perturbation attacking . But if for each query , then would have produced the same perturbation attacking .

By Lemma A and a union bound over the number of queries, the hypothetical query answers to the entire ensemble depend only on with probability at least


Now in order to apply Lemma A to bound as a function of , we want to show for each that the event is independent of the query answers . This can be done by applying Lemma A again to each column to show that with high probability, the query answers only depend on the random sampling of . Since is a function of the query answers, then this means that the adversary’s chosen also only depends on . We obtain

and we see that this probability has no dependence on the actual column since is independent and identical for each . We incur a factor in the probability of failure by applying a union bound of the failure probability in (3) over all . Thus the event is independent and identical for each column with probability at least

Then by Lemma A, the probability of changing the output of is at least

Appendix B Probability inequalities

Suppose is an integer, where . Then

where is the negative entropy function.

[Hoeffding’s inequality] Suppose . Then for any ,

Appendix C Link to code for experiments