Adversarial examples from computational constraints

05/25/2018 ∙ by Sébastien Bubeck, et al. ∙ The University of Texas at Austin Microsoft 2

Why are classifiers in high dimension vulnerable to "adversarial" perturbations? We show that it is likely not due to information theoretic limitations, but rather it could be due to computational constraints. First we prove that, for a broad set of classification tasks, the mere existence of a robust classifier implies that it can be found by a possibly exponential-time algorithm with relatively few training examples. Then we give a particular classification task where learning a robust classifier is computationally intractable. More precisely we construct a binary classification task in high dimensional space which is (i) information theoretically easy to learn robustly for large perturbations, (ii) efficiently learnable (non-robustly) by a simple linear separator, (iii) yet is not efficiently robustly learnable, even for small perturbations, by any algorithm in the statistical query (SQ) model. This example gives an exponential separation between classical learning and robust learning in the statistical query model. It suggests that adversarial examples may be an unavoidable byproduct of computational limitations of learning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The most basic task in learning theory is to learn from a data set a good approximation to the unknown input-output function . One is typically interested in finding a hypothesis function

with small out of sample probability of error. That is, assuming the

’s are i.i.d. from some distribution , one wishes to approximately minimize . A more challenging task is to learn a robust hypothesis, that is one that would minimize the probability of error against adversarially corrupted examples. More precisely, assume that the input space is endowed with a norm and let be a fixed robustness parameter. In robust learning the goal is to find to minimize:

Such an input in the above event is colloquially referred to as an adversarial example111In the literature one sometimes uses a more stringent definition of adversarial examples, where and are in addition required to satisfy . We ignore this requirement here..

Following Szegedy et al. [18]

there is a rapidly expanding literature exploring the vulnerability of neural networks to adversarially chosen perturbations. The surprising observation is that, say in vision applications, for most images

the perturbation can be chosen in a way that is imperceptible to a human yet dramatically changes the output of state-of-the-art neural networks. This is a particularly important issue as these neural networks are currently being deployed in real-world situations. Naturally there is by now a large literature (in fact going back at least to [3, 11]) on attacks (finding adversarial perturbations) and defenses (making classifiers robust against certain type of attacks).

While we have a sophisticated theory for the classical goal of minimizing the non-robust probability of error, our understanding of the robust scenario is still very rudimentary. At the moment, the “attackers” seem to be winning the arms race against the “defenders”, see e.g.,

[1]. We identify four mutually exclusive possibilities for why all known classification algorithms are vulnerable to adversarial examples:

  1. No robust classifier exists.

  2. Identifying a robust classifier requires too much training data.

  3. Identifying a robust classifier from limited training data is information theoretically possible but computationally intractable.

  4. We just have not found the right algorithm yet.

The goal of this paper is to provide two pieces of evidence, one in favor of hypothesis 3 and one against hypothesis 2. Our primary result is that hypothesis 3 is indeed possible: there exist robust classification tasks that are information theoretically easy but computationally intractable under a powerful model of computation (namely the statistical query model, see below). Our secondary result is evidence against hypothesis 2, showing that if a robust classifier exists then it can be found with relatively few training examples under a standard assumption on the data distribution (for example, that the distribution within each label is close to a Lipschitz generative model, or is drawn from a finite set of exponential size).

In Section 1.1 we discuss related work on adversarial examples in light of those four hypotheses. In Section 1.2 we introduce the model of computation under which we will prove intractability. We conclude the introduction with Section 1.3 where we give a brief proof overview for our primary and secondary result. These results are discussed in greater depth respectively in Section 4 and Section 3.

1.1 Related work on adversarial examples

To the best of our knowledge, previous works have not linked computational constraints to adversarial examples, but instead have focused on the other three hypotheses.

Supporting hypothesis 1 is the work of Fawzi et al. [6]. Here the authors consider a generative model for the features, namely where is sampled from an isotropic Gaussian (in particular it is typically of Euclidean norm roughly ). The observation is that, due to Gaussian isoperimetry, no classifier is robust to perturbations in of Euclidean norm . If is -Lipschitz, this corresponds to perturbations of the image of at most . On the other hand, evidence against hypothesis 1 is the fact that humans seem to be robust classifiers with low error rate (albeit nonzero error rate, as shown by examples in [5]). This suggests that, to fit real distributions on images, the Lipschitz parameter in the data model assumed in [6] may be prohibitively large.

Another work arguing the inevitability of adversarial examples is Gilmer et al. [10]. There the authors propose a simple classification task, namely distinguishing between samples on the unit sphere in high dimension and samples on a sphere of radius bounded away from . They show experimentally that even in such a simple setup, state-of-the-art neural networks have adversarial examples at most points. We note however that this example only applies to specific classifiers, since it is easy to construct an efficient robust classifier for the given example (e.g., just use a linear model on the norm of the features); thus the “hardness” here only appears for a given network structure.

Supporting hypothesis 2 is the work of Schmidt et al. [16]. Here the authors consider a mixture of two separated Gaussians (isotropic, with means at distance ). With such a separation a single sample is sufficient to learn non-robustly; but to learn a classifier that is robust to -size perturbations in -norm one needs samples. This polynomial separation suggests that avoiding adversarial examples in high dimension requires a lot more samples than mere learning—but only up to samples. In fact, since their hard instance is essentially a set of possible distributions, our secondary result gives a black-box algorithm that would produce a robust classifier with samples.

Finally the large body of work on “adversarial defense” can be viewed as investigating hypothesis 4. We note that, at the time of writing, the state of the art defense Madry et al. [14] (according to [1]) is still far from being robust. Indeed on the CIFAR-10 dataset its accuracy is below even with very small perturbations (of order in -norm), while state of the art non-robust accuracy is higher than .

1.2 The SQ model

Proving computational hardness is a notoriously difficult problem. To circumvent this difficulty one usually either (i) reduces the problem at hand to a well-established computational hardness conjecture (e.g., proving NP-hardness), or (ii) proves an unconditional hardness within a limited computational framework (such as the oracle lower bounds in convex optimization, [15]). Our task here is further complicated by the average-case nature of the problem (the datasets are i.i.d. from some fixed distribution). Fortunately there is a growing set of results on computational hardness in learning theory that we can leverage. The statistical query (SQ) model of computation from Kearns [12]

is a particularly successful instance of approach (ii) for learning theory: (a) most known learning algorithms fall in the framework, including in particular logistic regression, SVM, stochastic gradient descent, etc; and (b) SQ-hardness has been proved for many interesting problems that are believed to be computationally hard, such as learning parity with noise

[12], learning intersection of halfspaces [13], the planted clique problem [8]

, robust estimation of high-dimensional Gaussians

[4], or learning a function computable by a small neural network [17]. Thus we naturally use this model to prove our main result on the computational hardness of robust learning. We now recall the definition of the SQ model and state informally our main result.

As Kearns put it in his original paper, the SQ model considers “learning algorithms that construct a hypothesis based on statistical properties of large samples rather than on the idiosyncrasies of a particular sample”. More precisely, rather than having access to a data set , in the SQ model one must make queries to a -SQ oracle which operates as follows: given a -valued function defined on input/output pairs, the SQ oracle returns a value where . We refer to as the precision of the oracle. Obviously, an algorithm using queries to an oracle with precision can be simulated using a data set of size roughly . In our main result we consider an oracle with exponential precision. More concretely we take of order where is the dimension of the problem and are some numerical constants. Observe that such a high precision oracle cannot be simulated with a polynomial (in ) number of samples. Yet we show that even with such a high precision one needs an exponential number of queries to achieve robust learning for a certain task which on the other hand is easy to learn, and information theoretically learnable robustly:

Theorem 1.1 (informal).

For any , there exists a classification task in which is

  • learnable in time and samples;

  • robustly learnable in samples with -robustness parameter (while with high probability all samples have -norm );

  • not efficiently and robustly learnable in the statistical query model, in the sense that even with an exponential (in ) precision statistical query oracle one needs an exponential (in ) number of queries in order to robustly learn with robustness parameter .

The same result holds using the norm instead of , except with diameter .

Of course, a number of natural machine learning algorithms such as nearest neighbor are not based on statistical queries. Although we cannot prove it, we believe that our input distributions are computationally hard in general. For the case of nearest neighbor, the distance to points of each class have very similar distributions—indeed, the two distributions match on polynomially many moments. This suggests that exponentially many samples are necessary for nearest neighbor. For more information about nearest neighbor classifiers in the context of adversarial examples, see 

[20].

Moreover, there are very few problems in any domain with exponential SQ hardness for which polynomial time algorithms are known; in fact, the only such problems involve solving systems of linear equations over finite fields [7]. Since Theorem 1.1 involves a real-valued problem, finding a polynomial time algorithm that avoids the SQ lower bound would be a remarkable breakthrough in SQ theory.

1.3 Overview of proofs

Our secondary result, on the information theoretic achievability of robustness, is proved via simple arguments reminiscent of PAC-learning theory. Namely, if a classifier is not good enough for a given pair of distributions, we can rule it out with high confidence by looking at not too many samples. Then, we use a union bound to claim the result for a family of pairs that is either at most exponentially large, or is at least covered by a net of at most exponential size (the only subtlety is in the proper definition of a net in this robust context).

Our primarily result, on the hardness of robustness, is technically much more challenging. The central object in the proof is a natural high-dimensional generalization of a construction from Diakonikolas et al. [4]. Roughly speaking, a hard pair of distributions is obtained by taking a standard multivariate Gaussian, choosing a random -dimensional subspace and planting there two well-separated distributions that match many moments of a Gaussian (in [4] only the case is considered). To show an SQ lower bound, we use – as in [4] – the framework of [2, 8] to reduce the question to computing a certain non-standard notion of correlation between the distributions. To bound said correlation, we deviate from [4] significantly, since their argument is tailored crucially to the case . Our argument is less precise, but allows which is necessary to obtain a large separation between the distributions (which in turn controls the parameter in Theorem 1.1).

2 Definitions

Throughout we restrict ourselves to binary classifiers, -feature space, as well as to balanced classes. We fix some norm in , and we denote .

Definition 2.1.

The -robust zero-one loss (with respect to ) is defined as follows, for and ,

Definition 2.2.

A binary classifier is -robust for a pair of distributions on if for any ,

Definition 2.3.

A (binary) classification task is given by a family of pairs of distributions over a domain . A classification algorithm receives datasets consisting of i.i.d. samples from and respectively, and outputs a classifier .

We say that is -robustly learnable with samples if there is a classification algorithm such that, for every , with probability at least over and , the algorithm produces a classifier that is -robust for .

Remark 2.4.

The success probability is an arbitrary constant larger than . It is easy to see that, for any , by using samples one can obtain a success probability of .

We also note that the classical -PAC learning scenario, with , corresponds to our definition of -robust classification with parameters and . Slightly more precisely, a concept class for PAC-learning corresponds to the family of all pairs of distribution supported respectively on and for some .

Definition 2.5.

We say that is -robustly feasible if every admits an -robust classifier. When it exists we denote for such a classifier (chosen arbitrarily among all robust classifiers for ), and .

3 Robust learning with few samples

Obviously robust feasibility is a necessary condition for robust learnability. We show that it is in fact sufficient, even for sample efficient robust learnability. We first do so when a finite set of classifiers suffices for robust feasibility.

3.1 Robust empirical risk minimization

Theorem 3.1.

Assume that is -robustly feasible. Then it is -robustly learnable with .

Proof.

Let be the empirical measure corresponding to the dataset . We will show that ERM on the -robust loss gives the claimed sample complexity. More precisely we consider the classification algorithm that outputs:

For shorthand notation we write and . In particular we simply want to prove that . Note that by definition . A standard Chernoff bound gives that, with probability at least , one has for every ,

Now observe that for one can has , and thus we obtain with ,

It now suffices to observe that implies . ∎

3.2 Robust covering number

In many natural situations the classification task is specified by a continuous set of distributions. For example one might have a set of the form where and are Lipschitz functions and is some compact subset of . In this case Theorem 3.1 does not apply, although one would like to say that “essentially” is of log-size roughly . The classical solution to this difficulty is with covering numbers:

Definition 3.2.

For a metric space we write

With a slight abuse of notation we also extend the distance to the Cartesian product by .

With the above definitions one can obtain the following result as a straightforward corollary of Theorem 3.1 and the definition of total variation distance.

Theorem 3.3.

Assume that is -robustly feasible. Then is -robustly learnable with .

In fact, if one is willing to lose a little bit of robustness, one can use a significantly weaker notion of “distance” than total variation. Indeed we can consider a broader class of modifications to a distribution that preserves the robustness of a classifier: in Theorem 3.3 we used that we can move arbitrarily a small amount of mass, but in fact we can also move a little an arbitrary amount of mass. While the former type of movement corresponds to total variation distance, the latter corresponds to the (infinity) Wasserstein distance. We denote for the infimum of over all measures with marginal over (respectively ) equal to (respectively ). Next we introduce a slightly non-standard notion of covering with respect to a pair of distances

Definition 3.4.

For a metric space equipped with two distances and we define an neighborhood by222The choice of first moving with and then with will fit our application. In general a more natural definition would be:

:

The corresponding covering number is:

It is now easy to prove the following strengthening of Theorem 3.3:

Theorem 3.5.

Assume that is -robustly feasible. Then is -robustly learnable with .

Proof.

Let be the set realizing the infimum in the definition of . Observe that is -robustly feasible with classifiers from , and apply Theorem 3.1. ∎

3.3 Covering number bound from generative models

We now show that distributions approximated by generative models have bounded covering numbers (in terms of Definition 3.4), so Theorem 3.5 gives a good sample complexity for such distributions. The proof is deferred to Appendix C in the supplementary material.

Definition 3.6.

A generative model is a neural network indexed by weights . The generated distribution is the distribution given by for .

Lemma 3.7.

Let be an -layer neural network architecture with at most activations in each layer and Lipschitz nonlinearities such as ReLUs. Consider any family of distribution pairs such that for each , and each , there exists some with . Then

4 Lower bound for the SQ model

Let and be two distributions over a set , for which we would like to solve a (binary) classification task. The SQ model, introduced in [12], is defined as follows. An algorithm is allowed to access and through queries of the following kind. A query is specified by a function , and the response is two numbers such that and . Here is a positive parameter called precision. After asking a number of such queries, the algorithm must output a required (robust or non-robust) classifier for and .

Our main result is as follows:

Theorem 4.1.

For every sufficiently small the following holds. There exists a family of pairs of distributions over such that:

  • Almost all the mass of and is supported in an -ball of radius ;

  • The distributions and admits a -robust classifier; moreover, a -robust classifier can be learned from samples from and ;

  • For and , there exists a linear (non-robust) classifier, which can be learned in polynomial time;

  • For every , in order to learn a -robust classifier for and , one needs at least statistical queries with accuracy as good as .

For instance, if is a small constant we get the existence of a -robust classifier, where is a large constant. One could push as high as at a cost of the lower bound being against SQ queries with somewhat worse accuracy ( instead of ).

We first show a family of pairs that admit a robust classifier, yet it is hard (in the SQ model) to learn any (non-robust) classifier. Later, in Section 4.3, we show a simple modification of this family to obtain the main result.

4.1 Hard family of distributions

Here we define a hard family of pairs of distributions as discussed above. This section contains the definition and key properties of the family; proofs of those properties appear in Appendix A. This family can be seen to be a high-dimensional generalization and modification of a family considered in [4]. The family depends on three parameters: integers , and a positive real .

Fix an integer . We introduce two auxiliary distributions over that we will use later as building blocks.

Lemma 4.2.

There exist two distributions and over with everywhere positive p.d.f.’s and respectively such that:

  • and match in the first moments;

  • There exist two subsets such that the distance between and is at least , , and ;

  • , and for every and , one has: .

(See Figure 1 for the illustration.)

Figure 1: The distributions in Lemma 4.2 are similar to discretized Gaussians, with careful discretization and weighting from Gauss-Hermite quadrature.

Next let us fix parameters and . Let be a family of -dimensional subspaces of with fixed orthonormal bases such that for every and , one has: . Informally speaking, subspaces from are pairwise near-orthogonal.

Lemma 4.3.

For every , there exists such a family with and .

Now we are ready to define our family of hard pairs of distributions over . The family is parameterized by a -dimensional subspace together with an orthonormal basis , where is the family of subspaces guaranteed by Lemma 4.3. Let us extend the above basis to a basis for the whole : . Now we define a pair of distributions and via their p.d.f.’s and respectively as follows:

where and are densities of distributions and from Lemma 4.2, and

is the p.d.f. of the standard Gaussian distribution

. Now we simply take to be and to be .

Lemma 4.4.

There exist two sets such that the distance between and is , and for which and .

As a result, the pair admits a -robust classifier. Moreover, since

(which follows from standard bounds on the number of pairwise near-orthogonal unit vectors in

), it follows from Theorem 3.1 that one can learn a -robust classifier from merely samples.

4.2 SQ lower bound for learning a classifier for and

The heart of the matter is to show that it requires statistical queries with precision to learn a classifier for and provided that all the parameters are set correctly. The argument is fairly involved and uses the framework of [8] to reduce the question to that of upper bounding -correlation between the distributions. Due to space limitations, we show the argument in Appendix B of the supplementary material.

4.3 Making the distribution easy to learn non-robustly

Let us now show a family of pairs distributions over such that it is easy to learn a (non-robust) classifier, but hard to learn a robust one. The construction is very simple: we take distributions over as defined above and define to be , where , and, similarly, to be , where and . These distributions admit a trivial (non-robust) classifier based on the first coordinate. Moreover, since and are linearly separable, they can be classified using linear SVM or logistic regression. Information-theoretically, one can learn a -robust classifier using samples by ignoring the first coordinate and applying Theorem 3.1. However, for every , one needs SQ queries with accuracy to learn an -robust separator. This can be shown exactly the same way as for and (see Appendix B in the supplementary material).

The above distributions are hard to learn robustly with respect to the norm. We can switch to by replacing by its Hadamard transform . Since , the robustness parameters in the theorem are unchanged while the diameter becomes .

5 Conclusion and future directions

In this paper we put forward the thesis that adversarial examples might be an unavoidable consequence of computational constraints for learning algorithms. Our main piece of evidence is a classification task, for which there essentially exists a classifier robust to Euclidean perturbations of size (while with high probability any sample has norm ), yet finding any non-trivial robust classifier (even for arbitrarily small perturbations, and with probability of correctness only slightly better than chance) is hard in the statistical query model (in the sense that one needs an exponential number of queries, even with a very high precision statistical query oracle). We identify several directions in which this result could be strengthened to give stronger evidence for our thesis.

  1. The most important question for the validity of our thesis is whether one could prove a similar hardness result for natural distributions. This is a particularly challenging open problem as the concept of a natural distribution is fuzzy (for instance there is no consensus on what a natural distribution for images should look like).

  2. We believe that our proposed classification task is really computationally hard in any sense, not only in the statistical query model. As we discussed SQ is natural for learning theory hardness, but there have been lots of works leveraging other types of hardness assumption (e.g., cryptographic). It would be interesting to explore further the position of robust learning in the hardness landscape.

  3. Finally one might wonder whether the perturbation size is optimal (for distributions essentially supported in a ball of size ). A concrete open question could be phrased as follows: consider a classification task that is -robustly feasible, how fast does need to grow in order to ensure that one can find in polynomial time a -robust classifier?

References

Appendix A Proofs of properties of the SQ hard distribution

We start with the following lemma on Hermite polynomials:

Lemma A.1.

For every , the distance between any roots of and is at least .

Proof.

It is known that extrema of are exactly zeros of , which follows from and a lack of double roots. Thus, it is enough to show that extrema and zeros of are -separated.

Consider the case where are such that , is positive between and , and . Let us show how to lower bound . Denote . Clearly, and is positive between and with a unique local maximum on , which we denote by . It is not hard to check that . Thus, it is enough to lower bound . It is known (see, e.g., [19, Section 5.5] that satisfies the ODE . By comparing with , we can get that lower bound .

Now let us lower bound . It is known [19, Section 5.5] that satisfies the ODE . By comparing this ODE with , we get that . The latter step is due to and that the lower bound on is nonincreasing in .

Other cases can be treated similarly. ∎

Lemma 4.2.

There exist two distributions and over with everywhere positive p.d.f.’s and respectively such that:

  • and match in the first moments;

  • There exist two subsets such that the distance between and is at least , , and ;

  • , and for every and , one has: .

(See Figure 1 for the illustration.)

Proof.

Let and be two consecutive (physicist’s) Hermite’s polynomials. It is a classic result in Gaussian quadrature (see, e.g., [19]) that for every , there exists a discrete distribution supported on the zeros of , which matches in the first moments. Let denote such a distribution for and the same for . By Lemma A.1, the distance between the supports of and is at least and they both match in the first moments.

Now, we obtain the desired distributions and as follows. Fix a small . The distribution is defined as , where , , and and are independent. The distribution is defined similarly, but instead of we use . It is easy to check that and match the first moments of . Now suppose that . The second property follows from the supports of and being separated and the standard concentration inequalities; specifically, we take to be the Minkowski sum of the support of scaled down and the ball of radius , and to be similar with instead of . Then the chance is not in is at most the chance has , which is .

Now let us prove the bounds on , for the similar bounds follows exactly the same way.

Denote the roots of .

One has:

where . Hence,

We have for every the bound  [9]. Therefore, if denotes the p.d.f. of we have

Lemma 4.3.

For every , there exists such a family with and .

Proof.

Let and be uniformly random -dimensional subspaces of . W.l.o.g. we can assume that is spanned by the first standard basis vectors. Let be spanned by an orthonormal basis such that each is distributed uniformly on the unit sphere of . Consider an -net of the unit sphere of of size . For every with probability at least the absolute value of the dot product of with a given is at most . As a result, with probability at least , dot products between all elements of and all are at most in the absolute value. But this implies that the dot products between all the unit vectors of and are at most . So, by setting and by using the union bound, we get that we can set:

Thus, we can set , and for a sufficiently small positive , which yields . ∎

Lemma 4.4.

There exist two sets such that the distance between and is , and for which and .

Proof.

The sets are defined as follows:

and

The points and are well-separated, since in at least a -fraction of , both and . Since and are -separated, we obtain the result.

The bounds on the probabilities follow from the respective bounds in Lemma 4.2 and standard Chernoff bounds. ∎

Appendix B SQ lower bound

b.1 SQ lower bound

Now let us show that if we set all the parameters appropriately, it is hard in the SQ model to learn a good classifier (robust or otherwise) for distributions and defined above, where is an unknown subspace. The main idea is to show that if the subspace is chosen uniformly at random, unless we perform more than queries, we can not tell apart or from the standard Gaussian (and as a result, from each other). Intuitively, any since query can only reliably distinguish from for a tiny fraction of subspaces . The result then follows by a simple counting argument. To formalize the above intuition, we use an argument similar at a high-level to the one used in [4].

Let be distributions over with everywhere positive p.d.f.’s , , and , respectively. Then, the pairwise correlation of and w.r.t. , denoted by , is defined as follows:

In Section B.2, we show that for an appropriate setting of parameters (namely, when ), for every , one has:

and

Then by repeating the proof of Lemma 3.3 from [8], we get that if the number of queries is significantly smaller than: