Learning predictors that are robust to adversarial perturbations is an important challenge in contemporary machine learning. There has been a lot of interest lately in how predictors learned by deep learning arenot robust to adversarial examples (Szegedy et al., 2013; Biggio et al., 2013; Goodfellow et al., 2014), and there is an ongoing effort to devise methods for learning predictors that are adversarially robust. In this paper, we consider the problem of learning, based on a (non-adversarial) i.i.d. sample, a predictor that is robust to adversarial examples at test time. We emphasize that this is distinct from the learning process itself being robust to an adversarial training set.
Given an instance space and label space , we formalize an adversary we would like to protect against as , where represents the set of perturbations (adversarial examples) that can be chosen by the adversary at test time. For example, could be perturbations of distance at most w.r.t. some metric , such as the metric considered in many applications: . Our only (implicit) restriction on the specification of is that should be nonempty for every . For a distribution over , we observe i.i.d. samples , and our goal is to learn a predictor having small robust risk,
The common approach to adversarially robust learning is to pick a hypothesis class
(e.g. neural networks) and learn through robustempirical risk minimization:
where . Most work on the problem has focused on computational approaches to solve this empirical optimization problem, or related problems of minimizing a robust version of some surrogate loss instead of the 0/1 loss (Madry et al., 2017; Wong and Kolter, 2018; Raghunathan et al., 2018a, b). But of course our true objective is not the empirical robust risk , but rather the population robust risk .
How can we ensure that is small? All prior approaches that we are aware of for ensuring adversarially robust generalization are based on uniform convergence, i.e. showing that w.h.p. for all predictors
, the estimation erroris small, perhaps for some surrogate loss (Bubeck et al., 2018; Cullina et al., 2018; Khim and Loh, 2018; Yin et al., 2018). Such approaches justify , and in particular yield M-estimation type proper
learning rules: we are learning a hypothesis class by choosing a predictor in the class that minimizes some empirical functional. For standard supervised learning we know that proper learning, and specifically, is sufficient for learning, and so it is sensible to limit attention to such methods.
But it has also been observed in practice that the adversarial error does not generalize as well as the standard error, i.e. there can be a large gap betweenand even when their non-robust versions are similar (Schmidt et al., 2018). This suggests that perhaps the robust risk does not concentrate as well as the standard risk, and so RERM in adversarially robust learning might not work as well as ERM in standard supervised learning. Does this mean that such problems are not adversarially robustly learnable? Or is it perhaps that proper learners might not be sufficient?
In this paper we aim to characterize which hypothesis classes are adversarially robustly learnable, and using what learning rules. That is, for a given hypothesis class and adversary , we ask whether it is possible, based on an i.i.d. sample to learn a predictor that has population robust risk almost as good as any predictor in (see Definition 2 in Section 2). We discover a stark contrast between proper learning rules which output predictors in , and improper learning rules which are not constrained to predictors in . Our main results are:
We show that there exists an adversary and a hypothesis class with finite VC dimension that cannot be robustly PAC learned with any proper learning rule (including ).
We show that for any adversary and any hypothesis class with finite VC dimension, there exists an improper learning rule that can robustly PAC learn (although with sample complexity that is sometimes exponential in the VC dimension).
Our results suggest that we should start considering improper learning rules to ensure adversarially robust generalization. They also demonstrate that previous approaches to adversarially robust generalization are not always sufficient, as all prior work we are aware of is based on uniform convergence of the robust risk, either directly for the loss of interest (Bubeck et al., 2018; Cullina et al., 2018) or some carefully constructed surrogate loss (Khim and Loh, 2018; Yin et al., 2018), which would still justify the use of M-estimation type proper learning. The approach of Attias et al. (2018) for the case where (i.e. finite number of perturbations) is most similar to ours, as it uses an improper learning rule, but their analysis is still based on uniform convergence and so would apply also to (the improperness is introduced only for computational, not statistical, reasons). Also, in this specific case, our approach would give an improved sample complexity that scales only roughly logarithmically with , as opposed to the roughly linear scaling in Attias et al. (2018)—see discussion at the end of Section 4 for details.
A related negative result was presented by Schmidt et al. (2018), where they showed that there exists a family of distributions (namely, mixtures of two -dimensional spherical Gaussians) where the sample complexity for standard learning is , but the sample complexity for adversarially robust learning is at least . This an interesting instance where there is a large separation in sample complexity between standard learning and robust learning. But distribution-specific learning is known to be less easily characterizable, with the uniform convergence not being necessary for learning, and ERM not always being optimal, even for standard (non-robust) supervised learning. In this paper we focus on “worst case” distribution-free robust learning, as in standard PAC learnability.
A different notion of robust learning was studied by Xu and Mannor (2012). They use empirical robustness as a design technique for learning rules, but their goal, and the guarantees they establish are on the standard non-robust population risk, and so do not inform us about robust learnability.
2 Problem Setup
We are interested in studying the sample complexity of adversarially robust PAC learning in the realizable and agnostic settings. Given a hypothesis class , our goal is to design a learning rule such that for any distribution over , the rule will find a predictor that competes with the best predictor in terms of the robust risk using a number of samples that is independent of the distribution . The following definitions formalize the notion of robust PAC learning in the realizable and agnostic settings:111We implicitly suppose that the hypotheses in and their losses are measurable, and that standard mild restrictions on are imposed to guarantee measurability of empirical processes, so that the standard tools of VC theory apply. See Blumer, Ehrenfeucht, Haussler, and Warmuth (1989); van der Vaart and Wellner (1996) for discussion of such measurability issues, which we will not mention again in the remainder of this article.
[Agnostic Robust PAC Learnability] For any , the sample complexity of agnostic robust PAC learning of with respect to adversary , denoted as , is defined as the smallest for which there exists a learning rule such that, for every data distribution over
, with probability at leastover ,
If no such exists, define . We say that is robustly PAC learnable in the agnostic setting with respect to adversary if , is finite.
[Realizable Robust PAC Learnability] For any , the sample complexity of realizable robust -PAC learning of with respect to adversary , denoted , is defined as the smallest for which there exists a learning rule such that, for every data distribution over where there exists a predictor with zero robust risk, , with probability at least over ,
If no such exists, define . We say that is robustly PAC learnable in the realizable setting with respect to adversary if , is finite.
[Proper Learnability] We say that is properly robustly PAC learnable (in the agnostic or realizable setting) if it can be learned as in Definitions 2 or 2 using a learning rule that always outputs a predictor in . We refer to learning using any learning rule , as in the definitions above, as improper learning.
We also denote , the (non-robust) error rate under the - loss, and the empirical error rate. These agree with the robust variant when , and so robust learnability agrees with standard supervised learning when . For more powerful adversaries, robust learnability is a special case of Vapink’s “General Learning” (Vapnik, 1982), but can not, in general, be phrased in terms of supervised learning of some modified hypothesis class or loss. We recall the Vapnik-Chervonenkis dimension (VC dimension) is defined as follows,
[VC dimension] We say that a sequence is shattered by if such that . The VC dimension of (denoted ) is then defined as the largest integer for which there exists that is shattered by . If no such exists, then is said to be infinite.
In the standard PAC learning framework, we know that a hypothesis class is PAC learnable if and only if the VC dimension of is finite (Vapnik and Chervonenkis, 1971, 1974; Blumer et al., 1989; Ehrenfeucht et al., 1989). In particular, is properly PAC learnable with and therefore proper learning is sufficient for supervised learning. A natural question to ask, based on the definition of robust PAC learning, is what is a necessary and sufficient condition on that implies that it is robustly PAC learnable with respect to adversary . We can easily obtain a sufficient condition based on Vapink’s “General Learning” (Vapnik, 1982). Denote by the robust loss class of ,
If the robust loss class has finite VC dimension (), then is robustly PAC learnable with and sample complexity that scales linearly with . One might then wish to relate the VC dimension of the hypothesis class () to the VC dimension of the robust loss class (). But as we show in Sections 3 and 5, there can be arbitrarily large gaps between them.
As mentioned earlier, for supervised learning finite VC dimension of the loss class (which is equal to the VC dimension of the hypothesis class) is also necessary for learning. For general learning, unlike supervised learning, the loss class having finite VC dimension, and uniform convergence over this class, is not, in general, necessary, and rules other than might be needed for learning (e.g. Vapnik, 1982; Shalev-Shwartz et al., 2009; Daniely et al., 2015). In the following Sections, we show that this is also the case for robust learning. We show that can be arbitrarily larger, we might not have uniform convergence, might not ensure learning, while the problem is still learnable with a different (improper, in our case) learning rule.
3 Sometimes There are no Proper Robust Learners
We start by showing that even for hypothesis classes with finite VC dimension, indeed even if , robust PAC learning might not be possible using any proper learning rule. In particular, even if there is a robust predictor in , and even with an unbounded number of samples, (or any other M-estimator or other proper learning rules), will not ensure a low robust risk.
There exists a hypothesis class with and an adversary such that is not properly robustly PAC learnable with respect to in the realizable setting.
This result implies that finite VC dimension of a hypothesis class is not sufficient for robust PAC learning if we want to use proper learning rules. For the proofs in this section, we will fix an instance space equipped with a metric , and an adversary such that for all for some . First, we prove a lemma that shows that there exists a hypothesis class where there is an arbitrarily large gap between the VC dimension of and the VC dimension of the robust loss class of ,
Let . Then, there exists such that but .
Proof Pick points in such that for all . In other words, we want the perturbation sets to be mutually disjoint.
We will construct a hypothesis class in the following iterative manner. Initialize set . For each bit string , initialize . For each , if then pick a point and add it to , i.e. . Once we finish picking points based on all bits that are set to , we add to (i.e. ). We define as:
Then, let . We can think of each mapping as being characterized by a unique signature that indicates the points that it labels with . These points are carefully picked such that, first, they are inside the perturbation sets of ; and second, no two mappings label the same point with , i.e. for any , where , . Also, we make sure that all mappings in label the set with .
Next, we proceed with proving two claims about . First, that . Pick any two points . Consider the following cases. In case or is in . Suppose W.L.O.G that . Then we know that all mappings labels in the same way with label , because for all . Therefore, we cannot shatter with . In case and are both in . Since by our construction, and for any , we have two sub-cases. Either for some , which means that the only labelings we can obtain are with , and with for any . Second case is that and for . By our construction, we know that we cannot label both points and with , because they don’t belong to the same set. Therefore, in both subcases, we cannot shatter with . This concludes that .
Second, we will show that . Consider the set . We will show that shatters . Pick any labeling . Note that by construction of , such that . Then, for each , . This shows that shatters , and therefore .
The following lemma establishes that for any sample size , there exists a hypothesis class with such that any proper
learning rule will fail in learning a robust classifier if it observes at mostsamples but not more.
Let . Then, there exists with such that for any proper learning rule ,
a distribution over and a predictor where .
With probability at least over , .
Proof This proof follows standard lower bound techniques that use the probabilistic method (Shalev-Shwartz and Ben-David, 2014, Chapter 5). Let . Construct as before, according to Lemma 3, on points . By construction, we know that shatters the set . We will only keep a subset of that includes classifiers that are robustly correct only on subsets of size , i.e. . Let be an arbitrary proper learning rule. The main idea here is to construct a family of distributions that are supported only on points of , which would force rule to choose which points it can afford to be not correctly robust on. If rule observes only points, it can’t do anything better than guessing which of the remaining points of are actually included in the support of the distribution.
Consider a family of distributions where , each distribution is uniform over only points in . For every distribution , by construction of , there exists a classifier such that . This satisfies the first requirement. For the second requirement, we will use the probabilistic method to show that there exists a distribution such that , and finish the proof using a variant of Markov’s inequality.
Pick an arbitrary sequence . Consider a uniform weighting over the distributions . Denote by the event that for a distribution that is picked uniformly at random. We will lower bound the expected robust loss of the classifier that rule outputs, namely , given the event ,
We can lower bound the robust loss of the classifier by conditioning on the event that denoted ,
Since , and is uniform over its support of size , we have . This allows us to get a lower bound on (1),
Since , by construction of , we know that there are at least points in where is not robustly correct. We can unroll the expectation over as follows
Thus, it follows by (2) that . Now, by law of total expectation,
Since the expectation over is at least , this implies that there exists a distribution such that
. Using a variant of Markov’s inequality, for any random variabletaking values in , and any , we have . For and , we get .
We now proceed with the proof of Theorem 3.
Proof of Theorem 3 Let be an infinite sequence of sets such that each set contains distinct points from , where for any such that we have . Foreach , construct on as in Lemma 3. We want to ensure that predictors in are non-robust on the points in for all , by doing the following adjustment for each (recall from Lemma 3 that each predictor has its own unique signature ),
Let . We will show that . Pick any two points . There are six cases to consider. In case both and are in for some , then we only obtain the labelings (by predictors from ) and (by predictors from with ). In case both and are in , then they are not shattered by Lemma 3. In case and for , then we can only obtain the labelings (by predictors in ), (by predictors in ), and (by predictors in for ). In case and for , then we can’t obtain the labeling . In case and for , then we can’t obtain the labeling . Finally, if either or is in but not in and not in , then all predictors label or with , and so we can’t shatter them. This shows that .
By Lemma 3, it follows that for any proper learning rule and for any , we can construct a distribution over where there exists a predictor with , but with probability at least over , . This works because classifiers from classes where make mistakes on points in and so they are non-robust. Thus, rule will do worse if it picks predictors from these classes. This shows that the sample complexity to properly robustly PAC learn is infinite. This concludes that is not properly robustly PAC learnable.
4 Finite VC Dimension is Sufficient for (Improper) Robust Learnability
In the previous section we saw that finite VC dimension is not sufficient for proper robust learnability. We now show that it is sufficient for improper robust learnability, thus (1) establishing that if is learnable, it is also robustly learnable, albeit possibly with a higher sample complexity; and (2) unlike the standard supervised learning setting, to achieve learnability we might need to escape properness, as improper learning is necessary for some hypothesis classes.
We begin, in Section 4.1 with the realizable case, i.e. where there exists with zero robust risk. Then in Section 4.2 we turn to the agnostic setting, and observe that a version of a recent reduction by David, Moran, and Yehudayoff (2016) from agnostic to realizable learning applies also for robust learning. We thus establish agnostic robust learnability of finite VC classes by using this reduction and relying on the realizable learning result of Section 4.1.
4.1 Realizable Robust Learnability
For any and , ,
In fact, what we show is a stronger result, which bounds the sample complexity as a function of the dual VC dimension. Formally, for each , define a function such that for each . Then the dual VC dimension of , denoted , is defined as the VC dimension of the set . This quantity is known to satisfy (Assouad, 1983), though for many spaces it satisfies or even, as is the case for linear separators, . The proof below will establish that
from which the above theorem immediately follows.
Our approach to this proof is via sample compression arguments. Specifically, we make use of a lemma (Lemma A in Appendix 4.2), which extends to the robust loss the classic compression-based generalization guarantees from the - loss. We now proceed with the proof of Theorem 4.1.
Proof of Theorem 4.1 The learning algorithm achieving this bound is a modification of a sample compression scheme recently proposed by Moran and Yehudayoff (2016), or more precisely, a variant of that method explored by Hanneke, Kontorovich, and Sadigurschi (2019). Our modification forces the compression scheme to also have zero empirical robust loss. Fix and a sample size , and denote by any distribution with .
By classic PAC learning guarantees (Vapnik and Chervonenkis, 1974; Blumer et al., 1989), there is a positive integer with the property that, for any distribution over with , for iid -distributed samples , with nonzero probability, every satisfying also has .
Fix a deterministic function mapping any labeled data set to a classifier in robustly consistent with the labels in the data set, if a robustly consistent classifier exists (i.e., having zero on the given data set). Suppose we are given training examples as input to the learner. Under the assumption that this is an iid sample from a robustly realizable distribution, we suppose , which should hold with probability one. Denote by for every . Before we can apply the compression approach, we first need to inflate the data set to a (potentially infinite) larger set, and then discretize it to again reduce it back to a finite sample size. Denote by . Note that . Define an inflated data set . As it is difficult to handle this potentially-infinite set in an algorithm, we consider a discretized version of it. Specifically, consider a dual space : a set of functions defined as , for each and each . The VC dimension of is at most the dual VC dimension of : , which is known to satisfy (Assouad, 1983). Now denote by a subset of which includes exactly one for each distinct classification of realized by functions . In particular, by Sauer’s lemma (Vapnik and Chervonenkis, 1971; Sauer, 1972), , which for is at most . In particular, note that for any and , if for every , then for every as well, which would further imply . We will next go about finding such a set of functions.
By our choice of , we know that for any distribution over , iid samples sampled from would have the property that, with nonzero probability, all with also have . In particular, this implies at least that there exists a subset with such that every with has . For such a set , note that , and therefore there exists a set with and . Furthermore, since for every , we know , and hence . Altogether, we have that, for any distribution over , with .
We will use the above as a weak hypothesis in a boosting algorithm. Specifically, we run the -Boost algorithm (Schapire and Freund, 2012, Section 6.4.2) with as its data set, using the above mapping to produce the weak hypotheses for the distributions produced on each round of the algorithm. As proven in (Schapire and Freund, 2012), for an appropriate a-priori choice of in the -Boost algorithm, running this algorithm for rounds suffices to produce a sequence of hypotheses s.t.
From this observation, we already have a sample complexity bound, only slightly worse than the claimed result. Specifically, the above implies that satisfies . Note that each of these classifiers is equal for some with . Thus, the classifier is representable as the value of an (order-dependent) reconstruction function with a compression set size
Thus, invoking Lemma A, if (for a sufficiently large numerical constant ), we have that with probability at least ,
and setting this less than and solving for a sufficient size of to achieve this yields a sample complexity bound, which is slightly greater than that claimed in (3). We next proceed to further refine this bound via a sparsification step. However, as an aside, we note that the above intermediate step will be useful in a discussion below, where the size of this compression scheme in the second expression in (4) offers an improvement over a result of Attias, Kontorovich, and Mansour (2018).
Via a technique of (Moran and Yehudayoff, 2016) we can further reduce the above bound. Specifically, since all of are in , classic uniform convergence results of Vapnik and Chervonenkis (1971) imply that taking independent random indices , we have . In particular, together with the above guarantee from -Boost, this implies that there exist indices (which may be chosen deterministically) satisfying
so that the majority vote predictor
satisfies , and hence .
Since again, each is the result of for some of size ,
we have that can be represented as the value of an (order-dependent) reconstruction function
with a compression set size . Thus, Lemma A
implies that, for (for an appropriately large numerical constant ), with probability at least ,
Setting this less than and solving for a sufficient size of to achieve this yields the stated bound.
4.2 Agnostic Robust Learnability
For the agnostic case, we can establish an upper bound via reduction to the realizable case, following an argument from David, Moran, and Yehudayoff (2016). Specifically, we have the following result.
For any and , ,
In fact, as was true in the realizable case, we can establish an often-better bound: namely,