# Black-box Certification and Learning under Adversarial Perturbations

We formally study the problem of classification under adversarial perturbations, both from the learner's perspective, and from the viewpoint of a third-party who aims at certifying the robustness of a given black-box classifier. We analyze a PAC-type framework of semi-supervised learning and identify possibility and impossibility results for proper learning of VC-classes in this setting. We further introduce and study a new setting of black-box certification under limited query budget. We analyze this for various classes of predictors and types of perturbation. We also consider the viewpoint of a black-box adversary that aims at finding adversarial examples, showing that the existence of an adversary with polynomial query complexity implies the existence of a robust learner with small sample complexity.

## Authors

• 9 publications
• 2 publications
• 4 publications
10/22/2020

### Reducing Adversarially Robust Learning to Non-Robust PAC Learning

We study the problem of reducing adversarially robust learning to standa...
11/21/2019

### Heuristic Black-box Adversarial Attacks on Video Recognition Models

We study the problem of attacking video recognition models in the black-...
07/27/2021

11/09/2018

### Universal Decision-Based Black-Box Perturbations: Breaking Security-Through-Obscurity Defenses

We study the problem of finding a universal (image-agnostic) perturbatio...
10/02/2018

### Can Adversarially Robust Learning Leverage Computational Hardness?

Making learners robust to adversarial perturbation at test time (i.e., e...
02/10/2021

### Adversarial Robustness: What fools you makes you stronger

We prove an exponential separation for the sample complexity between the...
11/09/2018

### Universal Hard-label Black-Box Perturbations: Breaking Security-Through-Obscurity Defenses

We study the problem of finding a universal (image-agnostic) perturbatio...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We formally study the problem of classification under adversarial perturbations. An adversarial perturbation is an imperceptible alteration of a classifier’s input which changes its prediction. The existence of adversarial perturbations for real-world input instances and typical classifiers (Szegedy et al., 2014)

has contributed to a lack of trust in predictive tools derived from automated learning. Recent years have thus seen a surge of studies proposing various heuristics to enhance robustness to adversarial attacks

(Chakraborty et al., 2018)

. Existing solutions can be divided into two general categories: (i) those that modify the learning procedure to increase the adversarial robustness, for example by modifying the training data or the loss function used for training

(Sinha et al., 2018; Cohen et al., 2019; Salman et al., 2019), and (ii) post-hoc approaches that aim to modify an already existing classifier to enhance its robustness (Cohen et al., 2019).

A user of a predictive tool, however, may not oftentimes be involved in the training of the classifier nor have the technical access or capabilities to modify its input/output behavior. Instead, the predictor may have been provided by a third party and the user may have merely a black-box access to the predictor. That is, the predictor presents itself as an oracle that takes input query and responds with the label

. The provider of the predictive tool, while not necessarily assumed to have malicious intent, is still naturally considered untrusted, and the user thus has an interest in verifying the predictor’s performance (including adversarial robustness) on its own application domain. While the standard notion of classification accuracy can be easily estimated using an i.i.d. sample generated from user’s data generating distribution, estimating the expected robust loss is not that easy: Given a labeled instance

, the user can immediately verify whether the instance is misclassified () using a single query to , but understanding whether is vulnerable under adversarial perturbations may require many more queries to the oracle.

In this work, we introduce and analyze a formal model for black-box certification under query access. We provide examples of hypothesis classes and perturbation types111Defined in Section 2, a perturbation type encompasses the set of admissible perturbations that the adversary is allowed to make at each point. for which such a certifier exists. We further generalize these ideas and introduce the notion of a witness set for certification, where we identify general classes of problems and perturbation types that admit black-box certification with bounded query complexity. On the contrary, we also demonstrate cases of simple classes where the query complexity of certification is unbounded.

We further look at the problem from the viewpoint of the adversary, connecting the query complexity of an adversary (for finding adversarial examples) and that of the certifier. An intriguing question that we explore is whether the sample complexity of learning a robust classifier with respect to a hypothesis class is related to the query complexity of an optimal adversary (or certifier) for that class. We uncover such a connection, showing that the existence of a successful adversary with polynomial query complexity for a hypothesis class means that we can robustly learn that class with a rather small number of samples. For this, we adapt a compression-based argument, demonstrating a sample complexity upper bound for robust learning that is smaller than what was previously known (Montasser et al., 2019) (assuming that a polynomial adversary exists for that class).

We start our investigations with the problem of robustly (PAC-)learning classes of finite VC-dimension. It has been shown recently that, while the VC-dimension characterizes the proper learnability of a hypothesis class under the binary (misclassification) loss, there are classes of small VC-dimension that are not properly learnable under the robust loss (Montasser et al., 2019). We define the notion of the margin class (associated with a hypothesis class and a perturbation type) and show that, if both the class and the margin class are simple (measured by their VC-dimension), then proper learning under robust loss is not significantly more difficult than learning with respect to the binary loss.

The corresponding complexity of the margin class, however, can be potentially quite large for specific choices of perturbation types and hypothesis classes. We thus investigate and provide scenarios where a form of semi-supervised learning can overcome the impossibility of proper robust learning even under general perturbation types. We believe that our investigations of robust learnability in these scenarios may help shed some light on where the difficulty of general robust classification stems from.

### 1.1 Related work

Recent years have produced a surge of work on adversarial attack (and defense) mechanisms (Madry et al., 2018; Chakraborty et al., 2018; Chen et al., 2017; Dong et al., 2018; Narodytska and Kasiviswanathan, 2017; Papernot et al., 2017; Akhtar and Mian, 2018; Su et al., 2019), as well as the development of reference implementations of these (Goodfellow et al., 2018). Here, we briefly discuss earlier studies focused on developing theoretical understanding of adversarially robust learning.

Several recent studies have suggested and analyzed approaches of training under data augmentation (Sinha et al., 2018; Salman et al., 2019). The general idea is to add adversarial perturbations to data points already at training time to promote smoothness around the support of the data generating distribution. These studies then provide statistical guarantees for the robustness of the learned classifier. Similarly, statistical guarantees have been presented for robust training that modifies the loss function rather than the training data (Wong and Kolter, 2018). However, the notion of robustness certification used in these is different from what we propose. While they focus on designing learning methods that are certifiably robust, we aim at certifying an arbitrary classifier and for a potentially new distribution.

The robust learnability of finite VC-classes has been studied only recently, often with pessimistic conclusions. An early result demonstrated that there exist distributions where robust learning requires provably more data than its non-robust counterpart (Schmidt et al., 2018)

. Recent works have studied adversarially robust classification in the PAC-learning framework of computational learning theory

(Cullina et al., 2018; Awasthi et al., 2019; Montasser et al., 2020) and presented hardness results for binary distribution and hypothesis classes in this framework (Diochnos et al., 2018; Gourdeau et al., 2019; Diochnos et al., 2019). On the other hand, robust learning has been shown to be possible under various assumptions. It was shown to be possible for finite hypothesis classes if the adversary also had a finite number of options for corrupting the input (Feige et al., 2015). The assumption on finite hypothesis classes was later relaxed to finite VC-dimension of the hypothesis class in (Attias et al., 2019). It has also been shown that robust learning is possible (by robust Empirical Risk Minimization (ERM)) under a feasibility assumption on the distribution and bounded covering numbers of the hypothesis class (Bubeck et al., 2019). However, more recent work has presented classes of VC-dimension , where the robust loss class has arbitrarily large VC-dimension (Cullina et al., 2018) and, moreover, where proper learning (such as robust ERM) is impossible in a distribution-free finite sample regime (Montasser et al., 2019). This latter work also presents an improper successful learning scheme for any VC-class and any adversary type, however with a sample complexity dependent on the VC-dimension of the dual class (that is, potentially exponential in the VC-dimension of the class).

We note that two additional aspects of our work have appeared in the the literature before: considering robust learnability by imposing computational constraints on an adversary has been explored recently (Bubeck et al., 2019; Gourdeau et al., 2019; Garg et al., 2019). Earlier work has also hypothesized that unlabeled data may facilitate adversarially robust learning, and demonstrated a scenario where access to unlabeled data yields a better bound on the sample complexity under a specific data generative model (Carmon et al., 2019; Alayrac et al., 2019).

Less closely related to our work, the theory of adversarially robust learnability has been studied for non-parametric learners. A first study in that framework showed that a nearest neighbor classifier’s robust loss converges to that of the Bayes optimal (Wang et al., 2018). A follow-up work then derived a characterization of the best classifier with respect to the robust loss (analogous to the notion of the Bayes optimal), and suggested a training data pruning approach for non-parametric robust classification (Yang et al., 2019).

### 1.2 Outline and summary of contributions

##### Setup and analyzing adversarial loss formulation

In Section 2, we provide the formal setup for the problem of adversarial learning. We also decompose the adversarial loss, and define the notion of the margin class associated with a hypothesis class and a perturbation type (Def. 4).

##### Using unlabeled data for adversarial learning of VC-classes

In Section 3, we study the sample complexity of proper robust learning. While this sample complexity can be infinite for general VC-classes (Cullina et al., 2018; Montasser et al., 2019; Yin et al., 2019), we show that VC-classes are properly robustly learnable if the margin class also has finite VC-dim (Thm. 7

). We formalize an idealized notion of semi-supervised learning where the learner has additional oracle access to probability weights of the margin sets. We show that, perhaps counter intuitively, oracle access to both (exact) margin weights and (exact) binary losses, does not suffice for identifying the minimizer of the adversarial loss in a class

(Thm. 9), even in the -realizable case (Thm. 12). However, under the additional assumption of robust realizability distributions in the (idealized) semi-supervised proper learning becomes feasible with bounded label complexity and unlabeled finite samples can be beneficial (Thms. 10 and 11).

##### Blackbox certification with query access

We formally define the problem of black-box certification through query access (Def. 15), and demonstrate examples where certification is possible (Obs. 16) or impossible (Obs. 17). Motivated by this impossibility result, we also introduce a tolerant notion of certification (Def. 19). We show that while more classes are certifiable with this definition (Obs. 20), some simple classes still remain impossible to certify (Obs. 21). We identify a sufficient condition for certifiability of a hypothesis class w.r.t. a perturbation type through the notion of witness sets (Def. 22 and Thm. 23). We then consider the query complexity of the adversary (as opposed to that of the certifier) for finding adversarial instances (Def. 24, 25, and 26), showing the connection between the success of a non-adaptive adversary and the existence of a witness set (Obs. 27).

##### Connecting adversarial query complexity and VC-learnability

The culminating (and perhaps the most technical) result reconnects the two themes of our work: (PAC-)learnability with respect to an adversarial loss and the query complexity of an adversary. With Theorem 28, we show the existence of a perfect adversary with small query complexity implies small sample complexity of robust learning.

We include proof sketches of all our claims here, and refer the reader to the supplementary material for detailed proofs.

## 2 Setup and Definitions

We let denote the domain (often ) and (mostly ) a (binary) label space. We assume that data is generated by some distribution over and let denote the marginal of over . A hypothesis is a function , and can naturally be identified with a subset of , namely . Since we are working with binary labels, we also sometimes identify a hypothesis with the pre-image of under , that is the domain subset . We let denote the set of all Borel functions222For an uncountable domain, we only consider Borel-measurable hypotheses to avoid dealing with measurability issues. from to (or all functions in case of a countable domain). A hypothesis class is a subset of , often denoted by .

The quality of prediction of a hypothesis on a labeled example is measured by a loss function . For classification problems, the quality of prediction is typically measured with the binary loss

 ℓ0/1(h,x,y)=1[h(x)≠y],

where denotes the indicator function for predicate . For (adversarially) robust classification, we let , the perturbation type, be a function that maps each instance to the set of admissible perturbations at point . We assume that the perturbation type satisfies for all . If is equipped with a metric , then a natural choice for the set of perturbations at is a ball of radius around . For an and , we say that is an adversarial point of with respect to if . We use the following definition of the adversarially robust loss with respect to perturbation type

 ℓU(h,x,y)=1[∃z∈U(x) : h(z)≠y].

If is always a ball of radius around , we will also use the notation . We assume that the perturbation type is so that is a measurable function for all . A sufficient condition is that the set are open sets (where is assumed to be equipped with some topology) and the pertubation type further satisfies if and only if for all (see Appendix B for a proof and an example of a simple perturbation type that renders the the corresponding loss function of a threshold predictor non-measurable).

We denote the expected loss (or true loss) of a hypothesis with respect to the distribution and loss function by . In particular, we will denote the true binary loss by and the true robust loss by . Further, we denote the approximation error of class with respect to distribution and loss function by

The empirical loss of a hypothesis with respect to loss function and a sample is defined as .

A learner is a function that takes in a finite sequence of labeled instances and outputs a hypothesis . The following is a standard notion of (PAC-)learnability from finite samples of a hypothesis class (Vapnik and Chervonenkis, 1971; Valiant, 1984; Blumer et al., 1989; Shalev-Shwartz and Ben-David, 2014).

###### Definition 1 ((Agnostic) Learnability).

A hypothesis class is agnostic learnable with respect to set of distributions and loss function , if there exists a learner such that for all , there is a sample size such that, for any distribution , if the input to is an iid sample from of size , then, with probability at least over the samples, the learner outputs a hypothesis with

The class is said to be learnable in the realizable case with respect to loss function , if the above holds under the condition that . We say that is distribution-free learnable (or simply learnable) if it is learnable when is the set of all probability measures over .

###### Definition 2 (VC-dimension).

We say that a collection of subsets of some domain shatters a subset if for every there exists such that . The VC-dimension of , denoted by , is defined to be the supremum of the size of the sets that are shattered by .

It is easy to see that the VC-dimension of a binary hypothesis class is independent of whether we view as a subset of or pre-images of (thus, subsets of ). It is well known that, for the binary loss, a hypothesis class is (distribution-free) learnable if and only if it has finite VC-dimension (Blumer et al., 1989). Furthermore, any learnable binary hypothesis class can be learned with a proper learner.

###### Definition 3 (Proper Learnability).

We call a learner a proper learner for the class if, for all input samples , we have . A class is properly learnable if the conditions in Definition 1 hold with a proper learner .

It has recently been shown that there are classes of finite VC-dimension that are not properly learnable with respect to the adversarially robust loss (Montasser et al., 2019).

### 2.1 Decomposing the robust loss

In this work, we adapt the most commonly used notion of a adversarially robust loss (Montasser et al., 2019; Yang et al., 2019). Note that, we have if and only if at least one of the following conditions holds:
makes a mistake on with respect to label , or
there is a close-by instance that labels different than , that is, is close to ’s decision boundary.

The first condition holds when falls into the error region, . The notion of error region then naturally captures the (non-adversarial) loss:

 L0/1P(h)=P(x,y)∼P[(x,y)∈errh]=P(errh).

The second condition holds when lies in the margin area of . The following definition makes this notion explicit.

Let be some hypothesis. We define the margin area of with respect to perturbation type , as the subset defined by

 marUh={(x,y)∈X×Y ∣ ∃z∈U(x):h(x)≠h(z)}

Based on these definitions, the adversarially robust loss with respect to is if and only if the sample falls into the error region and/or the margin area of :

 LUP(h)=P(errh∪marUh).
###### Definition 4.

For class , we refer to the collection as the margin class of .

While we defined that margin areas as subsets of , it is sometimes natural to identify them with their projection on , thus simply as subsets of .

###### Remark 5.

We note that there is more than one way to formulate a loss function that captures both classification accuracy and robustness to small (adversarial) perturbations. The notion we adopt has the property that even the best classifier with respect to the binary loss (even an with if such exist) may have positive robust loss, if the true labels themselves change within the adversarial neighbourhoods. A natural alternative, also often considered, is to require the adversary to find a neighbouring point where the classifier returns an incorrect label (instead of just a different label). However, we note that such a notion cannot be phrased as a loss function (as it depends on the true label of the perturbed instance). Previous studies have provided excellent discussions of the various options. (Diochnos et al., 2018; Gourdeau et al., 2019).

##### Semi-Supervised Learning (SSL)

Since the margin areas can naturally be viewed as subsets of , their weights under the data generating distribution can potentially be estimated with samples from , that is, from unlabeled data. A learner that takes in both a labeled sample from and an unlabeled sample from , is called a semi-supervised learner. For scenarios where robust learning has been shown to be hard, we explore whether this hardness can be overcome by SSL. We will consider semi-supervised learners that take in labeled and unlabeled data samples, and also idealized semi-supervised learners that, in addition to a labeled data sample have oracle access to probability weights of certain subsets of (Göpfert et al., 2019).

## 3 Robust Learning of VC Classes

It has been shown that there is a class of bounded VC-dimension ( in fact) and a perturbations type such that is not robustly properly learnable (Montasser et al., 2019), even if the distribution is realizable with respect to under the -robust loss. The perturbation type in the lower bound construction can actually chosen to be balls with respect to some metric over (for any , even ). The same work also shows that if a class has bounded VC-dimension, then it is (improperly) robustly learnable with respect to any perturbation type .

###### Theorem 6 ((Montasser et al., 2019)).

1.) There is a class over with , and a set of distributions with for all , such that is not proper learnable over with respect to loss function .
2.) Let be any domain and be any type of perturbation, and let be a hypothesis class with finite VC-dimension. Then is distribution-free agnostic learnable with respect to loss function .

While the second part of the above theorem seems to settle adversarially robust learnability for binary hypothesis classes, the positive result is achieved with a compression-based learner, which has potentially much higher sample complexity than what suffices for the binary loss. In fact, the size of the best known general compression scheme (Moran and Yehudayoff, 2016) depends on the VC-dimension of the dual class of , making the sample complexity of this approach generally exponential in VC-dimension of .

We will now first show that the impossibility result in the first part of the above theorem crucially depends on the combination of the class and a pertubation type (despite these being balls in a Euclidian space) so that the margin class has infinite VC-dimension. We prove that, if both and have finite VC-dimension then is (distribution-free) learnable with respect to the robust loss, with a proper learner.

###### Theorem 7 (Proper learnability for finite VC and finite margin-VC).

Let be any domain and be a hypothesis class with finite VC-dimension. Further, let be any perturbation type such that has finite VC-dimension. We set . Then is distribution-free (agnostically) properly learnable with respect to the robust loss , and the sample complexity is

###### Proof Sketch.

We provide the more detailed argument in the appendix. Recall that a set is said to be an -approximation of with respect to if for all we have , that is, if the empirical estimates with respect to of the sets in are -close to their true probability weights. Consider the class of subsets of point-wise unions of error and margin regions. A simple counting argument shows that , where . Thus, by basic VC-theory, a sample of size will be an -approximation of with respect to with probability at least . Thus any empirical risk minimizer with respect to is a successful proper and agnostic robust learner for . ∎

###### Observation 8.

We believe the conditions of Theorem 7 hold for most natural classes and perturbation types . Eg. if is the class of linear predictors in and are sets of balls with respect to some -norm, then both and have finite VC-dimension (see also (Yin et al., 2019)).

### 3.1 Using unlabeled data for robust proper learning

In light of the above two general results, we turn to investigate whether unlabeled data can help in overcoming the discrepancy between the two setups. In particular, under various additional assumptions, we consider the case of being finite but (potentially) being infinite and a learner having additional access to .

We model knowledge of as the learner having access to an oracle that returns the probability weights of various subsets of . We say that the learner has access to a margin oracle for class if, for every , it has access (can query) the probability weight of the margin set of , that is . Since the margin areas can be viewed as subsets of , if the margin class of under perturbation type has finite VC-dimension, a margin oracle can be approximated using an unlabeled sample from the distribution .

Similarly, one could define an error oracle for as an oracle, that, for every would return the weight of the error sets . This is typically approximated with a labeled sample from the data-generating distribution, if the class has finite VC-dimension. This is similar to the settings of learning by distances (Ben-David et al., 1995) or learning with statistical queries (Kearns, 1998; Feldman, 2017).

To minimize the adversarial loss however, the learner needs to find (through oracle access or through approximations by samples) a minimizer of the weights . We now first show that having access to both an exact error oracle and an exact margin oracle does not suffice for this.

###### Theorem 9.

There is a class with over a domain with , a perturbation type , and two distributions and over , that are indistinguishable with error and margin oracles for , while their robust loss minimizers in differ.

###### Proof.

Let be the domain. We consider two distributions and over . Both have true label on all points, that is for all . However their marginals and differ:

 P1X(x1) = P1X(x3)=0, P1X(x2) = 2/6,and P1X(xi) = 1/6 for i∈{4,5,6,7}. P2X(x4) = P2X(x6)=0, P2X(x5) = 2/6,and P2X(xi) = 1/6 for i∈{1,2,3,7}.

The class consists of two functions: and . Further, we consider the following perturbation sets (for readability, we first state them without the points themselves):

 ~U(x1)={x2}, ~U(x2)={x1,x3}, ~U(x3)={x2}, ~U(x4)={x5}, ~U(x5)={x4,x6}, ~U(x6)={x5}, ~U(x7)=∅

Now we set , so that each point is included in its own perturbation set. Now, both and have -loss on both and . And for both and the margin areas have weight on both and . However, the adversarial loss minimizer for is and for is (by a gap of each). ∎

While impossibility result in the above example, of course, can be overcome by estimating the weights of the seven points in the domain, the construction exhibits that merely estimating classification error and weights of margin sets does not suffice for proper learning with respect to the adversarial loss. The example shows, that the learner also needs to take into account the interactions (intersections between the sets) of the two components of the adversarial loss. However the weights of intersection , inherently involve label information.

In the following subsection we show that realizability with respect to the robust loss implies that robust learning becomes possible with access to a (bounded size) labeled sample from the distribution and additional access to a margin oracle or a (bounded size) unlabeled sample. In the appendix Section C.3, we further explore weakening this assumption to only require -reazability with access to stronger version of the margin oracle.

#### 3.1.1 Robust realizability: ∃h∗∈H with LUP(h∗)=0

This is the setup of the impossibility result for proper learning (Montasser et al., 2019). We show that proper learning becomes possible with access to a margin oracle for .

###### Theorem 10.

Let be some domain, a hypothesis class with finite VC-dimension and any perturbation type. If a learner is given additional access to a margin oracle for , then is properly learnable with respect to the robust loss and the class of distributions that are robust-realizable by , , with labeled sample complexity

###### Proof Sketch.

By the robust realizability, there is an with implying that , that is, the distribution is (standard) realizable by . Basic VC-theory tells us that an iid sample of size guarantees that all functions in the version space of (that is all with ) have true binary loss at most (with probability at least ). Now, with access to a margin oracle for a learner can remove all hypotheses with from the version space and return any remaining hypothesis (at least will remain). ∎

Note that the above procedure crucially depends on actual access to a margin oracle. The weights cannot be generally estimated if has infinite VC-dimension, as the impossibility result for proper learning from finite samples shows. Thus proper learnability even under these (strong) assumptions cannot always be manifested by a semi-supervised proper learner that has access only to finite amounts of unlabeled data. We also note that the above result (even with access to ) does not allow for an extension to the agnostic case via the type of reductions known from compression-based bounds (Montasser et al., 2019; Moran and Yehudayoff, 2016).

On the other hand, if the margin class has finite, but potentially much larger VC-dimension than , then we can use unlabeled data to approximate the margin oracle in Theorem 10. The following result thus provides an improved bound on the number of labeled samples that suffice for robust proper learning under the assumptions of Theorem 7.

###### Theorem 11.

Let be some domain, a hypothesis class with finite VC-dimension and let be a perturbation type such that the margin class also has finite VC-dimension. If a learner is given additional access to an (unlabeled) sample from , then is properly learnable with respect to the robust loss and the class of distributions that are robust-realizable by , , with labeled sample complexity and unlabeled sample complexity

###### Proof.

The stated sample sizes imply that all functions in in the version space of the labeled sample have true binary loss at most and all functions in whose margin areas are not hit by have true margin weight at most . The learner can thus output any function with classification error on and margin weight under (at least will satisfy these conditions), and we get . ∎

The assumption in the above theorems states that there exists one function in the class that has both perfect classification accuracy and no weight in its margin area. The proof of the impossibility construction of Theorem 9 employs a class and distributions where no function in the class has perfect margin or perfectly classifies the task. We can modify that construction to show that the “double realizability” in Theorem 10 is necessary if the access to the marginal should be restricted to a margin oracle for . The proof of the follwing result can be found in Appendix C.2.

###### Theorem 12.

There is a class with over a domain with , a perturbation type , and two distributions and over , such that there are functions with and for both , while and are indistinguishable with error and margin oracles for and their robust loss minimizers in differ.

## 4 Black-box Certification and the Query Complexity of Adversarial Attacks

Given a fixed hypothesis , a basic concentration inequality (e.g., Hoeffding’s inequality) indicates that the empirical loss of on a samples , , gives an -accurate estimate of the true loss with respect to , . In fact, in order to compute , we do not need to know directly; it would suffice to be able to query on the given sample. Therefore, we can say it is possible to estimate the true binary loss of up to additive error using samples from and queries to .

The high-level question that we ask in this section is whether and when we can do the same for the adversarial loss, . If possible, it would mean that we can have a third-party that “certifies” the robustness of a given black-box predictor (e.g., without relying on the knowledge of the learning algorithm that produced it)

###### Definition 13 (Label Query Oracle).

We call an oracle a label query oracle for a hypothesis , if for all , upon querying for , the oracle returns the label .

###### Definition 14 (Query-based Algorithm).

We call an algorithm a query-based algorithm, if has access to a label query oracle .

###### Definition 15 (Certifiablility).

A class is certifiable with respect to if there exists a query based algorithm and there are functions such that for every , every distribution over , and every , we have that with probability at least over an iid sample of size

 |A(S,Oh) − LUP(h)| < ϵ

with a query budget of for . In this case, we say that admits blackbox query certification.

In light of Section 2.1, the task of robust certification is to estimate the probability weight of the set .

###### Observation 16.

Let be the set of all half-spaces in and let be the unit ball wrt -norm centred at . Then admits -certification under for functions .

###### Proof.

Say we have a sample . For each point define the set , i.e., the four corner points of . The certifier can determine whether by querying the label of ; further it can determine whether by querying all points in . Let . By querying all points in , the certifier can calculate the robust loss of on . This will be an -accurate estimate of when . ∎

We immediately see that certification is non-trivial, in that there are cases where robust certification is impossible. The proof can be found in Appendix D.

###### Observation 17.

Let be the set of all half-spaces in and let be the unit ball wrt -norm centred at . Then is not certifiable under .

This motivates us to define a tolerant certification version.

###### Definition 18 (Restriction of a perturbation type).

Let be a perturbation types. We say that is a restriction of if for all .

Note that if is a restriction of , then, for all distributions and predictors we have .

###### Definition 19 (Tolerant Certification).

A class is tolerantly certifiable with respect to and , where is a restriction of , if there exists a query based algorithm , and there are functions such that for every , every distribution over , and every , we have that with probability at least over an i.i.d. sample of size

 A(S,Oh) ∈ [LUP(h)−ϵ,LVP(h)+ϵ]

with a query budget of for . In this case, we say that admits tolerant blackbox query certification.

###### Observation 20.

Let and . Let be the set of all half-spaces in . Then is tolerantly certifiable with respect to and .

###### Proof sketch.

For each , we can always find a regular polygon with vertices that “sits” between the and . Therefore, in order to find out whether is adversarially vulnerable or not, it would suffice to make queries. Combining this with Hoeffding’s inequality shows that if we sample points from and make queries for each, we can estimate within error . ∎

Though more realistic, even the tolerant notion of certifiability does not make all seemingly simple classes certifiable.

###### Observation 21.

Let and . There exists a hypothesis class with VC-dimension 1, such that is not tolerantly certifiable with respect to and .

###### Proof sketch.

For any , let . Let . clearly has a VC-dimension of 1, but we claim that it is not tolerantly certifiable. We construct an argument similar to Observation 17. The idea is that no matter what queries the certifier chooses, we can always set to be a point that was not queried and is either inside or outside depending on the certifier’s answer. ∎

### 4.1 Witness Sets for Certification

A common observation in the previous examples was that if the certifier could identify a set of points whose labels determined the points in that were in the margin of , then querying those points was enough for robust certification. This motivates the following definition.

###### Definition 22 (Witness sets).

Given a hypothesis class and a perturbation type , for any point , we say that is a witness set for if there exists a mapping such that for any hypothesis , if and only if lies in the margin of (where denotes the restriction of hypothesis to set ).

Clearly, all positive examples above were created using witness sets. The following theorem identifies a large class of pairs that exhibit finite witness sets.

###### Theorem 23.

For any , consider two partial orderings and over the elements of where for , we say if , and if . For both partial orderings we identify (as equivalent) hypotheses where these intersections co-incide and further we remove all hypotheses where the intersections are empty. 333Here, we think of hypotheses and as the pre-image of 1 (as noted in Section 2), and hence subsets of . If both partial orders have finite number of minima for each , then the pair exhibits a finite witness set and hence is certifiable.

###### Proof.

For this proof we will identify hypotheses with their equivalence classes in each partial ordering. Let be the set of minima for and for . For each , we pick a point such that but for any , thus forming a set . Similarly, we define the set . We claim that is a witness set for , i.e., we can determine whether is in the margin of any hypothesis by looking at labels that assigns to points in .

We only consider the case where , since the case is similar. We claim that is in the margin of if and only if there exists a point in that is assigned the label 1 by . Indeed, suppose there exists such a point. Then since the point lies in and is assigned the opposite label as by , must lie in the margin of . For the other direction, suppose lies in the margin of . Then there must exist a point such that , which means there must be a hypothesis such that , which means there must exist such that . ∎

We can easily verify, for example, that for the pair defined in Observation 16, the set of minima defined by the partial orderings above is finite. Indeed, the (equivalence class of) half-spaces corresponding to the four corners of the unit cube constitute the minima.

### 4.2 Query complexity of adversarial attacks and its connection to robust PAC learning

###### Definition 24.

There have been (successful) attempts (Papernot et al., 2017; Brendel et al., 2018)

at attacking trained neural network models where the adversary was not given any information about the gradients, and had to rely solely on black-box queries to the model. Our definition of the adversary fits those scenarios. Next, we define the query complexity of the adversary.

###### Definition 25.

If, for , there is a function such that, for any and any set , the adversary will produce an admissible attack after at most queries, we say that adversary has query complexity bounded by on . We say that the adversary is efficient if is polynomial in .

Note that it is possible that the adversary’s queries are adaptive, i.e., the point it queries depends on the output of its first queries. A weaker version of an adversary is one where that is not the case.

###### Definition 26.

An adversary is called non-adaptive if the set of points it queries is uniquely determined by the set before making any queries to .

Intuitively, there is a connection between perfect adversaries and witness sets because a witness set merely helps identify the points in that have adversarial points, whereas an adversary finds those adversarial points.

###### Observation 27.

If the pair exhibits a perfect, non-adaptive adversary with query complexity , then it also has witness sets of size .

Finally, we tie everything together by showing that the existence of a proper, perfect adversary implies that the robust learning problem has a small sample complexity.

###### Theorem 28.

If the robust learning problem defined by has a perfect, proper, and efficient adversary, then in the robust realizable-case () it can be robustly learned with sample complexity where .

###### Proof Sketch..

We adapt the compression-based approach of (Montasser et al., 2019) to prove the result. Let us assume that we are given a sample of size that is labeled by some , and want to “compress” this sample using a small subset . Let us assume that is the hypothesis that is reconstructed using . For the compression to succeed, we need to have for every . Given the perfect proper efficient adversary, we can find all the adversarial points in using queries. In fact, we can amend the points corresponding to these queries to to create an inflated set, which we call . Note that . Furthermore, we can now replace the condition with (the latter implies the former because of the definition of perfect adversary). Therefore, our task becomes compressing with respect to the standard binary loss, for which we can use boosting (Schapire, 1990; Moran and Yehudayoff, 2016). There is, however, a catch: we are only allowed to use points from the given sample in our compressed set, yet a careless boosting-based approach may use points in . Note that since the adversary was proper we have . Therefore, we can use robust ERM that operates on as our weak learner as opposed to the regular ERM that operates on  (Montasser et al., 2019). As a result, we can encode each weak classifier using instances from . Using a boosting argument, we need to combine weak classifiers to achieve zero error on the whole sample. Finally, the size of the compression is and the results follows from the classic connection of compression and learning (Littlestone and Warmuth, 1986; Montasser et al., 2019). ∎

We remark that the above theorem improves the result of (Montasser et al., 2019) when an efficient, perfect, and proper adversary exists. This shows an interesting connection between the difficulty of finding adversarial examples and that of robust learning. In particular, if the adversarial points can be found easily (at least when measured by query complexity), then robust learning is almost as easy as non-robust learning (in the sense of agnostic sample complexity). Or, stated in the contrapositive, if robust learning is hard, then even if adversarial points exist, finding them is going to be hard.

It is possible to further extend the result to the agnostic learning scenario, using the same reduction from agnostic learning to realizable learning that was proposed by (David et al., 2016) and used in (Montasser et al., 2019).

## 5 Conclusion

We formalized the problem of black-box certification and its relation to an adversary with bounded query budget. We showed the existence of an adversary with small query complexity implies small sample complexity for robust learning. This suggests that the apparent hardness of robust learning – compared to standard PAC learning – in terms of sample complexity may not actually matter as long as we are dealing with bounded adversaries. It would be interesting to explore other types of adversaries (e.g., non-proper and/or non-perfect) to see if they lead to efficient robust learners as well. Another interesting direction is finding scenarios where finite unlabeled data can substitute the knowledge of the marginal distribution discussed in Section 3.

## Acknowledgements

We thank the Vector Institute for providing us with the meeting space in which this work was developed! Ruth Urner and Hassan Ashtiani were supported by NSERC Discovery Grants.

## References

• N. Akhtar and A. Mian (2018)

Threat of adversarial attacks on deep learning in computer vision: a survey

.
IEEE Access 6, pp. 14410–14430. Cited by: §1.1.
• J. Alayrac, J. Uesato, P. Huang, A. Fawzi, R. Stanforth, and P. Kohli (2019) Are labels required for improving adversarial robustness?. In Advances in Neural Information Processing Systems 32, NeurIPS, pp. 12192–12202. Cited by: §1.1.
• I. Attias, A. Kontorovich, and Y. Mansour (2019) Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT, pp. 162–183. Cited by: §1.1.
• P. Awasthi, A. Dutta, and A. Vijayaraghavan (2019) On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, NeurIPS, pp. 13760–13770. Cited by: §1.1.
• S. Ben-David, A. Itai, and E. Kushilevitz (1995) Learning by distances. Inf. Comput. 117 (2), pp. 240–250. Cited by: §3.1.
• A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989) Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM) 36 (4), pp. 929–965. Cited by: §C.1, §2, §2.
• W. Brendel, J. Rauber, and M. Bethge (2018)

Decision-based adversarial attacks: reliable attacks against black-box machine learning models

.
Cited by: §4.2.
• S. Bubeck, Y. T. Lee, E. Price, and I. P. Razenshteyn (2019) Adversarial examples from computational constraints. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 831–840. Cited by: §1.1, §1.1.
• Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi, and P. Liang (2019) Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems 32, NeurIPS, pp. 11190–11201. Cited by: §1.1.
• A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay (2018) Adversarial attacks and defences: A survey. CoRR abs/1810.00069. Cited by: §1.1, §1.
• P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

,
pp. 15–26. Cited by: §1.1.
• J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 1310–1320. Cited by: §1.
• D. Cullina, A. N. Bhagoji, and P. Mittal (2018) PAC-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, NeurIPS, pp. 230–241. Cited by: §1.1, §1.2.
• O. David, S. Moran, and A. Yehudayoff (2016) Supervised learning through the lens of compression. In Advances in Neural Information Processing Systems, NIPS, pp. 2784–2792. Cited by: §4.2.
• D. I. Diochnos, S. Mahloujifar, and M. Mahmoody (2019) Lower bounds for adversarially robust PAC learning. CoRR abs/1906.05815. Cited by: §1.1.
• D. Diochnos, S. Mahloujifar, and M. Mahmoody (2018)

Adversarial risk and robustness: general definitions and implications for the uniform distribution

.
In Advances in Neural Information Processing Systems 31, NeurIPS, pp. 10359–10368. Cited by: §1.1, Remark 5.
• Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In

Proceedings of the IEEE conference on computer vision and pattern recognition

,
pp. 9185–9193. Cited by: §1.1.
• U. Feige, Y. Mansour, and R. Schapire (2015) Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, COLT, pp. 637–657. Cited by: §1.1.
• V. Feldman (2017) A general characterization of the statistical query complexity. In Proceedings of the 30th Conference on Learning Theory, COLT, pp. 785–830. Cited by: §3.1.
• S. Garg, S. Jha, S. Mahloujifar, and M. Mahmoody (2019) Adversarially robust learning could leverage computational hardness. CoRR abs/1905.11564. Cited by: §1.1, §4.2.
• I. J. Goodfellow, P. D. McDaniel, and N. Papernot (2018) Making machine learning robust against adversarial inputs. Commun. ACM 61 (7), pp. 56–66. Cited by: §1.1.
• C. Göpfert, S. Ben-David, O. Bousquet, S. Gelly, I. O. Tolstikhin, and R. Urner (2019) When can unlabeled data improve the learning rate?. In Conference on Learning Theory, COLT, pp. 1500–1518. Cited by: §2.1.
• P. Gourdeau, V. Kanade, M. Kwiatkowska, and J. Worrell (2019) On the hardness of robust classification. In Advances in Neural Information Processing Systems 32, NeurIPS, pp. 7444–7453. Cited by: §1.1, §1.1, Remark 5.
• D. Haussler and E. Welzl (1987) Epsilon-nets and simplex range queries. Discret. Comput. Geom. 2, pp. 127–151. Cited by: §C.1.
• M. J. Kearns (1998) Efficient noise-tolerant learning from statistical queries. J. ACM 45 (6), pp. 983–1006. Cited by: §3.1.
• N. Littlestone and M. Warmuth (1986) Relating data compression and learnability. Cited by: Appendix E, §4.2.
• A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR, Cited by: §1.1.
• O. Montasser, S. Goel, I. Diakonikolas, and N. Srebro (2020) Efficiently learning adversarially robust halfspaces with noise. arXiv preprint arXiv:2005.07652. Cited by: §1.1.
• O. Montasser, S. Hanneke, and N. Srebro (2019) VC classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, COLT, pp. 2512–2530. Cited by: Appendix E, §1.1, §1.2, §1, §1, §2.1, §2, §3.1.1, §3.1.1, §3, §4.2, §4.2, §4.2, Theorem 6.
• S. Moran and A. Yehudayoff (2016) Sample compression schemes for vc classes. Journal of the ACM (JACM) 63 (3), pp. 1–10. Cited by: §3.1.1, §3, §4.2.
• N. Narodytska and S. Kasiviswanathan (2017) Simple black-box adversarial attacks on deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1310–1318. Cited by: §1.1.
• N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1.1, §4.2.
• H. Salman, J. Li, I. P. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang (2019) Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems 32, NeurIPS, pp. 11289–11300. Cited by: §1.1, §1.
• R. E. Schapire and Y. Freund (2013) Boosting: foundations and algorithms. Kybernetes. Cited by: Appendix E.
• R. E. Schapire (1990) The strength of weak learnability. Machine learning 5 (2), pp. 197–227. Cited by: §4.2.
• L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry (2018) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, NeurIPS, pp. 5014–5026. Cited by: §1.1.
• S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press. Cited by: §C.1, §C.2, §2.
• A. Sinha, H. Namkoong, and J. C. Duchi (2018) Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR, Cited by: §1.1, §1.
• J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

IEEE Transactions on Evolutionary Computation

23 (5), pp. 828–841.
Cited by: §1.1.
• C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR, Cited by: §1.
• L. G. Valiant (1984) A theory of the learnable. Commun. ACM 27 (11), pp. 1134–1142. Cited by: §C.1, §2.
• V. N. Vapnik and A. Ya. Chervonenkis (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications 16 (2), pp. 264–280. Cited by: §C.1, §2.
• Y. Wang, S. Jha, and K. Chaudhuri (2018) Analyzing the robustness of nearest neighbors to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 5120–5129. Cited by: §1.1.
• E. Wong and J. Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 5283–5292. Cited by: §1.1.
• Y. Yang, C. Rashtchian, Y. Wang, and K. Chaudhuri (2019) Adversarial examples for non-parametric methods: attacks, defenses and large sample limits. CoRR abs/1906.03310. Cited by: §1.1, §2.1.
• D. Yin, K. Ramchandran, and P. L. Bartlett (2019) Rademacher complexity for adversarially robust generalization. In Proceedings of the 36th International Conference on Machine Learning,ICML, pp. 7085–7094. Cited by: §1.2, Observation 8.

## Appendix A Note on our notation for sets and functions

We use the following notation for sets and functions:

2X the power-set (set of all subsets) of X the set of all functions from X to Y f is a function from X to Y

Functions from some set to some set are a special type of relations between and . Thus a function is a subset of , namely

 f={(x,y)∈X×Y ∣ y=f(x)}

If is a (not necessarily binary) classifier, and

is a probability distribution over

, then the probability of misclassification is , where is the complement of in , that is

 errh={(x,z)∈X×Y ∣ z≠f(x)}=(X×Y)∖h

If is a binary label space, then it is also common to identify classifiers with a subset of the domain, namely the set , that is the set of points that is mapped to label under :

 h−1(1)={x∈X ∣ h(x)=1}

We switch between identifying with and viewing as a subset of , depending on which view aids the simplicity of argument in a given context.

We defined the margin areas of a classifier (with respect to a perturbation type) again as subsets of .

 marUh={(x,y)∈X×Y ∣ ∃z∈U(x):h(x)≠h(z)}

Note, that here, if for a given domain point , we have for some , then for all . Thus, the sets are not functions. Rather, they can naturally be identified with their projection on , and we again do so if convenient in the context.

The given definitions of and , naturally let us express the robust loss as the probably measure of a subset of :

 LUP(h)=P(errh∪marUh).

## Appendix B Note on measurability

Here, we note that allowing the perturbation type to be an arbitrary mapping from the domain to can easily lead to the adversarial loss being not measurable, even if is a measurable set for every . Consider the case , and a distribution with uniform on the interval . Consider a subset that is not Borel-measurable. Consider a simple threshold function

 f:R→{0,1},f(x)=1[x<1]

and a the following perturbation type:

 U(x) = {∅ifx∉M{x+1}if x∈M

Clearly, is a measurable function, and every set is measurable. However, we get , that is, the margin area of under these perturbations is not measurable, and therefore the adversarial loss with respect to is not measurable. Note that the same phenomenon can occur for sets that are always open intervals containing the point . With the same function , for perturbation sets

 U(x) = ⎧⎪⎨⎪⎩Br(x)∩(0,1)ifx<1,x∉MBr(x)∩(1,2)ifx>1(0,2)if x∈M or x=1

we get