In this paper, we consider the problem of distribution-independent learning of halfspaces that are robust to adversarial examples at test time, also referred to as robust PAC learning of halfspaces. Halfspaces are binary predictors of the form , where .
In adversarially robust PAC learning, given an instance space and label space , we formalize an adversary – that we would like to be robust against – as a map , where represents the set of perturbations (adversarial examples) that can be chosen by the adversary at test time (i.e., we require that ). For an unknown distribution over , we observe i.i.d. samples , and our goal is to learn a predictor that achieves small robust risk,
The information-theoretic aspects of adversarially robust learning have been studied in recent work, see, e.g., [DBLP:conf/nips/SchmidtSTTM18, NIPS2018_7307, khim2018adversarial, bubeck2019adversarial, DBLP:conf/icml/YinRB19, pmlr-v99-montasser19a]. This includes studying what learning rules should be used for robust learning and how much training data is needed to guarantee high robust accuracy. It is now known that any hypothesis class with finite VC dimension is robustly learnable, though sometimes improper learning is necessary and the sample complexity may be exponential in the VC dimension [pmlr-v99-montasser19a].
On the other hand, the computational aspects of adversarially robust PAC learning are less understood. In this paper, we take a first step towards studying this broad algorithmic question with a focus on the fundamental problem of learning adversarially robust halfspaces.
A first question to ask is whether efficient PAC learning implies efficient robust PAC learning, i.e., whether there is a general reduction that solves the adversarially robust learning problem. Recent work has provided strong evidence that this is not the case. Specifically, [bubeck2019adversarial] showed that there exists a learning problem that can be learned efficiently non-robustly, but is computationally intractable to learn robustly (under plausible complexity-theoretic assumptions). There is also more recent evidence that suggests that this is also the case in the PAC model. [awasthi2019robustness] showed that it is computationally intractable to even weakly robustly learn degree- polynomial threshold functions (PTFs) with perturbations in the realizable setting, while PTFs of any constant degree are known to be efficiently PAC learnable non-robustly in the realizable setting. [gourdeau2019hardness] showed that there are hypothesis classes that are hard to robustly PAC learn, under the assumption that it is hard to non-robustly PAC learn.
The aforementioned discussion suggests that when studying robust PAC learning, we need to characterize which types of perturbation sets admit computationally efficient robust PAC learners and under which noise assumptions. In the agnostic PAC setting, it is known that even weak (non-robust) learning of halfspaces is computationally intractable [feldman2006new, guruswami2009hardness, DiakonikolasOSW11, daniely2016complexity]. For -perturbations, where , it was recently shown that the complexity of proper learning is exponential in [diakonikolas2019nearly]. In this paper, we focus on the realizable case and the (more challenging) case of random label noise.
We can be more optimistic in the realizable setting. Halfspaces are efficiently PAC learnable non-robustly via Linear Programming[maass1994fast]
, and under the margin assumption via the Perceptron algorithm[rosenblatt1958perceptron]. But what can we say about robustly PAC learning halfspaces? Given a perturbation set and under the assumption that there is a halfspace that robustly separates the data, can we efficiently learn a predictor with small robust risk?
Just as empirical risk minimization (ERM) is central for non-robust PAC learning, a core component of adversarially robust learning is minimizing the robust empirical risk on a dataset ,
In this paper, we provide necessary and sufficient conditions on perturbation sets , under which the robust empirical risk minimization problem is efficiently solvable in the realizable setting. We show that an efficient separation oracle for yields an efficient solver for , while an efficient approximate separation oracle for is necessary for even computing the robust loss of a halfspace . In addition, we relax our realizability assumption and show that under random classification noise [DBLP:journals/ml/AngluinL87], we can efficiently robustly PAC learn halfspaces with respect to any perturbation.
Our main contributions can be summarized as follows:
In the realizable setting, the class of halfspaces is efficiently robustly PAC learnable with respect to , given an efficient separation oracle for .
To even compute the robust risk with respect to efficiently, an efficient approximate separation oracle for is necessary.
In the random classification noise setting, the class of halfspaces is efficiently robustly PAC learnable with respect to any perturbation.
1.1 Related Work
Here we focus on the recent work that is most closely related to the results of this paper. [awasthi2019robustness] studied the tractability of with respect to perturbations, obtaining efficient algorithms for halfspaces in the realizable setting, but showing that for degree- polynomial threshold functions is computationally intractable (assuming ). [gourdeau2019hardness] studied robust learnability of hypothesis classes defined over with respect to hamming distance, and showed that monotone conjunctions are robustly learnable when the adversary can perturb only bits, but are not
robustly learnable even under the uniform distribution when the adversary can flipbits.
In this work, we take a more general approach, and instead of considering specific perturbation sets, we provide methods in terms of oracle access to a separation oracle for the perturbation set , and aim to characterize which perturbation sets admit tractable .
In the non-realizable setting, the only prior work we are aware of is by [diakonikolas2019nearly] who studied the complexity of robustly learning halfspaces in the agnostic setting under perturbations.
2 Problem Setup
Let be the instance space and be the label space. We consider halfspaces .
The following definitions formalize the notion of adversarially robust PAC learning in the realizable and random classification noise settings:
Definition 2.1 (Realizable Robust PAC Learning).
We say is robustly PAC learnable with respect to an adversary in the realizable setting, if there exists a learning algorithm with sample complexity such that: for any , for every data distribution over where there exists a predictor with zero robust risk, , with probability at least
, with probability at leastover ,
Definition 2.2 (Robust PAC Learning with Random Classification Noise).
Let be an unknown halfspace. Let be an arbitrary distribution over such that , and . A noisy example oracle, works as follows: Each time is invoked, it returns a labeled example , where , with probability and with probability . Let be the joint distribution on
be the joint distribution ongenerated by the above oracle.
We say is robustly PAC learnable with respect to an adversary in the random classification noise model, if and a learning algorithm , such that for every distribution over (generated as above by a noisy oracle), with probability at least over ,
Sample Complexity of Robust Learning
Denote by the robust loss class of ,
It was shown by [NIPS2018_7307] that for any set that is nonempty, closed, convex, and origin-symmetric, and an adversary that is defined as (e.g., -balls), the VC dimension of the robust loss of halfspaces is at most the standard VC dimension . Based on Vapnik’s “General Learning” [vapnik:82], this implies that we have uniform convergence of robust risk with samples. Formally, for any and any distribution over , with probability at least over ,
In particular, this implies that for any adversary that satisfies the conditions above, is robustly PAC learnable w.r.t. by minimizing the robust empirical risk on ,
Thus, it remains to efficiently solve the problem. We discuss necessary and sufficient conditions for solving in the following section.
3 The Realizable Setting
In this section, we show necessary and sufficient conditions for minimizing the robust empirical risk on a dataset in the realizable setting, i.e. when the dataset is robustly separable with a halfspace where . In Theorem 3.5, we show that an efficient separation oracle for yields an efficient solver for . While in Theorem 3.10, we show that an efficient approximate separation oracle for is necessary for even computing the robust loss of a halfspace .
Note that the set of allowed perturbations can be non-convex, and so it might seem difficult to imagine being able to solve the problem in full generality. But, it turns out that for halfspaces it suffices to consider only convex perturbation sets due to the following observation:
Given a halfspace and an example . If , then . And if , then , where denotes the convex-hull of .
Observation 3.1 shows that for any dataset that is robustly separable w.r.t. with a halfspace , is also robustly separable w.r.t. the convex hull using the same halfspace , where . Thus, in the remainder of this section we only consider perturbation sets that are convex, i.e., for each , is convex.
Denote by a separation oracle for . takes as input and either:
asserts that , or
returns a separating hyperplanesuch that for all .
For any , denote by an approximate separation oracle for . takes as input and either:
asserts that , or
returns a separating hyperplane such that for all .
Denote by a membership oracle for . takes as input and either:
asserts that , or
asserts that .
When discussing a separation or membership oracle for a fixed convex set , we overload notation and write , , and (in this case only one argument is required).
3.1 An efficient separation oracle for is sufficient to solve efficiently
Let denote the set of valid solutions for (see Equation 3). Note that is not empty since we are considering the realizable setting. Although the treatment we present here is for homogeneous halfspaces (where a bias term is not needed), the results extend trivially to the non-homogeneous case.
Below, we show that we can efficiently find a solution given access to a separation oracle for , .
Let be an arbitrary convex adversary. Given access to a separation oracle for , that runs in time . There is an algorithm that finds in time where is an upper bound on the bit complexity of the valid solutions in and the examples and perturbations in .
Note that the polynomial dependence on in the runtime is unavoidable even in standard non-robust ERM for halfspaces, unless we can solve linear programs in strongly polynomial time, which is currently an open problem.
Theorem 3.5 implies that for a broad family of perturbation sets , halfspaces are efficiently robustly PAC learnable with respect to in the realizable setting, as we show in the following corollary:
Let be an adversary such that where is nonempty, closed, convex, and origin-symmetric. Then, given access to an efficient separation oracle that runs in time , is robustly PAC learnable w.r.t. in the realizable setting in time .
This covers many types of perturbation sets that are considered in practice. For example, could be perturbations of distance at most w.r.t. some norm , such as the norm considered in many applications: . In addition, Theorem 3.5 also implies that we can solve the problem for other natural perturbation sets such as translations and rotations in images (see, e.g., [pmlr-v97-engstrom19a]), and perhaps mixtures of perturbations of different types (see, e.g., [DBLP:journals/corr/abs-1908-08016]), as long we have access to efficient separation oracles for these sets.
Benefits of handling general perturbation sets :
One important implication of Theorem 3.5 that highlights the importance of having a treatment that considers general perturbation sets (and not just perturbations for example) is the following: for any efficiently computable feature map , we can efficiently solve the robust empirical risk problem over the induced halfspaces , as long as we have access to an efficient separation oracle for the image of the perturbations . Observe that in general maybe non-convex and complicated even if is convex, however Observation 3.1 combined with the realizability assumption imply that it suffices to have an efficient separation oracle for the convex-hull .
Before we proceed with the proof of Theorem 3.5, we state the following requirements and guarantees for the Ellipsoid method which will be useful for us in the remainder of the section:
Lemma 3.7 (see, e.g., Theorem 2.4 in [bubeck2015convex]).
Let be a convex set, and a separation oracle for . Then, the Ellipsoid method using oracle queries to , will find a , or assert that is empty. Furthermore, the total runtime is .
The proof of Theorem 3.5 relies on two key lemmas. First, we show that efficient robust certification yields an efficient solver for the problem. Given a halfspace and an example , efficient robust certification means that there is an algorithm that can efficiently either: (a) assert that is robust on , i.e. , or (b) return a perturbation such that .
Let be a procedure that either: (a) Asserts that is robust on , i.e., , or (b) Finds a perturbation such that . If can be solved in time, then there is an algorithm that finds in time.
Observe that is a convex set since
Our goal is to find a . Let be an efficient robust certifier that runs in time. We will use to implement a separation oracle for denoted . Given a halfspace , we simply check if is robustly correct on all datapoints by running on each . If there is a point where is not robustly correct, then we get a perturbation where , and we return as a separating hyperplane. Otherwise, we know that is robustly correct on all datapoints, and we just assert that .
Once we have a separation oracle , we can use the Ellipsoid method (see Lemma 3.7) to solve the problem. More specifically, with a query complexity of to and overall runtime of (this depends on runtime of ), the Ellipsoid method will return a . ∎
Next, we show that we can do efficient robust certification when given access to an efficient separation oracle for , .
If we have an efficient separation oracle that runs in time. Then, we can efficiently solve in time.
Given a halfspace and , we want to either: (a) assert that is robust on , i.e. , or (b) find a perturbation such that . Let be the set of all points that mis-labels. Observe that by definition is convex, and therefore is also convex. We argue that having an efficient separation oracle for suffices to solve our robust certification problem. Because if is not empty, then by definition, we can find a perturbation such that with a separation oracle and the Ellipsoid method (see Lemma 3.7). If is empty, then by definition, is robustly correct on , and the Ellipsoid method will terminate and assert that is empty.
Thus, it remains to implement . Given a point , we simply ask the separation oracle for by calling and the separation oracle for by checking if . If the we get a separating hyperplane from and we can use it separate from . Similarly, if , by definition, and so we can use as a separating hyperplane to separate from . The overall runtime of this separation oracle is , and so we can efficiently solve in time using the Ellipsoid method (Lemma 3.7). ∎
We are now ready to proceed with the proof of Theorem 3.5.
3.2 An efficient approximate separation oracle for is necessary for computing the robust loss
Our efficient algorithm for requires a separation oracle for . We now show that even efficiently computing the robust loss of a halfspace on an example requires an efficient approximate separation oracle for .
Given a halfspace and an example , let be a procedure that computes the robust loss in time, then for any , we can implement an efficient -approximate separation oracle in time, where .
Let . We will describe how to implement a -approximate separation oracle for denoted . Fix the first argument to an arbitrary . Upon receiving a point as input, the main strategy is to search for a halfspace that can label all of with , and label the point with . If then there is a halfspace that separates from because is convex, but this is impossible if . Since we are only concerned with implementing an approximate separation oracle, we will settle for a slight relaxation which is to either:
assert that is -close to , i.e., , or
return a separating hyperplane such that for all .
Let denote the set of halfspaces that label all of with . Since is nonempty, it follows by definition that is nonempty. To evaluate membership in , given a query , we just make a call to . Let . This can be efficiently computed in time. Next, for any , we can get an -approximate separation oracle for denoted (see Definition 3.3) using queries to the membership oracle described above [lee2018efficient]. When queried with a halfspace , either:
asserts that , or
returns a separating hyperplane such that for all halfspaces .
Observe that by definition, . Furthermore, for any , by definition, such that . Since, for each , by definition of , we have , it follows by Cauchy-Schwarz inequality that:
Let be a -approximate separation oracle for . Observe that if the distance between and is greater than , it follows that there is such that:
By definition of , this implies that is not empty, which implies that the intersection is nonempty. We also have the contrapositive, which is, if the intersection is empty, then we know that . To conclude the proof, we run the Ellipsoid method with the approximate separation oracle to search over the restricted space . Restricting the space is easily done because we will use the query point as the separating hyperplane. Either the Ellipsoid method will find , in which case by Equation 4, has the property that:
and so we return as a separating hyperplane between and . If the Ellipsoid terminates without finding any such , this implies that the intersection is empty, and therefore, by the contrapositive above, we assert that . ∎
4 Random Classification Noise
In this section, we relax the realizability assumption to random classification noise [DBLP:journals/ml/AngluinL87]. We show that for any adversary that represents perturbations of bounded norm (i.e., , where ), the class of halfspaces is efficiently robustly PAC learnable with respect to in the random classification noise model.
Let be an adversary such that where and . Then, is robustly PAC learnable w.r.t under random classification noise in time .
The proof of Theorem 4.1 relies on the following key lemma. We show that the structure of the perturbations allows us to relate the robust loss of a halfspace with the -margin loss of . Before we state the lemma, recall that the dual norm of denoted is defined as .
For any and any ,
First observe that
This holds because when , by definition , which implies that . For the other direction, when , by definition such that , which implies that . To conclude the proof, by definition of the set and the dual norm , we have
4.2 implies that for any distribution over , to solve the -robust learning problem
it suffices to solve the -margin learning problem
We will solve the -margin learning problem in Equation (6) in the random classification noise setting using an appropriately chosen convex surrogate loss. Our convex surrogate loss and its analysis build on a convex surrogate that appears in the appendix of [diakonikolas2019distribution] for learning large -margin halfspaces under random classification noise w.r.t. the 0-1 loss. We note that the idea of using a convex surrogate to (non-robustly) learn large margin halfspaces in the presence of random classification noise is implicit in a number of prior works, starting with [DBLP:conf/colt/Bylander94].
Our robust setting is more challenging for the following reasons. First, we are not interested in only ensuring small 0-1 loss, but rather ensuring small -margin loss. Second, we want to be able to handle all norms, as opposed to just the norm. As a result, our analysis is somewhat delicate.
We will show that solving the following convex optimization problem:
where , suffices to solve the -margin learning problem in Equation (6). Intuitively, the idea here is that for , the objective is exactly a scaled hinge loss, which gives a learning guarantee w.r.t to the -margin loss when there is no noise (). When the noise , we slightly adjust the slopes, such that even correct prediction encounters a loss. The choice of the slope is based on which will depend on the noise rate and the -suboptimality that is required for Equation (6).
We can solve Equation (7) with a standard first-order method through samples using Stochastic Mirror Descent, when the dual norm is an -norm. We state the following properties of Mirror Descent we will require:
Lemma 4.3 (see, e.g., Theorem 6.1 in [bubeck2015convex]).
Let be a convex function that is -Lipschitz w.r.t. where . Then, using the potential function , a suitable step-size , and a sequence of iterates computed by the following update:
Stochastic Mirror Descent with stochastic gradients of , will find an -suboptimal point such that and .
When , we will use the entropy potential function . In this case, Stochastic Mirror Descent will require stochastic gradients.
We are now ready to state our main result for this section:
Let . Let be a distribution over such that there exists a halfspace with and is generated by corrupted by RCN with noise rate . An application of Stochastic Mirror Descent on , returns, with high probability, a halfspace where with -robust misclassification error in time.
In Theorem 4.5, we get a -robustness guarantee assuming -robust halfspace that is corrupted with random classification noise. This can be strengthened to get a guarantee of -robustness for any constant .
The rest of this section is devoted to the proof of Theorem. The high-level strategy is to show that an -suboptimal solution to Equation (7) gives us an -suboptimal solution to Equation (6) (for a suitably chosen ). In Lemma 4.8, we bound from above the -margin loss in terms of our convex surrogate objective , and in Lemma 4.9 we show that there are minimizers of our convex surrogate such that it is sufficiently small. These are the two key lemmas that we will use to piece everything together.
For any , consider the contribution of the objective of , denoted by . This is defined as where . In the following lemma, we provide a decomposition of that will help us in proving Lemmas 4.8 and 4.9.
For any , let . Then, we have that:
Based on the definition of the surrogate loss, it suffices to consider three cases: