# The Power of Localization for Efficiently Learning Linear Separators with Noise

We introduce a new approach for designing computationally efficient learning algorithms that are tolerant to noise, and demonstrate its effectiveness by designing algorithms with improved noise tolerance guarantees for learning linear separators. We consider both the malicious noise model and the adversarial label noise model. For malicious noise, where the adversary can corrupt both the label and the features, we provide a polynomial-time algorithm for learning linear separators in ^d under isotropic log-concave distributions that can tolerate a nearly information-theoretically optimal noise rate of η = Ω(ϵ). For the adversarial label noise model, where the distribution over the feature vectors is unchanged, and the overall probability of a noisy label is constrained to be at most η, we also give a polynomial-time algorithm for learning linear separators in ^d under isotropic log-concave distributions that can handle a noise rate of η = Ω(ϵ). We show that, in the active learning model, our algorithms achieve a label complexity whose dependence on the error parameter ϵ is polylogarithmic. This provides the first polynomial-time active learning algorithm for learning linear separators in the presence of malicious noise or adversarial label noise.

• 37 publications
• 45 publications
• 22 publications
02/12/2020

### Efficient active learning of sparse halfspaces with arbitrary bounded noise

In this work we study active learning of homogeneous s-sparse halfspaces...
12/19/2020

### On the Power of Localized Perceptron for Label-Optimal Learning of Halfspaces with Adversarial Noise

We study online active learning of homogeneous halfspaces in ℝ^d with ad...
07/05/2017

### Learning Geometric Concepts with Nasty Noise

We study the efficient learnability of geometric concept classes - speci...
07/11/2013

### Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy

We describe a framework for designing efficient active learning algorith...
11/06/2012

### Active and passive learning of linear separators under log-concave distributions

We provide new results concerning label efficient, polynomial time, pass...
06/06/2020

### Attribute-Efficient Learning of Halfspaces with Malicious Noise: Near-Optimal Label Complexity and Noise Tolerance

This paper is concerned with computationally efficient learning of homog...
03/22/2017

### S-Concave Distributions: Towards Broader Distributions for Noise-Tolerant and Sample-Efficient Learning Algorithms

We provide new results concerning noise-tolerant and sample-efficient le...

## 1 Introduction

Overview.

Dealing with noisy data is one of the main challenges in machine learning and is an active area of research. In this work we study the noise-tolerant learning of linear separators, arguably the most popular class of functions used in practice

[CST00]

. Learning linear separators from correctly labeled (non-noisy) examples is a very well understood problem with simple efficient algorithms like Perceptron being effective both in the classic passive learning setting

[KV94, Vap98] and in the more modern active learning framework [Das11]. However, for noisy settings, except for the special case of uniform random noise, very few positive algorithmic results exist even for passive learning. In the context of theoretical computer science more broadly, problems of noisy learning are related to seminal results in approximation-hardness [ABSS93, GR06], cryptographic assumptions [BFKL94, Reg05], and are connected to other classic questions in learning theory (e.g., learning DNF formulas [KSS94]), and appear as barriers in differential privacy  [GHRU11].

In this paper we present new techniques for designing efficient algorithms for learning linear separators in the presence of malicious noise and adversarial label noise. These models were originally proposed for a setting in which the algorithm must work for an arbitrary, unknown distribution. As we will see, bounds on the amount of noise tolerated for this distribution-free setting were weak, and no significant progress was made for many years. This motivated research investigating the role of the distribution generating the data on the tolerable level of noise: a breakthrough result of [KKMS05] and subsequent work of [KLS09] showed that indeed better bounds can be obtained for the uniform and isotropic log-concave distributions. In this paper, we continue this line of research. For the malicious noise case, where the adversary can corrupt both the label part and the feature part of the observation (and it has unbounded computational power and access to the entire history of the learning algorithm’s computation), we design an efficient algorithm that can tolerate a near-optimal amount of malicious noise (within constant factor of the statistical limit) for the uniform distribution, and also improve over the previously known results for log-concave distributions. In particular, unlike previous works, our noise tolerance limit has no dependence on the dimension of the space. We also show similar improvements for adversarial label noise, and furthermore show that our algorithms can naturally exploit the power of active learning. Active learning is a widely studied modern learning paradigm, where the learning algorithm only receives the class labels of examples when it asks for them. We show that in this model, our algorithms achieve a label complexity whose dependence on the error parameter is exponentially better than that of any passive algorithm. This provides the first polynomial-time active learning algorithm for learning linear separators in the presence of adversarial label noise, solving an open problem posed in [BBL06, Mon06]. It also provides the first analysis showing the benefits of active learning over passive learning under the challenging malicious noise model.

Our work brings a new set of algorithmic and analysis techniques including localization (previously used for obtaining better sample complexity results) and soft outlier removal that we believe will have other applications in learning theory and optimization. Localization [BBM05, BBL05, Zha06, BBZ07, BLL09, Kol10, Han11, BL13]

refers to the practice of progressively narrowing the focus of a learning algorithm to an increasingly restricted range of possibilities (which are known to be safe given the information up to a certain point in time), thereby improving the stability of estimates of the quality of these possibilities based on random data.

In the following we start by formally defining the learning models we consider. We then present the most relevant prior work, and then our main results and techniques.

Passive and Active Learning. Noise Models.   In this work we consider the problem of learning linear separators in two learning paradigms: the classic passive learning setting and the more modern active learning scenario. As is typical [KV94, Vap98], we assume that there exists a distribution over and a fixed unknown target function . In the noise-free case, in the

passive supervised learning

model the algorithm is given access to a distribution oracle from which it can get training samples where . The goal of the algorithm is to output a hypothesis such that . In the active learning model [CAL94, Das11] the learning algorithm is given as input a pool of unlabeled examples drawn from the distribution oracle. The algorithm can then query for the labels of examples of its choice from the pool. The goal is to produce a hypothesis of low error while also optimizing for the number of label queries (also known as label complexity

). The hope is that in the active learning setting we can output a classifier of small error by using many fewer label requests than in the passive learning setting by actively directing the queries to informative examples (while keeping the number of unlabeled examples polynomial).

In this work we focus on two noise models. The first one is the malicious noise model of [Val85, KL88] where samples are generated as follows: with probability a random pair is output where and ; with probability the adversary can output an arbitrary pair . We will call the noise rate. Each of the adversary’s examples can depend on the state of the learning algorithm and also the previous draws of the adversary. We will denote the malicious oracle as . The goal remains, however, to output a hypothesis such that .

In this paper, we consider an extension of the malicious noise model to the the active learning model as follows. There are two oracles, an example generation oracle and a label revealing oracle. The example generation oracle works as usual in the malicious noise model: with probability a random pair is generated where and ; with probability the adversary can output an arbitrary pair . In the active learning setting, unlike the standard malicious noise model, when an example is generated, the algorithm only receives , and must make a separate call to the label revealing oracle to get . The goal of the algorithm is still to output a hypothesis such that .

In the adversarial label noise model, before any examples are generated, the adversary may choose a joint distribution

over whose marginal distribution over is and such that . In the active learning version of this model, once again we will have two oracles, and example generation oracle and a label revealing oracle. We note that the results from our theorems in this model translate immediately into similar guarantees for the agnostic model of [KSS94] (used commonly both in passive and active learning (e.g., [KKMS05, BBL06, Han07]) – see Appendix G for details.

We will be interested in algorithms that run in time and use samples. In addition, for the active learning scenario we want our algorithms to also optimize for the number of label requests. In particular, we want the number of labeled examples to depend only polylogarithmically in . The goal then is to quantify for a given value of , the tolerable noise rate which would allow us to design an efficient (passive or active) learning algorithm.

Previous Work. In the context of passive learning, Kearns and Li’s analysis [KL88] implies that halfspaces can be efficiently learned with respect to arbitrary distributions in polynomial time while tolerating a malicious noise rate of . Kearns and Li [KL88] also showed that malicious noise at a rate greater than cannot be tolerated (and a slight variant of their construction shows that this remains true even when the distribution is uniform over the unit sphere). The bound for the distribution-free case was not improved for many years. Kalai et al. [KKMS05] showed that,111These results from [KKMS05] are most closely related to our work. We describe some of their other results, more prominently featured in their paper, later. when the distribution is uniform, the -time averaging algorithm tolerates malicious noise at a rate . They also described an improvement to based on the observation that uniform examples will tend to be well-separated, so that pairs of examples that are too close to one another can be removed, and this limits an adversary’s ability to coordinate the effects of its noisy examples. [KLS09] analyzed another approach to limiting the coordination of the noisy examples: they proposed an outlier removal procedure that used PCA to find any direction

onto which projecting the training data led to suspiciously high variance, and removing examples with the most extreme values after projecting onto any such

. Their algorithm tolerates malicious noise at a rate under the uniform distribution.

Motivated by the fact that many modern machine learning applications have massive amounts of unannotated or unlabeled data, there has been significant interest in designing active learning algorithms that most efficiently utilize the available data, while minimizing the need for human intervention. Over the past decade there has been substantial progress progress on understanding the underlying statistical principles of active learning, and several general characterizations have been developed for describing when active learning could have an advantage over the classic passive supervised learning paradigm both in the noise free settings and in the agnostic case [FSST97, Das05, BBL06, BBZ07, Han07, DHM07, CN07, BHW08, Kol10, BHLZ10, Wan11, Das11, RR11, BH12]. However, despite many efforts, except for very simple noise models (random classification noise [BF13] and linear noise [DGS12]), to date there are no known computationally efficient algorithms with provable guarantees in the presence of noise. In particular, there are no computationally efficient algorithms for the agnostic case, and furthermore no result exists showing the benefits of active learning over passive learning in the malicious noise model, where the feature part of the examples can be corrupted as well. We discuss additional related work in Appendix A.

### 1.1 Our Results

The following are our main results.

###### Theorem 1.1.

There is a polynomial-time algorithm for learning linear separators with respect to the uniform distribution over the unit ball in in the presence of malicious noise such that an upper bound on suffices to imply that for any , the output of satisfies with probability at least .

###### Theorem 1.2.

There is a polynomial-time algorithm for learning linear separators with respect to the uniform distribution over the unit ball in in the presence of adversarial label noise such that an upper bound on suffices to imply that for any , the output of satisfies with probability at least .

As a restatement of the above theorem, in the agnostic setting considered in [KKMS05], we can output a halfspace of error at most in time . Kalai, et al, achieved error by learning a low degree polynomial in time whose dependence on the inverse accuracy is super-exponential. On the other hand, this result of [KKMS05] applies when the target halfspace does not necessary go through the origin.

###### Theorem 1.3.

There is a polynomial-time algorithm for learning linear separators with respect to any isotropic log-concave distribution in in the presence of malicious noise such that an upper bound on suffices to imply that for any , the output of satisfies with probability at least .

###### Theorem 1.4.

There is a polynomial-time algorithm for learning linear separators with respect to isotropic log-concave distribution in in the presence of adversarial label noise such that an upper bound on suffices to imply that for any , the output of satisfies with probability at least .

Our algorithms naturally exploit the power of active learning. (Indeed, as we will see, an active learning algorithm proposed in [BBZ07] provided the springboard for our work.) We show that in this model, the label complexity of both algorithms depends only poly-logarithmically in where is the desired error rate, while still using only a polynomial number of unlabeled samples. (For the uniform distribution, the dependence of the number of labels on is ). Our efficient algorithm that tolerates adversarial label noise solves an open problem posed in [BBL06, Mon06]. Furthermore, our paper provides the first active learning algorithm for learning linear separators in the presence of non-trivial amount of adversarial noise that can affect not only the label part, but also the feature part.

Our work exploits the power of localization for designing noise-tolerant polynomial-time algorithms. Such localization techniques have been used for analyzing sample complexity for passive learning (see [BBM05, BBL05, Zha06, BLL09, BL13]) or for designing active learning algorithms (see [BBZ07, Kol10, Han11, BL13]). Ideas useful for making such a localization strategy computationally efficient, and tolerating malicious noise, are described in Section 1.2.

We note that all our algorithms are proper in that they return a linear separator. (Linear models can be evaluated efficiently, and are otherwise easy to work with.) We summarize our results, and the most closely related previous work, in Tables 1 and 2.

### 1.2 Techniques

Hinge Loss Minimization As minimizing the 0-1 loss in the presence of noise is NP-hard [JP78, GJ90], a natural approach is to minimize a surrogate convex loss that acts as a proxy for the 0-1 loss. A common choice in machine learning is to use the hinge loss: In this paper, we use the slightly more general and, for a set of examples, we let Here is a parameter that changes during training. It can be shown that minimizing hinge loss with an appropriate normalization factor can tolerate a noise rate of under the uniform distribution over the unit ball in . This is also the limit for such a strategy since a more powerful malicious adversary with can concentrate all the noise directly opposite to the target vector and make sure that the hinge-loss is no longer a faithful proxy for the 0-1 loss.

Localization in the instance and concept space    Our first key insight is that by using an iterative localization technique, we can limit the harm caused by an adversary at each stage and hence can still do hinge-loss minimization despite significantly more noise. In particular, the iterative style algorithm we propose proceeds in stages and at stage , we have a hypothesis vector of a certain error rate. The goal in stage is to produce a new vector of error rate half of . In order to halve the error rate, we focus on a band of size around the boundary of the linear classifier whose normal vector is , i.e. . For the rest of the paper, we will repeatedly refer to this key region of borderline examples as “the band”. The key observation made in [BBZ07] is that outside the band, all the classifiers still under consideration (namely those hypotheses within radius of the previous weight vector ) will have very small error. Furthermore, the probability mass of this band under the original distribution is small enough, so that in order to make the desired progress we only need to find a hypothesis of constant error rate over the data distribution conditioned on being within margin of . This idea was used in [BBZ07] to obtain active learning algorithms with improved label complexity ignoring computational complexity considerations222We note that the localization considered by [BBZ07] is a more aggressive one than those considered in disagreement based active learning literature [BBL06, Han07, Kol10, Han11, Wan11] and earlier in passive learning [BBM05, BBL05, Zha06]..

In this work, we build on this idea to produce polynomial time algorithms with improved noise tolerance. To obtain our results, we exploit several new ideas: (1) the performance of the rescaled hinge loss minimization in smaller and smaller bands, (2) a analysis of properties of the distribution obtained after conditioning on the band that enables us to more sensitively identify cases in which the adversary concentrates the effects of noisy examples, (3) another type of localization — a novel soft outlier removal procedure.

We first show that if we minimize a variant of the hinge loss that is rescaled depending on the width of the band, it remains a faithful enough proxy for the 0-1 error even when there is significantly more noise. As a first step towards this goal, consider the setting where we pick proportionally to , the size of the band, and is proportional to the error rate of

, and then minimize a normalized hinge loss function

over vectors . We first show that has small hinge loss within the band. Furthermore, within the band the adversarial examples cannot hurt the hinge loss of by a lot. To see this notice that if the malicious noise rate is , within the effective noise rate is . Also the maximum value of the hinge loss for vectors is . Hence the maximum amount by which the adversary can affect the hinge loss is . Using this approach we get a noise tolerance of .

In order to get better tolerance in the adversarial, or agnostic, setting, we note that examples for which is large for close to are the most harmful, and, by analyzing the variance of for such directions , we can more effectively limit the amount by which an adversary can “hurt” the hinge loss. This then leads to an improved noise tolerance of .

For the case of malicious noise, in addition we need to deal with the presence of outliers, i.e. points not generated from the uniform distribution. We do this by introducing a soft localized outlier removal procedure at each stage (described next). This procedure assigns a weight to each data point indicating the algorithm’s confidence that the point is not “noisy”. We then minimize the weighted hinge loss. Combining this with the variance analysis mentioned above leads to a noise of tolerance of in the malicious case.

Soft Localized Outlier Removal Outlier removal techniques have been studied before in the context of learning problems [BFKV97, KLS09]. In [KLS09], the goal of outlier removal was to limit the ability of the adversary to coordinate the effects of noisy examples – excessive such coordination was detected and removed. Our outlier removal procedure (see Figure 2) is similar in spirit to that of [KLS09] with two key differences. First, as in [KLS09], we will use the variance of the examples in a particular direction to measure their coordination. However, due to the fact that in round , we are minimizing the hinge loss only with respect to vectors that are close to , we only need to limit the variance in these directions. As training proceeds, the band is increasingly shaped like a pancake, with pointing in its flattest direction. Hypotheses that are close to also point in flat directions; the variance in those directions is which is much smaller than the found in a generic direction. This allows us to limit the harm of the adversary to a greater extent than was possible in the analysis of [KLS09]

. The second difference is that, unlike previous outlier removal techniques, rather than making discrete remove-or-not decisions, we instead weigh the examples and then minimize the weighted hinge loss. Each weight indicates the algorithm’s confidence that an example is not noisy. We show that these weights can be computed by solving a linear program with infinitely many constraints. We then show how to design an efficient separation oracle for the linear program using recent general-purpose techniques from the optimization community

[SZ03, BM13].

In Section 4 we show that our results hold for a more general class of distributions which we call admissible distributions. From Section 4 it also follows that our results can be extended to -nearly log-concave distributions (for small enough ). Such distributions, for instance, can capture mixtures of log-concave distributions [BL13].

## 2 Preliminaries

Recall that and Similarly, the expected hinge loss w.r.t.  is defined as . Our analysis will also consider the distribution obtained by conditioning on membership in the band, i.e. the set .

We present our algorithms in the active learning model. Since we will prove that our active algorithm only uses a polynomial number of unlabeled samples, this will imply a guarantee for passive learning setting. At a high level, our algorithms are iterative learning algorithms that operate in rounds. In each round we focus on points that fall near the decision boundary of the current hypothesis and use them in order to obtain a new vector of lower error. In the malicious noise case, in round we first do a soft outlier removal and then minimize hinge loss normalized appropriately by . A formal description appears in Figure 1, and a formal description of the outlier removal procedure appears in Figure 2. We will present specific choices of the parameters of the algorithms in the following sections.

The description of the algorithm and its analysis is simplified if we assume that it starts with a preliminary weight vector whose angle with the target is acute, i.e. that satisfies . We show in Appendix B that this is without loss of generality for the types of problems we consider.

## 3 Learning with respect to uniform distribution with malicious noise

Let denote the unit ball in . In this section we focus on the case where the distribution is the uniform distribution over and present our results for malicious noise. Theorem 1.1 is a corollary of Theorem 3.1, which follows. A detailed proof of Theorem 3.1 is in Appendix C, and Theorem 1.1 also is a corollary of Theorem 4.2, proved in Section D. We sketch the proof here.

###### Theorem 3.1.

Let be the (unit length) target weight vector. There are absolute positive constants and a polynomial such that an upper bound on suffices to imply that for any , using the algorithm from Figure 1 with cut-off values , radii , , for , , , a number of unlabeled examples in round and a number of labeled examples in round , after iterations, we find satisfying with probability .

### 3.1 Proof Sketch of Theorem 3.1

We may assume without loss of generality that all examples, including noisy examples, fall in . This is because any example that falls outside can be easily identified by the algorithm as noisy and removed, effectively lowering the noise rate.

Using techniques from [BBZ07], we may reduce our problem to a subproblem concerning learning with respect to a distribution obtained by conditioning on membership in the band. In particular, in Appendix C.3, we adapt the argument of [BBZ07] to show that, for a sufficiently small absolute constant , in order prove Theorem 3.1, all we need is Theorem 3.2 stated below, together with the required bounds on computational, sample and label complexity.

###### Theorem 3.2.

After round of the algorithm in Figure 1, with probability at least , we have .

The proof of Theorem 3.2 follows from a series of steps summarized in the lemmas below. First, we bound the expected hinge loss of the target within the band . Since we are analyzing a particular round , to reduce clutter in the formulas, for the rest of this section, let us refer to simply as and as .

###### Lemma 3.3.

.

• Notice that is never negative, so, on any clean example , we have and, furthermore, will pay a non-zero hinge only inside the region where . Hence, Using standard tail bounds (see Eq. 1 in Appendix C), we can lower bound the denominator for a constant . Also the numerator is at most for another constant . Hence, we have if we choose small enough. ∎

During round we can decompose the working set into the set of “clean” examples which are drawn from and the set of “dirty” or malicious examples which are output by the adversary. We will ultimately relate the hinge loss of vectors over the weighted set to the hinge loss over clean examples . In order to do this we will need the following guarantee from the outlier removal subroutine of Figure 2 (which is applied with ).

###### Theorem 3.4.

There is a constant and a polynomial such that, if examples are drawn from the distribution (each replaced with an arbitrary unit-length vector with probability ), then by using the algorithm in Figure 2 with , we have that with probability , the output satisfies the following: (a) , and (b) for all unit length such that , Furthermore, the algorithm can be implemented in polynomial time.

The key points in proving this theorem are the following. We will show that the vector which assigns a weight to examples in and weight to examples in is a feasible solution to the linear program in Figure 2. In order to do this, we first show that the fraction of dirty examples in round is not too large, i.e., w.h.p., we have . Next, we show that, for all with distance of , that is at most . The proof of feasibility follows easily by combining the variance bound with standard VC tools. In the appendix we also show how to solve the linear program in polynomial time. The complete proof of the Theorem 3.4 is in Appendix C.

As explained in the introduction, the soft outlier removal procedure enables us to get a more refined bound on the extent to which the value minimized by the algorithm is a faithful proxy for the value that it would minimize in the absence of noise. This is formalized in the following lemma. (Here and are defined with respect to the unrevealed labels that the adversary has committed to.)

###### Lemma 3.5.

There are absolute constants , and such that, for large enough , with probability , if we define , then for any , we have and

A detailed proof of Lemma 3.5 is given in Appendix C. Here were give a few ideas. The loss on a particular example can be upper bounded by . One source of difference between , the loss on the clean examples, and , the loss minimized by the algorithm, is the loss on the (total fractional) dirty examples that were not deleted by the soft outlier removal. By using the Cauchy-Shwartz inequality, the (weighted) sum of over those surviving noisy examples can be bounded in terms of the variance in the direction

, and the (total fractional) number of surviving dirty examples. Our soft outlier detection allows us to bound the variance of the surviving noisy examples in terms of

. Another way that can be different from is effect of deleting clean examples. We can similarly use the variance on the clean examples to bound this in terms of .

Given Lemma 3.3, Theorem 3.4, and Lemma 3.5, the proof of Theorem 3.2 can be summarized as follows. Let be the probability that we want to bound. Applying VC theory, w.h.p., all sampling estimates of expected loss are accurate to within , so we may assume w.l.o.g. that this is the case. Since, for each error, the hinge loss is at least , we have . Applying Lemma 3.5 and VC theory, we get, . The fact that approximately minimizes the hinge loss, together with VC theory, gives . Once again applying Lemma 3.5 and VC theory yields . Since , we get . Now notice that is . Hence an bound on suffices to imply, w.h.p., that .

## 4 Learning with respect to admissible distributions with malicious noise

One of our main results (Theorem 1.3) concerns isotropic log concave distributions. (A probability distribution is isotropic log-concave if its density can be written as for a convex function , its mean is , and its covariance matrix is .)

In this section, we extend our analysis from the previous section and show that it works for isotropic log concave distributions, and in fact an even more general class of distributions which we call admissible distributions. In particular this includes the class of isotropic log-concave distributions in and the uniform distributions over the unit ball in .

###### Definition 4.1.

A sequence of probability distributions over respectively is -admissible if it satisfies the following conditions. (1.) There are such that, for all , for drawn from and any unit length , (a) for all for which , we have and for all for which , . (2.) For any , there is a such that, for all , the following holds. Let and be two unit vectors in , and assume that . Then (3.) There is an absolute constant such that, for any , for any two unit vectors and in we have (4.) There is a constant such that, for all constant , for all , for any such that, , and , for any , we have (5.) There is a constant such that, for all , we have

For the case of admissible distributions we have the following theorem, which is proved in Appendix D.

###### Theorem 4.2.

Let a distribution over be chosen from a -admissible sequence of distributions. Let be the (unit length) target weight vector. There are settings of the parameters of the algorithm from Figure 1, such that an upper bound on the rate of malicious noise suffices to imply that for any , a number of unlabeled examples in round and a number of labeled examples in round , and such that , after iterations, finds satisfying with probability .

If the support of is bounded in a ball of radius , then, we have that label requests suffice.

The above theorem contains Theorem 1.3 as a special case. This is because of the fact that any isotropic log-concave distribution is -admissible (see Appendix F.2 for a proof).

The intuition in the case of adversarial label noise is the same as for malicious noise, except that, because the adversary cannot change the marginal distribution over the instances, it is not necessary to perform outlier removal. Bounds for learning with adversarial label noise are not corollaries of bounds for learning with malicious noise, however, because, while the marginal distribution over the instances for all the examples, clean and noisy, is not affected by the adversary, the marginal distribution over the clean examples is changed (because the examples whose classifications are changed are removed from the distribution over clean examples).

Theorem 1.2 and Theorem 1.4, which concern adversarial label noise, can be proved by combining the analysis in Appendix E with the facts that (a rescaling of) the uniform distribution and i.l.c. distributions are 0-admissible and 2-admissible respectively, which is proved in Appendix F.

## 6 Discussion

We note that the idea of localization in the concept space is traditionally used in statistical learning theory both in supervised and active learning for getting sharper rates

[BBL05, BLL09, Kol10]. Furthermore, the idea of localization in the instance space has been used in margin-based analysis of active learning [BBZ07, BL13]. In this work we used localization in both senses in order to get polynomial-time algorithms with better noise tolerance. It would be interesting to further exploit this idea for other concept spaces.

## References

• [AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
• [ABS10] P. Awasthi, A. Blum, and O. Sheffet. Improved guarantees for agnostic learning of disjunctions. COLT, 2010.
• [ABSS93] S. Arora, L. Babai, J. Stern, and Z. Sweedyk. The hardness of approximate optima in lattices, codes, and systems of linear equations. In Proceedings of the 1993 IEEE 34th Annual Foundations of Computer Science, 1993.
• [Bau90] E. B. Baum. The perceptron algorithm is fast for nonmalicious distributions. Neural Computation, 2:248–260, 1990.
• [BBL05] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 9:9:323–375, 2005.
• [BBL06] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.
• [BBM05] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.
• [BBZ07] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.
• [BF13] M.-F. Balcan and V. Feldman. Statistical active learning algorithms. NIPS, 2013.
• [BFKL94] Avrim Blum, Merrick L. Furst, Michael J. Kearns, and Richard J. Lipton. Cryptographic primitives based on hard learning problems. In Proceedings of the 13th Annual International Cryptology Conference on Advances in Cryptology, 1994.
• [BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.
• [BGMN05] F. Barthe, O. Guédon, S. Mendelson, and A. Naor. A probabilistic approach to the geometry of the ℓpn-ball. The Annals of Probability, 33(2):480–513, 2005.
• [BH12] M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, 2012.
• [BHLZ10] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.
• [BHW08] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In COLT, 2008.
• [BL13] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, 2013.
• [BLL09] N. H. Bshouty, Y. Li, and P. M. Long. Using the doubling dimension to analyze the generalization of learning algorithms. JCSS, 2009.
• [BM13] D. Bienstock and A. Michalka. Polynomial solvability of variants of the trust-region subproblem, 2013. Optimization Online.
• [BSS12] A. Birnbaum and S. Shalev-Shwartz. Learning halfspaces with the zero-one loss: Time-accuracy tradeoffs. NIPS, 2012.
• [Byl94] T. Bylander. Learning linear threshold functions in the presence of classification noise. In

Conference on Computational Learning Theory

, 1994.
• [CAL94] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2), 1994.
• [CGZ10] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 2010.
• [CN07] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007.
• [CST00] N. Cristianini and J. Shawe-Taylor.

An introduction to support vector machines and other kernel-based learning methods

.
Cambridge University Press, 2000.
• [Das05] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, volume 18, 2005.
• [Das11] S. Dasgupta. Active learning. Encyclopedia of Machine Learning, 2011.
• [DGS12] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers. JMLR, 2012.
• [DHM07] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. NIPS, 20, 2007.
• [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In FOCS, pages 563–576, 2006.
• [FSST97] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.
• [GHRU11] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In

Proceedings of the 43rd annual ACM symposium on Theory of computing

, 2011.
• [GJ90] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. 1990.
• [GR06] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, 2006.
• [GR09] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765, 2009.
• [GSSS13] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Efficient pool-based active learning of halfspaces. In ICML, 2013.
• [Han07] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.
• [Han11] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
• [JP78] D. S. Johnson and F. Preparata. The densest hemisphere problem. Theoretical Computer Science, 6(1):93 – 107, 1978.
• [KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, 2005.
• [KL88] Michael Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of the twentieth annual ACM symposium on Theory of computing, 1988.
• [KLS09] A. R. Klivans, P. M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10, 2009.
• [Kol10] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11:2457–2485, 2010.
• [KSS94] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning. Mach. Learn., 17(2-3), November 1994.
• [KV94] M. Kearns and U. Vazirani. An introduction to computational learning theory. MIT Press, Cambridge, MA, 1994.
• [LS06] P. M. Long and R. A. Servedio. Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions. NIPS, 2006.
• [LS11] P. M. Long and R. A. Servedio. Learning large-margin halfspaces with more malicious noise. NIPS, 2011.
• [LV07] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 30(3):307–358, 2007.
• [Mon06] Claire Monteleoni. Efficient algorithms for general active learning. In Proceedings of the 19th annual conference on Learning Theory, 2006.
• [Pol11] D. Pollard. Convergence of Stochastic Processes. Springer Series in Statistics. 2011.
• [Reg05] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 2005.
• [RR11] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In NIPS, 2011.
• [Ser01] Rocco A. Servedio. Smooth boosting and learning with malicious noise. In 14th Annual Conference on Computational Learning Theory and 5th European Conference on Computational Learning Theory, 2001.
• [SZ03] J. Sturm and S. Zhang. On cones of nonnegative quadratic functions. Mathematics of Operations Research, 28:246–267, 2003.
• [Val85] L. G. Valiant. Learning disjunction of conjunctions. In

Proceedings of the 9th International Joint Conference on Artificial intelligence

, 1985.
• [Vap98] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
• [Vem10] S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. JACM, 57(6), 2010.
• [Wan11] L. Wang. Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning. JMLR, 2011.
• [Zha06] T. Zhang. Information theoretical upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.

## Appendix A Additional Related Work

#### Passive Learning

Blum et al. [BFKV97] considered noise-tolerant learning of halfspaces under a more idealized noise model, known as the random noise model, in which the label of each example is flipped with a certain probability, independently of the feature vector. Some other, less closely related, work on efficient noise-tolerant learning of halfspaces includes [Byl94, BFKV97, FGKP06, GR09, Ser01, ABS10, LS11, BSS12].

#### Active Learning

As we have mentioned, most prior theoretical work on active learning focuses on either sample complexity bounds (without regard for efficiency) or on providing polynomial time algorithms in the noiseless case or under simple noise models (random classification [BF13] noise or linear noise [CGZ10, DGS12]).

In [CGZ10, DGS12] online learning algorithms in the selective sampling framework are presented, where labels must be actively queried before they are revealed. Under the assumption that the label conditional distribution is a linear function determined by a fixed target vector, they provide bounds on the regret of the algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. As pointed out in [DGS12], these results can also be converted to a distributional PAC setting where instances are drawn i.i.d. In this setting they obtain exponential improvement in label complexity over passive learning. These interesting results and techniques are not directly comparable to ours. Our framework is not restricted to halfspaces. Another important difference is that (as pointed out in [GSSS13]) the exponential improvement they give is not possible in the noiseless version of their setting. In other words, the addition of linear noise defined by the target makes the problem easier for active sampling. By contrast RCN can only make the classification task harder than in the realizable case.

Recently, [BF13] showed the first polynomial time algorithms for actively learning thresholds, balanced rectangles, and homogenous linear separators under log-concave distributions in the presence of random classification noise. Active learning with respect to isotropic log-concave distributions in the absence of noise was studied in [BL13].

## Appendix B Initializing with vector w0

We will prove that we may assume without loss of generality that we receive a whose angle with the target is acute given the assumption that the marginal distribution over the instances is -admissible (see Definition 4.1 – we prove in Appendix F that the uniform distribution over a sphere is -admissible, and isotropic log-concave distributions are -admissible).

Suppose we have an algorithm as a subroutine that satisfies the guarantee of Theorem 4.2, given access to such a . Then we can arrive at an algorithm which works without it as follows. With probability , for a random , either or has an acute angle with . We may then run with both choices, and with set to for any admissible distribution, where is the constant in Definition 4.1. Then we can use hypothesis testing on examples, and, with high probability, find a hypothesis with error less than . Part 3 of Definition 4.1 then implies that may then set , and call again.

## Appendix C Proof of Theorem 3.1

We start by stating some useful properties of the uniform distribution .

### c.1 Properties of D

1. [Bau90, BBZ07, KKMS05] For any , there is a such that, for drawn from the uniform distribution over and any unit length , for all for which , we have

 c2|b−a|√d≤Pr(u⋅x∈[a,b])≤|b−a|√d. (1)
2. The following corollary of Theorem 4 of [BL13] is proved in detail in Lemma F.2 of this paper: For any , there is a such that, for all , the following holds. Let and be two unit vectors in , and assume that . Then

 Prx∼D(sign(u⋅x)≠sign(v⋅x) % and |v⋅x|≥c4α√d)≤c3α. (2)

### c.2 Parameter choices

Next, for easy reference throughout the proof, we collect specifications of how parameters of the algorithm of Figure 1 are set. Let be the value of that ensures that Equation 2 holds when , and let . Let . Let be the value of that ensures that Equation 1 holds when , and let , , . Finally, let .

### c.3 Margin based analysis

The proof of Theorem 3.1 follows the high level structure of the proof of [BBZ07]; the new element is the application of Theorem C.4 (called Theorem 3.2 in the proof sketch in Section 3) which analyzes the performance of the hinge loss minimization algorithm for learning inside the band, which in turn applies Theorem C.1, which analyzes the benefits of our new localized outlier removal procedure.

Proof (of Theorem 3.1): We will prove by induction on that after iterations, we have with probability .

When , all that is required is .

Assume now the claim is true for (). Then by induction hypothesis, we know that with probability at least , has error at most . This implies .

Let us define and . Since has unit length, and , we have which in turn implies .

Applying Equation 2 to bound the error rate outside the band, we have both:

 Prx[(wk−1⋅x)(wk⋅x)<0,x∈¯Swk−1,bk−1]≤2−(k+3)    and
 Prx[(wk−1⋅x)(w∗⋅x)<0,x∈¯Swk−1,bk−1]≤2−(k+3).

Taking the sum, we obtain Therefore, we have

 err(wk)≤(errDwk−1,bk−1(wk))Pr(Swk−1,bk−1)+2−(k+2).

Equation 1 gives , which implies

 err(wk)≤(errDwk−1,bk−1(wk))2bk−1√d+2−(k+2)≤2−(k+1)((errDwk−1,bk−1(wk))4~c4+1/2).

Recall that is the distribution obtained by conditioning on the event that . In Theorem C.4 below (which is called Theorem 3.2 in the body of the paper) we will show that, with probability , has error at most within , implying that , completing the proof of the induction, and therefore showing, with probability at least , iterations suffice to achieve .

A polynomial number of unlabeled samples are required by the algorithm and the number of labeled examples required by the algorithm is . ∎

### c.4 Analysis of the outlier removal subroutine

Before taking on the subproblem of analyzing the error within the band, we need to prove the following theorem (which is the same as Theorem 3.4 in the main body) about the outlier removal subroutine of Figure 2.

###### Theorem C.1.

There is a polynomial such that, if