Revisiting Perceptron: Efficient and Label-Optimal Learning of Halfspaces

02/18/2017 ∙ by Songbai Yan, et al. ∙ Microsoft University of California, San Diego 0

It has been a long-standing problem to efficiently learn a halfspace using as few labels as possible in the presence of noise. In this work, we propose an efficient Perceptron-based algorithm for actively learning homogeneous halfspaces under the uniform distribution over the unit sphere. Under the bounded noise condition MN06, where each label is flipped with probability at most η < 1/2, our algorithm achieves a near-optimal label complexity of Õ(d/(1-2η)^21/ϵ) in time Õ(d^2/ϵ(1-2η)^3). Under the adversarial noise condition ABL14, KLS09, KKMS08, where at most a Ω̃(ϵ) fraction of labels can be flipped, our algorithm achieves a near-optimal label complexity of Õ(d1/ϵ) in time Õ(d^2/ϵ). Furthermore, we show that our active learning algorithm can be converted to an efficient passive learning algorithm that has near-optimal sample complexities with respect to ϵ and d.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the problem of designing efficient noise-tolerant algorithms for actively learning homogeneous halfspaces in the streaming setting. We are given access to a data distribution from which we can draw unlabeled examples, and a noisy labeling oracle

that we can query for labels. The goal is to find a computationally efficient algorithm to learn a halfspace that best classifies the data while making as few queries to the labeling oracle as possible.

Active learning arises naturally in many machine learning applications where unlabeled examples are abundant and cheap, but labeling requires human effort and is expensive. For those applications, one natural question is whether we can learn an accurate classifier using as few labels as possible. Active learning addresses this question by allowing the learning algorithm to sequentially select examples to query for labels, and avoid requesting labels which are less informative, or can be inferred from previously-observed examples.

There has been a large body of work on the theory of active learning, showing sharp distribution-dependent label complexity bounds [21, 11, 34, 27, 35, 46, 60, 41]. However, most of these general active learning algorithms rely on solving empirical risk minimization problems, which are computationally hard in the presence of noise [5].

On the other hand, existing computationally efficient algorithms for learning halfspaces [17, 29, 42, 45, 6, 23, 7, 8] are not optimal in terms of label requirements. These algorithms have different degrees of noise tolerance (e.g. adversarial noise [6], malicious noise [43], random classification noise [3], bounded noise [49], etc), and run in time polynomial in and . Some of them naturally exploit the utility of active learning [6, 7, 8], but they do not achieve the sharpest label complexity bounds in contrast to those computationally-inefficient active learning algorithms [10, 9, 60].

Therefore, a natural question is: is there any active learning halfspace algorithm that is computationally efficient, and has a minimum label requirement? This has been posed as an open problem in [50]. In the realizable setting,  [26, 10, 9, 56] give efficient algorithms that have optimal label complexity of under some distributional assumptions. However, the challenge still remains open in the nonrealizable setting. It has been shown that learning halfspaces with agnostic noise even under Gaussian unlabeled distribution is hard [44]. Nonetheless, we give an affirmative answer to this question under two moderate noise settings: bounded noise and adversarial noise.

1.1 Our Results

We propose a Perceptron-based algorithm, , for actively learning homogeneous halfspaces under the uniform distribution over the unit sphere. It works under two noise settings: bounded noise and adversarial noise. Our work answers an open question by [26] on whether Perceptron-based active learning algorithms can be modified to tolerate label noise.

In the -bounded noise setting (also known as the Massart noise model [49]), the label of an example is generated by for some underlying halfspace , and flipped with probability . Our algorithm runs in time , and requires labels. We show that this label complexity is nearly optimal by providing an almost matching information-theoretic lower bound of . Our time and label complexities substantially improve over the state of the art result of [8], which runs in time and requires labels.

Our main theorem on learning under bounded noise is as follows:

Theorem 2 (Informal).

Suppose the labeling oracle satisfies the -bounded noise condition with respect to , then for , with probability at least : (1) The output halfspace is such that ; (2) The number of label queries to oracle is at most ; (3) The number of unlabeled examples drawn is at most ; (4) The algorithm runs in time .

In addition, we show that our algorithm also works in a more challenging setting, the -adversarial noise setting [6, 42, 45].222Note that the adversarial noise model is not the same as that in online learning [18], where each example can be chosen adversarially. In this setting, the examples still come iid from a distribution, but the assumption on the labels is just that for some halfspace . Under this assumption, the Bayes classifier may not be a halfspace. We show that our algorithm achieves an error of while tolerating a noise level of . It runs in time , and requires only labels which is near-optimal. has a label complexity bound that matches the state of the art result of [39]333The label complexity bound is implicit in [39] by a refined analysis of the algorithm of [6] (See their Lemma 8 for details)., while having a lower running time.

Our main theorem on learning under adversarial noise is as follows:

Theorem 3 (Informal).

Suppose the labeling oracle satisfies the -adversarial noise condition with respect to , where . Then for , with probability at least : (1) The output halfspace is such that ; (2) The number of label queries to oracle is at most ; (3) The number of unlabeled examples drawn is at most ; (4) The algorithm runs in time .

Throughout the paper,

is shown to work if the unlabeled examples are drawn uniformly from the unit sphere. The algorithm and analysis can be easily generalized to any spherical symmetrical distributions, for example, isotropic Gaussian distributions. They can also be generalized to distributions whose densities with respect to uniform distribution are bounded away from 0.

In addition, we show in Section 6 that can be converted to a passive learning algorithm, , that has near optimal sample complexities with respect to and under the two noise settings. We defer the discussion to the end of the paper.

Algorithm Label Complexity Time Complexity
[10, 9, 60] 444The algorithm needs to minimize 0-1 loss, the best known method for which requires superpolynomial time.
[8]
Our Work
Table 1: A comparison of algorithms for active learning of halfspaces under the uniform distribution, in the -bounded noise model.
Algorithm Noise Tolerance Label Complexity Time Complexity
[60]
[39]
Our Work
Table 2: A comparison of algorithms for active learning of halfspaces under the uniform distribution, in the -adversarial noise model.

2 Related Work

Active Learning.

The recent decades have seen much success in both theory and practice of active learning; see the excellent surveys by [54, 37, 25]. On the theory side, many label-efficient active learning algorithms have been proposed and analyzed. An incomplete list includes [21, 11, 34, 27, 35, 46, 60, 41]. Most algorithms relies on solving empirical risk minimization problems, which are computationally hard in the presence of noise [5].

Computational Hardness of Learning Halfspaces.

Efficient learning of halfspaces is one of the central problems in machine learning [22]

. In the realizable case, it is well known that linear programming will find a consistent hypothesis over data efficiently. In the nonrealizable setting, however, the problem is much more challenging.

A series of papers have shown the hardness of learning halfspaces with agnostic noise [5, 30, 33, 44, 23]. The state of the art result [23] shows that under standard complexity-theoretic assumptions, there exists a data distribution, such that the best linear classifier has error , but no polynomial time algorithms can achieve an error at most for every , even with improper learning.  [44] shows that under standard assumptions, even if the unlabeled distribution is Gaussian, any agnostic halfspace learning algorithm must run in time to achieve an excess error of . These results indicate that, to have nontrivial guarantees on learning halfspaces with noise in polynomial time, one has to make additional assumptions on the data distribution over instances and labels.

Efficient Active Learning of Halfspaces.

Despite considerable efforts, there are only a few halfspace learning algorithms that are both computationally-efficient and label-efficient even under the uniform distribution. In the realizable setting, [26, 10, 9] propose computationally efficient active learning algorithms which have an optimal label complexity of .

Since it is believed to be hard for learning halfspaces in the general agnostic setting, it is natural to consider algorithms that work under more moderate noise conditions. Under the bounded noise setting [49], the only known algorithms that are both label-efficient and computationally-efficient are [7, 8]. [7] uses a margin-based framework which queries the labels of examples near the decision boundary. To achieve computational efficiency, it adaptively chooses a sequence of hinge loss minimization problems to optimize as opposed to directly optimizing the 0-1 loss. It works only when the label flipping probability upper bound is small (). [8] improves over [7] by adapting a polynomial regression procedure into the margin-based framework. It works for any , but its label complexity is , which is far worse than the information-theoretic lower bound . Recently [20] gives an efficient algorithm with a near-optimal label complexity under the membership query model where the learner can query on synthesized points. In contrast, in our stream-based model, the learner can only query on points drawn from the data distribution. We note that learning in the stream-based model is harder than in the membership query model, and it is unclear how to transform the DC algorithm in [20] into a computationally efficient stream-based active learning algorithm.

Under the more challenging -adversarial noise setting, [6] proposes a margin-based algorithm that reduces the problem to a sequence of hinge loss minimization problems. Their algorithm achieves an error of in polynomial time when , but requires labels. Later, [39] performs a refined analysis to achieve a near-optimal label complexity of , but the time complexity of the algorithm is still an unspecified high order polynomial.

Tables 1 and 2 present comparisons between our results and results most closely related to ours in the literature. Due to space limitations, discussions of additional related work are deferred to Appendix A.

3 Definitions and Settings

We consider learning homogeneous halfspaces under uniform distribution. The instance space is the unit sphere in , which we denote by . We assume throughout this paper. The label space . We assume all data points are drawn i.i.d. from an underlying distribution over . We denote by the marginal of over (which is uniform over ), and the conditional distribution of given . Our algorithm is allowed to draw unlabeled examples from , and to make queries to a labeling oracle for labels. Upon query , returns a label drawn from . The hypothesis class of interest is the set of homogeneous halfspaces . For any hypothesis , we define its error rate . We will drop the subscript in when it is clear from the context. Given a dataset , we define the empirical error rate of over as .

Definition 1 (Bounded Noise [49]).

We say that the labeling oracle satisfies the -bounded noise condition for some with respect to , if for any , .

It can be seen that under -bounded noise condition, is the Bayes classifier.

Definition 2 (Adversarial Noise [6]).

We say that the labeling oracle satisfies the -adversarial noise condition for some with respect to , if .

For two unit vectors

, denote by the angle between them. The following lemma gives relationships between errors and angles (see also Lemma 1 in [8]).

Lemma 1.

For any , .

Additionally, if the labeling oracle satisfies the -bounded noise condition with respect to , then for any vector , .

Given access to unlabeled examples drawn from and a labeling oracle , our goal is to find a polynomial time algorithm such that with probability at least , outputs a halfspace with for some target accuracy and confidence . (By Lemma 1, this guarantees that the excess error of is at most , namely, .) The desired algorithm should make as few queries to the labeling oracle as possible.

We say an algorithm achieves a label complexity of , if for any target halfspace , with probability at least , outputs a halfspace such that , and requests at most labels from oracle .

4 Main Algorithm

Our main algorithm, (Algorithm 1

), works in epochs. It works under the bounded and the adversarial noise models, if its sample schedule

and band width are set appropriately with respect to each noise model. At the beginning of each epoch , it assumes an upper bound of on , the angle between current iterate and the underlying halfspace . As we will see, this can be shown to hold with high probability inductively. Then, it calls procedure (Algorithm 2) to find an new iterate , which can be shown to have an angle with at most with high probability. The algorithm ends when a total of epochs have passed.

For simplicity, we assume for the rest of the paper that the angle between the initial halfspace and the underlying halfspace is acute, that is, ; Appendix F shows that this assumption can be removed with a constant overhead in terms of label and time complexities.

0:  Labeling oracle , initial halfspace , target error , confidence , sample schedule , band width .
0:  learned halfspace .
1:  Let .
2:  for  do
3:     .
4:  end for
5:  return  .
Algorithm 1

Procedure (Algorithm 2) is the core component of . It sequentially performs a modified Perceptron update rule on the selected new examples  [51, 17, 26]:

(1)

Define . Update rule (1) implies the following relationship between and (See Lemma 8 in Appendix E for its proof):

(2)

This motivates us to take as our measure of progress; we would like to drive up to (so that goes down to ) as fast as possible.

To this end, samples new points under time-varying distributions and query for their labels, where is a band inside the unit sphere. The rationale behind the choice of is twofold:

  1. We set to have a probability mass of , so that the time complexity of rejection sampling is at most per example. Moreover, in the adversarial noise setting, we set large enough to dominate the noise of magnitude .

  2. Unlike the active Perceptron algorithm in [26] or other margin-based approaches (for example [55, 10]) where examples with small margin are queried, we query the label of the examples with a range of margin . From a technical perspective, this ensures that decreases by a decent amount in expectation (see Lemmas 9 and 10 for details).

Following the insight of [32], we remark that the modified Perceptron update (1) on distribution

can be alternatively viewed as performing stochastic gradient descent on a special non-convex loss function

. It is an interesting open question whether optimizing this new loss function can lead to improved empirical results for learning halfspaces.

0:  Labeling oracle , initial halfspace , angle upper bound , confidence , number of iterations , band width .
0:  Improved halfspace .
1:  for  do
2:     Define region .
3:     Rejection sample . In other words, draw from until is in . Query for its label .
4:     .
5:  end for
6:  return  .
Algorithm 2

5 Performance Guarantees

We show that works in the bounded and the adversarial noise models, achieving computational efficiency and near-optimal label complexities. To this end, we first give a lower bound on the label complexity under bounded noise, and then give computational and label complexity upper bounds under the two noise conditions respectively. We defer all proofs to the Appendix.

5.1 A Lower Bound under Bounded Noise

We first present an information-theoretic lower bound on the label complexity in the bounded noise setting under uniform distribution. This extends the distribution-free lower bounds of [53, 37], and generalizes the realizable-case lower bound of [47] to the bounded noise setting. Our lower bound can also be viewed as an extension of  [59]’s Theorem 3; specifically it addresses the hardness under the -Tsybakov noise condition where (while [59]’s Theorem 3 provides lower boundes when ).

Theorem 1.

For any , , , , for any active learning algorithm , there is a , and a labeling oracle that satisfies -bounded noise condition with respect to , such that if with probability at least , makes at most queries of labels to and outputs such that , then .

5.2 Bounded Noise

We establish Theorem 2 in the bounded noise setting. The theorem implies that, with appropriate settings of input parameters, efficiently learns a halfspace of excess error at most with probability at least , under the assumption that is uniform over the unit sphere and has bounded noise. In addition, it queries at most labels. This matches the lower bound of Theorem 1, and improves over the state of the art result of [8], where a label complexity of is shown using a different algorithm.

The proof and the precise setting of parameters ( and ) are given in Appendix C.

Theorem 2 ( under Bounded Noise).

Suppose Algorithm 1 has inputs labeling oracle that satisfies -bounded noise condition with respect to halfspace , initial halfspace such that , target error , confidence , sample schedule where , band width where . Then with probability at least :

  1. [leftmargin=1cm]

  2. The output halfspace is such that .

  3. The number of label queries is .

  4. The number of unlabeled examples drawn is
    .

  5. The algorithm runs in time .

The theorem follows from Lemma 2 below. The key ingredient of the lemma is a delicate analysis of the dynamics of the angles , where is the angle between the iterate and the halfspace . Since is randomly sampled and is noisy, we are only able to show that decreases by a decent amount in expectation. To remedy the stochastic fluctuations, we apply martingale concentration inequalities to carefully control the upper envelope of sequence .

Lemma 2 ( under Bounded Noise).

Suppose Algorithm 2 has inputs labeling oracle that satisfies -bounded noise condition with respect to halfspace , initial halfspace and angle upper bound such that , confidence , number of iterations , band width . Then with probability at least :

  1. [leftmargin=1cm]

  2. The output halfspace is such that .

  3. The number of label queries is .

  4. The number of unlabeled examples drawn is .

  5. The algorithm runs in time .

5.3 Adversarial Noise

We establish Theorem 3 in the adversarial noise setting. The theorem implies that, with appropriate settings of input parameters, efficiently learns a halfspace of excess error at most with probability at least , under the assumption that is uniform over the unit sphere and has an adversarial noise of magnitude . In addition, it queries at most labels. Our label complexity bound is information-theoretically optimal [47], and matches the state of the art result of [39]. The benefit of our approach is computational: it has a running time of , while [39] needs to solve a convex optimization problem whose running time is some polynomial over and with an unspecified degree.

The proof and the precise setting of parameters ( and ) are given in Appendix C.

Theorem 3 ( under Adversarial Noise).

Suppose Algorithm 1 has inputs labeling oracle that satisfies -adversarial noise condition with respect to halfspace , initial halfspace such that , target error , confidence , sample schedule where , band width where . Additionally . Then with probability at least :

  1. [leftmargin=1cm]

  2. The output halfspace is such that .

  3. The number of label queries is .

  4. The number of unlabeled examples drawn is .

  5. The algorithm runs in time .

The theorem follows from Lemma 3 below, whose proof is similar to Lemma 2.

Lemma 3 ( under Adversarial Noise).

Suppose Algorithm 2 has inputs labeling oracle that satisfies -adversarial noise condition with respect to halfspace , initial halfspace and angle upper bound such that , confidence , number of iterations , band width . Additionally . Then with probability at least :

  1. [leftmargin=1cm]

  2. The output halfspace is such that .

  3. The number of label queries is .

  4. The number of unlabeled examples drawn is

  5. The algorithm runs in time .

6 Implications to Passive Learning

can be converted to a passive learning algorithm, , for learning homogeneous halfspaces under the uniform distribution over the unit sphere. has PAC sample complexities close to the lower bounds under the two noise models. We give a formal description of in Appendix B. We give its formal guarantees in the corollaries below, which are immediate consequences of Theorems 2 and 3.

In the -bounded noise model, the sample complexity of improves over the state of the art result of [8], where a sample complexity of is obtained. The bound has the same dependency on and as the minimax upper bound of by [49], which is achieved by a computationally inefficient ERM algorithm.

Corollary 1 ( under Bounded Noise).

Suppose has inputs distribution that satisfies -bounded noise condition with respect to , initial halfspace , target error , confidence , sample schedule where , band width where . Then with probability at least : (1) The output halfspace is such that ; (2) The number of labeled examples drawn is . (3) The algorithm runs in time .

In the -adversarial noise model, the sample complexity of matches the minimax optimal sample complexity upper bound of obtained in  [39]. Same as in active learning, our algorithm has a faster running time than [39].

Corollary 2 ( under Adversarial Noise).

Suppose has inputs distribution that satisfies -adversarial noise condition with respect to , initial halfspace , target error , confidence , sample schedule where , band width where . Furthermore . Then with probability at least : (1) The output halfspace is such that ; (2) The number of labeled examples drawn is . (3) The algorithm runs in time .

Tables 3 and 4 present comparisons between our results and results most closely related to ours.

Algorithm Sample Complexity Time Complexity
[8]
ERM [49]
Our Work
Table 3: A comparison of algorithms for PAC learning halfspaces under the uniform distribution, in the -bounded noise model.
Algorithm Sample Complexity Time Complexity
[39]
ERM [57]
Our Work
Table 4: A comparison of algorithms for PAC learning halfspaces under the uniform distribution, in the -adversarial noise model where .

Acknowledgments.

The authors thank Kamalika Chaudhuri for help and support, Hongyang Zhang for thought-provoking initial conversations, Jiapeng Zhang for helpful discussions, and the anonymous reviewers for their insightful feedback. Much of this work is supported by NSF IIS-1167157 and 1162581.

References

  • Agarwal [2013] Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. ICML (3), 28:1220–1228, 2013.
  • Ailon et al. [2014] Nir Ailon, Ron Begleiter, and Esther Ezra. Active learning using smooth relative regret approximations with applications. Journal of Machine Learning Research, 15(1):885–920, 2014.
  • Angluin and Laird [1988] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, Apr 1988. ISSN 1573-0565. doi: 10.1023/A:1022873112823. URL https://doi.org/10.1023/A:1022873112823.
  • Anthony and Bartlett [2009] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 2009.
  • Arora et al. [1993] Sanjeev Arora, László Babai, Jacques Stern, and Z Sweedyk. The hardness of approximate optima in lattices, codes, and systems of linear equations. In Foundations of Computer Science, 1993. Proceedings., 34th Annual Symposium on, pages 724–733. IEEE, 1993.
  • Awasthi et al. [2014] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. In

    Proceedings of the 46th Annual ACM Symposium on Theory of Computing

    , pages 449–458. ACM, 2014.
  • Awasthi et al. [2015] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In COLT, pages 167–190, 2015.
  • Awasthi et al. [2016] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2016, 2016.
  • Balcan and Long [2013] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In COLT, 2013.
  • Balcan et al. [2007] M.-F. Balcan, A. Z. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.
  • Balcan et al. [2009] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78–89, 2009.
  • Balcan and Feldman [2013] Maria-Florina Balcan and Vitaly Feldman. Statistical active learning algorithms. In NIPS, pages 1295–1303, 2013.
  • Balcan and Zhang [2017] Maria-Florina Balcan and Hongyang Zhang. S-concave distributions: Towards broader distributions for noise-tolerant and sample-efficient learning algorithms. arXiv preprint arXiv:1703.07758, 2017.
  • Balcan et al. [2010] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine learning, 80(2-3):111–139, 2010.
  • Beygelzimer et al. [2010] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.
  • Beygelzimer et al. [2009] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Twenty-Sixth International Conference on Machine Learning, 2009.
  • Blum et al. [1998] Avrim Blum, Alan M. Frieze, Ravi Kannan, and Santosh Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1998.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Cesa-Bianchi et al. [2009] Nicolò Cesa-Bianchi, Claudio Gentile, and erancesco Orabona. Robust bounds for classification via selective sampling. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 121–128, 2009.
  • Chen et al. [2017] Lin Chen, Hamed Hassani, and Amin Karbasi. Near-optimal active learning of halfspaces via query synthesis in the noisy setting. In

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • Cohn et al. [1994] David A. Cohn, Les E. Atlas, and Richard E. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
  • Cristianini and Shawe-Taylor [2000] Nello Cristianini and John Shawe-Taylor.

    An introduction to support vector machines and other kernel-based learning methods.

    2000.
  • Daniely [2015] Amit Daniely. Complexity theoretic limitations on learning halfspaces. arXiv preprint arXiv:1505.05800, 2015.
  • Dasgupta [2005] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.
  • Dasgupta [2011] Sanjoy Dasgupta. Two faces of active learning. Theoretical computer science, 412(19):1767–1781, 2011.
  • Dasgupta et al. [2005] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. In Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings, pages 249–263, 2005.
  • Dasgupta et al. [2007] Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, 2007.
  • Dekel et al. [2012] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13(Sep):2655–2697, 2012.
  • Dunagan and Vempala [2004] John Dunagan and Santosh Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 315–320. ACM, 2004.
  • Feldman et al. [2006] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami. New results for learning noisy parities and halfspaces. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 563–574. IEEE, 2006.
  • Freund et al. [1997] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.
  • Guillory et al. [2009] Andrew Guillory, Erick Chastain, and Jeff Bilmes. Active learning as non-convex optimization. In International Conference on Artificial Intelligence and Statistics, pages 201–208, 2009.
  • Guruswami and Raghavendra [2009] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765, 2009.
  • Hanneke [2007] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.
  • Hanneke [2009] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009.
  • Hanneke [2011] Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
  • Hanneke [2014] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
  • Hanneke and Yang [2012] Steve Hanneke and Liu Yang. Surrogate losses in passive and active learning. arXiv preprint arXiv:1207.3772, 2012.
  • Hanneke et al. [2015] Steve Hanneke, Varun Kanade, and Liu Yang. Learning with a drifting target concept. In International Conference on Algorithmic Learning Theory, pages 149–164. Springer, 2015.
  • Hsu [2010] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.
  • Huang et al. [2015] Tzu-Kuo Huang, Alekh Agarwal, Daniel Hsu, John Langford, and Robert E. Schapire. Efficient and parsimonious agnostic active learning. CoRR, abs/1506.08669, 2015.
  • Kalai et al. [2008] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
  • Kearns and Li [1993] Michael Kearns and Ming Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807–837, 1993.
  • Klivans and Kothari [2014] Adam Klivans and Pravesh Kothari. Embedding Hard Learning Problems Into Gaussian Space. In APPROX/RANDOM 2014, pages 793–809, 2014.
  • Klivans et al. [2009] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715–2740, 2009.
  • Koltchinskii [2010] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. JMLR, 2010.
  • Kulkarni et al. [1993] Sanjeev R Kulkarni, Sanjoy K Mitter, and John N Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
  • Long [1995] Philip M Long. On the sample complexity of pac learning half-spaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995.
  • Massart and Nédélec [2006] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, pages 2326–2366, 2006.
  • Monteleoni [2006] Claire Monteleoni. Efficient algorithms for general active learning. In

    International Conference on Computational Learning Theory

    , pages 650–652. Springer, 2006.
  • Motzkin and Schoenberg [1954] TS Motzkin and IJ Schoenberg. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6(3):393–404, 1954.
  • Orabona and Cesa-Bianchi [2011] Francesco Orabona and Nicolo Cesa-Bianchi. Better algorithms for selective sampling. In Proceedings of the 28th international conference on Machine learning (ICML-11), pages 433–440, 2011.
  • Raginsky and Rakhlin [2011] Maxim Raginsky and Alexander Rakhlin. Lower bounds for passive and active learning. In Advances in Neural Information Processing Systems, pages 1026–1034, 2011.
  • Settles [2010] Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
  • Tong and Koller [2001] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
  • Tosh and Dasgupta [2017] Christopher Tosh and Sanjoy Dasgupta. Diameter-based active learning. In ICML, pages 3444–3452, 2017.
  • Vapnik and Chervonenkis [1971] Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2):264–280, 1971.
  • Wang [2011] Liwei Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. Journal of Machine Learning Research, 12(Jul):2269–2292, 2011.
  • Wang and Singh [2016] Yining Wang and Aarti Singh. Noise-adaptive margin-based active learning and lower bounds under tsybakov noise condition. In AAAI, 2016.
  • Zhang and Chaudhuri [2014] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 442–450, 2014.
  • Zhang et al. [2017] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In COLT, pages 1980–2022, 2017.

Appendix A Additional Related Work

Active Learning.

The recent decades have seen much success in both theory and practice of active learning; see the excellent surveys by [54, 37, 25]. On the theory side, many label-efficient active learning algorithms have been proposed and analyzed [21, 31, 24, 11, 34, 10, 27, 14, 16, 35, 46, 40, 15, 58, 36, 2, 60, 41]. Most algorithms are disagreement-based algorithms [37], and are not label-optimal due to the conservativeness of their label query policy. In addition, most of these algorithms require either explicit enumeration of classifiers in the hypothesis classes, or solving empirical 0-1 loss minimization problems on sets of examples. The former approach is easily seen to be computationally infeasible, while the latter is proven to be computationally hard as well [5]. The only exception in this family we are aware of is [38]. [38] considers active learning by sequential convex surrogate loss minimization. However, it assumes that the expected convex loss minimizer over all possible functions lies in a pre-specified real-valued function class, which is unlikely to hold in the bounded noise and the adversarial noise settings.

Some recent works [60, 41, 10, 9, 59] provide noise-tolerant active learning algorithms with improved label complexity over disagreement-based approaches. However, they are still computationally inefficient: [60] relies on solving a series of linear program with an exponential number of constraints, which are computationally intractable;  [41, 10, 9, 59] relies on solving a series of empirical 0-1 loss minimization problems, which are also computationally hard in the presence of noise [5].

Efficient Learning of Halfspaces.

A series of papers have shown the hardness of learning halfspaces with agnostic noise [5, 30, 33, 44, 23]. These results indicate that, to have nontrivial guarantees on learning halfspaces with noise in polynomial time, one has to make additional assumptions on the data distribution over instances and labels.

Many noise models, other than the bounded noise model and the adversarial noise model, has been studied in the literature. A line of work [19, 52, 28, 1] considers parameterized noise models. For instance, [28] gives an efficient algorithm for the setting that where is the optimal classifier. [1] studies a generalization of the above linear noise model, where is a multiclass label, and there is a link function such that . Their analyses depend heavily on the noise models and it is unknown whether their algorithms can work with more general noise settings. [61] analyzes the problem of learning halfspaces under a new noise condition (as an application of their general analysis of stochastic gradient Langevin dynamics). They assume that the label flipping probability on every is bounded by , for some . It can be seen that the bounded noise condition implies the noise condition of [61], and it is an interesting open question whether it is possible to extend our algorithm and analysis to their setting.

Under the random classification noise condition [3],  [17] gives the first efficient passive learning algorithm of learning halfspaces, by using a modification of Perceptron update (similar to Equation (1)) together with a boosting-type aggregation. [12]

proposes an active statistical query algorithm for learning halfspaces. The algorithm proceeds by estimating the distance between the current halfspace and the optimal halfspace. However, it requires a suboptimal number of

labels. In addition, both results above rely on the uniformity over the random classification noise, and it is shown in [7] that this type of statistical query algorithms will fail in the heterogeneous noise setting (in particular the bounded noise setting and the adversarial noise setting).

In the adversarial noise model, we assume that there is a halfspace with error at most over data. The goal is to design an efficient algorithm that outputting a classifier that disagrees with with probability at most . [42] proposes an elegant averaging-based algorithm that tolerates an error of at most assuming that the unlabeled distribution is uniform. However it has a suboptimal label complexity of . Under the assumption that the unlabeled distribution is log-concave or -concave, the state of the art results [6, 13] give efficient margin-based algorithms that tolerates a noise of . As discussed in the main text, such algorithms require a hinge loss minimization procedure that has a running time polynomial in with an unspecified degree. Finally,  [23] gives a PTAS that outputs a classifier with error , in time . Observe that in the case of , the running time is an unspecified high order polynomial in terms of and .

Appendix B Implications to Passive Learning

In this section, we formally describe (Algorithm 3), a passive learning version of Algorithm 1. The algorithmic framework is similar to Algorithm 1, except that it calls Algorithm 4 rather than Algorithm 2.

0:  Initial halfspace , target error , confidence , sample schedule , band width .
0:  learned halfspace .
1:  Let .
2:  for  do
3:     .
4:  end for
5:  return  .
Algorithm 3

Algorithm 4 is similar to Algorithm 2, except that it draws labeled examples from directly, as opposed to performing label queries on unlabeled examples drawn.

0:  Initial halfspace , angle upper bound , confidence , number of iterations , band width .
0:  Improved halfspace .
1:  for  do
2:     Define region .
3:     Rejection sample . In other words, repeat drawing example until it is in .
4:     .
5:  end for
6:  return  .
Algorithm 4

It can be seen that with the same input as , has exactly the same running time, and the number of labeled examples drawn in is exactly the same as the number of unlabeled examples drawn in . Therefore, Corollaries 1 and 2 are immediate consequences of Theorems 2 and 3.

Appendix C Proofs of Theorems 2 and 3

In this section, we give straightforward proofs that show Theorem 2 (resp. Theorem 3) are direct consequences of Lemma 2 (resp. Lemma 3). We defer the proofs of Lemmas 2 and 3 to Appendix D.

Theorem 4 (Theorem 2 Restated).

Suppose Algorithm 1 has inputs labeling oracle that satisfies -bounded noise condition with respect to underlying halfspace , initial halfspace such that , target error , confidence , sample schedule where