In this paper, we study active learning of classifiers in an agnostic setting, where no assumptions are made on the true function that generates the labels. The learner has access to a large pool of unlabelled examples, and can interactively request labels for a small subset of these; the goal is to learn an accurate classifier in a pre-specified class with as few label queries as possible. Specifically, we are given a hypothesis class and a target , and our aim is to find a binary classifier in whose error is at most more than that of the best classifier in , while minimizing the number of requested labels.
There has been a large body of previous work on active learning; see the surveys by [Das11, Set10] for overviews. The main challenge in active learning is ensuring consistency in the agnostic setting while still maintaining low label complexity. In particular, a very natural approach to active learning is to view it as a generalization of binary search [FSST97, Das05, Now11]. While this strategy has been extended to several different noise models [Kää06, Now11, NJC13], it is generally inconsistent in the agnostic case [DH08].
The primary algorithm for agnostic active learning is called disagreement-based active learning. The main idea is as follows. A set of possible risk minimizers is maintained with time, and the label of an example is queried if there exist two hypotheses and in such that . This algorithm is consistent in the agnostic setting [CAL94, BBL09, DHM07, Han07, BDL09, Han09, BHLZ10, Kol10]; however, due to the conservative label query policy, its label requirement is high. A line of work due to [BBZ07, BL13, ABL14]
have provided algorithms that achieve better label complexity for linear classification on the uniform distribution over the unit sphere as well as log-concave distributions; however, their algorithms are limited to these specific cases, and it is unclear how to apply them more generally.
Thus, a major challenge in the agnostic active learning literature has been to find a general active learning strategy that applies to any hypothesis class and data distribution, is consistent in the agnostic case, and has a better label requirement than disagreement based active learning. This has been mentioned as an open problem by several works, such as [BBL09, Das11, BL13].
In this paper, we provide such an algorithm. Our solution is based on two key contributions, which may be of independent interest. The first is a general connection between confidence-rated predictors and active learning. A confidence-rated predictor is one that is allowed to abstain from prediction on occasion, and as a result, can guarantee a target prediction error. Given a confidence-rated predictor with guaranteed error, we show how to use it to construct an active label query algorithm consistent in the agnostic setting. Our second key contribution is a novel confidence-rated predictor with guaranteed error that applies to any general classification problem. We show that our predictor is optimal in the realizable case, in the sense that it has the lowest abstention rate out of all predictors that guarantee a certain error. Moreover, we show how to extend our predictor to the agnostic setting.
Combining the label query algorithm with our novel confidence-rated predictor, we get a general active learning algorithm consistent in the agnostic setting. We provide a characterization of the label complexity of our algorithm, and show that this is better than disagreement-based active learning in general. Finally, we show that for linear classification with respect to the uniform distribution and log-concave distributions, our bounds reduce to those of [BBZ07, BL13].
2.1 The Setting
We study active learning for binary classification. Examples belong to an instance space , and their labels lie in a label space ; labelled examples are drawn from an underlying data distribution on . We use to denote the marginal on on , and to denote the conditional distribution on induced by . Our algorithm has access to examples through two oracles – an example oracle which returns an unlabelled example drawn from and a labelling oracle which returns the label of an input drawn from .
Given a hypothesis class of VC dimension , the error of any with respect to a data distribution over is defined as . We define: , . For a set , we abuse notation and use to also denote the uniform distribution over the elements of . We define , .
Given access to examples from a data distribution through an example oracle and a labeling oracle , we aim to provide a classifier
such that with probability, , for some target values of and ; this is achieved in an adaptive manner by making as few queries to the labelling oracle as possible. When , we are said to be in the realizable case; in the more general agnostic case, we make no assumptions on the labels, and thus can be positive.
Previous approaches to agnostic active learning have frequently used the notion of disagreements. The disagreement between two hypotheses and with respect to a data distribution is the fraction of examples according to to which and assign different labels; formally: . Observe that a data distribution induces a pseudo-metric on the elements of ; this is called the disagreement metric. For any and any , define to be the disagreement ball of radius around with respect to the data distribution . Formally: .
For notational simplicity, we assume that the hypothesis space is “dense” with repsect to the data distribution , in the sense that , . Our analysis will still apply without the denseness assumption, but will be significantly more messy. Finally, given a set of hypotheses , the disagreement region of is the set of all examples such that there exist two hypotheses for which .
This paper establishes a connection between active learning and confidence-rated predictors with guaranteed error. A confidence-rated predictor is a prediction algorithm that is occasionally allowed to abstain from classification. We will consider such predictors in the transductive setting. Given a set of candidate hypotheses, an error guarantee , and a set of unlabelled examples, a confidence-rated predictor either assigns a label or abstains from prediction on each unlabelled . The labels are assigned with the guarantee that the expected disagreement111where the expectation is with respect to the random choices made by between the label assigned by and any is . Specifically,
This ensures that if some is the true risk minimizer, then, the labels predicted by on do not differ very much from those predicted by . The performance of a confidence-rated predictor which has a guarantee such as in Equation (1) is measured by its coverage, or the probability of non-abstention ; higher coverage implies better performance.
2.2 Main Algorithm
Our active learning algorithm proceeds in epochs, where the goal of epochis to achieve excess generalization error , by querying a fresh batch of labels. The algorithm maintains a candidate set that is guaranteed to contain the true risk minimizer.
The critical decision at each epoch is how to select a subset of unlabelled examples whose labels should be queried. We make this decision using a confidence-rated predictor . At epoch , we run with candidate hypothesis set and error guarantee . Whenever abstains, we query the label of the example. The number of labels queried is adjusted so that it is enough to achieve excess generalization error .
An outline is described in Algorithm 1; we next discuss each individual component in detail.
At epoch , we maintain a set of candidate hypotheses guaranteed to contain the true risk minimizer (w.h.p). In the realizable case, we use a version space as our candidate set. The version space with respect to a set of labelled examples is the set of all such that for all .
Suppose we run Algorithm 1 in the realizable case with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor , target excess error and target confidence . Then, with probability , .
In the non-realizable case, the version space is usually empty; we use instead a -confidence set for the true risk minimizer. Given a set of labelled examples, let be a function of ; is said to be a -confidence set for the true risk minimizer if for all data distributions over ,
Recall that . In the non-realizable case, our candidate sets are -confidence sets for , for . The precise setting of is explained in Algorithm 2.
Suppose we run Algorithm 1 in the non-realizable case with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor , target excess error and target confidence . Then with probability , .
We next discuss our label query procedure – which examples should we query labels for, and how many labels should we query at each epoch?
Which Labels to Query?
Our goal is to query the labels of the most informative examples. To choose these examples while still maintaining consistency, we use a confidence-rated predictor with guaranteed error. The inputs to the predictor are our candidate hypothesis set which contains (w.h.p) the true risk minimizer, a fresh set of unlabelled examples, and an error guarantee . For notation simplicity, assume the elements in are distinct. The output is a sequence of abstention probabilities , for each example in . It induces a distribution over , from which we independently draw examples for label queries.
How Many Labels to Query?
The goal of epoch is to achieve excess generalization error . To achieve this, passive learning requires labelled examples222 hides logarithmic factors in the realizable case, and examples in the agnostic case. A key observation in this paper is that in order to achieve excess generalization error on , it suffices to achieve a much larger excess generalization error on the data distribution induced by and , where is the fraction of examples on which the confidence-rated predictor abstains.
In the realizable case, we achieve this by sampling i.i.d examples from , and querying their labels to get a labelled dataset . Observe that as is the abstention probability of with guaranteed error , it is generally smaller than the measure of the disagreement region of the version space; this key fact results in improved label complexity over disagreement-based active learning. This sampling procedure has the following property:
Suppose we run Algorithm 1 in the realizable case with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor , target excess error and target confidence . Then with probability , for all , and for all , . In particular, the returned at the end of the algorithm satisfies .
The agnostic case has an added complication – in practice, the value of is not known ahead of time. Inspired by [Kol10], we use a doubling procedure(stated in Algorithm 2) which adaptively finds the number of labelled examples to be queried and queries them. The following two lemmas illustrate its properties – that it is consistent, and that it does not use too many label queries.
Suppose we run Algorithm 2 with inputs hypothesis set , example distribution , labelling oracle , target excess error and target confidence . Let be the joint distribution on
be the joint distribution oninduced by and . Then there exists an event , , such that on , (1) Algorithm 2 halts and (2) the set has the following properties:
(2.1) If for , , then .
(2.2) On the other hand, if , then .
When event happens, we say Algorithm 2 succeeds.
The following lemma is a consequence of our label query procedure in the non-realizable case.
Suppose we run Algorithm 1 in the non-realizable case with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor , target excess error and target confidence . Then with probability , for all , and for all , . In particular, the returned at the end of the algorithm satisfies .
2.3 Confidence-Rated Predictor
Our active learning algorithm uses a confidence-rated predictor with guaranteed error to make its label query decisions. In this section, we provide a novel confidence-rated predictor with guaranteed error. This predictor has optimal coverage in the realizable case, and may be of independent interest. The predictor receives as input a set of hypotheses (which is likely to contain the true risk minimizer), an error guarantee , and a set of of unlabelled examples. We consider a soft prediction algorithm; so, for each example in , the predictor outputs three probabilities that add up to – the probability of predicting , and . This output is subject to the constraint that the expected disagreement333where the expectation is taken over the random choices made by between the labels assigned by and those assigned by any is at most , and the goal is to maximize the coverage, or the expected fraction of non-abstentions.
Our key insight is that this problem can be written as a linear program, which is described in Algorithm3. There are three variables, , and , for each unlabelled ; there are the probabilities with which we predict , and on respectively. Constraint (2) ensures that the expected disagreement between the label predicted and any is no more than , while the LP objective maximizes the coverage under these constraints. Observe that the LP is always feasible. Although the LP has infinitely many constraints, the number of constraints in Equation (2) distinguishable by is at most , where is the VC dimension of the hypothesis class .
The performance of a confidence-rated predictor is measured by its error and coverage. The error of a confidence-rated predictor is the probability with which it predicts the wrong label on an example, while the coverage is its probability of non-abstention. We can show the following guarantee on the performance of the predictor in Algorithm 3.
In the realizable case, if the hypothesis set is the version space with respect to a training set, then . In the non-realizable case, if the hypothesis set is an -confidence set for the true risk minimizer , then, w.p , .
In the realizable case, we can also show that our confidence rated predictor has optimal coverage. Observe that we cannot directly show optimality in the non-realizable case, as the performance depends on the exact choice of the -confidence set.
In the realizable case, suppose that the hypothesis set is the version space with respect to a training set. If is any confidence rated predictor with error guarantee , and if is the predictor in Algorithm 3, then, the coverage of is at least much as the coverage of .
3 Performance Guarantees
An essential property of any active learning algorithm is consistency – that it converges to the true risk minimizer given enough labelled examples. We observe that our algorithm is consistent provided we use any confidence-rated predictor with guaranteed error as a subroutine. The consistency of our algorithm is a consequence of Lemmas 3 and 6 and is shown in Theorem 3.
Theorem 3 (Consistency).
We now establish a label complexity bound for our algorithm; however, this label complexity bound applies only if we use the predictor described in Algorithm 3 as a subroutine.
For any hypothesis set , data distribution , and , define to be the minimum abstention probability of a confidence-rated predictor which guarantees that the disagreement between its predicted labels and any under is at most .
Formally, . Define . The label complexity of our active learning algorithm can be stated as follows.
Theorem 4 (Label Complexity).
Suppose we run Algorithm 1 with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor of Algorithm 3, target excess error and target confidence . Then there exist constants such that with probability :
(1) In the realizable case, the total number of labels queried by Algorithm 1 is at most:
(2) In the agnostic case, the total number of labels queried by Algorithm 1 is at most:
The label complexity of disagreement-based active learning is characterized in terms of the disagreement coefficient. Given a radius , the disagreement coefficent is defined as:
where for any , is the disagreement region of . As [EYW10], in our notation, .
In the realizable case, the label complexity of disagreement-based active learning is [Han13]444Here the notation hides factors logarithmic in . Our label complexity bound may be simplified to:
which is essentially the bound of [Han13] with replaced by . As enforcing a lower error guarantee requires more abstention, is a decreasing function of ; as a result,
and our label complexity is better.
Again, this is essentially the bound of [DHM07] with replaced by the smaller quantity
[Han13] has provided a more refined analysis of disagreement-based active learning that gives a label complexity of ; observe that their dependence is still on . We leave a more refined label complexity analysis of our algorithm for future work.
3.1 Tsybakov Noise Conditions
An important sub-case of learning from noisy data is learning under the Tsybakov noise conditions [Tsy04].
(Tsybakov Noise Condition) Let . A labelled data distribution over satisfies -Tsybakov Noise Condition with respect to a hypothesis class for some constant , if for all , .
The following theorem shows the performance guarantees achieved by Algorithm 1 under the Tsybakov noise conditions.
Suppose -Tsybakov Noise Condition holds for with respect to . Then Algorithm 1 with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor of Algorithm 3, target excess error and target confidence satisfies the following properties. There exists a constant such that with probability , the total number of labels queried by Algorithm 1 is at most:
For , our label complexity is at most
In both cases, our bounds are better, as . In further work, [HY12] provides a refined analysis with a bound of ; however, this work is not directly comparable to ours, as they need prior knowledge of and .
3.2 Case Study: Linear Classification under the Log-concave Distribution
We now consider learning linear classifiers with respect to log-concave data distribution on . In this case, for any , the disagreement coefficient [BL13]; however, for any , (see Lemma 14 in the Appendix), which is much smaller so long as is not too small. This leads to the following label complexity bounds.
Suppose is isotropic and log-concave on , and is the set of homogeneous linear classifiers on . Then Algorithm 1 with inputs example oracle , labelling oracle , hypothesis class , confidence-rated predictor of Algorithm 3, target excess error and target confidence satisfies the following properties.
With probability :
(1) In the realizable case, there exists some absolute constant such that the total number of labels queried is at most .
(2) In the agnostic case, there exists some absolute constant such that the total number of labels queried is at most .
(3) If -Tsybakov Noise condition holds for with respect to , then there exists some constant (that depends on ) such that the total number of labels queried is at most .
In the realizable case, our bound matches [BL13]. For disagreement-based algorithms, the bound is , which is worse by a factor of . [BL13] does not address the fully agnostic case directly; however, if is known a-priori, then their algorithm can achieve roughly the same label complexity as ours.
For the Tsybakov Noise Condition with , [BBZ07, BL13] provides a label complexity bound for with an algorithm that has a-priori knowledge of and . We get a slightly better bound. On the other hand, a disagreement based algorithm [Han13] gives a label complexity of . Again our bound is better by factor of over disagreement-based algorithms. For , we can tighten our label complexity to get a bound, which again matches [BL13], and is better than the ones provided by disagreement-based algorithm – [Han13].
4 Related Work
Active learning has seen a lot of progress over the past two decades, motivated by vast amounts of unlabelled data and the high cost of annotation [Set10, Das11, Han13]. According to [Das11], the two main threads of research are exploitation of cluster structure [UWBD13, DH08], and efficient search in hypothesis space, which is the setting of our work. We are given a hypothesis class , and the goal is to find an that achieves a target excess generalization error, while minimizing the number of label queries.
Three main approaches have been studied in this setting. The first and most natural one is generalized binary search [FSST97, Das04, Das05, Now11], which was analyzed in the realizable case by [Das05] and in various limited noise settings by [Kää06, Now11, NJC13]. While this approach has the advantage of low label complexity, it is generally inconsistent in the fully agnostic setting [DH08]. The second approach, disagreement-based active learning, is consistent in the agnostic PAC model. [CAL94] provides the first disagreement-based algorithm for the realizable case. [BBL09] provides an agnostic disagreement-based algorithm, which is analyzed in [Han07] using the notion of disagreement coefficient. [DHM07] reduces disagreement-based active learning to passive learning; [BDL09] and [BHLZ10] further extend this work to provide practical and efficient implementations. [Han09, Kol10] give algorithms that are adaptive to the Tsybakov Noise condition. The third line of work [BBZ07, BL13, ABL14], achieves a better label complexity than disagreement-based active learning for linear classifiers on the uniform distribution over unit sphere and logconcave distributions. However, a limitation is that their algorithm applies only to these specific settings, and it is not apparent how to apply it generally.
Research on confidence-rated prediction has been mostly focused on empirical work, with relatively less theoretical development. Theoretical work on this topic includes KWIK learning [LLW08], conformal prediction [SV08] and the weighted majority algorithm of [FMS04]. The closest to our work is the recent learning-theoretic treatment by [EYW10, EYW11]. [EYW10] addresses confidence-rated prediction with guaranteed error in the realizable case, and provides a predictor that abstains in the disagreement region of the version space. This predictor achieves zero error, and coverage equal to the measure of the agreement region. [EYW11] shows how to extend this algorithm to the non-realizable case and obtain zero error with respect to the best hypothesis in . Note that the predictors in [EYW10, EYW11] generally achieve less coverage than ours for the same error guarantee; in fact, if we plug them into our Algorithm 1, then we recover the label complexity bounds of disagreement-based algorithms [DHM07, Han09, Kol10].
A formal connection between disagreement-based active learning in realizable case and perfect confidence-rated prediction (with a zero error guarantee) was established by [EYW12]. Our work can be seen as a step towards bridging these two areas, by demonstrating that active learning can be further reduced to imperfect confidence-rated prediction, with potentially higher label savings.
We thank NSF under IIS-1162581 for research support. We thank Sanjoy Dasgupta and Yoav Freund for helpful discussions. CZ would also like to thank Liwei Wang for introducing the problem of selective classification to him.
- [ABL14] P. Awasthi, M-F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. In STOC, 2014.
- [BBL09] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78–89, 2009.
- [BBZ07] M.-F. Balcan, A. Z. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.
- [BDL09] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.
- [BHLZ10] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.
- [BL13] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In COLT, 2013.
- [CAL94] D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning. Machine Learning, 15(2), 1994.
- [Das04] S. Dasgupta. Analysis of a greedy active learning strategy. In NIPS, 2004.
- [Das05] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.
- [Das11] S. Dasgupta. Two faces of active learning. Theor. Comput. Sci., 412(19), 2011.
- [DH08] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, 2008.
- [DHM07] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.
- [EYW10] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. JMLR, 2010.
- [EYW11] R. El-Yaniv and Y. Wiener. Agnostic selective classification. In NIPS, 2011.
- [EYW12] R. El-Yaniv and Y. Wiener. Active learning via perfect selective classification. JMLR, 2012.
- [FMS04] Y. Freund, Y. Mansour, and R. E. Schapire. Generalization bounds for averaged classifiers. The Ann. of Stat., 32, 2004.
- [FSST97] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.
- [Han07] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.
- [Han09] S. Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.
- [Han13] S. Hanneke. A statistical theory of active learning. Manuscript, 2013.
- [Hsu10] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.
- [HY12] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. CoRR, abs/1207.3772, 2012.
- [Kää06] M. Kääriäinen. Active learning in the non-realizable case. In ALT, 2006.
- [Kol10] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. JMLR, 2010.
- [LLW08] L. Li, M. L. Littman, and T. J. Walsh. Knows what it knows: a framework for self-aware learning. In ICML, 2008.
- [NJC13] M. Naghshvar, T. Javidi, and K. Chaudhuri. Noisy bayesian active learning. In Allerton, 2013.
- [Now11] R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
- [Set10] B. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2010.
- [SV08] G. Shafer and V. Vovk. A tutorial on conformal prediction. JMLR, 2008.
- [Tsy04] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32:135–166, 2004.
- [UWBD13] R. Urner, S. Wulff, and S. Ben-David. Plal: Cluster-based active learning. In COLT, 2013.
Appendix A Additional Notation and Concentration Lemmas
We begin with some additional notation that will be used in the subsequent proofs. Recall that we define:
where is the VC dimension of the hypothesis class .
The following lemma is an immediate corollary of the multiplicative VC bound; we pick the version of the multiplicative VC bound due to [Hsu10].
Pick any , . Let be a set of iid copies of drawn from a distribution over labelled examples. Then, the following hold with probability at least over the choice of :
(1) For all ,
In particular, all classifiers in consistent with satisfies
(2) For all in ,
Where is defined in Equation (3).
We occasionally use the following (weaker) version of Lemma 7.
Pick any , . Let be a set of iid copies of . The following holds with probability at least : (1) For all ,
(2) For all in ,
Where is defined in Equation (3).
For an unlabelled sample , we use to denote the joint distribution over induced by uniform distribution over and . We have:
If the size of of the unlabelled dataset is at least , then with probability , the following conditions hold for all :
If the size of of the unlabelled dataset is at least , then with probability , the following hold:
(1) The outputs of any confidence-rated predictor with inputs hypothesis set , unlabelled data , and error bound satisfy: