Given high costs of obtaining labels for big datasets, interactive learning is gaining popularity in both practice and theory of machine learning. On the practical side, there has been an increasing interest in designing algorithms capable of engaging domain experts in two-way queries to facilitate more accurate and more effort-efficient learning systems (c.f.[26, 31]). On the theoretical side, study of interactive learning has led to significant advances such as exponential improvement of query complexity over passive learning under certain conditions (c.f. [5, 6, 7, 15, 19, 27]). While most of these approaches to interactive learning fix the form of an oracle, e.g., the labeling oracle, and explore the best way of querying, recent work allows for multiple diverse forms of oracles [12, 13, 16, 33]. The focus of this paper is on this latter setting, also known as active dual supervision . We investigate how to recover a hypothesis that is a good approximator of the optimal classifier , in terms of expected 0/1 error , given limited access to labels on individual instances and pairwise comparisons about which one of two given instances is more likely to belong to the +1/-1 class.
Our study is motivated by important applications where comparisons are easier to obtain than labels, and the algorithm can leverage both types of oracles to improve label and total query complexity. For example, in material design, synthesizing materials for specific conditions requires expensive experimentation, but with an appropriate algorithm we can leverage expertize of material scientists, for whom it may be hard to accurately assess the resulting material properties, but who can quickly compare different input conditions and suggest which ones are more promising. Similarly, in clinical settings, precise assessment of each individual patient’s health status can be difficult, expensive and/or risky (e.g. it may require application of invasive sensors or diagnostic surgeries), but comparing relative statuses of two patients at a time may be relatively easy and accurate. In both these scenarios we may have access to a modest amount of individually labeled data, but the bulk of more accessible training information is available via pairwise comparisons. There are many other examples where humans find it easier to perform pairwise comparisons rather than providing direct labels, including content search 31], ranking , etc.
Despite many successful applications of comparison oracles, many fundamental questions remain. One of them is how to design noise-tolerant, cost-efficient algorithms that can approximate the unknown target hypothesis to arbitrary accuracy while having access to pairwise comparisons. On one hand, while there is theoretical analysis on the pairwise comparisons concerning the task of learning to rank [3, 22]
, estimating ordinal measurement models and learning combinatorial functions , much remains unknown how to extend these results to more generic hypothesis classes. On the other hand, although we have seen great progress on using single or multiple oracles with the same form of interaction [9, 16], classification using both comparison and labeling queries remains an interesting open problem. Independently of our work, Kane et al.  concurrently analyzed a similar setting of learning to classify using both label and comparison queries. However, their algorithms work only in the noise-free setting.
Our Contributions: Our work addresses the aforementioned issues by presenting a new algorithm, Active Data Generation with Adversarial Comparisons (ADGAC), which learns a classifier with both noisy labeling and noisy comparison oracles.
We analyze ADGAC under Tsybakov (TNC) 
and adversarial noise conditions for the labeling oracle, along with the adversarial noise condition for the comparison oracle. Our general framework can augment any active learning algorithm by replacing the batch sampling in these algorithms with ADGAC. Figure1 presents the work flow of our framework.
We propose A-ADGAC algorithm, which can learn an arbitrary hypothesis class. The label complexity of the algorithm is as small as learning a threshold function under both TNC and adversarial noise condition, independently of the structure of the hypothesis class. The total query complexity improves over previous best-known results under TNC which can only access the labeling oracle.
We derive Margin-ADGAC to learn the class of halfspaces. This algorithm has the same label and total query complexity as A-ADGAC, but is computationally efficient.
We present lower bounds on total query complexity for any algorithm that can access both labeling and comparison oracles, and a noise tolerance lower bound for our algorithms. These lower bounds demonstrate that our analysis is nearly optimal.
|Label Noise||Work||# Label||# Query|
An important quantity governing the performance of our algorithms is the adversarial noise level of comparisons: denote by the adversarial noise tolerance level of comparisons that guarantees an algorithm to achieve an error of
, with probability at least. Table 1 compares our results with previous work in terms of label complexity, total query complexity, and for generic hypothesis class with error . We see that our results significantly improve over prior work with the extra comparison oracle. Denote by the VC-dimension of and the disagreement coefficient. We also compare the results in Table 2 for learning halfspaces under isotropic log-concave distributions. In both cases, our algorithms enjoy small label complexity that is independent of and . This is helpful when labels are very expensive to obtain. Our algorithms also enjoy better total query complexity under both TNC and adversarial noise condition for efficiently learning halfspaces.
|Label Noise||Work||# Label||# Query||Efficient?|
Notations: We study the problem of learning a classifier , where and are the instance space and label space, respectively. Denote by the distribution over and let be the marginal distribution over . A hypothesis class is a set of functions . For any function , define the error of under distribution over as . Let . Suppose that satisfies . For simplicity, we assume that such an exists in class .
We apply the concept of disagreement coefficient from Hanneke  for generic hypothesis class in this paper. In particular, for any set , we denote by . The disagreement coefficient is defined as , where .
Problem Setup: We analyze two kinds of noise conditions for the labeling oracle, namely, adversarial noise condition and Tsybakov noise condition (TNC). We formally define them as follows.
Condition 1 (Adversarial Noise Condition for Labeling Oracle).
Distribution satisfies adversarial noise condition for labeling oracle with parameter , if .
Condition 2 (Tsybakov Noise Condition for Labeling Oracle).
Distribution satisfies Tsybakov noise condition for labeling oracle with parameters , if . The special case of is also called Massart noise condition.
For TNC, we assume that the above-defined is the Bayes optimal classifier, i.e., [14, 18],111The assumption that is Bayes optimal classifier can be relaxed if the approximation error of can be quantified under assumptions on the decision boundary (c.f. ). where . In the classic active learning scenario, the algorithm has access to an unlabeled pool drawn from . The algorithm can then query the labeling oracle for any instance from the pool. The goal is to find an such that the error . The labeling oracle has access to the input , and outputs according to . In our setting, however, an extra comparison oracle is available. This oracle takes as input a pair of instances , and returns a variable , where indicates that is more likely to be positive, while otherwise. In this paper, we discuss an adversarial noise condition for the comparison oracle. We discuss about dealing with TNC on the comparison oracle in appendix.
Condition 3 (Adversarial Noise Condition for Comparison Oracle).
Distribution satisfies adversarial noise with parameter , if .
For an interactive learning algorithm , given error and failure probability , let and be the comparison and label complexity, respectively. The query complexity of is defined as the sum of label and comparison complexity. Similar to the definition of , define as the maximum such that algorithm achieves an error of at most with probability . As a summary, learns an such that with probability using comparisons and labels, if and . We omit the parameters of if they are clear from the context. We use to express sample complexity and noise tolerance, and to ignore the terms. Table 3 summarizes the main notations throughout the paper.
|Hypothesis class||Tsybakov noise level (labeling)|
|Instance & Instance space||Adversarial noise level (labeling)|
|Label & Label space||Adversarial noise level (comparison)|
|Comparison & Comparison space||Error of on distribution|
|VC dimension of||Label complexity|
|Disagreement coefficient||Comparison complexity|
|Optimal classifier in||Noise tolerance (labeling)|
|Optimal scoring function||Noise tolerance (comparison)|
3 Active Data Generation with Adversarial Comparisons (ADGAC)
The hardness of learning from pairwise comparisons follows from the error of comparison oracle: the comparisons are noisy, and can be asymmetric and intransitive, meaning that the human might give contradicting preferences like or (here is some preference). This makes traditional methods, e.g., defining a function class , fail, because such a class may have infinite VC dimension.
In this section, we propose a novel algorithm, ADGAC, to address this issue. Having access to both comparison and labeling oracles, ADGAC generates a labeled dataset by techniques inspired from group-based binary search. We show that ADGAC can be combined with any active learning procedure to obtain interactive algorithms that can utilize both labeling and comparison oracles. We provide theoretical guarantees for ADGAC.
3.1 Algorithm Description
To illustrate ADGAC, we start with a general active learning framework in Algorithm 1. Many active learning algorithms can be adapted to this framework, such as A  and margin-based active algorithms [6, 5]. Here represents the querying space/disagreement region of the algorithm (i.e., we reject an instance if ), and represents a version space consisting of potential classifiers. For example, A algorithm can be adapted to Algorithm 1 straightforwardly by keeping as the sample space and as the version space. More concretely, A algorithm  for adversarial noise can be characterized by
where and are parameters of the A algorithm, and is the disagreement region of . Margin-based active learning  can also be fitted into Algorithm 1 by taking as the halfspace that (approximately) minimizes the hinge loss, and as the region within the margin of that halfspace.
To efficiently apply the comparison oracle, we propose to replace step 5 in Algorithm 1 with a subroutine, ADGAC, that has access to both comparison and labeling oracles. Subroutine 2 describes ADGAC. It takes as input a dataset and a sampling number
. ADGAC first runs Quicksort algorithm onusing feedback from comparison oracle, which is of form . Given that the comparison oracle might be asymmetric w.r.t. its two arguments, i.e., may not equal to , for each pair , we randomly choose or as the input to . After Quicksort, the algorithm divides the data into multiple groups of size , and does group-based binary search by sampling labels from each group and determining the label of each group by majority vote.
3.2 Theoretical Analysis of ADGAC
Before we combine ADGAC with active learning algorithms, we provide theoretical results for ADGAC. By the algorithmic procedure, ADGAC reduces the problem of labeling the whole dataset to binary searching a threshold on the sorted list . One can show that the conflicting instances cannot be too many within each group , and thus binary search performs well in our algorithm. We also use results in  to give an error estimate of Quicksort. We have the following result based on the above arguments.
Suppose that Conditions 2 and 3 hold for , and . Assume a set with is sampled i.i.d. from and is an arbitrary subset of with . There exist absolute constants such that if we run Subroutine 2 with , , , it will output a labeling of such that , with probability at least . The expected number of comparisons required is , and the number of sample-label pairs required is .
Similarly, we analyze ADGAC under adversarial noise condition w.r.t. labeling oracle with .
Suppose that Conditions 1 and 3 hold for , and . Assume a set with is sampled i.i.d. from and is an arbitrary subset of with . There exist absolute constants such that if we run Subroutine 2 with , , , and , it will output a labeling of such that , with probability at least . The expected number of comparisons required is , and the number of sample-label pairs required is .
Theorems 4 and 5 show that ADGAC gives a labeling of dataset with arbitrary small error using label complexity independent of the data size. Moreover, ADGAC is computationally efficient and distribution-free. These nice properties of ADGAC lead to improved query complexity when we combine ADGAC with other active learning algorithms.
4 A-ADGAC: Learning of Generic Hypothesis Class
In this section, we combine ADGAC with A algorithm to learn a generic hypothesis class. We use the framework in Algorithm 1: let A-ADGAC be the algorithm that replaces step 5 in Algorithm 1 with ADGAC of parameters , where are parameters to be specified later. Under TNC, we have the following result.
Suppose that Conditions 2 and 3 hold, and . There exist global constants such that if we run A-ADGAC with , , , with specified in Theorem 4, with probability at least , the algorithm will return a classifier with with comparison and label complexity
The dependence on in can be reduced to under Massart noise.
We can prove a similar result for adversarial noise condition.
Theorems 6 and 7 show that having access to even a biased comparison function can reduce the problem of learning a classifier in high-dimensional space to that of learning a threshold classifier in one-dimensional space as the label complexity matches that of actively learning a threshold classifier. Given the fact that comparisons are usually easier to obtain, A-ADGAC will save a lot in practice due to its small label complexity. More importantly, we improve the total query complexity under TNC by separating the dependence on and ; The query complexity is now the sum of the two terms instead of the product of them. This observation shows the power of pairwise comparisons for learning classifiers. Such small label/query complexity is impossible without access to a comparison oracle, since query complexity with only labeling oracle is at least and under TNC and adversarial noise conditions, respectively . Our results also matches the lower bound of learning with labeling and comparison oracles up to log factors (see Section 6).
We note that Theorems 6 and 7 require rather small , equal to and , respectively. We will show in Section 6.3 that it is necessary to require in order to obtain a classifier of error , if we restrict the use of labeling oracle to only learning a threshold function. Such restriction is able to reach the near-optimal label complexity as specified in Theorems 6 and 7.
5 Margin-ADGAC: Learning of Halfspaces
In this section, we combine ADGAC with margin-based active learning  to efficiently learn the class of halfspaces. Before proceeding, we first mention a naive idea of utilizing comparisons: we can i.i.d. sample pairs from , and use as the label of , where is the feedback from comparison oracle. However, this method cannot work well in our setting without additional assumption on the noise condition for the labeling .
Before proceeding, we assume that is isotropic log-concave on ; i.e., has mean 0, covariance and the logarithm of its density function is a concave function [5, 6]. The hypothesis class of halfspaces can be represented as . Denote by for some . Define and as the hinge loss. The expected hinge loss of is .
Margin-based active learning  is a concrete example of Algorithm 1 by taking as (a singleton set of) the hinge loss minimizer, while taking as the margin region around that minimizer. More concretely, take and for some such that . The algorithm works with constants and a set of parameters that equal to . always contains a single hypothesis. Suppose in iteration . Let satisfies , where is the content of in iteration . We also have and .
Let Margin-ADGAC be the algorithm obtained by replacing the sampling step in margin-based active learning with ADGAC using parameters , where are additional parameters to be specified later. We have the following results under TNC and adversarial noise conditions, respectively.
The proofs of Theorems 8 and 9 are different from the conventional analysis of margin-based active learning in two aspects: a) Since we use labels generated by ADGAC, which is not independently sampled from the distribution , we require new techniques that can deal with adaptive noises; b) We improve the results of  over the dependence of by new Rademacher analysis.
proposed a perceptron-like algorithm with label complexity as small as
under Massart and adversarial noise conditions, their algorithm works only under uniform distributions over the instance space. In contrast, our algorithm Margin-ADGAC works under broad log-concave distributions. The label and total query complexity of Margin-ADGAC improves over that of traditional active learning. The lower bounds in Section6 show the optimality of our complexity.
6 Lower Bounds
In this section, we give lower bounds on learning using labeling and pairwise comparison. In Section 6.1, we give a lower bound on the optimal label complexity . In Section 6.2 we use this result to give a lower bound on the total query complexity, i.e., the sum of comparison and label complexity. Our two methods match these lower bounds up to log factors. In Section 6.3, we additionally give an information-theoretic bound on , which matches our algorithms in the case of Massart and adversarial noise.
Following from [19, 20], we assume that there is an underlying score function such that . Note that does not necessarily have relation with ; We only require that represents how likely a given is positive. For instance, in digit recognition, represents how an image looks like a 7 (or 9); In the clinical setting, measures the health condition of a patient. Suppose that the distribution of
is continuous, i.e., the probability density function exists and for every, .
6.1 Lower Bound on Label Complexity
The definition of naturally induces a comparison oracle with . We note that this oracle is invariant to shifting w.r.t. , i.e., and lead to the same comparison oracle. As a result, we cannot distinguish from without labels. In other words, pairwise comparisons do not help in improving label complexity when we are learning a threshold function on , where all instances are in the natural order. So the label complexity of any algorithm is lower bounded by that of learning a threshold classifier, and we formally prove this in the following theorem.
For any algorithm that can access both labeling and comparison oracles, sufficiently small , and any score function that takes at least two values on , there exists a distribution satisfying Condition 2 such that the optimal function is in the form of for some and
The lower bound in Theorem 10 matches the label complexity of A-ADGAC and Margin-ADGAC up to a log factor. So our algorithm is near-optimal.
6.2 Lower Bound on Total Query Complexity
We use Theorem 10 to give lower bounds on the total query complexity of any algorithm which can access both comparison and labeling oracles.
6.3 Adversarial Noise Tolerance of Comparisons
Note that label queries are typically expensive in practice. Thus it is natural to ask the following question: what is the minimal requirement on , given that we are only allowed to have access to minimal label complexity as in Theorem 10? We study this problem in this section. More concretely, we study the requirement on when we learn a threshold function using labels. Suppose that the comparison oracle gives feedback using a scoring function , i.e., , and has error . We give a sharp minimax bound on the risk of the optimal classifier in the form of for some below.
Suppose that and both and have probability density functions. If induces an oracle with error , then we have .
By Theorem 12, we see that the condition of is necessary if labels from are only used to learn a threshold on . This matches our choice of under Massart and adversarial noise conditions for labeling oracle (up to a factor of ).
We presented a general algorithmic framework, ADGAC, for learning with both comparison and labeling oracles. We proposed two variants of the base algorithm, A-ADGAC and Margin-ADGAC, to facilitate low query complexity under Tsybakov and adversarial noise conditions. The performance of our algorithms matches lower bounds for learning with both oracles. Our analysis is relevant to a wide range of practical applications where it is easier, less expensive, and/or less risky to obtain pairwise comparisons than labels.
We thank Chicheng Zhang for insightful ideas on improving results in  using Rademacher complexity.
-  S. Agarwal and P. Niyogi. Stability and generalization of bipartite ranking algorithms. In Annual Conference on Learning Theory, pages 32–47, 2005.
-  S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10:441–474, 2009.
-  N. Ailon and M. Mohri. An efficient reduction of ranking to classification. arXiv preprint arXiv:0710.2889, 2007.
-  J. Attenberg, P. Melville, and F. Provost. A unified approach to active dual supervision for labeling features and examples. In Machine Learning and Knowledge Discovery in Databases, pages 40–55. Springer, 2010.
-  P. Awasthi, M.-F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Annual Conference on Learning Theory, pages 152–192, 2016.
-  P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50, 2017.
-  M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65–72. ACM, 2006.
-  M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Annual Conference On Learning Theory, pages 35–50, 2007.
-  M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, pages 20–1, 2012.
-  M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In Annual Conference on Learning Theory, pages 288–316, 2013.
-  M.-F. Balcan, E. Vitercik, and C. White. Learning combinatorial functions from pairwise comparisons. arXiv preprint arXiv:1605.09227, 2016.
-  M.-F. Balcan and H. Zhang. Noise-tolerant life-long matrix completion via adaptive sampling. In Advances in Neural Information Processing Systems, pages 2955–2963, 2016.
-  A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. Search improves label for active learning. In Advances in Neural Information Processing Systems, pages 3342–3350, 2016.
-  S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
-  R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339–2353, 2008.
-  O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655–2697, 2012.
-  J. Fürnkranz and E. Hüllermeier. Preference learning and ranking by pairwise comparison. In Preference learning, pages 65–82. Springer, 2010.
-  S. Hanneke. Adaptive rates of convergence in active learning. In COLT. Citeseer, 2009.
-  S. Hanneke. Theory of active learning, 2014.
-  S. Hanneke and L. Yang. Surrogate losses in passive and active learning. arXiv preprint arXiv:1207.3772, 2012.
-  R. Heckel, N. B. Shah, K. Ramchandran, and M. J. Wainwright. Active ranking from pairwise comparisons and the futility of parametric assumptions. arXiv preprint arXiv:1606.08842, 2016.
-  K. G. Jamieson and R. Nowak. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2240–2248, 2011.
-  D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classification with comparison queries. arXiv preprint arXiv:1704.03564, 2017.
-  A. Krishnamurthy. Interactive Algorithms for Unsupervised Machine Learning. PhD thesis, Carnegie Mellon University, 2015.
-  L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
S. Maji and G. Shakhnarovich.
Part and attribute discovery from relative annotations.
International Journal of Computer Vision, 108(1-2):82–96, 2014.
-  S. Sabato and T. Hess. Interactive algorithms: from pool to stream. In Annual Conference On Learning Theory, pages 1419–1439, 2016.
-  N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. Wainwright. When is it better to compare than to score? arXiv preprint arXiv:1406.6618, 2014.
-  N. Stewart, G. D. Brown, and N. Chater. Absolute identification by relative judgment. Psychological review, 112(4):881, 2005.
-  A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, pages 135–166, 2004.
C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie.
Similarity comparisons for interactive fine-grained categorization.
IEEE Conference on Computer Vision and Pattern Recognition, pages 859–866, 2014.
-  S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal active learning of halfspaces. arXiv preprint arXiv:1702.05581, 2017.
-  L. Yang and J. G. Carbonell. Cost complexity of proactive learning via a reduction to realizable active learning. Technical report, CMU-ML-09-113, 2009.
-  C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442–450, 2014.
Appendix A Our Techniques
Intransitivity: The main challenge of learning with pairwise comparisons is that the comparisons might be asymmetric or intransitive. If we construct a classifier by simply comparing with a fixed instance by comparison oracle, then the concept class of classifiers will have infinite VC dimension, so the complexity will be as high as infinite if we apply the traditional tools of VC theory. To resolve the issue, we conduct a group-based binary search in ADGAC. The intuition is that by dividing the dataset into several ranked groups , the majority of labels in each group can be stably decided if we sample enough examples from that group. Therefore, we are able to reduce the original problem in the high-dimensional space to the problem of learning a “threshold” function in one-dimension space. Then some straightforward approaches such as binary search learns the thresholding function.
Combining with Active Learning Algorithms: If the labels follow Tsybakov noise (i.e., Condition 2), the most straightforward method to combine ADGAC with existing algorithms is to combine ADGAC with an algorithm that uses the label oracle only and works under TNC. However, we cannot save query complexity if we follow this method. To see this, notice that in each round we need roughly samples and labels; if we use ADGAC, we can obtain a labeling of samples with at most errors with low label complexity. Suppose is the set of labels that ADGAC makes error on. However, since the outside active learning algorithm works under TNC, we will need to query labels in to make sure that the ADGAC labels follow TNC. That means our label complexity is still , the same as the original algorithm. To avoid this problem, we combine ADGAC with algorithms under adversarial noise in all cases including TNC. This eliminates the need to query additional labels, and also reduces the query complexity.
Handling Independence: We mostly follow previous works on combining ADGAC with existing algorithms. However, since we now obtain labels from ADGAC instead of , the labels are not independently sampled, and we need to adapt the proof to our case. We use different methods for A-ADGAC and Margin-ADGAC: For the former, we use results from PAC learning to bound the error on all samples; for the latter, we decompose the error of any classifier on labels generated by ADGAC into two parts: The first part is caused by the error of ADGAC itself, and second is by on truthful labels. Using the above techniques enables us to circumvent the independence problem.
Lower Bounds: It is typically hard to provide a unified lower bound for multi-query learning framework, as several quantities are simultaneously involved in the analysis, e.g., the comparison complexity, the label complexity, the noise tolerance, etc. So traditional proof techniques for active learning, e.g., Le Cam’s and Fano’s bounds [15, 19], cannot be trivially applied to our setting. Instead, we prove lower bounds on one quantity by allowing arbitrary budgets of other quantities. Another non-trivial technique is in the proof of minimax bound for the adversarial noise level of comparison oracle (see Theorem 12): In the proof of upper bound, we divide the integral region w.r.t. the expectation into segments, each of size , and the expectation is thus the limit when . We upper bound the discrete approximation of the integral by a careful calibration of noise on each segment for a fixed , and then let . The proof then leads to a general inequality (Lemma 21), and it might be of independent interest.
Appendix B Additional Related Work
It is well known that people are better at comparison than labeling [29, 28]. It has been widely used to tackle problems in classification , clustering  and ranking [2, 17]. Balcan et al.  studied using pairwise comparisons to learn submodular functions on sets. Another related problem is bipartite ranking , which exactly does the opposite of our problem: Given a group of binary labels, learn a ranking function that rank positive samples higher than negative ones.
Interactive learning has wide application in the field of computer vision and natural language processing (see e.g.,
). There are also abundant literatures on interactive ways to improve unsupervised and semi-supervised learning. However, there lacks a general statistical analysis of interactive learning for traditional classification tasks. Balcan and Hanneke  analyze class conditional queries (CCQ), where the user gives counterexamples to a given classification. Beygelzimer et al.  used a similar idea using search queries. However, their interactions requires a oracle that is usually stronger than the traditional labelers (i.e., we can simulate traditional active learning using such oracles), and is generally hard to deploy in practice. There turns out to be little general analysis on using a ”weaker” interaction between human and computer. Balcan and Hanneke studied an abstract query based notions from exact learning, but their analysis cannot handle queries that gives relation between samples (as comparisons do). Our work fits in this blank.
We compare our work to traditional label-based active learning , which has drawn a lot of attention in the society in recent years. Disagreement-based active learning has been shown to reach a near-optimal rate on classification problems . Another line of research is margin-based active learning , which aims at computational efficiency of learning halfspaces, under the large-margin assumption.
Appendix C Learning under TNC for Comparisons
In this section we justify our choice of analyzing adversarial noise model for the comparison oracle. In fact, any algorithm using adversarial comparisons can be transformed into an algorithm using TNC comparisons, by treating learning comparison functions as a separate learning problem. Let be a hypothesis class consisting of comparison functions . Suppose the optimal comparison function is , and Tsybakov noise condition holds for with some constant ; i.e., for any we have
Also suppose . Assume has VC-dimension and disagreement coefficient , standard active learning requires samples to learn a comparison function of error with probability . So an algorithm for adversarial noise on comparisons can be automatically transformed into an algorithm for TNC on comparisons with and . So we only analyze adversarial noise for comparison in other parts of this paper.
Appendix D Proof of Theorem 4
To bound the error in labeling by ADGAC, we first bound the number of incorrectly sorted pairs due to noise/bias of the comparison oracle. We call an inverse pair if (the partial order is decided by randomly querying or ). Also, we call an anti-sort pair if (after sorting by Quicksort). Let be the set of all anti-sort pairs, be the set of all inverse pairs in , and be the set of all inverse pairs in . We first bound using . Let be the random bits supplied for Quicksort in its process, by Theorem 3 in  we have
Notice that sampling a pair of is equivalent to sample a set of points and then uniformly pick two different points in it. Also, number of inverse pairs in is less than that in . So we have