1 Introduction
In recent years, the availability of big data and the high cost of labeling has lead to a surge of interest in active learning
, an adaptive, semisupervised learning paradigm. In traditional active learning, given an instance space
, a distribution on , and a class of concepts , the learner receives unlabeled samples x from with the ability to query an oracle for the labeling. Classically our goal would be to minimize the number of samples the learner draws before approximately learning the concept class with high probability (PAClearning). Instead, active learning assumes unlabeled samples are inexpensive, and rather aims to minimize expensive queries to the oracle. While active learning requires exponentially fewer labeled samples than PAClearning for simple classes such as intervals and thresholds, it fails to provide asymptotic improvement for classes essential to machine learning such as linear separators
Sanjoy.However, recent results point to the fact that with slight relaxations or additions to the paradigm, such concept classes can be learned with exponentially fewer queries. In 2013, Balcan and Long Balcan proved that this was the case for homogeneous (through the origin) linear separators, as long as the distribution over the instance space
was logconcave–a wide range of distributions generalizing common cases such as gaussians or uniform distributions over convex sets. Later, Balcan and Zhang
sconcave extended this to sconcave distributions, a diverse generalization of logconcavity including fattailed distributions. Similarly, ElYaniv and Weiner ElYanivproved that nonhomogeneous linear separators can be learned with exponentially fewer queries with respect to error over gaussian distributions, but also show that their algorithm suffers from a lower bound exponential in the dimension of
.Kane, Lovett, Moran, and Zhang KLMZ proved that the nonhomogeneity barrier could be broken for general distributions in two dimensions by empowering the oracle to compare points rather than just label them. Queries of this type are called comparison queries, and are notable not only for their increase in computational power, but for their real world applications such as in recommender systems rec or for increasing accuracy xu2017noise. Our work adopts a mixture of the approaches of Balcan et al. and Kane et al. We show that by leveraging comparison queries, nonhomogeneous linear separators may be learned in exponentially fewer samples as long as the distribution satisfies weak concentration and anticoncentration bounds, conditions realized by, for instance, concave distributions.
Further, by leveraging techniques based off of inference introduced in the same paper by Kane et al. we can use comparison queries to provide a stronger guarantee than PAC learning with little cost to query complexity. In the late 80’s, Rivest and Sloan Rivest proposed a competing model to PAClearning called Reliable and Probably Useful (RPU) learning. This model, which is a learning theoretic formalization of selective classification introduced by Chow Chow more than 2 decades before, does not allow the learner to make mistakes, but instead allows the answer “I don’t know,” written as “”. Here, error is measured not by the amount of misclassified examples, but by the measure of examples on which our learner returns . RPUlearning was for the most part abandoned by the early 90’s in favor of PAClearning as Kivinen Kivinen, Kivinen2 proved the sample complexity of RPUlearning simple concept classes such as rectangles required an exponential number of samples even under the uniform distribution. However, the model was recently reintroduced by ElYaniv and Weiner ElYaniv, who termed it perfect selective classification. ElYaniv and Weiner prove a connection between Active and RPUlearning similar to the strategy employed by Kane et al. KLMZ (who refer to RPUlearners as “confident” learners). We will extend the lower bound of ElYaniv and Weiner to prove that actively RPUlearning linear separators with only labels is exponentially difficult in dimension even for nice distributions. On the other hand, we will further show that comparison queries allow RPUlearning with nearly matching sample and query complexity to PAClearning.
1.1 Background and Related Work
1.1.1 PAClearning
Probably Approximately Correct (PAC)learning is a framework for learning classifiers over an instance space introduced by Valiant Valiant. Given an instance space , label space , and a concept class of concepts , PAClearning proceeds as follows. First, an adversary chooses a hidden distribution over and a hidden classifier . The learner then draws labeled samples from , and outputs a concept which it thinks is close to with respect to . Formally, we define closeness of and as the error:
We say the pair is PAClearnable if there exists a learner which, using only samples^{1}^{1}1Formally, must also be polynomial in a number of parameters of , for all picks a classifier that with probability has at most error from . Formally,
The goal of PAClearning is to compute the sample complexity and thereby prove whether certain pairs are efficiently learnable. In this paper, we will be concerned with the case of binary classification, where . In addition, in the case that is linear separators we instead write our concept classes as the sign of a family of functions . Instead of , we write the hypothesis class , and each defines a concept . The sample complexity of PAClearning is characterized by the VC dimension VC, Blumer of which we denote by , and is given by:
1.1.2 RPUlearning
Reliable and Probably Useful (RPU)learning is a stronger variant of PAClearning introduced by Rivest and Sloan Rivest, in which the learner is reliable: it is not allowed to make errors, but may instead say “I don’t know” (or for shorthand, “”). Since it easy to make a reliable learner by simply always outputting “”, our learner must be useful, and with high probability cannot output “” more than a small fraction of the time. Let A be a reliable learner, we define the error of A on a sample S with respect to to be
We call the coverage of the learner . Finally, we say the pair is RPUlearnable if , there exists a reliable learner which in samples has error with probability :
RPUlearning is characterized by the VC dimension of certain intersections of concepts Kivinen. Unfortunately, many simple cases turn out to be not RPUlearnable (e.g. rectangles in ), with even relaxations having exponential sample complexity Kivinen2.
1.1.3 Passive vs Active Learning
PAC and RPUlearning traditionally refer to supervised learning, where the learning algorithm receives prelabeled samples. We call this paradigm passive learning. In contrast, active learning refers to the case where the learner receives unlabeled samples and may adaptively query a labeling oracle. Similar to the passive case, for active learning we study the query complexity , the minimum number of queries to learn some pair in either the PAC or RPU learning models. The hope is that by adaptively choosing when to query the oracle, the learner may only need to query a number of samples logarithmic in the sample complexity.
We will discuss two paradigms of active learning: poolbased active learning, and membership query synthesis (MQS) pool, MQS. In the former, the learner has access to a pool of unlabeled data and may request that the oracle label any point. This model matches realworld scenarios where learners have access to large, unlabeled datasets, but labeling is too expensive to use passive learning (e.g. medical imagery). Membership query synthesis allows the learner to synthesize points in the instance space and query their labels. This model is the logical extreme of the poolbased model where our pool is the entire instance space. Because we will be considering learning with a fixed distribution, we will slightly modify MQS: the learner may only query points in the support of the distribution. This is the natural specification to distribution dependent learning, as it still models the case where our pool is as large as possible.
1.1.4 The Distribution Dependent Case
While PAC and RPUlearning were traditionally studied in the worstcase scenario over distributions, data in the real world is often drawn from distributions with nice properties such as concentration and anticoncentration bounds. As such, there has been a wealth of research into distributiondependent PAClearning, where the model has been relaxed only in that some distributional conditions are known. Distribution dependent learning has been studied in both the passive and the active case Balcan, Long, Long2, Hanneke. Most closely related to our work, Balcan and Long Balcan proved new upper bounds on active and passive learning of homogeneous (through the origin) linear separators in 0centered logconcave distributions. Later, Balcan and Zhang sconcave extended this to concave distributions. We directly extend the original algorithm of Balcan and Long to nonhomogeneous linear separators via the inclusion of comparison queries, and leverage the concentration results of Balcan and Zhang to provide an inference based algorithm for learning under sconcave distributions.
1.1.5 The Point Location Problem
Our results on RPUlearning imply the existence of simple linear decision trees for an important problem in computer science and computational geometry known as the point location problem. Given a set of
hyperplanes in dimensions, called a hyperplane arrangement of size and denoted by , it is a classic result that partitions into cells. The point location problem is as follows:Definition 1.1 (Point Location Problem).
Given a hyperplane arrangement and a point , both in , determine in which cell of lies.
Instances of this problem show up throughout computer science, such as in sum, subsetsum, knapsack, or any variety of other problems Ksum. The best known depth for a linear decision tree solving the point location problem is from a recent work of Ezra and Sharir Ezra, who proved the existence of an depth LDT for arbitrary and . The caveat of this work is that the LDT may use arbitrary linear queries, which may be too powerful of a model in practice. Kane, Lovett, and Moran KLM offer an depth LDT restricting the model to generalized comparison queries, queries of the form for a point and hyperplanes . These queries are nice as they preserve structural properties of the input such as sparsity, but they still suffer from overcomplication–any still allows an infinite set of queries.
Kane, Lovett, Moran, and Zhang’s KLMZ original work on inference dimension showed that in the worst case, the depth of a comparison LDT for point location is . However, by restricting to have good margin or bounded bit complexity, they build a comparison LDT of depth , which comes with the advantage of drawing from a finite set of queries for a given problem instance. Our work provides another result of this flavor: we will prove that if is drawn from a distribution with weak restrictions, for large enough there exists a comparison LDT with expected depth .
1.2 Our Results
1.2.1 Notation
We begin by introducing notation for our learning models. For a distribution , an instance , and a hypothesis class , we write the triple to denote the problem of learning over with a hypothesis . When is the uniform distribution over , we will write for convenience. We will further denote by the unit ball in dimensions, and by hyperplanes in dimensions. Given and a point , a label query determines ; given , a comparison query determines .
In addition, we will separate our models of learnability into combinations of three classes Q,R, and S, where Q , R , and S . Informally, we say an element defines our query type, an element in our learning regime, and an element in our learning model. Learnability of a triple is then defined by the combination of any choice of query, regime, and model, which we term as the  learnability of . Note that in Comparisonlearning we have both a labeling and comparison oracle.
Finally, we will discuss a number of different measures of complexity for  learning triples. For passive learning, we will focus on the sample complexity . For active learning, we will focus on the query complexity . In both cases, we will often drop and instead give bounds on the expected sample/query complexity for error denoted (or respectively). A bound for probability then follow with repetitions by Chernoff. In the case of a finite instance space of size , we denote the expected query complexity of perfectly learning as .
As a final note, we will at times use a subscript in our asymptotic notation to suppress factors only dependent on dimension.
1.2.2 PACLearning
To show the power of active learning with comparison queries in the PAClearning model, we will begin by proving lower bounds. In particular, we show that neither active learning nor comparison queries alone provide a significant speedup over passive learning. In order to do this, we will assume the stronger MQS model, as lower bounds here transfer over to the poolbased regime.
Proposition 1.2.
For small enough , and , the query complexity of LabelMQSPAC learning is:
Thus without enriched queries, active learning fails to significantly improve over passive learning even over a nice, lowdimensional distribution. Likewise, adding comparison queries alone also provides little improvement.
Proposition 1.3.
For small enough , and , the sample complexity of ComparisonPassivePAC learning is:
Now we can compare the query complexity of active learning with comparisons to the above. For our upper bound, we will assume the poolbased model with a Poly pool size, as upper bounds here transfer to the MQS model. Our algorithm for ComparisonPoolPAC learning combines a modification of Balcan and Long’s Balcan learning algorithm with noisy thresholding to provide an exponential speedup for nonhomogeneous linear separators.
Theorem 1.4.
Let be a logconcave distribution over . Then the query complexity of ComparisonPoolPAC learning is
Balcan and Long also give a lower bound of for logconcave distributions which carries over to our setting, so this bound is near tight in dimension and error.
1.3 RPULearning
In the RPUlearning model, we will first confirm that passive learning with label queries is intractable information theoretically, and continue to show that active learning alone provides little improvement. Unlike in PAClearning however, we will show that comparisons in this regime provide a significant improvement in not only active, but also passive learning.
Proposition 1.5.
The expected sample complexity of LabelPassiveRPU learning is:
Thus we see that RPUlearning linear separators is intractable for large dimension. Further, active learning with label queries is of the same order of magnitude.
Proposition 1.6.
For all , the query complexity of LabelMQSRPU learning is:
These two bounds are a generalization of the technique employed by ElYaniv and Weiner ElYaniv to prove lower bounds for a specific algorithm, and apply to any learner. We further show that this bound is tight up to a logarithmic factor. For passive RPUlearning with comparison queries, we will simply inherit the lower bound from the PAC model (Proposition 1.3).
Corollary 1.7.
For small enough , and , any algorithm that ComparisonPassiveRPU learns must use at least
samples.
Note that unlike for label queries, this lower bound is not exponential in dimension. In fact, we will show that this bound is tight up to a linear factor in dimension, and further that employing comparison queries in general shifts the RPU model from being intractable to losing only a logarithmic factor over PAClearning in both the passive and active regimes. We need one definition: two distributions over are affinely equivalent if there is an invertible affine map such that .
Theorem 1.8.
Let be a distribution over that is affinely equivalent to a distribution over , for which the following holds:

,

,
The sample complexity of ComparisonPassiveRPUlearning is:
and the query complexity of ComparisonPoolRPU learning is:
Note that the constants have logarithmic dependence on and .
We prove Theorem 1.8 through the theory of inference dimension from KLMZ, which implies the following result for the point location problem as well.
Theorem 1.9.
Let be a distribution satisfying the criterion of Theorem 1.8, , and . Then for large enough there exists an LDT using only label and comparison queries solving the point location problem with expected depth
For ease of viewing, we summarize our main results on expected sample/query complexity in Tables 1 and 2 for the special case of the uniform distribution over the unit ball. The only table entries not novel to this work are the LabelPassivePAC bounds Long, Long2, and the lower bound on ComparisonPool/MQSPAC learning Balcan, lowerbound. Note also that lower bounds for PAC learning carry over to for RPU learning.
PAC  Passive  Pool  MQS 

Label  Long, Long2  
Comparison  Balcan, lowerbound 
RPU  Passive  Pool  MQS 

Label  
Comparison 
1.4 Our Techniques
1.4.1 Sphere Packing and Random Polytopes
For the PAClearning model, our lower bounds rely on packing spherical caps, where a spherical cap is a portion of a ball cut off by some hyperplane . Our results rely on finding a large number of disjoint spherical caps of a large enough volume. In particular, our lower bound argument is as follows:
Spherical Cap Packing Lower Bound: Imagine we are able to pack disjoint spherical caps of volume onto the surface of the unit ball . These caps correspond to potential hyperplanes, over which our adversary may pick a uniform distribution. Now imagine the learner queries points by any method. This means that most caps will not contain points, and with only label queries, the corresponding hyperplanes to these caps are indistinguishable to any learner. Thus with constant probability the learner will err on some cap, giving error.
For the RPUlearning model, our lower bounds rely on the complexity of random polytopes. A random polytope of size over a distribution is the convex hull of a sample , and its complexity is given by the expected probability mass of its convex hull
Random Polytope Complexity Lower Bound: Imagine our adversary chooses a distribution such that with high probability, every point that our learner queries is of the same sign. Thus, the learner cannot infer any points outside the convex hull of the sample. Since we know the relation between this volume which cannot be inferred and the number of points drawn, setting the volume to be gives a lower bound on the query complexity.
These techniques are essentially generalizations of the algorithm specific lower bounds given by ElYaniv and Weiner ElYaniv, who also consider random polytope complexity.
1.4.2 Inference Dimension and Enriched Queries
Our novel RPUlearning upper bounds are based upon the inference dimension paradigm introduced by Kane, Lovett, Moran, and Zhang (KLMZ) in 2017 KLMZ. Using this new combinatorial framework for active learning, Kane et al. provide worstcase upper and lower bounds for an enriched query oracle, and in particular show how to actively PAClearn linear separators in . Kane et al. focus in particular on one type of enriched query, the comparison query, which allows the oracle to compare two points. Formally, a comparison query on points with underlying function asks:
The inference dimension framework allows for any kind of extended query, i.e. boolean functions on the underlying family of functions. Let be a hypothesis class, and a set of queries. We denote the answers to all queries on by . For a sample and , we adopt the notation of Kane et al. KLMZ and say that infers the point under , denoted
if answers to queries under determine the label of . As an example, consider to be linear separators in dimensions, to be label queries, and our sample to be positively labeled points under some classifier in general position. Due to linearity, any point inside the convex hull of is inferred by under .
Using this concept, Kane et al. define inference dimension, and show that the framework characterizes worstcase active learning.
Definition 1.10 (Inference Dimension Klmz).
The inference dimension of with query set is the smallest such that for any subset of size , , s.t. infers under .
Kane et al. show that finite inference dimension implies query complexity that is logarithmic in the sample complexity. Let be the number of oracle queries required to answer all queries on a sample of size in the worst case (e.g. for comparison queries via sorting).
Theorem 1.11 (Klmz).
Let denote the inference dimension of with query set . Then the expected query complexity of for is:
Further, infinite inference dimension provides a lower bound:
Theorem 1.12 (Klmz).
Assume that the inference dimension of with query set is . Then for , the sample complexity of QPoolPAC learning is:
As the name would suggest, the upper bound derived via inference dimension is based upon a reliable learner that infers a large number of points given a small sample. While not explicitly stated in KLMZ, it follows from the same argument that finite inference dimension gives an upper bound on RPUlearning:
Corollary 1.13.
Let denote the inference dimension of with query set . Then the sample complexity to passively RPUlearn is:
Further, the expected query complexity to actively RPUlearn is:
2 PAC Learning with Comparison Queries
In this section we study PAC learning with comparison queries in both the passive and active cases.
2.1 Lower Bounds
To begin, we prove that over a uniform distribution on a unit ball, learning linear separators with only label queries is hard.
Proposition 2.1 (Restatement of Proposition 1.2).
For small enough , and , the query complexity of LabelMQSPAC learning is:
Proof.
This follows from a packing argument. The area of a cap of angle is
by Taylor expanding . For small enough , setting to then gives that the area of this cap is at least , and thus that its measure is at least . Since we can pack at least of such caps into the ball, then for small enough we have a packing of at least caps with measure greater than .
Consider an adversary which picks one of these caps to be negative. Say we query only points, then there is at best a probability that we uncover which cap is negative. In the case that we do not, we cannot do anything better than guess which remaining cap is negative. Since there are more than remaining caps for small enough our guess is correct no more than of the time, meaning our failure probability is
∎
To show that our exponential improvement comes from the use of comparisons in combination with active learning, we will prove that using comparisons coupled with passive learning provides no improvement.
Proposition 2.2 (Restatement of Proposition 1.3).
For small enough , and , any algorithm that passively learns with comparison queries must use at least
samples.
Proof.
Let be any hyperplane cutting off a size cap from , and be the parallel hyperplane tangent to the cap. We will consider the distribution of hyperplanes that is uniform over and . Given uniform samples from , the probability that at least one point lands inside the cap is . Let
then for small enough , this probability is . Say no sample lands in , then and are completely indistinguishable by label or comparison queries. Any hypothesis chosen by the learner must label at least half of positive or negative, and will thus have error with either or . Since the distribution over these hyperplanes is uniform, the learner fails with probability at least . Thus in total the probability that the learner fails is at least ∎
Together, these lower bounds show it is only the combination of active learning and comparison queries which provides an exponential improvement.
2.2 Upper Bounds
For completeness, we will begin by showing that Proposition 1.2 is tight for before moving to our main result for the section.
Proposition 2.3.
The query complexity of LabelMQSPAC learning is:
Proof.
To begin, we will show that selecting points along the boundary of in a regular fashion (such that their convex hull is the regular sided polygon) is enough if all such points have the same label. This follows from the fact that each cap created by the polygon has area and thus probability mass
Taylor approximating sine shows that picking gives Area(Cap) . If all k points are of the same sign (say 1), a hyperplane can only cut through one such cap, and thus labeling the entire disk 1.
Thus we have reduced to the case where there are one or more points of differing signs. In this scenario, there will be exactly two edges where connected vertices are of different signs, which denotes that the hyperplane passes through both edges. Next, on each of the two caps associated with these edges, we query points in order to find the crossing point of the hyperplane via binary search up to an accuracy of . This reduces the area of unknown labels to the strip connecting these two arcs, which has
probability mass. Picking any consistent hyperplane then finishes the proof.
∎
Now we will show that active learning with comparison queries in the PAClearning model exponentially improves over the passive and label regimes. Our work is closely related to the algorithm of Balcan and Long Balcan
, and relies on using comparison queries to reduce to a combination of their algorithm and thresholding. Our bounds will relate to a general set of distributions called isotropic (0centered, identity variance) logconcave distributions, distributions whose density function
may be written as for some concave function . logconcavity generalizes many natural distributions such as gaussians and convex sets. To begin, we will need a few statements regarding isotropic logconcave distributions proved initially by Lovasz and Vempala logconcave, and Klivans, Long, and Tang Klivans (here we include additional facts we require for RPUlearning later on).Fact 2.4 (logconcave, Klivans).
Let be an arbitrary logconcave distribution in
with probability density function
, andnormal vectors of homogeneous hyperplanes. The following statements hold where 3,4,5, and 6 assume
is isotropic:
, the difference of i.i.d pairs, is logconcave

may be affinely transformed to an isotropic distribution Iso

s.t. the angle between and , denoted , satisfies


All marginals of are isotropic logconcave

If
Using these facts, we will give an upper bound for the Poolbased model assuming a pool of Poly unlabeled samples. For a sketch of the algorithm, see Figure 1.
Theorem 2.5 (Restatement of Theorem 1.4).
Let be a logconcave distribution over . The query complexity of ComparisonPoolPAC learning is
Proof.
Recall that may be affinely transformed into an isotropic distribution Iso(). Further, we may simulate queries over Iso() by applying the same transformation to our samples, and after learning over Iso(), we may transform our learner back to . Thus learning Iso() is equivalent to learning and we will assume is isotropic without loss of generality. Our algorithm will first learn a “homogenized” version of the hidden separator via Balcan and Long’s algorithm, thereby reducing to thresholding. Note that comparison queries on the difference of points is equivalent to a label query on the point on the homogeneous hyperplane with normal vector :
We begin by drawing samples from the logconcave distribution and then apply Balcan and Long’s algorithm Balcan to learn the homogenized version of () up to error with probability using only
comparison queries. Further, since the constant given in item of Fact 2.4 is universal, this means any separator output by the algorithm has a normal vector with angle
Having learned an approximation to , we turn our attention to approximating . Consider the set of points on which and disagree, that is:
To find an approximation for , we need to show that there will be correctly labeled points close to the threshold. To this end, let and define such that:
We will show that drawing a sample of points, the following three statements hold with at least probability:
Since the measure of the regions defined in statements 1 and 2 is , the probability that does not have at least one point in both regions is with an appropriate constant.
To prove the third statement, assume for contradiction that there exists such that . Because and differ in sign, this implies that , where is the projection of onto the plane spanned by u and . We can bound the probability of this event occurring by the concentration of isotropic logconcave distributions:
(1) 
Because we have bounded the angle between and , with a large enough constant for we have:
Then with a large enough constant for , union bounding over gives that the third statement occurs with probability at most .
We have proved that with probability , statements 1,2, and 3 hold. Further, if these statements hold, any hyperplane we pick consistent with thresholding will disagree on at most probability mass from due to the anticoncentration of isotropic logconcave distributions and the definition of . Further, repeating this process times and taking the median shift value gives the same statement with probability at least by a Chernoff bound. Note that the number of queries made in this step is dominated by the number of queries to learn .
Finally, we need to analyze the error of our proposed hyperplane . We have already proved that the error between this and is with probability at least , so it is enough to show that . This follows similarly to statement 3 above. The portion of Dis satisfying has probability mass at most by anticoncentration. With a large enough constant for , the remainder of Dis has mass at most by (1). Then in total, with probability , has error at most .
∎
Balcan and Long Balcan provide a lower bound on query complexity for logconcave distributions and oracles for any binary query of , so this algorithm is tight up to logarithmic factors.
3 RPU Learning with Comparison Queries
Kivinen Kivinen2 showed that RPUlearning is intractable for nice concept classes even under simple distributions when restricted to label queries. We will confirm that RPUlearning linear separators with only label queries is intractable in high dimensions, but can be made efficient in both the passive and active regimes via comparison queries.
3.1 Lower bounds
In the passive, labelonly case, RPUlearning is lower bounded by the expected number of vertices on a random polytope drawn from our distribution
. For simple distributions such as uniform over the unit ball, this gives sample complexity which is exponential in dimension, making RPUlearning impractical for any sort of highdimensional data.
Definition 3.1.
Given a distribution and parameter , we denote by the minimum size of a sample drawn i.i.d from such that the expected measure of the convex hull of , which we denote for , is .
The quantity , which has been studied in computational geometry for decades ball, ballMQS, lower bounds LabelPassiveRPU Learning, and in some cases provides a matching upper bound up to log factors.
Proposition 3.2.
Let D be any distribution on . The expected sample complexity of LabelPassiveRPUlearning is:
Proof.
For any sample size , there exists a hyperplane with small enough negative measure such that the probability of drawing one or more negative points is . Further, given that a drawn sample is entirely positive, for each point outside the convex hull of there exists a hyperplane consistent with that labels the point positively, and one that labels the point negatively. Thus, as long as our sample is entirely positive, any algorithm which labels points outside of the convex hull will err on some consistent hyperplane.
Recall that is the minimum size of the sample which needs to be drawn such that is in expectation. Consider drawing a sample of size . The expected measure is then
This in turn implies a bound by the Markov inequality on the probability of the measure of the convex hull of a given sample, which we denote :
Now consider the following relation between samples of size and , which follows by viewing our size sample as distinct samples of size at least :
Combining these results and letting :
To force any learner to fail on a sample, we need two conditions: first that the measure of the convex hull is , and second that all points are of the same sign. For the latter, we argued we could pick any probability such that this occurs. Picking then gives the desired success bound:
∎
Further, for simple distributions such as uniform over a ball, this bound is tight up to a factor.
Proposition 3.3.
The sample complexity of LabelPassiveRPU learning is:
Proof.
We will begin by computing for a ball. The expected measure of a sample drawn randomly from is computed in Wie, and given by
where is a constant depending only on dimension. Setting then gives:
Given a sample of size , let denote the subset of positively labeled points, and negatively labeled. We can infer at least the points inside the convex hulls of and . Our goal is to show that, with high probability, the measure of is . To show this, we will employ the fact ball that the expected measure of the convex hull of a sample of size uniformly drawn from any convex body is lowerbounded by:
Given this, let of measure be the set of positive points, and the negative points with measure . Since we have drawn points, with probability we will have at least points from , and at least points from . Given this many points, the expected value of our inferred mass is:
This function is minimized at , and plugging in , gives .
However, since we have conditioned on enough points being drawn from P and N, we are not done. This occurs across at least a percent of our samples, meaning that if we assume the inferred mass is 0 on other samples, our expected error (for a large enough constant on our number of samples) will be at most:
Setting is enough to drop the error below , and gives the number of samples as
∎
In the active regime, this sort of bound is complicated by the fact that we are less interested in the number of points drawn than labeled. If we were restricted to only drawing points, we could repeat the same argument in combination with the expected number of vertices to get a bound. However, with a larger pool of allowed points, the pertinent question becomes the maximum rather than expected measure of the convex hull. In cases such as the unit ball, these actually give about the same result.
Proposition 3.4 (Restatement of Proposition 1.6).
For all , the query complexity of LabelMQSRPU learning is:
Proof.
The maximum volume of the convex hull of points in is ballMQS
Notice here the difference from the random case in the exponent, which comes from the fact that we are only counting the expected vertices on the boundary of the hull of the sample. Since in this scenario there exists a hyperplane with 0 negative probability mass, we can apply the same argument from Proposition 3.2, setting to get the desired bound. ∎
3.2 Upper bounds
Our positive results for comparison based RPUlearning rely on weakening the concept of inference dimension to be distribution dependent. With this in mind, we introduce average inference dimension:
Definition 3.5 (Average Inference Dimension).
We say has average inference dimension , if:
In other words, the probability that we cannot infer a point from a randomly drawn sample of size n is bounded by its average inference dimension . There is a simple averagecase to worstcase reduction for average inference dimension via a union bound:
Observation 3.6.
Let have average inference dimension , and . Then has inference dimension with probability:
Proof.
The probability that a fixed subset of size does not have a point s.t. is at most . Union bounding over all subsets gives the desired result. ∎
This reduction allows us to apply inference dimension in both the active and passive distributional cases. This is due in part to the fact that the boosting algorithm proposed by Kane et al. KLMZ is reliable even when given the wrong inference dimension as input–the algorithm simply loses its guarantee on query complexity. As a result, we may plug this reduction directly into their algorithm.
Corollary 3.7.
Given a query set , let be the number of queries required to answer all questions on a sample of size . Let have average inference dimension , then there exists an RPUlearner with coverage
after drawing points. Further, the expected query complexity of actively RPUlearning a finite sample is
Proof.
For the first fact, we will appeal to the symmetry argument of KLMZ. Consider a reliable learner which takes in a sample of size and infers all possible points in . To compute coverage, we want to know the probability a random point is inferred by . Since was randomly drawn from , this is the same as computing the probability that any point in can be inferred from . By Observation 3.6, the probability that has inference dimension is
Since could equally well have been any point in by symmetry, if has inference dimension the coverage will be at least KLMZ. Since this occurs with probability at least by Observation 3.6, the expected coverage of is at least
The second statement follows from a similar argument. If has inference dimension , then by Theorem 1.11 the expected query complexity is at most . For a given , the expected query complexity is then bounded by:
Plugging in Observation 3.6 and minimizing over then gives the desired result. ∎
In fact, this lemma shows that RPUlearning with inverse superexponential average inference dimension loses only log factors over passive or active PAClearning. Asking for such small average inference dimension may seem unreasonable, but something as simple as label queries on a uniform distributions over convex sets has average inference dimension with respect to linear separators labelaid.
Corollary 3.8.
Given a query set , let be the number of queries required to answer all questions on a sample of size . For any , let have average inference dimension . Then the expected sample complexity of QPoolRPU learning is:
Further, the expected query complexity of actively learning a finite sample is:
Proof.
We will show that by employing comparison queries we can improve the average inference dimension of linear separators from to , but first we will need to review a result on inference dimension from KLMZ.
Theorem 3.9 (Theorem 4.7 Klmz).
Given a set , we define the minimalratio of with respect to a hyperplane as:
In other words, the minimalratio is a normalized version of margin, a common tool in learning algorithms. Given , define to be the subset of hyperplanes with minimal ratio with respect to . The inference dimension of (X,H) is then:
Our strategy to prove the average inference dimension of comparison queries follows via a reduction to minimalratio. Informally, our strategy is very simple. We will argue that, with high probability, throwing out the closest and furthest points from any classifier leaves a set with large minimalratio. We will show this in three main steps.
Step 1: Assuming concentration of our distribution, a large number of points are contained inside a ball. We will use this to bound the maximum function value for a given hyperplane when its furthest points are removed.
Step 2: Assuming anticoncentration of our distribution, we will union bound over all hyperplanes to show that they have good margin. In order to do this, we will define the notion of a strip about a hyperplane h, which is simply h “fattened” by in both directions. If not too many points lie inside each hyperplane’s strip, then we can be assured when we remove the closest points the remaining set will have margin . Since we cannot union bound over the infinite set of strips, we will build a net of the objects and use this instead.
Step 3: Combining the above results carefully shows that for any hyperplane, removing the furthest and closest points leaves a subsample of good minimalratio. In particular, by making sure the number of remaining points matches the bound on inference dimension given in Theorem 3.9, we can be assured that one of these points may be inferred from the rest as long as our high probability conditions hold.
Theorem 3.10.
Let be a distribution over affinely equivalent to another with the following properties:

,

Comments
There are no comments yet.