The Power of Comparisons for Actively Learning Linear Classifiers

07/08/2019 ∙ by Max Hopkins, et al. ∙ University of California, San Diego 4

In the world of big data, large but costly to label datasets dominate many fields. Active learning, an unsupervised alternative to the standard PAC-learning model, was introduced to explore whether adaptive labeling could learn concepts with exponentially fewer labeled samples. While previous results show that active learning performs no better than its supervised alternative for important concept classes such as linear separators, we show that by adding weak distributional assumptions and allowing comparison queries, active learning requires exponentially fewer samples. Further, we show that these results hold as well for a stronger model of learning called Reliable and Probably Useful (RPU) learning. In this model, our learner is not allowed to make mistakes, but may instead answer "I don't know." While previous negative results showed this model to have intractably large sample complexity for label queries, we show that comparison queries make RPU-learning at worst logarithmically more expensive in the passive case, and quadratically more expensive in the active case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the availability of big data and the high cost of labeling has lead to a surge of interest in active learning

, an adaptive, semi-supervised learning paradigm. In traditional active learning, given an instance space

, a distribution on , and a class of concepts , the learner receives unlabeled samples x from with the ability to query an oracle for the labeling

. Classically our goal would be to minimize the number of samples the learner draws before approximately learning the concept class with high probability (PAC-learning). Instead, active learning assumes unlabeled samples are inexpensive, and rather aims to minimize expensive queries to the oracle. While active learning requires exponentially fewer labeled samples than PAC-learning for simple classes such as intervals and thresholds, it fails to provide asymptotic improvement for classes essential to machine learning such as linear separators

Sanjoy.

However, recent results point to the fact that with slight relaxations or additions to the paradigm, such concept classes can be learned with exponentially fewer queries. In 2013, Balcan and Long Balcan proved that this was the case for homogeneous (through the origin) linear separators, as long as the distribution over the instance space

was log-concave–a wide range of distributions generalizing common cases such as gaussians or uniform distributions over convex sets. Later, Balcan and Zhang

s-concave extended this to s-concave distributions, a diverse generalization of log-concavity including fat-tailed distributions. Similarly, El-Yaniv and Weiner El-Yaniv

proved that non-homogeneous linear separators can be learned with exponentially fewer queries with respect to error over gaussian distributions, but also show that their algorithm suffers from a lower bound exponential in the dimension of

.

Kane, Lovett, Moran, and Zhang KLMZ proved that the non-homogeneity barrier could be broken for general distributions in two dimensions by empowering the oracle to compare points rather than just label them. Queries of this type are called comparison queries, and are notable not only for their increase in computational power, but for their real world applications such as in recommender systems rec or for increasing accuracy xu2017noise. Our work adopts a mixture of the approaches of Balcan et al.  and Kane et al. We show that by leveraging comparison queries, non-homogeneous linear separators may be learned in exponentially fewer samples as long as the distribution satisfies weak concentration and anti-concentration bounds, conditions realized by, for instance, -concave distributions.

Further, by leveraging techniques based off of inference introduced in the same paper by Kane et al.  we can use comparison queries to provide a stronger guarantee than PAC learning with little cost to query complexity. In the late 80’s, Rivest and Sloan Rivest proposed a competing model to PAC-learning called Reliable and Probably Useful (RPU) learning. This model, which is a learning theoretic formalization of selective classification introduced by Chow Chow more than 2 decades before, does not allow the learner to make mistakes, but instead allows the answer “I don’t know,” written as “”. Here, error is measured not by the amount of misclassified examples, but by the measure of examples on which our learner returns . RPU-learning was for the most part abandoned by the early 90’s in favor of PAC-learning as Kivinen Kivinen, Kivinen2 proved the sample complexity of RPU-learning simple concept classes such as rectangles required an exponential number of samples even under the uniform distribution. However, the model was recently re-introduced by El-Yaniv and Weiner El-Yaniv, who termed it perfect selective classification. El-Yaniv and Weiner prove a connection between Active and RPU-learning similar to the strategy employed by Kane et al. KLMZ (who refer to RPU-learners as “confident” learners). We will extend the lower bound of El-Yaniv and Weiner to prove that actively RPU-learning linear separators with only labels is exponentially difficult in dimension even for nice distributions. On the other hand, we will further show that comparison queries allow RPU-learning with nearly matching sample and query complexity to PAC-learning.

1.1 Background and Related Work

1.1.1 PAC-learning

Probably Approximately Correct (PAC)-learning is a framework for learning classifiers over an instance space introduced by Valiant Valiant. Given an instance space , label space , and a concept class of concepts , PAC-learning proceeds as follows. First, an adversary chooses a hidden distribution over and a hidden classifier . The learner then draws labeled samples from , and outputs a concept which it thinks is close to with respect to . Formally, we define closeness of and as the error:

We say the pair is PAC-learnable if there exists a learner which, using only samples111Formally, must also be polynomial in a number of parameters of , for all picks a classifier that with probability has at most error from . Formally,

The goal of PAC-learning is to compute the sample complexity and thereby prove whether certain pairs are efficiently learnable. In this paper, we will be concerned with the case of binary classification, where . In addition, in the case that is linear separators we instead write our concept classes as the sign of a family of functions . Instead of , we write the hypothesis class , and each defines a concept . The sample complexity of PAC-learning is characterized by the VC dimension VC, Blumer of which we denote by , and is given by:

1.1.2 RPU-learning

Reliable and Probably Useful (RPU)-learning is a stronger variant of PAC-learning introduced by Rivest and Sloan Rivest, in which the learner is reliable: it is not allowed to make errors, but may instead say “I don’t know” (or for shorthand, “”). Since it easy to make a reliable learner by simply always outputting “”, our learner must be useful, and with high probability cannot output “” more than a small fraction of the time. Let A be a reliable learner, we define the error of A on a sample S with respect to to be

We call the coverage of the learner . Finally, we say the pair is RPU-learnable if , there exists a reliable learner which in samples has error with probability :

RPU-learning is characterized by the VC dimension of certain intersections of concepts Kivinen. Unfortunately, many simple cases turn out to be not RPU-learnable (e.g. rectangles in ), with even relaxations having exponential sample complexity Kivinen2.

1.1.3 Passive vs Active Learning

PAC and RPU-learning traditionally refer to supervised learning, where the learning algorithm receives pre-labeled samples. We call this paradigm passive learning. In contrast, active learning refers to the case where the learner receives unlabeled samples and may adaptively query a labeling oracle. Similar to the passive case, for active learning we study the query complexity , the minimum number of queries to learn some pair in either the PAC or RPU learning models. The hope is that by adaptively choosing when to query the oracle, the learner may only need to query a number of samples logarithmic in the sample complexity.

We will discuss two paradigms of active learning: pool-based active learning, and membership query synthesis (MQS) pool, MQS. In the former, the learner has access to a pool of unlabeled data and may request that the oracle label any point. This model matches real-world scenarios where learners have access to large, unlabeled datasets, but labeling is too expensive to use passive learning (e.g. medical imagery). Membership query synthesis allows the learner to synthesize points in the instance space and query their labels. This model is the logical extreme of the pool-based model where our pool is the entire instance space. Because we will be considering learning with a fixed distribution, we will slightly modify MQS: the learner may only query points in the support of the distribution. This is the natural specification to distribution dependent learning, as it still models the case where our pool is as large as possible.

1.1.4 The Distribution Dependent Case

While PAC and RPU-learning were traditionally studied in the worst-case scenario over distributions, data in the real world is often drawn from distributions with nice properties such as concentration and anti-concentration bounds. As such, there has been a wealth of research into distribution-dependent PAC-learning, where the model has been relaxed only in that some distributional conditions are known. Distribution dependent learning has been studied in both the passive and the active case Balcan, Long, Long2, Hanneke. Most closely related to our work, Balcan and Long Balcan proved new upper bounds on active and passive learning of homogeneous (through the origin) linear separators in 0-centered log-concave distributions. Later, Balcan and Zhang s-concave extended this to -concave distributions. We directly extend the original algorithm of Balcan and Long to non-homogeneous linear separators via the inclusion of comparison queries, and leverage the concentration results of Balcan and Zhang to provide an inference based algorithm for learning under s-concave distributions.

1.1.5 The Point Location Problem

Our results on RPU-learning imply the existence of simple linear decision trees for an important problem in computer science and computational geometry known as the point location problem. Given a set of

hyperplanes in dimensions, called a hyperplane arrangement of size and denoted by , it is a classic result that partitions into cells. The point location problem is as follows:

Definition 1.1 (Point Location Problem).

Given a hyperplane arrangement and a point , both in , determine in which cell of lies.

Instances of this problem show up throughout computer science, such as in -sum, subset-sum, knapsack, or any variety of other problems K-sum. The best known depth for a linear decision tree solving the point location problem is from a recent work of Ezra and Sharir Ezra, who proved the existence of an depth LDT for arbitrary and . The caveat of this work is that the LDT may use arbitrary linear queries, which may be too powerful of a model in practice. Kane, Lovett, and Moran KLM offer an depth LDT restricting the model to generalized comparison queries, queries of the form for a point and hyperplanes . These queries are nice as they preserve structural properties of the input such as sparsity, but they still suffer from over-complication–any still allows an infinite set of queries.

Kane, Lovett, Moran, and Zhang’s KLMZ original work on inference dimension showed that in the worst case, the depth of a comparison LDT for point location is . However, by restricting to have good margin or bounded bit complexity, they build a comparison LDT of depth , which comes with the advantage of drawing from a finite set of queries for a given problem instance. Our work provides another result of this flavor: we will prove that if is drawn from a distribution with weak restrictions, for large enough there exists a comparison LDT with expected depth .

1.2 Our Results

1.2.1 Notation

We begin by introducing notation for our learning models. For a distribution , an instance , and a hypothesis class , we write the triple to denote the problem of learning over with a hypothesis . When is the uniform distribution over , we will write for convenience. We will further denote by the unit ball in dimensions, and by hyperplanes in dimensions. Given and a point , a label query determines ; given , a comparison query determines .

In addition, we will separate our models of learnability into combinations of three classes Q,R, and S, where Q , R , and S . Informally, we say an element defines our query type, an element in our learning regime, and an element in our learning model. Learnability of a triple is then defined by the combination of any choice of query, regime, and model, which we term as the -- learnability of . Note that in Comparison-learning we have both a labeling and comparison oracle.

Finally, we will discuss a number of different measures of complexity for -- learning triples. For passive learning, we will focus on the sample complexity . For active learning, we will focus on the query complexity . In both cases, we will often drop and instead give bounds on the expected sample/query complexity for error denoted (or respectively). A bound for probability then follow with repetitions by Chernoff. In the case of a finite instance space of size , we denote the expected query complexity of perfectly learning as .

As a final note, we will at times use a subscript in our asymptotic notation to suppress factors only dependent on dimension.

1.2.2 PAC-Learning

To show the power of active learning with comparison queries in the PAC-learning model, we will begin by proving lower bounds. In particular, we show that neither active learning nor comparison queries alone provide a significant speed-up over passive learning. In order to do this, we will assume the stronger MQS model, as lower bounds here transfer over to the pool-based regime.

Proposition 1.2.

For small enough , and , the query complexity of Label-MQS-PAC learning is:

Thus without enriched queries, active learning fails to significantly improve over passive learning even over a nice, low-dimensional distribution. Likewise, adding comparison queries alone also provides little improvement.

Proposition 1.3.

For small enough , and , the sample complexity of Comparison-Passive-PAC learning is:

Now we can compare the query complexity of active learning with comparisons to the above. For our upper bound, we will assume the pool-based model with a Poly pool size, as upper bounds here transfer to the MQS model. Our algorithm for Comparison-Pool-PAC learning combines a modification of Balcan and Long’s Balcan learning algorithm with noisy thresholding to provide an exponential speed-up for non-homogeneous linear separators.

Theorem 1.4.

Let be a log-concave distribution over . Then the query complexity of Comparison-Pool-PAC learning is

Balcan and Long also give a lower bound of for log-concave distributions which carries over to our setting, so this bound is near tight in dimension and error.

1.3 RPU-Learning

In the RPU-learning model, we will first confirm that passive learning with label queries is intractable information theoretically, and continue to show that active learning alone provides little improvement. Unlike in PAC-learning however, we will show that comparisons in this regime provide a significant improvement in not only active, but also passive learning.

Proposition 1.5.

The expected sample complexity of Label-Passive-RPU learning is:

Thus we see that RPU-learning linear separators is intractable for large dimension. Further, active learning with label queries is of the same order of magnitude.

Proposition 1.6.

For all , the query complexity of Label-MQS-RPU learning is:

These two bounds are a generalization of the technique employed by El-Yaniv and Weiner El-Yaniv to prove lower bounds for a specific algorithm, and apply to any learner. We further show that this bound is tight up to a logarithmic factor. For passive RPU-learning with comparison queries, we will simply inherit the lower bound from the PAC model (Proposition 1.3).

Corollary 1.7.

For small enough , and , any algorithm that Comparison-Passive-RPU learns must use at least

samples.

Note that unlike for label queries, this lower bound is not exponential in dimension. In fact, we will show that this bound is tight up to a linear factor in dimension, and further that employing comparison queries in general shifts the RPU model from being intractable to losing only a logarithmic factor over PAC-learning in both the passive and active regimes. We need one definition: two distributions over are affinely equivalent if there is an invertible affine map such that .

Theorem 1.8.

Let be a distribution over that is affinely equivalent to a distribution over , for which the following holds:

  1. ,

  2. ,

The sample complexity of Comparison-Passive-RPU-learning is:

and the query complexity of Comparison-Pool-RPU learning is:

Note that the constants have logarithmic dependence on and .

We prove Theorem 1.8 through the theory of inference dimension from KLMZ, which implies the following result for the point location problem as well.

Theorem 1.9.

Let be a distribution satisfying the criterion of Theorem 1.8, , and . Then for large enough there exists an LDT using only label and comparison queries solving the point location problem with expected depth

For ease of viewing, we summarize our main results on expected sample/query complexity in Tables 1 and 2 for the special case of the uniform distribution over the unit ball. The only table entries not novel to this work are the Label-Passive-PAC bounds Long, Long2, and the lower bound on Comparison-Pool/MQS-PAC learning Balcan, lowerbound. Note also that lower bounds for PAC learning carry over to for RPU learning.

PAC Passive Pool MQS
Label Long, Long2
Comparison Balcan, lowerbound
Table 1: Expected sample and query complexity for PAC learning .
RPU Passive Pool MQS
Label
Comparison
Table 2: Expected sample and query complexity for RPU learning .

1.4 Our Techniques

1.4.1 Sphere Packing and Random Polytopes

For the PAC-learning model, our lower bounds rely on packing spherical caps, where a spherical cap is a portion of a ball cut off by some hyperplane . Our results rely on finding a large number of disjoint spherical caps of a large enough volume. In particular, our lower bound argument is as follows:

Spherical Cap Packing Lower Bound: Imagine we are able to pack disjoint spherical caps of volume onto the surface of the unit ball . These caps correspond to potential hyperplanes, over which our adversary may pick a uniform distribution. Now imagine the learner queries points by any method. This means that most caps will not contain points, and with only label queries, the corresponding hyperplanes to these caps are indistinguishable to any learner. Thus with constant probability the learner will err on some cap, giving error.

For the RPU-learning model, our lower bounds rely on the complexity of random polytopes. A random polytope of size over a distribution is the convex hull of a sample , and its complexity is given by the expected probability mass of its convex hull

Random Polytope Complexity Lower Bound: Imagine our adversary chooses a distribution such that with high probability, every point that our learner queries is of the same sign. Thus, the learner cannot infer any points outside the convex hull of the sample. Since we know the relation between this volume which cannot be inferred and the number of points drawn, setting the volume to be gives a lower bound on the query complexity.

These techniques are essentially generalizations of the algorithm specific lower bounds given by El-Yaniv and Weiner El-Yaniv, who also consider random polytope complexity.

1.4.2 Inference Dimension and Enriched Queries

Our novel RPU-learning upper bounds are based upon the inference dimension paradigm introduced by Kane, Lovett, Moran, and Zhang (KLMZ) in 2017 KLMZ. Using this new combinatorial framework for active learning, Kane et al. provide worst-case upper and lower bounds for an enriched query oracle, and in particular show how to actively PAC-learn linear separators in . Kane et al. focus in particular on one type of enriched query, the comparison query, which allows the oracle to compare two points. Formally, a comparison query on points with underlying function asks:

The inference dimension framework allows for any kind of extended query, i.e. boolean functions on the underlying family of functions. Let be a hypothesis class, and a set of queries. We denote the answers to all queries on by . For a sample and , we adopt the notation of Kane et al. KLMZ and say that infers the point under , denoted

if answers to queries under determine the label of . As an example, consider to be linear separators in dimensions, to be label queries, and our sample to be positively labeled points under some classifier in general position. Due to linearity, any point inside the convex hull of is inferred by under .

Using this concept, Kane et al. define inference dimension, and show that the framework characterizes worst-case active learning.

Definition 1.10 (Inference Dimension Klmz).

The inference dimension of with query set is the smallest such that for any subset of size , , s.t. infers under .

Kane et al. show that finite inference dimension implies query complexity that is logarithmic in the sample complexity. Let be the number of oracle queries required to answer all queries on a sample of size in the worst case (e.g. for comparison queries via sorting).

Theorem 1.11 (Klmz).

Let denote the inference dimension of with query set . Then the expected query complexity of for is:

Further, infinite inference dimension provides a lower bound:

Theorem 1.12 (Klmz).

Assume that the inference dimension of with query set is . Then for , the sample complexity of Q-Pool-PAC learning is:

As the name would suggest, the upper bound derived via inference dimension is based upon a reliable learner that infers a large number of points given a small sample. While not explicitly stated in KLMZ, it follows from the same argument that finite inference dimension gives an upper bound on RPU-learning:

Corollary 1.13.

Let denote the inference dimension of with query set . Then the sample complexity to passively RPU-learn is:

Further, the expected query complexity to actively RPU-learn is:

2 PAC Learning with Comparison Queries

In this section we study PAC learning with comparison queries in both the passive and active cases.

2.1 Lower Bounds

To begin, we prove that over a uniform distribution on a unit ball, learning linear separators with only label queries is hard.

Proposition 2.1 (Restatement of Proposition 1.2).

For small enough , and , the query complexity of Label-MQS-PAC learning is:

Proof.

This follows from a packing argument. The area of a cap of angle is

by Taylor expanding . For small enough , setting to then gives that the area of this cap is at least , and thus that its measure is at least . Since we can pack at least of such caps into the ball, then for small enough we have a packing of at least caps with measure greater than .

Consider an adversary which picks one of these caps to be negative. Say we query only points, then there is at best a probability that we uncover which cap is negative. In the case that we do not, we cannot do anything better than guess which remaining cap is negative. Since there are more than remaining caps for small enough our guess is correct no more than of the time, meaning our failure probability is

To show that our exponential improvement comes from the use of comparisons in combination with active learning, we will prove that using comparisons coupled with passive learning provides no improvement.

Proposition 2.2 (Restatement of Proposition 1.3).

For small enough , and , any algorithm that passively learns with comparison queries must use at least

samples.

Proof.

Let be any hyperplane cutting off a size cap from , and be the parallel hyperplane tangent to the cap. We will consider the distribution of hyperplanes that is uniform over and . Given uniform samples from , the probability that at least one point lands inside the cap is . Let

then for small enough , this probability is . Say no sample lands in , then and are completely indistinguishable by label or comparison queries. Any hypothesis chosen by the learner must label at least half of positive or negative, and will thus have error with either or . Since the distribution over these hyperplanes is uniform, the learner fails with probability at least . Thus in total the probability that the learner fails is at least

Together, these lower bounds show it is only the combination of active learning and comparison queries which provides an exponential improvement.

2.2 Upper Bounds

For completeness, we will begin by showing that Proposition 1.2 is tight for before moving to our main result for the section.

Proposition 2.3.

The query complexity of Label-MQS-PAC learning is:

Proof.

To begin, we will show that selecting points along the boundary of in a regular fashion (such that their convex hull is the regular sided polygon) is enough if all such points have the same label. This follows from the fact that each cap created by the polygon has area and thus probability mass

Taylor approximating sine shows that picking gives Area(Cap) . If all k points are of the same sign (say 1), a hyperplane can only cut through one such cap, and thus labeling the entire disk 1.
Thus we have reduced to the case where there are one or more points of differing signs. In this scenario, there will be exactly two edges where connected vertices are of different signs, which denotes that the hyperplane passes through both edges. Next, on each of the two caps associated with these edges, we query points in order to find the crossing point of the hyperplane via binary search up to an accuracy of . This reduces the area of unknown labels to the strip connecting these two arcs, which has probability mass. Picking any consistent hyperplane then finishes the proof. ∎

Now we will show that active learning with comparison queries in the PAC-learning model exponentially improves over the passive and label regimes. Our work is closely related to the algorithm of Balcan and Long Balcan

, and relies on using comparison queries to reduce to a combination of their algorithm and thresholding. Our bounds will relate to a general set of distributions called isotropic (0-centered, identity variance) log-concave distributions, distributions whose density function

may be written as for some concave function . log-concavity generalizes many natural distributions such as gaussians and convex sets. To begin, we will need a few statements regarding isotropic log-concave distributions proved initially by Lovasz and Vempala log-concave, and Klivans, Long, and Tang Klivans (here we include additional facts we require for RPU-learning later on).

Fact 2.4 (log-concave, Klivans).

Let be an arbitrary log-concave distribution in

with probability density function

, and

normal vectors of homogeneous hyperplanes. The following statements hold where 3,4,5, and 6 assume

is isotropic:

  1. , the difference of i.i.d pairs, is log-concave

  2. may be affinely transformed to an isotropic distribution Iso

  3. s.t. the angle between and , denoted , satisfies

  4. All marginals of are isotropic log-concave

  5. If

Using these facts, we will give an upper bound for the Pool-based model assuming a pool of Poly unlabeled samples. For a sketch of the algorithm, see Figure 1.

1 ; shift_list = [];
2 normal_vector = B-L;
3 for  in range  do
4       ;
5       Project( normal_vector);
6       shift_list.add(Threshold(s));
7 end for
Return
Algorithm 1 Comparison-Pool-PAC learn
Figure 1: Algorithm for Comparison-Pool-PAC learning an isotropic log-concave distribution . Our algorithm references two sub-routines. The first is the Label-Pool-PAC learner B-L presented in Balcan. The second is a thresholding procedure Threshold(), which labels the one-dimensional array by binary search and outputs a consistent threshold value.
Theorem 2.5 (Restatement of Theorem 1.4).

Let be a log-concave distribution over . The query complexity of Comparison-Pool-PAC learning is

Proof.

Recall that may be affinely transformed into an isotropic distribution Iso(). Further, we may simulate queries over Iso() by applying the same transformation to our samples, and after learning over Iso(), we may transform our learner back to . Thus learning Iso() is equivalent to learning and we will assume is isotropic without loss of generality. Our algorithm will first learn a “homogenized” version of the hidden separator via Balcan and Long’s algorithm, thereby reducing to thresholding. Note that comparison queries on the difference of points is equivalent to a label query on the point on the homogeneous hyperplane with normal vector :

We begin by drawing samples from the log-concave distribution and then apply Balcan and Long’s algorithm Balcan to learn the homogenized version of () up to error with probability using only

comparison queries. Further, since the constant given in item of Fact 2.4 is universal, this means any separator output by the algorithm has a normal vector with angle

Having learned an approximation to , we turn our attention to approximating . Consider the set of points on which and disagree, that is:

To find an approximation for , we need to show that there will be correctly labeled points close to the threshold. To this end, let and define such that:

We will show that drawing a sample of points, the following three statements hold with at least probability:

Since the measure of the regions defined in statements 1 and 2 is , the probability that does not have at least one point in both regions is with an appropriate constant.

To prove the third statement, assume for contradiction that there exists such that . Because and differ in sign, this implies that , where is the projection of onto the plane spanned by u and . We can bound the probability of this event occurring by the concentration of isotropic log-concave distributions:

(1)

Because we have bounded the angle between and , with a large enough constant for we have:

Then with a large enough constant for , union bounding over gives that the third statement occurs with probability at most .

We have proved that with probability , statements 1,2, and 3 hold. Further, if these statements hold, any hyperplane we pick consistent with thresholding will disagree on at most probability mass from due to the anti-concentration of isotropic log-concave distributions and the definition of . Further, repeating this process times and taking the median shift value gives the same statement with probability at least by a Chernoff bound. Note that the number of queries made in this step is dominated by the number of queries to learn .

Finally, we need to analyze the error of our proposed hyperplane . We have already proved that the error between this and is with probability at least , so it is enough to show that . This follows similarly to statement 3 above. The portion of Dis satisfying has probability mass at most by anti-concentration. With a large enough constant for , the remainder of Dis has mass at most by (1). Then in total, with probability , has error at most .

Balcan and Long Balcan provide a lower bound on query complexity for log-concave distributions and oracles for any binary query of , so this algorithm is tight up to logarithmic factors.

3 RPU Learning with Comparison Queries

Kivinen Kivinen2 showed that RPU-learning is intractable for nice concept classes even under simple distributions when restricted to label queries. We will confirm that RPU-learning linear separators with only label queries is intractable in high dimensions, but can be made efficient in both the passive and active regimes via comparison queries.

3.1 Lower bounds

In the passive, label-only case, RPU-learning is lower bounded by the expected number of vertices on a random polytope drawn from our distribution

. For simple distributions such as uniform over the unit ball, this gives sample complexity which is exponential in dimension, making RPU-learning impractical for any sort of high-dimensional data.

Definition 3.1.

Given a distribution and parameter , we denote by the minimum size of a sample drawn i.i.d from such that the expected measure of the convex hull of , which we denote for , is .

The quantity , which has been studied in computational geometry for decades ball, ball-MQS, lower bounds Label-Passive-RPU Learning, and in some cases provides a matching upper bound up to log factors.

Proposition 3.2.

Let D be any distribution on . The expected sample complexity of Label-Passive-RPU-learning is:

Proof.

For any sample size , there exists a hyperplane with small enough negative measure such that the probability of drawing one or more negative points is . Further, given that a drawn sample is entirely positive, for each point outside the convex hull of there exists a hyperplane consistent with that labels the point positively, and one that labels the point negatively. Thus, as long as our sample is entirely positive, any algorithm which labels points outside of the convex hull will err on some consistent hyperplane.

Recall that is the minimum size of the sample which needs to be drawn such that is in expectation. Consider drawing a sample of size . The expected measure is then

This in turn implies a bound by the Markov inequality on the probability of the measure of the convex hull of a given sample, which we denote :

Now consider the following relation between samples of size and , which follows by viewing our size sample as distinct samples of size at least :

Combining these results and letting :

To force any learner to fail on a sample, we need two conditions: first that the measure of the convex hull is , and second that all points are of the same sign. For the latter, we argued we could pick any probability such that this occurs. Picking then gives the desired success bound:

Further, for simple distributions such as uniform over a ball, this bound is tight up to a factor.

Proposition 3.3.

The sample complexity of Label-Passive-RPU learning is:

Proof.

We will begin by computing for a ball. The expected measure of a sample drawn randomly from is computed in Wie, and given by

where is a constant depending only on dimension. Setting then gives:

Given a sample of size , let denote the subset of positively labeled points, and negatively labeled. We can infer at least the points inside the convex hulls of and . Our goal is to show that, with high probability, the measure of is . To show this, we will employ the fact ball that the expected measure of the convex hull of a sample of size uniformly drawn from any convex body is lower-bounded by:

Given this, let of measure be the set of positive points, and the negative points with measure . Since we have drawn points, with probability we will have at least points from , and at least points from . Given this many points, the expected value of our inferred mass is:

This function is minimized at , and plugging in , gives .

However, since we have conditioned on enough points being drawn from P and N, we are not done. This occurs across at least a percent of our samples, meaning that if we assume the inferred mass is 0 on other samples, our expected error (for a large enough constant on our number of samples) will be at most:

Setting is enough to drop the error below , and gives the number of samples as

In the active regime, this sort of bound is complicated by the fact that we are less interested in the number of points drawn than labeled. If we were restricted to only drawing points, we could repeat the same argument in combination with the expected number of vertices to get a bound. However, with a larger pool of allowed points, the pertinent question becomes the maximum rather than expected measure of the convex hull. In cases such as the unit ball, these actually give about the same result.

Proposition 3.4 (Restatement of Proposition 1.6).

For all , the query complexity of Label-MQS-RPU learning is:

Proof.

The maximum volume of the convex hull of points in is ball-MQS

Notice here the difference from the random case in the exponent, which comes from the fact that we are only counting the expected vertices on the boundary of the hull of the sample. Since in this scenario there exists a hyperplane with 0 negative probability mass, we can apply the same argument from Proposition 3.2, setting to get the desired bound. ∎

3.2 Upper bounds

Our positive results for comparison based RPU-learning rely on weakening the concept of inference dimension to be distribution dependent. With this in mind, we introduce average inference dimension:

Definition 3.5 (Average Inference Dimension).

We say has average inference dimension , if:

In other words, the probability that we cannot infer a point from a randomly drawn sample of size n is bounded by its average inference dimension . There is a simple average-case to worst-case reduction for average inference dimension via a union bound:

Observation 3.6.

Let have average inference dimension , and . Then has inference dimension with probability:

Proof.

The probability that a fixed subset of size does not have a point s.t. is at most . Union bounding over all subsets gives the desired result. ∎

This reduction allows us to apply inference dimension in both the active and passive distributional cases. This is due in part to the fact that the boosting algorithm proposed by Kane et al. KLMZ is reliable even when given the wrong inference dimension as input–the algorithm simply loses its guarantee on query complexity. As a result, we may plug this reduction directly into their algorithm.

Corollary 3.7.

Given a query set , let be the number of queries required to answer all questions on a sample of size . Let have average inference dimension , then there exists an RPU-learner with coverage

after drawing points. Further, the expected query complexity of actively RPU-learning a finite sample is

Proof.

For the first fact, we will appeal to the symmetry argument of KLMZ. Consider a reliable learner which takes in a sample of size and infers all possible points in . To compute coverage, we want to know the probability a random point is inferred by . Since was randomly drawn from , this is the same as computing the probability that any point in can be inferred from . By Observation 3.6, the probability that has inference dimension is

Since could equally well have been any point in by symmetry, if has inference dimension the coverage will be at least KLMZ. Since this occurs with probability at least by Observation 3.6, the expected coverage of is at least

The second statement follows from a similar argument. If has inference dimension , then by Theorem 1.11 the expected query complexity is at most . For a given , the expected query complexity is then bounded by:

Plugging in Observation 3.6 and minimizing over then gives the desired result. ∎

In fact, this lemma shows that RPU-learning with inverse super-exponential average inference dimension loses only log factors over passive or active PAC-learning. Asking for such small average inference dimension may seem unreasonable, but something as simple as label queries on a uniform distributions over convex sets has average inference dimension with respect to linear separators label-aid.

Corollary 3.8.

Given a query set , let be the number of queries required to answer all questions on a sample of size . For any , let have average inference dimension . Then the expected sample complexity of Q-Pool-RPU learning is:

Further, the expected query complexity of actively learning a finite sample is:

Proof.

Both results follow from the fact that setting the average inference dimension to gives

Then for the sample complexity, it is enough to plug this into Corollary 3.7 and let be

Plugging this into the query complexity sets the latter term from Corollary 3.7 to 1, giving:

We will show that by employing comparison queries we can improve the average inference dimension of linear separators from to , but first we will need to review a result on inference dimension from KLMZ.

Theorem 3.9 (Theorem 4.7 Klmz).

Given a set , we define the minimal-ratio of with respect to a hyperplane as:

In other words, the minimal-ratio is a normalized version of margin, a common tool in learning algorithms. Given , define to be the subset of hyperplanes with minimal ratio with respect to . The inference dimension of (X,H) is then:

Our strategy to prove the average inference dimension of comparison queries follows via a reduction to minimal-ratio. Informally, our strategy is very simple. We will argue that, with high probability, throwing out the closest and furthest points from any classifier leaves a set with large minimal-ratio. We will show this in three main steps.

Step 1: Assuming concentration of our distribution, a large number of points are contained inside a ball. We will use this to bound the maximum function value for a given hyperplane when its furthest points are removed.

Step 2: Assuming anti-concentration of our distribution, we will union bound over all hyperplanes to show that they have good margin. In order to do this, we will define the notion of a -strip about a hyperplane h, which is simply h “fattened” by in both directions. If not too many points lie inside each hyperplane’s -strip, then we can be assured when we remove the closest points the remaining set will have margin . Since we cannot union bound over the infinite set of -strips, we will build a -net of the objects and use this instead.

Step 3: Combining the above results carefully shows that for any hyperplane, removing the furthest and closest points leaves a subsample of good minimal-ratio. In particular, by making sure the number of remaining points matches the bound on inference dimension given in Theorem 3.9, we can be assured that one of these points may be inferred from the rest as long as our high probability conditions hold.

Theorem 3.10.

Let be a distribution over affinely equivalent to another with the following properties:

  1. ,