Active and passive learning of linear separators under log-concave distributions

11/06/2012
by   Maria-Florina Balcan, et al.
Microsoft
0

We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators. We prove that active learning provides an exponential improvement over PAC (passive) learning of homogeneous linear separators under nearly log-concave distributions. Building on this, we provide a computationally efficient PAC algorithm with optimal (up to a constant factor) sample complexity for such problems. This resolves an open question concerning the sample complexity of efficient PAC algorithms under the uniform distribution in the unit ball. Moreover, it provides the first bound for a polynomial-time PAC algorithm that is tight for an interesting infinite class of hypothesis functions under a general and natural class of data-distributions, providing significant progress towards a longstanding open question. We also provide new bounds for active and passive learning in the case that the data might not be linearly separable, both in the agnostic case and and under the Tsybakov low-noise condition. To derive our results, we provide new structural results for (nearly) log-concave distributions, which might be of independent interest as well.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/22/2017

S-Concave Distributions: Towards Broader Distributions for Noise-Tolerant and Sample-Efficient Learning Algorithms

We provide new results concerning noise-tolerant and sample-efficient le...
07/31/2013

The Power of Localization for Efficiently Learning Linear Separators with Noise

We introduce a new approach for designing computationally efficient lear...
02/12/2020

Efficient active learning of sparse halfspaces with arbitrary bounded noise

In this work we study active learning of homogeneous s-sparse halfspaces...
02/11/2021

Sample-Optimal PAC Learning of Halfspaces with Malicious Noise

We study efficient PAC learning of homogeneous halfspaces in ℝ^d in the ...
06/17/2022

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

We study the fundamental problem of learning a single neuron, i.e., a fu...
06/20/2014

Noise-adaptive Margin-based Active Learning and Lower Bounds under Tsybakov Noise Condition

We present a simple noise-robust margin-based active learning algorithm ...
07/30/2020

The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise

We study the computational complexity of adversarially robust proper lea...

1 Introduction

Learning linear separators is one of the central challenges in machine learning. They are widely used and have been long studied both in the statistical and computational learning theory. A seminal result of BEHW89, using tools due to VC, showed that

-dimensional linear separators can be learned to accuracy

with probability

in the classic PAC model in polynomial time with examples. The best known lower bound for linear separators is

, and this holds even in the case in which the distribution is uniform Lon95. Whether the upper bound can be improved to match the lower bound via a polynomial-time algorithm is been long-standing open question, both for general distributions EHKV89,BEHW89 and for the case of the uniform distribution in the unit ball Lon95,Lon03,phil-doubling. In this work we resolve this question in the case where the underlying distribution belongs to the class of log-concave and nearly log-concave distributions, a wide class of distributions that includes the gaussian distribution and uniform distribution over any convex set, and which has played an important role in several areas including sampling, optimization, integration, and learning lv:2007.

We also consider active learning, a major area of research of modern machine learning, where the algorithm only receives the classifications of examples when it requests them sanjoy11-encyc. Our main result here is a polynomial-time active learning algorithm with label complexity that is exponentially better than the label complexity of any passive learning algorithm in these settings. This answers an open question in BBZ07 and it also significantly expands the set of cases for which we can show that active learning provides a clear exponential improvement in (without increasing the dependence on ) over passive learning. Remarkably, our analysis for passive learning is done via a connection to our analysis for active learning – to our knowledge, this is the first paper using this technique.

We also study active and passive learning in the case that the data might not be linearly separable. We specifically provide new improved bounds for the widely studied Tsybakov low-noise condition MT99,BBM05,MN06, as well as new bounds on the disagreement coefficient, with implications for the agnostic case (i.e., arbitrary forms of noise).

Passive Learning

  In the classic passive supervised machine learning setting, the learning algorithm is given a set of labeled examples drawn i.i.d. from some fixed but unknown distribution over the instance space and labeled according to some fixed but unknown target function, and the goal is to output a classifier that does well on new examples coming from the same distribution. This setting has been long studied in both computational learning theory (within the PAC model Valiant:acm84,KV:book94) and statistical learning theory Vap82,Vapnik:book98,bbl05, and has played a crucial role in the developments and successes of machine learning.

However, despite remarkable progress, the basic question of providing polynomial-time algorithms with tight bounds on the sample complexity has remained open. Several milestone results along these lines that are especially related to our work include the following. The analysis of BEHW89, proved using tools from VC, implies that linear separators can be learned in polynomial time with labeled examples. EHKV89 proved a bound that implies an lower bound for linear separators and explicitly posed the question of providing tight bounds for this class. HLW94 established an upper bound of , which can be achieved in polynomial-time for linear separators.

BEHW89 achieved polynomial-time learning by finding a consistent hypothesis (i.e., a hypothesis which correctly classifies all training examples); this is a special case of ERM Vap82. An intensive line of research in the empirical process and statistical learning theory literature has taken account of “local complexity” to prove stronger bounds for ERM  VW96,Van00,BBM05,Lon03,Men03,gine:06,Hanneke07,steve-surrogate. In the context of learning, local complexity takes account of the fact that really bad classifiers can be easily discarded, and the set of “local” classifiers that are harder to disqualify is sometimes not as rich. A recent landmark result of gine:06 (see also RR11,steve-surrogate) is the bound for consistent algorithms of

(1)

where is the Alexander capacity, which depends on the distribution alexander87 (see Section 8 and Appendix A for further discussion). However, this bound can be suboptimal for linear separators.

In particular, for linear separators in the case in which the underlying distribution is uniform in the unit ball, the sample complexity is known Lon95,Lon03 to be , when computational considerations are ignored. phil-doubling, using the doubling dimension Ass83, another measure of local complexity, proved a bound of

(2)

for a polynomial-time algorithm. As a lower bound of on for for the case of linear separators and the uniform distribution is implicit in Hanneke07, the bound of gine:06 given by (1) cannot yield a bound better than

(3)

in this case.

In this paper we provide a tight bound (up to constant factors) on the sample complexity of polynomial-time learning of linear separators with respect to log-concave distributions. Specifically, we prove an upper bound of using a polynomial-time algorithm that holds for any zero-mean log-concave distribution. We also prove an information theoretic lower bound that matches our (computationally efficient) upper bound for each log-concave distribution. This provides the first bound for a polynomial-time algorithm that is tight for an interesting non-finite class of hypothesis functions under a general class of data-distributions, and also characterizes (up to a constant factor) the distribution-specific sample complexity for each distribution in the class. In the special case of the uniform distribution, our upper bound closes the existing gap between the upper bounds (2) and (3) and the lower bound of Lon95.

Active Learning  We also study learning of linear separators in the active learning model; here the learning algorithm can access unlabeled (i.e., unclassified) examples and ask for labels of unlabeled examples of its own choice, and the hope is that a good classifier can be learned with significantly fewer labels by actively directing the queries to informative examples. This has been a major area of machine learning research in the past fifteen years mainly due the availability of large amounts of unannotated or raw data in many modern applications sanjoy11-encyc, with many exciting developments on understanding its underlying principles as well QBC,sanjoy-coarse,BBL,BBZ07,Hanneke07,dhsm,CN07,Nina08,Kol10,nips10. However, with a few exceptions BBZ07,CN07,dkm, most of the theoretical developments have focused on the so called disagreement-based active learning paradigm hanneke:11,Kol10; methods and analyses developed in this context are often suboptimal, as they take a conservative approach and consider strategies that query even points on which there is a small amount of uncertainty (or disagreement) among the classifiers still under consideration given the labels queried so far. The results derived in this manner often show an improvement in the factor in the label complexity of active versus passive learning; however, unfortunately, the dependence on the term typically gets worse.

By analyzing a more aggressive, margin-based active learning algorithm, we prove that we can efficiently (in polynomial time) learn homogeneous linear separators when the underlying distribution is log-concave by using only label requests, answering an open question in BBZ07. This represents an exponential improvement of active learning over passive learning and it significantly broadens the cases for which we can show that the dependence on in passive learning can be improved to only in active learning, but without increasing the dependence on the dimension . We note that an improvement of this type was known to be possible only for the case when the underlying distributions is (nearly) uniform in the unit ball BBZ07,dkm,QBC; even for this special case, our analysis improves by a multiplicative factor the results of BBZ07; it also provides better dependence on than any other previous analyses implementable in a computationally efficient manner (both disagreement-based hanneke:11,Hanneke07 and more aggressive ones dkm,QBC), and over the inefficient splitting index analysis of sanjoy-coarse.

Techniques

  At the core of our results is a novel characterization of the region of disagreement of two linear separators under a log-concave measure. We show that for any two linear separators specified by normal vectors

and , for any constant we can pick a margin as small as , where is the angle between and , and still ensure that the probability mass of the region of disagreement outside of band of margin of one of them is (Theorem 3). Using this fact, we then show how we can use a margin-based active learning technique, where in each round we only query points near the hypothesized decision boundary, to get an exponential improvement over passive learning.

We then show that any passive learning algorithm that outputs a hypothesis consistent with random examples will, with probability at least , output a hypothesis of error at most (Theorem 5). Interestingly, our analysis is quite dissimilar to the classic analyses of ERM. It proceeds by conceptually running the algorithm online on progressively larger chunks of examples, and using the intermediate hypotheses to track the progress of the algorithm. We show, using the same tools as in the active learning analysis, that it is always likely that the algorithm will receive informative examples. Our analysis shows that the algorithm would also achieve accuracy with high probability even if it periodically built preliminary hypotheses using some of the examples, and then only used borderline cases for those preliminary classifiers for further training.111Note that such examples would not be i.i.d from the underlying distribution! To achieve the optimal sample complexity, we have to carefully distribute the confidence parameter, by allowing higher probability of failure in the later stages, to compensate for the fact that, once the hypothesis is already pretty good, it takes longer to get examples that help to further improve it.

Non-separable case  We also study label-efficient learning in the presence of noise. We show how our results for the realizable case can be extended to handle (a variant of) the Tsybakov noise, which has received substantial attention in statistical learning theory, both for passive and active learning MT99,BBM05,MN06,gine:06,BBZ07,Kol10,hanneke:11; this includes the random classification noise commonly studied in computational learning theory KV:book94, and the more general bounded (or Massart) noise BBM05,MN06,gine:06,Kol10. Our analysis for Massart noise leads to optimal bounds (up to constant factors) for active and passive learning of linear separators when the marginal distribution on the feature vectors is log-concave, improving the dependence on over previous best known results. Our analysis for Tsybakov noise leads to bounds on active learning with improved dependence on over previous known results in this case as well.

We also provide a bound on the Alexander’s capacity alexander87,gine:06 and the closely related disagreement coefficient notion Hanneke07, which have been widely used to characterize the sample complexity of various (active and passive) algorithms Hanneke07,Kol10,gine:06,nips10. This immediately implies concrete bounds on the labeled data complexity of several algorithms in the literature, including active learning algorithms designed for the purely agnostic case (i.e., arbitrary forms of noise), e.g., the algorithm BBL and the DHM algorithm dhsm.

Nearly log-concave distributions  We also extend our results both for passive and active learning to deal with nearly log-concave distributions; this is a broader class of distributions introduced by kannan91, which contains mixtures of (not too separated) log-concave distributions. In deriving our results, we provide new tail bounds and structural results for these distributions, which might be of independent interest and utility, both in learning theory and in other areas including sampling and optimization.

We note that our bounds on the disagreement coefficient improve by a factor of over the bounds of fridemnan09 (matching what was known for the much less general case of nearly uniform distribution over the unit sphere); furthermore, they apply to the nearly log-concave case where we allow an arbitrary number of discontinuities, a case not captured by the fridemnan09 conditions at all. We discuss other related papers in Appendix A.

2 Preliminaries and Notation

We focus on binary classification problems; that is, we consider the problem of predicting a binary label based on its corresponding input vector . As in the standard machine learning formulation, we assume that the data points are drawn from an unknown underlying distribution over ; is called the instance space and is the label space. In this paper we assume that and ; we also denote the marginal distribution over by . Let be the class of linear separators through the origin, that is . To keep the notation simple, we sometimes refer to a weight vector and the linear classifier with that weight vector interchangeably. Our goal is to output a hypothesis function of small error, where

We consider two learning protocols: passive learning and active learning. In the passive learning setting, the learning algorithm is given a set of labeled examples drawn i.i.d. from and the goal is output a hypothesis of small error by using only a polynomial number of labeled examples. In the (pool-based) active learning setting, a set of labeled examples is also drawn i.i.d. from ; the learning algorithm is permitted direct access to the sequence of values (unlabeled data points), but has to make a label request to obtain the label of example . The hope is that in the active learning setting we can output a classifier of small error by using many fewer label requests than in the passive learning setting by actively directing the queries to informative examples (while keeping the number of unlabeled examples polynomial). For added generality, we also consider the selective sampling active learning model, where the algorithm visits the unlabeled data points in sequence, and, for each , makes a decision on whether or not to request the label based only on the previously-observed values () and corresponding requested labels, and never changes this decision once made. Both our upper and lower bounds will apply to both selective sampling and pool-based active learning.

In the “realizable case”, we assume that the labels are deterministic and generated by a target function that belongs to . In the non-realizable case (studied in Sections 8 and 9) we do not make this assumption and instead aim to compete with the best function in .

Given two vectors and and any distribution we denote by ; we also denote by the angle between the vectors and .

3 Log-Concave Densities

Throughout this paper we focus on the case where the underlying distribution is log-concave or nearly log-concave. Such distributions have played a key role in the past two decades in several areas including sampling, optimization, and integration algorithms lv:2007, and more recently for learning theory as well KKMS05,KLT09,Vem10. In this section we first summarize known results about such distributions that are useful for our analysis and then prove a novel structural statement that will be key to our analysis (Theorem 3). In Section 6 we describe extensions to nearly log-concave distributions as well.

A distribution over is log-concave if is concave, where is its associated density function. It is isotropic if its mean is the origin and its covariance matrix is the identity.

Log-concave distributions form a broad class of distributions: for example, the Gaussian, Logistic, and uniform distribution over any convex set are log-concave distributions. The following lemma summarizes known useful facts about isotropic log-concave distributions (most are from lv:2007; the upper bound on the density is from KLT09).

Assume that is log-concave in and let be its density function.

  1. If is isotropic then If then:

  2. If is isotropic, then whenever . Furthermore, and where is and is , for all of any norm.

  3. All marginals of are log-concave. If is isotropic, its marginals are isotropic as well.

  4. If , then

  5. If is isotropic and we have and for all .

Throughout our paper we will use the fact that there exists a universal constant such that the probability of disagreement of any two homogeneous linear separators is lower bounded by the c times the angle between their normal vectors. This follows by projecting the region of disagreement in the space given by the two normal vectors, and then using properties of log-concave distributions in 2-dimensions. The proof is implicit in earlier works (e.g., Vem10); for completeness, we include a proof in Appendix B.

Assume is an isotropic log-concave in . Then there exists such that for any two unit vectors and in we have

To analyze our active and passive learning algorithms we provide a novel characterization of the region of disagreement of two linear separators under a log-concave measure: For any , there is a such that the following holds. Let and be two unit vectors in , and assume that . If is isotropic log-concave in , then:

(4)

Choose . We will show that, if is large enough relative to , then (4) holds. Let . Let be the set whose probability we want to bound. Since the event under consideration only concerns the projection of onto the span of and , Lemma 3(c) implies we can assume without loss of generality that .

Next, we claim that each member of has . Assume without loss of generality that is positive. (The other case is symmetric.) Then , so the angle of with is obtuse, i.e. . Since , this implies that . But , and is unit length, so , which, since , implies This, since for all , in turn implies This implies that, if is a ball of radius in , that

(5)

To obtain the desired bound, we carefully bound each term in the RHS. Choose .

Let be the density of . We have

Applying the density upper bound from Lemma 3 with , there are constants and such that

If we include in the integral again, we get

Now, we exploit the fact that the integral above is a rescaling of a probability with respect to the uniform distribution. Let be the volume of the unit ball in . Then, we have

for . Returning to (5), we get

Since this completes the proof.

We note that a weaker result of this type was proven (via different techniques) for the uniform distribution in the unit ball in BBZ07. In addition to being more general, Theorem 3 is tighter and more refined even for this specific case – this improvement is essential for obtaining tight bounds for polynomial time algorithms for passive learning (Section 5) and better bounds for active learning as well.

4 Active Learning

In this section we analyze a margin-based algorithm for actively learning linear separators under log-concave distributions BBZ07 (Algorithm 1). Lower bounds proved in Section 7 show that this algorithm needs exponentially fewer labeled examples than any passive learning algorithm.

This algorithm has been previously proposed and analyzed in BBZ07 for the special case of the uniform distribution in the unit ball. In this paper we analyze it for the much more general class of log-concave distributions.

Input: a sampling oracle for , a labeling oracle, sequences , (sample sizes) and , (cut-off values).

Output: weight vector .

  • Draw examples from , label them and put them in .

  • iterate

    • find a hypothesis with consistent with all labeled examples in .

    • let

    • until additional data points are labeled, draw sample from

      • if , then reject ,

      • else, ask for label of , and put into .

Algorithm 1 Margin-based Active Learning

Assume is isotropic log-concave in . There exist constants s.t. for , and for any , , using Algorithm 1 with and , after iterations, we find a separator of error at most with probability . The total number of labeled examples needed is . Let be the constant from Lemma 3. We will show, using induction, that, for all , with probability at least any consistent with the data in the working set has , so that, in particular, .

The case where follows from the standard VC bounds (see e.g.,VC). Assume now the claim is true for (), and consider the th iteration. Let , and By the induction hypothesis, we know that, with probability at least all consistent with , including , have errors at most . Consider an arbitrary such . By Lemma 3 we have and , so . Applying Theorem 3, there is a choice of (the constant such that ) that satisfies and So

(6)

Now let us treat the case that . Since we are labeling data points in at iteration , classic Vapnik-Chervonenkis bounds VC imply that, if is a large enough absolute constant, then with probability , for all consistent with the data in ,

(7)

Finally, since consists of those points that, after projecting onto the direction , fall into an interval of length , Lemma 3 implies that Putting this together with (6) and (7), with probability , we have , completing the proof.

5 Passive Learning

In this section we show how an analysis that was inspired by active learning leads to optimal (up to constant factors) bounds for polynomial-time algorithms for passive learning.

Assume that is zero mean and log-concave in . There exists an absolute constant s.t. for , and for any , , any algorithm that outputs a hypothesis that correctly classifies examples finds a separator of error at most with probability .

Proof Sketch: We focus here on the case that is isotropic. We can treat the non-isotropic case by observing that the two cases are equivalent; one may pass between them by applying the whitening transform. (See Appendix C for details.)

While our analysis will ultimately provide a guarantee for any learning algorithm that always outputs a consistent hypothesis, we will use intermediate hypothesis of Algorithm 1 in the analysis.

Let be the constant from Lemma 3. While proving Theorem 4, we proved that, if Algorithm 1 is run with and , that for all , with probability any consistent with the data in has . Thus, after iterations, with probability at least , any linear classifier consistent with all the training data has error , since any such classifier is consistent with the examples in .

Now, let us analyze the number of examples used, including those examples whose labels were not requested by Algorithm 1. Lemma 3 implies that there is a positive constant such that : again, consists of those points that fall into an interval of length after projecting onto . The density is lower bounded by a constant when , and we can use the bound for when . The expected number of examples that we need before we find elements of is therefore at most . Using a Chernoff bound, if we draw examples, the probability that we fail to get members of is at most , which is at most if is large enough. So, the total number of examples needed, , is at most a constant factor more than

We can show , completing the proof.  

We conclude this section by pointing out several important facts and implications of Theorem 5 and its proof.

  1. The separator in Theorem 5 (and the one in Theorem 4 ) can be found in polynomial time

    , for example by using linear programming.

  2. The analysis of Theorem 5 also bounds the number of unlabeled examples needed by the active learning algorithm of Theorem 4. This shows that an algorithm can request a nearly optimally small number of labels without increasing the total number of examples required by more than a constant factor. Specifically, in round , we only need unlabeled examples (whp), where , so the total number of unlabeled examples needed over all rounds is .

6 More Distributions

In this section we consider learning with respect to a more general class of distributions. We start by providing a general set of conditions on a set of distributions that is sufficient for efficient passive and active learning w.r.t. distributions in . We now consider nearly log-concave distributions, an interesting, more general class containing log-concave distributions, considered previously in kannan91 and shie07. We then prove that isotropic nearly log-concave distributions satisfy our sufficient conditions; in Appendix D, we also show how to remove the assumption that the distribution is isotropic.

A set of distributions is admissible if it satisfies the following:

  • There exists such that for any and any two unit vectors and in we have

  • For any , there is a such that the following holds for all . Let and be two unit vectors in s.t. . Then

  • There are positive constants such that, for any , for any projection of onto a one-dimensional subspace, the density of satisfies for all and for all with .

The proofs of Theorem 4 and Theorem 5 can be used without modification to show: If is admissible, then arbitrary can be learned with respect to arbitrary distributions in in polynomial time in the active learning model from labeled examples, and in the passive learning model from examples.

6.1 The nearly log-concave case

A density function is log-concave if for any , , , we have .

Clearly, a density function is log-concave if it is -log-concave. An example of a -log-concave distribution is a mixture of two log-concave distributions whose covariance matrices are , and whose means and have .

In this section we prove that for any sufficiently small constant , the class of isotropic log-concave distribution in is admissible and has light tails (this second fact is useful for analyzing the disagreement coefficient in Sections 8). In doing so we provide several new properties for such distributions, which could be of independent interest. Detailed proofs of our claims appear in Appendix D.

We start by showing that for any isotropic log-concave density there exists a log-concave density whose center is within of ’s center and that satisfies , for as small as . The fact depends only exponentially in (as opposed to exponentially in ) is key for being able to argue that such distributions have light tails.

For any isotropic log-concave density function there exists a log-concave density function that satisfies and , for . Moreover, we have for every unit vector .

Proof Sketch: Note that if the density function is log-concave we have that satisfies that for any , , , we have . Let be the function whose subgraph is the convex hull of the subgraph of . By using Caratheodory’s theorem222Caratheodory’s theorem states that if a point of lies in the convex hull of a set , then there is a subset of consisting of or fewer points such that lies in the convex hull of . we can show that This implies and we can prove by induction on that If we further normalize to make it a density function, we obtain that is log-concave and satisfies where This implies that for any we have .

Using this fact and concentration properties of (in particular Lemma 3), we can show that the center of is close to the center of , as desired.  

Assume is a sufficiently small non-negative constant and let be the set of all isotropic log-concave distributions. (a) is admissible. (b) Any has light tails. That is: , for .

Proof Sketch: (a) Choose . As in Lemma 3, consider the plane determined by and and let denote the projection operator that given , orthogonally projects onto this plane. If then By using the Prekopa-Leindler inequality Gar02 one can show that is log-concave (see e.g., shie07). Moreover, if is isotropic, than is isotropic as well. By Lemma 6.1 we know that there exists a -isotropic log-concave distribution centered at , , satisfying and for every unit vector , for constants and . For sufficiently small we have . Using this, by applying the whitening transform (see Theorem D.1 in Appendix D), we can show , for , which implies , for . Using a reasoning as in Lemma 3 we get The generalization of Theorem 3 follows from a similar proof, except using Theorem D.1. The density bounds in the case also follow from Theorem D.1 as well.

(b) Since is isotropic, we have (where is its associated density). By Lemma 6.1, there exists a log-concave density such that , for . This implies . By Lemma 3 we get that that under , , so under we have .  

Using Theorem 6 and Theorem 6.1(a) we obtain: Let be a sufficiently small constant. Assume that is an isotropic log-concave distribution in . Then arbitrary can be learned with respect to in polynomial time in the active learning model from labeled examples, and in the passive learning model from examples.

7 Lower Bounds

In this section we give lower bounds on the label complexity of passive and active learning of homogeneous linear separators when the underlying distribution is log-concave, for a sufficiently small constant . These lower bounds are information theoretic, applying to any procedure, that might not be necessarily computationally efficient. The proof is in Appendix E.

For a small enough constant we have: (1) for any log-concave distribution whose covariance matrix has full rank, the sample complexity of learning origin-centered linear separators under in the passive learning model is (2) the sample complexity of active learning of linear separators under log-concave distributions is

Note that, if the covariance matrix of does not have full rank, the number of dimensions is effectively less than , so our lower bound essentially applies for all log-concave distributions.

8 The inseparable case: Disagreement-based active learning

We consider two closely related distribution dependent capacity notions: the Alexander capacity and the disagreement coefficient; they have been widely used for analyzing the label complexity of non-aggressive active learning algorithms Hanneke07,dhsm,Kol10,hanneke:11,nips10. We begin with the definitions. For , define . For any , define the region of disagreement as Define the Alexander capacity function for w.r.t. as: Define the disagreement coefficients for w.r.t. as:

The following is our bound in the disagreement coefficient. Its proof is in Appendix F.

Let be a sufficiently small constant. Assume that is an isotropic log-concave distribution in . For any , for any , is . Thus .

Theorem 8 immediately leads to concrete bounds on the label complexity of several algorithms in the literature Hanneke07,CAL,BBL,Kol10,dhsm. For example, by composing it with a result of dhsm, we obtain a bound of for agnostic active learning when is isotropic log-concave in ; that is we only need label requests to output a classifier of error at most , where .

9 The Tsybakov condition

In this section we consider a variant of the Tsybakov noise condition MT99. We assume that the classifier that minimizes is a linear classifier, and that, for the weight vector of the optimal classifier, there exist known parameters such that, for all , we have

By generalizing Theorem 3 so that it provides a stronger bound for larger margins, and combining the result with the other lemmas of this paper and techniques from BBZ07, we get the following.

Let . Assume that the distribution satisfies the Tsybakov noise condition for constants and , and that the marginal on is isotropic log-concave. (1) If , we can find a separator with excess error with probability using labeled examples in the active learning model, and labeled examples in the passive learning model. (2) If , we can find a separator with excess error with probability using labeled examples in the active learning model.

In the case (that is more general than the Massart noise condition) our analysis leads to optimal bounds for active and passive learning of linear separators under log-concave distributions, improving the dependence on over previous best known results steve-surrogate,gine:06. Our analysis for Tsybakov noise () leads to bounds on active learning with improved dependence on over previous known results steve-surrogate in this case as well. Proofs and further details appear in Appendix G.

10 Discussion and Open Questions

The label sample complexity of our active learning algorithm for learning homogeneous linear separators under isotropic logconcave distributions is , while our lower bound for this setting is Our upper bound is achieved by an algorithm that uses a polynomial number of unlabeled training examples, and polynomial time. If an unbounded amount of computation time and an unbounded number of unlabeled examples are available, it seems to be easy to learn to accuracy using label requests, no matter what the value of . (Roughly, the algorithm can construct an -cover to initialize a set of candidate hypotheses, then repeatedly wait for an unlabeled example that evenly splits the current list of candidates, and ask its label, eliminated roughly half of the candidates.) It would be interesting to know what is the best label complexity for a polynomial-time algorithm, or even an algorithm that is constrained to use a polynomial number of unlabeled examples.

Conceptually, our analysis of ERM for passive learning under (nearly) log-concave distributions is based on a more aggressive localization than those considered previously in the literature. It would be very interesting to extend this analysis as well as our analysis for active learning to arbitrary distributions and more general concept spaces.

Acknowledgements

We thank Steve Hanneke for a number of useful discussions.

This work was supported in part by NSF grant CCF-0953192, AFOSR grant FA9550-09-1-0538, and a Microsoft Research Faculty Fellowship.

References

  • [Alexander.(1987)] K.S. Alexander. Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probability Theory and Related Fields, 1987.
  • [Alon(2010)] N. Alon. A non-linear lower bound for planar epsilon-nets. FOCS, pages 341–346, 2010.
  • [Applegate and Kannan(1991)] D. Applegate and R. Kannan. Sampling and integration of near log-concave functions. In STOC, 1991.
  • [Assouad(1983)] P. Assouad. Plongements lipschitziens dans. R . Bull. Soc. Math. France, 111(4):429–448, 1983.
  • [Balcan et al.(2006)Balcan, Beygelzimer, and Langford] M. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.
  • [Balcan et al.(2007)Balcan, Broder, and Zhang] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.
  • [Balcan et al.(2008)Balcan, Hanneke, and Wortman] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In COLT, 2008.
  • [Bartlett et al.(2005)Bartlett, Bousquet, and Mendelson] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 2005.
  • [Beygelzimer et al.(2010)Beygelzimer, Hsu, Langford, and Zhang] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.
  • [Blumer et al.(1989)Blumer, Ehrenfeucht, Haussler, and Warmuth] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. JACM, 36(4):929–965, 1989.
  • [Boucheron et al.(2005)Boucheron, Bousquet, and Lugosi] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 2005.
  • [Bshouty et al.(2009)Bshouty, Li, and Long] N. H. Bshouty, Y. Li, and P. M. Long. Using the doubling dimension to analyze the generalization of learning algorithms. JCSS, 2009.
  • [Caramanis and Mannor(2007)] C. Caramanis and S. Mannor. An inequality for nearly log-concave distributions with applications to learning. IEEE Transactions on Information Theory, 2007.
  • [Castro and Nowak(2007)] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007.
  • [Cesa-Bianchi et al.(2010)Cesa-Bianchi, Gentile, and Zaniboni.] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 2010.
  • [Clarkson and Varadarajan(2007)] K. L. Clarkson and K. Varadarajan. Improved approximation algorithms for geometric set cover. Discrete Comput. Geom., 37(1):43–58, 2007.
  • [Cohn et al.(1994)Cohn, Atlas, and Ladner] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. In ICML, 1994.
  • [Dasgupta(2005)] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, volume 18, 2005.
  • [Dasgupta(2011)] S. Dasgupta. Active learning. Encyclopedia of Machine Learning, 2011.
  • [Dasgupta et al.(2005)Dasgupta, Kalai, and Monteleoni] S. Dasgupta, A. Kalai, and C. Monteleoni.

    Analysis of perceptron-based active learning.

    In COLT, 2005.
  • [Dasgupta et al.(2007)Dasgupta, Hsu, and Monteleoni] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. Advances in Neural Information Processing Systems, 20, 2007.
  • [Dekel et al.(2012)Dekel, Gentile, and Sridharan] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 2012.
  • [Ehrenfeucht et al.(1989)Ehrenfeucht, Haussler, Kearns, and Valiant] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. G. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 1989.
  • [Freund et al.(1997)Freund, Seung, Shamir, and Tishby.] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.
  • [Friedman(2009)] E. J. Friedman. Active learning for smooth problems. In COLT, 2009.
  • [Gardner(2002)] R. J. Gardner. The Brunn-Minkowski inequality. Bull. Amer. Math. Soc., 2002.
  • [Giné and Koltchinskii(2006)] E. Giné and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empirical processes. The Annals of Probability, 34(3):1143–1216, 2006.
  • [Gonen et al.(2013)Gonen, Sabato, and Shalev-Shwartz] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Efficient pool-based active learning of halfspaces. In ICML, 2013.
  • [Hanneke(2007)] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.
  • [Hanneke(2011)] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
  • [Hanneke and Yang(2012)] S. Hanneke and L. Yang. Surrogate losses in passive and active learning, 2012. http://arxiv.org/abs/1207.3772.
  • [Haussler et al.(1994)Haussler, Littlestone, and Warmuth] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting -functions on randomly drawn points. Information and Computation, 115(2):129–161, 1994.
  • [Haussler and Welzl(1987)] David Haussler and Emo Welzl. Epsilon nets and simplex range queries. Disc. Comp. Geometry, 2:127–151, 1987.
  • [Kalai et al.(2005)Kalai, Klivans, Mansour, and Servedio] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. In Proceedings of the 46th Annual Symposium on the Foundations of Computer Science (FOCS), 2005.
  • [Kearns and Vazirani(1994)] M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.
  • [Klivans et al.(2009a)Klivans, Long, and Servedio.] A. R. Klivans, P. M. Long, and R. A. Servedio. Learning halfspaces with malicious noise. JMLR, 2009a.
  • [Klivans et al.(2009b)Klivans, Long, and Tang] A. R. Klivans, P. M. Long, and A. Tang. Baum’s algorithm learns intersections of halfspaces with respect to log-concave distributions. In RANDOM, 2009b.
  • [Koltchinskii(2010)] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11:2457–2485, 2010.
  • [Komlós et al.(1992)Komlós, Pach, and Woeginger] J. Komlós, J. Pach, and G. Woeginger. Almost tight bounds on epsilon-nets. Discrete and Computational Geometry, 7:163–173, 1992.
  • [Kulkarni et al.(1993)Kulkarni, Mitter, and Tsitsiklis] S. R. Kulkarni, S. K. Mitter, and J. N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 1993.
  • [Long(1995)] P. M. Long. On the sample complexity of PAC learning halfspaces against the uniform distribution.

    IEEE Transactions on Neural Networks

    , 6(6):1556–1559, 1995.
  • [Long(2003)] P. M. Long. An upper bound on the sample complexity of PAC learning halfspaces with respect to the uniform distribution. Information Processing Letters, 2003.
  • [Lovasz and Vempala(2007)] L. Lovasz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 2007.
  • [Mammen and Tsybakov(1999)] E. Mammen and A.B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27:1808–1829, 1999.
  • [Massart and Nedelec(2006)] P. Massart and E. Nedelec. Risk bounds for statistical learning. The Annals of Statistics, 2006.
  • [Mendelson(2003)] S. Mendelson. Estimating the performance of kernel classes. Journal of Machine Learning Research, 4:759–771, 2003.
  • [Nowak(2011)] R. Nowak. The Geometry of Generalized Binary Search. IEEE Transactions on Information Theory, 2011.
  • [Pach and Agarwal(1995)] J. Pach and P.K. Agarwal. Combinatorial Geometry. John Wiley and Sons, 1995.
  • [Raginsky and Rakhlin(2011)] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In NIPS, 2011.
  • [Valiant(1984)] L.G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • [van de Geer(2000)] S. van de Geer. Empirical processes in M-estimation. Cambridge Series in Statistical and Probabilistic Methods, 2000.
  • [van der Vaart and Wellner(1996)] A. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes With Applications to Statistics. Springer, 1996.
  • [Vapnik and Chervonenkis(1971)] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
  • [Vapnik(1982)] V. N. Vapnik. Estimation of Dependencies based on Empirical Data. Springer Verlag, 1982.
  • [Vapnik(1998)] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
  • [Vempala(2010)] S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces. JACM, 57(6), 2010.

Appendix A Additional Related Work

Learning with noise. Alexander Capacity and the Disagreement Coefficient  Roughly speaking the Alexander capacity alexander87,gine:06 quantifies how fast the region of disagreement of the set of classifiers at distance of the optimal classifier collapses as a function ; 333The region of disagreement of a set of classifiers is the of set of instances s.t. for each there exist two classifiers that disagree about the label of . the disagreement coefficient Hanneke07 additionally involves the supremum of over a range of values. fridemnan09 provides guarantees on these quantities (for sufficiently small ) for general classes of functions in if the underlying data distribution is sufficiently smooth. Our analysis implies much tighter bounds for linear separators under log-concave distributions (matching what was known for the much less general case of nearly uniform distribution over the unit sphere); furthermore, we also analyze the nearly log-concave case where we allow an arbitrary number of discontinuities, a case not captured by the fridemnan09 conditions at all. This immediately implies concrete bounds on the labeled data complexity of several algorithms in the literature including the algorithm BBL and the DHM algorithm dhsm, with implications for the purely agnostic case (i.e., arbitrary forms of noise), as well as the Koltchinskii’s algorithm Kol10 and the CAL algorithm BBL,Hanneke07,hanneke:11. Furthermore, in the realizable case and under Tsybakov noise, we show even better bounds, by considering aggressive active learning algorithms.

Note that as opposed to the realizable case, all existing active learning algorithms analyzed under Massart and Tsybakov noise conditions using the learning model analyzed in this paper (including our algorithms in Theorem 9), as well as those for the agnostic setting, are not known to run in time . In fact, even ignoring the optimality of sample complexity, there are no known algorithms for passive learning that run in time for general values of , even for the Massart noise condition and under log-concave distributions. Existing works on agnostic passive learning under log-concave distributions either provide running times (e.g., the work of KKMS05) or can only achieve values of that are significantly larger than the noise rate kls09.

Other Work on Active Learning  Several papers CaCEGe10,dgs12 present efficient online learning algorithms in the selective sampling framework, where labels must be actively queried before they are revealed. Under the assumption that the label conditional distribution is linear function determined by a fixed target vector, they provide bounds on the regret of the algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. As pointed by  dgs12, these results can be converted to a statistical setting when the instances are drawn i.i.d and they further assume a margin condition. In this setting they obtain exponential improvement in label complexity over passive learning. While very interesting, these results are incomparable to ours; their techniques significantly exploit the linear noise condition to get these improvements – note that such an improvement would not be possible in the realizable case (as pointed for example in GSS12).

Now11 considers an interesting abstract “generalized binary search” problem with applications to active learning; while these results apply for more general concept spaces, it is not clear how to implement the resulting procedures in polynomial time and by using access to only a polynomial number of unlabeled samples from the underlying distribution (as required by the active learning model). Another interesting recent work is that of GSS12, which study active learning of linear separators via an aggressive algorithm using a margin condition, using a general approximation guarantee on the number of labels requested; note that while these results work for potentially more general distributions, as opposed to ours, they do not come with explicit (tight) bounds on the label complexity.

-nets, Learning, and Geometry  Small -nets are useful for many applications, especially in Computational Geometry (see PA95). The same fundamental techniques of VC,Vap82 have been applied to establish the existence of small -nets HW87 and to bound the sample complexity of learning Vap82,BEHW89, and a number of interesting upper and lower bounds on the smallest possible size of -nets have been obtained KPW92,CV07,Alo10.

Our analysis implies a upper bound on the size of an -net for a set of regions of disagreement between all possible linear classifiers and the target, when the distribution is zero-mean and log-concave. In particular, since in Theorem 5 we prove that any hypothesis consistent with the training data has error rate with probability , setting to a constant gives a proof of a bound on the size of an -net for the following set:

Appendix B Proof of Lemma 3

Lemma 3. Assume is an isotropic log-concave in . Then there exists such that for any two unit vectors and in we have

Consider two unit vectors and . Let denote the projection operator that, given , orthogonally projects onto the plane determined by and . That is, if we define an orthogonal coordinate system in which coordinates lie in this plane and coordinates are orthogonal to this plane, then