S-Concave Distributions: Towards Broader Distributions for Noise-Tolerant and Sample-Efficient Learning Algorithms

03/22/2017 ∙ by Maria-Florina Balcan, et al. ∙ Carnegie Mellon University 0

We provide new results concerning noise-tolerant and sample-efficient learning algorithms under s-concave distributions over R^n for -1/2n+3< s< 0. The new class of s-concave distributions is a broad and natural generalization of log-concavity, and includes many important additional distributions, e.g., the Pareto distribution and t-distribution. This class has been studied in the context of efficient sampling, integration, and optimization, but much remains unknown concerning the geometry of this class of distributions and their applications in the context of learning. The challenge is that unlike the commonly used distributions in learning (uniform or more generally log-concave distributions), this broader class is not closed under the marginalization operator and many such distributions are fat-tailed. In this work, we introduce new convex geometry tools to study the properties of s-concave distributions and use these properties to provide bounds on quantities of interest to learning including the probability of disagreement between two halfspaces, disagreement outside a band, and disagreement coefficient. We use these results to significantly generalize prior results for margin-based active learning, disagreement-based active learning, and passively learning of intersections of halfspaces. Our analysis of geometric properties of s-concave distributions might be of independent interest to optimization more broadly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developing provable learning algorithms is one of the central challenges in learning theory. The study of such algorithms has led to significant advances in both the theory and practice of passive and active learning. In the passive learning model, the learning algorithm has access to a set of labeled examples sampled i.i.d. from some unknown distribution over the instance space and labeled according to some underlying target function. In the active learning model, however, the algorithm can access unlabeled examples and request labels of its own choice, and the goal is to learn the target function with significantly fewer labels. In this work, we study both learning models in the case where the underlying distribution belongs to the class of -concave distributions.

Prior work on noise-tolerant and sample-efficient algorithms mostly relies on the assumption that the distribution over the instance space is log-concave [2, 22, 9, 57]. A distribution is log-concave

if the logarithm of its density is a concave function. The assumption of log-concavity has been made for a few purposes: for computational efficiency reasons and for sample efficiency reasons. For computational efficiency reasons, it was made to obtain a noise-tolerant algorithm even for seemingly simple decision surfaces like linear separators. These simple algorithms exist for noiseless scenarios, e.g., via linear programming 

[51], but they are notoriously hard once we have noise [25, 42, 32]; This is why progress on noise-tolerant algorithms has focused on uniform [36, 43] and log-concave distributions [6]. Other concept spaces, like intersections of halfspaces, even have no computationally efficient algorithm in the noise-free settings that works under general distributions, but there has been nice progress under uniform and log-concave distributions [44]. For sample efficiency reasons, in the context of active learning, we need distributional assumptions in order to obtain label complexity improvements [26]. The most concrete and general class for which prior work obtains such improvements is when the marginal distribution over instance space satisfies log-concavity [59, 9]. In this work, we provide a broad generalization of all above results, showing how they extend to -concave distributions . A distribution with density is -concave if is a concave function. We identify key properties of these distributions that allow us to simultaneously extend all above results.

How general and important is the class of s-concave distributions? The class of -concave distributions is very broad and contains many well-known (classes of) distributions as special cases. For example, when , -concave distributions reduce to log-concave distributions. Furthermore, the -concave class contains infinitely many fat-tailed distributions that do not belong to the class of log-concave distributions, e.g., Cauchy, Pareto, and distributions, which have been widely applied in the context of theoretical physics and economics, but much remains unknown about how the provable learning algorithms, such as active learning of halfspaces, perform under these realistic distributions. We also compare -concave distributions with nearly-log-concave distributions, a slightly broader class of distributions than log-concavity. A distribution with density is nearly-log-concave if for any , , we have  [9]. The class of -concave distributions includes many important extra distributions which do not belong to the nearly-log-concave distributions: a nearly-log-concave distribution must have sub-exponential tails (see Theorem 11, [9]), while the tail probability of an -concave distribution might decay much slower (see Theorem 1 (6)). We also note that efficient sampling, integration and optimization algorithms for -concave distributions have been well understood [23, 37]. Our analysis of -concave distributions bridges these algorithms to the strong guarantees of noise-tolerant and sample-efficient learning algorithms.

1.1 Our Contributions

Structural Results. We study various geometric properties of -concave distributions. These properties serve as the structural results for many provable learning algorithms, e.g., margin-based active learning [9], disagreement-based active learning [56, 35], learning intersections of halfspaces [44], etc. When , our results exactly reduce to those for log-concave distributions [9, 4, 6]. Below, we state our structural results informally:

Theorem 1 (Informal).

Let be an isotropic -concave distribution in . Then there exist closed-form functions , , , , , and such that

  1. [leftmargin=*]

  2. (Weakly Closed under Marginal) The marginal of over

    arguments (or cumulative distribution function, CDF) is isotropic

    -concave. (Theorems 3, 4)

  3. (Lower Bound on Hyperplane Disagreement)

    For any two unit vectors

    and in , , where is the angle between and . (Theorem 12)

  4. (Probability of Band) There is a function such that for any unit vector and any , we have . (Theorem 11)

  5. (Disagreement outside Margin) For any absolute constant and any function , there exists a function such that . (Theorem 13)

  6. (Variance in 1-D Direction)

    There is a function such that for any unit vectors and in such that and for any , we have , where is the conditional distribution of over the set . (Theorem 14)

  7. (Tail Probability) We have . (Theorem 5)

If (i.e., the distribution is log-concave), then and the functions , , , , , , and are all absolute constants.

To prove Theorem 1, we introduce multiple new techniques, e.g., extension of Prekopa-Leindler theorem and reduction to baseline function (see the supplementary material for our techniques), which might be of independent interest to optimization more broadly.

Margin Based Active Learning: We apply our structural results to margin-based active learning of a halfspace under any isotropic -concave distribution for both realizable and adversarial noise models. In the realizable case, the instance is drawn from an isotropic -concave distribution and the label . In the adversarial noise model, an adversary can corrupt any fraction of labels. For both cases, we show that there exists a computationally efficient algorithm that outputs a linear separator such that (see Theorems 15 and 16). The label complexity w.r.t. improves exponentially over the passive learning scenario under -concave distributions, though the underlying distribution might be fat-tailed. To the best of our knowledge, this is the first result concerning the computationally-efficient, noise-tolerant margin-based active learning under the broader class of -concave distributions. Our work solves an open problem proposed by Awasthi et al. [6] about exploring wider classes of distributions for provable active learning algorithms.

Disagreement Based Active Learning: We apply our results to agnostic disagreement-based active learning under

-concave distributions. The key to the analysis is estimating the disagreement coefficient, a distribution-dependent measure of complexity that is used to analyze certain types of active learning algorithms, e.g., the CAL 

[24] and algorithm [7]. We work out the disagreement coefficient under isotropic -concave distribution (see Theorem 17). By composing it with the existing work on active learning [27], we obtain a bound on label complexity under the class of -concave distributions. As far as we are aware, this is the first result concerning disagreement-based active learning that goes beyond log-concave distributions. Our bounds on the disagreement coefficient match the best known results for the much less general case of log-concave distributions [9]; Furthermore, they apply to the -concave case where we allow an arbitrary number of discontinuities, a case not captured by [31].

Learning Intersections of Halfspaces: Baum’s algorithm is one of the most famous algorithms for learning the intersections of halfspaces. The algorithm was first proposed by Baum [11] under symmetric distribution, and later extended to log-concave distribution by Klivans et al. [44] as these distributions are almost symmetric. In this paper, we show that approximate symmetry also holds for the case of -concave distributions. With this, we work out the label complexity of Baum’s algorithm under the broader class of -concave distributions (see Theorem 18), and advance the state-of-the-art results (see Table 1).

We provide lower bounds to partially show the tightness of our analysis. Our results can be potentially applied to other provable learning algorithms as well [38, 58, 13, 57, 10], which might be of independent interest. We discuss our techniques and other related papers in the supplementary material.

Prior Work Ours
Margin (Efficient, Noise) uniform [5] log-concave [6] -concave
Disagreement uniform [34] nearly-log-concave [9] -concave
Baum’s symmetric [11] log-concave [44] -concave
Table 1: Comparisons with prior distributions for margin-based active learning, disagreement-based active learning, and Baum’s algorithm.

1.2 Our Techniques

In this section, we introduce the techniques used for obtaining our results.

Figure 1: Isoperimetry.

Marginalization: Our results are inspired by isoperimetric inequality for -concave distributions by the work of Chandrasekaran et al. [23]. Roughly, the isoperimetry states that if two sets and are well-separated, then the area between them has large measure relative to the measure of the two sets (see Figure 1). Results of this kind are particularly useful for margin-based active learning of halfspace [5, 4, 6]: The algorithm proceeds in rounds, aiming to cut down the error by half in each round in the band. Since the measure of the band is large or even dominates, the error over the whole space decreases almost by half in each round, resulting in exponentially fast convergence rate. However, in order to make the analysis of such algorithms work for -concave distribution, we typically require more refined geometric properties than the isoperimetry as the isoperimetric inequality states nothing about the absolute measure of band under -concave distributions.

The insight behind the isoperimetry is a collection of properties concerning the geometry of probability density. While the geometric properties of some classic paradigms, such as log-concave distributions (for the case of ), are well-studied [49], it is typically hard to generalize those results to the -concave distribution, for broader range of . This is due to the fact that the class of -concave functions is not closed under marginalization: The marginal of an -concave function may not be -concave any more. This directly restricts the possibility of applying the prior proof techniques for log-concave distribution to the -concave one. Furthermore, previous proofs heavily depend on the assumption that the density is light-tailed (see Theorem 11 in [9]), which is not applicable for probably fat-tailed -concave distribution.

To mitigate the above concerns, we begin with a powerful tool from convex geometry by Brascamp and Lieb [20]. This result can be viewed as an extension of celebrated Prékopa-Leindler inequality, an integral inequality that is closely related to a number of classical inequalities in analysis and serves as the building block of isoperimetry under the log-concave distributions [21, 22]. With this, we can show that the marginal of any -concave function is -concave, with a closed-form that is related to the parameter and the dimension of marginalization. Our analysis is tight as there exists an -concave function with a -concave marginal.

Reduction to 1-D Baseline Function: It is in general hard to study a high-dimensional -concave distribution. Instead, we build on the marginalization technique described above to reduce each -dimensional -concave function to the one-dimensional case. Thus it suffices to investigate the geometry of one-dimensional -concave functions. But there are still infinitely many such functions in this class.

Our proofs take a novel analysis by reducing all one-dimensional -concave density to a certain baseline function. The baseline function should meet two goals: (a) It represents the worst case in the class of

-concave functions, namely, such functions should achieve the bounds of geometric properties of our interest; (b) The function should be easy to analyze, e.g., with closed-form moments or integrations. Note that choosing a baseline function at the “boundary” between

-concavity and non--concavity classes readily achieves goal (a). To achieve goal (b), we set the “template” function as easy as for a particular choice of parameters and . Such functions have many good properties that one can exploit. First, the moments can be represented in closed-form by the beta function. This enables us to figure out the relations among moments of various orders explicitly and obtain a recursive inequality, which is critical for deducing the bounds of one-dimensional geometric properties. Second, is at the “boundary” of -concave class: is not a concave function for any . Therefore, this enables us to analyze the whole class of -concavity by focusing on . Below, we summarize our high-level proof ideas briefly.

2 Related Work

Active Learning of Halfspace under Uniform Distribution:

Learning halfspace has been extensively studied in the past decades [16, 45, 29, 36, 52, 41, 40, 39]. Probably one of the most famous results is the VC argument. Vapnik [54] and Blumer et al. [17] showed that any hypothesis that is consistent with labeled examples has error at most , if the VC dimension of the hypothesis class is . The algorithm works under any data distribution and runs in polynomial time when the consistent hypothesis can be found efficiently, e.g., by linear programming in the realizable case. Other algorithms such as Perception [50], Winnow [47]

, and Support Vector Machine 

[55] provide better guarantees if the target vector has low or norm. All these results form the basis of passive learning.

To explore the possibility of further improving the label complexity, several algorithms were later proposed in the active learning literature [15, 14] under the uniform distributions [28, 30], among which disagreement-based active learning and margin-based active learning are two typical approaches. In the disagreement-based active learning, the algorithm proceeds in rounds, requesting the labels of instances in the disagreement region among the current candidate hypothesises. Cohn et al. [24] provided the first disagreement-based active learning algorithm in the realizable case. Balcan et al. [7]

later extended such an algorithm to the agnostic setting by estimating the confidence interval of disagreement region. The analysis technique was further generalized thanks to Hanneke 

[34] by introducing the concept of disagreement coefficient, which is a new measure of complexity for active learning problems and serves as an important element for bounding the label complexity. However, this seminal work only focused on the disagreement coefficient under the uniform distribution.

Margin-based active learning is another line of research in the active learning literature. The algorithm proceeds in rounds, requesting labels of examples aggressively in the margin area around the current guess of hypothesis. Balcan et al. [8] first proposed an algorithm for margin-based active learning under the uniform distribution in the realizable case. They also provided guarantees under the Tsybakov noise model [53], but the algorithm is inefficient. To mitigate the issue, Awasthi et al. [3] considered a subclass of Tsybakov noise — Massart noise [19]. The algorithm runs in polynomial time by doing a sequence of hinge loss minimizations on the labeled instances. However, it was not clear then whether the analysis works for other distributions instead of the uniform one.

Geometry of Log-Concave Distribution:

Log-concave distribution, a class of probability distributions such that the logarithm of density function is concave, is a common generalization of uniform distribution over the convex set 

[49]. Bertsimas and Vempala [12] and Kalai and Vempala [37] noticed that efficient sampling, integration, and optimization algorithms for this distribution class rely heavily on the good isoperimetry of density functions. Informally, a function has good isoperimetry if one cannot remove a small-measure set from its domain and partition the domain into two disjoint large-measure sets. The isoperimetry is commonly believed as a characteristic of good geometric properties. To see this, Lovász and Vempala [49] proved the isoperimetric inequality for the log-concave distribution, and provided a bunch of refined geometric properties for this distribution class. Going slightly beyond the log-concave distribution, Caramanis and Mannor [22] showed good isoperimetry for nearly log-concave distributions, but more refined geometry was not provided there.

Active learning of halfspace under (nearly) log-concave distribution has a natural connection to the geometry of that distribution (a.k.a. admissible distribution). The connection was first introduced by [9], and is sufficient for the success of disagreement-based and margin-based active learning under log-concave distribution [9]. To resolve the computational issue, Awasthi et al. [5] studied the probability of disagreement outside the margin under the log-concave distribution, and proposed an efficient algorithm for the challenging adversarial noise. More recently, Awasthi et al. [4] provided stronger guarantees for efficient learning of halfspace in the Massart noise model under log-concave distribution.

S-Concave Distribution: The problem of extending the log-concave distribution to the broader one for provable learning algorithms has received significant attention in recent years. Although some efforts have been devoted to generalizing the probability distribution, e.g., to the nearly log-concave distribution [9], the analysis is intrinsically built upon the geometry of log-concave distribution. Moreover, to the best of our knowledge, there is no efficient, noise-tolerant active learning algorithm that goes beyond the log-concave distribution. As a candidate extension, the class of -concave distributions has many appealing properties that one can exploit [23, 33]: (a) The distribution class is much broader than the log-concave distributions as implies the log-concavity; (b) The -concave function mapping from to has good isoperimetry if ; (c) Efficient sampling, integration, and optimization algorithms are available for such distribution class. All these properties inspire our work.

3 Preliminary

Before proceeding, we define some notations and clarify our problem setup in this section.

Notations:

We will use capital or lower-case letters to represent random variables,

to represent an -concave distribution, and to represent the conditional distribution of over the set . We define the sign function as if and otherwise. We denote by the beta function, and the gamma function. We will consider a single norm for the vectors in , namely, the -norm denoted by . We will frequently use (or , ) to represent the measure of the probability distribution with density function . The notation represents the set . For convenience, the symbol slightly differs from the ordinary addition : For or , ; Otherwise, and are the same. For , we define the angle between them as .

3.1 From Log-Concavity to S-Concavity

We begin with the definition of -concavity. There are slight differences among the definitions of -concave density, -concave distribution, and -concave measure.

Definition 1 (S-Concave (Density) Function, Distribution, Measure).

A function : is -concave, for , if for all , .111When , we note that . In this case, is known to be log-concave. A probability distribution is -concave, if its density function is -concave. A probability measure is -concave if for any sets and .

Special classes of -concave functions include concavity , harmonic-concavity , quasi-concavity , etc. The conditions in Definition 1 are progressively weaker as becomes smaller: -concave densities (distributions, measures) are -concave if . Thus one can verify [23]: .

4 Structural Results of S-Concave Distributions: A Toolkit

In this section, we develop geometric properties of -concave distribution. The challenge is that unlike the commonly used distributions in learning (uniform or more generally log-concave distributions), this broader class is not closed under the marginalization operator and many such distributions are fat-tailed. To address this issue, we introduce several new techniques. We first introduce the extension of the Prekopa-Leindler inequality so as to reduce the high-dimensional problem to the one-dimensional case. We then reduce the resulting one-dimensional -concave function to a well-defined baseline function, and explore the geometric properties of that baseline function.

4.1 Marginal Distribution and Cumulative Distribution Function

We begin with the analysis of the marginal distribution, which forms the basis of other geometric properties of -concave distributions . Unlike the (nearly) log-concave distribution where the marginal remains (nearly) log-concave, the class of -concave distributions is not closed under the marginalization operator. To study the marginal, our primary tool is the theory of convex geometry. Specifically, we will use an extension of the Prékopa-Leindler inequality developed by Brascamp and Lieb [20], which allows for a characterization of the integral of -concave functions.

Theorem 2 ([20], Thm 3.3).

Let , and , , and be non-negative integrable functions on such that for every . Then for , with .

Building on this, the following theorem plays a key role in our analysis of the marginal distribution.

Theorem 3 (Marginal).

Let be an -concave density on a convex set with . Denote by . For every x in , consider the section . Then the marginal density is -concave on , where . Moreover, if is isotropic, then is isotropic.

Similar to the marginal, the CDF of an -concave distribution might not remain in the same class. This is in sharp contrast to log-concave distributions. The following theorem studies the CDF of an -concave distribution.

Theorem 4.

The CDF of -concave distribution in is -concave, where and .

Theorem 3 and 4 serve as the bridge that connects high-dimensional -concave distributions to one-dimensional -concave distributions. With them, we are able to reduce the high-dimensional problem to the one-dimensional one.

4.2 Fat-Tailed Density

Tail probability is one of the most distinct characteristics of -concave distributions compared to (nearly) log-concave distributions. While it can be shown that the (nearly) log-concave distribution has an exponentially small tail (Theorem 11, [9]), the tail of an -concave distribution is fat, as proved by the following theorem.

Theorem 5 (Tail Probability).

Let come from an isotropic distribution over with an -concave density. Then for every , we have , where is an absolute constant.

Theorem 5 is almost tight for . To see this, consider that is drawn from a one-dimensional Pareto distribution with density . It can be easily seen that for , which matches Theorem 5 up to an absolute constant factor.

4.3 Geometry of S-Concave Distributions

We now investigate the geometry of -concave distributions. We first consider one-dimensional -concave distributions: We provide bounds on the density of centroid-centered halfspaces (Lemma 6) and range of the density function (Lemma 7). Building upon these, we develop geometric properties of high-dimensional -concave distributions by reducing the distributions to the one-dimensional case based on marginalization (Theorem 3).

4.3.1 One-Dimensional Case

We begin with the analysis of one-dimensional halfspaces. To bound the probability, a normal technique is to bound the centroid region and the tail region separately. However, the challenge is that the -concave distribution is fat-tailed (Theorem 5). So while the probability of a one-dimensional halfspace is bounded below by an absolute constant for log-concave distributions, such a probability for -concave distributions decays as becomes smaller. The following lemma captures such an intuition.

Lemma 6 (Density of Centroid-Centered Halfspaces).

Let be drawn from a one-dimensional distribution with -concave density for . Then for .

We also study the image of a one-dimensional -concave density. The following condition for is for the existence of second-order moment.

Lemma 7.

Let be an isotropic -concave density function and . (a) For all , ; (b) We have , where .

4.3.2 High-Dimensional Case

We now move on to the high-dimensional case . In the following, we will assume . Though this working range of vanishes as becomes larger, it is almost the broadest range of that we can hopefully achieve: Chandrasekaran et al. [23] showed a lower bound of if one require the -concave distribution to have good geometric properties. In addition, we can see from Theorem 3 that if , the marginal of an -concave distribution might even not exist; Such a case does happen for certain -concave distributions with

, e.g., the Cauchy distribution. So our range of

is almost tight up to a factor.

We start our analysis with the density of centroid-centered halfspaces in high-dimensional spaces.

Lemma 8 (Density of Centroid-Centered Halfspaces).

Let be an -concave density function, and let be any halfspace containing its centroid. Then for .

Proof.

W.L.O.G., we assume is orthogonal to the first axis. By Theorem 3, the first marginal of is -concave. Then by Lemma 6, , where . ∎

The following theorem is an extension of Lemma 7 to high-dimensional spaces. The proofs basically reduce the -dimensional density to its first marginal by Theorem 3, and apply Lemma 7 to bound the image.

Theorem 9 (Bounds on Density).

Let be an isotropic -concave density. Then

(a) Let , where and . For any such that , we have .

(b) for every .

(c) There exists an such that .

(d) .

(e) for every .

(f) For any line through the origin, .

Theorem 9 provides uniform bounds on the density function. To obtain more refined upper bound on the image of -concave densities, we have the following lemma. The proof is built upon Theorem 9.

Lemma 10 (More Refined Upper Bound on Densities).

Let be an isotropic -concave density. Then for every , where

, , , and .

We also give an absolute bound on the measure of band.

Theorem 11 (Probability inside Band).

Let be an isotropic -concave distribution in . Denote by . Then for any unit vector , Moreover, if where , then where .

To analyze the problem of learning linear separators, we are interested in studying the disagreement between the hypothesis of the output and the hypothesis of the target. The following theorem captures such a characteristic under -concave distributions.

Theorem 12 (Probability of Disagreement).

Assume is an isotropic -concave distribution in . Then for any two unit vectors and in , we have , where , is an absolute constant, , , .

Due to space constraints, all missing proofs are deferred to the supplementary material.

5 Applications: Provable Algorithms under S-Concave Distributions

In this section, we show that many algorithms that work under log-concave distributions behave well under -concave distributions by applying the above-mentioned geometric properties. For simplicity, we will frequently use the notations in Theorem 1.

5.1 Margin Based Active Learning

We first investigate margin-based active learning under isotropic -concave distributions in both realizable and adversarial noise models. The algorithm (see Algorithm 1) follows a localization technique: It proceeds in rounds, aiming to cut the error down by half in each round in the margin [8].

  Input: Parameters , , , , , and as in Theorem 16.
  1: Draw examples from , label them and put them into .
  2:   For
  3:  Find to approximately minimize the hinge loss over s.t. :   .
  4:  Normalize , yielding ; Clear the working set .
  5:While additional data points are not labeled
  6:   Draw sample from .
  7:   If , reject ; else ask for label of and put into .
  Output: Hypothesis .
Algorithm 1 Margin Based Active Learning under S-Concave Distributions

5.1.1 Relevant Properties of S-Concave Distributions

The analysis requires more refined geometric properties as below. Theorem 13 basically claims that the error mostly concentrates in a band, and Theorem 14 guarantees that the variance in any 1-D direction cannot be too large. We defer the detailed proofs to the supplementary material.

Theorem 13 (Disagreement outside Band).

Let and be two vectors in and assume that . Let be an isotropic -concave distribution. Then for any absolute constant and any function , there exists a function such that where , is the beta function, , and are given by Lemma 10.

Theorem 14 (1-D Variance).

Assume that is isotropic -concave. For given by Theorem 9 (a), there is an absolute such that for all and for all such that and , where , and are given by Lemma 10 and Theorem 11, and .

5.1.2 Realizable Case

We show that margin-based active learning works under -concave distributions in the realizable case.

Theorem 15.

In the realizable case, let be an isotropic -concave distribution in . Then for , , and absolute constants , there is an algorithm (see the supplementary material) that runs in iterations, requires labels in the -th round, and outputs a linear separator of error at most with probability at least . In particular, when (a.k.a. log-concave), we have .

By Theorem 15, we see that the algorithm of margin-based active learning under -concave distributions works almost as well as the log-concave distributions in the resizable case, improving exponentially w.r.t. the variable over passive learning algorithms.

5.1.3 Efficient Learning with Adversarial Noise

In the adversarial noise model, an adversary can choose any distribution over such that the marginal over is -concave but an fraction of labels can be flipped adversarially. The analysis builds upon an induction technique where in each round we do hinge loss minimization in the band and cut down the 0/1 loss by half. The algorithm was previously analyzed in [5, 6] for the special class of log-concave distributions. In this paper, we analyze it for the much more general class of -concave distributions.

Theorem 16.

Let be an isotropic -concave distribution in over and the label obey the adversarial noise model. If the rate of adversarial noise satisfies for some absolute constant , then for , , and an absolute constant , Algorithm 1 runs in iterations, outputs a linear separator such that with probability at least . The label complexity in the -th round is , where , , . In particular, if , .

By Theorem 16, the label complexity of margin-based active learning improves exponentially over that of passive learning w.r.t. even under fat-tailed -concave distributions and challenging adversarial noise model.

5.2 Disagreement Based Active Learning

We apply our results to the analysis of disagreement-based active learning under -concave distributions. The key is estimating the disagreement coefficient, a measure of complexity of active learning problems that can be used to bound the label complexity [34]

. Recall the definition of the disagreement coefficient w.r.t. classifier

, precision , and distribution as follows. For any , define where . Define the disagreement region as . Let the Alexander capacity . The disagreement coefficient is defined as . Below, we state our results on the disagreement coefficient under isotropic -concave distributions.

Theorem 17 (Disagreement Coefficient).

Let be an isotropic -concave distribution over . For any and , the disagreement coefficient is . In particular, when (a.k.a. log-concave), .

Our bounds on the disagreement coefficient match the best known results for the much less general case of log-concave distributions [9]; Furthermore, they apply to the -concave case where we allow arbitrary number of discontinuities, a case not captured by [31]. The result immediately implies concrete bounds on the label complexity of disagreement-based active learning algorithms, e.g., CAL [24] and  [7]. For instance, by composing it with the result from [27], we obtain a bound of

for agnostic active learning under an isotropic -concave distribution . Namely, it suffices to output a halfspace with error at most , where .

5.3 Learning Intersections of Halfspaces

Baum [11] provided a polynomial-time algorithm for learning the intersections of halfspaces w.r.t. symmetric distributions. Later, Klivans [44] extended the result by showing that the algorithm works under any distribution as long as for any set . In this section, we show that it is possible to learn intersections of halfspaces under the broader class of s-concave distributions.

Theorem 18.

In the PAC realizable case, there is an algorithm (see the supplementary material) that outputs a hypothesis of error at most with probability at least under isotropic -concave distributions. The label complexity is , where is defined by , , , , , , , and . In particular, if (a.k.a. log-concave), is an absolute constant.

6 Lower Bounds

In this section, we give information-theoretic lower bounds on the label complexity of passive and active learning of homogeneous halfspaces under -concave distributions.

Theorem 19.

For a fixed value we have: (a) For any -concave distribution in whose covariance matrix is of full rank, the sample complexity of learning origin-centered linear separators under in the passive learning scenario is ; (b) The label complexity of active learning of linear separators under -concave distributions is .

If the covariance matrix of is not of full rank, then the intrinsic dimension is less than . So our lower bounds essentially apply to all -concave distributions. According to Theorem 19, it is possible to have an exponential improvement of label complexity w.r.t. over passive learning by active sampling, even though the underlying distribution is a fat-tailed -concave distribution. This observation is captured by Theorems 15 and 16.

7 Conclusions

In this paper, we study the geometric properties of -concave distributions. Our work advances the state-of-the-art results on the margin-based active learning, disagreement-based active learning, and learning intersections of halfspaces w.r.t. the distributions over the instance space. When , our results reduce to the best-known results for log-concave distributions. The geometric properties of -concave distributions can be potentially applied to other learning algorithms, which might be of independent interest more broadly.

Acknowledgements. This work was supported in part by grants NSF-CCF 1535967, NSF CCF-1422910, NSF CCF-1451177, a Sloan Fellowship, and a Microsoft Research Fellowship.

References

  • [1] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 2009.
  • [2] D. Applegate and R. Kannan. Sampling and integration of near log-concave functions. In

    ACM Symposium on Theory of Computing

    , pages 156–163, 1991.
  • [3] P. Awasthi, M.-F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Annual Conference on Learning Theory, pages 167–190, 2015.
  • [4] P. Awasthi, M.-F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Annual Conference on Learning Theory, pages 152–192, 2016.
  • [5] P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. In ACM Symposium on Theory of Computing, pages 449–458, 2014.
  • [6] P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50, 2017.
  • [7] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
  • [8] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Annual Conference on Learning Theory, pages 35–50, 2007.
  • [9] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In Annual Conference on Learning Theory, pages 288–316, 2013.
  • [10] M.-F. Balcan and H. Zhang. Noise-tolerant life-long matrix completion via adaptive sampling. In Advances in Neural Information Processing Systems, pages 2955–2963, 2016.
  • [11] E. B. Baum. A polynomial time algorithm that learns two hidden unit nets. Neural Computation, 2(4):510–522, 1990.
  • [12] D. Bertsimas and S. Vempala. Solving convex programs by random walks. Journal of the ACM, 51(4):540–556, 2004.
  • [13] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In

    International Conference on Machine Learning

    , pages 49–56, 2009.
  • [14] A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. Search improves label for active learning. In Advances in Neural Information Processing Systems, pages 3342–3350, 2016.
  • [15] A. Beygelzimer, D. J. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems, pages 199–207, 2010.
  • [16] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In IEEE Symposium on Foundations of Computer Science, pages 330–338, 1996.
  • [17] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
  • [18] S. G. Bobkov. Large deviations and isoperimetry over convex probability measures with heavy tails. Electronic Journal of Probability, 12:1072–1100, 2007.
  • [19] O. Bousquet, S. Boucheron, and G. Lugosi. Theory of classification: A survey of recent advances. ESAIM: Probability and Statistics, 9(9):323–375, 2005.
  • [20] H. J. Brascamp and E. H. Lieb. On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. Journal of Functional Analysis, 22(4):366–389, 1976.
  • [21] C. Caramanis and S. Mannor. An inequality for nearly log-concave distributions with applications to learning. In Annual Conference on Learning Theory, pages 534–548, 2004.
  • [22] C. Caramanis and S. Mannor. An inequality for nearly log-concave distributions with applications to learning. IEEE Transactions on Information Theory, 53(3):1043–1057, 2007.
  • [23] K. Chandrasekaran, A. Deshpande, and S. Vempala. Sampling s-concave functions: The limit of convexity based isoperimetry. In

    Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques

    , pages 420–433. 2009.
  • [24] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
  • [25] A. Daniely. Complexity theoretic limitations on learning halfspaces. In ACM Symposium on Theory of computing, pages 105–117, 2016.
  • [26] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, volume 17, pages 337–344, 2004.
  • [27] S. Dasgupta, D. J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems, pages 353–360, 2007.
  • [28] S. Dasgupta, A. T. Kalai, and C. Monteleoni.

    Analysis of perceptron-based active learning.

    In Annual Conference on Learning Theory, pages 249–263, 2005.
  • [29] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In ACM Symposium on Theory of computing, pages 315–320, 2004.
  • [30] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query by committee. Advances in Neural Information Processing Systems, pages 483–483, 1993.
  • [31] E. Friedman. Active learning for smooth problems. In Annual Conference on Learning Theory, 2009.
  • [32] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765, 2009.
  • [33] Q. Han and J. A. Wellner. Approximation and estimation of s-concave densities via Rényi divergences. The Annals of Statistics, 44(3):1332–1359, 2016.
  • [34] S. Hanneke. A bound on the label complexity of agnostic active learning. In International Conference on Machine Learning, pages 353–360, 2007.
  • [35] S. Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131–309, 2014.
  • [36] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
  • [37] A. T. Kalai and S. Vempala. Simulated annealing for convex optimization. Mathematics of Operations Research, 31(2):253–266, 2006.
  • [38] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classification with comparison queries. In IEEE Symposium on Foundations of Computer Science, pages 355–366, 2017.
  • [39] M. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807–837, 1993.
  • [40] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. Machine Learning, 17(2-3):115–141, 1994.
  • [41] M. J. Kearns and U. V. Vazirani.

    An introduction to computational learning theory

    .
    MIT press, 1994.
  • [42] A. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, 28:793–809, 2014.
  • [43] A. R. Klivans, P. M. Long, and R. A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
  • [44] A. R. Klivans, P. M. Long, and A. K. Tang. Baum’s algorithm learns intersections of halfspaces with respect to log-concave distributions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 588–600. 2009.
  • [45] A. R. Klivans, R. O’Donnell, and R. A. Servedio. Learning intersections and thresholds of halfspaces. In IEEE Symposium on Foundations of Computer Science, pages 177–186, 2002.
  • [46] S. R. Kulkarni, S. K. Mitter, and J. N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
  • [47] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285–318, 1988.
  • [48] P. M. Long. On the sample complexity of pac learning half-spaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995.
  • [49] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures Algorithms, 30(3):307–358, 2007.
  • [50] M. Minsky and S. Papert. Perceptrons–extended edition: An introduction to computational geometry, 1987.
  • [51] R. A. Servedio. Efficient algorithms in computational learning theory. PhD thesis, Harvard University, 2001.
  • [52] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the zero-one loss. arXiv preprint arXiv:1005.3681, 2010.
  • [53] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, pages 135–166, 2004.
  • [54] V. Vapnik. Estimations of dependences based on statistical data. Springer, 1982.
  • [55] V. Vapnik.

    The nature of statistical learning theory

    .
    Springer Science & Business Media, 2013.
  • [56] L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. Journal of Machine Learning Research, 12(Jul):2269–2292, 2011.
  • [57] Y. Xu, H. Zhang, A. Singh, A. Dubrawski, and K. Miller. Noise-tolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2428–2437, 2017.
  • [58] S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal active learning of halfspaces. arXiv preprint arXiv:1702.05581, 2017.
  • [59] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442–450, 2014.

Appendix A Proof of Theorem 3

Theorem 3 (restated) Let be an -concave density on a convex set with . Denote by . For every x in , consider the section . Then the marginal density is -concave on , where . Moreover, if is isotropic, then is isotropic.

Proof.

The proof that is isotropic is standard [49]. We now prove the first part. Let , be any two points. Define for . So the functions is defined on , . Now let for and define on . Notice that for any , , . To see this, by the convexity of the set , the point belongs to . So , i.e., . Using the -concavity of , we have