1 Introduction
Developing provable learning algorithms is one of the central challenges in learning theory. The study of such algorithms has led to significant advances in both the theory and practice of passive and active learning. In the passive learning model, the learning algorithm has access to a set of labeled examples sampled i.i.d. from some unknown distribution over the instance space and labeled according to some underlying target function. In the active learning model, however, the algorithm can access unlabeled examples and request labels of its own choice, and the goal is to learn the target function with significantly fewer labels. In this work, we study both learning models in the case where the underlying distribution belongs to the class of concave distributions.
Prior work on noisetolerant and sampleefficient algorithms mostly relies on the assumption that the distribution over the instance space is logconcave [2, 22, 9, 57]. A distribution is logconcave
if the logarithm of its density is a concave function. The assumption of logconcavity has been made for a few purposes: for computational efficiency reasons and for sample efficiency reasons. For computational efficiency reasons, it was made to obtain a noisetolerant algorithm even for seemingly simple decision surfaces like linear separators. These simple algorithms exist for noiseless scenarios, e.g., via linear programming
[51], but they are notoriously hard once we have noise [25, 42, 32]; This is why progress on noisetolerant algorithms has focused on uniform [36, 43] and logconcave distributions [6]. Other concept spaces, like intersections of halfspaces, even have no computationally efficient algorithm in the noisefree settings that works under general distributions, but there has been nice progress under uniform and logconcave distributions [44]. For sample efficiency reasons, in the context of active learning, we need distributional assumptions in order to obtain label complexity improvements [26]. The most concrete and general class for which prior work obtains such improvements is when the marginal distribution over instance space satisfies logconcavity [59, 9]. In this work, we provide a broad generalization of all above results, showing how they extend to concave distributions . A distribution with density is concave if is a concave function. We identify key properties of these distributions that allow us to simultaneously extend all above results.How general and important is the class of sconcave distributions? The class of concave distributions is very broad and contains many wellknown (classes of) distributions as special cases. For example, when , concave distributions reduce to logconcave distributions. Furthermore, the concave class contains infinitely many fattailed distributions that do not belong to the class of logconcave distributions, e.g., Cauchy, Pareto, and distributions, which have been widely applied in the context of theoretical physics and economics, but much remains unknown about how the provable learning algorithms, such as active learning of halfspaces, perform under these realistic distributions. We also compare concave distributions with nearlylogconcave distributions, a slightly broader class of distributions than logconcavity. A distribution with density is nearlylogconcave if for any , , we have [9]. The class of concave distributions includes many important extra distributions which do not belong to the nearlylogconcave distributions: a nearlylogconcave distribution must have subexponential tails (see Theorem 11, [9]), while the tail probability of an concave distribution might decay much slower (see Theorem 1 (6)). We also note that efficient sampling, integration and optimization algorithms for concave distributions have been well understood [23, 37]. Our analysis of concave distributions bridges these algorithms to the strong guarantees of noisetolerant and sampleefficient learning algorithms.
1.1 Our Contributions
Structural Results. We study various geometric properties of concave distributions. These properties serve as the structural results for many provable learning algorithms, e.g., marginbased active learning [9], disagreementbased active learning [56, 35], learning intersections of halfspaces [44], etc. When , our results exactly reduce to those for logconcave distributions [9, 4, 6]. Below, we state our structural results informally:
Theorem 1 (Informal).
Let be an isotropic concave distribution in . Then there exist closedform functions , , , , , and such that

[leftmargin=*]

(Weakly Closed under Marginal) The marginal of over
arguments (or cumulative distribution function, CDF) is isotropic
concave. (Theorems 3, 4) 
(Lower Bound on Hyperplane Disagreement)
For any two unit vectors
and in , , where is the angle between and . (Theorem 12) 
(Probability of Band) There is a function such that for any unit vector and any , we have . (Theorem 11)

(Disagreement outside Margin) For any absolute constant and any function , there exists a function such that . (Theorem 13)

(Tail Probability) We have . (Theorem 5)
If (i.e., the distribution is logconcave), then and the functions , , , , , , and are all absolute constants.
To prove Theorem 1, we introduce multiple new techniques, e.g., extension of PrekopaLeindler theorem and reduction to baseline function (see the supplementary material for our techniques), which might be of independent interest to optimization more broadly.
Margin Based Active Learning: We apply our structural results to marginbased active learning of a halfspace under any isotropic concave distribution for both realizable and adversarial noise models. In the realizable case, the instance is drawn from an isotropic concave distribution and the label . In the adversarial noise model, an adversary can corrupt any fraction of labels. For both cases, we show that there exists a computationally efficient algorithm that outputs a linear separator such that (see Theorems 15 and 16). The label complexity w.r.t. improves exponentially over the passive learning scenario under concave distributions, though the underlying distribution might be fattailed. To the best of our knowledge, this is the first result concerning the computationallyefficient, noisetolerant marginbased active learning under the broader class of concave distributions. Our work solves an open problem proposed by Awasthi et al. [6] about exploring wider classes of distributions for provable active learning algorithms.
Disagreement Based Active Learning: We apply our results to agnostic disagreementbased active learning under
concave distributions. The key to the analysis is estimating the disagreement coefficient, a distributiondependent measure of complexity that is used to analyze certain types of active learning algorithms, e.g., the CAL
[24] and algorithm [7]. We work out the disagreement coefficient under isotropic concave distribution (see Theorem 17). By composing it with the existing work on active learning [27], we obtain a bound on label complexity under the class of concave distributions. As far as we are aware, this is the first result concerning disagreementbased active learning that goes beyond logconcave distributions. Our bounds on the disagreement coefficient match the best known results for the much less general case of logconcave distributions [9]; Furthermore, they apply to the concave case where we allow an arbitrary number of discontinuities, a case not captured by [31].Learning Intersections of Halfspaces: Baum’s algorithm is one of the most famous algorithms for learning the intersections of halfspaces. The algorithm was first proposed by Baum [11] under symmetric distribution, and later extended to logconcave distribution by Klivans et al. [44] as these distributions are almost symmetric. In this paper, we show that approximate symmetry also holds for the case of concave distributions. With this, we work out the label complexity of Baum’s algorithm under the broader class of concave distributions (see Theorem 18), and advance the stateoftheart results (see Table 1).
We provide lower bounds to partially show the tightness of our analysis. Our results can be potentially applied to other provable learning algorithms as well [38, 58, 13, 57, 10], which might be of independent interest. We discuss our techniques and other related papers in the supplementary material.
Prior Work  Ours  

Margin (Efficient, Noise)  uniform [5]  logconcave [6]  concave 
Disagreement  uniform [34]  nearlylogconcave [9]  concave 
Baum’s  symmetric [11]  logconcave [44]  concave 
1.2 Our Techniques
In this section, we introduce the techniques used for obtaining our results.
Marginalization: Our results are inspired by isoperimetric inequality for concave distributions by the work of Chandrasekaran et al. [23]. Roughly, the isoperimetry states that if two sets and are wellseparated, then the area between them has large measure relative to the measure of the two sets (see Figure 1). Results of this kind are particularly useful for marginbased active learning of halfspace [5, 4, 6]: The algorithm proceeds in rounds, aiming to cut down the error by half in each round in the band. Since the measure of the band is large or even dominates, the error over the whole space decreases almost by half in each round, resulting in exponentially fast convergence rate. However, in order to make the analysis of such algorithms work for concave distribution, we typically require more refined geometric properties than the isoperimetry as the isoperimetric inequality states nothing about the absolute measure of band under concave distributions.
The insight behind the isoperimetry is a collection of properties concerning the geometry of probability density. While the geometric properties of some classic paradigms, such as logconcave distributions (for the case of ), are wellstudied [49], it is typically hard to generalize those results to the concave distribution, for broader range of . This is due to the fact that the class of concave functions is not closed under marginalization: The marginal of an concave function may not be concave any more. This directly restricts the possibility of applying the prior proof techniques for logconcave distribution to the concave one. Furthermore, previous proofs heavily depend on the assumption that the density is lighttailed (see Theorem 11 in [9]), which is not applicable for probably fattailed concave distribution.
To mitigate the above concerns, we begin with a powerful tool from convex geometry by Brascamp and Lieb [20]. This result can be viewed as an extension of celebrated PrékopaLeindler inequality, an integral inequality that is closely related to a number of classical inequalities in analysis and serves as the building block of isoperimetry under the logconcave distributions [21, 22]. With this, we can show that the marginal of any concave function is concave, with a closedform that is related to the parameter and the dimension of marginalization. Our analysis is tight as there exists an concave function with a concave marginal.
Reduction to 1D Baseline Function: It is in general hard to study a highdimensional concave distribution. Instead, we build on the marginalization technique described above to reduce each dimensional concave function to the onedimensional case. Thus it suffices to investigate the geometry of onedimensional concave functions. But there are still infinitely many such functions in this class.
Our proofs take a novel analysis by reducing all onedimensional concave density to a certain baseline function. The baseline function should meet two goals: (a) It represents the worst case in the class of
concave functions, namely, such functions should achieve the bounds of geometric properties of our interest; (b) The function should be easy to analyze, e.g., with closedform moments or integrations. Note that choosing a baseline function at the “boundary” between
concavity and nonconcavity classes readily achieves goal (a). To achieve goal (b), we set the “template” function as easy as for a particular choice of parameters and . Such functions have many good properties that one can exploit. First, the moments can be represented in closedform by the beta function. This enables us to figure out the relations among moments of various orders explicitly and obtain a recursive inequality, which is critical for deducing the bounds of onedimensional geometric properties. Second, is at the “boundary” of concave class: is not a concave function for any . Therefore, this enables us to analyze the whole class of concavity by focusing on . Below, we summarize our highlevel proof ideas briefly.2 Related Work
Active Learning of Halfspace under Uniform Distribution:
Learning halfspace has been extensively studied in the past decades [16, 45, 29, 36, 52, 41, 40, 39]. Probably one of the most famous results is the VC argument. Vapnik [54] and Blumer et al. [17] showed that any hypothesis that is consistent with labeled examples has error at most , if the VC dimension of the hypothesis class is . The algorithm works under any data distribution and runs in polynomial time when the consistent hypothesis can be found efficiently, e.g., by linear programming in the realizable case. Other algorithms such as Perception [50], Winnow [47], and Support Vector Machine
[55] provide better guarantees if the target vector has low or norm. All these results form the basis of passive learning.To explore the possibility of further improving the label complexity, several algorithms were later proposed in the active learning literature [15, 14] under the uniform distributions [28, 30], among which disagreementbased active learning and marginbased active learning are two typical approaches. In the disagreementbased active learning, the algorithm proceeds in rounds, requesting the labels of instances in the disagreement region among the current candidate hypothesises. Cohn et al. [24] provided the first disagreementbased active learning algorithm in the realizable case. Balcan et al. [7]
later extended such an algorithm to the agnostic setting by estimating the confidence interval of disagreement region. The analysis technique was further generalized thanks to Hanneke
[34] by introducing the concept of disagreement coefficient, which is a new measure of complexity for active learning problems and serves as an important element for bounding the label complexity. However, this seminal work only focused on the disagreement coefficient under the uniform distribution.Marginbased active learning is another line of research in the active learning literature. The algorithm proceeds in rounds, requesting labels of examples aggressively in the margin area around the current guess of hypothesis. Balcan et al. [8] first proposed an algorithm for marginbased active learning under the uniform distribution in the realizable case. They also provided guarantees under the Tsybakov noise model [53], but the algorithm is inefficient. To mitigate the issue, Awasthi et al. [3] considered a subclass of Tsybakov noise — Massart noise [19]. The algorithm runs in polynomial time by doing a sequence of hinge loss minimizations on the labeled instances. However, it was not clear then whether the analysis works for other distributions instead of the uniform one.
Geometry of LogConcave Distribution:
Logconcave distribution, a class of probability distributions such that the logarithm of density function is concave, is a common generalization of uniform distribution over the convex set
[49]. Bertsimas and Vempala [12] and Kalai and Vempala [37] noticed that efficient sampling, integration, and optimization algorithms for this distribution class rely heavily on the good isoperimetry of density functions. Informally, a function has good isoperimetry if one cannot remove a smallmeasure set from its domain and partition the domain into two disjoint largemeasure sets. The isoperimetry is commonly believed as a characteristic of good geometric properties. To see this, Lovász and Vempala [49] proved the isoperimetric inequality for the logconcave distribution, and provided a bunch of refined geometric properties for this distribution class. Going slightly beyond the logconcave distribution, Caramanis and Mannor [22] showed good isoperimetry for nearly logconcave distributions, but more refined geometry was not provided there.Active learning of halfspace under (nearly) logconcave distribution has a natural connection to the geometry of that distribution (a.k.a. admissible distribution). The connection was first introduced by [9], and is sufficient for the success of disagreementbased and marginbased active learning under logconcave distribution [9]. To resolve the computational issue, Awasthi et al. [5] studied the probability of disagreement outside the margin under the logconcave distribution, and proposed an efficient algorithm for the challenging adversarial noise. More recently, Awasthi et al. [4] provided stronger guarantees for efficient learning of halfspace in the Massart noise model under logconcave distribution.
SConcave Distribution: The problem of extending the logconcave distribution to the broader one for provable learning algorithms has received significant attention in recent years. Although some efforts have been devoted to generalizing the probability distribution, e.g., to the nearly logconcave distribution [9], the analysis is intrinsically built upon the geometry of logconcave distribution. Moreover, to the best of our knowledge, there is no efficient, noisetolerant active learning algorithm that goes beyond the logconcave distribution. As a candidate extension, the class of concave distributions has many appealing properties that one can exploit [23, 33]: (a) The distribution class is much broader than the logconcave distributions as implies the logconcavity; (b) The concave function mapping from to has good isoperimetry if ; (c) Efficient sampling, integration, and optimization algorithms are available for such distribution class. All these properties inspire our work.
3 Preliminary
Before proceeding, we define some notations and clarify our problem setup in this section.
Notations:
We will use capital or lowercase letters to represent random variables,
to represent an concave distribution, and to represent the conditional distribution of over the set . We define the sign function as if and otherwise. We denote by the beta function, and the gamma function. We will consider a single norm for the vectors in , namely, the norm denoted by . We will frequently use (or , ) to represent the measure of the probability distribution with density function . The notation represents the set . For convenience, the symbol slightly differs from the ordinary addition : For or , ; Otherwise, and are the same. For , we define the angle between them as .3.1 From LogConcavity to SConcavity
We begin with the definition of concavity. There are slight differences among the definitions of concave density, concave distribution, and concave measure.
Definition 1 (SConcave (Density) Function, Distribution, Measure).
A function : is concave, for , if for all , .^{1}^{1}1When , we note that . In this case, is known to be logconcave. A probability distribution is concave, if its density function is concave. A probability measure is concave if for any sets and .
4 Structural Results of SConcave Distributions: A Toolkit
In this section, we develop geometric properties of concave distribution. The challenge is that unlike the commonly used distributions in learning (uniform or more generally logconcave distributions), this broader class is not closed under the marginalization operator and many such distributions are fattailed. To address this issue, we introduce several new techniques. We first introduce the extension of the PrekopaLeindler inequality so as to reduce the highdimensional problem to the onedimensional case. We then reduce the resulting onedimensional concave function to a welldefined baseline function, and explore the geometric properties of that baseline function.
4.1 Marginal Distribution and Cumulative Distribution Function
We begin with the analysis of the marginal distribution, which forms the basis of other geometric properties of concave distributions . Unlike the (nearly) logconcave distribution where the marginal remains (nearly) logconcave, the class of concave distributions is not closed under the marginalization operator. To study the marginal, our primary tool is the theory of convex geometry. Specifically, we will use an extension of the PrékopaLeindler inequality developed by Brascamp and Lieb [20], which allows for a characterization of the integral of concave functions.
Theorem 2 ([20], Thm 3.3).
Let , and , , and be nonnegative integrable functions on such that for every . Then for , with .
Building on this, the following theorem plays a key role in our analysis of the marginal distribution.
Theorem 3 (Marginal).
Let be an concave density on a convex set with . Denote by . For every x in , consider the section . Then the marginal density is concave on , where . Moreover, if is isotropic, then is isotropic.
Similar to the marginal, the CDF of an concave distribution might not remain in the same class. This is in sharp contrast to logconcave distributions. The following theorem studies the CDF of an concave distribution.
Theorem 4.
The CDF of concave distribution in is concave, where and .
4.2 FatTailed Density
Tail probability is one of the most distinct characteristics of concave distributions compared to (nearly) logconcave distributions. While it can be shown that the (nearly) logconcave distribution has an exponentially small tail (Theorem 11, [9]), the tail of an concave distribution is fat, as proved by the following theorem.
Theorem 5 (Tail Probability).
Let come from an isotropic distribution over with an concave density. Then for every , we have , where is an absolute constant.
4.3 Geometry of SConcave Distributions
We now investigate the geometry of concave distributions. We first consider onedimensional concave distributions: We provide bounds on the density of centroidcentered halfspaces (Lemma 6) and range of the density function (Lemma 7). Building upon these, we develop geometric properties of highdimensional concave distributions by reducing the distributions to the onedimensional case based on marginalization (Theorem 3).
4.3.1 OneDimensional Case
We begin with the analysis of onedimensional halfspaces. To bound the probability, a normal technique is to bound the centroid region and the tail region separately. However, the challenge is that the concave distribution is fattailed (Theorem 5). So while the probability of a onedimensional halfspace is bounded below by an absolute constant for logconcave distributions, such a probability for concave distributions decays as becomes smaller. The following lemma captures such an intuition.
Lemma 6 (Density of CentroidCentered Halfspaces).
Let be drawn from a onedimensional distribution with concave density for . Then for .
We also study the image of a onedimensional concave density. The following condition for is for the existence of secondorder moment.
Lemma 7.
Let be an isotropic concave density function and . (a) For all , ; (b) We have , where .
4.3.2 HighDimensional Case
We now move on to the highdimensional case . In the following, we will assume . Though this working range of vanishes as becomes larger, it is almost the broadest range of that we can hopefully achieve: Chandrasekaran et al. [23] showed a lower bound of if one require the concave distribution to have good geometric properties. In addition, we can see from Theorem 3 that if , the marginal of an concave distribution might even not exist; Such a case does happen for certain concave distributions with
, e.g., the Cauchy distribution. So our range of
is almost tight up to a factor.We start our analysis with the density of centroidcentered halfspaces in highdimensional spaces.
Lemma 8 (Density of CentroidCentered Halfspaces).
Let be an concave density function, and let be any halfspace containing its centroid. Then for .
Proof.
The following theorem is an extension of Lemma 7 to highdimensional spaces. The proofs basically reduce the dimensional density to its first marginal by Theorem 3, and apply Lemma 7 to bound the image.
Theorem 9 (Bounds on Density).
Let be an isotropic concave density. Then
(a) Let , where and . For any such that , we have .
(b) for every .
(c) There exists an such that .
(d) .
(e) for every .
(f) For any line through the origin, .
Theorem 9 provides uniform bounds on the density function. To obtain more refined upper bound on the image of concave densities, we have the following lemma. The proof is built upon Theorem 9.
Lemma 10 (More Refined Upper Bound on Densities).
Let be an isotropic concave density. Then for every , where
, , , and .
We also give an absolute bound on the measure of band.
Theorem 11 (Probability inside Band).
Let be an isotropic concave distribution in . Denote by . Then for any unit vector , Moreover, if where , then where .
To analyze the problem of learning linear separators, we are interested in studying the disagreement between the hypothesis of the output and the hypothesis of the target. The following theorem captures such a characteristic under concave distributions.
Theorem 12 (Probability of Disagreement).
Assume is an isotropic concave distribution in . Then for any two unit vectors and in , we have , where , is an absolute constant, , , .
Due to space constraints, all missing proofs are deferred to the supplementary material.
5 Applications: Provable Algorithms under SConcave Distributions
In this section, we show that many algorithms that work under logconcave distributions behave well under concave distributions by applying the abovementioned geometric properties. For simplicity, we will frequently use the notations in Theorem 1.
5.1 Margin Based Active Learning
We first investigate marginbased active learning under isotropic concave distributions in both realizable and adversarial noise models. The algorithm (see Algorithm 1) follows a localization technique: It proceeds in rounds, aiming to cut the error down by half in each round in the margin [8].
5.1.1 Relevant Properties of SConcave Distributions
The analysis requires more refined geometric properties as below. Theorem 13 basically claims that the error mostly concentrates in a band, and Theorem 14 guarantees that the variance in any 1D direction cannot be too large. We defer the detailed proofs to the supplementary material.
Theorem 13 (Disagreement outside Band).
Let and be two vectors in and assume that . Let be an isotropic concave distribution. Then for any absolute constant and any function , there exists a function such that where , is the beta function, , and are given by Lemma 10.
5.1.2 Realizable Case
We show that marginbased active learning works under concave distributions in the realizable case.
Theorem 15.
In the realizable case, let be an isotropic concave distribution in . Then for , , and absolute constants , there is an algorithm (see the supplementary material) that runs in iterations, requires labels in the th round, and outputs a linear separator of error at most with probability at least . In particular, when (a.k.a. logconcave), we have .
By Theorem 15, we see that the algorithm of marginbased active learning under concave distributions works almost as well as the logconcave distributions in the resizable case, improving exponentially w.r.t. the variable over passive learning algorithms.
5.1.3 Efficient Learning with Adversarial Noise
In the adversarial noise model, an adversary can choose any distribution over such that the marginal over is concave but an fraction of labels can be flipped adversarially. The analysis builds upon an induction technique where in each round we do hinge loss minimization in the band and cut down the 0/1 loss by half. The algorithm was previously analyzed in [5, 6] for the special class of logconcave distributions. In this paper, we analyze it for the much more general class of concave distributions.
Theorem 16.
Let be an isotropic concave distribution in over and the label obey the adversarial noise model. If the rate of adversarial noise satisfies for some absolute constant , then for , , and an absolute constant , Algorithm 1 runs in iterations, outputs a linear separator such that with probability at least . The label complexity in the th round is , where , , . In particular, if , .
By Theorem 16, the label complexity of marginbased active learning improves exponentially over that of passive learning w.r.t. even under fattailed concave distributions and challenging adversarial noise model.
5.2 Disagreement Based Active Learning
We apply our results to the analysis of disagreementbased active learning under concave distributions. The key is estimating the disagreement coefficient, a measure of complexity of active learning problems that can be used to bound the label complexity [34]
. Recall the definition of the disagreement coefficient w.r.t. classifier
, precision , and distribution as follows. For any , define where . Define the disagreement region as . Let the Alexander capacity . The disagreement coefficient is defined as . Below, we state our results on the disagreement coefficient under isotropic concave distributions.Theorem 17 (Disagreement Coefficient).
Let be an isotropic concave distribution over . For any and , the disagreement coefficient is . In particular, when (a.k.a. logconcave), .
Our bounds on the disagreement coefficient match the best known results for the much less general case of logconcave distributions [9]; Furthermore, they apply to the concave case where we allow arbitrary number of discontinuities, a case not captured by [31]. The result immediately implies concrete bounds on the label complexity of disagreementbased active learning algorithms, e.g., CAL [24] and [7]. For instance, by composing it with the result from [27], we obtain a bound of
for agnostic active learning under an isotropic concave distribution . Namely, it suffices to output a halfspace with error at most , where .
5.3 Learning Intersections of Halfspaces
Baum [11] provided a polynomialtime algorithm for learning the intersections of halfspaces w.r.t. symmetric distributions. Later, Klivans [44] extended the result by showing that the algorithm works under any distribution as long as for any set . In this section, we show that it is possible to learn intersections of halfspaces under the broader class of sconcave distributions.
Theorem 18.
In the PAC realizable case, there is an algorithm (see the supplementary material) that outputs a hypothesis of error at most with probability at least under isotropic concave distributions. The label complexity is , where is defined by , , , , , , , and . In particular, if (a.k.a. logconcave), is an absolute constant.
6 Lower Bounds
In this section, we give informationtheoretic lower bounds on the label complexity of passive and active learning of homogeneous halfspaces under concave distributions.
Theorem 19.
For a fixed value we have: (a) For any concave distribution in whose covariance matrix is of full rank, the sample complexity of learning origincentered linear separators under in the passive learning scenario is ; (b) The label complexity of active learning of linear separators under concave distributions is .
If the covariance matrix of is not of full rank, then the intrinsic dimension is less than . So our lower bounds essentially apply to all concave distributions. According to Theorem 19, it is possible to have an exponential improvement of label complexity w.r.t. over passive learning by active sampling, even though the underlying distribution is a fattailed concave distribution. This observation is captured by Theorems 15 and 16.
7 Conclusions
In this paper, we study the geometric properties of concave distributions. Our work advances the stateoftheart results on the marginbased active learning, disagreementbased active learning, and learning intersections of halfspaces w.r.t. the distributions over the instance space. When , our results reduce to the bestknown results for logconcave distributions. The geometric properties of concave distributions can be potentially applied to other learning algorithms, which might be of independent interest more broadly.
Acknowledgements. This work was supported in part by grants NSFCCF 1535967, NSF CCF1422910, NSF CCF1451177, a Sloan Fellowship, and a Microsoft Research Fellowship.
References
 [1] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 2009.

[2]
D. Applegate and R. Kannan.
Sampling and integration of near logconcave functions.
In
ACM Symposium on Theory of Computing
, pages 156–163, 1991.  [3] P. Awasthi, M.F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Annual Conference on Learning Theory, pages 167–190, 2015.
 [4] P. Awasthi, M.F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1bit compressed sensing under asymmetric noise. In Annual Conference on Learning Theory, pages 152–192, 2016.
 [5] P. Awasthi, M.F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. In ACM Symposium on Theory of Computing, pages 449–458, 2014.
 [6] P. Awasthi, M.F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6):50, 2017.
 [7] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
 [8] M.F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Annual Conference on Learning Theory, pages 35–50, 2007.
 [9] M.F. Balcan and P. M. Long. Active and passive learning of linear separators under logconcave distributions. In Annual Conference on Learning Theory, pages 288–316, 2013.
 [10] M.F. Balcan and H. Zhang. Noisetolerant lifelong matrix completion via adaptive sampling. In Advances in Neural Information Processing Systems, pages 2955–2963, 2016.
 [11] E. B. Baum. A polynomial time algorithm that learns two hidden unit nets. Neural Computation, 2(4):510–522, 1990.
 [12] D. Bertsimas and S. Vempala. Solving convex programs by random walks. Journal of the ACM, 51(4):540–556, 2004.

[13]
A. Beygelzimer, S. Dasgupta, and J. Langford.
Importance weighted active learning.
In
International Conference on Machine Learning
, pages 49–56, 2009.  [14] A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. Search improves label for active learning. In Advances in Neural Information Processing Systems, pages 3342–3350, 2016.
 [15] A. Beygelzimer, D. J. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems, pages 199–207, 2010.
 [16] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomialtime algorithm for learning noisy linear threshold functions. In IEEE Symposium on Foundations of Computer Science, pages 330–338, 1996.
 [17] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
 [18] S. G. Bobkov. Large deviations and isoperimetry over convex probability measures with heavy tails. Electronic Journal of Probability, 12:1072–1100, 2007.
 [19] O. Bousquet, S. Boucheron, and G. Lugosi. Theory of classification: A survey of recent advances. ESAIM: Probability and Statistics, 9(9):323–375, 2005.
 [20] H. J. Brascamp and E. H. Lieb. On extensions of the BrunnMinkowski and PrékopaLeindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. Journal of Functional Analysis, 22(4):366–389, 1976.
 [21] C. Caramanis and S. Mannor. An inequality for nearly logconcave distributions with applications to learning. In Annual Conference on Learning Theory, pages 534–548, 2004.
 [22] C. Caramanis and S. Mannor. An inequality for nearly logconcave distributions with applications to learning. IEEE Transactions on Information Theory, 53(3):1043–1057, 2007.

[23]
K. Chandrasekaran, A. Deshpande, and S. Vempala.
Sampling sconcave functions: The limit of convexity based
isoperimetry.
In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
, pages 420–433. 2009.  [24] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
 [25] A. Daniely. Complexity theoretic limitations on learning halfspaces. In ACM Symposium on Theory of computing, pages 105–117, 2016.
 [26] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, volume 17, pages 337–344, 2004.
 [27] S. Dasgupta, D. J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems, pages 353–360, 2007.

[28]
S. Dasgupta, A. T. Kalai, and C. Monteleoni.
Analysis of perceptronbased active learning.
In Annual Conference on Learning Theory, pages 249–263, 2005.  [29] J. Dunagan and S. Vempala. A simple polynomialtime rescaling algorithm for solving linear programs. In ACM Symposium on Theory of computing, pages 315–320, 2004.
 [30] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query by committee. Advances in Neural Information Processing Systems, pages 483–483, 1993.
 [31] E. Friedman. Active learning for smooth problems. In Annual Conference on Learning Theory, 2009.
 [32] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765, 2009.
 [33] Q. Han and J. A. Wellner. Approximation and estimation of sconcave densities via Rényi divergences. The Annals of Statistics, 44(3):1332–1359, 2016.
 [34] S. Hanneke. A bound on the label complexity of agnostic active learning. In International Conference on Machine Learning, pages 353–360, 2007.
 [35] S. Hanneke et al. Theory of disagreementbased active learning. Foundations and Trends in Machine Learning, 7(23):131–309, 2014.
 [36] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.
 [37] A. T. Kalai and S. Vempala. Simulated annealing for convex optimization. Mathematics of Operations Research, 31(2):253–266, 2006.
 [38] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classification with comparison queries. In IEEE Symposium on Foundations of Computer Science, pages 355–366, 2017.
 [39] M. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807–837, 1993.
 [40] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. Machine Learning, 17(23):115–141, 1994.

[41]
M. J. Kearns and U. V. Vazirani.
An introduction to computational learning theory
. MIT press, 1994.  [42] A. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, 28:793–809, 2014.
 [43] A. R. Klivans, P. M. Long, and R. A. Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10:2715–2740, 2009.
 [44] A. R. Klivans, P. M. Long, and A. K. Tang. Baum’s algorithm learns intersections of halfspaces with respect to logconcave distributions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 588–600. 2009.
 [45] A. R. Klivans, R. O’Donnell, and R. A. Servedio. Learning intersections and thresholds of halfspaces. In IEEE Symposium on Foundations of Computer Science, pages 177–186, 2002.
 [46] S. R. Kulkarni, S. K. Mitter, and J. N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
 [47] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2(4):285–318, 1988.
 [48] P. M. Long. On the sample complexity of pac learning halfspaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556–1559, 1995.
 [49] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures Algorithms, 30(3):307–358, 2007.
 [50] M. Minsky and S. Papert. Perceptrons–extended edition: An introduction to computational geometry, 1987.
 [51] R. A. Servedio. Efficient algorithms in computational learning theory. PhD thesis, Harvard University, 2001.
 [52] S. ShalevShwartz, O. Shamir, and K. Sridharan. Learning kernelbased halfspaces with the zeroone loss. arXiv preprint arXiv:1005.3681, 2010.
 [53] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, pages 135–166, 2004.
 [54] V. Vapnik. Estimations of dependences based on statistical data. Springer, 1982.

[55]
V. Vapnik.
The nature of statistical learning theory
. Springer Science & Business Media, 2013.  [56] L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. Journal of Machine Learning Research, 12(Jul):2269–2292, 2011.
 [57] Y. Xu, H. Zhang, A. Singh, A. Dubrawski, and K. Miller. Noisetolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2428–2437, 2017.
 [58] S. Yan and C. Zhang. Revisiting perceptron: Efficient and labeloptimal active learning of halfspaces. arXiv preprint arXiv:1702.05581, 2017.
 [59] C. Zhang and K. Chaudhuri. Beyond disagreementbased agnostic active learning. In Advances in Neural Information Processing Systems, pages 442–450, 2014.
Appendix A Proof of Theorem 3
Theorem 3 (restated) Let be an concave density on a convex set with . Denote by . For every x in , consider the section . Then the marginal density is concave on , where . Moreover, if is isotropic, then is isotropic.
Proof.
The proof that is isotropic is standard [49]. We now prove the first part. Let , be any two points. Define for . So the functions is defined on , . Now let for and define on . Notice that for any , , . To see this, by the convexity of the set , the point belongs to . So , i.e., . Using the concavity of , we have
Comments
There are no comments yet.