1 Introduction
We study the problem of agnostic distribution learning whereby a learner is given i.i.d. samples from an unknown distribution and needs to choose, among a set of candidate distributions, the one that is closest to
. This problem formulation immediately raises several questions. The first one is how to define closeness between probability distributions. Here we will argue that the total variation metric is a natural choice. The second one is what assumptions are made on
. We choose the so called agnostic or robust case which means that we are not making any assumption. The last one is whether the best thing to do for the learner is to return an element of (this is called the proper case), or to possibly produce a distribution which is not a member of (this is the improper case) but is guaranteed to be competitive with respect to the best member of .Our study will focus on the informationtheoretic limits of the problem, which means that we will not be concerned with the computational complexity of the learner and will only consider what, in theory, is the best achievable performance of a learner as a function of the size of the candidate class and the number of samples from that it has access to.
1.1 Why total variation?
The total variation metric, defined for two probability measures on as
(1) 
has the nice property of being a proper metric. Additionally it has the natural interpretation of measuring the largest discrepancy in the measure assigned to the same event by the two different measures. And while it thus looks like an metric (when viewing a probability measure as a map from subsets of to ), it also can be rewritten as an norm: if and have densities and respectively (or probability mass function when is finite/countable),
(2) 
as well as an optimal coupling:
(3) 
Note that there is a large literature about density estimation in the metric (as opposed to ). However, is a less natural way of measuring the distance between densities because it lacks invariance with respect to the choice of the reference measure on the domain. This may not be an issue when considering realvalued distributions where the Lebesgue measure is the canonical choice, but when working on highdimensional or general domains, this dependency is not necesssarily desirable (for more details, see Chapter 6.5 in the book by Devroye and Lugosi (2001)).
Another classical choice is to use the KullbackLeibler divergence, however
has the downside of being defined only when is absolutely continuous with respect to and in a setting like the one we are considering where we do not wish to assume anything about the target distribution, this cannot be guaranteed. Even if one were to consider instead, then one would be restricted to considering models that put mass on all points of the domain and the KullbackLeibler distance could be dominated by the points of very low probability.Compared to those other two choices, total variation has the benefit of being invariant, bounded and being a metric. We refer the reader to Chapter 6 in the book by Devroye and Lugosi (2001) for a discussion regarding the advantages of total variation and a detailed comparison with other natural similarity measures.
Of course, there are other possible choices such as the Hellinger divergence or others, and it would be an interesting question to extend the current study to those.
1.2 Why agnostic?
A basic classification of machine learning problems separates between
realizable and agnostic learning. In the realizable case one assumes that the target distribution belongs to a prespecified class which is known to the algorithm, and in the agnostic case one usually does not assume anything about the target distribution but rather extends the goal of learning to so that the output distribution is competitive with the best distribution in (i.e. the one which is closest to ).In this work we focus on the agnostic case. Nevertheless, a sensible^{1}^{1}1This is due to the lower bound of (and in the proper case), see section 1.7. setting to keep in mind is the “almost realizable” case in which the distance between and is small. Such scenarios may occur in contexts where one has a strong prior about the target distribution, but would like to remain resilient/robust against small fluctuations and thus to avoid realizability assumptions.
1.3 Why improper?
Another basic classification in machine learning problems distinguishes between proper and improper learning. In the proper case the algorithm always outputs a distribution whereas in the improper case it may output arbitrary distribution (in both cases the goal remains the same, namely to compete with the best distribution in ). While at a first glance it may seem strange to consider the improper case, it turns out that in many cases improperness is beneficial (e.g. boosting is inherently improper (Schapire and Freund, 2012); in multiclass classification some classes can only be learned improperly (Daniely and ShalevShwartz, 2014)). The main results in this paper manifest another setting in which improper learning is provably stronger than proper learning.
1.4 Is this problem too hard?
While the total variation is a natural metric with strong guarantees, at a first glance it may seem impossible to use in such an abstract distribution learning setting: imagine that the class contains just two distributions , and let denote the target distribution. Then, a natural empiricalriskminimizationlike approach would be to estimate both distances from a large enough i.i.d. sample drawn from and output the minimizer. The problem with this approach is that estimating requires samples from (see e.g. Jiao et al. (2018)). In particular, if is infinite (say ) then it is impossible to do it with a finite sample complexity.
However, perhaps surprisingly, despite the impossibility of estimating the total variation one can still find an approximate minimizer of it (even when is infinite!). A more detailed survey of relevant results is given in Section 1.6 below.
1.5 Problem definition
Let be a domain and let denote the set of all probability distributions over . We assume that either (i) is finite in which case is identified with the set of
dimensional probability vectors, or (ii)
in which case is the set of Borel probability measures.Let be a set of distributions. We focus on the case where is finite and denote its size by . Let , we say that is learnable if there is an algorithm such that for every there is a finite sample complexity bound such that for every target distribution , if receives as input at least independent samples from then it outputs a distribution such that
with probability at least , where and is the total variation distance. We say that is properly learnable if it is learnable by a proper algorithm; namely an algorithm that always outputs . The function is called the sample complexity of the algorithm.
Sample complexity.
Note that if is finite then any class of distribution is learnable for with sample complexity (because this many samples suffice to estimate for every , which allows to estimate its total variation distance to each ). Therefore, when is finite, we consider to be learnable only if its sample complexity depends efficiently on , namely (note that is the bitcomplexity of representing each sample in the input and therefore means polynomial in the input size).
1.6 Previous related work
Density estimation has been studied since more than a century ago, for textbook introductions see e.g. (Devroye and Gyorfi, 1985; Devroye and Lugosi, 2001; Diakonikolas, 2016). A significant portion of works considered this problem when is some specific class of distributions such as mixtures of gaussians (e.g. Kalai et al. (2012); Diakonikolas et al. (2017, 2018a); Kothari et al. (2018); Ashtiani et al. (2018b, a)), histograms (e.g. Pearson (1895); Lugosi and Nobel (1996); Devroye and Lugosi (2004); Chan et al. (2014); Diakonikolas et al. (2018b)), and more. For a fairly recent survey see (Diakonikolas, 2016).
This work concerns arbitrary classes and the only assumption we make is that is finite. The factor 3 upper bound in the proper case was derived by Yatracos (1985) using the elegant and simple idea of Yatracos’s sets (also referred to as Schaffe’s sets by Devroye and Lugosi (2001)). Devroye and Lugosi (2001) extended Yatracos’s idea and also gave a factor 2 lower bound for his algorithm. Mahalanabis and Stefankovic (2008) improved the lower bound to 3 and extended it to a more general family of proper algorithms. A lower bound of factor 2 for arbitrary (possibly improper) algorithms follows from the work Chan et al. (2014) (see section 1.7). Devroye and Lugosi (2001) point out in their book the absence of universal methods other than Yatraco’s which achieve a constant approximation factor; this comment inspired the current work.
1.7 Main results
Theorem 1 (Upper bound  improper case).
Every finite class of distributions is learnable with .
Theorem 2 (Lower bound  proper case).
For every there is a class of size that is not properly learnable.
Tightness of Theorem 1.
The factor in Theorem 1 in general can not be improved. This follows from Chan et al. (2014) (Theorem 7) which demonstrates a class of distributions over such that any (possibly improper) algorithm that learns this class with requires some samples. Note that in their Theorem statement the class is infinite, but a closer inspection of their proof reveals that it needs only to contain two distributions, and so their lower bound already applies for .
Proofs overview.
Our approach for the lower bound is a variant of the proof in Chan et al. (2014)
and boils down to using a tensorized version of Le Cam’s method together with a birthday paradox kind of argument.
For the upper bound, we introduce two methods, a static and an adaptive one, both of which are based on the observation that once we find a distribution so that for every the result follows by the triangle inequality (see Lemma 3). The static method can be viewed as a direct extension of Yatracos’ ideas as we also construct a family of functions of finite VC dimension and estimate the corresponding surrogate variational metric (see Equation (4)). Note however that our construction and analysis are more complex and rely on a careful inspection of barycenters with respect to the total variation metric.
The adaptive method, which could apply to other probability metrics^{2}^{2}2As long as they have a variational form as in (4), which is for example the case of Wasserstein’s metric. than proceeds in steps: it maintains lower bounds and, at each step, increases one of them by at least until there exists a distribution such that for all . Given that is bounded by , this implies that the algorithm terminates after steps. The crux of the algorithm is in the implementation of each step. To this end we use the minimax theorem applied to (since is a supremum) to find functions so that some linear combination of the numbers is positive for any distribution . Applying this result for implies that estimating will allow us to improve at least one of our lower bounds.
1.8 Open questions and future research
The main result in this paper is the determination of the optimal approximation factor in density estimation and the development of universal algorithmic approaches to achieve it.
One central issue that remains open concerns sample complexity. Our current sample complexity upper bounds are either linear in or based on rather sophisticated techniques from adaptive data analysis which includes dependencies on . For comparison, Yatracos’s proper algorithm which achieves factor 3 has a clean sample complexity of . It would be interesting to determine whether the factor can be achieved with a similar sample complexity.
We list below other possible suggestions for future research:

Is it the case that any (possibly infinite) class that is learnable for some is learnable for ? E.g. assume that the family of Yatracos’s sets of has a finite VC dimension (so is properly learnable for ). Is learnable for ?

Our result remains valid if we replace the total variation with any IPM metric. How about divergences? Is there a natural characterization of all divergences for which every finite can be learned for some constant ?
2 Preliminaries
An assumption.
Some of our arguments exploit the Minimax Theorem for zerosum games (Neumann, 1928). Therefore, we will assume a setting (i.e. the domain and the set of distributions ) in which this theorem is valid. Alternatively, one could state explicit assumptions such as finiteness or forms of compactness under which it is known that the Minimax Theorem holds. However, we believe that the presentation benefits from avoiding such explicit technical assumptions and simply assuming the Minimax Theorem as an “axiom” in the discussed setting.
Standard notation.
We use to denote the set . For two vectors let denote the statement that for every . Denote by the standard basis vector whose ’th coordinate is and its other coordinates are and by the vector .
We use standard notations for asymptotics such as . We may also sometimes use or to hide logarithmic factors. E.g. if for some .
2.1 Total variation and surrogates
Let be a family of functions. Assume that is symmetric in the sense that whenever then also (this allows us to remove the absolute value from some definitions and will simplify some calculations). Define a semimetric on (recall that is the set of distributions over ),
(4) 
Note that when is the set of all (measurable) functions then is the total variation distance, that is symmetric, i.e. , and that and that is convex (as a supremum over linear functions).
Distances vectors and sets.
Let , and let be a distribution. The distance vector of relative to the ’s is the vector .
The following claim shows that in order to find such that it suffices to find such that . All of our algorithms exploit this claim.
Lemma 3.
Let such that . Then .
Proof.
Follows directly by the triangle inequality; indeed, let be a minimizer of in . Then, . ∎
Next, we explore which are of the form for some . For this we make the following definition. A vector is called an distance dominating vector if for some distribution . Define to be the set of all dominating distance vectors. When is the set of all measurable functions, we denote by .
Claim 4.
is convex and upwardclosed^{3}^{3}3Recall that upwardsclosed means that whenever and then also ..
Proof.
That is upwardclosed is trivial. Convexity follows since is convex. ∎
The following claim shows that the nontrivial halfspaces that contain have normals in the nonnegative orthant.
Claim 5.
If and satisfy that for all , then .
Proof.
We prove the contraposition. Assume that for some . then there is a vector with for all , where is sufficiently large so that . The proof is finished by noting that such a satisfies (because it dominates any distance vector). ∎
Corollary 6.
Let be compact and convex such that . Then, there is such that
Proof.
By the standard separation theorem for convex sets there is such that . By Claim 5 it follows that . ∎
Note that if are families of functions then . Thus, for every .
Claim 7.
Let , be families of functions. The following two statements are equivalent:

,

, for every .
Proof.
is trivial. For the other direction, we prove the contraposition: assume that , and without loss of generality that . Then, by Corollary 6 there is such that for all , and in particular, as required. ∎
3 Upper bounds
In this section we show that every finite class is learnable for . This is achieved by Theorem 8 and Theorem 9 (stated below) which also provide quantitative bounds on the sample complexity.
Theorem 8 (Upper bound infinite domain).
Let be a finite class of distributions over a domain with . Then is learnable with and sample complexity
The first bound of gives a standard dependency on (standard in the sense that a similar dependence appear in popular concentration bounds). The second bound improved the dependence on from linear to , however it has inferior dependence with respect to . Both of these bounds depend polynomially on , which is poor comparing to the logarithmic dependence exhibited by the proper learning algorithm due to Yatracos. The next theorem shows that for finite domains one can achieve a logarithmic dependence in (as well as in the size of the domain):
Theorem 9 (Upper bound finite domain).
Let be a finite class of distributions over a finite domain with . Then is learnable with and sample complexity
Theorem 8 and Theorem 9 are based on three algorithms, which are presented and analyzed in Section 3.1 and Section 4 . In Section 4.1 we use these algorithms to prove Theorem 8 and Theorem 9.
3.1 Adaptive algorithms
In this section we present two algorithms which share a similar “adaptive” approach. These algorithms yield the sample complexity bounds with sublinear dependence on : that is, the bound from Theorem 8 and the bound from Theorem 9). The algorithm which achieves the bound from Theorem 8 is based on a “static” approach and appears in Section 4.
The two adaptive algorithms can be extended to yield learners for other metrics: they only rely on the triangleinequality and some form of convexity (which allows to apply the Minimax Theorem). In particular they extend to any Integral Probability Metric (IPM) (Müller, 1997).
A crucial property that will be utilized in the sample complexity analysis is that these algorithms require only a statistical query access (which we define next) to the target distribution ; in a statistical query, the algorithm submits a function to a statistical query oracle and receives back an estimate of . Note that the oracle can provide an accurate^{4}^{4}4That is, an estimate which is correct up to an additive error of estimate with a high probability by drawing samples from perquery and returning the empirical average of as an estimate. Interestingly, there are sophisticated methods within the domain of Adaptive Data Analysis that significantly reduce the amortized sample complexity for estimating adaptive queries (Dwork et al., 2015; Bassily et al., 2016). We will use these results in our sample complexity analysis (in Section 4.1).
We prove the following:
Theorem 10.
Let be a class of distributions, let , and let be the target distribution. Then. there exist algorithms such that

makes at most statistical queries to and satisfies the following: if the estimates to all queries are accurate then it outputs such that .

makes at most statistical queries to and satisfies the following: if the estimates to all queries are accurate then it outputs such that .
Note that by Lemma 3 it follows that the output distribution satisfies , as required.
Proof of theorem 10.
Both algorithms follow the same skeleton which is depicted in Figure 1. The approach is based on Lemma 3 by which it suffices to find a vector such that , where is the distance vectors of the target distribution with respect to the ’s. The derivation of such a distancevector is based on the convexity of , and the access of the algorithms to can be conveniently abstracted via the following separation oracle:
Definition 11 (Separation oracle).
A separation oracle for is an algorithm which, given an input point , if then it returns such that
, and otherwise, it returns a hyperplane separating
from .The separation oracle is used in item 2.
The derivation of the desired distancesvector is achieved by producing an increasing sequence of vectors
such that is obtained from by increasing a carefully picked coordinate by (in item 2(b)). We postpone the details of how is found and first assume it in order to argue that total number of iterations is at most : indeed, observe that the increases by in each step (i.e. ). Thereofore, since we see that after at most steps, must satisfy . In this point a distribution is outputted such that , as required.
It thus remains to explain how an appropriate index is found in item 2(b) (which is also where the implementations of differs). The derivation of follows via an application of LP duality (in the form of the Minimax Theorem) as we explain next.
3.1.1 Finding an index in each step
Consider an arbitrary step in the algorithm, say the ’th step. Thus, we maintain a vector that satisfies . We assume that (or else we are done), and we want to show how, using few statistical queries, one can find an index such that .
The following lemma is the crux of the argument. On a high level, it shows how using a few statistical queries, one can estimate a vector such that (i) , and (ii) there is an index such that . This means that the index satisfies the requirements, and we can proceed to the next step by setting .
Lemma 12.
Let such that . Then, there are functions , and coefficients with , such that for every distribution the vector , defined by , satisfies:

, and

for all .
We stress that the functions ’s depend only on the ’s and on .
Proof of Lemma 12.
First, use Corollary 6 to find , such that . Note that necessarily , and therefore we can normalize it so that . Next, we find the functions ’s using the Minimax Theorem (Neumann, 1928):
Pick the functions ’s to be maximizers of the last expression (i.e. the maximizers of ). Therefore, for every distribution . This is equivalent to , which is the first item of the conclusion. For the secomd item, note that
as required. ∎
Binary search
Input: vectors , and functions as in Lemma 12, and a sample access to the target distribution .
Output: an index such that .

Set .
While :
Set , , and

Submit statistical queries to derive estimates of respectively up to an additive error of .

If then set , , and normalize for and else set , , and normalize for .


Output ().
We next show how to use Lemma 12 to find an appropriate index . Plug in the lemma , and set , where is the target distribution. Note that since the ’s are known, we can use statistical queries for ’s to estimate the entries of . By the first item of the lemma:
which implies that there exists an index such that (in fact it shows that if we interpret the ’s as a distribution over indices then, on average, a random index will satisfy it). The second item implies that increasing such a coordinate by will keep it upper bounded .
Thus, it suffices to estimate each coordinate up to an additive error of , and pick any index such that the estimated value satisfies . achieves this simply by querying statistical queries (one per ) with accuracy . So, the total number of statistical queries used by is at most , and if each of them is accurate then it outputs a valid distribution .
It remains to show how finds an index . uses a slightly more complicated binarysearch approach, which uses just statistical queries, but requires higher accuracy of .
Binary search for an appropriate index .
The pseudocode appears in Figure 2. We next argue that the index outputted by this procedure satisfies . Consider the first iteration in the while loop; note that . Therefore, since it follows that . Now, , and therefore is at least . This in turn implies that is at least . Therefore, in the second iteration we have . By applying the same argument inductively we get that at the ’th iteration we have , and in particular in the last iteration we find an index such that , as required.
∎
4 A static algorithm
Uniform convergence.
Before we describe the main result in this section we recall some basic facts from statistical learning theory that will be useful. Let
be a class of functions from . We say that has uniform convergence rate of (at most) if for every distribution over and every ,It is well known that if is a class of functions with VC dimension then its uniform convergence rate is Vapnik and Chervonenkis (1971).
Lemma 13.
Let be classes with VC dimension at most . Then, the VC dimension of is at most .
Proof.
We show that does not shatter a set of size . Let of size . Indeed, by the SauerShelah Lemma Sauer (1972):
where is the binary entropy function, and the second to last inequality follows by a standard upper bound on the binomial coefficients by the entropy function: for every . ∎
We next present the main result in this section which is an algorithm which achieves factor whose sample complexity is . It is conceptually simpler than the adaptive algorithms from the previous section (although the proof here is more technical). Specifically, it is based on finding a set of functions which satisfies two properties:

Given some samples from , one can estimate up to an additive error, with probability at least (where the probability is over the samples from ). In particular this means that the distance vector of with respect to can be estimated from this many samples.

and have the same distances vectors, i.e. .
Using these two items the algorithm proceeds as follows: it uses the first item to estimate up to an additive . Then, it uses the second item (by which ) to find such that and outputs it. Lemma 3 then implies that as required.
Theorem 14.
Let . Then there exists a class of functions from to such that:

, and

The VC dimension of is at most (in particular, the uniform convergence rate of is some ).
Construction of .
Consider the Yatracos functions that are defined by if and only if , and define
The class is defined by
See Figure 3 for an illustration of a function in .
Theorem 14 follows from the next two lemmas (Lemma 16 implies that via Corollary 6).
Lemma 15.
has VC dimension at most .
Lemma 16.
For every
Proof of Lemma 15.
We claim that the VC dimension of each is at most , this will finish the proof by Lemma 13. To see that has VC dimension at most , we show that its signrank (defined below) is at most . This implies the bound on the VC dimension, since the VC dimension is at most the signrank (see e.g. (Alon et al., 2016)).
The signrank of is the minimal such that there is a representation of using dimensional vectors so that each corresponds to a dimensional halfspace. Formally, if there is a mapping such that for every there is such that if and only if .
To see that the signrank of is at most consider the mapping
For every with pick where the first coordinates of are the ’s for , and the last coordinate is . The halfspace defined by indeed corresponds to :
∎
Proof of Lemma 16.
Lemma 16 follows by a careful inspection of the vertices of
. This inspection involves a somewhat technical analysis of the solutions of a related linear program. We provide the proof in
Appendix A. ∎4.1 Proofs of Theorem 8 and Theorem 9
Theorem 8 and Theorem 9 follow from Theorem 10 and Theorem 14, combined with results in Adaptive Data Analysis. We refer the reader to the survey by Dwork et al. (2015) for a detailed introduction.
First, the bound in Theorem 8 is a direct corollary of the static algorithm from the previous section (see the discussion prior to Theorem 14’s statement). The second bound in Theorem 8 and the bound in Theorem 9 follows from the two adaptive algorithms in Theorem 10, as we explain next.
In order for Algorithms to output a valid distribution , it is required that all of the statistical queries they use are answered with the desired accuracy. Recall that uses queries and requires accuracy of per query and that uses queries and require accuracy of per query. To achieve this, one needs to draw enough samples from the target distribution that suffice for a goodenough estimate. A natural way is to estimate each of the statistical queries by its empirical average. However, since the algorithm is adaptive (i.e. the choice of the statistical query used in iteration depends on the previous queries and their estimates), this may require a large number of samples from . In particular, there are settings in which if one uses the empirical averages as estimates then samples are needed in order to answer adaptive queries adaptively Luckily, the domain of Adaptive Data Analysis has developed clever estimates which achieve significant reductions in the sample complexity. In a nutshell, the idea is to return a noisy version of the empirical averages, and the highlevel intuition is that the noise stabilizes this random process and hence makes it more concentrated.
We will use the following results due to Bassily et al. (2016), which improve upon results from Dwork et al. (2015).
Theorem 17 (Infinite domain, Corollay 6.1 in Bassily et al. (2016)).
Let be the target distribution. Then, there is a mechanism that given samples from , answers adaptive statistical queries such that with probability at least each of the provided estimates is accurate, and
Theorem 18 (Finite domain Corollary 6.3 in Bassily et al. (2016)).
Let be the target distribution. Then, there is a mechanism that given samples from , answers adaptive statistical queries such that with probability at least each of the provided estimates is accurate, and
Algorithm combined with Theorem 17 yields the dependence in Theorem 8, and combined with Theorem 18 yields Theorem 9.
5 Lower bounds
As discussed in the introduction, any finite can be properly learned by Yatracos’ algorithm. We show that is optimal:
Theorem 19 (Lower bound for infinite domains).
For every there is a class of two densities such that the following holds. Let be a proper learning algorithm for and let be a sample complexity bound. Then there exists a target distribution such that and if gets at most samples from as an input then
with probability at least .
Thus is the threshold for proper learning: given any , by picking the above theorem yields a lower bound of .
Theorem 20 (Lower bound for finite domains).
Let be the domain of size. Then, for every there is a class of two densities such that the following holds. Let be a proper learning algorithm for . Then there exists a target distribution such that and if gets at most samples from as an input then
with probability at least .
We will make use of the following lemma which is a simple generalization of Le Cam’s Lemma (see Yu (1997), Lemma 1)
Lemma 21.
Let and be two families of probability distributions, denotes the distribution obtained by sampling (assuming some given fixed distribution over ) and then drawing independent samples from . Consider an algorithm (which can be randomized) that determines, given i.i.d. examples from some , whether or . Then such an algorithm will have a probability of making a mistake lower bounded by
Proof.
We first assume that the algorithm is deterministic. Any deterministic algorithm deciding whether comes from or is associated with a set (the set such that if the sample falls in it, it decides , and otherwise). The worstcase probability of the algorithm to err is given by
which can be lower bounded by the expectation under first choosing between and with probability and then picking