Density estimation or distribution learning
refers to the problem of estimating the unknown probability density function of a common source of independent sample observations. In any interesting case, we know that the unknown source density may come from a known class. In theparametric
case, each density in this class can be specified using a bounded number of real parameters, the class of all Gaussian densities with any mean and any variance. The remaining cases are callednonparametric. Examples of nonparametric classes include bounded monotone densities on , -Lipschitz densities for a given constant , and log-concave densities, to name a few. By minimax estimation, we mean density estimation in the minimax sense, we are interested in the existence of a density estimate which minimizes its approximation error, even in the worst case.
There is a long line of work in the statistics literature about density estimation, and a growing interest coming from the theoretical computer science and machine learning communities; for a selection of new and old books on this topic, see[10, 11, 13, 18, 26, 27]. The study of nonparametric density estimation began as early as in the 1950’s, when Grenander  described and studied properties of the maximum likelihood estimate of an unknown density taken from the class of bounded monotone densities on . This estimator and this class received much further treatment over the years, in particular by Prakasa Rao , Groeneboom , and Birgé [6, 7, 8], who identified the optimal -error minimax rate, up to a constant factor. Since then, countless more nonparametric classes have been studied, and many different all-purpose methods have been developed to obtain minimax results about these classes: for the construction of density estimates, see the maximum likelihood estimate, skeleton estimates, kernel estimates, and wavelet estimates, to name a few; and for minimax rate lower bounds, see the methods of Assouad, Fano, and Le Cam [10, 11, 13, 29].
One very popular style of density estimate is the histogram, in which the support of the random data is partitioned into bins, where each bin receives a weight proportional to the number of data points contained within, and such that the estimate is constant with the given weight along each bin. Then, the selection of the bins themselves becomes critical in the construction of a good histogram estimate. Birgé  showed how histograms with carefully chosen exponentially increasing bin sizes will have -error within a constant factor of the optimal minimax rate for the class of bounded non-increasing densities on . In general, the right choice of an underlying partition for a histogram estimate is not obvious.
In this work, we devise a recursive data-based approach for determining the partition of the support for a histogram estimate of discrete non-increasing densities. We also use a similar approach to build a piecewise-linear estimator for discrete non-increasing convex densities—see Anevski , Jongbloed , and Groeneboom, Jongbloed, and Wellner  for works concerning the maximum likelihood and minimax estimation of continuous non-increasing convex densities. Both of our estimators are minimax-optimal, their minimax -error is within a constant factor of the optimal rate. Recursive data-based partitioning schemes have been extremely popular in density estimation since the 1970’s with Gessaman , Chen and Zhao , Lugosi and Nobel 
, and countless others, with great interest coming from the machine learning and pattern recognition communities. Still, it seems that most of the literature involving recursive data-based partitions are not especially concerned with the rate of convergence of density estimates, but rather other properties such as consistency under different recursive schemes. Moreover, most of the density estimation literature is concerned with the estimation of continuous probability distributions. In discrete density estimation, not all of the constructions or methods used to develop arguments for analogous continuous classes will neatly apply, and in some cases, there are discrete phenomena that call for a different approach.
2 Preliminaries and summary
Let be a given class of probability densities with respect to a base measure on the measurable space , and let . If
is a random variable taking values in, we write to mean that
The notation means that for each , and that are mutually independent.
Typically in density estimation, either , is the Borel -algebra, and is the Lebesgue measure, or is countable, , and is the counting measure. The former case is referred to as the continuous setting, and the latter case as the discrete setting, where is more often called a probability mass function in the literature. Throughout this paper, we will only be concerned with the discrete setting, and even so, we still refer to as a class of densities, and as a density itself. Plainly, in this case, signifies that
Let be unknown. Given the samples , our goal is to create a density estimate
such that the probability measures corresponding to and are close in total variation (TV) distance, where for any probability measures , their TV-distance is defined as
The TV-distance has several equivalent definitions; importantly, if and are probability measures with corresponding densities and , then
where for any function , we define the -norm of as
(In the continuous case, this sum is simply replaced with an integral.) In view of the relation between TV-distance and -norm in (LABEL:l1-interp), we will abuse notation and write
There are various possible measures of dissimilarity between probability distributions which can be considered in density estimation, the Hellinger distance, Wasserstein distance, -distance,
-divergence, Kullback-Leibler divergence, or any number of other divergences; see Sason and Verdú for a survey on many such functions and the relations between them. Here, we focus on the TV-distance due to its several appealing properties, such as being a metric, enjoying the natural probabilistic interpretation of (LABEL:prob-interp), and having the coupling characterization (LABEL:coupling-interp).
If is a density estimate, we define the risk of the estimator with respect to the class as
where the expectation is over the i.i.d. samples from , and possible randomization of the estimator. From now on we will omit the dependence of on unless it is not obvious. The minimax risk or minimax rate for is the smallest risk over all possible density estimates,
We can now state our results precisely. Let and let be the class of non-increasing densities on
, set of of all probability vectorsfor which
monotone-result For sufficiently large ,
Let be the class of all non-increasing convex densities on , so each satisfies (LABEL:decr) and
convex-result For sufficiently large ,
We emphasize here that the above results give upper and lower bounds on the minimax rates and which are within universal constant factors of one another, for the entire range of and .
Our upper bounds will crucially rely on the next results, which allow us to relate the minimax rate of a class to an old and well-studied combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension : For a family of subsets of , the VC-dimension of , denoted by , is the size of the largest set such that for every , there exists such that . See, the book of Devroye and Lugosi  for examples and applications of the VC-dimension in the study of density estimation.
[Devroye, Lugosi ]minimum-distance-risk Let be a class of densities supported on , and let be a class of density estimates satisfying for every . Let be the Yatracos class of ,
For , let be the probability measure corresponding to . Let also be the empirical measure based on , where
Then, there is an estimate for which
The estimate in minimum-distance-risk is called the minimum distance estimate in —we omit the details of its construction, though we emphasize that if computing takes one unit of computation for any and , then selecting takes time polynomial in the size of , which is often exponential in the quantities of interest.
[Devroye, Lugosi ]tv-emp Let be as in minimum-distance-risk, and let . Then, there is a universal constant for which
The quantity in tv-emp is precisely equal to if is the Borel -algebra on .
minimum-distance-risk Let be as in minimum-distance-risk. Then, there is a universal constant for which
3 Non-increasing densities
This section is devoted to presenting a proof of the upper bound of monotone-result. The lower bound is proved in monotone-lower using a careful but standard application of Assouad’s Lemma . Part of our analysis in proving monotone-result will involve the development of an explicit efficient estimator for a density in .
3.1 A greedy tree-based estimator
Suppose that is a power of two. This assumption can only, at worst, smudge some constant factors in the final minimax rate. Using the samples , we will recursively construct a rooted ordered binary tree which determines a partition of the interval , from which we can build a histogram estimate for . Specifically, let be the root of , where . We say that covers the interval . Then, for every node in covering the interval
we first check if , and if so we make a leaf in . Otherwise, if
are the first and second halves of , we verify the condition
where are the number of samples which fall into the intervals ,
The inequality (LABEL:greedy-splitting-rule) is referred to as the greedy splitting rule. If (LABEL:greedy-splitting-rule) is satisfied, then create nodes covering and respectively, and add them to as left and right children of . If not, make a leaf in .
After applying this procedure, one obtains a (random) tree with leaves , and the set forms a partition of the support . Let be the histogram estimate based on this partition,
The density estimate will be called the greedy tree-based estimator. See greedy for a typical plot of , and a visualization of the tree .
Intuitively, we justify the rule (LABEL:greedy-splitting-rule) as follows: We expect that is at least as large as by monotonicity of the density , and the larger the difference , the finer a partition around and should be to minimize the error of a piecewise constant estimate of . However, even if and were equal in expectation, we expect with positive probability that may deviate from
on the order of a standard deviation, on the order of, and this determines the threshold for splitting.
One could argue that any good estimate of a non-increasing density should itself be non-increasing, and the estimate does not have this property. This can be rectified using a method of Birgé , who described a transformation of piecewise-constant density estimates which does not increase risk with respect to non-increasing densities. Specifically, suppose that the estimate is not non-increasing. Then, there are consecutive intervals , such that has constant value on and on , and . Let the transformed estimate be constant on , with value
the average value of on . Iterate the above transformation until a non-increasing estimate is obtained. It can be proven that this results in a unique estimate , regardless of the order of merged intervals, and that
3.2 An idealized tree-based estimator
Instead of analyzing the greedy tree-based estimator of the preceding section, we will fully analyze an idealized version. Indeed, in (LABEL:greedy-splitting-rule), the quantities are distributed as for , where we define
If we replace the quantities in (LABEL:greedy-splitting-rule) with their expectations, we obtain the idealized splitting rule
where we note that , since is non-increasing. Using the same procedure as in the preceding section, replacing the splitting rule with (LABEL:id-splitting-rule), we obtain a deterministic tree with leaves , and we set
is constant and equal to the average value of on each interval for . We call the idealized tree-based estimate. See ideal for a visualization of and . It may be instructive to compare ideal to greedy.
Of course, and both depend intimately upon knowledge of the density ; in practice, we only have access to the samples , and the density itself is unknown. In particular, we cannot practically use as an estimate for unknown . Importantly, as we will soon show, we can still use along with minimum-distance-risk to get a minimax rate upper bound for .
Writing out the TV-distance explicitly, we have
Let , and define . If , then , so assume that . In this case, let and be the left and right halves of the interval . In order to compute , we refer to biasarea. We view as the positive area between the curve and the line ; in biasarea, this is the patterned area. Let and be the average value of on and respectively. Then, is the positive area between and on , which is represented as the gray area on in biasarea, and is the positive area between and on , the gray area on in biasarea. The points are the unique points for which
Since the gray area on to the left of is equal to the gray area on to the right of , and the gray area on to the left of is equal to the gray area on to the right of , then
where this last line follows from the splitting rule (LABEL:id-splitting-rule), since and . So,
where the last line follows from the Cauchy-Schwarz inequality. ∎
id-num-leaves If and , then
Note that has height at most . Let be the set of nodes at depth in which have at least one leaf as a child, for , and label the children of the nodes in in order of appeareance from right to left in as . Since none of the nodes in are themselves leaves, then by (LABEL:id-splitting-rule),
and in particular since , then so that . In general,
and this recurrence relation can be solved to obtain that
Let be the set of leaves at level in . The leaves at level form a subsequence of . Write for the total probability mass of held in the leaves ,
Summing over all leaves except on the last level, and using the facts that and ,
By Hölder’s inequality,
Proof of the upper bound in monotone-result.
The case is trivial, and follows simply because the TV-distance is always upper bounded by .
Suppose next that . In this regime, we can use a histogram estimator for with bins of size for each element of . It is well known that risk of this estimator is on the order of .
Finally, suppose that . Let be the class of all piecewise-constant probability densities on which have parts; in particular, . Let be the Yatracos class of ,
Then, , where is the class of all unions of intervals in . It is well known that , so . By minimum-distance-risk and id-tv, there are univeral constants for which
By id-num-leaves, we see that for sufficiently large , there is a universal constant such that
4 Non-increasing convex densities
Recall that is the class of non-increasing convex densities supported on . Then, forms a subclass of , which we considered in monotone. This section is devoted to extending the techniques of monotone in order to obtain a minimax rate upper bound on . Again, the lower bound is proved using standard techniques in convex-lower.
In this section, we assume that is a power of three. In order to prove the upper bound of convex-result, we construct a ternary tree just as in monotone, now with a ternary splitting rule, where if a node has children in order from left to right, we split and recurse if
Here we obtain a tree with leaves . If has children from left to right, let be the midpoint of for . Let the estimate on to be composed of the secant line passing through the points and . The estimate is again called the idealized tree-based estimate for . conv-id-tv
Before proving this, we first note that by convexity of , the slope of the secant line passing through and is at least the slope of the secant line passing through and . Equivalently,
Proof of conv-id-tv.
As in the proof of id-tv, we have
for . We refer to convexarea for a visualization of the quantity , which is depicted as the patterned area.
Let , , and denote the gray area on , , and respectively. It can be shown that
The result then follows from the splitting rule (LABEL:convex-splitting-rule) and the Cauchy-Schwarz inequality. ∎
If and , then
The tree has height at most . Let be the set of nodes at depth in with at least one leaf as a child, for , labelled in order of appearance from right to left in as . By the convex splitting rule (LABEL:convex-splitting-rule), and since is non-increasing,
so in particular, , and . In general,
We claim now that , which we prove by induction; the base case is shown above, and by the induction hypothesis,
for all , while the cases can be manually verified. Then, by monotonicity of ,
Let now be the set of leaves at level in . The leaves at level form a subsequence of . Let be the total probability mass of held in the leaves . By (LABEL:concave-prob-lower),
Summing over all leaves except on the last level,