Density estimation for shift-invariant multidimensional distributions

11/09/2018 ∙ by Anindya De, et al. ∙ Google Columbia University Northwestern University 0

We study density estimation for classes of shift-invariant distributions over R^d. A multidimensional distribution is "shift-invariant" if, roughly speaking, it is close in total variation distance to a small shift of it in any direction. Shift-invariance relaxes smoothness assumptions commonly used in non-parametric density estimation to allow jump discontinuities. The different classes of distributions that we consider correspond to different rates of tail decay. For each such class we give an efficient algorithm that learns any distribution in the class from independent samples with respect to total variation distance. As a special case of our general result, we show that d-dimensional shift-invariant distributions which satisfy an exponential tail bound can be learned to total variation distance error ϵ using Õ_d(1/ ϵ^d+2) examples and Õ_d(1/ ϵ^2d+2) time. This implies that, for constant d, multivariate log-concave distributions can be learned in Õ_d(1/ϵ^2d+2) time using Õ_d(1/ϵ^d+2) samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a (1-ϵ,ϵ) mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error O(ϵ) from the target distribution. We show that our general results are close to best possible by proving a simple Ω(1/ϵ^d) information-theoretic lower bound on sample complexity even for learning bounded distributions that are shift-invariant.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In multidimensional density estimation, an algorithm has access to independent draws from an unknown target probability distribution over

, which is typically assumed to belong to or be close to some class of “nice” distributions. The goal is to output a hypothesis distribution which with high probability is close to the target distribution. A number of different distance measures can be used to capture the notion of closeness; in this work we use the total variation distance (also known as the “statistical distance” and equivalent to the distance). This is a well studied framework which has been investigated in detail, see e.g. the books (DG85, ; devroye2012combinatorial, ).

Multidimensional density estimation is typically attacked in one of two ways. In the first general approach a parameterized hypothesis class is chosen, and a setting of parameters is chosen based on the observed data points. This approach is justified given the belief that the parameterized class contains a good approximation to the distribution generating the data, or even that the parameterized class actually contains the target distribution. See (Dasgupta:99, ; KMV:10, ; MoitraValiant:10, ) for some well-known multidimensional distribution learning results in this line.

In the second general approach a hypothesis distribution is constructed by “smoothing” the empirical distribution with a kernel function. This approach is justified by the belief that the target distribution satisfies some smoothness assumptions, and is more appropriate when studying distributions that do not have a parametric representation. The current paper falls within this second strand.

The most popular smoothness assumption is that the distribution has a density that belongs to a Sobolev space (sobolev1963theorem, ; barron1991minimum, ; holmstrom1992asymptotic, ; devroye2012combinatorial, ). The simplest Sobolev space used in this context corresponds to having a bound on the average of the partial first “weak derivatives” of the density; other Sobolev spaces correspond to bounding additional derivatives. A drawback of this approach is that it does not apply to distributions whose densities have jump discontinuities. Such jump discontinuities can arise in various applications, for example, when objects under analysis must satisfy hard constraints.

Do we want to discuss Besov spaces at all? To address this, some authors have used the weaker assumption that the density belongs to a Besov space (besov1959family, ; devore1993besov, ; masry1997multivariate, ; willett2007multiscale, ; acharya2017sample, ). In the simplest case, this allows jump discontinuities as long as the function does not change very fast on average. The precise definition, which is quite technical (see devore1993besov ), makes reference to the effect on a distribution of shifting the domain by a small amount.

The densities we consider. In this paper we analyze a clean and simple smoothness assumption, which is a continuous analog of the notion of shift-invariance that has recently been used for analyzing the learnability of various types of discrete distributions (barbour1999poisson, ; daskalakis2013learning, ; DLS18asums, ). The assumption is based on the shift-invariance of in direction at scale , which, for a density over

, a unit vector

, and a positive real value , we 111Phil: The discussion about full-rank covariance matrices seems to no longer be needed. define to be

We define the quantity to be the worst case of over all directions , i.e.

For any constant , we define the class of densities to consist of all -dimensional densities with the property that for all

Our notion of shift-invariance provides a quantitative way of capturing the intuition that the density changes gradually on average in every direction. Several natural classes fit nicely into this framework; for example, we note that

-dimensional standard normal distributions are easily shown to belong to

. As another example, we will show later that any -dimensional isotropic log-concave distribution belongs to .

Many distributions arising in practice have light tails, and distributions with light tails can in general be learned more efficiently. To analyze learning shift-invariant distributions in a manner that takes advantage of light tails when they are available, while accommodating heavier tails when necessary, we define classes with different combinations of shift-invariant and tail behavior. Given a nonincreasing function which satisfies , we define the class of densities to consist of those which have the additional property that for all , it holds that

where is the mean of the distribution

As motivation for its study, we feel that is a simple and easily understood class that exhibits an attractive tradeoff between expressiveness and tractability. As we show, it is broad enough to include distributions of central interest such as multidimensional isotropic log-concave distributions, but it is also limited enough to admit efficient noise-tolerant density estimation algorithms.

Our density estimation framework. We recall the standard notion of density estimation with respect to total variation distance. Given a class of densities over , a density estimation algorithm for is given access to i.i.d. draws from , where is the unknown target density to be learned. For any , given any parameter , after making some number of draws depending on and the density estimation algorithm must output a description of a hypothesis density over which, with high probability over the draws from , satisfies . It is of interest both to bound the sample complexity of such an algorithm (the number of draws from that it makes) and its running time.

Our learning results will hold even in a challenging model of noise-tolerant density estimation for a class . In this framework, the density estimation algorithm is given access to i.i.d. draws from , which is a mixture where and may be any density. (We will sometimes say that such an is an -corrupted version of . This model of noise is sometimes referred to as Huber’s contamination model (huber1967behavior, ).) Now the goal of the density estimation algorithm is to output a description of a hypothesis density over which, with probability at least (say) 9/10 over the draws from , satisfies . This is a challenging variant of the usual density estimation framework, especially for multidimensional density estimation. In particular, there are simple distribution learning problems (such as learning a single Gaussian or product distribution over ) which are essentially trivial in the noise-free setting, but for which computationally efficient noise-tolerant learning algorithms have proved to be a significant challenge DKKLMNS16 ; DKKLMS18 ; Steinhardt18 .

1.1 Results

Our main positive result is a general algorithm which efficiently learns any class in the noise-tolerant model described above. Given a constant and a tail bound , we show that any distribution in the class can be noise-tolerantly learned to any error with a sample complexity that depends on and . The running time of our algorithm is roughly quadratic in the sample complexity, and the sample complexity is (see Theorem 30 in Section 5 for a precise statement of the exact bound). These bounds on the number of examples and running time do not depend on which member of is being learned.

We need to work this outTo illustrate the power of our analysis, we note that our positive results straightfowardly imply that the class of -dimensional log-concave distributions can be learned with the time and sample complexity described above, solving an open problem posed by Diakonikolas et al. (diakonikolas2016learning, ).

Application: Learning multivariate log-concave densities. A multivariate density function over is said to be log-concave if there is an upper semi-continuous concave function such that for all . Log-concave distributions arise in a range of contexts and have been well studied; see (CDSS13, ; CDSS14, ; acharya2017sample, ; AcharyaDK15, ; CanonneDGR16, ; DKS16a, ) for work on density estimation of univariate (discrete and continuous) log-concave distributions. In the multivariate case, (KimSamworth14, ) gave a sample complexity lower bound (for squared Hellinger distance) which implies that samples are needed to learn -dimensional log-concave densities to error . More recently, (diakonikolas2016learning, ) established the first finite sample complexity upper bound for multivariate log-concave densities, by giving an algorithm that semi-agnostically (i.e. noise-tolerantly in a very strong sense) learns any -dimensional log-concave density using samples. The algorithm of (diakonikolas2016learning, ) is not computationally efficient, and indeed, Diakonikolas et al. ask if there is an algorithm with running time polynomial in the sample complexity, referring to this as “a challenging and important open question.” A subsequent (and recent) work of Carpenter et al. (carpenter2018, ) showed that the maximum likelihood estimator (MLE) is statistically efficient (i.e., achieves near optimal sample complexity). However, we note that the MLE is computationally inefficient and thus has no bearing on the question of finding an efficient algorithm for learning log-concave densities.

We show that multivariate log-concave densities can be learned in polynomial time as a special case of our main algorithmic result. We establish that any -dimensional log-concave density is -shift-invariant. Together with well-known tail bounds on -dimensional log-concave densities, this easily yields that any -dimensional log-concave density belongs to where the tail bound function is inverse exponential. Theorem 30 then immediately implies the following, answering the open question of (diakonikolas2016learning, ):

Theorem 1.

There is an algorithm with the following property: Let be a unknown log-concave density over and let be an -corruption of . whose covariance matrix has full rank Given any error parameter and confidence parameter and access to independent draws from , the algorithm with probability outputs a hypothesis density such that . The algorithm runs in time and uses many samples.

While our sample complexity is quadratically larger than the optimal sample complexity for learning log-concave distributions (from (diakonikolas2016learning, )), such computational-statistical tradeoffs are in fact quite common (see, for example, the work of bhaskara2015sparse

which gives a faster algorithm for learning Gaussian mixture models by using more samples).

A lower bound. We also prove a simple lower bound, showing that any algorithm that learns shift-invariant -dimensional densities with bounded support to error must use examples. These densities may be thought of as satisfying the strongest possible rate of tail decay as they have zero tail mass outside of a bounded region (corresponding to for larger than some absolute constant). This lower bound shows that a sample complexity of at least is necessary even for very structured special cases of our multivariate density estimation problem.

1.2 Our approach

For simplicity, and because it is a key component of our general algorithm, we first describe how our algorithm learns an -error hypothesis when the target distribution belongs to and also has bounded support: all its mass is on points in the origin-centered ball of radius .

In this special case, analyzed in Section 3

, our algorithm has two conceptual stages. First, we smooth the density that we are to learn through convolution – this is done in a simple way by randomly perturbing each draw. This convolution uses a kernel that damps the contributions to the density coming from high-frequency functions in its Fourier decomposition; intuitively, the shift-invariance of the target density ensures that the convolved density (which is an average over small shifts of the original density) is close to the original density. In the second conceptual stage, the algorithm approximates relatively few Fourier coefficients of the smoothed density. We show that an inverse Fourier transformation using this approximation still provides an accurate approximation to the target density.

222We note that a simpler version of this approach, which only uses a smoothing kernel and does not employ Fourier analysis, can be shown to give a similar, but quantitatively worse, results, such as a sample complexity of essentially when is zero outside of a bounded region. However, this is worse than the lower bound of by a quadratic factor, whereas our algorithm essentially achieves this optimal sample complexity.

Next, in Section 4, we consider the more general case in which the target distribution belongs to the class (so at this point we are not yet in the noise-tolerant framework). Here the high-level idea of our approach is very straightforward: it is essentially to reduce to the simpler special case (of bounded support and good shift-invariance in every direction) described above. (A crucial aspect of this transformation algorithm is that it uses only a small number of draws from the original shift-invariant distribution; we return to this point below.) We can then use the algorithm for the special case to obtain a high-accuracy hypothesis, and perform the inverse transformation to obtain a high-accuracy hypothesis for the original general distribution. We remark that while the conceptual idea is thus very straightforward, there are a number of technical challenges that must be met to implement this approach. One of these is that it is necessary to truncate the tails of the original distribution so that an affine transformation of it will have bounded support, and doing this changes the shift-invariance of the original distribution. Another is that the transformation procedure only succeeds with non-negligible probability, so we must run this overall approach multiple times and perform hypothesis selection to actually end up with a single high-accuracy hypothesis.

In Section 5, we consider the most general case of noise-tolerant density estimation for Recall that in this setting the target density is some distribution which need not actually belong to but satisfies for some density . It turns out that this case can be handled using essentially the same algorithm as the previous paragraph. We show that even in the noise-tolerant setting, our transformation algorithm will still successfully find a transformation as above that would succeed if the target density were rather than . (This robustness of the transformation algorithm crucially relies on the fact that it only uses a small number of draws from the given distribution to be learned.) We then show that after transforming in this way, the original algorithm for the special case can in fact learn the transformed version of to high accuracy; then, as in the previous paragraph, performing the inverse transformation gives a high-accuracy hypothesis for .

In Section 6 we apply the above results to establish efficient noise-tolerant learnability of log-concave densities over . To apply our results, we need to have (i) bounds on the rate of tail decay, and (ii) shift-invariance bounds. As noted earlier, exponential tail bounds on -dimensional log-concave densities are well known, so it remains to establish shift-invariance. Using basic properties of log-concave densities, in Section 6 we show that any -dimensional isotropic log-concave density is -shift-invariant. Armed with this bound, by applying our noise-tolerant learning result (Theorem 30) we get that any -dimensional isotropic log-concave density can be noise-tolerantly learned in time , using samples. Log-concave distributions are shift-invariant even if they are only approximately isotropic. We show that general log-concave distributions may be learned by bringing them into approximately isotropic position with a preprocessing step, borrowing techniques from LovaszVempala07 .

The lower bound. As is standard, our lower bound (proved in Section 7) is obtained via Fano’s inequality. We identify a large set of bounded-support shift-invariant -dimensional densities with the following two properties: all pairs of densities from have KL-divergence that is not too big (so that they are hard to tell apart), but also have total variation distance that is not too small (so that a successful learning algorithm is required to tell them apart). The members of are obtained by choosing functions that take one of two values in each cell of a -dimensional checkerboard. The two possible values are within a small constant factor of each other, which keeps the KL divergence small. To make the total variation distance large, we choose the values using an error-correcting code – this means that distinct members of have different values on a constant fraction of the cells, which leads to large variation distance. we also choose values are different enough to achieve the desired lower bound on the total-variation distance.

1.3 Related work

The most closely related work that we are aware of was mentioned above: (holmstrom1992asymptotic, ) obtained bounds similar to ours for using kernel methods to learn densities that belong to various Sobolev spaces. As mentioned above, these results do not directly apply for learning densities in because of the possibility of jump discontinuities. Removed the following parenthetical: (On the other hand, it may be possible to obtain results similar to ours by arguing that, via a suitable convolution, a member of may be approximated by a member of a Sobolev space. The Fourier analysis used in our approach appears to be a simpler and more illuminating path to our main result.) (holmstrom1992asymptotic, )

also proved a lower bound on the sample complexity of algorithms that compute kernel density estimates. In contrast our lower bound holds for any density estimation algorithm, kernel-based or otherwise.

The assumption that the target density belongs to a Besov space (see (kle2009smoothing, )) makes reference to the effect of shifts on the distribution, as does shift-invariance. We do not see any obvious containments between classes of functions defined through shift-invariance and Besov spaces, but this is a potential topic for further research.

Another difference with prior work is the ability of our approach to succeed in the challenging noise-tolerant learning model. We are not aware of analyses for density estimation of densities belonging to Sobolev or Besov spaces that extend to the noise-tolerant setting in which the target density is only assumed to be close to some density in the relevant class.

As mentioned above, shift-invariance was used in the analysis of algorithms for learning discrete probability distributions in (barbour1999poisson, ; daskalakis2013learning, ). Likewise, both the discrete and continuous Fourier transforms have been used in the past to learn discrete probability distributions (diakonikolas2016optimal, ; diakonikolas2016fourier, ; DDKT16, ).

2 Preliminaries

We write to denote the radius- ball in , i.e. . If is a probability density over and is a subset of its domain, we write to denote the density of conditioned on .

2.1 Shift-invariance

Roughly speaking, the shift-invariance of a distribution measures how much it changes (in total variation distance) when it is subjected to a small translation. The notion of shift-invariance has typically been used for discrete distributions (especially in the context of proving discrete limit theorems, see e.g. (CGS11, ) and many references therein). We give a natural continuous analogue of this notion below.

Definition 2.

Given a probability density over , a unit vector , and a positive real value , we say that the shift-invariance of in direction at scale , denoted , is

(1)

Intuitively, if , then for any direction (unit vector) the variation distance between and a shift of by in direction is at most for all . The factor in the definition means that does not necessarily go to zero as gets small; the effect of shifting by is measured relative to .

Let

For any constant we define the class of densities to consist of all -dimensional densities with the property that for all

We could obtain an equivalent definition if we removed the factor from the definition of , and required that for all . This could of course be generalized to enforce bounds on the modified that are not linear in . We have chosen to focus on linear bounds in this paper to have cleaner theorems and proofs.

We include “sup” in the definition due to the fact that smaller shifts can sometimes have bigger effects. For example, a sinusoid with period is unaffected by a shift of size , but profoundly affected by a shift of size . Because of possibilities like this, to capture the intuitive notion that “small shifts do not lead to large changes”, we seem to need to evaluate the worst case over shifts of at most a certain size.

Remark 3.

For the rest of the paper, for the sake of simplicity we assume that the covariance matrix of has full rank. This assumption is without loss of generality because, as is implicit in the arguments of Lemma 20, if the rank is less than (and hence the density is supported in an affine subspace of dimension strictly less than ), then as a consequence of the shift-invariance of , a small number of draws from will reveal the affine span of the entire distribution, and the entire algorithm can be carried out within that lower dimensional subspace.

As described earlier, given a nonincreasing “tail bound” function which is absolutely continuous and satisfies , we further define the class of densities to consist of those which have the additional property that has -light tails, meaning that for all , it holds that where is the mean of

Remark 4.

It will be convenient in our analysis to consider only tail bound functions that satisfy (the constants and are arbitrary here and could be replaced by any other absolute positive constants). This is without loss of generality, since any tail bound function which does not meet this criterion can simply be replaced by a weaker tail bound function which does meet this criterion, and clearly if has -light tails then also has -light tails.

We will (ab)use the notation to mean .

The complexity of learning with a tail bound will be expressed in part using

We remark that the quantity is the “right” quantity in the sense that the integral is finite as long as the density has “non-trivial decay”. More precisely, note that by Chebyshev’s inequality, . Since the integral diverges, this means that if is finite, then the density has a decay sharper than the trivial decay implied by Chebyshev’s inequality.

2.2 Fourier transform of high-dimensional distributions

In this subsection we gather some helpful facts from multidimensional Fourier analysis.

While it is possible to do Fourier analysis over , in this paper, we will only do Fourier analysis for functions .

Definition 5.

For any function , we define by

Next, we recall the following standard claims about Fourier transforms of functions, which may be found, for example, in (smith1995handbook, ). 333Phil: I did not find the following in Fourier Analysis, by Körner. Some of them are in (smith1995handbook, ), but, honestly, I’m only guessing that the others are there. Do either of you have a reference handy that we could use? Rocco: Are we happy with this now?

Claim 6.

For let denote the convolution of and . Then for any , we have .

Next, we recall Parseval’s identity on the cube.

Claim 7 (Parseval’s identity).

For such that , it holds that

The next claim says that the Fourier inversion formula can be applied to any sequence in to obtain a function whose Fourier series is identical to the given sequence.

Claim 8 (Fourier inversion formula).

For any such that , the function is well defined and satisfies for all .

We will also use Young’s inequality:

Claim 9 (Young’s inequality).

Let , , , such that . Then .

2.3 A useful mollifier

Our algorithm and its analysis require the existence of a compactly supported distribution with fast decaying Fourier transform. Since the precise rate of decay is not very important, we use the function as follows:

(2)

Here is chosen so that is a pdf; by symmetry, its mean is . (This function has previously been used as a mollifier (kane2010exact, ; diakonikolas2010bounded, ).) The following fact can be found in johnson2015saddle (while it is proved only for , it is easy to see that the same proof holds if ). 444Phil: How is the following fact proved? Rocco: I think I missed part of the discussion on this on the recent phone call — were we going to cite a paper of Jelani’s or something?

Fact 10.

For defined in (2) and , we have that .

Let us now define the function as Combining this definition and Fact 10, we have the following claim:Sorry to be dense but I don’t see exactly how we get this from the definition of and Fact 10 555Anindya: Rocco, I think the idea is that since integrates to , its Fourier transform at any point is bounded by . is just the product distribution where each coordinate is (after a translation by ).

Claim 11.

For with , we have .

The next fact is immediate from (2) and the definition of :

Fact 12.

and as a consequence, .

3 A restricted problem: learning shift-invariant distributions with bounded support

As sketched in Section 1.2, we begin by presenting and analyzing a density estimation algorithm for densities that, in addition to being shift-invariant, have support bounded in . Our analysis also captures the fact that, to achieve accuracy , an algorithm often only needs the density to be learned to have shift invariance at a scale slightly finer than .

Lemma 13.

There is an algorithm learn-bounded with the following property: For all constant , for all , all , and all -dimensional densities with support in such that , given access to independent draws from , the algorithm runs in time

uses

samples, and with probability , outputs a hypothesis such that .

Further, given any point , can be computed in time and satisfies .

Proof.

Let , and let us define . (Here denotes convolution and is the mollifier defined in Section 2.3.) We make a few simple observations about :

  • Since we have that is a density supported on .

  • Since is a constant, a draw from can be generated in constant time. Thus given a draw from , one can generate a draw from in constant time, simply by generating a draw from and adding it to the draw from .

  • By Young’s inequality (Claim 9), we have that . Noting that is a density and thus and applying Fact 12, we obtain that is finite. As a consequence, the Fourier coefficients of are well-defined.

Preliminary analysis. We first observe that because is supported on , the distribution may be viewed as an average of different shifts of where each shift is by a distance at most . Fix any direction and consider a shift of in direction by some distance at most . Since , we have that the variation distance between and this shift in direction is at most . Averaging over all such shifts, it follows that

(3)

Next, we observe that by Claim 6, for any , we have . Since is a pdf, , and thus we have . Also, for any parameter , define . Let us fix another parameter (to be determined later). Applying Claim 11, we obtain

An easy calculation shows that if , then If we now set to be , then

The algorithm. We first observe that for any , the Fourier coefficient can be estimated to good accuracy using relatively few draws from (and hence from , recalling (ii) above). More precisely, as an easy consequence of the definition of the Fourier transform, we have:

Observation 14.

For any , the Fourier coefficient can be estimated to within additive error of magnitude at most with confidence using draws from .

Let us define the set of low-degree Fourier coefficients as . Thus, . Thus, using draws from , by Observation 14, with probability , we can compute a set of values such that

(4)

Recalling (ii), the sequence can be computed in time. Define for . Combining (4) with this, we get

Thus, setting as , we get that

(5)

Note that by definition satisfies . Thus, we can apply the Fourier inversion formula (Claim 8) to obtain a function such that

(6)

where the first equality follows by Parseval’s identity (Claim 7). By the Cauchy-Schwarz inequality,

Plugging in (6), we obtain Let us finally define (our final hypothesis), , as follows: . Note that since is a non-negative real value for all , we have

(7)

Finally, recalling that by (3) we have , it follows that

Complexity analysis. We now analyze the time and sample complexity of this algorithm as well as the complexity of computing . First of all, observe that plugging in the value of and recalling that is a constant, we get that . Combining this with the choice of (set just above (5)), we get that the algorithm uses

draws from . Next, as we have noted before, computing the sequence takes time

To compute the function (and hence ) at any point takes time This is because the Fourier inversion formula (Claim 8) has at most non-zero terms.

Finally, we prove the upper bound on . If the training examples are , then for any , we have

completing the proof. ∎

With an eye towards our ultimate goal of obtaining noise-tolerant density estimation algorithms, the next corollary says that the algorithm in Lemma 13 is robust to noise. All the parameters have the same meaning and relations as in Lemma 13.

Corollary 15.

Let be a density supported in 666Looking ahead, while in general an “-noisy” version of need not be supported in , the reduction we employ will in fact ensure that we only need to deal with noisy distributions that are in fact supported in such that there is a -dimensional density satisfying the following two properties: (i) satisfies all the conditions in the hypothesis of Lemma 13, and (ii) is an -corrupted version of , i.e. for some density Then given access to samples from , the algorithm learn-bounded returns a hypothesis which satisfies . All the other guarantees including the sample complexity and time complexity remain the same as Lemma 13.

Proof.

The proof of Lemma 13 can be broken down into two parts:

  • can be approximated by , and

  • can be learned.

The argument that can be learned only used two facts about it:

  • it is supported in , and

  • it has few nonzero Fourier coefficients.

So, now consider the distribution where is the same distribution as in Lemma 13. Because is the result of convolving (a density supported in with , it is supported in , and has the same Fourier concentration property that we used for . Thus, the algorithm will return a hypothesis distribution such that the analogue of (7) holds, i.e.

(8)

Recalling that the density can be expressed as where is some density supported in , we now have

The penultimate inequality uses (3) and the fact that the total variation distance between any two distributions is bounded by . Combining the above with (8), the corollary is proved. ∎

4 Density estimation for densities in

Fix any nonincreasing tail bound function which satisfies and the condition of Remark 4 and any constant . In this section we prove the following theorem which gives a density estimation algorithm for the class of distributions :

Theorem 16.

For any as above and any , there is an algorithm with the following property: Let be any target density (unknown to the algorithm) which belongs to . Given any error parameter and confidence parameter and access to independent draws from , the algorithm with probability outputs a hypothesis such that .

The algorithm runs in time

O( ( (1 + g^-1(ϵ))^2d ( 1ϵ )^2 d + 2 log^4 d ( 1 + g-1(ϵ)ϵ ) log( 1 + g-1(ϵ)ϵδ ) + I_g ) log1δ )

and uses O( ( (1 + g^-1(ϵ))^2d ( 1ϵ )^2 d + 2 log^4 d ( 1 + g-1(ϵ)ϵ ) log( 1 + g-1(ϵ)ϵδ ) + I_g ) log1δ )

samples.

4.1 Outline of the proof

Theorem 16 is proved by a reduction to Lemma 13. The main ingredient in the proof of Theorem 16 is a “transformation algorithm” with the following property: given as input access to i.i.d. draws from any density , the algorithm constructs parameters which enable draws from the density to be transformed into draws from another density, which we denote . The density is obtained by approximating after conditioning on a non-tail sample, and scaling the result so that it lies in a ball of radius .

Given such a transformation algorithm, the approach to learn is clear: we first run the transformation algorithm to get access to draws from the transformed distribution . We then use draws from to run the algorithm of Lemma 13 to learn to high accuracy. (Intuitively, the error relative to of the final hypothesis density is because at most comes from the conditioning and at most from the algorithm of Lemma 13.) We note that while this high-level approach is conceptually straightforward, a number of technical complications arise; for example, our transformation algorithm only succeeds with some non-negligible probability, so we must run the above-described combined procedure multiple times and perform hypothesis testing to identify a successful final hypothesis from the resulting pool of candidates.

The rest of this section is organized as follows: In Section 4.2 we give various necessary technical ingredients for our transformation algorithm. We state and prove the key results about the transformation algorithm in Section 4.3, and we use the transformation algorithm to prove Theorem 16 in Section 4.4.

4.2 Technical ingredients for the transformation algorithm

As sketched earlier, our approach will work with a density obtained by conditioning on lying in a certain ball that has mass close to 1 under .(intuitively, this corresponds to the condition used in the learning algorithm of Section 3; recall that the densities learned by that algorithm were assumed to lie in ). While we know that the original density has good shift-invariance, we will further need the conditioned distribution to also have good shift-invariance in order for the learn-bounded algorithm of Section 3 to work. Thus we require the following simple lemma, which shows that conditioning a density on a region of large probability cannot hurt its shift invariance too much.

Lemma 17.

Let and let be a ball such that where . If is the density of conditioned on , then, for all , .

Proof.

Let be any unit vector in .777spurious line?: Let (resp. ) be the densities obtained by projecting (resp. ) to direction . Note that can be expressed as + where