DeepAI

# Asymptotic nonequivalence of density estimation and Gaussian white noise for small densities

It is well-known that density estimation on the unit interval is asymptotically equivalent to a Gaussian white noise experiment, provided the densities are sufficiently smooth and uniformly bounded away from zero. We show that a uniform lower bound, whose size we sharply characterize, is in general necessary for asymptotic equivalence to hold.

• 12 publications
• 16 publications
09/30/2020

### Analysis of KNN Density Estimation

We analyze the ℓ_1 and ℓ_∞ convergence rates of k nearest neighbor densi...
04/09/2022

### A new family of smooth copulas with arbitrarily irregular densities

Copulas are known to satisfy a number of regularity properties, and one ...
03/15/2022

### Estimating monotone densities by cellular binary trees

We propose a novel, simple density estimation algorithm for bounded mono...
07/17/2018

### A transformation-based approach to Gaussian mixture density estimation for bounded data

Finite mixture of Gaussian distributions provide a flexible semi-paramet...
08/25/2020

### Multiple-Source Adaptation with Domain Classifiers

We consider the multiple-source adaptation (MSA) problem and improve a p...
05/04/2022

### Subexponentialiy of densities of infinitely divisible distributions

We show the equivalence of three properties for an infinitely divisible ...
12/14/2018

### Asymptotically Minimax Predictive Density for Sparse Count Data

Predictive density estimation under the Kullback--Leibler loss in high-d...

## 1 Introduction

A fundamental problem in nonparametric statistics is density estimation on a compact set, say the unit interval , where we observe

i.i.d. observations from an unknown probability density

If the parameter space consists of densities that are uniformly bounded away from zero and have Hölder smoothness , then a seminal result of Nussbaum [18] establishes the global asymptotic equivalence of this experiment to the Gaussian white noise model where we observe arising from

 dYt=2√f(t)dt+n−1/2dWt,t∈[0,1],  f∈Θ, (1)

where is a Brownian motion. The smoothness constraint is sharp: Brown and Zhang [4] construct a counterexample with a parameter space of Hölder smoothness exactly such that asymptotic equivalence does not hold.

If two statistical experiments are asymptotically equivalent in the Le Cam sense, then asymptotic statements can be transferred between the experiments. More precisely, the existence of a decision procedure with risk

for a given bounded loss function in one model implies the existence of a corresponding decision procedure with risk

in this loss in the other model. To derive asymptotic properties, one may therefore work in the simpler model and transfer the results to the more complex model. This is one of the main motivations behind the study of asymptotic equivalence. The last part of the introduction provides definitions and summarizes the concept of asymptotic equivalence of statistical experiments.

In practice, densities may be small or even zero on a subset of the domain, in which case the above result no longer applies. The goal of this article is to contribute to the general understanding of necessary conditions for asymptotic equivalence to hold, in particular the necessity of uniform boundedness away from zero. We show that without a minimal lower bound on the densities, density estimation and the Gaussian model (1) are always asymptotically nonequivalent, irrespective of the amount of Hölder smoothness.

In fact, we prove a more precise result by characterizing a size threshold such that if densities fall below this level, asymptotic equivalence never holds. In a companion paper [22], we show constructively that above this threshold, asymptotic equivalence may still hold. Our threshold is thus sharp in the sense that it is the smallest possible value a density can take such that asymptotic equivalence can hold.

We employ sample-size dependent parameter spaces

, as is typical in high-dimensional statistics. We prove that if the parameter spaces contain a sequence of

-smooth densities such that for all as well as suitable neighbourhoods of -smooth densities around the , then the experiments are always asymptotically nonequivalent. This is a natural threshold for describing “small” and “large” densities with, for instance, different minimax rates attainable above and below this level [19, 20], see (2) and the related discussion below.

From a practical perspective, Gaussian approximations have been proposed in density estimation (e.g. [1]) and one would like to better understand how “large” a density must be for such methods to be applicable. The present work is a step in this direction. Furthermore, all the results presented in this paper also hold for the closely related case of Poisson intensity estimation, which is always asymptotically equivalent to density estimation, irrespective of density size or Hölder smoothness [15, 22]. This case is of particular practical relevance given the widespread use of Gaussian approximations for Poisson data [12], even for small intensities [16]. We avoid further mention of Poisson intensity estimation for conciseness, but readers should bear in mind that all the present results and conclusions apply equally to that model.

There are few results establishing the necessity of conditions for asymptotic equivalence via counterexamples. For nonparametric regression, [4, 8] show the necessity of smoothness assumptions. The paper [26] establishes nonequivalence between the GARCH model and its diffusion limit under stochastic volatility, as well as their equivalence under deterministic volatility.

The proof we employ here relies on a reduction to binary experiments. The difficulty lies in both the construction of a suitable two-point testing problem and also in obtaining sufficiently good bounds on the total variation distance. Indeed, the situation is rather more subtle than one might first imagine. For two-point hypothesis testing problems, we show that one can consistently test between the alternatives in one model if and only if one can do so in the other (Lemma 2). To establish nonequivalence, one must therefore construct alternatives which can be separated with a positive probability that is strictly bounded away from zero and one and for which suitable bounds can be computed.

Although for small signals, density estimation and the Gaussian white noise model (1) are no longer asymptotically equivalent, many aspects of their statistical theory remain the same. As mentioned above, simple hypothesis testing is essentially the same in both models without any lower bound on the densities. To explain this in more detail, suppose that and are two sequences of densities and denote the probability measures in the density estimation model and the Gaussian model (1) by and respectively. The sums of the type I and II error probabilities of the Neyman-Pearson test for the simple hypotheses

 H0:f=gnH1:f=hn

in the two models are and respectively. By Lemma 2 below, if and only if which shows that we can consistently test against a simple alternative in one model if and only if we can do so in the other model. This argument requires no lower bound on the densities. The Hellinger distance also behaves very similarly in the two models, see Lemma 2

for a precise statement. It is an interesting phenomenon that while the models are potentially far apart with respect to the Le Cam distance, information distances, such as the total variation and Hellinger distance, remain close. Although this does not hold for all common information measures, for instance the Kullback-Leibler divergence, it nevertheless suggests that negative results for small densities in the Le Cam sense may be misleading, since many important statistical properties still carry over between models.

Beyond density estimation, uniform boundedness away from zero is a standard assumption in the asymptotic equivalence literature [2, 9, 10, 11, 18]. However, this assumption is not always required, including in regression type models [3, 23] and even some non-linear problems, such as diffusion processes [5, 6, 7]. A better understanding of the necessity of such conditions is therefore of interest in a wide variety of models.

## 2 Main results

### Basic notation and definitions

For two functions on , we write if for all and let denote the -norm of . Given two probability measures with densities with respect to some dominating measure , we recall the total variation distance and Hellinger distance .

A statistical experiment consists of a sample space with associated -algebra and a family of probability measures all defined on the measurable space . We call dominated if there exists a probability measure such that any is dominated by Furthermore, is said to be Polish if is a Polish space and is the associated Borel -algebra. If and are two Polish and dominated experiments indexed by the same parameter space, the Le Cam deficiency can be defined as

 δ(E(Θ),F(Θ)):=infMsupθ∈Θ∥∥MPnθ−Qnθ∥∥TV,

where the infimum is taken over all Markov kernels . The Le Cam distance is defined as

which defines a pseudo-distance on the space of all experiments with parameter space One may generalize the definition of Le Cam deficiency to spaces that are neither Polish nor dominated upon replacing the notion of Markov kernel with a more general transition [14, 24]. However, we refrain from doing so here since these notions coincide in the Polish and dominated experiments we consider in this article, see (68) and Proposition 9.2 of [18]. Finally, we say that two sequences of experiments and are asymptotically equivalent if as . General treatments on asymptotic equivalence can be found in [14, 24].

In this article, we consider the following two statistical experiments.

Density estimation : In nonparametric density estimation, we observe i.i.d. copies

of a random variable on

with unknown Lebesgue density The corresponding statistical experiment is with the product probability measure of

Gaussian white noise experiment : We observe the Gaussian process arising from (1) with unknown. Denote by the space of continuous functions on and let be the -algebra generated by the open sets with respect to the uniform norm. The Gaussian white noise experiment is then given by with the distribution of

### Function spaces

Denote by the largest integer strictly smaller than The usual Hölder semi-norm is given by and the Hölder norm is Consider the space of -smooth Hölder densities with Hölder norm bounded by

 Cβ(R):={f:[0,1]→R : f≥0, ∫10f(u)du=1, f(⌊β⌋) exists, ∥f∥Cβ≤R}.

If the pointwise rate of estimation at any over the parameter space is given by

 n−ββ+1+(f(x)n)β2β+1, (2)

with upper and lower bounds matching up to factors (see Theorems 3.1 and 3.3 of [19] for density estimation and Theorems 1 and 2 of [20]

for the Gaussian white noise model). There is thus a phase transition in the estimation rate for small densities occurring at the

-dependent signal size . This is the same boundary for asymptotic nonequivalence proved in Theorem 1 below, so that in some respects at least, the two experiments do behave differently from one another below this threshold. However, despite asymptotic nonequivalence, many other properties, such as minimax rates and consistent testing, are still asymptotically the same below this threshold. Indeed, the counterexample we construct lies right on the boundary of testing problems and in some sense only narrowly fails. The importance of the threshold is not isolated to minimax estimation rates and asymptotic equivalence and seems to play a fundamental role for small densities, for example being necessary to obtain sharp rates when estimating the support of a density [19]. For further discussion see [19, 20].

The rate of convergence (2) does not extend to using the usual definition of Hölder smoothness due to the existence of functions which are highly oscillatory near zero (Theorem 3 of [20]). A natural way to attain the rate for smoothness is to impose a shape constraint ruling out such pathological behaviour. On , define the flatness seminorm

 |f|Hβ=max1≤j<β∥|f(j)|β/|f|β−j∥1/j∞=max1≤j<β(supx∈[0,1]|f(j)(x)|β|f(x)|β−j)1/j (3)

with defined as and if The quantity measures the flatness of a function near zero in the sense that if is small, then the derivatives of must also be small in a neighborhood of . Define and consider the space of densities

 Hβ(R):={f∈Cβ(R) : ∥f∥Hβ≤R}.

Notice that for For further discussion and properties of the function space , see [21].

The reason we construct a counterexample in is to concretely show that asymptotic nonequivalence is not due to functions that are highly oscillatory near zero, but also holds for typical Hölder functions. Thus even when considering only “nice” Hölder functions, for which the rate (2) is attainable, nonequivalence still holds.

### Asymptotic nonequivalence

To obtain suitable lower bounds on the Le Cam deficiencies, we require that the small densities are not isolated in the parameter space , meaning we must introduce a notion of interior parameter space. This is in some sense necessary, since asymptotic equivalence may still hold when the small density behaviour is driven by a parametric component, in particular having finite Hellinger metric dimension. For further discussion on this point, see Proposition 1 below.

The following result is the main contribution of this article, showing that if

 inff∈Θninfx∈[0,1]f(x)≲n−β/(β+1),

then the Le Cam deficiency is bounded from below by a positive constant for sufficiently large In this case, the experiments are asymptotically nonequivalent.

###### Theorem 1.

Let There exists a constant not depending on such that if is a sequence satisfying and for all , then

 δ(EDn(Θn),EGn(Θn))≥0.007+o(1)>0.

The assumption is that the parameter space is rich enough to contain a function that somewhere falls below the threshold , together with all the functions in lying in the band around An explicit expression for can be obtained from the proof, see (15). As a particular example, the norm balls satisfy the above assumptions.

###### Corollary 1.

For any and sufficiently large

###### Proof of Corollary 1.

For consider the density For any integer , , where denotes the Gamma function. This implies that and . For any and , , which implies for Hence, , so that for some finite constant depending only on Let be the constant in Theorem 1. If we may apply Theorem 1 with the constant sequence since then By Theorem 1 with and replaced by the assertion follows. ∎

Since for any density on the radius in the previous corollary must be larger than some , otherwise the parameter space is empty. For small densities, the Gaussian white noise model (1) can be asymptotically more informative than density estimation. This result is only interesting in the case since for asymptotic equivalence can fail even if all densities are uniformly bounded away from zero [4].

Under general conditions, if for and , the squared Le Cam deficiencies between density estimation and the Gaussian model (1) are exactly of the order

 min{1,n1−2β2β+1supf∈Θn∫10f(x)−2β+32β+1dx}, (4)

see Theorem 4 of [22]. In particular, if is uniformly bounded away from zero we recover the rate , so that the experiments are asymptotically equivalent if and only if . As we now show by example, in view of (4), the threshold obtained in Theorem 1 is essentially sharp up to a logarithmic factor.

Consider the densities with diverging, which satisfy and for large enough. For the constant from Theorem 1, set

 Θn={f∈Hβ(cR):c−1f0,n≤f≤cf0,n}.

Since , applying (4),

 Δ(EDn(Θn),EGn(Θn))2≍n1−2β2β+1∫10f0,n(x)−2β+32β+1dx≍M(1−2β)(β+1)β(2β+1)n→0

for , so that density estimation and the Gaussian model (1) with parameter spaces are asymptotically equivalent. In summary, asymptotic equivalence always fails below the threshold , but may still hold for any level larger than , thereby showing that Theorem 1 is sharp up to a logarithmic factor. The factor is a technical artifact arising from the proof of (4).

### Asymptotic equivalence for small densities in parametric settings

Asymptotic nonequivalence due to small densities is a feature of fully nonparametric models and our conclusions do not necessarily apply in parametric models. We illustrate this via an example, whose proof we defer to the end of the article.

###### Proposition 1.

Consider the probability density . For a compact interval, consider the location family . For this parameter space, density estimation and the Gaussian model (1) are asymptotically equivalent, that is as ,

 Δ(EDn(Θ(K)),EGn(Θ(K)))→0.

The densities in the location family are not bounded away from zero on , with equal to zero on , yet asymptotic equivalence still holds. The reason for this is that for areas of where there are too few observations to admit a Gaussian approximation, the required information is provided by the parameter estimates for . A sufficient condition for this is finite Hellinger metric dimension in the density model, not to be confused with finite vectorial dimension of the parameter space, see Assumption (A3) of Le Cam [13]. Recall that a family of density functions is said to have finite Hellinger metric dimension if there exists a number such that every subset of which can be covered by an -ball in Hellinger distance , can be covered by at most -balls in , where does not depend on . For example, the family of densities on for some has vectorial dimension one yet does not have finite Hellinger metric dimension, see Remark 2 after Theorem 4.3 of Le Cam [13]. In this sense, Theorem 1 is truly a nonparametric result.

One can extend this further by considering parameter spaces with a-priori known zeroes. For instance Mariucci [17] establishes asymptotic equivalence for densities of the form , where , , is an unknown function uniformly bounded away from zero and is a given known function that is possibly small. In view of the above, one may interpret this as a form of semiparametric model, with a parametric part determining the density for small values and the nonparametric part doing so for large values. Thus for areas of with sufficient observations, one can fit a Gaussian approximation based on as usual, whereas for regions with insufficient observations, one must use the information provided by the parameter estimate for , which in this particular example arises from a zero-dimensional family since is known exactly.

### Overview of the proof

The proof of Theorem 1 is based on a reduction to binary experiments and a direct comparison of the total variation distances between the parameters using the following lemma.

###### Lemma 1.

Let and be binary experiments. Then

 δ(Eb1,Eb2)≥12(∥P2,1−P2,2∥TV−∥P1,1−P1,2∥TV).
###### Proof.

We have the explicit formula with the error function in , and where the infimum is over all tests , see Strasser [24], Corollary 15.7 and Definition 14.1. Notice that the definition of deficiency in [24], Definition 15.1, has an additional factor . The result then follows with ([24], p. 71). ∎

To establish asymptotic nonequivalence for a discrete experiment and its continuous analogue, a standard approach is to consider a sequence of binary experiments such that the total variation distance in the discrete model is zero (i.e. both measures are the same) but the total variation distance in the continuous model is positive. Lemma 1 then yields asymptotic nonequivalence.

This approach cannot be used here and the proof of Theorem 1 requires a much more careful choice of the sequence of binary experiments. Consider a sequence of binary experiments in the density estimation setting with corresponding binary experiments in the Gaussian white noise model. The following result shows that the total variation distance in one experiment tends to zero if and only if the total variation in the other experiment also tends to zero. The same holds if the total variation distances both tend to one. Thus, in order to construct a lower bound via Lemma 1, such sequences cannot be used.

###### Lemma 2.

Let and be arbitrary sequences of densities in both experiments and For the product probability measure for density estimation and the law of the Gaussian white noise model (1),

 ∥Pnfn−Pngn∥TV→0⇔∥Qnfn−Qngn∥TV→0⇔n∫(√fn−√gn)2→0 (5)

and

 ∥Pnfn−Pngn∥TV→1⇔∥Qnfn−Qngn∥TV→1⇔n∫(√fn−√gn)2→∞. (6)

If denotes the Hellinger distance, then for

 H2(Qnfn,Qngn)≤H2(Pnfn,Pngn)≤H2(Qnfn,Qngn)+2lognn. (7)
###### Proof.

We first prove (7). By Lemma 5 below, Together with Lemmas 2.17 and 2.19 of [24], this proves

 H2(Qnfn,Qngn)≤H2(Pnfn,Pngn)≤H2(Qnfn,Qngn)+12∫(√fn−√gn)2.

Distinguishing whether the term is larger or smaller than and using that if it is, then establishes (7).

To verify the first two assertions of the lemma, notice that by Le Cam’s inequalities (Lemma 2.3 in [25]), for any probability measures

 (8)

Consequently, the total variation of two sequences and converges to zero if and only if Similarly, if and only if Using (7) and (5) and (6) follow. ∎

In view of this, we must construct sequences such that the total variation distances in the two experiments tend neither to zero nor one and are separated for large enough.

In the following we describe the ideas that finally lead to a lower bound. As a first step, we use Lemma 4 below to show that in the density estimation model,

 ∥Pnf−Png∥TV≤1−(1−∥f−g∥12)n.

In the Gaussian white noise model, we have by Lemma 5 that with the distribution function of a standard normal random variable. For the Le Cam deficiency of the binary experiments with parameter space Lemma 1 then implies the following lower bound:

 (9)

To prove asymptotic nonequivalence, we therefore want to construct sequences such that the total variation is small while the Hellinger distance is large. The largest value of the Hellinger distance is given by Le Cam’s inequalities (8), An inspection of the proof shows that equality is achieved if for all either or This is a first indication that the bound (9) is particularly useful for small densities.

We now provide a heuristic showing that the reduction to a binary experiment can only be used if the parameter space contains small densities. Observe that in view of Lemma

2, we need to show that the Le Cam deficiency is lower bounded by a positive constant. The standard approach for nonparametric two hypothesis lower bounds is to consider one function as a local perturbation of the other. For fixed and a smooth function with and support in set

 gn=fn+hβnK(⋅−x0hn), (10)

where is fixed and . If is a -Hölder smooth function, a standard argument shows that is also -Hölder smooth. If the perturbation is small enough, then is a density since . The perturbation has height and support of length which means that the total variation distance is of the order To ensure that we therefore take On the other hand, the squared Hellinger distance satisfies

 ∥√fn−√gn∥22=∫(fn−gn)2(√fn+√gn)2≍h2β+1nfn(x0)+hβn. (11)

To ensure the right hand side is of order we consequently need If all densities are bounded away from zero, a different approach based on a multiple testing problem is needed to obtain sharp lower bounds [22].

To summarize, we have used that in the density estimation model the total variation of the product measures and can be bounded in terms of the total variation between the densities and On the contrary, in the Gaussian white noise model the total variation distance is a function of the Hellinger distance of and The total variation distance is bounded from below by the squared Hellinger distance and from above by the Hellinger distance via (8). Nonequivalence can therefore be established using the inequality (9) if the total variation between and behaves like the squared Hellinger distance, which happens exactly when the densities are small.

## 3 Proofs

We construct two test functions and use that the Le Cam deficiency is bounded from below by the difference of the total variation distances. To prove Theorem 1, it is by (9) enough to show that for some densities

 12(1−∥f1,n−f2,n∥12)n−Φ(−√n∥√f1,n−√f2,n∥2) ≥0.007+o(1). (12)

We henceforth omit the index for convenience, writing and Before we describe the construction of we first recall the following basic property of functions in the flat Hölder space

###### Lemma 3 (Lemma 1 in [21]).

Suppose that with and let be any constant satisfying Then for

 |h|≤a(|f(x)|∥f∥Hβ)1/β,

we have

 |f(x+h)−f(x)|≤12|f(x)|,

implying in particular, .

Construction of For a given density we consider two perturbations of for an such that This choice is natural in view of (11). The way we construct the perturbations is for technical convenience slightly different than in (10). By assumption there exist densities such that for some for all Without loss of generality, we may assume that We must ensure that we can apply Lemma 3 on the support of the perturbations, which motivates the following definitions. With a solution of set

 F:=af0(x0)β+1β4R1β≤116n

and observe that Given pick such that By Lemma 3, , which implies

 x2≤x0+af0(x0)1/β/R1/β≤1/2+R−1β(β+1)n−ββ+1, (13)

so that for large enough.

Let be a non-negative function supported on and satisfying For the solution of consider the two test functions

 fj(x)=f0(x)(1−γF+γK(F0(x)−F0(xj−1)F)),j∈{1,2}, (14)

where is the distribution function of . Figure 1 displays an example of this construction. Since , it follows that and for By substitution, and thus the are densities. Moreover, and have disjoint support. We also have the following proposition that is proved below.

###### Proposition 2.

There exists a finite constant , not depending on , such that

Using and gives

 1−32n≤fj(x)f0(x)≤1+(1+√8/a)2∥K∥∞.

It therefore follows that for

 c=max(C,4,1+(1+√8/a)2∥K∥∞) (15)

and thus by assumption We now establish (12) for these

Lower bound for : Using and substituting we get Since

 12(1−∥f1−f2∥12)n≥12(1−32n)n→12e−32. (16)

Upper bound for : This is equivalent to lower bounding Splitting the integral into using the properties of substitution and the Cauchy-Schwarz inequality yields

 ∥√f1−√f2∥22 =2F∫10(