Density Estimation with Contaminated Data: Minimax Rates and Theory of Adaptation

12/21/2017 ∙ by Haoyang Liu, et al. ∙ 0

This paper studies density estimation under pointwise loss in the setting of contamination model. The goal is to estimate f(x_0) at some x_0∈R with i.i.d. observations, X_1,...,X_n∼ (1-ϵ)f+ϵ g, where g stands for a contamination distribution. In the context of multiple testing, this can be interpreted as estimating the null density at a point. We carefully study the effect of contamination on estimation through the following model indices: contamination proportion ϵ, smoothness of target density β_0, smoothness of contamination density β_1, and level of contamination m at the point to be estimated, i.e. g(x_0)≤ m. It is shown that the minimax rate with respect to the squared error loss is of order [n^-2β_0/2β_0+1]∨[ϵ^2(1∧ m)^2]∨[n^-2β_1/2β_1+1ϵ^2/2β_1+1], which characterizes the exact influence of contamination on the difficulty of the problem. We then establish the minimal cost of adaptation to contamination proportion, to smoothness and to both of the numbers. It is shown that some small price needs to be paid for adaptation in any of the three cases. Variations of Lepski's method are considered to achieve optimal adaptation. The problem is also studied when there is no smoothness assumption on the contamination distribution. This setting that allows for an arbitrary contamination distribution is recognized as Huber's ϵ-contamination model. The minimax rate is shown to be [n^-2β_0/2β_0+1]∨ [ϵ^2β_0/β_0+1]. The adaptation theory is also different from the smooth contamination case. While adaptation to either contamination proportion or smoothness only costs a logarithmic factor, adaptation to both numbers is proved to be impossible.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonparametric density estimation is a well-studied classical topic [21, 8, 23]. In this paper, we consider this classical statistical task with a modern twist. Instead of assuming i.i.d. observations from a true density , we assume

(1)

where is a density not related to , and the goal is to estimate at some . In other words, for each observation, there is an probability that the observation is sampled from a distribution not related to the density of interest.

This problem naturally appears in both robust statistics and multiple testing literature. In robust statistics literature, has the name “contamination”, and the task is interpreted as robustly estimating a density with contaminated data points [6]. In multiple testing literature, and are respectively called null density and alternative density, and the task is interpreted as estimating null density at a point [11]. In this paper, we use the name “contamination” to refer to both and the observations generated from it.

The nature of the problem heavily depends on the assumptions put on and . When there is no constraint on the contamination distribution , the data generating process (1) is also recognized as Huber’s -contamination model [13, 14]. Recent work on nonparametric estimation in such a setting includes [6, 12], and the influence of contamination on minimax rates is investigated by [7, 6]. On the other hand, in the literature of multiple testing, it is more common to put parametric structural assumptions on the alternative , and optimal rates of estimating the null density are investigated by [15, 3].

In this paper, we explore this problem with connections to nonparametric density estimation literature in mind. Specifically, the density function is assumed to have a Hölder smoothness . Both cases of structured and arbitrary contamination are considered and fundamental limit of this problem is studied by establishing minimax rate. In the structured contamination case, the contamination distribution is endowed with a Hölder smoothness, and the contamination level at the point is assumed to satisfy . The minimax rate of estimating with respect to the squared error loss is shown to be of order

(2)

The minimax rate involves three terms, and the influence of contamination on estimation is precisely characterized. The first term corresponds to the classical minimax rate of nonparametric estimation when there is no contamination. The second term is determined by contamination on . It depends on both the contamination proportion and the contamination level . The last term is caused by contamination on the neighborhood of , which is present even if the contamination level is zero. In the arbitrary contamination case, or equivalently under Huber’s -contamination model, the minimax rate is of order

(3)

Compared with (2), the rate (3) is easier to understand in terms of the influence of the contamination. It is interesting to note that even though is the smoothness index of , it still appears on the second term in (3). Thus, when the contamination is arbitrary, its influence on estimation is also determined by the smoothness of the target density.

We also thoroughly investigate the theory of adaptation in both settings of contamination models. Depending on specific settings, various adaptation costs are necessary. For the contamination model with structured contamination, when the contamination proportion is unknown, an optimal adaptive procedure can achieve the rate (2) with replaced by . When the smoothness is unknown, an optimal adaptive procedure can achieve the rate (2) with replaced by . Similarly, for the contamination model with arbitrary contamination, the rate (3) can be achieved up to a logarithmic factor when either or is unknown. On the other hand, however, when both the contamination proportion and the smoothness are unknown, the adaptation theories are completely different for the two contamination models. For structured contamination, the adaptation cost is just the combination of the cost of unknown contamination proportion and that of unknown smoothness. In contrast, for arbitrary contamination, we show that adaptation is simply impossible when both and are unknown. In other words, it is impossible to adaptively achieve a rate of the form with any two functions and .

The theory of adaptation in nonparametric functional estimation without contamination is well studied in the literature. It is shown by [1, 17, 5] that a logarithmic factor must be paid for estimating a point of a density function when smoothness is not known. Adaptation costs of estimating other nonparametric functionals have been investigated in [18, 22, 16, 2, 4]. Compared with the results in the literature, the presence of contamination brings extra complication to the problem of adaptation. It is remarkable that the adaptation cost depends very sensitively on each specific setting and contamination model. The new phenomena revealed in our paper for adaptation with contamination have not been discovered before.

The rest of the paper is organized as follows. The contamination model with structured contamination is studied in Section 2 and Section 3. Results of minimax rates and costs of adaptation are given in Section 2 and Section 3, respectively. The corresponding theory of contamination model with arbitrary contamination is investigated in Section 4. In Section 5, we discuss extensions of our results to multivariate density estimation and a consistent procedure in the hardest scenario where adaptation is impossible. All proofs are given in Section 6.

We close this section by introducing notations that will be used later. For , let and . For an integer , denotes the set . For a positive real number , is the smallest integer no smaller than and is the largest integer no larger than . For two positive sequences and , we write or if for all with some consntant independent of . The notation means we have both and . Given a set , denotes its cardinality, and is the associated indicator function. We use and to denote generic probability and expectation whose distribution is determined from the context. The notation stands for . The class of infinitely differentiable functions on is denoted by . For two probability measures and , the chi-squared divergence is defined as , and the total variation distance is defined as . Throughout the paper, , and their variants denote generic constants that do not depend on . Their values may change from place to place.

2 Minimax Rates with Structured Contamination

2.1 Results and Implications

Consider i.i.d. observations . The goal is to estimate at a given point. Without loss of generality, we aim to estimate . In other words, for every , we have with probability and with probability . Thus, there are approximately observations that are not related to the density function , which are referred to as contamination.

To study the fundamental limit of estimating with contaminated data, we need to specify appropriate regularity conditions on both and . We first define the Hölder class by

Here, stands for the smoothness parameter, and stands for the radius of the function space. The Hölder class of density functions is defined as

Finally, we define the class of mixtures in the form of by

This class is indexed by several numbers. Throughout the paper, we refer to as contamination proportion and as contamination level at . The pair controls the smoothness of the density function that we want to estimate, and the pair controls the smoothness of the contamination density . Among the six numbers, and are allowed to depend on the sample size , but the numbers are all assumed to be constants that do not depend on throughout the paper. It is also assumed that .

The minimax risk of estimation is defined as (notice that we suppress the dependence on for )

where the notation is used to denote the density . Later in the paper, we will shorthand by . Obviously, the minimax risk becomes smaller if gets smaller or gets larger. Besides the role of and , the other model indices are also expected to affect the difficulty of the problem, as listed in the following.

  • The smoothness of : From classical density estimation theory, we know the smoother is, the easier it is to estimate .

  • The level of : Intuitively, the smaller is, the smaller its influence is on , and thus the easier the problem is.

  • The smoothness of : Intuitively, the smoother is, the less the contamination effect can spread, and thus the easier it is to account for the effect of in the contamination model.

Now we present the following theorem of minimax rate, that justifies our intuition above.

Theorem 2.1.

Under the setting above, we have

(4)

In other words, can be upper and lower bounded by the right hand side of (4) up to a constant that only depends on .

Theorem 2.1 completely characterizes the difficulty of estimating with contaminated data. The three terms in the rate (4) have different but very clear meanings. The first term is the classical minimax rate of estimating a smooth function at a given point without contamination. The second term is proportional to the squared of the product of contamination level and contamination proportion. The last term is perhaps the most interesting. Here the effect of is powered by an exponent depending on , and it stands for the interaction between the contamination proportion and the contamination smoothness. The fact that it does not depend on implies that we have to pay this price with contaminated data even if .

To further understand the implications of Theorem 2.1, we present the following illustrative special cases of the minimax rate (4). First, when , we get

This is simply the classical minimax rate of estimating without contamination.

Next, to understand the role of , we consider two extreme cases of and . From (4), we have

and

The case of is particularly interesting. It implies , and one may expect that the contamination would have no influence on the minimax rate. This intuition is not true because of the term . Since nonparametric estimation of also depends on the values of the density function at a neighborhood of , the contamination from can still have an effect on the neighborhood of despite that . A smaller value of allows a greater perturbation by on the neighborhood of . When , the minimax rate has a simple form of . The influence on the minimax rate from contamination is always , regardless of the smoothness .

Finally, we consider the cases of and . In fact, the Hölder class with is not well defined, but the discussion below still holds for a sufficiently large constant . From (4), we have

and

The influence of the contamination takes the forms of and for the two extreme cases. This immediately implies that for any values of , we have

In other words, the influence of contamination on the minimax rate is sandwiched between and .

2.2 Upper Bounds

The minimax rate (4

) can be achieved by a simple kernel density estimator that takes the form

(5)

This estimator is slightly different from the classical kernel density estimator because it is normalized by instead of . The knowledge of the contamination proportion is very critical to achieve the minimax rate (4). Later, we will show in Section 3.2 that the minimax rate (4) cannot be achieved if is not known.

We introduce the following class of kernel functions.

The class collects all bounded and squared integrable kernel functions of order . The number is assumed to be a constant throughout the paper. We refer to [8] for examples of kernel functions in the class .

Theorem 2.2.

For the estimator with some and , we have

Theorem 2.2 reveals an interesting choice of the bandwidth . Compared with the optimal bandwidth of order in classical nonparametric function estimation, the

in the structured contamination setting is always smaller. The choice of bandwidth is a consequences of the specific bias-variance tradeoff under the structured contamination model. As an interesting contrast, in the case of arbitrary contamination, the optimal choice of bandwidth is always larger than the usual one, see Section

4.

The error bound in Theorem 2.2 can be found through a classical bias-variance tradeoff argument. We can decompose the difference as

(6)

Here, the first term is the stochastic error. The second term gives the approximation error of the kernel convolution. The last term is caused by the contamination at . Direct analysis of the three terms gives the bound

(7)

Now with the choice , we obtain the error bound in Theorem 2.2. For detailed derivation, see the proof of Theorem 2.2 in Section 6.1.

2.3 Lower Bounds

In this section, we study the lower bound part of the minimax rate (4). We first state a theorem.

Theorem 2.3.

We have

The first term is the classical minimax lower bound for nonparametric estimation. Thus, we will only give here a overview of how to derive the second and the third terms. Two specific functions are used as building blocks for our construction, and their definitions and properties are summarized in the following two lemmas.

Lemma 2.1.

Let . Define

The constant is chosen so that . It satisfies the following properties:

  1. is an even density function compactly supported on .

  2. .

  3. For any constants , there exists a constant , such that .

  4. For any small constant , is uniformly lower bounded by a positive constant on , and it is uniformly upper bounded by a positive constant on .

Lemma 2.2.

Let . Define

It satisfies the following properties:

  1. is an even function compactly supported on .

  2. For any , there exists a constant such that .

  3. is uniformly lower bounded by a positive constant on , and is uniformly upper bounded by a positive constant on .

  4. .

Both the proofs of the second and the third terms in the lower bound involve careful constructions of two pairs of densities and . In order to show , we consider the following constructions,

Here, the constants are chosen so that the constructed functions are well-defined densities in the desired parameter spaces. It is easy to check that with the above construction,

This implies that with the presence of contamination, an estimator cannot distinguish between the two data generating processes and . As a consequence, an error of order cannot be avoided.

The derivation of the lower bound is more intricate. Consider the following four functions,

where the definitions of the functions are given in Lemma 2.1 and Lemma 2.2. Again, the constants are chosen properly so that the constructed functions are well-defined densities in the desired function classes.

A dominant feature of this constructions is that is a perturbation of with two levels of perturbation, respectively with bandwidth and , while usual lower bound proof in nonparametric estimation involves perturbing a function at a single bandwidth level. The first level of perturbation serves to cancel the effect of the corresponding perturbation on , while the second perturbation serves to ensure the constraint of contamination level. Indeed, if we relate and through the equation , then it is direct that . In other words, the constructed contamination density functions and both have contamination level . An illustration of this construction with a two-level perturbation is given by Figure 1.

Figure 1: An illustration of the construction of .

The colors of the plot correspond to those in the formulas.

With the above construction, it is not hard to check that

In order that an estimator cannot distinguish between the two densities and , a sufficient condition is (see Lemma 6.1), which leads to the choice of at the order . As a consequence, an error of order

cannot be avoided. A rigorous proof of Theorem 2.3 will be given in Section 6.2.

3 Adaptation Theory with Structured Contamination

3.1 Summary of Results

To achieve the minimax rate in Theorem 2.1, the kernel density estimator (5) requires the knowledge of contamination proportion and smoothness . In this section, we discuss adaptive procedures to estimate without the knowledge of these parameters. However, adaptation to or to is not free, and one can only achieve slower rates than the minimax rate (4). The adaptation cost varies for each different scenario. A summary of our results is listed below.

  • When the contamination proportion is unknown, the best possible rate is

  • When the smoothness parameters are unknown, the best possible rate is

  • When both the contamination proportion and the smoothness are unknown, the best possible rate becomes

Compared with the minimax rate (4), the ignorance of the contamination proportion implies that is replaced by in the rate, while the ignorance of the smoothness implies that is replaced by in the rate.

3.2 Unknown Contamination Proportion

The kernel density estimator (5) depends on in two ways: the normalization through and the optimal choice of bandwidth . Without the knowledge of , we consider the following estimator

(8)

The first difference between (8) and (5) is the normalization. When is not given, we can only use in (8). Moreover, the choice of in (8) cannot depend on .

Theorem 3.1.

For the estimator with some and , we have

With the choice , becomes the classical nonparametric density estimator. The contamination results in an extra in the rate compared with the classical nonparametric minimax rate, regardless of the values of and . Note that in the current setting, the error has the following decomposition,

(9)

The difference between (6) and (9) is resulted from different normalizations in (5) and (8). Some standard calculation gives the bound

which implies the optimal choice of bandwidth , and thus the rate in Theorem 3.1. A detailed proof is given in Section 6.1.

In view of the form of the minimax rate (4), the rate given by Theorem 3.1 can be obtained by replacing the in (4) with . A matching lower bound for adaptivity to is given by the following theorem.

Theorem 3.2.

Consider two models and with different contamination proportions. For any estimator that satisfies

for some constant , there must exist another constant , such that for , we have

Theorem 3.2 shows that it is impossible to achieve a rate that is faster than even over only two different contamination proportions. The proof of Theorem 3.2 relies on the following construction,

With an appropriate choice of the constant , we have and . Moreover, it is easy to check that

In other words, a model with contamination proportion can also be written as a mixture that uses a different . Unless the contamination proportion is specified, one cannot tell the difference between and . This leads to a lower bound of the error, which is of order . A rigorous proof of Theorem 3.2 that uses a constrained risk inequality in [1] is given in Section 6.3.

3.3 Unknown Smoothness

In this section, we consider the case that the smoothness numbers are unknown, but the contamination proportion is given. In view of the kernel density estimator (5) that achieves the minimax rate, we can still use the normalization by because of the knowledge of , but the bandwidth needs to be picked in a data-driven way. For a given , define

With a discrete set and some constant , Lepski’s method [18, 19, 20] selects a data-driven bandwidth through the following procedure,

(10)

In words, we choose the largest bandwidth below which the variance dominates. If the set that is maximized over is empty, we will use the convention . The estimator that uses a data-driven bandwidth enjoys the following guarantee.

Theorem 3.3.

Consider the adaptive kernel density estimator with the bandwidth defined by (10). In (10), we set such that and to be a sufficiently large constant. The kernel is selected from with a large constant . Then, we have

Lepski’s method is known to be adaptive over various nonparametric classes, and it can achieve minimax rates up to a logarithmic factor without knowing the smoothness parameter [17]. Theorem 3.3 shows that this is also the case with contaminated observations. With an adaptive kernel density estimator normalized by , the minimax rate (4) is achieved up to a logarithmic factor in Theorem 3.3.

A comparison between the adaptive rate given by Theorem 3.3 and the minimax rate (4) reveals two differences. The first adaptation cost is given by , compared with in (4). Previous work in adaptive nonparametric estimation [1, 17, 2] implies that this cost is unavoidable for adaptation to smoothness. The second adaptation cost is given by , compared with in (4). In the next theorem, we show that this adaptations cost is also unavoidable without the knowledge of the smoothness parameters.

Theorem 3.4.

Consider two models and with different smoothness parameters. Assume that , , and . For any estimator that satisfies

for some constant , we must have

Similar to the statement of Theorem 3.2, Theorem 3.4 shows that it is impossible to achieve a rate that is faster than across two function classes with different smoothness parameters. We remark that the assumptions and in Theorem 3.4 are necessary conditions for to dominate