Nonparametric Confidence Regions for Level Sets: Statistical Properties and Geometry

03/04/2019 ∙ by Wanli Qiao, et al. ∙ University of California-Davis George Mason University 0

This paper studies and critically discusses the construction of nonparametric confidence regions for density level sets. Methodologies based on both vertical variation and horizontal variation are considered. The investigations provide theoretical insight into the behavior of these confidence regions via large sample theory. We also discuss the geometric relationships underlying the construction of horizontal and vertical methods, and how finite sample performance of these confidence regions is influenced by geometric or topological aspects. These discussions are supported by numerical studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For a density function on we define the superlevel set of at level and the corresponding isosurface or contour as

respectively. The dependence on the level is suppressed in our notation, for is fixed throughout the manuscript. We will study methods for constructing confidence regions for and based on an iid sample from . While new methods for constructing confidence regions are proposed below, and a comparison of these and other existing methods is provided, the overarching goal of this work is to provide a critical discussion of the advantages and disadvantages of the various existing methods. This will, among others, also lead to insight about the influence of geometry on the statistical performance of the confidence regions.

The estimation of level sets (or isosurfaces) has received quite some interest in the literature. For some earlier work in the context of density level set estimation see Hartigan (1987), Polonik (1995), Cavalier (1997), Tsybakov (1997), Walther (1997). There are relations to density support estimation, (e.g. Cuevas et al. 2004, Cuevas 2009), clustering (e.g. Cuevas et al. 2000, Rinaldo et al. 2010), classification (e.g., Mammen and Tsybakov, 1999, Hall and Kang, 2005, Steinwart et al. 2005, Audibert and Tsybakov 2007), anomaly detection (e.g. Breuning et al. 2000, Hodge and Austin 2004), and more. Bandwidth selection for level set estimation is considered by Samworth and Wand (2010), and Qiao (2018a). Applications of level set estimation exist in many fields, such as astronomy (Jang 2006), medical imaging (Willett and Nowak, 2005), geoscience (Sommerfeld et al. 2015), and others. See Mason and Polonik (2009) and Mammen and Polonik (2013) for further literature. Level sets of functions also play a central role in topological data analysis, in particular in ‘persistent homology’, where the topological properties of level sets as a function of the parameter

are used for statistical analysis; e.g. see Fasy et al. (2014).

All the methods discussed in this paper are based on a kernel estimator

and its derivatives. Here is a -dimensional kernel, and is the bandwidth. As estimators for the superlevel sets or the corresponding contours, we are considering plug-in estimators given by

It is well-known that is a biased estimator. For a twice continuously differentiable kernel , we will also consider a de-biased version of the kernel estimator, given by with

where denotes a kernel density estimator using bandwidth , which can be different from . See Chen (2017) for a recent study of the de-biased estimator.

In some of the recent related literature (see, e.g., Chen et al., 2017), the bias is ignored entirely, and the target is redefined as a ‘smoothed’ version of the superlevel set, or its contour, given by

respectively, where here and in what follows, we use the superscript to indicate that the definition is based on an ‘expected quantity’. We will also consider this target for some of our methods.
For an interval and a real valued function , we denote by the pre-image of under . For , we also use the standard notation for the pre-image. Two different types of confidence regions are considered. One of them is based on vertical variation. Such confidence regions for or are of the form

(1.1)

Corresponding lower and upper confidence regions for (and ) will be of the form and respectively. The crucial question then is, how to choose the quantity in order to achieve a good performance. The other type of confidence region considered here is based on horizontal variation. Such confidence regions for (or ) are of the form

(1.2)

where the random variable

either does not depend on , i.e. , or , for some random variable not depending on . Once horizontal variation based confidence regions for are constructed, we can obtain corresponding confidence regions for : Upper bounds (sets) are obtained by adding to , and lower bounds are constructed by subtraction. Confidence regions for and are constructed similarly.

For a simple heuristic underlying the choice of

, consider the one-dimensional case. For small , we have for , so that reflects vertical variation, while represents horizontal variation. For more details see Section 3.
Geometrically, confidence regions based on horizontal variation as in (1.2), but with a constant consist of tubes of constant width put around the estimated contour In other words, the maximum (horizontal) distance to or is being controlled. Confidence regions based on vertical variation as in (1.1), as well as the regions based on horizontal variation with non-constant have varying width, which contain information about the slope of the density. Different constructions of and will be discussed. One of them is based on extreme value theory for Gaussian fields indexed by manifolds, and the others on various bootstrap approximations. Some of the constructions of the horizontal methods also involve the estimation of integral curves.

The same type of confidence regions as discussed above, can also be constructed with replaced by the de-biased density estimator . We note that in the literature, different approaches to the removal of bias effect in confidence bands or regions using kernel-type estimators. One is based on undersmoothing of the original kernel density estimator. This approach is not considered in some detail here (but see Remark 2.2 below). Alternatively, one can use explicit bias corrections, or the smoothed bootstrap. Both of these approaches are being discussed in our work. For further relevant recent literature see Chen (2017) and Calonico et al. (2018a).

Bootstrap confidence regions for a density superlevel set and/or isosurface based on vertical variation can be found in Mammen and Polonik (2013) and Chen et al. (2017). The latter also constructed a confidence set based on horizontal variation, and such constructions can also be found in Chen (2017). The confidence sets proposed here are compared to these methods. In a different setting Jankowski and Stanberry (2014) consider confidence regions for the expected value of a random sets (or its boundary) based on repeated observations of the random set.
In summary, the contributions of the manuscript are to

  • derive asymptotically valid confidence regions based on vertical variation using extreme value distributions of kernel estimators indexed by certain manifolds;

  • use bootstrap methods to construct confidence regions based on vertical variation in order to improve finite sample performance of the confidence regions in (i);

  • derive asymptotically valid bootstrap confidence regions based on horizontal variation;

  • discuss geometric connections between the different constructions of confidence regions;

  • provide numerical studies to compare the finite sample performance of the various confidence regions developed here and in the literature;

  • critically discuss advantages and disadvantages of the two different types of constructing confidence regions (horizontal and vertical variation).

The theoretical result underlying (i), see Theorem 2.1, provides a closed form of asymptotically valid confidence regions for superlevel sets and isosurfaces. To the best of our knowledge, this is the first result of this type, and it provides important qualitative insight into the underlying problem. Due to the well-known slow convergence properties of extreme value distributions, the construction of bootstrap confidence regions appear of higher practical relevance.

In a recent paper, Qiao and Polonik (2018) derived the asymptotic distribution of extrema of certain rescaled non-homogeneous Gaussian fields. This result provides an important tool for the construction of our confidence regions based on large sample distribution theory. The paper Qiao and Polonik (2016) on ridge (or filament) estimation is also a source of inspiration for the work presented here.

The paper is organized as follows. Section 2

is considering vertical variation based methods. We first present a result on the behavior of the coverage probability of asymptotic confidence regions for isosurfaces and levels sets, and then we construct a bootstrap-based confidence region. Section 

3 is dedicated to horizontal methods. Section 3.1 presents the construction of two confidence regions of the form (1.2) using integral curves. Section 3.2 discusses a horizontal bootstrap based confidence region related to the Hausdorff-distance based method of Chen et al. (2017). In Section 4 we discuss how the finite sample behavior of our methods is influenced by some geometric or topological properties of estimated superlevel sets etc. Simulation results presented in Section 5 compare the various methods. Section 6 presents some concluding discussions. The proofs of the technical results are presented in Section 7.

2 Confidence regions based on vertical variation

The following notation is used throughout the manuscript. For a sequence we let

(2.1)

For a kernel density estimator based on a bandwidth the quantity equals the rate of the uniform deviation from the density under standard assumptions (satisfied in our setting). In other words, we have uniform consistency of the kernel estimator if as The quantities have the same interpretation when considering the kernel estimator of the -th derivatives of the density. Similarly, is the standard uniform rate of convergence of the kernel density estimator of the -th derivative when centered at its expectation, .

2.1 Confidence regions based on asymptotic distribution

Our first result provides asymptotically valid confidence regions for the isosurface and the superlevel set . Before formulating the theorem we introduce the underlying assumptions and some more notation.
Assumptions:
(F1) The probability density has bounded, continuous derivatives up to fourth order. There exists such that .
(F2) There exist and such that for
(K) The kernel function is symmetric about zero, with support contained in , and is continuously differentiable up to fourth order.
(A) Both and are continuously differentiable up to -th order.
(H1). , and as , where or .
(H2) The bandwidth used for bias correction, satisfies and , where or . In addition, and .

Remark 2.1


a) Assumption (F1) can be weakened when only considering confidence regions for the smoothed isosurface . Only continuous second order derivatives of are needed in this case. For simplicity, we just use the stronger assumption (F1) throughout the manuscript.
b) Assumptions (
F1) (even the weakened version discussed in 1.) and (F2) together imply that the -dimensional isosurface has positive reach (e.g. see Lemma 1 of Chen et al., 2017), which is a necessary condition for applying the result in Qiao and Polonik (2018). Assumptions on the reach (introduced by Federer 1959) are used in a number of studies of geometric properties of manifolds, etc. Also note that assumption (F2) implies that the level .
c) It seems likely that the assumption of a non-negative kernel can be replaced by a higher-order kernel. This is because our focus is on the superlevel sets (or isosurfaces) with and regions there the density estimator is negative is irrelevant to our estimation. Even for the bootstrap method, technically we can bootstrap from a truncated and normalized density estimator. However, the truncation might incur further technical considerations.
d) The smoothness requirements for and in assumption (
A) are needed to enable to use of the Rosenblatt transform (see Rosenblatt, 1976) in the proof of Theorem 2.1. It will not be needed for bootstrap based methods. This is why we did not combine assumptions (A), (F1) and (K).
e) Assumptions (H1) and (H2), both very mild, are used in large sample confidence regions based on the vertical method. They guarantee the uniform consistency of the de-biased estimator . Specifically, (H1) is used for centered at its expectation, while (H2) is used for the bias correction part . We use stronger assumptions (H1) and (H2) for confidence regions based on the horizontal method. In particular, these assumptions guarantee the uniform consistency of the second derivatives of .
f) Under our assumptions, in the case of , is a union of finitely many points, say . Denote .

We need to introduce further notation. Let be the bias. Set and let denote -dimensional Hausdorff measure. Let be an estimator of . Some specific estimators will be given in Remark 2.2c). For and we set

(2.2)

where , and . The quantity is based on the extreme value behavior of a Gaussian field indexed by (see (7.1) given in Theorem 7.1). We further need the following, which, for , simply is a scaled version of

(2.3)

where is the standard normal cdf, and, for , is the cardinality of (cf. Remark 2.1f) ). Note that when , has a typical structure appearing in confidence bands for probability density or regression functions. See, for example, the main result of Bickel and Rosenblatt (1973). When ,

corresponds to a quantile value of the maximum of a Gaussian mixture model because

is a collection of separated points under our assumptions. With this notation, our first confidence interval based on vertical variation, and using the de-biased estimator of the underlying density, is defined as

Theorem 2.1

Suppose that (F1), (F2), (K), (A), (H1) and (H2) hold. Let If is a consistent estimator for , then we have,

(2.4)

An asymptotic confidence region for the superlevel set is given by the following upper and lower bounds:

and

Corollary 2.1

Suppose that (F1), (F2), (K), (A), (H1) and (H2) hold. Also suppose and as . Let If converges to in probability, then we have

Remark 2.2

a) Notice that the construction of involves the choice of two bandwidths, and . This is the case for all the confidence regions considered in this paper that are based on .
b) Constructing with rather than also results in an asymptotically valid confidence set for provided undersmoothing is being used to handle the bias. In this case, a sufficient condition for consistency of the coverage probability is that the bandwidth satisfies . When this assumption holds, the stochastic term asymptotically dominates the bias term , so that the latter can be ignored in the proof (see Hall, 1993).
c) We discuss two choices for the estimator . Let be the -dimensional Lebesgue meaure, and let denote the empirical probability measure, where denotes Dirac measure in . Let be the class of compact sets with a positive reach bounded away from zero such that . One option for the consistent estimator of is given by , where

The consistency of this estimator is shown in Proposition 3 in Cuevas et al. (2012). There, consistency is derived in terms of outer Minkowski content, which, under our assumptions, is equivalent to the consistency using Hausdorff measure (see Corollary 1 in Ambrosio et al., 2008). Efficiently computing is challenging.

Another estimator for is given by with The convergence rate and asymptotic normality of this estimator in the context of surface integral estimation are shown in Theorem 1 in Qiao (2018b), where additional assumptions are imposed, in particular on the speed of convergence of .
d) The -dimensional Hausdorff measure of and its estimators come into play because (i) the distribution of the statistic related to the above confidence regions can be approximated by that of the extreme value of certain Gaussian random fields indexed by the level set; and (ii) the latter, in turn, is related to an integral over with respect to the -dimensional Hausdorff measure. In our case, the integrand of this integral turns out to be a constant - see Theorem 6.1, and thus we obtain the volume of the isosurface. In fact, it is well known that the probability of the extreme value of a locally stationary Gaussian random fields exceeding a large level is asymptotically proportional to the volume of the index set (locally speaking). See, for example, Chapter 2 of Piterbarg (1996). Surface integrals have appeared in the context of level-set estimation before in Cadre (2006). There, however, a first order asymptotics (consistency) is considered, using the set-theoretic measure of symmetric difference

. On a very heuristic level, the fact that a surface integral comes in here can be understood by the fact that the variance of the fluctuations of

is of the order which is inherited from the fluctuations of the density estimator, and by then approximating by a constant times the Lebesgue measure of Lebesgue’s theorem gives that converges to a surface integral over the boundary . In other words, the technical reason for this surface integral to appear in Cadre (2006) is different from why it appears in our context.

2.2 Bootstrap confidence regions

Bootstrap confidence regions based on vertical variation of the kernel density estimate have been constructed in Mammen and Polonik (2013) and in Chen et al. (2017). They are based on a bootstrap approximation of quantiles of statistics of the form

where is such that (asymptotically) The confidence regions considered in the literature differ in the choice of the set While Mammen and Polonik (2013) propose to use for some appropriate choice of tending to zero, as , Chen et al. simply use Here we are using the smallest such set Clearly, the statistics are stochastically ordered in terms of the size of the set . Thus, the choice leads to the confidence set that is the smallest among the three. Of course the coverage still needs to be investigated. However, if the corresponding bootstrap approximations of the distributions of (and ) work similarly well, then using the statistic or , respectively, can be expected to be a good choice for the construction the bootstrap based confidence sets based on the vertical variation.
Our construction is as follows. Let be a sample drawn from a kernel density estimator using and bandwidth . Let be the kernel density estimate using and bandwidth . Let , where we use to indicate the expectation with respect to . For , let be the -quantile of the distribution of , and let be the corresponding quantile of . We now define our bootstrap confidence regions for and , respectively, as

and (2.5)

We also define the following sets to construct bootstrap confidence regions for and , respectively.

Below we show that this (and other) confidence region is asymptotically exact, and we derive rates of convergence for the coverage probability. These rates of convergence have a somewhat complex appearance, which we first explain from a high level perspective.

Structure of the rates of convergence for the coverage probabilities of bootstrap based confidence sets: The derivation of the following somewhat complex looking rates of convergence of the coverage probabilities are all based on Lemma 8.1, which is a slight reformulation of a result of Mammen and Polonik (2013). Based on this result, the rates are of the form where and are derived as follows: Let and let be a bootstrap version, where is one of the density estimators considered, and is some centering; the set in the supremum is either or Then, we derive and by showing that for some sequence of positive real numbers , we have and Both of and are themselves comprised of a sum of various terms. In fact, in our applications of this result, is chosen such that is negligible, and we have , with

(2.6)

That, in our case, has this particular form follows from a result by Neumann (1998). The rates that make negligible will follow, respectively, from strong approximation results in Neumann (1998) and some modification of Neumann’s result given in Mammen and Polonik (2013).
Note further that in each of the construction approaches considered below, we obtain the same rates of convergence of the coverage probabilities for both confidence regions for and confidence regions for (and similarly for and ). The reason for this is as follows. Let and denote these coverage probabilities based on one approach (e.g. the left-hand sides of (2.11) and (2.12), respectively. By construction, we have . More precisely, it is shown in the proof of Corollary 2.1 that , where with Now, if then we obtain , and under our respective assumptions, showing that both and converge to at the same speed.

Theorem 2.2

Suppose that (F1), (F2) and (K) hold. Let and

(2.7)
(2.8)

(a) If then we have

(2.9)
(2.10)

(b) If as then we have

(2.11)
(2.12)
Remark 2.3

a) The set can be also used as a confidence region for (not just for ) if the bandwidth is chosen to be of smaller order than the optimal bandwidth, to make the bias negligible (undersmoothing). In practice, the choice of undersmoothing bandwidth might be difficult to determine. We do not pursue the theoretical justification for as a confidence region for , the numerical performance of which, however, is shown in the simulation section.
b) The quantity appears in , because the second partial derivatives appear in the bias of density estimation and need to be estimated in our proof. Note that we are not using the de-biased density estimator in this theorem.
c) For , if we choose both and to be of the standard optimal rates, i.e. , then

The rate in Chen et al. (2017) is given by . When is chosen as the standard optimal bandwidth for density estimation, i.e. is of the exact order , the rate given in Chen et al. (2017) becomes , which is slower than . Note that even if the rate in Chen et al. (2017) is as claimed in this paper, it is still slower than when using the optimal bandwidth.

For , if we again choose , then

With , is of the order of .

2.2.1 A bootstrap confidence region based on explicit bias correction

Introducing a new bandwidth and bootstrapping from as we did above, can be viewed a method of bias correction (see page 208, Hall 1992). This allows us to construct a confidence region for (rather than just for ). We can also construct a confidence region for using an explicit bias correction by

For confidence regions for , we define

We have the following result:

Theorem 2.3

Suppose that (F1), (F2), and (K) hold. Let Let

(2.13)

If and as , then we have

(2.14)

If we further assume as , then

(2.15)
Remark 2.4

The constructions of both and are using a quantile, that is ignoring the bias. Nevertheless, gives a confidence region for the smoothed isosurface , while is a confidence region for . This is so, because one of them, is based on the de-biased estimator, while is not. Heuristically, this can be understood by writing One can see that the bias correction in the density will adjust for the bias that is present in the quantile

For , if we choose the optimal bandwidth , and (which is the order of the optimal bandwidth for estimating the second derivatives), then

Compared to the rate given in Remark 2.3 this rate is faster. However, the fact that the construction of involves the choice of three bandwidths might be a challenge in practice.

3 Confidence regions based on horizontal variation

Various confidence regions for and based on horizontal variation will be derived in this section. The geometric link between horizontal and vertical variation based confidence regions is, for obvious reasons, given by the gradient. Let and , such that is close to , then we obviously have

(3.1)

and when is chosen such that is approximately perpendicular to ,

(3.2)

Different ways of choosing will give rise to different types of horizontal variation based confidence regions. For example, can be a projection point of onto or it can be chosen such that there exists a gradient integral curve connecting and .
While the above confidence regions based on vertical variation are using approximations to quantiles of the distribution of , the horizontal variation based confidence regions will use estimated quantiles of the quantity . The latter methods are not purely horizontal variation based, as they involve the adjustment of by the gradient of . Nevertheless, since they are explicitly using the differences , we still call them methods based on horizontal variation.
Confidence regions based on horizontal variation (without estimating the gradient), have been constructed in Chen et al. (2017) based on the Hausdorff distance. The approaches considered in our work are asymptotically equivalent to the method proposed in Chen et al. (2017), but in practice the confidence sets are different. The different constructions also provide additional insight into the underlying geometry.
In the following we introduce novel horizontal variation based methods, based on estimating quantiles via both the asymptotic distribution and the bootstrap. Instead of using standard bootstrap as in Chen et al. (2017), we adopt smoothed bootstrap. We would like to note that rather than simply providing alternative methods for confidence regions, the constructions and the discussion of the performance of the resulting confidence regions are meant to provide insight into the effects of geometric aspects of the underlying probability density on the performance of confidence regions.

3.1 Methods based on integral curves

The approach discussed here is based on the construction of (cf. (3.1)) using integral curves driven by the (scaled) gradient field and their relation to level set. This relation will be discussed first.
Integral curves and level sets: For any , let be the integral curve driven by the scaled gradient of starting from , defined by the equation,

where we assume that (cf. Assumption (F2)). For , is understood to mean . The reason for choosing the scaled gradient

as a driving vector field, rather than

itself, is based on the following convenient property. Suppose that , and set . Then, by the fundamental theorem of calculus for line integrals, we have for any and such that for all

(3.3)
(3.4)

where the right hand side of (3.3) is a line integral over the trajectory , and “” represents dot product between vectors. By adopting the current scaled vector field, we can make sure the height increase (or decrease) in the density is exactly the amount of “time” needed to travel. In other words, if two particles start from any two points , then, after following the integral curves and respectively for time (which can be negative), both of these two particles will arrive at . The same holds for , by using the convention to start integration at .
Let the plug-in estimators of and based on the kernel density estimate be denoted by and , respectively, where the latter is the solution of the differential equation

(3.5)

We use the notation to denote the integral curve as in (3.5), but with replaced by the bias-corrected version .

3.1.1 Confidence regions for and using local adjustment by the gradient

Using the de-biased estimator