Mass Volume Curves and Anomaly Ranking

05/03/2017 ∙ by Stephan Clémençon, et al. ∙ Télécom ParisTech RNC CONSEIL 0

This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more 'abnormal' as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued 'scoring' function allowing for comparing the degree of abnormality of mul-tivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a wide variety of applications, ranging from the monitoring of aircraft engines in aeronautics to non destructive control quality in the industry through fraud detection, network intrusion surveillance or system management in data centers (see for instance (Viswanathan et al., 2012)), anomaly detection

is of crucial importance. In most common situations, anomalies correspond to ‘rare’ observations and must be automatically detected based on an unlabeled dataset. In practice, the very purpose of anomaly detection techniques is to rank observations by degree of abnormality/novelty, rather than simply assigning them a binary label, ‘abnormal’

vs ‘normal’. In the case of univariate observations, abnormal values are generally those which are extremes, i.e. ‘too large’ or ‘too small’ in regard to central quantities such as the mean or the median, and anomaly detection may then derive from standard tail distribution analysis: the farther in the tail the observation lies, the more ‘abnormal’ it is considered. In contrast, it is far from easy to formulate the issue in a multivariate context. In the present paper, motivated by applications such as those aforementioned, we place ourselves in the situation where (unlabeled) observations take their values in a possibly very high-dimensional space, with

say, making approaches based on nonparametric density estimation or multivariate (heavy-) tail modeling hardly feasible, if not unfeasible, due to the curse of dimensionality. In this framework, a variety of statistical techniques, relying on the concept of

minimum volume set investigated in the seminal contributions of Einmahl and Mason (1992) and Polonik (1997) (see Section 2), have been developed in order to split the feature space into two halves and decide whether observations should be considered as ‘normal’ (namely when lying in the minimum volume set estimated on the basis of the dataset available) or not (when lying in the complementary set ) with a given confidence level . One may also refer to (Scott and Nowak, 2006) and to (Koltchinskii, 1997) for closely related notions. The problem considered here is of different nature, the goal pursued is not to assign to all possible observations a label ‘normal’ vs ‘abnormal’, but to rank them according to their level of ‘abnormality’. The most natural way to define a preorder on the feature space is to transport the natural order on the real line through some (measurable) scoring function : the ‘smaller’ the score , the more likely the observation is viewed as ‘abnormal’. This problem shall be here referred to as anomaly scoring. It can be somehow related to the literature dedicated to statistical depth functions in nonparametric statistics and operations research, see (Zuo and Serfling, 2000) and the references therein. Such parametric functions are generally proposed ad hoc in order to define a ‘center’ for the probability distribution of interest and a notion of distance to the latter, e.g. the concept of ‘center-outward ordering’ induced by halfspace depth in (Tukey, 1975) or (Donoho and Gasko, 1992). The angle embraced in this paper is quite different, its objective is indeed twofold: 1) propose a performance criterion for the anomaly scoring problem so as to formulate it in terms of -estimation 2) investigate the accuracy of scoring rules which optimize empirical estimates of the criterion thus tailored.

Due to the global nature of the ranking problem, the criterion we promote is functional, just like the Receiver Operating Characteristic (ROC) and Precision-Recall curves in the supervised ranking setup (i.e. when a binary label, e.g. ‘normal’ vs ‘abnormal’, is assigned to the sampling data), and shall be referred to as the Mass Volume curve (MV curve in abbreviated form). The latter induces a partial preorder on the set of scoring functions: the collection of optimal elements is defined as the set of scoring functions whose MV curve is minimum everywhere. Such optimal scoring functions are proved to coincide to strictly increasing transforms of the underlying probability density almost everywhere on the support of this underlying probability density (see Section 3 for the exact definition). In the unsupervised setting, curve analysis is shown to play a role quite comparable to that of curve analysis for supervised anomaly detection. The issue of estimating the curve of a given scoring function based on sampling data is then tackled and a smooth bootstrap method for constructing confidence regions is analyzed. A statistical methodology to build a nearly optimal scoring function is next described, which works as follows: first, a piecewise constant approximant of the optimal curve is estimated by solving a few minimum volume set estimation problems where confidence levels are chosen adaptively from the data to adjust to the variations of the optimal curve; second, a piecewise constant scoring function is built based on the sequence of estimated minimum volume sets. The curve of the scoring rule thus produced can be related to a stepwise approximant of the (unknown) optimal curve, which permits to establish the generalization ability of the algorithm through rate bounds in terms of norm in the space.

The rest of the article is structured as follows. Section 2

describes the mathematical framework, sets out the main notations and recalls the crucial notions related to anomaly/novelty detection on which the analysis carried out in the paper relies. Section 

3 first provides an informal description of the anomaly scoring problem and then introduces the curve criterion, dedicated to evaluate the performance of any scoring function. The set of optimal elements is described and statistical results related to the estimation of the curve of a given scoring function are stated in Section 4. Statistical learning of an anomaly scoring function is then formulated as a functional -estimation problem in Section 5, while Section 6 is devoted to the study of a specific algorithm for the design of nearly optimal anomaly scoring functions. Technical proofs are deferred to the Appendix section.

We finally point out that a very preliminary version of this work has been presented in the conference AISTATS 2013 (see (Clémençon and Jakubowicz, 2013)). The present article gives a better characterization of the optimal scoring functions and investigates much deeper the statistical assessment of the performance of a given scoring function. In particular, it shows how to construct confidence regions for the curve of a scoring function in a computationally feasible fashion, using the (smooth) bootstrap methodology for which we state a consistency result. This consistency result gives a rate of convergence which promotes the use of a smooth bootstrap approach, rather than a naive bootstrap technique. Additionally, in Section 6, the confidence levels of the minimum volume sets to be estimated are chosen in a data-driven way (instead of considering a regular subdivision of the interval fixed in advance), giving to the statistical estimation and learning procedures proposed in this paper crucial adaptivity properties, with respect to the (unknown) shape of the optimal curve. Eventually, we give an example showing that the nature of the problem tackled here is very different than that of density estimation and we also give a simpler formula of the derivative of the optimal curve (that of the underlying density) compared to the one originally given in (Clémençon and Jakubowicz, 2013).

2 Background and Preliminaries

As a first go, we start off with describing the mathematical setup and recalling key concepts in anomaly detection involved in the subsequent analysis.

2.1 Framework and Notations

Here and throughout, we suppose that we observe independent and identically distributed realizations of an unknown continuous probability distribution

, copies of a generic random variable

, taking their values in a (possibly very high dimensional) feature space , with . The density of the random variable with respect to , the Lebesgue measure on , is denoted by , its support by and the indicator function of any event by . For any set , its complementary is denoted by . The sup norm of any real valued function is denoted by and the Dirac mass at any point by . The notation

is used to mean boundedness in probability. The quantile function

of any real valued random variable

with cumulative distribution function

is defined by for all . For any real valued random variable , the generalized inverse of the decreasing function is defined by for all . For any function , denotes the cumulative distribution function of the random variable . In addition, for any , denotes the quantile at level of the distribution of . We also set for all . Finally, constants in inequalities are denoted by either or and may vary at each occurrence.

A natural way of defining preorders on is to map its elements onto and use the natural order on the real half-line.

Definition 1.

(Scoring function) A scoring function is any measurable function that is integrable with respect to the Lebesgue measure.

The set of all scoring functions is denoted by and we denote the level sets of any scoring function by:

Observe that the family is decreasing (for the inclusion, as increases from to ):

and that and 111Recall that a sequence of subsets of an ensemble is said to converge if and only if . In such a case, one defines its limit, denoted by as ..

The following quantities shall also be used in the sequel. For any scoring function and threshold level , define:

The quantity is referred to as the mass of the level set , while is generally termed the volume (with respect to the Lebesgue measure).

Remark 1.

The integrability of a scoring function (see Definition 1) implies that the volumes are finite on : for any ,

Incidentally, we point out that, in the specific case , the set is the density contour cluster of the density function at level , see (Polonik, 1995) for instance. Such a set is also referred to as a density level set. Observe in addition that, using the terminology introduced in (Liu et al., 1999), is the region enclosed by the contour of depth when is a depth function.

2.2 Minimum Volume Sets

The notion of minimum volume sets has been introduced and investigated in the seminal contributions of Einmahl and Mason (1992) and Polonik (1997) in order to describe regions where a multivariate random variable takes its values with highest/smallest probability. Let , a minimum volume set of mass at least is any solution of the constrained minimization problem

(1)

where the minimum is taken over all measurable subsets of . Application of this concept includes in particular novelty/anomaly detection: for large values of

, abnormal observations (outliers) are those which belong to the complementary set

. In the continuous setting, it can be shown that there exists a threshold value equal to such that the level set is a solution of the constrained optimization problem above. The (generalized) quantile function is then defined by:

The definition of can be extended to the interval by setting and . The following assumptions shall be used in the subsequent analysis.

The density is bounded: .

The density has no flat parts, i.e. for any ,

Under the assumptions above, for any , there exists a unique minimum volume set (equal to up to subsets of null -measure), whose mass is equal to exactly. Additionally, the mapping is continuous on and uniformly continuous on for all (when the support of is compact, uniform continuity holds on the whole interval ).

From a statistical perspective, estimates of minimum volume sets are built by replacing the unknown probability distribution by its empirical version and restricting optimization to a collection of borelian subsets of , supposed rich enough to include all density level sets (or reasonable approximations of the latter). In (Polonik, 1997), functional limit results are derived for the generalized empirical quantile process under certain assumptions for the class (stipulating in particular that is a Glivenko-Cantelli class for ). In (Scott and Nowak, 2006), it is proposed to replace the level by where plays the role of tolerance parameter. The latter should be chosen of the same order as the supremum roughly, complexity of the class being controlled by the VC dimension or by means of the concept of Rademacher averages, so as to establish rate bounds at fixed.

Alternatively, so-termed plug-in techniques, consisting in computing first an estimate of the density and considering next level sets of the resulting estimator have been investigated in several papers, among which (Cavalier, 1997; Tsybakov, 1997; Rigollet and Vert, 2009; Cadre, 2006; Cadre et al., 2013). Such an approach however yields significant computational issues even for moderate values of the dimension, inherent to the curse of dimensionality phenomenon.

3 Ranking Anomalies

In this section, the issue of scoring observations depending on their level of novelty/abnormality is first described in an informal manner and next formulated quantitatively, as a functional optimization problem, by means of a novel concept, termed the Mass Volume curve.

3.1 Overall Objective

The problem considered in this article is to learn a scoring function , based on training data

, so as to describe the extremal behavior of the (high-dimensional) random vector

by that of the univariate variable , which can be summarized by its tail behavior near : hopefully, the smaller the score , the more abnormal/rare the observation should be considered. Hence, an optimal scoring function should naturally permit to rank observations by increasing order of magnitude of (see Section 3.2 for the precise definition of the set of optimal scoring functions). The preorder on induced by such an optimal scoring function could then be used to rank observations by their degree of abnormality: for any optimal scoring function , an observation is more abnormal than an observation if and only if .

3.2 A Functional Criterion: the Mass Volume Curve

We now introduce the concept of Mass Volume curve and shows that it is a natural criterion to evaluate the accuracy of decision rules in regard to anomaly scoring.

Definition 2.

(True Mass Volume curve) Let . Its Mass Volume curve ( curve in abbreviated form) with respect to the probability distribution of the random variable is the plot of the function

If the scoring function is upper bounded, exists and is defined on .

Remark 2.

(Parametric Mass Volume Curve) If is invertible, and therefore is the ordinary inverse of , the Mass Volume curve of the scoring function can also be defined as the parametric curve:

(a) Scoring functions
(b) Mass Volume curves
Figure 1:

Scoring functions and their associated Mass Volume curves when considering a truncated Gaussian distribution with density

.
Remark 3.

(Connections to ROC/concentration analysis) We point out that the curve resembles a receiver operating characteristic (ROC) curve of the test/diagnostic function (see (Egan, 1975)), except that the distribution under the alternative hypothesis is not a probability measure but the image of Lebesgue measure on by the function

, while the distribution under the null hypothesis is the probability distribution of the random variable

. Hence, the curvature of the graph somehow measures the extent to which the spread of the distribution of

differs from that of a uniform distribution. Observe also that, in the case where the support of

coincides with the unit square , the curve of any scoring function corresponds to the concentration function introduced in (Polonik, 1999), where is the uniform probability distribution on and , to compare the concentration of with that of with respect to the class of sublevel sets of the function . Notice finally that, when is a depth function, is a scale curve, as defined in (Liu et al., 1999).

This functional criterion induces a partial order over the set of all scoring functions. Let and be two scoring functions on , the ordering provided by is better than that induced by when

Typical curves are illustrated in Fig. 1. A desirable curve increases slowly and rises near , just like the lowest curve of Fig. 1. This corresponds to the situation where the distribution of the random variable is much concentrated around its modes and the highest values (respectively, the lowest values) taken by are located near the modes of (respectively, in the tail region of ). The curve of the scoring function is then close to the right lower corner of the Mass Volume space. We point out that, in certain situations, some parts of the MV curve may be of interest solely, corresponding to large values of when focus is on extremal observations (the tail region of the random variable ) and to small values of when modes of the underlying distribution are investigated.

We define the set of optimal scoring functions as follows. We recall here that for any set , its complementary is denoted by .

Definition 3.

The set of optimal scoring functions is the set of scoring functions such that there exist and a function such that

  1. ,

  2. For all , ,

  3. is strictly increasing,

  4. For -almost all , .

One can note that, as expected, the density and strictly increasing transforms of the density belong to . More generally and roughly speaking, an optimal scoring function is a strictly increasing transform of the density -almost everywhere on the support of the density. Note also that on the one hand, if , thanks to , and , we have for almost all (because up to subsets of null measure as shown in Lemma 1). Therefore -almost everywhere on . On the other hand, thanks to , for almost all and therefore -almost everywhere on . Thus the values of on are -almost everywhere lower than the values of on , where is almost everywhere a strictly increasing transform of the density .

The result below shows that optimal scoring functions are those whose curves are minimum everywhere.

Proposition 1.

(Optimal curve) Let assumptions 2.2 and 2.2 be fulfilled. The elements of the class have the same curve, equal to , and provide the best possible ordering of ’s elements in regard to the curve criterion:

(2)

In addition, we have: and ,

(3)

where for all and where denotes the symmetric difference.

From now on we will thus use to denote the curve of elements of . The proof of Proposition 1 can be found in Appendix A. Incidentally, notice that, equipped with the notations introduced in 2.2, we have for all . We also point out that bound (3) reveals that the pointwise difference between the optimal curve and that of a scoring function candidate is controlled by the error made in recovering the specific minimum volume set through .

Remark 4.

In the framework we develop, anomaly scoring boils down to recovering the decreasing collection of all level sets of the density function , , without necessarily disposing of the corresponding levels. Indeed, one may check that any scoring function of the form

(4)

where is an arbitrary finite positive measure dominating the Lebesgue measure on , belongs to . Observe that corresponds to the case where is chosen to be the Lebesgue measure on in Eq. (4). The anomaly scoring problem can be thus cast as an overlaid collection of minimum volume set estimation problems. This observation shall turn out to be very useful when designing practical statistical learning strategies in Section 6.

We point out that the optimal curve provides a measure of the mass concentration of the random variable : the lower the curve , the more concentrated the distribution . Indeed, the excess mass functional introduced in (Muller and Sawitzki, 1991) can be expressed as . Hence, we have for all . One may refer to (Polonik, 1995) for results related to the statistical estimation of density contour clusters and applications to multimodality testing using the excess mass functional.

Example 1.

(Univariate Gaussian distribution) Let be random variable with Gaussian distribution , we have , where is the cumulative distribution function of .

Example 2.

(Multivariate Gaussian distribution) Let be a multivariate random variable with Gaussian distribution . We assume that is a diagonal matrix and we denote by its diagonal coefficients. The density level set is exactly the set with . It is well known that , the distribution with degrees of freedom. Thus , denoting the quantile of order of the distribution. The set is exactly the ellipsoid with semi-principal axes of length . The formula of the volume of an ellipsoid gives where is the gamma function.

The following result reveals that the optimal curve is convex and provides a closed analytical form for its derivative. The proof is given in Appendix A.

Proposition 2.

(Convexity and derivative) Suppose that assumptions 2.2 and 2.2 are satisfied. Then, as is bounded, the optimal curve is defined on . Furthermore,

  1. is convex.

  2. Let be the cumulative distribution function of the random variable and let . If is invertible on then is differentiable on and

The convexity of might be explained as follows: when considering density level sets with increasing probabilities , their volumes increase as well, but they increase more and more quickly since the mass of the distribution becomes less and less concentrated.

Elementary properties of curves are summarized in the following proposition.

Proposition 3.

(Properties of curves) For any , the following assertions hold true.

  1. Invariance. For any strictly increasing function , we have .

  2. Monotonicity. The mapping is increasing.

Assertion may be proved using the following property of the quantile function (see e.g. Property 2.3 in (Embrechts and Hofert, 2013)): for any cumulative distribution function , any and any , if and only if . Assertion derives from the fact that the quantile function is increasing. Details are left to the reader.

4 Statistical Estimation

In practice, curves are unknown, just like the probability distribution of the random variable , and must be estimated based on the observed sample . Replacing the mass of each level set by its statistical counterpart in Definition 2 leads to define the notion of empirical curve. We set, for all ,

(5)

where denotes the empirical distribution of the . Notice that takes its values in the set .

Definition 4.

(Empirical curve) Let . By definition, the empirical curve of is the graph of the (piecewise constant) function

Remark 5.

(On volume estimation). Except for very specific choices of the scoring function (e.g. when is piecewise constant and the volumes of the subsets of on which is constant can be explicitly computed), no closed analytic form for the volume is available in general and Monte-Carlo procedures should be used to estimate it (which may be practically challenging in a high-dimensional setting). Computation of volumes in high dimensions is an active topic of research (Lovász and Vempala, 2006) and for simplicity, this is not taken into account in the subsequent analysis.

In order to obtain a smoothed version , a typical strategy consists in replacing the empirical distribution estimate involved in (5) by the continuous distribution with density , with where is a regularizing Parzen-Rosenblatt kernel (i.e. a bounded square integrable function such that ) and is the smoothing bandwidth (see for instance (Wand and Jones, 1994)).

4.1 Consistency and Asymptotic Normality

The theorem below reveals that, under mild assumptions, the empirical curve is a consistent and asymptotically Gaussian estimate of the curve, uniformly over any subinterval of . It involves the assumptions listed below.

The scoring function is bounded: .

The random variable has a continuous cumulative distribution function . Let and . The distribution function is twice differentiable on and

(6)

There exists such that

has a limit when tends towards from the left:

The mapping is of class .

Assumption 4.1 implies that exists and . Note also that if is not bounded, one can always consider . Indeed takes its values in and as is strictly increasing thanks to Proposition 3. Assumptions 4.1 and 4.1 are common assumptions for the strong approximation of the quantile process (see for instance (Csörgő and Révész, 1978)). Condition (6) and assumption 4.1 are respectively equivalent to for all and

Eventually, assumption 4.1 is equivalent to: has a limit when tends towards . The density can therefore be extended by continuity to the interval . As , (6) can then be replaced by for all which also gives for all . One can also note that is thus invertible on with inverse .

In order to state part of the following theorem, we assume that the probability space is rich enough in the sense that an infinite sequence of Brownian bridges can be defined on it.

Theorem 1.

Let and . Assume that assumptions 4.1-4.1 are fulfilled. The following assertions hold true.

  1. (Consistency) With probability one, we have uniformly over :

  2. (Strong approximation) There exists a sequence of Brownian bridges such that we almost surely have, uniformly over the compact interval : as ,

    where

The technical proof is given in Appendix A. One can note from the proof of assertion that this assertion does not require assumption 4.1 to be satisfied. The proof of assertion relies on standard strong approximation results for the quantile process (Csörgő and Révész, 1978, 1981; Csörgő, 1983). Assertion means that the fluctuation process converges in the space of càd-làg functions on equipped with the sup norm, to the law of a Gaussian stochastic process .

Remark 6.

Assumption 4.1 is required in order to state the results of assertions and on the interval . However this assumption is restrictive as it implies that is discontinuous at since whereas for all . Instead of assumption 4.1, if one assumes that is decreasing on an interval to the left of , then the result of assertion can be obtained on the interval with the same rate of convergence if and with the same rate of convergence up to and factors if . In this case, assertion also holds on (see for instance (Csörgő and Révész, 1978)).

Remark 7.

(Asymptotic normality) It results from assertion in Theorem 1 that, for any , the pointwise estimator is asymptotically Gaussian under assumptions 4.1-4.1. For all converges in distribution towards , as , with .

4.2 Confidence Regions in the Mass Volume Space

The true curve of a given scoring function is unknown in practice and its performance must be statistically assessed based on a data sample. Beyond consistency of the empirical curve in sup norm and the asymptotic normality of the fluctuation process, we now tackle the question of constructing confidence bands in the space.

Definition 5.

Based on a sample , a (random) confidence region for the curve of a given scoring function at confidence level is any borelian set of the space (possibly depending on ) that covers the curve with probability larger than :

In practice, confidence regions shall be of the form of balls in the Skorohod’s space of càd-làg functions on with respect to the norm and with the estimate introduced in Definition 4 as center for some fixed :

Constructing confidence regions based on the approximation stated in Theorem 1 would require to know the density . Hence a bootstrap approach (Efron, 1979) should be preferred. Following in the footsteps of Silverman and Young (1987), it is recommended to implement a smoothed bootstrap procedure. The asymptotic validity of such a resampling method derives from a strong approximation result similar to the one of Theorem 1. Let be the fluctuation process defined for all by

The bootstrap approach suggests to consider, as an estimate of the law of the fluctuation process , the conditional law given the original sample of the naive bootstrapped fluctuation process

(7)

where, given , is the empirical curve of the scoring function based on a sample of i.i.d. random variables with distribution . The difficulty is twofold. First, the target is a distribution on a path space, namely a subspace of the Skorohod’s space equipped with the norm. Second, is a functional of the quantile process . The naive bootstrap, which consists in resampling from the raw empirical distribution , generally provides bad approximations of the distribution of empirical quantiles: the rate of convergence for a given quantile is indeed of order (Falk and Reiss, 1989) whereas the rate of the Gaussian approximation is (see (25) in Appendix A). The same phenomenon may be naturally observed for curves. In a similar manner to what is usually recommended for empirical quantiles, a smoothed version of the bootstrap algorithm shall be implemented in order to improve the approximation rate of the distribution of , namely, to resample the data from a smoothed version of the empirical distribution . We thus consider the smooth boostrapped fluctuation process

(8)

where, given , is the empirical curve of the scoring function based on a sample of i.i.d. random variables with distribution and where is the smooth version of the empirical curve, being the generalized inverse of . The algorithm for building a confidence band at level in the space from sampling data is described in Algorithm 1.

1:Based on the sample , compute the smoothed version of the empirical estimate .
2:Plot the smooth curve estimate, i.e., the graph of the mapping
3:Draw a bootstrap sample .
4:Based on , compute the smoothed bootstrap version of the empirical estimate .
5:Plot the bootstrap curve, i.e. the graph of the mapping
6:Get the bootstrap confidence bands at level defined by the ball of center and radius in , where is defined by
denoting by the conditional probability given the original data .
Algorithm 1 Smoothed curve bootstrap

Before turning to the theoretical analysis of this algorithm, its description calls a few comments. From a computational perspective, the smoothed bootstrap distribution must be approximated in its turn, by means of a Monte-Carlo approximation scheme. Based on the bootstrap fluctuation processes obtained, with , the radius then coincides with the empirical quantile at level of the statistical population . Concerning the number of bootstrap replications, picking does not modify the rate of convergence. However, choosing of magnitude comparable to so that is an integer may be more appropriate: the -quantile of the approximate bootstrap distribution is then uniquely defined and this does not impact the rate of convergence neither (Hall, 1986).

4.2.1 Bootstrap consistency

The next theorem reveals the asymptotic validity of the bootstrap estimate proposed above where we assume that the smoothed version of the distribution function is computed at step 1 of Algorithm 1 using a kernel . It requires the following assumptions.

The density is bounded and of class .

The bandwidth decreases towards as in a way that , and .

The kernel has finite support and satisfies the following conditions: , and .

The kernel is such that and is of the form , being a polynomial and a bounded real function of bounded variation.

The kernel is differentiable with derivative such that and of the form , being a polynomial and a bounded real function of bounded variation.

The bandwidth is such that tends to 0 as .

Assumptions 4.2.1 and 4.2.1 are needed to control the biases and . Assumption 4.2.1 on the bandwidth and assumption 4.2.1 (respectively assumption 4.2.1) are needed to control (respectively ) thanks to the result of Giné and Guillou (2002). Assumption 4.2.1 ensures that tends to almost surely as . Assumptions 4.2.1-4.2.1 are fulfilled, for instance, by the biweight kernel defined for all by:

(9)
Theorem 2.

(Asymptotic validity) Let and let denote the conditional probability given the original data . Suppose that assumptions 4.1-4.1, 4.1 and 4.2.1-4.2.1 are fulfilled. Then, the distribution estimate given by Algorithm 1 is such that we almost surely have

where .

The proof is given in Appendix A. Note that under assumption 4.2.1 tends to as . The primary tuning parameters of Algorithm 1 concern the bandwidth