On approximate validation of models: A Kolmogorov-Smirnov based approach

03/20/2019 ∙ by Eustasio del Barrio, et al. ∙ 0

Classical tests of fit typically reject a model for large enough real data samples. In contrast, often in statistical practice a model offers a good description of the data even though it is not the "true" random generator. We consider a more flexible approach based on contamination neighbourhoods around a model. Using trimming methods and the Kolmogorov metric we introduce a functional statistic measuring departures from a contaminated model and the associated estimator corresponding to its sample version. We show how this estimator allows testing of fit for the (slightly) contaminated model vs sensible deviations from it, with uniformly exponentially small type I and type II error probabilities. We also address the asymptotic behavior of the estimator showing that, under suitable regularity conditions, it asymptotically behaves as the supremum of a Gaussian process. As an application we explore methods of comparison between descriptive models based on the paradigm of model falseness. We also include some connections of our approach with the False-Discovery-Rate setting, showing competitive behavior when estimating the contamination level, although applicable in a wider framework.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classical Goodness of Fit tests try to establish if there is enough statistical evidence to reject the null hypothesis, which usually is a fixed generating mechanism. These procedures behave fairly well for moderate data sizes, but can become excessively rigid in the presence of large sample sizes. This fact was already noted for the chi-squared statistic in

Berkson (1938) and interpreted by many authors as an indication of model falseness leading to statements such as ‘for every data generating mechanism there exists a sample size at which the model failure will become obvious’ (see Lindsay and Liu (2009)). The issue has been approached in different ways (see e.g. Hodges and Lehmann (1954), Álvarez-Esteban et al. (2012), Munk and Czado (1998),…), sharing the idea that we should broaden the null hypothesis to include useful nearby models. Usually this is also accompanied by a gain in robustness in the new proposals.

However, considering the celebrated Box’s phrase ‘essentially, all models are wrong, but some are useful’, even under the paradigm of model falseness, rejecting a model would not be a satisfactory goal. If all models are false, and at a certain point, with enough data, we are able to reject the model, we could provide some measure of how useful of how good is compared to other models. This topic is addressed in Davies (1995, 2016) from the perspective that a useful model is anyone able to generate similar samples to the available data. Let us present our framework to revisit both topics from a novel point of view.

Often, some feature of a predominant population is clearly different from that of another minority population, simply because of its different eating or cultural habits. In either of these situations, a data sample of that feature taken from the general population will include data that do not come from and do not look like those arising from the predominant one. Consequently, the statistical inference on the main population should be made taking into account the presence of atypical data. As a first ingredient, to address this goal, we resort to a suggestive model introduced in Huber (1964), becoming one of the very basis of Robust Statistics: An (

-)contamination neighbourhood (CN) of a probability distribution

is the set of probability distributions


where is the set of all probability distributions in the space (throughout the paper the real line ). For a given probability and a particular value , a probability in would generate samples with an approximate of data coming from . Also we must note the use of particular contamination models in different statistical problems, stressing its role on the False-Discovery-Rate (FDR) setting (as considered e.g. in Genovese and Wasserman (2004)). We briefly comment on the relation of our approach with that in Section 5. Of course, if an ‘outlying label’ were available for the data coming from the contaminating distribution, , removing the labeled data would produce a legitimate sample from . The relevant fact is that CN’s are related to trimmings (see Álvarez-Esteban et al. (2011)) by


where denotes the set of -trimmings of the probability distribution ,


This means that an -trimming, , of is characterized by a down-weighting function such that and for all measurable sets in . In contrast with the hard 0-1 (trimmed/non-trimmed) trimming practice in data analysis, this concept allows for gradually diminishing/enhancing the importance of points in the sample space. Relation (2) allows us to work with trimmings, instead of CN’s, taking advantage of the underlying meaning of trimming and its mathematical properties. If and are distribution functions (d.f.’s in the sequel), we will also use and , with the same meanings as before, but defined in terms of d.f.’s.

The natural absence of an outlying label has been traditionally substituted by more or less orthodox trimming criteria, including the oldest consisting in trimming just the extreme values, carrying out the analysis with the remaining data. Recently, mainly in conection with two-sample problems (see e.g. Álvarez-Esteban et al. (2008, 2011, 2012, 2016)), optimal trimmings have been introduced as the nearest ones to the original model, according to some probability distance or dissimilarity measure. This role will be played here by the Kolmogorov (or -)distance between d.f.’s on the line, namely,

(we will often use the notation for ).

In this work, we develop a robust hypothesis testing procedure based on the previous considerations. Moreover, under the paradigm of a false-model world, we use the elements involved in the procedure to suggest some tools for comparing models or to determining the usefulness of particular models.

The use of CN’s, through their connection with trimmings, leads to consider to be the ‘reasonable’ model. Notice that (see Example 2.1), this approach differs from that based just on -neighbourhoods of , which would have a different meaning (see Owen (1995) for this and other classic approaches). As relation (8) shows, (2) is also equivalent to , giving to the ‘trimmed Kolmogorov distance’ functional


and to the plug-in estimator , a main role into our analysis. (Here is the empirical d.f. based on a sample of

independent random variables with common d.f.

). In particular, we address the possibilities of testing where ‘reasonable’ is controlled by the trimming level . Related null hypotheses have already been considered making use of different probability metrics or different neighbourhoods. In Álvarez-Esteban et al. (2011, 2012), the -Wasserstein distance is used in a two-sample version. Previous approaches based on particular trimming procedures were considered in Munk and Czado (1998) and Álvarez-Esteban et al. (2008). The Kolmogorov-Smirnov test is probably the most widely used goodness of fit test, therefore the -metric provides a privileged setting to develop our approach. Notice that, in del Barrio et al. (2019), we have included most of the mathematical tools involved in this problem. This includes existence and characterization of (a particular) minimizer, and even a result on directional differentiability, which will be used here.

As shown in Barron (1989), for any distance dominating the total variation distance, testing the null hypothesis vs. the alternative , makes generally unachievable to get exponential bounds for the involved errors. The test provided in Section 3 has exponentially small error probabilities for testing the null (equivalently, ) against the alternative . The test is uniformly consistent (type I and type II error probabilities tend to 0 uniformly) for detecting alternatives with if .

Also, in Section 4.1, we provide asymptotic theory for for inferential purposes. It includes an extension of Theorem 2 in Raghavachari (1973) for flexible null hypotheses.

The second main goal in this paper is to provide tools to compare different models when the null hypothesis is rejected. Under the model falseness paradigm, Davies (1995, 2016) introduce the idea of adequacy region (for a data set) as the set of probabilities in a model whose samples would typically look like the actual data. Also Rudas et al. (1994) proposes the very natural concept of index of fit, namely, the contamination level necessary to make the random generator of the data a contaminated member of the model. The proposal in Rudas et al. (1994), as well as its modification in Liu and Lindsay (2009), deal with multinomial models. In our setup we consider the trimmed Kolmogorov (tK) index of fit, , defined by


This is the minimum contamination level for which is a contaminated version of . This works in a very general setup, since we impose no constraints on and . This is in contrast with the methodology involved in the control of FDR, which takes advantage of the dominated contamination model. With the methodology developed here, it is fairly easy to calculate the empirical version of for a particular data set. Using our asymptotic theory for we propose a consistent estimator for in Section 4. We also provide comparisons with some methodologies developed in the FDR setting (as considered in Meinshausen and Rice (2006)) for estimating the proportion of false null hypotheses.

A related approach for comparing the quality of different models to describe the data is based on credibility indices, as introduced in Lindsay and Liu (2009). Given a goodness of fit procedure, the credibility index allows comparison between models based on the minimal sample size for which subsamples of size of the original data (of size ) reject the null hypothesis 50% of times. The idea behind this index is that for large samples, goodness of fit tests will very likely reject the null hypothesis, while often for smaller sub-samples the null would not be rejected. Of course, these credibility indices have to be estimated from the data. The proposal in Lindsay and Liu (2009) is to use subsampling to perform this estimation. However, the accuracy of the subsampling approximation is limited to small (as compared to the complete sample) subsample sizes. Here we show how our asymptotic theory for can provide further information about the credibility indices.

Summarizing, the paper addresses the analysis and applications of , the ‘trimmed Kolmogorov distance’. Section 2 is devoted to collect the mathematical bases and provide a fast algorithm for computation on sample data. The analysis of the proposed testing procedure is carried in Section 3. In Section 4 we show how to apply this test to credibility analysis and develop some results about the tK-index of fit and the related acceptance regions. The basis for that approach relies on the CLT for the trimmed Kolmogorov distance (see Theorem 4.1). Section 5 includes some relations with the FDR setting and comparisons between several estimators of the contamination index In Section 6 we illustrate the previous techniques to compare descriptive models over simulated and real data examples. In the last section we briefly discuss the results. Finally, the proof of the main result in the paper, the CLT for the trimmed Kolmogorov distance, is given in the Appendix.

2 Trimming and Kolmogorov distance

We keep the notation used in the Introduction and notice that the set can be also characterized, as showed in Álvarez-Esteban et al. (2008) (Proposition 2.2 in Álvarez-Esteban et al. (2011) gives a more general result), in terms of the set of -trimmed versions of the uniform law . Let be the set of absolutely continuous functions , such that , with derivative verifying Then, the composition of the functions and : gives the useful parameterization


The set is convex and also well behaved w.r.t. weak convergence of probabilities and widely employed probability metrics (see Section 2 in Álvarez-Esteban et al. (2011)). As showed in del Barrio et al. (2019), keeps several nice properties under ; we include below the most relevant ones.

Proposition 2.1.

For , if , with or without suffixes are d.f.’s:

  • is compact w.r.t. .

  • .

  • If then:

  • for every there exist such that

  • if , then there exists some -convergent subsequence . If is the limit of such a subsequence, necessarily .

  • if, additionally, is any sequence of d.f.’s such that then as

Immediate consequences of Proposition 2.1 are that for :


Moreover, by convexity of , the set of optimally trimmed versions of associated to problem (7) is also convex. However, guarantying uniqueness of the minimizer (as it holds w.r.t. - Wasserstein metric by Corollary 2.10 in Álvarez-Esteban et al. (2011)) is not possible. Mention apart, by its statistical interest, merits the the following consistency result, which is straightforward from Glivenko-Cantelli theorem and item e3) above.

Proposition 2.2 (Consistency of trimmed Kolmogorov distance).

Let and be the sequence of empirical d.f.’s based on a sequence of independent random variables with distribution function . If is any sequence of distribution functions -approximating the d.f. (i.e. ), then:

While in other contexts the roles played by discarding contamination (by trimming) and the distance under consideration seem to be clear, here the nature of Kolmogorov distance can lead to a distorted picture. To give some light on these roles, we include a very simple example based on uniform laws that allows explicit computations. We also must note that (as commented in Álvarez-Esteban et al. (2012)) contamination neighbourhoods have been extended in several ways; notably Rieder’s neighborhoods of a probability comprise contamination as well as total variation norm neighborhoods.

Example 2.1.

Contamination vs -based neighbourhoods. Let us fix to be the d.f. and consider the following scenarios for

  • the d.f. of an or an law. Then and if (and 0 if

  • the d.f. of a law. Then and for every .

In fact, the first situation involves a contamination of exact size of , because where is the d.f. of an or an law. In contrast, the second one does not fit in the contamination model at all. The following scenario includes inner contamination at the support of , adding some complexity to the analysis:

  • , where is the d.f. of a law with . Then and for : if else . If , then for , we would have while for defining , we would have .

The analysis above shows that the effect of optimal trimming according to the -distance strongly depends on several factors. Notably, they include the presence or not of a contaminating part, but also its spread and relative position.

Throughout this paper we make frequent use of the quantile function. Given a d. f.

, we write for the associated quantile function. Recall that it is just the left-continuous inverse of the d.f. , namely, . It allows a useful representation of the corresponding distribution because, if

is a uniformly distributed

random variable, has d.f. . Moreover, if has a continuous d.f. , is easily seen to be the quantile function associated to the r.v. . As we showed in del Barrio et al. (2019), under some regularity assumptions, can be expressed in terms of the function . This fact allows the practical computation of when is an empirical d.f. based on a data sample , and even that of for theoretical distributions (see Example 2.2). For the sake of completeness, we include below these results and a theorem which is a fundamental tool for our goals. It gives an explicit characterization of a solution of the corresponding optimization problem (see Theorem 2.5 in del Barrio et al. (2019) for a proof).

Lemma 2.1.

If are continuous d.f.’s and is additionally strictly increasing then

Theorem 2.1.

Assume is a continuous nondecreasing function. Define , , and

Then is an element of , and

Note that the assumption on is always verified when , and that taking right and left limits at 0 and 1, respectively, we can assume that is a nondecreasing (and left continuous) function from to .

A key aspect in Theorem 2.1 is that, although not necessarily unique, is an optimal trimming function in the sense described above. However, from the point of view of asymptotic theory, Theorem 2.1 is the key to our Theorem 4.1 in Section 4. Moreover, from a practical point of view, it yields a simple algorithm for the computation of , as follows.

Assume are i.i.d. observations from the continuous and strictly increasing d.f. and assume that is continuous. From Lemma 2.1 and Theorem 2.1 we know that , where , is the empirical quantile function of the transformed data, , , and

Denote by the ordered (transformed) sample. Note that if , while is a nonincreasing function and this implies that

with , . For the computation of we note that and for . Summarizing, we see that can be computed through the following algorithm.

Algorithm for the computation of :

  • compute , ; sort .

  • compute , ; , .

  • compute , , .

  • set , and , .

  • compute

Beyond this algorithm for the empirical case, Theorem 2.1 provides a simple way for the computation of theoretical trimmed Kolmorogov distances. Example 2.1 in del Barrio et al. (2019) analyzes the problem in Gaussian model. Let us include here a summary for illustration of this use.

Example 2.2.

Trimmed Kolmogorov distances in the Gaussian model. Consider the case , , where denotes the standard normal d.f., and . Here we have . We focus on the cases , and , (see del Barrio et al. (2019) for details).

If and then


In the case :

Relations (7) and (8

) state the link between CN’s and trimming, opening ways to approximately validating a model making use of trimming through the Kolmogorov distance. We end this section showing how CN’s and approximate validation in a parametric model setting can be related. For that task we focus on what are the parameters in the model leading to distributions in

. As pointed out in Davies (1995), we should just consider models able to generate data similar to our sample. Moreover, distributions in a CN have an intuitive appeal and, if is small, we can expect to be handling reasonable models. For instance, if

then we can calculate the tolerance region given by the subset of normal distributions belonging to

in an elementary fashion. This provides an approximate picture of the kind of distributions present in the CN of . These tolerance regions for and are shown in Figure 1. Every combination of inside the green border is a normal distribution that belongs to . The same is true for the red border and .

Figure 1: Plot of regions containing the parameters compatible with -contamination neighbourhoods of , for (red) and (green)

3 Hypothesis testing

To develop our approach for a testing procedure, throughout, will be independent random variables with common d.f. , and will be the corresponding empirical d.f. The main result, following the principles in Barron (1989), concerns control of error probabilities: a test is uniformly consistent (UC) if both type I and type II error probabilities (EI and EII in the sequel) converge uniformly to 0 as the sample size, , and it is uniformly exponentially consistent (UEC) if the error probabilities are uniformly bounded by for large and some . To stress on the necessity of considering some separating zone between the null and the alternative, we include this previous slightly more general result.

Proposition 3.1.

Given , for testing vs. , for every rejecting the null hypothesis when is an uniformly exponentially consistent (UEC) test.

Proof. From Proposition 2.1 c) in del Barrio et al. (2019), we have the inequality , thus for EI:


Note that the last bound follows from the Massart (1990) version of the Dvoretsky-Kiefer-Wolfowitz inequality.

To handle EII (thus if ), we have


As an easy consequence, taking and , we get:

Theorem 3.1.

Given , for testing


for every the critical region defines an uniformly exponentially consistent (UEC) test.

Since the null hypothesis includes all the contamination versions (of -level) of , rejection means that the generator of the sample is far enough of any such a contaminated version. Theorem 3.1 guarantees that alternatives will be quickly detected when farness is measured through the -distance.

In statistical practice, it could be wiser to change the alternative hypothesis and make it sample size dependent. That leads to consider tests of the form


for , and rejection when . For instance, taking , and results in an uniformly consistent test. Uniform consistency is weaker than uniform exponential consistency, but it allows to detect, for example, alternatives at a distance . Also, we can consider as a tuning parameter which can help if we have some additional information or if we want more or less conservative tests with respect to EI and EII probabilities (of course, when , or or , some bounds are meaningless and we can not assure uniform consistency with the previous procedure). Alternatively, we may look for the smallest possible values for , while still controlling EI and EII. From (10) and (11) note that if is we would lose the control of the errors, since as . This leads us to choose as , or, fixing some :


Now, if we fix , looking for a rejection threshold, , for which

we get and With a bit of algebra we get


imposing , which gives the optimal boundary level


Relations (15) and (16) summarize the balance among the different elements. Ideally, we look for small , and but, paying the price for our demands, grows as gets smaller and as gets more similar to . Therefore, we need to make sensible choices for and . In Table 1 we show some examples of the mentioned behaviour. For instance, fixing and seems a sensible choice, giving a fairly low while keeping low error probabilities.

0.1 0.5 0.90 0.048 0.05 0.25 0.90 0.053 0.01 0.05 0.90 0.063
0.1 0.1 0.50 0.086 0.05 0.05 0.50 0.095 0.01 0.01 0.50 0.114
0.1 0.02 0.10 0.440 0.05 0.01 0.10 0.489 0.01 0.002 0.10 0.586
Table 1: Values associated to error bounds for and .

An appealing goal would be to detect the ‘true’ contamination level, that is, the minimal level of trimming for which the postulated model would not be rejected. In this way we could, also, detect possible contaminations in the generating mechanism. To address this objective, we resort to the following result obtained in greater generality in del Barrio and Matrán (2013).

Theorem 3.2.

If and , then


Therefore, if and we test for , as trimming from will eliminate the part of the sample coming from , but also will affect the part of the sample coming from . This fact and Proposition 2.2 lead to the following statement.

Proposition 3.2.

Let and , and . Then:

Figure 2: Round green (blue, red) dots represent the frequency of rejection (y label) for 150 independent samples of a generating mechanism for sample sizes 2000 (4000, 6000) and a model , as we vary the trimming level (x label). Diamond yellow (cyan, orange) dots represent the rejection frequency for a generator for sample sizes 12500 (25000, 50000). The black dashed line represents the true contamination level which is 0.1, since and . The error probabilities are fixed to .

This means that, for big enough samples, our testing procedure will be able to detect the overtrimming boundary, that is, the trimming level beyond which the trimmed sample is closer to the model than true random samples from that model. In Figure 2 we are able to appreciate this behaviour (see the caption for details). The frequency of rejecting the null, for both models, after trimming 0.11 or more is almost zero, the theoretical contamination being 0.1. We see that around 0.1 the models start dropping abruptly the rejection level, but that for the model contaminated with a we need much less points to attain the expected behaviour than we need for the model contaminated with a

. In other words, the presence of a meaningful outlier contamination, even when trimming is allowed, disturbs more heavily the Kolmogorov distance than the presence of equally meaningful inlier contamination. In any case, these results suggest that it may be possible to find an estimator for the ‘true’ contamination level. We elaborate a little bit more about this in the next section.

4 A central limit theorem with applications

We divide this section in two subsections, respectively devoted to the presentation of results and to some of their applications. In particular, we stress on the extension of some of the applications that Lindsay and Liu (2009) and Liu and Lindsay (2009) explored just on multinomial models.

4.1 A central limit result

What follows is our main theoretical result which describes the asymptotic behaviour of the normalized difference between the empirical estimator and the theoretical trimmed Kolmogorov distance under some regularity assumptions. We recall from Section 2 that can be expressed in terms of . We need to introduce the following sets, with , , and standing for the same objects as in Theorem 2.1 in Section 2,


A look at Theorem 4.1 in del Barrio et al. (2019) shows that provided is continuous. We further denote , and . To avoid pathological examples we will assume that


Our last regularity assumptions concern , the d.f. of the random variable , where . They allow the use of the strong approximation of the quantile process in the proof of the theorem (developed in the Appendix). We assume that has a density, supported in (note that, necessarily, ) and either one of

Theorem 4.1.

Assume that and are continuous d.f.’s, that is strictly increasing and that the d.f. associated to satisfies (22) and either (23) or (24). Then,

where is a Brownian bridge on .

The limit distribution in this result corresponds to the supremum of a Gaussian process. In fact, the index set for this process is often rather simple, consisting of only one or two points as we show in our next example.

Example 4.1.

Trimmed Kolmogorov distances in the Gaussian model (cont.) We revisit the cases studied in Example 2.2. Recall that