The Kolmogorov-Smirnov (KS) test (Kolmogorov, 1933; Smirnov, 1948) is a classical and celebrated tool for nonparametric hypothesis testing. Let and be independent samples. Let and denote the two sets of samples, and also let , where . The two-sample KS test statistic is defined as
In words, this measures the maximum absolute difference between the empirical cumulative distribution functions (CDFs) ofand , across all points in the joint sample
. Naturally, the two-sample KS test rejects the null hypothesis offor large values of the statistic. The statistic (1) can also be written in the following variational form:
where denotes total variation, and we define the empirical expectation operators via
The KS test is a fast, general-purpose two-sample nonparametric test. But being a general-purpose test also means that it is systematically less sensitive to some types of differences, such as tail differences (Bryson, 1974). Intuitively, this is because the empirical CDFs of and must both tend to 0 as and to 1 as , so the gap in the tails will not be large.
The insensitivity of the KS test to tail differences is well-known. Several authors have proposed modifications to the KS test to improve its tail sensitivity, based on variance-reweighting(Anderson and Darling, 1952), or Renyi-type statistics (Mason and Schuenemeyer, 1983; Calitz, 1987), to name a few ideas. In a different vein, Wang et al. (2014) recently proposed a higher-order extension of the KS two-sample test, which replaces the total variation constraint on in (2) with a total variation constraint on a derivative of . These authors show empirically that, in some cases, this modification can lead to better tail sensitivity. In the current work, we refine the proposal of Wang et al. (2014), and give theoretical backing for this new test.
A Higher-Order KS Test.
Our test statistic has the form of an integral probability metric (IPM). For a function class , the IPM between distributions and , with respect to , is defined as (Muller, 1997)
where we define the expectation operators by
For a given function class , the IPM is a pseudometric on the space of distributions. Note that the KS test in (2) is precisely , where are the empirical distributions of , respectively, and .
Consider an IPM given by replacing with , for an integer (where we write for the th weak derivative of ). Some motivation is as follows. In the case , we know that the witness functions in the KS test (2), i.e., the functions in that achieve the supremum, are piecewise constant step functions (cf. the equivalent representation (1)). These functions can only have so much action in the tails. By moving to , which is essentially comprised of the th order antiderivative of functions in , we should expect that the witness functions over are th order antiderivatives of piecewise constant functions, i.e., th degree piecewise polynomial functions, which can have much more sensitivity in the tails.
But simply replacing by and proposing to compute leads to an ill-defined test. This is due to the fact that contains all polynomials of degree . Hence, if the
th moments ofdiffer, for any (where we abbreviate for an integer ), then .
As such, we must modify to control the growth of its elements. While there are different ways to do this, not all result in computable IPMs. The approach we take yields an exact representer theorem (generalizing the equivalence between (1) and (2)). Define
denote one-sided limits at 0 from above and below, respectively. Informally, the functions inare pinned down at 0, with all lower-order derivatives (and the limiting th derivative from the right or left) equal to 0, which limits their growth. Now we define the th-order KS test statistic as
An important remark is that for , this recovers the original KS test statistic (2), because contains all step functions of the form , .
Another important remark is that for any , the function class in (4) is “rich enough” to make the IPM in (5) a metric. We state this formally next; its proof, as with all other proofs, is in the appendix.
For any , and any with moments, if and only if .
Figure 1 shows the results of a simple simulation comparing the proposed higher-order tests (5), of orders through 5, against the usual KS test (corresponding to ). For the simulation setup, we used and . For 500 “alternative” repetitions, we drew samples from , drew samples from , and computed test statistics; for another 500 “null” repetitions, we permuted the samples from the corresponding alternative repetition, and again computed test statistics. For each test, we varied the rejection threshold for each test, we calculated its true positive rate using the alternative repetitions, and calculated its false positive rate using the null repetitions. The oracle ROC curve corresponds to the likelihood ratio test (which knows the exact distributions ). Interestingly, we can see that power of the higher-order KS test improves as we increase the order from up to , then stops improving by .
Figure 2 displays the witness function (which achieves the supremum in (5)) for a large-sample version of the higher-order KS test, across orders through 5. We used the same distributions as in Figure 1, but now . We will prove in Section 2 that, for the th order test, the witness function is always a th degree piecewise polynomial (in fact, a rather simple one, of the form or for a knot ). Recall the underlying distributions here have different variances, and we can see from their witness functions that all higher-order KS tests choose to put weight on tail differences. Of course, the power of any test of is determined by the size of the statistic under the alternative, relative to typical fluctuations under the null. As we place more weight on tails, in this particular setting, we see diminishing returns at , meaning the null fluctuations must be too great.
Summary of Contributions.
Our contributions in this work are as follows.
We develop an exact representer theorem for the higher-order KS test statistic (5). This enables us to compute the test statistic in linear-time, for all . For , we develop a nearly linear-time approximation to the test statistic.
We derive the asymptotic null distribution of the our higher-order KS test statistic, based on empirical process theory. For , our approximation to the test statistic has the same asymptotic null.
We provide concentration tail bounds for the test statistic. Combined with the metric property from Proposition 1, this shows that our higher-order KS test is asymptotically powerful against any pair of fixed, distinct distributions .
We perform extensive numerical studies to compare the newly proposed tests with several others.
Other Related Work.
Recently, IPMs have been gaining in popularity due in large part to energy distance tests (Szekely and Rizzo, 2004; Baringhaus and Franz, 2004) and kernel maximum mean discrepancy (MMD) tests (Gretton et al., 2012), and in fact, there is an equivalence between the two classes (Sejdinovic et al., 2013). An IPM with a judicious choice of gives rise to a number of common distances between distributions, such as Wasserstein distance or total variation (TV) distance. While IPMs look at differences , tests based on -divergences (such as Kullback-Leibler, or Hellinger) look at ratios
, but can be hard to efficiently estimate in practice(Sriperumbudur et al., 2009). The TV distance is the only IPM that is also a -divergence, but it is impossible to estimate.
There is also a rich class of nonparametric tests based on graphs. Using minimum spanning trees, Friedman and Rafsky (1979) generalized both the Wald-Wolfowitz runs test and the KS test. Other tests are based on k-nearest neighbors graphs (Schilling, 1986; Henze, 1988) or matchings (Rosenbaum, 2005). The Mann-Whitney-Wilcoxon test has a multivariate generalization using the concept of data depth (Liu and Singh, 1993). Bhattacharya (2016) established that many computationally efficient graph-based tests have suboptimal statistical power, but some inefficient ones have optimal scalings.
Different computational-statistical tradeoffs were also discovered for IPMs (Ramdas et al., 2015b). Further, as noted by Janssen (2000) (in the context of one-sample testing), every nonparametric test is essentially powerless in an infinity of directions, and has nontrivial power only against a finite subspace of alternatives. In particular, this implies that no single nonparametric test can uniformly dominate all others; improved power in some directions generally implies weaker power in others. This problem only gets worse in high-dimensional settings (Ramdas et al., 2015a; Arias-Castro et al., 2018). Therefore, the question of which test to use for a given problem must be guided by a combination of simulations, computational considerations, a theoretical understanding of the pros/cons of each test, and a practical understanding of the data at hand.
In Section 2, we give computational details for the higher-order KS test statistic (5). We derive its asymptotic null in Section 3, and give concentration bounds (for the statistic around the population-level IPM) in Section 4. We give numerical experiments in Section 5, and conclude in Section 6 with a discussion.
Write for the test statistic in (5). In this section, we derive a representer theorem for , develop a linear-time algorithm for , and a nearly linear-time approximation for .
2.1 Representer Theorem
Fix . Let and for , where we write . For the statistic defined by (5),
The proof of this theorem uses a key result from Mammen (1991)
, where it is shown that we can construct a spline interpolant to a given function at given points, such that its higher-order total variation is no larger than that of the original function.
For general , we can interpret (6) as a comparison between truncated th order moments, between the empirical distributions and . The test statistic the maximum over all possible truncation locations . The critical aspect here is truncation, which makes the higher-order KS test statistic a metric (recall Proposition 1). A comparison of moments, alone, would not be enough to ensure such a property.
Theorem 1 itself does not immediately lead to an algorithm for computing , as the range of considered in the suprema is infinite. However, through a bit more work, detailed in the next two subsections, we can obtain an exact linear-time algorithm for all , and a linear-time approximation for .
2.2 Linear-Time Algorithm for
The key fact that we will exploit is that the criterion in (6), as a function of , is a piecewise polynomial of order with knots in . Assume without a loss of generality that . Also assume without a loss of generality that (this simplifies notation, and the general case follows by the repeating the same arguments separately for the points in on either side of 0). Define , , and
Then the statistic in (6) can be succinctly written as
where we let for convenience. Note each , is a th degree polynomial. We can compute a representation for these polynomials efficiently.
Fix . The polynomials in (7) satisfy the recurrence relations
(where ). Given the monomial expansion
we can compute an expansion for , with coefficients , , in time. So we can compute all coefficients , , in time.
To compute in (8), we must maximize each polynomial over its domain , for , and then compare maxima. Once we have computed a representation for these polynomials, as Lemma 1 ensures we can do in time, we can use this to analytically maximize each polynomial over its domain, provided the order is small enough. Of course, maximizing a polynomial over an interval can be reduced to computing the roots of its derivative, which is an analytic computation for any (since the roots of any quartic have a closed-form, see, e.g., Rosen 1995). The next result summarizes.
For any , the test statistic in (8) can be computed in time.
Maximizing a polynomial of degree is not generally possible in closed-form. However, developments in semidefinite optimization allow us to approximate its maximum efficiently, investigated next.
2.3 Linear-Time Approximation for
Seminal work of Shor (1998); Nesterov (2000) shows that the problem of maximizing a polynomial over an interval can be cast as a semidefinite program (SDP). The number of variables in this SDP depends only on the polynomial order , and all constraint functions are self-concordant. Using say an interior point method to solve this SDP, therefore, leads to the following result.
Let denote the -approximation from Proposition 3. Under the null , we would need to have in order for the approximation to share the asymptotic null distribution of , as we will see in Section 3.3. Taking say, , the statistic requires computational time, and this is why in various places we make reference to a nearly linear-time approximation when .
2.4 Simple Linear-Time Approximation
We conclude this section by noting a simple approximation to (6) given by
We would need to have in order for to share the asymptotic null of , see again Section 3.3 (this is assuming that has moments, so the sample moments concentrate for large enough ). This will not be true of , the maximum gap, in general. But it does hold when is continuous, having compact support, and a density bounded from below on its support; here, in fact, (see, e.g., Wang et al. 2014).
Although it does not have the strong guarantees of the approximation from Proposition 3, the statistic in (9) is simple and efficient—we must emphasize that it can be computed in linear time, as a consequence of Lemma 1 (the evaluations of at the sample points are the constant terms , in their monomial expansions)—and is likely a good choice for most practical purposes.
3 Asymptotic Null
To study the asymptotic null distribution of the proposed higher-order KS test, we will appeal to uniform central limit theorems (CLTs) from the empirical process theory literature, reviewed here for completeness. For functionsin a class , let denote a Gaussian process indexed by with mean and covariance
For functions , let denote the set of functions . Call a bracket of size , where denotes the norm, defined as
Finally, let be the smallest number of -sized brackets that are required to cover . Define the bracketing integral of as
Note that this is finite when grows slower than . We now state an important uniform CLT from empirical process theory.
Theorem 2 (Theorem 11.1.1 in Dudley 1999).
If is a class of functions with finite bracketing integral, then when and , the process
converges weakly to the Gaussian process . Hence,
3.1 Bracketing Integral Calculation
To derive the asymptotic null of the higher-order KS test, based on its formulation in (5), and Theorem 2, we would need to bound the bracketing integral of . While there are well-known entropy (log covering) number bounds for related function classes (e.g., Birman and Solomyak 1967; Babenko 1979), and the conversion from covering to bracketing numbers is standard, these results unfortunately require the function class to be uniformly bounded in the sup norm, which is certainly not true of .
Note that the representer result in (6) can be written as , where
We can hence instead apply Theorem 2 to , whose bracketing number can be bounded by direct calculation, assuming enough moments on .
Fix . Assume , for some . For the class in (10), there is a constant depending only on such that
3.2 Asymptotic Null for Higher-Order KS
When , note that for , the covariance function is
where denotes the CDF of . For , the covariance function is again equal to . The supremum of this Gaussian process over is that of a Brownian bridge, so Theorem 3 recovers the well-known asymptotic null distribution of the KS test, which (remarkably) does not depend on .
When , it is not clear how strongly the supremum of the Gaussian process from Theorem 3 depends on ; it appears it must depend on the first moments of , but is not clear whether it only depends on these moments. Section 5 investigates empirically. Currently, we do not have a precise understanding of whether the asymptotic null is useable in practice, and we suggest using a permutation null instead.
3.3 Asymptotic Null Under Approximation
The approximation from Proposition 3 shares the same asymptotic null, provided is small enough.
The approximation in (9) shares the same asymptotic null, provided is continuous with compact support.
4 Tail Concentration
We examine the convergence of our test statistics to their population analogs. In general, if the population-level IPM is large, then the concentration bounds below will imply that the empirical statistic will be large for sufficiently large, and the test will have power.
We first review the necessary machinery, again from empirical process theory. For , and a function
of a random variable, recall the norm is defined as . For , recall the exponential Orlicz norm of order is defined as
(These norms depend on the measure , since they are defined in terms of expectations with respect to , though this is not explicit in our notation.)
We now state an important concentration result.
Theorem 4 (Theorems 2.14.2 and 2.14.5 in van der Vaart and Wellner 1996).
Let be a class functions with an envelope function , i.e., for all . Define
and abbreviate . For , if , then for a constant ,
and for , if , then for a constant ,
The two-sample test statistic satisfies (following by a simple argument using convexity)
The terms on the right hand side can each be bounded by Theorem 4, where we can use the envelope function for . Using Markov’s inequality, we can then get a tail bound on the statistic.
Fix . Assume that both have moments, where and . For the statistic in (6), for any , with probability ,
where , and is a constant. If both have finite exponential Orlicz norms of order , then the above holds for .
When we assume moments, the population IPM for also has a representer in ; by Proposition 1, this implies is also a metric.
Fix . Assuming both have moments, . Therefore, by Proposition 1, is a metric (over the space of distributions with moments).
Putting this metric property together with Theorem 5 gives the following.
Fix . For and ,
reject when the higher-order KS test statistic (6)
where is as in Theorem 5. For any
that meet the moment conditions of Theorem 5,
as in such a way that approaches a positive constant, we
have type I error tending to 0, and power tending to 1, i.e., the higher-order
KS test is
approaches a positive constant, we have type I error tending to 0, and power tending to 1, i.e., the higher-order KS test isasymptotically powerful.
5 Numerical Experiments
We present numerical experiments that examine the convergence of our test statistic to its asymptotic null, its power relative to other general purpose nonparametric tests, and its power when have densities with local differences. Experiments comparing to the MMD test with a polynomial kernel are deferred to the appendix.
Convergence to Asymptotic Null.
In Figure 3, we plot histograms of finite-sample higher-order KS test statistics and their asymptotic null distributions, when . We considered both and
(the uniform distribution standardized to have mean 0 and variance 1). For a total of 1000 repetitions, we drew two sets of samples from, each of size , then computed the test statistics. For a total of 1000 times, we also approximated the supremum of the Gaussian process from Theorem 3 via discretization. We see that the finite-sample statistics adhere closely to their asymptotic distributions. Interestingly, we also see that the distributions look roughly similar across all four cases considered. Future work will examine more thoroughly.
Comparison to General-Purpose Tests.
In Figures 5 and 5, we compare the higher-order KS tests to the KS test, and other widely-used nonparametric tests from the literature: the kernel maximum mean discrepancy (MMD) test (Gretton et al., 2012) with a Gaussian kernel, the energy distance test (Szekely and Rizzo, 2004), and the Anderson-Darling test (Anderson and Darling, 1954). The simulation setup is the same as that in the introduction, where we considered with different variances, except here we study different means: , , and different third moments: , , where
denotes Student’s t-distribution with 3 degrees of freedom. The higher-order KS tests generally perform favorably, and in each setting there is a choice ofthat yields better power than KS. In the mean difference setting, this is , and the power degrades for , likely because these tests are “smoothing out” the mean difference too much; see Proposition 4.
Local Density Differences.
In Figures 7 and 7, we examine the higher-order KS tests and the KS test, in cases where have densities such that has sharp local changes. Figure 7 shows a case where is piecewise constant with a few short departures from 0 (see the appendix for a plot) and . The KS test is very powerful, and the higher-order KS tests all perform poorly; in fact, the KS test here has better power than all commonly-used nonparametric tests we tried (results not shown). Figure 7 displays a case where changes sharply in the right tail (see the appendix for a plot) and . The power of the higher-order KS test appears to increase with , likely because the witness functions are able to better concentrate on sharp departures for large .
This paper began by noting the variational characterization of the classical KS test as an IPM with respect to functions of bounded total variation, and then proposed a generalization to higher-order total variation classes. This generalization was nontrivial, with subtleties arising in defining the right class of functions so that the statistic was finite and amenable for simplification via a representer result, challenges in computing the statistic efficiently, and challenges in studying asymptotic convergence and concentration due to the fact that the function class is not uniformly sup norm bounded. The resulting class of linear-time higher-order KS tests was shown empirically to be more sensitive to tail differences than the usual KS test, and to have competitive power relative to several other popular tests.
In future work, we intend to more formally study the power properties of our new higher-order tests relative to the KS test. The following is a lead in that direction. For , define to be the th order integral operator, acting on a function , via
Denote by the CDFs of the distributions . Notice that the population-level KS test statistic can be written as , where is the sup norm. Interestingly, a similar representation holds for the higher-order KS tests.
Assuming have moments,
where is the adjoint of the bounded linear operator , with respect to the usual inner product. Further, if are supported on , or their first moments match, then we have the more explicit representation
The representation in Proposition 4 could provide one avenue for power analysis. When are supported on , or have matching moments, the representation is particularly simple in form. This form confirms the intuition that detecting higher-order moment differences is hard: as increases, the -times integrated CDF difference becomes smoother, and hence the differences are less accentuated.
In future work, we also intend to further examine the asymptotic null of the higher-order KS test (the Gaussian process from Theorem 3), and determine to what extent it depends on the underlying distribution (beyond say, its first moments). Lastly, some ideas in this paper seem extendable to the multivariate and graph settings, another direction for future work.
We thank Alex Smola for several early inspiring discussions. VS and RT were supported by NSF Grant DMS-1554123.
Appendix A Appendix
a.1 Comparing the Test in Wang et al. (2014)
The test statistic in Wang et al. (2014) can be expressed as
This is very close to our approximate statistic in (9). The only difference is that we replace by for .
a.2 Proof of Proposition 1
We first claim that is an envelope function for , meaning for all . To see this, note each has th weak derivative with left or right limit of 0 at 0, so ; repeatedly integrating and applying the derivative constraints yields the claim. Now due to the envelope function, if have moments, then the IPM is well-defined: , for all . Thus if , then clearly .
For the other direction, suppose that . By simple rescaling, for any , if , then . Therefore implies , where
This also implies , where
As the class contains , where (and is the class of infinitely differentiable, compactly supported functions on ), we have by Lemma 4 that for all open sets . By similar arguments, we also get that , for all open sets , where . This implies that (as , and the same for ), and finally, for all open sets , which means that .
a.3 Statement and Proof of Lemma 4
For any two distributions supported on an open set , if for all , then .
It suffices to show that for every open set . As are probability measures and hence Radon measures, there exists a sequence of compact sets , such that and . Let , be smooth compactly supported functions with values in such that on and outside of . (Such functions can be obtained by applying Urysohn’s Lemma on appropriate sets containing and and convolving the resulting continuous function with a bump function.) Then (where the equality by the main assumption in the lemma). Taking gives . By reversing the roles of , we also get . Thus . ∎
a.4 Proof of Theorem 1
Let be as in (10). Noting that , it is sufficient to show
Fix any . Denote . From the statement and proof of Theorem 1 in Mammen (1991), there exists a spline of degree , with finite number of knots such that for all
and importantly, . As , we hence know that the boundary constraints (derivative conditions at 0) are met, and .
Because is a spline with a given finite number of knot points, we know that it has an expansion in terms of truncated power functions. Write for the knots of , where . Also denote when , and when . Then for some , , and a polynomial of degree , we have
The boundary conditions on , , , imply
The second line above implies that
In the second case, we have . In the first case, we have , so . Therefore, in all cases we can write
with the new understanding that is either or . This means that lies in the span of functions in . Furthermore, our last expression for implies
Finally, using the fact that and agree on ,