 # A rank-based Cramér-von-Mises-type test for two samples

We study a rank based univariate two-sample distribution-free test. The test statistic is the difference between the average of between-group rank distances and the average of within-group rank distances. This test statistic is closely related to the two-sample Cramér-von Mises criterion. They are different empirical versions of a same quantity for testing the equality of two population distributions. Although they may be different for finite samples, they share the same expected value, variance and asymptotic properties. The advantage of the new rank based test over the classical one is its ease to generalize to the multivariate case. Rather than using the empirical process approach, we provide a different easier proof, bringing in a different perspective and insight. In particular, we apply the Hájek projection and orthogonal decomposition technique in deriving the asymptotics of the proposed rank based statistic. A numerical study compares power performance of the rank formulation test with other commonly-used nonparametric tests and recommendations on those tests are provided. Lastly, we propose a multivariate extension of the test based on the spatial rank.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

To test whether two samples come from the same or different populations, several distribution free tests such as the Kolmogorov-Smirnov test, the Cramér-von Mises test and their variations have been proposed and widely used. Let and be two independent random samples with continuous distribution functions and , respectively. The two sample problem is to test

 H0:F=GvsHa:F≠G. (1)

Denote and as the empirical distribution functions of the two samples and as the empirical distribution function of the combined sample, where . The Kolmogorov-Smirnov (KS) two-sample test uses the maximum distance (difference) between and . The classical Cramér-von Mises test statistic has the form

 Tc=mnN∫∞−∞[Fm(x)−Gn(x)]2dHN(x). (2)

This test statistic and its asymptotics have been well studied in the literature, for example, Lehmann , Rosenblatt , Darling , Fisz  and Anderson .

Both of the KS test statistic and the Cramér-von Mises test statistic are formulated based on the empirical distributions. Székely and Rizzo , Baringhaus and Franz  studied a test statistic based on the original data. That is

 mnN{1mnm∑i=1n∑j=1|Xi−Yj|−12m2m∑i=1m∑j=1|Xi−Xj|−12n2n∑i=1n∑j=1|Yi−Yj|}. (3)

This test has a direct generalization to the multivariate case. However, it requires an assumption on the first moment and it is not distribution free for the univariate case. It is worth to note that the test statistic (

3) falls in the unified framework on energy statistics studied by Székely and Rizzo [30, 31] and can be easily generalized to the sample problem. Other similar tests include  and , although they are derived under different motivations. Fernandez, Gamerro, and Garcìa 

developed a statistic based on the empirical characteristic functions of the observed observations. The statistic uses a weighted integral of the difference between the empirical characteristic function of the two samples. Gretton et al.

 proposed a test based on a kernel method in which the testing procedure is defined as the maximum difference in expectations over functions evaluated on the two samples. All of those test statistics are of the form being a difference on a measure of between-group and within-group.

In this paper, we propose a new rank based test of the same form. Nevertheless, it overcomes the limitations of (3). It is formulated based on the ranks of two samples with respect to the combined sample . Denote as the standardized rank of the quantity with respect to the distribution , i.e., . For testing the hypothesis (1), we use the following test statistic.

 T= mnN{1mnm∑i=1n∑j=1|R(Xi,HN)−R(Yj,HN)| −12m2m∑i=1m∑j=1|R(Xi,HN)−R(Xj,HN)| −12n2n∑i=1n∑j=1|R(Yi,HN)−R(Yj,HN)|}. (4)

is interpreted as the difference of the average of between-group rank differences and the average of within-group rank differences. A large value of indicates the deviation of two groups. The test based on is distribution-free and does not require any moment condition.

For the balanced samples (), one can consider an equivalent but simpler statistic

 T′=1mnm∑i=1n∑j=1|R(Xi,HN)−R(Yj,HN)|. (5)

is the average of rank differences between two groups. and are equivalent because when .

As we will see later, the test statistic is closely related to the classical nonparametric Cramér-von Mises criterion . They are different empirical plug-in versions of the same population quantity. The rank based test statistic and the Cramér-von Mises criterion may not be equal to each other for finite samples, but they are asymptotically equivalent. The advantage of the new rank based test over the classical one is its ease to generalize to the multivariate case. Multivariate generalizations of Cramér-von Mises tests have been considered by many researchers, but they are either applied on independent data  or used for testing independence 

or used in the goodness-of-fit test of the uniform distribution on the transformed data

. For the rank based formulation, generalizations to the multivariate two sample problem are straightforward by applying notions of multivariate rank functions. In this paper, rather than using the empirical process approach, we provide a different easier proof, bringing in a different perspective and insight. In particular, we apply the Hájek projection and orthogonal decomposition technique in deriving the asymptotics of the proposed statistic.

Some related works include Pettitt  and Baumgartner, Weiß, and Schindler . They considered statistics of Anderson-Darling type that can be viewed as standardized versions of Cramér-von Mises statistics. Schmid and Trede  utilized Cramér-von Mises statistics. A rank-based representation of a Cramér-von Mises statistic under a balanced size and its generalizations are studied by Borroni . Albers, Kallenberg and Martini  studied rank procedures for detecting shift alternatives with increasing shift in the tail of the distribution. Janic-Wróblewska and Ledwina  considered a test based on a combination of several linear rank statistics. Related to the rank procedures, other nonparametric tests include those based on the empirical likelihood approach. Einmahl and McKeague  considered test statistics based on the empirical likelihood ratios for the goodness of fit and two sample problems. It has been proved that those tests are asymptotically equivalent to the one-sample and two sample Anderson-Darling tests. Cao and Van Keilegom 

proposed an empirical likelihood ratio test via kernel density estimation. Gurevich and Vexler

 utilized an empirical likelihood ratio test based on samples entropy.

The paper has the following structure. Section 2 presents the main results, including the formulation of the test statistic and its properties. The simulation study is performed in Section 3. We propose a multivariate extension of the test in Section 4. We summarize and conclude the paper in Section 5. All proofs go to Section 6.

## 2 Main Results

To formulate the rank based test statistic in (1), we first establish its population version. We provide a result of the population version, from which we can see the relationship between our statistic and Cramér-von Mises criterion.

###### Theorem 2.1

Let and

be independent continuous random variables distributed from

and , respectively. Let with be the mixture distribution. Then

 E|R(X,H)−R(Y,H)|−12E|R(X1,H)−R(X2,H)|−12E|R(Y1,H)−R(Y2,H)|≥0 (6)

and the equality holds if and only if .

The above result is based on the following identity which is obtained from Lemma 6.1 in the Appendix.

 E|R(X,H)−R(Y,H)|−12E|R(X1,H)−R(X2,H)|−12E|R(Y1,H)−R(Y2,H)| =∫∞−∞(F(x)−G(x))2d(τF(x)+(1−τ)G(x)). (7)

The result of Theorem 2.1 suggests two possible statistics for testing the hypothesis (1). The first version is the sample plug-in version of the left side of (7). With and multiplying by , it is our test statistic defined in (1). is rejected if the sample version is large, i.e., . The critical value is determined by the significance level and the null distribution of . The test statistic is the difference of the average of between-group rank differences and the average of within-group rank differences. A large value of indicates the deviation of two groups.

The two-sample Cramér-von Mises statistic in (2) is the empirical version of the right side of (7). Hence and are all plug-in statistics of an equal quantity. Nevertheless, they may take different values. We shall thank one of the referees who pointed out this possibility. For example, in the case that , let the two realizations be and and the two realizations be and . It is easy to see that the Cramér-von Mises statistic has value and the test statistic has value . Next, we will study the properties of .

Let be , and . Then we have the following theorem.

###### Theorem 2.2

For , if , then

By this theorem and Theorem 2.1, it is easy to see that our test statistic is consistent for the alternative .

###### Theorem 2.3

Under , is distribution free.

Under , the combined samples constitute a random sample of size from the distribution . So any assignment of numbers to and numbers to from the set of integers

is equally likely, i.e. has probability

which is independent of . Using the fact that those number assignments have one-to-one linear relationship with the standardized ranks, is distribution free. Figure 1: The exact null distribution of T obtained from all combinations (Left: m=7, n=7; Right: m=7, n=9). The vertical line in each graph indicates the 5% critical value.

The exact null distribution of can be found by enumeration of all possible values of by considering the orderings of ’s and ’s. Figure 1 provides the exact null distribution of for sample sizes and by considering all combinations. However, the exact null distribution is infeasible to obtain for large sample sizes because the number of combinations increases dramatically as and increase. For large samples, we can use Monte-Carlo method on all combinations to approximate the null distribution. Also the limiting distribution of can be used to determine the critical values of the test. Next we study asymptotic behaviors of .

Since and the Cramér-von Mises statistic

are different sample plug-in versions of a same quantity for checking the equality of two distributions, we expect that they should have the same expectation, variance and asymptotic distribution under the null hypothesis. The expectation and variance of

are provided by Anderson . In the next theorem, we obtain the same results for , but provide a more straightforward derivation and simpler proof.

###### Theorem 2.4

Under ,

 ET=N+16N=16+16N

and

 Var(T)=N+1180N2[4(N+1)−3N2mn].
###### Remark 2.1

In particular, if , and .

Rosenblatt  and Fisz  have derived the limiting distribution of Cramér-von Mises statistic, which is a mixture of independent distributions. It is necessary to check whether or not our test statistic has the same limiting distribution. Rather than using a stochastic process method, we provide a different Hájek projection approach to obtain the limiting distribution of , which agrees with that of .

We obtain the first order Hájek projection in Lemma 6.2 as

where . Under , has variance

 Var(~T) = m∑i=1Var(E[T|Xi])+n∑j=1Var(E[T|Yj]) = [m(m−n)2m2N2+n(m−n)2n2N2]Var[F(Y1)−F2(Y1)] = (m−n)2180mnN.

Clearly, as . Therefore the first order Hájek projection is not sufficient in deriving the asymptotics of the statistic .

To derive the asymptotics of the statistic under the null hypothesis, it is necessary to have the second order projection of .

 ^T= m∑i=1n∑j=1E[T|Xi,Yj]+∑1≤i

Since

 E{E[T|X1,Y1]}=E{E[T|X1,X2]}=E{E[T|Y1,Y2]}=ET,

. By Lemma 6.3 and Lemma 6.4, it can be examined that

 Cov(E[T|Z1,Z2],E[T|Z1,Z3])=0,

where and are three different variables from , , and

 Var(^T) =m4+n4−2m3n−2mn3+10m2n2−8mnN+5n2+5m2180N2mn +m(m−1)2m2−2mn+5n290m2N2+n(n−1)2n2−2mn+5m290n2N2 =N2180mn+245N−N−136mn−118N2.

Then as under the condition . We shall always assume this condition in the following analysis.

Efron and Stein  discussed a general orthogonal decomposition of a statistic. Here, our statistic is decomposed as , where is the first order projection and is a negligible term. Hence the limiting distribution of is determined by the limiting distribution of .

To determine the limiting distribution of under , let . Then is a degenerate kernel function since is symmetric and . By Lemma 6.3 and Lemma 6.4, we have with as  111If , . and

 ^^T=1Nm∑i=1n∑j=1h(Xi,Yj)−1N∑1≤i

It is not difficult to verify that

 Var[h(Z1,Z2)]=2/45 (8)

and , i.e., and are orthogonal, where and are three different variables from , .

Now we define an operator on the function space by

 Ag∗(x)=∫∞−∞h(x,y)g∗(y)dF(y),x∈R,g∗∈L2(R,F).

This operator only has real eigenvalues since the kernel

is symmetric. Let be the non-zero eigenvalues of the operator obtained by solving the equation . With the substitution of and , solving is equivalent to solve that

 ∫10{|u−v|+u(1−u)+v(1−v)−23}g(v)dv=λg(u), (9)

where . Taking the twice derivative with respect to on both sides of (9), we have the equation . Solving it and substituting back, we have the eigenvalues of being

and the corresponding eigenfunctions

. The eigenfunction for the zero eigenvalue is . Note that the eigenvalues do not depend on , but the eigenfunctions depend on , which give a orthonormal basis for the space . Let . Then we have the following theorem.

###### Theorem 2.5

Under and the condition ,

 TNd⟶Z∞=−√452∞∑k=1λk(χ21k−1),

where are independent variables and . Hence

 (T−ET)/√Var(T)d⟶Z∞=−√452∞∑k=1λk(χ21k−1)

since in probability.

As expected, this asymptotical result agrees with the one for Cramér-von Mises statistic as proved with a stochastic process method in Rosenblatt  and Fisz . This different projection approach we applied here is typically useful in U-statistic theory, but we shall emphasize that is not an U-statistic.

In practice, we may approximate the limiting distribution by a distribution of a finite linear combination of independent random variables, i.e.

 Zd=−√452d∑k=1λk(χ21k−1)=√45π2d∑k=11k2(χ21k−1).

The accuracy of approximation depends on the choice of . Table 1 provides ratios of variance of the mixture and that of the infinite mixture, that is, . Also the table lists 95% quantiles of which are estimated by the average of 10 sample quantiles each on random samples. Those quantile values can be used to approximate the critical values of , which are given by the second part of Table 1. As we will see that even for small sample sizes, the approximated critical values are pretty accurate and close to the exact true values. For the case of , the true size of the test is 0.056 if the approximated critical value 0.4611 is used. For the case of and , the true size of the test is 0.052 if 0.4609 is used. In summary, or is recommended for a compromise between computation and accuracy.

## 3 Simulations

By the simulation study in this section we demonstrate the performance of the T test. There are many nonparametric tests available for the two sample problem. It is by no means to conduct a comprehensive comparison. Here we include Kolmogorov-Smirnov test (KS), Wilcoxon rank sum test (W) or Mood test (M), the empirical likelihood ratio test (ELR) proposed by Gurevich anf Vexler

, the empirical likelihood test (ELT) proposed by Einmahl and McKeague , Baringhaus and Franz’s Cramér test (CT), the test studied in Fernándes et al.  (DT) in the study. It is necessary to note that the CT and DT tests are not distribution-free tests, and their critical values and p-values are based Monte-Carlo method on permutations in each sample, which is implemented in the R package “cramer”. The R package “dbEmpLikeGOF” is used for the ELR test in which the parameter is set to be 0.1 as suggested in . The critical values of the ELT and our T test are computed through random combinations on .

Various alternative distributions are considered. For each case, iterations are computed to estimate powers by calculating the fraction of p-values less than or equal to the level of significance. The Monte Carlo errors can be estimated by . In particular, the size of tests shall maintain in the interval (0.046, 0.054).

Table 2 shows the size and power performance for each test under the normal distributions, where ,, and ,, with 0, 0.25, 0.5, 0.75, and 1. When , the KS test is undersized for both the equal and unequal sample sizes cases; the ELR test is oversized in the equal sample size case and seriously undersized for the sample unequal size case; all other tests keep a desirable size. As expected, the W test is the best among all tests since it is well-known to be powerful for the two-sample problem with a constant shift in location, especially when data follow logistic or normal distributions. The CT and ELT tests are comparable to W. The T test is more powerful than the DT, KS and ELR tests. In the unequal sample size case, the W test is the best followed by the CT test. The ELT and T tests are comparable and significantly better than the DT, KS and ELR tests.

The experiment is repeated for the

-distribution with 3 degrees of freedom and the result is presented in Table

3. Although the statistical power of the test is the highest among all tests for all cases, its power differences with the W test or the ELT test are small so that those three tests are comparable.

Table 4 shows the power performance for the Pareto distribution, where ,, Pa(2, 2) and ,, Pa(2+, 2) are generated, with 0, 0.25, 0.5, 0.75, and 1. The power of the ELR test is much higher than that of all others. For , the power of the ELR test is as high as 90%, which is 30% higher than the second best ELT test. The T test is the third best one. The power difference between the test and that of the test can be as large as 27% for equal sample sizes and can be as large as 32% for unequal sample sizes.

All considered tests as in the experiment for location alternatives are used for scale alternatives except the Wilcoxon test (W), as this is a test for location. Instead, the Mood’s test known as a scale test is used and referred to as the M test. Table 5 displays the results when samples of size 50 are generated from or Pareto, where 1, 1.5, 2, 2.5, and 3. In the normal case, the T test does not compare favorably to all considered tests other than the KS test. It performs significantly better than the KS test, but its power is 2-5 times smaller than that of others. It is interesting to see that the M test outperforms all tests in the normal case but it is the inferior in the Pareto case. The T test has better performance for Pareto samples than for normal samples due to the heavy tails in Pareto distributions. In the Pareto case, all tests outperform the M test by a great margin and the CT test is the superior. As suggested by a reviewer, we add one more case in the simulation in which and with sample sizes . The Monte Carlo powers of the seven tests are listed in Table 5. In this scenario, the T test performs better than KS and DT, but does not compare as favorably to the CT, W, ELT and ELR tests.

In general, the T test is not recommended for scale alternatives. The Kolmogorov-Smirov test is not recommended either. The empirical likelihood ELR test is more suitable for a general scale alternative, but is not recommended for a location alternative for symmetric distributions. The T test has a better performance for location alternatives than scale alternatives. It is easy to explain the power performance of the Cramér-von Mises test with the rank based formulation (1) for the location alternatives. For two samples from the same class distributions (normal distributions, t distributions or Pareto distributions and so on) but with different locations, the ranks in the mixture are quite different. Therefore the corresponding test can easily recognize them and have good power performance. We recommend to apply the T test for location alternatives, especially in the heavy-tailed distributions.

## 4 Multivariate Extension

The proposed rank test statistic is closely related to the two sample Cramér-von Mises criterion. Both statistics are different sample plug-in forms from a same population quantity. The advantage of our rank test is to allow straightforward generalizations to the multivariate case by using different multivariate rank functions. Among them, the spatial rank is appealing due to its computation ease, efficiency and other nice properties , . The sample version of the spatial rank function with respect to , the empirical distribution of the combined sample and in , is defined as

 \boldmath{R}(\boldmath{x},HN)=1NN∑i=1\boldmath{x}−\boldmath{z}i∥\boldmath{x}−\boldmath{z}i∥,

where for , for and is the Euclidian distance. Then the multivariate two-sample spatial rank statistic, denoted by , is defined as

 TM = mnN{1mnm∑i=1n∑j=1∥% \boldmath{R}(\boldmath{x}i,HN)−\boldmath{R}(% \boldmath{y}j,HN)∥ (10) −12m2m∑i=1m∑j=1∥%\boldmath$R$(\boldmath{x}i,HN)−\boldmath{R}(% \boldmath{x}j,HN)∥ −12n2n∑i=1n∑j=1∥%\boldmath$R$(\boldmath{y}i,HN)−\boldmath{R}(% \boldmath{y}j,HN)∥}.

The test statistic is the difference of the average of the intra-group rank distances and the average of the inter-group rank distances. A large value of indicates the deviation of the two groups and rejects the null hypothesis. The multivariate counterpart of Theorem 2.1 states as follows.

###### Theorem 4.1

Let and be independent

-variate continuous random vectors distributed from

and , respectively. Let with . Then

 E∥\boldmath{R}(\boldmath{X},H)−% \boldmath{R}(\boldmath{Y},H)∥−12E∥% \boldmath{R}(\boldmath{X}1,H)−\boldmath{R}(% \boldmath{X}2,H)∥ −12E∥\boldmath{R}(\boldmath{Y% }1,H)−\boldmath{R}(\boldmath{Y}2,H)∥≥0, (11)

where the equality holds if and only if .

The multivariate spatial rank test based on loses the distribution-free property under the null hypothesis. The test relies on the permutation method to determine critical values or compute p-values. But the test is robust. For example, it does not require the assumption of finite second moment as the Hotelling’s test. Neither it requires the assumption of finite first moment as the test (CT) considered by Baringhaus and Franz .

A simulation is conducted to compare performance of , CT and the Hotelling’s under multivariate normal, and Pareto distributions on (). Location and scatter alternatives are considered. For location alternatives in normal and distributions, the parameters of distributions for generating samples of size are and , while for samples with size are and , where and 1. For Pareto distribution, is generated with each component from Pareto(1,1) and is generated with each component from Pareto. R package “Hotelling” is used for the Hotelling’s test. and CT tests use the permutation method to compute p-values and iterations are computed to estimate powers by calculating the fraction of -values less than or equal 0.05. Results for the location alternatives are listed in Table 6.

From Table 6, three tests keep the size 5% well. Powers in are higher than that in for each of three tests under all distributions. In the normal cases, performs slightly worse than the Hotelling’s test and CT. The power of is about lower than that of the Hotelling test and lower than that of CT under when and . However, the power gain of over CT and the Hotelling’s test is huge in the -distributions. For and , is about twice powerful as CT and about triple powerful as the Hotelling test. The advantage of our proposed over CT and the Hotelling’s test are even more significant in the asymmetric Parato distributions than in the distributions for the location alternatives.

Results for scatter alternatives are listed in Table 7. For multivariate normal and distributions, we first consider the difference of scatter matrix only on scales. The parameters for sample are and , while for samples are and and , where and 3. We then consider the alternative with different orientation on the scatter matrices. The scatter matrix is for samples, while it is for samples. Hence two components of are positively correlated and the two components of are negatively correlated. The results for orientation difference alternatives are listed in the last column ”Orient” of Table 7. In , has diagonal elements to be 1 and off-diagonal elements to be 0.5 and

is constructed to have the same eigenvectors as

and eigenvalues to be the reciprocals of eigenvalues of . For Pareto distributions, is generated with each component from Pareto(1,1) and is generated with each component from Pareto.

From Table 7, all tests maintain the size 5% well. For asymmetric Pareto distributions, CT is slightly better than and is better than the Hotelling’s test. For normal and distributions, the Hotelling’s completely fails in scatter alternatives since it is a test on location difference. CT test is much better than for scale alternatives. Particularly CT is triple powerful as the in normal case and twice powerful in the case. This result is not surprising since is based on the spatial ranks that lose major information on distances or scales. However, when two scatter matrices of distributions are different on orientation, performs better than CT, especially in distribution, the power of is twice or triple as that of CT.

## 5 Summary

The problem of testing whether two samples come from the same or different population is a classical one in statistics. In this paper, we have studied a rank-based test for the univariate two sample problem. The test statistic is the difference between the average of between-group rank distances and the average of within-group rank distances. Under the null hypothesis, it is distribution free. The limiting null distribution was explored through techniques of Hájek projection and orthogonal decomposition. It has been proved that the limiting distribution is not normal since the projection on one variable is insufficient to represent the variation of the test statistic. By taking the second-order projection, an operator in the functional space was defined and its eigenfunctions and eigenvalues were applied to derive the limiting distribution. It is a weighted mixture of independent chi-square distributions with the weights being the eigenvalues of the operator. We provided a recommendation how to use the limiting distribution to obtain critical values of the proposed test in practice.

The proposed rank test statistic is closely related to two sample Cramér-von Mises criterion. Both statistics are different sample plug-in forms from the same population quantity. We have provided a counter example to show they are different. However, they have the same expectation, variance and limiting distribution. The advantage of our rank test is to allow straightforward generalizations to the multivariate case by using different multivariate rank functions. A continuation of this work is to study properties of the multivariate Cramér-von Mises test. Also the generalizations based on other multivariate rank functions deserve further investigation.

## 6 Proofs

The following lemma gives the expected value of the absolute difference between the standardized ranks of and .

###### Lemma 6.1

Let and be independent continuous random variables from and , respectively. Let with be the mixture distribution, be the distribution of and be the distribution function of . Then

 E|R(X,H)−R(Y,H)|=∫10J(t)(1−K(t))dt+∫10K(t)(1−J(t))dt. (12)

Proof of Lemma 6.1. Notice that

 |R(X,H)−R(Y,H)| =∫10[I(R(X,H)≤s

Since is continuous and , we have , for any , where . Then (12) holds by Fubini’s Theorem.

Proof of Theorem 2.2. Define

 ~D= 1mnm∑i=1n∑j=1|R(Xi,H)−R(Yj,H)| −12m2m∑i=1m∑j=1|R(Xi,H)−R(Xj,H)| −12n2n∑i=1n∑j=1|R(Yi,H)−R(Yj,H)|.

Conditioning on ,

, by the law of large numbers,

 1mm∑i=1|R(Xi,H)−R(Yj,H)|−EX1|R(X1,H)−R(Yj,H)|→0a.s.,

and

 1mnm∑i=1n∑j=1|R(Xi,H)−R(Yj,H)| =1nn∑j=1[EX1|R(X1,H)−R(Yj,H)|+oa.s.(1)] =E|R(X,H)−R(Y,H)|+oa.s.(1).

By the strong law of large numbers for -statistics ,

 12m2m∑i=1m∑j=1|R(Xi,H)−R(Xj,H)|→12E|R(X1,H)−R(X2,H)|a.s.,

and

 12n2n∑i=1n∑j=1|R(Yi,H)−R(Yj,H)|→12E|R(Y