Testing the equality of distributions from independent random samples is a classical statistical problem encountered in almost every field. Due to its fundamental importance and wide applications, research for the -sample problem has been kept active since 1940’s. Various tests have been proposed and new tests continue to emerge.
Often an omnibus test is based on a discrepancy measure among distributions. For example, the widely used and well-studied tests such as Cramér-von Mises test (), Anderson-Darling ([9, 32]) and their variations utilize different norms on the difference of empirical distribution functions, while some ([2, 23]
) are based on the comparison of density estimators if the underlying distributions are continuous. Other tests ([35, 12]
) are based on characteristic function difference measures. One of such measures is the energy distance ([36, 37]). It is the weighted distance between characteristic functions and is defined as follows.
Definition 1.1 (Energy distance)
Suppose that and are independent pairs independently from d-variate distributions and , respectively. Then the energy distance between and is
Let the characteristic functions of and be and , respectively. It has been proved that
where is a constant depending on . Clearly, if and only if . A natural estimator of (1), the linear combination of three -statistics, is called energy statistic. Reject if the energy statistic is sufficiently large. To extend to the -sample problem, Rizzo and Székely (
) proposed a new method called distance components (DISCO) by partitioning the total distance dispersion of the pooled samples into the within distance and between distance components analogous to the variance components in ANOVA. The test statistic is the ratio of the between variation and the within variation, where the between variation is the weighted sum of all two-sample energy distances. Equivalently, Danget al 
conduced a test based on the ratio of the between variation and the total variation, in which the ratio defines a dependence measure. Although those tests are consistent against any departure of the null hypothesis and are easy to compute the test statistics, the tests have to reply on a permutation procedure to determine the critical values since the null distribution depends on the unknown underlying distributions.
Empirical likelihood (EL) tests ([4, 10, 42]) successfully avoid the time-consuming permutation procedure. As a nonparametric approach, the EL ([24, 25]) also enjoys effectiveness of likelihood method and hence has been widely used, see [27, 28, 40] and the references therein. We refer to [6, 5, 7, 11] for the updates about the EL in high dimensions. When the constraints are nonlinear, EL loses this efficiency. To overcome this computational difficulty, Wood () proposed a sequential linearization method by linearizing the nonlinear constraints. However, they did not provide the Wilks’ theorem and stated that it was not easy to establish. Jing et al. () proposed the jackknife empirical likelihood (JEL) approach. The JEL method transforms the maximization problem of the EL with nonlinear constraints to the simple case of EL on the mean of jackknife pseudo-values, which is very effective in handling one and two-sample -statistics. This approach has attracted statisticians’ strong interest in a wide range of fields due to its efficiency, and many papers are devoted to the investigation of the method.
Recently several JEL tests ([21, 19, 20]) based on characteristic functions have been developed for the two-sample problem. Wan, Liu and Deng () proposed a JEL test using the energy distance, which is a function of three -statistics. To avoid the degenerate problem of statistics, a nuisance parameter is introduced and the resulting JEL method involves three constraints. The limiting distribution of the log-likelihood is a weighted chi-squared distribution. Directly generalizing their JEL test to the -sample problem may not work since the number of constraints increases quickly with . There are constraints, not only casting difficulty in computation but also bringing challenges in theoretical development.
We propose a JEL test for the -sample problem with only constraints. We treat the
-sample testing problem as a dependence test between a numerical variable and a categorical variable indicating samples from different populations. We apply JEL with the Gini correlation that mutually characterizes the dependence (). The limiting distribution of the proposed JEL ratio is a standard Chi-squared distribution. To our best knowledge, our approach is the first consistent JEL test for univariate and multivariate -sample problems in the literature. The idea of viewing the -sample test as an independent test between a numerical and categorical variable is not new. Jiang, Ye and Liu () proposed a nonparametric test based on mutual information. The numerical variable is discretized so that the mutual information can be easily evaluated. However, their method only applies to univariate populations. Heller, Heller and Gorfine ([13, 14]) proposed a dependence test based on rank distances, but their test requires a permutation procedure.
The reminder of the paper is organized as follows. In Section 2, we develop the JEL method for the -sample test. Simulation studies are conducted in Section 3. A real data analysis is illustrated in Section 4. Section 5 concludes the paper with a brief summary. All proofs are reserved to the Appendix.
2 JEL test for -sample based on a categorical Gini correlation
Let be a sample from -variate distribution respectively. The pooled sample is denoted as of sample size . The objective is to test the equality of the distributions, that is,
Let be the categorical variable taking values , and let
be a continuous random variable inwith the conditional distribution of given being . Assume . Then the distribution of is the mixture distribution defined as
Treating as an unbiased and consistent estimator of , we can view the pooled sample as a sample from .
By introducing the two variables and , testing (2) is equivalent to testing the independence between and . We will adopt the recently proposed categorical Gini correlation () which characterizes the independence of the continuous and categorical variables.
2.1 Categorical Gini correlation
Let and be i.i.d. copies from , and and be i.i.d. copies from . Let
be the Gini distance of and , respectively. Then the Gini correlation () between a continuous random variable and a categorical variable is defined as
Definition 2.1 (Dang )
For a non-degenerate random vector
For a non-degenerate random vectorin and a categorical variable , if , the Gini correlation of and is defined as
The Gini correlation characterizes the dependence. That is, if and only if and are independent. This is because
where is a constant depending on , and are characteristic functions of and , respectively. Hence we have the following result,
Lemma 2.1 (Dang et al. )
For , if and only if .
Therefore, testing (2) will be equivalent to testing whether . We can rewrite as
which can be estimated unbiasedly by
Clearly, and are -statistics of degree 2 with the kernel being . and
are unbiased estimators ofand , respectively.
Under , we have . Conversely, . Then and hence . Therefore, Testing is equivalent to testing
2.2 JEL test for -sample
In order to apply JEL, we define the the corresponding jackknife pseudo-values for as
It is obvious to see that
Under , we have
where , with the expectations taking under .
Next, we apply the JEL to the above jackknife pseudo values. Let
be the empirical probability vector assigned to the elements of, , and be probability vector for . We have the following optimization problem.
subject to the following constraints
in equation (7) maximizes the squared standard jackknife empirical likelihood ratio (JELR). This is because is the marginal probability and is the conditional probability and then we have . The maximization in is the same maximization solution of the regular JELR.
Applying Lagrange multiplier, one has
where satisfy the following equations:
Define and assume
Note that C1 implies . We have the following Wilks’ theorem.
Under and the conditions C1 and C2, we have
Proof. See the Appendix.
As a special case of the -sample test, the following result holds for .
For the two-sample problem, under the conditions C1-C2 and , we have
Compared with the result of , the limiting distribution of the proposed empirical log-likelihood ratio is a standard chi-squared distribution. The empirical log-likelihood has no need for multiplying a factor to adjust unbalanced sample sizes.
and the power of the test is
In the next theorem, we establish the consistence of the proposed test, which states its power is tending to 1 as the sample size goes to infinity.
Under the conditions C1 andC2, the proposed JEL test for the K-sample problem is consistent for any fixed alternative. That is,
Proof. See the Appendix.
3 Simulation Study
In order to assess the proposed JEL method for the homogeneity testing, we conduct extensive simulation studies in this section. We compare the following methods.
the JEL approach proposed in . It is applied only for .
Type I error rates and powers for each method at significance levels and are based on 10,000 replications. The results at significance level are similar to the results at 0.05 level and hence are not presented. We only consider one case of to demonstrate the similarity of our JEL-S and JEL-W. The remaining cases are for without loss of generality. We generate univariate () and multivariate (, ) random samples from normal, heavy-tailed
and asymmetric exponential distributions. In each distribution, samples of balanced and unbalanced sample sizes are generated.
3.1 Normal distributions
We first compare our JEL-S with JEL-W, which is also a JEL approach based on energy statistics but designed for the two-sample problem. We generate two independent samples with either equal () or unequal sample sizes () from the
-dimensional normal distributionsand , respectively, where is the -dimensional zero vector,
is the identity matrix indimension and is a positive number to specify the difference of scales. The results are displayed in Table 1.
As expected, the JEL-W and our approach perform similarly because both are JEL approach on energy distance to compare two samples. Advantages of the JEL approach over the others in testing scale differences are the same for , which is demonstrated in the following simulation.
Three random samples , and are simulated from normal distributions of , and respectively, where and are positive numbers. The simulation result is shown in Table 2.
In Table 2, the size of tests are given in the rows of and the powers in other rows. We can see that every method maintains the nominal level well. As expected, KW performs badly for scale differences because KW is a nonparametric one-way ANOVA on ranks and it is inconsistent for scale-difference problem. Although ET and AD are consistent, they are less powerful than the JEL method and HHG. The JEL method always has the highest power among the all considered tests.
Next, we consider the location difference case. Three random samples , and are simulated from normal distributions of , and , respectively. Here is the -vector with all elements being 1. The sizes of the tests are reported in the rows of in Table 3 and the others rows provide the powers of the tests.
|(0, 0)||.056||.049||.052||.052||.050||. 048||.042||.052||.052||.049|
The Type I error rates of all tests are close to the nominal level. The JEL-S performs the worst with the lowest power in this case, although it is consistent for any alternatives. An intuitive interpretation is that the JEL assigns more weights on the sample points lying between classes and loses power to differentiate classes. The phenomenon of less power in the location-difference problem is also common for the density approach, as mentioned in . For the location difference problem, we suggest to use non-parametric tests based on distribution function approaches. For example, AD and KW tests are recommended.
Our JEL-S has low powers to test location differences, it, however, is sensitive to detect scale-location changes. Three random samples , and are simulated from normal distributions , and , respectively. Here measure the difference of locations and scales. The simulation results are reported in Table 4.
|(0, 0, 1, 1)||.061||.047||.051||.051||.051||.061||.044||.049||.048||.049|
|(0.1, 0.2, 1.2, 1.4)||.540||.173||.195||.102||.335||.497||.153||.163||.091||.208|
|(0.2, 0.4, 1.4, 1.8)||.952||.570||.603||.208||.841||.933||.504||.510||.185||.630|
|(0.3, 0.6, 1.6, 2.2)||.990||.885||.893||.336||.987||.989||.827||.819||.290||.905|
|(0.4, 0.8, 1.8, 2.6)||.988||.981||.981||.434||.999||.990||.964||.958||.384||.986|
|(0, 0, 1, 1)||.054||.041||.048||.049||.046||.057||.046||.051||.051||.050|
|(0.1, 0.2, 1.2, 1.4)||.436||.204||.348||.260||.405||.392||.174||.300||.230||.240|
|(0.2, 0.4, 1.4, 1.8)||.897||.702||.891||.723||.926||.852||.650||.849||.683||.762|
|(0.3, 0.6, 1.6, 2.2)||.993||.970||.997||.958||.998||.985||.947||.993||.939||.976|
|(0.4, 0.8, 1.8, 2.6)||.999||.999||1.00||.996||1.00||.997||.997||1.00||.995||.999|
|(0, 0, 1, 1)||.054||.048||.053||.052||.051||.051||.044||.051||.050||.047|
|(0.1, 0.2, 1.2, 1.4)||.729||.313||.670||.478||.729||.668||.265||.600||.434||.509|
|(0.2, 0.4, 1.4, 1.8)||.995||.930||.999||.963||.999||.990||.886||.996||.943||.982|
|(0.3, 0.6, 1.6, 2.2)||.999||1.00||1.00||1.00||1.00||.999||.999||1.00||.999||1.00|
|(0.4, 0.8, 1.8, 2.6)||.998||1.00||1.00||1.00||1.00||.999||1.00||1.00||1.00||1.00|
From Table 4, we can have the following observations. For , KW is the least powerful. ET and KW perform similar but worse than HHG and JEL-S. JEL-S has the highest powers. For example, JEL-S is about 20%-30% more powerful than the second best HHG method in the case of . For and , ET performs the worst and JEL-S is the most competitive method.
3.2 Heavy-tailed distribution:
We compare the performance of JEL-S with others in the heavy-tailed distributions. Three random samples , and are simulated from multivariate distributions with 5 degrees of freedom with the same locations and different scales and , respectively. The results are reported in Table 5.
has been impacted by heavy-tailed outliers, while impacts in high dimensions are less than that in one dimension. JEL-S has a slight over-size problem. Its size is 2-3% higher than the nominal level, while its power is uniformly the highest among all methods. For the small difference case with, JEL-S is 10% more powerful than the second best HHG method.
3.3 Non-symmetric distribution: Exponential distribution
Lastly we consider the performance of JEL-S for asymmetric distributions. We generate random samples , and from multi-variate exponential distributions with independent components. The components of each sample are simulated from exp(1), exp and exp, respectively. Type I error rates and powers are presented in Table 6.
From Table 6, we observe that JEL-S suffers slightly from the over-size problem, while the problem becomes less of an issue for higher dimensions. JEL-S performs the best when the differences are small. HHG is inferior to others. Asymmetric exponential distributions with different scales also imply different mean values, and hence KW performs fairly.
3.4 Summary of the simulation study
Some conclusions can be drawn across all tables 1-6. HHG is affected by unbalanced sizes the most among all methods. For example, in Table 4, the power of HHG is dropped 13% and 17% for and , respectively, from the equal size to the unequal size case, compared with a 3-5% decrease in other methods.
Considering the same total size, the power in balanced sample is higher than unequal size samples for all tests. All methods share the same pattern of power changes when the dimension changes. For the Normal scale difference cases, powers in are lower than those in and . While for and exponential distributions, powers increase with .
Overall, JEL-S is competitive to the current approaches for comparing -samples. Particularly, JEL-S is very powerful for the scale difference problems and is very sensitive to detect subtle differences among distributions.
4 Real data analysis
For the illustration purpose, we apply the proposed JEL approach to a multiple two-sample test example. We apply the JEL method to the banknote authentication data which is available in UCI Machine Learning Repository (
). The data set consists of 1372 samples with 762 samples of them from the Genuine class denoted as Gdata and 610 from the Forgery class denoted as Fdata. Four features are recorded from each sample: variance of wavelet transformed image (VW), skewness of wavelet transformed image (SW), kurtosis of wavelet transformed image (KW) and entropy of image (EI). One can refer to Lohweg() and Sang, Dang and Zhao () for more descriptions and information of the data.
The densities of each of the variables for each class are drawn in Figure 1. We observe that the distributions of each variable in different classes are quite different, especially for variables VW and SW. The locations of VW in two classes are clearly different. The distribution of SW shows some multimodal trends in both classes. The distribution of KW in Forgery class is more right-skewed than it is in Genuine class. EI of two classes has similar left-skewed distribution. Here we shall compare the multivariate distribution of two classes and also conduct univariate two-class tests on each of four variables.
|(a) VW||(b) SW|
|(c) KW||(d) EI|
From Table 7, all tests reject the equality of multivariate distributions of Gdata and Fdata with significantly small -values close to 0. Also the -values for testing separately the individual distributions of VW, SW and KW are small for all methods and thus we conclude that the underlying distributions of those variables are quite different in two classes. For EI variable, however, we do not have significant evidence to reject the equality of the underlying distributions. This result agrees well with the impression from the last graph (d) in Figure 1. In these tests, the -values calculated from JEL approaches are much higher than those calculated from ET, AD, KW and HHG. As expected, our method performs very similar to the JEL-W approach for the two-sample problem.
In this paper, we have extended the JEL method to the -sample test via the categorical Gini correlation. Standard limiting chi-square distributions with degrees of freedom are established and are used to conduct hypothesis testings without a permutation procedure. Numerical studies confirm the advantages of the proposed method under a variety of situations. One of important contributions of this paper is to develop a powerful nonparametric method for multivariate -sample problem.
Although the proposed -sample JEL test is much more sensitive to shape difference among distributions, it is dull to detect the variation in location when the differences are subtle. This disadvantage probably stems from finding the solution of in equations of (2.2). That is, the within Gini distances and the overall Gini distances are restricted to be the same. This forces the JEL approach weighing more on the observations that are more close to other distributions. As a result, the JEL approach loses some power to detect the difference among the locations. This is a common problem for tests based on density functions. For the location difference problem, distribution function approaches such as AD and KW are more preferred.
Furthermore, the proposed JEL approach is developed based on Euclidean distance, and hence is only invariant under translation and homogeneous changes. Dang () suggested an affine Gini correlation, and we will continue this work by proposing an affine JEL test.
Lemma 6.1 (Hoeffding, 1948)
Under condition C1,
Let and Under the conditions of Lemma 6.1,
(i) , as
(ii) , as .
Lemma 6.3 (Liu, Liu and Zhou, 2018)
Under conditions C1 and C2 and , with probability tending to one as , there exists a root of