Nonparametric testing of independence is a fundamental problem in statistics and has been studied carefully by many classical papers such as hoeffding1948non
. This problem has been gaining greater interest recently due to its important roles in machine learning and big data analysis. Some important recent developments includeszekely2007measuring; gretton2008kernel; heller2012consistent; wang2016generalized; pfister2018kernel; gorsky2018multifit; ma2016fisher; berrett2019nonparametric. josse2016measuring have published an authoritative review.
One important problem in nonparametric dependence detection is nonuniform consistency, which means that no test can uniformly detect all forms of dependency, as described by zhang2019bet. This problem is particularly severe for nonlinear relationships, which are common in many areas of science. To avoid the power loss due to nonuniform consistency, zhang2019bet
considers the binary expansion statistics (BEStat) framework; this framework examines dependence with a filtration approach induced by the binary expansion of uniformly distributed variables.zhang2019bet also proposed testing independence of two continuous variables with the framework of maximum binary expansion testing (BET). The maximum binary expansion testing achieves uniformity and the minimax rate for sample size requirement for desired power. It also provides clear interpretability, and it can be implemented efficiently by bitwise operations.
Although maximum binary expansion testing works well in testing independence between two variables, two crucial improvements are needed for greater practical applicability. The first requirement is to improve the power when the sparsity assumption is violated in Theorem 4.4 of zhang2019bet. The second requirement is to extend the test for testing independence of random vectors. In this paper, we describe a new approach that solves both of these problems. The first problem is addressed by a novel ensemble approach, and the second problem is solved by using one-dimensional random projecting. Due to random projection and ensemble, we call the new method the binary expansion randomized ensemble test. We show with simulation studies that the proposed method has good power properties, and it maintains the clear interpretability of maximum binary expansion testing.
2 Proposed Method
2.1 The Binary Expansion Testing Framework
We briefly introduce the testing procedure and useful notations from zhang2019bet. Let be a random sample from distributions of and . If the marginal distributions of and are known, we can use the CDF transformation so that and are each uniformly distributed over . The binary expansions of two random variables and can be expressed as and where and . If we truncate the expansions at depth , then and are two discrete variables that can take
possible values. We define the binary variablesand to express the interaction between them as their products. We call the variables of the form with cross interactions. To explain, we use the following binary integer indexing. Let be a -dimensional binary vector with 1’s at and 0’s otherwise, and let be a -dimensional binary vector with 1’s at and 0’s otherwise. With this notation, the cross interaction can be written as
Next, we denote the sum of the observed binary interaction variables by with These statistics are referred to as the symmetry statistics. If and are independent, for and . If marginal distributions are unknown, we can use the empirical CDF transformation and then where is a symmetry statistic with empirical CDF transformation.
The maximum binary expansion testing procedure at depth can be defined as follows. First, we compute all symmetry statistics with and for . For each depth , we look for the symmetry statistic with the strongest asymmetry and find its -value. Finally, we use Bonferroni adjustment to obtain a -value that considers the family-wise error rate.
The maximum binary expansion testing procedure has several advantages. Let denote and denote the vector whose entries are all ’s with the first entry . zhang2019bet showed that for any , the maximum binary expansion testing with size requires observations to have power under the assumption . This requirement is the optimal rate in paninski2008coincidence. If the sample size increases at any smaller rate, there exist alternatives for which the power is strictly bounded away from 1. That is, the maximum binary expansion testing approach is minimax in the sample size requirement. The test provides both inferences and clear interpretations. For the maximum binary expansion test, rejection of independence implies that there is at least one significant cross interaction. Thus, we can find a potential dependence structure in the sample by investigating the detected cross interaction.
2.2 Univariate Independence Testing Procedure
Although the maximum binary expansion testing procedure shows good performance in many interesting dependency structures, there is room for improvement. In particular, a test based on the sum of squared symmetry statistics provides better power when the sparsity assumption is violated.
Consider a binary expansion test with specified . For each depth , we can find a set of symmetry statistics . Let be a set of corresponding indices of depth . Since an interaction has different indices for two different , to avoid confusion, we use the of depth . For example, when , and . The sets have a nested structure. Now, for each depth , we introduce the proposed measures of dependence. Let and be two random variables. The population measure of dependence is defined as
be a random sample from the joint distribution of. The empirical measure of dependence is defined as
for each depth . The following theorem lists some properties of and .
The following properties hold:
if and only if and are independent.
converges almost surely to .
If and are independent, then converges in distribution to .
We define the scaled sum of squared symmetry statistics for each depth as
By this definition, each can be used to detect the dependencies up to depth . Consider a test that rejects : and are independent if at least one of ’s is greater than , the quantile of
. Then, by Boole’s inequality, the upper bound of the type I error is
There are many possible versions of the test based on different choices of the ’s. Each has a corresponding set of alternatives that it performs well. Therefore, if we have prior information about the dependency, we can choose ’s in a way that provides optimal power for certain alternatives. When there are no specific prior alternatives, we need a strategy for choosing the ’s. We remark here that the alternatives in for smaller are more interpretable than those for larger . From this point of view, we propose an exponentially decaying approach for choice of . If we choose where then the upper bound of the significance level is
guaranteeing a level test. A natural choice of is 1 and the alternatives in each subset, as a group, are equally likely to be detected for ;
The power of the proposed test can be improved by a compromise between a distance correlation test and multiple testing over interactions. The binary expansion testing framework loses power from the adverse effect of multiplicity control over depth. This loss of power is particularly severe for linear dependency. See a detailed discussion in Section 1.2 in the Supplementary Material of zhang2019bet. By considering distance correlation combined with the proposed test, we can mitigate this power loss. There is only one interaction in and it relates to the upper halves of and and the lower halves, thus, represents a linear relationship. Because the above test is composed of multiple hypothesis tests, the test with can be replaced with the distance correlation test. Because we are using a Bonferroni correction for the critical values, this replacement still maintains the targeted level of the test. We call this approach as ensemble method because it combines two testing methods. The proposed procedure consists of the following steps:
: Fix with .
: Find the -value for the distance correlation test.
: For each , compute .
: Reject if either at least one of is greater than or the -value for the distance correlation test is less than .
2.3 Multivariate Independence Testing Procedure
Thus far, we have discussed the binary expansion test for univariate random variables. In this section, we develop a generalized independence test for random vectors. The generalization can be made by converting the independence of random vectors into the independence of univariate random variables. The lemma allowing this conversion is stated below and proved in Appendix A.
Let and be two random vectors. Then and are independent if and only if and are independent for all with and .
This result shows that, to prove independence of random vectors, it is sufficient to consider independence of arbitrary linear combinations of the components. Therefore, the multivariate independence can be tested by checking all possible combinations of and . Because testing all possible combinations cannot be implemented, we consider an approximation of the test by including a finite but reasonably broad number of combinations.
Suppose and are two random vectors. For with , We define a measure of dependence for the multivariate setting by
where and .
Let be a random sample from the joint distribution of . The empirical measure of dependence for the multivariate setting is defined by
Note that where and follow uniform distributions on hyper-spheres in and
respectively. This expectation can be estimated by
where is a random sample generated from uniform distributions on hyper-spheres in and . For each depth , we define the statistic
By computing quantiles of , for , we can consider the test that rejects “ and are independent” if at least one , for , is greater than . If , this procedure provides a level test.
For better performance, under possible linear dependency, we combine this procedure with the distance correlation test as above. If the scales of the elements in the random vectors differ greatly, standardization may be helpful to reduce the number of and to be sampled. We use the normal quantile transformation for standardization. The following algorithm summarizes the proposed approach.
: Set with .
: Standardize marginally each element of the random vectors.
: Find the -value for the distance correlation test.
: Fix . Generate random sample and from uniform distributions on hyper spheres, respectivley.
: For each , compute .
: Reject if either at least one of is greater than or the -value for the distance correlation test is less than .
We refer to this procedure as binary expansion randomized ensemble testing (BERET) due to its two aspects of random projection and ensemble structure. This method has the following advantages. First, the method achieves robust power by a compromise between the distance correlation test and multiple testing over interactions. The power loss due to multiplicity control over the depth also exists in the multivariate case. By considering the distance correlation result together with the proposed measure of dependence with , we can improve power over a wide range of plausible dependencies.
The second benefit of our method is clear interpretability. The issue of interpretability is particularly important in evaluating multivariate relationships. However, most multivariate independence tests provide only the results of the tests with no information on potential dependence structures in the sample. Although the canonical correlation test provides some related information, it shows poor power in nonlinear relationships relative to the proposed method. In contrast, when the proposed test rejects independence, the and vectors indicate the linear combinations of the vectors that have strong dependencies. Using these vectors, we can detect the possible dependence structures in the sample. See Fig. 1 for a three dimensional double helix structure example for illustration, in which white positive regions and blue negative regions of interaction provide the interpretation of global dependency. It can be seen that the double helix structure is detected by three linear combinations.
Lastly, our method provides useful exploratory information for model selection. A small entry in the unit vector or may indicate that the corresponding variable may not be related to the other random vector. See data examples in Section 4 for details.
3 Simulation Studies
3.1 Univariate Independence
For comparison, we consider the Hoeffding’s D test, the distance correlation test, the -nearest neighbor mutual information, Fisher’s exact scanning method, and the maximum binary expansion test. We use sample size as a moderate sample size for power comparison. We set the level of the tests to be 0.1 and simulate each scenario 1,000 times. We adopt because this depth provides a good approximation to the true distribution, as mentioned by zhang2019bet. The p-values of the proposed method are calculated using the asymptotic distribution of Theorem 2.3.
We compare the power of the above methods over familiar dependence structures such as linear, parabolic, circular, sine, checkerboard and local relationship described in zhang2019bet. At each noise level , are independent random variables. follows the standard uniform distribution. is a random variable. , , and follow -, , and - respectively. are generated from . Table 1 summarizes the details of the setting. Some graphical descriptions of the scenarios are given in Appendix B.
|Scenario||Generation of||Generation of|
Figure 2 shows the performance of the six methods. There are two points to notice. First, except for the proposed test, all the other methods show the lowest power in at least one scenario. The ensemble approach and the maximum binary expansion testing show similar powers across the scenarios except for the linear and local dependency. The ensemble approach considerably improves power in the linear and local dependency scenarios. As discussed previously, the ensemble approach utilizes the information on dependence remaining in the symmetry statistics that is not reflected in calculation of the maximum binary expansion testing. Therefore, small asymmetries in many symmetry statistics can be combined to provide a significant result in the ensemble approach when the sparsity assumption is violated. This result is related to the second finding that the ensemble approach outperforms Fisher’s exact scanning in both global and local dependence structures. zhang2019bet reported that maximum binary expansion testing provides better power for global dependence structures, whereas Fisher’s exact scanning performs better for local dependence structures. The simulation results suggest that the ensemble approach works better than Fisher’s exact scanning even in the local dependency scenario.
3.2 Multivariate Independence
Although the proposed method can be applied to arbitrary and , we choose and for better illustration. We compare the proposed method with the distance correlation test, the Heller-Heller-Gorfine test, the -variable Hilbert-Schmidt independence criterion, and the mutual information test. We again use sample size . We set the level of the tests to be 0.1 and simulate each scenario 1,000 times. For our method, we adopt because there is no considerable difference in performance compared with larger ’s such as . We also use a permutation method with 1,000 replicates to calculate the p-values of the proposed approach.
We compare the power of the methods over dependence structures such as linear, parabolic, spherical, sine, and local dependence structures. These scenarios are generalized from the univariate dependence simulations. In addition, we include an additional interesting relationship, the double helix structure. At each noise level , are independent random variables. follow the standard uniform distribution. follows . are independent random variables. follow the Rademacher distribution. Table 2 summarizes the details of the setting. These three dimensional scenarios are visually displayed in Appendix B.
|Scenario||Generation of||Generation of|
Figure 3 shows the simulation results. Our method provides stable results across the scenarios considered. It ranks at least third place in all scenarios. The mutual information test performs best in the highest number of scenarios. It provides the best power in spherical, double helix, and local dependency. In linear and sine relationships, however, there is significant loss of power in the mutual information test compared with the proposed method. The Heller-Heller-Gorfine test is the other test that shows stable performance across all scenarios. It is better than the proposed method in three scenarios and worse in the other three scenarios. That is, our method and the Heller-Hellor-Gorfine test show similar performance. The proposed method outperforms the distance correlation test and the -variable Hilbert-Schmidt independence criterion in at least five scenarios compared with each testing method separately. A point to notice is that the proposed method provides competitive performance while providing a much clearer interpretation of the dependence structure.
4 Data Examples
4.1 Life Expectancy
We use the proposed method to test independence between geographic location and life expectancy and compare its performance with the performance of other methods, i.e., the distance correlation test, the Heller-Heller-Gorfine test, the mutual information test, and the canonical correlation test. We include the canonical correlation test because it also provides some information on dependence structure as does the proposed method. For the proposed method, we set and . The p-value of the test is calculated by a permutation method with 1,000 replicates. The dataset is obtained from the life expectancy report released by the World Health Organization in 2016. The dataset includes males and females and total life expectancy of 189 countries and special administrative regions estimated in 2015. The latitudes (), longitudes (), and total life expectancies () are used in the analysis. Table 3 presents the testing results for the five different methods. All five tests provide p-values close to 0, indicating a significant dependence between geographic location and life expectancy. To identify the dependence structure, we investigate the symmetry statistics. Figure 4 shows the three largest symmetry statistics and the corresponding ’s.
The most asymmetric result is shown in the first row. It is with
. The horizontal axis is the empirical cumulative distribution function transformation of
, wherein a smaller value implies the country is located in the southeast and a larger value implies the northwest. There are four different groups, from the first one in the upper left to the fourth group in the lower right. Each blue cell represents a specific region, America, Africa, Europe and Asia from left to right. The countries in America and Europe show higher life expectancy than countries in Africa and Asia. The four points in the top right corner are Hong Kong, Japan, Macau and South Korea. They can be interpreted as potential outliers distinct from the global pattern.
The second row shows that there is a positive relationship between latitude and life expectancy. That is, the countries in North America, Europe and Northeast Asia have higher life expectancy than countries in Africa, South America and the other parts of Asia. The last row shows that a circular dependency can exist, which indicates that countries in the America and Asia have a medium life expectancy, whereas countries around the prime meridian have different life expectancies, higher in Europe and lower in Africa. These findings prove clearly that our method detects the dependence structures between geographic location and life expectancy.
The canonical correlation analysis also can be used to find information on dependence structure. The canonical correlation is 0.43, and it is calculated using and . The coefficients of and are similar to the elements of in the result of the proposed method in the second row. However, canonical correlation provides information only on the linear dependence structure, whereas our method provides richer information by considering various nonlinear dependence structures.
Now we add two randomly sampled noise variables to each and
to determine whether the proposed method can identify the pair of subgroups that depend on each other. Each noise variable is chosen from a standard normal distribution. Table3 presents the p-values for the five different methods. The p-values of all five methods are robust. Figure 5 shows the previously detected strongest dependence structure without noise variables along with an example of the corresponding result with noise variables. The coefficients of the variables are stable and the coefficients of the noise variables are small.
To compare small sample performance, we randomly selects subsample from the original sample. To mitigate the effect of randomness we calculate the average p-values from 100 simulations. Table 3 displays the average p-values of all five methods. All p-values are similarly affected, and all of them increase as sample size decreases.
In summary, all five methods detects the dependence with very small p-values. In addition, our method detects three interesting dependence structures which can be explained by known global features. The effects of inclusion of noise variables and reduction in sample size are similar for the five methods.
4.2 Mortality Rate
The second case is the relationship between mortality rate, birth rate and income level. We use the Central Intelligence Agency’s world fact data, estimated in 2018. The dataset includes income level (), birth rate (), and mortality rate () of 225 countries and special administrative regions. Table 4 presents the calculated p-values.
Once again, the proposed method and two other methods provide p-values close to 0, which rejects the null hypothesis, whereas the mutual information test and canonical correlation fail to reject it. The poor performance of canonical correlation can be explained by investigating the results of our method. The strongest asymmetry is given in Fig. 6, which shows a strong quadratic relationship. This relationship explains the failure of canonical correlation to work for the data we use here. Although the canonical correlation test provides both inference and information on dependence structure, it performs poorly in nonlinear dependency settings.
For explanation of the observed quadratic relationship one must point to two conflicting phenomena. The first one is that in developed countries the birth rates are low, but the mortality rates are high due to population aging. In developing countries, however, the birth rates are high from lack of family planning and the mortality rates are also high due to insufficient public health. Thus, mortality rates are high in countries with low or high birth rates.
We investigate the effect of noise variable in the same manner. The p-values of five different methods are represented in Table 4, which shows that all five methods are similarly affected by the noise variables. The strongest dependence structures detected by our method are presented in Fig. 7. The coefficients of noise variables are relatively small, and the same dependence structures are identified as before.
For small sample performance comparison, we randomly selects subsample from the sample as we did in the previous example. Table 4 lists the average p-values. The result shows that the p-values of the distance correlation test and the Heller-Heller-Gorfine test are significantly more affected on average than the p-values of the proposed method.
Once again, the binary expansion randomized ensemble test detects interesting structure that can be explained by widely recognized relationships between mortality rate and birth rate. It provides stable performance even when there are noise variables or when the sample size is small, whereas other methods can be significantly affected by a reduction in sample size.
4.3 House Price
The third data example is the market historical dataset of real estate from the University of California, Irvine machine learning repository. The data includes 414 transactions from the Xindan district of Taipei between August 2012 and July 2013. We use these data to detect the relationship between geographic location and house price. The p-values of the five methods are presented in Table 5.
All methods except the mutual information test provide p-values close to 0, which is consistent with the commonly assumed relationship between location and house price in a city. The mutual information test fails to reject the independence. Figure 8 presents the two strongest dependencies identified by the proposed method.
The symmetry statistic with the strongest asymmetry is , which means that there may be a linear relationship between geographic location and house price. The corresponding for the horizontal axis is . That is, houses have higher values in the north and lower values in the south. It is because the central part of Taipei is above the Xindan district. The symmetry statistic with the second strongest asymmetry is . The corresponding for the horizontal axis is . That is, house prices are high at the centre of the district, where two main roads intersect, and prices fall towards the periphery. These results accord closely with the general characteristics of real estate prices in a city. Therefore, we can conclude that the proposed method properly detects the relationships between house price and geographic location.
We add two randomly sampled noise variables to each and as before. The resulting p-values of the five different methods are represented in Table 5. The results of all methods except our method are significantly affected by the noise variables. The detected possible dependence structures by our method are presented in Fig. 9. The figure indicates that the same dependence structures are detected and the coefficients of the noise variables are relatively small as before. The average p-values from the random smaller sample size are also given in Table 5. The result shows that there is little difference in power among the four significant methods.
In conclusion, the proposed method detects the general behavior of house price in the city. The analysis also shows that our method is less affected by noise variables and a decrease in sample size.
Detection of dependence in a distribution-free setting is an important problem in statistics. Existing methods may have challenges with detecting complicated dependence structures. The distance correlation test, for example, does not detect circular dependency well, whereas it provides good powers in linear, parabolic, and sine settings in simulation studies. The maximum binary expansion testing procedure in zhang2019bet suggests a novel way to solve this problem. However, it is limited to the independence test of two random variables and there is room for enhancement of power when the sparsity assumption is violated.
In this paper, we introduce an ensemble approach and a binary expansion randomized ensemble test. The ensemble approach uses both the sum of squared symmetric statistics and the distance correlation test. It shows better power in linear and local settings while maintaining power for other dependence structures. Moreover, it can be easily generalized to an independence test for the multivariate setting, the binary expansion randomized ensemble test. By random projections, our method transforms the multivariate independence testing problem into a univariate testing problem. Our approach also maintains the clear interpretability of the maximum binary expansion testing.
Simulation studies suggest that the power of our method is advantageous compared with a range of competitors considered in many meaningful dependence structures. Investigation of three data examples shows that our proposed method reveals hidden dependence structures from the data while maintaining a level of power similar to the best of the competing methods. In addition, the performance of the method is relatively less affected by the existence of noise variables and a decrease in sample size.
The authors thank the helpful comments and suggestions from Xiao-Li Meng and Li Ma. This research was partially supported by the National Science Foundation and a grant from the National Cancer Institute.
Proof of Theorem 2.2
Suppose that . Then for and . If for and , then by definition, . Thus, by Theorem 4.1 in zhang2019bet, if and only if and are independent.
Since , we have and Therefore, .
By the definition of , we obtain . Since , we have .
By the law of large numbers,converges almost surely to for and . Hence, the conclusion follows at once by the continuous mapping theorem.
Fig. 10. displays some examples of the scenarios in the simulation studies of univariate dependence.
Fig. 11 displays three-dimensional scenarios with sample size . The lowest noise level () is chosen for better illustration.