With the advanced modern instruments such as Doppler shift acoustic radar, functional magnetic resonance imaging (fMRI), and Heidelberg retina tomograph device, a large number of complex data are being collected for contemporary scientific research. For example, to investigate whether the wind directions of two places are distinct, meteorologist measures the wind directions with colatitude and longitude coordinates on the earth. Another typical example is raised in biology. Collecting the fMRI data, biologists are desired to study the association between brain connectivity and age. Although these complex data may boost the progress of scientific research, their complicated structures challenge the statistical community, even on the fundamental problems of statistical inference and data analysis, e.g., -sample () test and test of mutual independence.
In the literature, a number of methods have been developed to address these two fundamental problems. Indeed, these functions currently available in R (R Core Team, 2017) include oneway.test, kruskal.test and cor.test from package stats, hotelling.test from package Hotelling (Curran, 2017), coeffRV from package FactoMineR (Lê et al., 2008), kmmd from package kernlab (Karatzoglou et al., 2004), hsic.test from package kpcalg (Verbyla et al., 2017), dhsic.test from package dHSIC (Pfister and Peters, 2017), eqdist.test and dcov.test from package energy (Rizzo and Székely, 2017), mdm_test from package EDMeasure (Jin et al., 2018), hhg.test.k.sample and hhg.test available in package HHG (Brill et al., 2018)
, and so on. Among them, functions in stats implement the classical parametric and non-parametric hypothesis test for the univariate distribution and the univariate random variable. These functions are only designed for univariate data, and hence are restricted. The packages Hotelling and FactoMineR provide the multivariate extension of Student’s
-test and Pearson correlation test although the normality assumption for multivariate data is usually difficult to validate. With respect to functions in kernlab, kpcalg, dHSIC, energy, and EDMeasure, they are capable of distinguishing distributions or examining the (mutual) independence assumption for univariate/multivariate continuous/discrete data while their performances are compromised when the moment assumption cannot be ensured. What is worse, energy distance(Székely and Rizzo, 2004) and distance covariance (Székely et al., 2007) based methods in energy and EDMeasure even lose power totally when the metric is not of strong negative type (Lyons, 2013). R functions in HHG, corresponding to HHG statistics developed by Heller et al. (2013, 2016), are known to be useful in detecting diverse distinctions among multivariate distributions and associations between multivariate random variables in Euclidean spaces.
In this paper, we introduce a user-friendly R package Ball (Wang et al., 2018) which implements hypothesis test procedures based on Ball Divergence (Pan et al., 2018a) and Ball Covariance (Pan et al., 2018c), addressed to -sample test and test of mutual independence problems, respectively. The Ball test procedure is non-parametric, moment-free, and more importantly, theoretically and practically capable of handling -sample test and test of mutual independence for complex data in metric spaces. At present, Ball package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=Ball.
The remaining sections are organized as follows: In Section 2, we revise Ball statistics with their intuitive explanations and further describe the fast algorithms for Ball statistics in Section 3. Section 4 gives a detailed account of the main functions in the Ball package and provides two specific real data examples to demonstrate their usage for the complex dataset. Section 5 discusses the numerical performance of Ball statistics in the -sample and test of independence problems. The paper is concluded with a short summary in Section 6.
2 Ball test statistics
In this section, we will describe the mathematical definitions and illustrations of Ball Divergence and Ball Covariance statistics in Section 2.1 and 2.2. Both Ball Divergence and Ball Covariance statistics are non-parametric statistics and consistent against any general alternatives without any moment assumptions (Pan et al., 2018a, c).
2.1 -sample test and Ball Divergence statistic
In metric space , given independent observed samples , , assuming that in the -th sample, ’s are i.i.d.
and associated with the probability measure. The -sample test problem states that testing whether probability measures are all equal:
We first revisit two-sample Ball Divergence statistic designed for the two-sample problem, a special case of -sample problem, where . Let be the indicator function. For any , denote as a closed ball with center and radius , and , and consequently, takes 1 when is inside of closed ball , or 0 otherwise. The Ball Divergence statistic () is defined as
Intuitively, if ’s and ’s come from an identical distribution, the proportions that elements of ’s and ’s inside of ball and should be approximately equal, in other words, , and consequently, approaches to zero in this scenario. On the contrary, is far away from zero.
Generally, assume that , the definition of -sample Ball Divergence statistic could be directly summed up all of two-sample Ball Divergence statistics
or found one group with the largest difference to the others
or aggregate the most significant different two-sample Ball Divergence statistics
where are the largest two-sample Ball Divergence statistics among When , , and degenerate into .
2.2 Test of mutual independence and Ball Covariance statistic
Assume that are metric spaces, and is the Euclidean norm on , and is the product metric space of , where the product metric is defined by
Given i.i.d. observations with , denote and as their associated joint probability measure and marginal probability measures, then the test of mutual independence can be formulated as
For , denote as a closed ball with center and radius , and . Thus, is a discriminated function taking 1 if is within closed ball , or 0 otherwise. Let
Then our ball statistic can be presented as,
We provide a heuristic explanation for. If , the proportion of in should be closed to the product of the marginal proportions of ’s within ball , i.e., Because is the summation of , a significantly large
value implies that the null hypothesisis impossible.
The statistic (3) could be extended with positive weights ,
Several choices are feasible. For example, we could take the Chi-square weight and the probability weight , and denote their corresponding statistics as and , respectively. give more weights on smaller balls, while standardizes
by dividing the variance of( are fixed). Furthermore, can be shown to asymptotically equivalent to HHG (Pan et al., 2018c).
The computational complexities of Ball statistics are proportional to if we compute them according to their exact definitions, however, in most cases, their computational complexities can reduce to a more moderate level (see Table 1) by several efficient algorithms utilizing the rank nature of Ball statistics. The rank nature motivates the computational acceleration in the following three aspects. First, because and are ranks, the computation procedure of Ball statistics can avoid repeatedly evaluating whether a point is located in a closed ball. Second, for the two-sample permutation test, by preserving the rank information when we first time compute , ranking procedure during the permutation becomes avertable. Third, if the dataset is sampled from the univariate distribution, computing the number of points inside a closed ball is equivalent to finding the rank of lower and upper boundaries of the ball. The three methods will be illustrated in Section 3.1, 3.2, and 3.3, respectively. Unfortunately, for the general situation, the computation of Ball Covariance statistic is not easy to optimize. Nevertheless, with engineering optimization such as multi-thread technique, the time consumption of Ball Covariance statistic can be cut down.
3.1 Rank based algorithm
We first recap the algorithm for proposed by Pan et al. (2018a). Assume the pairwise distance matrix of is:
After spending time on ranking and row by row to obtain the corresponding rank matrices and , we can only pay time to compute and by directly extracting the -elements of and . Therefore, is of time complexity, so does . In summary, for and general -sample Ball Divergence statistics, the computing times are no more than .
With respect to , the aforementioned algorithm can also be directly applied to calculate and within time. To compute within time, we rearrange to , where is the location of -smallest value among . And further, we have
where is the number of value not only behind but also not larger than . Owing to (5), the computation of turns to computing , a typical “count of smaller numbers after self” problem111https://leetcode.com/problems/count-of-smaller-numbers-after-self/, which can be solved with time via well designed algorithms such as binary indexed tree, binary search tree, and merge sort. Our implementation utilizes the merge sort algorithm to resolve the problem due to its efficiency and freeing extra complicated data structure. Thus, the time consumption of reduces to , and finally, the computation of costs time.
3.2 Optimized permutation test procedure
The permutation test procedure requires randomly exchanging the group label such that the permuted dataset and comes from the same distribution. It is worthy to note that the ranking procedure for the distance matrix of is actually unnecessary because we can obtain by exchanging the rows and columns of according to the group label exchanged manner. The exchanging procedure only needs quadratic time. As regards as the rank matrix , we develop Algorithm 1 to make its calculation costing quadratic time. Treating as and employing Algorithm 1 on it, the time consumption of also reduces to . Therefore, during the permutation, the time complexities of , , and reduce to . Algorithm 1 relies on the index matrix when we first rank the distance matrix , where means that the -smallest element of the -row of is . The detail of Algorithm 1 is shown below.
3.3 Fast algorithm for univariate data
We pay our attention to first. For convenience, assume have been sorted in ascending order. For univariate data, we have
Let is the smallest value satisfying and is the largest element satisfying . Thanks to (6), it is easy to verify that , and consequently, an alternative way to compute is finding out and Following this inspiration, we develop Algorithm 2 to accomplish the computation of in linear time. Through slightly modifying Algorithm 2, the computation complexity of also reduces to , and hence, the computational complexity of reduces to . Similarly, the time complexity of is . In summary, the computational complexity of is proportional to for the univariate dataset, so do , and .
With regard to , with slightly modified Algorithm 2, the time complexity of reduces to when are univariate random variables. Further, retaining and used for calculating and , which satisfy and , the computation procedure of only requires quadratic time by Algorithm 3, and finally accomplish the calculation of with quadratic time. Motivated by method reported in Heller et al. (2016), Algorithm 3 utilizes “the inclusion and exclusion criteria” to efficiently compute .
4 The Ball package
In this section, we introduce the Ball package, an R package implements Ball statistics introduced in Section 2 as well as the algorithms illustrated in Section 3. The core of Ball is programmed in C to improve computational efficiency. Moreover, we employ the parallel computing technique on the Ball statistics based test procedure to speed up the computation. To be specific, multiple Ball statistics are concurrently computed with OpenMP which supports multi-platform shared memory multiprocessing programming in C level. Aside from speediness, Ball package is concise and user-friendly. R user can conduct -sample test and test of independence via bd.test and bcov.test functions in Ball package.
We supply instruction and basic usage for bd.test and bcov.test functions in Section 4.1. Additionally, in Section 4.2, two real data analyses examples are provided to demonstrate how to use these functions to tackle the manifold-valued data.
4.1 Functions bd.test and bcov.test
The functions bd.test and bcov.test are programmed for -sample test and test of independence problems, respectively. The two functions are called as:
bd.test(x, y = NULL, R = 99, dst = FALSE, size = NULL, seed = 4, num.threads = 2, kbd.type = "sum") bcov.test(x, y = NULL, R = 99, dst = FALSE, weight = FALSE, seed = 4, num.threads = 2)
The arguments of two functions are described as follows.
x: a numeric vector, matrix, data.frame or dist object or list containing numeric vector, matrix, data.frame or dist object.
y: a numeric vector, matrix, data.frame or dist object.
R: the number of permutation replications, must be a non-negative integer. When R = 0, the Ball statistics will be returned. Default: R = 99
dst: if dst = TRUE, x and y are considered as the distance matrix or a list containing distance matrix. Default: dst = FALSE
size: a vector record sample size of each group and is only available for bd.test.
weight: a logical or character value used to choose the form of . If weight = FALSE, the ball covariance with constant weight is used. Alternatively, weight = TRUE and weight = "prob" indicates the probability weight is chosen while setting weight = "chisq" means selecting the Chi-square weight. Note that this arguments actually only influences the printed result in R console. At present, this arguments is only available for the bcov.test function, with default value: weight = FALSE.
seed: the random seed. Default: seed = 4
num.threads: the number of threads used in the Ball statistics based hypothesis test. Default: num.threads = 2
kdb.type: a character value controlling the output information. Setting kdb.type = "sum", kdb.type = "summax", or kdb.type = "max", the statistics value and value of , or based -sample test procedure are demonstrated. Note that this arguments actually only influences the printed result in R console and is only available for the bd.test function. Default: kdb.type = "sum"
If R > 0, the output is a htest class object like what t.test function returns. The output object contains the value of the Ball statistics (statistic), thevalue of the test (p.value), the number of permutation replications (replicates), the sample size of each group (size), a list containing multiple statistics value and their corresponding value (complete.info), the character string declaring the alternative hypothesis (alternative), the character string describing the hypothesis test (method), and the character string giving the name and helpful information of the data (data.name); If R = 0, the Ball statistic value is returned.
As a quick example, we generate simulated data and carry out hypothesis test based on Ball statistic to see if the null hypothesis can be rejected when the distributions are really different or variables are really associated.
For the -sample test problem (), we generate two univariate datasets and with different location parameters, where
The detailed R code is as follows.
R> library("Ball") R> set.seed(1) R> x <- rnorm(50) R> y <- rnorm(50, mean = 1) R> kbd_result <- bd.test(x = x, y = y) R> kbd_result
2-sample Ball Divergence Test
data: x and y number of observations = 100, group sizes: 50 50 replicates = 99 bd = 0.092215, p-value = 0.01 alternative hypothesis: distributions of samples are distinct
In this example, the function bd.test yields a value of 0.0922 for the and a value of 0.01 when we replicate the permutation procedure R = 99 times. Due to the value being under 0.05, we should reject the null hypothesis. Consequently, the hypothesis test result is concordant to the data generation mechanism.
As regard to the test of mutual independence issue, we sample i.i.d observations
from the multivariate normal distribution centered at the origin. The data generation procedure and the mutual independence test based onare demonstrated below.
R> library("mvtnorm") R> set.seed(1) R> cov_mat <- matrix(0.3, 3, 3) R> diag(cov_mat) <- 1 R> data_set <- rmvnorm(n = 100, mean = c(0, 0, 0), sigma = cov_mat) R> data_set <- as.list(as.data.frame(data_set)) R> ind_result <- bcov.test(x = data_set) R> ind_result
Ball Covariance test of mutual independence
data: data_set number of observations = 100 replicates = 99, weight: Constant bcov = 0.00063808, p-value = 0.01 alternative hypothesis: random variables are dependent
The output of bcov.test shows that is when the constant weight is used. And the value is 0.01, leading to the conclusion that the three univariate variables are mutually dependent.
4.2.1 Wind direction dataset
We consider the hourly wind speed and direction data recorded in the Atlantic coast of Galicia in winter season from 2003 until 2012, provided in R package NPCirc (Oliveira et al., 2014). We want to figure out whether there exist some wind direction differences between the first and latest week of 2007-08 winter reason. We apply to this data by carrying out the following steps. First, we calculate the distance matrix between samples using great-circle distance. And then, we pass the distance matrix to arguments x and let dst = TRUE and size = c(168, 167).
R> library("Ball") R> data("speed.wind", package = "NPCirc") R> index1 <- which(speed.wind[["Year"]] == 2007 & R+ speed.wind[["Month"]] == 11 & R+ speed.wind[["Day"]] R> index2 <- which(speed.wind[["Year"]] == 2008 & R+ speed.wind[["Month"]] == 1 & R+ speed.wind[["Day"]] R> d1 <- na.omit(speed.wind[["Direction"]][index1]) R> d2 <- na.omit(speed.wind[["Direction"]][index2]) R> # sample size: R> # c(length(d1), length(d2)) # 168, 167 R> # compute the great-cicle (geodesic) distance: R> theta <- c(d1, d2) / 360 R> dat <- cbind(cos(theta), sin(theta)) R> dx <- nhdist(dat, method = "geo") R> # hypothesis test with Ball divergence: R> bd.test(dx, size = c(168, 167), dst = TRUE)
2-sample Ball Divergence Test
data: dx number of observations = 335, group sizes: 168 167 replicates = 99 bd = 0.29114, p-value = 0.01 alternative hypothesis: distributions of samples are distinct
As can be seen from the output information of bd.test, the value is 0.2911 and the value is smaller than 0.05. To further confirm our conclusion, we visualize the wind direction of two groups in Figure 1 with R package circular (Agostinelli and Lund, 2017). Figure 1 shows that the hourly wind direction in the first week is concentrated on the 90 degrees but the wind directions of last week are widely dispersed.
R> library("circular") R> par(mfrow = c(1, 2), mar = c(0, 0, 0, 0)) R> # visualize the wind direction: R> plot(circular(c(d1), units = "degrees"), bin = 100, stack = TRUE, R+ shrink = 1.5) R> plot(circular(c(d2), units = "degrees"), bin = 100, stack = TRUE, R+ shrink = 1.5)
4.2.2 Brain fMRI dataset
We examine a public fMRI dataset from the 1000 Functional Connectomes Project222https://www.nitrc.org/frs/?group_id=296 which called on the principal investigators from the member site donating neuroimaging data such that the broader imaging community complete access to a large-scale functional imaging dataset. Given fMRI scans in the rest state and demographics of 86 sample donated from ICBM, it is of interest to evaluate whether the age is associated with the brain connectivity. To properly analyze this dataset, we carry out the pre-process for each individual who contains three fMRI scans, with nilearn (Abraham et al., 2014) package in Python environment. The pre-process includes following four steps: (i) Segment brain into a set of 116 cortical and subcortical regions with the Automated Anatomical Labeling (AAL) template (Tzourio-Mazoyer et al., 2002); (ii) For each fMRI scan, compute a correlation coefficient matrix of 116 regions based on voxel series value since this brain connectivity measurement is widely used in the neuroimaging literature (Ginestet et al., 2017; Ginestet and Simmons, 2011; Bullmore and Sporns, 2009); (iii) Average three correlation matrices in an element-wise manner and save the mean matrix to disk such that it can be analyzed with R; (iv) In R environment, the collection of mean correlation matrix as well as demographics are combined into a list object, then saved to disk in “.rda” format.
To achieve our goal, the pairwise distance matrix of covariance matrix and age are first computed and stored in the R objects dx and dy, respectively. Then, we pass dx and dy to bcov.test function so as to perform test of independence, meanwhile, let dst = TRUE to declare what x and y arguments accepted are distance matrices. The detailed R codes are demonstrated below.
R> library("Ball") R> library("CovTools") R> load("niICBM.rda") R> # Pairwise distance matrix of covariance matrix: R> dx <- CovDist(niICBM[["spd"]]) R> # Pairwise distance matrix of age: R> dy <- dist(niICBM[["covariate"]][["V3"]]) R> # hypothesis test with Ball Covariance: R> bcov.test(x = dx, y = dy, dst = TRUE, weight = "prob")
Ball Covariance test of independence
data: dx and dy number of observations = 86 replicates = 99, weight: Probability bcov.prob = 0.017138, p-value = 0.01 alternative hypothesis: random variables are dependent
The output message shows that the is 0.0171 and the value is smaller than the significant level, and thus, we conclude that the structure of brain is associated with the individual age. This result is also revealed by a recent research that age strongly effects on structural brain connectivity (Damoiseaux, 2017). In this example, we use the affine invariant Riemannian metric (Pennec et al., 2006) implemented in CovTools (Lee and You, 2018) to evaluate the structural difference among correlation matrix and Euclidean distance to measure age difference.
5 Numerical studies
In this section, the numerical studies are conducted to assess the performance of Ball statistics based test procedures for the complex data, including the directional, tree-structured, symmetric positive matrix, and network data inside the sphere spaces, tree metric spaces, symmetric positive-definite matrice spaces, and normalized Laplacian matrice spaces, respectively. For comparison, Energy distance and HHG are considered for the -sample test while distance covariance and HHG for the test of independence. The permutation technique helps us obtain the empirical distributions of these statistics under the null hypothesis, and derive the value. In all of the following Models, we repeat each of them 1000 times and fix the nominal significance level at .
5.1 -sample test
In this section, we investigate the performance of test statistics in revealing the distribution difference of two kinds of random complex data, i.e., directional data and tree-structured data which are frequently met by scientists interested in wind directions, marine currents(Marzio et al., 2014), or cancer evolution (Abbosh et al., 2017). To sample directional and tree-structured data, the rvmf and rtree functions in the R packages Directional (Tsagris et al., 2018) and ape (Paradis et al., 2018) are used to generate data from the von Mises-Fisher distribution and random tree distribution , where and are direction and concentration parameters while and are the number of tree nodes and an independent random sample collection associated with the branch lengths of tree. The dissimilarity between pairwise directional data and tree-structured data are measured by the great-circle distance and Kendall Colijn metrics (Kendall and Colijn, 2015), which are programmed in the nhdist and multiDist functions in the R packages Ball and treespace (Jombart et al., 2017).
We conduct the numerical analyses for the directional data in Models 4.1.1, 4.1.3-4.1.5, and 4.1.9 while other Models for the tree-structured data. Among them, except that Models 4.1.1 and 4.1.2 are designed for Type-I error evaluation, all Models are devoted to evaluate the empirical power. Specifically, Models 4.1.3-4.1.8 focus on the case that any two groups of-sample are distinct while Models 4.1.9 and 4.1.10 pay attention to the case that only one group is different to the others. Without loss of generality, fix in the Models 4.1.1-4.1.8 and for the Models 4.1.9 and 4.1.10. Keeping each group with the same sample size, we let the sample size of each group increase as 10, 20, 30, 40, and 50.
4.1.1: von Mises–Fisher distribution. The direction parameters are and the concentration parameters are .
4.1.2: Twenty nodes random tree distribution.
are independently random sample from uniform distribution.
4.1.3: von Mises–Fisher distribution. The direction parameters are , and the concentration parameters are .
4.1.4: The Mixture von Mises–Fisher distribution. The direction parameters are and the concentration parameters are . Mixture proportions are: .
4.1.5: The Mixture von Mises–Fisher distribution. The direction parameters are , and the concentration parameters are . Mixture proportions are: .
4.1.6: Fifteen nodes random tree distribution. are independently sample from three different uniform distributions respectively, where .
4.1.7: Fifteen nodes random tree distribution. are independently sample from three different uniform distributions respectively, where .
4.1.8: Fifteen nodes random tree distribution. are independently sample from three different distributions respectively, where , , .
Since only one group is different to the others, it is sufficient to specify the following Models by describing the distribution of two groups.
4.1.9: von Mises–Fisher distribution. The direction parameters are , and the concentration parameters are .
4.1.10: Fifteen nodes random tree distribution. are independently sample from two different uniform distribution, where .
The Type-I error rates and power estimations are graphed and demonstrated in Figures2 and 3. From Figure 2, three Ball Divergence statistics can control the Type-I error rates well around the significance level. The results of Figure 3 show that or overcomes Energy distance and HHG in all of the cases. More specifically, is superior to other methods in most of the cases, and