Ball: An R package for detecting distribution difference and association in metric spaces

11/09/2018 ∙ by Jin Zhu, et al. ∙ The University of Tennessee, Knoxville SUN YAT-SEN UNIVERSITY 0

The rapid development of modern technology facilitates the appearance of numerous unprecedented complex data which do not satisfy the axioms of Euclidean geometry, while most of the statistical hypothesis tests are available in Euclidean or Hilbert spaces. To properly analyze the data of more complicated structures, efforts have been made to solve the fundamental test problems in more general spaces. In this paper, a publicly available R package Ball is provided to implement Ball statistical test procedures for K-sample distribution comparison and test of mutual independence in metric spaces, which extend the test procedures for two sample distribution comparison and test of independence. The tailormade algorithms as well as engineering techniques are employed on the Ball package to speed up computation to the best of our ability. Two real data analyses and several numerical studies have been performed and the results certify the powerfulness of Ball package in analyzing complex data, e.g., spherical data and symmetric positive matrix data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advanced modern instruments such as Doppler shift acoustic radar, functional magnetic resonance imaging (fMRI), and Heidelberg retina tomograph device, a large number of complex data are being collected for contemporary scientific research. For example, to investigate whether the wind directions of two places are distinct, meteorologist measures the wind directions with colatitude and longitude coordinates on the earth. Another typical example is raised in biology. Collecting the fMRI data, biologists are desired to study the association between brain connectivity and age. Although these complex data may boost the progress of scientific research, their complicated structures challenge the statistical community, even on the fundamental problems of statistical inference and data analysis, e.g., -sample () test and test of mutual independence.

In the literature, a number of methods have been developed to address these two fundamental problems. Indeed, these functions currently available in R (R Core Team, 2017) include oneway.test, kruskal.test and cor.test from package stats, hotelling.test from package Hotelling (Curran, 2017), coeffRV from package FactoMineR (Lê et al., 2008), kmmd from package kernlab (Karatzoglou et al., 2004), hsic.test from package kpcalg (Verbyla et al., 2017), dhsic.test from package dHSIC (Pfister and Peters, 2017), eqdist.test and dcov.test from package energy (Rizzo and Székely, 2017), mdm_test from package EDMeasure (Jin et al., 2018), hhg.test.k.sample and hhg.test available in package HHG (Brill et al., 2018)

, and so on. Among them, functions in stats implement the classical parametric and non-parametric hypothesis test for the univariate distribution and the univariate random variable. These functions are only designed for univariate data, and hence are restricted. The packages Hotelling and FactoMineR provide the multivariate extension of Student’s

-test and Pearson correlation test although the normality assumption for multivariate data is usually difficult to validate. With respect to functions in kernlab, kpcalg, dHSIC, energy, and EDMeasure, they are capable of distinguishing distributions or examining the (mutual) independence assumption for univariate/multivariate continuous/discrete data while their performances are compromised when the moment assumption cannot be ensured. What is worse, energy distance

(Székely and Rizzo, 2004) and distance covariance (Székely et al., 2007) based methods in energy and EDMeasure even lose power totally when the metric is not of strong negative type (Lyons, 2013). R functions in HHG, corresponding to HHG statistics developed by Heller et al. (2013, 2016), are known to be useful in detecting diverse distinctions among multivariate distributions and associations between multivariate random variables in Euclidean spaces.

In this paper, we introduce a user-friendly R package Ball (Wang et al., 2018) which implements hypothesis test procedures based on Ball Divergence (Pan et al., 2018a) and Ball Covariance (Pan et al., 2018c), addressed to -sample test and test of mutual independence problems, respectively. The Ball test procedure is non-parametric, moment-free, and more importantly, theoretically and practically capable of handling -sample test and test of mutual independence for complex data in metric spaces. At present, Ball package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=Ball.

The remaining sections are organized as follows: In Section 2, we revise Ball statistics with their intuitive explanations and further describe the fast algorithms for Ball statistics in Section 3. Section 4 gives a detailed account of the main functions in the Ball package and provides two specific real data examples to demonstrate their usage for the complex dataset. Section 5 discusses the numerical performance of Ball statistics in the -sample and test of independence problems. The paper is concluded with a short summary in Section 6.

2 Ball test statistics

In this section, we will describe the mathematical definitions and illustrations of Ball Divergence and Ball Covariance statistics in Section 2.1 and 2.2. Both Ball Divergence and Ball Covariance statistics are non-parametric statistics and consistent against any general alternatives without any moment assumptions (Pan et al., 2018a, c).

2.1 -sample test and Ball Divergence statistic

In metric space , given independent observed samples , , assuming that in the -th sample, ’s are i.i.d.

and associated with the probability measure

. The -sample test problem states that testing whether probability measures are all equal:

(1)

We first revisit two-sample Ball Divergence statistic designed for the two-sample problem, a special case of -sample problem, where . Let be the indicator function. For any , denote as a closed ball with center and radius , and , and consequently, takes 1 when is inside of closed ball , or 0 otherwise. The Ball Divergence statistic () is defined as

where

Intuitively, if ’s and ’s come from an identical distribution, the proportions that elements of ’s and ’s inside of ball and should be approximately equal, in other words, , and consequently, approaches to zero in this scenario. On the contrary, is far away from zero.

Generally, assume that , the definition of -sample Ball Divergence statistic could be directly summed up all of two-sample Ball Divergence statistics

or found one group with the largest difference to the others

or aggregate the most significant different two-sample Ball Divergence statistics

where are the largest two-sample Ball Divergence statistics among When , , and degenerate into .

2.2 Test of mutual independence and Ball Covariance statistic

Assume that are metric spaces, and is the Euclidean norm on , and is the product metric space of , where the product metric is defined by

Given i.i.d. observations with , denote and as their associated joint probability measure and marginal probability measures, then the test of mutual independence can be formulated as

(2)

For , denote as a closed ball with center and radius , and . Thus, is a discriminated function taking 1 if is within closed ball , or 0 otherwise. Let

and

Then our ball statistic can be presented as,

(3)

We provide a heuristic explanation for

. If , the proportion of in should be closed to the product of the marginal proportions of ’s within ball , i.e., Because is the summation of , a significantly large

value implies that the null hypothesis

is impossible.

The statistic (3) could be extended with positive weights ,

(4)

Several choices are feasible. For example, we could take the Chi-square weight and the probability weight , and denote their corresponding statistics as and , respectively. give more weights on smaller balls, while standardizes

by dividing the variance of

( are fixed). Furthermore, can be shown to asymptotically equivalent to HHG (Pan et al., 2018c).

3 Algorithm

The computational complexities of Ball statistics are proportional to if we compute them according to their exact definitions, however, in most cases, their computational complexities can reduce to a more moderate level (see Table 1) by several efficient algorithms utilizing the rank nature of Ball statistics. The rank nature motivates the computational acceleration in the following three aspects. First, because and are ranks, the computation procedure of Ball statistics can avoid repeatedly evaluating whether a point is located in a closed ball. Second, for the two-sample permutation test, by preserving the rank information when we first time compute , ranking procedure during the permutation becomes avertable. Third, if the dataset is sampled from the univariate distribution, computing the number of points inside a closed ball is equivalent to finding the rank of lower and upper boundaries of the ball. The three methods will be illustrated in Section 3.1, 3.2, and 3.3, respectively. Unfortunately, for the general situation, the computation of Ball Covariance statistic is not easy to optimize. Nevertheless, with engineering optimization such as multi-thread technique, the time consumption of Ball Covariance statistic can be cut down.

Statistics Univariate Other
Table 1: The optimized computational complexities of several Ball statistics.

3.1 Rank based algorithm

We first recap the algorithm for proposed by Pan et al. (2018a). Assume the pairwise distance matrix of is:

After spending time on ranking and row by row to obtain the corresponding rank matrices and , we can only pay time to compute and by directly extracting the -elements of and . Therefore, is of time complexity, so does . In summary, for and general -sample Ball Divergence statistics, the computing times are no more than .

With respect to , the aforementioned algorithm can also be directly applied to calculate and within time. To compute within time, we rearrange to , where is the location of -smallest value among . And further, we have

(5)

where is the number of value not only behind but also not larger than . Owing to (5), the computation of turns to computing , a typical “count of smaller numbers after self” problem111https://leetcode.com/problems/count-of-smaller-numbers-after-self/, which can be solved with time via well designed algorithms such as binary indexed tree, binary search tree, and merge sort. Our implementation utilizes the merge sort algorithm to resolve the problem due to its efficiency and freeing extra complicated data structure. Thus, the time consumption of reduces to , and finally, the computation of costs time.

3.2 Optimized permutation test procedure

The permutation test procedure requires randomly exchanging the group label such that the permuted dataset and comes from the same distribution. It is worthy to note that the ranking procedure for the distance matrix of is actually unnecessary because we can obtain by exchanging the rows and columns of according to the group label exchanged manner. The exchanging procedure only needs quadratic time. As regards as the rank matrix , we develop Algorithm 1 to make its calculation costing quadratic time. Treating as and employing Algorithm 1 on it, the time consumption of also reduces to . Therefore, during the permutation, the time complexities of , , and reduce to . Algorithm 1 relies on the index matrix when we first rank the distance matrix , where means that the -smallest element of the -row of is . The detail of Algorithm 1 is shown below.

0:  An index matrix

, a group index vector

taken value 1 or 2, where means the -th observation belongs to the first group, otherwise the second group, is a vector recording the relative location of sample in the dataset .
1:  Exchange the rows and columns of according to the group label exchanged manner and obtain .
2:  for  do
3:     .
4:     for  do
5:        .
6:        if  then
7:           ,
8:           ,
9:           .
10:        end if
11:     end for
12:  end for
13:  return  
Algorithm 1 Optimized algorithm for the computation of

3.3 Fast algorithm for univariate data

We pay our attention to first. For convenience, assume have been sorted in ascending order. For univariate data, we have

(6)

Let is the smallest value satisfying and is the largest element satisfying . Thanks to (6), it is easy to verify that , and consequently, an alternative way to compute is finding out and Following this inspiration, we develop Algorithm 2 to accomplish the computation of in linear time. Through slightly modifying Algorithm 2, the computation complexity of also reduces to , and hence, the computational complexity of reduces to . Similarly, the time complexity of is . In summary, the computational complexity of is proportional to for the univariate dataset, so do , and .

0:  A sorted array with ascending order.
1:  Initialize .
2:  while  do
3:     if  then
4:        ,
5:        .
6:     else
7:        ,
8:        .
9:     end if
10:  end while
11:  return  
Algorithm 2 Fast algorithm for in the univariate case

With regard to , with slightly modified Algorithm 2, the time complexity of reduces to when are univariate random variables. Further, retaining and used for calculating and , which satisfy and , the computation procedure of only requires quadratic time by Algorithm 3, and finally accomplish the calculation of with quadratic time. Motivated by method reported in Heller et al. (2016), Algorithm 3 utilizes “the inclusion and exclusion criteria” to efficiently compute .

0:  A set including bivariate observations , a set containing lower and upper bound pairs .
1:  Compute the rank of and respectively, and denote them as and .
2:  Pay time to calculate bivariate empirical cumulative distribution .
3:  for  do
4:     for  do
5:        
6:     end for
7:  end for
8:  return  
Algorithm 3 Fast algorithm for in the univariate case

4 The Ball package

In this section, we introduce the Ball package, an R package implements Ball statistics introduced in Section 2 as well as the algorithms illustrated in Section 3. The core of Ball is programmed in C to improve computational efficiency. Moreover, we employ the parallel computing technique on the Ball statistics based test procedure to speed up the computation. To be specific, multiple Ball statistics are concurrently computed with OpenMP which supports multi-platform shared memory multiprocessing programming in C level. Aside from speediness, Ball package is concise and user-friendly. R user can conduct -sample test and test of independence via bd.test and bcov.test functions in Ball package.

We supply instruction and basic usage for bd.test and bcov.test functions in Section 4.1. Additionally, in Section 4.2, two real data analyses examples are provided to demonstrate how to use these functions to tackle the manifold-valued data.

4.1 Functions bd.test and bcov.test

The functions bd.test and bcov.test are programmed for -sample test and test of independence problems, respectively. The two functions are called as:

bd.test(x, y = NULL, R = 99, dst = FALSE, size = NULL, seed = 4, num.threads = 2, kbd.type = "sum") bcov.test(x, y = NULL, R = 99, dst = FALSE, weight = FALSE, seed = 4, num.threads = 2)

The arguments of two functions are described as follows.

  • x: a numeric vector, matrix, data.frame or dist object or list containing numeric vector, matrix, data.frame or dist object.

  • y: a numeric vector, matrix, data.frame or dist object.

  • R: the number of permutation replications, must be a non-negative integer. When R = 0, the Ball statistics will be returned. Default: R = 99

  • dst: if dst = TRUE, x and y are considered as the distance matrix or a list containing distance matrix. Default: dst = FALSE

  • size: a vector record sample size of each group and is only available for bd.test.

  • weight: a logical or character value used to choose the form of . If weight = FALSE, the ball covariance with constant weight is used. Alternatively, weight = TRUE and weight = "prob" indicates the probability weight is chosen while setting weight = "chisq" means selecting the Chi-square weight. Note that this arguments actually only influences the printed result in R console. At present, this arguments is only available for the bcov.test function, with default value: weight = FALSE.

  • seed: the random seed. Default: seed = 4

  • num.threads: the number of threads used in the Ball statistics based hypothesis test. Default: num.threads = 2

  • kdb.type: a character value controlling the output information. Setting kdb.type = "sum", kdb.type = "summax", or kdb.type = "max", the statistics value and value of , or based -sample test procedure are demonstrated. Note that this arguments actually only influences the printed result in R console and is only available for the bd.test function. Default: kdb.type = "sum"

If R > 0, the output is a htest class object like what t.test function returns. The output object contains the value of the Ball statistics (statistic), the

value of the test (p.value), the number of permutation replications (replicates), the sample size of each group (size), a list containing multiple statistics value and their corresponding value (complete.info), the character string declaring the alternative hypothesis (alternative), the character string describing the hypothesis test (method), and the character string giving the name and helpful information of the data (data.name); If R = 0, the Ball statistic value is returned.

As a quick example, we generate simulated data and carry out hypothesis test based on Ball statistic to see if the null hypothesis can be rejected when the distributions are really different or variables are really associated.

For the -sample test problem (), we generate two univariate datasets and with different location parameters, where

The detailed R code is as follows.

R> library("Ball") R> set.seed(1) R> x <- rnorm(50) R> y <- rnorm(50, mean = 1) R> kbd_result <- bd.test(x = x, y = y) R> kbd_result

2-sample Ball Divergence Test

data: x and y number of observations = 100, group sizes: 50 50 replicates = 99 bd = 0.092215, p-value = 0.01 alternative hypothesis: distributions of samples are distinct

In this example, the function bd.test yields a value of 0.0922 for the and a value of 0.01 when we replicate the permutation procedure R = 99 times. Due to the value being under 0.05, we should reject the null hypothesis. Consequently, the hypothesis test result is concordant to the data generation mechanism.

As regard to the test of mutual independence issue, we sample i.i.d observations

from the multivariate normal distribution centered at the origin. The data generation procedure and the mutual independence test based on

are demonstrated below.

R> library("mvtnorm") R> set.seed(1) R> cov_mat <- matrix(0.3, 3, 3) R> diag(cov_mat) <- 1 R> data_set <- rmvnorm(n = 100, mean = c(0, 0, 0), sigma = cov_mat) R> data_set <- as.list(as.data.frame(data_set)) R> ind_result <- bcov.test(x = data_set) R> ind_result

Ball Covariance test of mutual independence

data: data_set number of observations = 100 replicates = 99, weight: Constant bcov = 0.00063808, p-value = 0.01 alternative hypothesis: random variables are dependent

The output of bcov.test shows that is when the constant weight is used. And the value is 0.01, leading to the conclusion that the three univariate variables are mutually dependent.

4.2 Examples

4.2.1 Wind direction dataset

We consider the hourly wind speed and direction data recorded in the Atlantic coast of Galicia in winter season from 2003 until 2012, provided in R package NPCirc (Oliveira et al., 2014). We want to figure out whether there exist some wind direction differences between the first and latest week of 2007-08 winter reason. We apply to this data by carrying out the following steps. First, we calculate the distance matrix between samples using great-circle distance. And then, we pass the distance matrix to arguments x and let dst = TRUE and size = c(168, 167).

R> library("Ball") R> data("speed.wind", package = "NPCirc") R> index1 <- which(speed.wind[["Year"]] == 2007 & R+ speed.wind[["Month"]] == 11 & R+ speed.wind[["Day"]] R> index2 <- which(speed.wind[["Year"]] == 2008 & R+ speed.wind[["Month"]] == 1 & R+ speed.wind[["Day"]] R> d1 <- na.omit(speed.wind[["Direction"]][index1]) R> d2 <- na.omit(speed.wind[["Direction"]][index2]) R> # sample size: R> # c(length(d1), length(d2)) # 168, 167 R> # compute the great-cicle (geodesic) distance: R> theta <- c(d1, d2) / 360 R> dat <- cbind(cos(theta), sin(theta)) R> dx <- nhdist(dat, method = "geo") R> # hypothesis test with Ball divergence: R> bd.test(dx, size = c(168, 167), dst = TRUE)

2-sample Ball Divergence Test

data: dx number of observations = 335, group sizes: 168 167 replicates = 99 bd = 0.29114, p-value = 0.01 alternative hypothesis: distributions of samples are distinct

As can be seen from the output information of bd.test, the value is 0.2911 and the value is smaller than 0.05. To further confirm our conclusion, we visualize the wind direction of two groups in Figure 1 with R package circular (Agostinelli and Lund, 2017). Figure 1 shows that the hourly wind direction in the first week is concentrated on the 90 degrees but the wind directions of last week are widely dispersed.

R> library("circular") R> par(mfrow = c(1, 2), mar = c(0, 0, 0, 0)) R> # visualize the wind direction: R> plot(circular(c(d1), units = "degrees"), bin = 100, stack = TRUE, R+ shrink = 1.5) R> plot(circular(c(d2), units = "degrees"), bin = 100, stack = TRUE, R+ shrink = 1.5)

Figure 1: The raw circular data plot of hourly wind direction data. The left panel is corresponding to the first week wind direction in the 2007-08 winter season and the right panel corresponding the latest week.

4.2.2 Brain fMRI dataset

We examine a public fMRI dataset from the 1000 Functional Connectomes Project222https://www.nitrc.org/frs/?group_id=296 which called on the principal investigators from the member site donating neuroimaging data such that the broader imaging community complete access to a large-scale functional imaging dataset. Given fMRI scans in the rest state and demographics of 86 sample donated from ICBM, it is of interest to evaluate whether the age is associated with the brain connectivity. To properly analyze this dataset, we carry out the pre-process for each individual who contains three fMRI scans, with nilearn (Abraham et al., 2014) package in Python environment. The pre-process includes following four steps: (i) Segment brain into a set of 116 cortical and subcortical regions with the Automated Anatomical Labeling (AAL) template (Tzourio-Mazoyer et al., 2002); (ii) For each fMRI scan, compute a correlation coefficient matrix of 116 regions based on voxel series value since this brain connectivity measurement is widely used in the neuroimaging literature (Ginestet et al., 2017; Ginestet and Simmons, 2011; Bullmore and Sporns, 2009); (iii) Average three correlation matrices in an element-wise manner and save the mean matrix to disk such that it can be analyzed with R; (iv) In R environment, the collection of mean correlation matrix as well as demographics are combined into a list object, then saved to disk in “.rda” format.

To achieve our goal, the pairwise distance matrix of covariance matrix and age are first computed and stored in the R objects dx and dy, respectively. Then, we pass dx and dy to bcov.test function so as to perform test of independence, meanwhile, let dst = TRUE to declare what x and y arguments accepted are distance matrices. The detailed R codes are demonstrated below.

R> library("Ball") R> library("CovTools") R> load("niICBM.rda") R> # Pairwise distance matrix of covariance matrix: R> dx <- CovDist(niICBM[["spd"]]) R> # Pairwise distance matrix of age: R> dy <- dist(niICBM[["covariate"]][["V3"]]) R> # hypothesis test with Ball Covariance: R> bcov.test(x = dx, y = dy, dst = TRUE, weight = "prob")

Ball Covariance test of independence

data: dx and dy number of observations = 86 replicates = 99, weight: Probability bcov.prob = 0.017138, p-value = 0.01 alternative hypothesis: random variables are dependent

The output message shows that the is 0.0171 and the value is smaller than the significant level, and thus, we conclude that the structure of brain is associated with the individual age. This result is also revealed by a recent research that age strongly effects on structural brain connectivity (Damoiseaux, 2017). In this example, we use the affine invariant Riemannian metric (Pennec et al., 2006) implemented in CovTools (Lee and You, 2018) to evaluate the structural difference among correlation matrix and Euclidean distance to measure age difference.

5 Numerical studies

In this section, the numerical studies are conducted to assess the performance of Ball statistics based test procedures for the complex data, including the directional, tree-structured, symmetric positive matrix, and network data inside the sphere spaces, tree metric spaces, symmetric positive-definite matrice spaces, and normalized Laplacian matrice spaces, respectively. For comparison, Energy distance and HHG are considered for the -sample test while distance covariance and HHG for the test of independence. The permutation technique helps us obtain the empirical distributions of these statistics under the null hypothesis, and derive the value. In all of the following Models, we repeat each of them 1000 times and fix the nominal significance level at .

5.1 -sample test

In this section, we investigate the performance of test statistics in revealing the distribution difference of two kinds of random complex data, i.e., directional data and tree-structured data which are frequently met by scientists interested in wind directions, marine currents

(Marzio et al., 2014), or cancer evolution (Abbosh et al., 2017). To sample directional and tree-structured data, the rvmf and rtree functions in the R packages Directional (Tsagris et al., 2018) and ape (Paradis et al., 2018) are used to generate data from the von Mises-Fisher distribution and random tree distribution , where and are direction and concentration parameters while and are the number of tree nodes and an independent random sample collection associated with the branch lengths of tree. The dissimilarity between pairwise directional data and tree-structured data are measured by the great-circle distance and Kendall Colijn metrics (Kendall and Colijn, 2015), which are programmed in the nhdist and multiDist functions in the R packages Ball and treespace (Jombart et al., 2017).

We conduct the numerical analyses for the directional data in Models 4.1.1, 4.1.3-4.1.5, and 4.1.9 while other Models for the tree-structured data. Among them, except that Models 4.1.1 and 4.1.2 are designed for Type-I error evaluation, all Models are devoted to evaluate the empirical power. Specifically, Models 4.1.3-4.1.8 focus on the case that any two groups of

-sample are distinct while Models 4.1.9 and 4.1.10 pay attention to the case that only one group is different to the others. Without loss of generality, fix in the Models 4.1.1-4.1.8 and for the Models 4.1.9 and 4.1.10. Keeping each group with the same sample size, we let the sample size of each group increase as 10, 20, 30, 40, and 50.

  • 4.1.1: von Mises–Fisher distribution. The direction parameters are and the concentration parameters are .

  • 4.1.2: Twenty nodes random tree distribution.

    are independently random sample from uniform distribution

    .

  • 4.1.3: von Mises–Fisher distribution. The direction parameters are , and the concentration parameters are .

  • 4.1.4: The Mixture von Mises–Fisher distribution. The direction parameters are and the concentration parameters are . Mixture proportions are: .

  • 4.1.5: The Mixture von Mises–Fisher distribution. The direction parameters are , and the concentration parameters are . Mixture proportions are: .

  • 4.1.6: Fifteen nodes random tree distribution. are independently sample from three different uniform distributions respectively, where .

  • 4.1.7: Fifteen nodes random tree distribution. are independently sample from three different uniform distributions respectively, where .

  • 4.1.8: Fifteen nodes random tree distribution. are independently sample from three different distributions respectively, where , , .

Since only one group is different to the others, it is sufficient to specify the following Models by describing the distribution of two groups.

  • 4.1.9: von Mises–Fisher distribution. The direction parameters are , and the concentration parameters are .

  • 4.1.10: Fifteen nodes random tree distribution. are independently sample from two different uniform distribution, where .

The Type-I error rates and power estimations are graphed and demonstrated in Figures

2 and 3. From Figure 2, three Ball Divergence statistics can control the Type-I error rates well around the significance level. The results of Figure 3 show that or overcomes Energy distance and HHG in all of the cases. More specifically,