A fast algorithm for computing distance correlation

10/26/2018
by   Arin Chaudhuri, et al.
SAS
0

Classical dependence measures such as Pearson correlation, Spearman's ρ, and Kendall's τ can detect only monotonic or linear dependence. To overcome these limitations, szekely2007measuring proposed distance covariance as a weighted L_2 distance between the joint characteristic function and the product of marginal distributions. The distance covariance is 0 if and only if two random vectors X and Y are independent. This measure has the power to detect the presence of a dependence structure when the sample size is large enough. They further showed that the sample distance covariance can be calculated simply from modified Euclidean distances, which typically requires O(n^2) cost. The quadratic computing time greatly limits the application of distance covariance to large data. In this paper, we present a simple exact O(n(n)) algorithm to calculate the sample distance covariance between two univariate random variables. The proposed method essentially consists of two sorting steps, so it is easy to implement. Empirical results show that the proposed algorithm is significantly faster than state-of-the-art methods. The algorithm's speed will enable researchers to explore complicated dependence structures in large datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/21/2017

Detecting independence of random vectors II. Distance multivariance and Gaussian multivariance

We introduce two new measures for the dependence of n > 2 random variabl...
11/25/2017

Distance Metrics for Measuring Joint Dependence with Application to Causal Inference

Many statistical applications require the quantification of joint depend...
06/05/2019

Estimating Feature-Label Dependence Using Gini Distance Statistics

Identifying statistical dependence between the features and the label is...
06/01/2015

Mutual Dependence: A Novel Method for Computing Dependencies Between Random Variables

In data science, it is often required to estimate dependencies between d...
02/17/2021

Estimating The Proportion of Signal Variables Under Arbitrary Covariance Dependence

Estimating the proportion of signals hidden in a large amount of noise v...
02/08/2019

Distance-based and RKHS-based Dependence Metrics in High Dimension

In this paper, we study distance covariance, Hilbert-Schmidt covariance ...
11/21/2017

Detecting independence of random vectors I. Generalized distance covariance and Gaussian covariance

Distance covariance is a quantity to measure the dependence of two rando...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Detecting dependencies between two random vectors and

is a fundamental problem in statistics and machine learning. Dependence measures such as Pearson’s correlation, Spearman’s

, and Kendall’s are used in almost all quantitative areas; example areas are bioinformatics (Guo et al., 2014; Sferra et al., 2017) and time-series (Zhou, 2012). However, those classical dependence measures are usually designed to detect one specific dependence structure such as a monotonic or linear structure. It is easy to construct highly dependent and whose dependence cannot be detected by classical dependence measures. To overcome these limitations, Székely et al. (2007); Székely and Rizzo (2009) proposed distance covariance as a weighted distance between the joint characteristic function and the product of marginal characteristic distributions. The distance covariance is if and only if two random vectors and are independent. A closely related measure is the Hilbert-Schmidt independence measure (HSIC). HSIC has been extensively studied in machine learning literature (Gretton et al., 2005, 2008; Pfister et al., 2018). Sejdinovic et al. (2013) established equivalence of distance covariance with HSIC.

Despite the power of sample distance covariance to detect a dependence structure, its use for large sample sizes, is inhibited by the high computational cost required. The sample distance covariance and HSIC computation typically requires pairwise distance (kernel) calculations and memory for storing them. This is undesirable and greatly limits the application of distance correlation to large datasets. In the era of big data, it is not rare to see data that consists of millions of observations. For such data, an algorithm is almost impossible to run on a personal computer. To approximate the distance covariance or HSIC for large data, Nyström approach or the random Fourier feature method is often adopted. However, the use of these approximations leads to a reduction in power (Zhang et al., 2018). In this article, we describe an exact method to compute the sample distance covariance between two univariate random variables with computational cost and memory cost . Our proposed method essentially consists of just two sorting steps, which makes it easy to implement.

A closely related algorithm for sample distance covariance was proposed by Huo and Székely (2016). Our algorithm differs from Huo and Székely (2016) in the following ways: First, they implicitly assume that there are no ties in the data (see Algorithm 1 and proof in Huo and Székely (2016)), whereas our proposed method is valid for any pair of real-valued univariate variables. In practice, it is common to see datasets with ties, especially for discrete variables or bootstrap sample. Second, we use a merge sort instead of an AVL tree-type implementation to compute the Frobenius inner product of the distance matrices of and . Empirical results show that our proposed method is significantly faster; for example, for one million observations our MATLAB implementation runs 10 times faster (finishing in 4 seconds) on our desktop, whereas the implementation in Huo and Székely (2016) requires 40 seconds. Because our implementation consists only of MATLAB code while the key step in the Huo and Székely (2016) routine is implemented in C, even greater speed increases are possible by rewriting the critical parts of our implementation in C.

The rest of paper is organized as follows. In Section 2

, we briefly introduce the definition of distance covariance and its sample estimate. In Section

3, we describe the proposed algorithm for sample distance covariance. In Section 4, experiment results are presented. Finally, conclusions and remarks are made in Section 5.

2 Some Preliminaries

Denote the joint characteristic function of and as , and denote the marginal characteristic functions of and as and , respectively. Denote as the Euclidean norm in . The squared distance covariance is defined as the weighted distance between and ,

where , and are constants. It is obvious that if and only if and are independent. Under some mild conditions, the squared distance covariance can be defined equivalently as the expectation of Euclidean distances

(1)

where and

are identical independent copies from the joint distribution of

.

The squared distance correlation is defined by

(2)

Let and be the sample collected. Define

The squared sample distance covariance between and is

(3)

which is similar in form to (2). The squared sample distance correlation is given by

(4)

From (3), it is easy to see a brute force algorithm exists for distance covariance. However, the brute force implementation is difficult to handle large datasets. Moreover, the p-value of distance covariance or correlation is typically calculated by using permutation test, which makes it more computationally intensive.

If we can compute and all for and in steps, then we can also compute in steps.

In this paper, we consider the case where and are univariate random variables; that is, . For the rest of this document we assume that (because after an sort step, we can ensure that ).

3 Fast Algorithm for Distance Covariance

Define the function as

(5)

For any two real and we have

(6)

We use (6) extensively in the rest of paper.

3.1 Fast computation of the and

Define for and note that can be computed in time.

Since we have

(7)

So can be computed in time.

We can use an sorting algorithm to determine a permutation of such that . Therefore as in (7), can be computed in time after is sorted.

3.2 Fast computation of D

In this subsection, we describe an algorithm for computing . First, we have

(8)

In (3.2) note that if , thus showing that

(9)

Let and . Now the second term in (3.2) is

Therefore it can be computed in steps.

Define for The first term in (3.2) can be easily expressed in terms of . Note that

(10)

Thus the first term in (3.2) is expanded into a sum of four terms, each of which is of the form

(11)

For any , define If it can be shown that all of can be computed in time, then the sample distance covariance can be computed in time. We will show this in the next subsection. The preceding arguments lead to the following theorem.

Theorem 3.1

For any real-valued univariate variables with sample and , the sample distance covariance can be computed in time.

3.3 Fast computation of where

We are given a series along with weights . Define . Our objective is to compute in steps where

for

It is well known that the number of inversions in a permutation can be obtained by a merge sort (Ginat, 2004). We use a similar strategy to compute the while performing a merge sort on to sort the in decreasing order.

Merge sort works by successively merging sorted subarrays until the final array is sorted. Assume that we also keep an auxiliary array for storing the original indices of each element of the array, another auxiliary array for storing partially computed (say ), and third one for storing the partial sums of the intermediate results.

A merge sort makes passes over the data. At the beginning of the pass, the array consists of contiguous subarrays, each of which is sorted in decreasing order. We merge two consecutive subarrays times.

Let and be two such consecutive subarrays with and . We can also assume for all meaningful (here and refer to the original indices of these elements). During the merge step, if we notice that for some , then are all the terms in that are less than and we increment by . Note that if we also store running sums as we output the results, can be computed using just one difference.

The extra computation of maintaining the additional auxiliary arrays does not increase the order of computation. The detailed algorithm is presented in Algorithm 1. For a better understanding of the proposed algorithm, we also include MATLAB code in the Appendix.

Input: and
    Output:

1:  Initialize to be a matrix whose first row is and whose second row is
2:  
3:  while  do
4:     
5:     
6:     
7:     while  do
8:        
9:        
10:        while  and  do
11:           
12:           
13:           if  then
14:              
15:           else
16:              
17:              
18:           end if
19:        end while
20:        if  then
21:           
22:           
23:           
24:        else if  then
25:           
26:           
27:           
28:        end if
29:        
30:     end while
31:     
32:     
33:  end while
34:  return  d
Algorithm 1 Fast computation of
index 1 2 3 4 5 6 7 8
y 3 5 7 3 8 4 6 7
Iter 0 t 1 5 3 2 4 6 7 5
cusumT 1 6 9 11 15 21 28 33
d 0 0 0 0 0 0 0 0
index 2 1 3 4 5 6 8 7
y 5 3 7 3 8 4 7 6
Iter 1 t 5 1 3 2 4 6 5 7
cusumT 5 6 9 11 15 21 26 33
d 1 0 0 0 0 0 7 0
index 3 2 1 4 5 8 7 6
y 7 5 3 3 8 7 6 4
Iter 2 t 3 5 1 2 4 5 7 6
cusumT 3 8 9 11 15 20 27 33
d 6 1 0 0 0 13 6 0
index 5 3 8 7 2 6 1 4
Iter 3 y 8 7 7 6 5 4 3 3
d 11 6 21 14 1 3 0 0
Table 1: Illustration of fast algorithm for series and weight . The index row denotes the original order of series, and the cusumT row denotes the cumulative sum of . The proposed algorithm finishes in three iterations. Final results : .

We illustrate the proposed algorithm by using a simple series,
and weight . The iteration history is presented in Table 1. The index row in Table 1 denotes the original order of the series, and the cusumT row denotes the cumulative sum of . The algorithm finishes in iterations and outputs the results . Consider the computation of and for example. At iteration 1, we merge , because we exchange and and set and . At iteration 2, we merge and ; and because and we increment and because we increment . At the final merge, we note so we increment , where is calculated from the difference of cumulative sums of . Similarly for we have

The proposed algorithm has the same order of computational cost as a merge sort. Denote the computational cost as . We know . It is trivial to show that the complexity is by using the master theorem (Cormen et al., 2001). As a byproduct of calculating all , are sorted. Therefore, calculating all as in Section 3.1 takes only extra time.

  • As discussed previously, the proposed algorithm essentially consists of two sorting steps. First we sort X and calculate for . Then we sort Y and calculate and all for . The correctness of the described algorithm can be easily concluded from the previous discussion. To verify the correctness of our implementation we matched our numbers with a simple brute force implementation and we confirmed that the numbers match exactly.

    We also note another factor that impacts performance. A non recursive merge sort algorithm makes passes over the data, and in the pass makes merges of two subarrays of size . If is large then for small we end up merging a large number of small subarrays, which has a large overhead. We have observed that the speed increases by a factor of if we replace these initial merges by a single insert sort step — that is, we divide the input array into groups of length and sort each one of them using insert sort, store the intermediate values, and then continue with the merge steps with . A MATLAB implementation is provided as supplemental material to this paper.

4 Experiments

In this section, we compare the speed of the proposed fast algorithm with the dyadic updating method (Huo and Székely, 2016) and also with the brute force implementation. We implemented the proposed algorithm and the brute force method in MATLAB. For the dyadic updating method, we use the authors’ MATLAB implementation, in which the key step is implemented in C. Therefore, this comparison strongly favors the dyadic updating method because more than of its calculations are done in C instead of MATLAB according to the MATLAB code profiler. All simulations are run on a PC with Intel® Xeon® Gold CPU @ 2.40GHZ processor and 16GB memory running MATLAB version 9.0.0.341360 (R2016a).

The data are generated from a simple quadratic model, for , where and are i.i.d. standard normal. For each sample size , average running time is calculated based on 10 generated samples. The speed comparison results are presented in Table 2. The column Merge Sort 2 in the table corresponds to proposed algorithm with initial merge steps replaced by insertion sort. For a moderately sized data (for example ), the dyadic update method is about 25 times faster than the brute force implementation, whereas our proposed algorithm is about 200 times faster. Since brute force implementation requires space, no results are available when . For large data (), proposed method is 7 to 10 times faster than dyadic updating. Greater speed increases are expected if our proposed algorithm is implemented in C. Note that the p-value of the distance covariance is typically obtained by a permutation test, which is computationally intensive. For permutations and , the brute force implementation takes around seconds ( minutes) versus seconds (0.8 minutes) or seconds (0.4 minutes) for proposed implementation.

=2.5ex

n Brute Force Dyadic Update Merge Sort Merge Sort 2
0.0011(0.0052) 0.0006(0.0001) 0.0004(0.0002) 0.0001(0.0001)
0.0011(0.0003) 0.0027(0.0013) 0.0011(0.0004) 0.0004(0.0001)
0.0416(0.0034) 0.0107(0.0007) 0.0031(0.0009) 0.0013(0.0003)
0.6014(0.0160) 0.0678(0.0115) 0.0126(0.0026) 0.0059(0.0022)
9.5096(0.1134) 0.3585(0.0247) 0.0479(0.0054) 0.0241(0.0036)
- 1.7816(0.0757) 0.2511(0.0190) 0.1624(0.0261)
- 9.2188(0.2154) 1.1844(0.0552) 0.8258(0.0625)
- 45.302(0.4065) 5.4248(0.1077) 4.0098(0.1737)
- 219.04(2.7388) 25.345(0.4245) 19.8976(0.4665)
Table 2:

Average time in seconds for 10 replications. The numbers in parentheses are the estimated standard deviation of running time. No results are available for brute force implementation when

because the required memory exceeds the maximum memory size. Column Merge Sort 2 corresponds to proposed algorithm with initial merge steps replaced by insertion sort.

5 Conclusions

In this paper, we presented an algorithm for calculating the sample distance covariance for univariate variables. For multivariate random variables, random projection approach can be adopted (Huang and Huo, 2017), which depends on calculation of distance covariance for univariate variables. The proposed algorithm is intuitive and simple to implement. Empirical results show that it outperforms existing methods in the literature. Our algorithm will speed up any computational technique that depends on the calculation of univariate distance covariance for example, feature screening (Li et al., 2012).

The proposed faster algorithm provides a tool for scientists to explore complicated dependence structures using distance covariance in larger data than what was previously possible.

References

  • Cormen et al. (2001) Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2001). Introduction to algorithms second edition.
  • Ginat (2004) Ginat, D. (2004). Do Senior CS Students Capitalize on Recursion? SIGCSE Bull., 36(3):82–86.
  • Gretton et al. (2005) Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. In International Conference on Algorithmic Learning Theory, pages 63–77. Springer.
  • Gretton et al. (2008) Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008). A kernel statistical test of independence. In Advances in Neural Information Processing Systems, pages 585–592.
  • Guo et al. (2014) Guo, X., Zhang, Y., Hu, W., Tan, H., and Wang, X. (2014). Inferring nonlinear gene regulatory networks from gene expression data based on distance correlation. PloS one, 9(2):e87446.
  • Huang and Huo (2017) Huang, C. and Huo, X. (2017). A statistically and numerically efficient independence test based on random projections and distance covariance. arXiv preprint arXiv:1701.06054.
  • Huo and Székely (2016) Huo, X. and Székely, G. J. (2016). Fast computing for distance covariance. Technometrics, 58(4):435–447.
  • Knuth (1997) Knuth, D. E. (1997). The Art of Computer Programming: Sorting and Searching, volume 3. Pearson Education.
  • Li et al. (2012) Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139.
  • Pfister et al. (2018) Pfister, N., Bühlmann, P., Schölkopf, B., and Peters, J. (2018). Kernel-based tests for joint independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):5–31.
  • Sejdinovic et al. (2013) Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pages 2263–2291.
  • Sferra et al. (2017) Sferra, G., Fratini, F., Ponzi, M., and Pizzi, E. (2017). Phylo_dcor: distance correlation as a novel metric for phylogenetic profiling. BMC bioinformatics, 18(1):396.
  • Székely and Rizzo (2009) Székely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. The annals of applied statistics, pages 1236–1265.
  • Székely et al. (2007) Székely, G. J., Rizzo, M. L., Bakirov, N. K., et al. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794.
  • Zhang et al. (2018) Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. (2018). Large-scale kernel methods for independence testing. Statistics and Computing, 28(1):113–130.
  • Zhou (2012) Zhou, Z. (2012). Measuring nonlinear dependence in time-series, a distance correlation approach. Journal of Time Series Analysis, 33(3):438–457.

MATLAB Code for Fast Distance Covariance

mygreenrgb0,0.6,0

blue function covsq = fastDcov(x,y)
mygreen%fastDcov computes distance correlation between column vectors x and y
mygreen%Sort the data by x. This does not change the answer.
n = blue length(x);
[x, Index] = blue sort(x);
y = y(Index);
mygreen%a_x is the vector of row sums of distance matrix of x
si = blue cumsum(x);
s = si(n);
a_x = (-(n-2):2:n)..*x+(s-2*si);
%ComputeFrobeniusinnerproductbetweenthe
%distancematriceswhilefindingindexescorresponding
%tosortedyusingmerge-sort.
%Weightvectors
v=[xyx.*y];
nw=size(v,2);
%Thecolumnsofidxarebufferstostoresortindicesandoutputbuffer
%ofmerge-sort
%wealternatebetweenthemandavoiduneccessarycopying
idx=zeros(n,2);
idx(:,1)=1:n;
%iv1,iv2,iv3,iv4areusedtostoresumofweights.
%Onoutput,for
%
%
%
%
iv1=zeros(n,1);iv2=zeros(n,1);iv3=zeros(n,1);iv4=zeros(n,1);
%Themergesortloop.
i=1;r=1;s=2;
whilei<n;
␣␣␣␣gap=2*i;
␣␣␣␣k=0;
␣␣␣␣idx_r=idx(:,r);
␣␣␣␣csumv=[zeros(1,nw);cumsum(v(idx_r,:))];
␣␣␣␣forj=1:gap:n;
␣␣␣␣␣␣␣␣st1=j;e1=min(st1+i-1,n);
␣␣␣␣␣␣␣␣st2=j+i;e2=min(st2+i-1,n);
␣␣␣␣␣␣␣␣while(st1<=e1)&&(st2<=e2);
␣␣␣␣␣␣␣␣␣␣␣␣k=k+1;
␣␣␣␣␣␣␣␣␣␣␣␣idx1=idx_r(st1);
␣␣␣␣␣␣␣␣␣␣␣␣idx2=idx_r(st2);
␣␣␣␣␣␣␣␣␣␣␣␣ify(idx1)>=y(idx2);
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣idx(k,s)=idx1;
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣st1=st1+1;
␣␣␣␣␣␣␣␣␣␣␣␣else
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣idx(k,s)=idx2;
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣st2=st2+1;
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣iv1(idx2,1)=iv1(idx2)+e1-st1+1;
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣iv2(idx2)=iv2(idx2)+(csumv(e1+1,1)-csumv(st1,1));
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣iv3(idx2)=iv3(idx2)+(csumv(e1+1,2)-csumv(st1,2));
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣iv4(idx2)=iv4(idx2)+(csumv(e1+1,3)-csumv(st1,3));
␣␣␣␣␣␣␣␣␣␣␣␣end;
␣␣␣␣␣␣␣␣end;
␣␣␣␣␣␣␣␣ifst1<=e1;
␣␣␣␣␣␣␣␣␣␣␣␣kf=k+e1-st1+1;
␣␣␣␣␣␣␣␣␣␣␣␣idx((k+1):kf,s)=idx_r(st1:e1,:);
␣␣␣␣␣␣␣␣␣␣␣␣k=kf;
␣␣␣␣␣␣␣␣elseifst2<=e2;
␣␣␣␣␣␣␣␣␣␣␣␣kf=k+e2-st2+1;
␣␣␣␣␣␣␣␣␣␣␣␣idx((k+1):kf,s)=idx_r(st2:e2,:);
␣␣␣␣␣␣␣␣␣␣␣␣k=kf;
␣␣␣␣␣␣␣␣end;
␣␣␣␣end;
␣␣␣␣i=gap;
␣␣␣␣r=3-r;s=3-s;
end;
%distheFrobeniusinnerproductofthedistancematrices
covterm=␣␣n*(x-mean(x)).’*(y-mean(y));
c1=iv1.’*v(:,3);
c2=sum(iv4);
c3=iv2.’*y;
c4=iv3.’*x;
d=4*((c1+c2)-(c3+c4))-2*covterm;
%b_yisthevectorofrowsumsofdistancematrixofy
ySorted=y(idx(n:-1:1,r));
si=cumsum(ySorted);
s=si(n);
b_y=zeros(n,1);
b_y(idx(n:-1:1,r))=(-(n-2):2:n).’.*ySorted+(s-2*si);
%covsqequalsthesquareofthedistancecovariance
%betweenxandy.
nsq=n*n;
ncb=nsq*n;
nq=ncb*n;
term1=d/nsq;
term2=2*(a_x.’*b_y)/ncb;
term3=sum(a_x)*sum(b_y)/nq;
covsq=(term1+term3)-term2;