Graph Independence Testing

06/09/2019
by   Junhao Xiong, et al.
0

Identifying statistically significant dependency between variables is a key step in scientific discoveries. Many recent methods, such as distance and kernel tests, have been proposed for valid and consistent independence testing and can be applied to data in Euclidean and non-Euclidean spaces. However, in those works, n pairs of points in X×Y are observed. Here, we consider the setting where a pair of n × n graphs are observed, and the corresponding adjacency matrices are treated as kernel matrices. Under a ρ-correlated stochastic block model, we demonstrate that a naïve test (permutation and Pearson's) for a conditional dependency graph model is invalid. Instead, we propose a block-permutation procedure. We prove that our procedure is valid and consistent -- even when the two graphs have different marginal distributions, are weighted or unweighted, and the latent vertex assignments are unknown -- and provide sufficient conditions for the tests to estimate ρ. Simulations corroborate these results on both binary and weighted graphs. Applying these tests to the whole-organism, single-cell-resolution structural connectomes of C. elegans, we identify strong statistical dependency between the chemical synapse connectome and the gap junction connectome.

READ FULL TEXT VIEW PDF

page 9

page 13

12/22/2021

Local permutation tests for conditional independence

In this paper, we investigate local permutation tests for testing condit...
04/19/2018

The Hardness of Conditional Independence Testing and the Generalised Covariance Measure

It is a common saying that testing for conditional independence, i.e., t...
12/27/2019

The Chi-Square Test of Distance Correlation

Distance correlation has gained much recent attention in the statistics ...
05/13/2020

Consistency of permutation tests for HSIC and dHSIC

The Hilbert–Schmidt Independence Criterion (HSIC) is a popular measure o...
08/03/2020

Two-sample Testing on Latent Distance Graphs With Unknown Link Functions

We propose a valid and consistent test for the hypothesis that two laten...
03/26/2020

A Blind Permutation Similarity Algorithm

This paper introduces a polynomial blind algorithm that determines when ...
12/28/2018

Characterizing Watermark Numbers encoded as Reducible Permutation Graphs against Malicious Attacks

In the domain of software watermarking, we have proposed several graph t...

1 Introduction

Identifying statistically significant dependency between two or more sets of attributes serves as a key first check before further investigations are warranted. The space of possible attributes and their statistical dependencies is truly enormous, ranging from scalar values with relatively simple linear relationship to data with high dimensions, complex structures and nonlinear relationships. There are many traditional and recent new procedures for testing dependency on linear, nonlinear, low-dimensional Euclidean data with satisfactory performance, e.g., [1, 2, 3, 4, 5, 6, 7, 8], while detecting relationship between data of high-dimensional or complex structure remains difficult and less well-understood. The large and still growing amount of structured data motivate a development of methods for those data.

In particular, graphs are emerging as a prevalent form of data representation in many scientific areas, ranging from linguistics to neuroscience to sociology. Graphs have complex structures and relationships. One type of question about graphs is to ask whether a given pair of graphs are statistically dependent. For example, one could ask "to what extent are the connectomes (brain graphs) of the left and right hemisphere of a species correlated with each other?", or "is the connectome constructed on chemical synapses statistically dependent on the connectome constructed on gap junctions? If so, how strong is such correlation?". The answers to these questions would explain the presence or absence of relationships between the objects of interest.

We investigate a popular graph model, the -correlated stochastic block model (-SBM), and propose a statistical test for testing (conditional) dependence between two sample graphs from -SBM. The test utilizes the adjacency matrices and the block permutation procedure. We prove the validity of the resulting procedure, and demonstrate its effectiveness both simulated graphs and real brain graphs (connectomes).

2 Preliminary

Correlated Bernoulli Graphs

Let

be a graph-valued random variable with sample

. Each graph is defined by a set of vertices, , where , and a set of edges between pairs of vertices . Let be an adjacency matrix-valued random variable taking values , identifying which pairs of vertices share edges. Here, the graph is undirected, so the corresponding adjacency matrix is symmetric.

The -correlated Erdos-Renyi Model was proposed as an intuitive way to capture correlations between graphs [9]

. Erdos-Renyi Model (ER) is a random graph in which each edge is sampled i.i.d (independent and identically distributed) from a Bernoulli distribution with some parameter

. Let be a random variable denoting whether there is an edge from vertex to vertex in graph . are called -correlated ER() if and are ER graphs with parameter and the two random variables and have Pearson’s correlation for all . In fact, one can generalize the notion of

-correlated Bernoulli Graphs by allowing for different marginal probabilities for the two graphs. Namely,

are called -correlated ER() if is a ER(), is a ER() and are -correlated.

To sample a pair of -correlated ER, consider random variable . Note that can be written as following by the definition of Pearson’s correlation:

. Given this equation and the marginal probabilities of and , one can solve for the joint probability for each value of and and get the following sampling procedure: First, realize . Then, if , independently realize:

if , independently realize:

Note that this is only valid when:

To generate a pair of -correlated ER(), one can simply follow this procedure for each edge independently.

The Stochastic Block Model (SBM) is a generalization of ER. SBM is parametrized by the block probability matrix , where is the number of blocks [10]. Each community is labeled . The entry gives the probability of an edge from a node in community to a node in community , for all . The community assignment of each node is given by the community membership function . For all node , would mean that node is a member of block . ER is a SBM with , so the block probability matrix . A block refers to a submatrix in the adjacency matrix formed by the edges connecting every node in community to every node in community . Every edge within a block is sampled i.i.d from .

One can generalize -ER to -correlated SBM similar to how one generalize an ER to a SBM. Assuming the same community assignment (but possibly different block probability matrix), to generate a pair of -correlated SBM, one can follow the procedure of generating -correlated ER for each block separately.

A further generalization of SBM is the Independent Edge Graph (IE). An IE is parametrize by the edge probability matrix , where is the number of vertices. The probability of an edge from vertex to vertex is given by . An ER is an IE with for all , and an SBM is an IE with . and are called -correlated Bernoulli graphs if are both IE and the random variables and have Pearson’s correlation of for all

. Under the setting of correlated Bernoulli Graphs, the null hypothesis of the graph independence test is

, and the alternative is .

Correlated Gaussian Graphs

Correlated Bernoulli graphs are binary by definition. To sample correlated weighted graphs, we leverage the joint normal distribution and introduce

Correlated Gaussian Graphs. and are called -correlated Gaussian ER() if every pair of edges are sampled from a joint normal distribution with mean

, variance

and covariance . One can generalize -correlated Gaussian ER() to have different marginal distributions. Namely, and are called -correlated Gaussian ER(), where if for all . One can further generalize -correlated Gaussian ER to -correlated Gaussian SBM by following the procedure of generating -correlated Gaussian ER for each block separately. In the rest of this paper, we refer -correlated Bernoulli graphs and -correlated Gaussian graphs together as -correlated graphs.

Conditional independence testing

Conditional independence testing, also referred to as partial testing, is the testing of independence between conditional distributions [11]. Conditional independence is important if one is interested in identifying correlation given some known structure in the data. As a concrete example, given the connectomes of two individuals, we might observe the same block structure in both graphs because the brain of each individual is segmented into two hemispheres. The correlation due to such inherent structure might be interesting, but often we are more interested in any correlation that might exist in addition to the structural correlation, rendering partial testing an important problem.

A conditional independence testing problem can be formulated under the setting of -correlated graphs. Let and be two -correlated SBMs, and the corresponding adjacency matrices and , jointly sampled from a distribution . Since for -correlated graphs, both graphs have the same set of uniquely labeled vertices, all variability is in the adjacency matrix, that is , and . For notational simplicity, we will use the same notation to refer to the graphs and the adjacency matrices for the rest of this paper. Furthermore, let and be the marginal distributions of the adjacency matrices conditioning on the community assignments. To determine whether are independent, given the community assignments, the following hypothesis is tested: and .

Given a pair of -correlated SBMs, without conditioning on the community assignments, even when , both graphs still have the same block structures, which leads to some correlation between them. But when conditioning on the community assignments, any remaining correlation must be due to the correlation between edges. Therefore, in the paradigm of conditional testing, the null distribution asserts that .

Distance correlation (DCorr)

Distance correlation is a generalization of the classical Pearson correlation, to random vectors with arbitrary dimensions or in metric spaces

[3]. Consider semimetric spaces and with distance functions and . Consider random variables and . The distance covariance function for any random variables and is defined as the positive square root of the following ( and are independent copies of and respectively) [12]:

is zero if and only if and are independent and is non-zero otherwise. Usually, it is assumed that and and the metric is Euclidean distance, but the setting considered herein is slightly different. Given the fact that the distance covariance function characterizes whether and are independent, the graph independence testing problem can be described under this formulation.

Let be the set of vertices of graph and , let function denote the community assignment of each vertex in , and let the distances and be functions of the adjacency matrix entries respectively (we introduce the notion of kernel-induced distance explicitly in Section 3) Then consider random variables and . Informally, the two metric spaces both include the same set of vertices with the same community assignments, and use the kernel-induced distance as distance functions. Given the definition above, the more formal formulation of the hypothesis under testing is the following: and .

For notational simplicity, we drop the subscript . The distance covariance function can be normalized to the distance correlation function as:

Given samples jointly sampled from

, an unbiased estimate of

based solely on the sample distance matrices is described in [13]

. This sample test statistic is used for

DCorr in Algorithm 1 in the Appendix.

Multiscale Graph Correlation (MGC)

Multiscale Graph Correlation (MGC) builds upon distance correlation by exploring all local distance correlation and efficiently searching for the optimal scale. The algorithm is described in details in [14]

and it is demonstrated that compared to distance correlation, MGC loses almost no power in monotonic dependencies and achieves better finite sample power on high- dimensional data with non-monotonic relationships. It is shown that many theoretic properties that hold for distance correlation also holds for MGC

[15]. Similar to distance correlation, for a sample test statistic, MGC can take as inputs two sample distances matrices, then output a test statistic indicating the strength of correlation.

3 Methods

Graph Correlation using Induced Metric from Adjacency Matrix

One can view the adjacency matrix of an undirected graph as a similarity or kernel matrix, where the similarity of node and is the weight of the edge between them. A kernel can be converted into a distance metric using the bijective transformation between metric and kernels introduced in [16]. To that end, each adjacency entries is normalized by the maximum within the matrix (which is for unweighted graphs), and the diagonals of the adjacency matrix are tweaked to ensure the transformed distance satisfies the identity property, i.e. the distance of each node to itself is . Eventually, the transformed metric used in this paper is: , where

is the identity matrix and

is the matrix of ones. After computing the distances matrices for both graphs, one can use any proper correlations measures, such as MGC and DCorr that takes distance matrices as inputs. The sample test statistic described in [13] is used for DCorr and the test statistic in [14] is used for MGC.

Pearson Correlation

Alternatively, one can ignore the graph structure entirely and calculate the Pearson correlation using the vectorized adjacency matrices as a measure of their correlations. The test statistic can be expressed as the following ( and denote the overall mean of the adjacency matrices respectively):

Vectorization is necessary since Pearson only operates on 1-dimensional data. Pearson assumes the samples are i.i.d, but this assumption is violated for -correlated Bernoulli graphs in general, except under the special setting of -ER. As an example, in general, for -SBM, each pair of edges is sampled independently, but potentially under different distributions (namely, edges in different blocks are sampled under Bernoulli with different parameters). We investigate how this violation of the i.i.d assumption affects the correlation-based procedure in Section 5. Both Algorithm 1 and 2 in the Appendix are procedures to compute a test statistics for measuring correlation between graphs, and they are referred to as GCorr (graph correlation) in subsequent algorithms.

Computing p-value

Computing p-value requires using a permutation test to estimate the distribution of the test statistic under the null. Under the setting of -correlated graphs, in general, a naive permutation of the row-column pairs of the adjacency matrix would result in an invalid test for -SBM (the power would converge to 1 under the null), because the distribution of the permuted matrices does not approximate the null distribution at . Intuitively, since the block structure is the same in both graphs, one implicitly desires a conditional independence testing, which is not enabled by the usual permutation test procedure.

In general, it is not clear what a valid permutation test would be for -correlated graphs. Such permutation should preserve the inherent graph structure while smearing the edge correlations. Under -SBM, however, since the inherent graph structure is captured completely by the block structure, one can perform a block permutation test (Algorithm 3 in the Appendix) [11]. Namely, given the community assignment of nodes, the edges within each block are permuted, which preserves the block structure and thus is able to approximate the null distribution.

In practice, we usually don’t know the community assignment of each node. Assuming the vertices of both graphs are matched (there is a bijection between the vertices of both graphs), we can use a Joint Random Dot Product Graph (JRDPG) model to estimate the community assignment jointly. JRDPG is a procedure to embed multiple graphs sampled under some joint distribution. It works by finding the adjacency spectral embedding (ASE) of a matrix formed by concatenating the ASE of each of the jointly sampled graphs. If the graphs are sampled under an SBM, one can recover the communities by clustering the embeddings

[17]. The procedure to estimate the community assignment is Algorithm 4, and the procedure to compute p-value is Algorithm 5 (both algorithms are in the Appendix).

In Algorithm 4, one needs to choose the parameters , the number of dimensions of the latent positions, and , the prior estimate of the number of communities.

is chosen via the scree plots of the singular values

[18], while

can be chosen with prior knowledge about the graph structure, or one can select the optimal number of clusters in a Gaussian Mixture Model (GMM) by selecting the clustering with the best Bayesian Information Criterion (BIC).

4 Theoretical Results

To derive the theoretic property of the proposed test, we operate under the setting of -correlated SBM (which can be unweighted -correlated Bernoulli graphs or weighted -correlated Gaussian graphs), as stated in the following assumption:

Assumption 1.

The adjacency matrices are sampled jointly from a -correlated SBM distribution.

First, we show that the test is valid, i.e., it properly controls the type I error. We prove validity by showing that block permutation results in a test statistic that equals the test statistic under the null in distribution. To simplify the proof, we write

as , where are the adjacency matrices used by as kernels respectively. is the same as in the sense that distance correlation is computed by first computing a distance matrix for and respectively, using the kernel-induced distances of and .

Proposition 2.

Under Assumption 1, let be the block permutation procedure in Algorithm 3, and assume and are conditionally independent, that is, they are -correlated SBM (either Bernoulli or Gaussian) with . It follows that , i.e., the block permutation test is a valid test procedure for -correlated SBM.

In conditional testing, the Pearson, DCorr and MGC statistics are no longer under conditional independence. Instead, the test statistics under the null shall converge to some non-zero constants that depends on the actual distributions. Moreover, since the adjacency matrix is not positive semi-definite, the constant could be negative (whereas in case of Euclidean data and Euclidean distance, DCorr is asymptotically non-negative).

Theorem 3.

For any of Pearson, DCorr, MGC statistics, for -ER.

For -SBM with fixed marginals, there exists a unique constant such that if and only if .

Therefore, any of the three sample correlations using block permutation is consistent against all possible alternatives under -SBM.

Note that the theorem holds for either the binary -ER / -SBM from Bernoulli, or the weighted -ER / -SBM from Gaussian.

Theorem 3 is supported by simulation results in Figure 1 and 3, and it is clear that all three correlations coincide with each other, which equals in case of ER and is otherwise a linear function of in case of SBM. The proofs of both Proposition 2 and Theorem 3 are in the Appendix.

5 Simulated Experiments

Test statistics

We corroborate the theory using simulations with -correlated graphs, for which we can sample a pair of graphs while controlling their correlation exactly. For this experiment, we compare the test statistics of Pearson and DCorr with the correlation used to generate different settings of -correlated Bernoulli graphs. We use Algorithm 2 for Pearson and Algorithm 1 for DCorr (both algorithms are in the Appendix). The simulation settings are: (a): -ER, ; (b): -ER, ; (c): -SBM, the block probability matrices of the two graphs , where when , and when ; (d): -SBM, , where when , and when . All the community assignments are given instead of being estimated.

Figure 1 shows that for -ER, both the Pearson and DCorr test statistics estimate perfectly. In particular, it is zero only when . This aligns with Theorem 3. For -SBM, both test statistics are still the same, but they are no longer zero when . This is also expected based on Theorem 3. Intuitively, the test statistics differ from because they capture not only the correlation between pairs of edges, but also the correlation due to the block structure of SBMs. The test statistics of both DCorr and Pearson motivate a two-sided test of conditional independence, which we describe in the next section.

Figure 1: Test statistics on

-correlated Bernoulli graphs. For each setting, the graphs have 100 vertices. Test statistics are computed for 500 replications, the mean is plotted and the error bar is one standard deviation. Simulation settings for each subplot is described in the beginning of Section

5. Each subplot has different ranges for because the minimum and maximum differ for different marginal distributions. This suggests that the different test statistics accurately reflect the correlation structure.

Power

We use a power experiment to check that the test has the following properties: (1) validity: the power converges to below the type I error level under the null; (2) consistency: the power increases to 1 as sample size, i.e. the number of vertices . We compare the power of Pearson, DCorr and MGC on -ER and -SBM, both when the two graphs are sampled from distributions with the same probability matrix and when the distributions have different probability matrices. For MGC, we use Algorithm 1 but substituting the DCorr sample statistic with a MGC sample statistic. For -SBM, we compute the power when the community assignments are given, when the assignments are unknown and estimated, and when the block sizes are different. The algorithm for calculating the power on -correlated SBM is in Algorithm 6 in the Appendix. The simulation settings are the following: left column show , middle column shows , right column shows . For the rows 1-4 are the same as in Section 5; row 5 is the same as row 4, except the community assignments are estimated instead of given; row 6 is the same as row 5, except the block sizes are different: for each , there are 70 percent of nodes in the first community and 30 percent of nodes in the second community. Visualizations of some simulation settings are in Figure 2. All simulation results for -correlated Bernoulli graphs are in Figure 3.

The results show that all the tests using block permutation (Pearson, DCorr and MGC) are valid and consistent, and have similar power under all settings. For comparison, the power of Pearson using the exact analytic p-value instead of block permutation is also computed (the samples are pairs of edges in the vectorized upper triangles of both graphs [19], and the null is rejected if the p-value is less than type I error level ). Without block permutation, the test is invalid for -SBM. The same results also hold for the weighted -correlated Gaussian ER and SBM, shown in Figure 4. For implementation, we use the python package GraSPy for graphs generation and manipulation [20] and mgcpy for various functionalities related to independence testing.

Figure 2: Visualization of different settings of -correlated graphs. All the graphs shown have 100 vertices and 2 communities. The first row is -correlated Bernoulli SBMs with different probability matrices when . The block probability matrices , where when , and when . The second row is the same as the first row, except the block sizes are different: 70 vertices are in the first community and 30 vertices are in the second community. The third row is weighted -correlated Gaussian SBMs when . for the first block, and for the second block. The covariance matrix is , where if and if .
Figure 3: Power experiments using -correlated Bernoulli graphs. Left column show . Middle column shows . Right column shows . For the rows, row 1-4 are the same as described in Section 5; row 5 is the same as 4, except the community assignments are estimated instead of given; row 6 is the same as row 5, except the block sizes are different: for each , there are 70 percent of nodes in the first community and 30 percent of nodes in the second community. Power is computed for 500 Monte Carlo replicates for Pearson, DCorr and MGC with block permutations. Power for Pearson using the exact analytical p-value is computed for 5000 Monte Carlo replicates.
Figure 4: Power experiments using -correlated Gaussian graphs. Left column show . Middle column shows . Right column shows . For the rows, the simulation settings are: (1): -ER, ; (2): -ER, ; (3): -SBM, for the first block, and for the second block; (4): -SBM, for the first block, and for the second block. The covariance matrix for all settings is , where if and if . All the community assignments are given instead of being estimated. Power is computed for 500 Monte Carlo replicates.

6 Real Data Experiments

We consider the application of the graph conditional independence testing procedures on connectomes, also known as "brain graphs". The connectome of nematode Caenorhabditis elegans (C. elegans

) is the only known whole-animal connectome of an organism, including not only neurons but also body cells. It was constructed based on series of electron microscopy, and has been updated and made more complete over the years

[21, 22, 23, 24, 25]. The connectome is constructed on the level of individual synapses, and is constructed for both chemical synapses and gap junctions. The graphs are directed and weighted. The nodes of the graph are individual cells, and the edges represent the strength of synapses from one cell to another. The data provides an invaluable source of information to study the coordination of nervous system within the entirety of an organism.

Given such data, one natural initial question to ask is whether the graph constructed based on chemical synapses and that constructed based on gap junctions are statistically dependent on each other, and if so, how strong the correlation is. To answer this question with the proposed testing procedure, we chose the chemical and gap junction connectome of the hermaphrodite, one of the two sexes of adult C. elegans. However, the graph independence testing procedure cannot be directly applied to the original graphs for two reasons: (1) not all the cells that are present in the chemical synapses connectome are present in the gap junctions connectome, and vice versa; (2) the original graphs are directed. To address (1), since each vertex is labeled with a unique cell name, we take the intersection of the cells in each of the graphs, and ensure the vertices of the two graphs are matched. After taking the intersection, each graph has 448 nodes, which includes all neurons and the cells that the neurons synapse onto. To address (2), the average weight of the two edges between a pair of nodes is used as the edge weight between them, rendering the graphs symmetric and thus undirected. The weighted and unweighted graphs after preprocessing are shown in Figure 5.

Figure 5: Visualization of the C.elegans connectomes. The first row are the connectomes of the chemical synapses and gap junctions of C. elegans represented as adjacency matrices of undirected, weighted graphs. The chemical graph is on the left, the gap junction graph is on the right. The second row is the unweighted version of the adjacency matrices of the chemical and gap junctions graphs (all edge weights larger than zero are set to one)

To derive a p-value from the graph conditional independence test, the number of blocks used in block permutation needs to be set. By default, we pick the optimal that results in a GMM clustering with the lowest BIC in Algorithm 4. The estimated community assignment of the unweighted graphs is shown in Figure 6.

Figure 6: Visualization of the unweighted C.elegans connectomes, with the vertex sorted by the estimated community assignments. The optimal estimated number of blocks is 13.

Since graphs derived from real data do not arise from an SBM, we don’t expect any given to perfectly estimate the community assignment of each node. But given a relatively large number of nodes, one can get a better fit with SBM by increasing while still maintaining the validity of block permutation. Figure 7 shows that for any reasonable chosen , the test detects strong dependency between the connectomes of chemical synapses and gap junctions, meaning that over and above block structure, one could in theory predict the presence of a gap junction given the presence of a chemical synapse, and vice versa.

Figure 7: Test statistics for dependence on the weighted (left) and unweighted (right) C. elegans connectomes. For a given , we compute the test statistic under the null calculated with block permutation using blocks. is chosen for . As , the test statistics under the null approaches the observed test statistic. This is expected since when , effectively each node is in its own block, so block permutation does not alter the graph much, resulting in a null test statistic similar to the observed test statistic. But for any reasonably chosen e.g. , including the optimal identified with BIC, the observed test statistics are much larger than the null test statistics for all three tests, revealing a strong dependency between the two graphs. The null distribution of the test statistics are estimated for 500 replicates. The mean test statistic for each test is plotted and the error bars are for one standard deviation.

7 Conclusion and Future Work

We presented a statistical test for conditional independence between a pair of undirected graphs. This is the first approach that we know of that addresses the question of whether two graphs are conditionally independent. The question of independence is very general and may be of interest in many distinct areas of scientific inquiries. In the field of connectomics, the proposed approach is especially relevant in helping researchers draw new connections among various types of brain imaging data and ask novel questions about the relationships among these data.

The proposed test relies on the theory of Stochastic Block Model (SBM). We note that this poses certain limitation in the effectiveness of the test in settings where the number of nodes is relatively small and the graph can only be fitted poorly given a SBM with small number of blocks, because it would be difficult for block permutation to approximate the null distribution. In such case, if one chooses a larger , one can better approximate the graph with an SBM, but it would require a larger number of nodes for block permutation to approximate the null distribution; on the other hand, if one chooses a smaller , the SBM might not fit the graph well enough for block permutation to approximate the null. With this in mind, we hypothesize that a generalization to Degree-corrected SBM [26] or Random Dot Product Graph [27] might allow better modeling in such settings, if one can design a valid permutation procedure in these graph models. Such generalization of the procedure is a potential future direction of the current work.

Authors, Affiliations, and Acknowledgements


Johns Hopkins University University of Delaware

References and Notes

  • [1] K. Pearson, “Notes on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, vol. 58, pp. 240–242, 1895.
  • [2] M. G. Kendall, Rank Correlation Methods.   London: Griffin, 1970.
  • [3] G. J. Székely, M. L. Rizzo, N. K. Bakirov et al., “Measuring and testing dependence by correlation of distances,” The annals of statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
  • [4] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Scholkopf, “Kernel methods for measuring independence,”

    Journal of Machine Learning Research

    , vol. 6, pp. 2075–2129, 2005.
  • [5] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf, “Kernel measures of conditional dependence,” in Advances in neural information processing systems, 2007.
  • [6] A. Gretton and L. Gyorfi, “Consistent nonparametric tests of independence,” Journal of Machine Learning Research, vol. 11, pp. 1391–1423, 2010.
  • [7] R. Heller, Y. Heller, and M. Gorfine, “A consistent multivariate test of association based on ranks of distances,” Biometrika, vol. 100, no. 2, pp. 503–510, 2013.
  • [8] C. Shen, C. E. Priebe, and J. T. Vogelstein, “From distance correlation to multiscale graph correlation,” Journal of the American Statistical Association, 2019.
  • [9] D. E. Fishkind, L. Meng, A. Sun, C. E. Priebe, and V. Lyzinski, “Alignment strength and correlation for graphs,” arXiv preprint arXiv:1808.08502, 2018.
  • [10] P. Holland, K. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social Networks, vol. 5, no. 2, pp. 109–137, 1983.
  • [11] T. Hothorn, K. Hornik, M. A. Van De Wiel, A. Zeileis et al., “Implementing a class of permutation pests: the coin package,” 2008.
  • [12] D. Sejdinovic, B. Sriperumbudur, A. Gretton, K. Fukumizu et al., “Equivalence of distance-based and rkhs-based statistics in hypothesis testing,” The Annals of Statistics, vol. 41, no. 5, pp. 2263–2291, 2013.
  • [13]

    G. J. Székely and M. L. Rizzo, “The distance correlation t-test of independence in high dimension,”

    Journal of Multivariate Analysis

    , vol. 117, pp. 193–213, 2013.
  • [14] J. T. Vogelstein, E. W. Bridgeford, Q. Wang, C. E. Priebe, M. Maggioni, and C. Shen, “Discovering and deciphering relationships across disparate data modalities,” eLife, vol. 8, p. e41690, 2019.
  • [15] C. Shen, C. E. Priebe, and J. T. Vogelstein, “From distance correlation to multiscale graph correlation,” Journal of the American Statistical Association, no. just-accepted, pp. 1–39, 2018.
  • [16] C. Shen and J. T. Vogelstein, “The exact equivalence of distance and kernel methods for hypothesis testing,” arXiv preprint arXiv:1806.05514, 2018.
  • [17] D. Sussman, M. Tang, D. Fishkind, and C. Priebe, “A consistent adjacency spectral embedding for stochastic blockmodel graphs,” Journal of the American Statistical Association, vol. 107, no. 499, pp. 1119–1128, 2012.
  • [18] M. Zhu and A. Ghodsi, “Automatic dimensionality selection from the scree plot via the use of profile likelihood,” Computational Statistics & Data Analysis, vol. 51, no. 2, pp. 918–930, 2006.
  • [19] Student, “Probable error of a correlation coefficient,” Biometrika, pp. 302–310, 1908.
  • [20] J. Chung, B. D. Pedigo, E. W. Bridgeford, B. K. Varjavand, and J. T. Vogelstein, “Graspy: Graph statistics in python,” arXiv preprint arXiv:1904.05329, 2019.
  • [21] J. G. White, E. Southgate, J. N. Thomson, and S. Brenner, “The structure of the ventral nerve cord of caenorhabditis elegans,” Philosophical Transactions of the Royal Society of London. B, Biological Sciences, vol. 275, no. 938, pp. 327–348, 1976.
  • [22] J. White, “Neuronal connectivity in caenorhabditis elegans,” Trends in Neurosciences, vol. 8, pp. 277–283, 1985.
  • [23] J. G. White, E. Southgate, J. N. Thomson, and S. Brenner, “The structure of the nervous system of the nematode caenorhabditis elegans,” Philos Trans R Soc Lond B Biol Sci, vol. 314, no. 1165, pp. 1–340, 1986.
  • [24] L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, and D. B. Chklovskii, “Structural properties of the caenorhabditis elegans neuronal network,” PLoS computational biology, vol. 7, no. 2, p. e1001066, 2011.
  • [25] S. J. Cook, T. A. Jarrell, C. Brittin, Y. Wang, A. E. Bloniarz, M. A. Yakovlev, K. C. Q. Nguyen, L. T.-H. Tang, E. A. Bayer, J. S. Duerr, H. Buelow, O. Hobert, D. H. Hall, and S. W. Emmons, “Whole-animal connectomes of both c. elegans sexes,” Nature, in press, 2019.
  • [26] B. Karrer and M. E. J. Newman, “Stochastic blockmodels and community stochastic blockmodels and community structure in networks,” Physical Review E, vol. 83, p. 016107, 2011.
  • [27] S. Young and E. Scheinerman, “Random dot product graph models for social networks,” in Algorithms and Models for the Web-Graph.   Springer Berlin Heidelberg, 2007, pp. 138–149.

8 Appendix

Algorithms

1:  Input: a pair of undirected graphs with vertices represented as adjacency matrices
2:  Output: test statistic
3:  
4:   DCorr
5:  return  
Pseudocode 1 Distance Correlation for Graph Testing
1:  Input: a pair of undirected graphs with vertices represented as adjacency matrices
2:  Output: pearson correlation
3:   vectorize, vectorize
4:   pearson
5:  return  
Pseudocode 2 Pearson Correlation for Graph Testing
1:  Input: an undirected graphs with vertices represented as an adjacency matrix , the community assignment
2:  Output: the permuted graph represented as adjacency matrix
3:   sort-vertex
4:   sort
5:  for each block which is the th block on the row, and th block on the column, and has edges in the upper triangular part of  do
6:     
7:     if  then
8:         permute-on-diag
9:     else
10:         permute-off-diag
11:     end if
12:     
13:  end for
14:   symmetrize
15:  return  
Pseudocode 3 Block Permutation of Graph (block-permute)

There are several things to note about Algorithm 3. (1) Since the adjacency matrix is not necessarily given with the nodes in the same community next to each other, the vertices are first sorted based on the community assignment (referred to as sort-vertex). Given that the vertices of the graphs are exchangeable, the sorted graph and the original graph are equivalent in distribution. (2) Since the graph is undirected, only the upper triangular part of the adjacency matrix needs to be permuted. The upper triangular matrix is then reflected across the diagonal to form a symmetric matrix (this procedure is symmetrize in Algorithm 3). (3) In the implementation, the on-diagonal blocks and off-diagonal blocks are permuted differently. Since the blocks on the diagonal are symmetric, only their upper triangle parts are permuted. This is done by vectorizing the upper triangular matrix, permuting elements of the vector, then reshaping the vector back into the upper triangle (referred to as permute-on-diag). For the off-diagonal blocks, since the submatrix is not necessarily symmetric, the entire submatrix is vectorized. The vector is then permuted and reshaped back to a matrix (referred to as permute-off-diag).

1:  Input: a pair of undirected graphs with vertices represented as two adjacency matrices , the prior estimate of number of communities
2:  Output: the estimated community assignment
3:  For , let and be the matrices of the

leading eigenvectors of

and the diagonal matrix of the corresponding leading eigenvectors respectively.
4:  Let be the adjacency spectral embedding of :
5:  Let matrix be
6:

  Compute the singular value decomposition of

, and let be the matrix of the leading left singular vectors.
7:  Cluster the -dimensional columns of into clusters using the Gaussian Mixture Model. Let be the cluster assignments.
8:  return  
Pseudocode 4 Community Assignment Estimation of Graph (block-estimation)
1:  Input: a pair of undirected graphs with vertices represented as two adjacency matrices , prior estimate of the number of communities , number of replicates of the permutation test
2:  Output: a p-value
3:   GCorr
4:   block-estimation
5:   sort-vertex
6:  for  in  do
7:      block-permute
8:      GCorr
9:  end for
10:  if  then
11:     
12:  else
13:     
14:  end if
15:  return  
Pseudocode 5 P-value of Graph Correlation Testing
1:  Input: the correlation of the hypothesis under testing ; sampling function -correlated SBM to generate a pair of graphs with correlation , where the first graph has probability matrix , the second graph has probability matrix , the function that assigns each node to a community and the number of vertices ; prior estimate of the number of communities , number of monte carlo replication ; the type I error level
2:  Output: the power
3:  for  do
4:      -correlated-SBM()
5:      GCorr
6:      block-estimation
7:      block-permute
8:      sort-vertex
9:      GCorr(
10:  end for
11:  
12:  
13:   or
14:  return  
Pseudocode 6 Power of Graph Correlation Testing on -correlated SBM

In Algorithm 6, returns the percentile of the array .

Mathematical proofs

Proof of Proposition 2

Proof.

We need to show that and are conditionally independent. Let be the edge after permutation of edge , they shall lie in the same block. Note that in SBM, edges within a block are sampled i.i.d., namely, if , where is the block membership function.

Given that and are conditionally independent, we know for all within same block. And since under the model of -correlated SBMs, every pair of edges within a block is sampled i.i.d, we have where . Since permutes within each block, we have for all . Therefore, we have for all . Hence, and are conditionally independent with same marginals as before, thus . ∎

Proof of Theorem 3

Proof.

We shall first prove it on binary graphs, and show that under -ER for population Pearson, DCorr and MGC. For Pearson correlation, it is clear that by basic probability on Bernoulli random variables. For vertex set with adjacency matrix and vertex set with adjacency matrix , we denote edges between vertices in as and , and edges between vertices in as and . For population distance covariance denoted as , we have

(1)

where the second line follows because under -ER or -SBM. Therefore DCov is the same as Pearson covariance, and similarly distance variance is same as Pearson Variance, such that DCorr is the same as Pearson correlation. By [8], the population MGC statistic is the same as population DCorr under a linear relationship, so the same holds for MGC as well. As the sample correlations all converge to the respective population, all three sample correlations using the adjacency kernel are consistent estimators of the under -ER.

The above equivalence among Pearson, DCorr, and MGC holds for -SBM. Thus it suffices to only investigate the numerator and denominator of DCorr:

(2)

which is a weighted average within each block, and is thus linear with respect to as controls the covariance within each block. Similarly

where both variance terms are fixed constants for given marginals. Therefore for any -SBM with given marginals, is a linear function of . Note that at , is not necessarily due to the block structure. However, when using block permutation test, all three sample correlations are consistent because there is a unique constant such that if and only if .

Finally, the above arguments directly extend to the weighted Gaussian graphs: For -correlated Gaussian ER, by design the Pearson correlation must equal ; then Equation 8 still holds since the adjacency matrix is generated the same way other than the weight no longer being binary, thus DCorr and MGC still equal Pearson. For -correlated Gaussian SBM, DCov in Equation 2 is also linear with respect to (and still a weighted average of each block), and the variance terms are still fixed constants for given marginals. ∎