Information Recovery in Shuffled Graphs via Graph Matching

05/08/2016 ∙ by Vince Lyzinski, et al. ∙ 0

While many multiple graph inference methodologies operate under the implicit assumption that an explicit vertex correspondence is known across the vertex sets of the graphs, in practice these correspondences may only be partially or errorfully known. Herein, we provide an information theoretic foundation for understanding the practical impact that errorfully observed vertex correspondences can have on subsequent inference, and the capacity of graph matching methods to recover the lost vertex alignment and inferential performance. Working in the correlated stochastic blockmodel setting, we establish a duality between the loss of mutual information due to an errorfully observed vertex correspondence and the ability of graph matching algorithms to recover the true correspondence across graphs. In the process, we establish a phase transition for graph matchability in terms of the correlation across graphs, and we conjecture the analogous phase transition for the relative information loss due to shuffling vertex labels. We demonstrate the practical effect that graph shuffling---and matching---can have on subsequent inference, with examples from two sample graph hypothesis testing and joint spectral graph clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are an increasingly popular data modality in scientific research and statistical inference, with diverse applications in connectomics [bullmore2009complex], social network analysis [carrington2005models]

, and pattern recognition

[kandel2007applied], to name a few. Many joint graph inference methodologies (see, for example, [MT2, gray2012magnetic, bullmore2009complex, richiardi2011decoding]), joint graph embedding algorithms (see, for example, [MMCV2, JOFC, sunpriebe2013, shen2014manifold]) and graph-valued time-series methodologies (see, for example, [NL2, priebeTS, tangTS, wang2014locality]) operate under the implicit assumption that an explicit vertex correspondence is a priori known across the vertex sets of the graphs. While this assumption is natural in a host of real data settings, in many applications these correspondences may be unobserved and/or errorfully observed [vogelstein2011shuffled]. Connectomics offers a striking example of this continuum. Indeed, while for some simple organisms (e.g., the C. elegans roundworm [white1986structure]

) explicit neuron labels are known across specimen, and in human DTMRI connectomes, the vertices are often regions of the brain registered to a common template (see

[gray2012magnetic]), explicit cross-subject neuron labels are often unknown for more complex organisms.

How can we quantify the effect of the added uncertainty due to an errorfully observed vertex correspondence? Heuristically, if

is a realization from a bivariate random graph model with the property that vertices that are aligned across graphs behave similarly in their respective networks, then the uncertainty in is greatly reduced by observing and the latent alignment. Indeed, in the extreme case of and being isomorphic, observing the latent alignment function and completely determines . However, as the vertex labels are shuffled uncertainty is introduced into the bivariate model. In order to formalize this heuristic, we adopt an information theoretic perspective (see [cover2012elements] for the necessary background). We develop a bivariate graph model, the -correlated stochastic blockmodel (Section 2.1), in which we are able to formally address the information loss/increase in uncertainty due to an errorful labeling across graphs, and we further explore the impact this lost information has on subsequent inference (see Section 5).

In the presence of a latent vertex correspondence that is errorfully observed across graphs, graph matching methodologies can be applied to recover the latent vertex alignment before performing subsequent inference. Consequently, as multiple graph inference has surged in popularity, so has graph matching; see [ConteReview] and [foggia2014graph] for an excellent review of the graph matching literature. Formally, given two graphs with respective adjacency matrices and , the graph matching problem (GMP) seeks to minimize over permutation matrices — i.e., the GMP seeks a relabeling of the vertices of that minimizes the number of induced edge disagreements between and ; see Section 2.3 for more detail. While the related graph isomorphism problem has recently been shown to be of sub-exponential complexity [babai2016graph], there are no efficient algorithms known for the more general problem of graph matching. Due to its practical utility and computational difficulty, myriad heuristics have been proposed in the literature for approximately solving the GMP; see, for example, [ConteReview] and [FAQ] and the references contained therein.

Working in the aforementioned correlated stochastic blockmodel setting, we uncover a duality between graph matchability (see Definition 5) and information loss. We show that in the regime where graph matching can recover the latent vertex alignment after label shuffling, relatively little information is lost in the shuffle. We conjecture the inverse statement to be true as well: In the regime where graph matching cannot recover the latent vertex alignment after shuffling, a relatively nontrivial amount of information is lost in the shuffle. Formalizing graph matching as the antithetical operation to label shuffling allows us to better understand the utility of graph matching as a data preprocessing tool. Indeed, while in the presence of modest correlation relatively little information is lost due to shuffling, this lost information can have a dramatic negative effect on subsequent inference. While this may seem like an indictment against joint inference in the errorful correspondence setting, we demonstrate that graph matching can effectively recover almost all of the lost information (see Theorem 14) and, consequently, the lost inferential performance.

Note: Throughout, for real-valued function and , we shall write if We will also make use of the abbreviation a.a.s. (for asymptotically almost surely) which will be used as follows. A sequence of events occurs a.a.s. if at a rate fast enough to ensure .

2 Background and Definitions

We seek to understand the information lost due to the vertex correspondence across graphs being errorfully known, as well as the capacity of graph matching to recover this lost information. In this section we provide a statistical framework and the necessary definitions amenable to pursuing these problems further.

2.1 Correlated Stochastic Blockmodels

The random graph framework in which we will anchor our analysis is the correlated stochastic blockmodel (SBM) random graph model of [lyzinski_spectral]. SBM’s are widely used to model networks exhibiting an underlying community structure [sbm, sbm2], and provide a simple model family which has been effectively used to approximate the behavior of complex network data [airoldi13:_stoch, wolfe13:_nonpar, choi2014co]. Letting denote the set of labeled, -vertex, simple, undirected graphs, we define:

Definition 1.

are -correlated SBM() random graphs (abbreviated -SBM) if:

1. and are marginally SBM(); i.e., for each ,

  • The vertex set is the union of blocks , , …, , which are disjoint sets with respective cardinalities , , …, ;

  • The block membership function is such that for each , denotes the block of ; i.e., ;

  • The block adjacency probabilities are given by the symmetric matrix

    ; i.e., for each pair of vertices , the adjacency of and is an independent Bernoulli trial with probability of success .

2. The random variables

are collectively independent except that for each the correlation between and is .

One of the keys to the theoretical tractability of the -SBM model is that we can construct -SBM() random graphs as follows. First draw from the underlying SBM() model. Conditioning on , for each , if then is an independent Bernoulli trial with parameter ; if then is an independent Bernoulli trial with parameter . If so that and are a.a.s. not isomorphic, this construction highlights a natural alignment between the vertex sets of and : namely the identity function . Indeed, for modest the identity function is (with high probability) the permutation of the vertex set of that best preserves the shared structure between and ; see Theorem 12. As, in practice, this alignment is often errorfully observed, we shall refer to as the latent alignment between and .

Remark 2.

Note that -correlated Erdős-Rényi, abbreviated -ER, random graphs (resp., -correlated heterogeneous Erdős-Rényi random graphs) are easily realized by letting (resp., ) in Definition 1

2.2 Shuffled -correlated SBM random graphs

To understand the effect of an errorfully observed latent alignment function, we first need to define the action of errorfully aligning two -SBM random graphs. Slightly abusing notation, we let denote both the set of permutation matrices and the set of permutations of ; to avoid confusion in the sequel, we will use the Greek letters and to denote permutations of and capital Roman letters and to denote permutation matrices. For and we define the -shuffled graph via . Equivalently, if the adjacency matrix of is and the permutation matrix associated with is , then the adjacency matrix of is

For a deterministic permutation , the act of shuffling -SBM is realized as follows. For all , . The action of randomly shuffling the vertices of -SBM random graphs can then be defined via:

Definition 3.

Let be an -valued random variable. are -shuffled, -correlated SBM() random graphs (abbreviated -SBM) if

  • -SBM();

  • For any , we have

Simply stated, we first realize ; conditioned on , we then independently realize .

Note: In the sequel, we shall use and to denote deterministic permutations, and to denote a permutation-valued random variable.

2.3 Graph matching and graph matchability

If the latent alignment between and is errorfully known, graph matching methods can applied to approximately recover the true alignment. We formally define the graph matching problem as follows:

Definition 4.

Given two graphs -vertex graphs and with respective adjacency matrices and , the graph matching problem (GMP) is defined as

Note that the GMP objective function is equal to and solving the GMP is equivalent to solving . Intuitively, solving the GMP is equivalent to relabeling the vertices of so as to minimize the number of induced edge disagreements between and .

While solving the graph matching problem is NP-hard in general, there are a bevy of approximation algorithms and heuristics in the literature that perform well in practice [Zaslavskiy2009, FAQ, FAP, jovo, JMLR:v15:lyzinski14a] (in addition, see the excellent survey papers [ConteReview, foggia2014graph] for a thorough review of the prescient literature and discussion of numerous alternate formulations of the GMP). Note that in Section 5, to approximately match the shuffled graphs in our synthetic and real data applications, we use FAQ algorithm of [FAQ] and, when seeded vertices are present, the SGM algorithm of [FAP]. Seeded vertices, or seeds, are those vertices whose latent alignments are known a priori and are not subjected to any label shuffling.

Note also that the graph matching problem is closely related to the problem of entity resolution/record linkage (see, for example, [dedup1, dedup2]), especially in the setting of highly attributed networks. In the present -SBM setting, there is a key difference between the paradigms highlighted by the non-recoverability of vertex correspondences in the presence of general edge-shuffling; see Section 3.1.1 for detail.

In the -SBM setting the correlation structure across and highlights the natural alignment, namely , between the two graphs. In Theorems 10 and 12, we establish a phase transition for the values of under which graph matching can/cannot recover the latent alignment in the presence of vertex shuffling. Before being able to state these results, we first must define the concepts of graph matchability and matched to .

Definition 5.

Let be vertex-aligned random graphs with respective adjacency matrices and . We say that and are matchable if

To define the random graph matched to , we first define the concept of a matched graph for deterministic . To this end, let If , then it is natural to define , matched to , as any element of with all elements of being equally probable. Formally, we define

Definition 6.

Let -SBM(). The -valued random variable has distribution defined via

so that

A consequence of Definition 6 is that if , SBM(), then

3 Information loss and graph matching

Given graph-valued random variables, , the mutual information of and is defined in the standard way via Similarly, we define the entropy of via If , then two -correlated SBM random graphs are independent, and the mutual information between them is , regardless of whether the latent vertex alignment is known across graphs or not. If , then and are isomorphic and , the entropy of . If , then there is nontrivial information shared across graphs, information which is potentially lost if the labeling is corrupted. To this end, we have the following proposition, which is proved in Section A.1.

Proposition 7.

Let -SBM().
i. If , and are fixed in , then .
ii. For fixed and , if as then
iii. For fixed and , if as and then we have , for a constant .

Proposition 7 highlights the suitability of mutual information as a vehicle for studying graph correlation (and subsequently graph matchability). It is natural (in light of Theorems 10 and 12) to attempt to quantify the edge-wise correlation through the lens of graph matchability, as matchable graphs are precisely those whose correlation is above a phase transition threshold. However, in the -SBM setting the graph matching objective function computed at the latent alignment satisfies for a real constant . Likewise, the expected trace form of the graph matching objective function (shown in [rel] to be preferable for capturing the true alignment operationally) computed at the latent alignment satisfies for real . In both cases, for correlation decaying to the lead order term is correlation independent, and neither readily captures the edge-wise dependency structure across graphs. The mutual information, however, satisfies . The correlation in the lead order term emphasizes the utility of mutual information for teasing out graph correlation (and hence graph matchability) in the low correlation regimes. Unfortunately, while computing and is immediate, we are unaware of an efficient method for computing . If available, computing a properly normalized version of after matching would allow us to a posteriori judge the suitability of having matched the graphs in the first place.

3.1 Information lost and matchability in the high correlation regime

What is the degradation in information due to the uncertainty introduced by randomly permuting the labels of via ? According to the information processing inequality, with equality if and only if has a point mass distribution. Below we codify (see Theorems 8 and 10) the following duality between graph matchability and information loss: The correlation regime in which graph matching can successfully “unshuffle” the graphs—i.e., there is enough signal even in the shuffled graphs to recover the latent alignment—is precisely that in which relatively little information will be lost in the shuffle. Note that the proofs of Theorems 8 and 10 can be found in Section A.2 and Section A.3 respectively.

Theorem 8.

Let -SBM(), with and fixed in , and let

be uniformly distributed on

.

  • For all values of , it holds that

  • If and , then

If is constant in , then the asymptotic upper bound of Theorem 8 part i. and the asymptotic lower bound of Theorem 8 part ii. differ only by a logarithmic factor. We suspect that the true order is , as is the loose upper bound we derive on the information loss in the proof of part i. of the Theorem. In addition, while Theorem 8 is proven with uniformly distributed on , we suspect that an analogous result holds for other distributions on that place suitable mass on permutations that shuffle elements of , though we do not pursue this further here.

Remark 9.

In the proof of Theorem 8 part ii., we essentially prove a stronger statement than that presented in the theorem. If we define to be the set of permutations that preserve vertex block assignments and let be uniformly distributed on , then we prove that under the assumptions of the theorem, . The information processing inequality (see Proposition 22) then gives us that ; indeed, there is information in the vertices’ block assignments which is lost in and not in . Working with allows for errors of the vertex correspondences without having to deal with the mathematical complications that arise from also errorfully observed block memberships.

In light of Proposition 7, relatively little information is lost due to shuffling in the regime: indeed, under this assumption on we have that

Insomuch as graph matching is the antithetical operation to vertex shuffling, if relatively little information is lost in the shuffle then the graphs should be matchable (i.e., GM can unshuffle the networks); we formalize this below in Theorem 10.

Theorem 10.

With notation as above, let and be the adjacency matrices of SBM() random graphs with , and fixed in . There exists a constant such that if , then

Remark 11.

We note here that results similar to Theorem 10 for a much-simplified 2-block SBM appear in [onaran2016optimal], although the authors there consider a different MAP-based objective function in their matching setup.

3.1.1 Shuffling sans graphs

Considering an analogue of Theorem 8 in the non-graph setting illuminates the special role that GM plays in recovering the lost information. Consider the following example: let , and suppose we observe the sequence of ’s and a shuffled version of the sequence of ’s. What is the information loss due to this shuffling? In the Bernoulli setting, this is partially answered by Theorem 8, as that theorem can be immediately cast in the classical setting with playing the role of . The key difference between the graph setting and the classical setting is the structure the graph imposes on the shuffling, as not all edge-shuffles are feasibly obtained via vertex shuffles. This structure is what allows GM to unshuffle the graphs, since optimizing the GM objective over all edge-shuffles (i.e., optimally unshuffling in the classical setting) would potentially induce significantly more edge-correlation than initially present in the graphs and would not be effective for recovering the lost vertex alignment. Indeed, in this Bernoulli setting, it is easy to see that under mild model assumptions, we have where is the matched sequence of ’s.

3.2 The low correlation regime

In the -SBM model, Theorem 10 asserts that, under mild model assumptions, if is sufficiently large then and are matchable a.a.s. In Theorem 12 below, we identify the second half of the matchability phase transition at . Indeed, a consequence of Theorem 12 below is that, under mild assumptions, there exists a constant such that if then and are asymptotically not matchable with probability .

Theorem 12.

With notation as above, let and be the adjacency matrices of SBM() random graphs with , and fixed in . Further assume there is an such that . Let be a collection of disjoint within-block transpositions; i.e., if then . There exists a constant such that if then

We conjecture a similar phase transition for the relative information loss due to Namely, when and are not matchable we conjecture that a nontrivial fraction of the mutual information is lost in the shuffle.

Conjecture 13.

Let -SBM(), with and fixed in , and let be uniformly distributed on . If , then

While the matchability phase transition in Theorems 10 and 12 is tighter than the conjectured phase transition in Conjecture 13, if true, Conjecture 13 would imply a duality between information loss and matchability:

In Section 3.3, we show that in the high correlation regime—where the graphs are matchable and relatively little information is lost due to shuffling—graph matching can effectively recover the information lost due to shuffling a.a.s.; see Theorem 14. This provides a theoretical foundation for understanding the utility of graph matching as a pre-processing step for a host of inference tasks: often when the lost information due to shuffling has a negative effect on inference, graph matching can recover the lost information and improve the performance in subsequent inference. In the regime, while the graphs are no longer matchable, we conjecture (and experiments bear out) that the alignment found by graph matching still recovers much of the lost information. However, theoretically working in this regime will require new proof techniques, and we do not pursue this further here.

3.3 Graph matching: Recovering the lost information

In Section 5 we show that the information lost, even when relatively small (see Theorem 8), can have a deleterious effect on subsequent inference, and we demonstrate the potential of graph matching to recover the lost inference performance. Theorem 14 below provides a major step towards formalizing this intuition, proving that in the regime, graph matching recovers almost all of the lost information. The practical effect of this is the recovery of the lost inferential performance. Note that the proof of Theorem 14 can be found in Section A.5.

Theorem 14.

Let -SBM() with and fixed in . If , then

Theorem 14 provides a sharp contrast to the information lost in the shuffled graph regime in this high correlation setting. Indeed, recall that Theorem 8 implies that

Remark 15.

The information processing inequality states that, given random variables and and measurable , Intuitively, we cannot transform independently of and increase the mutual information between and . At first glance, Theorem 14, which implies that , seems to contradict this. However, we note that is a function of both and . Indeed, if then the information processing inequality need not hold (for a simple example, let have nontrivial entropy, let be independent of , and let ).

4 Empirically matching in the low correlation regime

(a)
(b)
Figure 1: For -SBM, we plot the mean number of edge disagreements s.e. against correlation when the graphs are aligned with the true latent correspondence (“” in the legend), and the graphs are aligned by initializing the FAQ graph matching algorithm at the true correspondence (“” in the legend). In the right panel we plot the mean sample edge correlation against for the matched graphs (“Matched” in the legend), and for the shuffled graphs with uniformly distributed over (“Shuffled” in the legend). In each case, the means are computed across Monte Carlo iterates.

In the low correlation regime, graph matching with high probability cannot recover the true correspondence in the presence of vertex shuffling. We conjecture that this is due, in part, to a nontrivial amount of the information between and being irrevocably lost due to the vertex shuffling. To explore this further empirically, we consider the following experiment. For -SBM, with

we plot in the left panel of Figure 1 the mean number of edge disagreements s.e. against correlation when the graphs are aligned with the true latent correspondence (“” in the legend), and the graphs are aligned by initializing the FAQ graph matching algorithm at the true correspondence (“” in the legend). In each case, the means are computed across Monte Carlo iterates. In the right panel we plot the mean sample edge correlation against for the matched graphs (“Matched” in the legend), and for the shuffled graphs with uniformly distributed over (“Shuffled” in the legend), again averaged over Monte Carlo iterates. We note here that we observe similar phenomena as in Figure 1 across a broad swath of parameter values as well.

From the figure we make the following observations. First, as the FAQ algorithm is a Frank-Wolfe based approach, the agreement between the two methods in Figure 1 panel (a) for implies that for these large correlation levels, the latent alignment is a (local) optimum for the graph matching objective function. Likewise, the improvement in the match error induced by matching the graphs via FAQ initialized at for implies that the latent alignment is not a (global and local) optimum for the graph matching objective function for these lower correlation values. This coincides with our intuition that in the presence (resp., absence) of enough correlation graph matching can (resp., cannot) recover the latent alignment in the presence of shuffling. In the figure we also see that in the graph matched setting, the matched error roughly asymptotes after the matchability phase transition. This is due, in part, to the matchability of and —indeed, if the vertex alignment induced in is viewed as the true latent alignment then (absent symmetries) this alignment is clearly recoverable via graph matching. Below the phase transition, matching the shuffled, correlated graphs artificially induces more edge-wise correlation than present in the latent alignment—see the right panel of Figure 1—in all cases bringing the edge correlation between and to the phase transition threshold. Shuffling effectively makes the edges across graphs independent (uncorrelated equals independent in the Bernoulli setting), and matching the shuffled graphs induces the same local edge correlation structure as would matching independent graphs; the original edge-correlation structure which is captured in is truly lost in the shuffle and the global structure preserved in the shuffle (subgraph counts, community structure, etc.) is not enough for graph matching to recover the latent alignment.

5 The effect on subsequent inference

While the loss in information due to shuffling has little effect on inference tasks that are independent of vertex labels (for example, the nonparametric hypothesis testing methodologies of [tang2014nonparametric, asta2014geometric]), the effect on inference that assumes an a priori known vertex alignment may be dramatic. We demonstrate this in the context of joint graph clustering and two sample hypothesis testing for graphs. The trend we demonstrate below is as follows: diminished performance as the alignment is shuffled and the performance loss due to shuffling being recovered via graph matching. We expect this trend to generalize to a host of other joint inference tasks as well. We note here that these results provide a striking contrast to those in [vogelstein2011shuffled], where it was shown that even if the graph labeling contain relevant class signal, obfuscating the labels does not necessarily decrease classification performance in a single graph setting.

5.1 Hypothesis Testing

We first consider the simple setting of testing whether two -correlated Erdős-Rényi graphs have the same edge probability. Formally, given -ER()—i.e., the edgewise correlation of ER() and ER() is where, as and potentially differ here, we require —we wish to test the hypotheses versus . If the edges across graphs were uncorrelated, under we can view the two edge sets as samples from independent Bin(

) random variables and a natural test statistic for testing

versus would be that of the two proportion pooled -test; namely where , , and . As the edges are positively correlated (with the same ) across graphs, a more powerful test of versus would be a paired two proportion -test; namely that with test statistic where is the empirical correlation between the edge sets of and . In this paired setting, directly applying the two proportion pooled test would yield an overly conservative (level less than

) test. Correcting for the type-I error in the pooled test (to make it approximately level

) is achieved by multiplying the -test critical value by yielding an equivalent test to the paired test using . A natural question in this paired setting is what level of shuffling is necessary for the pairedness of the data to become so corrupted (i.e., the information in the pairing being lost) as to render the unpaired test more powerful, and can graph matching recover the lost inferential performance in the paired regime? Insomuch as captures the edge-wise correlation across networks, this question is precisely addressing the effect that the reduced information (due to shuffling) has on testing power.

Figure 2: We plot the power (based on 2000 Monte Carlo trials) when testing versus against the number of unseeded (potentially shuffled) vertices, . In black, we plot the power of the paired test when at most of the unseeded vertices have their labels shuffled under for . The blue line plots the power of the unpaired test (not type-I error corrected), and the red line plots the power of the paired test when graph matching is used to align the networks before computing .

Exploring this further, we consider the above tests with , , , and when vertex correspondences are assumed known across and for (i.e., unseeded and potentially shuffled vertex labels), noting here that we observed similar phenomena across a broad swath of parameter choices. As we can empirically sample from the null distribution of given the level of shuffling, to ensure a level

test the shuffled null distribution is computed under the least favorable element of the null hypothesis for

, which here corresponds to none of the unseeded vertex having their labels shuffled.

In Figure 2, we plot the power (based on 2000 Monte Carlo trials) when testing versus against the number of unseeded (potentially shuffled) vertices. In black, we plot the power of the paired test when at most of the unseeded vertices have their labels shuffled under for . The blue line plots the power of the unpaired test (not type-I error corrected), and the red line plots the power of the paired test when graph matching is used to align the networks before computing . As expected, we see that as more vertices are shuffled under the power of the paired test decreases precipitously. From the figure, we also see that the information which is lost in the the shuffle—and the subsequent lost testing power—is recovered by first graph matching the networks before computing (at least when and the graph matching is effective in recovering the latent correspondence).

Note that the matched test is more powerful than the unpaired alternative for all levels of shuffling in this example, and is more powerful than the shuffled test as long as the graphs are sufficiently shuffled under . As the exact level of shuffling amongst the unseeded vertices (in or ) is unknown a priori, we propose the graph matching version of the test as a conservative, more robust, version of testing versus . We view the decreased power at of the matched test as an algorithmic artifact; indeed, with no seeds the GM algorithm we employ often does not recover the true correspondence after shuffling. With a perfect matching, we would expect the matched power to be for all values of .

An interesting aspect of Figure 2 is that the power of the unpaired test and the paired test with are identical. For and to yield approximately the same power with the same critical value used (which is the case in the least favorable element of the null for with seeds), it is necessary that . As we are in a Bernoulli setting, the edge-wise correlation being effectively is indicative of the edges across graphs being effectively pairwise independent (sample correlation has mean of order ), in which case the paired test we are using reduces to its unpaired alternative. Although the edges are effectively pairwise independent, the graphs globally are not. Indeed, less local structures (i.e., subgraph counts, community structures, etc.) are still correlated after shuffling. In this high correlation setting, this global structure is also captured by which, though diminished by shuffling, is still nontrivial. It is this global structure that is able to be leveraged by graph matching to recover the lost local signal and the lost information. This is precisely what separates the present graph setting from the more classical paired data setting, in which shuffling data labels is potentially irreversibly detrimental to subsequent inference.

5.1.1 The effect of shuffling on embedding-based tests

In [MT2], we propose a semiparametric hypothesis testing framework for determining whether two graphs are generated from the same underlying random graph model. Note that while this test is not provably UMP (indeed, no such tests exist in the literature for sufficiently complex graph models), it is nonetheless one of the first provably consistent two-sample graph hypothesis tests posed in the literature. The test proceeds as follows. Given distributed as heterogeneous ER() and distributed as heterogeneous ER() with and assumed positive semidefinite rank edge-probability matrices, and are first embedded into via adjacency spectral embedding.

Definition 16.

Let be an -vertex graph with adjacency matrix . The -dimensional adjacency spectral embedding (ASE) of is given by , where is the spectral decomposition of , is the diagonal matrix containing the

largest eigenvalues of

on its diagonal, and

is the matrix whose columns are the corresponding orthonormal eigenvectors.

In [MT2] it is proven that, under mild assumptions, the (suitably rotated) rows of concentrate tightly around the corresponding scaled eigenvectors of with high probability. This fact is leveraged to produce a consistent hypothesis test for testing versus based on a suitably scaled version of the test statistic where and .

In [levin2017central], the test is further refined to more explicitly take advantage of the assumed known vertex correspondence across networks. Inspired by [jointLi] and the joint manifold embedding methodology of [JOFC, fjofc] , we proceed as follows. Given vertex-aligned and , we use ASE to first embed the omnibus adjacency matrix . By jointly embedding and via the omnibus matrix, the Procrustes rotation necessary in computing can be circumvented, which is empirically shown to increase testing power in [levin2017central]. To wit, if with both , then the omnibus test statistic is a suitably scaled version of

As in the correlated Erdős-Rényi setting, we wish to understand the impact of vertex shuffling on testing power using . As before, we expect that as the labels are progressively more corrupted, tests based on graph invariants that do not utilize the latent alignment will achieve higher power than testing based on . Exploring this further, we consider the following experiment. Let be a rank positive semidefinite matrix with the rows of distributed as i.i.d. samples from a Dirichlet(1,1,1) distribution, and let be distributed as heterogeneous ER() (so that can be viewed as a sample form a random dot product graph with parameter [young2007random]). Consider where the final rows of are identical to those of and the first rows of are realized via where has i.i.d. Dirichlet(1,1,1) rows, independent of . We let be distributed as heterogeneous ER(), with edges maximally correlated to those of by where the multiplication and division in the computation of is entry-wise for the matrices there involved.

Figure 3: We plot (“Omni w/ Shuffling” in the legend) the power s.e. of the test of versus using (with appropriate Type-I error correction) when vertices have their labels potentially shuffled (i.e., seeded vertices). We plot the power s.e. of the test using (“Matched Omni” in the legend) when the seeded vertices are used to first matched the graphs before computing to directly test versus . We plot the power s.e. of testing versus using test statistics , , (resp., “Max Degree”, “Triangle Count”, and “Spectral Norm” in the legend). In all cases, the power is averaged over 25 Monte Carlo replicates.

We are interested in understanding the power of the test using for detecting the anomalous behavior of the vertices in . In Figure 3, we plot (“Omni w/ Shuffling” in the legend) the power s.e. of the test of versus using when vertices have their labels potentially shuffled (i.e., seeded vertices). To control the level of the test using here, we sample the null distribution under the least favorable member of the composite null which corresponds to all unseeded vertices being shuffled under . Rather than considering different levels of shuffling under as in Section 5.1, we plot the best possible performance for the composite alternative hypothesis; this is achieved when all unseeded vertices are also shuffled under . We also plot the power s.e. of the test using (“Matched Omni” in the legend) when the seeded vertices are used to first matched the graphs before computing to directly test versus . In each case, the power is averaged over 25 Monte Carlo replicates, which here corresponds to 25 different realizations of and .

To compare the omnibus embedding based test to graph invariant based tests, we consider the following graph invariants. For an adjacency matrix , we let be the maximum vertex degree in , be the number of triangle subgraphs present in , and be the spectral norm of . In Figure 3 we then plot the power s.e. of testing versus using these graph invariant test statistics: , , and (resp., “Max Degree”, “Triangle Count”, and “Spectral Norm” in the legend). In all cases, the power is averaged over 25 Monte Carlo replicates, which here corresponds to 25 different realizations of and .

From the figure, we see that in the presence of sufficient shuffling the graph invariant based tests are more powerful than the test that leverages the errorful correspondence. However, graph matching successfully recovers the lost information and subsequently the lost testing power. The increased variance in the graph matching based test at

can be attributed to errors induced in the matching under with only 25 seeds. With this number of seeds our matching algorithm can effectively recover the latent alignment under , while under the graph matching algorithm occasionally fail to recover the true correspondence. These errors under

result in strictly increased testing power here (and hence the increased standard error). While at first glance this would suggest using an imperfect matching to optimize power, the red curve demonstrates the dangers of having an imperfect matching in both

and . As practically there is no way to know whether the matching is perfect or not or whether we are in or , we view the matched test (always matching under both and ) as a conservative, robust alternative to its unmatched paired alternative.

We lastly note that while the graph invariant methods perform poorly in this heterogeneous ER anomaly setting, under alternate testing regimes we expect these tests to outperform the embedding based test here presented (for example, when testing in certain non-edge independent models). Further understanding the properties of the underlying model that dictate this performance is paramount in practice, and we are presently pursuing this line of research.

Remark 17.

Note that while it is perhaps more natural to use an appropriately centered and scaled version of as our test statistic, as noted in [MT2], yields a test that is inconsistent for a large class of alternatives (e.g., if in the and independent setting), whereas the test based on is provably level- consistent over the entire range of (fixed) alternative distributions. In [levin2017central], we posit the same level- consistency for testing based on .

5.2 Joint versus single graph clustering

We next explore the impact that label shuffling has on spectral graph clustering. Spectral graph clustering has become an important and widely-used machine learning method, with a sizable literature devoted to various spectral clustering algorithms under several model assumptions; see, for example,

[von2007tutorial, qin2013dcsbm, rohe2011spectral, sussman2012consistent, fishkind2013consistent, lyzinski2014perfect]. We focus here on a variant of the methodology of [sussman2012consistent, jointLi], which embeds (a pair of) graphs into an appropriate Euclidean space and subsequently employs the -means algorithm to cluster the data. Here, rather than using -means clustering to cluster the data, we will employ the model-based clustering algorithm Mclust [fraley1999mclust].

Figure 4: Joint versus single graph clustering of two -correlated SBM. Over a range of , we embed and cluster the jointly embedded vertices. The dashed line plots the mean Adjusted Rand Index (ARI) 2 s.e. for the clustering of against its true block assignments when embedding and jointly clustering the vertices. The solid line plots the ARI 2 s.e. of clustering against the true block assignments for single graph clustering. In each case the number of Monte Carlo trials was 500.

When we have multiple graph valued observations of the same data, can we efficiently utilize the information between the graphs to increase clustering performance? In the manifold matching literature, there are numerous examples of this heuristic: leveraging the signal across multiple data sets can increases inference performance within each of the data sets (see, for example, [JOFC, fjofc, sun2013generalized, shen2014manifold]). Inspired by this, given vertex-aligned and , we use ASE to embed the Omnibus adjacency matrix and use Mclust to cluster the embedded vertices.

(a)
(b)
Figure 5: We plot the mean ARI s.e. of: i) the clustering of obtained via jointly embedding/clustering when of the vertices in