# Recovering low-rank structure from multiple networks with unknown edge distributions

In increasingly many settings, particularly in neuroimaging, data sets consist of multiple samples from a population of networks, with vertices aligned across networks. For example, fMRI studies yield graphs whose vertices correspond to brain regions, which are the same across subjects. We consider the setting where we observe a sample of networks whose adjacency matrices have a shared low-rank expectation, but edge-level noise distributions may vary from one network to another. We show that so long as edge noise is sub-gamma distributed in each network, the shared low-rank structure can be recovered accurately using an eigenvalue truncation of a weighted network average. We also explore the extent to which edge-level errors influence estimation and downstream inference tasks. The proposed approach is illustrated on synthetic networks and on an fMRI study of schizophrenia.

## Authors

• 12 publications
• 2 publications
• 28 publications
01/24/2021

### Separating Stimulus-Induced and Background Components of Dynamic Functional Connectivity in Naturalistic fMRI

We consider the challenges in extracting stimulus-related neural dynamic...
09/06/2016

### Law of Large Graphs

Estimating the mean of a population of graphs based on a sample is a cor...
07/20/2017

09/23/2020

### Learning Mixtures of Low-Rank Models

We study the problem of learning mixtures of low-rank models, i.e. recon...
10/27/2019

06/10/2020

### Low Rank Directed Acyclic Graphs and Causal Structure Learning

Despite several important advances in recent years, learning causal stru...
04/30/2021

### Network Recovery from Unlabeled Noisy Samples

There is a growing literature on the statistical analysis of multiple ne...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many applications, simultaneous analysis of multiple networks is of increasing interest. Broadly speaking, a researcher may be interested in identifying structure that is shared across multiple networks. In the social sciences, this may correspond to some common underlying structure that appears, for example, in different friendship networks across high schools. In biology, one may be interested in identifying the extent to which different organisms’ protein-protein interaction networks display a similar structure. In neuroscience, one may wish to identify common patterns across multiple subjects’ brains in an imaging study. This last application in particular is easily abstracted to the situation where one observes a collection of independent graphs on the same vertex set, as there are well-established and widely used algorithms that map locations in individual brains onto an atlas of so-called regions of interest (ROIs), such as the one developed by PowerETAL2011. The assumption of vertex alignment across graphs is common in the statistics literature on multiple network analysis for neuroimaging applications; see for example LevAthTanLyzPri2017; ArrKesLevTay2017. A common approach to these problems is to treat the individual graphs as independent noisy realizations of some shared structure, for example, a stochastic block model (LeLevLev2018) or a low rank model (TanKetVogPriSus2016). The goal is then to recover this underlying shared structure.

To date, most techniques for multiple-network analysis have assumed that the observed networks come from the same distribution, and typically have binary edges. The latter assumption is restrictive in neuroimaging settings, where edge weights represent the strength of connectivity (e.g., measured by correlation between time series at pairs of ROIs), and substantial information is lost if these weights are truncated to binary; see, for example, AicJacCla2015. The shared noise distribution is also a restrictive assumption, since while we may reasonably expect that some population-level structure is shared across subjects, subject-level variation is likely to be heterogeneous (due to, for example, different head motion, etc). There are a number of different pipelines in use for reducing this type of noise in fMRI data (see, for example, CiricETAL2017, for a discussion), but all introduce artifacts of one kind or another, which we model here as potentially heterogeneous edge noise.

In this paper, we develop techniques for analyzing multiple networks without these two assumptions. In particular, we allow for weighted edges and heterogeneous noise distributions, and study how to estimate the underlying population mean. Under these conditions, the simple arithmetic mean of weighted graphs is likely to be sub-optimal, as networks with higher noise levels will contribute as much as the ones with less noise, and an estimate that takes noise levels into account should in principle work better. While there are a number of possible matrix means we might consider (see, for example, those described in Bhatia2007), in this paper, we focus on the case of weighted arithmetic means of networks. That is, letting be the adjacency matrices of independent graphs on the same vertex set, we are interested in estimators of the form , where are non-negative, data-dependent weights summing up to .

Next, we briefly review recent work on problems arising in the analysis of multiple vertex-aligned network observations. For example, motivated by brain imaging applications similar to those considered here, TanKetVogPriSus2016 considered the problem of estimating a low-rank population matrix, when graphs are drawn i.i.d. from a random dot product graph model (RDPGsurvey), and investigated the asymptotic relative efficiency of a low-rank approximation of the sample mean of these observed graphs compared to the graph sample mean itself. LevAthTanLyzPri2017 considered the problem of analyzing multiple vertex-aligned graphs, and devised a method to compare geometric representations of graphs, typically called embeddings in the literature, for the purpose of exploratory data analysis and hypothesis testing, focused particularly on comparing vertices across graphs. In a similar spirit, WanArrVogPri2017

considered the problem of embedding multiple binary graphs whose adjacency matrices (approximately) share eigenspaces, while possibly differing in their eigenvalues. In all of the above-described work, the authors assume binary networks and identical noise distributions on edges, in contrast to the present paper.

EynKovBroGlaBro2015 developed a technique for analyzing multiple manifolds by (approximately) simultaneously diagonalizing a collection of graph Laplacians. Like our work, the technique in EynKovBroGlaBro2015 aims to recover spectral information shared across multiple observed graphs, but differs in that the authors work with weighted similarity graphs that arise from data lying on a manifold, and derive a perturbation bound rather than applying a specific statistical or probabilistic model. The authors also require that the population graph Laplacian have a simple spectrum, a constraint that we do not require.

A few recent papers have considered the problem of analyzing multiple networks generated from stochastic block models with the same community structure, but possibly different connection probability matrices

(TanLuDhi2009; DonFroVanNef2014; HanXuAir2015; PauChe2016; BhaCha2018). Similar approaches have been developed for time-varying networks, where it is assumed that the connection probability may change over time, but community structure is constant or only slowly varying (XuHer2014). Once again, our setting is distinct from this line of work, since we assume a general shared low-rank structure with varying distribution of edge noise, and do not require edges to be binary.

TanTanVogPri2017 considered the problem of estimating shared low-rank structure based on a sample of networks under the setting where individual edges are drawn from contaminated distributions (Huber1964). The paper compares the theoretical guarantees of estimates based on edge-wise (non-robust) maximum likelihood estimation, edge-wise robust maximum likelihood estimation (FerYan2010), and eigenvalue truncations of both. The present work does not focus on robustness, and as a result is largely not comparable to TanTanVogPri2017, although our procedures can be made robust in a similar fashion if desirable.

Finally, KimLev2019 recently proposed a linear mixed effects model for samples of weighted networks, which decomposes the edge weights into community-level, edge-level, and random individual effects. They assume all networks are sampled i.i.d. from a given subject population, and do not allow for heterogeneous noise distributions. They also, for the most part, treat communities as known rather than estimating them.

## 2 Problem Setup and Notation

We begin by establishing notation. For an integer , we write for the set

. For a vector

, we write for the Euclidean norm of . For a matrix , denotes the spectral norm, the Frobenius norm, and the -to- norm, , where . For a positive semidefinite matrix , We write for the ratio of the largest eigenvalue of to its largest non-zero eigenvalue.

Throughout this paper, we assume that we observe undirected graphs each on vertices with corresponding adjacency matrices, . We will refer to the -th graph and its adjacency matrix interchangeably. The graphs are drawn independently with shared low-rank expectation , for all . Throughout, we will denote the rank of by . We assume that the vertices are aligned across the graphs, in the sense that the -th vertex in graph is directly comparable to the -th vertex in graph for all and . As a motivating example, we can think of these graphs as representations of the brains of patients, obtained from fMRI scans, and the vertices as locations in the brain mapped onto a common atlas.

While the low-rank structure is shared across graphs, we allow for different edge noise structure across graphs. That is, for each and , has mean 0 but otherwise arbitrary distribution , which may depend both on the subject and the specific edge . In the motivating example, this comes the fact that there is high-level anatomical and functional common structure across patients, but different amount of measurement noise and individual heterogeneity. For simplicity, we allow for self-loops, i.e., treat as observed, but we remind the reader that self-loops are generally a moot point for the purpose of asymptotics, since they make a negligible contribution compared to the off-diagonal entries. Throughout, we assume that all parameters, including the number of networks , can depend on the number of vertices , though we mostly suppress this dependence for ease of reading. We write for a generic positive constant, not depending on , whose value may change from one line to the next.

We now present a few examples that satisfy our assumptions, in order of increasing generality, to further illustrate the problem and motivate the rest of the paper. In all cases, the question is how to optimally recover the underlying shared low-rank expectation . We begin with one of the simplest possible settings under our model.

###### Example 1 (Normal measurement errors with subject-specific variance).

Assume that for each , are independent .

A weaker assumption on the edge measurement errors would be to replace a specific distributional assumption with a more general tail bound assumption, such as sub-Gaussian or sub-gamma errors. We refer the reader to Appendix A

for the definition and a few basic properties of sub-Gaussian and sub-gamma random variables, or to

BLM for a more substantial discussion.

###### Example 2 (Sub-gamma measurement errors with subject-specific parameter).

Assume that for each , are independent, mean , sub-gamma with parameters .

Note that the noise terms in Example 2 are no longer required to be identically distributed within a network. We can further relax the sub-gamma assumption to allow for edge-specific tail parameters rather than having a single tail parameter for each subject.

###### Example 3 (Sub-gamma measurement errors with subject- and edge-specific parameters).

Assume that for all , are independent, mean , and for each and , is sub-gamma with parameters .

In all of these examples, there are several inference questions we may wish to ask. In this work, we focus on

1. recovering the matrix ,

2. recovering when ,

3. recovering community memberships when corresponds to a stochastic block model (HolLasLei1983).

Given that the observed graphs differ in their noise structure, the question arises on how to combine these graphs to estimate (or or the community memberships). When the observed graphs are drawn i.i.d. from the same distribution, the sample mean is quite natural, and has been studied in TanKetVogPriSus2016. However, in our more general setting, one would naturally want to somehow de-emphasize noisier observations. We are thus interested in how to choose data-dependent weights with so that the weighted mean estimate is optimal in some sense, or at least provably better than the sample mean .

###### Remark 1 (Positive semi-definite assumption on P).

Throughout this paper, we will make the additional assumption that the expectation is positive semi-definite, so that for some , and focus on the low-rank case . The positive semi-definite assumption can be removed using the techniques in RubPriTanCap2017, at the cost of added notational complexity. Thus, for ease of exposition, we confine ourselves to the case where , bearing in mind that our results can be easily extended to include all low-rank .

### 2.1 Recovering Low-rank Structure

Throughout this paper, we will make use of a particular technique for recovering low-rank structure that is a standard first step in spectral clustering. Following the terminology of

SusTanFisPri2012, we refer to this as adjacency spectral embedding (ASE). Given any adjacency matrix , with a rank expectation , write where is diagonal with entries given by the non-zero eigenvalues of , and the

corresponding orthonormal eigenvectors are the columns of

. The matrix is only identifiable up to an orthogonal rotation, and we take without loss of generality. One can view the rows of as latent positions of the vertices in , with the expectation of an edge between two vertices given by the inner product of their latent positions, where is the -th row of . This view motivates the random dot product graph model (RDPGsurvey), in which the latent positions are first drawn i.i.d. from some underlying distribution on and edges are generated independently conditioned on the latent positions. The natural estimate of the matrix is

 ^X=ASE(A,d)=UAS1/2A∈Rn×d,

where the eigenvalues in and eigenvectors in now come from rather than the unknown . One can show that under appropriate conditions, recovers the matrix up to an orthogonal rotation that does not affect the estimate .

While we do not concern ourselves specifically with the random dot product graph in this paper, we make use of several generalizations of results initially established for that model, which we summarize in Appendix B. We note in passing that in general, selection of the embedding dimension (i.e., estimating the rank of ) is an interesting and challenging problem, but it is not the focus of the present work, and we assume throughout that is known. When is rank , one can show that under suitable growth conditions, the gap between the -th largest eigenvalue and the -th largest eigenvalue of grows with , and the same property holds for the adjacency matrix . As a result, it is not unreasonable to assume that one can accurately determine the appropriate dimension once the number of vertices is suitably large.

## 3 Methods and Theoretical Results

Generally speaking, we are interested in estimators of the form , where are data-dependent nonnegative weights summing to . In particular, we wish to capture how well the rank- eigenvalue truncation of approximates the true rank- expectation . We measure this either by bounding the difference in some matrix norm or by proving that we can successfully recover the matrix based on . In this section, we start from considering fixed rather than data-dependent weights, . The resulting bounds will suggest a certain choice of weights (see Theorem 1), and in Section 4, we will estimate these optimal weights and replace the fixed with data-dependent estimates .

### 3.1 Normal edges with subject-specific variance

Return for a moment to Example

1, in which , independent for all and . This very simple setting suggests a choice for the weights when constructing the matrix . Indeed, a natural extension of the estimator suggested by this setting will turn out to be the right choice in the more complicated settings described in Section 2. The following proposition follows immediately from writing out the joint log-likelihood of the observed networks and rearranging terms.

###### Proposition 1.

Suppose for all , and for some . If for all the edges are i.i.d. normal mean

and known variance

, then the maximum likelihood estimate for (up to an orthogonal rotation) is given by .

Motivated by this proposition, consider the plug-in estimator given by

 ^X=ASE⎛⎝(N∑t=1^ρ−1t)−1N∑s=1^ρ−1sA(s),d⎞⎠, (1)

where is an estimate of the variance of the edges in network . Given , one would naturally use the MLE but first we need to estimate . Let , and plug in to the MLE of subject-specific variance, to obtain, for ,

 ^ρs=∑1≤i≤j≤n2(A(s)−^P(s))2i,jn(n+1). (2)

With edges in each network, the estimates converge to the true variances fast enough that the plug-in estimator in Equation (1) recovers the true at a rate that matches the maximum-likelihood estimator in Proposition 1. The following proposition makes this claim precise.

###### Proposition 2.

Let be independent adjacency matrices with common expectation , where , and suppose that for each , are independent . Let denote the maximum likelihood estimator under the assumption of known variances, as described in Proposition 1.

Suppose that

 N∑s=1ρ−1s=ω(nlog2nλ2d(P)), (3)

Then for all suitably large

, there exists orthogonal matrix

such that

 ∥˚X−X˚V∥2,∞≤Cdλ1/2d(P)(N∑s=1ρ−1s)−1/2+Cdκ(P)nλ3/2d(P)(N∑s=1ρ−1s)−1.

Further, let denote the estimator defined in Equation (1). For all suitably large , there exists orthogonal matrix such that with probability ,

 ∥^X−XV∥2,∞≤Cdλ1/2d(P)(N∑s=1ρ−1s)−1/2+Cdκ(P)nλ3/2d(P)(N∑s=1ρ−1s)−1.

This proposition is a special case of Theorem 2 in Section 4, and thus we delay its proof until then.

### 3.2 Sub-gamma edges

In settings like those in Examples 2 and 3, where there are no longer parameters controlling the noise distribution, we must resort to more general concentration inequalities. Our main tool in this setting is a generalization of a bound on the error in recovering . Given a single adjacency matrix with , a natural estimate of is the ASE of the observed network, . The following lemma bounds the difference between and a certain rotation ( in the result below) of . Recall that this non-identifiable rotation has no impact on tasks such as community estimation. This bound makes no use of a particular error structure, but instead bounds the difference in terms of . This error term can then be bounded using standard concentration inequalities, which we will do below in the proof of Theorem 1.

###### Lemma 1.

Let be a rank matrix with non-zero eigenvalues . Let be a random symmetric matrix for which there exists a constant such that with probability ,

 ∥A−P∥

for all suitably large . Letting , such that for all suitably large , there exists a random orthogonal matrix such that with probability

 ∥^X −XV∥2,∞ ≤∥(A−P)UP∥2,∞λ1/2d(P)+C∥UTP(A−P)UP∥Fλ1/2d(P)+Cd∥A−P∥2κ(P)λ3/2d(P).

This lemma generalizes Theorem 18 in LyzTanAthParPri2017 and Lemma 1 in LevAthTanLyzPri2017. Details are included in Section B of the Appendix.

In order to apply Lemma 1

to the random matrix

, we need to ensure that the spectral condition in Equation (4) holds. Toward that end, the following lemma bounds the spectral error between and in terms of the weights and the sub-gamma parameters.

###### Lemma 2.

Suppose that are independent symmetric adjacency matrices with shared expectation , and suppose that are independent sub-gamma random variables with parameters . Let be fixed weights with . Then with probability at least ,

 ∥∥~A−P∥∥≤15√2η22logn,

where

 η2=2maxi∈[n]N∑s=1n∑j=1w2s(√2νs,i,j+2bs,i,j)2.

Lemma 2 follows from a standard matrix Bernstein inequality (Tropp2012). Details are provided in Appendix C.

While bounds for recovering are also possible under the setting of Example 3, in which is -sub-gamma for each , the bounds are comparatively complicated functions of these parameters. For simplicity, we state the following theorem for the case where the edges in the -th network are independent -sub-gamma random variables and include the more general case in Appendix C.

###### Theorem 1.

Under the setting of Lemma 2, with the additional condition that for some and for all , let . Suppose that the weights and sub-gamma parameters are such that

 N∑s=1w2s(νs+b2s)=o(λ2d(P)nlog2n) (5)

Then with probability there exists an orthogonal matrix such that

 ∥~X −XV∥2,∞ ≤Cdλ1/2d(P)(N∑s=1w2s(νs+b2s))1/2logn+Cdnκ(P)λ3/2d(P)(N∑s=1w2s(νs+b2s))log2n.

This theorem follows from applying Lemma 1 with , using Lemma 2 and Equation (5) to ensure that Equation (4) holds, and applying standard concentration inequalities to control the resulting bound on . Details can be found in Appendix C.

###### Remark 2.

As a quick sanity check, consider the case when all are of constant order. Then setting for yields , whence we see that averaging provides a faster rate of recovery of the matrix compared to using a single network, as we would expect. This is discussed further in Section 4. Further, by considering the case where, say and for , it is easy to see that there exist settings in which weighted averaging has the potential to improve markedly over the naïve sample mean.

## 4 Estimating the Sub-gamma Parameters

In the setting where the -th network has sub-gamma edge noise with parameters common for all edges, the results in Section 3.2 yield bounds for the error in recovering in spectral norm and for recovering in the -norm, when the weights were fixed. The resulting bounds in Lemma 2 and Theorem 1 are both monotone functions of the quantity . Thus, in the absence of stronger assumptions on the parameters controlling the size and spectrum of , the results of the previous section suggest that one should choose the weights so as to minimize . This is achieved by taking

 ws=˚ws=(νs+b2s)−1∑Nt=1(νt+b2t)−1 (6)

for each . Of course, in practice, we do not know the sub-gamma parameters and hence we must estimate them in order to obtain estimates of the optimal weights. Since is (up to a constant factor) an upper bound on the variances of the , a natural estimate of is

 ^ws=^ρ−1s∑Nt=1^ρ−1t, (7)

where, letting be the rank- truncation of for ,

 ^ρs=∑1≤i≤j≤n(A(s)−^P(s))2ij16n(n+1). (8)

Comparison with Equation (2) reveals that this is, in essence, the same estimation procedure that we derived in Section 3.1, extended to the case of sub-gamma edges. The factor of 16 in the denominator comes from replacing the equality with the sub-gamma moment bound (BLM, Chapter 2, Theorem 2.3)

 E(A(s)−P)2ij≤8νs+32b2s≤32(νs+b2s).

Just as in Section 3.1, the estimated weights are such that the plug-in estimate recovers the true matrix at the same rate as we would obtain if we knew the true sub-gamma parameters.

###### Theorem 2.

Under the same setting as Theorem 1, For any positive integers and , define

 γn,ℓ=√d(logN+logn)2+N1/ℓn2−ℓ2ℓlogcℓn√n, (9)

where is an arbitrary constant strictly greater than and define

 τs=E∑1≤i≤j≤n(A(s)−P)2ij16n(n+1) (10)

for all . Suppose that the model parameters grow in such a way that eventually and for some fixed ,

 γn,ℓmaxs∈[N]τ−1s(νs+b2s)=o(1) (11)

and

 ∑Nt=1(νt+b2t)−1∑Nt=1τ−1tN∑s=1τ−1s(νs+b2s)≤Clog2n(logN+logn)2. (12)

Let and be the weights defined in Equations (6) and (7), and define the estimators

 ^X=ASE(N∑s=1^wsA(s),d),   ˚X=ASE(N∑s=1˚wsA(s),d).

Provided that the sub-gamma parameters are such that

 N∑s=1(νs+b2s)−1=ω(nlog2nλ2d(P)), (13)

then with high probability there exist orthogonal matrices such that

 ∥^X−XV∥2,∞ ≤Cdλ1/2d(P)(N∑s=1(νs+b2s)−1)−1/2+Cdκ(P)nλ3/2d(P)(N∑s=1(νs+b2s)−1)−1 and ∥˚X−X˚V∥2,∞ ≤Cdλ1/2d(P)(N∑s=1(νs+b2s)−1)−1/2+Cdκ(P)nλ3/2d(P)(N∑s=1(νs+b2s)−1)−1.

That is, the plug-in estimator recovers at the same rate as the estimator which uses the optimal weights.

The proof is given in Appendix D. Note that is bounded immediately by Theorem 1, and it is the analysis of that requires more care, since the weights now depend on the observed networks.

###### Remark 3.

The quantities ( are, in essence, measures of the tightness of the sub-gamma tail bounds In the simplest case, when are i.i.d.  for some , we have , , and is independent of , recovering Proposition 2.

## 5 Perfect Clustering with Sub-gamma Edges

With Theorem 1 in hand, we obtain an immediate bound on the community misclassification rate in block models by an argument similar to that in LyzSusTanAthPri2014. To do so, we extend the stochastic block model (SBM) beyond the case of binary edges and to the case where multiple weighted graphs are drawn with a shared block structure.

###### Definition 1 (Joint Sub-gamma SBM).

Let and define the community membership matrix by if the -th vertex belongs to community and otherwise. We say that random adjacency matrices are jointly sub-gamma stochastic block model, written

 (A(1),A(2),…,A(N))∼J-Γ-SBM(n,B,Z,{(νs,bs)}Ns=1),

if conditional on , the adjacency matrices are independent with a common expectation (), and within each adjacency matrix , are independent -sub-gamma random variables.

###### Remark 4.

We can think of as a collection of networks which share a block structure, but different networks may have different levels of edge noise. This assumption is particularly natural in neuroscience applications, where, once nodes are mapped onto a common atlas, functional brain regions (e.g., the visual cortex) will behave similarly across patients, but the noise levels, which are affected by individual data collection issues, are expected to vary from patient to patient.

Our theoretical results from Section 3 have an immediate implication for detection and estimation of shared community structure in the joint sub-gamma SBM model. This result generalizes Theorem 6 in LyzSusTanAthPri2014.

###### Theorem 3 (Exact Recovery in the Joint Sub-gamma SBM).

Let where is fixed with having distinct rows given by . Let , where are fixed non-negative weights summing to , and define . Suppose that obey the growth conditions in Equation (5) holds and that the size of the smallest community grows as

 nmin=ω⎛⎝d2(N∑s=1w2s(νs+b2s))+d2(N∑s=1w2s(νs+b2s))2log4n⎞⎠. (14)

Let be the true underlying assignment of vertices to communities, so that if and only if , and let be the estimated community assignment function based on an optimal -means clustering of the rows of . Then the communities are recovered exactly almost surely, i.e., as ,

 P[minπ∈SK|{i∈[n]:π(τ(i))≠^τ(i)}|→0]=1 (15)
###### Proof.

The result follows from Theorem 1 and the fact that under the stochastic block model with fixed parameters, we have (see, for example, Observation 2 in LevAthTanLyzPri2017). The proof is otherwise a direct adaptation of the proof of Theorem 6 in LyzSusTanAthPri2014, and we omit the details. ∎

###### Remark 5 (Growth requirement on community size).

Each of the communities has a corresponding cluster center in , given by one of the rows of , say, . The lower bound on in Equation (14) ensures that if an optimal -means solution fails to select a cluster near to each of the (), then a solution with smaller objective value exists, a contradiction. In essence, the growth condition ensures that an optimal solution to the -means objective cannot afford to miss any of the cluster centers.

###### Remark 6 (Extensions of Theorem 3).

This result can be generalized in two natural directions. The first would be to allow for the communication matrix to depend on . Generally speaking, provided the entries of do not go to zero too quickly, the quantities and will grow quickly enough to ensure that , and Theorem 3 still holds. This extension is straightforward and we omit the details. Another generalization would be to expand the class of clustering algorithms for which the perfect recovery condition in (15) holds. When there are clusters, the matrix has distinct rows, say, , so that for all , . By Theorem 1, provided the parameters grow at suitable rates, for all suitably large , the rows of lie inside disjoint balls centered at these points. This yields a natural partition of the rows of , and any clustering algorithm that successfully recovers this partition can correctly recover the community assignments. We leave it for future work to characterize the clustering algorithms that obtain this recovery guarantee.

###### Remark 7 (Incorporating Sparsity).

Given its importance in network results, the reader may wonder about the role of sparsity in the sub-gamma edge noise model under consideration. A natural way to extend the traditional notion of sparsity to the weighted edge setting is to consider , where is a sparsity parameter. That is, the latent cluster centers are scaled uniformly toward the origin as grows. Under this regime, the condition in Equation (5) is satisfied so long as

 ∑sw2s(νs+b2s)=o(q2nlog2n).

Since a Bernoulli with success probability is a -sub-gamma random variable, when this becomes , which is a stricter requirement than the more typical sparsity growth rate of for some constant . This demonstrates that while the extension to sub-gamma edge distributions allows us to handle a much larger class of noise models, our general bounds come at a price.

## 6 Numerical experiments

We now turn to an experimental investigation of the effect of weighted averaging on low-rank estimation. We begin with simulated data, and then turn to a neuroimaging application.

### 6.1 Effect of weighted averaging on estimation

We begin by investigating the extent to which weighted averaging improves upon its unweighted counterpart in the case where we observe multiple weighted graphs with network-specific edge variances, as in Examples 1 and 2.

We consider the following simulation setup. On each trial, we generate the rows of independently and identically distributed as , where

 Σ=⎡⎢⎣321232123⎤⎥⎦,

and take . Next, independently for each network , we draw edge weights independently from a -mean Laplace distribution with variance . We chose this distribution because it has heavier tails than the Gaussian while still belonging to the class of sub-gamma random variables. Similar results to those presented here were also observed under Gaussian-, exponential- and gamma-distributed edge errors. Without loss of generality, we take the first network to have edge variance while all other networks have unit edge variance, so that for

. Thus, the first network is an outlier with higher edge-level variance than the other observed networks. We compare weighted and unweighted averaging, with weights estimated as described in Section

4, to obtain the weighted average , and the unweighted average . Rank- eigenvalue truncations of each of these yield estimates and , respectively. We evaluate the weighted estimate by its relative improvement,

 ∥¯P−P∥−∥~P−P∥∥¯P−P∥,

for three different matrix norms, Frobenius, spectral, and -norm. We can think of this quantity as a measure of the outlier’s influence.

We repeat this experiment for different values of the number of vertices , the number of networks and the outlier variance , and average over 20 replications for each setting. Figure 1 summarizes the results for Frobenius norms; the trends for the other two norms are similar. Figure (a)a shows relative improvement as a function of the outlier variance , for different values of , with fixed. Figure (b)b shows relative improvement as a function of while holding the outlier variance fixed, again for different values of . Figure 1 suggests two main conclusions. First, the relative improvement is never negative, showing that even when the outlier variance is small, there is no disadvantage to using the weighted average. Second, even a single outlier with larger edge variance can significantly impact the unweighted average. Similar trends to those seen in Figure 1 apply to the error in recovering , measured in -norm.

### 6.2 Application to neuroimaging data

We briefly investigate how the choice of network average impacts downstream analyses of real data. We use the COBRE data set (AineETAL2017), a collection of fMRI scans from 69 healthy patients and 54 schizophrenic patients, for a total of subjects. Each fMRI scan is processed to obtain a weighted graph on vertices, in which each vertex represents a brain region, and edge weights capture functional connectivity, as measured by regional averages of voxel-level time series correlations. The data are processed so that the brain regions align across subjects, with the vertices corresponding to regions in the Power parcellation (PowerETAL2011).

In real data, we do not have access to the true low-rank matrix , if such a matrix exists at all. Thus, to compare weighted and unweighted network averaging on real-world data, we compare their impact on downstream tasks such as clustering and hypothesis testing. Even for these tasks, the ground truth is typically not known, and thus it is not possible to directly assess which method returns a better answer. Instead, we will check whether the weighted averages yield appreciably different downstream results, and point to the synthetic experiments as evidence that the weighted network average is likely the better choice.

We begin by examining the effect of weighted averaging on estimated community structure. We make the assumption once again that these networks share a low-rank expectation , with . We will compare the behavior of clustering applied to the unweighted network mean against the behavior of clustering applied to its weighted counterpart . In practice, the model rank is unknown and must be estimated from the data. While this model selection task is important, it is not the focus of the present work, and thus instead of potentially introducing additional noise from imperfect estimation, we simply compare performance of the two estimators over a range of values of . For each fixed model rank , we first construct the estimate , and then estimate communities by applying -means clustering to the rows of , taking the number of communities to be . Denote the resulting assignment of vertices to communities by , and let denote the clustering obtained by -means applied to the rows of . We measure the difference between these two assignments by the discrepancy

 δ(c,c′)=n−1minπ∈SKn∑i=1I{ci≠π(c′i)}, (16)

where denotes the set of all permutations of the set . This discrepancy measures the fraction of vertices that are assigned to different communities by and after accounting for possible community relabeling. The optimization over the set of permutations in (16) can be solved using the Hungarian algorithm (Kuhn1955).

Figure 2 shows the discrepancy as a function of the number of communities . For simplicity, we take the number of communities equal to the model rank , though we note that similar patterns appear when we allow and to vary separately. In order to account for the possibility that the healthy and schizophrenic populations display different community structures, Figure 2 shows the results of the community estimation experiment just described, applied only to the 69 healthy patients in the data set. A similar pattern persists if we restrict instead to the schizophrenic patients, and if we pool the healthy and schizophrenic patients. Each data point is the mean of 20 independent runs of

-means with randomly drawn initial values. The dotted lines indicate two standard errors over these 20 runs. It is clear from the plot that for a wide array of model choices, the weighted and unweighted average networks result in assigning a non-trivial fraction of the vertices to different clusters. Thus switching from unweighted to weighted averaging is likely to have considerable effects on downstream inference tasks pertaining to community structure.

The difference in task performance between these two different network averages persists even if we do not perform clustering, as we now show. The Power parcellation (PowerETAL2011) is one of many ways of assigning ROIs (i.e., nodes) to larger functional units, typically called functional regions. The Power parcellation assigns each of the 264 ROIs (i.e., nodes) in the COBRE data set to one of 14 different communities, corresponding to functional regions, with sizes varying between approximately 5 and 50 nodes per community. Table 1 summarizes the 14 functional regions and their purported functions. We refer to a pair of functional regions , for every as a network cell. Thus, the communities in the Power parcellation yield 105 cells. For a given parcellation, a problem of scientific interest is to identify which network cells, if any, are different