Clustering of heterogeneous populations of networks

07/15/2021 ∙ by Jean-Gabriel Young, et al. ∙ 0

Statistical methods for reconstructing networks from repeated measurements typically assume that all measurements are generated from the same underlying network structure. This need not be the case, however. People's social networks might be different on weekdays and weekends, for instance. Brain networks may differ between healthy patients and those with dementia or other conditions. Here we describe a Bayesian analysis framework for such data that allows for the fact that network measurements may be reflective of multiple possible structures. We define a finite mixture model of the measurement process and derive a fast Gibbs sampling procedure that samples exactly from the full posterior distribution of model parameters. The end result is a clustering of the measured networks into groups with similar structure. We demonstrate the method on both real and synthetic network populations.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many modern network analyses involve large corpora of networks defined on a constant set of nodes. Examples include repeated observations of proximity networks eagle2009inferring

, longitudinal studies of social relationships among fixed groups of individuals 

butts2003network , and measurements of neural connectivity across patients sporns2010networks .

As pointed out in a number of recent studies, repeated measurements offer a unique opportunity to carry out robust network analyses fienberg1985statistical ; priebe2015statistical ; newman2018network ; newman2018estimating ; peixoto2018reconstructing ; le2018estimating ; tang2018connectome ; wang2019common ; young2020bayesian ; young2021reconstruction

. With such large, rich data sets it is possible to extract information not available from a single network observation. For example, one can use repeated observations to infer a latent network structure despite the presence of measurement error—given multiple noisy realizations of the same network it becomes possible to estimate the network most likely to have generated the measurements. In the case of repeated observation of social interactions, for instance, one expects friends to interact more often than strangers, and repeated observations, properly analyzed, can thus reveal the network of friendships 

newman2018network . In neuroscience applications, one might focus on the identification of a characteristic brain connectivity pattern for a set of patients with a common condition, which is uncovered by combining the measurements made on all patients sporns2010networks .

Implicit in these methods is the assumption that there is a single underlying network that determines all the observations, but this need not be the case. Patients undergoing brain scans might have different underlying conditions, for instance, leading to different measured connectivity patterns from one patient to another bassett2018understanding . Likewise, acquaintances might interact differently based on the time of the day or location, leading for example to different social networks during working hours and outside of work.

Treating a population of observed networks as realizations of the same underlying graph might thus adversely affect our analysis of these systems, just as ignoring the multimodal nature of a distribution of numbers may lead to mischaracterization of a study sample. In some cases we know when the underlying network changes, such as when measurements are made on different days of the week, in which case the data can be divided up and processed separately using standard techniques. Often, however, the underlying generative processes are unknown, making it hard to split the sample manually. For example, if the brain activity of a patient is measured as they are put under anesthesia, we know that the activity will undergo a change from one state to another but the transition point itself is unknown. It could be that the activity changes well before a visible loss of consciousness, or well after, so one cannot rely on direct observations to determine the change point; one needs to infer the transition from the networks. In this and other similar cases, one must simultaneously determine whether the data is best described by a single underlying network or many, what these networks are, and which measurements should be attributed to which underlying network.

In this paper we describe a framework for modeling heterogeneous populations of networks. We view each network in a population as a noisy realization of one underlying network out of multiple possibilities, which we call modes. We model the data as being generated from a superposition or finite mixture of graph distributions titterington1985statistical ; mclachlan2004finite , which provides sufficient flexibility to accommodate highly heterogeneous populations. We demonstrate how our methodology allows us to simultaneously infer the underlying networks and cluster the observed networks such that each cluster of observations consists of noisy realizations of a single underlying network mode. Our framework also provides a natural means for selecting the number of modes that describe a given set of networks.

Previous work on statistical modeling of network measurements has mostly focused on the unimodal case, where it is assumed that a population of networks is best described by a single underlying network, of which the observations are noisy realizations fienberg1985statistical ; kenny1984social ; banks1994metric ; butts2003network ; newman2018network ; peixoto2018reconstructing ; lunagomez2020modeling ; josephs2021network . Some recent approaches have explicitly incorporated multiple modes using latent space representations durante2017nonparametric ; nielsen2018multiple ; wang2019joint ; arroyo2019inference , exponential random graph models yin2019finite , or parametric network models signorelli2020model . Further afield are studies aiming to cluster the layers of multilayer networks (where layers can be categorical or temporal), for instance when trying to detect change points peel2015detecting ; peixoto2018change —abrupt changes in sequences of network snapshots—or as a side-effect of pooling information across network layers when clustering their nodes peixoto2015inferring ; stanley2016clustering . Among these works the approach most closely related to our own is perhaps that of La Rosa et al. la2016gibbs , who extend unimodal metric models of network populations banks1994metric ; lunagomez2020modeling to the multimodal case with finite mixtures. In the unimodal case all possible networks are assigned a distance to the mode and networks closer to the mode are assumed more likely. The extension described by La Rosa et al. employs multiple modes and can thus more faithfully model diverse populations of networks. While mathematically elegant, however, this approach has some shortcomings. For example, the model is not easily estimated when the distance between networks is difficult to compute bento2019family . The approach of La Rosa et al. also does not differentiate between false-positive and false-negative rates, which may differ substantially and change the composition of the corresponding clusters of networks (see Appendix A).

This paper is organized as follows. First, we briefly review a previously described unimodal model of homogeneous populations of networks, then introduce our model for multimodal heterogeneous populations by building upon the unimodal case. We then discuss statistical estimation of the model using a fast Gibbs sampling procedure and demonstrate using synthetic data that we can recover model parameters when the level of noise in the measurements is sufficiently low. We further demonstrate our methods with an example application to a real-world network population from a longitudinal social network study.

Ii The model

We consider an experiment or observational study in which networks are measured on the same set of nodes. The networks could record, for instance, connectivity patterns in brain scans of a cohort of patients, in which a node represents a region of the brain and edges indicate when two regions are sufficiently connected in a given patient bassett2017network . Or the population of networks could encode a set of relationships among a group of people such as students in a school with nodes representing the students and edges indicating when two students are within a certain physical distance of one another during a specified time interval stehle2011high . We record the networks as a set of adjacency matrices indexed by , where is an matrix with element if there is an edge between nodes and in network , and otherwise. For simplicity of presentation we will assume the networks to be undirected so that , but our methods are easily adapted to directed networks.

ii.1 Homogeneous populations of networks

A variety of approaches have been proposed for analyzing multiple network observations of this kind when the each observation is believed to be a noisy realization of a single underlying “ground truth” network—see for example butts2003network ; banks1994metric ; newman2018network ; le2018estimating ; peixoto2018reconstructing ; lunagomez2020modeling . As a starting point for our discussion we adopt the approach of butts2003network ; newman2018network in which one defines a model to describe how the observed networks are related to the underlying ground truth. The particular model in this case has two parameters: is the a true-positive rate for the edges and is the false-positive rate. In other words, if there is an edge connecting two nodes in the adjacency matrix 

of the ground truth network, then that edge will be observed with independent probability 

in a noisy realization , while an absent edge in will be mistakenly observed with probability . The probability of observing a complete network  under this model is then given by


where the product is over all (unordered) pairs of nodes. When the individual observed networks in are independent of one another the likelihood of the complete population  is


where is the total number of times an edge is observed between node and in the samples.

With Eq. (II.1) in hand, one can simulate observed networks or perform inference about the generative process and, for instance, estimate from the data . Applications of this kind are studied at length in Refs. butts2003network ; newman2018network ; peixoto2018reconstructing ; young2020bayesian among others. Our goal here is to extend the model to accommodate heterogeneous network populations, making it suitable for inference about a broader range of network data.

ii.2 Heterogeneous populations of networks

Consider again our sample  of networks sharing the same set of nodes. We can allow for heterogeneity in these samples by letting the individual networks be noisy realizations of different underlying network modes , rather than just a single mode as before. In the context of network neuroscience, for example, we could repeatedly measure the brain of a single patient undergoing a transition from the conscious to unconscious state, which would imply that . Alternatively, we might have a population of patients, some of whom have a neurological disorder such as Alzheimer’s disease and some of whom do not.

Each sample will be assigned to a network mode  so that is a noisy realization of . We use a variable to encode this information thus:


With this notation we can denote the assignment of samples to modes as a matrix , whose rows correspond to samples and columns to reference networks. Each sample corresponds to precisely one reference network, so the rows of satisfy . Further, the column sums correspond to the number of samples in generated from mode .

To model the now heterogeneous population of networks we use the same generative process as before, Eq. (1), but with a set  of multiple underlying networks instead of just one. Since different modes may display different rates of measurement error, we allow each mode to have its own associated true- and false-positive rates and . For notational simplicity we henceforth denote the sets of parameters and , as well as other model parameters we will introduce shortly, collectively as .

The likelihood of this model is analogous to that of Eq. (II.1) and is given by


The main difference is the product over modes and the inclusion of the variable to encode the modes. Performing the product over network samples , we can also write this as


where is the number of interactions observed between and across all noisy realizations of the underlying network .

In most cases, however, the mode assignments will be unknown—if not then we could just divide up the the networks into their groups and model them separately as disjoint homogeneous populations. We can eliminate by marginalizing over its possible values in Eq. (4) thus:


where the sum is over all possible assignment matrices  and

is a prior probability on the assignments. We choose the convenient categorical prior


with the prior probability that a sample is assigned to mode (so that ), and the convention that now additionally includes the parameters . Substituting Eqs. (4) and (7) into Eq. (II.2), we then find that




which we can think of as the elements of four matrices measuring agreement between samples and modes. (Refer to Table 1 in the Appendix for a summary of notations used in this paper.) For example, counts the number of edges that are simultaneously absent in sample  and mode  and can be thought of as an entry of a matrix  that records such numbers for all modes and samples.

The mixture model appearing in Eq. (8) is sufficiently flexible to account for complicated structure in populations of graphs, analogous to the flexibility seen in mixture models over distributions of numbers mclachlan2004finite . In Fig. 1 we show a population of small networks, generated from Eq. (8) with three modes (). As a visualization of the workings of the model we embed the networks it generates in a two-dimensional space (using multidimensional scaling kruskal1978multidimensional applied to Hamming distance) and then compute the density of network samples in that space, which is shown in the contour plot. From this plot one can see that the networks generated form clear clusters around the three representative modes, with the model introducing some noise.

Figure 1: Contour map of a population of networks generated from the heterogeneous model of Eq. 8, as embedded in a two-dimensional space using multidimensional scaling. Darker colors indicate a higher density of networks and the three peaks correspond to the modes used in the model, which are illustrated at the top. For this simple example, we draw networks from each of the three modes with identical probability and use the same true- and false- positive rates for all modes, namely and for , 2, 3. As illustrated at the bottom of the figure, a network resembling mode 1 sits close to that mode in the low-dimensional space.

Iii Estimation of the model using Gibbs Sampling

In a typical application of the model, we observe a population of networks and we want to determine , , and assuming that was generated from the model as described in Section II.111We may know some of these parameters in which case the problem can be solved more easily. If we can successfully infer the parameters, then we will have not only denoised the network samples  by finding each network’s mode, but we will have also clustered the data into homogeneous classes of networks—two tasks of considerable scientific interest.

Here we adopt a Bayesian approach to the estimation problem, which allows us to draw directly from standard model-based clustering techniques mclachlan2004finite . The starting point for this approach is the posterior distribution for the quantities of interest:


where collectively denotes the rate parameters and  as well as the mixture weights , and is a prior over all the unknown quantities. Note that the number  of reference networks does not appear in the equation; here we use a parametric approach and treat

as a known quantity to be handled separately. There are a number of more complex Bayesian non-parametric model alternatives one could consider that would permit Gibbs sampling and allow

to vary, including Dirichlet process mixture models neal2000markov , but we will not explore these options here. Instead, the procedure used to infer the posterior distribution in Eq. (10) can be run for multiple values of  and the optimal value can be chosen by identifying which value of produced the samples with the highest probability. One can add a prior distribution on to penalize parametrizations with a greater number of clusters, thereby preventing uninformative solutions with from being chosen. However, the prior probability in Eq. (7) already penalizes solutions with , and this seems sufficient for making reasonable inference in practice—see the results in Sec. IV.

We have already discussed the likelihood in Eq. (10), which leaves us with the task of specifying the prior distribution to complete the model. We use a prior that factorizes in the form


The first probability on the right-hand-side, , is specified in Eq. (7). It is the prior over assignments

given the vector of group assignment probabilities 

, which we have included in . For the other two probabilities we select the simplest priors possible. For the networks, we set


where is the total number of edges in all reference networks and , which we also now include in for notational simplicity, is the prior probability of an edge being placed between each pair of nodes . For the parameters —i.e.,  and

—we use uniform priors so that their overall contribution to the posterior distribution is a multiplicative constant. (One can instead opt for conjugate prior distributions for these quantities, which would be the beta and Dirichlet distributions in this case, without much mathematical complication, see Appendix 


With these priors in place, estimation of the heterogeneous network model in Sec. II boils down to either finding point estimates of the parameters in Eq. (10), or to computing averages over this distribution. Here, we propose a fast sampling algorithm that allows us to accomplish both of these tasks. The algorithm returns a series of samples , each giving a possible choice of the network modes, assignment matrices, and parameters from which the networks could have been generated. These samples can in turn be used to approximate averages over the posterior distribution appearing in Eq. (10), to find estimates of the model parameters, or to assess our uncertainty about the exact values of the parameters.

iii.1 Gibbs sampling

We use a Gibbs sampling method to generate our samples geman1984stochastic . The Gibbs sampler operates by cycling through the parameters of the model and generating values for each set of parameters in turn, while conditioning on the values of the remaining parameters. Samples drawn from this random process at sufficient large intervals will be distributed according to the joint posterior distribution of Eq. (10).

The conditional probability distributions needed for Gibbs sampling can be calculated from the model’s likelihood given in Eqs. (

4) and (5) and from the priors of Eqs. (7) and (III). First, we have the probability of a given set of reference networks , conditioned on the data and the values of and :


which we can transform into an explicit expression by substituting in the compact likelihood of Eq. (5) and our prior over . We find




is the probability that nodes and are connected in the th reference network, when we know the cluster assignments and parameter values.

Likewise, the conditional probability of the cluster assignments can be calculated as


We already have an expression for the denominator—it is the mixture likelihood of Eq. (8)—and the numerator can be calculated by combining the prior in Eq. (7) with another form of Eq. (4),


where the four matrices are the ones defined in Eq. (9). Putting everything together we find that



can be interpreted as the probability that sample is a noisy version of mode , conditioned on known values for the modes and parameters.

The update equation for the remaining parameters is given by


The integral appearing in the denominator has a closed-form solution, but we don’t actually need to calculate it since our goal is to sample from the distribution, so we only need to know how Eq. (19) depends on , and the denominator is, by definition, not a function of . Upon substituting the likelihood and priors into Eq. (19) we find that the conditional distribution for factorizes as


where the matrices


quantify the edge agreement between the mode and all the network samples in its corresponding cluster (see Table 1 for a summary).

Equations (14), (18), and (III.1) provide us with all the conditional distributions we need to implement the Gibbs sampler. As mentioned previously, we cycle through the variables , and , generating new samples of each from Eqs. (14), (18), and (III.1) respectively while using the most recent samples of the other parameters as input.

The conditional distributions all allow for straightforward sampling. As Eq. (14) and (18

) show, the edges of the modes and the cluster assignments are determined by independent categorical random variables. Equation (

III.1) further shows that the parameters are independent from one another (when we condition on and 

), and that they are governed by either a beta distribution (

, and ) or a Dirichlet distribution (). Hence, generating updated parameters for the Gibbs sampler is straightforward since all variables can be simulated with standard univariate sampling methods available in most statistical software packages.

iii.2 Implementation

The Gibbs sampling approach allows us to sample from the full posterior distribution of Eq. (10), which is sufficient for estimating any expectation value of interest. In its naive form, however, the Gibbs sampler is quite slow. For example generating new modes takes steps which can become an issue for larger networks. Fortunately there are some computational tricks we can use to speed up the calculation substantially.

The first observation we make is that the variables  are related, such that they need not be all computed every time the modes are updated. Using Eqs. (9) one can show that


where is the number of edges in network mode , and is the number of edges in the sample . By pre-computing from the data once and using the above relations we can recover all the matrices  in terms of the edge agreements  and the counts . Furthermore, these latter quantities can be computed rapidly by traversing edge lists, which can be done in order time for sparse networks with edges, using for example hash tables to store the lists. Thus, updates of only take linear time.

A second set of observations allows us to accelerate the calculation by removing redundancies from the processing of network modes. First, we notice that, conditioned on the modes, the edges of  are independent identically distributed Bernoulli variables as shown in Eq. (14). Second, we observe that the probabilities needed to generate these edges are identical for all node pairs  that occur a given number of times  in networks belonging to cluster , as shown by Eq. (15). Thus, the fundamental quantity needed to track and update the modes is, in effect,  rather than the modes themselves.

To make use of these observations, we denote the set of edges that occur exactly times in cluster as


and the number of unique values of observed in cluster  for the current Gibbs sample as . For each cluster , we can then replace the Bernoulli trials needed to generate a sample by a fast two-step process. First, for each of the unique values of , we generate the number of edges that will appear in the mode out of independent trials with a probability of success of corresponding to the value of for the edge . Then we choose edges uniformly at random from the set and add these edges to the mode adjacency matrix , which we store using an edge list implemented with a hash table. We repeat this sampling procedure for all values in the current cluster sample and repeat for all clusters .

For a given cluster , this two-step process can be carried out in operations on average, which equals only in the worst case when every edge occurs a different number of times  in every cluster. For a more typical population of sparse networks with edges, most pairs of nodes  will never be connected in any of the samples, and thus the sets are typically much larger than the sets  for , which will have about edges. As a result our alternative sampling scheme offers a drastic improvement in computational complexity, since we do not need to sample all potential edge pairs in  and the low value of  ensures that we will rarely sample many edges from this set.

Iv Results

In this section we demonstrate the performance of our method using both synthetic (computer-generated) and real-world data sets.

iv.1 Synthetic data

As a first demonstration of our approach and the proposed model, we test its ability to infer the parameters of a population of networks generated using the model itself. We use a two-mode configuration with the modes 1 and 3 shown in Fig. 1 as the ground-truth, and a symmetric population of networks with , such that about half of the networks in the sample are noisy samples of mode 1 and the other half are samples of mode 3. We control the difficulty of the task by varying the true-positive rate and the false-positive rate , with


identical for all clusters . The parameter is the probability that is a false positive or a false negative—in other words, the probability of flipping an edge/non-edge in . For each experiment we generate a population of , 50, 100, or 200 networks, which we feed into our algorithm as input .

Figure 2: Recovery performance for (a) the partitions , (b) the model parameters , and (c) the modes , for a bimodal population of networks drawn from the modes 1 and 3 shown in Fig. 1 with , across a range of flip probabilities (Eq. 26) and population sizes .

For every synthetic data set  we generate samples using the Gibbs sampling algorithm and from these samples compute point-estimates . For the modes and parameters, we use the posterior averages


where is one sample of the modes and is one sample of the parameters. We summarize the cluster labels as a cluster assignment such that


where is the cluster label of network  that appears most frequently in the posterior marginal distribution.

Once we have these point estimates we quantify the quality of the reconstruction by comparing them against , the parameters used to generate the synthetic networks in the first place. For the modes  we compute the total number of missing and spurious edges across all modes (in other words the sum of the Hamming or distances between the graphs in and ). For the parameters we compute the distance between the parameter vectors and . And for the cluster labels  we compute the variation of information meilua2007comparing between the estimates and the true values.

We vary the sizes of our synthetic populations over , 50, 100, 200 to test how population size affects our reconstruction performance. For each size we generate 50 realizations of the

networks and take the average reconstruction quality over these synthetic populations. The standard errors associated with the average reconstruction performances are shown as error bars in the plots. We also use a beta prior for the density

with , (see Appendix B for details), which allows for more consistent estimates of the mode edge densities in the high noise region where .

Figure 3: Results for the reality mining proximity networks described in the text, in which proximity measurements were binned in day-long intervals. (a) Modal networks  with edge widths proportional to the estimates . (b) The fraction of networks in each cluster that were sampled on a particular day of the week (left) and during a particular month (right). (c) Average posterior log-probability (Eq. 10) across Gibbs samples as a function of the number of clusters .

The results of these experiments are shown in Fig. 2. In panel (a) we show the reconstruction performance for the cluster assignments , which degrades gradually as we increase the flip probability and does not depend strongly on the number of networks in the population. In panel (b) we show the analogous curves for the parameters , which also become gradually more difficult to recover as we increase the noise level, up to a certain point, but then become easier to recover. The reason for this is that in the completely noisy limit (), all the information about the mode networks and clusters is destroyed, and so we end up with completely randomized clusters and modes. But this means we are likely to end up with values of and near the correct value of 0.5, since this is the value for clusters with no structure. We are also likely to infer values of the  close to the correct value of 0.5, as this is the most likely size distribution of the clusters if they are chosen completely at random. As we approach this regime, therefore, we begin to see improvement in the performance for  due to these effects.

Of all the model variables, the modes  are the easiest to recover for low levels of noise, as demonstrated by the relatively long flat portions of the curves in panel (c). In this case, the modes are recovered near-perfectly for flip probabilities less than some transitional value which depends on . Beyond this point the noise introduced into the model through the flip probability  begins to blur the clusters of networks and introduces errors in their recovery. As demonstrated by the ordering of the curves in all panels, the reconstruction becomes easier as we increase , which is expected since each new network sample gives us more data about the latent structure.

The results of Fig. 2 tell us that our Gibbs sampler is capable of correctly inferring all of the model parameters so long as there is well-defined cluster structure in the data, and that inference becomes more reliable as the size of the network population grows. In the following section we demonstrate that our estimation procedure can also successfully cluster populations of real networks and identify multiple distinct modes under real-world conditions.

iv.2 Social network

As our second example we study a network of physical proximity measurements among a group of college students and faculty in the “reality mining” study of Eagle and Pentland eagle2006reality . Over a nine-month period, study participants carried mobile phones equipped with special software that recorded when they were in close proximity with one another using Bluetooth radio technology. From these data we construct networks in which an edge between two participants indicates that they were observed in proximity at least once on a given day. The result is one network for each day of the study, with nodes in each network and edges ranging in number from 1 to 418.

The results of applying our Gibbs sampling algorithm with samples to this population of networks are shown in Fig. 3. In order to select the number of clusters , we run the algorithm at multiple values of 

, identifying the value for which the average posterior probability is maximized as the optimal 

. In panel (a) of the figure we display the three very distinct modes  found for this population, with the edge widths proportional to the inferred values of . In panel (b) we show histograms of the fractions of networks in each cluster that were sampled on a given day of the week (left) and during a given month (right). We can see clear separation in these histograms, indicating that the modes correspond roughly to weekends (yellow), weekdays during the fall semester (orange), and weekdays during the spring semester (blue). We see a clearly distinct structure in the networks during semester weekdays versus weekends, with a denser network observed in the first semester than the second. On the other hand, the weekend network mode contains a wider variety of edges with non-negligible weight. These observations are consistent with a typical university course schedule in which students associate primarily with classmates during classes but with a wider selection of acquaintances outside of work. In panel (c), we show the mean log-probability of the Gibbs samples as a function of the number of modes , which can we see has a peak at (the value chosen for our posterior estimates).

We note that the modes have significantly higher density than the individual observed networks with their corresponding clusters, as demonstrated by the high false negative rates and low false-positive rates . This indicates that the set of likely edges in each mode is much larger than the number of interactions observed in any single time window. This is not surprising—most people don’t see all of their acquaintances every day. Nonetheless, it emphasizes the ability of our method to infer a plausible set of “true” connections from very noisy individual network measurements, a task that would not be easy to do by clustering networks in a naive way.

V Conclusions

In this paper we have described a method for statistical analysis of heterogeneous network populations, as seen in applications involving repeated or longitudinal measurements of a single network, or observation of a fixed set of nodes across multiple study subjects. We have proposed a generative model for heterogeneous networks in which networks divide into a number of clusters or modes and derived a fast, exact Gibbs sampling procedure for sampling the resulting posterior distributions. We have demonstrated our model and estimation methods first on synthetic network data, finding that parameter recovery with the Gibbs sampler is possible as long as variability within the clusters of networks is not too high. We then analyze a real-world network population from a longitudinal proximity network study and find that this population is best described by three underlying mode networks rather than a single ground-truth network as is assumed in most network reconstruction approaches. Our model is capable of encompassing a greater variety of network data than such unimodal approaches and provides a natural framework for simultaneously clustering and denoising a set of networks.

The framework we describe could be extended in a number of ways. For the sake of simplicity we have here assumed that our networks are undirected, unweighted, and that that all edges in a given mode have identical error rates. There are other possibilities, however, including adaptations for directed or weighted edges, more complex noise models such as ones with individual error rates for each node or edge, or the use of a parametric model for the mode networks to allow for simultaneous inference of large-scale structure such as communities. We leave exploration of these avenues for future work.


We thank Guillaume St-Onge for helpful comments. This work was funded in part by the James S. McDonnell Foundation (JGY), the US Department of Defense NDSEG fellowship program (AK), and the US National Science Foundation under grant DMS–2005899 (MEJN).

After this work was completed we learned of very recent related work by Mantziou et al. A preprint of their work can be found at mantziou2021 .


  • (1) N. Eagle, A. S. Pentland, and D. Lazer, Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. U.S.A. 106, 15274–15278 (2009).
  • (2) C. T. Butts, Network inference, error, and informant (in)accuracy: A Bayesian approach. Soc. Netw. 25, 103–140 (2003).
  • (3) O. Sporns, Networks of the Brain. MIT Press, Cambridge, MA (2010).
  • (4) S. E. Fienberg, M. M. Meyer, and S. S. Wasserman, Statistical analysis of multiple sociometric relations. J. Am. Stat. Assoc. 80, 51–67 (1985).
  • (5) C. E. Priebe, D. L. Sussman, M. Tang, and J. T. Vogelstein, Statistical inference on errorfully observed graphs. J. Comput. Graph. Stat. 24, 930–953 (2015).
  • (6) M. E. J. Newman, Network structure from rich but noisy data. Nat. Phys. 14, 542–545 (2018).
  • (7) M. E. J. Newman, Estimating network structure from unreliable measurements. Phys. Rev. E 98, 062321 (2018).
  • (8) T. P. Peixoto, Reconstructing networks with unknown and heterogeneous errors. Phys. Rev. X 8, 041011 (2018).
  • (9) C. M. Le, K. Levin, and E. Levina, Estimating a network from multiple noisy realizations. Electron. J. Stat. 12, 4697–4740 (2018).
  • (10) R. Tang et al., Connectome smoothing via low-rank approximations. IEEE Trans. Med. Imaging. 38, 1446–1456 (2018).
  • (11) L. Wang, Z. Zhang, and D. Dunson, Common and individual structure of brain networks. Ann. Appl. Stat. 13, 85–112 (2019).
  • (12)

    J.-G. Young, G. T. Cantwell, and M. E. J. Newman, Bayesian inference of network structure from unreliable data.

    J. Complex Netw. 8, cnaa046 (2020).
  • (13) J.-G. Young, F. S. Valdovinos, and M. E. J. Newman, Reconstruction of plant–pollinator networks from observational data. Nat. Commun. 12, 3911 (2021).
  • (14) D. S. Bassett, C. H. Xia, and T. D. Satterthwaite, Understanding the emergence of neuropsychiatric disorders with network neuroscience. Biol. Psychiatry: Cogn. Neurosci. Neuroimaging 3, 742–753 (2018).
  • (15) D. M. Titterington, A. Smith, and U. E. Makov, Statistical Analysis of Finite Mixture Distributions. Wiley, New York, NY (1985).
  • (16) G. J. McLachlan and D. Peel, Finite Mixture Models. Wiley, New York, NY (2004).
  • (17) D. A. Kenny and L. La Voie, The social relations model. Adv. Exp. Soc. Psychol. 18, 141–182 (1984).
  • (18) D. Banks and K. Carley, Metric inference for social networks. J. Classif. 11, 121–149 (1994).
  • (19) S. Lunagómez, S. C. Olhede, and P. J. Wolfe, Modeling network populations via graph distances. J. Am. Stat. Assoc. (2020).
  • (20) N. Josephs, W. Li, and E. D. Kolaczyk, Network recovery from unlabeled noisy samples. Preprint arXiv:2104.14952 (2021).
  • (21) D. Durante, D. B. Dunson, and J. T. Vogelstein, Nonparametric bayes modeling of populations of networks. J. Am. Stat. Assoc. 112, 1516–1530 (2017).
  • (22) A. M. Nielsen and D. Witten, The multiple random dot product graph model. Preprint arXiv:1811.12172 (2018).
  • (23) S. Wang, J. Arroyo, J. T. Vogelstein, and C. E. Priebe, Joint embedding of graphs. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1324–1336 (2019).
  • (24) J. Arroyo, A. Athreya, J. Cape, G. Chen, C. E. Priebe, and J. T. Vogelstein, Inference for multiple heterogeneous networks with a common invariant subspace. Preprint arXiv:1906.10026 (2019).
  • (25) F. Yin, W. Shen, and C. T. Butts, Finite mixtures of ERGMs for ensembles of networks. Preprint arXiv:1910.11445 (2019).
  • (26) M. Signorelli and E. C. Wit, Model-based clustering for populations of networks. Stat. Model. 20, 9–29 (2020).
  • (27) L. Peel and A. Clauset, Detecting change points in the large-scale structure of evolving networks. In

    Proceedings of the 29th International Conference on Artificial Intelligence (AAAI)

    , pp. 2914–2920 (2015).
  • (28) T. P. Peixoto and L. Gauvin, Change points, memory and epidemic spreading in temporal networks. Sci. Rep. 8, 15511 (2018).
  • (29) T. P. Peixoto, Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. Phys. Rev. E 92, 042807 (2015).
  • (30) N. Stanley, S. Shai, D. Taylor, and P. J. Mucha, Clustering network layers with the strata multilayer stochastic block model. IEEE Trans. Netw. Sci. Eng. 3, 95–105 (2016).
  • (31) P. S. La Rosa, T. L. Brooks, E. Deych, B. Shands, F. Prior, L. J. Larson-Prior, and W. D. Shannon, Gibbs distribution for statistical analysis of graphical data with a sample application to fcMRI brain images. Stat. Med. 35, 566–580 (2016).
  • (32) J. Bento and S. Ioannidis, A family of tractable graph metrics. Appl. Netw. Sci. 4, 107 (2019).
  • (33) D. S. Bassett and O. Sporns, Network neuroscience. Nat. Neurosci. 20, 353–364 (2017).
  • (34) J. Stehlé, N. Voirin, A. Barrat, C. Cattuto, L. Isella, J.-F. Pinton, M. Quaggiotto, W. Van den Broeck, C. Régis, B. Lina, et al., High-resolution measurements of face-to-face contact patterns in a primary school. PLOS ONE 6, e23176 (2011).
  • (35) J. B. Kruskal and M. Wish, Multidimensional scaling. In Sage University Paper series on Quantitative Applications in Social Sciences, 07-011, Sage Publications, Beverly Hills, CA (1978).
  • (36)

    R. M. Neal, Markov chain sampling methods for Dirichlet process mixture models.

    J. Comput. Graph. Stat. 9, 249–265 (2000).
  • (37) S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984).
  • (38) M. Meilă, Comparing clusterings: An information based distance. J. Multivar. Anal. 98, 873–895 (2007).
  • (39) N. Eagle and A. Pentland, Reality mining: Sensing complex social systems. J. Pers. Ubiquitous Comput. 10, 255–268 (2006).
  • (40) A. Mantziou, S. Lunagómez, and R. Mitra, Bayesian model-based clustering for multiple network data. Preprint arXiv:2107.03431 (2021).
  • (41) A. Gelman, H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian Data Analysis. Chapman and Hall/CRC, New York, NY, 3rd edition (2013).

Appendix A Metric inference

One approach to the problem of network clustering and denoising is to employ methods of metric inference banks1994metric , in which network samples live in a metric space and are clustered based on their distance in that space. It turns out that the simplest metric model of graphs, as used in Refs. banks1994metric ; lunagomez2020modeling ; la2016gibbs , corresponds to a special case of the model analyzed in the present work. Here we demonstrate this correspondence.

Starting with the unimodal case, the likelihood of a single network sample under a generic metric model is given by banks1994metric



is a parameter that controls an analog of the variance,

is a non-increasing function, and is a distance metric on the space of all labeled graphs on nodes. This model is very flexible in principle but less so in practice, because computational issues place constraints on the possible choices of metric and the function . Most authors use an exponential function and the Hamming distance banks1994metric ; lunagomez2020modeling

One can then use the properties of the exponential to derive the simple data likelihood


which is straightforward to evaluate and simulate numerically.

 Quantity  Definition Expression
Number of network samples
Number of network samples in cluster for mode 
Edges in network sample 
Edges in mode 
Edges in all modes
Number of times are connected in network samples from mode 
Edges absent in sample and absent in mode (“true negatives”)
Edges present in sample and absent in mode (“false positives”)
Edges absent in sample and present in mode (“false negatives”)
Edges present in sample and present in mode (“true positives”)
Total true negatives for all samples in cluster for mode
Total false positives for all samples in cluster for mode
Total false negatives for all samples in cluster for mode
Total true positives for all samples in cluster for mode
Table 1: Variables use in the analysis and algorithm, as defined in terms of the basic quantities and .

Comparing with Eq. (II.1) of the main text, we see that this model is equivalent to our model with true- and false-positive rates if


In particular, if we set then we can map the two models directly, with


so long as . In other words, the metric distance approach using the exponential function and Hamming distance banks1994metric ; lunagomez2020modeling is equivalent to assuming a binomial model for edge measurements, but with a particular relationship between false-positive and true-positive rates, , which is generally unrealistic newman2018estimating ; young2020bayesian .

It is straightforward to verify that analogous results hold in the multimodal case. One finds that when , one can define defined analogously to Eq. (33) to obtain a single parameter controlling the probability that existing edges from mode are missed in the samples and that non-existent edges are added. Thus, we conclude that metric models built with the exponential function and Hamming distance are less expressive than the generative process of network measurements considered here, and we recommend using the latter. If one does wish to use a metric model then the algorithms presented in the main text can be used to fit it by setting for all classes .

Appendix B Conjugate priors

For ease of presentation, we have used flat priors for the parameters in the main text, but one can easily replace them with more flexible priors thus:


where is the Euler beta function and is its generalization , and where the parameters of these distributions (such as and ) are new hyper-parameters of the model.

Since the priors above are conjugate gelman2013bayesian , our Gibbs sampling algorithm carries over with little modification. The only change needed is to the sampling distribution for the parameters:


The parameters of the conjugate priors are analogous to the combinatorial quantities of Table 1 and they thus act as pseudo-counts of edges, missing edges, and so on.