## 1 Introduction

Dynamic networks are a general language for describing time-evolving complex systems, and discrete time network models provide an emerging statistical technique to study biological, business, economic, information, and social systems in the real world. For example, time-evolving networks shed light on understanding critical processes such as the study of biological functions using protein-protein interaction networks (Han et al., 2004; Taylor et al., 2009), and also contribute to assessing infectious disease epidemiology, dynamic brain networks and time-evolving structures of social networks (Morris & Kretzschmar, 1995; Bearman et al., 2004; Kossinets & Watts, 2006; Park & Friston, 2013; Lee & Xue, 2017).

A community can be defined as a set of nodes sharing similar connectivity patterns. In computer science and statistical physics, many node clustering algorithms have been developed. Girvan & Newman (2002) propose an algorithm to identify communities based on edge “betweenness”. They construct communities by progressively removing the edges that connect communities most from the original network. Newman & Girvan (2004) proposed three different measures of “betweenness” and compared the results based on modularity, which measures the quality of a particular division of a network. On the other hand, in statistics, analyzing and clustering networks is often based on statistical mixture models. One idea of model-based clustering in networks comes from Handcock et al. (2007), who propose a latent position cluster model that extends the latent space model of Hoff et al. (2002) to take account of clustering, using the model-based clustering ideas of Fraley & Raftery (2002). In current literature, there are two very popular statistical models. One is the stochastic block model (SBM) and the other is the exponential-family random graph model (ERGM).

Stochastic block models were first introduced by Holland et al. (1983) and they focused on the case of *a*
*priori*

specified blocks, where the memberships are known or assumed and the goal is to estimate a matrix of edge probabilities. A statistical approach to

*a*

*posteriori*block modeling for networks was introduced by Snijders & Nowicki (1997) and Nowicki & Snijders (2001), where the objective is to simultaneously estimate the matrix of edge probabilities and the memberships. Airoldi et al. (2008) relax the assumption of a single latent role for nodes and develop a mixed membership stochastic block model. Karrer & Newman (2011) relax the assumption that a stochastic block model treats all nodes within a community as stochastically equivalent and propose a degree-corrected stochastic block model that can consider node covariates. Moreover, in recent years, asymptotic theory of these models has been advanced by several pioneering papers including Bickel & Chen (2009), Choi et al. (2012), Amini et al. (2013), and Choi & Wolfe (2014). The communities found in stochastic block models are interpreted meaningfully in many research fields. For example, in citation and collaboration networks, such communities can be interpreted as scientific disciplines (Newman, 2004; Ji & Jin., 2016). Communities in food web networks can be interpreted as ecological subsystems (Girvan & Newman, 2002). Unlike the time-evolving networks considered in the current article, cross-sectional networks are the basis of most of the stochastic block model literature cited here.

Exponential-family random graph models allow researchers to incorporate interesting features of the network into statistical models. Moreover, researchers can specify a model capturing those features and cluster nodes based on the specified model. Indeed, the stochastic block model is a special case of a mixture of exponential-family random graph models. Some estimation algorithms for exponential-family random graph models do not scale well computationally to large networks. Vu et al. (2013) propose ERGM-based clustering for large-scale cross-sectional networks that solves the scalability issue by assuming dyadic independence conditional on the cluster memberships of nodes. In recent years, several authors have proposed discrete time network models based on ERGM. Hanneke et al. (2010) propose a temporal ERGM (TERGM) to fit the model to a network series and Krivitsky & Handcock (2014) propose a separable temporal ERGM (STERGM) that gives more flexibility in modeling time-evolving networks.

Our work is primarily motivated by detecting communities in time-evolving networks, and our results advance existing literature by introducing a promising framework that incorporates model-based clustering while remaining computationally scalable to large networks. This framework is based on discrete time exponential-family random graph models and inherits the philosophy of finite mixture models, which simultaneously allows both modeling and detecting communities in time-evolving networks, helping researchers and practitioners understand the complex structure of these networks. Moreover, we propose a conditional likelihood Bayesian information criterion to solve the model selection problem of determining an appropriate number of communities. We also propose an efficient variational expectation-maximization (EM) algorithm that exhibits computational scalability for large-scale time-evolving networks by exploiting variational methods and minorization-maximization (MM) techniques.

In Section 2, we present our model-based clustering method for time-evolving networks based on a finite mixture of discrete time exponential-family random graph models. In Section 3, we use conditional likelihood to construct an effective model selection criterion. Section 4 designs an efficient variational expectation-maximization algorithm to find approximate maximum likelihood estimates of network parameters and mixing proportions. Given these estimates, we can infer membership labels and solve the problem of community detection for time-evolving networks. The power of our method is demonstrated by simulation studies in Section 5 and real-world applications to international trade networks and collaboration networks in Section 6.

## 2 Methodology

### 2.1 Model-based clustering of time-evolving networks through discrete time ERGMs

In this section, we present the model-based clustering method for time-evolving networks based on a finite mixture of discrete time exponential-family random graph models. First, we introduce some necessary notation. We consider nodes that are fixed over time and indexed by integers . Let represent the network at time and denote by the corresponding observed network, where is the set of all possible networks. Let

be a vector of

network parameters of interest. Under the -order Markov assumption, discrete time exponential-family random graph models are of the form(1) |

where is given by

and ensures that sums to . Here, is a -dimensional vector of sufficient statistics on networks , , .

We now focus on the simplest case of discrete time exponential-family random graph models under the first-order Markov assumption and we write the one-step transition probability from to as

(2) |

where and are defined as above.

Remark 1: Given covariates and a vector of covariate coefficients, we can also write the transition probability from to with covariates as

where . Here, is a -dimensional vector of statistics.

In general, for some choices of , the model in (2) is not tractable for modeling large networks, since the computing time to evaluate the likelihood function directly grows as in the case of undirected edges. Here, we restrict our attention to scalable exponential-family models by only choosing statistics that preserve conditional dyadic independence wherein the distribution of given factors over the edge states, i.e.,

(3) |

Before proceeding, we introduce specific examples of statistics that preserve conditional dyadic independence and capture interesting time-evolving network features in both TERGM and STERGM:

(4) |

(5) |

(6) |

(7) |

The subscripted and superscripted mean that summation should be taken over all pairs with ; the same is true for products as in equation (3). Corresponding to the first and second statistics above are TERGM parameters: relates to density, or the number of edges in the network at time , while relates to stability, or the number of edges maintaining their status from time to time . Corresponding to the third and fourth statistics above are STERGM parameters: relates to formation, or the number of edges absent at time but present at time , while relates to persistence, or the number of edges existing at time that survive to time .

Here, as in Vu et al. (2013), we assume that the probability mass function has a -component mixture form as follows:

(8) |

where denotes the membership indicators with distributions

and denotes the support of . In the mixture form (8), the assumption of conditional dyadic independence given strikes a balance between complexity and parsimony. For now, the number of communities is fixed and known. In Section 3, we will discuss how to choose an optimal number of communities .

Now, we consider inference based on observing a series of networks, , given an initial network . The log-likelihood of the observed network series is

(9) |

Our aim is to estimate parameters and via maximizing the log-likelihood , i.e.,

Section 4 designs a novel variational EM algorithm to efficiently find the approximate maximum likelihood estimates. We shall see that the parameter estimates obtained by this algorithm can provide community membership labels.

Before proceeding, we give specific examples of discrete time exponential-family random graph models with stability parameter(s) that control the rate of evolution of a network. Stability parameters are popular in the study of time-evolving networks; in sociology, researchers are interested in whether, say, same-gender friendships are more stable than other friendships or whether there are differences among ethnic categories in forming lasting sexual partnerships over time (Knecht, 2008; Krivitsky et al., 2011).

### 2.2 Parameter Identifiability

The unique identifiability of the parameters in a broad class of random graph mixture models has been shown by Allman et al. (2009) and Allman et al. (2011). Here we prove the generic identifiability for our proposed parameterizations. Theorem 1, whose proof is given in Appendix, extends the identifiability result of the stochastic block model of Allman et al. (2009) and Allman et al. (2011) to discrete time exponential-family random graph mixture models. In this context, “generically identifiable” means uniquely identifiable except possibly on a subset of the parameter space whose Lebesgue measure is zero.

###### Theorem 1.

Let be the number of nodes in a time-evolving network. The parameters , , and the conditional probability of observing an edge , of equation (8) are generically identifiable, up to permutations of the subscripts , if

Moreover, the network parameters , in the model (e.g., Examples 1 and 2) are generically identifiable, up to permutations of the subscripts , if .

## 3 Model selection

In practice, the number of communities is unknown and should be chosen. Handcock et al. (2007)

propose a Bayesian method of determining the number of clusters by using approximate conditional Bayes factors in a latent position cluster model.

Daudin et al. (2008) also derive a Bayesian model selection criterion that is based on the integrated classification likelihood (ICL). In this section, we use the conditional likelihood of the network series, conditioning on an estimate of the membership vector, to construct an effective model selection criterion.We obtain the conditional log-likelihood of the network series , given initial network and estimated membership vector , as

(10) |

which can be written using conditional dyadic independence (3) in the form

We propose the conditional likelihood Bayesian information criterion to choose the number of communities for our method:

(11) |

where is the maximum likelihood estimate assuming communities and is the model complexity, following Varin & Vidoni (2005), Varin et al. (2011) and Xue et al. (2012), based on and . We choose the optimal by minimizing the CL-BIC score.

Remark 2: In cross-sectional networks, a similar criterion, the composite likelihood BIC, is proposed by Saldana et al. (2017)

in the stochastic block model setting where the membership vector is estimated separately using a method such as spectral clustering.

We may derive the explicit conditional likelihood BIC for Examples 1 and 2, i.e., the TERGM and STERGM cases. For TERGM with a stability parameter, we obtain

For any given and the corresponding estimate , we derive the explicit estimate of as

where and

We also derive the explicit estimate of as

where and for . We now obtain the estimate of as . Finally, for clustering time-evolving networks through TERGM with a stability parameter, we determine the optimal number of communities from

(12) |

where and are the estimates of and corresponding to a given . Similar details for STERGM with formation and persistence parameters are presented in Appendix.

Here, we also introduce modified integrated classification likelihood. Again for the TERGM with a stability parameter, the modified ICL can be written as

(13) |

and we choose the optimal number of communities as

(14) |

We present model selection results using both conditional likelihood BIC and modified ICL in the simulation studies of Section 5.

Remark 3: Our conditional likelihood BIC can also be applied to choose the number of communities for finite mixture of ERGMs in cross-sectional networks. Under the assumption of conditional dyadic independence given the conditional log-likelihood in this case can be written as

and we choose the optimal by minimizing as in equation (12) with .

## 4 Computation

Here we present a novel variational EM algorithm to solve model-based clustering for large scale time-evolving networks. Our algorithm is modeled on the algorithm presented by Vu et al. (2013). The algorithm combines the power of variational methods (Wainwright & Jordan, 2008) and minorization-maximization techniques (Hunter & Lange, 2004) to effectively handle both the computationally intractable log-likelihood function and the non-convex optimization problem of the lower bound of the log-likelihood. We introduce an auxiliary distribution to derive a tractable lower bound on the intractable log-likelihood function. Using Jensen’s inequality, the log-likelihood function may be shown to be bounded from the below as follows:

(15) |

If were unconstrained in the sense that we could choose from the set of all distributions with support , we would obtain the best lower bound when , where the inequality becomes equality. However, this unconstrained form of is intractable. We therefore constrain to a subset of tractable choices and maximize tractable lower bound to find approximate maximum likelihood estimates.

Here, we constrain to the mean-field variational family where the are mutually independent,

We further specify to be for , where is the variational parameter. In the estimation phase, whenever it is necessary to assign each node to a particular community, the th node is assigned to the community with the highest value among .

If we now denote the right side of Inequality (15) by , we may write

(16) |

If , , and
denote the parameter estimates at the
th iteration of our variational EM algorithm, then in principle that algorithm
consists of alternating between two steps:

Idealized Variational E-step: Let . Idealized Variational M-step: Let .

Remark 4: If the distribution were totally unconstrained, then the E-step above would simply consist of determining the conditional distribution , and the variational E-step and M-step would coincide with the E-step and M-step of the traditional EM algorithm for this situation.

In our idealized variational EM algorithm, it is difficult to directly maximize the nonconcave function with respect to . To address this challenge, we use a minorization-maximization technique to construct a tractable minorizing function of , then maximize this minorizer. We define

(17) |

which satisfies the defining characteristics of a minorizing function, namely

(18) |

and

(19) |

Additional details on constructing this minorizing function are presented in Appendix. Since is concave in and separates into functions of the individual parameters, maximizing is equivalent to solving a sequence of constrained quadratic programming subproblems with respect to respectively, under constraints and for .

To maximize in the M-step, maximization with respect to and may be accomplished separately. First, to derive the closed-form updates for , we introduce a Lagrange multiplier with the constraint . The closed-form update for is

(20) |

We could obtain using the Newton-Raphson method, though naive application of Newton-Raphson will not guarantee an increase in (16) which is necessary for the ascent property of the lower bound of the log-likelihood. Since the Hessian matrix at the th iteration is positive definite, it can be easily shown that if we go in the direction of the Newton-Raphson method, we are guaranteed to go uphill initially. In our modified Newton-Raphson method we do not find the successor point as in the standard Newton-Raphson method. We instead take as a search direction and perform a line search (Bertsimas, 2009) to find

then we find the successor point by

(21) |

Now, we summarize the details of our proposed variational EM algorithm as Algorithm 1.

Remark 5: The initial are chosen independently uniformly randomly on (0, 1), then each is multiplied by a normalizing constant chosen so that for every . We then start with an M-step to obtain initial and .

Remark 6: Using standard arguments that apply to minorization-maximization, or MM algorithms (Hunter & Lange, 2004), we can show that our variational EM algorithm preserves the ascent property of the lower bound of the log-likelihood, namely,

## 5 Simulation Studies

We first conduct simulation studies for a mixture of TERGMs and STERGMs. To simulate time-evolving networks from the -component mixture of TERGM with stability parameters, i.e., Example 1, we first specify network structure by choosing randomly the categories of the nodes according to the fixed mixing proportions and by defining initial densities for each category. Now we obtain initial network by simulating all the edges between two nodes based on the probabilities with specified density parameters and categories of the nodes. Next, we set different stability parameters for each category and simulate all the edges in series of networks sequentially, based on the probabilities with specified stability parameters and given previous time point network and categories of the nodes. Similarly, we simulate time-evolving networks from the -component mixture of STERGM with formation and persistence parameters, i.e., Example 2. For each of the four model settings listed in Table 1, we use 100 nodes and 10 discrete time points.

Model 1 | Model 2 | ||||

Mixing proportion | 0.5 | 0.5 | 0.33 | 0.33 | 0.33 |

Stability parameter | -0.5 | 0.5 | -1 | 0 | 1 |

Initial network density parameter | -0.5 | 0.5 | -1 | 0 | 1 |

Model 3 | Model 4 | ||||

Mixing proportion | 0.5 | 0.5 | 0.33 | 0.33 | 0.33 |

Formation parameter | -1.5 | 1.5 | -1.5 | 0 | 1.5 |

Persistence parameter | -1 | 1 | -1 | 0 | 1 |

Initial network density parameter | -0.5 | 0.5 | -1 | 0 | 1 |

To check the performance of the algorithm at identifying the correct number of communities, we count the frequencies of and over 100 repetitions. To assess the clustering performance, we calculate the average value of the Rand Index (RI) over the 100 repetitions (Rand, 1971). The measure calculates the proportion of pairs whose estimated labels correspond to the true labels in terms of being assigned to the same or different groups:

To assess the estimation performance of the algorithm, we calculate the average norm loss for estimated mixing proportions and network parameters over the 100 repetitions:

First of all, we check the performance of our criterion functions in choosing the correct number of communities. As shown in Table 2, both CL-BIC and modified ICL perform well. The average Rand Index results are reported in Table 3.

Model 1 | Model 2 | |||||||

0 | 99 | 0 | 1 | 0 | 0 | 97 | 3 | |

0 | 100 | 0 | 0 | 0 | 0 | 96 | 4 | |

Model 3 | Model 4 | |||||||

0 | 99 | 1 | 0 | 0 | 3 | 93 | 4 | |

0 | 100 | 0 | 0 | 0 | 0 | 99 | 1 |

Model 1 | |||

TERGM | 1.000 (0.000) | 0.874 (0.033) | 0.773 (0.030) |

Model 2 | |||

TERGM | 0.751 (0.044) | 0.996 (0.031) | 0.943 (0.025) |

Model 3 | |||

STERGM | 1.000 (0.000) | 0.874 (0.033) | 0.774 (0.031) |

Model 4 | |||

STERGM | 0.761 (0.045) | 0.998 (0.021) | 0.945 (0.018) |

, with sample standard deviations in parentheses, where

is the true number of communities.Finally, Table 4 summarizes estimation performance of our algorithm using , , , and . The results of Tables 2 through 4 together tell us that our algorithm performs convincingly on this set of test datasets.

Model 1 | Model 2 | ||||

0.056 (0.043) | 0.013 (0.008) | 0.073 (0.043) | 0.038 (0.098) | ||

Model 3 | Model 4 | ||||

0.057 (0.043) | 0.030 (0.020) | 0.023 (0.015) | 0.072 (0.040) | 0.045 (0.094) | 0.039 (0.071) |

To check which model selection criterion is more robust in choosing correct number of communities when the time-evolving networks are not simulated from the true model, we conduct another simulation study. This time we use the ‘simulate.stergm’ function in the ‘tergm’ package (Krivitsky & Handcock, 2016) in R (R Core Team, 2016) to simulate time-evolving networks (Krivitsky & Handcock, 2014) and combine the time-evolving networks into single time-evolving networks where each simulated time-evolving networks representing different communities. First, we specify each network structure by choosing randomly the categories of the nodes according to the fixed mixing proportions and by defining the network densities. Next, we set different mean relational durations, which represent different degrees of stability, and simulate each time-evolving network to have the average network density we defined over the time points. Finally, we combine the time-evolving networks into single time-evolving networks by adding a fixed number of edges between randomly chosen pairs of individuals in different communities. For each of the two model settings listed in Table 5, we use 100 nodes, 10 discrete time points, and 10 edges added between randomly chosen pairs of nodes in different communities.

Model 5 | Model 6 | ||||
---|---|---|---|---|---|

Mixing proportion | 0.4 | 0.6 | 0.3 | 0.4 | 0.3 |

Mean relational duration | 5 | 2.5 | 7.5 | 5 | 2.5 |

Average network density | 0.15 | 0.1 | 0.1 | 0.25 | 0.3 |

As shown in Table 6, in the new simulation setting where the time-evolving networks are not simulated from the true model, CL-BIC still performs well in choosing the correct number of communities through both TERGM with a stability parameter and STERGM with formation and persistence parameters. However, as shown in Table 7, modified ICL fails to choose the correct number of communities. The results of Tables 6 and 7 together tell us that the performance of our proposed CL-BIC in choosing the correct number of communities is more robust than modified ICL when the model assumptions are violated, at least in the particular testing regime we implemented.

Model 5 | Model 6 | |||||||

TERGM | 0 | 87 | 12 | 1 | 0 | 1 | 96 | 3 |
---|---|---|---|---|---|---|---|---|

STERGM | 0 | 93 | 2 | 5 | 0 | 8 | 91 | 1 |

Model 5 | Model 6 | |||||||

TERGM | 0 | 49 | 23 | 28 | 0 | 0 | 65 | 35 |
---|---|---|---|---|---|---|---|---|

STERGM | 0 | 53 | 32 | 15 | 0 | 0 | 59 | 41 |

The average Rand Index results are also reported in Table 8. In all models, both TERGM with a stability parameter and STERGM with formation and persistence parameters achieve a high average Rand Index for the correct number of mixtures. Moreover, we see a fairly high average Rand Index with the selected (via minimum CL-BIC) number of communities . The results of Table 6 and 8 together tell us that our algorithm based on CL-BIC performs convincingly in choosing the correct number of communities and assigning nodes to communities even when the time-evolving networks are not generated from the true model.

Model 5 | ||||
---|---|---|---|---|

TERGM | 0.976 (0.032) | 0.797 (0.045) | 0.716 (0.036) | 0.948 (0.080) |

Comments

There are no comments yet.