In recent years, a number of network models have been introduced in the literature to study how binary interactions between entities evolve over time. One common approach relies on the discretization of the time dimension: once an appropriate time grid is specified, the continuous data are essentially transformed into a collection of static network snapshots. This approach has facilitated the extension of many static network models to a dynamic framework. For example, the Stochastic Block Model (SBM) of wang1987stochastic has been recently adapted to the dynamic case by yang2011detecting and matias2017statistical. In the same fashion, extensions of the Latent Position Model (LPM) of hoff2002latent have been proposed by sarkar2005dynamic and sewell2015latent, among others. The model of hanneke2010discrete extends instead the well known Exponential Random Graph Model of holland1981exponential.
However, the approach based on the discretization of the time dimension has been recently criticized, mainly due to the non-negligible effects that the data transformation may have on the results corneli2017multiple,matias2018semiparametric. In fact, the discretization process always involves a certain level of arbitrariness, either due to the data being collected at specific given times, or because of a post-collection transformation. In truth, in the vast majority of data analysis applications, the interactions evolve over time in a continuous fashion.
Dynamic binary interactions could either be instantaneous or protracted over a time interval. An example of the first situation is the well-known email network Enron, and this framework has been studied by several recent works, including corneli2017multiple and matias2018semiparametric. However, in many situations the interactions among a collection of entities may be protracted over time, and the object of analysis may be to model for how long these entities interact (and conversely do not interact) within an observed time period: in this paper we focus on such case. This context generates data which allow a representation of the generic interaction using the format , where denotes the sender node, the receiver, is the instant in which the interaction begins, and the interaction length. This framework is apt to describe a variety of networks, including phone call networks, visual contact networks, speech networks, or proximity networks.
The goal of this paper is to introduce a continuous network framework to directly model the lengths of the observed binary interactions. Our approach relies on a SBM structure, whereby the nodes are characterized by a cluster membership variable (allocation), which determines the distributions over the edge values. In our context, the allocations determine both the lengths of the interactions as well as the lengths of the non-interactions.
The literature on the modeling of interaction lengths is rather limited. One framework that is similar to ours is the Stochastic Block Transition Model of xu2015stochastic and rastelli2018exact. In these papers, the block structure is used to determine the propensity to create and destroy edges across contiguous time frames. This can naturally give a model-based quantification of the persistence of edges and non-edges. Our work also shares similarities with the Stochastic Actor-Oriented Models, discussed by snijders2005models, and extended towards a number of different directions in more recent contributions. We should point out that, while these works have motives and methods similar to ours, both of these approaches rely on the discretization of the time dimension, whereas our proposed model is based on a fully continuous framework.
As regards model inference, we propose a variational expectation-maximization algorithm to estimate model parameters and cluster allocation. Additionally, a model-based clustering criterion is introduced to select the optimal number of latent groups.
In recent years, variational methods have been successfully applied in a variety of mixture models for networks.
For example, they have been employed for the static and dynamic SBM by
As regards model inference, we propose a variational expectation-maximization algorithm to estimate model parameters and cluster allocation. Additionally, a model-based clustering criterion is introduced to select the optimal number of latent groups. In recent years, variational methods have been successfully applied in a variety of mixture models for networks. For example, they have been employed for the static and dynamic SBM bydaudin2008mixture and matias2017statistical, respectively. They have also been used for mixed membership models airoldi2008mixed, networks for instantaneous interactions corneli2017multiple, and textual networks bouveyron2018stochastic. A recent review on variational inference can be found in blei2017variational.
The paper is organized as follows: Section 2 describes in detail the type of data analyzed, introducing a homogeneous model and its new stochastic block model extension. Section 3 presents a variational expectation-maximization algorithm to estimate the model parameters, and a criterion to select the number of clusters. In Section 4 the proposed method is tested on simulated data experiments, whereas in Section 5 it is demonstrated in application to the analysis of face-to-face interactions between high-school students in France. The paper ends with a discussion in 6.
2 Statistical model
2.1 Interaction length data
The observed data describe the binary interactions between units during a certain time interval . These units, or nodes, are labeled with . At every point in time , node may or may not be interacting with node . We illustrate our model focusing on directed interactions, noting that extensions to the undirected case are straightforward to implement.
We represent the data observations with a continuous collection of adjacency matrices
where is equal to whenever is interacting with , and to otherwise.
Since is a step function, we can avoid the continuous notation. We note that, for each pair of nodes, naturally partitions the segment separating the subsets where the function takes the same value. We denote these subsets with , and impose two conditions:
In other words, must be constant on each set, and any two consecutive sets cannot correspond to the same value of . In practice, the data must be represented as an alternating sequence of interactions and non-interactions.
In particular, we are interested in studying the length of these interactions, which we denote , for . Also, we denote with the value of on the corresponding set. Exploiting the fact that can only take values and , we can actually reconstruct all of the observed data just by using the collection of and the initial values .
2.2 The homogeneous model
From now on, all the probabilities considered are conditional on
From now on, all the probabilities considered are conditional onand the initial states .
We assume that each interaction length (resp. non-interaction length) is an independent exponential random variable with rate
We assume that each interaction length (resp. non-interaction length) is an independent exponential random variable with rate(resp. ), hence the value denotes the average time of interaction (resp. the average time of non-interaction). Since these rates do not depend on the nodes at the extremities of each edge, the model is homogeneous.
Consider the edge between two arbitrary nodes and . The interval is partitioned into the sets , where denotes the total number of interactions and non-interactions between and . One important aspect of these data is that the interaction (or non-interaction) lengths and only provide a lower bound for the true non-observed lengths. In other words, these observed values are truncated. A representation of the data structure under consideration is presented in Figure 1.
It follows that the likelihood contribution provided by an edge with more than sub-segments is given by a product of exponential densities, corresponding to the observed embedded intervals, and two cumulated densities, corresponding to the truncated observations at the extremities. If the number of segments is equal to , only the cumulated densities must remain. If the number of segments is equal to , only one cumulated density must remain, which, using the exponential assumption, can be imposed with for all .
These properties translate into the following probability for the pair of nodes :
where and are the pdf and cdf of an exponential variable with rate , respectively.
To summarize the observed data, we can introduce the following statistics:
Now, we replace the exponential distributions with their actual expressions, and, using the independence assumption on the edges, we obtain the following log-likelihood:
Now, we replace the exponential distributions with their actual expressions, and, using the independence assumption on the edges, we obtain the following log-likelihood:
Maximum likelihood estimators are available in closed form for the two model parameters:
2.3 The Stochastic Blockmodel for interaction lengths
A natural extension of the homogeneous model is a latent block structure model, where nodes are allowed to belong to different sub-populations.
An allocation variable is thus assigned to each of the nodes, to indicate their cluster membership. Such categorical variable
is thus assigned to each of the nodes, to indicate their cluster membership. Such categorical variabletakes values in and indicates to which of groups node is allocated to. As in any finite mixture model, these allocation variables are assumed to arise from a Multinomial distribution with probabilities , where the generic corresponds to probability of observing a priori group . Since the number of groups must be chosen and fixed, a convenient alternative representation is given by , where denotes the indicator function for the set . Finally, the blockmodel assumption postulates that the lengths of the interactions and the lengths of non-interactions between a node in group and a node in group are determined by the parameters and , respectively.
The conditional log-likelihood of the SBM reads as follows:
This formulation mimics the finite mixture framework in the network context, as in daudin2008mixture. We can follow the same procedure used for the homogeneous model to obtain the following result:
where the new quantities are defined as:
3.1 Variational Expectation-Maximization algorithm
As is usual in model-based clustering, we are interested in performing inference for this model by maximizing the marginal likelihood (or evidence) with respect to the model parameters. However, integrating out the allocations Z, is not computationally feasible, even for very small datasets. An Expectation-Maximization (EM) algorithm dempster1977maximum can be employed to overcome this issue.
The EM alternates two steps: the E-step, where we calculate the expectation of the conditional likelihood (8) with respect to the posterior , and the M-step, where we maximize such expectation with respect to the likelihood parameters. The combination of the two steps is guaranteed to not decrease the value of the objective function (in this case, the marginal likelihood) and to converge to a local optimum wu:1983. However, differently from other more common finite mixture models, the posterior distribution does not factorize into a simple form for stochastic blockmodels such as ours. As a consequence, the E-step cannot be performed exactly, due to the higher computational costs. This makes the standard EM algorithm not applicable.
A variational approximation can be used to overcome this limitation and perform inference on SBM, as previously proposed by daudin2008mixture and a number of subsequent works. The goal here is to replace the posterior distribution on the allocations by a more tractable one, which would allow an efficient use of the EM algorithm. In practice, we introduce variational parameters and consider the family of all the distributions that factorize into the product of independent multinomial variables:
then, the distribution that is most similar to the true posterior is selected.
This can be formalized as follows. First, we note that the marginal log-likelihood satisfies:
Now, we add and subtract on the right hand side, and take the expectations of both sides, obtaining:
Here, refers to the Kullback-Leibler divergence and is defined as:
refers to the Kullback-Leibler divergence and is defined as:
whereas the entropy of the variational distribution is defined as:
The decomposition of the marginal log-likelihood in (13) can be exploited to define an optimization procedure. We alternate two steps: in the first step, we minimize with respect to the variational parameters ; in the second step, we maximize the right hand side of (13) with respect to the likelihood parameters.
In practice, the two steps correspond to the maximization of the lower bound to the marginal likelihood given by with respect to and , respectively. In our SBM for interaction lengths, the quantities involved in (13) can be written down explicitly, and the update rules are determined by closed form equations, as shown in the next section.
Once the algorithm stops, it returns the optimal parameters , , and .
Regarding the clustering task, the parameters denote a soft partition for the nodes and represents an approximation to the allocation variable .
These values may be interpreted as the posterior probabilities for the nodes to belong to each of the
. These values may be interpreted as the posterior probabilities for the nodes to belong to each of thegroups. Hence, a straightforward estimated hard partition can simply be obtained by considering the maximum a posteriori derived from .
3.2 Update rules
First, we characterize our objective function with the following proposition.
The evidence lower bound for the model proposed is given by:
where the tildes denote the expected values of the corresponding quantities with respect to :
The proof is given in Appendix A.1.
The following propositions focus instead on the optimization of the evidence lower bound with respect to the variational parameters and the model parameters.
The variational parameters that minimize are given by:
The proof is given in Appendix A.2.
The optimal mixing proportions are given by:
The proof is given in Appendix A.3.
The model parameters maximizing the lower bound for the evidence are:
The proof is given in Appendix A.4.
3.3 Algorithm initialization
For a fixed value of , the variational EM algorithm allows one to perform inference on the model parameters , and cluster allocations Z. The procedure needs to be initialized from some starting values as the algorithm is only guaranteed to converge to a local optimum, and often, in the context of mixture models, the final estimates are dependent on such initial values baudry:2015,ohagan:2012,scrucca:2015,biernacki:2003. Several strategies have been proposed in the literature in the context of SBM for networks, see come2015model,matias2017statistical,bouveyron2018stochastic, for example.
A simple strategy is to consider multiple random starting allocations and then retain the model with the highest value of the maximized lower bound of the marginal likelihood as described in Section 3.1. However, a random initialization does not avoid that the algorithm could reach a sub-optimum solution and is often computationally intensive scrucca:2015,come2015model. Hence, we adopt the following initialization procedure based on spectral clustering in order to provide an initial estimate of the allocations
scrucca:2015,come2015model. Hence, we adopt the following initialization procedure based on spectral clustering in order to provide an initial estimate of the allocations:
Compute the total interaction duration time between any pair of nodes and construct the matrix .
Perform spectral -means clustering
von:2007 using the affinity matrix, where the logarithm is taken element-wise and only for non-zero entries.
Initialize the allocations using the classification obtained in the spectral clustering step.
The rationale of this procedure is that nodes interacting more often and for longer time are reasonably expected to belong to the same cluster, and that the affinity matrix constructed using the total interaction time over the observed period would naturally include this information.
3.4 Model selection
The optimal number of latent groups is often not known and needs be estimated from the data: we propose to choose the value of that maximizes the Integrated Completed Likelihood (ICL) criterion. The ICL criterion, first introduced by biernacki2000assessing, aims at maximizing the integrated completed likelihood . This criterion has been widely used to perform model choice for mixture models, especially within the literature on networks daudin2008mixture,come2015model,rastelli2018choosing.
We propose to evaluate this criterion using a BIC-type approximation, as previously proposed by several other works, including biernacki2000assessing, daudin2008mixture and matias2017statistical.
In our proposed model, the ICL value is equal to:
where denotes the Maximum-A-Posteriori estimates of the allocations, as obtained by the variational EM algorithm.
denotes the Maximum-A-Posteriori estimates of the allocations, as obtained by the variational EM algorithm.
The proof of this proposition is straightforward, since the formula is simply an adaptation of a similar result from daudin2008mixture.
We note that, in practice, the ICL criterion requires fitting the model for every plausible value of . As the computational cost of the variational EM grows with , this grid search may slow down the procedure unnecessarily. For this reason, we generally consider values of that are much smaller than , run the variational EM for those, and select the best result using the ICL values.
4 Simulated data experiments
We propose three simulation studies to assess the performance of our method with respect to clustering and model selection.
4.1 Simulation study 1
In the first study, our goal is to assess the performance of our method in clustering the nodes.
We generate random networks of nodes from our likelihood model.
We consider latent groups.
The true mixing proportions are generated independently for each network using a symmetric Dirichlet distribution with parameter .
For each of the latent groups, the rates and are generated independently from a gamma distribution with shape and rate both equal to and variance equal to
are generated independently from a gamma distribution with shape and rate both equal to. This distribution has mean
and variance equal to. In other words, is the precision associated to this gamma distribution and determines whether the groups are well separated or not. A small value will imply an easier inferential task, where clusters are well separated; vice-versa, as increases, separating the cluster will become more challenging. We repeat the experiment for every value of in the set .
Our model assumes that, for each pair of nodes, their first and last interactions are truncated. In order to mimic this behavior, we propose the following mechanism. First, for every pair of nodes, we sample the values uniformly at random from the set . Then, we consider a time horizon , and, for all pairs of nodes, we generate a sequence of i.i.d. exponential random variables, using the appropriate exponential rates, as indicated by the derived values of and by cluster memberships. Finally, we truncate the sequence at the -th value such that and . Crucially, we truncate the last generated value to make sure that the overall length is exactly equal to . In this way, every pair of nodes have cumulated interaction and non-interaction lengths equal to .
We use our inferential procedure once on each dataset, providing in input the correct number of groups to the algorithm. The computing times are reported in Table 1.
We assess the clustering performance by comparing the true and estimated partitions using the adjusted Rand index (ARI) introduced by hubert1985comparing. In particular, we use the estimated maximum a posteriori partition . The left panel of Figure 2 shows summaries for the obtained ARI values for each of the values considered.
The smallest two values of lead to well separated clusters, which are always successfully identified by our method. In these cases there are no differences between the true and estimated partitions. As increases, the algorithm maintains a good performance in most of the generated datasets, however it fails to recognize the groups in some of the networks.
4.2 Simulation study 2
In the second simulation study, our goal is to assess the performance of our method in both clustering and model choice. We consider again artificial networks of nodes. We repeat the experiment five times, i.e. for the true varying in the set . In this study, we do not use the parameter : we consider instead a community structure, where nodes belonging to the same group tend to interact more frequently and for a longer time. For each value of , the matrix has values on the diagonal and on the off-diagonal elements. Viceversa, the matrix has values on the diagonal and on the off-diagonal elements. If then and .
We estimate the parameters using the variational EM for ranging from one to ten groups, and retain the solution maximizing the ICL as the optimal configuration overall. As in the previous study, we compare the estimated maximum a posteriori partition with the true one using the ARI. The right panel of Figure 2 shows the summaries for the ARI values in each of the five cases considered. The index deviates from one as the number of groups increases, since the clustering and model choice tasks become more challenging.
The criterion performs very well when there are few or no groups, compared to the number of nodes. This is reasonable, since, if few clusters are present, more nodes will belong to the same group, thus highlighting the heterogeneity in the data. By contrast, when the nodes are divided in different clusters, the algorithm struggles to recover the true partition exactly and tends towards underestimation of the actual number of groups. However, as shown in the right panel of Figure 2, the ARI scores are rather high in all datasets, signalling that the optimal clustering obtained is fundamentally similar to the data-generating one.
4.3 Simulation study 3
In the third simulation study we consider a more challenging situation where the time interval can be short. This scenario poses a challenge since the available data that is used to estimate the model is rather limited. Similarly to the previous simulation studies, networks are generated at random, using latent groups. In this study, the matrix has values on the diagonal and on the off-diagonal elements, and has values on the diagonal and on the off-diagonal elements. Therefore, for two nodes in the same group, the average interaction length is , and the average non-interaction length is . On the other hand, for nodes in different groups, the average interaction length is , and the average non-interaction length is .
The time interval is denoted by : we consider five different settings where varies in the set , respectively. Figure 3 illustrates the performance in each setting, measured by the ARI index, for the variational EM procedure.
When the time interval is long enough, the algorithm achieves good results and recovers the latent structure without errors. However, if is small, the available data is not sufficient to converge to a correct solution. In particular, when , most edges do not exhibit any change in time, meaning that the only interaction (or non-interaction) that may appear gets truncated on both sides. The procedure manages to extract relevant information from the data as increases, especially for values larger than , since, on average, edges will tend to exhibit at least one value change.
5 High school students interaction data analysis
In this section we show the proposed model in application to a dataset presented by mastrandrea:2015. The data concern face-to-face interactions among high school students in Marseilles, France, and were collected by means of wearable sensors over a period of days in December 2013. Students wore a sensor badge on their chest and the instrument recorded when they were facing each other with a time resolution of 20 seconds. Thus, any pair of students was considered interacting face-to-face when the sensors of the two were exchanging data packets at any given time during the seconds interval.
Additional information on the students is available from the same dataset. Students may have 4 different main specializations: biology (BIO), mathematics and physics (MP), physics and chemistry (PC), and engineering studies (PSI). Figure 4 shows a time-aggregated summary of the data through a binary interaction network between the students, i.e. an edge denotes that a pair of students had at least a single face-to-face contact during the days.
Before using our proposed method, we need to transform the data to ensure that they take the form of a list of exponential lengths. Unfortunately, most data collected with wearable sensors are only available in a discrete representation, in fact, in our application the surroundings of an individual are scanned at seconds intervals.
We aim to transform the available data into a continuous collection of adjacency matrices . In order to do this, we focus our attention only on pairs of nodes that have a proximity contact more than times during the days. For each of these pairs, we analyze the cadence of the contacts. We start exploring their contact history from time , and, whenever we encounter consecutive face-to-face contacts that take place in less than minutes, we start recording an interaction between the two nodes. We continue browsing their history, and, as soon as consecutive face-to-face contacts extend to more than minutes, we interrupt the interaction between the nodes. In this way, for every pair of nodes, we construct interactions of variable length, which appear in those time sections where face-to-face contacts between the two students were more frequent. In fact, one feature of this construction is that the start time and end time of interactions coincide with face-to-face contact times. We note that this transformation shifts the interactions in time, but this has no effect on our analysis, since we only focus on the length of the interactions, and not necessarily on their positioning.
Our goal now is to compare the clustering structure informed by the constructed interaction lengths to the four different specializations that are observed. Hence we estimate the proposed SBM by running the algorithm with fixed to 4 in advance.
The ARI between the estimated partition and the different specialization classes is equal to 0.66, denoting good correspondence. From the table, Cluster 1 only corresponds to students in biology, while Cluster 2 is mainly a mixture of students taking the specialization in mathematics and physics and the specialization in engineering. The class corresponding to students in physics and chemistry is split among Cluster 3 and 4. Interestingly, these students are separated into two blocks, characterized by very dissimilar average interaction lengths. The right panel of Figure 5 reports the average values of the estimated interaction length parameters versus the average values of the estimated non-interaction length parameters (logarithmic scale). Cluster 4 has a smaller average time of interaction compared to the others and particularly to Cluster 3, which is the one with the longest average interaction time. Moreover, clusters are distinguished also in terms of average non-interaction time, with units in Cluster 1 being those not having face-to-face contacts for longer periods on average.
We note that, in figures 4 and 5, the positions of the nodes are deduced from their binary interactions. This means that the lengths of interactions are not used in any way, hence, these graphical representations may not necessarily exhibit the true underlying topology of the observed network. For example, differently from the given clustering configuration, our method allocates several nodes that are positioned in the outskirts of the network (Figure 5, left panel) into cluster . From a static SBM point of view, the behavior of these nodes is not reasonable, since it does not align with the exhibited block and community structure. In other words, in the static SBM context, these nodes add a lot of heterogeneity to cluster . On the other hand, the goal of our method is to cluster together nodes that have similar interaction lengths. This means that, while nodes in cluster have a heterogeneous behavior when choosing their neighbors, they also tend to create connections of similar lengths. Based on the right panel of Figure 5, we can state that the nodes in cluster create both long interactions and long non-interactions.
The model introduced in this paper provides a new approach to analyze networks evolving over time. The main advantage of the proposed model is its fully continuous specification, which allows more flexibility and, possibly, a better fit to the observed data. Thanks to the SBM structure, the estimation of the model parameters can be performed efficiently with an adapted EM algorithm. We have shown this estimation method to be effective in our simulation studies, where the procedure successfully recovered the latent structure in a variety of scenarios. We have also proposed an adaptation of the integrated completed likelihood criterion, showing that it generally allows one to discover the number of latent clusters.
Since most observed networks evolve in a continuous fashion over time, the observed initial and final interaction lengths are generally truncated, due to the arbitrary data collection method that is used. Our approach includes this property directly in the modelling, hence leading to a formulation that is theoretically appropriate also in cases where the time interval is short.
This methodology can be used for both directed and undirected interactions, and it may be extended to a bipartite network context.
The model may also be extended to allow the parameters and to change over time.
This type of idea has been recently explored for other types of blockmodels, see matias2018semiparametric and corneli2017multiple.
Our same model may also be considered in a conjugate Bayesian setting, since conjugate priors may be specified for all of the model parameters.
corneli2017multiple. Our same model may also be considered in a conjugate Bayesian setting, since conjugate priors may be specified for all of the model parameters.
One limitation of the approach proposed ensues from the multimodality of the objective function.
While the variational objective is generally smoother than the actual marginal likelihood of the model, the algorithm adopted remains a heuristic one, in that there are no guarantees that the optimal solution found corresponds to the global maximum.
This fact is a known hindrance in SBMs, and particularly in dynamic SBMs.
The initialization method described in this paper may help prevent the local optimum issue, but it does not solve this problem.
One limitation of the approach proposed ensues from the multimodality of the objective function. While the variational objective is generally smoother than the actual marginal likelihood of the model, the algorithm adopted remains a heuristic one, in that there are no guarantees that the optimal solution found corresponds to the global maximum. This fact is a known hindrance in SBMs, and particularly in dynamic SBMs. The initialization method described in this paper may help prevent the local optimum issue, but it does not solve this problem.
Future work may build upon our proposal both in terms of modelling and on the computational aspects, by defining new continuous dynamic blockmodel frameworks, and by considering more effective estimation procedures, respectively.
The R package expSBM accompanies this paper and it provides an implementation of the variational algorithm described, for both directed and undirected networks. Parts of the code have been written in C++ to reduce the overall computing time. The package is publicly available from CRAN rcoreteam.
Appendix A Appendix
a.1 Proof of Proposition 1
The evidence lower bound is defined as follows:
We study the terms on the right hand side separately.
The three parts combined give (16).
a.2 Proof of Proposition 2
The evidence lower bound can be rewritten as follows:
Now consider the following Lagrangian:
with multipliers . The derivative is equal to the following:
Regarding the constraints:
This yields the following:
This critical point is a maximum. Using this result in (30) finishes the proof.
a.3 Proof of Proposition 3
Consider the following Lagrangian:
and its derivative:
This gives the root and in turn:
which leads to the result of the proposition. This critical point is a maximum.
a.4 Proof of Proposition 4
has root which corresponds to a maximum. The formula for is obtained analogously.