## 1 Introduction

Network analysis is by no means a new field and consequently a towering wealth of literature that explores various aspects of network analysis is available (GolderbergEtAl:2010:ASurveyOfStatisticalNetworkModels). However, as opposed to static networks, the study of dynamic or temporally evolving networks is still in a nascent stage. But over the past few years, due to emergence of exciting applications and advancements in computing capabilities, the study of dynamic networks has witnessed a steady progress with a positive acceleration (KimEtAl:2017:AReviewOfDynamicNetworkModelsWithLatentVariables).

A naive way to study dynamic networks is by applying static network analysis techniques to each individual network snapshot. However, this approach implicitly assumes that each snapshot has independent information, and completely ignores the relationship and shared information between snapshots. Those relationships are the essence of the network evolution, and both modeling and understanding them is of paramount importance. The situation is analogous to predicting the position of a car based on observed noisy positions from the past. Methods that ignore the dynamics of a moving vehicle perform rather poorly in comparison to methods that take it into account.

When one considers a dynamic network, both the presence of edges as well as nodes may vary over time. However, in many real-world contexts it is reasonable to assume that the number of nodes is fixed over time (i.e. no new node joins the network and no existing node leaves it). Protein-protein interaction networks are one such type of networks. In this paper, our main focus is on networks for which the presence of nodes is invariant over time while the presence of edges is time dependent. In Appendix D, we also outline an extension of our approach that accommodates the birth and death of nodes as well.

In this paper, we propose a statistical model for dynamic networks, which we call Dynamic Latent Attribute Interaction Model (DLAIM). This model imposes a minimal set of assumptions on the dynamics, in contrast with other existing approaches (XingEtAl:2010:AStateSpaceMixedMembershipBlockmodelForDynamicNetworkTomography; FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks; HeaukulaniEtAl:2013:DynamicProbabilisticModelsForLatentFeaturePropagationInSocialNetworks; KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks; GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks). This increases the flexibility of our model. Furthermore, our model applies to directed as well as undirected networks. Rather than focusing on a specific inference task, our model attempts to capture some of the mechanisms believed to be behind the network evolution. This means that performing inference in this model can give important insights about the network evolution. In particular, our model is able to capture underlying community structures and evolving node attributes and their interactions.

In theory, the parameters of the proposed model could be estimated (from training data) via Bayesian inference. However, the likelihood structure of the model is complex and non-convex, making such methods computationally infeasible. This motivates a neural network based variational inference procedure yielding an end-to-end trainable architecture that can be used for efficient and scalable inference. This is described in detail in Sections

3 and 4.To objectively compare the performance of our model to existing approaches we consider the task of link forecasting - predicting future network links given only the past observations. Perhaps more interestingly (although more subjective) one can examine the learned node attributes and interactions and their evolution. Notably, in Section 5.2 we show that the learned quantities may have a physical significance, providing insights in line with what one expects given the knowledge and context of these networks (collaboration networks in our examples). These might be used as aids for visualization, as well as for interpretation of network dynamics.

Contributions: (i) we have proposed a new statistical model for dynamic networks that encodes temporal edge dependencies and can model both undirected and directed networks; (ii) we have developed a computationally scalable neural network based variational inference procedure for the model; and (iii) we have provided ample empirical evidence that the model is suitable for link forecasting while simultaneously providing important insights into the network evolution mechanics as it can provide interpretable embeddings.

## 2 Related Work

One of the first successful statistical model for dynamic networks was proposed in (XingEtAl:2010:AStateSpaceMixedMembershipBlockmodelForDynamicNetworkTomography). It is an extension of the well known Mixed Membership Stochastic Blockmodel (Airoldi:2008:MixedMembershipStochasticBlockmodels) with the additional assumption that parameters evolve via a Gaussian random walk, i.e. , where is a parameter and and are fixed matrices. Since and are fixed, the model assumes that the process governing the network dynamics itself does not change over time, which might be a limiting assumption. Since then, multiple researchers have proposed extensions of static network models like Stochastic Blockmodel (HollandEtAl:1983:StochasticBlockmodelsFirstSteps) to the case of dynamic networks (YangEtAl:2011:DetectingCommunitiesAndTheirEvolutionsInDynamicSocialNetworksABayesianapproach; Xu:2014:DynamicStochasticBlockmodelsForTimeEvolvingSocialNetworks; Xu:2015:StochasticBlockTransitionModelsForDynamicNetworks).

Another class of models extend the general latent space model for static networks to the dynamic network setting (SarkarEtAl:2005:DynamicSocialNetworkAnalysisUsingLatentSpaceModels; FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks; HeaukulaniEtAl:2013:DynamicProbabilisticModelsForLatentFeaturePropagationInSocialNetworks; KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks; SwellEtAl:2015:LatentSpaceModelsForDynamicNetworks; SwellEtAl:2016:LatentSpaceModelsForDynamicNetworksWithWeightedEdges; GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks)

. Our proposed model also falls under this category. The basic idea behind such models is to represent each node by an embedding (which may change with time) and model the probability of an edge as a function of the embeddings of the two endpoints. All of these approaches (except

(GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks)) use an MCMC based inference procedure that does not directly support neural network based inference.Our model most closely resembles (KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks) in terms of modeling network snapshots and (GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks) in terms of performing inference. However, there are notable differences: (i) approaches like (SarkarEtAl:2005:DynamicSocialNetworkAnalysisUsingLatentSpaceModels; FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks; KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks; SwellEtAl:2015:LatentSpaceModelsForDynamicNetworks) assume that the nature of interactions between nodes is constant over time. In our model the role of each attribute can also change. This is a rather distinctive feature of DLAIM, allowing us to capture both local dynamics (the evolution of node attributes) and global dynamics (the evolving role of attributes); (ii) our model for static network snapshots is fully differentiable which allows us to use a neural network based variational inference procedure as opposed to most existing methods that use MCMC based inference; (iii) approaches like (XingEtAl:2010:AStateSpaceMixedMembershipBlockmodelForDynamicNetworkTomography; HeaukulaniEtAl:2013:DynamicProbabilisticModelsForLatentFeaturePropagationInSocialNetworks; GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks) impose strict restrictions on dynamics, for example, (HeaukulaniEtAl:2013:DynamicProbabilisticModelsForLatentFeaturePropagationInSocialNetworks; GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks) assume that each attribute changes based on a non-negative linear combination of its neighbors. This is not necessarily justified in all settings. In DLAIM we assume only a smoothly changing network; and (iv) since we do not use the adjacency matrix row as input to our neural network, our model is more scalable as compared to (GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks)

## 3 Dynamic Latent Attribute Interaction Model

### 3.1 Modeling Individual Snapshots

In our model, time is discrete. The network evolution is therefore described by the corresponding network snapshots at each timestep specified by binary adjacency matrices where is the number of nodes. We assume that there are no self-loops. Each node is modeled by latent attributes whose values lie in the interval . These attributes can change over time. We use to denote the latent attributes of node at time .

The interaction between the attribute vectors of each pair of nodes directly dictates the probability of observing an edge between them. For simplicity, our interaction model only encodes interactions between attributes of the same type, described by

interaction matrices. Let , be a matrix that encodes the affinity between nodes with respect to attribute at time . For undirected graphs, the matrices are symmetric. At time the node attributes and interaction matrices fully determine the probability of edges being present. Formally, given , and , , edges occur independently and the probability of an edge from node to is modeled as:(1) |

where, is defined as:

(2) |

Here is the sigmoid function,

refers to a Bernoulli distribution with parameter

and is the entry of matrix. Finally and are independent. This formulation allows representation of both homophilic and heterophilic interactions among nodes based on the structure of matrices .The interaction model that we consider is in the same spirit as the Multiplicative Attribute Graph (MAG) model (KimLeskovec:2012:MultiplicativeAttributeGraphModelOfRealWorldNetworks). Some other dynamic network models (KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks) use the MAG model directly to represent each static network snapshot, however, in our case we have a few differences: our node attributes are not restricted to being binary and; we have a differentiable expectation operation as given in (2) instead of the non-differentiable “selection” operation given in (KimLeskovec:2012:MultiplicativeAttributeGraphModelOfRealWorldNetworks). These differences crucially allow us to use a neural network based variational inference procedure.

### 3.2 Modeling Network Dynamics

Having described how each network snapshot is generated, it remains to describe how attributes and their interactions evolve over time. To make an analogy with genetics, each attribute type might be seen as a gene, and the attribute vector corresponds to the gene expression profile of a given node. The level of expression of each attribute might change over time - nodes may start exhibiting new attributes and stop exhibiting old ones thereby leading to a change in . At the same time, the role of each attribute in regulating the presence of edges in the network may also change over time leading to a change in matrices.

One approach to model the dynamics of a network is to use domain expertise to impose a specific set of assumptions on the process governing the dynamics. However, this limits the class of networks that can be faithfully modeled. Instead, we adopt the strategy of imposing a minimal set of assumptions on the dynamics. This is in the same spirit as in the models used in tracking using stochastic filtering (e.g., Kalman filters)

(YilmazJavedShah:2006:ObjectTrackingASurvey), where dynamics are rather simple and primarily capture the insight that the state of the system cannot change too dramatically over time. The use of simple dynamics together with a powerful function approximator (a neural network) during the inference ensures that a simple yet powerful model can be learned from observed network data.Let be a vector consisting of the entries of matrix^{1}^{1}1For directed graphs the matrix can be arbitrary, therefore will have four entries. For undirected graphs the matrices are symmetric, and hence three entries suffice.. We model the evolution of matrices as:

(3) |

where

is a model hyperparameter and

denotes the identity matrix. This model captures the intuition that the interaction matrices will likely not change dramatically over time.

Since the entries of the attribute vector are restricted to lie in a similar dynamics model as above is not possible. A simple workaround is to re-parameterize the problem by introducing the vectors such that

(4) |

As before,

is the sigmoid function. Now we can have an evolution model similar to (

3) on vectors :(5) |

Here is a model hyperparameter. This in turn models the evolution of vectors .

Note that (3) and (5) only imply that the values of variables are unlikely to change very quickly. Other than that, they do not place any strong or network specific restriction on the dynamics. The hyperparameters and control the magnitude of likely change.

This approach for modeling dynamics has advantages and disadvantages. The major advantage is flexibility, since during inference time, a powerful enough function approximator can learn appropriate network dynamics from the observed data. However, this is a generative model and realizations of this model will generally yield globally unrealistic network dynamics. Nevertheless, within small time intervals, the behavior of the networks will be consistent with what is observed in realistic scenarios, and this is enough to ensure good tracking performance. In many real world cases, a suitable amount of observed data is available but clues about the network dynamics are unavailable. Since the task is to gain meaningful insights from the data, we believe the advantages of this approach outweigh the disadvantages.

Note that (3) and (5) are applicable from timestep onward. The initial vectors and are sampled from the following prior distributions:

(6) |

Here, and are hyperparameters. In our experiments, we set these hyperparameters to a high value (). This allows the initial embeddings to become flexible enough to represent the first snapshot faithfully. After that, the assumption that the network changes slowly ((3) and (5

)) is used to sample the value of random variables

and for .We make the following independence assumptions: given the vectors are independent of any quantity indexed by time . An analogous statement applies to the interaction matrices . Finally, given , and , the entries are independent of everything else. The graphical model and generative process for DLAIM are given in Fig. 3 and Algorithm 1 respectively in Appendix A.

## 4 Inference in Dlaim

In practice an observed sequence of network snapshots is available, and the main inference task is to estimate the values of the underlying latent random variables. In DLAIM performing exact inference is intractable because the computation of marginalized log probability of observed data results in integrals that are hard to evaluate. Thus, approximate inference techniques must be adopted.

Our goal is to compute an approximation to the true posterior distribution . Note that, in our current approach , , , and are hyperparameters, that are simply set by the user. We pose the inference problem as an optimization problem by using Variational Inference (BleiEtAl:2017:VariationalInferenceAReviewForStatisticians) and parameterize the approximating distribution by a neural network. There are several benefits like efficiency and scalability (BleiEtAl:2017:VariationalInferenceAReviewForStatisticians) associated with the use of variational inference. Also, coupled with powerful neural networks, the ability of variational inference to model complicated distributions has been demonstrated by several researchers (KingmaEtAl:2013:AutoEncodingVariationalBayes).

The main idea of variational inference is to approximate the posterior distribution by a suitable surrogate. Consider a general latent variable model with the set of all observed random variables and the set of all latent random variables . The (intractable) posterior distribution is approximated by using a parameterized distribution where is the set of all the parameters of . One would like the distribution to be as close to the distribution as possible. In general, Kullback-Leibler (KL) divergence is used as a measure of similarity between the two distributions. The goal of variational inference is to find the parameters for which is minimized. However, this optimization objective is intractable since one cannot efficiently compute . Nevertheless one can show that maximizing the Evidence Lower Bound Objective (ELBO) given by

(7) |

is equivalent to minimizing the KL criterion (BleiEtAl:2017:VariationalInferenceAReviewForStatisticians). For most models, the ELBO can be efficiently computed or approximated by imposing a suitable set of assumptions on as described later. In the context of our model the distribution will be parameterized by a neural network and hence will represent the set of parameters of that neural network.

### 4.1 Approximating ELBO

The latent variables in our model correspond to the elements of and for . The observed variables are . The parameter vector consists of the weights of the neural network. Following (7), we get:

(8) |

Using the independence assumptions stated in Section 3, one can write:

(9) |

The right hand side of (4.1) can be computed using (1), (3), (4), (5) and (6). Following the standard practice (BleiEtAl:2017:VariationalInferenceAReviewForStatisticians), we also assume that belongs to a mean field family of distributions, i.e. all the variables are independent under :

(10) |

We model the distributions and

using a Gaussian distribution as given in (

11) and (12) (notation^{2}

^{2}2Define , where ).

(11) |

(12) |

Here . We wish to learn the mean and covariance parameters of Gaussian distributions in (11) and (12) (these are called variational parameters). There are two possible approaches for doing this: (i) can be directly optimized as a function of variational parameters or (ii) One can model the variational parameters as outputs of some other parametric function (like a neural network) and then optimize the parameters of that parametric function. The second approach can be viewed as a form of regularization where the space in which variational parameters can lie is constrained to the range of the parametric function in use. We adopt the latter approach, and obtain the variational parameters as outputs of neural networks. We use to denote the set of neural network parameters. Thus , but we do not explicitly mention the dependence on in general to avoid notational clutter. can now be computed by using (4.1) and (4.1) in (4.1). Integration of the term involving is hard, so for this term Monte Carlo estimation can be used. In all our experiments we use only one sample to get an approximation to (4.1) as it was done in (KingmaEtAl:2013:AutoEncodingVariationalBayes). Additionally, we empirically observed that for , using and directly as a sample for Monte-Carlo estimation improves the link forecasting performance and hence we do this in our experiments.

### 4.2 Network Architecture

We use a neural network to parameterize the distributions in (11) and (12). Our network consists of four separate GRUs (ChoEtAl:2014:OnThePropertiesOfNeuralMachineTranslationEncoderDecoredApproaches) (also see Appendix C), one each for the mean and covariance parameters (, , and ). Details about network architecture are given in Appendix B.

Once the mean and covariance parameters are available, we use the reparameterisation trick (KingmaEtAl:2013:AutoEncodingVariationalBayes) to sample and using (11) and (12) which are then used to approximate using (4.1) as described in Section 4.1. The training objective is to maximize . The beauty of our model is that is differentiable with respect to and gradients can be easily computed by back-propagation which allows one to capitalize on the powerful optimization methods used for training neural networks. Furthermore, since ELBO uses only pairwise interactions among nodes, we can operate in a batch setting where only a subset of all nodes and the interactions within this subset are considered. This allows us to scale up to rather large networks by training our model on random batches of nodes.

One additional benefit of using a neural network as opposed to learning the variational parameters directly is that the neural network should be able to capture the temporal patterns in the data that cannot be captured by the variational parameters on their own. Since the neural network is being trained to predict the parameters for time given the history up to time , it is being encouraged to look for temporal patterns.

We use the well known Adam optimizer (KingmaBa:2014:AdamAMethodforStochasticOptimization) with a learning rate of to train the inference network. A separate inference network is trained for all time steps (in other words, to make predictions for time we train the inference network with all the observations up to time ). Note that all networks have exactly the same number of parameters. When training, parameters of the neural network that is used to make predictions at time are initialized with the parameters of trained network for time .

A Note on Scalability: For each batch/iteration, one needs edge probabilities and distance between embeddings across successive timesteps. These require and operations, respectively (batch size). Note, however, that many operations can be parallelized to improve runtime. We use and randomly sample batches in our experiments. Moreover, approaches that use MCMC are usually much slower than approaches that use variational inference (BleiEtAl:2017:VariationalInferenceAReviewForStatisticians) and hence we believe that our approach is more scalable as compared to existing approaches.

Dataset | #Nodes | #Snapshots | Directed |
---|---|---|---|

Enron-50 | 50 | 37 | ✗ |

Enron-Full | 149 | 24 | ✓ |

Infocom | 78 | 50 | ✗ |

NIPS-110 | 110 | 17 | ✗ |

EU-U | 986 | 33 | ✗ |

EU-D | 986 | 33 | ✓ |

CollegeMsg | 1899 | 19 | ✓ |

MIT Reality Mining | 94 | 37 | ✗ |

Enron-50 | Infocom | NIPS-110 | EU-U | |
---|---|---|---|---|

BAS | 0.874 | 0.698 | 0.703 | 0.914 |

LFRM (MillerEtAl:2009:NonparametricLatentFeatureModelsForLinkPrediction) | 0.777 | 0.640 | 0.398 | - |

DRIFT (FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks) | 0.910 | 0.782 | 0.672 | - |

DMMG (KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks) | - | 0.804 | 0.732 | - |

iELSM (GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks) | 0.913 | 0.868 | 0.754 | 0.948 |

DLAIM (this paper) | 0.923 0.002 | 0.821 0.007 | 0.810 0.008 | 0.973 0.001 |

Link forecasting - Mean AUC scores with standard deviation across 20 independent executions of the experiment for undirected networks

BAS | DLAIM (This Paper) | |
---|---|---|

Enron-Full | 0.842 | 0.928 0.004 |

EU-D | 0.902 | 0.934 0.003 |

CollegeMsg | 0.686 | 0.857 0.007 |

## 5 Experiments

In this section, we evaluate the performance of our model on several benchmark real world networks. In order to objectively compare the performance of DLAIM with other approaches we focus on the task of link forecasting (formally described in Section 5.1). In these examples our approach outperforms other approaches suitable for this task. We also perform a qualitative case study to demonstrate the utility of learned node attribute vectors and interaction matrices. Table 1 summarizes the datasets (also see Appendix E).

[Best viewed in color] Network between a subset of authors chosen to highlight the community structure. Different colors have been used to differentiate the communities found by running spectral clustering algorithm on a normalized version of adjacency matrix predicted by our method.

### 5.1 Link Forecasting

We consider the setting where we are given a dynamic network up to timestep as a sequence of snapshots, . The task is to use the observed data to predict . Note that this task is different (and inherently more difficult) from missing link prediction where only missing edges at timestep are to be found.

In all our experiments, we fixed the value of as it allowed significant amount of flexibility to the model while maintaining computational tractability. Similarly, based on preliminary experiments with NIPS-110 dataset, we chose and for all experiments. The fact that we were able to reuse the same values of hyperparameters across all our link forecasting experiments indicates that our approach is rather robust and dataset specific tuning is not generally required.

A simple baseline method (denoted by BAS) treats each entry of as an independent Bernoulli random variable with a prior (FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks). We compare our performance against this simple baseline and other existing approaches (MillerEtAl:2009:NonparametricLatentFeatureModelsForLinkPrediction; FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks; KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks; GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks).

LFRM or Latent Feature Infinite Relational Model (MillerEtAl:2009:NonparametricLatentFeatureModelsForLinkPrediction)

represents nodes in a static network using binary feature vectors. It is a non-parametric model. It imposes Indian Buffet Process

(Griffiths:2011:TheIndianBuffetProcessAnIntroductionAndReview) prior on a feature matrix that encodes feature vector of nodes in its rows. (FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks) proposed Dynamic Relational Infinite Feature Model (DRIFT) as an extension of LFRM for dynamic networks by allowing features of nodes to evolve under Markov assumption. While computing predictions for time , LFRM model trained on time was used (FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks). Dynamic Multigroup Membership Model (DMMG) (KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks) uses a model similar to ours but with discrete node attributes and fixed interaction matrices (Section 2). All these methods use MCMC based inferece but since our model is completely differentiable, we are able to use a neural network based variational inference procedure. (GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks) also have a differentiable model for which they use neural network based variational inference, however, their generative model is restrictive as they only focus on assortative and undirected networks (Section 2).We consider both directed and undirected networks. In the case of undirected networks the matrices are symmetric for all and , thus, effectively there are only three random variables in each . A simple change to the output dimension of the relevant GRUs ( and in Appendix B) accommodates this.

We use the well known AUC (Area Under Curve) score for comparison with other approaches. AUC computes the area under the true-positive rate vs false-positive rate curve for various values of threshold used for classification. Values close to indicate good results. The scores reported in Tables 2 and 3 were obtained by first averaging the scores obtained across snapshots and then taking the mean values across 20 independent runs of the inference network.

We were not able to obtain an implementation for DMMG and therefore we present only the values reported by the authors in (KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks). It can be seen that our approach outperforms all the other approaches on all datasets except Infocom. We believe that this is because the Infocom network changes quickly across snapshots as it is a contact network and it has abrupt breaks (at the end of each day when participants leave the premises). This violates our assumption of a slowly changing network.

### 5.2 Qualitative Analysis

In this section, we present some qualitative insights about NIPS-110 and MIT Reality Mining datasets that were revealed by our model. For NIPS-110, author names were obtained by parsing the raw data^{3}^{3}3http://www.cs.huji.ac.il/~papushado/nips_collab_data.html and selecting top 110 authors as before. We use for this analysis. A smaller value of was chosen to aid the manual inspection process.

As ground truth communities are available for MIT Reality Mining dataset, we used it for a sanity check. It is known that two communities that align with ground truth communities can be discovered from the network structure (Xu:2014:DynamicStochasticBlockmodelsForTimeEvolvingSocialNetworks; EagleEtAl:2006:TowardsTimeAwareLinkPredictionInEvolvingSocialNetworks). We followed the same procedure as in (Xu:2014:DynamicStochasticBlockmodelsForTimeEvolvingSocialNetworks) and our model was able to recover both communities. We also observed that both node attributes and interaction matrices evolved with time.

For NIPS-110 dataset, we observed that node attributes for authors did not change noticeably over time, however, the interaction matrices showed time dependent behavior. This aligns with what one might intuitively expect: authors typically do not dramatically change their domain of expertise over time, but they may start collaborating with different people as connections among different fields emerge.

We further conducted two experiments. First, we trained our inference network on all available snapshots and performed community detection on all snapshots using the trained embeddings. Second, we incrementally trained inference networks (starting by observing only two snapshots for the first network and going up to observe snapshots for the last network), and then performed community detection on the first snapshot using embeddings for first snapshot obtained from each of the trained networks.

To perform community detection at a given timestep , we use the learned embeddings to compute the summation term inside in (1) for all pair of nodes to get . We mean normalize entries of , exponentiate them and then perform spectral clustering on this matrix. We chose spectral clustering as it can possibly discover non-convex clusters. Note that this is different from clustering on all snapshots independently since the embeddings capture temporal smoothness.

Through the first experiment, we wish to demonstrate that learned embeddings enforce smoothness over model dynamics (Fig 1). In Fig 1(a)

, nodes have been classified into communities because they will coauthor a paper together in future, despite having no edges between them at

(see Fig 1(b) and 1(c)).It might appear that nodes do not switch communities at all and that same result would have been obtained by running spectral clustering on the sum of all snapshots. However, this is not true. One can see that Vapnik is part of green community at and orange community at since after that time he publishes multiple papers with members from orange community. This demonstrates that our method captures temporal smoothness while being flexible enough to capture temporal changes.

Through the second experiment, we wish to demonstrate how learned embeddings from past are updated as new information arrives. It can be seen in Fig 2 that as new edges are observed in future, embeddings for first time step are updated to reflect this information. As an example, Hinton, Williams, Zemel and Rasmussen belong to different communities when only the first two snapshots have been observed, but over time these authors become part of the same community as they publish papers together. Note that the first row in Fig 2 corresponds to for all columns, hence, new information has to temporally flow backward for Fig 2 to emerge.

## 6 Conclusion

In this paper, we presented a new statistical model for dynamic networks along with an associated neural network based variational inference procedure. The proposed model not only does not impose strict restrictions on the dynamics of networks, but is also applicable to a large class of directed as well as undirected networks. We demonstrated the utility of our approach by using it to perform link forecasting where we achieved state-of-the-art performance. A qualitative study provides further evidence that the learned latent quantities might carry useful information.

We briefly mentioned how our proposed model can accommodate a change in number of nodes in Appendix D. One can also explicitly model a variable number of attributes as done in (KimEtAl:2013:NonparametricMultiGroupMembershipModelForDynamicNetworks).

## References

## Appendix A Algorithm for Generating Networks using DLAIM

## Appendix B Inference Network Architecture

We use a neural network to parameterize the distributions in (11) and (12). Our network consists of four separate GRUs (ChoEtAl:2014:OnThePropertiesOfNeuralMachineTranslationEncoderDecoredApproaches) (see also Appendix C), one each for the mean and covariance parameters (, , and ). We will refer to these GRUs as , , and respectively. These GRUs interact with each other only during the computation of since their outputs are used to compute (4.1). This has been depicted in Fig. 4.

For brevity of exposition, we will only describe the inputs and outputs for . For other GRUs, similar ideas have been employed. For , generates at timestep for all nodes in the current batch as output. In GRUs, the output of current timestep is used as the input hidden state for the next timestep, thus the input hidden state at timestep corresponds to . To be consistent with this, the initial hidden state of should be . This means that the initial hidden state for is a learnable vector.

In all our experiments, we use an all ’s input vector for at each timestep. If observable features of nodes (that may be dynamic themselves) are available, one can instead use these features as input. For and

, instead of computing the variance terms, which are constrained to be positive, we compute log of variance (this is again standard practice

(KingmaEtAl:2013:AutoEncodingVariationalBayes)).## Appendix C Description of GRU

GRU or Gated Recurrent Unit is a type of recurrent neural network introduced by

(ChoEtAl:2014:OnThePropertiesOfNeuralMachineTranslationEncoderDecoredApproaches) in the context of natural language translation problem using neural networks.Recurrent neural networks (RNN), as the name suggests, are designed to operate on inputs that are sequential in nature (for example, sentences, speech etc.). They maintain an internal state as a vector that is updated each time an input is received. This state is also used while processing the input tokens in a sequence. Unrolled along the time dimension, a standard recurrent neural network can be thought of as a very deep neural network. Due to this, standard RNNs suffer from the vanishing gradient problem where the gradients become too small to perform meaningful parameter updates thereby effectively stopping the learning process

(BengioEtAl:1994:LearningLongTermDependenciesWithGradientDescentIsDifficult). To overcome this problem, two popular variants of standard RNNs exist: LSTM (Hochreiter:1997:LongShortTermMemory) and GRU (ChoEtAl:2014:OnThePropertiesOfNeuralMachineTranslationEncoderDecoredApproaches). Since we use GRUs in our experiments we will describe the working of a GRU in this section.Let and be the input vector and output hidden state respectively at time step . The key idea is to be able to copy over information from previous time step if the current input token in not relevant for updating the state. Such as operation will counter the vanishing gradient problem as the derivatives for this operation will be close to the derivatives of an identity map. To achieve this, GRU computes a vector at time to act as an update gate:

(13) |

Here and are learnable parameters and is the sigmoid function.

While retaining past information is useful, equally important is to forget the old information that is no longer needed. GRUs do this via the reset gate:

(14) |

As before, and are learnable parameters.

The output at time (which will also serve as input hidden state for time ) is then computed as follows:

(15) |

We use for the elementwise multiplication operation and with and as learnable parameters:

(16) |

In the context of our inference network, the input is always a zero vector although one may want to use node attributes as input if they are available. The initial hidden state is itself a learnable parameter. At time step , the GRU takes the current value of variational parameter that it is modeling as input hidden state and produces the variational parameter value for next time step as output at .

## Appendix D Extension for Variable Number of Nodes

In this paper, our main focus is on networks where the number and identity of nodes do not change over time. However, our inference procedure is flexible enough to allow the number of nodes to vary over time. In this section, we briefly describe how this can be achieved. Although the number of nodes is allowed to change, we assume that the number of attributes is constant. We also assume that each node is alive only during a continuous time interval, in particular nodes are not allowed to reappear after disappearing.

Since the number of attributes is constant, there is no change in the way is calculated for all and . The task then, is to find the vectors and for all nodes that are alive at timestep . We will mimic the procedure that was used in Section 4.2.

Suppose is the longest interval in which node is alive, then will be drawn from the prior distribution given in (6). Note that does not exist for and . For , follows (5). At timestep , only those node pairs for which both nodes are alive at , contribute to the last term in (4.1). Similarly, appears in the first and third term of (4.1) and first term of (4.1) only for .

Given a batch of nodes, the network architecture proposed in Appendix B can be used even in the case of variable number of nodes by using the appropriate terms to compute (4.1) and by treating appropriate vectors as learnable parameters as described above. We performed preliminary experiments by artificially assigning a birth time and death time to all nodes in the NIPS-110 dataset described in Appendix E. We were able to successfully train the inference network and get good performance on the link forecasting task. However, due to space constraints, we do not present our results here and leave detailed experiments for future work.

## Appendix E Dataset Description

We use the following datasets in our link prediction experiments:

1. Enron email: The full Enron email corpus (KlimtEtAl:2004:TheEnronCorpus) has 149 nodes corresponding to employees in a company. A directed edge from node to node implies that sent an email to . We use an undirected subset of the Enron corpus (Enron-50) consisting of 50 nodes as described in (FouldsEtAl:2011:ADynamicRelationalInfiniteFeatureModelForLongitudinalSocialNetworks). We also perform link prediction on the full, directed Enron corpus with nodes by taking data from years 2000-2001 where each network snapshot corresponds to a month (Enron-Full).

2. Infocom: There are 78 nodes in this network. An undirected edge from node to node at timestep indicates that and were in proximity of each other during that timestep. We obtain a dynamic network with 50 snapshots by using the procedure outlined in (GuptaEtAl:2018:EvolvingLatentSpaceModelForDynamicNetworks).

3. NIPS co-authorship: This dataset consists of 5,722 nodes. An undirected edge from node to node indicates that and co-authored a paper. We consider a subset of the dataset containing 110 nodes as described in (HeaukulaniEtAl:2013:DynamicProbabilisticModelsForLatentFeaturePropagationInSocialNetworks). We refer to this dataset as NIPS-110.

4. EU Email: This dataset (YinEtAl:2017:LocalHigherOrderGraphClustering) contains information about emails that were exchanged between individuals belonging to an European research organization. There are nodes in the network. We consider the first days and create network snapshots by aggregating data over day time windows. We treat this as an undirected (EU-U) as well as a directed (EU-D) network.

5. CollegeMsg: There are 1899 nodes in this dataset (PanzarasaEtAl:2009:PatternsAndDynamicsOfUsersBehaviourAndInteraction). A binary, directed edge corresponds to a message exchanged between the sender and receiver. Temporal data for 193 days is available. We discard the last 3 days and divide the data into 10 days wide buckets which gives us 19 snapshots.

6. MIT Reality Mining: We use this dataset only for performing qualitative analysis in Section 5.2. This network has nodes that correspond to people on MIT campus (EagleEtAl:2006:RealityMiningSensingComplexSocialSystems). Following (Xu:2014:DynamicStochasticBlockmodelsForTimeEvolvingSocialNetworks), we aggregate the Bluetooth proximity data so that each snapshot corresponds to 1 week and we have snapshots from August 2004 to May 2005.

Comments

There are no comments yet.