GRADE: Graph Dynamic Embedding

by   Simeon Spasov, et al.

Representation learning of static and more recently dynamically evolving graphs has gained noticeable attention. Existing approaches for modelling graph dynamics focus extensively on the evolution of individual nodes independently of the evolution of mesoscale community structures. As a result, current methods do not provide useful tools to study and cannot explicitly capture temporal community dynamics. To address this challenge, we propose GRADE - a probabilistic model that learns to generate evolving node and community representations by imposing a random walk prior over their trajectories. Our model also learns node community membership which is updated between time steps via a transition matrix. At each time step link generation is performed by first assigning node membership from a distribution over the communities, and then sampling a neighbor from a distribution over the nodes for the assigned community. We parametrize the node and community distributions with neural networks and learn their parameters via variational inference. Experiments demonstrate GRADE meets or outperforms baselines in dynamic link prediction, shows favourable performance on dynamic community detection, and identifies coherent and interpretable evolving communities.


page 1

page 2

page 3

page 4


vGraph: A Generative Model for Joint Community Detection and Node Representation Learning

This paper focuses on two fundamental tasks of graph analysis: community...

Group-Node Attention for Community Evolution Prediction

Communities in social networks evolve over time as people enter and leav...

Signed Link Representation in Continuous-Time Dynamic Signed Networks

Signed networks allow us to model bi-faceted relationships and interacti...

A Bayesian Framework for Community Detection Integrating Content and Link

This paper addresses the problem of community detection in networked dat...

A Statistical Model for Dynamic Networks with Neural Variational Inference

In this paper we propose a statistical model for dynamically evolving ne...

STWalk: Learning Trajectory Representations in Temporal Graphs

Analyzing the temporal behavior of nodes in time-varying graphs is usefu...

Dynamic Behavioral Mixed-Membership Model for Large Evolving Networks

The majority of real-world networks are dynamic and extremely large (e.g...

1 Introduction

Representation learning over graph-structured data has generated significant interest in the machine learning community owing to widespread application in a variety of interaction-based networks, such as social and communication networks, bio-informatics and relational knowledge graphs. Developing methods for unsupervised graph representation learning is challenging as it requires summarizing the graph structural information in low-dimensional embeddings. These representations can then be used for downstream tasks, such as node classification, link prediction and community detection.

The majority of unsupervised graph representation learning methods have focused solely on static non-evolving graphs, while many real-world networks exhibit complex temporal behaviour. To address the challenge of encoding temporal patterns of relational data, existing methods for dynamic graph embedding focus extensively on capturing node evolution (Goyal et al., 2018, 2020; Sankar et al., 2020; Zhou et al., 2018). Although these methods achieve compelling results against static baselines on dynamic tasks, they do not lend themselves to capturing the evolution of graph-level structures, such as clusters of nodes, or communities. On the other hand, the patterns of evolving node clusters are of great interest in social networks (Kossinets and Watts, 2006; Greene et al., 2010; Yang et al., 2011), as well as encountered in the temporal organization of large-scale brain networks (Vidaurre et al., 2017), among others.

To address this challenge, we propose GRADE (GRAph Dynamic Embedding) - a probabilistic generative model for jointly learning evolving node and community representations. The benefit of modelling the interaction between nodes and communities for graph representation learning in the static setting was studied by vGraph (Sun et al., 2019). Further, Battiston et al. (2020) produces evidence that taking into consideration higher-order graph structures, such as communities, enhances our capability to model emergent dynamical behaviour. Consequently, in this work, we extend the idea of modelling node-community interactions, proposed in vGraph, to the dynamic case. We represent a dynamic network as a sequence of graph snapshots over a series of discrete and equally-spaced time intervals. At each time step, we model the edge generation process between node neighbours via multinomial community and node distributions. First, we sample a community assignment for each node from a distribution over the communities, i.e. . Then, we sample a neighbour from the distribution over the nodes of the assigned community, that is . Both the community and node distributions are parametrized by neural network transformations of the node and community embeddings. In our work, we assume that the semantic meaning of communities and the proportions over the communities for each node evolve simultaneously over time. Following an approach introduced in dynamic topic modelling (Dieng et al., 2019)

, we encode temporal evolution in our method by assuming a random walk prior over the representations between time steps. Furthermore, we draw inspiration from social networks where a user’s preferences can shift from one community to another. We explicitly model the dynamism in community membership by introducing a node-specific and time-varying transition matrix to update the community mixture coefficients over time. We design an effective algorithm for inference via backpropagation. We learn the parameters of our model by means of variational inference to maximize the lower bound of the observed data. More specifically, we resort to amortized inference

(Gershman and Goodman, 2014) to learn neural network mappings from node and community representations to the respective conditional distributions, as well as structured variational inference (Hoffman and Blei, 2015; Saul and Jordan, 1996) to retain the dependence of the embeddings on their historical states. Our proposed method is aimed at non-attributed dynamic graphs. It is worth noting that although GRADE is a transductive approach, changes of vertex sets between snapshots at different time steps do not pose a problem, if the complete vertex set is known a priori.

In the experimental section we evaluate our model on the tasks of dynamic link prediction and dynamic non-overlapping community detection on real-world dynamic graphs. Our results show GRADE is competitive with or outperforms other state-of-the-art static or dynamic transductive approaches for unsupervised graph representation learning. Furthermore, we provide visualizations of dynamic community evolution.

2 Related Work

Methods for unsupervised learning on evolving graphs are often dynamic extensions of ideas applied in the static case. (1)

Graph factorizationwhiteGraph factorization approaches such as DANE (Li et al., 2017) rely on spectral embedding, similarly to static methods like Ahmed et al. (2013); Cao et al. (2015); Belkin and Niyogi (2003). DANE assumes smooth temporal evolution and models it using matrix perturbation theory. (2) In skip-gram modelswhiteskip-gram models node representations are learnt by random walk objectives (Grover and Leskovec, 2016; Perozzi et al., 2014; Tang et al., 2015). In the dynamic case, CTDNE (Nguyen et al., 2018) and NetWalk (Yu et al., 2018) augment the random walk with temporal constraints based on time-stamped edges. Further, (3) temporal point processeswhitetemporal point processes have also been used in combination with neural network parametrization by KnowEvolve (Trivedi et al., 2017) and DyREP (Trivedi et al., 2019) to model continuous-time node interactions in multi-relational and simple dynamic graphs, respectively. (4) More recently,

graph convolutional neural networks

whitegraph convolutional neural networks (GNNs) have become a widely used tool for graph representation learning (Kipf and Welling, 2016; Veličković et al., 2018; Bruna et al., 2014). A popular approach to include temporal dependency in GNNs is to introduce a recurrent mechanism. For example, (Seo et al., 2018)

proposes two ways to achieve this goal. One way is to obtain node embeddings via a GNN, which are then fed to an LSTM to learn dynamism. The second approach modifies the LSTM layers to incorporate graph-structured data. A different approach altogether is to evolve the graph convolutional parameters with a recurrent neural network (RNN), such as in EvolveGCN

(Pareja et al., 2019), as opposed to the node embeddings, hence addressing issues stemming from rapidly changing node sets between time steps. Alternatively, STGCN (Yu et al., 2017) avoids using RNNs completely by introducing an efficient ST-Conv layer for faster training with few model parameters.

The most related body of work to this paper is a category of transductive unsupervised methods applied on temporally discrete and non-attributed dynamic graphs. One such approach is DynGEM (Goyal et al., 2018)

which employs deep autoencoders to learn node embeddings but resorts to no recurrent structures for temporal dependency. Instead, time dynamics are injected by re-initializing the parameters of the autoencoder at each time step

with the parameters learnt at . Unlike our proposed method GRADE, DynGEM can only handle growing dynamic networks. Another method is dyngraph2vec (Goyal et al., 2020) which is trained to predict future links based on current node embeddings using an LSTM mechanism. DynamicTriad (Zhou et al., 2018) models the process of temporal link formation between vertices with common neighbours, that is triadic closure, and enforces latent space similarity between future connected nodes. DySAT (Sankar et al., 2020) draws inspiration from the success of attention mechanisms and applies them structurally over local neighbourhoods and temporally over historical representations. The advantages of GRADE over these competitive methods is that firstly, we learn both node and community-level dynamic embeddings and secondly, our approach can be used to infer the embeddings for future time steps. In comparison, these dynamic methods use representations learnt at the last training step for dynamic prediction.

Finally, GRADE is also related to dynamic topic modelling (Dieng et al., 2019; Blei and Lafferty, 2006) as both can also be viewed as state-space models. The difference is that in GRADE we are dealing with multinomial distributions over the nodes and communities instead of topics and words. Moreover, some works like (Bamler and Mandt, 2017; Rudolph and Blei, 2018) have focused on the shift of word meaning over time, and others such as (Dieng et al., 2019) model the evolution of documents. In contrast, GRADE assumes both nodes and communities undergo temporal semantic shift.

3 Problem Definition and Preliminaries

We consider a dataset comprising a sequence of non-attributed (i.e. without node features) graph snapshots over a series of discrete and equally-spaced time intervals , such that is an integer time index. We assume all the edges in snapshot occur at time and the complete set of vertices in the dynamic graph is known a priori. Our method supports the addition or removal of nodes as well as edges between time steps. We also assume there are

communities (clusters of nodes) in the dynamic network. Our method aims to learn time-evolving vector representations

for all nodes and communities , } , for each time step . Further, a useful model for dynamic graph embedding should not only capture the patterns of temporal evolution in the node and community representations but also be able to predict their future trajectories.

4 Methodology

4.1 GRADE: Generative Model Description

Figure 1: Plate notation for GRADE. The node and community representations, and consequently the parametrization of the node and community distributions, evolve over time. The parameters of the community distribution, , are explicitly updated by a deterministic transformation of the node embeddings (denoted by rectangules). The observed data is the edges (, ) in the dynamic graph.

width= Symbol Definition sequence of dynamic graph snapshots complete vertex set of community indices community and node embeddings

temporal smoothness hyperparameters

community assignment latent variable transition matrix for node at time edge (source node, target node) vertex set indices of neighbour index of parameters of community and node distributions

Table 1: Notation used in paper.

GRADE is a probabilistic method for modelling the edge generation process in dynamic graphs. We adopt the approach by vGraph (Sun et al., 2019) to represent each node in the active vertex set of as a mixture of communities, and each community as a multinomial distribution over the nodes. The linked neighbour generation at each time step is as follows: first, we sample a community assignment from a conditional prior distribution over the communities . Then, a neighbour is drawn from the node generative distribution based on the social context defined by the assigned community. The generative process for graph snapshot can be formulated as:


where and parametrize the multinomial generative and prior distributions at time step respectively. In our dynamic graph model, we suppose that the semantic meaning of communities as well as community proportions for nodes change over time. This necessitates capturing the temporal evolution of the underlying node and community distributions by an evolving set of parameters. GRADE achieves this by making these parameters implicitly dependent on the evolving node and community embeddings, and

respectively. More specifically, we treat the community and node representations as random variables and impose a simple state-space model that evolves smoothly with Gaussian noise between time steps as follows:


Note that we evolve the embeddings of the complete vertex set at each time step, although our model allows for a subset of the nodes to be present at each time step. The temporal smoothness hyperparameters and

control the rate of temporal dynamics. The parametrization of the generative distribution is achieved by first transforming the community representations through a neural network, and then mapping the output through a softmax layer:


To evolve the community mixture weights for nodes, we observe that users’ interests in a social network change over time. As a result, users may shift from one community to another. This is characterized by user-specific behaviour within the broader context of community evolution. For these reasons, we explicitly model community transition with a transition matrix. More specifically, for each node , we update the community mixture weights by means of a node-specific and time-varying transition matrix , produced as a function, , of the node embeddings:


In summary, GRADE’s edge generative process for each graph snapshots in is as follows:

  1. Draw community embeddings for :   

  2. Draw node embeddings for all nodes :   

  3. Transition matrix

    is a non-linear transformation,

    , of the node embeddings:

  4. Update community mixture coefficients for node :  

  5. For each edge (, ) in :

    1. Draw community assignment from multinomial prior over the communities:

    2. Parameters of distribution over the nodes is a function of :   softmax

    3. Draw linked neighbour from node generative distribution for sampled community :

The graphical model of the proposed generative process is shown in Figure 1 and common notation used in the paper in Table 1.

4.2 Inference Algorithm

Consider we are given a dataset comprising a set of node links (, ) for a sequence of graph snapshots . In our dynamic graph model, the latent variables are the hidden community assignments , and the evolving node and community representations and

. The logarithm of the marginal probability of the observations is given by the sum of the log probability of each observed temporal edge

of all nodes in :


Exact inference of the posterior is intractable. Instead, we resort to variational methods as a means of approximation. Variational inference provides flexibility when choosing an appropriate family of distributions , indexed by the parameters , as an approximation to the true posterior. The idea is to optimize the parameters by minimizing the Kullback-Leibler (KL) divergence between the true posterior and its approximation. This procedure is known as minimizing the evidence lower bound (ELBO) (Kingma and Welling, 2013):


The variational approximation we choose takes the form:


The variational distributions over the node and community representations depend on all of their respective historical states. We capture this temporal dependency with Gated Recurrent Units (GRU). We model the outputs of both


as Gaussian distributions, whose means and diagonal covariance vectors are given by the outputs of their respective GRU units. The advantage of this structured approach

(Hoffman and Blei, 2015; Saul and Jordan, 1996), where we retain dependency only on previous states, is that it allows us to easily infer the posterior distributions of the node and community representations at future time steps. The difference between the approximated multinomial conditional prior over the community assignments and the approximated multinomial posterior is in the dependence on the neighbour . In principle, this dependency can be easily integrated in the parametrization via amortized inference (Gershman and Goodman, 2014). More specifically, we use both embeddings and as inputs to the transformation generating the community transition matrix. Also, the structure of the variational distribution over the assignments enables an efficient procedure for inferring edge labels, as well as community memebership approximation as follows:


where is the set of neighbors of node . The procedure is also applicable on future test graphs. Optimizing the lower bound (eq. (6)) w.r.t. all parameters is performed based on stochastic optimization using the reparametrization trick (Kingma et al., 2014) and Gumbel-Softmax reparametrization (Jang et al., 2016; Maddison et al., 2016) to obtain gradients. The ELBO can be formulated as a neighbour reconstruction loss, and a sum of KL regularization terms between the priors and posteriors of each of the latent variables. Refer to Algorithm 1 for a summary of the procedure.

  Input: Edges in dynamic graph .
  Initialize all model and variational parameters
  Initialize , as learnable parameters. is uniform over .
  for  iterations 1, 2, 3, … do
     for  graph in  do
         Sample community representations from posteriors
        for  k in  do
        end for
         Sample node representation for complete vertex set from posteriors
        for  i in 1, …,  do
        end for
         Sample community assignments for source nodes
        for edges () in  do
        end for
     end forEstimate ELBO (eq. (6)) and update model and variational parameters via backpropagation.
  end for
Algorithm 1 GRADE Inference Algorithm

5 Experiments

We evaluate our proposed model on the tasks of dynamic link prediction and dynamic community detection against state-of-the-art baselines. Furthermore, we propose a quantitative metric to assess the quality of the learnt evolving communities and provide visualizations for a qualitative assessment.

5.1 Data sets

We use three discrete-time dynamic networks based on the DBLP, IMDb and Reddit datasets to evaluate our method. A summary of all datasets is provided in Table 2.

DBLP. We preprocess the DBLP dataset to identify the top 10,000 most prolific authors in terms of publication count in the years 2000-2018 inclusive. We construct a graph snapshot for each year based on co-authorship. We produce yearly labels for authors if over half of their annual publications fall within the same research category.

Reddit is a timestamped hyperlink network between subreddits spanning 40 months (Kumar et al., 2018). We link subreddits if one community posts a hyperlink to another community. We divide the network in 10 graph snapshots.

IMDb. We first identify the 10,000 most popular movies in terms of highest number of votes for the years 2000-2019 inclusive. We form links between the principals (director, producer, main actors) for each movie to form a dynamic network.

width= Data set # Nodes # Links Node Activity Context Dynamics # Snapshots Label Rate Train Val Test DBLP 10,000 374,911 0.47 0.30 13 3 3 0.083 Reddit 35,776 180,622 0.25 0.24 6 2 2 IMDb 13,633 105,841 0.1 0.06 14 3 3

Table 2: Datasets statistics. We use the proportion of unique timestamps (to all time steps) associated with a vertex’s edges to measure average node activity. The rate of context dynamics is captured by the Jaccard coefficient between the sets of 1-hop neighbours across all active nodes in consecutive time steps (lower coefficient suggests high rate of context dynamics).

5.2 Baseline methods

We compare GRADE against five baselines comprising three static and two dynamic methods. The static methods are: DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016) and vGraph (Sun et al., 2019). The dynamic graph methods consist of: DynamicTriad (Zhou et al., 2018) and DySAT (Sankar et al., 2020)

width= Dataset DeepWalk Node2Vec DySAT DynTriad GRADE GRADE (random) vGraph DBLP 605 39 1,436 33 IMDb 1,036 1 1,010 38 Reddit 601 8 727 87

Table 3: Mean average rank (MAR) results on dynamic link prediction. Lower values are better. Best and second-best results are marked in bold and underlined respectively.

5.3 Evaluation Metrics

Dynamic link prediction. An important application of dynamic graph embedding is capturing the pattern of evolution in the training set to predict edges in future time steps. For all baselines we use a metric of similarity between node representations (Euclidean distance or dot product) as a predictor of connectivity, following each method’s implementation. For static methods, we aggregate all observed edges in the training set in a single graph to produce node embeddings. For dynamic baselines, the vertex representations at the last training step are used. For GRADE, we train our model, and infer the posterior distributions of the node and community representations at the test time steps. We evaluate dynamic link prediction performance using the metric mean average rank (MAR). To calculate mean average rank we first produce a ranking of candidate neighbours spanning the complete vertex set for each source node in the test set edge list. The ranking is produced via a similarity measure on the node embeddings for all baseline methods. For GRADE and vGraph, we produce a distribution over the neighbours by summing over all possible community assignments from the prior , that is we do not incorporate any neighbour information in order to guarantee a fair comparison:


and rank nodes according to their probability. We identify the rank of the ground truth neighbour and average over all test edges.

Dynamic community detection is another relevant use case for our method. More specifically, we leverage historical information by training a model on the train time steps, and infer non-overlapping communities given the edges in the test set. We evaluate performance on this task using Normalized Mutual Information (NMI) (Tian et al., 2014) and Modularity. Publicly available dynamic network datasets with labelled evolving communities are difficult to obtain. We use the DBLP dataset which we have manually labelled.

Further, a novel application of GRADE is predicting community-scale dynamics. We demonstrate this capability by inferring the community representations (i.e., the posterior multinomial distribution over the nodes for each community) for the test time steps, and producing rankings of the most probable nodes. A vertex predicted to have high probability for a given community should also be integral to its structure. We evaluate performance on this task by calculating Spearman’s rank correlation coefficient between the predicted node probabilities of the top-250 vertices in community and the same nodes’ centrality as measured by the number of links to vertices assigned to the same community on the test set.

5.4 Experimental Procedure

We cross-validate all methods and identify the best set of hyperparameters on the task of dynamic link prediction via grid search. The train/validation/test splits are done across time steps as shown in Table  2

. We use no node attributes in any of our experiments and set the node embedding dimensionality to 128 for all methods. For all baselines with the exception of vGraph and GRADE, we apply K-means to the learnt vertex representations to identify non-overlapping communities. Further, since the majority of baseline methods (other than vGraph and GRADE) do not produce distributions over the nodes for each community, we use k-nearest neighbours algorithm to identify the top-250 nodes closest in representation space to each cluster’s centroid for the task of predicting community-scale dynamics. For consistency between baselines we determine the number of communities to be detected as part of the cross-validation procedure for GRADE. The implementations provided by the authors are used for all baselines. We train GRADE using the Adam optimizer

(Kingma and Ba, 2014)

with an initial learning rate of 0.005 which is decayed by 0.99 every 100 iterations. To save the best models for GRADE and vGraph, we train for 10,000 epochs on the task of dynamic link prediction and select the models with lowest mean average rank on the validation set. The same models were used in the evaluation for all tasks. We use procedures provided by the authors’ implementations of DySAT and DynamicTriad to save best-performing models. Owing to the size of dynamic networks we cannot use full-batch training. We resort to training GRADE stochastically by splitting the edges at each time step in equally sized batches comprising

edges in our experiments. Since our model is transductive we report results on nodes that have been observed in the training set. All results are averaged across 4 runs.

5.5 Results

Table 3 summarizes the results on dynamic link prediction. We observe GRADE outperforms noticeably on the DBLP ( improvement on mean average rank compared to second-best method) and Reddit datasets ( improvement compared to second-best), and achieves comparable results to baselines on IMDb. Further, to examine whether GRADE captures the true community and node dynamics, we randomize the sequence of graphs in the train set while retaining the true order in the validation and test sets. We observe noticeable degradation after randomization on all datasets, which suggests that GRADE identifies a pattern of temporal evolution instead of learning aggregated graph representations.

Figure 2:

Temporal evolution of top-10 authors within a community broadly corresponding to Artificial Intelligence and learnt by GRADE on the DBLP dataset for years 2000-2018 inclusive.

width= Measure Dataset DeepWalk Node2Vec DySAT DynTriad GRADE GRADE (random) vGraph Modularity DBLP 0.383 0.002 IMDb 0.128 0.09 0.163 0.008 Reddit 0.368 0.004 0.370 0.003 Top-250 (node proba- bility vs Centrality) DBLP 0.323 0.009 IMDb 0.239 0.003 0.176 0.007 Reddit 0.492 0.019 0.466 0.009 NMI DBLP 0.429 0.015 0.435 0.035

Table 4:

Dynamic community detection performance. Best and second-best results are marked in bold and underlined respectively. Values within a standard deviation on the same task are both marked.

Results on dynamic community detection and predicting community-scale dynamics are presented in Table 4. In these tasks GRADE also outperforms noticeably all baselines on the DBLP and Reddit datasets. It also shows that the capability of our model to infer node and community representations at future test time steps helps with performance. This is in contrast with other dynamic baselines that use the learnt embeddings at the last training step for prediction. An interesting observation is that training our model on a randomized order of graphs can result in a comparable performance to the true sequence on some tasks, such as NMI on DBLP and modularity on Reddit. We also notice that training GRADE on the true sequence consistently leads to performance as good as or better compared to training graph randomization, corroborating that our proposed model captures patterns of temporal dynamics.

For the only dataset in which we do not outperform - IMDB, GRADE produces dynamic link prediction results close to the best-performing method in absolute values (3rd out of 6 methods). On the tasks of dynamic link prediction, GRADE is also in the upper half in performance (2nd and 3rd out of 6 for modularity and top-250 node prediction). We hypothesise this behaviour is a result of the low node activity in IMDb (see Table 2), where a node is present, on average, in the graph vertex sets of 2 (out of 20) time steps. In support of our claim, Pareja et al. (2019) argues recurrent networks, such as GRUs as used in our implementation, struggle to learn the irregular behaviour of frequently appearing and disappearing vertices.

In Figure 2 we visualize the temporal evolution of the top 10 most probable authors from a community strongly associated with Artificial Intelligence, learnt by GRADE on all time steps from DBLP. We observe the top authors in each year work within the same general research area (coherence) and the community is broadly in agreement with historical events (interpretability). For example, our model assigns high probability to influential researchers like Yoshua Bengio and Ian J. Goodfellow in later years.

6 Conclusion

In this paper, we propose GRADE - a method which jointly learns evolving node and community representations in discrete-time dynamic graphs. We achieve this with an edge generative mechanism modelling the interaction between local and global graph structures via node and community multinomial distributions. We parametrize these distributions with the learnt embeddings, and evolve them over time with a Gaussian state-space model. Moreover, we introduce transition matrices to explicitly capture node community dynamics. Finally, we validate the effectiveness of GRADE on real-world datasets on the tasks of dynamic link prediction, dynamic community detection, and the novel task of predicting community-scale dynamics, that is inferring future structurally influential vertices.

Broader Impact

GRADE is a general framework that is able to characterize the global and local dynamics of networks. As networks are ubiquitous in the real world, our method can be used in a variety of applications and domains such as modeling the dynamics of social media (e.g., Twitter and Facebook), modeling the evolution of a research community, and modeling the dynamics of biological networks. In addition, GRADE could be potentially used for modeling the dynamic contagion network of COVID-19 and predict and track the COVID-19 patients.

On the other hand, GRADE is a method for dynamic graph representation learning which does not use node features. As a result, our approach relies on patterns of network structural change at the node and community level. Consequently, inherent bias from dataset pre-processing can also propagate to the model’s predictions, which may lead to ethical issues of fairness.


  • Ahmed et al. (2013) A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 37–48, 2013.
  • Bamler and Mandt (2017) R. Bamler and S. Mandt. Dynamic word embeddings. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 380–389. PMLR, 06–11 Aug 2017.
  • Battiston et al. (2020) F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J.-G. Young, and G. Petri. Networks beyond pairwise interactions: structure and dynamics, 2020.
  • Belkin and Niyogi (2003) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
  • Blei and Lafferty (2006) D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120, 2006.
  • Bruna et al. (2014) J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
  • Cao et al. (2015) S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM international on conference on information and knowledge management, pages 891–900, 2015.
  • Dieng et al. (2019) A. B. Dieng, F. J. Ruiz, and D. M. Blei. The dynamic embedded topic model. arXiv preprint arXiv:1907.05545, 2019.
  • Gershman and Goodman (2014) S. Gershman and N. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the annual meeting of the cognitive science society, volume 36, 2014.
  • Goyal et al. (2018) P. Goyal, N. Kamra, X. He, and Y. Liu. Dyngem: Deep embedding method for dynamic graphs. arXiv preprint arXiv:1805.11273, 2018.
  • Goyal et al. (2020) P. Goyal, S. R. Chhetri, and A. Canedo. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. Knowledge-Based Systems, 187:104816, 2020.
  • Greene et al. (2010) D. Greene, D. Doyle, and P. Cunningham. Tracking the evolution of communities in dynamic social networks. In 2010 international conference on advances in social networks analysis and mining, pages 176–183. IEEE, 2010.
  • Grover and Leskovec (2016) A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
  • Hamilton et al. (2017) W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
  • Hoffman and Blei (2015) M. D. Hoffman and D. M. Blei. Structured stochastic variational inference. In Artificial Intelligence and Statistics, 2015.
  • Jang et al. (2016) E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma and Welling (2013) D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kingma et al. (2014) D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
  • Kipf and Welling (2016) T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • Kossinets and Watts (2006) G. Kossinets and D. J. Watts. Empirical analysis of an evolving social network. science, 311(5757):88–90, 2006.
  • Kumar et al. (2018) S. Kumar, W. L. Hamilton, J. Leskovec, and D. Jurafsky. Community interaction and conflict on the web. In Proceedings of the 2018 World Wide Web Conference, pages 933–943, 2018.
  • Li et al. (2017) J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu. Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 387–396, 2017.
  • Maddison et al. (2016) C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  • Nguyen et al. (2018) G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim. Continuous-time dynamic network embeddings. In Companion Proceedings of the The Web Conference 2018, pages 969–976, 2018.
  • Pareja et al. (2019) A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, and C. E. Leisersen. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. arXiv preprint arXiv:1902.10191, 2019.
  • Perozzi et al. (2014) B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
  • Rudolph and Blei (2018) M. Rudolph and D. Blei. Dynamic embeddings for language evolution. In Proceedings of the 2018 World Wide Web Conference, pages 1003–1011, 2018.
  • Sankar et al. (2020) A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 519–527, 2020.
  • Saul and Jordan (1996) L. K. Saul and M. I. Jordan. Exploiting tractable substructures in intractable networks. In Advances in neural information processing systems, pages 486–492, 1996.
  • Seo et al. (2018) Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pages 362–373. Springer, 2018.
  • Sun et al. (2019) F.-Y. Sun, M. Qu, J. Hoffmann, C.-W. Huang, and J. Tang. vgraph: A generative model for joint community detection and node representation learning. In Advances in Neural Information Processing Systems, pages 512–522, 2019.
  • Tang et al. (2015) J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
  • Tian et al. (2014) F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graph clustering. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
  • Trivedi et al. (2017) R. Trivedi, H. Dai, Y. Wang, and L. Song. Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3462–3471. JMLR. org, 2017.
  • Trivedi et al. (2019) R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha. Dyrep: Learning representations over dynamic graphs. In International Conference on Learning Representations(ICLR2019), 2019.
  • Veličković et al. (2018) P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.
  • Vidaurre et al. (2017) D. Vidaurre, S. M. Smith, and M. W. Woolrich. Brain network dynamics are hierarchically organized in time. Proceedings of the National Academy of Sciences, 114(48):12827–12832, 2017.
  • Yang et al. (2011) T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin. Detecting communities and their evolutions in dynamic social networks—a bayesian approach. Machine learning, 82(2):157–189, 2011.
  • Yu et al. (2017) B. Yu, H. Yin, and Z. Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875, 2017.
  • Yu et al. (2018) W. Yu, W. Cheng, C. C. Aggarwal, K. Zhang, H. Chen, and W. Wang.

    Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks.

    In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2672–2681, 2018.
  • Zhou et al. (2018) L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang. Dynamic network embedding by modeling triadic closure process. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

7 Supplementary Material

7.1 A. Elbo

At each time step , the evidence lower bound for our model from Eq.6 can be expressed as:


We average over the KL divergence terms between the priors and posteriors for all node representations so as not to overpower the other loss terms.

7.2 B. Hyperparameter ranges

We report the sets of hyperparameters we have explored during the cross-validation stage for each method.
walk lengths =
numbers of walks =
p return parameter =
q in-out parameter set to 1

walk lengths =
numbers of walks =

spatial dropout range =
temporal dropout range =
walk lengths =

beta 0 =
beta 1 =

The number of communities for DBLP is set to 8; IMDb - 12; Reddit - 25. We cross-validate our model with the following combinations of temporal smoothness hyperparameters: = (0.01, 0.1), (0.1, 1.0) and (1.0, 10.0).

7.3 C. Datasets