1 Introduction
Representation learning over graphstructured data has generated significant interest in the machine learning community owing to widespread application in a variety of interactionbased networks, such as social and communication networks, bioinformatics and relational knowledge graphs. Developing methods for unsupervised graph representation learning is challenging as it requires summarizing the graph structural information in lowdimensional embeddings. These representations can then be used for downstream tasks, such as node classification, link prediction and community detection.
The majority of unsupervised graph representation learning methods have focused solely on static nonevolving graphs, while many realworld networks exhibit complex temporal behaviour. To address the challenge of encoding temporal patterns of relational data, existing methods for dynamic graph embedding focus extensively on capturing node evolution (Goyal et al., 2018, 2020; Sankar et al., 2020; Zhou et al., 2018). Although these methods achieve compelling results against static baselines on dynamic tasks, they do not lend themselves to capturing the evolution of graphlevel structures, such as clusters of nodes, or communities. On the other hand, the patterns of evolving node clusters are of great interest in social networks (Kossinets and Watts, 2006; Greene et al., 2010; Yang et al., 2011), as well as encountered in the temporal organization of largescale brain networks (Vidaurre et al., 2017), among others.
To address this challenge, we propose GRADE (GRAph Dynamic Embedding)  a probabilistic generative model for jointly learning evolving node and community representations. The benefit of modelling the interaction between nodes and communities for graph representation learning in the static setting was studied by vGraph (Sun et al., 2019). Further, Battiston et al. (2020) produces evidence that taking into consideration higherorder graph structures, such as communities, enhances our capability to model emergent dynamical behaviour. Consequently, in this work, we extend the idea of modelling nodecommunity interactions, proposed in vGraph, to the dynamic case. We represent a dynamic network as a sequence of graph snapshots over a series of discrete and equallyspaced time intervals. At each time step, we model the edge generation process between node neighbours via multinomial community and node distributions. First, we sample a community assignment for each node from a distribution over the communities, i.e. . Then, we sample a neighbour from the distribution over the nodes of the assigned community, that is . Both the community and node distributions are parametrized by neural network transformations of the node and community embeddings. In our work, we assume that the semantic meaning of communities and the proportions over the communities for each node evolve simultaneously over time. Following an approach introduced in dynamic topic modelling (Dieng et al., 2019)
, we encode temporal evolution in our method by assuming a random walk prior over the representations between time steps. Furthermore, we draw inspiration from social networks where a user’s preferences can shift from one community to another. We explicitly model the dynamism in community membership by introducing a nodespecific and timevarying transition matrix to update the community mixture coefficients over time. We design an effective algorithm for inference via backpropagation. We learn the parameters of our model by means of variational inference to maximize the lower bound of the observed data. More specifically, we resort to amortized inference
(Gershman and Goodman, 2014) to learn neural network mappings from node and community representations to the respective conditional distributions, as well as structured variational inference (Hoffman and Blei, 2015; Saul and Jordan, 1996) to retain the dependence of the embeddings on their historical states. Our proposed method is aimed at nonattributed dynamic graphs. It is worth noting that although GRADE is a transductive approach, changes of vertex sets between snapshots at different time steps do not pose a problem, if the complete vertex set is known a priori.In the experimental section we evaluate our model on the tasks of dynamic link prediction and dynamic nonoverlapping community detection on realworld dynamic graphs. Our results show GRADE is competitive with or outperforms other stateoftheart static or dynamic transductive approaches for unsupervised graph representation learning. Furthermore, we provide visualizations of dynamic community evolution.
2 Related Work
Methods for unsupervised learning on evolving graphs are often dynamic extensions of ideas applied in the static case. (1)
whiteGraph factorization approaches such as DANE (Li et al., 2017) rely on spectral embedding, similarly to static methods like Ahmed et al. (2013); Cao et al. (2015); Belkin and Niyogi (2003). DANE assumes smooth temporal evolution and models it using matrix perturbation theory. (2) In whiteskipgram models node representations are learnt by random walk objectives (Grover and Leskovec, 2016; Perozzi et al., 2014; Tang et al., 2015). In the dynamic case, CTDNE (Nguyen et al., 2018) and NetWalk (Yu et al., 2018) augment the random walk with temporal constraints based on timestamped edges. Further, (3) whitetemporal point processes have also been used in combination with neural network parametrization by KnowEvolve (Trivedi et al., 2017) and DyREP (Trivedi et al., 2019) to model continuoustime node interactions in multirelational and simple dynamic graphs, respectively. (4) More recently, whitegraph convolutional neural networks (GNNs) have become a widely used tool for graph representation learning (Kipf and Welling, 2016; Veličković et al., 2018; Bruna et al., 2014). A popular approach to include temporal dependency in GNNs is to introduce a recurrent mechanism. For example, (Seo et al., 2018)proposes two ways to achieve this goal. One way is to obtain node embeddings via a GNN, which are then fed to an LSTM to learn dynamism. The second approach modifies the LSTM layers to incorporate graphstructured data. A different approach altogether is to evolve the graph convolutional parameters with a recurrent neural network (RNN), such as in EvolveGCN
(Pareja et al., 2019), as opposed to the node embeddings, hence addressing issues stemming from rapidly changing node sets between time steps. Alternatively, STGCN (Yu et al., 2017) avoids using RNNs completely by introducing an efficient STConv layer for faster training with few model parameters.The most related body of work to this paper is a category of transductive unsupervised methods applied on temporally discrete and nonattributed dynamic graphs. One such approach is DynGEM (Goyal et al., 2018)
which employs deep autoencoders to learn node embeddings but resorts to no recurrent structures for temporal dependency. Instead, time dynamics are injected by reinitializing the parameters of the autoencoder at each time step
with the parameters learnt at . Unlike our proposed method GRADE, DynGEM can only handle growing dynamic networks. Another method is dyngraph2vec (Goyal et al., 2020) which is trained to predict future links based on current node embeddings using an LSTM mechanism. DynamicTriad (Zhou et al., 2018) models the process of temporal link formation between vertices with common neighbours, that is triadic closure, and enforces latent space similarity between future connected nodes. DySAT (Sankar et al., 2020) draws inspiration from the success of attention mechanisms and applies them structurally over local neighbourhoods and temporally over historical representations. The advantages of GRADE over these competitive methods is that firstly, we learn both node and communitylevel dynamic embeddings and secondly, our approach can be used to infer the embeddings for future time steps. In comparison, these dynamic methods use representations learnt at the last training step for dynamic prediction.Finally, GRADE is also related to dynamic topic modelling (Dieng et al., 2019; Blei and Lafferty, 2006) as both can also be viewed as statespace models. The difference is that in GRADE we are dealing with multinomial distributions over the nodes and communities instead of topics and words. Moreover, some works like (Bamler and Mandt, 2017; Rudolph and Blei, 2018) have focused on the shift of word meaning over time, and others such as (Dieng et al., 2019) model the evolution of documents. In contrast, GRADE assumes both nodes and communities undergo temporal semantic shift.
3 Problem Definition and Preliminaries
We consider a dataset comprising a sequence of nonattributed (i.e. without node features) graph snapshots over a series of discrete and equallyspaced time intervals , such that is an integer time index. We assume all the edges in snapshot occur at time and the complete set of vertices in the dynamic graph is known a priori. Our method supports the addition or removal of nodes as well as edges between time steps. We also assume there are
communities (clusters of nodes) in the dynamic network. Our method aims to learn timeevolving vector representations
for all nodes and communities , } , for each time step . Further, a useful model for dynamic graph embedding should not only capture the patterns of temporal evolution in the node and community representations but also be able to predict their future trajectories.4 Methodology
4.1 GRADE: Generative Model Description
GRADE is a probabilistic method for modelling the edge generation process in dynamic graphs. We adopt the approach by vGraph (Sun et al., 2019) to represent each node in the active vertex set of as a mixture of communities, and each community as a multinomial distribution over the nodes. The linked neighbour generation at each time step is as follows: first, we sample a community assignment from a conditional prior distribution over the communities . Then, a neighbour is drawn from the node generative distribution based on the social context defined by the assigned community. The generative process for graph snapshot can be formulated as:
(1) 
where and parametrize the multinomial generative and prior distributions at time step respectively. In our dynamic graph model, we suppose that the semantic meaning of communities as well as community proportions for nodes change over time. This necessitates capturing the temporal evolution of the underlying node and community distributions by an evolving set of parameters. GRADE achieves this by making these parameters implicitly dependent on the evolving node and community embeddings, and
respectively. More specifically, we treat the community and node representations as random variables and impose a simple statespace model that evolves smoothly with Gaussian noise between time steps as follows:
(2) 
(3) 
Note that we evolve the embeddings of the complete vertex set at each time step, although our model allows for a subset of the nodes to be present at each time step. The temporal smoothness hyperparameters and
control the rate of temporal dynamics. The parametrization of the generative distribution is achieved by first transforming the community representations through a neural network, and then mapping the output through a softmax layer:
softmax.To evolve the community mixture weights for nodes, we observe that users’ interests in a social network change over time. As a result, users may shift from one community to another. This is characterized by userspecific behaviour within the broader context of community evolution. For these reasons, we explicitly model community transition with a transition matrix. More specifically, for each node , we update the community mixture weights by means of a nodespecific and timevarying transition matrix , produced as a function, , of the node embeddings:
(4) 
In summary, GRADE’s edge generative process for each graph snapshots in is as follows:

Draw community embeddings for :

Draw node embeddings for all nodes :

Update community mixture coefficients for node :

For each edge (, ) in :

Draw community assignment from multinomial prior over the communities:

Parameters of distribution over the nodes is a function of : softmax

Draw linked neighbour from node generative distribution for sampled community :

The graphical model of the proposed generative process is shown in Figure 1 and common notation used in the paper in Table 1.
4.2 Inference Algorithm
Consider we are given a dataset comprising a set of node links (, ) for a sequence of graph snapshots . In our dynamic graph model, the latent variables are the hidden community assignments , and the evolving node and community representations and
. The logarithm of the marginal probability of the observations is given by the sum of the log probability of each observed temporal edge
of all nodes in :(5) 
Exact inference of the posterior is intractable. Instead, we resort to variational methods as a means of approximation. Variational inference provides flexibility when choosing an appropriate family of distributions , indexed by the parameters , as an approximation to the true posterior. The idea is to optimize the parameters by minimizing the KullbackLeibler (KL) divergence between the true posterior and its approximation. This procedure is known as minimizing the evidence lower bound (ELBO) (Kingma and Welling, 2013):
(6) 
The variational approximation we choose takes the form:
(7) 
The variational distributions over the node and community representations depend on all of their respective historical states. We capture this temporal dependency with Gated Recurrent Units (GRU). We model the outputs of both
andas Gaussian distributions, whose means and diagonal covariance vectors are given by the outputs of their respective GRU units. The advantage of this structured approach
(Hoffman and Blei, 2015; Saul and Jordan, 1996), where we retain dependency only on previous states, is that it allows us to easily infer the posterior distributions of the node and community representations at future time steps. The difference between the approximated multinomial conditional prior over the community assignments and the approximated multinomial posterior is in the dependence on the neighbour . In principle, this dependency can be easily integrated in the parametrization via amortized inference (Gershman and Goodman, 2014). More specifically, we use both embeddings and as inputs to the transformation generating the community transition matrix. Also, the structure of the variational distribution over the assignments enables an efficient procedure for inferring edge labels, as well as community memebership approximation as follows:(8) 
where is the set of neighbors of node . The procedure is also applicable on future test graphs. Optimizing the lower bound (eq. (6)) w.r.t. all parameters is performed based on stochastic optimization using the reparametrization trick (Kingma et al., 2014) and GumbelSoftmax reparametrization (Jang et al., 2016; Maddison et al., 2016) to obtain gradients. The ELBO can be formulated as a neighbour reconstruction loss, and a sum of KL regularization terms between the priors and posteriors of each of the latent variables. Refer to Algorithm 1 for a summary of the procedure.
5 Experiments
We evaluate our proposed model on the tasks of dynamic link prediction and dynamic community detection against stateoftheart baselines. Furthermore, we propose a quantitative metric to assess the quality of the learnt evolving communities and provide visualizations for a qualitative assessment.
5.1 Data sets
We use three discretetime dynamic networks based on the DBLP, IMDb and Reddit datasets to evaluate our method. A summary of all datasets is provided in Table 2.
DBLP. We preprocess the DBLP dataset to identify the top 10,000 most prolific authors in terms of publication count in the years 20002018 inclusive. We construct a graph snapshot for each year based on coauthorship. We produce yearly labels for authors if over half of their annual publications fall within the same research category.
Reddit is a timestamped hyperlink network between subreddits spanning 40 months (Kumar et al., 2018). We link subreddits if one community posts a hyperlink to another community. We divide the network in 10 graph snapshots.
IMDb. We first identify the 10,000 most popular movies in terms of highest number of votes for the years 20002019 inclusive. We form links between the principals (director, producer, main actors) for each movie to form a dynamic network.
5.2 Baseline methods
We compare GRADE against five baselines comprising three static and two dynamic methods. The static methods are: DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016) and vGraph (Sun et al., 2019). The dynamic graph methods consist of: DynamicTriad (Zhou et al., 2018) and DySAT (Sankar et al., 2020)
5.3 Evaluation Metrics
Dynamic link prediction. An important application of dynamic graph embedding is capturing the pattern of evolution in the training set to predict edges in future time steps. For all baselines we use a metric of similarity between node representations (Euclidean distance or dot product) as a predictor of connectivity, following each method’s implementation. For static methods, we aggregate all observed edges in the training set in a single graph to produce node embeddings. For dynamic baselines, the vertex representations at the last training step are used. For GRADE, we train our model, and infer the posterior distributions of the node and community representations at the test time steps. We evaluate dynamic link prediction performance using the metric mean average rank (MAR). To calculate mean average rank we first produce a ranking of candidate neighbours spanning the complete vertex set for each source node in the test set edge list. The ranking is produced via a similarity measure on the node embeddings for all baseline methods. For GRADE and vGraph, we produce a distribution over the neighbours by summing over all possible community assignments from the prior , that is we do not incorporate any neighbour information in order to guarantee a fair comparison:
(9) 
and rank nodes according to their probability. We identify the rank of the ground truth neighbour and average over all test edges.
Dynamic community detection is another relevant use case for our method. More specifically, we leverage historical information by training a model on the train time steps, and infer nonoverlapping communities given the edges in the test set. We evaluate performance on this task using Normalized Mutual Information (NMI) (Tian et al., 2014) and Modularity. Publicly available dynamic network datasets with labelled evolving communities are difficult to obtain. We use the DBLP dataset which we have manually labelled.
Further, a novel application of GRADE is predicting communityscale dynamics. We demonstrate this capability by inferring the community representations (i.e., the posterior multinomial distribution over the nodes for each community) for the test time steps, and producing rankings of the most probable nodes. A vertex predicted to have high probability for a given community should also be integral to its structure. We evaluate performance on this task by calculating Spearman’s rank correlation coefficient between the predicted node probabilities of the top250 vertices in community and the same nodes’ centrality as measured by the number of links to vertices assigned to the same community on the test set.
5.4 Experimental Procedure
We crossvalidate all methods and identify the best set of hyperparameters on the task of dynamic link prediction via grid search. The train/validation/test splits are done across time steps as shown in Table 2
. We use no node attributes in any of our experiments and set the node embedding dimensionality to 128 for all methods. For all baselines with the exception of vGraph and GRADE, we apply Kmeans to the learnt vertex representations to identify nonoverlapping communities. Further, since the majority of baseline methods (other than vGraph and GRADE) do not produce distributions over the nodes for each community, we use knearest neighbours algorithm to identify the top250 nodes closest in representation space to each cluster’s centroid for the task of predicting communityscale dynamics. For consistency between baselines we determine the number of communities to be detected as part of the crossvalidation procedure for GRADE. The implementations provided by the authors are used for all baselines. We train GRADE using the Adam optimizer
(Kingma and Ba, 2014)with an initial learning rate of 0.005 which is decayed by 0.99 every 100 iterations. To save the best models for GRADE and vGraph, we train for 10,000 epochs on the task of dynamic link prediction and select the models with lowest mean average rank on the validation set. The same models were used in the evaluation for all tasks. We use procedures provided by the authors’ implementations of DySAT and DynamicTriad to save bestperforming models. Owing to the size of dynamic networks we cannot use fullbatch training. We resort to training GRADE stochastically by splitting the edges at each time step in equally sized batches comprising
edges in our experiments. Since our model is transductive we report results on nodes that have been observed in the training set. All results are averaged across 4 runs.5.5 Results
Table 3 summarizes the results on dynamic link prediction. We observe GRADE outperforms noticeably on the DBLP ( improvement on mean average rank compared to secondbest method) and Reddit datasets ( improvement compared to secondbest), and achieves comparable results to baselines on IMDb. Further, to examine whether GRADE captures the true community and node dynamics, we randomize the sequence of graphs in the train set while retaining the true order in the validation and test sets. We observe noticeable degradation after randomization on all datasets, which suggests that GRADE identifies a pattern of temporal evolution instead of learning aggregated graph representations.
Results on dynamic community detection and predicting communityscale dynamics are presented in Table 4. In these tasks GRADE also outperforms noticeably all baselines on the DBLP and Reddit datasets. It also shows that the capability of our model to infer node and community representations at future test time steps helps with performance. This is in contrast with other dynamic baselines that use the learnt embeddings at the last training step for prediction. An interesting observation is that training our model on a randomized order of graphs can result in a comparable performance to the true sequence on some tasks, such as NMI on DBLP and modularity on Reddit. We also notice that training GRADE on the true sequence consistently leads to performance as good as or better compared to training graph randomization, corroborating that our proposed model captures patterns of temporal dynamics.
For the only dataset in which we do not outperform  IMDB, GRADE produces dynamic link prediction results close to the bestperforming method in absolute values (3^{rd} out of 6 methods). On the tasks of dynamic link prediction, GRADE is also in the upper half in performance (2^{nd} and 3^{rd} out of 6 for modularity and top250 node prediction). We hypothesise this behaviour is a result of the low node activity in IMDb (see Table 2), where a node is present, on average, in the graph vertex sets of 2 (out of 20) time steps. In support of our claim, Pareja et al. (2019) argues recurrent networks, such as GRUs as used in our implementation, struggle to learn the irregular behaviour of frequently appearing and disappearing vertices.
In Figure 2 we visualize the temporal evolution of the top 10 most probable authors from a community strongly associated with Artificial Intelligence, learnt by GRADE on all time steps from DBLP. We observe the top authors in each year work within the same general research area (coherence) and the community is broadly in agreement with historical events (interpretability). For example, our model assigns high probability to influential researchers like Yoshua Bengio and Ian J. Goodfellow in later years.
6 Conclusion
In this paper, we propose GRADE  a method which jointly learns evolving node and community representations in discretetime dynamic graphs. We achieve this with an edge generative mechanism modelling the interaction between local and global graph structures via node and community multinomial distributions. We parametrize these distributions with the learnt embeddings, and evolve them over time with a Gaussian statespace model. Moreover, we introduce transition matrices to explicitly capture node community dynamics. Finally, we validate the effectiveness of GRADE on realworld datasets on the tasks of dynamic link prediction, dynamic community detection, and the novel task of predicting communityscale dynamics, that is inferring future structurally influential vertices.
Broader Impact
GRADE is a general framework that is able to characterize the global and local dynamics of networks. As networks are ubiquitous in the real world, our method can be used in a variety of applications and domains such as modeling the dynamics of social media (e.g., Twitter and Facebook), modeling the evolution of a research community, and modeling the dynamics of biological networks. In addition, GRADE could be potentially used for modeling the dynamic contagion network of COVID19 and predict and track the COVID19 patients.
On the other hand, GRADE is a method for dynamic graph representation learning which does not use node features. As a result, our approach relies on patterns of network structural change at the node and community level. Consequently, inherent bias from dataset preprocessing can also propagate to the model’s predictions, which may lead to ethical issues of fairness.
References
 Ahmed et al. (2013) A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed largescale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 37–48, 2013.
 Bamler and Mandt (2017) R. Bamler and S. Mandt. Dynamic word embeddings. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 380–389. PMLR, 06–11 Aug 2017.
 Battiston et al. (2020) F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J.G. Young, and G. Petri. Networks beyond pairwise interactions: structure and dynamics, 2020.
 Belkin and Niyogi (2003) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 Blei and Lafferty (2006) D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120, 2006.
 Bruna et al. (2014) J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014.
 Cao et al. (2015) S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM international on conference on information and knowledge management, pages 891–900, 2015.
 Dieng et al. (2019) A. B. Dieng, F. J. Ruiz, and D. M. Blei. The dynamic embedded topic model. arXiv preprint arXiv:1907.05545, 2019.
 Gershman and Goodman (2014) S. Gershman and N. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the annual meeting of the cognitive science society, volume 36, 2014.
 Goyal et al. (2018) P. Goyal, N. Kamra, X. He, and Y. Liu. Dyngem: Deep embedding method for dynamic graphs. arXiv preprint arXiv:1805.11273, 2018.
 Goyal et al. (2020) P. Goyal, S. R. Chhetri, and A. Canedo. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. KnowledgeBased Systems, 187:104816, 2020.
 Greene et al. (2010) D. Greene, D. Doyle, and P. Cunningham. Tracking the evolution of communities in dynamic social networks. In 2010 international conference on advances in social networks analysis and mining, pages 176–183. IEEE, 2010.
 Grover and Leskovec (2016) A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
 Hamilton et al. (2017) W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
 Hoffman and Blei (2015) M. D. Hoffman and D. M. Blei. Structured stochastic variational inference. In Artificial Intelligence and Statistics, 2015.
 Jang et al. (2016) E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
 Kipf and Welling (2016) T. N. Kipf and M. Welling. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 Kossinets and Watts (2006) G. Kossinets and D. J. Watts. Empirical analysis of an evolving social network. science, 311(5757):88–90, 2006.
 Kumar et al. (2018) S. Kumar, W. L. Hamilton, J. Leskovec, and D. Jurafsky. Community interaction and conflict on the web. In Proceedings of the 2018 World Wide Web Conference, pages 933–943, 2018.
 Li et al. (2017) J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu. Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 387–396, 2017.
 Maddison et al. (2016) C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Nguyen et al. (2018) G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim. Continuoustime dynamic network embeddings. In Companion Proceedings of the The Web Conference 2018, pages 969–976, 2018.
 Pareja et al. (2019) A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, and C. E. Leisersen. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. arXiv preprint arXiv:1902.10191, 2019.
 Perozzi et al. (2014) B. Perozzi, R. AlRfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
 Rudolph and Blei (2018) M. Rudolph and D. Blei. Dynamic embeddings for language evolution. In Proceedings of the 2018 World Wide Web Conference, pages 1003–1011, 2018.
 Sankar et al. (2020) A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang. Dysat: Deep neural representation learning on dynamic graphs via selfattention networks. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 519–527, 2020.
 Saul and Jordan (1996) L. K. Saul and M. I. Jordan. Exploiting tractable substructures in intractable networks. In Advances in neural information processing systems, pages 486–492, 1996.
 Seo et al. (2018) Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pages 362–373. Springer, 2018.
 Sun et al. (2019) F.Y. Sun, M. Qu, J. Hoffmann, C.W. Huang, and J. Tang. vgraph: A generative model for joint community detection and node representation learning. In Advances in Neural Information Processing Systems, pages 512–522, 2019.
 Tang et al. (2015) J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
 Tian et al. (2014) F. Tian, B. Gao, Q. Cui, E. Chen, and T.Y. Liu. Learning deep representations for graph clustering. In TwentyEighth AAAI Conference on Artificial Intelligence, 2014.
 Trivedi et al. (2017) R. Trivedi, H. Dai, Y. Wang, and L. Song. Knowevolve: Deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3462–3471. JMLR. org, 2017.
 Trivedi et al. (2019) R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha. Dyrep: Learning representations over dynamic graphs. In International Conference on Learning Representations(ICLR2019), 2019.
 Veličković et al. (2018) P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.
 Vidaurre et al. (2017) D. Vidaurre, S. M. Smith, and M. W. Woolrich. Brain network dynamics are hierarchically organized in time. Proceedings of the National Academy of Sciences, 114(48):12827–12832, 2017.
 Yang et al. (2011) T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin. Detecting communities and their evolutions in dynamic social networks—a bayesian approach. Machine learning, 82(2):157–189, 2011.
 Yu et al. (2017) B. Yu, H. Yin, and Z. Zhu. Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875, 2017.

Yu et al. (2018)
W. Yu, W. Cheng, C. C. Aggarwal, K. Zhang, H. Chen, and W. Wang.
Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2672–2681, 2018.  Zhou et al. (2018) L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang. Dynamic network embedding by modeling triadic closure process. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
7 Supplementary Material
7.1 A. Elbo
At each time step , the evidence lower bound for our model from Eq.6 can be expressed as:
(10) 
We average over the KL divergence terms between the priors and posteriors for all node representations so as not to overpower the other loss terms.
7.2 B. Hyperparameter ranges
We report the sets of hyperparameters we have explored during the crossvalidation stage for each method.
node2vec:
walk lengths =
numbers of walks =
p return parameter =
q inout parameter set to 1
deepwalk:
walk lengths =
numbers of walks =
DySAT:
spatial dropout range =
temporal dropout range =
walk lengths =
DynamicTriad:
beta 0 =
beta 1 =
GRADE:
The number of communities for DBLP is set to 8; IMDb  12; Reddit  25. We crossvalidate our model with the following combinations of temporal smoothness hyperparameters: = (0.01, 0.1), (0.1, 1.0) and (1.0, 10.0).
7.3 C. Datasets
Links to download raw data:
DBLP: https://www.aminer.cn/citation
IMDb: https://datasets.imdbws.com/
Reddit: https://snap.stanford.edu/conflict/