1 Introduction
Representation learning over graph structured data has emerged as keystone machine learning task due to its ubiquitous applicability in variety of domains such as social networks, bioinformatics, natural language processing, and relational knowledge bases. The key idea behind this task is to encode structural information at node (or subgraph) level into lowdimensional embedding vectors that can be used as feature inputs to further downstream tasks such as link prediction, clustering and classification
[1, 2, 3, 4]. Traditionally, such domains have been modeled as static graphs where the learning is conducted on fixed set of nodes and edges [5, 6, 7, 8, 9, 10, 11]. However, many of these domains now present data that is highly dynamic in nature. For instance, social network communications, financial transaction graphs, longitudinal citation data, etc., contain finegrained temporal information on various components of evolving graphs. While research on learning representations over static graphs has progressed rapidly in recent years, there is a conspicuous lack of principled approach to tackle unique challenges involved in highly dynamic graph structures [12].Dynamic graphs have been typically treated from two viewpoints: a) Topologically evolving graphs where the number of nodes and edges are expected to grow (shrink) over time – e.g. collaboration networks. In this view, graphs are represented as collection of snapshots at discrete time points. b) Temporal graphs where each edge contains temporal information – e.g. telephone call networks. Such graphs are represented as sequence of timestamped edges. Most realworld dynamic graphs exhibit properties from both the above classes. We therefore take a unified abstract view of two classes and describe them as two temporal processes over a single dynamic graph –
Association Process: Realized as growth of network and leads to long lasting information exchange between nodes.
Communication Process: Realized as interactions and leads to temporary information flow between nodes in graph.
At a fundamental level, the former constitutes the dynamics of the network which accounts for structural properties of network while the later constitutes the dynamics on the network which accounts for the activities and transmission properties of the network. [13, 14].
We observe that such an abstraction naturally gives rise to a complex transmission system where information, contained in node’s latent features, propagates across the graph in a nonlinear fashion. As noted in [15], an important feature of such systems is the ability to express the dynamical processes at different scales. While it is natural to consider that the association and communication processes are directly observed and evolve at different temporal scales [16], we propose an intermediate scale relevant to the hidden embedding propagation network that drives the interactions of the two processes across the graph. This leads to temporal evolution of nodes participating in these processes as the information ultimately propagates through them, updating their representations on the way.
Mediation Process: We, therefore, posit representation learning over dynamic graphs as the problem of learning node embeddings that serve as mediator influencing the dynamics of both the association and communication processes. On the other hand, both these complex nonlinear processes ultimately lead to evolution of nodes’ representation. Further, the feature information propagated through the mediator nodes is governed by the temporal dynamics of communication and association histories of the node with it’s neighbors in the graph.
For instance, in a social network, when a node’s neighborhood grows, it changes node’s social representation which in turn affects her social interactions (association embedding communication) and this effect is further percolated to all her neighbors (feature propagation). Similarly, when node’s interaction behavior changes, it affects the representation of her neighbors and herself which in turn changes the structure and strength of her connections (communication embedding association). Thus, a node’s evolving and latent representation play a mediator’s role in driving the dynamics of the two observed processes. We call this phenomenon evolution through mediation.
To model this phenomenon, we propose DyRep, that considers joint effect of communication and association processes for updating node representations and simultaneously models the driving effect of evolving representations on both these processes. To achieve this, we identify key challenges involved in the learning process and make several contributions addressing these challenges:
Dynamic processes over graphs bring inherent uncertainty due to possibility of encountering multiple new nodes and edges over time. To address this, we propose an inductive framework that learns set of functions to compute node embeddings instead of being restricted to learning only individual representations. We then equip our framework with mathematical benefits of temporal point processes to effectively model finegrained temporal dynamics. Further, as the processes evolve at multiple time scales, we extend the point process model to consider a time scale dependent parameter. To capture highly complex and nonlinear dynamics governing the evolution of node representations, we use a deep recurrent architecture with powerful structural encoder. A novel intensity controlled attention mechanism is proposed to facilitate the mediation process. Finally, dynamic graphs has potential to generate enormous amount of communication events and hence we propose an efficient learning procedure to achieve high scalability in number of events. Figure 1 provides an illustrative view of our framework. We show the effectiveness of our model through both quantitative and exploratory analysis against several representative baselines on two realworld dynamic graphs.
2 Preliminaries
We briefly describe the basic mathematical framework of temporal point process that is used to model temporal dynamics of both association and communication events. We then specify the dynamic graph setting that exhibits both these types of events.
2.1 Temporal Point Process
Stochastic point processes [17] are random processes whose realization comprised of discrete events in time, . A temporal point process is one such stochastic process that can be equivalently represented as a counting process, , which contains the number of events up to time .
The common way to characterize temporal point processes is via the conditional intensity function , a stochastic model of rate of happening events given the previous events. Formally,
is the conditional probability of observing an event in the tiny window
, , where is history until .Similarly, for and given history , we characterize the conditional probability that no event happens during as , which is called survival function of the process [18]. Moreover, the conditional density that an event occurs at time is defined as . The intensity is often designed to capture phenomena of interests – common forms include Poisson Process, Hawkes processes [19, 20, 21, 22], SelfCorrecting Process [23]. Temporal Point Processes have previously been used to model both – dynamics on the network [24, 25, 26] and dynamics of the network [27, 16].
2.2 Dynamic Graph Representation
Let be the initial snapshot of graph at time , where is the set of nodes and is the set of edges. Let and . For this work, we assume that the topology evolves through growth and there is no deletion of edge or node which is intended as future work.
Event Observation. Both communication and association processes are realized in the form of dyadic events observed on graph over a temporal window . We use the following canonical tuple representation for any type of event at time of the form ,
where are the two nodes involved in an event. represents time of the event. represent link status – signify a association edge and signify no association. represent the type of event. Here, when the current event is a association event and when the current event is a communication event. One can use to signify more types of communication events (e.g. in dynamic heterogeneous graphs). We then represent complete set of observed events ordered by time in window as . Here, , .
Node Representation. Let represent dimensional representation of node . As the representations are evolving over time, we qualify the representation as function of time and denote it as which signifies the representation of node updated after an event involving at time . We use to denote most recently updated embedding of node just before time .
3 Proposed Method: DyRep
We propose an inductive representation learning framework that learns a set of functions capable of ingesting temporally evolving information about the nonlinear dynamics governing the changes in topological structure of graph and interactions between the nodes in the graph. These functions generate and/or update node embeddings over time based on the ingested information. The key idea behind our approach is that when an event is observed between two nodes, information flows from the neighborhood of one node to the other and affects the the embeddings of the nodes accordingly. While a communication event only propagates local information across two nodes, an association event changes the topology and thereby has more global effect. To capture the mutual effects of changing graph topology and interaction between nodes while learning temporally evolving node representations, we design the following three functions:
Temporal Function:
A multi timescale conditional intensity function that models the occurrence of events and is learned through a single layer neural network over most recent embeddings of the nodes involved.
Embedding Update Function: A deep recurrent function to update the embedding of nodes involved in an event based on its own previous embedding, aggregate information passed from the neighborhood of the other node and any exogenous effect.
Attentive Aggregate Function: An aggregator function to combine the information from the neighborhood of a node that serves as input to the aforementioned embedding update function. The information is aggregated by employing a novel intensity augmented attention scheme to attend to nodes according to past association or communication.
3.1 Temporal Function: Modeling Multi timescale Global Dynamics
The observations over dynamic graph contain temporal point patterns of two interleaved complex processes in the form of communication and association events respectively. As mentioned before, both these process are dependent on the most recent node representations. To incorporate these complex processes in our model, we first formalize the notion of compatibility or closeness between representations. Given an observed event , we define a function over the most recently updated representations of two nodes, and that computes the compatibility between the two representations as follows:
(1) 
Here, is the type of event (communication vs. association) and hence serves as the model parameter that learns timescale specific compatibility score.
Using this notion of compatibility, we employ a continuoustime deep model of temporal point process and use the function in (1) to parametrize the conditional intensity function that models the occurrence of event of type at time :
(2) 
The choice of needs to account for two critical criteria: 1.) The definition of allows the value of to be negative. But we require the intensity to be positive. 2.) As mentioned before, the dynamics corresponding to communication and association processes evolve at different time scales and hence we use a modified version of softplus function parameterized by a dynamics parameter to capture this timescale dependence:
(3) 
where, in our case and is scalar timescale parameter learned as part of training. In onedimensional event sequences, this formulation in (3) corresponds to the nonlinear transfer function proposed in [28]
3.2 Embedding Update Function: Modeling Local Information Propagation Dynamics
In this section, we learn an embedding update function that models the nonlinear evolution of a given node’s representation for both nodes involved in an observed event. Specifically, we propose that after a node has participated in an event, the representation of the node needs to be updated to capture the effect of the observed event based on the principles of  selfpropagation, localized feature information propagation through the other node and exogenous drive.
SelfPropagation. Selfpropagation can be considered as foundational component of the dynamics governing an individual node’s evolution. A node evolves in embedded space with respect to its previous position (e.g. set of features) and do not evolve randomly.
Exogenous Drive. Some exogenous force may smoothly update the node’s current features during the time interval between two global events involving that node.
Localized Embedding Propagation. Two nodes involved in an event form a temporary (communication) or a permanent (association) pathway for the information to propagate from the neighborhood of one node to the other node. This corresponds to the influence of the nodes at secondorder proximity passing through the other node participating in the event. To realize the above processes in our setting, we first describe a simple setup: Consider nodes and participating in any type of event at time . Let and denote the neighborhood of nodes and respectively.
We discuss two key points here: 1.) Node serves as mediator passing information from to node and hence receives the information in an aggregated form through . 2.) While each neighbor of wants to pass its information to , the information that node captures is governed by an aggregate function parametrized by ’s communication and association history with its neighbors.
With this setup, for any event at time , we update the embeddings for both nodes involved in the event using a recurrent architecture. Specifically, for th event of node , we evolve its representation with the following update function:
(4) 
where, is the output representation vectors obtained from aggregator function on node ’s neighborhood and is the recurrent state obtained from the previous representation of node . is a scalar difference between the current and previous time points when the node was involved in an event. and are model parameters that govern the aggregate effect of all the three processes respectively. Eq. (4) is used to make updates for both nodes involved in the event with all three components specific to the node.
3.3 Attentive Aggregate function: Modeling Mesoscopic Influence Dynamics
In a dynamic graph, any event would not only affect the nodes involved in that event but also the local neighbors of those nodes. Aligned with our motivation, any event would impact the embeddings of the nodes involved in those events and the updated embeddings would further induce more events across graph. To capture this mutual effect, we want to design a function that can capture the structural information local to a node based on historical events on its neighbor nodes. To this effect, we propose a novel Intensity based Attentional Mechanism that empowers the aggregate function to attend to the neighbors based on node’s communication and association history.
Intensity based Attentional Mechanism. Let be the adjacency matrix for graph at time . Let
be a stochastic matrix capturing the association strength between pair of vertices at time
. One can consider the matrix as a selection matrix that induces a natural selection process for a node – it would tend to communicate more with other nodes that it wants to associate with or has recently associated with. On the other hand it would want to attend less to noninteresting nodes. While we discuss construction and update of in the next section, following implication is required for the construction of in (4): For any two nodes and at time , if and if . Denote as the 1hop neighborhood of node at time .To formally capture the difference in the influence of different neighbors, we design a novel intensity based attention layer that uses the matrix to induce a shared attention mechanism to compute normalized attention coefficients for nodes. Specifically, we inject the graph structure by performing localized attention  for a node , compute the coefficients pertaining to the 1hop neighbors of node as follows:
(5) 
where signify the attention weight for the neighbor at time and hence it is a temporally evolving quantity. These attention coefficients are then used to compute the aggregate information for node by employing an attended aggregation mechanism across neighbors as follows:
where, and are parameters governing the information propagated by each neighbor of . is the most recent embedding for node .
The attention mechanism plays two simultaneous roles: On one hand it captures the effect of contribution of each neighbors based on the intensity of events between them. On the other hand it can also be seen as inducing competition among neighbors to gain more attention of the node.
Overall, this component captures information from the structure of the graph and associate it with the temporal dynamics of evolving representations. The intuition is that a node is highly influenced by its neighbors. While the effect may be propagated from far away nodes, it still needs to propagate through immediate neighbors. Most previous approaches either directly observe peer influence explicitly or use fixed length proximity measure to capture local information (many static graph embedding approaches). But evolving representations are expected to capture both temporal and structural dependencies through their second order proximity neighborhood and thereby account for hidden propagation across network.
Construction and Update of . As mentioned earlier, while modeling the global and local principles governing the evolutionary processes, the key to effectively learn the overall dynamics of the system is an intermediate level of information propagation which connects the two vastly different scales of events. We construct a single stochastic matrix (used to parameterize attention in the earlier section) to capture this intermediate information. We note that updates in adjacency matrix over time and activities between pair of nodes directly effect the values in of node representations which in turn impact values in . And on the other way, updates made to contribute further activities between nodes and eventual update to through its effect on node representations.
As before, and are the adjacency matrix and a rightstochastic matrix at time . At the initial timepoint , we construct directly from . Specifically, for a give node , we initialize the elements of corresponding row vector as:
(6) 
After observing an event at time , we make updates to and as per the observation of and . Algorithm 1 describes the update scenarios.
4 Efficient Learning Procedure
The complete parameter space for the current model is . For a set of observed events, we learn these parameters by minimizing the negative log likelihood of the intensity function in (2):
(7) 
where represent the intensity of event at time and
represent total survival probability for events that do not happen.
While it is intractable (will require time) and unnecessary to compute the integral in Eq. (7
) for all possible nonevents in a stochastic setting, we can locally optimize the objective function using minibatch stochastic gradient descent where we estimate the integral using novel sampling technique. Algorithm
2 provides the recipe to compute the survival term in Eq. (7). Let be the minibatch size and be the number of samples. The complexity of Algorithm 2 will then be for the batch where the factor of accounts for the update happening for two nodes per event. Figure 2 shows the running time for training when the number events are in . It demonstrates linear scalability in number of events which is desired to tackle webscale dynamic networks [29].5 Applications
Here, we show that our framework unifies various applications. Specifically we model dynamical processes over graph structured data using the DyRep framework and use the intensity based inference mechanism to predict both structural and behavioral evolution over time in the form of predicting association/communication links. We also support time prediction for the next event to occur between any two nodes.
5.1 Dynamic Link Prediction
When any two nodes in a graph has increased rate of interaction events, they are more likely to get involved in further interactions and eventually these interactions may lead to the formation of structural link between them. Similarly, formation of the structural link may lead to increased likelihood of interactions between newly connected nodes. To understand, how well our model captures these phenomenon, we ask questions like: 1.) Which is the most likely node that would undergo an event of type with a given node at time ? or 2.) Given two nodes and at time , which is most likely event type they will participate in? Its conditional density at time can be computed:
(8) 
where is the time of the most recent event on either dimension or . We directly use this conditional density to make most likely node and event type predictions.
5.2 Event Time Prediction
This is a relatively novel application where the aim is to compute the next time point when a particular type of event (structural or interaction) can occur. Given a pair of nodes and event type at time , we use Eq. 8 to compute conditional density at time . The next time point for the event can then be computed as: where the integral does not have an analytic form and hence we estimate it using Monte Carlo trick.
6 Experiments
6.1 Datasets
We evaluate DyRep and baselines on two real world datasets: Social Evolution Dataset released by MIT Human Dynamics Lab and Github Dataset available at Github Archive. Table 1 provides statistics for our final dataset used in the experiments. These datasets cover a range of configurations as Social Dataset is a small network with high clustering coefficient and over 2M events. In contrast, Github dataset forms a large network with low clustering coefficient and sparse events thus allowing us to test the robustness of our model.
Dataset  #Nodes  #Initial  #Final  #Communications  Clustering 

Associations  Associations  Coefficient  
Social Evolution  100  407  809  2020554  0.548 
Github  12328  70640  166565  604954  0.087 
6.2 Baselines
For Dynamic Link Prediction task, we compare the performance of our model against multiple representation learning baselines, three of which has capability to model evolving graphs. Specifically, we compare with KnowEvolve [30], DynGem [31], GraphSage [32] and Node2Vec [6]. Table 2 provides qualitative comparisons between all methods. Below we describe each of them in detail:

KnowEvolve [30]: This work is the stateofart model for multirelational dynamic graphs where each edge has timestamp and type (communication events). It models the occurrence of an edge as a multivariate point process whose intensity function is modulated by the score for that edge computed based on the learned entity embeddings. The temporally evolving entity embeddings are learned via recurrent architecture. They do not have timescale specific parameter in intensity function and also does not model graph structure or association links.

DynGem [31]
: It divides timeline into discrete time points and learns embedding for the graph snapshots at these time points. Specifically it employs autoencoder model and learns embedding in a warm start manner by using the learned embeddings from previous snapshot to initialize the training of current snapshot. To support growing nodes, they propose a heuristic quantity PropSize, to dynamically determine the number of hidden units required for each snapshot. They do not model time explicitly.

GraphSage [32]: While not explicitly designed for dynamic or temporally evolving graphs, this is an inductive representation learning method that learns sample and aggregation functions to learn representations instead of training for individual node. This makes it powerful baseline for modeling association links in our setup as they inherently support new nodes and edges due to its inductive capability.

Node2Vec [6]: This is a simple baseline to learn graph embeddings over static graphs with a smart random walk based approach to select the neighborhood to learn individual node embeddings. We compare with this baseline to include a purely static and transductive baseline as part of our evaluation.
For Event Time Prediction, we compare our model against KnowEvolve described above which also has the capability to predict time in a multirelational dynamic graphs. Further, we compare with purely temporal model where all events in graph are considered as dyadic and modeled as Multidimensional Hawkes Process (MHP) [33].
Key  DyRep  KnowEvolve  DynGem  Static Methods 

Properties  (Our Method)  (e.g. GraphSage)  
Models Association  ✓  X  ✓  ✓ 
Models Communication  ✓  ✓  X  X 
Models Time  ✓  ✓  X  X 
Learns Representation  ✓  ✓  ✓  ✓ 
Graph Information  Attended NonUniform  Single  Complete  Uniformly Sampled 
2ndorder  Edge  1st and 2ndorder  2ndorder  
Neighborhood  Neighborhood  Neighborhood  
Predicts Time  ✓  ✓  X  X 
6.3 Evaluation Scheme
We divide our test sets into slots based on time and report the performance for each time slot, thus providing comprehensive temporal evaluation of different methods. This method of reporting is expected to provide finegrained insights on how various methods perform over time as they move farther from the learned training history.
For the baselines that do not explicitly model time (DynGem, GraphSage and Node2Vec), we adopt a sliding window training approach with warmstart method where we learn on initial train set and test for the first slot. Then we add the data from first slot in the train set and remove equal amount of data from start of train set and retrain the model using the embeddings from previous train. This will give the baselines an opportunity to learn from test data as our model also updates the node embeddings during test (but freeze model parameters after training).
Dynamic Link Prediction. For a given test record , we replace with other entities in the graph and compute the density in (8). We then rank all the entities in descending order of the density and report the rank of the ground truth entity. Please note that the latest embeddings of the nodes update even during the test while the parameters of the model remaining fixed. Hence, when ranking the entities, we remove any entities that creates a pair already seen in the test.
We report Mean Average Rank and HITS(@10) metric for dynamic link prediction.
Time Prediction. For a given test record , we report the next time this communication event may occur. We compute the next time point using the previous scheme and report MAE against the ground truth.
6.4 Predictive Results
(a) MAR (Communication)  (b) HITS@10 (Communication)  (c) MAR (Association)  (c) HITS@10 (Association) 
Communication Event Prediction Performance. We first considered the task of predicting communication events between nodes. Please note that such nodes may or may not have a permanent edge (association) between them. The experimental results for this task are reported in columns (a) and (b) of Figure 3.
For Social Evolution dataset, our method significantly and consistently outperforms all the baselines on both metrics. While the performance of our method drops a little over time, it is expected due to the temporal recency affect on node’s evolution. But the overall performance of our method is very stable across all time points with negligible deviation. KnowEvolve can capture event dynamics well and shows consistently better rank than others but it is interesting that its performance deteriorates significantly in HITS@10 metric over time. We conjecture that features learned through edgelevel modeling limits the predictive capacity of the method over time. Another interesting observation is the inability of DynGem (snapshot based dynamic) and GraphSage (inductive) to significantly outperform Node2vec (transductive static baseline). We believe that discrete time snapshot based models fail to capture finegrained dynamics of communication events well. Further, while there are many recurrent and new events where time is critical, there is no topological evolution in this graph which seems to amortize GraphSage’s inductive ability.
For Github dataset, we demonstrate comparable performance with both KnowEvolve and GraphSage on Rank metric. We would like to note that the overall performance for all methods on rank metric is low. As we reported earlier, Github dataset has very less number of events compared to the number of nodes and it also has very low clustering coefficient which makes it a very challenging dataset to learn. It is expected that for a large number of nodes with no communication history, most of the methods will show comparable performance but our method outperforms all others when there is some history available. This is demonstrated by our significantly better performance for HITS@10 metric where we are able to do highly accurate prediction for nodes where we learn better history. This can also be attributed to our model’s ability to capture the effect of evolving topology which is missed by KnowEvolve. Finally, we do not see significant decrease in performance of any method over time in this case which can again be attributed to roughly uniform distribution of nodes with no communication history across time slots.
Association Event Prediction Performance. As the occurrence of association events is not available across all time slots, we instead report the aggregate number in columns (c) and (d) of Figure 3 for this task. For both the datasets under both metrics, our model significantly outperforms the baselines for this task. Specifically, our model’s strong performance on HITS@10 metric across both datasets demonstrates its robustness in accurate learning from various properties of data. On Social evolution dataset, the number of association events are very small (only 485) and hence our strong performance on that dataset shows that the model is able to capture the influence of communication events on the association events through the learned representations. On the Github dataset, there are lots of association events where in many instances the network grows through new nodes. Our model’s strong performance across both metric demonstrates its inductive ability to generalize across new nodes across time.
Time Prediction Performance. Figure 4
demonstrates consistently better performance than the stateofart baseline for event time prediction on both datasets. While KnowEvolve models both processes as two different types of relations between entities, it does not explicitly capture the variance in the time scales of two processes which may explain better performance of our model. Further, KnowEvolve does not consider influence of neighborhood which may lead to capturing weaker temporal dynamics across the graph. MHP uses specific parametric intensity function where each node pair is modeled as an independent dimension which fails to account for intricate dependencies across graph.
(a) Social Evolution  (b) Github  (c) Social Evolution  (d) Github 
6.5 Qualitative Results
tSNE Visualization. We conducted a series of qualitative analysis to understand the discriminative power of evolving embeddings learned by DyRep and see the effect of temporal evolution, association process and communication process on the learned embeddings. For this, we compare our embeddings against the embeddings learned by GraphSage as it is stateofart static embedding method that is also inductive. Figure 5 shows the tSNE embeddings learned by Dyrep (left) and GraphSage (right) respectively. We used sklearn.manifold.TSNE library to plot this figure with , , , , , and ran for 40,000 iterations. The visualization demonstrates that DyRep embeddings have more discriminative power as it can effectively capture the distinctive and evolving structural features over time as aligned with empirical evidence.
(a) DyRep Embeddings  (b) GraphSage Embeddings 
We assess the quality of learned embeddings and the ability of model to capture both temporal and structural information. Let be the time point when train ended. Let be the timepoint when the first test slot ends.
Effect of Association and Communication on Embeddings. We conducted this experiment on Social dataset. We consider three use cases to demonstrate how the interactions and associations between the nodes changed their representations and visualize them to realize the effect.

Nodes that did not have association before test but got linked during first test slot. Nodes 46 and 76 got associated in test between test points 0 and 1. This reduced the cosine distance in both models but DyRep shows prominent effect of this association which should be the case. DyRep reduces the cosine distance from 1.231 to 0.005. Also, DyRep embeddings for these two points belong to different clusters initially but later converge to same cluster. In GraphSage, the cosine distance reduces from 1.011 to 0.199 and the embeddings still remain in original clusters. Figure 6 shows the visualization of embeddings at the two time points in both the methods. This demonstrates that our embeddings can capture association events effectively.
(a) Train End Time (b) Test Slot 1 End Time. Figure 6: Use Case I. Top row: GraphSage Embeddings. Bottom Row: DyRep Embeddings. 
Nodes that did not have association but many communication events (114000). Nodes 27 and 70 is such a use case. DyRep embeddings consider the nodes to be in top 5 nearest neighbor of each other, in the same cluster and cosine distance of 0.005 which is aligned with the fact that nodes with large number of events tend to develop similar features over time. Graphsage on the other hand considers them 32nd nearest neighbor, puts them in different clusters with cosine distance  0.792. Figure 7 shows the visualization of embeddings at the two time points in both the methods. This demonstrates that our embeddings can capture communication events and their temporal effect on embeddings effectively.
(a) Train End Time (b) Test Slot 1 End Time. Figure 7: Use Case II. Top row: GraphSage Embeddings. Bottom Row: DyRep Embeddings. 
Nodes that have association but less number of events. Nodes 19 and 26 remain associated with each other throughout data. DyRep Embeddings keep the nodes nearby although not in same cluster with a cosine distance of 0.649 which demonstrates its ability to learn the association and less communication dynamics between two nodes. For GraphSage the embeddings are on opposite ends of cluster with cosine distance of 1.964 Figure 8 shows the visualization of embeddings at the two time points in both the methods.
(a) DyRep Embeddings (b) GraphSage Embeddings. Figure 8: Embedding Use Case III 
Temporal evolution of DyRep embeddings. Here we visualize the embedding positions of the nodes (tracked in red) as they evolve through time and forms and breaks from clusters.
Figure 9: Use Case IV: DyRep Embeddings over time  From left to right and top to bottom. are the timepoints when test with that id ended. Hence, means the time when test slot 1 finished.
7 Related Work
Static Embedding Approaches.
Representation learning over graph structured data has gained significant attention in recent years, specifically with the advent of deep learning as it provides many sophisticated techniques to capture various properties of graphs. Due to availability of data and lot of open research directions, most works in this area has focused on static graphs. Such approaches can be broadly classified into two categories – Node embedding approaches aim to encode structural information pertaining to a node to produce its lowdimensional representation
[5, 6, 7, 8, 9, 10, 11]. As they learn each individual node’s representation, they are inherently transductive. Recently, [32] proposed GraphSage, an inductive method for learning functions to compute node representations that can be generalized to unseen nodes. Subgraph embedding techniques learn to encode higher order graph structures into low dimensional vector representations [34, 35, 36]. Further, various approaches to use convolutional neural networks
[37, 38, 39] over graphs have been proposed to capture sophisticated feature information but are generally less scalable. Most of these approaches only work with static graphs or can model evolving graphs without temporal part.Dynamic Embedding Approaches. Preliminary approaches in dynamic representation learning have considered the two processes of communication and association in a segregated manner. [31] uses a warm start method to train across snapshots and employs a heuristic approach to learn stable embeddings over time but do not model time. Recently, KnowEvolve [30] proposed a deep recurrent architecture to model multirelational timestamped edges that addresses the communication process. This approach can be seen as addressing one part of the DyRep loop that we presented in Figure 1 and the information captured by node embeddings only depend on the edgelevel information. DANE [40] proposes a network embedding method in dynamic environment but their dynamics consists of change in node’s attributes over time and their current work can be considered orthogonal to our approach. Research on learning dynamic embeddings has also progressed in language community where the aim is to learn temporally evolving word embeddings [41, 42]. [43, 44] include some other approaches that propose model of learning dynamic embeddings in graph data but none of these models consider time at finer level and do not capture both topological evolution and interactions.
Deep Temporal Point Process Models. Recently, [45] has shown that fixed parametric form of point processes lead into the model misspecification issues ultimately affecting performance on real world datasets. [45] therefore propose a data driven alternative to instead learn the conditional intensity function from the observed events and thereby increase its flexibility. Following that work, there have been increased attraction in topic of learning conditional intensity function using deep learning[28] and also intensity free approach using GANS [46] for learning with deep generative temporal point process models.
8 Conclusion
We have presented a novel deep representation learning framework that can effectively and efficiently learn to compute node representations for dynamic graphs. We identified key challenges in such settings and devise an inductive approach to address them. We propose that node representations serve as mediator that governs the complex and nonlinearly evolving processes of communication and association over dynamic graphs. Our framework learns a set of functions that can capture evolving dynamics of these processes and produce temporal and structural informationrich embeddings. Our superior predictive and qualitative evaluation performance demonstrates the effectiveness of our approach and we hope that this contribution will open a wide range of application domains and exciting research directions in the area of representation learning over dynamic graph structured data. Future interesting directions would be to extend DyRep to support network shrinkage settings (i.e. support node and edge deletions) and support encoding higher order dynamic structures.
References

[1]
Leman Akoglu, Hanghang Tong, and Danai Koutra.
Graph based anomaly detection and description: a survey.
Data Mining and Knowledge Discovery, 29(3):626–688, 2015.  [2] Lise Getoor and Ben Taskar. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning). 2007.
 [3] David LibenNowell and Jon Kleinberg. The link prediction problem for social networks. In CIKM, 2003.
 [4] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. Collective classification in network data. AI Magazine, 2008.
 [5] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In CIKM, 2015.
 [6] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.
 [7] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, 2014.
 [8] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In WWW, 2015.
 [9] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In KDD, 2016.
 [10] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiquiang Yang. Community preserving network embedding. In AAAI, 2017.
 [11] Linchuan Xu, Xiaokai Wei, Jiannong Cao, and Philip Y. Yu. Embedding identity and interest for social networks. In WWW, 2017.
 [12] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv:1709.05584, 2017.
 [13] Damien Farine. The dynamics of transmission and the dynamics of networks. Journal of Animal Ecology, 86(3):415–418, 2017.
 [14] Oriol Artime, Jose J. Ramasco, and Maxi San Miguel. Dynamics on networks: competition of temporal and topological correlations. arXiv:1604.04155, 2017.
 [15] Bernard Chazelle. Natural algorithms and influence systems. Communications of the ACM, 2012.
 [16] Mehrdad Farajtabar, Yichen Wang, Manuel GomezRodriguez, Shuang Li, Hongyuan Zha, and Le Song. Coevolve: A joint point process model for information diffusion and network coevolution. In NIPS, 2015.
 [17] DJ Daley and D VereJones. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. 2007.
 [18] Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a process point of view. 2008.
 [19] Mehrdad Farajtabar, Nan Du, Manuel Gomez Rodriguez, Isabel Valera, Hongyuan Zha, and Le Song. Shaping social activity by incentivizing users. In NIPS, 2014.
 [20] Alan G Hawkes. Spectra of some selfexciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
 [21] Yichen Wang, Bo Xie, Nan Du, and Le Song. Isotonic hawkes processes. In ICML, 2016.
 [22] Behzad Tabibian, Isabel Valera, Mehrdad Farajtabar, Le Song, Bernhard Schölkopf, and Manuel GomezRodriguez. Distilling information reliability and source trustworthiness from digital traces. In WWW, 2017.
 [23] V. Isham and M. Westcott. A selfcorrecting point process. Advances in Applied Probability, 37:629–646, 1979.
 [24] Mehrdad Farajtabar, Xiaojing Ye, Sahar Harati, Le Song, and Hongyuan Zha. Multistage campaigning in social networks. In NIPS, 2016.
 [25] Ali Zarezade, Ali Khodadadi, Mehrdad Farajtabar, Hamid R Rabiee, and Hongyuan Zha. Correlated cascades: Compete or cooperate. In AAAI, 2017.
 [26] Mehrdad Farajtabar, Jiachen Yang, Xiaojing Ye, Huan Xu, Rakshit Trivedi, Elias Khalil, Shuang Li, Le Song, and Hongyuan Zha. Fake news mitigation via point process based intervention. In ICML, 2017.
 [27] Long Tran, Mehrdad Farajtabar, Le Song, and Hongyuan Zha. Netcodec: Community detection from individual activities. In SDM, 2015.
 [28] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally selfmodulating multivariate point process. In NIPS, 2017.
 [29] Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. Motifs in temporal networks. In WSDM, 2017.

[30]
Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song.
Knowevolve: Deep temporal reasoning for dynamic knowledge graphs.
In ICML, 2017.  [31] Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. Dyngem: Deep embedding method for dynamic graphs. IJCAI International Workshop on Representation Learning for Graphs, 2017.
 [32] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
 [33] Nan Du, Yichen Wang, Niao He, and Le Song. Time sensitive recommendation from recurrent user activities. In NIPS, 2015.
 [34] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. Neural Networks, IEEE Transactions on, 20(1):61–80, 2009.
 [35] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. In ICLR, 2016.
 [36] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016.
 [37] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In ICLR, 2017.
 [38] Thomas N Kipf and Max Welling. Variational graph autoencoders. arXiv:1611.07308, 2016.
 [39] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In ICLR, 2014.
 [40] Jundong Li, Harsh Dani, Xia Hu, Jilaing Tang, Yi Change, and Huan Liu. Attributed network embedding for learning in a dynamic environment. In CIKM, 2017.
 [41] Robert Bamler and Stephan Mandt. Dynamic word embedding. In ICML, 2017.
 [42] Maja Rudolph and David Blei. Dynamic embeddings for language evolution. In WWW, 2018.
 [43] Carl Yang, Mengxiong Liu, Zongyi Wang, Liuyan Liu, and Jiawei Han. Graph clustering with dynamic embedding. arXiv:1712.08249, 2017.
 [44] Purnamrita Sarkar, Sajid Siddiqi, and Geoffrey Gordon. A latent space approach to dynamic embedding of cooccurence data. In AISTATS, 2007.
 [45] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In KDD, 2016.
 [46] Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. Wasserstein learning of deep generative point process models. In NIPS, 2017.