Temporal Graph Networks for Deep Learning on Dynamic Graphs

by   Emanuele Rossi, et al.

Graph Neural Networks (GNNs) have recently become increasingly popular due to their ability to learn complex systems of relations or interactions arising in a broad spectrum of problems ranging from biology and particle physics to social networks and recommendation systems. Despite the plethora of different models for deep learning on graphs, few approaches have been proposed thus far for dealing with graphs that present some sort of dynamic nature (e.g. evolving features or connectivity over time). In this paper, we present Temporal Graph Networks (TGNs), a generic, efficient framework for deep learning on dynamic graphs represented as sequences of timed events. Thanks to a novel combination of memory modules and graph-based operators, TGNs are able to significantly outperform previous approaches being at the same time more computationally efficient. We furthermore show that several previous models for learning on dynamic graphs can be cast as specific instances of our framework. We perform a detailed ablation study of different components of our framework and devise the best configuration that achieves state-of-the-art performance on several transductive and inductive prediction tasks for dynamic graphs.


page 1

page 2

page 3

page 4


FDGNN: Fully Dynamic Graph Neural Network

Dynamic Graph Neural Networks recently became more and more important as...

Tensor Graph Convolutional Networks for Prediction on Dynamic Graphs

Many irregular domains such as social networks, financial transactions, ...

Dynamic Graph Echo State Networks

Dynamic temporal graphs represent evolving relations between entities, e...

Graph Neural Networks Designed for Different Graph Types: A Survey

Graphs are ubiquitous in nature and can therefore serve as models for ma...

A Method to Predict Semantic Relations on Artificial Intelligence Papers

Predicting the emergence of links in large evolving networks is a diffic...

Next Stop "NoOps": Enabling Cross-System Diagnostics Through Graph-based Composition of Logs and Metrics

Performing diagnostics in IT systems is an increasingly complicated task...

Deep Hyperedges: a Framework for Transductive and Inductive Learning on Hypergraphs

From social networks to protein complexes to disease genomes to visual d...

Code Repositories


Pytorch implementation of "Streaming Graph Neural Network" (not author just trying to reproduce the experiment results)

view repo


A PyTorch annotated replication of the paper: https://arxiv.org/abs/2006.10637

view repo

1 Introduction

In the past few years, graph representation learning Bronstein et al. (2017); Hamilton et al. (2017a); Battaglia et al. (2018)

has produced a sequence of successes, gaining increasing popularity in machine learning. Graphs are ubiquitously used as models for systems of relations and interactions in many fields

Battaglia et al. (2016); Qi et al. (2018); Monti et al. (2016); Choma et al. (2018); Duvenaud et al. (2015); Gilmer et al. (2017); Parisot et al. (2018); Rossi et al. (2019), in particular, social sciences Ying et al. (2018); Monti et al. (2019) and biology Zitnik et al. (2018); Veselkov and others (2019); Gainza and others (2019). Learning on such data is possible using graph neural networks (GNNs) Hamilton et al. (2017b) that typically operate by a message passing mechanism Battaglia et al. (2018) aggregating information in a neighborhood of a node and create node embeddings that are then used for node-wise classification Monti et al. (2016); Velickovic et al. (2018); Kipf and Welling (2017) or edge prediction Zhang and Chen (2018) tasks.

The majority of methods for deep learning on graphs assume that the underlying graph is static. However, most real-life systems of interactions such as social networks or biological interactomes are dynamic. While it is often possible to apply static graph deep learning models Liben-Nowell and Kleinberg (2007) to dynamic graphs by ignoring the temporal evolution, this has been shown to be sub-optimal Xu et al. (2020), and in some cases, it is the dynamic structure that contains crucial insights about the system. Learning on dynamic graphs is relatively recent, and most works are limited to the setting of discrete-time dynamic graphs represented as a sequence of snapshots of the graph over time Liben-Nowell and Kleinberg (2007); Dunlavy et al. (2011); Yu et al. (2019); Sankar et al. (2020); Pareja et al. (2019); Yu et al. (2018). Few approaches support the inductive setting of generalizing to new nodes not seen during training Nguyen et al. (2018a); Bastas et al. (2019); Trivedi et al. (2019); Kumar et al. (2019). Such approaches are unsuitable for interesting real world settings such as social networks, where dynamic graphs are continuous (i.e. edges can appear at any time) and evolving (i.e. new nodes join the graph continuously).


In this paper, we first propose the generic inductive framework of Temporal Graph Networks (TGNs) operating on continuous-time dynamic graphs represented as a sequence of events, and show that many previous methods are specific instances of TGNs. Second, we propose a novel training strategy allowing the model to learn from the sequentiality of the data while maintaining highly efficient parallel processing. We show that this leads to an order of magnitude speed up over previous methods. Third, we perform a detailed ablation study of different components of our framework and analyze the tradeoff between speed and accuracy. Finally, we show state-of-the-art performance on multiple tasks and datasets in both transductive and inductive settings, while being much faster than previous methods.

2 Background

Deep learning on static graphs.

A static graph comprises nodes and edges , which are endowed with features, denoted by and for all , respectively. A typical graph neural network (GNN) creates an embedding of the nodes by learning a local aggregation rule of the form

which is interpreted as message passing from the neighbors of . Here, denotes the neighborhood of node and and are learnable functions.

Dynamic Graphs.

There exist two main classes of dynamic graphs. Discrete-time dynamic graphs (DTDG) are sequences of static graph snapshots taken at intervals in time. Continuos-time dynamic graphs (CTDG) are more general and can be represented as timed lists of events, which may include edge addition or deletion, node addition or deletion and node or edge feature transformations. In this paper, we do not consider deletion events.

Our temporal (multi-)graph is modeled as a sequence of time-stamped events , representing addition or change of a node or interaction between a pair of nodes at times . An event can be of two types: 1) A node-wise event is represented by , where denotes the index of the node and

is the vector attribute associated with the event. After its first appearance, a node is assumed to live forever and its index is used consistently for the following events. 2) An

interaction event between nodes and is represented by a (directed) temporal edge (there might be more than one edge between a pair of nodes, so technically this is a multigraph). We denote by and the temporal set of vertices and edges, respectively, and by the neighborhood of node in time interval . denotes the -hop neighborhood. A snapshot of the temporal graph at time is the (multi-)graph with nodes.

3 Temporal Graph Networks

Following the terminology in Kazemi et al. (2020), a neural model for dynamic graphs can be regarded as an encoder-decoder pair, where an encoder is a function that maps from a dynamic graph to node embeddings, and a decoder takes as input one or more node embeddings and makes a prediction based on these, e.g. node classification or edge prediction. The key contribution of this paper is a novel Temporal Graph Network (TGN) encoder applied on a continuous-time dynamic graph represented as a sequence of time-stamped events and producing, for each time , the embedding of the graph nodes .

3.1 Core modules

Figure 1: Two flows of operations for processing a batch of time-stamped interactions using TGN. Top:

using the embedding module to compute the temporal node embeddings and subsequently the loss function.

Bottom: memory update from batch interactions.


The memory (state) of the model at time consists of a vector for each node the model has seen so far. The memory of a node is updated when the node is involved in an event (e.g. interaction with an other node or node-wise change), and its purpose is to represent the history of a node in a compressed format. Thanks to this specific module, TGNs have the capability to memorize long term dependencies for each node in the graph.

In addition, a global memory can be added to the model to track the evolution of the entire temporal network. While we envisage the benefits such a memory could bring (e.g. information can easily travel long distances in the graph, nodes’ memory can be updated w.r.t the changes in global state, easy graph-wise predictions based on global memory), such a direction has not been explored in this work and it is as such left to future research.

Message Function.

For each event involving node , a message is computed to update ’s memory. In the case of an interaction event between nodes and at time , two messages can be computed for the source and target nodes that respectively start and receive the interaction:


Similarly, in case of a node-wise event , a single message can be computed for the node involved in the event:


Here, is the memory of node just before time , and and are learnable message functions, e.g. MLPs. In all our experiments, we chose the message function as identity (id), which is simply the concatenation of the inputs, for the sake of simplicity.

Message Aggregator.

Resorting to batch processing for efficiency reasons may lead to multiple events involving the same node in the same batch. As each event generates a message in our formulation, we use a mechanism to aggregate messages for ,


Here, is an aggregation function. While multiple choices can be considered for implementing this module (e.g. RNNs or attention w.r.t. the node memory), for the sake of simplicity we considered two efficient non-learnable solutions in our experiments: most recent message (keep only most recent message for a given node) and mean message (average all messages for a given node). We leave learnable aggregation as a future research direction.

Memory Updater.

As previously mentioned, the memory of a node is updated upon each event involving the node itself:


For interaction events involving two nodes and , the memories of both nodes are updated after the event has happened. For node-wise events, only the memory of the related node is updated. Here,

is a learnable memory update function, e.g. a recurrent neural network such as LSTM 

Hochreiter and Schmidhuber (1997) or GRU Cho et al. (2014).


The embedding module is used to generate the temporal embedding of node at any time . The main goal of the embedding module is to avoid the so-called memory staleness problem Kazemi et al. (2020). Since the memory of a node is updated only when the node is involved in an event, it might happen that, in the absence of events for a long time (e.g. a social network user who stops using the platform for some time before becoming active again), ’s memory becomes stale. While multiple implementations of the embedding module are possible, we use the form:

where is a learnable function. This includes many different formulations as particular cases:

Identity (id): , which uses the memory directly as the node embedding.

Time projection (time): , where are learnable parameters, is the time since the last interaction, and denotes element-wise vector product. This version of the embedding method was used in JODIE Kumar et al. (2019).

Temporal Graph Attention (attn): A series of graph attention layers compute ’s embedding by aggregating information from its -hop temporal neighborhood. The input to the -th layer is ’s representation , the current timestamp , ’s neighborhood representation together with timestamps and features for each of the considered interactions which form an edge in ’s temporal neighborhood:


Here, represents a generic time encoding Xu et al. (2020), is the concatenation operator and . Each layer amounts to performing multi-head-attention Vaswani et al. (2017) where the query () is a reference node (i.e. the target node or one of its -hop neighbors), and the keys and values are its neighbors. Finally, an MLP is used to combine the reference node representation with the aggregated information. Differently from the original formulation of this layer (firstly proposed in TGAT Xu et al. (2020)) where no node-wise temporal features were used, in our case the input representation of each node and as such it allows the model to exploit both the current memory and the temporal node features .

Temporal Graph Sum (sum): A simpler and faster aggregation over the graph:


Here as well, is a time encoding and .

3.2 Training

Figure 2: Two implementations of TGN with different memory updates. Left: Basic training strategy. Right: Advanced training strategy. is the raw message generated by event , is the instant of time of the last event involving each node, and the one immediately preceding .

Our TGN model can be trained for a variety of tasks such as future edge prediction (self-supervised setting) or node classification (semi-supervised setting). We present two possible training procedures for TGNs while using the link prediction task as a simple example: provided a list of ordered timed interactions, the goal of the model is to predict the future interactions from those observed in the past. Both training procedures are detailed in Algorithms 1 and 2, and Figure 2 depicts how TGN modules are combined.

Figure 1 shows that interactions serve two purposes: 1) they are the training objective, 2) they are used to update the memory. While the interactions in a batch cannot be used to update the memory before predicting the same interactions (as this would leak information), reversing the order of the operations, i.e. predicting the interactions and computing the loss before updating the memory, causes all memory-related modules (Message Function, Message Aggregator, and Memory Updater) not to receive a gradient (Algorithm 1). Therefore, extra steps must be taken in order to train these modules.

  // Initialize memory to zeros
1 foreach batch training data do
2       sample negatives ;
       emb(), emb(), emb() ;
        // Compute node embeddings
       , , ), , ) ;
        // Compute interactions probs
       = BCE(, ) ;
        // Compute BCE loss
       msg), msg) ;
        // Compute messages111For the sake of clarity, we use the same message function for both sources and destination.
       agg() ;
        // Aggregate messages for the same nodes
       mem(, ), mem(, ) ;
        // Update memory
4 end foreach
Algorithm 1 Training TGN - No gradient flows

Basic training strategy.

The simplest strategy keeps the same order of operations as Algorithm 1 (predict interactions, then update memory), but breaks every batch222By ‘batch’ we refer to what is sometime defined as mini-batch, i.e. a subset of the original dataset, which is used for mini-batch gradient descent. of size into sub-batches of size

. The sub-batches are processed sequentially with their losses accumulated and backpropagation is only performed after the last sub-batch. If a node appears in two sub-batches, its memory in the second sub-batch will depend on the computation done by the memory-related modules in the first. Therefore, these modules will receive a gradient.

Advanced training strategy.

While the basic training procedure is straightforward to implement, it presents two drawbacks: 1) it slows down the training, as each batch is not computed fully in parallel, 2) the only nodes that contribute to the memory-related modules’ gradients are those with at least one interaction in multiple sub-batches. Therefore, these modules can still receive no gradient if sub-batches do not share any nodes, or the gradient can be heavily skewed towards a few nodes that appear multiple times, leading to biased update steps and ultimately to a sub-optimal local minimum for the overall training procedure.

The solution to this problem is to reverse the order of operations. Let be the time of node ’s last interaction in its last sub-batch . Instead of letting the memory be representative of the entire set of interactions involving in the past, we store memory , i.e. the state of prior to the last sub-batch , together with the raw information we need to update with the interactions of (i.e. the set of raw update messages of ’s interactions in ). At the beginning of each sub-batch, the model first updates the nodes’ memories by computing and aggregating messages from the stored raw information (line 2 of Algorithm 2), then uses the updated memory to infer the embeddings and computes the loss function (Figure 2 right). As a result, the loss function depends on a memory which has just been updated by its related modules. Moreover, all nodes involved in the computation of the embeddings (i.e. all source and target nodes and related neighbors) contribute to the gradients, ultimately producing more stable optimization and better local minima (Figure 2(b)).

While the advanced training strategy is sufficient to train TGNs, it can also be combined with the basic strategy by breaking each batch into sub-batches. We investigate the speed vs accuracy tradeoff of different combinations of the two strategies in Section 5.

  // Initialize memory to zeros
  // Initialize raw messages
1 foreach batch training data do
2       sample negatives ;
       msg) ;
        // Compute messages from raw features111For the sake of clarity, we use the same message function for both sources and destination.
       agg() ;
        // Aggregate messages for the same nodes
       ) ;
        // Get updated memory
        // Compute node embeddings333We denote with an embedding layer that operates on the updated version of the memory .
       , (, ), (, ) ;
        // Compute interactions probs
       = BCE(, ) ;
        // Compute BCE loss
        // Compute raw messages
        // Store updated memory for sources and destinations
4 end foreach
Algorithm 2 Training TGN - Advanced Strategy

4 Related Work

Early models for learning on dynamic graphs focused on Discrete Time Dynamic Graphs (DTDG)s. Such approaches either aggregate graph snapshots and then apply static methods Liben-Nowell and Kleinberg (2007); Hisano (2018); Sharan and Neville (2008); Ibrahim and Chen (2015); Ahmed and Chen (2016); Ahmed et al. (2016)

, assemble snapshots into tensors and factorize 

Dunlavy et al. (2011); Yu et al. (2017); Ma et al. (2019), or encode each snapshot to produce a series of embeddings. In the latter case, the embeddings are either aggregated by taking a weighted sum Yao et al. (2016); Zhu et al. (2012), fit to time series models Huang and Lin (2009); Güneş et al. (2016); da Silva Soares and Prudêncio (2012); Moradabadi and Meybodi (2017), used as components in RNNs Seo et al. (2018); Narayan and Roe (2018); Manessi et al. (2020); Yu et al. (2019); Chen et al. (2018); Sankar et al. (2020); Pareja et al. (2019), or learned by imposing a smoothness constraint over time Kim and Han (2009); Gupta et al. (2011); Yao et al. (2016); Zhu et al. (2017); Zhou et al. (2018); Singer et al. (2019); Goyal et al. (2018); Fard et al. (2019); Pei et al. (2016). Another line of work encodes DTDGs by first performing random walks on an initial snapshot and then modifying the walk behaviour for subsequent snapshots Mahdavi et al. (2018); Du et al. (2018); Xin et al. (2016); De Winter et al. (2018); Yu et al. (2018).

Only recently have Continuous Time Dynamic Graphs (CTDGs) been addressed. Several approaches use random walk models Nguyen et al. (2018b, a); Bastas et al. (2019)

that incorporate continuous time through constraints on transition probabilities. Sequence-based approaches for CTDGs 

Kumar et al. (2019); Trivedi et al. (2017, 2019); Ma et al. (2018)

use RNNs to update representations of the source and destination node each time a new edge appears. Other recent works have focused on dynamic knowledge graphs  

Goel et al. (2019); Xu et al. (2019); Dasgupta et al. (2018); García-Durán et al. (2018).

Most recent CTDG learning models can be interpreted as specific cases of our framework (see Table 1). For example, Jodie Kumar et al. (2019) uses the time projection embedding module . TGAT Xu et al. (2020) is a specific case of TGN when the memory and its related modules are missing, and graph attention is used as the Embedding module. Finally, we note that TGN generalizes the Graph Networks (GN) model Battaglia et al. (2018) for static graphs (with the exception of the global block that we mentioned before), and thus the majority of existing message passing-type architectures.

For additional background, we refer the reader to surveys on general graph representation learning Bronstein et al. (2017); Hamilton et al. (2017a); Battaglia et al. (2018) and the recent survey on dynamic graph learning Kazemi et al. (2020).

Mem. Mem. Update Embedding Mess. Agg. Mess. Func.
JODIE node RNN time   — id
TGAT attn (2l, 20n)
TGN-attn node GRU attn (1l, 10n) last id
TGN-2l node GRU attn (2l, 10n) last id
TGN-no-mem attn (1l, 10n) id
TGN-time node GRU time last id
TGN-id node GRU id last id
TGN-sum node GRU sum (1l, 10n) last id
TGN-mean node GRU attn (1l, 10n) mean id
Table 1: Previous models for deep learning on continuous-time dynamic graphs are specific case of our TGN framework. Shown are multiple variants of TGN used in our ablation studies. method (,) refers to graph convolution using layers and neighbors. uses t-batches. uses uniform sampling of neighbors, while the default is sampling the most recent neighbors.

5 Experiments

5.1 Experimental Settings


We use three datasets in our experiments: Wikipedia, Reddit  Kumar et al. (2019), and Twitter. Reddit and Wikipedia are bipartite interaction graphs. In the Reddit dataset, users and sub-reddits are nodes, and an interaction occurs when a user writes a post to the sub-reddit. In the Wikipedia dataset, users and pages are nodes, and an interaction represents a user editing a page. In both aforementioned datasets, the interactions are represented by text features (of a post or page edit, respectively), and labels represent whether a user is banned. Both interactions and labels are time-stamped.

The Twitter dataset is a non-bipartite graph released as part of the 2020 RecSys Challenge Belli et al. (2020). Nodes are users and interactions are retweets. The features of an interaction are a BERT-based Wolf et al. (2019) vector representation of the text of the retweet. We use a subset of the original dataset formed by taking the largest connected component of the retweet graph and retaining only the nodes with the highest in degrees and highest out degrees. Dataset statistics together with more details are provided in the supplementary material.


Our experimental setup closely follows Xu et al. (2020) and focuses on the tasks of future edge prediction and dynamic node classification. On the former, we use both the transductive and inductive settings. In the transductive task, we predict future links of nodes which were observed during training, whereas in the inductive tasks we predict future links of nodes never observed before. The transductive setting is used for node classification. We perform the same 70%-15%-15% chronological split as in Xu et al. (2020).

Future Edge Prediction.

The goal is to predict the probability of an edge occurring between two nodes at a given time. Our encoder is combined with a simple MLP decoder mapping from the concatenation of two node embeddings to the probability of the edge. For the Wikipedia and Reddit datasets, we use Adam optimizer with a learning rate of , a batch size of for both training, validation and testing, and early stopping with a patience of 5. For the Twitter dataset, the only change is the learning rate, which is set to . We sample an equal amount of negatives to the positive interactions, and use average precision

as reference metric. All results are averaged over 10 runs to obtain mean and standard deviation.

Dynamic Node Classification.

The task is to predict a binary label indicating whether a user was banned at a specific time. We pre-train our encoder on the future edge prediction task, then freeze it and combine it with a task-specific MLP decoder. We use the Adam optimizer with a learning rate of and a batch size of for both training, validation and testing. The metric used is the ROC-AUC. All results are averaged over 10 runs to obtain mean and standard deviation.


Our strong baselines are state-of-the-art approaches for continuous time dynamic graphs (CTDNE Nguyen et al. (2018b), Jodie Kumar et al. (2019), and TGAT Xu et al. (2020)) as well as state-of-the-art models for static graphs (GAE Kipf and Welling (2016), VGAE Kipf and Welling (2016), DeepWalk Perozzi et al. (2014), Node2Vec Grover and Leskovec (2016), GAT Velickovic et al. (2018) and GraphSAGE Hamilton et al. (2017a)).

5.2 Performance


Table 2 presents the results on future edge prediction. Our model clearly outperforms the baselines by a large margin in both the transductive and the inductive setting on all datasets. The gap is particularly large on the Twitter dataset, where we outperfom the second-best method (TGAT) by over 25%. Table 4 shows the results on dynamic node classification, where again our model obtains state-of-the-art results, with a large improvement over all other methods.


Due to the efficient parallel processing and the need for only one graph attention layer (see section 5.3 for the ablation study on the number of layers), our model is up to faster than Jodie and about

faster than TGAT to complete a single epoch (see Figure 

2(a)), while requiring a similar number of epochs to converge.

Wikipedia Reddit Twitter
Transductive Inductive Transductive Inductive Transductive Inductive
Table 2: Average Precision (%) for future edge prediction task in transductive and inductive settings. Mean and standard deviations are computed over 10 runs. Static graph method. Does not support inductive setting.
Wikipedia Reddit
Table 4: Different settings of combinations of models and training strategies. #sb is number of sub-batches.
Setting Model Update #sb
TGN-id-s1 TGN-id start 1
TGN-id-s5 TGN-id start 5
TGN-id-e1 TGN-id end 1
TGN-id-e5 TGN-id end 5
TGN-att-s1 TGN-att start 1
TGN-att-s5 TGN-att start 5
TGN-att-e1 TGN-att end 1
TGN-att-e5 TGN-att end 5
Table 3: ROC AUC % for the dynamic node classification. Mean and standard deviations are computed over 10 runs. Static graph method.
(a) Tradeoff between accuracy (test average precision in %) and speed (time per epoch in sec) of different models.
(b) Tradeoff between accuracy (test average precision in %) and speed (time per epoch in sec) of different training strategies.
Figure 3: Ablation studies on the Wikipedia dataset for the transductive setting. Means and standard deviations (visualized as ellipses) were computed over 10 runs.

5.3 Choice of Modules

We perform a detailed ablation study comparing different instances of our TGN framework. We are particular interested in the speed vs accuracy tradeoff resulting from the choice of modules and their combination. The variants we experiment with are reported in Table 1 and their results are depicted in Figure 2(a).


We compare a model that does not make use of a memory (TGN-no-mem), with a model which uses memory (TGN-attn) but is otherwise identical. While TGN-att is about slower, it vastly outperforms TGN-no-mem, confirming the importance of memory for effective learning on dynamic graphs, due to its ability to store long-term information about a node which is otherwise hard to capture.

Embedding Module.

We compared models with different embedding modules (TGN-id, TGN-time, TGN-attn, TGN-sum). The first interesting insight is that projecting the embedding in time seems to slightly hurt, as shown by the fact that TGN-time underperforms TGN-id. Moreover, the ability to exploit the graph is crucial for performance: we note that all graph-based projections (TGN-attn, TGN-sum) outperform the graph-less TGN-id model by a large margin, with TGN-attn being the top performer at the expense of being slightly slower than the simpler TGN-sum.

Message Aggregator.

We compared two models, one using the most last message aggregator (TGN-attn) and another a mean aggregator (TGN-mean-aggr) but otherwise the same. While TGN-mean-aggr performs slightly better, it is more than slower.

Number of layers.

While in TGAT having 2 layers is of fundamental importance for obtaining good performances (TGAT vs TGAT-1l), in TGN the presence of the memory makes it enough to use 1 layer to obtain very high performances (TGN-attn vs TGN-2l). This is probably because when accessing the memory of the 1-hop neighbors, we are indirectly accessing information from hops further away. Moreover, being able to use only 1 layer of graph attention speeds up the model dramatically.

5.4 Training Strategies

Table 4 shows the configurations we experimented with for this ablation study, while figure 2(b) presents the results. The TGN-id model makes only use of the memory (no embedding module) and therefore makes for a perfect testbed for training strategies related to the memory-related modules. Looking at the results with the TGN-id model, our proposed strategy of updating the memory at the start of the epoch clearly outperforms updating at the end.

Interestingly, when using a graph attention embedding module (TGN-attn), the benefit of the advanced strategy shrinks. This is probably due to the fact that the embedding module is able is able to adapt to the random memory-related modules, effectively denoising the spurious behavior of the nodes’ memory.

6 Conclusion

We introduce TGN, a generic framework for learning on continuous-time dynamic graphs. We obtain state-of-the-art results on several tasks and datasets while being faster than previous methods. Detailed ablation studies shows the importance of the memory and its related modules to store long-term information, as well as the importance of the graph-based embedding module to generate up-to-date node embeddings. We envision interesting applications of TGN in the fields of social sciences, recommender systems, and biological interaction networks, opening up a future research direction of exploring more advanced settings of our model and understanding the most appropriate domain-specific choices.

Broader Impact

Graph Neural Networks able to effectively process temporal graphs can potentially serve a variety of purposes in our society e.g. improved recommender systems that take into account the evolving nature of users on social networks or marketplaces, as well as better filtering mechanisms for the detection of unhealthy behaviors such as spam or coordinate manipulation. At the same time, due to the novelty of these approaches, the robustness of such architectures w.r.t. external adversarial attacks has not been validated yet in the literature. Additional studies will thus need to be realised to identify the potential risks and benefits that temporal graph neural networks may present when subjected to adversarial attacks, before these can be applied to sensitive personal data and extensively exploited in industrial applications.


  • [1] N. M. Ahmed, L. Chen, Y. Wang, B. Li, Y. Li, and W. Liu (2016) Sampling-based algorithm for link prediction in temporal networks. Information Sciences 374, pp. 1–14. Cited by: §4.
  • [2] N. M. Ahmed and L. Chen (2016) An efficient algorithm for link prediction in temporal uncertain social networks. Information Sciences 331, pp. 120–136. Cited by: §4.
  • [3] N. Bastas, T. Semertzidis, A. Axenopoulos, and P. Daras (2019) Evolve2vec: learning network representations using temporal unfolding. In International Conference on Multimedia Modeling, pp. 447–458. Cited by: §1, §4.
  • [4] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, and R. Faulkner (2018) Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261. Cited by: §1, §4, §4.
  • [5] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In NIPS, pp. 4502–4510. Cited by: §1.
  • [6] L. Belli, S. I. Ktena, A. Tejani, A. Lung-Yut-Fon, F. Portman, X. Zhu, Y. Xie, A. Gupta, M. M. Bronstein, A. Delić, et al. (2020) Privacy-preserving recommender systems challenge on twitter’s home timeline. arXiv:2004.13715. Cited by: §5.1.
  • [7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34 (4), pp. 18–42. Cited by: §1, §4.
  • [8] J. Chen, X. Xu, Y. Wu, and H. Zheng (2018) GC-lstm: graph convolution embedded lstm for dynamic link prediction. arXiv:1812.04206. External Links: 1812.04206 Cited by: §4.
  • [9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. External Links: Link, Document Cited by: §3.1.
  • [10] N. Choma, F. Monti, L. Gerhardt, T. Palczewski, Z. Ronaghi, P. Prabhat, W. Bhimji, M. M. Bronstein, S. Klein, and J. Bruna (2018) Graph neural networks for icecube signal classification. In ICMLA, Cited by: §1.
  • [11] P. R. da Silva Soares and R. B. C. Prudêncio (2012) Time series based link prediction. In IJCNN, pp. 1–7. Cited by: §4.
  • [12] S. S. Dasgupta, S. N. Ray, and P. Talukdar (2018)

    HyTE: hyperplane-based temporally aware knowledge graph embedding

    In EMNLP, pp. 2001–2011. External Links: Link, Document Cited by: §4.
  • [13] S. De Winter, T. Decuypere, S. Mitrović, B. Baesens, and J. De Weerdt (2018) Combining temporal aspects of dynamic networks with node2vec for a more efficient dynamic link prediction. In ASONAM, Vol. , pp. 1234–1241. Cited by: §4.
  • [14] L. Du, Y. Wang, G. Song, Z. Lu, and J. Wang (2018) Dynamic network embedding: an extended approach for skip-gram based network embedding.. In IJCAI, pp. 2086–2092. Cited by: §4.
  • [15] D. M. Dunlavy, T. G. Kolda, and E. Acar (2011) Temporal link prediction using matrix and tensor factorizations. TKDD 5 (2), pp. 1–27. Cited by: §1, §4.
  • [16] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §1.
  • [17] A. M. Fard, E. Bagheri, and K. Wang (2019) Relationship prediction in dynamic heterogeneous information networks. In European Conference on Information Retrieval, pp. 19–34. Cited by: §4.
  • [18] P. Gainza et al. (2019) Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods 17, pp. 184–192. Cited by: §1.
  • [19] A. García-Durán, S. Dumančić, and M. Niepert (2018) Learning sequence encoders for temporal knowledge graph completion. In EMNLP, pp. 4816–4821. External Links: Document Cited by: §4.
  • [20] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §1.
  • [21] R. Goel, S. M. Kazemi, M. Brubaker, and P. Poupart (2019) Diachronic embedding for temporal knowledge graph completion. arXiv:1907.03143. External Links: 1907.03143 Cited by: §4.
  • [22] P. Goyal, N. Kamra, X. He, and Y. Liu (2018) DynGEM: deep embedding method for dynamic graphs.. arXiv:1805.11273 abs/1805.11273. External Links: Link Cited by: §4.
  • [23] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD ’16, KDD ’16, New York, NY, USA, pp. 855–864. External Links: ISBN 9781450342322, Link, Document Cited by: §5.1, Experimental Settings for Baselines.
  • [24] İ. Güneş, Ş. Gündüz-Öğüdücü, and Z. Çataltepe (2016) Link prediction using time series of neighborhood-based node similarity scores. Data Mining and Knowledge Discovery 30 (1), pp. 147–180. Cited by: §4.
  • [25] M. Gupta, C. C. Aggarwal, J. Han, and Y. Sun (2011) Evolutionary clustering and analysis of bibliographic networks. In ASONAM, pp. 63–70. Cited by: §4.
  • [26] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Representation learning on graphs: methods and applications. arXiv:1709.05584. Cited by: §1, §4, §5.1, Experimental Settings for Baselines.
  • [27] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §1.
  • [28] R. Hisano (2018) Semi-supervised graph embedding approach to dynamic link prediction. Springer Proceedings in Complexity, pp. 109–121. External Links: ISBN 9783319731988, ISSN 2213-8692, Link, Document Cited by: §4.
  • [29] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §3.1.
  • [30] Z. Huang and D. K. Lin (2009) The time-series link prediction problem with applications in communication surveillance. INFORMS Journal on Computing 21 (2), pp. 286–303. Cited by: §4.
  • [31] N. M. A. Ibrahim and L. Chen (2015) Link prediction in dynamic social networks by integrating different types of information. Applied Intelligence 42 (4), pp. 738–750. Cited by: §4.
  • [32] S. M. Kazemi, R. Goel, K. Jain, I. Kobyzev, A. Sethi, P. Forsyth, and P. Poupart (2020) Representation learning for dynamic graphs: a survey. Journal of Machine Learning Research 21 (70), pp. 1–73. External Links: Link Cited by: §3.1, §3, §4.
  • [33] M. Kim and J. Han (2009) A particle-and-density based evolutionary clustering method for dynamic networks. VLDB 2 (1), pp. 622–633. Cited by: §4.
  • [34] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. NIPS Workshop on Bayesian Deep Learning. Cited by: §5.1, Experimental Settings for Baselines.
  • [35] T. N. Kipf and M. Welling (2017) Semi-Supervised Classification with Graph Convolutional Networks. In ICLR, External Links: Link Cited by: §1.
  • [36] S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. In KDD ’19, pp. 1269–1278. External Links: ISBN 9781450362016, Link, Document Cited by: §1, §3.1, §4, §4, §5.1, §5.1, Experimental Settings for Baselines.
  • [37] D. Liben-Nowell and J. Kleinberg (2007-05) The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58 (7), pp. 1019–1031. External Links: ISSN 1532-2882 Cited by: §1, §4.
  • [38] Y. Ma, Z. Guo, Z. Ren, E. Zhao, J. Tang, and D. Yin (2018) Streaming graph neural networks. arXiv:1810.10627. Cited by: §4.
  • [39] Y. Ma, V. Tresp, and E. A. Daxberger (2019) Embedding models for episodic knowledge graphs. Journal of Web Semantics 59, pp. 100490. Cited by: §4.
  • [40] S. Mahdavi, S. Khoshraftar, and A. An (2018) Dynnode2vec: scalable dynamic network embedding. In 2018 IEEE International Conference on Big Data, pp. 3762–3765. Cited by: §4.
  • [41] F. Manessi, A. Rozza, and M. Manzo (2020) Dynamic graph convolutional networks. Pattern Recognition 97, pp. 107000. Cited by: §4.
  • [42] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein (2016) Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §1.
  • [43] F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein (2019) Fake news detection on social media using geometric deep learning. arXiv:1902.06673. External Links: Link, 1902.06673 Cited by: §1.
  • [44] B. Moradabadi and M. R. Meybodi (2017) A novel time series link prediction method: learning automata approach. Physica A: Statistical Mechanics and its Applications 482, pp. 422–432. Cited by: §4.
  • [45] A. Narayan and P. H. Roe (2018) Learning graph dynamics using deep neural networks. IFAC-PapersOnLine 51 (2), pp. 433–438. Cited by: §4.
  • [46] G. H. Nguyen, J. Boaz Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim (2018) Dynamic network embeddings: from random walks to temporal random walks. In 2018 IEEE International Conference on Big Data, Vol. , pp. 1085–1092. Cited by: §1, §4.
  • [47] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim (2018) Continuous-time dynamic network embeddings. In WWW ’18, pp. 969–976. External Links: ISBN 9781450356404, Link, Document Cited by: §4, §5.1, Experimental Settings for Baselines.
  • [48] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, and C. E. Leisersen (2019) Evolvegcn: evolving graph convolutional networks for dynamic graphs. arXiv:1902.10191. Cited by: §1, §4.
  • [49] S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. Guerrero, B. Glocker, and D. Rueckert (2018) Disease prediction using graph convolutional networks: application to autism spectrum disorder and alzheimer’s disease. Med Image Anal 48, pp. 117–130. Cited by: §1.
  • [50] Y. Pei, J. Zhang, G. Fletcher, and M. Pechenizkiy (2016) Node classification in dynamic social networks. AALTD, pp. 54. Cited by: §4.
  • [51] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. In KDD ’14, pp. 701–710. External Links: ISBN 9781450329569, Link, Document Cited by: §5.1, Experimental Settings for Baselines.
  • [52] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, pp. 401–417. Cited by: §1.
  • [53] E. Rossi, F. Monti, M. M. Bronstein, and P. Liò (2019) NcRNA classification with graph convolutional networks. In KDD Workshop on Deep Learning on Graphs, Cited by: §1.
  • [54] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang (2020) DySAT: deep neural representation learning on dynamic graphs via self-attention networks. In WSDM, pp. 519–527. Cited by: §1, §4.
  • [55] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson (2018) Structured sequence modeling with graph convolutional recurrent networks. Lecture Notes in Computer Science, pp. 362–373. External Links: ISBN 9783030041670, ISSN 1611-3349, Link, Document Cited by: §4.
  • [56] U. Sharan and J. Neville (2008)

    Temporal-relational classifiers for prediction in evolving domains

    In ICDM, pp. 540–549. Cited by: §4.
  • [57] U. Singer, I. Guy, and K. Radinsky (2019-07) Node embedding over temporal graphs. In IJCAI, pp. 4605–4612. External Links: Document, Link Cited by: §4.
  • [58] R. Trivedi, H. Dai, Y. Wang, and L. Song (2017) Know-evolve: deep temporal reasoning for dynamic knowledge graphs. In ICML, pp. 3462–3471. Cited by: §4.
  • [59] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha (2019) DyRep: learning representations over dynamic graphs. In ICLR, External Links: Link Cited by: §1, §4.
  • [60] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. External Links: Link Cited by: §3.1.
  • [61] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §1, §5.1, Experimental Settings for Baselines.
  • [62] K. Veselkov et al. (2019) HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Scientific Reports 9 (1), pp. 1–12. Cited by: §1.
  • [63] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019)

    HuggingFace’s transformers: state-of-the-art natural language processing

    arXiv:1910.03771. Cited by: §5.1.
  • [64] Y. Xin, Z. Xie, and J. Yang (2016) An adaptive random walk sampling method on dynamic community detection. Expert Systems with Applications 58, pp. 10–19. Cited by: §4.
  • [65] C. Xu, M. Nayyeri, F. Alkhoury, J. Lehmann, and H. S. Yazdi (2019) Temporal knowledge graph completion based on time series gaussian embedding. arXiv:1911.07893. Cited by: §4.
  • [66] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan (2020) Inductive representation learning on temporal graphs. In ICLR, External Links: Link Cited by: §1, §3.1, §4, §5.1, §5.1, Hyperparameters, Experimental Settings for Baselines.
  • [67] L. Yao, L. Wang, L. Pan, and K. Yao (2016) Link prediction based on common-neighbors for dynamic social network. Procedia Computer Science 83, pp. 82–89. Cited by: §4.
  • [68] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018)

    Graph convolutional neural networks for web-scale recommender systems

    In KDD ’18, Cited by: §1.
  • [69] B. Yu, M. Li, J. Zhang, and Z. Zhu (2019) 3D graph convolutional networks with temporal graphs: a spatial information free framework for traffic forecasting. arXiv:1903.00919. External Links: 1903.00919 Cited by: §1, §4.
  • [70] W. Yu, W. Cheng, C. C. Aggarwal, H. Chen, and W. Wang (2017) Link prediction with spatial and temporal consistency in dynamic networks.. In IJCAI, pp. 3343–3349. Cited by: §4.
  • [71] W. Yu, W. Cheng, C. C. Aggarwal, K. Zhang, H. Chen, and W. Wang (2018)

    Netwalk: a flexible deep embedding approach for anomaly detection in dynamic networks

    In KDD ’18, pp. 2672–2681. Cited by: §1, §4.
  • [72] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In NIPS, Cited by: §1.
  • [73] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang (2018) Dynamic network embedding by modeling triadic closure process. In AAAI, Cited by: §4.
  • [74] J. Zhu, Q. Xie, and E. J. Chin (2012) A hybrid time-series link prediction framework for large social network. In DEXA, pp. 345–359. Cited by: §4.
  • [75] Y. Zhu, H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, and D. Cai (2017) What to do next: modeling user behaviors by time-lstm.. In IJCAI, Vol. 17, pp. 3602–3608. Cited by: §4.
  • [76] M. Zitnik, M. Agrawal, and J. Leskovec (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.

Supplementary Material


The statistics of the three datasets used are reported in table 5.

Wikipedia Reddit Twitter
# Nodes 9,227 11,000 8,926
# Edges 157,474 672,447 130,865
# Edge features 172 172 768
# Edge features type LIWC LIWC BERT
Timespan 30 days 30 days 7 days
Chronological Split 70%-15%-15% 70%-15%-15% 70%-15%-15%
# Nodes with dynamic labels 217 366
Table 5: Statistics of the datasets used in the experiments.


For all the models and datasets we used the same hyperparameters, which had been found to work well in the TGAT paper 


Memory Dimension 172
Node Embedding Dimension 100
Time Embedding Dimension 100
# Attention Heads 2
Dropout 0.1
Table 6: Model Hyperparameters.

Experimental Settings for Baselines

Our results for GAE [34], VGAE [34], DeepWalk [51], Node2Vec [23], GAT [61] and GraphSAGE [26], CTDNE [47] and TGAT [66] are taken directly from the TGAT paper [66].

For Jodie [36]

, we implement our own version in PyTorch, as a specific case of our framework with the temporal embedding module, and the t-batch training algorithm.