TGN: Temporal Graph Networks
Graph Neural Networks (GNNs) have recently become increasingly popular due to their ability to learn complex systems of relations or interactions arising in a broad spectrum of problems ranging from biology and particle physics to social networks and recommendation systems. Despite the plethora of different models for deep learning on graphs, few approaches have been proposed thus far for dealing with graphs that present some sort of dynamic nature (e.g. evolving features or connectivity over time). In this paper, we present Temporal Graph Networks (TGNs), a generic, efficient framework for deep learning on dynamic graphs represented as sequences of timed events. Thanks to a novel combination of memory modules and graph-based operators, TGNs are able to significantly outperform previous approaches being at the same time more computationally efficient. We furthermore show that several previous models for learning on dynamic graphs can be cast as specific instances of our framework. We perform a detailed ablation study of different components of our framework and devise the best configuration that achieves state-of-the-art performance on several transductive and inductive prediction tasks for dynamic graphs.READ FULL TEXT VIEW PDF
TGN: Temporal Graph Networks
Pytorch implementation of "Streaming Graph Neural Network" (not author just trying to reproduce the experiment results)
A PyTorch annotated replication of the paper: https://arxiv.org/abs/2006.10637
has produced a sequence of successes, gaining increasing popularity in machine learning. Graphs are ubiquitously used as models for systems of relations and interactions in many fieldsBattaglia et al. (2016); Qi et al. (2018); Monti et al. (2016); Choma et al. (2018); Duvenaud et al. (2015); Gilmer et al. (2017); Parisot et al. (2018); Rossi et al. (2019), in particular, social sciences Ying et al. (2018); Monti et al. (2019) and biology Zitnik et al. (2018); Veselkov and others (2019); Gainza and others (2019). Learning on such data is possible using graph neural networks (GNNs) Hamilton et al. (2017b) that typically operate by a message passing mechanism Battaglia et al. (2018) aggregating information in a neighborhood of a node and create node embeddings that are then used for node-wise classification Monti et al. (2016); Velickovic et al. (2018); Kipf and Welling (2017) or edge prediction Zhang and Chen (2018) tasks.
The majority of methods for deep learning on graphs assume that the underlying graph is static. However, most real-life systems of interactions such as social networks or biological interactomes are dynamic. While it is often possible to apply static graph deep learning models Liben-Nowell and Kleinberg (2007) to dynamic graphs by ignoring the temporal evolution, this has been shown to be sub-optimal Xu et al. (2020), and in some cases, it is the dynamic structure that contains crucial insights about the system. Learning on dynamic graphs is relatively recent, and most works are limited to the setting of discrete-time dynamic graphs represented as a sequence of snapshots of the graph over time Liben-Nowell and Kleinberg (2007); Dunlavy et al. (2011); Yu et al. (2019); Sankar et al. (2020); Pareja et al. (2019); Yu et al. (2018). Few approaches support the inductive setting of generalizing to new nodes not seen during training Nguyen et al. (2018a); Bastas et al. (2019); Trivedi et al. (2019); Kumar et al. (2019). Such approaches are unsuitable for interesting real world settings such as social networks, where dynamic graphs are continuous (i.e. edges can appear at any time) and evolving (i.e. new nodes join the graph continuously).
In this paper, we first propose the generic inductive framework of Temporal Graph Networks (TGNs) operating on continuous-time dynamic graphs represented as a sequence of events, and show that many previous methods are specific instances of TGNs. Second, we propose a novel training strategy allowing the model to learn from the sequentiality of the data while maintaining highly efficient parallel processing. We show that this leads to an order of magnitude speed up over previous methods. Third, we perform a detailed ablation study of different components of our framework and analyze the tradeoff between speed and accuracy. Finally, we show state-of-the-art performance on multiple tasks and datasets in both transductive and inductive settings, while being much faster than previous methods.
A static graph comprises nodes and edges , which are endowed with features, denoted by and for all , respectively. A typical graph neural network (GNN) creates an embedding of the nodes by learning a local aggregation rule of the form
which is interpreted as message passing from the neighbors of . Here, denotes the neighborhood of node and and are learnable functions.
There exist two main classes of dynamic graphs. Discrete-time dynamic graphs (DTDG) are sequences of static graph snapshots taken at intervals in time. Continuos-time dynamic graphs (CTDG) are more general and can be represented as timed lists of events, which may include edge addition or deletion, node addition or deletion and node or edge feature transformations. In this paper, we do not consider deletion events.
Our temporal (multi-)graph is modeled as a sequence of time-stamped events , representing addition or change of a node or interaction between a pair of nodes at times . An event can be of two types: 1) A node-wise event is represented by , where denotes the index of the node and
is the vector attribute associated with the event. After its first appearance, a node is assumed to live forever and its index is used consistently for the following events. 2) Aninteraction event between nodes and is represented by a (directed) temporal edge (there might be more than one edge between a pair of nodes, so technically this is a multigraph). We denote by and the temporal set of vertices and edges, respectively, and by the neighborhood of node in time interval . denotes the -hop neighborhood. A snapshot of the temporal graph at time is the (multi-)graph with nodes.
Following the terminology in Kazemi et al. (2020), a neural model for dynamic graphs can be regarded as an encoder-decoder pair, where an encoder is a function that maps from a dynamic graph to node embeddings, and a decoder takes as input one or more node embeddings and makes a prediction based on these, e.g. node classification or edge prediction. The key contribution of this paper is a novel Temporal Graph Network (TGN) encoder applied on a continuous-time dynamic graph represented as a sequence of time-stamped events and producing, for each time , the embedding of the graph nodes .
The memory (state) of the model at time consists of a vector for each node the model has seen so far. The memory of a node is updated when the node is involved in an event (e.g. interaction with an other node or node-wise change), and its purpose is to represent the history of a node in a compressed format. Thanks to this specific module, TGNs have the capability to memorize long term dependencies for each node in the graph.
In addition, a global memory can be added to the model to track the evolution of the entire temporal network. While we envisage the benefits such a memory could bring (e.g. information can easily travel long distances in the graph, nodes’ memory can be updated w.r.t the changes in global state, easy graph-wise predictions based on global memory), such a direction has not been explored in this work and it is as such left to future research.
For each event involving node , a message is computed to update ’s memory. In the case of an interaction event between nodes and at time , two messages can be computed for the source and target nodes that respectively start and receive the interaction:
Similarly, in case of a node-wise event , a single message can be computed for the node involved in the event:
Here, is the memory of node just before time , and and are learnable message functions, e.g. MLPs. In all our experiments, we chose the message function as identity (id), which is simply the concatenation of the inputs, for the sake of simplicity.
Resorting to batch processing for efficiency reasons may lead to multiple events involving the same node in the same batch. As each event generates a message in our formulation, we use a mechanism to aggregate messages for ,
Here, is an aggregation function. While multiple choices can be considered for implementing this module (e.g. RNNs or attention w.r.t. the node memory), for the sake of simplicity we considered two efficient non-learnable solutions in our experiments: most recent message (keep only most recent message for a given node) and mean message (average all messages for a given node). We leave learnable aggregation as a future research direction.
As previously mentioned, the memory of a node is updated upon each event involving the node itself:
For interaction events involving two nodes and , the memories of both nodes are updated after the event has happened. For node-wise events, only the memory of the related node is updated. Here,
is a learnable memory update function, e.g. a recurrent neural network such as LSTMHochreiter and Schmidhuber (1997) or GRU Cho et al. (2014).
The embedding module is used to generate the temporal embedding of node at any time . The main goal of the embedding module is to avoid the so-called memory staleness problem Kazemi et al. (2020). Since the memory of a node is updated only when the node is involved in an event, it might happen that, in the absence of events for a long time (e.g. a social network user who stops using the platform for some time before becoming active again), ’s memory becomes stale. While multiple implementations of the embedding module are possible, we use the form:
where is a learnable function. This includes many different formulations as particular cases:
Identity (id): , which uses the memory directly as the node embedding.
Time projection (time): , where are learnable parameters, is the time since the last interaction, and denotes element-wise vector product. This version of the embedding method was used in JODIE Kumar et al. (2019).
Temporal Graph Attention (attn): A series of graph attention layers compute ’s embedding by aggregating information from its -hop temporal neighborhood. The input to the -th layer is ’s representation , the current timestamp , ’s neighborhood representation together with timestamps and features for each of the considered interactions which form an edge in ’s temporal neighborhood:
Here, represents a generic time encoding Xu et al. (2020), is the concatenation operator and . Each layer amounts to performing multi-head-attention Vaswani et al. (2017) where the query () is a reference node (i.e. the target node or one of its -hop neighbors), and the keys and values are its neighbors. Finally, an MLP is used to combine the reference node representation with the aggregated information. Differently from the original formulation of this layer (firstly proposed in TGAT Xu et al. (2020)) where no node-wise temporal features were used, in our case the input representation of each node and as such it allows the model to exploit both the current memory and the temporal node features .
Temporal Graph Sum (sum): A simpler and faster aggregation over the graph:
Here as well, is a time encoding and .
Our TGN model can be trained for a variety of tasks such as future edge prediction (self-supervised setting) or node classification (semi-supervised setting). We present two possible training procedures for TGNs while using the link prediction task as a simple example: provided a list of ordered timed interactions, the goal of the model is to predict the future interactions from those observed in the past. Both training procedures are detailed in Algorithms 1 and 2, and Figure 2 depicts how TGN modules are combined.
Figure 1 shows that interactions serve two purposes: 1) they are the training objective, 2) they are used to update the memory. While the interactions in a batch cannot be used to update the memory before predicting the same interactions (as this would leak information), reversing the order of the operations, i.e. predicting the interactions and computing the loss before updating the memory, causes all memory-related modules (Message Function, Message Aggregator, and Memory Updater) not to receive a gradient (Algorithm 1). Therefore, extra steps must be taken in order to train these modules.
The simplest strategy keeps the same order of operations as Algorithm 1 (predict interactions, then update memory), but breaks every batch222By ‘batch’ we refer to what is sometime defined as mini-batch, i.e. a subset of the original dataset, which is used for mini-batch gradient descent. of size into sub-batches of size
. The sub-batches are processed sequentially with their losses accumulated and backpropagation is only performed after the last sub-batch. If a node appears in two sub-batches, its memory in the second sub-batch will depend on the computation done by the memory-related modules in the first. Therefore, these modules will receive a gradient.
While the basic training procedure is straightforward to implement, it presents two drawbacks: 1) it slows down the training, as each batch is not computed fully in parallel, 2) the only nodes that contribute to the memory-related modules’ gradients are those with at least one interaction in multiple sub-batches. Therefore, these modules can still receive no gradient if sub-batches do not share any nodes, or the gradient can be heavily skewed towards a few nodes that appear multiple times, leading to biased update steps and ultimately to a sub-optimal local minimum for the overall training procedure.
The solution to this problem is to reverse the order of operations. Let be the time of node ’s last interaction in its last sub-batch . Instead of letting the memory be representative of the entire set of interactions involving in the past, we store memory , i.e. the state of prior to the last sub-batch , together with the raw information we need to update with the interactions of (i.e. the set of raw update messages of ’s interactions in ). At the beginning of each sub-batch, the model first updates the nodes’ memories by computing and aggregating messages from the stored raw information (line 2 of Algorithm 2), then uses the updated memory to infer the embeddings and computes the loss function (Figure 2 right). As a result, the loss function depends on a memory which has just been updated by its related modules. Moreover, all nodes involved in the computation of the embeddings (i.e. all source and target nodes and related neighbors) contribute to the gradients, ultimately producing more stable optimization and better local minima (Figure 2(b)).
While the advanced training strategy is sufficient to train TGNs, it can also be combined with the basic strategy by breaking each batch into sub-batches. We investigate the speed vs accuracy tradeoff of different combinations of the two strategies in Section 5.
Early models for learning on dynamic graphs focused on Discrete Time Dynamic Graphs (DTDG)s. Such approaches either aggregate graph snapshots and then apply static methods Liben-Nowell and Kleinberg (2007); Hisano (2018); Sharan and Neville (2008); Ibrahim and Chen (2015); Ahmed and Chen (2016); Ahmed et al. (2016)
, assemble snapshots into tensors and factorizeDunlavy et al. (2011); Yu et al. (2017); Ma et al. (2019), or encode each snapshot to produce a series of embeddings. In the latter case, the embeddings are either aggregated by taking a weighted sum Yao et al. (2016); Zhu et al. (2012), fit to time series models Huang and Lin (2009); Güneş et al. (2016); da Silva Soares and Prudêncio (2012); Moradabadi and Meybodi (2017), used as components in RNNs Seo et al. (2018); Narayan and Roe (2018); Manessi et al. (2020); Yu et al. (2019); Chen et al. (2018); Sankar et al. (2020); Pareja et al. (2019), or learned by imposing a smoothness constraint over time Kim and Han (2009); Gupta et al. (2011); Yao et al. (2016); Zhu et al. (2017); Zhou et al. (2018); Singer et al. (2019); Goyal et al. (2018); Fard et al. (2019); Pei et al. (2016). Another line of work encodes DTDGs by first performing random walks on an initial snapshot and then modifying the walk behaviour for subsequent snapshots Mahdavi et al. (2018); Du et al. (2018); Xin et al. (2016); De Winter et al. (2018); Yu et al. (2018).
that incorporate continuous time through constraints on transition probabilities. Sequence-based approaches for CTDGsKumar et al. (2019); Trivedi et al. (2017, 2019); Ma et al. (2018)
use RNNs to update representations of the source and destination node each time a new edge appears. Other recent works have focused on dynamic knowledge graphsGoel et al. (2019); Xu et al. (2019); Dasgupta et al. (2018); García-Durán et al. (2018).
Most recent CTDG learning models can be interpreted as specific cases of our framework (see Table 1). For example, Jodie Kumar et al. (2019) uses the time projection embedding module . TGAT Xu et al. (2020) is a specific case of TGN when the memory and its related modules are missing, and graph attention is used as the Embedding module. Finally, we note that TGN generalizes the Graph Networks (GN) model Battaglia et al. (2018) for static graphs (with the exception of the global block that we mentioned before), and thus the majority of existing message passing-type architectures.
For additional background, we refer the reader to surveys on general graph representation learning Bronstein et al. (2017); Hamilton et al. (2017a); Battaglia et al. (2018) and the recent survey on dynamic graph learning Kazemi et al. (2020).
|Mem.||Mem. Update||Embedding||Mess. Agg.||Mess. Func.|
|TGAT||—||—||attn (2l, 20n)||—||—|
|TGN-attn||node||GRU||attn (1l, 10n)||last||id|
|TGN-2l||node||GRU||attn (2l, 10n)||last||id|
|TGN-no-mem||—||—||attn (1l, 10n)||—||id|
|TGN-sum||node||GRU||sum (1l, 10n)||last||id|
|TGN-mean||node||GRU||attn (1l, 10n)||mean||id|
We use three datasets in our experiments: Wikipedia, Reddit Kumar et al. (2019), and Twitter. Reddit and Wikipedia are bipartite interaction graphs. In the Reddit dataset, users and sub-reddits are nodes, and an interaction occurs when a user writes a post to the sub-reddit. In the Wikipedia dataset, users and pages are nodes, and an interaction represents a user editing a page. In both aforementioned datasets, the interactions are represented by text features (of a post or page edit, respectively), and labels represent whether a user is banned. Both interactions and labels are time-stamped.
The Twitter dataset is a non-bipartite graph released as part of the 2020 RecSys Challenge Belli et al. (2020). Nodes are users and interactions are retweets. The features of an interaction are a BERT-based Wolf et al. (2019) vector representation of the text of the retweet. We use a subset of the original dataset formed by taking the largest connected component of the retweet graph and retaining only the nodes with the highest in degrees and highest out degrees. Dataset statistics together with more details are provided in the supplementary material.
Our experimental setup closely follows Xu et al. (2020) and focuses on the tasks of future edge prediction and dynamic node classification. On the former, we use both the transductive and inductive settings. In the transductive task, we predict future links of nodes which were observed during training, whereas in the inductive tasks we predict future links of nodes never observed before. The transductive setting is used for node classification. We perform the same 70%-15%-15% chronological split as in Xu et al. (2020).
The goal is to predict the probability of an edge occurring between two nodes at a given time. Our encoder is combined with a simple MLP decoder mapping from the concatenation of two node embeddings to the probability of the edge. For the Wikipedia and Reddit datasets, we use Adam optimizer with a learning rate of , a batch size of for both training, validation and testing, and early stopping with a patience of 5. For the Twitter dataset, the only change is the learning rate, which is set to . We sample an equal amount of negatives to the positive interactions, and use average precision
as reference metric. All results are averaged over 10 runs to obtain mean and standard deviation.
The task is to predict a binary label indicating whether a user was banned at a specific time. We pre-train our encoder on the future edge prediction task, then freeze it and combine it with a task-specific MLP decoder. We use the Adam optimizer with a learning rate of and a batch size of for both training, validation and testing. The metric used is the ROC-AUC. All results are averaged over 10 runs to obtain mean and standard deviation.
Our strong baselines are state-of-the-art approaches for continuous time dynamic graphs (CTDNE Nguyen et al. (2018b), Jodie Kumar et al. (2019), and TGAT Xu et al. (2020)) as well as state-of-the-art models for static graphs (GAE Kipf and Welling (2016), VGAE Kipf and Welling (2016), DeepWalk Perozzi et al. (2014), Node2Vec Grover and Leskovec (2016), GAT Velickovic et al. (2018) and GraphSAGE Hamilton et al. (2017a)).
Table 2 presents the results on future edge prediction. Our model clearly outperforms the baselines by a large margin in both the transductive and the inductive setting on all datasets. The gap is particularly large on the Twitter dataset, where we outperfom the second-best method (TGAT) by over 25%. Table 4 shows the results on dynamic node classification, where again our model obtains state-of-the-art results, with a large improvement over all other methods.
Due to the efficient parallel processing and the need for only one graph attention layer (see section 5.3 for the ablation study on the number of layers), our model is up to faster than Jodie and about
faster than TGAT to complete a single epoch (see Figure2(a)), while requiring a similar number of epochs to converge.
We perform a detailed ablation study comparing different instances of our TGN framework. We are particular interested in the speed vs accuracy tradeoff resulting from the choice of modules and their combination. The variants we experiment with are reported in Table 1 and their results are depicted in Figure 2(a).
We compare a model that does not make use of a memory (TGN-no-mem), with a model which uses memory (TGN-attn) but is otherwise identical. While TGN-att is about slower, it vastly outperforms TGN-no-mem, confirming the importance of memory for effective learning on dynamic graphs, due to its ability to store long-term information about a node which is otherwise hard to capture.
We compared models with different embedding modules (TGN-id, TGN-time, TGN-attn, TGN-sum). The first interesting insight is that projecting the embedding in time seems to slightly hurt, as shown by the fact that TGN-time underperforms TGN-id. Moreover, the ability to exploit the graph is crucial for performance: we note that all graph-based projections (TGN-attn, TGN-sum) outperform the graph-less TGN-id model by a large margin, with TGN-attn being the top performer at the expense of being slightly slower than the simpler TGN-sum.
We compared two models, one using the most last message aggregator (TGN-attn) and another a mean aggregator (TGN-mean-aggr) but otherwise the same. While TGN-mean-aggr performs slightly better, it is more than slower.
While in TGAT having 2 layers is of fundamental importance for obtaining good performances (TGAT vs TGAT-1l), in TGN the presence of the memory makes it enough to use 1 layer to obtain very high performances (TGN-attn vs TGN-2l). This is probably because when accessing the memory of the 1-hop neighbors, we are indirectly accessing information from hops further away. Moreover, being able to use only 1 layer of graph attention speeds up the model dramatically.
Table 4 shows the configurations we experimented with for this ablation study, while figure 2(b) presents the results. The TGN-id model makes only use of the memory (no embedding module) and therefore makes for a perfect testbed for training strategies related to the memory-related modules. Looking at the results with the TGN-id model, our proposed strategy of updating the memory at the start of the epoch clearly outperforms updating at the end.
Interestingly, when using a graph attention embedding module (TGN-attn), the benefit of the advanced strategy shrinks. This is probably due to the fact that the embedding module is able is able to adapt to the random memory-related modules, effectively denoising the spurious behavior of the nodes’ memory.
We introduce TGN, a generic framework for learning on continuous-time dynamic graphs. We obtain state-of-the-art results on several tasks and datasets while being faster than previous methods. Detailed ablation studies shows the importance of the memory and its related modules to store long-term information, as well as the importance of the graph-based embedding module to generate up-to-date node embeddings. We envision interesting applications of TGN in the fields of social sciences, recommender systems, and biological interaction networks, opening up a future research direction of exploring more advanced settings of our model and understanding the most appropriate domain-specific choices.
Graph Neural Networks able to effectively process temporal graphs can potentially serve a variety of purposes in our society e.g. improved recommender systems that take into account the evolving nature of users on social networks or marketplaces, as well as better filtering mechanisms for the detection of unhealthy behaviors such as spam or coordinate manipulation. At the same time, due to the novelty of these approaches, the robustness of such architectures w.r.t. external adversarial attacks has not been validated yet in the literature. Additional studies will thus need to be realised to identify the potential risks and benefits that temporal graph neural networks may present when subjected to adversarial attacks, before these can be applied to sensitive personal data and extensively exploited in industrial applications.
HyTE: hyperplane-based temporally aware knowledge graph embedding. In EMNLP, pp. 2001–2011. External Links: Cited by: §4.
Temporal-relational classifiers for prediction in evolving domains. In ICDM, pp. 540–549. Cited by: §4.
HuggingFace’s transformers: state-of-the-art natural language processing. arXiv:1910.03771. Cited by: §5.1.
Graph convolutional neural networks for web-scale recommender systems. In KDD ’18, Cited by: §1.
Netwalk: a flexible deep embedding approach for anomaly detection in dynamic networks. In KDD ’18, pp. 2672–2681. Cited by: §1, §4.
The statistics of the three datasets used are reported in table 5.
|# Edge features||172||172||768|
|# Edge features type||LIWC||LIWC||BERT|
|Timespan||30 days||30 days||7 days|
|# Nodes with dynamic labels||217||366||–|
For all the models and datasets we used the same hyperparameters, which had been found to work well in the TGAT paper.
|Node Embedding Dimension||100|
|Time Embedding Dimension||100|
|# Attention Heads||2|