TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning

Dynamic graph modeling has recently attracted much attention due to its extensive applications in many real-world scenarios, such as recommendation systems, financial transactions, and social networks. Although many works have been proposed for dynamic graph modeling in recent years, effective and scalable models are yet to be developed. In this paper, we propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion and enables effective dynamic node representation learning that captures both the temporal and topology information. Technically, our model contains three novel aspects. First, we generalize the vanilla Transformer to temporal graph learning scenarios and design a graph-topology-aware transformer. Secondly, on top of the proposed graph transformer, we introduce a two-stream encoder that separately extracts representations from temporal neighborhoods associated with the two interaction nodes and then utilizes a co-attentional transformer to model inter-dependencies at a semantic level. Lastly, we are inspired by the recently developed contrastive learning and propose to optimize our model by maximizing mutual information (MI) between the predictive representations of two future interaction nodes. Benefiting from this, our dynamic representations can preserve high-level (or global) semantics about interactions and thus is robust to noisy interactions. To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs. We evaluate our model on four benchmark datasets for interaction prediction and experiment results demonstrate the superiority of our model.


page 1

page 2

page 3

page 4


Efficient-Dyn: Dynamic Graph Representation Learning via Event-based Temporal Sparse Attention Network

Static graph neural networks have been widely used in modeling and repre...

Representation Learning for Dynamic Hyperedges

Recently there has been a massive interest in extracting information fro...

Contrastive Bidirectional Transformer for Temporal Representation Learning

This paper aims at learning representations for long sequences of contin...

Anomaly Detection in Dynamic Graphs via Transformer

Detecting anomalies for dynamic graphs has drawn increasing attention du...

Continuous-Time and Multi-Level Graph Representation Learning for Origin-Destination Demand Prediction

Traffic demand forecasting by deep neural networks has attracted widespr...

Navigating the Dynamics of Financial Embeddings over Time

Financial transactions constitute connections between entities and throu...

Structural Landmarking and Interaction Modelling: on Resolution Dilemmas in Graph Classification

Graph neural networks are promising architecture for learning and infere...

1. Introduction

Representation learning on graphs is gaining increasing interests since it has exhibited great potentials in many real-world applications, ranging from e-commerce (Zhao et al., 2019; He et al., 2020), drug discovery (You et al., 2018; Zitnik et al., 2018), social networks (Wu et al., 2018; Zhao et al., 2019) to financial transactions (Khazane et al., 2019). Previous works on graph representation learning mainly focus on static settings where the topological structures are assumed fixed. However, graphs in practice are often constantly evolving, i.e., the nodes and their associated interactions (edges) can emerge and vanish over time. For instance, sequences of interactions such as following new friends, sharing news with friends on Twitter, or daily user-item purchasing interactions on Amazon can naturally yield a dynamic graph. The graph neural networks (GNNs) methods (Defferrard et al., 2016; Hamilton et al., 2017; Veličković et al., 2017) tailored for static graphs usually perform poorly on such dynamic scenarios because of the inability of utilizing temporal evolutionary information that is lost in the static settings.

Dynamic graph modeling aims to learn a low-dimensional embedding for each node that can effectively encode temporal and structural properties on dynamic graphs. This is an appealing yet challenging task due to the sophisticated time-evolving graph structures. Several works have been proposed these days. According to the way that dynamic graphs are constructed, these works can be roughly divided into discrete-time methods and continuous-time methods. The former methods (Pareja et al., 2020; Goyal et al., 2018; Sankar et al., 2020) rely on discrete-time dynamic graph construction that approximates dynamic graph as a series of graph snapshots over time. Usually, static graph encoding techniques like GCN (Hamilton et al., 2017) or GAT (Veličković et al., 2017)

are applied to each snapshot, and then a recurrent neural network (RNN) 

(Hochreiter and Schmidhuber, 1997a) or a self-attention mechanism (Vaswani et al., 2017) is introduced to capture complicated temporal dependency among snapshots. However, the discrete-time methods can be sub-optimal since they ignore the fine-grained temporal and structural information which might be critical in real-world applications. Meanwhile, it is unclear how to specify the size of the time intervals in different applications. To tackle these challenges, the latter continuous-time methods (Rossi et al., 2020; Xu et al., 2020; Dai et al., 2016; Kumar et al., 2019; Chang et al., 2020) are considered and have achieved state-of-the-art results. Methods in this direction focus on designing different temporal neighborhood aggregation techniques applied to the fine-grained temporal graphs which are represented as a sequence of temporally ordered interactions. For instance, TGAT (Xu et al., 2020) proposes a continuous-time kernel encoder combined with a self-attention mechanism to aggregate information from temporal neighborhood. TGNs (Rossi et al., 2020) introduces a generic temporal aggregation framework with a node-wise memory mechanism. Chang et. al. propose a dynamic message passing neural network to capture the high-order graph proximity on their temporal dependency interaction graph (TDIG) (Chang et al., 2020).

Although continuous-time methods have achieved impressive results, there exist some limitations: First, when aggregating information from temporal neighborhoods of the two target interaction nodes, most of the aforementioned-methods (Kumar et al., 2019; Chang et al., 2020; Dai et al., 2016) employ RNNs-like architectures. Such methods suffer from vanishing gradient in optimization and are unable to capture long-term dependencies. The learned dynamic representations will degrade especially when applied to complicated temporal graphs. Secondly, these methods typically compute dynamic embeddings of the two target interactions nodes separately without considering the semantic relatedness between their temporal neighborhoods (i.e. history behaviors), which may also be a causal factor for the target interaction. For example, in a temporal co-author network, the fact is that nodes A and B previously co-authored with node C respectively can promote a new potential collaboration between nodes A and B. Therefore, modeling mutual influences between the two temporal neighborhoods can aid informative dynamic representations. Lastly, in optimization, most prior works typically model exact future by reconstructing future states (Kumar et al., 2019) or leveraging a Temporal Point Process (TPP) framework (Dai et al., 2016; Chang et al., 2020) to model complicated stochastic processes of future interactions. However, they may learn the noisy information when trying to fit the next interactions. Besides, computing the survival function of an intensity function in TPP-based methods is expensive when the integration cannot be computed in closed-form.

To address the above limitations, we propose a novel continuous-time Transformer-based dynamic graph modeling framework via contrastive learning, called TCL. The main contributions of our work are summarized as follows:

  • [leftmargin=*]

  • We generalize the vanilla Transformer and enable it to handle temporal-topological information on the dynamic graph represented as a sequence of temporally cascaded interactions.

  • To obtain informative dynamic embeddings, We design a two-stream encoder that separately processes temporal neighborhoods associated with the two target interaction nodes by our graph-topology-aware Transformer and then integrate them at a semantic level through a co-attentional Transformer.

  • To ensure robust learning, we leverage a contrastive learning strategy that maximizes the mutual information (MI) between the predictive representations of future interaction nodes. To the our best knowledge, this is the first attempt to apply contrastive learning to dynamic graph modeling.

  • Our model is evaluated on four diverse interaction datasets for interaction prediction. Experimental results demonstrate that our method yields consistent and significant improvements over state-of-the-art baselines.

2. Related Work

This section reviews state-of-the-art approaches for dynamic graph learning. Since CL loss is used as our optimization objective, we also review recent works on CL-based graph representation learning.

2.1. Dynamic Graph Modeling

According to how the dynamic graph is constructed, we roughly divide the existing modeling approaches into two categories: discrete-time methods and continuous-time methods.

Discrete-time Methods. Methods in this category deal with a sequence of discretized graph snapshots that coarsely approximates a time-evolving graph. The authors in (Zhou et al., 2018) utilize temporally regularized weights to enforce the smoothness of nodes’ dynamic embeddings from adjacent snapshots. However, this method may break down when nodes exhibit significantly varying evolutionary behaviors. DynGEM (Goyal et al., 2018)

is an autoencoding approach that minimizes the reconstruction loss and learns incremental node embeddings through initialization from the previous time steps. However, this method may not capture the long-term graph similarities. Inspired by the self-attention mechanism 

(Vaswani et al., 2017), DySAT (Sankar et al., 2020) computes dynamic embeddings by employing structural attention layers on each snapshot followed by temporal attention layers to capture temporal variations among snapshots. Recently, EvolveGCN (Pareja et al., 2020) leverages RNNs to regulate the GCN model (i.e., network parameters) at every time step to capture the dynamism in the evolving network parameters. Despite progress, the snapshots-based methods inevitably fail to capture the fine-grained temporal and structural information due to the coarse approximation of continuous-time graphs. It is also challenging to specify a suitable aggregation granularity.

Continuous-time Methods. Methods in this category directly operate on time-evolving graphs without time discretization and focus on designing different temporal aggregators to extract information. The dynamic graphs are represented as a series of chronological interactions with precise timestamps recorded. DeepCoevolve (Dai et al., 2016) and its variant JODIE (Kumar et al., 2019) employ two coupled RNNs to update dynamic node embeddings given each interaction. They provide an implicit way to construct the dynamic graph where only the historical interaction information of the two involved nodes of the interactions at time are utilized. The drawbacks are that they are limited to modeling first-order proximity while ignoring the higher-order temporal neighborhood structures. To exploit the topology structure of the temporal graph explicitly, TDIG-MPNN (Chang et al., 2020)

proposes a graph construction method, named Temporal Dependency Interaction Graph (TDIG), which generalizes the above implicit construction and is constructed from a sequence of cascaded interactions. Based on the topology of TDIG, they employ a graph-informed Long Short Term Memory (LSTM) 

(Hochreiter and Schmidhuber, 1997b) to obtain the dynamic embeddings. However, the downside of the above methods is that it is not good at capturing long-range dependencies and is difficult to train, which is also the intrinsic weaknesses of RNNs. Recent work TGAT (Xu et al., 2020) and TGNs (Rossi et al., 2020) adopt a different graph construction technique, i.e., a time-recorded multi-graph, which accommodates more than one interaction (edge) between a pair of nodes. TGAT uses a time encoding kernel combined with a graph attention layer (Veličković et al., 2017) to aggregate temporal neighbors. Just like the encoding process in static models (e.g., GraphSAGE (Hamilton et al., 2017)), a single TGAT layer is used to aggregate one-hop neighborhoods and by stacking several TGAT layers, it can capture high-order topological information. TGNs generalizes the aggregation of TGAT and utilizes a node-wise memory to capture long-term dependency.

2.2. Contrastive Learning On Graph Representation Learning

Recently, state-of-the-art results in unsupervised graph representation learning have been achieved by leveraging a contrastive learning loss that contrasts samples from a distribution that contains dependencies of interest and the distribution that does not. Deep graph Infomax (Velickovic et al., 2019) learns node representations by contrasting representations of nodes that belong to a graph and nodes coming from a corrupted graph. InfoGraph (Sun et al., 2019) learns graph-level representations by contrasting the representations at graph level and that of sub-structures at different scales. Motivated by recent advances in multi-view contrastive learning for visual representation learning,  (Hassani and Khasahmadi, 2020) proposes a contrastive multi-view representation learning at both node and graph levels. According to (Poole et al., 2019), contrastive objectives used in these methods can be seen as maximizing the lower bounds of MI. A recent study (Tschannen et al., 2019)

has shown that the success of these models is not only attributed to the properties of MI but is also influenced by the choice of the encoder and the MI estimators.

3. Preliminaries

To make our paper self-contained, we start with introducing the basic knowledge of Temporal Dependency Interaction Graph (Chang et al., 2020), Transformer (Vaswani et al., 2017) and Contrastive Learning (Oord et al., 2018), upon which we build our new method.

Figure 1. Illustration of a Temporal Dependency Interaction Graph induced from a sequence of six chronological interactions.

3.1. Temporal Dependency Interaction Graph

Temporal Dependency Interaction Graph (TDIG) proposed by  (Chang et al., 2020) is constructed by a sequence of temporally cascaded chronological interactions. Compared with other construction methods, TDIG as illustrated in Figure 1 maintains the fine-grained temporal and structural information of dynamic graphs. Therefore, in this paper, we select TDIG as our backbone modeling.

Formally, a temporal dependency interaction graph () consists of a node set and an edge set indexed by time and is constructed based on a sequence of chronological interactions up to time . An interaction occurring at time is denoted as , where nodes , are the two parties involved in this interaction. Since one node can have multiple interactions happening at different time points, for convenience, we let represent the node at time who was involved in . There are two types of edges in . One is the interaction edge that corresponds to an interaction . The other is the dependency edge that links the current node to the two dependency nodes that were involved in ’s last interaction at time (just before time t). The dependency edge represents the evolution and the causality between temporal relevant nodes. The two dependencies nodes of are denoted as and respectively. The time interval between and the node ’s last interaction, denoted as , is treated as a dependency edge feature. To reduce the computational cost in practice, history information present in the sub-graphs rather than a whole TDIG is utilized to form dynamic embeddings. is denoted as the max -depth sub-graph(i.e., temporal neighborhoods) rooted at , where is a hyper-parameter. For more details, we refer the readers to the original paper (Chang et al., 2020) for a complete description.

3.2. Transformer

Transformer (Vaswani et al., 2017) has achieved state-of-the-art performance and efficiency on many NLP tasks that have been previously dominated by RNN/CNN-based (Sutskever et al., 2014; Luong et al., 2015) approaches. A Transformer block relies heavily on a multi-head attention mechanism to learn the context-aware representation for sequences, which can be defined as:


where is a trainable parameter. is the number of heads. is computed as:


where are trainable parameters and is the scaled dot-product attention defined as:


As input to Transformer, an element in a sequence is represented by an embedding vector. The multi-head attention mechanism works by injecting the Positional Embedding into the element embeddings to be aware of element orders in a sequence.

3.3. Contrastive Learning

Contrastive learning approaches have attracted much attention for learning word embeddings in natural language models 

(Klein and Nabi, 2020), image classification (Chen et al., 2020), and static graph modeling (Velickovic et al., 2019). Contrastive learning optimizes an objective that enforces the positive samples to stay in a close neighborhood and negative samples far apart:


where the positive pair is contrasted with negative pair . A discriminating function with parameters is trained to achieve a high value for congruent pairs and low for incongruent pairs. According to the proofs (Poole et al., 2019), minimizing the objective can maximize the lower bounds of MI.

4. Proposed Method

Given the topology of TDIG, we first describe how to generalize the vanilla Transformer to handle temporal and structural information. To obtain informative dynamic embeddings, we design a two-stream encoder that first processes historical graphs associated with the two interaction nodes by a graph-topology-aware Transformer and then propose a co-attentional Transformer to fuse them at the level of semantics. Finally, we will introduce our temporal contrastive objective function used in model training.

4.1. Graph Transformer

Considering the advantages of the Transformer in capturing long-term dependencies and in computational efficiency, we propose to extract temporal and structural information of TDIG by Transformer type of architecture. The vanilla Transformer architecture cannot be directly applied to dynamic graphs due to two reasons: (a) The vanilla Transformer is designed to process equally spaced sequences and cannot deal with the irregularly spaced time intervals between interactions that can be critical in analyzing the interactive behaviors (Vaswani et al., 2017). (b) The vanilla Transformer considers the attention with a flat structure, which is not suitable to capture the hierarchical and structured dependency relationships exhibited in TDIG. To address these challenges, we make two adaptions: (1) We leverage the depths of nodes together with the time intervals to encode nodes’ positions or hierarchies. (2) We inject the graph structure into the attention mechanism by performing the masked operation. We will elaborate as follows.

4.1.1. Embedding Layer

Let be the set of nodes in where the nodes are listed according to a pre-determined order of graph traversal (e.g., the breadth-first-search) and is the number of nodes in . We create a node embedding matrix , where each row of the matrix corresponds to a d-dimensional embedding of a specific node in node set . We then retrieve the node embedding matrices for the node sequences :


where is the concatenation of the corresponding embeddings of nodes in .

Positional Embedding: Since the self-attention mechanism can not be aware of the nodes’ positions or hierarchies on the TDIG, we first learn a nodes’ depth embedding , where the -th row of the matrix denotes a d-dimensional embedding for the depth . Denote


where is the concatenation of the corresponding depth embeddings of nodes in . The depths of nodes in the TDIG indicate temporal orders of the observed interactions in which these nodes are involved. However, only considering the depths of nodes is not enough, because the nodes in the TDIG have multiple instances with the same depths. To enhance nodes’ positional information, we also employ time intervals (i.e., dependency edge features) that usually convey important behaviour information. Specifically, we use a time-interval projection matrix to get the following time-interval embedding matrices:


where is the concatenation of the corresponding time-interval embeddings of nodes in .

Finally, by injecting the TDIG-based positional embedding into the above node embeddings, we get the input embedding matrices for the next structure-aware attention layer:


4.1.2. Structure-Aware Attention Layer

The self-attention mechanism above used in the NLP domain usually allows every token to attend to every other token, but directly applying it to nodes of the TDIG will lead to a loss of the structural information. Instead, we propose to inject the TDIG structural information into the attention mask. Specifically, the structure attention mask for the subgraph is defined as:


where and denote the query node indexed by and the key node indexed by respectively. The is set by 0 only for the affinity pairs whose key node belongs to the sub-graphs rooted at the query node , i.e., . We then inject the structure attention mask into the scaled dot-product attention layer:


where represents the -th block input (when stacking B blocks) and . It could be observed that the attention output will be zero for the mask with negative infinity values (-). In other words, each query node only attends to this query node’ structural dependency nodes whose occurrence time are earlier than the occurrence time of this query node. The masked multi-head attention is denoted as


4.1.3. Graph Transformer Block

Our graph-topology-aware Transformer block is defined with the following operation:


where ”+” means a residual connection operation.

denotes a layer normalization module (Ba et al., 2016) which usually stabilizes the training process. represents a two-layer fully-connected feed-forward network module defined as:



is an activation function usually implemented with a rectified linear unit 

(Nair and Hinton, 2010). We sequentially apply these operations to get , which is the output of our graph Transformer block.

4.2. Encoders for Dynamic Embeddings

Given the topology of the , our work aims to obtain the dynamic embeddings of the interaction nodes and , i.e., and , by leveraging the history information in their sub-graphs and . To achieve this, we devise an encoder with a two-stream architecture that first processes the and in two separate streams by the above graph-topology-aware Transformer layer and then interacts them at the level of semantics through a co-attentional Transformer layer. The encoder architecture is depicted in Fig.2

Figure 2. Overview of Our Two-stream Encoder Framework.

As shown in Fig. 2, our two-stream encoder includes two successive parts. The first part is a graph Transformer block which we utilize to obtain intermediate embeddings of nodes in the and , i.e., and . The second part is a cross-attentional Transformer block that first models information correlation between and , and then gets final informative dynamic embeddings matrices , and , where we retrieval dynamic embeddings and for the interactions nodes and . The core of the cross-attentional Transformer block is a cross-attention operation by exchanging key-value pairs in the multi-headed attention. Formally, we have the following equations:


The co-attention operations by exchanging key-value pairs in Eq.15 and Eq.16 enable our encoder to highlight relevant semantics shared by the two sub-graphs and , and suppress the irrelevant semantics. Meanwhile, the attentive information from is incorporated into the final representation and vice versa. By doing so, the obtained embeddings matrices and can be semantic and informative. We then retrieval dynamic node embeddings and for the two interactions nodes and from and .

Relation to previous work. In comparison with the RNN-based information aggregation mechanism, such as JODIE, DeepCoevolve and TDIG-MPNN, our Transformer-based model excels in capturing the complex dependency structures over sophisticated time-evolving graphs. It successfully tackles the vanishing gradient issues inherited from RNN and eases optimization. What’s more, all these previous methods learn node embeddings separately without considering the semantic relatedness between their temporal neighborhoods (i.e. history behaviors), which may also trigger new interactions. In contrast, we design a two-stream encoder to separately process temporal neighborhoods of the interaction nodes and integrate them at a semantic level through a co-attentional Transformer. As a result, we obtain more informative dynamic embeddings.

4.3. Dynamic Graph Contrastive Learning

For many generative time series models, the training strategies are formulated to maximize the prediction accuracy. For example, JODIE exploits the local smoothness of the data and optimizes the model by minimizing the distance between the predicted node embedding and the ground truth node’s embedding at every interaction. Based on the TPP framework, methods like DeepCoevolve (Dai et al., 2016) and TDIG-MPNN (Chang et al., 2020) model the occurrence rate of new interaction at any time and optimize the models by maximizing the likelihood. However, their training performance relies on the modeling accuracy of future interactions and could be vulnerable to noise. Regarding this, instead of reconstructing the exact future interactions or employing the complicated generative model to learn the stochastic processes of future interactions, we attempt to maximize the mutual information between the latent representations of interaction nodes in the future. In this way, we can learn the underlying latent representations that the interaction nodes have in common. Moreover, these underlying latent representations can preserve the high-level semantics of interactions and focus less on the low-level details, which are robust to noise information (Wang and Isola, 2020). We apply the contrastive objective function to our dynamic embedding learning. According to proofs in  (Poole et al., 2019), minimizing this training objective function is equivalent to maximizing the lower bound of the mutual information between interaction nodes.

4.3.1. Future Prediction

Since we are leveraging the future interaction as our supervisory signal, we first construct the predictive future states of nodes and associated with future interaction . Considering one node has two dependency nodes in TDIG, the predictive node’ representation at future time is constructed based on the dynamic embeddings of this node’ two dependency nodes involved in the previous interaction just before time . Formally, we have the following equations:


where denotes concatenation. is a projection head used to predict the future node representation. and are the dynamic embeddings of ’ two dependency nodes and respectively, which are obtained by extracting the history information from sub-graphs and using our two-stream encoder. The function takes the concatenation of and as input to forecast future representation . Similarly, we have the predictive representation for the node at future time . In practice, is an with two hidden layers and non-linearity.

4.3.2. Contrastive Objective Function

To train our two-stream encoders in an end-to-end fashion and learn the informative dynamic embeddings that aid future interaction prediction, we utilize the future interaction as our supervisory signal and maximize the mutual information between two nodes involved in the future interaction by contrasting their predictive representations. A schematic overview of our proposed method is in Fig. 3

Given a list of interactions observed in a time window , our contrastive learning is to minimize:


where denotes discriminator function that takes two predictive representations as the input and then scores the agreement between them. Our discriminator function is designed to explore both the additive and multiplicative interaction relations between and :


Essentially, the contrastive objective function acts as a multi-way classifier that distinguishes the positive pairs out of all other negative pairs. In our case, the positive pair is (

), i.e., the predictive representations of two nodes and involved in an interaction at time . All the other pairs () where are negative pairs, which means all the other items that don’t have interactions with at time will be treated as negative items. The aim of our contrastive loss here is to train our two-stream encoder that maximises the shared information between positive pairs, while minimizing the shared information between negative pairs that are well separated. Compared with existing training methods that model the exact future, our proposed contrastive loss makes the obtained dynamic embeddings more informative and robust to noise information. A schematic overview of our training method is shown in Fig. 3. We describe the pseudocode of training in Appendix.

Figure 3. Illustration of CL-based optimization strategy for proposed two-stream encoder.

5. Experiments

We evaluate the effectiveness of TCL by comparing with baselines on four diverse interaction datasets. Meanwhile, the ablation study is conducted to understand which sub-module of TCL contributes most to the overall performance. We also conduct experiments of parameter sensitivity.

5.1. Experimental Setting

5.1.1. Datasets

We evaluate our proposed model for temporal interaction prediction on four diverse interaction datasets as shown in Table 1.

CollegeMsg (Leskovec and Krevl, 2014). This dataset consists of message sending activities on a social network at the University of California, Irvine. The dataset we use here contains 999 senders and 1,352 receivers, with totally 48,771 activities.

Reddit (Kumar et al., 2019). This public dataset consists of one month of posts made by users on subreddits. We select the most active subreddits as items and the most active users. This results in interactions.

LastFM (Leskovec and Krevl, 2014). We use the dataset from top 1000 active users and 1000 popular songs, where the interactions indicate the user listen to the song.

Wikipedia (Leskovec and Krevl, 2014). We use the data from 998 frequent edited pages and 7619 active users on Wikipedia, yielding totally 128271 interactions.

These four datasets are from different real-world scenes and vary in terms of the number of interactions, interaction repetition density and time duration, where interaction repetition density is the number of the intersection of training set and test set divides by the number of test set.

Dataset CollegeMsg Wikipedia LastFM Reddit
#u 999 7619 1000 9986
#v 1352 998 1000 984
#Interactions 48,771 128,271 1,293,103 672,447
Interaction Repetition Density 67.28% 87.78% 88.01% 53.97%
Duration (day) 193.63 30.00 1586.89 3882.00
Table 1. Dataset Statistics.
Group Method CollegeMsg Wikipedia LastFM Reddit
MR Hit10 MR Hit10 MR Hit10 MR Hit10
Static GraphSage* (Hamilton et al., 2017; Veličković et al., 2017) 269.356.12 11.561.27 165.393.59 33.473.81 261.672.17 11.021.04 50.49 1.14 58.631.54
Discrete-time CTDNE(Nguyen et al., 2018) 607.2710.65 1.850.53 350.686.80 11.890.62 461.597.49 1.170.05 51.301.37 58.551.08
DynGEM(Goyal et al., 2018) 337.649.30 3.470.38 216.565.27 16.790.21 457.688.37 3.670.04 52.431.75 45.370.90
DySAT(Sankar et al., 2020) 354.358.47 2.690.47 146.592.36 17.640.89 292.914.28 4.630.36 55.791.78 56.1501.14
Continuous- time JODIE(Kumar et al., 2019) 286.627.14 24.090.92 125.881.96 41.271.03 289.534.91 27.97 1.35 56.131.78 59.111.94
DeepCoevolve (Dai et al., 2016) 439.6311.71 2.530.21 227.746.41 15.370.26 376.235.38 20.440.68 49.701.53 62.061.02
TGAT(Xu et al., 2020) 206.796.37 35.491.15 87.032.41 72.051.37 199.242.51 30.270.92 45.112.19 64.332.66
TGNs(Rossi et al., 2020) 196.694.36 36.071.39 75.472.02 79.741.67 192.363.02 31.070.85 56.342.04 61.612.73
TDIG-MPNN (Chang et al., 2020) 160.891.58 39.371.31 69.281.34 79.820.49 180.072.43 33.291.63 46.910.49 71.460.08
TCL 137.571.70 45.530.37 54.331.05 81.170.16 149.310.72 35.320.48 33.951.27 75.950.47
Improvement 14.49% 15.64% 21.58% 1.69% 16.01% 9.38% 24.74% 6.28%
Table 2. Overall Comparison on Temporal Interaction Prediction.

5.1.2. Baselines

We compare TCL with the following algorithms spanning three algorithmic categories:

  • [leftmargin=*]

  • Static Methods. We consider GraphSAGE (Hamilton et al., 2017) with four aggregators, namely, GCN, MEAN, MAX and LSTM. A GAT (Veličković et al., 2017) aggregator is also implemented base on the GraphSAGE framework (Hamilton et al., 2017). For convenience, we report the best results among the above five aggregators denoted as GraphSage*.

  • Discrete-Time Methods. Three snapshot-based dynamic methods are considered here. CTDNE (Nguyen et al., 2018) is an extension to DeepWalk (Perozzi et al., 2014) with time constraint for sampling orders; DynGEM (Goyal et al., 2018) is an autoencoding approach that minimizes the reconstruction loss and learns incremental node embeddings through initialization from the previous time steps; DySAT (Sankar et al., 2020) computes dynamic embeddings by using structural and temporal attentions.

  • Continuous-Time Methods. Five continuous-time baselines are considered here. DeepCoevolve (Dai et al., 2016) learns the dynamic embedding via utilizing a TPP framework to estimate the intensity of two nodes interacting in the future time; JODIE(Kumar et al., 2019) utilizes two RNN models and a projection function to learn dynamic embeddings; TDIG-MPNN (Chang et al., 2020) proposes a dynamic message passing network combined a selection mechanism to obtain dynamic embeddings; TGAT(Xu et al., 2020) proposes temporal graph attention layer to aggregate temporal-topological neighborhood features; TGNs(Rossi et al., 2020) extends TGAT by adding a memory modules to capture long-term interaction relations.

5.1.3. Evaluation Metrics

Given an interaction , each method outputs the node ’s preference scores over all the items at time in test set. We sort scores in a descending order and record the rank of the paired node . We report the average ranks for all interactions in test data, which is denoted as Mean Rank(MR). We also report the Hit@10 defined as the proportion of times that a test tuple appears in the top 10.

5.1.4. Reproducibility

For GraphSage, CTDNE, DynGEM, DySAT, JODIE, TGAT, and TGNs

, we use their open-source implementations. For

DeepCoevole and TDIG-MPNN, we use the source codes provided by their authors. More details about parameter settings and implementation can be found in the Appendix.

5.2. Overall Performances

In this section, we evaluate our method and the baselines on temporal interaction prediction task. All experiments are repeated five times and averaged results are reported in Table  2, from which we have the following observations: (1)TCL consistently outperforms all the competitors on all the datasets. The improvement of TCL over the second-best results(underlined in Table  2) are 14.49%, 21.58%, 16.01% and 24.74% respectively in terms of MR scores. The strong performance verifies the superiority of TCL. (2)On average, the continuous-time methods perform better than the static and discrete-time methods, which can be explained by the fact that the fine-grained temporal and structural information is critical for dynamic scenarios. (3)In some scenarios, the performance of discrete-time dynamic methods is not better than that of static methods as expected. Similar phenomenons have also been observed by previous work DyRep (Trivedi et al., 2019) and TDIG-MPNN. A possible explanation is that it’s non-trivial to specify the appropriate aggregation granularity(i.e., the number of the snapshots) for these scenarios. (5)In some scenarios, the static methods GraphSage* perform competitive with the continuous-time baselines. One possible reason could be that there are many repetitive interactions in our datasets and recurring interaction information can help models predict easily, especially for static methods which can make full use of structural information. (6)TCL and recent work including TGAT, TGNs and TDIG-MPNN surpass JODIE and DeepCoevolve by a large margin, which indicates the importance of exploiting information from high-order temporal neighborhood(i.g., the k-depth sub-graph information used in TCL). (7) TCL performs relatively better than the recent work TGAT, TGNs and TDIG-MPNN. The reasons could be two folds. First, all these baseline methods treat temporal information aggregations of two interactions nodes separately without considering the semantic relatedness between their temporal neighborhoods (i.e. history behaviors), which may be a causal factor for the target interaction, while the proposed two-stream encoder can utilize the co-attentional Transformer to capture inter-dependencies at semantic level. Second, we use the contrastive learning loss as our optimization objective that enable our dynamic embeddings to preserve high-level (or global)semantics about interactions which is robust to noise information.

5.3. Ablation Study

Dataset CollegeMsg Wikipedia LastFM Reddit
Metrics MR MR MR MR
TCL w/o TE 140.20 54.49 157.86 34.09
TCL w/o DE 139.11 55.85 151.48 35.05
TCL w/o CA-Transformer 146.18 57.42 152.31 35.39
Two-Stream-Encoder+TPP 143.31 62.78 153.28 40.61
TCL 136.13 52.97 149.27 33.67
Table 3. The comparison of TCL with its variants.

We perform ablation studies on our TCL framework by removing specific module one at a time to explore their relative importance. The components validated in this section are the positional embedding, the cross-attentional Transformer, and the contrastive learning strategy.

Positional Embedding. We design the graph positional embedding with the aim of enhancing the positional information of nodes. To this end, the positional embedding module encodes both the time interval information and depth information. To evaluate their importance, we test the removal of time embedding (i.e., TCL w/o TE) and the removal of depth embedding (i.e., TCL w/o DE). From the results shown in Table 3, it can be observed that the performance degrades when removing either TE or DE, confirming the effectiveness of both time intervals between adjacent interactions and depth of the nodes.

Cross-attentional Transformer. To aid informative dynamic representations, we propose a two-stream encoder which includes a cross-attentional Transformer module to aggregate information from temporal neighborhood with the mutual influence captured. To verify the effectiveness, we test the removal of the cross-attentional Transformer (i.e., TCL w/o CA-Transformer) in our two-stream encoder. As shown in Table 3, we find that TCL with the default setting outperforms TCL w/o CA-Transformer over all datasets by 5.04% on average, demonstrating the important role of the cross-attentional in our encoder.

Figure 4. Performance of TCL w.r.t. different depths of the -depth sub-graph.
Figure 5. Performance of TCL w.r.t different number of heads.

Contrastive learning. To improve robustness to noisy interactions, we utilize the contrastive learning as the objective function that maximizes the mutual information between the predictive representations of future interaction nodes. To evaluate its effectiveness, we compare TCL with a variant that replaces the contrastive learning by a TPP objective (i.e., Two-Stream-Encoder+TPP). It can be observed that TCL outperforms Two-Stream-Encoder+TPP by a large margin, i.e., 9.76% on average, which demonstrates the effectiveness of our optimization strategy.

5.4. Parameter Sensitivity

We investigate the impact of parameters on the future interaction prediction performance.

5.4.1. Impact of Sub-graph Depth

We explore how the depth of the sub-graph impacts the performance. We plot the MR metric of TCL in different depth on four datasets. The results are summarized in Fig 4. We find that the performance gradually improves with the increasing of depth, which verifies that exploiting information of high-order temporal neighborhood can benefit the performance.

5.4.2. Impact of Attention Head Number

Multi-head attention allows the model to jointly attend to information from different representation subspaces. We attempt to see how the attention head number in our two-stream encoder impacts the performance. We plot the MR metric with different number of heads in Fig 5. We observe that in most cases the performance improves when the head number increases, which demonstrates the effectiveness of multi-head attention. However, at some cases, more heads lead to degraded performance due to the possible over-fitting problem.

6. Conclusion

In this paper, we propose a novel continuous-time dynamic graph representation learning method, called TCL. TCL generalizes the vanilla Transformer and obtains temporal-topological information on dynamic graphs via a two-stream encoder. The proposed contrastive learning can preserve the high-level semantics of interactions and focus less on the low-level details, which is robust to noise. Extensive experiments verify the effectiveness and stability of TCL. In the future, there are still two important problems to be considered, i.e., how to effectively model long-term history information on a time-evolving graph and how to scale well.


  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1.3.
  • X. Chang, X. Liu, J. Wen, S. Li, Y. Fang, L. Song, and Y. Qi (2020) Continuous-time dynamic graph learning via neural interaction processes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 145–154. Cited by: §1, §1, §2.1, §3.1, §3.1, §3, §4.3, 3rd item, Table 2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §3.3.
  • H. Dai, Y. Wang, R. Trivedi, and L. Song (2016) Deep coevolutionary network: embedding user and item features for recommendation. arXiv preprint arXiv:1609.03675. Cited by: §1, §1, §2.1, §4.3, 3rd item, Table 2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. neural information processing systems, pp. 3844–3852. Cited by: §1.
  • P. Goyal, N. Kamra, X. He, and Y. Liu (2018) Dyngem: deep embedding method for dynamic graphs. arXiv preprint arXiv:1805.11273. Cited by: §1, §2.1, 2nd item, Table 2.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §1, §2.1, 1st item, Table 2.
  • K. Hassani and A. H. Khasahmadi (2020) Contrastive multi-view representation learning on graphs. In ICML, pp. 4116–4126. Cited by: §2.2.
  • X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang (2020) Lightgcn: simplifying and powering graph convolution network for recommendation. In SIGIR, pp. 639–648. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997a) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997b) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • A. Khazane, J. Rider, M. Serpe, A. Gogoglou, K. Hines, C. B. Bruss, and R. Serpe (2019) Deeptrax: embedding graphs of financial transactions. In ICMLA, pp. 126–133. Cited by: §1.
  • T. Klein and M. Nabi (2020)

    Contrastive self-supervised learning for commonsense reasoning

    arXiv preprint arXiv:2005.00669. Cited by: §3.3.
  • S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1269–1278. Cited by: §1, §1, §2.1, 3rd item, §5.1.1, Table 2.
  • J. Leskovec and A. Krevl (2014) SNAP Datasets: Stanford large network dataset collection. Note: Cited by: §5.1.1, §5.1.1, §5.1.1.
  • M. Luong, H. Pham, and C. D. Manning (2015)

    Effective approaches to attention-based neural machine translation

    arXiv preprint arXiv:1508.04025. Cited by: §3.2.
  • V. Nair and G. E. Hinton (2010)

    Rectified linear units improve restricted boltzmann machines

    In Icml, Cited by: §4.1.3.
  • G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim (2018) Continuous-time dynamic network embeddings. In Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, pp. 969–976. External Links: Link, Document Cited by: 2nd item, Table 2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.
  • A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. Schardl, and C. Leiserson (2020) Evolvegcn: evolving graph convolutional networks for dynamic graphs. In AAAI, Vol. 34, pp. 5363–5370. Cited by: §1, §2.1.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: 2nd item.
  • B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. In ICML, pp. 5171–5180. Cited by: §2.2, §3.3, §4.3.
  • E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein (2020)

    Temporal graph networks for deep learning on dynamic graphs

    arXiv preprint arXiv:2006.10637. Cited by: §1, §2.1, 3rd item, Table 2.
  • A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang (2020) DySAT: deep neural representation learning on dynamic graphs via self-attention networks. In WSDM, pp. 519–527. Cited by: §1, §2.1, 2nd item, Table 2.
  • F. Sun, J. Hoffmann, V. Verma, and J. Tang (2019) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §2.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215. Cited by: §3.2.
  • R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha (2019) Dyrep: learning representations over dynamic graphs. In ICLR, Cited by: §5.2.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2019) On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.1, §3.2, §3, §4.1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §1, §2.1, 1st item, Table 2.
  • P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax.. In ICLR, Cited by: §2.2, §3.3.
  • T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, pp. 9929–9939. Cited by: §4.3.
  • L. Wu, P. Sun, R. Hong, Y. Fu, X. Wang, and M. Wang (2018) SocialGCN: an efficient graph convolutional network based model for social recommendation. arXiv preprint arXiv:1811.02815. Cited by: §1.
  • D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan (2020) Inductive representation learning on temporal graphs. arXiv preprint arXiv:2002.07962. Cited by: §1, §2.1, 3rd item, Table 2.
  • J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec (2018) Graph convolutional policy network for goal-directed molecular graph generation. arXiv preprint arXiv:1806.02473. Cited by: §1.
  • J. Zhao, Z. Zhou, Z. Guan, W. Zhao, W. Ning, G. Qiu, and X. He (2019) Intentgc: a scalable graph convolution framework fusing heterogeneous information for recommendation. In SIGKDD, pp. 2347–2357. Cited by: §1.
  • L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang (2018) Dynamic network embedding by modeling triadic closure process. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.1.
  • M. Zitnik, M. Agrawal, and J. Leskovec (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.

Appendix A Reproducibility Supplement

a.1. Settings for Baselines

To enhance the reproducibility of this paper, we first give the pseudocode of our training process. Then we describe the implementation of the baselines. Finally, we introduce the experimental environment and hyperparameters settings.

Implementation of Baselines. We compare TCL with three categories of methods: (1) Static Methods: GraphSage111 (2) Discrete-Time Methods: CTDNE222, DynGEM333 and DySAT444 (3) Continuous-time Methods: JODIE555, DeepCoevole, TGAT666, TGNs777 and TDIG-MPNN.

For GraphSage*, the maximum number of 1/2/3/4/5-hop neighbor nodes is set to be 25/10/10/10/10. For discrete-time methods, we search the number of snapshots in {1,5,10,15} for all datasets. We search learning rates in {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1}, batchsize in {128, 256, 512} and keep the other hyper-parameters the same as their published version to obtain the best results of these methods. For continuous-time methods, we search the learning rates in {0.0001, 0.0005, 0.001, 0.005}, batch-size in {128, 256, 512} to obtain the best results of these methods and keep the other hyper-parameters same with their published version. We set the embedding dimension as

for all the methods, maximum training epochs

and negative samples (except for DynGEM and JODIE which do not require negative samples) for a fair comparison.

a.2. Settings for TCL

Hyper-parameters Setting
Learning rate 0.0005
Optimizer Adam
Mini-batch size 512
Node embedding dimension 64
Number of attention heads 16
The depth of the k sub-graph 5
Dropout ratio in the input 0.6
Number Training Epoch 20
Number of negative samples 5
Number of blocks 1
Table 4. Hyper-parameters Settings.

DataSet Split. We split CollegeMsg into training/validation/test sets by 60/20/20. For a fair comparison with JODIE, we split dataset LastFM by 80/10/10. For a fair comparison with TGNT and TGNs, we split Reddit and Wiki by 70/15/15.

a.3. Pseudocode for TCL

The pseudocode of the training procedure for TCL is detailed in Algorithm 1.

0:  The dynamic interaction set ; Depth ; # Heads ; # Blocks ; # Epochs , number of Negative samples # NS . Initialize the parameters of Two-Stream Encoder , and projection function .
1:  for  in  do
2:     for () in  do
13:        Select # NS negative items and extract the dependency k-depth sub-graphs of , , as
2(, ),
3(, ),
(, ).
44:        Obtain the behavior sequences of each sub-graph via BFS,
5, ,
6, ,
, .
75:        Encode each pair of dependency sub-graph via Two-Stream Encoder ,
6:        Project dependency node embeddings to node embedding via Future Prediction Function ,
8:        Calculate the CL loss according to Eq. 23 and update the parameters of and .
9:     end for
10:  end for
Algorithm 1 TCL