1. Introduction
Representation learning on graphs is gaining increasing interests since it has exhibited great potentials in many realworld applications, ranging from ecommerce (Zhao et al., 2019; He et al., 2020), drug discovery (You et al., 2018; Zitnik et al., 2018), social networks (Wu et al., 2018; Zhao et al., 2019) to financial transactions (Khazane et al., 2019). Previous works on graph representation learning mainly focus on static settings where the topological structures are assumed fixed. However, graphs in practice are often constantly evolving, i.e., the nodes and their associated interactions (edges) can emerge and vanish over time. For instance, sequences of interactions such as following new friends, sharing news with friends on Twitter, or daily useritem purchasing interactions on Amazon can naturally yield a dynamic graph. The graph neural networks (GNNs) methods (Defferrard et al., 2016; Hamilton et al., 2017; Veličković et al., 2017) tailored for static graphs usually perform poorly on such dynamic scenarios because of the inability of utilizing temporal evolutionary information that is lost in the static settings.
Dynamic graph modeling aims to learn a lowdimensional embedding for each node that can effectively encode temporal and structural properties on dynamic graphs. This is an appealing yet challenging task due to the sophisticated timeevolving graph structures. Several works have been proposed these days. According to the way that dynamic graphs are constructed, these works can be roughly divided into discretetime methods and continuoustime methods. The former methods (Pareja et al., 2020; Goyal et al., 2018; Sankar et al., 2020) rely on discretetime dynamic graph construction that approximates dynamic graph as a series of graph snapshots over time. Usually, static graph encoding techniques like GCN (Hamilton et al., 2017) or GAT (Veličković et al., 2017)
are applied to each snapshot, and then a recurrent neural network (RNN)
(Hochreiter and Schmidhuber, 1997a) or a selfattention mechanism (Vaswani et al., 2017) is introduced to capture complicated temporal dependency among snapshots. However, the discretetime methods can be suboptimal since they ignore the finegrained temporal and structural information which might be critical in realworld applications. Meanwhile, it is unclear how to specify the size of the time intervals in different applications. To tackle these challenges, the latter continuoustime methods (Rossi et al., 2020; Xu et al., 2020; Dai et al., 2016; Kumar et al., 2019; Chang et al., 2020) are considered and have achieved stateoftheart results. Methods in this direction focus on designing different temporal neighborhood aggregation techniques applied to the finegrained temporal graphs which are represented as a sequence of temporally ordered interactions. For instance, TGAT (Xu et al., 2020) proposes a continuoustime kernel encoder combined with a selfattention mechanism to aggregate information from temporal neighborhood. TGNs (Rossi et al., 2020) introduces a generic temporal aggregation framework with a nodewise memory mechanism. Chang et. al. propose a dynamic message passing neural network to capture the highorder graph proximity on their temporal dependency interaction graph (TDIG) (Chang et al., 2020).Although continuoustime methods have achieved impressive results, there exist some limitations: First, when aggregating information from temporal neighborhoods of the two target interaction nodes, most of the aforementionedmethods (Kumar et al., 2019; Chang et al., 2020; Dai et al., 2016) employ RNNslike architectures. Such methods suffer from vanishing gradient in optimization and are unable to capture longterm dependencies. The learned dynamic representations will degrade especially when applied to complicated temporal graphs. Secondly, these methods typically compute dynamic embeddings of the two target interactions nodes separately without considering the semantic relatedness between their temporal neighborhoods (i.e. history behaviors), which may also be a causal factor for the target interaction. For example, in a temporal coauthor network, the fact is that nodes A and B previously coauthored with node C respectively can promote a new potential collaboration between nodes A and B. Therefore, modeling mutual influences between the two temporal neighborhoods can aid informative dynamic representations. Lastly, in optimization, most prior works typically model exact future by reconstructing future states (Kumar et al., 2019) or leveraging a Temporal Point Process (TPP) framework (Dai et al., 2016; Chang et al., 2020) to model complicated stochastic processes of future interactions. However, they may learn the noisy information when trying to fit the next interactions. Besides, computing the survival function of an intensity function in TPPbased methods is expensive when the integration cannot be computed in closedform.
To address the above limitations, we propose a novel continuoustime Transformerbased dynamic graph modeling framework via contrastive learning, called TCL. The main contributions of our work are summarized as follows:

[leftmargin=*]

We generalize the vanilla Transformer and enable it to handle temporaltopological information on the dynamic graph represented as a sequence of temporally cascaded interactions.

To obtain informative dynamic embeddings, We design a twostream encoder that separately processes temporal neighborhoods associated with the two target interaction nodes by our graphtopologyaware Transformer and then integrate them at a semantic level through a coattentional Transformer.

To ensure robust learning, we leverage a contrastive learning strategy that maximizes the mutual information (MI) between the predictive representations of future interaction nodes. To the our best knowledge, this is the first attempt to apply contrastive learning to dynamic graph modeling.

Our model is evaluated on four diverse interaction datasets for interaction prediction. Experimental results demonstrate that our method yields consistent and significant improvements over stateoftheart baselines.
2. Related Work
This section reviews stateoftheart approaches for dynamic graph learning. Since CL loss is used as our optimization objective, we also review recent works on CLbased graph representation learning.
2.1. Dynamic Graph Modeling
According to how the dynamic graph is constructed, we roughly divide the existing modeling approaches into two categories: discretetime methods and continuoustime methods.
Discretetime Methods. Methods in this category deal with a sequence of discretized graph snapshots that coarsely approximates a timeevolving graph. The authors in (Zhou et al., 2018) utilize temporally regularized weights to enforce the smoothness of nodes’ dynamic embeddings from adjacent snapshots. However, this method may break down when nodes exhibit significantly varying evolutionary behaviors. DynGEM (Goyal et al., 2018)
is an autoencoding approach that minimizes the reconstruction loss and learns incremental node embeddings through initialization from the previous time steps. However, this method may not capture the longterm graph similarities. Inspired by the selfattention mechanism
(Vaswani et al., 2017), DySAT (Sankar et al., 2020) computes dynamic embeddings by employing structural attention layers on each snapshot followed by temporal attention layers to capture temporal variations among snapshots. Recently, EvolveGCN (Pareja et al., 2020) leverages RNNs to regulate the GCN model (i.e., network parameters) at every time step to capture the dynamism in the evolving network parameters. Despite progress, the snapshotsbased methods inevitably fail to capture the finegrained temporal and structural information due to the coarse approximation of continuoustime graphs. It is also challenging to specify a suitable aggregation granularity.Continuoustime Methods. Methods in this category directly operate on timeevolving graphs without time discretization and focus on designing different temporal aggregators to extract information. The dynamic graphs are represented as a series of chronological interactions with precise timestamps recorded. DeepCoevolve (Dai et al., 2016) and its variant JODIE (Kumar et al., 2019) employ two coupled RNNs to update dynamic node embeddings given each interaction. They provide an implicit way to construct the dynamic graph where only the historical interaction information of the two involved nodes of the interactions at time are utilized. The drawbacks are that they are limited to modeling firstorder proximity while ignoring the higherorder temporal neighborhood structures. To exploit the topology structure of the temporal graph explicitly, TDIGMPNN (Chang et al., 2020)
proposes a graph construction method, named Temporal Dependency Interaction Graph (TDIG), which generalizes the above implicit construction and is constructed from a sequence of cascaded interactions. Based on the topology of TDIG, they employ a graphinformed Long Short Term Memory (LSTM)
(Hochreiter and Schmidhuber, 1997b) to obtain the dynamic embeddings. However, the downside of the above methods is that it is not good at capturing longrange dependencies and is difficult to train, which is also the intrinsic weaknesses of RNNs. Recent work TGAT (Xu et al., 2020) and TGNs (Rossi et al., 2020) adopt a different graph construction technique, i.e., a timerecorded multigraph, which accommodates more than one interaction (edge) between a pair of nodes. TGAT uses a time encoding kernel combined with a graph attention layer (Veličković et al., 2017) to aggregate temporal neighbors. Just like the encoding process in static models (e.g., GraphSAGE (Hamilton et al., 2017)), a single TGAT layer is used to aggregate onehop neighborhoods and by stacking several TGAT layers, it can capture highorder topological information. TGNs generalizes the aggregation of TGAT and utilizes a nodewise memory to capture longterm dependency.2.2. Contrastive Learning On Graph Representation Learning
Recently, stateoftheart results in unsupervised graph representation learning have been achieved by leveraging a contrastive learning loss that contrasts samples from a distribution that contains dependencies of interest and the distribution that does not. Deep graph Infomax (Velickovic et al., 2019) learns node representations by contrasting representations of nodes that belong to a graph and nodes coming from a corrupted graph. InfoGraph (Sun et al., 2019) learns graphlevel representations by contrasting the representations at graph level and that of substructures at different scales. Motivated by recent advances in multiview contrastive learning for visual representation learning, (Hassani and Khasahmadi, 2020) proposes a contrastive multiview representation learning at both node and graph levels. According to (Poole et al., 2019), contrastive objectives used in these methods can be seen as maximizing the lower bounds of MI. A recent study (Tschannen et al., 2019)
has shown that the success of these models is not only attributed to the properties of MI but is also influenced by the choice of the encoder and the MI estimators.
3. Preliminaries
To make our paper selfcontained, we start with introducing the basic knowledge of Temporal Dependency Interaction Graph (Chang et al., 2020), Transformer (Vaswani et al., 2017) and Contrastive Learning (Oord et al., 2018), upon which we build our new method.
3.1. Temporal Dependency Interaction Graph
Temporal Dependency Interaction Graph (TDIG) proposed by (Chang et al., 2020) is constructed by a sequence of temporally cascaded chronological interactions. Compared with other construction methods, TDIG as illustrated in Figure 1 maintains the finegrained temporal and structural information of dynamic graphs. Therefore, in this paper, we select TDIG as our backbone modeling.
Formally, a temporal dependency interaction graph () consists of a node set and an edge set indexed by time and is constructed based on a sequence of chronological interactions up to time . An interaction occurring at time is denoted as , where nodes , are the two parties involved in this interaction. Since one node can have multiple interactions happening at different time points, for convenience, we let represent the node at time who was involved in . There are two types of edges in . One is the interaction edge that corresponds to an interaction . The other is the dependency edge that links the current node to the two dependency nodes that were involved in ’s last interaction at time (just before time t). The dependency edge represents the evolution and the causality between temporal relevant nodes. The two dependencies nodes of are denoted as and respectively. The time interval between and the node ’s last interaction, denoted as , is treated as a dependency edge feature. To reduce the computational cost in practice, history information present in the subgraphs rather than a whole TDIG is utilized to form dynamic embeddings. is denoted as the max depth subgraph(i.e., temporal neighborhoods) rooted at , where is a hyperparameter. For more details, we refer the readers to the original paper (Chang et al., 2020) for a complete description.
3.2. Transformer
Transformer (Vaswani et al., 2017) has achieved stateoftheart performance and efficiency on many NLP tasks that have been previously dominated by RNN/CNNbased (Sutskever et al., 2014; Luong et al., 2015) approaches. A Transformer block relies heavily on a multihead attention mechanism to learn the contextaware representation for sequences, which can be defined as:
(1) 
where is a trainable parameter. is the number of heads. is computed as:
(2) 
where are trainable parameters and is the scaled dotproduct attention defined as:
(3) 
As input to Transformer, an element in a sequence is represented by an embedding vector. The multihead attention mechanism works by injecting the Positional Embedding into the element embeddings to be aware of element orders in a sequence.
3.3. Contrastive Learning
Contrastive learning approaches have attracted much attention for learning word embeddings in natural language models
(Klein and Nabi, 2020), image classification (Chen et al., 2020), and static graph modeling (Velickovic et al., 2019). Contrastive learning optimizes an objective that enforces the positive samples to stay in a close neighborhood and negative samples far apart:(4) 
where the positive pair is contrasted with negative pair . A discriminating function with parameters is trained to achieve a high value for congruent pairs and low for incongruent pairs. According to the proofs (Poole et al., 2019), minimizing the objective can maximize the lower bounds of MI.
4. Proposed Method
Given the topology of TDIG, we first describe how to generalize the vanilla Transformer to handle temporal and structural information. To obtain informative dynamic embeddings, we design a twostream encoder that first processes historical graphs associated with the two interaction nodes by a graphtopologyaware Transformer and then propose a coattentional Transformer to fuse them at the level of semantics. Finally, we will introduce our temporal contrastive objective function used in model training.
4.1. Graph Transformer
Considering the advantages of the Transformer in capturing longterm dependencies and in computational efficiency, we propose to extract temporal and structural information of TDIG by Transformer type of architecture. The vanilla Transformer architecture cannot be directly applied to dynamic graphs due to two reasons: (a) The vanilla Transformer is designed to process equally spaced sequences and cannot deal with the irregularly spaced time intervals between interactions that can be critical in analyzing the interactive behaviors (Vaswani et al., 2017). (b) The vanilla Transformer considers the attention with a flat structure, which is not suitable to capture the hierarchical and structured dependency relationships exhibited in TDIG. To address these challenges, we make two adaptions: (1) We leverage the depths of nodes together with the time intervals to encode nodes’ positions or hierarchies. (2) We inject the graph structure into the attention mechanism by performing the masked operation. We will elaborate as follows.
4.1.1. Embedding Layer
Let be the set of nodes in where the nodes are listed according to a predetermined order of graph traversal (e.g., the breadthfirstsearch) and is the number of nodes in . We create a node embedding matrix , where each row of the matrix corresponds to a ddimensional embedding of a specific node in node set . We then retrieve the node embedding matrices for the node sequences :
(5) 
where is the concatenation of the corresponding embeddings of nodes in .
Positional Embedding: Since the selfattention mechanism can not be aware of the nodes’ positions or hierarchies on the TDIG, we first learn a nodes’ depth embedding , where the th row of the matrix denotes a ddimensional embedding for the depth . Denote
(6) 
where is the concatenation of the corresponding depth embeddings of nodes in . The depths of nodes in the TDIG indicate temporal orders of the observed interactions in which these nodes are involved. However, only considering the depths of nodes is not enough, because the nodes in the TDIG have multiple instances with the same depths. To enhance nodes’ positional information, we also employ time intervals (i.e., dependency edge features) that usually convey important behaviour information. Specifically, we use a timeinterval projection matrix to get the following timeinterval embedding matrices:
(7) 
where is the concatenation of the corresponding timeinterval embeddings of nodes in .
Finally, by injecting the TDIGbased positional embedding into the above node embeddings, we get the input embedding matrices for the next structureaware attention layer:
(8) 
4.1.2. StructureAware Attention Layer
The selfattention mechanism above used in the NLP domain usually allows every token to attend to every other token, but directly applying it to nodes of the TDIG will lead to a loss of the structural information. Instead, we propose to inject the TDIG structural information into the attention mask. Specifically, the structure attention mask for the subgraph is defined as:
(9) 
where and denote the query node indexed by and the key node indexed by respectively. The is set by 0 only for the affinity pairs whose key node belongs to the subgraphs rooted at the query node , i.e., . We then inject the structure attention mask into the scaled dotproduct attention layer:
(10)  
where represents the th block input (when stacking B blocks) and . It could be observed that the attention output will be zero for the mask with negative infinity values (). In other words, each query node only attends to this query node’ structural dependency nodes whose occurrence time are earlier than the occurrence time of this query node. The masked multihead attention is denoted as
(11) 
4.1.3. Graph Transformer Block
Our graphtopologyaware Transformer block is defined with the following operation:
(12)  
(13) 
where ”+” means a residual connection operation.
denotes a layer normalization module (Ba et al., 2016) which usually stabilizes the training process. represents a twolayer fullyconnected feedforward network module defined as:(14) 
where
is an activation function usually implemented with a rectified linear unit
(Nair and Hinton, 2010). We sequentially apply these operations to get , which is the output of our graph Transformer block.4.2. Encoders for Dynamic Embeddings
Given the topology of the , our work aims to obtain the dynamic embeddings of the interaction nodes and , i.e., and , by leveraging the history information in their subgraphs and . To achieve this, we devise an encoder with a twostream architecture that first processes the and in two separate streams by the above graphtopologyaware Transformer layer and then interacts them at the level of semantics through a coattentional Transformer layer. The encoder architecture is depicted in Fig.2
As shown in Fig. 2, our twostream encoder includes two successive parts. The first part is a graph Transformer block which we utilize to obtain intermediate embeddings of nodes in the and , i.e., and . The second part is a crossattentional Transformer block that first models information correlation between and , and then gets final informative dynamic embeddings matrices , and , where we retrieval dynamic embeddings and for the interactions nodes and . The core of the crossattentional Transformer block is a crossattention operation by exchanging keyvalue pairs in the multiheaded attention. Formally, we have the following equations:
(15)  
(16)  
(17)  
(18)  
(19)  
(20) 
The coattention operations by exchanging keyvalue pairs in Eq.15 and Eq.16 enable our encoder to highlight relevant semantics shared by the two subgraphs and , and suppress the irrelevant semantics. Meanwhile, the attentive information from is incorporated into the final representation and vice versa. By doing so, the obtained embeddings matrices and can be semantic and informative. We then retrieval dynamic node embeddings and for the two interactions nodes and from and .
Relation to previous work. In comparison with the RNNbased information aggregation mechanism, such as JODIE, DeepCoevolve and TDIGMPNN, our Transformerbased model excels in capturing the complex dependency structures over sophisticated timeevolving graphs. It successfully tackles the vanishing gradient issues inherited from RNN and eases optimization. What’s more, all these previous methods learn node embeddings separately without considering the semantic relatedness between their temporal neighborhoods (i.e. history behaviors), which may also trigger new interactions. In contrast, we design a twostream encoder to separately process temporal neighborhoods of the interaction nodes and integrate them at a semantic level through a coattentional Transformer. As a result, we obtain more informative dynamic embeddings.
4.3. Dynamic Graph Contrastive Learning
For many generative time series models, the training strategies are formulated to maximize the prediction accuracy. For example, JODIE exploits the local smoothness of the data and optimizes the model by minimizing the distance between the predicted node embedding and the ground truth node’s embedding at every interaction. Based on the TPP framework, methods like DeepCoevolve (Dai et al., 2016) and TDIGMPNN (Chang et al., 2020) model the occurrence rate of new interaction at any time and optimize the models by maximizing the likelihood. However, their training performance relies on the modeling accuracy of future interactions and could be vulnerable to noise. Regarding this, instead of reconstructing the exact future interactions or employing the complicated generative model to learn the stochastic processes of future interactions, we attempt to maximize the mutual information between the latent representations of interaction nodes in the future. In this way, we can learn the underlying latent representations that the interaction nodes have in common. Moreover, these underlying latent representations can preserve the highlevel semantics of interactions and focus less on the lowlevel details, which are robust to noise information (Wang and Isola, 2020). We apply the contrastive objective function to our dynamic embedding learning. According to proofs in (Poole et al., 2019), minimizing this training objective function is equivalent to maximizing the lower bound of the mutual information between interaction nodes.
4.3.1. Future Prediction
Since we are leveraging the future interaction as our supervisory signal, we first construct the predictive future states of nodes and associated with future interaction . Considering one node has two dependency nodes in TDIG, the predictive node’ representation at future time is constructed based on the dynamic embeddings of this node’ two dependency nodes involved in the previous interaction just before time . Formally, we have the following equations:
(21)  
(22) 
where denotes concatenation. is a projection head used to predict the future node representation. and are the dynamic embeddings of ’ two dependency nodes and respectively, which are obtained by extracting the history information from subgraphs and using our twostream encoder. The function takes the concatenation of and as input to forecast future representation . Similarly, we have the predictive representation for the node at future time . In practice, is an with two hidden layers and nonlinearity.
4.3.2. Contrastive Objective Function
To train our twostream encoders in an endtoend fashion and learn the informative dynamic embeddings that aid future interaction prediction, we utilize the future interaction as our supervisory signal and maximize the mutual information between two nodes involved in the future interaction by contrasting their predictive representations. A schematic overview of our proposed method is in Fig. 3
Given a list of interactions observed in a time window , our contrastive learning is to minimize:
(23) 
where denotes discriminator function that takes two predictive representations as the input and then scores the agreement between them. Our discriminator function is designed to explore both the additive and multiplicative interaction relations between and :
(24) 
Essentially, the contrastive objective function acts as a multiway classifier that distinguishes the positive pairs out of all other negative pairs. In our case, the positive pair is (
), i.e., the predictive representations of two nodes and involved in an interaction at time . All the other pairs () where are negative pairs, which means all the other items that don’t have interactions with at time will be treated as negative items. The aim of our contrastive loss here is to train our twostream encoder that maximises the shared information between positive pairs, while minimizing the shared information between negative pairs that are well separated. Compared with existing training methods that model the exact future, our proposed contrastive loss makes the obtained dynamic embeddings more informative and robust to noise information. A schematic overview of our training method is shown in Fig. 3. We describe the pseudocode of training in Appendix.5. Experiments
We evaluate the effectiveness of TCL by comparing with baselines on four diverse interaction datasets. Meanwhile, the ablation study is conducted to understand which submodule of TCL contributes most to the overall performance. We also conduct experiments of parameter sensitivity.
5.1. Experimental Setting
5.1.1. Datasets
We evaluate our proposed model for temporal interaction prediction on four diverse interaction datasets as shown in Table 1.
CollegeMsg (Leskovec and Krevl, 2014). This dataset consists of message sending activities on a social network at the University of California, Irvine. The dataset we use here contains 999 senders and 1,352 receivers, with totally 48,771 activities.
Reddit (Kumar et al., 2019). This public dataset consists of one month of posts made by users on subreddits. We select the most active subreddits as items and the most active users. This results in interactions.
LastFM (Leskovec and Krevl, 2014). We use the dataset from top 1000 active users and 1000 popular songs, where the interactions indicate the user listen to the song.
Wikipedia (Leskovec and Krevl, 2014). We use the data from 998 frequent edited pages and 7619 active users on Wikipedia, yielding totally 128271 interactions.
These four datasets are from different realworld scenes and vary in terms of the number of interactions, interaction repetition density and time duration, where interaction repetition density is the number of the intersection of training set and test set divides by the number of test set.
Dataset  CollegeMsg  Wikipedia  LastFM  
#u  999  7619  1000  9986 
#v  1352  998  1000  984 
#Interactions  48,771  128,271  1,293,103  672,447 
Interaction Repetition Density  67.28%  87.78%  88.01%  53.97% 
Duration (day)  193.63  30.00  1586.89  3882.00 
Group  Method  CollegeMsg  Wikipedia  LastFM  
MR  Hit10  MR  Hit10  MR  Hit10  MR  Hit10  
Static  GraphSage* (Hamilton et al., 2017; Veličković et al., 2017)  269.356.12  11.561.27  165.393.59  33.473.81  261.672.17  11.021.04  50.49 1.14  58.631.54 
Discretetime  CTDNE(Nguyen et al., 2018)  607.2710.65  1.850.53  350.686.80  11.890.62  461.597.49  1.170.05  51.301.37  58.551.08 
DynGEM(Goyal et al., 2018)  337.649.30  3.470.38  216.565.27  16.790.21  457.688.37  3.670.04  52.431.75  45.370.90  
DySAT(Sankar et al., 2020)  354.358.47  2.690.47  146.592.36  17.640.89  292.914.28  4.630.36  55.791.78  56.1501.14  
Continuous time  JODIE(Kumar et al., 2019)  286.627.14  24.090.92  125.881.96  41.271.03  289.534.91  27.97 1.35  56.131.78  59.111.94 
DeepCoevolve (Dai et al., 2016)  439.6311.71  2.530.21  227.746.41  15.370.26  376.235.38  20.440.68  49.701.53  62.061.02  
TGAT(Xu et al., 2020)  206.796.37  35.491.15  87.032.41  72.051.37  199.242.51  30.270.92  45.112.19  64.332.66  
TGNs(Rossi et al., 2020)  196.694.36  36.071.39  75.472.02  79.741.67  192.363.02  31.070.85  56.342.04  61.612.73  
TDIGMPNN (Chang et al., 2020)  160.891.58  39.371.31  69.281.34  79.820.49  180.072.43  33.291.63  46.910.49  71.460.08  
TCL  137.571.70  45.530.37  54.331.05  81.170.16  149.310.72  35.320.48  33.951.27  75.950.47  
Improvement  14.49%  15.64%  21.58%  1.69%  16.01%  9.38%  24.74%  6.28% 
5.1.2. Baselines
We compare TCL with the following algorithms spanning three algorithmic categories:

[leftmargin=*]

Static Methods. We consider GraphSAGE (Hamilton et al., 2017) with four aggregators, namely, GCN, MEAN, MAX and LSTM. A GAT (Veličković et al., 2017) aggregator is also implemented base on the GraphSAGE framework (Hamilton et al., 2017). For convenience, we report the best results among the above five aggregators denoted as GraphSage*.

DiscreteTime Methods. Three snapshotbased dynamic methods are considered here. CTDNE (Nguyen et al., 2018) is an extension to DeepWalk (Perozzi et al., 2014) with time constraint for sampling orders; DynGEM (Goyal et al., 2018) is an autoencoding approach that minimizes the reconstruction loss and learns incremental node embeddings through initialization from the previous time steps; DySAT (Sankar et al., 2020) computes dynamic embeddings by using structural and temporal attentions.

ContinuousTime Methods. Five continuoustime baselines are considered here. DeepCoevolve (Dai et al., 2016) learns the dynamic embedding via utilizing a TPP framework to estimate the intensity of two nodes interacting in the future time; JODIE(Kumar et al., 2019) utilizes two RNN models and a projection function to learn dynamic embeddings; TDIGMPNN (Chang et al., 2020) proposes a dynamic message passing network combined a selection mechanism to obtain dynamic embeddings; TGAT(Xu et al., 2020) proposes temporal graph attention layer to aggregate temporaltopological neighborhood features; TGNs(Rossi et al., 2020) extends TGAT by adding a memory modules to capture longterm interaction relations.
5.1.3. Evaluation Metrics
Given an interaction , each method outputs the node ’s preference scores over all the items at time in test set. We sort scores in a descending order and record the rank of the paired node . We report the average ranks for all interactions in test data, which is denoted as Mean Rank(MR). We also report the Hit@10 defined as the proportion of times that a test tuple appears in the top 10.
5.1.4. Reproducibility
For GraphSage, CTDNE, DynGEM, DySAT, JODIE, TGAT, and TGNs
, we use their opensource implementations. For
DeepCoevole and TDIGMPNN, we use the source codes provided by their authors. More details about parameter settings and implementation can be found in the Appendix.5.2. Overall Performances
In this section, we evaluate our method and the baselines on temporal interaction prediction task. All experiments are repeated five times and averaged results are reported in Table 2, from which we have the following observations: (1)TCL consistently outperforms all the competitors on all the datasets. The improvement of TCL over the secondbest results(underlined in Table 2) are 14.49%, 21.58%, 16.01% and 24.74% respectively in terms of MR scores. The strong performance verifies the superiority of TCL. (2)On average, the continuoustime methods perform better than the static and discretetime methods, which can be explained by the fact that the finegrained temporal and structural information is critical for dynamic scenarios. (3)In some scenarios, the performance of discretetime dynamic methods is not better than that of static methods as expected. Similar phenomenons have also been observed by previous work DyRep (Trivedi et al., 2019) and TDIGMPNN. A possible explanation is that it’s nontrivial to specify the appropriate aggregation granularity(i.e., the number of the snapshots) for these scenarios. (5)In some scenarios, the static methods GraphSage* perform competitive with the continuoustime baselines. One possible reason could be that there are many repetitive interactions in our datasets and recurring interaction information can help models predict easily, especially for static methods which can make full use of structural information. (6)TCL and recent work including TGAT, TGNs and TDIGMPNN surpass JODIE and DeepCoevolve by a large margin, which indicates the importance of exploiting information from highorder temporal neighborhood(i.g., the kdepth subgraph information used in TCL). (7) TCL performs relatively better than the recent work TGAT, TGNs and TDIGMPNN. The reasons could be two folds. First, all these baseline methods treat temporal information aggregations of two interactions nodes separately without considering the semantic relatedness between their temporal neighborhoods (i.e. history behaviors), which may be a causal factor for the target interaction, while the proposed twostream encoder can utilize the coattentional Transformer to capture interdependencies at semantic level. Second, we use the contrastive learning loss as our optimization objective that enable our dynamic embeddings to preserve highlevel (or global)semantics about interactions which is robust to noise information.
5.3. Ablation Study
Dataset  CollegeMsg  Wikipedia  LastFM  
Metrics  MR  MR  MR  MR 
TCL w/o TE  140.20  54.49  157.86  34.09 
TCL w/o DE  139.11  55.85  151.48  35.05 
TCL w/o CATransformer  146.18  57.42  152.31  35.39 
TwoStreamEncoder+TPP  143.31  62.78  153.28  40.61 
TCL  136.13  52.97  149.27  33.67 
We perform ablation studies on our TCL framework by removing specific module one at a time to explore their relative importance. The components validated in this section are the positional embedding, the crossattentional Transformer, and the contrastive learning strategy.
Positional Embedding. We design the graph positional embedding with the aim of enhancing the positional information of nodes. To this end, the positional embedding module encodes both the time interval information and depth information. To evaluate their importance, we test the removal of time embedding (i.e., TCL w/o TE) and the removal of depth embedding (i.e., TCL w/o DE). From the results shown in Table 3, it can be observed that the performance degrades when removing either TE or DE, confirming the effectiveness of both time intervals between adjacent interactions and depth of the nodes.
Crossattentional Transformer. To aid informative dynamic representations, we propose a twostream encoder which includes a crossattentional Transformer module to aggregate information from temporal neighborhood with the mutual influence captured. To verify the effectiveness, we test the removal of the crossattentional Transformer (i.e., TCL w/o CATransformer) in our twostream encoder. As shown in Table 3, we find that TCL with the default setting outperforms TCL w/o CATransformer over all datasets by 5.04% on average, demonstrating the important role of the crossattentional in our encoder.
Contrastive learning. To improve robustness to noisy interactions, we utilize the contrastive learning as the objective function that maximizes the mutual information between the predictive representations of future interaction nodes. To evaluate its effectiveness, we compare TCL with a variant that replaces the contrastive learning by a TPP objective (i.e., TwoStreamEncoder+TPP). It can be observed that TCL outperforms TwoStreamEncoder+TPP by a large margin, i.e., 9.76% on average, which demonstrates the effectiveness of our optimization strategy.
5.4. Parameter Sensitivity
We investigate the impact of parameters on the future interaction prediction performance.
5.4.1. Impact of Subgraph Depth
We explore how the depth of the subgraph impacts the performance. We plot the MR metric of TCL in different depth on four datasets. The results are summarized in Fig 4. We find that the performance gradually improves with the increasing of depth, which verifies that exploiting information of highorder temporal neighborhood can benefit the performance.
5.4.2. Impact of Attention Head Number
Multihead attention allows the model to jointly attend to information from different representation subspaces. We attempt to see how the attention head number in our twostream encoder impacts the performance. We plot the MR metric with different number of heads in Fig 5. We observe that in most cases the performance improves when the head number increases, which demonstrates the effectiveness of multihead attention. However, at some cases, more heads lead to degraded performance due to the possible overfitting problem.
6. Conclusion
In this paper, we propose a novel continuoustime dynamic graph representation learning method, called TCL. TCL generalizes the vanilla Transformer and obtains temporaltopological information on dynamic graphs via a twostream encoder. The proposed contrastive learning can preserve the highlevel semantics of interactions and focus less on the lowlevel details, which is robust to noise. Extensive experiments verify the effectiveness and stability of TCL. In the future, there are still two important problems to be considered, i.e., how to effectively model longterm history information on a timeevolving graph and how to scale well.
References
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1.3.
 Continuoustime dynamic graph learning via neural interaction processes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 145–154. Cited by: §1, §1, §2.1, §3.1, §3.1, §3, §4.3, 3rd item, Table 2.
 A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §3.3.
 Deep coevolutionary network: embedding user and item features for recommendation. arXiv preprint arXiv:1609.03675. Cited by: §1, §1, §2.1, §4.3, 3rd item, Table 2.
 Convolutional neural networks on graphs with fast localized spectral filtering. neural information processing systems, pp. 3844–3852. Cited by: §1.
 Dyngem: deep embedding method for dynamic graphs. arXiv preprint arXiv:1805.11273. Cited by: §1, §2.1, 2nd item, Table 2.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §1, §2.1, 1st item, Table 2.
 Contrastive multiview representation learning on graphs. In ICML, pp. 4116–4126. Cited by: §2.2.
 Lightgcn: simplifying and powering graph convolution network for recommendation. In SIGIR, pp. 639–648. Cited by: §1.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
 Deeptrax: embedding graphs of financial transactions. In ICMLA, pp. 126–133. Cited by: §1.

Contrastive selfsupervised learning for commonsense reasoning
. arXiv preprint arXiv:2005.00669. Cited by: §3.3.  Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1269–1278. Cited by: §1, §1, §2.1, 3rd item, §5.1.1, Table 2.
 SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §5.1.1, §5.1.1, §5.1.1.

Effective approaches to attentionbased neural machine translation
. arXiv preprint arXiv:1508.04025. Cited by: §3.2. 
Rectified linear units improve restricted boltzmann machines
. In Icml, Cited by: §4.1.3.  Continuoustime dynamic network embeddings. In Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 2327, 2018, pp. 969–976. External Links: Link, Document Cited by: 2nd item, Table 2.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.
 Evolvegcn: evolving graph convolutional networks for dynamic graphs. In AAAI, Vol. 34, pp. 5363–5370. Cited by: §1, §2.1.
 Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: 2nd item.
 On variational bounds of mutual information. In ICML, pp. 5171–5180. Cited by: §2.2, §3.3, §4.3.

Temporal graph networks for deep learning on dynamic graphs
. arXiv preprint arXiv:2006.10637. Cited by: §1, §2.1, 3rd item, Table 2.  DySAT: deep neural representation learning on dynamic graphs via selfattention networks. In WSDM, pp. 519–527. Cited by: §1, §2.1, 2nd item, Table 2.
 Infograph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Cited by: §2.2.
 Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215. Cited by: §3.2.
 Dyrep: learning representations over dynamic graphs. In ICLR, Cited by: §5.2.
 On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §2.2.
 Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.1, §3.2, §3, §4.1.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §1, §2.1, 1st item, Table 2.
 Deep graph infomax.. In ICLR, Cited by: §2.2, §3.3.
 Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, pp. 9929–9939. Cited by: §4.3.
 SocialGCN: an efficient graph convolutional network based model for social recommendation. arXiv preprint arXiv:1811.02815. Cited by: §1.
 Inductive representation learning on temporal graphs. arXiv preprint arXiv:2002.07962. Cited by: §1, §2.1, 3rd item, Table 2.
 Graph convolutional policy network for goaldirected molecular graph generation. arXiv preprint arXiv:1806.02473. Cited by: §1.
 Intentgc: a scalable graph convolution framework fusing heterogeneous information for recommendation. In SIGKDD, pp. 2347–2357. Cited by: §1.

Dynamic network embedding by modeling triadic closure process.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.1.  Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.
Appendix A Reproducibility Supplement
a.1. Settings for Baselines
To enhance the reproducibility of this paper, we first give the pseudocode of our training process. Then we describe the implementation of the baselines. Finally, we introduce the experimental environment and hyperparameters settings.
Implementation of Baselines. We compare TCL with three categories of methods: (1) Static Methods: GraphSage^{1}^{1}1https://github.com/williamleif/GraphSAGE. (2) DiscreteTime Methods: CTDNE^{2}^{2}2https://github.com/stellargraph/stellargraph, DynGEM^{3}^{3}3https://github.com/palash1992/DynamicGEM and DySAT^{4}^{4}4https://github.com/aravindsankar28/DySAT. (3) Continuoustime Methods: JODIE^{5}^{5}5https://github.com/srijankr/jodie, DeepCoevole, TGAT^{6}^{6}6https://github.com/StatsDLMathsRecomSys/Inductiverepresentationlearningontemporalgraphs, TGNs^{7}^{7}7https://github.com/twitterresearch/tgn and TDIGMPNN.
For GraphSage*, the maximum number of 1/2/3/4/5hop neighbor nodes is set to be 25/10/10/10/10. For discretetime methods, we search the number of snapshots in {1,5,10,15} for all datasets. We search learning rates in {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1}, batchsize in {128, 256, 512} and keep the other hyperparameters the same as their published version to obtain the best results of these methods. For continuoustime methods, we search the learning rates in {0.0001, 0.0005, 0.001, 0.005}, batchsize in {128, 256, 512} to obtain the best results of these methods and keep the other hyperparameters same with their published version. We set the embedding dimension as
for all the methods, maximum training epochs
and negative samples (except for DynGEM and JODIE which do not require negative samples) for a fair comparison.a.2. Settings for TCL
Hyperparameters  Setting 
Learning rate  0.0005 
Optimizer  Adam 
Minibatch size  512 
Node embedding dimension  64 
Number of attention heads  16 
The depth of the k subgraph  5 
Dropout ratio in the input  0.6 
Number Training Epoch  20 
Number of negative samples  5 
Number of blocks  1 
DataSet Split. We split CollegeMsg into training/validation/test sets by 60/20/20. For a fair comparison with JODIE, we split dataset LastFM by 80/10/10. For a fair comparison with TGNT and TGNs, we split Reddit and Wiki by 70/15/15.
a.3. Pseudocode for TCL
The pseudocode of the training procedure for TCL is detailed in Algorithm 1.