1 Introduction
In recent years, graph representation learning has been recognized as a fundamental learning problem and has received much attention due to its widespread use in various domains, including social network analysis (Kipf & Welling, 2017; Hamilton et al., 2017), traffic prediction (Cui et al., 2019; Rahimi et al., 2018)
(Wang et al., 2019a, b), drug discovery (Do et al., 2019; Duvenaud et al., 2015), and recommendation systems (Berg et al., 2017; Ying et al., 2018). Most existing graph representation learning work focuses on static graphs. However, realworld graphs are intrinsically dynamic where nodes and edges can appear and disappear over time. This dynamic nature of realworld graphs motivates dynamic graph representation learning methods that can model temporal evolutionary patterns and accurately predict node properties and future edges.Recently, several attempts (Sankar et al., 2018; Pareja et al., 2020; Goyal et al., 2018) have been made to generalize graph learning algorithms from static graphs to dynamic graphs by first learning node representations on each static graph snapshot then aggregating these representations from the temporal dimension. However, these methods are vulnerable to noisy information such as missing or spurious links. This is due to the ineffective message aggregation over unrelated neighbors from noisy connections. The temporal aggregation makes this issue severe by further carrying the noise information over time. Overrelying on graph structures makes the model sensitive to noisy input and can significantly affect downstream task accuracy. A remedy is to consider the input graph as fully connected and learn a graph topology by assigning lower weights to taskirrelevant edges during training (Devlin et al., 2019)
. However, completely ignoring the graph structure makes the optimization inefficient because the model has to estimate the underlying graph structure while learn model parameters at the same time. To resolve the above challenges, we propose a Transformerbased dynamic graph learning method named Dynamic Graph Transformer (
DGT) that can “leverage underlying graph structures” and “capture implicit edge connections” to balance this tradeoff.Transformers (Vaswani et al., 2017)
, designed to automatically capture the interdependencies between tokens in a sequence, have been successfully applied in several domains such as Natural Language Processing
(Devlin et al., 2019; Brown et al., 2020)and Computer Vision
(Dosovitskiy et al., 2020; Liu et al., 2021). We summarize the success of Transformers into three main factors, which can also help resolve the aforementioned challenges in dynamic graph representation learning: () fullyconnected selfattention: by modeling all pairwise node relations, DGT can capture implicit edge connections, thus become robust to graphs with noisy information such as missing links; () positional encoding: by generalizing positional encoding to the graph domain using spatialtemporal encoding, we can inject both spatial and temporal graph evolutionary information as inductive biases into DGT to learn a graph’s evolutionary patterns over time; () selfsupervised pretraining: by optimizing two complementary pretraining tasks, DGT presents a better performance on the downstream tasks.Though powerful, training Transformers on largescale graphs is nontrivial due to the quadratic complexity of the fully connected selfattention on the graph size (Zaheer et al., 2020; Wang et al., 2020). This issue is more severe on dynamic graphs as the computation cost grows with the number of timesteps in a dynamic graph (Pareja et al., 2020; Sankar et al., 2018). To make the training scalable and independent of both the graph size and the number of timesteps, we first propose a temporalunion graph structure that aggregates graph information from multiple timesteps into a unified metagraph; we then develop a twotower architecture with a novel targetcontext node sampling strategy to model a subset of nodes with their contextual information. These approaches improve DGT’s training efficiency and scalability from both the temporal and spatial perspectives.
To this end, we summarize our contributions as follows: () a twotower Transformerbased method named DGT with spatialtemporal encoding that can capture implicit edge connections in addition to the input graph topology; () a temporalunion graph data structure that efficiently summarizes the spatialtemporal information of dynamic graphs and a novel targetcontext node sampling strategy for largescale training; () two complementary pretraining tasks that can facilitate performing downstream tasks and are proven beneficial using information theory; and () a comprehensive evaluation on realworld datasets with ablation studies to validate the effectiveness of DGT.
2 Preliminaries and related works
In this section, we first define dynamic graphs, then review related literature on dynamic graph representation learning and Transformers on graphs.
Dynamic graph definition.
The nodes and edges in a dynamic graph may appear and disappear over time.
In this paper, we define a dynamic graph as a sequence of static graph snapshots with a temporal order , where the th snapshot graph is an undirected graph with a shared node set of all time steps and an edge set . We also denote its adjacency matrix as .
Our goal is to learn a latent representation of each node at each timestep , such that the learned representation can be used for any specific downstream task such as link prediction or node classification.
Please notice that the shared node set is not static and will be updated when new snapshot graph arrives, which is the same as Sankar et al. (2018); Pareja et al. (2020).
Dynamic graph representation learning.
Previous dynamic graph representation learning methods usually extend static graph algorithms by further taking the temporal information into consideration. They can mainly be classified into three categories: (
) smoothnessbased methodslearn a graph autoencoder to generate node embeddings on each graph snapshot and ensure the temporal smoothness of the node embeddings across consecutive timesteps. For example,
DyGEM (Goyal et al., 2018) uses the learned embeddings from the previous timestep to initialize the embeddings in the next timestep. DynAERNN applies RNN to smooth node embeddings at different timesteps; () Recurrentbased methods capture the temporal dependency using RNN. For example, GCRN (Seo et al., 2018) first computes node embeddings on each snapshot using GCN (Defferrard et al., 2016), then feeds the node embeddings into an RNN to learn their temporal dependency. EvolveGCN (Pareja et al., 2020) uses RNN to estimate the GCN weight parameters at different timesteps; () Attentionbased methods use selfattention mechanism for both spatial and temporal message aggregation. For example, DySAT (Sankar et al., 2018) propose to use the selfattention mechanism for temporal and spatial information aggregation. TGAT (Xu et al., 2020) encodes the temporal information into the node feature, then applies selfattention on the temporal augmented node features. However, smoothnessbased methods heavily rely on the temporal smoothness and are inadequate when nodes exhibit vastly different evolutionary behaviors, recurrentbased methods scale poorly when the number of timesteps increases due to the recurrent nature of RNN, attentionbased methods only consider the selfattention on existing edges and are sensitive to noisy graphs. In contrast, DGT leverages Transformer to capture the spatialtemporal dependency between all nodes pairs, does not overrelying on the given graph structures, and is less sensitive to noisy edges.Graph Transformers. Recently, several attempts have been made to leverage Transformer for graph representation learning. For example, Graphormer (Ying et al., 2021) and GraphTransformer (Dwivedi & Bresson, 2020) use scaled dotproduct attention (Vaswani et al., 2017) for message aggregation and generalizes the idea of positional encoding to graph domains. GraphBert (Zhang et al., 2020) first samples an egocentric network for each node, then order all nodes into a sequence based on node importance, and feed into the Transformer. However, Graphormer is only feasible to small molecule graphs and cannot scale to large graphs due to the significant computation cost of full attention; GraphTransformer only considers the firsthop neighbor aggregation, which makes it sensitive to noisy graphs; GraphBert does not leverage the graph topology and can perform poorly when graph topology is important. In contrast, DGT encodes the input graph structures as an inductive bias to guide the fullattention optimization, which balances the tradeoffs between noisy input robustness and efficiently learning an underlying graph structure. A detailed comparison is deferred to Appendix D.
3 Method
In this section, we first introduce the temporal uniongraph (in Section 3.1) and our sampling strategy (in Section 3.2) that can reduce the overall complexity from the temporal and spatial perspectives respectively. Then, we introduce our spatialtemporal encoding technique (in Section 3.3), describe the twotower transformer architecture design, and explain how to integrate the spatialtemporal encoding to DGT (in Section 3.4). Figure 1 illustrates the overall DGT design.
3.1 temporalunion graph generation
One major challenge of applying Transformers on graph representation learning is its significant computation and memory overhead. In Transformers, the computation cost of selfattention is and its memory cost is . When using full attention, the computation graph is fully connected with , where the overall complexity is quadratic in the graph size. On dynamic graphs, this problem can be even more severe if one naively extends the static graph algorithm to a dynamic graph, e.g., first extracting the spatial information of each snapshot graph separately, then jointly reasoning the temporal information on all snapshot graphs (Sankar et al., 2018; Pareja et al., 2020). By doing so, the overall complexity grows linearly with the number of timesteps , i.e., with computation and memory cost. To reduce the dependency of the overall complexity on the number of timesteps, we propose to first aggregate dynamic graphs into a temporalunion graph then employ DGT on the generated temporalunion graph, where is the set of all possible unique edges in . As a result, the overall complexity of DGT does not grow with the number of timesteps. Details on how to leverage spatialtemporal encoding to recover the temporal information of edges are described in Section 3.3.
3.2 Target node driven context node sampling
Although the temporalunion graph can alleviate the computation burden from the temporal dimension, due to the overall quadratic complexity of selfattention with respect to the input graph size, scaling the training of Transformer to realworld graphs is still nontrivial. Therefore, a properly designed sampling strategy that makes the overall complexity independent with graph sizes is necessary. Our goal is to design a subgraph sampling strategy that ensures a fixed number of wellconnected nodes and a lower computational complexity. To this end, we propose to first sample a subset of nodes that we are interested in as target nodes, then sample their common neighbors as context nodes.
Let target nodes be the set of nodes that we are interested in and want to compute its node representation. For example, for the link prediction task, are the set of nodes that we aim to predict whether they are connected. Then, the context nodes are sampled as the common neighbors of the target nodes. Notice that since context nodes are sampled as the common neighbors of the target nodes, they can provide local structure information for nodes in the target node set. Besides, since two different nodes in the target node set can be far apart with a disconnected neighborhood, the neighborhood of two nodes can provide an approximation of the global view of the full graph. During the sampling process, to control the randomness involved in the sampling process, are chosen as the subset of nodes with the top joint Personalized PageRank (PPR) score (Andersen et al., 2006) to nodes in
, where PPR score is a node proximity measure that captures the importance of two nodes in the graph. More specifically, our joint PPR sampler proceeds as follows: First, we compute the approximated PPR vector
for all node , where the th element incan be interpreted as the probability of a random walk to start at node
and end at node . We then compute the approximated joint PPR vector . Finally, we select context nodes where each node has the top joint PPR score in . In practice, we select the context node size the same as the target node size, i.e., .3.3 Spatialtemporal encoding
Given the temporalunion graph, our next step is to translate the spatialtemporal information from snapshot graphs to the temporalunion graph , which can be recognized and leveraged by Transformers. Notice that most classical GNNs either overrely on the given graph structure by only considering the first or higherorder neighbors for feature aggregation (Ying et al., 2021), or directly learn graph adjacency without using the given graph structure (Devlin et al., 2019). On the one hand, overrelying on the graph structure makes the model fails to capture the interrelation between nodes that are not connected in the labeled graph, and could be very sensitive to the noisy edges due to humanlabeling errors. On the other hand, completely ignoring the graph structure makes the optimization problem challenging because the model has to iteratively learn model parameters and estimate the graph structure. To avoid the above two extremes, we present two simple but effective designs of encodings, i.e., temporal connection encoding and spatial distance encoding, and provide details on how to integrate them into DGT.
Temporal connection encoding.
Temporal connection (TC) encoding is designed to
inform DGT if an edge exists in the th snapshot graph.
We denote as the temporal connection encoding lookuptable where represents the hidden dimension size, which is indexed by a function indicating whether an edge exists at timestep .
More specifically, we have if , if and use this value as an index to extract the corresponding temporal connection embedding from the lookup table for nextstep processing.
Note that during pretraining or the training on first few timesteps, we need to maskout certain timesteps to avoid leaking information related to the predicted items (e.g., the temporal reconstruction task in Section. 4.1).
In these cases, we set where denotes the timestep we maskout, and skip the embedding extraction at time .
Spatial distance encoding.
Spatial distance (SD) encoding is designed to provide DGT a global view of the graph structure.
The success of Transformer is largely attributed to its global receptive field due to its full attention, i.e., each token in the sequence can attend independently to other tokens and process its representations.
Computing full attention requires the model to explicitly capturing the positions dependency between tokens, which can be achieved by either assigning each position an absolute positional encoding or encode the relative distance using relative positional encoding.
However, for graphs, the design of unique node positions is not mandatary because a graph is not changed by the permutation of its nodes.
To encode the global structural information of a graph in the model, inspired by (Ying et al., 2021), we adopt a spatial distance encoding that measures the relative spatial relationship between any two nodes in the graph, which is a generalization of the classical Transformer’s positional encoding to the graph domain.
Let be the maximum shortest path distance (SPD) we considered, where is a hyperparameter that can be smaller than the graph diameter. More specifically, given any node and node , we define as the SPD between the two nodes if and otherwise as . Let as the spatial distance lookuptable which is indexed by the , where is used to select the spatial distance encoding that provides the spatial distance information of two nodes.
Integrate spatialtemporal encoding.
We integrate temporal connection encoding and spatial distance encoding by
projecting them as a bias term in the selfattention module.
Specifically, to integrate the spatialtemporal encoding of node pair to DGT,
we first gather all its associated temporal connection encodings on different timesteps as . Then, we apply weight average on all encodings over the temporal axis and projected the temporal averaged encoding as a scalar by , where the aggregation weight is learned during training.
Similarly, to integrate the spatial distance encoding,
we project the spatial distance encoding of node pair as a scalar by .
Then, and are used as the bias term to the selfattention, which we describe in detail in Section 3.4.
3.4 Graph Transformer architecture
As shown in Figure 1, each layer in DGT consists of two towers (i.e., the target node tower and the context node tower) to encode the target nodes and the context nodes separately. The same set of parameters are shared between two towers. The twotower structure is motivated by the fact that nodes within each group are sampled independently but there exist neighborhood relationships between intergroup nodes. Only attending intergroup nodes help DGT better capture this context information without fusing representations from irrelevant nodes. The details are as follows:

[noitemsep,topsep=0pt,leftmargin=3mm ]

First, we compute the selfattentions that are used to aggregate information from target nodes to context nodes (denote as “ctx”) and from context nodes to target nodes (denote as “tgt”). Let define as the th layer output of the contextnode tower and as the th layer output of the targetnode tower. Then, the th layer selfattention is computed as
where stands for applying layer normalization on and are weight matrices.

Then, we integrate spatialtemporal encoding as a bias term to selfattention as follows
where denote the the matrix form of the projected temporal connection and spatial distance selfattention bias with row indexed by and columns indexed by .^{1}^{1}1Given a matrix , the element at the th row and th column is denoted as , the submatrix formed from row and columns is denoted as .

After that, we use the normalized and to propagate information between two towers, i.e.,

Finally, a residual connected feedforward network is applied to the aggregated message to produce the final output
where denotes the multilayer feedforward network. The final layer output of the target node tower will be used to compute the loss defined in Section 4.
4 Dynamic Graph Transformer learning
Transformer usually requires a significant amount of supervised data to guarantee their generalization ability on unseen data. However, existing dynamic graph datasets are relatively small and may not be sufficient to train a powerful Transformer. To overcome this challenge, we propose to first pretrain DGT with two complementary selfsupervised objective functions (in Section 4.1). Then, we finetine DGT using the supervised objective function (in Section 4.2). Notice that the same set of snapshot graphs but different objective functions are used for pretraining and finetuning. Finally, via an informationtheoretic analysis, we show that the representation can have a better generalization ability on downstream tasks by optimizing our pretraining losses (in Section 4.3).
4.1 Pretraining
We introduce a temporal reconstruction loss and a multiview contrastive loss as our selfsupervised object functions. Then, our overall pretaining loss is defined as , where is a hyperparameter that balances the importance of two pretaining tasks as illustrated in Figure 2.
Temporal reconstruction loss. To ensure that the spatialtemporal encoding is effective and can inform DGT the temporal dependency between multiple snapshot graphs, we introduce a temporal reconstruction loss as our first pretraining objective. Our goal is to reconstruct the th graph snapshot ’s structure using all but except the th graph snapshot (i.e., ). Let denote the targetnode tower’s final layer output computed on . To decode the graph structure of graph snapshot , we use a fully connected layer as the temporal structure decoder that takes as input and output with denotes the th row of . Then, the temporal reconstruction loss is defined as , where
is Sigmoid function and
(1) 
Multiview contrastive loss. Recall that is constructed by deterministically selecting the common neighbors of with the top PPR score. Then, we introduce as the subset of the common neighbors of randomly sampled with sampling probability of each node proportional to its PPR score. Since a different set of context nodes are provided for the same set of target nodes, provides an alternative view of when computing the representation for nodes in . Notice that although the provided context nodes are different, since they have the same target nodes, it is natural to expect the calculated representation have high similarity. We denote and as the final layer model output that are computed on and . To this end, we introduce our second selfsupervised objective function as , where StopGrad denotes stop gradient. Note that optimizing alone without stopping gradient results in a degenerated solution (Chen & He, 2021; Tian et al., 2021).
4.2 Finetuning
To apply the pretrained model for downstream tasks, we choose to finetune the pretrained model with downstream task objective functions. Here, we take link prediction as an example. Our goal is to predict the existence of a link at time using information up to time . Let denote the final layer output of DGT using snapshot graphs . Then, the link prediction loss is defined as , where LinkPredLoss is defined in Eq. 1.
4.3 On the importance of pretraining from information theory perspective
In this section, we show that our pretraining objectives can improve the generalization error under mild assumptions and results in a better performance on downstream tasks. Let
denote the input random variable,
as the selfsupervised signal (also known as a different view of input ), and as the representations that are generated by a deterministic mapping function . In our setting, we have the sampled subgraph of temporalunion graph induced by node as input , the sampled subgraph of induced by node as selfsupervised signal , and DGT as that computes the representation of by . Besides, we introduce the taskrelevant information as , which refers to the information that is required for downstream tasks. For example, when the downstream task is link prediction, can be the ground truth graph structure about which we want to reason. Notice that in practice we have no access to during pretraining and it is only introduced as the notation for analysis. Furthermore, let denote entropy, denote conditional entropy, denote mutual information, and denote conditional mutual information. More details and preliminaries on information theory are deferred to Appendix B.In the following, we study the generalization error of the learned representation under the binary classification setting. We choose Bayes error rate
(i.e., the lowest possible test error rate a binary classifier can achieve) as our evaluation metric, which can be formally defined as
. Before proceeding to our result, we make the following assumption on input , selfsupervised signal , and taskrelevant information .Assumption 1.
We assume the taskrelevant information is shared between the input random variable , selfsupervised signal , i.e., we have and .
We argue the above assumption is mild because input and selfsupervised signal are two different views of the data, therefore they are expected to contain taskrelevant information . In Proposition 1, we make connections between the Bayes error rate and pretraining losses, which explains why the proposed pretraining losses are helpful for downstream tasks. We defer the proofs to Appendix B.
Proposition 1.
We can upper bound Bayes error rate by , and reduce the upper bound of by maximizing the mutual information between the learned representation and input , which can be achieved by minimizing temporal reconstruction loss , and minimizing the taskirrelevant information between the learned representation and input , which can be achieved by minimizing our multiview loss .
The Proposition 1 suggests that if we can create a different views of our input data in a way such that both and contain the taskrelevant information , then by jointly optimizing our two pretraining losses can result in the representation with a lower Bayes error rate . Our analysis is based on the information theory framework developed in (Tsai et al., 2020), in which they show that using contrastive loss between and (i.e., maximizing ), predicting from (i.e., minimizing ), and predicting from (i.e., minimizing ) can result in a smaller Bayes error rate .
5 Experiments
We evaluate DGT using dynamic graph link prediction, which has been widely used in (Sankar et al., 2018; Goyal et al., 2018) to compare its performance with a variety of static and dynamic graph representation learning baselines. Besides, DGT can also be applied to other downstream tasks such as node classification. We defer node classification results to Appendix A.3.
5.1 Experiment setup
Datasets.
We select five realworld datasets of various sizes and types in our experiments. The detailed data statistics can be accessed at Table 11 in Appendix C.2.
Graph snapshots are created by splitting the data using suitable time windows such that each snapshot has an equitable number of interactions. In each snapshot, the edge weights are determined by the number of interactions.
Link prediction task.
To compare the performance of DGT with baselines, we follow the evaluation strategy in (Goyal et al., 2018; Zhou et al., 2018; Sankar et al., 2018)
by training a logistic regression classifier taking two node embeddings as input for dynamic graph link prediction. Specifically, we learn the dynamic node representations on snapshot graphs
and evaluate DGT by predicting links at . For evaluation, we consider all links in as positive examples and an equal number of sampled unconnected node pairs as negative examples. We split of the edge examples for training the classifier, of examples for hyperparameters tuning, and the rest of examples for model performance evaluation following the practice of existing studies (e.g., (Sankar et al., 2018)). We evaluate the link prediction performance using Micro and MacroAUC scores, where the MicroAUC is calculated across the link instances from all the timesteps while the MacroAUC is computed by averaging the AUC at each timestep. During inference, all nodes in the testing set (from edge samples in ) are selected as the target nodes. To scale the inference of the testing sets of any sizes, we compute the full attention by first splitting all selfattentions into multiple chunks then iteratively compute the selfattention in each chunk (as shown in Figure 7). Since only a fixed number of selfattention is computed at each iteration, we significantly reduce DGT ’s inference memory consumption. We also repeat all experiments three times with different random seeds.Baselines. We compare with several stateoftheart methods as baselines including both static and dynamic graph learning algorithms. For static graph learning algorithms, we compare against Node2Vec (Grover & Leskovec, 2016) and GraphSAGE (Hamilton et al., 2017). To make the comparison fair, we feed these static graph algorithms the same temporalunion graph used in DGT rather than any single graph snapshots. For dynamic graph learning algorithms, we compare against DynAERNN (Goyal et al., 2020), DynGEM (Goyal et al., 2018), DySAT (Sankar et al., 2018), and EvolveGCN (Pareja et al., 2020). We use the official implementations for all baselines and select the best hyperparameters for both baselines and DGT. Notice that we only compare with dynamic graph algorithms that takes a set of temporal ordered snapshot graph as input, and leave the study on other dynamic graph structure (e.g., continuous timestep algorithms (Xu et al., 2020; Rossi et al., 2020)) as a future direction. More details on experiment configurations are deferred to Appendix C and more results (Figure 4, Table 2 and 3) are deferred to Appendix A.1.
Method  Metric  Enron  RDS  UCI  Yelp  ML10M 

Node2Vec  MicroAUC  
MacroAUC  
GraphSAGE  MicroAUC  
MacroAUC  
DynAERNN  MicroAUC  
MacroAUC  
DynGEM  MicroAUC  
MacroAUC  
DySAT  MicroAUC  
MacroAUC  
EvolveGCN  MicroAUC  
MacroAUC  
DGT  MicroAUC  
MacroAUC 
5.2 Experiment results
Table 1 indicates the stateoftheart performance of our approach on link prediction tasks, where DGT achieves a consistent MacroAUC gain on all datasets. Besides, DGT
is more stable when using different random seeds observed from a smaller standard deviation of the AUC score. Furthermore, to better understand the behaviors of different methods from a finer granularity, we compare the model performance at each timestep in Figure
4 and observe that the performance of DGT is relatively more stable than other methods over time. Besides, we additionally report the results of dynamic link prediction evaluated only on unseen links at each timestep. Here, we define unseen links as the ones that first appear at the prediction timestep but are not in the previous graph snapshots. From Table 2, we find that although all methods achieve a lower AUC score, which may be due to the new link prediction is more challenging, DGT still achieves a consistent MacroAUC gain over baselines. Moreover, we compare the training time and memory consumption with several baselines in Table 3 and shows that DGT maintains a good scalability.5.3 Ablation studies
We conduct ablation studies to further understand DGT and present the details in Appendix A.2.
The effectiveness of pretraining.
We compare the performance of DGT with and without pretraining.
As shown in Figure 3,
DGT ’s performance is significantly improved if we first pretrain it with the selfsupervised loss with a finetuning on downstream tasks.
When comparing the AUC scores at each timestep, we observe that the DGT
with no pretraining has a relatively lower performance but a larger variance. This may be due to the vast number of training parameters in
DGT , which potentially requires more data to be trained well. The selfsupervised pretraining alleviate this challenge by utilizing additional unlabeled input data.Comparing twotower to singletower architecture. In Table 4, we compare the performance of DGT with single and twotower design where a singletower means a fullattention of over all pairs of target and context nodes. We observe that the twotower DGT has a consistent performance gain ( Micro and MacroAUC) over the singletower on Yelp and ML10M. This may be due to that the nodes within the target or context node set are sampled independently while intergroup nodes are likely to be connected. Only attending intergroup nodes helps DGT better capturing these contextual information without fusing representations from irrelevant nodes.
Comparing hop attention with full attention. To better understand fullattention, we compare it with sparson and ions such as hop and hop attention. These variants are evaluated based on the singletower DGT to include all node pairs into consideration. Table 5 shows the results where we observe that the full attention presents a consistent performance gain around over the other two variants. This demonstrates the benefits of fullattention when modeling implicit edge connections in graphs with a larger receptive fields comparing to its hop counterparts.
The effectiveness of spatialtemporal encoding. In Table 6, we conduct an ablation study by independently removing two encodings to validate the effectiveness of spatialtemporal encoding. We observe that even without any encoding (i.e., ignoring the spatialtemporal graph topologies), due to full attention, DGT is still very competitive comparing with the stateoftheart baselines in Table 1. However, we also observe a performance gain when adding the spatial connection and temporal distance encoding, which empirically shows their effectiveness.
The effectiveness of stacking more layers. When stacking more layers, traditional GNNs usually suffer from the oversmoothing (Zhao & Akoglu, 2020; Yan et al., 2021) and result in a degenerated performance. We study the effect of applying more DGT layers and show results in Table 7. In contrast to previous studies, DGT has a relatively stable performance and does not suffer much from performance degradation when the number of layers increases. This is potentially due to that DGT only requires a shallow architecture since each individual layer is capable of modeling longerrange dependencies due to full attention. Besides, the selfattention mechanism can automatically attend importance neighbors, therefore alleviate the oversmoothing and bottleneck effect.
6 Conclusion
In this paper, we introduce DGT for dynamic graph representation learning, which can efficiently leverage the graph topology and capture implicit edge connections. To further improve the generalization ability, two complementary pretraining tasks are introduced. To handle largescale dynamic graphs, a temporalunion graph structure and a targetcontext node sampling strategy are designed for an efficient and scalable training. Extensive experiments on realworld dynamic graphs show that DGT presents significant performance gains over several stateoftheart baselines. Potential future directions include exploring GNNs on continuous dynamic graphs and studying its expressive power.
Acknowledgements
Part of this work was done during internship at Facebook AI under the supervision of Yanhong Wu, part of this work was supported by NSF grant 2008398.
References
 Andersen et al. (2006) Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006.
 Berg et al. (2017) Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. In International Conference on Knowledge Discovery & Data Mining, 2017.
 Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are fewshot learners. In Advances in Neural Information Processing Systems, 2020.
 Chen & He (2021) Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Conference on Advances in Neural Information Processing Systems, 2021.
 Cover & Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. 2006.

Cui et al. (2019)
Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang.
Traffic graph convolutional recurrent neural network: A deep learning framework for networkscale traffic learning and forecasting.
IEEE Transactions on Intelligent Transportation Systems, 2019.  Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, 2016.
 Devlin et al. (2019) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
 Do et al. (2019) Kien Do, Truyen Tran, and Svetha Venkatesh. Graph transformation policy network for chemical reaction prediction. In International Conference on Knowledge Discovery & Data Mining, 2019.
 Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
 Duvenaud et al. (2015) David Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafael GómezBombarelli, Timothy Hirzel, Alán AspuruGuzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, 2015.
 Dwivedi & Bresson (2020) Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020.
 Feder & Merhav (1994) Meir Feder and Neri Merhav. Relations between entropy and error probability. IEEE Transactions on Information theory, 40, 1994.
 Goyal et al. (2018) Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. Dyngem: Deep embedding method for dynamic graphs. CoRR, 2018.
 Goyal et al. (2020) Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. KnowledgeBased Systems, 187, 2020.
 Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In International Conference on Knowledge Discovery and Data Mining, 2016.
 Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017.
 Hu et al. (2020) Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 2024, 2020, 2020.
 Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
 Kumar et al. (2019) Srijan Kumar, Xikun Zhang, and Jure Leskovec. Predicting dynamic embedding trajectory in temporal interaction networks. In International Conference on Knowledge Discovery & Data Mining, 2019.
 Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.

Murphy (2022)
Kevin P. Murphy.
Probabilistic Machine Learning: An introduction
. 2022. 
Pareja et al. (2020)
Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura,
Hiroki Kanezashi, Tim Kaler, Tao Schardl, and Charles Leiserson.
Evolvegcn: Evolving graph convolutional networks for dynamic graphs.
In
Conference on Artificial Intelligence
, volume 34, 2020.  Rahimi et al. (2018) Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. Semisupervised user geolocation via graph convolutional networks. In Proceedings of the Association for Computational Linguistics, 2018.
 Rossi et al. (2020) Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637, 2020.
 Sankar et al. (2018) Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. Dynamic graph representation learning via selfattention networks. arXiv preprint arXiv:1812.09430, 2018.
 Seo et al. (2018) Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, 2018.
 Tian et al. (2021) Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding selfsupervised learning dynamics without contrastive pairs. arXiv preprint arXiv:2102.06810, 2021.
 Tsai et al. (2020) YaoHung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and LouisPhilippe Morency. Selfsupervised learning from a multiview perspective. arXiv preprint arXiv:2006.05576, 2020.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

Wang et al. (2019a)
Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li,
and Zhongyuan Wang.
Knowledgeaware graph neural networks with label smoothness regularization for recommender systems.
In International Conference on Knowledge Discovery & Data Mining, 2019a.  Wang et al. (2020) Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Selfattention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
 Wang et al. (2019b) Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and TatSeng Chua. KGAT: knowledge graph attention network for recommendation. In International Conference on Knowledge Discovery & Data Mining, 2019b.
 Xu et al. (2020) Da Xu, Chuanwei Ruan, Evren Körpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. In International Conference on Learning Representations, 2020.
 Yan et al. (2021) Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. arXiv preprint arXiv:2102.06462, 2021.
 Ying et al. (2021) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and TieYan Liu. Do transformers really perform bad for graph representation? arXiv preprint arXiv:2106.05234, 2021.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph convolutional neural networks for webscale recommender systems. In International Conference on Knowledge Discovery & Data Mining, 2018.
 Yun et al. (2019) Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. Graph transformer networks. In Advances in Neural Information Processing Systems, 2019.
 Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, 2020.
 Zhang et al. (2020) Jiawei Zhang, Haopeng Zhang, Congying Xia, and Li Sun. Graphbert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
 Zhao & Akoglu (2020) Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2020.
 Zhou et al. (2020) Dawei Zhou, Lecheng Zheng, Jiawei Han, and Jingrui He. A datadriven graph generative model for temporal interaction networks. In International Conference on Knowledge Discovery & Data Mining, 2020.
 Zhou et al. (2018) L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang. Dynamic Network Embedding by Modelling Triadic Closure Process. In Conference on Artificial Intelligence, 2018.
Appendix A More experiment results
a.1 Link prediction results
In this section, we provide the remained figures and tables in Section 5.
Comparison of AUC score at different time steps. In Figure 4, we compare the AUC score of DGT with baselines on Enron, UCI, Yelp, and ML10M dataset. We can observe that DGT can consistently outperform baselines on Enron and ML10M dataset at all time steps, but has a relatively lower AUC score at certain time steps on the UCI and Yelp dataset. Besides, the performance of DGT is relatively more stable than baselines on different time steps.
Comparision of AUC score on new link prediction task. In Table 2, we report dynamic link prediction result evaluated only on the new links at each time step, where a link that appears at the current snapshot but not in the previous snapshot is considered as a new link. This experiment can provide an indepth analysis of the capabilities of different methods in predicting unseen links. As shown in Table 2, all methods achieve a lower AUC score, which is expected because new link prediction is more challenging. However, DGT still achieves consistent gains of MacroAUC over baselines, thus illustrate its effectiveness in accurately temporal context for new link prediction.
Method  Metric  Enron  RDS  UCI  Yelp  ML10M 

Node2Vec  MicroAUC  
MacroAUC  
GraphSAGE  MicroAUC  
MacroAUC  
DynAERNN  MicroAUC  
MacroAUC  
DynGEM  MicroAUC  
MacroAUC  
DySAT  MicroAUC  
MacroAUC  
EvolveGCN  MicroAUC  
MacroAUC  
DGT  MicroAUC  
MacroAUC 
Computation time and memory consumption. In Table 3
, we compare the memory consumption and epoch time on the last time step of ML10M and Yelp dataset. We chose the last time step of these two datasets because its graph size is relatively larger than others, which can provide a more accurate time and memory estimation. The memory consumption is record by
nvidiasmi and the time is recorded by function time.time(). During pretraining, DGT samples context node and context node at each iteration. During finetuning, DGT first positive links (links in the graph) and sample negative links (node pairs that do not exist in the graph), then treat all nodes in the sampled node pairs at target nodes and sample the same amount of context nodes. Notice that although the same sampling size hyperparameter is used, since the graph size and the graph density are different, the actual memory consumption and time are also different. For example, since the Yelp dataset has more edges with more associated nodes for evaluation than ML10M, the memory consumption and time are required on Yelp than on ML10M dataset.Dataset  Method  Memory consumption  Epoch time  Total time 

ML10M  DySAT  GB  Sec  Sec ( epochs) 
EvolveGCN  GB  Sec  Sec ( epochs)  
DGT (Pretraining)  GB  Sec  Sec ( epochs)  
DGT (Finetuning)  GB  Sec  Sec ( epochs)  
Yelp  DySAT  GB  Sec  Sec ( epochs) 
EvolveGCN  GB  Sec  Sec ( epochs)  
DGT (Pretraining)  GB  Sec  Sec ( epochs)  
DGT (Finetuning)  GB  Sec  Sec ( epochs) 
a.2 Ablation study results.
In this section, we provide missing the tables in Section 5.3, where discussion on the results are provided in Section 5.3.
Compare twotower to singletower architecture. In Table 4, we compare the MicroAUC score and MacroAUC score of DGT with onetower^{2}^{2}2The node representation in the singletower DGT is computed by
Method  Metric  UCI  Yelp  ML10M 

Singletower  MicroAUC  
MacroAUC  
Twotower  MicroAUC  
MacroAUC 
Compare hop attention with full attention. In Table 4, we compare the performance of “singletower DGT using fullattention”, “singletower DGT using hop attention”, and “singletower DGT using hop attention” on the UCI, Yelp, and ML10M dataset.
Method  Metric  UCI  Yelp  ML10M 

Full attention  MicroAUC  
MacroAUC  
hop neighbor  MicroAUC  
MacroAUC  
hop neighbor  MicroAUC  
MacroAUC 
The effectiveness of spatialtemporal encoding. In Table 6, we validate the effectiveness of spatialtemporal encoding by independently removing the temporal edge coding and spatial distance encoding.
Method  Metric  UCI  Yelp  ML10M 

With both encoding  MicroAUC  
MacroAUC  
Without any encoding  MicroAUC  
MacroAUC  
Only temporal connective encoding  MicroAUC  
MacroAUC  
Only spatial distance encoding  MicroAUC  
MacroAUC 
The effect of the number of layers. In Table 7, we compare the MicroAUC score and MacroAUC score of DGT with a different number of layers on the UCI, Yelp, and ML10M datasets.
Method  Metric  UCI  Yelp  ML10M 

layers  MicroAUC  
MacroAUC  
layers  MicroAUC  
MicroAUC  
layers  MicroAUC  
MicroAUC 
a.3 Node classification results
In this section, we show that although DGT is orginally designed for the link prediction task, the learned representation of DGT can be also applied to binary node classification. We evaluate DGT on Wikipedia and Reddit dataset, where dataset statistic is summarized in Table 11. The snapshot is created in a similar manner as the link prediction task. As shown in Table 8 and Figure 5, DGT performs around better than all baselines on the Wikipedia dataset and around better than EvolveGCN on Reddit dataset. However, the results DGT on the Reddit dataset is slightly lower than DySAT. This is potentially due to DGT is less in favor of a dense graph, e.g., Reddit dataset, with very dense graph structure information encoded by spatialtemporal encodings.
Method  Metric  Wikipedia  

DySAT  MicroAUC  
MacroAUC  
EvolveGCN  MicroAUC  
MicroAUC  
DGT (without pretraining)  MicroAUC  
MicroAUC  
DGT (with pretraining)  MicroAUC  
MicroAUC 
a.4 Results on noisy dataset
In this section, we study the effect of noisy input on the performance of DGT using UCI and Yelp datasets. We achieve this by randomly selecting , , of the node pairs and changing their connection status either from connected to notconnected or from notconnected to connected. As shown in Table 9, although the performance of both using fullattention and 1hop attention decreases as the noisy level increases, the performance of using fullattention aggregation is more stable and robust as the noisy level changes. This is because 1hop attention relies more on the given structure, while fullattention only take the give structure as a reference and learns the “ground truth” underlying graph structure by gradient descent update.
Method  10%  20%  50%  

UCI  DGT (1hop attention)  MicroAUC  
MacroAUC  
DGT (Full aggregation)  MicroAUC  
MacroAUC  
Yelp  DGT (1hop attention)  MicroAUC  
MacroAUC  
DGT (Full aggregation)  MicroAUC  
MacroAUC 
a.5 Comparison with continuousgraph learning algorithms
In this section, we compare snapshot graphbased methods against continuous graphbased learning algorithm on the UCI, Yelp, and ML10M dataset. For the continuous graph learning algorithm, we choose JODIE Kumar et al. (2019) and TGAT Xu et al. (2020) as the baseline. As shwon in Table 10, JODIE and TGAT suffer from significant performance degradation. This is because they are designed to leverage the edge features and finegrained timestamp information for link prediction, however, these information is lacking on existing snapshot graph datasets.
Please note that we compare with continuous graph algorithm only for the sake of completeness. However, since snapshot graphbased methods and continuous graphbased methods require different input graph structures, different evaluation strategies, and are designed under different settings, directly comparing two sets of methods cannot provide much meaningful interpretation. For example, existing works Kumar et al. (2019); Xu et al. (2020) on a continuous graph select the training and evaluation set by taking the first of links in the dataset for training and taking the rest for evaluation. In other words, the training and evaluation samples can be arbitrary close and might even come from the same time step. However, in the snapshot graph, the training and evaluation set is selected by taking the links in the previous snapshot graphs for training and evaluating on the th snapshot graph. That is, the training and evaluation samples never come from the same time step. Besides, since the time steps in the continuous graph are finegrained than snapshot graphs, continuous graph methods suffer from performance degradation when applied on the snapshot graph dataset due to lack of finegrained timestamp information. Due to the aforementioned reasons, existing continuous graph learning methods (e.g., Jodie, TGAT) only compare with other continuous graph methods on the continuous datasets, similarly, existing snapshot graph learning methods (e.g., DySAT, EvolveGCN, DynAERNN, DynGEM) also only considers other snapshot graph based methods as their baseline for a comparison.
Method  Metric  UCI  Yelp  ML10M 

DGT  MicroAUC  
MacroAUC  
JODIE  MicroAUC  
MacroAUC  
TGAT  MicroAUC  
MacroAUC 
Appendix B Pretraining can reduce the irreducible error
b.1 Preliminary on information theory
In this section, we recall preliminaries on information theory, which are helpful to understand the proof in the following section. More details can be found in books such as Murphy (2022); Cover & Thomas (2006).
Entropy. Let
be a discrete random variable,
as the sample space, and as outcome. We define the probability mass function as . Then, the entropy for a discrete random variable is defined as(3) 
where we use base . The joint entropy of two random variables is defined as
(4) 
The conditional entropy of given is the uncertainty we have in after seeing , which is defined as
(5)  
Notice that we have if , where a deterministic mapping.
Mutual information. Mutual information is a special case of KLdivergence, which is a measure of distance between two distributions. The KLdivergence between is defined as
(6) 
Then, the mutual information between random variable is defined as follows
(7) 
and we have if are independent. Notice that we use instead of represent the mutual information between and . Besides, represent the mutual information between and .
Based on the above definition on entropy and mutual information, we have the following relation between entropy and mutual information:
(8)  
and the following relation between conditional entropy and conditional mutual information:
(9)  
where conditional mutual information can be think of as the reduction in the uncertainty of due to knowledge of when is given.
A figure showing the relation between mutual information and entropy is provided in Figure 6.
Data processing inequality. Random variables are said to form a Markov chain if the joint probability mass function can be written as . Suppose random variable
forms a Markov chain
, then we have .Bayes error and entropy. In the binary classification setting, Bayes error rate is the lowest possible test error rate (i.e., irreducible error), which can be formally defined as
(10) 
where denotes label and denotes input. Feder & Merhav (1994) derives an upper bound showing the relation between Bayes error rate with entropy:
(11) 
The above inequality is used as the foundation of our following analysis.
b.2 Proof of Proposition 1
In the following, we utilize the analysis framework developed in Tsai et al. (2020)
to show the importance of two pretraining loss functions.
By using Eq. 11, we have the following inequality:
(12) 
By rearanging the above inequality, we have the following upper bound on the Bayes error rate
(13)  
where equality is due to , equality is due to and because is a deterministic mapping given input . Our goal is to find the deterministic mapping function to generate that can maximize , such that the upper bound on the right hand side of Eq. 13 is minimized. We can achieve this by:

[noitemsep,topsep=0pt,leftmargin=3mm ]

Maximizing the mutual information between the representation to the input .

Minimizing the taskirrelevant information , i.e., the mutual information between the representation to the input given taskrelevant information .
In the following, we first show that minimizing can maximize the mutual information , then we show that minimizing can minimize the task irrelevant information .
Maximize mutual information . By the relation between mutual information and entropy , we know that maximizing the mutual information is equivalent to minimizing the conditional entropy . Notice that we ignore because it is only dependent on the raw feature and is irrelevant to feature representation . By the definition of conditional entropy, we have
(14)  
Comments
There are no comments yet.