Dynamic Graph Representation Learning via Graph Transformer Networks

by   Weilin Cong, et al.
Penn State University

Dynamic graph representation learning is an important task with widespread applications. Previous methods on dynamic graph learning are usually sensitive to noisy graph information such as missing or spurious connections, which can yield degenerated performance and generalization. To overcome this challenge, we propose a Transformer-based dynamic graph learning method named Dynamic Graph Transformer (DGT) with spatial-temporal encoding to effectively learn graph topology and capture implicit links. To improve the generalization ability, we introduce two complementary self-supervised pre-training tasks and show that jointly optimizing the two pre-training tasks results in a smaller Bayesian error rate via an information-theoretic analysis. We also propose a temporal-union graph structure and a target-context node sampling strategy for efficient and scalable training. Extensive experiments on real-world datasets illustrate that DGT presents superior performance compared with several state-of-the-art baselines.



There are no comments yet.


page 1

page 2

page 3

page 4


Pairwise Half-graph Discrimination: A Simple Graph-level Self-supervised Strategy for Pre-training Graph Neural Networks

Self-supervised learning has gradually emerged as a powerful technique f...

Pre-Training on Dynamic Graph Neural Networks

The pre-training on the graph neural network model can learn the general...

Self-Supervised Dynamic Graph Representation Learning via Temporal Subgraph Contrast

Self-supervised learning on graphs has recently drawn a lot of attention...

JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs

Existing pre-trained models for knowledge-graph-to-text (KG-to-text) gen...

Curriculum Pre-Training Heterogeneous Subgraph Transformer for Top-N Recommendation

Due to the flexibility in modelling data heterogeneity, heterogeneous in...

Anomaly Detection in Dynamic Graphs via Transformer

Detecting anomalies for dynamic graphs has drawn increasing attention du...

Adversarial Graph Disentanglement

A real-world graph has a complex topology structure, which is often form...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, graph representation learning has been recognized as a fundamental learning problem and has received much attention due to its widespread use in various domains, including social network analysis (Kipf & Welling, 2017; Hamilton et al., 2017), traffic prediction (Cui et al., 2019; Rahimi et al., 2018)

, knowledge graphs 

(Wang et al., 2019a, b), drug discovery (Do et al., 2019; Duvenaud et al., 2015), and recommendation systems (Berg et al., 2017; Ying et al., 2018). Most existing graph representation learning work focuses on static graphs. However, real-world graphs are intrinsically dynamic where nodes and edges can appear and disappear over time. This dynamic nature of real-world graphs motivates dynamic graph representation learning methods that can model temporal evolutionary patterns and accurately predict node properties and future edges.

Recently, several attempts (Sankar et al., 2018; Pareja et al., 2020; Goyal et al., 2018) have been made to generalize graph learning algorithms from static graphs to dynamic graphs by first learning node representations on each static graph snapshot then aggregating these representations from the temporal dimension. However, these methods are vulnerable to noisy information such as missing or spurious links. This is due to the ineffective message aggregation over unrelated neighbors from noisy connections. The temporal aggregation makes this issue severe by further carrying the noise information over time. Over-relying on graph structures makes the model sensitive to noisy input and can significantly affect downstream task accuracy. A remedy is to consider the input graph as fully connected and learn a graph topology by assigning lower weights to task-irrelevant edges during training (Devlin et al., 2019)

. However, completely ignoring the graph structure makes the optimization inefficient because the model has to estimate the underlying graph structure while learn model parameters at the same time. To resolve the above challenges, we propose a Transformer-based dynamic graph learning method named Dynamic Graph Transformer (

DGT) that can “leverage underlying graph structures” and “capture implicit edge connections” to balance this trade-off.

Transformers (Vaswani et al., 2017)

, designed to automatically capture the inter-dependencies between tokens in a sequence, have been successfully applied in several domains such as Natural Language Processing 

(Devlin et al., 2019; Brown et al., 2020)

and Computer Vision 

(Dosovitskiy et al., 2020; Liu et al., 2021). We summarize the success of Transformers into three main factors, which can also help resolve the aforementioned challenges in dynamic graph representation learning: () fully-connected self-attention: by modeling all pair-wise node relations, DGT can capture implicit edge connections, thus become robust to graphs with noisy information such as missing links; () positional encoding: by generalizing positional encoding to the graph domain using spatial-temporal encoding, we can inject both spatial and temporal graph evolutionary information as inductive biases into DGT to learn a graph’s evolutionary patterns over time; () self-supervised pre-training: by optimizing two complementary pre-training tasks, DGT presents a better performance on the downstream tasks.

Though powerful, training Transformers on large-scale graphs is non-trivial due to the quadratic complexity of the fully connected self-attention on the graph size (Zaheer et al., 2020; Wang et al., 2020). This issue is more severe on dynamic graphs as the computation cost grows with the number of time-steps in a dynamic graph (Pareja et al., 2020; Sankar et al., 2018). To make the training scalable and independent of both the graph size and the number of time-steps, we first propose a temporal-union graph structure that aggregates graph information from multiple time-steps into a unified meta-graph; we then develop a two-tower architecture with a novel target-context node sampling strategy to model a subset of nodes with their contextual information. These approaches improve DGT’s training efficiency and scalability from both the temporal and spatial perspectives.

To this end, we summarize our contributions as follows: () a two-tower Transformer-based method named DGT with spatial-temporal encoding that can capture implicit edge connections in addition to the input graph topology; () a temporal-union graph data structure that efficiently summarizes the spatial-temporal information of dynamic graphs and a novel target-context node sampling strategy for large-scale training; () two complementary pre-training tasks that can facilitate performing downstream tasks and are proven beneficial using information theory; and () a comprehensive evaluation on real-world datasets with ablation studies to validate the effectiveness of DGT.

2 Preliminaries and related works

In this section, we first define dynamic graphs, then review related literature on dynamic graph representation learning and Transformers on graphs.

Dynamic graph definition.  The nodes and edges in a dynamic graph may appear and disappear over time. In this paper, we define a dynamic graph as a sequence of static graph snapshots with a temporal order , where the -th snapshot graph is an undirected graph with a shared node set of all time steps and an edge set . We also denote its adjacency matrix as . Our goal is to learn a latent representation of each node at each time-step , such that the learned representation can be used for any specific downstream task such as link prediction or node classification. Please notice that the shared node set is not static and will be updated when new snapshot graph arrives, which is the same as  Sankar et al. (2018); Pareja et al. (2020).
Dynamic graph representation learning.

  Previous dynamic graph representation learning methods usually extend static graph algorithms by further taking the temporal information into consideration. They can mainly be classified into three categories: (

) smoothness-based methods

learn a graph autoencoder to generate node embeddings on each graph snapshot and ensure the temporal smoothness of the node embeddings across consecutive time-steps. For example,

DyGEM (Goyal et al., 2018) uses the learned embeddings from the previous time-step to initialize the embeddings in the next time-step. DynAERNN applies RNN to smooth node embeddings at different time-steps; () Recurrent-based methods capture the temporal dependency using RNN. For example, GCRN (Seo et al., 2018) first computes node embeddings on each snapshot using GCN (Defferrard et al., 2016), then feeds the node embeddings into an RNN to learn their temporal dependency. EvolveGCN (Pareja et al., 2020) uses RNN to estimate the GCN weight parameters at different time-steps; () Attention-based methods use self-attention mechanism for both spatial and temporal message aggregation. For example, DySAT (Sankar et al., 2018) propose to use the self-attention mechanism for temporal and spatial information aggregation. TGAT (Xu et al., 2020) encodes the temporal information into the node feature, then applies self-attention on the temporal augmented node features. However, smoothness-based methods heavily rely on the temporal smoothness and are inadequate when nodes exhibit vastly different evolutionary behaviors, recurrent-based methods scale poorly when the number of time-steps increases due to the recurrent nature of RNN, attention-based methods only consider the self-attention on existing edges and are sensitive to noisy graphs. In contrast, DGT  leverages Transformer to capture the spatial-temporal dependency between all nodes pairs, does not over-relying on the given graph structures, and is less sensitive to noisy edges.
Graph Transformers.  Recently, several attempts have been made to leverage Transformer for graph representation learning. For example, Graphormer (Ying et al., 2021) and GraphTransformer (Dwivedi & Bresson, 2020) use scaled dot-product attention (Vaswani et al., 2017) for message aggregation and generalizes the idea of positional encoding to graph domains. GraphBert (Zhang et al., 2020) first samples an egocentric network for each node, then order all nodes into a sequence based on node importance, and feed into the Transformer. However, Graphormer is only feasible to small molecule graphs and cannot scale to large graphs due to the significant computation cost of full attention; GraphTransformer only considers the first-hop neighbor aggregation, which makes it sensitive to noisy graphs; GraphBert does not leverage the graph topology and can perform poorly when graph topology is important. In contrast, DGT encodes the input graph structures as an inductive bias to guide the full-attention optimization, which balances the trade-offs between noisy input robustness and efficiently learning an underlying graph structure. A detailed comparison is deferred to Appendix D.

3 Method

In this section, we first introduce the temporal union-graph (in Section 3.1) and our sampling strategy (in Section 3.2) that can reduce the overall complexity from the temporal and spatial perspectives respectively. Then, we introduce our spatial-temporal encoding technique (in Section 3.3), describe the two-tower transformer architecture design, and explain how to integrate the spatial-temporal encoding to DGT (in Section 3.4). Figure 1 illustrates the overall DGT design.

Figure 1: Overview of using DGT for link prediction. Given snapshot graphs as input, (1) we first generate the temporal union graph with the considered max shortest path distance , and its associated (2) temporal connection encoding and (3) spatial distance encoding. Then, the encodings are mapped into for each node pairs using a fully connected layer. To predict whether an edge exists in future graph , we first (4) sample target nodes and context nodes, and then apply (5) DGT to encode target nodes and context nodes separately.

3.1 temporal-union graph generation

One major challenge of applying Transformers on graph representation learning is its significant computation and memory overhead. In Transformers, the computation cost of self-attention is and its memory cost is . When using full attention, the computation graph is fully connected with , where the overall complexity is quadratic in the graph size. On dynamic graphs, this problem can be even more severe if one naively extends the static graph algorithm to a dynamic graph, e.g., first extracting the spatial information of each snapshot graph separately, then jointly reasoning the temporal information on all snapshot graphs (Sankar et al., 2018; Pareja et al., 2020). By doing so, the overall complexity grows linearly with the number of time-steps , i.e., with computation and memory cost. To reduce the dependency of the overall complexity on the number of time-steps, we propose to first aggregate dynamic graphs into a temporal-union graph then employ DGT on the generated temporal-union graph, where is the set of all possible unique edges in . As a result, the overall complexity of DGT does not grow with the number of time-steps. Details on how to leverage spatial-temporal encoding to recover the temporal information of edges are described in Section 3.3.

3.2 Target node driven context node sampling

Although the temporal-union graph can alleviate the computation burden from the temporal dimension, due to the overall quadratic complexity of self-attention with respect to the input graph size, scaling the training of Transformer to real-world graphs is still non-trivial. Therefore, a properly designed sampling strategy that makes the overall complexity independent with graph sizes is necessary. Our goal is to design a sub-graph sampling strategy that ensures a fixed number of well-connected nodes and a lower computational complexity. To this end, we propose to first sample a subset of nodes that we are interested in as target nodes, then sample their common neighbors as context nodes.

Let target nodes be the set of nodes that we are interested in and want to compute its node representation. For example, for the link prediction task, are the set of nodes that we aim to predict whether they are connected. Then, the context nodes are sampled as the common neighbors of the target nodes. Notice that since context nodes are sampled as the common neighbors of the target nodes, they can provide local structure information for nodes in the target node set. Besides, since two different nodes in the target node set can be far apart with a disconnected neighborhood, the neighborhood of two nodes can provide an approximation of the global view of the full graph. During the sampling process, to control the randomness involved in the sampling process, are chosen as the subset of nodes with the top- joint Personalized PageRank (PPR) score (Andersen et al., 2006) to nodes in

, where PPR score is a node proximity measure that captures the importance of two nodes in the graph. More specifically, our joint PPR sampler proceeds as follows: First, we compute the approximated PPR vector

for all node , where the -th element in

can be interpreted as the probability of a random walk to start at node

and end at node . We then compute the approximated joint PPR vector . Finally, we select context nodes where each node has the top- joint PPR score in . In practice, we select the context node size the same as the target node size, i.e., .

3.3 Spatial-temporal encoding

Given the temporal-union graph, our next step is to translate the spatial-temporal information from snapshot graphs to the temporal-union graph , which can be recognized and leveraged by Transformers. Notice that most classical GNNs either over-rely on the given graph structure by only considering the first- or higher-order neighbors for feature aggregation (Ying et al., 2021), or directly learn graph adjacency without using the given graph structure (Devlin et al., 2019). On the one hand, over-relying on the graph structure makes the model fails to capture the inter-relation between nodes that are not connected in the labeled graph, and could be very sensitive to the noisy edges due to human-labeling errors. On the other hand, completely ignoring the graph structure makes the optimization problem challenging because the model has to iteratively learn model parameters and estimate the graph structure. To avoid the above two extremes, we present two simple but effective designs of encodings, i.e., temporal connection encoding and spatial distance encoding, and provide details on how to integrate them into DGT.

Temporal connection encoding.  Temporal connection (TC) encoding is designed to inform DGT if an edge exists in the -th snapshot graph. We denote as the temporal connection encoding lookup-table where represents the hidden dimension size, which is indexed by a function indicating whether an edge exists at time-step . More specifically, we have if , if and use this value as an index to extract the corresponding temporal connection embedding from the look-up table for next-step processing. Note that during pre-training or the training on first few time-steps, we need to mask-out certain time-steps to avoid leaking information related to the predicted items (e.g., the temporal reconstruction task in Section. 4.1). In these cases, we set where denotes the time-step we mask-out, and skip the embedding extraction at time .
Spatial distance encoding.  Spatial distance (SD) encoding is designed to provide DGT a global view of the graph structure. The success of Transformer is largely attributed to its global receptive field due to its full attention, i.e., each token in the sequence can attend independently to other tokens and process its representations. Computing full attention requires the model to explicitly capturing the positions dependency between tokens, which can be achieved by either assigning each position an absolute positional encoding or encode the relative distance using relative positional encoding. However, for graphs, the design of unique node positions is not mandatary because a graph is not changed by the permutation of its nodes. To encode the global structural information of a graph in the model, inspired by (Ying et al., 2021), we adopt a spatial distance encoding that measures the relative spatial relationship between any two nodes in the graph, which is a generalization of the classical Transformer’s positional encoding to the graph domain. Let be the maximum shortest path distance (SPD) we considered, where is a hyper-parameter that can be smaller than the graph diameter. More specifically, given any node and node , we define as the SPD between the two nodes if and otherwise as . Let as the spatial distance lookup-table which is indexed by the , where is used to select the spatial distance encoding that provides the spatial distance information of two nodes.
Integrate spatial-temporal encoding. We integrate temporal connection encoding and spatial distance encoding by projecting them as a bias term in the self-attention module. Specifically, to integrate the spatial-temporal encoding of node pair to DGT, we first gather all its associated temporal connection encodings on different time-steps as . Then, we apply weight average on all encodings over the temporal axis and projected the temporal averaged encoding as a scalar by , where the aggregation weight is learned during training. Similarly, to integrate the spatial distance encoding, we project the spatial distance encoding of node pair as a scalar by . Then, and are used as the bias term to the self-attention, which we describe in detail in Section 3.4.

3.4 Graph Transformer architecture

As shown in Figure 1, each layer in DGT consists of two towers (i.e., the target node tower and the context node tower) to encode the target nodes and the context nodes separately. The same set of parameters are shared between two towers. The two-tower structure is motivated by the fact that nodes within each group are sampled independently but there exist neighborhood relationships between inter-group nodes. Only attending inter-group nodes help DGT better capture this context information without fusing representations from irrelevant nodes. The details are as follows:

  • [noitemsep,topsep=0pt,leftmargin=3mm ]

  • First, we compute the self-attentions that are used to aggregate information from target nodes to context nodes (denote as “ctx”) and from context nodes to target nodes (denote as “tgt”). Let define as the -th layer output of the context-node tower and as the -th layer output of the target-node tower. Then, the th layer self-attention is computed as

    where stands for applying layer normalization on and are weight matrices.

  • Then, we integrate spatial-temporal encoding as a bias term to self-attention as follows

    where denote the the matrix form of the projected temporal connection and spatial distance self-attention bias with row indexed by and columns indexed by .111Given a matrix , the element at the -th row and -th column is denoted as , the submatrix formed from row and columns is denoted as .

  • After that, we use the normalized and to propagate information between two towers, i.e.,

  • Finally, a residual connected feed-forward network is applied to the aggregated message to produce the final output

    where denotes the multi-layer feed-forward network. The final layer output of the target node tower will be used to compute the loss defined in Section 4.

4 Dynamic Graph Transformer learning

Transformer usually requires a significant amount of supervised data to guarantee their generalization ability on unseen data. However, existing dynamic graph datasets are relatively small and may not be sufficient to train a powerful Transformer. To overcome this challenge, we propose to first pre-train DGT with two complementary self-supervised objective functions (in Section 4.1). Then, we fine-tine DGT using the supervised objective function (in Section 4.2). Notice that the same set of snapshot graphs but different objective functions are used for pre-training and fine-tuning. Finally, via an information-theoretic analysis, we show that the representation can have a better generalization ability on downstream tasks by optimizing our pre-training losses (in Section 4.3).

4.1 Pre-training

We introduce a temporal reconstruction loss and a multi-view contrastive loss as our self-supervised object functions. Then, our overall pre-taining loss is defined as , where is a hyper-parameter that balances the importance of two pre-taining tasks as illustrated in Figure 2.

Temporal reconstruction loss.  To ensure that the spatial-temporal encoding is effective and can inform DGT the temporal dependency between multiple snapshot graphs, we introduce a temporal reconstruction loss as our first pre-training objective. Our goal is to reconstruct the -th graph snapshot ’s structure using all but except the -th graph snapshot (i.e., ). Let denote the target-node tower’s final layer output computed on . To decode the graph structure of graph snapshot , we use a fully connected layer as the temporal structure decoder that takes as input and output with denotes the -th row of . Then, the temporal reconstruction loss is defined as , where

is Sigmoid function and


Multi-view contrastive loss.  Recall that is constructed by deterministically selecting the common neighbors of with the top- PPR score. Then, we introduce as the subset of the common neighbors of randomly sampled with sampling probability of each node proportional to its PPR score. Since a different set of context nodes are provided for the same set of target nodes, provides an alternative view of when computing the representation for nodes in . Notice that although the provided context nodes are different, since they have the same target nodes, it is natural to expect the calculated representation have high similarity. We denote and as the final layer model output that are computed on and . To this end, we introduce our second self-supervised objective function as , where StopGrad denotes stop gradient. Note that optimizing alone without stopping gradient results in a degenerated solution (Chen & He, 2021; Tian et al., 2021).

Figure 2: Overview of the pre-training. Given snapshot graphs as input, we first generate the temporal union graph. Then, we sample the target node and two different set of context nodes . After that, we apply DGT on and to output and . To this end, we optimize by maximizing the similarity between and , and optimize by recovering snapshot graphs using .

4.2 Fine-tuning

To apply the pre-trained model for downstream tasks, we choose to finetune the pre-trained model with downstream task objective functions. Here, we take link prediction as an example. Our goal is to predict the existence of a link at time using information up to time . Let denote the final layer output of DGT using snapshot graphs . Then, the link prediction loss is defined as , where LinkPredLoss is defined in Eq. 1.

4.3 On the importance of pre-training from information theory perspective

In this section, we show that our pre-training objectives can improve the generalization error under mild assumptions and results in a better performance on downstream tasks. Let

denote the input random variable,

as the self-supervised signal (also known as a different view of input ), and as the representations that are generated by a deterministic mapping function . In our setting, we have the sampled sub-graph of temporal-union graph induced by node as input , the sampled subgraph of induced by node as self-supervised signal , and DGT as that computes the representation of by . Besides, we introduce the task-relevant information as , which refers to the information that is required for downstream tasks. For example, when the downstream task is link prediction, can be the ground truth graph structure about which we want to reason. Notice that in practice we have no access to during pre-training and it is only introduced as the notation for analysis. Furthermore, let denote entropy, denote conditional entropy, denote mutual information, and denote conditional mutual information. More details and preliminaries on information theory are deferred to Appendix B.

In the following, we study the generalization error of the learned representation under the binary classification setting. We choose Bayes error rate

(i.e., the lowest possible test error rate a binary classifier can achieve) as our evaluation metric, which can be formally defined as

. Before proceeding to our result, we make the following assumption on input , self-supervised signal , and task-relevant information .

Assumption 1.

We assume the task-relevant information is shared between the input random variable , self-supervised signal , i.e., we have and .

We argue the above assumption is mild because input and self-supervised signal are two different views of the data, therefore they are expected to contain task-relevant information . In Proposition 1, we make connections between the Bayes error rate and pre-training losses, which explains why the proposed pre-training losses are helpful for downstream tasks. We defer the proofs to Appendix B.

Proposition 1.

We can upper bound Bayes error rate by , and reduce the upper bound of by maximizing the mutual information between the learned representation and input , which can be achieved by minimizing temporal reconstruction loss , and minimizing the task-irrelevant information between the learned representation and input , which can be achieved by minimizing our multi-view loss .

The Proposition 1 suggests that if we can create a different views of our input data in a way such that both and contain the task-relevant information , then by jointly optimizing our two pre-training losses can result in the representation with a lower Bayes error rate . Our analysis is based on the information theory framework developed in (Tsai et al., 2020), in which they show that using contrastive loss between and (i.e., maximizing ), predicting from (i.e., minimizing ), and predicting from (i.e., minimizing ) can result in a smaller Bayes error rate .

5 Experiments

We evaluate DGT using dynamic graph link prediction, which has been widely used in (Sankar et al., 2018; Goyal et al., 2018) to compare its performance with a variety of static and dynamic graph representation learning baselines. Besides, DGT can also be applied to other downstream tasks such as node classification. We defer node classification results to Appendix A.3.

5.1 Experiment setup

Datasets. We select five real-world datasets of various sizes and types in our experiments. The detailed data statistics can be accessed at Table 11 in Appendix C.2. Graph snapshots are created by splitting the data using suitable time windows such that each snapshot has an equitable number of interactions. In each snapshot, the edge weights are determined by the number of interactions.
Link prediction task.  To compare the performance of DGT with baselines, we follow the evaluation strategy in (Goyal et al., 2018; Zhou et al., 2018; Sankar et al., 2018)

by training a logistic regression classifier taking two node embeddings as input for dynamic graph link prediction. Specifically, we learn the dynamic node representations on snapshot graphs

and evaluate DGT by predicting links at . For evaluation, we consider all links in as positive examples and an equal number of sampled unconnected node pairs as negative examples. We split of the edge examples for training the classifier, of examples for hyper-parameters tuning, and the rest of examples for model performance evaluation following the practice of existing studies (e.g., (Sankar et al., 2018)). We evaluate the link prediction performance using Micro- and Macro-AUC scores, where the Micro-AUC is calculated across the link instances from all the time-steps while the Macro-AUC is computed by averaging the AUC at each time-step. During inference, all nodes in the testing set (from edge samples in ) are selected as the target nodes. To scale the inference of the testing sets of any sizes, we compute the full attention by first splitting all self-attentions into multiple chunks then iteratively compute the self-attention in each chunk (as shown in Figure 7). Since only a fixed number of self-attention is computed at each iteration, we significantly reduce DGT ’s inference memory consumption. We also repeat all experiments three times with different random seeds.
Baselines.  We compare with several state-of-the-art methods as baselines including both static and dynamic graph learning algorithms. For static graph learning algorithms, we compare against Node2Vec (Grover & Leskovec, 2016) and GraphSAGE (Hamilton et al., 2017). To make the comparison fair, we feed these static graph algorithms the same temporal-union graph used in DGT rather than any single graph snapshots. For dynamic graph learning algorithms, we compare against DynAERNN (Goyal et al., 2020), DynGEM (Goyal et al., 2018), DySAT (Sankar et al., 2018), and EvolveGCN (Pareja et al., 2020). We use the official implementations for all baselines and select the best hyper-parameters for both baselines and DGT. Notice that we only compare with dynamic graph algorithms that takes a set of temporal ordered snapshot graph as input, and leave the study on other dynamic graph structure (e.g., continuous time-step algorithms (Xu et al., 2020; Rossi et al., 2020)) as a future direction. More details on experiment configurations are deferred to Appendix C and more results (Figure 4, Table 2 and 3) are deferred to Appendix A.1.

Method Metric Enron RDS UCI Yelp ML-10M
Node2Vec Micro-AUC
GraphSAGE Micro-AUC
DynGEM Micro-AUC
EvolveGCN Micro-AUC
Table 1: Comparing DGT with baselines using Micro- and Macro-AUC on real-world datasets.

5.2 Experiment results

Table 1 indicates the state-of-the-art performance of our approach on link prediction tasks, where DGT achieves a consistent Macro-AUC gain on all datasets. Besides, DGT

 is more stable when using different random seeds observed from a smaller standard deviation of the AUC score. Furthermore, to better understand the behaviors of different methods from a finer granularity, we compare the model performance at each time-step in Figure 

4 and observe that the performance of DGT is relatively more stable than other methods over time. Besides, we additionally report the results of dynamic link prediction evaluated only on unseen links at each time-step. Here, we define unseen links as the ones that first appear at the prediction time-step but are not in the previous graph snapshots. From Table 2, we find that although all methods achieve a lower AUC score, which may be due to the new link prediction is more challenging, DGT still achieves a consistent Macro-AUC gain over baselines. Moreover, we compare the training time and memory consumption with several baselines in Table 3 and shows that DGT maintains a good scalability.

Figure 3: Comparison of the Micro- and Macro-AUC score of DGT with and without pre-training.

5.3 Ablation studies

We conduct ablation studies to further understand DGT and present the details in Appendix A.2.
The effectiveness of pre-training.  We compare the performance of DGT with and without pre-training. As shown in Figure 3, DGT ’s performance is significantly improved if we first pre-train it with the self-supervised loss with a fine-tuning on downstream tasks. When comparing the AUC scores at each time-step, we observe that the DGT

 with no pre-training has a relatively lower performance but a larger variance. This may be due to the vast number of training parameters in

DGT , which potentially requires more data to be trained well. The self-supervised pre-training alleviate this challenge by utilizing additional unlabeled input data.
Comparing two-tower to single-tower architecture.  In Table 4, we compare the performance of DGT with single- and two-tower design where a single-tower means a full-attention of over all pairs of target and context nodes. We observe that the two-tower DGT has a consistent performance gain ( Micro- and Macro-AUC) over the single-tower on Yelp and ML-10M. This may be due to that the nodes within the target or context node set are sampled independently while inter-group nodes are likely to be connected. Only attending inter-group nodes helps DGT better capturing these contextual information without fusing representations from irrelevant nodes.
Comparing -hop attention with full attention.  To better understand full-attention, we compare it with sparson and ions such as -hop and -hop attention. These variants are evaluated based on the single-tower DGT to include all node pairs into consideration. Table 5 shows the results where we observe that the full attention presents a consistent performance gain around over the other two variants. This demonstrates the benefits of full-attention when modeling implicit edge connections in graphs with a larger receptive fields comparing to its -hop counterparts.
The effectiveness of spatial-temporal encoding.  In Table 6, we conduct an ablation study by independently removing two encodings to validate the effectiveness of spatial-temporal encoding. We observe that even without any encoding (i.e., ignoring the spatial-temporal graph topologies), due to full attention, DGT is still very competitive comparing with the state-of-the-art baselines in Table 1. However, we also observe a performance gain when adding the spatial connection and temporal distance encoding, which empirically shows their effectiveness.
The effectiveness of stacking more layers.  When stacking more layers, traditional GNNs usually suffer from the over-smoothing (Zhao & Akoglu, 2020; Yan et al., 2021) and result in a degenerated performance. We study the effect of applying more DGT layers and show results in Table 7. In contrast to previous studies, DGT has a relatively stable performance and does not suffer much from performance degradation when the number of layers increases. This is potentially due to that DGT only requires a shallow architecture since each individual layer is capable of modeling longer-range dependencies due to full attention. Besides, the self-attention mechanism can automatically attend importance neighbors, therefore alleviate the over-smoothing and bottleneck effect.

6 Conclusion

In this paper, we introduce DGT for dynamic graph representation learning, which can efficiently leverage the graph topology and capture implicit edge connections. To further improve the generalization ability, two complementary pre-training tasks are introduced. To handle large-scale dynamic graphs, a temporal-union graph structure and a target-context node sampling strategy are designed for an efficient and scalable training. Extensive experiments on real-world dynamic graphs show that DGT presents significant performance gains over several state-of-the-art baselines. Potential future directions include exploring GNNs on continuous dynamic graphs and studying its expressive power.


Part of this work was done during internship at Facebook AI under the supervision of Yanhong Wu, part of this work was supported by NSF grant 2008398.


Appendix A More experiment results

a.1 Link prediction results

In this section, we provide the remained figures and tables in Section 5.

Comparison of AUC score at different time steps.  In Figure 4, we compare the AUC score of DGT with baselines on Enron, UCI, Yelp, and ML-10M dataset. We can observe that DGT can consistently outperform baselines on Enron and ML-10M dataset at all time steps, but has a relatively lower AUC score at certain time steps on the UCI and Yelp dataset. Besides, the performance of DGT is relatively more stable than baselines on different time steps.

Figure 4: Comparison of DGT with baselines across multiple time steps, where the Macro-AUC score is reported in the box next to the curves.

Comparision of AUC score on new link prediction task.  In Table 2, we report dynamic link prediction result evaluated only on the new links at each time step, where a link that appears at the current snapshot but not in the previous snapshot is considered as a new link. This experiment can provide an in-depth analysis of the capabilities of different methods in predicting unseen links. As shown in Table 2, all methods achieve a lower AUC score, which is expected because new link prediction is more challenging. However, DGT still achieves consistent gains of Macro-AUC over baselines, thus illustrate its effectiveness in accurately temporal context for new link prediction.

Method Metric Enron RDS UCI Yelp ML-10M
Node2Vec Micro-AUC
GraphSAGE Micro-AUC
DynGEM Micro-AUC
EvolveGCN Micro-AUC
Table 2: Comparison of Micro- and Macro-AUC on real-world datasets restricted to new edges.

Computation time and memory consumption.  In Table 3

, we compare the memory consumption and epoch time on the last time step of ML-10M and Yelp dataset. We chose the last time step of these two datasets because its graph size is relatively larger than others, which can provide a more accurate time and memory estimation. The memory consumption is record by

nvidia-smi and the time is recorded by function time.time(). During pre-training, DGT samples context node and context node at each iteration. During fine-tuning, DGT first positive links (links in the graph) and sample negative links (node pairs that do not exist in the graph), then treat all nodes in the sampled node pairs at target nodes and sample the same amount of context nodes. Notice that although the same sampling size hyper-parameter is used, since the graph size and the graph density are different, the actual memory consumption and time are also different. For example, since the Yelp dataset has more edges with more associated nodes for evaluation than ML-10M, the memory consumption and time are required on Yelp than on ML-10M dataset.

Dataset Method Memory consumption Epoch time Total time
ML-10M DySAT GB Sec Sec ( epochs)
EvolveGCN GB Sec Sec ( epochs)
DGT (Pre-training) GB Sec Sec ( epochs)
DGT (Fine-tuning) GB Sec Sec ( epochs)
Yelp DySAT GB Sec Sec ( epochs)
EvolveGCN GB Sec Sec ( epochs)
DGT (Pre-training) GB Sec Sec ( epochs)
DGT (Fine-tuning) GB Sec Sec ( epochs)
Table 3: Comparison of the epoch time and memory consumption of DGT with baseline methods on the last time step of ML-10M and Yelp dataset using the neural architecture configuration summarized in Section C.4.

a.2 Ablation study results.

In this section, we provide missing the tables in Section 5.3, where discussion on the results are provided in Section 5.3.

Compare two-tower to single-tower architecture.  In Table 4, we compare the Micro-AUC score and Macro-AUC score of DGT with one-tower222The node representation in the single-tower DGT is computed by

and two-tower structure on UCI, Yelp, and ML-10M datasets.

Method Metric UCI Yelp ML-10M
Single-tower Micro-AUC
Two-tower Micro-AUC
Table 4: Comparison of the Micro- and Macro-AUC of DGT using single-tower and two-tower model architecture on the real-world datasets.

Compare -hop attention with full attention.  In Table 4, we compare the performance of “single-tower DGT using full-attention”, “single-tower DGT using -hop attention”, and “single-tower DGT using -hop attention” on the UCI, Yelp, and ML-10M dataset.

Method Metric UCI Yelp ML-10M
Full attention Micro-AUC
-hop neighbor Micro-AUC
-hop neighbor Micro-AUC
Table 5: Comparison of the Micro- and Macro-AUC of full attention and -hop attention using the single-tower architecture on the real-world datasets.

The effectiveness of spatial-temporal encoding.  In Table 6, we validate the effectiveness of spatial-temporal encoding by independently removing the temporal edge coding and spatial distance encoding.

Method Metric UCI Yelp ML-10M
With both encoding Micro-AUC
Without any encoding Micro-AUC
Only temporal connective encoding Micro-AUC
Only spatial distance encoding Micro-AUC
Table 6: Comparison of the Micro- and Macro-AUC of with and without spatial-temporal encoding on the real-world datasets.

The effect of the number of layers.  In Table 7, we compare the Micro-AUC score and Macro-AUC score of DGT with a different number of layers on the UCI, Yelp, and ML-10M datasets.

Method Metric UCI Yelp ML-10M
layers Micro-AUC
layers Micro-AUC
layers Micro-AUC
Table 7: Comparison of the Micro- and Macro-AUC score of DGT with different number of layers on the real-world datasets.

a.3 Node classification results

In this section, we show that although DGT is orginally designed for the link prediction task, the learned representation of DGT can be also applied to binary node classification. We evaluate DGT on Wikipedia and Reddit dataset, where dataset statistic is summarized in Table 11. The snapshot is created in a similar manner as the link prediction task. As shown in Table 8 and Figure 5, DGT performs around better than all baselines on the Wikipedia dataset and around better than EvolveGCN on Reddit dataset. However, the results DGT on the Reddit dataset is slightly lower than DySAT. This is potentially due to DGT is less in favor of a dense graph, e.g., Reddit dataset, with very dense graph structure information encoded by spatial-temporal encodings.

Method Metric Wikipedia Reddit
EvolveGCN Micro-AUC
DGT (without pre-training) Micro-AUC
DGT (with pre-training) Micro-AUC
Table 8: Comparison of the Micro- and Macro-AUC score of DGT with different number of layers on the real-world datasets for binary node classification task.
Figure 5: Comparison of DGT with baselines across multiple time steps, where the Macro-AUC score is reported in the box next to the curves

a.4 Results on noisy dataset

In this section, we study the effect of noisy input on the performance of DGT using UCI and Yelp datasets. We achieve this by randomly selecting , , of the node pairs and changing their connection status either from connected to not-connected or from not-connected to connected. As shown in Table 9, although the performance of both using full-attention and 1-hop attention decreases as the noisy level increases, the performance of using full-attention aggregation is more stable and robust as the noisy level changes. This is because 1-hop attention relies more on the given structure, while full-attention only take the give structure as a reference and learns the “ground truth” underlying graph structure by gradient descent update.

Method 10% 20% 50%
UCI DGT (1-hop attention) Micro-AUC
DGT (Full aggregation) Micro-AUC
Yelp DGT (1-hop attention) Micro-AUC
DGT (Full aggregation) Micro-AUC
Table 9: Comparison of the Macro-AUC score of DGT and its variants with input graph with different noisy level.

a.5 Comparison with continuous-graph learning algorithms

In this section, we compare snapshot graph-based methods against continuous graph-based learning algorithm on the UCI, Yelp, and ML-10M dataset. For the continuous graph learning algorithm, we choose JODIE Kumar et al. (2019) and TGAT Xu et al. (2020) as the baseline. As shwon in Table 10, JODIE and TGAT suffer from significant performance degradation. This is because they are designed to leverage the edge features and fine-grained timestamp information for link prediction, however, these information is lacking on existing snapshot graph datasets.

Please note that we compare with continuous graph algorithm only for the sake of completeness. However, since snapshot graph-based methods and continuous graph-based methods require different input graph structures, different evaluation strategies, and are designed under different settings, directly comparing two sets of methods cannot provide much meaningful interpretation. For example, existing works Kumar et al. (2019); Xu et al. (2020) on a continuous graph select the training and evaluation set by taking the first of links in the dataset for training and taking the rest for evaluation. In other words, the training and evaluation samples can be arbitrary close and might even come from the same time step. However, in the snapshot graph, the training and evaluation set is selected by taking the links in the previous snapshot graphs for training and evaluating on the -th snapshot graph. That is, the training and evaluation samples never come from the same time step. Besides, since the time steps in the continuous graph are fine-grained than snapshot graphs, continuous graph methods suffer from performance degradation when applied on the snapshot graph dataset due to lack of fine-grained timestamp information. Due to the aforementioned reasons, existing continuous graph learning methods (e.g., Jodie, TGAT) only compare with other continuous graph methods on the continuous datasets, similarly, existing snapshot graph learning methods (e.g., DySAT, EvolveGCN, DynAERNN, DynGEM) also only considers other snapshot graph based methods as their baseline for a comparison.

Method Metric UCI Yelp ML-10M
Table 10: Comparison of the Micro- and Macro-AUC score of DGT , JODIE on the real-world datasets.

Appendix B Pre-training can reduce the irreducible error

b.1 Preliminary on information theory

In this section, we recall preliminaries on information theory, which are helpful to understand the proof in the following section. More details can be found in books such as Murphy (2022); Cover & Thomas (2006).

Entropy.  Let

be a discrete random variable,

as the sample space, and as outcome. We define the probability mass function as . Then, the entropy for a discrete random variable is defined as


where we use base . The joint entropy of two random variables is defined as


The conditional entropy of given is the uncertainty we have in after seeing , which is defined as


Notice that we have if , where a deterministic mapping.

Mutual information.  Mutual information is a special case of KL-divergence, which is a measure of distance between two distributions. The KL-divergence between is defined as


Then, the mutual information between random variable is defined as follows


and we have if are independent. Notice that we use instead of represent the mutual information between and . Besides, represent the mutual information between and .

Based on the above definition on entropy and mutual information, we have the following relation between entropy and mutual information:


and the following relation between conditional entropy and conditional mutual information:


where conditional mutual information can be think of as the reduction in the uncertainty of due to knowledge of when is given.

A figure showing the relation between mutual information and entropy is provided in Figure 6.

Figure 6: Relationship between entropy and mutual information (Figure 2.2 of Cover & Thomas (2006)).

Data processing inequality.  Random variables are said to form a Markov chain if the joint probability mass function can be written as . Suppose random variable

forms a Markov chain

, then we have .

Bayes error and entropy.  In the binary classification setting, Bayes error rate is the lowest possible test error rate (i.e., irreducible error), which can be formally defined as


where denotes label and denotes input. Feder & Merhav (1994) derives an upper bound showing the relation between Bayes error rate with entropy:


The above inequality is used as the foundation of our following analysis.

b.2 Proof of Proposition 1

In the following, we utilize the analysis framework developed in Tsai et al. (2020)

to show the importance of two pre-training loss functions.

By using Eq. 11, we have the following inequality:


By rearanging the above inequality, we have the following upper bound on the Bayes error rate


where equality is due to , equality is due to and because is a deterministic mapping given input . Our goal is to find the deterministic mapping function to generate that can maximize , such that the upper bound on the right hand side of Eq. 13 is minimized. We can achieve this by:

  • [noitemsep,topsep=0pt,leftmargin=3mm ]

  • Maximizing the mutual information between the representation to the input .

  • Minimizing the task-irrelevant information , i.e., the mutual information between the representation to the input given task-relevant information .

In the following, we first show that minimizing can maximize the mutual information , then we show that minimizing can minimize the task irrelevant information .

Maximize mutual information .  By the relation between mutual information and entropy , we know that maximizing the mutual information is equivalent to minimizing the conditional entropy . Notice that we ignore because it is only dependent on the raw feature and is irrelevant to feature representation . By the definition of conditional entropy, we have