1 Introduction
Graph representation learning is becoming increasingly popular in addressing many graphrelated applications, such as social networks, recommendation systems, knowledge graphs, biology, and so on
[jacobs2001protein, otte2002social, manouselis2011recommender]. A substantial body of works has focused on staticgraph neural networks (GNNs)
[dai2016discriminative, hamilton2017representation, monti2017geometric, berg2017graph, ying2018graph, zhang2020revisiting], but many realworld interactions are temporal, making dynamic graph a more preferable model. Dynamic graphs explicitly model both the time and nodedependency of interactions, allowing them to better represent temporal evolutions over graphs while also respecting the dynamic nature of the data. However, unlike the active line of research on efficient algorithms for training static GNNs [hamilton2017inductive, chen2018fastgcn, chen2018stochastic, huang2018adaptive], existing works on dynamic graph models could only be applied to small graphs.In some sense, dynamic graphs combine the properties of sequence models and GNNs. While enjoying the representation power of both, it suffers from significantly more computational challenges. On the one hand, unlike sequence models that treat the interaction events between different nodes independently, dynamic graph models, however, introduce node dependencies, which prevents us from processing events on different nodes parallelly. On the other hand, static GNNs perform synchronized updates over the nodes, which is not allowed in the dynamic counterpart due to the time ordering constraints.
To be more specific, many dynamic graph models represent the state of a node at time
using a lowdimensional vector
. To model temporal evolutions, whenever an interaction event occurs between nodes and at time , with label , the embeddings of involved nodes and will be updated by a neural operator in the following form:(1) 
where the notation represents the most recent state of the node before time . Given a sequence of interaction events, what is the computational complexity of computing all the temporal embeddings ?
As an example, Fig. 1 shows the computational graph of the embedding updates for 8 interaction events that occurred between 4 nodes. To avoid ambiguity in terminology, we refer to a computational graph as a computational DAG (directed acyclic graph) throughout this paper because it must be directed acyclic.
A naive approach can compute Eq. 1 sequentially according to the time ordering of interactions, which will require the number of interactions many computational steps. To improve the computational complexity, a recent advance JODIE [kumar2019predicting] has proposed to parallelize independent computations. For example, the interaction events and in Fig. 1 are independent since their involved nodes do not overlap, so JODIE will perform the updates for and parallelly (please compare Fig. 2 (left) to Fig. 1). However, the effectiveness of this strategy heavily depends on the dataset’s property  to what the interaction events can be parallelized, which reveals to be far from satisfactory on realworld datasets due to the presence of many actively interacted and longlived nodes.
To address the computational barriers that prevent the use of dynamic graph models from industrialscale datasets, in this paper, we propose EDGE, which stands for Efficient Dynamic Graph lEarning.
The design of EDGE is motivated by our key finding that the computational complexity is restricted by the longest path in the computational DAG  no matter how we optimize the parallelization of updates within a step, it takes at least the length of the longest path many sequential steps in the end. Section 3 will go over this in greater detail. As a simple example, the computational DAG in Fig. 2 (left) has a longest path of length 6, so JODIE cannot take less than 6 steps to finish the computations. The longest path in the computational DAG represents the limit that JODIE and any other methods can reach, so EDGE is designed to carefully reduce its length from two aspects, as briefly summarized below.
First, EDGE selectively expresses some computational dependencies via the training loss to achieve better parallelism in computation. To be more specific, EDGE decouples some nodes (called dnodes) from the computational DAG to remove certain computational dependencies; and these dependencies will be added back via the training loss. For instance, the dnodes (greencolored ) in Fig. 2 are divided into two nodes each, which separates a dnode’s descendants from its ancestors, shortening the longest path in the computational DAG. However, a naive decoupling strategy could lead to a suboptimal model due to the ignored dependencies. In Section 3, we will explain how we carefully design the decoupling strategy by answering the following two questions, which are critical to its success.

[wide,nolistsep]

How to select the most effective set of dnodes for shortening the longest path?

How to modify the training loss to compensate for the removed computational dependencies caused by the dnodes?
Second, EDGE assumes the convergent states of actively interacted nodes and models them by static embeddings. As visualized in Fig. 3, many realistic datasets contain scalefree interaction networks [leskovec2005graphs], where most nodes have a small number of interactions but a few highdegree nodes are actively interacted. Examples include celebrities on Twitter who have lots of followers, and popular items on Amazon which are frequently clicked. While these active nodes contribute a large amount of edges to the graph, their embedding states tend to be static. The reasoning behind is mainly twofold. First, an active node’s interactions with other nodes give us diminishing returns in terms of their information value in understanding this node’s characteristics. Second, from the optimization perspective, a recurrent operator such as in Eq. 1 often has smaller derivatives in deeper layers, which is also the cause of the gradient vanishing problem [pascanu2013difficulty]. With the operator applied repeatedly, embeddings of active nodes will eventually evolve slowly. With this motivation, EDGE represents active nodes by nodespecific static embeddings during training, while keeping other nodes dynamic. The effect from other nodes to active nodes are implicitly incorporated through the gradient updates during training. After being trained, the operator could still be applied to active nodes during testing phase. This static treatment of active nodes proves to be very effective on realistic datasets.
In summary, these two designs jointly reduce the path length in the training computational graph, making EDGE scalable to realistic datasets. We conduct comprehensive experiments on several useritem interaction datasets, including two largescale datasets, to compare EDGE to a wide range of baselines, showing the SOTA performances of EDGE in both accuracy and prediction efficiency.
Related work. Although learning on dynamic graphs is relatively recent [rahman2018dylink2vec, goyal2020dyngraph2vec, sankar2020dysat, rossi2020temporal, xu2020inductive], various models were actively proposed to represent temporal patterns via the time ordering of interaction events [ibrahim2015link, dai2016deep, kumar2019predicting] or from temporal aligned graph snapshots [goyal2020dyngraph2vec, sankar2020dysat]. A large subset of such models that updates nodewise embeddings through a neural network when new interactions appear [kumar2019predicting, ibrahim2015link, dai2016deep, trivedi2017know, ma2020streaming] is most relevant to this paper. They vary from each other mainly by designing different architectures and incorporating different information. The method proposed in this paper could be generally applied to improve the training efficiency of these models. For a more comprehensive summary of other dynamic models, we refer the audience to [rossi2020temporal] which summarizes different models via a general framework.
2 Model
In this paper, we focus on modeling the interactions between two groups of nodes, and , but the technique could easily be applied to other scenarios involving fewer or more nodes in each temporal event.
We denote an interaction event between nodes and by , where is the time and is a label, which can indicate click/nonclick in useritem interaction networks, for example. In this section, we will illustrate how a dynamic graph model is used to describe the evolution of node embeddings over time when a sequence of interaction events are observed.
2.1 Embedding Evolution
In many dynamic graph models, two crucial components for modeling the evolution of node embeddings are

[nolistsep]

the initial embeddings of the nodes; and

the neural operator for updating them (via Eq. 1).
In EDGE, a key consideration is the convergent states of active nodes. Therefore, it brings in a new component:

[nolistsep]

the nodespecific static embeddings for active nodes.
To be more specific, Given a degree threshold , we say a node is active if it has more than interactions in the training set. We can denote the set of active nodes by and define in a similary way based on a threshold . In EDGE, these active nodes’ embeddings are modeled by static embeddings
(2) 
where each is a learnable vector, and is the final timestamp in the training dataset. The other ‘inactive’ nodes are evolved by the neural operator when the interactions occur as follows.
(3)  
(4)  
(5)  
(6) 
where are learnable vectors that define the common initializations for inactive nodes in and , and we use different parameters and in the operator that updates these two groups of nodes. The architecture of
is not the main focus of this paper, so we simply adopt a gated recurrent unit (GRU)
[chung2014empirical] based architecture. The notion can be any available event features. Examples include contents of posts on Twitter, static features of items on shopping platforms, etc. Lastly, the notation in refers to the timestamp of the last event occured to before .Eq. 2 to Eq. 6 jointly define the overall embedding evolution model in EDGE. A few remarks and important discussions for this model are presented in the following.
Efficiency gain. The reduction of computational dependencies achieved by representing active nodes using static embeddings is significant, because the active nodes contribute a substantial number of edges to the network. The only computational sacrifice here is the memory cost for introducing the additional set of learnable vectors for the active nodes. However, in our experiments, this does not cause any problems. Every transductive graph neural network, by comparison, will require a memory cost greater than that.
Accuracy. The static treatment of active nodes does damage the model’s performance, which will be verified by extensive experiments in Section 4. This static strategy is wellthoughtout and supported by a number of arguments below.

[wide,nolistsep]

If we view the embedding evolution as a refresh of our understanding about the features of a node, the information value of a new interaction for an active node is minimal.

Optimization is as important as the model. With the reduced computational dependencies, the training optimization becomes easier, which might also be the reason why we often observe an improvement in model performance in our experiments.

From the view of variancebias tradeoff, it is reasonable to use a transductive (i.e., nodespecific) embedding for active nodes and use an inductive model for inactive nodes.

During the test phase, we could still apply the learned operator to evolve the embeddings of active nodes according to their newly observed events.
2.2 Prediction Model and Loss Function
After computing the node embeddings, a predictive model extras useful features to archieve the desired prediction. Our experiments are mainly about predicting the next interacted node in for the nodes in , so we employ a neural layer to predict the embedding of the next interacted node. Here we abuse the notation to generally represent parameters in neural networks. Inspired by the design in [kumar2019predicting], we employ a prediction layer in the following form:
where is the most recently interacted node of before time . Here denotes any available node features. In our experiments,
is simply a multilayer perceptron (MLP).
With the prediction embedding , one can then measure its similarity to embeddings of nodes in . In our experiments, we measure its similarity to a node using inner product, and use crossentropy loss (equivalently, the softmax loss) for each event in the training set :
(7) 
where includes all parameters. is a set of ‘negative’ nodes that did not interact with at . In our experiments, this set is randomly sampled from .
3 Efficient Training Algorithm
The bottleneck of previous training methods is the underutilization of GPUs since the computation is hard to parallelize. We find that the computational complexity is limited by the longest path in the computational DAG, which motivates our design of dnodes for shortening it. We will describe the details in this section.
3.1 Why Longest Path?
Why is the computational complexity limited by the longest path in the computational DAG? To answer this question, we would like to refer the audience to the literature of parallel algorithm analysis. Briefly speaking,

[leftmargin=5mm,nolistsep]

The socalled workdepth model of parallel computation was originally introduced by [shiloach1982n2log] in 1982, which has been used for many years to describe the performance of parallel algorithms.

Depth is defined as the longest chain of sequential dependencies in the computation [blelloch1996programming].

Minimizing the depth is important in designing parallel algorithms, because the depth determines the shortest possible execution time [mccool2012structured].
When the computations of a dynamic graph are viewed as the operations in a parallel algorithm, then the “longest path in the computational DAG” corresponds to the “depth” in the workdepth model, which determines the shortest possible execution time.
3.2 A Simplified Illustration of Dnodes
Before going into the details of the dnodes, we would like to first explain the highlevel idea through a highly simplified example. Instead of the very complex dynamic graph model, consider a 6layer feedforward network .
Clearly, its computational DAG is a path
where is the th hidden layer output. Clearly, it requires 6 sequential steps to finish the computations layer by layer. However, we can speed it up to 3 sequential steps by decoupling this path through the node .
More precisely, we select as the dnode and add a learnable vector for it. Then we can perform the computations and parallelly. The new computational DAG becomes 2 independent paths:
and their lengths are 3. The key is that is a learnable vector that does not depend on any other nodes, so we can start to use for computing before we obtain .
The final problem is, with this decoupling, it seems the model becomes different from the original 6layer feedforward network. Fortunately, to maintain the consistency, we only need to enforce the constraint during the optimization. Apparently, if is satisfied, the models before and after the decoupling are equivalent. Therefore, we can perform gradient updates to optimize both the network parameters and by solving the following constrained optimization
(8) 
Experimentally, we enforce the constraint softly by:
(9) 
To summarize, the key idea of dnodes is to selectively express some dependencies in computation by dependencies in optimization (via training loss).
In this example, the computational DAG simply consists of paths. In the case of a dynamic graph, the computational DAG is far more complex. However, the algorithm follows the same logic. We will present the details in the following subsections.
3.3 Step 1: Selection of Dnodes
In the example in Section 3.2, the computational DAG is a path, so selecting the midpoint as the dnode is very effective in shortening the path. However, in the case of dynamic graph models, the computational DAG is a lot more complex due to the node dependency and time dependency (e.g., Fig. 1).
Ideally, we should choose a small number of dnodes that can effectively shorten the longest path in the computational DAG. Therefore, in this section, we explore the answer of:
Given a limited quota , how to choose the most effective collection of dnodes?
To begin with, we construct the computational DAG for computing all temporal node embeddings through the outlined steps in Algorithm 1 and denote it by . The goal is to decouple a subset of nodes from the computational DAG to shorten the longest path.
Mathematically, given the computational DAG, , the dnode selection problem can be formulated as the following minmax integer programming (IP):
where the binary variable
indicates whether the node is selected as a dnode, and measures the longest distance between node and root nodes in the decoupled graph.Given the extremely large problem size of this IP (the number of nodes in is proportional to the number of events in the dataset), we cannot use existing packages such as Gurobi [gurobi2018gurobi]
to solve it. Therefore, we design a greedy heuristic to give an approximate solution, which reveals to be effective in experiments.
The algorithm steps are outlined in Algorithm 2. Briefly speaking, it repeatedly finds the middlepoint of the longest path in the computational DAG, takes it as a dnode, and then adjusts edges from this node, until dnodes are found. The idea of this algorithm is similar to alternative minimization maximization, but solving a subproblem each time sequentially for :

[leftmargin=0.5cm]

Optimize with fixed : Suppose values of are given by current selections . Then the node that optimizes the inter maximization is apparently in the longest path in Algorithm 2. Furthermore, for all edges in this path, the corresponding constraints in the IP are binding (i.e., equality holds for all ), and removing all other constraints does not change the optimal value . Therefore, we select the next dnode by solving a subproblem identified by this longest path .

Select on subproblem: This step selects a dnode and makes , which is accomplished by solving the IP with the subset of constraints identified by edges in the longest path . As pointed out in Algorithm 2, the that optimizes the subproblem is the center node in the path .
It is notable that the longest path of a DAG can be found in linear time, and Algorithm 2 only needs to run once as a preprocessing step.
3.4 Step 2: Express Dependency by Constraints
From now on, we index each selected dnode by since it has a unique correspondence to a temporal state .
Each dnode plays a role of detaching the events happened to after from those before . Similar to adding the vector for in Section 3.2, the decoupling is realized by creating an additional embedding for each dnode, which will be used as a replacement of for any events after that involve the state of . An apparent benefit is can be used for computing future evolutions before obtaining the embedding .
It is important that the decoupling of the dnodes should not break the original dependencies in the model. Fortunately, it is easy to observe that involving the dnodes will not change the obtained node embeddings if its associated two embeddings are equal:
Therefore, during the training phase, EDGE solves a constrained optimization:
Various algorithms are available for solving such a constrained optimization problem. For example, one can use Lagrangian method with quadratic penalty, and solve it by alternative primaldual updates. In our experiments, we enforce the constraints softly, by adding a quadratic penalty to the loss whenever the node is reached in the computational DAG.
4 Experiments
We conduct experiments on four public datasets. Two of them are large datasets close to industrial scale, on which we show the scalability and prediction accuracy of EDGE, by comparing to a diverse set of SOTA methods. The other two are comparatively smaller datasets, on which all dynamic graph baselines are runnable, so they are used for comparison to them. Implementation of EDGE will be publicly available upon acceptance.
Datasets include Taobao [pi2019practice]
, MovieLens25M (ML25M)
[harper2015movielens], lastfm1K [Celma:Springer2010], and Reddit [baumgartner2020pushshift]. Dataset statistics are summarized in Table 1. Taobao and ML25M are largescale. Especially, Taobao contains about 100 million interactions. All datasets provide timed useritem interaction data where for each interaction, user ID, item ID, timestamp, item feature, and event feature (if available) are given. For Reddit, subreddits are considered as items. For ML25M, we ignore the ratings and simply use it as interaction data. For Taobao and ML25M, we sort the interactions by timestamp, and use a (0.7,0.1,0.2) datasplit for training, validation, and testing, respectively. For Lastfm1K and Reddit, we use the filtered version given by [kumar2019predicting] and follow their datasplit of (0.8,0.1,0.1). More details and downloadable links to the datasets are provided in Appendix A.Dataset  #users  #items  #interactions 

lastfm1K  1,000  980  1,293,103 
10,000  984  672,447  
ML25M  162,538  59,048  24,999,849 
Taobao  987,975  4,111,798  96,678,667 
Baselines. We compare EDGE to a wide range of baselines spanning 3 categories: (i) Dynamic graph models include JODIE [kumar2019predicting], dynAERNN [goyal2020dyngraph2vec], TGAT [xu2020inductive], CTDNE [nguyen2018continuous], and DeepCoevolve [dai2016deep]. The first three were proposed recently and more advanced. (ii) GNNs: GCMC [berg2017graph] is a SOTA graphbased architecture for link prediction in useritem interaction network. GCMCSAGE is its stochastic variant which is more efficient, using techniques in GraphSAGE [xu2020inductive]. GCMCGAT is another variant based on graph attention networks [velivckovic2017graph]. (iii) Deep sequence models: SumPooling is a simple yet effective model widely used in industry. GRU4REC [hidasi2015session] is a representative of RNNbased models. SASRec [kang2018self] is a 2layer transformer decoderlike model. DIEN [zhou2019deep] is an advanced attentionbased sequence model. MIMN [pi2019practice] is a memoryenhanced RNN for better capturing long sequential user behaviors. It is a strong baseline and the Taobao dataset were publicized by the authors.
method  Taobao  ML25M  
MRR  Rec@10  MRR  Rec@10  
GNNs 
GCMC  0.303  0.542  0.171  0.378  
GCMCSAGE  0.149  0.366  0.109  0.264  
GCMCGAT  0.230  0.494  0.185  0.404  
Sequence 
Models 
SumPooling  0.415  0.664  0.302  0.591 
GRU4REC  0.546  0.777  0.364  0.657  
DIEN  0.605  0.834  0.356  0.638  
MIMN  0.607  0.828  0.363  0.653  
SASRec  0.488  0.702  0.360  0.653  
Dynamic 
Graphs 
dynAERNN  oom  oom  0.249  0.509 
TGAT  0.016  0.025  0.114  0.219  
JODIE  0.454  0.680  0.354  0.634  
EDGE  0.675  0.841  0.397  0.673 
indicates that the released implementation is improved by us to make the method either runnable on these large datasets or better adapted to the evaluation metric, so the results are expected to be better than the original version. Modifications are described in Appendix
B. ‘oom’ stands for out of memory.Configuration. In all experiments, both node embedding dimension and feature embedding dimension are 64. Only Reddit contains event features. Other datasets may provide categorical item features. We experimentally find that including item ID as an additional item feature is generally helpful. To conduct stochastic optimization, we sort the events by time and divide them into 300 batches, so that each batch contains a subgraph within a continuous time window. To use dnodes, we choose dnodes per batch. The threshold for active nodes is chosen to be 200 for users and 100 for items on both Taobao and ML25M. The other two datasets are filtered and small, so threshold is not necessary. We use for both datasets (which means no static embedding is used for any nodes). Besides, we also tried using for Reddit and for lastfm1K.
Evaluation metric. We measure the performance of different models in terms of the mean reciprocal rank (MRR), which is the average of the reciprocal rank, and recall@10. For both metrics, higher values are better. On Taobao and ML25M, the ranking of the truly interacted item in each event is calculated among 500 items where the other 499 negative items are uniformly sampled from the set of all items. On Reddit and lastfm1K, ranks are over all items. In the tables in this section, we mark the best performing method by red color and the second best by blue color.
Training time per epoch and MRR performance on test set. This figure shows the comparison in both efficiency and accuracy.
Result 1: On large datasets.
(i) MRR and Rec@10: The overall performances on Taobao and ML25M are summarized in Table 2. EDGE has consistent improvements compared to all baselines. Improvements of 11.2% and 9.1% over the second best method are achieved in terms of MRR. In fact, the released implementations of many baselines cannot run or give reasonable performances on these large datasets. As indicated by symbols * in Table 2, we improve the baselines to allow them to have more advantages. Please refer to Appendix B for details.
(ii) Training efficiency: In Fig. 4, the axis is the training time per epoch, which is a measure of training efficiency, and the axis is the MRR performance on test set. While achieving good performances in MRR, EDGE is very efficient. It has a similar efficiency as the sequence models, and is a lot faster than both GNNbased models and exisiting dynamic graph models, which take 2 hours or more to run on a single epoch. Furthermore, we observe that EDGE actually takes fewer number of epochs to converge compared to the sequence models. The #(dnodes) and active threshold are the same as that in Table 2. Results with different configurations will be compared later.
method  lastfm1K  

MRR  Rec@10  MRR  Rec@10  
CTDNE  0.01  0.01  0.165  0.257 
DeepCoevolve  0.019  0.039  0.171  0.275 
JODIE  0.195  0.307  0.726  0.852 
dynAERNN  0.021  0.038  0.157  0.301 
TGAT  0.015  0.023  0.602  0.775 
EDGE  0.199  0.332  0.725  0.855 
Result 2: On small datasets. Lastfm1K and Reddit are datasets preprocessed by JODIE which are relatively small so that all dynamic graph models can scale to them. (i) MRR and Rec@10 are summarized in Table 3. Overall, EDGE has similar performances as JODIE, which is expected, because the architecture we use in EDGE is similar to that in JODIE. The advantage of EDGE is in efficient training, but on small datasets, JODIE can be well trained, too. (ii) Runtime: However, EDGE is more efficient than JODIE. Fig. 6 shows the training time of EDGE when #(dnodes) per batch is 160 and the threshold for active nodes is , so all its speedup comes from dnodes.
Result 3: Ablation study on the dnodes. We validate the effectiveness of EDGE, especially the speedup from dnodes, by a series of ablation experiments. Fig. 5 summarizes the results. By gradually increasing the number of dnodes, we can observe the decrease in training time from the bar plots. The dnodes are very effective in reducing the computational cost in the sense that they constitute a small portion of the computational nodes, especially on large datasets, and do not have much effect on the prediction performances. We can even observe some improvement in prediction accuracy when using more dnodes. This might be credited to the easiness of training on shorter sequences.
Algorithm 2 (ours)  number of dnodes  0  10  20  40  80  160  320  640 

longest path length  209.75  164.05  142.81  115.01  89.63  66.12  45.47  29.43  
cutbytime  number of parts  1  2  4  8  16  32  64  128 
number of dnodes  0  134  250  346  429  506  588  701  
longest path length  209.75  202.6  198.95  196.29  193.02  190.84  187.39  179.51 
Result 4: Reduction in the path length. To verify that our selection algorithm has actually reduce the longest path in the computational DAG, we explicitly compute its length after decoupling the selected dnodes. For comparison, we also implement a comparatively simpler approach, called ‘cutbytime’. It selects the dnodes to divide the DAG into multiple disconnected parts based on the time intervals. Results are reported in Table 4, which reveals a few advantages of our heuristic. First, we can easily control the number of dnodes. However, in the case of cutbytime or other algorithms that cut the DAG into separate parts, the number of dnodes is determined by the number of parts. Second, our heuristic is a lot more effective, in terms of reducing the length of the longest path.
Result 5: Ablation study on the active nodes. What if active nodes do not use static embeddings? The main purpose of using static embeddings is to improve the efficiency and ease of training, which is especially important for large datasets. To demonstrate this, we gradually increase the node splitting threshold to see the change in runtime and accuracy. Note that when increases, fewer nodes are categorized as active nodes to use static embeddings. For the extreme case when , all nodes are treated as inactive and do not use static embedding. It is observed in Table 5 that using more active nodes can effectively improve training efficiency. Especially, on Movielesns25M, the efficiency will be unacceptable without using static embeddings for active nodes. In terms of accuracy, on Taobao, the case of =(200,100) has achieved as good performance as other thresholds. On movielens25M, it seems =(400,200) is a better choice for accuracy and efficiency tradeoff.
Taobao  

time per epoch  MRR  Rec@10  
(200,100)  13.14 mins  0.675  0.642 
(400,200)  13.84 mins  0.674  0.841 
(800,400)  14.52 mins  0.674  0.841 
(1600,800)  15.57 mins  0.675  0.841 
(, )  27.50 mins  0.674  0.841 
ML25M  
time per epoch  MRR  Rec@10  
(200,100)  12.72 mins  0.397  0.674 
(400,200)  22.79 mins  0.404  0.683 
(800,400)  1 hr 4 mins  0.409  0.687 
(1600,800)  3 hr 1 mins  0.406  0.688 
(, )  20 hr 54 mins     
5 Conclusion and discussion
In this paper, we have proposed an efficient framework, EDGE, to address the computational challenges of learning largescale dynamic graphs. We evaluated EDGE on largescale item ranking tasks, and showed that its performances outperform several classes of SOTA methods. The models discussed in this paper are mainly based on nodewise updates by an recurrent operator when new interactions appear, which lack a GNNlike aggregation from a node’s neighbors. Future works include adaptation of this efficient algorithm to more sophisticated aggregations, and also exploration on effective training algorithm for constrained optimization.
References
Appendix A Dataset details
a.1 Overall Description
Taobao dataset contains useritem interaction data from November 25 to December 03, 2017 (one week). The details of this dataset can be found at https://tianchi.aliyun.com/dataset/dataDetail?dataId=649.
Movielens25M dataset contains about 25 million rating activities from MovieLens, from January 09, 1995 to November 21, 2019. The details of this dataset can be found at https://grouplens.org/datasets/movielens/25m/.
Lastfm1K dataset contains music listening data of 1000 users. We use the dataset prepared by [kumar2019predicting]. The details can be found at https://github.com/srijankr/jodie and http://ocelma.net/MusicRecommendationDataset/lastfm1K.html.
Reddit dataset contains one month of posts made by users on subreddits [baumgartner2020pushshift]. We use the dataset filtered by [kumar2019predicting], which contains the 1,000 most active subreddits as items and the 10,000 most active users. The details can be found at https://github.com/srijankr/jodie.
a.2 Node Degree Distribution
Appendix B Baseline specifications
Some of the compared baseline methods are not directly scalable to large scale interaction graphs, or originally designed for ranking task. To make the baselines runnable on large and sparse graphs and comparable to our proposed method, we have made a few improvements to the released implementations.

[leftmargin=*]

GRU4Rec & DIEN & MIMN: We changed the data batching from peruser based to perexample based, to better adapt for time ordered interaction data and to better make the FULL use of the training data. Besides, the implementation of GRU4Rec follows implementation by the authors of MIMN paper, which includes some modifications compared to the original version of GRU4Rec released by their authors, and should be expected to perform better.

GCMC & GCMCSAGE & GCMCGAT
: We changed the loss function from Softmax over different ratings to Softmax over [true item, 10 random selected items], to better adapt to the ranking task.

dynAERNN: These two methods are originally designed for discrete graph snapshots, while our focused tasks are continues interaction graphs. We manually transformed the interactions graph into 10 snapshots, with equal edge count increments.
For the downstream ranking task, we followed the evaluation method used in dySat [sankar2020dysat]
: after the node embeddings are trained, we train a logistic regression classifier to predict link between a pair of user / item node embedding using Hadmard operator. The logits on test set are used for ranking. For Taobao dataset, we are not able to get results of dynAERNN given the memory constraint.

JODIE: we made the following adaptions: 1) replaced the static onehot representation to 64dimensional learnable embeddings because the number of nodes are too large that the original implementation will have the outofmemory issue; 2) add a module to deal with categorical feature via a learnable embedding table since the original implementation is more suitable for continuous feature; 3) used triplet loss with random negative example rather than original MSE loss, which empirically show improvements.

TGAT: We notice that the performance of TGAT is particularly bad on Taobao. We spend efforts on checking the experiments to make sure there is no bug. Finally, we identify two issues of using this method on Taobao dataset. Firstly, TGAT is an inductive model, which could potentially give a better generalization ability for newly observed nodes in the test set, but on the other hand it is not expressive enough to make use of the largescale dataset. Secondly, the original implementation in TGAT does not deal with categorical feature, but in the Taobao dataset, the only feature is a categorical feature for the items. We need to add a module to map the categorical feature to a learnable feature embedding table (similar to how we modify the feature module in JODIE) to achieve a better performance. With this feature embedding module, we were able to increase its performance on Taobao to about 0.086 (MRR) and 0.164 (Rec@10).
Appendix C Ablation study results
We present the complete results for the ablation study blow.
Appendix D Computing resources
Experiments on largescale datasets are run on NVIDIA Quadro RTX 6000 (24 GB memory) or on NVIDIA Tesla v100 (16 GB memory). Experiments on small datasets are either run on NVIDIA GeForce 2080 Ti (11 GB memory) or on NVIDIA Tesla v100 (16 GB memory). Each individual job is run on a single GPU.
Comments
There are no comments yet.