jodie
A PyTorch implementation of ACM SIGKDD 2019 paper "Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks"
view repo
Modeling sequential interactions between users and items/products is crucial in domains such as ecommerce, social networking, and education. Representation learning presents an attractive opportunity to model the dynamic evolution of users and items, where each user/item can be embedded in a Euclidean space and its evolution can be modeled by an embedding trajectory in this space. However, existing dynamic embedding methods generate embeddings only when users take actions and do not explicitly model the future trajectory of the user/item in the embedding space. Here we propose JODIE, a coupled recurrent neural network model that learns the embedding trajectories of users and items. JODIE employs two recurrent neural networks to update the embedding of a user and an item at every interaction. Crucially, JODIE also models the future embedding trajectory of a user/item. To this end, it introduces a novel projection operator that learns to estimate the embedding of the user at any time in the future. These estimated embeddings are then used to predict future useritem interactions. To make the method scalable, we develop a tBatch algorithm that creates timeconsistent batches and leads to 9x faster training. We conduct six experiments to validate JODIE on two prediction tasks—future interaction prediction and state change prediction—using four realworld datasets. We show that JODIE outperforms six stateoftheart algorithms in these tasks by at least 20 prediction.
READ FULL TEXT VIEW PDF
Modeling a sequence of interactions between users and items (e.g., produ...
read it
Growing amounts of online user data motivate the need for automated
proc...
read it
Although modern recommendation systems can exploit the structure in user...
read it
Online platforms can be divided into informationoriented and socialori...
read it
Online monitoring user cardinalities (or degrees) in graph streams is
fu...
read it
Poisson factorization is a probabilistic model of users and items for
re...
read it
In recommender systems (RSs), predicting the next item that a user inter...
read it
A PyTorch implementation of ACM SIGKDD 2019 paper "Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks"
Users interact sequentially with items in many domains such as ecommerce (e.g., a customer purchasing an item) (zhang2017deep), education (a student enrolling in a MOOC course) (liyanagunawardena2013moocs), and social and collaborative platforms (a user posting in a group in Reddit) (iba2010analyzing; kumar2018community). The same user may interact with different items over a period of time and these interactions change over time (DBLP:journals/debu/HamiltonYL17; DBLP:conf/recsys/PalovicsBKKF14; zhang2017deep; agrawal2014big; DBLP:conf/asunam/ArnouxTL17; raghavan2014modeling; DBLP:journals/corr/abs171110967). These interactions create a network of temporal interactions between users and items. Figure 1 (left) shows an example network between users and items, with each interaction marked with a time stamp and a feature vector (such as the review text or the purchase amount). Accurate realtime recommendation of items and predicting change in the state of users are fundamental problems in these domains (DBLP:conf/wsdm/QiuDMLWT18; DBLP:conf/asunam/ArnouxTL17; DBLP:conf/sdm/LiDLLGZ14; DBLP:journals/corr/abs180401465; DBLP:conf/cosn/SedhainSXKTC13; walker2015complex; DBLP:conf/icwsm/Junuthula0D18). For instance, predicting when a student is likely to drop out of a MOOC course is important to develop early intervention measures (kloft2014predicting; yang2013turn) and predicting when a user is likely to turn malicious on social platforms, like Reddit and Wikipedia, ensures platform integrity (kumar2018rev2; kumar2015vews; cheng2017anyone).
Representation learning, or learning lowdimensional embeddings of entities, is a powerful approach to represent the evolution of users’ and items’ properties (DBLP:journals/kbs/GoyalF18; zhang2017deep; dai2016deep; DBLP:conf/nips/FarajtabarWGLZS15; beutel2018latent; zhou2018dynamic). However, the recent methods that generate dynamic embeddings suffer from four fundamental challenges. First, a majority of the existing methods generate an embedding for a user only when she takes an action (beutel2018latent; zhang2017deep; dai2016deep; zhou2018dynamic; you2019hierarchical). However, consider a user who makes a purchase today and its embedding is updated. The embedding will remain the same if it returns to the platform on the next day, a week later, or even a month later. As a result, the same predictions and recommendations will be made to her regardless of when she returns. However, a user’s intent changes over time (cheng2017predicting) and thus her embedding needs to be updated (projected) to the query time. The challenge here is how to accurately predict the embedding trajectories of users/items as time progresses. Second, entities have both stationary properties that do not change over time and timeevolving properties. Some existing methods (zhang2017deep; dai2016deep; wang2016coevolutionary) consider only one of the two when generating embeddings. However, it is essential to consider both in a unified framework to leverage information at both scales. Third, many existing methods predict useritem interactions by scoring all items for each user (zhang2017deep; dai2016deep; beutel2018latent; zhou2018dynamic). This has linear time complexity and is not practical in scenarios with millions of items. Instead, methods are required that can recommend items in nearconstant time. Fourth, most models are trained by sequentially processing the interactions one at a time, so that the temporal dependencies between the interactions are maintained (zhang2017deep; dai2016deep; wang2016coevolutionary). This prevents such models from scaling to datasets with millions of interactions. New methods are needed that can be trained with batches of data to generate embedding trajectories.
Present work. Here we present JODIE which learns to generate embedding trajectories of all users and items from temporal interactions^{1}^{1}1JODIE stands for Joint Dynamic UserItem Embeddings.. The embedding trajectories of the example network are shown in Figure 1 (right). The embeddings of the user and item are updated when a user takes an action and a projection operator predicts the future embedding trajectory of the user.
Present work: JODIE. Each user and item has two embeddings: a static embedding and a dynamic embedding. The static embedding represents the entity’s longterm stationary property, while the dynamic embedding represents timevarying property and is learned using the JODIE algorithm. Both embeddings are used to generate the trajectory. This enables JODIE to make predictions from both the stationary and timevarying properties of the user.
The JODIE model consists of two major components: an update operation and a projection operation.
The update operation of JODIE has two Recurrent Neural Networks (RNNs) to generate user and item embeddings. Crucially, the two RNNs are coupled to explicitly incorporate the interdependency between users and items. After each interaction, the user RNN updates the user embedding by using the embedding of the interacting item. Similarly, the item RNN uses the user embedding to update the item embedding. The model also has the ability to incorporate feature vectors from the interaction, for example, the text of a Reddit post. It should be noted that JODIE is easily extendable to multiple types of entities by training one RNN for each entity type. In the current work, we show how to apply JODIE to the case of bipartite interactions between users and items.
A major innovation of JODIE is that it also uses a projection operation that predicts the future embedding trajectory of the users. Intuitively, the embedding of a user will change slightly after a short time elapses since her previous interaction (with any item), while the embedding can change significantly after a long time elapses. As a result, JODIE trains a temporal attention layer to project the embedding of users after some time elapses since its previous interaction. The projected user embedding is then used to predict the item that the user is most likely to interact with.
To predict the item that a user will interact with, an important design decision is to output the embedding of an item, instead of an interaction probability. Current methods generate a probability score of interaction between a user and an item, which takes linear time to find the most likely item because probability scores for all items have to be generated first. Instead, by directly generating the item embedding, we can recommend the item that is closest to the predicted item in the embedding space. This can be done efficiently in constant time using the locality sensitive hashing (LSH) techniques
(leskovec2014mining).Present work: tBatch. Most existing models learn embeddings from a sequence of interactions by processing one interaction after the other, in increasing order of time to maintain the temporal dependency among the interactions (dai2016deep; zhang2017learning; wang2016coevolutionary). This makes such algorithms unscalable to real datasets with millions of interactions. Therefore, we create a batching algorithm, called tBatch, to train JODIE by creating training batches of independent interactions such that the interactions in each batch can be processed in parallel. To do so, we iteratively select independent edge sets from the interaction network. In every batch, each user and item appears at most once and the temporallysorted interactions of each user (and item) appear in monotonically increasing batches. Experimentally, we show that tBatch makes JODIE 9.2 faster than its most similar dynamic embedding baselines.
Present work: Experiments. We conduct six experiments to evaluate the performance of JODIE on two tasks: predicting the next interaction of a user and predicting the change in state of users (when a user will be banned from social platforms and when a student will drop out from a MOOC course). We use four datasets from Reddit, Wikipedia, LastFM, and a MOOC course activity for our experiments. We compare JODIE with six stateoftheart algorithms from three categories: deep recurrent recommender algorithms (zhu2017next; beutel2018latent; wu2017recurrent), temporal node embedding algorithm (nguyen2018continuous), and dynamic coevolution models (dai2016deep). JODIE improves over the baseline algorithms on the interaction prediction task by at least 20% in terms of mean reciprocal rank and 12% in AUC on average for predicting user state change. We further show that JODIE is robust to the percentage of training data and the size of the embeddings.
Overall, in this paper, we make the following contributions:
Embedding algorithm: We propose a coupled recurrent neural network model called JODIE to learn embedding trajectories of users and items. Crucially, JODIE also learns a projection operator to predict the embedding trajectory of users and predicts future interactions in constant time.
Batching algorithm: We propose the tBatch algorithm to create independent but temporally consistent training data batches that help to train JODIE 9.2 faster than the closest baseline.
Effectiveness: JODIE outperforms six stateoftheart algorithms in predicting future interactions and user state change predictions, by performing at least 20% better in predicting future interactions and 12% better on average in predicting user state change.
The code and datasets are available on the project website:
https://snap.stanford.edu/jodie.
Here we discuss the research closest to our problem setting spanning three broad areas. Table 1 compares their differences.
Deep  Temporal  Coevolution models (dai2016deep)  Proposed  
recurrent  network  model  
models  embedding  
Property 
LSTM, TimeLSTM (zhu2017next) 
RRN (wu2017recurrent) 
LatentCross (beutel2018latent) 
CTDNE (nguyen2018continuous) 
IGE (zhang2017learning) 
JODIE 

Predict embedding trajectory  ✔  
Predict future item embedding  ✔  
Train using batches of data  ✔  ✔  ✔  ✔  ✔ 
Deep recurrent recommender models. Several recent models employ recurrent neural networks (RNNs) and variants (LSTMs and GRUs) to build recommender systems. RRN (wu2017recurrent) uses RNNs to generate dynamic user and item embeddings from rating networks. Recent methods, such as TimeLSTM (zhu2017next) and LatentCross (beutel2018latent) learn how to incorporate features into the embeddings. However, most of these methods suffer from two major shortcomings. First, they take the onehot vector of the item as input to update the user embedding. This only incorporates the item id and ignores the item’s current state. The second shortcoming is that some models, such as TimeLSTM and LatentCross, generate dynamic embeddings only for users and not for items.
JODIE overcomes these shortcomings by learning embeddings for both users and items using mutuallyrecursive RNNs. In doing so, JODIE outperforms these methods by at least 20% in predicting the next interaction and 12% on average in predicting user state change, while having comparable running time as these methods.
Dynamic coevolution models. Methods that jointly learn representations of users and items have recently been developed using pointprocess modeling (wang2016coevolutionary; trivedi2017know) and RNNbased modeling (dai2016deep). The basic idea behind these models is similar to JODIE —user and item embeddings influence each other whenever they interact. However, the major difference between JODIE and these models are that JODIE trains a project operation to forecast the user embedding at any time, outputs item embeddings instead of interaction probability, and trains the model using batching. As a result, we observe that JODIE outperforms DeepCoevolve by at least 44.8% in predicting the next interaction and 14% in predicting state change. In addition, most of these existing models are not scalable because they process interactions in a sequential order to maintain temporal dependency. JODIE overcomes this limitation by creating efficient training data batches which makes JODIE 9 faster than these baselines.
Temporal network embedding models. Several models have recently been developed that generate embeddings for the nodes (users and items) in temporal networks. CTDNE (nguyen2018continuous) is a stateoftheart algorithm that generates embeddings using temporallyincreasing random walks, but it generates one final static embedding of the nodes. Similarly, IGE (zhang2017learning) generates one final embedding of users and items from interaction graphs. Therefore, both these methods (CTDNE and IGE) need to be rerun for every new edge to create dynamic embeddings. Another recent algorithm, DynamicTriad (zhou2018dynamic) learns dynamic embeddings but does not work on bipartite interaction networks as it requires the presence of triads. Other recent algorithms such as DDNE (DBLP:journals/access/LiZYZY18), DANE (DBLP:conf/cikm/LiDHTCL17), DynGem (goyal2018dyngem), Zhu et al. (zhu2016scalable), and Rahman et al. (DBLP:journals/corr/abs180405755) learn embeddings from a sequence of graph snapshots, which is not applicable to our setting of continuous interaction data. Recent models such as NPGLM model (DBLP:journals/corr/abs171000818), DGNN (DBLP:journals/corr/abs181010627), and DyRep (trivedi2018representation) learn embeddings from persistent links between nodes, which do not exist in interaction networks as the edges represent instantaneous interactions.
Our proposed model, JODIE overcomes these shortcomings by generating and predicting the trajectories of users and items. In doing so, JODIE performs 4.4 better than CTDNE in predicting the next interaction, while having comparable running time.
In this section, we propose JODIE, a method to learn embedding trajectories of users and items from an ordered sequence of temporal useritem interactions . An interaction happens between a user and an item at time . Each interaction has an associated feature vector (e.g., a vector representing the text of a post). Table 2 lists the symbols used. For ease of notation, we will drop the subscript in the rest of the section.
Our proposed model, called JODIE
, learns an embedding trajectory for users and items and is reminiscent of the popular Kalman Filtering algorithm
(julier1997new).^{2}^{2}2Kalman filtering is used to accurately measure the state of a system using a combination of system observations and state estimates given by the laws of the system. JODIE uses the interactions to update the state of the interacting users and items via a trained update operation. JODIE trains a projection operation that uses the previous observed state and the elapsed time to predict the future embedding of the user. When the user’s and item’s next interactions are observed, their embeddings are updated again. We illustrate the model in Figure 2 and the projection operation in Figure 3.Static and Dynamic Embeddings. Each user and item is assigned two embeddings: a static and a dynamic embedding. We use both embeddings to encode both the longterm stationary properties of the entities and their dynamic properties.
Static embeddings, and , do not change over time. These are used to express stationary properties such as the longterm interest of users. We use onehot vectors as static embeddings of all users and items, as advised in TimeLSTM (zhu2017next) and TimeAwareLSTM (baytas2017patient). Using node2vec (grover2016node2vec) gave empirically similar results, so we use onehot vectors.
On the other hand, each user and item is assigned a dynamic embedding represented as and at time , respectively. These embeddings change over time to model their timevarying behavior and properties. The sequence of dynamic embeddings of a user/item is referred to its trajectory.
Next, we describe the update and projection operations. Then, we will describe how we predict the future interaction item embeddings and how we train the model.
In the update operation, the interaction between a user and item at time is used to generate their dynamic embeddings and . Fig. 2 illustrates the update operations.
Our model uses two recurrent neural networks for updates— is shared across all users to update user embeddings, and is shared among all items to update item embeddings. The hidden states of the user RNN and the item RNN represent the user and item embeddings, respectively.
The two RNNs are mutuallyrecursive. When user interacts with item , updates the embedding by using the embedding of item right before time as an input. is the same as item ’s embedding after its previous interaction with any user. Notice that this design decision is in stark contrast with the popular use of items’ onehot vectors to update user embeddings (beutel2018latent; wu2017recurrent; zhu2017next), which has the following two disadvantages: (a) onehot vector only contains the information about the item’s id and not the item’s current state, and (b) the dimension of the onehot vector becomes very large when real datasets have millions of items, making the model challenging to train and scale. Instead, we use the dynamic embedding of an item as it reflects the item’s current state leading to more meaningful dynamic user embeddings and easier training. For the same reason, updates the dynamic embedding of item by using the dynamic user embedding (which is ’s embedding right before time ). This results in mutually recursive dependency between the embeddings. More formally,
where denotes the time since ’s previous interaction (with any item) and is the time since item ’s previous interaction (with any user). is the interaction feature vector. The matrices are the parameters of and matrices are the parameters of .
is a sigmoid function to introduce nonlinearity. The matrices are trained to predict the embedding of the item at
’s next interaction as explained later in Section 3.3.Variants of RNNs, such as LSTM, GRU, and TLSTM (zhu2017next), gave experimentally similar and sometimes worse performance, so we use RNNs in our model to reduce the number of trainable parameters.
Symbol  Meaning 

and  Dynamic embedding of user and item at time 
and  Dynamic embedding of user and item before time 
and  Static embedding of user and item 
Projected embedding of user at time  
Predicted item embedding 
Here we explain one of the major contributions of our algorithm, the embedding projection operator, which predicts the future embedding trajectory of the user. This is done by projecting the embedding of the user at a future time. The projected embedding can then be used for downstream tasks, such as predicting items the user will interact with at a given query/prediction time in the future.
Figure 3 visualizes the main idea of projecting a user’s embedding trajectory. The operation projects the embedding of a user after some time has elapsed since its last interaction at time . To give an example, a short duration after time , the user ’s projected embedding is close to its previously observed embedding . As more time elapses, the projected embeddings drift farther to and . When the next interaction is observed at time , the user’s embedding is updated to using the update operation.
Two inputs are required for the projection operation: ’s embedding at time and the elapsed time . We follow the method suggested in LatentCross (beutel2018latent) to incorporate time into the projected embedding via Hadamard product. We do not simply concatenate the embedding and the time and pass them through a linear layer as prior research has shown that neural networks are inefficient in modeling the interactions between concatenated inputs. Instead, we create a temporal attention vector as described below.
We first convert to a timecontext vector using a linear layer (represented by vector ): . We initialize by a 0mean Gaussian. The projected embedding is then obtained as an elementwise product of the timecontext vector with the previous embedding as follows:
The vector acts as a temporal attention vector to scale the past user embedding. When , then and the projected embedding is the same as the input embedding vector. The larger the value of , the more the projected embedding vector differs from the input embedding vector and the projected embedding vector drifts over time.
We find that a linear layer works the best to project the embedding as it is equivalent to a linear transformation in the embedding space. Adding nonlinearity to the transformation makes the projection operation nonlinear, which we find experimentally to reduce the prediction performance. Thus, we use the linear transformation as described above.
Next, we describe how we train the model to efficiently project user embeddings such that they are useful in predicting the next item with which the user will interact.
Let interact with item at time and then with item at time . Right before , can we predict which item will interact with? We use this task to train the update and projection operations in JODIE. We train JODIE to make this prediction using ’s projected embedding .
A crucial design decision here is that JODIE directly outputs an item embedding vector, , instead of an interaction probability between and item . This has the advantage of reducing the computation at inference time from linear (in the number of items) to nearconstant. Most existing methods (dai2016deep; wu2017recurrent; beutel2018latent; DBLP:conf/kdd/DuDTUGS16) that output an interaction probability need to do the expensive neuralnetwork forward pass times (once for each of item ) to find the item with the highest probability score. In contrast, JODIE only needs to do forwardpass of the prediction layer once and output a predicted item embedding. Then the item with the closest embedding can be returned in nearconstant time by using Locality Sensitive Hashing (LSH) techniques (leskovec2014mining). To maintain the LSH data structure, we update it whenever an item’s embedding is updated.
Thus, we train JODIE to minimize the difference between the predicted item embedding and the real item embedding as follows: . Here, represents the concatenation of vectors and , and the superscript ‘’ indicates the embedding immediately before the time.
We make this prediction using the projected user embedding and the embedding of item (the item from ’s previous interaction) immediately before time . The reason we include is twofold: (a) may interact with other users between time and , and thus the embedding contains more recent information, and (b) users often interact with the same item consecutively (i.e., ) and including the item embedding helps to ease the prediction. We use both the static and dynamic embeddings to predict the static and dynamic embedding of the predicted item . The prediction is made using a fully connected linear layer as follows:
where
and the bias vector
make the linear layer.Training the model. JODIE is trained to minimize the distance between the predicted item embedding and the ground truth item’s embedding at every interaction. We calculate the total loss as follows:
The first loss term minimizes the predicted embedding error. The last two terms are added to regularize the loss and prevent the consecutive dynamic embeddings of a user and item to vary too much, respectively. and are scaling parameters to ensure the losses are in the same range. It is noteworthy that we do not use negative sampling during training as JODIE directly outputs the embedding of the predicted item.
Extending the loss for categorical prediction. In certain prediction tasks, such as user state change prediction, additional training labels may be present for supervision. The user state change labels are binary (categorical). In those cases, we can train another prediction function
to predict the label using the embedding of the user after an interaction. We calculate the crossentropy loss for categorical labels and add the loss to the above loss function with another scaling parameter. We explicitly do not just train to minimize only the crossentropy loss to prevent overfitting.
Here we explain the batching algorithm we propose to parallelize the training of JODIE. It is important to maintain temporal dependencies between interactions during training, such that interaction is processed before .
Existing methods that use a single RNN, such as TLSTM (zhu2017next) and RRN (beutel2018latent), split users into different batches and process them in parallel. This is possible because these approaches use onehot vector encodings of items as inputs and can thus be trained using the standard Back Propagation Through Time (BPTT) mechanism.
However, in JODIE, the mutuallyrecursive RNNs enable us to incorporate the item’s embedding to update the user embedding and viceversa. This creates interdependencies between two users that interacted with the same item and this prevents us from simply splitting users into separate batches and processing them in parallel.
Most existing methods that also use two mutuallyrecursive RNNs (dai2016deep; zhang2017learning) naively process all the interactions one at a time in sequential order. However, this is not scalable to a large number of interactions as the training process is very slow. Therefore, we train JODIE using a training data batching algorithm that we call tBatch. This leads to an order of magnitude of speedup in JODIE compared to most existing training approaches.
Creating the training batches is challenging because it has two requirements: (1) all interactions in each batch should be processed in parallel, and (2) processing the batches in increasing order of their index should maintain the temporal ordering of the interactions and thus, it should generate the same embedding as without any batching.
To overcome these challenges, tBatch creates each batch by selecting independent edge sets of the interaction network, i.e., two interactions in the same batch do not share any common user or item. JODIE works iteratively in two steps: the select step and the reduce step. In the select step, a new batch is created by selecting the maximal edge set such that each edge is the lowest timestamped edge incident on both and . This trivially makes the batch an independent edge set. In the reduce step, the selected edges are removed from the network. JODIE iterates the two steps till no edges remain in the graph. Thus, each batch is parallelizable and processing batches in order maintains the sequential dependencies.
In practice, we implement tBatch as a sequential algorithm as follows. The algorithm assigns each interaction to a batch , where . We initialize empty batches (in the worst case scenario that each batch only has one interaction). We iterate through the temporallysorted sequence of interactions and add each interaction to a batch . Let be the batch with the largest index that has an interaction involving an entity till interaction . Then, the interaction (say, between user and item ) is assigned to the batch with index . The complexity of creating the batches is , i.e., linear in the number of interactions, as each interaction is used once.
It is trivial to verify that tBatch algorithm satisfies the two requirements. tBatch ensures that each user and item appears at most once in every batch and thus, each batch can be parallelized. In addition, the and interactions of every user and every item are assigned to batches and , respectively, such that . So, JODIE can process the batches in increasing order of their indices to ensure that the temporal ordering of the transactions is respected.
We do not predetermine the number and size of the batches because it depends on the interactions in the dataset. The number of batches can range between 1 and . Let us illustrate these two extreme cases. When all interactions have unique users and items, then only one batch is created that has all the interactions. On the other extreme, if all interactions are associated to the same user or the same item, then batches are created. Therefore, we initialize batches and discard all trailing empty batches after assignment.
DeepCoevolve is the closest stateoftheart algorithm to JODIE because it also trains two mutuallyrecursive RNNs to generate embedding trajectories. However, the key differences between JODIE and DeepCoevolve are the following: (i) JODIE uses a novel project function to predict the future trajectory of users. Instead, DeepCoevolve maintains the same embedding of a user between two of its consecutive interactions. Predicting the trajectory enables JODIE to make more effective predictions. (ii) JODIE predicts the embedding of the next item that a user will interact with. In contrast, DeepCoevolve predicts the probability of interaction between a user and an item. During inference time, DeepCoevolve requires forward passes through the inference layer (for items) to recommend the item with the highest score. On the other hand, JODIE takes nearconstant time. (iii) JODIE is trained with batches of interaction data, as opposed to individual interactions.
As a result, as we will see in the experiments section, JODIE significantly outperforms DeepCoevolve both in terms of performance and training time. JODIE is 9.2 faster, 45% better in predicting future interactions, and 13.9% better in predicting user state change on average.
In this section, we experimentally validate the effectiveness of JODIE on two tasks: future interaction prediction and user state change prediction. We conduct experiments on three datasets each and compare with six strong baselines to show the following:
JODIE outperforms the baselines by at least 20% in terms of mean reciprocal rank in predicting the next item and 12% on average in predicting user state change.
We show that JODIE is 9.2 faster than DeepCoevolve and comparable to other baselines.
JODIE is robust in performance to the availability of training data and the dimension of the embedding.
Finally, in a case study on the MOOC dataset, we show that JODIE can predict student dropout five interactions in advance.
We first explain the experimental setting and the baseline methods and then describe the experimental results.
Experimental setting. We train all models by splitting the data by time to simulate the real situation. Thus, we train all models on the first interactions, validate on the next , and test on the last remaining interactions.
For a fair comparison, we use 128 dimensions as the dimensionality of the dynamic embedding for all algorithms and onehot vectors for static embeddings. All algorithms are run for 50 epochs, and all reported numbers for all models are for the test data corresponding to the best performing validation set.
Method  Wikipedia  LastFM  Minimum % improvement of JODIE over method  
MRR  Recall@10  MRR  Recall@10  MRR  Recall@10  MRR  Recall@10  
LSTM (zhu2017next)  0.355  0.551  0.329  0.455  0.062  0.119  104.5%  54.6% 
TimeLSTM (zhu2017next)  0.387  0.573  0.247  0.342  0.068  0.137  87.6%  48.7% 
RRN (wu2017recurrent)  0.603  0.747  0.522  0.617  0.089  0.182  20.4%  14.1% 
LatentCross (beutel2018latent)  0.421  0.588  0.424  0.481  0.148  0.227  31.8%  35.2% 
CTDNE (nguyen2018continuous)  0.165  0.257  0.035  0.056  0.01  0.01  340.0%  231.5% 
DeepCoevolve (dai2016deep)  0.171  0.275  0.515  0.563  0.019  0.039  44.8%  46.0% 
JODIE (proposed)  0.726  0.852  0.746  0.822  0.195  0.307     
Baselines. We compare JODIE with six stateoftheart algorithms spanning three algorithmic categories:
Deep recurrent recommender models: in this category, we compare with RRN (wu2017recurrent), LatentCross (beutel2018latent), TimeLSTM (zhu2017next), and standard LSTM. These algorithms are stateoftheart in recommender systems and generate dynamic user embeddings. We use TimeLSTM3 cell for TimeLSTM as it performs the best in the original paper (zhu2017next), and LSTM cells in RRN and LatentCross models. As is standard, we use the onehot vector of items as inputs to these models.
Dynamic coevolution models: here we compare with the stateoftheart algorithm, DeepCoevolve (dai2016deep), which has been shown to outperform other coevolutionary pointprocess algorithms (trivedi2017know; wang2016coevolutionary). We use 10 negative samples per interaction for computational tractability.
Temporal network embedding models: we compare JODIE with CTDNE (nguyen2018continuous) which is the stateoftheart in generating embeddings from temporal networks. As it generates static embeddings, we generate new embeddings after each edge is added. We use uniform sampling of neighborhood as it performs the best in the original paper (nguyen2018continuous).
The prediction task here is: given all interactions till time , which item will user interact with at time (out of all items)?
We use three datasets in this experiments:
Reddit post dataset: this public dataset consists of one month of posts made by users on subreddits (pushshift). We selected the 1,000 most active subreddits as items and the 10,000 most active users. This results in 672,447 interactions. We convert the text of each post into a feature vector representing their LIWC categories (pennebaker2001linguistic).
Wikipedia edits: this public dataset is one month of edits made by edits on Wikipedia pages (wikidump). We selected the 1,000 most edited pages as items and editors who made at least 5 edits as users (a total of 8,227 users). This generates 157,474 interactions. Similar to the Reddit dataset, we convert the edit text into a LIWCfeature vector.
LastFM song listens: this public dataset has one month of wholistenstowhich song information (lastfm). We selected all 1000 users and the 1000 most listened songs resulting in 1,293,103 interactions. In this dataset, interactions do not have features.
We select these datasets such that they vary in terms of users’ repetitive behavior: in Wikipedia and Reddit, a user interacts with the same item consecutively in 79% and 61% interactions, respectively, while in LastFM, this happens in only 8.6% interactions.
Method  Wikipedia  MOOC  Mean improvement  

of JODIE  
LSTM  0.523  0.575  0.686  23.08% 
TimeLSTM  0.556  0.671  0.711  12.63% 
RRN  0.586  0.804  0.558  13.69% 
LatentCross  0.574  0.628  0.686  15.62% 
DeepCoevolve  0.577  0.663  0.671  13.94% 
JODIE  0.599  0.831  0.756   
Experimental setting. We use the first 80% data to train, next 10% to validate, and the final 10% to test. We measure the performance of the algorithms in terms of the mean reciprocal rank (MRR) and recall@10—MRR is the average of the reciprocal rank and recall@10 is the fraction of interactions in which the ground truth item is ranked in the top 10. Higher values for both are better. For every interaction, the ranking of ground truth item is calculated with respect to all the items in the dataset.
For JODIE, items are ranked based on their distance from the predicted item embedding. The rank of the ground truth item is calculated in this ranked list.
Results. Table 3 compares the results of JODIE with the six stateoftheart methods. We observe that JODIE significantly outperforms all baselines in all datasets across both metrics on the three datasets. Among the baselines, there is no clear winner—while RRN performs the better in Reddit and Wikipedia, LatentCross performs better in LastFM. As CTDNE generates static embedding, its performance is low. We calculate the percentage improvement of JODIE over the baseline as (performance of JODIE minus performance of baseline)/(performance of baseline). Across all datasets, the minimum improvement of JODIE is at least 20% in terms of MRR and 14% in terms of recall@10. Please note that JODIE outperforms DeepCoevolve, the closest baseline in terms of the algorithm, by at least 44.8% in MRR across all datasets.
Noticeably, we observe that JODIE performs well irrespective of how repetitive users are—the MRR at least 20.4% higher in Wikipedia and Reddit (high repetition datasets), and at least 31.75% higher in LastFM (low repetition dataset). This means JODIE is able to learn to balance personal preference with users’ nonrepetitive interaction behavior.
In this experiment, the task is to predict if an interaction will lead to a state change in user, particularly in two use cases: predicting if a user will be banned and predicting if a student will dropout of a course. Till a user is banned or dropsout, the label of the user is ‘0’, and their last interaction has the label ‘1’. For users that are not banned or do not dropout, the label is always ‘0’. This is a highly challenging task as less than 1% of the labels are ‘1‘.
We use three datasets for this task:
Reddit bans: Reddit post dataset (from Section 4.1) with groundtruth labels of banned users from Reddit
This gives 366 true labels among 672,447 interactions (= 0.05%).
Wikipedia bans: Wikipedia edit data (from Section 4.1) with public groundtruth labels of banned users (wikidump). This results in 217 positive labels among 157,474 interactions (= 0.14%).
MOOC student dropout: this public dataset consists of actions, e.g., viewing a video, submitting an answer, etc., done by students on a MOOC online course (kddcup). This dataset consists of 7,047 users interacting with 98 items (videos, answers, etc.) resulting in over 411,749 interactions. There are 4,066 dropout events (= 0.98%).
Experimental setting. In this experiment, we train the models on the first 60% interactions, validate on the next 20%, and test on the last 20% interactions. We evaluate the models using the area under the curve metric (AUC), a standard metric in the tasks with highly imbalanced labels.
For the baselines, we train a logistic regression classifier on the training data using the dynamic user embedding as input. As always, for all models, we report the test AUC for the epoch with the highest validation AUC.
Results. Table 4 compares the performance of JODIE on the three datasets with the baseline models. We see that JODIE outperforms the baselines by at least 12% on average in predicting user state change across all datasets. JODIE outperforms RRN, the closest competitor in the ban prediction task, by at least 2.2% while it outperforms RRN by 28% in the student dropout task. Note that DeepCoevolve, which is the most similar baseline algorithmically, is outperformed by 13.9% by JODIE on average. Thus, JODIE consistently performs the best across various datasets.
Here we compare the running time of JODIE with the baseline algorithms. Algorithmically, the DeepCoevolve is the closest to JODIE as it also trains two mutuallyrecursive RNNs. The other methods train only one RNN and are therefore easily scalable.
Figure 4 shows the running time (in minutes) of one epoch of the Reddit dataset.^{3}^{3}3We ran the experiment on one NVIDIA Titan X Pascal GPUs with 12Gb of RAM at 10Gbps speed. We find that JODIE is 9.2 faster than DeepCoevolve (its closest algorithmic competitor). At the same time, the running time of JODIE is comparable to the other baselines that only use one RNN in their model. This shows that JODIE is able to train the mutuallyrecursive model in equivalent time as nonmutuallyrecursive models, because of the use of the tBatch training batching algorithm.
In addition, we find that JODIE without tBatch took 43.53 minutes while JODIE with tBatch took 5.13 minutes. Thus, tBatch results in 8.4 speedup.
In this experiment, we validate the robustness of JODIE by varying the percentage of training data and comparing the performance of the algorithms in both the tasks of future interaction prediction and user state change prediction.
For the next item prediction, we vary the training data percentage from 10% to 80%. In each case, we take the 10% interactions after the training data as validation and the next 10% interactions next as testing. This is done to compare the performance on the same testing data size. Figures 5(a–c) show the change in mean reciprocal rank (MRR) of all the algorithms on the three datasets, as the training data size is increased. We note that the performance of JODIE is stable and does not vary much across the data points. Moreover, JODIE consistently outperforms the baseline models by a significant margin (by a maximum of 33.1%).
We make similar observations in user state change prediction task. Here, we vary training data percents to 20%, 40%, and 60%, and in each case take the following 20% interactions as validation and the next 20% interactions as the test. Figure 5(d) shows the AUC of all the algorithms on the Wikipedia dataset. Other datasets have similar results. Again, we find that JODIE is stable and consistently outperforms the baselines, irrespective of the training data size.
Finally, we validate the effect of the dynamic embedding size on the predictions. To do this, we vary the dynamic embedding dimension from 32 to 256 and calculate the mean reciprocal rank for interaction prediction on the LastFM dataset. The effect on other datasets is similar. The resulting figure is showing in Figure 6. We find that the embedding dimension size has little effect on the performance of JODIE and it performs the best overall. Interestingly, improvement in JODIE is higher for smaller embedding dimensions. This is because JODIE uses both the static and the dynamic embedding for prediction, which gives it the power to learn from both parts.
In this paper, we proposed a coupled recurrent neural network model called JODIE that learns dynamic embeddings of users and items from a sequence of temporal interactions. JODIE learns to predict the future embeddings of users and items, which leads it to give better prediction performance of future useritem interactions and change in user state. We also presented a training data batching method that makes JODIE an order of magnitude faster than similar baselines.
There are several directions for future work. Learning embeddings for individual users and items is expensive, and one could learn trajectories for groups of users or items to reduce the number of parameters. Another direction is characterizing the trajectories to cluster similar entities. Finally, an innovative direction would be to design new items based on missing predicted items that many users are likely to interact with.
Acknowledgements
. JL is a Chan Zuckerberg Biohub investigator. This research has been supported in part by NSF OAC1835598, DARPA MCS, DARPA ASED, ARO MURI, Amazon, Boeing, Docomo, Hitachi, JD, Siemens, and Stanford Data Science Initiative. We thank Sagar Honnungar for help with the initial phase of the project.
Here we describe some technical details of the model.
The code and datasets are available on the project website:
https://snap.stanford.edu/jodie.
We coded all the models and the baselines in PyTorch. Table
6 mentions the dataset details and Table 5 mentions the model parameters.Parameter  Value 

Optimizer  Adam 
Learning rate  1e3 
Model weight decay  1e5 
Dynamic embedding size  128 
Number of epochs  50 
Future interaction prediction experiment  
Training data percent  80% 
Validation data percent  10% 
Test data percent  10% 
User state change experiment  
Training data percent  60% 
Validation data percent  20% 
Test data percent  20% 
Data  Users  Items  Interactions  State  Action 
Changes  Repetition  
10,000  984  672,447  366  79%  
Wikipedia  8,227  1,000  157,474  217  61% 
LastFM  980  1,000  1,293,103    8.6% 
MOOC  7,047  97  411,749  4,066   
Comments
There are no comments yet.