Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks

08/03/2019 ∙ by Srijan Kumar, et al. ∙ University of Illinois at Urbana-Champaign Stanford University 0

Modeling sequential interactions between users and items/products is crucial in domains such as e-commerce, social networking, and education. Representation learning presents an attractive opportunity to model the dynamic evolution of users and items, where each user/item can be embedded in a Euclidean space and its evolution can be modeled by an embedding trajectory in this space. However, existing dynamic embedding methods generate embeddings only when users take actions and do not explicitly model the future trajectory of the user/item in the embedding space. Here we propose JODIE, a coupled recurrent neural network model that learns the embedding trajectories of users and items. JODIE employs two recurrent neural networks to update the embedding of a user and an item at every interaction. Crucially, JODIE also models the future embedding trajectory of a user/item. To this end, it introduces a novel projection operator that learns to estimate the embedding of the user at any time in the future. These estimated embeddings are then used to predict future user-item interactions. To make the method scalable, we develop a t-Batch algorithm that creates time-consistent batches and leads to 9x faster training. We conduct six experiments to validate JODIE on two prediction tasks—future interaction prediction and state change prediction—using four real-world datasets. We show that JODIE outperforms six state-of-the-art algorithms in these tasks by at least 20 prediction.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A PyTorch implementation of ACM SIGKDD 2019 paper "Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Left: a temporal interaction network of three users and four items. Each arrow represents an interaction with associated timestamp

and a feature vector

. Right: embedding trajectory of the users and items. We predict the future trajectory of users (the dotted line shown for one user) by training an embedding projection operator.

Users interact sequentially with items in many domains such as e-commerce (e.g., a customer purchasing an item) (zhang2017deep), education (a student enrolling in a MOOC course) (liyanagunawardena2013moocs), and social and collaborative platforms (a user posting in a group in Reddit) (iba2010analyzing; kumar2018community). The same user may interact with different items over a period of time and these interactions change over time (DBLP:journals/debu/HamiltonYL17; DBLP:conf/recsys/PalovicsBKKF14; zhang2017deep; agrawal2014big; DBLP:conf/asunam/ArnouxTL17; raghavan2014modeling; DBLP:journals/corr/abs-1711-10967). These interactions create a network of temporal interactions between users and items. Figure 1 (left) shows an example network between users and items, with each interaction marked with a time stamp and a feature vector (such as the review text or the purchase amount). Accurate real-time recommendation of items and predicting change in the state of users are fundamental problems in these domains (DBLP:conf/wsdm/QiuDMLWT18; DBLP:conf/asunam/ArnouxTL17; DBLP:conf/sdm/LiDLLGZ14; DBLP:journals/corr/abs-1804-01465; DBLP:conf/cosn/SedhainSXKTC13; walker2015complex; DBLP:conf/icwsm/Junuthula0D18). For instance, predicting when a student is likely to drop out of a MOOC course is important to develop early intervention measures (kloft2014predicting; yang2013turn) and predicting when a user is likely to turn malicious on social platforms, like Reddit and Wikipedia, ensures platform integrity (kumar2018rev2; kumar2015vews; cheng2017anyone).

Representation learning, or learning low-dimensional embeddings of entities, is a powerful approach to represent the evolution of users’ and items’ properties (DBLP:journals/kbs/GoyalF18; zhang2017deep; dai2016deep; DBLP:conf/nips/FarajtabarWGLZS15; beutel2018latent; zhou2018dynamic). However, the recent methods that generate dynamic embeddings suffer from four fundamental challenges. First, a majority of the existing methods generate an embedding for a user only when she takes an action (beutel2018latent; zhang2017deep; dai2016deep; zhou2018dynamic; you2019hierarchical). However, consider a user who makes a purchase today and its embedding is updated. The embedding will remain the same if it returns to the platform on the next day, a week later, or even a month later. As a result, the same predictions and recommendations will be made to her regardless of when she returns. However, a user’s intent changes over time (cheng2017predicting) and thus her embedding needs to be updated (projected) to the query time. The challenge here is how to accurately predict the embedding trajectories of users/items as time progresses. Second, entities have both stationary properties that do not change over time and time-evolving properties. Some existing methods (zhang2017deep; dai2016deep; wang2016coevolutionary) consider only one of the two when generating embeddings. However, it is essential to consider both in a unified framework to leverage information at both scales. Third, many existing methods predict user-item interactions by scoring all items for each user (zhang2017deep; dai2016deep; beutel2018latent; zhou2018dynamic). This has linear time complexity and is not practical in scenarios with millions of items. Instead, methods are required that can recommend items in near-constant time. Fourth, most models are trained by sequentially processing the interactions one at a time, so that the temporal dependencies between the interactions are maintained (zhang2017deep; dai2016deep; wang2016coevolutionary). This prevents such models from scaling to datasets with millions of interactions. New methods are needed that can be trained with batches of data to generate embedding trajectories.

Present work. Here we present JODIE which learns to generate embedding trajectories of all users and items from temporal interactions111JODIE stands for Joint Dynamic User-Item Embeddings.. The embedding trajectories of the example network are shown in Figure 1 (right). The embeddings of the user and item are updated when a user takes an action and a projection operator predicts the future embedding trajectory of the user.

Present work: JODIE. Each user and item has two embeddings: a static embedding and a dynamic embedding. The static embedding represents the entity’s long-term stationary property, while the dynamic embedding represents time-varying property and is learned using the JODIE algorithm. Both embeddings are used to generate the trajectory. This enables JODIE to make predictions from both the stationary and time-varying properties of the user.

The JODIE model consists of two major components: an update operation and a projection operation.

The update operation of JODIE has two Recurrent Neural Networks (RNNs) to generate user and item embeddings. Crucially, the two RNNs are coupled to explicitly incorporate the interdependency between users and items. After each interaction, the user RNN updates the user embedding by using the embedding of the interacting item. Similarly, the item RNN uses the user embedding to update the item embedding. The model also has the ability to incorporate feature vectors from the interaction, for example, the text of a Reddit post. It should be noted that JODIE is easily extendable to multiple types of entities by training one RNN for each entity type. In the current work, we show how to apply JODIE to the case of bipartite interactions between users and items.

A major innovation of JODIE is that it also uses a projection operation that predicts the future embedding trajectory of the users. Intuitively, the embedding of a user will change slightly after a short time elapses since her previous interaction (with any item), while the embedding can change significantly after a long time elapses. As a result, JODIE trains a temporal attention layer to project the embedding of users after some time elapses since its previous interaction. The projected user embedding is then used to predict the item that the user is most likely to interact with.

To predict the item that a user will interact with, an important design decision is to output the embedding of an item, instead of an interaction probability. Current methods generate a probability score of interaction between a user and an item, which takes linear time to find the most likely item because probability scores for all items have to be generated first. Instead, by directly generating the item embedding, we can recommend the item that is closest to the predicted item in the embedding space. This can be done efficiently in constant time using the locality sensitive hashing (LSH) techniques 


Present work: t-Batch. Most existing models learn embeddings from a sequence of interactions by processing one interaction after the other, in increasing order of time to maintain the temporal dependency among the interactions (dai2016deep; zhang2017learning; wang2016coevolutionary). This makes such algorithms unscalable to real datasets with millions of interactions. Therefore, we create a batching algorithm, called t-Batch, to train JODIE by creating training batches of independent interactions such that the interactions in each batch can be processed in parallel. To do so, we iteratively select independent edge sets from the interaction network. In every batch, each user and item appears at most once and the temporally-sorted interactions of each user (and item) appear in monotonically increasing batches. Experimentally, we show that t-Batch makes JODIE 9.2 faster than its most similar dynamic embedding baselines.

Present work: Experiments. We conduct six experiments to evaluate the performance of JODIE on two tasks: predicting the next interaction of a user and predicting the change in state of users (when a user will be banned from social platforms and when a student will drop out from a MOOC course). We use four datasets from Reddit, Wikipedia, LastFM, and a MOOC course activity for our experiments. We compare JODIE with six state-of-the-art algorithms from three categories: deep recurrent recommender algorithms (zhu2017next; beutel2018latent; wu2017recurrent), temporal node embedding algorithm (nguyen2018continuous), and dynamic co-evolution models (dai2016deep). JODIE improves over the baseline algorithms on the interaction prediction task by at least 20% in terms of mean reciprocal rank and 12% in AUC on average for predicting user state change. We further show that JODIE is robust to the percentage of training data and the size of the embeddings.

Overall, in this paper, we make the following contributions:
Embedding algorithm: We propose a coupled recurrent neural network model called JODIE to learn embedding trajectories of users and items. Crucially, JODIE also learns a projection operator to predict the embedding trajectory of users and predicts future interactions in constant time.
Batching algorithm: We propose the t-Batch algorithm to create independent but temporally consistent training data batches that help to train JODIE 9.2 faster than the closest baseline.
Effectiveness: JODIE outperforms six state-of-the-art algorithms in predicting future interactions and user state change predictions, by performing at least 20% better in predicting future interactions and 12% better on average in predicting user state change.

The code and datasets are available on the project website:

2. Related Work

Here we discuss the research closest to our problem setting spanning three broad areas. Table 1 compares their differences.

Deep Temporal Co-evolution models (dai2016deep) Proposed
recurrent network model
models embedding

LSTM, Time-LSTM (zhu2017next)

RRN (wu2017recurrent)

LatentCross (beutel2018latent)

CTDNE (nguyen2018continuous)

IGE (zhang2017learning)


Predict embedding trajectory
Predict future item embedding
Train using batches of data
Table 1. Table comparing the desired properties of the existing algorithms and our proposed JODIE algorithm. JODIE satisfies all the desirable properties.

Deep recurrent recommender models. Several recent models employ recurrent neural networks (RNNs) and variants (LSTMs and GRUs) to build recommender systems. RRN (wu2017recurrent) uses RNNs to generate dynamic user and item embeddings from rating networks. Recent methods, such as Time-LSTM (zhu2017next) and LatentCross (beutel2018latent) learn how to incorporate features into the embeddings. However, most of these methods suffer from two major shortcomings. First, they take the one-hot vector of the item as input to update the user embedding. This only incorporates the item id and ignores the item’s current state. The second shortcoming is that some models, such as Time-LSTM and LatentCross, generate dynamic embeddings only for users and not for items.

JODIE overcomes these shortcomings by learning embeddings for both users and items using mutually-recursive RNNs. In doing so, JODIE outperforms these methods by at least 20% in predicting the next interaction and 12% on average in predicting user state change, while having comparable running time as these methods.

Dynamic co-evolution models. Methods that jointly learn representations of users and items have recently been developed using point-process modeling (wang2016coevolutionary; trivedi2017know) and RNN-based modeling (dai2016deep). The basic idea behind these models is similar to JODIE —user and item embeddings influence each other whenever they interact. However, the major difference between JODIE and these models are that JODIE trains a project operation to forecast the user embedding at any time, outputs item embeddings instead of interaction probability, and trains the model using batching. As a result, we observe that JODIE outperforms DeepCoevolve by at least 44.8% in predicting the next interaction and 14% in predicting state change. In addition, most of these existing models are not scalable because they process interactions in a sequential order to maintain temporal dependency. JODIE overcomes this limitation by creating efficient training data batches which makes JODIE 9 faster than these baselines.

Temporal network embedding models. Several models have recently been developed that generate embeddings for the nodes (users and items) in temporal networks. CTDNE (nguyen2018continuous) is a state-of-the-art algorithm that generates embeddings using temporally-increasing random walks, but it generates one final static embedding of the nodes. Similarly, IGE (zhang2017learning) generates one final embedding of users and items from interaction graphs. Therefore, both these methods (CTDNE and IGE) need to be re-run for every new edge to create dynamic embeddings. Another recent algorithm, DynamicTriad (zhou2018dynamic) learns dynamic embeddings but does not work on bipartite interaction networks as it requires the presence of triads. Other recent algorithms such as DDNE (DBLP:journals/access/LiZYZY18), DANE (DBLP:conf/cikm/LiDHTCL17), DynGem (goyal2018dyngem), Zhu et al. (zhu2016scalable), and Rahman et al. (DBLP:journals/corr/abs-1804-05755) learn embeddings from a sequence of graph snapshots, which is not applicable to our setting of continuous interaction data. Recent models such as NP-GLM model (DBLP:journals/corr/abs-1710-00818), DGNN (DBLP:journals/corr/abs-1810-10627), and DyRep (trivedi2018representation) learn embeddings from persistent links between nodes, which do not exist in interaction networks as the edges represent instantaneous interactions.

Our proposed model, JODIE   overcomes these shortcomings by generating and predicting the trajectories of users and items. In doing so, JODIE performs 4.4 better than CTDNE in predicting the next interaction, while having comparable running time.

3. JODIE: Joint Dynamic User-Item Embedding Model

Figure 2. The JODIE model: After an interaction between user and item , the dynamic embeddings of and are updated in the update operation with and , respectively. The projection operation predicts the user embedding at a future time .

In this section, we propose JODIE, a method to learn embedding trajectories of users and items from an ordered sequence of temporal user-item interactions . An interaction happens between a user and an item at time . Each interaction has an associated feature vector (e.g., a vector representing the text of a post). Table 2 lists the symbols used. For ease of notation, we will drop the subscript in the rest of the section.

Our proposed model, called JODIE

, learns an embedding trajectory for users and items and is reminiscent of the popular Kalman Filtering algorithm 

(julier1997new).222Kalman filtering is used to accurately measure the state of a system using a combination of system observations and state estimates given by the laws of the system. JODIE uses the interactions to update the state of the interacting users and items via a trained update operation. JODIE trains a projection operation that uses the previous observed state and the elapsed time to predict the future embedding of the user. When the user’s and item’s next interactions are observed, their embeddings are updated again. We illustrate the model in Figure 2 and the projection operation in Figure 3.

Static and Dynamic Embeddings. Each user and item is assigned two embeddings: a static and a dynamic embedding. We use both embeddings to encode both the long-term stationary properties of the entities and their dynamic properties.

Static embeddings, and , do not change over time. These are used to express stationary properties such as the long-term interest of users. We use one-hot vectors as static embeddings of all users and items, as advised in Time-LSTM (zhu2017next) and TimeAware-LSTM (baytas2017patient). Using node2vec (grover2016node2vec) gave empirically similar results, so we use one-hot vectors.

On the other hand, each user and item is assigned a dynamic embedding represented as and at time , respectively. These embeddings change over time to model their time-varying behavior and properties. The sequence of dynamic embeddings of a user/item is referred to its trajectory.

Next, we describe the update and projection operations. Then, we will describe how we predict the future interaction item embeddings and how we train the model.

3.1. Embedding update operation

In the update operation, the interaction between a user and item at time is used to generate their dynamic embeddings and . Fig. 2 illustrates the update operations.

Our model uses two recurrent neural networks for updates— is shared across all users to update user embeddings, and is shared among all items to update item embeddings. The hidden states of the user RNN and the item RNN represent the user and item embeddings, respectively.

The two RNNs are mutually-recursive. When user interacts with item , updates the embedding by using the embedding of item right before time as an input. is the same as item ’s embedding after its previous interaction with any user. Notice that this design decision is in stark contrast with the popular use of items’ one-hot vectors to update user embeddings (beutel2018latent; wu2017recurrent; zhu2017next), which has the following two disadvantages: (a) one-hot vector only contains the information about the item’s id and not the item’s current state, and (b) the dimension of the one-hot vector becomes very large when real datasets have millions of items, making the model challenging to train and scale. Instead, we use the dynamic embedding of an item as it reflects the item’s current state leading to more meaningful dynamic user embeddings and easier training. For the same reason, updates the dynamic embedding of item by using the dynamic user embedding (which is ’s embedding right before time ). This results in mutually recursive dependency between the embeddings. More formally,

where denotes the time since ’s previous interaction (with any item) and is the time since item ’s previous interaction (with any user). is the interaction feature vector. The matrices are the parameters of and matrices are the parameters of .

is a sigmoid function to introduce non-linearity. The matrices are trained to predict the embedding of the item at

’s next interaction as explained later in Section 3.3.

Variants of RNNs, such as LSTM, GRU, and T-LSTM (zhu2017next), gave experimentally similar and sometimes worse performance, so we use RNNs in our model to reduce the number of trainable parameters.

Symbol Meaning
and Dynamic embedding of user and item at time
and Dynamic embedding of user and item before time
and Static embedding of user and item
Projected embedding of user at time
Predicted item embedding
Table 2. Table of symbols used in this paper.
Figure 3. This figure shows the key idea behind projection operation. The predicted embedding of user is shown for different elapsed time . The predicted embedding drifts farther as more time elapses. When the next interaction is observed, the embedding is updated again.

3.2. Embedding projection operation

Here we explain one of the major contributions of our algorithm, the embedding projection operator, which predicts the future embedding trajectory of the user. This is done by projecting the embedding of the user at a future time. The projected embedding can then be used for downstream tasks, such as predicting items the user will interact with at a given query/prediction time in the future.

Figure 3 visualizes the main idea of projecting a user’s embedding trajectory. The operation projects the embedding of a user after some time has elapsed since its last interaction at time . To give an example, a short duration after time , the user ’s projected embedding is close to its previously observed embedding . As more time elapses, the projected embeddings drift farther to and . When the next interaction is observed at time , the user’s embedding is updated to using the update operation.

Two inputs are required for the projection operation: ’s embedding at time and the elapsed time . We follow the method suggested in LatentCross (beutel2018latent) to incorporate time into the projected embedding via Hadamard product. We do not simply concatenate the embedding and the time and pass them through a linear layer as prior research has shown that neural networks are inefficient in modeling the interactions between concatenated inputs. Instead, we create a temporal attention vector as described below.

We first convert to a time-context vector using a linear layer (represented by vector ): . We initialize by a 0-mean Gaussian. The projected embedding is then obtained as an element-wise product of the time-context vector with the previous embedding as follows:

The vector acts as a temporal attention vector to scale the past user embedding. When , then and the projected embedding is the same as the input embedding vector. The larger the value of , the more the projected embedding vector differs from the input embedding vector and the projected embedding vector drifts over time.

We find that a linear layer works the best to project the embedding as it is equivalent to a linear transformation in the embedding space. Adding non-linearity to the transformation makes the projection operation non-linear, which we find experimentally to reduce the prediction performance. Thus, we use the linear transformation as described above.

Next, we describe how we train the model to efficiently project user embeddings such that they are useful in predicting the next item with which the user will interact.

3.3. Training to predict next item embedding

Let interact with item at time and then with item at time . Right before , can we predict which item will interact with? We use this task to train the update and projection operations in JODIE. We train JODIE to make this prediction using ’s projected embedding .

A crucial design decision here is that JODIE directly outputs an item embedding vector, , instead of an interaction probability between and item . This has the advantage of reducing the computation at inference time from linear (in the number of items) to near-constant. Most existing methods (dai2016deep; wu2017recurrent; beutel2018latent; DBLP:conf/kdd/DuDTUGS16) that output an interaction probability need to do the expensive neural-network forward pass times (once for each of item ) to find the item with the highest probability score. In contrast, JODIE only needs to do forward-pass of the prediction layer once and output a predicted item embedding. Then the item with the closest embedding can be returned in near-constant time by using Locality Sensitive Hashing (LSH) techniques (leskovec2014mining). To maintain the LSH data structure, we update it whenever an item’s embedding is updated.

Thus, we train JODIE to minimize the difference between the predicted item embedding and the real item embedding as follows: . Here, represents the concatenation of vectors and , and the superscript ‘-’ indicates the embedding immediately before the time.

We make this prediction using the projected user embedding and the embedding of item (the item from ’s previous interaction) immediately before time . The reason we include is two-fold: (a) may interact with other users between time and , and thus the embedding contains more recent information, and (b) users often interact with the same item consecutively (i.e., ) and including the item embedding helps to ease the prediction. We use both the static and dynamic embeddings to predict the static and dynamic embedding of the predicted item . The prediction is made using a fully connected linear layer as follows:


and the bias vector

make the linear layer.

Training the model. JODIE is trained to minimize the distance between the predicted item embedding and the ground truth item’s embedding at every interaction. We calculate the total loss as follows:

The first loss term minimizes the predicted embedding error. The last two terms are added to regularize the loss and prevent the consecutive dynamic embeddings of a user and item to vary too much, respectively. and are scaling parameters to ensure the losses are in the same range. It is noteworthy that we do not use negative sampling during training as JODIE directly outputs the embedding of the predicted item.

Extending the loss for categorical prediction. In certain prediction tasks, such as user state change prediction, additional training labels may be present for supervision. The user state change labels are binary (categorical). In those cases, we can train another prediction function

to predict the label using the embedding of the user after an interaction. We calculate the cross-entropy loss for categorical labels and add the loss to the above loss function with another scaling parameter. We explicitly do not just train to minimize only the cross-entropy loss to prevent overfitting.

3.4. t-Batch: Training data batching

Here we explain the batching algorithm we propose to parallelize the training of JODIE. It is important to maintain temporal dependencies between interactions during training, such that interaction is processed before .

Existing methods that use a single RNN, such as T-LSTM (zhu2017next) and RRN (beutel2018latent), split users into different batches and process them in parallel. This is possible because these approaches use one-hot vector encodings of items as inputs and can thus be trained using the standard Back Propagation Through Time (BPTT) mechanism.

However, in JODIE, the mutually-recursive RNNs enable us to incorporate the item’s embedding to update the user embedding and vice-versa. This creates interdependencies between two users that interacted with the same item and this prevents us from simply splitting users into separate batches and processing them in parallel.

Most existing methods that also use two mutually-recursive RNNs (dai2016deep; zhang2017learning) naively process all the interactions one at a time in sequential order. However, this is not scalable to a large number of interactions as the training process is very slow. Therefore, we train JODIE using a training data batching algorithm that we call t-Batch. This leads to an order of magnitude of speed-up in JODIE compared to most existing training approaches.

Creating the training batches is challenging because it has two requirements: (1) all interactions in each batch should be processed in parallel, and (2) processing the batches in increasing order of their index should maintain the temporal ordering of the interactions and thus, it should generate the same embedding as without any batching.

To overcome these challenges, t-Batch creates each batch by selecting independent edge sets of the interaction network, i.e., two interactions in the same batch do not share any common user or item. JODIE works iteratively in two steps: the select step and the reduce step. In the select step, a new batch is created by selecting the maximal edge set such that each edge is the lowest time-stamped edge incident on both and . This trivially makes the batch an independent edge set. In the reduce step, the selected edges are removed from the network. JODIE iterates the two steps till no edges remain in the graph. Thus, each batch is parallelizable and processing batches in order maintains the sequential dependencies.

In practice, we implement t-Batch as a sequential algorithm as follows. The algorithm assigns each interaction to a batch , where . We initialize empty batches (in the worst case scenario that each batch only has one interaction). We iterate through the temporally-sorted sequence of interactions and add each interaction to a batch . Let be the batch with the largest index that has an interaction involving an entity till interaction . Then, the interaction (say, between user and item ) is assigned to the batch with index . The complexity of creating the batches is , i.e., linear in the number of interactions, as each interaction is used once.

It is trivial to verify that t-Batch algorithm satisfies the two requirements. t-Batch ensures that each user and item appears at most once in every batch and thus, each batch can be parallelized. In addition, the and interactions of every user and every item are assigned to batches and , respectively, such that . So, JODIE can process the batches in increasing order of their indices to ensure that the temporal ordering of the transactions is respected.

We do not predetermine the number and size of the batches because it depends on the interactions in the dataset. The number of batches can range between 1 and . Let us illustrate these two extreme cases. When all interactions have unique users and items, then only one batch is created that has all the interactions. On the other extreme, if all interactions are associated to the same user or the same item, then batches are created. Therefore, we initialize batches and discard all trailing empty batches after assignment.

3.5. Differences between Jodie and DeepCoevolve

DeepCoevolve is the closest state-of-the-art algorithm to JODIE because it also trains two mutually-recursive RNNs to generate embedding trajectories. However, the key differences between JODIE and DeepCoevolve are the following: (i) JODIE uses a novel project function to predict the future trajectory of users. Instead, DeepCoevolve maintains the same embedding of a user between two of its consecutive interactions. Predicting the trajectory enables JODIE to make more effective predictions. (ii) JODIE predicts the embedding of the next item that a user will interact with. In contrast, DeepCoevolve predicts the probability of interaction between a user and an item. During inference time, DeepCoevolve requires forward passes through the inference layer (for items) to recommend the item with the highest score. On the other hand, JODIE takes near-constant time. (iii) JODIE is trained with batches of interaction data, as opposed to individual interactions.

As a result, as we will see in the experiments section, JODIE significantly outperforms DeepCoevolve both in terms of performance and training time. JODIE is 9.2 faster, 45% better in predicting future interactions, and 13.9% better in predicting user state change on average.

4. Experiments

In this section, we experimentally validate the effectiveness of JODIE on two tasks: future interaction prediction and user state change prediction. We conduct experiments on three datasets each and compare with six strong baselines to show the following:

  1. JODIE outperforms the baselines by at least 20% in terms of mean reciprocal rank in predicting the next item and 12% on average in predicting user state change.

  2. We show that JODIE is 9.2 faster than DeepCoevolve and comparable to other baselines.

  3. JODIE is robust in performance to the availability of training data and the dimension of the embedding.

  4. Finally, in a case study on the MOOC dataset, we show that JODIE can predict student drop-out five interactions in advance.

We first explain the experimental setting and the baseline methods and then describe the experimental results.

Experimental setting. We train all models by splitting the data by time to simulate the real situation. Thus, we train all models on the first interactions, validate on the next , and test on the last remaining interactions.

For a fair comparison, we use 128 dimensions as the dimensionality of the dynamic embedding for all algorithms and one-hot vectors for static embeddings. All algorithms are run for 50 epochs, and all reported numbers for all models are for the test data corresponding to the best performing validation set.

Method Reddit Wikipedia LastFM Minimum % improvement of JODIE over method
MRR Recall@10 MRR Recall@10 MRR Recall@10 MRR Recall@10
LSTM (zhu2017next) 0.355 0.551 0.329 0.455 0.062 0.119 104.5% 54.6%
Time-LSTM (zhu2017next) 0.387 0.573 0.247 0.342 0.068 0.137 87.6% 48.7%
RRN (wu2017recurrent) 0.603 0.747 0.522 0.617 0.089 0.182 20.4% 14.1%
LatentCross (beutel2018latent) 0.421 0.588 0.424 0.481 0.148 0.227 31.8% 35.2%
CTDNE (nguyen2018continuous) 0.165 0.257 0.035 0.056 0.01 0.01 340.0% 231.5%
DeepCoevolve (dai2016deep) 0.171 0.275 0.515 0.563 0.019 0.039 44.8% 46.0%
JODIE (proposed) 0.726 0.852 0.746 0.822 0.195 0.307 - -
Table 3. Future interaction prediction experiment: Table comparing the performance of JODIE with state-of-the-art algorithms, in terms of mean reciprocal rank (MRR) and recall@10. The best algorithm in each column is colored blue and second best is light blue. The last two columns show the minimum percentage improvement of JODIE over the method, across over all datasets. We see that JODIE outperforms all baselines by at least 20% in MRR and 14% in recall@10.

Baselines. We compare JODIE with six state-of-the-art algorithms spanning three algorithmic categories:

  1. Deep recurrent recommender models: in this category, we compare with RRN (wu2017recurrent), LatentCross (beutel2018latent), Time-LSTM (zhu2017next), and standard LSTM. These algorithms are state-of-the-art in recommender systems and generate dynamic user embeddings. We use Time-LSTM-3 cell for Time-LSTM as it performs the best in the original paper (zhu2017next), and LSTM cells in RRN and LatentCross models. As is standard, we use the one-hot vector of items as inputs to these models.

  2. Dynamic co-evolution models: here we compare with the state-of-the-art algorithm, DeepCoevolve (dai2016deep), which has been shown to outperform other co-evolutionary point-process algorithms (trivedi2017know; wang2016coevolutionary). We use 10 negative samples per interaction for computational tractability.

  3. Temporal network embedding models: we compare JODIE with CTDNE (nguyen2018continuous) which is the state-of-the-art in generating embeddings from temporal networks. As it generates static embeddings, we generate new embeddings after each edge is added. We use uniform sampling of neighborhood as it performs the best in the original paper (nguyen2018continuous).

4.1. Experiment 1: Future interaction prediction

The prediction task here is: given all interactions till time , which item will user interact with at time (out of all items)?

We use three datasets in this experiments:
Reddit post dataset: this public dataset consists of one month of posts made by users on subreddits (pushshift). We selected the 1,000 most active subreddits as items and the 10,000 most active users. This results in 672,447 interactions. We convert the text of each post into a feature vector representing their LIWC categories (pennebaker2001linguistic).
Wikipedia edits: this public dataset is one month of edits made by edits on Wikipedia pages (wikidump). We selected the 1,000 most edited pages as items and editors who made at least 5 edits as users (a total of 8,227 users). This generates 157,474 interactions. Similar to the Reddit dataset, we convert the edit text into a LIWC-feature vector.
LastFM song listens: this public dataset has one month of who-listens-to-which song information (lastfm). We selected all 1000 users and the 1000 most listened songs resulting in 1,293,103 interactions. In this dataset, interactions do not have features.

We select these datasets such that they vary in terms of users’ repetitive behavior: in Wikipedia and Reddit, a user interacts with the same item consecutively in 79% and 61% interactions, respectively, while in LastFM, this happens in only 8.6% interactions.

Method Reddit Wikipedia MOOC Mean improvement
LSTM 0.523 0.575 0.686 23.08%
Time-LSTM 0.556 0.671 0.711 12.63%
RRN 0.586 0.804 0.558 13.69%
LatentCross 0.574 0.628 0.686 15.62%
DeepCoevolve 0.577 0.663 0.671 13.94%
JODIE 0.599 0.831 0.756 -
Table 4. User state change prediction: Table comparing the performance in terms of AUC of JODIE with state of the art algorithms. The best algorithm in each column is colored blue and the second best is light blue. JODIE outperforms the baselines by at least 12.63% on average.

Experimental setting. We use the first 80% data to train, next 10% to validate, and the final 10% to test. We measure the performance of the algorithms in terms of the mean reciprocal rank (MRR) and recall@10—MRR is the average of the reciprocal rank and recall@10 is the fraction of interactions in which the ground truth item is ranked in the top 10. Higher values for both are better. For every interaction, the ranking of ground truth item is calculated with respect to all the items in the dataset.

For JODIE, items are ranked based on their distance from the predicted item embedding. The rank of the ground truth item is calculated in this ranked list.

Results. Table 3 compares the results of JODIE with the six state-of-the-art methods. We observe that JODIE significantly outperforms all baselines in all datasets across both metrics on the three datasets. Among the baselines, there is no clear winner—while RRN performs the better in Reddit and Wikipedia, LatentCross performs better in LastFM. As CTDNE generates static embedding, its performance is low. We calculate the percentage improvement of JODIE over the baseline as (performance of JODIE minus performance of baseline)/(performance of baseline). Across all datasets, the minimum improvement of JODIE is at least 20% in terms of MRR and 14% in terms of recall@10. Please note that JODIE outperforms DeepCoevolve, the closest baseline in terms of the algorithm, by at least 44.8% in MRR across all datasets.

Noticeably, we observe that JODIE performs well irrespective of how repetitive users are—the MRR at least 20.4% higher in Wikipedia and Reddit (high repetition datasets), and at least 31.75% higher in LastFM (low repetition dataset). This means JODIE is able to learn to balance personal preference with users’ non-repetitive interaction behavior.

4.2. Experiment 2: User state change prediction

In this experiment, the task is to predict if an interaction will lead to a state change in user, particularly in two use cases: predicting if a user will be banned and predicting if a student will drop-out of a course. Till a user is banned or drops-out, the label of the user is ‘0’, and their last interaction has the label ‘1’. For users that are not banned or do not drop-out, the label is always ‘0’. This is a highly challenging task as less than 1% of the labels are ‘1‘.

Figure 4. Figure compares the running time of JODIE and all baselines on the Reddit dataset. JODIE is 9.2 faster than DeepCoevolve and is comparable to the other baselines.
Figure 5. Robustness of JODIE: Figures (a–c) compare the mean reciprocal rank (MRR) of JODIE with baselines on interaction prediction task, by varying the training data size. Figure (d) shows the AUC of user state change prediction task by varying the training data size. We see JODIE consistently has the highest scores.

We use three datasets for this task:
Reddit bans: Reddit post dataset (from Section 4.1) with ground-truth labels of banned users from Reddit This gives 366 true labels among 672,447 interactions (= 0.05%).
Wikipedia bans: Wikipedia edit data (from Section 4.1) with public ground-truth labels of banned users (wikidump). This results in 217 positive labels among 157,474 interactions (= 0.14%).
MOOC student drop-out: this public dataset consists of actions, e.g., viewing a video, submitting an answer, etc., done by students on a MOOC online course (kddcup). This dataset consists of 7,047 users interacting with 98 items (videos, answers, etc.) resulting in over 411,749 interactions. There are 4,066 drop-out events (= 0.98%).

Experimental setting. In this experiment, we train the models on the first 60% interactions, validate on the next 20%, and test on the last 20% interactions. We evaluate the models using the area under the curve metric (AUC), a standard metric in the tasks with highly imbalanced labels.

For the baselines, we train a logistic regression classifier on the training data using the dynamic user embedding as input. As always, for all models, we report the test AUC for the epoch with the highest validation AUC.

Results. Table 4 compares the performance of JODIE on the three datasets with the baseline models. We see that JODIE outperforms the baselines by at least 12% on average in predicting user state change across all datasets. JODIE outperforms RRN, the closest competitor in the ban prediction task, by at least 2.2% while it outperforms RRN by 28% in the student drop-out task. Note that DeepCoevolve, which is the most similar baseline algorithmically, is outperformed by 13.9% by JODIE on average. Thus, JODIE consistently performs the best across various datasets.

4.3. Experiment 3: Runtime experiment

Here we compare the running time of JODIE with the baseline algorithms. Algorithmically, the DeepCoevolve is the closest to JODIE as it also trains two mutually-recursive RNNs. The other methods train only one RNN and are therefore easily scalable.

Figure 4 shows the running time (in minutes) of one epoch of the Reddit dataset.333We ran the experiment on one NVIDIA Titan X Pascal GPUs with 12Gb of RAM at 10Gbps speed. We find that JODIE is 9.2 faster than DeepCoevolve (its closest algorithmic competitor). At the same time, the running time of JODIE is comparable to the other baselines that only use one RNN in their model. This shows that JODIE is able to train the mutually-recursive model in equivalent time as non-mutually-recursive models, because of the use of the t-Batch training batching algorithm.

In addition, we find that JODIE without t-Batch took 43.53 minutes while JODIE with t-Batch took 5.13 minutes. Thus, t-Batch results in 8.4 speed-up.

4.4. Experiment 4: Robustness to the proportion of training data

In this experiment, we validate the robustness of JODIE by varying the percentage of training data and comparing the performance of the algorithms in both the tasks of future interaction prediction and user state change prediction.

For the next item prediction, we vary the training data percentage from 10% to 80%. In each case, we take the 10% interactions after the training data as validation and the next 10% interactions next as testing. This is done to compare the performance on the same testing data size. Figures 5(a–c) show the change in mean reciprocal rank (MRR) of all the algorithms on the three datasets, as the training data size is increased. We note that the performance of JODIE is stable and does not vary much across the data points. Moreover, JODIE consistently outperforms the baseline models by a significant margin (by a maximum of 33.1%).

We make similar observations in user state change prediction task. Here, we vary training data percents to 20%, 40%, and 60%, and in each case take the following 20% interactions as validation and the next 20% interactions as the test. Figure 5(d) shows the AUC of all the algorithms on the Wikipedia dataset. Other datasets have similar results. Again, we find that JODIE is stable and consistently outperforms the baselines, irrespective of the training data size.

Figure 6. Robustness to dynamic embedding size: The performance of JODIE is stable with the change in dynamic embedding size, for the task of interaction prediction on LastFM dataset. Please refer to the legend in Figure 5.

4.5. Experiment 5: Embedding size

Finally, we validate the effect of the dynamic embedding size on the predictions. To do this, we vary the dynamic embedding dimension from 32 to 256 and calculate the mean reciprocal rank for interaction prediction on the LastFM dataset. The effect on other datasets is similar. The resulting figure is showing in Figure 6. We find that the embedding dimension size has little effect on the performance of JODIE and it performs the best overall. Interestingly, improvement in JODIE is higher for smaller embedding dimensions. This is because JODIE uses both the static and the dynamic embedding for prediction, which gives it the power to learn from both parts.

5. Conclusions

In this paper, we proposed a coupled recurrent neural network model called JODIE that learns dynamic embeddings of users and items from a sequence of temporal interactions. JODIE learns to predict the future embeddings of users and items, which leads it to give better prediction performance of future user-item interactions and change in user state. We also presented a training data batching method that makes JODIE an order of magnitude faster than similar baselines.

There are several directions for future work. Learning embeddings for individual users and items is expensive, and one could learn trajectories for groups of users or items to reduce the number of parameters. Another direction is characterizing the trajectories to cluster similar entities. Finally, an innovative direction would be to design new items based on missing predicted items that many users are likely to interact with.


. JL is a Chan Zuckerberg Biohub investigator. This research has been supported in part by NSF OAC-1835598, DARPA MCS, DARPA ASED, ARO MURI, Amazon, Boeing, Docomo, Hitachi, JD, Siemens, and Stanford Data Science Initiative. We thank Sagar Honnungar for help with the initial phase of the project.


Appendix A Appendix

Here we describe some technical details of the model.

The code and datasets are available on the project website:

We coded all the models and the baselines in PyTorch. Table 

6 mentions the dataset details and Table 5 mentions the model parameters.

Parameter Value
Optimizer Adam
Learning rate 1e-3
Model weight decay 1e-5
Dynamic embedding size 128
Number of epochs 50
Future interaction prediction experiment
Training data percent 80%
Validation data percent 10%
Test data percent 10%
User state change experiment
Training data percent 60%
Validation data percent 20%
Test data percent 20%
Table 5. Table with model parameters.
Data Users Items Interactions State Action
Changes Repetition
Reddit 10,000 984 672,447 366 79%
Wikipedia 8,227 1,000 157,474 217 61%
LastFM 980 1,000 1,293,103 - 8.6%
MOOC 7,047 97 411,749 4,066 -
Table 6. Table with dataset information.