Continuous-Time Sequential Recommendation with Temporal Graph Collaborative Transformer

08/14/2021 ∙ by Ziwei Fan, et al. ∙ IFM Lab University of Illinois at Chicago Pinterest, Inc. FUDAN University 10

In order to model the evolution of user preference, we should learn user/item embeddings based on time-ordered item purchasing sequences, which is defined as Sequential Recommendation (SR) problem. Existing methods leverage sequential patterns to model item transitions. However, most of them ignore crucial temporal collaborative signals, which are latent in evolving user-item interactions and coexist with sequential patterns. Therefore, we propose to unify sequential patterns and temporal collaborative signals to improve the quality of recommendation, which is rather challenging. Firstly, it is hard to simultaneously encode sequential patterns and collaborative signals. Secondly, it is non-trivial to express the temporal effects of collaborative signals. Hence, we design a new framework Temporal Graph Sequential Recommender (TGSRec) upon our defined continuous-time bi-partite graph. We propose a novel Temporal Collaborative Trans-former (TCT) layer in TGSRec, which advances the self-attention mechanism by adopting a novel collaborative attention. TCT layer can simultaneously capture collaborative signals from both users and items, as well as considering temporal dynamics inside sequential patterns. We propagate the information learned fromTCTlayerover the temporal graph to unify sequential patterns and temporal collaborative signals. Empirical results on five datasets show that TGSRec significantly outperforms other baselines, in average up to 22.5

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender system has become essential in providing personalized information filtering services in a variety of applications (liang2016modeling; wang19neural; Liu2019JSCNJS; wang2021dkg; peng2020m2). It learns the user and item embeddings from historical records on the user-item interactions (he2017neural; rendle2009bpr). In order to model the dynamics of the user-item interaction, current research works (rendle2010factorizing; wu2017recurrent; hidasi2015session; tang2018personalized; fan2021Modeling) leverage historical time-ordered item purchasing sequences to predict future items for users, referred to as the sequential recommendation (SR) problem (hidasi2015session; kang2018self). One of the fundamental assumptions of SR is that the users’ interests change smoothly (hidasi2015session; wu2017recurrent; tang2018personalized; kang2018self). Thus, we can train a model to infer the items more likely to appear in the future sequence. For example, with the recent developments of Transformer (vaswani2017attention), current endeavors design a series of self-attention SR models to predict future item sequences (kang2018self; sun2019bert4rec; ssept20wu)

. A self-attention model infers sequence embeddings at position

by assigning an attention weight to each historical item and aggregating these items. The attention weights reveal impacts of previous items to the current state at time point .

Despite their effectiveness, existing works only leverage the sequential patterns to model the item transitions within sequences, which is still insufficient to yield satisfactory results. The reason is that they ignore the crucial temporal collaborative signals, which are latent in evolving user-item interactions and coexist with sequential patterns. To be specific, we present the effects of temporal collaborative signals in Figure 1. The target is to recommend an item to at as the next item after . By only considering the sequential patterns, is recommended since it appears times after as in and , compared with of only time in . However, if also taking account of collaborative signals, we would recommend , because both and have interactions with at and at and , respectively, which indicates their high similarity. Hence, ’s sequential patterns are of more impacts to . This motivates us to unify sequential patterns and temporal collaborative signals.

Figure 1. A toy example of temporal collaborative signals. Given the items that users and like in the past timestamps and , the target is to recommend an item to at as the next item after .

However, incorporating temporal collaborative signals in SR is rather challenging. The first challenge is that it is hard to simultaneously encode collaborative signals and sequential patterns. Current models capture the sequential pattern based on the transition of items within sequences (hidasi2015session; kang2018self; ma2020disentangled), thus lacking the mechanism to model the collaborative signals across sequences. Jodie (kumar2019predicting) and DGCF (li2020dynamic) employs LSTM to model the dynamics and interactions of user and item embeddings but they cannot learn the impacts of all historical items, thus unable to encode sequences. SASRec (kang2018self) proposed to use a self-attention mechanism to encode item sequence, while the effects of users, i.e., collaborative signals, are omitted. SSE-PT (ssept20wu) implicitly models collaborative signals by directly adding the same user embedding into the sequence encoding. However, it fails to model the interactions between user and item, thus unable to explicitly express collaborative signals.

The second challenge is that it is hard to express the temporal effects of collaborative signals. In other words, it remains unclear how to measure the impacts of those signals from a temporal perspective. For example, in Figure 1, is interacted with and at , while is interacted with them respectively at and . Since there is a lag, it is problematic to ignore the time gap and assume they are of equal contributions. We should use temporal information to infer the importance of those collaborative signals to the recommendation on . Existing works (hidasi2015session; kang2018self; sun2019bert4rec; li2020dynamic) assume that items appear discretely with equal time intervals. Thus, they only focus on the orders/positions of items in the sequence, which limits their capacity in expressing the temporal information. Some recent works (li2020time; ye2020time) also notice the importance of time span. But their models either fail to capture time differences between historical interactions or are unable to generalize to any unseen future timestamps or time difference, thus are still far from revealing the actual temporal effects of collaborative signals.

(a) Example of CTBG
(b) Temporal Inference
Figure 2. The associated CTBG of Figure 1 and the inference of temporal embeddings of and at .

Current transformer-based models (kang2018self; sun2019bert4rec) adopt self-attention mechanism, which has query, key, and value inputs from item embeddings and employs dot-product to learn their correlation scores. The limitation is that self-attention is only able to capture item-item relationships in sequences. Additionally, they have no module to capture temporal correlations of items. To this end, we propose a new model Temporal Graph Sequential Recommender (TGSRec). It consists of two novel components: (1) the Temporal Collaborative Transformer (TCT) layer and (2) graph information propagation.

The first component advances current transformer-based models as it can explicitly model collaborative signals in sequences and express temporal correlations of items in sequences. To be more specific, TCT layer adopts collaborative attention among user-item interactions, where the query input to the collaborative attention is from the target node (user/item), while the key and value inputs are from connected neighbors. As such, TCT layer learns the importance of those interactions, thus well characterizing the collaborative signals. Moreover, TCT layer fuses the temporal information into the collaborative attention mechanism, which explicitly expresses the temporal effects of those interactions. Altogether, the TCT layer captures temporal collaborative signals.

The second module is devised upon our proposed Continuous-Time Bipartite Graph (CTBG). The CTBG consists of user/item nodes, and interaction edges with timestamps, as shown in Figure 1(a). Given timestamps, neighbor items of users preserve sequential patterns. We propagate temporal collaborative information learned around each node to surrounding neighbors over CTBG. Therefore, it unifies sequential patterns with temporal collaborative signals.

In this work, we propose to use temporal embeddings of nodes for recommendation, which are dynamic and inferred at specified timestamps. For example, at time , we infer the temporal user embedding by aggregating the context. We illustrate the temporal inference of and at time in Figure 1(b). The temporal embeddings are inferred by our proposed TCT layer. It uses temporal information to discriminate impacts of those historical interactions and makes inferences of temporal node embeddings. The contributions of this paper are as follows:
Graph Sequential Recommendation: We connect the SR problem with graph embedding methods, which focuses on unifying the sequential patterns and temporal collaborative signals.
Temporal Collaborative Transformer: We propose a novel temporal collaborative attention mechanism to infer temporal node embeddings, which jointly models collaborative signals and temporal effects. This overcomes the inadequacy of the traditional self-attention mechanism on capturing the temporal effects and user-item collaborative signals.
Extensive Experiments: We conduct a comparison experiment on five real-world datasets. Comprehensive experiments demonstrate the state-of-the-art performance of TGSRec and its effectiveness of modeling temporal collaborative signals.

2. Related Work

In this section, we first review some related work, which includes sequential recommendation (SR), temporal information, and some graph-based recommender systems.

2.1. Sequential Recommendation

SR predicts the future items in the user shopping sequence by mining the sequential patterns. An initial solution to the SR problem is to build a Recurrent Neural Network (RNN) 

(hidasi2015session; wu2017recurrent; yu2016dynamic; 10.1145/3331184.3331329). GRU4Rec (hidasi2015session) is proposed to predict the next item in a session by employing the GRU modules. Later, a Hierarchical RNN (quadrana2017personalizing) is proposed to enhance the RNN model regarding the personalizing information. Additionally, LSTM (hochreiter1997long; wu2017recurrent) can be applied to explore both the long-term and short-term sequential patterns. Moreover, in order to capture the intent of users at local sub-sequence, NARM (li2017neural)

is proposed by combining the RNN model with attention weights. The major drawback of the RNN model is that it can only generate a single hidden vector, which limits its power to encode sequences 

(chen2018sequential).

Recently, owing to the success of self-attention model (vaswani2017attention; devlin2018bert; liu2021enriching; zhang2021pretrained) in NLP tasks, a series of attention-based SR models are proposed (kang2018self; sun2019bert4rec; ma2020disentangled; wu2020deja; ji2020hybrid; liu2021augmenting; peng2021ham). SASRec (kang2018self) applies the transformer layer to assign weights to items in the sequence. Later, inspired by the BERT (devlin2018bert) model, BERT4Rec (sun2019bert4rec) is proposed with a bidirectional transformer layer. (ma2020disentangled) introduce the sequence to sequence training procedure in SR. SSE-PT (ssept20wu) designs a personalized transformer to improve the SR performance. ASReP (liu2021augmenting) proposes augmenting short sequences to alleviate the cold-start issue in Transformer. TiSASRec (li2020time) enhances SASRec with the time-interval information found in the training data. However, these models only focus on the item transitions within sequences, while unable to unify the important temporal collaborative signals with sequential patterns and are not generalized to unseen timestamps.

2.2. Temporal Information

Previously mentioned SR works are specifically designed to capture sequential patterns, while ignoring the important temporal information (koren2009collaborative; xiong2010temporal; xiang2010temporal; kumar2019predicting; li2020time). In practice, the context of users and items changes over time, which is crucial for modeling the temporal dynamics in SR. TimeSVD++ (koren2009collaborative) is a representative work which models the temporal information into collaborative filtering (CF) method. It simply treats the bias as a function over time. BPTF (xiong2010temporal)

extends matrix factorization to tensor factorization and uses time as the third dimension. MS-IPF 

(xiang2010temporal) defines a temporal graph, where it operates PageRank algorithm for recommendation. Recently, CDTNE (nguyen2018continuous) is proposed by applying temporal random walk over its defined continuous-time dynamic network. TGAT (xu2020inductive) also introduces temporal attention for learning dynamic graph embeddings. JODIE (kumar2019predicting) develops user and item RNNs to update user and item embeddings. Regarding the SR problem, a few recent works (li2020time; ye2020time) also notice the importance of temporal information. CTA (wu2020deja), MTAM (ji2020hybrid), and TiSASRec (li2020time) all consider to use time intervals between successive items in sequences. TASER (ye2020time) encodes both the absolute time and relative time as vectors, which are processed to attention models to complete the SR task. However, these models are not able to unify temporal collaborative signals with sequential patterns.

2.3. Graph-based Recommendation

Because we solve the SR problem based on the graph structure (zhang2017learning; zhang2019ige), we also review some graph-based recommender system models (nguyen2018continuous; wang19neural; he2020lightgcn; Liu2019JSCNJS; liu2020basconv; 9377917), especially those based on Graph Neural Network (GNN) methods (kipf17semi; wang19neural; he2020lightgcn; berg2017gcmc). Compared with directly learning from sequences, graph-based models can also capture the structural information (nguyen2018continuous; berg2017gcmc). Both NGCF (wang19neural) and LightGCN (he2020lightgcn) argue that graph-based models are able to effectively model collaborative signals, which is crucial in learning user/item embeddings. The successes of GNN in recommender systems (wang2019kgat; he2020lightgcn; wang19neural; berg2017gcmc) provide simple yet effective methods in learning user/item embeddings from graphs. GNN models learn the embeddings by aggregating neighbors (wang19neural; he2020lightgcn). Therefore, it is easy to stack multiple layers to learn both the first-order and high-order collaborative signals (wang19neural; he2020lightgcn). CTDNE (nguyen2018continuous) defines a temporal graph to learn dynamic embeddings of nodes. TGAT (xu2020inductive) learns the dynamic graph embeddings based on the graph attention model. Basconv (liu2020basconv) characterizes heterogeneous graphs to learn user/item embeddings. Those models argue that graph is powerful in modeling both the structural and temporal information. However, few works investigate the possibility of solving SR problems based on graphs. SR-GNN (wu2019session) learns embeddings of session graphs by using a GNN to aggregate item embeddings but fails to model temporal collaborative signals.

3. Definitions and Preliminaries

In this section, we introduce some definitions and preliminaries. Different from using users’ interactions sequences as inputs in SR, we introduce the Continuous-Time Bipartite Graph (CTBG) to represent all temporal interactions. Each edge in this graph has the timestamp as the attribute. The directly connected neighbors of every user/item node in this graph preserve the sequential order via the timestamps at edges. The formal definition of CTBG are given in the following:

Definition 3.0 (Continuous-Time Bipartite Graph).

A continuous time bipartite graph with nodes and edges for recommendations is defined as , where and are two disjoint node sets of users and items, respectively. Every edge is denoted as a tuple , where , , and as the edge attribute. Each triplet denotes the interaction of a user with item at timestamp .

This paper focuses on the SR problem with continuous timestamps. An example of the CTBG is presented in Figure 1(a). Let denote the set of items interacted with the user before timestamp , and denote the remaining items. We defined the continuous-time sequential recommendation problem which we study in this paper as following:

Definition 3.0 (Continuous-Time Recommendation).

At a specific timestamp , given user set , item set , and the associated CTBG, the continuous-time recommendation of is to generate a ranking list of items from , where the items that is interested will be ranked top in the list.

Then, the SR problem is equivalent to make continuous-time recommendations on a set of future timestamps for each user :

Definition 3.0 (Continuous-Time Sequential Recommendation).

For a specific user , given a set of future timestamps , the continuous-time sequential recommendation for this user is to make a continuous-time recommendation for every timestamp .

This is a generalized definition compared with other works (kang2018self; ma2020disentangled). We explicitly consider timestamps, while others only care about the orders/positions. Therefore, differing from existing works using next-item prediction to evaluate sequential recommendation, future timestamps should be present to make a prediction. If timestamps are position numbers in sequences, the studied problem is reduced to the same definition as using only orders/positions information. Note that timestamp can be any real value, thus being continuous.

4. Proposed Model

Figure 3. The framework of TGSRec. The query node is , whose final temporal embedding at time is . The TCT layer samples its neighbor nodes and edges. Timestamps on edges are encoded as vectors by using mapping function . Node embeddings for the first TCT layer are long-term embeddings. Node embeddings for other TCT layers (e.g. layer 2) are propagated from the previous TCT layer, thus being temporal node embeddings.

In this section, we present the TGSRec model, which unifies sequential patterns and temporal collaborative signals. The framework of the TGSRec model is presented in Figure 3. There are three major components: 1) Embedding layer, which encodes nodes and timestamps in a consistent way to connect the SR problem with graph embedding method; 2) Temporal Collaborative Transformer (TCT) layer, which employs a novel temporal collaborative attention mechanism to discriminate temporal impacts of neighbors, and aggregates both node and time embeddings to infer the temporal node embedding; 3) Prediction layer, which utilizes output embeddings from the final TCT layer to calculate the score.

4.1. Embedding Layer

We encode two types of embeddings in this paper, one being the long-term embeddings of nodes, and the other being the continuous-time embeddings of timestamps on edges.

4.1.1. Long-Term User/Item Embeddings

Long-term embeddings for users and items are necessary (devooght2017long) for long-term collaborative signals representation. In CTBG, it functions as node features and is optimized to model the holistic structural information. A user (item) node is parameterized by a vector . Since we learn embeddings for nodes in the CTBG, we retrieve the embedding of a node by indexing an embedding table , where . Note that the embedding table serves as a starting state for the inference of temporal user/item embeddings. During the training process, will be optimized.

4.1.2. Continuous-Time Embedding

The continuous time encoding (ye2020time; xu2019self) behaves as a function that maps those scalar timestamps into vectors, i.e., , where . Based on previous SR models (ye2020time; li2020time; wu2020deja), time span plays a vital component in expressing the temporal effects and uncovering sequential patterns. The time encoding function embeds timestamps into vectors so as to represent the time span as the dot product of corresponding encoded time embeddings. Therefore, we define the temporal effects as a function of time span in continuous time space: given a pair of interactions and of the same user, the temporal effect is defined as a function , which is expressed as a kernel value of the time embeddings of and :

(1)

where is the temporal kernel and denotes the dot product operation. The temporal effect measures the temporal correlation between two timestamps. Moreover, the time encoding function should be generalized to any unseen timestamp such that any time span not found in training data can still be inferred by the encoded time embeddings. Unlike modeling the absolute time difference like (li2020time), representing temporal effects as a kernel is generalized to any timestamp as it models the time representations directly. Therefore, the temporal effect of any pair of timestamps can be inductively inferred by the dot product of time representations. Eq. (1) can be achieved by a continuous and translation-invariant kernel based on Bochner’s Theorem (loomis2013introduction). By explicitly representing the temporal features, the temporal embedding is:

(2)

where are learnable and is the dimension.

4.2. Temporal Collaborative Transformer

Next, we present the novel TCT layer of TGSRec. We intend to address two strengths of a TCT layer: (1) constructing information from both user/item embeddings and temporal embedding, which explicitly characterizes temporal effects of the correlations; (2) a collaborative attention module, which advances existing self-attention mechanism by modeling the importance of user-item interactions, which is thus able to explicitly recognize collaborative signals. To achieve this, we first present the information construction and aggregation from a user perspective. Then, we introduce a novel collaborative attention mechanism to infer importance of interactions. Finally, we demonstrate how to generalize to items.

4.2.1. Information Construction

We construct input information of each TCT layer as the combination of long term node embeddings and time embeddings. As such, we can unify temporal information and collaborative signals. In particular, the query input information at the -th layer for user at time is:

(3)

where . is the information for at , is the temporal embedding of , and denotes the time vector of . denotes the concatenation operation. Other operations including summation are possible. However, we use concatenation for simplicity. It also provides intuitive interpretation in the attention, as shown in Eq. (7). Note that when , it is the first TCT layer. The temporal embedding , i.e., the long-term user embedding. When , the temporal embedding is generated from the previous TCT layer.

In addition to the query node itself, to we also propagate temporal collaborative information from its neighbors. We randomly sample different interactions of before time as . The input information at the -th layer for each pair is:

(4)

where is the information for item at , denotes the temporal embedding of at . Again, note that when , , i.e., the long-term item embedding. When , the temporal embedding is output from the previous TCT layer.

4.2.2. Information Propagation

After constructing the information, we propagate the information of sampled neighbors to infer the temporal embeddings. Since the neighbors are involving with time , in this way, we can unify the sequential patterns with temporal collaborative signals. We compute the linear combination of the information from all sampled interactions as:

(5)

where 111 also has a superscript of the layer number , which is ignored for simplicity. denotes the importance of an interaction and

is the linear transformation matrix.

represents the impact of a historical interaction to the temporal inference of at time , which is calculated by the temporal collaborative attention.

4.2.3. Temporal Collaborative Attention

We adopt the novel temporal collaborative attention mechanism to measure the weights , which considers both neighboring interactions and the temporal information on edges. Both factors contribute to the importance of historical interactions. Thus, it is a better mechanism to capture temporal collaborative signals than self-attention mechanism that only models item-item correlations. The attention weight is formulated as follows:

(6)

where and are both linear transformation matrices, and the factor protects the dot-product from growing large with high dimensions. We adopt dot-product attention because if we ignore transformation matrices and the scalar factor, based on Eq. (3) and Eq. (4), the right-hand side of Eq. (6) can be rewritten as:

(7)

where the first term denotes the user-item collaborative signal, and the second term models the temporal effect according to Eq. (1). With more stacked layers, collaborative signals and temporal effects are entangled and tightly connected. Hence, the dot-product attention can characterize impacts of temporal collaborative signals.

Hereafter, we normalize the attention weights across all sampled interactions by employing a softmax function:

(8)

Moreover, the computation is implemented by packing the information of all sampled interactions. To be more specific, we stack the information (Eq. (4)) of all sampled interactions as a matrix , as illustrated222In figure 3, an embedding is a row vector, while in notations, it is a column vector. in Figure 3. We denote , and , which are respectively the key, value and query input for the temporal collaborative attention module. We illustrate this in Figure 3 as green blocks. For simplicity and without ambiguity, we ignore the superscripts and time and combine Eq. (6) and Eq. (8). Then, we can rewrite the Eq. (5) as:

(9)

which is in the form of dot-product attention in Transformer (vaswani2017attention). Therefore, we can safely apply the multi-head attention operation and concatenate the output from each head as the information for aggregation, which is presented in Figure 3. Note that our attention is not a self-attention but a temporal collaborative attention, which jointly models user-item interactions and temporal information.

4.2.4. Information Aggregation

To output the temporal node embedding, the final step of a TCT layer is to aggregate the query information in Eq. (3) and the neighbor information in Eq. (5

). We concatenate and send them to a feed-forward neural network (FFN):

(10)

where is the temporal embedding of at on

-th layer, and FFN consists of two linear transformation layers with a ReLU activation function in between 

(vaswani2017attention). The output temporal embedding can either be sent to the next layer or output as the final temporal node embedding for prediction.

4.2.5. Generalization to items

Though we only present the TCT layer from the user query perspective, it is analogous if the query is an item at a specific time. We only need to alternate the user query information to the item query information, and change the neighbor information in Eq. (4) and Eq. (5) accordingly as user-time pairs. Then, we can make an inference of the temporal embedding of item at time as , which is sent to the next layer.

4.3. Model Prediction

The TGSRec model consists of TCT layers. For each test triplet , it yields temporal embeddings for both and at on the last TCT layer, denoting as and , respectively. Then, the prediction score is:

(11)

where denotes the score to recommend for at time . With the generalized continuous-time embeddings and the proposed TCT layers, we can generalize and infer user/item embeddings at any timestamp, thus making multiple steps recommendation feasible while existing work only predicts next item. Recall that based on the Definition 3.0.3, we recommend each user a ranking list of items at the given timestamp. Therefore, we use Eq. (11) to calculate scores of all candidate items and sort them by scores.

4.4. Model Optimization

To learn the model parameters, we use the pairwise BPR loss (rendle2009bpr), which is widely used for top-N recommendation. The pairwise BPR loss assumes that the observed implicit feedback items have greater prediction scores than those unobserved and is also designed for ranking based top-N recommendation. The objective function is:

(12)

where denotes the training samples, includes all learnable parameter, and

is a sigmoid function. The training samples

, where the positive interaction comes from the edge set of CTBG, the negative item is sampled from unobserved items of user at timestamp ; includes long-term embedding , time embedding parameter , and all linear transformation matrices. The loss is optimized via mini-batch Adam (DBLP:journals/corr/KingmaB14) with adaptive learning rate. Alternatively, we can optimize the model with a Binary Cross Entropy (BCE) loss as:

(13)

which is compared with BPR loss in experiments.

5. Experiments

In this section, we present the experimental setups and results to demonstrate the effectiveness of TGSRec. The experiments answer the following Research Questions (RQs):

  • RQ1: Does TGSRec yield better recommendation?

  • RQ2: How do different hyper-parameters (e.g., number of neighbors , etc.) affect the performance of TGSRec?

  • RQ3: How do different modules (e.g., temporal collaborative attention, etc.) affect the performance of TGSRec?

  • RQ4: Can TGSRec effectively unify sequential patterns and temporal collaborative signals? (Reveal temporal correlations)

5.1. Datasets

We conduct our experiments on four Amazon review datasets (mcauley2015image)

and MovieLens ML-100K dataset 

(harper2015movielens). The Amazon datasets are collected from different domains333https://jmcauley.ucsd.edu/data/amazon/, from the Amazon website during May 1996 to July 2014. The Movie Lens dataset is collected from September 19th, 1997 through April 22nd, 1998. We use Unix timestamps on all datasets. For each dataset, we chronologically split for train/validation/test in 80%/10%/10% ratio based on the interaction timestamps. More details, such as data descriptions and statistics, are presented in the Table 1. We can find amazon datasets are much sparser and their time spans are much longer compared with ML-100K dataset. For Amazon related datasets, the time intervals of successive interactions are typically in days, while ML-100k has shorter time intervals, ranging from seconds to days.

Dataset Toys Baby Tools Music ML100K
#Users 17,946 17,739 15,920 4,652 943
#Items 11,639 6,876 10,043 3,051 1,682
#Edges 154,793 146,775 127,784 54,932 48,569
#Train 134,632 128,833 107,684 51,765 80,003
#Valid 11,283 10,191 10,847 2,183 1,516
#Test 8,878 7,751 9,253 984 1,344
Density 0.07% 0.12% 0.08% 0.38% 6.30%
Avg. Int. 85 days 61 days 123 days 104 days 4.8 hours
  • “Av. Int.” denotes average time interval.

Table 1. Statistics of datasets.

5.2. Experimental Settings

Datasets Metric BPR LightGCN SR-GNN GRU4Rec Caser SSE-PT BERT4Rec SASRec TiSASRec CDTNE TGSRec Improv.
Recall@10 0.0021 0.0016 0.0020 0.0274 0.0138 0.1213 0.1273 0.1452 0.1361 0.0016 0.3650 0.2198
Recall@20 0.0036 0.0026 0.0033 0.0288 0.0238 0.1719 0.1865 0.2044 0.1931 0.0045 0.3714 0.1670
Toys MRR 0.0024 0.0018 0.0018 0.0277 0.0082 0.0595 0.0643 0.0732 0.0671 0.0025 0.3661 0.2929
Recall@10 0.0028 0.0036 0.0030 0.0036 0.0077 0.0911 0.0884 0.0975 0.1040 0.0218 0.2235 0.1195
Recall@20 0.0039 0.0045 0.0062 0.0048 0.0193 0.1418 0.1634 0.1610 0.1662 0.0292 0.2295 0.0663
Baby MRR 0.0019 0.0024 0.0024 0.0028 0.0071 0.0434 0.0511 0.0455 0.0521 0.0157 0.2147 0.1626
Recall@10 0.0023 0.0021 0.0051 0.0048 0.0077 0.0775 0.1296 0.0913 0.0946 0.0186 0.2457 0.1161
Recall@20 0.0036 0.0035 0.0092 0.0059 0.0161 0.1155 0.1784 0.1337 0.1356 0.0271 0.2559 0.0775
Tools MRR 0.0026 0.0023 0.0028 0.0051 0.0068 0.0419 0.0628 0.0460 0.0480 0.0203 0.2468 0.1840
Recall@10 0.0122 0.0142 0.0051 0.0549 0.0183 0.0915 0.1352 0.1372 0.1372 0.0071 0.5935 0.4563
Recall@20 0.0152 0.0183 0.0092 0.0589 0.0346 0.1494 0.2093 0.2094 0.1951 0.0163 0.5986 0.3892
Music MRR 0.0057 0.0064 0.0028 0.0540 0.0106 0.0423 0.0824 0.0768 0.0681 0.0037 0.3820 0.2996
Recall@10 0.0461 0.0565 0.0045 0.0996 0.0246 0.1079 0.1116 0.09450 0.1332 0.0350 0.3118 0.1786
Recall@20 0.0766 0.0960 0.0060 0.1168 0.0417 0.1801 0.1786 0.1808 0.2232 0.0536 0.3252 0.1020
ML100k MRR 0.0213 0.0252 0.0012 0.0938 0.0147 0.0519 0.0600 0.0448 0.0605 0.0162 0.2416 0.1478
Table 2. Overall Performance w.r.t. Recall@{10,20} and MRR.
(a) NDCG@10 in Toys
(b) NDCG@10 in Baby
(c) NDCG@10 in Tools
(d) NDCG@10 in Music
(e) NDCG@10 in ML100K
Figure 4. NDCG@10 Performance in all Datasets. We ignore other methods because of their low values.

5.2.1. Baselines

We compared TGSRec with the state-of-the-art methods in three different groups. Static models: Static models ignore the temporal information and generate static user/item embeddings for a recommendation. We compare with the most standard baseline BPRMF (rendle2009bpr), and also compare with a recent GNN-based model LightGCN (he2020lightgcn). Temporal models: We compare some relevant temporal methods, such as CTDNE (nguyen2018continuous) and one recent model TiSASRec (li2020time), which utilize time information. We also try to compare with JODIE (kumar2019predicting). However, we do not report it because has out-of-memory errors on most datasets. Transformer-based SR models: Since our model is built upon the transformer, we mainly focus on comparing with the recent transformer-based SR methods, which are SASRec (kang2018self), BERT4Rec (sun2019bert4rec), SSE-PT (ssept20wu), and TiSASRec (li2020time). Other SR models: In addition, we also compare with other type of SR models, i.e., FPMC (rendle2010factorizing), GRU4Rec (hidasi2015session), Caser (tang2018personalized), and SR-GNN (wu2019session), for comprehensive study.

For each testing interaction , our continuous-time sequential recommendation setting allows models to use any history interactions during the prediction stage, regardless of whether the historical interactions are in training portion, validation part or even in testing set. However, all parameters of models are only learned from the training data.

We implement TGSRec

with Pytorch in a Nvidia 1080Ti GPU. We grid search all parameters and report the test performance based on the best validation results. For all models, we search for the dimensions of embeddings

in range of and we tune the learning rate in , search the L2 regularization weight from . For sequential methods, we search the maximum length of sequence in , number of layers from , and number of heads in .

5.2.2. Evaluation Protocol

All models will generate a ranking list of items for each testing interaction. Each evaluation metric is averaged over the total number of interactions as the final reported result. In order to accelerate the evaluation, we sample 1,000 negative items for evaluation instead of full set of negative items. For each interaction

in validation/test sets, we treat items that has no interactions with before as negative items. Regarding the sampling bias for evaluation (krichene2020sampled)

, we apply the unbiased estimator in 

(krichene2020sampled) to correct the sampled ranks. We evaluate the top-N recommendation performance by standard ranking-based evaluation metrics Recall@N, NDCG@N, and Mean Reciprocal Rank (MRR). We set N to be either 10 or 20 for a comprehensive comparison.

5.3. Performance Comparison (RQ1)

We compare the performance of all models and demonstrate the superiority of TGSRec. We report the Recall and MRR of all models in Table 2. Additionally, we visualize the comparisons of NDCG in Figure 4. We have the following observations:

  • TGSRec consistently and significantly outperforms all baselines in all datasets. In particular, for absolute performance improvement gains relative to the 2nd best, TGSRec achieves , and absolute gains at recall@10, recall@20, and MRR, respectively. TGSRec also significantly outperforms others in NDCG, as shown in Figure 4. Several factors together determine the superiority of TGSRec: (1) TGSRec captures temporal collaborative signals; (2) TGSRec explicitly expresses temporal effects; and (3) TGSRec stacks multiple TCT layers to propagate the information.

  • Those static methods achieve the worst performance among all models. A simple GRU4Rec even performs 10 times better than them. This indicates that static embeddings fail to utilize the temporal information, limiting its recommendation ability in SR. Thus, it is important to model dynamics.

  • The CDTNE performs better than Caser and GRU4Rec in Tools and Baby datasets. This suggests the benefit of modeling temporal information with a graph. But it is still much worse than those transformer-based methods, which again proves the strength of transformer in encoding sequences. We also notice the poor performance of SR-GNN. We analyze the data and find time intervals between successive interactions vary a lot. Since SR-GNN is originally designed for session-based sequences, it is not suitable for SR with a long time span.

  • The transformer-based SR methods consistently outperform all other types of baselines, which demonstrates the effectiveness of using transformer structure to encode sequence. Among them, TiSASRec is better than SASRec on two datasets, which proves the effectiveness of using time information. But it is still far worse than TGSRec. The reason is twofold. One is that only the interval information is not enough to unify the temporal information with sequential patterns. The other is that the proposed temporal collaborative attention in TCT layer captures more precise and generalized temporal effects. We find that BERT4Rec is better than the other baselines on the Tools dataset but not better on other datasets. Since the main difference between BERT4Rec and SASRec is the bi-directional sequence encoding, it may break causal relations among items within a sequence. The TGSRec performs much better than SR models, showing the necessity of unifying sequential patterns and temporal collaborative signals.

Architecture Toys Baby Tools Music ML100K
(0) Default 0.3649 0.2235 0.3623 0.5935 0.3118
(1) Mean 0.0027 0.0210 0.0055 0.0051 0.0647
(2) LSTM 0.0991 0.1237 0.1266 0.3740 0.3088
(3) Fixed 0.0854 0.0944 0.0910 0.3679 0.2789
(4) Position 0.0380 0.0243 0.0209 0.0742 0.0878
(5) Empty 0.0139 0.0240 0.0018 0.0346 0.0603
(6) BCELoss 0.2200 0.1916 0.1763 0.4624 0.3542
Table 3. Ablation analysis (Recall@10) on five datasets. Bold score indicates performance better than the default version, while indicates a performance drop more than 50%.

5.4. Parameter Sensitivity (RQ2)

Figure 5. Recall@10 w.r.t. and on datasets.

In this section, we conduct sensitivity analyses of the hyperparameters of

TGSRec, including the number of layers , embedding size , and the number of neighbors . The results are reported in Figure 5.
The number of layers. The number of TCT layers is searched from . The results are shown in the top row of Figure 5. When , TGSRec has no TCT layer, thus unable to infer temporal embeddings. We can observe it performs the worst on all dataset, which justify the benefit of temporal inference. When , it makes temporal inference, but without propagation to the next layer. Therefore, it still performs worse than on most datasets. When , it can not only make temporal inference, but also propagate the information to capture high-order signals, which alleviates the data sparsity problem.
Embedding size. The embedding size of TCT layers is searched from , which is presented at the mid-row in Figure 5. We can find that the performance increases as the embedding size enlarges. However, when the embedding size is too large, e.g., , the performance drops, which results from the over-fitting problem because of too many parameters.
Number of neighbors. The number of neighbors is searched in , which is illustrated in the bottom row of Figure 5. We can observe that TGSRec has performance gains on most datasets as the number of neighbors grows. It is because more neighbors can provide more information for encoding both sequences and temporal collaborative signals.

5.5. Ablation Study (RQ3 & RQ4)

In this section, we conduct experiments to analyze different components in TGSRec. We develop several variants to better understand their effectiveness. Table 3 shows the performance w.r.t. Recall@10 of the default TGSRec and other variants. We label each row with an index number for quick reference. The default is TGSRec with all components and labeled as

. We develop the variants by substituting some components, which are temporal collaborative attention (1-2), continuous-time embeddings (3-5), and loss function (6):


Temporal collaborative attention. We replace the proposed temporal collaborative attention of sampled neighbors with a mean pooling or LSTM module, both of which are widely used to encode sequences. Results are labeled as and in Table 3. We can observe that substituting collaborative attention with a mean pooling layer severely spoils the performance. Compared with that, the adoption of LSTM is much better, indicating the necessity of encoding sequential patterns by considering item transitions. However, both of them are worse than the default one, which implies the advantage of temporal collaborative attention in encoding sequences.
Continuous-time embedding. We use three variants to verify the efficacy of the time mapping function . The first variant is that we sample in Eq. (2

) directly from a normal distribution. The second and third variants replace the

with a learnable positional embedding as in (kang2018self) and emptying all zeros, respectively. The results are labeled as in Table 3. Because of the better performance of position embedding compared with empty embedding, we can conclude that TGSRec has the ability to encode sequential patterns. In addition, we also find that even a fixed to learn the time embedding can significantly outperform the position embedding, indicating the necessity of using the temporal kernel to capture temporal effects in sequences. Moreover, the default version, using a trainable , achieves the best performance, which indicates its capacity to learn temporal effects from data.
Loss function. We also compare BPR loss and BCE Loss, which is labeled as in Table 3. The results indicate that the BCE loss performs inferior to BPR loss, except for the ML100K dataset. This is because BPR loss is optimized for ranking while BCE loss is designed for binary classification.

5.6. Temporal Correlations (RQ4)

Though we have already indicated the answer of RQ4 in Sec. 5.5, this section also conducts detailed analyses of the temporal correlation within sequences to directly answer RQ4.

5.6.1. Temporal Information Construction

Variant Toys Baby Tools Music ML100K
TGSRec 0.3649 0.2235 0.3623 0.5935 0.3118
w/o T 0.0103 0.0138 0.0106 0.0112 0.1555
w/o T 0.1013 0.0961 0.0836 0.2785 0.2336
Table 4. Variants of Temporal Information Construction

We develop two variants by dismissing the time vector in either Eq. (3) or Eq. (4), i.e., users without time vectors or items without time vectors. The results are presented in Table 4. The observations are two-fold. Firstly, the performance of items without time is better than users without time. It implies that the temporal inference of user embeddings are rather important, which matches the intuition that the preference of users are dynamic while items are relatively more static. Secondly, the performance deteriorates significantly in both variants, indicating again TGSRec is able to model temporal effects of collaborative signals while also encoding sequences.

5.6.2. Temporal Attention Weights Visualization

Figure 6. Temporal Attention Weights Visualization

We visualize the attention weights of TGSRec on the Music dataset for a user, which is shown in Figure 6. Each row is associated with an increment (‘h’ for hour and ‘d’ for day) from the last interactive timestamp , where ‘next’ denotes the timestamp (+34d) for the test interaction. Each column is associated with an item. We can observe that the attention weights for items are dynamic at different timestamps, which indicates the temporal inference characteristics of TGSRec. Moreover, the time increments can be arbitrary values, which verifies its continuity.

5.6.3. Recommendation Results.

Besides the attention visualization, we also present a part of the recommendation results of the same user in Table 5. Additionally, we also show the results of SASRec and TiSASRec, which only leverage sequential patterns. We find that only TGSRec can predict the ground truth item (Killing Joke) in top-4 predictions at the time of interests. When time (e.g., +30d) becomes close to the predicting timestampe ‘next’ (i.e., +34d), the ground truth item appears in the top-4 predictions. We can observe that the top predicted items from SASRec are also recommended by TGSRec, though in lower ranks. It again proves that TGSRec can unify sequential patterns and temporal collaborative signals.

Time Rank-1 Rank-2 Rank-3 Rank-4
T+5d Letoya H. of Blue L. Ult. Prince Veneer
T+30d J. of A Gemini Living Lgds. Killing Joke Crane Wife
next Buf. S.F. Killing Joke Empire Stadium Arc.
T+60d D. of Future P. Even Now L. Mks. Wd. Przts. Author
SAS. Crane Wife Empire H. Fna. Are You in Rev.
TiSAS. Crane Wife Empire WTE. P. S. Stadium Arc.
Table 5. Recommendation w.r.t. time increments after the last interaction at timestamp . ‘next’ is the timestamp of the test interaction. The ground truth item is in red color. Items also predicted by SASRec and TiSASRec are in blue color.

6. Conclusion

In this paper, we design a new SR model, TGSRec, to unify sequential patterns and temporal collaborative signals. TGSRec is defined upon the proposed CTBG. We apply a temporal kernel to map continuous timestamps on edges to vectors. Then, we introduce the TCT layer, which can infer temporal embeddings of nodes. It samples neighbors and learns attention weights to aggregate both node embeddings and time vectors. In this way, a TCT layer is able to encode both sequential patterns and collaborative signals, as well as reveal temporal effects. Extensive experiments on five real-world datasets demonstrate the effectiveness of TGSRec. TGSRec significantly outperforms existing transformer-based sequential recommendation models. Moreover, the ablation study and detailed analyses verify the efficacy of those components in TGSRec. In conclusion, TGSRec is a better framework to solve the SR problem with temporal information.

7. Acknowledgments

This work is supported in part by NSF under grants III-1763325, III-1909323, III-2106758, and SaTC-1930941. This work is also partially supported by NSF through grant IIS-1763365 and by UC Davis. This work is also funded in part by the National Natural Science Foundation of China Projects No. U1936213

References