1. Introduction
Recommender system has become essential in providing personalized information filtering services in a variety of applications (liang2016modeling; wang19neural; Liu2019JSCNJS; wang2021dkg; peng2020m2). It learns the user and item embeddings from historical records on the useritem interactions (he2017neural; rendle2009bpr). In order to model the dynamics of the useritem interaction, current research works (rendle2010factorizing; wu2017recurrent; hidasi2015session; tang2018personalized; fan2021Modeling) leverage historical timeordered item purchasing sequences to predict future items for users, referred to as the sequential recommendation (SR) problem (hidasi2015session; kang2018self). One of the fundamental assumptions of SR is that the users’ interests change smoothly (hidasi2015session; wu2017recurrent; tang2018personalized; kang2018self). Thus, we can train a model to infer the items more likely to appear in the future sequence. For example, with the recent developments of Transformer (vaswani2017attention), current endeavors design a series of selfattention SR models to predict future item sequences (kang2018self; sun2019bert4rec; ssept20wu)
. A selfattention model infers sequence embeddings at position
by assigning an attention weight to each historical item and aggregating these items. The attention weights reveal impacts of previous items to the current state at time point .Despite their effectiveness, existing works only leverage the sequential patterns to model the item transitions within sequences, which is still insufficient to yield satisfactory results. The reason is that they ignore the crucial temporal collaborative signals, which are latent in evolving useritem interactions and coexist with sequential patterns. To be specific, we present the effects of temporal collaborative signals in Figure 1. The target is to recommend an item to at as the next item after . By only considering the sequential patterns, is recommended since it appears times after as in and , compared with of only time in . However, if also taking account of collaborative signals, we would recommend , because both and have interactions with at and at and , respectively, which indicates their high similarity. Hence, ’s sequential patterns are of more impacts to . This motivates us to unify sequential patterns and temporal collaborative signals.
However, incorporating temporal collaborative signals in SR is rather challenging. The first challenge is that it is hard to simultaneously encode collaborative signals and sequential patterns. Current models capture the sequential pattern based on the transition of items within sequences (hidasi2015session; kang2018self; ma2020disentangled), thus lacking the mechanism to model the collaborative signals across sequences. Jodie (kumar2019predicting) and DGCF (li2020dynamic) employs LSTM to model the dynamics and interactions of user and item embeddings but they cannot learn the impacts of all historical items, thus unable to encode sequences. SASRec (kang2018self) proposed to use a selfattention mechanism to encode item sequence, while the effects of users, i.e., collaborative signals, are omitted. SSEPT (ssept20wu) implicitly models collaborative signals by directly adding the same user embedding into the sequence encoding. However, it fails to model the interactions between user and item, thus unable to explicitly express collaborative signals.
The second challenge is that it is hard to express the temporal effects of collaborative signals. In other words, it remains unclear how to measure the impacts of those signals from a temporal perspective. For example, in Figure 1, is interacted with and at , while is interacted with them respectively at and . Since there is a lag, it is problematic to ignore the time gap and assume they are of equal contributions. We should use temporal information to infer the importance of those collaborative signals to the recommendation on . Existing works (hidasi2015session; kang2018self; sun2019bert4rec; li2020dynamic) assume that items appear discretely with equal time intervals. Thus, they only focus on the orders/positions of items in the sequence, which limits their capacity in expressing the temporal information. Some recent works (li2020time; ye2020time) also notice the importance of time span. But their models either fail to capture time differences between historical interactions or are unable to generalize to any unseen future timestamps or time difference, thus are still far from revealing the actual temporal effects of collaborative signals.
Current transformerbased models (kang2018self; sun2019bert4rec) adopt selfattention mechanism, which has query, key, and value inputs from item embeddings and employs dotproduct to learn their correlation scores. The limitation is that selfattention is only able to capture itemitem relationships in sequences. Additionally, they have no module to capture temporal correlations of items. To this end, we propose a new model Temporal Graph Sequential Recommender (TGSRec). It consists of two novel components: (1) the Temporal Collaborative Transformer (TCT) layer and (2) graph information propagation.
The first component advances current transformerbased models as it can explicitly model collaborative signals in sequences and express temporal correlations of items in sequences. To be more specific, TCT layer adopts collaborative attention among useritem interactions, where the query input to the collaborative attention is from the target node (user/item), while the key and value inputs are from connected neighbors. As such, TCT layer learns the importance of those interactions, thus well characterizing the collaborative signals. Moreover, TCT layer fuses the temporal information into the collaborative attention mechanism, which explicitly expresses the temporal effects of those interactions. Altogether, the TCT layer captures temporal collaborative signals.
The second module is devised upon our proposed ContinuousTime Bipartite Graph (CTBG). The CTBG consists of user/item nodes, and interaction edges with timestamps, as shown in Figure 1(a). Given timestamps, neighbor items of users preserve sequential patterns. We propagate temporal collaborative information learned around each node to surrounding neighbors over CTBG. Therefore, it unifies sequential patterns with temporal collaborative signals.
In this work, we propose to use temporal embeddings of nodes for recommendation,
which are dynamic and inferred at specified timestamps.
For example, at time , we infer the temporal user embedding by aggregating the context.
We illustrate the temporal inference of and at time in Figure 1(b).
The temporal embeddings are inferred by our proposed TCT layer.
It uses temporal information to discriminate impacts of those historical interactions and makes inferences of temporal node embeddings.
The contributions of this paper are as follows:
Graph Sequential Recommendation: We connect the SR problem with graph embedding methods, which focuses on unifying the sequential patterns and temporal collaborative signals.
Temporal Collaborative Transformer: We propose a novel temporal collaborative attention mechanism to infer temporal node embeddings, which jointly models collaborative signals and temporal effects. This overcomes the inadequacy of the traditional selfattention mechanism on capturing the temporal effects and useritem collaborative signals.
Extensive Experiments: We conduct a comparison experiment on five realworld datasets.
Comprehensive experiments demonstrate the stateoftheart performance of TGSRec and its effectiveness of modeling temporal collaborative signals.
2. Related Work
In this section, we first review some related work, which includes sequential recommendation (SR), temporal information, and some graphbased recommender systems.
2.1. Sequential Recommendation
SR predicts the future items in the user shopping sequence by mining the sequential patterns. An initial solution to the SR problem is to build a Recurrent Neural Network (RNN)
(hidasi2015session; wu2017recurrent; yu2016dynamic; 10.1145/3331184.3331329). GRU4Rec (hidasi2015session) is proposed to predict the next item in a session by employing the GRU modules. Later, a Hierarchical RNN (quadrana2017personalizing) is proposed to enhance the RNN model regarding the personalizing information. Additionally, LSTM (hochreiter1997long; wu2017recurrent) can be applied to explore both the longterm and shortterm sequential patterns. Moreover, in order to capture the intent of users at local subsequence, NARM (li2017neural)is proposed by combining the RNN model with attention weights. The major drawback of the RNN model is that it can only generate a single hidden vector, which limits its power to encode sequences
(chen2018sequential).Recently, owing to the success of selfattention model (vaswani2017attention; devlin2018bert; liu2021enriching; zhang2021pretrained) in NLP tasks, a series of attentionbased SR models are proposed (kang2018self; sun2019bert4rec; ma2020disentangled; wu2020deja; ji2020hybrid; liu2021augmenting; peng2021ham). SASRec (kang2018self) applies the transformer layer to assign weights to items in the sequence. Later, inspired by the BERT (devlin2018bert) model, BERT4Rec (sun2019bert4rec) is proposed with a bidirectional transformer layer. (ma2020disentangled) introduce the sequence to sequence training procedure in SR. SSEPT (ssept20wu) designs a personalized transformer to improve the SR performance. ASReP (liu2021augmenting) proposes augmenting short sequences to alleviate the coldstart issue in Transformer. TiSASRec (li2020time) enhances SASRec with the timeinterval information found in the training data. However, these models only focus on the item transitions within sequences, while unable to unify the important temporal collaborative signals with sequential patterns and are not generalized to unseen timestamps.
2.2. Temporal Information
Previously mentioned SR works are specifically designed to capture sequential patterns, while ignoring the important temporal information (koren2009collaborative; xiong2010temporal; xiang2010temporal; kumar2019predicting; li2020time). In practice, the context of users and items changes over time, which is crucial for modeling the temporal dynamics in SR. TimeSVD++ (koren2009collaborative) is a representative work which models the temporal information into collaborative filtering (CF) method. It simply treats the bias as a function over time. BPTF (xiong2010temporal)
extends matrix factorization to tensor factorization and uses time as the third dimension. MSIPF
(xiang2010temporal) defines a temporal graph, where it operates PageRank algorithm for recommendation. Recently, CDTNE (nguyen2018continuous) is proposed by applying temporal random walk over its defined continuoustime dynamic network. TGAT (xu2020inductive) also introduces temporal attention for learning dynamic graph embeddings. JODIE (kumar2019predicting) develops user and item RNNs to update user and item embeddings. Regarding the SR problem, a few recent works (li2020time; ye2020time) also notice the importance of temporal information. CTA (wu2020deja), MTAM (ji2020hybrid), and TiSASRec (li2020time) all consider to use time intervals between successive items in sequences. TASER (ye2020time) encodes both the absolute time and relative time as vectors, which are processed to attention models to complete the SR task. However, these models are not able to unify temporal collaborative signals with sequential patterns.2.3. Graphbased Recommendation
Because we solve the SR problem based on the graph structure (zhang2017learning; zhang2019ige), we also review some graphbased recommender system models (nguyen2018continuous; wang19neural; he2020lightgcn; Liu2019JSCNJS; liu2020basconv; 9377917), especially those based on Graph Neural Network (GNN) methods (kipf17semi; wang19neural; he2020lightgcn; berg2017gcmc). Compared with directly learning from sequences, graphbased models can also capture the structural information (nguyen2018continuous; berg2017gcmc). Both NGCF (wang19neural) and LightGCN (he2020lightgcn) argue that graphbased models are able to effectively model collaborative signals, which is crucial in learning user/item embeddings. The successes of GNN in recommender systems (wang2019kgat; he2020lightgcn; wang19neural; berg2017gcmc) provide simple yet effective methods in learning user/item embeddings from graphs. GNN models learn the embeddings by aggregating neighbors (wang19neural; he2020lightgcn). Therefore, it is easy to stack multiple layers to learn both the firstorder and highorder collaborative signals (wang19neural; he2020lightgcn). CTDNE (nguyen2018continuous) defines a temporal graph to learn dynamic embeddings of nodes. TGAT (xu2020inductive) learns the dynamic graph embeddings based on the graph attention model. Basconv (liu2020basconv) characterizes heterogeneous graphs to learn user/item embeddings. Those models argue that graph is powerful in modeling both the structural and temporal information. However, few works investigate the possibility of solving SR problems based on graphs. SRGNN (wu2019session) learns embeddings of session graphs by using a GNN to aggregate item embeddings but fails to model temporal collaborative signals.
3. Definitions and Preliminaries
In this section, we introduce some definitions and preliminaries. Different from using users’ interactions sequences as inputs in SR, we introduce the ContinuousTime Bipartite Graph (CTBG) to represent all temporal interactions. Each edge in this graph has the timestamp as the attribute. The directly connected neighbors of every user/item node in this graph preserve the sequential order via the timestamps at edges. The formal definition of CTBG are given in the following:
Definition 3.0 (ContinuousTime Bipartite Graph).
A continuous time bipartite graph with nodes and edges for recommendations is defined as , where and are two disjoint node sets of users and items, respectively. Every edge is denoted as a tuple , where , , and as the edge attribute. Each triplet denotes the interaction of a user with item at timestamp .
This paper focuses on the SR problem with continuous timestamps. An example of the CTBG is presented in Figure 1(a). Let denote the set of items interacted with the user before timestamp , and denote the remaining items. We defined the continuoustime sequential recommendation problem which we study in this paper as following:
Definition 3.0 (ContinuousTime Recommendation).
At a specific timestamp , given user set , item set , and the associated CTBG, the continuoustime recommendation of is to generate a ranking list of items from , where the items that is interested will be ranked top in the list.
Then, the SR problem is equivalent to make continuoustime recommendations on a set of future timestamps for each user :
Definition 3.0 (ContinuousTime Sequential Recommendation).
For a specific user , given a set of future timestamps , the continuoustime sequential recommendation for this user is to make a continuoustime recommendation for every timestamp .
This is a generalized definition compared with other works (kang2018self; ma2020disentangled). We explicitly consider timestamps, while others only care about the orders/positions. Therefore, differing from existing works using nextitem prediction to evaluate sequential recommendation, future timestamps should be present to make a prediction. If timestamps are position numbers in sequences, the studied problem is reduced to the same definition as using only orders/positions information. Note that timestamp can be any real value, thus being continuous.
4. Proposed Model
In this section, we present the TGSRec model, which unifies sequential patterns and temporal collaborative signals. The framework of the TGSRec model is presented in Figure 3. There are three major components: 1) Embedding layer, which encodes nodes and timestamps in a consistent way to connect the SR problem with graph embedding method; 2) Temporal Collaborative Transformer (TCT) layer, which employs a novel temporal collaborative attention mechanism to discriminate temporal impacts of neighbors, and aggregates both node and time embeddings to infer the temporal node embedding; 3) Prediction layer, which utilizes output embeddings from the final TCT layer to calculate the score.
4.1. Embedding Layer
We encode two types of embeddings in this paper, one being the longterm embeddings of nodes, and the other being the continuoustime embeddings of timestamps on edges.
4.1.1. LongTerm User/Item Embeddings
Longterm embeddings for users and items are necessary (devooght2017long) for longterm collaborative signals representation. In CTBG, it functions as node features and is optimized to model the holistic structural information. A user (item) node is parameterized by a vector . Since we learn embeddings for nodes in the CTBG, we retrieve the embedding of a node by indexing an embedding table , where . Note that the embedding table serves as a starting state for the inference of temporal user/item embeddings. During the training process, will be optimized.
4.1.2. ContinuousTime Embedding
The continuous time encoding (ye2020time; xu2019self) behaves as a function that maps those scalar timestamps into vectors, i.e., , where . Based on previous SR models (ye2020time; li2020time; wu2020deja), time span plays a vital component in expressing the temporal effects and uncovering sequential patterns. The time encoding function embeds timestamps into vectors so as to represent the time span as the dot product of corresponding encoded time embeddings. Therefore, we define the temporal effects as a function of time span in continuous time space: given a pair of interactions and of the same user, the temporal effect is defined as a function , which is expressed as a kernel value of the time embeddings of and :
(1) 
where is the temporal kernel and denotes the dot product operation. The temporal effect measures the temporal correlation between two timestamps. Moreover, the time encoding function should be generalized to any unseen timestamp such that any time span not found in training data can still be inferred by the encoded time embeddings. Unlike modeling the absolute time difference like (li2020time), representing temporal effects as a kernel is generalized to any timestamp as it models the time representations directly. Therefore, the temporal effect of any pair of timestamps can be inductively inferred by the dot product of time representations. Eq. (1) can be achieved by a continuous and translationinvariant kernel based on Bochner’s Theorem (loomis2013introduction). By explicitly representing the temporal features, the temporal embedding is:
(2) 
where are learnable and is the dimension.
4.2. Temporal Collaborative Transformer
Next, we present the novel TCT layer of TGSRec. We intend to address two strengths of a TCT layer: (1) constructing information from both user/item embeddings and temporal embedding, which explicitly characterizes temporal effects of the correlations; (2) a collaborative attention module, which advances existing selfattention mechanism by modeling the importance of useritem interactions, which is thus able to explicitly recognize collaborative signals. To achieve this, we first present the information construction and aggregation from a user perspective. Then, we introduce a novel collaborative attention mechanism to infer importance of interactions. Finally, we demonstrate how to generalize to items.
4.2.1. Information Construction
We construct input information of each TCT layer as the combination of long term node embeddings and time embeddings. As such, we can unify temporal information and collaborative signals. In particular, the query input information at the th layer for user at time is:
(3) 
where . is the information for at , is the temporal embedding of , and denotes the time vector of . denotes the concatenation operation. Other operations including summation are possible. However, we use concatenation for simplicity. It also provides intuitive interpretation in the attention, as shown in Eq. (7). Note that when , it is the first TCT layer. The temporal embedding , i.e., the longterm user embedding. When , the temporal embedding is generated from the previous TCT layer.
In addition to the query node itself, to we also propagate temporal collaborative information from its neighbors. We randomly sample different interactions of before time as . The input information at the th layer for each pair is:
(4) 
where is the information for item at , denotes the temporal embedding of at . Again, note that when , , i.e., the longterm item embedding. When , the temporal embedding is output from the previous TCT layer.
4.2.2. Information Propagation
After constructing the information, we propagate the information of sampled neighbors to infer the temporal embeddings. Since the neighbors are involving with time , in this way, we can unify the sequential patterns with temporal collaborative signals. We compute the linear combination of the information from all sampled interactions as:
(5) 
where ^{1}^{1}1 also has a superscript of the layer number , which is ignored for simplicity. denotes the importance of an interaction and
is the linear transformation matrix.
represents the impact of a historical interaction to the temporal inference of at time , which is calculated by the temporal collaborative attention.4.2.3. Temporal Collaborative Attention
We adopt the novel temporal collaborative attention mechanism to measure the weights , which considers both neighboring interactions and the temporal information on edges. Both factors contribute to the importance of historical interactions. Thus, it is a better mechanism to capture temporal collaborative signals than selfattention mechanism that only models itemitem correlations. The attention weight is formulated as follows:
(6) 
where and are both linear transformation matrices, and the factor protects the dotproduct from growing large with high dimensions. We adopt dotproduct attention because if we ignore transformation matrices and the scalar factor, based on Eq. (3) and Eq. (4), the righthand side of Eq. (6) can be rewritten as:
(7) 
where the first term denotes the useritem collaborative signal, and the second term models the temporal effect according to Eq. (1). With more stacked layers, collaborative signals and temporal effects are entangled and tightly connected. Hence, the dotproduct attention can characterize impacts of temporal collaborative signals.
Hereafter, we normalize the attention weights across all sampled interactions by employing a softmax function:
(8) 
Moreover, the computation is implemented by packing the information of all sampled interactions. To be more specific, we stack the information (Eq. (4)) of all sampled interactions as a matrix , as illustrated^{2}^{2}2In figure 3, an embedding is a row vector, while in notations, it is a column vector. in Figure 3. We denote , and , which are respectively the key, value and query input for the temporal collaborative attention module. We illustrate this in Figure 3 as green blocks. For simplicity and without ambiguity, we ignore the superscripts and time and combine Eq. (6) and Eq. (8). Then, we can rewrite the Eq. (5) as:
(9) 
which is in the form of dotproduct attention in Transformer (vaswani2017attention). Therefore, we can safely apply the multihead attention operation and concatenate the output from each head as the information for aggregation, which is presented in Figure 3. Note that our attention is not a selfattention but a temporal collaborative attention, which jointly models useritem interactions and temporal information.
4.2.4. Information Aggregation
To output the temporal node embedding, the final step of a TCT layer is to aggregate the query information in Eq. (3) and the neighbor information in Eq. (5
). We concatenate and send them to a feedforward neural network (FFN):
(10) 
where is the temporal embedding of at on
th layer, and FFN consists of two linear transformation layers with a ReLU activation function in between
(vaswani2017attention). The output temporal embedding can either be sent to the next layer or output as the final temporal node embedding for prediction.4.2.5. Generalization to items
Though we only present the TCT layer from the user query perspective, it is analogous if the query is an item at a specific time. We only need to alternate the user query information to the item query information, and change the neighbor information in Eq. (4) and Eq. (5) accordingly as usertime pairs. Then, we can make an inference of the temporal embedding of item at time as , which is sent to the next layer.
4.3. Model Prediction
The TGSRec model consists of TCT layers. For each test triplet , it yields temporal embeddings for both and at on the last TCT layer, denoting as and , respectively. Then, the prediction score is:
(11) 
where denotes the score to recommend for at time . With the generalized continuoustime embeddings and the proposed TCT layers, we can generalize and infer user/item embeddings at any timestamp, thus making multiple steps recommendation feasible while existing work only predicts next item. Recall that based on the Definition 3.0.3, we recommend each user a ranking list of items at the given timestamp. Therefore, we use Eq. (11) to calculate scores of all candidate items and sort them by scores.
4.4. Model Optimization
To learn the model parameters, we use the pairwise BPR loss (rendle2009bpr), which is widely used for topN recommendation. The pairwise BPR loss assumes that the observed implicit feedback items have greater prediction scores than those unobserved and is also designed for ranking based topN recommendation. The objective function is:
(12) 
where denotes the training samples, includes all learnable parameter, and
is a sigmoid function. The training samples
, where the positive interaction comes from the edge set of CTBG, the negative item is sampled from unobserved items of user at timestamp ; includes longterm embedding , time embedding parameter , and all linear transformation matrices. The loss is optimized via minibatch Adam (DBLP:journals/corr/KingmaB14) with adaptive learning rate. Alternatively, we can optimize the model with a Binary Cross Entropy (BCE) loss as:(13) 
which is compared with BPR loss in experiments.
5. Experiments
In this section, we present the experimental setups and results to demonstrate the effectiveness of TGSRec. The experiments answer the following Research Questions (RQs):

RQ1: Does TGSRec yield better recommendation?

RQ2: How do different hyperparameters (e.g., number of neighbors , etc.) affect the performance of TGSRec?

RQ3: How do different modules (e.g., temporal collaborative attention, etc.) affect the performance of TGSRec?

RQ4: Can TGSRec effectively unify sequential patterns and temporal collaborative signals? (Reveal temporal correlations)
5.1. Datasets
We conduct our experiments on four Amazon review datasets (mcauley2015image)
and MovieLens ML100K dataset
(harper2015movielens). The Amazon datasets are collected from different domains^{3}^{3}3https://jmcauley.ucsd.edu/data/amazon/, from the Amazon website during May 1996 to July 2014. The Movie Lens dataset is collected from September 19th, 1997 through April 22nd, 1998. We use Unix timestamps on all datasets. For each dataset, we chronologically split for train/validation/test in 80%/10%/10% ratio based on the interaction timestamps. More details, such as data descriptions and statistics, are presented in the Table 1. We can find amazon datasets are much sparser and their time spans are much longer compared with ML100K dataset. For Amazon related datasets, the time intervals of successive interactions are typically in days, while ML100k has shorter time intervals, ranging from seconds to days.Dataset  Toys  Baby  Tools  Music  ML100K 

#Users  17,946  17,739  15,920  4,652  943 
#Items  11,639  6,876  10,043  3,051  1,682 
#Edges  154,793  146,775  127,784  54,932  48,569 
#Train  134,632  128,833  107,684  51,765  80,003 
#Valid  11,283  10,191  10,847  2,183  1,516 
#Test  8,878  7,751  9,253  984  1,344 
Density  0.07%  0.12%  0.08%  0.38%  6.30% 
Avg. Int.  85 days  61 days  123 days  104 days  4.8 hours 

“Av. Int.” denotes average time interval.
5.2. Experimental Settings
Datasets  Metric  BPR  LightGCN  SRGNN  GRU4Rec  Caser  SSEPT  BERT4Rec  SASRec  TiSASRec  CDTNE  TGSRec  Improv. 

Recall@10  0.0021  0.0016  0.0020  0.0274  0.0138  0.1213  0.1273  0.1452  0.1361  0.0016  0.3650  0.2198  
Recall@20  0.0036  0.0026  0.0033  0.0288  0.0238  0.1719  0.1865  0.2044  0.1931  0.0045  0.3714  0.1670  
Toys  MRR  0.0024  0.0018  0.0018  0.0277  0.0082  0.0595  0.0643  0.0732  0.0671  0.0025  0.3661  0.2929 
Recall@10  0.0028  0.0036  0.0030  0.0036  0.0077  0.0911  0.0884  0.0975  0.1040  0.0218  0.2235  0.1195  
Recall@20  0.0039  0.0045  0.0062  0.0048  0.0193  0.1418  0.1634  0.1610  0.1662  0.0292  0.2295  0.0663  
Baby  MRR  0.0019  0.0024  0.0024  0.0028  0.0071  0.0434  0.0511  0.0455  0.0521  0.0157  0.2147  0.1626 
Recall@10  0.0023  0.0021  0.0051  0.0048  0.0077  0.0775  0.1296  0.0913  0.0946  0.0186  0.2457  0.1161  
Recall@20  0.0036  0.0035  0.0092  0.0059  0.0161  0.1155  0.1784  0.1337  0.1356  0.0271  0.2559  0.0775  
Tools  MRR  0.0026  0.0023  0.0028  0.0051  0.0068  0.0419  0.0628  0.0460  0.0480  0.0203  0.2468  0.1840 
Recall@10  0.0122  0.0142  0.0051  0.0549  0.0183  0.0915  0.1352  0.1372  0.1372  0.0071  0.5935  0.4563  
Recall@20  0.0152  0.0183  0.0092  0.0589  0.0346  0.1494  0.2093  0.2094  0.1951  0.0163  0.5986  0.3892  
Music  MRR  0.0057  0.0064  0.0028  0.0540  0.0106  0.0423  0.0824  0.0768  0.0681  0.0037  0.3820  0.2996 
Recall@10  0.0461  0.0565  0.0045  0.0996  0.0246  0.1079  0.1116  0.09450  0.1332  0.0350  0.3118  0.1786  
Recall@20  0.0766  0.0960  0.0060  0.1168  0.0417  0.1801  0.1786  0.1808  0.2232  0.0536  0.3252  0.1020  
ML100k  MRR  0.0213  0.0252  0.0012  0.0938  0.0147  0.0519  0.0600  0.0448  0.0605  0.0162  0.2416  0.1478 
5.2.1. Baselines
We compared TGSRec with the stateoftheart methods in three different groups. Static models: Static models ignore the temporal information and generate static user/item embeddings for a recommendation. We compare with the most standard baseline BPRMF (rendle2009bpr), and also compare with a recent GNNbased model LightGCN (he2020lightgcn). Temporal models: We compare some relevant temporal methods, such as CTDNE (nguyen2018continuous) and one recent model TiSASRec (li2020time), which utilize time information. We also try to compare with JODIE (kumar2019predicting). However, we do not report it because has outofmemory errors on most datasets. Transformerbased SR models: Since our model is built upon the transformer, we mainly focus on comparing with the recent transformerbased SR methods, which are SASRec (kang2018self), BERT4Rec (sun2019bert4rec), SSEPT (ssept20wu), and TiSASRec (li2020time). Other SR models: In addition, we also compare with other type of SR models, i.e., FPMC (rendle2010factorizing), GRU4Rec (hidasi2015session), Caser (tang2018personalized), and SRGNN (wu2019session), for comprehensive study.
For each testing interaction , our continuoustime sequential recommendation setting allows models to use any history interactions during the prediction stage, regardless of whether the historical interactions are in training portion, validation part or even in testing set. However, all parameters of models are only learned from the training data.
We implement TGSRec
with Pytorch in a Nvidia 1080Ti GPU. We grid search all parameters and report the test performance based on the best validation results. For all models, we search for the dimensions of embeddings
in range of and we tune the learning rate in , search the L2 regularization weight from . For sequential methods, we search the maximum length of sequence in , number of layers from , and number of heads in .5.2.2. Evaluation Protocol
All models will generate a ranking list of items for each testing interaction. Each evaluation metric is averaged over the total number of interactions as the final reported result. In order to accelerate the evaluation, we sample 1,000 negative items for evaluation instead of full set of negative items. For each interaction
in validation/test sets, we treat items that has no interactions with before as negative items. Regarding the sampling bias for evaluation (krichene2020sampled), we apply the unbiased estimator in
(krichene2020sampled) to correct the sampled ranks. We evaluate the topN recommendation performance by standard rankingbased evaluation metrics Recall@N, NDCG@N, and Mean Reciprocal Rank (MRR). We set N to be either 10 or 20 for a comprehensive comparison.5.3. Performance Comparison (RQ1)
We compare the performance of all models and demonstrate the superiority of TGSRec. We report the Recall and MRR of all models in Table 2. Additionally, we visualize the comparisons of NDCG in Figure 4. We have the following observations:

TGSRec consistently and significantly outperforms all baselines in all datasets. In particular, for absolute performance improvement gains relative to the 2nd best, TGSRec achieves , and absolute gains at recall@10, recall@20, and MRR, respectively. TGSRec also significantly outperforms others in NDCG, as shown in Figure 4. Several factors together determine the superiority of TGSRec: (1) TGSRec captures temporal collaborative signals; (2) TGSRec explicitly expresses temporal effects; and (3) TGSRec stacks multiple TCT layers to propagate the information.

Those static methods achieve the worst performance among all models. A simple GRU4Rec even performs 10 times better than them. This indicates that static embeddings fail to utilize the temporal information, limiting its recommendation ability in SR. Thus, it is important to model dynamics.

The CDTNE performs better than Caser and GRU4Rec in Tools and Baby datasets. This suggests the benefit of modeling temporal information with a graph. But it is still much worse than those transformerbased methods, which again proves the strength of transformer in encoding sequences. We also notice the poor performance of SRGNN. We analyze the data and find time intervals between successive interactions vary a lot. Since SRGNN is originally designed for sessionbased sequences, it is not suitable for SR with a long time span.

The transformerbased SR methods consistently outperform all other types of baselines, which demonstrates the effectiveness of using transformer structure to encode sequence. Among them, TiSASRec is better than SASRec on two datasets, which proves the effectiveness of using time information. But it is still far worse than TGSRec. The reason is twofold. One is that only the interval information is not enough to unify the temporal information with sequential patterns. The other is that the proposed temporal collaborative attention in TCT layer captures more precise and generalized temporal effects. We find that BERT4Rec is better than the other baselines on the Tools dataset but not better on other datasets. Since the main difference between BERT4Rec and SASRec is the bidirectional sequence encoding, it may break causal relations among items within a sequence. The TGSRec performs much better than SR models, showing the necessity of unifying sequential patterns and temporal collaborative signals.
Architecture  Toys  Baby  Tools  Music  ML100K 

(0) Default  0.3649  0.2235  0.3623  0.5935  0.3118 
(1) Mean  0.0027  0.0210  0.0055  0.0051  0.0647 
(2) LSTM  0.0991  0.1237  0.1266  0.3740  0.3088 
(3) Fixed  0.0854  0.0944  0.0910  0.3679  0.2789 
(4) Position  0.0380  0.0243  0.0209  0.0742  0.0878 
(5) Empty  0.0139  0.0240  0.0018  0.0346  0.0603 
(6) BCELoss  0.2200  0.1916  0.1763  0.4624  0.3542 
5.4. Parameter Sensitivity (RQ2)
In this section, we conduct sensitivity analyses of the hyperparameters of
TGSRec, including the number of layers , embedding size , and the number of neighbors . The results are reported in Figure 5.The number of layers. The number of TCT layers is searched from . The results are shown in the top row of Figure 5. When , TGSRec has no TCT layer, thus unable to infer temporal embeddings. We can observe it performs the worst on all dataset, which justify the benefit of temporal inference. When , it makes temporal inference, but without propagation to the next layer. Therefore, it still performs worse than on most datasets. When , it can not only make temporal inference, but also propagate the information to capture highorder signals, which alleviates the data sparsity problem.
Embedding size. The embedding size of TCT layers is searched from , which is presented at the midrow in Figure 5. We can find that the performance increases as the embedding size enlarges. However, when the embedding size is too large, e.g., , the performance drops, which results from the overfitting problem because of too many parameters.
Number of neighbors. The number of neighbors is searched in , which is illustrated in the bottom row of Figure 5. We can observe that TGSRec has performance gains on most datasets as the number of neighbors grows. It is because more neighbors can provide more information for encoding both sequences and temporal collaborative signals.
5.5. Ablation Study (RQ3 & RQ4)
In this section, we conduct experiments to analyze different components in TGSRec. We develop several variants to better understand their effectiveness. Table 3 shows the performance w.r.t. Recall@10 of the default TGSRec and other variants. We label each row with an index number for quick reference. The default is TGSRec with all components and labeled as
. We develop the variants by substituting some components, which are temporal collaborative attention (12), continuoustime embeddings (35), and loss function (6):
Temporal collaborative attention. We replace the proposed temporal collaborative attention of sampled neighbors with a mean pooling or LSTM module, both of which are widely used to encode sequences. Results are labeled as and in Table 3. We can observe that substituting collaborative attention with a mean pooling layer severely spoils the performance. Compared with that, the adoption of LSTM is much better, indicating the necessity of encoding sequential patterns by considering item transitions. However, both of them are worse than the default one, which implies the advantage of temporal collaborative attention in encoding sequences.
Continuoustime embedding. We use three variants to verify the efficacy of the time mapping function . The first variant is that we sample in Eq. (2
) directly from a normal distribution. The second and third variants replace the
with a learnable positional embedding as in (kang2018self) and emptying all zeros, respectively. The results are labeled as in Table 3. Because of the better performance of position embedding compared with empty embedding, we can conclude that TGSRec has the ability to encode sequential patterns. In addition, we also find that even a fixed to learn the time embedding can significantly outperform the position embedding, indicating the necessity of using the temporal kernel to capture temporal effects in sequences. Moreover, the default version, using a trainable , achieves the best performance, which indicates its capacity to learn temporal effects from data.Loss function. We also compare BPR loss and BCE Loss, which is labeled as in Table 3. The results indicate that the BCE loss performs inferior to BPR loss, except for the ML100K dataset. This is because BPR loss is optimized for ranking while BCE loss is designed for binary classification.
5.6. Temporal Correlations (RQ4)
Though we have already indicated the answer of RQ4 in Sec. 5.5, this section also conducts detailed analyses of the temporal correlation within sequences to directly answer RQ4.
5.6.1. Temporal Information Construction
Variant  Toys  Baby  Tools  Music  ML100K 

TGSRec  0.3649  0.2235  0.3623  0.5935  0.3118 
w/o T  0.0103  0.0138  0.0106  0.0112  0.1555 
w/o T  0.1013  0.0961  0.0836  0.2785  0.2336 
We develop two variants by dismissing the time vector in either Eq. (3) or Eq. (4), i.e., users without time vectors or items without time vectors. The results are presented in Table 4. The observations are twofold. Firstly, the performance of items without time is better than users without time. It implies that the temporal inference of user embeddings are rather important, which matches the intuition that the preference of users are dynamic while items are relatively more static. Secondly, the performance deteriorates significantly in both variants, indicating again TGSRec is able to model temporal effects of collaborative signals while also encoding sequences.
5.6.2. Temporal Attention Weights Visualization
We visualize the attention weights of TGSRec on the Music dataset for a user, which is shown in Figure 6. Each row is associated with an increment (‘h’ for hour and ‘d’ for day) from the last interactive timestamp , where ‘next’ denotes the timestamp (+34d) for the test interaction. Each column is associated with an item. We can observe that the attention weights for items are dynamic at different timestamps, which indicates the temporal inference characteristics of TGSRec. Moreover, the time increments can be arbitrary values, which verifies its continuity.
5.6.3. Recommendation Results.
Besides the attention visualization, we also present a part of the recommendation results of the same user in Table 5. Additionally, we also show the results of SASRec and TiSASRec, which only leverage sequential patterns. We find that only TGSRec can predict the ground truth item (Killing Joke) in top4 predictions at the time of interests. When time (e.g., +30d) becomes close to the predicting timestampe ‘next’ (i.e., +34d), the ground truth item appears in the top4 predictions. We can observe that the top predicted items from SASRec are also recommended by TGSRec, though in lower ranks. It again proves that TGSRec can unify sequential patterns and temporal collaborative signals.
Time  Rank1  Rank2  Rank3  Rank4 
T+5d  Letoya  H. of Blue L.  Ult. Prince  Veneer 
T+30d  J. of A Gemini  Living Lgds.  Killing Joke  Crane Wife 
next  Buf. S.F.  Killing Joke  Empire  Stadium Arc. 
T+60d  D. of Future P.  Even Now  L. Mks. Wd.  Przts. Author 
SAS.  Crane Wife  Empire  H. Fna. Are  You in Rev. 
TiSAS.  Crane Wife  Empire  WTE. P. S.  Stadium Arc. 
6. Conclusion
In this paper, we design a new SR model, TGSRec, to unify sequential patterns and temporal collaborative signals. TGSRec is defined upon the proposed CTBG. We apply a temporal kernel to map continuous timestamps on edges to vectors. Then, we introduce the TCT layer, which can infer temporal embeddings of nodes. It samples neighbors and learns attention weights to aggregate both node embeddings and time vectors. In this way, a TCT layer is able to encode both sequential patterns and collaborative signals, as well as reveal temporal effects. Extensive experiments on five realworld datasets demonstrate the effectiveness of TGSRec. TGSRec significantly outperforms existing transformerbased sequential recommendation models. Moreover, the ablation study and detailed analyses verify the efficacy of those components in TGSRec. In conclusion, TGSRec is a better framework to solve the SR problem with temporal information.
7. Acknowledgments
This work is supported in part by NSF under grants III1763325, III1909323, III2106758, and SaTC1930941. This work is also partially supported by NSF through grant IIS1763365 and by UC Davis. This work is also funded in part by the National Natural Science Foundation of China Projects No. U1936213