1 Introduction
Recommendation systems are increasingly prevalent due to content delivery platforms, ecommerce websites, and mobile apps [shani2008mining]. Most recommendation problems can be naturally thought of as predicting the user’s partial ranking of a large candidate pool of items. After obtaining the optimal ranking ordering, the recommender system can simply recommend top items in the list for each individual user. Usually rankings are made personalized to cater to users’ special tastes. In literature, this is formulated as the collaborative ranking problem [weimer2008cofi]. The temporal ordering, determined by when users engaged with items, has proven to be an important resource to further improve ranking performance. We call the collaborative ranking setting with temporal ordering information the Temporal Collaborative Ranking problem in this paper.
Recent advances in deep learning, especially the discovery of various attention mechanisms [bahdanau2014neural, sutskever2014sequence] and newer architectures [vaswani2017attention, devlin2018bert] in addition to classical RNN and CNN architecture in natural language processing, have allowed us to make better use of the temporal ordering of items that each user has engaged with. In particular, the SASRec model [kang2018self], inspired by the popular Transformer model in natural languages processing, has achieved stateofart results in the temporal collaborative ranking problem and enjoyed more than 10x speedup compared to earlier RNN[hidasi2015session] / CNN[tang2018personalized]based methods. But at a closer look, the SASRec is inherently an unpersonalized model without introducing user embeddings and this often leads to an inferior recommendation model in terms of both ranking performances and model interpretability. Although personalization is not needed for the original Transformer model [vaswani2017attention] in natural languages understandings or translations, personalization plays a crucial role throughout recommender system literature [zhang2019deep] ever since the matrix factorization approach to the Netflix prize [koren2009bellkor].
In this work, we propose a novel method, Personalized Transformer (SSEPT), that introduces personalization into selfattentive neural network architectures.
[kang2018self] found that adding additional personalized embeddings did not improve the performance of their Transformer model, and postulate that this is due to the fact that they already use the user history and the embeddings only contribute to overfitting. Although introducing user embeddings into the model is indeed difficult with existing regularization techniques for embeddings, we show that personalization can greatly improve ranking performances with recent regularization technique called Stochastic Shared Embeddings (SSE) [wu2019stochastic]. The personalized Transformer (SSEPT) model with SSE regularization works well for all 5 realworld datasets we consider, outperforming previous stateoftheart algorithm SASRec by almost 5% in terms of NDCG@10. Furthermore, after examining some random users’ engagement history and corresponding attention heat maps used during the inference stage, we find our model is not only more interpretable but also able to focus on recent engagement patterns for each user. Moreover, our SSEPT model with a slight modification, which we call SSEPT++, can handle extremely long sequences and outperform SASRec in ranking results with comparable training speed, striking a balance between performance and speed requirements.2 Related Work
2.1 Collaborative Filtering and Ranking
Recommender systems can be divided into those designed for explicit feedback, such as ratings [koren2009matrix], and those for implicit feedback, based on user engagement [hu2008collaborative]. Recently, implicit feedback datasets, such as user clicks of web pages, checkin’s of restaurants, likes of posts, listening behavior of music, watching history and purchase history, are increasingly prevalent. Unlike explicit feedback, implicit feedback datasets can be obtained without users noticing or active participation. Itemtoitem [sarwar2001item], usertouser [wang2006unifying], usertoitem [koren2009matrix] are 3 different angles of utilizing user engagement data. In itemtoitem approaches, the goal is to recommend similar items to what users have engaged. In usertouser approaches, the goal is to recommend to a user some items that similar users have engaged previously. Usertoitem approaches, on the other hand, focus on examining user and item relationships as a whole, which is also referred to as a collaborative filtering approach. These relationships can also be viewed as graphs [wu2019graph].
Two main approaches to recommendations are: attempt to predict the explicit or implicit feedback with matrix (or tensor) completion, or attempt to predict the relative rankings derived from the feedback. Collaborative filtering algorithms including matrix factorization,
[hill1995recommending, schafer2007collaborative, koren2008factorization, mnih2008probabilistic, hu2008collaborative], which predict the feedback in a pointwise fashion as if it were a supervised learning problem, fall into the first category. Predicting the feedback with supervised learning objectives suffers from the different rating standards of users, and it can be helpful to consider the data to simply be the ranking of the items based on feedback. There are two main approaches to the collaborative ranking problem, namely pairwise and listwise methods. Pairwise methods
[rendle2009bpr, wu2017large] consider each pairwise comparison for a user as a label, which implicitly models the pairwise comparisons as independent observations. Listwise methods [wu2018sql], on the other hand, consider a user’s entire engagement history as independent observations. Normally, in terms of ranking performances, listwise approaches outperform pairwise approaches, and pairwise approaches outperform pointwise collaborative filtering [wu2018sql].2.2 Sessionbased and Sequential Recommendation
Both sessionbased and sequential (i.e. nextbasket) recommendation algorithms take advantage of additional temporal information to make better personalized recommendations. The main difference between sessionbased recommendations [hidasi2015session] and sequential recommendations [kang2018self] is that the former assumes that the user ids are not recorded and therefore the length of engagement sequences are relatively short. Therefore, sessionbased recommendations normally do not consider user factors. On the other hand, sequential recommendation treats each sequence as a user’s engagement history [kang2018self]. Both settings, do not explicitly require timestamps: only the relative temporal orderings are assumed known (in contrast to, for example, timeSVD++ [koren2009collaborative]
). Initially, sequence data in temporal order are usually modelled with Markov models, in which future observation is conditioned on last few observed items
[rendle2010factorizing]. In [rendle2010factorizing], a personalized Markov model with user latent factors is proposed for more personalized results.In recent years, deep learning techniques, borrowing from natural language processing (NLP) literature, are more widely used in tackling sequential data. Like sentences in NLP, sequence data in recommendations can be similarly modelled by recurrent neural networks (RNN)
[hidasi2015session, hidasi2018recurrent]and convolutional neural network (CNN)
[tang2018personalized]models. Later on, attention models are getting more and more attention in both NLP,
[vaswani2017attention, devlin2018bert], and recommender systems, [liu2018stamp, kang2018self]. SASRec [kang2018self]is a recent method with stateoftheart performance among the many deep learning models. Motivated by the Transformer model in neural machine translation
[vaswani2017attention], SASRec utilizes a similar architecture to the encoder part of the Transformer model.2.3 Regularization Techniques
In deep learning, models with many more parameters than data points can easily overfit training data. This may prevent us from adding user embeddings as additional parameters into complicated models like the Transformer model [kang2018self], which can easily have 20 layers with 6 selfattention blocks and millions of parameters for a mediumsized dataset like Movielens10M [harper2016movielens]. regularization [hoerl1970ridge] is the most widely used approach and has been used in many matrix factorization models in recommender systems; regularization [tibshirani1996regression] is used when a sparse model is preferred. For deep neural networks, it has been shown that regularizations are often too weak, while dropout [hinton2012improving, srivastava2014dropout] is more effective in practice. There are many other regularization techniques, including parameter sharing [goodfellow2016deep], maxnorm regularization [srebro2005maximum]
[pascanu2013difficulty], etc. Very recently, a new regularization technique called Stochastic Shared Embeddings (SSE) [wu2019stochastic] is proposed as a new means of regularizing embedding layers. [wu2019stochastic] develops two versions of SSE, SSEGraph and SSESE. We find that SSESE is essential to the success of our Personalized Transformer (SSEPT) model.3 Methodology
3.1 Temporal Collaborative Ranking
Let us formally define the temporal collaborative ranking problem as: given users, each user engaging with a subset of items in a temporal order, the goal of the task is to find an optimal personalized ranking ordering of top items out of total items for any given user at any given time point. We assume our data is in the format of sequences of items for users have interacted with so far, namely
(1) 
Sequences of length contain indices of the last items the user
has interacted with in the temporal order (from old to new). For different users, the sequence lengths can be very different (where we pad the shorter sequences to obtain length
). We cannot simply randomly split data points into train/validation/test sets because they come in temporal orders. Instead, we need to make sure our training data is before validation data which is before test data temporally. We use last items in sequences as test sets, secondtolast items as validation sets and the rest as training sets. We use ranking metrics such as NDCG@ and Recall for evaluations, which are defined in (11) and (13).3.2 Personalized Transformer Architecture
Our model is motivated by the Transformer model in [vaswani2017attention] and [kang2018self]. In the following sections, we are going to examine each component of our Personalized Transformer (SSEPT) model: the embedding layer, selfattention layer, pointwise feedforward layer, prediction layer, layer normalization, dropout, weight decay, stochastic shared embeddings, and so on.
3.2.1 Embedding Layer
We define a learnable user embedding lookup table and item embedding lookup table , where is the number of users, is the number of items and , are the number of hidden units for user and item respectively. We also specify learnable positional encoding table , where . So each input sequence will be represented by the following embedding:
(2) 
where represents concatenating item embedding and user embedding into embedding for time . Note that the main difference between our model and [kang2018self] is that we introduce the user embeddings , making our model personalized.
3.2.2 SelfAttention Layer
Selfattention layers defined as:
(3) 
where and
(4) 
The attention layer we actually used is a masked one because we only want attention from the future to the past, not the opposite direction. Therefore, all links between for are forbidden. We find using bidirectional attention would lead to significantly worse performance.
3.2.3 Pointwise FeedForward layer
After feeding embeddings into the selfattention layer, we want to add nonlinearity to the resulting for each sequence data. Therefore, we add a pointwise feedforward layer after the selfattention layer, consisting of two fully connected layers:
(5) 
where , are the weight matrices and , are the bias terms.
3.2.4 SelfAttention Blocks
We combine selfattention layer and pointwise feedforward layer to form a selfattention (SA) block. One block consists of one selfattention layer and two fully connected layers. We can stack blocks by feeding the output of first block as the input of the second block, i.e.
(6) 
We use to denote the number of attention blocks.
3.2.5 Prediction Layer
As to the prediction layer, the predicted probability of user
at time rated item is:(7) 
where
is the sigmoid function and
is the predicted score of item by user at time point , defined as:(8) 
Although we can use another set of user and item embedding lookup tables for the and , we find it better to use the same set of embedding lookup tables as in the embedding layer. To distinguish the and in (8) from in (2), we call embeddings in (8) output embeddings and those in (2) input embeddings.
There are multiple ways to define the loss of our model, previously a popular loss is the BPR loss [rendle2009bpr, hidasi2018recurrent]:
(9) 
where is the sigmoid function, is the predicted score of the positive item at time point for user , is the predicted score of the negative item, and set of negative items is defined as . At time point , the positive item index is in (1), and negative item index satisfies .
We find the BPR loss does not perform as well as the binary cross entropy loss in practice. The binary cross entropy loss between predicted probability for the positive item and one uniformly sampled negative item is given as . Summing over and , we obtain the objective function that we want to minimize is:
(10) 
At the inference time, top recommendations for user at time can be made by sorting scores for all items and recommending the first items in the sorted list.
3.3 Personalized Transformer Regularization Techniques
3.3.1 Layer Normalization
Layer normalization [ba2016layer]
normalizes neurons within a layer. Previous studies
[ba2016layer]show it is more effective than batch normalization for training recurrent neural networks (RNNs). One alternative is the batch normalization
[ioffe2015batch] but we find it does not work as well as the layer normalization in practice even for a reasonable large batch size of 128. Therefore, our SSEPT model adopts layer normalization.3.3.2 Residual Connections
Residual connections are firstly proposed in ResNet for image classification problems [he2016deep]. Recent research finds that residual connections can help training very deep neural networks even if they are not convolutional neural networks [vaswani2017attention]. Using residual connections allows us to train very deep neural networks here. For example, the best performing model for Movielens10M dataset in Table 10 is the SSEPT with 6 attention blocks, in which layers are trained endtoend.
3.3.3 Weight Decay
Weight decay [krogh1992simple], also known as regularization [hoerl1970ridge], is applied to all embeddings, including both user and item embeddings.
3.3.4 Dropout
Dropout [srivastava2014dropout] is applied to the embedding layer , selfattention layer and pointwise feedforward layer by stochastically dropping some percentage of hidden units to prevent coadaption of neurons. Dropout has been shown to be an effective way of regularizing deep learning models.
3.3.5 Stochastic Shared Embeddings
Unlike previous SASRec model [kang2018self], we use one more regularization technique in our SSEPT model specifically for embedding layer in addition to the ones listed earlier: the Stochastic Shared Embeddings (SSE) [wu2019stochastic]. The reason that we want to use this additional regularization technique is that we find the existing wellknown regularization techniques like layer normalization, dropout and weight decay cannot prevent the model from overfitting badly after introducing user embeddings. We apply this new regularization technique SSESE to our SSEPT model, and we find it makes possible training this personalized model with more parameters.
The main idea of SSE is to stochastically replace embeddings with another embedding during SGD, which has the effect of regularizing the embedding layers. Specifically, SSESE replaces one embedding with another embedding stochastically with probability , which is called SSE probability in [wu2019stochastic]. There are 3 different places in our model that SSESE can be applied. We can apply SSESE to input/output user embeddings, input item embeddings, and output item embeddings with probabilities , and respectively. Note that input user embedding and output user embedding are always replaced at the same time with SSE probability . Empirically, we find that SSESE to user embeddings and output item embeddings always helps, but SSESE to input item embeddings is only useful when the average sequence length is large, e.g. more than 100 in Movielens1M and Movielens10M datasets.
In summary, layer normalization and dropout are used in all layers except prediction layer. Residual connections are used in both selfattention layer and pointwise feedforward layer. SSESE is used in embedding layer and prediction layer.
3.4 Handling Long Sequences: SSEPT++
To handle extremely long sequences, a slight modification can be made on the way how input sequences ’s are fed into the SSEPT neural networks. We call the enhanced model SSEPT++ to distinguish it from the standard SSEPT model, which cannot handle sequences longer than .
Sometimes, we want to make use of extremely long sequences, , where . However, our SSEPT model can only handle sequences of maximum length of . The simplest way is to sample starting index uniformly and use , where . Although sampling the starting index uniformly from can accommodate long sequences of length , this does not work very well in practice. Uniformly sampling does not take into account the importance of recent items in a long sequence. To solve this dilemma, we introduce an additional hyperparameter which we call sampling probability. It implies that with probability , we sample the starting index uniformly from and use sequence as input. With probability , we will simply use the recent items as input. If the sequence is already shorter than then we simply set for user .
Our proposed SSEPT++ model can work almost as well as SSEPT with a much smaller . One can see in Table 4 with SSEPT++ can perform almost as well as SSEPT. The time complexity of the SSEPT model is of order . Therefore, reducing by one half would lead to a theoretically 4x speedup in terms of the training and inference speeds. As to space complexity, both SSEPT and SSEPT++ are of order
. When the number of users and items scales up, Tensorflow will automatically store the user and item embedding lookup tables in RAM instead of GPU memory.
dataset  #users  #items  avg sequence len  max sequence len 

Beauty  52,024  57,289  7.6  291 
games  31,013  23,715  7.3  858 
steam  334,730  13,047  11.0  1,229 
ml1m  6,040  3,416  163.5  2,275 
ml10m  69,878  65,133  141.1  7,357 
4 Experiments
In this section, we compare our proposed algorithms, Personalized Transformer (SSEPT) and SSEPT++, with other stateoftheart algorithms on realworld datasets. We implement our codes in Tensorflow and conduct all our experiments on
a server with 40core Intel Xeon E52630 v4 @ 2.20GHz CPU, 256G RAM and Nvidia GTX 1080 GPUs.
4.1 Datasets
We use 5 datasets, the first 4 of them have exactly the same train/dev/test splits as in [kang2018self]:

Beauty category from Amazon product review datasets. ^{1}^{1}1http://jmcauley.ucsd.edu/data/amazon/

Games category from the same source.

Steam dataset introduced in [kang2018self]. It contains reviews crawled from a large video game distribution platform.

Movielens1M dataset [harper2016movielens], a widely used benchmark datasets containing one million user movie ratings.

Movielens10M dataset with ten million user ratings.
Detailed dataset statistics are given in Table 1. One can easily see that the first 3 datasets have very short sequences while the last 2 datasets have very long sequences.
4.2 Evaluation Metrics
The evaluation metrics we use are standard ranking metrics, namely NDCG and Recall for top recommendations:

NDCG: defined as:
(11) where represents th user and
(12) In the DCG definition, represents the index of the th ranked item for user in test data based on the learned score matrix . is the rating matrix and is the rating given to item by user . is the ordering provided by the ground truth rating.

Recall@: defined as a fraction of positive items retrieved by the top recommendations the model makes:
(13) here we already assume there is only a single positive item that user will engage next and the indicator function is defined to indicate whether the positive item falls into the top position in our obtained ranked list using scores predicted in (8).
In the temporal collaborative ranking setting, at time point , the rating matrix can be formed in two ways. One is that we include all ratings after , the other is to include only ratings at time point . We use the latter, which is the same setting as [kang2018self]. For a large dataset with numerous users and items, the evaluation procedure would be slow because (11) would require computing the ranking of all items based on their predicted scores for every single user. As a means of speedup evaluations, we sample a fixed number of negative candidates while always keeping the positive item that we know the user will engage next. This way, both and will be narrowed down to a small set of item candidates, and prediction scores will only be computed for those items through a single forward pass of the neural network.
Ideally, we want both NDCG and Recall to be exactly 1, because NDCG@ means the positive item is always put on the top position of the top ranking list, and Recall@ means the positive item is always contained by the top recommendations the model makes. In our evaluation procedures, a larger or a smaller makes the recommendation problem harder because it implies the candidate pool is larger and higher ranking quality is desired.
(A)  (B)  (C)  (D)  (E)  (F)  (G)  (H)  (I)  (J)  (K)  % GAIN OVER  
DATASET  METRIC  POPREC  BPR  FMC  FPMC  TRANSREC  GRU4REC  STAMP  GRU4REC+  CASER  SASREC  SSEPT  (A)(E) (F)(J) 

Recall@  0.4003  0.3775  0.3771  0.4310  0.4607  0.2125  0.3207  0.3949  0.4264  0.4837  0.5028  13.0 3.6 
BEAUTY  NDCG@  0.2277  0.2183  0.2477  0.2891  0.3020  0.1203  0.1801  0.2556  0.2547  0.3220  0.3370  11.6 4.7 

Recall@  0.4724  0.4853  0.6358  0.6802  0.6838  0.2938  0.4358  0.6599  0.5282  0.7434  0.7757  13.4 4.3 
GAMES  NDCG@  0.2779  0.2875  0.4456  0.4680  0.4557  0.1837  0.2615  0.4759  0.3214  0.5401  0.5660  20.3 4.8 

Recall@  0.7172  0.7061  0.7731  0.7710  0.7624  0.4190  0.6982  0.8018  0.7874  0.8732  0.8772  13.5 0.4 
STEAM  NDCG@  0.4535  0.4436  0.5193  0.5011  0.4852  0.2691  0.4296  0.5595  0.5381  0.6293  0.6378  22.8 1.4 

Recall@  0.4329  0.5781  0.6986  0.7599  0.6413  0.5581  0.7255  0.7501  0.7886  0.8233  0.8358  10.0 1.5 
MLM  NDCG@  0.2377  0.3287  0.4676  0.5176  0.3969  0.3381  0.4813  0.5513  0.5538  0.5936  0.6191  19.6 4.3 

Methods  NDCG@  Recall@  user dim  item dim 

SASREC  0.5936  0.8233  N/A  50 
SASREC  0.5919  0.8202  N/A  100 
SSEPT  0.6191  0.8358  50  50 
SSEPT  0.6281  0.8341  50  100 
SSEPT++  0.6292  0.8389  50  100 
Methods  NDCG@  Recall@  Max Len  user dim  item dim 
SASREC  0.5919  0.8202  200  N/A  100 
SSEPT  0.6281  0.8341  200  50  100 
SASREC  0.5769  0.8045  100  N/A  100 
SSEPT  0.6142  0.8212  100  50  100 
SSEPT++  0.6186  0.8318  100  50  100 
4.3 Baselines
We include 5 nondeeplearning and 5 deeplearning algorithms in our comparisons:
4.3.1 Nondeeplearning Baselines

PopRec: ranking items according to their popularity.

BPR: Bayesian personalized ranking for implicit feedback setting [rendle2009bpr]
. It is a lowrank matrix factorization model with a pairwise loss function. But it does not utilize the temporal information. Therefore, it serves as a strong baseline for nontemporal methods.

FMC: Factorized Markov Chains: a firstorder Markov Chain method, in which predictions are made only based on previously engaged item.

PFMC: a personalized Markov chain model [rendle2010factorizing] that combines matrix factorization and firstorder Markov Chain to take advantage of both users’ latent longterm preferences as well as shortterm item transitions.

TransRec: a firstorder sequential recommendation method [he2017translation]
in which items are embedded into a transition space and users are modelled as translation vectors operating on item sequences.
SQLRank [wu2018sql] and itembased recommendations [sarwar2001item] are omitted because the former is similar to BPR [rendle2009bpr] except using the listwise loss function instead of the pairwise loss function and the latter has been shown inferior to TransRec [he2017translation].
Movielens1m  Dimensions  Number of Blocks  Sampling Probability  SSESE Parameters  
Model  NDCG  Recall  
SASRec  0.5961  0.8195    50  2         
SASRec  0.5941  0.8182    100  2         
SASRec  0.5996  0.8272    100  6         
SSEPT  0.6101  0.8343  50  50  2    0.92  0.1  0 
SSEPT  0.6164  0.8336  50  50  2    0.92  0  0.1 
SSEPT  0.5832  0.8091  50  50  2    0  0.1  0.1 
SSEPT  0.6174  0.8351  50  50  2    0.92  0.1  0.1 
SSEPT  0.5949  0.8205  75  25  2    0.92  0.1  0.1 
SSEPT  0.6214  0.8359  25  75  2    0.92  0.1  0.1 
SSEPT  0.6281  0.8341  50  100  2    0.92  0.1  0.1 
SSEPT++  0.6292  0.8389  50  100  2  0.3  0.92  0.1  0.1 
Regularization  NDCG@  GAIN  Recall@  GAIN 

NO REG (BASELINE)  0.4855    0.6500   
PS  0.5065  4.3  0.6656  2.4 
PS (JOB)  0.4938  1.7  0.6570  1.1 
PS (GENDER)  0.5110  5.3  0.6672  2.6 
PS (AGE)  0.5133  5.7  0.6743  3.7 
0.5149  6.0  0.6786  4.4  
DROPOUT  0.5165  6.4  0.6823  5.0 
+ DROPOUT  0.5293  9.0  0.6921  6.5 
SSESE  0.5393  11.1  0.6977  7.3 
+ SSESE + DROPOUT  0.5870  20.9  0.7442  14.5 
SASRec ( + DROPOUT)  0.5601  0.7164  

Movielens1m  Dimensions  Number of Blocks  SSESE Parameters  
Model  NDCG  Hit Ratio  
SASRec  0.7268  0.9429    50  2       
SASRec  0.7413  0.9474    100  2       
SSEPT  0.7199  0.9331  50  100  2  PS  0.01  0.01 
SSEPT  0.7169  0.9296  50  100  2  0.0  0.01  0.01 
SSEPT  0.7398  0.9418  50  100  2  0.2  0.01  0.01 
SSEPT  0.7500  0.9500  50  100  2  0.4  0.01  0.01 
SSEPT  0.7484  0.9480  50  100  2  0.6  0.01  0.01 
SSEPT  0.7529  0.9485  50  100  2  0.8  0.01  0.01 
SSEPT  0.7503  0.9505  50  100  2  1.0  0.01  0.01 
4.3.2 Deeplearning baselines

GRU4Rec: the first RNNbased method proposed for the sessionbased recommendation problem [hidasi2015session]. It utilizes the GRU structures [chung2014empirical] initially proposed for speech modelling.

GRU4Rec: followup work of GRU4Rec by the same authors: the model has a very similar architecture to GRU4Rec but has a more complicated loss function [hidasi2018recurrent].

Caser: a CNNbased method [tang2018personalized] which embeds a sequence of recent items in both time and latent spaces forming an ‘image’ before learning local features through horizontal and vertical convolutional filters. In [tang2018personalized], user embeddings are included in the prediction layer only. On the contrast, in our Personalized Transformer, user embeddings are also introduced in the lowest embedding layer so they can play an important role in selfattention mechanisms as well as in prediction stages.

STAMP: a sessionbased recommendation algorithm [liu2018stamp] using attention mechanism. [liu2018stamp] only uses fully connected layers with one attention block that is not selfattentive.

SASRec: a selfattentive sequential recommendation method [kang2018self] motivated by Transformer in NLP [vaswani2017attention]. Unlike our method SSEPT, SASRec does not incorporate user embedding and therefore is not a personalized method. SASRec paper [kang2018self] also does not utilize SSE [wu2019stochastic] for further regularization: only dropout and weight decay are used.
4.4 Comparison Results
We use the same datasets as in [kang2018self] and follow the same procedure in the paper: use last items for each user as test data, secondtolast as validation data and the rest as training data. We implemented our method in Tensorflow and solve it with Adam Optimizer [kingma2014adam] with a learning rate of , momentum exponential decay rates and a batch size of . In Table 3, since we use the same data, the performance of previous methods except STAMP have been reported in [kang2018self]. We tune the dropout rate, and SSE probabilities for input user/item embeddings and output embeddings on validation sets and report the best NDCG and Recall for top recommendations on test sets. As mentioned before, we sampled negative items to speed up the evaluation.
For a fair comparison, we restrict all algorithms to use up to 50 hidden units for item embeddings. For the SSEPT and SASRec models, we use the same number of attention blocks of 2 and set the maximum length for Movielens 1M dataset and for other datasets. We use top with and number of negatives in evaluation procedure. One can easily see in Table 2 that our proposed SSEPT has the best performance over all previous methods on all four datasets we consider. On most datasets, our SSEPT improves NDCG by more than 4% when compared with SASRec [kang2018self] and more than 20% when compared to nondeeplearning methods.
When we relax the constraints, we find that an increase in the number of attention blocks and hidden units would allow our SSEPT model to perform even better than in Table 2. In Table 5, when we increase item embedding dimension from 50 to 100, our SSEPT achieves 0.6281 for NDCG@10 and SSEPT++ achieves even higher 0.9292 while that of SASRec drops to 0.5919 from 0.5936.
To show the effectiveness of SSEPT++, we decrease max length allowed from 200 to 100, we find in Table 4 that SSEPT++ suffer the least with NDCG@10 dropping to 0.6186 from 0.6281 and Recall@10 dropping to 0.8318 from 0.8341. The one that suffers the most is the SASRec: its NDCG@10 drops to 0.5769 from 0.5919 and Recall@10 drops to 0.8045 from 0.8202.
We vary the tuning parameters we use in Table 5 including user/item embedding dimensions, the number of attention blocks, SSE probabilities for SSEPT and the sampling probability for SSEPT++. It is obvious from Table 5 and Table 7 that these hyperparameters play an important role in terms of the final prediction performances. For our SSEPT model, a larger item dimension helps improve the recommendation but it is not the case for baseline SASRec. Also, using SSESE in all three places achieves best recommendation performances for Movielens1M dataset in Table 5. One can easily see from Table 7, using SSESE towards input user embeddings is crucial again to ensure a properly regularized model. SSESE, together with dropout and weight decay, is the best choice for regularization, which is evident from Table 6. In practice, these SSE probabilities, just like dropout rate, can be treated as tuning parameters and easily tuned.
4.5 Attention Maps for Input Embeddings
Apart from evaluating our SSEPT against SASRec using welldefined ranking metrics on 5 datasets, we use 2 other ways to visualize the comparisons. The first way is to visualize the attention maps of both methods and compare them. Note that the attention map is a lower triangular matrix as we only allow attention at present is paid to the past, but not to the future. The attention maps for the first layer in Figure 3 show that our SSEPT paid more attention to recent items in a long sequence than SASRec. This is evident by comparing the attention intensity level of the two plots (right bottom).
As our second way to visualize the comparisons, we examine some random users’ engagement histories to see the top recommendations the two models give. In Figure 2, a random user’s engagement history in Movielens1M dataset is given in temporal order (columnwise). We hide the last item whose index is 26 in test set and hope that a temporal collaborative ranking model can figure out item26 is the one this user will watch next using only previous engagement history. One can see for a typical user, they tend to look at different style of movies at different times. Earlier on, they watched a variety of movies, including SciFi, animation, thriller, romance, horror, action, comedy and adventure. But later on, in the last two columns of Figure 2, drama and thriller are the two types they like to watch most, especially the drama type. In fact, they watched 9 drama movies out of recent 10 movies. It is not surprising to see the one we hide from the models is also drama type. In the top5 recommendations given by our SSEPT, the hidden item26 is put in the first place. Intelligently, the SSEPT recommends 3 drama movies, 2 thriller movies and mixing them up in positions. Interestingly, the top recommendation is ‘Othello’, which like the recently watched ‘Richard III’, is an adaptation of a Shakespeare play, and this dependence is reflected in the attention weight. In contrast, SASRec cannot provide top5 recommendations that are personalized enough. It recommends a variety of action, SciFi, comedy, horror, and drama movies but none of them match item26. Although this user has watched all these types of movies in the past, they do not watch these anymore as one can easily tell from his recent history. Unfortunately, SASRec cannot capture this and does not provide personalized recommendations for this user by focusing more on drama and thriller movies. What we see from this particular example is consistent with the previous findings from examining the attention maps. Attention heat maps for both models during inference are included in Figure 2. It is easy to see that SSEPT model shares with human reasoning that more emphasis should be placed on recent movies.
UserSide SSESE Probability  NDCG@  Recall@ 
Parameter Sharing  0.6188  0.8294 
1.0  0.6258  0.8346 
0.9  0.6275  0.8321 
0.8  0.6244  0.8359 
0.6  0.6256  0.8341 
0.4  0.6237  0.8369 
0.2  0.6163  0.8281 
0.0  0.5908  0.8048 

Sampling Probability  NDCG@  Recall@ 

SASRec ()  0.5769  0.8045 
SSEPT ()  0.6142  0.8212 
1.0  0.5697  0.7977 
0.8  0.5735  0.7801 
0.6  0.6062  0.8242 
0.4  0.6113  0.8273 
0.3  0.6186  0.8318 
0.2  0.6193  0.8233 
0.0  0.6142  0.8212 

Datasets  # of blocks  NDCG@  Recall@ 

Movielens1M  SASREC (6 blocks)  0.5984  0.8207 
1  0.6162  0.8301  
2  0.6280  0.8365  
3  0.6293  0.8376  
4  0.6270  0.8401  
5  0.6308  0.8361  
6  0.6270  0.8397  
Movielens10M  SASRec (6 blocks)  0.7531  0.9490 
1  0.7454  0.9478  
2  0.7512  0.9522  
3  0.7543  0.9491  
4  0.7608  0.9485  
5  0.7619  0.9524  
6  0.7683  0.9537 
METRIC  NDCG@  Recall@  

UnPersonalized  0.3787  0.6119  500 
Personalized  0.3846  0.6171  500 
UnPersonalized  0.2791  0.4781  1000 
Personalized  0.2860  0.4929  1000 
UnPersonalized  0.1939  0.3515  2000 
Personalized  0.1993  0.3667  2000 
4.6 Training Speeds
In [kang2018self], it has been shown that SASRec is about 11 times faster than Caser and 17 times faster than GRU4Rec and achieves much better NDCG@10 results so we did not include Caser and GRU4Rec in our comparisons. Therefore, we only compare the training speeds and ranking performances among SASRec, SSEPT and SSEPT++. Given that we added additional user embeddings into our SSEPT model, it is expected that it will take slightly longer to train our model than unpersonalized SASRec. In Figure 4, max length is used for SSEPT++, and is used for SSEPT and SASRec. We find empirically that training speed of the SSEPT and SSEPT++ model are comparable to that of SASRec, with SSEPT++ being the fastest and the best performing model. It is clear that our SSEPT and SSEPT++ achieve much better ranking performances than our baseline SASRec using the same training time in Figure 4.
4.7 Ablation Study
4.7.1 SSE probability
Given the importance of SSE regularization for our SSEPT model, we carefully examined the SSE probability for input user embedding in Table 8. We find that the hyperparameter SSE probability is not too sensitive: anywhere between 0.4 and 1.0 gives good results, better than parameter sharing and not using SSESE. This is also evident based on comparison results in Table 6.
4.7.2 Sampling Probability
Recall that the sampling probability is unique to our SSEPT++ model. We show in Table 9 using an appropriate sampling probability like 0.20.3 would allow it to outperform SSEPT when the same maximum length is used.
4.7.3 Number of Attention Blocks
We find for our SSEPT model, a larger number of attention blocks is preferred. One can easily see in Table 10, the optimal ranking performances are achieved at for Movielens1M dataset and at for Movielens10M dataset.
4.7.4 Number of Negatives Sampled
We want to make sure the number of negatives sampled during evaluation or difference in the usage of regularization techniques does not affect our final conclusion. So we add another set of experiments to remove personalization for our SSEPT model while keeping all the regularization techniques we used. Based on the results in Table 11, we are positive that the personalized model always outperforms the unpersonalized one even if we use the same regularization techniques. This holds true regardless of how many negatives sampled during evaluations.
5 Conclusion
In this paper, we propose a novel neural network architecture called Personalized Transformer for the temporal collaborative ranking problem. It enjoys the benefits of being a personalized model, therefore achieving better ranking results for individual users than the current stateoftheart. By examining the attention mechanisms during inference, the model is also more interpretable and tends to pay more attention to recent items in long sequences than unpersonalized deep learning models.
Comments
There are no comments yet.