Introduction
Recommendation
is one of the key application areas of artificial intelligence in the big data era. The recommendation tasks are supported by large scale data, and users need to select a specific item from many alternative items. This selection requirement motivates the utilization of attention mechanism in the recommendation task. The attention is applied to the item selection, and the sequential recommendation particularly selects the past item choice records to consider the recommendation at the current timestep with the attention mechanism
[24, 15, 13, 26, 27, 11].Given the relationship between the attention and the recommendation, adopting a new attention mechanism to the recommendation has been a research trend. For instance, SelfAttentive Sequential Recommendation (SASRec) [12] adopted the selfattention mechanism of the Transformer [23] to the recommendation task. This adaptation is interesting, but it was limitedly customized to meet the task specifics. Recommendation often requires understanding items, users, browsing sequences, etc, and the recommendation models need to consider such contexts which SASRec does not provide. Following SASRec, there have been developments in using the selfattention mechanism of the Transformer to model a task specific feature of sequential recommendation. For example, ATRank [30] utilized the selfattention mechanism for considering the influences from heterogenous behavior representations. To model the user’s shortterm intent, AttRec [29] adopted the selfattention mechanism on the user interaction history. Similar to ATRank and AttRec, BST [5] used the selfattention mechanism for aggregate of the auxiliary user and item features.
Given the success of the selfattention [21, 6, 28], the recommendation task can be improved from the sequential information, which was limitedly used in the previous works. Moreover, such utilization on the sequential information provides a new approach to customize the selfattention structure to the recommendation task. Figure 1 is the example that the cooccurrence information may influence the attention weight. It is feasible to see a movie pair that has a higher cooccurrence than others, and this movie pair should inform the attention mechanism to increase the weight.
We renovate and customize the selfattention of Transformer with a latent space model. Specifically, we add a latent space to the selfattention value of the Transformer, and we use the latent space to model the context from relations of the recommendation task. The latent space is modeled as a multivariate skewnormal (MSN) distribution [2] with the dimension of the number of unique items in the sequence. The covariance matrix of the MSN distribution is the variable that we model the relations of a sequence, items, and a user by a kernel function that provides the flexibility of the recommendation task adaptation. After the kernel modeling, we provide the reparametrization of the MSN distribution to enable the amortized inference on the introduced latent space. Since the relation modeling is done with kernelization, we call this model as relationaware kernelized selfattention (RKSA). We designed RKSA with three innovations. First, the deterministic Transformer may not work well in the generalized task of recommendation because of sparsity, so we added a latent dimension and its corresponding reparameterization. Second, the covariance modeling with the relationaware kernel enables the more fundamental adaptation of the selfattention to the recommendation. Third, the kernelized latent space of the selfattention provides the reasoning on the recommendation result. RKSA is evaluated against eight baseline models including SASRec, HCRNN, NARM, etc; as well as, five benchmark datasets with Amazon review, MovieLens, Steam, etc. Our experiments showed that RKSA significantly improves the performance over the baselines on the benchmarks, consistently.
Preliminary
MultiHead Attention
We start the preliminary by reviewing the selfattention structure that is the backbone of RKSA. Recently, [23] proposed the scaleddot product attention, which is defined by Equation 1 where , , and are the queries, the keys, and the value matrix, respectively. The scaleddot product attention calculates importance weights from the dotproduct of query with key with a scaling of . This importance is boundarized by the softmax, and the boundarized importance is again multiplied by the value to form the scaleddot product attention.
(1) 
When the query, the key, and the value take the same as an input matrix in Equation 2, the scaleddot product attention is called as the selfattention. A selfattention with an additional predefined or learnable positional embedding [23, 12] is able to capture the latent information of the position like previous recurrent networks.
(2) 
Multihead attention uses scaleddot product attentions with times smaller dimension on attention weight parameters. [23] found that the multihead attention is useful even though it uses the similar number of parameters compared to the singlehead attention.
(3) 
[25]
considered the dependencies, i.e. item cooccurrence, between the temporal state representations over a single sequence with the scaleddot product attention. Their model is introducing a context vector
to be linearly combined with and in the selfattention. We expand this context modeling with stochasticity and kernel method to add the flexibility of the selfattention.Multivariate SkewNormal Distribution
As we mentioned the latent space model of RKSA, we introduce an explicit probability density model to the selfattention structure. Here, we choose the
multivariate skewnormal (MSN) distribution to be the explicit density because we intend to model 1) the covariance structure between items; and 2) the skewness of the attention value. It would be natural to consider the multivariate normal distribution to enable the covariance model, but the normal distribution is unable to model the skewness because it enforces the symmetric shape of the density curve. As the name suggests, the MSN distribution reflects the skewness as the shape parameter [2]. The MSN distribution needs four parameters: location , scale , correlation , and shape . Following [1], adimensional random variable
follows the MSN disitribution with the location parameter ; the correlation matrix ; the scale parameter ; and the shape parameter , as Equation 4.(4) 
Here, is the covariance matrix; is the dimensional multivariate normal density with the mean and the covariance ; and
is the cumulative distribution function of
. If is a zero vector, the distribution reduces down to the multivariate normal distribution with the mean and the covariance .Kernel Function
Given that we intend to model the covariance of the MSN, we introduce how we provide the flexible covariance structure through kernels. Kernel function, , evaluates a pair of observations in the observation space
with a real value. In the machine learning field, the kernel functions are widely used to compute the similarity between two data points as a covariance matrix. Given observations
, a function is a valid kernel if and only if it is (1) symmetric: for all ; and (2) positive semidefinite: for all [17]. We apply a customized kernel function to model the relational covariance parameter of the MSN in RKSA, and we provide proofs on the validity of our customized kernels.Methodology
This section explains the sequential recommendation task, the overall structure of RelationAware Kernelized Self Attention (RKSA), and its detailed parameter modeling.
Problem Statement
A sequential recommendation uses datasets built upon a past action sequence of a user. Let be a set of users; let be a set of items; and let be a user ’s action sequence. The task of sequential recommendation is predicting the next item to interact by the user, as .
SelfAttention Block
We propose elationaware Kernelized SelfAttention (RKSA), which is a modification of the selfattention structure embedded in Transformer [23]. Figure 2 illustrates that RKSA is a customized selfattention based on relations, such as the item, the user, and the global cooccurrence information. The detailed procedure is explained in the below.
Embedding Layer
Since the raw data of items and interactions follow sparse onehot encoding, we need to embed the information of items and positions of interactions. To create such embeddings, we use the latest
actions from the user sequence of . Specifically, the item embedding matrix is defined as , where denotes the dimensionality of the embedding. Eis estimated by a hidden layer as a part of the modeled neural network, and the raw input to the hidden layer is the onehot encoding of an interacted item at time
. Similarly, we set a user embedding to be to make a distinction between users. Also, we define a positional embedding matrix as , to introduce the sequential ordering information of the interactions, which we follow the ideas from [12]. P and U are also estimated by a hidden layer that matches the dimensionality of E for the further construction of .Afterward, we estimate the inputs to RKSA, and the input should convey the representation of items and positions in the sequences. Here, we assume that the item at time , which is , is represented as as a timestep of the sequence, and we denoted the representation as because it is the input to RKSA. is estimated through the summation of the item embedding and the positional embedding , as . Finally, the input item sequence is expressed as by combining the item embedding E, and the positional embedding P.
RelationAware Kernelized SelfAttention
The core component of RKSA is the multihead attention structure that includes a latent variable of . Given that Equation 1 is deterministic, we intend to turn into a single latent variable . The changed part is originally the alignment score of the attention mechanism, so its range becomes . Additionally, we assume that there is a skewed shape in the alignment score distribution, so we designed to follow the multivariate skewnormal distribution (MSN), as Equation 5
. In other words, we sample the logit of the softmax function from the MSN distribution.
(5) 
In the above, the parameters of the MSN distribution include the location , the covariance , and the shape parameter . The details of the parameters are explained in Section Parameter Modeling. Additionally, in Equation 5, denotes the items in the sequence , and is the cooccurrence matrix of from our kernel model, which is explained in Section Covariance. The cooccurrence matrix is constructed by counting the cooccurrence number between item pairs in the whole dataset. We follow the amortized inference with a reparametrization on the MSN; and and are used as inputs to the inference.
Lastly, the output of RKSA is the hidden dimension defined as H in Equation 5. is the value vector estimated from the input item sequence representation . Since we modify the scaleddot product attention, RKSA is easily expanded to be a variant of multihead attention by following the same procedure of Equation MultiHead Attention.
PointWise FeedForward Network
We apply the PointWise FeedForward Network
in Transformer to the output of RKSA by each position. The pointwise feedforward network consists of two linear transformations with a ReLU nonlinear activation function between the linear transformations. The final output of the pointwise feedforward network,
is .Besides of the above modeling structure, we stacked multiple selfattention blocks to learn complex transition patterns, and we added residual connections
[7] to train a deeper network structure. We also applied the layer normalization [3] and the dropout [20] to the output of each layer by following [23].Output Layer
Let be the number of selfattention blocks. The task requires predicting the ()th item with the th output of the th selfattention block. We use the same weights of the item embedding layer to rank the item prediction. The relevance score of the item is defined as :
(6) 
denotes the th output of the last selfattention block, and is the embedding of item . The prediction ranking of the ()th item is defined by the ranking of the items’ relevance scores.
Parameter Modeling
This section enumerates the detailed modeling of the MSN parameters, which is used for the latent variable in RKSA.
Location
The location has the same role of the mean of multivariate normal distribution. Given that we use the MSN to sample the alignment score, we still need to provide the deterministic alignment score with the most likelihood. Therefore, we allow the alignment score to be the location parameter as:
(7) 
Also, we can use activation function and scaling to with .
Covariance
The covariance represents the relation between items. While is a square matrix of parameters, has a limited size because we only use the latest
items; and because there are not many unique items in those latest interactions. The relation can be measured by various methods, ranging from a simple cooccurrence counting to a nonlinear kernel function. This paper design a kernel function to measure the relation between a pair of items because the kernel function is known to be the efficient and nonlinear highdimensional distance metric that can also be learned through optimizing the kernel hyperparameters.
We compose a kernel function by considering the relations of the cooccurrence, the item and the user. For a given sequence, for timesteps and , we utilize the normalized representations of and
. Additionally, we infer the variance
of at timestep and , by an amortized inference as Equation 8.(8) 
In the above, we set the activation function of standard deviation as softplus to make the value of standard deviation positive. The following defines three different kernel functions, and we denote
as for simplicity.
Counting kernel is defined by the cooccurrence number of each item pair. The counting kernel is where are the number of occurrence of item and , respectively, and is the number of cooccurrence of item and .

Item kernel utilizes the representation of each item. There are two alternative kernels. The linear item kernel is where
denotes dot product; and the Radial Basis Function (RBF) kernel is
. 
User kernel utilizes the representation of each items and users. The user kernel is for user embedding and weight matrix where denotes Hadamard product.
Unlike the item and the user kernel, the validity of the counting kernel should be checked because it is not a wellknown format as the linear or the RBF kernels. The counting kernel is always symmetric and positive semidefinite. Therefore, the counting kernel is a valid kernel function.
From the property of kernel functions, we combine kernel functions by their summation to make the final kernel function flexible. The final kernel function is defined as:
(9) 
With the above kernel function, our modeling on the correlation matrix is , similar to the definition of of Equation 4.
This section describes the covariance modeling with the final kernel, so the kernel hyperparameter, such as , and r, needs to be inferred. While they need to be supervised to learn the kernel hyperparameters, the loss of the recommendation task needs to be augmented with an additional loss to guide the kernel hyperparameter. Therefore, we modeled a loss that regularizes the covariance to be the item cooccurrence. Since we have other loss terms, i.e. the recommendation loss, the learned correlation does not become same to the item cooccurrence, but the cooccurrence loss can be prior knowledge. Particularly, we measure the cooccurrence loss with the listwise ranking loss to match the alignment of the correlation and the ranking of the item cooccurrences. The cooccurrence loss is defined as maximizing te listwise ranking loss [4].
Shape
The shape parameter reflects the relation between a final item and an item in a user sequence. We designate to items . We define by introducing a ratio parameter with the cooccurrence matrix ; and a learnable scaling parameter . Specifically, we assume , which is a scaled correlation between the final item and the item .
First, we calculate the ratio parameter with the cooccurrence matrix , by the summation of the linear alignments between the last time , and the aligned item . Here, let be the value of th row and th column of cooccurrence matrix, . For simplicity, we denote as . The following is the detailed formula of .
(10) 
Equation Shape computes by the dotproduct between th row and th column of the cooccurrence matrix , which means that we calculate the correlation between the cooccurrence of and . Having said that, the cooccurrence of the same item is semantically meaningless in , so such cases used the average of the remaining elements in each row in the dotproduct process. enables modeling the twohop dependency between and through .
Second, Equation 11 defines the scaling parameter, :
(11) 
We can apply the softplus activation to , so the shape parameter becomes positive.
Model Inference
Loss Function
Given the above model structure, this subsection introduces the inference on the latent variable following the MSN distribution. It is wellknown that the latent variable can be inferred by optimizing the evidence lower bound from the Jensen’s inequality, so we optimize the evidence lower bound on the marginal loglikelihood, , when predicting the th item . Equation 12
describes the loss function of this prediction task.
(12)  
utilizes the binary crossentropy loss with the negative sampling as conducted in [12] to calculate . It should be noted that the actual loss function is a combination of the prediction loss and the cooccurrence loss, which is . is the regularization weight hyperparameter of the cooccurrence loss.
Reparametrization of
We sample the values of from the distribution using the reparameterization trick. Equation 13 shows the reparametrization of the MSN distribution with the sample from the two Normal distributions.
(13)  
This reparametrization is utilized because needs to be instantiated for the forward path. Equation 13 shows how to sample given the amortized inference parameters of , , , and . Once the forward path is enabled, the neural network can be trained via the backpropagation method.
Experiment Result
Datasets
We evaluate our model on five real world datasets: Amazon (Beauty, Games) [8, 16], CiteULike, Steam, and MovieLens. We follow the same preprocessing procedure on Beauty, Games, and Steam from [12]. For preprocessing CiteULike and MovieLens, we follow the preprocessing procedure from [19]. We split all datasets for training, validation, and testing following the procedure of [12]. Table 1 summarizes the statistics of the preprocessed datasets.
C  I  U  C+I  C+U  I+U  C+I+U  

B  0.5015  0.4982  0.4958  0.5012  0.4955  0.4951  0.5011 
M  0.5911  0.5966  0.5977  0.5960  0.5962  0.5997  0.5973 
Baselines
We compared RKSA with seven baselines.

Pop always recommends the most popular items.

ItemKNN [14] recommends an item based on the measured similarity of the last item.

BPRMF [18] recommends an item by the user and the item latent vectors with the matrix factorization.

GRU4REC [10] models the sequential user history with GRU and the specialized recommendation loss function such as Top1 and BPR loss.

NARM [13] focuses on both short and longterm dependency of a sequence with an attention and a modified bilinear embedding function.

HCRNN [19] considers the user’s sequential interest change with the global, the local, and the temporary context modeling. It modifies the GRU cell structure to incorporate the various context modeling.

AttRec [29] models the shortterm intent using selfattention and the longterm preference with metric learning.

SASRec [12]
is a Transformer model which combines the strength of Markov chains and RNN. SASRec focuses on finding the relevant items adaptively with selfattention mechanisms.
Experiment Settings
For GRU4REC, NARM, HCRNN, and SASRec, we use the official codes written by the corresponding authors. For GRU4REC, NARM and HCRNN, we apply the data augmentation method proposed by NARM [13]. We use two selfattention blocks and one head for SASRec and RKSA following the default setting of [12]. For fair comparisons, we apply the same setting of the batch size (128), the item embedding (64), the dropout rate (0.5), the learning rate (0.001), and the optimizer (Adam). We use the same setting of authors for other hyperparameters. For RKSA, we set the cooccurrence loss weight as 0.001. Furthermore, we use the learning rate decay and the early stopping based on the validation accuracy for all methods. We use the latest 50 actions of sequence for all datasets.
Quantitative Analysis
Table 2 presents the recommendation performance of the experimented models. We adopt two widely used measurements: Hit Rate@K and NDCG@K [9]. Considering that all useritem pairs require heavy computation, we use 100 negative samples for the evaluation following [12, 9]. We repeat each experiment for five times, and the results are the average of each method. The performance of RKSA comes from the best kernel variant of RKSA, and RKSA outperforms all baseline models on all datasets and metrics. Especially, Beauty shows the biggest improvement. Beauty is the most sparse dataset, so there are many items infrequently occurred. This result suggests that using the relational information can be helpful for predicting such infrequent items.
Ablation Study
We compared the kernel function combinations on Beauty and MovieLens datasets. We consider Beauty as a representative sparse dataset, and MovieLens as a representative dense dataset. Table 3 shows the performance of each kernel functions. We assume that using the sparse and short dataset is hard to learn the representation of the item and user. Therefore, RKSA with the counting kernel function shows the best performance on the sparse dataset. On the contrary, it is relatively easy to learn the representation of item and user by the dense dataset, and Table 3 shows the kernel combination of the item and the user is best.
Qualitative Analysis
Item Embedding and Correlation Matrix
The item kernel utilizes the dependency between the items in each time step. When learning the cooccurrence loss, the kernel hyperparameter and the item embedding captures the relational information of the cooccurrence. Figure 3a illustrates the item embedding of movies. The item embedding with the same genres are distributed closely together.
We generate the synthetic sequence to analyze the correlation from the trained kernel function. We use the counting and the item kernel combination without the user kernel because the sequence was synthetic. The synthetic sequence includes four different movie series and an animation movie. Figure 3b shows that the movies belong to the same series have high correlations. On the contrary, the correlations between the animation genre and the other genres were low.
Finally, we observed the weights of the counting, the item, and the user kernels, see Figure 4a, because the kernel weights also contribute to the construction of the correlation matrix. Since each dataset has different characteristics, a dataset emphasizes the counting, the item, and the user relations, differently. Interestingly, the counting kernel was not the most dominant kernel in MovieLens, but the user kernel was dominant. MovieLens is relatively dense dataset with respect to the number of average action per user, as shown in Table 1. Our proposed model, RKSA, adapts to the property of dataset well, and focus on the user kernel instead of other kernels on MovieLens dataset.
Predicted ranking of infrequent items
A sparse dataset, like Beauty, has many infrequent items, which are difficult to predict because of its information sparsity. To overcome this problem, RKSA utilizes the relational information of the whole dataset, instead of a single sequence in the prediction. Figure 4b shows that the target item is highly ranked by RKSA as the information sparsity worsens, compare to the predicted ranking of SASRec.
Attention Weight Case Study
Figure 5 shows the attention weight of SASRec and RKSA with the cooccurrence information between the last item and each item of sequence. The sequence instance in Figure 5 has a high cooccurrence value at timestep 0, 1, 2, and 5; and Figure 5 confirms that RKSA places higher attention values than SASRec. In the opposite case, the attention weight of RKSA is lower than the attention weight of SASRec.
Conclusion
We present relationaware kernelized selfattention (RKSA) for a sequential recommendation task. RKSA introduces a new selfattention mechanism which is stochastic as well as kernelized by the relational information. While the past attention mechanisms are deterministic, we introduce a latent variable in the attention. Moreover, the latent variable utilizes the kernelized correlation matrix, so the kernel can be expanded to include relational information and modeling. From these innovations, we were able to see the best performance in all experimental settings. We expect that the further development on the stochastic attention of the Transformer will come in the near future.
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF2018R1C1B6008652).
References
 [1] (1999) Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 579–602. Cited by: Multivariate SkewNormal Distribution.
 [2] (1996) The multivariate skewnormal distribution. Biometrika 83 (4), pp. 715–726. Cited by: Introduction, Multivariate SkewNormal Distribution.
 [3] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: PointWise FeedForward Network.
 [4] (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: Covariance.
 [5] (2019) Behavior sequence transformer for ecommerce recommendation in alibaba. arXiv preprint arXiv:1905.06874. Cited by: Introduction.
 [6] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction.

[7]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: PointWise FeedForward Network.  [8] (2016) Ups and downs: modeling the visual evolution of fashion trends with oneclass collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: Datasets.
 [9] (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: Quantitative Analysis.

[10]
(2015)
Sessionbased recommendations with recurrent neural networks
. arXiv preprint arXiv:1511.06939. Cited by: 4th item.  [11] (2018) CSAN: contextual selfattention network for user sequential recommendation. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 447–455. Cited by: Introduction.
 [12] (2018) Selfattentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: Introduction, MultiHead Attention, Embedding Layer, Loss Function, 8th item, Quantitative Analysis, Datasets, Experiment Settings.
 [13] (2017) Neural attentive sessionbased recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1419–1428. Cited by: Introduction, 5th item, Experiment Settings.
 [14] (2003) Amazon. com recommendations: itemtoitem collaborative filtering. IEEE Internet computing (1), pp. 76–80. Cited by: 2nd item.
 [15] (2018) STAMP: shortterm attention/memory priority model for sessionbased recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1831–1839. Cited by: Introduction.
 [16] (2015) Imagebased recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: Datasets.
 [17] (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Kernel Function.
 [18] (2009) BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the twentyfifth conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: 3rd item.
 [19] (2019) Hierarchical context enabled recurrent neural network for recommendation. In Proceedings of the AAAI, Cited by: 6th item, Datasets.
 [20] (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: PointWise FeedForward Network.
 [21] (2018) Deep semantic role labeling with selfattention. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: Introduction.
 [22] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov). Cited by: Figure 3.
 [23] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Introduction, MultiHead Attention, MultiHead Attention, MultiHead Attention, PointWise FeedForward Network, SelfAttention Block.
 [24] (2018) Attentionbased transactional context embedding for nextitem recommendation. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: Introduction.
 [25] (2019) Contextaware selfattention networks. arXiv preprint arXiv:1902.05766. Cited by: MultiHead Attention.
 [26] (2018) Sequential recommender system based on hierarchical attention networks. In the 27th International Joint Conference on Artificial Intelligence, Cited by: Introduction.
 [27] (2019) Multiorder attentive ranking model for sequential recommendation. In Proceedings of the AAAI, Cited by: Introduction.
 [28] (2018) Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: Introduction.
 [29] (2019) Next item recommendation with selfattentive metric learning. In ThirtyThird AAAI Conference on Artificial Intelligence, Vol. 9. Cited by: Introduction, 7th item.
 [30] (2018) ATRank: an attentionbased user behavior modeling framework for recommendation. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: Introduction.
Comments
There are no comments yet.