is one of the key application areas of artificial intelligence in the big data era. The recommendation tasks are supported by large scale data, and users need to select a specific item from many alternative items. This selection requirement motivates the utilization of attention mechanism in the recommendation task. The attention is applied to the item selection, and the sequential recommendation particularly selects the past item choice records to consider the recommendation at the current timestep with the attention mechanism[24, 15, 13, 26, 27, 11].
Given the relationship between the attention and the recommendation, adopting a new attention mechanism to the recommendation has been a research trend. For instance, Self-Attentive Sequential Recommendation (SASRec)  adopted the self-attention mechanism of the Transformer  to the recommendation task. This adaptation is interesting, but it was limitedly customized to meet the task specifics. Recommendation often requires understanding items, users, browsing sequences, etc, and the recommendation models need to consider such contexts which SASRec does not provide. Following SASRec, there have been developments in using the self-attention mechanism of the Transformer to model a task specific feature of sequential recommendation. For example, ATRank  utilized the self-attention mechanism for considering the influences from heterogenous behavior representations. To model the user’s short-term intent, AttRec  adopted the self-attention mechanism on the user interaction history. Similar to ATRank and AttRec, BST  used the self-attention mechanism for aggregate of the auxiliary user and item features.
Given the success of the self-attention [21, 6, 28], the recommendation task can be improved from the sequential information, which was limitedly used in the previous works. Moreover, such utilization on the sequential information provides a new approach to customize the self-attention structure to the recommendation task. Figure 1 is the example that the co-occurrence information may influence the attention weight. It is feasible to see a movie pair that has a higher co-occurrence than others, and this movie pair should inform the attention mechanism to increase the weight.
We renovate and customize the self-attention of Transformer with a latent space model. Specifically, we add a latent space to the self-attention value of the Transformer, and we use the latent space to model the context from relations of the recommendation task. The latent space is modeled as a multivariate skew-normal (MSN) distribution  with the dimension of the number of unique items in the sequence. The covariance matrix of the MSN distribution is the variable that we model the relations of a sequence, items, and a user by a kernel function that provides the flexibility of the recommendation task adaptation. After the kernel modeling, we provide the reparametrization of the MSN distribution to enable the amortized inference on the introduced latent space. Since the relation modeling is done with kernelization, we call this model as relation-aware kernelized self-attention (RKSA). We designed RKSA with three innovations. First, the deterministic Transformer may not work well in the generalized task of recommendation because of sparsity, so we added a latent dimension and its corresponding reparameterization. Second, the covariance modeling with the relation-aware kernel enables the more fundamental adaptation of the self-attention to the recommendation. Third, the kernelized latent space of the self-attention provides the reasoning on the recommendation result. RKSA is evaluated against eight baseline models including SASRec, HCRNN, NARM, etc; as well as, five benchmark datasets with Amazon review, MovieLens, Steam, etc. Our experiments showed that RKSA significantly improves the performance over the baselines on the benchmarks, consistently.
We start the preliminary by reviewing the self-attention structure that is the backbone of RKSA. Recently,  proposed the scaled-dot product attention, which is defined by Equation 1 where , , and are the queries, the keys, and the value matrix, respectively. The scaled-dot product attention calculates importance weights from the dot-product of query with key with a scaling of . This importance is boundarized by the softmax, and the boundarized importance is again multiplied by the value to form the scaled-dot product attention.
When the query, the key, and the value take the same as an input matrix in Equation 2, the scaled-dot product attention is called as the self-attention. A self-attention with an additional predefined or learnable positional embedding [23, 12] is able to capture the latent information of the position like previous recurrent networks.
Multi-head attention uses scaled-dot product attentions with times smaller dimension on attention weight parameters.  found that the multi-head attention is useful even though it uses the similar number of parameters compared to the single-head attention.
considered the dependencies, i.e. item co-occurrence, between the temporal state representations over a single sequence with the scaled-dot product attention. Their model is introducing a context vectorto be linearly combined with and in the self-attention. We expand this context modeling with stochasticity and kernel method to add the flexibility of the self-attention.
Multivariate Skew-Normal Distribution
As we mentioned the latent space model of RKSA, we introduce an explicit probability density model to the self-attention structure. Here, we choose themultivariate skew-normal (MSN) distribution to be the explicit density because we intend to model 1) the covariance structure between items; and 2) the skewness of the attention value. It would be natural to consider the multivariate normal distribution to enable the covariance model, but the normal distribution is unable to model the skewness because it enforces the symmetric shape of the density curve. As the name suggests, the MSN distribution reflects the skewness as the shape parameter . The MSN distribution needs four parameters: location , scale , correlation , and shape . Following , a
-dimensional random variablefollows the MSN disitribution with the location parameter ; the correlation matrix ; the scale parameter ; and the shape parameter , as Equation 4.
Here, is the covariance matrix; is the -dimensional multivariate normal density with the mean and the covariance ; and
is the cumulative distribution function of. If is a zero vector, the distribution reduces down to the multivariate normal distribution with the mean and the covariance .
Given that we intend to model the covariance of the MSN, we introduce how we provide the flexible covariance structure through kernels. Kernel function, , evaluates a pair of observations in the observation space
with a real value. In the machine learning field, the kernel functions are widely used to compute the similarity between two data points as a covariance matrix. Given observations, a function is a valid kernel if and only if it is (1) symmetric: for all ; and (2) positive semi-definite: for all . We apply a customized kernel function to model the relational covariance parameter of the MSN in RKSA, and we provide proofs on the validity of our customized kernels.
This section explains the sequential recommendation task, the overall structure of Relation-Aware Kernelized Self Attention (RKSA), and its detailed parameter modeling.
A sequential recommendation uses datasets built upon a past action sequence of a user. Let be a set of users; let be a set of items; and let be a user ’s action sequence. The task of sequential recommendation is predicting the next item to interact by the user, as .
We propose elation-aware Kernelized Self-Attention (RKSA), which is a modification of the self-attention structure embedded in Transformer . Figure 2 illustrates that RKSA is a customized self-attention based on relations, such as the item, the user, and the global co-occurrence information. The detailed procedure is explained in the below.
Since the raw data of items and interactions follow sparse one-hot encoding, we need to embed the information of items and positions of interactions. To create such embeddings, we use the latestactions from the user sequence of . Specifically, the item embedding matrix is defined as , where denotes the dimensionality of the embedding. E. Similarly, we set a user embedding to be to make a distinction between users. Also, we define a positional embedding matrix as , to introduce the sequential ordering information of the interactions, which we follow the ideas from . P and U are also estimated by a hidden layer that matches the dimensionality of E for the further construction of .
Afterward, we estimate the inputs to RKSA, and the input should convey the representation of items and positions in the sequences. Here, we assume that the item at time , which is , is represented as as a timestep of the sequence, and we denoted the representation as because it is the input to RKSA. is estimated through the summation of the item embedding and the positional embedding , as . Finally, the input item sequence is expressed as by combining the item embedding E, and the positional embedding P.
Relation-Aware Kernelized Self-Attention
The core component of RKSA is the multi-head attention structure that includes a latent variable of . Given that Equation 1 is deterministic, we intend to turn into a single latent variable . The changed part is originally the alignment score of the attention mechanism, so its range becomes . Additionally, we assume that there is a skewed shape in the alignment score distribution, so we designed to follow the multivariate skew-normal distribution (MSN), as Equation 5
. In other words, we sample the logit of the softmax function from the MSN distribution.
In the above, the parameters of the MSN distribution include the location , the covariance , and the shape parameter . The details of the parameters are explained in Section Parameter Modeling. Additionally, in Equation 5, denotes the items in the sequence , and is the co-occurrence matrix of from our kernel model, which is explained in Section Covariance. The co-occurrence matrix is constructed by counting the co-occurrence number between item pairs in the whole dataset. We follow the amortized inference with a reparametrization on the MSN; and and are used as inputs to the inference.
Lastly, the output of RKSA is the hidden dimension defined as H in Equation 5. is the value vector estimated from the input item sequence representation . Since we modify the scaled-dot product attention, RKSA is easily expanded to be a variant of multi-head attention by following the same procedure of Equation Multi-Head Attention.
Point-Wise Feed-Forward Network
We apply the Point-Wise Feed-Forward Network
in Transformer to the output of RKSA by each position. The point-wise feed-forward network consists of two linear transformations with a ReLU nonlinear activation function between the linear transformations. The final output of the point-wise feed-forward network,is .
Besides of the above modeling structure, we stacked multiple self-attention blocks to learn complex transition patterns, and we added residual connections to train a deeper network structure. We also applied the layer normalization  and the dropout  to the output of each layer by following .
Let be the number of self-attention blocks. The task requires predicting the ()-th item with the -th output of the -th self-attention block. We use the same weights of the item embedding layer to rank the item prediction. The relevance score of the item is defined as :
denotes the -th output of the last self-attention block, and is the embedding of item . The prediction ranking of the ()-th item is defined by the ranking of the items’ relevance scores.
This section enumerates the detailed modeling of the MSN parameters, which is used for the latent variable in RKSA.
The location has the same role of the mean of multivariate normal distribution. Given that we use the MSN to sample the alignment score, we still need to provide the deterministic alignment score with the most likelihood. Therefore, we allow the alignment score to be the location parameter as:
Also, we can use activation function and scaling to with .
The covariance represents the relation between items. While is a square matrix of parameters, has a limited size because we only use the latest
items; and because there are not many unique items in those latest interactions. The relation can be measured by various methods, ranging from a simple co-occurrence counting to a non-linear kernel function. This paper design a kernel function to measure the relation between a pair of items because the kernel function is known to be the efficient and nonlinear high-dimensional distance metric that can also be learned through optimizing the kernel hyperparameters.
We compose a kernel function by considering the relations of the co-occurrence, the item and the user. For a given sequence, for timesteps and , we utilize the normalized representations of and
. Additionally, we infer the varianceof at timestep and , by an amortized inference as Equation 8.
In the above, we set the activation function of standard deviation as softplus to make the value of standard deviation positive. The following defines three different kernel functions, and we denoteas for simplicity.
Counting kernel is defined by the co-occurrence number of each item pair. The counting kernel is where are the number of occurrence of item and , respectively, and is the number of co-occurrence of item and .
Item kernel utilizes the representation of each item. There are two alternative kernels. The linear item kernel is where
denotes dot product; and the Radial Basis Function (RBF) kernel is.
User kernel utilizes the representation of each items and users. The user kernel is for user embedding and weight matrix where denotes Hadamard product.
Unlike the item and the user kernel, the validity of the counting kernel should be checked because it is not a well-known format as the linear or the RBF kernels. The counting kernel is always symmetric and positive semi-definite. Therefore, the counting kernel is a valid kernel function.
From the property of kernel functions, we combine kernel functions by their summation to make the final kernel function flexible. The final kernel function is defined as:
With the above kernel function, our modeling on the correlation matrix is , similar to the definition of of Equation 4.
This section describes the covariance modeling with the final kernel, so the kernel hyperparameter, such as , and r, needs to be inferred. While they need to be supervised to learn the kernel hyperparameters, the loss of the recommendation task needs to be augmented with an additional loss to guide the kernel hyperparameter. Therefore, we modeled a loss that regularizes the covariance to be the item co-occurrence. Since we have other loss terms, i.e. the recommendation loss, the learned correlation does not become same to the item co-occurrence, but the co-occurrence loss can be prior knowledge. Particularly, we measure the co-occurrence loss with the listwise ranking loss to match the alignment of the correlation and the ranking of the item co-occurrences. The co-occurrence loss is defined as maximizing te listwise ranking loss .
The shape parameter reflects the relation between a final item and an item in a user sequence. We designate to items . We define by introducing a ratio parameter with the co-occurrence matrix ; and a learnable scaling parameter . Specifically, we assume , which is a scaled correlation between the final item and the item .
First, we calculate the ratio parameter with the co-occurrence matrix , by the summation of the linear alignments between the last time , and the aligned item . Here, let be the value of -th row and -th column of co-occurrence matrix, . For simplicity, we denote as . The following is the detailed formula of .
Equation Shape computes by the dot-product between -th row and -th column of the co-occurrence matrix , which means that we calculate the correlation between the co-occurrence of and . Having said that, the co-occurrence of the same item is semantically meaningless in , so such cases used the average of the remaining elements in each row in the dot-product process. enables modeling the two-hop dependency between and through .
Second, Equation 11 defines the scaling parameter, :
We can apply the softplus activation to , so the shape parameter becomes positive.
Given the above model structure, this subsection introduces the inference on the latent variable following the MSN distribution. It is well-known that the latent variable can be inferred by optimizing the evidence lower bound from the Jensen’s inequality, so we optimize the evidence lower bound on the marginal log-likelihood, , when predicting the -th item . Equation 12
describes the loss function of this prediction task.
utilizes the binary cross-entropy loss with the negative sampling as conducted in  to calculate . It should be noted that the actual loss function is a combination of the prediction loss and the co-occurrence loss, which is . is the regularization weight hyperparameter of the co-occurrence loss.
We sample the values of from the distribution using the reparameterization trick. Equation 13 shows the reparametrization of the MSN distribution with the sample from the two Normal distributions.
This reparametrization is utilized because needs to be instantiated for the forward path. Equation 13 shows how to sample given the amortized inference parameters of , , , and . Once the forward path is enabled, the neural network can be trained via the back-propagation method.
We evaluate our model on five real world datasets: Amazon (Beauty, Games) [8, 16], CiteULike, Steam, and MovieLens. We follow the same preprocessing procedure on Beauty, Games, and Steam from . For preprocessing CiteULike and MovieLens, we follow the preprocessing procedure from . We split all datasets for training, validation, and testing following the procedure of . Table 1 summarizes the statistics of the preprocessed datasets.
We compared RKSA with seven baselines.
Pop always recommends the most popular items.
Item-KNN  recommends an item based on the measured similarity of the last item.
BPR-MF  recommends an item by the user and the item latent vectors with the matrix factorization.
GRU4REC  models the sequential user history with GRU and the specialized recommendation loss function such as Top1 and BPR loss.
NARM  focuses on both short and long-term dependency of a sequence with an attention and a modified bi-linear embedding function.
HCRNN  considers the user’s sequential interest change with the global, the local, and the temporary context modeling. It modifies the GRU cell structure to incorporate the various context modeling.
AttRec  models the short-term intent using self-attention and the long-term preference with metric learning.
For GRU4REC, NARM, HCRNN, and SASRec, we use the official codes written by the corresponding authors. For GRU4REC, NARM and HCRNN, we apply the data augmentation method proposed by NARM . We use two self-attention blocks and one head for SASRec and RKSA following the default setting of . For fair comparisons, we apply the same setting of the batch size (128), the item embedding (64), the dropout rate (0.5), the learning rate (0.001), and the optimizer (Adam). We use the same setting of authors for other hyperparameters. For RKSA, we set the co-occurrence loss weight as 0.001. Furthermore, we use the learning rate decay and the early stopping based on the validation accuracy for all methods. We use the latest 50 actions of sequence for all datasets.
Table 2 presents the recommendation performance of the experimented models. We adopt two widely used measurements: Hit Rate@K and NDCG@K . Considering that all user-item pairs require heavy computation, we use 100 negative samples for the evaluation following [12, 9]. We repeat each experiment for five times, and the results are the average of each method. The performance of RKSA comes from the best kernel variant of RKSA, and RKSA outperforms all baseline models on all datasets and metrics. Especially, Beauty shows the biggest improvement. Beauty is the most sparse dataset, so there are many items infrequently occurred. This result suggests that using the relational information can be helpful for predicting such infrequent items.
We compared the kernel function combinations on Beauty and MovieLens datasets. We consider Beauty as a representative sparse dataset, and MovieLens as a representative dense dataset. Table 3 shows the performance of each kernel functions. We assume that using the sparse and short dataset is hard to learn the representation of the item and user. Therefore, RKSA with the counting kernel function shows the best performance on the sparse dataset. On the contrary, it is relatively easy to learn the representation of item and user by the dense dataset, and Table 3 shows the kernel combination of the item and the user is best.
Item Embedding and Correlation Matrix
The item kernel utilizes the dependency between the items in each time step. When learning the co-occurrence loss, the kernel hyperparameter and the item embedding captures the relational information of the co-occurrence. Figure 3a illustrates the item embedding of movies. The item embedding with the same genres are distributed closely together.
We generate the synthetic sequence to analyze the correlation from the trained kernel function. We use the counting and the item kernel combination without the user kernel because the sequence was synthetic. The synthetic sequence includes four different movie series and an animation movie. Figure 3b shows that the movies belong to the same series have high correlations. On the contrary, the correlations between the animation genre and the other genres were low.
Finally, we observed the weights of the counting, the item, and the user kernels, see Figure 4a, because the kernel weights also contribute to the construction of the correlation matrix. Since each dataset has different characteristics, a dataset emphasizes the counting, the item, and the user relations, differently. Interestingly, the counting kernel was not the most dominant kernel in MovieLens, but the user kernel was dominant. MovieLens is relatively dense dataset with respect to the number of average action per user, as shown in Table 1. Our proposed model, RKSA, adapts to the property of dataset well, and focus on the user kernel instead of other kernels on MovieLens dataset.
Predicted ranking of infrequent items
A sparse dataset, like Beauty, has many infrequent items, which are difficult to predict because of its information sparsity. To overcome this problem, RKSA utilizes the relational information of the whole dataset, instead of a single sequence in the prediction. Figure 4b shows that the target item is highly ranked by RKSA as the information sparsity worsens, compare to the predicted ranking of SASRec.
Attention Weight Case Study
Figure 5 shows the attention weight of SASRec and RKSA with the co-occurrence information between the last item and each item of sequence. The sequence instance in Figure 5 has a high co-occurrence value at timestep 0, 1, 2, and 5; and Figure 5 confirms that RKSA places higher attention values than SASRec. In the opposite case, the attention weight of RKSA is lower than the attention weight of SASRec.
We present relation-aware kernelized self-attention (RKSA) for a sequential recommendation task. RKSA introduces a new self-attention mechanism which is stochastic as well as kernelized by the relational information. While the past attention mechanisms are deterministic, we introduce a latent variable in the attention. Moreover, the latent variable utilizes the kernelized correlation matrix, so the kernel can be expanded to include relational information and modeling. From these innovations, we were able to see the best performance in all experimental settings. We expect that the further development on the stochastic attention of the Transformer will come in the near future.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF-2018R1C1B6008652).
-  (1999) Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 579–602. Cited by: Multivariate Skew-Normal Distribution.
-  (1996) The multivariate skew-normal distribution. Biometrika 83 (4), pp. 715–726. Cited by: Introduction, Multivariate Skew-Normal Distribution.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Point-Wise Feed-Forward Network.
-  (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: Covariance.
-  (2019) Behavior sequence transformer for e-commerce recommendation in alibaba. arXiv preprint arXiv:1905.06874. Cited by: Introduction.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: Point-Wise Feed-Forward Network.
-  (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: Datasets.
-  (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: Quantitative Analysis.
Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939. Cited by: 4th item.
-  (2018) CSAN: contextual self-attention network for user sequential recommendation. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 447–455. Cited by: Introduction.
-  (2018) Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: Introduction, Multi-Head Attention, Embedding Layer, Loss Function, 8th item, Quantitative Analysis, Datasets, Experiment Settings.
-  (2017) Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1419–1428. Cited by: Introduction, 5th item, Experiment Settings.
-  (2003) Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet computing (1), pp. 76–80. Cited by: 2nd item.
-  (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1831–1839. Cited by: Introduction.
-  (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: Datasets.
-  (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Kernel Function.
-  (2009) BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: 3rd item.
-  (2019) Hierarchical context enabled recurrent neural network for recommendation. In Proceedings of the AAAI, Cited by: 6th item, Datasets.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: Point-Wise Feed-Forward Network.
-  (2018) Deep semantic role labeling with self-attention. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov). Cited by: Figure 3.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Introduction, Multi-Head Attention, Multi-Head Attention, Multi-Head Attention, Point-Wise Feed-Forward Network, Self-Attention Block.
-  (2018) Attention-based transactional context embedding for next-item recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
-  (2019) Context-aware self-attention networks. arXiv preprint arXiv:1902.05766. Cited by: Multi-Head Attention.
-  (2018) Sequential recommender system based on hierarchical attention networks. In the 27th International Joint Conference on Artificial Intelligence, Cited by: Introduction.
-  (2019) Multi-order attentive ranking model for sequential recommendation. In Proceedings of the AAAI, Cited by: Introduction.
-  (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: Introduction.
-  (2019) Next item recommendation with self-attentive metric learning. In Thirty-Third AAAI Conference on Artificial Intelligence, Vol. 9. Cited by: Introduction, 7th item.
-  (2018) ATRank: an attention-based user behavior modeling framework for recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.