Sequential Recommendation with Relation-Aware Kernelized Self-Attention

11/15/2019 ∙ by Mingi Ji, et al. ∙ KAIST 수리과학과 16

Recent studies identified that sequential Recommendation is improved by the attention mechanism. By following this development, we propose Relation-Aware Kernelized Self-Attention (RKSA) adopting a self-attention mechanism of the Transformer with augmentation of a probabilistic model. The original self-attention of Transformer is a deterministic measure without relation-awareness. Therefore, we introduce a latent space to the self-attention, and the latent space models the recommendation context from relation as a multivariate skew-normal distribution with a kernelized covariance matrix from co-occurrences, item characteristics, and user information. This work merges the self-attention of the Transformer and the sequential recommendation by adding a probabilistic model of the recommendation task specifics. We experimented RKSA over the benchmark datasets, and RKSA shows significant improvements compared to the recent baseline models. Also, RKSA were able to produce a latent space model that answers the reasons for recommendation.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.



is one of the key application areas of artificial intelligence in the big data era. The recommendation tasks are supported by large scale data, and users need to select a specific item from many alternative items. This selection requirement motivates the utilization of attention mechanism in the recommendation task. The attention is applied to the item selection, and the sequential recommendation particularly selects the past item choice records to consider the recommendation at the current timestep with the attention mechanism

[24, 15, 13, 26, 27, 11].

Given the relationship between the attention and the recommendation, adopting a new attention mechanism to the recommendation has been a research trend. For instance, Self-Attentive Sequential Recommendation (SASRec) [12] adopted the self-attention mechanism of the Transformer [23] to the recommendation task. This adaptation is interesting, but it was limitedly customized to meet the task specifics. Recommendation often requires understanding items, users, browsing sequences, etc, and the recommendation models need to consider such contexts which SASRec does not provide. Following SASRec, there have been developments in using the self-attention mechanism of the Transformer to model a task specific feature of sequential recommendation. For example, ATRank [30] utilized the self-attention mechanism for considering the influences from heterogenous behavior representations. To model the user’s short-term intent, AttRec [29] adopted the self-attention mechanism on the user interaction history. Similar to ATRank and AttRec, BST [5] used the self-attention mechanism for aggregate of the auxiliary user and item features.

Figure 1: Each entry of the co-occurrence matrix means the number of users that each movie pair appeared together in an user sequence in the MovieLens dataset. We can see that there are many users who watched Star Wars movies together. This allows modifying the attention weight from blue to red using co-occurrence information, when Star Wars 6 is a query.

Given the success of the self-attention [21, 6, 28], the recommendation task can be improved from the sequential information, which was limitedly used in the previous works. Moreover, such utilization on the sequential information provides a new approach to customize the self-attention structure to the recommendation task. Figure 1 is the example that the co-occurrence information may influence the attention weight. It is feasible to see a movie pair that has a higher co-occurrence than others, and this movie pair should inform the attention mechanism to increase the weight.

We renovate and customize the self-attention of Transformer with a latent space model. Specifically, we add a latent space to the self-attention value of the Transformer, and we use the latent space to model the context from relations of the recommendation task. The latent space is modeled as a multivariate skew-normal (MSN) distribution [2] with the dimension of the number of unique items in the sequence. The covariance matrix of the MSN distribution is the variable that we model the relations of a sequence, items, and a user by a kernel function that provides the flexibility of the recommendation task adaptation. After the kernel modeling, we provide the reparametrization of the MSN distribution to enable the amortized inference on the introduced latent space. Since the relation modeling is done with kernelization, we call this model as relation-aware kernelized self-attention (RKSA). We designed RKSA with three innovations. First, the deterministic Transformer may not work well in the generalized task of recommendation because of sparsity, so we added a latent dimension and its corresponding reparameterization. Second, the covariance modeling with the relation-aware kernel enables the more fundamental adaptation of the self-attention to the recommendation. Third, the kernelized latent space of the self-attention provides the reasoning on the recommendation result. RKSA is evaluated against eight baseline models including SASRec, HCRNN, NARM, etc; as well as, five benchmark datasets with Amazon review, MovieLens, Steam, etc. Our experiments showed that RKSA significantly improves the performance over the baselines on the benchmarks, consistently.


Multi-Head Attention

We start the preliminary by reviewing the self-attention structure that is the backbone of RKSA. Recently, [23] proposed the scaled-dot product attention, which is defined by Equation 1 where , , and are the queries, the keys, and the value matrix, respectively. The scaled-dot product attention calculates importance weights from the dot-product of query with key with a scaling of . This importance is boundarized by the softmax, and the boundarized importance is again multiplied by the value to form the scaled-dot product attention.


When the query, the key, and the value take the same as an input matrix in Equation 2, the scaled-dot product attention is called as the self-attention. A self-attention with an additional predefined or learnable positional embedding [23, 12] is able to capture the latent information of the position like previous recurrent networks.


Multi-head attention uses scaled-dot product attentions with times smaller dimension on attention weight parameters. [23] found that the multi-head attention is useful even though it uses the similar number of parameters compared to the single-head attention.



considered the dependencies, i.e. item co-occurrence, between the temporal state representations over a single sequence with the scaled-dot product attention. Their model is introducing a context vector

to be linearly combined with and in the self-attention. We expand this context modeling with stochasticity and kernel method to add the flexibility of the self-attention.

Multivariate Skew-Normal Distribution

As we mentioned the latent space model of RKSA, we introduce an explicit probability density model to the self-attention structure. Here, we choose the

multivariate skew-normal (MSN) distribution to be the explicit density because we intend to model 1) the covariance structure between items; and 2) the skewness of the attention value. It would be natural to consider the multivariate normal distribution to enable the covariance model, but the normal distribution is unable to model the skewness because it enforces the symmetric shape of the density curve. As the name suggests, the MSN distribution reflects the skewness as the shape parameter [2]. The MSN distribution needs four parameters: location , scale , correlation , and shape . Following [1], a

-dimensional random variable

follows the MSN disitribution with the location parameter ; the correlation matrix ; the scale parameter ; and the shape parameter , as Equation 4.


Here, is the covariance matrix; is the -dimensional multivariate normal density with the mean and the covariance ; and

is the cumulative distribution function of

. If is a zero vector, the distribution reduces down to the multivariate normal distribution with the mean and the covariance .

Figure 2: (a) Graphical notation of RKSA. is the parameter of MSN distribution, and dashed line denotes sampling procedure. (b) The overall structure of RKSA with MSN parameters. The scaled-dot product denotes the matrix multiplication between query and key matrix in scaled-dot product attention.

Kernel Function

Given that we intend to model the covariance of the MSN, we introduce how we provide the flexible covariance structure through kernels. Kernel function, , evaluates a pair of observations in the observation space

with a real value. In the machine learning field, the kernel functions are widely used to compute the similarity between two data points as a covariance matrix. Given observations

, a function is a valid kernel if and only if it is (1) symmetric: for all ; and (2) positive semi-definite: for all [17]. We apply a customized kernel function to model the relational covariance parameter of the MSN in RKSA, and we provide proofs on the validity of our customized kernels.


This section explains the sequential recommendation task, the overall structure of Relation-Aware Kernelized Self Attention (RKSA), and its detailed parameter modeling.

Problem Statement

A sequential recommendation uses datasets built upon a past action sequence of a user. Let be a set of users; let be a set of items; and let be a user ’s action sequence. The task of sequential recommendation is predicting the next item to interact by the user, as .

Self-Attention Block

We propose elation-aware Kernelized Self-Attention (RKSA), which is a modification of the self-attention structure embedded in Transformer [23]. Figure 2 illustrates that RKSA is a customized self-attention based on relations, such as the item, the user, and the global co-occurrence information. The detailed procedure is explained in the below.

Embedding Layer

Since the raw data of items and interactions follow sparse one-hot encoding, we need to embed the information of items and positions of interactions. To create such embeddings, we use the latest

actions from the user sequence of . Specifically, the item embedding matrix is defined as , where denotes the dimensionality of the embedding. E

is estimated by a hidden layer as a part of the modeled neural network, and the raw input to the hidden layer is the one-hot encoding of an interacted item at time

. Similarly, we set a user embedding to be to make a distinction between users. Also, we define a positional embedding matrix as , to introduce the sequential ordering information of the interactions, which we follow the ideas from [12]. P and U are also estimated by a hidden layer that matches the dimensionality of E for the further construction of .

Afterward, we estimate the inputs to RKSA, and the input should convey the representation of items and positions in the sequences. Here, we assume that the item at time , which is , is represented as as a timestep of the sequence, and we denoted the representation as because it is the input to RKSA. is estimated through the summation of the item embedding and the positional embedding , as . Finally, the input item sequence is expressed as by combining the item embedding E, and the positional embedding P.

Relation-Aware Kernelized Self-Attention

The core component of RKSA is the multi-head attention structure that includes a latent variable of . Given that Equation 1 is deterministic, we intend to turn into a single latent variable . The changed part is originally the alignment score of the attention mechanism, so its range becomes . Additionally, we assume that there is a skewed shape in the alignment score distribution, so we designed to follow the multivariate skew-normal distribution (MSN), as Equation 5

. In other words, we sample the logit of the softmax function from the MSN distribution.


In the above, the parameters of the MSN distribution include the location , the covariance , and the shape parameter . The details of the parameters are explained in Section Parameter Modeling. Additionally, in Equation 5, denotes the items in the sequence , and is the co-occurrence matrix of from our kernel model, which is explained in Section Covariance. The co-occurrence matrix is constructed by counting the co-occurrence number between item pairs in the whole dataset. We follow the amortized inference with a reparametrization on the MSN; and and are used as inputs to the inference.

Lastly, the output of RKSA is the hidden dimension defined as H in Equation 5. is the value vector estimated from the input item sequence representation . Since we modify the scaled-dot product attention, RKSA is easily expanded to be a variant of multi-head attention by following the same procedure of Equation Multi-Head Attention.

Point-Wise Feed-Forward Network

We apply the Point-Wise Feed-Forward Network

in Transformer to the output of RKSA by each position. The point-wise feed-forward network consists of two linear transformations with a ReLU nonlinear activation function between the linear transformations. The final output of the point-wise feed-forward network,

is .

Besides of the above modeling structure, we stacked multiple self-attention blocks to learn complex transition patterns, and we added residual connections

[7] to train a deeper network structure. We also applied the layer normalization [3] and the dropout [20] to the output of each layer by following [23].

Output Layer

Let be the number of self-attention blocks. The task requires predicting the ()-th item with the -th output of the -th self-attention block. We use the same weights of the item embedding layer to rank the item prediction. The relevance score of the item is defined as :


denotes the -th output of the last self-attention block, and is the embedding of item . The prediction ranking of the ()-th item is defined by the ranking of the items’ relevance scores.

Parameter Modeling

This section enumerates the detailed modeling of the MSN parameters, which is used for the latent variable in RKSA.


The location has the same role of the mean of multivariate normal distribution. Given that we use the MSN to sample the alignment score, we still need to provide the deterministic alignment score with the most likelihood. Therefore, we allow the alignment score to be the location parameter as:


Also, we can use activation function and scaling to with .


The covariance represents the relation between items. While is a square matrix of parameters, has a limited size because we only use the latest

items; and because there are not many unique items in those latest interactions. The relation can be measured by various methods, ranging from a simple co-occurrence counting to a non-linear kernel function. This paper design a kernel function to measure the relation between a pair of items because the kernel function is known to be the efficient and nonlinear high-dimensional distance metric that can also be learned through optimizing the kernel hyperparameters.

We compose a kernel function by considering the relations of the co-occurrence, the item and the user. For a given sequence, for timesteps and , we utilize the normalized representations of and

. Additionally, we infer the variance

of at timestep and , by an amortized inference as Equation 8.


In the above, we set the activation function of standard deviation as softplus to make the value of standard deviation positive. The following defines three different kernel functions, and we denote

as for simplicity.

  • Counting kernel is defined by the co-occurrence number of each item pair. The counting kernel is where are the number of occurrence of item and , respectively, and is the number of co-occurrence of item and .

  • Item kernel utilizes the representation of each item. There are two alternative kernels. The linear item kernel is where

    denotes dot product; and the Radial Basis Function (RBF) kernel is


  • User kernel utilizes the representation of each items and users. The user kernel is for user embedding and weight matrix where denotes Hadamard product.

Unlike the item and the user kernel, the validity of the counting kernel should be checked because it is not a well-known format as the linear or the RBF kernels. The counting kernel is always symmetric and positive semi-definite. Therefore, the counting kernel is a valid kernel function.

From the property of kernel functions, we combine kernel functions by their summation to make the final kernel function flexible. The final kernel function is defined as:


With the above kernel function, our modeling on the correlation matrix is , similar to the definition of of Equation 4.

This section describes the covariance modeling with the final kernel, so the kernel hyperparameter, such as , and r, needs to be inferred. While they need to be supervised to learn the kernel hyperparameters, the loss of the recommendation task needs to be augmented with an additional loss to guide the kernel hyperparameter. Therefore, we modeled a loss that regularizes the covariance to be the item co-occurrence. Since we have other loss terms, i.e. the recommendation loss, the learned correlation does not become same to the item co-occurrence, but the co-occurrence loss can be prior knowledge. Particularly, we measure the co-occurrence loss with the listwise ranking loss to match the alignment of the correlation and the ranking of the item co-occurrences. The co-occurrence loss is defined as maximizing te listwise ranking loss [4].


The shape parameter reflects the relation between a final item and an item in a user sequence. We designate to items . We define by introducing a ratio parameter with the co-occurrence matrix ; and a learnable scaling parameter . Specifically, we assume , which is a scaled correlation between the final item and the item .

First, we calculate the ratio parameter with the co-occurrence matrix , by the summation of the linear alignments between the last time , and the aligned item . Here, let be the value of -th row and -th column of co-occurrence matrix, . For simplicity, we denote as . The following is the detailed formula of .


Equation Shape computes by the dot-product between -th row and -th column of the co-occurrence matrix , which means that we calculate the correlation between the co-occurrence of and . Having said that, the co-occurrence of the same item is semantically meaningless in , so such cases used the average of the remaining elements in each row in the dot-product process. enables modeling the two-hop dependency between and through .

Second, Equation 11 defines the scaling parameter, :


We can apply the softplus activation to , so the shape parameter becomes positive.

Model Inference

Loss Function

Given the above model structure, this subsection introduces the inference on the latent variable following the MSN distribution. It is well-known that the latent variable can be inferred by optimizing the evidence lower bound from the Jensen’s inequality, so we optimize the evidence lower bound on the marginal log-likelihood, , when predicting the -th item . Equation 12

describes the loss function of this prediction task.


utilizes the binary cross-entropy loss with the negative sampling as conducted in [12] to calculate . It should be noted that the actual loss function is a combination of the prediction loss and the co-occurrence loss, which is . is the regularization weight hyperparameter of the co-occurrence loss.

Reparametrization of

We sample the values of from the distribution using the reparameterization trick. Equation 13 shows the reparametrization of the MSN distribution with the sample from the two Normal distributions.


This reparametrization is utilized because needs to be instantiated for the forward path. Equation 13 shows how to sample given the amortized inference parameters of , , , and . Once the forward path is enabled, the neural network can be trained via the back-propagation method.

Experiment Result


We evaluate our model on five real world datasets: Amazon (Beauty, Games) [8, 16], CiteULike, Steam, and MovieLens. We follow the same preprocessing procedure on Beauty, Games, and Steam from [12]. For preprocessing CiteULike and MovieLens, we follow the preprocessing procedure from [19]. We split all datasets for training, validation, and testing following the procedure of [12]. Table 1 summarizes the statistics of the preprocessed datasets.

max width= Dataset #users #items #actions avg. avg. actions actions /user /item Beauty 52,024 57,289 0.4m 7.6 6.9 Games 31,013 23,715 0.3m 9.3 12.1 CiteULike 1,798 2,000 0.05m 30.6 27.5 Steam 334,730 13,047 3.7m 11.0 282.5 MovieLens 4,639 930 0.2m 40.9 204.0

Table 1: Statistics of evaluation datasets.

max width=2 Dataset Metric Pop


BPR-MF GRU4REC NARM HCRNN AttRec SASRec RKSA Beauty Hit@5 0.2972 0.0885 0.0735 0.3097 0.3663 0.3643 0.3341 0.3735 0.3999* NDCG@5 0.1478 0.0872 0.0486 0.2257 0.2785 0.2764 0.2535 0.2846 0.2998* Hit@10 0.4289 0.0885 0.1285 0.4174 0.4674 0.4653 0.4222 0.4720 0.5015* NDCG@10 0.1882 0.0872 0.0662 0.2604 0.3111 0.3091 0.2819 0.3164 0.3326* Games Hit@5 0.3416 0.1969 0.1291 0.5749 0.6224 0.6229 0.5673 0.6395 0.6544* NDCG@5 0.1730 0.1892 0.0920 0.4570 0.4927 0.4955 0.4358 0.5068 0.5168* Hit@10 0.4846 0.1969 0.1919 0.6733 0.7244 0.7233 0.6812 0.7373 0.7551* NDCG@10 0.2168 0.1892 0.1121 0.4889 0.5257 0.5281 0.4727 0.5385 0.5495* CiteULike Hit@5 0.1318 0.3563 0.1624 0.4310 0.4457 0.4442 0.4275 0.5044 0.5308 NDCG@5 0.0650 0.2666 0.1107 0.2982 0.3016 0.3053 0.2891 0.3447 0.3687* Hit@10 0.2144 0.3815 0.2472 0.5879 0.6150 0.6077 0.5808 0.6757 0.6893* NDCG@10 0.0902 0.2751 0.1378 0.3488 0.3565 0.3583 0.3388 0.4001 0.4202* Steam Hit@5 0.5545 0.2964 0.5724 0.7065 0.7095 0.7136 0.5936 0.7477 0.7514 NDCG@5 0.2873 0.2724 0.4144 0.5444 0.5476 0.5516 0.4182 0.5828 0.5841 Hit@10 0.7162 0.2965 0.7083 0.8293 0.8314 0.8344 0.7491 0.8610 0.8668* NDCG@10 0.3370 0.2724 0.4587 0.5844 0.5873 0.5909 0.4687 0.6196 0.6217 MovieLens Hit@5 0.1521 0.2950 0.1241 0.3883 0.4057 0.4039 0.3493 0.4260 0.4361* NDCG@5 0.0733 0.2019 0.0767 0.2650 0.2775 0.2770 0.2217 0.2965 0.3023* Hit@10 0.2547 0.4051 0.2088 0.5487 0.5617 0.5606 0.5094 0.5873 0.5997* NDCG@10 0.1044 0.2376 0.1039 0.3167 0.3278 0.3275 0.2734 0.3485 0.3552*

Table 2: Performance comparison (higher is better). The best performing model is indicated as boldface. The second-best model is indicated as underline. indicates that the result has p-value less than

against the second-best result based on t-test.

B 0.5015 0.4982 0.4958 0.5012 0.4955 0.4951 0.5011
M 0.5911 0.5966 0.5977 0.5960 0.5962 0.5997 0.5973
Table 3: Ablation study on the Beauty and MovieLens datasets. The measure is Hit@10 and C, I and U denote counting, item and user kernel function respectively. B is the Beauty dataset; and M is the MovieLens dataset.


We compared RKSA with seven baselines.

  • Pop always recommends the most popular items.

  • Item-KNN [14] recommends an item based on the measured similarity of the last item.

  • BPR-MF [18] recommends an item by the user and the item latent vectors with the matrix factorization.

  • GRU4REC [10] models the sequential user history with GRU and the specialized recommendation loss function such as Top1 and BPR loss.

  • NARM [13] focuses on both short and long-term dependency of a sequence with an attention and a modified bi-linear embedding function.

  • HCRNN [19] considers the user’s sequential interest change with the global, the local, and the temporary context modeling. It modifies the GRU cell structure to incorporate the various context modeling.

  • AttRec [29] models the short-term intent using self-attention and the long-term preference with metric learning.

  • SASRec [12]

    is a Transformer model which combines the strength of Markov chains and RNN. SASRec focuses on finding the relevant items adaptively with self-attention mechanisms.

Experiment Settings

For GRU4REC, NARM, HCRNN, and SASRec, we use the official codes written by the corresponding authors. For GRU4REC, NARM and HCRNN, we apply the data augmentation method proposed by NARM [13]. We use two self-attention blocks and one head for SASRec and RKSA following the default setting of [12]. For fair comparisons, we apply the same setting of the batch size (128), the item embedding (64), the dropout rate (0.5), the learning rate (0.001), and the optimizer (Adam). We use the same setting of authors for other hyperparameters. For RKSA, we set the co-occurrence loss weight as 0.001. Furthermore, we use the learning rate decay and the early stopping based on the validation accuracy for all methods. We use the latest 50 actions of sequence for all datasets.

Quantitative Analysis

Table 2 presents the recommendation performance of the experimented models. We adopt two widely used measurements: Hit Rate@K and NDCG@K [9]. Considering that all user-item pairs require heavy computation, we use 100 negative samples for the evaluation following [12, 9]. We repeat each experiment for five times, and the results are the average of each method. The performance of RKSA comes from the best kernel variant of RKSA, and RKSA outperforms all baseline models on all datasets and metrics. Especially, Beauty shows the biggest improvement. Beauty is the most sparse dataset, so there are many items infrequently occurred. This result suggests that using the relational information can be helpful for predicting such infrequent items.

Ablation Study

We compared the kernel function combinations on Beauty and MovieLens datasets. We consider Beauty as a representative sparse dataset, and MovieLens as a representative dense dataset. Table 3 shows the performance of each kernel functions. We assume that using the sparse and short dataset is hard to learn the representation of the item and user. Therefore, RKSA with the counting kernel function shows the best performance on the sparse dataset. On the contrary, it is relatively easy to learn the representation of item and user by the dense dataset, and Table 3 shows the kernel combination of the item and the user is best.

Qualitative Analysis

Item Embedding and Correlation Matrix

The item kernel utilizes the dependency between the items in each time step. When learning the co-occurrence loss, the kernel hyperparameter and the item embedding captures the relational information of the co-occurrence. Figure 3a illustrates the item embedding of movies. The item embedding with the same genres are distributed closely together.

We generate the synthetic sequence to analyze the correlation from the trained kernel function. We use the counting and the item kernel combination without the user kernel because the sequence was synthetic. The synthetic sequence includes four different movie series and an animation movie. Figure 3b shows that the movies belong to the same series have high correlations. On the contrary, the correlations between the animation genre and the other genres were low.

Finally, we observed the weights of the counting, the item, and the user kernels, see Figure 4a, because the kernel weights also contribute to the construction of the correlation matrix. Since each dataset has different characteristics, a dataset emphasizes the counting, the item, and the user relations, differently. Interestingly, the counting kernel was not the most dominant kernel in MovieLens, but the user kernel was dominant. MovieLens is relatively dense dataset with respect to the number of average action per user, as shown in Table 1. Our proposed model, RKSA, adapts to the property of dataset well, and focus on the user kernel instead of other kernels on MovieLens dataset.

Figure 3: (a) Item embedding visualization with tSNE [22] of MovieLens dataset. (b) Correlation between movies by the counting and item kernel combination.
Figure 4: (a) The weights of the counting, the item, and the user kernels for the final kernel calculation (b) Average predicted ranking of the SASRec and RKSA by item occurrence in Beauty dataset. Value of the -axis grows, it indicates the frequently occurred group. RKSA predict the higher ranking for infrequent items.

Predicted ranking of infrequent items

A sparse dataset, like Beauty, has many infrequent items, which are difficult to predict because of its information sparsity. To overcome this problem, RKSA utilizes the relational information of the whole dataset, instead of a single sequence in the prediction. Figure 4b shows that the target item is highly ranked by RKSA as the information sparsity worsens, compare to the predicted ranking of SASRec.

Attention Weight Case Study

Figure 5 shows the attention weight of SASRec and RKSA with the co-occurrence information between the last item and each item of sequence. The sequence instance in Figure 5 has a high co-occurrence value at timestep 0, 1, 2, and 5; and Figure 5 confirms that RKSA places higher attention values than SASRec. In the opposite case, the attention weight of RKSA is lower than the attention weight of SASRec.

Figure 5: Attention heatmap for a user sequence of MovieLens. The first row indicates the co-occurrence, and the last item does not have co-occurrence information. If the co-occurrence between last item (query) and each item is bigger than average co-occurrence of sequence, we fill each timestep as black and the rest white. The second row is an attention weight in SASRec and the below is an attention weight in RKSA.


We present relation-aware kernelized self-attention (RKSA) for a sequential recommendation task. RKSA introduces a new self-attention mechanism which is stochastic as well as kernelized by the relational information. While the past attention mechanisms are deterministic, we introduce a latent variable in the attention. Moreover, the latent variable utilizes the kernelized correlation matrix, so the kernel can be expanded to include relational information and modeling. From these innovations, we were able to see the best performance in all experimental settings. We expect that the further development on the stochastic attention of the Transformer will come in the near future.


This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF-2018R1C1B6008652).


  • [1] A. Azzalini and A. Capitanio (1999) Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 579–602. Cited by: Multivariate Skew-Normal Distribution.
  • [2] A. Azzalini and A. D. Valle (1996) The multivariate skew-normal distribution. Biometrika 83 (4), pp. 715–726. Cited by: Introduction, Multivariate Skew-Normal Distribution.
  • [3] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Point-Wise Feed-Forward Network.
  • [4] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: Covariance.
  • [5] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou (2019) Behavior sequence transformer for e-commerce recommendation in alibaba. arXiv preprint arXiv:1905.06874. Cited by: Introduction.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: Point-Wise Feed-Forward Network.
  • [8] R. He and J. McAuley (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: Datasets.
  • [9] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: Quantitative Analysis.
  • [10] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2015)

    Session-based recommendations with recurrent neural networks

    arXiv preprint arXiv:1511.06939. Cited by: 4th item.
  • [11] X. Huang, S. Qian, Q. Fang, J. Sang, and C. Xu (2018) CSAN: contextual self-attention network for user sequential recommendation. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 447–455. Cited by: Introduction.
  • [12] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: Introduction, Multi-Head Attention, Embedding Layer, Loss Function, 8th item, Quantitative Analysis, Datasets, Experiment Settings.
  • [13] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma (2017) Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1419–1428. Cited by: Introduction, 5th item, Experiment Settings.
  • [14] G. Linden, B. Smith, and J. York (2003) Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet computing (1), pp. 76–80. Cited by: 2nd item.
  • [15] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1831–1839. Cited by: Introduction.
  • [16] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: Datasets.
  • [17] C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Kernel Function.
  • [18] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: 3rd item.
  • [19] K. Song, M. Ji, S. Park, and I. Moon (2019) Hierarchical context enabled recurrent neural network for recommendation. In Proceedings of the AAAI, Cited by: 6th item, Datasets.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: Point-Wise Feed-Forward Network.
  • [21] Z. Tan, M. Wang, J. Xie, Y. Chen, and X. Shi (2018) Deep semantic role labeling with self-attention. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
  • [22] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov). Cited by: Figure 3.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Introduction, Multi-Head Attention, Multi-Head Attention, Multi-Head Attention, Point-Wise Feed-Forward Network, Self-Attention Block.
  • [24] S. Wang, L. Hu, L. Cao, X. Huang, D. Lian, and W. Liu (2018) Attention-based transactional context embedding for next-item recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
  • [25] B. Yang, J. Li, D. F. Wong, L. S. Chao, X. Wang, and Z. Tu (2019) Context-aware self-attention networks. arXiv preprint arXiv:1902.05766. Cited by: Multi-Head Attention.
  • [26] H. Ying, F. Zhuang, F. Zhang, Y. Liu, G. Xu, X. Xie, H. Xiong, and J. Wu (2018) Sequential recommender system based on hierarchical attention networks. In the 27th International Joint Conference on Artificial Intelligence, Cited by: Introduction.
  • [27] L. Yu, C. Zhang, S. Liang, and X. Zhang (2019) Multi-order attentive ranking model for sequential recommendation. In Proceedings of the AAAI, Cited by: Introduction.
  • [28] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: Introduction.
  • [29] S. Zhang, Y. Tay, L. Yao, A. Sun, and J. An (2019) Next item recommendation with self-attentive metric learning. In Thirty-Third AAAI Conference on Artificial Intelligence, Vol. 9. Cited by: Introduction, 7th item.
  • [30] C. Zhou, J. Bai, J. Song, X. Liu, Z. Zhao, X. Chen, and J. Gao (2018) ATRank: an attention-based user behavior modeling framework for recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.