Incorporating User Micro-behaviors and Item Knowledge into Multi-task Learning for Session-based Recommendation

06/12/2020 ∙ by Wenjing Meng, et al. ∙ FUDAN University 0

Session-based recommendation (SR) has become an important and popular component of various e-commerce platforms, which aims to predict the next interacted item based on a given session. Most of existing SR models only focus on exploiting the consecutive items in a session interacted by a certain user, to capture the transition pattern among the items. Although some of them have been proven effective, the following two insights are often neglected. First, a user's micro-behaviors, such as the manner in which the user locates an item, the activities that the user commits on an item (e.g., reading comments, adding to cart), offer fine-grained and deep understanding of the user's preference. Second, the item attributes, also known as item knowledge, provide side information to model the transition pattern among interacted items and alleviate the data sparsity problem. These insights motivate us to propose a novel SR model MKM-SR in this paper, which incorporates user Micro-behaviors and item Knowledge into Multi-task learning for Session-based Recommendation. Specifically, a given session is modeled on micro-behavior level in MKM-SR, i.e., with a sequence of item-operation pairs rather than a sequence of items, to capture the transition pattern in the session sufficiently. Furthermore, we propose a multi-task learning paradigm to involve learning knowledge embeddings which plays a role as an auxiliary task to promote the major task of SR. It enables our model to obtain better session representations, resulting in more precise SR recommendation results. The extensive evaluations on two benchmark datasets demonstrate MKM-SR's superiority over the state-of-the-art SR models, justifying the strategy of incorporating knowledge learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems have played a very important role in many web applications including web search, e-commerce, entertainment and so on. On many web sites, a user often exhibits a specific short-term intention (Jannach et al., 2015). The natural characteristics of a session data reflect a user’s behavior pattern and current preference precisely. Therefore, modeling a user’s current intention based on his/her behaviors in a recent session often results in satisfactory recommendation results, which is the basic principle of session-based recommendation (SR for short) (Wang et al., 2019). As a representative class of sequential recommendation, SR systems aim to predict an item that would be interacted by a target user in the next interaction, based on the recent behaviors committed by the user in a session.

In order to achieve precise recommendations, an SR model generally uses a certain algorithm to leverage the sequential information in a session. The existing algorithms include Markov-based models and deep neural network (DNN for short) based models. For example, the basic idea of FPMC 

(Rendle et al., 2010)

is to calculate transition probability between the items in a session based on Markov chain. In recent years, recurrent neural networks (RNNs for short) have been applied in different ways to learn a user’s dynamic and time-shifted preference 

(Hidasi et al., 2016a; Jing et al., 2017; Quadrana et al., 2017; Liu et al., 2018), exhibiting better performance than traditional methods. Although these deep SR models have been proven effective, there still exist some issues as follows.

The first issue is the lack of exploiting user micro-behaviors. Most of previous session-based models (Hidasi et al., 2016a; Jing et al., 2017; Liu et al., 2018) only model a session from a macro perspective, i.e., regard a session as a sequence of items, without taking into account different operations of users. Even though a user interacts with the same item in a session, different operations committed on this item reflect the user’s different intentions in this session and different preferences on this item. In this paper, we regard a user’s specific operation on a certain item as a micro-behavior which has finer granularity and offers deeper understanding of the user than a macro-behavior of item-level, i.e., a user-item interaction. We use a toy example in Fig.1 which was extracted from a real music dataset, to illustrate the significance of incorporating micro-behaviors into SR’s session modeling.

Figure 1. A toy example of user micro-behaviors (depicted by red dashed rectangles) in two sessions (depicted by green dashed rectangles) from a music site. Different operations may be committed on the same item sequence which reflect different intensions and preferences of a user on a fine-grained level. Therefore, modeling a session based on user micro-behaviors consisting of items and operations is helpful to achieve preciser SR.

In Fig. 1, the user in session first searches song which is a hip-pop song sung by a certain American band Maroon 5. Then, he adds it into his playlist and favorites in turn. Next, he clicks button Artist More to get more Maroon 5’s songs listed on a page, and then listens to song on the page. After adding into his favorites, he further listens to another Maroon 5’s song through the Artist More page. Song belongs to a genre different from hip-pop although it is also sung by Maroon 5. In session , the user also first searches song and then clicks button Genre More to get more hip-pop songs. Next, he also selects from the Genre More page. After adding into his favorites, he further listens to song which also belongs to hip-pop but was sung by the singer other than Maroon 5. Suppose that we already have the session information in the green dashed rectangles for the two sessions, we need to predict which item the user will interact with subsequently. According to traditional SR’s paradigm of modeling a session only based on the historical item sequence, and are learned as the same representation since they both consist of song and . Thus the same item may be recommended as the next interaction, which is not consistent with the observations in Fig. 1. If we take user micro-behaviors which are depicted by red dashed rectangles rather than items to model the sessions, and have different representations. Based on such fine-grained session representations, will be recommended to the user in because the transition pattern between and (having the same singer) is the same as the one between and . Similarly, will be recommended to the user in .

The second issue is the insufficient utilization of item knowledge towards the sparsity problem of user-item interactions. Since most of previous SR systems model a session only based on the interacted item sequence in the session, they can not learn session representations sufficiently when historical user-item interactions are sparse, especially for the cold-start items. Inspired by (Yang et al., 2018, 2019)

, we can also import item attributes as side information which are generally distilled from open knowledge graphs (KGs for short) and recognized as item knowledge in this paper, to alleviate the data sparsity problem. Furthermore, the item transitions in terms of micro-behavior can also be indicated by item knowledge. Recall the example in Fig.

1, transition is indicated by two knowledge triplets ¡, song-artist, Maroon 5¿ and ¡, song-artist, Maroon 5¿. Therefore, incorporating item knowledge is helpful for micro-behavior based modeling. In this paper, we regard such latent relationships among the items in a session as semantic correlations.

To address above issues, we propose a novel SR model MKM-SR in this paper, which incorporates user Micro-behaviors and item Knowledge into Multi-task learning for Session-based R

ecommendation. In MKM-SR, a user’s sequential pattern in a session is modeled on micro-behavior level rather than item-level. Specifically, a session is constituted by a sequence of user micro-behaviors. Each micro-behavior is actually a combination of an item and its corresponding operation. As a result, learning item embeddings and operation embeddings is the premise of learning micro-behavior embeddings, based on which a session’s representation is generated. For this goal, we feed the operation sequence and the item sequence of a session into Gated Recurrent Unit (GRU for short)

(Cho et al., 2014) and Gated Graph Neural Network (GGNN for short) (Li et al., 2016; Wu et al., 2019), respectively. In this step, we adopt different learning mechanisms due to the different characteristics of operations and items. What’s more, we incorporate item knowledge to learn better item embeddings by TransH (Wang et al., 2014) which has been proven an effective KG embedding model on handling many-to-many/one relations. Unlike previous models using knowledge embeddings as pre-trained item embeddings (Wang et al., 2018b; Yang et al., 2019, 2018), we take knowledge embedding learning as an auxiliary task and add it into a multi-task learning (MTL for short) paradigm in which the major task is to predict the next interacted item. Our extensive experiments verify that our MTL paradigm is more effective than previous pre-training paradigm in terms of promoting SR performance.

In summary, our contributions in this paper are as follows:

1. In order to improve SR performance, we incorporate user micro-behaviors into session modeling to capture the transition pattern between the successive items in a session on a fine-grained level, and further investigate the effects of different algorithms on modeling micro-behaviors.

2. We incorporate item knowledge into our model through an MTL paradigm which takes knowledge embedding learning as an auxiliary (sub) task of SR. Furthermore, we validate an optimal training strategy for our MTL through extensive comparisons.

3. We provide deep insights on the rationales of our model’s design mechanisms and extensive evaluation results over two realistic datasets (KKBOX and JDATA), to justify our model’s superiority over the state-of-the-art recommendation models.

In the rest of this paper, we introduce related work in Section 2 followed by the detailed description of our model in Section 3. We display our extensive experiment results in Section 4 and conclude our work in Section 5.

2. Related Work

In this section, we provide a brief overview of the research related to our work in this paper.

2.1. Session-based and Sequential Recommendation

As a sub-task of sequential recommendation, the objective of SR is to predict the successive item(s) that an anonymous user likely to interact with, according to the implicit feedbacks in a session. In general, there are two major classes of approaches to leverage sequential information from users’ historical records, i.e., Markov-based models and DNN-based models. In the first class, (Shani et al., 2005; Huang et al., 2009) use Markov chains to capture sequential patterns between consecutive user-item interactions. (Sarwar et al., 2001; Linden et al., 2003) try to characterize users’ latest preferences with the last click, but neglect the previous clicks and discard the useful information in the long sequence. Rendel proposed a hybrid model FPMC (Rendle, 2012)

, which combines Matrix Factorization (MF for short) and Markov Chain (MC for short) to model sequential behaviors for next basket recommendation. A major problem of FPMC is that it still adopts the static representations for user intentions. Recently, inspired by the power of DNNs in modeling sequences in NLP, some DNN-based solutions have been developed and demonstrated state-of-the-art performance for SR. Particularly, the RNN-based models including LSTM and GRU, are widely used to capture users’ general interests and current interests together through encoding historical interactions into a hidden state (vector). As the pioneers to employ RNNs for SR, Hidasi et al.

(Hidasi et al., 2016a) proposed a deep SR model which encodes items into one-hot embeddings and then feed them into GRUs to achieve recommendation. Afterwards, Jing et al. (Jing et al., 2017) further improved the RNN-based solution through adding extra mechanism to tackle the short memory problem inherent in RNNs. In addition, the model in (Liu et al., 2018) utilizes an attention net to model user’s general states and current states separately. This model takes into account the effects of users’ current actions on their next moves explicitly. Besides, Hidasi et al. proposed (Hidasi et al., 2016b) and (Hidasi and Karatzoglou, 2018)

to improve their model performance through adjust loss functions. More recently, the authors in

(Wu et al., 2019) adopted GGNN to capture the complex transition pattern among the items in a session rather than the transition pattern of single way. Although these models show promising performance for SR tasks, there is still room for improvement since they all neglect user micro-behaviors in sessions. (Liu et al., 2017; Zhou et al., 2018; Wan and McAuley, 2018; Gu et al., 2020) are the rare models considering micro-behaviors. (Wan and McAuley, 2018) only models monotonic behavior chains where user behaviors are supposed to follow the same chain, ignoring the multiple types of behaviors. To tackle this problem, (Zhou et al., 2018) and (Gu et al., 2020) both adopt LSTM to model micro-behaviors. However, they ignored the different transition pattern between items and operations. In this paper, we adopt RNN and GNN simultaneously to model the micro-behaviors, which not only consider the differences of items and operations, but also keep the logic of operation order as mentioned in (Wan and McAuley, 2018; Gu et al., 2020).

2.2. Knowledge-based Recommendation

Knowledge-based recommendation has already been recognized as an important family of recommender systems (Burke, 2000). Traditionally, the knowledge includes various item attributes which are used as the constraints of filtering out a user’s favorite items. As more open linked data emerge, many researchers use abundant knowledge in KGs as side information to improve the performance of recommender systems. In (Yu et al., 2014) a heterogeneous information network (HIN for short) is constructed based movie knowledge, and then the relatedness between movies is measured through the volume of meta-paths. The authors in (Palumbo et al., 2017) applied Node2Vec (Grover and Leskovec, 2016) to learn user/item representations according to different relations between entities in KGs, but it is still collaborative filtering (CF for short) based method resulting in poor performance when user-item interactions are sparse. In recent years, many researchers utilized DNNs to learn knowledge embeddings which are fed into the downstream recommendation models. For example, Wang et al. proposed a deep model DKN (Wang et al., 2018b) for news recommendation, in which they also used a translation-based KG embedding model TransD (Ji et al., 2015) to learn knowledge embeddings to enrich news representations. The authors in (Yang et al., 2018, 2019) utilized Metapath2Vec (Swami et al., 2017) to learn knowledge embeddings which are used to generate the representations of users and items. Different with MKM-SR’s multi-task learning solution, these models use knowledge embeddings as the pre-trained item embeddings. Another representative deep recommendation model incorporating knowledge is RippleNet (Wang et al., 2018a) where user representations are learned through an iterative propagation among a KG including the entities of items and their attributes. KGs have also been employed for sequential recommendation. For example, (Huang et al., 2018) proposes a key-value memory network to incorporate movie attributes from KGs, in order to improve sequential movie recommendation. FDSA (Zhang et al., 2019) is a feature-level sequential recommendation model with self-attentions, in which the item features can be regarded as the item knowledge used to enrich user representations. Unlike our model, these two KG-based sequential recommendation models model user behavior sequences on macro (item) level rather than micro level. What’s more, our model incorporates item knowledge through an MTL paradigm which takes knowledge embedding learning as an auxiliary (sub) task of SR, guaranteeing the information is shared among user micro-behaviors and item attributes, thus representations are learned better.

3. Methodology

In this section, we introduce the details of MKM-SR including the related algorithms involved in the model. We first formalize the problem addressed in this paper, and then summarize the pipeline of MKM-SR followed by the detailed descriptions of each step (component). In the following introductions, we use a bold lowercase to represent a vector and a bold uppercase to represent a set, matrix or a cube (tensor).

Figure 2. The overall framework of our proposed MKM-SR. The arrows in the figure indicate data flows. At first, an item sequence and an operation sequence are extracted from a given session simultaneously, and then fed into GGNN and GRU to learn item embeddings and operation embeddings, respectively. These two types of embeddings assemble a sequence of micro-behavior embeddings which are used to generate the session’s representation . The final score is computed by an MLP followed by softmax operation. Furthermore, the knowledge embeddings learned by TransH are incorporated into a multi-task learning loss function to learn better item embeddings resulting in superior SR.

3.1. Problem Definition

In this paper, since we focus on user micro-behaviors rather than interacted items in sessions in this paper, we first use to denote the micro-behavior sequence in a session where is the length of the sequence. Specifically, is the -th micro-behavior which is actually a combination of an item and its corresponding operation. Furthermore, the item knowledge we incorporate into MKM-SR is represented by the form of triplet. Formally, a knowledge triplet represents that is the value of item ’s attribute which is often recognized as a relation. For example, ¡Sugar, song-artist, Maroon 5¿ describes that Maroon 5 is the singer of song Sugar. MKM-SR is trained with the observed user micro-behaviors in sessions and obtained item knowledge.

The goal of our SR model is to predict the next interacted item based on a given session. To achieve this goal, our model is fed with the given session and a (next interacted) candidate item to generate a matching score (probability) between and . With , a top- ranking list can be generated for a given session (one sample). In general, the item with the highest score will be predicted as the next interacted item.

3.2. Model Overview

The framework of MKM-SR is illustrated in Fig. 2. In model training, besides the operations and items involved in training samples, item knowledge is also input into MKM-SR to learn better operation embeddings and item embeddings, then better session representations are also obtained which are crucial to compute precise scores.

For more clear explanation, we reversely introduce the pipeline of MKM-SR from the right side of Fig. 2. For a given session and a candidate item , the final score

is computed by a multi-layer perceptron (MLP for short) fed with

’s representation and ’s embedding which are both a vector of dimensions. Specifically, ’s representation is obtained by aggregating a group of micro-behavior embeddings. Given that different objects in a session (sequence) have different levels of priority on representing this session, we adopt a soft-attention mechanism (Xu et al., 2015) to generate ’s global representation which reflects a user’s long-term preference. Furthermore, a micro-behavior embedding is the concatenation of of its item embedding (green rectangles in Fig. 2) and operation embedding (red rectangles in Fig. 2) since we believe an operation and an item have different roles to represent a user’s micro-behavior. The item embedding and the operation embedding in a micro-behavior embedding are learned respectively by different models, i.e., GGNN and GRU. The reason of such manipulations will be explained subsequently. The GGNN and the GRU in MKM-SR are fed with an item sequence and an operation sequence respectively, both of which have the same length as the micro-behavior sequence, i.e., .

As we mentioned before, the item knowledge is helpful to discover the semantic correlations between the items in a session. As a result, we add knowledge learning as an auxiliary task into an MTL paradigm through designing a weighted sum of different loss functions. Specifically, we import TransH’s (Wang et al., 2014) loss function as the loss of our knowledge learning, since it is a KG embedding model which can model many-to-many/one relations effectively. In addition, we adopt alternating training (Ren et al., 2015) strategy for training MKM-SR.

3.3. Encoding Session Information

In this subsection, we introduce how to obtain the representation of a given session which is crucial for our model to compute the final score . Based on the basic principle of SR (Hidasi et al., 2016a; Jing et al., 2017), the premise of obtaining a session representation is to learn each object’s embedding in the session. In the setting of this paper, an object in the sequence of a session is a micro-behavior. According to our definition of micro-behavior, a micro-behavior is the combination of an item and an operation committed on the item. In MKM-SR, we first learn item embeddings and operation embeddings separately, and then concatenate an item embedding and an operation embedding as the embedding of the micro-behavior, rather than directly learn a micro-behavior embedding as whole. Our experiment results shown in the subsequent section verify that our solution is better than the latter. We adopt such solution due to the following intuitions.

3.3.1. Intuitions of Embedding Learning

We argue that item sequences and operation sequences have different effects on session modeling, and exhibit different transition patterns. For the item sequence of a session, its transition pattern is actually more complex than the single way transitions between successive items which were captured by previous RNN-based sequential models (Hidasi et al., 2016a; Jing et al., 2017; Quadrana et al., 2017). In other words, not only the subsequent items are correlated to the preceding ones in a sequence, but also the preceding items are correlated to the subsequent ones. It is also the reason that a user often interacted with an item that he/she has already interacted with before. Obviously, such transition pattern relies on the bidirectional contexts (preceding items and subsequent items) rather than the unidirectional contexts, which can be modeled by a graph-based model rather than a sequential model of single way such as GRU. Consequently, inspired by (Wu et al., 2019) we adopt GGNN to model item sequences to obtain item embeddings in MKM-SR.

Although the operations committed by a user in a session also assemble a sequence, their transition pattern is different with the one of item sequences. Therefore, GGNN is not appropriate to model operation sequences due to the following reasons. First, the unique types of operations are very limited in most platforms. One operation may recur in a sequence with big probability, resulting in that most nodes (operations) have similar neighbor groups if we convert operation sequences into a directed graph. Thus, most operation embeddings learned through applying GGNN over such a graph are very similar, which can not well characterize the diversity of a user’s preference. On the other hand, the transitions between two successive operations often demonstrate a certain of sequential pattern. For example, a user often adds a product to cart after he/she reads its comments, or purchases the product after he/she adds it to cart. Therefore, we adopt GRU rather than GGNN to learn operation embeddings. Next, we introduce the details of learning item embeddings and operation embeddings in turn.

3.3.2. Learning Item Embeddings

In order to learn item embeddings by GGNN, we should convert an item sequence into a directed graph at first.

Formally, given a micro-level item sequence in a session in which each object is the item in a micro-behavior, the corresponding directed graph is . In , each node represents a unique item in , and each directed edge links two successive items in . Please note that an item often recurs in a session. For example, the item sequence of session in Fig. 1 is . As a result, and self-loops exist in if . To better model , we further construct it as a weighted directed graph. The normalized weight of edge is calculated as the occurrence frequency of in divided by the frequency that item occurs as a preceding item in .

In the initial step of GGNN, for a given item node in , its initial embedding is obtained by the lookup of item embedding matrix, and used as its initial hidden state . Based on the basic principle of iterative propagation in GNN (Scarselli et al., 2009), we use the hidden state in the -th () step as ’s item embedding after steps, i.e., . Since is a weighted directed graph, we use and to denote ’s incoming adjacency matrix and outgoing adjacency matrix, respectively. The entries of these two adjacency matrices are edge weights indicating the extent to which the nodes in communicate with each other. Then, ’s is computed according to the following update functions,


where all bold lowercases are the vectors of dimensions and all s are matrices.

is Sigmoid function and

is element-wise multiplication. In addition, and are reset gate and update gate respectively. As described in Eq. 1, the hidden state in the -th step for item , i.e., , is calculated based on its previous state and the candidate state . After steps, we can get the learned embeddings of all item nodes in , based on which the item embeddings in are obtained as


According to ’s construction, given a session, an item has only one learned embedding no matter whether it recurs in the sequence. Consequently, may have recurrent item embeddings. If an item occurs in multiple sessions, they may have different learned embeddings since different sessions correspond to different s.

3.3.3. Learning Operation Embeddings

Due to the aforementioned reasons, we adopt a GRU (Cho et al., 2014)

fed with operation sequences to learn operation embeddings. GRU is an improved version of standard RNN to model dynamic temporal behaviors, which aims to solve the vanishing gradient problem.

Formally, we use to denote an operation sequence fed into our GRU. For an operation in , its initial embedding is also obtained by the lookup in operation embedding matrix. Then, its learned embedding is the hidden state (vector) in the -th step output by GRU, which is calculated based on and the hidden state in the (-1)-th step as follows,


where represents the calculation in one GRU unit, and denotes all GRU parameters. In fact, is ’s learned embedding. To calculate , we set . Thus, we obtain the learned embeddings of all operations in as


Please note that an operation may also recur as an item in the sequence. According to GRU’s principle, an operation recurring in an operation sequence has multiple different embeddings. For example, the operation sequence of in Fig. 1 is . ’s learned embedding in the third position is different to its learned embedding in the fifth position in the sequence. As a result, has no recurrent embeddings, which is different to .

Then, we concatenate and to obtain the embeddings of the micro-behaviors in the given session as shown in Fig. 2. So we have


where is concatenation operation. Based on such micro-behavior embeddings, two sessions having the same item sequence but different operation sequences still have different representations which can capture users’ fine-grainded intentions.

3.3.4. Generating Session Representations

To obtain a session representation, we should aggregate the embeddings of all micro-behaviors in this session. Inspired by (Wu et al., 2019), we take into account a session’s local preference and global preference. A session’s local preference is directly represented by the embedding of the most recent micro-behavior, i.e., .

For representing a session’s global preference, we use soft-attention mechanism (Xu et al., 2015) to assign proper weight for each micro-behavior’s embedding in the session since different micro-behaviors have different levels of priority. Specifically, given a micro-behavior , its attention weight is computed as


where and . Then, the global representation of the session is


At last, the session’s final representation is


where .

After obtaining the representation of session , we compute the final score through an MLP fed with and the candidate item’s embedding , followed by a Softmax operation. Thus we have


To train MKM-SR, we first collect sufficient training samples denoted as where if item is the next interacted item of the user following session , otherwise . Then we adopt binary cross-entropy as the loss function of SR task as follows,


where and are the session set and item set in training samples.

3.4. Learning Knowledge Embeddings

Recall the toy example of Fig. 1, song and are the next interacted item of session and respectively. In fact, they are both semantically correlated to the previous items and in terms of shared knowledge (the same singer or genre). As a result, the item embeddings learned based on such shared knowledge are often in consonance with interaction sequences, which are regarded as knowledge embeddings in this paper. Such observations inspire us to use knowledge embeddings to enhance SR performance. In this subsection, we introduce how to learn knowledge embeddings given observed item knowledge.

In a KG containing items, many-to-one and many-to-many relations are often observed. For example, many songs are sung by a singer, a movie may belong to several genres and a movie genre includes many movies. Among the state-of-the-art KG embedding models, transH (Wang et al., 2014)

imports hyperplanes to handle many-to-many/one relations effectively. Therefore, we import the training loss of TransH to learn knowledge embeddings in our model.

Specifically, for each attribute relation , we first position a relation-specific translation vector in the relation-specific hyperplane . Given a triplet , item ’s embedding and attribute ’s embedding are first projected to the hyperplane with as the normal vectors. The projections are denoted as and . We expect that and can be connected by a translation vector on the hyperplane with a low error if is correct. Thus the score function is used to measure the plausibility that the triplet is incorrect. We can use and to represent as follows since .


Therefore, the loss function for knowledge embedding learning is


where is the set of all knowledge triplets.

3.5. The Objective of Multi-task Learning

Many previous recommendation models based on knowledge (Wang et al., 2018b; Yang et al., 2019, 2018) generally learn knowledge embeddings in advance which are used as pre-trained item embeddings. In other words, is used to pre-train item embedding in advance of using to fine-tune . In such scenario, knowledge embedding learning and recommendation are two separate learning tasks.

In general, incorporating two learning tasks into an MTL paradigm is more effective than achieving their respective goals separately, if the two tasks are related to each other. In MTL, the learning results of one task can be used as the hints to guide another task to learn better (Ruder, 2017). Inspired by the observations on the example in Fig. 1, learning knowledge embeddings can be regarded as an auxiliary task to predict the features (item embeddings) which are used for SR’s prediction task. Consequently, in MKM-SR we import knowledge embedding learning as an auxiliary task into an MTL paradigm, to assist SR task.

In our scenario, the MTL’s objective is to maximize the following posterior probability of our model’s parameters

given knowledge triplet set and SR’s training set . According to Bayes rule, this objective is


where is

’s prior probability which is set to follow a Gaussian distribution of zero mean and 0.1 standard deviation.

is the likelihood of observing given , and is the likelihood of observing given and

, which is defined as the product of Bernoulli distributions. Then, the comprehensive loss function of our MTL’s objective is


where is the regularization term to prevent over-fitting, and are control parameters. We obtain the values of and through tuning experiments.

During the optimization of loss , there are two training strategies of MTL, alternating training and joint training (Ren et al., 2015). For alternating training, we have


where and represent the set of sessions and candidate items in the training set , respectively. For joint learning, we have


Through empirical comparisons, we have verified that alternating training is a better strategy for MKM-SR.

4. Experiments

In this section, we try to answer the following research questions (RQs for short) through extensive experiments.

RQ1: Can our model MKM-SR outperform the state-of-the-art SR models?

RQ2: Is it useful to incorporate micro-behaviors and knowledge into our model?

RQ3: Is it rational to obtain a session’s representation through learning item embeddings by GGNN and learning operation embeddings by GRU separately?

RQ4: Which training strategy is better for incorporating knowledge learning into MKM-SR?

4.1. Experiment Settings

4.1.1. Datasets

We evaluate all compared models on the following realistic datasets:

KKBOX111 This dataset was provided by a famous music service KKBOX, which contains many users’ historical records of listening to music in a given period. We take the ’source system tab’ as user operations, such as ’tab my library’ (manipulation on local storage) and ’tab search’. The music attributes used in our experiments include artist (singer), genre, language and release year.

JDATA222 This dataset was extracted from which is a famous Chinese e-commerce website. It contains a stream of user actions on within two months. The operation types include clicking, ordering, commenting, adding to cart and favorite. The product attributes used in our experiments include brand, shop, category and launch year.

For both of the two datasets, we considered four item attributes (relations) as knowledge which were incorporated in our model. As in (Hidasi et al., 2016a; Wu et al., 2019), we set the duration time threshold of sessions in JDATA to one hour, and set the index gap of sessions in KKBOX to 2000 (according to statistic analysis), to divide different sessions. We also filtered out the sessions of length 1 and the items that appear less than 3 times in the datasets. For both datasets, we took the earlier 90% user behaviors as the training set, and took the subsequent (recent) 10% user behaviors as the test set. In model prediction, given a test session the models first computed the matching scores of all items and then generated a top- list according to the scores.

In order to demonstrate the effectiveness of incorporating item knowledge to alleviate the problem of cold-start items, we added two additional manipulations on our datasets unlike the previous SR evaluations. The first is to retain the items that only appear in the test set, i.e., the cold-start items. The second is to simulate a sparse JDATA dataset, denoted as Demo, through only retaining the early 1% user behaviors. Such sparse dataset has a bigger proportion of cold-start items. In the previous SR models such as (Liu et al., 2018; Wu et al., 2019; Zhou et al., 2018), these cold-start items’ embeddings are initialized in random, and can not be tuned during model training since they are not involved in any training sample. Thus, the recommendations about these items are often unsatisfactory.

The statistics of our datasets are shown in Table .1, where ’(N)’ indicates the datasets having some cold-start items, and ’new%’ is the proportion of the behaviors involving cold-start items to all behaviors in the test set. We have taken into account all operation types provided by the two datasets. To reproduce our experiment results conveniently, our experiment samples and MKM-SR’s source codes have been published on

seesion# session length item#(new%) item frequency operation#
KKBOX 180,047 4.713 33,454 25.365 23
KKBOX(N) 180,096 4.726 34,120(0.99%) 24.942 23
JDATA 455,481 5.372 134,548 18.186 5
JDATA(N) 456,005 5.383 139,099(3.69%) 17.654 5
Demo 5,633 5.330 12,195 2.301 5
Demo(N) 5,696 4.992 12,917(40.45%) 2.192 5
Table 1. Dataset statistics
Hit@20 MRR@20 Hit@20 MRR@20 Hit@20 MRR@20 Hit@20 MRR@20 Hit@20 MRR@20 Hit@20 MRR@20
FPMC 5.614 1.166 5.530 1.147 7.531 2.623 7.049 2.493 3.787 1.808 3.261 1.644
GRU4REC+BPR 12.795 4.545 12.501 4.693 35.433 13.262 34.827 13.346 12.189 5.124 5.567 2.969
GRU4REC+CE 12.445 4.007 12.429 4.135 35.347 13.956 34.794 13.542 12.965 4.992 9.896 4.505
NARM 14.667 5.839 13.926 5.200 36.867 16.826 35.862 16.677 14.446 5.645 8.056 3.615
STAMP 14.475 4.783 14.287 4.544 35.555 12.936 34.691 12.187 14.609 5.796 9.317 2.902
SR-GNN 14.187 4.476 13.399 4.792 40.588 15.968 38.723 15.203 15.504 7.220 10.317 4.682
RIB 15.982 4.763 13.887 5.328 37.236 14.134 35.551 13.420 12.893 4.887 9.965 4.436
KM-SR 17.680 7.195 17.019 6.301 41.094 16.552 40.480 15.709 23.726 9.363 15.065 6.323
M(GRU)-SR 16.971 5.435 16.865 5.250 37.015 14.034 36.374 13.734 18.507 6.430 11.262 4.747
M(GGNN)-SR 13.262 4.347 13.035 4.098 38.270 16.532 37.231 15.663 16.141 6.811 9.168 3.572
M(GGNNx2)-SR 17.270 5.532 16.983 5.435 41.017 16.544 41.308 15.780 19.782 7.865 12.017 4.734
M-SR 20.998 5.878 20.523 5.707 41.440 16.851 41.019 15.850 20.631 7.969 12.914 5.228
MKM-SR 22.574 7.543 22.221 6.976 42.565 17.585 41.998 16.990 24.623 9.642 15.110 6.424
Table 2. All models’ SR performance scores (percentage value) show that MKM-SR outperforms all competitors no matter whether the historical interactions are sparse.

4.1.2. Compared Models

To emphasize MKM-SR’s superiority performance, we compared it with the following state-of-the-art SR models:

FPMC (Rendle et al., 2010): It is a sequential prediction method based on personalized Markov chain which is often used as SR baseline.

GRU4REC+BPR/CE (Hidasi and Karatzoglou, 2018): These two baselines are the improved versions of GRU4REC (Hidasi et al., 2016a) which is a state-of-the-art SR model. GRU4REC+BPR uses Bayes personalized ranking (Rendle et al., 2012) as loss function, and GRU4REC+CE uses cross-entropy as loss function.

NARM (Jing et al., 2017): It is a GRU-based SR model with an attention to consider the long-term dependency of user preferences.

STAMP (Liu et al., 2018): This SR model considers both current interests and general interests of users. In particular, STAMP uses an additional neural network to model current interests.

SR-GNN (Wu et al., 2019): It also utilizes GGNN to capture the complex transition patterns among the items in a session, but does not incorporate micro-behaviors and knowledge.

RIB (Zhou et al., 2018): It also incorporates user operations of which the embeddings are learned by Word2Vec (Mikolov et al., 2013), and adopts GRU to model the sequence of user micro-behaviors.

In addition, to justify the necessity and validity of incorporating micro-behaviors and knowledge in our model, we further propose some variants of MKM-SR to be compared as follows.

KM-SR: It removes all modules related to operations and the rest components are the same as MKM-SR. We compared MKM-SR with KM-SR to verify the significance of incorporating micro-behaviors.

M-SR: It removes the auxiliary task of learning knowledge embeddings, i.e., in Eq. 14, and the rest components are the same as MKM-SR. All of the following variants remove the task of learning knowledge embeddings, between which the differences are the manipulations on session modeling.

M(GRU/GGNN)-SR: Unlike MKM-SR, these two variants directly learn micro-behavior embeddings (). The only difference between them is, M(GRU)-SR feeds micro-behavior sequences to GRU and M(GGNN)-SR feeds micro-behavior sequences to GGNN.

M(GGNNx2)-SR It uses two GGNNs to learn operation embeddings and item embeddings respectively.

4.1.3. Evaluation Metrics

We use the following metrics to evaluate all models’ performance which have been widely used in previous SR evaluations.

Hit@k: It is the proportion of hit samples to all samples that have the correct next interacted item in the top- ranking lists.

MRR@k: The average reciprocal rank of the correct next interacted item in the top- ranking list. The reciprocal rank is set to zero if the correct item is ranked behind top-.

4.1.4. Hyper-parameter Setup

For fair comparisons, we adopted the same dimension of operation and item embeddings for MKM-SR and all baselines. Due to space limitation, we only display the results of 100-dim embeddings. The consistent conclusions were drawn based on the experiment results of the embeddings of other dimensions. In addition, all embeddings were initialized by a Gaussian distribution with a mean of 0 and a standard deviation of 0.1. We set GGNN’s step number to 1. MKM-SR was learned by alternating training rather than joint training, of which the reason will be verified in the following experiments. In addition, we used Adam (Kingma and Ba, 2015) optimizer with learning rate 0.001 and batch size 128. For the baselines, we used their default hyper-parameter settings in their papers except for embedding dimension. About the control parameters in Eq. 14, we set in Eq. 14 to 0.0001 for each dataset, which was decided through our tuning experiments. For L2 penalty , we set it to as previous SR models (Wang et al., 2019).

Next, we display the results of our evaluations to answer the aforementioned RQs, based on which we provide some insights on the reasons causing the superiority or inferiority of compared models .

4.2. Global Performance Comparisons

At first, we compared all models’ SR performance over different datasets to answer RQ1, of which the Hit@20 and MRR@20 scores (percentage value) are listed in Table 2. The displayed scores are the average of five runnings for each model.

The comparison results show that, MKM-SR outperforms all baselines and variants in all dataset (answer yes to RQ1). Especially in the datasets with cold-start items and Demo, MKM-SR and KM-SR have more remarkable superiority. Such results justify the effects of incorporating knowledge to alleviate the sparsity problem of cold-start items (answer yes to RQ2). As shown in Table 1, KKBOX has more unique operations than JDATA which are useful to better capture user preferences on fine-grained level. Therefore, besides MKM-SR and M-SR, another model incorporating user operations RIB also has more remarkable advantage in KKBOX than in JDATA, compared with the GRU-based baselines that do not incorporate operations. These results justify the effects of incorporating micro-behaviors (answer yes to RQ2).

In addition, a user is more likely to interact with the same item in a session of JDATA. The transition pattern between the successive items in such scenario can be captured by GGNN better than GRU. It is the reason that SR-GNN has greater advantage in JDATA than in KKBOX, compared with the GRU-based models including GRU4REC+BPR/CE and NARM.

4.3. Ablation Study

We further compared MKM-SR with its variants to answer RQ2, RQ3. We have the following conclusions drawn based on the results in Table 2. MKM-SR’s advantage over KM-SR and M-SR shows that, both micro-behaviors (operations) and item knowledge deserve to be incorporated w.r.t. improve SR performance (answer yes to RQ2). In addition, M-SR outperforms M(GRU)-SR and M(GGNN)-SR indicating that modeling a session through learning item embeddings and learning operation embeddings separately is more effective than learning micro-behavior embeddings directly. As we stated before, the transition pattern of item sequences is different to that of operation sequence. Therefore, it is less effective to combine an item with an operation as a micro-behavior and then learn the micro-behavior sequence only by a certain model. Furthermore, M-SR’s superiority over M(GGNNx2)-SR shows that operation sequences should be learned by GRU rather than GGNN, of which the reason has been explained in Subsec. 3.3. These results provide yes answer to RQ3.

4.4. Strategies of Incorporating Knowledge Learning

As we mentioned before, there are two training strategies for our MTL loss Eq. 14, i.e., alternating training (Eq. 15) and joint training (Eq. 16). To answer RQ4, we trained KM-SR respectively with the two training strategies and compared their learning curves. Furthermore, we added a pre-training variant to be compared, in which the item embeddings are first pre-trained by TransH and then input into KM-SR to be tuned only by loss in Eq. 10. We did not adopt MKM-SR in this comparison experiment because the three training strategies are not relevant to operation embedding learning.

Figure 3. The learning curves of three different strategies to incorporate knowledge learning show that, incorporating knowledge learning into the MTL of alternating training is the best strategy for our SR task.

In Fig. 3, we only display the learning curves of MRR@20 in KKBOX(N) and JDATA(N) since MRR@k reflects ranking performance better than Hit@k. The curves in the figure show that, although the pre-training model has a better learning start, it is overtaken by the other two rivals on the stage of convergence. Such results demonstrate MTL’s superiority over the knowledge-based pre-training. According to Eq. 16, the items often occurring in the sessions of training set will be tuned multiple times by loss

in each epoch of joint training. It makes the learned embeddings bias to the auxiliary task

too much, shrinking the learning effect of the main task . Therefore, alternating training is better than joint training in our SR scenario.

Figure 4. KKBOX(N)’s item embedding distributions under different learning mechanisms show that, the song embeddings learned by the MTL keep close distances across different groups (genres) on the premise of discriminating different groups. It conforms to the fact that the successive songs in a session may belong to different genres.

To further verify the significance of incorporating knowledge learning into the MTL paradigm (Eq. 14), we visualize the embedding distributions of some items sampled from KKBOX(N), which were learned respectively by different mechanisms in Fig. 4 where the points of different colors represent the songs of different genres. In Fig. 4(a), the item embeddings were learned by feeding item sequences of sessions into Word2Vec, thus two items are close in the space if they often cooccur in some sessions. As shown in the sub-figure, such learned embeddings make many songs of different genres too converged and thus can not discriminate different genres. In Fig. 4(b), the item embeddings were learned solely by TransH. Although such learned embeddings discriminate different genres obviously, the gap between two groups of different genres is too big. It makes the model based on embedding distances hard to predict the item of different genre as the next interacted item, which does not conform to some facts, such as the song in ’s item sequence shown in Fig. 1. It is also the reason why the pre-training model is defeated by the joint-training model and alter-training model in Fig. 3. The item embeddings shown in Fig. 4(c) were learned by MKM-SR through MTL, and exhibit two characteristics, i.e., they can discriminate different genres for most items, meanwhile keep close distances across different genres. Such item embeddings with the two characteristics well indicate two kinds of correlations between the successive items in a session. The former characteristic indicates the semantic correlations among items, and the latter characteristic indicate the items’ co-occurrence correlations across different sessions. In fact, these two correlations can be captured respectively through the learning task of and the learning task of . Obviously, it is useful for improving SR to capture these two correlations simultaneously.

4.5. MTL’s Control Parameter Tuning

At last, we investigate the influence of ’s in the MLT’s loss to MKM-SR’s final recommendation performance. Fig. 5 shows MKM-SR’s MRR@20 scores in KKBOX(N), from which we find that MKM-SR’s performance varies marginally (1%) when is set in . What’s more, MKM-SR gains the best score when . It implies that, as an auxiliary task knowledge embedding learning will disturb the main task of SR if it is assigned with more weight.

Figure 5. MKM-SR’s performance on KKBOX(N) with different shows that is the best setting.

5. Conclusion

In this paper, we propose a novel session-based recommendation (SR) model, namely MKM-SR, which incorporates user micro-behaviors and item knowledge simultaneously. According to the different intuitions about item sequences and operation sequences in a session, we adopt different mechanisms to learn item embeddings and operation embeddings which are used to generate fine-grained session representations. We also investigate the significance of learning knowledge embeddings and the influences of different training strategies through sufficient comparison studies. MKM-SR’s superiority over the state-of-the-art SR models is justified by our extensive experiments and inspires a promising direction of improving SR.


  • R. Burke (2000) Knowledge-based recommender systems. Cited by: §2.2.
  • K. Cho, B. Van Merrienboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    Computer Science. Cited by: §1, §3.3.3.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proc. of KDD, Cited by: §2.2.
  • Y. Gu, Z. Ding, S. Wang, and D. Yin (2020) Hierarchical user profiling for e-commerce recommender systems. In Proceedings of WSDM, pp. 223–231. Cited by: §2.1.
  • B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2016a) Session-based recommendations with recurrent neural networks. In Proc. of ICLR, Cited by: §1, §1, §2.1, §3.3.1, §3.3, §4.1.1, §4.1.2.
  • B. Hidasi and A. Karatzoglou (2018) Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of CIKM, pp. 843–852. Cited by: §2.1, §4.1.2.
  • B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk (2016b) Parallel recurrent neural network architectures for feature-rich session-based recommendations. In Proc. of RecSys, Cited by: §2.1.
  • J. Huang, W. X. Zhao, H. Dou, J. Wen, and E. Y. Chang (2018) Improving sequential recommendation with knowledge-enhanced memory networks. In Proc. of SIGIR, Cited by: §2.2.
  • Y. M. Huang, T. C. Huang, K. T. Wang, and W. Y. Hwang (2009) A markov-based recommendation model for exploring the transfer of learning on the web. Journal of Educational Technology & Society 12 (2), pp. 144–162. Cited by: §2.1.
  • D. Jannach, L. Lerche, and M. Jugovac (2015) Adaptation and evaluation of recommendations for short-term shopping goals. pp. 211–218. Cited by: §1.
  • G. Ji, S. He, L. Xu, K. Liu, and J. Zhao (2015) Knowledge graph embedding via dynamic mapping matrix. In Proc. of ACL, Cited by: §2.2.
  • L. Jing, P. Ren, Z. Chen, Z. Ren, and J. Ma (2017) Neural attentive session-based recommendation. In Proc. of CIKM, Cited by: §1, §1, §2.1, §3.3.1, §3.3, §4.1.2.
  • J. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proc. of ICLR, Cited by: §4.1.4.
  • Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow (2016) GATED graph sequence neural networks. In Proceedings of ICLR, Cited by: §1.
  • G. Linden, B. Smith, and J. York (2003) recommendations: item-to-item collaborative filtering. IEEE Internet Computing 7 (1), pp. 76–80. External Links: ISSN 1089-7801, Link, Document Cited by: §2.1.
  • Q. Liu, S. Wu, and L. Wang (2017) Multi-behavioral sequential prediction with recurrent log-bilinear model. IEEE Transactions on Knowledge and Data Engineering 29 (6), pp. 1254–1267. Cited by: §2.1.
  • Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In Proc. of SIGKDD, Cited by: §1, §1, §2.1, §4.1.1, §4.1.2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    In arXiv:1301.3781, Cited by: §4.1.2.
  • E. Palumbo, G. Rizzo, and R. Troncy (2017) Entity2rec: learning user-item relatedness from knowledge graphs for top-n item recommendation. In Proc. of RecSys., Cited by: §2.2.
  • M. Quadrana, A. Karatzoglou, B. Hidasi, and P. Cremonesi (2017) Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proc. of RecSys, Cited by: §1, §3.3.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2, §3.5.
  • S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2012) BPR: bayesian personalized ranking from implicit feedback. pp. 452–461. Cited by: §4.1.2.
  • S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In Proc. of WWW, Cited by: §1, §4.1.2.
  • S. Rendle (2012) Factorization machines with libfm. Acm Transactions on Intelligent Systems and Technology 3 (3), pp. 1–22. Cited by: §2.1.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §3.5.
  • B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2001) Item-based collaborative filtering recommendation algorithms. In Proceedings of WWW, pp. 285–295. Cited by: §2.1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Trans. Neural Networks 20 (1), pp. 61–80. External Links: Link, Document Cited by: §3.3.2.
  • G. Shani, R. I. Brafman, and D. Heckerman (2005) An mdp-based recommender system.

    Journal of Machine Learning Research

    6 (1), pp. 1265–1295.
    Cited by: §2.1.
  • A. Swami, A. Swami, and A. Swami (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In Proc. of KDD, Cited by: §2.2.
  • M. Wan and J. McAuley (2018) Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 86–94. Cited by: §2.1.
  • H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, and M. Guo (2018a) RippleNet: propagating user preferences on the knowledge graph for recommender systems. In Proc. of CIKM, Cited by: §2.2.
  • H. Wang, F. Zhang, X. Xie, and M. Guo (2018b) DKN: deep knowledge-aware network for news recommendation. In Proceedings of WWW, Cited by: §1, §2.2, §3.5.
  • S. Wang, L. Cao, and Y. Wang (2019) A survey on session-based recommender systems. CoRR abs/1902.04864. Cited by: §1, §4.1.4.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014) Knowledge graph embedding by translating on hyperplanes. In

    Twenty-Eighth AAAI conference on artificial intelligence

    Cited by: §1, §3.2, §3.4.
  • S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019) Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 346–353. Note: srgnn Cited by: §1, §2.1, §3.3.1, §3.3.4, §4.1.1, §4.1.1, §4.1.2.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio: (2015) Show, attend and tell: neural image caption generation with visual attention. In Proc. of ICML, Cited by: §3.2, §3.3.4.
  • D. Yang, Z. Guo, Z. Wang, J. Jiang, Y. Xiao, and W. Wang (2018) Knowledge embedding towards the recommendation with sparse user-item interactions. In Proceedings of ICDM, Cited by: §1, §1, §2.2, §3.5.
  • D. Yang, Z. Wang, J. Jiang, and Y. Xiao (2019) Knowledge embedding towards the recommendation with sparse user-item interactions. In Proceedings of ASONAM, Cited by: §1, §1, §2.2, §3.5.
  • X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han (2014) Personalized entity recommendation: a heterogeneous information network approach. In Proc. of WSDM, Cited by: §2.2.
  • T. Zhang, P. Zhao, Y. Liu, V. S. Sheng, J. Xu, D. Wang, G. Liu, and X. Zhou (2019) Feature-level deeper self-attention network for sequential recommendation. In Proc. of IJCAI, Cited by: §2.2.
  • M. Zhou, Z. Ding, J. Tang, and D. Yin (2018) Micro behaviors: a new perspective in e-commerce recommender systems. In Proc. of WSDM, Cited by: §2.1, §4.1.1, §4.1.2.