Recommender systems have played a very important role in many web applications including web search, e-commerce, entertainment and so on. On many web sites, a user often exhibits a specific short-term intention (Jannach et al., 2015). The natural characteristics of a session data reflect a user’s behavior pattern and current preference precisely. Therefore, modeling a user’s current intention based on his/her behaviors in a recent session often results in satisfactory recommendation results, which is the basic principle of session-based recommendation (SR for short) (Wang et al., 2019). As a representative class of sequential recommendation, SR systems aim to predict an item that would be interacted by a target user in the next interaction, based on the recent behaviors committed by the user in a session.
In order to achieve precise recommendations, an SR model generally uses a certain algorithm to leverage the sequential information in a session. The existing algorithms include Markov-based models and deep neural network (DNN for short) based models. For example, the basic idea of FPMC(Rendle et al., 2010)
is to calculate transition probability between the items in a session based on Markov chain. In recent years, recurrent neural networks (RNNs for short) have been applied in different ways to learn a user’s dynamic and time-shifted preference(Hidasi et al., 2016a; Jing et al., 2017; Quadrana et al., 2017; Liu et al., 2018), exhibiting better performance than traditional methods. Although these deep SR models have been proven effective, there still exist some issues as follows.
The first issue is the lack of exploiting user micro-behaviors. Most of previous session-based models (Hidasi et al., 2016a; Jing et al., 2017; Liu et al., 2018) only model a session from a macro perspective, i.e., regard a session as a sequence of items, without taking into account different operations of users. Even though a user interacts with the same item in a session, different operations committed on this item reflect the user’s different intentions in this session and different preferences on this item. In this paper, we regard a user’s specific operation on a certain item as a micro-behavior which has finer granularity and offers deeper understanding of the user than a macro-behavior of item-level, i.e., a user-item interaction. We use a toy example in Fig.1 which was extracted from a real music dataset, to illustrate the significance of incorporating micro-behaviors into SR’s session modeling.
In Fig. 1, the user in session first searches song which is a hip-pop song sung by a certain American band Maroon 5. Then, he adds it into his playlist and favorites in turn. Next, he clicks button Artist More to get more Maroon 5’s songs listed on a page, and then listens to song on the page. After adding into his favorites, he further listens to another Maroon 5’s song through the Artist More page. Song belongs to a genre different from hip-pop although it is also sung by Maroon 5. In session , the user also first searches song and then clicks button Genre More to get more hip-pop songs. Next, he also selects from the Genre More page. After adding into his favorites, he further listens to song which also belongs to hip-pop but was sung by the singer other than Maroon 5. Suppose that we already have the session information in the green dashed rectangles for the two sessions, we need to predict which item the user will interact with subsequently. According to traditional SR’s paradigm of modeling a session only based on the historical item sequence, and are learned as the same representation since they both consist of song and . Thus the same item may be recommended as the next interaction, which is not consistent with the observations in Fig. 1. If we take user micro-behaviors which are depicted by red dashed rectangles rather than items to model the sessions, and have different representations. Based on such fine-grained session representations, will be recommended to the user in because the transition pattern between and (having the same singer) is the same as the one between and . Similarly, will be recommended to the user in .
The second issue is the insufficient utilization of item knowledge towards the sparsity problem of user-item interactions. Since most of previous SR systems model a session only based on the interacted item sequence in the session, they can not learn session representations sufficiently when historical user-item interactions are sparse, especially for the cold-start items. Inspired by (Yang et al., 2018, 2019)
, we can also import item attributes as side information which are generally distilled from open knowledge graphs (KGs for short) and recognized as item knowledge in this paper, to alleviate the data sparsity problem. Furthermore, the item transitions in terms of micro-behavior can also be indicated by item knowledge. Recall the example in Fig.1, transition is indicated by two knowledge triplets ¡, song-artist, Maroon 5¿ and ¡, song-artist, Maroon 5¿. Therefore, incorporating item knowledge is helpful for micro-behavior based modeling. In this paper, we regard such latent relationships among the items in a session as semantic correlations.
To address above issues, we propose a novel SR model MKM-SR in this paper, which incorporates user Micro-behaviors and item Knowledge into Multi-task learning for Session-based R
ecommendation. In MKM-SR, a user’s sequential pattern in a session is modeled on micro-behavior level rather than item-level. Specifically, a session is constituted by a sequence of user micro-behaviors. Each micro-behavior is actually a combination of an item and its corresponding operation. As a result, learning item embeddings and operation embeddings is the premise of learning micro-behavior embeddings, based on which a session’s representation is generated. For this goal, we feed the operation sequence and the item sequence of a session into Gated Recurrent Unit (GRU for short)(Cho et al., 2014) and Gated Graph Neural Network (GGNN for short) (Li et al., 2016; Wu et al., 2019), respectively. In this step, we adopt different learning mechanisms due to the different characteristics of operations and items. What’s more, we incorporate item knowledge to learn better item embeddings by TransH (Wang et al., 2014) which has been proven an effective KG embedding model on handling many-to-many/one relations. Unlike previous models using knowledge embeddings as pre-trained item embeddings (Wang et al., 2018b; Yang et al., 2019, 2018), we take knowledge embedding learning as an auxiliary task and add it into a multi-task learning (MTL for short) paradigm in which the major task is to predict the next interacted item. Our extensive experiments verify that our MTL paradigm is more effective than previous pre-training paradigm in terms of promoting SR performance.
In summary, our contributions in this paper are as follows:
1. In order to improve SR performance, we incorporate user micro-behaviors into session modeling to capture the transition pattern between the successive items in a session on a fine-grained level, and further investigate the effects of different algorithms on modeling micro-behaviors.
2. We incorporate item knowledge into our model through an MTL paradigm which takes knowledge embedding learning as an auxiliary (sub) task of SR. Furthermore, we validate an optimal training strategy for our MTL through extensive comparisons.
3. We provide deep insights on the rationales of our model’s design mechanisms and extensive evaluation results over two realistic datasets (KKBOX and JDATA), to justify our model’s superiority over the state-of-the-art recommendation models.
In the rest of this paper, we introduce related work in Section 2 followed by the detailed description of our model in Section 3. We display our extensive experiment results in Section 4 and conclude our work in Section 5.
2. Related Work
In this section, we provide a brief overview of the research related to our work in this paper.
2.1. Session-based and Sequential Recommendation
As a sub-task of sequential recommendation, the objective of SR is to predict the successive item(s) that an anonymous user likely to interact with, according to the implicit feedbacks in a session. In general, there are two major classes of approaches to leverage sequential information from users’ historical records, i.e., Markov-based models and DNN-based models. In the first class, (Shani et al., 2005; Huang et al., 2009) use Markov chains to capture sequential patterns between consecutive user-item interactions. (Sarwar et al., 2001; Linden et al., 2003) try to characterize users’ latest preferences with the last click, but neglect the previous clicks and discard the useful information in the long sequence. Rendel et.al proposed a hybrid model FPMC (Rendle, 2012)
, which combines Matrix Factorization (MF for short) and Markov Chain (MC for short) to model sequential behaviors for next basket recommendation. A major problem of FPMC is that it still adopts the static representations for user intentions. Recently, inspired by the power of DNNs in modeling sequences in NLP, some DNN-based solutions have been developed and demonstrated state-of-the-art performance for SR. Particularly, the RNN-based models including LSTM and GRU, are widely used to capture users’ general interests and current interests together through encoding historical interactions into a hidden state (vector). As the pioneers to employ RNNs for SR, Hidasi et al.(Hidasi et al., 2016a) proposed a deep SR model which encodes items into one-hot embeddings and then feed them into GRUs to achieve recommendation. Afterwards, Jing et al. (Jing et al., 2017) further improved the RNN-based solution through adding extra mechanism to tackle the short memory problem inherent in RNNs. In addition, the model in (Liu et al., 2018) utilizes an attention net to model user’s general states and current states separately. This model takes into account the effects of users’ current actions on their next moves explicitly. Besides, Hidasi et al. proposed (Hidasi et al., 2016b) and (Hidasi and Karatzoglou, 2018)
to improve their model performance through adjust loss functions. More recently, the authors in(Wu et al., 2019) adopted GGNN to capture the complex transition pattern among the items in a session rather than the transition pattern of single way. Although these models show promising performance for SR tasks, there is still room for improvement since they all neglect user micro-behaviors in sessions. (Liu et al., 2017; Zhou et al., 2018; Wan and McAuley, 2018; Gu et al., 2020) are the rare models considering micro-behaviors. (Wan and McAuley, 2018) only models monotonic behavior chains where user behaviors are supposed to follow the same chain, ignoring the multiple types of behaviors. To tackle this problem, (Zhou et al., 2018) and (Gu et al., 2020) both adopt LSTM to model micro-behaviors. However, they ignored the different transition pattern between items and operations. In this paper, we adopt RNN and GNN simultaneously to model the micro-behaviors, which not only consider the differences of items and operations, but also keep the logic of operation order as mentioned in (Wan and McAuley, 2018; Gu et al., 2020).
2.2. Knowledge-based Recommendation
Knowledge-based recommendation has already been recognized as an important family of recommender systems (Burke, 2000). Traditionally, the knowledge includes various item attributes which are used as the constraints of filtering out a user’s favorite items. As more open linked data emerge, many researchers use abundant knowledge in KGs as side information to improve the performance of recommender systems. In (Yu et al., 2014) a heterogeneous information network (HIN for short) is constructed based movie knowledge, and then the relatedness between movies is measured through the volume of meta-paths. The authors in (Palumbo et al., 2017) applied Node2Vec (Grover and Leskovec, 2016) to learn user/item representations according to different relations between entities in KGs, but it is still collaborative filtering (CF for short) based method resulting in poor performance when user-item interactions are sparse. In recent years, many researchers utilized DNNs to learn knowledge embeddings which are fed into the downstream recommendation models. For example, Wang et al. proposed a deep model DKN (Wang et al., 2018b) for news recommendation, in which they also used a translation-based KG embedding model TransD (Ji et al., 2015) to learn knowledge embeddings to enrich news representations. The authors in (Yang et al., 2018, 2019) utilized Metapath2Vec (Swami et al., 2017) to learn knowledge embeddings which are used to generate the representations of users and items. Different with MKM-SR’s multi-task learning solution, these models use knowledge embeddings as the pre-trained item embeddings. Another representative deep recommendation model incorporating knowledge is RippleNet (Wang et al., 2018a) where user representations are learned through an iterative propagation among a KG including the entities of items and their attributes. KGs have also been employed for sequential recommendation. For example, (Huang et al., 2018) proposes a key-value memory network to incorporate movie attributes from KGs, in order to improve sequential movie recommendation. FDSA (Zhang et al., 2019) is a feature-level sequential recommendation model with self-attentions, in which the item features can be regarded as the item knowledge used to enrich user representations. Unlike our model, these two KG-based sequential recommendation models model user behavior sequences on macro (item) level rather than micro level. What’s more, our model incorporates item knowledge through an MTL paradigm which takes knowledge embedding learning as an auxiliary (sub) task of SR, guaranteeing the information is shared among user micro-behaviors and item attributes, thus representations are learned better.
In this section, we introduce the details of MKM-SR including the related algorithms involved in the model. We first formalize the problem addressed in this paper, and then summarize the pipeline of MKM-SR followed by the detailed descriptions of each step (component). In the following introductions, we use a bold lowercase to represent a vector and a bold uppercase to represent a set, matrix or a cube (tensor).
3.1. Problem Definition
In this paper, since we focus on user micro-behaviors rather than interacted items in sessions in this paper, we first use to denote the micro-behavior sequence in a session where is the length of the sequence. Specifically, is the -th micro-behavior which is actually a combination of an item and its corresponding operation. Furthermore, the item knowledge we incorporate into MKM-SR is represented by the form of triplet. Formally, a knowledge triplet represents that is the value of item ’s attribute which is often recognized as a relation. For example, ¡Sugar, song-artist, Maroon 5¿ describes that Maroon 5 is the singer of song Sugar. MKM-SR is trained with the observed user micro-behaviors in sessions and obtained item knowledge.
The goal of our SR model is to predict the next interacted item based on a given session. To achieve this goal, our model is fed with the given session and a (next interacted) candidate item to generate a matching score (probability) between and . With , a top- ranking list can be generated for a given session (one sample). In general, the item with the highest score will be predicted as the next interacted item.
3.2. Model Overview
The framework of MKM-SR is illustrated in Fig. 2. In model training, besides the operations and items involved in training samples, item knowledge is also input into MKM-SR to learn better operation embeddings and item embeddings, then better session representations are also obtained which are crucial to compute precise scores.
For more clear explanation, we reversely introduce the pipeline of MKM-SR from the right side of Fig. 2. For a given session and a candidate item , the final score
is computed by a multi-layer perceptron (MLP for short) fed with’s representation and ’s embedding which are both a vector of dimensions. Specifically, ’s representation is obtained by aggregating a group of micro-behavior embeddings. Given that different objects in a session (sequence) have different levels of priority on representing this session, we adopt a soft-attention mechanism (Xu et al., 2015) to generate ’s global representation which reflects a user’s long-term preference. Furthermore, a micro-behavior embedding is the concatenation of of its item embedding (green rectangles in Fig. 2) and operation embedding (red rectangles in Fig. 2) since we believe an operation and an item have different roles to represent a user’s micro-behavior. The item embedding and the operation embedding in a micro-behavior embedding are learned respectively by different models, i.e., GGNN and GRU. The reason of such manipulations will be explained subsequently. The GGNN and the GRU in MKM-SR are fed with an item sequence and an operation sequence respectively, both of which have the same length as the micro-behavior sequence, i.e., .
As we mentioned before, the item knowledge is helpful to discover the semantic correlations between the items in a session. As a result, we add knowledge learning as an auxiliary task into an MTL paradigm through designing a weighted sum of different loss functions. Specifically, we import TransH’s (Wang et al., 2014) loss function as the loss of our knowledge learning, since it is a KG embedding model which can model many-to-many/one relations effectively. In addition, we adopt alternating training (Ren et al., 2015) strategy for training MKM-SR.
3.3. Encoding Session Information
In this subsection, we introduce how to obtain the representation of a given session which is crucial for our model to compute the final score . Based on the basic principle of SR (Hidasi et al., 2016a; Jing et al., 2017), the premise of obtaining a session representation is to learn each object’s embedding in the session. In the setting of this paper, an object in the sequence of a session is a micro-behavior. According to our definition of micro-behavior, a micro-behavior is the combination of an item and an operation committed on the item. In MKM-SR, we first learn item embeddings and operation embeddings separately, and then concatenate an item embedding and an operation embedding as the embedding of the micro-behavior, rather than directly learn a micro-behavior embedding as whole. Our experiment results shown in the subsequent section verify that our solution is better than the latter. We adopt such solution due to the following intuitions.
3.3.1. Intuitions of Embedding Learning
We argue that item sequences and operation sequences have different effects on session modeling, and exhibit different transition patterns. For the item sequence of a session, its transition pattern is actually more complex than the single way transitions between successive items which were captured by previous RNN-based sequential models (Hidasi et al., 2016a; Jing et al., 2017; Quadrana et al., 2017). In other words, not only the subsequent items are correlated to the preceding ones in a sequence, but also the preceding items are correlated to the subsequent ones. It is also the reason that a user often interacted with an item that he/she has already interacted with before. Obviously, such transition pattern relies on the bidirectional contexts (preceding items and subsequent items) rather than the unidirectional contexts, which can be modeled by a graph-based model rather than a sequential model of single way such as GRU. Consequently, inspired by (Wu et al., 2019) we adopt GGNN to model item sequences to obtain item embeddings in MKM-SR.
Although the operations committed by a user in a session also assemble a sequence, their transition pattern is different with the one of item sequences. Therefore, GGNN is not appropriate to model operation sequences due to the following reasons. First, the unique types of operations are very limited in most platforms. One operation may recur in a sequence with big probability, resulting in that most nodes (operations) have similar neighbor groups if we convert operation sequences into a directed graph. Thus, most operation embeddings learned through applying GGNN over such a graph are very similar, which can not well characterize the diversity of a user’s preference. On the other hand, the transitions between two successive operations often demonstrate a certain of sequential pattern. For example, a user often adds a product to cart after he/she reads its comments, or purchases the product after he/she adds it to cart. Therefore, we adopt GRU rather than GGNN to learn operation embeddings. Next, we introduce the details of learning item embeddings and operation embeddings in turn.
3.3.2. Learning Item Embeddings
In order to learn item embeddings by GGNN, we should convert an item sequence into a directed graph at first.
Formally, given a micro-level item sequence in a session in which each object is the item in a micro-behavior, the corresponding directed graph is . In , each node represents a unique item in , and each directed edge links two successive items in . Please note that an item often recurs in a session. For example, the item sequence of session in Fig. 1 is . As a result, and self-loops exist in if . To better model , we further construct it as a weighted directed graph. The normalized weight of edge is calculated as the occurrence frequency of in divided by the frequency that item occurs as a preceding item in .
In the initial step of GGNN, for a given item node in , its initial embedding is obtained by the lookup of item embedding matrix, and used as its initial hidden state . Based on the basic principle of iterative propagation in GNN (Scarselli et al., 2009), we use the hidden state in the -th () step as ’s item embedding after steps, i.e., . Since is a weighted directed graph, we use and to denote ’s incoming adjacency matrix and outgoing adjacency matrix, respectively. The entries of these two adjacency matrices are edge weights indicating the extent to which the nodes in communicate with each other. Then, ’s is computed according to the following update functions,
where all bold lowercases are the vectors of dimensions and all s are matrices.
is Sigmoid function andis element-wise multiplication. In addition, and are reset gate and update gate respectively. As described in Eq. 1, the hidden state in the -th step for item , i.e., , is calculated based on its previous state and the candidate state . After steps, we can get the learned embeddings of all item nodes in , based on which the item embeddings in are obtained as
According to ’s construction, given a session, an item has only one learned embedding no matter whether it recurs in the sequence. Consequently, may have recurrent item embeddings. If an item occurs in multiple sessions, they may have different learned embeddings since different sessions correspond to different s.
3.3.3. Learning Operation Embeddings
Due to the aforementioned reasons, we adopt a GRU (Cho et al., 2014)
fed with operation sequences to learn operation embeddings. GRU is an improved version of standard RNN to model dynamic temporal behaviors, which aims to solve the vanishing gradient problem.
Formally, we use to denote an operation sequence fed into our GRU. For an operation in , its initial embedding is also obtained by the lookup in operation embedding matrix. Then, its learned embedding is the hidden state (vector) in the -th step output by GRU, which is calculated based on and the hidden state in the (-1)-th step as follows,
where represents the calculation in one GRU unit, and denotes all GRU parameters. In fact, is ’s learned embedding. To calculate , we set . Thus, we obtain the learned embeddings of all operations in as
Please note that an operation may also recur as an item in the sequence. According to GRU’s principle, an operation recurring in an operation sequence has multiple different embeddings. For example, the operation sequence of in Fig. 1 is . ’s learned embedding in the third position is different to its learned embedding in the fifth position in the sequence. As a result, has no recurrent embeddings, which is different to .
Then, we concatenate and to obtain the embeddings of the micro-behaviors in the given session as shown in Fig. 2. So we have
where is concatenation operation. Based on such micro-behavior embeddings, two sessions having the same item sequence but different operation sequences still have different representations which can capture users’ fine-grainded intentions.
3.3.4. Generating Session Representations
To obtain a session representation, we should aggregate the embeddings of all micro-behaviors in this session. Inspired by (Wu et al., 2019), we take into account a session’s local preference and global preference. A session’s local preference is directly represented by the embedding of the most recent micro-behavior, i.e., .
For representing a session’s global preference, we use soft-attention mechanism (Xu et al., 2015) to assign proper weight for each micro-behavior’s embedding in the session since different micro-behaviors have different levels of priority. Specifically, given a micro-behavior , its attention weight is computed as
where and . Then, the global representation of the session is
At last, the session’s final representation is
After obtaining the representation of session , we compute the final score through an MLP fed with and the candidate item’s embedding , followed by a Softmax operation. Thus we have
To train MKM-SR, we first collect sufficient training samples denoted as where if item is the next interacted item of the user following session , otherwise . Then we adopt binary cross-entropy as the loss function of SR task as follows,
where and are the session set and item set in training samples.
3.4. Learning Knowledge Embeddings
Recall the toy example of Fig. 1, song and are the next interacted item of session and respectively. In fact, they are both semantically correlated to the previous items and in terms of shared knowledge (the same singer or genre). As a result, the item embeddings learned based on such shared knowledge are often in consonance with interaction sequences, which are regarded as knowledge embeddings in this paper. Such observations inspire us to use knowledge embeddings to enhance SR performance. In this subsection, we introduce how to learn knowledge embeddings given observed item knowledge.
In a KG containing items, many-to-one and many-to-many relations are often observed. For example, many songs are sung by a singer, a movie may belong to several genres and a movie genre includes many movies. Among the state-of-the-art KG embedding models, transH (Wang et al., 2014)
imports hyperplanes to handle many-to-many/one relations effectively. Therefore, we import the training loss of TransH to learn knowledge embeddings in our model.
Specifically, for each attribute relation , we first position a relation-specific translation vector in the relation-specific hyperplane . Given a triplet , item ’s embedding and attribute ’s embedding are first projected to the hyperplane with as the normal vectors. The projections are denoted as and . We expect that and can be connected by a translation vector on the hyperplane with a low error if is correct. Thus the score function is used to measure the plausibility that the triplet is incorrect. We can use and to represent as follows since .
Therefore, the loss function for knowledge embedding learning is
where is the set of all knowledge triplets.
3.5. The Objective of Multi-task Learning
Many previous recommendation models based on knowledge (Wang et al., 2018b; Yang et al., 2019, 2018) generally learn knowledge embeddings in advance which are used as pre-trained item embeddings. In other words, is used to pre-train item embedding in advance of using to fine-tune . In such scenario, knowledge embedding learning and recommendation are two separate learning tasks.
In general, incorporating two learning tasks into an MTL paradigm is more effective than achieving their respective goals separately, if the two tasks are related to each other. In MTL, the learning results of one task can be used as the hints to guide another task to learn better (Ruder, 2017). Inspired by the observations on the example in Fig. 1, learning knowledge embeddings can be regarded as an auxiliary task to predict the features (item embeddings) which are used for SR’s prediction task. Consequently, in MKM-SR we import knowledge embedding learning as an auxiliary task into an MTL paradigm, to assist SR task.
In our scenario, the MTL’s objective is to maximize the following posterior probability of our model’s parametersgiven knowledge triplet set and SR’s training set . According to Bayes rule, this objective is
where isis the likelihood of observing given , and is the likelihood of observing given and
, which is defined as the product of Bernoulli distributions. Then, the comprehensive loss function of our MTL’s objective is
where is the regularization term to prevent over-fitting, and are control parameters. We obtain the values of and through tuning experiments.
During the optimization of loss , there are two training strategies of MTL, alternating training and joint training (Ren et al., 2015). For alternating training, we have
where and represent the set of sessions and candidate items in the training set , respectively. For joint learning, we have
Through empirical comparisons, we have verified that alternating training is a better strategy for MKM-SR.
In this section, we try to answer the following research questions (RQs for short) through extensive experiments.
RQ1: Can our model MKM-SR outperform the state-of-the-art SR models?
RQ2: Is it useful to incorporate micro-behaviors and knowledge into our model?
RQ3: Is it rational to obtain a session’s representation through learning item embeddings by GGNN and learning operation embeddings by GRU separately?
RQ4: Which training strategy is better for incorporating knowledge learning into MKM-SR?
4.1. Experiment Settings
We evaluate all compared models on the following realistic datasets:
KKBOX111https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data: This dataset was provided by a famous music service KKBOX, which contains many users’ historical records of listening to music in a given period. We take the ’source system tab’ as user operations, such as ’tab my library’ (manipulation on local storage) and ’tab search’. The music attributes used in our experiments include artist (singer), genre, language and release year.
JDATA222https://jdata.jd.com/html/detail.html?id=8: This dataset was extracted from JD.com which is a famous Chinese e-commerce website. It contains a stream of user actions on JD.com within two months. The operation types include clicking, ordering, commenting, adding to cart and favorite. The product attributes used in our experiments include brand, shop, category and launch year.
For both of the two datasets, we considered four item attributes (relations) as knowledge which were incorporated in our model. As in (Hidasi et al., 2016a; Wu et al., 2019), we set the duration time threshold of sessions in JDATA to one hour, and set the index gap of sessions in KKBOX to 2000 (according to statistic analysis), to divide different sessions. We also filtered out the sessions of length 1 and the items that appear less than 3 times in the datasets. For both datasets, we took the earlier 90% user behaviors as the training set, and took the subsequent (recent) 10% user behaviors as the test set. In model prediction, given a test session the models first computed the matching scores of all items and then generated a top- list according to the scores.
In order to demonstrate the effectiveness of incorporating item knowledge to alleviate the problem of cold-start items, we added two additional manipulations on our datasets unlike the previous SR evaluations. The first is to retain the items that only appear in the test set, i.e., the cold-start items. The second is to simulate a sparse JDATA dataset, denoted as Demo, through only retaining the early 1% user behaviors. Such sparse dataset has a bigger proportion of cold-start items. In the previous SR models such as (Liu et al., 2018; Wu et al., 2019; Zhou et al., 2018), these cold-start items’ embeddings are initialized in random, and can not be tuned during model training since they are not involved in any training sample. Thus, the recommendations about these items are often unsatisfactory.
The statistics of our datasets are shown in Table .1, where ’(N)’ indicates the datasets having some cold-start items, and ’new%’ is the proportion of the behaviors involving cold-start items to all behaviors in the test set. We have taken into account all operation types provided by the two datasets. To reproduce our experiment results conveniently, our experiment samples and MKM-SR’s source codes have been published on https://github.com/ciecus/MKM-SR.
|seesion#||session length||item#(new%)||item frequency||operation#|
4.1.2. Compared Models
To emphasize MKM-SR’s superiority performance, we compared it with the following state-of-the-art SR models:
FPMC (Rendle et al., 2010): It is a sequential prediction method based on personalized Markov chain which is often used as SR baseline.
GRU4REC+BPR/CE (Hidasi and Karatzoglou, 2018): These two baselines are the improved versions of GRU4REC (Hidasi et al., 2016a) which is a state-of-the-art SR model. GRU4REC+BPR uses Bayes personalized ranking (Rendle et al., 2012) as loss function, and GRU4REC+CE uses cross-entropy as loss function.
NARM (Jing et al., 2017): It is a GRU-based SR model with an attention to consider the long-term dependency of user preferences.
STAMP (Liu et al., 2018): This SR model considers both current interests and general interests of users. In particular, STAMP uses an additional neural network to model current interests.
SR-GNN (Wu et al., 2019): It also utilizes GGNN to capture the complex transition patterns among the items in a session, but does not incorporate micro-behaviors and knowledge.
In addition, to justify the necessity and validity of incorporating micro-behaviors and knowledge in our model, we further propose some variants of MKM-SR to be compared as follows.
KM-SR: It removes all modules related to operations and the rest components are the same as MKM-SR. We compared MKM-SR with KM-SR to verify the significance of incorporating micro-behaviors.
M-SR: It removes the auxiliary task of learning knowledge embeddings, i.e., in Eq. 14, and the rest components are the same as MKM-SR. All of the following variants remove the task of learning knowledge embeddings, between which the differences are the manipulations on session modeling.
M(GRU/GGNN)-SR: Unlike MKM-SR, these two variants directly learn micro-behavior embeddings (). The only difference between them is, M(GRU)-SR feeds micro-behavior sequences to GRU and M(GGNN)-SR feeds micro-behavior sequences to GGNN.
M(GGNNx2)-SR It uses two GGNNs to learn operation embeddings and item embeddings respectively.
4.1.3. Evaluation Metrics
We use the following metrics to evaluate all models’ performance which have been widely used in previous SR evaluations.
Hit@k: It is the proportion of hit samples to all samples that have the correct next interacted item in the top- ranking lists.
MRR@k: The average reciprocal rank of the correct next interacted item in the top- ranking list. The reciprocal rank is set to zero if the correct item is ranked behind top-.
4.1.4. Hyper-parameter Setup
For fair comparisons, we adopted the same dimension of operation and item embeddings for MKM-SR and all baselines. Due to space limitation, we only display the results of 100-dim embeddings. The consistent conclusions were drawn based on the experiment results of the embeddings of other dimensions. In addition, all embeddings were initialized by a Gaussian distribution with a mean of 0 and a standard deviation of 0.1. We set GGNN’s step number to 1. MKM-SR was learned by alternating training rather than joint training, of which the reason will be verified in the following experiments. In addition, we used Adam (Kingma and Ba, 2015) optimizer with learning rate 0.001 and batch size 128. For the baselines, we used their default hyper-parameter settings in their papers except for embedding dimension. About the control parameters in Eq. 14, we set in Eq. 14 to 0.0001 for each dataset, which was decided through our tuning experiments. For L2 penalty , we set it to as previous SR models (Wang et al., 2019).
Next, we display the results of our evaluations to answer the aforementioned RQs, based on which we provide some insights on the reasons causing the superiority or inferiority of compared models .
4.2. Global Performance Comparisons
At first, we compared all models’ SR performance over different datasets to answer RQ1, of which the Hit@20 and MRR@20 scores (percentage value) are listed in Table 2. The displayed scores are the average of five runnings for each model.
The comparison results show that, MKM-SR outperforms all baselines and variants in all dataset (answer yes to RQ1). Especially in the datasets with cold-start items and Demo, MKM-SR and KM-SR have more remarkable superiority. Such results justify the effects of incorporating knowledge to alleviate the sparsity problem of cold-start items (answer yes to RQ2). As shown in Table 1, KKBOX has more unique operations than JDATA which are useful to better capture user preferences on fine-grained level. Therefore, besides MKM-SR and M-SR, another model incorporating user operations RIB also has more remarkable advantage in KKBOX than in JDATA, compared with the GRU-based baselines that do not incorporate operations. These results justify the effects of incorporating micro-behaviors (answer yes to RQ2).
In addition, a user is more likely to interact with the same item in a session of JDATA. The transition pattern between the successive items in such scenario can be captured by GGNN better than GRU. It is the reason that SR-GNN has greater advantage in JDATA than in KKBOX, compared with the GRU-based models including GRU4REC+BPR/CE and NARM.
4.3. Ablation Study
We further compared MKM-SR with its variants to answer RQ2, RQ3. We have the following conclusions drawn based on the results in Table 2. MKM-SR’s advantage over KM-SR and M-SR shows that, both micro-behaviors (operations) and item knowledge deserve to be incorporated w.r.t. improve SR performance (answer yes to RQ2). In addition, M-SR outperforms M(GRU)-SR and M(GGNN)-SR indicating that modeling a session through learning item embeddings and learning operation embeddings separately is more effective than learning micro-behavior embeddings directly. As we stated before, the transition pattern of item sequences is different to that of operation sequence. Therefore, it is less effective to combine an item with an operation as a micro-behavior and then learn the micro-behavior sequence only by a certain model. Furthermore, M-SR’s superiority over M(GGNNx2)-SR shows that operation sequences should be learned by GRU rather than GGNN, of which the reason has been explained in Subsec. 3.3. These results provide yes answer to RQ3.
4.4. Strategies of Incorporating Knowledge Learning
As we mentioned before, there are two training strategies for our MTL loss Eq. 14, i.e., alternating training (Eq. 15) and joint training (Eq. 16). To answer RQ4, we trained KM-SR respectively with the two training strategies and compared their learning curves. Furthermore, we added a pre-training variant to be compared, in which the item embeddings are first pre-trained by TransH and then input into KM-SR to be tuned only by loss in Eq. 10. We did not adopt MKM-SR in this comparison experiment because the three training strategies are not relevant to operation embedding learning.
In Fig. 3, we only display the learning curves of MRR@20 in KKBOX(N) and JDATA(N) since MRR@k reflects ranking performance better than Hit@k. The curves in the figure show that, although the pre-training model has a better learning start, it is overtaken by the other two rivals on the stage of convergence. Such results demonstrate MTL’s superiority over the knowledge-based pre-training. According to Eq. 16, the items often occurring in the sessions of training set will be tuned multiple times by loss
in each epoch of joint training. It makes the learned embeddings bias to the auxiliary tasktoo much, shrinking the learning effect of the main task . Therefore, alternating training is better than joint training in our SR scenario.
To further verify the significance of incorporating knowledge learning into the MTL paradigm (Eq. 14), we visualize the embedding distributions of some items sampled from KKBOX(N), which were learned respectively by different mechanisms in Fig. 4 where the points of different colors represent the songs of different genres. In Fig. 4(a), the item embeddings were learned by feeding item sequences of sessions into Word2Vec, thus two items are close in the space if they often cooccur in some sessions. As shown in the sub-figure, such learned embeddings make many songs of different genres too converged and thus can not discriminate different genres. In Fig. 4(b), the item embeddings were learned solely by TransH. Although such learned embeddings discriminate different genres obviously, the gap between two groups of different genres is too big. It makes the model based on embedding distances hard to predict the item of different genre as the next interacted item, which does not conform to some facts, such as the song in ’s item sequence shown in Fig. 1. It is also the reason why the pre-training model is defeated by the joint-training model and alter-training model in Fig. 3. The item embeddings shown in Fig. 4(c) were learned by MKM-SR through MTL, and exhibit two characteristics, i.e., they can discriminate different genres for most items, meanwhile keep close distances across different genres. Such item embeddings with the two characteristics well indicate two kinds of correlations between the successive items in a session. The former characteristic indicates the semantic correlations among items, and the latter characteristic indicate the items’ co-occurrence correlations across different sessions. In fact, these two correlations can be captured respectively through the learning task of and the learning task of . Obviously, it is useful for improving SR to capture these two correlations simultaneously.
4.5. MTL’s Control Parameter Tuning
At last, we investigate the influence of ’s in the MLT’s loss to MKM-SR’s final recommendation performance. Fig. 5 shows MKM-SR’s MRR@20 scores in KKBOX(N), from which we find that MKM-SR’s performance varies marginally (1%) when is set in . What’s more, MKM-SR gains the best score when . It implies that, as an auxiliary task knowledge embedding learning will disturb the main task of SR if it is assigned with more weight.
In this paper, we propose a novel session-based recommendation (SR) model, namely MKM-SR, which incorporates user micro-behaviors and item knowledge simultaneously. According to the different intuitions about item sequences and operation sequences in a session, we adopt different mechanisms to learn item embeddings and operation embeddings which are used to generate fine-grained session representations. We also investigate the significance of learning knowledge embeddings and the influences of different training strategies through sufficient comparison studies. MKM-SR’s superiority over the state-of-the-art SR models is justified by our extensive experiments and inspires a promising direction of improving SR.
- Knowledge-based recommender systems. Cited by: §2.2.
On the properties of neural machine translation: encoder-decoder approaches. Computer Science. Cited by: §1, §3.3.3.
- Node2vec: scalable feature learning for networks. In Proc. of KDD, Cited by: §2.2.
- Hierarchical user profiling for e-commerce recommender systems. In Proceedings of WSDM, pp. 223–231. Cited by: §2.1.
- Session-based recommendations with recurrent neural networks. In Proc. of ICLR, Cited by: §1, §1, §2.1, §3.3.1, §3.3, §4.1.1, §4.1.2.
- Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of CIKM, pp. 843–852. Cited by: §2.1, §4.1.2.
- Parallel recurrent neural network architectures for feature-rich session-based recommendations. In Proc. of RecSys, Cited by: §2.1.
- Improving sequential recommendation with knowledge-enhanced memory networks. In Proc. of SIGIR, Cited by: §2.2.
- A markov-based recommendation model for exploring the transfer of learning on the web. Journal of Educational Technology & Society 12 (2), pp. 144–162. Cited by: §2.1.
- Adaptation and evaluation of recommendations for short-term shopping goals. pp. 211–218. Cited by: §1.
- Knowledge graph embedding via dynamic mapping matrix. In Proc. of ACL, Cited by: §2.2.
- Neural attentive session-based recommendation. In Proc. of CIKM, Cited by: §1, §1, §2.1, §3.3.1, §3.3, §4.1.2.
- Adam: a method for stochastic optimization. In Proc. of ICLR, Cited by: §4.1.4.
- GATED graph sequence neural networks. In Proceedings of ICLR, Cited by: §1.
- Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing 7 (1), pp. 76–80. External Links: Cited by: §2.1.
- Multi-behavioral sequential prediction with recurrent log-bilinear model. IEEE Transactions on Knowledge and Data Engineering 29 (6), pp. 1254–1267. Cited by: §2.1.
- STAMP: short-term attention/memory priority model for session-based recommendation. In Proc. of SIGKDD, Cited by: §1, §1, §2.1, §4.1.1, §4.1.2.
Efficient estimation of word representations in vector space. In arXiv:1301.3781, Cited by: §4.1.2.
- Entity2rec: learning user-item relatedness from knowledge graphs for top-n item recommendation. In Proc. of RecSys., Cited by: §2.2.
- Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proc. of RecSys, Cited by: §1, §3.3.1.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2, §3.5.
- BPR: bayesian personalized ranking from implicit feedback. pp. 452–461. Cited by: §4.1.2.
- Factorizing personalized markov chains for next-basket recommendation. In Proc. of WWW, Cited by: §1, §4.1.2.
- Factorization machines with libfm. Acm Transactions on Intelligent Systems and Technology 3 (3), pp. 1–22. Cited by: §2.1.
- An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §3.5.
- Item-based collaborative filtering recommendation algorithms. In Proceedings of WWW, pp. 285–295. Cited by: §2.1.
- The graph neural network model. IEEE Trans. Neural Networks 20 (1), pp. 61–80. External Links: Cited by: §3.3.2.
An mdp-based recommender system.
Journal of Machine Learning Research6 (1), pp. 1265–1295. Cited by: §2.1.
- Metapath2vec: scalable representation learning for heterogeneous networks. In Proc. of KDD, Cited by: §2.2.
- Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 86–94. Cited by: §2.1.
- RippleNet: propagating user preferences on the knowledge graph for recommender systems. In Proc. of CIKM, Cited by: §2.2.
- DKN: deep knowledge-aware network for news recommendation. In Proceedings of WWW, Cited by: §1, §2.2, §3.5.
- A survey on session-based recommender systems. CoRR abs/1902.04864. Cited by: §1, §4.1.4.
Knowledge graph embedding by translating on hyperplanes.
Twenty-Eighth AAAI conference on artificial intelligence, Cited by: §1, §3.2, §3.4.
- Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 346–353. Note: srgnn Cited by: §1, §2.1, §3.3.1, §3.3.4, §4.1.1, §4.1.1, §4.1.2.
- Show, attend and tell: neural image caption generation with visual attention. In Proc. of ICML, Cited by: §3.2, §3.3.4.
- Knowledge embedding towards the recommendation with sparse user-item interactions. In Proceedings of ICDM, Cited by: §1, §1, §2.2, §3.5.
- Knowledge embedding towards the recommendation with sparse user-item interactions. In Proceedings of ASONAM, Cited by: §1, §1, §2.2, §3.5.
- Personalized entity recommendation: a heterogeneous information network approach. In Proc. of WSDM, Cited by: §2.2.
- Feature-level deeper self-attention network for sequential recommendation. In Proc. of IJCAI, Cited by: §2.2.
- Micro behaviors: a new perspective in e-commerce recommender systems. In Proc. of WSDM, Cited by: §2.1, §4.1.1, §4.1.2.