Unlike conventional recommendation algorithms which get accustomed to modeling each user-item interaction separately (KorenBV09), recent sequential recommendation approaches meet more realistic requirements for its ability of modeling user dynamic interest. Session-based target behavior prediction (hidasi2015session) is the one of the main studied problem in this regard, aiming to predict the next item to be interacted with a user under a specific type of behavior (e.g., clicking an item). Based on the predictions, information providers can effectively deliver items to appropriate users and at the same time, and users can quickly find the items what they actually want. Note that we use session-based prediction and session-based recommendation interchangeably throughout this paper.
Early studies for this problem assume that the appearance of the next item depends only on its previous item (rendle2010factorizing; ZhangW15)
in the same sequence. With such a strong assumption, they could only model the last item in each sequence and ignore other information from the sequence. To relieve this assumption, various methods adopt sequential models for session-based recommendation system to learn behavior sequences. Recurrent Neural Networks (RNN)(Hochreiter-NC97) is commonly leveraged to obtain promising performance. The relevant methods could roughly be attributed into two categories: single-session based recommendation models (hidasi2015session; gc_san) and multi-session based recommendation models (quadrana2017personalizing; YouWPERL19). As the latter category requires the user ID of each behavior sequence should be known in advance to link multiple sequences of the same user together, it is not so universal than the first category due to privacy issues and user scalability problem (e.g., a billion of active users each day in WeChat). As such, we study session-based target behavior prediction from the perspective of single-session based modeling.
In the domain of single-session based behavior prediction, some studies (liu2018stamp; RenCLR0R19; sun2019bert4rec) adopt attention mechanism (Bahdanau-Arxiv14; Vaswani-NIPS2017) and outperform the pioneering RNN based methods (hidasi2015session). Recent advances in graph neural networks (GNN) (DefferrardBV16; hamilton2017inductive) further boost the performance of session-based behavior prediction by modeling each session-based behavior sequence as a graph to achieve the state-of-the-art performance (sr_gnn; gc_san). However, existing studies in this regard still suffer from several limitations. Firstly, they focus on only using the same type of user behavior as input for the next item prediction, but ignore the potential of leveraging other type of behavior as auxiliary information. This is particularly crucial when the target behavior is sparse but important (e.g., buying or sharing an item). Secondly, item-to-item relations are modeled separately and locally, since both RNN based and GNN based recommendation models only utilize one behavior sequence each time. It is intuitive that abundant item-to-item relations are hidden in various behavior sequences. For example, if many other users who have bought item B after buying item A, the relation between item A and B is especially vital if a target user just bought item A.
To overcome these limitations, we propose a novel Multi-relational Graph Neural Network model for Session-based target behavior Prediction, namely MGNN-SPred for short. The target behavior we focused on is the aforementioned sparse behavior beyond the dense click behavior. MGNN-SPred jointly considers target behavior and auxiliary behavior sequences and explores global item-to-item relations for accurate prediction. Specifically, for the purpose of considering the global item-to-item relations, we build a Multi-Relational Item Graph (MRIG) based on the past behavior sequences of all sessions. There might exist multiple relations between two graph nodes, denoting target and auxiliary behavior types. Based on MRIG, MGNN-SPred encodes global item-to-item relations into node representations and further obtains local representations for current target and auxiliary behavior sequences, respectively. In the end, MGNN-SPred leverages a gating mechanism to adaptively fuse the representations from target behavior sequence and auxiliary behavior sequence to produce current user interest representation.
The main contributions of this work is summarized as follows:
1. We address the two limitations of existing methods by breaking the restriction of only using one type of behavior sequence in session-based recommendation and exploring another type of behavior as auxiliary information. We further construct the multi-relational item graph for learning global item-to-item relations.
2. To effectively model MRIG w.r.t. target and auxiliary behavior sequences, we develop the novel graph model MGNN-SPred which learns global item-to-item relations through graph neural network and integrates representations of target and auxiliary of current sequences by the gating mechanism.
3. We carry out extensive experiments and demonstrate MGNN-SPred achieves the best performance among strong competitors, showing the benefits of overcoming the two limitations.
2. Related Work
Session-Based Behavior Prediction. In the literature, the pioneering study (hidasi2015session) in the direction of single-session based recommendation first adopts a recurrent neural network based approach with past interacted items as the input of different time steps for session-based recommendation. Following that, (tan2016improved) improves the model with data augmentation and the consideration of temporal user behavior shift. In addition to using RNN, (li2017neural) also adopts attention mechanism to capture a user’s sequential behavior and its main purpose in a current session. Similarly, (liu2018stamp) proposes a novel attention mechanism to capture both the users’ long-term interests in general and their short-term attention. More recently, with the flourish Graph Neural Networks (GNN) methodologies, (sr_gnn) first separates each session sequence into different graphs and uses graph neural networks to capture complex item transitions in a specific graph. Afterwards, each session is represented as the combination of the global preference and current interests of this session using an attention network. (gc_san) is similar to (sr_gnn), which uses a multi-layered self-attention network as an alternative to capture long-range dependencies between items within a session. As discussed in the introduction, these existing relevant methods suffer from two limitations which motivate the proposal of our model in this paper.
Multi-Behavior Modeling. Multi-behavior modeling for recommender system aims to leverage other types of user behavior to boost the recommendation performance on the target behavior. A few studies have already investigated this scenario from different perspectives. (Krohn-GrimbergheDFS12) considers to leverage users’ social interactions as auxiliary behavior for target behavior prediction by collective matrix factorization (CMF) techniques. In a similar fashion, (ZhaoCHC15) builds multiple matrices from user different behaviors which cover user resharing behavior, user commenting behavior, user posting behavior, etc. CMF is adopted to learn shared user representation for recommendation as well. (LoniPLH16) proposes multi-feedback Bayesian personalized ranking (BPR), an extension of the classical Bayesian personalized ranking approach and tailored for different user behaviors. It differentiates different preference levels between different user behaviors in the sampling stage for ranking. (DingY0QLCJY18) also considers the assignment different preference levels of various user behaviors. Instead of BPR, it incorporates this useful information into element-wise alternating least squares learner. More recently, a neural network approach is proposed by (Gao0GCFLCJ19) to learn representations for user-item interactions with different behaviors. Multi-task learning is conducted to predict multi-behaviors with respect to a certain item in a cascading way. Our work fundamentally differs from the above studies since all of them assume the independence of different user-item interactions while our study is more realistic by considering to model user behaviors in a sequential setting.
Graph Neural Networks.
Graph neural networks are the methods used to generate representation of graph structured data, such as social network and knowledge graph.(perozzi2014deepwalk) extends Word2vec (mikolov2013distributed) by proposing a model, DeepWalk, to learn node representations based on sequences sampled from graphs. LINE(tang2015line) encodes first-order and second-order proximity of nodes into a low-dimensional space. Recently, a surge of methods related on graph convolutional networks (GCN) have been raised. (bruna2013spectral) presents a method with a graph-based analogue of convolutional architectures, which is the original version of GCN. Later, a number of improvements, extensions, and approximations of these spectral convolutions be proposed (kipf2016semi; duvenaud2015convolutional; hamilton2017inductive; monti2017geometric). These approaches outperform other methods based on random walks (e.g., DeepWalk and node2vec). With the success in mind, an amount of GCN based methods are widely applied in various domains such as recommendation systems (monti2017geometric). But most GCN based methods require that all nodes in the graph are present in each propagation step of GNN. Different from GCN, GraphSAGE (hamilton2017inductive) can train GNN with a minibatch setting. Inspired by this, we design our GNN to learn from the constructed multi-relational item graph for session-based behavior prediction.
3.1. Problem Definition
For a session in the session set , let denote the target behavior sequence and represent the auxiliary behavior sequence. Moreover, we construct a Multi-Relational Item Graph based on all behavior sequences from all sessions, where is the set of nodes in the graph containing all available items and is the edge sets involving multiple types of directed edges. Each edge is a triple consisting of the head item, the tail item, and the type of this edge. For instance, if we construct the graph based on behaviors of sharing and clicking, then an edge means that a user shared item and subsequently shared item , and an edge means that a user clicked item after clicking item . Given the above notations, we formulate the problem as follows:
Problem 1 (session-based target behavior prediction).
Given a session and its target and auxiliary behavior sequences and , along with MRIG , the target of this problem is to learn a model that can generate items which are most likely to be interacted with the user of the session in the next.
The overall architecture of the proposed MGNN-SPred is depicted in Figure 1. The input to MGNN-SPred contains a Multi-Relational Item Graph (MRIG) and the two types of behavior sequences. SR-MRIG first learns item correlations from MRIG by graph neural networks and encode them into item representations. Afterwards, a user’s two behavior sequences are regarded as two sub-graphs in the MRIG where the items in each sub-graph are connected with a virtual node (“T” or “A” in Figure 1), respectively. Subsequently, SR-MRIG aggregates the nodes of each sub-graph to the corresponding virtual node, thus getting the representation of each behavior sequence. Finally, to fuse the two behavior representations and obtain user preference representations, a gating mechanism is adopted to adaptively decide the importance of different behaviors and perform weighted summation over them. For the purpose of recommendation, SR-MRIG calculates each item’s score by user and item representations via a bi-linear product and use the scores to rank them for recommendation.
3.3. Graph Construction
There are abundant relationships between items lying in users’ historical behaviors. If a user buys item , and subsequently buys item in the same session, it indicates that item and item probably have some dependency, but does not reflect similarity too much since a user less likely buys two very similar items within a short duration. In comparison, if a user clicks item , and subsequently clicks item , it indicates that item and item are probably with large similarity. This is intuitive because a user usually browses a number of similar items, and picks the most suitable one to buy.
We construct the multi-relational item graph by taking all items as nodes and each type of behavior corresponds as one directed edge, denoting different relationships between items. The process of constructing MRIG is shown in Algorithm 1. The both target and auxiliary behavior sequences from all sessions and () are provided as input. The algorithm browses all behavior sequences, and collects all items in the sequences as nodes of graph and constructs edges between two consequent items in the same sequence with their behavior types as the edge types. After constructing the graph with target and auxiliary behaviors, there are two types of directional edges in the graph.
3.4. Item Representation Learning
For each node , we use
denotes its one-hot representation. Before we feed the one-hot representations of nodes into GNN, we first convert each of them into a low-dimensional dense vectorby a learnable embedding matrix : .
After collecting the vectors , we feed them with MRIG into GNN to generate global representation of nodes . The representations are expected to encode multiple item-to-item relations. We take node as an example for illustration. First of all, we collect neighbors of the node . Each node in the graph has four types of neighboring node sets. According to the type and direction, we name the four sets as “target-forward”, “target-backward”, “auxiliary-forward”, and “auxiliary-backward”. Take the type of “target” as an example, we obtain neighbor groups corresponding to forward and backward direction as below:
For the type of “auxiliary”, its neighbor groups, i.e., and , are acquired by the same way.
At each step of representation propagation in GNN, we first aggregate each group of neighbors by the mean-pooling defined as follows: The representation of this group is defined below:
The representations of the three remaining groups are calculated in a similar fashion. Consequently, for the propose of joint considering different relations between items, we combine these four representations of different neighbor groups by sum-pooling:
Finally, we update the representation of the center node by:
After performing iterations, we take the node representation of the last step as the representation of the corresponding item: . In practice, we implement the GNN in a minibatch setting which is inspired by (hamilton2017inductive) to ensure scalability.
3.5. Sequence Representation Learning
We have tried different ways to compute the representation of the virtual node for the target and auxiliary behavior sequences, including using attention mechanism to assign different importance weights to the nodes and performing sub-graph propagating for several times. Empirically, we have found that simple mean-pooling could already achieve comparable performance while retaining low complexity. We denote the summarized representations of target behavior sequence and auxiliary behavior sequence as and , respectively, which are given as:
We argue that two different types of behavior sequence representations might contribute differently when building an integrated representation. This is because the auxiliary behavior is not exactly the same with the target behavior to be predicted, and different users might have different concentration on different behaviors. For instances, some users might browse the item pages frequently and click various items arbitrarily, and another users might only click the items they want to buy. It is self-evident that the contributions of auxiliary behavior sequence for the next item prediction are different in these situations. We define the following gating mechanism to calculate the relative importance weight :
where denotes the concatenation of the two representations,
is the sigmoid function, andis a trainable parameter of our model. Finally, we obtain the user preference representation for the current session by the weighted summation of and :
3.6. Model Prediction and Training
We further calculate the recommendation score of each items using the item embedding . A bi-linear matching scheme is employed by:
where is a trainable parameter matrix of our model.
To learn the parameters of our model, we apply a softmax function to normalize the scores
over all items to get the probability distribution:
Backpropagation for neural networks is adopted to optimize the model by minimizing the cross-entropy loss of the predicted probability distribution
w.r.t. the ground truth. The loss function is defined as follows:
where denotes the one-hot representation of the ground truth.
|#edge of target||217,774||225,879|
|#edge of auxiliary||1,546,220||3,277,411|
|Average length of target||9.76||3.31|
|Average length of auxiliary||33.49||8.56|
4.1. Experimental Setup
We evaluate our model on two real-world datasets named WeChat and Yoochoose. The Yoochoose dataset is obtained from the RecSys Challenge 2015. The user behavior sequences in the dataset are already segmented into sessions and all the users are anonymized. The WeChat dataset is collected from Top Stories (看一看) of WeChat, where we choose videos are regarded as items. We randomly select one hundred thousand active users and collect their behavior records for a duration of one week. Since the duration is relatively short, we retain an entire behavior sequence of each user by taking the sequence as a single session. We treat behavior of purchase in Yoochoose and behavior of sharing in WeChat as the target behavior, and regard clicking behavior in both datasets as the auxiliary behavior.
Given a session with the target behavior sequence and the auxiliary behavior sequence , we adopt a similar way to construct training example as (li2017neural; sr_gnn). That is, we treat each item , () as the label and use as input of target behavior. The treatment for the auxiliary behavior is a little different, because a user is very likely to click an item before buying or sharing it. To avoid the auxiliary input already sees the labels, we only keep the clicked items before the target item that is also bought or shared by the user. We set a maximum length for both types of sequences and only keep the last items longer than the maximum length. Considering the fact that two datasets have different average sequence length (see details in Table 1), we set the maximum length to 10 for WeChat and 3 for Yoochoose. We discuss the impact of different maximum length in Section 4.4.3.
We split the datasets in a chronological order for evaluation, consistent with real situations. We take the first 6/7 of datasets as the training data, and use 1/3 of the remaining data as the validation data to determine optimal hyper-parameter settings. MRIG used throughout the experiments are constructed only based on training data. The basic statistics of two datasets are summarized in Table 1.
We compare the proposed model with several strong competitors, including state-of-the-art graph neural network based model for session-based recommendation.
POP. It just recommends the top-n frequent items in the training set regardless of behaviors in current sessions.
Item-KNN(sarwar2001item). It recommends items most similar to the previously interacted items belonging to the same sessions.
GRU4Rec (hidasi2015session). GRU4Rec is the pioneering RNN-based deep sequential model for session-based recommendation.
NARM (li2017neural). It employs attention mechanism to capture different importance of each item according to their hidden states obtained by RNN. A weighted integration of different item representations is performed to obtain final representation.
STAMP (liu2018stamp). This model learns users’ general interest from the long-term memory of session context and current interest from the short-term memory of their last behaviors.
SR-GNN (sr_gnn) and GC-SAN (gc_san). Both of the graph-based models only use a current session to construct graph for applying GNN to learn item representations. The difference is SR-GNN represents each session by a traditional attention network while GC-SAN is based on a multi-layered self-attention mechanism.
R-DAN. Reasoning-DAN (R-DAN) (nam2017dual) is used to model both behavior sequences simultaneously.
CoAtt. Co-Attention (CoAtt) (lu2016hierarchical) with alternative calculation for interactive attention is adopted for comparison.
HetGNN. Heterogeneous graph neural network (ZhangSHSC19) is applied for recommendation, with two edge types and one node type.
It is worth noting only target behavior is considered by the above baselines originally developed for session-based recommendation, i.e., GRU4Rec, NARM, STAMP, SR-GNN, and GC-SAN. To make the comparison more fairable, we revise these methods through the following manner. We use their original forms to model the target behavior sequence and auxiliary behavior sequence respectively, And afterwards, we utilize the proposed gating mechanism to fuse the two types of representations as ours. In addition, we also compare our model with the baselines in the situation of only considering target behavior (see Table 3 for details).
|GRU4Rec (w/o a)||16.889||1.2346||3.9128||14.817||1.6032||4.0012|
|NARM (w/o a)||17.773||1.3123||4.1298||14.443||1.5540||3.8900|
|SR-GNN (w/o a)||18.093||1.2621||4.1368||15.302||1.5782||4.0852|
|Ours (w/o a)||19.252||1.3933||4.4473||21.089||2.3798||5.8221|
4.1.3. Implementation Details
We implement our proposed model based on Tensorflow. The dimension of item embedding is set to 64. Adam with default parameter setting is adopted to optimize the model, with the mini-batch size of 64. GNN is ensured to run in a minibatch setting and the depthis set to 2. We terminate the learning process with an early stopping strategy. We test different forms of attention computation formulas for the baselines based on attention mechanism and report their best results. The hyper-parameters of baselines are turned on validation datasets as well.
4.2. Model Comparison
We consider the top-100 ranked predictions as recommended items. Following (sr_gnn; gc_san), we adopt HR@100 (H@100), MRR@100 (M@100), and NDCG@100 (N@100) to evaluate the recommendation performance of all models after obtaining their recommendation lists. Table 2
shows the performance comparison between our model and the adopted baselines. (1) The first part of the table corresponds to the simple baselines. We observe their results are significantly worse than other methods. (2) The second part involves standard sequential based methods for session-based recommendation. We observe that their results keep at the same level, except for STAMP on WeChat. It shows that: 1) taking session-based recommendation as a sequential modeling task can improve performance; 2) although NARM and STAMP are more advanced approaches which use attention mechanism to combine hidden representations of different time steps, they do not show advantages on the sparse behavior prediction problem we studied (not the same as previous studies focusing on click prediction). (3) The third part is GNN based models. SR-GNN and GC-SAN seem to be better than the sequential methods, and HetGNN further boost the performance. (4) The second-to-last part involves approaches of learning two sequences in other research domains. Their best results are worse than the best performance of the above recommendation methods, which suggests that considering the interaction of items in two sequences might have no benefit for the studied problem. Finally, we can see that our method outperforms all the other methods, demonstrating the superiority of our model for session-based recommendation.
4.3. Impact of Auxiliary Behavior Sequence
We choose several representative methods in Table 3 to test whether considering the auxiliary behavior sequence indeed boosts the performance of session-based recommendation. The methods with “(w/o a)” mean removing the auxiliary behavior sequence from their full version. Firstly, we observe that our proposed model still consistently achieves better performance in this situation. Moreover, by comparing each method in Table 2 with its “(w/o a)” version, we can find every method beats the one of “(w/o a)” with significant margins. Based on the above illustrations, we demonstrate that considering the auxiliary behavior sequence is indeed meaningful.
|Ours (w/o ae)||20.923||1.4665||4.7945||25.463||2.7678||6.8907|
|Ours (w/o asg)||19.742||1.3949||4.5167||22.517||2.6025||6.2631|
|Ours (w/o g)||20.363||1.3707||4.6154||27.577||3.3531||7.7896|
4.4. Model Analysis
4.4.1. Ablation Study
We conduct ablation studies of our model, using “w/o ae” to denote removing the edges related to the auxiliary behavior, using “w/o asg” to denote that not modeling the sub-graph of the auxiliary behavior sequence in getting user preference representation, and using “w/o g” to indicate merging the two representations of the target and auxiliary behavior sequences by simple summation instead of the gating mechanism. Table 4 shows the corresponding results. We observe that the incorporation of the auxiliary edge into the built graph is beneficial for the problem by seeing “w/o ae”. The integration of the auxiliary behavior with target behavior sequence have a notable contribution by seeing “w/o asg”. Besides, we find that the performance becomes worse if we do not use the gating mechanism to merge the two representations of the target and auxiliary behavior sequences by investigating “w/o g”. Through the above comparison, we conclude the main components in our model are effective.
4.4.2. Impact of Depth of GNN
We test different depth settings (from 0 to 3) about graph representation propagation. The depth setting with value 0 means the our model does not use GNN and could not learn any information from MRIG. Figure 2(b) shows the corresponding results. We can see that the performance of depth 0 is without doubt much worse than the results with depths from 1 to 3. This comparison clarifies the significance of considering MRIG for our model. Moreover, the performance becomes significantly better when the depth grows from 1 to 2, showing modeling high-order relation between items through GNN is indispensable. When the number of graph representation propagation is larger than 3, the representations of nodes might become less distinguishable, which is not ideal for further improving the performance.
4.4.3. Impact of Sequence Length
We visualize the performance variation with the change of the maximum behavior sequence length in Figure 3(b), where we set in the range from 1 to 20. As expected, with larger maximum sequence length at the beginning, the performance of both our model and SR-GNN grows to be better. After reaching the peaks, the results slightly become worse, and finally the variation trends turn to be stable. Overall, our model outperforms SR-GNN consistently. Besides, we find the lengths with the best performance are not the same in the two datasets. This is due to the fact the average length of Yoochoose is much smaller than that of WeChat, as shown in Table 1.
In this paper, we study session-based target behavior prediction. Two limitations of existing relevant models are addressed: using only target behavior for next item prediction and lacking a principled way encode global item-to-item relations. To alleviate the issues, MGNN-SPred is proposed, with the major novelties of building and modeling of the multi-relational item graph. In addition, a gating mechanism is adopted to adaptively fuse target behavior sequences and auxiliary behavior sequences into the user preference representations for the next item prediction. Comprehensive experiments on two real-world datasets demonstrate MGNN-SPred achieves the best performance and its design is rational.