With the rapid growth of Internet services and mobile devices, personalized recommender systems play an increasingly important role in modern society. They can reduce information overload and help satisfy diverse service demands. Such systems bring significant benefits to at least two parties. They can: (i) help users easily discover products from millions of candidates, and (ii) create opportunities for product providers to increase revenue.
On the Internet, users access online products or items in a chronological order. The items a user will interact with in the future may depend strongly on the items he/she has accessed in the past. This property facilitates a practical application scenario—sequential recommendation. In the sequential recommendation task, in addition to the general user interest captured by all general recommendation models, we argue that there are three extra important factors to model: user short-term interests, user long-term interests, and item co-occurrence patterns. The user short-term interest describes the user preference given several recently accessed items in a short-term period. The user long-term interest captures the long-range dependency between earlier accessed items and the items users will access in the future. The item co-occurrence pattern illustrates the joint occurrences of commonly related items, such as a mobile phone and a screen protector.
Although many existing methods have proposed effective models, we argue that they do not fully capture the aforementioned factors. First, methods like Caser [DBLP:conf/wsdm/TangW18], MARank [DBLP:conf/aaai/YuZLZ19], and Fossil [DBLP:conf/icdm/HeM16] only model the short-term user interest and ignore the long-term dependencies of items in the item sequence. The importance of capturing the long-range dependency has been confirmed by [DBLP:conf/kdd/BellettiCC19]. Second, methods like SARSRec [DBLP:conf/icdm/KangM18] do not explicitly model the user short-term interest. Neglecting the user short-term interest prevents the recommender system from understanding the time-varying user intention over a short-term period. Third, methods like GC-SAN [DBLP:conf/ijcai/XuZLSXZFZ19] and GRU4Rec+ [DBLP:conf/cikm/HidasiK18] do not explicitly capture the item co-occurrence patterns in the item sequences. Closely related item pairs often appear one after the other and a recommender system should take this into account.
To incorporate the factors mentioned above, we propose a memory augmented graph neural network (MA-GNN) to tackle the sequential recommendation task. This consists of a general interest module, a short-term interest module, a long-term interest module, and an item co-occurrence module. In the general interest module, we adopt a matrix factorization term to model the general user interest without considering the item sequential dynamics. In the short-term interest module, we aggregate the neighbors of items using a GNN to form the user intentions over a short period. These can capture the local contextual information and structure [DBLP:journals/corr/abs-1806-01261]
within this short-term period. To model the long-term interest of users, we use a shared key-value memory network to generate the interest representations based on users’ long-term item sequences. By doing this, other users with similar preferences will be taken into consideration when recommending an item. To combine the short-term and long-term interest, we introduce a gating mechanism in the GNN framework, which is similar to the long short-term memory (LSTM)[DBLP:journals/neco/HochreiterS97]. This controls how much the long-term or the short-term interest representation can contribute to the combined representation. In the item co-occurrence module, we apply a bilinear function to capture the closely related items that appear one after the other in the item sequence. We extensively evaluate our model on five real-world datasets, comparing it with many state-of-the-art methods using a variety of performance validation metrics. The experimental results not only demonstrate the improvements of our model over other baselines but also show the effectiveness of the proposed components.
To summarize, the major contributions of this paper are:
To model the short-term and long-term interests of users, we propose a memory augmented graph neural network to capture items’ short-term contextual information and long-range dependencies.
To effectively fuse the short-term and long-term interests, we incorporate a gating mechanism within the GNN framework to adaptively combine these two kinds of hidden representations.
To explicitly model the item co-occurrence patterns, we use a bilinear function to capture the feature correlations between items.
Experiments on five real-world datasets show that the proposed MA-GNN model significantly outperforms the state-of-the-art methods for sequential recommendation.
Early recommendation studies largely focused on explicit feedback [DBLP:conf/kdd/Koren08]. The recent research focus is shifting towards implicit data [DBLP:conf/www/TranLLK19, DBLP:conf/kdd/LiS17]. Collaborative filtering (CF) with implicit feedback is usually treated as a Top-K item recommendation task, where the goal is to recommend a list of items to users that users may be interested in. It is more practical and challenging [DBLP:conf/icdm/PanZCLLSY08], and accords more closely with the real-world recommendation scenario. Early works mostly rely on matrix factorization techniques [DBLP:conf/icdm/HuKV08, DBLP:conf/uai/RendleFGS09] to learn latent features of users and items. Due to their ability to learn salient representations, (deep) neural network-based methods [DBLP:conf/www/HeLZNHC17]
are also adopted. Autoencoder-based methods[DBLP:conf/cikm/MaZWL18, DBLP:conf/wsdm/MaKWWL19] have also been proposed for Top-K recommendation. In [DBLP:conf/kdd/LianZZCXS18, DBLP:conf/ijcai/XueDZHC17]
, deep learning techniques are used to boost the traditional matrix factorization and factorization machine methods.
The sequential recommendation task takes as input the chronological item sequence. A Markov chain[DBLP:conf/ijcai/ChengYLK13] is a classical option for modelling the data. For example, FPMC [DBLP:conf/www/RendleFS10] factorizes personalized Markov chains in order to capture long-term preferences and short-term transitions. Fossil [DBLP:conf/icdm/HeM16] combines similarity-based models with high-order Markov chains. TransRec [DBLP:conf/recsys/HeKM17]
proposes a translation-based method for sequential recommendation. Recently, inspired by the advantages of sequence learning in natural language processing, researchers have proposed (deep) neural network based methods to learn the sequential dynamics. For instance, Caser[DBLP:conf/wsdm/TangW18]DBLP:conf/cikm/HidasiK18, DBLP:journals/corr/HidasiKBT15, DBLP:conf/cikm/LiRCRLM17] have been used to model the sequential patterns for the task of session-based recommendation. Self-attention [DBLP:conf/nips/VaswaniSPUJGKP17] exhibits promising performance in sequence learning and is starting to be used in sequential recommendation. SASRec [DBLP:conf/icdm/KangM18] leverages self-attention to adaptively take into account the interactions between items. Memory networks [DBLP:conf/wsdm/ChenXZT0QZ18, DBLP:conf/sigir/HuangZDWC18] are also adopted to memorize the items that will play a role in predicting future user actions.
However, our proposed model is different from previous models. We apply a graph neural network with external memories to capture the short-term item contextual information and long-term item dependencies. In addition, we also incorporate an item co-occurrence module to model the relationships between closely related items.
The recommendation task considered in this paper takes sequential implicit feedback as training data. The user preference is represented by a user-item sequence in chronological order, , where are item indexes that user has interacted with. Given the earlier subsequence of users, the problem is to recommend a list of items from a total of items () to each user and evaluate whether the items in appear in the recommended list.
In this section, we introduce the proposed model, MA-GNN, which applies a memory augmented graph neural network for the sequential recommendation task. We introduce four factors that have an impact on the user preference and intention learning. Then we introduce the prediction and training procedure of the proposed model.
General Interest Modeling
The general or static interest of a user captures the inherent preferences of the user and is assumed to be stable over time. To capture the general user interest, we employ a matrix factorization term without considering the sequential dynamics of items. This term takes the form
where is the embedding of user , is the output embedding of item , and is the dimension of the latent space.
Short-term Interest Modeling
A user’s short-term interest describes the user’s current preference and is based on several recently accessed items in a short-term period. The items a user will interact with in the near future are likely to be closely related to the items she just accessed, and this property of user behaviors has been confirmed in many previous works [DBLP:conf/wsdm/TangW18, DBLP:conf/cikm/HidasiK18, DBLP:conf/icdm/HeM16]. Therefore, it is very important in sequential recommendation to effectively model the user’s short-term interest, as reflected by recently accessed items.
To explicitly model the user short-term interest, we conduct a sliding window strategy to split the item sequence into fine-grained sub-sequences. We can then focus on the recent sub-sequence to predict which items will appear next and ignore the irrelevant items that have less impact. For each user , we extract every successive items as input and their next items as the targets to be predicted, where is the -th sub-sequence of user . Then the problem can be formulated as: in the user-item interaction sequence , given a sequence of successive items, how likely is it that the predicted items accord with the target items for that user. Due to their ability to perform neighborhood information aggregation and local structure learning [DBLP:journals/corr/abs-1806-01261], graph neural networks (GNNs) are a good match for the task of aggregating the items in to learn user short-term interests.
Item Graph Construction. Since item sequences are not inherently graphs for GNN training, we need to build a graph to capture the connections between items. For each item in item sequences, we extract several subsequent items (three items in our experiments) and add edges between them. We perform this for each user and count the number of edges of extracted item pairs across all users. Then we row-normalize the adjacency matrix. As such, relevant items that appear closer to one another in the sequence can be extracted. An example of how to extract item neighbors and build the adjacency matrix is shown in Figure 2. We denote the extracted adjacency matrix as , where denotes the normalized node weight of item regarding item . And the neighboring items of item is denoted as .
Short-term Interest Aggregation. To capture the user short-term interest, we use a two-layer GNN to aggregate the neighboring items in for learning the user short-term interest representation. Formally, for an item in the -th short-term window , its input embedding is represented as . The user short-term interest is then:
where denotes vertical concatenation, are the learnable parameters in the graph neural network, and the superscript denotes that the representation is from the user short-term interest. By aggregating neighbors of items in , represents a union-level summary [DBLP:conf/wsdm/TangW18, DBLP:conf/aaai/YuZLZ19] indicating which items are closely relevant to the items in . Based on the summarized user short-term interest, the items that a user will access next can be inferred.
However, directly applying the above GNN to make predictions clearly neglects the long-term user interest in the past . There may be some items outside the short-term window that can express the user preference or indicate the user state. These items can play an important role in predicting items that will be accessed in the near future. This long-term dependency has been confirmed in many previous works [DBLP:conf/kdd/LiuZMZ18, DBLP:conf/ijcai/XuZLSXZFZ19, DBLP:conf/kdd/BellettiCC19]. Thus, how to model the long-term dependency and balance it with the short-term context is a crucial question in sequential recommendation.
Long-term Interest Modeling
To capture the long-term user interest, we can use external memory units [DBLP:conf/nips/SukhbaatarSWF15, DBLP:conf/www/ZhangSKY17] to store the time-evolving user interests given the user accessed items in . However, maintaining the memory unit for each user has a huge memory overhead to store the parameters. Meanwhile, the memory unit may capture information that is very similar to that represented by the user embedding . Therefore, we propose to use a memory network to store the latent interest representation shared by all users, where each memory unit represents a certain type of latent user interest, such as the user interest regarding different categories of movies. Given the items accessed by a user in the past , we can learn a combination of different types of interest to reflect the user long-term interest (or state) before .
Instead of performing a summing operation to generate the query as in the original memory network [DBLP:conf/nips/SukhbaatarSWF15]
, we apply a multi-dimensional attention model to generate the query embedding. This allows discriminating informative items that can better reflect the user preference to have a greater influence on the positioning of the corresponding external memory units. Formally, we denote the item embeddings inas . The multi-dimensional attention to generate the query embedding is computed as:
where is the sinusoidal positional encoding function that maps the item positions into position embeddings, which is the same as the one used in Transformer [DBLP:conf/nips/VaswaniSPUJGKP17]. equals to , denotes the outer product. and are the learnable parameters in the attention model, and is the hyper-parameter to control the number of dimensions in the attention model. is the attention score matrix. is the matrix representation of the query, and each of the rows represents a different aspect of the query. Finally, is the combined query embedding that averages the different aspects.
Given the query embedding , we use this query to find the appropriate combination of the shared user latent interest in the memory network. Formally, the keys and values of the memory network [DBLP:conf/nips/SukhbaatarSWF15, DBLP:conf/emnlp/MillerFDKBW16] are denoted as and , respectively, where is the number of memory units in the memory network. Therefore, the user long-term interest embedding can be modeled as:
where are the -th memory unit and the superscript denotes the representation is from the user long-term interest.
We have obtained the user short-term interest representation and the long-term interest representation. The next aim is to combine these two kinds of hidden representations in the GNN framework to facilitate the user preference prediction on unrated items. Here, we modify Eq. 2 to bridge the user short-term interest and long-term interest.
Specifically, we borrow the idea of LSTM [DBLP:journals/neco/HochreiterS97] that uses learnable gates to balance the current inputs and historical hidden states. Similarly, we propose a learnable gate to control how much the recent user interest representation and the long-term user interest representation can contribute to the combined user interest for item prediction:
where are the learnable parameters in the gating layer, denotes the element-wise multiplication, and is the learnable gate. The superscript denotes the fusion of long- and short-term interests.
Item Co-occurrence Modeling
Successful learning of pairwise item relationships is a key component of recommender systems due to its effectiveness and interpretability. This has been studied and exploited in many recommendation models [DBLP:journals/tois/DeshpandeK04, DBLP:reference/sp/NingDK15]. In the sequential recommendation problem, the closely related items may appear one after another in the item sequence. For example, after purchasing a mobile phone, the user is much more likely to buy a mobile phone case or protector. To capture the item co-occurrence patterns, we use a bilinear function to explicitly model the pairwise relations between the items in and other items. This function takes the form
where is a matrix of the learnable parameters that captures the correlations between item latent features.
Prediction and Training
To infer the user preference, we have a prediction layer to combine the aforementioned factors together:
As the training data is derived from the user implicit feedback, we optimize the proposed model with respect to the Bayesian Personalized Ranking objective [DBLP:conf/uai/RendleFGS09] via gradient descent. This involves optimizing the pairwise ranking between the positive (observed) and negative (non-observed) items:
Here denotes the positive item in , and denotes the randomly sampled negative item,
is the sigmoid function,denotes other learnable parameters in the model, and is the regularization parameter. , and
are column vectors of, and , respectively. When minimizing the objective function, the partial derivatives w.r.t. all parameters are computed by gradient descent with back-propagation.
In this section, we first describe the experimental set-up. We then report the results of conducted experiments and demonstrate the effectiveness of the proposed modules.
The proposed model is evaluated on five real-world datasets from various domains with different sparsities: MovieLens-20M [DBLP:journals/tiis/HarperK16], Amazon-Books and Amazon-CDs [DBLP:conf/www/HeM16], Goodreads-Children and Goodreads-Comics [DBLP:conf/recsys/WanM18]. MovieLens-20M is a user-movie dataset collected from the MovieLens website; the dataset has 20 million user-movie interactions. The Amazon-Books and Amazon-CDs datasets are adopted from the Amazon review dataset with different categories, i.e., CDs and Books, which cover a large amount of user-item interaction data, e.g., user ratings and reviews. The Goodreads-Children and Goodreads-Comics datasets were collected in late 2017 from the goodreads website with a focus on the genres of Children and Comics. In order to be consistent with the implicit feedback setting, we keep those with ratings no less than four (out of five) as positive feedback and treat all other ratings as missing entries on all datasets. To filter noisy data, we only keep the users with at least ten ratings and the items with at least ten ratings. The data statistics after preprocessing are shown in Table 1.
For each user, we use the earliest 70% of the interactions in the user sequence as the training set and use the next 10% of interactions as the validation set for hyper-parameter tuning. The remaining 20% constitutes the test set for reporting model performance. Note that during the testing procedure, the input sequences include the interactions in both the training set and validation set. The learning of all the models is carried out five times to report the average results.
compared to the best baseline method based on the paired t-test.
We evaluate all the methods in terms of Recall@K and NDCG@K. For each user, Recall@K (R@K) indicates what percentage of her rated items emerge in the top recommended items. NDCG@K (N@K) is the normalized discounted cumulative gain at , which takes the position of correctly recommended items into account. is set to .
To demonstrate the effectiveness of our model, we compare to the following recommendation methods: (1) BPRMF, Bayesian Personalized Ranking based Matrix Factorization [DBLP:conf/uai/RendleFGS09], a classic method for learning pairwise item rankings; (2) GRU4Rec, Gated Recurrent Unit for Recommendation [DBLP:journals/corr/HidasiKBT15], which uses recurrent neural networks to model item sequences for session-based recommendation; (3) GRU4Rec+, an improved version of GRU4Rec [DBLP:conf/cikm/HidasiK18]
, which adopts an advanced loss function and sampling strategy; (4)GC-SAN, Graph Contextualized Self-Attention Network [DBLP:conf/ijcai/XuZLSXZFZ19], which uses a graph neural network and self-attention mechanism for session-based recommendation; (5) Caser, Convolutional Sequence Embedding Recommendation, [DBLP:conf/wsdm/TangW18], which captures high-order Markov chains via convolution operations; (6) SASRec, Self-Attention based Sequential Recommendation [DBLP:conf/icdm/KangM18], which uses an attention mechanism to identify relevant items for prediction; (7) MARank, Multi-order Attentive Ranking model [DBLP:conf/aaai/YuZLZ19], which unifies individual- and union-level item interactions to infer user preference from multiple views; (8) MA-GNN, the proposed model, which applies a memory augmented GNN to combine the recent and historical user interests and adopts a bilinear function to explicitly capture the item-item relations.
In the experiments, the latent dimension of all the models is set to 50. For the session-based methods, we treat the items in a short-term window as one session. For GRU4Rec and GRU4Rec+, we find that a learning rate of and batch size of can achieve good performance. These two methods adopt Top1 loss and BPR-max loss, respectively. For GC-SAN, we set the weight factor to and the number of self-attention blocks to . For Caser, we follow the settings in the author-provided code to set , , the number of horizontal filters to , and the number of vertical filters to . For SASRec, we set the number of self-attention blocks to , the batch size to , and the maximum sequence length to . For MARank, we follow the original paper to set the number of depending items as and the number of hidden layers as . The network architectures of the above methods are configured to be the same as described in the original papers. The hyper-parameters are tuned on the validation set.
For MA-GNN, we follow the same setting in Caser to set and . Hyper-parameters are tuned by grid search on the validation set. The embedding size is also set to . The value of and are selected from . The learning rate and are set to and , respectively. The batch size is set to .
The performance comparison results are shown in Table 2.
Observations about our model
. First, the proposed model, MA-GNN, achieves the best performance on all five datasets with all evaluation metrics, which illustrates the superiority of our model. Second, MA-GNN outperforms SASRec. Although SASRec adopts the attention model to distinguish the items users have accessed, it neglects the common item co-occurrence patterns between two closely related items, which is captured by our bilinear function. Third, MA-GNN achieves better performance than Caser, GC-SAN and MARank. One major reason is that these three methods only model the user interests in a short-term window or session, but fail to capture the long-term item dependencies. On the contrary, we have a memory network to generate the long-term user interest. Fourth, MA-GNN obtains better results than GRU4Rec and GRU4Rec+. One possible reason is that GRU4Rec and GRU4Rec+ are session-based methods that do not explicitly model the user general interests. Fifth, MA-GNN outperforms BPRMF. BPRMF only captures the user general interests, and does not incorporate the sequential patterns of user-item interactions. As such, BPRMF fails to capture the user short-term interests.
Other observations. First, all the results reported on MovieLens-20M, GoodReads-Children and GoodReads-Comics are better than the results on other datasets. The major reason is that the other datasets are sparser and data sparsity negatively impacts recommendation performance. Second, MARank, SASRec and GC-SAN outperform Caser on most of the datasets. The main reason is that these methods can adaptively measure the importance of different items in the item sequence, which may lead to more personalized user representation learning. Third, Caser achieves better performance than GRU4Rec and GRU4Rec+ in most cases. One possible reason is that Caser explicitly inputs the user embeddings into its prediction layer, which allows it to learn general user interests. Fourth, GRU4Rec+ performs better than GRU4Rec on all datasets. The reason is that GRU4Rec+ not only captures the sequential patterns in the user-item sequence but also has a superior objective function—BPR-max. Fifth, all the methods perform better than BPR. This illustrates that a technique that can only perform effective modeling of the general user interests is incapable of adequately capturing the user sequential behavior.
To verify the effectiveness of the proposed short-term interest modeling module, long-term interest modeling module, and item co-occurrence modeling module, we conduct an ablation study in Table 3. This demonstrates the contribution of each module to the MA-GNN model. In (1), we utilize only the BPR matrix factorization without other components to show the performance of modeling user general interests. In (2), we incorporate the user short-term interest by the vanilla graph neural network (Eq. 1 and 2) on top of (1). In (3), we integrate the user long-term interest with the short-term interest via the proposed interest fusion module (Eq. 3, 4 and 5) on top of (2). In (4), we replace the interest fusion module in (3) with the concatenation operation to link the short-term interest and long-term interest. In (5), we replace the concatenation operation with a gated recurrent unit [DBLP:conf/emnlp/ChoMGBBSB14] (GRU). In (6), we present the overall MA-GNN model to show the effectiveness of the item co-occurrence modeling module.
From the results shown in Table 3, we make the following observations. First, comparing (1) and (2)-(6), we can observe that although the conventional BPR matrix factorization can capture the general user interests, it cannot effectively model the short-term user interests. Second, from (1) and (2), we observe that incorporating the short-term interest using the conventional aggregation function of the GNN slightly improves the model performance. Third, in (3), (4) and (5), we compare three ways to bridge the user short-term interest and long-term interest. From the results, we can observe that our proposed gating mechanism achieves considerably better performance than concatenation or the GRU, which demonstrates that our gating mechanism can adaptively combine these two kinds of hidden representations. Fourth, from (3) and (6), we observe that by incorporating the item co-occurrence pattern, the performance further improves. The results show the effectiveness of explicitly modeling the co-occurrence patterns of the items that a user has accessed and those items that the user may interact with in the future. The item co-occurrence pattern can provide a significant amount of supplementary information to help capture the user sequential dynamics.
Influence of Hyper-parameters
The dimension of the multi-dimensional attention model and the number of the memory units are two important hyper-parameters in the proposed model. We investigate their effects on CDs and Comics datasets in Figure 3.
From the results in Figure 3, we observe that both the multi-dimensional attention and the memory network contribute to capturing the long-term user interests. These two components lead to a larger improvement in performance for the CDs dataset compared to the Comics dataset, indicating that they may help to alleviate the data sparsity problem.
To validate whether each memory unit can represent a certain type of user interests, we conduct a case study on the MovieLens dataset to verify how each memory unit functions given different movies. We randomly select a user and several movies she watched. For simplicity, we treat each selected item as a query to visualize the attention weight computed by Eq. 4. In this case, we set the number of memory units to .
From Figure 4, we observe that our memory units perform differently given different types of movies, which may illustrate that each memory unit in the memory network can represent one type of the user interest. For example, the Three Colors trilogy has quite similar attention weights in the memory network, since these three movies are loosely based on three political ideals in the motto of the French Republic. Die Hard is an action thriller movie, which is distinct from any other movies in the case study, explaining why it has a different weight pattern.
In this paper, we propose a memory augmented graph neural network (MA-GNN) for sequential recommendation. MA-GNN applies a GNN to model items’ short-term contextual information, and utilize a memory network to capture the long-range item dependency. In addition to the user interest modeling, we employ a bilinear function to model the feature correlations between items. Experimental results on five real-world datasets clearly validate the performance advantages of our model over many state-of-the-art methods and demonstrate the effectiveness of the proposed modules.