Recommendation, as an information filtering method, has been extended to a wide range of real-world applications such as product recommendation (Qu et al., 2018), video recommendation (Lu et al., 2018) and app recommendation (Yao et al., 2017). In recent years, an important trend in recommendation is to consider the order of user behaviors (Hidasi et al., 2016a; Jing and Smola, 2017; Wu et al., 2017; Yu et al., 2016), which has shown promising results by capturing behavior relationships. According to the way to incorporate sequential information, existing methods can be summarized into two categories. Traditional methods utilize sequential information as contextual features during training (Gao et al., 2013; Koenigstein et al., 2011; Yuan et al., 2013)
, which cannot model the high-order sequential relationships. Recent sequential recommendation methods inherently incorporate temporal information with Recurrent Neural Networks (RNNs)(Hidasi et al., 2016a; Wu et al., 2017; Yu et al., 2016). And these methods are also explored by introducing attention mechanism (Donkers et al., 2017; Pei et al., 2017), personalization (Donkers et al., 2017) and auxiliary information (e.g. heterogeneous attributes, knowledge base) (Liu et al., 2018a; Huang et al., 2018; Chen et al., 2018) in order to model sequential user behavior better.
However, the real-world user behaviors usually reflect a hybrid of sequential relationships and association relationships. The orders observed in behavior sequences do not necessarily reflect relationships among behaviors. For example, Figure 1 illustrates an user’s purchase sequence of phone and accessories. Clearly, the transition from iPhone 6s (step 1) to phone accessories (step 2,3,4,5) contains sequential relationships due to purchase causality. On the other hand, the purchase order of cellphone accessories in fact could be arbitrary and actually reflects their association.
Although RNNs based sequential modeling methods are capable of capturing the sequential relationships among user behaviors, they ignore the association relationships among items. We here thus propose an unified model named CAASR (Cascading: Association Augmented Sequential Recommendation) to simultaneously capture association relationships and sequential relationships for sequential recommendation. The general idea of CAASR is illustrated in Figure 2. The item association features are mined from item graph by GCNs, and the sequential relationships are modeled by a widely used RNNs based sequential Recommendation method named GRU4Rec (Hidasi et al., 2016a). The two are connected in cascades and learned with one single training target. Because the adopted GCNs does not support mini-batch training, we design a specific graph embedding lookup layer which can combine GCNs and RNNs based sequential modeling method adaptively. Table 1 compares CAASR with other methods in various perspectives. The CAASR is the only method that mines item association and generalizes RNNs based sequential recommendation.
, while P-GraphAE utilizes widely used Autoencoder regularization technique in(Cao et al., 2017). CAASR is our proposed method in cascading style.
Our contributions are summarized as three following points:
In this paper, we firstly concentrate on modeling both the item association relationships and sequential relationships for sequential recommendation, and develop a novel method named CAASR. To the best we know, CAASR is the first model that concentrates on item association relationships in RNNs based sequential recommendation.
To seamlessly combine item association relationships and sequential relationships, we explore a better cascading style to incorporate them rather than the widely used parallel regularization technique.
We conduct extensive experiments on three real-world datasets. Comprehensive results demonstrate the superiority over state-of-the-art models both quantitatively and qualitatively.
The rest of this paper is organized as follows. Section 2 introduces recent related works, including sequential recommendation, item association relationships in recommendation and graph embedding methods. Problem definition and a concise introduction of GCNs are given in section 3. Section 4 provides the details about proposed methods. Experiments and results are illustrated in section 5, and the results of our algorithm are verified both quantitatively and qualitatively. Finally, conclusion and future work are given in section 6.
2. Related Work
2.1. Towards Recommendation Methods for Sequential User Behavior
Most sequential recommendation methods are designed to capture and utilize the sequential relationships from sequence data.
Traditional Methods. Previously, sequential information is usually used as a contextual feature during the training phase, in which timestamp is considered as an additional resource to enrich the model features. However, these methods (Karatzoglou et al., 2010; Gao et al., 2013) based on handcrafted features are consuming and not applicable. Also TimeSVD++ (Koren, 2009)
tends to exploit temporal signals through SVD technique. Tensor factorization is another way of applying temporal information such as(Li et al., 2017). They typically cannot model high-order sequential relationships from chronological data (Donkers et al., 2017)
. There are also several sequential models often decomposing the problem into two parts: user preference modeling and sequential modeling. For example, Factorized Personalized Markov Chain (FPMC)(Rendle et al., 2010) is a classic sequential recommendation that fuses MF and factorized Markov Chain to model user preference and sequential relationships respectively. Recently, inspired by transnational metric embeddings (Bordes et al., 2013), TransRec (He et al., 2017)
unifies sequential recommendation and a specific metric embedding approach together, modeling each user as a translating vector from his/her last visited item to the next one.
Deep Learning Based Methods2017) and Recurrent Neural Networks (RNNs) (Rumelhart et al., 1986). RNNs is one promising approach in capturing the sequential relationships of user preference via user-item interaction data (Jing and Smola, 2017; Wu et al., 2017; Yu et al., 2016). Wu et.al (Wu et al., 2017) and Hidasi et.al (Hidasi et al., 2016a)
utilize Long Short Term Memory (LSTM)(Graves, 2012)
and Gated Recurrent Unit (GRU)(Cho et al., 2014) respectively to model sequential relationships of user representation from chronological data. Then, in (Hidasi et al., 2016b), both user representation and item one-hot vectors are merged into one model based on (Hidasi et al., 2016a). Tim et.al (Donkers et al., 2017) point that approaches in (Hidasi et al., 2016a, b)
do not explicitly model individual users, thus they introduce a variant of GRU model that utilizes pair-wise loss to generate personalized recommendation. Due to the well performance of attention mechanism in Neural Machine Translation (NMT)(Bahdanau et al., 2014) and video captioning tasks (Chen et al., 2017; Xu et al., 2015), attention mechanism is introduced into recommendation as well in (Donkers et al., 2017; Pei et al., 2017; Li et al., 2017; Liu et al., 2018b). In (Donkers et al., 2017), they design a kind of attention gated cell to regulate the original gating process in GRU. Interacting Attention-gated Recurrent Networks (IARN), proposed in (Pei et al., 2017), extends recurrent networks for both modeling user and item dynamics with a novel gating mechanism. A novel attention scheme is also designed to allow the user-side and item-side recurrent networks to interact with each other. STAMP (Liu et al., 2018b), a novel short-term attention/memory priority mechanism is proposed to emphasize general interests of users. As for auxiliary information, item heterogeneous attributes and knowledge base has been applied in sequential recommendation in (Liu et al., 2018a) and (Huang et al., 2018; Chen et al., 2018) respectively.
2.2. Item Association in Recommendation
Although item association is rarely modeled in RNNs based sequential recommendation, it has been widely used in non-sequential recommendation methods (Mei et al., 2011; Liang et al., 2016; Cao et al., 2017). VideoReach in (Mei et al., 2011) presents a better video recommendation method based on multi-modal content relevance. While in many recommendation scenarios, contents of items are difficult to obtain. In item-based collaborative filtering, an item-item similarity matrix encoding item relevance is defined from user behavior data to directly apply the similarity matrix to predict missing preferences. This method is not practical and highly sensitive to the choice of similarity metric and data normalization (Herlocker et al., 2002). While inspired by the exploration of co-factorizing multiple relational matrices in Collective Matrix Factorization (CMF) (Singh and Gordon, 2008), Cofactor (Liang et al., 2016) constructs SPPMI matrix from item co-occurrence matrix and applies co-factorization as the regularization term of item representation in Matrix Factorization (MF) method (Koren et al., 2009). This co-factorization technique is also applied into ranking recommendation in (Cao et al., 2017).
2.3. Graph Embedding Methods
Representation learning about nodes in graphs has arisen as one hot research area. These methods mainly focus on two kinds of directions.
Random walk Based. Recently, skip-gram model (Mikolov et al., 2013)
has shown its success in natural language processing. This method has opened new ways for feature learning of discrete objects such as words. Perozzi et.al(Perozzi et al., 2014) discover that random walks of nodes in graph lead to similar power-law distribution like words in natural language processing. Thus they regard the random walks on graph as sentences and leverage skip-gram model for learning latent node representation. In (Tang et al., 2015), Tang et.al define a node’s context by its neighborhoods. Node2Vec (Grover and Leskovec, 2016) defines biased random walk to control the Bread First Search (BFS) and Deep First Search (DFS). These random walk motivated works are based on skip-gram and Levy et.al (Levy and Goldberg, 2014) point that skip-gram with negative sampling (SGNS) (Mikolov et al., 2013) is actually conducting implicit matrix factorization.
Graph Convolutional Networks. Graph Convolutional Networks is the other representative methodology for this research area from spectral graph analysis.
In spectral graph theory (Chung and Graham, 1997), complex geometric structures in graphs can be studied with spectral filters. Graph Convolutional Network learns graph structure from a spectral graph view. It is a kind of network that learns local stationary features of nodes in spectral graph domain.
In the convolution theorem (Mallat, 1999)
, convolution is defined as linear operators that diagonalized in Fourier basis. And this basis is the eigenvectors of graph Laplacian operator. While this kind of filter defined in spectral domain cannot naturally be localized and the computation is costly because of multiplication with graph Fourier. These limitations are overcome by a special choice of filter parameterization in(Bruna et al., 2013; Defferrard et al., 2016). In (Defferrard et al., 2016), the authors apply Chebyshev polynomials to reach a recursive formulation of graph convolution operation. Compared with work in (Bruna et al., 2013), method in (Defferrard et al., 2016)
provides a strict control over local support of filters and is more computationally efficient by avoiding using the eigenvalue decomposition of graph Laplacian matrix. Meanwhile, better performance is achieved as well. GCNs is similar to convolution operation in CNNs, while this is in spectral domain for graph data. Kipf et.al(Kipf and Welling, 2016) simplifies GCNs in (Defferrard et al., 2016) through stacking convolutional layers instead of summation of different Chebshev orders. Velickovic et.al (Velickovic et al., 2017) regard graph convolution as the aggregation problem of node representation and they introduce multi-head attention mechanism to learn attentive weights for each nodes during aggregation process. GraphSAGE in (Hamilton et al., 2017) generalizes graph convolution network to unseen nodes and proposes an inductive learning framework that leverages node feature information to efficiently generate node embeddings. Note that we mainly focus on (Defferrard et al., 2016) here. A concise introduction of GCNs exploited in our paper would be provided later.
3.1. Problem Definition
We firstly introduce the notations utilized in our paper and give a clear problem definition in recommendation for sequential user behavior data. In our paper, we denote as the set of users and as the set of items. Our task concentrates on implicit feedback in user behavior data, where feedback between user and item at time step is denoted as 1 if they have interaction or 0 if not. By ascendingly sorting the interaction data of each user according to time, we can form the sequence data for the user as , where represents the item that user has interacted with at time step and is the length of interaction data for user . Given an interaction sequence , our aim lies in predicting the top
items that the user would most probably interact with at time step. And the notations utilized in this paper are summarized in Table 2.
|the user set|
|the item set|
|number of users (sequences)|
|number of items|
|the item that user interacted at step|
|item graph adjacent matrix|
|degree matrix of|
the identity matrix of size
|symmetric normalized Laplacian matrix of item graph|
|rescaled symmetric normalized Laplacian matrix of item graph|
|Chebshev order of graph convolution in this paper|
|dimension of embedding|
|Chenshev polynomial with order|
|loss function of CAASR method|
|loss function of pure RNNs based sequential recommendation method|
|loss function of P-Cofactor method|
|loss function of P-GraphAE method|
3.2. Graph Convolutional Networks
Graph Convolutional Networks (GCNs) is one essential ingredient for our CAASR method, thus we give a concise introduction about it. GCNs is an operation that aims to learn superior node representation considering dependence among nodes. It is neither limited to the task nor the model (Defferrard et al., 2016; Kipf and Welling, 2016; Velickovic et al., 2017; Hamilton et al., 2017).
Spectral convolution on graphs is defined as the multiplication of a signal with a parameterized filter in the Fourier domain, i.e.:
where represents the graph convolution operation. and denote the matrix of eigenvectors and eigenvalues of the graph Laplacian , respectively. And
indicates the graph Fourier transform of. And a polynomial filter is taken in (Defferrard et al., 2016) as . While this convolution filter involves the eigen-decomposition of which might be computationally expensive for large graphs. To circumvent this problem, could be well-approximated by a truncated expansion in terms of Chebshev polynomials up to order (Hammond et al., 2011):
where and . denotes the largest eigenvalue of . The Chebshev polynomails are recursively defined as with and .
From above we could clearly see that GCNs learn each node representation from the spectral graph domain. The size of one node’s neighbors is called the receptive field. It is enlarged through increasing Chebshev order just like the -hop in a graph, which could encourage more neighborhood information when learning node representation.
4. The Proposed Method
In this section, we demonstrate: 1) how to sufficiently mine item association relationships merely from user behavior data; 2) how to combine this item association relationships with sequential relationships in different styles. Both cascading and parallel style are explored and shown in Figure 3.
We firstly introduce the general architecture of our proposed cascading CAASR model illustrated in Figure 4. Apart from this, two parallel styles of combining this item association pattern are also explored and analyzed later. Details are illustrated as follows.
4.1. Association Augmented Graph Embedding
As we summarized before, the real-world user consuming sequence is usually a hybrid of association relationships and sequential relationships. And recent RNNs based sequential recommendation methods such as GRU4Rec are capable of capturing the sequential relationships but fail to emphasize the item association characteristic. We expect to model this item association pattern while performing sequential recommendation task.
As for mining the item association relationships, there are several existing non-sequential recommendation methods and the typical one is (Liang et al., 2016). In (Liang et al., 2016), an item co-occurrence matrix is constructed from user behavior data and co-factorization technique is utilized. Whereas we find this technique does not work well in RNNs based sequential user behavior modeling. This happens mainly because it is incompatible to apply regularization between deeply represented representation by RNNs and shallowly learned representation by traditional matrix factorization. Instead, as one kind of expression style of graph data, this co-occurrence matrix intrinsically reflects association relationship among items. In order to extract item association information, we use item co-occurrence data to construct the item graph which could be easily processed by graph embedding methods.
Considering graph embedding methods, they can be categorized into two main groups, the random walk based and GCNs. The random walk based graph embedding methods do not support end-to-end training as well as rely a lot on the quality of random walk sequences. In contrast, GCNs, developed from spectral graph signal processing (Shuman et al., 2013), has robust and superior performance over random walk based methods, which has been proved in (Defferrard et al., 2016; Kipf and Welling, 2016). It is neither limited to the model nor the task, allowing an end-to-end and flexible combination for sequential recommendation. Therefore, we leverage graph convolution on the item co-occurrence graph to learn association augmented item embeddings as prior knowledge for describing items.
The item co-occurrence graph can be obtained from user-item interactions. We count the number of user consuming sequence that two items co-occur. By this way, a matrix containing the frequency that every two items co-occur is formed. We set the frequency number larger than a threshold to 1 and the rest as 0 to form a sparse adjacent matrix of the item graph. This adjacent matrix ( is the number of items) is symmetric here, indicting the symmetric relationship between every two items. We give an example to illustrate this process in Figure 5.
To prepare for the subsequent convolution operation in spectral domain, several operations need to be performed according to graph convolutional theory in (Defferrard et al., 2016). Firstly, we obtain the normalized graph Laplacian through , where is the diagonal degree matrix with and is the identity matrix of size . Then, the re-scaled version of is formulated as and is the maximum eigenvalue of .
Graph Convolution. In order to extract item association information from item graph , this item graph needs to be filtered in spectral domain. Suppose we have the item representation input , is the number of items and is the input channel for each item. Here we only use the one-hot vector representation for all items as in all our model training process, this means and is an identity matrix with size . The convolution on item graph can be generalized as below:
where indicates the graph convolution operation defined in spectral domain. is the number of order of Chebyshev polynomials in Laplacian. is the filter parameter that needs to be learned for order . denotes the spectral graph embeddings of all items, and it contains all the associated structure information of items. Specifically, the Chebyshev polynomial is a sparse matrix, which can be recursively generated by using the definition in Eq. 4. And
can be efficiently implemented as the product of a sparse matrix with a dense matrix by using Tensorflow111https://www.tensorflow.org/
This association augmented graph embedding operation is illustrated is Figure 4 as the first component in shade. The input is filtered by filters and outputs feature matrices each with shape that capture different orders of associated features. Summarization of order features leads to the final convolutional features of all items .
4.2. Embedding Lookup Layer
Through the spectral graph filtering, we get associated embeddings of items. Note that in most neural networks, mini-batch training approach is utilized due to both low memory usage and better model stability. As for graph convolution in (Defferrard et al., 2016; Kipf and Welling, 2016), mini-batch strategy cannot guarantee the original graph structure during training. Therefore we cannot train the spectral filter parameters in mini-batch manner. Instead, we input the whole rescaled normalized graph Laplacian matrix in sparse manner. To seamlessly combine association augmented graph embedding and downstream sequential relationships modeling module with mini-batch training style, we design a graph embedding lookup layer to bridge the gap. In this manner, not only the graph structure is reserved but also the latter RNNs based sequential relationships modeling module can still be trained with mini-batch approach.
This embedding lookup layer is shown in Figure 4 as the second component with shade. In each forward and backward training process of final loss, considering we have a mini-batch item index that is at step , then we lookup the corresponding association graph embeddings that belong to items . And the chosen mini-batch association item embedding at step is denoted as
. This means that in each epoch training process, one item graph embedding may be trained several times, which makes the learning process more stable and fast converged. And this derivable graph embedding lookup operation can be formulated as:
where function denotes the lookup operation on association augmented graph embeddings . denotes the graph embedding of mini-batch with the corresponding item index , and is the mini-batch size.
4.3. Sequential Recommendation
Association augmented item embeddings from the former part is aimless. The latter sequential recommendation part captures the sequential relationships and refines associated embeddings for recommendation. This sequential recommendation part has been studied by many techniques such as attention mechanism (Donkers et al., 2017; Pei et al., 2017), personalization (Donkers et al., 2017) and auxiliary information (e.g. heterogenous attributes, knowledge base) (Liu et al., 2018a; Huang et al., 2018; Chen et al., 2018). While our model concentrates on modeling both item association relationships and sequential relationships integrally in sequential recommendation, thus we adopt one typical RNNs based sequential recommendation model named GRU4Rec (Hidasi et al., 2016a) as the sequential relationships modeling module in our experiments333Note that this sequential recommendation module can be substituted with other RNNs based sequential models.
. GRU4rec is based on GRU which is a variant of RNN unit that aims to deal with the vanishing gradient problem. GRU4Rec also supports user cold-start problem due to the fact that it does not specify a user in training process, which is practical in real scenarios. The superior performance has pushed it as the state-of-art RNNs based sequential recommendation method.
After we obtaining association augmented graph embeddings of items, we could sample the mini-batch spectral embeddings at step . Cascading manner allows the association information flow through GRU to be formulated as follows:
(1) the update gate:
(2) the reset gate:
(3) the candidate activation functionis:
(4) the output state at step is given by:
where and are all parameters in GRU cell we need to learn. denotes the activation function and is element-wise product.
4.4. Network Learning
Since implicit feedback in recommendation appears more common in real scenarios, we apply our model with BPR loss (Rendle et al., 2009) which is a powerful pairwise metric for implicit feedback data. And we adopt the same mini-batch negative sampling approach introduced in GRU4Rec. Suppose is one user, and denote the positive and negative item respectively. Therefore, the preference of user for positive item over negative item at time step is , which is represented as:
where is regarded as user representation at time step , and are respectively the spectral graph embedding of positive sample item and negative sample item at step. The model aims to predict which item the user may interact with at next step, thus the supervised information comes from its next step.
The objective function is parameterized as :
where indicates the probability that user prefers item than item . denotes the network parameters. plays as the regularization term in this objective function.
is the hyperparameter for regularization. A concise description of CAASR is illustrated in Algorithm1
4.5. Connection to GRU4Rec
where is exactly same to the embedding parameter for input in GRU4Rec. That is to say our CAASR can degrade to general RNNs based sequential recommendation when =0.
In addition, when the number of filters , the spectral filter takes zero-hop of item’s neighborhoods, meaning no consideration of the item’s neighborhoods and item association information. Our model with a order Chebyshev polynomial can capture a high order association pattern among items.
4.6. Parallel Style Association Augmented Models
Apart from the proposed CAASR method in cascading style, item association relationships could also be combined with sequential relationships in two kinds of parallel styles. One typical way of incorporating it is co-factorization developed from non-sequential recommendation. And the other parallel style is the Autoencoder with regularization which is widely used in many deep learning models aiming to merge one specific information to the other. We would give a concise introduction about these two parallel styles to better demonstrate the advantages of our CAASR in cascading style.
4.6.1. Co-factorization Regularized
Inspired by the widely used co-factorization technique in non-sequential recommendation, we transfer it to sequential recommendation to introduce the item association information. This combination model named P-Cofactor is depicted in Figure 6.
Following method in (Liang et al., 2016), the point-wise mutual information (PMI) between item and is defined as:
Empirically, it is estimated as:
where denotes the number of times that item and co-occur. and . is the total number of co-occur pairs. Then the (sparse) Shifted Positive Point-wise Mutual Information (SPPMI) is defined as:
where is the hyperparameter that controls the sparsity of SPPMI matrix. If we denote the item SPPMI matrix as , the loss function for P-Cofactor could be formalized as Eq. 16.
where denotes the typical sequential recommendation module loss sharing the same formula in Eq. 11. And is the item embedding shared by both the sequential recommendation module and SPPMI matrix factorization module. is the item context embedding for . From the definition of , we could summarize that this co-factorization technique is actually performing regularization for the learning process of item embedding in sequential recommendation. The concise learning process is illustrated in Algorithm 2.
4.6.2. Graph Autoencoder Regularized
As one kind of unsupervised learning methods, Autoencoder, together with
regularization, has been widely applied in combining two kinds of knowledge, such as Stacked Denoising Autoencoder (SDAE)(Wang et al., 2015) and Variational Autoencoder (VAE) (Li and She, 2017) in recommendation. Inspired by this, we construct item graph Autoencoder here to extract item association pattern in low-dimensional vector representation. Together with regularization technique, the architecture of graph Autoencoder regularized model named P-GraphAE is shown in Figure 7.
In P-GraphAE, Graph Autoencoder (GraphAE) is an essential part which mines the inherent association relationships in item graph. And the graph convolution operation in encoder is exactly the same with description in Section 4.1. To make a clear description, we denote as the latent representation of nodes in item graph for P-GraphAE and it shares the same definition in Eq. 3 as:
As for the decoder, the aim lies in reconstructing the links in item graph. In practice, reconstructing all non-existent links in would easily lead to over-fitting, thus we randomly sample non-existent links to reconstruct. In our experiments, is 5 times larger than the number of existent links for all three datasets. Denote the training links set as , we could formulate the loss of P-GraphAE as:
where denotes the typical sequential recommendation module loss sharing the same formula in Eq. 11. And is the predicted adjacent matrix. and denotes the latent item representation for GraphAE and sequential recommendation module respectively. performs regularization and merges the item association information from item graph to sequential recommendation module. A concise illustration of this P-GraphAE is shown in Algorithm 3.
4.7. Comparison Between Cascading Style and Parallel Style
The way of combining this item association relationships and sequential relationships matters a lot both on model performance and complexity. Recent typical way of introducing associated information to existed recommendation involves parallel regularization of the corresponding representation (Liang et al., 2016; Cao et al., 2017). This parallel regularization style puts a distance restriction on two different representations from distinctive domains. It is impractical to choose one specific distance manner and tune the regularization hyperparameter. Instead, the item association relationships describe the relational features for each item, which indicates the inherent knowledge of items. Hence, a cascading way to combine this item association relationships with sequential modeling is expected to perform better. In addition, parallel regularization involves additional reconstruction and regularization operation of the item graph, which makes it gain more complexity than cascading style.
5. Experiments and Results
5.1. Datasets and Experimental Settings
To demonstrate the effectiveness of our model, we use three real-world datasets: Amazon Instant Video (AIV)444http://jmcauley.ucsd.edu/data/amazon/links.html, Amazon Cell Phones and Accessories (ACPA)4 and TaoBao555https://tianchi.aliyun.com/datalab/dataSet.html?spm=5176.100073.0.0.61c835eeI6T5UL&dataId=52.
AIV and ACPA are both collected by MCAuley et.al (He and McAuley, 2016) and these datasets contain explicit feedbacks with ratings range from 1 to 5 from Amazon666https://www.amazon.com/ during 1996.05 and 2014.07. In our experimental settings, we binary all the explicit feedbacks.
Taobao is a dataset from clothing matching competition on TianChi777https://tianchi.aliyun.com/ platform. The user purchase history, from 2014.6.14 to 2015.6.15, is considered as our sequence data here. Both these three datasets contain URLs to images of products. Note that in our experiments, we utilize these images for analysis rather than for training. We only use user-item interactions with timestamps to train our model.
All these three datasets need to be preprocessed for better model learning. Following (Hidasi et al., 2016a), we filtered users less than feedbacks and items less than interactions for all datasets. is 5,15,20 for AIV, ACPA and Taobao and is 5,10,10 for AIV, ACPA and Taobao respectively. Especially for TaoBao dataset, we find a serious click farming problem on it. Thus we also filter users that . Users whose number of unique items in history less than 10 are also filtered.
After preprocessing, the statistics of these three datasets are summarized in Table 3. Following (Hidasi et al., 2016a), for each dataset, we randomly sample 80% sequences as train data, and the rest 20% as test data. Note that one sequence is either in the train set or the test set. As it is a common operation in recommendation research, items not seen during training stage are filtered out for the test set.
5.1.2. Evaluation Metrics
Sequential Recommendation is usually with implicit feedback, in which we are expected to correctly predict the next item that the user will probably interact with. A good recommendation means that the target item should be among the first few recommended items. Thus in accordance with recent recommendation evaluation metrics, we adopt Recall@k and MRR@k as our evaluation metrics.
We compare our CAASR with the following baselines:
BPR. Bayesian Personalized Ranking (BPR) (Rendle et al., 2009) is one matrix factorization method for implicit feedback. BPR cannot be directly applied to sequential recommendation because new user does not have his or her representation vector. Thus following by GRU4Rec, we regard the average item feature vectors of items in user’s history as the user representation.
K-Nearest Neighborhoods (KNN) is one common method in many practical Recommendation Systems. By learning item vectors in BPR model, we can utilize KNN at each time step to predict what the next item that the user will interact with. The similarity among items is defined as cosine similarity. This provides a baseline that concentrates on user’s short-term interest compared to BPR.
GRU4Rec. GRU4Rec (Hidasi et al., 2016a) is one typical RNNs based sequential recommendation method which captures the sequential relationships in user’s sequential data. It also supports also supports user cold-start problem due to the fact that it does not specify a user in training process, which is practical in real scenarios. Outperformed performance makes it become one state-of-the-art RNNs based method for sequential recommendation.
P-Cofator. Following item association relationships mining approach in non-sequential recommendation (Liang et al., 2016), P-Cofactor introduces co-factorization technique to incorporate item association relationships and sequential relationships in sequential recommendation. This co-factorization technique actually performs one kind of regularization for RNNs based sequential recommendation.
P-GraphAE. P-GraphAE is the parallel extension of our cascading CAASR model, which applies popular Autoencoder and regularization approach to combine item association relationships and sequential relationships in sequential recommendation. Apart from P-Cofactor, it serves as the other parallel style to make comparison with CAASR.
Remind that our CAASR concentrates on modeling both item association relationships and sequential relationships in sequential user behavior, and we adopt GRU4Rec as the sequential relationships modeling method. This makes GRU4Rec become a perfect contrast baseline for CAASR. P-Cofactor and P-GraphAE are methods aiming to incorporate item association relationships with sequential relationships parallelly. Thus they form as the parallel baseline methods for our cascading method.
5.1.4. Experimental Settings
For all models, we set the maximum iteration up to 30. Batch training size is 50. Latent dimension size ranges in [50,100,150,200, 250,300], and the best is chosen for each model according to the performance on test set.
For BPR and BPR+KNN, we use a regularization term and the hyper-parameter is 0.01 to avoid over-fitting. Only one GRU layer is utilized for GRU4Rec and models based on GRU4Rec. For simplicity, we set and apply Dropout (Srivastava et al., 2014)
technique for regularization. Dropout rate is 0.2 for all deep learning based methods. Parameters are initialized with uniform distribution ranges from -0.1 to 0.1. RMSprop is adopted for optimization. Table4 shows the hyper-parameter setting for all models on three datasets.
5.2. Quantitative Results
Overall performance of all models on three datasets. The relative improvement of our CAASR over GRU4Rec is given in the table. Model names with * refer to our proposed methods. Compared to CAASR, the t-test results of other baselines are as well shown in the table. Andmeans p-value¡0.01, indicates p-value¡0.05 and means p-value¿0.05.
5.2.1. Overall Performance
(1) From this table, we can observe that deep learning based methods (GRU4Rec, P-GraphAE and CAASR) yield superior performance compared to BPR and BPR+KNN. It is reasonable due to the superior representation learning ability of deep learning.
(2) Our proposed model achieves the best performance on all three datasets. Compared to the state-of-the-art GRU4Rec, we even achieve a 23.16% relative improvement at Recall@20 on ACPA and 16.23% relative improvement at MRR@20 on AIV. Remind that the only difference between our CAASR and GRU4Rec is the additional item association embedding part. And this demonstrates the significance of modeling item association relationships for sequential user behavior.
(3) From the table, it is clear that P-Cofactor developed from GRU4Rec performs worse than GRU4Rec almost on each dataset. Especially on ACPA, it is even 1.43% and 1.15% lower than GRU4Rec on Recall@10 and Recall@20 respectively. Remind that P-Cofactor directly introduces co-factorization technique to incorporate the item association relationships into RNNs based sequential recommendation method. The result indicates that co-factorization technique from non-sequential recommendation actually harms the performance of GRU4Rec. This happens mainly because it is incompatible to apply regularization between deeply learned representation by RNNs and shallowly represented representation through traditional matrix factorization.
(4) Compared to P-Cofactor, P-GraphAE and CAASR both adopt graph convolution operation. And the parallel style P-GraphAE generally performs slightly better than GRU4Rec on AIV and TaoBao, while it fails on ACPA. In contrast, our CAASR in cascading style always performs better than GRU4Rec and P-GraphAE on all datasets. Regarding Recall@20 on ACPA, it achieves a 2.31% and 3.46% improvement compared to GRU4Rec and P-GraphAE respectively. These results illustrate the correctness of cascading style rather than parallel style. In addition, compared to P-GraphAE, the cascading CAASR has lower complexity and generalizes the input of RNNs based sequential recommendation. This allows it to degrade to general RNNs based sequential recommendation model under specific condition while P-GraphAE fails.
(5) To verify whether the improvement of CAASR is statistical significant, we conduct t-test here and the result of p-value in shown in Table. 5 with different markers , and . In this table, p-value refers to the comparison between CASSR and other baselines. The p-values in Table. 5 are all less than 0.01 or 0.05, which validates the improvements of our CAASR are statistical significant.
5.2.2. Performance on Different Latent Dimensions
In Figure 8, for all models, we show their performance on Recall@20 and MRR@20 with different latent dimensions. In Figure 8 (a)(c)(e), we can see that our model reaches a better performance of Recall@20 on three datasets almost at every latent dimension. As for MRR@20 in Figure 8 (b)(d)(f), our model can still perform better when the latent dimension is high. Note that Recall@20 and MRR@20 are different kinds of evaluation metrics. Better Recall@20 does not mean an exactly better MRR@20. We compare our model both in recall ability and ranking ability.
Moreover, Figure 8 indicates that BRR, BPR+KNN and GRU4Rec model tend to have their best performance at low dimension. While our CAASR favors reaching its best performance at high dimension. This is probably because both BPR and GRU4Rec take deficient utilization of inputs. When the dimension is high, the information in learned vectors tends to be redundant and yields to be not robust on test set. While CAASR learns item association relationships in spectral graph domain with more information flow. This needs to match a higher dimension to adequately encode the data.
From Figure 8, it is also obvious that P-Cofactor generally does worse performance than GRU4Rec at every dimension, enhancing the claim that co-factorization developed from non-sequential recommendation could not bring reliable gain for recommendation in RNNs based sequential recommendation. As for Recall@20 and MRR@20 of P-GraphAE, it shares a similar trend with CAASR which tends to achieve a better performance along with the increase of latent dimension. Whereas P-GraphAE with higher complexity still cannot perform better than CAASR, which emphasizes the advantage of the better cascading style.
5.2.3. Training Loss and Performance.
In order to explore the influence of the association relationships and the way of combining it with sequential relationships captured by RNNs based sequential recommendation method, we show the train BPR loss, test Recall@20 and test MRR@20 along with training epochs on all three datasets in Figure 9.
It is clear from Figure 9 (a)(d)(g) that CAASR converges faster and achieves a lower BPR loss on all three datasets than the other three methods. As for Recall@20 and MRR@20 on AIV shown in Figure 9 (b)(c), CAASR reaches its best performance at epoch less than 5 and later training causes over-fitting, while the other three models need more training epochs. As for performance on ACPA and TaoBao in Figure 9 (e)(f)(h)(i), CAASR outperforms the others along the training process and the gap is more obvious in less training epochs, which reflects the fast convergence rate of CAASR. In summary, combining item association relationships and sequential pattern in cascading style fastens the convergence rate of BPR loss, as well as helps the RNNs based sequential recommendation reach a better learning point for final recommendation target.
5.2.4. Parameter Analysis on Latent Dimension and Chebyshev Order .
As the Chebyshev polynomial order is involved in our method, it is curious to see whether a higher Chebyshev polynomial order is beneficial to the recommendation accuracy. Towards this target, we further investigate our CAASR with different on two datasets: AIV and ACPA (TaoBao dataset is not implemented here due to our limited computational resource). The results of both Recall@20 and MRR@20 on two datasets are summarized in Table 6 and 7.
As we can see when at the same model capability (with the same latent factors), with the increase of Chebyshev polynomial order ranges in [3,4,5], CAASR performs best at on AIV. While for ACPA, CAASR tends to reach its best performance at . Remind that denotes the number of hops from the central node. Too large may lead to consider uninformative neighborhoods. Therefor the proper tends to be related to the specific dataset.
5.2.5. Sparse Train Data.
Different sparse train data may have different influence on different methods. The sparse train data directly leads to sparse item graph, which makes it meaningful to explore whether our graph embedding based methods could still outperform general GRU4Rec. Therefore, we conduct an experiment about the model performance with different ratio of train data while keeping the test data fixed. Figure 10 shows the corresponding result on AIV, ACPA and TaoBao.
From Figure 10, we can see that, in general, all model performance rises with the increase of train data. Our cascading CAASR is capable of outperforming other methods under different train data conditions in terms of both general GRU4Rec method and the parallel methods (P-Cofactor and P-GraphAE). In Figure 10 (c)(f) on TaoBao, P-GraphAE poses slightly better than CAASR under some train data ratio conditions while it certainly fails on AIV and ACPA in Figure 10(a)(b)(d)(e). Instead, the casacading shares a lower computational complexity and more robust performance than P-GraphAE, which verifies the meaning that we combine item association relationships with sequential relationships in cascading style for sequential user behavior.
5.2.6. Dropout Regularization.
In our method, we apply Dropout to play as the regularization technique for RNNs based methods. In order to study the influence of Dropout, we explore the model performance with different Dropout ratios along the training process on three datasets and the results are shown in Figure 11.
From the figure, we can see that different Dropout may have different influence on different datasets. In Figure 11 (a)(d)(g) for three datasets, it is obvious that no dropout (Dropout=0.0) will encourage the model to obtain lower loss, yet fit the train data better. While the model performance varies a lot with Dropout=0.0 for three datasets. For instance, in Figure 11 (e)(f) for ACPA and (h)(i) for TaoBao, CAASR can still reach a not bad performance when Dropout rate equals 0.0, which means no Dropout technique is utilized. This mainly due to that the data distribution is complex and no Dropout may encourage the model to fit data better. In contrary, no Dropout technique is not suitable for AIV dataset in Figure 11 (b)(c), because it is clearly that CAASR with no Dropout over-fits the data a lot. And Dropout rate equals 0.6 for AIV presents better performance, which indicates CAASR needs some regularization to avoid over-fitting. In summary, the Dropout parameter is better chosen according to specific dataset in order to robustly match the data distribution.
5.3. Qualitative Results
5.3.1. Sequential Recommendation.
Remind that our CAASR method integrally considers the item association relationships and sequential relationships for recommendation in sequential user behavior. While the typical RNNs based sequential recommendation method GRU4Rec mainly focus on modeling the sequential relationships. Therefore, it is curious to see whether CAASR benefits from item association in real recommendation sequence and we conduct an experiment to verify about this. In particular, Figure 12 (a)(c) show two examples in test set on ACPA and TaoBao datasets respectively. The corresponding predictions of MRR@20 at each step are stated in Figure 12 (b) and Figure 12 (d) respectively.
ACPA is a dataset of user consuming history on cell phones and accessories. Sequence items in user1’s history are shown in Figure 12 (a). Compared to cell phone in step 1 and its accessories at the following stpng, phone screen savers at step 2, vehicle charger at step 3, phone cases at step 4,5,7 and mobile source at step 6, these items do not have obvious sequential relationships, they present more kind of association relationships. We notice that in Figure 12 (b), our model CAASR exactly predicts the phone sticker at step 2 after the cell phone consumption at step 1, while GRU4Rec fails. This means our model is capable of capturing item association relationships that GRU4Rec misses. At step 4 and step 5, items both are phone shells. Regarding item association relationships among items, our model exactly predicts the phone shell at step 5 and puts it at the first place. This is very practical in real scenarios.
Items in TaoBao dataset take the records of user purchase history on fashion and clothing. From the consuming sequence of user2 in Figure 12 (c), we notice that the user was inclined to buy some warm clothes like at step 1, step 3 and step 4 mainly due to the whether. From step 4 to step 5, the user began to buy some cool clothing. In fact, change from warm clothes to cool ones indeed exists sequential relationships because of the change of seasons. While in one particular season or whether, like items at stpng 1,2,3,4 and items at stpng 5,6,7,8, there is no particular sequential relationships. We note that the pale brown trousers at step 6 and black trousers at step 8 are more like association relationships rather than sequential relationships. Therefore in Figure 12 (d), we can see that our CAASR, involved with this kind of association relationships, achieves a better performance than GRU4Rec at these stpng.
5.3.2. Item Embedding Analysis.
To verify the above hypothesis, we calculate the pair-wise item cosine similarity for all item pairs based on learned item embeddings. Let and denote the cosine similarity of items and based on item embeddings for GRU4Rec and CAASR respectively. We take 8 item pairs with the largest difference . The corresponding item images are shown in Figure 13 (a-h). We can see that the item pairs are mainly complementary products like skirt with sandals in Figure 13 (b)(d). This suggests that compared to GRU4Rec embeddings, our CAASR model can better capture the association relationships among complementary products.
We also take 8 item pairs with the largest difference . The corresponding item images are shown in Figure 13 (i-p). In this figure, we see that item pairs mainly focus on items in same category like shoes in Figure 13 (i)(j) or irrelevant item pairs like shoes with washing liquid in Figure 13 (l) and sandals with snow boots in Figure 13 (k). This suggests that compared to CAASR embedding, GRU4Rec captures relatively pointless association relationships among products.
This matches well with our expectation as we introduce the item association graph to model the association relationships in sequential user behavior. Our model is able to recommend the associated and complementary items. This characteristic benefits the recommendation accuracy in real scenarios because a user tends to buy an associated or complementary item at next step rather than some irrelevant products.
6. Conclusion and Future Work
In this paper, we point that both the association relationships and sequential relationships are existed in sequential user behavior data. These two inherent characteristics make it essential to capture both of them for sequential recommendation. Although RNNs based sequential recommendation methods are capable of model sequential relationships, they fail to put emphasize on association relationships. In this case, we propose a cascading CAASR model that unifiedly incorporates item association relationships and sequential relationships for sequential recommendation in cascading style. In addition, two parallel styles of combining this item association relationships and sequential relationships are explored and analyzed in this paper to demonstrate the advantages of cascading style. To the best of our knowledge, CAASR is the first model that concentrates on item association relationships in RNNs based sequential recommendation. We conduct extensive experiments on three real-world datasets: AIV, ACPA and TaoBao. Results demonstrate the effectiveness of our model both quantitatively and qualitatively.
In future, we will explore one more general and efficient framework to combine the association relationships and sequential relationships for modeling sequential user behavior and make better recommendation performance in this scenario.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.1.
- Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §2.1.
- Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.3.
- Embedding factorization models for jointly recommending items and user generated lists. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 585–594. Cited by: Table 1, §2.2, §4.7.
Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 6298–6306. Cited by: §2.1.
- Sequential recommendation with user memory networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 108–116. Cited by: §1, §2.1, §4.3.
- On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Cited by: §2.1.
- Spectral graph theory. American Mathematical Soc.. Cited by: §2.3.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852. Cited by: §2.3, §3.2, §3.2, §4.1, §4.1, §4.2.
- Sequential user-based recurrent neural network recommendations. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 152–160. Cited by: §1, §2.1, §2.1, §4.3.
- Exploring temporal effects for location recommendation on location-based social networks. In Proceedings of the 7th ACM conference on Recommender systems, pp. 93–100. Cited by: §1, §2.1.
- Long short-term memory. Springer Berlin Heidelberg. Cited by: §2.1.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.3.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §2.3, §3.2.
- Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §3.2.
- Translation-based recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 161–169. Cited by: §2.1.
- Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: §5.1.1.
- An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Information retrieval 5 (4), pp. 287–310. Cited by: §2.2.
- Session-based recommendations with recurrent neural networks. Cited by: §1, §1, §2.1, §4.3, 3rd item, §5.1.1, §5.1.1.
- Parallel recurrent neural network architectures for feature-rich session-based recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 241–248. Cited by: §2.1.
- Improving sequential recommendation with knowledge-enhanced memory networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 505–514. Cited by: §1, §2.1, §4.3.
- Neural survival recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 515–524. Cited by: §1, §2.1.
- Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems, pp. 79–86. Cited by: §2.1.
- Convolutional neural networks. In Deep Learning with Python, pp. 63–78. Cited by: §2.1.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.3, §3.2, §4.1, §4.2.
- Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy. In Proceedings of the fifth ACM conference on Recommender systems, pp. 165–172. Cited by: §1.
- Matrix factorization techniques for recommender systems. Computer (8), pp. 30–37. Cited by: §2.2.
- Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 447–456. Cited by: §2.1.
- Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185. Cited by: §2.3.
- Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1419–1428. Cited by: §2.1.
- Collaborative variational autoencoder for recommender systems. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 305–314. Cited by: §4.6.2.
- A time-aware personalized point-of-interest recommendation via high-order tensor factorization. ACM Transactions on Information Systems. 35 (4), pp. 31:1–31:23. External Links: Cited by: §2.1.
- Factorization meets the item embedding: regularizing matrix factorization with item co-occurrence. In Proceedings of the 10th ACM conference on recommender systems, pp. 59–66. Cited by: Table 1, §2.2, §4.1, §4.6.1, §4.7, 4th item.
- A sequential embedding approach for item recommendation with heterogeneous attributes. arXiv preprint arXiv:1805.11008. Cited by: §1, §2.1, §4.3.
- STAMP: short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1831–1839. Cited by: §2.1.
- A deep bayesian tensor-based system for video recommendation. ACM Transactions on Information Systems. 37 (1), pp. 7:1–7:22. External Links: Cited by: §1.
- A wavelet tour of signal processing. Academic press. Cited by: §2.3.
- Contextual video recommendation by multimodal relevance and user feedback. ACM Transactions on Information Systems. 29 (2), pp. 10:1–10:24. External Links: Cited by: §2.2.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.3.
- Interacting attention-gated recurrent networks for recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1459–1468. Cited by: §1, §2.1, §4.3.
- Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.3.
- Product-based neural networks for user response prediction over multi-field categorical data. ACM Transactions on Information Systems. 37 (1), pp. 5:1–5:35. External Links: Cited by: §1.
BPR: bayesian personalized ranking from implicit feedback.
Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: §4.4, 1st item.
- Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th international conference on World wide web, pp. 811–820. Cited by: §2.1.
- Learning representations by back-propagating errors. nature 323 (6088), pp. 533. Cited by: §2.1.
The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30 (3), pp. 83–98. Cited by: §4.1.
- Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 650–658. Cited by: §2.2.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: §5.1.4.
- Line: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: §2.3.
- Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2). Cited by: §2.3, §3.2.
- Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235–1244. Cited by: §4.6.2.
- Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining, pp. 495–503. Cited by: §1, §2.1.
- Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.1.
- Version-aware rating prediction for mobile app recommendation. ACM Transactions on Information Systems. 35 (4), pp. 38:1–38:33. External Links: Cited by: §1.
- A dynamic recurrent model for next basket recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 729–732. Cited by: §1, §2.1.
- Time-aware point-of-interest recommendation. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 363–372. Cited by: §1.