I Introduction
Recommender systems play a critical role in the fastdeveloping Internet age, aiming to predict the most likely items which users may be interested in. Collaborative filtering is an efficient and widely used approach in recommendation, which commits to capturing latent user and item features from historical interactions. Early works like Matrix Factorization (MF) [17]
decompose a rating matrix into user and item embeddings to capture implicit semantics. As the scale of users and items increases rapidly in recent years, more deep learning models are proposed based on collaborative filtering to characterize plentiful user tastes over a large amount of items. For example,
[1] and [37] build useritem graphs to integrate the multihop relationship of interactions. [20] and [19]introduce a variational autoencoder framework into the model and infer the representation as a Gaussian distribution.
Sequential recommendation is an important part among recommender systems. It models user behaviors as a sequence of items instead of a set of items. Markov Chain (MC)
[6] is a classic method, which models shortterm item transitions and predicts the next item a user may like. Factorized personalized Markov chain (FPMC) [29]combines the markov chain and the traditional matrix factorization together to model user preferences. With the development of deep learning networks, Recurrent Neural Networks (RNNs) have achieved successes in sequential recommendation. For example, Long ShortTerm Memory (LSTM)
[41] is a common variation of RNN to enhance model’s ability of maintaining sequential information by memory cells. GRU4Rec [5]applies Gated Recurrent Units (GRU) to sessionbased recommendation by introducing sessionparallel minibatches. RNNbased methods face the challenge of maintaining longrange information. Then selfattention network is applied to sequential recommendations recently to capture both longterm and shortterm dependencies. SASRec and BERT4Rec both get good prediction results with attention mechanisms. SASRec
[13] is able to capture longterm dependencies because it takes into account the influential weights of a whole historical sequence. BERT4Rec [31] employs a deep bidirectional selfattention network with Cloze tasks to increase the efficiency of a transformer model.These previous works model user intentions with historical sequential interactions, ignoring dynamic underlying relationships behind items. The edges that link pairwise items contain abundant semantic information of factors why and how users choose one item after another. These underlying factors are related to realworld concepts, and one certain factor often plays a leading role in a single situation. For example, suppose there are two users interacting with six items, as shown in Figure 1
. The link graph shows that item 2 is adjacent to all the other five items. But these edges are intuitively motivated by different factors. Item 2 is linked to items 1 and 4 because they are in the same color, while it is linked to 5 and 6 because they have short sleeves. Item 3 is connected to item 2 because it can be used as a Tshirt jacket. These different factors show the intention transformation of user behaviors, and also reveal the shared features of pairwise items. Therefore, recognizing and distinguishing the underlying itemlink factors is able to enhance the expression ability of models, and disentangled representation learning
[22] is a common method proposed to achieve this goal.Disentangled representation learning has been of great popularity in many fields such as Computer Vision
[12, 26], and it has been applied to recommender systems recently. The general purpose of disentangled representation learning is to separate the distinct and informative factors from the variations of data, where each unit is related to a single concept in the real world. A single change of one factor will lead to a change of the relevant unit. Many models have been proved to have the ability to learn disentangled representation and have been applied to fulfill realistic tasks. For example, by learning disentangled representation of a face image, we can obtain independent representations of different features of the face. Then we are able to identify whether a person in a picture has bangs, is wearing glasses, is smiling, and so on. Further, we can change these features directionally by modifying the values of the corresponding dimensions of these features. Therefore, learning disentangled representation can enhance the interpretability and controllability of model.The most prominent networks to learn disentangled representation are VAE and InfoGAN. VAE [11]
adds a coefficient hyperparameter
to the KLdivergence term in the objective of variational autoencoder to encourage more factorized latent representations. This extra hyperparameter puts a heavy pressure on the posterior distribution to match the factorized prior distribution [2]. InfoGAN [4] maximizes the mutual information between a fixed small subset of the GAN’s noise variables and observations. With regard to recommender systems, MacridVAE [24] infers highlevel concepts of user intentions at the macro level and applies VAE to enhance disentanglement at the micro level. The authors also propose a selfsupervised seq2seq training strategy for sequential recommendation [25], which compares user intentions between subsequences generated by an intention disentangled encoder. DGCF [38] devises intentaware interaction graphs to distinguish user intentions over different items, focusing on useritem relationships. However, these studies do not consider itemlink relationship patterns, and fail to distinguish different user intentions behind sequences. As a result, the sequential model will be sensitive to noisy data and hardly interpretable.In this paper, we propose an EdgeEnhanced Global Disentangled Graph Neural Network (EGDGNN) model to capture the itemlink information. We model item representations and user intentions from both the global and the local levels. At the global level, we build a global itemlink graph over all sequences, and each itempair in the sequences is denoted as an edge in the graph. Figure 1 shows an example of the construction of a global graph over two sequences. We apply the channelaware mechanism to decompose edges into several channels, where each channel is correspond to an influential factor. The channels extract features specifically to one of the disentangled factors from the neighbors and aggregate different factors jointly to the target item. At the local level, we model disentangled userintention representation over the current sequence. We first infer the latent variation as a Gaussian distribution in order to enforce disentanglement from the statistical perspective of variational autoencoder. Then we use the channelaware mechanism and aggregate item information through the edge channels from former items in the sequence. The aggregated item representation is used to express current user intention. We conduct experiments based on our proposed method on three realworld datasets and compare the predicting results with the stateofart baselines. Results show that the EGDGNN model not only outperforms the previous works in prediction tasks, but also forces the system to find good disentangled representations.
The major contributions of this paper are summarized as follows:

To the best of our knowledge, it is the first work to explore disentangled representation over the global and the local levels to learn factorized underlying factors of item relationships.

We propose a disentangled graph neural network to infer the factors behind pairwise items which motivate user intentions. We apply the GNN model to the global graph and the local sequences to learn item link patterns, and employ variational autoencoder taking advantage of its statistical property.

We evaluate our model on three realworld datasets, and experimental results show our proposed model is able to achieve a good disentangled representation over sequences to indicate user intentions.
The rest of this paper is organized as follows. Firstly, we review related works corresponding to our work in Section II. Then we propose the problem formulation and definitions of our model and introduce preliminaries in Section III. Next, we present the details of our proposed model in Section IV. Section V records and visualizes experimental results, proving the effectiveness of our model. Finally, we conclude the paper and put forward a vision for future work in Section VI.
Ii Related Work
In this section, we present the recent works related to our model, including Sequential Recommendation, Disentangled Representation Learning, and Graph Neural Network.
Iia Sequential Recommendation
Recommendation systems have been extensively popular over the last two decades. They aim to predict users’ preferences over historical behaviors. Matrix Factorization (MF) [17] is the most common framework for prediction, which learns user/item embeddings respectively to model latent relationships between users and items. Further study like SVD++ [18] combines domain model and hidden factor model, and proposes a new globally optimized neighborhood model. Sequential recommendation is an important branch of recommendation systems. Given a chronological item sequence of user’s historical behaviors, sequential recommendation can predict the next item with which a user is likely to interact.
Markov Chain (MC) [6] is a classical model to capture shortterm item transitions. FPMC [29] further combines Matrix Factorization and Markov Chain together to model both longtern preferences and shorttern transitions. Fossil [9] combines similaritybased models with highorder Markov Chains. TransRec [8]
turns a user embedding into a translation vector and considers the threeorder relationships between users, candidate items, and previous behaviors. With the proposal of Recurrent Neural Network (RNN), researchers have proposed redundant works based on this sequential framework and its variants. For example, TimeLSTM
[41] uses Long ShortTerm Memory (LSTM) to model time intervals with time gates, and GRU4Rec [5] uses Gated Recurrent Units (GRU) to model click sequences for sessionbased recommendation. Recently, more deep neural networks have been applied to model sequence patterns. Caser [32]employs a Convolutional Neural Network (CNN) to capture sequential patterns as local features of images by embedding recent sequential items into the images. SASRec
[13] finds the relevance between items adaptively using the selfattention mechanism. BERT4Rec [31] employs a deep bidirectional selfattention network with Cloze task to increase the efficiency of transformer model. SVAE [30] leverages Variational Autoencoder (VAE) to handle temporal information of sequences. However, these previous works do not distinguish the various contributions of neighbors over different aspects.IiB Disentangled Representation Learning
The purpose of learning disentangled representation is to find independent factors in the latent space. Each dimension of the representation has a specific and irrelevant meaning and is humanunderstandable. For example, learning disentangled representation over face pictures can get representations regarding eyes, hair, smiles, etc., while learning disentangled representation over landscape pictures can get representations regarding trees, sky, buildings, etc. Distinguishing such features from the representations can bring enhanced robustness, interpretability, and controllability. Therefore, it has been a popular task in many fields such as computer vision [12] and topic modeling [21]. During recent years, many methods have been proposed to improve the disentanglement learning ability. InfoGAN [4]
realizes unsupervised learning of disentangled representation by introducing mutual information to constrain the latent variables.
VAE [11] turns the perspective to information bottleneck and focuses on the KLdivergence in the VAE objective. Further studies like TCVAE [3] and FactorVAE [14] decompose the KLdivergence and directly encourage factorized distribution by putting penalty on the total correlation.Recently, some studies turn attention to disentangled learning in recommendation. For instance, MacridVAE [24] is the first work to learn disentangled representation from useritem interactions in recommender systems. At the macro level, it divides user intentions into several highlevel concepts and categorizes each item into a concept. At the microlevel, it applies VAE framework to the encoder layer to encourage dimension independence. DGCF [38] devises a disentangled graph model to learn user intents based on neural graph collaborative filtering. DICE [40] disentangles the interest and conformity representation with causal embedding. Ma et.al. [25]
performs selfsupervision in the latent space to classify user intentions. It reconstructs future sequences as a whole using the sequencetosequence training strategy, instead of individual items in the future sequences. However, these works fail to maintain the itemlink relationships and disentangle their influential factors. In this paper, we will solve this problem by introducing graph neural network into sequential recommendation.
IiC Graph Neural Network
Graph Neural Network (GNN) is a classical learning network used to capture information of graph structure data. It has achieved great success in various tasks, such as node classification [16] and link prediction [33]. Early work like ChebNet [7] realizes fast localized convolutional filters on graphs by CNN and avoids the Fourier basis. Graph Attention Network (GAT) [36] aggregates neighbor nodes through multihead selfattention mechanism, realizing the adaptive matching of the weights of different neighbors and enhancing the ability of Graph Convolutional Network (GCN). DisenGCN [23] proposes a disentangled graph convolutional network with neighborhood routing mechanism to learn disentangled node representations from its neighbors. CGAT [21] enhances the GAT framework by introducing a channelaware attention mechanism. It disentangles topic representations structurally and semantically over useruser interaction graphs.
Graph neural network is also widely used in recommender systems. LightGCN [10] learns the user and item embeddings by linearly propagating them on the useritem interaction graph. FGNN [27] investigates the inherent order of item transition patterns in session recommendation using a modified weighted GAT model. These existing studies have proved the effectiveness of graph neural network in obtaining itemlink transition patterns, so we apply it to our work to model the user intention transition through sequences.
Iii Preliminary
We will present the preliminary statements of this paper before the details of our model. We first describe the notations and the sequential recommendation problem of our paper. Then we put forward the channelaware mechanism used in the representation space. Finally, we introduce the variational autoencoder and its contribution to learning disentanglement.
Iiia Problem Formulation
Given users and items, we denote a user set as and an item set as . For each user, represents the sequential behaviors interacted by user . Given a historical sequence at time , a sequential recommender model aims to predict the next item at time .
In this paper, we propose a globallevel graph to capture itemlink transition information. We define the global graph as , where is a set of all items in the training data and is a set of edges. Each edge means a user interacts with after in a sequence. denotes the neighborhood of item , i.e., the items adjacent to in the sequences. We use an undirected graph in this paper, because for the itemlink pairs, the similarities and influencing factors between them are orderindependent. The temporal order of sequences will be considered at the local level. Table I lists detailed explanations of the notations used in this paper.
Notations  Descriptions  

set of users and items  
number of users and items  
historical behaviors of user  
number of decomposed channels  
length of sequences and sliding windows  
global itemlink graph  
set of item ’s neighbors  
embedding dimensions of inputs and channels  
initial embedding of items and positions  


representation of item regarding channel  
representation of global and local level layers  
output of selfattention and VAE layer  

IiiB Channelaware Mechanism
Given a historical sequence , let be the latent intention of user while interacting with item . Assuming that there are factors related to user intentions, we divide the latent representation into channels, i.e., . The channel corresponds to the factor independently. For each pair of adjacent items, the correlation between and indicates the similarity between item and item regarding factor , and also reveals why the two items are connected and how they influence each other.
IiiC VAE for Disentangled Learning
The Variational Autoencoder (VAE) is a generative model which models variables as random distributions based on the Bayesian Theorem. Assume a dimensional variable being the sampled latent representation from sequence , we aim to maximize the probability of the next item, that is, to maximize the probability of the whole sequence :
(1) 
Since the probability is not iterable, the variational inference method takes advantages of Bayesian Theorem and proposes a posterior distribution to approximate the true distribution . Migrating to sequential recommendation, the log likelihood of can be derived as follows:
(2)  
Algorithm 2 is the training objective of variational autoencoder, it is called Evidence Lower BOund (ELBO). By maximizing ELBO, the model can get an approximate posterior distribution for the encoder to generate the latent representation .
In practice, the generative model suggests that the variables follow Gaussian distribution and applies a ’Reparameterization Trick’ to calculate the gradient. Then the variables can be written as a polynomial generated from the mean and the variance of Gaussian distribution:
(3) 
VAE is a common modification of VAE. It introduces an adjustable hyperparameter to the original objective of VAE:
(4) 
Burgess et.al. [2] discussed why VAE is able to learn an axisaligned disentangled representation from the perspective of information bottleneck. acts as a constriction limiting the capacity of the bottleneck, and encourages VAE to improve data loglikelihood.
Furthermore, the KLdivergence term can be composed to three parts, following the contribution of TCVAE [3]:
(5)  
The three terms above are referred to as the Mutual Information (MI), the Total Correlation (TC), and the dimensionwise KL respectively. A heavier penalty on the TC term forces the model to learn a factorized representation, each dimension of which is independent. Therefore, if we put a strong penalty on the KLdivergence by adjusting , VAE can find statistically independent factors from the observed data.
Iv Methodology
In this section, we will present our proposed model. Figure 3 illustrates its overall architecture, which consists of three parts: Globallevel Disentanglement Layer, Locallevel Disentanglement Layer, and Prediction Layer. We claim that sequential disentangled representation learning model should have three main characteristics: 1) items that have similar features should be close in the corresponding embedding space; 2) the changing of factors between linked items should reveal the intention transition of user behaviors; 3) separated representations should be independent of each other. We will discuss how the model realizes these purposes in detail in the following sections.
Iva Globallevel Disentanglement Layer
We will first introduce the globallevel disentangled representation learning layer based on the channelaware mechanism, which is the key framework of our work. We build a global itemlink graph based on training sequences, where all the item pairs appear adjacently in the sequences are connected with undirected edges. We aim to extract the independent factors motivating user intentions and find out the degree of mutual influence between the two items on these factors. Before introducing the mechanism, we first propose two hypotheses.
Hypothesis 1. There are highlevel concepts associated with user intentions, which means there are latent factors to be disentangled.
Based on this hypothesis, given a global graph , we divide the nodes (i.e., the items) into components in the latent space, and the edges are divided into channels correspondingly. The component is related to the factor of the user intention, and the channel indicates how factor attributes to the linkage of pairwise items.
Hypothesis 2. Factor indicates the degree of similarity between item and in terms of factor , that is, the representations of item and should be close in the latent subspace if and have similar characteristics regarding the factor.
Intuitively, for a pair of linked items, their similarity is equivalent, which means on the same factor , the degree of influence of item on is the same as that of on . Therefore, we can use undirected graphs to model the information transition instead of directed graphs.
Then we will introduce the channelaware mechanism based on the above hypotheses. The illustration is shown in Figure 2. The item representations are divided into components by sending the initial embeddings into learning layers respectively. The edges are composed into channels and each channel transmits information of the corresponding item embedding. For a single node in the graph, we aim to aggregate information from its neighbourhood . We first compute the probability that factor influences item from its neighbors :
(6) 
where is the parameter of the learning layer regarding factor , and is the initial embedding of node .
is a nonlinear activation function.
reveals why the item pair and is linked adjacently, and how item attributes to item over factor . The larger is, the higher item and are similar on factor , the greater the transition information from to is, and the larger the width of the edge is in the graph. Moreover, satisfies to ensure that the total width of each factor is the same.Then we can accumulate information according to the probabilities of channels from the neighbors of item and update the item representation:
(7) 
In order to ensure the numerical stability, we use normalization as:
(8) 
By projecting item representations into different channels, we can aggregate item information from the perspective of different concepts. The globallevel item representation can then be denoted as the combination of channels:
(9) 
The design of channelaware mechanism based on neighborhood fulfills our first claim of disentanglement learning. Similar characteristics are passed through corresponding channels to model item features and different neighbors influence the target item differently. Taking Figure 2 as an example, item and are linked because they are black, then information will be passed through the channel which corresponds to factor ’color’. The representation of item and should be close in the component related to ’color’, but far in the component related to ’category’ since item is a pair of trousers and item is a shirt. Similarly, the representation of item and should be close in the component of ’category’, but far in the component of ’color’.
IvB Locallevel Disentanglement Layer
Considering that items appearing in one sequence are rarely repeated, we model locallevel user intentions based on sequential models instead of graphs. Given a sequence of user’s historical behavior , we transform it into as training data. For users whose sequential length is greater than , we select the nearest interacted items, and for those whose sequential length is less than , we add zero vectors repeatedly to the left side of sequences. In order to distinguish the item representations at different positions in the sequence, we add a learnable position embedding into the initial item embedding, and take as the input of the learning layer:
(10) 
IvB1 SAVAE Layer
We first apply selfattention network [13] into our local learning model taking advantage of its ability to capture both long and shortrange dependencies of items in sequence. The scaled dotproduct attention is defined following [35]:
(11) 
where is the dimension of input embedding, the scale factor is to avoid the inner product values being overly large. , , and denote the queries, keys and values respectively. The three parameters are generated by input :
(12) 
where are the projection matrices of attention layers.
By utilizing residual connection and layer normalization, we can propagate lowlevel features to the highlevel ones and get the final output of the selfattention layer:
(13) 
We then take as the input of the variational autoencoder framework. Let be the latent variable sampled from the sequence , which obeys a Gaussian distribution. Following SVAE [30], we inference the posterior distribution as a multinomial layer. The mean and variance vectors are computed based on the selfattention vectors as follows:
(14) 
where
represents linear transformations. By using the ’Reparameterization Trick’ mentioned in the Preliminary section, the output of our SAVAE layer is written as:
(15) 
where
. By sampling a random variable
with standard Gaussian distribution, the latent representation of sequence is reparameterized, and we can handle the uncertainty of user behaviors.IvB2 Disentangled Learning Layer
After the selfattention variational autoencoder model, we get item representations with normal distribution of the overall sequence. Then, we will apply the channelaware aggregation mechanism for locallevel disentanglement learning.
In the globallevel learning layer, we assume that adjacent items have similar characteristics, so we apply the channelaware mechanism to aggregate feature information for representation updating. When it comes to local level, we focus on the transition of user preference by modeling the variation of item factors. In order to obtain the transition features, we use the sliding window strategy based on graph neural network.
Sliding window strategy is a popular dividing algorithm. By applying sliding window strategy to some search tasks, it can convert the nested loop problem into a single loop problem, reducing time complexity. Specifically, the algorithm sets a fixed window, which moves from time 1 to time in the sequence axis, and executes the channelaware algorithm in each step among the window. Taking Figure 3 as an example, the user sequence is arranged by time order as . Suppose the window length is 4, the window first covers the 4 items of the earliest interactions. We add edges between , and respectively, and apply the channelaware mechanism to calculate the similarities and degrees of influence between the three items and over the channels. Then we accumulate the feature information to , achieving a step of information transmission. Next, the window slides to , and we repeat the above steps in this window. Finally, the item feature information will be transformed to the last item through channels.
We set the sliding window length as . That means for each target item , information will be aggregated from its former items. The probability between item and is calculated by Algorithm 6 based on the channelaware mechanism and channel information is aggregated as follows:
(16)  
where represents the learning parameter of channel , which is shared with the globallevel layer. Also, we use normalization for . Having obtained the information aggregation through sliding window from previous to back, we can form the user intention at time with the item embedding in the sequence. Then the local sequential representation is the combination of factors: .
In summary, we learn disentangled representation of the current sequence from both channel and statistical perspectives. Variational autoencoder helps the model learn independent latent representation over the whole sequence statistically, which would be discussed in the next part. The channelaware sliding window strategy is able to distinguish the various factors of users. Different factors pass through channels that are related to different user intentions. The influenced factor is changed through sequences with the transition of user intentions, realizing characteristic (2) of our claim.
IvC Predicting Layer
Based on the obtained representations learned from global and local level layers, the final sequential representation is written as:
(17) 
where are combination parameters of predicting layer.
We can estimate the final recommendation probability of candidate items based on the current sequential embedding and the initial item embedding. Let
denote the prediction probability of item appearing as the next interaction in the current sequence:(18) 
Since we use VAE framework in our model, the training objective is defined following the evidence lower bound:
(19) 
The first term is regarded as the reconstruction error, measuring the accuracy between the prediction and ground truth . Here we compute the reconstruction loss using cross entropy:
(20) 
The second term, , is used to measure the distance between posterior distribution and prior distribution. Practically, it is computed with the intermediate variables of VAE layer [15]:
(21) 
We then discuss how our model can learn independent disentangled representation. According to Section IIIC, the KLdivergence can be separated into three parts: the indexcode mutual information, the total correlation, and the dimensionwise KL. The total correlation is a measure of redundancy, acting as the degree of interdependence between variables in the latent variable space. Therefore, applying the VAE framework into our model contributes to learning statistically independent factors of the data distribution, and realizing our last claim of learning disentangled representation.
IvD Complexity Analysis
IvD1 Time Complexity
The time consumption of our model mainly consists of three parts. The first part is the globalgraph building. In order to construct the global, we need to traverse every edge, which costs . The second part is the channelaware mechanism, the algorithm of the mechanism is shown in Algorithm 1. For each channel, the cost of updating item embedding is . The third part is sliding window strategy. The window slides from the start of sequences to the end, which costs time of sequence length , and the locallevel channelaware mechanism costs . Therefore, the total space complexity of our model is .
IvD2 Space Complexity
The space consumption of our model is mainly in the undirected graph and channelaware mechanism. In order to store the neighborhood of each node, an adjacency matrix of . For the channelaware mechanism, there are parameters of dimension for all channels, and the final item embedding generated from channels is of dimension . Therefore, the total space complexity of our model is .
V Experiment
In this section, we will present our experimental setup and results. Firstly, we introduce the datasets and the evaluation metrics used in our experiments, then we will introduce the eight baseline methods which are related to VAE models or disentangled learning models. Next, we compare the experimental results of these baseline methods with our method under the same experimental setting to verify the effectiveness of our proposed model. Moreover, we evaluate the influence of each part in our model and the influence of the key parameters. Finally, we perform visualization experiments on the sequential embeddings generated in the experiment, which proves that our disentanglement model is able to distinguish intention factors in the latent space. In specific, our experiments aim to answer the following questions:

RQ1: Does our proposed model outperform the stateofart works over various kinds of datasets?

RQ2: What is the influence of each component, i.e., the global and local layers in our model?

RQ3: How does our model realize the disentangled representation learning in the latent space?

RQ4: What is the influence of the hyperparameter setting on different datasets?
Va Datasets
Dataset  Users  Items  Interacts  Density 

ML1M  6040  3416  999611  95.15 
Beauty  52204  57289  394908  99.98 
Games  31013  23715  287107  99.96 
We adopt three realworld datasets to evaluate the effectiveness of our method. MovieLens is a timeseries dataset containing rating data for multiple movies by users. We use the version MovieLens1M that includes 1 million user ratings. Amazon is an ecommerce dataset which contains users’ purchasing behaviors on rich products. We choose two categories, ’Beauty’ and ’Video Games’, and use the 5core version for our experiment.
We use timestamps to arrange the sequence order, that is, the items that are interacted by the same users are arranged in sequence according to their interacting time. Following the previous work [13], we split data into three parts: the last interacted item for testing, the secondtolast interacted item for validation and the rest items for training. We regard the training sequence of length as subsequences, and the last element of each subsequence is regarded as the training ground truth. While in validation and testing tasks, we choose the last item of sequence as ground truth with 100 randomly sampled negative items. The detailed statistics are shown in Table II. The average sequence length of each dataset is 163.5, 5.63 and 7.26 respectively.
VB Metrics
We adopt two ranking based metrics to evaluate the recommendation performance: Normalized Discounted Cumulative Gain (NDCG) and Recall. The larger the values of metrics are, the better the performance is. We refer to the two metrics as N@K and R@K for short.

NDCG is a rating metric which takes into account the position of correctly recommended items. It is defined as follows:
(22) where DCG is the Discounted Cumulative Gain. We hope that the most relevant items are at the top of the list, so before adding scores, we divide each item by an increasing number. IDCG is the ideal DCG, which sorts the results to the best state and calculates DCG of the query under this arrangement. They are defined as:
(23) where represents the relevance of the item, which is either 1 or 0, and is the set of relevant items.

Recall describes the percentage of rated items that are actually preferred by users included in the recommendation list. It defines a recommendation list of top predicted items for a user as , and uses to represent the corresponding test set. The percentage of rated items is then computed as:
(24)
Dataset  Metric  POP  BPR  FPMC  TransRec  Caser  SASRec  DSS^{1}  VSAN  EGDGNN 

ML1M  N@5  0.1428  0.1856  0.2726  0.2816  0.2175  0.3922  0.2119  0.4224  0.4571 
R@5  0.2546  0.2972  0.4081  0.4166  0.3389  0.5450  0.3202  0.5851  0.6161  
N@10  0.1863  0.2440  0.3284  0.3352  0.2709  0.4419  0.2562  0.4593  0.5012  
R@10  0.4086  0.4738  0.5806  0.5826  0.5045  0.6982  0.4573  0.7054  0.7517  
Beauty  N@5  0.0483  0.0915  0.1429  0.1726  0.1928  0.2166  0.2139  0.2285  0.2380 
R@5  0.0754  0.1299  0.2220  0.2467  0.2787  0.3013  0.3155  0.3332  0.3158  
N@10  0.0659  0.1448  0.1839  0.2049  0.2295  0.2495  0.2527  0.2575  0.2710  
R@10  0.1303  0.2425  0.3492  0.3471  0.3923  0.4030  0.4351  0.4279  0.4180  
Games  N@5  0.1695  0.2195  0.2004  0.2428  0.2671  0.3978  0.2589  0.3956  0.4303 
R@5  0.2845  0.3204  0.3250  0.3468  0.3821  0.5377  0.3714  0.5491  0.5599  
N@10  0.2082  0.2606  0.2583  0.2845  0.3123  0.4388  0.3052  0.4288  0.4668  
R@10  0.3605  0.4459  0.4462  0.4762  0.5217  0.6641  0.5152  0.6672  0.6723 

The code of DSS is not released by authors and we reimplement it according to the paper.
VC Baselines
We compare our method with the following competitive baselines, with particular emphasis on VAEbased and disentangled learning methods. All the baselines put emphasis on sequential recommendation tasks.

1) POP: a classical method that ranks items according to their popularity.

2) BPR: Bayesian Personalized Ranking [28], a classical model based on Matrix Factorization. It designs a pairwise optimization method to learn pairwise item rankings from implicit feedback.

3) FPMC: Factorized Personalized Markov Chains [29], a method combining Matrix Factorization and firstorder Markov Chain together. It introduces a personalized transfer matrix based on Markov chain to capture time information and introduces matrix factorization to solve the sparse problem of the transition matrix.

4) TransRec: Translation based Recommendation [8]. It embeds items into a transition space and models each user as a transition vector to obtain the ’threeorder’ relationships, i.e., the interactions between a user, the previous visited items and the next item.

5) Caser: Convolutional Sequence Embedding Recommendation [32]. The main idea is to form an ’image’ with the most recent items of a sequence in time and latent spaces, and apply Convolutional Neural Network (CNN) to learn the highorder sequential patterns as the local feature of the image.

6) SASRec: SelfAttention based Sequential Recommendation [13]. By applying the selfattention mechanism into sequential problems, the model can not only capture the long term information like RNNs but also handle the short term patterns in terms of small number of behaviors like MCs.

7) DSS: Disentangled SelfSupervision [25], the first model that focuses on disentangled representation learning on sequential recommendation. It designs a Disentangled Sequence Encoder to disentangle user intention in the latent space over subsequences and propose a seq2seq selfsupervised strategy for training.

8) VSAN: Variational Selfattention Network [39]. It combines the selfAttention mechanism with variational inference for sequential recommendation to model the longrange and short dependencies of sequences.
VD Experiment Setting
We conduct experiments with PyTorch. In the experiments, the dimension of item embedding of all the methods is set 100. The channel embedding dimension of our model is set 20. We set the batch size as 128 and the learning rate as 0.002. We limit the maximum sequence length to 200 for the MovieLens dataset and 50 for Amazon. The dropout rate of turning off neurons is set as 0.5 for both the global and local layers. Singlehead selfattention network is used as the sequential encoder. We use random seeds for the generation of Gaussian distribution and report the average performance result under five times.
VE Performance Analysis
To evaluate the effectiveness of our proposed model, we perform nextitem recommendation based on our model and the baselines under the same experimental setting. Specifically, we predict the item user may be interested in at time based on the former items and choose the item in each sequence as ground truth for metric. Table III records the performance results. We will compare and analyze the results in detail in this section.
Firstly, We can observe from the table that our method outperforms the baselines over all the datasets. There is no doubt that our model gets better predicting results over the classical baselines, POP, BPR, and FPMC, since we take complex sequential interaction information into account. In terms of models based on neural networks, we can see that SASRec performs better than the transformer models, indicating that the selfattention network captures more sequential semantics with both long and short term patterns.
The VSAN method proposes a new selfattention network with variational autoencoder and achieves secondbest results in our experiments. It proves that capturing the long and short range dependencies together with the attentionbased network and the statistical method does help the model get better prediction results. The effectiveness of random method, variational inference, is also confirmed in eliminating the random noise in user behaviors. Although the disentangled selfsupervised method performs well in the Beauty dataset, it does not have good results in the other two datasets. However, its good performance on Beauty dataset is sufficient to prove the effectiveness of learning disentangled user intention over sequences. In DSS, one sequence behavior is encoded into one kind of user intention, ignoring the various different factors hidden behind item transitions. Compared with disentangling user intention over whole sequences, we focus on the intention transition of pairwise items, therefore, our model gets better predicting results than the previous work.
Then we turn focus back to our proposed model, we get the best experimental results in most circumstances. In particular, its relative improvements over the strongest baselines w.r.t. NDCG@5 are 8.21, 4.16, and 8.17 for the three datasets respectively. Compared with VSAN, our model builds a global itemlink graph and disentangles the influential factors into channels for representation updating. Compared with DSS, our model pays more attention to the itemitem relationship. We form the user intention taking advantage of the transformer ability of selfattention instead of modeling the whole sequence with an encoder. According to these improvements, it is no doubt that our model can obtain user intention in an adaptive way and find more suitable items that users may be interested in.
Secondly, we achieve the best improvements for all the metrics on MovieLens dataset. It indicates that by introducing the channelaware mechanism, the model is able to capture more itemlink relationship information that is hard to be captured by previous works. Moreover, by composing several highlevel concepts, the movie items, which have few explicit features, is classified into some implicit categories, and the model can obtain the user intention from various highlevel perspectives to predict users’ true preference.
Thirdly, we find that our model reaches high values very early compared with the baselines, as shown in Figure 4
. The trends of experimental results of 10 epochs indicate that our model can get good prediction results early in the first five epochs. Even though the time complexity of our model is larger than the stateofart baselines, we can still get high prediction results within a short time. That means by distinguishing the latent factors hidden behind sequences, the model can learn item representations over various factors and find dynamic user intentions that are not shown explicitly.
Dataset  Metric  Global  Local  SAVAE  SliWin  EGDGNN 

ML1M  N@10  0.4641  0.4772  0.2900  0.4628  0.5012 
R@10  0.7142  0.7377  0.5311  0.7065  0.7517  
Beauty  N@10  0.2562  0.2437  0.2274  0.2420  0.2710 
R@10  0.4043  0.4089  0.3990  0.4065  0.4180  
Games  N@10  0.4445  0.3988  0.3370  0.3665  0.4668 
R@10  0.6518  0.6242  0.5630  0.6050  0.6723 
Dataset  Metric  =0  =0.1  =0.5  =1  =2 

ML1M  N@10  0.4875  0.5012  0.4920  0.4881  0.4987 
R@10  0.7425  0.7517  0.7434  0.7478  0.7441  
Beauty  N@10  0.2559  0.2587  0.2606  0.2710  0.2538 
R@10  0.3995  0.4078  0.4051  0.4180  0.4021  
Games  N@10  0.4472  0.4668  0.4625  0.4613  0.4616 
R@10  0.6671  0.6723  0.6699  0.6685  0.6659 
VF Ablation Study
VF1 Influence of each layer of model
We first implement ablation studies to evaluate the effectiveness of each part of our model. Specifically, we perform four ablation experiments as follows:

Global only: remove the locallevel learning layer, i.e., the SAVAE layer and sliding window layer, only perform with the globallevel graph.

Local only: remove the global link graph and corresponding learning layer, only perform with the locallevel layer.

SAVAE only: remove the sliding window strategy part, only reserve the selfattention and variational autoencoder layers.

SliWin only: remove the selfattention and variational autoencoder layers, only reserve the sliding window mechanism for locallevel learning.
Table IV lists the results of ablation studies, showing how each part influences the final performance of our model. It is clear that the global and locallevel layers both contribute to the improvement of our model. For Amazon datasets, the globallevel layer performs better than the locallevel layer, indicating that the relationship between product items is close and worth exploring.
Besides, the sliding window strategy gets better prediction results compared with the SAVAE layer. It proves that the channelaware mechanism plays quite a crucial role in disentangling user intentions over different factors. Moreover, we can observe that the improvement of channelaware mechanism is extremely large on the MovieLens dataset, since our model can capture much relevant information between items and explore highlevel factors even on a small scale dataset.
VF2 Influence of penalty on KLdivergence
Then we implement ablation experiments on the variational autoencoder framework. As introduced in the previous sections, the parameter acts as a penalty on KLdivergence term which contributes to forcing the model to find independent latent variables. Therefore, we will evaluate the role of in disentangled representation learning. We set from 0 to 2 to examine its effectiveness, and list the results in Table V. We can see that when is 0, the experimental results are obviously the worst, since the variational autoencoder model degenerates to original autoencoder. And when is too large, the results will also decrease, since the posterior distribution is close to the standard normal distribution. Therefore, we need to find a suitable value to strike a balance between reconstruction accuracy and disentangled learning.
VG Visualization of Node Representations
In order to analyze the performance of learning disentangled representation, we visualize the item representations using the tSNE [34] algorithm on the two Amazon datasets. In detail, we learn the globallevel item embeddings based on the channelaware mechanism and project the embeddings into a 2dimension space. We choose the channel with the largest embedding value, i.e., , as the item category and color the nodes based on their categories in Figure 5.
We can observe that on both datasets, the items with the same categories are close in the latent space, indicating they share similar features. Meanwhile, the factors which have close relationships are close in the latent space as well. Taking the Beauty dataset as an example, the items colored in pink are close to the items colored in orange and cyan. This indicates that the items which have these three features share similar characteristics. When a user interacts with an item colored pink, he/she is likely to choose an item colored in orange or cyan. The Games dataset also shows the same characteristics. This visual experiment again proves the effectiveness of learning disentangled representation from an intuitive perspective, and also shows its ability in enhancing the interpretability of model. In summary, learning disentangled representations based on item edges can not only observe the underlying features between items, but also help the model predict what users may like.
VH Hyperparameter Sensitivity
The most important hyperparameters in our model are the number of channels and the length of sliding window . Specifically, we fix other parameters and adjust the number of one hyperparameter by fixed length. We record the predicting results on the three datasets and draw line charts to show their impacts. We will analyze the figures in this section.
VH1 Impact of the number of channels
We adjust the number of channels from 5 to 35 in steps of 5 and show the results in Figure 6. We can see that the recommendation performance improves as the channel number increases, and tends to remain unchanged after reaching the peak. The MovieLens dataset reaches the peak later than the Amazon dataset. The reason may be that product items do not have as many attributes as the movie items have, and the model does not require too many classifications to achieve the best results.
VH2 Impact of the length of sliding window
We adjust the length of sliding window from 5 to 25 in steps of 5 and show the results in Figure 7. We can observe that the influences of the length are quite different from that of channel numbers. On the MovieLens dataset, the performance results become slightly larger as the window length grows, but there is no such trend on the Amazon datasets. Therefore, we can speculate that the choice of window length does not mainly affect the recommended results.
Vi Conclusion
In this paper, we proposed an edgeenhanced model based on graph neural network to learn sequential representation at both global and local levels. We designed a disentangled learning layer, i.e., the channelaware mechanism, to distinguish various factors which motivate user intentions. The mechanism divided the information transition model into several channels and aggregated item information through different channels. At the global level, we built a global itemlink graph based on training data and update item feature information through neighborhood. At the local level, we apply variational autoencoder framework to infer user behaviors as distributions, taking advantage of its statistical ability in learning disentangled representation. Then we adopt a sliding window strategy along with the channelaware mechanism to capture the transition of user intentions through sequences. Experimental results showed that our proposed method achieves better performance than previous works. It is notable that user information is also important for learning disentangled representation. Therefore, we will consider adding user nodes in further studies.
Acknowledgments
This research was partially supported by NSFC (No. 61876117, 61876217, 61872258, 61728205), ESP of the State Key Laboratory of Software Development Environment, and PAPD of Jiangsu Higher Education Institutions.
References
 [1] (2017) Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263. Cited by: §I.
 [2] (2018) Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599. Cited by: §I, §IIIC.

[3]
(2018)
Isolating sources of disentanglement in variational autoencoders
. In NeurIPS, pp. 2615–2625. Cited by: §IIB, §IIIC.  [4] (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2180–2188. Cited by: §I, §IIB.
 [5] (2018) Sequential recommendation with user memory networks. In WSDM, pp. 108–116. Cited by: §I, §IIA.

[6]
(2013)
Where you like to go next: successive pointofinterest recommendation.
In
TwentyThird international joint conference on Artificial Intelligence
, Cited by: §I, §IIA.  [7] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. NeurIPS 29, pp. 3844–3852. Cited by: §IIC.
 [8] (2017) Translationbased recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pp. 161–169. Cited by: §IIA, 4th item.
 [9] (2016) Fusing similarity models with markov chains for sparse sequential recommendation. In ICDM, pp. 191–200. Cited by: §IIA.
 [10] (2020) Lightgcn: simplifying and powering graph convolution network for recommendation. In SIGIR, pp. 639–648. Cited by: §IIC.
 [11] (2017) Betavae: learning basic visual concepts with a constrained variational framework. ICLR. Cited by: §I, §IIB.
 [12] (2018) Learning to decompose and disentangle representations for video prediction. In NeurIPS, pp. 517–526. Cited by: §I, §IIB.
 [13] (2018) Selfattentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §I, §IIA, §IVB1, 6th item, §VA.
 [14] (2018) Disentangling by factorising. In ICML, pp. 2649–2658. Cited by: §IIB.
 [15] (2014) Autoencoding variational bayes. ICLR. Cited by: §IVC.
 [16] (2017) Semisupervised classification with graph convolutional networks. ICLR. Cited by: §IIC.
 [17] (2009) Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: §I, §IIA.
 [18] (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In SIGKDD, pp. 426–434. Cited by: §IIA.
 [19] (2017) Collaborative variational autoencoder for recommender systems. In SIGKDD, pp. 305–314. Cited by: §I.
 [20] (2018) Variational autoencoders for collaborative filtering. In WWW, pp. 689–698. Cited by: §I.
 [21] (2020) Graph attention networks over edge contentbased channels. In SIGKDD, pp. 1819–1827. Cited by: §IIB, §IIC.
 [22] (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, pp. 4114–4124. Cited by: §I.
 [23] (2019) Disentangled graph convolutional networks. In ICML, pp. 4212–4221. Cited by: §IIC.
 [24] (2019) Learning disentangled representations for recommendation. In NeurIPS, pp. 5712–5723. Cited by: §I, §IIB.
 [25] (2020) Disentangled selfsupervision in sequential recommenders. In SIGKDD, pp. 483–491. Cited by: §I, §IIB, 7th item.
 [26] (2021) Disentangled motifaware graph learning for phrase grounding. In AAAI, Cited by: §I.
 [27] (2019) Rethinking the item order in sessionbased recommendation with graph neural networks. In CIKM, pp. 579–588. Cited by: §IIC.
 [28] (2009) BPR: bayesian personalized ranking from implicit feedback. In International Conference on Uncertainty in Artificial Intelligence, pp. 452–461. Cited by: 2nd item.
 [29] (2010) Factorizing personalized markov chains for nextbasket recommendation. In WWW, pp. 811–820. Cited by: §I, §IIA, 3rd item.
 [30] (2019) Sequential variational autoencoders for collaborative filtering. In WSDM, pp. 600–608. Cited by: §IIA, §IVB1.
 [31] (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In CIKM, pp. 1441–1450. Cited by: §I, §IIA.
 [32] (2018) Personalized topn sequential recommendation via convolutional sequence embedding. In WSDM, pp. 565–573. Cited by: §IIA, 5th item.
 [33] (2003) Link prediction in relational data. NeurIPS 16, pp. 659–666. Cited by: §IIC.

[34]
(2008)
Visualizing data using tsne..
Journal of machine learning research
9 (11). Cited by: §VG.  [35] (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §IVB1.
 [36] (2018) Graph attention networks. ICLR. Cited by: §IIC.
 [37] (2019) Neural graph collaborative filtering. In SIGIR, pp. 165–174. Cited by: §I.
 [38] (2020) Disentangled graph collaborative filtering. In SIGIR, pp. 1001–1010. Cited by: §I, §IIB.
 [39] (2021) Variational selfattention network for sequential recommendation. In ICDE, pp. 1559–1570. Cited by: 8th item.
 [40] (2021) Disentangling user interest and conformity for recommendation with causal embedding. In WWW, pp. 2980–2991. Cited by: §IIB.
 [41] (2017) What to do next: modeling user behaviors by timelstm.. In IJCAI, Vol. 17, pp. 3602–3608. Cited by: §I, §IIA.
Comments
There are no comments yet.