Edge-Enhanced Global Disentangled Graph Neural Network for Sequential Recommendation

Sequential recommendation has been a widely popular topic of recommender systems. Existing works have contributed to enhancing the prediction ability of sequential recommendation systems based on various methods, such as recurrent networks and self-attention mechanisms. However, they fail to discover and distinguish various relationships between items, which could be underlying factors which motivate user behaviors. In this paper, we propose an Edge-Enhanced Global Disentangled Graph Neural Network (EGD-GNN) model to capture the relation information between items for global item representation and local user intention learning. At the global level, we build a global-link graph over all sequences to model item relationships. Then a channel-aware disentangled learning layer is designed to decompose edge information into different channels, which can be aggregated to represent the target item from its neighbors. At the local level, we apply a variational auto-encoder framework to learn user intention over the current sequence. We evaluate our proposed method on three real-world datasets. Experimental results show that our model can get a crucial improvement over state-of-the-art baselines and is able to distinguish item features.



There are no comments yet.


page 1


Transition Information Enhanced Disentangled Graph Neural Networks for Session-based Recommendation

Session-based recommendation is a practical recommendation task that pre...

GLIMG: Global and Local Item Graphs for Top-N Recommender Systems

Graph-based recommendation models work well for top-N recommender system...

Attention over Self-attention:Intention-aware Re-ranking with Dynamic Transformer Encoders for Recommendation

Re-ranking models refine the item recommendation list generated by the p...

Learning Disentangled Representations for Recommendation

User behavior data in recommender systems are driven by the complex inte...

Sequential Recommendation with Causal Behavior Discovery

The key of sequential recommendation lies in the accurate item correlati...

Cascading: Association Augmented Sequential Recommendation

Recently, recommendation according to sequential user behaviors has show...

Improving Sequential Recommendation Consistency with Self-Supervised Imitation

Most sequential recommendation models capture the features of consecutiv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recommender systems play a critical role in the fast-developing Internet age, aiming to predict the most likely items which users may be interested in. Collaborative filtering is an efficient and widely used approach in recommendation, which commits to capturing latent user and item features from historical interactions. Early works like Matrix Factorization (MF) [17]

decompose a rating matrix into user and item embeddings to capture implicit semantics. As the scale of users and items increases rapidly in recent years, more deep learning models are proposed based on collaborative filtering to characterize plentiful user tastes over a large amount of items. For example,  

[1] and [37] build user-item graphs to integrate the multi-hop relationship of interactions. [20] and [19]

introduce a variational auto-encoder framework into the model and infer the representation as a Gaussian distribution.

Sequential recommendation is an important part among recommender systems. It models user behaviors as a sequence of items instead of a set of items. Markov Chain (MC) 

[6] is a classic method, which models short-term item transitions and predicts the next item a user may like. Factorized personalized Markov chain (FPMC) [29]

combines the markov chain and the traditional matrix factorization together to model user preferences. With the development of deep learning networks, Recurrent Neural Networks (RNNs) have achieved successes in sequential recommendation. For example, Long Short-Term Memory (LSTM) 

[41] is a common variation of RNN to enhance model’s ability of maintaining sequential information by memory cells. GRU4Rec [5]

applies Gated Recurrent Units (GRU) to session-based recommendation by introducing session-parallel mini-batches. RNN-based methods face the challenge of maintaining long-range information. Then self-attention network is applied to sequential recommendations recently to capture both long-term and short-term dependencies. SASRec and BERT4Rec both get good prediction results with attention mechanisms. SASRec 

[13] is able to capture long-term dependencies because it takes into account the influential weights of a whole historical sequence. BERT4Rec [31] employs a deep bidirectional self-attention network with Cloze tasks to increase the efficiency of a transformer model.

These previous works model user intentions with historical sequential interactions, ignoring dynamic underlying relationships behind items. The edges that link pairwise items contain abundant semantic information of factors why and how users choose one item after another. These underlying factors are related to real-world concepts, and one certain factor often plays a leading role in a single situation. For example, suppose there are two users interacting with six items, as shown in Figure 1

. The link graph shows that item 2 is adjacent to all the other five items. But these edges are intuitively motivated by different factors. Item 2 is linked to items 1 and 4 because they are in the same color, while it is linked to 5 and 6 because they have short sleeves. Item 3 is connected to item 2 because it can be used as a T-shirt jacket. These different factors show the intention transformation of user behaviors, and also reveal the shared features of pairwise items. Therefore, recognizing and distinguishing the underlying item-link factors is able to enhance the expression ability of models, and disentangled representation learning 

[22] is a common method proposed to achieve this goal.

Disentangled representation learning has been of great popularity in many fields such as Computer Vision 

[12, 26], and it has been applied to recommender systems recently. The general purpose of disentangled representation learning is to separate the distinct and informative factors from the variations of data, where each unit is related to a single concept in the real world. A single change of one factor will lead to a change of the relevant unit. Many models have been proved to have the ability to learn disentangled representation and have been applied to fulfill realistic tasks. For example, by learning disentangled representation of a face image, we can obtain independent representations of different features of the face. Then we are able to identify whether a person in a picture has bangs, is wearing glasses, is smiling, and so on. Further, we can change these features directionally by modifying the values of the corresponding dimensions of these features. Therefore, learning disentangled representation can enhance the interpretability and controllability of model.

Fig. 1: An example of item relationship patterns.

The most prominent networks to learn disentangled representation are -VAE and InfoGAN. -VAE [11]

adds a coefficient hyperparameter

to the KL-divergence term in the objective of variational auto-encoder to encourage more factorized latent representations. This extra hyperparameter puts a heavy pressure on the posterior distribution to match the factorized prior distribution [2]. InfoGAN [4] maximizes the mutual information between a fixed small subset of the GAN’s noise variables and observations. With regard to recommender systems, Macrid-VAE [24] infers high-level concepts of user intentions at the macro level and applies VAE to enhance disentanglement at the micro level. The authors also propose a self-supervised seq2seq training strategy for sequential recommendation [25], which compares user intentions between sub-sequences generated by an intention disentangled encoder. DGCF [38] devises intent-aware interaction graphs to distinguish user intentions over different items, focusing on user-item relationships. However, these studies do not consider item-link relationship patterns, and fail to distinguish different user intentions behind sequences. As a result, the sequential model will be sensitive to noisy data and hardly interpretable.

In this paper, we propose an Edge-Enhanced Global Disentangled Graph Neural Network (EGD-GNN) model to capture the item-link information. We model item representations and user intentions from both the global and the local levels. At the global level, we build a global item-link graph over all sequences, and each item-pair in the sequences is denoted as an edge in the graph. Figure 1 shows an example of the construction of a global graph over two sequences. We apply the channel-aware mechanism to decompose edges into several channels, where each channel is correspond to an influential factor. The channels extract features specifically to one of the disentangled factors from the neighbors and aggregate different factors jointly to the target item. At the local level, we model disentangled user-intention representation over the current sequence. We first infer the latent variation as a Gaussian distribution in order to enforce disentanglement from the statistical perspective of variational auto-encoder. Then we use the channel-aware mechanism and aggregate item information through the edge channels from former items in the sequence. The aggregated item representation is used to express current user intention. We conduct experiments based on our proposed method on three real-world datasets and compare the predicting results with the state-of-art baselines. Results show that the EGD-GNN model not only outperforms the previous works in prediction tasks, but also forces the system to find good disentangled representations.

The major contributions of this paper are summarized as follows:

  • To the best of our knowledge, it is the first work to explore disentangled representation over the global and the local levels to learn factorized underlying factors of item relationships.

  • We propose a disentangled graph neural network to infer the factors behind pairwise items which motivate user intentions. We apply the GNN model to the global graph and the local sequences to learn item link patterns, and employ variational auto-encoder taking advantage of its statistical property.

  • We evaluate our model on three real-world datasets, and experimental results show our proposed model is able to achieve a good disentangled representation over sequences to indicate user intentions.

The rest of this paper is organized as follows. Firstly, we review related works corresponding to our work in Section II. Then we propose the problem formulation and definitions of our model and introduce preliminaries in Section III. Next, we present the details of our proposed model in Section IV. Section V records and visualizes experimental results, proving the effectiveness of our model. Finally, we conclude the paper and put forward a vision for future work in Section VI.

Ii Related Work

In this section, we present the recent works related to our model, including Sequential Recommendation, Disentangled Representation Learning, and Graph Neural Network.

Ii-a Sequential Recommendation

Recommendation systems have been extensively popular over the last two decades. They aim to predict users’ preferences over historical behaviors. Matrix Factorization (MF) [17] is the most common framework for prediction, which learns user/item embeddings respectively to model latent relationships between users and items. Further study like SVD++ [18] combines domain model and hidden factor model, and proposes a new globally optimized neighborhood model. Sequential recommendation is an important branch of recommendation systems. Given a chronological item sequence of user’s historical behaviors, sequential recommendation can predict the next item with which a user is likely to interact.

Markov Chain (MC) [6] is a classical model to capture short-term item transitions. FPMC [29] further combines Matrix Factorization and Markov Chain together to model both long-tern preferences and short-tern transitions. Fossil [9] combines similarity-based models with high-order Markov Chains. TransRec [8]

turns a user embedding into a translation vector and considers the three-order relationships between users, candidate items, and previous behaviors. With the proposal of Recurrent Neural Network (RNN), researchers have proposed redundant works based on this sequential framework and its variants. For example, Time-LSTM 

[41] uses Long Short-Term Memory (LSTM) to model time intervals with time gates, and GRU4Rec [5] uses Gated Recurrent Units (GRU) to model click sequences for session-based recommendation. Recently, more deep neural networks have been applied to model sequence patterns. Caser [32]

employs a Convolutional Neural Network (CNN) to capture sequential patterns as local features of images by embedding recent sequential items into the images. SASRec 

[13] finds the relevance between items adaptively using the self-attention mechanism. BERT4Rec [31] employs a deep bidirectional self-attention network with Cloze task to increase the efficiency of transformer model. SVAE [30] leverages Variational Auto-encoder (VAE) to handle temporal information of sequences. However, these previous works do not distinguish the various contributions of neighbors over different aspects.

Ii-B Disentangled Representation Learning

The purpose of learning disentangled representation is to find independent factors in the latent space. Each dimension of the representation has a specific and irrelevant meaning and is human-understandable. For example, learning disentangled representation over face pictures can get representations regarding eyes, hair, smiles, etc., while learning disentangled representation over landscape pictures can get representations regarding trees, sky, buildings, etc. Distinguishing such features from the representations can bring enhanced robustness, interpretability, and controllability. Therefore, it has been a popular task in many fields such as computer vision [12] and topic modeling [21]. During recent years, many methods have been proposed to improve the disentanglement learning ability. InfoGAN [4]

realizes unsupervised learning of disentangled representation by introducing mutual information to constrain the latent variables.

-VAE [11] turns the perspective to information bottleneck and focuses on the KL-divergence in the VAE objective. Further studies like -TCVAE [3] and FactorVAE [14] decompose the KL-divergence and directly encourage factorized distribution by putting penalty on the total correlation.

Recently, some studies turn attention to disentangled learning in recommendation. For instance, Macrid-VAE [24] is the first work to learn disentangled representation from user-item interactions in recommender systems. At the macro level, it divides user intentions into several high-level concepts and categorizes each item into a concept. At the micro-level, it applies VAE framework to the encoder layer to encourage dimension independence. DGCF [38] devises a disentangled graph model to learn user intents based on neural graph collaborative filtering. DICE [40] disentangles the interest and conformity representation with causal embedding. Ma et.al. [25]

performs self-supervision in the latent space to classify user intentions. It reconstructs future sequences as a whole using the sequence-to-sequence training strategy, instead of individual items in the future sequences. However, these works fail to maintain the item-link relationships and disentangle their influential factors. In this paper, we will solve this problem by introducing graph neural network into sequential recommendation.

Ii-C Graph Neural Network

Graph Neural Network (GNN) is a classical learning network used to capture information of graph structure data. It has achieved great success in various tasks, such as node classification [16] and link prediction [33]. Early work like ChebNet [7] realizes fast localized convolutional filters on graphs by CNN and avoids the Fourier basis. Graph Attention Network (GAT) [36] aggregates neighbor nodes through multi-head self-attention mechanism, realizing the adaptive matching of the weights of different neighbors and enhancing the ability of Graph Convolutional Network (GCN). DisenGCN [23] proposes a disentangled graph convolutional network with neighborhood routing mechanism to learn disentangled node representations from its neighbors. CGAT [21] enhances the GAT framework by introducing a channel-aware attention mechanism. It disentangles topic representations structurally and semantically over user-user interaction graphs.

Graph neural network is also widely used in recommender systems. LightGCN [10] learns the user and item embeddings by linearly propagating them on the user-item interaction graph. FGNN [27] investigates the inherent order of item transition patterns in session recommendation using a modified weighted GAT model. These existing studies have proved the effectiveness of graph neural network in obtaining item-link transition patterns, so we apply it to our work to model the user intention transition through sequences.

Iii Preliminary

We will present the preliminary statements of this paper before the details of our model. We first describe the notations and the sequential recommendation problem of our paper. Then we put forward the channel-aware mechanism used in the representation space. Finally, we introduce the variational auto-encoder and its contribution to learning disentanglement.

Iii-a Problem Formulation

Given users and items, we denote a user set as and an item set as . For each user, represents the sequential behaviors interacted by user . Given a historical sequence at time , a sequential recommender model aims to predict the next item at time .

In this paper, we propose a global-level graph to capture item-link transition information. We define the global graph as , where is a set of all items in the training data and is a set of edges. Each edge means a user interacts with after in a sequence. denotes the neighborhood of item , i.e., the items adjacent to in the sequences. We use an undirected graph in this paper, because for the item-link pairs, the similarities and influencing factors between them are order-independent. The temporal order of sequences will be considered at the local level. Table I lists detailed explanations of the notations used in this paper.

Notations Descriptions
set of users and items
number of users and items
historical behaviors of user
number of decomposed channels
length of sequences and sliding windows
global item-link graph
set of item ’s neighbors
embedding dimensions of inputs and channels
initial embedding of items and positions
probability of the correlation between item
and regarding channel
representation of item regarding channel
representation of global and local level layers
output of self-attention and VAE layer

mean and variance of Gaussian distribution in

VAE layer
TABLE I: Details of notations

Iii-B Channel-aware Mechanism

Given a historical sequence , let be the latent intention of user while interacting with item . Assuming that there are factors related to user intentions, we divide the latent representation into channels, i.e., . The channel corresponds to the factor independently. For each pair of adjacent items, the correlation between and indicates the similarity between item and item regarding factor , and also reveals why the two items are connected and how they influence each other.

Iii-C VAE for Disentangled Learning

The Variational Auto-encoder (VAE) is a generative model which models variables as random distributions based on the Bayesian Theorem. Assume a -dimensional variable being the sampled latent representation from sequence , we aim to maximize the probability of the next item, that is, to maximize the probability of the whole sequence :


Since the probability is not iterable, the variational inference method takes advantages of Bayesian Theorem and proposes a posterior distribution to approximate the true distribution . Migrating to sequential recommendation, the log likelihood of can be derived as follows:


Algorithm 2 is the training objective of variational auto-encoder, it is called Evidence Lower BOund (ELBO). By maximizing ELBO, the model can get an approximate posterior distribution for the encoder to generate the latent representation .

In practice, the generative model suggests that the variables follow Gaussian distribution and applies a ’Reparameterization Trick’ to calculate the gradient. Then the variables can be written as a polynomial generated from the mean and the variance of Gaussian distribution:


-VAE is a common modification of VAE. It introduces an adjustable hyperparameter to the original objective of VAE:


Burgess et.al. [2] discussed why -VAE is able to learn an axis-aligned disentangled representation from the perspective of information bottleneck. acts as a constriction limiting the capacity of the bottleneck, and encourages -VAE to improve data log-likelihood.

Furthermore, the KL-divergence term can be composed to three parts, following the contribution of -TCVAE [3]:


The three terms above are referred to as the Mutual Information (MI), the Total Correlation (TC), and the dimension-wise KL respectively. A heavier penalty on the TC term forces the model to learn a factorized representation, each dimension of which is independent. Therefore, if we put a strong penalty on the KL-divergence by adjusting , VAE can find statistically independent factors from the observed data.

Iv Methodology

In this section, we will present our proposed model. Figure 3 illustrates its overall architecture, which consists of three parts: Global-level Disentanglement Layer, Local-level Disentanglement Layer, and Prediction Layer. We claim that sequential disentangled representation learning model should have three main characteristics: 1) items that have similar features should be close in the corresponding embedding space; 2) the changing of factors between linked items should reveal the intention transition of user behaviors; 3) separated representations should be independent of each other. We will discuss how the model realizes these purposes in detail in the following sections.

Iv-a Global-level Disentanglement Layer

We will first introduce the global-level disentangled representation learning layer based on the channel-aware mechanism, which is the key framework of our work. We build a global item-link graph based on training sequences, where all the item pairs appear adjacently in the sequences are connected with undirected edges. We aim to extract the independent factors motivating user intentions and find out the degree of mutual influence between the two items on these factors. Before introducing the mechanism, we first propose two hypotheses.

Hypothesis 1. There are high-level concepts associated with user intentions, which means there are latent factors to be disentangled.

Based on this hypothesis, given a global graph , we divide the nodes (i.e., the items) into components in the latent space, and the edges are divided into channels correspondingly. The component is related to the factor of the user intention, and the channel indicates how factor attributes to the linkage of pairwise items.

Hypothesis 2. Factor indicates the degree of similarity between item and in terms of factor , that is, the representations of item and should be close in the latent subspace if and have similar characteristics regarding the factor.

Intuitively, for a pair of linked items, their similarity is equivalent, which means on the same factor , the degree of influence of item on is the same as that of on . Therefore, we can use undirected graphs to model the information transition instead of directed graphs.

Fig. 2: Illustration of channel-aware mechanism, where channel represents factor ’color’ and channel represents factor ’category’.
Fig. 3: Overall architecture of the proposed model. The sequential representation is combined with the global representation learned from the Global-level Disentangled Representation Learning Layer and the local representation learned from the SA-VAE Layer and Local-level Disentangled Learning Layer.

Then we will introduce the channel-aware mechanism based on the above hypotheses. The illustration is shown in Figure 2. The item representations are divided into components by sending the initial embeddings into learning layers respectively. The edges are composed into channels and each channel transmits information of the corresponding item embedding. For a single node in the graph, we aim to aggregate information from its neighbourhood . We first compute the probability that factor influences item from its neighbors :


where is the parameter of the learning layer regarding factor , and is the initial embedding of node .

is a nonlinear activation function.

reveals why the item pair and is linked adjacently, and how item attributes to item over factor . The larger is, the higher item and are similar on factor , the greater the transition information from to is, and the larger the width of the edge is in the graph. Moreover, satisfies to ensure that the total width of each factor is the same.

Then we can accumulate information according to the probabilities of channels from the neighbors of item and update the item representation:


In order to ensure the numerical stability, we use -normalization as:


By projecting item representations into different channels, we can aggregate item information from the perspective of different concepts. The global-level item representation can then be denoted as the combination of channels:


The design of channel-aware mechanism based on neighborhood fulfills our first claim of disentanglement learning. Similar characteristics are passed through corresponding channels to model item features and different neighbors influence the target item differently. Taking Figure 2 as an example, item and are linked because they are black, then information will be passed through the channel which corresponds to factor ’color’. The representation of item and should be close in the component related to ’color’, but far in the component related to ’category’ since item is a pair of trousers and item is a shirt. Similarly, the representation of item and should be close in the component of ’category’, but far in the component of ’color’.

Iv-B Local-level Disentanglement Layer

Considering that items appearing in one sequence are rarely repeated, we model local-level user intentions based on sequential models instead of graphs. Given a sequence of user’s historical behavior , we transform it into as training data. For users whose sequential length is greater than , we select the nearest interacted items, and for those whose sequential length is less than , we add zero vectors repeatedly to the left side of sequences. In order to distinguish the item representations at different positions in the sequence, we add a learnable position embedding into the initial item embedding, and take as the input of the learning layer:


Iv-B1 SA-VAE Layer

We first apply self-attention network  [13] into our local learning model taking advantage of its ability to capture both long and short-range dependencies of items in sequence. The scaled dot-product attention is defined following  [35]:


where is the dimension of input embedding, the scale factor is to avoid the inner product values being overly large. , , and denote the queries, keys and values respectively. The three parameters are generated by input :


where are the projection matrices of attention layers.

By utilizing residual connection and layer normalization, we can propagate low-level features to the high-level ones and get the final output of the self-attention layer:


We then take as the input of the variational auto-encoder framework. Let be the latent variable sampled from the sequence , which obeys a Gaussian distribution. Following SVAE [30], we inference the posterior distribution as a multinomial layer. The mean and variance vectors are computed based on the self-attention vectors as follows:



represents linear transformations. By using the ’Reparameterization Trick’ mentioned in the Preliminary section, the output of our SA-VAE layer is written as:



. By sampling a random variable

with standard Gaussian distribution, the latent representation of sequence is reparameterized, and we can handle the uncertainty of user behaviors.

Iv-B2 Disentangled Learning Layer

After the self-attention variational auto-encoder model, we get item representations with normal distribution of the overall sequence. Then, we will apply the channel-aware aggregation mechanism for local-level disentanglement learning.

In the global-level learning layer, we assume that adjacent items have similar characteristics, so we apply the channel-aware mechanism to aggregate feature information for representation updating. When it comes to local level, we focus on the transition of user preference by modeling the variation of item factors. In order to obtain the transition features, we use the sliding window strategy based on graph neural network.

Sliding window strategy is a popular dividing algorithm. By applying sliding window strategy to some search tasks, it can convert the nested loop problem into a single loop problem, reducing time complexity. Specifically, the algorithm sets a fixed window, which moves from time 1 to time in the sequence axis, and executes the channel-aware algorithm in each step among the window. Taking Figure 3 as an example, the user sequence is arranged by time order as . Suppose the window length is 4, the window first covers the 4 items of the earliest interactions. We add edges between , and respectively, and apply the channel-aware mechanism to calculate the similarities and degrees of influence between the three items and over the channels. Then we accumulate the feature information to , achieving a step of information transmission. Next, the window slides to , and we repeat the above steps in this window. Finally, the item feature information will be transformed to the last item through channels.

We set the sliding window length as . That means for each target item , information will be aggregated from its former items. The probability between item and is calculated by Algorithm 6 based on the channel-aware mechanism and channel information is aggregated as follows:


where represents the learning parameter of channel , which is shared with the global-level layer. Also, we use -normalization for . Having obtained the information aggregation through sliding window from previous to back, we can form the user intention at time with the item embedding in the sequence. Then the local sequential representation is the combination of factors: .

In summary, we learn disentangled representation of the current sequence from both channel and statistical perspectives. Variational auto-encoder helps the model learn independent latent representation over the whole sequence statistically, which would be discussed in the next part. The channel-aware sliding window strategy is able to distinguish the various factors of users. Different factors pass through channels that are related to different user intentions. The influenced factor is changed through sequences with the transition of user intentions, realizing characteristic (2) of our claim.

Iv-C Predicting Layer

Based on the obtained representations learned from global and local level layers, the final sequential representation is written as:


where are combination parameters of predicting layer.

We can estimate the final recommendation probability of candidate items based on the current sequential embedding and the initial item embedding. Let

denote the prediction probability of item appearing as the next interaction in the current sequence:


Since we use VAE framework in our model, the training objective is defined following the evidence lower bound:


The first term is regarded as the reconstruction error, measuring the accuracy between the prediction and ground truth . Here we compute the reconstruction loss using cross entropy:


The second term, , is used to measure the distance between posterior distribution and prior distribution. Practically, it is computed with the intermediate variables of VAE layer [15]:


We then discuss how our model can learn independent disentangled representation. According to Section III-C, the KL-divergence can be separated into three parts: the index-code mutual information, the total correlation, and the dimension-wise KL. The total correlation is a measure of redundancy, acting as the degree of interdependence between variables in the latent variable space. Therefore, applying the -VAE framework into our model contributes to learning statistically independent factors of the data distribution, and realizing our last claim of learning disentangled representation.

Input: initial item embeddings , channel number
Output: disentangled item embeddings

1:for each item  do
3:end for
4:for each item  do
5:     for each item  do
8:     end for
9:end for
10:for each item  do
12:end for
Algorithm 1 Channel-aware Algorithm

Iv-D Complexity Analysis

Iv-D1 Time Complexity

The time consumption of our model mainly consists of three parts. The first part is the global-graph building. In order to construct the global, we need to traverse every edge, which costs . The second part is the channel-aware mechanism, the algorithm of the mechanism is shown in Algorithm 1. For each channel, the cost of updating item embedding is . The third part is sliding window strategy. The window slides from the start of sequences to the end, which costs time of sequence length , and the local-level channel-aware mechanism costs . Therefore, the total space complexity of our model is .

Iv-D2 Space Complexity

The space consumption of our model is mainly in the undirected graph and channel-aware mechanism. In order to store the neighborhood of each node, an adjacency matrix of . For the channel-aware mechanism, there are parameters of dimension for all channels, and the final item embedding generated from channels is of dimension . Therefore, the total space complexity of our model is .

V Experiment

In this section, we will present our experimental setup and results. Firstly, we introduce the datasets and the evaluation metrics used in our experiments, then we will introduce the eight baseline methods which are related to VAE models or disentangled learning models. Next, we compare the experimental results of these baseline methods with our method under the same experimental setting to verify the effectiveness of our proposed model. Moreover, we evaluate the influence of each part in our model and the influence of the key parameters. Finally, we perform visualization experiments on the sequential embeddings generated in the experiment, which proves that our disentanglement model is able to distinguish intention factors in the latent space. In specific, our experiments aim to answer the following questions:

  • RQ1: Does our proposed model outperform the state-of-art works over various kinds of datasets?

  • RQ2: What is the influence of each component, i.e., the global and local layers in our model?

  • RQ3: How does our model realize the disentangled representation learning in the latent space?

  • RQ4: What is the influence of the hyperparameter setting on different datasets?

V-a Datasets

Dataset Users Items Interacts Density
ML-1M 6040 3416 999611 95.15
Beauty 52204 57289 394908 99.98
Games 31013 23715 287107 99.96
TABLE II: Details for three datasets

We adopt three real-world datasets to evaluate the effectiveness of our method. MovieLens is a time-series dataset containing rating data for multiple movies by users. We use the version MovieLens-1M that includes 1 million user ratings. Amazon is an e-commerce dataset which contains users’ purchasing behaviors on rich products. We choose two categories, ’Beauty’ and ’Video Games’, and use the 5-core version for our experiment.

We use timestamps to arrange the sequence order, that is, the items that are interacted by the same users are arranged in sequence according to their interacting time. Following the previous work [13], we split data into three parts: the last interacted item for testing, the second-to-last interacted item for validation and the rest items for training. We regard the training sequence of length as sub-sequences, and the last element of each sub-sequence is regarded as the training ground truth. While in validation and testing tasks, we choose the last item of sequence as ground truth with 100 randomly sampled negative items. The detailed statistics are shown in Table II. The average sequence length of each dataset is 163.5, 5.63 and 7.26 respectively.

V-B Metrics

We adopt two ranking based metrics to evaluate the recommendation performance: Normalized Discounted Cumulative Gain (NDCG) and Recall. The larger the values of metrics are, the better the performance is. We refer to the two metrics as N@K and R@K for short.

  • NDCG is a rating metric which takes into account the position of correctly recommended items. It is defined as follows:


    where DCG is the Discounted Cumulative Gain. We hope that the most relevant items are at the top of the list, so before adding scores, we divide each item by an increasing number. IDCG is the ideal DCG, which sorts the results to the best state and calculates DCG of the query under this arrangement. They are defined as:


    where represents the relevance of the item, which is either 1 or 0, and is the set of relevant items.

  • Recall describes the percentage of rated items that are actually preferred by users included in the recommendation list. It defines a recommendation list of top predicted items for a user as , and uses to represent the corresponding test set. The percentage of rated items is then computed as:

Dataset Metric POP BPR FPMC TransRec Caser SASRec DSS1 VSAN EGD-GNN
ML-1M N@5 0.1428 0.1856 0.2726 0.2816 0.2175 0.3922 0.2119 0.4224 0.4571
R@5 0.2546 0.2972 0.4081 0.4166 0.3389 0.5450 0.3202 0.5851 0.6161
N@10 0.1863 0.2440 0.3284 0.3352 0.2709 0.4419 0.2562 0.4593 0.5012
R@10 0.4086 0.4738 0.5806 0.5826 0.5045 0.6982 0.4573 0.7054 0.7517
Beauty N@5 0.0483 0.0915 0.1429 0.1726 0.1928 0.2166 0.2139 0.2285 0.2380
R@5 0.0754 0.1299 0.2220 0.2467 0.2787 0.3013 0.3155 0.3332 0.3158
N@10 0.0659 0.1448 0.1839 0.2049 0.2295 0.2495 0.2527 0.2575 0.2710
R@10 0.1303 0.2425 0.3492 0.3471 0.3923 0.4030 0.4351 0.4279 0.4180
Games N@5 0.1695 0.2195 0.2004 0.2428 0.2671 0.3978 0.2589 0.3956 0.4303
R@5 0.2845 0.3204 0.3250 0.3468 0.3821 0.5377 0.3714 0.5491 0.5599
N@10 0.2082 0.2606 0.2583 0.2845 0.3123 0.4388 0.3052 0.4288 0.4668
R@10 0.3605 0.4459 0.4462 0.4762 0.5217 0.6641 0.5152 0.6672 0.6723
  • The code of DSS is not released by authors and we re-implement it according to the paper.

TABLE III: Results of recommendation performance

V-C Baselines

We compare our method with the following competitive baselines, with particular emphasis on VAE-based and disentangled learning methods. All the baselines put emphasis on sequential recommendation tasks.

  • 1) POP: a classical method that ranks items according to their popularity.

  • 2) BPR: Bayesian Personalized Ranking [28], a classical model based on Matrix Factorization. It designs a pair-wise optimization method to learn pairwise item rankings from implicit feedback.

  • 3) FPMC: Factorized Personalized Markov Chains [29], a method combining Matrix Factorization and first-order Markov Chain together. It introduces a personalized transfer matrix based on Markov chain to capture time information and introduces matrix factorization to solve the sparse problem of the transition matrix.

  • 4) TransRec: Translation based Recommendation [8]. It embeds items into a transition space and models each user as a transition vector to obtain the ’three-order’ relationships, i.e., the interactions between a user, the previous visited items and the next item.

  • 5) Caser: Convolutional Sequence Embedding Recommendation [32]. The main idea is to form an ’image’ with the most recent items of a sequence in time and latent spaces, and apply Convolutional Neural Network (CNN) to learn the high-order sequential patterns as the local feature of the image.

  • 6) SASRec: Self-Attention based Sequential Recommendation [13]. By applying the self-attention mechanism into sequential problems, the model can not only capture the long term information like RNNs but also handle the short term patterns in terms of small number of behaviors like MCs.

  • 7) DSS: Disentangled Self-Supervision [25], the first model that focuses on disentangled representation learning on sequential recommendation. It designs a Disentangled Sequence Encoder to disentangle user intention in the latent space over sub-sequences and propose a seq2seq self-supervised strategy for training.

  • 8) VSAN: Variational Self-attention Network [39]. It combines the self-Attention mechanism with variational inference for sequential recommendation to model the long-range and short dependencies of sequences.

V-D Experiment Setting

We conduct experiments with PyTorch. In the experiments, the dimension of item embedding of all the methods is set 100. The channel embedding dimension of our model is set 20. We set the batch size as 128 and the learning rate as 0.002. We limit the maximum sequence length to 200 for the MovieLens dataset and 50 for Amazon. The dropout rate of turning off neurons is set as 0.5 for both the global and local layers. Single-head self-attention network is used as the sequential encoder. We use random seeds for the generation of Gaussian distribution and report the average performance result under five times.

V-E Performance Analysis

To evaluate the effectiveness of our proposed model, we perform next-item recommendation based on our model and the baselines under the same experimental setting. Specifically, we predict the item user may be interested in at time based on the former items and choose the item in each sequence as ground truth for metric. Table III records the performance results. We will compare and analyze the results in detail in this section.

Firstly, We can observe from the table that our method outperforms the baselines over all the datasets. There is no doubt that our model gets better predicting results over the classical baselines, POP, BPR, and FPMC, since we take complex sequential interaction information into account. In terms of models based on neural networks, we can see that SASRec performs better than the transformer models, indicating that the self-attention network captures more sequential semantics with both long and short term patterns.

The VSAN method proposes a new self-attention network with variational auto-encoder and achieves second-best results in our experiments. It proves that capturing the long and short range dependencies together with the attention-based network and the statistical method does help the model get better prediction results. The effectiveness of random method, variational inference, is also confirmed in eliminating the random noise in user behaviors. Although the disentangled self-supervised method performs well in the Beauty dataset, it does not have good results in the other two datasets. However, its good performance on Beauty dataset is sufficient to prove the effectiveness of learning disentangled user intention over sequences. In DSS, one sequence behavior is encoded into one kind of user intention, ignoring the various different factors hidden behind item transitions. Compared with disentangling user intention over whole sequences, we focus on the intention transition of pairwise items, therefore, our model gets better predicting results than the previous work.

Then we turn focus back to our proposed model, we get the best experimental results in most circumstances. In particular, its relative improvements over the strongest baselines w.r.t. NDCG@5 are 8.21, 4.16, and 8.17 for the three datasets respectively. Compared with VSAN, our model builds a global item-link graph and disentangles the influential factors into channels for representation updating. Compared with DSS, our model pays more attention to the item-item relationship. We form the user intention taking advantage of the transformer ability of self-attention instead of modeling the whole sequence with an encoder. According to these improvements, it is no doubt that our model can obtain user intention in an adaptive way and find more suitable items that users may be interested in.

Secondly, we achieve the best improvements for all the metrics on MovieLens dataset. It indicates that by introducing the channel-aware mechanism, the model is able to capture more item-link relationship information that is hard to be captured by previous works. Moreover, by composing several high-level concepts, the movie items, which have few explicit features, is classified into some implicit categories, and the model can obtain the user intention from various high-level perspectives to predict users’ true preference.

Thirdly, we find that our model reaches high values very early compared with the baselines, as shown in Figure 4

. The trends of experimental results of 10 epochs indicate that our model can get good prediction results early in the first five epochs. Even though the time complexity of our model is larger than the state-of-art baselines, we can still get high prediction results within a short time. That means by distinguishing the latent factors hidden behind sequences, the model can learn item representations over various factors and find dynamic user intentions that are not shown explicitly.

(a) ML-1M
(b) Beauty
Fig. 4: The trend of performance results in 10 epochs.
Dataset Metric Global Local SA-VAE SliWin EGD-GNN
ML-1M N@10 0.4641 0.4772 0.2900 0.4628 0.5012
R@10 0.7142 0.7377 0.5311 0.7065 0.7517
Beauty N@10 0.2562 0.2437 0.2274 0.2420 0.2710
R@10 0.4043 0.4089 0.3990 0.4065 0.4180
Games N@10 0.4445 0.3988 0.3370 0.3665 0.4668
R@10 0.6518 0.6242 0.5630 0.6050 0.6723
TABLE IV: Influence of each part of model
Dataset Metric =0 =0.1 =0.5 =1 =2
ML-1M N@10 0.4875 0.5012 0.4920 0.4881 0.4987
R@10 0.7425 0.7517 0.7434 0.7478 0.7441
Beauty N@10 0.2559 0.2587 0.2606 0.2710 0.2538
R@10 0.3995 0.4078 0.4051 0.4180 0.4021
Games N@10 0.4472 0.4668 0.4625 0.4613 0.4616
R@10 0.6671 0.6723 0.6699 0.6685 0.6659
TABLE V: Influence of penalty on KL-divergence

V-F Ablation Study

V-F1 Influence of each layer of model

We first implement ablation studies to evaluate the effectiveness of each part of our model. Specifically, we perform four ablation experiments as follows:

  • Global only: remove the local-level learning layer, i.e., the SA-VAE layer and sliding window layer, only perform with the global-level graph.

  • Local only: remove the global link graph and corresponding learning layer, only perform with the local-level layer.

  • SA-VAE only: remove the sliding window strategy part, only reserve the self-attention and variational auto-encoder layers.

  • SliWin only: remove the self-attention and variational auto-encoder layers, only reserve the sliding window mechanism for local-level learning.

Table IV lists the results of ablation studies, showing how each part influences the final performance of our model. It is clear that the global and local-level layers both contribute to the improvement of our model. For Amazon datasets, the global-level layer performs better than the local-level layer, indicating that the relationship between product items is close and worth exploring.

Besides, the sliding window strategy gets better prediction results compared with the SA-VAE layer. It proves that the channel-aware mechanism plays quite a crucial role in disentangling user intentions over different factors. Moreover, we can observe that the improvement of channel-aware mechanism is extremely large on the MovieLens dataset, since our model can capture much relevant information between items and explore high-level factors even on a small scale dataset.

V-F2 Influence of penalty on KL-divergence

Then we implement ablation experiments on the variational auto-encoder framework. As introduced in the previous sections, the parameter acts as a penalty on KL-divergence term which contributes to forcing the model to find independent latent variables. Therefore, we will evaluate the role of in disentangled representation learning. We set from 0 to 2 to examine its effectiveness, and list the results in Table V. We can see that when is 0, the experimental results are obviously the worst, since the variational auto-encoder model degenerates to original auto-encoder. And when is too large, the results will also decrease, since the posterior distribution is close to the standard normal distribution. Therefore, we need to find a suitable value to strike a balance between reconstruction accuracy and disentangled learning.

(a) Beauty
(b) Games
Fig. 5: Visualization of the item embeddings on Amazon datasets

V-G Visualization of Node Representations

In order to analyze the performance of learning disentangled representation, we visualize the item representations using the t-SNE [34] algorithm on the two Amazon datasets. In detail, we learn the global-level item embeddings based on the channel-aware mechanism and project the embeddings into a 2-dimension space. We choose the channel with the largest embedding value, i.e., , as the item category and color the nodes based on their categories in Figure 5.

We can observe that on both datasets, the items with the same categories are close in the latent space, indicating they share similar features. Meanwhile, the factors which have close relationships are close in the latent space as well. Taking the Beauty dataset as an example, the items colored in pink are close to the items colored in orange and cyan. This indicates that the items which have these three features share similar characteristics. When a user interacts with an item colored pink, he/she is likely to choose an item colored in orange or cyan. The Games dataset also shows the same characteristics. This visual experiment again proves the effectiveness of learning disentangled representation from an intuitive perspective, and also shows its ability in enhancing the interpretability of model. In summary, learning disentangled representations based on item edges can not only observe the underlying features between items, but also help the model predict what users may like.

(a) ML-1M
(b) Beauty
(c) Games
Fig. 6: Effect of the number of channels
(a) ML-1M
(b) Beauty
(c) Games
Fig. 7: Effect of the length of sliding window

V-H Hyper-parameter Sensitivity

The most important hyper-parameters in our model are the number of channels and the length of sliding window . Specifically, we fix other parameters and adjust the number of one hyperparameter by fixed length. We record the predicting results on the three datasets and draw line charts to show their impacts. We will analyze the figures in this section.

V-H1 Impact of the number of channels

We adjust the number of channels from 5 to 35 in steps of 5 and show the results in Figure 6. We can see that the recommendation performance improves as the channel number increases, and tends to remain unchanged after reaching the peak. The MovieLens dataset reaches the peak later than the Amazon dataset. The reason may be that product items do not have as many attributes as the movie items have, and the model does not require too many classifications to achieve the best results.

V-H2 Impact of the length of sliding window

We adjust the length of sliding window from 5 to 25 in steps of 5 and show the results in Figure 7. We can observe that the influences of the length are quite different from that of channel numbers. On the MovieLens dataset, the performance results become slightly larger as the window length grows, but there is no such trend on the Amazon datasets. Therefore, we can speculate that the choice of window length does not mainly affect the recommended results.

Vi Conclusion

In this paper, we proposed an edge-enhanced model based on graph neural network to learn sequential representation at both global and local levels. We designed a disentangled learning layer, i.e., the channel-aware mechanism, to distinguish various factors which motivate user intentions. The mechanism divided the information transition model into several channels and aggregated item information through different channels. At the global level, we built a global item-link graph based on training data and update item feature information through neighborhood. At the local level, we apply variational auto-encoder framework to infer user behaviors as distributions, taking advantage of its statistical ability in learning disentangled representation. Then we adopt a sliding window strategy along with the channel-aware mechanism to capture the transition of user intentions through sequences. Experimental results showed that our proposed method achieves better performance than previous works. It is notable that user information is also important for learning disentangled representation. Therefore, we will consider adding user nodes in further studies.


This research was partially supported by NSFC (No. 61876117, 61876217, 61872258, 61728205), ESP of the State Key Laboratory of Software Development Environment, and PAPD of Jiangsu Higher Education Institutions.


  • [1] R. v. d. Berg, T. N. Kipf, and M. Welling (2017) Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263. Cited by: §I.
  • [2] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599. Cited by: §I, §III-C.
  • [3] R. T. Chen, X. Li, R. Grosse, and D. Duvenaud (2018)

    Isolating sources of disentanglement in variational autoencoders

    In NeurIPS, pp. 2615–2625. Cited by: §II-B, §III-C.
  • [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2180–2188. Cited by: §I, §II-B.
  • [5] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha (2018) Sequential recommendation with user memory networks. In WSDM, pp. 108–116. Cited by: §I, §II-A.
  • [6] C. Cheng, H. Yang, M. R. Lyu, and I. King (2013) Where you like to go next: successive point-of-interest recommendation. In

    Twenty-Third international joint conference on Artificial Intelligence

    Cited by: §I, §II-A.
  • [7] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. NeurIPS 29, pp. 3844–3852. Cited by: §II-C.
  • [8] R. He, W. Kang, and J. McAuley (2017) Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pp. 161–169. Cited by: §II-A, 4th item.
  • [9] R. He and J. McAuley (2016) Fusing similarity models with markov chains for sparse sequential recommendation. In ICDM, pp. 191–200. Cited by: §II-A.
  • [10] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang (2020) Lightgcn: simplifying and powering graph convolution network for recommendation. In SIGIR, pp. 639–648. Cited by: §II-C.
  • [11] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. ICLR. Cited by: §I, §II-B.
  • [12] J. Hsieh, B. Liu, D. Huang, L. Fei-Fei, and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In NeurIPS, pp. 517–526. Cited by: §I, §II-B.
  • [13] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §I, §II-A, §IV-B1, 6th item, §V-A.
  • [14] H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, pp. 2649–2658. Cited by: §II-B.
  • [15] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. ICLR. Cited by: §IV-C.
  • [16] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §II-C.
  • [17] Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: §I, §II-A.
  • [18] Y. Koren (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In SIGKDD, pp. 426–434. Cited by: §II-A.
  • [19] X. Li and J. She (2017) Collaborative variational autoencoder for recommender systems. In SIGKDD, pp. 305–314. Cited by: §I.
  • [20] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara (2018) Variational autoencoders for collaborative filtering. In WWW, pp. 689–698. Cited by: §I.
  • [21] L. Lin and H. Wang (2020) Graph attention networks over edge content-based channels. In SIGKDD, pp. 1819–1827. Cited by: §II-B, §II-C.
  • [22] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, pp. 4114–4124. Cited by: §I.
  • [23] J. Ma, P. Cui, K. Kuang, X. Wang, and W. Zhu (2019) Disentangled graph convolutional networks. In ICML, pp. 4212–4221. Cited by: §II-C.
  • [24] J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In NeurIPS, pp. 5712–5723. Cited by: §I, §II-B.
  • [25] J. Ma, C. Zhou, H. Yang, P. Cui, X. Wang, and W. Zhu (2020) Disentangled self-supervision in sequential recommenders. In SIGKDD, pp. 483–491. Cited by: §I, §II-B, 7th item.
  • [26] Z. Mu, S. Tang, J. Tan, Q. Yu, and Y. Zhuang (2021) Disentangled motif-aware graph learning for phrase grounding. In AAAI, Cited by: §I.
  • [27] R. Qiu, J. Li, Z. Huang, and H. Yin (2019) Rethinking the item order in session-based recommendation with graph neural networks. In CIKM, pp. 579–588. Cited by: §II-C.
  • [28] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In International Conference on Uncertainty in Artificial Intelligence, pp. 452–461. Cited by: 2nd item.
  • [29] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811–820. Cited by: §I, §II-A, 3rd item.
  • [30] N. Sachdeva, G. Manco, E. Ritacco, and V. Pudi (2019) Sequential variational autoencoders for collaborative filtering. In WSDM, pp. 600–608. Cited by: §II-A, §IV-B1.
  • [31] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In CIKM, pp. 1441–1450. Cited by: §I, §II-A.
  • [32] J. Tang and K. Wang (2018) Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM, pp. 565–573. Cited by: §II-A, 5th item.
  • [33] B. Taskar, M. Wong, P. Abbeel, and D. Koller (2003) Link prediction in relational data. NeurIPS 16, pp. 659–666. Cited by: §II-C.
  • [34] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne..

    Journal of machine learning research

    9 (11).
    Cited by: §V-G.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ĺ. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §IV-B1.
  • [36] P. Velićković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. ICLR. Cited by: §II-C.
  • [37] X. Wang, X. He, M. Wang, F. Feng, and T. Chua (2019) Neural graph collaborative filtering. In SIGIR, pp. 165–174. Cited by: §I.
  • [38] X. Wang, H. Jin, A. Zhang, X. He, T. Xu, and T. Chua (2020) Disentangled graph collaborative filtering. In SIGIR, pp. 1001–1010. Cited by: §I, §II-B.
  • [39] J. Zhao, P. Zhao, L. Zhao, Y. Liu, V. S. Sheng, and X. Zhou (2021) Variational self-attention network for sequential recommendation. In ICDE, pp. 1559–1570. Cited by: 8th item.
  • [40] Y. Zheng, C. Gao, X. Li, X. He, Y. Li, and D. Jin (2021) Disentangling user interest and conformity for recommendation with causal embedding. In WWW, pp. 2980–2991. Cited by: §II-B.
  • [41] Y. Zhu, H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, and D. Cai (2017) What to do next: modeling user behaviors by time-lstm.. In IJCAI, Vol. 17, pp. 3602–3608. Cited by: §I, §II-A.