CmnRec: Sequential Recommendations with Chunk-accelerated Memory Network

04/28/2020 ∙ by Shilin Qu, et al. ∙ Northeastern University Huazhong University of Science u0026 Technology Tencent 0

Recently, Memory-based Neural Recommenders (MNR) have demonstrated superior predictive accuracy in the task of sequential recommendations, particularly for modeling long-term item dependencies. However, typical MNR requires complex memory access operations, i.e., both writing and reading via a controller (e.g., RNN) at every time step. Those frequent operations will dramatically increase the network training time, resulting in the difficulty in being deployed on industrial-scale recommender systems. In this paper, we present a novel general Chunk framework to accelerate MNR significantly. Specifically, our framework divides proximal information units into chunks, and performs memory access at certain time steps, whereby the number of memory operations can be greatly reduced. We investigate two ways to implement effective chunking, i.e., PEriodic Chunk (PEC) and Time-Sensitive Chunk (TSC), to preserve and recover important recurrent signals in the sequence. Since chunk-accelerated MNR models take into account more proximal information units than that from a single timestep, it can remove the influence of noise in the item sequence to a large extent, and thus improve the stability of MNR. In this way, the proposed chunk mechanism can lead to not only faster training and prediction, but even slightly better results. The experimental results on three real-world datasets (weishi, ml-10M and ml-latest) show that our chunk framework notably reduces the running time (e.g., with up to 7x for training 10x for inference on ml-latest) of MNR, and meantime achieves competitive performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the rapid development of Web 2.0, the speed of data production and streaming has gone up to a great extent. Meanwhile, Internet users can easily access various online products and services, which results in a large amount of action feedback. The extensive user feedback provides a fundamental information source to build recommender systems, which assist users in finding relevant products or items of interest. Since users generally access items in chronological order, the item a user will next interact with may be closely relevant to the accessed items in a previous time window. The literature has shown that it is valuable to consider time information and preference drift for better recommendation performance (Hidasi et al., 2015; Quadrana et al., 2017; Tang and Wang, 2018; Yuan et al., 2019; Ma et al., 2019). In this paper, we focus on the task of sequential (a.k.a., session-based) recommendation, which is built upon the historical behavior trajectory of users.

A critical challenge for sequential recommendation is to effectively model the preference dynamics of users given the behavior sequence. Among all the existing methodologies, Recurrent Neural Networks (RNN) have become the most prevalent approaches with remarkable success 

(Hidasi et al., 2015; Quadrana et al., 2017)

. Different from feedforward networks, the weights of RNN can be well preserved and updated over time via its internal state, which endows RNN with the ability to process sequence. However, learning vanilla RNN for long-term dependencies remains a fundamental challenge due to the vanishing gradient problem 

(Bengio et al., 1994), and it is noted that long-range user sessions widely exist in real applications. For example, users on TikTok111

can watch hundreds of mico-videos in an hour since the average playing time of each video takes only 15 seconds. To model long-term item dependencies for the sequential recommendation problem, previous attempts have introduced Long Short-Term Memory (LSTM) 

(Hochreiter and Schmidhuber, 1997)

& Gated Recurrent Units (GRU)  

(Hidasi et al., 2015)

, temporal convolutional neural network architecture with dilated layers 

(Yuan et al., 2019; Fajie et al., 2019), attention machine  (Vaswani et al., 2017; Kang and McAuley, 2018), and external memory  (Chen et al., 2018; Wang et al., 2018).

Among these advanced methods, the External Memory Network (EMN) (Sukhbaatar et al., 2015; Graves et al., 2016)

is the most resembles human cognitive architecture due to its significant external memory mechanism. EMN is composed of a neural controller, e.g., RNN, and the external memory, which can be regarded as an extension of standard RNN, including LSTM & GRU. Unlike RNN, EMN stores useful past information by external memory rather than a squeezed vector. EMN has shown high potentials in areas, such as visual reasoning 

(Johnson et al., 2017), question answering (Seo et al., 2016)

, natural language processing 

(Cai et al., 2017). Since 2018, researchers started to apply it in the field of recommendation to improve the accuracy of existing recurrent models  (Chen et al., 2018; Wang et al., 2018; Ebesu et al., 2018; Huang et al., 2018), in the following referred to as Memory-based Neural Recommenders (MNR).

Figure 1. Illustration of the Chunk mechanism for better memorization. Numerics and alphabets are chunked into numeric units and words for faster and easier remembering.

In order to remember more information, all existing MNR implementations require to repeat the memory accessing operations, including both reading and writing, at every time step. The reading and writing accessing operations are much more expensive than the controller in terms of time complexity, which becomes a severe efficiency problem when modeling long-range sequences. One possible way to speed up MNR is to optimize the specific memory operations directly. However, there are many different implementations of accessing operations, and a clear drawback of these methods is that they are not general. In this paper, we focus on developing a general acceleration framework that applies to various types of MNR.

Our core idea to accelerate MNR in this paper is originally inspired by the chunk (Gerrig, 2013) technique in cognitive psychology. Psychology introduced the concept of chunk to improve human’s memory. It refers to a meaningful unit of information that can be reorganized based on certain rules. For example, giving the letter sequence ”m-e-m-o-r-y”, we can remember it as six separate letters, or memorize it by the word ”memory”, as illustrated in Figure 1. The latter method can greatly reduce our memory burden but maintain the same amount of information. As such, we believe that applying the chunk strategy for MNR is a promising way to improve the efficiency issue of MNR.

In this paper, we propose a sequential recommendation framework with chunk-accelerated memory network (CmnRec for short), which speeds up the memory network by reducing the number of memory operations. Our chunk framework consists of the chunk region, chunk rule and attention machine. Explicitly, the chunk region temporarily stores the information units (the output vector of the controller) generated in the non-chunk time. The chunk rule determines when (i.e., chunk time) to perform memory operations. The attention machine extracts the most valuable information in the chunk region, generating new information units to perform memory operations. Through the functions of these modules and rules, chunk compresses the information ingested in advance with high quality, which not only reduces the workload of memorization but also improves the memorized efficiency.

To sum up, the main contributions of this paper include:

  • We propose a general chunk-based sequential recommendation framework, which significantly accelerates various MNRs without harming the accuracy. To the best of our knowledge, this is the first work to evidence that using less memorization can enable comparable accuracy for the recommendation task.

  • We also present two effective implementations for CmnRec: periodic chunk (PEC) and time-sensitive chunk (TSC), by taking into account both long and short-term dependencies.

  • We compare CmnRec with state-of-the-art sequential recommendation methods on three real-world datasets. Our experimental results demonstrate that CmnRec offers competitive and robust recommendations with both training and inference time.

2. Related Work

This work can be regarded as an integration of sequential recommendation and memory networks. In the following, we briefly review related literature in the two directions.

2.1. Sequential Recommendation

Sequential (a.k.a., session-based) recommender systems are an emerging topic in the field of recommendation and have attracted much attention in recent years due to the advance of deep learning. Existing sequential recommendation models can be mainly categorized into three classes according to the models they involved 

(Wang et al., 2019a)

: Markov chain-based methods 

(Shani et al., 2005; He and McAuley, 2016), factorization based methods (Rendle et al., 2010; Yuan et al., 2018; Yuan et al., 2016, 2017), and deep learning-based methods  (Hidasi et al., 2015; Hidasi and Karatzoglou, 2018; Tang and Wang, 2018; Yuan et al., 2019). Specifically, due to the efficiency consideration, Markov chain based recommenders are typically built on the first-order dependency assumption, and thus only capture the first-order dependency over items. As a result, these methods usually do not perform well when modeling long-term and higher-order item dependencies. Factorization-based recommenders (a.k.a., Factorization Machines  (Rendle et al., 2010)) deal with previous user actions as general features by merely summing all their embedding vectors, and are not able to explicitly model the sequential dynamic and patterns in the user session. Thanks to the development of deep neural networks, many deep learning-based sequential models have been proposed and shown superior performance in contrast to the above-mentioned conventional methods by utilizing the complex network structures.

A pioneering work by Hidasi et al. (Hidasi et al., 2015) introduced RNN into the field of recommender systems. They trained a Gated Recurrent Unit (GRU) architecture to model the evolution of user interests, referred to as GRU4Rec. Following this idea, a lot of RNN variants have been proposed in the past three years. Specifically, (Tan et al., 2016) proposed an improved GRU4Rec by introducing data augmentation and embedding dropout techniques. Hidasi and Karatzoglou (Hidasi and Karatzoglou, 2017) further proposed a family of alternative ranking objective functions with effective sampling tricks to improve the cross-entropy and pairwise ranking losses.  (Quadrana et al., 2017) proposed a personalized sequential recommendation model with hierarchical recurrent neural networks, while  (Gu et al., 2016; Smirnova and Vasile, 2017) explored how to leverage content and context features to improve the recommendation accuracy further. More recently, researchers have proposed several other neural network architectures, including convolutional neural networks (CNN), Caser (Tang and Wang, 2018) and NextItNet (Yuan et al., 2019)

, self-attention models SASRec 

(Kang and McAuley, 2018). Compared with RNN models, CNN and attention architectures are much easier to parallelize on GPUs.

2.2. EMN and MNR

Recently, External Memory Network (EMN) has attracted significant attention in research fields that process sequential data. Generally, EMN involves two main parts: an external memory matrix to maintain state, and a recurrent controller to operate (i.e., reading and writing) the matrix (Chen et al., 2018). Compared with standard RNN models compressing historical signals into a fixed-length vector, EMN is more powerful in dealing with complex relations and long distances due to the external memory. EMN has successfully applied in domains, such as neural language translation (Grefenstette et al., 2015), question answering (Miller et al., 2016) and knowledge tracking (Zhang et al., 2017). Recently, researchers in (Chen et al., 2018; Ebesu et al., 2018; Wang et al., 2018) have applied it in recommender systems to capture user sequential behaviors and evolving preferences.

As the first work that introduces EMN into the recommendation system, Sequential Recommendation with User Memory Network (RUM) (Chen et al., 2018) has successfully demonstrated superior advantages over traditional baselines. Similarly, Neural Memory Streaming Recommender Networks with Adversarial Training (NMRN) (Wang et al., 2018) proposes a key-value memory network for each user to capture and store both short-term and long-term interests in a unified way. Meanwhile, Ebesu et al. proposed Collaborative Memory Network (CMN) (Ebesu et al., 2018) that deals with all user embedding collections as user memory matrix and utilizes the associative addressing scheme of the memory operations as a nearest neighborhood model.

Ignoring the implementation of external memory networks and the types of memory operation performed, all EMN-style models need to perform memory reading and writing operations at every timestep. Such persistent memory operations significantly increase the model complexity and training/inference time, which limits the applications of MNR in large-scale industrial recommender systems. In general, efficiency can be achieved by either reducing the complexity or the frequency of memory access. Since there are many ways to implement EMN, we hope to propose a general acceleration framework. To achieve this goal, we propose reducing the number of memory operations, which is suitable to accelerate MNR with various implementations of memory operations.

3. Memory-based Neural Recommendation (MNR)

In this section, we will introduce the generic architecture of memory-based neural sequential recommendations. Let , and (interchangeably denote by ) be the set of all items, sequences and items in a specific sequence, respectively. Denote and as the size of item and sequence sets. The corresponding item embeddings are .

Figure 2 (except the part of chunk and attention) is a generic memory-based neural recommendation architecture similar to GRU- 4Rec. From bottom to top, the model includes the input items, embedding layer, controller layer, memory network layer, feedforward layer and output items. The first three layers perform in the same manner as a classic RNN network. The essential difference between MNR and GRU4Rec lies in the memory network layer. Note that MNR can be seen as an extension of RNNs with external memory network , where is the number of memory slots and is memory slot embedding size. Next, we will elaborate on the details of the MNR.

As shown in Figure 2, each controller will concatenate the embeddings of the current input item and the memory (read from

) at the previous moment as an external input. The memory storage will be updated according to the output of the controller

. Finally, both the controller output and updated memory

will be fed into the feedforward layer, which helps generate the item ID with the maximum probability of being the next item, formulated as follows:



is a feed-forward operation that performs a non-linear transformation of the inputs and returns a feature vector

as the output. is a function to find the item ID with the maximum value in the vector, that is, the maximum occurrence probability at the -th moment predicted by MNR.

Figure 2. Chunk acceleration on Memory-based Neural Recommendation (MNR). During non-chunk time, information units (the hidden state from the controller) are put into the chunk area. When the chunk time comes, the attention machine will extract the most valuable information in the chunk area and generate a new information unit to replace the current hidden state. Then, the information units stored in the chunk area will be emptied; the memory reading and writing operations are triggered. The dotted red box on the right illustrates that MNR performs a complete process from information input and memory update to the final generation of prediction results.

Let the hidden state at the last moment be . The writing operation and reading operation can be written as Eq.(3) and Eq.(4):


The controller output and hidden state can be updated as and :


Depending on the memory type selected, , , and have different implementations. In this paper, we adopt the implementations of DNC (Graves et al., 2016) for simplicity.

4. CmnRec

In this section, we will give a detailed description of the chunk framework, followed by the concrete implementations.

4.1. From Psychology to Recommendation

Psychology points out that people unconsciously use chunk strategies to reduce the “things” to be remembered to improve the efficiency of memorization (Gerrig, 2013). Motivated by this, our core idea of chunk acceleration for MNR is to combine nearby information units according to specific rules and generate new information units, to reduce the frequency of memory operations and improve memory efficiency. Therefore, how to find an appropriate rule of chunk is the critical problem.

From a more generic perspective, information units are all converted from discrete item sets. An intuitive method is to chunk the information units based on the position of items. In practice, items are ordered chronologically in the sequence, so the rule of chunk becomes a sequence segmentation problem. That is, how to segment item sequences to minimize information loss while improving mnemonic efficiency?

4.2. Framework

The core idea of chunk-based memory neural network is formed on a specific sequence partitioning rule222We will describe this rule later., where it first divides the close items into different chunks by order, and then writes these chunks into memory. Suppose the memory slot number of the MNR is , and the length of the sequence is . The whole sequence will be divided into subsequences , ,…, () . The controller hidden states corresponding to these subsequences will be chunked times. The time corresponding to the future end of subsequences are the chunk time .

In the chunk framework, controller output does not operate memory at every time step. There is a chunk area ( is varying) to store temporarily. During non-chunk time, () cache . Until the chunk time arrives, the attention machine converts to a new controller hidden state and then replace the hidden state in the current controller. Finally, the chunk area is emptied () and memory is manipulated. The attention machine works as follow:


where is the -th element of , is the attention score of at -th time step, and is the read vector at time step . , and are parameters, and is attention dimension. Algorithm 1 summarizes the whole process of processing chunk accelerates MNR. The theoretical complexity analysis is attached in Appendix A

Input: a original sequence item IDs ,
     a chunk matrice ,
     a memory slot number .
Output: The predicted sequential item IDs
1 Generates chunk time steps set using Eq.(16);
2 for  do
3       ;
4       // The operation of the controller and memory in the chunk moment.
5       if  in  then
6             Use Eq.(8) and Eq.(9) to calculate ;
7             Perform Eq.(5): ;
8             Perform Eq.(6): ;
9             Perform Eq.(3): ;
10             Perform Eq.(4): ;
12       // The operation of the controller in the non-chunk moment.
13       else
14             Perform Eq.(5): ;
15             Perform Eq.(6): ;
16             ;
18       // Predict the item ID with the highest probability.
19       Perform Eq.(1): ;
20       Perform Eq.(2):
Algorithm 1 CmnRec

4.2.1. RNNs Analysis

The key of our chunk framework is how to partition the sequence and find the appropriate chunk time steps. First of all, we need to define the concept of “contribution” in the model. Let’s start with RNN. According to its hidden state transformation formula , where is jointly determined by and . Here, we use the norm of gradient and to represent the contributions of and to . The gradient represents the rate of change, so the larger the gradient norm, the greater the contribution. Since RNN is a cyclic structure, it is also able to measure the contribution of and at -th time step to by and . Tersely, let and denote these two terms. In general RNNs, there must be that satisfy and (see Appendix B for proof). From past to future, the contribution of grows when .

Based on the concept of “contribution”, we can define the total contributions of a sequence with length in RNN as follows:


4.2.2. Chunk Analysis

For the MNR with chunk acceleration, the contributions of each chunk area can be counted as a separate RNN contribution. Let the lengths of chunk areas be ,,..,, and . Each chunk operation integrates outputs of the controller. Hence, the contributions of chunk areas can be expressed as follows:


Assuming the contribution of the input to the hidden state at time step is constant, which means . Because of = (see Appendix C for proof), to simply omit the common terms, the terms in Eq.(11) is transformed to:


For brevity, let , where . In order to store more information in memory, each memory slot should carry as much information as possible. To this end, we should make the gap among ,,…, as small as possible.

4.3. Chunk Implementation

Eq(12) gives the theoretical explanation of each chunk contribution to the final output. Each term in Eq(12) contains two parts. represents the summation of -based contributions of information units in a subsequence, which is the latest contribution. is the proportion between the hidden states and , which is long-term dependence.

(a) Periodic Chunk (PEC).
(b) Time-sensitive Chunk (TSC)
(c) Extreme Chunk (EXC)
Figure 3. Chunk rule analysis. Given an input item sequence length of 20, the number of memory slots is 4. (a) shows the periodic chunk with period length of 5 and sequence segmentation , updating memory at time steps 5, 10, 15 and 20. (b) shows time-sensitive chunk, where sequence segmentation and memory update time steps are and 8, 14, 18, 20 respectively. (c) shows the extreme chunk, where the sequence is divided into , and memory is updated at time steps 17, 18, 19 and 20.

4.3.1. Periodic Chunk (PEC)

When the change rate of in the sequence is relatively slow, i.e., the preference transfer of users is not obvious, we have and . This means the long-term dependencies of each chunk are similar, and the latest contribution increases with the length of the subsequence. Only when all the chunk latest contributions are the same (i.e., ), the gaps among each chunk contribution are the smallest. Therefore, we proposed the periodic chunk (PEC). Given the input sequence , the chunk cycle , and the chunk time steps are:


The sequence segmentation results are:


Figure 3 (a) is a graphical illustration of PEC.

(a) weishi
(b) ml-10M
(c) ml-latest
Figure 4. Correlation between the target item and other items in the sequence. (a),(b),(c) represent the correlations on three different datasets, the sequence lengths of which are 10, 50, and 100, respectively.


But in the long run, user preferences will shift, and users have different degrees of preference to different items, which means there is a distribution of user preferences in the sequence. To demonstrate the preference distribution, we investigate the importance of items to the target item in a sequence. In a given sequence, the end item is treated as target item. We calculated the item importance as the correlation between the current item in the sequence and the target item. We adopt cosine similarity as the correlation indicator, which is given as follows.


We use the item embeddings (trained by LSTM) as the input vectors for cosine similarity. Experimental results are shown in Figure 4, the horizontal axis is the position of items in a sequence, and the vertical axis is the correlation between the target item and the current item. Although there are some small fluctuations in Figure 4 (a), the overall trend in (b) (c) and (d) is stable and upward. These figures generally show that the correlations between a target item and other items increase as the their distances decrease. Simply put, the newer the item is, the more it reflects future changes of preference in the sequence. Mathematically, (see Appendix D for proof).

4.3.2. Time-sensitive Chunk (TSC)

Since , it means the stronger the long-term dependence, the smaller the latest contribution. To achieve small gaps between each chunk, sequence partitioning needs to balance long-term dependencies and latest contributions. We propose a time-sensitive chunk strategy where the writing interval is larger at the beginning of the sequence but will be reduced over time (i.e., ), so as to keep the balance. From the tail to end of the sequence, the input length ratios between each chunk should be and , so the sum of the input length ratio is , proportional step length . Chunk time steps are:


The sequence segmentation results are:


The example of TSC is shown in Figure 3 (b). To obtain the goal of “the newer the item is, the greater importance it has for the next prediction”, we may have another case to be addressed. As shown in Figure 3 (c), TSC degenerates into an extreme chunk (short for EXC), where the most attention is paid to latest contribution. In the sequence, the first items form a large chunk, whereas each of the remaining items is treated as a separate chunk.

5. Experiments

In this section, we conduct extensive experiments to investigate the efficacy of the chunk-accelerated MNR. Specifically, we aim to answer the following research questions (RQs).

  1. RQ1: Does chunk speed up MNR significantly? What impacts does the sequence length have on model acceleration?

  2. RQ2: Does the chunk-accelerated MNR perform comparably with the typical memory-based neural recommendation models in terms of recommendation accuracy?

  3. RQ3: How does chunk-accelerated MNR perform with TSC, EXC and PEC? Which setting performs best?

5.1. Datasets

We conduct experiments on three real-world recommendation datasets: ml-latest, ml-10M333 and weishi444

ml-latest (Harper and Konstan, 2015) is a widely used public dataset for both general and sequential recommendations (Tang et al., 2019; Fajie et al., 2019; Wang et al., 2019b; Kang and McAuley, 2018). The original dataset contains 27,753,444 interactions, 283,228 users and 58,098 video clips with timestamps. To reduce the impact of cold items, we filter out videos that appear less than 20 times, and generate a number of sequences, each of which belongs to one user in chronological order. Then, we split the sequence into subsequence every movies. If the length of the subsequence is less than

, we pad zero in the beginning of the sequence to reach

. For those with length less than , we simply remove them in our experiments. In our experiments, we set with , which results in datasets ml-latest.

ml-10M contains 10,000,054 interactions, 10,681 movies and 71,567 users. We perform similar pre-processing as ml-latest by setting to 50 and to 5.

weishi is a micro-video recommendation dataset collected by the Weishi Group of Tencent555 Since both cold users and items have already been trimmed by the official provider, we do not need to perform pre-processing for the cold-start problem. Each user sequence contains 10 items at maximum. The statistics of our datasets after above preprocessing are shown in Table  1.

Dataset # Interactions # Sequence # Item s T
weishi 9,986,953 1,048,575 65,997 9.5243 10
ml-10M 7,256,224 178,768 10,670 40.5902 50
ml-latest 25,240,741 300,624 18,226 83.9612 100
Table 1. The statistics of the experimental datasets. s: the average length of each sequence. T: the unified sequence length after padding zero.
M 2 3 4 6 9 12 Average
weishi 1.51 1.52 1.515
ml-10M 6.28 6.71 6.17 4.68 5.96
ml-latest 11.35 12.16 10.51 8.28 8.03 10.07
Table 2. Inference speedup. The values denote multiples. M is slot number.

5.2. Comparative Methods & Evaluation Metrics

GRU4Rec (Hidasi et al., 2015)

: It is a seminal work that applies the Gated Recurrent Unit (GRU) for sequential recommendation. For a fair comparison, we use the cross-entropy loss function for all neural network models.

LSTM4Rec: It simply replaces GRU with LSTM since we observe that LSTM generally performs better than GRU for the item recommendation task. SRMN (Chen et al., 2018): It is a recently proposed sequential recommendation model with external memory network architecture. For comparison purpose, we report results by using LSTM as the controller. In addition, we also compare with two CNN-based sequential recommendation methods: Caser (Tang and Wang, 2018) and NextItNet (Yuan et al., 2019). As for our proposed methods, we report results with the three chunk variants, i.e., TSC, PEC and EXC.

Following (Yuan et al., 2019; Tang and Wang, 2018), we use three popular top- metrics to evaluate the performance of these sequential recommendation models, namely, MRR@N (Mean Reciprocal Rank) (Hidasi and Karatzoglou, 2018), HR@N (Hit Ratio) (Wang et al., 2018) and NDCG@N (Normalized Discounted Cumulative Gain) (Guo et al., 2016).

(a) weishi
(b) ml-10M
(c) ml-latest
Figure 5.

Training time of each epoch on the three datasets.

5.3. Experiment Setup

To ensure the fairness of the experiment, the dimensions of item embeddings are set to 128 for all neural network models, similar to (Tang and Wang, 2018; Yuan et al., 2019). We first tune baseline GRU4Rec and LSTM4Rec to optimal performance. Specifically, we set the number of layers of GRU4Rec and LSTM4Rec to 1 and the hidden dimension to 256, which performs better than two hidden layers or a larger hidden dimension. We empirically find that all models except NextItNet benefit from a larger batch size. To make full use of GPU, we set batch size to 1024 for these models. As for NextItNet, we empirically find that it performs best when batch size is between 64 and 256 for all these datasets. We report the results with its best-performing batch size. For SRMN, we set the embedding size of memory slot as 256. The attention dimension

of chunk is 64 on all datasets. Our datasets are randomly divided into training (80%), validation (2%) and testing (18%) sets. All methods are implemented using Tensorflow with Adam 

(Kingma and Ba, 2014) as the optimizer. Results are reported when models are converged on the validation test. Our implementation code will be released later.

(a) MRR@5
(b) HR@5
(c) NDCG@5
Figure 6. Performance comparisons with respect to top-N values.
Dataset weishi ml-10M ml-latest
2 3 2 4 6 9 2 4 6 9 12


SRMN 0.1001 0.1005 0.0739 0.0748 0.0756 0.0741 0.0749 0.0755 0.0764
TSC 0.0978 0.1010 0.0724 0.0738 0.0769 0.0733 0.0742 0.0778
PEC 0.0721 0.0753 0.0717 0.0750 0.0750
EXC 0.0958 0.0954 0.0651 0.0633 0.0641 0.0673 0.0636 0.0593 0.0622 0.0633 0.0653


SRMN 0.1636 0.1638 0.1302 0.1316 0.1320 0.1317 0.1335 0.1340 0.1359
TSC 0.1599 0.1648 0.1287 0.1302 0.1356 0.1289 0.1318 0.1329 0.1381
PEC 0.1280 0.1336 0.1346 0.1347
EXC 0.1573 0.1567 0.1157 0.1125 0.1137 0.1188 0.1149 0.1088 0.1132 0.1161 0.1190


SRMN 0.1158 0.1161 0.0878 0.0888 0.0895 0.0883 0.0893 0.0899 0.0911
TSC 0.1131 0.1168 0.0863 0.0876 0.0911 0.0860 0.0875 0.0882 0.0925
PEC 0.0857 0.0895 0.0897 0.0896
EXC 0.1110 0.1105 0.0774 0.0754 0.0763 0.0800 0.0758 0.0713 0.0746 0.0762 0.0784
Table 3. Performance comparison between SRMN and the proposed methods. Bold means the best result, means the second-best result. is slot number. Note that when , the chunk framework reduces to the SRMN, performing memory operations at each time step.

5.4. Experimental Result and Analysis

5.4.1. Run time (RQ1).

As analyzed before, the chunk framework is theoretically more efficient than SRMN by reducing the number of memory access. To confirm this, we plot the results of the running time of the two methods in Figure 5. It can be seen that the training time of SRMN is several times slower than that of TSC, and the speedup with the maximum memory slot (best accuracy for both SRMN and TSC) on the three datasets are 2.75, 4.44, and 6.34 respectively. We find that the relative improvements are much larger on ml-10M and ml-latest than on weishi. The larger improvements should be attributed to the lengths of the item sequence since for longer sequences, the interval distance between two memory accesses is also larger. Taking the weishi and ml-latest as an example. By setting the number of memory slots as 2, the average interval distance to perform memory access on weishi is 5, while it is 50 on ml-latest. It is also worth noting that the relation between the number of memory slots and the running time is not linear. Increasing the number of memory slots will lead to a decrease of the chunk area, which helps to reduce the computing time of the attention machine. Therefore, the optimal slot number depends on the specific dataset. We also demonstrate the speedup for item generating in Table 2. As shown, similar conclusions also hold to the inference phase.

5.4.2. Performance comparison with original SRMN (RQ2)

To verify the effectiveness of the proposed chunk framework, we focus on comparing it with the standard SRMN. We report the recommendation accuracy on Table 3

and obtain the following observations: (1) TSC achieves comparable results with SRMN on all datasets by applying for a relatively large slot number. Both SRMN and our method are senstive to the number of slots — better accuracy is obtained with larger slot number. Particularly, TSC and PEC with 9 memory slots had an even 1.72% performance improvement over SRMN on ml-10M in term of MRR@5. (2) In general, the performance of all chunk-based methods will keep growing by increasing the number of memory slots in the beginning. It then keeps relatively stable once the number of memory slots has been large enough. The optimal number can be achieved by hyperparameter tuning. Empirically, for a sequential dataset with session length longer than 50, we can set the default number to 10, which is a favorable trade-off between the performance and computational cost.

5.4.3. Performance comparison against baselines.

We report the results of all methodologies in Figure  6, and make the following observations. First, the CNN-based model Caser performs worse than GRU4Rec and LSTM4Rec. By contrast, the state-of-the-art temporal CNN model NextItNet yields obviously better results than these baselines. Our findings here are consistent with those in previous works (Yuan et al., 2019; Tang et al., 2019). Third, SRMN and TSE outperform all other baselines, which demonstrates the effectiveness of memory-based neural networks.

(a) weishi
(b) ml-10M
Figure 7. Convergence behaviors in terms of NDCG@5. The number of memory slots of TSC and SRMN are 3 and 9 respectively on the two datasets.

5.4.4. Denoising

We plot the convergence behaviors of GRU4Rec, LSTM4Rec, SRMN and TSC in Figure 7. As shown, memory-based recommendation models (i.e., SRMN and TSC) have apparent advantages over the RNN models in terms of both accuracy and robustness. We believe the external memory network can enhance the storage and the capacity of information processing of RNN so as to improve accuracy. In addition, abnormal input data or noise usually leads to overfitting after convergence. However, since the external storage network maintains more information than the recurrent unit, the impacts of abnormal data from a small number of instances can be restricted to a certain extent. Furthermore, we observe that TSC is even more robust than SRMN, and the results on both weishi and ml-10M imply that TSC can effectively prevent the overfitting problem. We argue that the memory update mechanism in TSC makes it insensitive to noise since it takes into account data obtained from previous timesteps rather than that from only the current timestep.

5.4.5. Performance comparison of TSC, PEC and EXC (RQ3)

Since we have introduced three chunk variants, i.e., TSC, PEC and EXC, we report their results on Table 3

for a clear comparison. First, we observe that PEC and TSC perform much better than EXC on all datasets. In fact, EXC performs even worse than the baseline models. We suspect that this is because EXC mainly focuses on modeling the most recent interactions, ignoring earlier interactions which however make up the vast majority of the interaction sequence. That is, the extreme partitioning cannot offer satisfied performance in practice. Second, TSC achieves better results than PEC in terms of all evaluation metrics when setting a large slot number. This implies that the time-sensitive chunk strategy is better suited to balance long short-term sequential relations than the periodic setup.

6. Conclusion

In this paper, we have introduced a novel sequential recommendation framework by combining the Chunk and External Memory Network (EMN). The motivation is that the way of memory access operations in the existing EMN introduces redundant computation, which results in very high time complexity when modeling long-range user session data. A Chunk-accelerated memory network is proposed with two practical implementations: periodic chunk (PEC) and time-sensitive chunk (TSC). We demonstrate that our proposed chunk framework significantly reduces the computation time of memory-based sequential recommendation models but achieves competitive recommendation results.

Appendix A A rough complexity analysis

MNR consists of a controller and an external memory network (EMN). For easier illustration, the controller and EMN are often realized by RNNs and DNC. And the time complexity of them is and (Graves et al., 2016), respectively ( is item embedding size, and is hidden state size). Based on the time complexity of the controller and EMN, the total time consumption of MNR and CmnRec is and , respectively. When , the time consumption ratio of MNR and CmnRec is . In practice, a larger acceleration can be obtained due to there are many other complex operations when EMN processes memory operations, such as memory addressing operations (Graves et al., 2016).

Appendix B Find upper bound

The update equation of standard RNN hidden state is:


Starting from taking the derivative of , we make use of the result which is shown in (Le et al., 2019), where is the bound of . Then we take the derivative of . In Eq.( 18), and are interchangeable. Referring to the solving process of , we replace with , and with . As a result, we can get .

As for LSTM, the update equation is:


Following the same way, we start from taking the derivative of . According to the results in (Le et al., 2019), makes . And then we solve the problem of . For brevity, we let:

where . In Eq.( 19), we swap and . As a result, and remain the same, which means , and are equivalent. Following the solution of in standard RNN, we replace with , and with . As a result, there is to make .

Appendix C Find dependencies among discontinuous contribution


Appendix D Proof

The impact of items become increasingly important in a sequence (Figure 4), which means the contribution of to increases with , which is mathematically expressed as . We can get . According to Eq.( 20):


By canceling out the equivalence terms, we have:

Because of and , we can get .


  • (1)
  • Bengio et al. (1994) Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 2 (March 1994), 157–166.
  • Cai et al. (2017) Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming architectures generalize via recursion. In Proceedings of Fifth International Conference on Learning Representations (ICLR ’17). ACM, New Orleans, Louisiana, USA, 108–116.
  • Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential Recommendation with User Memory Networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, New York, NY, USA, 108–116.
  • Ebesu et al. (2018) Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for Recommendation Systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, USA, 515–524.
  • Fajie et al. (2019) Yuan Fajie, He Xiangnan, Guo Guibing, Xu Zhezhao, Xiong Jian, and He Xiuqiang. 2019. Modeling the Past and Future Contexts for Session-based Recommendation. arXiv preprint arXiv:1906.04473 (2019).
  • Gerrig (2013) Richard J. Gerrig. 2013. Psychology and life (20th ed.). Pearson Education, One Lake Street, Upper Saddle River, NJ, USA. 180 pages. (book).
  • Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471.
  • Grefenstette et al. (2015) Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to Transduce with Unbounded Memory. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’15). MIT Press, Cambridge, MA, USA, 1828–1836.
  • Gu et al. (2016) Youyang Gu, Tao Lei, Regina Barzilay, and Tommi S Jaakkola. 2016. Learning to refine text based recommendations.. In EMNLP. 2103–2108.
  • Guo et al. (2016) Weiyu Guo, Shu Wu, Liang Wang, and Tieniu Tan. 2016. Personalized Ranking with Pairwise Factorization Machines. Neurocomput. 214, C (Nov. 2016), 191–200.
  • Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (Dec. 2015), 19 pages.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Fusing similarity models with markov chains for sparse sequential recommendation. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 191–200.
  • Hidasi and Karatzoglou (2017) Balázs Hidasi and Alexandros Karatzoglou. 2017. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations. arXiv preprint arXiv:1706.03847 (2017).
  • Hidasi and Karatzoglou (2018) Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 843–852.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.
  • Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y. Chang. 2018. Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, USA, 505–514.
  • Johnson et al. (2017) J. Johnson, B. Hariharan, L. Maaten, J. Hoffman, L. Fei-Fei, C. Zitnick, and R. Girshick. 2017. Inferring and Executing Programs for Visual Reasoning. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    . IEEE Computer Society, Los Alamitos, CA, USA, 3008–3017.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. ICLR (2014).
  • Le et al. (2019) Hung Le, Truyen Tran, and Svetha Venkatesh. 2019. Learning to Remember More with Less Memorization. arXiv preprint arXiv:1901.01347 (2019).
  • Ma et al. (2019) Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical Gating Networks for Sequential Recommendation. arXiv preprint arXiv:1906.09217 (2019).
  • Miller et al. (2016) Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (NLP ’16). 1400–1409.
  • Quadrana et al. (2017) Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. 2017. Personalizing session-based recommendations with hierarchical recurrent neural networks. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 130–137.
  • Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 811–820.
  • Seo et al. (2016) Min Joon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Query-Reduction Networks for Question Answering. In ICLR.
  • Shani et al. (2005) Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based recommender system.

    Journal of Machine Learning Research

    6, Sep (2005), 1265–1295.
  • Smirnova and Vasile (2017) Elena Smirnova and Flavian Vasile. 2017. Contextual Sequence Modeling for Recommendation with Recurrent Neural Networks. arXiv preprint arXiv:1706.07684 (2017).
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2440–2448.
  • Tan et al. (2016) Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural networks for session-based recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 17–22.
  • Tang et al. (2019) Jiaxi Tang, Francois Belletti, Sagar Jain, Minmin Chen, Alex Beutel, Can Xu, and Ed H Chi. 2019. Towards neural mixture recommender for long range dependent user sequences. In The World Wide Web Conference. ACM, 1782–1793.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, New York, NY, USA, 565–573.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  • Wang et al. (2019b) Jingyi Wang, Qiang Liu, Zhaocheng Liu, and Shu Wu. 2019b. Towards Accurate and Interpretable Sequential Prediction: A CNN & Attention-Based Feature Extractor. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 1703–1712.
  • Wang et al. (2018) Qinyong Wang, Hongzhi Yin, Zhiting Hu, Defu Lian, Hao Wang, and Zi Huang. 2018. Neural memory streaming recommender networks with adversarial training. (2018), 2467–2475.
  • Wang et al. (2019a) Shoujin Wang, Longbing Cao, and Yan Wang. 2019a. A survey on session-based recommender systems. arXiv preprint arXiv:1902.04864 (2019).
  • Yuan et al. (2016) Fajie Yuan, Guibing Guo, Joemon M Jose, Long Chen, Haitao Yu, and Weinan Zhang. 2016. Lambdafm: learning optimal ranking with factorization machines using lambda surrogates. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 227–236.
  • Yuan et al. (2017) Fajie Yuan, Guibing Guo, Joemon M Jose, Long Chen, Haitao Yu, and Weinan Zhang. 2017. BoostFM: Boosted factorization machines for top-n feature-based recommendation. In Proceedings of the 22nd International Conference on Intelligent User Interfaces. ACM, 45–54.
  • Yuan et al. (2019) Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, and Xiangnan He. 2019. A Simple Convolutional Generative Network for Next Item Recommendation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, New York, NY, USA, 582–590.
  • Yuan et al. (2018) Fajie Yuan, Xin Xin, Xiangnan He, Guibing Guo, Weinan Zhang, Chua Tat-Seng, and Joemon M Jose. 2018. fBGD: Learning embeddings from positive unlabeled data with BGD. (2018).
  • Zhang et al. (2017) Jiani Zhang, Xingjian Shi, Irwin King, and Dit-Yan Yeung. 2017. Dynamic Key-Value Memory Networks for Knowledge Tracing. In Proceedings of the 26th International Conference on World Wide Web (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 765–774.