Memory Augmented Neural Model for Incremental Session-based Recommendation

04/28/2020 ∙ by Fei Mi, et al. ∙ EPFL 0

Increasing concerns with privacy have stimulated interests in Session-based Recommendation (SR) using no personal data other than what is observed in the current browser session. Existing methods are evaluated in static settings which rarely occur in real-world applications. To better address the dynamic nature of SR tasks, we study an incremental SR scenario, where new items and preferences appear continuously. We show that existing neural recommenders can be used in incremental SR scenarios with small incremental updates to alleviate computation overhead and catastrophic forgetting. More importantly, we propose a general framework called Memory Augmented Neural model (MAN). MAN augments a base neural recommender with a continuously queried and updated nonparametric memory, and the predictions from the neural and the memory components are combined through another lightweight gating network. We empirically show that MAN is well-suited for the incremental SR task, and it consistently outperforms state-of-the-art neural and nonparametric methods. We analyze the results and demonstrate that it is particularly good at incrementally learning preferences on new and infrequent items.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to new privacy regulations that prohibit building user preference models from historical user data, it is getting important to utilize short-term dynamic preferences within a browser session. Session-based Recommendation (SR) is therefore increasingly used in interactive online computing systems. The goal of SR is to make recommendations based on user behavior obtained in short web browser sessions, and the task is to predict the users’ next actions, such as clicks, based on previous actions in the same session.

To better address the dynamic nature of SR tasks, we study it from an incremental learning perspective, referred to as Incremental Session-based Recommendation. In this setting, new items and preferences appear incrementally, and models need to incorporate the new preferences incrementally while preserving old ones that are still useful. The setup requires a recommender to be incrementally updated with data observed during the last time period and evaluated on the events in the following time period. We summarize three main challenges of incremental SR scenarios: (i). catastrophic forgetting [19]: incorporating new patterns requires additional training and often reduces performance on old patterns. (ii). computation efficiency: models need to be efficient as they need to be frequently updated with new data. (iii). sample efficiency: the number of observations on new items and patterns is often small such that models need to capture them quickly with limited observations.

Recently proposed neural approaches [11, 16, 12, 17, 29] have shown great success in a static setting where the set of items and the distribution of preferences during testing are assumed to be the same as in the training phase. However, this assumption rarely holds in real-world recommendation applications. As our first contribution, we show that neural recommenders can also be efficiently and effectively used for incremental SR scenarios by applying small incremental updates. To the best of our knowledge, this is the first study of neural models for SR from an incremental learning perspective. We found that incrementally updating neural models using a single pass with a small learning rate through new data helps to capture new patterns incrementally. However, using large learning rates (or equivalently multiple passes) degrades performance badly due to overfitting new patterns and forgetting old ones.

Our main contribution is to propose a method called Memory Augmented Neural model (MAN) inspired by a framework proposed for language model [20, 5]

, neural machine translation 

[31], and image recognition [24] tasks. MAN augments a neural recommender with a nonparametric memory to capture new items and preferences incrementally. The predictions of neural and memory components are combined by a lightweight gating network. MAN is agnostic to the neural model as long as it learns meaningful sequence representations, therefore, it can be easily and broadly applied to various neural recommenders. The nonparametric memory component of MAN helps to deal with all three challenges of incremental SR mentioned above. First, it achieves a long-term memory to remember long histories of observations to mitigate catastrophic forgetting. Second, it is very efficient to update because it is not trained jointly with the base neural recommender. Third, the nonparametric nature by itself helps to better capture new patterns with a smaller number of observations to address the third challenge of sample efficiency. Through extensive experiments, we show that MAN boosts the performance of different neural methods and achieves state-of-the-art. We also demonstrate that it is particularly good at capturing preferences on new and infrequent items.

2 Related Work

Incremental learning for recommendation is an important topic for practical recommendation systems where new users, items, and interactions are continuously streaming in. For the standard recommendation task of predicting ratings or clicks, different learners have been studied based on matrix factorization (MF) with online updating, including [10]. For sequential recommendation tasks, the interaction matrix to be decomposed is constructed from sequential user feedback. [27]

proposes FPMC to factorize the transitions in Markov chains with low-rank representation. Later,

[9] proposes to FOSSIL with factorized Markov chains to incorporate sequential information. Recently, SMF is proposed by [18] for SR

using session latent vectors. However, these MF-based methods are expensive to train. For example,

[16, 33] reported that even 120GB memory is not enough to train FPMC. Therefore, they are in principle not suitable for incremental settings.

Session-based recommendation can be formulated as a sequence learning problem to predict the user’s sequential behavior. It can be solved by recurrent neural networks (RNNs). The first work (

GRU4Rec, [11]

) uses a gated recurrent unit (GRU) to learn session representations from previous clicks and predict the next click. Based on

GRU4Rec, [12] proposes new ranking losses on relevant sessions, and [30] proposes to augment training data. NARM [16] augments GRU4Rec with a bilinear decoder and an additional attention operation to pay attention to specific parts of the sequence. Base on NARM, [17] proposes STAMP to model users’ general and short-term interests using two separate attention operations, and [26] proposes RepeatNet to use an additional repeat decoder based on an attention mechanism to predict repetitive actions in a session. Recently, [33] uses graph attention to capture complex transitions of items. Motivated by the recent success of Tansformer [32] and BERT [3] for language model tasks, [15] proposed SASRec using the the Transformer operation and [29] proposed BERT4Rec using the training scheme of BERT to model bi-directional information through Cloze tasks. Despite the broad exploration and success, these methods are all studied in a static SR scenario without considering new items and patterns that appear continuously.

Nonparametric methods [8]

are ideal for incremental learning tasks thanks to their computation efficiency. The amount of information they capture increases as the number of observations grows. Simple item-based collaborative filtering methods using nearest neighbors have been proven to be effective and are widely employed in industry. Markov models

[28]

also support updating the transition probabilities incrementally. However, the Markov assumption is often too strong, and it limits the recommendation performance. Recently,  

[13] proposed SKNN to compare the entire current session with historical sessions in the training data. They show that SKNN is very efficient and achieves strong results. Lately, variations [18, 4] of SKNN have been proposed to consider the position of items in a session or the timestamp of a past session. [21, 22] built a nonparametric recommender based on a structure called context tree to model suffixes of a sequence. These simple nonparametric methods are independent from the neural approaches. In this paper, we combine the edges of both neural and nonparametric models for incremental SR scenarios.

Memory-augmented neural models [7] have been well-known for the purpose of maintaining long-term memory, which is also recently explored for sequential recommendation tasks [2, 25]. However, the goal of maintaining long-term memory is not the central concern in incremental SR scenarios. This category of memory-augmented neural models differ from ours in the sense that their memory components are parametric as they need to be trained jointly with the rest of the model. Therefore, they are not suitable be updated frequently without catastrophic forgetting in incremental SR scenarios.

Our work is largely inspired by a recently proposed nonparametric memory module that is not trained jointly with neural models. [6, 20] introduce a cache to augment RNNs for language modeling task. They later improve this cache to unbounded size [5] and achieve significant performance improvement. A similar memory module is also proposed for neural machine translation tasks [31] and image recognition tasks [24].

3 Model

Background

session-based recommendation, an event is a click on an item and the task is to predict the next event based on a sequence of events in the current web-browser session. Suppose is the set of all candidate items, and the size is . Existing neural session-based recommenders typically contain two modules: an encoder to compute a compact sequence representation of the sequence of events until time , and a decoder to compute an output distribution to predict the next event. Recurrent neural networks (RNNs) and fully-connected layers are common choices for encoder and decoder respectively. In our later experiments, we included models that use different types of encoders and decoders. Our MAN framework is agnostic to the neural recommender, therefore, readers can use many other neural architectures with an encoder-decoder structure.

Next, we present the Memory Augmented Neural recommender (MAN) to augment a neural session-based recommender with a cache-like nonparametric memory to incrementally incorporate new items and preferences. We first introduce the architecture of the cache-like memory. Then we describe how to use it to generate nonparametric memory predictions, and how to merge memory and neural predictions through a lightweight gating network.

3.1 Nonparametric Memory Structure

Our memory is an array of slots in the form of (key, value) pairs [23] in Eq.(1). is queried by keys and returns corresponding values. We define the keys to be input sequence representations computed by the neural encoder, and the values to be the corresponding label of the next event. To scale to a large set of observations and long histories in practical recommendation scenarios, we do not restrict the size of but store all pairs from previous events.

(1)

3.2 Memory Prediction

To support efficient incremental learning, the memory module is not

trained jointly with the neural recommender. Instead, it directly predicts the next event by computing a probability distribution using entries stored in

. For an input sequence , we first match the current sequence representation against to retrieve nearest neighbors of , where is the Euclidean distance between the sequence representation of the -th neighbor and . Then, we use the nearest neighbors to compute a non-parametric memory prediction

using a variable kernel density estimation by:

(2)

where is the probability on an item , is Kronecker delta which equals one when the equality holds and zero otherwise, is a Gaussian kernel (), and is the Euclidean distance between and its closest neighbor in . only assigns non-zero probability to at most (number of neighbors) items because the probabilities assigned to items that do not appear in the nearest neighbors are zero. As a result, is a very sparse distribution as a mixture of the labels in weighted by their similarities to . To capture new preference patterns incrementally, is queried and updated incrementally during the testing phase such that is up-to-date.

Figure 1: Computation pipeline of Memory Augmented Neural recommender (MAN). Predictions from a neural network are augmented by predictions from a memory module through a gating network.

3.3 Gating Network to Combine Memory and Neural Predictions

The neural prediction mainly captures static old preference patterns while the nonparametric memory prediction can model infrequent and new preference patterns incrementally. To flexibly work with both scenarios, these two predictions are combined. A simple way proposed by [5, 24]

is linearly interpolating them with a fixed weight, and we later call this version “

MAN-Shallow” in our experiments in Section 4.5.

To better merge these two predictions at different sequential contexts, we propose to use a lightweight gating network [1] to learn the mixing coefficient as a function of the sequence representation . We use a lightweight fully connected neural network defined in Eq.(3) with a single hidden layer of 100 hidden units, tanh

as hidden layer activation function, and a Sigmoid function at the output layer.

(3)

The output of is a scalar between 0 and 1 that measures the relative importance of , while the other fraction is multiplied to . The final prediction distribution is an learning interpolation of and weighted by the output of computed in Eq.(4) .

(4)

The gating network is trained with cross-entropy loss using as predictive distribution after the normal training of the neural model with both and are fixed. The gradients are not computed for the large number parameters of the neural model to avoid computation overhead and interference with the trained neural model. Inspired by [1], it is trained using only validation data. The idea is to train it using data not seen during the pre-training phase to better predict new preferences that might appear in incremental SR scenarios. We randomly select 90% validation data for training and the remaining 10% for early stopping. We compared this setup with training using the whole training set, and we found that our setup achieves better performance while being much more efficient.

3.4 Efficient Large-scale Nearest Neighbor Computation

As the number of events in practical SR scenarios is huge and we do not restrict the size of , computing nearest neighbors frequently to generate can be expensive. We apply a scalable retrieval method used by [5]. To avoid exhaustive search, an inverted file table is maintained. Keys in

are first clustered to a set of clusters using k-means, then all keys in

can be associated with one centroid. When we query in , it is searched by firstly matching to a set of centroids to get the closest cluster and then the set of keys in this cluster.

The clustered memory supports efficient querying, yet it is memory consuming because each key in needs to be stored. This can be greatly reduced by Product Quantization (PQ [14]) that quantizes a vector by parts (sub-quantizers), and it does not directly store the vector but its residual, i.e., the difference between the vector and its associated centroids. We use centroids, 8 sub-quantizers per sequence representation, and 8 bits allocated per sub-quantizer, then we only need the size of 16 (quantization code + centroid id = 8 + 8) bytes per vector. Therefore, a million sequence representations can be stored with only 16 Mb memory. With an inverted table and PQ, we have a fast approximate nearest neighbor retrieval method with a low memory footprint. We use the FAISS 111https://github.com/facebookresearch/faissopen-source library that also supports GPU acceleration for implementation.

1:procedure Train()
2:     Train , w.r.t.
3:     for  do
4:         Compute ; store to
5:     end for
6:     Build the inverted table for
7:     Fix , , and train w.r.t.
8:end procedure
9:procedure Test()
10:     for  do
11:         Compute , and
12:         Query and to retrieve
13:         Compute by Eq.(2), by Eq.(4)
14:         Update and with
15:     end for
16:end procedure
Algorithm 1 Memory Augmented Neural Recommender

3.5 Overall Man Algorithm

The computation pipeline and algorithm of MAN presented in Figure 1 and Algorithm 1. Next, we describe the training and testing procedures of MAN in detail, and also analyzed its computation efficiency.

Training procedure

MAN first trains the neural encoder and decoder on the training set . Then, it computes sequence representations for all training data and stores them with corresponding labels to . Afterwards, the clustered memory of the inverted table is built with entries in . Lastly, The gating network is trained on validation set .

Testing procedure

The sequence representation and the neural prediction is first computed. Then, and are queried with to retrieve nearest neighbors to compute . To generate the final recommendation , is merged with weighted by the output of the gating network. Lastly, and are incrementally updated with the new testing pair . During testing, the clustered memory in is not updated for the purpose of computation efficiency. Running k-means to update the clustered memory on huge data sets with large dimensions is computationally intensive. Therefore, we need to decide when and how to update the clustering algorithm, and there will be a trade-off between the performance benefits and the computation overhead. Studies on this part are left for interesting future work.

c rrr rr rrc & Training Data & Validation Data & Testing Data
& Events & Sessions & Items & Events & Sessions & Events & Sessions & New Events
YOOCHOOSE & 6,245,412 & 1,535,693 & 22,594 & 693,935 & 170,633 & 748,269 & 178,920 & 8.6%
DIGINETICA & 636,506 & 130,994 & 42,294 & 70,723 & 14,555 & 286,254 & 59,240 & 3.3%

Table 1: Statistics of two datasets. The last column (‘New Events’) indicates the percentage of testing events that involve new items not in the training set.

Computation efficiency analysis

During training, the additional training procedures of MAN

on top of the regular neural recommender training are efficient. (i) Sequence representations of training data can be obtained directly from the last regular training epoch of the neural recommender, therefore, no computation overhead is injected at this step (line 3-5). 2). (ii) Building the clustered memory and the inverted table for entries in

(line 6) is also fast with the FAISS library. (iii) Training the lightweight gating network using only validation split (line 7) is much more efficient than the regular training of the base neural recommender. During testing, querying the memory to retrieve nearest neighbors can be done very efficiently supported by FAISS; a forward computation through the lightweight gating network is also efficient.

4 Experiments and Analyses

4.1 Datasets

YOOCHOOSE: This is a public dataset for RecSys Challenge 2015.222http://2015.recsyschallenge.com/challenge.html It contains click-streams on an e-commerce site over 6 months. Events in the last week are tested. Following [30, 16], we use the latest 1/4 training sequences because YOOCHOOSE is very large.

DIGINETICA: This dataset contains click-streams data on another e-commerce site over a span of 5 months for CIKM Cup 2016.333http://cikm2016.cs.iupui.edu/cikm-cup Events in the last four weeks are tested.

Items that appear less than five times, and sessions of length shorter than two or longer than 20 are filtered out. Statistics of the two datasets after pruning are summarized in Table 3.5, and the last 10% of the training data based on time is used as the validation set. Different from previous static settings that remove items not in the training phase from test sets, our test sets for the incremental SR task include events on new items that appear only during the testing phase. The last column indicates the percentage of events in the test data that involve items not part of the training data.

4.2 Evaluation Metrics

HR@k: Average hit rate when the desired item is amongst the top-k recommended items. It can be interpreted as precision [17, 33] or recall [11, 16, 13] because we predict the immediate next event.

MRR@k: HR@k does not consider the order of the items recommended. MRR@k measures the mean reciprocal ranks of the desired items in top-k recommended items.

4.3 Models and Training Details

Item-KNN

: This simple baseline recommends top items similar to the single last item in the current session based on co-occurrence statistics of two items in other sessions.

(S)-SKNN: Instead of considering only the last item, SKNN [13] compares all items in the current session with items in other sessions. S-SKNN [18] is an improved version that assigns more weights to items that appear later in a session.

CT [22]: It builds a nonparametric recommender based on a structure called Context Tree to model suffixes of a sequence.

GRU4Rec [12]: It uses an RNN with GRU as encoder. It also uses specialized ranking-based losses computed w.r.t. the most relevant samples, which perform better than their initial version [11].

NARM [16]: It improves the encoder of GRU4Rec with an item-level attention and replaces the decoder by a bilinear decoding scheme.

MAN (proposed): The method proposed in this paper. Two versions (MAN-GRU4Rec and MAN-NARM) are tested using GRU4Rec and NARM as base neural models respectively. Unless mentioned specifically, MAN is based on NARM. 444In this paper, we chose the two most representative base neural models (GRU4Rec and NARM) to evaluate MAN. As MAN is agnostic to the neural model, the exploration of using MAN on top of other neural recommenders is left for future work.

1e-2 5e-3 1e-3 5e-4 1e-4 5e-5 0
YOOCHOOSE
NARM 0.389 0.422 0.460 0.463 0.447 0.440 0.420
DIGINETICA
NARM 0.235 0.255 0.315 0.338 0.355 0.350 0.324
Table 2: HR@5 with different learning rate to update NARM incrementally. NARM is fixed during testing when .
YOOCHOOSE DIGINETICA
HR@5 MRR@5 HR@20 MRR@20 HR@5 MRR@5 HR@20 MRR@20
Item-KNN 0.205 0.114 0.403 0.127 0.112 0.042 0.186 0.056
GRU4Rec 0.359 0.216 0.582 0.228 0.191 0.114 0.382 0.135
SKNN 0.411 0.245 0.625 0.268 0.262 0.156 0.489 0.177
S-SKNN 0.416 0.247 0.628 0.272 0.279 0.170 0.497 0.185
CT 0.427 0.263 0.618 0.286 0.290 0.189 0.515 0.206
MAN-GRU4Rec 0.447 0.269 0.657 0.293 0.331 0.203 0.545 0.226
NARM 0.463 0.280 0.682 0.303 0.358 0.221 0.566 0.242
MAN-NARM 0.476 0.292 0.689 0.314 0.381 0.234 0.599 0.258
Table 3: Overall results of different models for the incremental SR task on two datasets. Models are ranked by HR@5, and the best method in each column is in bold.
Figure 2: Incremental performance (HR@5) as the number of tested events increases.

During training, hidden layer size of GRU4Rec and NARM is set to 100, and the item embedding size of NARM is set to 50; the batch size is set to 512 and 30 epochs are trained for both models. Hyper-parameters of different models are tuned on validation splits to maximize HR@5. The number of nearest neighbors of MAN is set to 50 for YOOCHOOSE and 100 for DIGINETICA. During testing, we update different neural models and MAN incrementally using a single gradient descent step as every batch of 100 events are tested. Learning rates for neural models are 5e-4 and 1e-4 for YOOCHOOSE and DIGINETICA, and the learning rate to update the gating network is 1e-3. Other pure nonparametric methods (Item-KNN, (S)-SKNN, CT) are also incrementally updated as every batch of 100 events is tested.

4.4 Experiment Results

In this section, we first study the effect of using different learning rates to update neural models incrementally. Then, we analyze the overall performance of different methods and plot the incremental performance of two representative methods as more events are tested.

Effect of incremental learning rate

Before we proceed to our main results, we first highlight that the degree of updates is important when neural models are updated incrementally. As an example, we show results of using different learning rates to update NARM in Table 2. We can see that using large learning rates 555 Using multiple incremental training epochs has a similar effect as using large earning rates in one epoch. to incrementally update NARM degrades performance severely, while using relatively small learning rates outperforms the version when NARM is fixed during testing (). We believe that large learning rates cause the model to overfit new patterns while catastrophically forgetting old patterns. Therefore, we contend that incremental updates for neural models needs to be small.

Overall Performance

The results of different methods on two datasets are summarized in Table 3. Several interesting empirical results can be noted: First, MAN-NARM achieves the top performance. Both MAN-NARM and MAN-GRU4Rec consistently outperform their individual neural modules (NARM and GRU4Rec) with notable margins. This result shows that our MAN architecture effectively helps standard neural recommenders for incremental SR scenarios. Second, MAN is not sensitive to neural recommenders. We observed that even though NARM significantly outperforms GRU4Rec, MAN-GRU4Rec and MAN-NARM show comparable performances. We contend that the memory predictions effectively compensate the failed predictions of GRU4Rec. Third, (S)-SKNN and CT are much stronger nonparametric methods than Item-KNN, with CT being slightly better. They both outperform GRU4Rec on two datasets, and their performance gaps compared to MAN and NARM are smaller on the YOUCHOOSE dataset that contains more new events.

Incremental Performance

In Figure 2, we present incremental performance of MAN and NARM as more events are evaluated during testing. We also included versions that are fixed during the testing phase. In NARM-Fixed, the neural model is fixed. In MAN-Fixed, both the neural model and the gating networks are fixed, and only memory entries are incrementally expanded. Several results can be noted from Figure 2. First, MAN helps to boost performance both when the neural model is fixed and incrementally updated. MAN outperforms NARM by consistent margins, and MAN-Fix outperforms NARM-Fix by increasing margins. Second, incrementally updated models consistently outperform their fixed versions. Both MAN and NARM outperform their corresponding fixed versions with significant and increasing margins.

4.5 In-depth Analysis

In the following experiments, we conduct in-depth analysis to further understand the effectiveness and efficiency of different components of MAN and for what types of events MAN most improves predictions.

YOOCHOOSE DIGINETICA
HR@5 MRR@5 HR@5 MRR@5
MAN 0.476 0.292 0.381 0.234
MAN-Shallow 0.469 0.286 0.374 0.228
MAN-50k 0.469 0.287 0.376 0.231
MAN-10k 0.466 0.284 0.369 0.226
Table 4: Ablation study for MAN. The large memory size and the gating network of MAN gain performance benefits.

Ablation Study

We further studied in Table 4 the effect of two setups in MAN, i.e., the large memory size and the combining scheme with a gating network. Two simpler versions are compared. (i) MAN-Shallow linearly combines neural and memory predictions with a fixed scalar rather than the weight output by the gating network. The scalar weight is tuned on validation splits to be 0.7 for YOOCHOOSE and 0.8 for DIGINETICA. This simple method serves as a strong baseline in learning new vocabularies in language model tasks [20, 5]. (ii) MAN-50k/10k use fixed-size memories that only store a limited number of recent events (50k/10k). MAN-Shallow is consistently inferior to MAN. It means the gating network makes a better decision to combine neural and memory predictions. Furthermore, the performance drops as the size of the memory decrease from unbounded (MAN) to 50k, and to 10k. It means that keeping a big memory that handles long histories achieves better recommendation performance. Despite the slight performance drop, the three simplified versions are more efficient, therefore, they are still suitable candidates for industry.

Figure 3: Visualizing the empirical distribution of the Gating Network outputs assigned to the parametric part on DIGINETICA. Outputs on old (in the training phase) and new (not in the training phase) items are plotted separately. MAN assigns small weights to the parametric part when items are new, and large weights to it when items are old.
YOOCHOOSE DIGINETICA
Train Avg. Response Train Avg. Response
NARM 375.5 25.5 42.5 23.5
MAN 392.2 29.5 45.8 27.8
Table 5: Computation time of MAN and NARM. Total training time (in minutes) and average response time (in milliseconds) per recommendation are reported for both datasets. The computation overhead of MAN on top of NARM is trivial.

Visualization of Gating Network Outputs

In this experiment, we further visualize the empirical distribution of the gating network outputs assigned to the parametric part. We demonstrate in Figure 3 that it indeed learns desired weight distributions to the parametric and non-parametric parts. For new items denoted by the red curve, outputs of the gating network tend to be small such that larger weights are assigned to the non-parametric memory prediction. Similarly, for old items denoted by the blue curve, outputs of the gating network tend to be large such that smaller weights are given to the non-parametric memory prediction.

Computation Time

In Table 5, we report the total training time and average response time per recommendation of MAN compared to NARM. Both models are trained using a NVIDIA TITAN X GPU with 12GB memory. We can see that the extra training time and response time of MAN is trivial on top of the NARM. This empirical results, together with our analysis in Section 3.5, demonstrate that MAN is computation efficient.

Disentangled Performance

To further understand for what types of events that MAN most improves predictions, we studied the disentangled performance when items are bucketed into five groups by their occurrence frequency in training data. Bucket 1 contains the least frequent items, and bucket 5 contains the most frequent items. Bucket splitting intervals are chosen to ensure the bucket size is the same. The disentangled performances of MAN and NARM across five buckets are reported in Table 6, and the results reveal that:

  • Infrequent items are more challenging to predict. The performances of both methods have an increasing trend as item frequency increases.

  • MAN is consistently better than NARM on all levels of item frequency.

  • The improvement margin of MAN over NARM is very significant on infrequent items (small bucket number). Therefore, we contend that MAN it is especially good at learning new patterns and items with a small number of observations.

Bucket # 1 2 3 4 5
YOOCHOOSE
MAN 0.318 0.263 0.319 0.378 0.517
NARM 0.287 0.247 0.302 0.362 0.510
DIGINETICA
MAN 0.145 0.187 0.251 0.329 0.515
NARM 0.117 0.169 0.230 0.303 0.494
Table 6: Disentangled performance (HR@5) at different item frequency buckets. MAN is consistently better at all buckets, and the improvement margin on infrequent items is the most significant.

5 Conclusion

In this paper, we study neural methods in a realistic incremental session-based recommendation scenario. We show that existing neural models can be used in this scenario with small incremental updates, and we propose a general method called Memory Augmented Neural recommender (MAN) that is widely applicable to augment different existing neural models. MAN uses a efficient nonparametric memory to compute a memory prediction, which is combined with the neural prediction through a lightweight gating network. We show that MAN consistently outperforms state-of-the-art neural and nonparametric methods and it is particularly good at learning new items with insufficient number of observations.

References

  • [1] A. Bakhtin, A. Szlam, M. Ranzato, and E. Grave (2018)

    Lightweight adaptive mixture of neural and n-gram language models

    .
    arXiv preprint arXiv:1804.07705. Cited by: §3.3.
  • [2] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha (2018) Sequential recommendation with user memory networks. In WSDM, pp. 108–116. Cited by: §2.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [4] D. Garg, P. Gupta, P. Malhotra, L. Vig, and G. Shroff (2019) Sequence and time aware neighborhood for session-based recommendations: stan. In SIGIR, pp. 1069–1072. Cited by: §2.
  • [5] E. Grave, M. M. Cisse, and A. Joulin (2017) Unbounded cache model for online language modeling with open vocabulary. In NIPS, pp. 6042–6052. Cited by: §1, §2, §3.3, §3.4, §4.5.
  • [6] E. Grave, A. Joulin, and N. Usunier (2017) Improving neural language models with a continuous cache. In ICLR, Cited by: §2.
  • [7] A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §2.
  • [8] W. Härdle and O. Linton (1994) Applied nonparametric methods. Handbook of Econometrics 4, pp. 2295–2339. Cited by: §2.
  • [9] R. He and J. McAuley (2016) Fusing similarity models with markov chains for sparse sequential recommendation. In ICDM, pp. 191–200. Cited by: §2.
  • [10] X. He, H. Zhang, M. Kan, and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pp. 549–558. Cited by: §2.
  • [11] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2016) Session-based recommendations with recurrent neural networks. In ICLR, Cited by: §1, §2, §4.2, §4.3.
  • [12] B. Hidasi and A. Karatzoglou (2018) Recurrent neural networks with top-k gains for session-based recommendations. In CIKM, pp. 843–852. Cited by: §1, §2, §4.3.
  • [13] D. Jannach and M. Ludewig (2017) When recurrent neural networks meet the neighborhood for session-based recommendation. In RecSys, pp. 306–310. Cited by: §2, §4.2, §4.3.
  • [14] H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. PAMI 33 (1), pp. 117–128. Cited by: §3.4.
  • [15] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §2.
  • [16] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma (2017) Neural attentive session-based recommendation. In CIKM, pp. 1419–1428. Cited by: §1, §2, §2, §4.1, §4.2, §4.3.
  • [17] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In SIGKDD, pp. 1831–1839. Cited by: §1, §2, §4.2.
  • [18] M. Ludewig and D. Jannach (2018) Evaluation of session-based recommendation algorithms. User Modeling and User-Adapted Interaction 28 (4-5), pp. 331–390. Cited by: §2, §2, §4.3.
  • [19] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [20] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer sentinel mixture models. In ICLR, Cited by: §1, §2, §4.5.
  • [21] F. Mi and B. Faltings (2017) Adaptive sequential recommendation for discussion forums on moocs using context trees. In Proceedings of the 10th international conference on educational data mining, Cited by: §2.
  • [22] F. Mi and B. Faltings (2018) Context tree for adaptive session-based recommendation. arXiv preprint arXiv:1806.03733. Cited by: §2, §4.3.
  • [23] A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In EMNLP, pp. 1400–1409. Cited by: §3.1.
  • [24] E. Orhan (2018) A simple cache model for image recognition. In NIPS, pp. 10107–10116. Cited by: §1, §2, §3.3.
  • [25] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, G. Zhou, J. Xu, Y. Yu, X. Zhu, et al. (2019) Lifelong sequential modeling with personalized memorization for user response prediction. In SIGIR, pp. 565–574. Cited by: §2.
  • [26] P. Ren, Z. Chen, J. Li, Z. Ren, J. Ma, and M. de Rijke (2019) RepeatNet: a repeat aware neural recommendation machine for session-based recommendation. In AAAI, Cited by: §2.
  • [27] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811–820. Cited by: §2.
  • [28] G. Shani, D. Heckerman, and R. I. Brafman (2005) An mdp-based recommender system. JMLR 6 (Sep), pp. 1265–1295. Cited by: §2.
  • [29] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In CIKM, pp. 1441–1450. Cited by: §1, §2.
  • [30] Y. K. Tan, X. Xu, and Y. Liu (2016) Improved recurrent neural networks for session-based recommendations. In

    1st Workshop on Deep Learning for Recommender Systems

    ,
    pp. 17–22. Cited by: §2, §4.1.
  • [31] Z. Tu, Y. Liu, S. Shi, and T. Zhang (2018) Learning to remember translation history with a continuous cache. TACL 6, pp. 407–420. Cited by: §1, §2.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.
  • [33] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019) Session-based recommendation with graph neural networks. In AAAI, Cited by: §2, §2, §4.2.

4 Experiments and Analyses

4.1 Datasets

YOOCHOOSE: This is a public dataset for RecSys Challenge 2015.222http://2015.recsyschallenge.com/challenge.html It contains click-streams on an e-commerce site over 6 months. Events in the last week are tested. Following [30, 16], we use the latest 1/4 training sequences because YOOCHOOSE is very large.

DIGINETICA: This dataset contains click-streams data on another e-commerce site over a span of 5 months for CIKM Cup 2016.333http://cikm2016.cs.iupui.edu/cikm-cup Events in the last four weeks are tested.

Items that appear less than five times, and sessions of length shorter than two or longer than 20 are filtered out. Statistics of the two datasets after pruning are summarized in Table 3.5, and the last 10% of the training data based on time is used as the validation set. Different from previous static settings that remove items not in the training phase from test sets, our test sets for the incremental SR task include events on new items that appear only during the testing phase. The last column indicates the percentage of events in the test data that involve items not part of the training data.

4.2 Evaluation Metrics

HR@k: Average hit rate when the desired item is amongst the top-k recommended items. It can be interpreted as precision [17, 33] or recall [11, 16, 13] because we predict the immediate next event.

MRR@k: HR@k does not consider the order of the items recommended. MRR@k measures the mean reciprocal ranks of the desired items in top-k recommended items.

4.3 Models and Training Details

Item-KNN

: This simple baseline recommends top items similar to the single last item in the current session based on co-occurrence statistics of two items in other sessions.

(S)-SKNN: Instead of considering only the last item, SKNN [13] compares all items in the current session with items in other sessions. S-SKNN [18] is an improved version that assigns more weights to items that appear later in a session.

CT [22]: It builds a nonparametric recommender based on a structure called Context Tree to model suffixes of a sequence.

GRU4Rec [12]: It uses an RNN with GRU as encoder. It also uses specialized ranking-based losses computed w.r.t. the most relevant samples, which perform better than their initial version [11].

NARM [16]: It improves the encoder of GRU4Rec with an item-level attention and replaces the decoder by a bilinear decoding scheme.

MAN (proposed): The method proposed in this paper. Two versions (MAN-GRU4Rec and MAN-NARM) are tested using GRU4Rec and NARM as base neural models respectively. Unless mentioned specifically, MAN is based on NARM. 444In this paper, we chose the two most representative base neural models (GRU4Rec and NARM) to evaluate MAN. As MAN is agnostic to the neural model, the exploration of using MAN on top of other neural recommenders is left for future work.

1e-2 5e-3 1e-3 5e-4 1e-4 5e-5 0
YOOCHOOSE
NARM 0.389 0.422 0.460 0.463 0.447 0.440 0.420
DIGINETICA
NARM 0.235 0.255 0.315 0.338 0.355 0.350 0.324
Table 2: HR@5 with different learning rate to update NARM incrementally. NARM is fixed during testing when .
YOOCHOOSE DIGINETICA
HR@5 MRR@5 HR@20 MRR@20 HR@5 MRR@5 HR@20 MRR@20
Item-KNN 0.205 0.114 0.403 0.127 0.112 0.042 0.186 0.056
GRU4Rec 0.359 0.216 0.582 0.228 0.191 0.114 0.382 0.135
SKNN 0.411 0.245 0.625 0.268 0.262 0.156 0.489 0.177
S-SKNN 0.416 0.247 0.628 0.272 0.279 0.170 0.497 0.185
CT 0.427 0.263 0.618 0.286 0.290 0.189 0.515 0.206
MAN-GRU4Rec 0.447 0.269 0.657 0.293 0.331 0.203 0.545 0.226
NARM 0.463 0.280 0.682 0.303 0.358 0.221 0.566 0.242
MAN-NARM 0.476 0.292 0.689 0.314 0.381 0.234 0.599 0.258
Table 3: Overall results of different models for the incremental SR task on two datasets. Models are ranked by HR@5, and the best method in each column is in bold.
Figure 2: Incremental performance (HR@5) as the number of tested events increases.

During training, hidden layer size of GRU4Rec and NARM is set to 100, and the item embedding size of NARM is set to 50; the batch size is set to 512 and 30 epochs are trained for both models. Hyper-parameters of different models are tuned on validation splits to maximize HR@5. The number of nearest neighbors of MAN is set to 50 for YOOCHOOSE and 100 for DIGINETICA. During testing, we update different neural models and MAN incrementally using a single gradient descent step as every batch of 100 events are tested. Learning rates for neural models are 5e-4 and 1e-4 for YOOCHOOSE and DIGINETICA, and the learning rate to update the gating network is 1e-3. Other pure nonparametric methods (Item-KNN, (S)-SKNN, CT) are also incrementally updated as every batch of 100 events is tested.

4.4 Experiment Results

In this section, we first study the effect of using different learning rates to update neural models incrementally. Then, we analyze the overall performance of different methods and plot the incremental performance of two representative methods as more events are tested.

Effect of incremental learning rate

Before we proceed to our main results, we first highlight that the degree of updates is important when neural models are updated incrementally. As an example, we show results of using different learning rates to update NARM in Table 2. We can see that using large learning rates 555 Using multiple incremental training epochs has a similar effect as using large earning rates in one epoch. to incrementally update NARM degrades performance severely, while using relatively small learning rates outperforms the version when NARM is fixed during testing (). We believe that large learning rates cause the model to overfit new patterns while catastrophically forgetting old patterns. Therefore, we contend that incremental updates for neural models needs to be small.

Overall Performance

The results of different methods on two datasets are summarized in Table 3. Several interesting empirical results can be noted: First, MAN-NARM achieves the top performance. Both MAN-NARM and MAN-GRU4Rec consistently outperform their individual neural modules (NARM and GRU4Rec) with notable margins. This result shows that our MAN architecture effectively helps standard neural recommenders for incremental SR scenarios. Second, MAN is not sensitive to neural recommenders. We observed that even though NARM significantly outperforms GRU4Rec, MAN-GRU4Rec and MAN-NARM show comparable performances. We contend that the memory predictions effectively compensate the failed predictions of GRU4Rec. Third, (S)-SKNN and CT are much stronger nonparametric methods than Item-KNN, with CT being slightly better. They both outperform GRU4Rec on two datasets, and their performance gaps compared to MAN and NARM are smaller on the YOUCHOOSE dataset that contains more new events.

Incremental Performance

In Figure 2, we present incremental performance of MAN and NARM as more events are evaluated during testing. We also included versions that are fixed during the testing phase. In NARM-Fixed, the neural model is fixed. In MAN-Fixed, both the neural model and the gating networks are fixed, and only memory entries are incrementally expanded. Several results can be noted from Figure 2. First, MAN helps to boost performance both when the neural model is fixed and incrementally updated. MAN outperforms NARM by consistent margins, and MAN-Fix outperforms NARM-Fix by increasing margins. Second, incrementally updated models consistently outperform their fixed versions. Both MAN and NARM outperform their corresponding fixed versions with significant and increasing margins.

4.5 In-depth Analysis

In the following experiments, we conduct in-depth analysis to further understand the effectiveness and efficiency of different components of MAN and for what types of events MAN most improves predictions.

YOOCHOOSE DIGINETICA
HR@5 MRR@5 HR@5 MRR@5
MAN 0.476 0.292 0.381 0.234
MAN-Shallow 0.469 0.286 0.374 0.228
MAN-50k 0.469 0.287 0.376 0.231
MAN-10k 0.466 0.284 0.369 0.226
Table 4: Ablation study for MAN. The large memory size and the gating network of MAN gain performance benefits.

Ablation Study

We further studied in Table 4 the effect of two setups in MAN, i.e., the large memory size and the combining scheme with a gating network. Two simpler versions are compared. (i) MAN-Shallow linearly combines neural and memory predictions with a fixed scalar rather than the weight output by the gating network. The scalar weight is tuned on validation splits to be 0.7 for YOOCHOOSE and 0.8 for DIGINETICA. This simple method serves as a strong baseline in learning new vocabularies in language model tasks [20, 5]. (ii) MAN-50k/10k use fixed-size memories that only store a limited number of recent events (50k/10k). MAN-Shallow is consistently inferior to MAN. It means the gating network makes a better decision to combine neural and memory predictions. Furthermore, the performance drops as the size of the memory decrease from unbounded (MAN) to 50k, and to 10k. It means that keeping a big memory that handles long histories achieves better recommendation performance. Despite the slight performance drop, the three simplified versions are more efficient, therefore, they are still suitable candidates for industry.

Figure 3: Visualizing the empirical distribution of the Gating Network outputs assigned to the parametric part on DIGINETICA. Outputs on old (in the training phase) and new (not in the training phase) items are plotted separately. MAN assigns small weights to the parametric part when items are new, and large weights to it when items are old.
YOOCHOOSE DIGINETICA
Train Avg. Response Train Avg. Response
NARM 375.5 25.5 42.5 23.5
MAN 392.2 29.5 45.8 27.8
Table 5: Computation time of MAN and NARM. Total training time (in minutes) and average response time (in milliseconds) per recommendation are reported for both datasets. The computation overhead of MAN on top of NARM is trivial.

Visualization of Gating Network Outputs

In this experiment, we further visualize the empirical distribution of the gating network outputs assigned to the parametric part. We demonstrate in Figure 3 that it indeed learns desired weight distributions to the parametric and non-parametric parts. For new items denoted by the red curve, outputs of the gating network tend to be small such that larger weights are assigned to the non-parametric memory prediction. Similarly, for old items denoted by the blue curve, outputs of the gating network tend to be large such that smaller weights are given to the non-parametric memory prediction.

Computation Time

In Table 5, we report the total training time and average response time per recommendation of MAN compared to NARM. Both models are trained using a NVIDIA TITAN X GPU with 12GB memory. We can see that the extra training time and response time of MAN is trivial on top of the NARM. This empirical results, together with our analysis in Section 3.5, demonstrate that MAN is computation efficient.

Disentangled Performance

To further understand for what types of events that MAN most improves predictions, we studied the disentangled performance when items are bucketed into five groups by their occurrence frequency in training data. Bucket 1 contains the least frequent items, and bucket 5 contains the most frequent items. Bucket splitting intervals are chosen to ensure the bucket size is the same. The disentangled performances of MAN and NARM across five buckets are reported in Table 6, and the results reveal that:

  • Infrequent items are more challenging to predict. The performances of both methods have an increasing trend as item frequency increases.

  • MAN is consistently better than NARM on all levels of item frequency.

  • The improvement margin of MAN over NARM is very significant on infrequent items (small bucket number). Therefore, we contend that MAN it is especially good at learning new patterns and items with a small number of observations.

Bucket # 1 2 3 4 5
YOOCHOOSE
MAN 0.318 0.263 0.319 0.378 0.517
NARM 0.287 0.247 0.302 0.362 0.510
DIGINETICA
MAN 0.145 0.187 0.251 0.329 0.515
NARM 0.117 0.169 0.230 0.303 0.494
Table 6: Disentangled performance (HR@5) at different item frequency buckets. MAN is consistently better at all buckets, and the improvement margin on infrequent items is the most significant.

5 Conclusion

In this paper, we study neural methods in a realistic incremental session-based recommendation scenario. We show that existing neural models can be used in this scenario with small incremental updates, and we propose a general method called Memory Augmented Neural recommender (MAN) that is widely applicable to augment different existing neural models. MAN uses a efficient nonparametric memory to compute a memory prediction, which is combined with the neural prediction through a lightweight gating network. We show that MAN consistently outperforms state-of-the-art neural and nonparametric methods and it is particularly good at learning new items with insufficient number of observations.

References

  • [1] A. Bakhtin, A. Szlam, M. Ranzato, and E. Grave (2018)

    Lightweight adaptive mixture of neural and n-gram language models

    .
    arXiv preprint arXiv:1804.07705. Cited by: §3.3.
  • [2] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha (2018) Sequential recommendation with user memory networks. In WSDM, pp. 108–116. Cited by: §2.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [4] D. Garg, P. Gupta, P. Malhotra, L. Vig, and G. Shroff (2019) Sequence and time aware neighborhood for session-based recommendations: stan. In SIGIR, pp. 1069–1072. Cited by: §2.
  • [5] E. Grave, M. M. Cisse, and A. Joulin (2017) Unbounded cache model for online language modeling with open vocabulary. In NIPS, pp. 6042–6052. Cited by: §1, §2, §3.3, §3.4, §4.5.
  • [6] E. Grave, A. Joulin, and N. Usunier (2017) Improving neural language models with a continuous cache. In ICLR, Cited by: §2.
  • [7] A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §2.
  • [8] W. Härdle and O. Linton (1994) Applied nonparametric methods. Handbook of Econometrics 4, pp. 2295–2339. Cited by: §2.
  • [9] R. He and J. McAuley (2016) Fusing similarity models with markov chains for sparse sequential recommendation. In ICDM, pp. 191–200. Cited by: §2.
  • [10] X. He, H. Zhang, M. Kan, and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pp. 549–558. Cited by: §2.
  • [11] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2016) Session-based recommendations with recurrent neural networks. In ICLR, Cited by: §1, §2, §4.2, §4.3.
  • [12] B. Hidasi and A. Karatzoglou (2018) Recurrent neural networks with top-k gains for session-based recommendations. In CIKM, pp. 843–852. Cited by: §1, §2, §4.3.
  • [13] D. Jannach and M. Ludewig (2017) When recurrent neural networks meet the neighborhood for session-based recommendation. In RecSys, pp. 306–310. Cited by: §2, §4.2, §4.3.
  • [14] H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. PAMI 33 (1), pp. 117–128. Cited by: §3.4.
  • [15] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §2.
  • [16] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma (2017) Neural attentive session-based recommendation. In CIKM, pp. 1419–1428. Cited by: §1, §2, §2, §4.1, §4.2, §4.3.
  • [17] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In SIGKDD, pp. 1831–1839. Cited by: §1, §2, §4.2.
  • [18] M. Ludewig and D. Jannach (2018) Evaluation of session-based recommendation algorithms. User Modeling and User-Adapted Interaction 28 (4-5), pp. 331–390. Cited by: §2, §2, §4.3.
  • [19] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [20] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer sentinel mixture models. In ICLR, Cited by: §1, §2, §4.5.
  • [21] F. Mi and B. Faltings (2017) Adaptive sequential recommendation for discussion forums on moocs using context trees. In Proceedings of the 10th international conference on educational data mining, Cited by: §2.
  • [22] F. Mi and B. Faltings (2018) Context tree for adaptive session-based recommendation. arXiv preprint arXiv:1806.03733. Cited by: §2, §4.3.
  • [23] A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In EMNLP, pp. 1400–1409. Cited by: §3.1.
  • [24] E. Orhan (2018) A simple cache model for image recognition. In NIPS, pp. 10107–10116. Cited by: §1, §2, §3.3.
  • [25] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, G. Zhou, J. Xu, Y. Yu, X. Zhu, et al. (2019) Lifelong sequential modeling with personalized memorization for user response prediction. In SIGIR, pp. 565–574. Cited by: §2.
  • [26] P. Ren, Z. Chen, J. Li, Z. Ren, J. Ma, and M. de Rijke (2019) RepeatNet: a repeat aware neural recommendation machine for session-based recommendation. In AAAI, Cited by: §2.
  • [27] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811–820. Cited by: §2.
  • [28] G. Shani, D. Heckerman, and R. I. Brafman (2005) An mdp-based recommender system. JMLR 6 (Sep), pp. 1265–1295. Cited by: §2.
  • [29] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In CIKM, pp. 1441–1450. Cited by: §1, §2.
  • [30] Y. K. Tan, X. Xu, and Y. Liu (2016) Improved recurrent neural networks for session-based recommendations. In

    1st Workshop on Deep Learning for Recommender Systems

    ,
    pp. 17–22. Cited by: §2, §4.1.
  • [31] Z. Tu, Y. Liu, S. Shi, and T. Zhang (2018) Learning to remember translation history with a continuous cache. TACL 6, pp. 407–420. Cited by: §1, §2.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.
  • [33] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019) Session-based recommendation with graph neural networks. In AAAI, Cited by: §2, §2, §4.2.

5 Conclusion

In this paper, we study neural methods in a realistic incremental session-based recommendation scenario. We show that existing neural models can be used in this scenario with small incremental updates, and we propose a general method called Memory Augmented Neural recommender (MAN) that is widely applicable to augment different existing neural models. MAN uses a efficient nonparametric memory to compute a memory prediction, which is combined with the neural prediction through a lightweight gating network. We show that MAN consistently outperforms state-of-the-art neural and nonparametric methods and it is particularly good at learning new items with insufficient number of observations.

References

  • [1] A. Bakhtin, A. Szlam, M. Ranzato, and E. Grave (2018)

    Lightweight adaptive mixture of neural and n-gram language models

    .
    arXiv preprint arXiv:1804.07705. Cited by: §3.3.
  • [2] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha (2018) Sequential recommendation with user memory networks. In WSDM, pp. 108–116. Cited by: §2.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [4] D. Garg, P. Gupta, P. Malhotra, L. Vig, and G. Shroff (2019) Sequence and time aware neighborhood for session-based recommendations: stan. In SIGIR, pp. 1069–1072. Cited by: §2.
  • [5] E. Grave, M. M. Cisse, and A. Joulin (2017) Unbounded cache model for online language modeling with open vocabulary. In NIPS, pp. 6042–6052. Cited by: §1, §2, §3.3, §3.4, §4.5.
  • [6] E. Grave, A. Joulin, and N. Usunier (2017) Improving neural language models with a continuous cache. In ICLR, Cited by: §2.
  • [7] A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §2.
  • [8] W. Härdle and O. Linton (1994) Applied nonparametric methods. Handbook of Econometrics 4, pp. 2295–2339. Cited by: §2.
  • [9] R. He and J. McAuley (2016) Fusing similarity models with markov chains for sparse sequential recommendation. In ICDM, pp. 191–200. Cited by: §2.
  • [10] X. He, H. Zhang, M. Kan, and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pp. 549–558. Cited by: §2.
  • [11] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2016) Session-based recommendations with recurrent neural networks. In ICLR, Cited by: §1, §2, §4.2, §4.3.
  • [12] B. Hidasi and A. Karatzoglou (2018) Recurrent neural networks with top-k gains for session-based recommendations. In CIKM, pp. 843–852. Cited by: §1, §2, §4.3.
  • [13] D. Jannach and M. Ludewig (2017) When recurrent neural networks meet the neighborhood for session-based recommendation. In RecSys, pp. 306–310. Cited by: §2, §4.2, §4.3.
  • [14] H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. PAMI 33 (1), pp. 117–128. Cited by: §3.4.
  • [15] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §2.
  • [16] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma (2017) Neural attentive session-based recommendation. In CIKM, pp. 1419–1428. Cited by: §1, §2, §2, §4.1, §4.2, §4.3.
  • [17] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In SIGKDD, pp. 1831–1839. Cited by: §1, §2, §4.2.
  • [18] M. Ludewig and D. Jannach (2018) Evaluation of session-based recommendation algorithms. User Modeling and User-Adapted Interaction 28 (4-5), pp. 331–390. Cited by: §2, §2, §4.3.
  • [19] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [20] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer sentinel mixture models. In ICLR, Cited by: §1, §2, §4.5.
  • [21] F. Mi and B. Faltings (2017) Adaptive sequential recommendation for discussion forums on moocs using context trees. In Proceedings of the 10th international conference on educational data mining, Cited by: §2.
  • [22] F. Mi and B. Faltings (2018) Context tree for adaptive session-based recommendation. arXiv preprint arXiv:1806.03733. Cited by: §2, §4.3.
  • [23] A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In EMNLP, pp. 1400–1409. Cited by: §3.1.
  • [24] E. Orhan (2018) A simple cache model for image recognition. In NIPS, pp. 10107–10116. Cited by: §1, §2, §3.3.
  • [25] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, G. Zhou, J. Xu, Y. Yu, X. Zhu, et al. (2019) Lifelong sequential modeling with personalized memorization for user response prediction. In SIGIR, pp. 565–574. Cited by: §2.
  • [26] P. Ren, Z. Chen, J. Li, Z. Ren, J. Ma, and M. de Rijke (2019) RepeatNet: a repeat aware neural recommendation machine for session-based recommendation. In AAAI, Cited by: §2.
  • [27] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811–820. Cited by: §2.
  • [28] G. Shani, D. Heckerman, and R. I. Brafman (2005) An mdp-based recommender system. JMLR 6 (Sep), pp. 1265–1295. Cited by: §2.
  • [29] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In CIKM, pp. 1441–1450. Cited by: §1, §2.
  • [30] Y. K. Tan, X. Xu, and Y. Liu (2016) Improved recurrent neural networks for session-based recommendations. In

    1st Workshop on Deep Learning for Recommender Systems

    ,
    pp. 17–22. Cited by: §2, §4.1.
  • [31] Z. Tu, Y. Liu, S. Shi, and T. Zhang (2018) Learning to remember translation history with a continuous cache. TACL 6, pp. 407–420. Cited by: §1, §2.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.
  • [33] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019) Session-based recommendation with graph neural networks. In AAAI, Cited by: §2, §2, §4.2.

References

  • [1] A. Bakhtin, A. Szlam, M. Ranzato, and E. Grave (2018)

    Lightweight adaptive mixture of neural and n-gram language models

    .
    arXiv preprint arXiv:1804.07705. Cited by: §3.3.
  • [2] X. Chen, H. Xu, Y. Zhang, J. Tang, Y. Cao, Z. Qin, and H. Zha (2018) Sequential recommendation with user memory networks. In WSDM, pp. 108–116. Cited by: §2.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [4] D. Garg, P. Gupta, P. Malhotra, L. Vig, and G. Shroff (2019) Sequence and time aware neighborhood for session-based recommendations: stan. In SIGIR, pp. 1069–1072. Cited by: §2.
  • [5] E. Grave, M. M. Cisse, and A. Joulin (2017) Unbounded cache model for online language modeling with open vocabulary. In NIPS, pp. 6042–6052. Cited by: §1, §2, §3.3, §3.4, §4.5.
  • [6] E. Grave, A. Joulin, and N. Usunier (2017) Improving neural language models with a continuous cache. In ICLR, Cited by: §2.
  • [7] A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §2.
  • [8] W. Härdle and O. Linton (1994) Applied nonparametric methods. Handbook of Econometrics 4, pp. 2295–2339. Cited by: §2.
  • [9] R. He and J. McAuley (2016) Fusing similarity models with markov chains for sparse sequential recommendation. In ICDM, pp. 191–200. Cited by: §2.
  • [10] X. He, H. Zhang, M. Kan, and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pp. 549–558. Cited by: §2.
  • [11] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2016) Session-based recommendations with recurrent neural networks. In ICLR, Cited by: §1, §2, §4.2, §4.3.
  • [12] B. Hidasi and A. Karatzoglou (2018) Recurrent neural networks with top-k gains for session-based recommendations. In CIKM, pp. 843–852. Cited by: §1, §2, §4.3.
  • [13] D. Jannach and M. Ludewig (2017) When recurrent neural networks meet the neighborhood for session-based recommendation. In RecSys, pp. 306–310. Cited by: §2, §4.2, §4.3.
  • [14] H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. PAMI 33 (1), pp. 117–128. Cited by: §3.4.
  • [15] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, pp. 197–206. Cited by: §2.
  • [16] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma (2017) Neural attentive session-based recommendation. In CIKM, pp. 1419–1428. Cited by: §1, §2, §2, §4.1, §4.2, §4.3.
  • [17] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang (2018) STAMP: short-term attention/memory priority model for session-based recommendation. In SIGKDD, pp. 1831–1839. Cited by: §1, §2, §4.2.
  • [18] M. Ludewig and D. Jannach (2018) Evaluation of session-based recommendation algorithms. User Modeling and User-Adapted Interaction 28 (4-5), pp. 331–390. Cited by: §2, §2, §4.3.
  • [19] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [20] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer sentinel mixture models. In ICLR, Cited by: §1, §2, §4.5.
  • [21] F. Mi and B. Faltings (2017) Adaptive sequential recommendation for discussion forums on moocs using context trees. In Proceedings of the 10th international conference on educational data mining, Cited by: §2.
  • [22] F. Mi and B. Faltings (2018) Context tree for adaptive session-based recommendation. arXiv preprint arXiv:1806.03733. Cited by: §2, §4.3.
  • [23] A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston (2016) Key-value memory networks for directly reading documents. In EMNLP, pp. 1400–1409. Cited by: §3.1.
  • [24] E. Orhan (2018) A simple cache model for image recognition. In NIPS, pp. 10107–10116. Cited by: §1, §2, §3.3.
  • [25] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, G. Zhou, J. Xu, Y. Yu, X. Zhu, et al. (2019) Lifelong sequential modeling with personalized memorization for user response prediction. In SIGIR, pp. 565–574. Cited by: §2.
  • [26] P. Ren, Z. Chen, J. Li, Z. Ren, J. Ma, and M. de Rijke (2019) RepeatNet: a repeat aware neural recommendation machine for session-based recommendation. In AAAI, Cited by: §2.
  • [27] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme (2010) Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811–820. Cited by: §2.
  • [28] G. Shani, D. Heckerman, and R. I. Brafman (2005) An mdp-based recommender system. JMLR 6 (Sep), pp. 1265–1295. Cited by: §2.
  • [29] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In CIKM, pp. 1441–1450. Cited by: §1, §2.
  • [30] Y. K. Tan, X. Xu, and Y. Liu (2016) Improved recurrent neural networks for session-based recommendations. In

    1st Workshop on Deep Learning for Recommender Systems

    ,
    pp. 17–22. Cited by: §2, §4.1.
  • [31] Z. Tu, Y. Liu, S. Shi, and T. Zhang (2018) Learning to remember translation history with a continuous cache. TACL 6, pp. 407–420. Cited by: §1, §2.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.
  • [33] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019) Session-based recommendation with graph neural networks. In AAAI, Cited by: §2, §2, §4.2.