ADER: Adaptively Distilled Exemplar Replay Towards Continual Learning for Session-based Recommendation

07/23/2020 ∙ by Fei Mi, et al. ∙ EPFL 0

Session-based recommendation has received growing attention recently due to the increasing privacy concern. Despite the recent success of neural session-based recommenders, they are typically developed in an offline manner using a static dataset. However, recommendation requires continual adaptation to take into account new and obsolete items and users, and requires "continual learning" in real-life applications. In this case, the recommender is updated continually and periodically with new data that arrives in each update cycle, and the updated model needs to provide recommendations for user activities before the next model update. A major challenge for continual learning with neural models is catastrophic forgetting, in which a continually trained model forgets user preference patterns it has learned before. To deal with this challenge, we propose a method called Adaptively Distilled Exemplar Replay (ADER) by periodically replaying previous training samples (i.e., exemplars) to the current model with an adaptive distillation loss. Experiments are conducted based on the state-of-the-art SASRec model using two widely used datasets to benchmark ADER with several well-known continual learning techniques. We empirically demonstrate that ADER consistently outperforms other baselines, and it even outperforms the method using all historical data at every update cycle. This result reveals that ADER is a promising solution to mitigate the catastrophic forgetting issue towards building more realistic and scalable session-based recommenders.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Due to new privacy regulations that prohibit building user preference models from historical user data, utilizing anonymous short-term interaction data within a browser session becomes popular. Session-based Recommendation (SR) is therefore increasingly used in real-life online systems, such as E-commerce and social media. The goal of SR is to make recommendations based on user behavior obtained in short web browser sessions, and the task is to predict the user’s next actions, such as clicks, views, and even purchases, based on previous activities in the same session.

Despite the recent success of various neural approaches (hidasi2015session; li2017neural; liu2018stamp; kang2018self), they are developed in an offline manner, in which the recommender is trained on a very large static training set and evaluated on a very restrictive testing set in a one-time process. However, this setup does not reflect the realistic use cases of online recommendation systems. In reality, a recommender needs to be periodically updated with new data steaming in, and the updated model is supposed to provide recommendations for user activities before the next update. In this paper, we propose a continual learning setup to consider such realistic recommendation scenarios.

The major challenge of continual learning is catastrophic forgetting (mccloskey1989catastrophic; french1999catastrophic). That is, a neural model updated on new data distributions tends to forget old distributions it has learned before. A naive solution is to retrain the model using all historical data every time. However, it suffers from severe computation and storage overhead in large-scale recommendation applications.

To this end, we propose to store a small set of representative sequences from previous data, namely exemplars, and replay them each time when the recommendation model needs to be trained on new data. Methods using exemplars have shown great success in different continual learning (rebuffi2017icarl; castro2018end)

and reinforcement learning

(schaul2015prioritized; andrychowicz2017hindsight) tasks. In this paper, we propose to select representative exemplars of an item using an herding technique (welling2009herding; rebuffi2017icarl), and its exemplar size is proportional to the item frequency in the near past. To enforce a stronger constraint on not forgetting previous user preferences, we propose a regularization method based on the well-known knowledge distillation technique (hinton2015distilling)

. We propose to apply a distillation loss on the selected exemplars to preserve the model’s knowledge. The distillation loss is further adaptively interpolated with the regular cross-entropy loss on the new data by considering the difference between new data and old ones to flexibly deal with different new data distributions.

Altogether, (1) we are the first to study the practical continual learning setting for the session-based recommendation task; (2) we propose a method called Adaptively Distilled Exemplar Replay (ADER) for this task, and benchmark it with state-of-the-art continual learning techniques; (3) experiment results on two widely used datasets empirically demonstrate the superior performance of ADER and its ability to mitigate catastrophic forgetting.111Code is available at:

2. Related Work

2.1. Session-based Recommendation

Session-based recommendation (SR) can be formulated as a sequence learning problem to be solved by recurrent neural networks (RNNs). The first work (GRU4Rec,


) uses a gated recurrent unit (GRU) to learn session representations from previous clicks. Based on GRU4Rec,

(hidasi2018recurrent) proposes new ranking losses on relevant sessions, and (tan2016improved) proposes to augment training data. Attention operation is first used by NARM (li2017neural) to pay attention to specific parts of the sequence. Base on NARM, (liu2018stamp) proposes STAMP to model users’ general and short-term interests using two separate attention operations, and (ren2018repeatnet) proposes RepeatNet to predict repetitive actions in a session. Motivated by the recent success of Tansformer (vaswani2017attention) and BERT (devlin2018bert) for language model tasks, (kang2018self) proposed SASRec using Transformer, and (sun2019bert4rec) proposed BERT4Rec to model bi-directional information. Despite the broad exploration and success, the above methods are all studied in a static and offline manner. Recently, the incremental and steaming nature of SR is pointed out by (guo2019streaming; mi2020memory).

Besides neural approaches, several non-parametric methods have been proposed. (jannach2017recurrent) proposed SKNN to compare the current session with historical sessions in the training data. Lately, variations (ludewig2018evaluation; garg2019sequence) of SKNN have been proposed to consider the position of items in a session or the timestamp of a past session. (garcin2013personalized; mi2016adaptive; mi2017adaptive; mi2018context) applies a non-parametric structure called context tree. Although these methods can be efficiently updated, the realistic continual learning setting and the corresponding forgetting issue remain to be explored.

2.2. Continual Learning

The major challenge for continual learning is catastrophic forgetting (mccloskey1989catastrophic; french1999catastrophic). Methods designed to mitigate catastrophic forgetting fall into three categories: regularization (li2017learning; kirkpatrick2017overcoming; zenke2017continual), Exemplar Replay (rebuffi2017icarl; chaudhry2019continual; castro2018end) and dynamic architectures (rusu2016progressive; maltoni2019continuous). Methods using dynamic architectures increase model parameters throughout the training process, which leads to an unfair comparison with other methods. In this work, we focus on the first two categories.

Regularization methods add specific regularization terms to consolidate knowledge learned before. (li2017learning) introduces knowledge distillation (hinton2015distilling)

to penalize model logit change, and it is widely employed by 

(rebuffi2017icarl; castro2018end; wu2019large; hou2019learning; zhao2019maintaining). (kirkpatrick2017overcoming; zenke2017continual; aljundi2018memory) propose to penalize changes on parameters that are crucial to old knowledge according to various importance measures. Exemplar Replay methods store past samples, a.k.a exemplars, and replay them periodically to prevent model forgetting previous knowledge. Besides selecting exemplars uniformly, (rebuffi2017icarl) incorporates the Herding technique (welling2009herding) to select exemplars, and it soon becomes popular (castro2018end; wu2019large; hou2019learning; zhao2019maintaining).

3. Methodology

In this section, we first introduce some background in Section 3.1 and a formulation of the continual learning setup in Section 3.2. In Section 3.3, we propose our method called “Adaptively Distilled Exemplar Replay” (ADER).

3.1. Background on Neural Session-based Recommenders

A user action in SR is a click or view on an item, and the task is to predict the next user action based on a sequence of user actions in the current web-browser session. Existing neural models typically contain two modules: an feature extractor to compute a compact sequence representation of the sequence of previous user actions, and an output layer to predict the next user action. Various recurrent neural networks (hidasi2015session; hidasi2018recurrent) and attention mechanisms (li2017neural; liu2018stamp; kang2018self) have been proposed for , and the common choices for the output layer is fully-connect layers(hidasi2015session) or bi-linear decoders (li2017neural; kang2018self). In this paper, we base our comparison on SASRec (kang2018self), and we refer readers to model details in the original paper to avoid verbosity. Nevertheless, the techniques proposed and compared in this paper are agnostic to , therefore, a more thorough comparison using different are left for interesting future work.

Figure 1. An visualization of the continual learning setup. At each update cycle , the model is trained with data , and the updated model is evaluated w.r.t. to data before the next model update.

3.2. Formulation of Continual Learning for Session-based Recommendation

In this section, we formulate the continual learning setting for the session-based recommendation task to simulate the realistic use cases of training a recommendation model continually. To be specific, at an update cycle , the recommendation model obtained until the last update cycle needs to be updated with new incoming data . After is trained on , the updated model is evaluated w.r.t. the incoming data before the next update cycle . A visualization of the continual learning setup is illustrated in Fig. 1, where a recommendation model is continually trained and tested upon receiving data in sequential update cycles.

3.3. Proposed Solution: Adaptively Distilled Exemplar Replay (Ader)

3.3.1. Exemplar Replay

To alleviate the widely-recognized catastrophic forgetting issue in continual learning, the model needs to preserve old knowledge it has learned before. To this end, we propose to store past samples,a.k.a exemplars, and replay them periodically to preserve previous knowledge. To maintain a manageable memory footprint, we only store a fixed total number of exemplars throughout the entire continual learning process. Two decisions need to be made at each cycle : (1). how many exemplars should be stored for each item/label? (2). what is the criterion for selecting exemplars of an item/label?

First, we design the number of exemplars of each appeared item in (i.e. the set of appeared items until cycle ) to be proportional to its appearance frequency. In other words, more frequent and popular items contribute a larger portion of selected exemplars to be replayed to the next cycle. Suppose we store exemplars in total, the number of exemplars at cycle for a item is:


where the second term is the probability that item

appears in the current update cycle, as well as in the exemplars we kept from the last cycle. Therefore, the exemplar sizes of different items to be select in the cycle

can be encoded as a vector


Second, we need to decide which samples to select as exemplars for each item. We propose to use a herding technique (welling2009herding; rebuffi2017icarl) to select the most representative sequences of an item in an iterative manner based on the distance to the mean feature vector of the item. In each iteration, one sample from that best approximates the average feature vector () over all training examples of this item () is selected to . The details are presented in Algorithm 1.

3.3.2. Adaptive Distillation on Exemplars

The number of exemplars should be reasonably small to reduce memory overhead. As a consequence, the constraint to prevent the recommender forgetting previous user preference patterns is not strong enough. To enforce a stronger constrain on not forgetting old user preference patterns, we propose to use a knowledge distillation loss (hinton2015distilling) on exemplars to better consolidate old knowledge

0:  ;
  for  do
     for  do
     end for
  end for
  exemplar set
Algorithm 1 ADER: ExemplarSelection at cycle
  Initialize with
  while  not converged do
     Train with loss in Eq. (4)
  end while
  Compute using Algorithm 1 with and computed by Eq. (1)
  updated and new exemplar set
Algorithm 2 ADER: UpdateModel at cycle

At a cycle , the set of exemplars to be replayed is and the set of items till the last cycle is , the proposed knowledge distillation (KD) loss is written as:


where is predicted distribution (softmax of logits) over generated by , and is the prediction of over . measures the difference between the outputs of the previous model and the current model on exemplars, and the idea is to penalize prediction changes on items in previous update cycles.

L defined above is interpolated with a regular cross-entropy (CE) loss computed w.r.t. defined below:


In practice, the size of incoming data and the number of new items varies in different cycles, therefore, the degree of need to preserve old knowledge varies. To this end, we propose an adaptive weight to combine with :


In general, increases when the ratio of the number of old items to that of new items increases, and when the ratio of the exemplar size to the current data size increases. The idea is to rely more on L when the new cycle contains fewer new items or fewer data to be learned. The overall training procedure of ADER is summarized in Algorithm 2.

4. Experiments

4.1. Dataset

Two widely used dataset are adopted: (1). DIGINETICA: This dataset contains click-streams data on a e-commerce site over a 5 months, and it is used for CIKM Cup 2016 ( (2). YOOCHOOSE: It is another dataset used by RecSys Challenge 2015 ( for predicting click-streams on another e-commerce site over 6 months.

As in (hidasi2015session; li2017neural; liu2018stamp; kang2018self), we remove sessions of length 1 and items that appear less than 5 times. To simulate the continual learning scenario, we split the model update cycle of DIGINETICA by weeks and YOOCHOOSE by days as its volume is much larger. Different time spans also resemble model update cycles at different granulates. In total, 16 update cycles are used to continually train the recommender on both datasets. 10% of the training data of each update cycle is randomly selected as a validation set. Statistics of split datasets are summarized in Table 1. We can see that YOOCHOOSE is less dynamic, indicated by the tiny fraction of actions on new items, that is, old items heavily reappear.


week 0 1 2 3 4 5 6 7 8
total actions 70,739 37,586 31,089 32,687 30,419 57,913 52,225 57,100 69,042
new actions 100.00% 18.25% 13.26% 11.29% 10.12% 9.08% 6.64% 6.35% 5.42%
week 9 10 11 12 13 14 15 16 Total
total actions 82,834 82,935 50,037 63,133 70,050 71,670 56,959 77,065 993,483
new actions 5.22% 3.02% 3.01% 1.78% 1.83% 0.78% 0.45% 0.27% /


day 0 1 2 3 4 5 6 7 8
total actions 219,389 209,219 218,162 162,637 177,943 307,603 232,887 178,076 199,615
new actions 100.00% 3.04% 1.74% 1.29% 0.95% 0.57% 0.50% 1.09% 0.74%
day 9 10 11 12 13 14 15 16 Total
total actions 179,889 123,750 153,565 300,830 259,673 187,348 154,316 105,676 3,370,578
new actions 0.81% 1.08% 0.56% 0.56% 0.29% 0.41% 0.38% 0.35% /
Table 1. Statistics of the two datasets; “new actions” indicate the percentage of actions on new items in this update cycle; week/day 0 is only used for training, while week/day 16 is only used for testing.

4.2. Evaluation Metrics

Two commonly used evaluation metrics are used: (1).

Recall@k: The ratio when the desired item is among the top-k recommended items. (2). MRR@k: Recall@k does not consider the order of the items recommended, while MRR@k measures the mean reciprocal ranks of the desired items in top-k recommended items. For easier comparison, we reported the mean value of these two metrics averaged over all 16 update cycles.

4.3. Baseline Methods

Several widely adopted baselines in continual learning literature are compared:

  • [itemsep=0pt,topsep=1pt,leftmargin=12pt]

  • Finetune: At each cycle, the recommender trained till the last task is trained with the data from the current task.

  • Dropout (mirzadeh2020dropout): Dropout (hinton2012improving) is recently found by (mirzadeh2020dropout) that it effectively alleviates catastrophic forgetting. Based on Finetune, we applied dropout to every self-attention and feed-forward layer.

  • EWC (kirkpatrick2017overcoming)

    : It is a well-known method to alleviate forgetting by regularizing parameters important to previous data estimated by the diagonal of a Fisher information matrix computed w.r.t. exemplars.

  • ADER (c.f. Algorithm 2): The proposed method using adaptively distilled exemplars in each cycle with dropout.

  • Joint: In each cycle, the recommender is trained (with dropout) using data from the current and all historical cycles. This is a common performance “upper bound” for continual learning.

The above methods are applied on top of the state-of-the-art base SR recommender SASRec (kang2018self)

using 150 hidden units and 2 stacked self-attention blocks. During continual training, we set the batch size to be 256 on DIGINETICA and 512 on YOOCHOOSE. We use Adam optimizer with a learning rate of 5e-4. A total of 100 epochs are trained, and early stop is applied if validation performance (Recall@20) does not improve for 5 consecutive epochs. Other hyper-parameters are tuned to maximize Recall@20. The dropout rate of

Dropout, ADER, and Jointis set to 0.3; 30,000 exemplars are used by default for EWC and ADER; of ADER is set to 0.8 on DIGINETICA and 1.0 on YOOCHOOSE.

4.4. Overall Results on Two Datasets

Finetune Dropout EWC Joint ADER Finetune Dropout EWC Joint ADER
Recall@20 47.28% 49.07% 47.66% 50.03% 50.21% 71.86% 72.20% 71.91% 72.22% 72.38%
Recall@10 35.00% 36.53% 35.48% 37.27% 37.52% 63.82% 64.15% 63.89% 64.16% 64.41%
MRR@20 16.01% 16.86% 16.28% 17.31% 17.32% 36.49% 36.60% 36.53% 36.65% 36.71%
MRR@10 15.16% 16.00% 15.44% 16.43% 16.45% 35.92% 36.03% 35.97% 36.08% 36.14%
Table 2. Performance averaged over 16 continual update cycles on two datasets.

Results averaged over 16 update cycles are presented in Table 2, and several interesting observations can be noted:

  • [itemsep=0pt,topsep=1pt,leftmargin=12pt]

  • Finetune already works reasonably well, especially on the less dynamic YOOCHOOSE dataset. The performance gap between Finetune and Joint is less significant than typical continual learning setups (rebuffi2017icarl; li2017learning; wu2019large; hou2019learning). The reason is that catastrophic forgetting is not severe since old items can frequently reappear in recommendation tasks.

  • EWC only outperforms Finetune marginally, and it performs worse than Dropout.

  • Dropout is effective, and it notably outperforms Finetune, especially on the more dynamic DIGINETICA dataset.

  • ADER significantly outperforms other methods, and the improvement margin over other methods is larger on the more dynamic DIGINETICA dataset. Furthermore, it even outperforms Joint. This result empirically reveals that ADER is a promising solution for the continual recommendation setting by effectively preserving user preference patterns learned before.

Detailed disentangled performance at each update cycle is plotted in Figure 2. We can see that the advantage of ADER is significant on the more dynamic DIGINETICA dataset. On the less dynamic YOOCHOOSE dataset, the gain of ADER mainly comes from the more dynamic starting cycles with relatively more actions on new items. At later stable cycles with few new items, different methods show comparable performance, including the vanilla Finetune.

Figure 2. Disentangled Recall@20 (Top) and MRR@20 (Bottom) at each continual learning update cycle on two datasets.

4.5. In-depth Analysis

In following experiments, we conducted an in-depth analysis of the results on the more dynamic DIGINETICA dataset.

4.5.1. Different number of Exemplars

We studied the effect of a varying number of exemplars for ADER. Besides using 30k exemplars, we tested using only 10k/20k exemplars, and results are shown in Table 4.5.2. We can see that the performance of ADER only drops marginally as exemplar size decreases from 30k to 10k. This result reveals that ADER is insensitive to the number of exemplars, and it works reasonably well with smaller number of exemplars.

4.5.2. Ablation Study

In this experiment, we compared ADER to several simplified versions to justify our design choices. (i). ER: A vanilla exemplar replay different from ADER by using a regular L, rather than L, on exemplars. (ii). ER: It differs from ER by selecting exemplars of an item at random. (iii). ER: It differs from ER by selecting exemplars of an item with smallest . (iv). ADER: This version differs from ADER by selecting equal number of exemplars for each item, that is, the assumption that more frequent items should be stored more is removed. (v). ADER: This version differs from ADER by not using the adaptive in Eq. (4), but a fixed .

Comparison results are presented in Table 4.5.2, and several observations can be noted: (1). Herding is effective to selected exemplars, indicated by the better performance of ER over ER and ER. (2). The distillation loss in Eq. (2) is helpful, indicated by the better performance of three versions of ADER over three vanilla ER methods. (3). Selecting exemplars proportional to item frequency is helpful, indicated by the better performance of ADER over ADER. (4). The adaptive in Eq. (2) is helpful, , indicated by the better performance of ADER over ADER .

10k 20k 30k Recall@20 49.59% 50.05% 50.21% Recall@10 36.92% 37.40% 37.52% MRR@20 17.04% 17.29% 17.32% MRR@10 16.17% 16.42% 16.45% Table 3. Different exemplar sizes for ADER. ER ER ER ADER ADER ADER Recall@20 49.14% 49.31% 49.34% 49.92% 50.09% 50.21% Recall@10 36.61% 36.65% 36.78% 37.21% 37.41% 37.52% MRR@20 16.79% 16.90% 16.85% 17.23% 17.29% 17.32% MRR@10 15.92% 16.02% 16.98% 16.35% 16.41% 16.45% Table 4. Ablation study for ADER.

5. Conclusion

In this paper, we studied the practical and realistic continual learning setting for session-based recommendation tasks. To prevent the recommender forgetting user preferences learned before, we propose ADER by replaying carefully chosen exemplars from previous cycles and an adaptive distillation loss. Experiment results on two widely used datasets empirically demonstrate the promising performance of ADER. Our work may inspire researchers working from similar continual learning perspective for building more robust and scalable recommenders.