Modeling the Past and Future Contexts for Session-based Recommendation

06/11/2019 ∙ by Yuan Fajie, et al. ∙ Northeastern University Tencent 0

Long session-based recommender systems have attacted much attention recently. For each user, they may create hundreds of click behaviors in short time. To learn long session item dependencies, previous sequential recommendation models resort either to data augmentation or a left-to-right autoregressive training approach. While effective, an obvious drawback is that future user behaviors are always mising during training. In this paper, we claim that users' future action signals can be exploited to boost the recommendation quality. To model both past and future contexts, we investigate three ways of augmentation techniques from both data and model perspectives. Moreover, we carefully design two general neural network architectures: a pretrained two-way neural network model and a deep contextualized model trained on a text gap-filling task. Experiments on four real-word datasets show that our proposed two-way neural network models can achieve competitive or even much better results. Empirical evidence confirms that modeling both past and future context is a promising way to offer better recommendation accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Session-based Recommender system (SRS) (Hidasi et al., 2015) has become an emerging topic in the recommendation domain, where a recommender is able to predict the next item based on a history of observed (e.g., clicked, bought) items within a user session. While recent advances in deep neural networks (Quadrana et al., 2017; Tuan and Phuong, 2017; Li et al., 2017; Tang and Wang, 2018; Kang and McAuley, 2018; Yuan et al., 2019) have led to promising approaches to model user interest distribution in sessions, one difficult task is to represent long-range user sessions (Yuan et al., 2019). However, in practice, long sessions are widely popular in scenarios such as short video, image or news recommendations. For example, users may view several hundreds of short videos in hours on TikTok111 as the average running time of each video is only 15 seconds.

In early literature (Hidasi et al., 2015), it is common to build SRS by simply predicting the last item based on previous user behaviors in the session. Though this straightforward appoach may provide reasonable recommendations for the short-range SRS scenario, it is largely unsatisfied when modeling relatively long session data. The possible reason is that the way of modeling inevitably ignored internal item dependencies in the session. To address this issue, most recommenders resort to data augmentation (Tan et al., 2016) techniques to enhance the model when handling long sessions. The prevalent data augmentation method is to generate new subsessions by using prefixes of the original input (Tuan and Phuong, 2017; Tan et al., 2016; Tang and Wang, 2018), as illustrated in Figure 1 (a). While effective, the way of data augmentation has obvious side effects since the generated subsessions may break the data integrity of user’s clicking distributions (Yuan et al., 2019). Moreover, the large number of new subsessions will increase training times compared to only the input session (Quadrana et al., 2017). The other effective approach is to model the distribution of the entire user session instead of only the desired item. That is, the probablistic distribution of each item in the user session will be modeled conditioned on all its left (i.e., past) context, as shown in Figure 1 (b). This idea results in a typical autoregressive generative model, referred to as NextItNet (Yuan et al., 2019), which achieves state-of-the-art recommendation accuracy for long session-based recomender systems with a dilated CNN architecture.

(a) Data augmentation
(b) Model augmentation
Figure 1.

Two ways of augmentations in SRS. The numbers represent observed itemIDs in each user session. ”0” is the padding token. The red token represents future items predicted by SRS. (a) Typical data augmentation approach with new training subsessions created by spliting the original input session. Clearly, for a 50-length original input, the processing approach can produce a 50 times larger training dataset. (b) The prediction output

is only determined by previous timesteps. E.g., item 4 is predicted by items 1, 2 and 3, which achieves the same effect with session-1 in (a). The overall loss in (b) can be regarded as the sum of the separate loss of the original input and subsessions in (a). In fact, it can also be viewed as a form of data augmentation from the model perspective.

In fact, the right (i.e., future) context of most predicted items (except the last one) in the user session is always available during training. it is intuitive that user’s future actions also have certain connections with the current clicking behaviors. Hence, it is reasonable to believe that leveraging the right context appropriately during training is very likely to help improve existing SRS. Motivated by this, we investigate several strategies to incorporate the right context from both data and model perspectives. Empirically, we first show that the simple data augmentation by reversing the input session is not sufficient to solve this problem. Furthermore, we resort to developing deep contextualized models to take into account both the past and future context during training. The main difficulty is that the standard deep two-way neural network will cause information leakage problem for the SRS task since higher layer neurons are able to ‘peep’ at the predicted items through their connections. This will lead to ineffective training since the ‘answer’ is already known by the model. To overcome this issue, we present two carefully designed optimizing solutions: one is the combination of two independently trained unidirectional objective functions on the basis of NextItNet, while the other is trained by a real two-way objective function with a novel blank-masking technique inspired by the gap-filling task in language examination. For the later one, we also build a dilated CNN architecture to model long-range user sessions and fit the proposed training objective. Extensive experiments on several real-world datasets show that our proposed approaches generally perform better than typical data augmentation methods and NextItNet-style learning algorithms.

2. Preliminaries

First, the problem of session-based recommendation is described. Then, two state-of-the-art temporal CNN models are shortly recapitulated. At last, we review previous work on session-based recommender systems. The main claim of this section is that the existing left-to-right data or model augmentation method is insufficient for better recommendations.

2.1. Top- Session-based Recommendation

The formulation of top- session-based recomendation in this paper clozely follows that in NextItNet (Yuan et al., 2019). In SRS, the concept “session” is defined as a collection of items (referring to any objects e.g., videos, songs or queries) that happened at one time or in a certain period of time (Li et al., 2017; Wang et al., 2019). For instance, both a list of browsed webpages and a collection of watched videos consumed in a hour by a user can be regarded as a session. Formally, let be a set of items in a user session, where denotes the index of clicked item out of a total number of items in the session. The task of SRS is to train a model such that for a given prefix session data, , it generates the distribution y for candidate items, where .

represents probablity value of item

that will occur in the future clicking event. In practice, SRS typically makes more than one recommendation by selecting the top- (e.g., ) items from y, referred to as the top- session-based recommendations (Tan et al., 2016).

2.2. Issues of NextItNet-style Algorithms

In this section, we mainly review the typical session-based recommendation models that have similar left-to-right style, including but not limited to Improved GRURec (Hidasi et al., 2015) (short for GRURec), Caser (Tang and Wang, 2018) and NextItNet (Kang and McAuley, 2018). Among these models, GRURec and Caser fall in the line of standard data augmentation methods following Figure 1 (a), while NextItNet is a typical autoregressive generative model following Figure 1 (b).

2.2.1. Data Augmentation

The authors of Improved GRURec proposed a generic data augmentation (DA) to improve recommendation quality, which has been applied in a majority of following work, such as (Tang and Wang, 2018; Tuan and Phuong, 2017; Li et al., 2017). The basic idea of DA in SRS is to treat all prefixes in the user session as new training sequences (Hidasi et al., 2015). Specifically, for a given user session , the DA method will generate a collection of sequences and target labels {, ,…, }. Following this processing, the model learns all conditional relations instead of only the last item and the prefix sequence . Due to more information learned by additional subsessions, DA is an effective way to reduce the overfitting problem especially when the user session is long and the user-item matrix is sparse. Even though the DA method has been successfully applied in numerous SRS work, it may lead to a break of the integrity of original input distribution (Yuan et al., 2019).

2.2.2. Model Augmentation

To overcome the above-mentioned suboptimal problem, NextItNet-style learning methods (Yuan et al., 2019; Kang and McAuley, 2018) proposed to optimizating all positions of the original input instead of only the final index. Specifically, the generative model of NextItNet uses (or ) as the input and outputs distributions over , where

is the desired item. Mathematically, the joint distribution of a user session

can be factorized as a product of conditional distributions following the chain fule:


where denotes the probability of -th item conditioned on its all prefix , is the parameters. For clarity, the difference of data augmentation and model augmentation is shown as follows:


2.2.3. Issues of Modeling from Left-to-Right

As shown, the above optimization approaches simply train the user session from left to right. Although it matches the prediction scenario in practice, there is also right context available during training. For instance, with the given user session , when () is predicted in the training session, the future clicked items {} are already there for an input sequence. Even though is not directly determined by , it is very likely to be the reason why occurs in the future since is conditioned on . Moreover, leveraging the future context can be regarded as a way of data augmentation that helps models allieviate the well-known sparsity problems in SRS. Correspondingly, it is reasonable to argue that the right (i.e., future) context of the currently clicked item may have certain casual connections with iteself. Hence, we believe it is worth to investigate the impact to recomendation models by fusing the right context during training. However, two key concerns arise if the NextItNet-style algorithms model the future context. One concern is that the right context is unavailable in the prediction phrase, which will result in the mismatch problem between training and the final generating task. The other concern is that the higher layer neurons are able to see the predicted item through the network connections. This incurs ineffective training since the ‘answer’ is already known. Motivated by the two difficulties, we propose two distinct optimization methods: one idea is built on the combination of two pretrained unidirectional NextItNets, while the other is based on a well-designed blank-filling mask borrowed from the gap-filling task in language examinations.

2.3. Related Work

Recently, the powerful deep neural network based sequential models (e.g., RNN) have almost dominated the field of session-based recommender systems (SRS). Among these models, GRURec (Hidasi et al., 2015)

is a pioneering work to employ Gated Recurrent Units to model users’ evolution of preferences. Inspired by the success, a class of RNN-based models have been researched extensively for SRS. For example, an improved RNN variant in

(Tan et al., 2016) showed promising improvements over standard RNN models by proposing data augmentation techniques. Hidasi et al (Hidasi et al., 2015) further proposed a family of alternative ranking objective functions along with effective sampling tricks to improve the cross entropy and pairwise losses. (Quadrana et al., 2017) proposed personalized SRS, while  (Gu et al., 2016; Smirnova and Vasile, 2017) explored how to use content and context features to enhance the recommendation accuracy.

Another line of research work is based on convolutional neural networks (CNN) and attention mechanisms. The main claim is that RNN-based sequential models seriously depend on a hidden state of the entire past that cannot fully utilize parallel processing power of GPUs

(Gehring et al., 2017; Yuan et al., 2019). As a result, their speed is limited in both training and evaluation. Instead, CNN and purely attention based models are inherently easier to parallelize since all timesteps in the user session are known during training. The most typical CNN models for SRS is Caser (Tang and Wang, 2018), which treats the item embedding matrix as an image and then perform 2D convolution. In NextItNet (Yuan et al., 2019)

, authors argued that the CNN architecture and max pooling operation are not suitable to model long-range user sequence. Correspondingly, they proposed using stacked dilated CNN to increase the receptive field of higher layer neurons. Moreover, authors claimed that the data augmentation techniques in Caser or GRURec can be omitted by developing loss functions directly for the entire input. They showed that the autoregressive NextItNet is more powerful than Caser or state-of-the-art RNN models for session-based recommendation task. Meanwhile, transformer-based self-attention

(Vaswani et al., 2017; Zhang et al., 2018) models also demonstrated competitive results in the area of SRS.

While above works have achieved promising improvements over traditional recommendation approaches (e.g., markov chains style models

(Rendle et al., 2010) & factorization models (Yuan et al., 2016; Yuan et al., 2018; Rendle et al., 2009)), one important drawback is either data augmentation based models (Tan et al., 2016; Tang and Wang, 2018; Li et al., 2017; Tuan and Phuong, 2017)

or autoregressive models

(Zhang et al., 2018; Vaswani et al., 2017; Yuan et al., 2019) only model the past user behaviors during training. Intuitively, SRS may achieve further improvements if future user preference are considered during training. Motivated by this, in what follows, we will investigate several augmentation methods from both data and model perspectives.

3. Proposed Methods

In this section, augmentation methods are investigated to improve the item representations by modeling both the past and future contexts in user sessions. First, a simple data augmentation by reversing the input user session is presented as a baseline. Then, two model-level augmentation methods that follows similar intuition are proposed.

3.1. Two-way Data Augmentation

In Section 2.2, the left-to-right optimization objective was claimed as the main concern of potentially non-optimal results. That is, all useful user behaviors {} in the future are ignored when learning the representations of item .

A fairly straightforward approach to take advantage of future context is to reverse the original user session and train the recommendation model by feeding it both the input and reversed output. The recommendation model of NextItNet or Caser can be used without any modification. Throughout this paper, we investigate the recommendation performance by using NextItNet (denoted by NextItNet+).


The above data augmentation is trivial to implement, however, we notice that the way of modeling has two drawbacks: (1) the left and right contexts of item are modeled by the same set of parameters or same convolutional kernels of NextitNet. While in practice, the impact of the left and right contexts to can be very different. Hence, the exactly same network representation is not accurate from this perspective. (2) The separate training process of the left and right contexts easily results in suboptimal performance since the parameters learned for the left context may be largely changed when the model trains the right context. In view of this, a better solution is that (1) the optimization objective consists of both the left and right contexts simultaneously, and (2) the left and right contexts are represented by different set of model parameters. In the following, we will introduce two model-leve augmentation methods that are able to model the past and future contexts simultaneously.

Figure 2. The two-way NextItNets. (a) and (b) are the forward and backward NextItNet respectively. The grey bar denotes the representation of the padded token. For simplicity, we use the dilated convolution with a dilation factor 1 for illustration.
Figure 3. The architecture of GfNextItNet with both past and future context. and denotes the receptive field and dilation respectively. The coverage of red-line network denotes the receptive field of item 9 on the fourth comvolutional layer. The blue arcs are the skip connection of residual blocks.

3.2. Pretrained Two-way NextItNets

In this section, we introduce a new type of two-way NextItNets that model the past context in the forward direction and model the future context in the backward direction. Similar to the forward NextItNet, the backward NextItNet runs over a user session in reverse, predicting the previous item conditioned on the future context. The claim here is different from that in NextItNet, where both the predicted item and its future context are masked. In this work, we only gurantee that the predicted item are not seen by higher neurons. The formulation of backward NextItNet is given as follows:


Both the forward and backward NextItNets will produce the item embeddings of a user session in each convolutional layer. Let and

be the item embedding vector

calculated by the top layer NexitItNet from the forward and backward directions respectively. To form the two-way NextItNets, we concatenate the embedding from the forward and backward NextItNets, i.e., , as illustrated in Figure 2. To combine both directions in the objective function, we maximize the joint log likelihood of both directions.


Obviously, the parameters consist of three parts, namely, the bottom layer item embedding , convolutional kernels of NextItNet &

and weights of softmax layer

. and are shared between directions while and are independent of each other. The idea here has similar motivation with the bidirectional language model (Peters et al., 2018), with the exception that the bidirectional language model is proposed for word understanding or downstream NLP tasks, which are not well-suited for the generating task. This is because although two-way NextItNets can learn better item repesentations by taking into account two directional contexts, unfortunately the right context is unavailable during generating. That is, the backward NextItNet in Figure 2 (b) is useless when generating the distribution of next item. The discrepancies between training and predicting may hurt the final recommendation performance (see Table 2) since the optimal parameters learned for the two-way NextItNets may be suboptimal for the unidirectional NextItNet. To leverage future context as well as addressing the mismatch issue, we propose using the two-way NextItNets as the pretraining model and recording and

for the forward NextItNet. After convergence of the two-way NextItNets, we fine tune the forward NextItNet with only the left context. The pretraining and fine tuning process can be simply viewed as a transfer learning

(Yosinski et al., 2014) task, where the source dataset contains both the past and future contexts while the target dataset contains only the past context. For the specific learning strategies, we can either freeze and by only training the top layers or fine tune them during training of the forward NextItNet with the past context. Empirically, we find the fine tuning all layers shows better performance than both the original forward NextItNet and two-way NextItNets. The pretraining based two-way NextItNets we refer to it as PtNextItNet.

3.3. Gap-filling Based NextItNet

The above two-way NextItNets are essentially a shallow concatenation of independently trained left-to-right and right-to-left model. We argue that a deep two-way network is reasonably more powerful than the shallow concatenation. In this section, we introduce an alternative approach to model two directional contexts, which does not rely on pretraining and fine tuning.

The main difficulty to build a real deep two-way neural network model is that the network could ‘cheat’ since predicted item will be ‘indirectly seen’ by the high-layer network connections. To address it, we borrow an idea from a fill-in-the-blank or gap-filling (Sakaguchi et al., 2013; Fedus et al., 2018) task that is well-known in student language testing. Specifically, we present a blank-mask techique that directly masks some percentage of items in a session and then predict these masked items. We assume items in a user session as words in a sentence222Note that the collection of items (e.g., short videos, photos, or songs) in a user session, i.e., the item ‘sentence’, intuitively have very weak order relationship, which is different from language generation., where we randomly remove some words by replacing them with the masked symbol “__”, e.g., “today, I went to the __ and bought a __ of milk with my __”. The goal is to predict the answers of these missing words, i.e., “shop”, “bottle” and “wife”. Our mask technique here is different from that in NextItNet where it claims that all right contexts (including the predicted item) should be covered, while here we claim that only the predicted items denoted by “__” require to be covered. The joint distribution of missing items is given as follows


where is the number of masked items. As shown in Figure 3, . The above equation indicates that each missing item is predicted by both its left and right contexts. By using the blank-masking technique, the information leaking problem can be well addressed.

In practice, we usually perform the masking operation for items in a user session with 30 to 40 percent. In addition, to learn a robust model, we do not replace all these masked items with “__” in practice. This is in line with existing work (Song et al., 2019; Devlin et al., 2018; Sperber et al., [n. d.])

that a small percentage of noisy and real items will help reduce overfitting and improve the accuracy of deep learning models. Empirically, the best performance is achieved by the following procedure: 60% to 70% of the time (denoted by

=70% 80%), masked items are replaced with “__”, half of the remaining time masked items are replaced with real items, and half with randomly sampled fake items from the item pool. By doing this, the CNN decoder does not know which item will be predicted and which item is the real one, and as a result, it is forced to learn the contextual representation of all its surrounded items, i.e., both the left and right context. The gap-filling based SRS is referred to as GfNextItNet.

3.3.1. Model Structure

Following NextItNet, GfNextItNet consists of three components: the input representation tensor, the CNN layers and the output layer. The basic module of the input representation tensor is an item embedding matrix

of the masked user session via a look-up table on the entire item matrix . This is in contrast to NextItNet-style learning algorithm that the embedding matrix is constructed by the vectors of the first items. As a result, the output softmax layer is also different since NextItNet-style models predict all next items of the individual item in the input session, while GfNextItNet only predicts the masked items in the input session. For clarity, we show the difference of input and ouput sequence between GfNextItNet and NextItNet as follows.


where the item {3,129,10,13,15,17} are required to be estimated by GfNextItNet. Specifically, items 3,13,15,17 are masked by “__”, item 8 is replaced by a random item (i.e., 129) from the item pool, and item 10 keeps the original item in the input sesssion. In addition, GfNextItNet is flexible to model various context features by concating the feature embeddings with

, such as user profile vector (including but not limited to user age, gender as well as social relations) or item position embedding vector , i.e., .

Following NextItNet, the masked convolutions use dilation to increase the receptive field so as to model long-range user sessions. The dilation rates are doubled every layer from 1 up to 16 in this work. In addition, we can repeat this scheme several times for very long-range sessions, such as . To alleviate gradient vanishing issues, we investigate two types of convolutional structures, one is to wrap every two dilated layers by a residual block following NextItNet, while the other is to build skip connections for all preceding layers of the layer following DenseNet (Huang et al., 2017). It worth mentioning that DenseNet architectures have not been explored in the field of recommender systems. Figure 3 shows the architecture of GfNextItNet with four dilated CNN layers.

On top of the final convolutional layer, we apply the look-up table again to extract the hidden vectors of predicted items. To obtain the probabilities of the predicted items, we simply perform a convolution with the input channel same size as the width of and output channel same size as the number of items. Note that for very large size of items, the negative sampling techniques should be easily applied, such as a static sampled softmax in (Jean et al., 2014) or a dynamic negative sampler in (Yuan et al., 2016).

3.3.2. Model Training & Generating

To maximize the joint distribution of masked items (i.e., Eq. (6)), we compute the cross entropy loss between these masked items and groud truth items. Other alternative objective functions such as pairwise ranking loss (BPR (Rendle et al., 2009)) and adversarial training based loss (He et al., 2018) are also theorectically feasible and worth exploring in the future. Since GfNextItNet only predicts around 30%-40% of items in each batch, it needs more training steps to converge compared to NextItNet-style models (which predict all items ), but less steps compared to the left-to-right data augmentation based models (e.g., Caser) (which predict only the last item).

Once the model is well trained, we can use it for item generation. Unlike the training phrase that masks a certain percentage of items in each user session, we just need to mask the position of last item in each user session and keeps all its left-context items with its original format during the generating phrase. In addition, it also makes sense to perform pretraining and fine tuning before generation, similar to PtNextItNet. That is, we fine tune the well-trained GfNextItNet using new training sessions that only mask the last item in each user session, which makes the training and testing more consistent. In fact, we observe obvious improvements in our experiments even though we do not perform fine tuning. This is probably because the random masking technique is able to guarantee sufficient training of the last item with the left context in the user session.

4. Experiments

As the key contribution of this work is to improve the existing left-to-right style learning algorithms for SRS, we evaluate the proposed approaches on four real-world datasets and conduct detailed ablation studies to answer the following research questions:

  1. RQ1: Does the proposed three augmentation methods performs better than existing state-of-the-art models that train user sessions from left to right or using only past context? Which way performs best among the three proposed methods?

  2. RQ2: How does PtNextItNet perform without pretraining? How does GfNextItNet perform with different masking strategies?

  3. RQ3:

    What is the effect of other key hyperparameters of GfNextItNet? For examples, the batch size of user session and the types of residual blocks.

In the following, we will first describe the experimental settings, followed by answering the above research questions.

4.1. Experimental Settings

4.1.1. Dataset Descriptions

We conduct experiments on four large-scale industry datasets, among of which three are publicly accessible and one is made by Tecent, China.

1. Yahoo! Music333 This dataset is provided by Yahoo! Research Alliance Webscope program. The full version of the dataset contains 200,000 users and 136,000 songs. To speed up our experiments, in this paper we randomly select 50,000 users including around 19 million user-item observations. The songs played by each user are ordered sequentially. For a fair comparison, we define the maximum session length (denoted by ) as 20444It produces similar conclusions with different length size, e.g.,10, 50, or100. by simply splitting long-range (¿20) user sequence into several subsessions with each session length equals to 20 or padding 0 if the length is less than 20.

2. Byte-100M555 This dataset is obtained from ICME 2019 short video understanding challenge, which focuses on predicting the probability that each user finishes watching and likes a given video. Since the original data contains cold users and items, we perform a standard preprocessing by filtering observations if a user has less than 5 items or an item has less than 10 users. We set as 50 and perform the same preprocessing as above. Note that since the timestep information of the original data has been processed by data desensitization, we do not consider the order relations in each user session.

3. Weishi666 Similar to Byte-Recommend100M, this is also a short video dataset produced by Tencent, China. We set as 30 and perform similar preprocessing as above.

4. Last.fm777 This is a music recommendation dataset by randomly drawing 200,000 songs from We follows the same preprocessing as in (Yuan et al., 2019) by setting as 100.

Table 1 summarizes the statistics of evaluated datasets after basic preprocessing.

4.1.2. Evaluation Protocols

We randomly split all user sessions into training (80%), validation (3%) and testing (17%) sets. We evaluate all models by three popular top- metrics, namely MRR@ (Mean Reciprocal Rank) (Hidasi et al., 2015), HR@ (Hit Ratio) (He et al., 2018) and NDCG@ (Yuan et al., 2018) (Normalized Discounted Cumulative Gain). is set to and for comparison. The HR intuitively measures whether the ground truth item is on the top- list, while the NDCG & MRR account for the hitting position by rewarding higher scores to hit at a top rank. We adopt the leave-one-out evaluation for user sessions in the testing sets, i.e., evaluating the accuracy of the last (i.e., next) item of each session, similarly to  (Yuan et al., 2019).

4.1.3. Compared Methods

In this work, we mainly compare our improved augmentation methods with two state-of-the-art left-to-right style CNN baselines, namely, Caser (Tang and Wang, 2018) and NextItNet (Yuan et al., 2019). Specifically, Caser is a typical session-based recommendation model based on the left-to-right data augmentation, while NextItNet is a typical left-to-right style model-level augmentation approach. Since both Caser and NextItNet are convolutional neural network (CNN) models, our proposed methods are also learned by CNN for comparison purpose.

Note that the comparisons to other well-known recommendation models, such as Bayesian personalized ranking (BPR) (Rendle et al., 2009), Markov chain based models FMC & FPMC (Rendle et al., 2010) or GRURec (Hidasi et al., 2015; Tan et al., 2016) are simply omitted since they significantly underperform either Caser or NextItNet in existing lituerature (Tang and Wang, 2018; Yuan et al., 2019; Kang and McAuley, 2018; Xiao et al., 2019).

DATA #users #items #actions #sessions session length Sparsity avg.sessions per user
Yahoo! Music 50,000 136.738 19.1M 0.97M 20 99.72% 19.4
Byte-100M 548,254 513,879 67.4M 1.02M 50 99.97% 1.86
Weishi 200,000 57,723 5.62M 0.2M 30 99.95% 1.0 1000 200,000 11.2M 0.11M 100 94.40% 110
Table 1. Statistics of the datasets. “M” is short for million.
DATA Models MRR@5 MRR@20 HR@5 HR@20 NDCG@5 NDCG@20
Yahoo! Music MostPop 0.0007 0.0012 0.0020 0.0075 0.0010 0.0025
Caser 0.0655 0.0787 0.1196 0.2561 0.0788 0.1175
NextItNet 0.0940 0.1113 0.1692 0.3473 0.1125 0.1631
NextItNet+ 0.0926 0.1101 0.1672 0.3457 0.1110 0.1618
tNextItNet 0.0884 0.1050 0.1583 0.3287 0.1056 0.1540
PtNextItNet 0.0975 0.1156 0.1750 0.3604 0.1167 0.1693
GfNextItNet 0.1016 0.12027 0.1833 0.3740 0.1218 0.1759
Byte-100M MostPop 0.0007 0.0017 0.0021 0.0116 0.0011 0.0038
Caser 0.0017 0.0024 0.0038 0.0120 0.0022 0.0045
NextItNet 0.0033 0.0050 0.0071 0.0256 0.0043 0.0094
NextItNet+ 0.0038 0.0057 0.0081 0.0287 0.0049 0.0105
tNextItNet 0.0031 0.0046 0.0682 0.0237 0.0040 0.0086
PtNextItNet 0.0039 0.0059 0.0085 0.0304 0.0051 0.0111
GfNextItNet 0.0042 0.0060 0.0090 0.0296 0.0054 0.0110
Weishi MostPop 0.0010 0.0019 0.0033 0.0135 0.0016 0.0044
Caser 0.0099 0.0128 0.0189 0.0504 0.0121 0.0208
NextItNet 0.0095 0.0131 0.0187 0.0570 0.0118 0.0224
NextItNet+ 0.0105 0.0142 0.0203 0.0609 0.0129 0.0242
tNextItNet 0.0089 0.0119 0.0181 0.0510 0.0112 0.0202
PtNextItNet 0.0100 0.0138 0.0200 0.0613 0.0128 0.0240
GfNextItNet 0.0108 0.0146 0.0215 0.0632 0.0135 0.0249 MostPop 0.0007 0.0009 0.0013 0.0047 0.0008 0.0017
Caser 0.2118 0.2193 0.2657 0.3386 0.2253 0.2464
NextItNet 0.2893 0.2970 0.3672 0.4531 0.3088 0.3326
NextItNet+ 0.2839 0.2933 0.3555 0.4570 0.3019 0.3300
tNextItNet 0.2150 0.2223 0.2706 0.3416 0.2289 0.2494
PtNextItNet 0.3078 0.3151 0.3554 0.4268 0.3197 0.3403
GfNextItNet 0.2942 0.3029 0.3505 0.4348 0.3082 0.3327

MostPop returns the most popular item respectively.

Table 2. Accuracy comparison. NextItNet+ represents NextItNet with the two-way data augmentation. tNextItNet is the two-way NextItNet without pretraining. MostPop returns item list ranked by popularity. For each measure, the best result is indicated in bold.
Data 10% 20% 30% 40% 50%
Yahoo! Music 0.0979 0.0989 0.0990 0.1008 0.0969
Byte-100M 0.0041 0.0041 0.0042 0.0043 0.0040
Table 3. Impact of (=80%) regarding MRR@5.
Data 50% 60% 70% 90% 100%
Yahoo! Music 0.0962 0.0989 0.1005 0.0997 0.0986
Byte-100M 0.0039 0.0042 0.0042 0.0040 0.0038
Table 4. Impact of (=40%) regarding MRR@5.

4.1.4. Experimental Reproducibility

All models were trained on GPUs (Tesla P40) using Tensorflow. The reported results use an embedding dimension of

=64. Results for =12, 32, 128 demonstrate similar behaviors but are omitted for saving space. The learning rates and batch sizes of baseline methods are manually tuned according to performance in validation sets. Empirically, NextItNet shows best recommendation accuracy using the learning rate of 0.001 and batch size of 3264, which is basically consistent with the report in (Yuan et al., 2019). A larger batch size will not further improve the recommendation accuracy. PtNextItNet adopts exactly the same hyper-parameters of NextItNet since it can be regarded as a variant of NextItNet. GfNextItNet are relatively sensitive to batch size since the it does not predict all items in each batch. We leave it as later discussion. Without special mention, is set to 40% and is set to 80%. In addition, we report results with residual block (b) (i.e., the blue arcs in Figure 3) of NextItNet for GfNextItNet, PtNextItNet, tNextItNet, NextItNet+, and NextItNet since it empirically performs better than block (a) (see (Yuan et al., 2019)).

4.2. Performance Comparison (RQ1)

Table 2 presents the results of all methods on four datasets. We observe that except Weishi, the NextitNet baseline largely outperforms Caser on all other datasets. Even on the Weishi dataset, NextItNet is comparable to Caser regarding all ranking measures. This is consistent with the observation in (Yuan et al., 2019) as the optimization objective and CNN models of NextItNet are more suitable for the SRS scenario. Hence, in what follows, we focus on comparing our proposed improvements with NextItNet. First, among all three proposed augmentation methods, NextItNet+ performs better than NextItNet on the Byte-100M and Weishi datasets, while it shows similar results on the other two datasets. In other words, the trivial two-way data augmentation method does not guarantee better results than the unidirectional model with only the left context. The reasults are predictable since the parameters learned by the left context may not completely suits the right context. Even though the model considers more context, the indepent training proess of the right context may hurt the model representation for the left context. Second, the proposed model-level augmentation methods, i.e., PtNextItNet and GfNextItNet, signficantly outperform NextItNet on the first three datasets. This results indicate that a suitable way of modeling by using more contexts does improve the recommendation accuracy. Surprisingly, neither PtNextItNet nor GfNextItNet achieve an important improvement compared with the standard NextItNet on By manually examining the datasets, we note that two reasons may lead to this result. First, the user behaviors are much more densier than other datasets. Each user on listened more than 10,000 songs and creates 110 sessions in average. As a result, each predicted item in fact has already suffiecient context to use even though ignoring the future signals. The second reason is that the user session is very long, and correspondingly, the predicted item with its surrounded context may appear more than once in the session, which can be learned by the model. In this case, the future context may not help much for the prediction accuracy. Even in such a scenario, PtNextItNet and GfNextItNet are still competitive, compared to NextItNet. Regarding the performance comparison between PtNextItNet and GfNextItNet, we observe that empirically GfNextItNet is more powerful than PtNextItNet, which demonstrates that a deep two-way network is more suitable than a shallow concatation of them.

4.3. Impact of Pretraining and Masking (RQ2)

First, we show the results888In the following, we conduct the ablation studies on one or two datasets to speedup the experiments since the conclusions on other datasets are basically consistent. of a non-pretrained two-way NextItNet (i.e., tNextItNet) in Table 2. It can be seen that NextItNet consistently performs worse than the standard left-to-right style NextItNet, although tNextItNet uses both the left and right contexts. The reason is because tNextItNet consists of two independent CNN networks during training, while only the left-to-right network is available during generating. The mismatch during training and generating easily leads to suboptimal results. However, it is reasonable to argue that the embedding layer and left-to-right network trained using two-way context may contain useful future information, which can not be learned by only the left context. Hence, a fine-tuning NextItNet (i.e, PtNextItNet) based on tNextItNet will give consideration to both pre-trained two-way networks and final generating objective function. We also show the convergence behaviors of PtNextItNet in Figure 4. The results demonstrate that PtNextItNet converges faster and better than NextItNet and tNextItNet.

Table 3 & 4 show the performance change of GfNextItNet on Yahoo! Music and Byte-100M with different masking hyperparameters. First, we fix and tune from . As shown in Table 3, too large or too small typically achieves suboptimal performance. The highest recommendation accuracy is obtained when is between to . The is because masking too much percentage of items in the user session is very likely to miss important contexts of these masked items and thus harm the model’s understanding capacity, while on the contrary GfNextItNet needs more training steps to converge since very small percentage of items in each batch are predicted. Similar behaviors are observed in Table 4, where the optimal results are obtaned when is around 70%. The reason is easy to understand since using too many noisy items (i.e., ) definitely result in low accuracy, while the robustness of GfNextItNet will be impacted without noise (i.e., ).

Model Yahoo! Music Byte-100M Weishi
ResNet 0.1016 0.0042 0.0108 0.2942
DenseNet 0.0960 0.0045 0.0108 0.3102
Table 5. Impact of different convolutional structures on GfNextItNet regarding MRR@5. All other parameters are fixed. Results for NextItNet, NextItNet+, PtNextItNet come to the same conclusion but are omitted for saving space.
(a) Yahoo Music
(b) Byte-100M
Figure 4. Convergence behaviors of PtNextItNet. Each unit in x-axis is 3000*64 training sessions, where 64 is the batch size.

4.4. Impact of Hyperparameters and Convolutional Architectures (RQ3)

In this subsection, we evaluate the impact of some basic hyperparameters (i.e., batch size, embedding dimension ) and the residual block component. As we mentioned before, compared with NextItNet or PtNextItNet, GfNextItNet is relatively sensitive to batch size since it does not construct the entire input session. We show the convergence behaviors of Yahoo! Music on Figure 5 (a) with different batch size. Clearly, GfNextItNet achieves higher accuracy with batch size of 1024 than 64 and 256. The similar conclusion also applies to other datasets. Figure 5 (b) shows the results of GfNextItNet with different embedding sizes. Similar to other embedding models, a relatively larger embedding dimension usually leads to better results.

To our knowledge, the impact of residual block structures in the field of recommender systems has not been well researched in literature. As a supplemental study, we also apply a state-of-the-art convolutional architecture, referred to as densely connected convolutional networks (DenseNet) (Sercu and Goel, 2016)

, originally designed in the field of computer vision. The results are presented in Table 

5. It seems that we are hard to judge which type of convolutional architecture performs better in the task of SRS according to the results, although DenseNet significantly outperforms ResNet for the image processing task (Sercu and Goel, 2016). In other words, we may need to conduct more trials when choosing a convolutional architecture for different recommendation datasets.

(a) Impact of batch size
(b) Impact of embedding dimension
Figure 5. Impact of batch size and embedding size on Yahoo Music. Each unit in x-axis is 5000*1024 training sessions. All results are reported based on the first 40*512 sessions of the testing set to speedup the experiments.

5. Conclusion

In this paper, we have shown how to incorporate future context for the standard left-to-right style learning algorithms in task of SRS. The motivation is that state-of-the-art session-based recommendartion models does not or are not suitable to model both the past and future preferences. Two carefully designed bidirectional objective functions are proposed. Moreover, we present pretrained two-way NextItNets and gap-filling based NextItNet, which perfectly fit the objective functions and solve the information leakage issues of higher layer neurons. Through ablations and controled experiments, we demonstrate that the proposed two-way recommendation models are more powerful than the traditional unidirectional models, and thus achieve new state-of-the-art results. In future, we are interested in studying whether the future context or the proposed two-way NextItNets will also improve the recommendation diversity for SRS.