Two-Stage Session-based Recommendations with Candidate Rank Embeddings

by   José Antonio Sánchez Rodríguez, et al.

Recent advances in Session-based recommender systems have gained attention due to their potential of providing real-time personalized recommendations with high recall, especially when compared to traditional methods like matrix factorization and item-based collaborative filtering. Nowadays, two of the most recent methods are Short-Term Attention/Memory Priority Model for Session-based Recommendation (STAMP) and Neural Attentive Session-based Recommendation (NARM). However, when these two methods were applied in the similar-item recommendation dataset of Zalando (Fashion-Similar), they did not work out-of-the-box compared to a simple Collaborative-Filtering approach. Aiming for improving the similar-item recommendation, we propose to concentrate efforts on enhancing the rank of the few most relevant items from the original recommendations, by employing the information of the session of the user encoded by an attention network. The efficacy of this strategy was confirmed when using a novel Candidate Rank Embedding that encodes the global ranking information of each candidate in the re-ranking process. Experimental results in Fashion-Similar show significant improvements over the baseline on Recall and MRR at 20, as well as improvements in Click Through Rate based on an online test. Additionally, it is important to point out from the evaluation that was performed the potential of this method on the next click prediction problem because when applied to STAMP and NARM, it improves the Recall and MRR at 20 on two publicly available real-world datasets.


page 1

page 2

page 3

page 4

page 5

page 6


Session-based Complementary Fashion Recommendations

In modern fashion e-commerce platforms, where customers can browse thous...

Next-item Recommendations in Short Sessions

The changing preferences of users towards items trigger the emergence of...

Session-based Recommendations with Recurrent Neural Networks

We apply recurrent neural networks (RNN) on a new domain, namely recomme...

Session-based Recommendation with Hypergraph Attention Networks

Session-based recommender systems aim to improve recommendations in shor...

Time is of the Essence: a Joint Hierarchical RNN and Point Process Model for Time and Item Predictions

In recent years session-based recommendation has emerged as an increasin...

KnowledgeCheckR: Intelligent Techniques for Counteracting Forgetting

Existing e-learning environments primarily focus on the aspect of provid...

In-Session Personalization for Talent Search

Previous efforts in recommendation of candidates for talent search follo...

1. Introduction

Recommendation systems have become a major tool for users to explore websites with an extensive assortment. These websites offer different recommendation products in different locations that try to meet the specific needs of the user.

For Zalando, one of the most important products is the similar-item recommendation. It is shown to all the customers that reach the product page, which offers them different alternatives related to the product that the user is browsing. In Figure 1, this page and the fashion-similar item recommendation is illustrated. In order to improve this recommendation product, a dataset that contains users’ history view events and their clicks on the recommended items was collected in early 2018 for further research.

Recent studies on Session-based Recommendations have shown significant improvements compared to collaborative filtering approaches in several datasets (zs Hidasi et al., 2015; Hidasi and Karatzoglou, 2018; Tan et al., 2016). Following the state-of-the-art approaches, two of the best-performing Session-based recommenders, STAMP (Liu et al., 2018) and NARM (Li et al., 2017) were selected to predict the next clicked item in the similar item recommendation and compare the results with a basic collaborative filtering algorithm. However, we were unable to obtain any improvements from the baseline in terms of recall@20.

This counter-intuitive phenomenon was also observed and reported by other studies in the literature (de Souza Pereira Moreira et al., 2018; Ludewig and Jannach, 2018). The reasons behind the scene can be multi-faceted and complicated. For example, one possible cause could be the bias from the feedback loop existing in the production system. Alternatively, it could appear because of the intrinsic user behavior presented in the dataset. Figuring out the exact reasons and applying counter-measures to fight against this phenomenon belongs to another research area and is therefore not the focus of this paper.

On the other hand, the objective of this work is to improve the similar item recommendation in an online test. Prior an online test, the offline evaluation metrics must be improved in order to reduce the risk of impacting users negatively and to prioritize different online experiments. Our hypothesis is that with the proper use of personalized information, the resulting model should be able to outperform the baseline algorithms in both offline and online evaluations.

In order to effectively use the personalized information of user sessions, we have to overcome the performance issue of directly applying session-based models to the Zalando Fashion-Similar dataset. Earlier results indicate that with a single session-based model it seems to struggle in capturing global information between items in the dataset. To address this issue, we propose to use Candidate Rank Embeddings (CRE) together with a session-based recommender to collect such information from the pre-trained/pre-calculated model.

Following the approach of Covignton et. al. (Covington et al., 2016), in the first stage a Candidate Generator was employed to take a given user’s click history as input to generate a sorted list of candidates of size . Specifically, Collaborative filtering was used as the Candidate Generator. In the second stage, and are fed to the Re-ranker to produce fine-tuned recommendations. Borrowing the user session encoder of STAMP , is encoded and used in the Re-ranker together with an innovative Candidate Rank Embeddings (CRE). CREs learns the personalized candidate rank preference along with their item preference in the process of training the model to optimize the prediction of the next item of interest. By using CREs, the Re-ranker was enabled to incorporate implicit information from the Candidate Generator as prior knowledge that helps to guide and calibrate the training of the Re-ranker with the objective of improving the order of the original recommendation.

The offline experiments showed that our CRE-enhanced model can indeed outperform the collaborative filtering baseline. The good performance on the Fashion-Similar dataset suggests the CRE trick may also be applicable to other recommendation tasks such as general next click prediction. This hypothesis was confirmed by combining the CRE-enhanced model with two baselines, STAMP and NARM, and evaluate on the YooChoose 1/4 and Diginetica next click prediction datasets.

The main contribution of this study lies in the following two aspects:

  • Candidate Rank Embeddings were used together with a session-based recommender to enhance the performance of I2I-CF in a two-stage approach. The model outperforms the baselines on a Fashion-Similar dataset in terms of Recall and MRR at 20. Also, the improvement was confirmed with an online test where it was observed significant improvements in Click Through Rate.

  • Experiments and analysis were done to compare the baselines I2I-CF, STAMP and NARM, and the proposed method using those baselines as Candidate Generators on the task of predicting the next click. The results show that the model with CREs improves both Recall@20 and MRR@20 of these baselines on two publicly available datasets.

Figure 1. The product page offers similar items to the anchor.

2. Related Work

2.1. Session-based Recommender Systems

The concept of using user interaction history as a sequence has been introduced to recommender systems with the success of the Recurrent Neural Network (RNN) family in the sequence modeling field. Before the thrive of these models, the user interactions were mainly used in simpler methods, such as, item-item collaborative filtering

(Aiolli, 2013) or matrix factorization (Koren et al., 2009; Kumar et al., 2014). One of the early attempts of using the click sequences of users as input data and considering recommendation as a next target prediction was proposed in (Devooght and Bersini, 2016). Authors of DREAM (Yu et al., 2016)

used pooling to summarize the current basket and later, another vanilla RNN was used to recommend the next basket. With the advantage of Gated Recurrent Unit (GRU) over vanilla RNN been proven for longer sequences, GRU-based recommender systems have been proposed with different loss functions, such as Bayesian Personalized Ranking (BPR)

(Rendle et al., 2009), TOP-1 (zs Hidasi et al., 2015) and their enhancements (Hidasi and Karatzoglou, 2018). These models have shown better results than matrix factorization based methods over public sequence prediction datasets.

Various authors have proposed different improvements to the GRU family approach. In (Tan et al., 2016) two techniques were proposed to address the data sparsity issue and combat the behavior distribution shift problem. In addition, the NARM architecture (Li et al., 2017) further increased the prediction power of GRU-based session recommenders by adding an attention layer on top of the output. Despite of the high recall and Mean Reciprocal Rank (MRR), the training process of NARM takes longer time than pure GRU approaches.

Considering the time consumption problem of the RNN families, authors of Short-Term Attention model

(Liu et al., 2018) managed to substantially reduce the training and prediction time by replacing the RNN structure with simpler components such as the feature average layer and feed-forward networks. As an improvement, the STAMP model, in addition, encodes the user feature from their click session with an attention operation. However, despite the promising results on several public datasets, there were studies reporting that some state-of-the-art session-based models do not outrun simple collaborative filtering methods on a certain datasets (de Souza Pereira Moreira et al., 2018; Ludewig and Jannach, 2018). We found the same phenomenon in our dataset generated by an online system, and that motivates us to design a solution to have it benefit from personalization without performance degradations.

2.2. Two-Stage Approaches

Two-stage approaches are widely adopted in different recommendation tasks in various domains. Paul Covington et. al. proposed the use of two neural networks to achieve video recommendation in YouTube (Covington et al., 2016). With a clear separation of candidate selection and ranking generation, their solution targets especially at multimedia recommendations. This cascading approach was used to solve performance issues, mainly due to the enormous amount of available videos on Youtube.

Other studies have employed this technique to improve accuracy. Rubtsov et. al. (Rubtsov et al., 2018)

applied a two-stage architecture to improve the quality of music playlist continuation. Similar to us, they used collaborative-filtering to generate candidates and a more complicated algorithm based on gradient boosting for the final prediction. Likewise, we found that the use of the two-stage architecture makes it easy to improve the model performance when applying session-based recommendations to the fashion similar item recommendation. Besides applying a two stage approach to use the session information, we modeled the user-rank preference with Candidate Rank Embeddings.

3. Problem Statement

The Session-based Recommendation problem is usually modeled as predicting the next item of interest for the user, providing their interaction history. Given an assortment of items where all possible items come from, the short-term history of a user consists of a sequence of items that has interacted with so far. Session-based Recommenders aim to shift the next item that the user will interact with to the top position of the recommendation list when given . A collection of user interaction sequences is denoted as a dataset , it is composed of pairs of user sequence and target item , for .

A session-based recommender produces a sorted list from all items in a subset . In most cases, is equivalent to , but it can also come from a selected set of candidates from another recommender. To obtain , a score for each item is calculated and all the items from are ranked by their scores in a descending order. The scores of items in is denoted as and the function yields the item list with the new order according to :


In the following sections, we use

to represent the latent vector for an item

. To represent a matrix of feature vectors of items, the notation is used; where is a list or a set of items. The shape of is .

4. Two-Stage Recommeder with Candidate Rank Embeddings

Using the naming convention from (Covington et al., 2016), the first recommender of the cascade is considered as the Candidate Generator , while the second, known as the Re-ranker , ranks the most relevant items from the output of . Both and take a candidate set and a specific user history as input parameters, and later assign a score to each candidate. The final recommendation of the proposed method is calculated as follows:


Where denotes the most relevant candidates computed by and indicates the rest in . and denotes the trainable parameters of and respectively. In this study, we set and to be independent and do not share parameters. The training process of the proposed model is also two-staged. is trained first (in case of training needed). The training of starts after a well-trained is obtained, and only considers the top-ranked candidates coming from , as described in Eq 2. The parameters and are optimized using . Figure 2 illustrates the overall architecture of the model.

Figure 2. The model architecture: The Candidate Generator is trained first and treated as a black box. Then, the Re-ranker is trained to re-score the candidates provided by the Generator. For calculating these scores, two components are considered: the candidates and the rank preference of them.

At inference time, given a specific user interaction history , both, and , take it as an input and operate sequentially. More details about the Candidate Generator and the Re-ranker are described in Sections 4.1 and 4.2

4.1. The Candidate Generator

The Candidate Generator can be an arbitrary recommender that takes a user session as input and ranks the set . For training we consider .

The selection of the algorithm for depends mostly on the characteristics of the dataset and the performance of the algorithm on generating high quality candidates.

4.2. The Re-ranker

The Re-ranker takes the same user click sequence , as does, but concentrates on ranking a smaller set of candidates determined by the Candidate Generator it connects with.

We employ a variant of STAMP in to include the candidate rank information in the model. Specifically, the encoder for the user click history is reused. is a simple element-wise multiplication between the history representation and the anchor representation item described in (Liu et al., 2018).


The STAMP encoder was chosen because of its architectural simplicity and promising performance in predicting the user preference.

Before defining the calculation of the scores, Candidate Rank Embeddings must be introduced. Given a sequence , we obtain a sorted list of candidates of size from . The Candidate Rank is an integer ranging from which denotes the position of the item in . Each Candidate Rank is associated with a Candidate Rank Embedding. Therefore, CREs are positional embeddings shared among different candidate lists produced by . The CREs for a Re-ranker that takes candidates into account can be represented as a matrix of shape .

With the CRE defined the scores are specified in the following equation:


Where , , , , , , , are learnable weight matrices and bias terms of the feed-forward network layers, is the item embeddings of the candidates and is the candidate rank embedding matrix. Note that is the same for all user sequences because they depend only on the rank of the candidates. All the embeddings are initialized randomly and trained together with the model.

We train to predict the next click that is in by using the Cross Entropy loss.

The main difference between our solution and STAMP is the use of and two non-linear projections and

. These projections are used to approximate the click probability of the candidates and the rank preference of users. The first projection focuses on predicting the embedding of the target item, while the second one focuses on predicting the embedding of the position of the target item in the ranked candidate list.

The intuition behind learning the rank preference is that the information from the output of can flow into the model, and a balance between the newly-learned item preference and the old rank can be obtained by summing up two user preferences.

Furthermore, since the ranking score comes from the dot product from the candidate rank embeddings and a projection from the user representation, this allows the model to learn the relationship between the user and the position of the target. For example, it gives the capability to recognize which type of users like to click the top positions or the items which co-occur with the anchor very often when produces its candidates using co-occurrence information. This behavior can be difficult to learn with a model that considers only the user-item preference.

In Section 5.5 an analysis of the importance of using the candidate rank information is presented, where we compare our two-stage approach against one without CREs.

So far, we only tried using a one-to-one mapping between Candidate Ranks/positions and CREs. In applications with a large candidate set, having multiple ranks share one CRE could be beneficial because training signals can be shared among several unpopular positions.

5. Experiments and Analysis

5.1. Baselines

The following recommendation algorithms are used in our experiments as the baselines.

  • Item-Item Collaborative Filtering (I2I-CF): it considers only the most similar items for the last-seen item

    in the user interaction/click sequence. It is pre-calculated using a variant of the cosine similarity function described in

    (Aiolli, 2013).

  • Attention-based GRU (NARM) (Li et al., 2017)

  • Short-Term Attention Model (STAMP) (Liu et al., 2018)

5.2. Experiment Setup

Each of the baselines is used as a Candidate Generator in two sets of separated experiments. The first set of experiments evaluate the model performance on the Fashion-Similar dataset to predict the next similar item on the carousel that the user would click. The second set of experiments compare the performance of CREs on two next item prediction datasets, when combining with different baseline models. Before training the re-ranker, we first train the corresponding baseline and keep them fixed as Candidate Generators. Not all the training sequences are used in training, the re-ranker only considers those which their target item falls within the candidate set from . Five percent of the training examples are randomly sampled for validation. During training, the model performance is checked every 1000 steps and the best model is selected by their performance in terms of Recall@5. Adam (Kingma and Ba, 2014) is used to train for epochs with a learning rate of 0.001 and batch size of . We set the number of candidates being returned by to be 100. The item embeddings and model weights of are initialized w.r.t. the best settings reported in (Liu et al., 2018). The weights of , , , , , are initialized with Xavier (Glorot and Bengio, 2010). To label the results, we use RRCRE-X as an abbreviation of re-ranking the output from an approach X with CREs.

5.3. Predicting the Fashion Similar Item

In this task, our goal is to predict the clicked products in the similar-item-recommendation carousel, given the latest actions of a user.

Only the Fashion-Similar dataset was used because the other datasets are mainly used for general next click prediction tasks. The dataset was collected using the customers actions from several major european markets for multiple days. Every record in this dataset consists of a click on the similar-item recommendation, which is the target, along with the latest 12 items the user browsed before interacting with the similar-item recommendation.

The dataset was split into training and test sets by time, leaving 9 days for the training set and the last day for the test test. With this arrangement, the offline evaluation measures how well the model performs on the next day assuming we retrain our algorithms daily.

The training set contains 8353562 examples, for 1435605 users which interacted with 650228 items in total. The test set contains 624559 examples, for 233051 users which interacted with 259784 items in total.

Recall@20 MRR@20
I2I-CF 0.8106 0.2611
STAMP 0.7244 0.2443
NARM 0.6989 0.2510
RRCRE-CF 0.8381 0.2981
Table 1. Performance of the baselines and the proposed method on predicting the similar item.

5.4. Predicting the next click

For this task, the objective is to predict the next user interaction given the past click history. We used the following datasets:

  • YooChoose 1/4: the dataset was preprocessed exactly as described in (Tan et al., 2016).

  • Diginetica: we preprocessed the dataset with the approach described in (Wu et al., 2019). Note that Wu et. al. introduced additional preprocessing to the dataset compared to (Li et al., 2017).

YooChoose 1/4 Diginetica
Recall@20 MRR@20 Recall@20 MRR@20
I2I-CF 0.5259 0.2001 0.3760 0.1211
STAMP 0.6983 0.2915 0.4834 0.1588
NARM 0.6973 0.2921 0.5015 0.1599
RRCRE-I2I-CF 0.5586 0.2446 0.3773 0.1220
RRCRE-STAMP 0.7086 0.3133 0.5046 0.1677
RRCRE-NARM 0.7029 0.3082 0.5116 0.1675
Table 2. Performance of the baselines and the proposed method on the task of predicting the next click.

5.5. Offline Results and Analysis

From Table 1 we can observe that STAMP and NARM don’t show superior results in the task of predicting the similar item in the offline evaluation. The result on Fashion-Similar seems to be counter-intuitive, and further investigations are required as future work.

However, with RRCRE-I2I-CF we are able to improve I2I-CF in the similar item prediction task in terms of Recall@20 and MRR@20. It is because the model is capable of utilizing the hidden information in the ranking of the baseline together with the session information captured by the attention network from . An online test was performed to confirm the offline results, described in Section 5.6.

As shown in Table 2, STAMP and NARM perform significantly better than I2I-CF in the next click prediction task.

We also applied our method to use STAMP and NARM as Candidate Generator in the next click prediction task, the evaluation result shows that it is able to slightly improve the Recall@20 and MRR@20 of STAMP and NARM on both YooChoose 1/4 and Diginetica.

Figure 3. Our approach with and without Candidate Rank information.

To understand the improvement obtained by applying the Candidate Rank Embeddings, we compared the model performance between the proposed method with and without CRE, i.e. the final scores of candidates become . We use RR-X as an abbreviation of re-ranking the output from an approach X without CREs. The result is illustrated in Figure 3. For the Fashion-similar dataset, we can observe that simply re-ranking the most relevant candidates from I2I-CF with doesn’t lead to superior results. On the other hand, when training with CREs we obtain a better result listed in Table 1. We also compare RR-STAMP and RRCRE-STAMP in one of the next click prediction datasets. It turns out that RRCRE-STAMP outperforms the baseline from epoch 1, while RR-STAMP requires more iterations. It is because RR-STAMP has to learn the next clicked target from randomly-initialized model parameters without the rank information from STAMP being presented.

Figure 4. Recall@5 of the bests approaches on each dataset given different number of candidates to re-rank.

Additionally, we illustrate the behaviour of the proposed method with respect to the number of candidates to re-rank. In Fig 4 we can observe improvements in recall@20 even with a small . In addition, we found the recall plateaus or decreases when exceeds a certain threshold. One possible reason could be that the candidates associated with the low ranks rarely appear as targets in the dataset. As a result, the CREs for these ranks could not have been well-trained and could have captured misleading information. However, since using a relatively small simplify the effort of training and serving in production, more investigations were not done.

5.6. Online test in Zalando

From the previous section, it was observed an improvement in the offline metrics with respect to I2I-CF. As a consequence, an online test hosted by Zalando was performed to compare I2I-CF and RRCRE-I2I-CF.

RRCRE-I2I-CF was served using cpu machines running Tensorflow-Serving. For this algorithm, the recommendations are calculated in real-time. On the contrary, I2I-CF was served by using a static table stored in memory.

I2I-CF was giving static non-personalized similar recommendations and RRCRE-I2I-CF was adapting the similar item list depending on the user’s previous actions.

The models were updated every day to be adapted to the latest user behaviour and we compared their performance in major european markets for several days.

To ensure that the recommendations satisfy the similarity constraint of the product, some filters based on category trees were applied to the output of both methods.

The results showed relative improvements in engagement based on a significant +2.84% () increase in Click Through Rate. It proves that there is a positive effect of using the session of users to generate a personalize ranking of similar items and supports the offline experiment results carried-out.

6. Conclusion and Future Work

In this study, the possibility of improving the fashion similar item recommendation was explored with a two-staged re-ranking approach that is able to benefit from the Candidate Rank information, the session of the user and a small set of candidates.

With this approach, the Recall@20 and MRR@20 of I2I-CF was improved on the Fashion-Similar dataset, and the success in the offline evaluation was confirmed by an online test.

The proposed approach was also confirmed to be useful to improve the performance of two advanced session-based recommendation algorithms, STAMP and NARM on the next click prediction datasets YooChoose 1/4 and Diginetica. Despite the success in the offline evaluation, further experiments are needed to confirm the impact of the proposed method in the context of session-based recommendation.

The authors are immensely grateful to Alan Akbik, Andrea Briceno, Humberto Corona, Antonino Freno, Francis Gonzalez, Romain Guigoures, Sebastian Heinz, Bowen Li, Max Moeller, Roberto Roverso, Rezar Shirvany, Julie Sanchez, Hao Su, Lina Weichbrodt and Nana Yamazaki for their support, revisions, suggestions, ideas and comments that greatly helped to improve the quality of this work.