Latent Variable Session-Based Recommendation

04/24/2019 ∙ by David Rohde, et al. ∙ Criteo Durham University 0

Session based recommendation provides an attractive alternative to the traditional feature engineering approach to recommendation. Feature engineering approaches require hand tuned features of the users history to be created to produce a context vector. In contrast a session based approach is able to dynamically model the users state as they act. We present a probabilistic framework for session based recommendation. A latent variable for the user state is updated as the user views more items and we learn more about their interests. The latent variable model is conceptually simple and elegant; yet requires sophisticated computational technique to approximate the integral over the latent variable. We provide computational solutions using both the re-parameterization trick and also using the Bouchard bound for the softmax function, we further explore employing a variational auto-encoder and a variational Expectation-Maximization algorithm for tightening the variational bound. The model performs well against a number of baselines. The intuitive nature of the model allows an elegant formulation combining correlations between items and their popularity and that sheds light on other popular recommendation methods. An attractive feature of the latent variable approach is that, as the user continues to act, the posterior on the users state tightens reflecting the recommender system's increased knowledge about that user.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A traditional approach to building a recommender system is to use feature engineering techniques in order to summarize a users history into a feature vector of fixed dimension, which enables machine learning algorithms to be applied in order to do next item predictions or to model the outcome of recommendations. Feature engineering of the variable dimension user history is often quite compromised, for example the simple heuristic of looking at the most recent item is often employed. Session based recommendation represents a significant step forward where instead of producing a feature vector there is a representation of the recommender system’s state of knowledge about the users interests at a certain point in time.

Session based models require the temporal and sequential features of the user behavior to be modeled. In this approach, rather than feature engineering being used to build a model, a user’s state is dynamically updated as the user acts and responds to recommendations. This has traditionally been approached in the recommender system community chiefly using Recurrent Neural Network (RNN) based approaches

(Hidasi and Karatzoglou, 2018a) (Quadrana et al., 2018) (Zolna and Romanski, 2017) (Quadrana et al., 2017) (Smirnova and Vasile, 2017) (Quadrana et al., 2018) (Tan et al., 2016) for other approaches see (Ying et al., 2018) (Kang and McAuley, 2018) (Shani et al., 2005)

. RNNs are a powerful temporal models which can model subtle user dynamics, for example users may visit certain items with higher probability in a certain order.

There are two reasons that the recommender systems representation of a user may change in time, the first is the temporal nature of the users interests e.g. they are in market for a product until they buy it and then they no longer are, secondly the recommender systems understanding of the user will improve as the user performs more actions and reveals more of their implicit interests.

Employing session based recommendation is also an important step towards modeling long term rewards, such as sales. This is due to the fact that a short term reward (e.g. a click) might be reasonable using a within subject study design (Greenwald, 1976) but a between subject (i.e. session based) is needed for long term rewards.

Our primary contributions in this paper are as follows:

  • We demonstrate how much of the power of embedding methods can be recovered through the use of a low-rank multivariate Gaussian latent variable model, transformed to a categorical output via a softmax. Our framework is semi-Bayesian integrating the latent variable with associated computational challenges, despite the model’s apparent simplicity and elegance.

  • We provide an elegant way to convert a user history, containing a variable number of item views, into a fixed dimensional representation of the user’s history.

  • We derive an analytical variational bound for our model and show how to train embeddings using a variational auto-encoder. We also show that the model can be trained without the bound using the re-parameterization trick.

  • We show that the variational auto-encoder can be used at prediction time to produce a user representation, we also derive a variational EM algorithm for the same purpose.

  • We show that the method performs well on both synthetic and real-world data.

To aid the in reproducibility of the approaches, we release all source code111Code to be released post review. and relevant experimental scripts.

In Section 2 we introduce the latent variable model and show that it has intuitively pleasing properties. In Section 3 we review relevant literature. In Section 4 we explain our computational framework for performing approximate inference. In Section 5 we outline the experimental setup and in Section 6, finally Section 7 makes some concluding remarks.

2. Background

2.1. A Latent Variable Model For Item Views

The model we introduce in this paper is that item views in a session are explained by a session level latent variable. The model may be either viewed as a probabilistic matrix factorization model or as a co-variance estimation of a low rank Multivariate normal. Perhaps surprisingly when combined with computational machinery to marginalize the latent variable, this simple model is able to reproduce many interesting features of a session based recommender system.

Symbol Dimension Description
Scalar A given users id.
Scalar sequential time.
Scalar Total number of products.
Scalar The size of the embedding.
Scalar Product id for user at time .
A given users state.
Product embedding matrix.
Product embedding for .
The mean of .
The covariance of .
Item popularity shift.
Scalar Session length for .
Table 1. Notations and Definitions

Our model describes a generative process for the types of products that users co-view in sessions. Throughout this paper, we will make use of the notation introduced in Table 1. We use to denote a user or a session, we use time to denote sequential time and to denote which product they viewed from to where is the number of products, the user’s interest is described by a dimensional latent variable which can be interpreted as the users interest in topics. The session length of user is given by . We then assume the following generative process for the views in each session:

For the moment we assume that

and have already been estimated, we defer the topic of estimation to later sections. In production we have observed a users online viewing history

and we would like to produce a representation of the users interests. Our proposal is to use Bayesian inference in order to infer

as a representation of their interests. This representation of interests can then be used as a feature for training a recommender system on so called “bandit feedback” i.e. logs of the recommender system itself (this is distinct from the user history). The above is a “matrix factorization” view of our model, however one could also view it as a “covariance-estimation” task:

We neglect the mathematical complications due to the covariance matrix: (usually) being low-rank and the fact that formally has no density.

2.2. Case Study

Before going into the details of the model and related computational material we demonstrate on a simple case study that this model is deceptively powerful in producing an update-able representation of a users interests.

Imagine we have a recommender system that has just seven products, the products are: Sleek Phone, City Phone, Rice, Coscous, Beer, Women’s shirt, Men’s shirt. We further imagine that an offline job has generated the embeddings or parameters already. These embeddings establish that there is a correlation such that users interested in the Sleek Phone are also interested in the City phone, users interested in Rice are also interested in Coscous, users interested in Beer have some interest in Rice and Coscous but it is not strong and finally users interested in the Men’s Shirt are anti-correlated with users who are interested in the Women’s Shirt. To make this concrete we assume that:

The fact that most entries in are positive simplifies the discussion as it means that high implies interest, The only two negative values are used to model a negative correlation between female and male shirts.

An interesting facet of this model is that it is not trivial to establish which is the most popular product simply by examining the parameters, while does reflect popularity it is also affected by in complex ways. In this constructed example we are able to label the five components of as phones, grains, drinks, women’s clothes and men’s clothes.

We now consider how different user histories affect
. Approximation of this quantity can be made accurately and easily using the Stan probabilistic programming language (Team, 2018), although later we will show that excellent performance can also be obtained from using variational methods, that are viable to scale to real world recommender systems, and with comparative cost to methods such as Recurrent Neural Networks.

Figure 2. User representation (left) and next item prediction for a user with one sleek phone and two city phones in their history
Figure 3. User representation (left) and next item prediction for a user with one sleek phone and twenty city phones in their history
Figure 1. User representation (left) and next item prediction for a user with one sleek phone in their history
Figure 2. User representation (left) and next item prediction for a user with one sleek phone and two city phones in their history
Figure 3. User representation (left) and next item prediction for a user with one sleek phone and twenty city phones in their history
Figure 4. User representation (left) and next item prediction for a user with two female shirts and one sleek phone in their history
Figure 1. User representation (left) and next item prediction for a user with one sleek phone in their history

In Figure 4 - 4

the intuitive behavior of this simple model is demonstrated. The results of three approximate methods are presented and shown to be in good agreement, we defer discussion of the approximation methods here except to note that we take Markov chain Monte Carlo to be the gold standard.

In Figure 4 we observe the case where a single Sleek Phone is observed in the users history as a consequence of the short history there is significant uncertainty in the knowledge about the user, although the embedding reflects higher interest in the phone category; the next item prediction is high for both the Sleek Phone and the City Phone. In Figure 4 we observe the case where a single Sleek Phone is observed twice and the City Phone once in the users history as a consequence there is less uncertainty in the knowledge about the user, the next item prediction is higher for both the Sleek Phone and the City Phone. In Figure 4 the user has viewed the city phone twenty times and the Sleek phone just once, as a consequence the user’s embedding shows a strong interest in phones with low uncertainty, the next item prediction is distributed among the two phones. In Figure 4 we observe a user who has observed a City Phone and a Women’s Shirt, we see that the user’s interests in phones and women’s clothes are increased and their interest in men’s clothing is decreased, indicating the negative entry in the embedding has the desired effect. The next item predictions also reflects these preferences.

This simple model has shown a remarkable ability to summarize a users interests and is able to reflect both strong or weak information about our knowledge of the user. Having demonstrated the intuitive value of this model we now show how to estimate , how to efficiently approximate such that it can be done online updating the user representation as the user acts in a dynamic way and finally how to do next item prediction which may be a proxy to the recommendation task.

3. Related literature

3.1. Scalable Variational Approximations

Two approaches for scalable Bayesian inference focus on approximating a posterior on a fixed dimensional parameters space rather than the latent variable case we care about as such they are not appropriate for our case (Kucukelbir et al., 2017), (Ranganath et al., 2014).

Under a conditional independence assumption it is often possible to reduce the variational Expectation Maximization algorithm to a finite sum fixed point iteration, where the finite sum is over the data plus another term associated with the prior and the entropy. This formulation rather directly allows the Robbins Monro stochastic approximation algorithm (Robbins and Monro, 1951) and has been effective in complete data exponential family models (Hoffman et al., 2013). While this algorithm does indeed apply to simple versions of our model the ”M-Step” for estimating the embeddings would require un-feasibly large matrix inverses.

3.2. Latent Variable Models

Our model is a special case of (Liang et al., 2018) (also see (Rezende et al., 2014; Lafferty and Blei, 2006)) which has stronger analytical properties including an analytical bound and EM algorithm which we can exploit both to gain computational advantages and to highlight similarities with other methods.

An interesting suggestion made in (Liang et al., 2018) is the use of reducing the contribution of the Kullback Leibler component of the lower bound e.g. by multiplying this by a value lower than one e.g. one half. They justify this with a combination of empirical results and by interpreting the model as containing a construction error and regularization. In this paper we are primarily focused upon producing a user representation, to multiply the KL component by a half would have the effect of “squaring the likelihood” i.e. double counting the data resulting in artificially reduced uncertainties on the user representation. As we are primarily interested in producing a user representation we do not pursue that method here, although we do acknowledge the excellent empirical results they present.

We use a semi-Bayesian or latent variable framework integrating the latent variable but estimating the parameters. There is a literature discussing the improved statistical properties of this procedure, for see theoretical arguments given in (Welling et al., 2008) for a demonstration of empirical performance see (Dikmen and Févotte, 2011). A critical observation is that for a traditional matrix factorization the parameter space grows with the number of users, this makes traditional statistical notions such as convergence difficulty and indeed means that if a new user arrives a fit must be done before a prediction can be done.

In contrast if one of the matrices is integrated then the model becomes fixed dimensional then the dimensionality of the model is fixed and traditional statistical notions such as convergence again become relevant.

3.3. Word2Vec and Prod2Vec

The skipgram model and skipgram with negative sampling, collectively known as word2vec (Mikolov et al., 2013), caused a sensation in both the natural language and recommender systems community (Gunawardana et al., 2009). If we define the event matrix:

Then word2vec operates on the co-event matrix: , often the rows and columns are refereed to as target and context. This matrix is of size which while large is often much smaller than which is where is the number of users, so operating on this matrix is more efficient computationally. There are however disadvantages in modeling directly. One being that has some quite subtle relationships e.g. some values of are impossible. An example of a valid matrix for , is:

This matrix is consistent with two user sessions, the first session visited product 1 and 2 and the second session visited product 3. Now consider:

This matrix is inconsistent with any containing only positive counts. Intuitively we can see this by noting that the diagonal implies that each product has been observed as associated with one user each. The first row (or by symmetry the first column) says that product 1, 2 and 3 all occur together. The only way we can have each product viewed exactly once and all occurring together is for all entries to be associated yet we have which is inconsistent. The skipgram model suggests modeling the rows of as multinomial draws, which gives positive probability to events that cannot happen, given the complexity of the constraints on it is difficult to see how this can be avoided except by modeling directly.

A further contribution in (Mikolov et al., 2013) was a negative sampling heuristic, which allowed these methods to scale to very large numbers of categories by avoiding large summations over every iteration. However the meaning of negative sampling remains unclear and it complicates producing probabilistic algorithms. For example within these classes of algorithms there is a tuning parameter to decide how many negative examples to generated. Of course increasing the amount of (artificially) generated data will (artificially) reduce uncertainty on parameter estimates, while there have been attempts at a Bayesian skipgram model (Barkan, 2017) it is difficult to see how any method employing this heuristic can correctly control the uncertainties that they compute.

It is interesting to reflect on the widespread successful use of these methods. The heuristics employed do not make it easy to make a complete comparison with our method, but we can make a few comments. If the underlying model of was Gaussian (this cannot be true as it has support only on natural numbers) then

would be the scatter matrix which along with the mean gives the sufficient statistic of a Gaussian distribution. Taking the eigenvalue decomposition of this would result in principal component analysis (without the usual subtracting of the means step) and there is the well known result that PCA can be computed either by an eigenvalue decomposition of

or the singular value decomposition of

, which loosely accounting for the changes of support and the integration is what our method achieves. Loosely we can view our latent variable model also in sense i.e. estimating the covariance as or doing a matrix factorization of the form . The use of dot products between embeddings can be viewed as covariances and the cosine distances as correlations.

The non-probabilistic nature of word2vec poses problems that are typically dealt with using heuristics such as using these embeddings as features in the “feature engineering approach” several questions are difficult to resolve e.g. How do you do next item prediction (combining popularity with the associated embeddings)? How do you do recommendation? How do you combine several items of history into the a fixed dimensional user state?

3.4. RNN Session Based Recommendation

In the session based recommender system literature, there is a significant literature applying RNNs to the recommendation problem in this case like us they apply the model directly to . The RNN is a more flexible model able to capture more sophisticated sequences e.g. if a shopper transitions from being interested to complimentary products after a purchase event. This extra flexibility is powerful, but also require more data to identify these effects.

In contrast the latent variance model we introduce is effectively a low rank Gaussian prior on a categorical variable as such up to the capacity of the model the law of large numbers would apply i.e. if a user had a long enough history and the embedding size

was greater than equal to the number of products then the next item prediction would converge to the empirical history due to the law of large numbers(De Finetti, 1980). In contrast the RNN does not a priori incorporate the law of large numbers it is a flexible sequential model and if the law of large numbers holds, as we might approximately expect, then the RNN will need to see more data to recognize this. It is an important remark that if the user embedding size is less than the number of products regardless of if an RNN or a latent factor model is used it is not possible for the next item prediction for a user to converge to their empirical history due to a lack of capacity. The non-linear model of (Liang et al., 2018) also has the same limitation. This low capacity is typically not a problem as user sequences are very short, but it does highlight that there are limitations introduced by using small embedding sizes i.e. the ability to distinguish users in subtly different products may be lost; of course the advantage is vastly improved tractability. The stronger assumptions of the latent variable method suggest its realm of applicability is when those assumptions are true, or they are approximately true and there is insufficient data to learn a higher capacity model such as an RNN.

4. Approximate Inference

In previous sections we discussed the model and showed it has intuitively reasonable properties. In this section we show (i) how to learn the embeddings and (ii) how, at deployment, to make predictions by approximating the posterior over a users’ representation i.e. how to compute in real time.

4.1. Optimizing the lower bound

In order to make this method practically usable we need two components: firstly to be able to estimate efficiently and secondly we need to be able to rapidly produce and update user embeddings based on a users activity. To solve both parts of the problem, we will employ variational approximations. Variational approximations work by turning integration problems into optimization problems.

The model we introduce has the form:

If we use a normal distribution

, then variational bound has the form:

We see that there is a problematic term associated with the denominator of the softmax. We consider two possible computational approaches to this the Bouchard bound (Bouchard, 2007) and the re-parameterization trick (Kingma and Welling, 2014).

4.1.1. Bouchard Bound

The Bouchard bound introduces a further approximation and additional variational parameters but produces an analytical bound:

Where is the Jaakola and Jordan function (Jaakkola and Jordan, 1997):

The bound may be optimized using the following variational EM algorithm which enjoys the coordinate descent properties of an EM algorithm guaranteeing the bound will tighten at each iteration. The algorithm here is the dual of the one presented in (Bouchard, 2007) as we assume the embedding is fixed and is updated where the algorithm they present does the opposite. The EM algorithm consists of cycling the following update equations:

There are other variational bounds that may be considered for this problem most notably the tilted bound (Knowles and Minka, 2011). Even though the Bouchard bound is loose compared to the tilted bound, it does enjoy the availability of an EM algorithm which enjoys the stability properties of a coordinate descent algorithm. In the case of the tilted bound the known fixed point algorithms are not guaranteed to be stable and are not always stable in practice (Nolan and Wand, 2017; Rohde and Wand, 2016) so extra methods such as line searches would need to be considered. We do not further consider alternative bounds.

The computational cost of this algorithm depends on the number of products linearly and the embedding size cubicly, if and are modest it can take less than a second making it potentially deployable at prediction time. In practice we found the cost of large might be prohibitive due to the sums over all embeddings, in these cases a variational auto-encode described in the next section, is to be preferred.

4.1.2. Re-parameterization Trick

The second approach to computing expectations with respect to the denominator of the softmax is to use the re-parameterization trick (Kingma and Welling, 2014), which allows us to take a sample of from the variational distribution and compute a noisy derivative of the lower bound. Within each iteration we proceed by simulating:

and then computing:

Where , we can then optimize the noisy lower bound:

Often is taken to be diagonal which makes computing simply an element-wise square root.

4.2. Latent variable size growing with data

A naive application of the algorithm discussed so far would have the number of variational parameters or for the Bouchard bound growing with the number of parameters. We propose to limit the number of parameters by the use of a variational auto-encoder (Kingma and Welling, 2014). This involves using a flexible function and optimizing it to do the job of the EM algorithm i.e.

or in the case of the Bouchard bound:

Where any function e.g. a deep net can be used for and .

It is common to use the re-parameterization trick and an auto-encoder in combination although this is not necessary. The choice between the two hinges on accepting Monte Carlo error or using a looser, but analytical bound.

4.3. Next Item Prediction

Finally and perhaps surprisingly the predictive distribution required to do next item prediction is also not trivial in this case, i.e. approximating:

is not trivial even if is approximated with a Gaussian distribution . We are interested in computing:

We considered using a Monte Carlo based approximation, first by drawing samples:

as well as using a number of fast approximations such as:

while we investigated more complex approximations (such as normalizing the exponential of the lower bound) we did not find they helped in practice, the two VB approximations shown in Figure 4 - 4 and denoted (MC) and (approx) are the Monte Carlo and mean approximations respectively.

5. Experimental Setup

We demonstrate that our method produces useful user representations on next item prediction using the RecoGym simulation environment (Rohde et al., 2018). RecoGym is a framework for simulating a recommender system and enables the simulation of AB tests although here we simply use it to create organic sequences of item views and test the model’s ability to do next item prediction, this allows us to compute the same metrics as on standard offline datasets. We also present results upon the YooChoose dataset (Ben-Shimon et al., 2015). We split both the datasets into train and test so that sessions reside entirely in one of the two groups. We fit the model to the training set, we then evaluate by providing the model events and testing the model’s ability to predict .

5.1. Implementation Details

All the models, including the relevant baselines, have been implemented using the PyTorch automatic differentiation package in Python

(Paszke et al., 2017)

. All models are updated via the use of Stochastic Gradient Descent (SGD), specifically the RMSProp variant. We set the learning rate to 0.001 and tune the other hyper-parameters, including L2 regularization, for each dataset based upon a validation set. The dataset specific hyper-parameter values are reported in Section

6 with the relevant results.

5.2. Performance Metrics

The various models are evaluated using recall at K (RC@K) and truncated discounted cumulative gain at K (DCG@K), which are defined below.

Let be the th highest value of . For all results presented in this paper, we set K to five.

We compute the average of these quantities over all sessions in the test set.

5.3. Latent Variable Inference

We consider three alternative methods for training the model:

  • Bouch/AE - A linear variational auto-encoder using the Bouchard bound.

  • RT/AE - Using the re-parameterization trick with the Bouchard bound.

  • RT/Deep AE - A deep auto-encoder again using the re-parameterization trick. The deep auto-encoder consists of mapping an input of size P to three linear rectifier layers of K units each. We encountered numerical problems using the Bouchard bound with a deep auto-encoder.

When we update the posterior over a users latent variable representation at test time, we assess both using the auto-encoder denoted AE and using the 100 iterations of the EM algorithm denoted EM in the results.

When we compute next item predictions we consider both using a 100 sample Monte Carlo approximation denoted MC and just taking the mean as a point estimate denoted mean it uses only (and correspondingly ignores ).

5.4. Baselines

To demonstrate the effectiveness of our approach, we present results from the following baseline approaches:

5.4.1. Popularity

Item popularity provides no personalization, but is nonetheless a strong baselines for certain recommendation tasks.

5.4.2. Item KNN

Item K Nearest Neighbors (KNN) involves computing the correlation matrix of the sample data adding the identity to prevent division by zero and then using these correlations as recommendations based on a users most recent historical item. The limitations of this technique is that it ignores item popularity and multiple items in the users history, but despite these limitations it is often a strong baseline.

5.4.3. Recurrent Neural Network

For this baseline, we make use of a recurrent neural network to learn a user representation by predicting the next item in the session. The model architecture we employ is similar to that of (Hidasi and Karatzoglou, 2018b)

, in that we feed the output from an embedding layer into a Gated Recurrent Unit (GRU)

(Cho et al., 2014)

with 64 hidden units to learn the temporal dynamics of the user’s session. The output from the GRU is then passed through a final softmax layer which gives the probability of the next item in the sequence. The network is trained to minimize the categorical cross-entropy over the training sessions via RMSProp.

6. Results

Train Online Online RC@5 DCG@5
Algorithm Latent Next Item
Pop 0.456 0.440
ItemKNN 0.461 0.492
RNN 0.620 0.646
Bouch/AE AE MC 0.712 0.796
Bouch/AE AE mean 0.712 0.777
Bouch/AE EM MC 0.738 0.796
Bouch/AE EM mean 0.748 0.796
RT/AE AE MC 0.707 0.802
RT/AE AE mean 0.697 0.784
RT/AE EM MC 0.738 0.802
RT/AE EM mean 0.733 0.802
RT/Deep AE AE MC 0.697 0.785
RT/Deep AE AE mean 0.717 0.775
RT/Deep AE EM MC 0.733 0.785
RT/Deep AE EM mean 0.733 0.787
Table 2. Results on the testset for all approaches on the RecoGym dataset with 20 products. For both metrics, a higher value is better.

6.1. RecoGym

For our first experiment we use the RecoGym simulator with 20 products and

i.e. a static user state. With this we generate a training set of 100 sessions and a test set of 1000 sessions, this results in 17161 and 176804 events for train and test respectively. The latent variable algorithms were all trained using 5000 epochs with the RMSProp algorithm and an embedding dimension of 10. The RNN was trained for 5000 epochs, with the same embedding size and again the RMSProp algorithm was used in all cases. The results from this are presented in Table

2, which show the Bouchard method of training using the EM algorithm for predicting latent variables and Monte Carlo for predicting the next item was the best performing algorithm on the RC@5 metric, RT/AE performed slightly better on on the DCG@5 metric using either the EM algorithm or the auto-encoder with Monte Carlo.

Train Online Online RC@5 DCG@5
Algorithm Latent Next Item
ItemKNN 0.020 0.024
Pop 0.020 0.016
RNN 0.035 0.033
Bouch/AE AE MC 0.082 0.128
Bouch/AE AE mean 0.082 0.079
Bouch/AE EM MC 0.117 0.128
Bouch/AE EM mean 0.117 0.130
RT/AE AE MC 0.061 0.047
RT/AE AE mean 0.056 0.059
RT/AE EM MC 0.051 0.047
RT/AE EM mean 0.051 0.047
RT/Deep AE AE MC 0.090 0.105
RT/Deep AE AE mean 0.080 0.068
RT/Deep AE EM MC 0.090 0.105
RT/Deep AE EM mean 0.090 0.106
Table 3. Results on the testset for all approaches on the RecoGym dataset with 2000 products. For both metrics, a higher value is better.

For our second experiment we use the RecoGym simulator with 2000 products and , i.e. a static user state, we generate a training set of 100 sessions and a test set of 100 sessions, this results in 21852 and 19533 events for train and test respectively. The latent variable algorithms were all trained using 15000 epochs using the RMSProp algorithm, the embedding size was set to 10. The RNN was trained with K=200 for 5000 epochs (it performed slightly worse with a training run of 25000). The results are shown in Table 3, again the Bouchard method of training using the EM algorithm for predicting latent variables and Monte Carlo for predicting the next item was the best performing algorithm on the RC@5 and DCG@5 metrics.

6.2. YooChoose

Train Online Online RC@5 DCG@5
Algorithm Latent Next Item
Pop 0.143 0.147
ItemKNN 0.804 0.921
RNN 0.690 0.781
Bouch/AE AE MC 0.433 0.420
Bouch/AE AE mean 0.451 0.562
Bouch/AE EM MC 0.386 0.420
Bouch/AE EM mean 0.429 0.497
RT/AE AE MC 0.495 0.731
RT/AE AE mean 0.616 0.658
RT/AE EM MC 0.693 0.731
RT/AE EM mean 0.707 0.768
RT/Deep AE AE MC 0.751 0.868
RT/Deep AE AE mean 0.771 0.876
RT/Deep AE EM MC 0.772 0.868
RT/Deep AE EM mean 0.775 0.873
Table 4. Results on the testset for all approaches on the YooChoose dataset with 100 products. For both metrics, a higher value is better.

For our third experiment we use the YooChoose dataset filtered to the most popular 100 products. This is a strong filter of YooChoose 60000 products, but allows for effective experimentation and still results in 2905816 events and 28286 events for the training and test set respectively. The deep auto-encoder latent variable algorithms was trained for 100 epochs, the linear Bouchard auto-encoder and re-parameterization trick auto-encoder were trained for 100 epochs, the RNN was trained for a single epoch and had an embedding size of 20, longer training runs were observed to cause overfitting and reduced performance. All latent variables are trained using a full rank model i.e . The results are shown in Figure 4, in this case the ItemKNN model performs best on both metrics, the deep auto-encoder trained using the re-parameterization trick performs slightly worse, the best performing setups involve predicting using the mean method there was very little difference between predicting with the EM algorithm and with the auto-encoder on this data set.

The ItemKNN baseline turned out to be very strong. This is most likely due to the fact that we filtered the dataset to just 100 popular products allowing full rank covariance estimation. The latent variable model also operating at full rank was unable to perform quite as well. Another notable difference in the two methods is that ItemKNN just looks at the most recent event where the latent variable session model combines all history. If the most recent event contains more relevant history this may advantage ItemKNN.

6.3. Interpretation of Results

The model we present is very closely aligned to the internal model in the RecoGym simulator hence the strong performance here of all the variants of our model. It is perhaps surprising that next item prediction using just the posterior mean performed similarly well to the Monte Carlo approach. The value gained by the EM algorithm was also marginal. Given the ability of an RNN to model very complex data such as language it is perhaps unsurprising that it performs poorly on the RecoGym 2000 product dataset given a relatively small sample.

For the YooChoose 100 product dataset the ItemKNN algorithm proved to be very effective. The Deep AE was the closest performing with the EM MC variant being the best by a small margin. The fact that the Deep AE performs the best and the linear auto-encoders improve substantially when using the EM algorithm both suggest that a linear auto-encoder is not sufficient for this problem.

7. Conclusion

Recommender systems are increasingly using embeddings to represent items. A user’s session on the recommender system then will involve interactions with many of these items. We have demonstrated an elegant algorithm for taking a users history of varying length and summarizing it with a posterior distribution over a user embedding which has the same dimension as the product embedding. Sensible behavior such as higher uncertainty when the user has a short history and lower uncertainty when the user has a longer history are features of this model formulation. We have demonstrated how it is possible to train the model to produce item embedding using a variational auto-encoder either with the re-parameterization technique or using the Bouchard bound. Similarly it is possible to is possible to rapidly convert a user history containing multiple items to a user embedding using a variational auto-encoder or using the EM algorithm (although the later is constrained to small numbers of products due to summations over large ).

A complexity of latent variable methods is the need to do a numerical integration at prediction time. The EM algorithm presented has excellent stability properties, but scales poorly when the number of items is in the tens of thousands. There are several lines of interesting work that could speed up this evaluation. Alternatively using already well understood techniques we could simply use a variational auto-encoder, which also produces rapid approximation of the integral.

There are numerous possible extensions to the training algorithm. Training speed requires normalization of size which can be prohibitive, methods such as those outlined in (Ruiz et al., 2018) may be adaptable to this model. Finally the model can be incorporated to model time in a more sophisticated way and to consider the feedback to recommendations rather than be exclusively built for next item prediction.


  • (1)
  • Barkan (2017) Oren Barkan. 2017. Bayesian Neural Word Embedding, See Singh and Markovitch (2017), 3135–3143.
  • Ben-Shimon et al. (2015) David Ben-Shimon, Alexander Tsikinovsky, Michael Friedmann, Bracha Shapira, Lior Rokach, and Johannes Hoerle. 2015. Recsys challenge 2015 and the yoochoose dataset. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, New York, NY, USA, 357–358.
  • Bouchard (2007) Guillaume Bouchard. 2007. Efficient bounds for the softmax function, applications to inference in hybrid models.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL

    . ACL, 1724–1734.
  • Cuzzocrea et al. (2018) Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). 2018. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018. ACM.
  • De Finetti (1980) Bruno De Finetti. 1980. Foresight: Its logical laws, its subjective sources (1937). Studies in subjective probability (1980), 55–118.
  • Dikmen and Févotte (2011) Onur Dikmen and Cédric Févotte. 2011. Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2267–2275.
  • Greenwald (1976) Anthony G Greenwald. 1976. Within-subjects designs: To use or not to use? Psychological Bulletin 83, 2 (1976), 314.
  • Gunawardana et al. (2009) Asela Gunawardana, Christopher Meek, et al. 2009. A unified approach to building hybrid recommender systems. RecSys 9 (2009), 117–124.
  • Hidasi and Karatzoglou (2018a) Balázs Hidasi and Alexandros Karatzoglou. 2018a. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations, See Cuzzocrea et al. (2018), 843–852.
  • Hidasi and Karatzoglou (2018b) Balázs Hidasi and Alexandros Karatzoglou. 2018b. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations, See Cuzzocrea et al. (2018), 843–852.
  • Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14, 1 (2013), 1303–1347.
  • Jaakkola and Jordan (1997) Tommi Jaakkola and Michael Jordan. 1997.

    A variational approach to Bayesian logistic regression models and their extensions. In

    Sixth International Workshop on Artificial Intelligence and Statistics

    , Vol. 82. 4.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Recommendation. In IEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17-20, 2018. IEEE Computer Society, 197–206.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
  • Knowles and Minka (2011) David A. Knowles and Tom Minka. 2011. Non-conjugate Variational Message Passing for Multinomial and Binary Regression. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1701–1709.
  • Kucukelbir et al. (2017) Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. 2017. Automatic Differentiation Variational Inference. Journal of Machine Learning Research 18, 14 (2017), 1–45.
  • Lafferty and Blei (2006) John D. Lafferty and David M. Blei. 2006. Correlated Topic Models. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt (Eds.). MIT Press, 147–154.
  • Liang et al. (2018) Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018.

    Variational Autoencoders for Collaborative Filtering. In

    Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, Pierre-Antoine Champin, Fabien L. Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 689–698.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119.
  • Nolan and Wand (2017) Tui H Nolan and Matt P Wand. 2017. Accurate logistic variational message passing: algebraic and numerical details. Stat 6, 1 (2017), 102–112.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Quadrana et al. (2018) Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence-Aware Recommender Systems. ACM Comput. Surv. 51, 4 (2018), 1–66.
  • Quadrana et al. (2017) Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. 2017. Personalizing Session-based Recommendations with Hierarchical Recurrent Neural Networks. In Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys 2017, Como, Italy, August 27-31, 2017, Paolo Cremonesi, Francesco Ricci, Shlomo Berkovsky, and Alexander Tuzhilin (Eds.). ACM, New York, NY, USA, 130–137.
  • Ranganath et al. (2014) Rajesh Ranganath, Sean Gerrish, and David M. Blei. 2014. Black Box Variational Inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014 (JMLR Workshop and Conference Proceedings), Vol. 33., 814–822.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In

    Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014 (JMLR Workshop and Conference Proceedings), Vol. 32., 1278–1286.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics 22, 3 (1951), 400–407.
  • Rohde et al. (2018) David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018.

    RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. In

    REVEAL workshop, ACM Conference on Recommender Systems 2018.
  • Rohde and Wand (2016) David Rohde and Matt P Wand. 2016. Semiparametric mean field variational Bayes: General principles and numerical issues. The Journal of Machine Learning Research 17, 1 (2016), 5975–6021.
  • Ruiz et al. (2018) Francisco J. R. Ruiz, Michalis K. Titsias, Adji B. Dieng, and David M. Blei. 2018. Augment and Reduce: Stochastic Inference for Large Categorical Distributions. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. PMLR, 4400–4409.
  • Shani et al. (2005) Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
  • Singh and Markovitch (2017) Satinder P. Singh and Shaul Markovitch (Eds.). 2017. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. AAAI Press.
  • Smirnova and Vasile (2017) Elena Smirnova and Flavian Vasile. 2017. Contextual Sequence Modeling for Recommendation with Recurrent Neural Networks. In

    Proceedings of the 2Nd Workshop on Deep Learning for Recommender Systems

    (DLRS 2017). ACM, New York, NY, USA, 2–9.
  • Tan et al. (2016) Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved Recurrent Neural Networks for Session-based Recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS@RecSys 2016, Boston, MA, USA, September 15, 2016, Alexandros Karatzoglou, Balázs Hidasi, Domonkos Tikk, Oren Sar Shalom, Haggai Roitman, Bracha Shapira, and Lior Rokach (Eds.). ACM, 17–22.
  • Team (2018) Stan Development Team. 2018. PyStan: the Python interface to Stan, Version
  • Welling et al. (2008) Max Welling, Chaitanya Chemudugunta, and Nathan Sutter. 2008. Deterministic Latent Variable Models and Their Pitfalls. In Proceedings of the SIAM International Conference on Data Mining, SDM 2008, April 24-26, 2008, Atlanta, Georgia, USA. SIAM, 196–207.
  • Ying et al. (2018) Haochao Ying, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018. Sequential Recommender System based on Hierarchical Attention Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., Jérôme Lang (Ed.)., 3926–3932.
  • Zolna and Romanski (2017) Konrad Zolna and Bartlomiej Romanski. 2017. User Modeling Using LSTM Networks, See Singh and Markovitch (2017), 5025–5027.