1. Introduction
A traditional approach to building a recommender system is to use feature engineering techniques in order to summarize a users history into a feature vector of fixed dimension, which enables machine learning algorithms to be applied in order to do next item predictions or to model the outcome of recommendations. Feature engineering of the variable dimension user history is often quite compromised, for example the simple heuristic of looking at the most recent item is often employed. Session based recommendation represents a significant step forward where instead of producing a feature vector there is a representation of the recommender system’s state of knowledge about the users interests at a certain point in time.
Session based models require the temporal and sequential features of the user behavior to be modeled. In this approach, rather than feature engineering being used to build a model, a user’s state is dynamically updated as the user acts and responds to recommendations. This has traditionally been approached in the recommender system community chiefly using Recurrent Neural Network (RNN) based approaches
(Hidasi and Karatzoglou, 2018a) (Quadrana et al., 2018) (Zolna and Romanski, 2017) (Quadrana et al., 2017) (Smirnova and Vasile, 2017) (Quadrana et al., 2018) (Tan et al., 2016) for other approaches see (Ying et al., 2018) (Kang and McAuley, 2018) (Shani et al., 2005). RNNs are a powerful temporal models which can model subtle user dynamics, for example users may visit certain items with higher probability in a certain order.
There are two reasons that the recommender systems representation of a user may change in time, the first is the temporal nature of the users interests e.g. they are in market for a product until they buy it and then they no longer are, secondly the recommender systems understanding of the user will improve as the user performs more actions and reveals more of their implicit interests.
Employing session based recommendation is also an important step towards modeling long term rewards, such as sales. This is due to the fact that a short term reward (e.g. a click) might be reasonable using a within subject study design (Greenwald, 1976) but a between subject (i.e. session based) is needed for long term rewards.
Our primary contributions in this paper are as follows:

We demonstrate how much of the power of embedding methods can be recovered through the use of a lowrank multivariate Gaussian latent variable model, transformed to a categorical output via a softmax. Our framework is semiBayesian integrating the latent variable with associated computational challenges, despite the model’s apparent simplicity and elegance.

We provide an elegant way to convert a user history, containing a variable number of item views, into a fixed dimensional representation of the user’s history.

We derive an analytical variational bound for our model and show how to train embeddings using a variational autoencoder. We also show that the model can be trained without the bound using the reparameterization trick.

We show that the variational autoencoder can be used at prediction time to produce a user representation, we also derive a variational EM algorithm for the same purpose.

We show that the method performs well on both synthetic and realworld data.
To aid the in reproducibility of the approaches, we release all source code^{1}^{1}1Code to be released post review. and relevant experimental scripts.
In Section 2 we introduce the latent variable model and show that it has intuitively pleasing properties. In Section 3 we review relevant literature. In Section 4 we explain our computational framework for performing approximate inference. In Section 5 we outline the experimental setup and in Section 6, finally Section 7 makes some concluding remarks.
2. Background
2.1. A Latent Variable Model For Item Views
The model we introduce in this paper is that item views in a session are explained by a session level latent variable. The model may be either viewed as a probabilistic matrix factorization model or as a covariance estimation of a low rank Multivariate normal. Perhaps surprisingly when combined with computational machinery to marginalize the latent variable, this simple model is able to reproduce many interesting features of a session based recommender system.
Symbol  Dimension  Description 

Scalar  A given users id.  
Scalar  sequential time.  
Scalar  Total number of products.  
Scalar  The size of the embedding.  
Scalar  Product id for user at time .  
A given users state.  
Product embedding matrix.  
Product embedding for .  
The mean of .  
The covariance of .  
Item popularity shift.  
Scalar  Session length for . 
Our model describes a generative process for the types of products that users coview in sessions. Throughout this paper, we will make use of the notation introduced in Table 1. We use to denote a user or a session, we use time to denote sequential time and to denote which product they viewed from to where is the number of products, the user’s interest is described by a dimensional latent variable which can be interpreted as the users interest in topics. The session length of user is given by . We then assume the following generative process for the views in each session:
For the moment we assume that
and have already been estimated, we defer the topic of estimation to later sections. In production we have observed a users online viewing historyand we would like to produce a representation of the users interests. Our proposal is to use Bayesian inference in order to infer
as a representation of their interests. This representation of interests can then be used as a feature for training a recommender system on so called “bandit feedback” i.e. logs of the recommender system itself (this is distinct from the user history). The above is a “matrix factorization” view of our model, however one could also view it as a “covarianceestimation” task:We neglect the mathematical complications due to the covariance matrix: (usually) being lowrank and the fact that formally has no density.
2.2. Case Study
Before going into the details of the model and related computational material we demonstrate on a simple case study that this model is deceptively powerful in producing an updateable representation of a users interests.
Imagine we have a recommender system that has just seven products, the products are: Sleek Phone, City Phone, Rice, Coscous, Beer, Women’s shirt, Men’s shirt. We further imagine that an offline job has generated the embeddings or parameters already. These embeddings establish that there is a correlation such that users interested in the Sleek Phone are also interested in the City phone, users interested in Rice are also interested in Coscous, users interested in Beer have some interest in Rice and Coscous but it is not strong and finally users interested in the Men’s Shirt are anticorrelated with users who are interested in the Women’s Shirt. To make this concrete we assume that:
The fact that most entries in are positive simplifies the discussion as it means that high implies interest, The only two negative values are used to model a negative correlation between female and male shirts.
An interesting facet of this model is that it is not trivial to establish which is the most popular product simply by examining the parameters, while does reflect popularity it is also affected by in complex ways. In this constructed example we are able to label the five components of as phones, grains, drinks, women’s clothes and men’s clothes.
We now consider how different user histories affect
. Approximation of this quantity can be made accurately and easily using the Stan probabilistic programming language (Team, 2018), although later we will show that excellent performance can also be obtained from using variational methods, that are viable to scale to real world recommender systems, and with comparative cost to methods such as Recurrent Neural Networks.
the intuitive behavior of this simple model is demonstrated. The results of three approximate methods are presented and shown to be in good agreement, we defer discussion of the approximation methods here except to note that we take Markov chain Monte Carlo to be the gold standard.
In Figure 4 we observe the case where a single Sleek Phone is observed in the users history as a consequence of the short history there is significant uncertainty in the knowledge about the user, although the embedding reflects higher interest in the phone category; the next item prediction is high for both the Sleek Phone and the City Phone. In Figure 4 we observe the case where a single Sleek Phone is observed twice and the City Phone once in the users history as a consequence there is less uncertainty in the knowledge about the user, the next item prediction is higher for both the Sleek Phone and the City Phone. In Figure 4 the user has viewed the city phone twenty times and the Sleek phone just once, as a consequence the user’s embedding shows a strong interest in phones with low uncertainty, the next item prediction is distributed among the two phones. In Figure 4 we observe a user who has observed a City Phone and a Women’s Shirt, we see that the user’s interests in phones and women’s clothes are increased and their interest in men’s clothing is decreased, indicating the negative entry in the embedding has the desired effect. The next item predictions also reflects these preferences.
This simple model has shown a remarkable ability to summarize a users interests and is able to reflect both strong or weak information about our knowledge of the user. Having demonstrated the intuitive value of this model we now show how to estimate , how to efficiently approximate such that it can be done online updating the user representation as the user acts in a dynamic way and finally how to do next item prediction which may be a proxy to the recommendation task.
3. Related literature
3.1. Scalable Variational Approximations
Two approaches for scalable Bayesian inference focus on approximating a posterior on a fixed dimensional parameters space rather than the latent variable case we care about as such they are not appropriate for our case (Kucukelbir et al., 2017), (Ranganath et al., 2014).
Under a conditional independence assumption it is often possible to reduce the variational Expectation Maximization algorithm to a finite sum fixed point iteration, where the finite sum is over the data plus another term associated with the prior and the entropy. This formulation rather directly allows the Robbins Monro stochastic approximation algorithm (Robbins and Monro, 1951) and has been effective in complete data exponential family models (Hoffman et al., 2013). While this algorithm does indeed apply to simple versions of our model the ”MStep” for estimating the embeddings would require unfeasibly large matrix inverses.
3.2. Latent Variable Models
Our model is a special case of (Liang et al., 2018) (also see (Rezende et al., 2014; Lafferty and Blei, 2006)) which has stronger analytical properties including an analytical bound and EM algorithm which we can exploit both to gain computational advantages and to highlight similarities with other methods.
An interesting suggestion made in (Liang et al., 2018) is the use of reducing the contribution of the Kullback Leibler component of the lower bound e.g. by multiplying this by a value lower than one e.g. one half. They justify this with a combination of empirical results and by interpreting the model as containing a construction error and regularization. In this paper we are primarily focused upon producing a user representation, to multiply the KL component by a half would have the effect of “squaring the likelihood” i.e. double counting the data resulting in artificially reduced uncertainties on the user representation. As we are primarily interested in producing a user representation we do not pursue that method here, although we do acknowledge the excellent empirical results they present.
We use a semiBayesian or latent variable framework integrating the latent variable but estimating the parameters. There is a literature discussing the improved statistical properties of this procedure, for see theoretical arguments given in (Welling et al., 2008) for a demonstration of empirical performance see (Dikmen and Févotte, 2011). A critical observation is that for a traditional matrix factorization the parameter space grows with the number of users, this makes traditional statistical notions such as convergence difficulty and indeed means that if a new user arrives a fit must be done before a prediction can be done.
In contrast if one of the matrices is integrated then the model becomes fixed dimensional then the dimensionality of the model is fixed and traditional statistical notions such as convergence again become relevant.
3.3. Word2Vec and Prod2Vec
The skipgram model and skipgram with negative sampling, collectively known as word2vec (Mikolov et al., 2013), caused a sensation in both the natural language and recommender systems community (Gunawardana et al., 2009). If we define the event matrix:
Then word2vec operates on the coevent matrix: , often the rows and columns are refereed to as target and context. This matrix is of size which while large is often much smaller than which is where is the number of users, so operating on this matrix is more efficient computationally. There are however disadvantages in modeling directly. One being that has some quite subtle relationships e.g. some values of are impossible. An example of a valid matrix for , is:
This matrix is consistent with two user sessions, the first session visited product 1 and 2 and the second session visited product 3. Now consider:
This matrix is inconsistent with any containing only positive counts. Intuitively we can see this by noting that the diagonal implies that each product has been observed as associated with one user each. The first row (or by symmetry the first column) says that product 1, 2 and 3 all occur together. The only way we can have each product viewed exactly once and all occurring together is for all entries to be associated yet we have which is inconsistent. The skipgram model suggests modeling the rows of as multinomial draws, which gives positive probability to events that cannot happen, given the complexity of the constraints on it is difficult to see how this can be avoided except by modeling directly.
A further contribution in (Mikolov et al., 2013) was a negative sampling heuristic, which allowed these methods to scale to very large numbers of categories by avoiding large summations over every iteration. However the meaning of negative sampling remains unclear and it complicates producing probabilistic algorithms. For example within these classes of algorithms there is a tuning parameter to decide how many negative examples to generated. Of course increasing the amount of (artificially) generated data will (artificially) reduce uncertainty on parameter estimates, while there have been attempts at a Bayesian skipgram model (Barkan, 2017) it is difficult to see how any method employing this heuristic can correctly control the uncertainties that they compute.
It is interesting to reflect on the widespread successful use of these methods. The heuristics employed do not make it easy to make a complete comparison with our method, but we can make a few comments. If the underlying model of was Gaussian (this cannot be true as it has support only on natural numbers) then
would be the scatter matrix which along with the mean gives the sufficient statistic of a Gaussian distribution. Taking the eigenvalue decomposition of this would result in principal component analysis (without the usual subtracting of the means step) and there is the well known result that PCA can be computed either by an eigenvalue decomposition of
or the singular value decomposition of
, which loosely accounting for the changes of support and the integration is what our method achieves. Loosely we can view our latent variable model also in sense i.e. estimating the covariance as or doing a matrix factorization of the form . The use of dot products between embeddings can be viewed as covariances and the cosine distances as correlations.The nonprobabilistic nature of word2vec poses problems that are typically dealt with using heuristics such as using these embeddings as features in the “feature engineering approach” several questions are difficult to resolve e.g. How do you do next item prediction (combining popularity with the associated embeddings)? How do you do recommendation? How do you combine several items of history into the a fixed dimensional user state?
3.4. RNN Session Based Recommendation
In the session based recommender system literature, there is a significant literature applying RNNs to the recommendation problem in this case like us they apply the model directly to . The RNN is a more flexible model able to capture more sophisticated sequences e.g. if a shopper transitions from being interested to complimentary products after a purchase event. This extra flexibility is powerful, but also require more data to identify these effects.
In contrast the latent variance model we introduce is effectively a low rank Gaussian prior on a categorical variable as such up to the capacity of the model the law of large numbers would apply i.e. if a user had a long enough history and the embedding size
was greater than equal to the number of products then the next item prediction would converge to the empirical history due to the law of large numbers(De Finetti, 1980). In contrast the RNN does not a priori incorporate the law of large numbers it is a flexible sequential model and if the law of large numbers holds, as we might approximately expect, then the RNN will need to see more data to recognize this. It is an important remark that if the user embedding size is less than the number of products regardless of if an RNN or a latent factor model is used it is not possible for the next item prediction for a user to converge to their empirical history due to a lack of capacity. The nonlinear model of (Liang et al., 2018) also has the same limitation. This low capacity is typically not a problem as user sequences are very short, but it does highlight that there are limitations introduced by using small embedding sizes i.e. the ability to distinguish users in subtly different products may be lost; of course the advantage is vastly improved tractability. The stronger assumptions of the latent variable method suggest its realm of applicability is when those assumptions are true, or they are approximately true and there is insufficient data to learn a higher capacity model such as an RNN.4. Approximate Inference
In previous sections we discussed the model and showed it has intuitively reasonable properties. In this section we show (i) how to learn the embeddings and (ii) how, at deployment, to make predictions by approximating the posterior over a users’ representation i.e. how to compute in real time.
4.1. Optimizing the lower bound
In order to make this method practically usable we need two components: firstly to be able to estimate efficiently and secondly we need to be able to rapidly produce and update user embeddings based on a users activity. To solve both parts of the problem, we will employ variational approximations. Variational approximations work by turning integration problems into optimization problems.
The model we introduce has the form:
If we use a normal distribution
, then variational bound has the form:We see that there is a problematic term associated with the denominator of the softmax. We consider two possible computational approaches to this the Bouchard bound (Bouchard, 2007) and the reparameterization trick (Kingma and Welling, 2014).
4.1.1. Bouchard Bound
The Bouchard bound introduces a further approximation and additional variational parameters but produces an analytical bound:
Where is the Jaakola and Jordan function (Jaakkola and Jordan, 1997):
The bound may be optimized using the following variational EM algorithm which enjoys the coordinate descent properties of an EM algorithm guaranteeing the bound will tighten at each iteration. The algorithm here is the dual of the one presented in (Bouchard, 2007) as we assume the embedding is fixed and is updated where the algorithm they present does the opposite. The EM algorithm consists of cycling the following update equations:
There are other variational bounds that may be considered for this problem most notably the tilted bound (Knowles and Minka, 2011). Even though the Bouchard bound is loose compared to the tilted bound, it does enjoy the availability of an EM algorithm which enjoys the stability properties of a coordinate descent algorithm. In the case of the tilted bound the known fixed point algorithms are not guaranteed to be stable and are not always stable in practice (Nolan and Wand, 2017; Rohde and Wand, 2016) so extra methods such as line searches would need to be considered. We do not further consider alternative bounds.
The computational cost of this algorithm depends on the number of products linearly and the embedding size cubicly, if and are modest it can take less than a second making it potentially deployable at prediction time. In practice we found the cost of large might be prohibitive due to the sums over all embeddings, in these cases a variational autoencode described in the next section, is to be preferred.
4.1.2. Reparameterization Trick
The second approach to computing expectations with respect to the denominator of the softmax is to use the reparameterization trick (Kingma and Welling, 2014), which allows us to take a sample of from the variational distribution and compute a noisy derivative of the lower bound. Within each iteration we proceed by simulating:
and then computing:
Where , we can then optimize the noisy lower bound:
Often is taken to be diagonal which makes computing simply an elementwise square root.
4.2. Latent variable size growing with data
A naive application of the algorithm discussed so far would have the number of variational parameters or for the Bouchard bound growing with the number of parameters. We propose to limit the number of parameters by the use of a variational autoencoder (Kingma and Welling, 2014). This involves using a flexible function and optimizing it to do the job of the EM algorithm i.e.
or in the case of the Bouchard bound:
Where any function e.g. a deep net can be used for and .
It is common to use the reparameterization trick and an autoencoder in combination although this is not necessary. The choice between the two hinges on accepting Monte Carlo error or using a looser, but analytical bound.
4.3. Next Item Prediction
Finally and perhaps surprisingly the predictive distribution required to do next item prediction is also not trivial in this case, i.e. approximating:
is not trivial even if is approximated with a Gaussian distribution . We are interested in computing:
We considered using a Monte Carlo based approximation, first by drawing samples:
as well as using a number of fast approximations such as:
5. Experimental Setup
We demonstrate that our method produces useful user representations on next item prediction using the RecoGym simulation environment (Rohde et al., 2018). RecoGym is a framework for simulating a recommender system and enables the simulation of AB tests although here we simply use it to create organic sequences of item views and test the model’s ability to do next item prediction, this allows us to compute the same metrics as on standard offline datasets. We also present results upon the YooChoose dataset (BenShimon et al., 2015). We split both the datasets into train and test so that sessions reside entirely in one of the two groups. We fit the model to the training set, we then evaluate by providing the model events and testing the model’s ability to predict .
5.1. Implementation Details
All the models, including the relevant baselines, have been implemented using the PyTorch automatic differentiation package in Python
(Paszke et al., 2017). All models are updated via the use of Stochastic Gradient Descent (SGD), specifically the RMSProp variant. We set the learning rate to 0.001 and tune the other hyperparameters, including L2 regularization, for each dataset based upon a validation set. The dataset specific hyperparameter values are reported in Section
6 with the relevant results.5.2. Performance Metrics
The various models are evaluated using recall at K (RC@K) and truncated discounted cumulative gain at K (DCG@K), which are defined below.
Let be the th highest value of . For all results presented in this paper, we set K to five.
We compute the average of these quantities over all sessions in the test set.
5.3. Latent Variable Inference
We consider three alternative methods for training the model:

Bouch/AE  A linear variational autoencoder using the Bouchard bound.

RT/AE  Using the reparameterization trick with the Bouchard bound.

RT/Deep AE  A deep autoencoder again using the reparameterization trick. The deep autoencoder consists of mapping an input of size P to three linear rectifier layers of K units each. We encountered numerical problems using the Bouchard bound with a deep autoencoder.
When we update the posterior over a users latent variable representation at test time, we assess both using the autoencoder denoted AE and using the 100 iterations of the EM algorithm denoted EM in the results.
When we compute next item predictions we consider both using a 100 sample Monte Carlo approximation denoted MC and just taking the mean as a point estimate denoted mean it uses only (and correspondingly ignores ).
5.4. Baselines
To demonstrate the effectiveness of our approach, we present results from the following baseline approaches:
5.4.1. Popularity
Item popularity provides no personalization, but is nonetheless a strong baselines for certain recommendation tasks.
5.4.2. Item KNN
Item K Nearest Neighbors (KNN) involves computing the correlation matrix of the sample data adding the identity to prevent division by zero and then using these correlations as recommendations based on a users most recent historical item. The limitations of this technique is that it ignores item popularity and multiple items in the users history, but despite these limitations it is often a strong baseline.
5.4.3. Recurrent Neural Network
For this baseline, we make use of a recurrent neural network to learn a user representation by predicting the next item in the session. The model architecture we employ is similar to that of (Hidasi and Karatzoglou, 2018b)
, in that we feed the output from an embedding layer into a Gated Recurrent Unit (GRU)
(Cho et al., 2014)with 64 hidden units to learn the temporal dynamics of the user’s session. The output from the GRU is then passed through a final softmax layer which gives the probability of the next item in the sequence. The network is trained to minimize the categorical crossentropy over the training sessions via RMSProp.
6. Results
Train  Online  Online  RC@5  DCG@5 

Algorithm  Latent  Next Item  
Pop  0.456  0.440  
ItemKNN  0.461  0.492  
RNN  0.620  0.646  
Bouch/AE  AE  MC  0.712  0.796 
Bouch/AE  AE  mean  0.712  0.777 
Bouch/AE  EM  MC  0.738  0.796 
Bouch/AE  EM  mean  0.748  0.796 
RT/AE  AE  MC  0.707  0.802 
RT/AE  AE  mean  0.697  0.784 
RT/AE  EM  MC  0.738  0.802 
RT/AE  EM  mean  0.733  0.802 
RT/Deep AE  AE  MC  0.697  0.785 
RT/Deep AE  AE  mean  0.717  0.775 
RT/Deep AE  EM  MC  0.733  0.785 
RT/Deep AE  EM  mean  0.733  0.787 
6.1. RecoGym
For our first experiment we use the RecoGym simulator with 20 products and
i.e. a static user state. With this we generate a training set of 100 sessions and a test set of 1000 sessions, this results in 17161 and 176804 events for train and test respectively. The latent variable algorithms were all trained using 5000 epochs with the RMSProp algorithm and an embedding dimension of 10. The RNN was trained for 5000 epochs, with the same embedding size and again the RMSProp algorithm was used in all cases. The results from this are presented in Table
2, which show the Bouchard method of training using the EM algorithm for predicting latent variables and Monte Carlo for predicting the next item was the best performing algorithm on the RC@5 metric, RT/AE performed slightly better on on the DCG@5 metric using either the EM algorithm or the autoencoder with Monte Carlo.Train  Online  Online  RC@5  DCG@5 

Algorithm  Latent  Next Item  
ItemKNN  0.020  0.024  
Pop  0.020  0.016  
RNN  0.035  0.033  
Bouch/AE  AE  MC  0.082  0.128 
Bouch/AE  AE  mean  0.082  0.079 
Bouch/AE  EM  MC  0.117  0.128 
Bouch/AE  EM  mean  0.117  0.130 
RT/AE  AE  MC  0.061  0.047 
RT/AE  AE  mean  0.056  0.059 
RT/AE  EM  MC  0.051  0.047 
RT/AE  EM  mean  0.051  0.047 
RT/Deep AE  AE  MC  0.090  0.105 
RT/Deep AE  AE  mean  0.080  0.068 
RT/Deep AE  EM  MC  0.090  0.105 
RT/Deep AE  EM  mean  0.090  0.106 
For our second experiment we use the RecoGym simulator with 2000 products and , i.e. a static user state, we generate a training set of 100 sessions and a test set of 100 sessions, this results in 21852 and 19533 events for train and test respectively. The latent variable algorithms were all trained using 15000 epochs using the RMSProp algorithm, the embedding size was set to 10. The RNN was trained with K=200 for 5000 epochs (it performed slightly worse with a training run of 25000). The results are shown in Table 3, again the Bouchard method of training using the EM algorithm for predicting latent variables and Monte Carlo for predicting the next item was the best performing algorithm on the RC@5 and DCG@5 metrics.
6.2. YooChoose
Train  Online  Online  RC@5  DCG@5 

Algorithm  Latent  Next Item  
Pop  0.143  0.147  
ItemKNN  0.804  0.921  
RNN  0.690  0.781  
Bouch/AE  AE  MC  0.433  0.420 
Bouch/AE  AE  mean  0.451  0.562 
Bouch/AE  EM  MC  0.386  0.420 
Bouch/AE  EM  mean  0.429  0.497 
RT/AE  AE  MC  0.495  0.731 
RT/AE  AE  mean  0.616  0.658 
RT/AE  EM  MC  0.693  0.731 
RT/AE  EM  mean  0.707  0.768 
RT/Deep AE  AE  MC  0.751  0.868 
RT/Deep AE  AE  mean  0.771  0.876 
RT/Deep AE  EM  MC  0.772  0.868 
RT/Deep AE  EM  mean  0.775  0.873 
For our third experiment we use the YooChoose dataset filtered to the most popular 100 products. This is a strong filter of YooChoose 60000 products, but allows for effective experimentation and still results in 2905816 events and 28286 events for the training and test set respectively. The deep autoencoder latent variable algorithms was trained for 100 epochs, the linear Bouchard autoencoder and reparameterization trick autoencoder were trained for 100 epochs, the RNN was trained for a single epoch and had an embedding size of 20, longer training runs were observed to cause overfitting and reduced performance. All latent variables are trained using a full rank model i.e . The results are shown in Figure 4, in this case the ItemKNN model performs best on both metrics, the deep autoencoder trained using the reparameterization trick performs slightly worse, the best performing setups involve predicting using the mean method there was very little difference between predicting with the EM algorithm and with the autoencoder on this data set.
The ItemKNN baseline turned out to be very strong. This is most likely due to the fact that we filtered the dataset to just 100 popular products allowing full rank covariance estimation. The latent variable model also operating at full rank was unable to perform quite as well. Another notable difference in the two methods is that ItemKNN just looks at the most recent event where the latent variable session model combines all history. If the most recent event contains more relevant history this may advantage ItemKNN.
6.3. Interpretation of Results
The model we present is very closely aligned to the internal model in the RecoGym simulator hence the strong performance here of all the variants of our model. It is perhaps surprising that next item prediction using just the posterior mean performed similarly well to the Monte Carlo approach. The value gained by the EM algorithm was also marginal. Given the ability of an RNN to model very complex data such as language it is perhaps unsurprising that it performs poorly on the RecoGym 2000 product dataset given a relatively small sample.
For the YooChoose 100 product dataset the ItemKNN algorithm proved to be very effective. The Deep AE was the closest performing with the EM MC variant being the best by a small margin. The fact that the Deep AE performs the best and the linear autoencoders improve substantially when using the EM algorithm both suggest that a linear autoencoder is not sufficient for this problem.
7. Conclusion
Recommender systems are increasingly using embeddings to represent items. A user’s session on the recommender system then will involve interactions with many of these items. We have demonstrated an elegant algorithm for taking a users history of varying length and summarizing it with a posterior distribution over a user embedding which has the same dimension as the product embedding. Sensible behavior such as higher uncertainty when the user has a short history and lower uncertainty when the user has a longer history are features of this model formulation. We have demonstrated how it is possible to train the model to produce item embedding using a variational autoencoder either with the reparameterization technique or using the Bouchard bound. Similarly it is possible to is possible to rapidly convert a user history containing multiple items to a user embedding using a variational autoencoder or using the EM algorithm (although the later is constrained to small numbers of products due to summations over large ).
A complexity of latent variable methods is the need to do a numerical integration at prediction time. The EM algorithm presented has excellent stability properties, but scales poorly when the number of items is in the tens of thousands. There are several lines of interesting work that could speed up this evaluation. Alternatively using already well understood techniques we could simply use a variational autoencoder, which also produces rapid approximation of the integral.
There are numerous possible extensions to the training algorithm. Training speed requires normalization of size which can be prohibitive, methods such as those outlined in (Ruiz et al., 2018) may be adaptable to this model. Finally the model can be incorporated to model time in a more sophisticated way and to consider the feedback to recommendations rather than be exclusively built for next item prediction.
References
 (1)
 Barkan (2017) Oren Barkan. 2017. Bayesian Neural Word Embedding, See Singh and Markovitch (2017), 3135–3143. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14653
 BenShimon et al. (2015) David BenShimon, Alexander Tsikinovsky, Michael Friedmann, Bracha Shapira, Lior Rokach, and Johannes Hoerle. 2015. Recsys challenge 2015 and the yoochoose dataset. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, New York, NY, USA, 357–358.
 Bouchard (2007) Guillaume Bouchard. 2007. Efficient bounds for the softmax function, applications to inference in hybrid models.

Cho et al. (2014)
Kyunghyun Cho, Bart van
Merrienboer, Çaglar Gülçehre,
Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio.
2014.
Learning Phrase Representations using RNN
EncoderDecoder for Statistical Machine Translation. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL
. ACL, 1724–1734. http://aclweb.org/anthology/D/D14/D141179.pdf  Cuzzocrea et al. (2018) Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). 2018. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 2226, 2018. ACM. http://dl.acm.org/citation.cfm?id=3269206
 De Finetti (1980) Bruno De Finetti. 1980. Foresight: Its logical laws, its subjective sources (1937). Studies in subjective probability (1980), 55–118.
 Dikmen and Févotte (2011) Onur Dikmen and Cédric Févotte. 2011. Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2267–2275.
 Greenwald (1976) Anthony G Greenwald. 1976. Withinsubjects designs: To use or not to use? Psychological Bulletin 83, 2 (1976), 314.
 Gunawardana et al. (2009) Asela Gunawardana, Christopher Meek, et al. 2009. A unified approach to building hybrid recommender systems. RecSys 9 (2009), 117–124.
 Hidasi and Karatzoglou (2018a) Balázs Hidasi and Alexandros Karatzoglou. 2018a. Recurrent Neural Networks with Topk Gains for Sessionbased Recommendations, See Cuzzocrea et al. (2018), 843–852. https://doi.org/10.1145/3269206.3271761
 Hidasi and Karatzoglou (2018b) Balázs Hidasi and Alexandros Karatzoglou. 2018b. Recurrent Neural Networks with Topk Gains for Sessionbased Recommendations, See Cuzzocrea et al. (2018), 843–852. https://doi.org/10.1145/3269206.3271761
 Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14, 1 (2013), 1303–1347.

Jaakkola and
Jordan (1997)
Tommi Jaakkola and
Michael Jordan. 1997.
A variational approach to Bayesian logistic regression models and their extensions. In
Sixth International Workshop on Artificial Intelligence and Statistics
, Vol. 82. 4.  Kang and McAuley (2018) WangCheng Kang and Julian McAuley. 2018. SelfAttentive Sequential Recommendation. In IEEE International Conference on Data Mining, ICDM 2018, Singapore, November 1720, 2018. IEEE Computer Society, 197–206. https://doi.org/10.1109/ICDM.2018.00035
 Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. AutoEncoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114
 Knowles and Minka (2011) David A. Knowles and Tom Minka. 2011. Nonconjugate Variational Message Passing for Multinomial and Binary Regression. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1701–1709.
 Kucukelbir et al. (2017) Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. 2017. Automatic Differentiation Variational Inference. Journal of Machine Learning Research 18, 14 (2017), 1–45. http://jmlr.org/papers/v18/16107.html
 Lafferty and Blei (2006) John D. Lafferty and David M. Blei. 2006. Correlated Topic Models. In Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt (Eds.). MIT Press, 147–154. http://papers.nips.cc/paper/2906correlatedtopicmodels.pdf

Liang
et al. (2018)
Dawen Liang, Rahul G.
Krishnan, Matthew D. Hoffman, and Tony
Jebara. 2018.
Variational Autoencoders for Collaborative Filtering. In
Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 2327, 2018, PierreAntoine Champin, Fabien L. Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 689–698. https://doi.org/10.1145/3178876.3186150  Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119. http://papers.nips.cc/paper/5021distributedrepresentationsofwordsandphrasesandtheircompositionality.pdf
 Nolan and Wand (2017) Tui H Nolan and Matt P Wand. 2017. Accurate logistic variational message passing: algebraic and numerical details. Stat 6, 1 (2017), 102–112.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPSW.
 Quadrana et al. (2018) Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. SequenceAware Recommender Systems. ACM Comput. Surv. 51, 4 (2018), 1–66. https://doi.org/10.1145/3190616
 Quadrana et al. (2017) Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. 2017. Personalizing Sessionbased Recommendations with Hierarchical Recurrent Neural Networks. In Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys 2017, Como, Italy, August 2731, 2017, Paolo Cremonesi, Francesco Ricci, Shlomo Berkovsky, and Alexander Tuzhilin (Eds.). ACM, New York, NY, USA, 130–137. https://doi.org/10.1145/3109859.3109896
 Ranganath et al. (2014) Rajesh Ranganath, Sean Gerrish, and David M. Blei. 2014. Black Box Variational Inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 2225, 2014 (JMLR Workshop and Conference Proceedings), Vol. 33. JMLR.org, 814–822. http://jmlr.org/proceedings/papers/v33/ranganath14.html

Rezende
et al. (2014)
Danilo Jimenez Rezende,
Shakir Mohamed, and Daan Wierstra.
2014.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In
Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014 (JMLR Workshop and Conference Proceedings), Vol. 32. JMLR.org, 1278–1286. http://jmlr.org/proceedings/papers/v32/rezende14.html  Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics 22, 3 (1951), 400–407.

Rohde et al. (2018)
David Rohde, Stephen
Bonner, Travis Dunlop, Flavian Vasile,
and Alexandros Karatzoglou.
2018.
RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. In
REVEAL workshop, ACM Conference on Recommender Systems 2018.  Rohde and Wand (2016) David Rohde and Matt P Wand. 2016. Semiparametric mean field variational Bayes: General principles and numerical issues. The Journal of Machine Learning Research 17, 1 (2016), 5975–6021.
 Ruiz et al. (2018) Francisco J. R. Ruiz, Michalis K. Titsias, Adji B. Dieng, and David M. Blei. 2018. Augment and Reduce: Stochastic Inference for Large Categorical Distributions. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018 (Proceedings of Machine Learning Research), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. PMLR, 4400–4409. http://proceedings.mlr.press/v80/ruiz18a.html
 Shani et al. (2005) Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDPbased recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
 Singh and Markovitch (2017) Satinder P. Singh and Shaul Markovitch (Eds.). 2017. Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA. AAAI Press. http://www.aaai.org/Library/AAAI/aaai17contents.php

Smirnova and
Vasile (2017)
Elena Smirnova and
Flavian Vasile. 2017.
Contextual Sequence Modeling for Recommendation
with Recurrent Neural Networks. In
Proceedings of the 2Nd Workshop on Deep Learning for Recommender Systems
(DLRS 2017). ACM, New York, NY, USA, 2–9. https://doi.org/10.1145/3125486.3125488  Tan et al. (2016) Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved Recurrent Neural Networks for Sessionbased Recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS@RecSys 2016, Boston, MA, USA, September 15, 2016, Alexandros Karatzoglou, Balázs Hidasi, Domonkos Tikk, Oren Sar Shalom, Haggai Roitman, Bracha Shapira, and Lior Rokach (Eds.). ACM, 17–22. https://doi.org/10.1145/2988450.2988452
 Team (2018) Stan Development Team. 2018. PyStan: the Python interface to Stan, Version 2.17.1.0.
 Welling et al. (2008) Max Welling, Chaitanya Chemudugunta, and Nathan Sutter. 2008. Deterministic Latent Variable Models and Their Pitfalls. In Proceedings of the SIAM International Conference on Data Mining, SDM 2008, April 2426, 2008, Atlanta, Georgia, USA. SIAM, 196–207. https://doi.org/10.1137/1.9781611972788.18
 Ying et al. (2018) Haochao Ying, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018. Sequential Recommender System based on Hierarchical Attention Networks. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 1319, 2018, Stockholm, Sweden., Jérôme Lang (Ed.). ijcai.org, 3926–3932. https://doi.org/10.24963/ijcai.2018/546
 Zolna and Romanski (2017) Konrad Zolna and Bartlomiej Romanski. 2017. User Modeling Using LSTM Networks, See Singh and Markovitch (2017), 5025–5027. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14220
Comments
There are no comments yet.