Optimizing Slate Recommendations via Slate-CVAE

03/05/2018 ∙ by Ray Jiang, et al. ∙ Google 0

The slate recommendation problem aims to find the "optimal" ordering of a subset of documents to be presented on a surface that we call "slate". The definition of "optimal" changes depending on the underlying applications but a typical goal is to maximize user engagement with the slate. Solving this problem at scale is hard due to the combinatorial explosion of documents to show and their display positions on the slate. In this paper, we introduce Slate Conditional Variational Auto-Encoders (Slate-CVAE) to generate optimal slates. To the best of our knowledge, this is the first conditional generative model that provides a unified framework for slate recommendation by direct generation. Slate-CVAE automatically takes into account the format of the slate and any biases that the representation causes, thus truly proposing the optimal slate. Additionally, to deal with large corpora of documents, we present a novel approach that uses pretrained document embeddings combined with a soft-nearest-neighbors layer within our CVAE model. Experiments show that on the simulated and real-world datasets, Slate-CVAE outperforms recommender systems that consists of greedily ranking documents by a significant margin while remaining scalable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommender systems modeling is an important machine learning area in the IT industry, powering online advertisement, social networks and various content recommendation services

(Schafer et al., 2001; Lu et al., 2015). In the context of document recommendation, its aim is to generate and display an ordered list of “documents” to users (called a “slate” in Swaminathan et al. (2017); Sunehag et al. (2015)), based on both user preferences and documents content. For large scale recommender systems, a common scalable approach at inference time is to first select a small subset of candidate documents out of the entire document pool

. This step is called “candidate generation”. Then a function approximator such as a neural network (e.g., a Multi-Layer Perceptron (MLP)) called the “ranking model” is used to predict probabilities of user engagements for each document in the small subset

and greedily generates a slate by sorting the top documents from

based on estimated prediction scores

(Covington et al., 2016). This two-step process is widely popular to solve large scale recommendation problems due to its scalability and fast inference at serving time. The candidate generation step can decrease the number of candidates from millions to hundreds or less, effectively dealing with scalability when faced with a large corpus of documents . Since is much smaller than , the ranking model can be reasonably complicated without increasing latency.

However, there are two main problems with this approach. First the candidate generation and the ranking models are not trained jointly, which can lead to having candidates in that are not the highest scoring documents of the ranking model. Second and most importantly, the greedy ranking method suffers from numerous biases that come with the visual presentation of the slate and context in which documents are presented, both at training and serving time. For example, there exists positional biases caused by users paying more attention to prominent slate positions (Joachims et al., 2005), and contextual biases, due to interactions between documents presented together in the same slate, such as competition and complementarity, relative attractiveness, etc. (Yue et al., 2010).

In this paper, we propose a paradigm shift from the traditional viewpoint of solving a ranking problem to a direct slate generation framework. We consider a slate “optimal” when it maximizes some type of user engagement feedback, a typical desired scenario in recommender systems. For example, given a database of song tracks, the optimal slate can be an ordered list (in time or space) of songs such that the user ideally likes every song in that list. Another example considers news articles, the optimal slate has

ordered articles such that every article is read by the user. In general, optimality can be defined as a desired user response vector on the slate and the proposed model should be agnostic to these problem-specific definitions. Solving the slate recommendation problem by direct slate generation differs from ranking in that first, the entire slate is used as a training example instead of single documents, preserving numerous biases encoded into the slate that might influence user responses. Secondly, it does not assume that more relevant documents should necessarily be put in earlier positions in the slate at serving time. Our model directly generates slates, taking into account all the relevant biases learned through training.

In this paper, we apply Conditional Variational Auto-Encoders (CVAEs) (Kingma et al., 2014; Kingma and Welling, 2013; Jimenez Rezende et al., 2014) to model the distributions of all documents in the same slate conditioned on the user response. All documents in a slate along with their positional, contextual biases are jointly encoded into the latent space, which is then sampled and combined with desired conditioning for direct slate generation, i.e. sampling from the learned conditional joint distribution. Therefore, the model first learns which slates give which type of responses and then directly generates similar slates given a desired response vector as the conditioning at inference time. We call our proposed model List-CVAE. The key contributions of our work are:

  1. To the best of our knowledge, this is the first model that provides a conditional generative modeling framework for slate recommendation by direct generation. It does not necessarily require a candidate generator at inference time and is flexible enough to work with any visual presentation of the slate as long as the ordering of display positions is fixed throughout training and inference times.

  2. To deal with the problem at scale, we introduce an architecture that uses pretrained document embeddings combined with a negatively downsampled

    -head softmax layer within the List-CVAE model, where

    is the slate size.

The structure of this paper is the following. First we introduce related work using various CVAE-type models as well as other approaches to solve the slate generation problem. Next we introduce our List-CVAE modeling approach. The last part of the paper is devoted to experiments on both simulated and the real-world datasets.

2 Related Work

(a)
(b)
(c)
(d)
(e)
Figure 1: Comparison of related variants of VAE models. Note that user variables are not included in the graphs for clarity. LABEL:sub@fig:VAE_model VAE; LABEL:sub@fig:CVAECFAux_model CVAE-CF with auxiliary variables; LABEL:sub@fig:JVAECFJoint Variational Auto-Encoder-Collaborative Filtering (JVAE-CF); LABEL:sub@fig:JMVAE JMVAE; and, LABEL:sub@fig:CVAESlate List-CVAE (ours) with the whole slate as input.

Traditional matrix factorization techniques have been applied to recommender systems with success in modeling competitions such as the Netflix Prize (Koren et al., 2009)

. Later research emerged on using autoencoders to improve on the results of matrix factorization

(Wu et al., 2016; Wang et al., 2015)

(CDAE, CDL). More recently several works use Boltzmann Machines

(Abdollahi and Nasraoui, 2016) and variants of VAE models in the Collaborative Filtering (CF) paradigm to model recommender systems (Li and She, 2017; Lee et al., 2017; Liang et al., 2018) (Collaborative VAE, JMVAE, CVAE-CF, JVAE-CF). See Figure 1 for model structure comparisons. In this paper, unless specified otherwise, the user features and any context are routinely considered part of the conditioning variables (in Appendix A Personalization Test, we test List-CVAE generating personalized slates for different users). These models have primarily focused on modeling individual document or pairs of documents in the slate and applying greedy ordering at inference time.

Our model is also using a VAE type structure and in particular, is closely related to the Joint Multimodel Variational Auto-Encoder (JMVAE) architecture (Figure LABEL:fig:JMVAE). However, we use whole slates as input instead of single documents, and directly generate slates instead of using greedy ranking by prediction scores.

Other relevant work from the Information Retrieval (IR) literature are listwise ranking methods (Cao et al., 2007; Xia et al., 2008; Shi et al., 2010; Huang et al., 2015; Ai et al., 2018)

. These methods use listwise loss functions that take the contexts and positions of training examples into account. However, they eventually assign a prediction score for each document and greedily rank them at inference time.

In the Reinforcement Learning (RL) literature,

Sunehag et al. (2015) view the whole slates as actions and use a deterministic policy gradient update to learn a policy that generates these actions, given concatenated document features as input.

Finally, the framework proposed by (Wang et al., 2016) predicts user engagement for document and position pairs. It optimizes whole page layouts at inference time but may suffer from poor scalability due to the combinatorial explosion of all possible document position pairs.

3 Method

3.1 Problem Setup

We formally define the slate recommendation problem as follows. Let denote a corpus of documents and let be the slate size. Then let be the user response vector, where is the user’s response on document . For example, if the problem is to maximize the number of clicks on a slate, then let denote whether the document is clicked, and thus an optimal slate where is such that maximizes .

3.2 Variational Auto-Encoders

Variational Auto-Encoders (VAEs) are latent-variable models that define a joint density between observed variables and latent variables parametrized by a vector . Training such models requires marginalizing the latent variables in order to maximize the data likelihood . Since we cannot solve this marginalization explicitly, we resort to a variational approximation. For this, a variational posterior density parametrized by a vector is introduced and we optimize the variational Evidence Lower-Bound (ELBO) on the data log-likelihood:

(1)
(2)

where KL is the Kullback–Leibler divergence and where

is a prior distribution over latent variables. In a Conditional VAE (CVAE) we extend the distributions and to also depend on an external condition . The corresponding distributions are indicated by and . Taking the conditioning into account, we can write the variational loss to minimize as

(3)

MLP

Encoder

Embedding

Embedding

Embedding

MLP

MLP

Dot-product

Dot-product

Softmax

Softmax

Softmax

Decoder

Dot-product

Embedding

Embedding

Embedding

(a)

MLP

MLP

Dot-product

Dot-product

Max

Max

Max

Decoder

Dot-product

Embedding

Embedding

Embedding

(b)
Figure 2: Structure of List-CVAE for both LABEL:sub@fig:cvae_model_1 training and LABEL:sub@fig:cvae_model_2 inference. is the input slate. is the conditioning vector, where is the user responses on the slate . The concatenation of and makes the input vector to the encoder. The latent variable has a learned prior distribution . The raw output from the decoder are vectors , each of which is mapped to a real document through taking the dot product with the matrix containing all document embeddings. Thus produced

vectors of logits are then passed to the negatively downsampled

-head softmax operation. At inference time, is the ideal condition whose concatenation with sampled is the input to the decoder.

3.3 Our Model

We assume that the slates and the user response vectors are jointly drawn from a distribution . In this paper, we use a CVAE to model the joint distribution of all documents in the slate conditioned on the user responses , i.e. . At inference time, the List-CVAE model attempts to generate an optimal slate by conditioning on the ideal user response .

As we explained in Section 1, “optimality” of a slate depends on the task. With that in mind, we define the mapping . It transforms a user response vector into a vector in the conditioning space that encodes the user engagement metric we wish to optimize for. For instance, if we want to maximize clicks on the slate, we can use the binary click response vectors and set the conditioning to . Then at inference time, the corresponding ideal user response would be , and correspondingly the ideal conditioning would be .

(a)
(b)
(c)
Figure 3: Predictive prior distribution of the latent variable in , conditioned on ideal user response . The color map corresponds to the expected total responses of the corresponding slates. Plots are generated from the simulation experiment with 1000 documents and slate size 10.

As usual with CVAEs, the decoder models a distribution that, conditioned on z, is easy to represent. In our case, models an independent probability for each document on the slate, represented by a softmax distribution. Note that the documents are only independent to each other conditional on . In fact, the marginalized posterior can be arbitrarily complex. When the encoder encodes the input slate into the latent space, it learns the joint distribution of the documents in a fixed order, and thus also encodes any contextual, positional biases between documents and their respective positions into the latent variable . The decoder learns these biases through reconstruction of the input slates from latent variables with conditions. At inference time, the decoder reproduces the input slate distribution from the latent variable with the ideal conditioning, taking into account all the biases learned during training time.

To shed light onto what is encoded in the latent space, we simplify the prior distribution of

to be a fixed Gaussian distribution

in . We train List-CVAE and plot the predictive prior . As training evolves, generated output slates with low total responses are pushed towards the edge of the latent space while high response slates cluster towards a growing center area (Figure 3). Therefore after training, if we sample from its prior distribution and generate the corresponding output slates, they are more likely to have high total responses.

Since the number of documents in can be large, we first embed the documents into a low dimensional space. Let be that normalized embedding where denotes the unit sphere in . can easily be pretrained using a standard supervised model that predicts user responses from documents or through a standard auto-encoder technique. For the -th document in the slate, our model produces a vector in that is then matched to each document from via a dot-product. This operation produces vectors of logits for softmaxes, i.e. the -head softmax. At training time, for large document corpora , we uniformly randomly downsample negative documents and compute only a small subset of the logits for every training example, therefore efficiently scaling the nearest neighbor search to millions of documents with minimal model quality loss.

We train this model as a CVAE by minimizing the sum of the reconstruction loss and the KL-divergence term:

(4)

where is a function of the training step (Higgins et al., 2017).

During inference, output slates are generated by first sampling from the conditionally learned prior distribution , concatenating with the ideal condition , and passed into the decoder, generating from the learned , and finally taking over the dot-products with the full embedding matrix independently for each position .

4 Experiments

4.1 Simulation Data

Setup:

The simulator generates a random matrix

where each element represents the interaction between document at position and document at position , and . It simulates biases caused by layouts of documents on the slate (below, we set and ). Every document has a probability of engagement representing its innate attractiveness. User responses are computed by multiplying with interaction multipliers for each document presented before on the slate. Thus the user response

(5)

for , where

represents the Bernoulli distribution.

During training, all models see uniformly randomly generated slates and their generated responses . During inference time, we generate slates by conditioning on . We do not require document de-duplication since repetition may be desired in certain applications (e.g. generating temporal slates in an online advertisement session). Moreover List-CVAE should learn to produce the optimal slates whether those slates contain duplication or not from learning the training data distribution.

Evaluation:

For evaluation, we cannot use offline ranking evaluation metrics such as Normalized Discounted Cumulative Gain (NDCG)

(Järvelin and Kekäläinen, 2000), Mean Average Precision (MAP) (Baeza-Yates and Ribeiro-Neto, 1999) or Inverse Propensity Score (IPS) (Little and Rubin, 2002), etc. These metrics either require prediction scores for individual documents or assume that more relevant documents should appear in earlier ranking positions, unfairly favoring greedy ranking methods. Moreover, we find it limiting to use various diversity metrics since it is not always the case that a higher diversity-inclusive score gives better slates measured by user’s total responses. Even though these metrics may be more transparent, they do not measure our end goal, which is to maximize the expected number of total responses on the generated slates.

Instead, we evaluate the expected number of clicks over the distribution of generated slates and over the distribution of clicks on each document:

(6)

In practice, we distill the simulated environment of Eq. 5 using the cross-entropy loss onto a neural network model that officiates as our new simulation environment. The model consists of an embedding layer, which encodes documents into 8-dimensional embeddings. It then concatenates the embeddings of all the documents that form a slate and follows this concatenation with two hidden layers and a final softmax layer that predicts the slate response amongst the possible responses. Thus we call it the “response model”. We use the response model to predict user responses on 100,000 sampled output slates for evaluation purposes. This allows us to accurately evaluate our output slates by List-CVAE and all other baseline models.

Baselines:

Our experiments compare List-CVAE with several greedy ranking baselines that are often deployed in industry productions, namely Greedy MLP, Pairwise MLP, Position MLP and Greedy Long Short-Term Memory (LSTM) models. In addition to the greedy baselines, we also compare against auto-regressive (AR) versions of Position MLP and LSTM, as well as randomly-selected slates from the training set as a sanity check.

List-CVAE generates slates . The encoder and decoder of List-CVAE, as well as all the MLP-type models consist of two fully-connected neural network layers of the same size. Greedy MLP trains on pairs and outputs the greedy slate consisting of the top highest scoring documents. Pairwise MLP is an MLP model with a pairwise ranking loss where is the cross entropy loss and

are pairs of documents randomly selected with different user responses from the same slate. We sweep on hyperparameters

and in addition to the shared MLP model structure sweep. Position MLP uses position in the slate as a feature during training time and simply sets it to 0 for fast performance at inference time. AR Position MLP is identical to Position MLP with the exception that the position feature is set to each corresponding slate position at inference time (as such it takes into account position biases). Greedy LSTM is an LSTM model with fully-connect layers before and after the recurrent middle layers. We tune the hyperparameters corresponding to the number of layers and their respective widths. We use sequences of documents that form slates as the input at training time, and use single examples as inputs with sequence length 1 at inference time, which is similar to scoring documents as if they are in the first position of a slate of size 1. Then we greedily rank the documents based on their prediction scores. AR LSTM is identical to Greedy LSTM during training. During inference, however, it selects documents sequentially by first selecting the best document for position 1, then unrolling the LSTM for 2 steps to select the best document for position 2, and so on. This way it takes into account the context of all previous documents and their positions. Random selects slates uniformly randomly from the training set.

(a)
(b)
Figure 4:

Small-scale experiments. The shaded area represent the 95% confidence interval over 20 independent runs. We compare List-CVAE against all baselines on small-scale synthetic data.

Small-scale experiment ():

We use the trained document embeddings from the response model for List-CVAE and all the baseline models. For List-CVAE, we also use trained priors where and is modeled by a small MLP (16, 32). Additionally, since we found little difference between different hyperparameters, we fixed the width of all hidden layers to 128, the learning rate to and the number of latent dimensions to 16. For all other baseline models, we sweep on the learning rates, model structures and any model specific hyperparameters such as for Position MLP and the forget bias for the LSTM model.

Figures LABEL:fig:100_documents_10_slate_8_embedding and LABEL:fig:1000_documents_10_slate_8_embedding_sim show the performance comparison when the number of documents and slate size to . While List-CVAE is not quite capable of reaching a perfect performance of 10 clicks (which is probably even above the optimal upper bound), it easily outperforms all other ranking baselines after only a few training steps. Appendix A includes an additional personalization test.

(a)
(b)
(c)
(d)
Figure 5: Real data experiments: LABEL:sub@fig:100_documents_10_slate_8_embedding Distribution of user responses in the filtered RecSys 2015 YOOCHOOSE Challenge dataset; LABEL:sub@fig:1000_documents_10_slate_8_embedding_sim We compare List-CVAE against all greedy and auto-regressive ranking baselines as well as the Random baseline on a semi-synthetic dataset of 10,000 documents. The shaded area represent the 95% confidence interval over 20 independent runs; LABEL:sub@fig:1million, LABEL:sub@fig:2million We compare List-CVAE against all baselines on semi-synthetic datasets of 1 million and 2 million documents.

4.2 Real-world Data

Due to a lack of publicly available large scale slate datasets, we use the data provided by the RecSys 2015 YOOCHOOSE Challenge (Ben-Shimon et al., 2015). This dataset consists of 9.2M user purchase sessions around 53K products. Each user session contains an ordered list of products on which the user clicked, and whether they decided to buy them. The List-CVAE model can be used on slates with temporal ordering. Thus we form slates of size 5 by taking consecutive clicked products. We then build user responses from whether the user bought them. We remove a portion of slates with no positive responses such that after removal they only account for 50% of the total number of slates. After filtering out products that are rarely bought, we get 375K slates of size 5 and a corpus of 10,000 candidate documents. Figure LABEL:fig:distribution_real_slates shows the user response distribution of the training data. Notice that in the response vector, 0 denotes a click without purchase and 1 denotes a purchase. For example, (1,0,0,0,1) means the user clicked on all five products but bought only the first and the last products.

Medium-scale experiment ():

Similarly to the previous section, we train a two-layer response model that officiates as a new semi-synthetic simulation environment. We use the same hyperparameters used previously. Figure LABEL:fig:10000_documents_5_slate_8_embedding_real shows that List-CVAE outperforms all other baseline models within 500 training steps, which corresponds to having seen less than % of all possible slates.

Large-scale experiment ( 1 million, 2 millions, ):

We synthesize 1,990k documents by adding independent Gaussian noise to the original 10k documents and label the synthetic documents by predicted responses from the response model. Thus the new pool of candidate documents consists of 10k original documents and 1,990k synthetic ones, totaling 2 million documents. To match each of the decoder outputs with real documents, we uniformly randomly downsample the negative document examples keeping in total only 1000 logits (the dot product outputs in the decoder) during training. At inference time, we pick the argmax for each of dot products with the full embedding matrix without sampling. This technique speeds up the total training and inference time for 2 million documents to merely 4 minutes on 1 GPU for both the response model (with 40k training steps) and List-CVAE (with 5k training steps). We ran 2 experiments with 1 million and 2 millions document respectively. From the results shown in Figure LABEL:fig:1million and LABEL:fig:2million, List-CVAE steadily outperforms all other baselines again. The greatly increased number of training examples helped List-CVAE really learn all the interactions between documents and their respective positional biases. The resulting slates were able to receive close to 5 purchases on average due to the limited complexity provided by the response model.

(a)
(b)
(c)
(d)
Figure 6: Generalization test on List-CVAE. All training examples have total responses for . Any slates with higher total responses are eliminated from the training data. The other experiment setups are the same as in the Medium-scale experiment.

Generalization test: In practice, we may not have any close-to-optimal slates in the training data. Hence it is crucial that List-CVAE is able to generalize to unseen optimal conditions. To test its generalization capacity, we use the medium-scale experiment setup on RecSys 2015 dataset and eliminate from the training data all slates where the total user response exceeds some ratio of the slate size , i.e. for . Figure 6 shows test results on increasingly difficult training sets from which to infer on the optimal slates. Without seeing any optimal slates (Figure LABEL:fig:generalization_T1) or slates with 4 or 5 total purchases (Figure LABEL:fig:generalization_T2), List-CVAE can still produce close to optimal slates. Even training on slates with only 0, 1 or 2 total purchases (), List-CVAE still surpasses the performance of all greedy baselines within 1000 steps (Figure LABEL:fig:generalization_T3). Thus demonstrating the strong generalization power of the model. List-CVAE cannot learn much about the interactions between documents given only 0 or 1 total purchase per slate (Figure LABEL:fig:generalization_T4), whereas the MLP-type models learn purchase probabilities of single documents in the same way as in slates with higher responses.

Although evaluation of our model requires choosing the ideal conditioning at or near the edge of the support of , we can always compromise generalization versus performance by controlling in practice. Moreover, interactions between documents are governed by similar mechanisms whether they are from the optimal or sub-optimal slates. As the experiment results indicate, List-CVAE can learn these mechanisms from sub-optimal slates and generalize to optimal slates.

5 Discussion

The List-CVAE model moves away from the conventional greedy ranking paradigm, and provides the first conditional generative modeling framework that approaches slate recommendation problem using direct slate generation. By modeling the conditional probability distribution of documents in a slate directly, this approach not only automatically picks up the positional and contextual biases between documents at both training and inference time, but also gracefully avoids the problem of combinatorial explosion of possible slates when the candidate set is large. The framework is flexible and can incorporate different types of conditional generative models. In this paper we showed its superior performance over popular greedy and auto-regressive baseline models with a conditional VAE model.

In addition, the List-CVAE model has good scalability. We designed an architecture that uses pre-trained document embeddings combined with a negatively downsampled -head softmax layer that greatly speeds up the training, scaling easily to millions of documents.

References

  • Schafer et al. (2001) J. Ben Schafer, Joseph A. Konstan, and John Riedl. E-commerce recommendation applications. Data Mining and Knowledge Discovery, pages 115–153, 2001.
  • Lu et al. (2015) Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and Guangquan Zhang. Recommender system application developments: A survey. 74, 04 2015.
  • Swaminathan et al. (2017) Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), 2017.
  • Sunehag et al. (2015) Peter Sunehag, Richard Evans, Gabriel Dulac-Arnold, Yori Zwols, Daniel Visentin, and Ben Coppin.

    Deep reinforcement learning with attention for slate markov decision processes with high-dimensional states and actions.

    2015.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), New York, NY, USA, 2016.
  • Joachims et al. (2005) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately interpreting clickthrough data as implicit feedback. In 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 154–161, 2005.
  • Yue et al. (2010) Yisong Yue, Rajan Patel, and Hein Roehrig. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In 19th International Conference on World Wide Web (WWW), pages 1011–1018, 2010.
  • Kingma et al. (2014) Diederik P Kingma, Danilo Jimenez Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. 2014.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2nd international conference on Learning Representations (ICLR), 2013.
  • Jimenez Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, August 2009. ISSN 0018-9162.
  • Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. Collaborative denoising auto-encoders for top-n recommender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM), pages 153–162, 2016.
  • Wang et al. (2015) Hao Wang, Naiyan Wang, and Dit-Yan Yeung.

    Collaborative deep learning for recommender systems.

    In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1235–1244, 2015.
  • Abdollahi and Nasraoui (2016) Behnoush Abdollahi and Olfa Nasraoui.

    Explainable restricted boltzmann machines for collaborative filtering.

    In 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI), 2016.
  • Li and She (2017) Xiaopeng Li and James She. Collaborative variational autoencoder for recommender systems. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), Halifax, NS, Canada, 2017.
  • Lee et al. (2017) Wonsung Lee, Kyungwoo Song, and Il-Chul Moon. Augmented variational autoencoders for collaborative filtering with auxiliary information. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM), 2017.
  • Liang et al. (2018) Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. Variational autoencoders for collaborative filtering. 2018.
  • Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise approach to listwise approach. Technical report, April 2007.
  • Xia et al. (2008) Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank - theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning (ICML)., Helsinki, Finland, 2008.
  • Shi et al. (2010) Yue Shi, Martha Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization for collaborative filtering. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys, pages 269–272, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0.
  • Huang et al. (2015) Shanshan Huang, Shuaiqiang Wang, Tie-Yan Liu, Jun Ma, Zhumin Chen, and Jari Veijalainen. Listwise collaborative filtering. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pages 343–352, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3621-5.
  • Ai et al. (2018) Qingyao Ai, Keping Bi, Jiafeng Guo, and W. Bruce Croft. Learning a deep listwise context model for ranking refinement. CoRR, 2018.
  • Wang et al. (2016) Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. Beyond ranking: Optimizing whole-page presentation. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM, pages 103–112, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3716-8.
  • Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. -vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of Fifth International Conference on Learning Representations (ICLR)., 2017.
  • Järvelin and Kekäläinen (2000) Kalervo Järvelin and Jaana Kekäläinen. Ir evaluation methods for retrieving highly relevant documents. SIGIR Forum, 51:243–250, 2000.
  • Baeza-Yates and Ribeiro-Neto (1999) R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval. Addison Wesley, 1999.
  • Little and Rubin (2002) R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley, 2002.
  • Ben-Shimon et al. (2015) David Ben-Shimon, Michael Friedman, Alexander Tsikinovsky, and Johannes Hörle. Recsys challenge 2015 and the yoochoose dataset, 2015.

Appendix A Personalization test

This test complements the small-scale experiment. To the 100 documents with slate size 10, we add user features into the conditioning , by adding a set of 50 different users to the simulation engine (), permuting the innate attractiveness of documents and their interactions matrix by a user-specific function . Let

(7)

be the response of the user on the document . During training, the condition is a concatenation of 16 dimensional user embeddings obtained from the response model, and responses . At inference time, the model conditions on for each randomly generated test user . We sweep over hidden layers of 512 or 1024 units in List-CVAE, and all baseline MLP structures. Figure 7 show that slates generated by List-CVAE have on average higher clicks than those produced by the greedy baseline models although its convergence took longer to reach than in the small-scale experiment.

Figure 7: Personalization test with . The shaded area represent the 95% confidence interval over 20 independent runs. We compare List-CVAE against Greedy MLP, Position MLP, Pairwise MLP, Greedy LSTM and Random baselines on small-scale synthetic data.