In this paper we propose an extension of the method proposed in . These are the key points of our contribution:
we propose a Conditioned VAE (not to be confused with Conditional VAE [4, 3]) to rank items under the constraint of a user selected category. In this way we are able to produce an accurate ranking of items that belong to the chosen category. For example, in a movie recommendation system a user can ask for movies that belongs to a specific genre;
we use a conditioned ELBO loss function. Using this particular loss our model is able to learn the relationship between items and categories managing to push items of the desired category at the top of the ranking;
we propose a sampling strategy to speed up the training of our model even in presence of a considerable number of conditions;
In this section we provide some useful notation used throughout the paper. We refer to the set of users of a RS with , where . Similarly, the set of items is referred to as such that . The set of ratings is denoted by . Each item is assumed to belong to a set of categories , where is the set of all possible categories s.t. .
Moreover, since we face top-N recommendation tasks, we consider the binary rating matrix with where users are on the rows and items on the columns such that iff . Given ,
indicates the binary rating vector corresponding to the user. We add a subscription to both user and item sets to indicate, respectively, the set of items rated by a user (i.e., ) and the set of users who rated the item (i.e., ). Finally, given a user we indicate with the binary condition vector related to the user , where if and only if such that belongs to the category .
We propose an extension of the architecture proposed in . We use a Conditioned Variational Autoencoder (C-VAE) where the architectures of the encoder and the decoder are symmetric. In Figure 1 we present our model. The input is composed of two parts: a user rating vector over the items of the dataset (i.e., ), and a one-hot condition vector expressing the categories of the items that have to be privileged in the final ranking. Following , the user rating vector is L2-normalized and a dropout layer is applied. The entire input is then fed to a fully connected layer with tanh
as activation function and composed of 600 neurons. The encoder has two outputs that compose the latent space of the C-VAE. They are fully connected layers made of 200 neurons with a linear activation, in fact they represent the mean and the standard deviation of a Gaussian distribution. The input of the decoder is a latent vectorthat has to be sampled from the latent distribution. Since the sampling is not a differentiable operation the so-called reparameterization trick 
has to be performed to be able to backpropagate the gradient. So, to samplewe need an input , sampled from a normal Gaussian distribution, that represents a noise around the mean of the Gaussian distribution over the latent space. Then, is fed to a fully connected layer tanh activated and composed of 600 neurons. The output layer is neurons fully connected layer with linear activation.
2.2 Conditioned Loss Function
The loss function we try to minimize is the negative conditioned evidence lower bound (-ELBO), where the conditioned -ELBO is:
is the decoder network and its parameters;
is the encoder network and its parameters;
is the input one-hot condition vector;
is the latent representation of the user , sampled from a standard Gaussian prior;
is the weight of the KL term.
The first term of the loss can be interpreted as the reconstruction error, while the second term, the so-called KL loss, can be viewed as a regularization. As in , is a hyper-parameter that controls the strength of regularization which introduces a trade-off between how well the model can reconstruct the (conditioned) user profile, and how close the approximate latent posterior stays to the prior during training. We tested (like in ) weakening the influence of the prior constraint causing the model to be less able to generate accurate user histories but making it able to make better recommendations.  provides a really good discussion about this phenomenon.
In particular, our reconstruction loss is a conditioned extension of the multinomial likelihood used in :
, where indicates the inner-product operation and is the item-category matrix, where iff item belongs to the category ;
This reconstruction loss is what makes our model able to learn the relationship between items and categories. In fact, for all that does not satisfy holds that , thus it filters out the items of the user rating vector that do not belong the category selected by the condition. In this way only the output scores related to items belonging the selected category are taken into account to compute the gradient.
We have evaluated the performance of C-VAE and we highlight the following results:
C-VAE reaches state-of-the-art performance when condition is fed to the model;
C-VAE can be used without the conditioning since it generalizes the architecture of Mult-VAE ;
C-VAE learns how to push the items that belong the target category at the top of the ranking, so it learns the relationships between items and categories.
A key requirement for the choice of the dataset was that it had to include side information for making conditions for our model, for example genres for movies or categories for businesses. Based on this requirement we chose three small- to medium-scale user-item consumption datasets from various domains:
: This dataset contains user-movie ratings collected from a movie recommendation service. Since we work on implicit feedback we binarized the explicit data. We decided to keep all the ratings due to the the small-scale of the dataset. We only kept users of have watched at least five movies and we took the genres of the movies as conditions.
MovieLens-20M (ML-20M): This dataset is a medium-scale version of the previous one. We binarized explicit data by keeping ratings of four or higher, like they did in . We only kept users who have watched at least five movies and we took the genres of the movies as conditions.
Yelp dataset: This dataset is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. We decided to choice this dataset due to its sparsity. We wanted to see how our model was able to deal with very sparse data. Firstly, we kept ratings of the twenty most popular business categories to reduce sparsity, then we only kept users who have reviewed at least five businesses and businesses that have been reviewed by at least ten users. We took the categories of the businesses as conditions.
Table 1 summarizes the dimensions of all the datasets after this first preprocessing.
|# of users||6,038||136,677||125,679|
|# of items||3,655||20,108||22,824|
|# of categories||18||20||20|
|# of interactions||1.0M||10.0M||1.4M|
|% of interactions||4.28||0.36||0.048|
|# of held-out users||440||10,000||9,000|
3.1.1 Conditions preprocessing
Since we use a Conditioned Variational Autoencoder, we needed to compute conditions for our examples. Due to the size of the datasets and to speed up the training of our models we decided to build conditions with just one category (only one bit at one). This means that conditions with multiple categories are not taken into account.
To compute conditions for the user rating matrix, we took every user rating vector and we created conditions, where is the number of categories of the items that the user have rated. The last condition is an empty condition vector (vector made of zeros) and we decided to add that in such a way that our model is able to recommend movies even when no condition is fed to it. So, if a user has rated three items and these items belong to seven categories in total, then the rating vector of the user is replicated eight times and every replica is associated with a different condition that corresponds to a specific category and the last one is the empty condition vector.
Table 2 summarizes the dimensions of all the datasets after the computation of the conditions.
|# of training examples||84,608||1,849,758||759,955|
|# of validation examples||7,061||154,239||47,364|
|# of test examples||7,018||153,146||46,847|
3.3 Experimental setup
We studied the performance of various models under strong generalization like they did in . Following the procedure in  we split all users into training, validation and test sets. We trained our models using the entire rating history of the training users. To evaluate, we took part of the rating history from held-out (validation and test) users to learn the necessary user-level representations for the model and then computed metrics by looking at how well the model ranked the rest of the unseen rating history from the held-out users.
In particular, for each held-out user, we randomly chose 80% of the rating history as the “fold-in” set (we refers to this as validation-training and test-training set) to learn the necessary user-level representation and then reported metrics on the remaining 20% of the rating history (we refers to this as validation-test and test-test set). For each validation set example (and it is the same for test set examples), its validation-test part has been created in such a way that contains only items that belong to the category expressed by the condition of the validation-train part, so all items that belong to categories different from the category expressed by the condition are removed. For the validation-training examples where no category is fed to the model (condition vector made of zeros), no filter is performed on the validation-test part (no items are removed). In this way we are able to better understand how well our model performs on the selected categories and if it is able to push items of the selected category at the top of the ranking. For example, if a user in the validation-training has watched movies that belong to action, drama and comedy genres and the genre specified in the condition is drama, then we are interested in knowing how well our model ranks only the drama movies that are in the validation-test part of the user.
We used 80/20 proportions for MovieLens 1M and MovieLens 20M datasets, while we chose 50/50 proportions for the Yelp dataset. Due to its sparsity other split proportions lead to had examples with no ratings in the validation-test set, causing the metrics impossible to compute.
We selected the hyper-parameters by evaluating nDCG@100 for MovieLens 20M and Yelp, and nDCG@10 for MovieLens 1M on the validation users. We kept the architecture of encoder and decoder symmetrical. For MovieLens 1M we used the architecture presented in 2.1 without hidden layers, while for the other datasets we used the architecture presented in 2.1. We set the dimension of the latent representation to 200 and the dimension of the hidden layers to 600. To better understand our architecture, recall is the total number of items and the size of the condition vector, the overall architecture for our model would be for MovieLens 20M and Yelp, while for MovieLens 1M. We used tanh
non-linearity as the activation function between layers and we did not apply any activation to the output of the encoder since it is used as the mean and variance of a Gaussian random variable. Moreover, we did not apply any activation to the output of the decoder because it has to be interpreted as an un-normalized score to rank the items. We tuned the regularization parameterusing the procedure used in . As a reminder these are the steps we have followed:
we trained the model annealing in such a way to reach 1.0 at the end of the training;
we selected the value corresponding to the highest validation score in terms of nDCG;
we re-trained the model annealing in such a way to reach the selected value at the end of the training.
To perform this tuning we annealed the term linearly with the number of gradient updates reported in Table 3. The table includes the best values we found for the parameter using the procedure explained above. We applied L2 normalization and dropout with 0.5 probability at the input layer (only to the rating part and not to the condition). We initialized the weights of our model with Xavier initialization and the biases with truncated normal initializer with 0.0 mean and 0.001 standard deviation. We trained our model using Adam optimizer with a learning rate equals to
, and without weight decay. We trained the model for 100 epochs on every dataset and we kept the model with the best validation nDCG and reported test set metrics with it. We used early stopping to stop the training if there has been no improvements on the validation metric for 10 epochs.
|# gradient updates||84,000||4,624,395||434,000|
3.4 Experimental results and analysis
In this section, we quantitatively compare our proposed method with related work using three different evaluations:
total: measures how well the model performs in general, that is when the test set contains users with condition together with users without conditions;
normal: measures how well the model performs without conditions, that is when test set contains only users without conditions;
conditioned: measures how well the model performs with conditions, that is then the test set contains only users with conditions.
Table 4, Table 5 and Table 6 report the results we have obtained with our method and Mult-VAE. Each metric is averaged across all test users. To compute the total and conditioned evaluations for Mult-VAE we have filtered the model output based on the category selected by the condition, so we kept only the scores related to items of the correct category.
From our results it is possible to observe that C-VAE obtains results that are similar to the filtered evaluation of Mult-VAE. This means that our method is able to learn the relationships between items and categories and how to recommend the best items that belong to a specific category selected by a user. We think that Mult-VAE performs better in total and conditioned evaluations because it is an accurate ranker in general, and by selecting only the scores that belong to a specific category from the output, we obtain an accurate ranking of the items of the selected category. We are confident that our method can converge to state-of-the-art results with more epochs of training. Regarding the normal evaluation, we think C-VAE is less accurate simply because there is a bias towards conditioned examples in the dataset, in fact our main goal is to obtain the best results in the conditioned evaluation. To better understand the dimension of this bias in Table 2 is possible to observe the number of conditioned and not conditioned examples in our datasets.
The results we have obtained in ML-1M show that the gap between our method and Mult-VAE decreases with the increase of the density of the dataset.
From the results we have obtained on the Yelp dataset is possible to see that our method brings no improvement when the items in the dataset belong to a restricted number of categories. In fact, due to the preprocessing we have performed on the dataset the majority of items belong to just one category.
Comparison between C-VAE and Mult-VAE on ML-20M. Standard errors are around 0.001/0.002.
3.4.1 Analysis of the produced ranking
To better understand if our method was able to produce rankings with items of the correct category at the top, we have produced the graph in Figure 2. To produce it we simply counted the number of items of the correct category in every position of the ranking for all the rankings predicted by Mult-VAE on the test set. The graph clearly shows that the majority of items of the correct category have been placed in the top positions, as we expected.
4 Conclusions and future work
In this paper we implemented a conditioned extension of VAE for top-N item recommendation. We discovered that our model is able to learn the relationships between items and categories, that is something new for generative methods applied to top-N recommendation. We compared our method with the state-of-the-art, that is Mult-VAE , and we discovered it reaches state-of-the-art performances using a novel training procedure, that includes a new conditioned loss (conditioned -ELBO).
Since we obtained promising results, our intent is to improve our model performances and explore its latent representation to try to explain its behaviour, in particular what the latent representation captures about the relationships between items and categories after training. In conclusion, we think that our method can be used for context-aware recommendation, where a context is fed to the model as condition.
-  Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014 (2014)
-  Liang, D., Krishnan, R.G., Hoffman, M.D., Jebara, T.: Variational autoencoders for collaborative filtering. In: Proceedings of the 2018 World Wide Web Conference. p. 689–698. WWW ’18, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018). https://doi.org/10.1145/3178876.3186150, https://doi.org/10.1145/3178876.3186150
-  Pagnoni, A., Liu, K., Li, S.: Conditional variational autoencoder for neural machine translation. ArXiv abs/1812.04405 (2018)
Zhang, B., Xiong, D., Su, J., Duan, H., Zhang, M.: Variational neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 521–530. Association for Computational Linguistics, Austin, Texas (Nov 2016).https://doi.org/10.18653/v1/D16-1050, https://www.aclweb.org/anthology/D16-1050