Log In Sign Up

Conditioned Variational Autoencoder for top-N item recommendation

by   Mirko Polato, et al.

In this paper, we propose a Conditioned Variational Autoencoder to improve state-of-the-art performances in top-N item recommendation.


Generating lyrics with variational autoencoder and multi-modal artist embeddings

We present a system for generating song lyrics lines conditioned on the ...

Improving Item Cold-start Recommendation via Model-agnostic Conditional Variational Autoencoder

Embedding MLP has become a paradigm for modern large-scale recommend...

Deep Variational Autoencoder with Shallow Parallel Path for Top-N Recommendation (VASP)

Recently introduced EASE algorithm presents a simple and elegant way, ho...

SGVAE: Sequential Graph Variational Autoencoder

Generative models of graphs are well-known, but many existing models are...

Insider Detection using Deep Autoencoder and Variational Autoencoder Neural Networks

Insider attacks are one of the most challenging cybersecurity issues for...

Universal Conditional Machine

We propose a single neural probabilistic model based on variational auto...

FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image Fusion

Sensor fusion can significantly improve the performance of many computer...

1 Introduction

In this paper we propose an extension of the method proposed in [2]. These are the key points of our contribution:

  1. we propose a Conditioned VAE (not to be confused with Conditional VAE [4, 3]) to rank items under the constraint of a user selected category. In this way we are able to produce an accurate ranking of items that belong to the chosen category. For example, in a movie recommendation system a user can ask for movies that belongs to a specific genre;

  2. we use a conditioned ELBO loss function. Using this particular loss our model is able to learn the relationship between items and categories managing to push items of the desired category at the top of the ranking;

  3. we propose a sampling strategy to speed up the training of our model even in presence of a considerable number of conditions;

1.1 Notation

In this section we provide some useful notation used throughout the paper. We refer to the set of users of a RS with , where . Similarly, the set of items is referred to as such that . The set of ratings is denoted by . Each item is assumed to belong to a set of categories , where is the set of all possible categories s.t. .

Moreover, since we face top-N recommendation tasks, we consider the binary rating matrix with where users are on the rows and items on the columns such that iff . Given ,

indicates the binary rating vector corresponding to the user

. We add a subscription to both user and item sets to indicate, respectively, the set of items rated by a user (i.e., ) and the set of users who rated the item (i.e., ). Finally, given a user we indicate with the binary condition vector related to the user , where if and only if such that belongs to the category .

2 Method

2.1 Architecture

We propose an extension of the architecture proposed in [2]. We use a Conditioned Variational Autoencoder (C-VAE) where the architectures of the encoder and the decoder are symmetric. In Figure 1 we present our model. The input is composed of two parts: a user rating vector over the items of the dataset (i.e., ), and a one-hot condition vector expressing the categories of the items that have to be privileged in the final ranking. Following [2], the user rating vector is L2-normalized and a dropout layer is applied. The entire input is then fed to a fully connected layer with tanh

as activation function and composed of 600 neurons. The encoder has two outputs that compose the latent space of the C-VAE. They are fully connected layers made of 200 neurons with a linear activation, in fact they represent the mean and the standard deviation of a Gaussian distribution. The input of the decoder is a latent vector

that has to be sampled from the latent distribution. Since the sampling is not a differentiable operation the so-called reparameterization trick [1]

has to be performed to be able to backpropagate the gradient. So, to sample

we need an input , sampled from a normal Gaussian distribution, that represents a noise around the mean of the Gaussian distribution over the latent space. Then, is fed to a fully connected layer tanh activated and composed of 600 neurons. The output layer is neurons fully connected layer with linear activation.

Figure 1: High level illustration of the Conditional Mult-VAE architecture.

2.2 Conditioned Loss Function

The loss function we try to minimize is the negative conditioned evidence lower bound (-ELBO), where the conditioned -ELBO is:


  • is the decoder network and its parameters;

  • is the encoder network and its parameters;

  • is the input one-hot condition vector;

  • is the latent representation of the user , sampled from a standard Gaussian prior;

  • is the weight of the KL term.

The first term of the loss can be interpreted as the reconstruction error, while the second term, the so-called KL loss, can be viewed as a regularization. As in [2], is a hyper-parameter that controls the strength of regularization which introduces a trade-off between how well the model can reconstruct the (conditioned) user profile, and how close the approximate latent posterior stays to the prior during training. We tested (like in [2]) weakening the influence of the prior constraint causing the model to be less able to generate accurate user histories but making it able to make better recommendations. [2] provides a really good discussion about this phenomenon.

In particular, our reconstruction loss is a conditioned extension of the multinomial likelihood used in [2]:


  • , where indicates the inner-product operation and is the item-category matrix, where iff item belongs to the category ;

  • .

This reconstruction loss is what makes our model able to learn the relationship between items and categories. In fact, for all that does not satisfy holds that , thus it filters out the items of the user rating vector that do not belong the category selected by the condition. In this way only the output scores related to items belonging the selected category are taken into account to compute the gradient.

3 Experiments

We have evaluated the performance of C-VAE and we highlight the following results:

  • C-VAE reaches state-of-the-art performance when condition is fed to the model;

  • C-VAE can be used without the conditioning since it generalizes the architecture of Mult-VAE [2];

  • C-VAE learns how to push the items that belong the target category at the top of the ranking, so it learns the relationships between items and categories.

3.1 Datasets

A key requirement for the choice of the dataset was that it had to include side information for making conditions for our model, for example genres for movies or categories for businesses. Based on this requirement we chose three small- to medium-scale user-item consumption datasets from various domains:

MovieLens-1M (ML-1M)

: This dataset contains user-movie ratings collected from a movie recommendation service. Since we work on implicit feedback we binarized the explicit data. We decided to keep all the ratings due to the the small-scale of the dataset. We only kept users of have watched at least five movies and we took the genres of the movies as conditions.

MovieLens-20M (ML-20M): This dataset is a medium-scale version of the previous one. We binarized explicit data by keeping ratings of four or higher, like they did in [2]. We only kept users who have watched at least five movies and we took the genres of the movies as conditions.

Yelp dataset: This dataset is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. We decided to choice this dataset due to its sparsity. We wanted to see how our model was able to deal with very sparse data. Firstly, we kept ratings of the twenty most popular business categories to reduce sparsity, then we only kept users who have reviewed at least five businesses and businesses that have been reviewed by at least ten users. We took the categories of the businesses as conditions.

Table 1 summarizes the dimensions of all the datasets after this first preprocessing.

ML-1M ML-20M Yelp
# of users 6,038 136,677 125,679
# of items 3,655 20,108 22,824
# of categories 18 20 20
# of interactions 1.0M 10.0M 1.4M
% of interactions 4.28 0.36 0.048
# of held-out users 440 10,000 9,000
Table 1: Datasets compositions after pre-processing.

3.1.1 Conditions preprocessing

Since we use a Conditioned Variational Autoencoder, we needed to compute conditions for our examples. Due to the size of the datasets and to speed up the training of our models we decided to build conditions with just one category (only one bit at one). This means that conditions with multiple categories are not taken into account.

To compute conditions for the user rating matrix, we took every user rating vector and we created conditions, where is the number of categories of the items that the user have rated. The last condition is an empty condition vector (vector made of zeros) and we decided to add that in such a way that our model is able to recommend movies even when no condition is fed to it. So, if a user has rated three items and these items belong to seven categories in total, then the rating vector of the user is replicated eight times and every replica is associated with a different condition that corresponds to a specific category and the last one is the empty condition vector.

Table 2 summarizes the dimensions of all the datasets after the computation of the conditions.

ML-1M ML-20M Yelp
# of training examples 84,608 1,849,758 759,955
# of validation examples 7,061 154,239 47,364
# of test examples 7,018 153,146 46,847
Table 2: Number of training, validation and test examples after conditions computation.

3.2 Metrics

We used the same ranking-based metrics used in [2]

: recall@k and nDCG@k. For each user, both metrics compare the predicted rank of the held-out items with their true rank. For C-VAE we got the predicted rank by sorting the un-normalized multinomial probability that our model outputs.

3.3 Experimental setup

We studied the performance of various models under strong generalization like they did in [2]. Following the procedure in [2] we split all users into training, validation and test sets. We trained our models using the entire rating history of the training users. To evaluate, we took part of the rating history from held-out (validation and test) users to learn the necessary user-level representations for the model and then computed metrics by looking at how well the model ranked the rest of the unseen rating history from the held-out users.

In particular, for each held-out user, we randomly chose 80% of the rating history as the “fold-in” set (we refers to this as validation-training and test-training set) to learn the necessary user-level representation and then reported metrics on the remaining 20% of the rating history (we refers to this as validation-test and test-test set). For each validation set example (and it is the same for test set examples), its validation-test part has been created in such a way that contains only items that belong to the category expressed by the condition of the validation-train part, so all items that belong to categories different from the category expressed by the condition are removed. For the validation-training examples where no category is fed to the model (condition vector made of zeros), no filter is performed on the validation-test part (no items are removed). In this way we are able to better understand how well our model performs on the selected categories and if it is able to push items of the selected category at the top of the ranking. For example, if a user in the validation-training has watched movies that belong to action, drama and comedy genres and the genre specified in the condition is drama, then we are interested in knowing how well our model ranks only the drama movies that are in the validation-test part of the user.

We used 80/20 proportions for MovieLens 1M and MovieLens 20M datasets, while we chose 50/50 proportions for the Yelp dataset. Due to its sparsity other split proportions lead to had examples with no ratings in the validation-test set, causing the metrics impossible to compute.

We selected the hyper-parameters by evaluating nDCG@100 for MovieLens 20M and Yelp, and nDCG@10 for MovieLens 1M on the validation users. We kept the architecture of encoder and decoder symmetrical. For MovieLens 1M we used the architecture presented in 2.1 without hidden layers, while for the other datasets we used the architecture presented in 2.1. We set the dimension of the latent representation to 200 and the dimension of the hidden layers to 600. To better understand our architecture, recall is the total number of items and the size of the condition vector, the overall architecture for our model would be for MovieLens 20M and Yelp, while for MovieLens 1M. We used tanh

non-linearity as the activation function between layers and we did not apply any activation to the output of the encoder since it is used as the mean and variance of a Gaussian random variable. Moreover, we did not apply any activation to the output of the decoder because it has to be interpreted as an un-normalized score to rank the items. We tuned the regularization parameter

using the procedure used in [2]. As a reminder these are the steps we have followed:

  • we trained the model annealing in such a way to reach 1.0 at the end of the training;

  • we selected the value corresponding to the highest validation score in terms of nDCG;

  • we re-trained the model annealing in such a way to reach the selected value at the end of the training.

To perform this tuning we annealed the term linearly with the number of gradient updates reported in Table 3. The table includes the best values we found for the parameter using the procedure explained above. We applied L2 normalization and dropout with 0.5 probability at the input layer (only to the rating part and not to the condition). We initialized the weights of our model with Xavier initialization and the biases with truncated normal initializer with 0.0 mean and 0.001 standard deviation. We trained our model using Adam optimizer with a learning rate equals to

, and without weight decay. We trained the model for 100 epochs on every dataset and we kept the model with the best validation nDCG and reported test set metrics with it. We used early stopping to stop the training if there has been no improvements on the validation metric for 10 epochs.

ML-1M ML-20M Yelp
# gradient updates 84,000 4,624,395 434,000
optimal beta 0.2 0.08 0.35
Table 3: Gradient updates and best values.

3.4 Experimental results and analysis

In this section, we quantitatively compare our proposed method with related work using three different evaluations:

  1. total: measures how well the model performs in general, that is when the test set contains users with condition together with users without conditions;

  2. normal: measures how well the model performs without conditions, that is when test set contains only users without conditions;

  3. conditioned: measures how well the model performs with conditions, that is then the test set contains only users with conditions.

Table 4, Table 5 and Table 6 report the results we have obtained with our method and Mult-VAE. Each metric is averaged across all test users. To compute the total and conditioned evaluations for Mult-VAE we have filtered the model output based on the category selected by the condition, so we kept only the scores related to items of the correct category.

From our results it is possible to observe that C-VAE obtains results that are similar to the filtered evaluation of Mult-VAE. This means that our method is able to learn the relationships between items and categories and how to recommend the best items that belong to a specific category selected by a user. We think that Mult-VAE performs better in total and conditioned evaluations because it is an accurate ranker in general, and by selecting only the scores that belong to a specific category from the output, we obtain an accurate ranking of the items of the selected category. We are confident that our method can converge to state-of-the-art results with more epochs of training. Regarding the normal evaluation, we think C-VAE is less accurate simply because there is a bias towards conditioned examples in the dataset, in fact our main goal is to obtain the best results in the conditioned evaluation. To better understand the dimension of this bias in Table 2 is possible to observe the number of conditioned and not conditioned examples in our datasets.

The results we have obtained in ML-1M show that the gap between our method and Mult-VAE decreases with the increase of the density of the dataset.

From the results we have obtained on the Yelp dataset is possible to see that our method brings no improvement when the items in the dataset belong to a restricted number of categories. In fact, due to the preprocessing we have performed on the dataset the majority of items belong to just one category.

Evaluation Total Normal Conditioned
Metric R@20 R@50 N@100 R@20 R@50 N@100 R@20 R@50 N@100
C-VAE 0.645 0.793 0.520 0.386 0.526 0.416 0.673 0.822 0.531
Mult-VAE 0.654 0.799 0.528 0.395 0.535 0.425 0.682 0.828 0.539
Table 4:

Comparison between C-VAE and Mult-VAE on ML-20M. Standard errors are around 0.001/0.002.

Evaluation Total Normal Conditioned
Metric R@20 R@50 N@100 R@20 R@50 N@100 R@20 R@50 N@100
C-VAE 0.311 0.459 0.238 0.139 0.235 0.143 0.392 0.564 0.282
Mult-VAE 0.311 0.460 0.237 0.134 0.233 0.143 0.394 0.567 0.282
Table 5: Comparison between C-VAE and Mult-VAE on Yelp. Standard errors are between 0.001 and 0.003.
Evaluation Total Normal Conditioned
Metric R@10 N@10 R@10 N@10 R@10 N@10
C-VAE 0.572 0.478 0.380 0.403 0.590 0.485
Mult-VAE 0.563 0.474 0.383 0.408 0.580 0.480
Table 6: Comparison between C-VAE and Mult-VAE on ML-1M. Standard errors are between 0.001 and 0.003.

3.4.1 Analysis of the produced ranking

To better understand if our method was able to produce rankings with items of the correct category at the top, we have produced the graph in Figure 2. To produce it we simply counted the number of items of the correct category in every position of the ranking for all the rankings predicted by Mult-VAE on the test set. The graph clearly shows that the majority of items of the correct category have been placed in the top positions, as we expected.

Figure 2: Distribution of the movies in the ranking belonging to the selected category.

4 Conclusions and future work

In this paper we implemented a conditioned extension of VAE for top-N item recommendation. We discovered that our model is able to learn the relationships between items and categories, that is something new for generative methods applied to top-N recommendation. We compared our method with the state-of-the-art, that is Mult-VAE  [2], and we discovered it reaches state-of-the-art performances using a novel training procedure, that includes a new conditioned loss (conditioned -ELBO).

Since we obtained promising results, our intent is to improve our model performances and explore its latent representation to try to explain its behaviour, in particular what the latent representation captures about the relationships between items and categories after training. In conclusion, we think that our method can be used for context-aware recommendation, where a context is fed to the model as condition.