Variational Collaborative Learning for User Probabilistic Representation

09/22/2018 ∙ by Kenan Cui, et al. ∙ Shanghai Jiao Tong University 0

Collaborative filtering (CF) has been successfully employed by many modern recommender systems. Conventional CF-based methods use the user-item interaction data as the sole information source to recommend items to users. However, CF-based methods are known for suffering from cold start problems and data sparsity problems. Hybrid models that utilize auxiliary information on top of interaction data have increasingly gained attention. A few "collaborative learning"-based models, which tightly bridges two heterogeneous learners through mutual regularization, are recently proposed for the hybrid recommendation. However, the "collaboration" in the existing methods are actually asynchronous due to the alternative optimization of the two learners. Leveraging the recent advances in variational autoencoder (VAE), we here propose a model consisting of two streams of mutual linked VAEs, named variational collaborative model (VCM). Unlike the mutual regularization used in previous works where two learners are optimized asynchronously, VCM enables a synchronous collaborative learning mechanism. Besides, the two stream VAEs setup allows VCM to fully leverages the Bayesian probabilistic representations in collaborative learning. Extensive experiments on three real-life datasets have shown that VCM outperforms several state-of-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


With the rapid growth of information online, recommender systems have been playing an increasingly important role in alleviating the information overload. Existing models for recommender systems can be broadly classified into three categories 

[Adomavicius and Tuzhilin2005]: content-based models, CF-based models, and hybrid models. The content-based models [Lang1995, Pazzani and Billsus1997] recommend items similar to what the user liked in the past utilizing user profiles or item descriptions. CF-based methods [Mnih and Salakhutdinov2008, He et al.2017, Liang et al.2018] model user preferences based on historic user-item interactions and recommend what people with similar preference have liked. Although CF-based models generally achieve higher recommendation accuracy than content-based methods, their accuracy drops significantly in the case of sparse interaction data. Therefore, hybrid methods [Li, Yeung, and Zhang2011, Wang and Blei2011], utilizing both interaction data and auxiliary information, have been largely adopted in real-world recommender systems.

Collaborative Deep Learning (CDL) 

[Wang, Wang, and Yeung2015] and Collaborative Variational Autoencoder (CVAE) [Li and She2017] have recently been proposed as unified models to integrate interaction data and auxiliary information and shown promising results. Both methods leverage Probabilistic matrix factorization (PMF) [Mnih and Salakhutdinov2008]

to learn user/item latent factors from interaction data through point estimation. At the meanwhile, a stacked denoising autoencoder (SDAE) 

[Vincent et al.2010] (or a VAE [Kingma and Welling2013]) is employed to learn latent representation from the auxiliary information. The two learners are integrated through mutual regularization, i.e., the latent representation in SDAE/VAE and the corresponding latent factor in PMF are used to regularize with each other. However, the two learners are actually optimized alternatively, making the ”collaboration” asynchronous: one-directional regularization in any iteration. Besides, due to the point estimation nature of latent factors in PMF, the regularization here fails to fully leverage the Bayesian representation of the latent variable from SDAE/VAE.

To address aforementioned problems, we propose a deep generative probabilistic model under the collaborative learning framework named variational collaborative model for user preference (VCM). The overall architecture of the model is illustrated in Figure 1. Two parallel extended VAEs are collaboratively employed to simultaneously learn comprehensive representations of user latent variable from user interaction data and auxiliary review text data.

Unlike CVAE and CDL, which learn separate user/item latent factors with point estimation nature through PMF, the VCM use VAE for CF [Liang et al.2018] to efficiently infer the variational distribution from interaction data as the probabilistic representation of user latent variable (without item). We also provide an alternative interpretation of the Kullback Leibler (KL) divergence regularization in VAE for CF: we view it as an upper bound of the amount of the information that preserved in the variational distribution, which can allocate proper user-level capacity and avoid over-fitting especially for the sparse signals from inactive users.

Benefit from the probabilistic representations for both the interaction data and auxiliary information, we design a synchronous collaborative learning mechanism: unlike the asynchronous ”collaboration” of CDL and CVAE, we adopt KL Divergence to make the probabilistic representation learned from two data views to match with each other at each iteration of the optimization. Compared with previous works, it provides a simple but more effective way to make the information flows between user interaction data and auxiliary user information in bi-direction rather than one-direction. Furthermore, because of the versatility of VAE, the VCM model is not limited to taking the review as the auxiliary information. Different multimedia modalities, e.g., images and other texts, are unified in the framework. Our contribution can be summarized as follows:

  • Unlike previous hybrid models that learns user/item latent factors by attaining maximum a posterior estimates for interaction data, we propose to use two stream VAEs set up to learn the probabilistic representation of user latent variable and provides user-level capacity.

  • Unlike the asynchronous mutual regularization used in previous models, we have the two components learning with each other under a synchronous collaborative learning mechanism, which allows the model to make full use of the Bayesian probabilistic representations from interaction data and auxiliary information.

  • Extensive experiment on three real-world datasets has shown that VCM can significantly outperform the state of the art models. Ablation studies have further proved that improvements come from specific components.


Similar to the work in [Hu, Koren, and Volinsky2008], the recommendation task we processed in this paper accepts implicit feedback. We use a binary matrix to indicate the click 111we use the verb ”click” for concreteness to indicate any interactions, including ”check-in,” ”purchase,” ”watch” history among user and item. We use to indicate users and to indicate items. The lower case

is a binary vector indicating the click history for each item from user

. Each user’s reviews are merged into one document, let be the bag-of-words representation for review documents of users (where is the length of the vocabulary). We use to indicate each word. The lower case is a bag-of-words vector with the number of each word from the document of user .


The architecture of our proposed model is shown in Figure 1. The model is consists of two parallel extended VAEs, one VAE () takes users’ click history

as input and output the probability over items, one VAE (

) takes users’ review text data as input and output the probability over words. Each VAE uses the encoder to compresses the input to the variational distribution then transfers the latent variable sampled from the posterior to the decoder to get the generative distribution for prediction. The KL divergence between two variational distributions is employed for the cooperation between and .

Figure 1: VCM model architecture.


We assume that the interaction data click history can be generated by user latent variable , and the auxiliary information review document can be generated by the another user latent variable . We introduce the variational distribution and to approach the true posteriors and , which represent the user click behavior preference and review document semantic content, respectively. Here, we employ the parameterised diagonal Gaussian as , and employ as . So we define the inference process of the probabilistic encoders as below:

  1. Construct vector representations of observed data for user :

  2. Parameterise the variational distribution over the user latent variables and :


can be any type of deep neural networks (DNN) that are suitable for the observed data.


are linear transformation, computing the parameters of the variational distributions. And

is consist of the parameters of and , whereas is consist of the parameters of and .


We define the generation process of two softmax decoders as below:

  1. Draw samples and from variational posterior and , respectively.

  2. Produce the probabilistic distribution over items and words for each user through DNN and softmax function:

  3. Reconstruct the data from two multinomial distributions, respectively:

where and are two DNN with parameters and . is the sum of clicks, and is the sum of words in review document of user , the observed data and can be generated from the two multinomial distribution respectively. Therefore, a suitable goal for learning the distribution of latent variable is to maximize the marginal log-likelihood function of click behavior data in expectation over the whole distribution of ,

And we can also get similar likelihood function of review document, we omitted the similar process for space limitation.

User-level Capacity

We introduce a limitation over to control the capacity of different users. This can be achieved if we match with the uninformative prior, such as the isotropic unit Gaussian used in [Higgins et al.2016, Higgins et al.2017]. Hence, we get the constrained optimization problem for the marginal log-likelihood function of click behavior data as:

has the property of being zero if the posterior distribution is equal to the uninformative prior, which means the model learn nothing from the data. Thus, the hidden variable can be seen as the upper bound of the amount of information that preserved in the variational distribution for each user’s preference. According to complementary slackness KKT conditions [Kuhn1951, Karush1939], solving this optimization problem is equivalent to maximize the lower bound as below:

So far, we get the lower bound for , similar process can be done to obtain the lower bound for as:

Varying KKT multiplier , puts different strength into pushing the variational distribution to align with the unit Gaussian prior. A proper choice of , can balance the trade-off between reconstruction loss and the limitation.

Collaborative Learning Mechanism

To improve the generalization recommendation performance of variational CF model , we use

as the teacher to provide review semantic content in the form of the posterior probability

to guide the learning process of . To measure the match of two posterior distributions and , we adopt KL divergence. The KL distance from to is computed as:

Similarly, to improve the ability to learn representation of semantic meaning for , we use as teacher to provide click behavior preference information in form of its posterior to guide the to capture the semantic content for review document, so the KL distance from to is computed as:

We adopt this bi-directional KL Divergence to make the probabilistic representation learned from two data views to match itself with each other, so that allows the VCM to fully leverage the two probabilistic representation.

Objective Function

We form the objective for user with collaborative learning mechanism as (we can get the objective function of the dataset by averaging the objective function for all users):

Note the parameters need optimize is ,

. We can obtain an unbiased estimate of

by sampling , and , then perform stochastic gradient ascent to optimize it. And by doing reparameterization trick [Kingma and Welling2013]: we sample , and reparameterize , , the stochasticity of the sampling process is isolated, the gradient with respect to can be back-propagated through the sampled and . With as the final lower bound, we train this two VAEs synchronously at each iteration according to Algorithm 1.

0:  Click matrix , Bag of word representation of review , , Anneal steps
1:  Randomly initialize ,
2:  for iteration in Anneal steps do
3:     Sample a batch of users
4:     for all  do
5:        Compute and via reparameterization trick
6:        Compute noisy gradient , with and
7:     end for
8:     Average noisy gradient from batch
10:     Update and by taking gradient update with ,
11:  end for
12:  return ,
Algorithm 1

VCM collaborative training with anneal stochastic gradient descent


We now describe how we make predictions given a trained model. Given a user’s click history , we rank all the items based on the predicted multinomial probability . The latent variables for is constructed as follows: we simply take the mean of the variational distribution . We denote this prediction method as VCM.

Benefit from collaborative learning, our model allows for bi-directional prediction (review2click and click2review). In order to predict click behavior corresponding to user’s review semantic content, we infer the latent variables by presenting the reviews to encoder of , we also simply take the mean to construct the latent variable and use the decoder of with as input to generate the predicted multinomial probability . So now only given a user’s review document, our model can encode the text into the latent variables and decode it to click behavior. And we denote this Cross-Domain prediction method as VCM-CD.



We experimented with three publicly accessible datasets from various domains with different scale and sparsity.

  • Yelp-2013 (Yelp): This data [Seo et al.2017] contains user-business check-in record and reviews from RecSys Challenge 2013 222 We only keep users who have checked in at least five business.

  • Amazon Clothing (Clothing): The Amazon dataset is the consumption records with reviews from We use the the clothing shoes and jewelry category 5-core  [He and McAuley2016]. We only keep users with at least five products in their shopping record and products that are bought by at least 5 users.

  • Amazon Movies (Movies): This data  [He and McAuley2016] contains the user-movie rating from Movies and TV 5-core with reviews. We only keep user with 5 watching record and movies that are played at least 10 users.

For each data set, we binaries the explicit data by maintaining ratings of four or higher and interpret it as implicit feedback. We merge each user’s reviews into one document, then we follow the same process to remove the stop words as [Miao, Yu, and Blunsom2016] for each document, and keep the most common words in all documents as the vocabulary. Table 1 summarizes the characteristics of all the datasets after pre-processing.

Yelp Clothing Movies
# of users 6,784 21,181 81,780
# of items 10,003 17,710 24,628
# of interactions 106,630 145,281 1,028,839
sparsity 0.16% 0.04% 0.05%
Table 1: Statistical characteristics of the datasets after preprocessing


We use two ranking-based metrics: the truncated normalized discounted cumulative gain (NDCG@) and Recall@. For each user, both the metrics compare the predicted rank of the held-out items with their true rank. Moreover, we get the predicted rank by sorting the multinomial probability . Formally, we define as the item at rank , is the indicator function, and is the set of the held-out items that user clicked on.

The expression in the denominator is the minimum of the number of items clicked by user and . While Recall@ considers all items ranked with the first to be equally important, and it reaches to the maximum of 1 when the model ranks all relevant items in position. And the truncated discounted cumulative gain (DCG@) is

DCG@ assign higher scores to the higher ranks versus lower ones. NDCG@ is the DCG@ linearly normalized to after dividing by the best possible DCG@ when all the held-out items are ranked at the top.


As the previous works [Wang, Wang, and Yeung2015, Li and She2017, Zheng, Noroozi, and Yu2017] has demonstrated, the performance of hybrid recommendation with auxiliary information is significantly better than CF-based models, so only hybrid models are used for comparison. The baselines included in our comparison are as follows:

  • CDL: Collaborative Deep Learning [Wang, Wang, and Yeung2015] tightly combines the SDAE with the PMF. The middle layer of the neural network acts as a bridge between the SDAE and the PMF.

  • CVAE: Collaborative VariationalAutoencoder [Li and She2017] is a probabilistic feedforward model for joint learning of VAE and collaborative filtering. CVAE is a very strong baseline and achieves the best performance among our baseline methods.

  • DeepCoNN: Deep Cooperative Neural Networks [Zheng, Noroozi, and Yu2017] jointly models user and item from textual reviews for rating prediction. To make it comparable, we revise the model to suitable for implicit feed back with negative sampling [He et al.2017].

Experimental setup

We randomly split the interaction data into training, validation, test sets. For each user, we take of the entire click history as and review document as to train models.

For evaluation, we use 20% click history as the validation set to tune hyper-parameters, and 20% held-out click history as the test set. We can take click history in train set (VCM prediction) or the review document (VCM-CD prediction) to learn the necessary users’ representations, and then compute metrics by looking how well the model ranks the rest unseen click items from the held-out set.

We select models hyper-parameters and architectures by evaluating NDCG@100 on the validation sets. For VCM, we explore Multilayer perceptron (MLP) with 0,1 and 2 hidden layers, and we find the best overall architecture for VCM would be

for and for

. Moreover, we find that going deeper does not improve performance. We use tanh as the activation function between layers. Note that since the output of


are used as the mean and variance of the Gaussian random variables, we do not apply an activation function on it. We apply dropout at the input layer with probability

for . We do not apply the weight decay for any parts. We train our model using Adam [Kingma and Ba2015] with the batch size of

users for 200 epoch on both datasets. We save the model with the best validation NDCG@

and report test set metrics with it. For simplicity, we set and with the same value and anneal them linearly for anneal steps, using the schedule described in Algorithm 1.

Figure 3 shows the NDCG@100 on Clothing validation set during training. Also, we empirically studied the effects of two important parameters of VCM: the latent dimension, the regularization coefficient and . Figure 3 shows the performance of VCM on the validation set of Clothing with varying from to and , from to to investigate its sensitivity. As it can be seen, the best regularization coefficient is , and it does not improve the performance when the dimension of the latent space is greater than 100 in . Results on Movies and Yelp show the same trend, and thus they are omitted due to the space limitation.

Figure 2: Performance of NDCG@100 of all models w.r.t. the number of the epoch on Clothing validation set
Figure 3: Performance of NDCG@100 of VCM w.r.t. the latent dimension and on Clothing validation set

Quantitative result

Figure 4 summarizes the results between our proposed methods and various baselines. The experiments are repeated 10 times, and the averages are reported. Each metric averaged across all users. Both VCM and VCM-CD significantly outperform the baselines across datasets and metrics.

As it can be seen, CVAE is a very strong baseline and outperform the other baselines in most situations. Compared with CDL, it can be seen that the inference network learns a better probabilistic representation of a latent variable for auxiliary information than CDL, leading to better performance. While CDL need additional noisy criteria in auxiliary information observation space, which makes it not robust. The inferior results of DeepCoNN may be due to that it only uses a single learner to learn user/item representations, only with auxiliary information as input compared to hybrid model. Therefore it cannot capture implicit relationships between users stored in interaction data very well.

To focus more specifically on the comparison of CVAE and VCM, we can see that although both CVAE and VCM use deep learning models to extract representation for auxiliary information, the proposed VCM achieves better and more robust recommendation, especially for large . This is because VCM learns the user probabilistic representation by two stream VAEs set up, instead of learning the user/item latent factor through the point estimate of PMF. Besides, the collaborative learning mechanism allows the model to fully leverage the Bayesian deep representation from two views of information and lets the two learners be optimized synchronously. On the other hand, due to the point nature of the latent factor learned by PMF and alternative optimization, CVAE fails to achieve this robust performance. VCM-CD that uses the cross-domain inference to make the prediction can achieve better performance than VCM cause the review text we used here contains more specific information about users preference when the interaction data is extremely sparse. This promotion is especially obvious in the most sparse Clothing dataset.

(a) Yelp-NDCG@
(b) Yelp-Recall@
(c) Clothing-NDCG@
(d) Clothing-Recall@
(e) Movies-NDCG@
(f) Movies-Recall@
Figure 4: Evaluation of Top-

item recommendation on three datasets. Standard errors of NDCG@100 are around 8e-4 for Yelp and 4e-4 for Clothing and 2e-4 for Movies. For each subplot, a paired t-test is performed, and

indicates statistical significance at , compared to the best baseline. We could not finish DeepCoNN within a reasonable amount of time on Movies.

Ablation Study

In this subsection, we do the ablation study to understand better how the collaborative learning mechanism work, we develop:

  • VCM-Se: The collaborative learning mechanism of VCM is removed. And the VCM is separated as two independent variational models.

  • VCM-OD: We first train on the reviews alone without the influence of . Then we fix , and train with . This means the information only can flow from to in one direction which is different with the bi-directional flow in collaborative learning mechanism.

  • VCM-NV: The bi-directional KL regularization in collaborative learning mechanism is replaced with a constraint: , which does not consider the variance and of the probabilistic representations.

The performance of VCM and its variants on Movies, Yelp, Clothing are given in Table 2. To demonstrate that the cooperation between two VAE can enhance recommendation performance, VCM-Se uses two independent VAEs for training without the collaborative learning mechanism. In this manner, we learn the two variational distribution and without considering the informative information from each other. As it can be seen in Table 2, VCM achieves the best performance. It verifies that modeling users’ preference from two views does augment the performance of . To investigate the importance of the bi-directional information flow in collaborative learning mechanism, VCM-OD is introduced that only consider one-directional information flow. Moreover, the performance gap between VCM-OD and VCM suggests that using collaborative synchronous training scheme is better than only using to enhance . Furthermore, although VCM-NV can also learn probabilistic representation for two views of data, this constraint that is without considering variance makes two learners can not leverage all information stored in representations, leading the performance VCM-NV drops like CVAE with the same reason.

Model Yelp Clothing Movies
VCM-Se 0.036 0.015 0.046
VCM-OD 0.047 0.025 0.047
VCM-NV 0.044 0.024 0.051
VCM 0.051 0.027 0.057
Table 2: Comparing variants of the proposed model on the performance of NDCG@10. The best results are indicated in bold. : p 0.01 in a statistical significance test, compared to the best variant.

The impact of collaborative learning on

It is natural to wonder how the collaborative learning promote the performance of

. Intuitively, by modeling users latent variable with click behavior and review text collaboratively, VCM can learn the more expressive representation than VCM-Se. Therefore VCM could be more robust when user’s click behavior data is scarce. To study this, we break down users into five groups based on their activity level. The activity level represents the number of items each user has clicked on. According to complementary slackness KKT condition 

[Kuhn1951, Karush1939], we can use as the approximation of the capacity limitation after optimization. It indicates the amount of information stored in the variational distribution. We compute NDCG@ and for each group using VCM and VCM-Se. Table.3 summarizes how performance differs across users of different active levels.

It is interesting to find that, as the activity level increase, the variational distribution capacity of VCM and VCM-Se also monotonically increase. This phenomenon shows that, by using to learn the probabilistic representation of user latent variable, both VCM and VCM-Se can automatically allocate a proper user-level capacity for users of different levels to store the information.

We can also find that the variational distribution capacity of VCM is all greater than VCM-Se for users of different levels in the three data sets. This shows the collaborative learning mechanism allows the information in review text flows from to , which makes more expressive with more information, and then the automatically allocates more capacity to store the more comprehensive information. The promotion of capacity between the two models is particularly prominent for users who only click a small number of items (shown in bold in Table 3).

5-20 9.7 0.032 18.3(+87%) 0.048(+49%)
20-40 18.4 0.051 32.5(+76%) 0.064(+26%)
40-60 25.2 0.049 42.9(+70%) 0.064(+30%)
60-80 33.7 0.049 49.7(+47%) 0.069(+42%)
80-max 47.1 0.093 59.0(+25%) 0.107(+15%)
5-6 6.7 0.015 9.0(+33%) 0.027(+70%)
6-7 8.4 0.016 10.7(+27%) 0.031(+85%)
8-9 10.3 0.017 12.7(+23%) 0.034(+94%)
10-11 11.0 0.022 13.4(+21%) 0.036(+55%)
12-max 14.4 0.023 17.4(+20%) 0.036(+53%)
5-20 10.0 0.046 19.7(+96%) 0.056(+22%)
20-40 20.9 0.046 33.4(+59%) 0.059(+27%)
40-60 30.1 0.048 43.8(+45%) 0.062(+30%)
60-80 36.3 0.051 50.4(+38%) 0.064(+25%)
80-max 66.2 0.087 72.1(+8%) 0.100(+14%)
Table 3: NDCG@10 and approximation of Capacity for users with increasing level of activity, and the activity level is measured by how many items a user clicked on. The larger the value of is, the more information the distribution contains. Although details vary across datasets, VCM consistently improves NDCG@10 and for the user of different levels. The relative improvement is shown in bracket.

The impact of collaborative learning on

The multinomial distribution is to model the probability of each word appearing in the review document for user . Without collaborative learning, the likelihood of the review document rewards the for only putting probability mass on the high-frequency words in . However, with the influence of under the collaborative learning mechanism, should also assign more probability mass on the keywords that can represent user preference.

We highlight words that have high probability in Figure 5. We randomly sample the review example from two users in Yelp dataset. Words with the highest probability are colored dark-green, high probability words are lighted-green, and low/medium probability words are not colored. In Figure 5, we compare the of VCM-Se and VCM model. For convenient comparison, we use blue and red rectangles to emphasize their differences.

For User I, of VCM puts more probability on ”vegetarian,” ”healthy,” ”vegan,” ”sauce” words which show that the user may be a vegetarian and put more attention on healthy habit. While, without the collaborative learning mechanism, of VCM-Se puts more probability on some meaningless words such as ”helpful,” ”wrong,” ”large.” A similar result is observed for user II. The words ”music” and ”museum” show the obvious preference. This demonstrates the collaborative learning mechanism has a beneficial influence on both two learners, which not only can enhance the recommendation performance for but also make capture the more representative words.

Figure 5: Highlighted words by in two users’ review

Related work

Compared to the CF-based approach, the hybrid model relies on only two sources of information to mitigate the sparsity problem. Based on the how tightly the interaction data and auxiliary information are integrated, the hybrid model can be divided into two subcategories: loose coupled and tightly coupled methods [Wang, Wang, and Yeung2015]. The loosely coupled method combines the output from separate collaborative and content-based systems into a final recommendation by a linear combination [Miranda et al.1999] or voting scheme [Pazzani1999]. The tightly coupled method takes the processed auxiliary information as a feature of the collaborative method [Li, Yeung, and Zhang2011]. However, they all assume that the features are the good representation which is usually not the case. Collaborative topic regression (CTR) [Wang and Blei2011] is a method that explicitly integrates the latent Dirichlet allocation [Blei, Ng, and Jordan2003] (LDA) and PMF for two source information with promising result. However, the representation ability is limited to the topic model.

On the other hand, deep learning model has shown great potential for learning effective representations [Vincent et al.2010]. Very recently, Neural collaborative filtering [He et al.2017] and VAE-CF [Liang et al.2018] that use neural networks have shown the promising result, but they belong to CF-based methods. CDL [Wang, Wang, and Yeung2015] and collaborative recurrent autoencoder have been proposed for joint learning a SDAE [Vincent et al.2010] (a denoising recurrent autoencoder) with PMF. Both of the models try to learn representation from auxiliary information with additional denoising criteria. To make a further step, CVAE propose to infer the stochastic distribution of the latent variable through the neural network for auxiliary information. However, most previous works use an asynchronous mutual regularization between learners which cannot fully leverage the representations for two sources of information.

There is also another line of research that only utilizes one single learner with only auxiliary information such as review text as input for rating regression [Chen et al.2018, Seo et al.2017], DeepCoNN [Zheng, Noroozi, and Yu2017] that models users and items using review text for rating prediction problems have shown promising result. Although they utilize word-embedding technique [Mikolov et al.2013]

and Convolutional neural network 

[Collobert et al.2011] (CNN) to learn good representation for text data, compared to the hybrid model, it only uses one single learner to learn user/item representation only with the auxiliary information as input, so it can not capture the implicit relationship between users stored in interaction data well.


This paper proposes the variational collaborative model that jointly model the generation of auxiliary information and interaction data collaboratively. It is a deep generative probabilistic model that learns a probabilistic representation of user latent variable through VAE, leading to robust recommendation performance. To the best of our knowledge, VCM is the first pure deep learning model that can fully leverage the probabilistic representation learned from different sources of data due to the synchronous collaborative learning mechanism. The experiment has shown that the proposed VCM can significantly outperform the state-of-the-art methods for the hybrid recommendation with more robust performance.


  • [Adomavicius and Tuzhilin2005] Adomavicius, G., and Tuzhilin, A. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge & Data Engineering (6):734–749.
  • [Blei, Ng, and Jordan2003] Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

  • [Chen et al.2018] Chen, C.; Zhang, M.; Liu, Y.; and Ma, S. 2018. Neural attentional rating regression with review-level explanations. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, 1583–1592. International World Wide Web Conferences Steering Committee.
  • [Collobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug):2493–2537.
  • [He and McAuley2016] He, R., and McAuley, J. 2016. Vbpr: Visual bayesian personalized ranking from implicit feedback. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    , 144–150.
  • [He et al.2017] He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, 173–182. International World Wide Web Conferences Steering Committee.
  • [Higgins et al.2016] Higgins, I.; Matthey, L.; Glorot, X.; Pal, A.; Uria, B.; Blundell, C.; Mohamed, S.; and Lerchner, A. 2016. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579.
  • [Higgins et al.2017] Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations.
  • [Hu, Koren, and Volinsky2008] Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 263–272. IEEE Computer Society.
  • [Karush1939] Karush, W. 1939. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago.
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • [Kuhn1951] Kuhn, H. 1951. Aw tucker, nonlinear programming. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability, 481–492.
  • [Lang1995] Lang, K. 1995. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995. Elsevier. 331–339.
  • [Li and She2017] Li, X., and She, J. 2017. Collaborative variational autoencoder for recommender systems. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 305–314. ACM.
  • [Li, Yeung, and Zhang2011] Li, W.-J.; Yeung, D.-Y.; and Zhang, Z. 2011. Generalized latent factor models for social network analysis. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, Spain, 1705.
  • [Liang et al.2018] Liang, D.; Krishnan, R. G.; Hoffman, M. D.; and Jebara, T. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, 689–698. International World Wide Web Conferences Steering Committee.
  • [Miao, Yu, and Blunsom2016] Miao, Y.; Yu, L.; and Blunsom, P. 2016. Neural variational inference for text processing. In International Conference on Machine Learning, 1727–1736.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
  • [Miranda et al.1999] Miranda, T.; Claypool, M.; Gokhale, A.; Mir, T.; Murnikov, P.; Netes, D.; and Sartin, M. 1999. Combining content-based and collaborative filters in an online newspaper. In In Proceedings of ACM SIGIR Workshop on Recommender Systems. Citeseer.
  • [Mnih and Salakhutdinov2008] Mnih, A., and Salakhutdinov, R. R. 2008. Probabilistic matrix factorization. In Advances in neural information processing systems, 1257–1264.
  • [Pazzani and Billsus1997] Pazzani, M., and Billsus, D. 1997. Learning and revising user profiles: The identification of interesting web sites. Machine learning 27(3):313–331.
  • [Pazzani1999] Pazzani, M. J. 1999. A framework for collaborative, content-based and demographic filtering. Artificial intelligence review 13(5-6):393–408.
  • [Seo et al.2017] Seo, S.; Huang, J.; Yang, H.; and Liu, Y. 2017. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In Proceedings of the Eleventh ACM Conference on Recommender Systems, 297–305. ACM.
  • [Vincent et al.2010] Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11(Dec):3371–3408.
  • [Wang and Blei2011] Wang, C., and Blei, D. M. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 448–456. ACM.
  • [Wang, Wang, and Yeung2015] Wang, H.; Wang, N.; and Yeung, D.-Y. 2015. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1235–1244. ACM.
  • [Zheng, Noroozi, and Yu2017] Zheng, L.; Noroozi, V.; and Yu, P. S. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 425–434. ACM.