Recommender systems have become increasingly indispensable. Applications include top- recommendations, which are widely adopted to recommend users ranked lists of items. For e-commerce, typically only a few recommendations are shown to the user each time and recommender systems are often evaluated based on the performance of the top- recommendations.
Collaborative Filtering (CF) based methods are a fundamental building block in many recommender systems. CF based recommender systems predict what items a user will prefer by discovering and exploiting similarity patterns across users and items. The performance of CF-based methods often drops significantly when ratings are very sparse. With the increased availability of so-called side information, that is, additional information associated with items such as product reviews, movie plots, etc., there is great interest in taking advantage of such information so as to compensate for the sparsity of ratings.
Existing methods utilizing side information are linear models (Ning and Karypis, 2012), which have a restricted model capacity. A growing body of work generalizes linear model by deep learning to explore non-linearities for large-scale recommendations (He et al., 2017; Sedhain et al., 2015; Wu et al., 2016; Zheng et al., 2016). State-of-the-art performance is achieved by applying Variational Autoencoders (VAEs) (Kingma and Welling, 2013) for CF (Li and She, 2017; Liang et al., 2018; Lee et al., 2017). These deep models learn item representations from side information. Thus, the dimension of side information determines the input dimension of the network, which dominates the overall size of the model. This is problematic since side information is generally high-dimensional (Chen et al., 2017). As shown in our experiments, existing deep models fail to beat linear models due to the high-dimensionality of side information and an insufficient number of samples.
To avoid the impact from the high-dimensionality while taking the effectiveness of VAE, we propose to learn feature representations from side information. In this way, the dimensions of the side information correspond to the number of samples rather than the input dimension of deep network. To instantiate this idea, in this paper, we propose collective Variational Autoencoder (cVAE), which learns to recover user ratings and side information simultaneously through VAE. While user ratings and side information are different sources of information, both are information associated with items. Thus, we take ratings from each user and each dimension of side information over all items as the input for VAE, so that samples from both sources of information have the same dimensionality (number of items). We can then feed ratings and side information into the same inference network and generation network. cVAE complements the sparse ratings with side information, as feeding side information into the same VAE increases the number of samples for training. The high-dimensionality of side information is not a problem for cVAE, as it increases the sample size rather than the network scale. To account for the heterogeneity of user rating and side information, the final layer of the generation network follows different distributions depending on the type of information. Training a VAE by feeding it side information as input acts like a pre-training step, which is a crucial step for developing a robust deep network. Our experiments show that the proposed model,cVAE, achieves state-of-the-art performance for top- recommendation with side information.
We introduce relevant notation in this section. We use , and to denote the number of users, items and the dimension of side information, respectively. We study the problem of top- recommendation with high-dimensional side information, where . We write for the matrix for side information and for user ratings. We summarize our notation in Table 1.
|number of users|
|number of items|
|dimension of side information|
|dimension of latent item representation|
|number of recommended items|
|matrix of side information|
|matrix of user rating|
|matrix of latent user representation|
|matrix of latent item representation|
|matrix of latent feature representation|
|hidden layer of inference network|
|hidden layer of generation network|
|the mean of latent input representation|
the variance of latent user or feature representation
non-linear transformation of inference network
|non-linear transformation of generation network|
the activation function to get
|the activation function to get|
the sigmoid function
2.2. Linear models for top- recommendation
Sparse LInear Method (SLIM) (Ning and Karypis, 2011) achieves state-of-the-art performance for top- recommendation. SLIM learns to reproduce the user rating marix through:
Here, is the coefficient matrix, which is analogous to the item similarity matrix. The performance of SLIM is heavily affected by the rating sparsity (Kabbur et al., 2013). Side information ha been utilized to overcome this issue (Ning and Karypis, 2012; Zhao et al., 2016; Chen et al., 2017). As a typical example of a method that uses side information, collective SLIM (cSLIM) learns from both user rating and side information. Specifically, are both reproduced through:
cSLIM learns the coefficient matrix collectively from both side information and user rating , a strategy that can help to overcome rating sparsity by side information. However, cSLIM is restricted by the fact that it is a linear model, which has limited model capacity.
2.3. Autoencoders for collaborative filtering
. Autoencoders are neural networks popularized byKramer (1991). They are unsupervised networks where the output of the network aims to be a reconstruction of the input.
In the context of CF, the autoencoder is fed with incomplete rows (resp. columns) of the user rating matrix
. It then outputs a vector that predicts the missing entries. These approaches perform a non-linear low-rank approximation ofin two different ways, using a User-side Autoencoder (UAE) (Figure 2(a)) or Item-side Autoencoder (IAE) (Figure 2(b)), which recover respectively through:
where is the user representation and is the item representation. Moreover, and are the encode network and decode network, respectively. UAEs encode to learn a user latent representation and then recover from . In contrast, IAEs encode the transpose of to learn item latent representation and then recover the transpose of from . Note that UAEs work in a similar way as SLIM, as both can be viewed as reproducing through , which also captures item similarities.
When side information associated with items is available, the Feature-side Autoencoder (FAE) is utilized to learn item representations:
where is the item representation. Existing hybrid methods incorporate FAE with IAE
as both learn item representations. However, this way of incorporating side information needs to estimate two separateVAEs, which is not an effective way to address rating sparsity. They are also vulnerable to the high dimensionality of side information.
In this section, we propose a new way to incorporate side information with user ratings by combining the effectiveness of both cSLIM and autoencoders. We propose to reproduce by a FAE and by a UAE. In this way, the input for autoencoders of both and are of the same dimension, i.e., the number of items . Thus, we can feed and into the same autoencoder rather than two different autoencoders, which helps to overcome rating sparsity.
3.1. Collective variational autoencoder
We propose a collective Variational Autoencoder (cVAE) to generalize the linear models for top- recommendation with side information to non-linear models, by taking advantage of Variational Autoencoders (VAEs). Specifically, we propose to recover through
where and correspond to the inference network and generation network parameterized by and , respectively. An overview of cVAE is depicted in Figure 1. Unlike previous work utilizing VAEs, the proposed model encodes and decodes user rating and side information through the same inference and generation networks. Our model can be viewed as a non-linear generalization of cSLIM, so as to learn item similarities collectively from user ratings and side information. While user ratings and side information are two different types of information, cSLIM fails to distinguish them. In contrast, cVAE assumes the output of the generation network to follow different distributions according to the type of input it has been fed.
Next, we describe the cVAE model in detail. Following common practice for VAE, we first assume the latent variables and
to follow a Gaussian distribution:
is an identity matrix. Whileand are fed into the same network, we would like to distinguish them via different distributions. In this paper, we assume that
is binarized to capture implicit feedback, which is a common setting for top-recommendation (Ning and Karypis, 2011). Thus we follow Lee et al. (2017) and assume that the rating of user
over all items follows a Bernoulli distribution:
is the sigmoid function. This defines the loss function when feeding user rating as input, i.e., the logistic log-likelihood for user:
where is the -th element of the vector and is normalized through a sigmoid function so that is within .
For side information, we study numerical features so that we assume the -th dimension of side information from all items follows a Gaussian distribution:
This defines the loss function when feeding side information as input, i.e., the Gaussian log-likelihood for dimension :
where is the -th element of vector . Note that although we assume and to be generated from and respectively, the generation has shared parameters .
The generation procedure is summarized as follows:
for each user :
for each dimension of side information :
Once the cVAE is trained, we can generate recommendations for each user with items ranked in descending order of . Here, is calculated as , that is, we take the mean of for prediction.
Next, we discuss how to perform inference for cVAE.
3.2. Variational inference
The log-likelihood of cVAE is intractable due to the non-linear transformations of the generation network. Thus, we resort to variational inference to approximate the distribution. Variational inference approximates the true intractable posterior with a simpler variational distribution . We follow the mean-field assumption (Xing et al., 2002) by setting to be a fully factorized Gaussian distribution:
While we can optimize by minimizing the Kullback-Leiber divergence , the number of parameters to learn grows with the number of users and dimensions of side information. This can become a bottleneck for real-world recommender systems with millions of users and high-dimensional side information. The VAE replaces individual variational parameters with a data-dependent function through an inference network parameterized by , i.e., , where and are generated as:
Putting together and with and forms the proposed cVAE (Figure 1).
We follow to derive the Evidence Lower Bound (ELBO):
We use a Monte Carlo gradient estimator (Paisley et al., 2012) to infer the expectation in Equation (3). We draw samples of and from and perform stochastic gradient ascent to optimize the ELBO. In order to take gradients with respect to through sampling, we follow the reparameterization trick (Kingma and Welling, 2013) to sample and as:
As the -divergence can be analytically derived (Kingma and Welling, 2013), we can then rewrite as:
We then maximize ELBO given in Equation (4) to learn and .
3.3. Implementation details
We discuss the implementation of cVAE in detail.
As we feed the user rating matrix and the item side information through the same input layer with neurons, we need to ensure that the input from both types of information are of the same format.
In this paper, we assume that user ratings are binarized to capture implicit feedback and that side information is represented as a bag-of-words.
We propose to train cVAEs through a two-phase algorithm.
We first feed it side information to train, which works as pre-training.
We then refine the VAE by feeding user ratings.
We follow the typical setting by taking as a Multi-Layer Perceptron
Multi-Layer Perceptron(MLP); is also taken to be a MLP of the identical network structure with . We also introduce two parameters, i.e., and , to extend the model and make it more suitable for the recommendation task.
We can adopt different perspectives about the ELBO derived in Equation (3) as: the first term can be interpreted as the reconstruction error, while the second term can be viewed as regularization. The ELBO is often over-regularized for recommendation tasks (Liang et al., 2018). Therefore, a parameter is introduced to control the strength of regularization, so that the ELBO becomes:
We propose to train the cVAE in two phases. We first pre-train the cVAE by feeding it side information only. We then refine the model by feeding it user ratings. While Liang et al. (2018) suggests to set small to avoid over-regularization, we opt for a larger value for during refinement, for two reasons: (1) the model is effectively pre-trained with side information; it would be reasonable to require the posterior to comply more with this prior; and (2) refinement with user ratings can easily overfit due to the sparsity of ratings; it would be reasonable to regularize heavier so as to avoid overfitting.
4.1. Experimental setup
We conduct experiments on two datasets, Games and Sports, constructed from different categories of Amazon products (McAuley and Leskovec, 2013). For each category, the original dataset contains transactions between users and items, indicating implicit user feedback. The statistics of the datasets are presented in Table 2. We use the product reviews as item featured. We extract unigram features from the review articles and remove stopwords. We represent each product item as a bag-of-words feature vector.
4.1.2. Methods for comparison
We contrast the performance of cVAE with that of existing existing VAE-based methods for CF: cfVAE (Li and She, 2017) and rVAE (Liang et al., 2018). Note that the performance of cfVAE will be affected greatly by the high-dimensionality of side information. Besides, as cfVAE is designed originally for the rating prediction task, the recommendations provided by cfVAE will be less effective. While rVAE is effective for top- recommendation, it suffers from rating sparsity as side information is not utilized.
We also compare with the state-of-the-art linear model for top- recommendation with side information, i.e., cSLIM (Ning and Karypis, 2012). By comparing with cSLIM, we can evaluate the capacity of cVAE as it can be regarded as a deep extension of cSLIM. We also compare with fVAE, which is the pre-trained model of cVAE with side information only. Note that cVAE is the refinement over fVAE by user rating.
For all the VAE-based methods, we follow Kingma and Welling (2013) to set the batch size as 100 so that we can set . We choose a two-layer network architecture for the inference network and generation network. For cfVAE and rVAE, the scale is 200-100 for inference network and 100-200 for generation network. For fVAE and cVAE, the scale is 1000-100 and 100-1000, respectively. The reason that the network scale for cfVAE and rVAE is relatively smaller is that (1) the input for cfVAE is high-dimensional with relatively fewer samples; and (2) the input for rVAE is sparse, which easily overfits for larger network scale. In comparison, we can select more hidden neurons for fVAE as it takes each dimension of the features over all items as input, so that the input for the network has relatively fewer dimensions and the number of samples is sufficient. This is similar with cVAE, which uses side information to overcome rating sparsity.
4.1.3. Evaluation method
To evaluate the performance of top- recommendation, we split the user rating matrix into and , respectively, for training the model, selecting parameters and testing the recommendation accuracy. Specifically, for each user, we randomly hold 10% of the ratings in the validation set and 10% in the test set and put the other ratings in the training set. For each user, the unrated items are sorted in decreasing order according to the predicted score and the first items are returned as the top- recommendations for that user.
Given the list of top- recommended items for user , Precision at (Pre@N) and Recall at (Rec@N) are defined as
Average precision at (AP@N) is a ranked precision metric that gives larger credit to correctly recommended items in the top- ranks. AP@N is defined as the average of precisions computed at all positions with an adopted item, namely
where Pre@k is the precision at cut-off in the top- recommended list. Here, is an indicator function
Mean average precision at (MAP@N) is defined as the mean of the AP scores for all users. Following Wu et al. (2016), the list of recommended items is evaluated with using Rec@N and MAP@N.
4.2. Experimental results
4.2.1. Parameter selection
To compare the performance of alternative top- recommendation methods, we first select parameters for all the methods through validation. Specifically, for cSLIM, we select and from , , , , , , . For cfVAE, we select from and from . For rVAE and fVAE, we select from and from . For cVAE, we select from and from . Note that we tune with larger values to possibly regularize heavier during the refinement.
The result of parameter selection is shown in Table 3.
4.2.2. Performance comparison
We present the results in terms of Rec@N and MAP@N in Table 4, where is respectively set as
. We show the best score in boldface. We attach asterisks to the best score if the improvement over the second best score is statistically significant; to this end, we conducted two-sided tests for the null hypothesis thatcVAE and the second best have identical average values; we use one asterisk if and two asterisks if .
As shown in Table 4, cVAE outperforms other methods according to all metrics and on both datasets. The improvement is also significant in many settings. A general trend is revealed that the significance of improvements become more evident when gets larger. Note that the other three methods utilizing VAE are less effective with high-dimensional side information. Actually, they even fail to beat linear models. In contrast, cVAE improves over cSLIM by using VAE for non-linear low-rank approximation. This demonstrates the effectiveness of our proposed cVAE model.
Specifically, on the Games dataset, cVAE shows significant improvements over the state-of-the-art methods. Apart from cVAE, cfVAE provides the best recommendation among all VAE-based CF methods, although it fails to beat cSLIM. This is followed by fVAE, which utilizes side information only. rVAE performs the worst, due to the rating sparsity.
On the Sports dataset, significant improvements can only be observed for Rec@15 and Rec@20. The results yield an interesting insight. If we look at the parameter selection for cSLIM, we can see that is set to 0, which means cSLIM performs the best recommendation when no side information is utilized. This does not necessarily mean that the side information of Sports is useless for recommendation. Actually, fVAE provides acceptable recommendations by utilizing side information only. Therefore, the way of incorporating side information by cSLIM is not effective. In comparison, cVAE improves over cSLIM by utilizing side information.
4.2.3. Effect of the number of recommended items
As depicted in Figure 3(a), the gaps between cVAE and other methods is getting larger with the growth of . It is interesting to note that fVAE surpasses cfVAE when and . This further demonstrates the effectiveness of a pre-train phase with side information proposed in this model.
In Figure 3(c), both fVAE and cfVAE outperform cSLIM when , and fVAE outperforms cfVAE when . This shows that deep models are superior to linear models when more items are recommended. In comparison, the improvement achieved by cVAE is more evident when , and the gap between cVAE and the second best method is always substantial.
On the other hand, the performance w.r.t. MAP@N does not show big differences when grows. Note that on the Games dataset (Figure 3(b)), cVAE performs much better than cSLIM when is small. The improvement becomes less evident when grows.
5. Related Work
We review related work on linear models for top- recommendation with side information and on deep models for collaborative filtering.
5.1. Top- recommendation with side information
Various methods have been developed to incorporate side information in recommender systems. Most of these methods have been developed in the context of the rating prediction problem, whereas the top- recommendation problem has received less attention. In the rest of this section we only review methods addressing top- recommendation problems.
Ning and Karypis (2012) propose several methods to incorporate side information with SLIM (Ning and Karypis, 2011). Among all these methods, cSLIM generally achieves the best performance as it can well compensate sparse ratings with side information. Zhao et al. (2016); Zhao and Guo (2017) proposed a joint model to combine self-recovery for user rating and predication from side information. Side information is also utilized to address cold-start top- recommendation. Elbadrawy and Karypis (2015) learn feature weights for calculating item similarities. Sharma et al. (2015) further improve over (Elbadrawy and Karypis, 2015) by studying feature interactions. While these methods generate the state-of-the-art performance for top- recommendation, they are all linear models, which have restricted model capacity.
5.2. Deep learning for hybrid recommendation
Several authors have attempted to combine deep learning with collaborative filtering. Wu et al. (2016)
utilize a denoising autoencoder to encode ratings and recover the score prediction.Zhuang et al. (2017) propose a dual-autoencoder to learn representations for both users and items. He et al. (2017) generalize matrix factorization for collaborative filtering by a neural network. These methods utilize user ratings only, that is, side information is not utilized. Wang et al. (2015) propose stacked denoising autoencoders to learn item representations from side information and form a collaborative deep learning method. Later, Li et al. (2015) reduce the computational cost of training by replacing stacked denoising autoencoders by a marginalized denoising autoencoder. Rather than manually corrupt input, variational autoencoders were later utilized for representation learning (Li and She, 2017). These models achieve state-of-the-art performance among hybrid recommender systems, but they are less effective when side information is high-dimensional. For more discussions on deep learning based recommender systems, we refer to a recent survey (Zhang et al., 2017).
In this paper, we have proposed an alternative way to feed side information to neural network so as to overcome the high-dimensionality. We propose collective Variational Autoencoder (cVAE), which can be regarded as the non-linear generalization of cSLIM. cVAE overcomes rating sparsity by feeding both ratings and side information into the same inference network and generation network. To cater for the heterogeneity of information (rating and side information), we assume different sources of information to follow different distributions, which is reflected in the use of different loss function. As for the implementation, we introduce a parameter to balance the positive samples and negative samples. We also introduce as the parameter for regularization, which controls how much the latent variable should be complied with the prior distribution. We conduct experiments over Amazon datasets. The results show the superiority of cVAE over other methods under the scenario with high-dimensional side information.
In conclusion, deep models are effective as long as the number of inputs are sufficient. Thus, using side information to pre-train cVAE helps to overcome the high-dimensionality. A general rule-of-thumb is, regularizing cVAE lightly during pre-train and heavily during the refinement of training.
- Chen et al. (2017) Yifan Chen, Xiang Zhao, and Maarten de Rijke. 2017. Top-N Recommendation with High-Dimensional Side Information via Locality Preserving Projection. In SIGIR. ACM, 985–988.
- Elbadrawy and Karypis (2015) Asmaa Elbadrawy and George Karypis. 2015. User-Specific Feature-Based Similarity Models for Top-n Recommendation of New Items. TIST 6, 3 (2015), 33:1–33:20.
- He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. 173–182.
- Kabbur et al. (2013) Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In SIGKDD. ACM, 659–667.
- Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR abs/1312.6114 (2013).
Mark A Kramer.
Nonlinear principal component analysis using autoassociative neural networks.AIChE journal 37, 2 (1991), 233–243.
- Lee et al. (2017) Wonsung Lee, Kyungwoo Song, and Il-Chul Moon. 2017. Augmented Variational Autoencoders for Collaborative Filtering with Auxiliary Information. In CIKM. ACM, 1139–1148.
- Li et al. (2015) Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep Collaborative Filtering via Marginalized Denoising Auto-encoder. In CIKM. ACM, 811–820.
- Li and She (2017) Xiaopeng Li and James She. 2017. Collaborative Variational Autoencoder for Recommender Systems. In SIGKDD. ACM, 305–314.
- Liang et al. (2018) Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In WWW. ACM, 689–698.
- McAuley and Leskovec (2013) Julian J. McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys. ACM, 165–172.
- Ning and Karypis (2011) Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In ICDM. IEEE, 497–506.
- Ning and Karypis (2012) Xia Ning and George Karypis. 2012. Sparse linear methods with side information for top-n recommendations. In RecSys. ACM, 155–162.
et al. (2012)
John William Paisley,
David M. Blei, and Michael I. Jordan.
Variational Bayesian Inference with Stochastic Search. InICML. JMLR.
- Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. AutoRec: Autoencoders Meet Collaborative Filtering. In WWW. ACM, 111–112.
- Sharma et al. (2015) Mohit Sharma, Jiayu Zhou, Junling Hu, and George Karypis. 2015. Feature-based factorized Bilinear Similarity Model for Cold-Start Top-n Item Recommendation. In SDM. SIAM, 190–198.
- Strub et al. (2016) Florian Strub, Romaric Gaudel, and Jérémie Mary. 2016. Hybrid Recommender System based on Autoencoders. In DLRS. ACM, 11–16.
- Wang et al. (2015) Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learning for Recommender Systems. In SIGKDD. ACM, 1235–1244.
- Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In WSDM. ACM, 153–162.
- Xing et al. (2002) Eric P Xing, Michael I Jordan, and Stuart Russell. 2002. A generalized mean field algorithm for variational inference in exponential families. In UAI. Morgan Kaufmann Publishers Inc., 583–591.
- Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. CoRR abs/1707.07435 (2017).
- Zhao and Guo (2017) Feipeng Zhao and Yuhong Guo. 2017. Learning Discriminative Recommendation Systems with Side Information. In IJCAI. 3469–3475.
- Zhao et al. (2016) Feipeng Zhao, Min Xiao, and Yuhong Guo. 2016. Predictive Collaborative Filtering with Side Information. In IJCAI. 2385–2391.
- Zheng et al. (2016) Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. 2016. A Neural Autoregressive Approach to Collaborative Filtering. In ICML, Vol. 48. JMLR, 764–773.
- Zhuang et al. (2017) Fuzhen Zhuang, Zhiqiang Zhang, Mingda Qian, Chuan Shi, Xing Xie, and Qing He. 2017. Representation learning via Dual-Autoencoder for recommendation. Neural Networks 90 (2017), 83–89.