Recommender systems have been playing a critical role in the realms of retail, social networking, and entertainment industries. Providing personalized recommendations is an important commercial strategy for online websites and mobile applications. There are two major recommendation tasks: rating prediction and personalized ranking. The former usually needs explicit ratings(e.g., 1-5 stars) while the latter aims to generate a ranked list of items in descending order based on the estimated preferences for each user. In many real world scenarios where only implicit feedback is available, personalized ranking is a more appropriate and popular choice[Rendle et al.2009]. Collaborative filtering (CF) is a de facto approach which has been widely used in many real-world recommender systems [Ricci et al.2015]. CF assumes that user-item interactions can be modelled by inner product of user and item latent factors in a low-dimensional space. An effective and widely adopted ranking model based on CF is Bayesian Personalized Ranking (BPR) [Rendle et al.2009] which optimizes the ranking lists with a personalized pairwise loss. Another state-of-the-art model is sparse linear method (SLIM) [Ning and Karypis2011] which recommends top-
items via sparse linear regression. While BPR and SLIM have been shown to perform well on ranking task, we argue that they are hindered by a critical limitation: both of them are built on the assumption that there exists a linear relationship between users and items, while the relationship shall be more complex in real-life scenarios.
In recent years, researchers have demonstrated the efficacy of deep neural model for recommendation problems [Zhang et al.2017a, Karatzoglou and Hidasi2017]. Deep neural network can be integrated into classic recommendation models such as collaborative filtering [He et al.2017, Tay et al.2018a] and content based approaches [Cheng et al.2016, Tay et al.2018b]
to enhance their performances. Many deep neural techniques such as multi-layered perceptron (MLP), autoencoder (AE), recurrent neural network (RNN) and convolutional neural network (CNN) can be applied to recommendation models. AE is usually used to incorporate side information of users/items. For example,[Wang et al.2015] and [Zhang et al.2017b] proposed integrated models by combining latent factor model (LFM) with different variants of autoencoder; AE can also be adopted to reconstruct the rating matrix directly [Sedhain et al.2015]. CNN is mainly used to extract features from textual [Kim et al.2016, Zheng et al.2017], audio [Van den Oord et al.2013] or visual [He and McAuley2016] content. RNN can be used to model the sequential patterns of rating data or session-based recommendation [Hidasi et al.2015]. For example, [Wu et al.2017] designed a recurrent neural network based rating prediction model to capture the temporal dynamics of rating data; [Hidasi et al.2015] proposed using RNN to capture the interconnections between sessions. Some works attempted to generalize traditional recommendation models into neural versions. For example, [He et al.2017, He and Chua2017] designed the neural translations of LFM and factorization machine to model user-item interactions; [Xue et al.2017] proposed a deep matrix factorization model to anticipate user’s preferences from historical explicit feedback.
Most previous works focused upon either explicit feedback (rating prediction task) or representation learning from abundant auxiliary information instead of interpreting user-item relationships in depth. In this work, we aim to model the user-item intricate relationships from implicit feedback, instead of explicit ratings, by applying multi-layered nonlinear transformations. The main contributions are as follows:
We propose two recommendation models with deep neural networks, user-based NeuRec (U-NeuRec) and item-based NeuRec (I-NeuRec), for personalized ranking task. We present an elegant integration of LFM and neural networks which can capture both the linearity and non-linearity in real-life datasets.
With deep neural networks, we managed to reduce the number of parameters of existing advanced models while achieving superior performances.
To make this paper self-contained, we first define the research problem and introduce two highly relevant previous works.
2.1 Problem Statement
Let and denote the total number of users and items in a recommender system, so we have a interaction matrix . We use low-case letter and to denote user and item respectively, and represents the preference of user to item
. In our work, we will use two important vectors:and . denotes user ’s preferences toward all items; means the preferences for item received from all users in the system. We will focus on recommendation with implicit feedback here. Implicit feedback such as, click, browse and purchase is widely accessible and easy to collect. We set to if the interaction between user and item exists, otherwise, . Here, does not necessarily mean user dislikes item , it may also mean that the user does not realize the existence of item .
2.2 Latent Factor Model
Latent factor model (LFM) is an effective methodology for model-based collaborative filtering. It assumes that the user-item affinity can be derived from low-dimensional representations of users and items. Latent factor method has been widely studied and many variants have been developed [Koren et al.2009, Koren2008, Zhang et al.2017b, Salakhutdinov and Mnih2007]. One of the most successful realizations of LFM is matrix factorization. It factorizes the interaction matrix into two low-rank matrices with the same latent space of dimensionality ( is much smaller than and ), such that user-item interactions are approximated as inner product in that space
where is the user latent factor and is the item latent factor. With this low rank approximation, it compresses the original matrix down to two smaller matrices.
2.3 Sparse Linear Method
SLIM [Ning and Karypis2011] is a sparse linear model for top- recommendation. It aims to learn a sparse aggregation coefficient matrix . is reminiscent of the similarity matrix in item-based neighbourhood CF (itemCF) [Linden et al.2003], but SLIM learns the similarity matrix as a least squares problem rather than determines it with predefined similarity metrics (e.g., cosine, Jaccard etc.). It finds the optimal coefficient matrix by solving the following optimization problem
The constraints are intended to avoid trivial solutions and ensure positive similarities. The norm is adopted to introduce sparsity to matrix . SLIM can be considered as a special case of LFM with and . SLIM is demonstrated to outperform numerous models in terms of top- recommendation. Nevertheless, we argue that it has two main drawbacks: (1) From the definition, the size of is far larger than the two latent factor models, that is,
, which also results in higher model complexity. Even though it can be improved via feature selection by first learning an itemCF model, this sacrifices model generalization as it heavily relies on other pre-trained recommendation models; (2) SLIM assumes that there exists strong linear relationship between interaction matrix and. However, this assumption does not necessarily holds. Intuitively, the relationship shall be far more complex in real world applications due to the dynamicity of user preferences and item changes. In this work, we aim to address these two problems. Inspired by LFM and recent advances of deep neural network on recommendation tasks, we propose employing a deep neural network to tackle the above disadvantages by introducing non-linearity to top- recommendations.
3 Proposed Methodology
In this section, we present a novel nonlinear model based on neural network for top- recommendation and denote it by NeuRec. Unlike SLIM which directly applies linear mapping on the interaction matrix , NeuRec first maps into a low-dimensional space with multi-layer neural networks. This transformation not only reduces the parameter size, but also incorporates non-linearity to the recommendation model. Then the user-item interaction is modeled by inner product in the low-dimensional space. Based on this approach, we further devise two variants, namely, U-NeuRec and I-NeuRec.
3.1 User-based NeuRec
For user-based NeuRec, we first get the high-level dense representations from the rows of
with feed-forward neural networks. Note thatis constructed with training data, so there are no leakages of test data in this model. Let and , ( is the number of layers) denote the weights and biases of layer . For each user, we have
is a non-linear activation function such as, or . The dimension of output is usually much smaller than original input . Suppose the output dimension is (we reuse the latent factor size here), we have an output for each user. Same as latent factor models, we define an item latent factor for each item, and consider as user latent factor. The recommendation score is computed by the inner product of these two latent factors
To train this model, we minimize the regularized squared error in the following form
Here, is the regularization rate. We adopt the Frobenius norm to regularize weight and item latent factor . Since parameter is no longer a similarity matrix but latent factors in a low-dimensional space, the constraints in SLIM and norm can be relaxed. For optimization, we apply the Adam algorithm [Kingma and Ba2014] to solve this objective function. Figure 1(left) illustrates the architecture of U-NeuRec.
3.2 Item-based NeuRec
Likewise, we use the column of as input and learn a dense representation for each item with a multi-layered neural network
Let denote the user latent factor for user , then the preference score of user to item is computed by
We also employ a regularized squared error as the training loss. Thus, the objective function of item-based NeuRec is formulated as
the optimal parameters can also be learned with Adam Optimizer as well. The architecture of I-NeuRec is illustrated in Figure 1(right).
3.3 Dropout Regularization
Dropout [Srivastava et al.2014]
is an effective regularization technique for neural networks. It can reduce the co-adaptation between neurons by randomly dropping some neurons during training. Unlike traditional dropout which is usually applied on hidden layers, here, we propose applying the dropout operation on the input layeror (We found that the improvement of applying the dropout on hidden layers is subtle in our case). By randomly dropping some historical interactions, we could prevent the model from learning the identity function and increase the robustness of NeuRec.
3.4 Relation to LFM and SLIM
In this section, we shed some light on the relationships between NeuRec and LFM / SLIM. NeuRec can be regarded as a neural integration of LFM and sparse linear model. NeuRec utilizes the concepts of latent factor in LFM. The major difference is that either item or user latent factors of NeuRec are learned from the rating matrix with deep neural network. In addition, NeuRec also manages to capture both negative and positive feedback in an integrated manner with rows or columns of as inputs. To be more precise, U-NeuRec is a neural extension of SLIM. If we set to identity function and enforce to be a uniform vector of 1 and omit the biases, we have . Hence, U-NeuRec will degrade to a SLIM with . Note that the sparsity and non-negativity constraints are dropped. I-NeuRec has no direct relationship with SLIM. Nonetheless, it can be viewed as a symmetry version of U-NeuRec. Since the objective functions of NeuRec and SLIM are similar, the complexities of these two models are linear to the size of the interaction matrix. Yet, NeuRec has less model parameters.
3.5 Pairwise Learning Approach
NeuRec can be boiled down to a pairwise training scheme with Bayesian log loss.
Where is the model parameters, for U-NeuRec, and for I-NeuRec; is Frobenius regularization; and represent observed and unobserved items respectively. The above pairwise method is intended to maximize the difference between positive items and negative items. However, previous studies have shown that optimizing these pairwise loss does not necessarily lead to best ranking performance [Zhang et al.2013]
. To overcome this issue, we adopt a non-uniform sampling strategy: in each epoch, we randomly sampleditems from negative samples for each user, calculate their ranking score and then treat the item with the highest rank as the negative sample. The intuition behind this algorithm is that we shall rank all positives samples higher than negatives samples.
In this section, we conduct experiments on four real-world datasets and analyze the impact of hyper-parameters.
4.1 Experimental Setup
4.1.1 Datasets Description
We conduct experiments on four real-world datasets: Movielens HetRec, Movielens 1M, FilmTrust and Frappe. The two Movielens datasets111https://grouplens.org/datasets/movielens/ are collected by GroupLens research[Harper and Konstan2015]. Movielens HetRec is released in HetRec 2011222http://recsys.acm.org/2011 . It consists of interactions from movies and users. They are widely used as benchmark datasets for evaluating the performance of recommender algorithms. FilmTrust is crawled from a movie sharing and rating website by Guo et al. [Guo et al.2013]. Frappe [Baltrunas et al.2015] is an Android application recommendation dataset which contains around a hundred thousand records from
users on over four thousand mobile applications. The interactions of these four datasets are binarized with the approach introduced in Section 2.1.
4.1.2 Evaluation Metrics
To appropriately evaluate the overall performance for ranking task, the evaluation metrics include Precision and Recall with different cut-off value (e.g., P@5, P@10, R@5 and R@10), Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (DNCG). These metrics are used to evaluate the quality of recommendation lists regarding different aspects[Liu and others2009, Shani and Gunawardana2011]: Precision, Recall and MAP are used to evaluate the recommendation accuracy, as they only consider the hit numbers and ignore the rank positions; MRR and DNCG are two rank-aware measures with which higher ranked positive items are prioritized, thus they are more suitable for assessing the quality of ranked lists. We omit the details for brevity.
4.2 Implementation Details
We implemented our proposed model based on Tensorflow333https://www.tensorflow.org/ and tested it on a NVIDIA TITAN X Pascal GPU. All models are learned with min-batch Adam. We do grid search to determine the hyper-parameters. For all the datasets, we implement a five hidden layers neural network with constant structure for the neural network part of NeuRec and use sigmoid as the activation function. For ML-HetRec, we set the neuron number of each layer to , latent factor dimension to and dropout rate to ; For ML-1M, neuron number is set to , is set to , and dropout rate is set to . The neuron size for FilmTrust is set to and is set to . We do not use dropout for this dataset; For Frappe, neuron size is set to , is set to and dropout rate is set to . We set the learning rate to for ML-HetRec, ML-1M and Frappe. The learning rate for FilmTrust is . For ML-HetRec, ML-1M and FilmTrust, we set the regularization rate to , and that for Frappe is set to . For simplicity, we adopt the same parameter setting for pairwise training method. We use 80% user-item pairs as training data and hold out 20% as the test set, and estimate the performance based on five random train-test splits.
4.3 Results and Discussions
Since NeuRec is designed to overcome the drawbacks of LFM and SLIM, so they are two strong baselines for comparison to demonstrate if our methods can overcome their disadvantages. Specifically, we choose BPRMF [Rendle et al.2009], a personalized ranking algorithm based on matrix factorization, as the representative of latent factor model. Similar to [Ning and Karypis2011], we adopt neighbourhood approach to accelerate the training process of SLIM. For fair comparison, we also report the results of mostPOP and two neural network based models: GMF and NeuMF [He et al.2017], and follow the configuration proposed in [He et al.2017]. The recent work DMF [Xue et al.2017] is tailored for explicit datasets and not suitable for recommendations on implicit feedback, so it is unfair to compare our method with it.
4.3.1 Parameter Size
The parameter size of SLIM is , while I-NeuRec has parameters and U-NeuRec has . is the size of the neural network. Usually, our model can reduce the number of parameters largely (up to 10 times).
4.3.2 Overall Comparisons
Table 1 and Figure 2 summarize the overall performance of baselines and NeuRec. From the comparison, we can observe that our methods constantly achieve the best performances on these four datasets not only in terms of prediction accuracy but also ranking quality. Higher MRR and NDCG mean that our models can effectively rank the items user preferred in top positions. Performance gains of NeuRec over the best baseline are: Movielens HetRec (), Movielens 1M (), FilmTrust (), Frappe (). The results of I-NeuRec and U-NeuRec are very close and better than competing baselines. The subtle difference between U-NeuRec and I-NeuRec might be due to the distribution differences of user historical interactions and item historical interactions (or the number of users and items). We found that the improvement of NeuMF over GMF are not significant, which might be due to the overfitting caused by the use of dual embedding spaces [Tay et al.2018a]. Although the improvements of pairwise based U-NeuRec and I-NeuRec are subtle (in Tables 2 and 3), they are still worth being investigated. From the results, we observe that U-NeuRec is more suitable for pairwise training. In U-NeuRec, positive item and negative item are represented by two independent vectors and , while in I-NeuRec, they need to share the same network with input or . Therefore, the negative and positive samples will undesirably influence each other.
|ML HetRec||ML 1M||FilmTrust||FRAPPE|
|ML HetRec||ML 1M||FilmTrust||FRAPPE|
4.4 Sensitivity to Neural Network Parameters
In the following text, we systematically investigate the impacts of neural hyper-parameters on U-NeuRec with dataset FilmTrust (I-NeuRec has a similar pattern to U-NeuRec). In each comparison, we keep other settings unchanged and adjust the corresponding parameter values.
4.4.1 Latent Factor Size
Similar to latent factor model [Koren and Bell2015], the latent factor dimension poses great influence on the ranking performances. Larger latent factor size will not increase the performance and may even result in overfitting. In our case, setting to a value around to is a reasonable choice.
4.4.2 Number of Neurons
We set the neurons size to 50, 150, 250, 350 and 450 with a constant structure. As shown in Figure 3(b), both too simple and too complex model will decrease the model performance: simple model suffers from under-fitting while complex model does not generalize well on test data.
4.4.3 Activation Function
We mainly investigate activation functions: , , and . We apply the activation function to all hidden layers. Empirically study shows that the function performs poorly with NeuRec, which also demonstrates the effectiveness of introducing non-linearity. outperforms the other three activation functions. One possible reason is that can restrict the predicted value in range of , so it is more suitable for binary implicit feedback.
4.4.4 Depth of Neural Network
Another key factor is the depth of the neural network. From Figure 3(d), we observe that our model achieves comparative performances with hidden layers number set to 3 to 7. However, when we continue to increase the depth, the performance drops significantly. Thus, we would like to avoid over-complex model by setting the depth to an appropriate small number.
5 Conclusion and Future Work
In this paper, we propose the NeuRec along with its two variants which provide a better understanding of the complex and non-linear relationship between items and users. Experiments show that NeuRec outperforms the competing methods by a large margin while reducing the size of parameters substantially. In the future, we would like to investigate methods to balance the performance of I-NeuRec and U-NeuRec, and incorporate items/users side information and context information to further enhance the recommendation quality. In addition, more advanced regularization techniques such as batch normalization could also be explored.
- [Baltrunas et al.2015] Linas Baltrunas, Karen Church, et al. Frappe: Understanding the usage and perception of mobile app recommendations in-the-wild. arXiv preprint arXiv:1505.03014, 2015.
[Cheng et al.2016]
Heng-Tze Cheng, Levent Koc, et al.
Wide & deep learning for recommender systems.In DLRS, pages 7–10. ACM, 2016.
- [Guo et al.2013] G. Guo, J. Zhang, and N. Yorke-Smith. A novel bayesian similarity measure for recommender systems. In IJCAI, pages 2619–2625, 2013.
- [Harper and Konstan2015] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December 2015.
- [He and Chua2017] Xiangnan He and Tat-Seng Chua. Neural factorization machines for sparse predictive analytics. In SIGIR, pages 355–364, NY, USA, 2017. ACM.
- [He and McAuley2016] Ruining He and Julian McAuley. Vbpr: Visual bayesian personalized ranking from implicit feedback. In AAAI, pages 144–150, 2016.
- [He et al.2017] Xiangnan He, Lizi Liao, et al. Neural collaborative filtering. In WWW, pages 173–182, 2017.
- [Hidasi et al.2015] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
- [Karatzoglou and Hidasi2017] Alexandros Karatzoglou and Balázs Hidasi. Deep learning for recommender systems. In RecSys, RecSys ’17, pages 396–397, New York, NY, USA, 2017. ACM.
- [Kim et al.2016] Donghyun Kim, Chanyoung Park, et al. Convolutional matrix factorization for document context-aware recommendation. In RecSys, pages 233–240. ACM, 2016.
- [Kingma and Ba2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [Koren and Bell2015] Yehuda Koren and Robert Bell. Advances in collaborative filtering. In Recommender systems handbook, pages 77–118. Springer, 2015.
- [Koren et al.2009] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, August 2009.
- [Koren2008] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In SIGKDD, pages 426–434. ACM, 2008.
- [Linden et al.2003] G. Linden, B. Smith, and J. York. Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80, Jan 2003.
- [Liu and others2009] Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
- [Ning and Karypis2011] X. Ning and G. Karypis. Slim: Sparse linear methods for top-n recommender systems. In ICDM, pages 497–506, Dec 2011.
- [Rendle et al.2009] Steffen Rendle, Christoph Freudenthaler, et al. Bpr: Bayesian personalized ranking from implicit feedback. In UAI, pages 452–461, 2009.
- [Ricci et al.2015] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B Kantor. Recommender systems handbook. Springer, 2015.
- [Salakhutdinov and Mnih2007] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In NIPS, pages 1257–1264, USA, 2007. Curran Associates Inc.
- [Sedhain et al.2015] Suvash Sedhain, Aditya Krishna Menon, et al. Autorec: Autoencoders meet collaborative filtering. In WWW, pages 111–112. ACM, 2015.
- [Shani and Gunawardana2011] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. Recommender systems handbook, pages 257–297, 2011.
- [Srivastava et al.2014] Nitish Srivastava, Geoffrey Hinton, et al. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
- [Tay et al.2018a] Yi Tay, Luu Anh Tuan, et al. Latent relational metric learning via memory-based attention for collaborative ranking. In WWW, pages 729–739, 2018.
- [Tay et al.2018b] Yi Tay, Luu Anh Tuan, et al. Multi-pointer co-attention networks for recommendation. CoRR, abs/1801.09251, 2018.
- [Van den Oord et al.2013] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In NIPS, pages 2643–2651, 2013.
- [Wang et al.2015] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In SIGKDD, pages 1235–1244. ACM, 2015.
- [Wu et al.2017] Chao-Yuan Wu, Amr Ahmed, et al. Recurrent recommender networks. In WSDM, pages 495–503, NY, USA, 2017. ACM.
- [Xue et al.2017] HongJian Xue, Xinyu Dai, et al. Deep matrix factorization models for recommender systems. In IJCAI, pages 3203–3209, 2017.
- [Zhang et al.2013] Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. Optimizing top-n collaborative filtering via dynamic negative item sampling. In SIGIR, pages 785–788, New York, NY, USA, 2013. ACM.
- [Zhang et al.2017a] Shuai Zhang, Lina Yao, and Aixin Sun. Deep learning based recommender system: A survey and new perspectives. arXiv preprint arXiv:1707.07435, 2017.
- [Zhang et al.2017b] Shuai Zhang, Lina Yao, and Xiwei Xu. Autosvd++: An efficient hybrid collaborative filtering model via contractive auto-encoders. In SIGIR, pages 957–960, New York, NY, USA, 2017. ACM.
- [Zheng et al.2017] Lei Zheng, Vahid Noroozi, et al. Joint deep modeling of users and items using reviews for recommendation. In WSDM, pages 425–434. ACM, 2017.