Neural Collaborative Ranking

08/15/2018 ∙ by Bo Song, et al. ∙ Zhejiang University 0

Recommender systems are aimed at generating a personalized ranked list of items that an end user might be interested in. With the unprecedented success of deep learning in computer vision and speech recognition, recently it has been a hot topic to bridge the gap between recommender systems and deep neural network. And deep learning methods have been shown to achieve state-of-the-art on many recommendation tasks. For example, a recent model, NeuMF, first projects users and items into some shared low-dimensional latent feature space, and then employs neural nets to model the interaction between the user and item latent features to obtain state-of-the-art performance on the recommendation tasks. NeuMF assumes that the non-interacted items are inherent negative and uses negative sampling to relax this assumption. In this paper, we examine an alternative approach which does not assume that the non-interacted items are necessarily negative, just that they are less preferred than interacted items. Specifically, we develop a new classification strategy based on the widely used pairwise ranking assumption. We combine our classification strategy with the recently proposed neural collaborative filtering framework, and propose a general collaborative ranking framework called Neural Network based Collaborative Ranking (NCR). We resort to a neural network architecture to model a user's pairwise preference between items, with the belief that neural network will effectively capture the latent structure of latent factors. The experimental results on two real-world datasets show the superior performance of our models in comparison with several state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the explosive growth of social media and e-commerce, we are living in the era of information explosion. Personalized recommendation is developed to alleviate the dilemma of information overload, and has become a core component of many popular e-commerce and social media services. Collaborative filtering (CF) (Hu et al., 2008; Salakhutdinov and Mnih, 2007; Pan et al., [n. d.]) is the most popular approach to personalized recommendation, which has been extensively studied in the past years. The two broad categories of CF are neighborhood-based approaches and model-based approaches. Neighborhood-based approaches, such as itemkNN (Sarwar et al., 2001), first employ similarity metric to identify a set of similar items, and then generate top- recommended items based on those similar items. They can give explainable recommendations, but the relevance of their recommendation is lower in comparison with model-based methods. Model-based methods (Hu et al., 2008; Cremonesi et al., 2010; Salakhutdinov and Mnih, 2007)

, especially latent factor models (LFM), map both users and items into a joint low-dimensional latent space. The prediction for a user’s preference on an item is estimated by the inner product of the corresponding user and item latent vectors.

Implicit feedback refers to the scenarios where there are examples of items users prefer, but lack of examples of items they dislike, e.g., retweeting history in twitter, purchase history in e-commerce. LFM have achieved state-of-the-art on many recommendation tasks, however, traditional LFM methods suffer from the implicitness in implicit feedback. To address this issue, several variants of LFM has been proposed. For example, Hu et al. (Hu et al., 2008) proposed a Weighted Regularized Matrix Factorization (WRMF) method that weights observed and unobserved ratings differently and solves a regularized Least-Squares problem. Rendle et al. (Rendle et al., 2009) proposed Bayesian Personalized Ranking (BPR) that formulates - recommendation as a ranking problem and optimizes the Bayesian pairwise ranking criterion, which is the maximum a posteriori (MAP) estimation of users’ pairwise preference between observed and unobserved items.

In the literature (He et al., 2017; Hsieh et al., 2017), it has been pointed out that the performance of LFM is hindered by using inner product as the user-item factor interaction function. For example, inner product does not guarantee the triangle inequality condition, as a result it is hard for the latent vectors of LFM to reliably capture the item-item or user-user similarity. To address this problem, Hsieh et al. (Hsieh et al., 2017) proposed to use metric learning(Kulis et al., 2013) to simultaneously capture users’ preference and the user-user and item-item similarity. While He et al. (He et al., 2017) argued that inner product only captures linear interactions, proposing a general framework named NCF that employs neural networks to learn a function from data to effectively capture the nonlinear interaction between user factor and item factor. NCF uses the learned interaction function to replace inner product and gives promising results.

Nevertheless, we argue that the learning strategy of NCF might hinder its performance. NCF labels observed interactions as positive instances, and all unobserved interactions or some sampled unobserved interactions are labeled as negative instances. If all unobserved interactions are treated as negative instances, then it suffers from two limitations: first, the negative class dominates the training data, which can degrade the predictive accuracy for the infrequent positive class (known as Class Imbalance Problem); second, treating non-interacted item as negative feedback does not conform to the facts that non-interacted item can also be interpreted as the user is not aware of it. Sampling some unobserved interactions as negative instances can alleviate the first problem, but it’s still at the risk of introducing false negative examples.

To tackle the problem above, in this paper, we develop a novel classification strategy for collaborative ranking. Based on the widely employed pairwise preference assumption that a user prefers observed items over all other unobserved items, we construct a positive preference set and a negative preference set from rating data. Then, the elements in positive preference set are labeled as 1, and 0 for negative preference set, as seen in section 4.2. Our classification strategy has the following advantage: (1) The total number of positive examples are equal to the total number of negative examples, which gives us hope to solve the Class Imbalance Problem; (2) Under our classification strategy, negative instances do not assume non-interacted items to be negative feedback, just that they are less preferred than interacted items. Finally, we combine the proposed classification strategy with NCF to present a neural collaborative ranking framework.

2. Related Work

2.1. One-Class Collaborative Filtering

When it comes to implicit feedback, we cannot simply treat the non-interacted items as negative examples, because the reason why a user doesn’t interact with an item is ambiguous, e.g., the user may dislike it or the user is just unaware of it. Implicit feedback scenarios are also referred to as one-class collaborative filtering (OCCF) problems (Pan et al., [n. d.]). One crucial issue of OCCF is lack of negative feedback. Matrix Factorization (MF) is the most popular collaborative filtering technique. However, traditional MF approaches are incapable of handling the no negative feedback problem of OCCF. Because if the missing user-item interactions are treated as negative samples or just ignored, then the learner cannot generalize well.

To tackle the above problem, several approaches have been proposed. According to how the missing data is used, existing methods to OCCF

can be classified into two categories e.g., sampling based approaches

(He et al., 2017; Pan et al., [n. d.]; Rendle et al., 2009) and whole-data based approaches (Hu et al., 2008). The former samples negative examples from the unobserved user-item interactions, while the latter includes all the unobserved user-item interactions as negative examples and uses a conditional weight to demote the influence of these ambiguous examples. We can also mainly categorize these approaches into point-wise methods and pairwise methods, according to how the relevance order is learned. Point-wise approaches generally regard user ratings as categorical labels or numerical values, and try to learn the relevance scores of missing data directly. While pairwise approaches try to capture the preference order between missing data. Pairwise approaches generally improve the ranking performance over point-wise approaches (Balakrishnan and Chopra, 2012; Lee et al., 2014).

2.2. Deep Neural Networks

There are many existing works trying to bridge the gap between deep neural networks (DNNs) and the task of collaborative filtering. A pioneering work along this direction is proposed by (Salakhutdinov et al., 2007)

, they adopted a variant of Restricted Boltzmann Machine, which is a two-layer undirected graphical model consisting of softmax visible units and binary hidden units, to perform the task of rating prediction. Hybrid collaborative filtering methods, which combine deep learning with MF, has received much attention recently. These work mainly focuses on leveraging deep learning models like autoencoders

(Wang et al., 2015; Li et al., 2015) or CNN to model side information (texts or images) to regularize latent user or item factors. Typical approaches include Collaborative Deep Learning (CDL) (Wang et al., 2015) and Convolutional Matrix Factorization (ConvMF) (Kim et al., 2016). The former employed SDAE (Vincent et al., 2010) to model texts, while the latter argued that bag-of-words model like SDAE has an inherent drawback, and proposed to use CNN to learn more effective latent features. In spite of their promising results, these approaches generally try to integrate deep learning with conventional recommender systems, no much attention has been paid to applying deep learning to develop pure collaborative filtering approaches to OCCF.

Another line of work tries to use deep learning to make recommendation directly. For example, Cheng et al. (Cheng et al., 2016)

proposed a context-aware recommendation method called Wide&Deep, which first embedded features into latent space then used a multi-layer perceptron (MLP) on the concatenation latent vectors to learn the latent structure. The idea using MLP on the concatenation latent vectors latter was modified by

He et al. (Hu et al., 2008), they proposed a general framework for neural network based collaborative filtering (NCF). NCF takes advantage of the one-class nature of implicit feedback and casts OCCF as a binary classification problem. More recently, the attention mechanism has been introduced to the task of collaborative filtering. In (Chen et al., 2017), the proposed model ACF adopted item- and component-level attention to address the implicitness in users’ interactions with multimedia content. ACF used two attention sub-networks to capture user’s preference degree in item level and component level. Item-level attention was employed to score the item preferences, while the component-level attention was employed to capture interesting components in multimedia content. Again, it is worth highlighting that most of these work focuses on recommendation scenarios with rich feature, while no much attention has been paid to deep learning for pure collaborative filtering approaches to OCCF.

3. Preliminaries

Assume that we have a set of users and a set of items , with and respectively. is the user-item rating matrix, where indicates whether user rated item or not. We denote by the set of the items rated by user . Matrices , are the latent representations of users and items respectively, denotes the -th column of , and likewise. The goal in OCCF is to obtain a predicted ranking over items.

3.1. Latent Factor Models

The basic idea of latent factor models is to transform both users and items into some shared low-dimensional latent feature space. Matrix factorization is the most popular technique to derive latent factor models. Formally, let denote as some weighting matrix, the objective of MF is to minimize the following regularized squared loss:

(1)

where

is the regularization hyperparameter. One classical MF approaches is Singular Value Decomposition (SVD). In SVD,

is conditioned on whether a user has interacted with an item or not, e.g., , where II() denotes the indicator function that returns 1 if the statement is true and 0 otherwise. This weighting scheme is inappropriate in implicit scenarios, as it will lead to trivial but useless solutions (e.g., all the miss entries of is predicted as 1). An alternative approach is to use some weighting scheme to give larger weight to observed ratings meanwhile small but non-zero weight to unobserved ratings, which leads to the WRMF method (Hu et al., 2008).

3.2. Bayesian Personalized Ranking (BPR)

An alternate strategy to address the implicitness in OCCF is BPR. Instead of predicting the relevance scores directly, BPR models a user’s preference over two items, where one of the item is observed and the other is not. BPR is a well-known pairwise ranking optimization framework, which assumes that the known positive preferences over observed items are ranked higher than all the other unknown preferences over unobserved items. Let denote the triplets of the form , , , is a user, is an observed item, has not been observed yet:

(2)

BPR optimizes a loss over a (user, item, item) triplet, the following optimization criterion is used for personalized ranking (BPR-OPT):

(3)

where

is the sigmoid function,

is the parameter vectors, is the regularization hyperparameter. In particular, BPR-MF is obtained when is predicted by matrix factorization:

(4)

4. Proposed Method

In this section, we will introduce our Neural Collaborative Ranking (NCR) model in detail. We first describe our neural network based pairwise ranking model, elaborating how to learn NCR with a probabilistic model that emphasizes user preference over a pair of observed and unobserved items. We then show the relations between our model and BPR-MF, and develop a shallow model using linear interactions between latent vectors. Next, a deep instantiation of NCR using multi-layer perceptron to model latent features is proposed to investigate deep neural networks for collaborative ranking. MLP endows our model with a high level of nonlinearities. Finally, a new pairwise ranking model unifying the strengths of linear and nonlinear interactions for modeling latent features is presented.

4.1. General Framework

We now elaborate NCR, our proposed general framework for collaborative ranking based on neural network. In order to obtain a full neural treatment of collaborative ranking, following Wide&Deep, we adopt feed-forward neural networks to model a

(user, item, item) triplet interaction , as shown in Figure 1. Our model consists of three layers, the bottom embedding layer, the middle hidden layers and the output prediction layer. Hereinafter, we elaborate the neural network architecture layer by layer.

Figure 1. Network Architecture for Neural Collaborative Ranking

Embedding Layer. The goal of the embedding layer is to transform both users and items into some shared low-dimensional latent feature space. After embedding, we acquire a dense vector representation for each user and item. This shares the same spirit of LFM mentioned before. Formally, let be an input triplet, we use embedding table lookup to obtain three embedding vectors and , respectively. The embedding layer can be easily extended to cover a wide range of auxiliary information, such as topic information (Wang and Blei, 2011) and multimedia content (Chen et al., 2017). Since in this work we only focus on the pure collaborative ranking setting, we do not take any side information into account.

Hidden Layers. The hidden layers are a stack of fully connected layers built above the embedding layer. The obtained dense vectors from embedding layer are concatenated together, resulting in a dense vector jointly encoding user preference and item attribute. Then the concatenated vector is fed into the hidden layers. Hidden layers are the key to endow our model with the capacity to learn highly nonlinear interactions between latent features. In particular, the size of the last hidden layer determines the model’s capability, so we term it as predictive factors.

Let be the number of hidden layers, the concatenated vector are propagated forward layer by layer, so we can formulate the interaction function as follow:

(5)

where denotes the mapping function for the -th hidden layer.

Prediction Layer. The prediction layer maps previous layers’ output to the prediction score . expresses the extent user prefers item to item . The prediction score given by NCR can be formulated as follows:

(6)

where denotes the the mapping function for the output layer. In our case, it’s the sigmoid function.

In this paper we choose to use a set of unified hidden layers to model the latent structure of a (user, observed item, unobserved item) triplet. Another choice is to employ two sets of hidden layers to model (user, observed item) and (user, unobserved item) pairs, respectively. Then the prediction score is formulated as follows:

(7)
(8)
(9)

We consider the former being more feasible, the intuition behind is that it also takes the nonlinear interactions between items into account. For the latter, we leave it as a future work.

4.2. Model Learning

To learn the parameters of our models, it’s straightforward to adopt the widely used logistic ranking loss as the loss function:


(10)

However, logistic loss may suffer from vanishing gradients for correctly ranked pairs (Rendle and Freudenthaler, 2014). Besides, prior work (Burges et al., 2005; He et al., 2017) shows that binary cross-entropy loss is a good choice for neural network based ranking, so we adopt the binary cross-entropy loss for our model. Hereinafter, we demonstrate that under our classification strategy, it is nature to formulate the binary cross-entropy for learning with NCR. And we elaborate how to construct the positive instances and negative instances for training.

Classification Strategy. The use of log ranking loss is based on the assumption that a user prefers observed items to unobserved items:

(11)

where is the personalized total ranking (Rendle et al., 2009). We call positive preference set. Similarly, we can construct a negative preference set:

(12)

For all triples in set , we define the following indicator function:

(13)

Then we view the value of as a label 1 for ; for all triplets in , we view the value of as a label 0. It’s obvious that the size of is equal to . Thus, we successfully avoid the Class Imbalance Problem in NCF. We constrain the prediction score in the range of [0,1], and interpret it as how likely the triple belongs to . With the above settings, the likelihood function is defined as follows:

(14)

Take the negative logarithm of the likelihood function, we endow our NCR with the binary cross-entropy loss

(15)

Discussion. The classification strategy above is based on the widely used pairwise preference assumption. By labeling a triplet as a positive instance, our model learns to rank an item (known preference) higher than an item (unknown preference). Likewise, By labeling a triplet as a negative instance, our model learns to rank an item (unknown preference) lower than an item (known preference). As a result, both positive and negative instance contribute to the pairwise ranking process. Since we employ the binary cross-entropy loss, the negative instance is necessary, which is different from the log ranking loss.

By utilizing a probabilistic treatment for NCR, we address pairwise ranking based recommendation as a binary classification problem. At training stage, we uniformly sample positive and negative instances from and respectively. In practice, we iteratively update the parameters until the loss does not decrease (by 0.1%) or the maximum iteration limit is reached. In latter section, we also conduct experiments to study the influence of negative samples on the results.

4.3. Relations to Other Methods

4.3.1. Relations to Bayesian Personalized Ranking

BPR-MF can be seen as a special case of NCR without hidden layers. In what follows, we concretely show that if we choose specific interaction function, output function and edge weight, our NCR model will degenerate into BPR-MF. As BPR-MF is the most popular method for pairwise ranking based recommendation, the fact that BPR-MF can be explained as a special case of NCR reveals that it is trivial for NCR to accommodate a wide range of pairwise ranking approaches.

To recover BPR-MF, we set the interaction function of latent vectors as

(16)

where denotes the element-wise product of vectors. This vector is then project to the output layer

(17)

where and w

is the activation function and the edge weight of the output layer, respectively. Then we define

as

(18)

and let w be a vector with all elements equal to 1. In this way, we can obtain the BPR-MF model.

In this work, we use the sigmoid function as , since we want to constrain the output in the range of [0,1]. For the edge weight w, instead of constraining it to be a vector of 1, we learn it from data, which allows w to vary the importance of latent dimensions. We term the degenerated model as NBPR, short for neural Bayesian personalized ranking.

4.3.2. Relations to RankNet

RankNet (Burges et al., 2005) is a well-known pairwise ranking method for information retrieval tasks. Our proposed model NCR shares some similarities with RankNet, e.g., both models are based on neural network and adopt binary cross-entropy loss. Nevertheless, RankNet was originally proposed for information retrieval tasks with dense features, which might not be directly applied in OCCF setting with no context information. We also mainly address the following difference: RankNet is a ”point-wise” model endowed with pairwise ranking policy, while NCR itself is a ”pairwise” model using pairwise ranking policy. By ”point-wise”, we mean both the input and output of RankNet are point-wise, i.e., it takes as input one training sample at a time, and the output is also a predicted score for a single sample. Pairwise ranking in RankNet is conducted by minimizing the loss function of two consecutive training samples. While NCR is inherently ”pairwise” as it takes as input a pair of items at a time, the output score is also a predicted preference over two items.

4.4. Predictive Rule

As for how to make recommendation, we cannot sort the output scores directly to obtain the top-

ranked items, because an output score is associated with a pair of items rather than a single item. To get rid of this bad situation, we provide a heuristic approach. Before making recommendation, let us discuss the consistency requirement in pairwise ranking.

Ideally, given a user , three items and , if our model asserts and , we also want it to assert

. Otherwise it would be hard to rank the three items correctly. Note that the consistency requirement in our case is different from RankNet, as we cannot calculate the combining probabilities for

and . In what follows, we show that under certain conditions, our model indeed can meet the consistency requirement. Recall that in section 4.3.1, we present a NCR model NBPR. We rewrite the edge weight w in NBPR as

where the dimensionality of is equal to . Then, we have the following predicted scores:

(19)

where is given by Equation 18 and is a monotonically increasing function. We have similar results for and . The first problem is how to rank two items and . Intuitively, indicates the probability that likes more than , while indicates the probability that likes more than . So we can infer that if , then the user will like more than . As a result, we give the following predictive rule: If , then ; otherwise . According to the predictive rule above, if and , we have

(20)
(21)

Add the above two equations and eliminate duplicates, we have

(22)

In other words, we have , i.e., . In consequence, the consistency requirement is met. For NCR model with hidden layers, if we use two sets of hidden layers to model user’s interactions between observed item and unobserved item, respectively, the above conclusion can also hold. However, if we decide to use a unified hidden layers to model the interactions, we have no idea whether the predictive rule above can meet the consistency requirement or not, so we call it a ”heuristic” approach, and our experimental results show it works well.

Based on the analysis above, we propose a simple algorithm to find the top- ranked items, as shown in Algorithm 1. To rank two items and , we need to compare the predicted score and . Algorithm 1 scans the candidate item set pass, each pass choosing a most preferred item.

Input : A user ; A set of unobserved items ; number of recommendation ; ranked list
Output : top- items
1 for  to  do
2       = -
3       =length of
4       max=
5       for  to  do
6             j=
7             if ¿ then
8                  max=j
9             end if
10            
11       end for
12      Append max to the end of
13 end for
Algorithm 1 NCR Recommendation

4.5. Deep Neural Collaborative Ranking

In order to make full use of DNNs’ capacity, in this section we investigate how to go deep with NCR. In the embedding layer of NCR, there are embeddings for user and item, respectively. For every triplet in the training set, we have three latent vectors. Intuitively, we can concatenate these latent vectors together. This is a widely used technique in many existing deep learning work (He et al., 2017; Srivastava and Salakhutdinov, 2012, 2012). However, simply concatenating latent features is insufficient to capture the user-item latent structures, because it dose not take any interactions between latent dimensions into consideration. To address this problem, following Wide&Deep, we add hidden layers on the concatenated vector. More precisely, we employ a standard MLP to to capture the user-item latent structures. MLP’s multi-layer nature enable it to learn a variety of levels of user-item latent structures, especially the nonlinear interactions between latent features. In comparison with NBPR using a fixed element-wise product between user and item latent vectors, MLP is more flexible when dealing with the concatenated vector. Proceeding on this track, we define a deep neural collaborative ranking (DNCR) model under NCR framework as

(23)

where and denote the weights and biases respectively, is the activation function. We use tanh as activation functions of hidden layers. As for the network architecture, we follow the popular tower pattern (e.g. (Cheng et al., 2016; He et al., 2017)

). The width (number of neurons) of a layer is decreased with its height (the number of layers below). More precisely, we first set the number of neurons in the bottom layer, then half the layer width for each successive higher layer. In this sense, higher layers can obtain more abstractive features.

Figure 2. Neural pairwise ranking model

4.6. Neural Personalized Ranking

In this section, we develop a model to combine NBPR and DNCR. The intuition behind is twofold: first, NBPR is a shallow model with limiting capacity, while DNCR is a deep model at the risk of overfitting, fusing them together can increase model’s capacity at the same time prevent overfitting; second, NBPR applies a linear mapping to model the interactions of latent user and item vectors, DNCR applies a nonlinear kernel to model the latent structures of features, by fusing them together, we can obtain a model enjoying the advantage of linearity and nonlinearity simultaneously. As for how to fuse them together, a trivial solution is to let NBPR and DNCR share the same embedding layer, and then fuse them together by combining the outputs of their learned interaction functions

(24)

where is the

-th layer’s output of MLP (Equation. 23). This solution is similar to Neural Tensor Network (NTN)

(Socher et al., 2013). However, due to the different learning process of the two models, their optimal embedding dimensionality and weights might be very different. Thus, constraining the two models to share the same embedding is not flexible enough, and may degrade the prediction performance. With this in mind, we propose to allow NBPR and DNCR to learn separate embeddings, and then fuse the two models by concatenating their last hidden layer, as shown in Figure 2. We formulize this solution as

(25)

where denotes the user and item embeddings for NBPR part, respectively; and is similarly defined. We name this model NeuPR, short for neural personalized ranking. As we have discussed before, NeuPR enjoys the linearity of BPR and nonlinearity of MLP at the same time, thus may be able to yield better results than NBPR and DNCR.

4.7. Pre-training

For NeuPR, randomly initialized weights and embeddings do not pass any information, thus it is hard for the output layers to capture meaningful features. As a result, the neural network cannot be trained effectively. On the other hand, improper initialization may lead NeuPR to being trapped in local optimum at an early stage, in consequence the convergence and performance suffer. To alleviate the problems above, it is intuitive to first train NBPR and DNCR with random initializations until convergence or the maximum iteration limit, then initialize NeuPR’ NBPR part and DNCR part with the pre-trained models of NBPR and DNCR, respectively. The only modification is the edge weight, like prior work (He et al., 2017), we concatenate the edge weights of the two pre-trained models with

(26)

where denote the edge weight vector w of NBPR and DNCR, respectively; and is a hyperparameter which balances the trade-off between the two pre-trained models.

number of Datasets Metrics Methods NCR Improvement of
predictive factors PopRank BPR eALS NeuMF NBPR DNCR NeuPR NeuPR vs. NeuMF
8 ML1m HR 0.4227 0.5096 0.4861 0.5480 0.5402 0.5664 0.5661 3.30%
NDCG 0.1815 0.2724 0.2503 0.2921 0.2854 0.3037 0.2997 2.60%
Amusic HR 0.2710 0.2752 0.3263 0.3476 0.3554 0.3654 0.3645 4.86%
NDCG 0.1222 0.1586 0.1819 0.2048 0.2062 0.2207 0.2124 3.71%
16 ML1m HR 0.454 0.5234 0.5156 0.5515 0.5492 0.5672 0.5692 3.21%
NDCG 0.254 0.2756 0.2691 0.2989 0.2935 0.3076 0.3124 4.52%
Amusic HR 0.229 0.2821 0.3247 0.3531 0.3643 0.3862 0.3707 4.98%
NDCG 0.126 0.1623 0.1841 0.2016 0.2115 0.2228 0.2089 3.06%
24 ML1m HR 0.454 0.5419 0.5227 0.5495 0.5478 0.5690 0.5724 4.17%
NDCG 0.254 0.2876 0.2759 0.2960 0.3002 0.3082 0.3104 4.86%
Amusic HR 0.229 0.2860 0.3028 0.3505 0.3667 0.3853 0.3736 6.59%
NDCG 0.126 0.1659 0.1729 0.2022 0.2139 0.2229 0.2166 7.12%
32 ML1m HR 0.454 0.5513 0.5344 0.5493 0.5435 0.5652 0.5740 4.50%
NDCG 0.254 0.2895 0.2833 0.2912 0.2973 0.3089 0.3096 4.14%
Amusic HR 0.229 0.2958 0.2995 0.3441 0.3676 0.3941 0.3758 9.21%
NDCG 0.126 0.1738 0.1727 0.2006 0.2165 0.2264 0.2185 8.92%
64 ML1m HR 0.454 0.5478 0.5386 0.5400 0.5258 0.5753 0.5801 7.42%
NDCG 0.254 0.2903 0.2891 0.2987 0.2885 0.3116 0.3158 5.72%
Amusic HR 0.229 0.2920 0.2961 0.3564 0.3793 0.3904 0.3826 7.35%
NDCG 0.126 0.1707 0.1710 0.2103 0.2291 0.2261 0.2245 6.75%
Table 1. HR@10 and NDCG@10 comparisons of different methods w.r.t. the number of predictive factors

5. Experiments

In this section, we conducted experiments to show the effectiveness of our proposed models. Moreover, extensive experiments were conducted to analyze the performance with different experimental settings, such as the number of hidden layers, negative sampling ratio, size of predictive factors, and so on.

Dataset #users #items #interactions density
ML1m 6,040 3,260 998,539 5.07%
Amusic 5,729 9,267 65,344 0.12%
Table 2. Data statistic on two real-world datasets

5.1. Experimental Settings

Datasets We evaluated our models on two real-world datasets form different domains, each of which has been widely used in many previous works for evaluation: MovieLens 1M111The dataset is available at https://grouplens.org/datasets/movielens/1m/ (ML1m) and Amazon Digital Music (Amusic)222The dataset is available at http://jmcauley.ucsd.edu/data/amazon/. For both datasets, we discarded users and items associated with less than 10 interations. Table 2 shows the statistics of our two datasets.

Evaluation Protocols We adopted the leave-one-out evaluation to evaluate the performance of item recommendation. For both datasets, we held out the latest interaction as a test item and the second latest interaction as a validation item for every user. The remaining data is used for training. Since it is too time-consuming to rank all items for every user during testing, following (Koren, 2008; He et al., 2017), we randomly sampled 100 items that are not interacted by the user, ranking the test item among the sampled items. The ranking performance is evaluated by Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) (He et al., 2015). For both metrics, the results are based on the truncated ranked list at 10.

Baselines We compared our proposed methods with the following baselines. We leave out the comparison with item-item models, such as CDAE(Wu et al., 2016), because they lack of user models for personalization, which may cause performance difference.

  • ItemPop. Items are ranked by the number of interactions. It is a non-personalized method that is widely used as the baseline for personalized methods.

  • BPR (Rendle et al., 2009) is a pairwise ranking method which optimizes the matrix factorization model with a pairwise ranking loss.

  • eALS (He et al., 2016) is a state-of-the-art matrix factorization method with square loss for collaborative filtering with implicit feedback.

  • NeuMF (He et al., 2017) is a state-of-the-art neural network based collaborative filtering method with binary cross-entropy loss. For fair comparison, we employ the same embedding size, number of hidden layers, and size of predictive factors for NueMF and our models.

Figure 3. Evaluation of Top- item recommendation where ranges from 1 to 10 on the two datasets
Figure 4. Performance of NeuMF, NeuPR, DNCR and NBPR w.r.t. the number of negative samples per positive instance (factors=8).

Parameter Settings

 We implemented our proposed approaches based on keras

333https://github.com/fchollet/keras

. For learning NCR, we randomly sampled one interaction for each user as the validation data and tuned hyperparameters on it. We varied the learning rate of [0.001, 0.0005, 0.0001], randomly initialized model weights with a Gaussian distribution (mean of 0 and standard deviation of 0.01), set the batch size to be 256, and chose Adam optimizer. For methods relying on negative samples, we sampled one negative instances per positive instances. And recall that in section 4.1 we term the last hidden layer of NCR as

predictive factors. We conducted experiments to test the predictive factors of [8, 16, 24, 32, 64]. Without special mention, we employed four hidden layers for DNCR; for example, if the size of predictive factors is 8, then the neural network architecture of hidden layers is , and the embedding size is 32.

Performance Comparison Table 1 shows the results of comparison w.r.t. the number of predictive factors. For BPR and eALS, the size of predictive factors is equal to the dimensionality of latent factors. The table demonstrates the effectiveness of our proposed models, we can see that DNCR and NeuPR achieve the best performance in both metrics NDCG and HR on most cases. On both datasets, our models are able to outperform the state-of-the-art matrix factorization methods eALS and BPR by a considerable margin. Even in comparison with the strongest baseline, NeuMF, our NeuPR consistently outperforms it and can achieve relative improvements of 3.21% 7.42% by HR on MovieLens. On the same dataset, NeuPR achieves relative improvements of 2.60% 5.72% by NDCG in comparison with NeuMF. On Amusic, the corresponding improvements are 4.86% 9.21% and 3.06% 8.92%. If we take the best of all NCR models into consideration, NCR models can achieve relative improvements of 3.21% 7.42% by HR and 3.97% 5.72% by NDCG on the MovieLens dataset. While on the extremely sparse Amusic dataset, our models significantly outperform the strongest baseline NeuMF, the corresponding improvements are 5.12% 14.53% by HR and 7.1%

12.86% by NDCG (paired t-tests, p¡0.01).

Figure 3 shows the performance of Top- recommended lists where the number of recommended items ranges from 1 to 10. Here we employed predictive factors of 8 for all methods. And ItemPop is omitted due to its weak performance. DNCR generally achieves similar prediction accuracy in comparison with NeuMF, and their performance curves are so close that we can hardly make a distinction. On Amusic, NBPR achieves comparable prediction accuracy with NeuMF, DNCR and NeuPR better NeuMF by a considerable margin. On both dataset, neural-network-based methods outperform conventional matrix-factorization-based methods. The characteristic of datasets also have some influences on the results; on the extreme sparse Amusic dataset, the performance gaps between different methods are relatively large, while on the relatively dense MovieLens dataset, the performance gaps between different methods are relatively small. From Table 1 and Figure 3, we can conclude that on dense dataset like MovieLens, NeuPR performs best, DNCR performs better than NBPR. While on the extreme sparse dataset like Amusic, DNCR performs best, NeuPR the second best and NBPR the worst. This indicates that on extreme sparse dataset, model’s capacity is more important than it’s ability to avoid overfitting. NeuPR’s NBPR part may drag its performance on Amusic.

Impact of Negative Sampling Ratio We also conducted extensive experiments to compare the performance of NCR models with the strongest baseline NeuMF under different negative sampling ratio. Figure 4 shows the performance of NeuMF and NeuPR w.r.t. the number of negative samples per positive instance. As can be seen, on both datasets, NCR methods beat all other methods in terms of both metrics across different negative sampling ratio. Among three NCR methods, DNCR consistently outperforms the other two methods on Amusic; While on MovieLens, NeuPR performs the best. On Amusic, DNCR performs the best when the negative sampling ratio is 5 negative samples per positive sample. On MovieLens, NeuPR performs the best when the negative sampling ratio is 2 negative samples per positive samples.

Training Loss To compare NeuPR, DNCR and NBPR more clearly, we further investigate the training loss (averaged over all instances) of NCR methods of each iteration on the two datasets. For fair comparison, we use learning rate of 0.0005, negative sampling ratio of 1 and report the training loss within 100 iterations. As can be seen in Figure 5, NeuPR achieves the lowest training loss on both datasets. However, lower training loss does not always means higher performance. For example, on Amusic, DNCR has the highest training loss while at the same time it achieves the best performance.

Figure 5. Evaluation of pre-training w.r.t. , where ranges from 0.0 to 1.0.


Impact of Depth of Layers in DNCR We conducted extensive experiments to investigate DNCR with different number of hidden layers. The results are shown in Figure 6. Here DNCR1 means DNCR with 1 hidden layers, and other DNCR notations have similar meaning.

Figure 6. Evaluation of DNCR w.r.t. the number of hidden layers, where the number of hidden layers ranges from 1 to 6.

We evaluated the performance using the same number of predictive factors (8) for DNCR with two or larger number of hidden layers. As can be seen, DNCR with only one hidden layer (In this case, the hidden layer is a concatenation of input features) performs worst, it only performs slightly better than Itempop, and underperforms eALS and BPR by a huge margin. Although DNCR2 only has one more hidden layer than DNCR1, it performs far better than DNCR1. This result shows that simply concatenating latent vectors is insufficient to capture the interactions between latent factors. On both datasets, when the number of layers are smaller than 5, increasing the number of hidden layers brings better performance; When we use DNCR with more than 5 hidden layers, the performance does not improve.

Utility of Pre-training We conducted an extensive experiment to investigate the utility of pre-training for NeuPR with . Table 3 shows the performance of NeuPR with and without pre-training. As can be seen, NeuPR with pre-training consistently outperforms NeuPR without pre-training on Amusic. On MovieLens, pre-training achieves better performance in most cases, but not significantly. On the Amusic dataset, pre-training is able to improve the recommendation quality by a large margin. In general, pre-training is beneficial to recommendation quality.

NeuPR Without Pre-training With Pre-training
factors HR@10 NDCG@10 HR@10 NDCG@10
Ml1m
8 0.5661 0.2997 0.5702 0.3024
16 0.5692 0.3124 0.5750 0.3085
24 0.5724 0.3104 0.5776 0.3127
32 0.5740 0.3096 0.5793 0.3122
64 0.5801 0.3158 0.5840 0.3125
Amusic
8 0.3645 0.2124 0.4083 0.2342
16 0.3707 0.2089 0.4036 0.2365
24 0.3736 0.2166 0.3992 0.2327
32 0.3758 0.2185 0.4025 0.2380
64 0.3826 0.2245 0.4125 0.2429
Table 3. Impact of Pre-training

Conclusion and Future Work
In this work we propose a novel general neural network based collaborative ranking framework for personalized ranking. We experimentally demonstrate the effectiveness of our novel pairwise classification strategy for recommendation. The results on two real-world datasets illustrate the effectiveness of our proposed three NCR instantiations NBPR, DNCR and NeuPR. In future, we will study how to solve the problem of information loss caused by concatenating latent vectors, and how to extend our proposed framework to incorporate auxiliary information to enrich latent features.

Acknowledgement

This research is supported by the National Natural Science Foundation of China (NSFC) No.61672449. We thank Dr. Weike Pan in Shenzhen university for helpful discussions on pairwise ranking techniques.

References

  • (1)
  • Balakrishnan and Chopra (2012) Suhrid Balakrishnan and Sumit Chopra. 2012. Collaborative ranking. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 143–152.
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In

    Proceedings of the 22nd International Conference on Machine Learning

    . ACM, 89–96.
  • Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335–344.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7–10.
  • Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of the 4th ACM Conference on Recommender Systems. ACM, 39–46.
  • He et al. (2015) Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. Trirank: Review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 1661–1670.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182.
  • He et al. (2016) Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 549–558.
  • Hsieh et al. (2017) Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative Metric Learning. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 193–201.
  • Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 263–272.
  • Kim et al. (2016) Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. 2016. Convolutional Matrix Factorization for Document Context-Aware Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 233–240.
  • Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 426–434.
  • Kulis et al. (2013) Brian Kulis et al. 2013. Metric learning: A survey. Foundations and Trends® in Machine Learning 5, 4 (2013), 287–364.
  • Lee et al. (2014) Joonseok Lee, Samy Bengio, Seungyeon Kim, Guy Lebanon, and Yoram Singer. 2014. Local collaborative ranking. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 85–96.
  • Li et al. (2015) Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep Collaborative Filtering via Marginalized Denoising Auto-encoder. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 811–820.
  • Pan et al. ([n. d.]) Rong Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. [n. d.]. One-Class Collaborative Filtering. In Proceedings of the 8th IEEE International Conference on Data Mining. 502–511.
  • Rendle and Freudenthaler (2014) Steffen Rendle and Christoph Freudenthaler. 2014. Improving Pairwise Learning for Item Recommendation from Implicit Feedback. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. ACM, 273–282.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In

    Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence

    . AUAI Press, 452–461.
  • Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In Proceedings of the 20th International Conference on Neural Information Processing Systems. 1257–1264.
  • Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning. ACM, 791–798.
  • Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th International Conference on World Wide Web. ACM, 285–295.
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with Neural Tensor Networks for Knowledge Base Completion. In Advances in Neural Information Processing Systems. 926–934.
  • Srivastava and Salakhutdinov (2012) Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal Learning with Deep Boltzmann Machines. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2222–2230.
  • Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010.

    Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

    Journal of Machine Learning Research 11 (2010), 3371–3408.
  • Wang and Blei (2011) Chong Wang and David M Blei. 2011. Collaborative Topic Modeling for Recommending Scientific Articles. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 448–456.
  • Wang et al. (2015) Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learning for Recommender Systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235–1244.
  • Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. ACM, 153–162.