Recommender systems are an effective solution to help people cope with an increasingly complex information landscape. Collaborative Filtering (CF) approaches have been widely investigated and used for personalized recommendation (Zhang et al., 2017; Adomavicius and Tuzhilin, 2005). Many traditional CF techniques are based on Matrix Factorization (MF) (Zhang et al., 2017). They characterize users and items by latent factors that are extracted from the user-item rating matrix. In the latent space, traditional CF methods, such as the Latent Factor Model (LFM) (Koren et al., 2009), often predict a user’s preference for an item with a linear kernel, i.e., a dot product of their latent factors, which may not be able to capture the complex structure of user-item interactions well.
Recently introduced Deep Learning (DL)-based approaches to recommender systems overcome shortcomings of conventional approaches to recommender systems, such as dynamic user preferences and intricate relationships within the data itself, and are able to achieve high recommendation quality. Today’s DL-based approaches to recommender systems mostly use DL to explore auxiliary information, e.g., textual descriptions of items or audio features of music, which is then used to model item features (Kim et al., 2016; Wang and Blei, 2011; Wang et al., 2015). For the user-item rating matrix, recent work mostly continues to use traditional MF-based approaches. Restricted Boltzmann Machines (Salakhutdinov et al., 2007) seem to have been the first model to use neural networks to model the user-item rating matrix and obtain competitive results over traditional methods; it is a two-layer network rather than a deep learning structure. Another recent approach, Collaborative Denoising Auto-Encoder (CDAE) (Wu et al., 2016), is mainly designed for rating prediction with a one-hidden layer neural network. Neural Collaborative Filtering (NCF) (He et al., 2017)
uses deep neural networks for learning the interaction function from data with multi-layer perceptrons, yet it does not explore users’ and items’ features that are known to be helpful in improvingCF recommendation performance. CDAE and NCF only exploit implicit feedback for recommendations instead of explicit rating feedback. Deep Matrix Factorization (DMF) (Hong-Jian et al., 2017) models the user-item rating matrix with a neural network that maps the users’ and items’ features into a low-dimensional space with non-linear projections; it uses an inner product to compute interactions between users and items, and applies the same linear kernel (i.e., dot product) as LFM (Koren et al., 2009).
We hypothesize that DL should be able to effectively capture both non-linear and non-trivial user-item relationships as well as users’ (items’) characteristics with multi-layer projections (Zhang et al., 2017). We propose a Joint Neural Collaborative Filtering (J-NCF
) model that enables two processes—feature extraction and user-item interaction modeling—to be trained jointly in a unifiedDL structure. The J-NCF
model contains two main networks for recommendation. The first network uses the rating information of a user (an item) as the network input, and outputs a vector representation for the user (the item). Then, using the connection of a user’s and an item’s vectors as input, the second neural network models the user-item interactions and outputs the prediction of the corresponding rating of the user and item. Thus, these two networks can be coupled tightly and trained jointly in a unified structure. Interaction modeling can optimize the feature learning process and more accurate feature representations can, in turn, improve the user-item interaction prediction. We take both implicit and explicit feedback, point-wise and pair-wise loss into account to enhance the prediction performance. In contrast, previous neural approaches such asCDAE, NCF and DMF are all optimized only with point-wise loss functions and leave dealing with pair-wise loss as future work.
To the best of our knowledge, in the area of recommender systems ours is the first attempt to use a joint neural network to tightly couple feature learning and interaction modeling with the rating matrix. J-NCF allows these two processes to optimize each other through joint training and thereby improve the recommendation performance.
Our experiments on real-world datasets, including the MovieLens dataset and the Amazon Movies dataset, show that J-NCF outperforms the state-of-the-art baselines in prediction accuracy, with improvements of up to 8.24% on the MovieLens 100K dataset, 10.81% on the MovieLens 1M dataset, and 10.21% on the Amazon Movies dataset in terms of HR@10. NDCG@10 improvements are 12.42% on the MovieLens 100K dataset, 14.24% on the MovieLens 1M dataset, and 15.06% on the Amazon Movies dataset, respectively, over the best baseline model. In addition, we investigate the scalability and sensitivity of J-NCF with different degrees of sparsity and different numbers of users’ ratings. Our experimental results indicate that J-NCF achieves competitive recommendation performance when compared to the best state-of-the-art model.
Our contributions in this paper are:
We design a Joint Neural Collaborative Filtering model (J-NCF) for recommendation, which enables deep feature learning and deep user-item interaction modeling to be coupled tightly and jointly optimized in a single neural network.
We design a new loss function that explores the information contained in both point-wise and pair-wise loss as well as implicit and explicit feedback.
We analyse the recommendation performance of J-NCF as well as baseline models and find that J-NCF consistently yields the best performance. J-NCF also shows competitive improvements over the best baseline model when applied with inactive users and different degrees of data sparsity.
We summarize related work in Section 2. Our approach, J-NCF, is described in Section 3. Section 4 presents our experimental setup. In Section 5, we report our results to demonstrate the recommendation performance of J-NCF. We also investigate the scalability and sensitivity of our model as well as other baselines in Section 6. Finally, we conclude our work in Section 7, where we also suggest future research directions.
2. Related work
We first look back to traditional approaches to recommender systems in Section 2.1, that focus on modeling the similarity between users (items) for recommendation. Then, as applying deep learning techniques into recommender systems is gaining momentum due to its state-of-the-art performance and high-quality recommendations, we summarize recent work on deep learning-based recommender systems in Section 2.2 that can provide a better understanding of user’s demands, item’s characteristics as well as historical interactions between them by extracting the features of items with auxiliary information, e.g., the content of movies.
2.1. Traditional recommender systems
In many commercial systems, “best bet” recommendations are shown, but the predicted rating values are not. This is usually referred to as a top-N recommendation task, where the goal of the recommender system is to find a few specific items that are supposed to be most appealing to the user. A similar prediction schema, denoted as Top Popular (Item-pop), recommends the top-N items with the highest popularity (largest number of ratings).
Most top-N recommender systems are based on collaborative filtering (Adomavicius and Tuzhilin, 2005), where recommendations rely on past behavior (ratings) from users, regardless of domain knowledge (Su and Khoshgoftaar, 2009). We group these CF approaches into two categories, i.e., neighborhood-based methods (Sarwar et al., 2001; Linden et al., 2003) and latent factor-based models (Koren et al., 2009; Kabbur et al., 2013). Neighborhood-based models share the typical merits of CF, which concentrate on exploring the similarity among either users or items. For instance, two users are similar because they have rated similarly the same set of items. A dual concept of similarity can be defined among items. Latent factor-based approaches generally model users and items as vectors in the same “latent factor” space by means of a reduced number of hidden factors. In such a space, users and items are directly comparable: the rating of a user on an item is predicted by the proximity (e.g., inner-product) between the related latent factor vectors.
For neighborhood-based models, algorithms that are centered around user-user similarity typically predict the rating by a user based on the ratings expressed by other users similar to her about such item. On the other hand, algorithms centered around item-item similarity compute the user preference to an item based on her own ratings to similar items. The similarity between item and item is measured as the tendency of users to rate items and similarly. It is typically based either on the cosine, the adjusted cosine, or (most commonly) the Pearson correlation coefficient (Sarwar et al., 2001)
. The kNN (k-nearest-neighborhood) approach is a representative enhanced neighborhood model(Adeniyi et al., 2016), which considers only the items rated by user that are the most similar to the item when predicting the rating . kNN-based approaches discard items that are poorly correlated to the target item, thus decreasing noise for improving the quality of recommendations. Neighborhood-baesd approaches are similar to the item-item model for user personalization, which is different from our approach based on the user-item model (Sarwar et al., 2001). Thus, we focus on the latent factor modeling approach.
Most research on latent factor modeling is based on factoring the user-item rating matrix, which is known as Singular Value Decomposition (SVD) (Koren et al., 2009). SVD factorizes the user-item rating matrix to a product of two lower rank matrices, one containing the “user factors,” the other containing the “item-factors.” Then, with an inner product and biases (), the user’s preference towards an item can be generated, i.e.,
where and denote the “user factors” and “item-factors,” respectively.
Since the conventional SVD
is undefined in the presence of unknown values, i.e., missing ratings, several solutions have been proposed. Earlier work addresses this issue by filling the missing ratings with a baseline estimation(Sarwar et al., 2000). However, this leads to a very large, dense user rating matrix, where the factorization process becomes computationally infeasible. Recent work learns factor vectors directly on known ratings through a suitable objective function that minimizes a prediction error. The proposed objective functions are usually regularized in order to avoid overfitting (Paterek, 2007). Typically, gradient descent is applied to minimize the objective function. An advantage of SVD-based approaches is that they can provide recommendations for new users after given their ratings towards some items without reconstructing the parameters of the models. Thus for a new user, SVD-based approaches can provide recommendations immediately according to his current ratings.
Another model based on SVD, SVD++ (Koren, 2008), incorporates both explicit and implicit feedback, and shows improved performance over many MF models. This is consistent with our motivation of combining explicit and implicit feedback in J-NCF. However, applying traditional MF methods to sparse ratings matrices can be a non-trivial challenge with high computational costs for decomposing the rating matrix.
Many traditional recommender systems apply a linear kernel with an inner product of user and item vectors to model user-item interactions. Linear functions may not be able to give an accurate description of the characteristics of users (items) and user-item interactions: previous work has pointed out that non-linearities have potential advantages for improving the performance of recommender systems with extensive experiments (Li et al., 2015; Wu et al., 2016; Sedhain et al., 2015).
2.2. Deep learning-based recommender system
DL-based recommender systems can be divided into two categories, i.e., single neural network models and deep integration models, depending on whether they rely solely on deep learning techniques or integrate traditional recommendation models with deep learning (Zhang et al., 2017; Su and Khoshgoftaar, 2009; Basiliyos et al., 2017; Liu and Wu, 2017; Zheng et al., 2016; Huang et al., 2013; Onal et al., 2018; He and Chua, 2017; Wang et al., 2019).
For the first category, RBM (Salakhutdinov et al., 2007; Truyen et al., 2009; Liu et al., 2015) is an early neural recommender system. It uses a two-layer undirected graph to model tabular data, such as users’ explicit ratings of movies. RBM targets rating prediction, not top-N recommendation, and its loss function considers only the observed ratings. It is technically challenging to incorporate negative sampling into the training of RBMs (Wu et al., 2016), which would be required for top-N recommendation. AutoRec (Sedhain et al., 2015) uses an Auto-Encoder for rating prediction. It only considers the observed ratings in the loss function, which does not guarantee good performance for top-N recommendation. To prevent the Auto-Encoder from learning an identity function and failing to generalize to unseen data, Denoising Auto-Encoders (DAEs) (Li et al., 2015) have been applied to learn from intentionally corrupted inputs. Most of the publications listed so far focus on explicit feedback and, hence, fail to learn users’ preference from implicit feedback. CDAE (Wu et al., 2016) extends DAEs; its input is a user’s partially observed implicit feedback. Unlike our work, both DAEs and CDAE use an item-item model for personalization that represents a user with their rated items (Sarwar et al., 2001) and the outputs are the item scores decoded from the learned user’s representation. Our work is a kind of user-item model, which learns users’ as well as items’ representations first and then calculates the relevance between them. The proposed J-NCF model is a user-item model that personalizes by modeling user-item interactions. Also, CDAE applies a linear kernel to model the relationship between users and items, whereas J-NCF applies a non-linear kernel.
to extract item features with auxiliary information, e.g., review text or contextual information, which we will incorporate in our future work. As for Recurrent Neural Networks, they are used in recommender systems that address the temporal dynamics of ratings and sequential features(Hidasi et al., 2016a; Trapit et al., 2016).
Most closely related to our model is Neural Collaborative Filtering (NCF) (He et al., 2017). It uses multi-layer perceptrons to model the two-way interaction between users and items, which is meant to capture the non-linear relationship between users and items. Let and denote the side information (e.g., the feature information), then, the prediction rule of NCF is formulated as follows:
where the function
defines the multilayer perceptron, andare the parameters of the network. However, NCF randomly initializes the representation of users and items, with just a one-hot identifier of user and item respectively, which only explores the users’ and items’ features in a limited manner. J-NCF adopts a joint neural network structure to capture both user and item features, and user-item relationships, as we hypothesize that the two parts can be optimized through tight coupling and joint training. In addition, NCF only exploits implicit feedback for item recommendations and ignores explicit feedback.
An extension based on NCF is CCCFNet (Cross-domain Content-boosted Collaborative Filtering neural Network) (Lian et al., 2017). The basic building block of CCCFNet is also a dual network (for users and items, respectively). It models the user-item interactions in the last layer with the dot product. Unlike our work, it applies content information with a neural network to capture the user’s preferences and item features. In addition, DeepFM (Deep Factorization Machine) (Guo et al., 2017) is an end-to-end model that seamlessly integrates factorization machine and MLP. However, it also applies content information and thus models higher-order feature interactions via a deep neural network and low-order interactions via a factorization machine. In contrast, J-NCF adopts the rating information to explore both user and item features, which are easier to collect.
As to deep integration models, Collaborative Deep Learning (CDL) (Wang et al., 2015) is a hierarchical Bayesian model that integrates stacked DAEs into traditional probabilistic MF. It differs from our work in two ways: (1) it extracts deep feature representations of items from the content information which we do not explore, and (2) it uses a linear kernel to model relations between users and items with the dot product of user and item vectors .
A well-known integration model is DeepCoNN (Deep Cooperative Neural Network) (Zheng et al., 2017), which adopts two parallel convolutional neural networks to model user behavior and item properties from review texts. In the final layer, a factorization machine is applied to capture their interactions from rating predictions. It alleviates the sparsity problem and enhances model interpretability by exploiting a rich semantic representation of the reviews, which could be investigated in J-NCF as future work.
Wide & Deep learning (Cheng et al., 2016) and DeepFM (Guo et al., 2017) are two state-of-the-art recommendation works with deep learning techniques. While they focus on incorporating various features of users and items, we aim at exploring deep learning methods for pure collaborative filtering systems. Another integration model that is directly relevant to our work is Deep Matrix Factorization (DMF) (Hong-Jian et al., 2017). It uses a deep MF model with a neural network that maps users and items into a common low-dimensional space. It follows the LFM, which uses the inner product to compute interactions between users and items. This may partially explain why using deep layers does not help to improve the performance of DMF (see (Hong-Jian et al., 2017, Section 4.4)). Unlike DMF, we apply multi-layer perceptrons to model user-item interactions using a combination of user and item feature vectors as input. This does not only help our model to be more expressive in modeling user-item interactions than linear products, but it also helps to improve the accuracy of user and item feature extraction.
On top of the previous work discussed above, our proposed model J-NCF combines feature learning and interaction modeling into an end-to-end trainable neural network, which enables the two processes to be optimized jointly. Besides this, we design a new loss function that combines point-wise and pair-wise losses to explore the integration of different types of information, i.e., both implicit and explicit feedback.
The proposed model, J-NCF, has a joint structure with a layer used for modeling users’ and items’ features (the DF network) and a higher layer used for modeling user-item interactions (the DI network). These two layers can be trained in a joint manner to give a predicted score of a user’s interactions with an item with minimum prediction error. We first describe the notation used and then detail J-NCF. We also describe the loss function that we use for optimization.
3.1. Problem formulation and notation
First we describe the task of top-N recommendation that we study in this paper. Suppose that there are users and items, denoted as and . denotes the rating information, where is the rating given by user to item . The task for top-N recommendation is to return a list containing a set of items for an individual user to maximize the user’s satisfaction.
The main notation we use in this paper is listed in Table 1.
|the set of users|
|the set of items|
|an explicit rating of user to item|
|a vector containing a user’s ratings; serves as input to Net|
|a vector containing an item’s ratings; serves as input to Net|
|the number of unique users|
|the number of unique items|
|the weight matrix for the -th layer in Net|
|the bias for the -th layer in Net|
the activation function for the-th layer in Net
|the number of layers in DF network|
|the weight matrix for the -th layer in the DI network|
|a combination of user and item vectors; serves as input to the DI network|
|the bias for the -th layer in the DI network|
|the activation function for the -th layer in the DI network|
|the number of layers in the DI network|
|the predicted score of the interaction between user and item|
|the set of items that a user rates|
|the set of items that are not rated by a user|
|a tradeoff parameter controlling the contributions of the point-wise loss and pair-wise loss|
3.2. Joint Neural Collaborative Filtering
The joint architecture of the proposed J-NCF model is shown in Fig. 1. The model contains two main networks: a DF network for modeling features and a DI network for modeling interactions between items and users, where the output of the first network serves as the input of the second.
The DF network is used for modeling users’ and items’ features. It contains two parallel neural networks coupled in the last layer, one network for users (Net) and another for items (Net). We give the ratings of a user and an item as inputs to Net and Net, respectively, which are defined as and , where
We think of ratings as non-trivial explicit feedback from users as different ratings indicate different levels of users preference towards items. Obviously, there are many unknown ratings between users and items indicating non-preference of a user towards an item. Following (He et al., 2017; Hong-Jian et al., 2017), we regard these unknown ratings as a kind of implicit feedback and mark them as zeroes. When pursuing a top-N recommendation task, we are interested only in a correct item ranking and care less about the exact rating scores. This grants us some flexibility, like considering all missing values in the user rating matrix as zeros (Cremonesi et al., 2010). Thus we can take both explicit and implicit feedback into consideration with Eq. (3).
Then, with multi-layer perceptrons (MLP), the initial high-dimensional rating vectors of users and items are mapped to lower-dimensional vectors. Since Net and Net only differ in their inputs, we focus on illustrating the process for Net; the same process is applied for Net with similar layers. The MLP model in the DF network is defined as:
where , and
denote the weight matrix, the bias vector and the activation function for the2017; He et al., 2017). indicates the number of layers used in the DF network. The output of the final layer is a deep representation of the user features; likewise, is the deep representation for the item features.
As to modeling user-item interactions, traditional LFM methods have been widely used. Such methods are based on the dot product of user and item vectors, which models a user’s preference with a linear kernel. In order to investigate the differences between non-linear and linear functions in modeling user-item interactions, we propose two ways to obtain fused users’ and items’ feature vectors as the input of the DI network:
The first way is to concatenate the two input vectors and , which we regard as a non-linear fusion. The second way is to use the element-wise product of vectors, which uses a linear kernel to generate user-item interactions. Based on these two ways of fusing the input vectors and , we propose two versions of J-NCF, which we discuss in detail in our experiments.
Generating is the first step for modeling user-item interactions. However, it is insufficient for modeling the complex relationship between users and items. Thus, we adopt intermediate hidden layers to which is fed so as to obtain a multi-layer non-linear projection of user-item interactions:
where , and denote the weight matrix, the bias vector and the activation function for the -th layer in the DI network. A ReLU is applied again as the activation function. indicates the number of layers used in the network. The output of the network is the predicted score of the interaction between user and item :
where the sigmoid functioncan restrict the output in (0,1). can be learnt through the training process with back propagation to control the weight of each dimension in .
3.3. Loss function
Objective functions for training recommender systems can be divided into three groups: point-wise, pair-wise and list-wise. Point-wise objectives aim at obtaining accurate ratings, which is more applicable in rating prediction tasks (Kabbur et al., 2013). Pair-wise objectives are usually focused on users’ preferences towards pairs of items and are usually considered more suitable for top-N recommendation (He et al., 2016, 2017; Kabbur et al., 2013; Rendle et al., 2009). List-wise objectives are focused on users’ interests towards a list of items, which are also used in some deep learning algorithms. We briefly summarize the three groups of loss functions.
We use to denote a loss function and to represent a regularization term that controls the model complexity and encodes prior information such as sparsity, non-negativity, or graph regularization.
For a point-wise loss function, the general calculation is:
There are several types of point-wise loss function. E.g., squared loss is more suitable for explicit feedback than implicit feedback, as it is calculated with:
where is a hyper-parameter denoting the weight of training instance .
The use of squared loss is based on the assumption that observations are generated from a Gaussian distribution, however, it may not tally well with implicit data
The use of squared loss is based on the assumption that observations are generated from a Gaussian distribution, however, it may not tally well with implicit data(Salakhutdinov and Mnih, 2007). For implicit feedback, there is a point-wise loss function mainly used for classification tasks (Hong-Jian et al., 2017; He et al., 2017), named log loss (Kabbur et al., 2013), which can perform better with implicit feedback than squared loss:
Pair-wise loss considers the relative order of the prediction for pairs of items, which is a more reliable kind of information for top-N recommendation. Hidasi and Karatzoglou (2018) investigate several popular pair-wise loss functions, i.e., TOP1, BPR-max and TOP1-max. We give a brief introduction of them. TOP1 is the regularized approximation of the relative rank of the relevant item, which can be calculated as:
where and denote the prediction scores for a negative item and a positive item , respectively; is the set of negative samples. The first part of TOP1 aims to ensure that the target score is higher than the score of the negative samples, while the second part pushes the score of the negative samples down. As for BPR-max and TOP1-max, they have been proposed by Hidasi and Karatzoglou (2018) to overcome the vanishing gradients as the number of negative samples increases. The idea is to have the target score compared with the most relevant sample score, which is the maximum score amongst the samples. As the maximum operation is non-differentiable, softmax scores are used to preserve differentiability. By summing over the individual losses weighted by the corresponding softmax scores , TOP1-max can be calculated as:
And the BPR-max loss function can be calculated as:
For list-wise loss, many deep learning-based methods combine cross-entropy loss with softmax, which introduces list-wise properties into the loss. We refer to it as softmax+cross-entropy (XE) loss, which can be calculated with the following function:
Most deep learning-based models only use the point-wise loss function for optimization and leave the pair-wise loss function for future work (Hong-Jian et al., 2017; He et al., 2017). Point-wise loss only uses the rating information and ignores the information contained in the relative order of pairs of items. Pair-wise loss, in contrast, ignores the information of a user’s individual preference for a certain item. Thus, unlike previous work, NCF and DMF, our proposed J-NCF model considers both point-wise and pair-wise loss for the top-N recommendation task and combines them into a new loss function:
where is used to control the weights of the two parts.
For point-wise loss, we adopt the log loss (Eq. (10)), which can integrate both implicit and explicit feedback. As to pair-wise loss, combining with different pair-wise losses yields different new loss functions, i.e., point-wise+TOP1, point-wise+BPR-max, and point-wise+TOP1-max. We analyze the performance of these different combined loss functions with experiments in Section 5.
Acknowledging that explicit and implicit feedback both contain information about a user’s preference towards items, we combine both kinds of feedback in our loss function for optimization and rewrite Eq. (15) in detail as
where , and denotes the largest rating score of user given to items, so that different values of have a different influence on the loss. For example, if the largest rating score of a user given to items is 4, when he rates an item with 2, we can generate . We refer to our loss function Eq. (16) as a “hybrid” loss function.
We have developed the joint neural network structure of the J-NCF model. The training process of J-NCF is shown in Algorithm 1. We first initialize the parameters in the network and modify the rating matrix from step 1 to 3. Then, in step 9 and 10, we generate deep feature representations for both users and items with the DF network. In step 11 and 12, we calculate the predicted scores for the user-item interactions with the DI network. Finally, we use the hybrid loss function in Eq. (16) and back propagation to optimize the network parameters with step 13 and 14.
4. Experimental setup
We design experiments on a variety of datasets to examine the effectiveness of J-NCF. We first explain the research questions and the models we use for comparison in Section 4.1. The datasets and experiments are described in Section 4.2.
4.1. Model summary and research questions
We conduct experiments with the aim of answering the following research questions:
Does our proposed J-NCF method outperform state-of-art collaborative filtering baselines for recommender systems?
How is the performance of J-NCF impacted by different choices for the pair-wise loss in Eq. (16)?
Does the hybrid loss function Eq. (13), which combines point-wise and pair-wise loss, help to improve the performance of J-NCF?
Are deeper layers of hidden units in the DF network and DI network helpful for the recommendation performance of J-NCF?
Does the combination of explicit and implicit feedback help to improve the performance of J-NCF?
How does the performance of J-NCF vary across users with different numbers of interactions?
Is J-NCF sensitive to different degrees of data sparsity?
How does J-NCF perform on a large and sparse dataset?
How do the training and inference times of J-NCF compare against those of other neural models?
We compare J-NCF against a number of traditional collaborative filtering baselines and against state-of-the-art deep learning based models:
This method ranks items based on the number of interactions, which is a non-personalized approach to determine recommendation scores (Adomavicius and Tuzhilin, 2005).
This method uses a pairwise loss function to optimize a MF model based on implicit feedback. We use it as a strong baseline for traditional collaborative filtering method (Rendle et al., 2009).
This is a state-of-the-art neural network-based method for recommender systems. It aims to capture the non-linear relationship between users and items. Unlike J-NCF, it simply uses one-hot vectors representing users and items as the input for modeling user-item interactions. And it only uses implicit feedback and a point-wise loss function (He et al., 2017).
This method uses multi-layer perceptrons for rating matrix factorization. Unlike our work, after projecting users and items into low dimensional vectors, it applies an inner product to calculate interactions between users and items, which is a linear kernel. It uses a point-wise loss function for optimization (Hong-Jian et al., 2017).
In addition, following the choices that we identified in Eq. (5), we consider two versions of J-NCF:
This is J-NCF using element-wise multiplication for combining a user and an item feature vector as the input for the DI layer, which has a linear kernel inside.
This is J-NCF using concatenation for combining a user and an item feature vector as the input for the DI layer, which is a non-linear way.
We list all the models to be discussed in Table 2.
|Item-pop||A typical recommendation approach, which ranks items based on the number of interactions.||(Adomavicius and Tuzhilin, 2005)|
|BPR||A recommendation method using a pairwise loss function to optimize an MF model based on implicit feedback.||(Rendle et al., 2009)|
|NCF||A state-of-the-art neural based method for recommender systems.||(He et al., 2017)|
|DMF||A method using multi-layer perceptrons for rating matrix factorization.||(Hong-Jian et al., 2017)|
|J-NCF||A J-NCF model using element-wise multiplication for combining a user and an item feature vector as the input for the DI layer.||This paper|
|J-NCF||A J-NCF model using concatenation for combining a user and an item feature vector as the input for the DI layer.||This paper|
|J-NCF||A J-NCF model with only point-wise loss based on Eq. (10).||This paper|
|J-NCF||A J-NCF model with only pair-wise loss based in Eq. (11).||This paper|
|J-NCF||A J-NCF model with our designed loss function in Eq. (13).||This paper|
|J-NCF||A J-NCF model with both explicit and implicit feedback in the input and the loss function.||This paper|
|J-NCF||A J-NCF model with only implicit feedback in the input and the loss function.||This paper|
4.2. Datasets and experimental setup
We use three publicly available datasets to evaluate our models and the baselines:
MovieLens, which contains several rating datasets from the MovieLens web site. The datasets are collected over various periods of time, depending on the size of the set (He et al., 2017; Hong-Jian et al., 2017). We use two sets for our experiments, i.e., MovieLens 100K (ML100K) containing 100,000 ratings from 943 users on 1,682 movies, and MovienLens 1M (ML1M) containing more than 1 million ratings from 6,040 users on 3,706 movies.***https://grouplens.org/datasets/movielens/
Amazon Electronics (AEle), which is a larger and sparser dataset than the other datasets used in our paper. It contains 7,824,482 ratings of users on different electronics. We use it to test the performance of our model when applied on a large and sparse dataset.‡‡‡http://jmcauley.ucsd.edu/data/amazon/
For the two MovieLens datasets, we do not process them because they are already filtered. For the AMovies dataset, following (Hong-Jian et al., 2017; He et al., 2017), we filter the dataset so that, similar to the MovieLens data, only users with at least 20 interactions and items with at least 5 interactions are retained. For the larger dataset AEle, we only do minor filtering on the data, i.e., filtering the users with less than 2 interactions and items with less than 5 interactions. To answer RQ1 to RQ7, we use the ML100K, ML1M, and AMovies datasets to evaluate our models and baselines. As for RQ8 to RQ9, we test the models on all of the datasets. The characteristics of the datasets after preprocessing are summarized in Table 3.
In order to answer RQ5, we plot distributions of users with different numbers of interactions in the ML100K, ML1M, and AMovies datasets in Figure 2.
The x-axis denotes the number of ratings while the y-axis indicates the number of users corresponding to the ratings. We see that the majority of users in the three datasets only have a few ratings, which we regard as “inactive users,” and few “active users” have far more ratings. E.g., in the ML100K dataset, 61.72% of the users have fewer than 100 ratings, 32.66% have between 100 and 300 ratings, and only 5.6% of the users have more than 300 ratings.
As we will see below, the models being considered in this paper achieve different scores when used on datasets with different characteristics, i.e., number of users and number of items (see Section 5). Thus, for RQ6, in order to evaluate the performance of our model on datasets with different degrees of sparsity, we keep the number of users and items the same. Namely, following (Kabbur et al., 2013), for each of the three datasets, i.e., ML100K, ML1M, and AMovies, we create three versions at different sparsity levels with the the following steps:
We start by randomly choosing a subset of users and items from the original dataset. This dataset is represented with a ‘-1’ suffix.
We randomly choose a rating record and make a judgment if the numbers of users as well as items are unchanged of the sub-dataset after removing this record. If unchanged, we remove this record; otherwise repeat Step 2.
After several repetitions of Step 2, the first sparser version of the dataset with the ‘-2’ suffix is created.
Repeat Step 2 and Step 3 based on the dataset with a ‘-2’ suffix, the second sparser version of the dataset with the ‘-3’ suffix is created in the same way.
The characteristics of the datasets are summarized in Table 4.
4.2.2. Experimental setup.
For evaluation, we use a leave-one-out strategy, which has been used widely in DL-based recommender systems (Hong-Jian et al., 2017; He et al., 2017, 2016). The training set consists of all but the last interaction of every user; the test set contains the latest interaction of every user. When testing, it is time-consuming to give ranking predictions to all items for every user. Thus following He et al. (2017); Hong-Jian et al. (2017), we randomly sample 100 items with which the user has not interacted and then give the test item ranking predictions among the 100 samples. Although using this sampling strategy during evaluation may overestimate the performance of all algorithms, Bellogin et al. (2011); Hidasi and Karatzoglou (2018) have pointed out that the comparison among algorithms still remains fair.
The majority of the recommender system literature applies error metrics for evaluation, i.e., Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Such classical error criteria do not really measure the top-N recommendation performance (Cremonesi et al., 2010). An extensive evaluation of several state-of-the-art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of the top- recommendation task (Cremonesi et al., 2010; Herlocker et al., 2004). Experimental results also show that improvements in terms of RMSE often do not translate into accuracy improvements (Herlocker et al., 2004). Thus, here we choose to use accuracy metrics to examine the recommendation performance (He et al., 2017). Specifically, we use HR and NDCG to evaluate the performance of our models. Hit Ratio (HR) is used to evaluate the precision of the recommender system, i.e., whether the test item is contained in the top-N list. The Normalized Discount Cumulative Gain (NDCG) measures the ranking accuracy of the recommender system, i.e., whether the test item is ranked at the top of the list.
As for parameters, we optimize the hyperparameters by running 100 experiments at randomly selected points of the parameter space. Optimization is done on a validation set, which is partitioned from the training set with the same procedure as the test set(Chen et al., 2018). As for the loss function, we test the parameter from to with step size of in our experiment.
For the neural networks, we randomly initialize model parameters with a Gaussian distribution (mean of 0 and standard deviation of 0.01), optimizing the model with mini-batch Adam(Kingma and Ba, 2014). The batch size and learning rate are set to 256 and 0.0001. For the baselines, we set the parameters of DMF as well as NCF following (Hong-Jian et al., 2017; He et al., 2017), respectively. For DMF and NCF, we set the batch size to 256, and the learning rate to 0.0001 and 0.001. For the DF network in DMF model, we apply two layers and the sizes of them are [128, 64]. For the DI network in the NCF model, we employ three hidden layers with size [128, 64, 8]. For the DF and DI networks in J-NCF, without special mention, we employ three layers in DF network with the size of [256, 128, 64] and two layers in DI network with size of [128, 8]. Thus the embedding sizes of users as well as items are same in all baseline models as well as J-NCF. We also keep the size of the last hidden layer of the DI network in J-NCF the same as NCF, which may determine the model capability. We also test our model as well as the baseline models with different numbers of layers to see if deep layers are beneficial to the overall performance of these models. Unless specified, for all the results presented in this paper, the number of recommendations () is equal to 10 (Hong-Jian et al., 2017; He et al., 2017).
5. Results and Discussion
5.1. Overall performance
To answer RQ1, we examine the recommendation performance of the baselines and the J-NCF and J-NCF models. See Table 5.
Let us first consider the baselines. From Table 5, we see that DMF achieves a better performance than the other baselines in terms of HR@10 and NDCG@10. Hence, we only use DMF as the best baseline for comparisons in later experiments. Bayesian Personalized Ranking (BPR) clearly shows higher improvements over the Item-pop baseline in terms of NDCG@10 than in terms of HR@10, which shows that pairwise loss has a strong performance for ranking prediction. The NCF and DMF models both show better performance than the two traditional CF models, which indicates the utility of DL techniques in improving recommendation performance.
Next, we compare the baselines against the J-NCF models. NCF and DMF both lose against the J-NCF models in terms of HR@10 and NDCG@10. This shows that a joint neural network structure that tightly couples deep feature learning and deep interaction modeling helps to improve the recommendation performance. Regarding the J-NCF models, independent of the choice of combining the users’ and items’ vectors, J-NCF achieves a better performance than the DMF baseline, resulting in HR@10 improvements ranging from 5.04% to 8.24% on the ML100K dataset, 5.62% to 10.81% on the ML1M dataset, and 7.21% to 10.21% on the AMovies dataset. NDCG@10 improvements range from 7.22% to 12.42% on the ML100K dataset, 6.25% to 14.24% on the ML1M dataset, and 10.44% to 15.06% on the AMovies dataset. Significant improvements against the baseline in terms of HR@10 and NDCG@10 are observed for both J-NCF and J-NCF at the level, except for J-NCF on the ML100K dataset, for which we observe significant improvements at the level in terms of HR@10 and NDCG@10. The higher improvements in NDCG@10 over HR@10 may be due to the fact that we incorporate pair-wise loss in our loss function, which motivates us to conduct a further investigation to answer RQ3.
Comparing J-NCF and J-NCF, we see that J-NCF achieves the best performance, with improvements of 3.05%, 3.51% and 2.81% in terms of HR@10, and 4.85%, 7.51% and 4.18% in terms of NDCG@10 over J-NCF on the three datasets, respectively. The complex relationship between users and items can be described better with a non-linear kernel than linear kernel, which is consistent with the findings in (Liu et al., 2015; He et al., 2017).
5.2. Impact of different loss functions
As we have mentioned in Section 3.3, there are several kinds of pair-wise loss functions that can be incorporated in Eq. (15). When J-NCF combines the point-wise loss, i.e., log loss, with TOP1, TOP1-max, and BPR-max pair-wise losses, it gives rise to the J-NCF, J-NCF and J-NCF models, respectively. Additionally, list-wise loss, i.e., softmax+cross-entropy (XE), can also be applied with J-NCF, which gives rise to the J-NCF model. In order to investigate the impact of various loss functions on J-NCF, we examine the recommendation performance of J-NCF, J-NCF, J-NCF as well as J-NCF models where the parameter in Eq. (15) ranges from to with a step size of . Fig. 3 shows the results.
As for the overall performance, we can see that when applied with a list-wise loss function, J-NCF has the worst performance among the four models. The other three models, which combine pair-wise and point-wise losses, show relatively similar results in terms of HR@10 and NDCG@10. When , it results in J-NCF. When , it leads to J-NCF, a model with only corresponding pair-wise loss functions. It is obvious that solely based on point-wise loss, J-NCF has better performance in terms of HR@10 while worse performance regarding NDCG@10 than J-NCF with only pair-wise loss. This can be explained by the fact that pair-wise loss can help J-NCF learn to rank items in right positions.
In Fig. 2(a), the performance of all models increases from to before a short-term decrease and then a dramatic drop after reaching the peak at . The performance of J-NCF, J-NCF and J-NCF is comparable in terms of HR@10. As for NDCG@10, shown in Fig. 2(b), J-NCF shows better performance than the other two models and achieves the highest point at .
Regarding the performance on the ML1M dataset, similar trends can be found in Fig. 2(c) and Fig. 2(d) as in Fig. 2(a) and Fig. 2(b), respectively. For the AMovies dataset shown in Fig. 2(e) and Fig. 2(f), J-NCF shows slightly better performance than both J-NCF and J-NCF in terms of HR@10, while the performance of J-NCF and J-NCF is similar in terms of NDCG@10, which is a little better than that of J-NCF.
As discussed in (Hidasi and Karatzoglou, 2018), the BPR-max and TOP1-max loss functions have been proposed to overcome vanishing gradients as the number of negative samples increases. Since we use a small number of negative samples in our paper, the performance is relatively similar between the three models, J-NCF, J-NCF and J-NCF. As BPR-max and TOP1-max losses need additional softmax calculations for all negative samples, we apply the TOP1 pair-wise loss in Eq. (15) for J-NCF in the experiments on which we report below.
5.3. Utility of hybrid loss function
For RQ3, in order to further investigate the utility of the hybrid loss function (Eq. (15)), we examine the recommendation performance of the J-NCF models under different settings, i.e., J-NCF with only point-wise loss based on Eq. (10) (we incorporate explicit feedback in the same way as Eq. (16)), J-NCF with only pair-wise loss based on Eq. (11), and J-NCF with our designed loss function from Eq. (16). Fig. 4 shows the results.
The overall performance in terms of HR and NDCG increases when the size of the top-N recommended list ranges from 1 to 10, as a large value of
increases the probability of including a user’s preferred item in the recommendation list.J-NCF consistently achieves improvements over DMF as well as the two models with a single loss function across positions, which demonstrates the utility of our newly designed loss function. Based on the ML100K dataset, J-NCF improves by 2.68% and 7.61%, respectively, over J-NCF and J-NCF in terms of HR@10; improvements of NDCG@10 over J-NCF and J-NCF are 3.99% and 2.36%, respectively.
Comparing J-NCF and J-NCF, we find that J-NCF beats J-NCF in terms of HR, while J-NCF shows more competitive performance in terms of NDCG than J-NCF. This confirms the findings in (Rendle et al., 2009; He et al., 2016) that a pair-wise ranking-aware learner has a strong performance for ranking prediction. This finding motivates us to incorporate both point-wise loss and pair-wise loss into the hybrid loss function. Clearly, J-NCF based models, i.e., J-NCF, J-NCF and J-NCF, show a better performance than DMF, which also proves that the joint neural structure is effective, i.e., deep interaction modeling can optimize neural matrix factorization and thus improve the recommendation performance.
Comparing the left and right hand sides of Fig. 4, we see that the improvements of J-NCF in terms of NDCG are more significant than those in terms of HR, as indicated by the relative improvements over DMF with different sizes of the recommendation list. In Fig. 3(a), J-NCF shows a 8.78% improvement over DMF in terms of HR at cutoff , a 5.91% improvement at and a 8.24% improvement at on the ML100K dataset. In Fig. 3(b), the improvements in terms of NDCG at cutoff , and are 19.01%, 15.72% and 12.42%, respectively. J-NCF with the hybrid loss function cannot only recommend the correct item to a user, but is also competitive in terms of ranking it at the top of the list.
5.4. Number of of layers in the networks
In J-NCF, we not only learn features of users and items through the DF neural network with multiple hidden layers, but also model user-item interactions with multi-layer perceptrons in the DI network. Thus it is crucial to see whether DL is helpful in our model. We conduct experiments to examine the performance of J-NCF with various numbers of layers in the DF and DI networks, respectively. In addition, we also test the performance of the best baseline model, i.e., DMF, with different DF networks. The results are shown in Table 6. The in DF- and DI- in Table 6 denotes the number of layers in the DF network and DI network of J-NCF, respectively.
As shown in Table 6, in terms of HR@10, we can see that with the number of layers increasing, the recommendation performance of J-NCF is improved, which verifies the effectiveness of DL techniques for recommender systems.
Comparing the number of layers in the DI and DF networks, we can find that stacking more layers in the DF network of J-NCF seems more helpful than in the DI network in enhancing the recommendation performance. For example, based on the ML100K dataset, the improvements of the configuration (DF-3, DI-2) over (DF-2, DI-2) are 2.82% and 4.31% in terms of HR@10 and NDCG@10, while the improvements are 1.05% and 2.62% for (DF-2, DI-3) over (DF-2, DI-2). When we stack more than 4 layers in the DI network (e.g., DI-5), the performance of J-NCF no longer increases. However, stacking more layers in the DF network (e.g., DF-5) still seems helpful and the best results produced for each dataset are all based on J-NCF with the (DF-5, DI-4) configuration. This may be because deep layers are more helpful in extracting users’ as well as items’ features and thus enhancing the user-item interactions predictions. It motivates us to incorporate more auxiliary information for exploring users’ and items’ features with deep learning techniques in future work.
As for NDCG@10, a similar phenomenon can be found. However, when comparing the scores of HR@10 and NDCG@10 under the same configurations, we can find that deeper layers can lead to more obvious improvements in terms of NDCG@10 than HR@10 on all of the three datasets. The best performance of J-NCF with (DF-5, DI-4) outperforms the worst performance of J-NCF with (DF-1, DI-1) by 20.52%, 25.37% and 34.52% in terms of HR@10 on the three datasets, respectively. However, the improvements are 28.96%, 63.05% and 53.37% in terms of NDCG@10 on the three datasets.
As for the baseline model DMF shown in the bottom rows in Table 6, when applied with DF-1, J-NCF with DI-1 loses to DMF on all datasets. Similar results can be found with DF-2, except on ML100K dataset. This can be explained by the fact that the simple concatenation of user’s and item’s embeddings with only one MLP layer in J-NCF is not efficient for modeling user-item interactions. When applied with more DI layers, J-NCF has better performance than DMF with the same number of DF layers. Additionally, we can find that DMF achieves the best performance with DF-2 and deeper layers do not seem useful for DMF model, which corresponds to the results in (Hong-Jian et al., 2017). However, J-NCF achieves further improvements when stacking more layers in either the DI or DF network, or both.
5.5. Impact of feedback
In J-NCF, we consider different kinds of user feedback. On the one hand, we use the interaction matrix as the input of the network with Eq. (3), which contains not only implicit feedback but also explicit feedback. On the other hand, our loss function in Eq. (16) employs a normalized strategy in the form of , where denotes the largest rating score of user given to items, to incorporate the explicit feedback. In order to answer RQ5, we conduct experiments to investigate whether the combination of explicit and implicit feedback works for J-NCF with different settings, i.e., J-NCF with both kinds of feedback in the input and the loss function as well as J-NCF with only implicit feedback by labeling 1 for the interactions and 0 for unknown ratings in the input and the loss function. Fig. 5 shows the recommendation performance of J-NCF, J-NCF, DMF and NCF across different numbers of training iterations, respectively.
First, from Fig. 5 we can see that J-NCF with both kinds of feedback achieves a competitive performance across all iterations in terms of HR@10 and NDCG@10 on the three datasets. It indicates that the combination of explicit and implicit feedback in the input and the specially designed loss function of J-NCF does help to improve the recommendation performance. Second, as the number of training iterations increases, the recommendation performance of all models is improved and then degraded after reaching a peak. More iterations may lead to overfitting, which hurts the recommendation performance. However, comparing J-NCF model with the baselines, i.e., DMF and NCF, we find that J-NCF converges to the best performance faster than other models. For example, on the ML100K dataset, the best result of J-NCF is generated after the first 9 effective iterations, while DMF and NCF need more training iterations to obtain the best results, i.e., 16 and 14 iterations respectively. The same phenomenon can be observed on the other two datasets. The optimal number of updates needed for J-NCF, DMF and NCF are around 10, 17 and 19 on the ML1M dataset, and 14, 18 and 19 on the AMovies dataset, respectively. Third, comparing the performance in terms of HR@10 and NDCG@10, we find that J-NCF shows larger improvements over J-NCF in terms of NDCG@10 than HR@10. For example, the improvements are 3.72%, 5.22% and 4.89% in terms of HR@10, on the ML100K, ML1M and AMovies datasets, respectively, vs. improvements of 4.61%, 5.58% and 5.31% in terms of NDCG@10. This confirms our hypothesis that incorporating both explicit and implicit feedback can improve the ranking precision for recommendation.
6. Scalability and Sensitivity
In order to answer RQ6 to RQ9, we study the scalability and sensitivity of J-NCF as well as the best baseline DMF when applied in different settings, i.e., with users with various numbers of ratings in Section 6.1, and with datasets with different levels of sparsity in Section 6.2. In addition, we also investigate the performance of the deep learning-based approaches, i.e., J-NCF, DMF and NCF, when applied with a large and sparse dataset in Section 6.3. Moreover, the training and inference time needed for these models on all datasets is discussed in Section 6.4.
6.1. Model scalability with user ratings
In Fig. 2, we have shown that in every dataset most users only have a few ratings, thus it is meaningful to investigate how the performance of J-NCF and DMF varies with different numbers of user ratings. Following (Prem et al., 2013), we look at the performance for users of varying degrees of activity, measured by percentile. For example, in Table 7, we first rank the users according to their numbers of their activities. 10% shows the mean performance across the bottom 10% of users, who are least active; the 90% mark shows the mean performance for all but the top 10% most active users.
As shown in Table 7, J-NCF outperforms the best baseline model DMF
for users across all activity levels, i.e., both the “inactive” users who constitute the majority, and the relatively few “very active” users who give more ratings. In addition,J-NCF always achieves the best performance in terms of HR@10 and NDCG@10. In order to test the robustness of J-NCF under different settings, i.e., J-NCF and J-NCF
, we conduct t-tests between the two versions ofJ-NCF with DMF, respectively. Significant improvements against the baseline DMF in terms of HR@10 and NDCG@10 are observed for both J-NCF and J-NCF at the level across all activity levels, except for J-NCF on the ML100K dataset with 50% and 90% users, for which we observe significant improvements at the level in terms of HR@10 and NDCG@10.
Specifically, J-NCF shows larger improvements over the DMF model for “inactive” users than for “very active” users. For example, when incorporating users with more interactions, i.e., from 50% to 90%, the improvements change from 11.08% to 7.85% in terms of HR@10, and 9.57% to 7.32% in terms of NDCG@10 on the ML100K dataset. This may be because the “very active” users have many interactions with the items that have few ratings and collaborative filtering lacks information for recommending items based only on the rating matrix. This naturally suggest a line of future work in which one would extend J-NCF with more auxiliary information, such as content information, to explore more accurate relationships between items.
To conclude and answer RQ6, the J-NCF models can beat the best baseline model for users across all activity levels. J-NCF shows the best performance in all datasets. In addition, for “inactive” users, J-NCF shows larger improvements over DMF than for “very active” users.
6.2. Sensitivity to data sparsity
To investigate the sensitivity of J-NCF to different levels of data sparsity, we examine the recommendation performance on datasets with different levels of sparsity, as presented in Table 4. Fig. 6 shows the results.
The overall performance of all models on the AMovies dataset is better than that on the other two datasets. That is to say, the recommendation performance may be influenced by the size of a dataset. Thus, in order to investigate the model sensitivity across datasets with different degrees of sparsity, it is essential to keep the number of users and items in the same scale for the datasets.
From Fig. 6, in particular, for the ML100K dataset, the ML1M dataset and the AMovies dataset respectively, we see that the J-NCF models outperform the baseline model DMF across all sub datasets with different degrees of sparsity in terms of HR@10 and NDCG@10. In addition, we find that when the density of those datasets goes down, the performance of all models decreases. Thus it is interesting to investigate the robustness of J-NCF when it is applied to sparse datasets. We find that when applied on small datasets, e.g., subsets of ML100K, our best model, i.e., J-NCF, shows higher improvements against DMF on sparser datasets. For example, J-NCF achieves 4.91% and 9.12% improvements over DMF in terms of HR@10 and NDCG@10 on the ML100K-1 subset (Density), while the improvements on the ML100K-3 subset (Density) are 7.77% and 12.02% in terms of HR@10 and NDCG@10, respectively. However, when applied on larger datasets with more users and items, i.e., subsets of ML1M and AMovies, J-NCF shows higher improvements against DMF on denser datasets. For instance, J-NCF achieves 11.13% improvements over DMF in terms of HR@10 on the ML1M-1 subset (Density), while the improvements on the ML1M-3 subset (Density) are 6.53% in terms of HR@10. These results may indicate that when the dataset becomes larger and sparser, it will be more difficult for models to improve their recommendation performances, which motivates us to conduct a further investigation to answer RQ8; see Section 6.3 below.
In addition, comparing the left and right-hand side plots in Fig. 6, we find that J-NCF shows a better performance in terms of NDCG@10 than HR@10. For example, the improvements of J-NCF over DMF are 9.19%, 8.28% and 15.11% in terms of HR@10 on ML100K-1, ML100K-2 and ML100K-3 datasets, respectively, while the improvements are 10.11%, 10.65% and 20.55% in terms of NDCG@10. This result is consistent with our findings in Section 5.3.
Thus in answer to RQ7, the J-NCF models outperform the best baseline model DMF across all datasets with different degrees of sparsity in terms of both metrics. Specifically, when applied on large datasets, i.e., ML1M and AMovies, J-NCF shows higher improvements against DMF on denser datasets. In addition, the improvements of J-NCF over DMF in terms of NDCG@10 are larger than in terms of HR@10.
6.3. Performance with a large and sparse dataset
For RQ8, in order to see if our model is able to work well on a large and sparse dataset, we examine our model as well as two baseline models, i.e., NCF and DMF, on the Amazon Electronic (AEle) dataset, which is larger and sparser than the MovieLens and Amazon Movies datasets. Fig. 7 shows the performance of the three models with different sizes of top-N recommended lists.
It is clear that J-NCF outperforms DMF as well as NCF in terms of HR and NDCG across different numbers of recommendations. With the size of top-N recommended lists ranging from 1 to 10, the overall performances of all models increase, which is consistent with the conclusion in Section 5.3. Comparing the results shown in Fig. 6(a) and Fig. 6(b), the improvements of J-NCF over DMF in terms of NDCG are more significant than those in terms of HR. For example, when and , the improvements of J-NCF over DMF in terms of HR are 5.88% and 4.62%, while the improvements are 6.12% and 5.82% in terms of NDCG, respectively. To conclude and answer RQ8, J-NCF can also work well with large and sparse datasets, especially in ranking items correctly.
6.4. Training and inference time
To answer RQ9, we investigate the scalability of J-NCF regarding training and inference time in Table 8. As shown in Table 8, in the “Training” part, “Total time” denotes the time needed for training the model to the best performance. And the “Average epoch” means the average training time for a single epoch in the training process. In the “Prediction” part, “Total time” denotes the prediction time needed for the whole test set. Since the test set contains the latest interaction of every user, the “Average ranking” indicates the time needed for providing a ranked list containing top 10 recommendations for a single user.
|Total time(s)||Average epoch(s)||Total time(s)||Average ranking(s)|
As we can see in Table 8, when the size of the dataset becomes larger, the time needed for both training and prediction gets increased significantly for all models. NCF consistently costs the least time among the three models for both training and prediction processes on all datasets. For the training process, the average training time for one epoch of J-NCF is slightly higher than DMF. However, the total training time for J-NCF is less than for DMF. It can be explained by the fact that J-NCF needs fewer iterations to obtain the best results than DMF, as indicated in Section 5. Thus, J-NCF costs less time for training to the best performance than DMF. For the prediction process, although the total time needed for J-NCF and DMF is more than NCF, the three models cost roughly similar amounts of time for providing a top 10 ranked list for a single user, which is around a few milliseconds.
7. Conclusions and Future Work
We have proposed a joint neural collaborative filtering model, J-NCF, for recommender systems. J-NCF uses a unified deep neural network to tightly couple two important parts in a recommender system, i.e., deep feature learning of users and items, and deep modeling of user-item interactions. For the user and item feature extraction, we use a deep neural network with matrix factorization and a combination of explicit and implicit feedback as input. Then we adopt another neural network for modeling user-item interactions using the feature vectors as inputs. Thus, J-NCF enables the two parts to be optimized with each other through a joint training process. In order to make J-NCF fit the top-N recommendation task, we design a new loss function that incorporates information from both pair-wise and point-wise loss.
The experimental results confirm the effectiveness of J-NCF. In addition, we have also experimentally investigated the performance of J-NCF under various settings, e.g., with different loss functions, with varying numbers of layers in the networks, and with using different feedback as inputs. The results confirm the effectiveness of our hybrid loss function and demonstrate that J-NCF performs better with more layers in the networks and using the combination of implicit and explicit feedback as input.
In addition, we have investigated the robustness of J-NCF with different degrees of data sparsity and different numbers of user ratings. J-NCF outperforms the best baseline model DMF for users across all activity levels, especially for “inactive users” who constitute the majority of users in the datasets. As for datasets with different levels of sparsity, in general, J-NCF shows more competitive recommendation performance on all datasets than the state-of-the-art baseline model DMF. Moreover, we have also tested J-NCF model with a large and sparse dataset, i.e., AEle, and the results show that J-NCF also outperforms state-of-the-art baseline models on the dataset.
As to future work, first, we plan to extend J-NCF with more auxiliary information (Zheng et al., 2017; Wang et al., 2017; Cai and de Rijke, 2016a, b), such as the content information of items as well as reviews, to get a more informed expression of users as well as items. As collaborative filtering usually suffers from limited performance due to the sparsity of user-item interactions (Shi et al., 2017), auxiliary information could be used to boost the performance. It would also be interesting to explore heterogeneous information in a knowledge base to improve the quality of recommender systems with deep learning (Zhang et al., 2016). Second, we plan to explore the context information of a user in a session with recurrent neural networks to deal with dynamic aspects recommender systems (Chatzis et al., 2017; Hidasi et al., 2016b; Cai et al., 2016b, a). In addition, an attention mechanism could be applied to J-NCF, which can filter out uninformative content and select the most representative items while providing good interpretability (Chen et al., 2017). Finally, as we have found that J-NCF is computationally more expensive than NCF, we plan to optimize the structure and implementation details of our model to make it more efficient.
Acknowledgements.We would like to thank our anonymous reviewers for their helpful comments and valuable suggestions.
- Adeniyi et al. (2016) David Adedayo Adeniyi, Zhaoqiang Wei, and Yongquan Yang. 2016. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method. Applied Computing and Informatics 12, 1 (2016), 90–108.
- Adomavicius and Tuzhilin (2005) Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering. 17, 6 (2005), 734–749.
- Basiliyos et al. (2017) Betru Basiliyos, Tilahun, Onana Charles, Awono, and Batchakui Bernabe. 2017. Deep Learning Methods on Recommender System: A Survey of State-of-the-art. International Journal of Computer Applications 162, 10 (2017), 17–22.
- Bellogin et al. (2011) Alejandro Bellogin, Pablo Castells, and Ivan Cantador. 2011. Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparison. In RecSys ’11. ACM, 333–336.
- Cai and de Rijke (2016a) Fei Cai and Maarten de Rijke. 2016a. Learning from homologous queries and semantically related terms for query auto completion. Information Processing & Management 52, 4 (2016), 628–643.
- Cai and de Rijke (2016b) Fei Cai and Maarten de Rijke. 2016b. A Survey of Query Auto Completion in Information Retrieval. Foundations and Trends in Information Retrieval 10, 4 (2016), 273–363.
- Cai et al. (2016a) Fei Cai, Shangsong Liang, and Maarten de Rijke. 2016a. Prefix-Adaptive and Time-Sensitive Personalized Query Auto Completion. IEEE Transactions on Knowledge and Data Engineering 28, 9 (Sep 2016), 2452–2466.
- Cai et al. (2016b) Fei Cai, Ridho Reinanda, and Maarten de Rijke. 2016b. Diversifying Query Auto-Completion. ACM Transactions on Information Systems 34, 4 (June 2016), 25:1–25:33.
- Chatzis et al. (2017) Sotirios P. Chatzis, Panayiotis Christodoulou, and Andreas S. Andreou. 2017. Recurrent Latent Variable Networks for Session-Based Recommendation. In DLRS ’17. 38–45.
- Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In SIGIR ’17. ACM, 335–344.
- Chen et al. (2018) Wanyu Chen, Fei Cai, Honghui Chen, and Maarten de Rijke. 2018. Attention-based Hierarchical Neural Query Suggestion. In SIGIR ’18. ACM, 1093–1096.
- Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In DLRS 2016. ACM, 7–10.
- Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In RecSys ’10. ACM, 39–46.
- Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-machine Based Neural Network for CTR Prediction. In IJCAI’17. AAAI Press, 1725–1731.
- He and Chua (2017) Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. In SIGIR ’17. ACM, 355–364.
- He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW ’17. ACM, 173–182.
- He et al. (2016) Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback. In SIGIR ’16. ACM, 549–558.
- Herlocker et al. (2004) Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22, 1 (2004), 5–53.
- Hidasi and Karatzoglou (2018) Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent Neural Networks with Top-k Gains for Session-based Recommendations. In CIKM ’18. ACM, 843–852.
- Hidasi et al. (2016a) Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016a. Session-based Recommendations with Recurrent Neural Networks. In ICLR ’16.
- Hidasi et al. (2016b) Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk. 2016b. Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations. In RecSys ’16. ACM, 241–248.
- Hong-Jian et al. (2017) Xue Hong-Jian, Dai Xinyu, Zhang Jianbing, Huang Shujian, and Chen Jiajun. 2017. Deep Matrix Factorization Models for Recommender Systems. In IJCAI ’17. 3203–3209.
- Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In CIKM ’13. ACM, 2333–2338.
- Kabbur et al. (2013) Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: Factored item similarity models for Top-N recommender systems. In KDD ’13. ACM, 659–667.
- Kim et al. (2016) Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. 2016. Convolutional Matrix Factorization for Document Context-Aware Recommendation. In RecSys ’16. 233–240.
- Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
- Koren (2008) Yehuda Koren. 2008. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. In KDD ’08. ACM, 426–434.
- Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37.
- Li et al. (2015) Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep Collaborative Filtering via Marginalized Denoising Auto-encoder. In CIKM ’15. ACM, 811–820.
- Lian et al. (2017) Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun. 2017. CCCFNet: A Content-Boosted Collaborative Filtering Neural Network for Cross Domain Recommender Systems. In WWW ’17. ACM, 817–818.
- Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Computing 7, 1 (2003), 76–80.
- Liu and Wu (2017) Juntao Liu and Caihua Wu. 2017. Deep Learning Based Recommendation: A Survey. In ICISA ’17. 451–458.
- Liu et al. (2015) Xiaomeng Liu, Yuanxin Ouyang, Wenge Rong, and Zhang Xiong. 2015. Item Category Aware Conditional Restricted Boltzmann Machine Based Recommendation. In ICONIP ’15. 609–616.
- Onal et al. (2018) Kezban Dilek Onal, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, Maarten de Rijke, and Matthew Lease. 2018. Neural information retrieval: At the end of the early years. Information Retrieval Journal 21, 2–3 (June 2018), 111–182.
- Paterek (2007) Arkadiusz Paterek. 2007. Improving regularized singular value decomposition for collaborative filtering. In KDD ’07. ACM.
- Prem et al. (2013) Gopalan Prem, Jake M. Hofman, and David M. Blei. 2013. Scalable Recommendation with Poisson Factorization. arXiv preprint arXiv:1311.1704 (2013).
- Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI ’09. 452–461.
- Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In NIPS’07. Curran Associates Inc., 1257–1264.
- Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann Machines for Collaborative Filtering. In ICML ’07. 791–798.
- Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based Collaborative Filtering Recommendation Algorithms. In WWW ’01. ACM, 285–295.
- Sarwar et al. (2000) Badrul Munir Sarwar, George Karypis, Joseph A. Konstan, and John Thomas Riedl. 2000. Application of Dimensionality Reduction in Recommender System–A Case Study. In ACM WebKDD Workshop. ACM.
et al. (2015)
Suvash Sedhain, Aditya
Menon, Scott Sanner, and Lexing Xie.
AutoRec: Autoencoders Meet Collaborative Filtering. InWWW ’15. ACM, 111–112.
- Shi et al. (2017) Lei Shi, Wayne Xin Zhao, and Yi-Dong Shen. 2017. Local Representative-Based Matrix Factorization for Cold-Start Recommendation. ACM Transaction on Information Systems 36, 2 (Aug. 2017), 22:1–22:28.
Xiaoyuan Su and Taghi M.
A Survey of Collaborative Filtering Techniques.
Advances in Artificial Intelligence2009 (2009), Article 4.
- Trapit et al. (2016) Bansal Trapit, Belanger David, and McCallum Andrew. 2016. Ask the GRU: Multi-task Learning for Deep Text Recommendations. In RecSys ’16. 107–114.
- Truyen et al. (2009) Tran The Truyen, Dinh Q. Phung, and Svetha Venkatesh. 2009. Ordinal Boltzmann Machines for Collaborative Filtering. In UAI ’09. 548–556.
- van den Oord et al. (2013) Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep Content-based Music Recommendation. In NIPS ’13. 2643–2651.
- Wang and Blei (2011) Chong Wang and David M. Blei. 2011. Collaborative Topic Modeling for Recommending Scientific Articles. In KDD ’11. 448–456.
- Wang et al. (2015) Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learning for Recommender Systems. In KDD ’15. ACM, 1235–1244.
- Wang et al. (2017) Suhang Wang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu. 2017. What Your Images Reveal: Exploiting Visual Contents for Point-of-Interest Recommendation. In WWW ’17. 391–400.
- Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. In SIGIR ’19. ACM.
- Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In WSDM ’16. ACM, 153–162.
- Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative Knowledge Base Embedding for Recommender Systems. In KDD ’16. ACM, 353–362.
- Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. arXiv preprint arXiv:1707.07435 (2017).
- Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In WSDM ’17. ACM, 425–434.
- Zheng et al. (2016) Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. 2016. A Neural Autoregressive Approach to Collaborative Filtering. In ICML’16. 764–773.