1. Introduction
Recommender systems are an effective solution to help people cope with an increasingly complex information landscape. Collaborative Filtering (CF) approaches have been widely investigated and used for personalized recommendation (Zhang et al., 2017; Adomavicius and Tuzhilin, 2005). Many traditional CF techniques are based on Matrix Factorization (MF) (Zhang et al., 2017). They characterize users and items by latent factors that are extracted from the useritem rating matrix. In the latent space, traditional CF methods, such as the Latent Factor Model (LFM) (Koren et al., 2009), often predict a user’s preference for an item with a linear kernel, i.e., a dot product of their latent factors, which may not be able to capture the complex structure of useritem interactions well.
Recently introduced Deep Learning (DL)based approaches to recommender systems overcome shortcomings of conventional approaches to recommender systems, such as dynamic user preferences and intricate relationships within the data itself, and are able to achieve high recommendation quality. Today’s DLbased approaches to recommender systems mostly use DL to explore auxiliary information, e.g., textual descriptions of items or audio features of music, which is then used to model item features (Kim et al., 2016; Wang and Blei, 2011; Wang et al., 2015). For the useritem rating matrix, recent work mostly continues to use traditional MFbased approaches. Restricted Boltzmann Machines (Salakhutdinov et al., 2007) seem to have been the first model to use neural networks to model the useritem rating matrix and obtain competitive results over traditional methods; it is a twolayer network rather than a deep learning structure. Another recent approach, Collaborative Denoising AutoEncoder (CDAE) (Wu et al., 2016), is mainly designed for rating prediction with a onehidden layer neural network. Neural Collaborative Filtering (NCF) (He et al., 2017)
uses deep neural networks for learning the interaction function from data with multilayer perceptrons, yet it does not explore users’ and items’ features that are known to be helpful in improving
CF recommendation performance. CDAE and NCF only exploit implicit feedback for recommendations instead of explicit rating feedback. Deep Matrix Factorization (DMF) (HongJian et al., 2017) models the useritem rating matrix with a neural network that maps the users’ and items’ features into a lowdimensional space with nonlinear projections; it uses an inner product to compute interactions between users and items, and applies the same linear kernel (i.e., dot product) as LFM (Koren et al., 2009).We hypothesize that DL should be able to effectively capture both nonlinear and nontrivial useritem relationships as well as users’ (items’) characteristics with multilayer projections (Zhang et al., 2017). We propose a Joint Neural Collaborative Filtering (JNCF
) model that enables two processes—feature extraction and useritem interaction modeling—to be trained jointly in a unified
DL structure. The JNCFmodel contains two main networks for recommendation. The first network uses the rating information of a user (an item) as the network input, and outputs a vector representation for the user (the item). Then, using the connection of a user’s and an item’s vectors as input, the second neural network models the useritem interactions and outputs the prediction of the corresponding rating of the user and item. Thus, these two networks can be coupled tightly and trained jointly in a unified structure. Interaction modeling can optimize the feature learning process and more accurate feature representations can, in turn, improve the useritem interaction prediction. We take both implicit and explicit feedback, pointwise and pairwise loss into account to enhance the prediction performance. In contrast, previous neural approaches such as
CDAE, NCF and DMF are all optimized only with pointwise loss functions and leave dealing with pairwise loss as future work.To the best of our knowledge, in the area of recommender systems ours is the first attempt to use a joint neural network to tightly couple feature learning and interaction modeling with the rating matrix. JNCF allows these two processes to optimize each other through joint training and thereby improve the recommendation performance.
Our experiments on realworld datasets, including the MovieLens dataset and the Amazon Movies dataset, show that JNCF outperforms the stateoftheart baselines in prediction accuracy, with improvements of up to 8.24% on the MovieLens 100K dataset, 10.81% on the MovieLens 1M dataset, and 10.21% on the Amazon Movies dataset in terms of HR@10. NDCG@10 improvements are 12.42% on the MovieLens 100K dataset, 14.24% on the MovieLens 1M dataset, and 15.06% on the Amazon Movies dataset, respectively, over the best baseline model. In addition, we investigate the scalability and sensitivity of JNCF with different degrees of sparsity and different numbers of users’ ratings. Our experimental results indicate that JNCF achieves competitive recommendation performance when compared to the best stateoftheart model.
Our contributions in this paper are:

We design a Joint Neural Collaborative Filtering model (JNCF) for recommendation, which enables deep feature learning and deep useritem interaction modeling to be coupled tightly and jointly optimized in a single neural network.

We design a new loss function that explores the information contained in both pointwise and pairwise loss as well as implicit and explicit feedback.

We analyse the recommendation performance of JNCF as well as baseline models and find that JNCF consistently yields the best performance. JNCF also shows competitive improvements over the best baseline model when applied with inactive users and different degrees of data sparsity.
We summarize related work in Section 2. Our approach, JNCF, is described in Section 3. Section 4 presents our experimental setup. In Section 5, we report our results to demonstrate the recommendation performance of JNCF. We also investigate the scalability and sensitivity of our model as well as other baselines in Section 6. Finally, we conclude our work in Section 7, where we also suggest future research directions.
2. Related work
We first look back to traditional approaches to recommender systems in Section 2.1, that focus on modeling the similarity between users (items) for recommendation. Then, as applying deep learning techniques into recommender systems is gaining momentum due to its stateoftheart performance and highquality recommendations, we summarize recent work on deep learningbased recommender systems in Section 2.2 that can provide a better understanding of user’s demands, item’s characteristics as well as historical interactions between them by extracting the features of items with auxiliary information, e.g., the content of movies.
2.1. Traditional recommender systems
In many commercial systems, “best bet” recommendations are shown, but the predicted rating values are not. This is usually referred to as a topN recommendation task, where the goal of the recommender system is to find a few specific items that are supposed to be most appealing to the user. A similar prediction schema, denoted as Top Popular (Itempop), recommends the topN items with the highest popularity (largest number of ratings).
Most topN recommender systems are based on collaborative filtering (Adomavicius and Tuzhilin, 2005), where recommendations rely on past behavior (ratings) from users, regardless of domain knowledge (Su and Khoshgoftaar, 2009). We group these CF approaches into two categories, i.e., neighborhoodbased methods (Sarwar et al., 2001; Linden et al., 2003) and latent factorbased models (Koren et al., 2009; Kabbur et al., 2013). Neighborhoodbased models share the typical merits of CF, which concentrate on exploring the similarity among either users or items. For instance, two users are similar because they have rated similarly the same set of items. A dual concept of similarity can be defined among items. Latent factorbased approaches generally model users and items as vectors in the same “latent factor” space by means of a reduced number of hidden factors. In such a space, users and items are directly comparable: the rating of a user on an item is predicted by the proximity (e.g., innerproduct) between the related latent factor vectors.
For neighborhoodbased models, algorithms that are centered around useruser similarity typically predict the rating by a user based on the ratings expressed by other users similar to her about such item. On the other hand, algorithms centered around itemitem similarity compute the user preference to an item based on her own ratings to similar items. The similarity between item and item is measured as the tendency of users to rate items and similarly. It is typically based either on the cosine, the adjusted cosine, or (most commonly) the Pearson correlation coefficient (Sarwar et al., 2001)
. The kNN (knearestneighborhood) approach is a representative enhanced neighborhood model
(Adeniyi et al., 2016), which considers only the items rated by user that are the most similar to the item when predicting the rating . kNNbased approaches discard items that are poorly correlated to the target item, thus decreasing noise for improving the quality of recommendations. Neighborhoodbaesd approaches are similar to the itemitem model for user personalization, which is different from our approach based on the useritem model (Sarwar et al., 2001). Thus, we focus on the latent factor modeling approach.Most research on latent factor modeling is based on factoring the useritem rating matrix, which is known as Singular Value Decomposition (SVD) (Koren et al., 2009). SVD factorizes the useritem rating matrix to a product of two lower rank matrices, one containing the “user factors,” the other containing the “itemfactors.” Then, with an inner product and biases (), the user’s preference towards an item can be generated, i.e.,
(1) 
where and denote the “user factors” and “itemfactors,” respectively.
Since the conventional SVD
is undefined in the presence of unknown values, i.e., missing ratings, several solutions have been proposed. Earlier work addresses this issue by filling the missing ratings with a baseline estimation
(Sarwar et al., 2000). However, this leads to a very large, dense user rating matrix, where the factorization process becomes computationally infeasible. Recent work learns factor vectors directly on known ratings through a suitable objective function that minimizes a prediction error. The proposed objective functions are usually regularized in order to avoid overfitting (Paterek, 2007). Typically, gradient descent is applied to minimize the objective function. An advantage of SVDbased approaches is that they can provide recommendations for new users after given their ratings towards some items without reconstructing the parameters of the models. Thus for a new user, SVDbased approaches can provide recommendations immediately according to his current ratings.Another model based on SVD, SVD++ (Koren, 2008), incorporates both explicit and implicit feedback, and shows improved performance over many MF models. This is consistent with our motivation of combining explicit and implicit feedback in JNCF. However, applying traditional MF methods to sparse ratings matrices can be a nontrivial challenge with high computational costs for decomposing the rating matrix.
Many traditional recommender systems apply a linear kernel with an inner product of user and item vectors to model useritem interactions. Linear functions may not be able to give an accurate description of the characteristics of users (items) and useritem interactions: previous work has pointed out that nonlinearities have potential advantages for improving the performance of recommender systems with extensive experiments (Li et al., 2015; Wu et al., 2016; Sedhain et al., 2015).
2.2. Deep learningbased recommender system
DLbased recommender systems can be divided into two categories, i.e., single neural network models and deep integration models, depending on whether they rely solely on deep learning techniques or integrate traditional recommendation models with deep learning (Zhang et al., 2017; Su and Khoshgoftaar, 2009; Basiliyos et al., 2017; Liu and Wu, 2017; Zheng et al., 2016; Huang et al., 2013; Onal et al., 2018; He and Chua, 2017; Wang et al., 2019).
For the first category, RBM (Salakhutdinov et al., 2007; Truyen et al., 2009; Liu et al., 2015) is an early neural recommender system. It uses a twolayer undirected graph to model tabular data, such as users’ explicit ratings of movies. RBM targets rating prediction, not topN recommendation, and its loss function considers only the observed ratings. It is technically challenging to incorporate negative sampling into the training of RBMs (Wu et al., 2016), which would be required for topN recommendation. AutoRec (Sedhain et al., 2015) uses an AutoEncoder for rating prediction. It only considers the observed ratings in the loss function, which does not guarantee good performance for topN recommendation. To prevent the AutoEncoder from learning an identity function and failing to generalize to unseen data, Denoising AutoEncoders (DAEs) (Li et al., 2015) have been applied to learn from intentionally corrupted inputs. Most of the publications listed so far focus on explicit feedback and, hence, fail to learn users’ preference from implicit feedback. CDAE (Wu et al., 2016) extends DAEs; its input is a user’s partially observed implicit feedback. Unlike our work, both DAEs and CDAE use an itemitem model for personalization that represents a user with their rated items (Sarwar et al., 2001) and the outputs are the item scores decoded from the learned user’s representation. Our work is a kind of useritem model, which learns users’ as well as items’ representations first and then calculates the relevance between them. The proposed JNCF model is a useritem model that personalizes by modeling useritem interactions. Also, CDAE applies a linear kernel to model the relationship between users and items, whereas JNCF applies a nonlinear kernel.
Several Convolutional Neural Network (CNN)based recommendation models have been proposed (Kim et al., 2016; Wang and Blei, 2011; van den Oord et al., 2013). They primarily use CNNs
to extract item features with auxiliary information, e.g., review text or contextual information, which we will incorporate in our future work. As for Recurrent Neural Networks, they are used in recommender systems that address the temporal dynamics of ratings and sequential features
(Hidasi et al., 2016a; Trapit et al., 2016).Most closely related to our model is Neural Collaborative Filtering (NCF) (He et al., 2017). It uses multilayer perceptrons to model the twoway interaction between users and items, which is meant to capture the nonlinear relationship between users and items. Let and denote the side information (e.g., the feature information), then, the prediction rule of NCF is formulated as follows:
(2) 
where the function
defines the multilayer perceptron, and
are the parameters of the network. However, NCF randomly initializes the representation of users and items, with just a onehot identifier of user and item respectively, which only explores the users’ and items’ features in a limited manner. JNCF adopts a joint neural network structure to capture both user and item features, and useritem relationships, as we hypothesize that the two parts can be optimized through tight coupling and joint training. In addition, NCF only exploits implicit feedback for item recommendations and ignores explicit feedback.An extension based on NCF is CCCFNet (Crossdomain Contentboosted Collaborative Filtering neural Network) (Lian et al., 2017). The basic building block of CCCFNet is also a dual network (for users and items, respectively). It models the useritem interactions in the last layer with the dot product. Unlike our work, it applies content information with a neural network to capture the user’s preferences and item features. In addition, DeepFM (Deep Factorization Machine) (Guo et al., 2017) is an endtoend model that seamlessly integrates factorization machine and MLP. However, it also applies content information and thus models higherorder feature interactions via a deep neural network and loworder interactions via a factorization machine. In contrast, JNCF adopts the rating information to explore both user and item features, which are easier to collect.
As to deep integration models, Collaborative Deep Learning (CDL) (Wang et al., 2015) is a hierarchical Bayesian model that integrates stacked DAEs into traditional probabilistic MF. It differs from our work in two ways: (1) it extracts deep feature representations of items from the content information which we do not explore, and (2) it uses a linear kernel to model relations between users and items with the dot product of user and item vectors .
A wellknown integration model is DeepCoNN (Deep Cooperative Neural Network) (Zheng et al., 2017), which adopts two parallel convolutional neural networks to model user behavior and item properties from review texts. In the final layer, a factorization machine is applied to capture their interactions from rating predictions. It alleviates the sparsity problem and enhances model interpretability by exploiting a rich semantic representation of the reviews, which could be investigated in JNCF as future work.
Wide & Deep learning (Cheng et al., 2016) and DeepFM (Guo et al., 2017) are two stateoftheart recommendation works with deep learning techniques. While they focus on incorporating various features of users and items, we aim at exploring deep learning methods for pure collaborative filtering systems. Another integration model that is directly relevant to our work is Deep Matrix Factorization (DMF) (HongJian et al., 2017). It uses a deep MF model with a neural network that maps users and items into a common lowdimensional space. It follows the LFM, which uses the inner product to compute interactions between users and items. This may partially explain why using deep layers does not help to improve the performance of DMF (see (HongJian et al., 2017, Section 4.4)). Unlike DMF, we apply multilayer perceptrons to model useritem interactions using a combination of user and item feature vectors as input. This does not only help our model to be more expressive in modeling useritem interactions than linear products, but it also helps to improve the accuracy of user and item feature extraction.
On top of the previous work discussed above, our proposed model JNCF combines feature learning and interaction modeling into an endtoend trainable neural network, which enables the two processes to be optimized jointly. Besides this, we design a new loss function that combines pointwise and pairwise losses to explore the integration of different types of information, i.e., both implicit and explicit feedback.
3. Approach
The proposed model, JNCF, has a joint structure with a layer used for modeling users’ and items’ features (the DF network) and a higher layer used for modeling useritem interactions (the DI network). These two layers can be trained in a joint manner to give a predicted score of a user’s interactions with an item with minimum prediction error. We first describe the notation used and then detail JNCF. We also describe the loss function that we use for optimization.
3.1. Problem formulation and notation
First we describe the task of topN recommendation that we study in this paper. Suppose that there are users and items, denoted as and . denotes the rating information, where is the rating given by user to item . The task for topN recommendation is to return a list containing a set of items for an individual user to maximize the user’s satisfaction.
The main notation we use in this paper is listed in Table 1.
Notation  Description 

the set of users  
the set of items  
an explicit rating of user to item  
a vector containing a user’s ratings; serves as input to Net  
a vector containing an item’s ratings; serves as input to Net  
the number of unique users  
the number of unique items  
the weight matrix for the th layer in Net  
the bias for the th layer in Net  
the activation function for the th layer in Net 

the number of layers in DF network  
the weight matrix for the th layer in the DI network  
a combination of user and item vectors; serves as input to the DI network  
the bias for the th layer in the DI network  
the activation function for the th layer in the DI network  
the number of layers in the DI network  
the predicted score of the interaction between user and item  
the set of items that a user rates  
the set of items that are not rated by a user  
a tradeoff parameter controlling the contributions of the pointwise loss and pairwise loss 
3.2. Joint Neural Collaborative Filtering
The joint architecture of the proposed JNCF model is shown in Fig. 1. The model contains two main networks: a DF network for modeling features and a DI network for modeling interactions between items and users, where the output of the first network serves as the input of the second.
The DF network is used for modeling users’ and items’ features. It contains two parallel neural networks coupled in the last layer, one network for users (Net) and another for items (Net). We give the ratings of a user and an item as inputs to Net and Net, respectively, which are defined as and , where
(3) 
We think of ratings as nontrivial explicit feedback from users as different ratings indicate different levels of users preference towards items. Obviously, there are many unknown ratings between users and items indicating nonpreference of a user towards an item. Following (He et al., 2017; HongJian et al., 2017), we regard these unknown ratings as a kind of implicit feedback and mark them as zeroes. When pursuing a topN recommendation task, we are interested only in a correct item ranking and care less about the exact rating scores. This grants us some flexibility, like considering all missing values in the user rating matrix as zeros (Cremonesi et al., 2010). Thus we can take both explicit and implicit feedback into consideration with Eq. (3).
Then, with multilayer perceptrons (MLP), the initial highdimensional rating vectors of users and items are mapped to lowerdimensional vectors. Since Net and Net only differ in their inputs, we focus on illustrating the process for Net; the same process is applied for Net with similar layers. The MLP model in the DF network is defined as:
(4)  
where , and
denote the weight matrix, the bias vector and the activation function for the
th layer. Here, we use a ReLU as the activation function, as it has been shown to be more expressive than others and can effectively deal with the vanishing gradient problem
(HongJian et al., 2017; He et al., 2017). indicates the number of layers used in the DF network. The output of the final layer is a deep representation of the user features; likewise, is the deep representation for the item features.As to modeling useritem interactions, traditional LFM methods have been widely used. Such methods are based on the dot product of user and item vectors, which models a user’s preference with a linear kernel. In order to investigate the differences between nonlinear and linear functions in modeling useritem interactions, we propose two ways to obtain fused users’ and items’ feature vectors as the input of the DI network:
(5) 
The first way is to concatenate the two input vectors and , which we regard as a nonlinear fusion. The second way is to use the elementwise product of vectors, which uses a linear kernel to generate useritem interactions. Based on these two ways of fusing the input vectors and , we propose two versions of JNCF, which we discuss in detail in our experiments.
Generating is the first step for modeling useritem interactions. However, it is insufficient for modeling the complex relationship between users and items. Thus, we adopt intermediate hidden layers to which is fed so as to obtain a multilayer nonlinear projection of useritem interactions:
(6)  
where , and denote the weight matrix, the bias vector and the activation function for the th layer in the DI network. A ReLU is applied again as the activation function. indicates the number of layers used in the network. The output of the network is the predicted score of the interaction between user and item :
(7) 
where the sigmoid function
can restrict the output in (0,1). can be learnt through the training process with back propagation to control the weight of each dimension in .3.3. Loss function
Objective functions for training recommender systems can be divided into three groups: pointwise, pairwise and listwise. Pointwise objectives aim at obtaining accurate ratings, which is more applicable in rating prediction tasks (Kabbur et al., 2013). Pairwise objectives are usually focused on users’ preferences towards pairs of items and are usually considered more suitable for topN recommendation (He et al., 2016, 2017; Kabbur et al., 2013; Rendle et al., 2009). Listwise objectives are focused on users’ interests towards a list of items, which are also used in some deep learning algorithms. We briefly summarize the three groups of loss functions.
We use to denote a loss function and to represent a regularization term that controls the model complexity and encodes prior information such as sparsity, nonnegativity, or graph regularization.
For a pointwise loss function, the general calculation is:
(8) 
There are several types of pointwise loss function. E.g., squared loss is more suitable for explicit feedback than implicit feedback, as it is calculated with:
(9) 
where is a hyperparameter denoting the weight of training instance .
The use of squared loss is based on the assumption that observations are generated from a Gaussian distribution, however, it may not tally well with implicit data
(Salakhutdinov and Mnih, 2007). For implicit feedback, there is a pointwise loss function mainly used for classification tasks (HongJian et al., 2017; He et al., 2017), named log loss (Kabbur et al., 2013), which can perform better with implicit feedback than squared loss:(10) 
Pairwise loss considers the relative order of the prediction for pairs of items, which is a more reliable kind of information for topN recommendation. Hidasi and Karatzoglou (2018) investigate several popular pairwise loss functions, i.e., TOP1, BPRmax and TOP1max. We give a brief introduction of them. TOP1 is the regularized approximation of the relative rank of the relevant item, which can be calculated as:
(11) 
where and denote the prediction scores for a negative item and a positive item , respectively; is the set of negative samples. The first part of TOP1 aims to ensure that the target score is higher than the score of the negative samples, while the second part pushes the score of the negative samples down. As for BPRmax and TOP1max, they have been proposed by Hidasi and Karatzoglou (2018) to overcome the vanishing gradients as the number of negative samples increases. The idea is to have the target score compared with the most relevant sample score, which is the maximum score amongst the samples. As the maximum operation is nondifferentiable, softmax scores are used to preserve differentiability. By summing over the individual losses weighted by the corresponding softmax scores , TOP1max can be calculated as:
(12) 
And the BPRmax loss function can be calculated as:
(13) 
For listwise loss, many deep learningbased methods combine crossentropy loss with softmax, which introduces listwise properties into the loss. We refer to it as softmax+crossentropy (XE) loss, which can be calculated with the following function:
(14) 
Most deep learningbased models only use the pointwise loss function for optimization and leave the pairwise loss function for future work (HongJian et al., 2017; He et al., 2017). Pointwise loss only uses the rating information and ignores the information contained in the relative order of pairs of items. Pairwise loss, in contrast, ignores the information of a user’s individual preference for a certain item. Thus, unlike previous work, NCF and DMF, our proposed JNCF model considers both pointwise and pairwise loss for the topN recommendation task and combines them into a new loss function:
(15) 
where is used to control the weights of the two parts.
For pointwise loss, we adopt the log loss (Eq. (10)), which can integrate both implicit and explicit feedback. As to pairwise loss, combining with different pairwise losses yields different new loss functions, i.e., pointwise+TOP1, pointwise+BPRmax, and pointwise+TOP1max. We analyze the performance of these different combined loss functions with experiments in Section 5.
Acknowledging that explicit and implicit feedback both contain information about a user’s preference towards items, we combine both kinds of feedback in our loss function for optimization and rewrite Eq. (15) in detail as
(16) 
where , and denotes the largest rating score of user given to items, so that different values of have a different influence on the loss. For example, if the largest rating score of a user given to items is 4, when he rates an item with 2, we can generate . We refer to our loss function Eq. (16) as a “hybrid” loss function.
We have developed the joint neural network structure of the JNCF model. The training process of JNCF is shown in Algorithm 1. We first initialize the parameters in the network and modify the rating matrix from step 1 to 3. Then, in step 9 and 10, we generate deep feature representations for both users and items with the DF network. In step 11 and 12, we calculate the predicted scores for the useritem interactions with the DI network. Finally, we use the hybrid loss function in Eq. (16) and back propagation to optimize the network parameters with step 13 and 14.
4. Experimental setup
We design experiments on a variety of datasets to examine the effectiveness of JNCF. We first explain the research questions and the models we use for comparison in Section 4.1. The datasets and experiments are described in Section 4.2.
4.1. Model summary and research questions
We conduct experiments with the aim of answering the following research questions:

Does our proposed JNCF method outperform stateofart collaborative filtering baselines for recommender systems?

How is the performance of JNCF impacted by different choices for the pairwise loss in Eq. (16)?

Does the hybrid loss function Eq. (13), which combines pointwise and pairwise loss, help to improve the performance of JNCF?

Are deeper layers of hidden units in the DF network and DI network helpful for the recommendation performance of JNCF?

Does the combination of explicit and implicit feedback help to improve the performance of JNCF?

How does the performance of JNCF vary across users with different numbers of interactions?

Is JNCF sensitive to different degrees of data sparsity?

How does JNCF perform on a large and sparse dataset?

How do the training and inference times of JNCF compare against those of other neural models?
We compare JNCF against a number of traditional collaborative filtering baselines and against stateoftheart deep learning based models:

This method ranks items based on the number of interactions, which is a nonpersonalized approach to determine recommendation scores (Adomavicius and Tuzhilin, 2005).

This method uses a pairwise loss function to optimize a MF model based on implicit feedback. We use it as a strong baseline for traditional collaborative filtering method (Rendle et al., 2009).

This is a stateoftheart neural networkbased method for recommender systems. It aims to capture the nonlinear relationship between users and items. Unlike JNCF, it simply uses onehot vectors representing users and items as the input for modeling useritem interactions. And it only uses implicit feedback and a pointwise loss function (He et al., 2017).

This method uses multilayer perceptrons for rating matrix factorization. Unlike our work, after projecting users and items into low dimensional vectors, it applies an inner product to calculate interactions between users and items, which is a linear kernel. It uses a pointwise loss function for optimization (HongJian et al., 2017).
In addition, following the choices that we identified in Eq. (5), we consider two versions of JNCF:

This is JNCF using elementwise multiplication for combining a user and an item feature vector as the input for the DI layer, which has a linear kernel inside.

This is JNCF using concatenation for combining a user and an item feature vector as the input for the DI layer, which is a nonlinear way.
We list all the models to be discussed in Table 2.
Model  Description  Source 
Itempop  A typical recommendation approach, which ranks items based on the number of interactions.  (Adomavicius and Tuzhilin, 2005) 
BPR  A recommendation method using a pairwise loss function to optimize an MF model based on implicit feedback.  (Rendle et al., 2009) 
NCF  A stateoftheart neural based method for recommender systems.  (He et al., 2017) 
DMF  A method using multilayer perceptrons for rating matrix factorization.  (HongJian et al., 2017) 
JNCF  A JNCF model using elementwise multiplication for combining a user and an item feature vector as the input for the DI layer.  This paper 
JNCF  A JNCF model using concatenation for combining a user and an item feature vector as the input for the DI layer.  This paper 
JNCF  A JNCF model with only pointwise loss based on Eq. (10).  This paper 
JNCF  A JNCF model with only pairwise loss based in Eq. (11).  This paper 
JNCF  A JNCF model with our designed loss function in Eq. (13).  This paper 
JNCF  A JNCF model with both explicit and implicit feedback in the input and the loss function.  This paper 
JNCF  A JNCF model with only implicit feedback in the input and the loss function.  This paper 
4.2. Datasets and experimental setup
4.2.1. Datesets.
We use three publicly available datasets to evaluate our models and the baselines:

MovieLens, which contains several rating datasets from the MovieLens web site. The datasets are collected over various periods of time, depending on the size of the set (He et al., 2017; HongJian et al., 2017). We use two sets for our experiments, i.e., MovieLens 100K (ML100K) containing 100,000 ratings from 943 users on 1,682 movies, and MovienLens 1M (ML1M) containing more than 1 million ratings from 6,040 users on 3,706 movies.^{*}^{*}*https://grouplens.org/datasets/movielens/

Amazon Movies (AMovies), which contains 4,607,047 ratings for movies from Amazon, which is bigger and sparser than the MovieLens datasets and used widely in the recommender systems literture for evaluation (Zhang et al., 2017; HongJian et al., 2017).^{†}^{†}†http://jmcauley.ucsd.edu/data/amazon/

Amazon Electronics (AEle), which is a larger and sparser dataset than the other datasets used in our paper. It contains 7,824,482 ratings of users on different electronics. We use it to test the performance of our model when applied on a large and sparse dataset.^{‡}^{‡}‡http://jmcauley.ucsd.edu/data/amazon/
For the two MovieLens datasets, we do not process them because they are already filtered. For the AMovies dataset, following (HongJian et al., 2017; He et al., 2017), we filter the dataset so that, similar to the MovieLens data, only users with at least 20 interactions and items with at least 5 interactions are retained. For the larger dataset AEle, we only do minor filtering on the data, i.e., filtering the users with less than 2 interactions and items with less than 5 interactions. To answer RQ1 to RQ7, we use the ML100K, ML1M, and AMovies datasets to evaluate our models and baselines. As for RQ8 to RQ9, we test the models on all of the datasets. The characteristics of the datasets after preprocessing are summarized in Table 3.
Dataset  #Users  #Items  #Ratings  #Density(%) 

ML100K  943  1,682  100,000  6.3047 
ML1M  6,040  3,706  1,000,209  4.4685 
AMovies  15,067  69,629  877,736  0.0837 
AEle  1,221,341  157,003  4,486,501  0.00234 
In order to answer RQ5, we plot distributions of users with different numbers of interactions in the ML100K, ML1M, and AMovies datasets in Figure 2.
The xaxis denotes the number of ratings while the yaxis indicates the number of users corresponding to the ratings. We see that the majority of users in the three datasets only have a few ratings, which we regard as “inactive users,” and few “active users” have far more ratings. E.g., in the ML100K dataset, 61.72% of the users have fewer than 100 ratings, 32.66% have between 100 and 300 ratings, and only 5.6% of the users have more than 300 ratings.
As we will see below, the models being considered in this paper achieve different scores when used on datasets with different characteristics, i.e., number of users and number of items (see Section 5). Thus, for RQ6, in order to evaluate the performance of our model on datasets with different degrees of sparsity, we keep the number of users and items the same. Namely, following (Kabbur et al., 2013), for each of the three datasets, i.e., ML100K, ML1M, and AMovies, we create three versions at different sparsity levels with the the following steps:

We start by randomly choosing a subset of users and items from the original dataset. This dataset is represented with a ‘1’ suffix.

We randomly choose a rating record and make a judgment if the numbers of users as well as items are unchanged of the subdataset after removing this record. If unchanged, we remove this record; otherwise repeat Step 2.

After several repetitions of Step 2, the first sparser version of the dataset with the ‘2’ suffix is created.

Repeat Step 2 and Step 3 based on the dataset with a ‘2’ suffix, the second sparser version of the dataset with the ‘3’ suffix is created in the same way.
The characteristics of the datasets are summarized in Table 4.
Dataset  #Users  #Items  #Ratings  #Density(%) 

ML100K1  943  1,682  69,999  4.4132 
ML100K2  943  1,682  39,999  2.2522 
ML100K3  943  1,682  9,999  0.6304 
ML1M1  3,706  6,040  850,208  3.7982 
ML1M2  3,706  6,040  350,207  1.5645 
ML1M3  3,706  6,040  167,870  0.7499 
AMovies1  7,402  12,080  87,807  0.0982 
AMovies2  7,402  12,080  37,823  0.0423 
AMovies3  7,402  12,080  18,867  0.0211 
4.2.2. Experimental setup.
For evaluation, we use a leaveoneout strategy, which has been used widely in DLbased recommender systems (HongJian et al., 2017; He et al., 2017, 2016). The training set consists of all but the last interaction of every user; the test set contains the latest interaction of every user. When testing, it is timeconsuming to give ranking predictions to all items for every user. Thus following He et al. (2017); HongJian et al. (2017), we randomly sample 100 items with which the user has not interacted and then give the test item ranking predictions among the 100 samples. Although using this sampling strategy during evaluation may overestimate the performance of all algorithms, Bellogin et al. (2011); Hidasi and Karatzoglou (2018) have pointed out that the comparison among algorithms still remains fair.
The majority of the recommender system literature applies error metrics for evaluation, i.e., Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Such classical error criteria do not really measure the topN recommendation performance (Cremonesi et al., 2010). An extensive evaluation of several stateoftheart recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of the top recommendation task (Cremonesi et al., 2010; Herlocker et al., 2004). Experimental results also show that improvements in terms of RMSE often do not translate into accuracy improvements (Herlocker et al., 2004). Thus, here we choose to use accuracy metrics to examine the recommendation performance (He et al., 2017). Specifically, we use HR and NDCG to evaluate the performance of our models. Hit Ratio (HR) is used to evaluate the precision of the recommender system, i.e., whether the test item is contained in the topN list. The Normalized Discount Cumulative Gain (NDCG) measures the ranking accuracy of the recommender system, i.e., whether the test item is ranked at the top of the list.
As for parameters, we optimize the hyperparameters by running 100 experiments at randomly selected points of the parameter space. Optimization is done on a validation set, which is partitioned from the training set with the same procedure as the test set
(Chen et al., 2018). As for the loss function, we test the parameter from to with step size of in our experiment.For the neural networks, we randomly initialize model parameters with a Gaussian distribution (mean of 0 and standard deviation of 0.01), optimizing the model with minibatch Adam
(Kingma and Ba, 2014). The batch size and learning rate are set to 256 and 0.0001. For the baselines, we set the parameters of DMF as well as NCF following (HongJian et al., 2017; He et al., 2017), respectively. For DMF and NCF, we set the batch size to 256, and the learning rate to 0.0001 and 0.001. For the DF network in DMF model, we apply two layers and the sizes of them are [128, 64]. For the DI network in the NCF model, we employ three hidden layers with size [128, 64, 8]. For the DF and DI networks in JNCF, without special mention, we employ three layers in DF network with the size of [256, 128, 64] and two layers in DI network with size of [128, 8]. Thus the embedding sizes of users as well as items are same in all baseline models as well as JNCF. We also keep the size of the last hidden layer of the DI network in JNCF the same as NCF, which may determine the model capability. We also test our model as well as the baseline models with different numbers of layers to see if deep layers are beneficial to the overall performance of these models. Unless specified, for all the results presented in this paper, the number of recommendations () is equal to 10 (HongJian et al., 2017; He et al., 2017).5. Results and Discussion
5.1. Overall performance
To answer RQ1, we examine the recommendation performance of the baselines and the JNCF and JNCF models. See Table 5.
ML100K  ML1M  AMovies  
Model  HR@10  NDCG@10  HR@10  NDCG@10  HR@10  NDCG@10 
Itempop  .3832  .2018  .4513  .2315  .5925  .3493 
BPR  .5762  .3021  .6097  .3711  .6288  .3903 
NCF  .6066  .3488  .6498  .3951  .6782  .4135 
DMF  .6309  .3616  .6748  .4221  .7151  .4616 
JNCF  .6627  .3877  .7127  .4485  .7666  .5098 
JNCF  .6829  .4065  .7377  .4822  .7881  .5311 
Let us first consider the baselines. From Table 5, we see that DMF achieves a better performance than the other baselines in terms of HR@10 and NDCG@10. Hence, we only use DMF as the best baseline for comparisons in later experiments. Bayesian Personalized Ranking (BPR) clearly shows higher improvements over the Itempop baseline in terms of NDCG@10 than in terms of HR@10, which shows that pairwise loss has a strong performance for ranking prediction. The NCF and DMF models both show better performance than the two traditional CF models, which indicates the utility of DL techniques in improving recommendation performance.
Next, we compare the baselines against the JNCF models. NCF and DMF both lose against the JNCF models in terms of HR@10 and NDCG@10. This shows that a joint neural network structure that tightly couples deep feature learning and deep interaction modeling helps to improve the recommendation performance. Regarding the JNCF models, independent of the choice of combining the users’ and items’ vectors, JNCF achieves a better performance than the DMF baseline, resulting in HR@10 improvements ranging from 5.04% to 8.24% on the ML100K dataset, 5.62% to 10.81% on the ML1M dataset, and 7.21% to 10.21% on the AMovies dataset. NDCG@10 improvements range from 7.22% to 12.42% on the ML100K dataset, 6.25% to 14.24% on the ML1M dataset, and 10.44% to 15.06% on the AMovies dataset. Significant improvements against the baseline in terms of HR@10 and NDCG@10 are observed for both JNCF and JNCF at the level, except for JNCF on the ML100K dataset, for which we observe significant improvements at the level in terms of HR@10 and NDCG@10. The higher improvements in NDCG@10 over HR@10 may be due to the fact that we incorporate pairwise loss in our loss function, which motivates us to conduct a further investigation to answer RQ3.
Comparing JNCF and JNCF, we see that JNCF achieves the best performance, with improvements of 3.05%, 3.51% and 2.81% in terms of HR@10, and 4.85%, 7.51% and 4.18% in terms of NDCG@10 over JNCF on the three datasets, respectively. The complex relationship between users and items can be described better with a nonlinear kernel than linear kernel, which is consistent with the findings in (Liu et al., 2015; He et al., 2017).
5.2. Impact of different loss functions
As we have mentioned in Section 3.3, there are several kinds of pairwise loss functions that can be incorporated in Eq. (15). When JNCF combines the pointwise loss, i.e., log loss, with TOP1, TOP1max, and BPRmax pairwise losses, it gives rise to the JNCF, JNCF and JNCF models, respectively. Additionally, listwise loss, i.e., softmax+crossentropy (XE), can also be applied with JNCF, which gives rise to the JNCF model. In order to investigate the impact of various loss functions on JNCF, we examine the recommendation performance of JNCF, JNCF, JNCF as well as JNCF models where the parameter in Eq. (15) ranges from to with a step size of . Fig. 3 shows the results.
As for the overall performance, we can see that when applied with a listwise loss function, JNCF has the worst performance among the four models. The other three models, which combine pairwise and pointwise losses, show relatively similar results in terms of HR@10 and NDCG@10. When , it results in JNCF. When , it leads to JNCF, a model with only corresponding pairwise loss functions. It is obvious that solely based on pointwise loss, JNCF has better performance in terms of HR@10 while worse performance regarding NDCG@10 than JNCF with only pairwise loss. This can be explained by the fact that pairwise loss can help JNCF learn to rank items in right positions.
In Fig. 2(a), the performance of all models increases from to before a shortterm decrease and then a dramatic drop after reaching the peak at . The performance of JNCF, JNCF and JNCF is comparable in terms of HR@10. As for NDCG@10, shown in Fig. 2(b), JNCF shows better performance than the other two models and achieves the highest point at .
Regarding the performance on the ML1M dataset, similar trends can be found in Fig. 2(c) and Fig. 2(d) as in Fig. 2(a) and Fig. 2(b), respectively. For the AMovies dataset shown in Fig. 2(e) and Fig. 2(f), JNCF shows slightly better performance than both JNCF and JNCF in terms of HR@10, while the performance of JNCF and JNCF is similar in terms of NDCG@10, which is a little better than that of JNCF.
As discussed in (Hidasi and Karatzoglou, 2018), the BPRmax and TOP1max loss functions have been proposed to overcome vanishing gradients as the number of negative samples increases. Since we use a small number of negative samples in our paper, the performance is relatively similar between the three models, JNCF, JNCF and JNCF. As BPRmax and TOP1max losses need additional softmax calculations for all negative samples, we apply the TOP1 pairwise loss in Eq. (15) for JNCF in the experiments on which we report below.
5.3. Utility of hybrid loss function
For RQ3, in order to further investigate the utility of the hybrid loss function (Eq. (15)), we examine the recommendation performance of the JNCF models under different settings, i.e., JNCF with only pointwise loss based on Eq. (10) (we incorporate explicit feedback in the same way as Eq. (16)), JNCF with only pairwise loss based on Eq. (11), and JNCF with our designed loss function from Eq. (16). Fig. 4 shows the results.
The overall performance in terms of HR and NDCG increases when the size of the topN recommended list ranges from 1 to 10, as a large value of
increases the probability of including a user’s preferred item in the recommendation list.
JNCF consistently achieves improvements over DMF as well as the two models with a single loss function across positions, which demonstrates the utility of our newly designed loss function. Based on the ML100K dataset, JNCF improves by 2.68% and 7.61%, respectively, over JNCF and JNCF in terms of HR@10; improvements of NDCG@10 over JNCF and JNCF are 3.99% and 2.36%, respectively.Comparing JNCF and JNCF, we find that JNCF beats JNCF in terms of HR, while JNCF shows more competitive performance in terms of NDCG than JNCF. This confirms the findings in (Rendle et al., 2009; He et al., 2016) that a pairwise rankingaware learner has a strong performance for ranking prediction. This finding motivates us to incorporate both pointwise loss and pairwise loss into the hybrid loss function. Clearly, JNCF based models, i.e., JNCF, JNCF and JNCF, show a better performance than DMF, which also proves that the joint neural structure is effective, i.e., deep interaction modeling can optimize neural matrix factorization and thus improve the recommendation performance.
Comparing the left and right hand sides of Fig. 4, we see that the improvements of JNCF in terms of NDCG are more significant than those in terms of HR, as indicated by the relative improvements over DMF with different sizes of the recommendation list. In Fig. 3(a), JNCF shows a 8.78% improvement over DMF in terms of HR at cutoff , a 5.91% improvement at and a 8.24% improvement at on the ML100K dataset. In Fig. 3(b), the improvements in terms of NDCG at cutoff , and are 19.01%, 15.72% and 12.42%, respectively. JNCF with the hybrid loss function cannot only recommend the correct item to a user, but is also competitive in terms of ranking it at the top of the list.
5.4. Number of of layers in the networks
In JNCF, we not only learn features of users and items through the DF neural network with multiple hidden layers, but also model useritem interactions with multilayer perceptrons in the DI network. Thus it is crucial to see whether DL is helpful in our model. We conduct experiments to examine the performance of JNCF with various numbers of layers in the DF and DI networks, respectively. In addition, we also test the performance of the best baseline model, i.e., DMF, with different DF networks. The results are shown in Table 6. The in DF and DI in Table 6 denotes the number of layers in the DF network and DI network of JNCF, respectively.
HR@10  NDCG@10  

DF1  DF2  DF3  DF4  DF5  DF1  DF2  DF3  DF4  DF5  
ML100K  DI1  .6242  .6511  .6713  .6955  .7213  .3581  .3721  .3971  .4123  .4313 
DI2  .6351  .6642  .6829  .7183  .7388  .3694  .3899  .4067  .4277  .4426  
DI3  .6493  .6712  .7144  .7309  .7479  .3811  .4001  .4197  .4388  .4535  
DI4  .6571  .6832  .7277  .7411  .7523  .3945  .4183  .4311  .4481  .4618  
DI5  .6501  .6799  .7254  .7408  .7501  .3903  .4111  .4287  .4433  .4587  
DMF  .6285  .6309  .6301  .6297  .6298  .3598  .3616  .3614  .3607  .3598  
ML1M  DI1  .6451  .6671  .7121  .7389  .7619  .3622  .3911  .4399  .4893  .5301 
DI2  .6531  .6999  .7377  .7531  .7814  .3889  .4233  .4822  .5211  .5525  
DI3  .6766  .7198  .7589  .7728  .7929  .4195  .4601  .5177  .5437  .5777  
DI4  .7134  .7472  .7683  .7834  .8088  .4581  .5101  .5389  .5663  .5906  
DI5  .7099  .7411  .7653  .7821  .8021  .4517  .5078  .5333  .5644  .5878  
DMF  .6673  .6748  .6738  .6722  .6725  .3955  .4221  .4201  .4197  .4199  
AMovies  DI1  .6611  .6922  .7481  .7911  .8188  .4041  .4533  .5004  .5413  .5622 
DI2  .6872  .7378  .7881  .8101  .8411  .4327  .4911  .5311  .5597  .5803  
DI3  .6989  .7633  .8078  .8378  .8787  .4632  .5204  .5501  .5714  .6102  
DI4  .7414  .7999  .8293  .8612  .8893  .5137  .5461  .5644  .5966  .6198  
DI5  .7379  .7922  .8201  .8589  .8821  .5111  .5402  .5599  .5934  .6145  
DMF  .7478  .7515  .7491  .7483  .7479  .4551  .4616  .4612  .4603  .4591 
As shown in Table 6, in terms of HR@10, we can see that with the number of layers increasing, the recommendation performance of JNCF is improved, which verifies the effectiveness of DL techniques for recommender systems.
Comparing the number of layers in the DI and DF networks, we can find that stacking more layers in the DF network of JNCF seems more helpful than in the DI network in enhancing the recommendation performance. For example, based on the ML100K dataset, the improvements of the configuration (DF3, DI2) over (DF2, DI2) are 2.82% and 4.31% in terms of HR@10 and NDCG@10, while the improvements are 1.05% and 2.62% for (DF2, DI3) over (DF2, DI2). When we stack more than 4 layers in the DI network (e.g., DI5), the performance of JNCF no longer increases. However, stacking more layers in the DF network (e.g., DF5) still seems helpful and the best results produced for each dataset are all based on JNCF with the (DF5, DI4) configuration. This may be because deep layers are more helpful in extracting users’ as well as items’ features and thus enhancing the useritem interactions predictions. It motivates us to incorporate more auxiliary information for exploring users’ and items’ features with deep learning techniques in future work.
As for NDCG@10, a similar phenomenon can be found. However, when comparing the scores of HR@10 and NDCG@10 under the same configurations, we can find that deeper layers can lead to more obvious improvements in terms of NDCG@10 than HR@10 on all of the three datasets. The best performance of JNCF with (DF5, DI4) outperforms the worst performance of JNCF with (DF1, DI1) by 20.52%, 25.37% and 34.52% in terms of HR@10 on the three datasets, respectively. However, the improvements are 28.96%, 63.05% and 53.37% in terms of NDCG@10 on the three datasets.
As for the baseline model DMF shown in the bottom rows in Table 6, when applied with DF1, JNCF with DI1 loses to DMF on all datasets. Similar results can be found with DF2, except on ML100K dataset. This can be explained by the fact that the simple concatenation of user’s and item’s embeddings with only one MLP layer in JNCF is not efficient for modeling useritem interactions. When applied with more DI layers, JNCF has better performance than DMF with the same number of DF layers. Additionally, we can find that DMF achieves the best performance with DF2 and deeper layers do not seem useful for DMF model, which corresponds to the results in (HongJian et al., 2017). However, JNCF achieves further improvements when stacking more layers in either the DI or DF network, or both.
5.5. Impact of feedback
In JNCF, we consider different kinds of user feedback. On the one hand, we use the interaction matrix as the input of the network with Eq. (3), which contains not only implicit feedback but also explicit feedback. On the other hand, our loss function in Eq. (16) employs a normalized strategy in the form of , where denotes the largest rating score of user given to items, to incorporate the explicit feedback. In order to answer RQ5, we conduct experiments to investigate whether the combination of explicit and implicit feedback works for JNCF with different settings, i.e., JNCF with both kinds of feedback in the input and the loss function as well as JNCF with only implicit feedback by labeling 1 for the interactions and 0 for unknown ratings in the input and the loss function. Fig. 5 shows the recommendation performance of JNCF, JNCF, DMF and NCF across different numbers of training iterations, respectively.
First, from Fig. 5 we can see that JNCF with both kinds of feedback achieves a competitive performance across all iterations in terms of HR@10 and NDCG@10 on the three datasets. It indicates that the combination of explicit and implicit feedback in the input and the specially designed loss function of JNCF does help to improve the recommendation performance. Second, as the number of training iterations increases, the recommendation performance of all models is improved and then degraded after reaching a peak. More iterations may lead to overfitting, which hurts the recommendation performance. However, comparing JNCF model with the baselines, i.e., DMF and NCF, we find that JNCF converges to the best performance faster than other models. For example, on the ML100K dataset, the best result of JNCF is generated after the first 9 effective iterations, while DMF and NCF need more training iterations to obtain the best results, i.e., 16 and 14 iterations respectively. The same phenomenon can be observed on the other two datasets. The optimal number of updates needed for JNCF, DMF and NCF are around 10, 17 and 19 on the ML1M dataset, and 14, 18 and 19 on the AMovies dataset, respectively. Third, comparing the performance in terms of HR@10 and NDCG@10, we find that JNCF shows larger improvements over JNCF in terms of NDCG@10 than HR@10. For example, the improvements are 3.72%, 5.22% and 4.89% in terms of HR@10, on the ML100K, ML1M and AMovies datasets, respectively, vs. improvements of 4.61%, 5.58% and 5.31% in terms of NDCG@10. This confirms our hypothesis that incorporating both explicit and implicit feedback can improve the ranking precision for recommendation.
6. Scalability and Sensitivity
In order to answer RQ6 to RQ9, we study the scalability and sensitivity of JNCF as well as the best baseline DMF when applied in different settings, i.e., with users with various numbers of ratings in Section 6.1, and with datasets with different levels of sparsity in Section 6.2. In addition, we also investigate the performance of the deep learningbased approaches, i.e., JNCF, DMF and NCF, when applied with a large and sparse dataset in Section 6.3. Moreover, the training and inference time needed for these models on all datasets is discussed in Section 6.4.
6.1. Model scalability with user ratings
HR@10  NDCG@10  

DMF  JNCF  JNCF  DMF  JNCF  JNCF  
ML100K  10%  .7001  .7400  .8015  .4358  .4786  .5001 
50%  .6813  .7349  .7568  .4200  .4379  .4602  
90%  .6279  .6585  .6772  .3813  .3897  .4092  
ML1M  10%  .7548  .7927  .8511  .5111  .5417  .5952 
50%  .7211  .7532  .7982  .4855  .5266  .5587  
90%  .6601  .6981  .7277  .4217  .4432  .4751  
AMovies  10%  .7851  .8611  .9191  .5349  .5998  .6611 
50%  .7519  .7855  .8411  .5033  .5466  .5821  
90%  .7013  .7411  .7732  .4597  .5038  .5301 
In Fig. 2, we have shown that in every dataset most users only have a few ratings, thus it is meaningful to investigate how the performance of JNCF and DMF varies with different numbers of user ratings. Following (Prem et al., 2013), we look at the performance for users of varying degrees of activity, measured by percentile. For example, in Table 7, we first rank the users according to their numbers of their activities. 10% shows the mean performance across the bottom 10% of users, who are least active; the 90% mark shows the mean performance for all but the top 10% most active users.
As shown in Table 7, JNCF outperforms the best baseline model DMF
for users across all activity levels, i.e., both the “inactive” users who constitute the majority, and the relatively few “very active” users who give more ratings. In addition,
JNCF always achieves the best performance in terms of HR@10 and NDCG@10. In order to test the robustness of JNCF under different settings, i.e., JNCF and JNCF, we conduct ttests between the two versions of
JNCF with DMF, respectively. Significant improvements against the baseline DMF in terms of HR@10 and NDCG@10 are observed for both JNCF and JNCF at the level across all activity levels, except for JNCF on the ML100K dataset with 50% and 90% users, for which we observe significant improvements at the level in terms of HR@10 and NDCG@10.Specifically, JNCF shows larger improvements over the DMF model for “inactive” users than for “very active” users. For example, when incorporating users with more interactions, i.e., from 50% to 90%, the improvements change from 11.08% to 7.85% in terms of HR@10, and 9.57% to 7.32% in terms of NDCG@10 on the ML100K dataset. This may be because the “very active” users have many interactions with the items that have few ratings and collaborative filtering lacks information for recommending items based only on the rating matrix. This naturally suggest a line of future work in which one would extend JNCF with more auxiliary information, such as content information, to explore more accurate relationships between items.
To conclude and answer RQ6, the JNCF models can beat the best baseline model for users across all activity levels. JNCF shows the best performance in all datasets. In addition, for “inactive” users, JNCF shows larger improvements over DMF than for “very active” users.
6.2. Sensitivity to data sparsity
To investigate the sensitivity of JNCF to different levels of data sparsity, we examine the recommendation performance on datasets with different levels of sparsity, as presented in Table 4. Fig. 6 shows the results.
The overall performance of all models on the AMovies dataset is better than that on the other two datasets. That is to say, the recommendation performance may be influenced by the size of a dataset. Thus, in order to investigate the model sensitivity across datasets with different degrees of sparsity, it is essential to keep the number of users and items in the same scale for the datasets.
From Fig. 6, in particular, for the ML100K dataset, the ML1M dataset and the AMovies dataset respectively, we see that the JNCF models outperform the baseline model DMF across all sub datasets with different degrees of sparsity in terms of HR@10 and NDCG@10. In addition, we find that when the density of those datasets goes down, the performance of all models decreases. Thus it is interesting to investigate the robustness of JNCF when it is applied to sparse datasets. We find that when applied on small datasets, e.g., subsets of ML100K, our best model, i.e., JNCF, shows higher improvements against DMF on sparser datasets. For example, JNCF achieves 4.91% and 9.12% improvements over DMF in terms of HR@10 and NDCG@10 on the ML100K1 subset (Density), while the improvements on the ML100K3 subset (Density) are 7.77% and 12.02% in terms of HR@10 and NDCG@10, respectively. However, when applied on larger datasets with more users and items, i.e., subsets of ML1M and AMovies, JNCF shows higher improvements against DMF on denser datasets. For instance, JNCF achieves 11.13% improvements over DMF in terms of HR@10 on the ML1M1 subset (Density), while the improvements on the ML1M3 subset (Density) are 6.53% in terms of HR@10. These results may indicate that when the dataset becomes larger and sparser, it will be more difficult for models to improve their recommendation performances, which motivates us to conduct a further investigation to answer RQ8; see Section 6.3 below.
In addition, comparing the left and righthand side plots in Fig. 6, we find that JNCF shows a better performance in terms of NDCG@10 than HR@10. For example, the improvements of JNCF over DMF are 9.19%, 8.28% and 15.11% in terms of HR@10 on ML100K1, ML100K2 and ML100K3 datasets, respectively, while the improvements are 10.11%, 10.65% and 20.55% in terms of NDCG@10. This result is consistent with our findings in Section 5.3.
Thus in answer to RQ7, the JNCF models outperform the best baseline model DMF across all datasets with different degrees of sparsity in terms of both metrics. Specifically, when applied on large datasets, i.e., ML1M and AMovies, JNCF shows higher improvements against DMF on denser datasets. In addition, the improvements of JNCF over DMF in terms of NDCG@10 are larger than in terms of HR@10.
6.3. Performance with a large and sparse dataset
For RQ8, in order to see if our model is able to work well on a large and sparse dataset, we examine our model as well as two baseline models, i.e., NCF and DMF, on the Amazon Electronic (AEle) dataset, which is larger and sparser than the MovieLens and Amazon Movies datasets. Fig. 7 shows the performance of the three models with different sizes of topN recommended lists.
It is clear that JNCF outperforms DMF as well as NCF in terms of HR and NDCG across different numbers of recommendations. With the size of topN recommended lists ranging from 1 to 10, the overall performances of all models increase, which is consistent with the conclusion in Section 5.3. Comparing the results shown in Fig. 6(a) and Fig. 6(b), the improvements of JNCF over DMF in terms of NDCG are more significant than those in terms of HR. For example, when and , the improvements of JNCF over DMF in terms of HR are 5.88% and 4.62%, while the improvements are 6.12% and 5.82% in terms of NDCG, respectively. To conclude and answer RQ8, JNCF can also work well with large and sparse datasets, especially in ranking items correctly.
6.4. Training and inference time
To answer RQ9, we investigate the scalability of JNCF regarding training and inference time in Table 8. As shown in Table 8, in the “Training” part, “Total time” denotes the time needed for training the model to the best performance. And the “Average epoch” means the average training time for a single epoch in the training process. In the “Prediction” part, “Total time” denotes the prediction time needed for the whole test set. Since the test set contains the latest interaction of every user, the “Average ranking” indicates the time needed for providing a ranked list containing top 10 recommendations for a single user.
Training  Prediction  

Total time(s)  Average epoch(s)  Total time(s)  Average ranking(s)  
ML100K  NCF  46.344  1.943  1.389  0.00147 
DMF  180.017  9.587  1.558  0.00165  
JNCF  116.023  10.925  1.607  0.00170  
ML1M  NCF  494.038  17.751  8.251  0.00137 
DMF  5,451.671  320.687  12.376  0.00205  
JNCF  3,539.059  340.048  13.858  0.00229  
AMovies  NCF  977.265  25.836  25.599  0.00170 
DMF  39,249.657  2,180.537  34.955  0.00232  
JNCF  31,414.628  2,206.084  37.818  0.00251  
AEle  NCF  61,812.187  326.828  2,919.005  0.00239 
DMF  788,138.604  43,785.478  4,360.187  0.00357  
JNCF  723,586.192  45,224.137  4,775.443  0.00391 
As we can see in Table 8, when the size of the dataset becomes larger, the time needed for both training and prediction gets increased significantly for all models. NCF consistently costs the least time among the three models for both training and prediction processes on all datasets. For the training process, the average training time for one epoch of JNCF is slightly higher than DMF. However, the total training time for JNCF is less than for DMF. It can be explained by the fact that JNCF needs fewer iterations to obtain the best results than DMF, as indicated in Section 5. Thus, JNCF costs less time for training to the best performance than DMF. For the prediction process, although the total time needed for JNCF and DMF is more than NCF, the three models cost roughly similar amounts of time for providing a top 10 ranked list for a single user, which is around a few milliseconds.
7. Conclusions and Future Work
We have proposed a joint neural collaborative filtering model, JNCF, for recommender systems. JNCF uses a unified deep neural network to tightly couple two important parts in a recommender system, i.e., deep feature learning of users and items, and deep modeling of useritem interactions. For the user and item feature extraction, we use a deep neural network with matrix factorization and a combination of explicit and implicit feedback as input. Then we adopt another neural network for modeling useritem interactions using the feature vectors as inputs. Thus, JNCF enables the two parts to be optimized with each other through a joint training process. In order to make JNCF fit the topN recommendation task, we design a new loss function that incorporates information from both pairwise and pointwise loss.
The experimental results confirm the effectiveness of JNCF. In addition, we have also experimentally investigated the performance of JNCF under various settings, e.g., with different loss functions, with varying numbers of layers in the networks, and with using different feedback as inputs. The results confirm the effectiveness of our hybrid loss function and demonstrate that JNCF performs better with more layers in the networks and using the combination of implicit and explicit feedback as input.
In addition, we have investigated the robustness of JNCF with different degrees of data sparsity and different numbers of user ratings. JNCF outperforms the best baseline model DMF for users across all activity levels, especially for “inactive users” who constitute the majority of users in the datasets. As for datasets with different levels of sparsity, in general, JNCF shows more competitive recommendation performance on all datasets than the stateoftheart baseline model DMF. Moreover, we have also tested JNCF model with a large and sparse dataset, i.e., AEle, and the results show that JNCF also outperforms stateoftheart baseline models on the dataset.
As to future work, first, we plan to extend JNCF with more auxiliary information (Zheng et al., 2017; Wang et al., 2017; Cai and de Rijke, 2016a, b), such as the content information of items as well as reviews, to get a more informed expression of users as well as items. As collaborative filtering usually suffers from limited performance due to the sparsity of useritem interactions (Shi et al., 2017), auxiliary information could be used to boost the performance. It would also be interesting to explore heterogeneous information in a knowledge base to improve the quality of recommender systems with deep learning (Zhang et al., 2016). Second, we plan to explore the context information of a user in a session with recurrent neural networks to deal with dynamic aspects recommender systems (Chatzis et al., 2017; Hidasi et al., 2016b; Cai et al., 2016b, a). In addition, an attention mechanism could be applied to JNCF, which can filter out uninformative content and select the most representative items while providing good interpretability (Chen et al., 2017). Finally, as we have found that JNCF is computationally more expensive than NCF, we plan to optimize the structure and implementation details of our model to make it more efficient.
Acknowledgements.
We would like to thank our anonymous reviewers for their helpful comments and valuable suggestions.References
 (1)
 Adeniyi et al. (2016) David Adedayo Adeniyi, Zhaoqiang Wei, and Yongquan Yang. 2016. Automated web usage data mining and recommendation system using KNearest Neighbor (KNN) classification method. Applied Computing and Informatics 12, 1 (2016), 90–108.
 Adomavicius and Tuzhilin (2005) Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the Next Generation of Recommender Systems: A Survey of the StateoftheArt and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering. 17, 6 (2005), 734–749.
 Basiliyos et al. (2017) Betru Basiliyos, Tilahun, Onana Charles, Awono, and Batchakui Bernabe. 2017. Deep Learning Methods on Recommender System: A Survey of Stateoftheart. International Journal of Computer Applications 162, 10 (2017), 17–22.
 Bellogin et al. (2011) Alejandro Bellogin, Pablo Castells, and Ivan Cantador. 2011. Precisionoriented Evaluation of Recommender Systems: An Algorithmic Comparison. In RecSys ’11. ACM, 333–336.
 Cai and de Rijke (2016a) Fei Cai and Maarten de Rijke. 2016a. Learning from homologous queries and semantically related terms for query auto completion. Information Processing & Management 52, 4 (2016), 628–643.
 Cai and de Rijke (2016b) Fei Cai and Maarten de Rijke. 2016b. A Survey of Query Auto Completion in Information Retrieval. Foundations and Trends in Information Retrieval 10, 4 (2016), 273–363.
 Cai et al. (2016a) Fei Cai, Shangsong Liang, and Maarten de Rijke. 2016a. PrefixAdaptive and TimeSensitive Personalized Query Auto Completion. IEEE Transactions on Knowledge and Data Engineering 28, 9 (Sep 2016), 2452–2466.
 Cai et al. (2016b) Fei Cai, Ridho Reinanda, and Maarten de Rijke. 2016b. Diversifying Query AutoCompletion. ACM Transactions on Information Systems 34, 4 (June 2016), 25:1–25:33.
 Chatzis et al. (2017) Sotirios P. Chatzis, Panayiotis Christodoulou, and Andreas S. Andreou. 2017. Recurrent Latent Variable Networks for SessionBased Recommendation. In DLRS ’17. 38–45.
 Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and TatSeng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item and ComponentLevel Attention. In SIGIR ’17. ACM, 335–344.
 Chen et al. (2018) Wanyu Chen, Fei Cai, Honghui Chen, and Maarten de Rijke. 2018. Attentionbased Hierarchical Neural Query Suggestion. In SIGIR ’18. ACM, 1093–1096.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In DLRS 2016. ACM, 7–10.
 Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Topn Recommendation Tasks. In RecSys ’10. ACM, 39–46.
 Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorizationmachine Based Neural Network for CTR Prediction. In IJCAI’17. AAAI Press, 1725–1731.
 He and Chua (2017) Xiangnan He and TatSeng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. In SIGIR ’17. ACM, 355–364.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural Collaborative Filtering. In WWW ’17. ACM, 173–182.
 He et al. (2016) Xiangnan He, Hanwang Zhang, MinYen Kan, and TatSeng Chua. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback. In SIGIR ’16. ACM, 549–558.
 Herlocker et al. (2004) Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22, 1 (2004), 5–53.
 Hidasi and Karatzoglou (2018) Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent Neural Networks with Topk Gains for Sessionbased Recommendations. In CIKM ’18. ACM, 843–852.
 Hidasi et al. (2016a) Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016a. Sessionbased Recommendations with Recurrent Neural Networks. In ICLR ’16.
 Hidasi et al. (2016b) Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk. 2016b. Parallel Recurrent Neural Network Architectures for Featurerich Sessionbased Recommendations. In RecSys ’16. ACM, 241–248.
 HongJian et al. (2017) Xue HongJian, Dai Xinyu, Zhang Jianbing, Huang Shujian, and Chen Jiajun. 2017. Deep Matrix Factorization Models for Recommender Systems. In IJCAI ’17. 3203–3209.
 Huang et al. (2013) PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In CIKM ’13. ACM, 2333–2338.
 Kabbur et al. (2013) Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: Factored item similarity models for TopN recommender systems. In KDD ’13. ACM, 659–667.
 Kim et al. (2016) Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. 2016. Convolutional Matrix Factorization for Document ContextAware Recommendation. In RecSys ’16. 233–240.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
 Koren (2008) Yehuda Koren. 2008. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. In KDD ’08. ACM, 426–434.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37.
 Li et al. (2015) Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep Collaborative Filtering via Marginalized Denoising Autoencoder. In CIKM ’15. ACM, 811–820.
 Lian et al. (2017) Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun. 2017. CCCFNet: A ContentBoosted Collaborative Filtering Neural Network for Cross Domain Recommender Systems. In WWW ’17. ACM, 817–818.
 Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: ItemtoItem Collaborative Filtering. IEEE Internet Computing 7, 1 (2003), 76–80.
 Liu and Wu (2017) Juntao Liu and Caihua Wu. 2017. Deep Learning Based Recommendation: A Survey. In ICISA ’17. 451–458.
 Liu et al. (2015) Xiaomeng Liu, Yuanxin Ouyang, Wenge Rong, and Zhang Xiong. 2015. Item Category Aware Conditional Restricted Boltzmann Machine Based Recommendation. In ICONIP ’15. 609–616.
 Onal et al. (2018) Kezban Dilek Onal, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, HengLu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, Maarten de Rijke, and Matthew Lease. 2018. Neural information retrieval: At the end of the early years. Information Retrieval Journal 21, 2–3 (June 2018), 111–182.
 Paterek (2007) Arkadiusz Paterek. 2007. Improving regularized singular value decomposition for collaborative filtering. In KDD ’07. ACM.
 Prem et al. (2013) Gopalan Prem, Jake M. Hofman, and David M. Blei. 2013. Scalable Recommendation with Poisson Factorization. arXiv preprint arXiv:1311.1704 (2013).
 Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI ’09. 452–461.
 Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In NIPS’07. Curran Associates Inc., 1257–1264.
 Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann Machines for Collaborative Filtering. In ICML ’07. 791–798.
 Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Itembased Collaborative Filtering Recommendation Algorithms. In WWW ’01. ACM, 285–295.
 Sarwar et al. (2000) Badrul Munir Sarwar, George Karypis, Joseph A. Konstan, and John Thomas Riedl. 2000. Application of Dimensionality Reduction in Recommender System–A Case Study. In ACM WebKDD Workshop. ACM.

Sedhain
et al. (2015)
Suvash Sedhain, Aditya
Menon, Scott Sanner, and Lexing Xie.
2015.
AutoRec: Autoencoders Meet Collaborative Filtering. In
WWW ’15. ACM, 111–112.  Shi et al. (2017) Lei Shi, Wayne Xin Zhao, and YiDong Shen. 2017. Local RepresentativeBased Matrix Factorization for ColdStart Recommendation. ACM Transaction on Information Systems 36, 2 (Aug. 2017), 22:1–22:28.

Su and
Khoshgoftaar (2009)
Xiaoyuan Su and Taghi M.
Khoshgoftaar. 2009.
A Survey of Collaborative Filtering Techniques.
Advances in Artificial Intelligence
2009 (2009), Article 4.  Trapit et al. (2016) Bansal Trapit, Belanger David, and McCallum Andrew. 2016. Ask the GRU: Multitask Learning for Deep Text Recommendations. In RecSys ’16. 107–114.
 Truyen et al. (2009) Tran The Truyen, Dinh Q. Phung, and Svetha Venkatesh. 2009. Ordinal Boltzmann Machines for Collaborative Filtering. In UAI ’09. 548–556.
 van den Oord et al. (2013) Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep Contentbased Music Recommendation. In NIPS ’13. 2643–2651.
 Wang and Blei (2011) Chong Wang and David M. Blei. 2011. Collaborative Topic Modeling for Recommending Scientific Articles. In KDD ’11. 448–456.
 Wang et al. (2015) Hao Wang, Naiyan Wang, and DitYan Yeung. 2015. Collaborative Deep Learning for Recommender Systems. In KDD ’15. ACM, 1235–1244.
 Wang et al. (2017) Suhang Wang, Yilin Wang, Jiliang Tang, Kai Shu, Suhas Ranganath, and Huan Liu. 2017. What Your Images Reveal: Exploiting Visual Contents for PointofInterest Recommendation. In WWW ’17. 391–400.
 Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and TatSeng Chua. 2019. Neural Graph Collaborative Filtering. In SIGIR ’19. ACM.
 Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising AutoEncoders for TopN Recommender Systems. In WSDM ’16. ACM, 153–162.
 Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016. Collaborative Knowledge Base Embedding for Recommender Systems. In KDD ’16. ACM, 353–362.
 Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. arXiv preprint arXiv:1707.07435 (2017).
 Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In WSDM ’17. ACM, 425–434.
 Zheng et al. (2016) Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. 2016. A Neural Autoregressive Approach to Collaborative Filtering. In ICML’16. 764–773.
Comments
There are no comments yet.