Introduction
Over the past decades, with the fast development of webbased service platforms such as ecommerce platforms and news/music/movie platforms, recommender systems have been extensively studied and widely deployed in many different scenarios to alleviate the information overload problem [Hu et al.2018, Srivastava et al.2018]. Due to the distinguishing capability of utilizing collective wisdoms and experiences, Collaborative Filtering (CF) algorithms, especially Matrix Factorization (MF) algorithms, have been widely used to build recommender systems [Wang et al.2018a, Zhao et al.2018, Hu et al.2017].
Matrix factorization assumes some relationship can be established between users and items through some latent factors. By mapping users and items into a common representation space in which they can be compared directly, the similarity between them can be further used to estimate how well they match. In this case, the model learns lowdimensional dense representation for user and item, and then adopts dot product as matching function to calculate matching score. Since Deep Neural Networks (DNNs) are extremely good at representation learning, deep learning methods have been widely explored and have shown promising results in various areas such as computer vision and natural language processing
[He et al.2016a, Serban et al.2016]. In the past few years, there are also many works adopting DNNs to introduce auxiliary data such as images, text descriptions and demographic information, to improve the representation learning process. What’s more, in vanilla matrix factorization, the mapping between the original representation space and the latent space is assumed to be linear, which can not be always guaranteed. To better learn the complex mapping between these two spaces, Xue et al. [Xue et al.2017] proposed a Deep Matrix Factorization (DMF), which uses a two pathway neural network architecture to replace the linear embedding operation used in vanilla matrix factorization. However, when it comes to the matching score prediction, matrix factorization methods still resort to dot product which simply combines latent factors linearly and seriously limits the expressiveness of the model.In addition to learning better representation for users and items, DNNs are very suitable to learn the complex matching function since they are capable of approximating any continuous function [Hornik, Stinchcombe, and White1989]. For example, He et al. [He et al.2017]
proposed NeuMF under the Neural Collaborative Filtering (NCF) framework which takes the concatenation of user embedding and item embedding as the input of a MultiLayer Perceptron (MLP) model to make prediction. The high capacity and nonlinearity of DNNs is used to learn the complex mapping relation between useritem representation and matching score. In this case, MLP is used to replace dot product used in traditional matrix factorization methods. However, as revealed in
[Beutel et al.2018], MLP is very inefficient in catching lowrank relations. In fact, using dot product to estimate matching score in traditional matrix factorization methods is to artificially limit the model to learn similarity — a lowrank relation that is thought to be positively related to matching score according to human experience. Although using MLP to learn the matching function directly endows the model with a great flexibility, without introducing human experience, the learning process may be inefficient. This is also why NeuMF needs to incorporate MLP with a shallow matrix factorization model.According to the above discussion, we can see that there are two types of methods for implementing collaborative filtering [Xu, He, and Li2018]. One is based on representation learning and the other one is based on matching function learning. To overcome the shortages of these two types of methods and further improve the performance of CF methods, we incorporate them under the proposed DeepCF framework. In particular, we first use these two types of CF methods to obtain different representations for the input useritem pair. Since these two types of methods have different advantages and learn the representation from different perspectives, a stronger and more robust joint representation for the useritem pair can be obtained by concatenating their learned representations. To calculate the matching score, we then pass this joint representation into a fully connected layer which enables the model to assign different weights on the features. Besides, since the quantity of implicit data far outweighs the quantity of explicit data in real world, designing recommendation algorithms that can work with implicit feedback data is extremely important and has been one of the hot research topics in recommender system. As a result, we focus on implicit feedback in this paper.
The main contributions of this work are as follows.

We point out the significance of incorporating collaborative filtering methods based on representation learning and matching function learning, and present a general Deep Collaborative Filtering (DeepCF) framework. The proposed framework abandons the traditional Deep+Shallow pattern and adopts deep models only to implement collaborative filtering with implicit feedback.

We propose a novel model named Collaborative Filtering Network (CFNet) based on the vanilla MLP model under the DeepCF framework, which has great flexibility to learn the complex matching function while being efficient to learn lowrank relations between users and items.

We conduct extensive experiments on four realworld datasets to demonstrate the effectiveness and rationality of the proposed DeepCF framework.
Related Work
Collaborative Filtering with Implicit Data
Since most of users would not tend to rate items, it’s often difficult to collect explicit feedback. As a result, the quantity of implicit data, such as a click, view, collect, or purchase, far outweighs the quantity of explicit data, such as a rating or a like. In this case, it’s very important to design recommendation algorithms that can work with implicit feedback data [Oard, Kim, and others1998]. The wellknown ALS model [Hu, Koren, and Volinsky2008] and SVD++ model [Koren2008] are the early exploration that studied collaborative filtering on datasets with implicit feedback. Both of the two models factorize the binary interaction matrix and assume user dislike unselected items, i.e., assign 0 for unselected items in the binary interaction matrix. Several works have been done to further improve collaborative filtering with implicit data by assuming user prefer the selected items than the unselected items [Rendle et al.2009, Mnih and Teh2012, He and McAuley2016].
Collaborative Filtering based on Representation Learning
Since Simon Funk proposed FunkSVD [Funk2006] in the famous Netflix Prize competition, matrix factorization for collaborative filtering has been widely studied and constantly developed over the past ten years [Salakhutdinov and Mnih2008, Koren, Bell, and Volinsky2009, Koren2009, Ma2013, Hu, Sun, and Liu2014]. Although these works tried to improve matrix factorization from different ways, e.g., introducing time, social information, text description, and location, their main idea is still mapping user and item into a common representation space where they can be compared directly. Recently, deep learning methods have shown promising results in various areas such as computer vision, speech recognition and natural language processing. There are also some works proposed to use DNNs for collaborative filtering based on representation learning. AutoRec [Sedhain et al.2015]
is the first model attempting to learn user representation and item representation by using autoencoder to reconstruct the input ratings. Collaborative Denoising AutoEncoders (CDAE)
[Wu et al.2016] further improved it by inputting both ratings and IDs. On the other hand, DMF [Xue et al.2017]uses a two pathway neural network architecture to factorize rating matrix and learn representations. Overall, representation learningbased methods learn representation in different ways and can flexibly incorporate with auxiliary data such as images, text descriptions, demographic information and so on. However, they still resort to the dot product or cosine similarity when predicting matching score.
Collaborative Filtering based on Matching Function Learning
NeuMF [He et al.2017] is a recently proposed framework that replaces the dot product used in vanilla MF with a neural network to learn the matching function. To offset the weakness of MLP in capturing lowrank relations, NeuMF unifies MF and MLP in one model. NNCF [Bai et al.2017] is a variant of NeuMF that takes user neighbors and item neighbors as inputs. ConvNCF [He et al.2018] uses an outer product operation to replace concatenation used in NeuMF so that the model can better learn pairwise correlations between embedding dimensions. Other than NeuMF, there are also many other works attempting to learn the matching function directly by making full use of auxiliary data. For example, Wide&Deep [Cheng et al.2016] adapts LR and MLP to learn the matching function from input continuous features and categorical features of user and item. DeepFM [Guo et al.2017] replaces LR with Factorization Machines (FM) to avoid manual feature engineering. NFM [He and Chua2017] proposed to use a biinteration pooling layer to learn feature crosses. What’s more, treebased models are also studied and proven to be effective [Zhao, Shi, and Hong2017, Zhu et al.2017, Wang et al.2018b]. In this paper, we focus on pure collaborative filtering without using auxiliary data. In this case, we mainly discuss NeuMF and compare it with the proposed DeepCF framework.
According to the above discussion, both representation learningbased and matching function learningbased collaborative filtering methods have been broadly studied and proven to be effective. Despite their strengths, both of the two types of methods have weaknesses, i.e., the limit expressiveness of dot product and the weakness in capturing lowrank relations. To our best knowledge, so far there is no work to point out the significance of combining the strengths of the two types of collaborative filtering methods to overcome these weaknesses. In this paper, we present a general framework that ensembles these two types of methods to endow the model with a great flexibility of learning the matching function while maintaining the ability to learn lowrank relations efficiently.
Preliminaries
Problem Statement
Suppose there are users and items in the system, following [Wu et al.2016, He et al.2017], we construct the useritem interaction matrix from users’ implicit feedback as follows,
(1) 
Comparing with explicit feedback, implicit feedback has two major problems. First, unlike ratings, an observed interaction () can only reflects users’ preference indirectly, i.e., it can not tell how much exactly a user likes an item. Second, an unobserved interaction () does not mean user does not like item . In fact, user may have never seen item since there are too many items in a system. These two problems pose huge challenges in learning from implicit data, especially the second one.
To perform collaborative filtering on implicit data which lacks real negative feedback is also known as the OneClass Collaborative Filtering (OCCF) problem [Pan et al.2008]. In general, there are two ways to tackle this problem, one is to treat all unobserved interactions as weak negative instances [Hu, Koren, and Volinsky2008, Pan et al.2008] and the other is to sample some negative instances from unobserved interactions [Pan et al.2008, Wu et al.2016, He et al.2017]. In this paper, we prefer the second method, i.e., uniformly sample negative instances from unobserved interactions.
The recommendation problem with explicit feedback is usually formulated as a rating prediction problem which estimates the missing values in rating matrix . The predicted scores are then used for ranking items and finally the topranking items are recommended to users. Similarly, to tackle the recommendation problem with implicit feedback, we can formulate it as an interaction prediction problem which estimates the missing values in interaction matrix , i.e., estimates whether the unobserved interactions would happen or not. However, unlike explicit feedback, implicit feedback is discrete and binary. Solving the above binary classification problem can not help us to further rank and recommend items. One feasible solution is to employ a probabilistic treatment for interaction matrix . We can assume
obeys a Bernoulli distribution:
(2) 
where
is the probability of
being equal to 1. What’s more, can be further interpreted as the probability that user is matched by item . In this case, a value of 1 for indicates that item perfectly matches user and a value of 0 indicates that user and item do not match at all. Rather than modeling which is discrete and binary, our method models instead. In this manner, we transform the binary classification problem, i.e., the interaction prediction problem, to a matching score prediction problem.Learning the Model
A modelbased method generally assumes that data can be generated by an underlying model as , where denotes the prediction of , i.e., the predicted probability that user is matched by item , denotes model parameters, and denotes the function that maps model parameters to the predicted score. In this manner, we need to figure out two key questions, i.e., how to define function and how to estimate parameters . We will answer the first question in the next section.
For the second question, most of the existing works generally estimate parameters through optimizing an objective function. Two types of objective functions are commonly used in recommender system, namely, pointwise loss [Hu, Koren, and Volinsky2008, He et al.2016b] and pairwise loss [Rendle et al.2009, Mnih and Teh2012, He and McAuley2016]. In this paper, we explore the pointwise loss only and leave the pairwise loss in our future work. Pointwise loss has been widely studied in collaborative filtering with explicit feedback under the regression framework [Funk2006, Salakhutdinov and Mnih2008]. The most commonly used pointwise loss is the squared loss (SE). However, the squared loss is not suitable for implicit feedback because it’s derived by assuming the error between the given rating and the predicted rating
obeys a normal distribution, which does not hold in the implicit feedback scenario since
is discrete and binary. As aforementioned in Problem Statement, we assume obeys a Bernoulli distribution, i.e., . By replacing with in Equation Problem Statement, we can define the likelihood function as(3) 
where denotes all the observed interactions in and denotes the sampled unobserved interactions, i.e., the negative instances. Furthermore, taking the negative logarithm of the likelihood (NLL), we obtain
(4) 
Based on all the above assumptions and formulations, we finally obtain an objective function which is suitable for learning from implicit feedback data, i.e., the binary crossentropy loss function.
To sum up, the recommendation problem with implicit feedback can be formulated as an interaction prediction problem. To endow algorithm with the ability to rank items, we employ a probabilistic treatment for interaction matrix so that is assumed to obey a Bernoulli distribution. Instead of modeling , we model which is the probability of being equal to 1. Since can also be interpreted as the probability that user is matched by item , the interaction prediction problem can be transformed to a matching score prediction problem. In this manner, using maximum likelihood estimation to estimate model parameters is equivalent to minimizing the binary crossentropy between and .
The Proposed Framework
In this section, we first introduce the general processes of representation learningbased CF methods and matching function learningbased CF methods. Then we elaborate these two types of methods and their MLP implements we used in this paper. Finally we illustrate how to fuse these two methods in the proposed DeepCF framework and how to learn the final model.
The General Process
The processes for representation learningbased CF methods and matching function learningbased CF methods can be concluded as the workflow shown in Figure 1. Both of the two types of methods start from extracting data from database. IDs, historical behaviors and other auxiliary data can all be used to construct the initial representations of user and item , which are denoted by and respectively. The CF models then calculate and , i.e., the latent representations for user and item . Next, a nonparametric operation is performed on and to aggregate the latent representations. Finally, mapping function is used to calculate the matching score . Notice that the last two steps are referred to as the matching function.
Representation Learning
For representation learningbased CF methods, the model focuses more on learning representation function and the matching function is usually assumed to be simple and nonparametric, e.g., dot product or cosine similarity. In this manner, the model is supposed to learn to map users and items into a common space where they can be directly compared. For example, taking onehot IDs as inputs, the vanilla MF [Funk2006] adopts linear embedding function as function and function to learn the latent representations. The latent representations and are then aggregated by the dot product function to calculate the matching score. In this case, mapping function is assumed to be the identity function. For another example, taking ratings as inputs, DMF [Xue et al.2017] adopts MLP as function and function to learn better latent representation by making full use of the nonlinearity and high capacity characteristics of neural networks. The cosine similarity between and is then calculated and used as matching score.
In this paper, we focus on implicit data only so no auxiliary data are used. The useritem interaction matrix is taken as input, i.e., user is represented by the corresponding row in and item is represented by the corresponding column in . In this paper, we adopt MLP to learn latent representations for users and items. Therefore, the representation learning part for users can be defined as:
(5) 
where , , and
denote the weight matrix, bias vector and activation for the
th layer’s perceptron respectively.is the activation function and we use
function in this paper. The latent representation for item is calculated in the same manner. Different from the existing representation learningbased CF methods, the matching function part is defined as:(6) 
where and
denote the weight matrix and the sigmoid function respectively. By substituting the nonparametric dot product or cosine similarity with elementwise product and a parametric neural network layer, our model still focuses on catching lowrank relations between users and items but is more expressive since the importance of latent dimensions can be different and the mapping can be nonlinear.
Matching Function Learning
Although matching function learningbased CF methods focus more on matching function learning. The representation learning part is still necessary since and are usually extremely sparse and have high dimension, making it difficult for the model to directly learn the matching function. Therefore, matching function learningbased CF methods usually use a linear embedding layer to learn latent representations for users and items. With the dense lowdimensional latent representations, the model is able to learn the matching function more efficiently.
In this paper, we adopt MLP to learn the matching function. Instead of IDs, we take the interaction matrix as input. Therefore, the matching function learning component can be formalized as:
(7) 
where and are the parameter matrices of the linear embedding layers. The meanings of other notions are the same as CFNetrl. In this manner, the representation learning functions and are implemented by the linear embedding layers. The latent representations and are then aggregated by a simple concatenation operation. Finally, MLP is used as the mapping function to calculate the matching score . Notice that although concatenation is the simplest aggregation operation, it maintains maximally the information passed from the previous layer and allows to make full use of the flexibility of the MLP model.
In summary, the matching function learning component used in this paper is implemented by Equation 7, which is called CFNetml.
Fusion and Learning
Fusion
In the previous two subsections, we have presented the MLP implementations for the two types of methods, i.e., the CFNetrl model and the CFNetml model. To incorporate these two models, we need to design a strategy to fuse them so that they can enhance each other. One of the most common fusing strategies is to concatenate the learned representations to obtain a joint representation and then feed it into a fully connected layer. In our case, for CFNetrl, the matching function shown in Equation 6 can be split into two steps. The model first calculates the elementwise product for user latent factor and item latent factor, and then sums it up with different weights. The product vector obtained in the first step is called the predictive vector in this paper. For CFNetml, the last layer of MLP is called the predictive vector as well. In both cases, the predictive vectors can be viewed as the representation for the corresponding useritem pair. Since the two types of CF methods have different advantages and learn the predictive vectors from different perspectives, the concatenation of the two predictive vectors will result in a stronger and more robust joint representation for the useritem pair. What’s more, the consequent fully connected layer enables the model to assign different weights on the features contained in the joint representation. Suppose we denote the predictive vectors of the representation learning component and the matching function learning component as and respectively, then the output of the fusion model can be defined as:
(8) 
Using Equation 8 to incorporate CFNetrl and CFNetml, we finally obtain the proposed CFNet model. The architecture of CFNet is shown in Figure 2.
Statistics  ml1m  lastfm  AMusic  AToy 

# of Users  6040  1741  1776  3137 
# of Items  3706  2665  12929  33953 
# of Ratings  1000209  69149  46087  84642 
Sparsity  0.9553  0.9851  0.9980  0.9992 
Datasets  Measures  Existing methods  CFNet  Improvement of  

ItemPop  eALS  DMF  NeuMF  CFNetrl  CFNetml  CFNet  CFNet vs. NeuMF  
ml1m  HR  0.4535  0.7018  0.6565  0.7210  0.7127  0.7073  0.7253  0.6% 
NDCG  0.2542  0.4280  0.3761  0.4387  0.4336  0.4264  0.4416  0.7%  
lastfm  HR  0.6628  0.8265  0.8840  0.8868  0.8840  0.8834  0.8995  1.4% 
NDCG  0.3862  0.5162  0.5804  0.6007  0.6001  0.5919  0.6186  3.0%  
AMusic  HR  0.2483  0.3711  0.3744  0.3891  0.3947  0.4071  0.4116  5.8% 
NDCG  0.1304  0.2352  0.2149  0.2391  0.2504  0.2420  0.2601  8.8%  
AToy  HR  0.2840  0.3717  0.3535  0.3650  0.3746  0.3931  0.4150  13.7% 
NDCG  0.1518  0.2434  0.2016  0.2155  0.2271  0.2293  0.2513  16.6% 
Learning
As discussed in the previous section, the objective function to minimize for the DeepCF framework is the binary crossentropy function. To optimize the model, we use minibatch Adam [Kingma and Ba2014]
. The batch size is fixed to 256 and the learning rate is 0.001. The model parameters are randomly initialized with a Gaussian distribution (with a mean of 0 and standard deviation of 0.01) and the negative instances
are uniformly sampled from unobserved interactions in each iteration.Pretraining
According to [Erhan et al.2010], the initialization is of significance to the convergence and performance of deep learning model. Using pretrained models to initialize the ensemble model can significantly increase the convergence speed and improve the final performance. Since CFNet is composed of two components, i.e., CFNetrl and CFNetml, we can pretrain these two components and use them to initialize CFNet. Notice that CFNetrl and CFNetml are trained from scratch using Adam while the CFNet with pretraining is optimized by the vanilla SGD. This is because Adam requires momentum information of the previous updated parameters which is not saved in CFNet with pretraining.
Experiments
In this section, we conduct experiments to demonstrate the effectiveness of the proposed DeepCF framework and its MLP implementation (i.e., the CFNet model). We also verify the utility of pretraining by comparing the CFNet models with and without pretraining. Hypeparameter sensitivity analysis is discussed in the last part of this section. We implement the proposed model based on Keras
^{1}^{1}1https://github.com/kerasteam/kerasand Tensorflow
^{2}^{2}2https://github.com/tensorflow/tensorflow, which will be released publicly upon acceptance.Experimental Settings
Dataset
We evaluate our models on four public datasets: MovieLens 1M (ml1m)^{3}^{3}3https://grouplens.org/datasets/movielens/, LastFM (lastfm)^{4}^{4}4http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/, Amazon music (AMusic) and Amazon toys (AToy)^{5}^{5}5http://jmcauley.ucsd.edu/data/amazon/. The ml1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way. The statistics of these four datasets are summarized in Table 1.
Evaluation Protocols
Following [He et al.2017], we adopt the leaveoneout evaluation, i.e., the latest interaction of each user is used for testing. Since ranking all items is timeconsuming, we randomly sample 100 unobserved interactions for each user. We then rank the 100 items with the test item according to the prediction. Two widely adopted evaluation measures, namely Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) are used to evaluate the ranking performance. The ranked list is truncated at 10 for both measures. Intuitively, the HR measures whether the test item is present on the top10 list or not, and the NDCG measures the ranking quality which assigns higher scores to hit at top position ranks.
Comparison Results
The comparison methods are as follows.

ItemPop is a nonpersonalized method that is often used as a benchmark for recommendation tasks. Items are ranked by their popularity, i.e., the number of interactions.

eALS [He et al.2016b] is a stateoftheart MF method. It uses all unobserved interactions as negative instances and weights them nonuniformly by item popularity.

DMF [Xue et al.2017] is a stateoftheart representation learningbased MF method which performs deep matrix factorization with normalized cross entropy loss as loss function. We ignore the explicit ratings and take the implicit feedback as input.

NeuMF [He et al.2017] is a stateoftheart matching function learningbased MF method which combines hidden layers of GMF and MLP to learn the interaction function based on cross entropy loss. It is the most related work to the proposed models. Different from our models, it adapts the deep+shallow pattern which has been widely adopted in many works such as [Cheng et al.2016, Guo et al.2017]. What’s more, NeuMF takes IDs as input and the proposed CFNet takes interaction matrix as input.
Since the proposed models focus on modeling the relationship between users and items, we mainly compare with useritem models. The comparison results are listed in Table 2. The best and the second best scores are shown in bold. According to the table, we have the following key observations:

CFNet achieves the best performance in general and obtains high improvements over the stateoftheart methods. Most importantly, such improvement increases along with the increasing of data sparsity, where the datasets are arranged in the order of increasing data sparsity. This justifies the effectiveness of the proposed DeepCF framework that combines representation learningbased CF methods and matching function learningbased CF methods.

The performance of DMF degrades severely when taking implicit feedback as input while the proposed CFNetrl consistently outperforms it. This indicates that replacing the nonparametric cosine similarity with elementwise product and a parametric neural network layer significantly improves the performance.
Impact of Pretraining
Different from the CFNet with pretraining, we use minibatch Adam to learn the CFNet without pretraining with random initializations. As shown in Table 3, the CFNet with pretraining outperforms the CFNet without pretraining in all cases. This result verifies the utility of the pretraining process which ensures CFNetrl and CFNetml to learn features from different perspectives and therefore allows the model to generate better results.
Sensitivity Analysis of Hyperparameters
Negative Sampling Ratio
To analyze the effect of negative sampling ratio, we test different negative sampling ratio, i.e., the number of negative samples per positive instance, on the ml1m dataset. From the results shown in Figure 3, we can find that sampling merely one or two instances is not enough and sampling more negative instances is helpful. The best HR@10 is obtained when the negative sampling ratio is set to 3 and the best NDCG@10 is obtained when the negative sampling ratio is set to 6. Overall, the optimal sampling ratio is around 3 to 7. Sampling more negative instances not only requires more time to train the model but also degrades the performance, which is consistent with the results shown in [He et al.2017].
The Number of Predictive Factors
Another hyperparameter used in the CFNet model is the number of predictive factors, i.e., the dimensions of and . As shown in Table 4, the proposed model generates the best performance with 64 predictive factors on most of the datasets except the AMusic dataset. On the Amusic dataset, the best performance is achieved with 16 factors. According to our observation, more predictive factors usually lead to better performances since it endows the model with larger capability and greater ability of representation.
Without pretraining  With pretraining  

Datasets  HR  NDCG  HR  NDCG 
ml1m  0.6962  0.4222  0.7253  0.4416 
lastfm  0.8685  0.5920  0.8995  0.6186 
AMusic  0.3530  0.2204  0.4116  0.2601 
AToy  0.3067  0.1653  0.4150  0.2513 
Datasets  Measures  Dimensions of predictive vectors  

8  16  32  64  
ml1m  HR  0.6820  0.6982  0.7157  0.7253 
NDCG  0.3992  0.4161  0.4351  0.4416  
lastfm  HR  0.8840  0.8857  0.8937  0.8995 
NDCG  0.6049  0.6111  0.6143  0.6186  
AMusic  HR  0.4003  0.4313  0.4262  0.4116 
NDCG  0.2480  0.2617  0.2661  0.2601  
AToy  HR  0.3797  0.3902  0.4026  0.4150 
NDCG  0.2273  0.2331  0.2383  0.2513 
Conclusion and Future Work
In this work, we have explored the possibility of fusing representation learningbased CF methods and matching function learningbased CF methods. We have devised a general framework DeepCF and proposed its MLP implementation, i.e., CFNet. The DeepCF framework is simple but effective. Although we have implemented the two components with MLP in this paper, different types of representation learningbased methods and matching function learningbased methods can be integrated under the DeepCF framework. This work points out the significance of incorporating these two types of methods, allowing the model to have both great flexibility to learn the complex matching function and high efficiency in learning lowrank relations between users and items. In future work, we will study the following problems. First, auxiliary data can be used to further improve the initial representations of users and items. Richer information usually leads to better performance. Second, except for elementwise product and concatenation, it is also very interesting to explore other aggregation methods. Third, DeepCF does not only support pointwise loss, using pairwise loss is also a feasible solution for learning the model. Finally, although we use DeepCF to solve the topN recommendation problem with implicit data, it’s also suitable for other data mining tasks that study the relations between two kinds of objects.
Acknowledgments
This work was supported by NSFC (61502543, 61876193 and 61672313), Guangdong Natural Science Funds for Distinguished Young Scholar (2016A030306014), Tiptop Scientific and Technical Innovative Youth Talents of Guangdong special support program (2016TQ03X542), National Key Research and Development Program of China (2016YFB1001003), and NSF through grants IIS1526499, IIS1763325, and CNS1626432.
References
 [Bai et al.2017] Bai, T.; Wen, J.R.; Zhang, J.; and Zhao, W. X. 2017. A neural collaborative filtering model with interactionbased neighborhood. In CIKM, 1979–1982.
 [Beutel et al.2018] Beutel, A.; Covington, P.; Jain, S.; Xu, C.; Li, J.; Gatto, V.; and Chi, E. H. 2018. Latent cross: Making use of context in recurrent recommender systems. In WSDM, 46–54.
 [Cheng et al.2016] Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 7–10.

[Erhan et al.2010]
Erhan, D.; Bengio, Y.; Courville, A.; Manzagol, P.A.; Vincent, P.; and Bengio,
S.
2010.
Why does unsupervised pretraining help deep learning?
Journal of Machine Learning Research
11(Feb):625–660.  [Funk2006] Funk, S. 2006. Netflix update: Try this at home. http://sifter.org/~simon/journal/20061211.html. Online; accessed 27June2018.
 [Guo et al.2017] Guo, H.; Tang, R.; Ye, Y.; Li, Z.; and He, X. 2017. Deepfm: A factorizationmachine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247.
 [He and Chua2017] He, X., and Chua, T.S. 2017. Neural factorization machines for sparse predictive analytics. In SIGIR, 355–364.
 [He and McAuley2016] He, R., and McAuley, J. 2016. VBPR: Visual bayesian personalized ranking from implicit feedback. In AAAI, 144–150.
 [He et al.2016a] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residual learning for image recognition. In CVPR, 770–778.
 [He et al.2016b] He, X.; Zhang, H.; Kan, M.Y.; and Chua, T.S. 2016b. Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, 549–558.
 [He et al.2017] He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.S. 2017. Neural collaborative filtering. In WWW, 173–182.
 [He et al.2018] He, X.; Du, X.; Wang, X.; Tian, F.; Tang, J.; and Chua, T.S. 2018. Outer productbased neural collaborative filtering. In IJCAI.
 [Hornik, Stinchcombe, and White1989] Hornik, K.; Stinchcombe, M.; and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366.
 [Hu et al.2017] Hu, Q.Y.; Zhao, Z.L.; Wang, C.D.; and Lai, J.H. 2017. An item orientated recommendation algorithm from the multiview perspective. Neurocomputing 269:261–272.
 [Hu et al.2018] Hu, Q.Y.; Huang, L.; Wang, C.D.; and Chao, H.Y. 2018. Item orientated recommendation by multiview intact space learning with overlapping. KnowledgeBased Systems.
 [Hu, Koren, and Volinsky2008] Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In ICDM, 263–272.
 [Hu, Sun, and Liu2014] Hu, L.; Sun, A.; and Liu, Y. 2014. Your neighbors affect your ratings: on geographical neighborhood influence to rating prediction. In SIGIR, 345–354.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Koren, Bell, and Volinsky2009] Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8).
 [Koren2008] Koren, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD, 426–434.
 [Koren2009] Koren, Y. 2009. Collaborative filtering with temporal dynamics. In KDD, 447–456.
 [Ma2013] Ma, H. 2013. An experimental study on implicit social recommendation. In SIGIR, 73–82.
 [Mnih and Teh2012] Mnih, A., and Teh, Y. W. 2012. Learning label trees for probabilistic modelling of implicit feedback. In NIPS, 2816–2824.
 [Oard, Kim, and others1998] Oard, D. W.; Kim, J.; et al. 1998. Implicit feedback for recommender systems. In Proceedings of the AAAI workshop on recommender systems, volume 83.
 [Pan et al.2008] Pan, R.; Zhou, Y.; Cao, B.; Liu, N. N.; Lukose, R.; Scholz, M.; and Yang, Q. 2008. Oneclass collaborative filtering. In ICDM, 502–511.
 [Rendle et al.2009] Rendle, S.; Freudenthaler, C.; Gantner, Z.; and SchmidtThieme, L. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI, 452–461.

[Salakhutdinov and
Mnih2008]
Salakhutdinov, R., and Mnih, A.
2008.
Bayesian probabilistic matrix factorization using Markov chain Monte Carlo.
In ICML, 880–887.  [Sedhain et al.2015] Sedhain, S.; Menon, A. K.; Sanner, S.; and Xie, L. 2015. Autorec: Autoencoders meet collaborative filtering. In WWW, 111–112.
 [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building endtoend dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, 3776–3784.
 [Srivastava et al.2018] Srivastava, R.; Palshikar, G. K.; Chaurasia, S.; and Dixit, A. 2018. What’s next? a recommendation system for industrial training. Data Science and Engineering 3(3):232–247.
 [Wang et al.2018a] Wang, C.D.; Deng, Z.H.; Lai, J.H.; and Yu, P. S. 2018a. Serendipitous recommendation in ecommerce using innovatorbased collaborative filtering. IEEE TCYB.
 [Wang et al.2018b] Wang, X.; He, X.; Feng, F.; Nie, L.; and Chua, T.S. 2018b. Tem: Treeenhanced embedding model for explainable recommendation. In WWW, 1543–1552.
 [Wu et al.2016] Wu, Y.; DuBois, C.; Zheng, A. X.; and Ester, M. 2016. Collaborative denoising autoencoders for topn recommender systems. In WSDM, 153–162.
 [Xu, He, and Li2018] Xu, J.; He, X.; and Li, H. 2018. Deep learning for matching in search and recommendation. In SIGIR Tutorial, 1365–1368.
 [Xue et al.2017] Xue, H.J.; Dai, X.Y.; Zhang, J.; Huang, S.; and Chen, J. 2017. Deep matrix factorization models for recommender systems. In IJCAI, 3203–3209.
 [Zhao et al.2018] Zhao, Z.L.; Huang, L.; Wang, C.D.; and Huang, D. 2018. Lowrank and sparse crossdomain recommendation algorithm. In DASFAA.

[Zhao, Shi, and Hong2017]
Zhao, Q.; Shi, Y.; and Hong, L.
2017.
GBCENT: Gradient boosted categorical embedding and numerical trees.
In WWW, 1311–1319.  [Zhu et al.2017] Zhu, J.; Shan, Y.; Mao, J.; Yu, D.; Rahmanian, H.; and Zhang, Y. 2017. Deep embedding forest: Forestbased serving with deep embedding features. In KDD, 1703–1711.