BCFNet
None
view repo
Collaborative Filtering (CF) based recommendation methods have been widely studied, which can be generally categorized into two types, i.e., representation learningbased CF methods and matching function learningbased CF methods. Representation learning tries to learn a common low dimensional space for the representations of users and items. In this case, a user and item match better if they have higher similarity in that common space. Matching function learning tries to directly learn the complex matching function that maps useritem pairs to matching scores. Although both methods are well developed, they suffer from two fundamental flaws, i.e., the representation learning resorts to applying a dot product which has limited expressiveness on the latent features of users and items, while the matching function learning has weakness in capturing lowrank relations. To overcome such flaws, we propose a novel recommendation model named Balanced Collaborative Filtering Network (BCFNet), which has the strengths of the two types of methods. In addition, an attention mechanism is designed to better capture the hidden information within implicit feedback and strengthen the learning ability of the neural network. Furthermore, a balance module is designed to alleviate the overfitting issue in DNNs. Extensive experiments on eight realworld datasets demonstrate the effectiveness of the proposed model.
READ FULL TEXT VIEW PDF
In general, recommendation can be viewed as a matching problem, i.e., ma...
read it
This paper proposes Quaternion Collaborative Filtering (QCF), a novel
re...
read it
Recommendation systems play a vital role to keep users engaged with
pers...
read it
We are interested in building collaborative filtering models for
recomme...
read it
Personalization lies at the core of boosting the product search system
p...
read it
Representing relationships as translations in vector space lives at the ...
read it
In this paper, we discuss the fundamental problem of representation lear...
read it
None
Over the past decades, recommender systems have been extensively studied and widely deployed in many different scenarios to alleviate the information overload problem. Due to the distinguishing capability of utilizing collective wisdom and experiences, Collaborative Filtering (CF) algorithms have been widely used to build recommender systems [46, 56, 26, 5, 47].
Matrix factorization is an important model in CF [25], which assumes that some relationship can be established between users and items through latent factors. By learning a common low dimensional space for the representations of users and items where they can be compared directly, the relevance of users and items can be further calculated by their similarity. In this way, matrix factorization can predict a personalized ranking for an individual user over a set of items. Unfortunately, in matrix factorization, the mapping relationship between the original representation space and the latent space is assumed to be linear, which can not be always guaranteed.
Since Deep Neural Networks (DNNs) are extremely good at representation learning of complex relationship, deep learning methods have been widely explored and have shown promising results in various areas such as computer vision, speech recognition and natural language processing
[16, 14, 43, 3]. In the past few years, there are also many works adopting DNNs for recommendation and generate more accurate prediction. To better learn the complex mapping between these two spaces, Deep Structured Semantic Models (DSSM) were proposed [28], which use a deep neural network to rank for web search. Motivated by DSSM, Xue et al. [54] proposed a Deep Matrix Factorization (DMF), which uses a two pathways neural network architecture to replace the linear embedding operation used in vanilla matrix factorization. However, they still resort to using inner product as matching function, which simply combines the multiplication of latent features linearly and seriously limits the expressiveness of the model when predicting matching scores.In order to learn better representation for users and items, it’s a good choice to replace the dot product with a deep neural network, which can lead to better recommendation performance [53]. DNNs are very suitable to learn the complex matching function, since they are capable of approximating any continuous function [23]. For example, He et al. [21] proposed NeuMF under the Neural Collaborative Filtering (NCF) framework which
replaces the dot product operations in matrix factorization with a multilayer neural network to capture the nonlinear relationship between users and items. By taking the concatenation of user embedding and item embedding as the input of a MultiLayer Perceptron (MLP) model, NeuMF is able to learn the interaction between users and items, from which the prediction can be made. In particular, it is capable of learning the complex mapping relationship between useritem representation and matching score. Therefore, compared with traditional MF methods, using MLP to replace dot product on recommendation can learn a better matching function.
However, as revealed in [4]
, MLP is very inefficient in catching lowrank relations. In fact, using dot product to estimate matching score in traditional matrix factorization methods is to artificially limit the model to learn similarity — a lowrank relation that is thought to be positively related to matching score according to human experience.
Moreover, since lots of the training samples in recommender systems are subjected to the sparsity issue, there are only a relatively small number of ratings which can be fed into MLP. The DNNsbased model with massive parameters may easily suffer from the overfitting issue.According to the above discussion, we can see that there are two types of methods for implementing collaborative filtering. One is mainly based on representation learning and the other one is mainly based on matching function learning. Since these two types of methods have different advantages in learning the representation from different perspectives, a stronger and more robust joint representation for the useritem pair can be obtained by concatenating their learned representations.
In our previous work [10]
, we first used these two types of CF methods to obtain different representations for the input useritem pair, which are integrated together to form a Deep Collaborative Filtering (DeepCF) framework. In this paper, as an extension of DeepCF, before feeding the vectors into DNNs, we first input them into a feedforward attention layer which can improve the representation ability of the deep neural networks. By allowing different parts to contribute differently when compressing them to a single representation, attentionbased architectures can learn to focus their “attention” to specific parts. Higher weights indicate that the corresponding factors are more informative for the recommendation. In addition, to alleviate the overfitting issue and offset the weakness of MLP in capturing lowrank relations, a balance module is introduced by means of generalized matrix factorization model (GMF). Therefore, a novel model named Balanced Collaborative Filtering Network (BCFNet) is proposed, which consists of three submodels, namely attentive representation learning (BCFNetrl), attentive matching function learning (BCFNetml) and balance module (BCFNetbm).
The main contributions of this work are as follows.
We point out the significance of incorporating collaborative filtering methods based on representation learning and matching function learning, and then propose a novel BCFNet model that combines attentive representation learning, attentive matching function learning and balance module. The proposed model adopts the Deep+Shallow pattern and employs attention mechanism for collaborative filtering with implicit feedback.
A feedforward attention mechanism is utilized to better capture the hidden information within implicit feedback and strengthen the learning ability of the neural network. A balance module is also designed to alleviate the overfitting issue caused by the high sparsity of interaction information. These two strategies enable the proposed BCFNet model to have great flexibility in learning the complex matching function and to effectively learn lowrank relations between users and items.
Extensive experiments are conducted on eight realworld datasets to demonstrate the effectiveness and rationality of the BCFNet model. The results show that the proposed BCFNet model consistently outperforms the stateoftheart methods.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 is the preliminaries. Section 4 introduces the BCFNet model in detail. Section 5 presents and analyzes the experimental results. At last, Section 6 draws the conclusion of this paper.
Compared with implicit data, it’s more difficult to collect explicit feedback (e.g., product ratings) because most of users would not tend to rate items. In fact, since users don’t need to express their preference explicitly, users’ implicit feedback, like a click, view times, collect or purchase history, can be more easily collected at a larger scale with a much lower cost than explicit feedback. In this case, it’s very important to design recommendation algorithms that can work with implicit feedback data [36, 33]. There are many wellknown methods that study collaborative filtering with implicit feedback, like ALS [27] and SVD++ [31].
Both of the two models factorize the binary interaction matrix and assume user dislike unobserved items, i.e., assign 0 for unobserved items in the binary interaction matrix. However, there are also several works considering that user may have never seen the unobserved items [39, 35, 17], which tend to assuming user prefers the selected items than the unobserved ones. For example, Bayesian personalized ranking (BPR) is an effective learning algorithm for implicit CF and has been widely adopted in many related domains, which focus on pairwise loss rather than pointwise loss.
Since Simon Funk proposed FunkSVD [12] in the famous Netflix Prize competition, matrix factorization for collaborative filtering has been widely studied and constantly developed over the past ten years [40, 30, 32, 24]. The main idea of these works is mapping user and item into a common representation space where they can be compared directly. Recently, deep learning methods have shown promising results in various areas such as computer vision and natural language processing. Inspired by these significant success, some attempts have been made in introducing deep neural networks (DNNs) to recommender systems. In [48], a model named Collaborative Deep Learning (CDL) was proposed, which performs deep representation learning for the content information and collaborative filtering for the rating matrix. Besides, AutoRec [42], which is the first model attempting to learn user and item representation respectively by using autoencoder to reconstruct the input ratings, has been applied to the recommendation. In addition, a deep learning architecture called DMF [54] uses the rating matrix directly as the input and maps user and items into a common lowdimensional space via a deep neural network. Overall, representation learningbased methods learn representation in different ways and can flexibly incorporate with auxiliary data.
However, despite their effectiveness and many subsequent developments, they still resort to using the dot product or cosine similarity as interaction function when predicting matching score.
Matrix factorization (MF) has shown its effectiveness in many recommender systems. However, most of the MF methods still use dot product which limits the expressiveness of the model when doing prediction. Several recent works on neural recommender models have shown that learning the interaction function from data can obtain better recommendation prediction. NeuMF [21] is a recently proposed framework that replaces the dot product used in vanilla MF with a neural network to learn the matching function. To offset the weakness of MLP in capturing lowrank relations, NeuMF unifies MF and MLP in one model. NNCF [2] is a variant of NeuMF that takes user neighbors and item neighbors as inputs. Other than NeuMF, there are also many other works attempting to learn the matching function directly by making full use of auxiliary data. For example, Wide&Deep [8] adapts LR and MLP to learn the matching function from input continuous features and categorical features of user and item. DeepFM [15] replaces LR with Factorization Machines (FM) to avoid manual feature engineering. Neural Factorization Machines (NFM) [18] uses a biinteraction pooling layer to learn feature crosses. What’s more, treebased models are also studied and proven to be effective [55, 59, 49]. Neural network based Aspectlevel Collaborative Filtering model (NeuACF) has been applied to exploit different aspect latent factors by using attention mechanism with NCF [44]. ConvNCF [19] uses an outer product operation to replace concatenation used in NeuMF and utilizes 2D convolution layers for learning joint representation of useritem pairs. In this paper, we mainly focus on pure collaborative filtering without using auxiliary data.
Attention mechanism has shown effectiveness in various machine learning tasks such as machine translation and computer vision
[1]. Recently, several works have done in utilizing attention mechanism in recommender systems [20, 44, 45, 6, 51]. For instance, in [7], a model named Attentive Collaborative Filtering (ACF) was proposed to employ attention modeling in CF. In
[9], the attention mechanism was introduced to capture the varying attention vectors of each specific useritem pair. To improve the performance of factorization, a model named attention factorization machine for learning the weight of feature interactions via attention networks was designed [52]. In [57], an attentionbased user model called ATRank was proposed, which utilizes a novel attention mechanism to model user behavior after considering the influences brought by other behaviors. Attention mechanism derives from the idea that human recognition usually can not process the whole entire signal at once, instead, one only focuses on few selective parts at a time. Attention’s success is mainly due to its advantage in assigning attentive weights for the input vectors, where higher weights indicate that the corresponding factors are more informative for the recommendation. In this paper, we adopt the attention mechanism in our model to make more accurate predictions.According to the above discussion, both representation learningbased and matching function learningbased collaborative filtering methods have been broadly studied and proven to be effective. Despite their strengths, both of the two types of methods have weaknesses, i.e., the limit expressiveness of dot product and the weakness in capturing lowrank relations. In our previous work [10], we pointed out the significance of combining the two types of collaborative filtering methods to overcome these weaknesses. In this paper we present a novel model that ensembles these two types of methods, and add a balance module and the attention mechanism to endow the model with a great flexibility of learning the matching function while maintaining the ability to learn lowrank relations efficiently. For clarity, Table I summarizes the main notations used in this paper.
The number of users  
The number of items  
Binary useritem interaction matrix  
The interaction of user to item  
Negative sample ratio  
All the observed interactions in  
All the unobserved interactions in  
The probability that user is matched by item 

The predicted interaction of user to item  
The model parameters  
The initial representation of user  
The initial representation of item  
The latent representation of user  
The latent representation of item  
The encoder vector of the feedforward  
attention layer  
The decoder vector of the feedforward  
attention layer  
The attention ratio of the feedforward  
attention layer  
Weight matrix  
Bias vector  
The predictive vector of BCFNetrl  
The predictive vector of BCFNetml  
The predictive vector of BCFNetbm 
Suppose there are users and items in the system, following [50, 21], we construct the useritem interaction matrix from users’ implicit feedback as follows,
(1) 
Although compared with explicit feedback, implicit feedback can be easier to obtain, it is more challenging to be utilized because it has two major problems. First, unlike ratings, implicit feedback is inherently noisy. While we track a useritem interaction (), we can only guess users’ preference indirectly. For example, the observed interaction does not provide any specific information about how much exactly a user likes an item. Second, without an observed interaction () does not mean user does not like item . In fact, user may have never seen item since there are too many items in a system. The nonobserved useritem interactions may be a mixture of real negative feedback and missing values. These two problems pose huge challenges in learning from implicit data, especially the second one.
To perform collaborative filtering on implicit data which lacks real negative feedback is also known as the OneClass Collaborative Filtering (OCCF) problem [37].
To tackle the problem of unobserved negative samples, several approaches have been proposed which can be classified into two categories: whole data based learning and sample based learning. The former assumes that all the unobserved data are weak negative instances and are equally weighted
[27, 37], while the latter samples some negative instances from unobserved interactions [37, 50, 21]. In this paper, we perform the second method, i.e., uniformly sample negative instances from unobserved interactions with the negative sample ratio , i.e., the number of negative samples per positive instance. Later we will conduct some experiments to verify the impact of on the proposed model. Let denote all the observed interactions in and denote the sampled unobserved interactions, i.e., the negative instances.To tackle the recommendation problem with implicit feedback, we can formulate it as an interaction prediction problem which estimates the missing values in interaction matrix , i.e., estimates whether the unobserved interactions would happen or not (the user would give a rating on the item or not). However, unlike explicit feedback, implicit feedback is discrete and binary. When dealing with implicit feedback that each entry is a binary value of 1 or 0, we often consider the learning of a recommender model as a binary classification problem. Solving the above binary classification problem can not help us to further rank and recommend items. One feasible solution is to employ a probabilistic treatment for interaction matrix . We can assume
obeys a Bernoulli distribution:
(2) 
where is the probability of being equal to 1. What’s more, can be further interpreted as the probability that user is matched by item . In this case, a value of 1 for indicates that item perfectly matches user and a value of 0 indicates that user and item do not match at all. Rather than modeling which is discrete and binary, our method models instead. In this manner, we transform the binary classification problem, i.e., the interaction prediction problem, to a matching score prediction problem.
In order to enhance the learning ability of Deep Neural Networks (DNNs), we utilize a feedforward attention mechanism [1, 38] before the learning process. Attention mechanism has been shown to be effective in various machine learning tasks such as machine translation, recommendation and computer vision [58, 52]. It has the advantage that it can assign different attentive scores to the input vectors, with higher values indicating that the corresponding vectors are more informative. For the representation learningbased CF methods, a feedforward attention layer can be added before the representation function to enhance its learning ability for user features and item features respectively. Similarly, for the matching function learningbased CF methods, we also can add a feedforward attention layer before its matching function, which can also improve its learning performance.
Suppose that there is a dimensional vector as the input of a feedforward attention layer, which is called the encoder vector. Then the output of the attention layer is also a dimensional vector , which is called the decoder vector. In this paper, we adopt a BP neural network [13] to learn the relationship between and . Therefore, the calculation process of can be formulated as:
(3) 
where
denotes the activation function
, and denote the weight matrix and bias vector of the BP neural network, denotes the attention ratio of the feedforward attention layer, and denotes the elementwise product of and . We utilize a function as the activation function to obtain the attention ratio , i.e., the probability vector by the BP neural network. The decoder vector is ultimately calculated by and , which can reflect the importance of each element in . In this formulation, attention mechanism can be regarded as computing an adaptive weight of the encoder vector . And then the decoder vector can be used for representation learning and matching function learning. We will also conduct some experiments to demonstrate that the feedforward attention mechanism can improve the learning ability of our model.A modelbased method generally assumes that data can be generated by an underlying model as , where denotes the prediction of , i.e., the predicted probability that user is matched by item , denotes model parameters, and denotes the function that maps model parameters to the predicted score. In this manner, we need to figure out two key questions, i.e., how to define function and how to estimate parameters . We will answer the first question in the next section.
For the second question, to estimate parameters , most of the existing works generally optimize an objective function. Two types of objective functions are commonly used in recommender system, namely, pointwise loss [27, 22] and pairwise loss [39, 35, 17]. Pointwise loss learning methods usually try to minimize the loss between and its target value , while the pairwise learning maximizes the margin between observed entry and unobserved entry . In this paper, we explore the pointwise loss only and leave the pairwise loss in our future work. Pointwise loss has been widely studied in collaborative filtering with explicit feedback under the regression framework [12, 40].
Existing pointwise methods usually perform a regression with squared loss (SE) to learn the recommender model. However, the squared loss can be derived by assuming that the error between the given rating and the predicted rating is generated from a Gaussian distribution, which does not hold in the implicit feedback scenario since
is discrete and binary. Thus we point out that it may be unsuitable for implicit data. As aforementioned, to adapt the binary and discrete characters of the implicit feedback data, we assume obeys a Bernoulli distribution, i.e., . By replacing with in Eq. (3.1), we can define the likelihood function as(4) 
where denotes all the observed interactions in and denotes the sampled unobserved interactions, i.e., the negative instances. Furthermore, taking the negative logarithm of the likelihood (NLL), we obtain
(5) 
Based on all the above assumptions and formulations, we finally obtain an objective function which is suitable for learning from implicit feedback data, i.e., the binary crossentropy loss function
[34]. By adapting a gradient descent we can optimize the objective function and minimize it for the BCFNet model.To sum up, the recommendation problem with implicit feedback can be formulated as an interaction prediction problem. To endow our algorithm with the ability to rank items for the recommendation task, we need to employ a probabilistic treatment for interaction matrix . We use a Logistic function as the activation function for the output layer, so that is constrained in the range of [0,1]. Instead of modeling , we model which is the probability of being equal to 1. Since can also be interpreted as the probability that user is matched by item , the interaction prediction problem can be transformed to a matching score prediction problem. In this manner, using maximum likelihood estimation to estimate model parameters is equivalent to minimizing the binary crossentropy between and .
In this section, we will introduce the proposed model named Balanced Collaborative Filtering Network (BCFNet) in detail. First, we present an architecture of of the BCFNet model. Then, we respectively introduce three submodules of the model, namely attentive representation learning (BCFNetrl), attentive matching function learning (BCFNetml) and balance module (BCFNetbm). Finally, we describe how to fuse these three submodules and how to learn the final BCFNet model.
The proposed BCFNet model consists of three submodules, namely, attentive representation learning, attentive matching function learning and balance module. The architecture of the BCFNet model is shown in Fig. 1.
All of the three modules start from extracting data from database. IDs, historical behaviors and other auxiliary data can all be used to construct the initial representations of user and item , which are denoted by and respectively. The CF models then calculate and , i.e., the latent representations for user and item . Next, a nonparametric operation is performed on and to aggregate the latent representations. Finally, mapping function is used to calculate the matching score . In what follows, we will introduce these three submodules and their implementations in detail.
For representation learningbased CF methods, the model focuses more on learning representation function and the matching function is usually assumed to be simple and nonparametric, e.g., dot product or cosine similarity. In this manner, the model is supposed to map users and items into a common space where they can be directly compared. For example, taking onehot IDs as inputs, the vanilla MF [12] adopts linear embedding function as function and function to learn the latent representations. The latent representations and are then aggregated by the dot product function to calculate the matching score. In this case, mapping function is assumed to be the identity function. For another example, taking ratings as inputs, DMF [54] adopts MLP as function and function to learn better latent representation by making full use of the nonlinearity and high capacity characteristics of neural networks. The cosine similarity between and is then calculated and used as matching score.
In this paper, we focus on implicit feedback data only in the BCFNet model, so no auxiliary data are used. In particular, the useritem interaction matrix is taken as input. That is, the initial user representation of user is the th row vector of , i.e. , and the initial item representation of item is the th column vector of , i.e. .
In the proposed BCFNet model, we design an attentive representation learningbased CF method that combines MLP with feedforward attention to learn latent representations for users and items. Suppose the size of the encoder layer of the feedforward attention layer is . First, for user , the decoder vector can be computed by applying the feedforward attention layer to as follows:
(6) 
where denotes the weight matrix of the encoder layer of the feedforward attention layer, denotes the activation of the input layer, denotes the activation function , and denote the weight matrix and bias vector of the BP neural network respectively, and denotes the attention ratio of the feedforward attention layer. And then, the representation learning part based on MLP implementation for users can be defined as:
(7) 
where denotes the input of MLP, , , and denote the weight matrix, bias vector and activation for the th layer’s perceptron respectively. is the activation function and we use function in this paper. The latent representation for item is calculated in the same manner. Different from the existing representation learningbased CF methods, the matching function part is defined as:
(8) 
where , and denote the predictive vector, the weight matrix and the activation function respectively. By substituting the nonparametric dot product or cosine similarity with elementwise product and a parametric neural network layer, our model still focuses on catching lowrank relations between users and items but is more expressive since the importance of latent dimensions can be different and the mapping can be nonlinear.
Matching function learningbased CF methods focus more on matching function learning. The representation learning part is still necessary since the initial representations of users and items, namely and are usually extremely sparse and have high dimension, making it difficult for the model to directly learn the matching function. Therefore, matching function learningbased CF methods usually use a linear embedding layer to learn latent representations for users and items. With the dense lowdimensional latent representations, the model is able to learn the matching function more efficiently.
In the proposed BCFNet model, we design an attentive matching function learningbased CF method that combines MLP with feedforward attention to learn the matching function. Instead of IDs, we take the interaction matrix as input. Suppose the size of the embedding layer is . First, for user and item , the decoder vector can be computed by applying the feedforward attention layer to and as follows:
(9) 
where and are the parameter matrices of the linear embedding layers, denotes the MLP user vector, and denotes the MLP item vector. The matching function learning part based on MLP implementation can be defined as:
(10) 
where denotes the input of MLP, , , and denote the weight matrix, bias vector and activation for the th layer’s perceptron respectively, and denotes the predictive vector. In this manner, the attentive representation learning functions and are implemented by the linear embedding layers. The latent representations and are then aggregated by a simple concatenation operation. After the process of the feedforward attention layer, MLP is used as the mapping function to calculate the matching score . Notice that although concatenation is the simplest aggregation operation, it maintains maximally the information passed from the previous layer and allows to make full use of the flexibility of the MLP model.
After additionally introducing the attention mechanism, the unified framework of representation learning and matching function learning (i.e. the DeepCF framework proposed in the previous version [10]) can be greatly improved. However, due to its DNNs structure, it may also lead to partial information loss and overfitting issue. For a reallife recommender system, there exist a large number of users and items which are subjected to the sparsity problems. In this case, only relatively few interactions can be input into the MLP implements, which is prone to overfitting issue and leads to mediocre results. In addition, during the deep learning process, some features of users and items may be simply ignored and some important implicit feedback may be given low weight in MLP.
Inspired by some shallow recommendation model without neural networks and attention mechanism, we add the generalized matrix factorization (GMF) model [54] to the BCFNet model as a balance module. As a shallow matrix factorization model, GMF adopts linear embedding function as representation function and uses dot product as matching function, which can offset the weakness of MLP in capturing lowrank relations and alleviate the overfitting issue in DNNs. Therefore, assuming the size of the embedding layer is , the balance module can be formulated as:
(11) 
where and are the parameter matrices of the linear embedding layers, denotes the MF user vector, denotes the MF item vector, and denotes the predictive vector. In the following experiments, it will be verified that the balance module is helpful to alleviate the overfitting issue caused by the high sparsity of interaction information.
In the previous three subsections, we have introduced the three modules of the proposed BCFNet model, each of which can be regarded as a separate model for recommender system. To incorporate these three modules, we need to design a strategy to fuse them so that they can enhance each other and improve the accuracy of the recommendation system. One of the most common fusing strategies is to concatenate the learned representations to obtain a joint representation and then feed it into a fully connected layer. As described in the previous three subsections, for BCFNetrl, BCFNetml and BCFNetbm, they generate the predictive vectors respectively, which are denoted as , and . And the predictive vectors can be viewed as the representation for the corresponding useritem pair. Since the three types of CF methods have different advantages and learn the predictive vectors from different perspectives, the concatenation of the three predictive vectors will result in a stronger and more robust joint representation for the useritem pair. What’s more, the consequent fully connected layer enables the model to assign different weights on the features contained in the joint representation. Therefore, the output of the fusion model can be defined as:
(12) 
Using Eq. (12) to incorporate BCFNetrl, BCFNetml and BCFNetbm, we finally obtain the proposed BCFNet model.
As discussed in the previous section, the objective function for the BCFNet model is the binary crossentropy function. To optimize the model, we use minibatch Adam [29]. The batch size is fixed to 256 and the learning rate is 0.00001
. The model parameters are randomly initialized with a Gaussian distribution (with a mean of 0 and standard deviation of 0.01) and the negative instances
are uniformly sampled from unobserved interactions in each iteration. The learning algorithm for the proposed BCFNet model is summarized in Algorithm 1.According to [11], the initialization is of significance to the convergence and performance of deep learning model. Using pretrained models to initialize the ensemble model can significantly increase the convergence speed and improve the final performance. Since BCFNet is composed of three components, i.e., BCFNetrl, BCFNetml and BCFNetbm, we can pretrain these three components and use them to initialize BCFNet. Notice that BCFNetrl, BCFNetml and BCFNetbm are trained from scratch using Adam while the BCFNet with pretraining is optimized by the vanilla SGD. This is because Adam requires momentum information of the previous updated parameters which is not saved in BCFNet with pretraining.
Datasets  # of Users  # of Items  # of Ratings  Sparsity 

ml100k  943  1682  100000  0.9370 
ml1m  6040  3706  1000209  0.9553 
lastfm  1741  2665  69149  0.9851 
filmtrust  1508  2071  35497  0.9886 
ABaby  746  5193  21262  0.9945 
ABeauty  1248  8942  42269  0.9962 
AMusic  1776  12929  46087  0.9980 
AToy  3137  33953  84642  0.9992 
In this section, we conduct experiments to demonstrate the effectiveness of the BCFNet model. First of all, we compare the proposed BCFNet model with seven existing models including the previous version namely CFNet [10]
. Then, we conduct experiments to validate the effectiveness of the feedforward attention layer and the balance module. We also verify the utility of pretraining by comparing the BCFNet models with and without pretraining. Finally, we analyze the effect of hyperparameters on the performance of the BCFNet model.
We implement the proposed model based on Keras
^{1}^{1}1https://github.com/kerasteam/kerasand Tensorflow
^{2}^{2}2https://github.com/tensorflow/tensorflow, which will be released publicly upon acceptance.Datasets  Measures  Existing methods  

ItemPop  ItemKNN  BPR  MLP  DMF  NeuMF  CFNet  
ml100k  HR  0.3998  0.5891  0.6320  0.6755  0.6797  0.6766  0.6819 
NDCG  0.2264  0.3283  0.3568  0.3995  0.3936  0.3945  0.3981  
ml1m  HR  0.4535  0.6624  0.6725  0.7073  0.6565  0.7210  0.7253 
NDCG  0.2542  0.3905  0.3908  0.4264  0.3761  0.4387  0.4416  
lastfm  HR  0.6628  0.8771  0.6249  0.8834  0.8840  0.8868  0.8995 
NDCG  0.3862  0.5617  0.3466  0.5919  0.5804  0.6007  0.6186  
filmtrust  HR  0.8966  0.8601  0.8680  0.9151  0.9071  0.9171  0.9158 
NDCG  0.7952  0.7582  0.7632  0.8024  0.7896  0.8067  0.8074  
ABaby  HR  0.5416  0.2064  0.5751  0.5938  0.5697  0.6046  0.6032 
NDCG  0.3223  0.1170  0.3569  0.3663  0.3479  0.3860  0.3794  
ABeauty  HR  0.5938  0.5321  0.6755  0.7099  0.6931  0.7260  0.7123 
NDCG  0.3548  0.3994  0.4738  0.4958  0.4795  0.5227  0.5099  
AMusic  HR  0.3148  0.3851  0.3987  0.4071  0.3744  0.3891  0.4116 
NDCG  0.1752  0.2825  0.2420  0.2420  0.2149  0.2391  0.2601  
AToy  HR  0.3143  0.3460  0.3975  0.3931  0.3535  0.3650  0.4090 
NDCG  0.1794  0.2254  0.2673  0.2293  0.2016  0.2155  0.2457 
Datasets  Measures  ACFNet  Improvement of  Improvement of  

ACFNetrl  ACFNetbm  ACFNetml  ACFNet  ACFNet vs. NeuMF  ACFNet vs. CFNet  
ml100k  HR  0.6903  0.6681  0.6776  0.7010  3.61%  2.80% 
NDCG  0.4003  0.3944  0.4011  0.4096  3.83%  2.89%  
ml1m  HR  0.7199  0.7084  0.7141  0.7358  2.05%  1.45% 
NDCG  0.4358  0.4342  0.4376  0.4496  2.48%  1.81%  
lastfm  HR  0.8943  0.8897  0.8955  0.9110  2.73%  1.28% 
NDCG  0.6058  0.6202  0.5970  0.6328  5.34%  2.28%  
filmtrust  HR  0.9151  0.9171  0.9198  0.9290  1.30%  1.44% 
NDCG  0.8129  0.8099  0.8067  0.8231  2.03%  1.94%  
ABaby  HR  0.6032  0.6059  0.6046  0.6086  0.66%  0.90% 
NDCG  0.3778  0.3821  0.3708  0.3865  0.13%  1.87%  
ABeauty  HR  0.7179  0.7244  0.7212  0.7364  1.43%  3.38% 
NDCG  0.5075  0.5272  0.5020  0.5299  1.38%  3.92%  
AMusic  HR  0.4026  0.3958  0.4206  0.4448  14.32%  8.07% 
NDCG  0.2482  0.2537  0.2492  0.2694  12.67%  3.58%  
AToy  HR  0.3915  0.4080  0.3927  0.4201  15.10%  2.71% 
NDCG  0.2277  0.2541  0.2270  0.2531  17.45%  3.01% 
We evaluate our models on eight realworld publicly available datasets: MovieLens 100k (ml100k)
, MovieLens 1M (ml1m)
^{3}^{3}3https://grouplens.org/datasets/movielens/, LastFM (lastfm)^{4}^{4}4http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/, FilmTrust (filmtrust)^{5}^{5}5https://www.librec.net/datasets.html, Amazon baby (ABaby), Amazon beauty (ABeauty), Amazon music (AMusic) and Amazon toys (AToy)^{6}^{6}6http://jmcauley.ucsd.edu/data/amazon/. They are obtained from the following four main sources.MovieLens: The MovieLens datasets have been widely used for movie recommendation. These datasets are collected from the MovieLens website by the GroupLens Research. We use the versions ml100k and ml1m in our experiments.
Lastfm: The lastfm dataset is a set about the sequence of songs that the users listen to. It is crawled from the Last.fm online system, which is the world’s largest social music platform.
Filmtrust: The filmtrust dataset is a dataset crawled from the entire filmtrust website in June 2011, which contains 1508 users, 2071 items and 35497 ratings.
Amazon: The Amazon datasets contain users’ rating data in Amazon. In our experiment, four datasets namely Baby, Beauty, Music and Toy are adopted.
Following [21], we adopt the leaveoneout evaluation, i.e., the latest interaction of each user is used for testing, while the remaining data for training. Since ranking all items is timeconsuming, we randomly select 100 unobserved interactions as negative samples for each user. We then rank the 100 items for each user according to the prediction. We evaluate the model ranking performance through two widely adopted evaluation measures, namely Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG), which are defined respectively as follows
(13)  
(14) 
where is the number of users whose test item appears in the recommended list and is the position of the test item in the list for the th hit. The ranked list is truncated at 10 for both measures, i.e. HR@10 and NDCG@10. Intuitively, HR@10 measures whether the test item is present on the top10 list or not, and NDCG@10 measures the ranking quality which assigns higher scores to hit at top position ranks on the top10 list. Larger values of HR@10 and NDCG@10 indicate the better performance.
For a fair comparison, we set the weight of observed interactions for each user as 1 for all methods. We sample four negative instances per positive instance, i.e., set the negative sample ratio to be 4 as default. We set the number of predictive factors as 128 on all the datasets except AMusic, on which the number of predictive factor is set as 64. We generally employ two hidden layers for BCFNetrl, and three for BCFNetml.
We compare the proposed BCFNet model with the following seven methods.
ItemPop is a nonpersonalized method that is often used as a benchmark for recommendation tasks. Items are simply ranked by their popularity, i.e., the number of interactions.
ItemKNN [41] is a standard itembased collaborative filtering method.
BPR [39] is a widely used learning framework for item recommendation with implicit feedback. It is a samplebased method that optimizes the MF model with a pairwise ranking loss.
MLP [21] is a matching function leaningbased collaborative filtering method. It uses multiple layers of nonlinearities to model the relationships between users and items.
DMF [54] is a stateoftheart representation learningbased MF method which performs deep matrix factorization to learn a common low dimensional space with normalized cross entropy loss as loss function. It uses a twopathway neural network architecture to replace the linear embedding operation used in vanilla matrix factorization. We ignore the explicit ratings and take the implicit feedback as input in this paper.
NeuMF [21] is a stateoftheart matching function learningbased MF method which combines hidden layers of GMF and MLP to learn the interaction function based on cross entropy loss. NeuMF takes IDs as input and adapts the deep+shallow pattern which has been widely adopted in many works such as [8, 15].
CFNet [10] is the previous version of BCFNet, which incorporates collaborative filtering methods based on representation learning and matching function learning to learn the complex matching function and lowrank relations between users and items.
The comparison results are listed in Table III. The best scores among the BCFNet model and its submodels and the best scores among other methods are highlighted respectively in bold. According to the table, we have the following key observations:
The proposed BCFNet model achieves the best performance on all the datasets except for the NDCG on AMusic and Atoy, and obtains high improvements over the stateoftheart methods. More importantly, most of the improvements increase along with the increasing of data sparsity, where the datasets are arranged in the order of increasing data sparsity. This justifies the effectiveness of the proposed BCFNet model which combines attentive representation learningbased CF methods, attentive matching function learningbased CF methods and balance module.
As a typical representation learning method, the performance of DMF has some merit compared with the traditional methods, but the proposed BCFNetrl model consistently outperforms it. This indicates that adding a feedforward attention layer and a parametric neural network layer significantly improves the learning ability of the representation leaning.
Compared with the MLP model, BCFNetml outperforms it on most datasets. This fully demonstrates the effectiveness of attention mechanism in improving matching function learning.
On the basis of BCFNetrl and BCFNetml, BCFNetbm has also made a great contribution to the effects of the proposed BCFNet model, especially the improvement in terms of NDCG. And in most cases, the improvement effect increases with the increase of data sparsity. This indicates the effectiveness of the balance module in addressing the overfitting issue caused by data sparsity, which will be further confirmed in the next subsection.

In order to investigate the impact of feedforward attention layer and balance module in BCFNet, we conduct experiments on BCFNet without attention and balance module (abbr. BCFNetwithoutAB, i.e. CFNet in [10]), BCFNet without attention (abbr. BCFNetwithoutA) and BCFNet without balance module (abbr. BCFNetwithoutB). As shown in Fig. 2, BCFNet outperforms BCFNetwithoutAB, BCFNetwithoutA and BCFNetwithoutB in all cases. This result verifies the effectiveness of the feedforward attention layer in enhancing the learning ability of the proposed neural network model. Moreover, BCFNetwithoutAB outperforms BCFNetwithoutA and BCFNetwithoutB on some datasets, which shows the necessity of combining the feedforward attention layer and the balance module.
In addition, we also conduct more experiments on BCFNet with balance module (i.e. the BCFNet model) and BCFNet without balance module (abbr. BCFNetwithoutB) to verify the effectiveness of the balance module in alleviating overfitting issue of neural network. In order to simulate the overfitting issue caused by the high sparsity of item interaction information in a recommender system, we divide some original dataset into three subdatasets according to item popularity, which are termed popularity levels 1, 2 and 3 respectively. A higher popularity level means that items in this subdataset are more popular and have more interaction information. In particular, for some original dataset, the item set is evenly partitioned into three subsets according to item popularity, and then all the interactions associated with items in each subset form a corresponding subdataset. Therefore, the sparsity of interaction information will decrease with the increase of item popularity level. Since some datasets used in experiments are not very sparse or cannot satisfy the leaveoneout evaluation condition that requires 100 negative samples for each user in their three subdatasets, so only five datasets are used in this experiment, namely lastfm, ABaby, ABeauty, AMusic and AToy. We run BCFNet and BCFNetwithoutB respectively on each subdataset. As shown in Fig. 3, with the increase of item popularity level, the effect of BCFNet and BCFNetwithoutB will be greatly improved. However, with the increase of item popularity level, most of the promotion effect of the balance module will be weakened, i.e. the highest promotion effect has been obtained in the case of the smallest item popularity. This fully illustrates that the balance module is helpful to alleviate the overfitting issue caused by the high sparsity of item interaction information.

Different from the BCFNet with pretraining, we use minibatch Adam to learn the BCFNet without pretraining with random initializations. As shown in Table IV, the BCFNet with pretraining (i.e. BCFNet) outperforms the BCFNet without pretraining (abbr. BCFNetwithoutP) in all cases. This result verifies the utility of the pretraining process which ensures BCFNetrl, BCFNetml and BCFNetbm to learn features from different perspectives and therefore allows the model to generate better results.
To analyze the effect of negative sampling ratio , we test different negative sampling ratio, i.e., the number of negative samples per positive instance, on the eight datasets. From the results shown in Fig. 4, we can find that sampling less than three instances is not enough and sampling more negative instances is helpful. In most cases, the best HR@10 and NDCG@10 are obtained when the negative sampling ratio is set to 4. Overall, the optimal sampling ratio is around 4 to 8. Sampling more negative instances not only requires more time to train the model but also degrades the performance, which is consistent with the results shown in [21].
Another hyperparameter used in the BCFNet model is the number of predictive factors, i.e., the dimensions of , and . To this end, we test the number of predictive factors in , and the results are listed in Table V. As shown in Table V, the proposed model generates the best performance with 128 predictive factors on most of the datasets except the AMusic dataset. On the AMusic dataset, the best performance is achieved with 64 factors. According to our observation, more predictive factors usually lead to better performances since it endows the model with larger capability and greater ability of representation.
Datasets  Measures  BCFNetwithoutP  BCFNet  Improvement 

ml100k  HR  0.5769  0.7010  21.51% 
NDCG  0.3216  0.4096  27.36%  
ml1m  HR  0.6843  0.7358  7.53% 
NDCG  0.4099  0.4496  9.69%  
lastfm  HR  0.8621  0.9110  5.67% 
NDCG  0.5871  0.6328  7.78%  
filmtrust  HR  0.8899  0.9290  4.39% 
NDCG  0.7903  0.8231  4.15%  
ABaby  HR  0.5456  0.6086  11.55% 
NDCG  0.3309  0.3865  16.80%  
ABeauty  HR  0.7099  0.7364  3.73% 
NDCG  0.4805  0.5299  10.28%  
AMusic  HR  0.3874  0.4448  14.82% 
NDCG  0.2356  0.2694  14.35%  
AToy  HR  0.3028  0.4201  38.74% 
NDCG  0.1631  0.2531  55.18% 
In this paper, we have presented a novel recommendation model called Balanced Collaborative Filtering Network (BCFNet), which combines attentive representation learning (BCFNetrl), attentive matching function learning (BCFNetml) and balance module (BCFNetbm). Therefore, it has the advantages of both representation learning and matching function learning. In addition, by introducing a feedforward attention layer, the learning ability of both of attentive representation learning and attentive matching function learning can be further improved. Furthermore, adding a balance module without using neural network and attention mechanism can alleviate the overfitting issue and capture lowrank relation. Extensive experiments on eight realworld datasets demonstrate the effectiveness and rationality of the proposed BCFNet model.
This work was supported by NSFC (61876193), Guangdong Natural Science Funds for Distinguished Young Scholar (2016A030306014), and NSF through grants IIS1526499, IIS1763325, and CNS1626432.
Datasets  Measures  Dimensions of predictive vectors  

16  32  64  128  
ml100k  HR  0.6660  0.6723  0.6702  0.7010 
NDCG  0.3850  0.3896  0.3912  0.4096  
ml1m  HR  0.6980  0.7078  0.7230  0.7358 
NDCG  0.4162  0.4261  0.4396  0.4496  
lastfm  HR  0.8926  0.8909  0.9012  0.9110 
NDCG  0.6219  0.6231  0.6250  0.6328  
filmtrust  HR  0.9045  0.9131  0.9204  0.9290 
NDCG  0.7947  0.8032  0.8101  0.8231  
ABaby  HR  0.6005  0.5965  0.6072  0.6086 
NDCG  0.3765  0.3800  0.3855  0.3865  
ABeauty  HR  0.7163  0.7171  0.7212  0.7364 
NDCG  0.5020  0.5056  0.5133  0.5299  
AMusic  HR  0.4110  0.4245  0.4448  0.4240 
NDCG  0.2581  0.2608  0.2694  0.2638  
AToy  HR  0.4080  0.4013  0.4074  0.4201 
NDCG  0.2464  0.2418  0.2495  0.2531  
Speech recognition with deep recurrent neural networks
. In ICASSP, pp. 6645–6649. Cited by: §1.Bayesian probabilistic matrix factorization using Markov chain Monte Carlo
. In ICML, pp. 880–887. Cited by: §2.2, §3.3.AutoRec: autoencoders meet collaborative filtering
. In WWW, pp. 111–112. Cited by: §2.2.GBCENT: gradient boosted categorical embedding and numerical trees
. In WWW, pp. 1311–1319. Cited by: §2.3.
Comments
There are no comments yet.