Collaborative Filtering (CF) based recommendation methods have been widely studied, which can be generally categorized into two types, i.e., representation learning-based CF methods and matching function learning-based CF methods. Representation learning tries to learn a common low dimensional space for the representations of users and items. In this case, a user and item match better if they have higher similarity in that common space. Matching function learning tries to directly learn the complex matching function that maps user-item pairs to matching scores. Although both methods are well developed, they suffer from two fundamental flaws, i.e., the representation learning resorts to applying a dot product which has limited expressiveness on the latent features of users and items, while the matching function learning has weakness in capturing low-rank relations. To overcome such flaws, we propose a novel recommendation model named Balanced Collaborative Filtering Network (BCFNet), which has the strengths of the two types of methods. In addition, an attention mechanism is designed to better capture the hidden information within implicit feedback and strengthen the learning ability of the neural network. Furthermore, a balance module is designed to alleviate the over-fitting issue in DNNs. Extensive experiments on eight real-world datasets demonstrate the effectiveness of the proposed model.READ FULL TEXT VIEW PDF
In general, recommendation can be viewed as a matching problem, i.e., ma...
This paper proposes Quaternion Collaborative Filtering (QCF), a novel
Recommendation systems play a vital role to keep users engaged with
We are interested in building collaborative filtering models for
Personalization lies at the core of boosting the product search system
Representing relationships as translations in vector space lives at the ...
In this paper, we discuss the fundamental problem of representation lear...
Over the past decades, recommender systems have been extensively studied and widely deployed in many different scenarios to alleviate the information overload problem. Due to the distinguishing capability of utilizing collective wisdom and experiences, Collaborative Filtering (CF) algorithms have been widely used to build recommender systems [46, 56, 26, 5, 47].
Matrix factorization is an important model in CF , which assumes that some relationship can be established between users and items through latent factors. By learning a common low dimensional space for the representations of users and items where they can be compared directly, the relevance of users and items can be further calculated by their similarity. In this way, matrix factorization can predict a personalized ranking for an individual user over a set of items. Unfortunately, in matrix factorization, the mapping relationship between the original representation space and the latent space is assumed to be linear, which can not be always guaranteed.
Since Deep Neural Networks (DNNs) are extremely good at representation learning of complex relationship, deep learning methods have been widely explored and have shown promising results in various areas such as computer vision, speech recognition and natural language processing[16, 14, 43, 3]. In the past few years, there are also many works adopting DNNs for recommendation and generate more accurate prediction. To better learn the complex mapping between these two spaces, Deep Structured Semantic Models (DSSM) were proposed , which use a deep neural network to rank for web search. Motivated by DSSM, Xue et al.  proposed a Deep Matrix Factorization (DMF), which uses a two pathways neural network architecture to replace the linear embedding operation used in vanilla matrix factorization. However, they still resort to using inner product as matching function, which simply combines the multiplication of latent features linearly and seriously limits the expressiveness of the model when predicting matching scores.
In order to learn better representation for users and items, it’s a good choice to replace the dot product with a deep neural network, which can lead to better recommendation performance . DNNs are very suitable to learn the complex matching function, since they are capable of approximating any continuous function . For example, He et al.  proposed NeuMF under the Neural Collaborative Filtering (NCF) framework which replaces the dot product operations in matrix factorization with a multi-layer neural network to capture the nonlinear relationship between users and items. By taking the concatenation of user embedding and item embedding as the input of a Multi-Layer Perceptron (MLP) model, NeuMF is able to learn the interaction between users and items, from which the prediction can be made. In particular, it is capable of learning the complex mapping relationship between user-item representation and matching score. Therefore, compared with traditional MF methods, using MLP to replace dot product on recommendation can learn a better matching function.
replaces the dot product operations in matrix factorization with a multi-layer neural network to capture the nonlinear relationship between users and items. By taking the concatenation of user embedding and item embedding as the input of a Multi-Layer Perceptron (MLP) model, NeuMF is able to learn the interaction between users and items, from which the prediction can be made. In particular, it is capable of learning the complex mapping relationship between user-item representation and matching score. Therefore, compared with traditional MF methods, using MLP to replace dot product on recommendation can learn a better matching function.
However, as revealed in 
, MLP is very inefficient in catching low-rank relations. In fact, using dot product to estimate matching score in traditional matrix factorization methods is to artificially limit the model to learn similarity — a low-rank relation that is thought to be positively related to matching score according to human experience.Moreover, since lots of the training samples in recommender systems are subjected to the sparsity issue, there are only a relatively small number of ratings which can be fed into MLP. The DNNs-based model with massive parameters may easily suffer from the over-fitting issue.
According to the above discussion, we can see that there are two types of methods for implementing collaborative filtering. One is mainly based on representation learning and the other one is mainly based on matching function learning. Since these two types of methods have different advantages in learning the representation from different perspectives, a stronger and more robust joint representation for the user-item pair can be obtained by concatenating their learned representations.
In our previous work  , we first used these two types of CF methods to obtain different representations for the input user-item pair, which are integrated together to form a Deep Collaborative Filtering (DeepCF) framework. In this paper, as an extension of DeepCF, before feeding the vectors into DNNs, we first input them into a feed-forward attention layer which can improve the representation ability of the deep neural networks. By allowing different parts to contribute differently when compressing them to a single representation, attention-based architectures can learn to focus their “attention” to specific parts. Higher weights indicate that the corresponding factors are more informative for the recommendation. In addition, to alleviate the over-fitting issue and offset the weakness of MLP in capturing low-rank relations, a balance module is introduced by means of generalized matrix factorization model (GMF). Therefore, a novel model named Balanced Collaborative Filtering Network (BCFNet) is proposed, which consists of three sub-models, namely attentive representation learning (BCFNet-rl), attentive matching function learning (BCFNet-ml) and balance module (BCFNet-bm).
, we first used these two types of CF methods to obtain different representations for the input user-item pair, which are integrated together to form a Deep Collaborative Filtering (DeepCF) framework. In this paper, as an extension of DeepCF, before feeding the vectors into DNNs, we first input them into a feed-forward attention layer which can improve the representation ability of the deep neural networks. By allowing different parts to contribute differently when compressing them to a single representation, attention-based architectures can learn to focus their “attention” to specific parts. Higher weights indicate that the corresponding factors are more informative for the recommendation. In addition, to alleviate the over-fitting issue and offset the weakness of MLP in capturing low-rank relations, a balance module is introduced by means of generalized matrix factorization model (GMF). Therefore, a novel model named Balanced Collaborative Filtering Network (BCFNet) is proposed, which consists of three sub-models, namely attentive representation learning (BCFNet-rl), attentive matching function learning (BCFNet-ml) and balance module (BCFNet-bm).
The main contributions of this work are as follows.
We point out the significance of incorporating collaborative filtering methods based on representation learning and matching function learning, and then propose a novel BCFNet model that combines attentive representation learning, attentive matching function learning and balance module. The proposed model adopts the Deep+Shallow pattern and employs attention mechanism for collaborative filtering with implicit feedback.
A feed-forward attention mechanism is utilized to better capture the hidden information within implicit feedback and strengthen the learning ability of the neural network. A balance module is also designed to alleviate the over-fitting issue caused by the high sparsity of interaction information. These two strategies enable the proposed BCFNet model to have great flexibility in learning the complex matching function and to effectively learn low-rank relations between users and items.
Extensive experiments are conducted on eight real-world datasets to demonstrate the effectiveness and rationality of the BCFNet model. The results show that the proposed BCFNet model consistently outperforms the state-of-the-art methods.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 is the preliminaries. Section 4 introduces the BCFNet model in detail. Section 5 presents and analyzes the experimental results. At last, Section 6 draws the conclusion of this paper.
Compared with implicit data, it’s more difficult to collect explicit feedback (e.g., product ratings) because most of users would not tend to rate items. In fact, since users don’t need to express their preference explicitly, users’ implicit feedback, like a click, view times, collect or purchase history, can be more easily collected at a larger scale with a much lower cost than explicit feedback. In this case, it’s very important to design recommendation algorithms that can work with implicit feedback data [36, 33]. There are many well-known methods that study collaborative filtering with implicit feedback, like ALS  and SVD++ .
Both of the two models factorize the binary interaction matrix and assume user dislike unobserved items, i.e., assign 0 for unobserved items in the binary interaction matrix. However, there are also several works considering that user may have never seen the unobserved items [39, 35, 17], which tend to assuming user prefers the selected items than the unobserved ones. For example, Bayesian personalized ranking (BPR) is an effective learning algorithm for implicit CF and has been widely adopted in many related domains, which focus on pair-wise loss rather than point-wise loss.
Since Simon Funk proposed Funk-SVD  in the famous Netflix Prize competition, matrix factorization for collaborative filtering has been widely studied and constantly developed over the past ten years [40, 30, 32, 24]. The main idea of these works is mapping user and item into a common representation space where they can be compared directly.
Recently, deep learning methods have shown promising results in various areas such as computer vision and natural language processing. Inspired by these significant success, some attempts have been made in introducing deep neural networks (DNNs) to recommender systems. In , a model named Collaborative Deep Learning (CDL) was proposed, which performs deep representation learning for the content information and collaborative filtering for the rating matrix. Besides, AutoRec , which is the first model attempting to learn user and item representation respectively by using auto-encoder to reconstruct the input ratings, has been applied to the recommendation.
In addition, a deep learning architecture called DMF  uses the rating matrix directly as the input and maps user and items into a common low-dimensional space via a deep neural network. Overall, representation learning-based methods learn representation in different ways and can flexibly incorporate with auxiliary data. However, despite their effectiveness and many subsequent developments, they still resort to using the dot product or cosine similarity as interaction function when predicting matching score.
However, despite their effectiveness and many subsequent developments, they still resort to using the dot product or cosine similarity as interaction function when predicting matching score.
Matrix factorization (MF) has shown its effectiveness in many recommender systems. However, most of the MF methods still use dot product which limits the expressiveness of the model when doing prediction. Several recent works on neural recommender models have shown that learning the interaction function from data can obtain better recommendation prediction. NeuMF  is a recently proposed framework that replaces the dot product used in vanilla MF with a neural network to learn the matching function. To offset the weakness of MLP in capturing low-rank relations, NeuMF unifies MF and MLP in one model. NNCF  is a variant of NeuMF that takes user neighbors and item neighbors as inputs. Other than NeuMF, there are also many other works attempting to learn the matching function directly by making full use of auxiliary data. For example, Wide&Deep  adapts LR and MLP to learn the matching function from input continuous features and categorical features of user and item. DeepFM  replaces LR with Factorization Machines (FM) to avoid manual feature engineering. Neural Factorization Machines (NFM)  uses a bi-interaction pooling layer to learn feature crosses. What’s more, tree-based models are also studied and proven to be effective [55, 59, 49]. Neural network based Aspect-level Collaborative Filtering model (NeuACF) has been applied to exploit different aspect latent factors by using attention mechanism with NCF . ConvNCF  uses an outer product operation to replace concatenation used in NeuMF and utilizes 2D convolution layers for learning joint representation of user-item pairs. In this paper, we mainly focus on pure collaborative filtering without using auxiliary data.
Attention mechanism has shown effectiveness in various machine learning tasks such as machine translation and computer vision , a model named Attentive Collaborative Filtering (ACF) was proposed to employ attention modeling in CF. In
Attention mechanism has shown effectiveness in various machine learning tasks such as machine translation and computer vision. Recently, several works have done in utilizing attention mechanism in recommender systems [20, 44, 45, 6, 51]. For instance, in 
, a model named Attentive Collaborative Filtering (ACF) was proposed to employ attention modeling in CF. In, the attention mechanism was introduced to capture the varying attention vectors of each specific user-item pair. To improve the performance of factorization, a model named attention factorization machine for learning the weight of feature interactions via attention networks was designed . In , an attention-based user model called ATRank was proposed, which utilizes a novel attention mechanism to model user behavior after considering the influences brought by other behaviors. Attention mechanism derives from the idea that human recognition usually can not process the whole entire signal at once, instead, one only focuses on few selective parts at a time. Attention’s success is mainly due to its advantage in assigning attentive weights for the input vectors, where higher weights indicate that the corresponding factors are more informative for the recommendation. In this paper, we adopt the attention mechanism in our model to make more accurate predictions.
According to the above discussion, both representation learning-based and matching function learning-based collaborative filtering methods have been broadly studied and proven to be effective. Despite their strengths, both of the two types of methods have weaknesses, i.e., the limit expressiveness of dot product and the weakness in capturing low-rank relations. In our previous work , we pointed out the significance of combining the two types of collaborative filtering methods to overcome these weaknesses. In this paper we present a novel model that ensembles these two types of methods, and add a balance module and the attention mechanism to endow the model with a great flexibility of learning the matching function while maintaining the ability to learn low-rank relations efficiently. For clarity, Table I summarizes the main notations used in this paper.
|The number of users|
|The number of items|
|Binary user-item interaction matrix|
|The interaction of user to item|
|Negative sample ratio|
|All the observed interactions in|
|All the unobserved interactions in|
The probability that useris matched by item
|The predicted interaction of user to item|
|The model parameters|
|The initial representation of user|
|The initial representation of item|
|The latent representation of user|
|The latent representation of item|
|The encoder vector of the feed-forward|
|The decoder vector of the feed-forward|
|The attention ratio of the feed-forward|
|The predictive vector of BCFNet-rl|
|The predictive vector of BCFNet-ml|
|The predictive vector of BCFNet-bm|
Although compared with explicit feedback, implicit feedback can be easier to obtain, it is more challenging to be utilized because it has two major problems. First, unlike ratings, implicit feedback is inherently noisy. While we track a user-item interaction (), we can only guess users’ preference indirectly. For example, the observed interaction does not provide any specific information about how much exactly a user likes an item. Second, without an observed interaction () does not mean user does not like item . In fact, user may have never seen item since there are too many items in a system. The non-observed user-item interactions may be a mixture of real negative feedback and missing values. These two problems pose huge challenges in learning from implicit data, especially the second one.
To perform collaborative filtering on implicit data which lacks real negative feedback is also known as the One-Class Collaborative Filtering (OCCF) problem . To tackle the problem of unobserved negative samples, several approaches have been proposed which can be classified into two categories: whole data based learning and sample based learning. The former assumes that all the unobserved data are weak negative instances and are equally weighted
To tackle the problem of unobserved negative samples, several approaches have been proposed which can be classified into two categories: whole data based learning and sample based learning. The former assumes that all the unobserved data are weak negative instances and are equally weighted[27, 37], while the latter samples some negative instances from unobserved interactions [37, 50, 21]. In this paper, we perform the second method, i.e., uniformly sample negative instances from unobserved interactions with the negative sample ratio , i.e., the number of negative samples per positive instance. Later we will conduct some experiments to verify the impact of on the proposed model. Let denote all the observed interactions in and denote the sampled unobserved interactions, i.e., the negative instances.
To tackle the recommendation problem with implicit feedback, we can formulate it as an interaction prediction problem which estimates the missing values in interaction matrix , i.e., estimates whether the unobserved interactions would happen or not (the user would give a rating on the item or not). However, unlike explicit feedback, implicit feedback is discrete and binary. When dealing with implicit feedback that each entry is a binary value of 1 or 0, we often consider the learning of a recommender model as a binary classification problem. Solving the above binary classification problem can not help us to further rank and recommend items. One feasible solution is to employ a probabilistic treatment for interaction matrix . We can assume
obeys a Bernoulli distribution:
where is the probability of being equal to 1. What’s more, can be further interpreted as the probability that user is matched by item . In this case, a value of 1 for indicates that item perfectly matches user and a value of 0 indicates that user and item do not match at all. Rather than modeling which is discrete and binary, our method models instead. In this manner, we transform the binary classification problem, i.e., the interaction prediction problem, to a matching score prediction problem.
In order to enhance the learning ability of Deep Neural Networks (DNNs), we utilize a feed-forward attention mechanism [1, 38] before the learning process. Attention mechanism has been shown to be effective in various machine learning tasks such as machine translation, recommendation and computer vision [58, 52]. It has the advantage that it can assign different attentive scores to the input vectors, with higher values indicating that the corresponding vectors are more informative. For the representation learning-based CF methods, a feed-forward attention layer can be added before the representation function to enhance its learning ability for user features and item features respectively. Similarly, for the matching function learning-based CF methods, we also can add a feed-forward attention layer before its matching function, which can also improve its learning performance.
Suppose that there is a -dimensional vector as the input of a feed-forward attention layer, which is called the encoder vector. Then the output of the attention layer is also a -dimensional vector , which is called the decoder vector. In this paper, we adopt a BP neural network  to learn the relationship between and . Therefore, the calculation process of can be formulated as:
where denotes the activation function
denotes the activation function, and denote the weight matrix and bias vector of the BP neural network, denotes the attention ratio of the feed-forward attention layer, and denotes the element-wise product of and . We utilize a function as the activation function to obtain the attention ratio , i.e., the probability vector by the BP neural network. The decoder vector is ultimately calculated by and , which can reflect the importance of each element in . In this formulation, attention mechanism can be regarded as computing an adaptive weight of the encoder vector . And then the decoder vector can be used for representation learning and matching function learning. We will also conduct some experiments to demonstrate that the feed-forward attention mechanism can improve the learning ability of our model.
A model-based method generally assumes that data can be generated by an underlying model as , where denotes the prediction of , i.e., the predicted probability that user is matched by item , denotes model parameters, and denotes the function that maps model parameters to the predicted score. In this manner, we need to figure out two key questions, i.e., how to define function and how to estimate parameters . We will answer the first question in the next section.
For the second question, to estimate parameters , most of the existing works generally optimize an objective function. Two types of objective functions are commonly used in recommender system, namely, point-wise loss [27, 22] and pair-wise loss [39, 35, 17]. Point-wise loss learning methods usually try to minimize the loss between and its target value , while the pair-wise learning maximizes the margin between observed entry and unobserved entry . In this paper, we explore the point-wise loss only and leave the pair-wise loss in our future work. Point-wise loss has been widely studied in collaborative filtering with explicit feedback under the regression framework [12, 40]. Existing point-wise methods usually perform a regression with squared loss (SE) to learn the recommender model. However, the squared loss can be derived by assuming that the error between the given rating and the predicted rating is generated from a Gaussian distribution, which does not hold in the implicit feedback scenario since
Existing point-wise methods usually perform a regression with squared loss (SE) to learn the recommender model. However, the squared loss can be derived by assuming that the error between the given rating and the predicted rating is generated from a Gaussian distribution, which does not hold in the implicit feedback scenario sinceis discrete and binary. Thus we point out that it may be unsuitable for implicit data. As aforementioned, to adapt the binary and discrete characters of the implicit feedback data, we assume obeys a Bernoulli distribution, i.e., . By replacing with in Eq. (3.1), we can define the likelihood function as
where denotes all the observed interactions in and denotes the sampled unobserved interactions, i.e., the negative instances. Furthermore, taking the negative logarithm of the likelihood (NLL), we obtain
Based on all the above assumptions and formulations, we finally obtain an objective function which is suitable for learning from implicit feedback data, i.e., the binary cross-entropy loss function. By adapting a gradient descent we can optimize the objective function and minimize it for the BCFNet model.
To sum up, the recommendation problem with implicit feedback can be formulated as an interaction prediction problem. To endow our algorithm with the ability to rank items for the recommendation task, we need to employ a probabilistic treatment for interaction matrix . We use a Logistic function as the activation function for the output layer, so that is constrained in the range of [0,1]. Instead of modeling , we model which is the probability of being equal to 1. Since can also be interpreted as the probability that user is matched by item , the interaction prediction problem can be transformed to a matching score prediction problem. In this manner, using maximum likelihood estimation to estimate model parameters is equivalent to minimizing the binary cross-entropy between and .
In this section, we will introduce the proposed model named Balanced Collaborative Filtering Network (BCFNet) in detail. First, we present an architecture of of the BCFNet model. Then, we respectively introduce three sub-modules of the model, namely attentive representation learning (BCFNet-rl), attentive matching function learning (BCFNet-ml) and balance module (BCFNet-bm). Finally, we describe how to fuse these three sub-modules and how to learn the final BCFNet model.
The proposed BCFNet model consists of three sub-modules, namely, attentive representation learning, attentive matching function learning and balance module. The architecture of the BCFNet model is shown in Fig. 1.
All of the three modules start from extracting data from database. IDs, historical behaviors and other auxiliary data can all be used to construct the initial representations of user and item , which are denoted by and respectively. The CF models then calculate and , i.e., the latent representations for user and item . Next, a non-parametric operation is performed on and to aggregate the latent representations. Finally, mapping function is used to calculate the matching score . In what follows, we will introduce these three sub-modules and their implementations in detail.
For representation learning-based CF methods, the model focuses more on learning representation function and the matching function is usually assumed to be simple and non-parametric, e.g., dot product or cosine similarity. In this manner, the model is supposed to map users and items into a common space where they can be directly compared. For example, taking one-hot IDs as inputs, the vanilla MF  adopts linear embedding function as function and function to learn the latent representations. The latent representations and are then aggregated by the dot product function to calculate the matching score. In this case, mapping function is assumed to be the identity function. For another example, taking ratings as inputs, DMF  adopts MLP as function and function to learn better latent representation by making full use of the non-linearity and high capacity characteristics of neural networks. The cosine similarity between and is then calculated and used as matching score.
In this paper, we focus on implicit feedback data only in the BCFNet model, so no auxiliary data are used. In particular, the user-item interaction matrix is taken as input. That is, the initial user representation of user is the -th row vector of , i.e. , and the initial item representation of item is the -th column vector of , i.e. .
In the proposed BCFNet model, we design an attentive representation learning-based CF method that combines MLP with feed-forward attention to learn latent representations for users and items. Suppose the size of the encoder layer of the feed-forward attention layer is . First, for user , the decoder vector can be computed by applying the feed-forward attention layer to as follows:
where denotes the weight matrix of the encoder layer of the feed-forward attention layer, denotes the activation of the input layer, denotes the activation function , and denote the weight matrix and bias vector of the BP neural network respectively, and denotes the attention ratio of the feed-forward attention layer. And then, the representation learning part based on MLP implementation for users can be defined as:
where denotes the input of MLP, , , and denote the weight matrix, bias vector and activation for the -th layer’s perceptron respectively. is the activation function and we use function in this paper. The latent representation for item is calculated in the same manner. Different from the existing representation learning-based CF methods, the matching function part is defined as:
where , and denote the predictive vector, the weight matrix and the activation function respectively. By substituting the non-parametric dot product or cosine similarity with element-wise product and a parametric neural network layer, our model still focuses on catching low-rank relations between users and items but is more expressive since the importance of latent dimensions can be different and the mapping can be non-linear.
Matching function learning-based CF methods focus more on matching function learning. The representation learning part is still necessary since the initial representations of users and items, namely and are usually extremely sparse and have high dimension, making it difficult for the model to directly learn the matching function. Therefore, matching function learning-based CF methods usually use a linear embedding layer to learn latent representations for users and items. With the dense low-dimensional latent representations, the model is able to learn the matching function more efficiently.
In the proposed BCFNet model, we design an attentive matching function learning-based CF method that combines MLP with feed-forward attention to learn the matching function. Instead of IDs, we take the interaction matrix as input. Suppose the size of the embedding layer is . First, for user and item , the decoder vector can be computed by applying the feed-forward attention layer to and as follows:
where and are the parameter matrices of the linear embedding layers, denotes the MLP user vector, and denotes the MLP item vector. The matching function learning part based on MLP implementation can be defined as:
where denotes the input of MLP, , , and denote the weight matrix, bias vector and activation for the -th layer’s perceptron respectively, and denotes the predictive vector. In this manner, the attentive representation learning functions and are implemented by the linear embedding layers. The latent representations and are then aggregated by a simple concatenation operation. After the process of the feed-forward attention layer, MLP is used as the mapping function to calculate the matching score . Notice that although concatenation is the simplest aggregation operation, it maintains maximally the information passed from the previous layer and allows to make full use of the flexibility of the MLP model.
After additionally introducing the attention mechanism, the unified framework of representation learning and matching function learning (i.e. the DeepCF framework proposed in the previous version ) can be greatly improved. However, due to its DNNs structure, it may also lead to partial information loss and over-fitting issue. For a real-life recommender system, there exist a large number of users and items which are subjected to the sparsity problems. In this case, only relatively few interactions can be input into the MLP implements, which is prone to over-fitting issue and leads to mediocre results. In addition, during the deep learning process, some features of users and items may be simply ignored and some important implicit feedback may be given low weight in MLP.
Inspired by some shallow recommendation model without neural networks and attention mechanism, we add the generalized matrix factorization (GMF) model  to the BCFNet model as a balance module. As a shallow matrix factorization model, GMF adopts linear embedding function as representation function and uses dot product as matching function, which can offset the weakness of MLP in capturing low-rank relations and alleviate the over-fitting issue in DNNs. Therefore, assuming the size of the embedding layer is , the balance module can be formulated as:
where and are the parameter matrices of the linear embedding layers, denotes the MF user vector, denotes the MF item vector, and denotes the predictive vector. In the following experiments, it will be verified that the balance module is helpful to alleviate the over-fitting issue caused by the high sparsity of interaction information.
In the previous three subsections, we have introduced the three modules of the proposed BCFNet model, each of which can be regarded as a separate model for recommender system. To incorporate these three modules, we need to design a strategy to fuse them so that they can enhance each other and improve the accuracy of the recommendation system. One of the most common fusing strategies is to concatenate the learned representations to obtain a joint representation and then feed it into a fully connected layer. As described in the previous three subsections, for BCFNet-rl, BCFNet-ml and BCFNet-bm, they generate the predictive vectors respectively, which are denoted as , and . And the predictive vectors can be viewed as the representation for the corresponding user-item pair. Since the three types of CF methods have different advantages and learn the predictive vectors from different perspectives, the concatenation of the three predictive vectors will result in a stronger and more robust joint representation for the user-item pair. What’s more, the consequent fully connected layer enables the model to assign different weights on the features contained in the joint representation. Therefore, the output of the fusion model can be defined as:
Using Eq. (12) to incorporate BCFNet-rl, BCFNet-ml and BCFNet-bm, we finally obtain the proposed BCFNet model.
As discussed in the previous section, the objective function for the BCFNet model is the binary cross-entropy function. To optimize the model, we use mini-batch Adam . The batch size is fixed to 256 and the learning rate is 0.00001
. The model parameters are randomly initialized with a Gaussian distribution (with a mean of 0 and standard deviation of 0.01) and the negative instancesare uniformly sampled from unobserved interactions in each iteration. The learning algorithm for the proposed BCFNet model is summarized in Algorithm 1.
According to , the initialization is of significance to the convergence and performance of deep learning model. Using pre-trained models to initialize the ensemble model can significantly increase the convergence speed and improve the final performance. Since BCFNet is composed of three components, i.e., BCFNet-rl, BCFNet-ml and BCFNet-bm, we can pre-train these three components and use them to initialize BCFNet. Notice that BCFNet-rl, BCFNet-ml and BCFNet-bm are trained from scratch using Adam while the BCFNet with pre-training is optimized by the vanilla SGD. This is because Adam requires momentum information of the previous updated parameters which is not saved in BCFNet with pre-training.
|Datasets||# of Users||# of Items||# of Ratings||Sparsity|
In this section, we conduct experiments to demonstrate the effectiveness of the BCFNet model. First of all, we compare the proposed BCFNet model with seven existing models including the previous version namely CFNet  . Then, we conduct experiments to validate the effectiveness of the feed-forward attention layer and the balance module. We also verify the utility of pre-training by comparing the BCFNet models with and without pre-training. Finally, we analyze the effect of hyperparameters on the performance of the BCFNet model.
. Then, we conduct experiments to validate the effectiveness of the feed-forward attention layer and the balance module. We also verify the utility of pre-training by comparing the BCFNet models with and without pre-training. Finally, we analyze the effect of hyperparameters on the performance of the BCFNet model.
We implement the proposed model based on Keras111https://github.com/keras-team/keras
and Tensorflow222https://github.com/tensorflow/tensorflow, which will be released publicly upon acceptance.
|Datasets||Measures||ACFNet||Improvement of||Improvement of|
|ACFNet-rl||ACFNet-bm||ACFNet-ml||ACFNet||ACFNet vs. NeuMF||ACFNet vs. CFNet|
We evaluate our models on eight real-world publicly available datasets: MovieLens 100k (ml-100k)
, MovieLens 1M (ml-1m)333https://grouplens.org/datasets/movielens/, LastFM (lastfm)444http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/, FilmTrust (filmtrust)555https://www.librec.net/datasets.html, Amazon baby (ABaby), Amazon beauty (ABeauty), Amazon music (AMusic) and Amazon toys (AToy)666http://jmcauley.ucsd.edu/data/amazon/. They are obtained from the following four main sources.
MovieLens: The MovieLens datasets have been widely used for movie recommendation. These datasets are collected from the MovieLens website by the GroupLens Research. We use the versions ml-100k and ml-1m in our experiments.
Lastfm: The lastfm dataset is a set about the sequence of songs that the users listen to. It is crawled from the Last.fm online system, which is the world’s largest social music platform.
Filmtrust: The filmtrust dataset is a dataset crawled from the entire filmtrust website in June 2011, which contains 1508 users, 2071 items and 35497 ratings.
Amazon: The Amazon datasets contain users’ rating data in Amazon. In our experiment, four datasets namely Baby, Beauty, Music and Toy are adopted.
Following , we adopt the leave-one-out evaluation, i.e., the latest interaction of each user is used for testing, while the remaining data for training. Since ranking all items is time-consuming, we randomly select 100 unobserved interactions as negative samples for each user. We then rank the 100 items for each user according to the prediction. We evaluate the model ranking performance through two widely adopted evaluation measures, namely Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG), which are defined respectively as follows
where is the number of users whose test item appears in the recommended list and is the position of the test item in the list for the -th hit. The ranked list is truncated at 10 for both measures, i.e. HR@10 and NDCG@10. Intuitively, HR@10 measures whether the test item is present on the top-10 list or not, and NDCG@10 measures the ranking quality which assigns higher scores to hit at top position ranks on the top-10 list. Larger values of HR@10 and NDCG@10 indicate the better performance.
For a fair comparison, we set the weight of observed interactions for each user as 1 for all methods. We sample four negative instances per positive instance, i.e., set the negative sample ratio to be 4 as default. We set the number of predictive factors as 128 on all the datasets except AMusic, on which the number of predictive factor is set as 64. We generally employ two hidden layers for BCFNet-rl, and three for BCFNet-ml.
We compare the proposed BCFNet model with the following seven methods.
ItemPop is a non-personalized method that is often used as a benchmark for recommendation tasks. Items are simply ranked by their popularity, i.e., the number of interactions.
ItemKNN  is a standard item-based collaborative filtering method.
BPR  is a widely used learning framework for item recommendation with implicit feedback. It is a sample-based method that optimizes the MF model with a pair-wise ranking loss.
MLP  is a matching function leaning-based collaborative filtering method. It uses multiple layers of nonlinearities to model the relationships between users and items.
DMF  is a state-of-the-art representation learning-based MF method which performs deep matrix factorization to learn a common low dimensional space with normalized cross entropy loss as loss function. It uses a two-pathway neural network architecture to replace the linear embedding operation used in vanilla matrix factorization. We ignore the explicit ratings and take the implicit feedback as input in this paper.
NeuMF  is a state-of-the-art matching function learning-based MF method which combines hidden layers of GMF and MLP to learn the interaction function based on cross entropy loss. NeuMF takes IDs as input and adapts the deep+shallow pattern which has been widely adopted in many works such as [8, 15].
CFNet  is the previous version of BCFNet, which incorporates collaborative filtering methods based on representation learning and matching function learning to learn the complex matching function and low-rank relations between users and items.
The comparison results are listed in Table III. The best scores among the BCFNet model and its sub-models and the best scores among other methods are highlighted respectively in bold. According to the table, we have the following key observations:
The proposed BCFNet model achieves the best performance on all the datasets except for the NDCG on AMusic and Atoy, and obtains high improvements over the state-of-the-art methods. More importantly, most of the improvements increase along with the increasing of data sparsity, where the datasets are arranged in the order of increasing data sparsity. This justifies the effectiveness of the proposed BCFNet model which combines attentive representation learning-based CF methods, attentive matching function learning-based CF methods and balance module.
As a typical representation learning method, the performance of DMF has some merit compared with the traditional methods, but the proposed BCFNet-rl model consistently outperforms it. This indicates that adding a feed-forward attention layer and a parametric neural network layer significantly improves the learning ability of the representation leaning.
Compared with the MLP model, BCFNet-ml outperforms it on most datasets. This fully demonstrates the effectiveness of attention mechanism in improving matching function learning.
On the basis of BCFNet-rl and BCFNet-ml, BCFNet-bm has also made a great contribution to the effects of the proposed BCFNet model, especially the improvement in terms of NDCG. And in most cases, the improvement effect increases with the increase of data sparsity. This indicates the effectiveness of the balance module in addressing the overfitting issue caused by data sparsity, which will be further confirmed in the next subsection.
In order to investigate the impact of feed-forward attention layer and balance module in BCFNet, we conduct experiments on BCFNet without attention and balance module (abbr. BCFNet-without-AB, i.e. CFNet in ), BCFNet without attention (abbr. BCFNet-without-A) and BCFNet without balance module (abbr. BCFNet-without-B). As shown in Fig. 2, BCFNet outperforms BCFNet-without-AB, BCFNet-without-A and BCFNet-without-B in all cases. This result verifies the effectiveness of the feed-forward attention layer in enhancing the learning ability of the proposed neural network model. Moreover, BCFNet-without-AB outperforms BCFNet-without-A and BCFNet-without-B on some datasets, which shows the necessity of combining the feed-forward attention layer and the balance module.
In addition, we also conduct more experiments on BCFNet with balance module (i.e. the BCFNet model) and BCFNet without balance module (abbr. BCFNet-without-B) to verify the effectiveness of the balance module in alleviating over-fitting issue of neural network. In order to simulate the over-fitting issue caused by the high sparsity of item interaction information in a recommender system, we divide some original dataset into three sub-datasets according to item popularity, which are termed popularity levels 1, 2 and 3 respectively. A higher popularity level means that items in this sub-dataset are more popular and have more interaction information. In particular, for some original dataset, the item set is evenly partitioned into three subsets according to item popularity, and then all the interactions associated with items in each subset form a corresponding sub-dataset. Therefore, the sparsity of interaction information will decrease with the increase of item popularity level. Since some datasets used in experiments are not very sparse or cannot satisfy the leave-one-out evaluation condition that requires 100 negative samples for each user in their three sub-datasets, so only five datasets are used in this experiment, namely lastfm, ABaby, ABeauty, AMusic and AToy. We run BCFNet and BCFNet-without-B respectively on each sub-dataset. As shown in Fig. 3, with the increase of item popularity level, the effect of BCFNet and BCFNet-without-B will be greatly improved. However, with the increase of item popularity level, most of the promotion effect of the balance module will be weakened, i.e. the highest promotion effect has been obtained in the case of the smallest item popularity. This fully illustrates that the balance module is helpful to alleviate the over-fitting issue caused by the high sparsity of item interaction information.
Different from the BCFNet with pre-training, we use mini-batch Adam to learn the BCFNet without pre-training with random initializations. As shown in Table IV, the BCFNet with pre-training (i.e. BCFNet) outperforms the BCFNet without pre-training (abbr. BCFNet-without-P) in all cases. This result verifies the utility of the pre-training process which ensures BCFNet-rl, BCFNet-ml and BCFNet-bm to learn features from different perspectives and therefore allows the model to generate better results.
To analyze the effect of negative sampling ratio , we test different negative sampling ratio, i.e., the number of negative samples per positive instance, on the eight datasets. From the results shown in Fig. 4, we can find that sampling less than three instances is not enough and sampling more negative instances is helpful. In most cases, the best HR@10 and NDCG@10 are obtained when the negative sampling ratio is set to 4. Overall, the optimal sampling ratio is around 4 to 8. Sampling more negative instances not only requires more time to train the model but also degrades the performance, which is consistent with the results shown in .
Another hyperparameter used in the BCFNet model is the number of predictive factors, i.e., the dimensions of , and . To this end, we test the number of predictive factors in , and the results are listed in Table V. As shown in Table V, the proposed model generates the best performance with 128 predictive factors on most of the datasets except the AMusic dataset. On the AMusic dataset, the best performance is achieved with 64 factors. According to our observation, more predictive factors usually lead to better performances since it endows the model with larger capability and greater ability of representation.
In this paper, we have presented a novel recommendation model called Balanced Collaborative Filtering Network (BCFNet), which combines attentive representation learning (BCFNet-rl), attentive matching function learning (BCFNet-ml) and balance module (BCFNet-bm). Therefore, it has the advantages of both representation learning and matching function learning. In addition, by introducing a feed-forward attention layer, the learning ability of both of attentive representation learning and attentive matching function learning can be further improved. Furthermore, adding a balance module without using neural network and attention mechanism can alleviate the over-fitting issue and capture low-rank relation. Extensive experiments on eight real-world datasets demonstrate the effectiveness and rationality of the proposed BCFNet model.
This work was supported by NSFC (61876193), Guangdong Natural Science Funds for Distinguished Young Scholar (2016A030306014), and NSF through grants IIS-1526499, IIS-1763325, and CNS-1626432.
|Datasets||Measures||Dimensions of predictive vectors|
Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649. Cited by: §1.
Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In ICML, pp. 880–887. Cited by: §2.2, §3.3.
AutoRec: autoencoders meet collaborative filtering. In WWW, pp. 111–112. Cited by: §2.2.
GB-CENT: gradient boosted categorical embedding and numerical trees. In WWW, pp. 1311–1319. Cited by: §2.3.