In the era of information explosion, information overload is one of the dilemmas we are confronted with. Recommender systems (RSs) are instrumental to address this problem, because they assist the users to identify which information is more preferred [Xue et al.2017]
. Further, to achieve better modeling ability of users’ preference, neural architectures that deep learning methods are employed[He et al.2017b, Xue et al.2017]. There emerge many latest researches in this trend, such as NeuMF [He et al.2017b] and DMF [Xue et al.2017]
. Basically, most methods represent the user and item in a hidden semantic manner and then metric the hidden representations to predict the rating by cosine similarity or Multi-layer Perceptron (MLP).
Despite the success of previous methods, they are still too simple to characterize users’ complex preference. For the example of movie recommendation, user usually considers the quality of a movie from multiple perspectives, such as acting quality and movie style. It means that all the perspectives make effects on the preference, which traditional neural methods are difficult to characterize. To tackle this problem, in this paper, we encode the user and item into hidden representations from multiple perspectives and then metric the hidden representations to predict the preference.
However, there still exist two challenges for the encoding process: to model hierarchically organized perspectives and to capture the correlation between user and item.
First, the perspectives are hierarchically organized from specific elements to abstract summarization. For the example of movie domain, there are basic aspects such as actor, director and shooting technique, based on which, abstract aspects such as acting quality and movie style are constructed. In detail, movie style is decided by director and shooting technique, while actor and director mostly determine the acting quality. Regarding the neural model, the output of each perspective indicates the representations of user/item metric in that perspective. For example, the encoded representation of user in actor perspective represents the user’s preference for actors, while the encoded representation of item in movie style perspective indicates the style of this movie. The representation in low-level should support the analysis in high-level, which motivates us to employ a hierarchical deep neural architecture. Thus, it is reasonable to apply multiple sequential stages and to encode the user/item from multiple perspectives in each stage.
Second, the correlation between user and item is weak in the encoding process of current models. However, in fact, from the study of psychology [Carlson et al.2009], users’ preference is subjective and would be slightly adjusted according to a specific item, while the subjective feature of a specific item could be slightly different from different users’ insight. Therefore, we employ the attention mechanism [Schmidhuber and rgen2015] to address the correlated effects between user and item.
Specifically in this paper, to model user’s complex preference on item, we propose a novel neural architecture for top-N recommendation task. Overall, our model encodes the user and item into hidden semantic representations and then metrics the hidden representations into predicted preference degree with cosine similarity. Specifically, regarding the encoding process, our model leverages several sequential stages to model the hierarchically organized perspectives. In each stage, there exist several perspectives and in each perspective, the representations for user and item would adjust each other by attention mechanism. Besides, we have studied two methods for constructing the attention signal, which are listed as “Softmax-ATT” and “Correlated-ATT”.
We evaluate the effectiveness of our neural architecture for the top-N recommendation task in six datasets from five domains (i.e. Movie, Book, Music, Baby Product, Office Product). Experimental results on these datasets demonstrate our model consistently outperforms the other baselines with remarkable improvements and achieves the state-of-the-art performance among deep recommendation models.
In summary, our contributions are outlined as follows:
We propose a novel neural architecture for recommendation systems, which focuses on the hierarchically organized perspectives and the correlation between user and item.
To our best knowledge, this is the first paper to introduce attention mechanism into neural recommendation systems.
Experimental results show the effectiveness of our proposed architecture, which outperforms other state-of-the-art methods in the top-N recommendation task.
The organization of this paper is as follows. First, problem formulation and related work are introduced. Second, our neural architecture is discussed. Third, we conduct the experiments to verify our model. Last, concluding remarks are in the final section.
2 Problem Formulation & Related Work
Suppose there are users and items . Let indicate the rating matrix, where is the rating of user on item and we denote if it is unknown. There are two manners to construct the user-item interaction matrix , which indicates the user whether performs operation on item as
Most traditional models for recommendation system employ Equation (1) as the input to their models, [Wu et al.2016, He et al.2017b], while some latest work takes the known entry as the ratings rather than as Equation (2) shows [Xue et al.2017]. We apply the second setting, because we suppose the explicit ratings in Equation (2) could reflect the preference level of a user for an item.
The recommendation systems are conventionally formulated as the problem of estimating the rating of each unobserved entry in, which is leveraged to rank the items. Model-based approaches that are the mainstream methodology leverage an underlying model to generate all the ratings:
where denotes the predicted score of interaction between user and item , indicates the model parameter and denotes the recommendation model that predicts the scores. With the predicted scores by model , we could rank the items for an individual user to conduct personalized recommendation.
First, matrix factorization as semantic latent space methodology is proposed for this task. For the classical method of latent factor model [Koren, Bell, and Volinsky2009], which basically applies the inner product of the hidden representations of user and item to predict the entity as follows
where means the predicted score, indicates latent factor model, / indicates the hidden representation of user / item , respectively. Also, there follow many related researches such as [Koren2008, Mcauley2013Hidden, Bao2014TopicMF].
Then, extra corpus such as social relationship is incorporated into recommendation for a further improvement, [Ma et al.2008]. However, because the additional corpus is difficult to obtain and is often full of noise, this methodology is still under limitation.
Recently, to learn non-linear interactions, neural collaborative filtering (NeuCF) [He et al.2017b]
presents an approach, where users and items are embedded into numerical vectors and then the embeddings are processed by a multi-layer perceptron to learn the users’ preference. Deep matrix factorization (DMF)[Xue et al.2017] jointly takes the spirit of latent factor model and neural collaborative filtering method. Specifically, DMF independently encodes the user and item by multi-layer perceptron (MLP) and then metrics the hidden representations of user and item from the MLP in the manner of Equation (4) to predict the preference degree. In fact, DMF takes the advantage of deep representation learning to achieve the state-of-the-art performance.
There list the notations used in the following sections. indicates a user and indicates an item. and are the index for and , respectively. denotes the user-item interaction matrix, formulated in Equation (2), while denotes the observed interactions, means all zero elements in and denotes the negative instances generated from sampling. Notably, means the training and developing dataset while is the source of testing dataset. Further, we indicate the -th row of matrix as , -th column as and its -th entry as .
In this section, first, we will introduce the overall sketch of our proposed neural architecture, which is illustrated in Fig.2. Then, we will discuss the details of each component in a bottom-up manner, namely interaction matrix, sequential stages and cosine similarity. Also the implementation of each stage and attention mechanism (demonstrated in Fig.3 and Fig.4
) will be analyzed as follows. Last, we present our loss function and training algorithm.
3.1 Neural Architecture
Our neural architecture is demonstrated in Fig.2. Basically, our model is composed by three components, namely interaction matrix, sequential stages and cosine similarity.
Interaction Matrix. Mentioned in previous section, we form the interaction matrix as Equation (2), which is the input of our model. From the interaction matrix , each user is represented as a high-dimensional vector , which indicates the corresponding user’s ratings across all items, while each item is represented as a high-dimensional vector , which means the corresponding item’s ratings across all users. Notably, it is a conventional trick to fill the unknown entry as . To overcome the sparsity of interaction matrix, the inputs of user and item are transformed by linear layer with the activation function ReLU (i.e ) as
where is the output of this layer for user/item, / means the input of row/column-specific interaction matrix for user/item, are the parameters of linear layer and is the activation function (i.e. ReLU).
Sequential Stages. In order to model the hierarchically organized perspectives shown in Fig.1, we leverage multiple sequential stages, shown in Fig.2. In each stage, there exist several perspectives to model the user/item representations from multiple aspects. In each perspective, the output of last stage is regarded as the input of this perspective while the outputs of all the perspectives in one stage are respectively concatenated as the output representation of user and item for this stage, shown in Fig.2.
Specifically in one perspective, first, the inputs of this perspective that the output representations of user and item in last stage are transformed by linear layer with the activation function ReLU.
where indicates the ReLU function, / is the output for user/item of linear layer in -th perspective of -th stage, / is the output for user/item of last stage and are model parameters.
Then, attention signal is generated from the output of linear layer by attention mechanism.
where / is the attention signal for user/item in -th perspective of -th stage and / is the output for user/item of linear layer in -th perspective of -th stage. / indicates the attention function for user/item.
Last, the output of this perspective is generated by weighting the output of linear layer with the attention signal in the manner of element-wise product. Mathematically, we have:
where / is the output of the -th perspective in -th stage, / is the attention signal for user/item in -th perspective of -th stage and / is the output for user/item of linear layer in -th perspective of -th stage. means the element-wise product.
Cosine Similarity. To generate the user’s preference on the item , we measure the output representations of user/item in the final stage with cosine similarity, which is a conventional operation in neural architecture, [Wang, Mi, and Ittycheriah2016], mathematically as
where is the predicted preference of user on item , / is the output representation for user/item of the final stage, is the length of vector.
3.2 Attention Mechanism
Motivated in Introduction, to characterize the correlations between user and item, we leverage attention mechanism to refine the encoded representations of user/item as Equation (9) and Equation (10) show. With the attention mechanism, the final representations for user/item are more flexible and more precise to characterize the user’s complex preference on the item.
Firstly, shown in Fig.3
, we directly employ a softmax layer to construct the attention signal, which is a conventional and common form for attention-based methods,[Yang et al.2017, cui2016attention, yin2015abcnn], mathematically as:
where / is the attention matrix for user/item in the -th perspective of -th stage, is the softmax operation for vector and other symbols are introduced in last subsection as / is the attention function for user/item and / is the output for user/item of linear layer in -th perspective of -th stage.
Notably, the attention matrices are model parameters to learn. Specifically, the attention signal for user is generated from the representation of item, while the attention signal for item is generated from the representation of user, which accords to our motivation of correlation. We call this attention setting as “Softmax-ATT”.
However, the correlation modeled by simple softmax operation could still be improved. For a more effective correlation modeling, we propose a novel attention structure, shown in Fig.4. First, we compute the softmax vectors as the first attention method does:
where / is the output of softmax layer in -th perspective of -th stage and other symbols are introduced previously. Then, we construct the correlation matrix between the representation of user and item, as
where / is the output of softmax layer, is the correlation matrix of -th perspective in -th stage, which contains the correlated information of all the dimensions for user/item. Last, we process the correlation matrix with activation function and average the row/column as the attention vector for user/item, as
where / indicates the average operation for row/column and other symbols are introduced previously. With the explicit computation of correlation matrix, the correlated effects between user and item could be characterized to a better extent. We call this attention setting as “Correlated-ATT”.
The definition of objective function for model optimization is critical for recommendation models. Specifically, regarding our model, we take advantage of point-wise objective function and cross-entropy loss. Actually, though the square loss is largely performed in many existing models, [Hu, Koren, and Volinsky2008, mnih2008probabilistic], neural architectures usually employ cross-entropy loss [He et al.2017a, wu2017sequence]. Thus, our objective function is as
where is the objective function, is the golden rating, is the predicted score and other symbols are introduced in Related Work. Specifically as previous literatures [He et al.2017a, Xue et al.2017], the target value
is a binarizedor for the rating , denoting whether the user has interacted with item
or not. Besides, the model is trained using Stochastic Gradient Descent (SGD) with Adam[Kingma and Ba2014], which is an adaptive learning rate algorithm.
The training process needs the negative samples and all the ratings in the training set are the positive ones. Thus, we randomly sample several negative samples that are not in the training/developing/testing dataset for one positive sample. Besides, we apply the concept of negative sample ratio to illustrate how many negative samples would be generated for one positive instance.
|Datasets||Metrics||Baselines||Our Methods||Improvements over the Best Baseline|
NDCG@10 and HR@10 Comparisons of Different Methods. We conduct t-test for statistical significance and, which means all of the improvements are statistically significant.
In this section, first, we will introduce the basic experimental settings, namely datasets, evaluation and implementation. Then, we will conduct the experiments about model performance. Last, we will analyze the sensitivity to hyper-parameters for our model.
4.1 Experimental Setting
Datasets. We evaluate our models on six widely used datasets from five domains in recommender systems: MovieLens 100K (Movie), MovieLens 1M (Movie-1M), Amazon music (Music), Amazon Kindle books (Book), Amazon office product (Office) and Amazon baby product (Baby). 111https://grouplens.org/datasets/movielens/ 222http://jmcauley.ucsd.edu/data/amazon/ We process the datasets, according to the previous literatures [Wu et al.2016, Xue et al.2017, He et al.2017b]. For the datasets of Movie and Movie-1M, we do not process them, because they are already filtered. Besides, other datasets are filtered to be similar to MovieLens data: only those users with at least 20 interactions and items with at least 5 interactions are retained.333We will publish our filtered datasets, once accepted. We list the statistics of all the six processed datasets in Tab.2.
Evaluation. To verify the performance of our model for item recommendation, we adopted the leave-one-out evaluation, which has been widely used in the related literatures [He et al.2017b, Xue et al.2017]. We held-out the latest interaction as the test item for each user and utilize the remaining dataset for training. Since it is too time-consuming to rank all the items for every user during testing, following [Koren, Bell, and Volinsky2009, He et al.2017b, Xue et al.2017], we randomly sample 100 items that are not interacted by the corresponding user as the test set for this user. Among the 100 items together with the test item, we get the rank according to the prediction scores. We also use Hit Ratio (HR) and Normalized Discounted Cumulative Grain (NDCG) to evaluate the ranking performance, [Xue et al.2017, He et al.2017a]. As default, in our experiments, we truncate the rank list at 10 for both metrics, where HR/NDCG intuitively means HR@10/NDCG@10, as previous literatures [Xue et al.2017]. It is the similar notation for HR@K/NDCG@K.
We implement our proposed methods based on Tensorflow444https://www.tensorflow.org and the released codes of DMF [Xue et al.2017]
. Our codes will be released publicly upon acceptance. To determine the hyper-parameters of our model, we randomly sample one interaction for each user as the developing data and tune hyper-parameters on it. For neural part of our model, we randomly initialize model parameters with a Gaussian distribution (with the mean of
and standard deviation of).
We test the batch size of , the negative instance number per positive instance of , the learning rate of , the number of stage , the number of perspectives in each stage , the dimension of all the linear layers , the dimension of the output of non-final stage and the dimension of the output of final stage . The optimal settings for our model are listed as: batch size as , negative instance number per positive instance as , learning rate as , number of stage as , number of perspectives of each stage as , the dimension of all the linear layers as , the dimension of the output of non-final stage as and the dimension of the output of final stage as .
4.2 Performance Verification
Baselines. As our proposed methods aim to model the relationship between users and items, we follow [Xue et al.2017] and [He et al.2017b] to mainly compare with user-item models. Thus, we leave out the comparison with item-item models, such as CDAE [Wu et al.2016]. Actually, since the neural recommendation methodology just starts to be focused, we just list two suitable latest baseline models.
NeuMF. This is a neural matrix factorization method for item recommendation. This method embeds the user and item as hidden representations and then leverages a multiple layer perceptron to learn the user-item action function based on the embeddings of user and item. We implement the pre-training version of NeuMF and tune its hyper-parameters in the same way as [He et al.2017b].
DMF. This is the state-of-the-art neural recommendation method. This method encodes the user and item into hidden representations independently and metrics the representations between user and item to predict the user’s preference degree for the item. We implement DMF and tune its hyper-parameters in the same way as [Xue et al.2017].
Conclusions. The comparisons are illustrated in Tab.1. Thus, we have concluded as below:
Our method outperforms the baselines extensively, which justifies the effectiveness of our model.
“Correlated-ATT” performs better than “Softmax-ATT”, which means to characterize the correlations between user and item would improve the model performance.
There exist some domains, where the promotion is obviously larger than the others. We suppose there exist more clear hierarchical perspectives in these domains. For the example of Music domain, there are many low-level aspects such as singer, writer, composer, volume and speed, based on which, high-level aspects such as genre, style, melody are constructed and analyzed.
4.3 Sensitive to Hyper-Parameters
In this subsection, in order to verify the effect of hyper-parameters, we leverage the “Correlated-ATT” setting for attention mechanism and also the optimal experimental setting that are introduced in Implementation as default.
HR@K & NDCG@K. Fig.6 shows the performance of top- recommended lists where the ranking position ranges from to . As can be concluded, our method demonstrates consistent improvements over other methods across different . For the dataset of Movie, our model outperforms DMF by 0.0239 for HR@K and 0.010 for NDCG@K in average, while for the dataset of Music, our method promotes DMF by 0.0360 for HR@K and 0.0261 for NDCG@K in average. This comparison demonstrates the consistent effectiveness of our methods.
Effect of Number of Negative Samples.
Argued in the previous section, our method samples negative instances from unobserved data for training. In this experiment, different negative sampling ratios are tested for the performance variance (e.g neg-5 indicates that the negative sampling ratio is 5 or we sample 5 negative instances per positive instance). From the results in Tab.3, we discover that larger negative sample ratio could lead to better performance, while overlarge ratio seems to harm the results. For the example of NDCG on the dataset of Movie, the performance increases before neg-5, while it drops after neg-9. In detail, the optimal negative sample ratio is around 5, which consistently accords to the previous researches, [He et al.2017a, Xue et al.2017].
|Datasets||Metric||Negative Sample Ratio|
Effect of Number of Layers. Since we model the hierarchically organized perspectives, the depth or the layer number could be a critical factor in our method. Thus, we conduct experiments to test the effect of depth. Shown in Fig.7, we could conclude that the 3-layer architectures work best among all the present models. Specifically, on the dataset of Movie, the optimal performance of layer-3 outperforms that of layer-2 by 0.021 for HR and 0.019 for NDCG, while on the dataset of Music, the optimal performance of layer-3 improves that of layer-2 by 0.072 for HR and 0.014 for NDCG. Thus, we conjecture deeper models could extract more abstract perspectives, which help to boost the performance.
Effect of Final Latent Dimension. Besides the negative sample ratio and the number of layers, the final latent dimension is also a sensitive factor, which directly guides the generation of predicted user’s preference. We vary the final latent dimension from to for the experiments. Demonstrated in Tab.4, we observe that larger final dimension leads to better performance. For the example of Movie dataset, HR increases with latent dimension number. Thus, we suppose larger latent dimension could encode more information into the final results, which could lead to better prediction accuracies.
|Datasets||Metric||Final Latent Dimension|
Training Loss and Performance. Fig.5 shows the training loss (averaged over all the training instances) and recommendation performance of our method and state-of-the-art baselines of each iteration on the dataset of Movie. Results on the other datasets show the same trend, thus they are omitted for limited pages. From the results, we could draw two observations. First, we could see that with more iterations, the training loss of our method gradually decreases and the recommendation performance is promoted. The most effective updates are in first 10 iterations and more iterations increase the risk of overfitting, which accords to our common knowledge. Second, our method achieves the lower training loss than DMF, which illustrates that our model could fit the data in a better degree. Thus, a better performance over DMF is expected. Overall, the experiments show the effectiveness of our method.
In this paper, we propose a novel neural architecture for recommendation system. Our model encodes the user and item from multiple hierarchically organized perspectives with attention mechanism and then metrics the abstract representations to predict the user’s preference on the item. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed methods. We will publish our poster, slides, datasets and codes at https://www.github.com/....
- [Carlson et al.2009] Carlson, N. R.; Heth, D.; Miller, H.; Donahoe, J.; and Martin, G. N. 2009. Psychology: the science of behavior. Pearson.
- [He et al.2017a] He, S.; Liu, C.; Liu, K.; and Zhao, J. 2017a. Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 199–208.
- [He et al.2017b] He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T. 2017b. Neural collaborative filtering. 25th international world wide web conference 173–182.
- [Hu, Koren, and Volinsky2008] Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 263–272. Ieee.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Koren, Bell, and Volinsky2009] Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8):30–37.
- [Koren2008] Koren, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426–434.
- [Li, Kawale, and Fu2015] Li, S.; Kawale, J.; and Fu, Y. 2015. Deep collaborative filtering via marginalized denoising auto-encoder. In ACM International on Conference on Information and Knowledge Management, 811–820.
- [Ma et al.2008] Ma, H.; Yang, H.; Lyu, M. R.; and King, I. 2008. Sorec:social recommendation using probabilistic matrix factorization. In Acm Conference on Information and Knowledge Management, 931–940.
[Salakhutdinov, Mnih, and
Salakhutdinov, R.; Mnih, A.; and Hinton, G.
Restricted boltzmann machines for collaborative filtering.
International Conference on Machine Learning, 791–798.
- [Schmidhuber and rgen2015] Schmidhuber, J., and rgen. 2015. Deep learning in neural networks. Elsevier Science Ltd.
- [Wang, Mi, and Ittycheriah2016] Wang, Z.; Mi, H.; and Ittycheriah, A. 2016. Semi-supervised clustering for short text via deep representation learning. In the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL).
- [Wu et al.2016] Wu, Y.; Dubois, C.; Zheng, A. X.; and Ester, M. 2016. Collaborative denoising auto-encoders for top-n recommender systems. In ACM International Conference on Web Search and Data Mining, 153–162.
[Xue et al.2017]
Xue, H. J.; Dai, X. Y.; Zhang, J.; Huang, S.; and Chen, J.
Deep matrix factorization models for recommender systems.
International Joint Conference on Artificial Intelligence, 3203–3209.
- [Yang et al.2017] Yang, Z.; Hu, J.; Salakhutdinov, R.; and Cohen, W. W. 2017. Semi-supervised qa with generative domain-adaptive nets. arXiv preprint arXiv:1702.02206.