With rapid development of the internet, people spend more time on the internet to search interesting and useful information, such as searching behaviors of users on Twitter, Quora, Amazon and so on. To improve long-term user enagement, recommender systems (RS) Koren et al. (2009) have been deeply developed and widely applied. It is of much concern to improve RS to provide more accurate recommendation. Based on the types of input data, methods for RS can be categorized into collaborative filtering (CF) methods Breese et al. (1998), content-based filtering methods Pazzani and Billsus (2007) and hybrid methods Burke (2002). CF methods learn from historical interactions bewteen users and items. Content-based methods compare the auxiliary contents between users and items. Hybrid methods combine the above two methods.
It has been shown that RS tasks can be formulated as a matrix completion (MC) problem Candès and Recht (2009); Recht (2011). Given a small set of observed entries, MC model aims to fill out the missing entries in a matrix. Solid recovery theories for the MC model with low-rank contraint have been made in Candès and Recht (2009); Recht (2011); Ge et al. (2017); Zhu et al. (2018) although it is a NP-hard problem. By replacing nuclear norm with the equivalent form, which is resulted from the low-rank contraint, MC can be efficiently solved by matrix factorization (MF) techniques Cabral et al. (2013).
In many real applications, side information can also be collected besides rating matrix, such as user/item features. Motivated by this, matrix completion with side information, called inductive matrix completion (IMC), was proposed Xu et al. (2013); Jain and Dhillon (2013) to boost the performance of MC. Similar to MC, IMC also has an equivalent factorized formulation. In factorized formulation, IMC projects user/item features with a learnable matrix to get low dimensional latent factor for each user/item first, and then makes prediction by the dot product of user and item latent factors. In real applications, the factorized version of IMC is more favored Jain and Dhillon (2013); Si et al. (2016); Chiang et al. (2015) due to its efficiency of optimization and implementation. There are lots of variants that were proposed to boost the performance of IMC, such as deep learning based IMC He et al. (2017); Wang et al. (2015). IMC refers to the factorized version in the rest of paper otherwise stated.
Regarding the factorized version, IMC models can be interpreted as learning an individual representation for each feature, which is independent from each other. Moreover, representations for the same features are shared across all users/items. Realistically, it’s more intuitive that representations for features should depend on both user and item scenarios. For example, people of the same age in different regions may like watching movies of different genres due to different cultures of the regions. That means ages in different regions have different summaries for interests in movies. Similarly, one person may have different interests in musics of the same genres from different countries. It indicates that the same features of a person have different summaries for interests in musics. However, being limited by the independent characteristic between features and shared characteristic for the same features across all users/items, IMC models cannot achieve the goal.
Unfortunately, few of existing IMC models attempt to break the above limitation and enable the models to learn context-aware representations for features, which is more intuitive in real applications as we have stated above. For example, He et al. (2017); Dziugaite and Roy (2015); Oord et al. (2013)Tay et al. (2018); Chen et al. (2017) applied attention mechanism to learn and assign different weights for features in the linear combination of feature representations to get representations for users/items. Wang et al. (2015)
applied autoencoder to reconstruct the original features, and used hidden states of middle layer as representations for users/items.He et al. (2018); Hsieh et al. (2017) adopted different functions for predictions instead of dot product. Even so many works have been proposed to improve IMC, the independent characteristic between features and shared characteristic for the same features across all users/items still exist in existing IMC models including the above works.
In this paper, we propose a novel IMC method, called collaborative self-attention (CSA), for recommender systems. The contributions of CSA are listed as follows:
We propose a novel self-attention based deep architecture to collaboratively learn contex-aware representations for features.
To the best of our knowledge, CSA is the first context-aware IMC model without independent characteristic between features and shared characteristic for the same features across all users/items, which limit the expressiveness of existing IMC models.
Extensive experiments demonstrate the effectiveness of CSA.
In this section, we introduce some preliminaries related with CSA, including matrix completion (MC), inductive matrix completion (IMC) and self-attention mechanism.
Bold uppercase letters like denote matrices, where rows represent users, columns represents items and entries represent ratings on items by users. denotes the th row of and denotes the th column of . denotes the element of that locates at the th row and th column. denotes the transpose of . Bold lowercase letters like
denote vectors, anddenotes the th element of . denotes the Frobenius norm of a matrix. denotes the nuclear norm of a matrix. denotes Hadmard product operation.
2.2 Matrix Completion
Suppose is the matrix that we want to recover. Given a small set of observed entries , , MC aims to fill out . The objective of MC problem can be formulated as follows Candès and Recht (2009); Recht (2011):
where is the hyper-parameter for nuclear norm.
is the loss function. Many different loss functions can be used according to applications. For example,for square loss, where could be real-valued or binary label, and for negative log logistic loss if is a binary label and . Sufficient recovery theories have been made for the above MC problem. It has been shown that the matrix can be exactly recovered given sufficiently large number of observed entries. As it was shown in Cabral et al. (2013); Yao and Li (2018); Fan et al. (2017); Jain et al. (2013), MC can be solved by matrix factorization (MF) technique.
where and are hyper-parameters for regularization terms. and . The rows of () denote latent factors of users (items). is the dimension of latent factor. There are lots of variants that were proposed to boost the performance of MC. For example, Hsieh et al. (2017); He et al. (2018) adopt different score function instead of dot product. He et al. (2017); Dziugaite and Roy (2015)
learned non-linear transformation for latent factors of users (items) via multilayer perceptron (MLP).Sedhain et al. (2015) used interactive items to represent user latent factors. Yao and Kwok (2018) adopted non-convex loss function.
2.3 Inductive Matrix Completion
In many real applications, side information can also be collected besides rating matrix, such as user/item features. Motivated by this, inductive matrix completion (IMC) Xu et al. (2013); Jain and Dhillon (2013) was proposed to utilize side information. Similar to MC, factorized version of IMC can be formulated as follows Jain and Dhillon (2013); Zhang et al. (2018):
where and are hyper-parameters for regularization terms. () denotes the input user (item) features. () denotes the dimension of features. () is the mapping matrix of user (item) features. Because , we find that IMC actually learns an individual representation for each feature (e.g, for feature ), which is independent from each other. Moreover, representation for feature with value is shared across all users. As we have stated in the previous section, although lots of variants were proposed to boost the performance of the IMC model, the independent characteristic between features and shared characteristic for the same features across all users/items still exist in existing IMC models.
2.4 Self-Attention Mechanism
Recently, self-attention mechanism Vaswani et al. (2017); Parikh et al. (2016); Cheng et al. (2016) is proposed to capture the relationship between entities like words in a sequence via attention operation. On another word, it can learn context-aware representation for entities, which is more intuitive in a wide range of applications. It has been shown that self-attention achieves great success in lots of tasks, such as machine translation Vaswani et al. (2017), quetion answer (QA) Yu et al. (2018), segmentation Tan et al. (2018) and so on. But there has not existed work to apply self-attention for RS to learn context-aware representations for features.
In this section, we present the details of our CSA model. CSA mainly contains four parts: user within-encoder, item within-encoder, cross-encoder and prediction layer. Within-encoder is proposed to generate context-aware representations for user (item) features conditioned on user (item) scenarios, named as ‘within’ context-aware representations. Cross-encoder is proposed to generate context-aware representations for user (item) features conditioned on item (user) scenarios, named as ‘cross’ context-aware representations. Prediction layer is proposed to make predictions based on the learned context-aware representations of features.
Figure 1 illustrates the model architecture of CSA. First of all, within-encoder recieves input representations of user (item) features and generates ‘within’ context-aware representations for user (item) features. Then, cross-encoder recieves outputs of within-encoder as inputs. Cross-encoder further generates ‘cross’ context-aware representations for user (item) features. Finally, outputs of cross-encoder are fed to the prediction layer to predict the score.
We have within-encoder for user feature and item feature separately. Each within-encoder block consists of two sub-layers. The first sub-layer is a multi-head self-attention layer. The second sub-layer is a feed-forward network layer. A residual connectionHe et al. (2016) is applied to each sub-layer, followed by a layer normalization Ba et al. (2016).
For features of user/item, we need to learn embedding vectors for them. Let () denote the embedding vectors of user (item) features. Taking the values of features into consideration, the input representations with respect to a specific user (item ) for the following parts of CSA are denoted as (), which are formulated as follows:
In order to generate ‘within’ context-aware representations for features, we adapt self-attention mechanism to achieve this goal. Here, we take a specific user as an example to describe the mechanism of this sub-layer. For a specific within-encoder layer, we denote as input representations of this layer with respect to a specific user . ‘Within’ context-aware representations for are formulated as follows:
where and they are parameters we need to learn. If is sparse with respect to row, masked softmax would be performed instead of softmax. From the above formulations, we can see that each row of is conditioned on all rows of . It indicates that can capture the relationships between different rows of within the scenarios of user . Instead of applying single-head attention on the inputs, we adopt multi-head attention to get stable outputs:
where and .
Feed-Forward Network Layer
To enable high flexibility of the model, a feed-forward network layer follows the self-attention layer. Suppose with a specific user to be the output of the self-attention layer, then output of the feed-forward network layer is formulated as follows:
where , and they are parameters we need to learn.
Given definitions of the self-attention layer and feed-forward network layer, we denote the parameters that we need to learn as . Then, with user can be summarized as follows:
where summarizes the self-attention layer and feed-forward layer. denotes the parameters with respect to user within-encoder. Similarly, we can get the output representation for a specific item :
Taking () as inputs for the first user (item) within-encoder layer, repeated application of can generate ‘within’ context-aware representations for features of user (item ), denoted as (), which are outputs of the multilayer user (item) within-encoder. As we can see from the above formulations, () is conditioned on scenarios of user (item ). Hence, () is no longer shared across all users (items).
The architecture of cross-encoder is similar to within-encoder except inputs. The inputs for cross-encoder is the concatenation of the outputs of user within-encoder and item within-encoder.
Let . ‘Cross’ context-aware representations for and with respect to the specific user and item are formulated as follows:
Taking as inputs for the first cross-encoder layer, repeated application of can generate ‘cross’ context-aware representations for features of user (item ), denoted as , which are outputs of the multilayer cross-encoder. Hence, and are conditioned on scenarios of both user and item .
3.3 Prediction Layer
After getting the context-aware representations for features of user and item , we feed them to the prediction layer and make prediction for .
3.4 Objective Function
By combining the above parts, CSA is formulated as follows:
where . () denotes the set of parameters in user (item) within-encoder. denotes the set of parameters in cross-encoder. and denote outputs of multilayer cross-encoder for user and item . is -norm regularization on parameters in CSA.
We evaluate our proposed CSA and other baselines on three real world datasets, which are publicly available. CSA and all baselines are implemented on PyTorchPaszke et al. (2017) with a NVIDIA TitanXP GPU server.
ShortVideo-Track1 111https://www.biendata.com/competition/icmechallenge2019/ is a short video recommendation dataset, which is collected from real industrial application. The short video is about 15 seconds long. The videos mainly consist of categorical features (e.g, device, channel, city). There are two tasks for this dataset. The first task is to predict whether a given user will watch through a given short video completely. The first task is named as finish task. The second task is to predict whether a given user will give a like to a given short video. The second task is named as like task.
ShortVideo-Track2 comes from the same source as ShortVideo-Track1. The difference is that users of ShortVideo-Track1 are from the same one city, while users of ShortVideo-Track2 are from many cities. That means the underlying patterns of this two datasets are different from each other. Moreover, ShortVideo-Track1 is larger than ShortVideo-Track2. The tasks for this dataset are the same as ShortVideo-Track1.
LastFM 222https://www.dtic.upf.edu/ ocelma/MusicRecommendationDataset/ is a music artist recommendation dataset Celma (2010), which is also collected from real industrial application. The music artists mainly consist of categorical features (e.g, gender, age, location). Interactions between users and music artists are play counts, which are summarized from the play history lists of users. As Pacula (2017)
does, we transform play counts to frequencies and rescale them to the range [0-4]. We further binarize frequencies to generate positive and negative examples with a threshold 2. Positive examples indicate high frequencies on artists by users, which implies that users are more interested in these artists. Negative examples indicate low frequencies on artists by users, which implies that user are less interested in these artists. The task for this dataset is to predict whether a given user will be interested in a given artist.
The statistics of the above three datasets are summarized in Table 3.
|like task||finish task|
|like task||finish task|
4.2 Baselines and Settings
For all datasets, we split all data into train, validation and test set, where , . denotes the sampling rate. For ShortVideo-Track1 and ShortVideo-Track2, we split the interaction samples in time order, since the interaction samples are ordered in time. For LastFM, we randomly split the data, since there is no time information about the interaction samples.
is selected from
. Maximum epoch is set to 20. We use validation set to tune this hyper-parameters. Dropout rate is set to 0.2, and it’s applied to all layers’ inputs. The number of layer is set to 1 for user within-encoder, item within-encoder and cross-encoder.is set to 64. The above settings are the same for all datasets. For ShortVideo-Track1 and ShortVideo-Track2, we apply attention head to within-encoder and cross-encoder. For LastFM, we apply attention head to within-encoder and cross-encoder.
We use Adam SGD optimizer Bengio and LeCun (2015) to optimize CSA. The initial learning rate is 0.01 for LastFM, 0.001 for ShortVideo-Track1 and ShortVideo-Track2. For LastFM, the learning rate will be decayed by a factor of 0.2 for every two epochs. For ShortVideo-Track1 and ShortVideo-Track2, the learning rate will be decayed by a factor of 0.1 for every two epochs.
For IMC, we adopt the same setting as CSA. For NeuMF, we also adopt the same setting as CSA. Hence, all comparisons between CSA and baselines are fair. In all experiments, we use AUC (area under the ROC curve) Bradley (1997)
as evaluation criteria. The larger the AUC is, the better the performance will be. All methods are run with best hyper-parameters that are tuned with validation set on different datasets. We repeat the experiments 5 times and report the mean of results. Since the standard deviation is very small (approximately 0.0002), we omit it in the tables and figures.
To compare CSA againts baselines, we empirically set . The results on ShortVideo-Track1, ShortVideo-Track2 and LastFM are summarized in Tables 3, 3 and 4. From Tables 3, 3 and 4, we can see that CSA consistently outperforms all baselines in most cases, which demonstrates the effectiveness of CSA. We observe that the margin between CSA and baselines becomes smaller with sampling rate increased. The main reason is that IMCs have the ability to catch the underlying pattern when the number of observed entries is large enough according to the recovery theory in Xu et al. (2013); Jain and Dhillon (2013). Please note that the cases with small of observed samples are more common in real recommendation systems. Moreover, users/items with small of observed samples are often new users/items to the systems, and accurate recommendation for new users/items are necessary. Hence, good performance at small is more meaningful.
4.4 Sensitivity to Hyper-parameters
In CSA, embedding dimension and are two important hyper-parameters. Here, we study the sensitivity of with on LastFM. And we study the sensitivity of with and on LastFM. Similar conclusions can be drawn on other settings with respect to and and other datasets, which are omitted here for space saving.
4.5 Visualization of Representions
We perform visualization on LastFM dataset. In order to verify that representations for features learned by CSA are context-aware, we visualize the representations learned by CSA. Firstly, we pick users whose country feature is ‘United States’ as the database. Secondly, we randomly pick a user from the database as the center user. We define the distance between other users and the center user as the number of the different features they have. Finally, we pick users from the database whose distance to center user is 1, 3, 5. For each distance, 100 users are selected. Finally, we visualize the ‘within’ context-aware representations for the country feature of these selected users and center user.
We use t-sne van der Maaten and Hinton (2008) to visualize the representations. The visualization is presented in Figure 4. Label ‘0’ denotes the center user. Label ‘1’, ‘3’ and ‘5’ denote the distance of users to center user with label ‘0’. From Figure 4, we can see that representations for the same country feature are different for all users. Furthermore, as we observe from the center user, we can conclude that the more similar two users’ context scenarios are, the more similar the representations for county feature of two users are. It indicates that representations learned by CSA are context-aware, which just verifies the effectiveness of CSA.
In this paper, we propose a novel IMC method, called collaborative self-attention (CSA), for recommender systems. CSA is a novel self-attention based deep architecture that can collaboratively learn contex-aware representations for features. To the best of our knowledge, CSA is the first context-aware IMC model without independent characteristic between features and shared characteristic for the same features across all users/items, which limit the exprexssiveness of existing IMC models. Extensive experiments on three large-scale datasets from RS applications demonstrate the effectiveness of CSA.
- Ba et al. (2016) L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
- Bengio and LeCun (2015) Y. Bengio and Y. LeCun. Adam: a method for stochastic optimization. In International Conference on Learning Representations, 2015.
A. P. Bradley.
The use of the area under the ROC curve in the evaluation of machine learning algorithms.Pattern Recognition, 30(7):1145–1159, 1997.
Breese et al. (1998)
J. Breese, D. Heckerman, and C. Kadie.
Empirical analysis of predictive algorithms for collaborative
Proceedings of Conference on Uncertainty in Artificial Intelligence, 1998.
- Burke (2002) R. D. Burke. Hybrid recommender systems: survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, 2002.
Cabral et al. (2013)
R. S. Cabral, F. Torre, J. P. Costeira, and A. Bernardino.
Unifying nuclear norm and bilinear factorization approaches for
low-rank matrix decomposition.
IEEE International Conference on Computer Vision, 2013.
- Candès and Recht (2009) E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
- Celma (2010) O. Celma. Music recommendation and discovery in the long tail. Springer, 2010.
- Chen et al. (2017) J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T. Chua. Attentive collaborative filtering: multimedia recommendation with item- and component-level attention. In ACM SIGIR Conference on Research and Development in Information Retrieval, 2017.
Cheng et al. (2016)
J. Cheng, L. Dong, and M. Lapata.
Long short-term memory-networks for machine reading.
Empirical Methods in Natural Language Processing, 2016.
- Chiang et al. (2015) K. Chiang, C. Hsieh, and I. S. Dhillon. Matrix completion with noisy side information. In Neural Information Processing Systems, 2015.
- Dziugaite and Roy (2015) G. K. Dziugaite and D. M. Roy. Neural network matrix factorization. CoRR, abs/1511.06443, 2015.
- Fan et al. (2017) H. Fan, Z. Zhang, Y. Shao, and C. Hsieh. Improved bounded matrix completion for large-scale recommender systems. In International Joint Conference on Artificial Intelligence, 2017.
- Ge et al. (2017) R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: a unified geometric analysis. In International Conference on Machine Learning, 2017.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- He et al. (2017) X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua. Neural collaborative filtering. International World Wide Web Conference, 2017.
- He et al. (2018) X. He, X. Du, X. Wang, F. Tian, J. Tang, and T. Chua. Outer product-based neural collaborative filtering. In International Joint Conference on Artificial Intelligence, 2018.
- Hsieh et al. (2017) C. Hsieh, L. Yang, Y. Cui, T. Lin, S. J. Belongie, and D. Estrin. Collaborative metric learning. In International Conference on World Wide Web, 2017.
- Jain and Dhillon (2013) P. Jain and I. S. Dhillon. Provable inductive matrix completion. CoRR, abs/1306.0626, 2013.
Jain et al. (2013)
P. Jain, P. Netrapalli, and S. Sanghavi.
Low-rank matrix completion using alternating minimization.
Symposium on Theory of Computing Conference, 2013.
- Koren et al. (2009) Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37, 2009.
- Oord et al. (2013) A. Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Neural Information Processing Systems, 2013.
- Pacula (2017) M. Pacula. A matrix factorization algorithm for music recommendation using implicit user feedback. 2017.
Parikh et al. (2016)
A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit.
A decomposable attention model for natural language inference.In Empirical Methods in Natural Language Processing, 2016.
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. 2017.
- Pazzani and Billsus (2007) M. Pazzani and D. Billsus. Content-based recommendation systems. In Proceedings of The Adaptive Web, Methods and Strategies of Web Personalization, 2007.
- Recht (2011) B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413–3430, 2011.
- Sedhain et al. (2015) S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. AutoRec: autoencoders meet collaborative filtering. In International Conference on World Wide Web Companion, 2015.
- Si et al. (2016) S. Si, K. Chiang, C. Hsieh, N. Rao, and I. S. Dhillon. Goal-directed inductive matrix completion. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
- Tan et al. (2018) Z. Tan, M. Wang, J. Xie, Y. Chen, and X. Shi. Deep semantic role labeling with self-attention. In AAAI Conference on Artificial Intelligence, 2018.
- Tay et al. (2018) Y. Tay, A. T. Luu, and S. C. Hui. Multi-pointer co-attention networks for recommendation. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018.
- van der Maaten and Hinton (2008) L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
- Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017.
- Wang et al. (2015) H. Wang, N. Wang, and D. Yeung. Collaborative deep learning for recommender systems. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.
- Xu et al. (2013) M. Xu, R. Jin, and Z. Zhou. Speedup matrix completion with side information: application to multi-label learning. In Neural Information Processing Systems, 2013.
- Yao and Li (2018) K. Yao and W. Li. Convolutional geometric matrix completion. CoRR, abs/1803.00754, 2018.
- Yao and Kwok (2018) Q. Yao and J. T. Kwok. Scalable robust matrix factorization with nonconvex loss. In Neural Information Processing Systems, 2018.
- Yu et al. (2018) A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le. QANet: combining local convolution with global gelf-attention for reading comprehension. In International Conference on Learning Representations, 2018.
- Zhang et al. (2018) X. Zhang, S. S. Du, and Q. Gu. Fast and sample efficient inductive matrix completion via multi-phase procrustes flow. In International Conference on Machine Learning, 2018.
- Zhu et al. (2018) Z. Zhu, Q. Li, G. Tang, and M. B. Wakin. Global optimality in low-rank matrix optimization. IEEE Transaction Signal Processing, 66(13):3614–3628, 2018.