TensorFlow implementation of paper "On Sampling Strategies for Neural Network-based Collaborative Filtering" by Chen, Ting, et al.
Learning a good representation of text is key to many recommendation applications. Examples include news recommendation where texts to be recommended are constantly published everyday. However, most existing recommendation techniques, such as matrix factorization based methods, mainly rely on interaction histories to learn representations of items. While latent factors of items can be learned effectively from user interaction data, in many cases, such data is not available, especially for newly emerged items. In this work, we aim to address the problem of personalized recommendation for completely new items with text information available. We cast the problem as a personalized text ranking problem and propose a general framework that combines text embedding with personalized recommendation. Users and textual content are embedded into latent feature space. The text embedding function can be learned end-to-end by predicting user interactions with items. To alleviate sparsity in interaction data, and leverage large amount of text data with little or no user interactions, we further propose a joint text embedding model that incorporates unsupervised text embedding with a combination module. Experimental results show that our model can significantly improve the effectiveness of recommendation systems on real-world datasets.READ FULL TEXT VIEW PDF
Matrix factorization is one of the most efficient approaches in recommen...
Microblogging platforms constitute a popular means of real-time communic...
The task of a personalization system is to recommend items or a set of i...
Latent factor models play a dominant role among recommendation technique...
Recent deep learning methods for recommendation systems are highly
The classification of television content helps users organise and naviga...
Email has remained a principal form of communication among people, both ...
TensorFlow implementation of paper "On Sampling Strategies for Neural Network-based Collaborative Filtering" by Chen, Ting, et al.
Personalized recommendation has gained a lot of attention during the past few years (Koren et al., 2009; Salakhutdinov et al., 2007; Wang and Blei, 2011). Many models and algorithms have been proposed for personalized recommendation, among which, collaborative filtering techniques such as matrix factorization (Salakhutdinov and Mnih, 2011; Koren, 2008) are shown to be most effective. For these approaches, historical behavior data is critical for learning latent factors of both users and items. However, in many scenarios, behavior data is not available or very sparse, which motivates us to incorporate content/text information for recommendation. In this work, we study the problem of content-based recommendation for completely new items/texts, where historical user behavior data is not available for new items at the time of recommendation. However, when it comes to text article recommendations, it is not straightforward to incorporate text content into existing collaborative filtering models.
In order to understand content of new items/texts for better recommendation, a good representation based on textual information is essential. This issue is challenging and has not been satisfyingly solved yet. On one hand, traditional content-based (Pazzani and Billsus, 2007)
recommendation methods are usually based on simple text processing methods such as cosine similarity or logistic regression where both text and users are represented as bag-of-words. The limitations of such representation include the inability to encode similarity between words, as well as losing word order information(Mikolov and Dean, 2013; Johnson and Zhang, 2014)
. On the other hand, for collaborative filtering methods, although some of which has been extended to incorporate auxiliary information, text feature extraction functions are usually simple, and cannot leverage recent proposed representation learning techniques for text(Singh and Gordon, 2008; Rendle, 2010; Chen et al., 2012).
We address these issues with an approach that marries text embedding to personalized recommendation. In our proposed model, users and texts are simultaneously embedded into latent space where preferences can be depicted by simple dot product. While each user is directly associated with an embedding vector, text embedding requires an embedding function that maps a text sequence into a vector. Both user embedding and text embedding function can be trained end-to-end based on user-item interactions directly. With sophisticated neural networks (e.g., Convolutional Neural Networks) as text embedding function, high-level textual features can be better captured.
While end-to-end training of the embedding function delivers focused supervision for learning the task related representations. Interaction data is usually sparse, and there are still large amount of unlabeled data/corpora. Hence, we further propose a joint text embedding model to leverage unsupervised text embeddings that are pre-trained on large-scale unlabeled copora. To effectively fuse both types of information, a novel combination module is constructed and incorporated into the unified framework. Experimental results on two real-world data sets demonstrate the effectiveness of the proposed joint text embedding framework.
We use to denote the set of texts, the -th text is represented by a sequence of words, i.e. . A matrix is used to denote the historical interactions between users and texts, where indicates interaction between a user and a text article , such as click-or-not, like-or-not. 111We consider as implicit feedback in this work, which means only positive interactions are provided, and non-interactions are treated as negative feedback implicitly.
Given text information and historical interaction data , our goal is to learn a model which can rank completely new texts for an existing user based on this user’s interests and the text content.
Existing methods of personalized recommendation algorithms can be roughly categorized into there categories: (1) collaborative filtering methods, (2) content-based methods, and (3) hybrid methods.
Matrix factorization (MF) techniques (Koren, 2008; Salakhutdinov and Mnih, 2011) is one of the most effective collaborative filtering (CF) methods. In MF, each user or item is associated with latent factor vectors or and the score between a pair of user and item is computed by their dot product, i.e. . Since each item is associated with latent factors , an new item cannot be handled properly as the training of depends on its interaction with users.
Content based methods (Pazzani and Billsus, 2007) usually build model for user and content based on term weighting schemes like TF-IDF. And cosine similarity or logistic regression can be used to match between a pair of user and item. It is difficult to work with such representations to encode similarities between words, as well as word orders.
Hybrid methods can improve so-called “cold-start” issue by incorporating side information (Rendle, 2010; Chen et al., 2012; Singh and Gordon, 2008), or item content information (Wang and Blei, 2011; Gopalan et al., 2014; Wang et al., 2015). However, most of these methods cannot deal with completely new items.
There are some work aiming at leveraging neural networks for better text recommendations, such as Collaborative Deep Learning(Wang et al., 2015), and others (Bansal et al., 2016)
. Compared to their work, 1) we treat the problem as a ranking problem instead of a rating prediction problem, thus pairwise loss functions are adopted; 2) our model provide a more general framework, enabling various text embedding functions, thus subsumes(Bansal et al., 2016) as a special case; 3) our model incorporates unsupervised text embedding from large-scale unlabeled corpora.
Recent advances in deep learning have demonstrated the importance of learning good representations for text and other types of data (Mikolov and Dean, 2013; Mikolov et al., 2013; Le and Mikolov, 2014; Kim, 2014; Chen et al., 2016; Chen and Sun, 2017). Text embedding techniques aim at mapping text into vector representation that can be utilized for future predictive tasks. Such models have been proposed for addressing text classification/categorization problem (Kim, 2014; Johnson and Zhang, 2014; Le and Mikolov, 2014)
. Our task resembles a personalized text classification/ranking problem, in the sense that we try to classify/rank an article according to its interestingness w.r.t. a given user. Also, we utilize user behavior instead of labels of text as a supervised signal.
In this section, we first introduce the supervised text embedding framework, which is trained in an end-to-end fashion for predicting user-item interactions. Then we propose a joint text embedding model by incorporating unsupervised text embedding with a combination function.
To simultaneously capture interests of users and semantics of texts, we embed both user and text into a common latent feature space, where dot product can be used to quantify their proximity.
Each user is directly associated with an embedding vector , which represents user’s interests. For a text sequence for the -th item, it is mapped into a fixed-sized vector by an embedding function . The proximity score between the user and item pair is computed by the dot product between their embeddings, as follows:
Text embedding function . In our framework, the text embedding function is very flexible. It can be specified by any
differentiable function that maps a text sequence into a fix-sized embedding vector. Many neural network structures can be applied, such as Convolutional Neural Networks, Recurrent Neural Networks, and etc. Here we introduce two such functions,MoV and CNN, while other extensions are straightforward.
Mean of Vectors (MoV). To represent a text sequence of length , we first embed each word in the text with an embedding vector (Mikolov and Dean, 2013; Mikolov et al., 2013), and then use the average of word embeddings to form the text/article embedding as follows:
To better extract non-linear interactions among words, a densely-connected layer with non-linear activation can be applied. A single layer of such transformation is given by:
Convolutional Neural Networks (CNN). Although MoW model is simple and relatively efficient, since text sequence is treated as a bag of words, orderings among words are ignored. As demonstrated in (Johnson and Zhang, 2014), ordering information of words can be helpful. To address the issue, Convolutional Neural Networks is adopted for text embedding.
In CNN, instead of averaging over all word embeddings, it maintains several filters of given size(s). Each filter will slide over the whole text sequence. Additionally, at each position, an activation is computed by a dot product between the filter and local embedding vectors. To be more specific, we use to denote the concatenation of word vectors for the text. To apply convolution on text sequence , we compute the -th entry from applying -th filter according to:
Here is the -th filter of size , and is the bias term. The output of convolution layer can be downsized by pooling operator, such as taking max over all temporal dimensions of c, so a fixed sized vector can be produced. Due to the page limit, we refer the reader to (Kim, 2014) for more clear detailed descriptions.
Objective Function and Training. To learn the user embedding and text embedding function, the output scores for each pair of user and item are used to predict their interactions. For a given user, we want to rank his/her interested articles higher than those he/she is not. So for each user , a pair of a positive item and a negative item are both sampled, and similar to (Rendle et al., 2009), the score difference between positive and negative items is maximized, leading to a pairwise ranking loss function as follows:
where is a positive item for user , and is a negative item for user . Each triplet is drawn from some predefined data distribution . Andfor training, positive interactions are first sampled, and then negative items are sampled according to some predefined distribution (e.g. item frequency).
The framework is demonstrated in Figure 1. We name the above proposed framework TER, short for Text Embedding for content-based Rcommendation.
There are two challenges faced by the supervised text embedding framework proposed above: 1) user-item interaction data may be sparse, and 2) there are many texts with little to none user interactions. These issues can lead to over-fitting. To alleviate sparsity in interaction data and leverage a large amount of text data with little to none user interactions, we propose to incorporate unsupervised text embedding with a new combination function. The overall framework with joint text embedding is summarized in Figure 1(a).
Different from the supervised model, a pre-trained text embedding module is added, so each text is first mapped into two embedding vectors: from text embedding function and from pre-trained embedding matrix . Then to generate a cohesive text embedding vector for the item, we propose a combination function to explicitly combine and . Below we introduce these two additional components in detail.
Unsupervised Text Embedding Matrix . Unlike supervised text embedding, which requires user interactions for training mapping function . The unsupervised text embedding can be pre-trained with only text articles themselves, requiring no additional labels. To leverage a large-scale text corpus, we adopt Paragraph Vector (Le and Mikolov, 2014) in our framework.
Given a set of text articles, Paragraph Vector associates each word with a word embedding vector and each document with a document embedding vector
. To learn both types of embedding vectors simultaneously, a prediction task is formed: for each word occurrence, we firstly hide the word, and then model is asked to predict the exact word given neighboring word embeddings and the document embedding. The probability of the word is given as follows:
As introduced in (Le and Mikolov, 2014), the model is trained by maximum likelihood with negative sampling.
After training Paragraph Vector on the whole corpus, which includes text articles that have no related user interaction associated. We obtain a pre-trained text embedding module with embedding matrix , where each row is an unsupervised text embedding vector for the -th text article.
Combination Function . To combine both text embedding vectors, i.e. from text embedding function , and from pre-trained embedding matrix , we introduce a combination function where is the user-defined output dimension. Since the relation between two text embedding vectors and can be complicated and non-linear, in order to combine them effectively, we specify the combination function with a small neural network: Firstly a concatenation of the two vectors are formed, i.e. , and then it is further transformed by a densely-connected layer with non-linear activations, i.e. .
Although unsupervised text embeddings can provide useful text features (Le and Mikolov, 2014), they might not be directly relevant to the task. So to control the degree of trust for unsupervised text embeddings, we introduce dropout (Srivastava et al., 2014) into unsupervised text vectors, i.e. , which randomly select entries and set them to zero. On one hand, when setting the dropout to zero, the whole embedding vector is utilized; on the other hand, when setting the dropout to one, the whole text vector is set to zero, hence it is equivalent to use none of pre-trained embeddings. When the dropout rate is between zero and one, it can be seen as a trade-off for the unsupervised module. Figure 1(b) illustrates the combination module.
Training of the Joint Model. The training procedure is separated into two stages. At the first stage, a unsupervised text embedding matrix is trained using unlabeled texts. At the second stage, similar to the supervised framework, the training objective is also pairwise ranking objective in Eq. 1. The parameters in the second stage involve both user embeddings and parameters in and . Finally we name the extended model TER+.
In this section, we present our empirical studies on two real-world text recommendation data sets.
Two real-world data sets are used. The first data set CiteULike, containing user-bookmarking-article behavior data from CiteULike.org, was provided in (Wang and Blei, 2011). It contains 5,551 users, 16,980 items, and 204,986 interactions. The second data set is Yahoo! News Feed222https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75. We randomly sampled 10,000 users (with at least 10 click behaviors) and their clicked news to form the data set, which contains 58,579 items, and 515,503 interactions. Since CiteULike and News data sets have both title and abstract/summary, for each data set, we create following two data sets: one contains only title information (i.e. short text), and the other contains both title and summary/abstract (i.e. long text). The average lengths of short text in CiteULike and News are 9 and 11 respectively, and that of long text are 194 and 89 respectively.
To ensure items at the test time are completely new, we first select a portion (20%) of items to form the pool of test items. All user interactions with those test items are held-out during training, only the remaining user-item interactions are used as training data. For unsupervised text embedding pre-training, we also include many texts that have no user interaction data. More specifically, for CiteULike data set, additional 339,150 papers from DBLP (a superset of CiteULike) are included; and for news data set, additional 3,935,228 news articles are also included.
|# of user||# of item||# of interaction|
We compare following methods in experiments:
Cosine similarity matching (Pazzani and Billsus, 2007), which is based on similarities of TF-IDFs between candidate and user’s historical items.
Regularized multi-task logistic regression (Evgeniou and Pontil, 2004), which can be seen as one-layer linear text model.
CDL (Collaborative Deep Learning) (Wang et al., 2015), which simultaneously trains auto-encoder for encoding text content, and matrix factorization for encoding user behavior.
Content Pre-trained, which first pre-trains text embeddings by Paragraph Vector, and then used as fixed item features for matrix factorization.
TER. This is our proposed supervised framework. Note that two variants of text embedding function are compared: MoV and CNN.
TER+. This is the joint text embedding framework. Both text embedding functions, MoV and CNN, are compared.
Parameter Settings: For CDL, both TER and TER+, we set the dimensions of both user embedding and final text embedding vector to 50 for fair comparisons. For CNN, we use 50 filters with filter size of 3. Regularization is added using both weight decay on user embedding and dropout on item embedding. We use Adam (Kingma and Ba, 2015) with learning rate of 0.001. For both baselines and our model, we tune the parameters with grid search.
Evaluation Metrics: We adopt MAP (Mean Average Precision) and average AUC for evaluation. First, for each interaction between a user and a test item in the test set, we sample 10 negative samples from a test item-pool to form the candidate set. Then, AP and AUC are computed based on the rankings given by the model and the final MAP and average AUC are averaged over all users.
|CiteULike (title)||CiteULike (title&abs.)||News (title)||News (title&sum.)|
|Cosine||0.5535 / 0.8194||0.7116 / 0.9162||0.3526 / 0.6950||0.4580 / 0.7721|
|Multitask||0.6129 / 0.8441||0.7355 / 0.9258||0.4051 / 0.7085||0.4560 / 0.7760|
|Content Pretrained||0.6250 / 0.8961||0.7310 / 0.9372||0.4512 / 0.8145||0.4778 / 0.8352|
|CDL||0.6182 / 0.8839||0.7484 / 0.9410||0.3549 / 0.7648||0.4477 / 0.8060|
|TER (MoV)||0.6789 / 0.9201||0.7476 / 0.9432||0.497 / 0.8294||0.5272 / 0.8515|
|TER (CNN)||0.6908 / 0.9264||0.7519 / 0.9458||0.5069 / 0.8470||0.5227 / 0.8580|
|TER+ (MoV)||0.7073 / 0.9309||0.7641 / 0.9485||0.5020 / 0.8462||0.5294 / 0.8628|
|TER+ (CNN)||0.6990 / 0.9274||0.7620 / 0.9478||0.5149 / 0.8541||0.5353 / 0.8626|
Table 3 shows MAP and AUC results of different methods on four data sets. As shown in the results, our methods (both TER and TER+) consistently beat other baselines and achieve state-of-the-art performance. Other several important observations can also be made from the results: 1) representation learning or embeddings methods (our methods, pre-trained method and CDL) can achieve better results compared to traditional TF-IDF based methods, 2) the joint supervised and unsupervised text embedding can achieve better results compared to supervised or unsupervised text embedding alone, and 3) the advantage of our model on short texts is more significant compared to longer one. We also observe that Mov outperforms CNN in some cases (e.g. in CiteULike data sets), we conjecture this is due to that words in CiteUlike may be more indicative w.r.t. user interests so simpler embedding functions can already well capture the semantics.
Figure 3 shows performances of different dropout rate for pre-trained text embedding vector in combination function . We observe that, as dropout rate increases, most of the curves go up and then go down. The peak occurs mostly around to , both the and two extreme points have worse results. This further confirms the effectiveness of incorporating unsupervised text embedding, and also show that certain level of noise injected into pre-trained text embedding can improve performance.
To further understand the proposed model, we conduct several case studies looking into the layout or nearest neighbors of words and articles in the embedding space.
To visualize the text embedding learned from different models, we firstly choose top conferences in five domains (ML, DM, CV, HCI, BIO), and then randomly select articles that are published in those conferences. We apply TSNE (Maaten and Hinton, 2008) to visualize 2d map for these articles, and color them according to their domains of publication. The results are shown in Figure 4 where we found that our combined model can best distinguish papers from different domains.
Table 4 shows similar words for given queried words, i.e. “neural” and “learning”, in CiteULike data set. From the result we clearly see the distinction between meanings of word learned from both methods. For example, the nearest word “neural” learned in unsupervised text embedding (articles with and without user like behavior) is mostly related to artificial neural networks, but in supervised text embedding, it is mostly related to neuroscience, which is more close to biology. This is because that in the CiteULike data set, there exist a lot of biologists, so the word embedding learned from supervised text embedding is likely to be dominated by the neuroscience perspective. However, by incorporating the unsupervised text embedding learned from a larger corpus, more meanings of the words can be recovered.
Table 5 shows the similar articles given a randomly selected queried article. We find that although unsupervised text embedding can provide some similar articles, the proposed framework (both TER and TER+) can better capture the similarity of articles.
Our work is related to both personalized recommendation and text embedding and understanding.
Collaborative filtering (Koren et al., 2009) has been one of the most effective methods in recommender system. Methods like matrix factorization (Koren, 2008; Salakhutdinov and Mnih, 2011) are wide adopted, and recently some methods based on neural networks are also explored (Wang et al., 2015; Sedhain et al., 2015; Zheng et al., 2016). Content based methods are proposed (Pazzani and Billsus, 2007; Chen and Sun, 2017), but has not been well developed to exploit deep semantics of content information. Hybrid methods can improve so-called “cold-start” issue by incorporating side information (Rendle, 2010; Chen et al., 2012; Singh and Gordon, 2008), or item content information (Wang and Blei, 2011; Gopalan et al., 2014; Wang et al., 2015). In our case, although we have historical data about users’ interactions with items, but at the time of recommendation we are considering the items that have never been seen before, which cannot be handle directly by most existing matrix factorization based methods. Our model is similar to CDL (Wang et al., 2015), but with following differences: (1) we treat the problem as ranking instead of rating prediction problem, (2) we provide a general framework which allows flexible choice of text embedding function , and (3) our model can explicitly incorporate unsupervised text embedding.
To understand text data, both supervised and unsupervised methods are proposed. Supervised methods are usually guided by text labels, such as sentiment labels or category labels. Different from traditional text classification, which train SVM or logistic regression classifiers based on n-gram features(Joachims, 1998; Pang et al., 2002)
, recent work take advantage of distributed representation brought by embedding methods, which include CNN(Collobert et al., 2011; Kim, 2014; Zhang et al., 2015), RNN (Tang et al., 2015) and others (Joulin et al., 2016). Those methods cannot be directly applied for recommendation as only a global classification/ranking model is provided. Also, instead of using labels as in existing supervised text embedding methods, we utilize user item interactions as supervision to learn the text embedding function. There are also unsupervised text embedding techniques (Mikolov et al., 2013; Mikolov and Dean, 2013; Le and Mikolov, 2014), which do not require labels but cannot adapt to the task of interest.
We further generalize the proposed model and develop efficient training techniques in (Chen et al., 2017).
In this work, we tackle the problem of content-based recommendation for completely new texts. An novel joint text embedding based framework is proposed, in which user embedding and text embedding function are learned end-to-end based on interactions between users and items. The text embedding function is flexible, and can be specified by deep neural networks. Both supervised and unsupervised text embeddings are fused together by an combination module as part of a unified model. Empirical evaluations based on real-world data sets demonstrate that our model can achieve state-of-the-art results for recommending new texts. As for the future work, it is interesting to explore other ways of incorporating unsupervised text embeddings.
The authors would like to thank Qian Zhao, Yue Ning, and Qingyun Wu for helpful discussions. Yizhou Sun is partially supported by NSF CAREER #1741634.
Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events. In
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI’16). Miami.
Journal of Machine Learning Research13, Dec (2012), 3619–3622.
Text categorization with support vector machines: Learning with many relevant features. InEuropean conference on machine learning. Springer, 137–142.
Autorec: Autoencoders meet collaborative filtering. InProceedings of the 24th International Conference on World Wide Web. ACM, 111–112.