Recommender systems are widely used in various domains and e-commerce platforms, such as to help consumers buy products at Amazon, watch videos on Youtube, and read articles on Google News. Collaborative filtering (CF) is among the most effective approaches based on the simple intuition that if users rated items similarly in the past then they are likely to rate items similarly in the future. Matrix factorization (MF) techniques which can learn the latent factors for users and items are its main cornerstone [35, 25]
. Recently, neural networks like multilayer perceptron (MLP) are used to learn the interaction function from data[10, 17]. MF and neural CF suffer from the data sparsity and cold-start issues.
One solution is to integrate CF with the content information, leading to hybrid methods. Items are usually associated with content information such as unstructured text, like the news articles and product reviews. These additional sources of information can alleviate the sparsity issue and are essential for recommendation beyond user-item interaction data. For application domains like recommending research papers and news articles, the unstructured text associated with the item is its text content [51, 1]. Other domains like recommending products, the unstructured text associated with the item is its user reviews which justify the rating behavior of consumers [33, 61]. Topic modelling and neural networks have been proposed to exploit the item content and lead to performance improvement. Memory networks are widely used in question answering and reading comprehension to perform reasoning . The memories can be naturally used to model additional sources like the item content , or to model a user’s neighborhood who consume the common items with this user .
Another solution is to transfer the knowledge from relevant domains and the cross-domain recommendation techniques address such problems [27, 39, 3]. In real life, a user typically participates several systems to acquire different information services. For example, a user installs applications in an app store and reads news from a website at the same time. It brings us an opportunity to improve the recommendation performance in the target service (or all services) by learning across domains. Following the above example, we can represent the app installation feedback using a binary matrix where the entries indicate whether a user has installed an app. Similarly, we use another binary matrix to indicate whether a user has read a news article. Typically these two matrices are highly sparse, and it is beneficial to learn them simultaneously. This idea is sharpened into the collective matrix factorization (CMF)  approach which jointly factorizes these two matrices by sharing the user latent factors. It combines CF on a target domain and another CF on an auxiliary domain, enabling knowledge transfer [38, 59]. In terms of neural networks, given two activation maps from two tasks, cross-stitch convolutional network (CSN)  and its sparse variant  learn linear combinations of both the input activations and feed these combinations as input to the successive layers’ filters, and hence enabling the knowledge transfer between two domains.
These two threads motivate us to exploit information from both the content and cross-domain information for RSs in this paper. To capture text content and to transfer cross-domain knowledge, we propose a novel neural model, TMH, for cross-domain recommendation with unstructured text in an end-to-end manner. TMH can attentively extract useful content via a memory network (MNet) and can selectively transfer knowledge across domains by a transfer network (TNet), a novel network. A shared layer of feature interactions is stacked on the top to couple the high-level representations learned from individual networks. On real-world datasets, TMH shows the better performance in terms of ranking metrics by comparing with various baselines. We conduct thorough analyses to understand how the content and transferred knowledge help TMH.
To the best of our knowledge, TMH is the first deep model that transfers cross-domain knowledge for recommendation with unstructured text in an end-to-end learning. Our contributions are summarized as follows:
The proposed TMH exploits the text content and transfers the source domain using an attention mechanism which is trained in an end-to-end manner. It is the first deep model that transfers cross-domain knowledge for recommendation with unstructured text using attention based neural networks.
We interpret the memory networks to attentively exploit the text content to match word semantics with user preferences. It is among a few recent works on adopting memory networks for hybrid recommendations.
The transfer component can selectively transfer source items with the guidance of target user-item interactions by the attentive weights. It is a novel transfer network for cross-domain recommendation.
The proposed model can alleviate the sparsity issue including cold-user and cold-item start, and outperforms various baselines in terms of ranking metrics on two real-world datasets.
The paper is organized as follows. We firstly introduce the problem formulation in Section 3. Then we present the memory component to exploit the text content in Section 5.1 and the transfer component to transfer cross-domain knowledge in Section 5.2 respectively. We propose the neural model TMH for cross-domain recommendation with unstructured text in Section 5, followed by its model learning (Section 5.4) and complexity analyses (Section 5.5). In Section 6, we experimentally demonstrate the superior performance of the proposed model over various baselines (Sec. 6.2). We show the benefit of both transferred knowledge and text content for the proposed model in Section 6.3. We can reduce the amount of cold users and cold items that are difficult to predict accurately (Section 6.4) and hence alleviate the cold-start issues. We review related works in Section 2 and conclude the paper in Section 7.
2. Related Works
Collaborative filtering Recommender systems aim at learning user preferences on unknown items from their past history. Content-based recommendations are based on the matching between user profiles and item descriptions. It is difficult to build the profile for each user when there is no/few content. Collaborative filtering (CF) alleviates this issue by predicting user preferences based on the user-item interaction behavior, agnostic to the content 
. Latent factor models learn feature vectors for users and items mainly based on matrix factorization (MF) which has probabilistic interpretations . Factorization machines (FM) can mimic MF . To address the data sparsity, an item-item matrix (SPPM) is constructed from the user-item interaction matrix in the CoFactor model . It then simultaneously factorizes the interaction matrix and the SPPMI matrix in a shared item latent space, enabling the usage of co-click information to regularize the learning of user-item matrix. In contrast with our method, We use independent unstructured text and source domain information to alleviate the data sparsity issue in the user-item matrix.
Neural networks are proposed to push the learning of feature vectors towards non-linear representations, including the neural network matrix factorization (NNMF) and multilayer perceptron (MLP) [10, 17]. The basic MLP architecture is extended to regularize the factors of users and items by social and geographical information . Other neural approaches learn from the explicit feedback for rating prediction task [61, 5]. We focus on learning from the implicit feedback for top-N recommendation . CF models, however, suffer from the data sparsity issue.
Hybrid filtering Items are usually associated with the content information such as unstructured text (e.g., abstracts of articles and reviews of products). CF approaches can be extended to exploit the content information [51, 52, 1] and user reviews [33, 20, 16]. Combining matrix factorization and topic modelling technique (Topic MF) is an effective way to integrate ratings with item contents [33, 29, 2]. Item reviews justify the rating behavior of a user, and item ratings are associated with their attributes hidden in reviews . Topic MF methods combine item latent factors in ratings with latent topics in reviews [33, 2]. The behavior factors and topic factors are aligned with a link function such as softmax transformation in the hidden factors and hidden topics (HFT) model  or an offset deviation in the collaborative topic regression (CTR) model 
. The CTR model assumes the item latent vector learnt from the interaction data is close to the corresponding topic proportions learnt from the text content, but allows them to be divergent from each other if necessary. Additional sources of information are integrated into CF to alleviate the data sparsity issues including knowledge graph[60, 53]. Convolutional networks (CNNs) have been used to extract the features from audio signals for music recommendation  and from image for product  and multimedia 
recommendation. Autoencoders are used to learn an intermediate representations from text[52, 58]. Recurrent networks  and convolutional networks [23, 61, 5] can exploit the word order when learning the text representations.
Memory networks can reason with an external memory . Due to the capability of naturally learning word embeddings to address the problems of word sparseness and semantic gap, a memory module can be used to model item content  or the neighborhood of users . Memory networks can learn to match word semantics with the specific user. We follow this research thread by using neural networks to attentively extract important information from the text content.
Cross-domain recommendation Cross-domain recommendation  is an effective technique to alleviate sparse issue. A class of methods are based on MF applied to each domain. Typical methods include collective matrix factorization (CMF)  approach which jointly factorizes two rating matrices by sharing the user latent factors and hence it enables knowledge transfer. CMF has its heterogeneous  variants, and codebook transfer . The coordinate system transfer can exploit heterogeneous feedbacks [40, 57]. Multiple source domains  and multi-view learning  are also proposed for integrating information from several domains. Transfer learning (TL) aims at improving the performance of the target domain by exploiting knowledge from source domains . Similar to TL, the multitask learning (MTL) is to leverage useful knowledge in multiple related tasks to help each other [4, 59]. The cross-stitch network  and its sparse variant  enable information sharing between two base networks for each domain in a deep way. Robust learning is also considered during knowledge transfer . These methods treat knowledge transfer as a global process with shared global parameters and do not match source items with the specific target item given a user. We follow this research thread by using neural networks to selectively transfer knowledge from the source items. We introduce a transfer component to exploit the source domain knowledge.
3. Problem Formulation
For collaborative filtering with implicit feedback, there is a binary matrix to describe user-item interactions where each entry is 1 (called observed entries) if user has an interaction with item and 0 (unobserved) otherwise:
Denote the set of -sized users by and items by . Usually the interaction matrix is very sparse since a user only consumed a very small subset of all items. Similarly for the task of item recommendation, each user is only interested in identifying top-K items. The items are ranked by their predicted scores:
where is the interaction function and denotes model parameters.
For MF-based CF approaches, the interaction function is fixed and computed by a dot product between the user and item vectors. For neural CF, neural networks are used to parameterize function and learn it from interaction data (see Section 4):
is concatenated from that of user and item embeddings, which are projections of their one-hot encodingsand by embedding matrices and , respectively. The output and hidden layers are computed by and in a neural network.
Similarly, for cross-domain recommendation, we have a target domain (e.g., news domain) user-item interaction matrix and a source domain (e.g., app domain) matrix where and () is the size of users and target items (source items ). Note that the users are shared and hence we can transfer knowledge across domains. We use to index users, to target items, and to source items. Let be the -sized source items that user has interacted with in the source domain. Neural CF can be extended to leverage the source domain and then the interaction function has the form of (see Section 5.2):
Usually, for hybrid filtering, the (target) domain also has the content information (e.g., product reviews). Denote by the content text corresponding to user and item . It is a sequence of words where each word comes from a vocabulary and is the length of the text document. Neural CF can be extended to leverage text and then the interaction function has the form of (see Section 5.1):
For the task of item recommendation, the goal is to generate a ranked list of items for each user based on her history records, i.e., top-N recommendations. We hope improve the recommendation performance in the target domain with the help of both the content and source domain information.
The equation sharpens the intuition behind the synthetic model, that is, the conditional probability of whether user will like the item can be determined by three factors: 1) his/her individual preferences, 2) the corresponding content text (), and 3) his/her behavior in a related source domain (). The likelihood function of the entire matrix is then defined as:
We propose a novel neural model to learn the conditional probability in an end-to-end manner (see Section 5):
where is the network function and are model parameters.
The model consists of a memory network to model unstructured text (Sec. 5.1) and a transfer network to transfer knowledge from the source domain (Sec. 5.2). A shared feature interaction layer where is the non-linear representations of the interaction, is stacked on the top of the learned high-level representations from individual networks (Sec. 5).
4. A Basic Neural CF Network
We adopt a feedforward neural network (FFNN) as the base neural CF model to parameterize the interaction function (see Eq. (2)). The basic network is similar to the Deep model in [8, 7] and the MLP model in . The base network consists of four modules with the information flow from the input to the output as follows.
Input: This module encodes user-item interaction indices. We adopt the one-hot encoding. It takes user and item , and maps them into one-hot encodings and where only the element corresponding to that index is 1 and all others are 0.
Embedding: This module firstly embeds one-hot encodings into continuous representations and by embedding matrices and respectively, and then concatenates them as to be the input of following building blocks.
Hidden layers: . This module takes the continuous representations from the embedding module and then transforms through several layers to a final latent representation . This module consists of hidden layers to learn nonlinear interaction between users and items.
Output : . This module predicts the score for the given user-item pair based on the representation
from the last layer of multi-hop module. Since we focus on one-class collaborative filtering, the output is the probability that the input pair is a positive interaction. This can be achieved by a softmax layer:where is the parameter.
|User embedding matrix|
|Target item embedding matrix|
|Source item embedding matrix|
|Internal memory matrix|
|External memory matrix|
|,||Linear mapping weight and bias|
|for the user-item interaction|
|Linear mapping for outputs|
|of individual networks|
|Weight of the shared layer|
5. The Proposed TMH Model
We describe the proposed Transfer Meet Hybrid (TMH) model in this section. TMH models user preferences in the target domain by exploiting the text content and transferring knowledge from a source/auxiliary domain. TMH learns high-level representations for unstructured text and source domain items such that the learned representations can estimate the conditional probability of that whether a user will like an item. This is done with a memory network (Sec. 5.1) and a transfer network (Sec. 5.2), coupled by the shared embeddings on the bottom and an interaction layer on the top (Sec. 5.3). The entire network can be trained efficiently to minimize a binary cross-entropy loss by back-propagation (Sec. 5.4). We begin by describing the recommendation problem and the model formulation before introducing the network architecture.
5.1. Matching Word Semantics with User Preferences
We adapt a memory network (MNet) to integrate unstructured text since it can learn to match word semantics with user preferences. Memory networks have been used in recommendation to model item content , model users’ neighborhood , and learn latent relationships . The local and centralized memories recommender (LCMR)  uses a local memory module (LMM) to exploit the text content by using MNet. We use memory networks (MNet) to attentively extract important information from the text content via the attention mechanism which can match word semantics with the specific user and determine which words are highly relevant to the user preferences.
MNet is a variant of memory augmented neural network which can learn high-level representations of unstructured text with respect to the given user-item interaction. The attention mechanism inherent in the memory component can determine which words are highly relevant to the user preferences.
The MNet consists of one internal memory matrix where is the vocabulary size (typically after processing ) and is the dimension of each memory slot, and one external memory matrix with the same dimensions as . The function of the two memory matrices works as follows.
Given a document corresponding to the interaction, we form the memory slots by mapping each word into an embedding vector with matrix , where and the length of the longest document is the memory size. We form a preference vector corresponding to the given document and the user-item interaction where each element encodes the relevance of user to these words given item as:
where we split the into the user part and the item part . The and are the user and item embeddings obtained by embedding matrices and respectively. On the right hand of the above equation, the first term captures the matching between preferences of user
and word semantics, for example, the user is a machine learning researcher and he/she may be more interested in the words such as “optimization” and “Bayesian” than those of “history” and “philosophy”. The second term computes the support of itemto the words, for example, the item is a machine learning related article and it may support more the words such as “optimization” and “Bayesian” than those of “history” and “philosophy”. Together, the content-based/associative addressing scheme can determine internal memories with highly relevance to the target user regarding the words given the specific item .
Actually we can compact the above two terms with a single vector dot product by concatenating the embeddings of the user and the item into :
The neural attention mechanism can adaptively learn the weighting function over the words to focus on a subset of them. Traditional combination of words predefines a heuristic weighting function such as average or weights with tf-idf scores. Instead, we compute the attentive weights over words for a given user-item interaction to infer the importance of each word’s unique contribution:
which produces a probability distribution over the words in. The neural attention mechanism allows the memory component to focus on specific words while to place little importance on other words which may be less relevant. The parameter is introduced to stabilize the numerical computation when the exponentials of the softmax function are very large and it also can amplify or attenuate the precision of the attention like a temperature  where a higher temperature (i.e., a smaller ) produces a softer probability distribution over words. We set by scaling along with the dimensionality .
We construct the high-level representations by interpolating the external memories with the attentive weights as the output:
where the external memory slot is another embedding vector for word by mapping it with matrix . The external memories allows the storage of long-term knowledge pertaining specifically to each word’s role in matching the user preference. In other words, the content-based addressing scheme identifies important words in a document acting as a key to retrieval the relevant values stored in the external memory matrix via the neural attention mechanism. The attention mechanism adaptively weights words according to the specific user and item. The final output represents a high-level, summarized information extracted attentively from the text content involved with relations between the user-item interaction and the corresponding words .
A more detailed discussion on Eq. (8) is in order. One alternative to form the memory slot is to use the pre-trained word embedding for word . If the dimensions of and are different, then a matrix can be used to build the connection between them as: . We, however, compute the attentions on words using the embeddings of the corresponding user and item, rather than the pre-trained word embeddings themselves. The reason is as follows. Though some sentimental words like ‘good’ and ‘bad’ are somewhat important, they also depend on who wrote the reviews . Some people are critical and hence when they give a ‘good’ word in the reivew, it means that the product is really good. While some people are very kind, they usually give ‘good’ words to all products. We address this issue by taking the user information into account when computing the attention weights, as motivated by . Moreover, the sentimental words only exist in review-related corpus, and they do not exist in non-emotional datasets like scientific articles. As a result, our way of computing attentions on words is general and applicable to many settings.
Remark I MNet attentively extracts useful content to match the word semantics with specific user where different words in the text document have different weights in the semantic factor. Memory networks are firstly proposed to address the question answering (QA) task where memories are a short story and the query is a question related to the text in which the answer can be reasoned by the network. We can think of the recommendation with text as a QA problem: the question to be answered is to ask how likely a user prefers an item. The unstructured text is analogue to the story and the query is analogue to the user-item interaction.
5.2. Selecting Source Items to Transfer
We introduce a transfer component to exploit the source domain knowledge. A user may participate several systems to acquire different information needs, for example, a user installs apps in an app store and reads news from other website. Cross-domain recommendation  is an effective technique to alleviate sparse issue where transfer learning (including multitask learning) [38, 4, 59] is a class of underlying methods. Typical methods include collective matrix factorization (CMF)  approach which jointly factorizes two rating matrices by sharing the user latent factors and hence it enables knowledge transfer. The cross-stitch network  and its sparse variant  enable information sharing between two base networks for each domain in a deep way. These methods treat knowledge transfer as a global process (shared global parameters) and do not match source items with the specific target item given a user.
We propose a novel transfer network (TNet) which can selectively transfer source knowledge for specific target item. Since the relationships between items are shown to be important in improving recommendation performance [41, 26, 37, 36] for single domain, we want to capture relationships between target item and source items of a user. The central idea is to learn adaptive weights over source items specific to the given target item during the knowledge transfer.
Given the source items with which the user has interacted in the source domain, TNet learns a transfer vector to capture the relations between the target item and source items given the user . The underlying observations can be illustrated in an example of improving the movie recommendation by transferring knowledge from the book domain. When we predict the preference of a user on the movie “The Lord of the Rings,” the importance of her read books such as “The Hobbit,” and “The Silmarillion” may be much higher than those such as “Call Me by Your Name”.
The similarities between target item and source items can be computed by their dot products:
where is the embedding for the source item by an embedding matrix . This score computes the compatibility between the target item and the source items consumed by the user. For example, the similarity of target movie “The Lord of the Rings,” with the source book “The Hobbit” may be larger than that with the source book “Call Me by Your Name” (given a user ).
We normalize similarity scores to be a probability distribution over source items:
and then the transfer vector is a weighted sum of the corresponding source item embeddings:
A more detailed discussion on Eq. (13) is in order. Eq. (13) sharpens the idea that the transfer component can selectively transfer source items with the guidance of target user-item interactions. This is achieved by attentive weights (or . When the source item is highly relevant to the target item given user , then the knowledge from the source domain is easily flowing into the target domain with a high influence weight. When the source item is irrelevant to the target item given user , then the knowledge from source domain is hard to flow into the target domain with a small effect weight. This selection is automatically determined by the transfer component, but this is not easily achieved by the existing cross-domain recommendation techniques like the multitask models such as collective matrix factorization  and collaborative cross networks  (a variant of cross-stitch networks ) which have multi-objective optimization. Besides, we implicitly use the label information from the source domain when generating the source items for a user, while CMF and CSN explicitly exploit label information by learning to predict the labels. As a result, the transfer component benefits from the source domain knowledge in two-step: selecting instances (source items) to transfer via source domain labels and re-weighting instances with attentive weights.
Remark II The computational process of MNet and TNet is similar. We firstly compute attentive weights over a collection of objects (words in MNet and items in TNet). Then we summarize the high-level representation as the output (the text representation in MNet and the transfer vector in TNet), weighted by the attentive probabilities which are computed by a content-based addressing scheme.
The architecture for the proposed TMH model is illustrated in Figure 1 as a feedforward neural network (FFNN). The input layer specifies embeddings of a user , a target item , and the corresponding source items . The content text is modelled by the memories in the MNet to produce a high-level representation . The source items are transferred into the transfer vector with the guidance of in the TNet. These computational pathes are introduced in the above Sec. 5.1 and Sec. 5.2 respectively.
Firstly, we use a simple neural CF model (CFNet) which has one hidden layer to learn a nonlinear representation for the user-item interaction:
where and are the weight and bias parameters in the hidden layer. Usually the dimension of is half of that in a typical tower-pattern architecture.
The outputs from the three individual networks can be viewed high-level features of the content text, source domain knowledge, and the user-item interaction. They come from different feature space learned by different networks. Thus, we use a shared layer on the top of the all features:
where is the parameter. And the joint representation:
is concatenated from the linear mapped outputs of individual networks where matrices are the corresponding linear mapping transformations..
Due to the nature of the implicit feedback and the task of item recommendation, the squared loss may be not suitable since it is usually for rating prediction. Instead, we adopt the binary cross-entropy loss: where the training samples are the union of observed target interaction matrix and randomly sampled negative pairs. Usually,
and we do not perform a predefined negative sampling in advance since this can only generate a fixed training set of negative samples. Instead, we generate negative samples during each epoch, enabling diverse and augmented training sets of negative examples to be used.
This objective function has a probabilistic interpretation and is the negative logarithm likelihood of the following likelihood function: where the model parameters are: Comparing with Eq. (6), instead of modeling all zero entries (i.e., the whole target matrix
), we learn from only a small subset of such unobserved entries and treat them as negative samples by picking them randomly during each optimization iteration (i.e., the negative sampling technique). The objective function can be optimized by stochastic gradient descent (SGD) and its variants like adaptive moment method (Adam). The update equations are: where
is the learning rate. Typical deep learning library like TensorFlow (https://www.tensorflow.org) provides automatic differentiation and hence we omit the gradient equations
which can be computed by chain rule in back-propagation (BP).
5.5. Complexity Analysis
In the model parameters , the embedding matrices , and contain a large number of parameters since they depend on the input size of users and (target and source) items, and their scale is hundreds of thousands. Typically, the number of words, i.e., the vocabulary size is . The dimension of embeddings is typically . Since the architecture follows a tower pattern, the dimension of the outputs of the three individual networks is also limited within hundreds. In total, the size of model parameters is linear with the input size and is close to the size of typical latent factors models  and one hidden layer neural CF approaches .
During training, we compute the outputs of the three individual networks in parallel using mini-batch stochastic optimization which can be trained efficiently by back-propagation. TMH is scalable to the number of the training data. It can easily update when new data examples come, just feeding them into the training mini-batch. Thus, TMH can handle the scalability and dynamics of items and users like in an online fashion. In contrast, the topic modeling related techniques have difficulty in benefitting from these advantages to this extent.
In this section, we conduct empirical study to answer the following questions: 1) how does the proposed TMH model perform compared with state-of-the-art recommender systems; and 2) how do the text content and the source domain information contribute each to the proposed framework. We firstly introduce the evaluation protocols and experimental settings, and then we compare the performance of different recommender systems. We further analyze the TMH model to understand the impact of the memory and transfer component. We also investigate that the improved performance comes from the cold-users and cold-items to some extent.
6.1. Experimental Settings
|Avg. Words Per News||7.2|
|Avg. Words Per Review||32.9|
Dataset We evaluate on two real-world cross-domain datasets. The first dataset, Mobile111An anonymous version can be released later., is provided by a large internet company, i.e., Cheetah Mobile (http://www.cmcm.com/en-us/) . The information contains logs of user reading news, the history of app installation, and some metadata such as news publisher and user gender collected in one month in the US. We removed users with fewer than 10 feedbacks. For each item, we use the news title as its text content. Following the work , we filter stop words and use tf-idf to choose the top 8,000 distinct words as the vocabulary. This yields a corpus of 612K words. The average number of words per news is less than 10. The dataset we used contains 477K user-news reading records and 817K user-app installations. There are 15.8K shared users which enable the knowledge transfer between the two domains. We aim to improve the news recommendation by transferring knowledge from app domain. The data sparsity is over 99.6%.
The second dataset is a public Amazon dataset (http://snap.stanford.edu/data/web-Amazon.html), which has been widely used to evaluate the performance of collaborative filtering approaches . We use the two categories of Amazon Men and Amazon Sports as the cross-domain [16, 20]. The original ratings are from 1 to 5 where five stars indicate that the user shows a positive preference on the item while the one stars are not. We convert the ratings of 4-5 as positive samples. The dataset we used contains 56K positive ratings on Amazon Men and 81K positive ratings on Amazon Sports. There are 8.5K shared users, 28K Men products, and 41K Sports goods. We aim to improve the recommendation on the Men domain by transferring knowledge from relevant Sports domain. The data sparsity is over 99.7%. We filter stop words and use tf-idf to choose the top 8,000 distinct words as the vocabulary . The average number of words per review is 32.9.
The statistics of the two datasets are summarized in Table 2. As we can see, both datasets are very sparse and hence we hope improve performance by transferring knowledge from the auxiliary domain and exploiting the text content as well. Note that Amazon dataset are long text of product reviews (the number of average words per item is 32), while Cheetah Mobile is short text of news titles (the number of average words per item is 7).
Evaluation Protocol For item recommendation task, the leave-one-out (LOO) evaluation is widely used and we follow the protocol in 
. That is, we reserve one interaction as the test item for each user. We determine hyper-parameters by randomly sampling another interaction per user as the validation/development set. We follow the common strategy which randomly samples 99 (negative) items that are not interacted by the user and then evaluate how well the recommender can rank the test item against these negative ones. Since we aim at top-K item recommendation, the typical evaluation metrics are hit ratio (HR), normalized discounted cumulative gain (NDCG), and mean reciprocal rank (MRR), where the ranked list is cut off at. HR intuitively measures whether the reserved test item is present on the top-K list, defined as: where is the hit position for the test item of user , and is the indicator function. NDCG and MRR also account for the rank of the hit position, respectively defined as: A higher value with lower cutoff indicates better performance.
|Improvement of TMH||10.04%||6.90%||6.01%||15.63%||9.43%||7.34%||12.65%||10.52%||7.60%|
Baselines We compare with various baselines, categorized as single/cross domain, shallow/deep, and hybrid methods.
BPRMF, Bayesian personalized ranking , is a latent factor model based on matrix factorization and pair-wise loss. It learns on the target domain only.
HFT, Hidden Factors and hidden Topics , adopts topic distributions to learn latent factors from text reviews. It is a hybrid method.
CDCF, Cross-domain CF with factorization machines (FM) , is a cross-domain recommender which extends FM . It is a context-aware approach which applies factorization on the merged domains (aligned by the shared users). That is, the auxiliary domain is used as context. On the Mobile dataset, the context for a user in the target news domain is his/her history of app installations in the source app domain. The feature vector for the input is a sparse vector where the non-zero entries are as follows: 1) the index for user id, 2) the index for target news id (target domain), and all indices for his/her installed apps (source domain).
CDCF++: We extend the above CDCF model to exploit the text content. The feature vector for the input is a sparse vector where the non-zero entries are augmented by the word features corresponding to the given user-item interaction. In this way, CDCF++ can learn from both the source domain and unstructured text information.
CMF, Collective matrix factorization , is a multi-relation learning approach which jointly factorizes matrices of individual domains. Here, the relation is the user-item interaction. On Mobile, the two matrices are “user by news” and “user by app” respectively. The shared user factors enable knowledge transfer between two domains. Then CMF factorizes matrices and simultaneously by sharing the user latent factors: and . It is a shallow model and jointly learns on two domains. CMF is a multi-objective shallow model for cross-domain recommendation. This can be thought of a non-deep transfer/multitask learning approach for cross-domain recommendation.
TextBPR extends the basic BPRMF model by integrating text content. It computes the prediction scores by two parts: one is the standard latent factors, same with the BPRMF; and the other is the text factors learned from the text content. It has two implementations, the VBPR model  and the TBPR model  which are the same in essence.
MLP, multilayer perceptron , is a neural CF approach which learns the nonlinear interaction function using neural networks. It is a deep model learning on the target domain only.
MLP++: We combine two MLPs by sharing the user embedding matrix, enabling the knowledge transfer between two domains through the shared users. It is a naive knowledge transfer approach applied for cross-domain recommendation.
CSN, Cross-stitch network , is a deep multitask learning model originally proposed for visual recognition tasks. We use the cross-stitch units to stitch two MLP networks. It learns a linear combination of activation maps from two networks and hence benefits from each other. Comparing with MLP++, CSN enables knowledge transfer also in the hidden layers besides the lower embedding matrices. CSN optimizes a multi-objective problem for cross-domain recommendation. This is a deep transfer learning approach for cross-domain recommendation.
LCMR, Local and Centralized Memory Recommender , is a deep model for collaborative filtering with unstructured Text. The local memory module is similar to our MNet except that we only have one layer. LCMR is corresponding to the MNet component of our model. This is a deep hybrid method.
Implementation For BPRMF, we use LightFM’s implementation222https://github.com/lyst/lightfm which is a popular CF library. For CDCF and CDCF++, we adapt the official libFM implementation333http://www.libfm.org. For CMF, we use a Python version reference to the original Matlab code444http://www.cs.cmu.edu/~ajit/cmf/. For HFT and TextBPR, we use the code released by their authors555http://cseweb.ucsd.edu/~jmcauley/. The word embeddings used in the TextBPR are pre-trained by GloVe 666https://nlp.stanford.edu/projects/glove/. For latent factor models, we vary the number of factors from 10 to 100 with step size 10. For MLP, we use the code released by its authors777https://github.com/hexiangnan/neural_collaborative_filtering. The MLP++ and CSN are implemented based on MLP. The LCMR model is similar to our MNet model and thus implemented in company. Our methods are implemented using TensorFlow. Parameters are randomly initialized from Gaussian . The optimizer is Adam with initial learning rate 0.001. The size of mini batch is 128. The ratio of negative sampling is 1. The MLP and MLP++ follows a tower pattern, halving the layer size for each successive higher layer. Specifically, the configuration of hidden layers in the base MLP network is as reference in the original paper 
. For CSN, it requires that the number of neurons in each hidden layer is the same and the configuration is(equals ). We investigate several typical configurations . The dimension of embeddings is .
|Improvement of TMH||2.04%||2.42%||2.51%||1.75%||2.32%||2.47%||0.81%||1.86%||2.34%|
6.2. Comparison Results
In this section, we report the recommendation performance of different methods and discuss the findings. The comparison results are shown in Table 4 and Table 3 respectively on the Mobile and Amazon datasets where the last row is the relative improvement of ours vs the best baseline. We have the following observations. Firstly, we can see that our proposed neural models are better than all baselines on the two datasets at each setting, including the base MLP network, shallow cross-domain models (CMF and CDCF), deep cross-domain models (MLP++ and CSN), and hybrid methods (HFT and TextBPR, LCMR). These results demonstrate the effectiveness of the proposed neural model.
On the Mobile dataset, the differences between TMH and other methods are more pronounced for small numbers of recommended items including top-5 or top-10 where we achieve average 2.25% relative improvements over the best baseline. This is a desirable feature since we often recommend only a small number of top ranked items to consumers to alleviate the information overload issue.
Note that the relative improvement of the proposed model vs. the best baseline is more significant on the Amazon dataset than that on the Mobile dataset, obtaining average 9.56% relative improvements over the best CSN baseline, though the Amazon is sparser than the Mobile (see Table 2). We show the benefit of combining text content by comparing with CSN. One explanation is that the relatedness of the Men and Sports domains is closer than that between the news and app domains. This will benefit all cross-domain methods including CMF, CDCF, MLP++, and CSN, since they exploit information from both two domains. Another explanation is that the text content contains richer information on the Amazon dataset. As it is shown in Table 2, the average words in the product reviews are longer that in the news titles. This will benefit all hybrid methods including HFT, TextBPR, and LCMR. We show the benefit of transferring source items by comparing with LCMR.
The hybrid TextBPR model composes a document representation by averaging the words’s embeddings. This can not distinguish the important words to match the user preferences. This may explain that it has difficulty in improving the recommendation performance when integrating text content. For example, it can not consistently outperform the pure CF method, MLP. The cross-domain CSN model transfers every representations from the source network with the same coefficient. This may have a risk in transferring the noise and harm the performance, as pointed out in its sparse variant . On the Amazon dataset, it loses to the proposed model by a large margin (though TMH leverages content information). In contrast, the memory and transfer components are both selective to extract useful information based on the attention mechanism. This may explain that our model is consistently the best at all settings.
There is a possibility that the noise from auxiliary domain and some irrelevance information contained in the unstructured text propose a challenge for exploiting them. This shows that the proposed model is more effective since it can select useful representations from the source network and attentively focus on the important words to match preferences of users.
In summary, the empirical comparison results demonstrate the superiority of the proposed neural model to exploit the text content and source domain knowledge for recommendation.
6.3. Impact of Unstructured Text and Auxiliary Domain
We have shown the effectiveness of the two memory and transfer components together in the proposed framework. We now investigate the contribution of each network to the TMH by eliminating the impact of text content and source domain from it in turn:
TMHMT: Eliminating the impact of both content and source information from TMH. This is a collaborative filtering recommender. Actually, it is equivalent to a single hidden layer MLP model.
TMHM: Eliminating the impact of content information (MNet) from TMH. This is a novel cross-domain recommender which can adaptively select source items to transfer via the attentive weights.
TMHT: Eliminating the impact of source information (TNet) from TMH. This is a novel hybrid filtering recommender which can attentively match word semantics with user preferences.
The ablation analyses of TMH and its components are shown in Figure 2. The performance degrades when either memory or transfer modules are eliminated. This is understandable since we lose some information. In other words, the two components can extract useful knowledge to improve the recommendation performance. For example, TMHT and TMHM respectively reduce 1.1% and 4.3% relative NDCG@10 performance by comparing with TMH on the Mobile dataset (they are 8.5% and 16.1% on Amazon), suggesting that both memory and transfer networks learn essential knowledge for recommendation. On the evaluated two datasets, removing the memory component degrades performance worse than that of removing the transfer component. This may be due to that the text content contains richer information or the source domain contains much more noise or both.
6.4. Improvement on Cold Users and Items
The cold-user and cold-item problems are common issues in recommender systems. When new users enter into a system, they have no history that can be exploited by the recommender system to learn their preferences, leading to the cold-user start problem. Similarly, when latest news are released on the Google News, there are no reading records that can be exploited by the recommender system to learn users’ preferences on them, leading to the cold-item start problem. In general, it is very hard to train a reliable recommender system and make predictions for users and items that have few interactions. Intuitively, the proposed model can alleviate both the cold-user and cold-item start issues. TMH alleviates the cold-user start issue in the target domain by transferring his/her history from the related source domain. TMH alleviates the cold-item start issue by exploiting the associated text content to reveal its properties, semantics, and topics. We now investigate that TMH indeed improves the performance over the cold users and items by comparing with the pure neural collaborative filtering method, MLP.
We analyse the distribution of missed hit users (MHUs) of TMH and MLP (at cutoff 10). We expect that the cold users in MHUs of MLP can be reduced by using the TMH model. The more amount we can reduce, the more effective that TMH can alleviate the cold-user start issues. The results are shown in Figure 3 where the number of training examples can measure the “coldness” of a user. Naturally, the MHUs are most of the cold users who have few training examples. As we can see, the number of cold users in MHUs of MLP is higher than that of TMH. If the cold users are defined as those with less than seven training examples, then TMH reduces the number of cold users from 4,218 to 3,746 on the Amazon dataset, achieving relative 12.1% reduction. On the Mobile dataset, if the cold users are those with less than ten training examples (Mobile is denser than Amazon), then TMH reduces the number of cold users from 1,385 to 1,145 on the Mobile dataset, achieving relative 20.9% reduction. These results show that the proposed model is effective in alleviating the cold-user start issue. The results on cold items are similar and we omit them due to the page limit.
It is shown that the text content and the source domain knowledge can help improve recommendation performance and can be effectively integrated under a neural architecture. The sparse target user-item interaction matrix can be reconstructed with the knowledge guidance from both of the two kinds of information, alleviating the data sparse issue. We proposed a novel deep neural model, TMH, for cross-domain recommendation with unstructured text. TMH smoothly enables transfer meeting hybrid. TMH consists of a memory component which can attentively focus important words to match user preferences and a transfer component which can selectively transfer useful source items to benefit the target domain. These are achieved by the attentive weights learned automatically. TMH shows better performance than various baselines on two real-world datasets under different settings. The results demonstrate that our combine model outperforms the baseline that relies only on memory networks (LCMR ) and outperforms the baseline that relies only on the transfer networks (CSN ). Additionally, we conducted ablation analyses to understand contributions from the two memory and transfer components, showing the necessity to combine transfer and hybrid. We quantify the amount of missed hit cold users (and items) that TMH can reduce by comparing with the pure CF method, showing that TMH is able to alleviate the cold-start issue.
In real world services, data sources may belong to different providers (e.g. product reviews provided by Amazon while social relations provided by Facebook). The data privacy is a big issue when we combine the multiple data sources. In future work, it is worth developing new learning techniques to learn a combined model while protecting user privacy.
-  T. Bansal, D. Belanger, and A. McCallum. Ask the gru: Multi-task learning for deep text recommendations. In ACM RecSys, 2016.
-  Yang Bao, Hui Fang, and Jie Zhang. Topicmf: Simultaneously exploiting ratings and reviews for recommendation. In AAAI, volume 14, pages 2–8, 2014.
-  I. Cantador, I. Fernández-Tobías, S. Berkovsky, and P. Cremonesi. Cross-domain recommender systems. In Recommender Systems Handbook. 2015.
-  R. Caruana. Multitask learning. Machine Learning, 1997.
-  R. Catherine and W. Cohen. Transnets: Learning to transform for recommendation. In ACM RecSys, 2017.
-  Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 335–344. ACM, 2017.
-  H.-T. Cheng, L. Koc, J. Harmsen, et al. Wide & deep learning for recommender systems. In Workshop on Deep Learning for Recommender Systems, 2016.
-  Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys, 2016.
-  M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems, 2004.
-  G. Dziugaite and D. Roy. Neural network matrix factorization. arXiv:1511.06443, 2015.
-  T. Ebesu, B. Shen, and Y. Fang. Collaborative memory network for recommendation systems. In ACM SIGIR, 2018.
-  Travis Ebesu, Bin Shen, and Yi Fang. Collaborative memory network for recommendation systems. SIGIR, 2018.
-  A. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In WWW, 2015.
-  Gayatree Ganu, Noemie Elhadad, and Amélie Marian. Beyond the stars: improving rating predictions using review text content. In WebDB, volume 9, pages 1–6, 2009.
-  Ming He, Jiuling Zhang, Peng Yang, and Kaisheng Yao. Robust transfer learning for cross-domain collaborative filtering using multiple rating patterns approximation. In WSDM ’18, pages 225–233. ACM, 2018.
-  R. He and J. McAuley. Vbpr: visual bayesian personalized ranking from implicit feedback. In AAAI, 2016.
-  X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative filtering. In WWW, 2017.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  G. Hu, Y. Zhang, and Q. Yang. Lcmr: Local and centralized memories for collaborative filtering with unstructured text. arXiv preprint arXiv:1804.06201, 2018.
-  Guang-Neng Hu and Xin-Yu. Dai. Integrating reviews into personalized ranking for cold start recommendation. In Pacific-Asia Knowledge Discovery and Data Mining, 2017.
-  Guangneng Hu, Yu Zhang, and Qiang Yang. Conet: Collaborative cross networks for cross-domain recommendation. CIKM, 2018.
-  Haoran Huang, Qi Zhang, Xuanjing Huang, et al. Mention recommendation for twitter with end-to-end memory network. In Proc. IJCAI, volume 17, pages 1872–1878, 2017.
-  Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. Convolutional matrix factorization for document context-aware recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 233–240. ACM, 2016.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. 2015.
-  Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 2009.
-  Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434. ACM, 2008.
-  B. Li, Q. Yang, and X. Xue. Can movies and books collaborate?: cross-domain collaborative filtering for sparsity reduction. In IJCAI, 2009.
-  Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M Blei. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. In Proceedings of the 10th ACM conference on recommender systems, pages 59–66. ACM, 2016.
-  Guang Ling, Michael R Lyu, and Irwin King. Ratings meet reviews, a combined approach to recommend. In Proceedings of the 8th ACM Conference on Recommender systems, pages 105–112. ACM, 2014.
-  Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. Transferable contextual bandit for cross-domain recommendation. 2018.
-  B. Loni, Y. Shi, M. Larson, and A. Hanjalic. Cross-domain collaborative filtering with factorization machines. In European conference on information retrieval, 2014.
-  Z. Lu, E. Zhong, L. Zhao, E. Xiang, W. Pan, and Q. Yang. Selective transfer learning for cross domain recommendation. In SIAM International Conference on Data Mining, 2013.
-  J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In ACM RecSys, 2013.
-  I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In IEEE CVPR, 2016.
-  A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In NIPS, 2008.
-  ThaiBinh Nguyen and Atsuhiro Takasu. Npe: Neural personalized embedding for collaborative filtering. In IJCAI, 2018.
-  Xia Ning and George Karypis. Slim: Sparse linear methods for top-n recommender systems. In 2011 11th IEEE International Conference on Data Mining, pages 497–506. IEEE, 2011.
-  S. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 2010.
-  W. Pan, N. Liu, E. Xiang, and Q. Yang. Transfer learning to predict missing ratings via heterogeneous user feedbacks. In IJCAI, 2011.
Weike Pan, Evan W Xiang, Nathan N Liu, and Qiang Yang.
Transfer learning in collaborative filtering for sparsity reduction.
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 230–235. AAAI Press, 2010.
Improving regularized singular value decomposition for collaborative filtering.In Proceedings of KDD cup and workshop, volume 2007, pages 5–8, 2007.
-  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
-  S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 2012.
-  S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In UAI, 2009.
-  A. Singh and G. Gordon. Relational learning via collective matrix factorization. In ACM SIGKDD, 2008.
-  Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
-  Duyu Tang, Bing Qin, Ting Liu, and Yuekui Yang. User modeling with neural network for review rating prediction. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 1340–1346. AAAI Press, 2015.
-  Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. Latent relational metric learning via memory-based attention for collaborative ranking. In WWW, 2018.
-  Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In NIPS, 2013.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
-  C. Wang and D. Blei. Collaborative topic modeling for recommending scientific articles. In ACM SIGKDD, 2011.
-  Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1235–1244. ACM, 2015.
-  Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. Dkn: Deep knowledge-aware network for news recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1835–1844. International World Wide Web Conferences Steering Committee, 2018.
-  Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. ICLR, 2015.
-  Y. Wu, C. DuBois, A. Zheng, and M. Ester. Collaborative denoising auto-encoders for top-n recommender systems. In ACM WSDM, 2016.
C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han.
Bridging collaborative filtering and semi-supervised learning: A neural approach for poi recommendation.In ACM SIGKDD, 2017.
-  D. Yang, J. He, H. Qin, Y. Xiao, and W. Wang. A graph-based recommendation across heterogeneous domains. In ACM CIKM, 2015.
-  Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 353–362. ACM, 2016.
-  Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv:1707.08114, 2017.
-  Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 635–644. ACM, 2017.
-  L. Zheng, V. Noroozi, and P. Yu. Joint deep modeling of users and items using reviews for recommendation. In ACM WSDM, 2017.